| Author |
Topic |
|
Beach_m2000
Starting Member
9 Posts |
Posted - 2011-01-19 : 02:45:04
|
| Hello,We are building large database of feature vectors. Each vector has 7 elements (dimensions), where each element is a number between 0 and 255 (8 bits).I have a VERY simple query to return the K nearest vectors to a test vector T (T1..T7), but it does not scale very well. The database may have many million vector entries.Here is the query:SELECT TOP (K) * FROM myTableORDER BY SQRT(POWER(V1 – T1,2) + POWER(V2 – T2,2) + POWER(V3 – T3,2) + POWER(V4 – T4,2) + POWER(V5 – T5,2) + POWER(V6 – T6,2) + POWER(V7 – T7,2));MyTable contains a unique ID and the vector elements V1..V7.We have not decided what DB engine to use, but most likely it will be MS SQL or MySQL.We can trade off accuracy for performance, as long as the accuracy is known.Has anyone worked on SQL implementations of a approximate kNN algorithm that can point me in the right direction?Regards,Stein Erik |
|
|
Lumbago
Norsk Yak Master
3271 Posts |
Posted - 2011-01-19 : 04:04:17
|
| Another fellow norwegian?? Soo cool :)Unfortunately I have no clue whatsoever what this is all about but do you have a small working sample of how this works? I'm almost positive that I'm not even remotely close to being able to answer your question, but it would be interesting to see how this works. Can you provide the table definition and some working inserts in the following format -> http://weblogs.sqlteam.com/brettk/archive/2005/05/25/5276.aspx- LumbagoMy blog (yes, I have a blog now! just not that much content yet) -> www.thefirstsql.com |
 |
|
|
SwePeso
Patron Saint of Lost Yaks
30421 Posts |
Posted - 2011-01-19 : 04:21:38
|
Remove the SQRT part. It's only needed to get the actual distance.If you only is interested in getting the nearest, the actual distance is of no importance.What I mean is that is doen't matter if the distance is 7 or 49, right? As long as all distances are without the SQRT part. N 56°04'39.26"E 12°55'05.63" |
 |
|
|
Beach_m2000
Starting Member
9 Posts |
Posted - 2011-01-19 : 05:01:59
|
quote: Originally posted by Peso Remove the SQRT part. It's only needed to get the actual distance.If you only is interested in getting the nearest, the actual distance is of no importance.What I mean is that is doen't matter if the distance is 7 or 49, right? As long as all distances are without the SQRT part.
Thanks! That helps the calculation speed per comparison, but not the number of compares ;-)! |
 |
|
|
Lumbago
Norsk Yak Master
3271 Posts |
Posted - 2011-01-19 : 05:25:28
|
I think I might have a solution for you actually but I'm not 100% sure and it does involve some precalculations to be done and some extra storage for each vector. The idea is basically to precalculate a "checksum" for the vector as a single numeric column and then index this column. I got the idea from a project I did where I needed to sort ip addresses and I *think* it might work. From what I understand your vectors and your test vectors look like this:CREATE TABLE vectors ( ID int IDENTITY(1, 1) PRIMARY KEY, V1 smallint, V2 smallint, V3 smallint, V4 smallint, V5 smallint, V6 smallint, V7 smallint)INSERT INTO vectors (V1, V2, V3, V4, V5, V6, V7)VALUES (0, 20, 40, 60, 80, 100, 120), (5, 25, 45, 65, 85, 105, 125), (10, 30, 50, 70, 90, 110, 130), (15, 35, 55, 75, 95, 115, 135)--> Test vectorsDECLARE @T1 smallint = 3, @T2 smallint = 23, @T3 smallint = 43, @T4 smallint = 63, @T5 smallint = 83, @T6 smallint = 103, @T7 smallint = 123SELECT *FROM vectorsORDER BY POWER(V1 - @T1,2) + POWER(V2 - @T2,2) + POWER(V3 - @T3,2) + POWER(V4 - @T4,2) + POWER(V5 - @T5,2) + POWER(V6 - @T6,2) + POWER(V7 - @T7,2) Now, by creating a function like this:CREATE FUNCTION dbo.KNNChecksum ( @V1 smallint, @V2 smallint, @V3 smallint, @V4 smallint, @V5 smallint, @V6 smallint, @V7 smallint)RETURNS DECIMAL(28, 9)ASBEGIN RETURN CONVERT(decimal(28, 0), @V1) + CONVERT(decimal(28, 0), @V2 * 256) + CONVERT(decimal(28, 0), @V3 * POWER(CONVERT(decimal(28, 0), 256), 2)) + CONVERT(decimal(28, 0), @V4 * POWER(CONVERT(decimal(28, 0), 256), 3)) + CONVERT(decimal(28, 0), @V5 * POWER(CONVERT(decimal(28, 0), 256), 4)) + CONVERT(decimal(28, 0), @V6 * POWER(CONVERT(decimal(28, 0), 256), 5)) + CONVERT(decimal(28, 0), @V7 * POWER(CONVERT(decimal(28, 0), 256), 6))END Can you then do the following and get the correct result? ->ALTER TABLE vectors ADD VectorChecksum decimal(28, 0)UPDATE vectors SET VectorChecksum = dbo.KNNChecksum(V1, V2, V3, V4, V5, V6, V7)DECLARE @TestChecksum decimal(28, 0) = dbo.KNNChecksum(@T1, @T2, @T3, @T4, @T5, @T6, @T7)SELECT TOP 1 *FROM vectorsORDER BY ABS(VectorChecksum - @TestChecksum) - LumbagoMy blog (yes, I have a blog now! just not that much content yet) -> www.thefirstsql.com |
 |
|
|
SwePeso
Patron Saint of Lost Yaks
30421 Posts |
Posted - 2011-01-19 : 05:40:09
|
Here is an alternative approach. Try it. It creates a "bounding box" to compare with.CREATE TABLE #Result ( v1 TINYINT NOT NULL, v2 TINYINT NOT NULL, v3 TINYINT NOT NULL, v4 TINYINT NOT NULL, v5 TINYINT NOT NULL, v6 TINYINT NOT NULL, v7 TINYINT NOT NULL, Dist INT )CREATE UNIQUE CLUSTERED INDEX UCX_Peso ON #Result (v1, v2, v3, v4, v5, v6, v7) WITH (IGNORE_DUP_KEY = ON, FILLFACTOR = 90)DECLARE @t1 TINYINT = 2, @t2 TINYINT = 7, @t3 TINYINT = 24, @t4 TINYINT = 12, @t5 TINYINT = 2, @t6 TINYINT = 16, @t7 TINYINT = 17, @WantedRecords INT = 22, @Loop SMALLINT = 0WHILE (SELECT COUNT(*) FROM #Result < @WantedRecords) OR @Loop >= 255 BEGIN INSERT #Result ( v1, v2, v3, v4, v5, v6, v7 ) SELECT v1, v2, v3, v4, v5, v6, v7 FROM dboo.SourceTable WHERE v1 BETWEEN @t1 - @Loop AND @t1 + @Loop AND v2 BETWEEN @t2 - @Loop AND @t2 + @Loop AND v3 BETWEEN @t3 - @Loop AND @t3 + @Loop AND v4 BETWEEN @t4 - @Loop AND @t4 + @Loop AND v5 BETWEEN @t5 - @Loop AND @t5 + @Loop AND v6 BETWEEN @t6 - @Loop AND @t6 + @Loop AND v7 BETWEEN @t7 - @Loop AND @t7 + @Loop SET @Loop += 1 ENDUPDATE #ResultSET Dist = SQRT(POWER(v1 - @t1, 2) + POWER(v2 - @t2, 2) + POWER(v3 - @t3, 2) + POWER(v4 - @t4, 2) + POWER(v5 - @t5, 2) + POWER(v6 - @t6, 2) + POWER(v7 - @t7, 2))SELECT TOP(@WantedRecords) v1, v2, v3, v4, v5, v6, v7FROM #ResultORDER BY Dist N 56°04'39.26"E 12°55'05.63" |
 |
|
|
Lumbago
Norsk Yak Master
3271 Posts |
Posted - 2011-01-19 : 06:08:17
|
I screwed up the function a little bit...I reversed the order of the "powers". Here is the right way to do it:CREATE FUNCTION dbo.KNNChecksum ( @V1 smallint, @V2 smallint, @V3 smallint, @V4 smallint, @V5 smallint, @V6 smallint, @V7 smallint)RETURNS DECIMAL(28, 0)ASBEGIN RETURN CONVERT(decimal(28, 0), @V1 * POWER(CONVERT(decimal(28, 0), 256), 6)) + CONVERT(decimal(28, 0), @V2 * POWER(CONVERT(decimal(28, 0), 256), 5)) + CONVERT(decimal(28, 0), @V3 * POWER(CONVERT(decimal(28, 0), 256), 4)) + CONVERT(decimal(28, 0), @V4 * POWER(CONVERT(decimal(28, 0), 256), 3)) + CONVERT(decimal(28, 0), @V5 * POWER(CONVERT(decimal(28, 0), 256), 2)) + CONVERT(decimal(28, 0), @V6 * 256) + CONVERT(decimal(28, 0), @V7)END And then I think if you index this checksum-column then this query will be more efficient than the last one:SELECT TOP 1 * FROM ( SELECT TOP 1 * FROM vectors WHERE VectorChecksum <= @TestChecksum UNION ALL SELECT TOP 1 * FROM vectors WHERE VectorChecksum >= @TestChecksum ) AS aORDER BY ABS(VectorChecksum - @TestChecksum) - LumbagoMy blog (yes, I have a blog now! just not that much content yet) -> www.thefirstsql.com |
 |
|
|
Beach_m2000
Starting Member
9 Posts |
Posted - 2011-01-19 : 06:25:13
|
quote: Originally posted by Lumbago I think I might have a solution for you actually but I'm not 100% sure and it does involve some precalculations to be done and some extra storage for each vector. The idea is basically to precalculate a "checksum" for the vector as a single numeric column and then index this column. I got the idea from a project I did where I needed to sort ip addresses and I *think* it might work. From what I understand your vectors and your test vectors look like this:CREATE TABLE vectors ( ID int IDENTITY(1, 1) PRIMARY KEY, V1 smallint, V2 smallint, V3 smallint, V4 smallint, V5 smallint, V6 smallint, V7 smallint)INSERT INTO vectors (V1, V2, V3, V4, V5, V6, V7)VALUES (0, 20, 40, 60, 80, 100, 120), (5, 25, 45, 65, 85, 105, 125), (10, 30, 50, 70, 90, 110, 130), (15, 35, 55, 75, 95, 115, 135)--> Test vectorsDECLARE @T1 smallint = 3, @T2 smallint = 23, @T3 smallint = 43, @T4 smallint = 63, @T5 smallint = 83, @T6 smallint = 103, @T7 smallint = 123SELECT *FROM vectorsORDER BY POWER(V1 - @T1,2) + POWER(V2 - @T2,2) + POWER(V3 - @T3,2) + POWER(V4 - @T4,2) + POWER(V5 - @T5,2) + POWER(V6 - @T6,2) + POWER(V7 - @T7,2) Now, by creating a function like this:CREATE FUNCTION dbo.KNNChecksum ( @V1 smallint, @V2 smallint, @V3 smallint, @V4 smallint, @V5 smallint, @V6 smallint, @V7 smallint)RETURNS DECIMAL(28, 9)ASBEGIN RETURN CONVERT(decimal(28, 0), @V1) + CONVERT(decimal(28, 0), @V2 * 256) + CONVERT(decimal(28, 0), @V3 * POWER(CONVERT(decimal(28, 0), 256), 2)) + CONVERT(decimal(28, 0), @V4 * POWER(CONVERT(decimal(28, 0), 256), 3)) + CONVERT(decimal(28, 0), @V5 * POWER(CONVERT(decimal(28, 0), 256), 4)) + CONVERT(decimal(28, 0), @V6 * POWER(CONVERT(decimal(28, 0), 256), 5)) + CONVERT(decimal(28, 0), @V7 * POWER(CONVERT(decimal(28, 0), 256), 6))END Can you then do the following and get the correct result? ->ALTER TABLE vectors ADD VectorChecksum decimal(28, 0)UPDATE vectors SET VectorChecksum = dbo.KNNChecksum(V1, V2, V3, V4, V5, V6, V7)DECLARE @TestChecksum decimal(28, 0) = dbo.KNNChecksum(@T1, @T2, @T3, @T4, @T5, @T6, @T7)SELECT TOP 1 *FROM vectorsORDER BY ABS(VectorChecksum - @TestChecksum) - LumbagoMy blog (yes, I have a blog now! just not that much content yet) -> www.thefirstsql.com
Lumbago,Thanks for your help.If I understand your code correctly, you create one large number to compare instead of seven, and you pre-calculated the vectors stored in the database. I see that this will save some computations, but this will have one very unwanted feature as far as I can tell. The point is that when you subtract both "checksums", a difference of 1 in V1 will yield a small difference, while a difference in V7 will yield a much larger difference. This is not what we want.Also, this code does not solve the main problem, namely that you will have to compare all vectors in the database with the test vector.Please let me know if I have misunderstood you in any way, and thanks for helping out.Stein Erik |
 |
|
|
Beach_m2000
Starting Member
9 Posts |
Posted - 2011-01-19 : 06:27:35
|
quote: Originally posted by Peso Here is an alternative approach. Try it. It creates a "bounding box" to compare with.CREATE TABLE #Result ( v1 TINYINT NOT NULL, v2 TINYINT NOT NULL, v3 TINYINT NOT NULL, v4 TINYINT NOT NULL, v5 TINYINT NOT NULL, v6 TINYINT NOT NULL, v7 TINYINT NOT NULL, Dist INT )CREATE UNIQUE CLUSTERED INDEX UCX_Peso ON #Result (v1, v2, v3, v4, v5, v6, v7) WITH (IGNORE_DUP_KEY = ON, FILLFACTOR = 90)DECLARE @t1 TINYINT = 2, @t2 TINYINT = 7, @t3 TINYINT = 24, @t4 TINYINT = 12, @t5 TINYINT = 2, @t6 TINYINT = 16, @t7 TINYINT = 17, @WantedRecords INT = 22, @Loop SMALLINT = 0WHILE (SELECT COUNT(*) FROM #Result < @WantedRecords) OR @Loop >= 255 BEGIN INSERT #Result ( v1, v2, v3, v4, v5, v6, v7 ) SELECT v1, v2, v3, v4, v5, v6, v7 FROM dboo.SourceTable WHERE v1 BETWEEN @t1 - @Loop AND @t1 + @Loop AND v2 BETWEEN @t2 - @Loop AND @t2 + @Loop AND v3 BETWEEN @t3 - @Loop AND @t3 + @Loop AND v4 BETWEEN @t4 - @Loop AND @t4 + @Loop AND v5 BETWEEN @t5 - @Loop AND @t5 + @Loop AND v6 BETWEEN @t6 - @Loop AND @t6 + @Loop AND v7 BETWEEN @t7 - @Loop AND @t7 + @Loop SET @Loop += 1 ENDUPDATE #ResultSET Dist = SQRT(POWER(v1 - @t1, 2) + POWER(v2 - @t2, 2) + POWER(v3 - @t3, 2) + POWER(v4 - @t4, 2) + POWER(v5 - @t5, 2) + POWER(v6 - @t6, 2) + POWER(v7 - @t7, 2))SELECT TOP(@WantedRecords) v1, v2, v3, v4, v5, v6, v7FROM #ResultORDER BY Dist N 56°04'39.26"E 12°55'05.63"
Peso,Thank you. Looks promising.Could you explain the following statement for me:CREATE UNIQUE CLUSTERED INDEX UCX_Peso ON #Result (v1, v2, v3, v4, v5, v6, v7) WITH (IGNORE_DUP_KEY = ON, FILLFACTOR = 90)Regards,Stein Erik |
 |
|
|
Lumbago
Norsk Yak Master
3271 Posts |
Posted - 2011-01-19 : 06:29:32
|
| Please look at the updated function in my previous post along with my updated query. I messed up the order of the powers so now a change in V1 will result in a large change in the checksum and a change in V7 will only mean a small change in the checksum. I also think the updated query will be very efficient given that you have an index on the checksum column.- LumbagoMy blog (yes, I have a blog now! just not that much content yet) -> www.thefirstsql.com |
 |
|
|
Lumbago
Norsk Yak Master
3271 Posts |
Posted - 2011-01-19 : 06:32:23
|
| Another thing: are the vectors unique? In that case you can make set the index to unique which will prevent a full index scan, making the index *very* efficient.- LumbagoMy blog (yes, I have a blog now! just not that much content yet) -> www.thefirstsql.com |
 |
|
|
SwePeso
Patron Saint of Lost Yaks
30421 Posts |
Posted - 2011-01-19 : 06:38:13
|
quote: Originally posted by Beach_m2000 Thank you. Looks promising.Could you explain the following statement for me:CREATE UNIQUE CLUSTERED INDEX UCX_Peso ON #Result (v1, v2, v3, v4, v5, v6, v7) WITH (IGNORE_DUP_KEY = ON, FILLFACTOR = 90)Regards,Stein Erik
It creates a unique index on the #Results table.When identical recods are inserted in next iteration, they are discared.The loop starts with an exact match for all points in vector.Every loop increments the "fuzzy distance" with 1 point up and down.Now we search all records again within this "fuzzy box". If there were matches for the previous loop, we don't want them inserted again. Does this make sense? N 56°04'39.26"E 12°55'05.63" |
 |
|
|
Beach_m2000
Starting Member
9 Posts |
Posted - 2011-01-19 : 07:12:39
|
quote: Originally posted by Peso
quote: Originally posted by Beach_m2000 Thank you. Looks promising.Could you explain the following statement for me:CREATE UNIQUE CLUSTERED INDEX UCX_Peso ON #Result (v1, v2, v3, v4, v5, v6, v7) WITH (IGNORE_DUP_KEY = ON, FILLFACTOR = 90)Regards,Stein Erik
It creates a unique index on the #Results table.When identical recods are inserted in next iteration, they are discared.The loop starts with an exact match for all points in vector.Every loop increments the "fuzzy distance" with 1 point up and down.Now we search all records again within this "fuzzy box". If there were matches for the previous loop, we don't want them inserted again. Does this make sense? N 56°04'39.26"E 12°55'05.63"
It may work, but it does not really give the "distance".To illustrate in 2D:Let's say we have to points <6,6> and <7,1>. The distance from the origin is 8.49 and 7.07 respectively. However, since your bounding box starts at 0 and increases equally along each direction, you may end up in a situation where a point will be excluded even if it is closer.In 7D this becomes even more problematic, since you may exclude points that are identical in 6 of the 7 dimensions.Would it be possibly to relax/reduce the bounding box requirements as we go through the search?SE |
 |
|
|
Lumbago
Norsk Yak Master
3271 Posts |
Posted - 2011-01-19 : 07:21:09
|
| Have you considered my approach again with the updated function...?- LumbagoMy blog (yes, I have a blog now! just not that much content yet) -> www.thefirstsql.com |
 |
|
|
SwePeso
Patron Saint of Lost Yaks
30421 Posts |
Posted - 2011-01-19 : 07:30:21
|
quote: Originally posted by Beach_m2000 Would it be possibly to relax/reduce the bounding box requirements as we go through the search?
Sure. My suggestion is just a stub.Feel free to change as you like. N 56°04'39.26"E 12°55'05.63" |
 |
|
|
SwePeso
Patron Saint of Lost Yaks
30421 Posts |
Posted - 2011-01-19 : 07:33:46
|
Have you considered using a SQLCLR?It will be REALLY fast. N 56°04'39.26"E 12°55'05.63" |
 |
|
|
Beach_m2000
Starting Member
9 Posts |
Posted - 2011-01-19 : 08:01:35
|
quote: Originally posted by Lumbago Have you considered my approach again with the updated function...?- LumbagoMy blog (yes, I have a blog now! just not that much content yet) -> www.thefirstsql.com
Yes, but I cannot see how it will work.If we convert it to 2D for simplicity, here is an example.<1,7> -> checksum = 1 * 1 + 256 * 7 = 1793<7,1> -> checksum = 1 * 7 + 256 * 1 = 263If we then calculate the "distance" from <0,0> we get two vastly different numbers for two points that are the same distance.In fact, using this implementation, the point <255,0> is closer to <0,0> than <0,1>, which I am sure you understand cannot be true ;-)!SE |
 |
|
|
Lumbago
Norsk Yak Master
3271 Posts |
Posted - 2011-01-19 : 08:23:53
|
| You understanding of my checksum is wrong...see this example:SELECT dbo.KNNChecksum(255, 0, 0, 0, 0, 0, 0) => 71776119061217280 which is the equivalent of 255 * (256^6)SELECT dbo.KNNChecksum(0, 255, 0, 0, 0, 0, 0) => 280375465082880 equivalent of 255 * (256^5)SELECT dbo.KNNChecksum(0, 0, 0, 0, 0, 1, 10) => 266 equivalent of 1 * (256^1) + 10You don't use 256 * 6, you use 256 ^ 6 which changes the whole perspective.- LumbagoMy blog-> www.thefirstsql.com |
 |
|
|
Beach_m2000
Starting Member
9 Posts |
Posted - 2011-01-19 : 08:34:11
|
quote: Originally posted by Lumbago You understanding of my checksum is wrong...see this example:SELECT dbo.KNNChecksum(255, 0, 0, 0, 0, 0, 0) => 71776119061217280 which is the equivalent of 255 * (256^6)SELECT dbo.KNNChecksum(0, 255, 0, 0, 0, 0, 0) => 280375465082880 equivalent of 255 * (256^5)SELECT dbo.KNNChecksum(0, 0, 0, 0, 0, 1, 10) => 266 equivalent of 1 * (256^1) + 10You don't use 256 * 6, you use 256 ^ 6 which changes the whole perspective.- LumbagoMy blog-> www.thefirstsql.com
OK, so how will this implementation value the distance between V_1 and T versus V_2 and T?Given:V_1 (V1 = 0, V2 = 0, V3 = 0, V4 = 0, V5 = 0, V6 = 0, V7 = 100)andV_2 V_1 (V1 = 0, V2 = 0, V3 = 0, V4 = 0, V5 = 0, V6 = 100, V7 = 0)andT is (T1 = 0, T2 = 0, T3 = 0, T4 = 0, T5 = 0, T6 = 0, T7 = 0)They should be identical.SE |
 |
|
|
Lumbago
Norsk Yak Master
3271 Posts |
Posted - 2011-01-19 : 09:08:37
|
Crap...I feel stupid now and I finally get why it doesn't work. Drawing up a simple x and y axis on my notebook made it all the more clear. Will have to rethink my strategy about this problem And btw: feel free to elaborate on what this is all being used for! I have a hard time imagining what a 7 dimension model looks like :)- LumbagoMy blog-> www.thefirstsql.com |
 |
|
|
SwePeso
Patron Saint of Lost Yaks
30421 Posts |
Posted - 2011-01-19 : 11:06:26
|
A SQLCLR would have to do one pass only over the dataset and for a few million records... 10-15 seconds maybe?The SQLCLR could keep the K records you want in memory and do an internal merge sort for speed. N 56°04'39.26"E 12°55'05.63" |
 |
|
|
Next Page
|
|
|