Nearest neighbor (approximate) algorithm help

Please start any new threads on our new site at https://forums.sqlteam.com. We've got lots of great SQL Server experts to answer whatever question you can come up with.

All Forums

General SQL Server Forums

New to SQL Server Programming

Nearest neighbor (approximate) algorithm help

Author

Topic

Beach_m2000
Starting Member

9 Posts

Posted - 2011-01-19 : 02:45:04

Hello,

We are building large database of feature vectors. Each vector has 7 elements (dimensions), where each element is a number between 0 and 255 (8 bits).

I have a VERY simple query to return the K nearest vectors to a test vector T (T1..T7), but it does not scale very well. The database may have many million vector entries.

Here is the query:

SELECT TOP (K) * FROM myTable
ORDER BY SQRT(POWER(V1 – T1,2) + POWER(V2 – T2,2) + POWER(V3 – T3,2) + POWER(V4 – T4,2) + POWER(V5 – T5,2) + POWER(V6 – T6,2) + POWER(V7 – T7,2));

MyTable contains a unique ID and the vector elements V1..V7.

We have not decided what DB engine to use, but most likely it will be MS SQL or MySQL.

We can trade off accuracy for performance, as long as the accuracy is known.

Has anyone worked on SQL implementations of a approximate kNN algorithm that can point me in the right direction?

Regards,

Stein Erik

Lumbago
Norsk Yak Master

3271 Posts

Posted - 2011-01-19 : 04:04:17

Another fellow norwegian?? Soo cool :)

Unfortunately I have no clue whatsoever what this is all about but do you have a small working sample of how this works? I'm almost positive that I'm not even remotely close to being able to answer your question, but it would be interesting to see how this works. Can you provide the table definition and some working inserts in the following format -> http://weblogs.sqlteam.com/brettk/archive/2005/05/25/5276.aspx

- Lumbago

My blog (yes, I have a blog now! just not that much content yet)
-> www.thefirstsql.com

SwePeso
Patron Saint of Lost Yaks

30421 Posts

Posted - 2011-01-19 : 04:21:38

Remove the SQRT part. It's only needed to get the actual distance.
If you only is interested in getting the nearest, the actual distance is of no importance.
What I mean is that is doen't matter if the distance is 7 or 49, right? As long as all distances are without the SQRT part.

N 56°04'39.26"
E 12°55'05.63"

Beach_m2000
Starting Member

9 Posts

Posted - 2011-01-19 : 05:01:59

quote:
Originally posted by Peso

Remove the SQRT part. It's only needed to get the actual distance.
If you only is interested in getting the nearest, the actual distance is of no importance.
What I mean is that is doen't matter if the distance is 7 or 49, right? As long as all distances are without the SQRT part.

Thanks! That helps the calculation speed per comparison, but not the number of compares ;-)!

Lumbago
Norsk Yak Master

3271 Posts

Posted - 2011-01-19 : 05:25:28

I think I might have a solution for you actually but I'm not 100% sure and it does involve some precalculations to be done and some extra storage for each vector. The idea is basically to precalculate a "checksum" for the vector as a single numeric column and then index this column. I got the idea from a project I did where I needed to sort ip addresses and I *think* it might work. From what I understand your vectors and your test vectors look like this:

CREATE TABLE vectors (
	ID int IDENTITY(1, 1) PRIMARY KEY, V1 smallint, V2 smallint, 
	V3 smallint, V4 smallint, V5 smallint, V6 smallint, V7 smallint)

INSERT INTO vectors (V1, V2, V3, V4, V5, V6, V7)
VALUES
	(0, 20, 40, 60, 80, 100, 120),
	(5, 25, 45, 65, 85, 105, 125),
	(10, 30, 50, 70, 90, 110, 130),
	(15, 35, 55, 75, 95, 115, 135)

--> Test vectors
DECLARE 
	@T1 smallint = 3,
	@T2 smallint = 23,
	@T3 smallint = 43,
	@T4 smallint = 63,
	@T5 smallint = 83,
	@T6 smallint = 103,
	@T7 smallint = 123

SELECT *
FROM vectors
ORDER BY POWER(V1 - @T1,2) + POWER(V2 - @T2,2) + POWER(V3 - @T3,2) + POWER(V4 - @T4,2) + POWER(V5 - @T5,2) + POWER(V6 - @T6,2) + POWER(V7 - @T7,2)

Now, by creating a function like this:

CREATE FUNCTION dbo.KNNChecksum (
	@V1 smallint, @V2 smallint, @V3 smallint, @V4 smallint,
	@V5 smallint, @V6 smallint, @V7 smallint)
RETURNS DECIMAL(28, 9)
AS

BEGIN 	
	RETURN CONVERT(decimal(28, 0), @V1) + 
		CONVERT(decimal(28, 0), @V2 * 256) + 
		CONVERT(decimal(28, 0), @V3 * POWER(CONVERT(decimal(28, 0), 256), 2)) +
		CONVERT(decimal(28, 0), @V4 * POWER(CONVERT(decimal(28, 0), 256), 3)) + 
		CONVERT(decimal(28, 0), @V5 * POWER(CONVERT(decimal(28, 0), 256), 4)) + 
		CONVERT(decimal(28, 0), @V6 * POWER(CONVERT(decimal(28, 0), 256), 5)) + 
		CONVERT(decimal(28, 0), @V7 * POWER(CONVERT(decimal(28, 0), 256), 6))
END

Can you then do the following and get the correct result? ->

ALTER TABLE vectors ADD VectorChecksum decimal(28, 0)

UPDATE vectors SET VectorChecksum = dbo.KNNChecksum(V1, V2, V3, V4, V5, V6, V7)
DECLARE @TestChecksum decimal(28, 0) = dbo.KNNChecksum(@T1, @T2, @T3, @T4, @T5, @T6, @T7)

SELECT TOP 1 *
FROM vectors
ORDER BY ABS(VectorChecksum - @TestChecksum)

- Lumbago

My blog (yes, I have a blog now! just not that much content yet)
-> www.thefirstsql.com

SwePeso
Patron Saint of Lost Yaks

30421 Posts

Posted - 2011-01-19 : 05:40:09

Here is an alternative approach. Try it. It creates a "bounding box" to compare with.

CREATE TABLE	#Result
		(
			v1 TINYINT NOT NULL,
			v2 TINYINT NOT NULL,
			v3 TINYINT NOT NULL,
			v4 TINYINT NOT NULL,
			v5 TINYINT NOT NULL,
			v6 TINYINT NOT NULL,
			v7 TINYINT NOT NULL,
			Dist INT
		)

CREATE UNIQUE CLUSTERED INDEX UCX_Peso ON #Result (v1, v2, v3, v4, v5, v6, v7) WITH (IGNORE_DUP_KEY = ON, FILLFACTOR = 90)

DECLARE	@t1 TINYINT = 2,
	@t2 TINYINT = 7,
	@t3 TINYINT = 24,
	@t4 TINYINT = 12,
	@t5 TINYINT = 2,
	@t6 TINYINT = 16,
	@t7 TINYINT = 17,
	@WantedRecords INT = 22,
	@Loop SMALLINT = 0

WHILE (SELECT COUNT(*) FROM #Result < @WantedRecords) OR @Loop >= 255
	BEGIN
		INSERT	#Result
			(
				v1,
				v2,
				v3,
				v4,
				v5,
				v6,
				v7
			)
		SELECT	v1,
			v2,
			v3,
			v4,
			v5,
			v6,
			v7
		FROM	dboo.SourceTable
		WHERE	v1 BETWEEN @t1 - @Loop AND @t1 + @Loop
			AND v2 BETWEEN @t2 - @Loop AND @t2 + @Loop
			AND v3 BETWEEN @t3 - @Loop AND @t3 + @Loop
			AND v4 BETWEEN @t4 - @Loop AND @t4 + @Loop
			AND v5 BETWEEN @t5 - @Loop AND @t5 + @Loop
			AND v6 BETWEEN @t6 - @Loop AND @t6 + @Loop
			AND v7 BETWEEN @t7 - @Loop AND @t7 + @Loop

		SET	@Loop += 1
	END

UPDATE	#Result
SET	Dist = SQRT(POWER(v1 - @t1, 2) + POWER(v2 - @t2, 2) + POWER(v3 - @t3, 2) + POWER(v4 - @t4, 2) + POWER(v5 - @t5, 2) + POWER(v6 - @t6, 2) + POWER(v7 - @t7, 2))

SELECT		TOP(@WantedRecords)
		v1,
		v2,
		v3,
		v4,
		v5,
		v6,
		v7
FROM		#Result
ORDER BY	Dist

N 56°04'39.26"
E 12°55'05.63"

Lumbago
Norsk Yak Master

3271 Posts

Posted - 2011-01-19 : 06:08:17

I screwed up the function a little bit...I reversed the order of the "powers". Here is the right way to do it:

CREATE FUNCTION dbo.KNNChecksum (
	@V1 smallint, @V2 smallint, @V3 smallint, @V4 smallint,
	@V5 smallint, @V6 smallint, @V7 smallint)
RETURNS DECIMAL(28, 0)
AS

BEGIN 	
	RETURN 	CONVERT(decimal(28, 0), @V1 * POWER(CONVERT(decimal(28, 0), 256), 6)) + 
		CONVERT(decimal(28, 0), @V2 * POWER(CONVERT(decimal(28, 0), 256), 5)) + 
		CONVERT(decimal(28, 0), @V3 * POWER(CONVERT(decimal(28, 0), 256), 4)) + 
		CONVERT(decimal(28, 0), @V4 * POWER(CONVERT(decimal(28, 0), 256), 3)) + 
		CONVERT(decimal(28, 0), @V5 * POWER(CONVERT(decimal(28, 0), 256), 2)) + 
		CONVERT(decimal(28, 0), @V6 * 256) + 
		CONVERT(decimal(28, 0), @V7)
END

And then I think if you index this checksum-column then this query will be more efficient than the last one:

SELECT TOP 1 * 
FROM (
	SELECT TOP 1 * FROM vectors WHERE VectorChecksum <= @TestChecksum
	UNION ALL
	SELECT TOP 1 * FROM vectors WHERE VectorChecksum >= @TestChecksum
	) AS a
ORDER BY ABS(VectorChecksum - @TestChecksum)

- Lumbago

My blog (yes, I have a blog now! just not that much content yet)
-> www.thefirstsql.com

Beach_m2000
Starting Member

9 Posts

Posted - 2011-01-19 : 06:25:13

quote:

Originally posted by Lumbago

I think I might have a solution for you actually but I'm not 100% sure and it does involve some precalculations to be done and some extra storage for each vector. The idea is basically to precalculate a "checksum" for the vector as a single numeric column and then index this column. I got the idea from a project I did where I needed to sort ip addresses and I *think* it might work. From what I understand your vectors and your test vectors look like this:

CREATE TABLE vectors (
	ID int IDENTITY(1, 1) PRIMARY KEY, V1 smallint, V2 smallint, 
	V3 smallint, V4 smallint, V5 smallint, V6 smallint, V7 smallint)

INSERT INTO vectors (V1, V2, V3, V4, V5, V6, V7)
VALUES
	(0, 20, 40, 60, 80, 100, 120),
	(5, 25, 45, 65, 85, 105, 125),
	(10, 30, 50, 70, 90, 110, 130),
	(15, 35, 55, 75, 95, 115, 135)

--> Test vectors
DECLARE 
	@T1 smallint = 3,
	@T2 smallint = 23,
	@T3 smallint = 43,
	@T4 smallint = 63,
	@T5 smallint = 83,
	@T6 smallint = 103,
	@T7 smallint = 123

SELECT *
FROM vectors
ORDER BY POWER(V1 - @T1,2) + POWER(V2 - @T2,2) + POWER(V3 - @T3,2) + POWER(V4 - @T4,2) + POWER(V5 - @T5,2) + POWER(V6 - @T6,2) + POWER(V7 - @T7,2)

Now, by creating a function like this:

CREATE FUNCTION dbo.KNNChecksum (
	@V1 smallint, @V2 smallint, @V3 smallint, @V4 smallint,
	@V5 smallint, @V6 smallint, @V7 smallint)
RETURNS DECIMAL(28, 9)
AS

BEGIN 	
	RETURN CONVERT(decimal(28, 0), @V1) + 
		CONVERT(decimal(28, 0), @V2 * 256) + 
		CONVERT(decimal(28, 0), @V3 * POWER(CONVERT(decimal(28, 0), 256), 2)) +
		CONVERT(decimal(28, 0), @V4 * POWER(CONVERT(decimal(28, 0), 256), 3)) + 
		CONVERT(decimal(28, 0), @V5 * POWER(CONVERT(decimal(28, 0), 256), 4)) + 
		CONVERT(decimal(28, 0), @V6 * POWER(CONVERT(decimal(28, 0), 256), 5)) + 
		CONVERT(decimal(28, 0), @V7 * POWER(CONVERT(decimal(28, 0), 256), 6))
END

Can you then do the following and get the correct result? ->

ALTER TABLE vectors ADD VectorChecksum decimal(28, 0)

UPDATE vectors SET VectorChecksum = dbo.KNNChecksum(V1, V2, V3, V4, V5, V6, V7)
DECLARE @TestChecksum decimal(28, 0) = dbo.KNNChecksum(@T1, @T2, @T3, @T4, @T5, @T6, @T7)

SELECT TOP 1 *
FROM vectors
ORDER BY ABS(VectorChecksum - @TestChecksum)

- Lumbago

My blog (yes, I have a blog now! just not that much content yet)
-> www.thefirstsql.com

Lumbago,

Thanks for your help.

If I understand your code correctly, you create one large number to compare instead of seven, and you pre-calculated the vectors stored in the database. I see that this will save some computations, but this will have one very unwanted feature as far as I can tell. The point is that when you subtract both "checksums", a difference of 1 in V1 will yield a small difference, while a difference in V7 will yield a much larger difference. This is not what we want.

Also, this code does not solve the main problem, namely that you will have to compare all vectors in the database with the test vector.

Please let me know if I have misunderstood you in any way, and thanks for helping out.

Stein Erik

Beach_m2000
Starting Member

9 Posts

Posted - 2011-01-19 : 06:27:35

quote:

Originally posted by Peso

Here is an alternative approach. Try it. It creates a "bounding box" to compare with.

CREATE TABLE	#Result
		(
			v1 TINYINT NOT NULL,
			v2 TINYINT NOT NULL,
			v3 TINYINT NOT NULL,
			v4 TINYINT NOT NULL,
			v5 TINYINT NOT NULL,
			v6 TINYINT NOT NULL,
			v7 TINYINT NOT NULL,
			Dist INT
		)

CREATE UNIQUE CLUSTERED INDEX UCX_Peso ON #Result (v1, v2, v3, v4, v5, v6, v7) WITH (IGNORE_DUP_KEY = ON, FILLFACTOR = 90)

DECLARE	@t1 TINYINT = 2,
	@t2 TINYINT = 7,
	@t3 TINYINT = 24,
	@t4 TINYINT = 12,
	@t5 TINYINT = 2,
	@t6 TINYINT = 16,
	@t7 TINYINT = 17,
	@WantedRecords INT = 22,
	@Loop SMALLINT = 0

WHILE (SELECT COUNT(*) FROM #Result < @WantedRecords) OR @Loop >= 255
	BEGIN
		INSERT	#Result
			(
				v1,
				v2,
				v3,
				v4,
				v5,
				v6,
				v7
			)
		SELECT	v1,
			v2,
			v3,
			v4,
			v5,
			v6,
			v7
		FROM	dboo.SourceTable
		WHERE	v1 BETWEEN @t1 - @Loop AND @t1 + @Loop
			AND v2 BETWEEN @t2 - @Loop AND @t2 + @Loop
			AND v3 BETWEEN @t3 - @Loop AND @t3 + @Loop
			AND v4 BETWEEN @t4 - @Loop AND @t4 + @Loop
			AND v5 BETWEEN @t5 - @Loop AND @t5 + @Loop
			AND v6 BETWEEN @t6 - @Loop AND @t6 + @Loop
			AND v7 BETWEEN @t7 - @Loop AND @t7 + @Loop

		SET	@Loop += 1
	END

UPDATE	#Result
SET	Dist = SQRT(POWER(v1 - @t1, 2) + POWER(v2 - @t2, 2) + POWER(v3 - @t3, 2) + POWER(v4 - @t4, 2) + POWER(v5 - @t5, 2) + POWER(v6 - @t6, 2) + POWER(v7 - @t7, 2))

SELECT		TOP(@WantedRecords)
		v1,
		v2,
		v3,
		v4,
		v5,
		v6,
		v7
FROM		#Result
ORDER BY	Dist

N 56°04'39.26"
E 12°55'05.63"

Peso,

Thank you. Looks promising.

Could you explain the following statement for me:

CREATE UNIQUE CLUSTERED INDEX UCX_Peso ON #Result (v1, v2, v3, v4, v5, v6, v7) WITH (IGNORE_DUP_KEY = ON, FILLFACTOR = 90)

Regards,

Stein Erik

Lumbago
Norsk Yak Master

3271 Posts

Posted - 2011-01-19 : 06:29:32

Please look at the updated function in my previous post along with my updated query. I messed up the order of the powers so now a change in V1 will result in a large change in the checksum and a change in V7 will only mean a small change in the checksum. I also think the updated query will be very efficient given that you have an index on the checksum column.

- Lumbago

My blog (yes, I have a blog now! just not that much content yet)
-> www.thefirstsql.com

Lumbago
Norsk Yak Master

3271 Posts

Posted - 2011-01-19 : 06:32:23

Another thing: are the vectors unique? In that case you can make set the index to unique which will prevent a full index scan, making the index *very* efficient.

- Lumbago

My blog (yes, I have a blog now! just not that much content yet)
-> www.thefirstsql.com

SwePeso
Patron Saint of Lost Yaks

30421 Posts

Posted - 2011-01-19 : 06:38:13

quote:
Originally posted by Beach_m2000

Thank you. Looks promising.

Could you explain the following statement for me:

CREATE UNIQUE CLUSTERED INDEX UCX_Peso ON #Result (v1, v2, v3, v4, v5, v6, v7) WITH (IGNORE_DUP_KEY = ON, FILLFACTOR = 90)

Regards,

Stein Erik

It creates a unique index on the #Results table.
When identical recods are inserted in next iteration, they are discared.

The loop starts with an exact match for all points in vector.
Every loop increments the "fuzzy distance" with 1 point up and down.
Now we search all records again within this "fuzzy box". If there were matches for the previous loop, we don't want them inserted again.
Does this make sense?

N 56°04'39.26"
E 12°55'05.63"

Beach_m2000
Starting Member

9 Posts

Posted - 2011-01-19 : 07:12:39

quote:
Originally posted by Peso

quote:
Originally posted by Beach_m2000

Thank you. Looks promising.

Could you explain the following statement for me:

CREATE UNIQUE CLUSTERED INDEX UCX_Peso ON #Result (v1, v2, v3, v4, v5, v6, v7) WITH (IGNORE_DUP_KEY = ON, FILLFACTOR = 90)

Regards,

Stein Erik
It creates a unique index on the #Results table.
When identical recods are inserted in next iteration, they are discared.

The loop starts with an exact match for all points in vector.
Every loop increments the "fuzzy distance" with 1 point up and down.
Now we search all records again within this "fuzzy box". If there were matches for the previous loop, we don't want them inserted again.
Does this make sense?

N 56°04'39.26"
E 12°55'05.63"

It may work, but it does not really give the "distance".

To illustrate in 2D:

Let's say we have to points <6,6> and <7,1>. The distance from the origin is 8.49 and 7.07 respectively. However, since your bounding box starts at 0 and increases equally along each direction, you may end up in a situation where a point will be excluded even if it is closer.

In 7D this becomes even more problematic, since you may exclude points that are identical in 6 of the 7 dimensions.

Would it be possibly to relax/reduce the bounding box requirements as we go through the search?

SE

Lumbago
Norsk Yak Master

3271 Posts

Posted - 2011-01-19 : 07:21:09

Have you considered my approach again with the updated function...?

- Lumbago

My blog (yes, I have a blog now! just not that much content yet)
-> www.thefirstsql.com

SwePeso
Patron Saint of Lost Yaks

30421 Posts

Posted - 2011-01-19 : 07:30:21

quote:
Originally posted by Beach_m2000

Would it be possibly to relax/reduce the bounding box requirements as we go through the search?

Sure. My suggestion is just a stub.
Feel free to change as you like.

N 56°04'39.26"
E 12°55'05.63"

SwePeso
Patron Saint of Lost Yaks

30421 Posts

Posted - 2011-01-19 : 07:33:46

Have you considered using a SQLCLR?
It will be REALLY fast.

N 56°04'39.26"
E 12°55'05.63"

Beach_m2000
Starting Member

9 Posts

Posted - 2011-01-19 : 08:01:35

quote:
Originally posted by Lumbago

Have you considered my approach again with the updated function...?

- Lumbago

My blog (yes, I have a blog now! just not that much content yet)
-> www.thefirstsql.com

Yes, but I cannot see how it will work.

If we convert it to 2D for simplicity, here is an example.

<1,7> -> checksum = 1 * 1 + 256 * 7 = 1793
<7,1> -> checksum = 1 * 7 + 256 * 1 = 263

If we then calculate the "distance" from <0,0> we get two vastly different numbers for two points that are the same distance.

In fact, using this implementation, the point <255,0> is closer to <0,0> than <0,1>, which I am sure you understand cannot be true ;-)!

SE

Lumbago
Norsk Yak Master

3271 Posts

Posted - 2011-01-19 : 08:23:53

You understanding of my checksum is wrong...see this example:

SELECT dbo.KNNChecksum(255, 0, 0, 0, 0, 0, 0) => 71776119061217280 which is the equivalent of 255 * (256^6)
SELECT dbo.KNNChecksum(0, 255, 0, 0, 0, 0, 0) => 280375465082880 equivalent of 255 * (256^5)
SELECT dbo.KNNChecksum(0, 0, 0, 0, 0, 1, 10) => 266 equivalent of 1 * (256^1) + 10

You don't use 256 * 6, you use 256 ^ 6 which changes the whole perspective.

- Lumbago

My blog-> www.thefirstsql.com

Beach_m2000
Starting Member

9 Posts

Posted - 2011-01-19 : 08:34:11

quote:
Originally posted by Lumbago

You understanding of my checksum is wrong...see this example:

SELECT dbo.KNNChecksum(255, 0, 0, 0, 0, 0, 0) => 71776119061217280 which is the equivalent of 255 * (256^6)
SELECT dbo.KNNChecksum(0, 255, 0, 0, 0, 0, 0) => 280375465082880 equivalent of 255 * (256^5)
SELECT dbo.KNNChecksum(0, 0, 0, 0, 0, 1, 10) => 266 equivalent of 1 * (256^1) + 10

You don't use 256 * 6, you use 256 ^ 6 which changes the whole perspective.

- Lumbago

My blog-> www.thefirstsql.com

OK, so how will this implementation value the distance between V_1 and T versus V_2 and T?

Given:
V_1 (V1 = 0, V2 = 0, V3 = 0, V4 = 0, V5 = 0, V6 = 0, V7 = 100)
and
V_2 V_1 (V1 = 0, V2 = 0, V3 = 0, V4 = 0, V5 = 0, V6 = 100, V7 = 0)
and
T is (T1 = 0, T2 = 0, T3 = 0, T4 = 0, T5 = 0, T6 = 0, T7 = 0)

They should be identical.

SE

Lumbago
Norsk Yak Master

3271 Posts

Posted - 2011-01-19 : 09:08:37

Crap...I feel stupid now and I finally get why it doesn't work. Drawing up a simple x and y axis on my notebook made it all the more clear. Will have to rethink my strategy about this problem

And btw: feel free to elaborate on what this is all being used for! I have a hard time imagining what a 7 dimension model looks like :)

- Lumbago

My blog-> www.thefirstsql.com

SwePeso
Patron Saint of Lost Yaks

30421 Posts

Posted - 2011-01-19 : 11:06:26

A SQLCLR would have to do one pass only over the dataset and for a few million records... 10-15 seconds maybe?
The SQLCLR could keep the K records you want in memory and do an internal merge sort for speed.

N 56°04'39.26"
E 12°55'05.63"

Next Page

Subscribe to SQLTeam.com

SQLTeam.com Articles via RSS

SQLTeam.com Weblog via RSS

- Advertisement -

Resources