Efficient way of ignoring similar items

Please start any new threads on our new site at https://forums.sqlteam.com. We've got lots of great SQL Server experts to answer whatever question you can come up with.

All Forums

SQL Server 2008 Forums

Transact-SQL (2008)

Efficient way of ignoring similar items

Author

Topic

mattt
Posting Yak Master

194 Posts

Posted - 2012-09-25 : 10:35:51

Hi,

Imagine you had a database of films. A lot of the content would be very closely related material - HD and SD versions of the same film, and sequels in the same franchise.

I work on a recommendations system. If we wanted to use that system to return "similar" films to a seed film it'd make sense to remove items in the results that are either closely related to the seed film or to each other.

Finding "closely related" doesn't have to be exact. I've been grabbing the first ten characters of the title and assuming anything else with the same first ten characters is "related" and excluding it. Doing that to make sure nothing "closely related" to the seed appears in the results is easy.

However, pulling the same trick with the other results is much more difficult. Let's say that the seed film is Terminator. We can remove Terminator HD and the Terminator sequels easily. However, a likely result might be The Expendables. Once we've got one of those items in the result set I need to try and make sure that no more "closely related" items to the Expendables appear in that result set.

I need to do this efficiently - response times need to be sub-second and the data tables involved are very large. No way I can think of to do this is remotely efficient. Is there one?

Cheers,
Matt

lazerath
Constraint Violating Yak Guru

343 Posts

Posted - 2012-09-25 : 11:37:36

I would consider keeping a physical representation of "closely related" relationships in the form of an intersection table with OriginalFilmID + RelatedFilmID. You can then break out the algorithm to find these related films so it isn't a factor in your response times. Since you only need to update this table when the Film list changes, this gives the added benefit in that you can use multiple methods and the methods can be more exact than first 10 characters, which I would guess is very rudamentary.

Once you have this intersection table, you can use it to feed the algorithm that returns results. I'd consider doing the heavy lifting in the middle tier because you could scale it horizontally and relieve some pressure on your db server. I'll have to think about how I'd structure the query in the DB layer to be most efficient, but it should be doable. I'll get back to you on that.

lazerath
Constraint Violating Yak Guru

343 Posts

Posted - 2012-09-25 : 13:40:47

The hardest part of this problem was setting up an example. I had to simplify the model and the "Recommendation" algorithm and take some liberties to ensure relevant results, but I think I have something that illustrates the technique.

Overall, the goal is to rank the recommendation list. Once you have that, you can associate it with the closely related films and then remove items from the list that relate to a film with a higher rank. This operation should be fast if you have the right indexes in place, but it also depends on how complex your recommendation algorithm is.

Hope this helps.

-- *********************************************************
-- Begin Sample Data Initialization
-- *********************************************************

-- Cleanup our temp tables
IF OBJECT_ID('tempdb..#Film') IS NOT NULL DROP TABLE #Film;
IF OBJECT_ID('tempdb..#FilmCloseRelation') IS NOT NULL DROP TABLE #FilmCloseRelation;
IF OBJECT_ID('tempdb..#FilmRange') IS NOT NULL DROP TABLE #FilmRange;


-- Declare our #Film table, which is the anchor of our sample model
CREATE TABLE #Film 
			(
			FilmID INT NOT NULL PRIMARY KEY,
			RangeID CHAR(1) NOT NULL
			);
		
-- Declare our #FilmCloseRelation table, which shows the application of the concept
CREATE TABLE #FilmCloseRelation 
			(
			OriginalFilmID INT NOT NULL,
			RelatedFilmID INT NOT NULL 
PRIMARY KEY	(
			OriginalFilmID,
			RelatedFilmID
			)
			);

-- The following CTE helps initialize #Film by producing a list of 10 number from 0 to 9
WITH		cteN10(N)
AS	
( -- - - - -
			SELECT	spt.number
			FROM	master..spt_values
			AS		spt
			WHERE	spt.[type]
			=		'P'
			AND		spt.Number < 10
) -- - - - -
-- With a cartesian product between 5 N10s, we take our 10 numbers to 100k rows and add a RangeID (important later)
,			cteN100K(N,RangeID)
AS 
( -- - - - -
			SELECT 1, RIGHT(CONVERT(VARCHAR(36),NEWID()),1)
			FROM cteN10 a, cteN10 b, cteN10 c, cteN10 d, cteN10 e 
) -- - - - -
-- Initialize #Film with 100k rows & a RangeID
INSERT		#Film
SELECT		ROW_NUMBER() OVER(ORDER BY (SELECT NULL)),
			RangeID
FROM		cteN100K;

-- Check for our work table for Film Grouping and do cleanup if exists
IF OBJECT_ID('tempdb..#FilmRange') IS NOT NULL DROP TABLE #FilmRange;

-- Populate our work table for Film Grouping
SELECT		RowID=IDENTITY(INT,1,1),
			f.FilmID, 
			f.RangeID,
			CONVERT(INT,NULL) as RangeAnchor
INTO		#FilmRange
FROM		#Film 
AS			f
ORDER BY	NEWID()

DECLARE		@LastRowID INT;

-- Set the RangeAnchor so that a random selection of Films have the 
-- same anchor with a higher distribution toward lower numbers
UPDATE		f
SET			@LastRowID
=			CASE WHEN f.RangeID IN('0','1','2','3','4','5','6') THEN f.RowID ELSE @LastRowID END
,			RangeAnchor
=			@LastRowID
FROM		#FilmRange
AS			f;

-- Remove Singles
WITH		cteFilmRange
AS
(
SELECT		f.FilmID,
			f.RangeAnchor,
			COUNT(*) OVER(PARTITION BY f.RangeAnchor) AS AnchorCount
FROM		#FilmRange
AS			f
)
DELETE		cteFilmRange
WHERE		AnchorCount
=			1;

-- Establish our Close Relationships
INSERT		#FilmCloseRelation
SELECT		f1.FilmID,
			f2.FilmID
FROM		#FilmRange
AS			f1
JOIN		#FilmRange
AS			f2
ON			f1.RangeAnchor
=			f2.RangeAnchor
AND			f1.FilmID
<>			f2.FilmID

-- Cleanup our temp work table
IF OBJECT_ID('tempdb..#FilmRange') IS NOT NULL DROP TABLE #FilmRange;


-- Establish top 50 Ranked Film Recommendation example recordset 

IF OBJECT_ID('tempdb..#FilmRecommendation') IS NOT NULL DROP TABLE #FilmRecommendation;


-- First 2 CTEs are only here to guarantee us some results that have related films for testing purposes
WITH		cteFilmRelationCount
AS
(
SELECT		fcr.OriginalFilmID,
			fcr.RelatedFilmID,
			COUNT(*) OVER(PARTITION BY fcr.OriginalFilmID) AS FilmCount
FROM		#FilmCloseRelation
AS			fcr
)
,			cteFilmsWithMostRelatives
AS
(
SELECT		TOP 2000
			c.RelatedFilmID AS FilmID
FROM		cteFilmRelationCount
AS			c
ORDER BY	c.FilmCount DESC
)
-- this establishes our 50 recommended films ranked for the user
,			cteSampleRankedFilms
AS
(
SELECT		TOP 50
			f.FilmID,
			NEWID() AS RowGUID
FROM		#Film
AS			f
-- Ensure we have results that overlap !
WHERE EXISTS(
SELECT		*
FROM		cteFilmsWithMostRelatives
AS			c
WHERE		f.FilmID
=			c.FilmID
			)
ORDER BY	RowGUID
)
SELECT		ROW_NUMBER() OVER(ORDER BY c.RowGUID) AS RowRank,
			c.FilmID
INTO		#FilmRecommendation
FROM		cteSampleRankedFilms
AS			c

-- *********************************************************
-- Begin Example
-- *********************************************************

-- Let's see our "raw" film recommendations
SELECT		*
FROM		#FilmRecommendation;

/*
RowRank	FilmID
1	61858
2	14407
3	59252
4	39041
5	83699
6	70928
7	90108
8	66062
9	20946
10	40182
11	96523
12	37204
13	6395
14	74072
15	13848
16	6757
17	13833
18	68940
19	47687
20	48654
21	56144
22	6989
23	82832
24	12388
25	2028
26	42166
27	58338
28	2936
29	73601
30	23663
31	81294
32	23103
33	14118
34	56732
35	50951
36	181
37	28840
38	22929
39	4507
40	5364
41	9134
42	9008
43	56916
44	56323
45	41927
46	25565
47	46884
48	94443
49	1916
50	9981
*/

-- let's see what relations exist within our raw film recommendations
SELECT		* 
FROM		#FilmRecommendation 
AS			fr1
LEFT JOIN	#FilmCloseRelation
AS			fcr
ON			fr1.FilmID
=			fcr.OriginalFilmID
LEFT JOIN	#FilmRecommendation 
AS			fr2
ON			fcr.RelatedFilmID
=			fr2.FilmID

/*
RowRank	FilmID	OriginalFilmID	RelatedFilmID	RowRank	FilmID
1	61858	61858	181	36	181
1	61858	61858	1512	NULL	NULL
1	61858	61858	1916	49	1916
1	61858	61858	6602	NULL	NULL
1	61858	61858	6989	22	6989
1	61858	61858	9981	50	9981
1	61858	61858	13833	17	13833
1	61858	61858	23663	30	23663
1	61858	61858	26608	NULL	NULL
1	61858	61858	35256	NULL	NULL
1	61858	61858	40261	NULL	NULL
1	61858	61858	42285	NULL	NULL
1	61858	61858	46884	47	46884
1	61858	61858	50173	NULL	NULL
1	61858	61858	53535	NULL	NULL
1	61858	61858	56546	NULL	NULL
1	61858	61858	65570	NULL	NULL
1	61858	61858	66062	8	66062
1	61858	61858	74072	14	74072
1	61858	61858	77171	NULL	NULL
1	61858	61858	86561	NULL	NULL
1	61858	61858	87351	NULL	NULL
1	61858	61858	91239	NULL	NULL
2	14407	14407	6757	16	6757
2	14407	14407	9252	NULL	NULL
2	14407	14407	13848	15	13848
2	14407	14407	14423	NULL	NULL
2	14407	14407	16022	NULL	NULL
2	14407	14407	18040	NULL	NULL
2	14407	14407	18508	NULL	NULL
2	14407	14407	23103	32	23103
2	14407	14407	25565	46	25565
2	14407	14407	31160	NULL	NULL
2	14407	14407	39041	4	39041
2	14407	14407	41927	45	41927
2	14407	14407	61368	NULL	NULL
2	14407	14407	67025	NULL	NULL
2	14407	14407	72138	NULL	NULL
2	14407	14407	73601	29	73601
2	14407	14407	88247	NULL	NULL
2	14407	14407	93993	NULL	NULL
3	59252	59252	4231	NULL	NULL
3	59252	59252	6395	13	6395
3	59252	59252	9419	NULL	NULL
3	59252	59252	12388	24	12388
3	59252	59252	17526	NULL	NULL
3	59252	59252	20946	9	20946
3	59252	59252	29415	NULL	NULL
3	59252	59252	30204	NULL	NULL
3	59252	59252	47066	NULL	NULL
3	59252	59252	56144	21	56144
3	59252	59252	58338	27	58338
3	59252	59252	60757	NULL	NULL
3	59252	59252	63069	NULL	NULL
3	59252	59252	69012	NULL	NULL
3	59252	59252	72043	NULL	NULL
3	59252	59252	83614	NULL	NULL
4	39041	39041	6757	16	6757
4	39041	39041	9252	NULL	NULL
4	39041	39041	13848	15	13848
4	39041	39041	14407	2	14407     -- This row will be culled out since it is related to a higher RowRank
4	39041	39041	14423	NULL	NULL
4	39041	39041	16022	NULL	NULL
4	39041	39041	18040	NULL	NULL
4	39041	39041	18508	NULL	NULL
4	39041	39041	23103	32	23103
4	39041	39041	25565	46	25565
4	39041	39041	31160	NULL	NULL
4	39041	39041	41927	45	41927
4	39041	39041	61368	NULL	NULL
4	39041	39041	67025	NULL	NULL
4	39041	39041	72138	NULL	NULL
4	39041	39041	73601	29	73601
4	39041	39041	88247	NULL	NULL
4	39041	39041	93993	NULL	NULL
5	83699	83699	3043	NULL	NULL
5	83699	83699	20185	NULL	NULL
5	83699	83699	23327	NULL	NULL
5	83699	83699	28049	NULL	NULL
5	83699	83699	40182	10	40182
5	83699	83699	45557	NULL	NULL
5	83699	83699	50755	NULL	NULL
5	83699	83699	59840	NULL	NULL
5	83699	83699	68949	NULL	NULL
5	83699	83699	70928	6	70928
5	83699	83699	72462	NULL	NULL
5	83699	83699	81294	31	81294
5	83699	83699	85868	NULL	NULL
5	83699	83699	90108	7	90108
5	83699	83699	90911	NULL	NULL
5	83699	83699	94299	NULL	NULL
*/


-- Now for the finalle, let's cull out films that are related to films higher up in our list.
SELECT		* 
FROM		#FilmRecommendation 
AS			fr1
LEFT JOIN	(
			#FilmCloseRelation
AS			fcr
JOIN		#FilmRecommendation 
AS			fr2
ON			fcr.RelatedFilmID
=			fr2.FilmID
			)
ON			fr1.FilmID
=			fcr.OriginalFilmID
AND			fr2.RowRank
<			fr1.RowRank
WHERE		fcr.OriginalFilmID
IS			NULL;

/*
-- All recommended films without a related film with a higher rank
-- Notice #4 isn't in the list because it was related to #2

RowRank	FilmID	OriginalFilmID	RelatedFilmID	RowRank	FilmID
1	61858	NULL	NULL	NULL	NULL
2	14407	NULL	NULL	NULL	NULL
3	59252	NULL	NULL	NULL	NULL
5	83699	NULL	NULL	NULL	NULL
11	96523	NULL	NULL	NULL	NULL
12	37204	NULL	NULL	NULL	NULL
18	68940	NULL	NULL	NULL	NULL
25	2028	NULL	NULL	NULL	NULL
*/

The temp tables at the end could be replaced with CTEs. They are only there so we can look at the contents to illustrate the technique.

robvolk
Most Valuable Yak

15732 Posts

Posted - 2012-09-25 : 14:03:30

I agree with lazerath's premise, you need to make the relationship explicit, rather than matching titles. Another thing is to separate the presentation (HD, SD, streaming, etc.) from the title, that has nothing to do with your recommendation engine.

You also have several ways to consider Terminator and The Expendables related: they're both action films (genre), they both have Arnold in them (actor). But they don't have the same subject (sci-fi/time travel/robots vs. present-day mercenaries) or other factors (SFX, box office, release year, director) that could also guide a recommendation. I know Netflix's recommendation engine is heavily biased towards user's viewing statistics...what they watched, how they rated it, etc.

mattt
Posting Yak Master

194 Posts

Posted - 2012-09-27 : 09:04:49

quote:
Originally posted by lazerath

The hardest part of this problem was setting up an example. I had to simplify the model and the "Recommendation" algorithm and take some liberties to ensure relevant results, but I think I have something that illustrates the technique.

Thanks for this, it's very useful. I just need to work on populated the relatedness table now.

Subscribe to SQLTeam.com

SQLTeam.com Articles via RSS

SQLTeam.com Weblog via RSS

- Advertisement -

Resources