Please start any new threads on our new site at https://forums.sqlteam.com. We've got lots of great SQL Server experts to answer whatever question you can come up with.

 All Forums
 SQL Server 2008 Forums
 Transact-SQL (2008)
 Efficient way of ignoring similar items

Author  Topic 

mattt
Posting Yak Master

194 Posts

Posted - 2012-09-25 : 10:35:51
Hi,

Imagine you had a database of films. A lot of the content would be very closely related material - HD and SD versions of the same film, and sequels in the same franchise.

I work on a recommendations system. If we wanted to use that system to return "similar" films to a seed film it'd make sense to remove items in the results that are either closely related to the seed film or to each other.

Finding "closely related" doesn't have to be exact. I've been grabbing the first ten characters of the title and assuming anything else with the same first ten characters is "related" and excluding it. Doing that to make sure nothing "closely related" to the seed appears in the results is easy.

However, pulling the same trick with the other results is much more difficult. Let's say that the seed film is Terminator. We can remove Terminator HD and the Terminator sequels easily. However, a likely result might be The Expendables. Once we've got one of those items in the result set I need to try and make sure that no more "closely related" items to the Expendables appear in that result set.

I need to do this efficiently - response times need to be sub-second and the data tables involved are very large. No way I can think of to do this is remotely efficient. Is there one?

Cheers,
Matt

lazerath
Constraint Violating Yak Guru

343 Posts

Posted - 2012-09-25 : 11:37:36
I would consider keeping a physical representation of "closely related" relationships in the form of an intersection table with OriginalFilmID + RelatedFilmID. You can then break out the algorithm to find these related films so it isn't a factor in your response times. Since you only need to update this table when the Film list changes, this gives the added benefit in that you can use multiple methods and the methods can be more exact than first 10 characters, which I would guess is very rudamentary.

Once you have this intersection table, you can use it to feed the algorithm that returns results. I'd consider doing the heavy lifting in the middle tier because you could scale it horizontally and relieve some pressure on your db server. I'll have to think about how I'd structure the query in the DB layer to be most efficient, but it should be doable. I'll get back to you on that.
Go to Top of Page

lazerath
Constraint Violating Yak Guru

343 Posts

Posted - 2012-09-25 : 13:40:47
The hardest part of this problem was setting up an example. I had to simplify the model and the "Recommendation" algorithm and take some liberties to ensure relevant results, but I think I have something that illustrates the technique.

Overall, the goal is to rank the recommendation list. Once you have that, you can associate it with the closely related films and then remove items from the list that relate to a film with a higher rank. This operation should be fast if you have the right indexes in place, but it also depends on how complex your recommendation algorithm is.

Hope this helps.

-- *********************************************************
-- Begin Sample Data Initialization
-- *********************************************************

-- Cleanup our temp tables
IF OBJECT_ID('tempdb..#Film') IS NOT NULL DROP TABLE #Film;
IF OBJECT_ID('tempdb..#FilmCloseRelation') IS NOT NULL DROP TABLE #FilmCloseRelation;
IF OBJECT_ID('tempdb..#FilmRange') IS NOT NULL DROP TABLE #FilmRange;


-- Declare our #Film table, which is the anchor of our sample model
CREATE TABLE #Film
(
FilmID INT NOT NULL PRIMARY KEY,
RangeID CHAR(1) NOT NULL
);

-- Declare our #FilmCloseRelation table, which shows the application of the concept
CREATE TABLE #FilmCloseRelation
(
OriginalFilmID INT NOT NULL,
RelatedFilmID INT NOT NULL
PRIMARY KEY (
OriginalFilmID,
RelatedFilmID
)
);

-- The following CTE helps initialize #Film by producing a list of 10 number from 0 to 9
WITH cteN10(N)
AS
( -- - - - -
SELECT spt.number
FROM master..spt_values
AS spt
WHERE spt.[type]
= 'P'
AND spt.Number < 10
) -- - - - -
-- With a cartesian product between 5 N10s, we take our 10 numbers to 100k rows and add a RangeID (important later)
, cteN100K(N,RangeID)
AS
( -- - - - -
SELECT 1, RIGHT(CONVERT(VARCHAR(36),NEWID()),1)
FROM cteN10 a, cteN10 b, cteN10 c, cteN10 d, cteN10 e
) -- - - - -
-- Initialize #Film with 100k rows & a RangeID
INSERT #Film
SELECT ROW_NUMBER() OVER(ORDER BY (SELECT NULL)),
RangeID
FROM cteN100K;

-- Check for our work table for Film Grouping and do cleanup if exists
IF OBJECT_ID('tempdb..#FilmRange') IS NOT NULL DROP TABLE #FilmRange;

-- Populate our work table for Film Grouping
SELECT RowID=IDENTITY(INT,1,1),
f.FilmID,
f.RangeID,
CONVERT(INT,NULL) as RangeAnchor
INTO #FilmRange
FROM #Film
AS f
ORDER BY NEWID()

DECLARE @LastRowID INT;

-- Set the RangeAnchor so that a random selection of Films have the
-- same anchor with a higher distribution toward lower numbers
UPDATE f
SET @LastRowID
= CASE WHEN f.RangeID IN('0','1','2','3','4','5','6') THEN f.RowID ELSE @LastRowID END
, RangeAnchor
= @LastRowID
FROM #FilmRange
AS f;

-- Remove Singles
WITH cteFilmRange
AS
(
SELECT f.FilmID,
f.RangeAnchor,
COUNT(*) OVER(PARTITION BY f.RangeAnchor) AS AnchorCount
FROM #FilmRange
AS f
)
DELETE cteFilmRange
WHERE AnchorCount
= 1;

-- Establish our Close Relationships
INSERT #FilmCloseRelation
SELECT f1.FilmID,
f2.FilmID
FROM #FilmRange
AS f1
JOIN #FilmRange
AS f2
ON f1.RangeAnchor
= f2.RangeAnchor
AND f1.FilmID
<> f2.FilmID

-- Cleanup our temp work table
IF OBJECT_ID('tempdb..#FilmRange') IS NOT NULL DROP TABLE #FilmRange;


-- Establish top 50 Ranked Film Recommendation example recordset

IF OBJECT_ID('tempdb..#FilmRecommendation') IS NOT NULL DROP TABLE #FilmRecommendation;


-- First 2 CTEs are only here to guarantee us some results that have related films for testing purposes
WITH cteFilmRelationCount
AS
(
SELECT fcr.OriginalFilmID,
fcr.RelatedFilmID,
COUNT(*) OVER(PARTITION BY fcr.OriginalFilmID) AS FilmCount
FROM #FilmCloseRelation
AS fcr
)
, cteFilmsWithMostRelatives
AS
(
SELECT TOP 2000
c.RelatedFilmID AS FilmID
FROM cteFilmRelationCount
AS c
ORDER BY c.FilmCount DESC
)
-- this establishes our 50 recommended films ranked for the user
, cteSampleRankedFilms
AS
(
SELECT TOP 50
f.FilmID,
NEWID() AS RowGUID
FROM #Film
AS f
-- Ensure we have results that overlap !
WHERE EXISTS(
SELECT *
FROM cteFilmsWithMostRelatives
AS c
WHERE f.FilmID
= c.FilmID
)
ORDER BY RowGUID
)
SELECT ROW_NUMBER() OVER(ORDER BY c.RowGUID) AS RowRank,
c.FilmID
INTO #FilmRecommendation
FROM cteSampleRankedFilms
AS c

-- *********************************************************
-- Begin Example
-- *********************************************************

-- Let's see our "raw" film recommendations
SELECT *
FROM #FilmRecommendation;

/*
RowRank FilmID
1 61858
2 14407
3 59252
4 39041
5 83699
6 70928
7 90108
8 66062
9 20946
10 40182
11 96523
12 37204
13 6395
14 74072
15 13848
16 6757
17 13833
18 68940
19 47687
20 48654
21 56144
22 6989
23 82832
24 12388
25 2028
26 42166
27 58338
28 2936
29 73601
30 23663
31 81294
32 23103
33 14118
34 56732
35 50951
36 181
37 28840
38 22929
39 4507
40 5364
41 9134
42 9008
43 56916
44 56323
45 41927
46 25565
47 46884
48 94443
49 1916
50 9981
*/

-- let's see what relations exist within our raw film recommendations
SELECT *
FROM #FilmRecommendation
AS fr1
LEFT JOIN #FilmCloseRelation
AS fcr
ON fr1.FilmID
= fcr.OriginalFilmID
LEFT JOIN #FilmRecommendation
AS fr2
ON fcr.RelatedFilmID
= fr2.FilmID

/*
RowRank FilmID OriginalFilmID RelatedFilmID RowRank FilmID
1 61858 61858 181 36 181
1 61858 61858 1512 NULL NULL
1 61858 61858 1916 49 1916
1 61858 61858 6602 NULL NULL
1 61858 61858 6989 22 6989
1 61858 61858 9981 50 9981
1 61858 61858 13833 17 13833
1 61858 61858 23663 30 23663
1 61858 61858 26608 NULL NULL
1 61858 61858 35256 NULL NULL
1 61858 61858 40261 NULL NULL
1 61858 61858 42285 NULL NULL
1 61858 61858 46884 47 46884
1 61858 61858 50173 NULL NULL
1 61858 61858 53535 NULL NULL
1 61858 61858 56546 NULL NULL
1 61858 61858 65570 NULL NULL
1 61858 61858 66062 8 66062
1 61858 61858 74072 14 74072
1 61858 61858 77171 NULL NULL
1 61858 61858 86561 NULL NULL
1 61858 61858 87351 NULL NULL
1 61858 61858 91239 NULL NULL
2 14407 14407 6757 16 6757
2 14407 14407 9252 NULL NULL
2 14407 14407 13848 15 13848
2 14407 14407 14423 NULL NULL
2 14407 14407 16022 NULL NULL
2 14407 14407 18040 NULL NULL
2 14407 14407 18508 NULL NULL
2 14407 14407 23103 32 23103
2 14407 14407 25565 46 25565
2 14407 14407 31160 NULL NULL
2 14407 14407 39041 4 39041
2 14407 14407 41927 45 41927
2 14407 14407 61368 NULL NULL
2 14407 14407 67025 NULL NULL
2 14407 14407 72138 NULL NULL
2 14407 14407 73601 29 73601
2 14407 14407 88247 NULL NULL
2 14407 14407 93993 NULL NULL
3 59252 59252 4231 NULL NULL
3 59252 59252 6395 13 6395
3 59252 59252 9419 NULL NULL
3 59252 59252 12388 24 12388
3 59252 59252 17526 NULL NULL
3 59252 59252 20946 9 20946
3 59252 59252 29415 NULL NULL
3 59252 59252 30204 NULL NULL
3 59252 59252 47066 NULL NULL
3 59252 59252 56144 21 56144
3 59252 59252 58338 27 58338
3 59252 59252 60757 NULL NULL
3 59252 59252 63069 NULL NULL
3 59252 59252 69012 NULL NULL
3 59252 59252 72043 NULL NULL
3 59252 59252 83614 NULL NULL
4 39041 39041 6757 16 6757
4 39041 39041 9252 NULL NULL
4 39041 39041 13848 15 13848
4 39041 39041 14407 2 14407 -- This row will be culled out since it is related to a higher RowRank
4 39041 39041 14423 NULL NULL
4 39041 39041 16022 NULL NULL
4 39041 39041 18040 NULL NULL
4 39041 39041 18508 NULL NULL
4 39041 39041 23103 32 23103
4 39041 39041 25565 46 25565
4 39041 39041 31160 NULL NULL
4 39041 39041 41927 45 41927
4 39041 39041 61368 NULL NULL
4 39041 39041 67025 NULL NULL
4 39041 39041 72138 NULL NULL
4 39041 39041 73601 29 73601
4 39041 39041 88247 NULL NULL
4 39041 39041 93993 NULL NULL
5 83699 83699 3043 NULL NULL
5 83699 83699 20185 NULL NULL
5 83699 83699 23327 NULL NULL
5 83699 83699 28049 NULL NULL
5 83699 83699 40182 10 40182
5 83699 83699 45557 NULL NULL
5 83699 83699 50755 NULL NULL
5 83699 83699 59840 NULL NULL
5 83699 83699 68949 NULL NULL
5 83699 83699 70928 6 70928
5 83699 83699 72462 NULL NULL
5 83699 83699 81294 31 81294
5 83699 83699 85868 NULL NULL
5 83699 83699 90108 7 90108
5 83699 83699 90911 NULL NULL
5 83699 83699 94299 NULL NULL
*/


-- Now for the finalle, let's cull out films that are related to films higher up in our list.
SELECT *
FROM #FilmRecommendation
AS fr1
LEFT JOIN (
#FilmCloseRelation
AS fcr
JOIN #FilmRecommendation
AS fr2
ON fcr.RelatedFilmID
= fr2.FilmID
)
ON fr1.FilmID
= fcr.OriginalFilmID
AND fr2.RowRank
< fr1.RowRank
WHERE fcr.OriginalFilmID
IS NULL;

/*
-- All recommended films without a related film with a higher rank
-- Notice #4 isn't in the list because it was related to #2

RowRank FilmID OriginalFilmID RelatedFilmID RowRank FilmID
1 61858 NULL NULL NULL NULL
2 14407 NULL NULL NULL NULL
3 59252 NULL NULL NULL NULL
5 83699 NULL NULL NULL NULL
11 96523 NULL NULL NULL NULL
12 37204 NULL NULL NULL NULL
18 68940 NULL NULL NULL NULL
25 2028 NULL NULL NULL NULL
*/


The temp tables at the end could be replaced with CTEs. They are only there so we can look at the contents to illustrate the technique.
Go to Top of Page

robvolk
Most Valuable Yak

15732 Posts

Posted - 2012-09-25 : 14:03:30
I agree with lazerath's premise, you need to make the relationship explicit, rather than matching titles. Another thing is to separate the presentation (HD, SD, streaming, etc.) from the title, that has nothing to do with your recommendation engine.

You also have several ways to consider Terminator and The Expendables related: they're both action films (genre), they both have Arnold in them (actor). But they don't have the same subject (sci-fi/time travel/robots vs. present-day mercenaries) or other factors (SFX, box office, release year, director) that could also guide a recommendation. I know Netflix's recommendation engine is heavily biased towards user's viewing statistics...what they watched, how they rated it, etc.
Go to Top of Page

mattt
Posting Yak Master

194 Posts

Posted - 2012-09-27 : 09:04:49
quote:
Originally posted by lazerath

The hardest part of this problem was setting up an example. I had to simplify the model and the "Recommendation" algorithm and take some liberties to ensure relevant results, but I think I have something that illustrates the technique.



Thanks for this, it's very useful. I just need to work on populated the relatedness table now.
Go to Top of Page
   

- Advertisement -