Please start any new threads on our new
site at https://forums.sqlteam.com. We've got lots of great SQL Server
experts to answer whatever question you can come up with.
Author |
Topic |
mattt
Posting Yak Master
194 Posts |
Posted - 2012-09-25 : 10:35:51
|
Hi,Imagine you had a database of films. A lot of the content would be very closely related material - HD and SD versions of the same film, and sequels in the same franchise.I work on a recommendations system. If we wanted to use that system to return "similar" films to a seed film it'd make sense to remove items in the results that are either closely related to the seed film or to each other.Finding "closely related" doesn't have to be exact. I've been grabbing the first ten characters of the title and assuming anything else with the same first ten characters is "related" and excluding it. Doing that to make sure nothing "closely related" to the seed appears in the results is easy.However, pulling the same trick with the other results is much more difficult. Let's say that the seed film is Terminator. We can remove Terminator HD and the Terminator sequels easily. However, a likely result might be The Expendables. Once we've got one of those items in the result set I need to try and make sure that no more "closely related" items to the Expendables appear in that result set.I need to do this efficiently - response times need to be sub-second and the data tables involved are very large. No way I can think of to do this is remotely efficient. Is there one?Cheers,Matt |
|
lazerath
Constraint Violating Yak Guru
343 Posts |
Posted - 2012-09-25 : 11:37:36
|
I would consider keeping a physical representation of "closely related" relationships in the form of an intersection table with OriginalFilmID + RelatedFilmID. You can then break out the algorithm to find these related films so it isn't a factor in your response times. Since you only need to update this table when the Film list changes, this gives the added benefit in that you can use multiple methods and the methods can be more exact than first 10 characters, which I would guess is very rudamentary.Once you have this intersection table, you can use it to feed the algorithm that returns results. I'd consider doing the heavy lifting in the middle tier because you could scale it horizontally and relieve some pressure on your db server. I'll have to think about how I'd structure the query in the DB layer to be most efficient, but it should be doable. I'll get back to you on that. |
 |
|
lazerath
Constraint Violating Yak Guru
343 Posts |
Posted - 2012-09-25 : 13:40:47
|
The hardest part of this problem was setting up an example. I had to simplify the model and the "Recommendation" algorithm and take some liberties to ensure relevant results, but I think I have something that illustrates the technique.Overall, the goal is to rank the recommendation list. Once you have that, you can associate it with the closely related films and then remove items from the list that relate to a film with a higher rank. This operation should be fast if you have the right indexes in place, but it also depends on how complex your recommendation algorithm is.Hope this helps.-- *********************************************************-- Begin Sample Data Initialization-- *********************************************************-- Cleanup our temp tablesIF OBJECT_ID('tempdb..#Film') IS NOT NULL DROP TABLE #Film;IF OBJECT_ID('tempdb..#FilmCloseRelation') IS NOT NULL DROP TABLE #FilmCloseRelation;IF OBJECT_ID('tempdb..#FilmRange') IS NOT NULL DROP TABLE #FilmRange;-- Declare our #Film table, which is the anchor of our sample modelCREATE TABLE #Film ( FilmID INT NOT NULL PRIMARY KEY, RangeID CHAR(1) NOT NULL ); -- Declare our #FilmCloseRelation table, which shows the application of the conceptCREATE TABLE #FilmCloseRelation ( OriginalFilmID INT NOT NULL, RelatedFilmID INT NOT NULL PRIMARY KEY ( OriginalFilmID, RelatedFilmID ) );-- The following CTE helps initialize #Film by producing a list of 10 number from 0 to 9WITH cteN10(N)AS ( -- - - - - SELECT spt.number FROM master..spt_values AS spt WHERE spt.[type] = 'P' AND spt.Number < 10) -- - - - --- With a cartesian product between 5 N10s, we take our 10 numbers to 100k rows and add a RangeID (important later), cteN100K(N,RangeID)AS ( -- - - - - SELECT 1, RIGHT(CONVERT(VARCHAR(36),NEWID()),1) FROM cteN10 a, cteN10 b, cteN10 c, cteN10 d, cteN10 e ) -- - - - --- Initialize #Film with 100k rows & a RangeIDINSERT #FilmSELECT ROW_NUMBER() OVER(ORDER BY (SELECT NULL)), RangeIDFROM cteN100K;-- Check for our work table for Film Grouping and do cleanup if existsIF OBJECT_ID('tempdb..#FilmRange') IS NOT NULL DROP TABLE #FilmRange;-- Populate our work table for Film GroupingSELECT RowID=IDENTITY(INT,1,1), f.FilmID, f.RangeID, CONVERT(INT,NULL) as RangeAnchorINTO #FilmRangeFROM #Film AS fORDER BY NEWID()DECLARE @LastRowID INT;-- Set the RangeAnchor so that a random selection of Films have the -- same anchor with a higher distribution toward lower numbersUPDATE fSET @LastRowID= CASE WHEN f.RangeID IN('0','1','2','3','4','5','6') THEN f.RowID ELSE @LastRowID END, RangeAnchor= @LastRowIDFROM #FilmRangeAS f;-- Remove SinglesWITH cteFilmRangeAS(SELECT f.FilmID, f.RangeAnchor, COUNT(*) OVER(PARTITION BY f.RangeAnchor) AS AnchorCountFROM #FilmRangeAS f)DELETE cteFilmRangeWHERE AnchorCount= 1;-- Establish our Close RelationshipsINSERT #FilmCloseRelationSELECT f1.FilmID, f2.FilmIDFROM #FilmRangeAS f1JOIN #FilmRangeAS f2ON f1.RangeAnchor= f2.RangeAnchorAND f1.FilmID<> f2.FilmID-- Cleanup our temp work tableIF OBJECT_ID('tempdb..#FilmRange') IS NOT NULL DROP TABLE #FilmRange;-- Establish top 50 Ranked Film Recommendation example recordset IF OBJECT_ID('tempdb..#FilmRecommendation') IS NOT NULL DROP TABLE #FilmRecommendation;-- First 2 CTEs are only here to guarantee us some results that have related films for testing purposesWITH cteFilmRelationCountAS(SELECT fcr.OriginalFilmID, fcr.RelatedFilmID, COUNT(*) OVER(PARTITION BY fcr.OriginalFilmID) AS FilmCountFROM #FilmCloseRelationAS fcr), cteFilmsWithMostRelativesAS(SELECT TOP 2000 c.RelatedFilmID AS FilmIDFROM cteFilmRelationCountAS cORDER BY c.FilmCount DESC)-- this establishes our 50 recommended films ranked for the user, cteSampleRankedFilmsAS(SELECT TOP 50 f.FilmID, NEWID() AS RowGUIDFROM #FilmAS f-- Ensure we have results that overlap !WHERE EXISTS(SELECT *FROM cteFilmsWithMostRelativesAS cWHERE f.FilmID= c.FilmID )ORDER BY RowGUID)SELECT ROW_NUMBER() OVER(ORDER BY c.RowGUID) AS RowRank, c.FilmIDINTO #FilmRecommendationFROM cteSampleRankedFilmsAS c-- *********************************************************-- Begin Example-- *********************************************************-- Let's see our "raw" film recommendationsSELECT *FROM #FilmRecommendation;/*RowRank FilmID1 618582 144073 592524 390415 836996 709287 901088 660629 2094610 4018211 9652312 3720413 639514 7407215 1384816 675717 1383318 6894019 4768720 4865421 5614422 698923 8283224 1238825 202826 4216627 5833828 293629 7360130 2366331 8129432 2310333 1411834 5673235 5095136 18137 2884038 2292939 450740 536441 913442 900843 5691644 5632345 4192746 2556547 4688448 9444349 191650 9981*/-- let's see what relations exist within our raw film recommendationsSELECT * FROM #FilmRecommendation AS fr1LEFT JOIN #FilmCloseRelationAS fcrON fr1.FilmID= fcr.OriginalFilmIDLEFT JOIN #FilmRecommendation AS fr2ON fcr.RelatedFilmID= fr2.FilmID/*RowRank FilmID OriginalFilmID RelatedFilmID RowRank FilmID1 61858 61858 181 36 1811 61858 61858 1512 NULL NULL1 61858 61858 1916 49 19161 61858 61858 6602 NULL NULL1 61858 61858 6989 22 69891 61858 61858 9981 50 99811 61858 61858 13833 17 138331 61858 61858 23663 30 236631 61858 61858 26608 NULL NULL1 61858 61858 35256 NULL NULL1 61858 61858 40261 NULL NULL1 61858 61858 42285 NULL NULL1 61858 61858 46884 47 468841 61858 61858 50173 NULL NULL1 61858 61858 53535 NULL NULL1 61858 61858 56546 NULL NULL1 61858 61858 65570 NULL NULL1 61858 61858 66062 8 660621 61858 61858 74072 14 740721 61858 61858 77171 NULL NULL1 61858 61858 86561 NULL NULL1 61858 61858 87351 NULL NULL1 61858 61858 91239 NULL NULL2 14407 14407 6757 16 67572 14407 14407 9252 NULL NULL2 14407 14407 13848 15 138482 14407 14407 14423 NULL NULL2 14407 14407 16022 NULL NULL2 14407 14407 18040 NULL NULL2 14407 14407 18508 NULL NULL2 14407 14407 23103 32 231032 14407 14407 25565 46 255652 14407 14407 31160 NULL NULL2 14407 14407 39041 4 390412 14407 14407 41927 45 419272 14407 14407 61368 NULL NULL2 14407 14407 67025 NULL NULL2 14407 14407 72138 NULL NULL2 14407 14407 73601 29 736012 14407 14407 88247 NULL NULL2 14407 14407 93993 NULL NULL3 59252 59252 4231 NULL NULL3 59252 59252 6395 13 63953 59252 59252 9419 NULL NULL3 59252 59252 12388 24 123883 59252 59252 17526 NULL NULL3 59252 59252 20946 9 209463 59252 59252 29415 NULL NULL3 59252 59252 30204 NULL NULL3 59252 59252 47066 NULL NULL3 59252 59252 56144 21 561443 59252 59252 58338 27 583383 59252 59252 60757 NULL NULL3 59252 59252 63069 NULL NULL3 59252 59252 69012 NULL NULL3 59252 59252 72043 NULL NULL3 59252 59252 83614 NULL NULL4 39041 39041 6757 16 67574 39041 39041 9252 NULL NULL4 39041 39041 13848 15 138484 39041 39041 14407 2 14407 -- This row will be culled out since it is related to a higher RowRank4 39041 39041 14423 NULL NULL4 39041 39041 16022 NULL NULL4 39041 39041 18040 NULL NULL4 39041 39041 18508 NULL NULL4 39041 39041 23103 32 231034 39041 39041 25565 46 255654 39041 39041 31160 NULL NULL4 39041 39041 41927 45 419274 39041 39041 61368 NULL NULL4 39041 39041 67025 NULL NULL4 39041 39041 72138 NULL NULL4 39041 39041 73601 29 736014 39041 39041 88247 NULL NULL4 39041 39041 93993 NULL NULL5 83699 83699 3043 NULL NULL5 83699 83699 20185 NULL NULL5 83699 83699 23327 NULL NULL5 83699 83699 28049 NULL NULL5 83699 83699 40182 10 401825 83699 83699 45557 NULL NULL5 83699 83699 50755 NULL NULL5 83699 83699 59840 NULL NULL5 83699 83699 68949 NULL NULL5 83699 83699 70928 6 709285 83699 83699 72462 NULL NULL5 83699 83699 81294 31 812945 83699 83699 85868 NULL NULL5 83699 83699 90108 7 901085 83699 83699 90911 NULL NULL5 83699 83699 94299 NULL NULL*/-- Now for the finalle, let's cull out films that are related to films higher up in our list.SELECT * FROM #FilmRecommendation AS fr1LEFT JOIN ( #FilmCloseRelationAS fcrJOIN #FilmRecommendation AS fr2ON fcr.RelatedFilmID= fr2.FilmID )ON fr1.FilmID= fcr.OriginalFilmIDAND fr2.RowRank< fr1.RowRankWHERE fcr.OriginalFilmIDIS NULL;/*-- All recommended films without a related film with a higher rank-- Notice #4 isn't in the list because it was related to #2RowRank FilmID OriginalFilmID RelatedFilmID RowRank FilmID1 61858 NULL NULL NULL NULL2 14407 NULL NULL NULL NULL3 59252 NULL NULL NULL NULL5 83699 NULL NULL NULL NULL11 96523 NULL NULL NULL NULL12 37204 NULL NULL NULL NULL18 68940 NULL NULL NULL NULL25 2028 NULL NULL NULL NULL*/ The temp tables at the end could be replaced with CTEs. They are only there so we can look at the contents to illustrate the technique. |
 |
|
robvolk
Most Valuable Yak
15732 Posts |
Posted - 2012-09-25 : 14:03:30
|
I agree with lazerath's premise, you need to make the relationship explicit, rather than matching titles. Another thing is to separate the presentation (HD, SD, streaming, etc.) from the title, that has nothing to do with your recommendation engine.You also have several ways to consider Terminator and The Expendables related: they're both action films (genre), they both have Arnold in them (actor). But they don't have the same subject (sci-fi/time travel/robots vs. present-day mercenaries) or other factors (SFX, box office, release year, director) that could also guide a recommendation. I know Netflix's recommendation engine is heavily biased towards user's viewing statistics...what they watched, how they rated it, etc. |
 |
|
mattt
Posting Yak Master
194 Posts |
Posted - 2012-09-27 : 09:04:49
|
quote: Originally posted by lazerath The hardest part of this problem was setting up an example. I had to simplify the model and the "Recommendation" algorithm and take some liberties to ensure relevant results, but I think I have something that illustrates the technique.
Thanks for this, it's very useful. I just need to work on populated the relatedness table now. |
 |
|
|
|
|
|
|