Levenshtein performance 'tuning' advise

Please start any new threads on our new site at https://forums.sqlteam.com. We've got lots of great SQL Server experts to answer whatever question you can come up with.

All Forums

General SQL Server Forums

New to SQL Server Programming

Levenshtein performance 'tuning' advise

Author

Topic

WonderTiger
Starting Member

4 Posts

Posted - 2014-11-05 : 04:46:13

Hello,

First off let me introduce myself. I'm Mike, 19 years old and living in the Netherlands. I'm currently working as trainee on a C#/SQL application. I'm totally new to SQL so that's why I need some help from you SQL guru's :).

So let me first explain my problem. The goal of the application is to match different files (e.g. Excel, CSV, TXT, database formats, etc.) with eachother. For example, we have a supplier list that contains product descriptions in the first file. The second file (from the administrative system) contains the same description but every description could be a little different compared to the product descriptions in the first file (e.g. extra dots, semicolons, letters, numbers, etc.). Now both of these files are imported to a SQL database table. To improve the matching the user first can remove, insert, change characters from a column (or multiple columns). After these edits the files should have a better match, however it still isn't going to be a 100% match.

This is where the levenshtein algorithm comes into play. With the levenshtein algorithm I'm able to get an approximate match between two strings, e.g. '10000' has 80% similarity to '00000'. This algorithm would be very useful to do the final match between my two imported files. So I made a levenshtein UDF from a dynamic link library. This function has two nvarchar's as parameter. The result from the function is a float that has a range from 0 - 100%. This function is then used in a query that looks like this:

SELECT T1.*, T2.* FROM myTable_1 T1 LEFT JOIN myTable_2 T2 
ON (master.dbo.fn_Levenshtein(T1.[column_T1], T2.[column_T2]) >= percentage);

This query is working great! It returns the correct results I'm expecting. So what is the problem then? Performance! To match two tables with 6000 records the query takes around 1 minute and 30 seconds to finish. This is to be expected though, as the query has to iterate through 6000*6000 = 12.000.000 rows. Now I need some help to lower this number of iterations.

I was thinking about two solutions. The first solution I was thinking of is to stop iterating through the rows when two results are returned from the levenshtein comparison. So when in the first 50 rows 2 results are already found with the desired percentage, then break the iteration so we could skip the other 5950 rows.

I also was thinking about the LIKE statement. Not to use that particular function, but how the LIKE mechanism is working. As I could see in the executing plan the LIKE statement is only iterating through the row count of the table (6000 times) and not the +- 12.000.000 I get in my levenshtein function.

If anyone can help me out with 'tuning' the performance of my query I would really appreciate that!

I'll hope everything I explained is clear to understand, if not please tell me so I can try to explain myself even better.

gbritton
Master Smack Fu Yak Hacker

2780 Posts

Posted - 2014-11-05 : 09:03:27

You might want to look into SQL Server's full text search options instead. Like Levenshtein, you can get ranked matches. However, SQL will do it internally and much faster.

In general, scalar UDFs in SQL perform badly.

WonderTiger
Starting Member

4 Posts

Posted - 2014-11-06 : 03:10:36

Thanks for your answer. I tried the full text search function, but this isn't exactly what I want. As the LIKE operator, the CONTAINS or FREETEXT operators will search the columns on the exact word you give as second parameter. E.g. if i have the word 'Green' and my search condition is 'Geen' then no results are returned. Besides that, the CONTAINS and FREETEXT operators won't accept a column as second parameter as you can see here --> http://msdn.microsoft.com/en-us/library/ms187787.aspx. I could probably make a workaround, however then still I won't get the function I achieved with the Levenshtein algorithm.

gbritton
Master Smack Fu Yak Hacker

2780 Posts

Posted - 2014-11-06 : 09:00:01

Well, to say that "green" is a close match to "geen" is not generally useful, IMO. You live in the Netherlands, so mag ik vragen, " is het waar dat 'groen' dichtbij 'geen' is, wat betekenis betreft? Volgens mij niet!"

What I'm really saying is that closeness depends on the linguistic context -- not just the combination of letters and numbers. If you are looking at product descriptions -- especially if the suppliers may use Dutch or English -- then to say "green" is close to "geen" is misleading at best and completely wrong at worst.

Not to say that SQL's full text search is necessarily better in that regard, but at least it supports linguistic contextualization through the thesaurus, though you would need to identify the source language of the text you are searching.

Note that you can get around the limitation you identified using dynamic SQL, though it leads to a somewhat-uglier solution.

WonderTiger
Starting Member

4 Posts

Posted - 2014-11-07 : 10:26:45

Green and Geen were just an example. In the real tables I will compare product descriptions like 'DFE911' Compared to 'DFD911' or 'DFE 9.11' etc.

gbritton
Master Smack Fu Yak Hacker

2780 Posts

Posted - 2014-11-07 : 11:01:22

OK that makes more sense. So the descriptions are really codes.

WonderTiger
Starting Member

4 Posts

Posted - 2014-11-10 : 02:33:44

Let me explain my wishes a bit better. Lets say we have these two tables:

We first going to match on the part numbers. This is because this column has the highest number of 100% matches, except the last one.
As you can see the part numbers differ between the two tables. Then only on the last record has to be matched on the discription from the whole second table. We only want to show the matches higher then, for example, 80% similarity.

Currently I have done it this way:

SELECT T1.*, T3.*, master.dbo.fn_Levenshtein(T1.Description, T3.Description, 80) AS percentage FROM  Table_1 AS T1
LEFT join Table_2 AS T2 ON (T1.Partnumber = T2.Partnumber)
LEFT join Table_2 AS T3 ON (master.dbo.fn_Levenshtein(T1.Description, T3.Description) >= 80)
WHERE T2.Partnumber IS NULL;

The query returns this as result:

So this is working as expected, however it is still slow whenever there aren't many 100% matches. I would really appreciate if someone could help me speeding up this query.

gbritton
Master Smack Fu Yak Hacker

2780 Posts

Posted - 2014-11-10 : 08:54:59

your query is about as good as you can get it and still produce the results you want. As I mentioned before, scalar functions (which is what Levenshtein is) are notorious for turning speedy queries into slow ones.

Another option would be to exclude the function from the query and process the data (including the Levens. matching) in C# instead of SQL.

Subscribe to SQLTeam.com

SQLTeam.com Articles via RSS

SQLTeam.com Weblog via RSS

- Advertisement -

Resources