Please start any new threads on our new site at https://forums.sqlteam.com. We've got lots of great SQL Server experts to answer whatever question you can come up with.

 All Forums
 SQL Server 2005 Forums
 Transact-SQL (2005)
 Match r out of n columns

Author  Topic 

LaurieCox

158 Posts

Posted - 2011-02-24 : 16:03:24
Hi,

I have two tables (ddl and sample data at end of post) and I want to find matches between the two tables where the data in at least r out n columns match.

If say I wanted at least three of the columns to match I could code it this way:
[CODE]
SELECT PN.ID AS PROGID
,BH.ID AS BILLID
from #ProgressNotes PN
,#BillingHistory BH
where PN.PATID = BH.PATID -- I will always match on PATID
and (
(PN.Serv_Start_Date = BH.Serv_Start_Date
and PN.serv_start_time = BH.serv_start_time
and PN.bill_duration = BH.bill_duration)

or (PN.Serv_Start_Date = BH.Serv_Start_Date
and PN.serv_start_time = BH.serv_start_time
and PN.bill_program_value = BH.bill_program_value)
)
--... for every possible group of three ...
order by PN.ID,BH.ID
[/CODE]
... but that is a lot* of typing. And what happens if I decide that four columns have to match (or only two)? Or what if the number of total columns to check for matching is not 6 but 8?
quote:
* I think the number of "or" clauses = (n!)/(r!(n-r)!)
Where:
n = number of columns
r = number of columns that must match
and each "or" clause would have r "and" clauses

And once you start throwing factorials around things can get ugly really fast ...So unless I can come up with a different approach to this problem, it is pretty much hopeless.

From the sample data (below) the code above gives this output:
[CODE]
PROGID BILLID
====== ======
NOTE1 BILL5
[/CODE]
If I completed the code (and made no typos in the sample data) the output should be:
[CODE]
PROGID BILLID
====== ======
NOTE1 BILL1
NOTE1 BILL2
NOTE1 BILL5
NOTE2 BILL4
[/CODE]

Thanks,

Laurie


Table DDL and sample data:
[CODE]
create table #ProgressNotes
(ID varchar(5),
PATID varchar(20),
Serv_start_date datetime,
serv_start_time varchar(10),
bill_duration int,
bill_program_value varchar(40),
serv_clinician varchar(20),
bill_location varchar(40))

insert into #ProgressNotes
SELECT 'NOTE1', '10000', '2010-05-11', '01:19 PM', 0, 'MIA SUPPORT',
'008409', 'Jail' UNION ALL
SELECT 'NOTE2', '20000', '2010-09-05', '08:20 PM', 0, 'MIA CRISIS',
'008475', 'Home'

create table #BillingHistory
(ID varchar(5),
PATID varchar(20),
Serv_start_date datetime,
serv_start_time varchar(10),
bill_duration int,
bill_program_value varchar(40),
serv_clinician varchar(20),
bill_location varchar(40))

insert into #BillingHistory
SELECT 'BILL1', '10000', '2010-05-11', '01:19 PM', 20, 'DDD SUPPORT',
'008409', 'Other' UNION ALL
SELECT 'BILL2', '10000', '2010-05-12', '03:20 PM', 10, 'MIA SUPPORT',
'008409', 'Jail' UNION ALL
SELECT 'BILL3', '10000', '2010-05-11', '03:20 PM', 0, 'INQUIRY',
'009999', 'Home' UNION ALL
SELECT 'BILL4', '20000', '2010-09-05', '08:11 PM', 0, 'INQUIRY',
'008700', 'Home' UNION ALL
SELECT 'BILL5', '10000', '2010-05-11', '01:19 PM', 10, 'MIA SUPPORT',
'001111', 'Other'
[/CODE]

robvolk
Most Valuable Yak

15732 Posts

Posted - 2011-02-24 : 16:29:39
This doesn't quite do what you want...or maybe it does...but anyway it might give you an idea:

;with col6(PATID,Serv_start_date,serv_start_time,bill_duration,bill_program_value,serv_clinician,bill_location)
as (select PATID,Serv_start_date,serv_start_time,bill_duration,bill_program_value,serv_clinician,bill_location
from #ProgressNotes
intersect
select PATID,Serv_start_date,serv_start_time,bill_duration,bill_program_value,serv_clinician,bill_location
from #BillingHistory),
col5(PATID,Serv_start_date,serv_start_time,bill_duration,bill_program_value,serv_clinician)
as (select PATID,Serv_start_date,serv_start_time,bill_duration,bill_program_value,serv_clinician
from #ProgressNotes
intersect
select PATID,Serv_start_date,serv_start_time,bill_duration,bill_program_value,serv_clinician
from #BillingHistory),
col4(PATID,Serv_start_date,serv_start_time,bill_duration,bill_program_value)
as (select PATID,Serv_start_date,serv_start_time,bill_duration,bill_program_value
from #ProgressNotes
intersect
select PATID,Serv_start_date,serv_start_time,bill_duration,bill_program_value
from #BillingHistory),
col3(PATID,Serv_start_date,serv_start_time,bill_duration)
as (select PATID,Serv_start_date,serv_start_time,bill_duration
from #ProgressNotes
intersect
select PATID,Serv_start_date,serv_start_time,bill_duration
from #BillingHistory),
col2(PATID,Serv_start_date,serv_start_time)
as (select PATID,Serv_start_date,serv_start_time
from #ProgressNotes
intersect
select PATID,Serv_start_date,serv_start_time
from #BillingHistory),
col1(PATID,Serv_start_date)
as (select PATID,Serv_start_date
from #ProgressNotes
intersect
select PATID,Serv_start_date
from #BillingHistory),
all_CTEs(Match, PATID) AS (select 6, PATID from col6
union all select 5, PATID from col5
union all select 4, PATID from col4
union all select 3, PATID from col3
union all select 2, PATID from col2
union all select 1, PATID from col1)
select PATID, MAX(Match) MatchedColumns FROM all_CTEs GROUP BY PATID
Each CTE compares a different set of columns, then all_CTEs unions them and finds the largest set of matched columns for each PATID.

If you need to work all combinations for n columns, then you'll need n! CTEs plus an all_CTE that unions them. It's a quick copy/paste/remove-extra-columns operation.

edit: it's probably not n!, but you can work out the math.
Go to Top of Page

LaurieCox

158 Posts

Posted - 2011-02-25 : 14:23:01
Hi robvolk,

Thanks for the reply. I got pulled off to do something else today, but I look forward to trying to figure out your code. I will have to read up on cte to figure out what is going on there.

Anyway, thanks again for your help.

Laurie
Go to Top of Page
   

- Advertisement -