Saturday, February 25, 2012

Matching algorithm

I have a db full of names and identifiers. Was wondering if anyone could direct me towards a good dupe eliminating algorithm or bit of logic to look at?
Thanks!What will be your basis for eliminating duplicates ?|||Originally posted by rnealejr
What will be your basis for eliminating duplicates ?

Name dupes and identifiers, some names may be similar, some addresses may be simialr. Each row will have approximately 10 identifiers, but some will be nnull. It is fairly dirty data.|||But how do you determine which rows to delete (like keying off the most recent timestamp) or will it be a visual comparison (so will you need to route these duplicates to a holding table) ?|||Originally posted by rnealejr
But how do you determine which rows to delete (like keying off the most recent timestamp) or will it be a visual comparison (so will you need to route these duplicates to a holding table) ?

Yes I wil need to move them to a holding table. It will be compared visually using all.|||You should post more info about your table of duplicities.
Generic query looks like this

select d1.*
into ClearTable
from DuplTable d1
join
(
select IdCol1,IdCol2,...
from DuplTable d2
group by IdCol1,IdCol2,...
having count(*)=1
) x on (d1.IdCol1=x.IdCol1 or (d1.IdCol1 is null and x.IdCol1 is null))
and (d1.IdCol2=x.IdCol2 or (d1.IdCol2 is null and x.IdCol2 is null))
.
.
.
select d1.*
into HoldingTable
from DuplTable d1
join
(
select IdCol1,IdCol2,...
from DuplTable d2
group by IdCol1,IdCol2,...
having count(*)>1
) x on (d1.IdCol1=x.IdCol1 or (d1.IdCol1 is null and x.IdCol1 is null))
and (d1.IdCol2=x.IdCol2 or (d1.IdCol2 is null and x.IdCol2 is null))
.
.
.
select * from ClearTable
select * from HoldingTable|||Originally posted by ispaleny
You should post more info about your table of duplicities.


Awesome, I am going to try that format. My original table is pretty dirty. It is Lname, fname, SSN, address, city, state, country, phone number, etc., about 5 more identifiers. It is around 3 million records, some complete, some not. Spellings are different in some cases, middle names are there in other cases.

Thanks for your help.
:)

No comments:

Post a Comment