and other columns.
I want to assign group number according to this business logic.
1. Records with equal SSN and (similar first name or last name) belong
to the same group.
John Smith 1234
Smith John 1234
S John 1234
J Smith 1234
John Smith and Smith John falls in the same group Number as long as
they have similar SSN.
This is because I have a record of equal SSN but the first name and
last name is switched because of people who make error inserting last
name as first name and vice versa. John Smith and Smith John will have
equal group Name if they have equal SSN.
2. There are records with equal SSN but different first name and last
name. These belong to different group numbers.
Equal SSN doesn't guarantee equal group number, at least one of the
first name or last name should be the same. John Smith and Dan Brown
with equal SSN=1234 shouldn't fall in the same group number.
Sample data:
Id Fname lname SSN grpNum
1 John Smith 1234 1
2 Smith John 1234 1
3 S John 1234 1
4 J Smith 1234 1
5 J S 1234 1
6 Dan Brown 1234 2
7 John Smith 1111 3
I have tried this code for 65,000 rows. It took 20 minute. I have to
run it for 21 million row data. I now that this is not an efficient
code.
INSERT into temp_FnLnSSN_grp
SELECT c1.fname, c1.lname, c1.ssn AS ssn, c3.tu_id,
(SELECT 1 + count(*)
FROM distFLS AS c2
WHERE c2.ssn < c1.ssn
or (c2.ssn = c1.ssn and (substring(c2.fname,1,1) =
substring(c1.fname,1,1) or substring(c2.lname,1,1) =
substring(c1.lname,1,1)
or substring(c2.fname,1,1) =
substring(c1.lname,1,1) or substring(c2.lname,1,1) =
substring(c1.fname,1,1))
)) AS group_number
FROM distFLS AS c1
JOIN tu_people_data AS c3
ON (c1.ssn = c3.ssn and
c1.fname = c3.fname and
c1.lname= c3.lname)
dist FLS is distinct First Name, last Name and SSN table from the
people table.
I have posted part of this question, schema one w

this thread.
http://groups.google.com/group/comp...6eb380b5f2e6de6Basically, this is just a query that sorts or groups on a CASE function.
However, the catch is how we want to classify different names as "similar".
I would say that two rows should be considered similar if they have the same
SSN and the names start with the same first letter. Instead of a group
number, let's do a group code which conists of those 2 characters. Since
fname and lname may be transposed, the lowest of the 2 characters will be
encoded first followed by the highest character.
fname lname SSN grpCode
-- -- -- --
J S 1234 JS
J Smith 1234 JS
S John 1234 JS
John Smith 1111 JS
John Smith 1234 JS
Smith John 1234 JS
Dan Brown 1234 BD
select lname, fname, SSN, grpCode
from
(
select
fname,
lname,
SSN,
-- Here we calculate the grpCode:
case
when left(fname,1) <= left(lname,1) then left(fname,1)
else left(lname,1)
end as grpCode
--
from
distFLS
) as x
order by
SSN,
grpCode,
fname,
lname
<jacob.dba@.gmail.com> wrote in message
news:1143482451.181115.64620@.v46g2000cwv.googlegroups.com...
>I have a table with first name, last name, SSN(social security number)
> and other columns.
> I want to assign group number according to this business logic.
> 1. Records with equal SSN and (similar first name or last name) belong
> to the same group.
> John Smith 1234
> Smith John 1234
> S John 1234
> J Smith 1234
> John Smith and Smith John falls in the same group Number as long as
> they have similar SSN.
> This is because I have a record of equal SSN but the first name and
> last name is switched because of people who make error inserting last
> name as first name and vice versa. John Smith and Smith John will have
> equal group Name if they have equal SSN.
> 2. There are records with equal SSN but different first name and last
> name. These belong to different group numbers.
> Equal SSN doesn't guarantee equal group number, at least one of the
> first name or last name should be the same. John Smith and Dan Brown
> with equal SSN=1234 shouldn't fall in the same group number.
>
> Sample data:
> Id Fname lname SSN grpNum
> 1 John Smith 1234 1
> 2 Smith John 1234 1
> 3 S John 1234 1
> 4 J Smith 1234 1
> 5 J S 1234 1
> 6 Dan Brown 1234 2
> 7 John Smith 1111 3
>
> I have tried this code for 65,000 rows. It took 20 minute. I have to
> run it for 21 million row data. I now that this is not an efficient
> code.
>
> INSERT into temp_FnLnSSN_grp
> SELECT c1.fname, c1.lname, c1.ssn AS ssn, c3.tu_id,
> (SELECT 1 + count(*)
> FROM distFLS AS c2
> WHERE c2.ssn < c1.ssn
> or (c2.ssn = c1.ssn and (substring(c2.fname,1,1) =
> substring(c1.fname,1,1) or substring(c2.lname,1,1) =
> substring(c1.lname,1,1)
> or substring(c2.fname,1,1) =
> substring(c1.lname,1,1) or substring(c2.lname,1,1) =
> substring(c1.fname,1,1))
> )) AS group_number
> FROM distFLS AS c1
> JOIN tu_people_data AS c3
> ON (c1.ssn = c3.ssn and
> c1.fname = c3.fname and
> c1.lname= c3.lname)
>
> dist FLS is distinct First Name, last Name and SSN table from the
> people table.
>
> I have posted part of this question, schema one w

> this thread.
>
> http://groups.google.com/group/comp...6eb380b5f2e6de6
>|||The group code calculation returns only with one letter.
I have added this code on it.
-- Here we calculate the grpCode:
case
when left(fname,1) <= left(lname,1) then left(fname,1) +
left(lname,1)
else left(lname,1) +left(fname,1)
end as grpCode
--|||I didn't run it on my end.
Thanks.
<jacob.dba@.gmail.com> wrote in message
news:1143487218.339033.116420@.v46g2000cwv.googlegroups.com...
> The group code calculation returns only with one letter.
> I have added this code on it.
> -- Here we calculate the grpCode:
> case
> when left(fname,1) <= left(lname,1) then left(fname,1) +
> left(lname,1)
> else left(lname,1) +left(fname,1)
> end as grpCode
> --
>|||I fogot to mention that some of the records have middle name entered
in place of first name or last name.
fname mname lname ssn
John coleman smith 1234
john smith coleman 1234
john S coleman 1234
John C Smith 1234
John Smith 1234
John-coleman Smith 1234
Smith John 1234
During the grouping process I am concerned only about fname,lname,
ssn.(no need of middle name). If there is other suggestion to include
columns I am happy to accept.
I have the idea to assign groups if one of the initial of the names is
similar with the others considering that the SSN is the same. that
means if SSN is equal and if J or S or C are there as an initial in the
names, we can say they are in the same group.|||Just revise the case function as needed, but the concept is the same.
<jacob.dba@.gmail.com> wrote in message
news:1143490139.170226.286890@.g10g2000cwb.googlegroups.com...
> I fogot to mention that some of the records have middle name entered
> in place of first name or last name.
> fname mname lname ssn
> John coleman smith 1234
> john smith coleman 1234
> john S coleman 1234
> John C Smith 1234
> John Smith 1234
> John-coleman Smith 1234
> Smith John 1234
> During the grouping process I am concerned only about fname,lname,
> ssn.(no need of middle name). If there is other suggestion to include
> columns I am happy to accept.
> I have the idea to assign groups if one of the initial of the names is
> similar with the others considering that the SSN is the same. that
> means if SSN is equal and if J or S or C are there as an initial in the
> names, we can say they are in the same group.
>|||Consider using Integration Services as that tool has a Fuzzy Lookup and
Fuzzy Grouping tasks that were specifically designed for this type of work.
<jacob.dba@.gmail.com> wrote in message
news:1143482451.181115.64620@.v46g2000cwv.googlegroups.com...
>I have a table with first name, last name, SSN(social security number)
> and other columns.
> I want to assign group number according to this business logic.
> 1. Records with equal SSN and (similar first name or last name) belong
> to the same group.
> John Smith 1234
> Smith John 1234
> S John 1234
> J Smith 1234
> John Smith and Smith John falls in the same group Number as long as
> they have similar SSN.
> This is because I have a record of equal SSN but the first name and
> last name is switched because of people who make error inserting last
> name as first name and vice versa. John Smith and Smith John will have
> equal group Name if they have equal SSN.
> 2. There are records with equal SSN but different first name and last
> name. These belong to different group numbers.
> Equal SSN doesn't guarantee equal group number, at least one of the
> first name or last name should be the same. John Smith and Dan Brown
> with equal SSN=1234 shouldn't fall in the same group number.
>
> Sample data:
> Id Fname lname SSN grpNum
> 1 John Smith 1234 1
> 2 Smith John 1234 1
> 3 S John 1234 1
> 4 J Smith 1234 1
> 5 J S 1234 1
> 6 Dan Brown 1234 2
> 7 John Smith 1111 3
>
> I have tried this code for 65,000 rows. It took 20 minute. I have to
> run it for 21 million row data. I now that this is not an efficient
> code.
>
> INSERT into temp_FnLnSSN_grp
> SELECT c1.fname, c1.lname, c1.ssn AS ssn, c3.tu_id,
> (SELECT 1 + count(*)
> FROM distFLS AS c2
> WHERE c2.ssn < c1.ssn
> or (c2.ssn = c1.ssn and (substring(c2.fname,1,1) =
> substring(c1.fname,1,1) or substring(c2.lname,1,1) =
> substring(c1.lname,1,1)
> or substring(c2.fname,1,1) =
> substring(c1.lname,1,1) or substring(c2.lname,1,1) =
> substring(c1.fname,1,1))
> )) AS group_number
> FROM distFLS AS c1
> JOIN tu_people_data AS c3
> ON (c1.ssn = c3.ssn and
> c1.fname = c3.fname and
> c1.lname= c3.lname)
>
> dist FLS is distinct First Name, last Name and SSN table from the
> people table.
>
> I have posted part of this question, schema one w

> this thread.
>
> http://groups.google.com/group/comp...6eb380b5f2e6de6
>
No comments:
Post a Comment