Showing posts with label according. Show all posts
Showing posts with label according. Show all posts

Thursday, March 22, 2012

At my wits' end: LIKE

(SQL Server 2005, express edition)

I have a list of table names that I need to translate according to a
naming convention. I'm doing this using pattern matching in a LIKE
clause.

In one specific case I get no match where I believe that there should
be one. I must be missing something obvious here, but what?

I have boiled it down to this example (the real one is more complex):

Matching on the first four characters I get a match:

select 'yes'
where 'TBAAA243_D_AFTBEL' like 'TBAA%';

--
yes

(1 row(s) affected)

That is fine, just as I would have expected. But if I try to match
only on the first 3 characters, I get this:

select 'yes'
where 'TBAAA243_D_AFTBEL' like 'TBA%';

(0 row(s) affected)

I have also tried the same on enterprise edition and get the same
strange result. Language is set to us_english

What am I missing here?

Any help appreciated, before I tear out the very last of my remaining
hair

Bo BrunsgaardHave you applied any service packs? If not, try installing Express SP2
(http://www.microsoft.com/downloads/...displaylang=en).
I get the correct results on my SP2 Developer Edition instance:

select 'yes'
where 'TBAAA243_D_AFTBEL' like 'TBA%';

--
yes

(1 row(s) affected)

--
Hope this helps.

Dan Guzman
SQL Server MVP

<bbcworldtour@.hotmail.comwrote in message
news:1176718533.092593.280820@.b75g2000hsg.googlegr oups.com...

Quote:

Originally Posted by

(SQL Server 2005, express edition)
>
I have a list of table names that I need to translate according to a
naming convention. I'm doing this using pattern matching in a LIKE
clause.
>
In one specific case I get no match where I believe that there should
be one. I must be missing something obvious here, but what?
>
I have boiled it down to this example (the real one is more complex):
>
Matching on the first four characters I get a match:
>
select 'yes'
where 'TBAAA243_D_AFTBEL' like 'TBAA%';
>
--
yes
>
(1 row(s) affected)
>
>
That is fine, just as I would have expected. But if I try to match
only on the first 3 characters, I get this:
>
select 'yes'
where 'TBAAA243_D_AFTBEL' like 'TBA%';
>
(0 row(s) affected)
>
I have also tried the same on enterprise edition and get the same
strange result. Language is set to us_english
>
What am I missing here?
>
Any help appreciated, before I tear out the very last of my remaining
hair
>
Bo Brunsgaard
>

|||On 16 Apr., 13:46, "Dan Guzman" <guzma...@.nospam-online.sbcglobal.net>
wrote:

Quote:

Originally Posted by

Have you applied any service packs? If not, try installing Express SP2
(http://www.microsoft.com/downloads/...711d5d-725...).
I get the correct results on my SP2 Developer Edition instance:


I upgraded to SP2, but the problem persisted. It turns out that it is
hidden deep inside the finer points of the database collation. I
thought this was kind of interesting in a low-intense way, so here's
the story:

Our databases are running a collation of Danish_Norwegian_CS_AS (we
are a Danish company).

In Danish we have three special phonemes that are represented in
writing as the letters , and . These three letters are
alphabetically placed as the last three letters of the alphabet.

The last one turns out to the culprit (if it doesn't show up proper
imagine an upper-cased A with a small circle superimposed on it).

Using the letter for the phoneme [] is a fairly recent addition to
Danish (around 1950's). Traditionally it was written as "AA". For
instance, my surname can be written as either "Brunsgrd" or
"Brunsgaard", but is still considered the same name.

So in Danish, "AA" can be either the traditional writing of the
phoneme [] OR just two "A"s which happen to be consecutive.

Danish_Norwegian_CS_AS collation recognizes "AA" as "". This is
usually real neat for sorting. Consider the lastnames "grd" and
"Aagaard" - these should be sorted together at the end of a list, and
using any Danish_Norwegian collation will ensure just that.

Consider:

create table taDanishDemo
(
nameInDanish varchar(30)
collate Danish_Norwegian_CS_AS

, nameInEnglish varchar(30)
collate Latin1_General_CS_AS
)
;

Let us insert a couple of rows which contain a case of consecutive
"A"s:

insert
into taDanishDemo (nameInDanish,nameInEnglish)
select 'TBAAA','TBAAA'
union all
select 'TBABA','TBABA'
;

Retrieving the rows ordered will now yield different results depending
on whether we order on the Danish or the Latin1 collated column:

select nameInEnglish
from taDanishDemo
order by nameInEnglish;

nameInEnglish
----------
TBAAA
TBABA

Under Latin1 collation the "AA" is considered just two concecutive
"A"'s and ordered at the beginning of the list.
But, under Danish collation, the "AA" is considered the traditional
writing of [], and placed at the end of the list:

select nameInDanish
from taDanishDemo
order by nameInDanish;

nameInDanish
----------
TBABA
TBAAA

So far, so good.

What threw me completely is that this also affect how the string "AA"
is interpreted by the LIKE operator.

select nameInDanish
from taDanishDemo
where nameInDanish like 'TBA%'

nameInDanish
----------
TBABA

The row containing "TBAAA" isn't returned Trying to match "AA" with an
"A" plus a wildcard will yield no match under Danish collation, since
SQL Serve interprets this as trying to match "" with "A"!

But under Latin1 collation "AA" does match "A" and a wildcard, as "AA"
is just two "A"'s

select nameInEnglish
from taDanishDemo
where nameInEnglish like 'TBA%'

nameInEnglish
----------
TBAAA
TBABA

I'm still not really sure whether this is a useful feature, an
unintended side effect or a bug :-)

Bo Brunsgaard|||Our databases are running a collation of Danish_Norwegian_CS_AS (we

Quote:

Originally Posted by

are a Danish company).


I'm glad you were able to identify the root cause. I briefly considered a
possible collation issue but didn't think that would explain your symptoms
since I didn't know that collation rules considered consecutive characters.
Thanks a lot for the detailed analysis.

--
Hope this helps.

Dan Guzman
SQL Server MVP

<bbcworldtour@.hotmail.comwrote in message
news:1176883896.875730.239770@.n76g2000hsh.googlegr oups.com...
On 16 Apr., 13:46, "Dan Guzman" <guzma...@.nospam-online.sbcglobal.net>
wrote:

Quote:

Originally Posted by

Have you applied any service packs? If not, try installing Express SP2
(http://www.microsoft.com/downloads/...1711d5d-725...).
I get the correct results on my SP2 Developer Edition instance:


I upgraded to SP2, but the problem persisted. It turns out that it is
hidden deep inside the finer points of the database collation. I
thought this was kind of interesting in a low-intense way, so here's
the story:

Our databases are running a collation of Danish_Norwegian_CS_AS (we
are a Danish company).

In Danish we have three special phonemes that are represented in
writing as the letters , and . These three letters are
alphabetically placed as the last three letters of the alphabet.

The last one turns out to the culprit (if it doesn't show up proper
imagine an upper-cased A with a small circle superimposed on it).

Using the letter for the phoneme [] is a fairly recent addition to
Danish (around 1950's). Traditionally it was written as "AA". For
instance, my surname can be written as either "Brunsgrd" or
"Brunsgaard", but is still considered the same name.

So in Danish, "AA" can be either the traditional writing of the
phoneme [] OR just two "A"s which happen to be consecutive.

Danish_Norwegian_CS_AS collation recognizes "AA" as "". This is
usually real neat for sorting. Consider the lastnames "grd" and
"Aagaard" - these should be sorted together at the end of a list, and
using any Danish_Norwegian collation will ensure just that.

Consider:

create table taDanishDemo
(
nameInDanish varchar(30)
collate Danish_Norwegian_CS_AS

, nameInEnglish varchar(30)
collate Latin1_General_CS_AS
)
;

Let us insert a couple of rows which contain a case of consecutive
"A"s:

insert
into taDanishDemo (nameInDanish,nameInEnglish)
select 'TBAAA','TBAAA'
union all
select 'TBABA','TBABA'
;

Retrieving the rows ordered will now yield different results depending
on whether we order on the Danish or the Latin1 collated column:

select nameInEnglish
from taDanishDemo
order by nameInEnglish;

nameInEnglish
----------
TBAAA
TBABA

Under Latin1 collation the "AA" is considered just two concecutive
"A"'s and ordered at the beginning of the list.
But, under Danish collation, the "AA" is considered the traditional
writing of [], and placed at the end of the list:

select nameInDanish
from taDanishDemo
order by nameInDanish;

nameInDanish
----------
TBABA
TBAAA

So far, so good.

What threw me completely is that this also affect how the string "AA"
is interpreted by the LIKE operator.

select nameInDanish
from taDanishDemo
where nameInDanish like 'TBA%'

nameInDanish
----------
TBABA

The row containing "TBAAA" isn't returned Trying to match "AA" with an
"A" plus a wildcard will yield no match under Danish collation, since
SQL Serve interprets this as trying to match "" with "A"!

But under Latin1 collation "AA" does match "A" and a wildcard, as "AA"
is just two "A"'s

select nameInEnglish
from taDanishDemo
where nameInEnglish like 'TBA%'

nameInEnglish
----------
TBAAA
TBABA

I'm still not really sure whether this is a useful feature, an
unintended side effect or a bug :-)

Bo Brunsgaard

Sunday, March 11, 2012

Assigning Group Numbers for millions of row

I have a table with first name, last name, SSN(social security number)
and other columns.
I want to assign group number according to this business logic.
1. Records with equal SSN and (similar first name or last name) belong
to the same group.
John Smith 1234
Smith John 1234
S John 1234
J Smith 1234
John Smith and Smith John falls in the same group Number as long as
they have similar SSN.
This is because I have a record of equal SSN but the first name and
last name is switched because of people who make error inserting last
name as first name and vice versa. John Smith and Smith John will have
equal group Name if they have equal SSN.
2. There are records with equal SSN but different first name and last
name. These belong to different group numbers.
Equal SSN doesn't guarantee equal group number, at least one of the
first name or last name should be the same. John Smith and Dan Brown
with equal SSN=1234 shouldn't fall in the same group number.
Sample data:
Id Fname lname SSN grpNum
1 John Smith 1234 1
2 Smith John 1234 1
3 S John 1234 1
4 J Smith 1234 1
5 J S 1234 1
6 Dan Brown 1234 2
7 John Smith 1111 3
I have tried this code for 65,000 rows. It took 20 minute. I have to
run it for 21 million row data. I now that this is not an efficient
code.
INSERT into temp_FnLnSSN_grp
SELECT c1.fname, c1.lname, c1.ssn AS ssn, c3.tu_id,
(SELECT 1 + count(*)
FROM distFLS AS c2
WHERE c2.ssn < c1.ssn
or (c2.ssn = c1.ssn and (substring(c2.fname,1,1) =
substring(c1.fname,1,1) or substring(c2.lname,1,1) =
substring(c1.lname,1,1)
or substring(c2.fname,1,1) =
substring(c1.lname,1,1) or substring(c2.lname,1,1) =
substring(c1.fname,1,1))
)) AS group_number
FROM distFLS AS c1
JOIN tu_people_data AS c3
ON (c1.ssn = c3.ssn and
c1.fname = c3.fname and
c1.lname= c3.lname)
dist FLS is distinct First Name, last Name and SSN table from the
people table.
I have posted part of this question, schema one w ago. Please refer
this thread.
http://groups.google.com/group/comp...6eb380b5f2e6de6Basically, this is just a query that sorts or groups on a CASE function.
However, the catch is how we want to classify different names as "similar".
I would say that two rows should be considered similar if they have the same
SSN and the names start with the same first letter. Instead of a group
number, let's do a group code which conists of those 2 characters. Since
fname and lname may be transposed, the lowest of the 2 characters will be
encoded first followed by the highest character.
fname lname SSN grpCode
-- -- -- --
J S 1234 JS
J Smith 1234 JS
S John 1234 JS
John Smith 1111 JS
John Smith 1234 JS
Smith John 1234 JS
Dan Brown 1234 BD
select lname, fname, SSN, grpCode
from
(
select
fname,
lname,
SSN,
-- Here we calculate the grpCode:
case
when left(fname,1) <= left(lname,1) then left(fname,1)
else left(lname,1)
end as grpCode
--
from
distFLS
) as x
order by
SSN,
grpCode,
fname,
lname
<jacob.dba@.gmail.com> wrote in message
news:1143482451.181115.64620@.v46g2000cwv.googlegroups.com...
>I have a table with first name, last name, SSN(social security number)
> and other columns.
> I want to assign group number according to this business logic.
> 1. Records with equal SSN and (similar first name or last name) belong
> to the same group.
> John Smith 1234
> Smith John 1234
> S John 1234
> J Smith 1234
> John Smith and Smith John falls in the same group Number as long as
> they have similar SSN.
> This is because I have a record of equal SSN but the first name and
> last name is switched because of people who make error inserting last
> name as first name and vice versa. John Smith and Smith John will have
> equal group Name if they have equal SSN.
> 2. There are records with equal SSN but different first name and last
> name. These belong to different group numbers.
> Equal SSN doesn't guarantee equal group number, at least one of the
> first name or last name should be the same. John Smith and Dan Brown
> with equal SSN=1234 shouldn't fall in the same group number.
>
> Sample data:
> Id Fname lname SSN grpNum
> 1 John Smith 1234 1
> 2 Smith John 1234 1
> 3 S John 1234 1
> 4 J Smith 1234 1
> 5 J S 1234 1
> 6 Dan Brown 1234 2
> 7 John Smith 1111 3
>
> I have tried this code for 65,000 rows. It took 20 minute. I have to
> run it for 21 million row data. I now that this is not an efficient
> code.
>
> INSERT into temp_FnLnSSN_grp
> SELECT c1.fname, c1.lname, c1.ssn AS ssn, c3.tu_id,
> (SELECT 1 + count(*)
> FROM distFLS AS c2
> WHERE c2.ssn < c1.ssn
> or (c2.ssn = c1.ssn and (substring(c2.fname,1,1) =
> substring(c1.fname,1,1) or substring(c2.lname,1,1) =
> substring(c1.lname,1,1)
> or substring(c2.fname,1,1) =
> substring(c1.lname,1,1) or substring(c2.lname,1,1) =
> substring(c1.fname,1,1))
> )) AS group_number
> FROM distFLS AS c1
> JOIN tu_people_data AS c3
> ON (c1.ssn = c3.ssn and
> c1.fname = c3.fname and
> c1.lname= c3.lname)
>
> dist FLS is distinct First Name, last Name and SSN table from the
> people table.
>
> I have posted part of this question, schema one w ago. Please refer
> this thread.
>
> http://groups.google.com/group/comp...6eb380b5f2e6de6
>|||The group code calculation returns only with one letter.
I have added this code on it.
-- Here we calculate the grpCode:
case
when left(fname,1) <= left(lname,1) then left(fname,1) +
left(lname,1)
else left(lname,1) +left(fname,1)
end as grpCode
--|||I didn't run it on my end.
Thanks.
<jacob.dba@.gmail.com> wrote in message
news:1143487218.339033.116420@.v46g2000cwv.googlegroups.com...
> The group code calculation returns only with one letter.
> I have added this code on it.
> -- Here we calculate the grpCode:
> case
> when left(fname,1) <= left(lname,1) then left(fname,1) +
> left(lname,1)
> else left(lname,1) +left(fname,1)
> end as grpCode
> --
>|||I fogot to mention that some of the records have middle name entered
in place of first name or last name.
fname mname lname ssn
John coleman smith 1234
john smith coleman 1234
john S coleman 1234
John C Smith 1234
John Smith 1234
John-coleman Smith 1234
Smith John 1234
During the grouping process I am concerned only about fname,lname,
ssn.(no need of middle name). If there is other suggestion to include
columns I am happy to accept.
I have the idea to assign groups if one of the initial of the names is
similar with the others considering that the SSN is the same. that
means if SSN is equal and if J or S or C are there as an initial in the
names, we can say they are in the same group.|||Just revise the case function as needed, but the concept is the same.
<jacob.dba@.gmail.com> wrote in message
news:1143490139.170226.286890@.g10g2000cwb.googlegroups.com...
> I fogot to mention that some of the records have middle name entered
> in place of first name or last name.
> fname mname lname ssn
> John coleman smith 1234
> john smith coleman 1234
> john S coleman 1234
> John C Smith 1234
> John Smith 1234
> John-coleman Smith 1234
> Smith John 1234
> During the grouping process I am concerned only about fname,lname,
> ssn.(no need of middle name). If there is other suggestion to include
> columns I am happy to accept.
> I have the idea to assign groups if one of the initial of the names is
> similar with the others considering that the SSN is the same. that
> means if SSN is equal and if J or S or C are there as an initial in the
> names, we can say they are in the same group.
>|||Consider using Integration Services as that tool has a Fuzzy Lookup and
Fuzzy Grouping tasks that were specifically designed for this type of work.
<jacob.dba@.gmail.com> wrote in message
news:1143482451.181115.64620@.v46g2000cwv.googlegroups.com...
>I have a table with first name, last name, SSN(social security number)
> and other columns.
> I want to assign group number according to this business logic.
> 1. Records with equal SSN and (similar first name or last name) belong
> to the same group.
> John Smith 1234
> Smith John 1234
> S John 1234
> J Smith 1234
> John Smith and Smith John falls in the same group Number as long as
> they have similar SSN.
> This is because I have a record of equal SSN but the first name and
> last name is switched because of people who make error inserting last
> name as first name and vice versa. John Smith and Smith John will have
> equal group Name if they have equal SSN.
> 2. There are records with equal SSN but different first name and last
> name. These belong to different group numbers.
> Equal SSN doesn't guarantee equal group number, at least one of the
> first name or last name should be the same. John Smith and Dan Brown
> with equal SSN=1234 shouldn't fall in the same group number.
>
> Sample data:
> Id Fname lname SSN grpNum
> 1 John Smith 1234 1
> 2 Smith John 1234 1
> 3 S John 1234 1
> 4 J Smith 1234 1
> 5 J S 1234 1
> 6 Dan Brown 1234 2
> 7 John Smith 1111 3
>
> I have tried this code for 65,000 rows. It took 20 minute. I have to
> run it for 21 million row data. I now that this is not an efficient
> code.
>
> INSERT into temp_FnLnSSN_grp
> SELECT c1.fname, c1.lname, c1.ssn AS ssn, c3.tu_id,
> (SELECT 1 + count(*)
> FROM distFLS AS c2
> WHERE c2.ssn < c1.ssn
> or (c2.ssn = c1.ssn and (substring(c2.fname,1,1) =
> substring(c1.fname,1,1) or substring(c2.lname,1,1) =
> substring(c1.lname,1,1)
> or substring(c2.fname,1,1) =
> substring(c1.lname,1,1) or substring(c2.lname,1,1) =
> substring(c1.fname,1,1))
> )) AS group_number
> FROM distFLS AS c1
> JOIN tu_people_data AS c3
> ON (c1.ssn = c3.ssn and
> c1.fname = c3.fname and
> c1.lname= c3.lname)
>
> dist FLS is distinct First Name, last Name and SSN table from the
> people table.
>
> I have posted part of this question, schema one w ago. Please refer
> this thread.
>
> http://groups.google.com/group/comp...6eb380b5f2e6de6
>

Assigning group numbers for millions of data

I have a table with first name, last name, SSN(social security number)
and other columns.
I want to assign group number according to this business logic.
1. Records with equal SSN and (similar first name or last name) belong
to the same group.
John Smith 1234
Smith John 1234
S John 1234
J Smith 1234
John Smith and Smith John falls in the same group Number as long as
they have similar SSN.
This is because I have a record of equal SSN but the first name and
last name is switched because of people who make error inserting last
name as first name and vice versa. John Smith and Smith John will have
equal group Name if they have equal SSN.
2. There are records with equal SSN but different first name and last
name. These belong to different group numbers.
Equal SSN doesn't guarantee equal group number, at least one of the
first name or last name should be the same. John Smith and Dan Brown
with equal SSN=1234 shouldn't fall in the same group number.

Sample data:
Id Fname lname SSN grpNum
1 John Smith 1234 1
2 Smith John 1234 1
3 S John 1234 1
4 J Smith 1234 1
5 J S 1234 1
6 Dan Brown 1234 2
7 John Smith 1111 3

I have tried this code for 65,000 rows. It took 20 minute. I have to
run it for 21 million row data. I now that this is not an efficient
code.

INSERT into temp_FnLnSSN_grp
SELECT c1.fname, c1.lname, c1.ssn AS ssn, c3.tu_id,
(SELECT 1 + count(*)
FROM distFLS AS c2
WHERE c2.ssn < c1.ssn
or (c2.ssn = c1.ssn and (substring(c2.fname,1,1) =
substring(c1.fname,1,1) or substring(c2.lname,1,1) =
substring(c1.lname,1,1)
or substring(c2.fname,1,1) =
substring(c1.lname,1,1) or substring(c2.lname,1,1) =
substring(c1.fname,1,1))
)) AS group_number
FROM distFLS AS c1
JOIN tu_people_data AS c3
ON (c1.ssn = c3.ssn and
c1.fname = c3.fname and
c1.lname= c3.lname)

dist FLS is distinct First Name, last Name and SSN table from the
people table.

I have posted part of this question, schema one week ago. Please refer
this thread.

http://groups.google.com/group/comp...6eb380b5f2e6de6I forgot to mention that some of the records have middle name entered
in place of first name or last name.
fname mname lname ssn

John coleman smith 1234
john smith coleman 1234
john S coleman 1234
John C Smith 1234
John Smith 1234
John-coleman Smith 1234
Smith John 1234

During the grouping process I am concerned only about fname,lname,
ssn.(no need of middle name). If there is other suggestion to include
columns I am happy to accept.
I have the idea to assign groups if one of the initial of the names is

similar with the others considering that the SSN is the same. that
means if SSN is equal and if J or S or C are there as an initial in the

names, we can say they are in the same group.

Reply

jacob.dba@.gmail.com wrote:
> I have a table with first name, last name, SSN(social security number)
> and other columns.
> I want to assign group number according to this business logic.
> 1. Records with equal SSN and (similar first name or last name) belong
> to the same group.
> John Smith 1234
> Smith John 1234
> S John 1234
> J Smith 1234
> John Smith and Smith John falls in the same group Number as long as
> they have similar SSN.
> This is because I have a record of equal SSN but the first name and
> last name is switched because of people who make error inserting last
> name as first name and vice versa. John Smith and Smith John will have
> equal group Name if they have equal SSN.
> 2. There are records with equal SSN but different first name and last
> name. These belong to different group numbers.
> Equal SSN doesn't guarantee equal group number, at least one of the
> first name or last name should be the same. John Smith and Dan Brown
> with equal SSN=1234 shouldn't fall in the same group number.
> Sample data:
> Id Fname lname SSN grpNum
> 1 John Smith 1234 1
> 2 Smith John 1234 1
> 3 S John 1234 1
> 4 J Smith 1234 1
> 5 J S 1234 1
> 6 Dan Brown 1234 2
> 7 John Smith 1111 3
>
> I have tried this code for 65,000 rows. It took 20 minute. I have to
> run it for 21 million row data. I now that this is not an efficient
> code.
>
> INSERT into temp_FnLnSSN_grp
> SELECT c1.fname, c1.lname, c1.ssn AS ssn, c3.tu_id,
> (SELECT 1 + count(*)
> FROM distFLS AS c2
> WHERE c2.ssn < c1.ssn
> or (c2.ssn = c1.ssn and (substring(c2.fname,1,1) =
> substring(c1.fname,1,1) or substring(c2.lname,1,1) =
> substring(c1.lname,1,1)
> or substring(c2.fname,1,1) =
> substring(c1.lname,1,1) or substring(c2.lname,1,1) =
> substring(c1.fname,1,1))
> )) AS group_number
> FROM distFLS AS c1
> JOIN tu_people_data AS c3
> ON (c1.ssn = c3.ssn and
> c1.fname = c3.fname and
> c1.lname= c3.lname)
>
> dist FLS is distinct First Name, last Name and SSN table from the
> people table.
> I have posted part of this question, schema one week ago. Please refer
> this thread.
> http://groups.google.com/group/comp...6eb380b5f2e6de6|||(jacob.dba@.gmail.com) writes:
> I want to assign group number according to this business logic.
> 1. Records with equal SSN and (similar first name or last name) belong
> to the same group.
> John Smith 1234
> Smith John 1234
> S John 1234
> J Smith 1234
> John Smith and Smith John falls in the same group Number as long as
> they have similar SSN.
> This is because I have a record of equal SSN but the first name and
> last name is switched because of people who make error inserting last
> name as first name and vice versa. John Smith and Smith John will have
> equal group Name if they have equal SSN.
> 2. There are records with equal SSN but different first name and last
> name. These belong to different group numbers.
> Equal SSN doesn't guarantee equal group number, at least one of the
> first name or last name should be the same. John Smith and Dan Brown
> with equal SSN=1234 shouldn't fall in the same group number.

What if you have both John Smith and Southerland Jane? Are the
same person or not?

This looks like a very difficult task, and the fact that you have
800 million rows certainly does not help to make it easier.

I think you need to scrap the idea you got from Itzik. My gut feeling
say that it will not scale.

Here is a very simple-minded solution where I've assumed that as
long as any combination of initials match, it's the same group.

CREATE TABLE [TU_People_Data] (
[tu_id] [bigint] NOT NULL ,
[count_id] [int] NOT NULL ,
[fname] [varchar] (32) COLLATE Latin1_General_CI_AS NULL ,
[lname] [varchar] (32) COLLATE Latin1_General_CI_AS NULL ,
[ssn] [int] NULL ,
CONSTRAINT [PK_tu_bulk_people] PRIMARY KEY CLUSTERED
(
[tu_id],
[count_id]
) ON [PRIMARY]
) ON [PRIMARY]
GO
CREATE TABLE #initials (ssn int NOT NULL,
fname varchar(32) NOT NULL,
lname varchar(32) NOT NULL,
initials char(2) NOT NULL)
go
CREATE TABLE #ssnmania (ident int NOT NULL,
ssn int NOT NULL,
initials char(2) NOT NULL,
PRIMARY KEY(ssn, initials))
go
INSERT #initals (ssn, fname, lname, initials)
SELECT DISTINCT ssn, fname, lname,
CASE WHEN fname < lname
THEN substring(fname, 1, 1) + substring(lname, 1, 1)
ELSE substring(lname, 1, 1) + substring(fname, 1, 1)
END
FROM TU_People_Data
go
INSERT #ssnmania (ssn, initials)
SELECT DISTINCT ssn, initials
FROM #initials
go
SELECT i.ssn, i.fname, i.lname, i.initials, groupno = s.ident
FROM #initials i
JOIN #ssnmania s ON i.ssn = s.ssn
AND s.initials = i.initials
go
DROP TABLE #initials, #ssnmania, TU_People_Data

--
Erland Sommarskog, SQL Server MVP, esquel@.sommarskog.se

Books Online for SQL Server 2005 at
http://www.microsoft.com/technet/pr...oads/books.mspx
Books Online for SQL Server 2000 at
http://www.microsoft.com/sql/prodin...ions/books.mspx|||Thanks Erland.
I have tried this procedure in the morning and it solves half of my
problem.
let me start by answering your question.
>What if you have both John Smith and Southerland Jane? Are the
> same person or not?
If these guys' SSN is the same, they are considered to be in the the
same group.
I am willing to take the chance that John Smith, Southerland Jane and
Jack Sam with similar SSN has slim chance to occur. if they exist,
they are gouped in one group number.
>>regarding your solution
In my table some of the rows for one person are displayed like this.
1.John Coleman Smith 1111 JS
2.John Smith Coleman 1111 CJ
3.Coleman John Smith 1111 CS
4.John-coleman Smith 1111 JS
5. Smith John 1111 JS
6.John Smith 2222 JS
7.J Smith 1111 JS
8 Jack Sam 3333 JS
you can see that all this guys can be grouped in the same group
name(except the 6th and 8th). I see that SSN is the major factor to
identify the groups.
So once SSN is the same then the intitals has to be one or two of the
three J or S or C.

Erland Sommarskog wrote:
> (jacob.dba@.gmail.com) writes:
> > I want to assign group number according to this business logic.
> > 1. Records with equal SSN and (similar first name or last name) belong
> > to the same group.
> > John Smith 1234
> > Smith John 1234
> > S John 1234
> > J Smith 1234
> > John Smith and Smith John falls in the same group Number as long as
> > they have similar SSN.
> > This is because I have a record of equal SSN but the first name and
> > last name is switched because of people who make error inserting last
> > name as first name and vice versa. John Smith and Smith John will have
> > equal group Name if they have equal SSN.
> > 2. There are records with equal SSN but different first name and last
> > name. These belong to different group numbers.
> > Equal SSN doesn't guarantee equal group number, at least one of the
> > first name or last name should be the same. John Smith and Dan Brown
> > with equal SSN=1234 shouldn't fall in the same group number.
> What if you have both John Smith and Southerland Jane? Are the
> same person or not?
> This looks like a very difficult task, and the fact that you have
> 800 million rows certainly does not help to make it easier.
> I think you need to scrap the idea you got from Itzik. My gut feeling
> say that it will not scale.
> Here is a very simple-minded solution where I've assumed that as
> long as any combination of initials match, it's the same group.
>
> CREATE TABLE [TU_People_Data] (
> [tu_id] [bigint] NOT NULL ,
> [count_id] [int] NOT NULL ,
> [fname] [varchar] (32) COLLATE Latin1_General_CI_AS NULL ,
> [lname] [varchar] (32) COLLATE Latin1_General_CI_AS NULL ,
> [ssn] [int] NULL ,
> CONSTRAINT [PK_tu_bulk_people] PRIMARY KEY CLUSTERED
> (
> [tu_id],
> [count_id]
> ) ON [PRIMARY]
> ) ON [PRIMARY]
> GO
> CREATE TABLE #initials (ssn int NOT NULL,
> fname varchar(32) NOT NULL,
> lname varchar(32) NOT NULL,
> initials char(2) NOT NULL)
> go
> CREATE TABLE #ssnmania (ident int NOT NULL,
> ssn int NOT NULL,
> initials char(2) NOT NULL,
> PRIMARY KEY(ssn, initials))
> go
> INSERT #initals (ssn, fname, lname, initials)
> SELECT DISTINCT ssn, fname, lname,
> CASE WHEN fname < lname
> THEN substring(fname, 1, 1) + substring(lname, 1, 1)
> ELSE substring(lname, 1, 1) + substring(fname, 1, 1)
> END
> FROM TU_People_Data
> go
> INSERT #ssnmania (ssn, initials)
> SELECT DISTINCT ssn, initials
> FROM #initials
> go
> SELECT i.ssn, i.fname, i.lname, i.initials, groupno = s.ident
> FROM #initials i
> JOIN #ssnmania s ON i.ssn = s.ssn
> AND s.initials = i.initials
> go
> DROP TABLE #initials, #ssnmania, TU_People_Data
>
>
> --
> Erland Sommarskog, SQL Server MVP, esquel@.sommarskog.se
> Books Online for SQL Server 2005 at
> http://www.microsoft.com/technet/pr...oads/books.mspx
> Books Online for SQL Server 2000 at
> http://www.microsoft.com/sql/prodin...ions/books.mspx|||(jacob.dba@.gmail.com) writes:
> I have tried this procedure in the morning and it solves half of my
> problem.

And the other half is? :-) I did not include the middle initial, because
I did not see that post until later.

But I guess that you could extend the logic that I posted to handle
the middle initial as well.

--
Erland Sommarskog, SQL Server MVP, esquel@.sommarskog.se

Books Online for SQL Server 2005 at
http://www.microsoft.com/technet/pr...oads/books.mspx
Books Online for SQL Server 2000 at
http://www.microsoft.com/sql/prodin...ions/books.mspx|||you need a function that takes the first character from first name,
last name, and middle initial, and sorts them. Call it "SortInit"
So, pass "Sam Alfred Jones" and it passes back "AJS". Likewise,
"Jones Alfred Sam" is returned as "AJS".

then, create your temp table and populate it with SSN and Sortinit().
then alter table on your temp table and add an identity column.
Then make your "temp table" a permanent one, as your business rules
will change, and fundamentally what you are doing is looking for
"duplicate rows" and grouping them, and this is almost always a
multiple pass project.

Saturday, February 25, 2012

ASPNETDB.mdf Resolved

Hi,

I thought I had this corrected, but in fact don't. According to http://msdn2.microsoft.com/en-us/library/ms228037.aspx

SQL Express is supposed to automatically generate a copy of ASPNETDB.mdf in the App_Data folder of the Express edition development suites (I'm using Visual Web Developer 2005).

In my case, it doesn't, and I can't figure out how to trigger it manually. I've read every post I can find, especially http://forums.microsoft.com/msdn/showpost.aspx?postid=98346&siteid=1

However, that mainly applies to deploying a database that already exists. I have already tried deleting the files as suggested and they do reappear in the appropriate folder, but I'm still not getting the ASPNETDB.mdf file in my apps.

Any help on this would be greatly appreciated. I've had a post up on the Visual Web Developer forum, but folk are staying away in droves. Also tried uninstall and reinstall all the way down to IIS 5.1

biobot

You install the database by executing Aspnet_regsql.exe
in your %windir%\Microsoft.NET\Framework\<Your.Net Framework version> folder

You can find installation instructions and other information in the following article:

How To: Use Role Manager in ASP.Net 2.0

http://msdn2.microsoft.com/en-us/library/ms998314.aspx

-Sue

|||

Sue,

Thank you! It took some improvising, but I am now getting the aspnetdb.mdf auto-generating as is should. (There are some differences between the SQLEXPRESS Management Suite and Enterprise Manager, apparently).

I never would have thought to look up how to use Role manager to solve this problem!!?

Also, it is great to get a response in complete, contextually relevant sentences!

Best regards,

Larry

|||Open Visual Web Developer Express (VWD). Open the website in VWD. Click on the 'Website' menu. At the bottom of the website menu is ASP.Net configuration, open it. It opens ‘Asp.Net Website Administration Tool’. Under the 'Security' tab add a user.

Done; ASPNETDB.mdf is up and configured for your website|||Doesn't work in a shared environment

Friday, February 24, 2012

ASPNETDB.mdf

Hi,

I thought I had this corrected, but in fact don't. According to http://msdn2.microsoft.com/en-us/library/ms228037.aspx

SQL Express is supposed to automatically generate a copy of ASPNETDB.mdf in the App_Data folder of the Express edition development suites (I'm using Visual Web Developer 2005).

In my case, it doesn't, and I can't figure out how to trigger it manually. I've read every post I can find, especially http://forums.microsoft.com/msdn/showpost.aspx?postid=98346&siteid=1

However, that mainly applies to deploying a database that already exists. I have already tried deleting the files as suggested and they do reappear in the appropriate folder, but I'm still not getting the ASPNETDB.mdf file in my apps.

Any help on this would be greatly appreciated. I've had a post up on the Visual Web Developer forum, but folk are staying away in droves. Also tried uninstall and reinstall all the way down to IIS 5.1

biobot

You install the database by executing Aspnet_regsql.exe
in your %windir%\Microsoft.NET\Framework\<Your.Net Framework version> folder

You can find installation instructions and other information in the following article:

How To: Use Role Manager in ASP.Net 2.0

http://msdn2.microsoft.com/en-us/library/ms998314.aspx

-Sue

|||

Sue,

Thank you! It took some improvising, but I am now getting the aspnetdb.mdf auto-generating as is should. (There are some differences between the SQLEXPRESS Management Suite and Enterprise Manager, apparently).

I never would have thought to look up how to use Role manager to solve this problem!!?

Also, it is great to get a response in complete, contextually relevant sentences!

Best regards,

Larry