Showing posts with label columns. Show all posts
Showing posts with label columns. Show all posts

Tuesday, March 20, 2012

Association Rules Model for problem

What would be the right design approach for the following problem?

I have a single table called SelectionFactors, which has the following columns and sample data:

ProjectID Factor FactorValue 1000 Countries USA 1000 Countries Canada 1000 Countries France 1000 Languages English 1000 Languages French 1000 Company Type Consulting 1000 Company Type Software 2000 Countries India 2000 Countries China 2000 Countries USA 2000 Languages English 2000 Languages Chinese (Simplified) 2000 Languages Chinese (Traditional) 2000 Languages Spanish 2000 Company Type Retail 2000 Company Type Dairy Products

The problem is to allow a descriptive analysis of the data to find patterns in the users selections. For instance,

if Languages->English is selected, what are the counts of projects for other Factor->Factor Value combinations?

Countries->USA = 2, Countries->Canada=1, Company Type->Consulting=1 and so on.

Since all the data is in this single table, are both the case and nested tables the same? What are the keys and inputs? I only need a descriptive analysis (no prediction) and ALL possible combinations MUST be part of the results; how should the model be designed?

Thank you,

Anna.

This is a good problem. What you want to do is to create a model with three nested tables all marked predict and input. Event though you want descriptive analysis, you need to mark them "predict" such that rules will be formed (only items marked predict will show on the right hand side of rules).

The way you will create such a model is to use the "Named Query" feature of the data source view(DSV). You can right-click on the DSV and select "New Named Query" to create these.

The named queries you will want to create are these:

Projects: SELECT DISTINCT ProjectID FROM SelectionFactors

Companies: SELECT ProjectID, FactorValue FROM SelectionFactors WHERE Factor='Companies'

Languages: SELECT ProjectID, FactorValue FROM SelectionFactors WHERE Factor='Languages'

Company Type: SELECT ProjectID, FactorValue FROM SelectionFactors WHERE Factor='Company Type'

You will then need to relate the ProjectID in each of the transaction named queries (Companies, etc) to the ProjectID in the Projects named query. The system will prompt you to set ProjectID as the key of the Projects named query . At this point, the named queries are equivalent to tables as far as Analysis Services is concerned.

When you create the model, you will make the Projects table the case table and the other tables nested tables. ProjectID will be the key of the Projects table and FactorValue will be the key of the other tables. Make sure to mark FactorValue as Key, Input, and Output.

When you process your model you will see rules relating projects, companies, and company types (both between factos and in the same factor).

Note that you may need to adjust the MINIMUM_SUPPORT and MINIMUM_PROBABILITY parameters to get the results you need (in fact, it's recommended)

HTH

-Jamie

|||

Thanks for your response.

However, I should have mentioned that there is an open-ended number of "Factors". New factors are added often and the number of factors is nearly 100.

The type of questions I am trying to answer using this model is:

"Among projects that have Languages-> English, how many (distinct projects) have Company Type->Consulting, Countries->USA" and so on.

Ideally, I would like to provide this analysis in a free, OLAP environment where the user can select one factor-> factor value combination and then see all the other related combinations.

I currently have a Cube with SelectionFactors serving as both the Fact and the dimension table with a single measure - DISTINCT COUNT of projectIDs.

Thanks once again.

|||

Given the information in my previous post, what are my design options?

Thanks.

|||

My question for you, then, is if you have a cube for this, what are you not getting that you need?

If the factor/factorvalue can be viewed as a complete unit, and you don't need to provide analysis between factors - e.g. given Language X, which Countries and Company Types are prevalent, you could create a calculated column in the DSV that combines the factor and factor value.

|||

Perhaps you can help me understand why I am unable to get the expected results. I am fairly new to OLAP and Data Mining and would appreciate any pointers and guidance.

The full picture of my problem is:

Dimension Project Profile - Project ID, (Other Project related attributes)

Dimension Selection Factors - Factor, Factor Value

Fact Selection Factors - Project ID, Factor, Factor Value

Project Cube Measure - DISTINCT COUNT of Project IDs

My understanding is that this is a problem of intersection of factors, which I cannot achieve with a single dimension. In fact there is this other thread http://forums.microsoft.com/msdn/showpost.aspx?postid=856304&siteid=1 that seems to come very close to my problem.

In my case, the following MDX returns the projects that have the intersection of specific factor values.

SELECT {[Measures].[Project Count]} ON COLUMNS,

INTERSECT

(

NONEMPTY

(

[Project Profile].[Project ID].Children,

(

[Measures].[Project Count],

[Selection Factor].[Selection Factors].[Factor].&[Annual Revenue].&[$500 million - $3 billion in revenues]

)

),

NONEMPTY

(

[Project Profile].[Project ID].Children,

(

[Measures].[Selection Project Count],

[Selection Factor].[Selection Factors].[Factor].&[Concurrent Users].&[1,000–10,000 users]

)

)

)

ON ROWS

FROM PROJECTCUBE

The questions are:

1. How do I get all the other Factor->Factor Values that are part of the selection factors for the projects returned by this MDX. This, if I understand correctly, is Basket analysis.

and

2. Is there a generic way to perform this analysis through cube design rather than writing MDX for each case?

Thanks once more.|||

The following MDX gives me the contents of the "baskets" of projects that have the selected items. However, the counts of the projects is not restricted to the subset retrieved through the Intersect. Is there a way to get the counts to only be within the subset returned by the Intersect?

With this MDX, I can get very close to what I need. The only thing I am missing is to get the counts (perhaps through a calculated measure). Help on this will be much appreciated!

__

SELECT [Measures].[Project Count] ON COLUMNS,

NONEMPTY

(

([Selection Factor].[Factor].Children, [Selection Factor].[Factor Value].Children),

INTERSECT

(

NONEMPTY

(

[Project Profile].[Project ID].Children,

(

[Measures].[Selection Project Count],

[Selection Factor].[Selection Factors].[Factor].&[Annual Revenue].&[$500 million - $3 billion in revenues]

)

),

NONEMPTY

(

[Project Profile].[Project ID].Children,

(

[Measures].[Selection Project Count],

[Selection Factor].[Selection Factors].[Factor].&[Concurrent Users].&[1,000–10,000 users]

)

)

)

)

ON ROWS

FROM PROJECTCUBE

|||

In order to obtain counts restricted to the subset returned by the Intersect, this MDX takes the constraint out of the ROWS into the WHERE clause.

I would still like to know how to design an Association rules model that can "automate" the generation of results by case. Hopefully, the MDX can point directly to the expected results.

SELECT [Measures].[Project Count] ON COLUMNS,

NONEMPTYCROSSJOIN([Selection Factor].[Factor].Children, [Selection Factor].[Factor Value].Children)

ON ROWS

FROM PROJECTCUBE

WHERE

NONEMPTY(

INTERSECT

(

NONEMPTY

(

[Project Profile].[Project ID].Children,

(

[Measures].[Project Count],

[Selection Factor].[Selection Factors].[Factor].&[Annual Revenue].&[$500 million - $3 billion in revenues]

)

),

NONEMPTY

(

[Project Profile].[Project ID].Children,

(

[Measures].[Project Count],

[Selection Factor].[Selection Factors].[Factor].&[Concurrent Users].&[1,000–10,000 users]

)

)

)

)

sql

Monday, March 19, 2012

Assistance on SUM() statement

I have a table #changes with two columns: familyID, Versiontime

For each record in the #changes table I need to compute the income from another table FamilyIncome with colunms: familyID, IncomeType, EffectiveDate, Amt. Example Below:

FamID, IncType, EffDate, Amt
100, 10, 01/01/2003, $50.00
100, 20, 01/01/2003, $50.00
100, 30, 01/01/2003, $50.00
100, 10, 02/02/3003, $100.00
100, 20, 02/02/3003, $100.00
100, 30, 02/02/3003, $100.00
100, 40, 02/02/3003, $100.00
100, 20, 03/03/3003, $75.00
100, 30, 03/03/3003, $75.00
100, 40, 03/03/3003, $75.00

So if I'm looking for the Incomes on the following dates (which are in the #changes table) it should be:

01/02/2003 - $150.00 (The three records effective on 01/01/2003)
02/10/2003 - $400.00 (The four records effective on 02/02/2003)
04/01/2003 - $325.00 (One record from 02/02/2003 is still effective (Type 10) plus the three records effective on 03/03/2003)

Any help is greatly appreciated,

BrentCan you post the DDL?

But it seems like a join between the two and a GROUP by with a SUM

something like

SELECT EFF_DATE, SUM(AMT)
FROM myTable1 a myTable2 b
ON a.key = b.key
GROUP BY EFF_DATE|||Brett,
I have tried a couple variations on this theme and so far they do not deliver the desired results:

1. Select familyID, VersionTime, (Select Sum(Amt) from FamilyIncome where FamilyIncome.familyid = #changes.familyid and EFFECTIVEDATE < cast(#changes.versiontime as datetime)+1) as Income
from #changes
group by familyid, cast(versiontime as datetime),#changes.versiontime
order by familyid, versiontime desc
THIS CODE WORKS FOR THE FIRST DATE AND RETURNS THE $150.00 DESIRED, BUT ON ANY FUTURE DATES IT ADDS THE NEW INCOME AND KEEPS A RUNNING TOTAL (I.E. $550.00 INSTEAD OF $400.00)

2. Select familyID, VersionTime, (Select Sum(Amt) from FamilyIncome where FamilyIncome.familyid = #changes.familyid
HAVING EFFECTIVEDATE < cast(#changes.versiontime as datetime)+1) as Income
from #changes
group by familyid, cast(versiontime as datetime),#changes.versiontime
order by familyid, versiontime desc
THIS CODE ONLY RETURNS A RESULT SET FOR THE LAST DATE IN THE SEQUENCE AND THEN IT RETURNS $225.00 (ONLY THE RECORDS WITH A 03/03/2003) DATE)
At this time I'm just trying to isolate the effdate calculations. Also whoever reads this all the dates should be 2003 the 3003 for the year is a typo.

Thanks,

Brent
Originally posted by Brett Kaiser
Can you post the DDL?

But it seems like a join between the two and a GROUP by with a SUM

something like

SELECT EFF_DATE, SUM(AMT)
FROM myTable1 a myTable2 b
ON a.key = b.key
GROUP BY EFF_DATE|||I don't understand your results...where do these dates come from:

01/02/2003 - $150.00 (The three records effective on 01/01/2003)
02/10/2003 - $400.00 (The four records effective on 02/02/2003)
04/01/2003 - $325.00 (One record from 02/02/2003

They don't exists in your data...|||Originally posted by Brett Kaiser
I don't understand your results...where do these dates come from:

They don't exists in your data...

Those dates come out of the #changes table which I only included the fields not an example. What I have is 43,000 records in the #changes table that are familyID's and Dates on which I need to perform several calculations (size of the family, income, fee schedule, etc) right now I'm hung up on getting the income which I need in order to determine the fee (fee is based on family size and income)

Brent

Assing rows to a table

I have a table where the columns are products and the
rows are customers. It stores sales info by customer and
product. I would like to add a row at the bottom af my
customer list that would be used for totaling each
column. Does anyone know how I can do this?
hi john,
if i understand you correctly you are looking for COMPUTE BY clause of
SELECT statement.
you can check the resultset of following query on northwind database.
use northwind
go
select b.customerid, a.unitprice
from [order details]a join orders b
on a.orderid =b.orderid
order by b.customerid
compute sum (a.unitprice ) by b.customerid
go
if this is not what you want pls post relevent table structure, sample
records, and expected result set out of it.Also specify whether you can
looking for in INSERT /UPDATE/SELECT statment.
Vishal Parkar
vgparkar@.yahoo.co.in | vgparkar@.hotmail.com

assigning weight to columns with fulltext (is it possible?)

Hello all,
I have a table which has several fulltext enabled columns.
Title, Summary, Description
I was wondering whether it is possible to assign some kind of weight
to the columns? For example, I find it more relevant if the keywords
are in the title than in the description. Therefore I would like
records that have the matching keywords in the title to come up first.
I'm thinking the way to achieve this is by assigning weight to the
columns, but am not sure how to go about doing this.
Any advice is much appreciated.
PS. I did have something going a while ago where I used CONTAINSTABLE
and severals join plus some variables which contained the weight and
used that on the RANK value. Thing was, it was REALLY really slow on
10,000+ records.
On Dec 13, 7:09 pm, Pacific Fox - Web Design Brisbane
<tacofl...@.gmail.com> wrote:
> Hello all,
> I have a table which has several fulltext enabled columns.
> Title, Summary, Description
> I was wondering whether it is possible to assign some kind of weight
> to the columns? For example, I find it more relevant if the keywords
> are in the title than in the description. Therefore I would like
> records that have the matching keywords in the title to come up first.
> I'm thinking the way to achieve this is by assigning weight to the
> columns, but am not sure how to go about doing this.
> Any advice is much appreciated.
|||Unfortunately the join approach is the only solution to something like this.
http://www.zetainteractive.com - Shift Happens!
Looking for a SQL Server replication book?
http://www.nwsu.com/0974973602.html
Looking for a FAQ on Indexing Services/SQL FTS
http://www.indexserverfaq.com
"Pacific Fox - Web Design Brisbane" <tacofleur@.gmail.com> wrote in message
news:facf10c4-82e5-4fac-9752-3921c1504650@.d4g2000prg.googlegroups.com...
> PS. I did have something going a while ago where I used CONTAINSTABLE
> and severals join plus some variables which contained the weight and
> used that on the RANK value. Thing was, it was REALLY really slow on
> 10,000+ records.
> On Dec 13, 7:09 pm, Pacific Fox - Web Design Brisbane
> <tacofl...@.gmail.com> wrote:
>

Sunday, March 11, 2012

Assigning Group Numbers for millions of row

I have a table with first name, last name, SSN(social security number)
and other columns.
I want to assign group number according to this business logic.
1. Records with equal SSN and (similar first name or last name) belong
to the same group.
John Smith 1234
Smith John 1234
S John 1234
J Smith 1234
John Smith and Smith John falls in the same group Number as long as
they have similar SSN.
This is because I have a record of equal SSN but the first name and
last name is switched because of people who make error inserting last
name as first name and vice versa. John Smith and Smith John will have
equal group Name if they have equal SSN.
2. There are records with equal SSN but different first name and last
name. These belong to different group numbers.
Equal SSN doesn't guarantee equal group number, at least one of the
first name or last name should be the same. John Smith and Dan Brown
with equal SSN=1234 shouldn't fall in the same group number.
Sample data:
Id Fname lname SSN grpNum
1 John Smith 1234 1
2 Smith John 1234 1
3 S John 1234 1
4 J Smith 1234 1
5 J S 1234 1
6 Dan Brown 1234 2
7 John Smith 1111 3
I have tried this code for 65,000 rows. It took 20 minute. I have to
run it for 21 million row data. I now that this is not an efficient
code.
INSERT into temp_FnLnSSN_grp
SELECT c1.fname, c1.lname, c1.ssn AS ssn, c3.tu_id,
(SELECT 1 + count(*)
FROM distFLS AS c2
WHERE c2.ssn < c1.ssn
or (c2.ssn = c1.ssn and (substring(c2.fname,1,1) =
substring(c1.fname,1,1) or substring(c2.lname,1,1) =
substring(c1.lname,1,1)
or substring(c2.fname,1,1) =
substring(c1.lname,1,1) or substring(c2.lname,1,1) =
substring(c1.fname,1,1))
)) AS group_number
FROM distFLS AS c1
JOIN tu_people_data AS c3
ON (c1.ssn = c3.ssn and
c1.fname = c3.fname and
c1.lname= c3.lname)
dist FLS is distinct First Name, last Name and SSN table from the
people table.
I have posted part of this question, schema one w ago. Please refer
this thread.
http://groups.google.com/group/comp...6eb380b5f2e6de6Basically, this is just a query that sorts or groups on a CASE function.
However, the catch is how we want to classify different names as "similar".
I would say that two rows should be considered similar if they have the same
SSN and the names start with the same first letter. Instead of a group
number, let's do a group code which conists of those 2 characters. Since
fname and lname may be transposed, the lowest of the 2 characters will be
encoded first followed by the highest character.
fname lname SSN grpCode
-- -- -- --
J S 1234 JS
J Smith 1234 JS
S John 1234 JS
John Smith 1111 JS
John Smith 1234 JS
Smith John 1234 JS
Dan Brown 1234 BD
select lname, fname, SSN, grpCode
from
(
select
fname,
lname,
SSN,
-- Here we calculate the grpCode:
case
when left(fname,1) <= left(lname,1) then left(fname,1)
else left(lname,1)
end as grpCode
--
from
distFLS
) as x
order by
SSN,
grpCode,
fname,
lname
<jacob.dba@.gmail.com> wrote in message
news:1143482451.181115.64620@.v46g2000cwv.googlegroups.com...
>I have a table with first name, last name, SSN(social security number)
> and other columns.
> I want to assign group number according to this business logic.
> 1. Records with equal SSN and (similar first name or last name) belong
> to the same group.
> John Smith 1234
> Smith John 1234
> S John 1234
> J Smith 1234
> John Smith and Smith John falls in the same group Number as long as
> they have similar SSN.
> This is because I have a record of equal SSN but the first name and
> last name is switched because of people who make error inserting last
> name as first name and vice versa. John Smith and Smith John will have
> equal group Name if they have equal SSN.
> 2. There are records with equal SSN but different first name and last
> name. These belong to different group numbers.
> Equal SSN doesn't guarantee equal group number, at least one of the
> first name or last name should be the same. John Smith and Dan Brown
> with equal SSN=1234 shouldn't fall in the same group number.
>
> Sample data:
> Id Fname lname SSN grpNum
> 1 John Smith 1234 1
> 2 Smith John 1234 1
> 3 S John 1234 1
> 4 J Smith 1234 1
> 5 J S 1234 1
> 6 Dan Brown 1234 2
> 7 John Smith 1111 3
>
> I have tried this code for 65,000 rows. It took 20 minute. I have to
> run it for 21 million row data. I now that this is not an efficient
> code.
>
> INSERT into temp_FnLnSSN_grp
> SELECT c1.fname, c1.lname, c1.ssn AS ssn, c3.tu_id,
> (SELECT 1 + count(*)
> FROM distFLS AS c2
> WHERE c2.ssn < c1.ssn
> or (c2.ssn = c1.ssn and (substring(c2.fname,1,1) =
> substring(c1.fname,1,1) or substring(c2.lname,1,1) =
> substring(c1.lname,1,1)
> or substring(c2.fname,1,1) =
> substring(c1.lname,1,1) or substring(c2.lname,1,1) =
> substring(c1.fname,1,1))
> )) AS group_number
> FROM distFLS AS c1
> JOIN tu_people_data AS c3
> ON (c1.ssn = c3.ssn and
> c1.fname = c3.fname and
> c1.lname= c3.lname)
>
> dist FLS is distinct First Name, last Name and SSN table from the
> people table.
>
> I have posted part of this question, schema one w ago. Please refer
> this thread.
>
> http://groups.google.com/group/comp...6eb380b5f2e6de6
>|||The group code calculation returns only with one letter.
I have added this code on it.
-- Here we calculate the grpCode:
case
when left(fname,1) <= left(lname,1) then left(fname,1) +
left(lname,1)
else left(lname,1) +left(fname,1)
end as grpCode
--|||I didn't run it on my end.
Thanks.
<jacob.dba@.gmail.com> wrote in message
news:1143487218.339033.116420@.v46g2000cwv.googlegroups.com...
> The group code calculation returns only with one letter.
> I have added this code on it.
> -- Here we calculate the grpCode:
> case
> when left(fname,1) <= left(lname,1) then left(fname,1) +
> left(lname,1)
> else left(lname,1) +left(fname,1)
> end as grpCode
> --
>|||I fogot to mention that some of the records have middle name entered
in place of first name or last name.
fname mname lname ssn
John coleman smith 1234
john smith coleman 1234
john S coleman 1234
John C Smith 1234
John Smith 1234
John-coleman Smith 1234
Smith John 1234
During the grouping process I am concerned only about fname,lname,
ssn.(no need of middle name). If there is other suggestion to include
columns I am happy to accept.
I have the idea to assign groups if one of the initial of the names is
similar with the others considering that the SSN is the same. that
means if SSN is equal and if J or S or C are there as an initial in the
names, we can say they are in the same group.|||Just revise the case function as needed, but the concept is the same.
<jacob.dba@.gmail.com> wrote in message
news:1143490139.170226.286890@.g10g2000cwb.googlegroups.com...
> I fogot to mention that some of the records have middle name entered
> in place of first name or last name.
> fname mname lname ssn
> John coleman smith 1234
> john smith coleman 1234
> john S coleman 1234
> John C Smith 1234
> John Smith 1234
> John-coleman Smith 1234
> Smith John 1234
> During the grouping process I am concerned only about fname,lname,
> ssn.(no need of middle name). If there is other suggestion to include
> columns I am happy to accept.
> I have the idea to assign groups if one of the initial of the names is
> similar with the others considering that the SSN is the same. that
> means if SSN is equal and if J or S or C are there as an initial in the
> names, we can say they are in the same group.
>|||Consider using Integration Services as that tool has a Fuzzy Lookup and
Fuzzy Grouping tasks that were specifically designed for this type of work.
<jacob.dba@.gmail.com> wrote in message
news:1143482451.181115.64620@.v46g2000cwv.googlegroups.com...
>I have a table with first name, last name, SSN(social security number)
> and other columns.
> I want to assign group number according to this business logic.
> 1. Records with equal SSN and (similar first name or last name) belong
> to the same group.
> John Smith 1234
> Smith John 1234
> S John 1234
> J Smith 1234
> John Smith and Smith John falls in the same group Number as long as
> they have similar SSN.
> This is because I have a record of equal SSN but the first name and
> last name is switched because of people who make error inserting last
> name as first name and vice versa. John Smith and Smith John will have
> equal group Name if they have equal SSN.
> 2. There are records with equal SSN but different first name and last
> name. These belong to different group numbers.
> Equal SSN doesn't guarantee equal group number, at least one of the
> first name or last name should be the same. John Smith and Dan Brown
> with equal SSN=1234 shouldn't fall in the same group number.
>
> Sample data:
> Id Fname lname SSN grpNum
> 1 John Smith 1234 1
> 2 Smith John 1234 1
> 3 S John 1234 1
> 4 J Smith 1234 1
> 5 J S 1234 1
> 6 Dan Brown 1234 2
> 7 John Smith 1111 3
>
> I have tried this code for 65,000 rows. It took 20 minute. I have to
> run it for 21 million row data. I now that this is not an efficient
> code.
>
> INSERT into temp_FnLnSSN_grp
> SELECT c1.fname, c1.lname, c1.ssn AS ssn, c3.tu_id,
> (SELECT 1 + count(*)
> FROM distFLS AS c2
> WHERE c2.ssn < c1.ssn
> or (c2.ssn = c1.ssn and (substring(c2.fname,1,1) =
> substring(c1.fname,1,1) or substring(c2.lname,1,1) =
> substring(c1.lname,1,1)
> or substring(c2.fname,1,1) =
> substring(c1.lname,1,1) or substring(c2.lname,1,1) =
> substring(c1.fname,1,1))
> )) AS group_number
> FROM distFLS AS c1
> JOIN tu_people_data AS c3
> ON (c1.ssn = c3.ssn and
> c1.fname = c3.fname and
> c1.lname= c3.lname)
>
> dist FLS is distinct First Name, last Name and SSN table from the
> people table.
>
> I have posted part of this question, schema one w ago. Please refer
> this thread.
>
> http://groups.google.com/group/comp...6eb380b5f2e6de6
>

Assigning group numbers for millions of data

I have a table with first name, last name, SSN(social security number)
and other columns.
I want to assign group number according to this business logic.
1. Records with equal SSN and (similar first name or last name) belong
to the same group.
John Smith 1234
Smith John 1234
S John 1234
J Smith 1234
John Smith and Smith John falls in the same group Number as long as
they have similar SSN.
This is because I have a record of equal SSN but the first name and
last name is switched because of people who make error inserting last
name as first name and vice versa. John Smith and Smith John will have
equal group Name if they have equal SSN.
2. There are records with equal SSN but different first name and last
name. These belong to different group numbers.
Equal SSN doesn't guarantee equal group number, at least one of the
first name or last name should be the same. John Smith and Dan Brown
with equal SSN=1234 shouldn't fall in the same group number.

Sample data:
Id Fname lname SSN grpNum
1 John Smith 1234 1
2 Smith John 1234 1
3 S John 1234 1
4 J Smith 1234 1
5 J S 1234 1
6 Dan Brown 1234 2
7 John Smith 1111 3

I have tried this code for 65,000 rows. It took 20 minute. I have to
run it for 21 million row data. I now that this is not an efficient
code.

INSERT into temp_FnLnSSN_grp
SELECT c1.fname, c1.lname, c1.ssn AS ssn, c3.tu_id,
(SELECT 1 + count(*)
FROM distFLS AS c2
WHERE c2.ssn < c1.ssn
or (c2.ssn = c1.ssn and (substring(c2.fname,1,1) =
substring(c1.fname,1,1) or substring(c2.lname,1,1) =
substring(c1.lname,1,1)
or substring(c2.fname,1,1) =
substring(c1.lname,1,1) or substring(c2.lname,1,1) =
substring(c1.fname,1,1))
)) AS group_number
FROM distFLS AS c1
JOIN tu_people_data AS c3
ON (c1.ssn = c3.ssn and
c1.fname = c3.fname and
c1.lname= c3.lname)

dist FLS is distinct First Name, last Name and SSN table from the
people table.

I have posted part of this question, schema one week ago. Please refer
this thread.

http://groups.google.com/group/comp...6eb380b5f2e6de6I forgot to mention that some of the records have middle name entered
in place of first name or last name.
fname mname lname ssn

John coleman smith 1234
john smith coleman 1234
john S coleman 1234
John C Smith 1234
John Smith 1234
John-coleman Smith 1234
Smith John 1234

During the grouping process I am concerned only about fname,lname,
ssn.(no need of middle name). If there is other suggestion to include
columns I am happy to accept.
I have the idea to assign groups if one of the initial of the names is

similar with the others considering that the SSN is the same. that
means if SSN is equal and if J or S or C are there as an initial in the

names, we can say they are in the same group.

Reply

jacob.dba@.gmail.com wrote:
> I have a table with first name, last name, SSN(social security number)
> and other columns.
> I want to assign group number according to this business logic.
> 1. Records with equal SSN and (similar first name or last name) belong
> to the same group.
> John Smith 1234
> Smith John 1234
> S John 1234
> J Smith 1234
> John Smith and Smith John falls in the same group Number as long as
> they have similar SSN.
> This is because I have a record of equal SSN but the first name and
> last name is switched because of people who make error inserting last
> name as first name and vice versa. John Smith and Smith John will have
> equal group Name if they have equal SSN.
> 2. There are records with equal SSN but different first name and last
> name. These belong to different group numbers.
> Equal SSN doesn't guarantee equal group number, at least one of the
> first name or last name should be the same. John Smith and Dan Brown
> with equal SSN=1234 shouldn't fall in the same group number.
> Sample data:
> Id Fname lname SSN grpNum
> 1 John Smith 1234 1
> 2 Smith John 1234 1
> 3 S John 1234 1
> 4 J Smith 1234 1
> 5 J S 1234 1
> 6 Dan Brown 1234 2
> 7 John Smith 1111 3
>
> I have tried this code for 65,000 rows. It took 20 minute. I have to
> run it for 21 million row data. I now that this is not an efficient
> code.
>
> INSERT into temp_FnLnSSN_grp
> SELECT c1.fname, c1.lname, c1.ssn AS ssn, c3.tu_id,
> (SELECT 1 + count(*)
> FROM distFLS AS c2
> WHERE c2.ssn < c1.ssn
> or (c2.ssn = c1.ssn and (substring(c2.fname,1,1) =
> substring(c1.fname,1,1) or substring(c2.lname,1,1) =
> substring(c1.lname,1,1)
> or substring(c2.fname,1,1) =
> substring(c1.lname,1,1) or substring(c2.lname,1,1) =
> substring(c1.fname,1,1))
> )) AS group_number
> FROM distFLS AS c1
> JOIN tu_people_data AS c3
> ON (c1.ssn = c3.ssn and
> c1.fname = c3.fname and
> c1.lname= c3.lname)
>
> dist FLS is distinct First Name, last Name and SSN table from the
> people table.
> I have posted part of this question, schema one week ago. Please refer
> this thread.
> http://groups.google.com/group/comp...6eb380b5f2e6de6|||(jacob.dba@.gmail.com) writes:
> I want to assign group number according to this business logic.
> 1. Records with equal SSN and (similar first name or last name) belong
> to the same group.
> John Smith 1234
> Smith John 1234
> S John 1234
> J Smith 1234
> John Smith and Smith John falls in the same group Number as long as
> they have similar SSN.
> This is because I have a record of equal SSN but the first name and
> last name is switched because of people who make error inserting last
> name as first name and vice versa. John Smith and Smith John will have
> equal group Name if they have equal SSN.
> 2. There are records with equal SSN but different first name and last
> name. These belong to different group numbers.
> Equal SSN doesn't guarantee equal group number, at least one of the
> first name or last name should be the same. John Smith and Dan Brown
> with equal SSN=1234 shouldn't fall in the same group number.

What if you have both John Smith and Southerland Jane? Are the
same person or not?

This looks like a very difficult task, and the fact that you have
800 million rows certainly does not help to make it easier.

I think you need to scrap the idea you got from Itzik. My gut feeling
say that it will not scale.

Here is a very simple-minded solution where I've assumed that as
long as any combination of initials match, it's the same group.

CREATE TABLE [TU_People_Data] (
[tu_id] [bigint] NOT NULL ,
[count_id] [int] NOT NULL ,
[fname] [varchar] (32) COLLATE Latin1_General_CI_AS NULL ,
[lname] [varchar] (32) COLLATE Latin1_General_CI_AS NULL ,
[ssn] [int] NULL ,
CONSTRAINT [PK_tu_bulk_people] PRIMARY KEY CLUSTERED
(
[tu_id],
[count_id]
) ON [PRIMARY]
) ON [PRIMARY]
GO
CREATE TABLE #initials (ssn int NOT NULL,
fname varchar(32) NOT NULL,
lname varchar(32) NOT NULL,
initials char(2) NOT NULL)
go
CREATE TABLE #ssnmania (ident int NOT NULL,
ssn int NOT NULL,
initials char(2) NOT NULL,
PRIMARY KEY(ssn, initials))
go
INSERT #initals (ssn, fname, lname, initials)
SELECT DISTINCT ssn, fname, lname,
CASE WHEN fname < lname
THEN substring(fname, 1, 1) + substring(lname, 1, 1)
ELSE substring(lname, 1, 1) + substring(fname, 1, 1)
END
FROM TU_People_Data
go
INSERT #ssnmania (ssn, initials)
SELECT DISTINCT ssn, initials
FROM #initials
go
SELECT i.ssn, i.fname, i.lname, i.initials, groupno = s.ident
FROM #initials i
JOIN #ssnmania s ON i.ssn = s.ssn
AND s.initials = i.initials
go
DROP TABLE #initials, #ssnmania, TU_People_Data

--
Erland Sommarskog, SQL Server MVP, esquel@.sommarskog.se

Books Online for SQL Server 2005 at
http://www.microsoft.com/technet/pr...oads/books.mspx
Books Online for SQL Server 2000 at
http://www.microsoft.com/sql/prodin...ions/books.mspx|||Thanks Erland.
I have tried this procedure in the morning and it solves half of my
problem.
let me start by answering your question.
>What if you have both John Smith and Southerland Jane? Are the
> same person or not?
If these guys' SSN is the same, they are considered to be in the the
same group.
I am willing to take the chance that John Smith, Southerland Jane and
Jack Sam with similar SSN has slim chance to occur. if they exist,
they are gouped in one group number.
>>regarding your solution
In my table some of the rows for one person are displayed like this.
1.John Coleman Smith 1111 JS
2.John Smith Coleman 1111 CJ
3.Coleman John Smith 1111 CS
4.John-coleman Smith 1111 JS
5. Smith John 1111 JS
6.John Smith 2222 JS
7.J Smith 1111 JS
8 Jack Sam 3333 JS
you can see that all this guys can be grouped in the same group
name(except the 6th and 8th). I see that SSN is the major factor to
identify the groups.
So once SSN is the same then the intitals has to be one or two of the
three J or S or C.

Erland Sommarskog wrote:
> (jacob.dba@.gmail.com) writes:
> > I want to assign group number according to this business logic.
> > 1. Records with equal SSN and (similar first name or last name) belong
> > to the same group.
> > John Smith 1234
> > Smith John 1234
> > S John 1234
> > J Smith 1234
> > John Smith and Smith John falls in the same group Number as long as
> > they have similar SSN.
> > This is because I have a record of equal SSN but the first name and
> > last name is switched because of people who make error inserting last
> > name as first name and vice versa. John Smith and Smith John will have
> > equal group Name if they have equal SSN.
> > 2. There are records with equal SSN but different first name and last
> > name. These belong to different group numbers.
> > Equal SSN doesn't guarantee equal group number, at least one of the
> > first name or last name should be the same. John Smith and Dan Brown
> > with equal SSN=1234 shouldn't fall in the same group number.
> What if you have both John Smith and Southerland Jane? Are the
> same person or not?
> This looks like a very difficult task, and the fact that you have
> 800 million rows certainly does not help to make it easier.
> I think you need to scrap the idea you got from Itzik. My gut feeling
> say that it will not scale.
> Here is a very simple-minded solution where I've assumed that as
> long as any combination of initials match, it's the same group.
>
> CREATE TABLE [TU_People_Data] (
> [tu_id] [bigint] NOT NULL ,
> [count_id] [int] NOT NULL ,
> [fname] [varchar] (32) COLLATE Latin1_General_CI_AS NULL ,
> [lname] [varchar] (32) COLLATE Latin1_General_CI_AS NULL ,
> [ssn] [int] NULL ,
> CONSTRAINT [PK_tu_bulk_people] PRIMARY KEY CLUSTERED
> (
> [tu_id],
> [count_id]
> ) ON [PRIMARY]
> ) ON [PRIMARY]
> GO
> CREATE TABLE #initials (ssn int NOT NULL,
> fname varchar(32) NOT NULL,
> lname varchar(32) NOT NULL,
> initials char(2) NOT NULL)
> go
> CREATE TABLE #ssnmania (ident int NOT NULL,
> ssn int NOT NULL,
> initials char(2) NOT NULL,
> PRIMARY KEY(ssn, initials))
> go
> INSERT #initals (ssn, fname, lname, initials)
> SELECT DISTINCT ssn, fname, lname,
> CASE WHEN fname < lname
> THEN substring(fname, 1, 1) + substring(lname, 1, 1)
> ELSE substring(lname, 1, 1) + substring(fname, 1, 1)
> END
> FROM TU_People_Data
> go
> INSERT #ssnmania (ssn, initials)
> SELECT DISTINCT ssn, initials
> FROM #initials
> go
> SELECT i.ssn, i.fname, i.lname, i.initials, groupno = s.ident
> FROM #initials i
> JOIN #ssnmania s ON i.ssn = s.ssn
> AND s.initials = i.initials
> go
> DROP TABLE #initials, #ssnmania, TU_People_Data
>
>
> --
> Erland Sommarskog, SQL Server MVP, esquel@.sommarskog.se
> Books Online for SQL Server 2005 at
> http://www.microsoft.com/technet/pr...oads/books.mspx
> Books Online for SQL Server 2000 at
> http://www.microsoft.com/sql/prodin...ions/books.mspx|||(jacob.dba@.gmail.com) writes:
> I have tried this procedure in the morning and it solves half of my
> problem.

And the other half is? :-) I did not include the middle initial, because
I did not see that post until later.

But I guess that you could extend the logic that I posted to handle
the middle initial as well.

--
Erland Sommarskog, SQL Server MVP, esquel@.sommarskog.se

Books Online for SQL Server 2005 at
http://www.microsoft.com/technet/pr...oads/books.mspx
Books Online for SQL Server 2000 at
http://www.microsoft.com/sql/prodin...ions/books.mspx|||you need a function that takes the first character from first name,
last name, and middle initial, and sorts them. Call it "SortInit"
So, pass "Sam Alfred Jones" and it passes back "AJS". Likewise,
"Jones Alfred Sam" is returned as "AJS".

then, create your temp table and populate it with SSN and Sortinit().
then alter table on your temp table and add an identity column.
Then make your "temp table" a permanent one, as your business rules
will change, and fundamentally what you are doing is looking for
"duplicate rows" and grouping them, and this is almost always a
multiple pass project.

Wednesday, March 7, 2012

Assign ID to distinct names

I have a table called NAMES with 2 columns - id, name. I have a bunch
of names in the table but id is null for all. I need to develop a
script that will assign a unique id to each distinct name.
Before:
ID Name
- Tom
- Tom
- Lee
- Lee
- Lee
- Jim
- Jim
After:
ID Name
1 Tom
1 Tom
2 Lee
2 Lee
2 Lee
3 Jim
3 Jim
DDL:
create table NAMES (
id int null,
name varchar(20) not null
)
insert NAMES values (null, 'Tom');
insert NAMES values (null, 'Tom');
insert NAMES values (null, 'Lee');
insert NAMES values (null, 'Lee');
insert NAMES values (null, 'Lee');
insert NAMES values (null, 'Jim');
insert NAMES values (null, 'Jim');
Any help is appreciated. Thanks."Green" <subhash.daga@.gmail.com> wrote in message
news:1140881551.847258.142380@.i40g2000cwc.googlegroups.com...
>I have a table called NAMES with 2 columns - id, name. I have a bunch
> of names in the table but id is null for all. I need to develop a
> script that will assign a unique id to each distinct name.
> Before:
> ID Name
> - Tom
> - Tom
> - Lee
> - Lee
> - Lee
> - Jim
> - Jim
> After:
> ID Name
> 1 Tom
> 1 Tom
> 2 Lee
> 2 Lee
> 2 Lee
> 3 Jim
> 3 Jim
> DDL:
> create table NAMES (
> id int null,
> name varchar(20) not null
> )
> insert NAMES values (null, 'Tom');
> insert NAMES values (null, 'Tom');
> insert NAMES values (null, 'Lee');
> insert NAMES values (null, 'Lee');
> insert NAMES values (null, 'Lee');
> insert NAMES values (null, 'Jim');
> insert NAMES values (null, 'Jim');
> Any help is appreciated. Thanks.
>
What's the point of duplicating the names? Try:
CREATE TABLE names2 (
id INTEGER NOT NULL
CONSTRAINT pk_names2 PRIMARY KEY,
name varchar(20) NOT NULL
CONSTRAINT ak1_names2 UNIQUE);
INSERT INTO names2 (id, name)
SELECT COUNT(DISTINCT N2.name), N1.name
FROM names AS N1
JOIN names AS N2
ON N1.name <= N2.name
GROUP BY N1.name ;
Result:
id name
-- --
1 Tom
2 Lee
3 Jim
David Portas, SQL Server MVP
Whenever possible please post enough code to reproduce your problem.
Including CREATE TABLE and INSERT statements usually helps.
State what version of SQL Server you are using and specify the content
of any error messages.
SQL Server Books Online:
http://msdn2.microsoft.com/library/ms130214(en-US,SQL.90).aspx
--|||Thanks. In reality, the table has much more columns and already has an
existing primary key. The ID column described above is not apart of the
pk.
Any other solutions please. I'd like to avoid using cursors.
Thanks.|||Green wrote:
> Thanks. In reality, the table has much more columns and already has an
> existing primary key. The ID column described above is not apart of
> the pk.
> Any other solutions please. I'd like to avoid using cursors.
> Thanks.
Get a list of the unique name values and put them in a temp table with
an int idenitity column. Update the original table from the temp table:
Create Table names (id int null, name varchar(50))
go
insert names values (null, 'Tom');
insert names values (null, 'Tom');
insert names values (null, 'Lee');
insert names values (null, 'Lee');
insert names values (null, 'Lee');
insert names values (null, 'Jim');
insert names values (null, 'Jim');
go
Create Table #names (id int identity not null, name varchar(50))
go
insert into #names (
name )
Select distinct name from names
go
select * from #Names
go
Update names
Set names.id = t.id
from #names t
where names.name = t.name
go
select * from names
go
drop table #names
go
David Gugick - SQL Server MVP
Quest Software|||> Thanks. In reality, the table has much more columns and already has an
> existing primary key. The ID column described above is not apart of the
> pk.
In that case adding the ID column based on the name would create a
transitive dependency in violation of the standard Normal Forms. Do I take
it that you really want to put name into a related table? Accurate DDL would
be a help here.
David Portas, SQL Server MVP
Whenever possible please post enough code to reproduce your problem.
Including CREATE TABLE and INSERT statements usually helps.
State what version of SQL Server you are using and specify the content
of any error messages.
SQL Server Books Online:
http://msdn2.microsoft.com/library/ms130214(en-US,SQL.90).aspx
--|||In sql 2005 you can do
declare @.NAMES table (
id int null,
name varchar(20) not null
)
insert @.NAMES values (null, 'Tom');
insert @.NAMES values (null, 'Tom');
insert @.NAMES values (null, 'Lee');
insert @.NAMES values (null, 'Lee');
insert @.NAMES values (null, 'Lee');
insert @.NAMES values (null, 'Jim');
insert @.NAMES values (null, 'Jim');
select *
,dense_rank() OVER( order by name)
FROM @.NAMES n1
Result:
id name
-- -- --
NULL Jim 1
NULL Jim 1
NULL Lee 2
NULL Lee 2
NULL Lee 2
NULL Tom 3
NULL Tom 3
farmer
"Green" <subhash.daga@.gmail.com> wrote in message
news:1140881551.847258.142380@.i40g2000cwc.googlegroups.com...
>I have a table called NAMES with 2 columns - id, name. I have a bunch
> of names in the table but id is null for all. I need to develop a
> script that will assign a unique id to each distinct name.
> Before:
> ID Name
> - Tom
> - Tom
> - Lee
> - Lee
> - Lee
> - Jim
> - Jim
> After:
> ID Name
> 1 Tom
> 1 Tom
> 2 Lee
> 2 Lee
> 2 Lee
> 3 Jim
> 3 Jim
> DDL:
> create table NAMES (
> id int null,
> name varchar(20) not null
> )
> insert NAMES values (null, 'Tom');
> insert NAMES values (null, 'Tom');
> insert NAMES values (null, 'Lee');
> insert NAMES values (null, 'Lee');
> insert NAMES values (null, 'Lee');
> insert NAMES values (null, 'Jim');
> insert NAMES values (null, 'Jim');
> Any help is appreciated. Thanks.
>|||to get exact result like yours
select *
,dense_rank() OVER( order by name desc)
FROM @.NAMES n1
"Farmer" <someone@.somewhere.com> wrote in message
news:%23TiCAqkOGHA.140@.TK2MSFTNGP12.phx.gbl...
> In sql 2005 you can do
> declare @.NAMES table (
> id int null,
> name varchar(20) not null
> )
> insert @.NAMES values (null, 'Tom');
> insert @.NAMES values (null, 'Tom');
> insert @.NAMES values (null, 'Lee');
> insert @.NAMES values (null, 'Lee');
> insert @.NAMES values (null, 'Lee');
> insert @.NAMES values (null, 'Jim');
> insert @.NAMES values (null, 'Jim');
>
> select *
> ,dense_rank() OVER( order by name)
> FROM @.NAMES n1
>
> Result:
> id name
> -- -- --
> NULL Jim 1
> NULL Jim 1
> NULL Lee 2
> NULL Lee 2
> NULL Lee 2
> NULL Tom 3
> NULL Tom 3
>
> farmer
>
> "Green" <subhash.daga@.gmail.com> wrote in message
> news:1140881551.847258.142380@.i40g2000cwc.googlegroups.com...
>|||Thanks all. I used Gugick's suggestion and that worked well. I find
Farmer's solution quite interesting. Maybe I'll try that nexttime.
Thanks.

Assign Aggregate count results to variable

I have an aggregate transform that outputs two columns, a group by (DT_STR) and a count(column name) (DT_UI8). The results are put into a Recordset Destination. When I attempt to map these columns to variables in a Foreach Loop Container (using a Foreach ADO Enumerator), I get the error:

Error: 0xC001F009 at ExtractNNRPersonUpdates: The type of the value being assigned to variable "User::NNRPersonCount" differs from the current variable type. Variables may not change type during execution. Variable types are strict, except for variables of type Object.

Error: 0xC001C012 at Foreach Household Loop Container: ForEach Variable Mapping number 2 to variable "User::NNRPersonCount" cannot be applied.

The variable mentioned in the error message is setup as a UInt64. I've tried all other integer data types and nothing works. I also tried changing the data type of the count coming out of the Aggregate transform but received a warning stating this isn't possible.

Any idea what I may be doing wrong?

I had a lot of trouble with this error, and ended up doing explicit casts in a script where I was doing my comparison, using the CType function.

eg:

Row.newid = rowIndex + CType(Me.Variables.MyCounter, ULong)

Dylan.

|||Out of desperation, I began trying all of the data types. When I set it to DBNull, it worked. I was then able to cast it to an integer in a Script Task. Doesn't seem right though because the values are never null.

Assign Aggregate count results to variable

I have an aggregate transform that outputs two columns, a group by (DT_STR) and a count(column name) (DT_UI8). The results are put into a Recordset Destination. When I attempt to map these columns to variables in a Foreach Loop Container (using a Foreach ADO Enumerator), I get the error:

Error: 0xC001F009 at ExtractNNRPersonUpdates: The type of the value being assigned to variable "User::NNRPersonCount" differs from the current variable type. Variables may not change type during execution. Variable types are strict, except for variables of type Object.

Error: 0xC001C012 at Foreach Household Loop Container: ForEach Variable Mapping number 2 to variable "User::NNRPersonCount" cannot be applied.

The variable mentioned in the error message is setup as a UInt64. I've tried all other integer data types and nothing works. I also tried changing the data type of the count coming out of the Aggregate transform but received a warning stating this isn't possible.

Any idea what I may be doing wrong?

I had a lot of trouble with this error, and ended up doing explicit casts in a script where I was doing my comparison, using the CType function.

eg:

Row.newid = rowIndex + CType(Me.Variables.MyCounter, ULong)

Dylan.

|||Out of desperation, I began trying all of the data types. When I set it to DBNull, it worked. I was then able to cast it to an integer in a Script Task. Doesn't seem right though because the values are never null.

Sunday, February 19, 2012

asp/net sql update against ntext datatype

Guys Ihave a table that among the other columns has a column of ntext.

I also have another table that has another ntext column...
due to normalizations I need to merge the tables and set the column on the first table plus the value from the second table...(based on some where clause...

How can this be done?
if data type is varchar it is no problem..but due to text datatype I am unable to perform this update...

please help.

FrankUnfortunately you cannot concatenate an ntext field. I think the best you are going to be able to do is something like this:


SELECT
CAST(CAST(myText AS nVarchar(4000)) + CAST(myText AS nVarchar(4000)) AS ntext)
FROM
Test

However this will obviously truncate both columns at 4000 before putting them together.

Terri|||but the length exceeds 8000.

that is actually the problem.|||My only suggestion then is to run a VB.NET (or C#) program that selects in the data from both sources, concatenates the nText data, then updates the nText field in the desired database table. I don't think you are going to be able to do this within just Transact-SQL itself.

Terri|||well that requires application code..it must be a way of doing it too...in sql..
thanks anyway.|||There is no way to concatenate nText fields to each other. If you find a way around it, let us know.

Quoting directly from the Books Online:

+ (String Concatenation)
An operator in a string expression that concatenates two or more character or binary strings, columns, or a combination of strings and column names into one expression (a string operator).

Syntax
expression + expression

Arguments
expression

Is any valid Microsoft® SQL Server? expression of any of the data types in the character and binary data type category,except the image, ntext, or text data types. Both expressions must be of the same data type, or one expression must be able to be implicitly converted to the data type of the other expression.

Terri

Monday, February 13, 2012

ASP.NET SQL CONTAINS statement

I have an SQL db that I need to be able to search and display. I haveabout seven different columns and would like results to be returned toa datagrid based upon the search criteria entered by the user.
How do I construct a SELECT statement such as below using the ASP.NET 1.1+?
SELECT * FROM db.table WHERE CONTAINS ("searchable term")
I can get a column searched and return the results already, but if Imodify to search all columns and use the CONTAINS it will fail. Isthere a way to do this easily?
Thanks,
TRKneller

Your query is failing because you are using CONTAINS Microsoft Proprietry FULL TEXT search predicate when you need to use LIKE which is ANSI SQL used to search table column based data. Full Text is used for Text and NText data but it is an add on to SQL Server dependent on Microsoft Search Service and the Catalog must be populated to get search results. SQL Server creates an Arithmetic Pointer to Text and NText data in your file system. Run a search for LIKE and FULL Text in the BOL(books online). Hope this helps.