Showing posts with label rules. Show all posts
Showing posts with label rules. Show all posts

Tuesday, March 20, 2012

Association Rules: discretization

Hi there,

it has been a long i'm trying to execute Microsoft Association Rules on my database.

I solved memory leak problem now, but i still can't understand output rules.

Database contain all the italian student who took a degree last year. Here in Italy, they have to compile a summary where they speak about universitary experience. ie: they talk about experience with teachers (pointage from 1 to 5); they says if they want to continue in the universitary field or not, and so on.

Most of the rules, says:

Int_Stud=1-2, RapDoc>4

Int_Stud is the column where i store student intention to continue university. 1 means they want to go on, 2 means they do not want to continue to study. So, this rules has no sense, because it relates all the student (in my mind): the one who wants to continue university and the one who do not want to.

I think problem is that visual studio 2005 and analysis service has no understanding of Int_Stud world, they've no idea that Int_Stud can have just 2 values and that they're opposite each other. Is there a solution to this problem? Can i discretize this column?

Even if I know not to have perfect english, I hope to be understandable

I think you have to modify your data source in sense of replacing 1 with "Continue" and 2 with "NotContinue" or something like that. I think the system consider 1 and 2 as a number. This replacing conduct to understandability too.|||

do u mean i have to modify value in database? I mean, i've something similar to 90 column in the same situation of Int_Stud ... does exist another solution? Moreover, most of fact table - i worked on - had only numeric values in (no text) ...|||

In this discretized situations I think u have to make necessary transformations. Let's think you have to present this project to your big boss. You have to explain in a legend what you mean with Int_stud that have value 1 or 2 etc. I don't have to be in your position ... Smile

|||

ggciubuc wrote:

In this discretized situations I think u have to make necessary transformations. Let's think you have to present this project to your big boss. You have to explain in a legend what you mean with Int_stud that have value 1 or 2 etc. I don't have to be in your position ...

I continue not understand what you mean when you speak about transformations .. why can't i make a discretization procedure on my numeric value? Can't i divided them into bucket?

|||

think you have to modify your data source in sense of replacing 1 with "Continue" and 2 with "NotContinue" or something like that.

|||

ggciubuc wrote:

think you have to modify your data source in sense of replacing 1 with "Continue" and 2 with "NotContinue" or something like that.

so, you mean that i've to modify my source data by changing numeric value with text one? Why? This will be text analysis, won't it?

I used weka one time, and it has a tool which allows to discretize numeric value by dividing it into bin (and you can set your preference too)

|||

You don't have to replace int values with text to make association rules model work correctly. All you need to do is change the column from Discretized to Discrete. (Open Mining Structure tab and look at the Content property of the column)

|||

My ideea is to make things more readable and understandable. I should do in this way.

|||

Tatyana Yakushev wrote:

You don't have to replace int values with text to make association rules model work correctly. All you need to do is change the column from Discretized to Discrete. (Open Mining Structure tab and look at the Content property of the column)

thanks so much .. it works exactly as i wanted to Smile

Association Rules Model for problem

What would be the right design approach for the following problem?

I have a single table called SelectionFactors, which has the following columns and sample data:

ProjectID Factor FactorValue 1000 Countries USA 1000 Countries Canada 1000 Countries France 1000 Languages English 1000 Languages French 1000 Company Type Consulting 1000 Company Type Software 2000 Countries India 2000 Countries China 2000 Countries USA 2000 Languages English 2000 Languages Chinese (Simplified) 2000 Languages Chinese (Traditional) 2000 Languages Spanish 2000 Company Type Retail 2000 Company Type Dairy Products

The problem is to allow a descriptive analysis of the data to find patterns in the users selections. For instance,

if Languages->English is selected, what are the counts of projects for other Factor->Factor Value combinations?

Countries->USA = 2, Countries->Canada=1, Company Type->Consulting=1 and so on.

Since all the data is in this single table, are both the case and nested tables the same? What are the keys and inputs? I only need a descriptive analysis (no prediction) and ALL possible combinations MUST be part of the results; how should the model be designed?

Thank you,

Anna.

This is a good problem. What you want to do is to create a model with three nested tables all marked predict and input. Event though you want descriptive analysis, you need to mark them "predict" such that rules will be formed (only items marked predict will show on the right hand side of rules).

The way you will create such a model is to use the "Named Query" feature of the data source view(DSV). You can right-click on the DSV and select "New Named Query" to create these.

The named queries you will want to create are these:

Projects: SELECT DISTINCT ProjectID FROM SelectionFactors

Companies: SELECT ProjectID, FactorValue FROM SelectionFactors WHERE Factor='Companies'

Languages: SELECT ProjectID, FactorValue FROM SelectionFactors WHERE Factor='Languages'

Company Type: SELECT ProjectID, FactorValue FROM SelectionFactors WHERE Factor='Company Type'

You will then need to relate the ProjectID in each of the transaction named queries (Companies, etc) to the ProjectID in the Projects named query. The system will prompt you to set ProjectID as the key of the Projects named query . At this point, the named queries are equivalent to tables as far as Analysis Services is concerned.

When you create the model, you will make the Projects table the case table and the other tables nested tables. ProjectID will be the key of the Projects table and FactorValue will be the key of the other tables. Make sure to mark FactorValue as Key, Input, and Output.

When you process your model you will see rules relating projects, companies, and company types (both between factos and in the same factor).

Note that you may need to adjust the MINIMUM_SUPPORT and MINIMUM_PROBABILITY parameters to get the results you need (in fact, it's recommended)

HTH

-Jamie

|||

Thanks for your response.

However, I should have mentioned that there is an open-ended number of "Factors". New factors are added often and the number of factors is nearly 100.

The type of questions I am trying to answer using this model is:

"Among projects that have Languages-> English, how many (distinct projects) have Company Type->Consulting, Countries->USA" and so on.

Ideally, I would like to provide this analysis in a free, OLAP environment where the user can select one factor-> factor value combination and then see all the other related combinations.

I currently have a Cube with SelectionFactors serving as both the Fact and the dimension table with a single measure - DISTINCT COUNT of projectIDs.

Thanks once again.

|||

Given the information in my previous post, what are my design options?

Thanks.

|||

My question for you, then, is if you have a cube for this, what are you not getting that you need?

If the factor/factorvalue can be viewed as a complete unit, and you don't need to provide analysis between factors - e.g. given Language X, which Countries and Company Types are prevalent, you could create a calculated column in the DSV that combines the factor and factor value.

|||

Perhaps you can help me understand why I am unable to get the expected results. I am fairly new to OLAP and Data Mining and would appreciate any pointers and guidance.

The full picture of my problem is:

Dimension Project Profile - Project ID, (Other Project related attributes)

Dimension Selection Factors - Factor, Factor Value

Fact Selection Factors - Project ID, Factor, Factor Value

Project Cube Measure - DISTINCT COUNT of Project IDs

My understanding is that this is a problem of intersection of factors, which I cannot achieve with a single dimension. In fact there is this other thread http://forums.microsoft.com/msdn/showpost.aspx?postid=856304&siteid=1 that seems to come very close to my problem.

In my case, the following MDX returns the projects that have the intersection of specific factor values.

SELECT {[Measures].[Project Count]} ON COLUMNS,

INTERSECT

(

NONEMPTY

(

[Project Profile].[Project ID].Children,

(

[Measures].[Project Count],

[Selection Factor].[Selection Factors].[Factor].&[Annual Revenue].&[$500 million - $3 billion in revenues]

)

),

NONEMPTY

(

[Project Profile].[Project ID].Children,

(

[Measures].[Selection Project Count],

[Selection Factor].[Selection Factors].[Factor].&[Concurrent Users].&[1,000–10,000 users]

)

)

)

ON ROWS

FROM PROJECTCUBE

The questions are:

1. How do I get all the other Factor->Factor Values that are part of the selection factors for the projects returned by this MDX. This, if I understand correctly, is Basket analysis.

and

2. Is there a generic way to perform this analysis through cube design rather than writing MDX for each case?

Thanks once more.|||

The following MDX gives me the contents of the "baskets" of projects that have the selected items. However, the counts of the projects is not restricted to the subset retrieved through the Intersect. Is there a way to get the counts to only be within the subset returned by the Intersect?

With this MDX, I can get very close to what I need. The only thing I am missing is to get the counts (perhaps through a calculated measure). Help on this will be much appreciated!

__

SELECT [Measures].[Project Count] ON COLUMNS,

NONEMPTY

(

([Selection Factor].[Factor].Children, [Selection Factor].[Factor Value].Children),

INTERSECT

(

NONEMPTY

(

[Project Profile].[Project ID].Children,

(

[Measures].[Selection Project Count],

[Selection Factor].[Selection Factors].[Factor].&[Annual Revenue].&[$500 million - $3 billion in revenues]

)

),

NONEMPTY

(

[Project Profile].[Project ID].Children,

(

[Measures].[Selection Project Count],

[Selection Factor].[Selection Factors].[Factor].&[Concurrent Users].&[1,000–10,000 users]

)

)

)

)

ON ROWS

FROM PROJECTCUBE

|||

In order to obtain counts restricted to the subset returned by the Intersect, this MDX takes the constraint out of the ROWS into the WHERE clause.

I would still like to know how to design an Association rules model that can "automate" the generation of results by case. Hopefully, the MDX can point directly to the expected results.

SELECT [Measures].[Project Count] ON COLUMNS,

NONEMPTYCROSSJOIN([Selection Factor].[Factor].Children, [Selection Factor].[Factor Value].Children)

ON ROWS

FROM PROJECTCUBE

WHERE

NONEMPTY(

INTERSECT

(

NONEMPTY

(

[Project Profile].[Project ID].Children,

(

[Measures].[Project Count],

[Selection Factor].[Selection Factors].[Factor].&[Annual Revenue].&[$500 million - $3 billion in revenues]

)

),

NONEMPTY

(

[Project Profile].[Project ID].Children,

(

[Measures].[Project Count],

[Selection Factor].[Selection Factors].[Factor].&[Concurrent Users].&[1,000–10,000 users]

)

)

)

)

sql

Association Rules algorithm [support,confidence]

In assotiation rules each rule has a [support, confidence] part. In Microsoft Association Rules there is a [probability,importance] measure in each rule and importance can be greater that 1.

I found the following in msdn but i'm not sure if i understood correctly.

MINIMUM_PROBABILITY: Specifies the minimum probability that a rule is true. For example, setting this value to 0.5 specifies that no rule with less than fifty percent probability is generated.
The default is 0.4.

MAXIMUM_SUPPORT: Specifies the maximum number of cases in which an itemset can have support. If this value is less than 1, the value represents a percentage of the total cases. Values greater than 1 represent the absolute number of cases that can contain the itemset.
The default is 1.

My questions are
1) Can i explain the [probability,importance] in [support,confidence]? If yes, how?
2) What importance>1 means?

Thank you in advance.

MS Association Rules still has [support, confidence] like any other AR implementation. If you take a rule of the form

A -> B

Support is the number of times the itemset A occurs, and confidence is the probability that B occurs when A occurs. Importance is a different measure that indicates the lift of the rule e.g. the increase in probability of the target over the probability of the target at random, which is why it can be greater than 1.

In the DM viewers, the support is indicated on the Itemset page.

|||

If i have a rule A,B => C and an itemset A,B,C with support 2000 then the support of the rule is 2000? How can i calculate the confidence then?

Thank you for your reply.

|||

If your rule is A,B=>C then:

Confidence == Probability == Numberof(A,B,C) / Numberof(A,B)

|||That was very helpful thank you :)|||Hi all,

I am trying to make an algorithm that also takes into account how "important" an association rule is. I found that "importance" is something that I am missing.

Ho is importance calculated ?

greetz,
Alex|||

The formula, with some details, is discussed in this thread:

http://forums.microsoft.com/MSDN/ShowPost.aspx?PostID=290186&SiteID=1

Association Rules

1) I use the identifier of transaction and attribute in one table.

Do I can to build a association rules structure without the use of the nested tables?

I tried - did not turn out...

2) As it is necessary to use a main and child table, can not build a prediction query.

When I try to add the predict column in a criteria/argument (Field=PredictSupport), i'll given message:

"Nested table column cannot be used as an argument in a data mining function."

I can not use other columns, because they are not predicable.

You have to have a nested table, but you can use the same table as the case and nested tables. Just mark the transaction key as the key for the "case" table, and the transaction item as the key for the "nested" table.

In your prediction query you would predict the nested table ( you are essentially predicting which rows will be there). For example, if your model was like this:

TransID LONG KEY,
Products TABLE PREDICT
(
ProductName TEXT KEY
)

You would issue a statement like

SELECT Predict(Products, 5) FROM MyModel PREDICTION JOIN ....

|||

Yes, turned out to build a model! Thanks!
But at an attempt to take advantage of builder of queries, there is an error:

Unable to retrieve the list of distinct states for mining model column Utp Doc Id.

ADDITIONAL INFORMATION:

Error (Data mining): The specified DMX column was not found in the context at line 1, column 17. (Microsoft SQL Server 2005 Analysis Services)


Without a builder - a query works :)
And yet question - as possible to apply a function, as though, PredictSupport? She needs a column, and for me does not turn out to use her from a table...

P.S.: I am sorry for my English =)

Association Prediction by Rules Still Returns Itemsets

If I use this code with an association model, it still returns itemsets for me - when it should be returning only nodes with rules associated with them (according to sqlserverdatamining.com). If I try adding 'AND $PROBABILITY > .25' to the where clause, it returns 0 results for every query I try. Any clue why this may be happening?

Code Snippet

SELECT FLATTENED
(SELECT * FROM PredictAssociation([Product],20,
INCLUDE_NODE_ID,INCLUDE_STATISTICS)
WHERE $NODEID<>'')
FROM
[ProductRecommend]
PREDICTION JOIN
OPENQUERY([ds],
'SELECT
[PRODUCTCLASSID],[DESCRIPTION]
FROM
[Product_Table]
WHERE
[PRODUCTCLASSID] = ''1234'' AND [DESCRIPTION] = ''DESC''
') AS t

ON
[ProductRecommend].[Product].[PRODUCTCLASSID] = t.[PRODUCTCLASSID] AND
[ProductRecommend].[Product].[DESCRIPTION] = t.[DESCRIPTION]

This query returns more relevant results than those lacking the filtering by $NODEID, however the results should have higher probabilities than .047! Please help! Thanks!

Okay, I just reconstructed the same query using my data in a relational mining model (instead of OLAP) and got reasonable results. Would anyone know how to fix this for OLAP or be able to point me in a direction where I could go to learn how to do it? Thanks.

|||What is the MINIMUM_PROBABILITY value for the OLAP mining model?
Is it different than the one from the relational model?

One more thing (which you probably know already) -- the query you posted initially executes one prediction (PredictAssociation ... 20) for each row in the data source query (does not group together multiple input rows belonging to the same transaction).
Basically, only rules having a single item on the left hand side will be used in prediction.