Tuesday, March 20, 2012

Association Rules Model for problem

What would be the right design approach for the following problem?

I have a single table called SelectionFactors, which has the following columns and sample data:

ProjectID Factor FactorValue 1000 Countries USA 1000 Countries Canada 1000 Countries France 1000 Languages English 1000 Languages French 1000 Company Type Consulting 1000 Company Type Software 2000 Countries India 2000 Countries China 2000 Countries USA 2000 Languages English 2000 Languages Chinese (Simplified) 2000 Languages Chinese (Traditional) 2000 Languages Spanish 2000 Company Type Retail 2000 Company Type Dairy Products

The problem is to allow a descriptive analysis of the data to find patterns in the users selections. For instance,

if Languages->English is selected, what are the counts of projects for other Factor->Factor Value combinations?

Countries->USA = 2, Countries->Canada=1, Company Type->Consulting=1 and so on.

Since all the data is in this single table, are both the case and nested tables the same? What are the keys and inputs? I only need a descriptive analysis (no prediction) and ALL possible combinations MUST be part of the results; how should the model be designed?

Thank you,

Anna.

This is a good problem. What you want to do is to create a model with three nested tables all marked predict and input. Event though you want descriptive analysis, you need to mark them "predict" such that rules will be formed (only items marked predict will show on the right hand side of rules).

The way you will create such a model is to use the "Named Query" feature of the data source view(DSV). You can right-click on the DSV and select "New Named Query" to create these.

The named queries you will want to create are these:

Projects: SELECT DISTINCT ProjectID FROM SelectionFactors

Companies: SELECT ProjectID, FactorValue FROM SelectionFactors WHERE Factor='Companies'

Languages: SELECT ProjectID, FactorValue FROM SelectionFactors WHERE Factor='Languages'

Company Type: SELECT ProjectID, FactorValue FROM SelectionFactors WHERE Factor='Company Type'

You will then need to relate the ProjectID in each of the transaction named queries (Companies, etc) to the ProjectID in the Projects named query. The system will prompt you to set ProjectID as the key of the Projects named query . At this point, the named queries are equivalent to tables as far as Analysis Services is concerned.

When you create the model, you will make the Projects table the case table and the other tables nested tables. ProjectID will be the key of the Projects table and FactorValue will be the key of the other tables. Make sure to mark FactorValue as Key, Input, and Output.

When you process your model you will see rules relating projects, companies, and company types (both between factos and in the same factor).

Note that you may need to adjust the MINIMUM_SUPPORT and MINIMUM_PROBABILITY parameters to get the results you need (in fact, it's recommended)

HTH

-Jamie

|||

Thanks for your response.

However, I should have mentioned that there is an open-ended number of "Factors". New factors are added often and the number of factors is nearly 100.

The type of questions I am trying to answer using this model is:

"Among projects that have Languages-> English, how many (distinct projects) have Company Type->Consulting, Countries->USA" and so on.

Ideally, I would like to provide this analysis in a free, OLAP environment where the user can select one factor-> factor value combination and then see all the other related combinations.

I currently have a Cube with SelectionFactors serving as both the Fact and the dimension table with a single measure - DISTINCT COUNT of projectIDs.

Thanks once again.

|||

Given the information in my previous post, what are my design options?

Thanks.

|||

My question for you, then, is if you have a cube for this, what are you not getting that you need?

If the factor/factorvalue can be viewed as a complete unit, and you don't need to provide analysis between factors - e.g. given Language X, which Countries and Company Types are prevalent, you could create a calculated column in the DSV that combines the factor and factor value.

|||

Perhaps you can help me understand why I am unable to get the expected results. I am fairly new to OLAP and Data Mining and would appreciate any pointers and guidance.

The full picture of my problem is:

Dimension Project Profile - Project ID, (Other Project related attributes)

Dimension Selection Factors - Factor, Factor Value

Fact Selection Factors - Project ID, Factor, Factor Value

Project Cube Measure - DISTINCT COUNT of Project IDs

My understanding is that this is a problem of intersection of factors, which I cannot achieve with a single dimension. In fact there is this other thread http://forums.microsoft.com/msdn/showpost.aspx?postid=856304&siteid=1 that seems to come very close to my problem.

In my case, the following MDX returns the projects that have the intersection of specific factor values.

SELECT {[Measures].[Project Count]} ON COLUMNS,

INTERSECT

(

NONEMPTY

(

[Project Profile].[Project ID].Children,

(

[Measures].[Project Count],

[Selection Factor].[Selection Factors].[Factor].&[Annual Revenue].&[$500 million - $3 billion in revenues]

)

),

NONEMPTY

(

[Project Profile].[Project ID].Children,

(

[Measures].[Selection Project Count],

[Selection Factor].[Selection Factors].[Factor].&[Concurrent Users].&[1,000–10,000 users]

)

)

)

ON ROWS

FROM PROJECTCUBE

The questions are:

1. How do I get all the other Factor->Factor Values that are part of the selection factors for the projects returned by this MDX. This, if I understand correctly, is Basket analysis.

and

2. Is there a generic way to perform this analysis through cube design rather than writing MDX for each case?

Thanks once more.|||

The following MDX gives me the contents of the "baskets" of projects that have the selected items. However, the counts of the projects is not restricted to the subset retrieved through the Intersect. Is there a way to get the counts to only be within the subset returned by the Intersect?

With this MDX, I can get very close to what I need. The only thing I am missing is to get the counts (perhaps through a calculated measure). Help on this will be much appreciated!

__

SELECT [Measures].[Project Count] ON COLUMNS,

NONEMPTY

(

([Selection Factor].[Factor].Children, [Selection Factor].[Factor Value].Children),

INTERSECT

(

NONEMPTY

(

[Project Profile].[Project ID].Children,

(

[Measures].[Selection Project Count],

[Selection Factor].[Selection Factors].[Factor].&[Annual Revenue].&[$500 million - $3 billion in revenues]

)

),

NONEMPTY

(

[Project Profile].[Project ID].Children,

(

[Measures].[Selection Project Count],

[Selection Factor].[Selection Factors].[Factor].&[Concurrent Users].&[1,000–10,000 users]

)

)

)

)

ON ROWS

FROM PROJECTCUBE

|||

In order to obtain counts restricted to the subset returned by the Intersect, this MDX takes the constraint out of the ROWS into the WHERE clause.

I would still like to know how to design an Association rules model that can "automate" the generation of results by case. Hopefully, the MDX can point directly to the expected results.

SELECT [Measures].[Project Count] ON COLUMNS,

NONEMPTYCROSSJOIN([Selection Factor].[Factor].Children, [Selection Factor].[Factor Value].Children)

ON ROWS

FROM PROJECTCUBE

WHERE

NONEMPTY(

INTERSECT

(

NONEMPTY

(

[Project Profile].[Project ID].Children,

(

[Measures].[Project Count],

[Selection Factor].[Selection Factors].[Factor].&[Annual Revenue].&[$500 million - $3 billion in revenues]

)

),

NONEMPTY

(

[Project Profile].[Project ID].Children,

(

[Measures].[Project Count],

[Selection Factor].[Selection Factors].[Factor].&[Concurrent Users].&[1,000–10,000 users]

)

)

)

)

sql

No comments:

Post a Comment