You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@mahout.apache.org by pr...@nokia.com on 2010/11/09 17:20:05 UTC

Deriving associations from frequent patterns

Hello all,
I am new to mahout. I have just started looking into mahout to replace our current fpgrowth implementation with a parallel fp growth that Mahout since we started having scalability issues. I looked at PFPGrowth documentation and I noticed that it only produces top K frequent patterns but not the associations and what we need is associations. So I was thinking of implementing a simple AssociationGenerator given the frequent patterns output. However I am not sure what is the best way to generate associations given the frequent patterns produced by mahout.

I have the following sample output from mahout.

Key: 46485: Value: ([46485],936), ([46705, 46485],355)
Key: 46705: Value: ([46705],2526)

We are interested only in item set size of 2 since we need only 1 ANTECEDENT to 1 CONSEQUENT ASSOCIATIONS ONLY.

I was planning to calculate associations with confidence as follows:
For each key above as A {
        for each two-item set as [A,C] {
                confidence (A->C) = support(A->C)/support(C);
                add association (A, C, confidence(A->C) to the list;
        }
}

Keeping the above requirement and pseudo code n mind, my questions as follows:
1. Is the above algorithm efficient?
2. In the first pattern, [46705, 46485] occurred 355 times but in second pattern why is the same pattern not repeated. Because of this calculating confidence (46705 -> 46485) becomes difficult. As you can see from above code, I was planning to read patterns for each feature and calculate confidence of all association with antecedent. But when I read feature 46705, I cannot calculate confidence of (46705 -> 46485) since the item set is not included with the feature.
3. Has anyone implemented associations from the generated frequent patterns.


Thanks
Praveen


RE: Deriving associations from frequent patterns

Posted by pr...@nokia.com.
Hello all,
Sorry to bother everyone but I still could not make any progress in generating item to item associations from the frequent patterns generated by Mahout PFP. I am still trying to understand the semantics of generated frequent patterns and what is the best way to generate associations from frequent patterns. 

For others sake, I would like to repeat my questions:
1. Why are the frequent patterns not generated in bopth directions. Example below has frequent pattern for ([46705, 46840],698) on first line but not for [46840, 46705] on 2nd line. So I cannot build association looping through products
>> Key: 46705: Value: ([46705],2526), ([46705, 46840],698)
>> Key: 46485: Value: ([46485],936), ([46705, 46485],355)

2. Could someone give a high level info on the algorithm to generate associations based on the frequent patterns generated?

Thanks
Praveen

-----Original Message-----
From: Peddi Praveen (Nokia-MS/Boston) 
Sent: Wednesday, November 10, 2010 7:44 AM
To: user@mahout.apache.org
Subject: Re: Deriving associations from frequent patterns

Ok thanks Anil.

Please let me know if you need anything else from me regarding my original question of calculating association rules and what can be done to make the output have necessary information.

Praveen

On Nov 9, 2010, at 11:17 PM, ext Robin Anil <ro...@gmail.com> wrote:

> g is the number of groups in which features get divided so that the 
> total size of transactions in bytes is almost equal in each reducer. 
> See the PFPGrowth paper. With g=1 you get the original fpgrowth. I 
> usually suggest a g size == numfeatures / (10 or 20) so as to make 
> parallel fpgrowth scalable and still get similar results as the sequential one.
> 
> Robin
> 
> On Wed, Nov 10, 2010 at 12:23 AM, <pr...@nokia.com> wrote:
> 
>> Hi Anil,
>> Here is the result for the same features with g=1
>> Key: 46705: Value: ([46705],2526), ([46705, 46840],698)
>> Key: 46485: Value: ([46485],936), ([46705, 46485],355), ([46840, 
>> 46485],329), ([46847, 46485],211), ([46705, 46840, 46485],207), 
>> ([46485, 46815],175), ([46485, 46852],159), ([46840, 46847, 
>> 46485],130), ([46705, 46847, 46485],126), ([46705, 46485, 
>> 46815],105), ([46840, 46485, 46815],97), ([46840, 46485, 46852],96), 
>> ([46847, 46485, 46815],94), ([46705, 46485, 46852],93), ([46705, 
>> 46840, 46847, 46485],92), ([20975, 46485],92), ([16794, 46485],80), 
>> ([46847, 46485, 46852],76), ([46705, 46840, 46485, 46815],75), 
>> ([46485, 46852, 46815],75), ([46705, 46840, 46485, 46852],69), 
>> ([20924, 46485],68), ([46705, 46847, 46485, 46815],67), ([46840, 
>> 46847, 46485, 46815],66), ([20975, 46705, 46840, 46485],65), ([46840, 
>> 46847, 46485, 46852],56), ([20975, 46705, 46485],55), ([20975, 46840, 
>> 46485],54), ([46705, 46840, 46847, 46485, 46815],53)
>> 
>> Full Result for same features when g=500 is:
>> Key: 46705: Value: ([46705],2526)
>> Key: 46485: Value: ([46485],936), ([46705, 46485],355), ([46840, 
>> 46485],329), ([46847, 46485],211), ([46705, 46840, 46485],205), 
>> ([46840, 46847, 46485],127), ([46705, 46847, 46485],124), ([20975, 
>> 46485],92), ([46705, 46840, 46847, 46485],90), ([20975, 46705, 
>> 46485],55), ([20975, 46840, 46485],54), ([21243, 46485],47), ([20975, 
>> 46705, 46840, 46485],43), ([39140, 46485],37), ([20975, 46847, 
>> 46485],31), ([20975, 46840, 46847, 46485],27), ([20975, 46705, 46847, 
>> 46485],26), ([20975, 46705, 46840, 46847, 46485],23), ([27984, 46705, 
>> 46485],23), ([21243, 46840, 46485],22), ([21243, 46705, 46485],21), 
>> ([39140, 46840, 46485],19), ([21243, 46847, 46485],18), ([39140, 
>> 46705, 46485],15), ([21243, 46705, 46840, 46485],14), ([6942, 
>> 46485],14), ([21243, 46840, 46847, 46485],13), ([39140, 46847, 
>> 46485],13), ([39140, 46840, 46847, 46485],11), ([20975, 39140, 
>> 46485],11), ([20975, 21243, 46485],11), ([39140, 46705, 46840, 
>> 46485],10), ([27984, 46705, 46840, 46847, 46485],9), ([39140, 46705, 
>> 46847, 46485],9), ([20975, 27984, 46705, 46485],8), ([39140, 46705, 
>> 46840, 46847, 46485],7), ([20975, 27984, 46705, 46840, 46485],7), 
>> ([21243, 46705, 46847, 46485],7), ([20975, 39140, 46840, 46485],7), 
>> ([6942, 46705, 46485],7), ([21243, 46705, 46840, 46847, 46485],6), 
>> ([20975, 21243, 46840, 46847, 46485],6), ([21243, 27984, 46485],6), 
>> ([39140, 27984, 46485],6), ([6942, 46840, 46485],6), ([20975, 27984, 
>> 46705, 46847, 46485],5), ([39140, 27984, 46847, 46485],5), ([20975, 
>> 39140, 46705, 46485],5), ([21243, 39140, 46485],5), ([4873, 46485],5)
>> 
>> The results are obviously different. This raises another question. 
>> Are the frequent patterns supposed to change with different values of g?
>> 
>> Praveen
>> 
>> -----Original Message-----
>> From: ext Robin Anil [mailto:robin.anil@gmail.com]
>> Sent: Tuesday, November 09, 2010 1:11 PM
>> To: user@mahout.apache.org
>> Subject: Re: Deriving associations from frequent patterns
>> 
>> Can you try with g1 and tell the resutl
>> 
>> On Tue, Nov 9, 2010 at 11:37 PM, <pr...@nokia.com> wrote:
>> 
>>> Here is the command I used to run PFPGrowth. I am still using only 
>>> single machine. Will be setting up hadoop cluster soon.
>>> 
>>> $ hadoop jar mahout-examples-0.4-job.jar
>>> org.apache.mahout.fpm.pfpgrowth.FPGrowthDriver      -i downloads-input
>>> -o reco-patterns-output      -k 50      -method mapreduce      -g 10
>>> -regex '[\ ]' -s 500
>>> 
>>> -----Original Message-----
>>> From: ext Robin Anil [mailto:robin.anil@gmail.com]
>>> Sent: Tuesday, November 09, 2010 1:01 PM
>>> To: user@mahout.apache.org
>>> Subject: Re: Deriving associations from frequent patterns
>>> 
>>> On Tue, Nov 9, 2010 at 11:20 PM, <pr...@nokia.com> wrote:
>>> 
>>>> Hi Anil,
>>>> 1. I am not sure if I understand your answer to #1 (or were you 
>>>> asking me a question?). Could you pls clarify? The sample patterns 
>>>> I gave is only a small subset from the output. I included only 
>>>> those two features for simplicity.
>>>> 
>>> Oh. Never mind. Let me see
>>> 
>>> 
>>>> 2. I am sending the gzipped sample transaction file (1M downloads) 
>>>> to your private email since I am not sure if I can attach files to 
>>>> the
>>> mailing list.
>>>> Please check your email for the sample file.
>>>> 
>>>> Praveen
>>>> 
>>>> -----Original Message-----
>>>> From: ext Robin Anil [mailto:robin.anil@gmail.com]
>>>> Sent: Tuesday, November 09, 2010 12:40 PM
>>>> To: user@mahout.apache.org
>>>> Subject: Re: Deriving associations from frequent patterns
>>>> 
>>>> On Tue, Nov 9, 2010 at 9:50 PM, <pr...@nokia.com> wrote:
>>>> 
>>>>> Hello all,
>>>>> I am new to mahout. I have just started looking into mahout to 
>>>>> replace our current fpgrowth implementation with a parallel fp 
>>>>> growth that Mahout since we started having scalability issues. I 
>>>>> looked at PFPGrowth documentation and I noticed that it only 
>>>>> produces top K frequent patterns but not the associations and what 
>>>>> we need is associations. So I was thinking of implementing a 
>>>>> simple AssociationGenerator given the frequent patterns output.
>>>>> However I am not sure what is the best way to generate 
>>>>> associations given the frequent
>>>> patterns produced by mahout.
>>>>> 
>>>>> I have the following sample output from mahout.
>>>>> 
>>>>> Key: 46485: Value: ([46485],936), ([46705, 46485],355)
>>>>> Key: 46705: Value: ([46705],2526)
>>>>> 
>>>>> We are interested only in item set size of 2 since we need only 1 
>>>>> ANTECEDENT to 1 CONSEQUENT ASSOCIATIONS ONLY.
>>>>> 
>>>>> I was planning to calculate associations with confidence as follows:
>>>>> For each key above as A {
>>>>>       for each two-item set as [A,C] {
>>>>>               confidence (A->C) = support(A->C)/support(C);
>>>>>               add association (A, C, confidence(A->C) to the list;
>>>>>       }
>>>>> }
>>>>> 
>>>>> Keeping the above requirement and pseudo code n mind, my questions 
>>>>> as
>>>>> follows:
>>>>> 1. Is the above algorithm efficient?
>>>>> 
>>>> You are running it over a set of Top K patterns. Its small. doesnt 
>>>> matter if its inefficient or not
>>>> 
>>>>> 2. In the first pattern, [46705, 46485] occurred 355 times but in 
>>>>> second pattern why is the same pattern not repeated. Because of 
>>>>> this calculating confidence (46705 -> 46485) becomes difficult. As 
>>>>> you can see from above code, I was planning to read patterns for 
>>>>> each feature and calculate confidence of all association with
>> antecedent.
>>>>> But when I read feature 46705, I cannot calculate confidence of
>>>>> (46705 ->
>>>>> 46485) since the item set is not included with the feature.
>>>>> 
>>>> Good question. I guess the partitioning is screwing this up as 
>>>> there are other K-1 patterns in the list > 355. Can you give a 
>>>> sample to
>> test.
>>>> 
>>>>> 3. Has anyone implemented associations from the generated frequent 
>>>>> patterns.
>>>>> 
>>>> Nope
>>>> 
>>>>> 
>>>>> 
>>>>> Thanks
>>>>> Praveen
>>>>> 
>>>>> 
>>>> 
>>> 
>> 

Re: Deriving associations from frequent patterns

Posted by pr...@nokia.com.
Ok thanks Anil.

Please let me know if you need anything else from me regarding my original question of calculating association rules and what can be done to make the output have necessary information.

Praveen

On Nov 9, 2010, at 11:17 PM, ext Robin Anil <ro...@gmail.com> wrote:

> g is the number of groups in which features get divided so that the total
> size of transactions in bytes is almost equal in each reducer. See the
> PFPGrowth paper. With g=1 you get the original fpgrowth. I usually suggest a
> g size == numfeatures / (10 or 20) so as to make parallel fpgrowth scalable
> and still get similar results as the sequential one.
> 
> Robin
> 
> On Wed, Nov 10, 2010 at 12:23 AM, <pr...@nokia.com> wrote:
> 
>> Hi Anil,
>> Here is the result for the same features with g=1
>> Key: 46705: Value: ([46705],2526), ([46705, 46840],698)
>> Key: 46485: Value: ([46485],936), ([46705, 46485],355), ([46840,
>> 46485],329), ([46847, 46485],211), ([46705, 46840, 46485],207), ([46485,
>> 46815],175), ([46485, 46852],159), ([46840, 46847, 46485],130), ([46705,
>> 46847, 46485],126), ([46705, 46485, 46815],105), ([46840, 46485, 46815],97),
>> ([46840, 46485, 46852],96), ([46847, 46485, 46815],94), ([46705, 46485,
>> 46852],93), ([46705, 46840, 46847, 46485],92), ([20975, 46485],92), ([16794,
>> 46485],80), ([46847, 46485, 46852],76), ([46705, 46840, 46485, 46815],75),
>> ([46485, 46852, 46815],75), ([46705, 46840, 46485, 46852],69), ([20924,
>> 46485],68), ([46705, 46847, 46485, 46815],67), ([46840, 46847, 46485,
>> 46815],66), ([20975, 46705, 46840, 46485],65), ([46840, 46847, 46485,
>> 46852],56), ([20975, 46705, 46485],55), ([20975, 46840, 46485],54), ([46705,
>> 46840, 46847, 46485, 46815],53)
>> 
>> Full Result for same features when g=500 is:
>> Key: 46705: Value: ([46705],2526)
>> Key: 46485: Value: ([46485],936), ([46705, 46485],355), ([46840,
>> 46485],329), ([46847, 46485],211), ([46705, 46840, 46485],205), ([46840,
>> 46847, 46485],127), ([46705, 46847, 46485],124), ([20975, 46485],92),
>> ([46705, 46840, 46847, 46485],90), ([20975, 46705, 46485],55), ([20975,
>> 46840, 46485],54), ([21243, 46485],47), ([20975, 46705, 46840, 46485],43),
>> ([39140, 46485],37), ([20975, 46847, 46485],31), ([20975, 46840, 46847,
>> 46485],27), ([20975, 46705, 46847, 46485],26), ([20975, 46705, 46840, 46847,
>> 46485],23), ([27984, 46705, 46485],23), ([21243, 46840, 46485],22), ([21243,
>> 46705, 46485],21), ([39140, 46840, 46485],19), ([21243, 46847, 46485],18),
>> ([39140, 46705, 46485],15), ([21243, 46705, 46840, 46485],14), ([6942,
>> 46485],14), ([21243, 46840, 46847, 46485],13), ([39140, 46847, 46485],13),
>> ([39140, 46840, 46847, 46485],11), ([20975, 39140, 46485],11), ([20975,
>> 21243, 46485],11), ([39140, 46705, 46840, 46485],10), ([27984, 46705, 46840,
>> 46847, 46485],9), ([39140, 46705, 46847, 46485],9), ([20975, 27984, 46705,
>> 46485],8), ([39140, 46705, 46840, 46847, 46485],7), ([20975, 27984, 46705,
>> 46840, 46485],7), ([21243, 46705, 46847, 46485],7), ([20975, 39140, 46840,
>> 46485],7), ([6942, 46705, 46485],7), ([21243, 46705, 46840, 46847,
>> 46485],6), ([20975, 21243, 46840, 46847, 46485],6), ([21243, 27984,
>> 46485],6), ([39140, 27984, 46485],6), ([6942, 46840, 46485],6), ([20975,
>> 27984, 46705, 46847, 46485],5), ([39140, 27984, 46847, 46485],5), ([20975,
>> 39140, 46705, 46485],5), ([21243, 39140, 46485],5), ([4873, 46485],5)
>> 
>> The results are obviously different. This raises another question. Are the
>> frequent patterns supposed to change with different values of g?
>> 
>> Praveen
>> 
>> -----Original Message-----
>> From: ext Robin Anil [mailto:robin.anil@gmail.com]
>> Sent: Tuesday, November 09, 2010 1:11 PM
>> To: user@mahout.apache.org
>> Subject: Re: Deriving associations from frequent patterns
>> 
>> Can you try with g1 and tell the resutl
>> 
>> On Tue, Nov 9, 2010 at 11:37 PM, <pr...@nokia.com> wrote:
>> 
>>> Here is the command I used to run PFPGrowth. I am still using only
>>> single machine. Will be setting up hadoop cluster soon.
>>> 
>>> $ hadoop jar mahout-examples-0.4-job.jar
>>> org.apache.mahout.fpm.pfpgrowth.FPGrowthDriver      -i downloads-input
>>> -o reco-patterns-output      -k 50      -method mapreduce      -g 10
>>> -regex '[\ ]' -s 500
>>> 
>>> -----Original Message-----
>>> From: ext Robin Anil [mailto:robin.anil@gmail.com]
>>> Sent: Tuesday, November 09, 2010 1:01 PM
>>> To: user@mahout.apache.org
>>> Subject: Re: Deriving associations from frequent patterns
>>> 
>>> On Tue, Nov 9, 2010 at 11:20 PM, <pr...@nokia.com> wrote:
>>> 
>>>> Hi Anil,
>>>> 1. I am not sure if I understand your answer to #1 (or were you
>>>> asking me a question?). Could you pls clarify? The sample patterns I
>>>> gave is only a small subset from the output. I included only those
>>>> two features for simplicity.
>>>> 
>>> Oh. Never mind. Let me see
>>> 
>>> 
>>>> 2. I am sending the gzipped sample transaction file (1M downloads)
>>>> to your private email since I am not sure if I can attach files to
>>>> the
>>> mailing list.
>>>> Please check your email for the sample file.
>>>> 
>>>> Praveen
>>>> 
>>>> -----Original Message-----
>>>> From: ext Robin Anil [mailto:robin.anil@gmail.com]
>>>> Sent: Tuesday, November 09, 2010 12:40 PM
>>>> To: user@mahout.apache.org
>>>> Subject: Re: Deriving associations from frequent patterns
>>>> 
>>>> On Tue, Nov 9, 2010 at 9:50 PM, <pr...@nokia.com> wrote:
>>>> 
>>>>> Hello all,
>>>>> I am new to mahout. I have just started looking into mahout to
>>>>> replace our current fpgrowth implementation with a parallel fp
>>>>> growth that Mahout since we started having scalability issues. I
>>>>> looked at PFPGrowth documentation and I noticed that it only
>>>>> produces top K frequent patterns but not the associations and what
>>>>> we need is associations. So I was thinking of implementing a
>>>>> simple AssociationGenerator given the frequent patterns output.
>>>>> However I am not sure what is the best way to generate
>>>>> associations given the frequent
>>>> patterns produced by mahout.
>>>>> 
>>>>> I have the following sample output from mahout.
>>>>> 
>>>>> Key: 46485: Value: ([46485],936), ([46705, 46485],355)
>>>>> Key: 46705: Value: ([46705],2526)
>>>>> 
>>>>> We are interested only in item set size of 2 since we need only 1
>>>>> ANTECEDENT to 1 CONSEQUENT ASSOCIATIONS ONLY.
>>>>> 
>>>>> I was planning to calculate associations with confidence as follows:
>>>>> For each key above as A {
>>>>>       for each two-item set as [A,C] {
>>>>>               confidence (A->C) = support(A->C)/support(C);
>>>>>               add association (A, C, confidence(A->C) to the list;
>>>>>       }
>>>>> }
>>>>> 
>>>>> Keeping the above requirement and pseudo code n mind, my questions
>>>>> as
>>>>> follows:
>>>>> 1. Is the above algorithm efficient?
>>>>> 
>>>> You are running it over a set of Top K patterns. Its small. doesnt
>>>> matter if its inefficient or not
>>>> 
>>>>> 2. In the first pattern, [46705, 46485] occurred 355 times but in
>>>>> second pattern why is the same pattern not repeated. Because of
>>>>> this calculating confidence (46705 -> 46485) becomes difficult. As
>>>>> you can see from above code, I was planning to read patterns for
>>>>> each feature and calculate confidence of all association with
>> antecedent.
>>>>> But when I read feature 46705, I cannot calculate confidence of
>>>>> (46705 ->
>>>>> 46485) since the item set is not included with the feature.
>>>>> 
>>>> Good question. I guess the partitioning is screwing this up as there
>>>> are other K-1 patterns in the list > 355. Can you give a sample to
>> test.
>>>> 
>>>>> 3. Has anyone implemented associations from the generated frequent
>>>>> patterns.
>>>>> 
>>>> Nope
>>>> 
>>>>> 
>>>>> 
>>>>> Thanks
>>>>> Praveen
>>>>> 
>>>>> 
>>>> 
>>> 
>> 

Re: Deriving associations from frequent patterns

Posted by Robin Anil <ro...@gmail.com>.
g is the number of groups in which features get divided so that the total
size of transactions in bytes is almost equal in each reducer. See the
PFPGrowth paper. With g=1 you get the original fpgrowth. I usually suggest a
g size == numfeatures / (10 or 20) so as to make parallel fpgrowth scalable
and still get similar results as the sequential one.

Robin

On Wed, Nov 10, 2010 at 12:23 AM, <pr...@nokia.com> wrote:

> Hi Anil,
> Here is the result for the same features with g=1
> Key: 46705: Value: ([46705],2526), ([46705, 46840],698)
> Key: 46485: Value: ([46485],936), ([46705, 46485],355), ([46840,
> 46485],329), ([46847, 46485],211), ([46705, 46840, 46485],207), ([46485,
> 46815],175), ([46485, 46852],159), ([46840, 46847, 46485],130), ([46705,
> 46847, 46485],126), ([46705, 46485, 46815],105), ([46840, 46485, 46815],97),
> ([46840, 46485, 46852],96), ([46847, 46485, 46815],94), ([46705, 46485,
> 46852],93), ([46705, 46840, 46847, 46485],92), ([20975, 46485],92), ([16794,
> 46485],80), ([46847, 46485, 46852],76), ([46705, 46840, 46485, 46815],75),
> ([46485, 46852, 46815],75), ([46705, 46840, 46485, 46852],69), ([20924,
> 46485],68), ([46705, 46847, 46485, 46815],67), ([46840, 46847, 46485,
> 46815],66), ([20975, 46705, 46840, 46485],65), ([46840, 46847, 46485,
> 46852],56), ([20975, 46705, 46485],55), ([20975, 46840, 46485],54), ([46705,
> 46840, 46847, 46485, 46815],53)
>
> Full Result for same features when g=500 is:
> Key: 46705: Value: ([46705],2526)
> Key: 46485: Value: ([46485],936), ([46705, 46485],355), ([46840,
> 46485],329), ([46847, 46485],211), ([46705, 46840, 46485],205), ([46840,
> 46847, 46485],127), ([46705, 46847, 46485],124), ([20975, 46485],92),
> ([46705, 46840, 46847, 46485],90), ([20975, 46705, 46485],55), ([20975,
> 46840, 46485],54), ([21243, 46485],47), ([20975, 46705, 46840, 46485],43),
> ([39140, 46485],37), ([20975, 46847, 46485],31), ([20975, 46840, 46847,
> 46485],27), ([20975, 46705, 46847, 46485],26), ([20975, 46705, 46840, 46847,
> 46485],23), ([27984, 46705, 46485],23), ([21243, 46840, 46485],22), ([21243,
> 46705, 46485],21), ([39140, 46840, 46485],19), ([21243, 46847, 46485],18),
> ([39140, 46705, 46485],15), ([21243, 46705, 46840, 46485],14), ([6942,
> 46485],14), ([21243, 46840, 46847, 46485],13), ([39140, 46847, 46485],13),
> ([39140, 46840, 46847, 46485],11), ([20975, 39140, 46485],11), ([20975,
> 21243, 46485],11), ([39140, 46705, 46840, 46485],10), ([27984, 46705, 46840,
> 46847, 46485],9), ([39140, 46705, 46847, 46485],9), ([20975, 27984, 46705,
> 46485],8), ([39140, 46705, 46840, 46847, 46485],7), ([20975, 27984, 46705,
> 46840, 46485],7), ([21243, 46705, 46847, 46485],7), ([20975, 39140, 46840,
> 46485],7), ([6942, 46705, 46485],7), ([21243, 46705, 46840, 46847,
> 46485],6), ([20975, 21243, 46840, 46847, 46485],6), ([21243, 27984,
> 46485],6), ([39140, 27984, 46485],6), ([6942, 46840, 46485],6), ([20975,
> 27984, 46705, 46847, 46485],5), ([39140, 27984, 46847, 46485],5), ([20975,
> 39140, 46705, 46485],5), ([21243, 39140, 46485],5), ([4873, 46485],5)
>
> The results are obviously different. This raises another question. Are the
> frequent patterns supposed to change with different values of g?
>
> Praveen
>
> -----Original Message-----
> From: ext Robin Anil [mailto:robin.anil@gmail.com]
> Sent: Tuesday, November 09, 2010 1:11 PM
> To: user@mahout.apache.org
> Subject: Re: Deriving associations from frequent patterns
>
> Can you try with g1 and tell the resutl
>
> On Tue, Nov 9, 2010 at 11:37 PM, <pr...@nokia.com> wrote:
>
> > Here is the command I used to run PFPGrowth. I am still using only
> > single machine. Will be setting up hadoop cluster soon.
> >
> > $ hadoop jar mahout-examples-0.4-job.jar
> > org.apache.mahout.fpm.pfpgrowth.FPGrowthDriver      -i downloads-input
> >  -o reco-patterns-output      -k 50      -method mapreduce      -g 10
> >  -regex '[\ ]' -s 500
> >
> > -----Original Message-----
> > From: ext Robin Anil [mailto:robin.anil@gmail.com]
> > Sent: Tuesday, November 09, 2010 1:01 PM
> > To: user@mahout.apache.org
> > Subject: Re: Deriving associations from frequent patterns
> >
> > On Tue, Nov 9, 2010 at 11:20 PM, <pr...@nokia.com> wrote:
> >
> > > Hi Anil,
> > > 1. I am not sure if I understand your answer to #1 (or were you
> > > asking me a question?). Could you pls clarify? The sample patterns I
> > > gave is only a small subset from the output. I included only those
> > > two features for simplicity.
> > >
> >  Oh. Never mind. Let me see
> >
> >
> > > 2. I am sending the gzipped sample transaction file (1M downloads)
> > > to your private email since I am not sure if I can attach files to
> > > the
> > mailing list.
> > > Please check your email for the sample file.
> > >
> > > Praveen
> > >
> > > -----Original Message-----
> > > From: ext Robin Anil [mailto:robin.anil@gmail.com]
> > > Sent: Tuesday, November 09, 2010 12:40 PM
> > > To: user@mahout.apache.org
> > > Subject: Re: Deriving associations from frequent patterns
> > >
> > > On Tue, Nov 9, 2010 at 9:50 PM, <pr...@nokia.com> wrote:
> > >
> > > > Hello all,
> > > > I am new to mahout. I have just started looking into mahout to
> > > > replace our current fpgrowth implementation with a parallel fp
> > > > growth that Mahout since we started having scalability issues. I
> > > > looked at PFPGrowth documentation and I noticed that it only
> > > > produces top K frequent patterns but not the associations and what
> > > > we need is associations. So I was thinking of implementing a
> > > > simple AssociationGenerator given the frequent patterns output.
> > > > However I am not sure what is the best way to generate
> > > > associations given the frequent
> > > patterns produced by mahout.
> > > >
> > > > I have the following sample output from mahout.
> > > >
> > > > Key: 46485: Value: ([46485],936), ([46705, 46485],355)
> > > > Key: 46705: Value: ([46705],2526)
> > > >
> > > > We are interested only in item set size of 2 since we need only 1
> > > > ANTECEDENT to 1 CONSEQUENT ASSOCIATIONS ONLY.
> > > >
> > > > I was planning to calculate associations with confidence as follows:
> > > > For each key above as A {
> > > >        for each two-item set as [A,C] {
> > > >                confidence (A->C) = support(A->C)/support(C);
> > > >                add association (A, C, confidence(A->C) to the list;
> > > >        }
> > > > }
> > > >
> > > > Keeping the above requirement and pseudo code n mind, my questions
> > > > as
> > > > follows:
> > > > 1. Is the above algorithm efficient?
> > > >
> > > You are running it over a set of Top K patterns. Its small. doesnt
> > > matter if its inefficient or not
> > >
> > > > 2. In the first pattern, [46705, 46485] occurred 355 times but in
> > > > second pattern why is the same pattern not repeated. Because of
> > > > this calculating confidence (46705 -> 46485) becomes difficult. As
> > > > you can see from above code, I was planning to read patterns for
> > > > each feature and calculate confidence of all association with
> antecedent.
> > > > But when I read feature 46705, I cannot calculate confidence of
> > > > (46705 ->
> > > > 46485) since the item set is not included with the feature.
> > > >
> > > Good question. I guess the partitioning is screwing this up as there
> > > are other K-1 patterns in the list > 355. Can you give a sample to
> test.
> > >
> > > > 3. Has anyone implemented associations from the generated frequent
> > > > patterns.
> > > >
> > > Nope
> > >
> > > >
> > > >
> > > > Thanks
> > > > Praveen
> > > >
> > > >
> > >
> >
>

RE: Deriving associations from frequent patterns

Posted by pr...@nokia.com.
Hi Anil,
Here is the result for the same features with g=1
Key: 46705: Value: ([46705],2526), ([46705, 46840],698)
Key: 46485: Value: ([46485],936), ([46705, 46485],355), ([46840, 46485],329), ([46847, 46485],211), ([46705, 46840, 46485],207), ([46485, 46815],175), ([46485, 46852],159), ([46840, 46847, 46485],130), ([46705, 46847, 46485],126), ([46705, 46485, 46815],105), ([46840, 46485, 46815],97), ([46840, 46485, 46852],96), ([46847, 46485, 46815],94), ([46705, 46485, 46852],93), ([46705, 46840, 46847, 46485],92), ([20975, 46485],92), ([16794, 46485],80), ([46847, 46485, 46852],76), ([46705, 46840, 46485, 46815],75), ([46485, 46852, 46815],75), ([46705, 46840, 46485, 46852],69), ([20924, 46485],68), ([46705, 46847, 46485, 46815],67), ([46840, 46847, 46485, 46815],66), ([20975, 46705, 46840, 46485],65), ([46840, 46847, 46485, 46852],56), ([20975, 46705, 46485],55), ([20975, 46840, 46485],54), ([46705, 46840, 46847, 46485, 46815],53)

Full Result for same features when g=500 is:
Key: 46705: Value: ([46705],2526)
Key: 46485: Value: ([46485],936), ([46705, 46485],355), ([46840, 46485],329), ([46847, 46485],211), ([46705, 46840, 46485],205), ([46840, 46847, 46485],127), ([46705, 46847, 46485],124), ([20975, 46485],92), ([46705, 46840, 46847, 46485],90), ([20975, 46705, 46485],55), ([20975, 46840, 46485],54), ([21243, 46485],47), ([20975, 46705, 46840, 46485],43), ([39140, 46485],37), ([20975, 46847, 46485],31), ([20975, 46840, 46847, 46485],27), ([20975, 46705, 46847, 46485],26), ([20975, 46705, 46840, 46847, 46485],23), ([27984, 46705, 46485],23), ([21243, 46840, 46485],22), ([21243, 46705, 46485],21), ([39140, 46840, 46485],19), ([21243, 46847, 46485],18), ([39140, 46705, 46485],15), ([21243, 46705, 46840, 46485],14), ([6942, 46485],14), ([21243, 46840, 46847, 46485],13), ([39140, 46847, 46485],13), ([39140, 46840, 46847, 46485],11), ([20975, 39140, 46485],11), ([20975, 21243, 46485],11), ([39140, 46705, 46840, 46485],10), ([27984, 46705, 46840, 46847, 46485],9), ([39140, 46705, 46847, 46485],9), ([20975, 27984, 46705, 46485],8), ([39140, 46705, 46840, 46847, 46485],7), ([20975, 27984, 46705, 46840, 46485],7), ([21243, 46705, 46847, 46485],7), ([20975, 39140, 46840, 46485],7), ([6942, 46705, 46485],7), ([21243, 46705, 46840, 46847, 46485],6), ([20975, 21243, 46840, 46847, 46485],6), ([21243, 27984, 46485],6), ([39140, 27984, 46485],6), ([6942, 46840, 46485],6), ([20975, 27984, 46705, 46847, 46485],5), ([39140, 27984, 46847, 46485],5), ([20975, 39140, 46705, 46485],5), ([21243, 39140, 46485],5), ([4873, 46485],5) 

The results are obviously different. This raises another question. Are the frequent patterns supposed to change with different values of g?

Praveen

-----Original Message-----
From: ext Robin Anil [mailto:robin.anil@gmail.com] 
Sent: Tuesday, November 09, 2010 1:11 PM
To: user@mahout.apache.org
Subject: Re: Deriving associations from frequent patterns

Can you try with g1 and tell the resutl

On Tue, Nov 9, 2010 at 11:37 PM, <pr...@nokia.com> wrote:

> Here is the command I used to run PFPGrowth. I am still using only 
> single machine. Will be setting up hadoop cluster soon.
>
> $ hadoop jar mahout-examples-0.4-job.jar
> org.apache.mahout.fpm.pfpgrowth.FPGrowthDriver      -i downloads-input
>  -o reco-patterns-output      -k 50      -method mapreduce      -g 10
>  -regex '[\ ]' -s 500
>
> -----Original Message-----
> From: ext Robin Anil [mailto:robin.anil@gmail.com]
> Sent: Tuesday, November 09, 2010 1:01 PM
> To: user@mahout.apache.org
> Subject: Re: Deriving associations from frequent patterns
>
> On Tue, Nov 9, 2010 at 11:20 PM, <pr...@nokia.com> wrote:
>
> > Hi Anil,
> > 1. I am not sure if I understand your answer to #1 (or were you 
> > asking me a question?). Could you pls clarify? The sample patterns I 
> > gave is only a small subset from the output. I included only those 
> > two features for simplicity.
> >
>  Oh. Never mind. Let me see
>
>
> > 2. I am sending the gzipped sample transaction file (1M downloads) 
> > to your private email since I am not sure if I can attach files to 
> > the
> mailing list.
> > Please check your email for the sample file.
> >
> > Praveen
> >
> > -----Original Message-----
> > From: ext Robin Anil [mailto:robin.anil@gmail.com]
> > Sent: Tuesday, November 09, 2010 12:40 PM
> > To: user@mahout.apache.org
> > Subject: Re: Deriving associations from frequent patterns
> >
> > On Tue, Nov 9, 2010 at 9:50 PM, <pr...@nokia.com> wrote:
> >
> > > Hello all,
> > > I am new to mahout. I have just started looking into mahout to 
> > > replace our current fpgrowth implementation with a parallel fp 
> > > growth that Mahout since we started having scalability issues. I 
> > > looked at PFPGrowth documentation and I noticed that it only 
> > > produces top K frequent patterns but not the associations and what 
> > > we need is associations. So I was thinking of implementing a 
> > > simple AssociationGenerator given the frequent patterns output. 
> > > However I am not sure what is the best way to generate 
> > > associations given the frequent
> > patterns produced by mahout.
> > >
> > > I have the following sample output from mahout.
> > >
> > > Key: 46485: Value: ([46485],936), ([46705, 46485],355)
> > > Key: 46705: Value: ([46705],2526)
> > >
> > > We are interested only in item set size of 2 since we need only 1 
> > > ANTECEDENT to 1 CONSEQUENT ASSOCIATIONS ONLY.
> > >
> > > I was planning to calculate associations with confidence as follows:
> > > For each key above as A {
> > >        for each two-item set as [A,C] {
> > >                confidence (A->C) = support(A->C)/support(C);
> > >                add association (A, C, confidence(A->C) to the list;
> > >        }
> > > }
> > >
> > > Keeping the above requirement and pseudo code n mind, my questions 
> > > as
> > > follows:
> > > 1. Is the above algorithm efficient?
> > >
> > You are running it over a set of Top K patterns. Its small. doesnt 
> > matter if its inefficient or not
> >
> > > 2. In the first pattern, [46705, 46485] occurred 355 times but in 
> > > second pattern why is the same pattern not repeated. Because of 
> > > this calculating confidence (46705 -> 46485) becomes difficult. As 
> > > you can see from above code, I was planning to read patterns for 
> > > each feature and calculate confidence of all association with antecedent.
> > > But when I read feature 46705, I cannot calculate confidence of
> > > (46705 ->
> > > 46485) since the item set is not included with the feature.
> > >
> > Good question. I guess the partitioning is screwing this up as there 
> > are other K-1 patterns in the list > 355. Can you give a sample to test.
> >
> > > 3. Has anyone implemented associations from the generated frequent 
> > > patterns.
> > >
> > Nope
> >
> > >
> > >
> > > Thanks
> > > Praveen
> > >
> > >
> >
>

RE: Deriving associations from frequent patterns

Posted by pr...@nokia.com.
Hi Anil,
Sorry the params in my previous email are not correct. Here is the correct command. 

> $ hadoop jar mahout-examples-0.4-job.jar
> org.apache.mahout.fpm.pfpgrowth.FPGrowthDriver      -i downloads-input
>  -o reco-patterns-output      -k 50      -method mapreduce      -g 500
>  -regex '[\ ]' -s 5

I used 500 for g and 5 for s.
When I used 10 for g the job took about 8 mins and when g=500, it took just over 2 mins.

I just ran with g=1 and it took 19 mins (compared to 8 and 2 in previous runs). I thought g is useful only when there is cluster. Why is increasing g is making it faster even on single machine. How do I calculate the optimal number based on my data size.

I will send you the output againa in a separate email.

Thanks
Praveen

-----Original Message-----
From: ext Robin Anil [mailto:robin.anil@gmail.com] 
Sent: Tuesday, November 09, 2010 1:11 PM
To: user@mahout.apache.org
Subject: Re: Deriving associations from frequent patterns

Can you try with g1 and tell the resutl

On Tue, Nov 9, 2010 at 11:37 PM, <pr...@nokia.com> wrote:

> Here is the command I used to run PFPGrowth. I am still using only 
> single machine. Will be setting up hadoop cluster soon.
>
> $ hadoop jar mahout-examples-0.4-job.jar
> org.apache.mahout.fpm.pfpgrowth.FPGrowthDriver      -i downloads-input
>  -o reco-patterns-output      -k 50      -method mapreduce      -g 10
>  -regex '[\ ]' -s 500
>
> -----Original Message-----
> From: ext Robin Anil [mailto:robin.anil@gmail.com]
> Sent: Tuesday, November 09, 2010 1:01 PM
> To: user@mahout.apache.org
> Subject: Re: Deriving associations from frequent patterns
>
> On Tue, Nov 9, 2010 at 11:20 PM, <pr...@nokia.com> wrote:
>
> > Hi Anil,
> > 1. I am not sure if I understand your answer to #1 (or were you 
> > asking me a question?). Could you pls clarify? The sample patterns I 
> > gave is only a small subset from the output. I included only those 
> > two features for simplicity.
> >
>  Oh. Never mind. Let me see
>
>
> > 2. I am sending the gzipped sample transaction file (1M downloads) 
> > to your private email since I am not sure if I can attach files to 
> > the
> mailing list.
> > Please check your email for the sample file.
> >
> > Praveen
> >
> > -----Original Message-----
> > From: ext Robin Anil [mailto:robin.anil@gmail.com]
> > Sent: Tuesday, November 09, 2010 12:40 PM
> > To: user@mahout.apache.org
> > Subject: Re: Deriving associations from frequent patterns
> >
> > On Tue, Nov 9, 2010 at 9:50 PM, <pr...@nokia.com> wrote:
> >
> > > Hello all,
> > > I am new to mahout. I have just started looking into mahout to 
> > > replace our current fpgrowth implementation with a parallel fp 
> > > growth that Mahout since we started having scalability issues. I 
> > > looked at PFPGrowth documentation and I noticed that it only 
> > > produces top K frequent patterns but not the associations and what 
> > > we need is associations. So I was thinking of implementing a 
> > > simple AssociationGenerator given the frequent patterns output. 
> > > However I am not sure what is the best way to generate 
> > > associations given the frequent
> > patterns produced by mahout.
> > >
> > > I have the following sample output from mahout.
> > >
> > > Key: 46485: Value: ([46485],936), ([46705, 46485],355)
> > > Key: 46705: Value: ([46705],2526)
> > >
> > > We are interested only in item set size of 2 since we need only 1 
> > > ANTECEDENT to 1 CONSEQUENT ASSOCIATIONS ONLY.
> > >
> > > I was planning to calculate associations with confidence as follows:
> > > For each key above as A {
> > >        for each two-item set as [A,C] {
> > >                confidence (A->C) = support(A->C)/support(C);
> > >                add association (A, C, confidence(A->C) to the list;
> > >        }
> > > }
> > >
> > > Keeping the above requirement and pseudo code n mind, my questions 
> > > as
> > > follows:
> > > 1. Is the above algorithm efficient?
> > >
> > You are running it over a set of Top K patterns. Its small. doesnt 
> > matter if its inefficient or not
> >
> > > 2. In the first pattern, [46705, 46485] occurred 355 times but in 
> > > second pattern why is the same pattern not repeated. Because of 
> > > this calculating confidence (46705 -> 46485) becomes difficult. As 
> > > you can see from above code, I was planning to read patterns for 
> > > each feature and calculate confidence of all association with antecedent.
> > > But when I read feature 46705, I cannot calculate confidence of
> > > (46705 ->
> > > 46485) since the item set is not included with the feature.
> > >
> > Good question. I guess the partitioning is screwing this up as there 
> > are other K-1 patterns in the list > 355. Can you give a sample to test.
> >
> > > 3. Has anyone implemented associations from the generated frequent 
> > > patterns.
> > >
> > Nope
> >
> > >
> > >
> > > Thanks
> > > Praveen
> > >
> > >
> >
>

Re: Deriving associations from frequent patterns

Posted by Robin Anil <ro...@gmail.com>.
Can you try with g1 and tell the resutl

On Tue, Nov 9, 2010 at 11:37 PM, <pr...@nokia.com> wrote:

> Here is the command I used to run PFPGrowth. I am still using only single
> machine. Will be setting up hadoop cluster soon.
>
> $ hadoop jar mahout-examples-0.4-job.jar
> org.apache.mahout.fpm.pfpgrowth.FPGrowthDriver      -i downloads-input
>  -o reco-patterns-output      -k 50      -method mapreduce      -g 10
>  -regex '[\ ]' -s 500
>
> -----Original Message-----
> From: ext Robin Anil [mailto:robin.anil@gmail.com]
> Sent: Tuesday, November 09, 2010 1:01 PM
> To: user@mahout.apache.org
> Subject: Re: Deriving associations from frequent patterns
>
> On Tue, Nov 9, 2010 at 11:20 PM, <pr...@nokia.com> wrote:
>
> > Hi Anil,
> > 1. I am not sure if I understand your answer to #1 (or were you asking
> > me a question?). Could you pls clarify? The sample patterns I gave is
> > only a small subset from the output. I included only those two
> > features for simplicity.
> >
>  Oh. Never mind. Let me see
>
>
> > 2. I am sending the gzipped sample transaction file (1M downloads) to
> > your private email since I am not sure if I can attach files to the
> mailing list.
> > Please check your email for the sample file.
> >
> > Praveen
> >
> > -----Original Message-----
> > From: ext Robin Anil [mailto:robin.anil@gmail.com]
> > Sent: Tuesday, November 09, 2010 12:40 PM
> > To: user@mahout.apache.org
> > Subject: Re: Deriving associations from frequent patterns
> >
> > On Tue, Nov 9, 2010 at 9:50 PM, <pr...@nokia.com> wrote:
> >
> > > Hello all,
> > > I am new to mahout. I have just started looking into mahout to
> > > replace our current fpgrowth implementation with a parallel fp
> > > growth that Mahout since we started having scalability issues. I
> > > looked at PFPGrowth documentation and I noticed that it only
> > > produces top K frequent patterns but not the associations and what
> > > we need is associations. So I was thinking of implementing a simple
> > > AssociationGenerator given the frequent patterns output. However I
> > > am not sure what is the best way to generate associations given the
> > > frequent
> > patterns produced by mahout.
> > >
> > > I have the following sample output from mahout.
> > >
> > > Key: 46485: Value: ([46485],936), ([46705, 46485],355)
> > > Key: 46705: Value: ([46705],2526)
> > >
> > > We are interested only in item set size of 2 since we need only 1
> > > ANTECEDENT to 1 CONSEQUENT ASSOCIATIONS ONLY.
> > >
> > > I was planning to calculate associations with confidence as follows:
> > > For each key above as A {
> > >        for each two-item set as [A,C] {
> > >                confidence (A->C) = support(A->C)/support(C);
> > >                add association (A, C, confidence(A->C) to the list;
> > >        }
> > > }
> > >
> > > Keeping the above requirement and pseudo code n mind, my questions
> > > as
> > > follows:
> > > 1. Is the above algorithm efficient?
> > >
> > You are running it over a set of Top K patterns. Its small. doesnt
> > matter if its inefficient or not
> >
> > > 2. In the first pattern, [46705, 46485] occurred 355 times but in
> > > second pattern why is the same pattern not repeated. Because of this
> > > calculating confidence (46705 -> 46485) becomes difficult. As you
> > > can see from above code, I was planning to read patterns for each
> > > feature and calculate confidence of all association with antecedent.
> > > But when I read feature 46705, I cannot calculate confidence of
> > > (46705 ->
> > > 46485) since the item set is not included with the feature.
> > >
> > Good question. I guess the partitioning is screwing this up as there
> > are other K-1 patterns in the list > 355. Can you give a sample to test.
> >
> > > 3. Has anyone implemented associations from the generated frequent
> > > patterns.
> > >
> > Nope
> >
> > >
> > >
> > > Thanks
> > > Praveen
> > >
> > >
> >
>

RE: Deriving associations from frequent patterns

Posted by pr...@nokia.com.
Here is the command I used to run PFPGrowth. I am still using only single machine. Will be setting up hadoop cluster soon.

$ hadoop jar mahout-examples-0.4-job.jar org.apache.mahout.fpm.pfpgrowth.FPGrowthDriver      -i downloads-input      -o reco-patterns-output      -k 50      -method mapreduce      -g 10      -regex '[\ ]' -s 500

-----Original Message-----
From: ext Robin Anil [mailto:robin.anil@gmail.com] 
Sent: Tuesday, November 09, 2010 1:01 PM
To: user@mahout.apache.org
Subject: Re: Deriving associations from frequent patterns

On Tue, Nov 9, 2010 at 11:20 PM, <pr...@nokia.com> wrote:

> Hi Anil,
> 1. I am not sure if I understand your answer to #1 (or were you asking 
> me a question?). Could you pls clarify? The sample patterns I gave is 
> only a small subset from the output. I included only those two 
> features for simplicity.
>
 Oh. Never mind. Let me see


> 2. I am sending the gzipped sample transaction file (1M downloads) to 
> your private email since I am not sure if I can attach files to the mailing list.
> Please check your email for the sample file.
>
> Praveen
>
> -----Original Message-----
> From: ext Robin Anil [mailto:robin.anil@gmail.com]
> Sent: Tuesday, November 09, 2010 12:40 PM
> To: user@mahout.apache.org
> Subject: Re: Deriving associations from frequent patterns
>
> On Tue, Nov 9, 2010 at 9:50 PM, <pr...@nokia.com> wrote:
>
> > Hello all,
> > I am new to mahout. I have just started looking into mahout to 
> > replace our current fpgrowth implementation with a parallel fp 
> > growth that Mahout since we started having scalability issues. I 
> > looked at PFPGrowth documentation and I noticed that it only 
> > produces top K frequent patterns but not the associations and what 
> > we need is associations. So I was thinking of implementing a simple 
> > AssociationGenerator given the frequent patterns output. However I 
> > am not sure what is the best way to generate associations given the 
> > frequent
> patterns produced by mahout.
> >
> > I have the following sample output from mahout.
> >
> > Key: 46485: Value: ([46485],936), ([46705, 46485],355)
> > Key: 46705: Value: ([46705],2526)
> >
> > We are interested only in item set size of 2 since we need only 1 
> > ANTECEDENT to 1 CONSEQUENT ASSOCIATIONS ONLY.
> >
> > I was planning to calculate associations with confidence as follows:
> > For each key above as A {
> >        for each two-item set as [A,C] {
> >                confidence (A->C) = support(A->C)/support(C);
> >                add association (A, C, confidence(A->C) to the list;
> >        }
> > }
> >
> > Keeping the above requirement and pseudo code n mind, my questions 
> > as
> > follows:
> > 1. Is the above algorithm efficient?
> >
> You are running it over a set of Top K patterns. Its small. doesnt 
> matter if its inefficient or not
>
> > 2. In the first pattern, [46705, 46485] occurred 355 times but in 
> > second pattern why is the same pattern not repeated. Because of this 
> > calculating confidence (46705 -> 46485) becomes difficult. As you 
> > can see from above code, I was planning to read patterns for each 
> > feature and calculate confidence of all association with antecedent. 
> > But when I read feature 46705, I cannot calculate confidence of 
> > (46705 ->
> > 46485) since the item set is not included with the feature.
> >
> Good question. I guess the partitioning is screwing this up as there 
> are other K-1 patterns in the list > 355. Can you give a sample to test.
>
> > 3. Has anyone implemented associations from the generated frequent 
> > patterns.
> >
> Nope
>
> >
> >
> > Thanks
> > Praveen
> >
> >
>

Re: Deriving associations from frequent patterns

Posted by Robin Anil <ro...@gmail.com>.
On Tue, Nov 9, 2010 at 11:20 PM, <pr...@nokia.com> wrote:

> Hi Anil,
> 1. I am not sure if I understand your answer to #1 (or were you asking me a
> question?). Could you pls clarify? The sample patterns I gave is only a
> small subset from the output. I included only those two features for
> simplicity.
>
 Oh. Never mind. Let me see


> 2. I am sending the gzipped sample transaction file (1M downloads) to your
> private email since I am not sure if I can attach files to the mailing list.
> Please check your email for the sample file.
>
> Praveen
>
> -----Original Message-----
> From: ext Robin Anil [mailto:robin.anil@gmail.com]
> Sent: Tuesday, November 09, 2010 12:40 PM
> To: user@mahout.apache.org
> Subject: Re: Deriving associations from frequent patterns
>
> On Tue, Nov 9, 2010 at 9:50 PM, <pr...@nokia.com> wrote:
>
> > Hello all,
> > I am new to mahout. I have just started looking into mahout to replace
> > our current fpgrowth implementation with a parallel fp growth that
> > Mahout since we started having scalability issues. I looked at
> > PFPGrowth documentation and I noticed that it only produces top K
> > frequent patterns but not the associations and what we need is
> > associations. So I was thinking of implementing a simple
> > AssociationGenerator given the frequent patterns output. However I am
> > not sure what is the best way to generate associations given the frequent
> patterns produced by mahout.
> >
> > I have the following sample output from mahout.
> >
> > Key: 46485: Value: ([46485],936), ([46705, 46485],355)
> > Key: 46705: Value: ([46705],2526)
> >
> > We are interested only in item set size of 2 since we need only 1
> > ANTECEDENT to 1 CONSEQUENT ASSOCIATIONS ONLY.
> >
> > I was planning to calculate associations with confidence as follows:
> > For each key above as A {
> >        for each two-item set as [A,C] {
> >                confidence (A->C) = support(A->C)/support(C);
> >                add association (A, C, confidence(A->C) to the list;
> >        }
> > }
> >
> > Keeping the above requirement and pseudo code n mind, my questions as
> > follows:
> > 1. Is the above algorithm efficient?
> >
> You are running it over a set of Top K patterns. Its small. doesnt matter
> if its inefficient or not
>
> > 2. In the first pattern, [46705, 46485] occurred 355 times but in
> > second pattern why is the same pattern not repeated. Because of this
> > calculating confidence (46705 -> 46485) becomes difficult. As you can
> > see from above code, I was planning to read patterns for each feature
> > and calculate confidence of all association with antecedent. But when
> > I read feature 46705, I cannot calculate confidence of (46705 ->
> > 46485) since the item set is not included with the feature.
> >
> Good question. I guess the partitioning is screwing this up as there are
> other K-1 patterns in the list > 355. Can you give a sample to test.
>
> > 3. Has anyone implemented associations from the generated frequent
> > patterns.
> >
> Nope
>
> >
> >
> > Thanks
> > Praveen
> >
> >
>

RE: Deriving associations from frequent patterns

Posted by pr...@nokia.com.
Hi Anil,
1. I am not sure if I understand your answer to #1 (or were you asking me a question?). Could you pls clarify? The sample patterns I gave is only a small subset from the output. I included only those two features for simplicity.
2. I am sending the gzipped sample transaction file (1M downloads) to your private email since I am not sure if I can attach files to the mailing list. Please check your email for the sample file.

Praveen

-----Original Message-----
From: ext Robin Anil [mailto:robin.anil@gmail.com] 
Sent: Tuesday, November 09, 2010 12:40 PM
To: user@mahout.apache.org
Subject: Re: Deriving associations from frequent patterns

On Tue, Nov 9, 2010 at 9:50 PM, <pr...@nokia.com> wrote:

> Hello all,
> I am new to mahout. I have just started looking into mahout to replace 
> our current fpgrowth implementation with a parallel fp growth that 
> Mahout since we started having scalability issues. I looked at 
> PFPGrowth documentation and I noticed that it only produces top K 
> frequent patterns but not the associations and what we need is 
> associations. So I was thinking of implementing a simple 
> AssociationGenerator given the frequent patterns output. However I am 
> not sure what is the best way to generate associations given the frequent patterns produced by mahout.
>
> I have the following sample output from mahout.
>
> Key: 46485: Value: ([46485],936), ([46705, 46485],355)
> Key: 46705: Value: ([46705],2526)
>
> We are interested only in item set size of 2 since we need only 1 
> ANTECEDENT to 1 CONSEQUENT ASSOCIATIONS ONLY.
>
> I was planning to calculate associations with confidence as follows:
> For each key above as A {
>        for each two-item set as [A,C] {
>                confidence (A->C) = support(A->C)/support(C);
>                add association (A, C, confidence(A->C) to the list;
>        }
> }
>
> Keeping the above requirement and pseudo code n mind, my questions as
> follows:
> 1. Is the above algorithm efficient?
>
You are running it over a set of Top K patterns. Its small. doesnt matter if its inefficient or not

> 2. In the first pattern, [46705, 46485] occurred 355 times but in 
> second pattern why is the same pattern not repeated. Because of this 
> calculating confidence (46705 -> 46485) becomes difficult. As you can 
> see from above code, I was planning to read patterns for each feature 
> and calculate confidence of all association with antecedent. But when 
> I read feature 46705, I cannot calculate confidence of (46705 -> 
> 46485) since the item set is not included with the feature.
>
Good question. I guess the partitioning is screwing this up as there are other K-1 patterns in the list > 355. Can you give a sample to test.

> 3. Has anyone implemented associations from the generated frequent 
> patterns.
>
Nope

>
>
> Thanks
> Praveen
>
>

Re: Deriving associations from frequent patterns

Posted by Robin Anil <ro...@gmail.com>.
On Tue, Nov 9, 2010 at 9:50 PM, <pr...@nokia.com> wrote:

> Hello all,
> I am new to mahout. I have just started looking into mahout to replace our
> current fpgrowth implementation with a parallel fp growth that Mahout since
> we started having scalability issues. I looked at PFPGrowth documentation
> and I noticed that it only produces top K frequent patterns but not the
> associations and what we need is associations. So I was thinking of
> implementing a simple AssociationGenerator given the frequent patterns
> output. However I am not sure what is the best way to generate associations
> given the frequent patterns produced by mahout.
>
> I have the following sample output from mahout.
>
> Key: 46485: Value: ([46485],936), ([46705, 46485],355)
> Key: 46705: Value: ([46705],2526)
>
> We are interested only in item set size of 2 since we need only 1
> ANTECEDENT to 1 CONSEQUENT ASSOCIATIONS ONLY.
>
> I was planning to calculate associations with confidence as follows:
> For each key above as A {
>        for each two-item set as [A,C] {
>                confidence (A->C) = support(A->C)/support(C);
>                add association (A, C, confidence(A->C) to the list;
>        }
> }
>
> Keeping the above requirement and pseudo code n mind, my questions as
> follows:
> 1. Is the above algorithm efficient?
>
You are running it over a set of Top K patterns. Its small. doesnt matter if
its inefficient or not

> 2. In the first pattern, [46705, 46485] occurred 355 times but in second
> pattern why is the same pattern not repeated. Because of this calculating
> confidence (46705 -> 46485) becomes difficult. As you can see from above
> code, I was planning to read patterns for each feature and calculate
> confidence of all association with antecedent. But when I read feature
> 46705, I cannot calculate confidence of (46705 -> 46485) since the item set
> is not included with the feature.
>
Good question. I guess the partitioning is screwing this up as there are
other K-1 patterns in the list > 355. Can you give a sample to test.

> 3. Has anyone implemented associations from the generated frequent
> patterns.
>
Nope

>
>
> Thanks
> Praveen
>
>

RE: Deriving associations from frequent patterns

Posted by pr...@nokia.com.
Hi Ted,
Thanks for your response. Currently this is what we are doing.

1. We get product download information as a triplet (userId, productId, downloaded_time)
2. We generate arff transaction file based on userId -> all downloaded product Ids
3. Feed that into a tool called WEKA's FPGrowth's implementation to get association rules in the form of (antecendent, consequent and confidence) where antecendetn and consequent is of only one size.
4. Use the above association rules to recommend products based on past download history (of course after applying all filters)

The drawback of the above implementation is that WEKA loads all transactions into memory and we are hitting the memory threshold when the total download size reached 60 to 80 millions. However we have more than 200M transactions to process and its growing fast. Our goal is find highly scalable solution.

Looking at mahout's PFPGrowth impementation I thought I can do the same exact thing in s distributed fashion except that I would need to generate association rules myself. The formula I mentioned to calculate confidence is from here: http://en.wikipedia.org/wiki/Association_rule_learning. When I looked at the output generated PFPGrowth, I thought I can simply process feature by feature and claculate these rules (as mentioned in my previous email) however I don't think that would work with the way mahout's PFPGrowth is generating the output.

Hope I clarified all your questions.

Praveen

-----Original Message-----
From: ext Ted Dunning [mailto:ted.dunning@gmail.com] 
Sent: Tuesday, November 09, 2010 12:29 PM
To: user@mahout.apache.org
Subject: Re: Deriving associations from frequent patterns

Praveen,

Could you define what you mean by association?  Do you mean temporal sequence?  A causal relationship?  Or simply a cooccurrence?

There is considerable confusion in the world about these terms.  To answer your question, it will be important to be sure what you mean by your question.

Assuming that you are looking at cooccurrence, possibly with a temporal ordering, your measure of confidence is anything but a measure of confidence.  More correctly, you are estimating the conditional probability P(B|A).  The estimate you are using, however, is subject to substantial error when counts are too small.

In many applications, you get much better results if you refrain from estimating conditional probabilities at all and satisfy yourself with simply separating those conditional probabilities that differ from the marginal probabilities (i.e. where P(B) != P(B | A) ).  This is a much simpler task and helps you avoid over-fitting.  In the Luduan system, for instance, I used a multinomial generalized log-likelihood ratio test (often called G^2) to finding interesting query terms and then used general corpus frequencies to weight the terms.  The results were substantially better than methods that tried to weight the terms using conditional probabilities because the Luduan approach could avoid over-fitting better.

This G^2 test is available in Mahout, but I think that the PFPGrowth algorithm inherently imposes something like it during the winnowing of patterns.

Another powerful method is to use spectral techniques to find cliques in a large graph.  This can have very dramatic results and very good scaling relative to iterative item-set growth techniques.

If you could say more about what you are trying to do at a high level, we could probably help you find capabilities in Mahout that suit your needs.

On Tue, Nov 9, 2010 at 8:20 AM, <pr...@nokia.com> wrote:

> Hello all,
> I am new to mahout. I have just started looking into mahout to replace 
> our current fpgrowth implementation with a parallel fp growth that 
> Mahout since we started having scalability issues. I looked at 
> PFPGrowth documentation and I noticed that it only produces top K 
> frequent patterns but not the associations and what we need is 
> associations. So I was thinking of implementing a simple 
> AssociationGenerator given the frequent patterns output. However I am 
> not sure what is the best way to generate associations given the frequent patterns produced by mahout.
>
> I have the following sample output from mahout.
>
> Key: 46485: Value: ([46485],936), ([46705, 46485],355)
> Key: 46705: Value: ([46705],2526)
>
> We are interested only in item set size of 2 since we need only 1 
> ANTECEDENT to 1 CONSEQUENT ASSOCIATIONS ONLY.
>
> I was planning to calculate associations with confidence as follows:
> For each key above as A {
>        for each two-item set as [A,C] {
>                confidence (A->C) = support(A->C)/support(C);
>                add association (A, C, confidence(A->C) to the list;
>        }
> }
>
> Keeping the above requirement and pseudo code n mind, my questions as
> follows:
> 1. Is the above algorithm efficient?
> 2. In the first pattern, [46705, 46485] occurred 355 times but in 
> second pattern why is the same pattern not repeated. Because of this 
> calculating confidence (46705 -> 46485) becomes difficult. As you can 
> see from above code, I was planning to read patterns for each feature 
> and calculate confidence of all association with antecedent. But when 
> I read feature 46705, I cannot calculate confidence of (46705 -> 
> 46485) since the item set is not included with the feature.
> 3. Has anyone implemented associations from the generated frequent 
> patterns.
>
>
> Thanks
> Praveen
>
>

Re: Deriving associations from frequent patterns

Posted by Ted Dunning <te...@gmail.com>.
Praveen,

Could you define what you mean by association?  Do you mean temporal
sequence?  A causal relationship?  Or simply a cooccurrence?

There is considerable confusion in the world about these terms.  To answer
your question, it will be important to be sure what you
mean by your question.

Assuming that you are looking at cooccurrence, possibly with a temporal
ordering, your measure of confidence is anything but a measure of
confidence.  More correctly, you are estimating the conditional probability
P(B|A).  The estimate you are using, however,
is subject to substantial error when counts are too small.

In many applications, you get much better results if you refrain from
estimating conditional probabilities at all and satisfy yourself with
simply separating those conditional probabilities that differ from the
marginal probabilities (i.e. where P(B) != P(B | A) ).  This is a much
simpler task and helps you avoid over-fitting.  In the Luduan system, for
instance, I used a multinomial generalized log-likelihood ratio
test (often called G^2) to finding interesting query terms and then used
general corpus frequencies to weight the terms.  The results
were substantially better than methods that tried to weight the terms using
conditional probabilities because the Luduan approach
could avoid over-fitting better.

This G^2 test is available in Mahout, but I think that the PFPGrowth
algorithm inherently imposes something like it during the winnowing of
patterns.

Another powerful method is to use spectral techniques to find cliques in a
large graph.  This can have very dramatic results and very
good scaling relative to iterative item-set growth techniques.

If you could say more about what you are trying to do at a high level, we
could probably help you find capabilities in Mahout that
suit your needs.

On Tue, Nov 9, 2010 at 8:20 AM, <pr...@nokia.com> wrote:

> Hello all,
> I am new to mahout. I have just started looking into mahout to replace our
> current fpgrowth implementation with a parallel fp growth that Mahout since
> we started having scalability issues. I looked at PFPGrowth documentation
> and I noticed that it only produces top K frequent patterns but not the
> associations and what we need is associations. So I was thinking of
> implementing a simple AssociationGenerator given the frequent patterns
> output. However I am not sure what is the best way to generate associations
> given the frequent patterns produced by mahout.
>
> I have the following sample output from mahout.
>
> Key: 46485: Value: ([46485],936), ([46705, 46485],355)
> Key: 46705: Value: ([46705],2526)
>
> We are interested only in item set size of 2 since we need only 1
> ANTECEDENT to 1 CONSEQUENT ASSOCIATIONS ONLY.
>
> I was planning to calculate associations with confidence as follows:
> For each key above as A {
>        for each two-item set as [A,C] {
>                confidence (A->C) = support(A->C)/support(C);
>                add association (A, C, confidence(A->C) to the list;
>        }
> }
>
> Keeping the above requirement and pseudo code n mind, my questions as
> follows:
> 1. Is the above algorithm efficient?
> 2. In the first pattern, [46705, 46485] occurred 355 times but in second
> pattern why is the same pattern not repeated. Because of this calculating
> confidence (46705 -> 46485) becomes difficult. As you can see from above
> code, I was planning to read patterns for each feature and calculate
> confidence of all association with antecedent. But when I read feature
> 46705, I cannot calculate confidence of (46705 -> 46485) since the item set
> is not included with the feature.
> 3. Has anyone implemented associations from the generated frequent
> patterns.
>
>
> Thanks
> Praveen
>
>