You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@mahout.apache.org by 林泽桢 <la...@gmail.com> on 2012/08/22 05:22:14 UTC

a bug of fpgrowth?

hello, when i use fpgrowth to get association rules, but it always come to
wrong, so confused.

Then i read the source code, i think i found a bug in line #102
of FrequentPatternMaxHeap.java, which " least.compareTo(frequentPattern) <
0 " should change to " least.compareTo(frequentPattern) > 0 ", the former
will filter a lot frequent patterns come after.

After modification, it comes to better, but when running on a file with
size of 400m and the maxHeapSize =1000, minsupport=2, fpgrowth costs above
10 hours, sometimes it spents 2 hours to compute one feature, is anything
wrong again?

thanks for help

Re: a bug of fpgrowth?

Posted by tom pierce <tc...@apache.org>.

Hello,

Could you try re-running FP-Growth with the '-2' flag, and let us know 
if you have more success?

This uses an alternate implementation of the FPGrowth algorithm; I have 
had problems similar to what you are seeing when using the default 
implementation.

I am skeptical of the change you suggest. That line seems to be related 
to maintaining a "least" pointer, and the logic seems right to me on the 
surface.

It is interesting to hear that this change improves the diversity of 
patterns you receive. The default implementation of FPGrowth will often 
"mine" the same pattern several times. There is also a limit on how many 
patterns will be returned (k). Together, these can limit the number of 
unique patterns found.

The '-2' flag should eliminate duplicate patterns. You can also try 
increasing k if you want to find more patterns.

-tom

On 08/21/2012 11:22 PM, 林泽桢 wrote:
> hello, when i use fpgrowth to get association rules, but it always come to
> wrong, so confused.
>
> Then i read the source code, i think i found a bug in line #102
> of FrequentPatternMaxHeap.java, which " least.compareTo(frequentPattern) <
> 0 " should change to " least.compareTo(frequentPattern) > 0 ", the former
> will filter a lot frequent patterns come after.
>
> After modification, it comes to better, but when running on a file with
> size of 400m and the maxHeapSize =1000, minsupport=2, fpgrowth costs above
> 10 hours, sometimes it spents 2 hours to compute one feature, is anything
> wrong again?
>
> thanks for help
>