You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@mahout.apache.org by 林泽桢 <la...@gmail.com> on 2012/08/22 05:22:14 UTC
a bug of fpgrowth?
hello, when i use fpgrowth to get association rules, but it always come to
wrong, so confused.
Then i read the source code, i think i found a bug in line #102
of FrequentPatternMaxHeap.java, which " least.compareTo(frequentPattern) <
0 " should change to " least.compareTo(frequentPattern) > 0 ", the former
will filter a lot frequent patterns come after.
After modification, it comes to better, but when running on a file with
size of 400m and the maxHeapSize =1000, minsupport=2, fpgrowth costs above
10 hours, sometimes it spents 2 hours to compute one feature, is anything
wrong again?
thanks for help
Re: a bug of fpgrowth?
Posted by tom pierce <tc...@apache.org>.
Hello,
Could you try re-running FP-Growth with the '-2' flag, and let us know
if you have more success?
This uses an alternate implementation of the FPGrowth algorithm; I have
had problems similar to what you are seeing when using the default
implementation.
I am skeptical of the change you suggest. That line seems to be related
to maintaining a "least" pointer, and the logic seems right to me on the
surface.
It is interesting to hear that this change improves the diversity of
patterns you receive. The default implementation of FPGrowth will often
"mine" the same pattern several times. There is also a limit on how many
patterns will be returned (k). Together, these can limit the number of
unique patterns found.
The '-2' flag should eliminate duplicate patterns. You can also try
increasing k if you want to find more patterns.
-tom
On 08/21/2012 11:22 PM, 林泽桢 wrote:
> hello, when i use fpgrowth to get association rules, but it always come to
> wrong, so confused.
>
> Then i read the source code, i think i found a bug in line #102
> of FrequentPatternMaxHeap.java, which " least.compareTo(frequentPattern) <
> 0 " should change to " least.compareTo(frequentPattern) > 0 ", the former
> will filter a lot frequent patterns come after.
>
> After modification, it comes to better, but when running on a file with
> size of 400m and the maxHeapSize =1000, minsupport=2, fpgrowth costs above
> 10 hours, sometimes it spents 2 hours to compute one feature, is anything
> wrong again?
>
> thanks for help
>