You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@mahout.apache.org by "tom pierce (Commented) (JIRA)" <ji...@apache.org> on 2011/12/04 00:20:41 UTC

[jira] [Commented] (MAHOUT-890) Performance issue in FPGrowth

    [ https://issues.apache.org/jira/browse/MAHOUT-890?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13162230#comment-13162230 ] 

tom pierce commented on MAHOUT-890:
-----------------------------------

I've prepared a patch which replaces the current FPTree/FPGrowth implementation, and also streamlines the mapreduce workflow a bit. I'm attaching it as MAHOUT-890.patch.

One caveat is that sequential FPGrowth no longer mines "all patterns that include item X for each item".  To duplicate this functionality, you'll need to postfilter the results.  In other words, if [A, B] is currently a frequent pattern, you would find it in the old results twice, once each in a list for A and B.  In the new version it will only appear in one list.  I've updated the tests to reflect this.

Based on the answer keys in some of the Parallel FPGrowth tests, the postfilter that should have been doing this fixup for the parallel version has not been  working correctly in all cases, so perhaps this functionality could be dropped or redone for both MR and sequential workflows (for example, in PFPGrowthTest.java, [D, A, E] is not in the answer key for A, but it is for D and E).  




                
> Performance issue in FPGrowth
> -----------------------------
>
>                 Key: MAHOUT-890
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-890
>             Project: Mahout
>          Issue Type: Bug
>          Components: Frequent Itemset/Association Rule Mining
>    Affects Versions: 0.6
>            Reporter: tom pierce
>         Attachments: MAHOUT-890.patch, addSynth.patch, logtrees.patch, simpleFPG.patch, smallexample.dat
>
>
> I've encountered a dataset which indicates there is probably a performance bug lurking in the FPGrowth implementation.  This set may be a bit of an unusual target for FPG - there's a relatively modest number itemsets, and many items with a Zipfy distribution.  I am attaching a patch (addSynth.patch) to add a similar dataset as core/src/test/resources/FPGsynth.dat.
> FPGsynth.dat can take minutes or a few hours to process, depending on how it is grouped out to machines.  If run in sequential mode, or with "-g 50" it will take considerable time.  Most reducers/"anchor items" are processed quickly, but a small number take a handful of minutes, and one or two take a long time.  If you experiment with this data, I suggest using  '-s 50 -regex "[ ]+"'. 
> Digging into this, I've found that the tree pruning code sometimes creates surprising trees.  One oddity I've observed is 0-count nodes, sometimes with non-zero children.  The other is that sometimes subtrees seem to get repeated.  I'm attaching a sample input file (smallexample.dat, use the whitespace regex with this one, too) and a patch which adds some logging in pruneFPTree and growthBottomUp which will print out some interesting trees when run with the smallexample.dat input.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira