You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@mahout.apache.org by "tom pierce (Created) (JIRA)" <ji...@apache.org> on 2011/11/20 04:39:51 UTC

[jira] [Created] (MAHOUT-890) Performance issue in FPGrowth

Performance issue in FPGrowth
-----------------------------

                 Key: MAHOUT-890
                 URL: https://issues.apache.org/jira/browse/MAHOUT-890
             Project: Mahout
          Issue Type: Bug
          Components: Frequent Itemset/Association Rule Mining
    Affects Versions: 0.6
            Reporter: tom pierce


I've encountered a dataset which indicates there is probably a performance bug lurking in the FPGrowth implementation.  This set may be a bit of an unusual target for FPG - there's a relatively modest number itemsets, and many items with a Zipfy distribution.  I am attaching a patch (addSynth.patch) to add a similar dataset as core/src/test/resources/FPGsynth.dat.

FPGsynth.dat can take minutes or a few hours to process, depending on how it is grouped out to machines.  If run in sequential mode, or with "-g 50" it will take considerable time.  Most reducers/"anchor items" are processed quickly, but a small number take a handful of minutes, and one or two take a long time.  If you experiment with this data, I suggest using  '-s 50 -regex "[ ]+"'. 

Digging into this, I've found that the tree pruning code sometimes creates surprising trees.  One oddity I've observed is 0-count nodes, sometimes with non-zero children.  The other is that sometimes subtrees seem to get repeated.  I'm attaching a sample input file (smallexample.dat, use the whitespace regex with this one, too) and a patch which adds some logging in pruneFPTree and growthBottomUp which will print out some interesting trees when run with the smallexample.dat input.



--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Re: [jira] [Updated] (MAHOUT-890) Performance issue in FPGrowth

Posted by Robin Anil <ro...@gmail.com>.
Thanks tom, The patch looks good, I will take an indepth look after new
year and commit it.

Happy 2012.

On Fri, Dec 30, 2011 at 3:40 PM, tom pierce (Updated) (JIRA) <
jira@apache.org> wrote:

>
>     [
> https://issues.apache.org/jira/browse/MAHOUT-890?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel]
>
> tom pierce updated MAHOUT-890:
> ------------------------------
>
>     Attachment: MAHOUT-890-2.patch
>
> This patch (MAHOUT-890-2) adds the new implementation (under fpgrowth2)
> alongside the old with a minimal number of boxed primitives in the parallel
> version.  This patch depends on MAHOUT-920, MAHOUT-921 and MAHOUT-927.
>
> Tests are included that check the new implementation against known-good
> output and the existing implementation.
>
> The sequential implementation does not have a post-filter to create the
> complete per-item itemsets (though all the same itemsets are found - the
> new implementation returns de-duped sets).  Otherwise the results should be
> interchangeable.
>
> > Performance issue in FPGrowth
> > -----------------------------
> >
> >                 Key: MAHOUT-890
> >                 URL: https://issues.apache.org/jira/browse/MAHOUT-890
> >             Project: Mahout
> >          Issue Type: Bug
> >          Components: Frequent Itemset/Association Rule Mining
> >    Affects Versions: 0.6
> >            Reporter: tom pierce
> >         Attachments: MAHOUT-890-2.patch, MAHOUT-890.patch,
> addSynth.patch, logtrees.patch, simpleFPG.patch, smallexample.dat
> >
> >
> > I've encountered a dataset which indicates there is probably a
> performance bug lurking in the FPGrowth implementation.  This set may be a
> bit of an unusual target for FPG - there's a relatively modest number
> itemsets, and many items with a Zipfy distribution.  I am attaching a patch
> (addSynth.patch) to add a similar dataset as
> core/src/test/resources/FPGsynth.dat.
> > FPGsynth.dat can take minutes or a few hours to process, depending on
> how it is grouped out to machines.  If run in sequential mode, or with "-g
> 50" it will take considerable time.  Most reducers/"anchor items" are
> processed quickly, but a small number take a handful of minutes, and one or
> two take a long time.  If you experiment with this data, I suggest using
>  '-s 50 -regex "[ ]+"'.
> > Digging into this, I've found that the tree pruning code sometimes
> creates surprising trees.  One oddity I've observed is 0-count nodes,
> sometimes with non-zero children.  The other is that sometimes subtrees
> seem to get repeated.  I'm attaching a sample input file (smallexample.dat,
> use the whitespace regex with this one, too) and a patch which adds some
> logging in pruneFPTree and growthBottomUp which will print out some
> interesting trees when run with the smallexample.dat input.
>
> --
> This message is automatically generated by JIRA.
> If you think it was sent incorrectly, please contact your JIRA
> administrators:
> https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
> For more information on JIRA, see: http://www.atlassian.com/software/jira
>
>
>

[jira] [Commented] (MAHOUT-890) Performance issue in FPGrowth

Posted by "Roger Dai (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/MAHOUT-890?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13167059#comment-13167059 ] 

Roger Dai commented on MAHOUT-890:
----------------------------------

I did not notice the replicated trees, but I met the similar problem when using mahout-0.5.
Some features did took much longer amount of time to mine. 
I printed out the stack and found found that the traverBottomUp method had been invoked for many times.
                
> Performance issue in FPGrowth
> -----------------------------
>
>                 Key: MAHOUT-890
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-890
>             Project: Mahout
>          Issue Type: Bug
>          Components: Frequent Itemset/Association Rule Mining
>    Affects Versions: 0.6
>            Reporter: tom pierce
>         Attachments: MAHOUT-890.patch, addSynth.patch, logtrees.patch, simpleFPG.patch, smallexample.dat
>
>
> I've encountered a dataset which indicates there is probably a performance bug lurking in the FPGrowth implementation.  This set may be a bit of an unusual target for FPG - there's a relatively modest number itemsets, and many items with a Zipfy distribution.  I am attaching a patch (addSynth.patch) to add a similar dataset as core/src/test/resources/FPGsynth.dat.
> FPGsynth.dat can take minutes or a few hours to process, depending on how it is grouped out to machines.  If run in sequential mode, or with "-g 50" it will take considerable time.  Most reducers/"anchor items" are processed quickly, but a small number take a handful of minutes, and one or two take a long time.  If you experiment with this data, I suggest using  '-s 50 -regex "[ ]+"'. 
> Digging into this, I've found that the tree pruning code sometimes creates surprising trees.  One oddity I've observed is 0-count nodes, sometimes with non-zero children.  The other is that sometimes subtrees seem to get repeated.  I'm attaching a sample input file (smallexample.dat, use the whitespace regex with this one, too) and a patch which adds some logging in pruneFPTree and growthBottomUp which will print out some interesting trees when run with the smallexample.dat input.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Assigned] (MAHOUT-890) Performance issue in FPGrowth

Posted by "Robin Anil (Assigned) (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/MAHOUT-890?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Robin Anil reassigned MAHOUT-890:
---------------------------------

    Assignee: Robin Anil
    
> Performance issue in FPGrowth
> -----------------------------
>
>                 Key: MAHOUT-890
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-890
>             Project: Mahout
>          Issue Type: Bug
>          Components: Frequent Itemset/Association Rule Mining
>    Affects Versions: 0.6
>            Reporter: tom pierce
>            Assignee: Robin Anil
>             Fix For: 0.6
>
>         Attachments: MAHOUT-890-2.patch, MAHOUT-890-3.patch, MAHOUT-890.patch, addSynth.patch, logtrees.patch, simpleFPG.patch, smallexample.dat
>
>
> I've encountered a dataset which indicates there is probably a performance bug lurking in the FPGrowth implementation.  This set may be a bit of an unusual target for FPG - there's a relatively modest number itemsets, and many items with a Zipfy distribution.  I am attaching a patch (addSynth.patch) to add a similar dataset as core/src/test/resources/FPGsynth.dat.
> FPGsynth.dat can take minutes or a few hours to process, depending on how it is grouped out to machines.  If run in sequential mode, or with "-g 50" it will take considerable time.  Most reducers/"anchor items" are processed quickly, but a small number take a handful of minutes, and one or two take a long time.  If you experiment with this data, I suggest using  '-s 50 -regex "[ ]+"'. 
> Digging into this, I've found that the tree pruning code sometimes creates surprising trees.  One oddity I've observed is 0-count nodes, sometimes with non-zero children.  The other is that sometimes subtrees seem to get repeated.  I'm attaching a sample input file (smallexample.dat, use the whitespace regex with this one, too) and a patch which adds some logging in pruneFPTree and growthBottomUp which will print out some interesting trees when run with the smallexample.dat input.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (MAHOUT-890) Performance issue in FPGrowth

Posted by "tom pierce (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/MAHOUT-890?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13166452#comment-13166452 ] 

tom pierce commented on MAHOUT-890:
-----------------------------------

Sure - I can do this.  I can also separate out the changes in the MR
logic (for example, a group file is no longer used- membership is
calculated on the fly) from the separate implementation.

I'll make new tickets for those changes, unless I hear otherwise.

Also, given that complete per-item sets were called out - should I
assume that'll be a showstopper because people are relying on the
existing functionality?  Would it be safe to only worry about the MR
case?


On Thu, Dec 8, 2011 at 8:49 PM, Robin Anil (Commented) (JIRA)

                
> Performance issue in FPGrowth
> -----------------------------
>
>                 Key: MAHOUT-890
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-890
>             Project: Mahout
>          Issue Type: Bug
>          Components: Frequent Itemset/Association Rule Mining
>    Affects Versions: 0.6
>            Reporter: tom pierce
>         Attachments: MAHOUT-890.patch, addSynth.patch, logtrees.patch, simpleFPG.patch, smallexample.dat
>
>
> I've encountered a dataset which indicates there is probably a performance bug lurking in the FPGrowth implementation.  This set may be a bit of an unusual target for FPG - there's a relatively modest number itemsets, and many items with a Zipfy distribution.  I am attaching a patch (addSynth.patch) to add a similar dataset as core/src/test/resources/FPGsynth.dat.
> FPGsynth.dat can take minutes or a few hours to process, depending on how it is grouped out to machines.  If run in sequential mode, or with "-g 50" it will take considerable time.  Most reducers/"anchor items" are processed quickly, but a small number take a handful of minutes, and one or two take a long time.  If you experiment with this data, I suggest using  '-s 50 -regex "[ ]+"'. 
> Digging into this, I've found that the tree pruning code sometimes creates surprising trees.  One oddity I've observed is 0-count nodes, sometimes with non-zero children.  The other is that sometimes subtrees seem to get repeated.  I'm attaching a sample input file (smallexample.dat, use the whitespace regex with this one, too) and a patch which adds some logging in pruneFPTree and growthBottomUp which will print out some interesting trees when run with the smallexample.dat input.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (MAHOUT-890) Performance issue in FPGrowth

Posted by "Sean Owen (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/MAHOUT-890?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13153770#comment-13153770 ] 

Sean Owen commented on MAHOUT-890:
----------------------------------

Do you have a suggested fix, or is it more of an observation? I didn't see a change in the patches.
                
> Performance issue in FPGrowth
> -----------------------------
>
>                 Key: MAHOUT-890
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-890
>             Project: Mahout
>          Issue Type: Bug
>          Components: Frequent Itemset/Association Rule Mining
>    Affects Versions: 0.6
>            Reporter: tom pierce
>         Attachments: addSynth.patch, logtrees.patch, smallexample.dat
>
>
> I've encountered a dataset which indicates there is probably a performance bug lurking in the FPGrowth implementation.  This set may be a bit of an unusual target for FPG - there's a relatively modest number itemsets, and many items with a Zipfy distribution.  I am attaching a patch (addSynth.patch) to add a similar dataset as core/src/test/resources/FPGsynth.dat.
> FPGsynth.dat can take minutes or a few hours to process, depending on how it is grouped out to machines.  If run in sequential mode, or with "-g 50" it will take considerable time.  Most reducers/"anchor items" are processed quickly, but a small number take a handful of minutes, and one or two take a long time.  If you experiment with this data, I suggest using  '-s 50 -regex "[ ]+"'. 
> Digging into this, I've found that the tree pruning code sometimes creates surprising trees.  One oddity I've observed is 0-count nodes, sometimes with non-zero children.  The other is that sometimes subtrees seem to get repeated.  I'm attaching a sample input file (smallexample.dat, use the whitespace regex with this one, too) and a patch which adds some logging in pruneFPTree and growthBottomUp which will print out some interesting trees when run with the smallexample.dat input.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (MAHOUT-890) Performance issue in FPGrowth

Posted by "tom pierce (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/MAHOUT-890?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13162230#comment-13162230 ] 

tom pierce commented on MAHOUT-890:
-----------------------------------

I've prepared a patch which replaces the current FPTree/FPGrowth implementation, and also streamlines the mapreduce workflow a bit. I'm attaching it as MAHOUT-890.patch.

One caveat is that sequential FPGrowth no longer mines "all patterns that include item X for each item".  To duplicate this functionality, you'll need to postfilter the results.  In other words, if [A, B] is currently a frequent pattern, you would find it in the old results twice, once each in a list for A and B.  In the new version it will only appear in one list.  I've updated the tests to reflect this.

Based on the answer keys in some of the Parallel FPGrowth tests, the postfilter that should have been doing this fixup for the parallel version has not been  working correctly in all cases, so perhaps this functionality could be dropped or redone for both MR and sequential workflows (for example, in PFPGrowthTest.java, [D, A, E] is not in the answer key for A, but it is for D and E).  




                
> Performance issue in FPGrowth
> -----------------------------
>
>                 Key: MAHOUT-890
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-890
>             Project: Mahout
>          Issue Type: Bug
>          Components: Frequent Itemset/Association Rule Mining
>    Affects Versions: 0.6
>            Reporter: tom pierce
>         Attachments: MAHOUT-890.patch, addSynth.patch, logtrees.patch, simpleFPG.patch, smallexample.dat
>
>
> I've encountered a dataset which indicates there is probably a performance bug lurking in the FPGrowth implementation.  This set may be a bit of an unusual target for FPG - there's a relatively modest number itemsets, and many items with a Zipfy distribution.  I am attaching a patch (addSynth.patch) to add a similar dataset as core/src/test/resources/FPGsynth.dat.
> FPGsynth.dat can take minutes or a few hours to process, depending on how it is grouped out to machines.  If run in sequential mode, or with "-g 50" it will take considerable time.  Most reducers/"anchor items" are processed quickly, but a small number take a handful of minutes, and one or two take a long time.  If you experiment with this data, I suggest using  '-s 50 -regex "[ ]+"'. 
> Digging into this, I've found that the tree pruning code sometimes creates surprising trees.  One oddity I've observed is 0-count nodes, sometimes with non-zero children.  The other is that sometimes subtrees seem to get repeated.  I'm attaching a sample input file (smallexample.dat, use the whitespace regex with this one, too) and a patch which adds some logging in pruneFPTree and growthBottomUp which will print out some interesting trees when run with the smallexample.dat input.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (MAHOUT-890) Performance issue in FPGrowth

Posted by "Robin Anil (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/MAHOUT-890?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13186346#comment-13186346 ] 

Robin Anil commented on MAHOUT-890:
-----------------------------------

Strike that, it works in the order 920, 921, 927, 890-3
                
> Performance issue in FPGrowth
> -----------------------------
>
>                 Key: MAHOUT-890
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-890
>             Project: Mahout
>          Issue Type: Bug
>          Components: Frequent Itemset/Association Rule Mining
>    Affects Versions: 0.6
>            Reporter: tom pierce
>         Attachments: MAHOUT-890-2.patch, MAHOUT-890-3.patch, MAHOUT-890.patch, addSynth.patch, logtrees.patch, simpleFPG.patch, smallexample.dat
>
>
> I've encountered a dataset which indicates there is probably a performance bug lurking in the FPGrowth implementation.  This set may be a bit of an unusual target for FPG - there's a relatively modest number itemsets, and many items with a Zipfy distribution.  I am attaching a patch (addSynth.patch) to add a similar dataset as core/src/test/resources/FPGsynth.dat.
> FPGsynth.dat can take minutes or a few hours to process, depending on how it is grouped out to machines.  If run in sequential mode, or with "-g 50" it will take considerable time.  Most reducers/"anchor items" are processed quickly, but a small number take a handful of minutes, and one or two take a long time.  If you experiment with this data, I suggest using  '-s 50 -regex "[ ]+"'. 
> Digging into this, I've found that the tree pruning code sometimes creates surprising trees.  One oddity I've observed is 0-count nodes, sometimes with non-zero children.  The other is that sometimes subtrees seem to get repeated.  I'm attaching a sample input file (smallexample.dat, use the whitespace regex with this one, too) and a patch which adds some logging in pruneFPTree and growthBottomUp which will print out some interesting trees when run with the smallexample.dat input.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Updated] (MAHOUT-890) Performance issue in FPGrowth

Posted by "tom pierce (Updated) (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/MAHOUT-890?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

tom pierce updated MAHOUT-890:
------------------------------

    Attachment: logtrees.patch
                smallexample.dat
                addSynth.patch
    
> Performance issue in FPGrowth
> -----------------------------
>
>                 Key: MAHOUT-890
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-890
>             Project: Mahout
>          Issue Type: Bug
>          Components: Frequent Itemset/Association Rule Mining
>    Affects Versions: 0.6
>            Reporter: tom pierce
>         Attachments: addSynth.patch, logtrees.patch, smallexample.dat
>
>
> I've encountered a dataset which indicates there is probably a performance bug lurking in the FPGrowth implementation.  This set may be a bit of an unusual target for FPG - there's a relatively modest number itemsets, and many items with a Zipfy distribution.  I am attaching a patch (addSynth.patch) to add a similar dataset as core/src/test/resources/FPGsynth.dat.
> FPGsynth.dat can take minutes or a few hours to process, depending on how it is grouped out to machines.  If run in sequential mode, or with "-g 50" it will take considerable time.  Most reducers/"anchor items" are processed quickly, but a small number take a handful of minutes, and one or two take a long time.  If you experiment with this data, I suggest using  '-s 50 -regex "[ ]+"'. 
> Digging into this, I've found that the tree pruning code sometimes creates surprising trees.  One oddity I've observed is 0-count nodes, sometimes with non-zero children.  The other is that sometimes subtrees seem to get repeated.  I'm attaching a sample input file (smallexample.dat, use the whitespace regex with this one, too) and a patch which adds some logging in pruneFPTree and growthBottomUp which will print out some interesting trees when run with the smallexample.dat input.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (MAHOUT-890) Performance issue in FPGrowth

Posted by "Grant Ingersoll (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/MAHOUT-890?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13174819#comment-13174819 ] 

Grant Ingersoll commented on MAHOUT-890:
----------------------------------------

Jeff, I haven't seen a new patch since the last round of discussions, or did I miss something?  Perhaps it's on another issue?  If so, it should be linked here.
                
> Performance issue in FPGrowth
> -----------------------------
>
>                 Key: MAHOUT-890
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-890
>             Project: Mahout
>          Issue Type: Bug
>          Components: Frequent Itemset/Association Rule Mining
>    Affects Versions: 0.6
>            Reporter: tom pierce
>         Attachments: MAHOUT-890.patch, addSynth.patch, logtrees.patch, simpleFPG.patch, smallexample.dat
>
>
> I've encountered a dataset which indicates there is probably a performance bug lurking in the FPGrowth implementation.  This set may be a bit of an unusual target for FPG - there's a relatively modest number itemsets, and many items with a Zipfy distribution.  I am attaching a patch (addSynth.patch) to add a similar dataset as core/src/test/resources/FPGsynth.dat.
> FPGsynth.dat can take minutes or a few hours to process, depending on how it is grouped out to machines.  If run in sequential mode, or with "-g 50" it will take considerable time.  Most reducers/"anchor items" are processed quickly, but a small number take a handful of minutes, and one or two take a long time.  If you experiment with this data, I suggest using  '-s 50 -regex "[ ]+"'. 
> Digging into this, I've found that the tree pruning code sometimes creates surprising trees.  One oddity I've observed is 0-count nodes, sometimes with non-zero children.  The other is that sometimes subtrees seem to get repeated.  I'm attaching a sample input file (smallexample.dat, use the whitespace regex with this one, too) and a patch which adds some logging in pruneFPTree and growthBottomUp which will print out some interesting trees when run with the smallexample.dat input.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (MAHOUT-890) Performance issue in FPGrowth

Posted by "Robin Anil (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/MAHOUT-890?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13166458#comment-13166458 ] 

Robin Anil commented on MAHOUT-890:
-----------------------------------

It wont be a complete showstopper, but that functionality is easy to replicate. If the core fpgrowth is correct, then all that is needed is to mine at each level in the tree. corresponding to a pattern. Thats is easy enough to replicate.
                
> Performance issue in FPGrowth
> -----------------------------
>
>                 Key: MAHOUT-890
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-890
>             Project: Mahout
>          Issue Type: Bug
>          Components: Frequent Itemset/Association Rule Mining
>    Affects Versions: 0.6
>            Reporter: tom pierce
>         Attachments: MAHOUT-890.patch, addSynth.patch, logtrees.patch, simpleFPG.patch, smallexample.dat
>
>
> I've encountered a dataset which indicates there is probably a performance bug lurking in the FPGrowth implementation.  This set may be a bit of an unusual target for FPG - there's a relatively modest number itemsets, and many items with a Zipfy distribution.  I am attaching a patch (addSynth.patch) to add a similar dataset as core/src/test/resources/FPGsynth.dat.
> FPGsynth.dat can take minutes or a few hours to process, depending on how it is grouped out to machines.  If run in sequential mode, or with "-g 50" it will take considerable time.  Most reducers/"anchor items" are processed quickly, but a small number take a handful of minutes, and one or two take a long time.  If you experiment with this data, I suggest using  '-s 50 -regex "[ ]+"'. 
> Digging into this, I've found that the tree pruning code sometimes creates surprising trees.  One oddity I've observed is 0-count nodes, sometimes with non-zero children.  The other is that sometimes subtrees seem to get repeated.  I'm attaching a sample input file (smallexample.dat, use the whitespace regex with this one, too) and a patch which adds some logging in pruneFPTree and growthBottomUp which will print out some interesting trees when run with the smallexample.dat input.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (MAHOUT-890) Performance issue in FPGrowth

Posted by "tom pierce (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/MAHOUT-890?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13155489#comment-13155489 ] 

tom pierce commented on MAHOUT-890:
-----------------------------------

I'm attaching another patch which adds a second FP-Growth implementation.  This code has a few difference with the current FPG implementation.  It does not avoid creating objects or recursion, so it may be more heap/stack hungry.  It also does not output lists of top-k freq patterns for each item; instead it outputs top-k frequent patterns containing items more frequent than the current target item (note that top-k sets per item can be post-processed from this data).  The union of all per-item itemsets should be the same, and I've included a test which demonstrates this for the retail.dat dataset.

This code handles the FPGsynth.dat dataset without trouble, and I've added a flag that lets you try out the new code by adding a "-2" to the commandline, in either MR or sequential mode (running "mahout fpg" will use the existing code by default).  I hope people will try them out against each other.

If this implementation is adopted we can add some itemset pruning (per the Li paper), and we can add a post-processor to generate top-k per item patterns, if there are users relying on that functionality.  I deferred these for now, so that running the implementations side-by-side would be very straightforward.




                
> Performance issue in FPGrowth
> -----------------------------
>
>                 Key: MAHOUT-890
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-890
>             Project: Mahout
>          Issue Type: Bug
>          Components: Frequent Itemset/Association Rule Mining
>    Affects Versions: 0.6
>            Reporter: tom pierce
>         Attachments: addSynth.patch, logtrees.patch, simpleFPG.patch, smallexample.dat
>
>
> I've encountered a dataset which indicates there is probably a performance bug lurking in the FPGrowth implementation.  This set may be a bit of an unusual target for FPG - there's a relatively modest number itemsets, and many items with a Zipfy distribution.  I am attaching a patch (addSynth.patch) to add a similar dataset as core/src/test/resources/FPGsynth.dat.
> FPGsynth.dat can take minutes or a few hours to process, depending on how it is grouped out to machines.  If run in sequential mode, or with "-g 50" it will take considerable time.  Most reducers/"anchor items" are processed quickly, but a small number take a handful of minutes, and one or two take a long time.  If you experiment with this data, I suggest using  '-s 50 -regex "[ ]+"'. 
> Digging into this, I've found that the tree pruning code sometimes creates surprising trees.  One oddity I've observed is 0-count nodes, sometimes with non-zero children.  The other is that sometimes subtrees seem to get repeated.  I'm attaching a sample input file (smallexample.dat, use the whitespace regex with this one, too) and a patch which adds some logging in pruneFPTree and growthBottomUp which will print out some interesting trees when run with the smallexample.dat input.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (MAHOUT-890) Performance issue in FPGrowth

Posted by "Robin Anil (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/MAHOUT-890?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13165747#comment-13165747 ] 

Robin Anil commented on MAHOUT-890:
-----------------------------------

Would you be willing to move the patch over to fpgrowth2 and to continue to work on the rewrite as a separate version until features are migrated.?
                
> Performance issue in FPGrowth
> -----------------------------
>
>                 Key: MAHOUT-890
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-890
>             Project: Mahout
>          Issue Type: Bug
>          Components: Frequent Itemset/Association Rule Mining
>    Affects Versions: 0.6
>            Reporter: tom pierce
>         Attachments: MAHOUT-890.patch, addSynth.patch, logtrees.patch, simpleFPG.patch, smallexample.dat
>
>
> I've encountered a dataset which indicates there is probably a performance bug lurking in the FPGrowth implementation.  This set may be a bit of an unusual target for FPG - there's a relatively modest number itemsets, and many items with a Zipfy distribution.  I am attaching a patch (addSynth.patch) to add a similar dataset as core/src/test/resources/FPGsynth.dat.
> FPGsynth.dat can take minutes or a few hours to process, depending on how it is grouped out to machines.  If run in sequential mode, or with "-g 50" it will take considerable time.  Most reducers/"anchor items" are processed quickly, but a small number take a handful of minutes, and one or two take a long time.  If you experiment with this data, I suggest using  '-s 50 -regex "[ ]+"'. 
> Digging into this, I've found that the tree pruning code sometimes creates surprising trees.  One oddity I've observed is 0-count nodes, sometimes with non-zero children.  The other is that sometimes subtrees seem to get repeated.  I'm attaching a sample input file (smallexample.dat, use the whitespace regex with this one, too) and a patch which adds some logging in pruneFPTree and growthBottomUp which will print out some interesting trees when run with the smallexample.dat input.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Updated] (MAHOUT-890) Performance issue in FPGrowth

Posted by "Robin Anil (Updated) (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/MAHOUT-890?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Robin Anil updated MAHOUT-890:
------------------------------

    Resolution: Fixed
        Status: Resolved  (was: Patch Available)
    
> Performance issue in FPGrowth
> -----------------------------
>
>                 Key: MAHOUT-890
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-890
>             Project: Mahout
>          Issue Type: Bug
>          Components: Frequent Itemset/Association Rule Mining
>    Affects Versions: 0.6
>            Reporter: tom pierce
>            Assignee: Robin Anil
>             Fix For: 0.6
>
>         Attachments: MAHOUT-890-2.patch, MAHOUT-890-3.patch, MAHOUT-890.patch, addSynth.patch, logtrees.patch, simpleFPG.patch, smallexample.dat
>
>
> I've encountered a dataset which indicates there is probably a performance bug lurking in the FPGrowth implementation.  This set may be a bit of an unusual target for FPG - there's a relatively modest number itemsets, and many items with a Zipfy distribution.  I am attaching a patch (addSynth.patch) to add a similar dataset as core/src/test/resources/FPGsynth.dat.
> FPGsynth.dat can take minutes or a few hours to process, depending on how it is grouped out to machines.  If run in sequential mode, or with "-g 50" it will take considerable time.  Most reducers/"anchor items" are processed quickly, but a small number take a handful of minutes, and one or two take a long time.  If you experiment with this data, I suggest using  '-s 50 -regex "[ ]+"'. 
> Digging into this, I've found that the tree pruning code sometimes creates surprising trees.  One oddity I've observed is 0-count nodes, sometimes with non-zero children.  The other is that sometimes subtrees seem to get repeated.  I'm attaching a sample input file (smallexample.dat, use the whitespace regex with this one, too) and a patch which adds some logging in pruneFPTree and growthBottomUp which will print out some interesting trees when run with the smallexample.dat input.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Updated] (MAHOUT-890) Performance issue in FPGrowth

Posted by "tom pierce (Updated) (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/MAHOUT-890?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

tom pierce updated MAHOUT-890:
------------------------------

    Attachment: simpleFPG.patch
    
> Performance issue in FPGrowth
> -----------------------------
>
>                 Key: MAHOUT-890
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-890
>             Project: Mahout
>          Issue Type: Bug
>          Components: Frequent Itemset/Association Rule Mining
>    Affects Versions: 0.6
>            Reporter: tom pierce
>         Attachments: addSynth.patch, logtrees.patch, simpleFPG.patch, smallexample.dat
>
>
> I've encountered a dataset which indicates there is probably a performance bug lurking in the FPGrowth implementation.  This set may be a bit of an unusual target for FPG - there's a relatively modest number itemsets, and many items with a Zipfy distribution.  I am attaching a patch (addSynth.patch) to add a similar dataset as core/src/test/resources/FPGsynth.dat.
> FPGsynth.dat can take minutes or a few hours to process, depending on how it is grouped out to machines.  If run in sequential mode, or with "-g 50" it will take considerable time.  Most reducers/"anchor items" are processed quickly, but a small number take a handful of minutes, and one or two take a long time.  If you experiment with this data, I suggest using  '-s 50 -regex "[ ]+"'. 
> Digging into this, I've found that the tree pruning code sometimes creates surprising trees.  One oddity I've observed is 0-count nodes, sometimes with non-zero children.  The other is that sometimes subtrees seem to get repeated.  I'm attaching a sample input file (smallexample.dat, use the whitespace regex with this one, too) and a patch which adds some logging in pruneFPTree and growthBottomUp which will print out some interesting trees when run with the smallexample.dat input.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (MAHOUT-890) Performance issue in FPGrowth

Posted by "Jeff Hammerbacher (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/MAHOUT-890?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13174809#comment-13174809 ] 

Jeff Hammerbacher commented on MAHOUT-890:
------------------------------------------

What is holding up progress on this issue?
                
> Performance issue in FPGrowth
> -----------------------------
>
>                 Key: MAHOUT-890
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-890
>             Project: Mahout
>          Issue Type: Bug
>          Components: Frequent Itemset/Association Rule Mining
>    Affects Versions: 0.6
>            Reporter: tom pierce
>         Attachments: MAHOUT-890.patch, addSynth.patch, logtrees.patch, simpleFPG.patch, smallexample.dat
>
>
> I've encountered a dataset which indicates there is probably a performance bug lurking in the FPGrowth implementation.  This set may be a bit of an unusual target for FPG - there's a relatively modest number itemsets, and many items with a Zipfy distribution.  I am attaching a patch (addSynth.patch) to add a similar dataset as core/src/test/resources/FPGsynth.dat.
> FPGsynth.dat can take minutes or a few hours to process, depending on how it is grouped out to machines.  If run in sequential mode, or with "-g 50" it will take considerable time.  Most reducers/"anchor items" are processed quickly, but a small number take a handful of minutes, and one or two take a long time.  If you experiment with this data, I suggest using  '-s 50 -regex "[ ]+"'. 
> Digging into this, I've found that the tree pruning code sometimes creates surprising trees.  One oddity I've observed is 0-count nodes, sometimes with non-zero children.  The other is that sometimes subtrees seem to get repeated.  I'm attaching a sample input file (smallexample.dat, use the whitespace regex with this one, too) and a patch which adds some logging in pruneFPTree and growthBottomUp which will print out some interesting trees when run with the smallexample.dat input.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (MAHOUT-890) Performance issue in FPGrowth

Posted by "Grant Ingersoll (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/MAHOUT-890?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13153800#comment-13153800 ] 

Grant Ingersoll commented on MAHOUT-890:
----------------------------------------

Thinking about how to test this, it sure would be nice if we could mark tests as long running or something, so they don't necessarily run every time.
                
> Performance issue in FPGrowth
> -----------------------------
>
>                 Key: MAHOUT-890
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-890
>             Project: Mahout
>          Issue Type: Bug
>          Components: Frequent Itemset/Association Rule Mining
>    Affects Versions: 0.6
>            Reporter: tom pierce
>         Attachments: addSynth.patch, logtrees.patch, smallexample.dat
>
>
> I've encountered a dataset which indicates there is probably a performance bug lurking in the FPGrowth implementation.  This set may be a bit of an unusual target for FPG - there's a relatively modest number itemsets, and many items with a Zipfy distribution.  I am attaching a patch (addSynth.patch) to add a similar dataset as core/src/test/resources/FPGsynth.dat.
> FPGsynth.dat can take minutes or a few hours to process, depending on how it is grouped out to machines.  If run in sequential mode, or with "-g 50" it will take considerable time.  Most reducers/"anchor items" are processed quickly, but a small number take a handful of minutes, and one or two take a long time.  If you experiment with this data, I suggest using  '-s 50 -regex "[ ]+"'. 
> Digging into this, I've found that the tree pruning code sometimes creates surprising trees.  One oddity I've observed is 0-count nodes, sometimes with non-zero children.  The other is that sometimes subtrees seem to get repeated.  I'm attaching a sample input file (smallexample.dat, use the whitespace regex with this one, too) and a patch which adds some logging in pruneFPTree and growthBottomUp which will print out some interesting trees when run with the smallexample.dat input.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (MAHOUT-890) Performance issue in FPGrowth

Posted by "tom pierce (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/MAHOUT-890?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13153822#comment-13153822 ] 

tom pierce commented on MAHOUT-890:
-----------------------------------

There's no fix patch up there (yet); it has been difficult for me to trace through this code.  It doesn't quite line up with my reading of the papers it's based on, and it is not always clear what assumptions and guarantees callers/callees are making.

I thought it would be helpful to show example cases that trigger the problem I'm having; maybe someone else who knows this code better could quickly see what is wrong.  I'll probably submit a patch in a day or two with a naive "straight from the paper" implementation.   

I love the suggestion to have a tests-long target or something similar (long-test-disable property?), especially if you can scope it to different packages.  


                
> Performance issue in FPGrowth
> -----------------------------
>
>                 Key: MAHOUT-890
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-890
>             Project: Mahout
>          Issue Type: Bug
>          Components: Frequent Itemset/Association Rule Mining
>    Affects Versions: 0.6
>            Reporter: tom pierce
>         Attachments: addSynth.patch, logtrees.patch, smallexample.dat
>
>
> I've encountered a dataset which indicates there is probably a performance bug lurking in the FPGrowth implementation.  This set may be a bit of an unusual target for FPG - there's a relatively modest number itemsets, and many items with a Zipfy distribution.  I am attaching a patch (addSynth.patch) to add a similar dataset as core/src/test/resources/FPGsynth.dat.
> FPGsynth.dat can take minutes or a few hours to process, depending on how it is grouped out to machines.  If run in sequential mode, or with "-g 50" it will take considerable time.  Most reducers/"anchor items" are processed quickly, but a small number take a handful of minutes, and one or two take a long time.  If you experiment with this data, I suggest using  '-s 50 -regex "[ ]+"'. 
> Digging into this, I've found that the tree pruning code sometimes creates surprising trees.  One oddity I've observed is 0-count nodes, sometimes with non-zero children.  The other is that sometimes subtrees seem to get repeated.  I'm attaching a sample input file (smallexample.dat, use the whitespace regex with this one, too) and a patch which adds some logging in pruneFPTree and growthBottomUp which will print out some interesting trees when run with the smallexample.dat input.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (MAHOUT-890) Performance issue in FPGrowth

Posted by "Robin Anil (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/MAHOUT-890?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13162249#comment-13162249 ] 

Robin Anil commented on MAHOUT-890:
-----------------------------------

As I see from the implementation, the mining is done exactly once. This is quite drastic a deviation from earlier implementation which was mining top K patterns for each level of the tree. I understand that the reason for creating a new implementation was due to the surprising trees that you found in the old version and also because you wanted to make the code more testable. There could be things that can be replaced in a layer. If the problem was in the core-fpgrowth, we can simply replace the new fptree and fpgrowth. part. 

Alternately, we can check the code in under fpgrowth2 and continue the rewrite and move the functionality over bit by bit.

Looking at the code, I see a lot of List<Integer> and Integer[] in the new code. In mahout we went away from the boxed versions due to performance reasons. There is an IntArray in Mahout Math which is used by the earlier code, I would suggest using that.


                
> Performance issue in FPGrowth
> -----------------------------
>
>                 Key: MAHOUT-890
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-890
>             Project: Mahout
>          Issue Type: Bug
>          Components: Frequent Itemset/Association Rule Mining
>    Affects Versions: 0.6
>            Reporter: tom pierce
>         Attachments: MAHOUT-890.patch, addSynth.patch, logtrees.patch, simpleFPG.patch, smallexample.dat
>
>
> I've encountered a dataset which indicates there is probably a performance bug lurking in the FPGrowth implementation.  This set may be a bit of an unusual target for FPG - there's a relatively modest number itemsets, and many items with a Zipfy distribution.  I am attaching a patch (addSynth.patch) to add a similar dataset as core/src/test/resources/FPGsynth.dat.
> FPGsynth.dat can take minutes or a few hours to process, depending on how it is grouped out to machines.  If run in sequential mode, or with "-g 50" it will take considerable time.  Most reducers/"anchor items" are processed quickly, but a small number take a handful of minutes, and one or two take a long time.  If you experiment with this data, I suggest using  '-s 50 -regex "[ ]+"'. 
> Digging into this, I've found that the tree pruning code sometimes creates surprising trees.  One oddity I've observed is 0-count nodes, sometimes with non-zero children.  The other is that sometimes subtrees seem to get repeated.  I'm attaching a sample input file (smallexample.dat, use the whitespace regex with this one, too) and a patch which adds some logging in pruneFPTree and growthBottomUp which will print out some interesting trees when run with the smallexample.dat input.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (MAHOUT-890) Performance issue in FPGrowth

Posted by "tom pierce (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/MAHOUT-890?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13162849#comment-13162849 ] 

tom pierce commented on MAHOUT-890:
-----------------------------------

Thanks for the feedback.  I agree with your thinking on tests, but I'm not sure what makes the most sense here.  I have an example set where the current implementation seems to produce correct output, it just takes an unreasonable amount of time (4+ hours on a small set).  I'll be happy to provide a unit test, though.  Unit testing the tree manipulations on a smaller scale is problematic because the intent and invariants are not obvious.  

For what it's worth, I actually spent more time than I'd like to admit trying to understand and fix the current implementation before resorting to a rewrite.  
My motivation for creating an alternate implementation was that I have a real dataset (of reasonable size) where the current implementation effectively fails to complete.  I've submitted a bug report, example data that demonstrates the long-execution problem, a toy example that can demonstrate evidence of tree-manipulation behavior that is almost certainly bad (but it is still not clear to me how to fix it), and I've also shown a naive implementation that completes quickly on the same data.  

On the topic of top-k patterns, I do not understand why it is desirable to repeatedly mine the same candidates/patterns.  The current implementation will find pattern "a b c" 3 times, when it could be found once and post-processed to create complete per-item lists.  It seems needlessly inefficient (especially if you don't care about per-item lists).  The way the code is structured, it is done this way even in mapreduce mode, even though the input itemsets are pruned in such a way that postprocessing is necessary anyway.  

You're right that there are a lot of boxed primitives in the patch; I also didn't pack the tree into an array representation.  I aimed for simple/quick to get it working; the boxed primitives in particular should be pretty easy to eliminate.

                
> Performance issue in FPGrowth
> -----------------------------
>
>                 Key: MAHOUT-890
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-890
>             Project: Mahout
>          Issue Type: Bug
>          Components: Frequent Itemset/Association Rule Mining
>    Affects Versions: 0.6
>            Reporter: tom pierce
>         Attachments: MAHOUT-890.patch, addSynth.patch, logtrees.patch, simpleFPG.patch, smallexample.dat
>
>
> I've encountered a dataset which indicates there is probably a performance bug lurking in the FPGrowth implementation.  This set may be a bit of an unusual target for FPG - there's a relatively modest number itemsets, and many items with a Zipfy distribution.  I am attaching a patch (addSynth.patch) to add a similar dataset as core/src/test/resources/FPGsynth.dat.
> FPGsynth.dat can take minutes or a few hours to process, depending on how it is grouped out to machines.  If run in sequential mode, or with "-g 50" it will take considerable time.  Most reducers/"anchor items" are processed quickly, but a small number take a handful of minutes, and one or two take a long time.  If you experiment with this data, I suggest using  '-s 50 -regex "[ ]+"'. 
> Digging into this, I've found that the tree pruning code sometimes creates surprising trees.  One oddity I've observed is 0-count nodes, sometimes with non-zero children.  The other is that sometimes subtrees seem to get repeated.  I'm attaching a sample input file (smallexample.dat, use the whitespace regex with this one, too) and a patch which adds some logging in pruneFPTree and growthBottomUp which will print out some interesting trees when run with the smallexample.dat input.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Updated] (MAHOUT-890) Performance issue in FPGrowth

Posted by "tom pierce (Updated) (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/MAHOUT-890?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

tom pierce updated MAHOUT-890:
------------------------------

    Attachment: MAHOUT-890-2.patch

This patch (MAHOUT-890-2) adds the new implementation (under fpgrowth2) alongside the old with a minimal number of boxed primitives in the parallel version.  This patch depends on MAHOUT-920, MAHOUT-921 and MAHOUT-927.

Tests are included that check the new implementation against known-good output and the existing implementation.

The sequential implementation does not have a post-filter to create the complete per-item itemsets (though all the same itemsets are found - the new implementation returns de-duped sets).  Otherwise the results should be interchangeable.
                
> Performance issue in FPGrowth
> -----------------------------
>
>                 Key: MAHOUT-890
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-890
>             Project: Mahout
>          Issue Type: Bug
>          Components: Frequent Itemset/Association Rule Mining
>    Affects Versions: 0.6
>            Reporter: tom pierce
>         Attachments: MAHOUT-890-2.patch, MAHOUT-890.patch, addSynth.patch, logtrees.patch, simpleFPG.patch, smallexample.dat
>
>
> I've encountered a dataset which indicates there is probably a performance bug lurking in the FPGrowth implementation.  This set may be a bit of an unusual target for FPG - there's a relatively modest number itemsets, and many items with a Zipfy distribution.  I am attaching a patch (addSynth.patch) to add a similar dataset as core/src/test/resources/FPGsynth.dat.
> FPGsynth.dat can take minutes or a few hours to process, depending on how it is grouped out to machines.  If run in sequential mode, or with "-g 50" it will take considerable time.  Most reducers/"anchor items" are processed quickly, but a small number take a handful of minutes, and one or two take a long time.  If you experiment with this data, I suggest using  '-s 50 -regex "[ ]+"'. 
> Digging into this, I've found that the tree pruning code sometimes creates surprising trees.  One oddity I've observed is 0-count nodes, sometimes with non-zero children.  The other is that sometimes subtrees seem to get repeated.  I'm attaching a sample input file (smallexample.dat, use the whitespace regex with this one, too) and a patch which adds some logging in pruneFPTree and growthBottomUp which will print out some interesting trees when run with the smallexample.dat input.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Updated] (MAHOUT-890) Performance issue in FPGrowth

Posted by "Robin Anil (Updated) (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/MAHOUT-890?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Robin Anil updated MAHOUT-890:
------------------------------

    Fix Version/s: 0.6
    
> Performance issue in FPGrowth
> -----------------------------
>
>                 Key: MAHOUT-890
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-890
>             Project: Mahout
>          Issue Type: Bug
>          Components: Frequent Itemset/Association Rule Mining
>    Affects Versions: 0.6
>            Reporter: tom pierce
>            Assignee: Robin Anil
>             Fix For: 0.6
>
>         Attachments: MAHOUT-890-2.patch, MAHOUT-890-3.patch, MAHOUT-890.patch, addSynth.patch, logtrees.patch, simpleFPG.patch, smallexample.dat
>
>
> I've encountered a dataset which indicates there is probably a performance bug lurking in the FPGrowth implementation.  This set may be a bit of an unusual target for FPG - there's a relatively modest number itemsets, and many items with a Zipfy distribution.  I am attaching a patch (addSynth.patch) to add a similar dataset as core/src/test/resources/FPGsynth.dat.
> FPGsynth.dat can take minutes or a few hours to process, depending on how it is grouped out to machines.  If run in sequential mode, or with "-g 50" it will take considerable time.  Most reducers/"anchor items" are processed quickly, but a small number take a handful of minutes, and one or two take a long time.  If you experiment with this data, I suggest using  '-s 50 -regex "[ ]+"'. 
> Digging into this, I've found that the tree pruning code sometimes creates surprising trees.  One oddity I've observed is 0-count nodes, sometimes with non-zero children.  The other is that sometimes subtrees seem to get repeated.  I'm attaching a sample input file (smallexample.dat, use the whitespace regex with this one, too) and a patch which adds some logging in pruneFPTree and growthBottomUp which will print out some interesting trees when run with the smallexample.dat input.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (MAHOUT-890) Performance issue in FPGrowth

Posted by "Hudson (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/MAHOUT-890?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13186544#comment-13186544 ] 

Hudson commented on MAHOUT-890:
-------------------------------

Integrated in Mahout-Quality #1307 (See [https://builds.apache.org/job/Mahout-Quality/1307/])
    MAHOUT-890 MAHOUT-920 MAHOUT-921 MAHOUT-927 FpGrowth2 Implementation and fixes to existing Mapreduce classes in pfpgrowth

robinanil : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1231700
Files : 
* /mahout/trunk/core/src/main/java/org/apache/mahout/fpm/pfpgrowth/FPGrowthDriver.java
* /mahout/trunk/core/src/main/java/org/apache/mahout/fpm/pfpgrowth/MultiTransactionTreeIterator.java
* /mahout/trunk/core/src/main/java/org/apache/mahout/fpm/pfpgrowth/PFPGrowth.java
* /mahout/trunk/core/src/main/java/org/apache/mahout/fpm/pfpgrowth/ParallelFPGrowthCombiner.java
* /mahout/trunk/core/src/main/java/org/apache/mahout/fpm/pfpgrowth/ParallelFPGrowthMapper.java
* /mahout/trunk/core/src/main/java/org/apache/mahout/fpm/pfpgrowth/ParallelFPGrowthReducer.java
* /mahout/trunk/core/src/main/java/org/apache/mahout/fpm/pfpgrowth/TransactionSortingMapper.java
* /mahout/trunk/core/src/main/java/org/apache/mahout/fpm/pfpgrowth/TransactionSortingReducer.java
* /mahout/trunk/core/src/main/java/org/apache/mahout/fpm/pfpgrowth/TransactionTree.java
* /mahout/trunk/core/src/main/java/org/apache/mahout/fpm/pfpgrowth/TransactionTreeIterator.java
* /mahout/trunk/core/src/main/java/org/apache/mahout/fpm/pfpgrowth/fpgrowth/FrequentPatternMaxHeap.java
* /mahout/trunk/core/src/main/java/org/apache/mahout/fpm/pfpgrowth/fpgrowth/Pattern.java
* /mahout/trunk/core/src/main/java/org/apache/mahout/fpm/pfpgrowth/fpgrowth2
* /mahout/trunk/core/src/main/java/org/apache/mahout/fpm/pfpgrowth/fpgrowth2/FPGrowthIds.java
* /mahout/trunk/core/src/main/java/org/apache/mahout/fpm/pfpgrowth/fpgrowth2/FPGrowthObj.java
* /mahout/trunk/core/src/main/java/org/apache/mahout/fpm/pfpgrowth/fpgrowth2/FPTree.java
* /mahout/trunk/core/src/test/java/org/apache/mahout/fpm/pfpgrowth/FPGrowthRetailDataTest2.java
* /mahout/trunk/core/src/test/java/org/apache/mahout/fpm/pfpgrowth/FPGrowthRetailDataTestVs.java
* /mahout/trunk/core/src/test/java/org/apache/mahout/fpm/pfpgrowth/FPGrowthSyntheticDataTest.java
* /mahout/trunk/core/src/test/java/org/apache/mahout/fpm/pfpgrowth/FPGrowthTest2.java
* /mahout/trunk/core/src/test/java/org/apache/mahout/fpm/pfpgrowth/PFPGrowthRetailDataTest.java
* /mahout/trunk/core/src/test/java/org/apache/mahout/fpm/pfpgrowth/PFPGrowthRetailDataTest2.java
* /mahout/trunk/core/src/test/java/org/apache/mahout/fpm/pfpgrowth/PFPGrowthRetailDataTestVs.java
* /mahout/trunk/core/src/test/java/org/apache/mahout/fpm/pfpgrowth/PFPGrowthSynthDataTest2.java
* /mahout/trunk/core/src/test/java/org/apache/mahout/fpm/pfpgrowth/PFPGrowthTest.java
* /mahout/trunk/core/src/test/java/org/apache/mahout/fpm/pfpgrowth/PFPGrowthTest2.java
* /mahout/trunk/core/src/test/java/org/apache/mahout/fpm/pfpgrowth/TransactionTreeTest.java
* /mahout/trunk/core/src/test/resources/FPGsynth.dat

                
> Performance issue in FPGrowth
> -----------------------------
>
>                 Key: MAHOUT-890
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-890
>             Project: Mahout
>          Issue Type: Bug
>          Components: Frequent Itemset/Association Rule Mining
>    Affects Versions: 0.6
>            Reporter: tom pierce
>            Assignee: Robin Anil
>             Fix For: 0.6
>
>         Attachments: MAHOUT-890-2.patch, MAHOUT-890-3.patch, MAHOUT-890.patch, addSynth.patch, logtrees.patch, simpleFPG.patch, smallexample.dat
>
>
> I've encountered a dataset which indicates there is probably a performance bug lurking in the FPGrowth implementation.  This set may be a bit of an unusual target for FPG - there's a relatively modest number itemsets, and many items with a Zipfy distribution.  I am attaching a patch (addSynth.patch) to add a similar dataset as core/src/test/resources/FPGsynth.dat.
> FPGsynth.dat can take minutes or a few hours to process, depending on how it is grouped out to machines.  If run in sequential mode, or with "-g 50" it will take considerable time.  Most reducers/"anchor items" are processed quickly, but a small number take a handful of minutes, and one or two take a long time.  If you experiment with this data, I suggest using  '-s 50 -regex "[ ]+"'. 
> Digging into this, I've found that the tree pruning code sometimes creates surprising trees.  One oddity I've observed is 0-count nodes, sometimes with non-zero children.  The other is that sometimes subtrees seem to get repeated.  I'm attaching a sample input file (smallexample.dat, use the whitespace regex with this one, too) and a patch which adds some logging in pruneFPTree and growthBottomUp which will print out some interesting trees when run with the smallexample.dat input.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Updated] (MAHOUT-890) Performance issue in FPGrowth

Posted by "tom pierce (Updated) (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/MAHOUT-890?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

tom pierce updated MAHOUT-890:
------------------------------

    Attachment: MAHOUT-890-3.patch

I decided to add a couple tests based on the synthetic data, and found two bounds-checking issues that the retail and toy sets failed to trigger.  MAHOUT-890-3 includes a fix for those, a couple extra tests, the synthetic test data, and a few less tabs.
                
> Performance issue in FPGrowth
> -----------------------------
>
>                 Key: MAHOUT-890
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-890
>             Project: Mahout
>          Issue Type: Bug
>          Components: Frequent Itemset/Association Rule Mining
>    Affects Versions: 0.6
>            Reporter: tom pierce
>         Attachments: MAHOUT-890-2.patch, MAHOUT-890-3.patch, MAHOUT-890.patch, addSynth.patch, logtrees.patch, simpleFPG.patch, smallexample.dat
>
>
> I've encountered a dataset which indicates there is probably a performance bug lurking in the FPGrowth implementation.  This set may be a bit of an unusual target for FPG - there's a relatively modest number itemsets, and many items with a Zipfy distribution.  I am attaching a patch (addSynth.patch) to add a similar dataset as core/src/test/resources/FPGsynth.dat.
> FPGsynth.dat can take minutes or a few hours to process, depending on how it is grouped out to machines.  If run in sequential mode, or with "-g 50" it will take considerable time.  Most reducers/"anchor items" are processed quickly, but a small number take a handful of minutes, and one or two take a long time.  If you experiment with this data, I suggest using  '-s 50 -regex "[ ]+"'. 
> Digging into this, I've found that the tree pruning code sometimes creates surprising trees.  One oddity I've observed is 0-count nodes, sometimes with non-zero children.  The other is that sometimes subtrees seem to get repeated.  I'm attaching a sample input file (smallexample.dat, use the whitespace regex with this one, too) and a patch which adds some logging in pruneFPTree and growthBottomUp which will print out some interesting trees when run with the smallexample.dat input.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Updated] (MAHOUT-890) Performance issue in FPGrowth

Posted by "tom pierce (Updated) (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/MAHOUT-890?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

tom pierce updated MAHOUT-890:
------------------------------

    Attachment: MAHOUT-890.patch
    
> Performance issue in FPGrowth
> -----------------------------
>
>                 Key: MAHOUT-890
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-890
>             Project: Mahout
>          Issue Type: Bug
>          Components: Frequent Itemset/Association Rule Mining
>    Affects Versions: 0.6
>            Reporter: tom pierce
>         Attachments: MAHOUT-890.patch, addSynth.patch, logtrees.patch, simpleFPG.patch, smallexample.dat
>
>
> I've encountered a dataset which indicates there is probably a performance bug lurking in the FPGrowth implementation.  This set may be a bit of an unusual target for FPG - there's a relatively modest number itemsets, and many items with a Zipfy distribution.  I am attaching a patch (addSynth.patch) to add a similar dataset as core/src/test/resources/FPGsynth.dat.
> FPGsynth.dat can take minutes or a few hours to process, depending on how it is grouped out to machines.  If run in sequential mode, or with "-g 50" it will take considerable time.  Most reducers/"anchor items" are processed quickly, but a small number take a handful of minutes, and one or two take a long time.  If you experiment with this data, I suggest using  '-s 50 -regex "[ ]+"'. 
> Digging into this, I've found that the tree pruning code sometimes creates surprising trees.  One oddity I've observed is 0-count nodes, sometimes with non-zero children.  The other is that sometimes subtrees seem to get repeated.  I'm attaching a sample input file (smallexample.dat, use the whitespace regex with this one, too) and a patch which adds some logging in pruneFPTree and growthBottomUp which will print out some interesting trees when run with the smallexample.dat input.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (MAHOUT-890) Performance issue in FPGrowth

Posted by "Grant Ingersoll (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/MAHOUT-890?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13162239#comment-13162239 ] 

Grant Ingersoll commented on MAHOUT-890:
----------------------------------------

Would it make sense to first establish tests for correctness and then consider replacing the functionality?  I haven't reviewed the patch yet, but it seems kind of drastic to replace first without actually having tests in place that establish correctness.  I realize that perhaps we have "manual" tests, but I think it would be better to first have unit tests in place.
                
> Performance issue in FPGrowth
> -----------------------------
>
>                 Key: MAHOUT-890
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-890
>             Project: Mahout
>          Issue Type: Bug
>          Components: Frequent Itemset/Association Rule Mining
>    Affects Versions: 0.6
>            Reporter: tom pierce
>         Attachments: MAHOUT-890.patch, addSynth.patch, logtrees.patch, simpleFPG.patch, smallexample.dat
>
>
> I've encountered a dataset which indicates there is probably a performance bug lurking in the FPGrowth implementation.  This set may be a bit of an unusual target for FPG - there's a relatively modest number itemsets, and many items with a Zipfy distribution.  I am attaching a patch (addSynth.patch) to add a similar dataset as core/src/test/resources/FPGsynth.dat.
> FPGsynth.dat can take minutes or a few hours to process, depending on how it is grouped out to machines.  If run in sequential mode, or with "-g 50" it will take considerable time.  Most reducers/"anchor items" are processed quickly, but a small number take a handful of minutes, and one or two take a long time.  If you experiment with this data, I suggest using  '-s 50 -regex "[ ]+"'. 
> Digging into this, I've found that the tree pruning code sometimes creates surprising trees.  One oddity I've observed is 0-count nodes, sometimes with non-zero children.  The other is that sometimes subtrees seem to get repeated.  I'm attaching a sample input file (smallexample.dat, use the whitespace regex with this one, too) and a patch which adds some logging in pruneFPTree and growthBottomUp which will print out some interesting trees when run with the smallexample.dat input.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (MAHOUT-890) Performance issue in FPGrowth

Posted by "Robin Anil (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/MAHOUT-890?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13186345#comment-13186345 ] 

Robin Anil commented on MAHOUT-890:
-----------------------------------

The Patch fails when applying. Can you send me a new one
                
> Performance issue in FPGrowth
> -----------------------------
>
>                 Key: MAHOUT-890
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-890
>             Project: Mahout
>          Issue Type: Bug
>          Components: Frequent Itemset/Association Rule Mining
>    Affects Versions: 0.6
>            Reporter: tom pierce
>         Attachments: MAHOUT-890-2.patch, MAHOUT-890-3.patch, MAHOUT-890.patch, addSynth.patch, logtrees.patch, simpleFPG.patch, smallexample.dat
>
>
> I've encountered a dataset which indicates there is probably a performance bug lurking in the FPGrowth implementation.  This set may be a bit of an unusual target for FPG - there's a relatively modest number itemsets, and many items with a Zipfy distribution.  I am attaching a patch (addSynth.patch) to add a similar dataset as core/src/test/resources/FPGsynth.dat.
> FPGsynth.dat can take minutes or a few hours to process, depending on how it is grouped out to machines.  If run in sequential mode, or with "-g 50" it will take considerable time.  Most reducers/"anchor items" are processed quickly, but a small number take a handful of minutes, and one or two take a long time.  If you experiment with this data, I suggest using  '-s 50 -regex "[ ]+"'. 
> Digging into this, I've found that the tree pruning code sometimes creates surprising trees.  One oddity I've observed is 0-count nodes, sometimes with non-zero children.  The other is that sometimes subtrees seem to get repeated.  I'm attaching a sample input file (smallexample.dat, use the whitespace regex with this one, too) and a patch which adds some logging in pruneFPTree and growthBottomUp which will print out some interesting trees when run with the smallexample.dat input.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (MAHOUT-890) Performance issue in FPGrowth

Posted by "tom pierce (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/MAHOUT-890?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13174825#comment-13174825 ] 

tom pierce commented on MAHOUT-890:
-----------------------------------

There are patches in other issues - it would be great if they moved forward, so I could be sure I won't need to rebase later patches.

See: 

https://issues.apache.org/jira/browse/MAHOUT-920
https://issues.apache.org/jira/browse/MAHOUT-921
https://issues.apache.org/jira/browse/MAHOUT-927
                
> Performance issue in FPGrowth
> -----------------------------
>
>                 Key: MAHOUT-890
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-890
>             Project: Mahout
>          Issue Type: Bug
>          Components: Frequent Itemset/Association Rule Mining
>    Affects Versions: 0.6
>            Reporter: tom pierce
>         Attachments: MAHOUT-890.patch, addSynth.patch, logtrees.patch, simpleFPG.patch, smallexample.dat
>
>
> I've encountered a dataset which indicates there is probably a performance bug lurking in the FPGrowth implementation.  This set may be a bit of an unusual target for FPG - there's a relatively modest number itemsets, and many items with a Zipfy distribution.  I am attaching a patch (addSynth.patch) to add a similar dataset as core/src/test/resources/FPGsynth.dat.
> FPGsynth.dat can take minutes or a few hours to process, depending on how it is grouped out to machines.  If run in sequential mode, or with "-g 50" it will take considerable time.  Most reducers/"anchor items" are processed quickly, but a small number take a handful of minutes, and one or two take a long time.  If you experiment with this data, I suggest using  '-s 50 -regex "[ ]+"'. 
> Digging into this, I've found that the tree pruning code sometimes creates surprising trees.  One oddity I've observed is 0-count nodes, sometimes with non-zero children.  The other is that sometimes subtrees seem to get repeated.  I'm attaching a sample input file (smallexample.dat, use the whitespace regex with this one, too) and a patch which adds some logging in pruneFPTree and growthBottomUp which will print out some interesting trees when run with the smallexample.dat input.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Updated] (MAHOUT-890) Performance issue in FPGrowth

Posted by "tom pierce (Updated) (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/MAHOUT-890?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

tom pierce updated MAHOUT-890:
------------------------------

    Status: Patch Available  (was: Open)
    
> Performance issue in FPGrowth
> -----------------------------
>
>                 Key: MAHOUT-890
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-890
>             Project: Mahout
>          Issue Type: Bug
>          Components: Frequent Itemset/Association Rule Mining
>    Affects Versions: 0.6
>            Reporter: tom pierce
>         Attachments: MAHOUT-890.patch, addSynth.patch, logtrees.patch, simpleFPG.patch, smallexample.dat
>
>
> I've encountered a dataset which indicates there is probably a performance bug lurking in the FPGrowth implementation.  This set may be a bit of an unusual target for FPG - there's a relatively modest number itemsets, and many items with a Zipfy distribution.  I am attaching a patch (addSynth.patch) to add a similar dataset as core/src/test/resources/FPGsynth.dat.
> FPGsynth.dat can take minutes or a few hours to process, depending on how it is grouped out to machines.  If run in sequential mode, or with "-g 50" it will take considerable time.  Most reducers/"anchor items" are processed quickly, but a small number take a handful of minutes, and one or two take a long time.  If you experiment with this data, I suggest using  '-s 50 -regex "[ ]+"'. 
> Digging into this, I've found that the tree pruning code sometimes creates surprising trees.  One oddity I've observed is 0-count nodes, sometimes with non-zero children.  The other is that sometimes subtrees seem to get repeated.  I'm attaching a sample input file (smallexample.dat, use the whitespace regex with this one, too) and a patch which adds some logging in pruneFPTree and growthBottomUp which will print out some interesting trees when run with the smallexample.dat input.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira