You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@mahout.apache.org by "Sean Owen (JIRA)" <ji...@apache.org> on 2011/08/21 21:42:27 UTC

[jira] [Updated] (MAHOUT-629) FP Growth performance improvement

     [ https://issues.apache.org/jira/browse/MAHOUT-629?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Sean Owen updated MAHOUT-629:
-----------------------------

    Fix Version/s:     (was: 0.6)

This still isn't in a position to commit without a patch.

> FP Growth performance improvement
> ---------------------------------
>
>                 Key: MAHOUT-629
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-629
>             Project: Mahout
>          Issue Type: Improvement
>          Components: Frequent Itemset/Association Rule Mining
>    Affects Versions: 0.4, 0.5
>            Reporter: Jaroslaw Odzga
>            Assignee: Robin Anil
>         Attachments: FPGrowth.java, performance_improvement.txt
>
>
> Instead of calculating patterns ending with given attribute multiple times they can be calculated just once. Depending on for how many features patterns are generated, speedup can be huge. More feature included - greater speedup. For test data set (88162 real life 'basket' transactions), if all features were selected (i.e. we want to generate patterns for all items in transactions), patterns generation time dropped from 1h 15min to 8sec. For parallel fpgrowth, where the number of requested features is limited the speedup is not that dramatic, but still noticeable. Basically work done is always smaller than before the patch (as patterns for each item are calculated at most once).
> The improvement is a variation of base algorithm in situation where we want to generate patterns for only subset of items (let's call this subset A). Given that items are ordered by descending frequency it is enough to calculate only patterns ending on any item with frequency smaller or equal to the most frequent item in the subset A. The heaps for each item are initialized upfront and merged after processing every item.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira