You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@mahout.apache.org by Tom Pierce <tc...@cloudera.com> on 2011/11/14 20:58:24 UTC
FP Growth oddities
Hi,
I've been playing around a bit with the FPGrowth implementation, and I
have some questions.
Is it intentional that all frequent itemsets including a given item
are mined for each (frequent) item? For example, consider a simple
example from FPGrowthTest.java:
X (occurs 12 times)
Y (occurs 4 times)
X Y (occurs 10 times)
The expected result is:
[(Y,([Y],14), ([X, Y],10)), (X,([X],22), ([X, Y],10))]
(This can be read as "When considering patterns with Y, [Y] and [X. Y]
were found to occur 14 and 10 times, respectively..")
Notice [X, Y] is found twice; a key efficiency improvement of FPGrowth
is that you can avoid this by only mining for frequent sets that
include the current base item and more-frequent items. Seems like
that is not taken advantage of here... Is there a reason for this, or
is this a bug?
Related to this is the grouping strategy - currently, when creating
batches of items to assign to reducers, the N most-frequent items form
a group, then the next N form a group, and so on. Intuitively, it
makes more sense to round-robin the examples as I would expect the
"hardest" items to (roughly) band together in frequency-order. Was
there a reason behind this choice?
I am also a little confused by the reporting of patterns (similar to
Vipul Pandey in MAHOUT-617). To again use an example from the tests:
[(Z,([X, Y, Z],11)), (Y,([Y],25), ([X, Y],21), ([X, Y, Z],11)),
(X,([X],33), ([X, Y],21), ([X, Y, Z],11))]
What is the rationale for not reporting [X, Z], for example? I
understand how you could generate these later, but why would you have
to?
I have also found two bugs; one is that the default number of groups
is mis-advertised. The other is that sometimes children are
multiply-added to nodes in conditional FP trees (so a node is its own
sibling). These are in Jira with patches as MAHOUT-885 and
MAHOUT-886.
In general, this code seems like it's more complicated (and difficult
to follow) than it could be. Part of that is simply due to the
object-avoidance, but I suspect that a lot is related to planned
improvements (for example I don't see where FP-Bonsai pruning is
implemented, but comments indicate someone is/was working on it).
Would folks be open to a major refactoring in here, or are people
actively working on fleshing bits of this out with new functionality?
-tom
Re: FP Growth oddities
Posted by Robin Anil <ro...@gmail.com>.
On Mon, Nov 14, 2011 at 1:58 PM, Tom Pierce <tc...@cloudera.com> wrote:
> Hi,
>
> I've been playing around a bit with the FPGrowth implementation, and I
> have some questions.
>
> Is it intentional that all frequent itemsets including a given item
> are mined for each (frequent) item? For example, consider a simple
> example from FPGrowthTest.java:
>
> X (occurs 12 times)
> Y (occurs 4 times)
> X Y (occurs 10 times)
>
> The expected result is:
>
> [(Y,([Y],14), ([X, Y],10)), (X,([X],22), ([X, Y],10))]
>
> (This can be read as "When considering patterns with Y, [Y] and [X. Y]
> were found to occur 14 and 10 times, respectively..")
>
> Notice [X, Y] is found twice; a key efficiency improvement of FPGrowth
> is that you can avoid this by only mining for frequent sets that
> include the current base item and more-frequent items. Seems like
> that is not taken advantage of here... Is there a reason for this, or
> is this a bug?
>
No, the mining here is to lookup top N patterns for each attribute
>
> Related to this is the grouping strategy - currently, when creating
> batches of items to assign to reducers, the N most-frequent items form
> a group, then the next N form a group, and so on. Intuitively, it
> makes more sense to round-robin the examples as I would expect the
> "hardest" items to (roughly) band together in frequency-order. Was
> there a reason behind this choice?
>
I cant recall the rationale, but the basic idea was this type of
partitioning allowed all patterns to be generated. Round robin will miss
some
See the paper http://infolab.stanford.edu/~echang/recsys08-69.pdf
>
> I am also a little confused by the reporting of patterns (similar to
> Vipul Pandey in MAHOUT-617). To again use an example from the tests:
>
> [(Z,([X, Y, Z],11)), (Y,([Y],25), ([X, Y],21), ([X, Y, Z],11)),
> (X,([X],33), ([X, Y],21), ([X, Y, Z],11))]
>
> What is the rationale for not reporting [X, Z], for example? I
> understand how you could generate these later, but why would you have
> to?
>
Ah. XZ - 11 is a redundant pattern. We try to mine only closed subsets. so
XYZ-11 already has that information.
>
> I have also found two bugs; one is that the default number of groups
> is mis-advertised. The other is that sometimes children are
> multiply-added to nodes in conditional FP trees (so a node is its own
> sibling). These are in Jira with patches as MAHOUT-885 and
> MAHOUT-886.
>
I will check this out. Seems like a bug.
>
> In general, this code seems like it's more complicated (and difficult
> to follow) than it could be. Part of that is simply due to the
> object-avoidance, but I suspect that a lot is related to planned
> improvements (for example I don't see where FP-Bonsai pruning is
> implemented, but comments indicate someone is/was working on it).
> Would folks be open to a major refactoring in here, or are people
> actively working on fleshing bits of this out with new functionality?
>
The current implementation tries to minimize memory usage by a lot. More
objects you create more performance hit it takes. But I am open to
refactoring the code to make it more readable.
FP Bonsai is implemented as a single function that prunes the nodes before
mining. You will find it as a small function there. Its not a significant
piece of code. Its already checked in I believe
>
> -tom
>