You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@mahout.apache.org by Tom Pierce <tc...@cloudera.com> on 2011/11/14 20:58:24 UTC

FP Growth oddities

Hi,

I've been playing around a bit with the FPGrowth implementation, and I
have some questions.

Is it intentional that all frequent itemsets including a given item
are mined for each (frequent) item?  For example, consider a simple
example from FPGrowthTest.java:

X      (occurs 12 times)
Y      (occurs 4 times)
X Y   (occurs 10 times)

The expected result is:

[(Y,([Y],14), ([X, Y],10)), (X,([X],22), ([X, Y],10))]

(This can be read as "When considering patterns with Y, [Y] and [X. Y]
were found to occur 14 and 10 times, respectively..")

Notice [X, Y] is found twice; a key efficiency improvement of FPGrowth
is that you can avoid this by only mining for frequent sets that
include the current base item and more-frequent items.  Seems like
that is not taken advantage of here...  Is there a reason for this, or
is this a bug?

Related to this is the grouping strategy - currently, when creating
batches of items to assign to reducers, the N most-frequent items form
a group, then the next N form a group, and so on.  Intuitively, it
makes more sense to round-robin the examples as I would expect the
"hardest" items to (roughly) band together in frequency-order.  Was
there a reason behind this choice?

I am also a little confused by the reporting of patterns (similar to
Vipul Pandey in MAHOUT-617).  To again use an example from the tests:

[(Z,([X, Y, Z],11)), (Y,([Y],25), ([X, Y],21), ([X, Y, Z],11)),
(X,([X],33), ([X, Y],21), ([X, Y, Z],11))]

What is the rationale for not reporting [X, Z], for example?  I
understand how you could generate these later, but why would you have
to?

I have also found two bugs; one is that the default number of groups
is mis-advertised.  The other is that sometimes children are
multiply-added to nodes in conditional FP trees (so a node is its own
sibling).  These are in Jira with patches as MAHOUT-885 and
MAHOUT-886.

In general, this code seems like it's more complicated (and difficult
to follow) than it could be.  Part of that is simply due to the
object-avoidance, but I suspect that a lot is related to planned
improvements (for example I don't see where FP-Bonsai pruning is
implemented, but comments indicate someone is/was working on it).
Would folks be open to a major refactoring in here, or are people
actively working on fleshing bits of this out with new functionality?

-tom

Re: FP Growth oddities

Posted by Robin Anil <ro...@gmail.com>.
On Mon, Nov 14, 2011 at 1:58 PM, Tom Pierce <tc...@cloudera.com> wrote:

> Hi,
>
> I've been playing around a bit with the FPGrowth implementation, and I
> have some questions.
>
> Is it intentional that all frequent itemsets including a given item
> are mined for each (frequent) item?  For example, consider a simple
> example from FPGrowthTest.java:
>
> X      (occurs 12 times)
> Y      (occurs 4 times)
> X Y   (occurs 10 times)
>
> The expected result is:
>
> [(Y,([Y],14), ([X, Y],10)), (X,([X],22), ([X, Y],10))]
>
> (This can be read as "When considering patterns with Y, [Y] and [X. Y]
> were found to occur 14 and 10 times, respectively..")
>
> Notice [X, Y] is found twice; a key efficiency improvement of FPGrowth
> is that you can avoid this by only mining for frequent sets that
> include the current base item and more-frequent items.  Seems like
> that is not taken advantage of here...  Is there a reason for this, or
> is this a bug?
>
No, the mining here is to lookup top N patterns for each attribute

>
> Related to this is the grouping strategy - currently, when creating
> batches of items to assign to reducers, the N most-frequent items form
> a group, then the next N form a group, and so on.  Intuitively, it
> makes more sense to round-robin the examples as I would expect the
> "hardest" items to (roughly) band together in frequency-order.  Was
> there a reason behind this choice?
>
I cant recall the rationale, but the basic idea was this type of
partitioning allowed all patterns to be generated. Round robin will miss
some
See the paper http://infolab.stanford.edu/~echang/recsys08-69.pdf



>
> I am also a little confused by the reporting of patterns (similar to
> Vipul Pandey in MAHOUT-617).  To again use an example from the tests:
>
> [(Z,([X, Y, Z],11)), (Y,([Y],25), ([X, Y],21), ([X, Y, Z],11)),
> (X,([X],33), ([X, Y],21), ([X, Y, Z],11))]
>
> What is the rationale for not reporting [X, Z], for example?  I
> understand how you could generate these later, but why would you have
> to?
>
Ah. XZ - 11 is a redundant pattern. We try to mine only closed subsets. so
XYZ-11 already has that information.

>
> I have also found two bugs; one is that the default number of groups
> is mis-advertised.  The other is that sometimes children are
> multiply-added to nodes in conditional FP trees (so a node is its own
> sibling).  These are in Jira with patches as MAHOUT-885 and
> MAHOUT-886.
>
I will check this out. Seems like a bug.

>
> In general, this code seems like it's more complicated (and difficult
> to follow) than it could be.  Part of that is simply due to the
> object-avoidance, but I suspect that a lot is related to planned
> improvements (for example I don't see where FP-Bonsai pruning is
> implemented, but comments indicate someone is/was working on it).
> Would folks be open to a major refactoring in here, or are people
> actively working on fleshing bits of this out with new functionality?
>
The current implementation tries to minimize memory usage by a lot. More
objects you create more performance hit it takes. But I am open to
refactoring the code to make it more readable.

FP Bonsai is implemented as a single function that prunes the nodes before
mining. You will find it as a small function there. Its not a significant
piece of code. Its already checked in I believe

>
> -tom
>