You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@mahout.apache.org by Dave Fry <df...@upstreamsoftware.com> on 2011/12/02 03:51:09 UTC

Frequent itemset mining

Hi!  I apologize for the newbie question, I'm just getting started with
Mahout.

On the "Overview" page on Mahout's website:
https://cwiki.apache.org/confluence/display/MAHOUT/Overview

It mentions this as the four primary targeted use cases for Mahout:
1) Recommendation mining takes users' behavior and from that tries to find
items users might like.
2) Clustering takes e.g. text documents and groups them into groups of
topically related documents.
3) Classification learns from exisiting categorized documents what
documents of a specific category look like and is able to assign unlabelled
documents to the (hopefully) correct category.
4) Frequent itemset mining takes a set of item groups (terms in a query
session, shopping cart content) and identifies, which individual items
usually appear together.

But, based on the Mahout documentation that I've read through, I can't seem
to find a clear mapping from that use case description to where in the
Mahout distribution I should be looking.  I've found several leads for use
case #1, but #4 seems to be a bit of a mystery (and searches for "frequent
itemset mining" don't seem to lead me to where I need to go.)

Basically, I'm looking to the answer to the question "Which items appear
most often with item X in browse histories and shopping carts?".  (As
opposed to "Based on what I know about your preferences, here are the items
that I predict you would be most likely to browse/add to your cart".)

Any help is appreciated!
Thanks,
Dave

Re: Frequent itemset mining

Posted by Ted Dunning <te...@gmail.com>.

How is this different from the fairly obvious PIG query to count
cooccurring items?

On Thu, Dec 1, 2011 at 6:51 PM, Dave Fry <df...@upstreamsoftware.com> wrote:

> Basically, I'm looking to the answer to the question "Which items appear
> most often with item X in browse histories and shopping carts?".
>

Re: Frequent itemset mining

Posted by tom pierce <tc...@apache.org>.

I found some very non-intuitive performance behavior in FPG when I was
trying it out, though I never quite tracked down why it was happening. 

I actually wound up contributing an alternate implementation; would you
re-try your example and add the "-2" flag, which selects the other
implementation?  I'd be curious to hear if that resolves your issue.

Thanks,
-tom

On 06/06/2012 02:00 AM, Sean Owen wrote:
> It wouldn't surprise me, though I don't know this implementation or
> your setup. Locally, you're not really running Hadoop -- it's all
> local, and there is no HDFS to replicate and such. You are saving the
> big overhead of shuffling data across machines, and the overhead of
> starting new workers. For small input, the overhead can indeed be most
> of the run time.
>
> On Wed, Jun 6, 2012 at 3:19 AM, Alex Kozlov <al...@cloudera.com> wrote:
>> The documentation says:
>>
>> Running parallel FPGrowth is as easy as adding changing the flag -method
>> mapreduce and adding the number of groups parameter e.g. -g 20 for 20
>> groups. First, let's run the above sample test in map-reduce mode:
>>
>> bin/mahout fpg \
>>     -i core/src/test/resources/retail.dat \
>>     -o patterns \
>>     -k 50 \
>>     -method mapreduce \
>>     -regex '[\ ]' \
>>     -s 2
>>
>>  The above test took 102 seconds on dual-core laptop, v.s. 609 seconds in
>> the sequential mode, (with 5 gigs of ram allocated). In a separate test,
>> the first 1000 lines of retail.dat took 20 seconds in map/reduce v.s. 30
>> seconds in sequential mode.
>>
>> Running the example above I get times more like hours (both sequential and
>> mapreduce methods) on a 48GB boxes.  Am I doing something wrong?  Should it
>> be minutes instead of seconds?
>> --
>> Alex K
>>
>> On Mon, Dec 5, 2011 at 12:50 PM, Isabel Drost <is...@apache.org> wrote:
>>
>>> On 02.12.2011 Tom Pierce wrote:
>>>> These programs are actually exposed though the main mahout program; if
>>> you
>>>> run:
>>>>
>>>> $MAHOUT_HOME/bin/mahout fpg
>>>>
>>>> it will run the Frequent Pattern Growth algorithm (aka frequent itemset
>>>> mining).
>>> Also there is quite some documentation on the wiki:
>>>
>>> https://cwiki.apache.org/MAHOUT/parallel-frequent-pattern-mining.html(also
>>> includes a link to the original research publication).
>>>
>>> Isabel
>>>
>>>

Re: Frequent itemset mining

Posted by Sean Owen <sr...@gmail.com>.

It wouldn't surprise me, though I don't know this implementation or
your setup. Locally, you're not really running Hadoop -- it's all
local, and there is no HDFS to replicate and such. You are saving the
big overhead of shuffling data across machines, and the overhead of
starting new workers. For small input, the overhead can indeed be most
of the run time.

On Wed, Jun 6, 2012 at 3:19 AM, Alex Kozlov <al...@cloudera.com> wrote:
> The documentation says:
>
> Running parallel FPGrowth is as easy as adding changing the flag -method
> mapreduce and adding the number of groups parameter e.g. -g 20 for 20
> groups. First, let's run the above sample test in map-reduce mode:
>
> bin/mahout fpg \
>     -i core/src/test/resources/retail.dat \
>     -o patterns \
>     -k 50 \
>     -method mapreduce \
>     -regex '[\ ]' \
>     -s 2
>
>  The above test took 102 seconds on dual-core laptop, v.s. 609 seconds in
> the sequential mode, (with 5 gigs of ram allocated). In a separate test,
> the first 1000 lines of retail.dat took 20 seconds in map/reduce v.s. 30
> seconds in sequential mode.
>
> Running the example above I get times more like hours (both sequential and
> mapreduce methods) on a 48GB boxes.  Am I doing something wrong?  Should it
> be minutes instead of seconds?
> --
> Alex K
>
> On Mon, Dec 5, 2011 at 12:50 PM, Isabel Drost <is...@apache.org> wrote:
>
>> On 02.12.2011 Tom Pierce wrote:
>> > These programs are actually exposed though the main mahout program; if
>> you
>> > run:
>> >
>> > $MAHOUT_HOME/bin/mahout fpg
>> >
>> > it will run the Frequent Pattern Growth algorithm (aka frequent itemset
>> > mining).
>>
>> Also there is quite some documentation on the wiki:
>>
>> https://cwiki.apache.org/MAHOUT/parallel-frequent-pattern-mining.html(also
>> includes a link to the original research publication).
>>
>> Isabel
>>
>>

Re: Frequent itemset mining

Posted by Alex Kozlov <al...@cloudera.com>.

The documentation says:

Running parallel FPGrowth is as easy as adding changing the flag -method
mapreduce and adding the number of groups parameter e.g. -g 20 for 20
groups. First, let's run the above sample test in map-reduce mode:

bin/mahout fpg \
     -i core/src/test/resources/retail.dat \
     -o patterns \
     -k 50 \
     -method mapreduce \
     -regex '[\ ]' \
     -s 2

 The above test took 102 seconds on dual-core laptop, v.s. 609 seconds in
the sequential mode, (with 5 gigs of ram allocated). In a separate test,
the first 1000 lines of retail.dat took 20 seconds in map/reduce v.s. 30
seconds in sequential mode.

Running the example above I get times more like hours (both sequential and
mapreduce methods) on a 48GB boxes.  Am I doing something wrong?  Should it
be minutes instead of seconds?
--
Alex K

On Mon, Dec 5, 2011 at 12:50 PM, Isabel Drost <is...@apache.org> wrote:

> On 02.12.2011 Tom Pierce wrote:
> > These programs are actually exposed though the main mahout program; if
> you
> > run:
> >
> > $MAHOUT_HOME/bin/mahout fpg
> >
> > it will run the Frequent Pattern Growth algorithm (aka frequent itemset
> > mining).
>
> Also there is quite some documentation on the wiki:
>
> https://cwiki.apache.org/MAHOUT/parallel-frequent-pattern-mining.html(also
> includes a link to the original research publication).
>
> Isabel
>
>

Re: Frequent itemset mining

Posted by Isabel Drost <is...@apache.org>.

On 02.12.2011 Tom Pierce wrote:
> These programs are actually exposed though the main mahout program; if you
> run:
> 
> $MAHOUT_HOME/bin/mahout fpg
> 
> it will run the Frequent Pattern Growth algorithm (aka frequent itemset
> mining).

Also there is quite some documentation on the wiki:

https://cwiki.apache.org/MAHOUT/parallel-frequent-pattern-mining.html (also 
includes a link to the original research publication).

Isabel

Re: Frequent itemset mining

Posted by Dave Fry <df...@upstreamsoftware.com>.

Awesome, thank you!!

2011/12/2 Tom Pierce <tc...@cloudera.com>

> These programs are actually exposed though the main mahout program; if you
> run:
>
> $MAHOUT_HOME/bin/mahout fpg
>
> it will run the Frequent Pattern Growth algorithm (aka frequent itemset
> mining).
>
> Running the command above will show you what parameters are
> required/available, including a switch to run in mapreduce or
> sequential (i.e. single machine) mode.  Most params should be
> straightforward, but this info may be helpful:
>
> The input is expected to be plain text with one itemset per line.
>
> The splitterPatter/regex will be used to split a line into itemsets;
> it defaults to a 'comma with optional whitespace' pattern.
>
> If you run in mapreduce mode, the "output" directory will have several
> subdirs, you'll want to look in the frequentpatterns subdir and run:
>
> $MAHOUT_HOME/bin/mahout seqdumper -s ${OUT}/frequentpatterns/part-r-00000
>
> On each of the "part*" files in that directory to see the frequent
> patterns.
>
> -tom
>
> 2011/12/2 戴清灏 <ro...@gmail.com>:
> > For a sequential implementation, fpgrowth.java might be the first.
> > For a parallel implementation, pfpgrowth.java might be.
> > there are 5 steps at total and 4 out of them are mapreduce.
> >
> > Sent from my mobile phone
> > 在 2011-12-2 下午12:48，"Dave Fry" <df...@upstreamsoftware.com>写道：
> >
> >> That would be fantastic, thank you!
> >>
> >> In the meantime, can you direct me to where in the source I should start
> >> looking?  (ie, which class would be the entry point I'm looking for?)
> >>
> >> 2011/12/1 戴清灏 <ro...@gmail.com>
> >>
> >> > There is actually a lack of the doc for the frequent pattern mining
> >> usage.
> >> > Actually, you are not the first one who claims the need of it.
> >> > I will be pleased to write one for that usage since I've read almost
> the
> >> > source code of it.
> >> >
> >> > 在 2011年12月2日星期五，Dave Fry 写道：
> >> >
> >> > > Hi!  I apologize for the newbie question, I'm just getting started
> with
> >> > > Mahout.
> >> > >
> >> > > On the "Overview" page on Mahout's website:
> >> > > https://cwiki.apache.org/confluence/display/MAHOUT/Overview
> >> > >
> >> > > It mentions this as the four primary targeted use cases for Mahout:
> >> > > 1) Recommendation mining takes users' behavior and from that tries
> to
> >> > find
> >> > > items users might like.
> >> > > 2) Clustering takes e.g. text documents and groups them into groups
> of
> >> > > topically related documents.
> >> > > 3) Classification learns from exisiting categorized documents what
> >> > > documents of a specific category look like and is able to assign
> >> > unlabelled
> >> > > documents to the (hopefully) correct category.
> >> > > 4) Frequent itemset mining takes a set of item groups (terms in a
> query
> >> > > session, shopping cart content) and identifies, which individual
> items
> >> > > usually appear together.
> >> > >
> >> > > But, based on the Mahout documentation that I've read through, I
> can't
> >> > seem
> >> > > to find a clear mapping from that use case description to where in
> the
> >> > > Mahout distribution I should be looking.  I've found several leads
> for
> >> > use
> >> > > case #1, but #4 seems to be a bit of a mystery (and searches for
> >> > "frequent
> >> > > itemset mining" don't seem to lead me to where I need to go.)
> >> > >
> >> > > Basically, I'm looking to the answer to the question "Which items
> >> appear
> >> > > most often with item X in browse histories and shopping carts?".
>  (As
> >> > > opposed to "Based on what I know about your preferences, here are
> the
> >> > items
> >> > > that I predict you would be most likely to browse/add to your
> cart".)
> >> > >
> >> > > Any help is appreciated!
> >> > > Thanks,
> >> > > Dave
> >> > >
> >> >
> >> >
> >> > --
> >> > Regards,
> >> > Q
> >> >
> >>
>

Re: Frequent itemset mining

Posted by Tom Pierce <tc...@cloudera.com>.

These programs are actually exposed though the main mahout program; if you run:

$MAHOUT_HOME/bin/mahout fpg

it will run the Frequent Pattern Growth algorithm (aka frequent itemset mining).

Running the command above will show you what parameters are
required/available, including a switch to run in mapreduce or
sequential (i.e. single machine) mode.  Most params should be
straightforward, but this info may be helpful:

The input is expected to be plain text with one itemset per line.

The splitterPatter/regex will be used to split a line into itemsets;
it defaults to a 'comma with optional whitespace' pattern.

If you run in mapreduce mode, the "output" directory will have several
subdirs, you'll want to look in the frequentpatterns subdir and run:

$MAHOUT_HOME/bin/mahout seqdumper -s ${OUT}/frequentpatterns/part-r-00000

On each of the "part*" files in that directory to see the frequent patterns.

-tom

2011/12/2 戴清灏 <ro...@gmail.com>:
> For a sequential implementation, fpgrowth.java might be the first.
> For a parallel implementation, pfpgrowth.java might be.
> there are 5 steps at total and 4 out of them are mapreduce.
>
> Sent from my mobile phone
> 在 2011-12-2 下午12:48，"Dave Fry" <df...@upstreamsoftware.com>写道：
>
>> That would be fantastic, thank you!
>>
>> In the meantime, can you direct me to where in the source I should start
>> looking?  (ie, which class would be the entry point I'm looking for?)
>>
>> 2011/12/1 戴清灏 <ro...@gmail.com>
>>
>> > There is actually a lack of the doc for the frequent pattern mining
>> usage.
>> > Actually, you are not the first one who claims the need of it.
>> > I will be pleased to write one for that usage since I've read almost the
>> > source code of it.
>> >
>> > 在 2011年12月2日星期五，Dave Fry 写道：
>> >
>> > > Hi!  I apologize for the newbie question, I'm just getting started with
>> > > Mahout.
>> > >
>> > > On the "Overview" page on Mahout's website:
>> > > https://cwiki.apache.org/confluence/display/MAHOUT/Overview
>> > >
>> > > It mentions this as the four primary targeted use cases for Mahout:
>> > > 1) Recommendation mining takes users' behavior and from that tries to
>> > find
>> > > items users might like.
>> > > 2) Clustering takes e.g. text documents and groups them into groups of
>> > > topically related documents.
>> > > 3) Classification learns from exisiting categorized documents what
>> > > documents of a specific category look like and is able to assign
>> > unlabelled
>> > > documents to the (hopefully) correct category.
>> > > 4) Frequent itemset mining takes a set of item groups (terms in a query
>> > > session, shopping cart content) and identifies, which individual items
>> > > usually appear together.
>> > >
>> > > But, based on the Mahout documentation that I've read through, I can't
>> > seem
>> > > to find a clear mapping from that use case description to where in the
>> > > Mahout distribution I should be looking.  I've found several leads for
>> > use
>> > > case #1, but #4 seems to be a bit of a mystery (and searches for
>> > "frequent
>> > > itemset mining" don't seem to lead me to where I need to go.)
>> > >
>> > > Basically, I'm looking to the answer to the question "Which items
>> appear
>> > > most often with item X in browse histories and shopping carts?".  (As
>> > > opposed to "Based on what I know about your preferences, here are the
>> > items
>> > > that I predict you would be most likely to browse/add to your cart".)
>> > >
>> > > Any help is appreciated!
>> > > Thanks,
>> > > Dave
>> > >
>> >
>> >
>> > --
>> > Regards,
>> > Q
>> >
>>

Re: Frequent itemset mining

Posted by 戴清灏 <ro...@gmail.com>.

For a sequential implementation, fpgrowth.java might be the first.
For a parallel implementation, pfpgrowth.java might be.
there are 5 steps at total and 4 out of them are mapreduce.

Sent from my mobile phone
在 2011-12-2 下午12:48，"Dave Fry" <df...@upstreamsoftware.com>写道：

> That would be fantastic, thank you!
>
> In the meantime, can you direct me to where in the source I should start
> looking?  (ie, which class would be the entry point I'm looking for?)
>
> 2011/12/1 戴清灏 <ro...@gmail.com>
>
> > There is actually a lack of the doc for the frequent pattern mining
> usage.
> > Actually, you are not the first one who claims the need of it.
> > I will be pleased to write one for that usage since I've read almost the
> > source code of it.
> >
> > 在 2011年12月2日星期五，Dave Fry 写道：
> >
> > > Hi!  I apologize for the newbie question, I'm just getting started with
> > > Mahout.
> > >
> > > On the "Overview" page on Mahout's website:
> > > https://cwiki.apache.org/confluence/display/MAHOUT/Overview
> > >
> > > It mentions this as the four primary targeted use cases for Mahout:
> > > 1) Recommendation mining takes users' behavior and from that tries to
> > find
> > > items users might like.
> > > 2) Clustering takes e.g. text documents and groups them into groups of
> > > topically related documents.
> > > 3) Classification learns from exisiting categorized documents what
> > > documents of a specific category look like and is able to assign
> > unlabelled
> > > documents to the (hopefully) correct category.
> > > 4) Frequent itemset mining takes a set of item groups (terms in a query
> > > session, shopping cart content) and identifies, which individual items
> > > usually appear together.
> > >
> > > But, based on the Mahout documentation that I've read through, I can't
> > seem
> > > to find a clear mapping from that use case description to where in the
> > > Mahout distribution I should be looking.  I've found several leads for
> > use
> > > case #1, but #4 seems to be a bit of a mystery (and searches for
> > "frequent
> > > itemset mining" don't seem to lead me to where I need to go.)
> > >
> > > Basically, I'm looking to the answer to the question "Which items
> appear
> > > most often with item X in browse histories and shopping carts?".  (As
> > > opposed to "Based on what I know about your preferences, here are the
> > items
> > > that I predict you would be most likely to browse/add to your cart".)
> > >
> > > Any help is appreciated!
> > > Thanks,
> > > Dave
> > >
> >
> >
> > --
> > Regards,
> > Q
> >
>

Re: Frequent itemset mining

Posted by Dave Fry <df...@upstreamsoftware.com>.

That would be fantastic, thank you!

In the meantime, can you direct me to where in the source I should start
looking?  (ie, which class would be the entry point I'm looking for?)

2011/12/1 戴清灏 <ro...@gmail.com>

> There is actually a lack of the doc for the frequent pattern mining usage.
> Actually, you are not the first one who claims the need of it.
> I will be pleased to write one for that usage since I've read almost the
> source code of it.
>
> 在 2011年12月2日星期五，Dave Fry 写道：
>
> > Hi!  I apologize for the newbie question, I'm just getting started with
> > Mahout.
> >
> > On the "Overview" page on Mahout's website:
> > https://cwiki.apache.org/confluence/display/MAHOUT/Overview
> >
> > It mentions this as the four primary targeted use cases for Mahout:
> > 1) Recommendation mining takes users' behavior and from that tries to
> find
> > items users might like.
> > 2) Clustering takes e.g. text documents and groups them into groups of
> > topically related documents.
> > 3) Classification learns from exisiting categorized documents what
> > documents of a specific category look like and is able to assign
> unlabelled
> > documents to the (hopefully) correct category.
> > 4) Frequent itemset mining takes a set of item groups (terms in a query
> > session, shopping cart content) and identifies, which individual items
> > usually appear together.
> >
> > But, based on the Mahout documentation that I've read through, I can't
> seem
> > to find a clear mapping from that use case description to where in the
> > Mahout distribution I should be looking.  I've found several leads for
> use
> > case #1, but #4 seems to be a bit of a mystery (and searches for
> "frequent
> > itemset mining" don't seem to lead me to where I need to go.)
> >
> > Basically, I'm looking to the answer to the question "Which items appear
> > most often with item X in browse histories and shopping carts?".  (As
> > opposed to "Based on what I know about your preferences, here are the
> items
> > that I predict you would be most likely to browse/add to your cart".)
> >
> > Any help is appreciated!
> > Thanks,
> > Dave
> >
>
>
> --
> Regards,
> Q
>

Re: Frequent itemset mining

Posted by 戴清灏 <ro...@gmail.com>.

There is actually a lack of the doc for the frequent pattern mining usage.
Actually, you are not the first one who claims the need of it.
I will be pleased to write one for that usage since I've read almost the
source code of it.

在 2011年12月2日星期五，Dave Fry 写道：

> Hi!  I apologize for the newbie question, I'm just getting started with
> Mahout.
>
> On the "Overview" page on Mahout's website:
> https://cwiki.apache.org/confluence/display/MAHOUT/Overview
>
> It mentions this as the four primary targeted use cases for Mahout:
> 1) Recommendation mining takes users' behavior and from that tries to find
> items users might like.
> 2) Clustering takes e.g. text documents and groups them into groups of
> topically related documents.
> 3) Classification learns from exisiting categorized documents what
> documents of a specific category look like and is able to assign unlabelled
> documents to the (hopefully) correct category.
> 4) Frequent itemset mining takes a set of item groups (terms in a query
> session, shopping cart content) and identifies, which individual items
> usually appear together.
>
> But, based on the Mahout documentation that I've read through, I can't seem
> to find a clear mapping from that use case description to where in the
> Mahout distribution I should be looking.  I've found several leads for use
> case #1, but #4 seems to be a bit of a mystery (and searches for "frequent
> itemset mining" don't seem to lead me to where I need to go.)
>
> Basically, I'm looking to the answer to the question "Which items appear
> most often with item X in browse histories and shopping carts?".  (As
> opposed to "Based on what I know about your preferences, here are the items
> that I predict you would be most likely to browse/add to your cart".)
>
> Any help is appreciated!
> Thanks,
> Dave
>


-- 
Regards,
Q