You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@mahout.apache.org by Neal Clark <nc...@uvic.ca> on 2010/04/09 23:11:53 UTC

Mahout GSoC 2010: Association Mining

Hello,

I just wanted to introduce myself. I am a MSc. Computer Science
student at the University of Victoria. My research over the past year
has been focused on developing and implementing an Apriori based
frequent item-set mining algorithm for mining large data sets at low
support counts.

https://docs.google.com/Doc?docid=0ATkk_-6ZolXnZGZjeGYzNzNfOTBjcjJncGpkaA&hl=en

The main finding of the above report is that support levels as low as
0.001% on the webdocs (1.4GB) dataset can be efficiently calculated.
On a 100 core cluster all frequent k2 pairs can calculated in
approximately 6 minutes.

I currently have an optimized k2 Hadoop implementation and algorithm
for generating frequent pairs and I am currently extending my work to
items of any length. The analysis of the extended approach will be
complete within the next two weeks.

Would you be interesting in moving forward with such an implementation
 as a GSoC project? If so any comments/feedback would be very much
appreciated. If you are interested I can create a proposal and submit
it to your issue tracker when it comes back online.

Thanks,

Neal.

Re: Mahout GSoC 2010: Association Mining

Posted by Neal Clark <nc...@gmail.com>.
I am interested in contributing. The next two weeks will be a little
busy, as it is end of term, but I would be more than happy to work
over the summer on this project.

Perhaps you can also give me some advice on how to accomplish a few
tasks. Currently I am using NLineInputFormat to create/configure the
necessary map tasks.  Each map task reads the entire transaction file
from Hadoop's DistributedCache. Since each MapTask processing a
portion of the count table the task can emit the final pairs and
confidence values without a reduce phase. No extra intermediary data
is created.

If the number of nodes and/or input data were sufficiently large it
would be useful to be able to both duplicate and split the input data
across the cluster. Adding a reduce phase is quite costly as the
candidate items and not just the frequent items must be sent to the
reducer. My test have shown that there are at least an oder of
magnitude more candidates that frequent items. However it is trivial
to throw a reduce phase on the end of the job to aggregate the items
if necessary. This could be configured via a job parameter.

Do you have any ideas on how to best distribute the configuration and
input data? I have considered writing a custom input spit or input
format to split the input data however it is not clear how the
configuration data would be distributed. I suppose keeping the
NLineInputFormat and reading the transactions directly from HDFS is an
option but then we would loose the benefit of Hadoop's location
awareness.

Thanks,

Neal.

On Sat, Apr 10, 2010 at 1:28 AM, Robin Anil <ro...@gmail.com> wrote:
> Like Ted said, its a bit late for a GSOC proposal, but I am excited at the
> possibility of improving the frequent pattern mining package. Check out the
> current Parallel FPGrowth implementation in the code, you can find more
> explanation on usage the Mahout wiki. Apriori should be trivially
> parallelizable without the extra memory problem of PFPGrowth and should
> scale well for large datasets. You can contribute it separately from GSOC,
> and Apache community always welcomes such contributions. The wiki should
> help you get started on the Mahout development, with correct code style and
> practices.  Let me know If you have any doubts or thoughts.
>
> Robin
> On Sat, Apr 10, 2010 at 5:51 AM, Ted Dunning <te...@gmail.com> wrote:
>
>> Neal, I think that this might well be a useful contribution to Mahout, but,
>> if I am not mistaken, I think that the deadline for student proposals for
>> GSoC has just passed.
>>
>> That likely means that making this contribution an official GSoC project is
>> not possible.  I am sure that the Mahout community would welcome you as a
>> contributor even without official Google status.  If you would like to do
>> this, go ahead and propose what you want to do (when JIRA comes back or
>> just
>> by email discussion) and you can get started.
>>
>> On Fri, Apr 9, 2010 at 2:11 PM, Neal Clark <nc...@uvic.ca> wrote:
>>
>> > Hello,
>> >
>> > I just wanted to introduce myself. I am a MSc. Computer Science
>> > student at the University of Victoria. My research over the past year
>> > has been focused on developing and implementing an Apriori based
>> > frequent item-set mining algorithm for mining large data sets at low
>> > support counts.
>> >
>> >
>> >
>> https://docs.google.com/Doc?docid=0ATkk_-6ZolXnZGZjeGYzNzNfOTBjcjJncGpkaA&hl=en
>> >
>> > The main finding of the above report is that support levels as low as
>> > 0.001% on the webdocs (1.4GB) dataset can be efficiently calculated.
>> > On a 100 core cluster all frequent k2 pairs can calculated in
>> > approximately 6 minutes.
>> >
>> > I currently have an optimized k2 Hadoop implementation and algorithm
>> > for generating frequent pairs and I am currently extending my work to
>> > items of any length. The analysis of the extended approach will be
>> > complete within the next two weeks.
>> >
>> > Would you be interesting in moving forward with such an implementation
>> >  as a GSoC project? If so any comments/feedback would be very much
>> > appreciated. If you are interested I can create a proposal and submit
>> > it to your issue tracker when it comes back online.
>> >
>> > Thanks,
>> >
>> > Neal.
>> >
>>
>

Re: Mahout GSoC 2010: Association Mining

Posted by Robin Anil <ro...@gmail.com>.
Like Ted said, its a bit late for a GSOC proposal, but I am excited at the
possibility of improving the frequent pattern mining package. Check out the
current Parallel FPGrowth implementation in the code, you can find more
explanation on usage the Mahout wiki. Apriori should be trivially
parallelizable without the extra memory problem of PFPGrowth and should
scale well for large datasets. You can contribute it separately from GSOC,
and Apache community always welcomes such contributions. The wiki should
help you get started on the Mahout development, with correct code style and
practices.  Let me know If you have any doubts or thoughts.

Robin
On Sat, Apr 10, 2010 at 5:51 AM, Ted Dunning <te...@gmail.com> wrote:

> Neal, I think that this might well be a useful contribution to Mahout, but,
> if I am not mistaken, I think that the deadline for student proposals for
> GSoC has just passed.
>
> That likely means that making this contribution an official GSoC project is
> not possible.  I am sure that the Mahout community would welcome you as a
> contributor even without official Google status.  If you would like to do
> this, go ahead and propose what you want to do (when JIRA comes back or
> just
> by email discussion) and you can get started.
>
> On Fri, Apr 9, 2010 at 2:11 PM, Neal Clark <nc...@uvic.ca> wrote:
>
> > Hello,
> >
> > I just wanted to introduce myself. I am a MSc. Computer Science
> > student at the University of Victoria. My research over the past year
> > has been focused on developing and implementing an Apriori based
> > frequent item-set mining algorithm for mining large data sets at low
> > support counts.
> >
> >
> >
> https://docs.google.com/Doc?docid=0ATkk_-6ZolXnZGZjeGYzNzNfOTBjcjJncGpkaA&hl=en
> >
> > The main finding of the above report is that support levels as low as
> > 0.001% on the webdocs (1.4GB) dataset can be efficiently calculated.
> > On a 100 core cluster all frequent k2 pairs can calculated in
> > approximately 6 minutes.
> >
> > I currently have an optimized k2 Hadoop implementation and algorithm
> > for generating frequent pairs and I am currently extending my work to
> > items of any length. The analysis of the extended approach will be
> > complete within the next two weeks.
> >
> > Would you be interesting in moving forward with such an implementation
> >  as a GSoC project? If so any comments/feedback would be very much
> > appreciated. If you are interested I can create a proposal and submit
> > it to your issue tracker when it comes back online.
> >
> > Thanks,
> >
> > Neal.
> >
>

Re: Mahout GSoC 2010: Association Mining

Posted by Ted Dunning <te...@gmail.com>.
Neal, I think that this might well be a useful contribution to Mahout, but,
if I am not mistaken, I think that the deadline for student proposals for
GSoC has just passed.

That likely means that making this contribution an official GSoC project is
not possible.  I am sure that the Mahout community would welcome you as a
contributor even without official Google status.  If you would like to do
this, go ahead and propose what you want to do (when JIRA comes back or just
by email discussion) and you can get started.

On Fri, Apr 9, 2010 at 2:11 PM, Neal Clark <nc...@uvic.ca> wrote:

> Hello,
>
> I just wanted to introduce myself. I am a MSc. Computer Science
> student at the University of Victoria. My research over the past year
> has been focused on developing and implementing an Apriori based
> frequent item-set mining algorithm for mining large data sets at low
> support counts.
>
>
> https://docs.google.com/Doc?docid=0ATkk_-6ZolXnZGZjeGYzNzNfOTBjcjJncGpkaA&hl=en
>
> The main finding of the above report is that support levels as low as
> 0.001% on the webdocs (1.4GB) dataset can be efficiently calculated.
> On a 100 core cluster all frequent k2 pairs can calculated in
> approximately 6 minutes.
>
> I currently have an optimized k2 Hadoop implementation and algorithm
> for generating frequent pairs and I am currently extending my work to
> items of any length. The analysis of the extended approach will be
> complete within the next two weeks.
>
> Would you be interesting in moving forward with such an implementation
>  as a GSoC project? If so any comments/feedback would be very much
> appreciated. If you are interested I can create a proposal and submit
> it to your issue tracker when it comes back online.
>
> Thanks,
>
> Neal.
>