You are viewing a plain text version of this content. The canonical link for it is here.

Posted to common-user@hadoop.apache.org by Raymond Jennings III <ra...@yahoo.com> on 2010/01/13 22:30:58 UTC

Apriori and association rules with Hadoop

Is it doable to create an Apriori app using map-reduce?  I am starting out but it's not clear how to create the next Candidate sets based on a previous run.  Does anyone have any experience with this?

Re: Apriori and association rules with Hadoop

Posted by Enis Soztutar <en...@gmail.com>.

Hi,

Unfortunately, the code base is not owned by me, I need my previous
employer's permission for releasing it to open-source.
On the other hand, with a decent knowledge of Hadoop, one can implement the
algorithm in a relatively short amount of time. Also, not sure about hive,
but I think you can express the algorithm in pig.

Anyway, I would love to see your dissertation when it is done, could you
please send it to the mailing list.


On Tue, Jan 19, 2010 at 2:40 AM, Rob Stewart <ro...@googlemail.com>wrote:

> Hi Enis,
>
> That's both a coincidence and excellent that you have achieved to implement
> the Apriori algorithm within Java MapReduce. As it happens, I am writing a
> dissertation on Hadoop (in particular, the binding query languages (JAQL,
> Hive and Pig). One part is to compare the scope of each language, and the
> ease of programming. I know it is not possible to achieve the Apriori
> algorithm in primitive Hive or Pig, though may be achieved by wrapping
> PigServer within some Java code.
>
> I'm unsure on the sensitivity of your code. Are you in a situation to
> release your source code to the world for your Apriori implementation? It
> would be of huge help to me, but most probably others too.
>
> Many regards,
>
>
> Rob Stewart
>
>
> 2010/1/19 Enis Soztutar <en...@gmail.com>
>
> > We have succesfully implemented  sequential apriori ( a variant of GSP )
> on
> > MapReduce. Candidate generation, counting and candidate selection steps
> are
> > all parallel. Each iteration runs 2 chained jobs.
> >
> > A rough algorithm for this is like :
> >  L1 = generateL1() // which is a word-count like program
> >
> >  repeat {
> >     countCandidates()
> >     generateCandidates()
> >  }
> >
> > countCandidates() {
> >    candidate set is already in HDFS
> >    every task JVM reads the candidate set into memory ( hash-tree can be
> > used here)
> >    input : data set which is stored in sequence files
> >    in mapper, for each candidate that is contained in the input, emit <C,
> > 1>
> >    in reducer sum up the values for Candidate C
> >    emit candidates with support greater than threshold
> > }
> >
> > generateCandidates() {
> >   mapper:
> >      for pattern P = [p1,p2,...pn]
> >         emit < p1, [p2,...pn] >  //with a special flag for first
> >         emit < [p1, p2,...pn-1], pn > //with a special flag for last
> >
> >    reducer :  //keys are sub-patterns [p1,....pn-1] of length n-1, values
> > are individual items
> >         for each first value v_f
> >            for each last value v_l
> >               emit  [ v_f, p1, .. pn-1, v_l ]
> > }
> >
> > Hope this helps,
> >
> > Enis
> >
> > On Wed, Jan 13, 2010 at 11:30 PM, Raymond Jennings III <
> > raymondjiii@yahoo.com> wrote:
> >
> > > Is it doable to create an Apriori app using map-reduce?  I am starting
> > out
> > > but it's not clear how to create the next Candidate sets based on a
> > previous
> > > run.  Does anyone have any experience with this?
> > >
> > >
> > >
> > >
> >
>

Re: Apriori and association rules with Hadoop

Posted by Rob Stewart <ro...@googlemail.com>.

Hi Enis,

That's both a coincidence and excellent that you have achieved to implement
the Apriori algorithm within Java MapReduce. As it happens, I am writing a
dissertation on Hadoop (in particular, the binding query languages (JAQL,
Hive and Pig). One part is to compare the scope of each language, and the
ease of programming. I know it is not possible to achieve the Apriori
algorithm in primitive Hive or Pig, though may be achieved by wrapping
PigServer within some Java code.

I'm unsure on the sensitivity of your code. Are you in a situation to
release your source code to the world for your Apriori implementation? It
would be of huge help to me, but most probably others too.

Many regards,


Rob Stewart


2010/1/19 Enis Soztutar <en...@gmail.com>

> We have succesfully implemented  sequential apriori ( a variant of GSP ) on
> MapReduce. Candidate generation, counting and candidate selection steps are
> all parallel. Each iteration runs 2 chained jobs.
>
> A rough algorithm for this is like :
>  L1 = generateL1() // which is a word-count like program
>
>  repeat {
>     countCandidates()
>     generateCandidates()
>  }
>
> countCandidates() {
>    candidate set is already in HDFS
>    every task JVM reads the candidate set into memory ( hash-tree can be
> used here)
>    input : data set which is stored in sequence files
>    in mapper, for each candidate that is contained in the input, emit <C,
> 1>
>    in reducer sum up the values for Candidate C
>    emit candidates with support greater than threshold
> }
>
> generateCandidates() {
>   mapper:
>      for pattern P = [p1,p2,...pn]
>         emit < p1, [p2,...pn] >  //with a special flag for first
>         emit < [p1, p2,...pn-1], pn > //with a special flag for last
>
>    reducer :  //keys are sub-patterns [p1,....pn-1] of length n-1, values
> are individual items
>         for each first value v_f
>            for each last value v_l
>               emit  [ v_f, p1, .. pn-1, v_l ]
> }
>
> Hope this helps,
>
> Enis
>
> On Wed, Jan 13, 2010 at 11:30 PM, Raymond Jennings III <
> raymondjiii@yahoo.com> wrote:
>
> > Is it doable to create an Apriori app using map-reduce?  I am starting
> out
> > but it's not clear how to create the next Candidate sets based on a
> previous
> > run.  Does anyone have any experience with this?
> >
> >
> >
> >
>

Re: Apriori and association rules with Hadoop

Posted by Enis Soztutar <en...@gmail.com>.

We have succesfully implemented  sequential apriori ( a variant of GSP ) on
MapReduce. Candidate generation, counting and candidate selection steps are
all parallel. Each iteration runs 2 chained jobs.

A rough algorithm for this is like :
  L1 = generateL1() // which is a word-count like program

  repeat {
     countCandidates()
     generateCandidates()
  }

countCandidates() {
    candidate set is already in HDFS
    every task JVM reads the candidate set into memory ( hash-tree can be
used here)
    input : data set which is stored in sequence files
    in mapper, for each candidate that is contained in the input, emit <C,
1>
    in reducer sum up the values for Candidate C
    emit candidates with support greater than threshold
}

generateCandidates() {
   mapper:
      for pattern P = [p1,p2,...pn]
         emit < p1, [p2,...pn] >  //with a special flag for first
         emit < [p1, p2,...pn-1], pn > //with a special flag for last

    reducer :  //keys are sub-patterns [p1,....pn-1] of length n-1, values
are individual items
         for each first value v_f
            for each last value v_l
               emit  [ v_f, p1, .. pn-1, v_l ]
}

Hope this helps,

Enis

On Wed, Jan 13, 2010 at 11:30 PM, Raymond Jennings III <
raymondjiii@yahoo.com> wrote:

> Is it doable to create an Apriori app using map-reduce?  I am starting out
> but it's not clear how to create the next Candidate sets based on a previous
> run.  Does anyone have any experience with this?
>
>
>
>

Re: Apriori and association rules with Hadoop

Posted by Alex Kozlov <al...@cloudera.com>.

Hi Raymond, there are a few projects to implement Apriori in Hadoop (see
http://issues.apache.org/jira/browse/MAHOUT-108).  And while the itemset
counting is trivial in Map/Reduce, the real smarts of the algorithm are in
the candidate selection implementation.  There are ways to do the selection
using Map/Reduce, but it depends on the actual algorithm, which may need
adjustments for Map/Reduce framework as well.

Can you provide more details?


From: Raymond Jennings III <ra...@yahoo.com>
> Date: Wed, Jan 13, 2010 at 1:30 PM
> Subject: Apriori and association rules with Hadoop
> To: common-user@hadoop.apache.org
>
>
> Is it doable to create an Apriori app using map-reduce?  I am starting out
> but it's not clear how to create the next Candidate sets based on a previous
> run.  Does anyone have any experience with this?
>
>