You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@mahout.apache.org by Pat Ferrel <pa...@occamsmachete.com> on 2014/07/06 00:19:02 UTC

Mahout V2

I compared  spark-itemsimilatity to the Hadoop version on sample data that is 8.7 M, 49290 x 139738 using my little 2 machine cluster and got the following speedup. 

Platform			Elapsed Time
Mahout Hadoop	0:20:37
Mahout Spark		0:02:19

This isn’t quite apples to apples because the Spark version does all the dictionary management, which is usually two extra jobs tacked on before and after the Hadoop job. I’ve done the complete pipeline using Hadoop and Spark now and can say that not only is it faster now but the old Hadoop way required keeping track of 10x more intermediate data and connecting up many more jobs to get the pipeline working. Now it’s just one job. You don’t need to worry about ID translation anymore and you get over 10x faster completion — this is one of those times when speed meets ease-of-use. 

Re: Mahout V2

Posted by Pat Ferrel <pa...@gmail.com>.
And in the dictionary management, which will affect memory usage and maybe a little bit for speed.

It was pretty cool to use this all in the mahout-shell. I built the equivalent of the hadoop recommender in about 10 lines of Scala in the shell. Output had application specific IDs too. Users are going to love this.

On Jul 5, 2014, at 5:26 PM, Sebastian Schelter <ss...@googlemail.com> wrote:

Nice. There is even still a huge potential for optimization in the spark
bindings.

-s
Am 05.07.2014 15:21 schrieb "Andrew Musselman" <an...@gmail.com>:

> Crazy awesome.
> 
>> On Jul 5, 2014, at 4:19 PM, Pat Ferrel <pa...@occamsmachete.com> wrote:
>> 
>> I compared  spark-itemsimilatity to the Hadoop version on sample data
> that is 8.7 M, 49290 x 139738 using my little 2 machine cluster and got the
> following speedup.
>> 
>> Platform            Elapsed Time
>> Mahout Hadoop    0:20:37
>> Mahout Spark        0:02:19
>> 
>> This isn’t quite apples to apples because the Spark version does all the
> dictionary management, which is usually two extra jobs tacked on before and
> after the Hadoop job. I’ve done the complete pipeline using Hadoop and
> Spark now and can say that not only is it faster now but the old Hadoop way
> required keeping track of 10x more intermediate data and connecting up many
> more jobs to get the pipeline working. Now it’s just one job. You don’t
> need to worry about ID translation anymore and you get over 10x faster
> completion — this is one of those times when speed meets ease-of-use.
> 


Re: Mahout V2

Posted by Sebastian Schelter <ss...@googlemail.com>.
Nice. There is even still a huge potential for optimization in the spark
bindings.

-s
Am 05.07.2014 15:21 schrieb "Andrew Musselman" <an...@gmail.com>:

> Crazy awesome.
>
> > On Jul 5, 2014, at 4:19 PM, Pat Ferrel <pa...@occamsmachete.com> wrote:
> >
> > I compared  spark-itemsimilatity to the Hadoop version on sample data
> that is 8.7 M, 49290 x 139738 using my little 2 machine cluster and got the
> following speedup.
> >
> > Platform            Elapsed Time
> > Mahout Hadoop    0:20:37
> > Mahout Spark        0:02:19
> >
> > This isn’t quite apples to apples because the Spark version does all the
> dictionary management, which is usually two extra jobs tacked on before and
> after the Hadoop job. I’ve done the complete pipeline using Hadoop and
> Spark now and can say that not only is it faster now but the old Hadoop way
> required keeping track of 10x more intermediate data and connecting up many
> more jobs to get the pipeline working. Now it’s just one job. You don’t
> need to worry about ID translation anymore and you get over 10x faster
> completion — this is one of those times when speed meets ease-of-use.
>

Re: Mahout V2

Posted by Andrew Musselman <an...@gmail.com>.
Crazy awesome.

> On Jul 5, 2014, at 4:19 PM, Pat Ferrel <pa...@occamsmachete.com> wrote:
> 
> I compared  spark-itemsimilatity to the Hadoop version on sample data that is 8.7 M, 49290 x 139738 using my little 2 machine cluster and got the following speedup. 
> 
> Platform            Elapsed Time
> Mahout Hadoop    0:20:37
> Mahout Spark        0:02:19
> 
> This isn’t quite apples to apples because the Spark version does all the dictionary management, which is usually two extra jobs tacked on before and after the Hadoop job. I’ve done the complete pipeline using Hadoop and Spark now and can say that not only is it faster now but the old Hadoop way required keeping track of 10x more intermediate data and connecting up many more jobs to get the pipeline working. Now it’s just one job. You don’t need to worry about ID translation anymore and you get over 10x faster completion — this is one of those times when speed meets ease-of-use.