You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@mahout.apache.org by Arnau Sanchez <py...@gmail.com> on 2016/09/22 08:49:49 UTC

spark-itemsimilarity slower than itemsimilarity

I've been using the Mahout itemsimilarity job for a while, with good results. I read that the new spark-itemsimilarity job is typically faster, by a factor of 10, so I wanted to give it a try. I must be doing something wrong because, with the same EMR infrastructure, the spark job is slower than the old one (6 min vs 16 min) working on the same data. I took a small sample dataset (766k rating pairs) to compare numbers, this is the result:

Input ratings: http://download.zaudera.com/public/ratings

Infrastructure: emr-4.7.2 (spark 1.6.2, mahout 0.12.2)

Old itemsimilarity:

$ mahout itemsimilarity --input ratings --output itemsimilarity --booleanData TRUE --maxSimilaritiesPerItem 10 --similarityClassname SIMILARITY_COOCCURRENCE
[5m54s]

(logs: http://download.zaudera.com/public/itemsimilarity.out)

New spark-itemsimilarity:

$ mahout spark-itemsimilarity --input ratings --output spark-itemsimilarity --maxSimilaritiesPerItem 10 --master yarn-client
[15m51s]

(logs: http://download.zaudera.com/public/spark-itemsimilarity.out)

Any ideas? Thanks!

Re: spark-itemsimilarity slower than itemsimilarity

Posted by Pat Ferrel <pa...@occamsmachete.com>.
Except for reading the input it now takes ~5 minutes to train.


On Sep 30, 2016, at 5:12 PM, Pat Ferrel <pa...@occamsmachete.com> wrote:

Yeah, I bet Sebastian is right. I see no reason not to try running with --master local[4] or some number of cores on localhost. This will avoid all serialization. With times that low and small data there is no benefit to separate machines.

We are using this with ~1TB of data. Using Mahout as a lib we also set the partitioning to 4x the number of cores in the cluster with --conf spark.parallelism=384 on 3 very large machines it complete in 42 minutes. We also have 11 events so 11 different input matrices. 

I’m testing a way to train on only what is used in the model creation, which is a way to take pairRDDs as input and filter out every interaction that is not from a user in the primary matrix. With 1TB this reduces the data to a very manageable number and reduces the size of the BiMaps dramatically (they require memory). It may not have as pronounced an effect on different data but will make the computation more efficient. With the Universal Recommender in the PredictionIO template we still use all user input in the queries so you loose no precision and get real-time user interactions in your queries. If you want a more modern recommender than the old Mahout MapReduce I’d strongly suggest you consider it or at least use it as a model for building a recommender with Mahout. actionml.com/docs/ur


On Sep 30, 2016, at 10:39 AM, Sebastian <ss...@apache.org> wrote:

Hi Arnau,

I don't think that you can expect any speedups in your setup, your input data is way to small and I think you run only two concurrent tasks. Maybe you should try a larger sample of your data and more machines.

At the moment, it seems to me that the overheads of running in a distributed setting (task scheduling, serialization...) totally dominate the computation.

Best,
Sebastian

On 30.09.2016 11:11, Arnau Sanchez wrote:
> Hi!
> 
> Here you go: "ratings-clean" contains only pairs of (user, product) for those products with 4 or more user interactions (770k -> 465k):
> 
> https://www.dropbox.com/sh/ex0d74scgvw11oc/AACXPNl17iQnHZZOeMMogLbfa?dl=0
> 
> The results:
> 
> 1 part of 465k:   3m41.361s
> 5 parts of 100k:  4m20.785s
> 24 pars of 20k:  10m44.375s
> 47 parts of 10k: 17m39.385s
> 
> On Fri, 30 Sep 2016 00:09:13 +0200 Sebastian <ss...@apache.org> wrote:
> 
>> Hi Arnau,
>> 
>> I had a look at your ratings file and its kind of strange. Its pretty
>> tiny (770k ratings, 8MB), but it has more than 250k distinct items. Out
>> of these, only 50k have more than 3 interactions.
>> 
>> So I think the first thing that you should do is throw out all the items
>> with so few interactions. Item similarity computations are pretty
>> sensitive to the number of unique items, maybe thats why you don't see
>> much difference in the run times.
>> 
>> -s
>> 
>> 
>> On 29.09.2016 22:17, Arnau Sanchez wrote:
>>> --input ratings --output spark-itemsimilarity --maxSimilaritiesPerItem 10



Re: spark-itemsimilarity slower than itemsimilarity

Posted by Pat Ferrel <pa...@occamsmachete.com>.
Yeah, I bet Sebastian is right. I see no reason not to try running with --master local[4] or some number of cores on localhost. This will avoid all serialization. With times that low and small data there is no benefit to separate machines.

We are using this with ~1TB of data. Using Mahout as a lib we also set the partitioning to 4x the number of cores in the cluster with --conf spark.parallelism=384 on 3 very large machines it complete in 42 minutes. We also have 11 events so 11 different input matrices. 

I’m testing a way to train on only what is used in the model creation, which is a way to take pairRDDs as input and filter out every interaction that is not from a user in the primary matrix. With 1TB this reduces the data to a very manageable number and reduces the size of the BiMaps dramatically (they require memory). It may not have as pronounced an effect on different data but will make the computation more efficient. With the Universal Recommender in the PredictionIO template we still use all user input in the queries so you loose no precision and get real-time user interactions in your queries. If you want a more modern recommender than the old Mahout MapReduce I’d strongly suggest you consider it or at least use it as a model for building a recommender with Mahout. actionml.com/docs/ur


On Sep 30, 2016, at 10:39 AM, Sebastian <ss...@apache.org> wrote:

Hi Arnau,

I don't think that you can expect any speedups in your setup, your input data is way to small and I think you run only two concurrent tasks. Maybe you should try a larger sample of your data and more machines.

At the moment, it seems to me that the overheads of running in a distributed setting (task scheduling, serialization...) totally dominate the computation.

Best,
Sebastian

On 30.09.2016 11:11, Arnau Sanchez wrote:
> Hi!
> 
> Here you go: "ratings-clean" contains only pairs of (user, product) for those products with 4 or more user interactions (770k -> 465k):
> 
> https://www.dropbox.com/sh/ex0d74scgvw11oc/AACXPNl17iQnHZZOeMMogLbfa?dl=0
> 
> The results:
> 
> 1 part of 465k:   3m41.361s
> 5 parts of 100k:  4m20.785s
> 24 pars of 20k:  10m44.375s
> 47 parts of 10k: 17m39.385s
> 
> On Fri, 30 Sep 2016 00:09:13 +0200 Sebastian <ss...@apache.org> wrote:
> 
>> Hi Arnau,
>> 
>> I had a look at your ratings file and its kind of strange. Its pretty
>> tiny (770k ratings, 8MB), but it has more than 250k distinct items. Out
>> of these, only 50k have more than 3 interactions.
>> 
>> So I think the first thing that you should do is throw out all the items
>> with so few interactions. Item similarity computations are pretty
>> sensitive to the number of unique items, maybe thats why you don't see
>> much difference in the run times.
>> 
>> -s
>> 
>> 
>> On 29.09.2016 22:17, Arnau Sanchez wrote:
>>> --input ratings --output spark-itemsimilarity --maxSimilaritiesPerItem 10


Re: spark-itemsimilarity slower than itemsimilarity

Posted by Sebastian <ss...@apache.org>.
Hi Arnau,

I don't think that you can expect any speedups in your setup, your input 
data is way to small and I think you run only two concurrent tasks. 
Maybe you should try a larger sample of your data and more machines.

At the moment, it seems to me that the overheads of running in a 
distributed setting (task scheduling, serialization...) totally dominate 
the computation.

Best,
Sebastian

On 30.09.2016 11:11, Arnau Sanchez wrote:
> Hi!
>
> Here you go: "ratings-clean" contains only pairs of (user, product) for those products with 4 or more user interactions (770k -> 465k):
>
> https://www.dropbox.com/sh/ex0d74scgvw11oc/AACXPNl17iQnHZZOeMMogLbfa?dl=0
>
> The results:
>
> 1 part of 465k:   3m41.361s
> 5 parts of 100k:  4m20.785s
> 24 pars of 20k:  10m44.375s
> 47 parts of 10k: 17m39.385s
>
> On Fri, 30 Sep 2016 00:09:13 +0200 Sebastian <ss...@apache.org> wrote:
>
>> Hi Arnau,
>>
>> I had a look at your ratings file and its kind of strange. Its pretty
>> tiny (770k ratings, 8MB), but it has more than 250k distinct items. Out
>> of these, only 50k have more than 3 interactions.
>>
>> So I think the first thing that you should do is throw out all the items
>> with so few interactions. Item similarity computations are pretty
>> sensitive to the number of unique items, maybe thats why you don't see
>> much difference in the run times.
>>
>> -s
>>
>>
>> On 29.09.2016 22:17, Arnau Sanchez wrote:
>>>  --input ratings --output spark-itemsimilarity --maxSimilaritiesPerItem 10

Re: spark-itemsimilarity slower than itemsimilarity

Posted by Arnau Sanchez <py...@gmail.com>.
Hi!

Here you go: "ratings-clean" contains only pairs of (user, product) for those products with 4 or more user interactions (770k -> 465k):

https://www.dropbox.com/sh/ex0d74scgvw11oc/AACXPNl17iQnHZZOeMMogLbfa?dl=0

The results:

1 part of 465k:   3m41.361s
5 parts of 100k:  4m20.785s
24 pars of 20k:  10m44.375s
47 parts of 10k: 17m39.385s

On Fri, 30 Sep 2016 00:09:13 +0200 Sebastian <ss...@apache.org> wrote:

> Hi Arnau,
> 
> I had a look at your ratings file and its kind of strange. Its pretty 
> tiny (770k ratings, 8MB), but it has more than 250k distinct items. Out 
> of these, only 50k have more than 3 interactions.
> 
> So I think the first thing that you should do is throw out all the items 
> with so few interactions. Item similarity computations are pretty 
> sensitive to the number of unique items, maybe thats why you don't see 
> much difference in the run times.
> 
> -s
> 
> 
> On 29.09.2016 22:17, Arnau Sanchez wrote:
> >  --input ratings --output spark-itemsimilarity --maxSimilaritiesPerItem 10  

Re: spark-itemsimilarity slower than itemsimilarity

Posted by Sebastian <ss...@apache.org>.
Hi Arnau,

I had a look at your ratings file and its kind of strange. Its pretty 
tiny (770k ratings, 8MB), but it has more than 250k distinct items. Out 
of these, only 50k have more than 3 interactions.

So I think the first thing that you should do is throw out all the items 
with so few interactions. Item similarity computations are pretty 
sensitive to the number of unique items, maybe thats why you don't see 
much difference in the run times.

-s


On 29.09.2016 22:17, Arnau Sanchez wrote:
>  --input ratings --output spark-itemsimilarity --maxSimilaritiesPerItem 10

Re: spark-itemsimilarity slower than itemsimilarity

Posted by Arnau Sanchez <py...@gmail.com>.
A Dropbox link now:

https://www.dropbox.com/sh/ex0d74scgvw11oc/AACXPNl17iQnHZZOeMMogLbfa?dl=0

And here is the script I use to test different sizes/partitions (example: 10 parts of 10k):

#!/bin/sh
set -e -u

mkdir -p ratings-split
rm -rf ratings-split/part*
hdfs dfs -rm -r ratings-split spark-itemsimilarity temp
cat ~/input_ratings/part* 2>/dev/null | head -n100k | split -l10k -d - "ratings-split/part-" 
hdfs dfs -mkdir -p ratings-split
hdfs dfs -copyFromLocal ratings-split
time mahout spark-itemsimilarity --input ratings-split/ --output spark-itemsimilarity \
  --maxSimilaritiesPerItem 10 --master yarn-client |& tee spark-itemsimilarity.out

Thanks!

On Thu, 29 Sep 2016 19:46:03 +0200 Arnau Sanchez <py...@gmail.com> wrote:

> Hi Sebastian,
> 
> That's weird, it works here. Anyway, a Dropbox link:
> 
> https://www.dropbox.com/sh/ex0d74scgvw11oc/AACXPNl17iQnHZZOeMMogLbfa?dl=0
> 
> Thanks!
> 
> On Thu, 29 Sep 2016 18:50:23 +0200 Sebastian <ss...@apache.org> wrote:
> 
> > Hi Arnau,
> > 
> > The links to your logfiles don't work for me unfortunately. Are you sure 
> > you correctly setup Spark? That can be a bit tricky in YARN settings, 
> > sometimes one machine idles around...
> > 
> > Best,
> > Sebastian
> > 
> > On 25.09.2016 18:01, Pat Ferrel wrote:  
> > > AWS EMR is usually not very well suited for Spark. Spark get’s most of it’s speed from in-memory calculations. So to see speed gains you have to have enough memory. Also partitioning will help in many cases. If you read in data from a single file—that partitioning will usually follow the calculation throughout all intermediate steps. If the data is from a single file the partition may be 1 and therefor it will only use one machine. The most recent Mahout snapshot (therefore the next release) allows you to pass in the partitioning for each event pair (this is only in the library use, not CLI). To get this effect in the current release, try splitting the input into multiple files.
> > >
> > > I’m. probably the one that reported the 10x speed up and used input from Kafka DStreams, which causes very small default partition sizes. Also other comparisons for other calculations give a similar speedup result. There is little question about Spark being much faster—when used the way it is meant to be.
> > >
> > > I use Mahout as a library all the time in the Universal Recommender implemented in Apache PredictionIO. As a library we get greater control than the CLI. The CLI is really only a proof of concept, not really meant for production.
> > >
> > > BTW there is a significant algorithm benefit of the code behind spark-itemsimilarity that is probably more important than the speed increase and that is Correlated Cross-Occurrence, which allows the use of many indicators of user taste, not just the primary/conversion event, which is all any other CF-style recommender that I know of can use.
> > >
> > >
> > > On Sep 22, 2016, at 1:49 AM, Arnau Sanchez <py...@gmail.com> wrote:
> > >
> > > I've been using the Mahout itemsimilarity job for a while, with good results. I read that the new spark-itemsimilarity job is typically faster, by a factor of 10, so I wanted to give it a try. I must be doing something wrong because, with the same EMR infrastructure, the spark job is slower than the old one (6 min vs 16 min) working on the same data. I took a small sample dataset (766k rating pairs) to compare numbers, this is the result:
> > >
> > > Input ratings: http://download.zaudera.com/public/ratings
> > >
> > > Infrastructure: emr-4.7.2 (spark 1.6.2, mahout 0.12.2)
> > >
> > > Old itemsimilarity:
> > >
> > > $ mahout itemsimilarity --input ratings --output itemsimilarity --booleanData TRUE --maxSimilaritiesPerItem 10 --similarityClassname SIMILARITY_COOCCURRENCE
> > > [5m54s]
> > >
> > > (logs: http://download.zaudera.com/public/itemsimilarity.out)
> > >
> > > New spark-itemsimilarity:
> > >
> > > $ mahout spark-itemsimilarity --input ratings --output spark-itemsimilarity --maxSimilaritiesPerItem 10 --master yarn-client
> > > [15m51s]
> > >
> > > (logs: http://download.zaudera.com/public/spark-itemsimilarity.out)
> > >
> > > Any ideas? Thanks!
> > >    

Re: spark-itemsimilarity slower than itemsimilarity

Posted by Sebastian <ss...@apache.org>.
Hi Arnau,

The links to your logfiles don't work for me unfortunately. Are you sure 
you correctly setup Spark? That can be a bit tricky in YARN settings, 
sometimes one machine idles around...

Best,
Sebastian

On 25.09.2016 18:01, Pat Ferrel wrote:
> AWS EMR is usually not very well suited for Spark. Spark get\u2019s most of it\u2019s speed from in-memory calculations. So to see speed gains you have to have enough memory. Also partitioning will help in many cases. If you read in data from a single file\u2014that partitioning will usually follow the calculation throughout all intermediate steps. If the data is from a single file the partition may be 1 and therefor it will only use one machine. The most recent Mahout snapshot (therefore the next release) allows you to pass in the partitioning for each event pair (this is only in the library use, not CLI). To get this effect in the current release, try splitting the input into multiple files.
>
> I\u2019m. probably the one that reported the 10x speed up and used input from Kafka DStreams, which causes very small default partition sizes. Also other comparisons for other calculations give a similar speedup result. There is little question about Spark being much faster\u2014when used the way it is meant to be.
>
> I use Mahout as a library all the time in the Universal Recommender implemented in Apache PredictionIO. As a library we get greater control than the CLI. The CLI is really only a proof of concept, not really meant for production.
>
> BTW there is a significant algorithm benefit of the code behind spark-itemsimilarity that is probably more important than the speed increase and that is Correlated Cross-Occurrence, which allows the use of many indicators of user taste, not just the primary/conversion event, which is all any other CF-style recommender that I know of can use.
>
>
> On Sep 22, 2016, at 1:49 AM, Arnau Sanchez <py...@gmail.com> wrote:
>
> I've been using the Mahout itemsimilarity job for a while, with good results. I read that the new spark-itemsimilarity job is typically faster, by a factor of 10, so I wanted to give it a try. I must be doing something wrong because, with the same EMR infrastructure, the spark job is slower than the old one (6 min vs 16 min) working on the same data. I took a small sample dataset (766k rating pairs) to compare numbers, this is the result:
>
> Input ratings: http://download.zaudera.com/public/ratings
>
> Infrastructure: emr-4.7.2 (spark 1.6.2, mahout 0.12.2)
>
> Old itemsimilarity:
>
> $ mahout itemsimilarity --input ratings --output itemsimilarity --booleanData TRUE --maxSimilaritiesPerItem 10 --similarityClassname SIMILARITY_COOCCURRENCE
> [5m54s]
>
> (logs: http://download.zaudera.com/public/itemsimilarity.out)
>
> New spark-itemsimilarity:
>
> $ mahout spark-itemsimilarity --input ratings --output spark-itemsimilarity --maxSimilaritiesPerItem 10 --master yarn-client
> [15m51s]
>
> (logs: http://download.zaudera.com/public/spark-itemsimilarity.out)
>
> Any ideas? Thanks!
>

Re: spark-itemsimilarity slower than itemsimilarity

Posted by Arnau Sanchez <py...@gmail.com>.
The scaling issues with EMR+Spark may explain the weird performance I am seeing with Mahout's spark-itemsimilarity, I compared the running times with different partitions: the more partitions I feed the job, the more parallel processes it creates in the nodes, the more RAM it uses (some 100GB in total), which looks promising... but the larger the elapsed time. With an input of 100k ratings:

1 part of 100k  -> 1m50.080s
5 parts of 20k  -> 2m03.376s
10 parts of 10k -> 2m48.722s
20 parts of 5k  -> 4m22.107s

This also happens for bigger inputs. With >= 1M ratings I am just not able to finish the task, it's so slow.

> > If you use the temp Spark strategy

We run a EMR cluster 24/7. With Spark we expected a faster data-processing so we could return more up-to-date recommendations.

Thanks for you hints Pat, much appreciated.

On Wed, 28 Sep 2016 15:23:27 -0700 Pat Ferrel <pa...@occamsmachete.com> wrote:

> The problem with EMR is that the Spark driver needs to be as big as the executors many times and is not handled by EMR. EMR worked fine for Hadoop MapReduce because the driver usually did not have to be scaled vertically. I suppose you could say EMR would work but does not solve the whole scaling problem.
> 
> What we use at my consulting group is custom orchestration code written in Terraform that creates a driver of the same size as the Spark executors. We then install Spark using Docker and Docker Swarm. Since the driver machine needs to have your application on it this will always require some custom installation. For storage HDFS is used and it needs to be permanent whereas Spark may only be needed periodically. We save money by creating Spark Executors and Driver, using them to store to a permanent HDFS, then destroying the Spark Driver and Executors. This way model creation with spark-itemsimilarity or some app that uses Mahout as a Library, will only be paid for at some small duty cycle. 
> 
> If you use the temp Spark strategy and create a model every week and need 4 r3.8xlarge to do it in 1 hour you only pay 1/168th of what you would for a permanent cluster. This brings the cost to a quite reasonable range. You are very unlikely to need machines that large anyway but you could afford it if you only pay for the time they are actually used.
> 
>  
> On Sep 26, 2016, at 12:30 AM, Arnau Sanchez <py...@gmail.com> wrote:
> 
> On Sun, 25 Sep 2016 09:01:43 -0700 Pat Ferrel <pa...@occamsmachete.com> wrote:
> 
> > AWS EMR is usually not very well suited for Spark.  
> 
> What infrastructure would you recommend? Some EC2 instances provide lots of memory (though maybe not with the most competitive price: r3.8xlarge, 244Gb RAM).
> 
> My fault, I forgot to specify my original EMR setup: MASTER m3.xlarge (15Gb), 2 CORE r3.xlarge (30.5Gb), 2 TASK c4.xlarge (7.5Gb).
> 
> > If the data is from a single file the partition may be 1 and therefor it will only use one machine.   
> 
> Indeed, I experienced that also for MR itemsimilarity, it yielded different times -and results- for different partitions. I'll do more tests on that. 
> 
> > The CLI is really only a proof of concept, not really meant for production.  
> 
> Noted.
> 
> > BTW there is a significant algorithm benefit of the code behind spark-itemsimilarity that is probably more important than the speed increase and that is Correlated Cross-Occurrence  
> 
> Great! I have yet to compare improvements in the recommendations themselves, I'll have this in mind.
> 
> Thanks for you help.

Re: spark-itemsimilarity slower than itemsimilarity

Posted by Pat Ferrel <pa...@occamsmachete.com>.
The problem with EMR is that the Spark driver needs to be as big as the executors many times and is not handled by EMR. EMR worked fine for Hadoop MapReduce because the driver usually did not have to be scaled vertically. I suppose you could say EMR would work but does not solve the whole scaling problem.

What we use at my consulting group is custom orchestration code written in Terraform that creates a driver of the same size as the Spark executors. We then install Spark using Docker and Docker Swarm. Since the driver machine needs to have your application on it this will always require some custom installation. For storage HDFS is used and it needs to be permanent whereas Spark may only be needed periodically. We save money by creating Spark Executors and Driver, using them to store to a permanent HDFS, then destroying the Spark Driver and Executors. This way model creation with spark-itemsimilarity or some app that uses Mahout as a Library, will only be paid for at some small duty cycle. 

If you use the temp Spark strategy and create a model every week and need 4 r3.8xlarge to do it in 1 hour you only pay 1/168th of what you would for a permanent cluster. This brings the cost to a quite reasonable range. You are very unlikely to need machines that large anyway but you could afford it if you only pay for the time they are actually used.

 
On Sep 26, 2016, at 12:30 AM, Arnau Sanchez <py...@gmail.com> wrote:

On Sun, 25 Sep 2016 09:01:43 -0700 Pat Ferrel <pa...@occamsmachete.com> wrote:

> AWS EMR is usually not very well suited for Spark.

What infrastructure would you recommend? Some EC2 instances provide lots of memory (though maybe not with the most competitive price: r3.8xlarge, 244Gb RAM).

My fault, I forgot to specify my original EMR setup: MASTER m3.xlarge (15Gb), 2 CORE r3.xlarge (30.5Gb), 2 TASK c4.xlarge (7.5Gb).

> If the data is from a single file the partition may be 1 and therefor it will only use one machine. 

Indeed, I experienced that also for MR itemsimilarity, it yielded different times -and results- for different partitions. I'll do more tests on that. 

> The CLI is really only a proof of concept, not really meant for production.

Noted.

> BTW there is a significant algorithm benefit of the code behind spark-itemsimilarity that is probably more important than the speed increase and that is Correlated Cross-Occurrence

Great! I have yet to compare improvements in the recommendations themselves, I'll have this in mind.

Thanks for you help.


Re: spark-itemsimilarity slower than itemsimilarity

Posted by Arnau Sanchez <py...@gmail.com>.
On Sun, 25 Sep 2016 09:01:43 -0700 Pat Ferrel <pa...@occamsmachete.com> wrote:

> AWS EMR is usually not very well suited for Spark.

What infrastructure would you recommend? Some EC2 instances provide lots of memory (though maybe not with the most competitive price: r3.8xlarge, 244Gb RAM).

My fault, I forgot to specify my original EMR setup: MASTER m3.xlarge (15Gb), 2 CORE r3.xlarge (30.5Gb), 2 TASK c4.xlarge (7.5Gb).

> If the data is from a single file the partition may be 1 and therefor it will only use one machine. 

Indeed, I experienced that also for MR itemsimilarity, it yielded different times -and results- for different partitions. I'll do more tests on that. 

> The CLI is really only a proof of concept, not really meant for production.

Noted.

> BTW there is a significant algorithm benefit of the code behind spark-itemsimilarity that is probably more important than the speed increase and that is Correlated Cross-Occurrence

Great! I have yet to compare improvements in the recommendations themselves, I'll have this in mind.

Thanks for you help.

Re: spark-itemsimilarity slower than itemsimilarity

Posted by Pat Ferrel <pa...@occamsmachete.com>.
AWS EMR is usually not very well suited for Spark. Spark get’s most of it’s speed from in-memory calculations. So to see speed gains you have to have enough memory. Also partitioning will help in many cases. If you read in data from a single file—that partitioning will usually follow the calculation throughout all intermediate steps. If the data is from a single file the partition may be 1 and therefor it will only use one machine. The most recent Mahout snapshot (therefore the next release) allows you to pass in the partitioning for each event pair (this is only in the library use, not CLI). To get this effect in the current release, try splitting the input into multiple files.

I’m. probably the one that reported the 10x speed up and used input from Kafka DStreams, which causes very small default partition sizes. Also other comparisons for other calculations give a similar speedup result. There is little question about Spark being much faster—when used the way it is meant to be.

I use Mahout as a library all the time in the Universal Recommender implemented in Apache PredictionIO. As a library we get greater control than the CLI. The CLI is really only a proof of concept, not really meant for production.

BTW there is a significant algorithm benefit of the code behind spark-itemsimilarity that is probably more important than the speed increase and that is Correlated Cross-Occurrence, which allows the use of many indicators of user taste, not just the primary/conversion event, which is all any other CF-style recommender that I know of can use.


On Sep 22, 2016, at 1:49 AM, Arnau Sanchez <py...@gmail.com> wrote:

I've been using the Mahout itemsimilarity job for a while, with good results. I read that the new spark-itemsimilarity job is typically faster, by a factor of 10, so I wanted to give it a try. I must be doing something wrong because, with the same EMR infrastructure, the spark job is slower than the old one (6 min vs 16 min) working on the same data. I took a small sample dataset (766k rating pairs) to compare numbers, this is the result:

Input ratings: http://download.zaudera.com/public/ratings

Infrastructure: emr-4.7.2 (spark 1.6.2, mahout 0.12.2)

Old itemsimilarity:

$ mahout itemsimilarity --input ratings --output itemsimilarity --booleanData TRUE --maxSimilaritiesPerItem 10 --similarityClassname SIMILARITY_COOCCURRENCE
[5m54s]

(logs: http://download.zaudera.com/public/itemsimilarity.out)

New spark-itemsimilarity:

$ mahout spark-itemsimilarity --input ratings --output spark-itemsimilarity --maxSimilaritiesPerItem 10 --master yarn-client
[15m51s]

(logs: http://download.zaudera.com/public/spark-itemsimilarity.out)

Any ideas? Thanks!