You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@mahout.apache.org by Pat Ferrel <pa...@occamsmachete.com> on 2015/03/26 20:49:14 UTC

Re: mahout 1.0 on EMR with spark item-similarity

Finally getting to Yarn. Paul were you trying to run spark-itemsimilarity with the spark-submit script? That shouldn’t work, the job is a standalone app and does not require, nor is it likely to work with spark-submit.

Were you able to run on Yarn? How?

On Jan 29, 2015, at 9:15 AM, Pat Ferrel <pa...@occamsmachete.com> wrote:

There are two indices (guava HashBiMaps) that map your ID into and out of Mahout IDs (HashBiMap<int, string>). There is one copy of each (row/user IDs and column/itemIDS) per physical machine that all local tasks consult. They are Spark broadcast values. These will grow linearly as the number of items and users grow and as the size of your IDs, treated as strings, grow. The hashmaps have some overhead but in large collections the main cost is the size of the application IDs stored as strings, Mahout’s IDs are ints.

On Jan 22, 2015, at 8:04 AM, Pasmanik, Paul <Pa...@danteinc.com> wrote:

I was able to get spark and mahout installed on EMR cluster as bootstrap actions and was able to run spark-itemsimilarity job via an EMR step with some modifications to mahout script (defining SPARK_HOME and making sure CLASSPATH is not picked up from the invoking script  which is amazon's script-runner).

I was only able to run this job using yarn-client (yarn-master is not able to submit to resource manager).  

In yarn-client mode the driver program runs in the client process and submits jobs to executors via yarn manager, so my question is how much memory does this driver need?
Will the memory requirement vary based on the size of the input to spark-itemsimilarity?

Thanks. 


-----Original Message-----
From: Pasmanik, Paul [mailto:Paul.Pasmanik@danteinc.com] 
Sent: Thursday, January 15, 2015 12:46 PM
To: user@mahout.apache.org
Subject: mahout 1.0 on EMR with spark

Has anyone tried running mahout 1.0 on EMR with Spark?
I've used instructions at  https://github.com/awslabs/emr-bootstrap-actions/tree/master/spark to get EMR cluster running spark.   I am now able to deploy EMR cluster with Spark using AWS JAVA APIs.
EMR allows running a custom script as bootstrap action which I can use to install mahout.
What I am trying to figure out is whether I would need to build mahout every time I start EMR cluster or have pre-built artifacts and develop a script similar to what awslab is using to install spark?

Thanks.



________________________________
The information contained in this electronic transmission is intended only for the use of the recipient and may be confidential and privileged. Unauthorized use, disclosure, or reproduction is strictly prohibited and may be unlawful. If you have received this electronic transmission in error, please notify the sender immediately.

Re: mahout 1.0 on EMR with spark item-similarity

Posted by Pat Ferrel <pa...@occamsmachete.com>.

Gave you the wrong schema entries for the advice about queries. Check with Solr documentation, which always trumps my guesses. 

To use token phrases do the following:

   <fieldType name=“indicator" class="solr.TextField" omitNorms=“false”>
      <!— This simple tokenizer will split the text by spaces (and other punctuation) to separate item-id tokens —>
      <tokenizer class="solr.StandardTokenizerFactory”/>
   < /fieldType>
   <field name=“purchase" stored=“true" type="indicator" multiValued=“false" indexed="true”/>

> On Apr 16, 2015, at 9:15 AM, Pat Ferrel <pa...@occamsmachete.com> wrote:
> 
> OK, this is cool. Almost there!
> 
> In order to answer the question you have to decide how you will persist the indicators. The output of spark-itemsimilarity can be directly indexed but requires phrase searches on text fields. It requires the item-id strings to be tokenized so they aren’t broken up in the analyzer used for Solr phrase queries. The better way to do it is store the indicators as multi-valued fields on Solr or a DB. Many ways to slice this.
> 
> If you want to index the output of Mahout directly we will assume the item IDs are tokenized and so contain no spaces, ., comma, or other punctuation that will break a phrase, so we can encode the user history as a single string of space separated item tokens.
> 
> To do a query with something like “ipad iphone” we’ll need to setup Solr like this, which is a bit of a guess since I use a DB—not the raw output files:
> 
>    <fieldType name=“indicator" class="solr.TextField" omitNorms=“false"/> <!— NOTICE NO TOKENIZER OR ANALYZER USED—>
>    <field name=“purchase" stored=“true" type="indicator" multiValued="true" indexed="true”/>

My bad, this is for multi-valued fields, see above for space delimited token fields. I believe the above should use class=“solr.stringField” also.

> 
> “OR” is the default query operator so unless you’ve messed with that it should be fine. You need that because if you have multiple fields in the future you want them ORed as well as the terms The query would be something like:
> 
> q=purchase: (“iphone ipad")
> 
> So you are applying the items the user has purchased only to the purchase indicator field. As you add cross-cooccurrence actions the fieldType will stay the same and you will add a field for “views".
> 
> q=purchase: (“iphone ipad”) view: (“history of items viewed")
> And so on. 
> 
> You’ll also need to index them in Solr as csv files with no header. The output is tab delimited by default so more correctly a tsv.
> 
> This can be setup to use multi-valued fields but you’d have to store the output of spark-itemsimilarity in Solr or a db. I’d actually recommend this for several reasons including that it is faster than HDFS but it requires you write storage code and customize the Solr config differently.
> 
> other answers below:
> 
> 
>> On Apr 16, 2015, at 7:35 AM, Pasmanik, Paul <Pa...@danteinc.com> wrote:
>> 
>> Thanks, Pat.
>> I have a question regarding the search on the multi-valued field.
>> So, once I have indexed the results of the spark-itemsimilarity for purchase field as multi-valued field in Solar, what kind of search do I perform using user's purchase history?   Is it a phrase search (purchase items separated by space, something like purchase: "iphone ipad" with or without high slope value) or is it an OR query using each purchased item (purchase: ("iphone" OR "ipad"))  or something totally different? 
>> 
>> My understanding is that if I have a document with purchase field that has values: 1,2,3,4,5 and another document that has values 3,4,5 and my purchase history has 1,2,4   then the first document should rank higher.
> 
> Yes. The longer answer is that Solr with omitNorms=“false” will TF-IDF weight terms (in this case individual indicators). So very frequently preferred items will be down-weighted on the theory that because you and I like "motherhood and apple pie", it doesn’t say much about our similarity of taste--everyone likes those. So actual results will depend on frequency of item preferences in the corpus. This down-weighting is a good thing since otherwise a the popular things would always be recommended.
> 
>> Thanks.
>> 
>> 
>> -----Original Message-----
>> From: Pat Ferrel [mailto:pat@occamsmachete.com] 
>> Sent: Tuesday, April 07, 2015 7:15 PM
>> To: user@mahout.apache.org; Pasmanik, Paul
>> Subject: Re: mahout 1.0 on EMR with spark item-similarity
>> 
>> We are working on a release, which will be 0.10.0 so give it a try if you can. It fixes one problem that you may encounter with an out of range index in a vector. You may not see it.
>> 
>> 1) The search engine must be able to take one query with multiple fields and apply each field in the query to separate fields in the index. Solr and ES work, not sure about Amazon.
>> 2) Config and experiment seem good.
>> 3) It is good practice to save you interactions in something like a db so they can be replayed to create indicators if needed and to maintain a time range of data. I use a key-value store like the search engine itself or NoSQL DB. The value is that you can look at the collection as an item catalog and so put metadata in columns or doc fields. This metadata can be used to match context later so if you are on a “men’s clothes” page you may want “people who bought this also bought these” but biased or filtered by the “men’s clothes” category.
>> 4) Tags can be used for CF or for content-based recs and CF would generally be better. In the case you ask about the query is items favored since spark-rowsimilarity will produce similar items (similar in their tags, not users who preferred). So the query is items. Extend this and text tokens (bag-of-words) can be used to calculate content-based indicators and recs that are personalized, not just "more items like this”. But again CF data would be preferable if available.
>> 
>> As with your cluster investment I’d start small with clear usage based indicators and build up as needed based on your application data.
>> 
>> Let us know how it goes
>> 
>> 
>> On Apr 7, 2015, at 7:01 AM, Pasmanik, Paul <Pa...@danteinc.com> wrote:
>> 
>> Thanks, Pat.
>> We are only running EMR cluster with 1 master and 1 core node right now and were using EMR AMI  3.2.3 which has Hadoop 2.4.0.  We are using default configuration for spark (using aws script for spark) which I believe sets number of instances to 2.  Spark version 1.1.0h  (https://github.com/awslabs/emr-bootstrap-actions/blob/master/spark/VersionInformation.md) 
>> We are not in production yet as we are experimenting right now.   
>> 
>> I have a question about the choice of the search engine to do recommendations.
>> I know the Practical Machine Learning book and mahout docs talk about Solr.  Do you see any issues with using Elastic Search or AWS Cloud Search?  
>> Also, looking at the content based indicator example on intro-cooccurrence-spark mahout page I see that spark-rowimilairity job is used to produce itemid to items matrix, but then it says to use tags associated with purchases in the query for tags like this:
>> Query:
>> field: purchase; q:user's-purchase-history
>> field: view; q:user's view-history
>> field: tags; q:user's-tags-associated-with-purchases
>> 
>> So, we are not providing the actual tags in the tags field query, are we?
>> 
>> Thanks
>> 
>> 
>> -----Original Message-----
>> From: Pat Ferrel [mailto:pat@occamsmachete.com] 
>> Sent: Monday, April 06, 2015 2:33 PM
>> To: user@mahout.apache.org
>> Subject: Re: mahout 1.0 on EMR with spark item-similarity
>> 
>> OK, this seems fine. So you used "-ma yarn-client”, I’ve verified that this works in other cases.
>> 
>> BTW we are nearing a new release. It fixes one cooccurrence problem that you may run into if you are using lots of small files to contain the initial interaction input. This happens often when using Spark Streaming for input.
>> 
>> If you want to try the source on github make sure to compile with -DskipTests since there is a failing test unrelated to the Spark code. Be aware that jar names have changed if that matters.
>> 
>> Can you report the cluster version of Spark and Hadoop as well as how many nodes?
>> 
>> Thanks
>> 
>> 
>> On Apr 6, 2015, at 11:19 AM, Pasmanik, Paul <Pa...@danteinc.com> wrote:
>> 
>> Pat, I was not using spark-submit script.  I am using mahout spark-itemsimilarity exactly how it is specified in http://mahout.apache.org/users/recommender/intro-cooccurrence-spark.html
>> 
>> So, what I did is I created a bootstrap action that installs spark and mahout on EMR cluster.  Then, I used AWS Java APIs to create an EMR job step which can call a script (amazon provides scriptRunner that can run any script).  So, I basically create a command (mahout spark-itemsimilarity <parameters>) and pass it to script runner that runs it. One of the parameters is -ma , so I pass in yarn-client.   
>> 
>> We use AWS java API to programmatically start EMR cluster (trigger by Quartz job) with whatever parameters that job needs.
>> I used instructions in here: https://github.com/awslabs/emr-bootstrap-actions/tree/master/spark  to install spark as bootstrap action.  I built mahout-1.0 locally and uploaded a package to s3. I also created a bash script to copy that package from s3 to EMR, unpack, remove mahout 0.9 version that is part for EMR ami.  Then I used another boostrap action to invoke that script  and install mahout.  I had to also make changes to mahout script.   Added SPARK_HOME=/home/hadoop/spark 
>> (this is where I installed spark on EMR). Modified CLASSPATH=${CLASSPATH}:$MAHOUT_CONF_DIR to CLASSPATH=$MAHOUT_CONF_DIR to avoid including classpath passed in by amazon script-runner since it contains path to the 2.11 version of scala (installed on EMR by Amazon) that conflicts with spark/mahout 2.10.x version.
>> 
>> 
>> -----Original Message-----
>> From: Pat Ferrel [mailto:pat@occamsmachete.com] 
>> Sent: Thursday, March 26, 2015 3:49 PM
>> To: user@mahout.apache.org
>> Subject: Re: mahout 1.0 on EMR with spark item-similarity
>> 
>> Finally getting to Yarn. Paul were you trying to run spark-itemsimilarity with the spark-submit script? That shouldn’t work, the job is a standalone app and does not require, nor is it likely to work with spark-submit.
>> 
>> Were you able to run on Yarn? How?
>> 
>> On Jan 29, 2015, at 9:15 AM, Pat Ferrel <pa...@occamsmachete.com> wrote:
>> 
>> There are two indices (guava HashBiMaps) that map your ID into and out of Mahout IDs (HashBiMap<int, string>). There is one copy of each (row/user IDs and column/itemIDS) per physical machine that all local tasks consult. They are Spark broadcast values. These will grow linearly as the number of items and users grow and as the size of your IDs, treated as strings, grow. The hashmaps have some overhead but in large collections the main cost is the size of the application IDs stored as strings, Mahout’s IDs are ints.
>> 
>> On Jan 22, 2015, at 8:04 AM, Pasmanik, Paul <Pa...@danteinc.com> wrote:
>> 
>> I was able to get spark and mahout installed on EMR cluster as bootstrap actions and was able to run spark-itemsimilarity job via an EMR step with some modifications to mahout script (defining SPARK_HOME and making sure CLASSPATH is not picked up from the invoking script  which is amazon's script-runner).
>> 
>> I was only able to run this job using yarn-client (yarn-master is not able to submit to resource manager).  
>> 
>> In yarn-client mode the driver program runs in the client process and submits jobs to executors via yarn manager, so my question is how much memory does this driver need?
>> Will the memory requirement vary based on the size of the input to spark-itemsimilarity?
>> 
>> Thanks. 
>> 
>> 
>> -----Original Message-----
>> From: Pasmanik, Paul [mailto:Paul.Pasmanik@danteinc.com] 
>> Sent: Thursday, January 15, 2015 12:46 PM
>> To: user@mahout.apache.org
>> Subject: mahout 1.0 on EMR with spark
>> 
>> Has anyone tried running mahout 1.0 on EMR with Spark?
>> I've used instructions at  https://github.com/awslabs/emr-bootstrap-actions/tree/master/spark to get EMR cluster running spark.   I am now able to deploy EMR cluster with Spark using AWS JAVA APIs.
>> EMR allows running a custom script as bootstrap action which I can use to install mahout.
>> What I am trying to figure out is whether I would need to build mahout every time I start EMR cluster or have pre-built artifacts and develop a script similar to what awslab is using to install spark?
>> 
>> Thanks.
>> 
>> 
>> 
>> ________________________________
>> The information contained in this electronic transmission is intended only for the use of the recipient and may be confidential and privileged. Unauthorized use, disclosure, or reproduction is strictly prohibited and may be unlawful. If you have received this electronic transmission in error, please notify the sender immediately.
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>

Re: mahout 1.0 on EMR with spark item-similarity

Posted by Pat Ferrel <pa...@occamsmachete.com>.

OK, this is cool. Almost there!

In order to answer the question you have to decide how you will persist the indicators. The output of spark-itemsimilarity can be directly indexed but requires phrase searches on text fields. It requires the item-id strings to be tokenized so they aren’t broken up in the analyzer used for Solr phrase queries. The better way to do it is store the indicators as multi-valued fields on Solr or a DB. Many ways to slice this.

If you want to index the output of Mahout directly we will assume the item IDs are tokenized and so contain no spaces, ., comma, or other punctuation that will break a phrase, so we can encode the user history as a single string of space separated item tokens.

To do a query with something like “ipad iphone” we’ll need to setup Solr like this, which is a bit of a guess since I use a DB—not the raw output files:

    <fieldType name=“indicator" class="solr.TextField" omitNorms=“false"/> <!— NOTICE NO TOKENIZER OR ANALYZER USED—>
    <field name=“purchase" stored=“true" type="indicator" multiValued="true" indexed="true"/>

“OR” is the default query operator so unless you’ve messed with that it should be fine. You need that because if you have multiple fields in the future you want them ORed as well as the terms The query would be something like:

q=purchase: (“iphone ipad")

So you are applying the items the user has purchased only to the purchase indicator field. As you add cross-cooccurrence actions the fieldType will stay the same and you will add a field for “views".

q=purchase: (“iphone ipad”) view: (“history of items viewed")
And so on. 

You’ll also need to index them in Solr as csv files with no header. The output is tab delimited by default so more correctly a tsv.

This can be setup to use multi-valued fields but you’d have to store the output of spark-itemsimilarity in Solr or a db. I’d actually recommend this for several reasons including that it is faster than HDFS but it requires you write storage code and customize the Solr config differently.

other answers below:


> On Apr 16, 2015, at 7:35 AM, Pasmanik, Paul <Pa...@danteinc.com> wrote:
> 
> Thanks, Pat.
> I have a question regarding the search on the multi-valued field.
> So, once I have indexed the results of the spark-itemsimilarity for purchase field as multi-valued field in Solar, what kind of search do I perform using user's purchase history?   Is it a phrase search (purchase items separated by space, something like purchase: "iphone ipad" with or without high slope value) or is it an OR query using each purchased item (purchase: ("iphone" OR "ipad"))  or something totally different? 
> 
> My understanding is that if I have a document with purchase field that has values: 1,2,3,4,5 and another document that has values 3,4,5 and my purchase history has 1,2,4   then the first document should rank higher.

Yes. The longer answer is that Solr with omitNorms=“false” will TF-IDF weight terms (in this case individual indicators). So very frequently preferred items will be down-weighted on the theory that because you and I like "motherhood and apple pie", it doesn’t say much about our similarity of taste--everyone likes those. So actual results will depend on frequency of item preferences in the corpus. This down-weighting is a good thing since otherwise a the popular things would always be recommended.

> Thanks.
> 
> 
> -----Original Message-----
> From: Pat Ferrel [mailto:pat@occamsmachete.com] 
> Sent: Tuesday, April 07, 2015 7:15 PM
> To: user@mahout.apache.org; Pasmanik, Paul
> Subject: Re: mahout 1.0 on EMR with spark item-similarity
> 
> We are working on a release, which will be 0.10.0 so give it a try if you can. It fixes one problem that you may encounter with an out of range index in a vector. You may not see it.
> 
> 1) The search engine must be able to take one query with multiple fields and apply each field in the query to separate fields in the index. Solr and ES work, not sure about Amazon.
> 2) Config and experiment seem good.
> 3) It is good practice to save you interactions in something like a db so they can be replayed to create indicators if needed and to maintain a time range of data. I use a key-value store like the search engine itself or NoSQL DB. The value is that you can look at the collection as an item catalog and so put metadata in columns or doc fields. This metadata can be used to match context later so if you are on a “men’s clothes” page you may want “people who bought this also bought these” but biased or filtered by the “men’s clothes” category.
> 4) Tags can be used for CF or for content-based recs and CF would generally be better. In the case you ask about the query is items favored since spark-rowsimilarity will produce similar items (similar in their tags, not users who preferred). So the query is items. Extend this and text tokens (bag-of-words) can be used to calculate content-based indicators and recs that are personalized, not just "more items like this”. But again CF data would be preferable if available.
> 
> As with your cluster investment I’d start small with clear usage based indicators and build up as needed based on your application data.
> 
> Let us know how it goes
> 
> 
> On Apr 7, 2015, at 7:01 AM, Pasmanik, Paul <Pa...@danteinc.com> wrote:
> 
> Thanks, Pat.
> We are only running EMR cluster with 1 master and 1 core node right now and were using EMR AMI  3.2.3 which has Hadoop 2.4.0.  We are using default configuration for spark (using aws script for spark) which I believe sets number of instances to 2.  Spark version 1.1.0h  (https://github.com/awslabs/emr-bootstrap-actions/blob/master/spark/VersionInformation.md) 
> We are not in production yet as we are experimenting right now.   
> 
> I have a question about the choice of the search engine to do recommendations.
> I know the Practical Machine Learning book and mahout docs talk about Solr.  Do you see any issues with using Elastic Search or AWS Cloud Search?  
> Also, looking at the content based indicator example on intro-cooccurrence-spark mahout page I see that spark-rowimilairity job is used to produce itemid to items matrix, but then it says to use tags associated with purchases in the query for tags like this:
> Query:
> field: purchase; q:user's-purchase-history
> field: view; q:user's view-history
> field: tags; q:user's-tags-associated-with-purchases
> 
> So, we are not providing the actual tags in the tags field query, are we?
> 
> Thanks
> 
> 
> -----Original Message-----
> From: Pat Ferrel [mailto:pat@occamsmachete.com] 
> Sent: Monday, April 06, 2015 2:33 PM
> To: user@mahout.apache.org
> Subject: Re: mahout 1.0 on EMR with spark item-similarity
> 
> OK, this seems fine. So you used "-ma yarn-client”, I’ve verified that this works in other cases.
> 
> BTW we are nearing a new release. It fixes one cooccurrence problem that you may run into if you are using lots of small files to contain the initial interaction input. This happens often when using Spark Streaming for input.
> 
> If you want to try the source on github make sure to compile with -DskipTests since there is a failing test unrelated to the Spark code. Be aware that jar names have changed if that matters.
> 
> Can you report the cluster version of Spark and Hadoop as well as how many nodes?
> 
> Thanks
> 
> 
> On Apr 6, 2015, at 11:19 AM, Pasmanik, Paul <Pa...@danteinc.com> wrote:
> 
> Pat, I was not using spark-submit script.  I am using mahout spark-itemsimilarity exactly how it is specified in http://mahout.apache.org/users/recommender/intro-cooccurrence-spark.html
> 
> So, what I did is I created a bootstrap action that installs spark and mahout on EMR cluster.  Then, I used AWS Java APIs to create an EMR job step which can call a script (amazon provides scriptRunner that can run any script).  So, I basically create a command (mahout spark-itemsimilarity <parameters>) and pass it to script runner that runs it. One of the parameters is -ma , so I pass in yarn-client.   
> 
> We use AWS java API to programmatically start EMR cluster (trigger by Quartz job) with whatever parameters that job needs.
> I used instructions in here: https://github.com/awslabs/emr-bootstrap-actions/tree/master/spark  to install spark as bootstrap action.  I built mahout-1.0 locally and uploaded a package to s3. I also created a bash script to copy that package from s3 to EMR, unpack, remove mahout 0.9 version that is part for EMR ami.  Then I used another boostrap action to invoke that script  and install mahout.  I had to also make changes to mahout script.   Added SPARK_HOME=/home/hadoop/spark 
> (this is where I installed spark on EMR). Modified CLASSPATH=${CLASSPATH}:$MAHOUT_CONF_DIR to CLASSPATH=$MAHOUT_CONF_DIR to avoid including classpath passed in by amazon script-runner since it contains path to the 2.11 version of scala (installed on EMR by Amazon) that conflicts with spark/mahout 2.10.x version.
> 
> 
> -----Original Message-----
> From: Pat Ferrel [mailto:pat@occamsmachete.com] 
> Sent: Thursday, March 26, 2015 3:49 PM
> To: user@mahout.apache.org
> Subject: Re: mahout 1.0 on EMR with spark item-similarity
> 
> Finally getting to Yarn. Paul were you trying to run spark-itemsimilarity with the spark-submit script? That shouldn’t work, the job is a standalone app and does not require, nor is it likely to work with spark-submit.
> 
> Were you able to run on Yarn? How?
> 
> On Jan 29, 2015, at 9:15 AM, Pat Ferrel <pa...@occamsmachete.com> wrote:
> 
> There are two indices (guava HashBiMaps) that map your ID into and out of Mahout IDs (HashBiMap<int, string>). There is one copy of each (row/user IDs and column/itemIDS) per physical machine that all local tasks consult. They are Spark broadcast values. These will grow linearly as the number of items and users grow and as the size of your IDs, treated as strings, grow. The hashmaps have some overhead but in large collections the main cost is the size of the application IDs stored as strings, Mahout’s IDs are ints.
> 
> On Jan 22, 2015, at 8:04 AM, Pasmanik, Paul <Pa...@danteinc.com> wrote:
> 
> I was able to get spark and mahout installed on EMR cluster as bootstrap actions and was able to run spark-itemsimilarity job via an EMR step with some modifications to mahout script (defining SPARK_HOME and making sure CLASSPATH is not picked up from the invoking script  which is amazon's script-runner).
> 
> I was only able to run this job using yarn-client (yarn-master is not able to submit to resource manager).  
> 
> In yarn-client mode the driver program runs in the client process and submits jobs to executors via yarn manager, so my question is how much memory does this driver need?
> Will the memory requirement vary based on the size of the input to spark-itemsimilarity?
> 
> Thanks. 
> 
> 
> -----Original Message-----
> From: Pasmanik, Paul [mailto:Paul.Pasmanik@danteinc.com] 
> Sent: Thursday, January 15, 2015 12:46 PM
> To: user@mahout.apache.org
> Subject: mahout 1.0 on EMR with spark
> 
> Has anyone tried running mahout 1.0 on EMR with Spark?
> I've used instructions at  https://github.com/awslabs/emr-bootstrap-actions/tree/master/spark to get EMR cluster running spark.   I am now able to deploy EMR cluster with Spark using AWS JAVA APIs.
> EMR allows running a custom script as bootstrap action which I can use to install mahout.
> What I am trying to figure out is whether I would need to build mahout every time I start EMR cluster or have pre-built artifacts and develop a script similar to what awslab is using to install spark?
> 
> Thanks.
> 
> 
> 
> ________________________________
> The information contained in this electronic transmission is intended only for the use of the recipient and may be confidential and privileged. Unauthorized use, disclosure, or reproduction is strictly prohibited and may be unlawful. If you have received this electronic transmission in error, please notify the sender immediately.
> 
> 
> 
> 
> 
> 
> 
> 
>

Re: mahout 1.0 on EMR with spark item-similarity

Posted by Pat Ferrel <pa...@occamsmachete.com>.

We are working on a release, which will be 0.10.0 so give it a try if you can. It fixes one problem that you may encounter with an out of range index in a vector. You may not see it.

1) The search engine must be able to take one query with multiple fields and apply each field in the query to separate fields in the index. Solr and ES work, not sure about Amazon.
2) Config and experiment seem good.
3) It is good practice to save you interactions in something like a db so they can be replayed to create indicators if needed and to maintain a time range of data. I use a key-value store like the search engine itself or NoSQL DB. The value is that you can look at the collection as an item catalog and so put metadata in columns or doc fields. This metadata can be used to match context later so if you are on a “men’s clothes” page you may want “people who bought this also bought these” but biased or filtered by the “men’s clothes” category.
4) Tags can be used for CF or for content-based recs and CF would generally be better. In the case you ask about the query is items favored since spark-rowsimilarity will produce similar items (similar in their tags, not users who preferred). So the query is items. Extend this and text tokens (bag-of-words) can be used to calculate content-based indicators and recs that are personalized, not just "more items like this”. But again CF data would be preferable if available.

As with your cluster investment I’d start small with clear usage based indicators and build up as needed based on your application data.

Let us know how it goes


On Apr 7, 2015, at 7:01 AM, Pasmanik, Paul <Pa...@danteinc.com> wrote:

Thanks, Pat.
We are only running EMR cluster with 1 master and 1 core node right now and were using EMR AMI  3.2.3 which has Hadoop 2.4.0.  We are using default configuration for spark (using aws script for spark) which I believe sets number of instances to 2.  Spark version 1.1.0h  (https://github.com/awslabs/emr-bootstrap-actions/blob/master/spark/VersionInformation.md) 
We are not in production yet as we are experimenting right now.   

I have a question about the choice of the search engine to do recommendations.
I know the Practical Machine Learning book and mahout docs talk about Solr.  Do you see any issues with using Elastic Search or AWS Cloud Search?  
Also, looking at the content based indicator example on intro-cooccurrence-spark mahout page I see that spark-rowimilairity job is used to produce itemid to items matrix, but then it says to use tags associated with purchases in the query for tags like this:
Query:
 field: purchase; q:user's-purchase-history
 field: view; q:user's view-history
 field: tags; q:user's-tags-associated-with-purchases

So, we are not providing the actual tags in the tags field query, are we?

Thanks


-----Original Message-----
From: Pat Ferrel [mailto:pat@occamsmachete.com] 
Sent: Monday, April 06, 2015 2:33 PM
To: user@mahout.apache.org
Subject: Re: mahout 1.0 on EMR with spark item-similarity

OK, this seems fine. So you used "-ma yarn-client”, I’ve verified that this works in other cases.

BTW we are nearing a new release. It fixes one cooccurrence problem that you may run into if you are using lots of small files to contain the initial interaction input. This happens often when using Spark Streaming for input.

If you want to try the source on github make sure to compile with -DskipTests since there is a failing test unrelated to the Spark code. Be aware that jar names have changed if that matters.

Can you report the cluster version of Spark and Hadoop as well as how many nodes?

Thanks


On Apr 6, 2015, at 11:19 AM, Pasmanik, Paul <Pa...@danteinc.com> wrote:

Pat, I was not using spark-submit script.  I am using mahout spark-itemsimilarity exactly how it is specified in http://mahout.apache.org/users/recommender/intro-cooccurrence-spark.html

So, what I did is I created a bootstrap action that installs spark and mahout on EMR cluster.  Then, I used AWS Java APIs to create an EMR job step which can call a script (amazon provides scriptRunner that can run any script).  So, I basically create a command (mahout spark-itemsimilarity <parameters>) and pass it to script runner that runs it. One of the parameters is -ma , so I pass in yarn-client.   

We use AWS java API to programmatically start EMR cluster (trigger by Quartz job) with whatever parameters that job needs.
I used instructions in here: https://github.com/awslabs/emr-bootstrap-actions/tree/master/spark  to install spark as bootstrap action.  I built mahout-1.0 locally and uploaded a package to s3. I also created a bash script to copy that package from s3 to EMR, unpack, remove mahout 0.9 version that is part for EMR ami.  Then I used another boostrap action to invoke that script  and install mahout.  I had to also make changes to mahout script.   Added SPARK_HOME=/home/hadoop/spark 
(this is where I installed spark on EMR). Modified CLASSPATH=${CLASSPATH}:$MAHOUT_CONF_DIR to CLASSPATH=$MAHOUT_CONF_DIR to avoid including classpath passed in by amazon script-runner since it contains path to the 2.11 version of scala (installed on EMR by Amazon) that conflicts with spark/mahout 2.10.x version.


-----Original Message-----
From: Pat Ferrel [mailto:pat@occamsmachete.com] 
Sent: Thursday, March 26, 2015 3:49 PM
To: user@mahout.apache.org
Subject: Re: mahout 1.0 on EMR with spark item-similarity

Finally getting to Yarn. Paul were you trying to run spark-itemsimilarity with the spark-submit script? That shouldn’t work, the job is a standalone app and does not require, nor is it likely to work with spark-submit.

Were you able to run on Yarn? How?

On Jan 29, 2015, at 9:15 AM, Pat Ferrel <pa...@occamsmachete.com> wrote:

There are two indices (guava HashBiMaps) that map your ID into and out of Mahout IDs (HashBiMap<int, string>). There is one copy of each (row/user IDs and column/itemIDS) per physical machine that all local tasks consult. They are Spark broadcast values. These will grow linearly as the number of items and users grow and as the size of your IDs, treated as strings, grow. The hashmaps have some overhead but in large collections the main cost is the size of the application IDs stored as strings, Mahout’s IDs are ints.

On Jan 22, 2015, at 8:04 AM, Pasmanik, Paul <Pa...@danteinc.com> wrote:

I was able to get spark and mahout installed on EMR cluster as bootstrap actions and was able to run spark-itemsimilarity job via an EMR step with some modifications to mahout script (defining SPARK_HOME and making sure CLASSPATH is not picked up from the invoking script  which is amazon's script-runner).

I was only able to run this job using yarn-client (yarn-master is not able to submit to resource manager).  

In yarn-client mode the driver program runs in the client process and submits jobs to executors via yarn manager, so my question is how much memory does this driver need?
Will the memory requirement vary based on the size of the input to spark-itemsimilarity?

Thanks. 


-----Original Message-----
From: Pasmanik, Paul [mailto:Paul.Pasmanik@danteinc.com] 
Sent: Thursday, January 15, 2015 12:46 PM
To: user@mahout.apache.org
Subject: mahout 1.0 on EMR with spark

Has anyone tried running mahout 1.0 on EMR with Spark?
I've used instructions at  https://github.com/awslabs/emr-bootstrap-actions/tree/master/spark to get EMR cluster running spark.   I am now able to deploy EMR cluster with Spark using AWS JAVA APIs.
EMR allows running a custom script as bootstrap action which I can use to install mahout.
What I am trying to figure out is whether I would need to build mahout every time I start EMR cluster or have pre-built artifacts and develop a script similar to what awslab is using to install spark?

Thanks.



________________________________
The information contained in this electronic transmission is intended only for the use of the recipient and may be confidential and privileged. Unauthorized use, disclosure, or reproduction is strictly prohibited and may be unlawful. If you have received this electronic transmission in error, please notify the sender immediately.

RE: mahout 1.0 on EMR with spark item-similarity

Posted by "Pasmanik, Paul" <Pa...@danteinc.com>.

Thanks, Pat.
We are only running EMR cluster with 1 master and 1 core node right now and were using EMR AMI  3.2.3 which has Hadoop 2.4.0.  We are using default configuration for spark (using aws script for spark) which I believe sets number of instances to 2.  Spark version 1.1.0h  (https://github.com/awslabs/emr-bootstrap-actions/blob/master/spark/VersionInformation.md) 
We are not in production yet as we are experimenting right now.   

I have a question about the choice of the search engine to do recommendations.
I know the Practical Machine Learning book and mahout docs talk about Solr.  Do you see any issues with using Elastic Search or AWS Cloud Search?  
Also, looking at the content based indicator example on intro-cooccurrence-spark mahout page I see that spark-rowimilairity job is used to produce itemid to items matrix, but then it says to use tags associated with purchases in the query for tags like this:
Query:
  field: purchase; q:user's-purchase-history
  field: view; q:user's view-history
  field: tags; q:user's-tags-associated-with-purchases

So, we are not providing the actual tags in the tags field query, are we?

Thanks


-----Original Message-----
From: Pat Ferrel [mailto:pat@occamsmachete.com] 
Sent: Monday, April 06, 2015 2:33 PM
To: user@mahout.apache.org
Subject: Re: mahout 1.0 on EMR with spark item-similarity

OK, this seems fine. So you used "-ma yarn-client”, I’ve verified that this works in other cases.

BTW we are nearing a new release. It fixes one cooccurrence problem that you may run into if you are using lots of small files to contain the initial interaction input. This happens often when using Spark Streaming for input.

If you want to try the source on github make sure to compile with -DskipTests since there is a failing test unrelated to the Spark code. Be aware that jar names have changed if that matters.

Can you report the cluster version of Spark and Hadoop as well as how many nodes?

Thanks
  

On Apr 6, 2015, at 11:19 AM, Pasmanik, Paul <Pa...@danteinc.com> wrote:

Pat, I was not using spark-submit script.  I am using mahout spark-itemsimilarity exactly how it is specified in http://mahout.apache.org/users/recommender/intro-cooccurrence-spark.html

So, what I did is I created a bootstrap action that installs spark and mahout on EMR cluster.  Then, I used AWS Java APIs to create an EMR job step which can call a script (amazon provides scriptRunner that can run any script).  So, I basically create a command (mahout spark-itemsimilarity <parameters>) and pass it to script runner that runs it. One of the parameters is -ma , so I pass in yarn-client.   

We use AWS java API to programmatically start EMR cluster (trigger by Quartz job) with whatever parameters that job needs.
I used instructions in here: https://github.com/awslabs/emr-bootstrap-actions/tree/master/spark  to install spark as bootstrap action.  I built mahout-1.0 locally and uploaded a package to s3. I also created a bash script to copy that package from s3 to EMR, unpack, remove mahout 0.9 version that is part for EMR ami.  Then I used another boostrap action to invoke that script  and install mahout.  I had to also make changes to mahout script.   Added SPARK_HOME=/home/hadoop/spark 
(this is where I installed spark on EMR). Modified CLASSPATH=${CLASSPATH}:$MAHOUT_CONF_DIR to CLASSPATH=$MAHOUT_CONF_DIR to avoid including classpath passed in by amazon script-runner since it contains path to the 2.11 version of scala (installed on EMR by Amazon) that conflicts with spark/mahout 2.10.x version.


-----Original Message-----
From: Pat Ferrel [mailto:pat@occamsmachete.com] 
Sent: Thursday, March 26, 2015 3:49 PM
To: user@mahout.apache.org
Subject: Re: mahout 1.0 on EMR with spark item-similarity

Finally getting to Yarn. Paul were you trying to run spark-itemsimilarity with the spark-submit script? That shouldn’t work, the job is a standalone app and does not require, nor is it likely to work with spark-submit.

Were you able to run on Yarn? How?

On Jan 29, 2015, at 9:15 AM, Pat Ferrel <pa...@occamsmachete.com> wrote:

There are two indices (guava HashBiMaps) that map your ID into and out of Mahout IDs (HashBiMap<int, string>). There is one copy of each (row/user IDs and column/itemIDS) per physical machine that all local tasks consult. They are Spark broadcast values. These will grow linearly as the number of items and users grow and as the size of your IDs, treated as strings, grow. The hashmaps have some overhead but in large collections the main cost is the size of the application IDs stored as strings, Mahout’s IDs are ints.

On Jan 22, 2015, at 8:04 AM, Pasmanik, Paul <Pa...@danteinc.com> wrote:

I was able to get spark and mahout installed on EMR cluster as bootstrap actions and was able to run spark-itemsimilarity job via an EMR step with some modifications to mahout script (defining SPARK_HOME and making sure CLASSPATH is not picked up from the invoking script  which is amazon's script-runner).

I was only able to run this job using yarn-client (yarn-master is not able to submit to resource manager).  

In yarn-client mode the driver program runs in the client process and submits jobs to executors via yarn manager, so my question is how much memory does this driver need?
Will the memory requirement vary based on the size of the input to spark-itemsimilarity?

Thanks. 


-----Original Message-----
From: Pasmanik, Paul [mailto:Paul.Pasmanik@danteinc.com] 
Sent: Thursday, January 15, 2015 12:46 PM
To: user@mahout.apache.org
Subject: mahout 1.0 on EMR with spark

Has anyone tried running mahout 1.0 on EMR with Spark?
I've used instructions at  https://github.com/awslabs/emr-bootstrap-actions/tree/master/spark to get EMR cluster running spark.   I am now able to deploy EMR cluster with Spark using AWS JAVA APIs.
EMR allows running a custom script as bootstrap action which I can use to install mahout.
What I am trying to figure out is whether I would need to build mahout every time I start EMR cluster or have pre-built artifacts and develop a script similar to what awslab is using to install spark?

Thanks.



________________________________
The information contained in this electronic transmission is intended only for the use of the recipient and may be confidential and privileged. Unauthorized use, disclosure, or reproduction is strictly prohibited and may be unlawful. If you have received this electronic transmission in error, please notify the sender immediately.

Re: mahout 1.0 on EMR with spark item-similarity

Posted by Pat Ferrel <pa...@occamsmachete.com>.

OK, this seems fine. So you used "-ma yarn-client”, I’ve verified that this works in other cases.

BTW we are nearing a new release. It fixes one cooccurrence problem that you may run into if you are using lots of small files to contain the initial interaction input. This happens often when using Spark Streaming for input.

If you want to try the source on github make sure to compile with -DskipTests since there is a failing test unrelated to the Spark code. Be aware that jar names have changed if that matters.

Can you report the cluster version of Spark and Hadoop as well as how many nodes?

Thanks
  

On Apr 6, 2015, at 11:19 AM, Pasmanik, Paul <Pa...@danteinc.com> wrote:

Pat, I was not using spark-submit script.  I am using mahout spark-itemsimilarity exactly how it is specified in http://mahout.apache.org/users/recommender/intro-cooccurrence-spark.html

So, what I did is I created a bootstrap action that installs spark and mahout on EMR cluster.  Then, I used AWS Java APIs to create an EMR job step which can call a script (amazon provides scriptRunner that can run any script).  So, I basically create a command (mahout spark-itemsimilarity <parameters>) and pass it to script runner that runs it. One of the parameters is -ma , so I pass in yarn-client.   

We use AWS java API to programmatically start EMR cluster (trigger by Quartz job) with whatever parameters that job needs.
I used instructions in here: https://github.com/awslabs/emr-bootstrap-actions/tree/master/spark  to install spark as bootstrap action.  I built mahout-1.0 locally and uploaded a package to s3. I also created a bash script to copy that package from s3 to EMR, unpack, remove mahout 0.9 version that is part for EMR ami.  Then I used another boostrap action to invoke that script  and install mahout.  I had to also make changes to mahout script.   Added SPARK_HOME=/home/hadoop/spark 
(this is where I installed spark on EMR). Modified CLASSPATH=${CLASSPATH}:$MAHOUT_CONF_DIR to CLASSPATH=$MAHOUT_CONF_DIR to 
avoid including classpath passed in by amazon script-runner since it contains path to the 2.11 version of scala (installed on EMR by Amazon) that conflicts with spark/mahout 2.10.x version.


-----Original Message-----
From: Pat Ferrel [mailto:pat@occamsmachete.com] 
Sent: Thursday, March 26, 2015 3:49 PM
To: user@mahout.apache.org
Subject: Re: mahout 1.0 on EMR with spark item-similarity

Finally getting to Yarn. Paul were you trying to run spark-itemsimilarity with the spark-submit script? That shouldn’t work, the job is a standalone app and does not require, nor is it likely to work with spark-submit.

Were you able to run on Yarn? How?

On Jan 29, 2015, at 9:15 AM, Pat Ferrel <pa...@occamsmachete.com> wrote:

There are two indices (guava HashBiMaps) that map your ID into and out of Mahout IDs (HashBiMap<int, string>). There is one copy of each (row/user IDs and column/itemIDS) per physical machine that all local tasks consult. They are Spark broadcast values. These will grow linearly as the number of items and users grow and as the size of your IDs, treated as strings, grow. The hashmaps have some overhead but in large collections the main cost is the size of the application IDs stored as strings, Mahout’s IDs are ints.

On Jan 22, 2015, at 8:04 AM, Pasmanik, Paul <Pa...@danteinc.com> wrote:

I was able to get spark and mahout installed on EMR cluster as bootstrap actions and was able to run spark-itemsimilarity job via an EMR step with some modifications to mahout script (defining SPARK_HOME and making sure CLASSPATH is not picked up from the invoking script  which is amazon's script-runner).

I was only able to run this job using yarn-client (yarn-master is not able to submit to resource manager).  

In yarn-client mode the driver program runs in the client process and submits jobs to executors via yarn manager, so my question is how much memory does this driver need?
Will the memory requirement vary based on the size of the input to spark-itemsimilarity?

Thanks. 


-----Original Message-----
From: Pasmanik, Paul [mailto:Paul.Pasmanik@danteinc.com] 
Sent: Thursday, January 15, 2015 12:46 PM
To: user@mahout.apache.org
Subject: mahout 1.0 on EMR with spark

Has anyone tried running mahout 1.0 on EMR with Spark?
I've used instructions at  https://github.com/awslabs/emr-bootstrap-actions/tree/master/spark to get EMR cluster running spark.   I am now able to deploy EMR cluster with Spark using AWS JAVA APIs.
EMR allows running a custom script as bootstrap action which I can use to install mahout.
What I am trying to figure out is whether I would need to build mahout every time I start EMR cluster or have pre-built artifacts and develop a script similar to what awslab is using to install spark?

Thanks.



________________________________
The information contained in this electronic transmission is intended only for the use of the recipient and may be confidential and privileged. Unauthorized use, disclosure, or reproduction is strictly prohibited and may be unlawful. If you have received this electronic transmission in error, please notify the sender immediately.

RE: mahout 1.0 on EMR with spark item-similarity

Posted by "Pasmanik, Paul" <Pa...@danteinc.com>.

Pat, I was not using spark-submit script.  I am using mahout spark-itemsimilarity exactly how it is specified in http://mahout.apache.org/users/recommender/intro-cooccurrence-spark.html

So, what I did is I created a bootstrap action that installs spark and mahout on EMR cluster.  Then, I used AWS Java APIs to create an EMR job step which can call a script (amazon provides scriptRunner that can run any script).  So, I basically create a command (mahout spark-itemsimilarity <parameters>) and pass it to script runner that runs it. One of the parameters is -ma , so I pass in yarn-client.   

We use AWS java API to programmatically start EMR cluster (trigger by Quartz job) with whatever parameters that job needs.
I used instructions in here: https://github.com/awslabs/emr-bootstrap-actions/tree/master/spark  to install spark as bootstrap action.  I built mahout-1.0 locally and uploaded a package to s3. I also created a bash script to copy that package from s3 to EMR, unpack, remove mahout 0.9 version that is part for EMR ami.  Then I used another boostrap action to invoke that script  and install mahout.  I had to also make changes to mahout script.   Added SPARK_HOME=/home/hadoop/spark 
(this is where I installed spark on EMR). Modified CLASSPATH=${CLASSPATH}:$MAHOUT_CONF_DIR to CLASSPATH=$MAHOUT_CONF_DIR to 
avoid including classpath passed in by amazon script-runner since it contains path to the 2.11 version of scala (installed on EMR by Amazon) that conflicts with spark/mahout 2.10.x version.
 
 
-----Original Message-----
From: Pat Ferrel [mailto:pat@occamsmachete.com] 
Sent: Thursday, March 26, 2015 3:49 PM
To: user@mahout.apache.org
Subject: Re: mahout 1.0 on EMR with spark item-similarity

Finally getting to Yarn. Paul were you trying to run spark-itemsimilarity with the spark-submit script? That shouldn’t work, the job is a standalone app and does not require, nor is it likely to work with spark-submit.

Were you able to run on Yarn? How?

On Jan 29, 2015, at 9:15 AM, Pat Ferrel <pa...@occamsmachete.com> wrote:

There are two indices (guava HashBiMaps) that map your ID into and out of Mahout IDs (HashBiMap<int, string>). There is one copy of each (row/user IDs and column/itemIDS) per physical machine that all local tasks consult. They are Spark broadcast values. These will grow linearly as the number of items and users grow and as the size of your IDs, treated as strings, grow. The hashmaps have some overhead but in large collections the main cost is the size of the application IDs stored as strings, Mahout’s IDs are ints.

On Jan 22, 2015, at 8:04 AM, Pasmanik, Paul <Pa...@danteinc.com> wrote:

I was able to get spark and mahout installed on EMR cluster as bootstrap actions and was able to run spark-itemsimilarity job via an EMR step with some modifications to mahout script (defining SPARK_HOME and making sure CLASSPATH is not picked up from the invoking script  which is amazon's script-runner).

I was only able to run this job using yarn-client (yarn-master is not able to submit to resource manager).  

In yarn-client mode the driver program runs in the client process and submits jobs to executors via yarn manager, so my question is how much memory does this driver need?
Will the memory requirement vary based on the size of the input to spark-itemsimilarity?

Thanks. 


-----Original Message-----
From: Pasmanik, Paul [mailto:Paul.Pasmanik@danteinc.com] 
Sent: Thursday, January 15, 2015 12:46 PM
To: user@mahout.apache.org
Subject: mahout 1.0 on EMR with spark

Has anyone tried running mahout 1.0 on EMR with Spark?
I've used instructions at  https://github.com/awslabs/emr-bootstrap-actions/tree/master/spark to get EMR cluster running spark.   I am now able to deploy EMR cluster with Spark using AWS JAVA APIs.
EMR allows running a custom script as bootstrap action which I can use to install mahout.
What I am trying to figure out is whether I would need to build mahout every time I start EMR cluster or have pre-built artifacts and develop a script similar to what awslab is using to install spark?

Thanks.



________________________________
The information contained in this electronic transmission is intended only for the use of the recipient and may be confidential and privileged. Unauthorized use, disclosure, or reproduction is strictly prohibited and may be unlawful. If you have received this electronic transmission in error, please notify the sender immediately.