You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@mahout.apache.org by Rohit Jain <ro...@gmail.com> on 2016/05/03 13:32:52 UTC

Mahout rowSimilarity

Hello Everyone,
I have products and there are certain associated tags to each product. So
to find similar products I am using mahout spark-rowsimilarity algorithm in
following manner.

$MAHOUT_HOME/mahout spark-rowsimilarity -i hdfs://0.0.0.0:9000/wtrousers -o
hdfs://0.0.0.0:9000/s_trousers_out1/ -D:spark.io.compression.=lzf -ma
spark://0.0.0.0:7077
To run this command I need to pull data from database to flat file. Is
there anyway I can use this command / write java code  directly to work on
database?

-- 
Thanks & Regards,

*Rohit Jain*
Web developer | Consultant
Mob +91 8097283931

Re: Mahout rowSimilarity

Posted by Pat Ferrel <pa...@occamsmachete.com>.

Here is an example that takes a PairRDD, which is an RDD of pairs of strings. The row-id and column-id are expected in the pair. This method inputs each element in the sparse matrix individually. So if the row-id is a user-id and the column-id is an item-id it will turn them into an IndexedDatasetSpark, which is essentially 2 BiMaps (one for users, one for items) and a DRM. Once you have the IndexedDataset pass it to SimiarityAnalysis.
https://github.com/apache/mahout/blob/master/spark/src/main/scala/org/apache/mahout/sparkbindings/indexeddataset/IndexedDatasetSpark.scala#L68


On May 4, 2016, at 6:12 AM, Rohit Jain <ro...@gmail.com> wrote:

I am still looking searching for my answer. It will be great if somebody
can help me with this :)

On Wed, May 4, 2016 at 11:25 AM, Rohit Jain <ro...@gmail.com> wrote:

> And If yes, can you please help me with what exactly do you mean by "You
> can then just write some simple pre processing code that converts your
> database files to the appropriate format for Mahout and read it in as an
> indexed dataset."
> 
> On Wed, May 4, 2016 at 11:21 AM, Rohit Jain <ro...@gmail.com>
> wrote:
> 
>> Hello Nikaash,
>> So you mean I need to first read data from my mogodb using scala's mongo
>> driver and then convert it into indexed datasets. And then process it using
>> row similarity?
>> 
>> On Wed, May 4, 2016 at 7:56 AM, Nikaash Puri <ni...@gmail.com>
>> wrote:
>> 
>>> Hi Rohit,
>>> 
>>> This would be a good place to start.
>>> https://github.com/apache/mahout/blob/master/spark/src/main/scala/org/apache/mahout/drivers/RowSimilarityDriver.scala
>>> <
>>> https://github.com/apache/mahout/blob/master/spark/src/main/scala/org/apache/mahout/drivers/RowSimilarityDriver.scala
>>>> 
>>> 
>>> This bit of code, in particular is how to call the spark-rowsimilarity
>>> from Scala:
>>> 
>>> val rowSimilarityIDS =
>>> SimilarityAnalysis.rowSimilarityIDS(indexedDataset,…)
>>> 
>>> You can then just write some simple pre processing code that converts
>>> your database files to the appropriate format for Mahout and read it in as
>>> an indexed dataset.
>>> 
>>> This is another great end to end example that achieves a similar result
>>> using spark-itemsimilarity.
>>> https://mahout.apache.org/users/environment/how-to-build-an-app.html <
>>> https://mahout.apache.org/users/environment/how-to-build-an-app.html>
>>> 
>>> Let me know if you need more help.
>>> 
>>> Thank you,
>>> Nikaash Puri
>>>> On 03-May-2016, at 9:49 PM, Rohit Jain <ro...@gmail.com> wrote:
>>>> 
>>>> Hello Pat,
>>>> Can you please explain it in little detail. I didn't understand how to
>>> go
>>>> about it.
>>>> 
>>>> On Tue, May 3, 2016 at 9:08 PM, Pat Ferrel <pa...@occamsmachete.com>
>>> wrote:
>>>> 
>>>>> Sure, but at least some would be Scala. There are examples in Mahout
>>> that
>>>>> take PairRDDs as input but anything that constructs an IndexedDataset
>>> would
>>>>> be fine. I use this code in a system that creates an RDD from HBase.
>>> Think
>>>>> of the task as one of how to create a Spark RDD from your DB content.
>>>>> 
>>>>> On May 3, 2016, at 4:32 AM, Rohit Jain <ro...@gmail.com>
>>> wrote:
>>>>> 
>>>>> Hello Everyone,
>>>>> I have products and there are certain associated tags to each
>>> product. So
>>>>> to find similar products I am using mahout spark-rowsimilarity
>>> algorithm in
>>>>> following manner.
>>>>> 
>>>>> $MAHOUT_HOME/mahout spark-rowsimilarity -i hdfs://
>>> 0.0.0.0:9000/wtrousers
>>>>> -o
>>>>> hdfs://0.0.0.0:9000/s_trousers_out1/ -D:spark.io.compression.=lzf -ma
>>>>> spark://0.0.0.0:7077
>>>>> To run this command I need to pull data from database to flat file. Is
>>>>> there anyway I can use this command / write java code  directly to
>>> work on
>>>>> database?
>>>>> 
>>>>> --
>>>>> Thanks & Regards,
>>>>> 
>>>>> *Rohit Jain*
>>>>> Web developer | Consultant
>>>>> Mob +91 8097283931
>>>>> 
>>>>> 
>>>> 
>>>> 
>>>> --
>>>> Thanks & Regards,
>>>> 
>>>> *Rohit Jain*
>>>> Web developer | Consultant
>>>> Mob +91 8097283931
>>> 
>>> 
>> 
>> 
>> --
>> Thanks & Regards,
>> 
>> *Rohit Jain*
>> Web developer | Consultant
>> Mob +91 8097283931
>> 
> 
> 
> 
> --
> Thanks & Regards,
> 
> *Rohit Jain*
> Web developer | Consultant
> Mob +91 8097283931
> 



-- 
Thanks & Regards,

*Rohit Jain*
Web developer | Consultant
Mob +91 8097283931

Re: Mahout rowSimilarity

Posted by Rohit Jain <ro...@gmail.com>.

I am still looking searching for my answer. It will be great if somebody
can help me with this :)

On Wed, May 4, 2016 at 11:25 AM, Rohit Jain <ro...@gmail.com> wrote:

> And If yes, can you please help me with what exactly do you mean by "You
> can then just write some simple pre processing code that converts your
> database files to the appropriate format for Mahout and read it in as an
> indexed dataset."
>
> On Wed, May 4, 2016 at 11:21 AM, Rohit Jain <ro...@gmail.com>
> wrote:
>
>> Hello Nikaash,
>> So you mean I need to first read data from my mogodb using scala's mongo
>> driver and then convert it into indexed datasets. And then process it using
>> row similarity?
>>
>> On Wed, May 4, 2016 at 7:56 AM, Nikaash Puri <ni...@gmail.com>
>> wrote:
>>
>>> Hi Rohit,
>>>
>>> This would be a good place to start.
>>> https://github.com/apache/mahout/blob/master/spark/src/main/scala/org/apache/mahout/drivers/RowSimilarityDriver.scala
>>> <
>>> https://github.com/apache/mahout/blob/master/spark/src/main/scala/org/apache/mahout/drivers/RowSimilarityDriver.scala
>>> >
>>>
>>> This bit of code, in particular is how to call the spark-rowsimilarity
>>> from Scala:
>>>
>>> val rowSimilarityIDS =
>>> SimilarityAnalysis.rowSimilarityIDS(indexedDataset,…)
>>>
>>> You can then just write some simple pre processing code that converts
>>> your database files to the appropriate format for Mahout and read it in as
>>> an indexed dataset.
>>>
>>> This is another great end to end example that achieves a similar result
>>> using spark-itemsimilarity.
>>> https://mahout.apache.org/users/environment/how-to-build-an-app.html <
>>> https://mahout.apache.org/users/environment/how-to-build-an-app.html>
>>>
>>> Let me know if you need more help.
>>>
>>> Thank you,
>>> Nikaash Puri
>>> > On 03-May-2016, at 9:49 PM, Rohit Jain <ro...@gmail.com> wrote:
>>> >
>>> > Hello Pat,
>>> > Can you please explain it in little detail. I didn't understand how to
>>> go
>>> > about it.
>>> >
>>> > On Tue, May 3, 2016 at 9:08 PM, Pat Ferrel <pa...@occamsmachete.com>
>>> wrote:
>>> >
>>> >> Sure, but at least some would be Scala. There are examples in Mahout
>>> that
>>> >> take PairRDDs as input but anything that constructs an IndexedDataset
>>> would
>>> >> be fine. I use this code in a system that creates an RDD from HBase.
>>> Think
>>> >> of the task as one of how to create a Spark RDD from your DB content.
>>> >>
>>> >> On May 3, 2016, at 4:32 AM, Rohit Jain <ro...@gmail.com>
>>> wrote:
>>> >>
>>> >> Hello Everyone,
>>> >> I have products and there are certain associated tags to each
>>> product. So
>>> >> to find similar products I am using mahout spark-rowsimilarity
>>> algorithm in
>>> >> following manner.
>>> >>
>>> >> $MAHOUT_HOME/mahout spark-rowsimilarity -i hdfs://
>>> 0.0.0.0:9000/wtrousers
>>> >> -o
>>> >> hdfs://0.0.0.0:9000/s_trousers_out1/ -D:spark.io.compression.=lzf -ma
>>> >> spark://0.0.0.0:7077
>>> >> To run this command I need to pull data from database to flat file. Is
>>> >> there anyway I can use this command / write java code  directly to
>>> work on
>>> >> database?
>>> >>
>>> >> --
>>> >> Thanks & Regards,
>>> >>
>>> >> *Rohit Jain*
>>> >> Web developer | Consultant
>>> >> Mob +91 8097283931
>>> >>
>>> >>
>>> >
>>> >
>>> > --
>>> > Thanks & Regards,
>>> >
>>> > *Rohit Jain*
>>> > Web developer | Consultant
>>> > Mob +91 8097283931
>>>
>>>
>>
>>
>> --
>> Thanks & Regards,
>>
>> *Rohit Jain*
>> Web developer | Consultant
>> Mob +91 8097283931
>>
>
>
>
> --
> Thanks & Regards,
>
> *Rohit Jain*
> Web developer | Consultant
> Mob +91 8097283931
>



-- 
Thanks & Regards,

*Rohit Jain*
Web developer | Consultant
Mob +91 8097283931

Re: Mahout rowSimilarity

Posted by Rohit Jain <ro...@gmail.com>.

And If yes, can you please help me with what exactly do you mean by "You
can then just write some simple pre processing code that converts your
database files to the appropriate format for Mahout and read it in as an
indexed dataset."

On Wed, May 4, 2016 at 11:21 AM, Rohit Jain <ro...@gmail.com> wrote:

> Hello Nikaash,
> So you mean I need to first read data from my mogodb using scala's mongo
> driver and then convert it into indexed datasets. And then process it using
> row similarity?
>
> On Wed, May 4, 2016 at 7:56 AM, Nikaash Puri <ni...@gmail.com>
> wrote:
>
>> Hi Rohit,
>>
>> This would be a good place to start.
>> https://github.com/apache/mahout/blob/master/spark/src/main/scala/org/apache/mahout/drivers/RowSimilarityDriver.scala
>> <
>> https://github.com/apache/mahout/blob/master/spark/src/main/scala/org/apache/mahout/drivers/RowSimilarityDriver.scala
>> >
>>
>> This bit of code, in particular is how to call the spark-rowsimilarity
>> from Scala:
>>
>> val rowSimilarityIDS =
>> SimilarityAnalysis.rowSimilarityIDS(indexedDataset,…)
>>
>> You can then just write some simple pre processing code that converts
>> your database files to the appropriate format for Mahout and read it in as
>> an indexed dataset.
>>
>> This is another great end to end example that achieves a similar result
>> using spark-itemsimilarity.
>> https://mahout.apache.org/users/environment/how-to-build-an-app.html <
>> https://mahout.apache.org/users/environment/how-to-build-an-app.html>
>>
>> Let me know if you need more help.
>>
>> Thank you,
>> Nikaash Puri
>> > On 03-May-2016, at 9:49 PM, Rohit Jain <ro...@gmail.com> wrote:
>> >
>> > Hello Pat,
>> > Can you please explain it in little detail. I didn't understand how to
>> go
>> > about it.
>> >
>> > On Tue, May 3, 2016 at 9:08 PM, Pat Ferrel <pa...@occamsmachete.com>
>> wrote:
>> >
>> >> Sure, but at least some would be Scala. There are examples in Mahout
>> that
>> >> take PairRDDs as input but anything that constructs an IndexedDataset
>> would
>> >> be fine. I use this code in a system that creates an RDD from HBase.
>> Think
>> >> of the task as one of how to create a Spark RDD from your DB content.
>> >>
>> >> On May 3, 2016, at 4:32 AM, Rohit Jain <ro...@gmail.com> wrote:
>> >>
>> >> Hello Everyone,
>> >> I have products and there are certain associated tags to each product.
>> So
>> >> to find similar products I am using mahout spark-rowsimilarity
>> algorithm in
>> >> following manner.
>> >>
>> >> $MAHOUT_HOME/mahout spark-rowsimilarity -i hdfs://
>> 0.0.0.0:9000/wtrousers
>> >> -o
>> >> hdfs://0.0.0.0:9000/s_trousers_out1/ -D:spark.io.compression.=lzf -ma
>> >> spark://0.0.0.0:7077
>> >> To run this command I need to pull data from database to flat file. Is
>> >> there anyway I can use this command / write java code  directly to
>> work on
>> >> database?
>> >>
>> >> --
>> >> Thanks & Regards,
>> >>
>> >> *Rohit Jain*
>> >> Web developer | Consultant
>> >> Mob +91 8097283931
>> >>
>> >>
>> >
>> >
>> > --
>> > Thanks & Regards,
>> >
>> > *Rohit Jain*
>> > Web developer | Consultant
>> > Mob +91 8097283931
>>
>>
>
>
> --
> Thanks & Regards,
>
> *Rohit Jain*
> Web developer | Consultant
> Mob +91 8097283931
>



-- 
Thanks & Regards,

*Rohit Jain*
Web developer | Consultant
Mob +91 8097283931

Re: Mahout rowSimilarity

Posted by Rohit Jain <ro...@gmail.com>.

Hello Nikaash,
So you mean I need to first read data from my mogodb using scala's mongo
driver and then convert it into indexed datasets. And then process it using
row similarity?

On Wed, May 4, 2016 at 7:56 AM, Nikaash Puri <ni...@gmail.com> wrote:

> Hi Rohit,
>
> This would be a good place to start.
> https://github.com/apache/mahout/blob/master/spark/src/main/scala/org/apache/mahout/drivers/RowSimilarityDriver.scala
> <
> https://github.com/apache/mahout/blob/master/spark/src/main/scala/org/apache/mahout/drivers/RowSimilarityDriver.scala
> >
>
> This bit of code, in particular is how to call the spark-rowsimilarity
> from Scala:
>
> val rowSimilarityIDS =
> SimilarityAnalysis.rowSimilarityIDS(indexedDataset,…)
>
> You can then just write some simple pre processing code that converts your
> database files to the appropriate format for Mahout and read it in as an
> indexed dataset.
>
> This is another great end to end example that achieves a similar result
> using spark-itemsimilarity.
> https://mahout.apache.org/users/environment/how-to-build-an-app.html <
> https://mahout.apache.org/users/environment/how-to-build-an-app.html>
>
> Let me know if you need more help.
>
> Thank you,
> Nikaash Puri
> > On 03-May-2016, at 9:49 PM, Rohit Jain <ro...@gmail.com> wrote:
> >
> > Hello Pat,
> > Can you please explain it in little detail. I didn't understand how to go
> > about it.
> >
> > On Tue, May 3, 2016 at 9:08 PM, Pat Ferrel <pa...@occamsmachete.com>
> wrote:
> >
> >> Sure, but at least some would be Scala. There are examples in Mahout
> that
> >> take PairRDDs as input but anything that constructs an IndexedDataset
> would
> >> be fine. I use this code in a system that creates an RDD from HBase.
> Think
> >> of the task as one of how to create a Spark RDD from your DB content.
> >>
> >> On May 3, 2016, at 4:32 AM, Rohit Jain <ro...@gmail.com> wrote:
> >>
> >> Hello Everyone,
> >> I have products and there are certain associated tags to each product.
> So
> >> to find similar products I am using mahout spark-rowsimilarity
> algorithm in
> >> following manner.
> >>
> >> $MAHOUT_HOME/mahout spark-rowsimilarity -i hdfs://
> 0.0.0.0:9000/wtrousers
> >> -o
> >> hdfs://0.0.0.0:9000/s_trousers_out1/ -D:spark.io.compression.=lzf -ma
> >> spark://0.0.0.0:7077
> >> To run this command I need to pull data from database to flat file. Is
> >> there anyway I can use this command / write java code  directly to work
> on
> >> database?
> >>
> >> --
> >> Thanks & Regards,
> >>
> >> *Rohit Jain*
> >> Web developer | Consultant
> >> Mob +91 8097283931
> >>
> >>
> >
> >
> > --
> > Thanks & Regards,
> >
> > *Rohit Jain*
> > Web developer | Consultant
> > Mob +91 8097283931
>
>


-- 
Thanks & Regards,

*Rohit Jain*
Web developer | Consultant
Mob +91 8097283931

Re: Mahout rowSimilarity

Posted by Nikaash Puri <ni...@gmail.com>.

Hi Rohit,

This would be a good place to start. https://github.com/apache/mahout/blob/master/spark/src/main/scala/org/apache/mahout/drivers/RowSimilarityDriver.scala <https://github.com/apache/mahout/blob/master/spark/src/main/scala/org/apache/mahout/drivers/RowSimilarityDriver.scala>

This bit of code, in particular is how to call the spark-rowsimilarity from Scala:

val rowSimilarityIDS = SimilarityAnalysis.rowSimilarityIDS(indexedDataset,…)

You can then just write some simple pre processing code that converts your database files to the appropriate format for Mahout and read it in as an indexed dataset.

This is another great end to end example that achieves a similar result using spark-itemsimilarity. https://mahout.apache.org/users/environment/how-to-build-an-app.html <https://mahout.apache.org/users/environment/how-to-build-an-app.html>

Let me know if you need more help.

Thank you,
Nikaash Puri
> On 03-May-2016, at 9:49 PM, Rohit Jain <ro...@gmail.com> wrote:
> 
> Hello Pat,
> Can you please explain it in little detail. I didn't understand how to go
> about it.
> 
> On Tue, May 3, 2016 at 9:08 PM, Pat Ferrel <pa...@occamsmachete.com> wrote:
> 
>> Sure, but at least some would be Scala. There are examples in Mahout that
>> take PairRDDs as input but anything that constructs an IndexedDataset would
>> be fine. I use this code in a system that creates an RDD from HBase. Think
>> of the task as one of how to create a Spark RDD from your DB content.
>> 
>> On May 3, 2016, at 4:32 AM, Rohit Jain <ro...@gmail.com> wrote:
>> 
>> Hello Everyone,
>> I have products and there are certain associated tags to each product. So
>> to find similar products I am using mahout spark-rowsimilarity algorithm in
>> following manner.
>> 
>> $MAHOUT_HOME/mahout spark-rowsimilarity -i hdfs://0.0.0.0:9000/wtrousers
>> -o
>> hdfs://0.0.0.0:9000/s_trousers_out1/ -D:spark.io.compression.=lzf -ma
>> spark://0.0.0.0:7077
>> To run this command I need to pull data from database to flat file. Is
>> there anyway I can use this command / write java code  directly to work on
>> database?
>> 
>> --
>> Thanks & Regards,
>> 
>> *Rohit Jain*
>> Web developer | Consultant
>> Mob +91 8097283931
>> 
>> 
> 
> 
> -- 
> Thanks & Regards,
> 
> *Rohit Jain*
> Web developer | Consultant
> Mob +91 8097283931

Re: Mahout rowSimilarity

Posted by Rohit Jain <ro...@gmail.com>.

Hello Pat,
Can you please explain it in little detail. I didn't understand how to go
about it.

On Tue, May 3, 2016 at 9:08 PM, Pat Ferrel <pa...@occamsmachete.com> wrote:

> Sure, but at least some would be Scala. There are examples in Mahout that
> take PairRDDs as input but anything that constructs an IndexedDataset would
> be fine. I use this code in a system that creates an RDD from HBase. Think
> of the task as one of how to create a Spark RDD from your DB content.
>
> On May 3, 2016, at 4:32 AM, Rohit Jain <ro...@gmail.com> wrote:
>
> Hello Everyone,
> I have products and there are certain associated tags to each product. So
> to find similar products I am using mahout spark-rowsimilarity algorithm in
> following manner.
>
> $MAHOUT_HOME/mahout spark-rowsimilarity -i hdfs://0.0.0.0:9000/wtrousers
> -o
> hdfs://0.0.0.0:9000/s_trousers_out1/ -D:spark.io.compression.=lzf -ma
> spark://0.0.0.0:7077
> To run this command I need to pull data from database to flat file. Is
> there anyway I can use this command / write java code  directly to work on
> database?
>
> --
> Thanks & Regards,
>
> *Rohit Jain*
> Web developer | Consultant
> Mob +91 8097283931
>
>


-- 
Thanks & Regards,

*Rohit Jain*
Web developer | Consultant
Mob +91 8097283931

Re: Mahout rowSimilarity

Posted by Pat Ferrel <pa...@occamsmachete.com>.

Sure, but at least some would be Scala. There are examples in Mahout that take PairRDDs as input but anything that constructs an IndexedDataset would be fine. I use this code in a system that creates an RDD from HBase. Think of the task as one of how to create a Spark RDD from your DB content.

On May 3, 2016, at 4:32 AM, Rohit Jain <ro...@gmail.com> wrote:

Hello Everyone,
I have products and there are certain associated tags to each product. So
to find similar products I am using mahout spark-rowsimilarity algorithm in
following manner.

$MAHOUT_HOME/mahout spark-rowsimilarity -i hdfs://0.0.0.0:9000/wtrousers -o
hdfs://0.0.0.0:9000/s_trousers_out1/ -D:spark.io.compression.=lzf -ma
spark://0.0.0.0:7077
To run this command I need to pull data from database to flat file. Is
there anyway I can use this command / write java code  directly to work on
database?

-- 
Thanks & Regards,

*Rohit Jain*
Web developer | Consultant
Mob +91 8097283931