You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by Kevin Mellott <ke...@gmail.com> on 2016/09/19 20:49:01 UTC

Similar Items

Hi all,

I'm trying to write a Spark application that will detect similar items (in
this case products) based on their descriptions. I've got an ML pipeline
that transforms the product data to TF-IDF representation, using the
following components.

   - *RegexTokenizer* - strips out non-word characters, results in a list
   of tokens
   - *StopWordsRemover* - removes common "stop words", such as "the",
   "and", etc.
   - *HashingTF* - assigns a numeric "hash" to each token and calculates
   the term frequency
   - *IDF* - computes the inverse document frequency

After this pipeline evaluates, I'm left with a SparseVector that represents
the inverse document frequency of tokens for each product. As a next step,
I'd like to be able to compare each vector to one another, to detect
similarities.

Does anybody know of a straightforward way to do this in Spark? I tried
creating a UDF (that used the Breeze linear algebra methods internally);
however, that did not scale well.

Thanks,
Kevin

Re: Similar Items

Posted by Peter Figliozzi <pe...@gmail.com>.

Related question: is there anything that does scalable matrix
multiplication on Spark?  For example, we have that long list of vectors
and want to construct the similarity matrix:  v * T(v).  In R it would be: v
%*% t(v)
Thanks,
Pete



On Mon, Sep 19, 2016 at 3:49 PM, Kevin Mellott <ke...@gmail.com>
wrote:

> Hi all,
>
> I'm trying to write a Spark application that will detect similar items (in
> this case products) based on their descriptions. I've got an ML pipeline
> that transforms the product data to TF-IDF representation, using the
> following components.
>
>    - *RegexTokenizer* - strips out non-word characters, results in a list
>    of tokens
>    - *StopWordsRemover* - removes common "stop words", such as "the",
>    "and", etc.
>    - *HashingTF* - assigns a numeric "hash" to each token and calculates
>    the term frequency
>    - *IDF* - computes the inverse document frequency
>
> After this pipeline evaluates, I'm left with a SparseVector that
> represents the inverse document frequency of tokens for each product. As a
> next step, I'd like to be able to compare each vector to one another, to
> detect similarities.
>
> Does anybody know of a straightforward way to do this in Spark? I tried
> creating a UDF (that used the Breeze linear algebra methods internally);
> however, that did not scale well.
>
> Thanks,
> Kevin
>

Re: Similar Items

Posted by Nick Pentreath <ni...@gmail.com>.

Sorry, the original repo: https://github.com/karlhigley/spark-neighbors

On Wed, 21 Sep 2016 at 13:09 Nick Pentreath <ni...@gmail.com>
wrote:

> I should also point out another library I had not come across before :
> https://github.com/sethah/spark-neighbors
>
>
> On Tue, 20 Sep 2016 at 21:03 Kevin Mellott <ke...@gmail.com>
> wrote:
>
>> Using the Soundcloud implementation of LSH, I was able to process a 22K
>> product dataset in a mere 65 seconds! Thanks so much for the help!
>>
>> On Tue, Sep 20, 2016 at 1:15 PM, Kevin Mellott <kevin.r.mellott@gmail.com
>> > wrote:
>>
>>> Thanks Nick - those examples will help a ton!!
>>>
>>> On Tue, Sep 20, 2016 at 12:20 PM, Nick Pentreath <
>>> nick.pentreath@gmail.com> wrote:
>>>
>>>> A few options include:
>>>>
>>>> https://github.com/marufaytekin/lsh-spark - I've used this a bit and
>>>> it seems quite scalable too from what I've looked at.
>>>> https://github.com/soundcloud/cosine-lsh-join-spark - not used this
>>>> but looks like it should do exactly what you need.
>>>> https://github.com/mrsqueeze/*spark*-hash
>>>> <https://github.com/mrsqueeze/spark-hash>
>>>>
>>>>
>>>> On Tue, 20 Sep 2016 at 18:06 Kevin Mellott <ke...@gmail.com>
>>>> wrote:
>>>>
>>>>> Thanks for the reply, Nick! I'm typically analyzing around 30-50K
>>>>> products at a time (as an isolated set of products). Within this set of
>>>>> products (which represents all products for a particular supplier), I am
>>>>> also analyzing each category separately. The largest categories typically
>>>>> have around 10K products.
>>>>>
>>>>> That being said, when calculating IDFs for the 10K product set we come
>>>>> out with roughly 12K unique tokens. In other words, our vectors are 12K
>>>>> columns wide (although they are being represented using SparseVectors). We
>>>>> have a step that is attempting to locate all documents that share the same
>>>>> tokens, and for those items we will calculate the cosine similarity.
>>>>> However, the part that attempts to identify documents with shared tokens is
>>>>> the bottleneck.
>>>>>
>>>>> For this portion, we map our data down to the individual tokens
>>>>> contained by each document. For example:
>>>>>
>>>>> DocumentId   |   Description
>>>>>
>>>>> ----------------------------------------------------------------------------------------------------
>>>>> 1                       Easton Hockey Stick
>>>>> 2                       Bauer Hockey Gloves
>>>>>
>>>>> In this case, we'd map to the following:
>>>>>
>>>>> (1, 'Easton')
>>>>> (1, 'Hockey')
>>>>> (1, 'Stick')
>>>>> (2, 'Bauer')
>>>>> (2, 'Hockey')
>>>>> (2, 'Gloves')
>>>>>
>>>>> Our goal is to aggregate this data as follows; however, our code that
>>>>> currently does this is does not perform well. In the realistic 12K product
>>>>> scenario, this resulted in 430K document/token tuples.
>>>>>
>>>>> ((1, 2), ['Hockey'])
>>>>>
>>>>> This then tells us that documents 1 and 2 need to be compared to one
>>>>> another (via cosine similarity) because they both contain the token
>>>>> 'hockey'. I will investigate the methods that you recommended to see if
>>>>> they may resolve our problem.
>>>>>
>>>>> Thanks,
>>>>> Kevin
>>>>>
>>>>> On Tue, Sep 20, 2016 at 1:45 AM, Nick Pentreath <
>>>>> nick.pentreath@gmail.com> wrote:
>>>>>
>>>>>> How many products do you have? How large are your vectors?
>>>>>>
>>>>>> It could be that SVD / LSA could be helpful. But if you have many
>>>>>> products then trying to compute all-pair similarity with brute force is not
>>>>>> going to be scalable. In this case you may want to investigate hashing
>>>>>> (LSH) techniques.
>>>>>>
>>>>>>
>>>>>> On Mon, 19 Sep 2016 at 22:49, Kevin Mellott <
>>>>>> kevin.r.mellott@gmail.com> wrote:
>>>>>>
>>>>>>> Hi all,
>>>>>>>
>>>>>>> I'm trying to write a Spark application that will detect similar
>>>>>>> items (in this case products) based on their descriptions. I've got an ML
>>>>>>> pipeline that transforms the product data to TF-IDF representation, using
>>>>>>> the following components.
>>>>>>>
>>>>>>>    - *RegexTokenizer* - strips out non-word characters, results in
>>>>>>>    a list of tokens
>>>>>>>    - *StopWordsRemover* - removes common "stop words", such as
>>>>>>>    "the", "and", etc.
>>>>>>>    - *HashingTF* - assigns a numeric "hash" to each token and
>>>>>>>    calculates the term frequency
>>>>>>>    - *IDF* - computes the inverse document frequency
>>>>>>>
>>>>>>> After this pipeline evaluates, I'm left with a SparseVector that
>>>>>>> represents the inverse document frequency of tokens for each product. As a
>>>>>>> next step, I'd like to be able to compare each vector to one another, to
>>>>>>> detect similarities.
>>>>>>>
>>>>>>> Does anybody know of a straightforward way to do this in Spark? I
>>>>>>> tried creating a UDF (that used the Breeze linear algebra methods
>>>>>>> internally); however, that did not scale well.
>>>>>>>
>>>>>>> Thanks,
>>>>>>> Kevin
>>>>>>>
>>>>>>
>>>>>
>>>
>>

Re: Similar Items

Posted by Nick Pentreath <ni...@gmail.com>.

I should also point out another library I had not come across before :
https://github.com/sethah/spark-neighbors

On Tue, 20 Sep 2016 at 21:03 Kevin Mellott <ke...@gmail.com>
wrote:

> Using the Soundcloud implementation of LSH, I was able to process a 22K
> product dataset in a mere 65 seconds! Thanks so much for the help!
>
> On Tue, Sep 20, 2016 at 1:15 PM, Kevin Mellott <ke...@gmail.com>
> wrote:
>
>> Thanks Nick - those examples will help a ton!!
>>
>> On Tue, Sep 20, 2016 at 12:20 PM, Nick Pentreath <
>> nick.pentreath@gmail.com> wrote:
>>
>>> A few options include:
>>>
>>> https://github.com/marufaytekin/lsh-spark - I've used this a bit and it
>>> seems quite scalable too from what I've looked at.
>>> https://github.com/soundcloud/cosine-lsh-join-spark - not used this but
>>> looks like it should do exactly what you need.
>>> https://github.com/mrsqueeze/*spark*-hash
>>> <https://github.com/mrsqueeze/spark-hash>
>>>
>>>
>>> On Tue, 20 Sep 2016 at 18:06 Kevin Mellott <ke...@gmail.com>
>>> wrote:
>>>
>>>> Thanks for the reply, Nick! I'm typically analyzing around 30-50K
>>>> products at a time (as an isolated set of products). Within this set of
>>>> products (which represents all products for a particular supplier), I am
>>>> also analyzing each category separately. The largest categories typically
>>>> have around 10K products.
>>>>
>>>> That being said, when calculating IDFs for the 10K product set we come
>>>> out with roughly 12K unique tokens. In other words, our vectors are 12K
>>>> columns wide (although they are being represented using SparseVectors). We
>>>> have a step that is attempting to locate all documents that share the same
>>>> tokens, and for those items we will calculate the cosine similarity.
>>>> However, the part that attempts to identify documents with shared tokens is
>>>> the bottleneck.
>>>>
>>>> For this portion, we map our data down to the individual tokens
>>>> contained by each document. For example:
>>>>
>>>> DocumentId   |   Description
>>>>
>>>> ----------------------------------------------------------------------------------------------------
>>>> 1                       Easton Hockey Stick
>>>> 2                       Bauer Hockey Gloves
>>>>
>>>> In this case, we'd map to the following:
>>>>
>>>> (1, 'Easton')
>>>> (1, 'Hockey')
>>>> (1, 'Stick')
>>>> (2, 'Bauer')
>>>> (2, 'Hockey')
>>>> (2, 'Gloves')
>>>>
>>>> Our goal is to aggregate this data as follows; however, our code that
>>>> currently does this is does not perform well. In the realistic 12K product
>>>> scenario, this resulted in 430K document/token tuples.
>>>>
>>>> ((1, 2), ['Hockey'])
>>>>
>>>> This then tells us that documents 1 and 2 need to be compared to one
>>>> another (via cosine similarity) because they both contain the token
>>>> 'hockey'. I will investigate the methods that you recommended to see if
>>>> they may resolve our problem.
>>>>
>>>> Thanks,
>>>> Kevin
>>>>
>>>> On Tue, Sep 20, 2016 at 1:45 AM, Nick Pentreath <
>>>> nick.pentreath@gmail.com> wrote:
>>>>
>>>>> How many products do you have? How large are your vectors?
>>>>>
>>>>> It could be that SVD / LSA could be helpful. But if you have many
>>>>> products then trying to compute all-pair similarity with brute force is not
>>>>> going to be scalable. In this case you may want to investigate hashing
>>>>> (LSH) techniques.
>>>>>
>>>>>
>>>>> On Mon, 19 Sep 2016 at 22:49, Kevin Mellott <ke...@gmail.com>
>>>>> wrote:
>>>>>
>>>>>> Hi all,
>>>>>>
>>>>>> I'm trying to write a Spark application that will detect similar
>>>>>> items (in this case products) based on their descriptions. I've got an ML
>>>>>> pipeline that transforms the product data to TF-IDF representation, using
>>>>>> the following components.
>>>>>>
>>>>>>    - *RegexTokenizer* - strips out non-word characters, results in a
>>>>>>    list of tokens
>>>>>>    - *StopWordsRemover* - removes common "stop words", such as
>>>>>>    "the", "and", etc.
>>>>>>    - *HashingTF* - assigns a numeric "hash" to each token and
>>>>>>    calculates the term frequency
>>>>>>    - *IDF* - computes the inverse document frequency
>>>>>>
>>>>>> After this pipeline evaluates, I'm left with a SparseVector that
>>>>>> represents the inverse document frequency of tokens for each product. As a
>>>>>> next step, I'd like to be able to compare each vector to one another, to
>>>>>> detect similarities.
>>>>>>
>>>>>> Does anybody know of a straightforward way to do this in Spark? I
>>>>>> tried creating a UDF (that used the Breeze linear algebra methods
>>>>>> internally); however, that did not scale well.
>>>>>>
>>>>>> Thanks,
>>>>>> Kevin
>>>>>>
>>>>>
>>>>
>>
>

Re: Similar Items

Posted by Kevin Mellott <ke...@gmail.com>.

Using the Soundcloud implementation of LSH, I was able to process a 22K
product dataset in a mere 65 seconds! Thanks so much for the help!

On Tue, Sep 20, 2016 at 1:15 PM, Kevin Mellott <ke...@gmail.com>
wrote:

> Thanks Nick - those examples will help a ton!!
>
> On Tue, Sep 20, 2016 at 12:20 PM, Nick Pentreath <nick.pentreath@gmail.com
> > wrote:
>
>> A few options include:
>>
>> https://github.com/marufaytekin/lsh-spark - I've used this a bit and it
>> seems quite scalable too from what I've looked at.
>> https://github.com/soundcloud/cosine-lsh-join-spark - not used this but
>> looks like it should do exactly what you need.
>> https://github.com/mrsqueeze/*spark*-hash
>> <https://github.com/mrsqueeze/spark-hash>
>>
>>
>> On Tue, 20 Sep 2016 at 18:06 Kevin Mellott <ke...@gmail.com>
>> wrote:
>>
>>> Thanks for the reply, Nick! I'm typically analyzing around 30-50K
>>> products at a time (as an isolated set of products). Within this set of
>>> products (which represents all products for a particular supplier), I am
>>> also analyzing each category separately. The largest categories typically
>>> have around 10K products.
>>>
>>> That being said, when calculating IDFs for the 10K product set we come
>>> out with roughly 12K unique tokens. In other words, our vectors are 12K
>>> columns wide (although they are being represented using SparseVectors). We
>>> have a step that is attempting to locate all documents that share the same
>>> tokens, and for those items we will calculate the cosine similarity.
>>> However, the part that attempts to identify documents with shared tokens is
>>> the bottleneck.
>>>
>>> For this portion, we map our data down to the individual tokens
>>> contained by each document. For example:
>>>
>>> DocumentId   |   Description
>>> ------------------------------------------------------------
>>> ----------------------------------------
>>> 1                       Easton Hockey Stick
>>> 2                       Bauer Hockey Gloves
>>>
>>> In this case, we'd map to the following:
>>>
>>> (1, 'Easton')
>>> (1, 'Hockey')
>>> (1, 'Stick')
>>> (2, 'Bauer')
>>> (2, 'Hockey')
>>> (2, 'Gloves')
>>>
>>> Our goal is to aggregate this data as follows; however, our code that
>>> currently does this is does not perform well. In the realistic 12K product
>>> scenario, this resulted in 430K document/token tuples.
>>>
>>> ((1, 2), ['Hockey'])
>>>
>>> This then tells us that documents 1 and 2 need to be compared to one
>>> another (via cosine similarity) because they both contain the token
>>> 'hockey'. I will investigate the methods that you recommended to see if
>>> they may resolve our problem.
>>>
>>> Thanks,
>>> Kevin
>>>
>>> On Tue, Sep 20, 2016 at 1:45 AM, Nick Pentreath <
>>> nick.pentreath@gmail.com> wrote:
>>>
>>>> How many products do you have? How large are your vectors?
>>>>
>>>> It could be that SVD / LSA could be helpful. But if you have many
>>>> products then trying to compute all-pair similarity with brute force is not
>>>> going to be scalable. In this case you may want to investigate hashing
>>>> (LSH) techniques.
>>>>
>>>>
>>>> On Mon, 19 Sep 2016 at 22:49, Kevin Mellott <ke...@gmail.com>
>>>> wrote:
>>>>
>>>>> Hi all,
>>>>>
>>>>> I'm trying to write a Spark application that will detect similar items
>>>>> (in this case products) based on their descriptions. I've got an ML
>>>>> pipeline that transforms the product data to TF-IDF representation, using
>>>>> the following components.
>>>>>
>>>>>    - *RegexTokenizer* - strips out non-word characters, results in a
>>>>>    list of tokens
>>>>>    - *StopWordsRemover* - removes common "stop words", such as "the",
>>>>>    "and", etc.
>>>>>    - *HashingTF* - assigns a numeric "hash" to each token and
>>>>>    calculates the term frequency
>>>>>    - *IDF* - computes the inverse document frequency
>>>>>
>>>>> After this pipeline evaluates, I'm left with a SparseVector that
>>>>> represents the inverse document frequency of tokens for each product. As a
>>>>> next step, I'd like to be able to compare each vector to one another, to
>>>>> detect similarities.
>>>>>
>>>>> Does anybody know of a straightforward way to do this in Spark? I
>>>>> tried creating a UDF (that used the Breeze linear algebra methods
>>>>> internally); however, that did not scale well.
>>>>>
>>>>> Thanks,
>>>>> Kevin
>>>>>
>>>>
>>>
>

Re: Similar Items

Posted by Kevin Mellott <ke...@gmail.com>.

Thanks Nick - those examples will help a ton!!

On Tue, Sep 20, 2016 at 12:20 PM, Nick Pentreath <ni...@gmail.com>
wrote:

> A few options include:
>
> https://github.com/marufaytekin/lsh-spark - I've used this a bit and it
> seems quite scalable too from what I've looked at.
> https://github.com/soundcloud/cosine-lsh-join-spark - not used this but
> looks like it should do exactly what you need.
> https://github.com/mrsqueeze/*spark*-hash
> <https://github.com/mrsqueeze/spark-hash>
>
>
> On Tue, 20 Sep 2016 at 18:06 Kevin Mellott <ke...@gmail.com>
> wrote:
>
>> Thanks for the reply, Nick! I'm typically analyzing around 30-50K
>> products at a time (as an isolated set of products). Within this set of
>> products (which represents all products for a particular supplier), I am
>> also analyzing each category separately. The largest categories typically
>> have around 10K products.
>>
>> That being said, when calculating IDFs for the 10K product set we come
>> out with roughly 12K unique tokens. In other words, our vectors are 12K
>> columns wide (although they are being represented using SparseVectors). We
>> have a step that is attempting to locate all documents that share the same
>> tokens, and for those items we will calculate the cosine similarity.
>> However, the part that attempts to identify documents with shared tokens is
>> the bottleneck.
>>
>> For this portion, we map our data down to the individual tokens contained
>> by each document. For example:
>>
>> DocumentId   |   Description
>> ------------------------------------------------------------
>> ----------------------------------------
>> 1                       Easton Hockey Stick
>> 2                       Bauer Hockey Gloves
>>
>> In this case, we'd map to the following:
>>
>> (1, 'Easton')
>> (1, 'Hockey')
>> (1, 'Stick')
>> (2, 'Bauer')
>> (2, 'Hockey')
>> (2, 'Gloves')
>>
>> Our goal is to aggregate this data as follows; however, our code that
>> currently does this is does not perform well. In the realistic 12K product
>> scenario, this resulted in 430K document/token tuples.
>>
>> ((1, 2), ['Hockey'])
>>
>> This then tells us that documents 1 and 2 need to be compared to one
>> another (via cosine similarity) because they both contain the token
>> 'hockey'. I will investigate the methods that you recommended to see if
>> they may resolve our problem.
>>
>> Thanks,
>> Kevin
>>
>> On Tue, Sep 20, 2016 at 1:45 AM, Nick Pentreath <nick.pentreath@gmail.com
>> > wrote:
>>
>>> How many products do you have? How large are your vectors?
>>>
>>> It could be that SVD / LSA could be helpful. But if you have many
>>> products then trying to compute all-pair similarity with brute force is not
>>> going to be scalable. In this case you may want to investigate hashing
>>> (LSH) techniques.
>>>
>>>
>>> On Mon, 19 Sep 2016 at 22:49, Kevin Mellott <ke...@gmail.com>
>>> wrote:
>>>
>>>> Hi all,
>>>>
>>>> I'm trying to write a Spark application that will detect similar items
>>>> (in this case products) based on their descriptions. I've got an ML
>>>> pipeline that transforms the product data to TF-IDF representation, using
>>>> the following components.
>>>>
>>>>    - *RegexTokenizer* - strips out non-word characters, results in a
>>>>    list of tokens
>>>>    - *StopWordsRemover* - removes common "stop words", such as "the",
>>>>    "and", etc.
>>>>    - *HashingTF* - assigns a numeric "hash" to each token and
>>>>    calculates the term frequency
>>>>    - *IDF* - computes the inverse document frequency
>>>>
>>>> After this pipeline evaluates, I'm left with a SparseVector that
>>>> represents the inverse document frequency of tokens for each product. As a
>>>> next step, I'd like to be able to compare each vector to one another, to
>>>> detect similarities.
>>>>
>>>> Does anybody know of a straightforward way to do this in Spark? I tried
>>>> creating a UDF (that used the Breeze linear algebra methods internally);
>>>> however, that did not scale well.
>>>>
>>>> Thanks,
>>>> Kevin
>>>>
>>>
>>

Re: Similar Items

Posted by Nick Pentreath <ni...@gmail.com>.

A few options include:

https://github.com/marufaytekin/lsh-spark - I've used this a bit and it
seems quite scalable too from what I've looked at.
https://github.com/soundcloud/cosine-lsh-join-spark - not used this but
looks like it should do exactly what you need.
https://github.com/mrsqueeze/*spark*-hash
<https://github.com/mrsqueeze/spark-hash>


On Tue, 20 Sep 2016 at 18:06 Kevin Mellott <ke...@gmail.com>
wrote:

> Thanks for the reply, Nick! I'm typically analyzing around 30-50K products
> at a time (as an isolated set of products). Within this set of products
> (which represents all products for a particular supplier), I am also
> analyzing each category separately. The largest categories typically have
> around 10K products.
>
> That being said, when calculating IDFs for the 10K product set we come out
> with roughly 12K unique tokens. In other words, our vectors are 12K columns
> wide (although they are being represented using SparseVectors). We have a
> step that is attempting to locate all documents that share the same tokens,
> and for those items we will calculate the cosine similarity. However, the
> part that attempts to identify documents with shared tokens is the
> bottleneck.
>
> For this portion, we map our data down to the individual tokens contained
> by each document. For example:
>
> DocumentId   |   Description
>
> ----------------------------------------------------------------------------------------------------
> 1                       Easton Hockey Stick
> 2                       Bauer Hockey Gloves
>
> In this case, we'd map to the following:
>
> (1, 'Easton')
> (1, 'Hockey')
> (1, 'Stick')
> (2, 'Bauer')
> (2, 'Hockey')
> (2, 'Gloves')
>
> Our goal is to aggregate this data as follows; however, our code that
> currently does this is does not perform well. In the realistic 12K product
> scenario, this resulted in 430K document/token tuples.
>
> ((1, 2), ['Hockey'])
>
> This then tells us that documents 1 and 2 need to be compared to one
> another (via cosine similarity) because they both contain the token
> 'hockey'. I will investigate the methods that you recommended to see if
> they may resolve our problem.
>
> Thanks,
> Kevin
>
> On Tue, Sep 20, 2016 at 1:45 AM, Nick Pentreath <ni...@gmail.com>
> wrote:
>
>> How many products do you have? How large are your vectors?
>>
>> It could be that SVD / LSA could be helpful. But if you have many
>> products then trying to compute all-pair similarity with brute force is not
>> going to be scalable. In this case you may want to investigate hashing
>> (LSH) techniques.
>>
>>
>> On Mon, 19 Sep 2016 at 22:49, Kevin Mellott <ke...@gmail.com>
>> wrote:
>>
>>> Hi all,
>>>
>>> I'm trying to write a Spark application that will detect similar items
>>> (in this case products) based on their descriptions. I've got an ML
>>> pipeline that transforms the product data to TF-IDF representation, using
>>> the following components.
>>>
>>>    - *RegexTokenizer* - strips out non-word characters, results in a
>>>    list of tokens
>>>    - *StopWordsRemover* - removes common "stop words", such as "the",
>>>    "and", etc.
>>>    - *HashingTF* - assigns a numeric "hash" to each token and
>>>    calculates the term frequency
>>>    - *IDF* - computes the inverse document frequency
>>>
>>> After this pipeline evaluates, I'm left with a SparseVector that
>>> represents the inverse document frequency of tokens for each product. As a
>>> next step, I'd like to be able to compare each vector to one another, to
>>> detect similarities.
>>>
>>> Does anybody know of a straightforward way to do this in Spark? I tried
>>> creating a UDF (that used the Breeze linear algebra methods internally);
>>> however, that did not scale well.
>>>
>>> Thanks,
>>> Kevin
>>>
>>
>

Re: Similar Items

Posted by Nick Pentreath <ni...@gmail.com>.

How many products do you have? How large are your vectors?

It could be that SVD / LSA could be helpful. But if you have many products
then trying to compute all-pair similarity with brute force is not going to
be scalable. In this case you may want to investigate hashing (LSH)
techniques.


On Mon, 19 Sep 2016 at 22:49, Kevin Mellott <ke...@gmail.com>
wrote:

> Hi all,
>
> I'm trying to write a Spark application that will detect similar items (in
> this case products) based on their descriptions. I've got an ML pipeline
> that transforms the product data to TF-IDF representation, using the
> following components.
>
>    - *RegexTokenizer* - strips out non-word characters, results in a list
>    of tokens
>    - *StopWordsRemover* - removes common "stop words", such as "the",
>    "and", etc.
>    - *HashingTF* - assigns a numeric "hash" to each token and calculates
>    the term frequency
>    - *IDF* - computes the inverse document frequency
>
> After this pipeline evaluates, I'm left with a SparseVector that
> represents the inverse document frequency of tokens for each product. As a
> next step, I'd like to be able to compare each vector to one another, to
> detect similarities.
>
> Does anybody know of a straightforward way to do this in Spark? I tried
> creating a UDF (that used the Breeze linear algebra methods internally);
> however, that did not scale well.
>
> Thanks,
> Kevin
>