You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@spark.apache.org by Russell Jurney <ru...@gmail.com> on 2016/11/15 18:06:10 UTC

Spark ML DataFrame API - need cosine similarity, how to convert to RDD Vectors?

I have two dataframes with common feature vectors and I need to get the
cosine similarity of one against the other. It looks like this is possible
in the RDD based API, mllib, but not in ml.

So, how do I convert my sparse dataframe vectors into something spark mllib
can use? I've searched, but haven't found anything.

Thanks!
-- 
Russell Jurney twitter.com/rjurney russell.jurney@gmail.com relato.io

Re: Spark ML DataFrame API - need cosine similarity, how to convert to RDD Vectors?

Posted by Yanbo Liang <yb...@gmail.com>.
Hi Russell,

Do you want to use RowMatrix.columnSimilarities to calculate cosine
similarities?
If so, you should using the following steps:

val dataset: DataFrame
// Convert the type of features column from ml.linalg.Vector type to
mllib.linalg.Vector
val oldDataset: DataFrame = MLUtils.convertVectorColumnsFromML(dataset,
"features")
// Convert fromt DataFrame to RDD
val oldRDD: RDD[mllib.linalg.Vector] =
oldDataset.select(col("features")).rdd.map { row =>
row.getAs[mllib.linalg.Vector](0) }
// Generate RowMatrix
val mat: RowMatrix = new RowMatrix(oldRDD, nRows, nCols)
mat.columnSimilarities()

Please feel free to let me know whether it can satisfy your requirements.


Thanks
Yanbo

On Wed, Nov 16, 2016 at 9:26 AM, Russell Jurney <ru...@gmail.com>
wrote:

> Asher, can you cast like that? Does that casting work? That is my
> confusion: I don't know what a DataFrame Vector turns into in terms of an
> RDD type.
>
> I'll try this, thanks.
>
> On Tue, Nov 15, 2016 at 11:25 AM, Asher Krim <ak...@hubspot.com> wrote:
>
>> What language are you using? For Java, you might convert the dataframe to
>> an rdd using something like this:
>>
>> df
>>     .toJavaRDD()
>>     .map(row -> (SparseVector)row.getAs(row.fieldIndex("columnName")));
>>
>> On Tue, Nov 15, 2016 at 1:06 PM, Russell Jurney <russell.jurney@gmail.com
>> > wrote:
>>
>>> I have two dataframes with common feature vectors and I need to get the
>>> cosine similarity of one against the other. It looks like this is possible
>>> in the RDD based API, mllib, but not in ml.
>>>
>>> So, how do I convert my sparse dataframe vectors into something spark
>>> mllib can use? I've searched, but haven't found anything.
>>>
>>> Thanks!
>>> --
>>> Russell Jurney twitter.com/rjurney russell.jurney@gmail.com relato.io
>>>
>>
>>
>>
>> --
>> Asher Krim
>> Senior Software Engineer
>>
>
>
>
> --
> Russell Jurney twitter.com/rjurney russell.jurney@gmail.com relato.io
>

Re: Spark ML DataFrame API - need cosine similarity, how to convert to RDD Vectors?

Posted by Russell Jurney <ru...@gmail.com>.
Asher, can you cast like that? Does that casting work? That is my
confusion: I don't know what a DataFrame Vector turns into in terms of an
RDD type.

I'll try this, thanks.

On Tue, Nov 15, 2016 at 11:25 AM, Asher Krim <ak...@hubspot.com> wrote:

> What language are you using? For Java, you might convert the dataframe to
> an rdd using something like this:
>
> df
>     .toJavaRDD()
>     .map(row -> (SparseVector)row.getAs(row.fieldIndex("columnName")));
>
> On Tue, Nov 15, 2016 at 1:06 PM, Russell Jurney <ru...@gmail.com>
> wrote:
>
>> I have two dataframes with common feature vectors and I need to get the
>> cosine similarity of one against the other. It looks like this is possible
>> in the RDD based API, mllib, but not in ml.
>>
>> So, how do I convert my sparse dataframe vectors into something spark
>> mllib can use? I've searched, but haven't found anything.
>>
>> Thanks!
>> --
>> Russell Jurney twitter.com/rjurney russell.jurney@gmail.com relato.io
>>
>
>
>
> --
> Asher Krim
> Senior Software Engineer
>



-- 
Russell Jurney twitter.com/rjurney russell.jurney@gmail.com relato.io

Re: Spark ML DataFrame API - need cosine similarity, how to convert to RDD Vectors?

Posted by Asher Krim <ak...@hubspot.com>.
What language are you using? For Java, you might convert the dataframe to
an rdd using something like this:

df
    .toJavaRDD()
    .map(row -> (SparseVector)row.getAs(row.fieldIndex("columnName")));

On Tue, Nov 15, 2016 at 1:06 PM, Russell Jurney <ru...@gmail.com>
wrote:

> I have two dataframes with common feature vectors and I need to get the
> cosine similarity of one against the other. It looks like this is possible
> in the RDD based API, mllib, but not in ml.
>
> So, how do I convert my sparse dataframe vectors into something spark
> mllib can use? I've searched, but haven't found anything.
>
> Thanks!
> --
> Russell Jurney twitter.com/rjurney russell.jurney@gmail.com relato.io
>



-- 
Asher Krim
Senior Software Engineer