You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@spark.apache.org by Asher Krim <ak...@hubspot.com> on 2017/01/04 07:58:22 UTC

Re: ml word2vec finSynonyms return type

The jira: https://issues.apache.org/jira/browse/SPARK-17629

Adding new methods could result in method clutter. Changing behavior of
non-experimental classes is unfortunate (ml Word2Vec was marked
Experimental until Spark 2.0). Neither option is great. If I had to pick, I
would rather change the existing methods to keep the class simpler moving
forward.


On Sat, Dec 31, 2016 at 8:29 AM, Felix Cheung <fe...@hotmail.com>
wrote:

> Could you link to the JIRA here?
>
> What you suggest makes sense to me. Though we might want to maintain
> compatibility and add a new method instead of changing the return type of
> the existing one.
>
>
> _____________________________
> From: Asher Krim <ak...@hubspot.com>
> Sent: Wednesday, December 28, 2016 11:52 AM
> Subject: ml word2vec finSynonyms return type
> To: <de...@spark.apache.org>
> Cc: <ma...@gmail.com>, Joseph Bradley <
> joseph@databricks.com>
>
>
>
> Hey all,
>
> I would like to propose changing the return type of `findSynonyms` in ml's
> Word2Vec
> <https://github.com/apache/spark/blob/branch-2.1/mllib/src/main/scala/org/apache/spark/ml/feature/Word2Vec.scala#L233-L248>
> :
>
> def findSynonyms(word: String, num: Int): DataFrame = {
>   val spark = SparkSession.builder().getOrCreate()
>   spark.createDataFrame(wordVectors.findSynonyms(word, num)).toDF("word",
> "similarity")
> }
>
> I find it very strange that the results are parallelized before being
> returned to the user. The results are already on the driver to begin with,
> and I can imagine that for most usecases (and definitely for ours) the
> synonyms are collected right back to the driver. This incurs both an added
> cost of shipping data to and from the cluster, as well as a more cumbersome
> interface than needed.
>
> Can we change it to just the following?
>
> def findSynonyms(word: String, num: Int): Array[(String, Double)] = {
>   wordVectors.findSynonyms(word, num)
> }
>
> If the user wants the results parallelized, they can still do so on their
> own.
>
> (I had brought this up a while back in Jira. It was suggested that the
> mailing list would be a better forum to discuss it, so here we are.)
>
> Thanks,
> --
> Asher Krim
> Senior Software Engineer
>
>

Re: ml word2vec finSynonyms return type

Posted by Asher Krim <ak...@hubspot.com>.

It took me a while, but I finally got around this:
https://github.com/apache/spark/pull/16811/files

On Fri, Jan 6, 2017 at 4:03 AM, Asher Krim <ak...@hubspot.com> wrote:

> Felix - I'm not sure I understand your example about pipeline models,
> could you elaborate? I'm talking about the `findSynonyms` methods, which
> AFAIK have nothing to do with pipeline models.
>
> Joseph - Cool, thanks, I'll PR something in the next few days (and reopen
> SPARK-17629 <https://issues.apache.org/jira/browse/SPARK-17629>)
>
> On Fri, Jan 6, 2017 at 12:33 AM, Joseph Bradley <jo...@databricks.com>
> wrote:
>
>> We returned a DataFrame since it is a nicer API, but I agree forcing RDD
>> operations is not ideal.  I'd be OK with adding a new method, but I agree
>> with Felix that we cannot break the API for something like this.
>>
>> On Thu, Jan 5, 2017 at 12:44 PM, Felix Cheung <fe...@hotmail.com>
>> wrote:
>>
>>> Given how Word2Vec is used the pipeline model in the new ml
>>> implementation, we might need to keep the current behavior?
>>>
>>>
>>> https://github.com/apache/spark/blob/master/examples/src/mai
>>> n/scala/org/apache/spark/examples/ml/Word2VecExample.scala
>>>
>>>
>>> _____________________________
>>> From: Asher Krim <ak...@hubspot.com>
>>> Sent: Tuesday, January 3, 2017 11:58 PM
>>> Subject: Re: ml word2vec finSynonyms return type
>>> To: Felix Cheung <fe...@hotmail.com>
>>> Cc: <ma...@gmail.com>, Joseph Bradley <
>>> joseph@databricks.com>, <de...@spark.apache.org>
>>>
>>>
>>>
>>> The jira: https://issues.apache.org/jira/browse/SPARK-17629
>>>
>>> Adding new methods could result in method clutter. Changing behavior of
>>> non-experimental classes is unfortunate (ml Word2Vec was marked
>>> Experimental until Spark 2.0). Neither option is great. If I had to pick, I
>>> would rather change the existing methods to keep the class simpler moving
>>> forward.
>>>
>>>
>>> On Sat, Dec 31, 2016 at 8:29 AM, Felix Cheung <felixcheung_m@hotmail.com
>>> > wrote:
>>>
>>>> Could you link to the JIRA here?
>>>>
>>>> What you suggest makes sense to me. Though we might want to maintain
>>>> compatibility and add a new method instead of changing the return type of
>>>> the existing one.
>>>>
>>>>
>>>> _____________________________
>>>> From: Asher Krim <ak...@hubspot.com>
>>>> Sent: Wednesday, December 28, 2016 11:52 AM
>>>> Subject: ml word2vec finSynonyms return type
>>>> To: <de...@spark.apache.org>
>>>> Cc: <ma...@gmail.com>, Joseph Bradley <
>>>> joseph@databricks.com>
>>>>
>>>>
>>>>
>>>> Hey all,
>>>>
>>>> I would like to propose changing the return type of `findSynonyms` in
>>>> ml's Word2Vec
>>>> <https://github.com/apache/spark/blob/branch-2.1/mllib/src/main/scala/org/apache/spark/ml/feature/Word2Vec.scala#L233-L248>
>>>> :
>>>>
>>>> def findSynonyms(word: String, num: Int): DataFrame = {
>>>>   val spark = SparkSession.builder().getOrCreate()
>>>>   spark.createDataFrame(wordVectors.findSynonyms(word,
>>>> num)).toDF("word", "similarity")
>>>> }
>>>>
>>>> I find it very strange that the results are parallelized before being
>>>> returned to the user. The results are already on the driver to begin with,
>>>> and I can imagine that for most usecases (and definitely for ours) the
>>>> synonyms are collected right back to the driver. This incurs both an added
>>>> cost of shipping data to and from the cluster, as well as a more cumbersome
>>>> interface than needed.
>>>>
>>>> Can we change it to just the following?
>>>>
>>>> def findSynonyms(word: String, num: Int): Array[(String, Double)] = {
>>>>   wordVectors.findSynonyms(word, num)
>>>> }
>>>>
>>>> If the user wants the results parallelized, they can still do so on
>>>> their own.
>>>>
>>>> (I had brought this up a while back in Jira. It was suggested that the
>>>> mailing list would be a better forum to discuss it, so here we are.)
>>>>
>>>> Thanks,
>>>> --
>>>> Asher Krim
>>>> Senior Software Engineer
>>>>
>>>>
>>>
>>>
>>
>>
>> --
>>
>> Joseph Bradley
>>
>> Software Engineer - Machine Learning
>>
>> Databricks, Inc.
>>
>> [image: http://databricks.com] <http://databricks.com/>
>>
>
>
>
> --
> Asher Krim
> Senior Software Engineer
>

Re: ml word2vec finSynonyms return type

Posted by Asher Krim <ak...@hubspot.com>.

Felix - I'm not sure I understand your example about pipeline models, could
you elaborate? I'm talking about the `findSynonyms` methods, which AFAIK
have nothing to do with pipeline models.

Joseph - Cool, thanks, I'll PR something in the next few days (and reopen
SPARK-17629 <https://issues.apache.org/jira/browse/SPARK-17629>)

On Fri, Jan 6, 2017 at 12:33 AM, Joseph Bradley <jo...@databricks.com>
wrote:

> We returned a DataFrame since it is a nicer API, but I agree forcing RDD
> operations is not ideal.  I'd be OK with adding a new method, but I agree
> with Felix that we cannot break the API for something like this.
>
> On Thu, Jan 5, 2017 at 12:44 PM, Felix Cheung <fe...@hotmail.com>
> wrote:
>
>> Given how Word2Vec is used the pipeline model in the new ml
>> implementation, we might need to keep the current behavior?
>>
>>
>> https://github.com/apache/spark/blob/master/examples/src/
>> main/scala/org/apache/spark/examples/ml/Word2VecExample.scala
>>
>>
>> _____________________________
>> From: Asher Krim <ak...@hubspot.com>
>> Sent: Tuesday, January 3, 2017 11:58 PM
>> Subject: Re: ml word2vec finSynonyms return type
>> To: Felix Cheung <fe...@hotmail.com>
>> Cc: <ma...@gmail.com>, Joseph Bradley <
>> joseph@databricks.com>, <de...@spark.apache.org>
>>
>>
>>
>> The jira: https://issues.apache.org/jira/browse/SPARK-17629
>>
>> Adding new methods could result in method clutter. Changing behavior of
>> non-experimental classes is unfortunate (ml Word2Vec was marked
>> Experimental until Spark 2.0). Neither option is great. If I had to pick, I
>> would rather change the existing methods to keep the class simpler moving
>> forward.
>>
>>
>> On Sat, Dec 31, 2016 at 8:29 AM, Felix Cheung <fe...@hotmail.com>
>> wrote:
>>
>>> Could you link to the JIRA here?
>>>
>>> What you suggest makes sense to me. Though we might want to maintain
>>> compatibility and add a new method instead of changing the return type of
>>> the existing one.
>>>
>>>
>>> _____________________________
>>> From: Asher Krim <ak...@hubspot.com>
>>> Sent: Wednesday, December 28, 2016 11:52 AM
>>> Subject: ml word2vec finSynonyms return type
>>> To: <de...@spark.apache.org>
>>> Cc: <ma...@gmail.com>, Joseph Bradley <
>>> joseph@databricks.com>
>>>
>>>
>>>
>>> Hey all,
>>>
>>> I would like to propose changing the return type of `findSynonyms` in
>>> ml's Word2Vec
>>> <https://github.com/apache/spark/blob/branch-2.1/mllib/src/main/scala/org/apache/spark/ml/feature/Word2Vec.scala#L233-L248>
>>> :
>>>
>>> def findSynonyms(word: String, num: Int): DataFrame = {
>>>   val spark = SparkSession.builder().getOrCreate()
>>>   spark.createDataFrame(wordVectors.findSynonyms(word,
>>> num)).toDF("word", "similarity")
>>> }
>>>
>>> I find it very strange that the results are parallelized before being
>>> returned to the user. The results are already on the driver to begin with,
>>> and I can imagine that for most usecases (and definitely for ours) the
>>> synonyms are collected right back to the driver. This incurs both an added
>>> cost of shipping data to and from the cluster, as well as a more cumbersome
>>> interface than needed.
>>>
>>> Can we change it to just the following?
>>>
>>> def findSynonyms(word: String, num: Int): Array[(String, Double)] = {
>>>   wordVectors.findSynonyms(word, num)
>>> }
>>>
>>> If the user wants the results parallelized, they can still do so on
>>> their own.
>>>
>>> (I had brought this up a while back in Jira. It was suggested that the
>>> mailing list would be a better forum to discuss it, so here we are.)
>>>
>>> Thanks,
>>> --
>>> Asher Krim
>>> Senior Software Engineer
>>>
>>>
>>
>>
>
>
> --
>
> Joseph Bradley
>
> Software Engineer - Machine Learning
>
> Databricks, Inc.
>
> [image: http://databricks.com] <http://databricks.com/>
>



-- 
Asher Krim
Senior Software Engineer

Re: ml word2vec finSynonyms return type

Posted by Joseph Bradley <jo...@databricks.com>.

We returned a DataFrame since it is a nicer API, but I agree forcing RDD
operations is not ideal.  I'd be OK with adding a new method, but I agree
with Felix that we cannot break the API for something like this.

On Thu, Jan 5, 2017 at 12:44 PM, Felix Cheung <fe...@hotmail.com>
wrote:

> Given how Word2Vec is used the pipeline model in the new ml
> implementation, we might need to keep the current behavior?
>
>
> https://github.com/apache/spark/blob/master/examples/
> src/main/scala/org/apache/spark/examples/ml/Word2VecExample.scala
>
>
> _____________________________
> From: Asher Krim <ak...@hubspot.com>
> Sent: Tuesday, January 3, 2017 11:58 PM
> Subject: Re: ml word2vec finSynonyms return type
> To: Felix Cheung <fe...@hotmail.com>
> Cc: <ma...@gmail.com>, Joseph Bradley <
> joseph@databricks.com>, <de...@spark.apache.org>
>
>
>
> The jira: https://issues.apache.org/jira/browse/SPARK-17629
>
> Adding new methods could result in method clutter. Changing behavior of
> non-experimental classes is unfortunate (ml Word2Vec was marked
> Experimental until Spark 2.0). Neither option is great. If I had to pick, I
> would rather change the existing methods to keep the class simpler moving
> forward.
>
>
> On Sat, Dec 31, 2016 at 8:29 AM, Felix Cheung <fe...@hotmail.com>
> wrote:
>
>> Could you link to the JIRA here?
>>
>> What you suggest makes sense to me. Though we might want to maintain
>> compatibility and add a new method instead of changing the return type of
>> the existing one.
>>
>>
>> _____________________________
>> From: Asher Krim <ak...@hubspot.com>
>> Sent: Wednesday, December 28, 2016 11:52 AM
>> Subject: ml word2vec finSynonyms return type
>> To: <de...@spark.apache.org>
>> Cc: <ma...@gmail.com>, Joseph Bradley <
>> joseph@databricks.com>
>>
>>
>>
>> Hey all,
>>
>> I would like to propose changing the return type of `findSynonyms` in
>> ml's Word2Vec
>> <https://github.com/apache/spark/blob/branch-2.1/mllib/src/main/scala/org/apache/spark/ml/feature/Word2Vec.scala#L233-L248>
>> :
>>
>> def findSynonyms(word: String, num: Int): DataFrame = {
>>   val spark = SparkSession.builder().getOrCreate()
>>   spark.createDataFrame(wordVectors.findSynonyms(word,
>> num)).toDF("word", "similarity")
>> }
>>
>> I find it very strange that the results are parallelized before being
>> returned to the user. The results are already on the driver to begin with,
>> and I can imagine that for most usecases (and definitely for ours) the
>> synonyms are collected right back to the driver. This incurs both an added
>> cost of shipping data to and from the cluster, as well as a more cumbersome
>> interface than needed.
>>
>> Can we change it to just the following?
>>
>> def findSynonyms(word: String, num: Int): Array[(String, Double)] = {
>>   wordVectors.findSynonyms(word, num)
>> }
>>
>> If the user wants the results parallelized, they can still do so on their
>> own.
>>
>> (I had brought this up a while back in Jira. It was suggested that the
>> mailing list would be a better forum to discuss it, so here we are.)
>>
>> Thanks,
>> --
>> Asher Krim
>> Senior Software Engineer
>>
>>
>
>


-- 

Joseph Bradley

Software Engineer - Machine Learning

Databricks, Inc.

[image: http://databricks.com] <http://databricks.com/>

Re: ml word2vec finSynonyms return type

Posted by Felix Cheung <fe...@hotmail.com>.

Given how Word2Vec is used the pipeline model in the new ml implementation, we might need to keep the current behavior?

https://github.com/apache/spark/blob/master/examples/src/main/scala/org/apache/spark/examples/ml/Word2VecExample.scala

_____________________________
From: Asher Krim <ak...@hubspot.com>>
Sent: Tuesday, January 3, 2017 11:58 PM
Subject: Re: ml word2vec finSynonyms return type
To: Felix Cheung <fe...@hotmail.com>>
Cc: <ma...@gmail.com>>, Joseph Bradley <jo...@databricks.com>>, <de...@spark.apache.org>>

The jira: https://issues.apache.org/jira/browse/SPARK-17629

Adding new methods could result in method clutter. Changing behavior of non-experimental classes is unfortunate (ml Word2Vec was marked Experimental until Spark 2.0). Neither option is great. If I had to pick, I would rather change the existing methods to keep the class simpler moving forward.

On Sat, Dec 31, 2016 at 8:29 AM, Felix Cheung <fe...@hotmail.com>> wrote:
Could you link to the JIRA here?

What you suggest makes sense to me. Though we might want to maintain compatibility and add a new method instead of changing the return type of the existing one.

_____________________________
From: Asher Krim <ak...@hubspot.com>>
Sent: Wednesday, December 28, 2016 11:52 AM
Subject: ml word2vec finSynonyms return type
To: <de...@spark.apache.org>>
Cc: <ma...@gmail.com>>, Joseph Bradley <jo...@databricks.com>>

Hey all,

I would like to propose changing the return type of `findSynonyms` in ml's Word2Vec<https://github.com/apache/spark/blob/branch-2.1/mllib/src/main/scala/org/apache/spark/ml/feature/Word2Vec.scala#L233-L248>:

def findSynonyms(word: String, num: Int): DataFrame = {
  val spark = SparkSession.builder().getOrCreate()
  spark.createDataFrame(wordVectors.findSynonyms(word, num)).toDF("word", "similarity")
}

I find it very strange that the results are parallelized before being returned to the user. The results are already on the driver to begin with, and I can imagine that for most usecases (and definitely for ours) the synonyms are collected right back to the driver. This incurs both an added cost of shipping data to and from the cluster, as well as a more cumbersome interface than needed.

Can we change it to just the following?

def findSynonyms(word: String, num: Int): Array[(String, Double)] = {
  wordVectors.findSynonyms(word, num)
}

If the user wants the results parallelized, they can still do so on their own.

(I had brought this up a while back in Jira. It was suggested that the mailing list would be a better forum to discuss it, so here we are.)

Thanks,
--
Asher Krim
Senior Software Engineer
[http://cdn2.hubspot.net/hub/137828/file-223457316-png/HubSpot_User_Group_Images/HUG_lrg_HS.png?t=1477096082917]