You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@spark.apache.org by Asher Krim <ak...@hubspot.com> on 2017/02/05 20:24:16 UTC
Re: ml word2vec finSynonyms return type

It took me a while, but I finally got around this:
https://github.com/apache/spark/pull/16811/files

On Fri, Jan 6, 2017 at 4:03 AM, Asher Krim <ak...@hubspot.com> wrote:

> Felix - I'm not sure I understand your example about pipeline models,
> could you elaborate? I'm talking about the `findSynonyms` methods, which
> AFAIK have nothing to do with pipeline models.
>
> Joseph - Cool, thanks, I'll PR something in the next few days (and reopen
> SPARK-17629 <https://issues.apache.org/jira/browse/SPARK-17629>)
>
> On Fri, Jan 6, 2017 at 12:33 AM, Joseph Bradley <jo...@databricks.com>
> wrote:
>
>> We returned a DataFrame since it is a nicer API, but I agree forcing RDD
>> operations is not ideal.  I'd be OK with adding a new method, but I agree
>> with Felix that we cannot break the API for something like this.
>>
>> On Thu, Jan 5, 2017 at 12:44 PM, Felix Cheung <fe...@hotmail.com>
>> wrote:
>>
>>> Given how Word2Vec is used the pipeline model in the new ml
>>> implementation, we might need to keep the current behavior?
>>>
>>>
>>> https://github.com/apache/spark/blob/master/examples/src/mai
>>> n/scala/org/apache/spark/examples/ml/Word2VecExample.scala
>>>
>>>
>>> _____________________________
>>> From: Asher Krim <ak...@hubspot.com>
>>> Sent: Tuesday, January 3, 2017 11:58 PM
>>> Subject: Re: ml word2vec finSynonyms return type
>>> To: Felix Cheung <fe...@hotmail.com>
>>> Cc: <ma...@gmail.com>, Joseph Bradley <
>>> joseph@databricks.com>, <de...@spark.apache.org>
>>>
>>>
>>>
>>> The jira: https://issues.apache.org/jira/browse/SPARK-17629
>>>
>>> Adding new methods could result in method clutter. Changing behavior of
>>> non-experimental classes is unfortunate (ml Word2Vec was marked
>>> Experimental until Spark 2.0). Neither option is great. If I had to pick, I
>>> would rather change the existing methods to keep the class simpler moving
>>> forward.
>>>
>>>
>>> On Sat, Dec 31, 2016 at 8:29 AM, Felix Cheung <felixcheung_m@hotmail.com
>>> > wrote:
>>>
>>>> Could you link to the JIRA here?
>>>>
>>>> What you suggest makes sense to me. Though we might want to maintain
>>>> compatibility and add a new method instead of changing the return type of
>>>> the existing one.
>>>>
>>>>
>>>> _____________________________
>>>> From: Asher Krim <ak...@hubspot.com>
>>>> Sent: Wednesday, December 28, 2016 11:52 AM
>>>> Subject: ml word2vec finSynonyms return type
>>>> To: <de...@spark.apache.org>
>>>> Cc: <ma...@gmail.com>, Joseph Bradley <
>>>> joseph@databricks.com>
>>>>
>>>>
>>>>
>>>> Hey all,
>>>>
>>>> I would like to propose changing the return type of `findSynonyms` in
>>>> ml's Word2Vec
>>>> <https://github.com/apache/spark/blob/branch-2.1/mllib/src/main/scala/org/apache/spark/ml/feature/Word2Vec.scala#L233-L248>
>>>> :
>>>>
>>>> def findSynonyms(word: String, num: Int): DataFrame = {
>>>>   val spark = SparkSession.builder().getOrCreate()
>>>>   spark.createDataFrame(wordVectors.findSynonyms(word,
>>>> num)).toDF("word", "similarity")
>>>> }
>>>>
>>>> I find it very strange that the results are parallelized before being
>>>> returned to the user. The results are already on the driver to begin with,
>>>> and I can imagine that for most usecases (and definitely for ours) the
>>>> synonyms are collected right back to the driver. This incurs both an added
>>>> cost of shipping data to and from the cluster, as well as a more cumbersome
>>>> interface than needed.
>>>>
>>>> Can we change it to just the following?
>>>>
>>>> def findSynonyms(word: String, num: Int): Array[(String, Double)] = {
>>>>   wordVectors.findSynonyms(word, num)
>>>> }
>>>>
>>>> If the user wants the results parallelized, they can still do so on
>>>> their own.
>>>>
>>>> (I had brought this up a while back in Jira. It was suggested that the
>>>> mailing list would be a better forum to discuss it, so here we are.)
>>>>
>>>> Thanks,
>>>> --
>>>> Asher Krim
>>>> Senior Software Engineer
>>>>
>>>>
>>>
>>>
>>
>>
>> --
>>
>> Joseph Bradley
>>
>> Software Engineer - Machine Learning
>>
>> Databricks, Inc.
>>
>> [image: http://databricks.com] <http://databricks.com/>
>>
>
>
>
> --
> Asher Krim
> Senior Software Engineer
>