You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Yan Facai (颜发才 JIRA)" <ji...@apache.org> on 2017/07/08 06:29:01 UTC
[jira] [Comment Edited] (SPARK-21341) Spark 2.1.1: I want to be able to serialize wordVectors on Word2VecModel

    [ https://issues.apache.org/jira/browse/SPARK-21341?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16078987#comment-16078987 ] 

Yan Facai (颜发才) edited comment on SPARK-21341 at 7/8/17 6:28 AM:
-----------------------------------------------------------------

Hi, [~zsellami].
I guess that since the wordVectors is mllib model in fact, which might be removed in the future, so it is marked private and transient. More interestingly, wordVectors are saved in data folder as dataframe, see:

{code}
336     override protected def saveImpl(path: String): Unit = {
337       DefaultParamsWriter.saveMetadata(instance, path, sc)
338
339       val wordVectors = instance.wordVectors.getVectors
340       val dataSeq = wordVectors.toSeq.map { case (word, vector) => Data(word, vector) }
341       val dataPath = new Path(path, "data").toString
342       sparkSession.createDataFrame(dataSeq)
343         .repartition(calculateNumberOfPartitions)
344         .write
345         .parquet(dataPath)
346     }
{code}

In all, developers indeed take a try to save the wordVector, however it seems to be broken in pipeline as you said.

So, could you give an example code to reproduce the bug?
I'd like to dig deeper.


was (Author: facai):
Hi, [~zsellami].
I guess that since the wordVectors is mllib model in fact, which might be removed in the future, so it is marked private and transient. More interestingly, wordVectors are saved in data folder as dataframe, see:

{code}
336     override protected def saveImpl(path: String): Unit = {
337       DefaultParamsWriter.saveMetadata(instance, path, sc)
338
339       val wordVectors = instance.wordVectors.getVectors
340       val dataSeq = wordVectors.toSeq.map { case (word, vector) => Data(word, vector) }
341       val dataPath = new Path(path, "data").toString
342       sparkSession.createDataFrame(dataSeq)
343         .repartition(calculateNumberOfPartitions)
344         .write
345         .parquet(dataPath)
346     }
{code}

In all, developers indeed take a try to save the wordVector, however it seems to break in pipeline as you said.

So, could you give an example code to reproduce the bug?
I'd like to dig deeper.

> Spark 2.1.1: I want to be able to serialize wordVectors on Word2VecModel 
> -------------------------------------------------------------------------
>
>                 Key: SPARK-21341
>                 URL: https://issues.apache.org/jira/browse/SPARK-21341
>             Project: Spark
>          Issue Type: Bug
>          Components: ML
>    Affects Versions: 2.1.1
>            Reporter: Zied Sellami
>
> I am using sparContext.saveAsObjectFile to save a complex object containing a pipelineModel with a Word2Vec ML Transformer. When I load the object and call myPipelineModel.transform, Word2VecModel raise a null pointer error on line 292 Word2Vec.scala "wordVectors.getVectors" . I resolve the problem by removing@transient annotation on val wordVectors and @transient lazy val on getVectors function.
> -Why this 2 val are transient ?
> -Any solution to add a boolean function on the Word2Vec Transformer to force the serialization of wordVectors.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org