You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Yan Facai (颜发才 JIRA)" <ji...@apache.org> on 2017/07/08 06:29:01 UTC
[jira] [Comment Edited] (SPARK-21341) Spark 2.1.1: I want to be
able to serialize wordVectors on Word2VecModel
[ https://issues.apache.org/jira/browse/SPARK-21341?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16078987#comment-16078987 ]
Yan Facai (颜发才) edited comment on SPARK-21341 at 7/8/17 6:28 AM:
-----------------------------------------------------------------
Hi, [~zsellami].
I guess that since the wordVectors is mllib model in fact, which might be removed in the future, so it is marked private and transient. More interestingly, wordVectors are saved in data folder as dataframe, see:
{code}
336 override protected def saveImpl(path: String): Unit = {
337 DefaultParamsWriter.saveMetadata(instance, path, sc)
338
339 val wordVectors = instance.wordVectors.getVectors
340 val dataSeq = wordVectors.toSeq.map { case (word, vector) => Data(word, vector) }
341 val dataPath = new Path(path, "data").toString
342 sparkSession.createDataFrame(dataSeq)
343 .repartition(calculateNumberOfPartitions)
344 .write
345 .parquet(dataPath)
346 }
{code}
In all, developers indeed take a try to save the wordVector, however it seems to be broken in pipeline as you said.
So, could you give an example code to reproduce the bug?
I'd like to dig deeper.
was (Author: facai):
Hi, [~zsellami].
I guess that since the wordVectors is mllib model in fact, which might be removed in the future, so it is marked private and transient. More interestingly, wordVectors are saved in data folder as dataframe, see:
{code}
336 override protected def saveImpl(path: String): Unit = {
337 DefaultParamsWriter.saveMetadata(instance, path, sc)
338
339 val wordVectors = instance.wordVectors.getVectors
340 val dataSeq = wordVectors.toSeq.map { case (word, vector) => Data(word, vector) }
341 val dataPath = new Path(path, "data").toString
342 sparkSession.createDataFrame(dataSeq)
343 .repartition(calculateNumberOfPartitions)
344 .write
345 .parquet(dataPath)
346 }
{code}
In all, developers indeed take a try to save the wordVector, however it seems to break in pipeline as you said.
So, could you give an example code to reproduce the bug?
I'd like to dig deeper.
> Spark 2.1.1: I want to be able to serialize wordVectors on Word2VecModel
> -------------------------------------------------------------------------
>
> Key: SPARK-21341
> URL: https://issues.apache.org/jira/browse/SPARK-21341
> Project: Spark
> Issue Type: Bug
> Components: ML
> Affects Versions: 2.1.1
> Reporter: Zied Sellami
>
> I am using sparContext.saveAsObjectFile to save a complex object containing a pipelineModel with a Word2Vec ML Transformer. When I load the object and call myPipelineModel.transform, Word2VecModel raise a null pointer error on line 292 Word2Vec.scala "wordVectors.getVectors" . I resolve the problem by removing@transient annotation on val wordVectors and @transient lazy val on getVectors function.
> -Why this 2 val are transient ?
> -Any solution to add a boolean function on the Word2Vec Transformer to force the serialization of wordVectors.
--
This message was sent by Atlassian JIRA
(v6.4.14#64029)
---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org