You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Sean Owen (JIRA)" <ji...@apache.org> on 2017/06/25 11:39:02 UTC

[jira] [Resolved] (SPARK-21207) ML/MLLIB Save Word2Vec Yarn Cluster

     [ https://issues.apache.org/jira/browse/SPARK-21207?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Sean Owen resolved SPARK-21207.
-------------------------------
          Resolution: Invalid
    Target Version/s:   (was: 2.0.1)

Questions go on the mailing list, like user@
http://spark.apache.org/contributing.html

> ML/MLLIB Save Word2Vec Yarn Cluster 
> ------------------------------------
>
>                 Key: SPARK-21207
>                 URL: https://issues.apache.org/jira/browse/SPARK-21207
>             Project: Spark
>          Issue Type: Question
>          Components: ML, MLlib, PySpark, YARN
>    Affects Versions: 2.0.1
>         Environment: OS : CentOS Linux release 7.3.1611 (Core) 
> Clusters :
> * vendor_id	: GenuineIntel
> * cpu family	: 6
> * model		: 79
> * model name	: Intel(R) Xeon(R) CPU E5-2620 v4 @ 2.10GHz
>            Reporter: offvolt
>
> Hello everyone, 
> I have a question about ML and MLLIB libraries for Word2Vec because I have a problem to save a model in Yarn Cluster, 
> I already work with word2vec (MLLIB) : 
> from pyspark import SparkContext
> from pyspark.mllib.feature import Word2Vec
> from pyspark.mllib.feature import Word2VecModel
> sc = SparkContext()
> inp = sc.textFile(pathCorpus).map(lambda row: row.split(" "))
> word2vec = Word2Vec().setVectorSize(k).setNumIterations(itera)
> model = word2vec.fit(inp)
> model.save(sc, pathModel)
> This code works well in cluster yarn when I use spark-submit like : 
> spark-submit --conf spark.driver.maxResultSize=2G --master yarn --deploy-mode cluster  --driver-memory 16G --executor-memory 10G --num-executors 10 --executor-cores 4 MyCode.py
> +*But I want to use the new Library ML so I do that : *+
> from pyspark import SparkContext
> from pyspark.sql import SQLContext
> from pyspark.sql.functions import explode, split
> from pyspark.ml.feature import Word2Vec
> from pyspark.ml.feature import Word2VecModel
> import numpy as np
> pathModel = "hdfs:///user/test/w2v.model"
> sc = SparkContext(appName = 'Test_App')
> sqlContext = SQLContext(sc)
> raw_text = sqlContext.read.text(corpusPath).select(split("value", " ")).toDF("words")
> numPart = raw_text.rdd.getNumPartitions() - 1
> word2Vec = Word2Vec(vectorSize= k, inputCol="words", outputCol="features", minCount = minCount, maxIter= itera).setNumPartitions(numPart)
> model = word2Vec.fit(raw_text)
> model.findSynonyms("Paris", 20).show()
> model.save(pathModel)
> This code works in local mode but when I try to deploy in clusters mode (like previously) I have a problem because when one cluster writes in hdfs folder the other cannot write inside, so at the end I have an empty folder instead of a plenty of parquet file like in MLLIB. I don't understand because it works with MLLIB but not in ML with the same config when I submitting my code. 
> Do you have an idea, how I can solve this problem ? 
> I hope I was clear enough. 
> Thanks,



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org