You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "L. C. Hsieh (Jira)" <ji...@apache.org> on 2019/12/06 00:34:00 UTC

[jira] [Resolved] (SPARK-24666) Word2Vec generate infinity vectors when numIterations are large

     [ https://issues.apache.org/jira/browse/SPARK-24666?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

L. C. Hsieh resolved SPARK-24666.
---------------------------------
    Fix Version/s: 3.1.0
                   2.4.5
       Resolution: Fixed

Issue resolved by pull request 26722
[https://github.com/apache/spark/pull/26722]

> Word2Vec generate infinity vectors when numIterations are large
> ---------------------------------------------------------------
>
>                 Key: SPARK-24666
>                 URL: https://issues.apache.org/jira/browse/SPARK-24666
>             Project: Spark
>          Issue Type: Bug
>          Components: ML, MLlib
>    Affects Versions: 2.3.1, 2.4.4
>         Environment:  2.0.X, 2.1.X, 2.2.X, 2.3.X, 2.4.X
>            Reporter: ZhongYu
>            Assignee: L. C. Hsieh
>            Priority: Critical
>             Fix For: 2.4.5, 3.1.0
>
>
> We found that Word2Vec generate large absolute value vectors when numIterations are large, and if numIterations are large enough (>20), the vector's value many be *infinity(or -**infinity)***, resulting in useless vectors.
> In normal situations, vectors values are mainly around -1.0~1.0 when numIterations = 1.
> The bug is shown on spark 2.0.X, 2.1.X, 2.2.X, 2.3.X, 2.4.X
> There are already issues report this bug: https://issues.apache.org/jira/browse/SPARK-5261 , but the bug fix works seems missing.
> Other people's reports:
> [https://stackoverflow.com/questions/49741956/infinity-vectors-in-spark-mllib-word2vec]
> [http://apache-spark-user-list.1001560.n3.nabble.com/word2vec-outputs-Infinity-Infinity-vectors-with-increasing-iterations-td29020.html]
> =======================================================
> Here are the code to reproduce the issue. You can download title.akas.tsv from [https://datasets.imdbws.com/] and upload to hdfs.
>  
> {code:java}
> import org.apache.spark.sql.SparkSession
> import org.apache.spark.ml.feature.Word2Vec
> case class Sentences(name: String, words: Array[String])
> import spark.implicits._
> // IMDB raw data title.akas.tsv download from https://datasets.imdbws.com/
> val dataset = spark.read
>   .option("header", "true").option("sep", "\t")
>   .option("quote", "").option("nullValue", "\\N")
>   .csv("/tmp/word2vec/title.akas.tsv")
>   .filter("region = 'US' or language = 'en'")
>   .select("title")
>   .as[String]
>   .map(s => Sentences(s, s.split(' ')))
>   .persist()
> println("Training model...")
> val word2Vec = new Word2Vec()
>   .setInputCol("words")
>   .setOutputCol("vector")
>   .setVectorSize(64)
>   .setWindowSize(4)
>   .setNumPartitions(50)
>   .setMinCount(5)
>   .setMaxIter(20)
> val model = word2Vec.fit(dataset)
> model.getVectors.show()
> {code}
> When set maxIter to 30, you will get the result.
> {code:java}
> scala> model.getVectors.show()
> +-------------+--------------------+
> |         word|              vector|
> +-------------+--------------------+
> |     Unspoken|[-Infinity,-Infin...|
> |       Talent|[Infinity,-Infini...|
> |    Hourglass|[1.09657520526310...|
> |Nickelodeon's|[2.20436549446219...|
> |      Priests|[-1.9625896848389...|
> |    Religion:|[-3.8815759928213...|
> |           Bu|[-7.9722236466752...|
> |      Totoro:|[-4.1829056206528...|
> |     Trouble,|[2.51985378203136...|
> |       Hatter|[8.49108115961009...|
> |          '79|[-5.4560309784650...|
> |         Vile|[-1.2059769646379...|
> |         9/11|[Infinity,-Infini...|
> |      Santino|[6.30405421282099...|
> |      Motives|[1.96207712570869...|
> |          '13|[-1.7641987324084...|
> |       Fierce|[-Infinity,Infini...|
> |       Stover|[5.10057474120744...|
> |          'It|[1.08629989605664...|
> |        Butts|[Infinity,Infinit...|
> +-------------+--------------------+
> only showing top 20 rows
> {code}
> In this case, set maxIter to 20 may not generate Infinity but very large absolute values. It depends on the training data sample and other configurations.
> {code:java}
> scala> model.getVectors.show(2,false)
> +--------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
> |word    |vector                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                 |
> +--------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
> |Unspoken|[-8.345756381631837E26,-4.521902763541592E26,-2.3382486258889084E27,-1.0244081299466769E27,-2.0078509112460803E27,-1.6760533100889865E27,-2.582670788770659E27,-3.38100521565687E26,1.7553847873565714E27,-1.170131062449021E27,-1.6565472801835883E27,-1.5594244347657445E27,-2.5150639513558596E26,1.949539129915606E27,-7.580918216717454E26,1.2361994783015613E27,-3.152053008864166E27,-8.185652662597534E26,-5.4443628225426E25,2.245579525466733E26,-1.97655047590181E27,2.8597275293150673E26,-1.1006336920210832E27,1.6166580407985987E27,1.5272882143409825E26,-1.0115330404529906E27,-1.8895683222101184E27,2.6156506156954E27,-1.698058504881491E27,-1.5132098806248563E27,3.7327358519511804E27,1.3356636582642166E27,2.3614379909704805E26,8.96912646624494E26,1.5518857669716535E27,-3.05221863964144E27,4.399680909202177E26,-2.607914789100649E27,-1.4080384994067242E27,2.7666078487221474E27,6.946950108699123E26,-1.1122679059344192E27,-2.3621557537823886E27,9.433206702172274E26,-2.3704690372536228E27,2.5086034219659006E27,2.0173186657484236E27,-1.8448836672357273E27,-1.5081404202054957E27,2.641836064055936E26,-5.613083015733733E26,-2.1296579720982533E26,-1.6550184140347592E27,-1.9152898718506886E27,1.25699596863538E27,-2.0774912070471012E27,-1.5454685136432914E27,-2.479843324641509E27,1.5560216745669318E27,-2.2176656540799786E27,-9.628781296451031E26,1.3663974096305426E27,1.6326327735924786E27,-1.9533865304335714E27]|
> |Talent  |[1.3996313289146157E31,-2.216329024373106E31,1.0729251707928603E31,-4.007120754159977E31,-7.217488429248302E30,3.579654497535965E31,2.7979270365837212E31,4.333613174196825E31,3.2947832174019738E31,-1.770444782887265E31,-1.1996572271408077E31,1.9686960444755403E31,-5.211369239778517E31,4.559579301984929E31,8.789691017490939E30,-3.3896103915518896E31,-2.842517781869879E31,3.653230690058367E31,1.6690004323711066E31,-1.1803405268246773E31,4.577673536512265E31,3.9686553942166427E31,-2.0779652882517364E31,9.553626958941078E29,-1.1967228014988571E31,2.667234660143298E31,-5.082234231802067E29,-5.053934698852727E31,2.911363689445293E31,4.57440169967406E31,2.296044625777839E31,3.4719839372636273E31,-4.753091634806606E30,-2.2139650908254315E31,5.747913246328898E31,-4.027332301367786E31,-3.3981312029599884E30,-3.235915541756495E31,-3.690297564613571E31,3.6645060993927487E31,2.32138854666024E31,-4.79833731565554E31,2.4538652976104142E31,4.91394707312416E30,2.2888500664401483E31,8.433142525511996E30,-2.3447174299865074E31,-3.9894235308718024E31,1.6571656530599892E31,3.743449438983912E31,5.619889452742693E31,2.0932366809902723E31,-2.2306515916821173E30,-4.2788883664425833E30,-8.754273117753689E30,-3.8767150140313846E30,-3.7649840346087072E31,-3.604430948638639E31,5.083292737026576E31,2.92915351645125E31,5.971055806972711E31,1.4773152095869043E31,5.12252479772471E31,3.035571146004139E31]                     |
> +--------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
> only showing top 2 rows
> {code}
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org