You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@spark.apache.org by "Nick Pentreath (JIRA)" <ji...@apache.org> on 2017/09/15 13:18:00 UTC

[jira] [Resolved] (SPARK-21958) Attempting to save large Word2Vec model hangs driver in constant GC.

     [ https://issues.apache.org/jira/browse/SPARK-21958?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Nick Pentreath resolved SPARK-21958.
------------------------------------
       Resolution: Fixed
    Fix Version/s: 2.3.0

Issue resolved by pull request 19191
[https://github.com/apache/spark/pull/19191]

> Attempting to save large Word2Vec model hangs driver in constant GC.
> --------------------------------------------------------------------
>
>                 Key: SPARK-21958
>                 URL: https://issues.apache.org/jira/browse/SPARK-21958
>             Project: Spark
>          Issue Type: Bug
>          Components: ML
>    Affects Versions: 2.2.0
>         Environment: Running spark on yarn, hadoop 2.7.2 provided by the cluster
>            Reporter: Travis Hegner
>              Labels: easyfix, patch, performance
>             Fix For: 2.3.0
>
>
> In the new version of Word2Vec, the model saving was modified to estimate an appropriate number of partitions based on the kryo buffer size. This is a great improvement, but there is a caveat for very large models.
> The {{(word, vector)}} tuple goes through a transformation to a local case class of {{Data(word, vector)}}... I can only assume this is for the kryo serialization process. The new version of the code iterates over the entire vocabulary to do this transformation (the old version wrapped the entire datum) in the driver's heap. Only to have the result then distributed to the cluster to be written into it's parquet files.
> With extremely large vocabularies (~2 million docs, with uni-grams, bi-grams, and tri-grams), that local driver transformation is causing the driver to hang indefinitely in GC as I can only assume that it's generating millions of short lived objects which can't be evicted fast enough.
> Perhaps I'm overlooking something, but it seems to me that since the result is distributed over the cluster to be saved _after_ the transformation anyway, we may as well distribute it _first_, allowing the cluster resources to do the transformation more efficiently, and then write the parquet file from there.
> I have a patch implemented, and am in the process of testing it at scale. I will open a pull request when I feel that the patch is successfully resolving the issue, and after making sure that it passes unit tests.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org