You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Hyukjin Kwon (JIRA)" <ji...@apache.org> on 2017/09/16 15:27:00 UTC

[jira] [Resolved] (SPARK-22039) Spark 2.1.1 Driver OOM when use interaction for large scale Sparse Vector

     [ https://issues.apache.org/jira/browse/SPARK-22039?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Hyukjin Kwon resolved SPARK-22039.
----------------------------------
    Resolution: Invalid

Questions should go the mailing list. Let's start this on the mailing list first rather than describing this as a JIRA now.

> Spark 2.1.1 Driver OOM when use interaction for large scale Sparse Vector
> -------------------------------------------------------------------------
>
>                 Key: SPARK-22039
>                 URL: https://issues.apache.org/jira/browse/SPARK-22039
>             Project: Spark
>          Issue Type: Question
>          Components: ML
>    Affects Versions: 2.1.1
>            Reporter: wuhaibo
>
> I'm working on large scale logistic regression for ctr prediction, and when user interaction for feature engineer, driver OOM. For detail, I interact among userid(one-hot, 30w dimension, sparse) and base features(60 dimensions, dense), driver memory is set to 40g.
> So, I try to debug from remote, and I find the spark interaction create a big schema, and a lot job is doing at the driver.
> there is two question:
> By reading source, I found interaction is implemented with sparse vector, so it does not need so much memory, and why it need do this at the driver. The interaction result is 1800w dimension sparse dataframe, why 1800w structField for schema is so big. this is dump file when the schema begins to create because it is too big, I can't dump all: 
> https://i.stack.imgur.com/h0XBf.jpg
> So I implement interaction method with RDD, the job can finish in 5mim, so I am wondering it's there any wrong here.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org