You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Joseph K. Bradley (JIRA)" <ji...@apache.org> on 2017/01/03 22:01:02 UTC
[jira] [Commented] (SPARK-19007) Speedup and optimize the
GradientBoostedTrees in the "data>memory" scene
[ https://issues.apache.org/jira/browse/SPARK-19007?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15796313#comment-15796313 ]
Joseph K. Bradley commented on SPARK-19007:
-------------------------------------------
From discussion on the linked PR:
This JIRA uncovers a few issues:
* Setting the storage level used by predErrorCheckpointer within LDA. See [SPARK-19063] for work on that.
* Number of RDDs persisted by PeriodicCheckpointer: It currently persists 3 at a time. It has to do with the fact that RDDs may be materialized later than checkpointer.update() gets called. Now that I look again, it's possible that we could maintain 2 instead of 3 cached RDDs in the checkpointer's persistedQueue, but I'd want to check this more carefully. Lower priority b/c more minor improvement.
* 2 RDDs remain cached by PeriodicCheckpointer: At the end of training, there are 2, not 1, RDDs cached. This could be fixed by adding a finalize() method to trait LDAOptimizer which can clean up the extra cached RDD. Unfortunately, now that the trait is public, we cannot change it. This fix will need to wait until we move the implementation to the spark.ml package, at which time we can fix the API.
I'll leave this issue open until we are ready to create tasks for the 2nd and 3rd items.
> Speedup and optimize the GradientBoostedTrees in the "data>memory" scene
> ------------------------------------------------------------------------
>
> Key: SPARK-19007
> URL: https://issues.apache.org/jira/browse/SPARK-19007
> Project: Spark
> Issue Type: Improvement
> Components: ML
> Affects Versions: 2.0.1, 2.0.2, 2.1.0
> Environment: A CDH cluster consists of 3 redhat server ,(120G memory、40 cores、43TB disk per server).
> Reporter: zhangdenghui
> Priority: Minor
> Original Estimate: 168h
> Remaining Estimate: 168h
>
> Test data:80G CTR training data from criteolabs(http://criteolabs.wpengine.com/downloads/download-terabyte-click-logs/ ) ,I used 1 of the 24 days' data.Some features needed to be repalced by new generated continuous features,the way to generate the new features refers to the way mentioned in the xgboost's paper.
> Recource allocated: spark on yarn, 20 executors, 8G memory and 2 cores per executor.
> Parameters: numIterations 10, maxdepth 8, the rest parameters are default
> I tested the GradientBoostedTrees algorithm in mllib using 80G CTR data mentioned above.
> It totally costs 1.5 hour, and i found many task failures after 6 or 7 GBT rounds later.Without these task failures and task retry it can be much faster ,which can save about half the time. I think it's caused by the RDD named predError in the while loop of the boost method at GradientBoostedTrees.scala,because the lineage of the RDD named predError is growing after every GBT round, and then it caused failures like this :
> (ExecutorLostFailure (executor 6 exited caused by one of the running tasks) Reason: Container killed by YARN for exceeding memory limits. 10.2 GB of 10 GB physical memory used. Consider boosting spark.yarn.executor.memoryOverhead.).
> I tried to boosting spark.yarn.executor.memoryOverhead but the meomry it needed is too much (even increase half the memory can't solve the problem) so i think it's not a proper method.
> Although it can set the predCheckpoint Interval smaller to cut the line of the lineage but it increases IO cost a lot.
> I tried another way to solve this problem.I persisted the RDD named predError every round and use pre_predError to record the previous RDD and unpersist it because it's useless anymore.
> Finally it costs about 45 min after i tried my method and no task failure occured and no more memeory added.
> So when the data is much larger than memory, my little improvement can speedup the GradientBoostedTrees 1~2 times.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org