You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@spark.apache.org by WangJianfei <wa...@otcaix.iscas.ac.cn> on 2016/11/16 01:02:27 UTC

回复: Reduce the memory usage if we do same first inGradientBoostedTrees if subsamplingRate< 1.0

with predError.zip(input) ,we get RDD data,  so we can just do a sample on predError or input, if so, we can't use zip(the elements number must be the same in each partition),thank you!




------------------ 原始邮件 ------------------
发件人: "Joseph Bradley [via Apache Spark Developers List]";<ml...@n3.nabble.com>;
发送时间: 2016年11月16日(星期三) 凌晨3:54
收件人: "WangJianfei"<wa...@otcaix.iscas.ac.cn>; 

主题: Re: Reduce the memory usage if we do same first inGradientBoostedTrees if subsamplingRate< 1.0



 	Thanks for the suggestion.  That would be faster, but less accurate in most cases.  It's generally better to use a new random sample on each iteration, based on literature and results I've seen.Joseph


On Fri, Nov 11, 2016 at 5:13 AM, WangJianfei <[hidden email]> wrote:
when we train the mode, we will use the data with a subSampleRate, so if the
 subSampleRate < 1.0 , we can do a sample first to reduce the memory usage.
 se the code below in GradientBoostedTrees.boost()
 
  while (m < numIterations && !doneLearning) {
       // Update data with pseudo-residuals 剩余误差
       val data = predError.zip(input).map { case ((pred, _), point) =>
         LabeledPoint(-loss.gradient(pred, point.label), point.features)
       }
 
       timer.start(s"building tree $m")
       logDebug("###################################################")
       logDebug("Gradient boosting tree iteration " + m)
       logDebug("###################################################")
       val dt = new DecisionTreeRegressor().setSeed(seed + m)
       val model = dt.train(data, treeStrategy)
 
 
 
 
 
 --
 View this message in context: http://apache-spark-developers-list.1001551.n3.nabble.com/Reduce-the-memory-usage-if-we-do-same-first-in-GradientBoostedTrees-if-subsamplingRate-1-0-tp19826.html
 Sent from the Apache Spark Developers List mailing list archive at Nabble.com.
 
 ---------------------------------------------------------------------
 To unsubscribe e-mail: [hidden email]
 
 


 	 	 	 	
 	
 	
 	 		If you reply to this email, your message will be added to the discussion below:
 		http://apache-spark-developers-list.1001551.n3.nabble.com/Reduce-the-memory-usage-if-we-do-sample-first-in-GradientBoostedTrees-with-the-condition-that-subsam0-tp19826p19899.html 	
 	 		 		To unsubscribe from Reduce the memory usage if we do sample first in GradientBoostedTrees with the condition that subsamplingRate< 1.0, click here.
 		NAML



--
View this message in context: http://apache-spark-developers-list.1001551.n3.nabble.com/Reduce-the-memory-usage-if-we-do-same-first-inGradientBoostedTrees-if-subsamplingRate-1-0-tp19904.html
Sent from the Apache Spark Developers List mailing list archive at Nabble.com.