You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Jianfei Wang (JIRA)" <ji...@apache.org> on 2016/11/16 14:17:59 UTC
[jira] [Issue Comment Deleted] (SPARK-18463) I think it's necessary
to have an overrided method of smaple
[ https://issues.apache.org/jira/browse/SPARK-18463?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Jianfei Wang updated SPARK-18463:
---------------------------------
Comment: was deleted
(was: ok ,what you mean is rdd1.zip(rdd2).sample() won't use more memory to store rdd1.zip(rdd2) ? thank you.)
> I think it's necessary to have an overrided method of smaple
> ------------------------------------------------------------
>
> Key: SPARK-18463
> URL: https://issues.apache.org/jira/browse/SPARK-18463
> Project: Spark
> Issue Type: New Feature
> Components: Spark Core
> Reporter: Jianfei Wang
>
> Currently in this situation:
> rdd3 = rdd1.zip(rdd2).sample()
> if we can take sample on the two sample directly,such as
> sample(rdd1,rdd2) ,so we can reduce the memory usage.
> there are some use cases in spark mllib,such as in GradientBoostedTrees
> while (m < numIterations && !doneLearning) {
> // Update data with pseudo-residuals 剩余误差
> val data = predError.zip(input).map { case ((pred, _), point) =>
> LabeledPoint(-loss.gradient(pred, point.label), point.features)
> }
> val dt = new DecisionTreeRegressor().setSeed(seed + m)
> val model = dt.train(data, treeStrategy)
> when we use data to train model,we will do a sample.
> so we can imp an method sample(rdd1,rdd2) to reduce the memory usage in such cases.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org