You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Sean Owen (JIRA)" <ji...@apache.org> on 2016/11/16 10:34:58 UTC
[jira] [Commented] (SPARK-18463) I think it's necessary to have an
overrided method of smaple
[ https://issues.apache.org/jira/browse/SPARK-18463?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15670092#comment-15670092 ]
Sean Owen commented on SPARK-18463:
-----------------------------------
I don't understand what this is proposing. The example you cite shows no sampling. You can't sample, then zip, two RDDs because they won't sample the same pairs.
> I think it's necessary to have an overrided method of smaple
> ------------------------------------------------------------
>
> Key: SPARK-18463
> URL: https://issues.apache.org/jira/browse/SPARK-18463
> Project: Spark
> Issue Type: New Feature
> Components: Spark Core
> Reporter: Jianfei Wang
>
> Currently in this situation:
> rdd3 = rdd1.zip(rdd2).sample()
> if we can take sample on the two sample directly,such as
> sample(rdd1,rdd2) ,so we can reduce the memory usage.
> there are some use cases in spark mllib,such as in GradientBoostedTrees
> while (m < numIterations && !doneLearning) {
> // Update data with pseudo-residuals 剩余误差
> val data = predError.zip(input).map { case ((pred, _), point) =>
> LabeledPoint(-loss.gradient(pred, point.label), point.features)
> }
> val dt = new DecisionTreeRegressor().setSeed(seed + m)
> val model = dt.train(data, treeStrategy)
> when we use data to train model,we will do a sample.
> so we can imp an method sample(rdd1,rdd2) to reduce the memory usage in such cases.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org