You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@spark.apache.org by "Sean Owen (JIRA)" <ji...@apache.org> on 2016/11/16 10:34:58 UTC

[jira] [Commented] (SPARK-18463) I think it's necessary to have an overrided method of smaple

    [ https://issues.apache.org/jira/browse/SPARK-18463?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15670092#comment-15670092 ] 

Sean Owen commented on SPARK-18463:
-----------------------------------

I don't understand what this is proposing. The example you cite shows no sampling. You can't sample, then zip, two RDDs because they won't sample the same pairs.

> I think it's necessary to have an overrided method of smaple
> ------------------------------------------------------------
>
>                 Key: SPARK-18463
>                 URL: https://issues.apache.org/jira/browse/SPARK-18463
>             Project: Spark
>          Issue Type: New Feature
>          Components: Spark Core
>            Reporter: Jianfei Wang
>
> Currently in this situation: 
> rdd3 = rdd1.zip(rdd2).sample()
> if we can take sample on the two sample directly,such as
> sample(rdd1,rdd2) ,so we can reduce the memory usage.
> there are some use cases in spark mllib,such as in GradientBoostedTrees
>  while (m < numIterations && !doneLearning) {
>       // Update data with pseudo-residuals 剩余误差
>       val data = predError.zip(input).map { case ((pred, _), point) =>
>         LabeledPoint(-loss.gradient(pred, point.label), point.features)
>       }
> val dt = new DecisionTreeRegressor().setSeed(seed + m)
>       val model = dt.train(data, treeStrategy)
> when we use data to train model,we will do a sample.
> so we can imp an method sample(rdd1,rdd2) to reduce the memory usage in such cases.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org