You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@spark.apache.org by sr...@apache.org on 2016/03/04 11:01:59 UTC

spark git commit: [SPARK-13646][MLLIB] QuantileDiscretizer counts dataset twice in get…

Repository: spark
Updated Branches:
  refs/heads/master dd83c209f -> 27e88faa0


[SPARK-13646][MLLIB] QuantileDiscretizer counts dataset twice in get…

## What changes were proposed in this pull request?

It avoids counting the dataframe twice.

Author: Abou Haydar Elias <ab...@gmail.com>
Author: Elie A <ab...@gmail.com>

Closes #11491 from eliasah/quantile-discretizer-patch.


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/27e88faa
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/27e88faa
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/27e88faa

Branch: refs/heads/master
Commit: 27e88faa058c1364d0e99fffc0c5cb64ef817bd3
Parents: dd83c20
Author: Abou Haydar Elias <ab...@gmail.com>
Authored: Fri Mar 4 10:01:52 2016 +0000
Committer: Sean Owen <so...@cloudera.com>
Committed: Fri Mar 4 10:01:52 2016 +0000

----------------------------------------------------------------------
 .../scala/org/apache/spark/ml/feature/QuantileDiscretizer.scala    | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)
----------------------------------------------------------------------


http://git-wip-us.apache.org/repos/asf/spark/blob/27e88faa/mllib/src/main/scala/org/apache/spark/ml/feature/QuantileDiscretizer.scala
----------------------------------------------------------------------
diff --git a/mllib/src/main/scala/org/apache/spark/ml/feature/QuantileDiscretizer.scala b/mllib/src/main/scala/org/apache/spark/ml/feature/QuantileDiscretizer.scala
index d75b3ef..18896fc 100644
--- a/mllib/src/main/scala/org/apache/spark/ml/feature/QuantileDiscretizer.scala
+++ b/mllib/src/main/scala/org/apache/spark/ml/feature/QuantileDiscretizer.scala
@@ -118,7 +118,7 @@ object QuantileDiscretizer extends DefaultParamsReadable[QuantileDiscretizer] wi
     require(totalSamples > 0,
       "QuantileDiscretizer requires non-empty input dataset but was given an empty input.")
     val requiredSamples = math.max(numBins * numBins, minSamplesRequired)
-    val fraction = math.min(requiredSamples.toDouble / dataset.count(), 1.0)
+    val fraction = math.min(requiredSamples.toDouble / totalSamples, 1.0)
     dataset.sample(withReplacement = false, fraction, new XORShiftRandom(seed).nextInt()).collect()
   }
 


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@spark.apache.org
For additional commands, e-mail: commits-help@spark.apache.org