You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Barry Becker (JIRA)" <ji...@apache.org> on 2018/08/03 20:31:00 UTC
[jira] [Commented] (SPARK-21986) QuantileDiscretizer picks wrong
split point for data with lots of 0's
[ https://issues.apache.org/jira/browse/SPARK-21986?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16568718#comment-16568718 ]
Barry Becker commented on SPARK-21986:
--------------------------------------
Here are a couple more test cases that show the problem:
{code:java}
test("Quantile discretizer on data with that is only -1, and 1 (and mostly -1)") {
verify(Seq(-1, -1, 1, -1, -1, -1, 1, -1, -1, 1, -1),
Seq(Double.NegativeInfinity, -1, Double.PositiveInfinity))
}
test("Quantile discretizer on data with that is only -1, 0, and 1 (and mostly -1)") {
verify(Seq(-1, -1, 1, -1, -1, -1, 1, 0, -1, -1, -1, 1, -1),
Seq(Double.NegativeInfinity, -1, Double.PositiveInfinity))
}
test("Quantile discretizer on data with that is only -1, 0, and 1 ") { // this is ok
verify(Seq(-1, -1, 1, -1, -1, -1, 1, 0, -1, 1, -1),
Seq(Double.NegativeInfinity, -1, 0, Double.PositiveInfinity))
}{code}
If the bin were defined as (low, high] instead of [low, high), then I believe all the cases would be correct. Maybe if all the cuts has a very small epsilon added, or simply selected the next distinct value, then they would also all be correct.
> QuantileDiscretizer picks wrong split point for data with lots of 0's
> ---------------------------------------------------------------------
>
> Key: SPARK-21986
> URL: https://issues.apache.org/jira/browse/SPARK-21986
> Project: Spark
> Issue Type: Improvement
> Components: MLlib
> Affects Versions: 2.1.1
> Reporter: Barry Becker
> Priority: Minor
>
> I have some simple test cases to help illustrate (see below).
> I discovered this with data that had 96,000 rows, but can reproduce with much smaller data that has roughly the same distribution of values.
> If I have data like
> Seq(0, 0, 0, 0, 0, 40, 0, 0, 45, 46, 0)
> and ask for 3 buckets, then it does the right thing and yields splits of
> Seq(Double.NegativeInfinity, 0.0, 40.0, Double.PositiveInfinity)
> However, if I add just one more zero, such that I have data like
> Seq(0, 0, 0, 0, 0, 0, 40, 0, 0, 45, 46, 0)
> then it will do the wrong thing and give splits of
> Seq(Double.NegativeInfinity, 0.0, Double.PositiveInfinity))
> I'm not bothered that it gave fewer buckets than asked for (that is to be expected), but I am bothered that it picked 0.0 instead of 40 as the one split point.
> The way it did it, now I have 1 bucket with all the data, and a second with none of the data.
> Am I interpreting something wrong?
> Here are my 2 test cases in scala:
> {code}
> class QuantileDiscretizerSuite extends FunSuite {
> test("Quantile discretizer on data with lots of 0") {
> verify(Seq(0, 0, 0, 0, 0, 0, 40, 0, 0, 45, 46, 0),
> Seq(Double.NegativeInfinity, 0.0, Double.PositiveInfinity))
> }
> test("Quantile discretizer on data with one less 0") {
> verify(Seq(0, 0, 0, 0, 0, 40, 0, 0, 45, 46, 0),
> Seq(Double.NegativeInfinity, 0.0, 40.0, Double.PositiveInfinity))
> }
>
> def verify(data: Seq[Int], expectedSplits: Seq[Double]): Unit = {
> val theData: Seq[(Int, Double)] = data.map {
> case x: Int => (x, 0.0)
> case _ => (0, 0.0)
> }
> val df = SPARK_SESSION.sqlContext.createDataFrame(theData).toDF("rawCol", "unused")
> val qb = new QuantileDiscretizer()
> .setInputCol("rawCol")
> .setOutputCol("binnedColumn")
> .setRelativeError(0.0)
> .setNumBuckets(3)
> .fit(df)
> assertResult(expectedSplits) {qb.getSplits}
> }
> }
> {code}
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)
---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org