You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Barry Becker (JIRA)" <ji...@apache.org> on 2018/08/03 20:31:00 UTC
[jira] [Commented] (SPARK-21986) QuantileDiscretizer picks wrong split point for data with lots of 0's

    [ https://issues.apache.org/jira/browse/SPARK-21986?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16568718#comment-16568718 ] 

Barry Becker commented on SPARK-21986:
--------------------------------------

Here are a couple more test cases that show the problem:
{code:java}
test("Quantile discretizer on data with that is only -1, and 1 (and mostly -1)") {
verify(Seq(-1, -1, 1, -1, -1, -1, 1, -1, -1, 1, -1),
Seq(Double.NegativeInfinity, -1, Double.PositiveInfinity))
}

test("Quantile discretizer on data with that is only -1, 0, and 1 (and mostly -1)") {
verify(Seq(-1, -1, 1, -1, -1, -1, 1, 0, -1, -1, -1, 1, -1),
Seq(Double.NegativeInfinity, -1, Double.PositiveInfinity))
}

test("Quantile discretizer on data with that is only -1, 0, and 1 ") {   // this is ok
verify(Seq(-1, -1, 1, -1, -1, -1, 1, 0, -1, 1, -1),
Seq(Double.NegativeInfinity, -1, 0, Double.PositiveInfinity))
}{code}
If the bin were defined as (low, high] instead of [low, high), then I believe all the cases would be correct. Maybe if all the cuts has a very small epsilon added, or simply selected the next distinct value, then they would also all be correct.

> QuantileDiscretizer picks wrong split point for data with lots of 0's
> ---------------------------------------------------------------------
>
>                 Key: SPARK-21986
>                 URL: https://issues.apache.org/jira/browse/SPARK-21986
>             Project: Spark
>          Issue Type: Improvement
>          Components: MLlib
>    Affects Versions: 2.1.1
>            Reporter: Barry Becker
>            Priority: Minor
>
> I have some simple test cases to help illustrate (see below).
> I discovered this with data that had 96,000 rows, but can reproduce with much smaller data that has roughly the same distribution of values.
> If I have data like
>   Seq(0, 0, 0, 0, 0, 40, 0, 0, 45, 46, 0)
> and ask for 3 buckets, then it does the right thing and yields splits of 
> Seq(Double.NegativeInfinity, 0.0, 40.0, Double.PositiveInfinity)
> However, if I add just one more zero, such that I have data like
>  Seq(0, 0, 0, 0, 0, 0, 40, 0, 0, 45, 46, 0)
> then it will do the wrong thing and give splits of 
>   Seq(Double.NegativeInfinity, 0.0, Double.PositiveInfinity))
> I'm not bothered that it gave fewer buckets than asked for (that is to be expected), but I am bothered that it picked 0.0 instead of 40 as the one split point.
> The way it did it, now I have 1 bucket with all the data, and a second with none of the data.
> Am I interpreting something wrong?
> Here are my 2 test cases in scala:
> {code}
> class QuantileDiscretizerSuite extends FunSuite {
>   test("Quantile discretizer on data with lots of 0") {
>     verify(Seq(0, 0, 0, 0, 0, 0, 40, 0, 0, 45, 46, 0),
>       Seq(Double.NegativeInfinity, 0.0, Double.PositiveInfinity))
>   }
>   test("Quantile discretizer on data with one less 0") {
>     verify(Seq(0, 0, 0, 0, 0, 40, 0, 0, 45, 46, 0),
>       Seq(Double.NegativeInfinity, 0.0, 40.0, Double.PositiveInfinity))
>   }
>   
>   def verify(data: Seq[Int], expectedSplits: Seq[Double]): Unit = {
>     val theData: Seq[(Int, Double)] = data.map {
>       case x: Int => (x, 0.0)
>       case _ => (0, 0.0)
>     }
>     val df = SPARK_SESSION.sqlContext.createDataFrame(theData).toDF("rawCol", "unused")
>     val qb = new QuantileDiscretizer()
>       .setInputCol("rawCol")
>       .setOutputCol("binnedColumn")
>       .setRelativeError(0.0)
>       .setNumBuckets(3)
>       .fit(df)
>     assertResult(expectedSplits) {qb.getSplits}
>   }
> }
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org