You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Sean Owen (JIRA)" <ji...@apache.org> on 2017/09/19 08:58:00 UTC

[jira] [Resolved] (SPARK-21986) QuantileDiscretizer picks wrong split point for data with lots of 0's

     [ https://issues.apache.org/jira/browse/SPARK-21986?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Sean Owen resolved SPARK-21986.
-------------------------------
    Resolution: Not A Problem

> QuantileDiscretizer picks wrong split point for data with lots of 0's
> ---------------------------------------------------------------------
>
>                 Key: SPARK-21986
>                 URL: https://issues.apache.org/jira/browse/SPARK-21986
>             Project: Spark
>          Issue Type: Improvement
>          Components: MLlib
>    Affects Versions: 2.1.1
>            Reporter: Barry Becker
>            Priority: Minor
>
> I have some simple test cases to help illustrate (see below).
> I discovered this with data that had 96,000 rows, but can reproduce with much smaller data that has roughly the same distribution of values.
> If I have data like
>   Seq(0, 0, 0, 0, 0, 40, 0, 0, 45, 46, 0)
> and ask for 3 buckets, then it does the right thing and yields splits of 
> Seq(Double.NegativeInfinity, 0.0, 40.0, Double.PositiveInfinity)
> However, if I add just one more zero, such that I have data like
>  Seq(0, 0, 0, 0, 0, 0, 40, 0, 0, 45, 46, 0)
> then it will do the wrong thing and give splits of 
>   Seq(Double.NegativeInfinity, 0.0, Double.PositiveInfinity))
> I'm not bothered that it gave fewer buckets than asked for (that is to be expected), but I am bothered that it picked 0.0 instead of 40 as the one split point.
> The way it did it, now I have 1 bucket with all the data, and a second with none of the data.
> Am I interpreting something wrong?
> Here are my 2 test cases in scala:
> {code}
> class QuantileDiscretizerSuite extends FunSuite {
>   test("Quantile discretizer on data with lots of 0") {
>     verify(Seq(0, 0, 0, 0, 0, 0, 40, 0, 0, 45, 46, 0),
>       Seq(Double.NegativeInfinity, 0.0, Double.PositiveInfinity))
>   }
>   test("Quantile discretizer on data with one less 0") {
>     verify(Seq(0, 0, 0, 0, 0, 40, 0, 0, 45, 46, 0),
>       Seq(Double.NegativeInfinity, 0.0, 40.0, Double.PositiveInfinity))
>   }
>   
>   def verify(data: Seq[Int], expectedSplits: Seq[Double]): Unit = {
>     val theData: Seq[(Int, Double)] = data.map {
>       case x: Int => (x, 0.0)
>       case _ => (0, 0.0)
>     }
>     val df = SPARK_SESSION.sqlContext.createDataFrame(theData).toDF("rawCol", "unused")
>     val qb = new QuantileDiscretizer()
>       .setInputCol("rawCol")
>       .setOutputCol("binnedColumn")
>       .setRelativeError(0.0)
>       .setNumBuckets(3)
>       .fit(df)
>     assertResult(expectedSplits) {qb.getSplits}
>   }
> }
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org