You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Nick Pentreath (JIRA)" <ji...@apache.org> on 2017/02/24 08:35:44 UTC

[jira] [Comment Edited] (SPARK-19714) Bucketizer Bug Regarding Handling Unbucketed Inputs

    [ https://issues.apache.org/jira/browse/SPARK-19714?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15882216#comment-15882216 ] 

Nick Pentreath edited comment on SPARK-19714 at 2/24/17 8:35 AM:
-----------------------------------------------------------------

I agree that the parameter naming is perhaps misleading. At least the doc should be updated because "invalid" here actually means {{NaN}} or {{null}}. 

However {{Bucketizer}} is doing what you tell it to as the splits are specified by you. Note that if you used {{QuantileDiscretizer}} to construct the {{Bucketizer}} then it adds {{+/- Infinity}} as the lower/upper bounds of the splits. So you can do the same if you want anything below the lower bound or above the lower bound to be "valid". You will then have 2 more buckets.


was (Author: mlnick):
I agree that the parameter naming is perhaps misleading. At least the doc should be updated because "invalid" here actually means {{NaN}} or {{null}}. 

However {{Bucketizer}} is doing what you tell it to as the splits are specified by you. Note that if you used {{QuantileDiscretizer}} to construct the {{Bucketizer}} then it adds {{+/- Infinity}} as the lower/upper bounds of the splits. So you can do the same if you want anything below the lower bound to be included in the first bucket, and above the upper bound to be included in the last bucket.

> Bucketizer Bug Regarding Handling Unbucketed Inputs
> ---------------------------------------------------
>
>                 Key: SPARK-19714
>                 URL: https://issues.apache.org/jira/browse/SPARK-19714
>             Project: Spark
>          Issue Type: Bug
>          Components: ML, MLlib
>    Affects Versions: 2.1.0
>            Reporter: Bill Chambers
>
> {code}
> contDF = spark.range(500).selectExpr("cast(id as double) as id")
> import org.apache.spark.ml.feature.Bucketizer
> val splits = Array(5.0, 10.0, 250.0, 500.0)
> val bucketer = new Bucketizer()
>   .setSplits(splits)
>   .setInputCol("id")
>   .setHandleInvalid("skip")
> bucketer.transform(contDF).show()
> {code}
> You would expect that this would handle the invalid buckets. However it fails
> {code}
> Caused by: org.apache.spark.SparkException: Feature value 0.0 out of Bucketizer bounds [5.0, 500.0].  Check your features, or loosen the lower/upper bound constraints.
> {code} 
> It seems strange that handleInvalud doesn't actually handleInvalid inputs.
> Thoughts anyone?



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org