You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by Pasquinell Urbani <pa...@exalitica.com> on 2016/07/11 22:28:14 UTC

QuantileDiscretizer not working properly with big dataframes

Hi all,

We have a dataframe with 2.5 millions of records and 13 features. We want
to perform a logistic regression with this data but first we neet to divide
each columns in discrete values using QuantileDiscretizer. This will
improve the performance of the model by avoiding outliers.

For small dataframes QuantileDiscretizer works perfect (see the ml example:
https://spark.apache.org/docs/1.6.0/ml-features.html#quantilediscretizer),
but for large data frames it tends to split the column in only the values 0
and 1 (despite the custom number of buckets is settled in to 5). Here is my
code:

val discretizer = new QuantileDiscretizer()
  .setInputCol("C4")
  .setOutputCol("C4_Q")
  .setNumBuckets(5)

val result = discretizer.fit(df3).transform(df3)
result.show()

I found the same problem presented here:
https://issues.apache.org/jira/browse/SPARK-13444 . But there is no
solution yet.

Do I am configuring the function in a bad way? Should I pre-process the
data (like z-scores)? Can somebody help me dealing with this?

Regards

Re: QuantileDiscretizer not working properly with big dataframes

Posted by Yanbo Liang <yb...@gmail.com>.

Could you tell us the Spark version you used?
We have fixed this bug at Spark 1.6.2 and Spark 2.0, please upgrade to
these versions and retry.
If this issue still exists, please let us know.

Thanks
Yanbo

2016-07-12 11:03 GMT-07:00 Pasquinell Urbani <
pasquinell.urbani@exalitica.com>:

> In the forum mentioned above the flowing solution is suggested
>
> Problem is in line 113 and 114 of QuantileDiscretizer.scala and can be
> fixed by changing line 113 like so:
> before: val requiredSamples = math.max(numBins * numBins, 10000)
> after: val requiredSamples = math.max(numBins * numBins, 10000.0)
>
> Is there another way?
>
>
> 2016-07-11 18:28 GMT-04:00 Pasquinell Urbani <
> pasquinell.urbani@exalitica.com>:
>
>> Hi all,
>>
>> We have a dataframe with 2.5 millions of records and 13 features. We want
>> to perform a logistic regression with this data but first we neet to divide
>> each columns in discrete values using QuantileDiscretizer. This will
>> improve the performance of the model by avoiding outliers.
>>
>> For small dataframes QuantileDiscretizer works perfect (see the ml
>> example:
>> https://spark.apache.org/docs/1.6.0/ml-features.html#quantilediscretizer),
>> but for large data frames it tends to split the column in only the values 0
>> and 1 (despite the custom number of buckets is settled in to 5). Here is my
>> code:
>>
>> val discretizer = new QuantileDiscretizer()
>>   .setInputCol("C4")
>>   .setOutputCol("C4_Q")
>>   .setNumBuckets(5)
>>
>> val result = discretizer.fit(df3).transform(df3)
>> result.show()
>>
>> I found the same problem presented here:
>> https://issues.apache.org/jira/browse/SPARK-13444 . But there is no
>> solution yet.
>>
>> Do I am configuring the function in a bad way? Should I pre-process the
>> data (like z-scores)? Can somebody help me dealing with this?
>>
>> Regards
>>
>
>

Re: QuantileDiscretizer not working properly with big dataframes

Posted by Pasquinell Urbani <pa...@exalitica.com>.

In the forum mentioned above the flowing solution is suggested

Problem is in line 113 and 114 of QuantileDiscretizer.scala and can be
fixed by changing line 113 like so:
before: val requiredSamples = math.max(numBins * numBins, 10000)
after: val requiredSamples = math.max(numBins * numBins, 10000.0)

Is there another way?


2016-07-11 18:28 GMT-04:00 Pasquinell Urbani <
pasquinell.urbani@exalitica.com>:

> Hi all,
>
> We have a dataframe with 2.5 millions of records and 13 features. We want
> to perform a logistic regression with this data but first we neet to divide
> each columns in discrete values using QuantileDiscretizer. This will
> improve the performance of the model by avoiding outliers.
>
> For small dataframes QuantileDiscretizer works perfect (see the ml
> example:
> https://spark.apache.org/docs/1.6.0/ml-features.html#quantilediscretizer),
> but for large data frames it tends to split the column in only the values 0
> and 1 (despite the custom number of buckets is settled in to 5). Here is my
> code:
>
> val discretizer = new QuantileDiscretizer()
>   .setInputCol("C4")
>   .setOutputCol("C4_Q")
>   .setNumBuckets(5)
>
> val result = discretizer.fit(df3).transform(df3)
> result.show()
>
> I found the same problem presented here:
> https://issues.apache.org/jira/browse/SPARK-13444 . But there is no
> solution yet.
>
> Do I am configuring the function in a bad way? Should I pre-process the
> data (like z-scores)? Can somebody help me dealing with this?
>
> Regards
>