You are viewing a plain text version of this content. The canonical link for it is here.

Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2022/11/13 18:08:52 UTC

[GitHub] [arrow-datafusion] DataPsycho opened a new issue, #4195: It will be good to have Bucketizer and OneHotEncoder in DataFusion like PySpark

DataPsycho opened a new issue, #4195:
URL: https://github.com/apache/arrow-datafusion/issues/4195

Feature engineering before Machine Learning needs some special transformation to deal with Categorical Data either they are nominal or Ordinal. It will be nice to have built-in OneHot encoder and Bucketizer functions. Currently, it is possible to create such encoded values but that needs a lot of boilerplate code with Joins and When else statements.

For Bucketizer up on providing a vector/list of range it will be able to create a new column in the data frame which will bucketize the input continuous column. An example can be found in [PySpark API doc](https://spark.apache.org/docs/latest/api/python/reference/api/pyspark.ml.feature.Bucketizer.html)
For OneHotEncoder upon providing a column if the column has n category it will be able to create `n` or `n-1` column up on giving a True False parameter. Here is [PySpark API doc](https://spark.apache.org/docs/3.1.1/api/python/reference/api/pyspark.ml.feature.OneHotEncoder.html).

Any Alternative ideal also should work. The new features can be under a new module `datafusion::ml::Bucketizer` and `datafusion::ml::OneHotEncoder`

--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org