You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Joseph K. Bradley (JIRA)" <ji...@apache.org> on 2014/10/24 22:34:34 UTC
[jira] [Created] (SPARK-4081) Categorical feature indexing
Joseph K. Bradley created SPARK-4081:
----------------------------------------
Summary: Categorical feature indexing
Key: SPARK-4081
URL: https://issues.apache.org/jira/browse/SPARK-4081
Project: Spark
Issue Type: New Feature
Components: MLlib
Affects Versions: 1.1.0
Reporter: Joseph K. Bradley
Priority: Minor
DecisionTree and RandomForest require that categorical features and labels be indexed 0,1,2.... There is currently no code to aid with indexing a dataset. This is a proposal for a helper class for computing indices (and also deciding which features to treat as categorical).
Proposed functionality:
* This helps process a dataset of unknown vectors into a dataset with some continuous features and some categorical features. The choice between continuous and categorical is based upon a maxCategories parameter.
* This can also map categorical feature values to 0-based indices.
Usage:
{code}
val myData1: RDD[Vector] = ...
val myData2: RDD[Vector] = ...
val datasetIndexer = new DatasetIndexer(maxCategories)
datasetIndexer.fit(myData1)
val indexedData1: RDD[Vector] = datasetIndexer.transform(myData1)
datasetIndexer.fit(myData2)
val indexedData2: RDD[Vector] = datasetIndexer.transform(myData2)
val categoricalFeaturesInfo: Map[Int, Int] = datasetIndexer.getCategoricalFeaturesInfo()
{code}
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org