You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@spark.apache.org by sr...@apache.org on 2018/07/21 13:26:49 UTC

spark git commit: [SPARK-23231][ML][DOC] Add doc for string indexer ordering to user guide (also to RFormula guide)

Repository: spark
Updated Branches:
  refs/heads/master d7ae4247e -> 81af88687


[SPARK-23231][ML][DOC] Add doc for string indexer ordering to user guide (also to RFormula guide)

## What changes were proposed in this pull request?
add doc for string indexer ordering

## How was this patch tested?
existing tests

Author: zhengruifeng3 <zh...@jd.com>
Author: zhengruifeng <ru...@foxmail.com>

Closes #21792 from zhengruifeng/doc_string_indexer_ordering.


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/81af8868
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/81af8868
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/81af8868

Branch: refs/heads/master
Commit: 81af88687f97f70b30828ac63239129637852526
Parents: d7ae424
Author: zhengruifeng3 <zh...@jd.com>
Authored: Sat Jul 21 08:26:45 2018 -0500
Committer: Sean Owen <sr...@gmail.com>
Committed: Sat Jul 21 08:26:45 2018 -0500

----------------------------------------------------------------------
 docs/ml-features.md | 25 ++++++++++++++++++++++---
 1 file changed, 22 insertions(+), 3 deletions(-)
----------------------------------------------------------------------


http://git-wip-us.apache.org/repos/asf/spark/blob/81af8868/docs/ml-features.md
----------------------------------------------------------------------
diff --git a/docs/ml-features.md b/docs/ml-features.md
index ad6e718..882b895 100644
--- a/docs/ml-features.md
+++ b/docs/ml-features.md
@@ -585,7 +585,11 @@ for more details on the API.
 ## StringIndexer
 
 `StringIndexer` encodes a string column of labels to a column of label indices.
-The indices are in `[0, numLabels)`, ordered by label frequencies, so the most frequent label gets index `0`.
+The indices are in `[0, numLabels)`, and four ordering options are supported:
+"frequencyDesc": descending order by label frequency (most frequent label assigned 0),
+"frequencyAsc": ascending order by label frequency (least frequent label assigned 0),
+"alphabetDesc": descending alphabetical order, and "alphabetAsc": ascending alphabetical order 
+(default = "frequencyDesc").
 The unseen labels will be put at index numLabels if user chooses to keep them.
 If the input column is numeric, we cast it to string and index the string
 values. When downstream pipeline components such as `Estimator` or
@@ -1593,10 +1597,25 @@ Suppose `a` and `b` are double columns, we use the following simple examples to
 * `y ~ a + b + a:b - 1` means model `y ~ w1 * a + w2 * b + w3 * a * b` where `w1, w2, w3` are coefficients.
 
 `RFormula` produces a vector column of features and a double or string column of label. 
-Like when formulas are used in R for linear regression, string input columns will be one-hot encoded, and numeric columns will be cast to doubles.
-If the label column is of type string, it will be first transformed to double with `StringIndexer`.
+Like when formulas are used in R for linear regression, numeric columns will be cast to doubles.
+As to string input columns, they will first be transformed with [StringIndexer](ml-features.html#stringindexer) using ordering determined by `stringOrderType`,
+and the last category after ordering is dropped, then the doubles will be one-hot encoded.
+
+Suppose a string feature column containing values `{'b', 'a', 'b', 'a', 'c', 'b'}`, we set `stringOrderType` to control the encoding:
+~~~
+stringOrderType | Category mapped to 0 by StringIndexer |  Category dropped by RFormula
+----------------|---------------------------------------|---------------------------------
+'frequencyDesc' | most frequent category ('b')          | least frequent category ('c')
+'frequencyAsc'  | least frequent category ('c')         | most frequent category ('b')
+'alphabetDesc'  | last alphabetical category ('c')      | first alphabetical category ('a')
+'alphabetAsc'   | first alphabetical category ('a')     | last alphabetical category ('c')
+~~~
+
+If the label column is of type string, it will be first transformed to double with [StringIndexer](ml-features.html#stringindexer) using `frequencyDesc` ordering.
 If the label column does not exist in the DataFrame, the output label column will be created from the specified response variable in the formula.
 
+**Note:** The ordering option `stringOrderType` is NOT used for the label column. When the label column is indexed, it uses the default descending frequency ordering in `StringIndexer`.
+
 **Examples**
 
 Assume that we have a DataFrame with the columns `id`, `country`, `hour`, and `clicked`:


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@spark.apache.org
For additional commands, e-mail: commits-help@spark.apache.org