You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Xiangrui Meng (JIRA)" <ji...@apache.org> on 2015/05/28 21:06:19 UTC

[jira] [Created] (SPARK-7921) Change includeFirst to dropLast in OneHotEncoder

Xiangrui Meng created SPARK-7921:
------------------------------------

             Summary: Change includeFirst to dropLast in OneHotEncoder
                 Key: SPARK-7921
                 URL: https://issues.apache.org/jira/browse/SPARK-7921
             Project: Spark
          Issue Type: Improvement
          Components: ML
    Affects Versions: 1.4.0
            Reporter: Xiangrui Meng
            Assignee: Xiangrui Meng


Change includeFirst to dropLast and leave the default to true. There are couple benefits:

a. consistent with other tutorials of one-hot encoding (or dummy coding) (e.g., http://www.ats.ucla.edu/stat/mult_pkg/faq/general/dummy.htm)
b. keep the indices unmodified in the output vector. If we drop the first, all indices will be shifted by 1.
c. If users use StringIndex, the last element is the least frequent one.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org