You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "hujiayin (JIRA)" <ji...@apache.org> on 2016/04/16 10:25:25 UTC

[jira] [Comment Edited] (SPARK-14623) add label binarizer

    [ https://issues.apache.org/jira/browse/SPARK-14623?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15244061#comment-15244061 ] 

hujiayin edited comment on SPARK-14623 at 4/16/16 8:24 AM:
-----------------------------------------------------------

Hi Joseph, I think it is similar as the combination of StringIndexer + OneHotEncoder into one class but the difference is the LabelBinarizer will collect the same element into one vector and will remember the position of the element in the input. 

For example, 
Input is "yellow,green,red,green,0"
Label Binarizer retrieves the labels from input and the labels are "0, green, red, yellow"
Output is
0, 0, 0, 1
0, 1, 0, 0
0, 0, 1, 0
0, 1, 0, 0
1, 0 ,0, 0
The second column reflects element "green" appears at positions 1 and 3 in the input. The 4 columns reflect the 4 labels. Column 0 represents label 0 and column 1 is label "green", so on. If I understand correctly, StringIndexer returns the category number of a label and OneHotEncoder returns the single high 1 binary representation of the category number.


was (Author: hujiayin):
Hi Joseph, I think it is similar as the combination of StringIndexer + OneHotEncoder into one class but the difference is the LabelBinarizer will collect the same element into one vector and will remember the position of the element in the input. 

For example, 
Input is "yellow,green,red,green,0"
Label Binarizer retrieves the labels from input and the labels are "0, green, red, yellow"
Output is
0, 0, 0, 1
0, 1, 0, 0
0, 0, 1, 0
0, 1, 0, 0
1, 0 ,0, 0
The second column reflects element "green" appears at positions 1 and 3 in the input. The 4 columns reflect the 4 labels. Column 0 represents label 0 and column 1 is label "green", so on. If I understand correctly, StringIndexer returns the category number of a label and OneHotEncoder returns the binary representation of the category number.

> add label binarizer 
> --------------------
>
>                 Key: SPARK-14623
>                 URL: https://issues.apache.org/jira/browse/SPARK-14623
>             Project: Spark
>          Issue Type: New Feature
>          Components: ML
>            Reporter: hujiayin
>            Priority: Minor
>   Original Estimate: 1h
>  Remaining Estimate: 1h
>
> It relates to https://issues.apache.org/jira/browse/SPARK-7445
> Map the labels to 0/1. 
> For example,
> Input:
> "yellow,green,red,green,0"
> The labels: "0, green, red, yellow"
> Output:
> 0, 0, 0, 1
> 0, 1, 0, 0
> 0, 0, 1, 0
> 0, 1, 0, 0
> 1, 0 ,0, 0



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org