You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Vincent (JIRA)" <ji...@apache.org> on 2016/09/12 08:44:21 UTC

[jira] [Comment Edited] (SPARK-17498) StringIndexer.setHandleInvalid sohuld have another option 'new'

    [ https://issues.apache.org/jira/browse/SPARK-17498?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15483514#comment-15483514 ] 

Vincent edited comment on SPARK-17498 at 9/12/16 8:43 AM:
----------------------------------------------------------

Here is what we cc [~qhuang] see about this issue
and correct me if any misunderstanding [~miro.balaz]
val df= sc.parallelize(Seq((0, "a"), (1, "b"), (2, "c"), (3, "a"), (4, "a"), (5, "c")), 2)
val indexer = new StringIndexer().fit(df)
when transform is call on a new dataframe with unseen label, 
say, 
val dfNew = sc.parallelize(Seq((0, "a"), (1, "b"), (2, "e"), (3, "d")), 2)
indexer.transform(dfNew)
should return 3, 4 for label "d", "e" instead of skipping/deleting the new incoming labels, and IndexToString  should return NaN for these added indexes 3, 4

[~yanboliang] [~srowen] [~josephkb] what do you think of this issue? Currently it can either skip the unseen label or throw an error in such case, do you think we should add such 'new' way of handler as proposed for StringIndexer?


was (Author: vincexie):
Here is what we cc [~qhuang] see about this issue
and correct me if any misunderstanding [~miro.balaz]
val df= sc.parallelize(Seq((0, "a"), (1, "b"), (2, "c"), (3, "a"), (4, "a"), (5, "c")), 2)
val indexer = new StringIndexer().fit(df)
when transform is call on a new dataframe with unseen label, 
say, 
val dfNew = sc.parallelize(Seq((0, "a"), (1, "b"), (2, "e"), (3, "d")), 2)
indexer.transform(dfNew)
should return 3, 4 for label "d", "e" instead of skipping/deleting the new incoming labels, and IndexToString  should return NaN for these added indexes 3, 4

[~yanboliang] [~srowen] [~josephkb] what do you think of this issue? Currently it can either skip the unseen label or throw an error for such case, do you think we should add such 'new' way of handler for StringIndexer?

> StringIndexer.setHandleInvalid sohuld have another option 'new'
> ---------------------------------------------------------------
>
>                 Key: SPARK-17498
>                 URL: https://issues.apache.org/jira/browse/SPARK-17498
>             Project: Spark
>          Issue Type: Improvement
>          Components: ML
>            Reporter: Miroslav Balaz
>
> That will map unseen label to maximum known label +1, IndexToString would map that back to "<undef>" or NA if there is something like that in spark,



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org