You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Levente Torok (JIRA)" <ji...@apache.org> on 2017/07/04 11:36:00 UTC

[jira] [Commented] (SPARK-11069) Add RegexTokenizer option to convert to lowercase

    [ https://issues.apache.org/jira/browse/SPARK-11069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16073508#comment-16073508 ] 

Levente Torok commented on SPARK-11069:
---------------------------------------

With this modification, in v1.6.x, there is no way to tokenize w/o. So this modification sucks.

So if "toLowercase" option is not implemented, as it is now, it is still better to have no conversion at all, since one can convert before using it if he/she wants, but one cannot use w/o conversion if he/she doesn't want.



> Add RegexTokenizer option to convert to lowercase
> -------------------------------------------------
>
>                 Key: SPARK-11069
>                 URL: https://issues.apache.org/jira/browse/SPARK-11069
>             Project: Spark
>          Issue Type: New Feature
>          Components: ML
>            Reporter: Joseph K. Bradley
>            Assignee: yuhao yang
>            Priority: Minor
>             Fix For: 1.6.0
>
>
> Tokenizer converts strings to lowercase automatically, but RegexTokenizer does not.  It would be nice to add an option to RegexTokenizer to convert to lowercase.  Proposal:
> * call the Boolean Param "toLowercase"
> * set default to false (so behavior does not change)
> *Q*: Should conversion to lowercase happen before or after regex matching?
> * Before: This is simpler.
> * After: This gives the user full control since they can have the regex treat upper/lower case differently.
> --> I'd vote for conversion before matching.  If a user needs full control, they can convert to lowercase manually.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org