You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "yuhao yang (JIRA)" <ji...@apache.org> on 2016/11/14 17:44:58 UTC

[jira] [Comment Edited] (SPARK-18374) Incorrect words in StopWords/english.txt

    [ https://issues.apache.org/jira/browse/SPARK-18374?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15664503#comment-15664503 ] 

yuhao yang edited comment on SPARK-18374 at 11/14/16 5:44 PM:
--------------------------------------------------------------

Thanks for the response. By default, _Tokenizer_ equals to {noformat}split("\\s"){noformat} and _RegexTokenizer_ equals to {noformat}split("\\s+"){noformat}, which means no split on apostrophes or quotes. _RegexTokenizer_ can surely support customized regex pattern for split.


was (Author: yuhaoyan):
Thanks for the response. By default, _Tokenizer_ equals to _split("\\s")_ and _RegexTokenizer_ equals to _split("\\s+"), which means no split on apostrophes or quotes. _RegexTokenizer_ can surely support customized regex pattern for split.

> Incorrect words in StopWords/english.txt
> ----------------------------------------
>
>                 Key: SPARK-18374
>                 URL: https://issues.apache.org/jira/browse/SPARK-18374
>             Project: Spark
>          Issue Type: Bug
>          Components: ML
>    Affects Versions: 2.0.1
>            Reporter: nirav patel
>
> I was just double checking english.txt for list of stopwords as I felt it was taking out valid tokens like 'won'. I think issue is english.txt list is missing apostrophe character and all character after apostrophe. So "won't" becam "won" in that list; "wouldn't" is "wouldn" .
> Here are some incorrect tokens in this list:
> won
> wouldn
> ma
> mightn
> mustn
> needn
> shan
> shouldn
> wasn
> weren
> I think ideal list should have both style. i.e. won't and wont both should be part of english.txt as some tokenizer might remove special characters. But 'won' is obviously shouldn't be in this list.
> Here's list of snowball english stop words:
> http://snowball.tartarus.org/algorithms/english/stop.txt



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org