You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "nirav patel (JIRA)" <ji...@apache.org> on 2016/11/09 02:08:58 UTC

[jira] [Created] (SPARK-18374) Incorrect words in StopWords/english.txt

nirav patel created SPARK-18374:
-----------------------------------

             Summary: Incorrect words in StopWords/english.txt
                 Key: SPARK-18374
                 URL: https://issues.apache.org/jira/browse/SPARK-18374
             Project: Spark
          Issue Type: Bug
          Components: ML
    Affects Versions: 2.0.1
            Reporter: nirav patel


I was just double checking english.txt for list of stopwords as I felt it was taking out valid tokens like 'won'. I think issue is english.txt list is missing apostrophe character and all character after apostrophe. So "won't" becam "won" in that list; "wouldn't" is "wouldn" .

Here are some incorrect tokens in this list:

won
wouldn
ma
mightn
mustn
needn
shan
shouldn
wasn
weren

I think ideal list should have both style. i.e. won't and wont both should be part of english.txt as some tokenizer might remove special characters. But 'won' is obviously shouldn't be in this list.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org