You are viewing a plain text version of this content. The canonical link for it is here.

Posted to solr-user@lucene.apache.org by Srinivasa Meenavalli <Sm...@zensar.com> on 2016/08/26 06:25:38 UTC

RE: Default stop word list

Hi Steven,

List of Stopwords of a language are not fixed, there is no single universal list of stop words used by all natural language processing tools .
Ideally stop words should be defined search merchandisers based on their domain instead of referring default.

https://en.wikipedia.org/wiki/Stop_words

You are allowed to add  lang/stopwords_<languagecode>.txt

<fieldType name="text_en" class="solr.TextField" positionIncrementGap="100">
    <analyzer type="index">
      <tokenizer class="solr.StandardTokenizerFactory"/>
      <filter class="solr.StopFilterFactory" words="lang/stopwords_en.txt" ignoreCase="true"/>
      <filter class="solr.LowerCaseFilterFactory"/>
      <filter class="solr.EnglishPossessiveFilterFactory"/>
      <filter class="solr.KeywordMarkerFilterFactory" protected="protwords.txt"/>
      <filter class="solr.PorterStemFilterFactory"/>
    </analyzer>
    <analyzer type="query">
      <tokenizer class="solr.StandardTokenizerFactory"/>
      <filter class="solr.SynonymFilterFactory" expand="true" synonyms="synonyms.txt" ignoreCase="true"/>
      <filter class="solr.StopFilterFactory" words="lang/stopwords_en.txt" ignoreCase="true"/>
      <filter class="solr.LowerCaseFilterFactory"/>
      <filter class="solr.EnglishPossessiveFilterFactory"/>
      <filter class="solr.KeywordMarkerFilterFactory" protected="protwords.txt"/>
      <filter class="solr.PorterStemFilterFactory"/>
    </analyzer>

Regards
Srinivas Meenavalli

-----Original Message-----
From: Steven White [mailto:swhite4141@gmail.com]
Sent: Friday, August 26, 2016 4:02 AM
To: solr-user@lucene.apache.org
Subject: Default stopword list

Hi everyone,

I'm curious, the current "default" stopword list, for English and other languages, how was it determined?  And for English, why "I" is not in the stopword list?

Thanks in advanced.

Steve
Disclaimer: The contents of this e-mail and attachment(s) thereto are confidential and intended for the named recipient(s) only. It shall not attach any liability on the originator or Zensar Technologies Limited or its affiliates. Any views or opinions presented in this email are solely those of the author and may not necessarily reflect the opinions of Zensar Technologies Limited or its affiliates. Any form of reproduction, dissemination, copying, disclosure, modification, distribution and / or publication of this message without the prior written consent of the author of this e-mail is strictly prohibited. If you have received this email in error please delete it and notify the sender immediately. Before opening any mail and attachments please check them for viruses and defect. Zensar Technologies Ltd or its affiliate do not accept any liability for virus infected mails.

Re: Default stop word list

Posted by Emir Arnautovic <em...@sematext.com>.

I would partially agree with Walter - having more resources allows us to 
include stopwords in index and let scoring model do its job. However, 
there are other Solr features that can suffer from that approach: e.g. 
if you use edismax and mm=80%, in case of query with stopwords, you can 
end up with irrelevant results only because they survived mm while 
relevant did not because it was missing stopwords.

I would say that decision should depend on field type - it is some 
description, I would include StopFilterFactory, but if it is some title, 
than keeping stopwords in index is one way of making sure extreme titles 
can be found. Alternative is to index it in different ways - analyzed, 
string, shingles... and combine those fields to find best match without 
loosing "to be or not to be".

Regards,
Emir


On 08.09.2016 18:21, Walter Underwood wrote:
> I recommend that you remove StopFilterFactor from every analysis chain.
>
> In the tf.idf scoring model, rare words are automatically weighted more than common words.
>
> I have an index with 11.6 million documents. \u201cthe\u201d occurs in 9.9 million of those documents. \u201ccat\u201d occurs in 16,000 of those documents. (I just did searches to get the counts).
>
> This is the idf (inverse document frequency) formula for Solr:
>
> public float idf(int docFreq, int numDocs) {
>      return (float)(Math.log(numDocs/(double)(docFreq+1)) + 1.0);
>    }
> \u201cthe\u201d has an idf of 1.07. \u201ccat\u201d has an idf of 3.86.
>
> The term \u201cthe\u201d still counts for relevance, but it is dominated by the weight for \u201ccat\u201d.
>
> wunder
> Walter Underwood
> wunder@wunderwood.org
> http://observer.wunderwood.org/  (my blog)
>
>
>> On Sep 8, 2016, at 7:09 AM, Steven White <sw...@gmail.com> wrote:
>>
>> Hi Walter and all.  Sorry for the late reply, I was out of town.
>>
>> Are you saying the list of stop words from the stop word file be remove?  I
>> understand the issues I will run into because of the stop word list, but
>> all alone, my understanding of stop word list being in the stop word file
>> is -- to eliminate them from being indexed -- is so that relevancy ranking
>> is improved.  For example, if I index the word "the" instead of removing it
>> than when I send the search term "the cat" (without quotes) than records
>> with "the" will rank far higher vs. records with "cat" in my result set.
>> In fact records with "cat" may not even be on the first page.  Wasn't this
>> was stop word list created?
>>
>> If my understanding is correct, is there a way for me to rank lower records
>> that have a hit due to a list of common words, such as stop words?  This
>> way: (1) I can than get rid of all the stop word list in the stop word
>> file, (2) solve the issue of searching on "be with me", et. al., and (3)
>> prevent the ranking issue.
>>
>> Steve
>>
>> On Mon, Aug 29, 2016 at 9:18 PM, Walter Underwood <wu...@wunderwood.org>
>> wrote:
>>
>>> Do not remove stop words. Want to search for \u201cvitamin a\u201d? That won\u2019t work.
>>>
>>> Stop word removal is a hack left over from when we were running search
>>> engines in 64 kbytes of memory.
>>>
>>> Yes, common words are less important for search, but removing them is a
>>> brute force approach with severe side effects. Instead, we use a
>>> proportional approach with the tf.idf model. That puts a higher weight on
>>> rare words and a lower weight on common words.
>>>
>>> For some real-life examples of problems with stop words, you can read the
>>> list of movie titles that disappear with stemming and stop words. I
>>> discovered these when I was running search at Netflix.
>>>
>>>         \u2022 Being There (this is the first one I noticed)
>>>         \u2022 To Be and To Have (�tre et Avoir)
>>>         \u2022 To Have and To Have Not
>>>         \u2022 Once and Again
>>>         \u2022 To Be or Not To Be (1942) (OK, it isn\u2019t just a quote from Hamlet)
>>>         \u2022 To Be or Not To Be (1983)
>>>         \u2022 Now and Then, Here and There
>>>         \u2022 Be with Me
>>>         \u2022 I\u2019ll Be There
>>>         \u2022 It Had to Be You
>>>         \u2022 You Should Not Be Here
>>>         \u2022 You Are Here
>>>
>>> https://observer.wunderwood.org/2007/05/31/do-all-stopword-queries-matter/
>>>
>>> wunder
>>> Walter Underwood
>>> wunder@wunderwood.org
>>> http://observer.wunderwood.org/  (my blog)
>>>
>>>
>>>> On Aug 29, 2016, at 5:39 PM, Steven White <sw...@gmail.com> wrote:
>>>>
>>>> Thanks Shawn.  This is the best answer I have seen, much appreciated.
>>>>
>>>> A follow up question, I want to remove stop words from the list, but if I
>>>> do, then search quality will degradation (and index size will grow (less
>>> of
>>>> an issue)).  For example, if I remove "a", then if someone search for
>>> "For
>>>> a Few Dollars More" (without quotes) chances are good records with "a"
>>> will
>>>> land higher up that are not relevant to user's search.  How can I address
>>>> this?  Can I setup my schema so that records that get hits against a list
>>>> of words, let's say off the stop word list, are ranked lower?
>>>>
>>>> Steve
>>>>
>>>> On Sat, Aug 27, 2016 at 2:53 PM, Shawn Heisey <ap...@elyograg.org>
>>> wrote:
>>>>> On 8/27/2016 12:39 PM, Shawn Heisey wrote:
>>>>>> I personally think that stopword removal is more of a problem than a
>>>>>> solution.
>>>>> There actually is one thing that a stopword filter can dothat has little
>>>>> to do with the purpose it was designed for.  You can make it impossible
>>>>> to search for certain words.
>>>>>
>>>>> Imagine that your original data contains the word "frisbee" but for some
>>>>> reason you do not want anybody to be able to locate results using that
>>>>> word.  You can create a stopword list containing just "frisbee" and any
>>>>> other variations that you want to limit like "frisbees", then place it
>>>>> as a filter on the index side of your analysis.  With this in place,
>>>>> searching for those terms will retrieve zero results.
>>>>>
>>>>> Thanks,
>>>>> Shawn
>>>>>
>>>>>
>>>
>

-- 
Monitoring * Alerting * Anomaly Detection * Centralized Log Management
Solr & Elasticsearch Support * http://sematext.com/

Re: Default stop word list

Posted by Walter Underwood <wu...@wunderwood.org>.

I recommend that you remove StopFilterFactor from every analysis chain.

In the tf.idf scoring model, rare words are automatically weighted more than common words.

I have an index with 11.6 million documents. “the” occurs in 9.9 million of those documents. “cat” occurs in 16,000 of those documents. (I just did searches to get the counts).

This is the idf (inverse document frequency) formula for Solr:

public float idf(int docFreq, int numDocs) {
    return (float)(Math.log(numDocs/(double)(docFreq+1)) + 1.0);
  }
“the” has an idf of 1.07. “cat” has an idf of 3.86.

The term “the” still counts for relevance, but it is dominated by the weight for “cat”.

wunder
Walter Underwood
wunder@wunderwood.org
http://observer.wunderwood.org/  (my blog)


> On Sep 8, 2016, at 7:09 AM, Steven White <sw...@gmail.com> wrote:
> 
> Hi Walter and all.  Sorry for the late reply, I was out of town.
> 
> Are you saying the list of stop words from the stop word file be remove?  I
> understand the issues I will run into because of the stop word list, but
> all alone, my understanding of stop word list being in the stop word file
> is -- to eliminate them from being indexed -- is so that relevancy ranking
> is improved.  For example, if I index the word "the" instead of removing it
> than when I send the search term "the cat" (without quotes) than records
> with "the" will rank far higher vs. records with "cat" in my result set.
> In fact records with "cat" may not even be on the first page.  Wasn't this
> was stop word list created?
> 
> If my understanding is correct, is there a way for me to rank lower records
> that have a hit due to a list of common words, such as stop words?  This
> way: (1) I can than get rid of all the stop word list in the stop word
> file, (2) solve the issue of searching on "be with me", et. al., and (3)
> prevent the ranking issue.
> 
> Steve
> 
> On Mon, Aug 29, 2016 at 9:18 PM, Walter Underwood <wu...@wunderwood.org>
> wrote:
> 
>> Do not remove stop words. Want to search for “vitamin a”? That won’t work.
>> 
>> Stop word removal is a hack left over from when we were running search
>> engines in 64 kbytes of memory.
>> 
>> Yes, common words are less important for search, but removing them is a
>> brute force approach with severe side effects. Instead, we use a
>> proportional approach with the tf.idf model. That puts a higher weight on
>> rare words and a lower weight on common words.
>> 
>> For some real-life examples of problems with stop words, you can read the
>> list of movie titles that disappear with stemming and stop words. I
>> discovered these when I was running search at Netflix.
>> 
>>        • Being There (this is the first one I noticed)
>>        • To Be and To Have (Être et Avoir)
>>        • To Have and To Have Not
>>        • Once and Again
>>        • To Be or Not To Be (1942) (OK, it isn’t just a quote from Hamlet)
>>        • To Be or Not To Be (1983)
>>        • Now and Then, Here and There
>>        • Be with Me
>>        • I’ll Be There
>>        • It Had to Be You
>>        • You Should Not Be Here
>>        • You Are Here
>> 
>> https://observer.wunderwood.org/2007/05/31/do-all-stopword-queries-matter/
>> 
>> wunder
>> Walter Underwood
>> wunder@wunderwood.org
>> http://observer.wunderwood.org/  (my blog)
>> 
>> 
>>> On Aug 29, 2016, at 5:39 PM, Steven White <sw...@gmail.com> wrote:
>>> 
>>> Thanks Shawn.  This is the best answer I have seen, much appreciated.
>>> 
>>> A follow up question, I want to remove stop words from the list, but if I
>>> do, then search quality will degradation (and index size will grow (less
>> of
>>> an issue)).  For example, if I remove "a", then if someone search for
>> "For
>>> a Few Dollars More" (without quotes) chances are good records with "a"
>> will
>>> land higher up that are not relevant to user's search.  How can I address
>>> this?  Can I setup my schema so that records that get hits against a list
>>> of words, let's say off the stop word list, are ranked lower?
>>> 
>>> Steve
>>> 
>>> On Sat, Aug 27, 2016 at 2:53 PM, Shawn Heisey <ap...@elyograg.org>
>> wrote:
>>> 
>>>> On 8/27/2016 12:39 PM, Shawn Heisey wrote:
>>>>> I personally think that stopword removal is more of a problem than a
>>>>> solution.
>>>> 
>>>> There actually is one thing that a stopword filter can dothat has little
>>>> to do with the purpose it was designed for.  You can make it impossible
>>>> to search for certain words.
>>>> 
>>>> Imagine that your original data contains the word "frisbee" but for some
>>>> reason you do not want anybody to be able to locate results using that
>>>> word.  You can create a stopword list containing just "frisbee" and any
>>>> other variations that you want to limit like "frisbees", then place it
>>>> as a filter on the index side of your analysis.  With this in place,
>>>> searching for those terms will retrieve zero results.
>>>> 
>>>> Thanks,
>>>> Shawn
>>>> 
>>>> 
>> 
>>

Re: Default stop word list

Posted by Steven White <sw...@gmail.com>.

Hi Walter and all.  Sorry for the late reply, I was out of town.

Are you saying the list of stop words from the stop word file be remove?  I
understand the issues I will run into because of the stop word list, but
all alone, my understanding of stop word list being in the stop word file
is -- to eliminate them from being indexed -- is so that relevancy ranking
is improved.  For example, if I index the word "the" instead of removing it
than when I send the search term "the cat" (without quotes) than records
with "the" will rank far higher vs. records with "cat" in my result set.
In fact records with "cat" may not even be on the first page.  Wasn't this
was stop word list created?

If my understanding is correct, is there a way for me to rank lower records
that have a hit due to a list of common words, such as stop words?  This
way: (1) I can than get rid of all the stop word list in the stop word
file, (2) solve the issue of searching on "be with me", et. al., and (3)
prevent the ranking issue.

Steve

On Mon, Aug 29, 2016 at 9:18 PM, Walter Underwood <wu...@wunderwood.org>
wrote:

> Do not remove stop words. Want to search for “vitamin a”? That won’t work.
>
> Stop word removal is a hack left over from when we were running search
> engines in 64 kbytes of memory.
>
> Yes, common words are less important for search, but removing them is a
> brute force approach with severe side effects. Instead, we use a
> proportional approach with the tf.idf model. That puts a higher weight on
> rare words and a lower weight on common words.
>
> For some real-life examples of problems with stop words, you can read the
> list of movie titles that disappear with stemming and stop words. I
> discovered these when I was running search at Netflix.
>
>         • Being There (this is the first one I noticed)
>         • To Be and To Have (Être et Avoir)
>         • To Have and To Have Not
>         • Once and Again
>         • To Be or Not To Be (1942) (OK, it isn’t just a quote from Hamlet)
>         • To Be or Not To Be (1983)
>         • Now and Then, Here and There
>         • Be with Me
>         • I’ll Be There
>         • It Had to Be You
>         • You Should Not Be Here
>         • You Are Here
>
> https://observer.wunderwood.org/2007/05/31/do-all-stopword-queries-matter/
>
> wunder
> Walter Underwood
> wunder@wunderwood.org
> http://observer.wunderwood.org/  (my blog)
>
>
> > On Aug 29, 2016, at 5:39 PM, Steven White <sw...@gmail.com> wrote:
> >
> > Thanks Shawn.  This is the best answer I have seen, much appreciated.
> >
> > A follow up question, I want to remove stop words from the list, but if I
> > do, then search quality will degradation (and index size will grow (less
> of
> > an issue)).  For example, if I remove "a", then if someone search for
> "For
> > a Few Dollars More" (without quotes) chances are good records with "a"
> will
> > land higher up that are not relevant to user's search.  How can I address
> > this?  Can I setup my schema so that records that get hits against a list
> > of words, let's say off the stop word list, are ranked lower?
> >
> > Steve
> >
> > On Sat, Aug 27, 2016 at 2:53 PM, Shawn Heisey <ap...@elyograg.org>
> wrote:
> >
> >> On 8/27/2016 12:39 PM, Shawn Heisey wrote:
> >>> I personally think that stopword removal is more of a problem than a
> >>> solution.
> >>
> >> There actually is one thing that a stopword filter can dothat has little
> >> to do with the purpose it was designed for.  You can make it impossible
> >> to search for certain words.
> >>
> >> Imagine that your original data contains the word "frisbee" but for some
> >> reason you do not want anybody to be able to locate results using that
> >> word.  You can create a stopword list containing just "frisbee" and any
> >> other variations that you want to limit like "frisbees", then place it
> >> as a filter on the index side of your analysis.  With this in place,
> >> searching for those terms will retrieve zero results.
> >>
> >> Thanks,
> >> Shawn
> >>
> >>
>
>

Re: Default stop word list

Posted by Walter Underwood <wu...@wunderwood.org>.

Do not remove stop words. Want to search for “vitamin a”? That won’t work.

Stop word removal is a hack left over from when we were running search engines in 64 kbytes of memory.

Yes, common words are less important for search, but removing them is a brute force approach with severe side effects. Instead, we use a proportional approach with the tf.idf model. That puts a higher weight on rare words and a lower weight on common words.

For some real-life examples of problems with stop words, you can read the list of movie titles that disappear with stemming and stop words. I discovered these when I was running search at Netflix.

	• Being There (this is the first one I noticed)
	• To Be and To Have (Être et Avoir)
	• To Have and To Have Not
	• Once and Again
	• To Be or Not To Be (1942) (OK, it isn’t just a quote from Hamlet)
	• To Be or Not To Be (1983)
	• Now and Then, Here and There
	• Be with Me
	• I’ll Be There
	• It Had to Be You
	• You Should Not Be Here
	• You Are Here

https://observer.wunderwood.org/2007/05/31/do-all-stopword-queries-matter/

wunder
Walter Underwood
wunder@wunderwood.org
http://observer.wunderwood.org/  (my blog)


> On Aug 29, 2016, at 5:39 PM, Steven White <sw...@gmail.com> wrote:
> 
> Thanks Shawn.  This is the best answer I have seen, much appreciated.
> 
> A follow up question, I want to remove stop words from the list, but if I
> do, then search quality will degradation (and index size will grow (less of
> an issue)).  For example, if I remove "a", then if someone search for "For
> a Few Dollars More" (without quotes) chances are good records with "a" will
> land higher up that are not relevant to user's search.  How can I address
> this?  Can I setup my schema so that records that get hits against a list
> of words, let's say off the stop word list, are ranked lower?
> 
> Steve
> 
> On Sat, Aug 27, 2016 at 2:53 PM, Shawn Heisey <ap...@elyograg.org> wrote:
> 
>> On 8/27/2016 12:39 PM, Shawn Heisey wrote:
>>> I personally think that stopword removal is more of a problem than a
>>> solution.
>> 
>> There actually is one thing that a stopword filter can dothat has little
>> to do with the purpose it was designed for.  You can make it impossible
>> to search for certain words.
>> 
>> Imagine that your original data contains the word "frisbee" but for some
>> reason you do not want anybody to be able to locate results using that
>> word.  You can create a stopword list containing just "frisbee" and any
>> other variations that you want to limit like "frisbees", then place it
>> as a filter on the index side of your analysis.  With this in place,
>> searching for those terms will retrieve zero results.
>> 
>> Thanks,
>> Shawn
>> 
>>

Re: Default stop word list

Posted by Steven White <sw...@gmail.com>.

Thanks Shawn.  This is the best answer I have seen, much appreciated.

A follow up question, I want to remove stop words from the list, but if I
do, then search quality will degradation (and index size will grow (less of
an issue)).  For example, if I remove "a", then if someone search for "For
a Few Dollars More" (without quotes) chances are good records with "a" will
land higher up that are not relevant to user's search.  How can I address
this?  Can I setup my schema so that records that get hits against a list
of words, let's say off the stop word list, are ranked lower?

Steve

On Sat, Aug 27, 2016 at 2:53 PM, Shawn Heisey <ap...@elyograg.org> wrote:

> On 8/27/2016 12:39 PM, Shawn Heisey wrote:
> > I personally think that stopword removal is more of a problem than a
> > solution.
>
> There actually is one thing that a stopword filter can dothat has little
> to do with the purpose it was designed for.  You can make it impossible
> to search for certain words.
>
> Imagine that your original data contains the word "frisbee" but for some
> reason you do not want anybody to be able to locate results using that
> word.  You can create a stopword list containing just "frisbee" and any
> other variations that you want to limit like "frisbees", then place it
> as a filter on the index side of your analysis.  With this in place,
> searching for those terms will retrieve zero results.
>
> Thanks,
> Shawn
>
>

Re: Default stop word list

Posted by Shawn Heisey <ap...@elyograg.org>.

On 8/27/2016 12:39 PM, Shawn Heisey wrote:
> I personally think that stopword removal is more of a problem than a
> solution.

There actually is one thing that a stopword filter can dothat has little
to do with the purpose it was designed for.  You can make it impossible
to search for certain words.

Imagine that your original data contains the word "frisbee" but for some
reason you do not want anybody to be able to locate results using that
word.  You can create a stopword list containing just "frisbee" and any
other variations that you want to limit like "frisbees", then place it
as a filter on the index side of your analysis.  With this in place,
searching for those terms will retrieve zero results.

Thanks,
Shawn

Re: Default stop word list

Posted by Shawn Heisey <ap...@elyograg.org>.

On 8/26/2016 7:13 AM, Steven White wrote:
> But what about the current "default" list that comes with Solr?  How was
> that list, for all supported languages, determined?

That list of stopwords was created from years of history with Lucene,
taking the expertise of many people and the wisdom of the Internet into
account.

> What I fear is this, when someone puts Solr into production, no one makes a
> change to that list, so if the list is not "valid" this will impacting
> search, but if the list is valid, how was it determined, just by the
> development team of Solr / Lucene or input from linguistic expert?

The list of stopwords that come with Solr is a *starting point*.  The
person who sets Solr up should review the list and adjust it to their
needs ... or possibly remove the stopword filter entirely.

I personally think that stopword removal is more of a problem than a
solution.  In the long forgotten days of history, when computers had far
less processing power, storage, and memory than they do now ... removing
stopwords was a significant performance advantage, because it made the
indexes smaller.

With typical modern server configurations and small to medium sized
indexes, the performance benefit is minimal, and the removal can
sometimes cause significant disadvantages.

The classic example query related to stopwords (in English) is trying to
search for "to be or not to be" -- a phrase made up of words that almost
always appear in a stopword list, causing big problems.  A more relevant
example is searching an entertainment database for "the who".  That
search returns mostly irrelevant results when stopwords are removed. 
Imagine searching a music database for "the the" and not finding
anything at all relating to this band:

https://en.wikipedia.org/wiki/The_The

Thanks,
Shawn

Re: Default stop word list

Posted by Steven White <sw...@gmail.com>.

But what about the current "default" list that comes with Solr?  How was
that list, for all supported languages, determined?

What I fear is this, when someone puts Solr into production, no one makes a
change to that list, so if the list is not "valid" this will impacting
search, but if the list is valid, how was it determined, just by the
development team of Solr / Lucene or input from linguistic expert?

Steve

On Fri, Aug 26, 2016 at 2:25 AM, Srinivasa Meenavalli <Smeenavali@zensar.com
> wrote:

> Hi Steven,
>
> List of Stopwords of a language are not fixed, there is no single
> universal list of stop words used by all natural language processing tools .
> Ideally stop words should be defined search merchandisers based on their
> domain instead of referring default.
>
> https://en.wikipedia.org/wiki/Stop_words
>
> You are allowed to add  lang/stopwords_<languagecode>.txt
>
> <fieldType name="text_en" class="solr.TextField"
> positionIncrementGap="100">
>     <analyzer type="index">
>       <tokenizer class="solr.StandardTokenizerFactory"/>
>       <filter class="solr.StopFilterFactory" words="lang/stopwords_en.txt"
> ignoreCase="true"/>
>       <filter class="solr.LowerCaseFilterFactory"/>
>       <filter class="solr.EnglishPossessiveFilterFactory"/>
>       <filter class="solr.KeywordMarkerFilterFactory"
> protected="protwords.txt"/>
>       <filter class="solr.PorterStemFilterFactory"/>
>     </analyzer>
>     <analyzer type="query">
>       <tokenizer class="solr.StandardTokenizerFactory"/>
>       <filter class="solr.SynonymFilterFactory" expand="true"
> synonyms="synonyms.txt" ignoreCase="true"/>
>       <filter class="solr.StopFilterFactory" words="lang/stopwords_en.txt"
> ignoreCase="true"/>
>       <filter class="solr.LowerCaseFilterFactory"/>
>       <filter class="solr.EnglishPossessiveFilterFactory"/>
>       <filter class="solr.KeywordMarkerFilterFactory"
> protected="protwords.txt"/>
>       <filter class="solr.PorterStemFilterFactory"/>
>     </analyzer>
>
> Regards
> Srinivas Meenavalli
>
> -----Original Message-----
> From: Steven White [mailto:swhite4141@gmail.com]
> Sent: Friday, August 26, 2016 4:02 AM
> To: solr-user@lucene.apache.org
> Subject: Default stopword list
>
> Hi everyone,
>
> I'm curious, the current "default" stopword list, for English and other
> languages, how was it determined?  And for English, why "I" is not in the
> stopword list?
>
> Thanks in advanced.
>
> Steve
> Disclaimer: The contents of this e-mail and attachment(s) thereto are
> confidential and intended for the named recipient(s) only. It shall not
> attach any liability on the originator or Zensar Technologies Limited or
> its affiliates. Any views or opinions presented in this email are solely
> those of the author and may not necessarily reflect the opinions of Zensar
> Technologies Limited or its affiliates. Any form of reproduction,
> dissemination, copying, disclosure, modification, distribution and / or
> publication of this message without the prior written consent of the author
> of this e-mail is strictly prohibited. If you have received this email in
> error please delete it and notify the sender immediately. Before opening
> any mail and attachments please check them for viruses and defect. Zensar
> Technologies Ltd or its affiliate do not accept any liability for virus
> infected mails.
>