You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by Teague James <te...@insystechinc.com> on 2014/07/14 21:53:31 UTC

Of, To, and Other Small Words

Hello all,

I am working with Solr 4.9.0 and am searching for phrases that contain words
like "of" or "to" that Solr seems to be ignoring at index time. Here's what
I tried:

curl http://localhost/solr/update?commit=true -H "Content-Type: text/xml"
--data-binary '<add><doc><field name="id">100</field><field
name="content">blah blah blah knowledge of science blah blah
blah</field></doc></add>'

Then, using a broswer:

http://localhost/solr/collection1/select?q="knowledge+of+science"&fq=id:100

I get zero hits. Search for "knowledge" or "science" and I'll get hits.
"knowledge of" or "of science" and I get zero hits. I don't want to use
proximity if I can avoid it, as this may introduce too many undesireable
results. Stopwords.txt is blank, yet clearly Solr is ignoring "of" and "to"
and possibly more words that I have not discovered through testing yet. Is
there some other configuration file that contains these small words? Is
there any way to force Solr to pay attention to them and not drop them from
the phrase? Any advice is appreciated! Thanks!

-Teague



RE: Of, To, and Other Small Words

Posted by Teague James <te...@insystechinc.com>.
Jack,

Thanks for replying and the suggestion. I replied to another suggestion with my field type and I do have <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" />.  There's nothing in the stopwords.txt. I even cleaned out stopwords_en.txt just to be certain. Any other suggestions on how to control this behavior?

-Teague

-----Original Message-----
From: Jack Krupansky [mailto:jack@basetechnology.com] 
Sent: Monday, July 14, 2014 4:26 PM
To: solr-user@lucene.apache.org
Subject: Re: Of, To, and Other Small Words

Or, if you happen to leave off the "words" attribute of the stop filter (or misspell the attribute name), it will use the internal Lucene hardwired list of stop words.

-- Jack Krupansky

-----Original Message-----
From: Anshum Gupta
Sent: Monday, July 14, 2014 4:03 PM
To: solr-user@lucene.apache.org
Subject: Re: Of, To, and Other Small Words

Hi Teague,

The StopFilterFactory (which I think you're using) by default uses lang/stopwords_en.txt (which wouldn't be empty if you check).
What you're looking at is the stopword.txt. You could either empty that file out or change the field type for your field.


On Mon, Jul 14, 2014 at 12:53 PM, Teague James <te...@insystechinc.com>
wrote:
> Hello all,
>
> I am working with Solr 4.9.0 and am searching for phrases that contain 
> words like "of" or "to" that Solr seems to be ignoring at index time. 
> Here's what I tried:
>
> curl http://localhost/solr/update?commit=true -H "Content-Type: text/xml"
> --data-binary '<add><doc><field name="id">100</field><field 
> name="content">blah blah blah knowledge of science blah blah 
> blah</field></doc></add>'
>
> Then, using a broswer:
>
> http://localhost/solr/collection1/select?q="knowledge+of+science"&fq=i
> d:100
>
> I get zero hits. Search for "knowledge" or "science" and I'll get hits.
> "knowledge of" or "of science" and I get zero hits. I don't want to 
> use proximity if I can avoid it, as this may introduce too many 
> undesireable results. Stopwords.txt is blank, yet clearly Solr is 
> ignoring "of" and "to"
> and possibly more words that I have not discovered through testing 
> yet. Is there some other configuration file that contains these small 
> words? Is there any way to force Solr to pay attention to them and not 
> drop them from the phrase? Any advice is appreciated! Thanks!
>
> -Teague
>
>



-- 

Anshum Gupta
http://www.anshumgupta.net 


Re: Of, To, and Other Small Words

Posted by Jack Krupansky <ja...@basetechnology.com>.
Or, if you happen to leave off the "words" attribute of the stop filter (or 
misspell the attribute name), it will use the internal Lucene hardwired list 
of stop words.

-- Jack Krupansky

-----Original Message----- 
From: Anshum Gupta
Sent: Monday, July 14, 2014 4:03 PM
To: solr-user@lucene.apache.org
Subject: Re: Of, To, and Other Small Words

Hi Teague,

The StopFilterFactory (which I think you're using) by default uses
lang/stopwords_en.txt (which wouldn't be empty if you check).
What you're looking at is the stopword.txt. You could either empty
that file out or change the field type for your field.


On Mon, Jul 14, 2014 at 12:53 PM, Teague James <te...@insystechinc.com> 
wrote:
> Hello all,
>
> I am working with Solr 4.9.0 and am searching for phrases that contain 
> words
> like "of" or "to" that Solr seems to be ignoring at index time. Here's 
> what
> I tried:
>
> curl http://localhost/solr/update?commit=true -H "Content-Type: text/xml"
> --data-binary '<add><doc><field name="id">100</field><field
> name="content">blah blah blah knowledge of science blah blah
> blah</field></doc></add>'
>
> Then, using a broswer:
>
> http://localhost/solr/collection1/select?q="knowledge+of+science"&fq=id:100
>
> I get zero hits. Search for "knowledge" or "science" and I'll get hits.
> "knowledge of" or "of science" and I get zero hits. I don't want to use
> proximity if I can avoid it, as this may introduce too many undesireable
> results. Stopwords.txt is blank, yet clearly Solr is ignoring "of" and 
> "to"
> and possibly more words that I have not discovered through testing yet. Is
> there some other configuration file that contains these small words? Is
> there any way to force Solr to pay attention to them and not drop them 
> from
> the phrase? Any advice is appreciated! Thanks!
>
> -Teague
>
>



-- 

Anshum Gupta
http://www.anshumgupta.net 


Re: Of, To, and Other Small Words

Posted by Jack Krupansky <ja...@basetechnology.com>.
Oops... forgot the link to the stop filter factory Javadoc:
http://lucene.apache.org/core/4_9_0/analyzers-common/org/apache/lucene/analysis/core/StopFilterFactory.html

-- Jack Krupansky

-----Original Message----- 
From: Jack Krupansky
Sent: Tuesday, July 15, 2014 7:42 AM
To: solr-user@lucene.apache.org
Subject: Re: Of, To, and Other Small Words

Yeah, this is another one of those places where the behavior of Solr is
defined but way down in the Lucene Javadoc, where no Solr user should ever
have to go!

It's also the kind of detail documented in my Solr Deep Dive e-book:
http://www.lulu.com/us/en/shop/jack-krupansky/solr-4x-deep-dive-early-access-release-7/ebook/product-21203548.html

-- Jack Krupansky

-----Original Message----- 
From: Alexandre Rafalovitch
Sent: Tuesday, July 15, 2014 4:36 AM
To: solr-user
Subject: Re: Of, To, and Other Small Words

https://github.com/apache/lucene-solr/blob/lucene_solr_4_9_0/lucene/analysis/common/src/java/org/apache/lucene/analysis/core/StopAnalyzer.java#L51

If you don't set the attribute in XML file, it falls back to the
default definitions.
Personal: http://www.outerthoughts.com/ and @arafalov
Solr resources: http://www.solr-start.com/ and @solrstart
Solr popularizers community: https://www.linkedin.com/groups?gid=6713853


On Tue, Jul 15, 2014 at 3:16 PM, Aman Tandon <am...@gmail.com>
wrote:
> Hi jack,
>
>
> it will use the internal *Lucene hardwired list* of stop words
>
>
> I am unaware of this, could you please provide the more information about
> this.
>
>
> With Regards
> Aman Tandon
>
>
> On Tue, Jul 15, 2014 at 7:21 AM, Alexandre Rafalovitch 
> <ar...@gmail.com>
> wrote:
>
>> You could try experimenting with CommonGramsFilterFactory and
>> CommonGramsQueryFilter (slightly different). There is actually a lot
>> of cool analyzers bundled with Solr. You can find full list on my site
>> at: http://www.solr-start.com/info/analyzers
>>
>> Regards,
>>    Alex.
>> Personal: http://www.outerthoughts.com/ and @arafalov
>> Solr resources: http://www.solr-start.com/ and @solrstart
>> Solr popularizers community: https://www.linkedin.com/groups?gid=6713853
>>
>>
>> On Tue, Jul 15, 2014 at 8:42 AM, Teague James <te...@insystechinc.com>
>> wrote:
>> > Alex,
>> >
>> > Thanks! Great suggestion. I figured out that it was the
>> EdgeNGramFilterFactory. Taking that out of the mix did it.
>> >
>> > -Teague
>> >
>> > -----Original Message-----
>> > From: Alexandre Rafalovitch [mailto:arafalov@gmail.com]
>> > Sent: Monday, July 14, 2014 9:14 PM
>> > To: solr-user
>> > Subject: Re: Of, To, and Other Small Words
>> >
>> > Have you tried the Admin UI's Analyze screen. Because it will show you
>> what happens to the text as it progresses through the tokenizers and
>> filters. No need to reindex.
>> >
>> > Regards,
>> >    Alex.
>> > Personal: http://www.outerthoughts.com/ and @arafalov Solr resources:
>> http://www.solr-start.com/ and @solrstart Solr popularizers community:
>> https://www.linkedin.com/groups?gid=6713853
>> >
>> >
>> > On Tue, Jul 15, 2014 at 8:10 AM, Teague James 
>> > <te...@insystechinc.com>
>> wrote:
>> >> Hi Anshum,
>> >>
>> >> Thanks for replying and suggesting this, but the field type I am using
>> (a modified text_general) in my schema has the file set to 
>> 'stopwords.txt'.
>> >>
>> >>         <fieldType name="text_general" class="solr.TextField"
>> positionIncrementGap="100">
>> >>                 <analyzer type="index">
>> >>                         <tokenizer
>> class="solr.StandardTokenizerFactory"/>
>> >>                         <filter class="solr.StopFilterFactory"
>> ignoreCase="true" words="stopwords.txt" />
>> >>                         <!-- in this example, we will only use 
>> >> synonyms
>> at query time
>> >>                         <filter class="solr.SynonymFilterFactory"
>> synonyms="index_synonyms.txt" ignoreCase="true" expand="false"/>-->
>> >>                         <filter class="solr.LowerCaseFilterFactory"/>
>> >>                         <!-- CHANGE: The NGramFilterFactory was added
>> to provide partial word search. This can be changed to
>> >>                         EdgeNGramFilterFactory side="front" to only
>> match front sided partial searches if matching any
>> >>                         part of a word is undesireable.-->
>> >>                         <filter class="solr.NGramFilterFactory"
>> minGramSize="3" maxGramSize="10" />
>> >>                         <!-- CHANGE: The PorterStemFilterFactory was
>> added to allow matches for 'cat' and 'cats' by searching for 'cat' -->
>> >>                         <filter class="solr.PorterStemFilterFactory"/>
>> >>                 </analyzer>
>> >>                 <analyzer type="query">
>> >>                         <tokenizer
>> class="solr.StandardTokenizerFactory"/>
>> >>                         <filter class="solr.StopFilterFactory"
>> ignoreCase="true" words="stopwords.txt" />
>> >>                         <filter class="solr.SynonymFilterFactory"
>> synonyms="synonyms.txt" ignoreCase="true" expand="true"/>
>> >>                         <filter class="solr.LowerCaseFilterFactory"/>
>> >>                         <!-- CHANGE: The PorterStemFilterFactory was
>> added to allow matches for 'cat' and 'cats' by searching for 'cat' -->
>> >>                         <filter class="solr.PorterStemFilterFactory"/>
>> >>                 </analyzer>
>> >>         </fieldType>
>> >>
>> >> Just to be double sure I cleared the list in stopwords_en.txt,
>> restarted Solr, re-indexed, and searched with still zero results. Any 
>> other
>> suggestions on where I might be able to control this behavior?
>> >>
>> >> -Teague
>> >>
>> >>
>> >> -----Original Message-----
>> >> From: Anshum Gupta [mailto:anshum@anshumgupta.net]
>> >> Sent: Monday, July 14, 2014 4:04 PM
>> >> To: solr-user@lucene.apache.org
>> >> Subject: Re: Of, To, and Other Small Words
>> >>
>> >> Hi Teague,
>> >>
>> >> The StopFilterFactory (which I think you're using) by default uses
>> lang/stopwords_en.txt (which wouldn't be empty if you check).
>> >> What you're looking at is the stopword.txt. You could either empty 
>> >> that
>> file out or change the field type for your field.
>> >>
>> >>
>> >> On Mon, Jul 14, 2014 at 12:53 PM, Teague James <
>> teaguej@insystechinc.com> wrote:
>> >>> Hello all,
>> >>>
>> >>> I am working with Solr 4.9.0 and am searching for phrases that
>> >>> contain words like "of" or "to" that Solr seems to be ignoring at
>> index time.
>> >>> Here's what I tried:
>> >>>
>> >>> curl http://localhost/solr/update?commit=true -H "Content-Type:
>> text/xml"
>> >>> --data-binary '<add><doc><field name="id">100</field><field
>> >>> name="content">blah blah blah knowledge of science blah blah
>> >>> blah</field></doc></add>'
>> >>>
>> >>> Then, using a broswer:
>> >>>
>> >>> http://localhost/solr/collection1/select?q="knowledge+of+science"&fq=
>> >>> i
>> >>> d:100
>> >>>
>> >>> I get zero hits. Search for "knowledge" or "science" and I'll get 
>> >>> hits.
>> >>> "knowledge of" or "of science" and I get zero hits. I don't want to
>> >>> use proximity if I can avoid it, as this may introduce too many
>> >>> undesireable results. Stopwords.txt is blank, yet clearly Solr is
>> ignoring "of" and "to"
>> >>> and possibly more words that I have not discovered through testing
>> >>> yet. Is there some other configuration file that contains these small
>> >>> words? Is there any way to force Solr to pay attention to them and
>> >>> not drop them from the phrase? Any advice is appreciated! Thanks!
>> >>>
>> >>> -Teague
>> >>>
>> >>>
>> >>
>> >>
>> >>
>> >> --
>> >>
>> >> Anshum Gupta
>> >> http://www.anshumgupta.net
>> >>
>> >
>> 

Re: Of, To, and Other Small Words

Posted by Jack Krupansky <ja...@basetechnology.com>.
Yeah, this is another one of those places where the behavior of Solr is 
defined but way down in the Lucene Javadoc, where no Solr user should ever 
have to go!

It's also the kind of detail documented in my Solr Deep Dive e-book:
http://www.lulu.com/us/en/shop/jack-krupansky/solr-4x-deep-dive-early-access-release-7/ebook/product-21203548.html

-- Jack Krupansky

-----Original Message----- 
From: Alexandre Rafalovitch
Sent: Tuesday, July 15, 2014 4:36 AM
To: solr-user
Subject: Re: Of, To, and Other Small Words

https://github.com/apache/lucene-solr/blob/lucene_solr_4_9_0/lucene/analysis/common/src/java/org/apache/lucene/analysis/core/StopAnalyzer.java#L51

If you don't set the attribute in XML file, it falls back to the
default definitions.
Personal: http://www.outerthoughts.com/ and @arafalov
Solr resources: http://www.solr-start.com/ and @solrstart
Solr popularizers community: https://www.linkedin.com/groups?gid=6713853


On Tue, Jul 15, 2014 at 3:16 PM, Aman Tandon <am...@gmail.com> 
wrote:
> Hi jack,
>
>
> it will use the internal *Lucene hardwired list* of stop words
>
>
> I am unaware of this, could you please provide the more information about
> this.
>
>
> With Regards
> Aman Tandon
>
>
> On Tue, Jul 15, 2014 at 7:21 AM, Alexandre Rafalovitch 
> <ar...@gmail.com>
> wrote:
>
>> You could try experimenting with CommonGramsFilterFactory and
>> CommonGramsQueryFilter (slightly different). There is actually a lot
>> of cool analyzers bundled with Solr. You can find full list on my site
>> at: http://www.solr-start.com/info/analyzers
>>
>> Regards,
>>    Alex.
>> Personal: http://www.outerthoughts.com/ and @arafalov
>> Solr resources: http://www.solr-start.com/ and @solrstart
>> Solr popularizers community: https://www.linkedin.com/groups?gid=6713853
>>
>>
>> On Tue, Jul 15, 2014 at 8:42 AM, Teague James <te...@insystechinc.com>
>> wrote:
>> > Alex,
>> >
>> > Thanks! Great suggestion. I figured out that it was the
>> EdgeNGramFilterFactory. Taking that out of the mix did it.
>> >
>> > -Teague
>> >
>> > -----Original Message-----
>> > From: Alexandre Rafalovitch [mailto:arafalov@gmail.com]
>> > Sent: Monday, July 14, 2014 9:14 PM
>> > To: solr-user
>> > Subject: Re: Of, To, and Other Small Words
>> >
>> > Have you tried the Admin UI's Analyze screen. Because it will show you
>> what happens to the text as it progresses through the tokenizers and
>> filters. No need to reindex.
>> >
>> > Regards,
>> >    Alex.
>> > Personal: http://www.outerthoughts.com/ and @arafalov Solr resources:
>> http://www.solr-start.com/ and @solrstart Solr popularizers community:
>> https://www.linkedin.com/groups?gid=6713853
>> >
>> >
>> > On Tue, Jul 15, 2014 at 8:10 AM, Teague James 
>> > <te...@insystechinc.com>
>> wrote:
>> >> Hi Anshum,
>> >>
>> >> Thanks for replying and suggesting this, but the field type I am using
>> (a modified text_general) in my schema has the file set to 
>> 'stopwords.txt'.
>> >>
>> >>         <fieldType name="text_general" class="solr.TextField"
>> positionIncrementGap="100">
>> >>                 <analyzer type="index">
>> >>                         <tokenizer
>> class="solr.StandardTokenizerFactory"/>
>> >>                         <filter class="solr.StopFilterFactory"
>> ignoreCase="true" words="stopwords.txt" />
>> >>                         <!-- in this example, we will only use 
>> >> synonyms
>> at query time
>> >>                         <filter class="solr.SynonymFilterFactory"
>> synonyms="index_synonyms.txt" ignoreCase="true" expand="false"/>-->
>> >>                         <filter class="solr.LowerCaseFilterFactory"/>
>> >>                         <!-- CHANGE: The NGramFilterFactory was added
>> to provide partial word search. This can be changed to
>> >>                         EdgeNGramFilterFactory side="front" to only
>> match front sided partial searches if matching any
>> >>                         part of a word is undesireable.-->
>> >>                         <filter class="solr.NGramFilterFactory"
>> minGramSize="3" maxGramSize="10" />
>> >>                         <!-- CHANGE: The PorterStemFilterFactory was
>> added to allow matches for 'cat' and 'cats' by searching for 'cat' -->
>> >>                         <filter class="solr.PorterStemFilterFactory"/>
>> >>                 </analyzer>
>> >>                 <analyzer type="query">
>> >>                         <tokenizer
>> class="solr.StandardTokenizerFactory"/>
>> >>                         <filter class="solr.StopFilterFactory"
>> ignoreCase="true" words="stopwords.txt" />
>> >>                         <filter class="solr.SynonymFilterFactory"
>> synonyms="synonyms.txt" ignoreCase="true" expand="true"/>
>> >>                         <filter class="solr.LowerCaseFilterFactory"/>
>> >>                         <!-- CHANGE: The PorterStemFilterFactory was
>> added to allow matches for 'cat' and 'cats' by searching for 'cat' -->
>> >>                         <filter class="solr.PorterStemFilterFactory"/>
>> >>                 </analyzer>
>> >>         </fieldType>
>> >>
>> >> Just to be double sure I cleared the list in stopwords_en.txt,
>> restarted Solr, re-indexed, and searched with still zero results. Any 
>> other
>> suggestions on where I might be able to control this behavior?
>> >>
>> >> -Teague
>> >>
>> >>
>> >> -----Original Message-----
>> >> From: Anshum Gupta [mailto:anshum@anshumgupta.net]
>> >> Sent: Monday, July 14, 2014 4:04 PM
>> >> To: solr-user@lucene.apache.org
>> >> Subject: Re: Of, To, and Other Small Words
>> >>
>> >> Hi Teague,
>> >>
>> >> The StopFilterFactory (which I think you're using) by default uses
>> lang/stopwords_en.txt (which wouldn't be empty if you check).
>> >> What you're looking at is the stopword.txt. You could either empty 
>> >> that
>> file out or change the field type for your field.
>> >>
>> >>
>> >> On Mon, Jul 14, 2014 at 12:53 PM, Teague James <
>> teaguej@insystechinc.com> wrote:
>> >>> Hello all,
>> >>>
>> >>> I am working with Solr 4.9.0 and am searching for phrases that
>> >>> contain words like "of" or "to" that Solr seems to be ignoring at
>> index time.
>> >>> Here's what I tried:
>> >>>
>> >>> curl http://localhost/solr/update?commit=true -H "Content-Type:
>> text/xml"
>> >>> --data-binary '<add><doc><field name="id">100</field><field
>> >>> name="content">blah blah blah knowledge of science blah blah
>> >>> blah</field></doc></add>'
>> >>>
>> >>> Then, using a broswer:
>> >>>
>> >>> http://localhost/solr/collection1/select?q="knowledge+of+science"&fq=
>> >>> i
>> >>> d:100
>> >>>
>> >>> I get zero hits. Search for "knowledge" or "science" and I'll get 
>> >>> hits.
>> >>> "knowledge of" or "of science" and I get zero hits. I don't want to
>> >>> use proximity if I can avoid it, as this may introduce too many
>> >>> undesireable results. Stopwords.txt is blank, yet clearly Solr is
>> ignoring "of" and "to"
>> >>> and possibly more words that I have not discovered through testing
>> >>> yet. Is there some other configuration file that contains these small
>> >>> words? Is there any way to force Solr to pay attention to them and
>> >>> not drop them from the phrase? Any advice is appreciated! Thanks!
>> >>>
>> >>> -Teague
>> >>>
>> >>>
>> >>
>> >>
>> >>
>> >> --
>> >>
>> >> Anshum Gupta
>> >> http://www.anshumgupta.net
>> >>
>> >
>> 


Re: Of, To, and Other Small Words

Posted by Walter Underwood <wu...@wunderwood.org>.
If you want to keep stopwords, take the stopword filter out of your analysis chain.

wunder
Walter Underwood
wunder@wunderwood.org
http://observer.wunderwood.org/


On Jul 15, 2014, at 1:36 AM, Alexandre Rafalovitch <ar...@gmail.com> wrote:

> https://github.com/apache/lucene-solr/blob/lucene_solr_4_9_0/lucene/analysis/common/src/java/org/apache/lucene/analysis/core/StopAnalyzer.java#L51
> 
> If you don't set the attribute in XML file, it falls back to the
> default definitions.
> Personal: http://www.outerthoughts.com/ and @arafalov
> Solr resources: http://www.solr-start.com/ and @solrstart
> Solr popularizers community: https://www.linkedin.com/groups?gid=6713853
> 
> 
> On Tue, Jul 15, 2014 at 3:16 PM, Aman Tandon <am...@gmail.com> wrote:
>> Hi jack,
>> 
>> 
>> it will use the internal *Lucene hardwired list* of stop words
>> 
>> 
>> I am unaware of this, could you please provide the more information about
>> this.
>> 
>> 
>> With Regards
>> Aman Tandon
>> 
>> 
>> On Tue, Jul 15, 2014 at 7:21 AM, Alexandre Rafalovitch <ar...@gmail.com>
>> wrote:
>> 
>>> You could try experimenting with CommonGramsFilterFactory and
>>> CommonGramsQueryFilter (slightly different). There is actually a lot
>>> of cool analyzers bundled with Solr. You can find full list on my site
>>> at: http://www.solr-start.com/info/analyzers
>>> 
>>> Regards,
>>>   Alex.
>>> Personal: http://www.outerthoughts.com/ and @arafalov
>>> Solr resources: http://www.solr-start.com/ and @solrstart
>>> Solr popularizers community: https://www.linkedin.com/groups?gid=6713853
>>> 
>>> 
>>> On Tue, Jul 15, 2014 at 8:42 AM, Teague James <te...@insystechinc.com>
>>> wrote:
>>>> Alex,
>>>> 
>>>> Thanks! Great suggestion. I figured out that it was the
>>> EdgeNGramFilterFactory. Taking that out of the mix did it.
>>>> 
>>>> -Teague
>>>> 
>>>> -----Original Message-----
>>>> From: Alexandre Rafalovitch [mailto:arafalov@gmail.com]
>>>> Sent: Monday, July 14, 2014 9:14 PM
>>>> To: solr-user
>>>> Subject: Re: Of, To, and Other Small Words
>>>> 
>>>> Have you tried the Admin UI's Analyze screen. Because it will show you
>>> what happens to the text as it progresses through the tokenizers and
>>> filters. No need to reindex.
>>>> 
>>>> Regards,
>>>>   Alex.
>>>> Personal: http://www.outerthoughts.com/ and @arafalov Solr resources:
>>> http://www.solr-start.com/ and @solrstart Solr popularizers community:
>>> https://www.linkedin.com/groups?gid=6713853
>>>> 
>>>> 
>>>> On Tue, Jul 15, 2014 at 8:10 AM, Teague James <te...@insystechinc.com>
>>> wrote:
>>>>> Hi Anshum,
>>>>> 
>>>>> Thanks for replying and suggesting this, but the field type I am using
>>> (a modified text_general) in my schema has the file set to 'stopwords.txt'.
>>>>> 
>>>>>        <fieldType name="text_general" class="solr.TextField"
>>> positionIncrementGap="100">
>>>>>                <analyzer type="index">
>>>>>                        <tokenizer
>>> class="solr.StandardTokenizerFactory"/>
>>>>>                        <filter class="solr.StopFilterFactory"
>>> ignoreCase="true" words="stopwords.txt" />
>>>>>                        <!-- in this example, we will only use synonyms
>>> at query time
>>>>>                        <filter class="solr.SynonymFilterFactory"
>>> synonyms="index_synonyms.txt" ignoreCase="true" expand="false"/>-->
>>>>>                        <filter class="solr.LowerCaseFilterFactory"/>
>>>>>                        <!-- CHANGE: The NGramFilterFactory was added
>>> to provide partial word search. This can be changed to
>>>>>                        EdgeNGramFilterFactory side="front" to only
>>> match front sided partial searches if matching any
>>>>>                        part of a word is undesireable.-->
>>>>>                        <filter class="solr.NGramFilterFactory"
>>> minGramSize="3" maxGramSize="10" />
>>>>>                        <!-- CHANGE: The PorterStemFilterFactory was
>>> added to allow matches for 'cat' and 'cats' by searching for 'cat' -->
>>>>>                        <filter class="solr.PorterStemFilterFactory"/>
>>>>>                </analyzer>
>>>>>                <analyzer type="query">
>>>>>                        <tokenizer
>>> class="solr.StandardTokenizerFactory"/>
>>>>>                        <filter class="solr.StopFilterFactory"
>>> ignoreCase="true" words="stopwords.txt" />
>>>>>                        <filter class="solr.SynonymFilterFactory"
>>> synonyms="synonyms.txt" ignoreCase="true" expand="true"/>
>>>>>                        <filter class="solr.LowerCaseFilterFactory"/>
>>>>>                        <!-- CHANGE: The PorterStemFilterFactory was
>>> added to allow matches for 'cat' and 'cats' by searching for 'cat' -->
>>>>>                        <filter class="solr.PorterStemFilterFactory"/>
>>>>>                </analyzer>
>>>>>        </fieldType>
>>>>> 
>>>>> Just to be double sure I cleared the list in stopwords_en.txt,
>>> restarted Solr, re-indexed, and searched with still zero results. Any other
>>> suggestions on where I might be able to control this behavior?
>>>>> 
>>>>> -Teague
>>>>> 
>>>>> 
>>>>> -----Original Message-----
>>>>> From: Anshum Gupta [mailto:anshum@anshumgupta.net]
>>>>> Sent: Monday, July 14, 2014 4:04 PM
>>>>> To: solr-user@lucene.apache.org
>>>>> Subject: Re: Of, To, and Other Small Words
>>>>> 
>>>>> Hi Teague,
>>>>> 
>>>>> The StopFilterFactory (which I think you're using) by default uses
>>> lang/stopwords_en.txt (which wouldn't be empty if you check).
>>>>> What you're looking at is the stopword.txt. You could either empty that
>>> file out or change the field type for your field.
>>>>> 
>>>>> 
>>>>> On Mon, Jul 14, 2014 at 12:53 PM, Teague James <
>>> teaguej@insystechinc.com> wrote:
>>>>>> Hello all,
>>>>>> 
>>>>>> I am working with Solr 4.9.0 and am searching for phrases that
>>>>>> contain words like "of" or "to" that Solr seems to be ignoring at
>>> index time.
>>>>>> Here's what I tried:
>>>>>> 
>>>>>> curl http://localhost/solr/update?commit=true -H "Content-Type:
>>> text/xml"
>>>>>> --data-binary '<add><doc><field name="id">100</field><field
>>>>>> name="content">blah blah blah knowledge of science blah blah
>>>>>> blah</field></doc></add>'
>>>>>> 
>>>>>> Then, using a broswer:
>>>>>> 
>>>>>> http://localhost/solr/collection1/select?q="knowledge+of+science"&fq=
>>>>>> i
>>>>>> d:100
>>>>>> 
>>>>>> I get zero hits. Search for "knowledge" or "science" and I'll get hits.
>>>>>> "knowledge of" or "of science" and I get zero hits. I don't want to
>>>>>> use proximity if I can avoid it, as this may introduce too many
>>>>>> undesireable results. Stopwords.txt is blank, yet clearly Solr is
>>> ignoring "of" and "to"
>>>>>> and possibly more words that I have not discovered through testing
>>>>>> yet. Is there some other configuration file that contains these small
>>>>>> words? Is there any way to force Solr to pay attention to them and
>>>>>> not drop them from the phrase? Any advice is appreciated! Thanks!
>>>>>> 
>>>>>> -Teague
>>>>>> 
>>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>>> --
>>>>> 
>>>>> Anshum Gupta
>>>>> http://www.anshumgupta.net
>>>>> 
>>>> 
>>> 


Re: Of, To, and Other Small Words

Posted by Alexandre Rafalovitch <ar...@gmail.com>.
https://github.com/apache/lucene-solr/blob/lucene_solr_4_9_0/lucene/analysis/common/src/java/org/apache/lucene/analysis/core/StopAnalyzer.java#L51

If you don't set the attribute in XML file, it falls back to the
default definitions.
Personal: http://www.outerthoughts.com/ and @arafalov
Solr resources: http://www.solr-start.com/ and @solrstart
Solr popularizers community: https://www.linkedin.com/groups?gid=6713853


On Tue, Jul 15, 2014 at 3:16 PM, Aman Tandon <am...@gmail.com> wrote:
> Hi jack,
>
>
> it will use the internal *Lucene hardwired list* of stop words
>
>
> I am unaware of this, could you please provide the more information about
> this.
>
>
> With Regards
> Aman Tandon
>
>
> On Tue, Jul 15, 2014 at 7:21 AM, Alexandre Rafalovitch <ar...@gmail.com>
> wrote:
>
>> You could try experimenting with CommonGramsFilterFactory and
>> CommonGramsQueryFilter (slightly different). There is actually a lot
>> of cool analyzers bundled with Solr. You can find full list on my site
>> at: http://www.solr-start.com/info/analyzers
>>
>> Regards,
>>    Alex.
>> Personal: http://www.outerthoughts.com/ and @arafalov
>> Solr resources: http://www.solr-start.com/ and @solrstart
>> Solr popularizers community: https://www.linkedin.com/groups?gid=6713853
>>
>>
>> On Tue, Jul 15, 2014 at 8:42 AM, Teague James <te...@insystechinc.com>
>> wrote:
>> > Alex,
>> >
>> > Thanks! Great suggestion. I figured out that it was the
>> EdgeNGramFilterFactory. Taking that out of the mix did it.
>> >
>> > -Teague
>> >
>> > -----Original Message-----
>> > From: Alexandre Rafalovitch [mailto:arafalov@gmail.com]
>> > Sent: Monday, July 14, 2014 9:14 PM
>> > To: solr-user
>> > Subject: Re: Of, To, and Other Small Words
>> >
>> > Have you tried the Admin UI's Analyze screen. Because it will show you
>> what happens to the text as it progresses through the tokenizers and
>> filters. No need to reindex.
>> >
>> > Regards,
>> >    Alex.
>> > Personal: http://www.outerthoughts.com/ and @arafalov Solr resources:
>> http://www.solr-start.com/ and @solrstart Solr popularizers community:
>> https://www.linkedin.com/groups?gid=6713853
>> >
>> >
>> > On Tue, Jul 15, 2014 at 8:10 AM, Teague James <te...@insystechinc.com>
>> wrote:
>> >> Hi Anshum,
>> >>
>> >> Thanks for replying and suggesting this, but the field type I am using
>> (a modified text_general) in my schema has the file set to 'stopwords.txt'.
>> >>
>> >>         <fieldType name="text_general" class="solr.TextField"
>> positionIncrementGap="100">
>> >>                 <analyzer type="index">
>> >>                         <tokenizer
>> class="solr.StandardTokenizerFactory"/>
>> >>                         <filter class="solr.StopFilterFactory"
>> ignoreCase="true" words="stopwords.txt" />
>> >>                         <!-- in this example, we will only use synonyms
>> at query time
>> >>                         <filter class="solr.SynonymFilterFactory"
>> synonyms="index_synonyms.txt" ignoreCase="true" expand="false"/>-->
>> >>                         <filter class="solr.LowerCaseFilterFactory"/>
>> >>                         <!-- CHANGE: The NGramFilterFactory was added
>> to provide partial word search. This can be changed to
>> >>                         EdgeNGramFilterFactory side="front" to only
>> match front sided partial searches if matching any
>> >>                         part of a word is undesireable.-->
>> >>                         <filter class="solr.NGramFilterFactory"
>> minGramSize="3" maxGramSize="10" />
>> >>                         <!-- CHANGE: The PorterStemFilterFactory was
>> added to allow matches for 'cat' and 'cats' by searching for 'cat' -->
>> >>                         <filter class="solr.PorterStemFilterFactory"/>
>> >>                 </analyzer>
>> >>                 <analyzer type="query">
>> >>                         <tokenizer
>> class="solr.StandardTokenizerFactory"/>
>> >>                         <filter class="solr.StopFilterFactory"
>> ignoreCase="true" words="stopwords.txt" />
>> >>                         <filter class="solr.SynonymFilterFactory"
>> synonyms="synonyms.txt" ignoreCase="true" expand="true"/>
>> >>                         <filter class="solr.LowerCaseFilterFactory"/>
>> >>                         <!-- CHANGE: The PorterStemFilterFactory was
>> added to allow matches for 'cat' and 'cats' by searching for 'cat' -->
>> >>                         <filter class="solr.PorterStemFilterFactory"/>
>> >>                 </analyzer>
>> >>         </fieldType>
>> >>
>> >> Just to be double sure I cleared the list in stopwords_en.txt,
>> restarted Solr, re-indexed, and searched with still zero results. Any other
>> suggestions on where I might be able to control this behavior?
>> >>
>> >> -Teague
>> >>
>> >>
>> >> -----Original Message-----
>> >> From: Anshum Gupta [mailto:anshum@anshumgupta.net]
>> >> Sent: Monday, July 14, 2014 4:04 PM
>> >> To: solr-user@lucene.apache.org
>> >> Subject: Re: Of, To, and Other Small Words
>> >>
>> >> Hi Teague,
>> >>
>> >> The StopFilterFactory (which I think you're using) by default uses
>> lang/stopwords_en.txt (which wouldn't be empty if you check).
>> >> What you're looking at is the stopword.txt. You could either empty that
>> file out or change the field type for your field.
>> >>
>> >>
>> >> On Mon, Jul 14, 2014 at 12:53 PM, Teague James <
>> teaguej@insystechinc.com> wrote:
>> >>> Hello all,
>> >>>
>> >>> I am working with Solr 4.9.0 and am searching for phrases that
>> >>> contain words like "of" or "to" that Solr seems to be ignoring at
>> index time.
>> >>> Here's what I tried:
>> >>>
>> >>> curl http://localhost/solr/update?commit=true -H "Content-Type:
>> text/xml"
>> >>> --data-binary '<add><doc><field name="id">100</field><field
>> >>> name="content">blah blah blah knowledge of science blah blah
>> >>> blah</field></doc></add>'
>> >>>
>> >>> Then, using a broswer:
>> >>>
>> >>> http://localhost/solr/collection1/select?q="knowledge+of+science"&fq=
>> >>> i
>> >>> d:100
>> >>>
>> >>> I get zero hits. Search for "knowledge" or "science" and I'll get hits.
>> >>> "knowledge of" or "of science" and I get zero hits. I don't want to
>> >>> use proximity if I can avoid it, as this may introduce too many
>> >>> undesireable results. Stopwords.txt is blank, yet clearly Solr is
>> ignoring "of" and "to"
>> >>> and possibly more words that I have not discovered through testing
>> >>> yet. Is there some other configuration file that contains these small
>> >>> words? Is there any way to force Solr to pay attention to them and
>> >>> not drop them from the phrase? Any advice is appreciated! Thanks!
>> >>>
>> >>> -Teague
>> >>>
>> >>>
>> >>
>> >>
>> >>
>> >> --
>> >>
>> >> Anshum Gupta
>> >> http://www.anshumgupta.net
>> >>
>> >
>>

Re: Of, To, and Other Small Words

Posted by Aman Tandon <am...@gmail.com>.
Hi jack,


it will use the internal *Lucene hardwired list* of stop words


I am unaware of this, could you please provide the more information about
this.


With Regards
Aman Tandon


On Tue, Jul 15, 2014 at 7:21 AM, Alexandre Rafalovitch <ar...@gmail.com>
wrote:

> You could try experimenting with CommonGramsFilterFactory and
> CommonGramsQueryFilter (slightly different). There is actually a lot
> of cool analyzers bundled with Solr. You can find full list on my site
> at: http://www.solr-start.com/info/analyzers
>
> Regards,
>    Alex.
> Personal: http://www.outerthoughts.com/ and @arafalov
> Solr resources: http://www.solr-start.com/ and @solrstart
> Solr popularizers community: https://www.linkedin.com/groups?gid=6713853
>
>
> On Tue, Jul 15, 2014 at 8:42 AM, Teague James <te...@insystechinc.com>
> wrote:
> > Alex,
> >
> > Thanks! Great suggestion. I figured out that it was the
> EdgeNGramFilterFactory. Taking that out of the mix did it.
> >
> > -Teague
> >
> > -----Original Message-----
> > From: Alexandre Rafalovitch [mailto:arafalov@gmail.com]
> > Sent: Monday, July 14, 2014 9:14 PM
> > To: solr-user
> > Subject: Re: Of, To, and Other Small Words
> >
> > Have you tried the Admin UI's Analyze screen. Because it will show you
> what happens to the text as it progresses through the tokenizers and
> filters. No need to reindex.
> >
> > Regards,
> >    Alex.
> > Personal: http://www.outerthoughts.com/ and @arafalov Solr resources:
> http://www.solr-start.com/ and @solrstart Solr popularizers community:
> https://www.linkedin.com/groups?gid=6713853
> >
> >
> > On Tue, Jul 15, 2014 at 8:10 AM, Teague James <te...@insystechinc.com>
> wrote:
> >> Hi Anshum,
> >>
> >> Thanks for replying and suggesting this, but the field type I am using
> (a modified text_general) in my schema has the file set to 'stopwords.txt'.
> >>
> >>         <fieldType name="text_general" class="solr.TextField"
> positionIncrementGap="100">
> >>                 <analyzer type="index">
> >>                         <tokenizer
> class="solr.StandardTokenizerFactory"/>
> >>                         <filter class="solr.StopFilterFactory"
> ignoreCase="true" words="stopwords.txt" />
> >>                         <!-- in this example, we will only use synonyms
> at query time
> >>                         <filter class="solr.SynonymFilterFactory"
> synonyms="index_synonyms.txt" ignoreCase="true" expand="false"/>-->
> >>                         <filter class="solr.LowerCaseFilterFactory"/>
> >>                         <!-- CHANGE: The NGramFilterFactory was added
> to provide partial word search. This can be changed to
> >>                         EdgeNGramFilterFactory side="front" to only
> match front sided partial searches if matching any
> >>                         part of a word is undesireable.-->
> >>                         <filter class="solr.NGramFilterFactory"
> minGramSize="3" maxGramSize="10" />
> >>                         <!-- CHANGE: The PorterStemFilterFactory was
> added to allow matches for 'cat' and 'cats' by searching for 'cat' -->
> >>                         <filter class="solr.PorterStemFilterFactory"/>
> >>                 </analyzer>
> >>                 <analyzer type="query">
> >>                         <tokenizer
> class="solr.StandardTokenizerFactory"/>
> >>                         <filter class="solr.StopFilterFactory"
> ignoreCase="true" words="stopwords.txt" />
> >>                         <filter class="solr.SynonymFilterFactory"
> synonyms="synonyms.txt" ignoreCase="true" expand="true"/>
> >>                         <filter class="solr.LowerCaseFilterFactory"/>
> >>                         <!-- CHANGE: The PorterStemFilterFactory was
> added to allow matches for 'cat' and 'cats' by searching for 'cat' -->
> >>                         <filter class="solr.PorterStemFilterFactory"/>
> >>                 </analyzer>
> >>         </fieldType>
> >>
> >> Just to be double sure I cleared the list in stopwords_en.txt,
> restarted Solr, re-indexed, and searched with still zero results. Any other
> suggestions on where I might be able to control this behavior?
> >>
> >> -Teague
> >>
> >>
> >> -----Original Message-----
> >> From: Anshum Gupta [mailto:anshum@anshumgupta.net]
> >> Sent: Monday, July 14, 2014 4:04 PM
> >> To: solr-user@lucene.apache.org
> >> Subject: Re: Of, To, and Other Small Words
> >>
> >> Hi Teague,
> >>
> >> The StopFilterFactory (which I think you're using) by default uses
> lang/stopwords_en.txt (which wouldn't be empty if you check).
> >> What you're looking at is the stopword.txt. You could either empty that
> file out or change the field type for your field.
> >>
> >>
> >> On Mon, Jul 14, 2014 at 12:53 PM, Teague James <
> teaguej@insystechinc.com> wrote:
> >>> Hello all,
> >>>
> >>> I am working with Solr 4.9.0 and am searching for phrases that
> >>> contain words like "of" or "to" that Solr seems to be ignoring at
> index time.
> >>> Here's what I tried:
> >>>
> >>> curl http://localhost/solr/update?commit=true -H "Content-Type:
> text/xml"
> >>> --data-binary '<add><doc><field name="id">100</field><field
> >>> name="content">blah blah blah knowledge of science blah blah
> >>> blah</field></doc></add>'
> >>>
> >>> Then, using a broswer:
> >>>
> >>> http://localhost/solr/collection1/select?q="knowledge+of+science"&fq=
> >>> i
> >>> d:100
> >>>
> >>> I get zero hits. Search for "knowledge" or "science" and I'll get hits.
> >>> "knowledge of" or "of science" and I get zero hits. I don't want to
> >>> use proximity if I can avoid it, as this may introduce too many
> >>> undesireable results. Stopwords.txt is blank, yet clearly Solr is
> ignoring "of" and "to"
> >>> and possibly more words that I have not discovered through testing
> >>> yet. Is there some other configuration file that contains these small
> >>> words? Is there any way to force Solr to pay attention to them and
> >>> not drop them from the phrase? Any advice is appreciated! Thanks!
> >>>
> >>> -Teague
> >>>
> >>>
> >>
> >>
> >>
> >> --
> >>
> >> Anshum Gupta
> >> http://www.anshumgupta.net
> >>
> >
>

Re: Of, To, and Other Small Words

Posted by Alexandre Rafalovitch <ar...@gmail.com>.
You could try experimenting with CommonGramsFilterFactory and
CommonGramsQueryFilter (slightly different). There is actually a lot
of cool analyzers bundled with Solr. You can find full list on my site
at: http://www.solr-start.com/info/analyzers

Regards,
   Alex.
Personal: http://www.outerthoughts.com/ and @arafalov
Solr resources: http://www.solr-start.com/ and @solrstart
Solr popularizers community: https://www.linkedin.com/groups?gid=6713853


On Tue, Jul 15, 2014 at 8:42 AM, Teague James <te...@insystechinc.com> wrote:
> Alex,
>
> Thanks! Great suggestion. I figured out that it was the EdgeNGramFilterFactory. Taking that out of the mix did it.
>
> -Teague
>
> -----Original Message-----
> From: Alexandre Rafalovitch [mailto:arafalov@gmail.com]
> Sent: Monday, July 14, 2014 9:14 PM
> To: solr-user
> Subject: Re: Of, To, and Other Small Words
>
> Have you tried the Admin UI's Analyze screen. Because it will show you what happens to the text as it progresses through the tokenizers and filters. No need to reindex.
>
> Regards,
>    Alex.
> Personal: http://www.outerthoughts.com/ and @arafalov Solr resources: http://www.solr-start.com/ and @solrstart Solr popularizers community: https://www.linkedin.com/groups?gid=6713853
>
>
> On Tue, Jul 15, 2014 at 8:10 AM, Teague James <te...@insystechinc.com> wrote:
>> Hi Anshum,
>>
>> Thanks for replying and suggesting this, but the field type I am using (a modified text_general) in my schema has the file set to 'stopwords.txt'.
>>
>>         <fieldType name="text_general" class="solr.TextField" positionIncrementGap="100">
>>                 <analyzer type="index">
>>                         <tokenizer class="solr.StandardTokenizerFactory"/>
>>                         <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" />
>>                         <!-- in this example, we will only use synonyms at query time
>>                         <filter class="solr.SynonymFilterFactory" synonyms="index_synonyms.txt" ignoreCase="true" expand="false"/>-->
>>                         <filter class="solr.LowerCaseFilterFactory"/>
>>                         <!-- CHANGE: The NGramFilterFactory was added to provide partial word search. This can be changed to
>>                         EdgeNGramFilterFactory side="front" to only match front sided partial searches if matching any
>>                         part of a word is undesireable.-->
>>                         <filter class="solr.NGramFilterFactory" minGramSize="3" maxGramSize="10" />
>>                         <!-- CHANGE: The PorterStemFilterFactory was added to allow matches for 'cat' and 'cats' by searching for 'cat' -->
>>                         <filter class="solr.PorterStemFilterFactory"/>
>>                 </analyzer>
>>                 <analyzer type="query">
>>                         <tokenizer class="solr.StandardTokenizerFactory"/>
>>                         <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" />
>>                         <filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt" ignoreCase="true" expand="true"/>
>>                         <filter class="solr.LowerCaseFilterFactory"/>
>>                         <!-- CHANGE: The PorterStemFilterFactory was added to allow matches for 'cat' and 'cats' by searching for 'cat' -->
>>                         <filter class="solr.PorterStemFilterFactory"/>
>>                 </analyzer>
>>         </fieldType>
>>
>> Just to be double sure I cleared the list in stopwords_en.txt, restarted Solr, re-indexed, and searched with still zero results. Any other suggestions on where I might be able to control this behavior?
>>
>> -Teague
>>
>>
>> -----Original Message-----
>> From: Anshum Gupta [mailto:anshum@anshumgupta.net]
>> Sent: Monday, July 14, 2014 4:04 PM
>> To: solr-user@lucene.apache.org
>> Subject: Re: Of, To, and Other Small Words
>>
>> Hi Teague,
>>
>> The StopFilterFactory (which I think you're using) by default uses lang/stopwords_en.txt (which wouldn't be empty if you check).
>> What you're looking at is the stopword.txt. You could either empty that file out or change the field type for your field.
>>
>>
>> On Mon, Jul 14, 2014 at 12:53 PM, Teague James <te...@insystechinc.com> wrote:
>>> Hello all,
>>>
>>> I am working with Solr 4.9.0 and am searching for phrases that
>>> contain words like "of" or "to" that Solr seems to be ignoring at index time.
>>> Here's what I tried:
>>>
>>> curl http://localhost/solr/update?commit=true -H "Content-Type: text/xml"
>>> --data-binary '<add><doc><field name="id">100</field><field
>>> name="content">blah blah blah knowledge of science blah blah
>>> blah</field></doc></add>'
>>>
>>> Then, using a broswer:
>>>
>>> http://localhost/solr/collection1/select?q="knowledge+of+science"&fq=
>>> i
>>> d:100
>>>
>>> I get zero hits. Search for "knowledge" or "science" and I'll get hits.
>>> "knowledge of" or "of science" and I get zero hits. I don't want to
>>> use proximity if I can avoid it, as this may introduce too many
>>> undesireable results. Stopwords.txt is blank, yet clearly Solr is ignoring "of" and "to"
>>> and possibly more words that I have not discovered through testing
>>> yet. Is there some other configuration file that contains these small
>>> words? Is there any way to force Solr to pay attention to them and
>>> not drop them from the phrase? Any advice is appreciated! Thanks!
>>>
>>> -Teague
>>>
>>>
>>
>>
>>
>> --
>>
>> Anshum Gupta
>> http://www.anshumgupta.net
>>
>

RE: Of, To, and Other Small Words

Posted by Teague James <te...@insystechinc.com>.
Alex,

Thanks! Great suggestion. I figured out that it was the EdgeNGramFilterFactory. Taking that out of the mix did it.

-Teague

-----Original Message-----
From: Alexandre Rafalovitch [mailto:arafalov@gmail.com] 
Sent: Monday, July 14, 2014 9:14 PM
To: solr-user
Subject: Re: Of, To, and Other Small Words

Have you tried the Admin UI's Analyze screen. Because it will show you what happens to the text as it progresses through the tokenizers and filters. No need to reindex.

Regards,
   Alex.
Personal: http://www.outerthoughts.com/ and @arafalov Solr resources: http://www.solr-start.com/ and @solrstart Solr popularizers community: https://www.linkedin.com/groups?gid=6713853


On Tue, Jul 15, 2014 at 8:10 AM, Teague James <te...@insystechinc.com> wrote:
> Hi Anshum,
>
> Thanks for replying and suggesting this, but the field type I am using (a modified text_general) in my schema has the file set to 'stopwords.txt'.
>
>         <fieldType name="text_general" class="solr.TextField" positionIncrementGap="100">
>                 <analyzer type="index">
>                         <tokenizer class="solr.StandardTokenizerFactory"/>
>                         <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" />
>                         <!-- in this example, we will only use synonyms at query time
>                         <filter class="solr.SynonymFilterFactory" synonyms="index_synonyms.txt" ignoreCase="true" expand="false"/>-->
>                         <filter class="solr.LowerCaseFilterFactory"/>
>                         <!-- CHANGE: The NGramFilterFactory was added to provide partial word search. This can be changed to
>                         EdgeNGramFilterFactory side="front" to only match front sided partial searches if matching any
>                         part of a word is undesireable.-->
>                         <filter class="solr.NGramFilterFactory" minGramSize="3" maxGramSize="10" />
>                         <!-- CHANGE: The PorterStemFilterFactory was added to allow matches for 'cat' and 'cats' by searching for 'cat' -->
>                         <filter class="solr.PorterStemFilterFactory"/>
>                 </analyzer>
>                 <analyzer type="query">
>                         <tokenizer class="solr.StandardTokenizerFactory"/>
>                         <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" />
>                         <filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt" ignoreCase="true" expand="true"/>
>                         <filter class="solr.LowerCaseFilterFactory"/>
>                         <!-- CHANGE: The PorterStemFilterFactory was added to allow matches for 'cat' and 'cats' by searching for 'cat' -->
>                         <filter class="solr.PorterStemFilterFactory"/>
>                 </analyzer>
>         </fieldType>
>
> Just to be double sure I cleared the list in stopwords_en.txt, restarted Solr, re-indexed, and searched with still zero results. Any other suggestions on where I might be able to control this behavior?
>
> -Teague
>
>
> -----Original Message-----
> From: Anshum Gupta [mailto:anshum@anshumgupta.net]
> Sent: Monday, July 14, 2014 4:04 PM
> To: solr-user@lucene.apache.org
> Subject: Re: Of, To, and Other Small Words
>
> Hi Teague,
>
> The StopFilterFactory (which I think you're using) by default uses lang/stopwords_en.txt (which wouldn't be empty if you check).
> What you're looking at is the stopword.txt. You could either empty that file out or change the field type for your field.
>
>
> On Mon, Jul 14, 2014 at 12:53 PM, Teague James <te...@insystechinc.com> wrote:
>> Hello all,
>>
>> I am working with Solr 4.9.0 and am searching for phrases that 
>> contain words like "of" or "to" that Solr seems to be ignoring at index time.
>> Here's what I tried:
>>
>> curl http://localhost/solr/update?commit=true -H "Content-Type: text/xml"
>> --data-binary '<add><doc><field name="id">100</field><field 
>> name="content">blah blah blah knowledge of science blah blah 
>> blah</field></doc></add>'
>>
>> Then, using a broswer:
>>
>> http://localhost/solr/collection1/select?q="knowledge+of+science"&fq=
>> i
>> d:100
>>
>> I get zero hits. Search for "knowledge" or "science" and I'll get hits.
>> "knowledge of" or "of science" and I get zero hits. I don't want to 
>> use proximity if I can avoid it, as this may introduce too many 
>> undesireable results. Stopwords.txt is blank, yet clearly Solr is ignoring "of" and "to"
>> and possibly more words that I have not discovered through testing 
>> yet. Is there some other configuration file that contains these small 
>> words? Is there any way to force Solr to pay attention to them and 
>> not drop them from the phrase? Any advice is appreciated! Thanks!
>>
>> -Teague
>>
>>
>
>
>
> --
>
> Anshum Gupta
> http://www.anshumgupta.net
>


Re: Of, To, and Other Small Words

Posted by Alexandre Rafalovitch <ar...@gmail.com>.
Have you tried the Admin UI's Analyze screen. Because it will show you
what happens to the text as it progresses through the tokenizers and
filters. No need to reindex.

Regards,
   Alex.
Personal: http://www.outerthoughts.com/ and @arafalov
Solr resources: http://www.solr-start.com/ and @solrstart
Solr popularizers community: https://www.linkedin.com/groups?gid=6713853


On Tue, Jul 15, 2014 at 8:10 AM, Teague James <te...@insystechinc.com> wrote:
> Hi Anshum,
>
> Thanks for replying and suggesting this, but the field type I am using (a modified text_general) in my schema has the file set to 'stopwords.txt'.
>
>         <fieldType name="text_general" class="solr.TextField" positionIncrementGap="100">
>                 <analyzer type="index">
>                         <tokenizer class="solr.StandardTokenizerFactory"/>
>                         <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" />
>                         <!-- in this example, we will only use synonyms at query time
>                         <filter class="solr.SynonymFilterFactory" synonyms="index_synonyms.txt" ignoreCase="true" expand="false"/>-->
>                         <filter class="solr.LowerCaseFilterFactory"/>
>                         <!-- CHANGE: The NGramFilterFactory was added to provide partial word search. This can be changed to
>                         EdgeNGramFilterFactory side="front" to only match front sided partial searches if matching any
>                         part of a word is undesireable.-->
>                         <filter class="solr.NGramFilterFactory" minGramSize="3" maxGramSize="10" />
>                         <!-- CHANGE: The PorterStemFilterFactory was added to allow matches for 'cat' and 'cats' by searching for 'cat' -->
>                         <filter class="solr.PorterStemFilterFactory"/>
>                 </analyzer>
>                 <analyzer type="query">
>                         <tokenizer class="solr.StandardTokenizerFactory"/>
>                         <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" />
>                         <filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt" ignoreCase="true" expand="true"/>
>                         <filter class="solr.LowerCaseFilterFactory"/>
>                         <!-- CHANGE: The PorterStemFilterFactory was added to allow matches for 'cat' and 'cats' by searching for 'cat' -->
>                         <filter class="solr.PorterStemFilterFactory"/>
>                 </analyzer>
>         </fieldType>
>
> Just to be double sure I cleared the list in stopwords_en.txt, restarted Solr, re-indexed, and searched with still zero results. Any other suggestions on where I might be able to control this behavior?
>
> -Teague
>
>
> -----Original Message-----
> From: Anshum Gupta [mailto:anshum@anshumgupta.net]
> Sent: Monday, July 14, 2014 4:04 PM
> To: solr-user@lucene.apache.org
> Subject: Re: Of, To, and Other Small Words
>
> Hi Teague,
>
> The StopFilterFactory (which I think you're using) by default uses lang/stopwords_en.txt (which wouldn't be empty if you check).
> What you're looking at is the stopword.txt. You could either empty that file out or change the field type for your field.
>
>
> On Mon, Jul 14, 2014 at 12:53 PM, Teague James <te...@insystechinc.com> wrote:
>> Hello all,
>>
>> I am working with Solr 4.9.0 and am searching for phrases that contain
>> words like "of" or "to" that Solr seems to be ignoring at index time.
>> Here's what I tried:
>>
>> curl http://localhost/solr/update?commit=true -H "Content-Type: text/xml"
>> --data-binary '<add><doc><field name="id">100</field><field
>> name="content">blah blah blah knowledge of science blah blah
>> blah</field></doc></add>'
>>
>> Then, using a broswer:
>>
>> http://localhost/solr/collection1/select?q="knowledge+of+science"&fq=i
>> d:100
>>
>> I get zero hits. Search for "knowledge" or "science" and I'll get hits.
>> "knowledge of" or "of science" and I get zero hits. I don't want to
>> use proximity if I can avoid it, as this may introduce too many
>> undesireable results. Stopwords.txt is blank, yet clearly Solr is ignoring "of" and "to"
>> and possibly more words that I have not discovered through testing
>> yet. Is there some other configuration file that contains these small
>> words? Is there any way to force Solr to pay attention to them and not
>> drop them from the phrase? Any advice is appreciated! Thanks!
>>
>> -Teague
>>
>>
>
>
>
> --
>
> Anshum Gupta
> http://www.anshumgupta.net
>

RE: Of, To, and Other Small Words

Posted by Teague James <te...@insystechinc.com>.
Hi Anshum,

Thanks for replying and suggesting this, but the field type I am using (a modified text_general) in my schema has the file set to 'stopwords.txt'. 

	<fieldType name="text_general" class="solr.TextField" positionIncrementGap="100">
		<analyzer type="index">
			<tokenizer class="solr.StandardTokenizerFactory"/>
			<filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" />
			<!-- in this example, we will only use synonyms at query time
			<filter class="solr.SynonymFilterFactory" synonyms="index_synonyms.txt" ignoreCase="true" expand="false"/>-->
			<filter class="solr.LowerCaseFilterFactory"/>
			<!-- CHANGE: The NGramFilterFactory was added to provide partial word search. This can be changed to
			EdgeNGramFilterFactory side="front" to only match front sided partial searches if matching any
			part of a word is undesireable.-->
			<filter class="solr.NGramFilterFactory" minGramSize="3" maxGramSize="10" />
			<!-- CHANGE: The PorterStemFilterFactory was added to allow matches for 'cat' and 'cats' by searching for 'cat' -->
			<filter class="solr.PorterStemFilterFactory"/>
		</analyzer>
		<analyzer type="query">
			<tokenizer class="solr.StandardTokenizerFactory"/>
			<filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" />
			<filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt" ignoreCase="true" expand="true"/>
			<filter class="solr.LowerCaseFilterFactory"/>
			<!-- CHANGE: The PorterStemFilterFactory was added to allow matches for 'cat' and 'cats' by searching for 'cat' -->
			<filter class="solr.PorterStemFilterFactory"/>
		</analyzer>
	</fieldType> 

Just to be double sure I cleared the list in stopwords_en.txt, restarted Solr, re-indexed, and searched with still zero results. Any other suggestions on where I might be able to control this behavior?

-Teague


-----Original Message-----
From: Anshum Gupta [mailto:anshum@anshumgupta.net] 
Sent: Monday, July 14, 2014 4:04 PM
To: solr-user@lucene.apache.org
Subject: Re: Of, To, and Other Small Words

Hi Teague,

The StopFilterFactory (which I think you're using) by default uses lang/stopwords_en.txt (which wouldn't be empty if you check).
What you're looking at is the stopword.txt. You could either empty that file out or change the field type for your field.


On Mon, Jul 14, 2014 at 12:53 PM, Teague James <te...@insystechinc.com> wrote:
> Hello all,
>
> I am working with Solr 4.9.0 and am searching for phrases that contain 
> words like "of" or "to" that Solr seems to be ignoring at index time. 
> Here's what I tried:
>
> curl http://localhost/solr/update?commit=true -H "Content-Type: text/xml"
> --data-binary '<add><doc><field name="id">100</field><field 
> name="content">blah blah blah knowledge of science blah blah 
> blah</field></doc></add>'
>
> Then, using a broswer:
>
> http://localhost/solr/collection1/select?q="knowledge+of+science"&fq=i
> d:100
>
> I get zero hits. Search for "knowledge" or "science" and I'll get hits.
> "knowledge of" or "of science" and I get zero hits. I don't want to 
> use proximity if I can avoid it, as this may introduce too many 
> undesireable results. Stopwords.txt is blank, yet clearly Solr is ignoring "of" and "to"
> and possibly more words that I have not discovered through testing 
> yet. Is there some other configuration file that contains these small 
> words? Is there any way to force Solr to pay attention to them and not 
> drop them from the phrase? Any advice is appreciated! Thanks!
>
> -Teague
>
>



-- 

Anshum Gupta
http://www.anshumgupta.net


Re: Of, To, and Other Small Words

Posted by Anshum Gupta <an...@anshumgupta.net>.
Hi Teague,

The StopFilterFactory (which I think you're using) by default uses
lang/stopwords_en.txt (which wouldn't be empty if you check).
What you're looking at is the stopword.txt. You could either empty
that file out or change the field type for your field.


On Mon, Jul 14, 2014 at 12:53 PM, Teague James <te...@insystechinc.com> wrote:
> Hello all,
>
> I am working with Solr 4.9.0 and am searching for phrases that contain words
> like "of" or "to" that Solr seems to be ignoring at index time. Here's what
> I tried:
>
> curl http://localhost/solr/update?commit=true -H "Content-Type: text/xml"
> --data-binary '<add><doc><field name="id">100</field><field
> name="content">blah blah blah knowledge of science blah blah
> blah</field></doc></add>'
>
> Then, using a broswer:
>
> http://localhost/solr/collection1/select?q="knowledge+of+science"&fq=id:100
>
> I get zero hits. Search for "knowledge" or "science" and I'll get hits.
> "knowledge of" or "of science" and I get zero hits. I don't want to use
> proximity if I can avoid it, as this may introduce too many undesireable
> results. Stopwords.txt is blank, yet clearly Solr is ignoring "of" and "to"
> and possibly more words that I have not discovered through testing yet. Is
> there some other configuration file that contains these small words? Is
> there any way to force Solr to pay attention to them and not drop them from
> the phrase? Any advice is appreciated! Thanks!
>
> -Teague
>
>



-- 

Anshum Gupta
http://www.anshumgupta.net