You are viewing a plain text version of this content. The canonical link for it is here.

Posted to solr-user@lucene.apache.org by Bogdan Vatkov <bo...@gmail.com> on 2010/01/03 02:20:09 UTC

Stopwords not working as expected

Hi,

I am using a default (example) configuration of Solr and there the
stopwording seems to be enabled for both indexing and querying of fields of
type "text".
I have a custom field which is of the "text" type.
I have extended the stopwords.txt file with lots of words but when I index
some documents the index contains stopwords - I can see this with the Luke
tool.
Am I supposed to see these terms in the index after they are declared in the
stopwords.txt file?
What could be wrong?

Best regards,
Bogdan

Re: Stopwords not working as expected

Posted by Bogdan Vatkov <bo...@gmail.com>.

Hi Grant,

I have attached my Solr index & configuration (+stopwords.txt) + Mahout
vectors and clusters.
My mahout scripts are in the archive too.
You see in the stopwords file I have added stuff like "cluster", "gmail" and
you will find them as TopTerms in the mahout/email-clustering/clusters.txt
file.

Best regards,
Bogdan

Re: Stopwords not working as expected

Posted by Bogdan Vatkov <bo...@gmail.com>.

yes, I can do that.
in the mean time I changed the Driver a little bit to apply the stopwords by
force:

in TFDFMapper
...
private List<String> stopwords;

public TFDFMapper(IndexReader reader, Weight weight, TermInfo termInfo,
File stopwordsFile) {
this.reader = reader;
this.weight = weight;
this.termInfo = termInfo;
this.numDocs = reader.numDocs();
this.stopwords = getContents(stopwordsFile);
}
...
public void map(String term, int frequency, TermVectorOffsetInfo[] offsets,
int[] positions) {
TermEntry entry = termInfo.getTermEntry(field, term);
if (entry != null) {
if (!stopwords.contains(term)) {
vector.setQuick(entry.termIdx, weight.calculate(frequency,
entry.docFreq, numTerms, numDocs));
}
}
}


and in Driver:
...
          String stopwordsFile = cmdLine.getValue(stopwordsOpt).toString();
          VectorMapper mapper = new TFDFMapper(reader, weight, termInfo, new
File(stopwordsFile));
I am currently waiting to see the result clusters.

But you are right I will try to run some smaller set of docs so that I can
debug easily (and share docs).
will come back shortly



On Sun, Jan 3, 2010 at 5:24 PM, Grant Ingersoll <gs...@apache.org> wrote:

>
> On Jan 3, 2010, at 9:13 AM, Bogdan Vatkov wrote:
>
> > Unfortunately it is all classified data I could not share, I will try to
> > debug
>
> Can you reproduce w/ generic documents?
>
> >
> > On Sun, Jan 3, 2010 at 4:10 PM, Grant Ingersoll <gs...@apache.org>
> wrote:
> >
> >> Is there anyway you could zip up a small document set and your Solr home
> >> and post somewhere?
> >>
> >> On Jan 3, 2010, at 9:08 AM, Bogdan Vatkov wrote:
> >>
> >>> Yesterday I had issues with mapping cluster results to dictionary
> entries
> >> -
> >>> it happened that I was using different dictionary - therefore the
> result
> >>> clusters shown really strange results.
> >>> But once I fixed all the commands, input/output files, etc. I got very
> >> good
> >>> result from clusterization POV (I mean clusters are quite correct
> having
> >> in
> >>> mind the input documents) but unfortunately the clusters contained
> mostly
> >>> words which I would like to stop - and which words I placed in the
> >>> stopwords.txt in Solr (re-indexed, restarted Solr, etc.).
> >>>
> >>> Where do you suggest I debug the vector creation? Seems Solr respects
> the
> >>> stopwords but not the vector creation (then clustering).
> >>>
> >>> On Sun, Jan 3, 2010 at 4:02 PM, Grant Ingersoll <gs...@apache.org>
> >> wrote:
> >>>
> >>>>
> >>>> On Jan 3, 2010, at 8:58 AM, Bogdan Vatkov wrote:
> >>>>
> >>>>> I have stopwords.txt file with 1200+ words, i did not understand this
> >>>> with
> >>>>> the stemming - you mean my stopwords are somehow ignored due to some
> >>>>> stemming or ?
> >>>>
> >>>> No, stopword removal happens before stemming so it is possible that a
> >> word
> >>>> that was not stopped was then stemmed to a stopword.
> >>>>
> >>>> I thought you said yesterday you got it straightened out.
> >>>>
> >>>>>
> >>>>> On Sun, Jan 3, 2010 at 3:53 PM, Grant Ingersoll <gsingers@apache.org
> >
> >>>> wrote:
> >>>>>
> >>>>>> Are you sure you have stopwords and it is not the result of stemming
> >>>> some
> >>>>>> other word?
> >>>>>>
> >>>>>> On Jan 3, 2010, at 7:57 AM, Bogdan Vatkov wrote:
> >>>>>>
> >>>>>>> my Solr config is like the default one:
> >>>>>>>
> >>>>>>> <field name="msg_body" type="text" termVectors="true"
> indexed="true"
> >>>>>>> stored="true"/>
> >>>>>>>
> >>>>>>> <fieldType name="text" class="solr.TextField"
> >>>>>> positionIncrementGap="100">
> >>>>>>>   <analyzer type="index">
> >>>>>>>     <tokenizer class="solr.WhitespaceTokenizerFactory"/>
> >>>>>>>     <filter class="solr.StopFilterFactory"
> >>>>>>>             ignoreCase="true"
> >>>>>>>             words="stopwords.txt"
> >>>>>>>             enablePositionIncrements="true"
> >>>>>>>             />
> >>>>>>>     <filter class="solr.WordDelimiterFilterFactory"
> >>>>>>> generateWordParts="1" generateNumberParts="1" catenateWords="1"
> >>>>>>> catenateNumbers="1" catenateAll="0" splitOnCaseChange="1"/>
> >>>>>>>     <filter class="solr.LowerCaseFilterFactory"/>
> >>>>>>>     <filter class="solr.SnowballPorterFilterFactory"
> >>>>>> language="English"
> >>>>>>> protected="protwords.txt"/>
> >>>>>>>   </analyzer>
> >>>>>>>   <analyzer type="query">
> >>>>>>>     <tokenizer class="solr.WhitespaceTokenizerFactory"/>
> >>>>>>>     <filter class="solr.SynonymFilterFactory"
> >> synonyms="synonyms.txt"
> >>>>>>> ignoreCase="true" expand="true"/>
> >>>>>>>     <filter class="solr.StopFilterFactory"
> >>>>>>>             ignoreCase="true"
> >>>>>>>             words="stopwords.txt"
> >>>>>>>             enablePositionIncrements="true"
> >>>>>>>             />
> >>>>>>>     <filter class="solr.WordDelimiterFilterFactory"
> >>>>>>> generateWordParts="1" generateNumberParts="1" catenateWords="0"
> >>>>>>> catenateNumbers="0" catenateAll="0" splitOnCaseChange="1"/>
> >>>>>>>     <filter class="solr.LowerCaseFilterFactory"/>
> >>>>>>>     <filter class="solr.SnowballPorterFilterFactory"
> >>>>>> language="English"
> >>>>>>> protected="protwords.txt"/>
> >>>>>>>   </analyzer>
> >>>>>>> </fieldType>
> >>>>>>
> >>>>>>
> >>>>>
> >>>>>
> >>>>> --
> >>>>> Best regards,
> >>>>> Bogdan
> >>>>
> >>>>
> >>>
> >>>
> >>> --
> >>> Best regards,
> >>> Bogdan
> >>
> >>
> >
> >
> > --
> > Best regards,
> > Bogdan
>
> --------------------------
> Grant Ingersoll
> http://www.lucidimagination.com/
>
> Search the Lucene ecosystem using Solr/Lucene:
> http://www.lucidimagination.com/search
>
>


-- 
Best regards,
Bogdan

Re: Stopwords not working as expected

Posted by Grant Ingersoll <gs...@apache.org>.

On Jan 3, 2010, at 9:13 AM, Bogdan Vatkov wrote:

> Unfortunately it is all classified data I could not share, I will try to
> debug

Can you reproduce w/ generic documents?

> 
> On Sun, Jan 3, 2010 at 4:10 PM, Grant Ingersoll <gs...@apache.org> wrote:
> 
>> Is there anyway you could zip up a small document set and your Solr home
>> and post somewhere?
>> 
>> On Jan 3, 2010, at 9:08 AM, Bogdan Vatkov wrote:
>> 
>>> Yesterday I had issues with mapping cluster results to dictionary entries
>> -
>>> it happened that I was using different dictionary - therefore the result
>>> clusters shown really strange results.
>>> But once I fixed all the commands, input/output files, etc. I got very
>> good
>>> result from clusterization POV (I mean clusters are quite correct having
>> in
>>> mind the input documents) but unfortunately the clusters contained mostly
>>> words which I would like to stop - and which words I placed in the
>>> stopwords.txt in Solr (re-indexed, restarted Solr, etc.).
>>> 
>>> Where do you suggest I debug the vector creation? Seems Solr respects the
>>> stopwords but not the vector creation (then clustering).
>>> 
>>> On Sun, Jan 3, 2010 at 4:02 PM, Grant Ingersoll <gs...@apache.org>
>> wrote:
>>> 
>>>> 
>>>> On Jan 3, 2010, at 8:58 AM, Bogdan Vatkov wrote:
>>>> 
>>>>> I have stopwords.txt file with 1200+ words, i did not understand this
>>>> with
>>>>> the stemming - you mean my stopwords are somehow ignored due to some
>>>>> stemming or ?
>>>> 
>>>> No, stopword removal happens before stemming so it is possible that a
>> word
>>>> that was not stopped was then stemmed to a stopword.
>>>> 
>>>> I thought you said yesterday you got it straightened out.
>>>> 
>>>>> 
>>>>> On Sun, Jan 3, 2010 at 3:53 PM, Grant Ingersoll <gs...@apache.org>
>>>> wrote:
>>>>> 
>>>>>> Are you sure you have stopwords and it is not the result of stemming
>>>> some
>>>>>> other word?
>>>>>> 
>>>>>> On Jan 3, 2010, at 7:57 AM, Bogdan Vatkov wrote:
>>>>>> 
>>>>>>> my Solr config is like the default one:
>>>>>>> 
>>>>>>> <field name="msg_body" type="text" termVectors="true" indexed="true"
>>>>>>> stored="true"/>
>>>>>>> 
>>>>>>> <fieldType name="text" class="solr.TextField"
>>>>>> positionIncrementGap="100">
>>>>>>>   <analyzer type="index">
>>>>>>>     <tokenizer class="solr.WhitespaceTokenizerFactory"/>
>>>>>>>     <filter class="solr.StopFilterFactory"
>>>>>>>             ignoreCase="true"
>>>>>>>             words="stopwords.txt"
>>>>>>>             enablePositionIncrements="true"
>>>>>>>             />
>>>>>>>     <filter class="solr.WordDelimiterFilterFactory"
>>>>>>> generateWordParts="1" generateNumberParts="1" catenateWords="1"
>>>>>>> catenateNumbers="1" catenateAll="0" splitOnCaseChange="1"/>
>>>>>>>     <filter class="solr.LowerCaseFilterFactory"/>
>>>>>>>     <filter class="solr.SnowballPorterFilterFactory"
>>>>>> language="English"
>>>>>>> protected="protwords.txt"/>
>>>>>>>   </analyzer>
>>>>>>>   <analyzer type="query">
>>>>>>>     <tokenizer class="solr.WhitespaceTokenizerFactory"/>
>>>>>>>     <filter class="solr.SynonymFilterFactory"
>> synonyms="synonyms.txt"
>>>>>>> ignoreCase="true" expand="true"/>
>>>>>>>     <filter class="solr.StopFilterFactory"
>>>>>>>             ignoreCase="true"
>>>>>>>             words="stopwords.txt"
>>>>>>>             enablePositionIncrements="true"
>>>>>>>             />
>>>>>>>     <filter class="solr.WordDelimiterFilterFactory"
>>>>>>> generateWordParts="1" generateNumberParts="1" catenateWords="0"
>>>>>>> catenateNumbers="0" catenateAll="0" splitOnCaseChange="1"/>
>>>>>>>     <filter class="solr.LowerCaseFilterFactory"/>
>>>>>>>     <filter class="solr.SnowballPorterFilterFactory"
>>>>>> language="English"
>>>>>>> protected="protwords.txt"/>
>>>>>>>   </analyzer>
>>>>>>> </fieldType>
>>>>>> 
>>>>>> 
>>>>> 
>>>>> 
>>>>> --
>>>>> Best regards,
>>>>> Bogdan
>>>> 
>>>> 
>>> 
>>> 
>>> --
>>> Best regards,
>>> Bogdan
>> 
>> 
> 
> 
> -- 
> Best regards,
> Bogdan

--------------------------
Grant Ingersoll
http://www.lucidimagination.com/

Search the Lucene ecosystem using Solr/Lucene: http://www.lucidimagination.com/search

Re: Stopwords not working as expected

Posted by Bogdan Vatkov <bo...@gmail.com>.

Unfortunately it is all classified data I could not share, I will try to
debug

On Sun, Jan 3, 2010 at 4:10 PM, Grant Ingersoll <gs...@apache.org> wrote:

> Is there anyway you could zip up a small document set and your Solr home
> and post somewhere?
>
> On Jan 3, 2010, at 9:08 AM, Bogdan Vatkov wrote:
>
> > Yesterday I had issues with mapping cluster results to dictionary entries
> -
> > it happened that I was using different dictionary - therefore the result
> > clusters shown really strange results.
> > But once I fixed all the commands, input/output files, etc. I got very
> good
> > result from clusterization POV (I mean clusters are quite correct having
> in
> > mind the input documents) but unfortunately the clusters contained mostly
> > words which I would like to stop - and which words I placed in the
> > stopwords.txt in Solr (re-indexed, restarted Solr, etc.).
> >
> > Where do you suggest I debug the vector creation? Seems Solr respects the
> > stopwords but not the vector creation (then clustering).
> >
> > On Sun, Jan 3, 2010 at 4:02 PM, Grant Ingersoll <gs...@apache.org>
> wrote:
> >
> >>
> >> On Jan 3, 2010, at 8:58 AM, Bogdan Vatkov wrote:
> >>
> >>> I have stopwords.txt file with 1200+ words, i did not understand this
> >> with
> >>> the stemming - you mean my stopwords are somehow ignored due to some
> >>> stemming or ?
> >>
> >> No, stopword removal happens before stemming so it is possible that a
> word
> >> that was not stopped was then stemmed to a stopword.
> >>
> >> I thought you said yesterday you got it straightened out.
> >>
> >>>
> >>> On Sun, Jan 3, 2010 at 3:53 PM, Grant Ingersoll <gs...@apache.org>
> >> wrote:
> >>>
> >>>> Are you sure you have stopwords and it is not the result of stemming
> >> some
> >>>> other word?
> >>>>
> >>>> On Jan 3, 2010, at 7:57 AM, Bogdan Vatkov wrote:
> >>>>
> >>>>> my Solr config is like the default one:
> >>>>>
> >>>>> <field name="msg_body" type="text" termVectors="true" indexed="true"
> >>>>> stored="true"/>
> >>>>>
> >>>>> <fieldType name="text" class="solr.TextField"
> >>>> positionIncrementGap="100">
> >>>>>    <analyzer type="index">
> >>>>>      <tokenizer class="solr.WhitespaceTokenizerFactory"/>
> >>>>>      <filter class="solr.StopFilterFactory"
> >>>>>              ignoreCase="true"
> >>>>>              words="stopwords.txt"
> >>>>>              enablePositionIncrements="true"
> >>>>>              />
> >>>>>      <filter class="solr.WordDelimiterFilterFactory"
> >>>>> generateWordParts="1" generateNumberParts="1" catenateWords="1"
> >>>>> catenateNumbers="1" catenateAll="0" splitOnCaseChange="1"/>
> >>>>>      <filter class="solr.LowerCaseFilterFactory"/>
> >>>>>      <filter class="solr.SnowballPorterFilterFactory"
> >>>> language="English"
> >>>>> protected="protwords.txt"/>
> >>>>>    </analyzer>
> >>>>>    <analyzer type="query">
> >>>>>      <tokenizer class="solr.WhitespaceTokenizerFactory"/>
> >>>>>      <filter class="solr.SynonymFilterFactory"
> synonyms="synonyms.txt"
> >>>>> ignoreCase="true" expand="true"/>
> >>>>>      <filter class="solr.StopFilterFactory"
> >>>>>              ignoreCase="true"
> >>>>>              words="stopwords.txt"
> >>>>>              enablePositionIncrements="true"
> >>>>>              />
> >>>>>      <filter class="solr.WordDelimiterFilterFactory"
> >>>>> generateWordParts="1" generateNumberParts="1" catenateWords="0"
> >>>>> catenateNumbers="0" catenateAll="0" splitOnCaseChange="1"/>
> >>>>>      <filter class="solr.LowerCaseFilterFactory"/>
> >>>>>      <filter class="solr.SnowballPorterFilterFactory"
> >>>> language="English"
> >>>>> protected="protwords.txt"/>
> >>>>>    </analyzer>
> >>>>>  </fieldType>
> >>>>
> >>>>
> >>>
> >>>
> >>> --
> >>> Best regards,
> >>> Bogdan
> >>
> >>
> >
> >
> > --
> > Best regards,
> > Bogdan
>
>


-- 
Best regards,
Bogdan

Re: Stopwords not working as expected

Posted by Grant Ingersoll <gs...@apache.org>.

Is there anyway you could zip up a small document set and your Solr home and post somewhere?

On Jan 3, 2010, at 9:08 AM, Bogdan Vatkov wrote:

> Yesterday I had issues with mapping cluster results to dictionary entries -
> it happened that I was using different dictionary - therefore the result
> clusters shown really strange results.
> But once I fixed all the commands, input/output files, etc. I got very good
> result from clusterization POV (I mean clusters are quite correct having in
> mind the input documents) but unfortunately the clusters contained mostly
> words which I would like to stop - and which words I placed in the
> stopwords.txt in Solr (re-indexed, restarted Solr, etc.).
> 
> Where do you suggest I debug the vector creation? Seems Solr respects the
> stopwords but not the vector creation (then clustering).
> 
> On Sun, Jan 3, 2010 at 4:02 PM, Grant Ingersoll <gs...@apache.org> wrote:
> 
>> 
>> On Jan 3, 2010, at 8:58 AM, Bogdan Vatkov wrote:
>> 
>>> I have stopwords.txt file with 1200+ words, i did not understand this
>> with
>>> the stemming - you mean my stopwords are somehow ignored due to some
>>> stemming or ?
>> 
>> No, stopword removal happens before stemming so it is possible that a word
>> that was not stopped was then stemmed to a stopword.
>> 
>> I thought you said yesterday you got it straightened out.
>> 
>>> 
>>> On Sun, Jan 3, 2010 at 3:53 PM, Grant Ingersoll <gs...@apache.org>
>> wrote:
>>> 
>>>> Are you sure you have stopwords and it is not the result of stemming
>> some
>>>> other word?
>>>> 
>>>> On Jan 3, 2010, at 7:57 AM, Bogdan Vatkov wrote:
>>>> 
>>>>> my Solr config is like the default one:
>>>>> 
>>>>> <field name="msg_body" type="text" termVectors="true" indexed="true"
>>>>> stored="true"/>
>>>>> 
>>>>> <fieldType name="text" class="solr.TextField"
>>>> positionIncrementGap="100">
>>>>>    <analyzer type="index">
>>>>>      <tokenizer class="solr.WhitespaceTokenizerFactory"/>
>>>>>      <filter class="solr.StopFilterFactory"
>>>>>              ignoreCase="true"
>>>>>              words="stopwords.txt"
>>>>>              enablePositionIncrements="true"
>>>>>              />
>>>>>      <filter class="solr.WordDelimiterFilterFactory"
>>>>> generateWordParts="1" generateNumberParts="1" catenateWords="1"
>>>>> catenateNumbers="1" catenateAll="0" splitOnCaseChange="1"/>
>>>>>      <filter class="solr.LowerCaseFilterFactory"/>
>>>>>      <filter class="solr.SnowballPorterFilterFactory"
>>>> language="English"
>>>>> protected="protwords.txt"/>
>>>>>    </analyzer>
>>>>>    <analyzer type="query">
>>>>>      <tokenizer class="solr.WhitespaceTokenizerFactory"/>
>>>>>      <filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt"
>>>>> ignoreCase="true" expand="true"/>
>>>>>      <filter class="solr.StopFilterFactory"
>>>>>              ignoreCase="true"
>>>>>              words="stopwords.txt"
>>>>>              enablePositionIncrements="true"
>>>>>              />
>>>>>      <filter class="solr.WordDelimiterFilterFactory"
>>>>> generateWordParts="1" generateNumberParts="1" catenateWords="0"
>>>>> catenateNumbers="0" catenateAll="0" splitOnCaseChange="1"/>
>>>>>      <filter class="solr.LowerCaseFilterFactory"/>
>>>>>      <filter class="solr.SnowballPorterFilterFactory"
>>>> language="English"
>>>>> protected="protwords.txt"/>
>>>>>    </analyzer>
>>>>>  </fieldType>
>>>> 
>>>> 
>>> 
>>> 
>>> --
>>> Best regards,
>>> Bogdan
>> 
>> 
> 
> 
> -- 
> Best regards,
> Bogdan

Re: Stopwords not working as expected

Posted by Bogdan Vatkov <bo...@gmail.com>.

Yesterday I had issues with mapping cluster results to dictionary entries -
it happened that I was using different dictionary - therefore the result
clusters shown really strange results.
But once I fixed all the commands, input/output files, etc. I got very good
result from clusterization POV (I mean clusters are quite correct having in
mind the input documents) but unfortunately the clusters contained mostly
words which I would like to stop - and which words I placed in the
stopwords.txt in Solr (re-indexed, restarted Solr, etc.).

Where do you suggest I debug the vector creation? Seems Solr respects the
stopwords but not the vector creation (then clustering).

On Sun, Jan 3, 2010 at 4:02 PM, Grant Ingersoll <gs...@apache.org> wrote:

>
> On Jan 3, 2010, at 8:58 AM, Bogdan Vatkov wrote:
>
> > I have stopwords.txt file with 1200+ words, i did not understand this
> with
> > the stemming - you mean my stopwords are somehow ignored due to some
> > stemming or ?
>
> No, stopword removal happens before stemming so it is possible that a word
> that was not stopped was then stemmed to a stopword.
>
> I thought you said yesterday you got it straightened out.
>
> >
> > On Sun, Jan 3, 2010 at 3:53 PM, Grant Ingersoll <gs...@apache.org>
> wrote:
> >
> >> Are you sure you have stopwords and it is not the result of stemming
> some
> >> other word?
> >>
> >> On Jan 3, 2010, at 7:57 AM, Bogdan Vatkov wrote:
> >>
> >>> my Solr config is like the default one:
> >>>
> >>>  <field name="msg_body" type="text" termVectors="true" indexed="true"
> >>> stored="true"/>
> >>>
> >>>  <fieldType name="text" class="solr.TextField"
> >> positionIncrementGap="100">
> >>>     <analyzer type="index">
> >>>       <tokenizer class="solr.WhitespaceTokenizerFactory"/>
> >>>       <filter class="solr.StopFilterFactory"
> >>>               ignoreCase="true"
> >>>               words="stopwords.txt"
> >>>               enablePositionIncrements="true"
> >>>               />
> >>>       <filter class="solr.WordDelimiterFilterFactory"
> >>> generateWordParts="1" generateNumberParts="1" catenateWords="1"
> >>> catenateNumbers="1" catenateAll="0" splitOnCaseChange="1"/>
> >>>       <filter class="solr.LowerCaseFilterFactory"/>
> >>>       <filter class="solr.SnowballPorterFilterFactory"
> >> language="English"
> >>> protected="protwords.txt"/>
> >>>     </analyzer>
> >>>     <analyzer type="query">
> >>>       <tokenizer class="solr.WhitespaceTokenizerFactory"/>
> >>>       <filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt"
> >>> ignoreCase="true" expand="true"/>
> >>>       <filter class="solr.StopFilterFactory"
> >>>               ignoreCase="true"
> >>>               words="stopwords.txt"
> >>>               enablePositionIncrements="true"
> >>>               />
> >>>       <filter class="solr.WordDelimiterFilterFactory"
> >>> generateWordParts="1" generateNumberParts="1" catenateWords="0"
> >>> catenateNumbers="0" catenateAll="0" splitOnCaseChange="1"/>
> >>>       <filter class="solr.LowerCaseFilterFactory"/>
> >>>       <filter class="solr.SnowballPorterFilterFactory"
> >> language="English"
> >>> protected="protwords.txt"/>
> >>>     </analyzer>
> >>>   </fieldType>
> >>
> >>
> >
> >
> > --
> > Best regards,
> > Bogdan
>
>


-- 
Best regards,
Bogdan

Re: Stopwords not working as expected

Posted by Grant Ingersoll <gs...@apache.org>.

On Jan 3, 2010, at 8:58 AM, Bogdan Vatkov wrote:

> I have stopwords.txt file with 1200+ words, i did not understand this with
> the stemming - you mean my stopwords are somehow ignored due to some
> stemming or ?

No, stopword removal happens before stemming so it is possible that a word that was not stopped was then stemmed to a stopword.

I thought you said yesterday you got it straightened out.

> 
> On Sun, Jan 3, 2010 at 3:53 PM, Grant Ingersoll <gs...@apache.org> wrote:
> 
>> Are you sure you have stopwords and it is not the result of stemming some
>> other word?
>> 
>> On Jan 3, 2010, at 7:57 AM, Bogdan Vatkov wrote:
>> 
>>> my Solr config is like the default one:
>>> 
>>>  <field name="msg_body" type="text" termVectors="true" indexed="true"
>>> stored="true"/>
>>> 
>>>  <fieldType name="text" class="solr.TextField"
>> positionIncrementGap="100">
>>>     <analyzer type="index">
>>>       <tokenizer class="solr.WhitespaceTokenizerFactory"/>
>>>       <filter class="solr.StopFilterFactory"
>>>               ignoreCase="true"
>>>               words="stopwords.txt"
>>>               enablePositionIncrements="true"
>>>               />
>>>       <filter class="solr.WordDelimiterFilterFactory"
>>> generateWordParts="1" generateNumberParts="1" catenateWords="1"
>>> catenateNumbers="1" catenateAll="0" splitOnCaseChange="1"/>
>>>       <filter class="solr.LowerCaseFilterFactory"/>
>>>       <filter class="solr.SnowballPorterFilterFactory"
>> language="English"
>>> protected="protwords.txt"/>
>>>     </analyzer>
>>>     <analyzer type="query">
>>>       <tokenizer class="solr.WhitespaceTokenizerFactory"/>
>>>       <filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt"
>>> ignoreCase="true" expand="true"/>
>>>       <filter class="solr.StopFilterFactory"
>>>               ignoreCase="true"
>>>               words="stopwords.txt"
>>>               enablePositionIncrements="true"
>>>               />
>>>       <filter class="solr.WordDelimiterFilterFactory"
>>> generateWordParts="1" generateNumberParts="1" catenateWords="0"
>>> catenateNumbers="0" catenateAll="0" splitOnCaseChange="1"/>
>>>       <filter class="solr.LowerCaseFilterFactory"/>
>>>       <filter class="solr.SnowballPorterFilterFactory"
>> language="English"
>>> protected="protwords.txt"/>
>>>     </analyzer>
>>>   </fieldType>
>> 
>> 
> 
> 
> -- 
> Best regards,
> Bogdan

Re: Stopwords not working as expected

Posted by Bogdan Vatkov <bo...@gmail.com>.

I have stopwords.txt file with 1200+ words, i did not understand this with
the stemming - you mean my stopwords are somehow ignored due to some
stemming or ?

On Sun, Jan 3, 2010 at 3:53 PM, Grant Ingersoll <gs...@apache.org> wrote:

> Are you sure you have stopwords and it is not the result of stemming some
> other word?
>
> On Jan 3, 2010, at 7:57 AM, Bogdan Vatkov wrote:
>
> > my Solr config is like the default one:
> >
> >   <field name="msg_body" type="text" termVectors="true" indexed="true"
> > stored="true"/>
> >
> >   <fieldType name="text" class="solr.TextField"
> positionIncrementGap="100">
> >      <analyzer type="index">
> >        <tokenizer class="solr.WhitespaceTokenizerFactory"/>
> >        <filter class="solr.StopFilterFactory"
> >                ignoreCase="true"
> >                words="stopwords.txt"
> >                enablePositionIncrements="true"
> >                />
> >        <filter class="solr.WordDelimiterFilterFactory"
> > generateWordParts="1" generateNumberParts="1" catenateWords="1"
> > catenateNumbers="1" catenateAll="0" splitOnCaseChange="1"/>
> >        <filter class="solr.LowerCaseFilterFactory"/>
> >        <filter class="solr.SnowballPorterFilterFactory"
> language="English"
> > protected="protwords.txt"/>
> >      </analyzer>
> >      <analyzer type="query">
> >        <tokenizer class="solr.WhitespaceTokenizerFactory"/>
> >        <filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt"
> > ignoreCase="true" expand="true"/>
> >        <filter class="solr.StopFilterFactory"
> >                ignoreCase="true"
> >                words="stopwords.txt"
> >                enablePositionIncrements="true"
> >                />
> >        <filter class="solr.WordDelimiterFilterFactory"
> > generateWordParts="1" generateNumberParts="1" catenateWords="0"
> > catenateNumbers="0" catenateAll="0" splitOnCaseChange="1"/>
> >        <filter class="solr.LowerCaseFilterFactory"/>
> >        <filter class="solr.SnowballPorterFilterFactory"
> language="English"
> > protected="protwords.txt"/>
> >      </analyzer>
> >    </fieldType>
>
>


-- 
Best regards,
Bogdan

Re: Stopwords not working as expected

Posted by Grant Ingersoll <gs...@apache.org>.

Are you sure you have stopwords and it is not the result of stemming some other word?

On Jan 3, 2010, at 7:57 AM, Bogdan Vatkov wrote:

> my Solr config is like the default one:
> 
>   <field name="msg_body" type="text" termVectors="true" indexed="true"
> stored="true"/>
> 
>   <fieldType name="text" class="solr.TextField" positionIncrementGap="100">
>      <analyzer type="index">
>        <tokenizer class="solr.WhitespaceTokenizerFactory"/>
>        <filter class="solr.StopFilterFactory"
>                ignoreCase="true"
>                words="stopwords.txt"
>                enablePositionIncrements="true"
>                />
>        <filter class="solr.WordDelimiterFilterFactory"
> generateWordParts="1" generateNumberParts="1" catenateWords="1"
> catenateNumbers="1" catenateAll="0" splitOnCaseChange="1"/>
>        <filter class="solr.LowerCaseFilterFactory"/>
>        <filter class="solr.SnowballPorterFilterFactory" language="English"
> protected="protwords.txt"/>
>      </analyzer>
>      <analyzer type="query">
>        <tokenizer class="solr.WhitespaceTokenizerFactory"/>
>        <filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt"
> ignoreCase="true" expand="true"/>
>        <filter class="solr.StopFilterFactory"
>                ignoreCase="true"
>                words="stopwords.txt"
>                enablePositionIncrements="true"
>                />
>        <filter class="solr.WordDelimiterFilterFactory"
> generateWordParts="1" generateNumberParts="1" catenateWords="0"
> catenateNumbers="0" catenateAll="0" splitOnCaseChange="1"/>
>        <filter class="solr.LowerCaseFilterFactory"/>
>        <filter class="solr.SnowballPorterFilterFactory" language="English"
> protected="protwords.txt"/>
>      </analyzer>
>    </fieldType>

Re: Stopwords not working as expected

Posted by Bogdan Vatkov <bo...@gmail.com>.

my Solr config is like the default one:

   <field name="msg_body" type="text" termVectors="true" indexed="true"
stored="true"/>

   <fieldType name="text" class="solr.TextField" positionIncrementGap="100">
      <analyzer type="index">
        <tokenizer class="solr.WhitespaceTokenizerFactory"/>
        <filter class="solr.StopFilterFactory"
                ignoreCase="true"
                words="stopwords.txt"
                enablePositionIncrements="true"
                />
        <filter class="solr.WordDelimiterFilterFactory"
generateWordParts="1" generateNumberParts="1" catenateWords="1"
catenateNumbers="1" catenateAll="0" splitOnCaseChange="1"/>
        <filter class="solr.LowerCaseFilterFactory"/>
        <filter class="solr.SnowballPorterFilterFactory" language="English"
protected="protwords.txt"/>
      </analyzer>
      <analyzer type="query">
        <tokenizer class="solr.WhitespaceTokenizerFactory"/>
        <filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt"
ignoreCase="true" expand="true"/>
        <filter class="solr.StopFilterFactory"
                ignoreCase="true"
                words="stopwords.txt"
                enablePositionIncrements="true"
                />
        <filter class="solr.WordDelimiterFilterFactory"
generateWordParts="1" generateNumberParts="1" catenateWords="0"
catenateNumbers="0" catenateAll="0" splitOnCaseChange="1"/>
        <filter class="solr.LowerCaseFilterFactory"/>
        <filter class="solr.SnowballPorterFilterFactory" language="English"
protected="protwords.txt"/>
      </analyzer>
    </fieldType>

Re: Stopwords not working as expected

Posted by Ted Dunning <te...@gmail.com>.

It is possible to do stop-word processing at index-time or at query-time.
It is generally good practice except in extreme applications to do it at
query time so that you have the use of the stop words in phrases.  Classic
examples is searching for "The Inc" (a company name) or "to be or not to be"
(a famous quote).

I can't comment on your SOLR setup, but it is plausible that SOLR is
stopping at query-time and leaving the stop words in your index to be found
by the vectorizer.  Perhaps Grant can comment more authoritatively on how
SOLR works.

On Sat, Jan 2, 2010 at 6:31 PM, Bogdan Vatkov <bo...@gmail.com>wrote:

> I am still not an expert in reading from Lucene index - is it possible that
> the Vector generation uses some "raw" reading of the Solr/Lucene index and
> thus getting the stopwords?
>

-- 
Ted Dunning, CTO
DeepDyve

Re: Stopwords not working as expected

Posted by Bogdan Vatkov <bo...@gmail.com>.

@Mahout experts: could you please, elaborate on that?
It seems that I am stopping successfully quite some words with the stopwords
mechanism in Solr (I do not get search results when querying with stopwords
with the localhost/solr/select interface) but this somehow is not effective
when Solr index gets converted to vectors in the
org.apache.mahout.utils.vectors.lucene.Driver class.
As a result I get clusters which contain (and are even mainly driven by) the
stopwords...
I am still not an expert in reading from Lucene index - is it possible that
the Vector generation uses some "raw" reading of the Solr/Lucene index and
thus getting the stopwords?

Best regards,
Bogdan

On Sun, Jan 3, 2010 at 3:51 AM, Lance Norskog <go...@gmail.com> wrote:

> Fields are both stored and indexed. The stored copy is exactly what
> you sent in. The index is built with the "text" type's analysis stack
> and is not stored. This output has the stopwords removed. The output
> is not stored in one place, but parts of it are scattered around the
> Lucene index data structures.  When you search for one of these
> stopwords, you should not get any documents.
>
> On Sat, Jan 2, 2010 at 5:20 PM, Bogdan Vatkov <bo...@gmail.com>
> wrote:
> > Hi,
> >
> > I am using a default (example) configuration of Solr and there the
> > stopwording seems to be enabled for both indexing and querying of fields
> of
> > type "text".
> > I have a custom field which is of the "text" type.
> > I have extended the stopwords.txt file with lots of words but when I
> index
> > some documents the index contains stopwords - I can see this with the
> Luke
> > tool.
> > Am I supposed to see these terms in the index after they are declared in
> the
> > stopwords.txt file?
> > What could be wrong?
> >
> > Best regards,
> > Bogdan
> >
>
>
>
> --
> Lance Norskog
> goksron@gmail.com
>

-- 
Best regards,
Bogdan

Re: Stopwords not working as expected

Posted by Bogdan Vatkov <bo...@gmail.com>.

@Mahout experts: could you please, elaborate on that?
It seems that I am stopping successfully quite some words with the stopwords
mechanism in Solr (I do not get search results when querying with stopwords
with the localhost/solr/select interface) but this somehow is not effective
when Solr index gets converted to vectors in the
org.apache.mahout.utils.vectors.lucene.Driver class.
As a result I get clusters which contain (and are even mainly driven by) the
stopwords...
I am still not an expert in reading from Lucene index - is it possible that
the Vector generation uses some "raw" reading of the Solr/Lucene index and
thus getting the stopwords?

Best regards,
Bogdan

On Sun, Jan 3, 2010 at 3:51 AM, Lance Norskog <go...@gmail.com> wrote:

> Fields are both stored and indexed. The stored copy is exactly what
> you sent in. The index is built with the "text" type's analysis stack
> and is not stored. This output has the stopwords removed. The output
> is not stored in one place, but parts of it are scattered around the
> Lucene index data structures.  When you search for one of these
> stopwords, you should not get any documents.
>
> On Sat, Jan 2, 2010 at 5:20 PM, Bogdan Vatkov <bo...@gmail.com>
> wrote:
> > Hi,
> >
> > I am using a default (example) configuration of Solr and there the
> > stopwording seems to be enabled for both indexing and querying of fields
> of
> > type "text".
> > I have a custom field which is of the "text" type.
> > I have extended the stopwords.txt file with lots of words but when I
> index
> > some documents the index contains stopwords - I can see this with the
> Luke
> > tool.
> > Am I supposed to see these terms in the index after they are declared in
> the
> > stopwords.txt file?
> > What could be wrong?
> >
> > Best regards,
> > Bogdan
> >
>
>
>
> --
> Lance Norskog
> goksron@gmail.com
>

-- 
Best regards,
Bogdan

Re: Stopwords not working as expected

Posted by Lance Norskog <go...@gmail.com>.

Fields are both stored and indexed. The stored copy is exactly what
you sent in. The index is built with the "text" type's analysis stack
and is not stored. This output has the stopwords removed. The output
is not stored in one place, but parts of it are scattered around the
Lucene index data structures.  When you search for one of these
stopwords, you should not get any documents.

On Sat, Jan 2, 2010 at 5:20 PM, Bogdan Vatkov <bo...@gmail.com> wrote:
> Hi,
>
> I am using a default (example) configuration of Solr and there the
> stopwording seems to be enabled for both indexing and querying of fields of
> type "text".
> I have a custom field which is of the "text" type.
> I have extended the stopwords.txt file with lots of words but when I index
> some documents the index contains stopwords - I can see this with the Luke
> tool.
> Am I supposed to see these terms in the index after they are declared in the
> stopwords.txt file?
> What could be wrong?
>
> Best regards,
> Bogdan
>

-- 
Lance Norskog
goksron@gmail.com