You are viewing a plain text version of this content. The canonical link for it is here.

Posted to solr-user@lucene.apache.org by tushar kapoor <tu...@rediffmail.com> on 2008/12/05 11:18:15 UTC

Russian stopwords

I am trying to filter russian stopwords but have not been successful with
that. I am using the following schema entry -

.....
 <fieldType name="text" class="solr.TextField" >
   <analyzer>
        <tokenizer class="solr.WhitespaceTokenizerFactory"/>
	 <filter class="solr.StopFilterFactory" ignoreCase="true"
words="stopwords.txt"/>
        <filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt"
ignoreCase="true"                                                              
expand="false"/>
        <filter class="solr.WordDelimiterFilterFactory"
generateWordParts="0" generateNumberParts="0" catenateWords="1"
catenateNumbers="1" catenateAll="0"/>
        <filter class="solr.LowerCaseFilterFactory"/>
        <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
      </analyzer>
    </fieldType>
......

Intrestingly, Russian synonyms are working fine. English and russian
synonyms get searched correctly.

Also,If I add an English language word to stopwords.txt it gets filtered
correctly. Its the russian words that are not getting filtered as stopwords.

Can someone explain the behaviour.

Thanks,
Tushar.
-- 
View this message in context: http://www.nabble.com/Russian-stopwords-tp20851093p20851093.html
Sent from the Solr - User mailing list archive at Nabble.com.

RE: Russian stopwords

Posted by Lance Norskog <go...@gmail.com>.

The default encoding on windows is not UTF-8. This causes various weirdness
when you develop on Windows. This has helped me find all places in
string-handling that need the encoding name parameter, so it's not all bad. 

Lance 

-----Original Message-----
From: tushar kapoor [mailto:tushar_kapoor_47@rediffmail.com] 
Sent: Saturday, December 06, 2008 1:17 AM
To: solr-user@lucene.apache.org
Subject: RE: Russian stopwords


Hi Steve,

You were right,it turned out to be a an encoding issue but a really weird
one. I was using windows notepad   to save the stopwords file in UTF-8
encoding. On the other hand I was using editplus to save synonyms file. That
was the only difference. The moment I switched to editplus for saving
stopwords file it started working for Russian, German and all type of
languages.

Anyways Thanks for the suggesting a valid direction.

Regards,
Tushar.


Steven A Rowe wrote:
> 
> Hi Tushar,
> 
> On 12/05/2008 at 5:18 AM, tushar kapoor wrote:
>> I am trying to filter russian stopwords but have not been successful 
>> with that.
> [...]
>>	 <filter class="solr.StopFilterFactory" ignoreCase="true"
>>              words="stopwords.txt"/>
>>      <filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt"
>>              ignoreCase="true" expand="false"/>
> [...]
>> Intrestingly, Russian synonyms are working fine. English and russian 
>> synonyms get searched correctly.
>>
>> Also,If I add an English language word to stopwords.txt it gets 
>> filtered correctly. Its the russian words that are not getting 
>> filtered as stopwords.
> 
> It might be an encoding issue - StopFilterFactory delegates stopword 
> file reading to SolrResourceLoader.getLines(), which uses an 
> InputStreamReader instantiated with the UTF-8 charset.  Is your 
> stopwords.txt encoded as UTF-8?
> 
> It's strange that synonyms are working fine, though - 
> SynonymFilterFactory reads in the synonyms file using the same 
> mechanism as StopFilterFactory - is it possible that your synonyms 
> file is encoded as UTF-8, but your stopwords file is encoded with a 
> different encoding, perhaps KOI8-R?  Like UTF-8, KOI8-R includes the 
> entirety of 7-bit ASCII, so English words would be properly decoded under
UTF-8.
> 
> Steve
> 
> 

--
View this message in context:
http://www.nabble.com/Russian-stopwords-tp20851093p20868126.html
Sent from the Solr - User mailing list archive at Nabble.com.

RE: Russian stopwords

Posted by tushar kapoor <tu...@rediffmail.com>.

Hi Steve,

You were right,it turned out to be a an encoding issue but a really weird
one. I was using windows notepad   to save the stopwords file in UTF-8
encoding. On the other hand I was using editplus to save synonyms file. That
was the only difference. The moment I switched to editplus for saving
stopwords file it started working for Russian, German and all type of
languages.

Anyways Thanks for the suggesting a valid direction.

Regards,
Tushar.


Steven A Rowe wrote:
> 
> Hi Tushar,
> 
> On 12/05/2008 at 5:18 AM, tushar kapoor wrote:
>> I am trying to filter russian stopwords but have not been
>> successful with that.
> [...]
>>	 <filter class="solr.StopFilterFactory" ignoreCase="true"
>>              words="stopwords.txt"/>
>>      <filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt"
>>              ignoreCase="true" expand="false"/>
> [...]
>> Intrestingly, Russian synonyms are working fine. English and russian
>> synonyms get searched correctly.
>>
>> Also,If I add an English language word to stopwords.txt it
>> gets filtered correctly. Its the russian words that are not
>> getting filtered as stopwords.
> 
> It might be an encoding issue - StopFilterFactory delegates stopword file
> reading to SolrResourceLoader.getLines(), which uses an InputStreamReader
> instantiated with the UTF-8 charset.  Is your stopwords.txt encoded as
> UTF-8?
> 
> It's strange that synonyms are working fine, though - SynonymFilterFactory
> reads in the synonyms file using the same mechanism as StopFilterFactory -
> is it possible that your synonyms file is encoded as UTF-8, but your
> stopwords file is encoded with a different encoding, perhaps KOI8-R?  Like
> UTF-8, KOI8-R includes the entirety of 7-bit ASCII, so English words would
> be properly decoded under UTF-8.
> 
> Steve
> 
> 

-- 
View this message in context: http://www.nabble.com/Russian-stopwords-tp20851093p20868126.html
Sent from the Solr - User mailing list archive at Nabble.com.

RE: Russian stopwords

Posted by Steven A Rowe <sa...@syr.edu>.

Hi Tushar,

On 12/05/2008 at 5:18 AM, tushar kapoor wrote:
> I am trying to filter russian stopwords but have not been
> successful with that.
[...]
>	 <filter class="solr.StopFilterFactory" ignoreCase="true"
>              words="stopwords.txt"/>
>      <filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt"
>              ignoreCase="true" expand="false"/>
[...]
> Intrestingly, Russian synonyms are working fine. English and russian
> synonyms get searched correctly.
>
> Also,If I add an English language word to stopwords.txt it
> gets filtered correctly. Its the russian words that are not
> getting filtered as stopwords.

It might be an encoding issue - StopFilterFactory delegates stopword file reading to SolrResourceLoader.getLines(), which uses an InputStreamReader instantiated with the UTF-8 charset.  Is your stopwords.txt encoded as UTF-8?

It's strange that synonyms are working fine, though - SynonymFilterFactory reads in the synonyms file using the same mechanism as StopFilterFactory - is it possible that your synonyms file is encoded as UTF-8, but your stopwords file is encoded with a different encoding, perhaps KOI8-R?  Like UTF-8, KOI8-R includes the entirety of 7-bit ASCII, so English words would be properly decoded under UTF-8.

Steve