You are viewing a plain text version of this content. The canonical link for it is here.

Posted to solr-user@lucene.apache.org by Daniel Brügge <da...@googlemail.com> on 2012/11/07 17:45:45 UTC

SolrCloud, Zookeeper and Stopwords with Umlaute or other special characters

Hi,

i am running a SolrCloud cluster with the 4.0.0 version. I have a stopwords
file
which is in the correct encoding. It contains german Umlaute like e.g. 'ü'.
I am
also running a standalone Zookeeper which contains this stopwords file. In
my schema
i am using the stopwords file in the standard way:

>
>     <fieldType name="text_general" class="solr.TextField"
> positionIncrementGap="100">
>       <analyzer type="index">
>                 <tokenizer class="solr.StandardTokenizerFactory"/>
>                 <filter class="solr.StopFilterFactory"
>                                 ignoreCase="true"
>                                 words="my_stopwords.txt"
>                                 enablePositionIncrements="true" />


When I am indexing i recognized, that all stopwords without Umlaute are
correctly removed, but the ones with
Umlaute still exist.

Is this a problem with ZK or Solr?

Thanks & regards

Daniel

Re: SolrCloud, Zookeeper and Stopwords with Umlaute or other special characters

Posted by Daniel Brügge <da...@googlemail.com>.

Ah, I have fixed it. It was necessary to import the files into Zookeeper
using the file.encoding system property and set it to UTF-8. Then it
worked. Hooray. :)

e.g.

java -Dfile.encoding=UTF-8 -Dbootstrap_confdir=/home/me/myconfdir
-Dcollection.configName=config1 -DzkHost="zkhost:2181" -DnumShards=2
-Dsolr.solr.home=/home/me/solr -jar start.jar



On Thu, Nov 8, 2012 at 2:09 PM, Daniel Brügge <daniel.bruegge@googlemail.com
> wrote:

> Weird, if i return the file contents in ZK with 'get' it returns me
>
> w??????rde          |  would
> w??????rden         |  would
>
> for example. So the Umlaute are not shown. Does anyone have an idea if
> this is because of Zookeepers cli or of the file contents itself?
>
> Thanks & regards.
>
> On Thu, Nov 8, 2012 at 12:24 PM, Daniel Brügge <
> daniel.bruegge@googlemail.com> wrote:
>
>> I trust the 'file' command output. And if i can read there "UTF-8 Unicode"
>> I believe that this is correct. Don't know if this is the 'correct
>> answer' for you ;)
>>
>> BTW: It works locally, but not with ZK. So it's maybe more a ZK issue,
>> which
>> somehow destroys my file. Will check.
>>
>>
>> On Thu, Nov 8, 2012 at 12:12 PM, Robert Muir <rc...@gmail.com> wrote:
>>
>>> On Wed, Nov 7, 2012 at 11:45 AM, Daniel Brügge
>>> <da...@googlemail.com> wrote:
>>> > Hi,
>>> >
>>> > i am running a SolrCloud cluster with the 4.0.0 version. I have a
>>> stopwords
>>> > file
>>> > which is in the correct encoding.
>>>
>>> What makes you think that?
>>>
>>> Note: "Because I can read it" is not the correct answer.
>>>
>>> Ensure any of your stopwords files etc are in UTF-8. This is often
>>> different from the encoding your computer uses by default if you open
>>> a file, start typing in it, and press save.
>>>
>>
>>
>

Re: SolrCloud, Zookeeper and Stopwords with Umlaute or other special characters

Posted by Daniel Brügge <da...@googlemail.com>.

Weird, if i return the file contents in ZK with 'get' it returns me

w??????rde          |  would
w??????rden         |  would

for example. So the Umlaute are not shown. Does anyone have an idea if this
is because of Zookeepers cli or of the file contents itself?

Thanks & regards.

On Thu, Nov 8, 2012 at 12:24 PM, Daniel Brügge <
daniel.bruegge@googlemail.com> wrote:

> I trust the 'file' command output. And if i can read there "UTF-8 Unicode"
> I believe that this is correct. Don't know if this is the 'correct answer'
> for you ;)
>
> BTW: It works locally, but not with ZK. So it's maybe more a ZK issue,
> which
> somehow destroys my file. Will check.
>
>
> On Thu, Nov 8, 2012 at 12:12 PM, Robert Muir <rc...@gmail.com> wrote:
>
>> On Wed, Nov 7, 2012 at 11:45 AM, Daniel Brügge
>> <da...@googlemail.com> wrote:
>> > Hi,
>> >
>> > i am running a SolrCloud cluster with the 4.0.0 version. I have a
>> stopwords
>> > file
>> > which is in the correct encoding.
>>
>> What makes you think that?
>>
>> Note: "Because I can read it" is not the correct answer.
>>
>> Ensure any of your stopwords files etc are in UTF-8. This is often
>> different from the encoding your computer uses by default if you open
>> a file, start typing in it, and press save.
>>
>
>

Re: SolrCloud, Zookeeper and Stopwords with Umlaute or other special characters

Posted by Daniel Brügge <da...@googlemail.com>.

I trust the 'file' command output. And if i can read there "UTF-8 Unicode"
I believe that this is correct. Don't know if this is the 'correct answer'
for you ;)

BTW: It works locally, but not with ZK. So it's maybe more a ZK issue, which
somehow destroys my file. Will check.

On Thu, Nov 8, 2012 at 12:12 PM, Robert Muir <rc...@gmail.com> wrote:

> On Wed, Nov 7, 2012 at 11:45 AM, Daniel Brügge
> <da...@googlemail.com> wrote:
> > Hi,
> >
> > i am running a SolrCloud cluster with the 4.0.0 version. I have a
> stopwords
> > file
> > which is in the correct encoding.
>
> What makes you think that?
>
> Note: "Because I can read it" is not the correct answer.
>
> Ensure any of your stopwords files etc are in UTF-8. This is often
> different from the encoding your computer uses by default if you open
> a file, start typing in it, and press save.
>

Re: SolrCloud, Zookeeper and Stopwords with Umlaute or other special characters

Posted by Robert Muir <rc...@gmail.com>.

On Wed, Nov 7, 2012 at 11:45 AM, Daniel Brügge
<da...@googlemail.com> wrote:
> Hi,
>
> i am running a SolrCloud cluster with the 4.0.0 version. I have a stopwords
> file
> which is in the correct encoding.

What makes you think that?

Note: "Because I can read it" is not the correct answer.

Ensure any of your stopwords files etc are in UTF-8. This is often
different from the encoding your computer uses by default if you open
a file, start typing in it, and press save.

Re: SolrCloud, Zookeeper and Stopwords with Umlaute or other special characters

Posted by Daniel Brügge <da...@googlemail.com>.

When I look at the text_de fieldType provided in the example schema i can
see:

>
>         <filter class="solr.LowerCaseFilterFactory"/>
>         <filter class="solr.StopFilterFactory" ignoreCase="true"
> words="lang/stopwords_de.txt" format="snowball"
> enablePositionIncrements="true"/>
>         <filter class="solr.GermanNormalizationFilterFactory"/>
>         <filter class="solr.GermanLightStemFilterFactory"/>


I have tried with this and this removed the words with Umlaute. It seems,
that is because of format="snowball". I haven't used this, because I though
I had one word per line. But maybe some invisible characters got into my
stopword file and destroyed it.

Thanks.

Daniel

On Thu, Nov 8, 2012 at 10:36 AM, Daniel Brügge <
daniel.bruegge@googlemail.com> wrote:

> Yes, I did this and the Words with the Umlaute went through the
> Stopfilter. The ones without Umlaute were correctly removed.
>
> On Thu, Nov 8, 2012 at 2:22 AM, Lance Norskog <go...@gmail.com> wrote:
>
>> You can debug this with the 'Analysis' page in the Solr UI. You pick
>> 'text_general' and then give words with umlauts in the text box for
>> indexing and queries.
>>
>> Lance
>>
>> ----- Original Message -----
>> | From: "Daniel Brügge" <da...@googlemail.com>
>> | To: solr-user@lucene.apache.org
>> | Sent: Wednesday, November 7, 2012 8:45:45 AM
>> | Subject: SolrCloud, Zookeeper and Stopwords with Umlaute or other
>> special characters
>> |
>> | Hi,
>> |
>> | i am running a SolrCloud cluster with the 4.0.0 version. I have a
>> | stopwords
>> | file
>> | which is in the correct encoding. It contains german Umlaute like
>> | e.g. 'ü'.
>> | I am
>> | also running a standalone Zookeeper which contains this stopwords
>> | file. In
>> | my schema
>> | i am using the stopwords file in the standard way:
>> |
>> | >
>> | >     <fieldType name="text_general" class="solr.TextField"
>> | > positionIncrementGap="100">
>> | >       <analyzer type="index">
>> | >                 <tokenizer class="solr.StandardTokenizerFactory"/>
>> | >                 <filter class="solr.StopFilterFactory"
>> | >                                 ignoreCase="true"
>> | >                                 words="my_stopwords.txt"
>> | >                                 enablePositionIncrements="true" />
>> |
>> |
>> | When I am indexing i recognized, that all stopwords without Umlaute
>> | are
>> | correctly removed, but the ones with
>> | Umlaute still exist.
>> |
>> | Is this a problem with ZK or Solr?
>> |
>> | Thanks & regards
>> |
>> | Daniel
>> |
>>
>
>

Re: SolrCloud, Zookeeper and Stopwords with Umlaute or other special characters

Posted by Daniel Brügge <da...@googlemail.com>.

Yes, I did this and the Words with the Umlaute went through the Stopfilter.
The ones without Umlaute were correctly removed.

On Thu, Nov 8, 2012 at 2:22 AM, Lance Norskog <go...@gmail.com> wrote:

> You can debug this with the 'Analysis' page in the Solr UI. You pick
> 'text_general' and then give words with umlauts in the text box for
> indexing and queries.
>
> Lance
>
> ----- Original Message -----
> | From: "Daniel Brügge" <da...@googlemail.com>
> | To: solr-user@lucene.apache.org
> | Sent: Wednesday, November 7, 2012 8:45:45 AM
> | Subject: SolrCloud, Zookeeper and Stopwords with Umlaute or other
> special characters
> |
> | Hi,
> |
> | i am running a SolrCloud cluster with the 4.0.0 version. I have a
> | stopwords
> | file
> | which is in the correct encoding. It contains german Umlaute like
> | e.g. 'ü'.
> | I am
> | also running a standalone Zookeeper which contains this stopwords
> | file. In
> | my schema
> | i am using the stopwords file in the standard way:
> |
> | >
> | >     <fieldType name="text_general" class="solr.TextField"
> | > positionIncrementGap="100">
> | >       <analyzer type="index">
> | >                 <tokenizer class="solr.StandardTokenizerFactory"/>
> | >                 <filter class="solr.StopFilterFactory"
> | >                                 ignoreCase="true"
> | >                                 words="my_stopwords.txt"
> | >                                 enablePositionIncrements="true" />
> |
> |
> | When I am indexing i recognized, that all stopwords without Umlaute
> | are
> | correctly removed, but the ones with
> | Umlaute still exist.
> |
> | Is this a problem with ZK or Solr?
> |
> | Thanks & regards
> |
> | Daniel
> |
>

Re: SolrCloud, Zookeeper and Stopwords with Umlaute or other special characters

Posted by Lance Norskog <go...@gmail.com>.

You can debug this with the 'Analysis' page in the Solr UI. You pick 'text_general' and then give words with umlauts in the text box for indexing and queries.

Lance

----- Original Message -----
| From: "Daniel Brügge" <da...@googlemail.com>
| To: solr-user@lucene.apache.org
| Sent: Wednesday, November 7, 2012 8:45:45 AM
| Subject: SolrCloud, Zookeeper and Stopwords with Umlaute or other special characters
| 
| Hi,
| 
| i am running a SolrCloud cluster with the 4.0.0 version. I have a
| stopwords
| file
| which is in the correct encoding. It contains german Umlaute like
| e.g. 'ü'.
| I am
| also running a standalone Zookeeper which contains this stopwords
| file. In
| my schema
| i am using the stopwords file in the standard way:
| 
| >
| >     <fieldType name="text_general" class="solr.TextField"
| > positionIncrementGap="100">
| >       <analyzer type="index">
| >                 <tokenizer class="solr.StandardTokenizerFactory"/>
| >                 <filter class="solr.StopFilterFactory"
| >                                 ignoreCase="true"
| >                                 words="my_stopwords.txt"
| >                                 enablePositionIncrements="true" />
| 
| 
| When I am indexing i recognized, that all stopwords without Umlaute
| are
| correctly removed, but the ones with
| Umlaute still exist.
| 
| Is this a problem with ZK or Solr?
| 
| Thanks & regards
| 
| Daniel
|