You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by Daniel Brügge <da...@googlemail.com> on 2012/11/07 17:45:45 UTC
SolrCloud, Zookeeper and Stopwords with Umlaute or other special characters
Hi,
i am running a SolrCloud cluster with the 4.0.0 version. I have a stopwords
file
which is in the correct encoding. It contains german Umlaute like e.g. 'ü'.
I am
also running a standalone Zookeeper which contains this stopwords file. In
my schema
i am using the stopwords file in the standard way:
>
> <fieldType name="text_general" class="solr.TextField"
> positionIncrementGap="100">
> <analyzer type="index">
> <tokenizer class="solr.StandardTokenizerFactory"/>
> <filter class="solr.StopFilterFactory"
> ignoreCase="true"
> words="my_stopwords.txt"
> enablePositionIncrements="true" />
When I am indexing i recognized, that all stopwords without Umlaute are
correctly removed, but the ones with
Umlaute still exist.
Is this a problem with ZK or Solr?
Thanks & regards
Daniel
Re: SolrCloud, Zookeeper and Stopwords with Umlaute or other special characters
Posted by Daniel Brügge <da...@googlemail.com>.
Ah, I have fixed it. It was necessary to import the files into Zookeeper
using the file.encoding system property and set it to UTF-8. Then it
worked. Hooray. :)
e.g.
java -Dfile.encoding=UTF-8 -Dbootstrap_confdir=/home/me/myconfdir
-Dcollection.configName=config1 -DzkHost="zkhost:2181" -DnumShards=2
-Dsolr.solr.home=/home/me/solr -jar start.jar
On Thu, Nov 8, 2012 at 2:09 PM, Daniel Brügge <daniel.bruegge@googlemail.com
> wrote:
> Weird, if i return the file contents in ZK with 'get' it returns me
>
> w??????rde | would
> w??????rden | would
>
> for example. So the Umlaute are not shown. Does anyone have an idea if
> this is because of Zookeepers cli or of the file contents itself?
>
> Thanks & regards.
>
> On Thu, Nov 8, 2012 at 12:24 PM, Daniel Brügge <
> daniel.bruegge@googlemail.com> wrote:
>
>> I trust the 'file' command output. And if i can read there "UTF-8 Unicode"
>> I believe that this is correct. Don't know if this is the 'correct
>> answer' for you ;)
>>
>> BTW: It works locally, but not with ZK. So it's maybe more a ZK issue,
>> which
>> somehow destroys my file. Will check.
>>
>>
>> On Thu, Nov 8, 2012 at 12:12 PM, Robert Muir <rc...@gmail.com> wrote:
>>
>>> On Wed, Nov 7, 2012 at 11:45 AM, Daniel Brügge
>>> <da...@googlemail.com> wrote:
>>> > Hi,
>>> >
>>> > i am running a SolrCloud cluster with the 4.0.0 version. I have a
>>> stopwords
>>> > file
>>> > which is in the correct encoding.
>>>
>>> What makes you think that?
>>>
>>> Note: "Because I can read it" is not the correct answer.
>>>
>>> Ensure any of your stopwords files etc are in UTF-8. This is often
>>> different from the encoding your computer uses by default if you open
>>> a file, start typing in it, and press save.
>>>
>>
>>
>
Re: SolrCloud, Zookeeper and Stopwords with Umlaute or other special characters
Posted by Daniel Brügge <da...@googlemail.com>.
Weird, if i return the file contents in ZK with 'get' it returns me
w??????rde | would
w??????rden | would
for example. So the Umlaute are not shown. Does anyone have an idea if this
is because of Zookeepers cli or of the file contents itself?
Thanks & regards.
On Thu, Nov 8, 2012 at 12:24 PM, Daniel Brügge <
daniel.bruegge@googlemail.com> wrote:
> I trust the 'file' command output. And if i can read there "UTF-8 Unicode"
> I believe that this is correct. Don't know if this is the 'correct answer'
> for you ;)
>
> BTW: It works locally, but not with ZK. So it's maybe more a ZK issue,
> which
> somehow destroys my file. Will check.
>
>
> On Thu, Nov 8, 2012 at 12:12 PM, Robert Muir <rc...@gmail.com> wrote:
>
>> On Wed, Nov 7, 2012 at 11:45 AM, Daniel Brügge
>> <da...@googlemail.com> wrote:
>> > Hi,
>> >
>> > i am running a SolrCloud cluster with the 4.0.0 version. I have a
>> stopwords
>> > file
>> > which is in the correct encoding.
>>
>> What makes you think that?
>>
>> Note: "Because I can read it" is not the correct answer.
>>
>> Ensure any of your stopwords files etc are in UTF-8. This is often
>> different from the encoding your computer uses by default if you open
>> a file, start typing in it, and press save.
>>
>
>
Re: SolrCloud, Zookeeper and Stopwords with Umlaute or other special characters
Posted by Daniel Brügge <da...@googlemail.com>.
I trust the 'file' command output. And if i can read there "UTF-8 Unicode"
I believe that this is correct. Don't know if this is the 'correct answer'
for you ;)
BTW: It works locally, but not with ZK. So it's maybe more a ZK issue, which
somehow destroys my file. Will check.
On Thu, Nov 8, 2012 at 12:12 PM, Robert Muir <rc...@gmail.com> wrote:
> On Wed, Nov 7, 2012 at 11:45 AM, Daniel Brügge
> <da...@googlemail.com> wrote:
> > Hi,
> >
> > i am running a SolrCloud cluster with the 4.0.0 version. I have a
> stopwords
> > file
> > which is in the correct encoding.
>
> What makes you think that?
>
> Note: "Because I can read it" is not the correct answer.
>
> Ensure any of your stopwords files etc are in UTF-8. This is often
> different from the encoding your computer uses by default if you open
> a file, start typing in it, and press save.
>
Re: SolrCloud, Zookeeper and Stopwords with Umlaute or other special characters
Posted by Robert Muir <rc...@gmail.com>.
On Wed, Nov 7, 2012 at 11:45 AM, Daniel Brügge
<da...@googlemail.com> wrote:
> Hi,
>
> i am running a SolrCloud cluster with the 4.0.0 version. I have a stopwords
> file
> which is in the correct encoding.
What makes you think that?
Note: "Because I can read it" is not the correct answer.
Ensure any of your stopwords files etc are in UTF-8. This is often
different from the encoding your computer uses by default if you open
a file, start typing in it, and press save.
Re: SolrCloud, Zookeeper and Stopwords with Umlaute or other special characters
Posted by Daniel Brügge <da...@googlemail.com>.
When I look at the text_de fieldType provided in the example schema i can
see:
>
> <filter class="solr.LowerCaseFilterFactory"/>
> <filter class="solr.StopFilterFactory" ignoreCase="true"
> words="lang/stopwords_de.txt" format="snowball"
> enablePositionIncrements="true"/>
> <filter class="solr.GermanNormalizationFilterFactory"/>
> <filter class="solr.GermanLightStemFilterFactory"/>
I have tried with this and this removed the words with Umlaute. It seems,
that is because of format="snowball". I haven't used this, because I though
I had one word per line. But maybe some invisible characters got into my
stopword file and destroyed it.
Thanks.
Daniel
On Thu, Nov 8, 2012 at 10:36 AM, Daniel Brügge <
daniel.bruegge@googlemail.com> wrote:
> Yes, I did this and the Words with the Umlaute went through the
> Stopfilter. The ones without Umlaute were correctly removed.
>
> On Thu, Nov 8, 2012 at 2:22 AM, Lance Norskog <go...@gmail.com> wrote:
>
>> You can debug this with the 'Analysis' page in the Solr UI. You pick
>> 'text_general' and then give words with umlauts in the text box for
>> indexing and queries.
>>
>> Lance
>>
>> ----- Original Message -----
>> | From: "Daniel Brügge" <da...@googlemail.com>
>> | To: solr-user@lucene.apache.org
>> | Sent: Wednesday, November 7, 2012 8:45:45 AM
>> | Subject: SolrCloud, Zookeeper and Stopwords with Umlaute or other
>> special characters
>> |
>> | Hi,
>> |
>> | i am running a SolrCloud cluster with the 4.0.0 version. I have a
>> | stopwords
>> | file
>> | which is in the correct encoding. It contains german Umlaute like
>> | e.g. 'ü'.
>> | I am
>> | also running a standalone Zookeeper which contains this stopwords
>> | file. In
>> | my schema
>> | i am using the stopwords file in the standard way:
>> |
>> | >
>> | > <fieldType name="text_general" class="solr.TextField"
>> | > positionIncrementGap="100">
>> | > <analyzer type="index">
>> | > <tokenizer class="solr.StandardTokenizerFactory"/>
>> | > <filter class="solr.StopFilterFactory"
>> | > ignoreCase="true"
>> | > words="my_stopwords.txt"
>> | > enablePositionIncrements="true" />
>> |
>> |
>> | When I am indexing i recognized, that all stopwords without Umlaute
>> | are
>> | correctly removed, but the ones with
>> | Umlaute still exist.
>> |
>> | Is this a problem with ZK or Solr?
>> |
>> | Thanks & regards
>> |
>> | Daniel
>> |
>>
>
>
Re: SolrCloud, Zookeeper and Stopwords with Umlaute or other special characters
Posted by Daniel Brügge <da...@googlemail.com>.
Yes, I did this and the Words with the Umlaute went through the Stopfilter.
The ones without Umlaute were correctly removed.
On Thu, Nov 8, 2012 at 2:22 AM, Lance Norskog <go...@gmail.com> wrote:
> You can debug this with the 'Analysis' page in the Solr UI. You pick
> 'text_general' and then give words with umlauts in the text box for
> indexing and queries.
>
> Lance
>
> ----- Original Message -----
> | From: "Daniel Brügge" <da...@googlemail.com>
> | To: solr-user@lucene.apache.org
> | Sent: Wednesday, November 7, 2012 8:45:45 AM
> | Subject: SolrCloud, Zookeeper and Stopwords with Umlaute or other
> special characters
> |
> | Hi,
> |
> | i am running a SolrCloud cluster with the 4.0.0 version. I have a
> | stopwords
> | file
> | which is in the correct encoding. It contains german Umlaute like
> | e.g. 'ü'.
> | I am
> | also running a standalone Zookeeper which contains this stopwords
> | file. In
> | my schema
> | i am using the stopwords file in the standard way:
> |
> | >
> | > <fieldType name="text_general" class="solr.TextField"
> | > positionIncrementGap="100">
> | > <analyzer type="index">
> | > <tokenizer class="solr.StandardTokenizerFactory"/>
> | > <filter class="solr.StopFilterFactory"
> | > ignoreCase="true"
> | > words="my_stopwords.txt"
> | > enablePositionIncrements="true" />
> |
> |
> | When I am indexing i recognized, that all stopwords without Umlaute
> | are
> | correctly removed, but the ones with
> | Umlaute still exist.
> |
> | Is this a problem with ZK or Solr?
> |
> | Thanks & regards
> |
> | Daniel
> |
>
Re: SolrCloud, Zookeeper and Stopwords with Umlaute or other
special characters
Posted by Lance Norskog <go...@gmail.com>.
You can debug this with the 'Analysis' page in the Solr UI. You pick 'text_general' and then give words with umlauts in the text box for indexing and queries.
Lance
----- Original Message -----
| From: "Daniel Brügge" <da...@googlemail.com>
| To: solr-user@lucene.apache.org
| Sent: Wednesday, November 7, 2012 8:45:45 AM
| Subject: SolrCloud, Zookeeper and Stopwords with Umlaute or other special characters
|
| Hi,
|
| i am running a SolrCloud cluster with the 4.0.0 version. I have a
| stopwords
| file
| which is in the correct encoding. It contains german Umlaute like
| e.g. 'ü'.
| I am
| also running a standalone Zookeeper which contains this stopwords
| file. In
| my schema
| i am using the stopwords file in the standard way:
|
| >
| > <fieldType name="text_general" class="solr.TextField"
| > positionIncrementGap="100">
| > <analyzer type="index">
| > <tokenizer class="solr.StandardTokenizerFactory"/>
| > <filter class="solr.StopFilterFactory"
| > ignoreCase="true"
| > words="my_stopwords.txt"
| > enablePositionIncrements="true" />
|
|
| When I am indexing i recognized, that all stopwords without Umlaute
| are
| correctly removed, but the ones with
| Umlaute still exist.
|
| Is this a problem with ZK or Solr?
|
| Thanks & regards
|
| Daniel
|