You are viewing a plain text version of this content. The canonical link for it is here.

Posted to solr-user@lucene.apache.org by Stefan Oestreicher <st...@netdoktor.at> on 2008/07/15 16:29:29 UTC

WordDelimiterFilter splits at non-ASCII chars

Hi,

as I understand the WordDelimiterFilter should split on case changes, word
delimiters and changes from character to digit, but it should not
differentiate between ASCII and multibyte chars. It does however. The word
"hälse" (german plural of "neck") gets split into "h", "ä" and "lse", which
unfortunately renders this filter quite unusable for me. Am i missing
something or is this a bug?
I'm using solr 1.3 built from trunk.

TIA,
 
Stefan Oestreicher

Re: WordDelimiterFilter splits at non-ASCII chars

Posted by Shalin Shekhar Mangar <sh...@gmail.com>.

Hi Stefan,

I wrote a test case for the problem you described but it is working fine. I
used the following definition:

<filter class="solr.WordDelimiterFilterFactory" generateWordParts="1"
generateNumberParts="0" catenateWords="0" catenateNumbers="0"
catenateAll="0" splitOnCaseChange="0" preserveOriginal="0"/>

What configuration are you using? If it is different, please share it so
that I can test with it.

On Tue, Jul 15, 2008 at 7:59 PM, Stefan Oestreicher <
stefan.oestreicher@netdoktor.at> wrote:

> Hi,
>
> as I understand the WordDelimiterFilter should split on case changes, word
> delimiters and changes from character to digit, but it should not
> differentiate between ASCII and multibyte chars. It does however. The word
> "hälse" (german plural of "neck") gets split into "h", "ä" and "lse", which
> unfortunately renders this filter quite unusable for me. Am i missing
> something or is this a bug?
> I'm using solr 1.3 built from trunk.
>
> TIA,
>
> Stefan Oestreicher
>
>


-- 
Regards,
Shalin Shekhar Mangar.

Re: WordDelimiterFilter splits at non-ASCII chars

Posted by Erik Hatcher <er...@ehatchersolutions.com>.

On Jul 16, 2008, at 4:33 AM, Stefan Oestreicher wrote:
> Yes you're right. I was testing with analysis.jsp but it chokes on  
> multibyte
> chars.
> I modified the jsp and set the encoding using
> request.setCharacterEncoding("UTF-8");
> and it's working fine. Bug in analysis.jsp?

Yeah, it's recently been fixed though:

r676183 | yonik | 2008-07-12 10:59:12 -0400 (Sat, 12 Jul 2008) | 1 line
SOLR-501: Fix admin/analysis.jsp UTF-8 input for some other servlet  
containers such as Tomcat

RE: WordDelimiterFilter splits at non-ASCII chars

Posted by Stefan Oestreicher <st...@netdoktor.at>.

Yes you're right. I was testing with analysis.jsp but it chokes on multibyte
chars.
I modified the jsp and set the encoding using
request.setCharacterEncoding("UTF-8");
and it's working fine. Bug in analysis.jsp?

thanks,
 
Stefan Oestreicher 

> -----Original Message-----
> From: yseeley@gmail.com [mailto:yseeley@gmail.com] On Behalf 
> Of Yonik Seeley
> Sent: Tuesday, July 15, 2008 6:29 PM
> To: solr-user@lucene.apache.org
> Subject: Re: WordDelimiterFilter splits at non-ASCII chars
> 
> On Tue, Jul 15, 2008 at 10:29 AM, Stefan Oestreicher 
> <st...@netdoktor.at> wrote:
> > as I understand the WordDelimiterFilter should split on 
> case changes, 
> > word delimiters and changes from character to digit, but it 
> should not 
> > differentiate between ASCII and multibyte chars. It does 
> however. The 
> > word "hälse" (german plural of "neck") gets split into "h", "ä" and 
> > "lse", which unfortunately renders this filter quite 
> unusable for me. 
> > Am i missing something or is this a bug?
> > I'm using solr 1.3 built from trunk.
> 
> Look for charset issues in communicating with Solr.  I just 
> tried this with the "text" field via Solr's analysis.jsp and 
> it works fine.
> 
> -Yonik
>

Re: WordDelimiterFilter splits at non-ASCII chars

Posted by Yonik Seeley <yo...@apache.org>.

On Tue, Jul 15, 2008 at 10:29 AM, Stefan Oestreicher
<st...@netdoktor.at> wrote:
> as I understand the WordDelimiterFilter should split on case changes, word
> delimiters and changes from character to digit, but it should not
> differentiate between ASCII and multibyte chars. It does however. The word
> "hälse" (german plural of "neck") gets split into "h", "ä" and "lse", which
> unfortunately renders this filter quite unusable for me. Am i missing
> something or is this a bug?
> I'm using solr 1.3 built from trunk.

Look for charset issues in communicating with Solr.  I just tried this
with the "text" field via Solr's analysis.jsp and it works fine.

-Yonik