You are viewing a plain text version of this content. The canonical link for it is here.

Posted to solr-user@lucene.apache.org by George Aroush <ge...@aroush.net> on 2008/07/03 03:16:55 UTC

schema.xml for CJK, German, French, etc.

Hi Folks,

Has anyone created schema.xml for languages other then English?  I like to
see a working example mainly for CJK, German and French.  If you have can
you share them?

TO get me started, I created the following for German:

  <fieldtype name="myfieldtype" class="solr.TextField">
    <analyzer>
      <tokenizer class="solr.WhitespaceTokenizerFactory"/>
      <filter class="solr.StopFilterFactory" ignoreCase="true"
words="stopwords.txt"/>
      <filter class="solr.WordDelimiterFilterFactory" generateWordParts="0"
generateNumberParts="1" catenateWords="1" catenateNumbers="1"
catenateAll="0"/>
      <filter class="solr.LowerCaseFilterFactory"/>
      <filter class="solr.SnowballPorterFilterFactory" language="German" />
      <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
    </analyzer>
  </fieldtype>

Will those filters work on German text?

Thanks.

-- George

RE: schema.xml for CJK, German, French, etc.

Posted by George Aroush <ge...@aroush.net>.

Thanks Erik!

Trouble is, I don't know those languages to conclude that my setup is
correct, specially for CJK.

It's less problematic for European languages, but then again, should I be
using those English filters with the German SnowballPorterFilterFactory?
That is, will WordDelimiterFilterFactory work with a German filter?  Etc.

It would be nice if folks share their setting (Generic for each language)
and then we can add them to a Solr Wiki.

-- George

> -----Original Message-----
> From: Erik Hatcher [mailto:erik@ehatchersolutions.com] 
> Sent: Wednesday, July 02, 2008 9:40 PM
> To: solr-user@lucene.apache.org
> Subject: Re: schema.xml for CJK, German, French, etc.
> 
> 
> On Jul 2, 2008, at 9:16 PM, George Aroush wrote:
> > Has anyone created schema.xml for languages other then English?
> 
> Indeed.
> 
> >  I like to
> > see a working example mainly for CJK, German and French.  
> If you have 
> > can you share them?
> >
> > TO get me started, I created the following for German:
> >
> >  <fieldtype name="myfieldtype" class="solr.TextField">
> >    <analyzer>
> >      <tokenizer class="solr.WhitespaceTokenizerFactory"/>
> >      <filter class="solr.StopFilterFactory" ignoreCase="true"
> > words="stopwords.txt"/>
> >      <filter class="solr.WordDelimiterFilterFactory"  
> > generateWordParts="0"
> > generateNumberParts="1" catenateWords="1" catenateNumbers="1"
> > catenateAll="0"/>
> >      <filter class="solr.LowerCaseFilterFactory"/>
> >      <filter class="solr.SnowballPorterFilterFactory"  
> > language="German" />
> >      <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
> >    </analyzer>
> >  </fieldtype>
> >
> > Will those filters work on German text?
> 
> 
> One tip that will help is visiting 
> http://localhost:8983/solr/admin/analysis.jsp
>   and test it out to see that you're getting the tokenization 
> that you desire on some sample text.  Solr's analysis 
> introspection is quite nice and easy to tinker with.
> 
> Removing stop words before lower casing won't quite work 
> though, as StopFilter is case-sensitive with all stop words 
> generally lowercased, but other than relocating the 
> StopFilterFactory in that chain it seems reasonable.
> 
> As always, though, it depends on what you want to do with 
> these languages to offer more concrete recommendations.
> 
> 	Erik
>

Re: schema.xml for CJK, German, French, etc.

Posted by Erik Hatcher <er...@ehatchersolutions.com>.

On Jul 2, 2008, at 9:16 PM, George Aroush wrote:
> Has anyone created schema.xml for languages other then English?

Indeed.

>  I like to
> see a working example mainly for CJK, German and French.  If you  
> have can
> you share them?
>
> TO get me started, I created the following for German:
>
>  <fieldtype name="myfieldtype" class="solr.TextField">
>    <analyzer>
>      <tokenizer class="solr.WhitespaceTokenizerFactory"/>
>      <filter class="solr.StopFilterFactory" ignoreCase="true"
> words="stopwords.txt"/>
>      <filter class="solr.WordDelimiterFilterFactory"  
> generateWordParts="0"
> generateNumberParts="1" catenateWords="1" catenateNumbers="1"
> catenateAll="0"/>
>      <filter class="solr.LowerCaseFilterFactory"/>
>      <filter class="solr.SnowballPorterFilterFactory"  
> language="German" />
>      <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
>    </analyzer>
>  </fieldtype>
>
> Will those filters work on German text?

One tip that will help is visiting http://localhost:8983/solr/admin/analysis.jsp 
  and test it out to see that you're getting the tokenization that you  
desire on some sample text.  Solr's analysis introspection is quite  
nice and easy to tinker with.

Removing stop words before lower casing won't quite work though, as  
StopFilter is case-sensitive with all stop words generally lowercased,  
but other than relocating the StopFilterFactory in that chain it seems  
reasonable.

As always, though, it depends on what you want to do with these  
languages to offer more concrete recommendations.

	Erik