You are viewing a plain text version of this content. The canonical link for it is here.

Posted to solr-user@lucene.apache.org by Engy Morsy <En...@bibalex.org> on 2011/06/29 14:03:51 UTC

Encoding problem while indexing

I am working on indexing arabic documents containg arabic diacritics and dotless characters (old arabic characters), I am using Apache Tomcat server, and I am using my modified version of the aramorph analyzer as the arabic analyzer. I managed on the development enviorment to normalize the arabic diacritics and dotless characters (same concept as in the solr.ArabicNormalizationFilterFactory). and i can verfiy that the analyzer is working fine, and i get the correct stem for arabic words. the input text file for testing has a utf-8 encoding.

When i build the aramorph jar file and place it under solr lib, the diacritics and the dotless characters splits the word. I made sure that the server.xml contains the URI-Encoding="utf-8".

I also made sure that the text being send to solr using solj is utf-8 encoding
example : solr.addBean(new Doc("4",new String("حِباًَ".getBytes("UTF8"))));

but nothing is working.

I tried to use the analyze link on solr admin for both indexing and querying and both shows that the arabic word is splited if a diacritics or dotless character is found.

Do you have any idea what might be the problem


schema snippet:

<fieldType name="text" class="solr.TextField">
<analyzer type="index" class="gpl.pierrick.brihaye.aramorph.lucene.ArabicNormalizeStemmer"/>
<analyzer type="query" class="gpl.pierrick.brihaye.aramorph.lucene.ArabicNormalizeStemmer"/>
</fieldType>

I also added the following parameter to the JVM: -Dfile.encoding=UTF-8

Thanks,
engy