You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by bbarani <bb...@gmail.com> on 2011/04/07 16:37:11 UTC

SOLR support for unicode?

Hi,

We are trying to index heterogenous data using SOLR, some of the sources
have some unicode characters like Zone™  but SOLR is converting them to
Zone™. Any idea how to resolve this issue? 

I am using SOLR on Jetty server...

Thanks,
Barani

--
View this message in context: http://lucene.472066.n3.nabble.com/SOLR-support-for-unicode-tp2790512p2790512.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: SOLR support for unicode?

Posted by Sivasakthivel <si...@gmail.com>.
Hi,

Thanks for your response. I am currently working in this issue.

When I run the test_utf8.sh script, I got the following result.
Solr server is up.
HTTP GET is accepting UTF-8
HTTP POST is accepting UTF-8
HTTP POST defaults to UTF-8
ERROR: HTTP GET is not accepting UTF-8 beyond the basic multilingual plane
ERROR: HTTP POST is not accepting UTF-8 beyond the basic multilingual plane
ERROR: HTTP POST + URL params is not accepting UTF-8 beyond the basic
multilingual plane

I also placed "TM" symbol and "–" Symbol in one of the example XML docs and
indexed that with post.jar,
with  "wt=python" param    .

Input:
  Good unicode support: h&#xE9;llo (hello with an™ accent OLB – Account 
over the e)

Output:
Good unicode support: héllo (hello with an� accent OLB � Account over the e)

--
View this message in context: http://lucene.472066.n3.nabble.com/SOLR-support-for-unicode-tp2790512p2822358.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: SOLR support for unicode?

Posted by Sivasakthivel <si...@gmail.com>.
Hi, 

Thanks for your response. I am currently working in this issue. 

When I run the test_utf8.sh script, I got the following result. 
Solr server is up. 
HTTP GET is accepting UTF-8 
HTTP POST is accepting UTF-8 
HTTP POST defaults to UTF-8 
ERROR: HTTP GET is not accepting UTF-8 beyond the basic multilingual plane 
ERROR: HTTP POST is not accepting UTF-8 beyond the basic multilingual plane 
ERROR: HTTP POST + URL params is not accepting UTF-8 beyond the basic
multilingual plane

I also placed "TM" symbol and "–" Symbol in one of the example XML docs and
indexed that with post.jar, 
with  "wt=python" param    . 

Input: 
  Good unicode support: héllo (hello with an™ accent OLB – Account  over the
e) 

Output: 
Good unicode support: héllo (hello with an� accent OLB � Account over the e)  



--
View this message in context: http://lucene.472066.n3.nabble.com/SOLR-support-for-unicode-tp2790512p2824041.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: SOLR support for unicode?

Posted by Chris Hostetter <ho...@fucit.org>.
: 
: Thanks for your response..please find below the schema details corresponding
: to that field..

your message inlcuded nothing but a bunch of blank lines, probably because 
your email editor thought you were trying to type in html (instead of xml)

before diving too deeply into your analyser however, it's improtant to 
sanity check that your servlet container is configured properly, and that 
your client is actaully sending the data encoded properly -- based on your 
description of hte problem it sounds like even the *stored* value of the 
field contains a "?" character, which means that analyzer probably isn't 
hte problem.

the exampledocs directory has a test_utf8.sh script which cna be handy for 
verifying that your servlet container seems to be behaving properly, you 
can also try putting a "TM" symbol in one of the example XML docs and 
index that with post.jar and see if that works for you.

if it does, then odds are your indexing code isn't doing what it should be 
encoding wise.

if using post.jar wit ha simple xml file in UTF still doesn't give you the 
expected outcome, please reply with the output of a query for your 
test doc that uses the "wt=python" param ... the python response writer is 
handy in these cases because it generates escape codes for everything 
outside of the ascii range making it easy to see *exactly* what bytes 
are in those stored fields.

-Hoss

Re: SOLR support for unicode?

Posted by bbarani <bb...@gmail.com>.
Hi,

Thanks for your response..please find below the schema details corresponding
to that field..



---------------------------------------------------------------------------

Field type details..



   
   

   
   
   
        

 
 

   
   
   
   
   
 


Thanks,
Barani

--
View this message in context: http://lucene.472066.n3.nabble.com/SOLR-support-for-unicode-tp2790512p2791151.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: SOLR support for unicode?

Posted by Jonathan Rochkind <ro...@jhu.edu>.
That's probably an issue of your analyzer.  Can you show us the field 
definition from the schema.xml file, for the field that you are putting 
this text in?

On 4/7/2011 10:37 AM, bbarani wrote:
> Hi,
>
> We are trying to index heterogenous data using SOLR, some of the sources
> have some unicode characters like Zone™  but SOLR is converting them to
> Zone™. Any idea how to resolve this issue?
>
> I am using SOLR on Jetty server...
>
> Thanks,
> Barani
>
> --
> View this message in context: http://lucene.472066.n3.nabble.com/SOLR-support-for-unicode-tp2790512p2790512.html
> Sent from the Solr - User mailing list archive at Nabble.com.