You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by Ethan Gruber <ew...@gmail.com> on 2007/05/24 15:32:18 UTC

Difficulty posting unicode to solr index

Hi,

I am attempting to post some unicode XML documents to my solr index.  They
are encoded in UTF-8.  When I attempt to query from the solr admin page, I'm
basically getting gibberish garbage text in return.  I decided to try a file
that I know is supposed to work, which is the utf8-example.xml found in the
exampledocs folder.  This also did not return proper unicode results.  None
of my other coworkers have run into this problem, but I believe there is one
difference between their system and my system which could account for the
error.  They're using Macs and thus posting with post.sh, and I am running
Windows and posting with a post.jar file.  Could post.jar not support
unicode?  Has anyone run into this problem before?

Thanks,
Ethan

Re: Difficulty posting unicode to solr index

Posted by Chris Hostetter <ho...@fucit.org>.
: error.  They're using Macs and thus posting with post.sh, and I am running
: Windows and posting with a post.jar file.  Could post.jar not support
: unicode?  Has anyone run into this problem before?

Which post.jar? (i recently committed a new version)

post.jar does all of the things it should do to ensure it sends Solr clean
UTF-8 data, but i have not personally tested (any version of) it on
windows.

is it possible the problem is not posting the data, but looking at the
responses from Solr in your browser?  (can other people who have good
success query the index after you send your udpate and see the complex
characters correctly?)




-Hoss


Re: Difficulty posting unicode to solr index

Posted by Yonik Seeley <yo...@apache.org>.
On 5/25/07, Ethan Gruber <ew...@gmail.com> wrote:
> Posting utf8-example.xml is the first thing I tried when I ran into this
> problem, and like the other files I had been working with, query results
> return garbage characters inside of unicode.

After posting utf8-example.xml, try this query:

http://localhost:8983/solr/select?indent=on&q=id%3AUTF8TEST&fl=features&wt=python

The python writer uses unicode escapes to keep the output in the ascii
range, so it's an easy way to see exactly what Solr thinks those
characters are.
You should get

{
 'responseHeader':{
  'status':0,
  'QTime':0,
  'params':{
	'wt':'python',
	'indent':'on',
	'q':'id:UTF8TEST',
	'fl':'features'}},
 'response':{'numFound':1,'start':0,'docs':[
	{
	 'features':[
	  'No accents here',
	  u'This is an e acute: \u00e9',
	  u'eaiou with circumflexes: \u00ea\u00e2\u00ee\u00f4\u00fb',
	  u'eaiou with umlauts: \u00eb\u00e4\u00ef\u00f6\u00fc',
	  'tag with escaped chars: <nicetag/>',
	  'escaped ampersand: Bonnie & Clyde']}]
 }}

If you do, that means that the problem is not getting the data into
solr, but the interpretation of what you get out.

-Yonik

Re: Difficulty posting unicode to solr index

Posted by Ethan Gruber <ew...@gmail.com>.
Posting utf8-example.xml is the first thing I tried when I ran into this
problem, and like the other files I had been working with, query results
return garbage characters inside of unicode.

On 5/25/07, Yonik Seeley <yo...@apache.org> wrote:
>
> On 5/25/07, Ethan Gruber <ew...@gmail.com> wrote:
> > Yes, it's definitely encoded in UTF-8.  I'm going to attempt either
> today or
> > Tuesday to post the files to a solr index that is online (as opposed to
> > localhost as was my case a few days ago) using post.sh through SSH and
> let
> > you know how it turns out.  That should definitely indicate whether or
> not
> > the problem is with my files themselves or the post.jar file.
>
> Why don't you try a file that we know is encoded in UTF-8,
> the solr/example/exampledocs/utf8-example.xml
>
> Try it first without modifying it (an editor can change the encoding a
> file is stored in).
>
> -Yonik
>

Re: Difficulty posting unicode to solr index

Posted by Yonik Seeley <yo...@apache.org>.
On 5/25/07, Ethan Gruber <ew...@gmail.com> wrote:
> Yes, it's definitely encoded in UTF-8.  I'm going to attempt either today or
> Tuesday to post the files to a solr index that is online (as opposed to
> localhost as was my case a few days ago) using post.sh through SSH and let
> you know how it turns out.  That should definitely indicate whether or not
> the problem is with my files themselves or the post.jar file.

Why don't you try a file that we know is encoded in UTF-8,
the solr/example/exampledocs/utf8-example.xml

Try it first without modifying it (an editor can change the encoding a
file is stored in).

-Yonik

Re: Difficulty posting unicode to solr index

Posted by Ethan Gruber <ew...@gmail.com>.
Yes, it's definitely encoded in UTF-8.  I'm going to attempt either today or
Tuesday to post the files to a solr index that is online (as opposed to
localhost as was my case a few days ago) using post.sh through SSH and let
you know how it turns out.  That should definitely indicate whether or not
the problem is with my files themselves or the post.jar file.

On 5/24/07, James liu <li...@gmail.com> wrote:
>
> how do u sure ur file is encoded by utf-8?
>
> 2007/5/24, Ethan Gruber <ew...@gmail.com>:
> >
> > Hi,
> >
> > I am attempting to post some unicode XML documents to my solr
> > index.  They
> > are encoded in UTF-8.  When I attempt to query from the solr admin page,
> > I'm
> > basically getting gibberish garbage text in return.  I decided to try a
> > file
> > that I know is supposed to work, which is the utf8-example.xml found in
> > the
> > exampledocs folder.  This also did not return proper unicode
> > results.  None
> > of my other coworkers have run into this problem, but I believe there is
> > one
> > difference between their system and my system which could account for
> > the
> > error.  They're using Macs and thus posting with post.sh, and I am
> > running
> > Windows and posting with a post.jar file.  Could post.jar not support
> > unicode?  Has anyone run into this problem before?
> >
> > Thanks,
> > Ethan
> >
>
>
>
> --
> regards
> jl

Re: Difficulty posting unicode to solr index

Posted by James liu <li...@gmail.com>.
how do u sure ur file is encoded by utf-8?

2007/5/24, Ethan Gruber <ew...@gmail.com>:
>
> Hi,
>
> I am attempting to post some unicode XML documents to my solr index.  They
> are encoded in UTF-8.  When I attempt to query from the solr admin page,
> I'm
> basically getting gibberish garbage text in return.  I decided to try a
> file
> that I know is supposed to work, which is the utf8-example.xml found in
> the
> exampledocs folder.  This also did not return proper unicode
> results.  None
> of my other coworkers have run into this problem, but I believe there is
> one
> difference between their system and my system which could account for the
> error.  They're using Macs and thus posting with post.sh, and I am running
> Windows and posting with a post.jar file.  Could post.jar not support
> unicode?  Has anyone run into this problem before?
>
> Thanks,
> Ethan
>



-- 
regards
jl