You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by Bill Au <bi...@gmail.com> on 2009/07/28 21:26:02 UTC

µTorrent indexed as µTorrent

I am using SolrJ to index the word µTorrent.  After a commit I was not able
to query for it.  It turns out that the document in my Solr index contains
the word µTorrent instead of µTorrent.  Any one has any idea what's going
on???

Bill

Re: µTorrent indexed as µTorrent

Posted by Yonik Seeley <yo...@lucidimagination.com>.
On Thu, Jul 30, 2009 at 6:34 PM, Bill Au<bi...@gmail.com> wrote:
>  FYI, it took me a while to discover that SolrJ by default uses a GET request for
> query, which uses ISO-8859-1.

That depends on the servlet container.  SolrJ GET requests are sent in
UTF-8.  Some servlet containers such as Tomcat need extra
configuration to treat URLs as UTF-8 instead of latin-1, but the
standard http://www.ietf.org/rfc/rfc3986.txt clearly specifies UTF-8.

To test the servlet container configuration, check out
example/exampledocs/test_utf8.sh

-Yonik
http://www.lucidimagination.com

  I had to explicitly use a POST to do query in
> SolrJ in order to get it to use UTF-8.
>
> Bill
>
> On Tue, Jul 28, 2009 at 5:27 PM, Robert Muir <rc...@gmail.com> wrote:
>
>> Bill, somewhere in the process I think you might be treating your
>> UTF-8 text as ISO-8859-1.
>>
>> Your character: 00B5 (µ)
>> Bits: 10110101
>>
>> UTF8-encoded: 11000010 10110101
>>
>> If you were to treat these bytes as ISO-8859-1 (i.e. reading from a
>> file or wrong url encoding) then it looks like:
>> 0xC2 (Å) followed by 0xB5 (µ)
>>
>>
>> On Tue, Jul 28, 2009 at 3:26 PM, Bill Au<bi...@gmail.com> wrote:
>> > I am using SolrJ to index the word µTorrent.  After a commit I was not
>> able
>> > to query for it.  It turns out that the document in my Solr index
>> contains
>> > the word µTorrent instead of µTorrent.  Any one has any idea what's
>> going
>> > on???
>> >
>> > Bill
>> >
>>
>>
>>
>> --
>> Robert Muir
>> rcmuir@gmail.com
>>
>

Re: µTorrent indexed as µTorrent

Posted by Bill Au <bi...@gmail.com>.
Thanks, Robert.  That's exactly what my problem was.  Things work find after
I make sure that all my processing (index and query) are using UTF-8.  FYI,
it took me a while to discover that SolrJ by default uses a GET request for
query, which uses ISO-8859-1.  I had to explicitly use a POST to do query in
SolrJ in order to get it to use UTF-8.

Bill

On Tue, Jul 28, 2009 at 5:27 PM, Robert Muir <rc...@gmail.com> wrote:

> Bill, somewhere in the process I think you might be treating your
> UTF-8 text as ISO-8859-1.
>
> Your character: 00B5 (µ)
> Bits: 10110101
>
> UTF8-encoded: 11000010 10110101
>
> If you were to treat these bytes as ISO-8859-1 (i.e. reading from a
> file or wrong url encoding) then it looks like:
> 0xC2 (Å) followed by 0xB5 (µ)
>
>
> On Tue, Jul 28, 2009 at 3:26 PM, Bill Au<bi...@gmail.com> wrote:
> > I am using SolrJ to index the word µTorrent.  After a commit I was not
> able
> > to query for it.  It turns out that the document in my Solr index
> contains
> > the word µTorrent instead of µTorrent.  Any one has any idea what's
> going
> > on???
> >
> > Bill
> >
>
>
>
> --
> Robert Muir
> rcmuir@gmail.com
>

Re: µTorrent indexed as µTorrent

Posted by Robert Muir <rc...@gmail.com>.
Bill, somewhere in the process I think you might be treating your
UTF-8 text as ISO-8859-1.

Your character: 00B5 (µ)
Bits: 10110101

UTF8-encoded: 11000010 10110101

If you were to treat these bytes as ISO-8859-1 (i.e. reading from a
file or wrong url encoding) then it looks like:
0xC2 (Å) followed by 0xB5 (µ)


On Tue, Jul 28, 2009 at 3:26 PM, Bill Au<bi...@gmail.com> wrote:
> I am using SolrJ to index the word µTorrent.  After a commit I was not able
> to query for it.  It turns out that the document in my Solr index contains
> the word µTorrent instead of µTorrent.  Any one has any idea what's going
> on???
>
> Bill
>



-- 
Robert Muir
rcmuir@gmail.com