You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by "Van Tassell, Kristian" <kr...@siemens.com> on 2012/02/01 15:17:43 UTC

UTF-8 support during indexing content

Hello everyone,

I have a question that I imagine has been asked many times before, so I apologize for the repeat.

I have a basic text field with the following text:
	the word ”stemming” in quotes

Uploading the data yields no errors, however when it is indexed, the text looks like this:

the word �stemming� in quotes


Searching for the word stemming, without quotes or otherwise, does not return any hits.

Just some basic facts:

- I included the solr.CollationKeyFilterFactory filter on the fieldType.
- Updating the index is done via a "solr xml" document. I've confirmed that the document contains the right quote marks immediately prior to uploading.
- Updating the index is done via solrj, essentially:
	DirectXmlRequest up = new DirectXmlRequest( "/update", xml );
	solrServer.request( up );
	solrServer.commit();
- In solr admin, the characters look like garbage, surrounding the word stemming (as shown above)


Thanks in advance for any details you can provide!
-Kristian

Re: UTF-8 support during indexing content

Posted by Chris Hostetter <ho...@fucit.org>.
: Subject: UTF-8 support during indexing content
: References: <8C...@webmail-m069.sysops.aol.com>
:  <13...@snape>
:  <8C...@webmail-m069.sysops.aol.com>
:  <13...@snape>
: In-Reply-To: <13...@snape>

https://people.apache.org/~hossman/#threadhijack
Thread Hijacking on Mailing Lists

When starting a new discussion on a mailing list, please do not reply to 
an existing message, instead start a fresh email.  Even if you change the 
subject line of your email, other mail headers still track which thread 
you replied to and your question is "hidden" in that thread and gets less 
attention.   It makes following discussions in the mailing list archives 
particularly difficult.


-Hoss

RE: UTF-8 support during indexing content

Posted by "Van Tassell, Kristian" <kr...@siemens.com>.
Travis and all,

This is solved and was not directly a Solr issue. I'll note the solution here in case anyone makes the same mistake. The documents are UTF-8 and the source documents are converted via XSLT. They look good up to that point. 

First off, based off of of some other recommendations I found, I changed the Tomcat <Connector> element to include the URIEncoding="UTF-8" setting.

The primary problem, however, was the data (mydata below) was read in without an encoding designation. 

DirectXmlRequest up = new DirectXmlRequest( "/update", mydata );

The stream was previously gathered incorrectly:

BufferedReader reader = new BufferedReader(new FileReader(filePath));

I've since changed this and am now getting the intended result.

InputStreamReader reader = new InputStreamReader(new FileInputStream(filePath), "UTF-8");

Thanks,
Kristian


-----Original Message-----
From: Travis Low [mailto:tlow@4centurion.com] 
Sent: Wednesday, February 01, 2012 8:27 AM
To: solr-user@lucene.apache.org
Subject: Re: UTF-8 support during indexing content

Are you sure the input document is in UTF-8?  That looks like classic
ISO-8859-1-treated-as-UTF-8.

How did you confirm the document contains the right quote marks immediately
prior to uploading?  If you just visually inspected it, then use whatever
tool you viewed it in to see what the character set is.

cheers,
Travis

On Wed, Feb 1, 2012 at 9:17 AM, Van Tassell, Kristian <
kristian.vantassell@siemens.com> wrote:

> Hello everyone,
>
> I have a question that I imagine has been asked many times before, so I
> apologize for the repeat.
>
> I have a basic text field with the following text:
>        the word ”stemming” in quotes
>
> Uploading the data yields no errors, however when it is indexed, the text
> looks like this:
>
> the word �stemming� in quotes
>
>
> Searching for the word stemming, without quotes or otherwise, does not
> return any hits.
>
> Just some basic facts:
>
> - I included the solr.CollationKeyFilterFactory filter on the fieldType.
> - Updating the index is done via a "solr xml" document. I've confirmed
> that the document contains the right quote marks immediately prior to
> uploading.
> - Updating the index is done via solrj, essentially:
>        DirectXmlRequest up = new DirectXmlRequest( "/update", xml );
>        solrServer.request( up );
>        solrServer.commit();
> - In solr admin, the characters look like garbage, surrounding the word
> stemming (as shown above)
>
>
> Thanks in advance for any details you can provide!
> -Kristian
>
**

Re: UTF-8 support during indexing content

Posted by Travis Low <tl...@4centurion.com>.
Are you sure the input document is in UTF-8?  That looks like classic
ISO-8859-1-treated-as-UTF-8.

How did you confirm the document contains the right quote marks immediately
prior to uploading?  If you just visually inspected it, then use whatever
tool you viewed it in to see what the character set is.

cheers,
Travis

On Wed, Feb 1, 2012 at 9:17 AM, Van Tassell, Kristian <
kristian.vantassell@siemens.com> wrote:

> Hello everyone,
>
> I have a question that I imagine has been asked many times before, so I
> apologize for the repeat.
>
> I have a basic text field with the following text:
>        the word ”stemming” in quotes
>
> Uploading the data yields no errors, however when it is indexed, the text
> looks like this:
>
> the word �stemming� in quotes
>
>
> Searching for the word stemming, without quotes or otherwise, does not
> return any hits.
>
> Just some basic facts:
>
> - I included the solr.CollationKeyFilterFactory filter on the fieldType.
> - Updating the index is done via a "solr xml" document. I've confirmed
> that the document contains the right quote marks immediately prior to
> uploading.
> - Updating the index is done via solrj, essentially:
>        DirectXmlRequest up = new DirectXmlRequest( "/update", xml );
>        solrServer.request( up );
>        solrServer.commit();
> - In solr admin, the characters look like garbage, surrounding the word
> stemming (as shown above)
>
>
> Thanks in advance for any details you can provide!
> -Kristian
>
**