You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by "Van Tassell, Kristian" <kr...@siemens.com> on 2012/02/01 15:17:43 UTC
UTF-8 support during indexing content
Hello everyone,
I have a question that I imagine has been asked many times before, so I apologize for the repeat.
I have a basic text field with the following text:
the word ”stemming” in quotes
Uploading the data yields no errors, however when it is indexed, the text looks like this:
the word �stemming� in quotes
Searching for the word stemming, without quotes or otherwise, does not return any hits.
Just some basic facts:
- I included the solr.CollationKeyFilterFactory filter on the fieldType.
- Updating the index is done via a "solr xml" document. I've confirmed that the document contains the right quote marks immediately prior to uploading.
- Updating the index is done via solrj, essentially:
DirectXmlRequest up = new DirectXmlRequest( "/update", xml );
solrServer.request( up );
solrServer.commit();
- In solr admin, the characters look like garbage, surrounding the word stemming (as shown above)
Thanks in advance for any details you can provide!
-Kristian
Re: UTF-8 support during indexing content
Posted by Chris Hostetter <ho...@fucit.org>.
: Subject: UTF-8 support during indexing content
: References: <8C...@webmail-m069.sysops.aol.com>
: <13...@snape>
: <8C...@webmail-m069.sysops.aol.com>
: <13...@snape>
: In-Reply-To: <13...@snape>
https://people.apache.org/~hossman/#threadhijack
Thread Hijacking on Mailing Lists
When starting a new discussion on a mailing list, please do not reply to
an existing message, instead start a fresh email. Even if you change the
subject line of your email, other mail headers still track which thread
you replied to and your question is "hidden" in that thread and gets less
attention. It makes following discussions in the mailing list archives
particularly difficult.
-Hoss
RE: UTF-8 support during indexing content
Posted by "Van Tassell, Kristian" <kr...@siemens.com>.
Travis and all,
This is solved and was not directly a Solr issue. I'll note the solution here in case anyone makes the same mistake. The documents are UTF-8 and the source documents are converted via XSLT. They look good up to that point.
First off, based off of of some other recommendations I found, I changed the Tomcat <Connector> element to include the URIEncoding="UTF-8" setting.
The primary problem, however, was the data (mydata below) was read in without an encoding designation.
DirectXmlRequest up = new DirectXmlRequest( "/update", mydata );
The stream was previously gathered incorrectly:
BufferedReader reader = new BufferedReader(new FileReader(filePath));
I've since changed this and am now getting the intended result.
InputStreamReader reader = new InputStreamReader(new FileInputStream(filePath), "UTF-8");
Thanks,
Kristian
-----Original Message-----
From: Travis Low [mailto:tlow@4centurion.com]
Sent: Wednesday, February 01, 2012 8:27 AM
To: solr-user@lucene.apache.org
Subject: Re: UTF-8 support during indexing content
Are you sure the input document is in UTF-8? That looks like classic
ISO-8859-1-treated-as-UTF-8.
How did you confirm the document contains the right quote marks immediately
prior to uploading? If you just visually inspected it, then use whatever
tool you viewed it in to see what the character set is.
cheers,
Travis
On Wed, Feb 1, 2012 at 9:17 AM, Van Tassell, Kristian <
kristian.vantassell@siemens.com> wrote:
> Hello everyone,
>
> I have a question that I imagine has been asked many times before, so I
> apologize for the repeat.
>
> I have a basic text field with the following text:
> the word ”stemming” in quotes
>
> Uploading the data yields no errors, however when it is indexed, the text
> looks like this:
>
> the word �stemming� in quotes
>
>
> Searching for the word stemming, without quotes or otherwise, does not
> return any hits.
>
> Just some basic facts:
>
> - I included the solr.CollationKeyFilterFactory filter on the fieldType.
> - Updating the index is done via a "solr xml" document. I've confirmed
> that the document contains the right quote marks immediately prior to
> uploading.
> - Updating the index is done via solrj, essentially:
> DirectXmlRequest up = new DirectXmlRequest( "/update", xml );
> solrServer.request( up );
> solrServer.commit();
> - In solr admin, the characters look like garbage, surrounding the word
> stemming (as shown above)
>
>
> Thanks in advance for any details you can provide!
> -Kristian
>
**
Re: UTF-8 support during indexing content
Posted by Travis Low <tl...@4centurion.com>.
Are you sure the input document is in UTF-8? That looks like classic
ISO-8859-1-treated-as-UTF-8.
How did you confirm the document contains the right quote marks immediately
prior to uploading? If you just visually inspected it, then use whatever
tool you viewed it in to see what the character set is.
cheers,
Travis
On Wed, Feb 1, 2012 at 9:17 AM, Van Tassell, Kristian <
kristian.vantassell@siemens.com> wrote:
> Hello everyone,
>
> I have a question that I imagine has been asked many times before, so I
> apologize for the repeat.
>
> I have a basic text field with the following text:
> the word ”stemming” in quotes
>
> Uploading the data yields no errors, however when it is indexed, the text
> looks like this:
>
> the word �stemming� in quotes
>
>
> Searching for the word stemming, without quotes or otherwise, does not
> return any hits.
>
> Just some basic facts:
>
> - I included the solr.CollationKeyFilterFactory filter on the fieldType.
> - Updating the index is done via a "solr xml" document. I've confirmed
> that the document contains the right quote marks immediately prior to
> uploading.
> - Updating the index is done via solrj, essentially:
> DirectXmlRequest up = new DirectXmlRequest( "/update", xml );
> solrServer.request( up );
> solrServer.commit();
> - In solr admin, the characters look like garbage, surrounding the word
> stemming (as shown above)
>
>
> Thanks in advance for any details you can provide!
> -Kristian
>
**