You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by Andreas Owen <ao...@conx.ch> on 2013/09/11 14:55:39 UTC

charset encoding

i'm using solr 4.3.1 with tika to index html-pages. the html files are iso-8859-1 (ansi) encoded and the meta tag "content-encoding" as well. the server-http-header says it's utf8 and firefox-webdeveloper agrees. 

when i index a page with special chars like ä,ö,ü solr outputs it completly foreign signs, not the normal wrong chars with 1/4 or the Flag in it. so it seams that its not simply the normal utf8/iso-8859-1 discrepancy. has anyone got a idea whats wrong?


Re: charset encoding

Posted by Alexandre Rafalovitch <ar...@gmail.com>.
Can you do a ServletFilter and modify things before they hit Solr?
Haven't tried this particular scenario myself, but it's something to
look at.

Regards,
  Alex.
Personal website: http://www.outerthoughts.com/
LinkedIn: http://www.linkedin.com/in/alexandrerafalovitch
- Time is the quality of nature that keeps events from happening all
at once. Lately, it doesn't seem to be working.  (Anonymous  - via GTD
book)


On Wed, Mar 26, 2014 at 6:39 PM, Antoine LE FLOC'H <le...@gmail.com> wrote:
> Thank you for this. This work around using "ie" works great.
>
> However this is called fairly early by Solr, before the request handlers
> are called. So it cannot be added be used by the solrconfig.
>
> Anybody has an idea, how we can force "ie" all the time by simply changing
> some Solr settings ?
> (not changing the query)
>
> Thank you.
>
>
>
> On Thu, Sep 12, 2013 at 7:38 PM, Shawn Heisey <so...@elyograg.org> wrote:
>
>> On 9/12/2013 11:17 AM, Andreas Owen wrote:
>> > it was the http-header, as soon as i force a iso-8859-1 header it worked
>>
>> Glad you found a workaround!
>>
>> If you are in a situation where you cannot control the header of the
>> request or modify the content itself to include charset information, or
>> there's some reason you would rather not take that route, there will be
>> another way with the next Solr release.
>>
>> https://issues.apache.org/jira/browse/SOLR-5082
>>
>> Solr 4.5 will support an "ie" (input encoding) parameter for the update
>> request so you can inform Solr what charset encoding to expect.  The
>> release process for Solr 4.5 has been started, it usually takes 2-3
>> weeks to complete.
>>
>> Thanks,
>> Shawn
>>
>>

Re: charset encoding

Posted by Antoine LE FLOC'H <le...@gmail.com>.
Thank you for this. This work around using "ie" works great.

However this is called fairly early by Solr, before the request handlers
are called. So it cannot be added be used by the solrconfig.

Anybody has an idea, how we can force "ie" all the time by simply changing
some Solr settings ?
(not changing the query)

Thank you.



On Thu, Sep 12, 2013 at 7:38 PM, Shawn Heisey <so...@elyograg.org> wrote:

> On 9/12/2013 11:17 AM, Andreas Owen wrote:
> > it was the http-header, as soon as i force a iso-8859-1 header it worked
>
> Glad you found a workaround!
>
> If you are in a situation where you cannot control the header of the
> request or modify the content itself to include charset information, or
> there's some reason you would rather not take that route, there will be
> another way with the next Solr release.
>
> https://issues.apache.org/jira/browse/SOLR-5082
>
> Solr 4.5 will support an "ie" (input encoding) parameter for the update
> request so you can inform Solr what charset encoding to expect.  The
> release process for Solr 4.5 has been started, it usually takes 2-3
> weeks to complete.
>
> Thanks,
> Shawn
>
>

Re: charset encoding

Posted by Shawn Heisey <so...@elyograg.org>.
On 9/12/2013 11:17 AM, Andreas Owen wrote:
> it was the http-header, as soon as i force a iso-8859-1 header it worked

Glad you found a workaround!

If you are in a situation where you cannot control the header of the
request or modify the content itself to include charset information, or
there's some reason you would rather not take that route, there will be
another way with the next Solr release.

https://issues.apache.org/jira/browse/SOLR-5082

Solr 4.5 will support an "ie" (input encoding) parameter for the update
request so you can inform Solr what charset encoding to expect.  The
release process for Solr 4.5 has been started, it usually takes 2-3
weeks to complete.

Thanks,
Shawn


Re: charset encoding

Posted by Andreas Owen <ao...@conx.ch>.
it was the http-header, as soon as i force a iso-8859-1 header it worked

On 12. Sep 2013, at 9:44 AM, Andreas Owen wrote:

> could it have something to do with the meta encoding tag is iso-8859-1 but the http-header tag is utf8 and firefox inteprets it as utf8?
> 
> On 12. Sep 2013, at 8:36 AM, Andreas Owen wrote:
> 
>> no jetty, and yes for tomcat i've seen a couple of answers
>> 
>> On 12. Sep 2013, at 3:12 AM, Otis Gospodnetic wrote:
>> 
>>> Using tomcat by any chance? The ML archive has the solution. May be on
>>> Wiki, too.
>>> 
>>> Otis
>>> Solr & ElasticSearch Support
>>> http://sematext.com/
>>> On Sep 11, 2013 8:56 AM, "Andreas Owen" <ao...@conx.ch> wrote:
>>> 
>>>> i'm using solr 4.3.1 with tika to index html-pages. the html files are
>>>> iso-8859-1 (ansi) encoded and the meta tag "content-encoding" as well. the
>>>> server-http-header says it's utf8 and firefox-webdeveloper agrees.
>>>> 
>>>> when i index a page with special chars like ä,ö,ü solr outputs it
>>>> completly foreign signs, not the normal wrong chars with 1/4 or the Flag in
>>>> it. so it seams that its not simply the normal utf8/iso-8859-1 discrepancy.
>>>> has anyone got a idea whats wrong?
>>>> 
>>>> 


Re: charset encoding

Posted by Andreas Owen <ao...@conx.ch>.
could it have something to do with the meta encoding tag is iso-8859-1 but the http-header tag is utf8 and firefox inteprets it as utf8?

On 12. Sep 2013, at 8:36 AM, Andreas Owen wrote:

> no jetty, and yes for tomcat i've seen a couple of answers
> 
> On 12. Sep 2013, at 3:12 AM, Otis Gospodnetic wrote:
> 
>> Using tomcat by any chance? The ML archive has the solution. May be on
>> Wiki, too.
>> 
>> Otis
>> Solr & ElasticSearch Support
>> http://sematext.com/
>> On Sep 11, 2013 8:56 AM, "Andreas Owen" <ao...@conx.ch> wrote:
>> 
>>> i'm using solr 4.3.1 with tika to index html-pages. the html files are
>>> iso-8859-1 (ansi) encoded and the meta tag "content-encoding" as well. the
>>> server-http-header says it's utf8 and firefox-webdeveloper agrees.
>>> 
>>> when i index a page with special chars like ä,ö,ü solr outputs it
>>> completly foreign signs, not the normal wrong chars with 1/4 or the Flag in
>>> it. so it seams that its not simply the normal utf8/iso-8859-1 discrepancy.
>>> has anyone got a idea whats wrong?
>>> 
>>> 


Re: charset encoding

Posted by Andreas Owen <ao...@conx.ch>.
no jetty, and yes for tomcat i've seen a couple of answers

On 12. Sep 2013, at 3:12 AM, Otis Gospodnetic wrote:

> Using tomcat by any chance? The ML archive has the solution. May be on
> Wiki, too.
> 
> Otis
> Solr & ElasticSearch Support
> http://sematext.com/
> On Sep 11, 2013 8:56 AM, "Andreas Owen" <ao...@conx.ch> wrote:
> 
>> i'm using solr 4.3.1 with tika to index html-pages. the html files are
>> iso-8859-1 (ansi) encoded and the meta tag "content-encoding" as well. the
>> server-http-header says it's utf8 and firefox-webdeveloper agrees.
>> 
>> when i index a page with special chars like ä,ö,ü solr outputs it
>> completly foreign signs, not the normal wrong chars with 1/4 or the Flag in
>> it. so it seams that its not simply the normal utf8/iso-8859-1 discrepancy.
>> has anyone got a idea whats wrong?
>> 
>> 


Re: charset encoding

Posted by Otis Gospodnetic <ot...@gmail.com>.
Using tomcat by any chance? The ML archive has the solution. May be on
Wiki, too.

Otis
Solr & ElasticSearch Support
http://sematext.com/
On Sep 11, 2013 8:56 AM, "Andreas Owen" <ao...@conx.ch> wrote:

> i'm using solr 4.3.1 with tika to index html-pages. the html files are
> iso-8859-1 (ansi) encoded and the meta tag "content-encoding" as well. the
> server-http-header says it's utf8 and firefox-webdeveloper agrees.
>
> when i index a page with special chars like ä,ö,ü solr outputs it
> completly foreign signs, not the normal wrong chars with 1/4 or the Flag in
> it. so it seams that its not simply the normal utf8/iso-8859-1 discrepancy.
> has anyone got a idea whats wrong?
>
>