You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by Rui Pereira <ru...@gmail.com> on 2009/03/27 16:11:15 UTC

Encoding problem

I'm having problems with encoding in responses from search queries. The
encoding problem only occurs in the topologyname field, if a instancename
has accents it is returned correctly. In all my configurations I have UTF-8.

<?xml version="1.0" encoding="UTF-8"?>
<dataConfig>
    <document name="topologies">
<entity query="SELECT DISTINCT '3141-' || Sub0.SUBID as id, 'Inventário' as
topologyname, 3141 as topologyid, Sub0.SUBID as instancekey, Sub0.NAME as
instancename FROM ...
              <field column="INSTANCEKEY" name="instancekey"/>
              <field column="ID" name="id"/>
              <field column="TOPOLOGYID" name="topologyid"/>
              <field column="INSTANCENAME" name="instancename"/>
              <field column="TOPOLOGYNAME" name="topologyname"/>...


As an example, I can have in the response the following result:

<doc>
<long name="instancekey">285</long>
<str name="instancename">Informática</str>
<long name="topologyid">3141</long>
<str name="topologyname">Inventário</str>
</doc>


Thanks in advance,
   Rui Pereira

Re: Encoding problem

Posted by aerox7 <am...@me.com>.
Hi,
I had the same problem with DATAIMPORTHandler : i have a utf-8 mysql
DATABASE but it's seems that DIH import data in LATIN... So i just use
Transformer to (re)encode my strings in UTF-8.


Rui Pereira-2 wrote:
> 
> I'm having problems with encoding in responses from search queries. The
> encoding problem only occurs in the topologyname field, if a instancename
> has accents it is returned correctly. In all my configurations I have
> UTF-8.
> 
> <?xml version="1.0" encoding="UTF-8"?>
> <dataConfig>
>     <document name="topologies">
> <entity query="SELECT DISTINCT '3141-' || Sub0.SUBID as id, 'Inventário'
> as
> topologyname, 3141 as topologyid, Sub0.SUBID as instancekey, Sub0.NAME as
> instancename FROM ...
>               <field column="INSTANCEKEY" name="instancekey"/>
>               <field column="ID" name="id"/>
>               <field column="TOPOLOGYID" name="topologyid"/>
>               <field column="INSTANCENAME" name="instancename"/>
>               <field column="TOPOLOGYNAME" name="topologyname"/>...
> 
> 
> As an example, I can have in the response the following result:
> 
> <doc>
> <long name="instancekey">285</long>
> <str name="instancename">Informática</str>
> <long name="topologyid">3141</long>
> <str name="topologyname">Inventário</str>
> </doc>
> 
> 
> Thanks in advance,
>    Rui Pereira
> 
> 

-- 
View this message in context: http://www.nabble.com/Encoding-problem-tp22743698p22745133.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Encoding problem

Posted by Rui Pereira <ru...@gmail.com>.
Thanks,I detected that same problem.
I have CP 1252 system file encoding and was recording data-config.xml file
in UTF-8. DIH was reading using the default encoding.
One possible workarround was using InputStream and OutputStream like DIH,
but the files won't be in UTF-8 if the system has different encoding (not
really good for XML files).
I will get the latest 1.4 build and maintain the files in UTF-8.

On Fri, Mar 27, 2009 at 9:37 PM, Shalin Shekhar Mangar <
shalinmangar@gmail.com> wrote:

> On Sat, Mar 28, 2009 at 12:51 AM, Shalin Shekhar Mangar <
> shalinmangar@gmail.com> wrote:
>
> >
> > I see that you are specifying the topologyname's value in the query
> itself.
> > It might be a bug in DataImportHandler because it reads the data-config
> as a
> > string from an InputStream. If your default platform encoding is not
> UTF-8,
> > this may be the cause.
> >
>
> I've opened SOLR-1090 to fix this issue.
>
> https://issues.apache.org/jira/browse/SOLR-1090
>
> --
> Regards,
> Shalin Shekhar Mangar.
>

Re: Encoding problem

Posted by Shalin Shekhar Mangar <sh...@gmail.com>.
On Sat, Mar 28, 2009 at 12:51 AM, Shalin Shekhar Mangar <
shalinmangar@gmail.com> wrote:

>
> I see that you are specifying the topologyname's value in the query itself.
> It might be a bug in DataImportHandler because it reads the data-config as a
> string from an InputStream. If your default platform encoding is not UTF-8,
> this may be the cause.
>

I've opened SOLR-1090 to fix this issue.

https://issues.apache.org/jira/browse/SOLR-1090

-- 
Regards,
Shalin Shekhar Mangar.

Re: Encoding problem

Posted by Shalin Shekhar Mangar <sh...@gmail.com>.
On Fri, Mar 27, 2009 at 8:41 PM, Rui Pereira <ru...@gmail.com>wrote:

> I'm having problems with encoding in responses from search queries. The
> encoding problem only occurs in the topologyname field, if a instancename
> has accents it is returned correctly. In all my configurations I have
> UTF-8.
>
> <?xml version="1.0" encoding="UTF-8"?>
> <dataConfig>
>    <document name="topologies">
> <entity query="SELECT DISTINCT '3141-' || Sub0.SUBID as id, 'Inventário' as
> topologyname, 3141 as topologyid, Sub0.SUBID as instancekey, Sub0.NAME as
> instancename FROM ...
>              <field column="INSTANCEKEY" name="instancekey"/>
>              <field column="ID" name="id"/>
>              <field column="TOPOLOGYID" name="topologyid"/>
>              <field column="INSTANCENAME" name="instancename"/>
>              <field column="TOPOLOGYNAME" name="topologyname"/>...
>
>
> As an example, I can have in the response the following result:
>
> <doc>
> <long name="instancekey">285</long>
> <str name="instancename">Informática</str>
> <long name="topologyid">3141</long>
> <str name="topologyname">Inventário</str>
> </doc>
>

I see that you are specifying the topologyname's value in the query itself.
It might be a bug in DataImportHandler because it reads the data-config as a
string from an InputStream. If your default platform encoding is not UTF-8,
this may be the cause.

Can you try running the Solr's (or your servlet-container's) java process
with -Dfile.encoding=UTF-8 and see if that fixes the problem?

-- 
Regards,
Shalin Shekhar Mangar.