You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by Rui Pereira <ru...@gmail.com> on 2009/03/27 16:11:15 UTC
Encoding problem
I'm having problems with encoding in responses from search queries. The
encoding problem only occurs in the topologyname field, if a instancename
has accents it is returned correctly. In all my configurations I have UTF-8.
<?xml version="1.0" encoding="UTF-8"?>
<dataConfig>
<document name="topologies">
<entity query="SELECT DISTINCT '3141-' || Sub0.SUBID as id, 'Inventário' as
topologyname, 3141 as topologyid, Sub0.SUBID as instancekey, Sub0.NAME as
instancename FROM ...
<field column="INSTANCEKEY" name="instancekey"/>
<field column="ID" name="id"/>
<field column="TOPOLOGYID" name="topologyid"/>
<field column="INSTANCENAME" name="instancename"/>
<field column="TOPOLOGYNAME" name="topologyname"/>...
As an example, I can have in the response the following result:
<doc>
<long name="instancekey">285</long>
<str name="instancename">Informática</str>
<long name="topologyid">3141</long>
<str name="topologyname">Inventário</str>
</doc>
Thanks in advance,
Rui Pereira
Re: Encoding problem
Posted by aerox7 <am...@me.com>.
Hi,
I had the same problem with DATAIMPORTHandler : i have a utf-8 mysql
DATABASE but it's seems that DIH import data in LATIN... So i just use
Transformer to (re)encode my strings in UTF-8.
Rui Pereira-2 wrote:
>
> I'm having problems with encoding in responses from search queries. The
> encoding problem only occurs in the topologyname field, if a instancename
> has accents it is returned correctly. In all my configurations I have
> UTF-8.
>
> <?xml version="1.0" encoding="UTF-8"?>
> <dataConfig>
> <document name="topologies">
> <entity query="SELECT DISTINCT '3141-' || Sub0.SUBID as id, 'Inventário'
> as
> topologyname, 3141 as topologyid, Sub0.SUBID as instancekey, Sub0.NAME as
> instancename FROM ...
> <field column="INSTANCEKEY" name="instancekey"/>
> <field column="ID" name="id"/>
> <field column="TOPOLOGYID" name="topologyid"/>
> <field column="INSTANCENAME" name="instancename"/>
> <field column="TOPOLOGYNAME" name="topologyname"/>...
>
>
> As an example, I can have in the response the following result:
>
> <doc>
> <long name="instancekey">285</long>
> <str name="instancename">Informática</str>
> <long name="topologyid">3141</long>
> <str name="topologyname">Inventário</str>
> </doc>
>
>
> Thanks in advance,
> Rui Pereira
>
>
--
View this message in context: http://www.nabble.com/Encoding-problem-tp22743698p22745133.html
Sent from the Solr - User mailing list archive at Nabble.com.
Re: Encoding problem
Posted by Rui Pereira <ru...@gmail.com>.
Thanks,I detected that same problem.
I have CP 1252 system file encoding and was recording data-config.xml file
in UTF-8. DIH was reading using the default encoding.
One possible workarround was using InputStream and OutputStream like DIH,
but the files won't be in UTF-8 if the system has different encoding (not
really good for XML files).
I will get the latest 1.4 build and maintain the files in UTF-8.
On Fri, Mar 27, 2009 at 9:37 PM, Shalin Shekhar Mangar <
shalinmangar@gmail.com> wrote:
> On Sat, Mar 28, 2009 at 12:51 AM, Shalin Shekhar Mangar <
> shalinmangar@gmail.com> wrote:
>
> >
> > I see that you are specifying the topologyname's value in the query
> itself.
> > It might be a bug in DataImportHandler because it reads the data-config
> as a
> > string from an InputStream. If your default platform encoding is not
> UTF-8,
> > this may be the cause.
> >
>
> I've opened SOLR-1090 to fix this issue.
>
> https://issues.apache.org/jira/browse/SOLR-1090
>
> --
> Regards,
> Shalin Shekhar Mangar.
>
Re: Encoding problem
Posted by Shalin Shekhar Mangar <sh...@gmail.com>.
On Sat, Mar 28, 2009 at 12:51 AM, Shalin Shekhar Mangar <
shalinmangar@gmail.com> wrote:
>
> I see that you are specifying the topologyname's value in the query itself.
> It might be a bug in DataImportHandler because it reads the data-config as a
> string from an InputStream. If your default platform encoding is not UTF-8,
> this may be the cause.
>
I've opened SOLR-1090 to fix this issue.
https://issues.apache.org/jira/browse/SOLR-1090
--
Regards,
Shalin Shekhar Mangar.
Re: Encoding problem
Posted by Shalin Shekhar Mangar <sh...@gmail.com>.
On Fri, Mar 27, 2009 at 8:41 PM, Rui Pereira <ru...@gmail.com>wrote:
> I'm having problems with encoding in responses from search queries. The
> encoding problem only occurs in the topologyname field, if a instancename
> has accents it is returned correctly. In all my configurations I have
> UTF-8.
>
> <?xml version="1.0" encoding="UTF-8"?>
> <dataConfig>
> <document name="topologies">
> <entity query="SELECT DISTINCT '3141-' || Sub0.SUBID as id, 'Inventário' as
> topologyname, 3141 as topologyid, Sub0.SUBID as instancekey, Sub0.NAME as
> instancename FROM ...
> <field column="INSTANCEKEY" name="instancekey"/>
> <field column="ID" name="id"/>
> <field column="TOPOLOGYID" name="topologyid"/>
> <field column="INSTANCENAME" name="instancename"/>
> <field column="TOPOLOGYNAME" name="topologyname"/>...
>
>
> As an example, I can have in the response the following result:
>
> <doc>
> <long name="instancekey">285</long>
> <str name="instancename">Informática</str>
> <long name="topologyid">3141</long>
> <str name="topologyname">Inventário</str>
> </doc>
>
I see that you are specifying the topologyname's value in the query itself.
It might be a bug in DataImportHandler because it reads the data-config as a
string from an InputStream. If your default platform encoding is not UTF-8,
this may be the cause.
Can you try running the Solr's (or your servlet-container's) java process
with -Dfile.encoding=UTF-8 and see if that fixes the problem?
--
Regards,
Shalin Shekhar Mangar.