You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by sunnyfr <jo...@gmail.com> on 2008/10/21 11:43:34 UTC
tomcat55/solr1.3 - Indexing data, doesnt take in consideration
utf8!
Hi,
I've solr 1.3 and tomcat55.
When I try to index a bit of data and I request ALL, obviously my accent and
UTF8 encoding is not took in consideration.
<doc>
<date name="created">2006-12-14T15:28:27Z</date>
<str name="description_ja">
Le 1er film de Goro Miyazaki (fils de Hayao)
<br />je suis allée ...
....
<str name="title_ja">渡邊 å‰å· vs 三田下田 1</str>
My database Mysql is well in UTF8, if I request data manually from mysql I
will get accent even japan characters properly
I index my data, my data-config is :
<dataSource type="JdbcDataSource"
driver="com.mysql.jdbc.Driver"
url="jdbc:mysql://master-spare.videos.com/videos"
user="solr"
password="pass"
batchSize="-1"
responseBuffering="adaptive"/>
My schema config file start by : <?xml version="1.0" encoding="UTF-8" ?>
I've add in my server.xml : because my localhost point on 8180
<Connector port="8180" maxHttpHeaderSize="8192"
maxThreads="150" minSpareThreads="25" maxSpareThreads="75"
enableLookups="false" redirectPort="8443" acceptCount="100"
connectionTimeout="20000" disableUploadTimeout="true"
URIEncoding="UTF-8" useBodyEncodingForURI="true" />
What can I check?
I'm using a linux server.
If I do dpkg-reconfigure -plow locales
Generating locales...
fr_BE.UTF-8... up-to-date
fr_CA.UTF-8... up-to-date
fr_CH.UTF-8... up-to-date
fr_FR.UTF-8... up-to-date
fr_LU.UTF-8... up-to-date
Generation complete.
Would that be a problem, I would say no but maybe, do I miss a package???
--
View this message in context: http://www.nabble.com/tomcat55-solr1.3---Indexing-data%2C-doesnt-take-in-consideration-utf8%21-tp20086167p20086167.html
Sent from the Solr - User mailing list archive at Nabble.com.
Re: tomcat55/solr1.3 - Indexing data, doesnt take in consideration
utf8!
Posted by sunnyfr <jo...@gmail.com>.
It actually come from the database mysql's variable :
| character_set_client | latin1
|
| character_set_connection | latin1
|
so I don't know really now how to configure my datasource to point in latin1
and not utf8.
sunnyfr wrote:
>
> Hi Jerome,
>
> I tried to chat with you but you wasn't there or ...?? lol on your
> website.
>
> Ok I tried what you did and my file bring me back in gedit :
> <?xml version="1.0" encoding="UTF-8"?>
> <response>
> <lst name="responseHeader"><int name="status">0</int><int
> name="QTime">0</int><lst name="params"><str
> name="q">ALL</str></lst></lst><result name="response" numFound="3"
> start="0"><doc><date name="created">2006-10-10T05:29:32Z</date><str
> name="description_ja">All Japan Women's Pro-wrestling
> <br /><br />WWWA Champion Title Match
> <br /><br />è±ç”°çœŸå¥ˆç¾Ž VS 井上京å
> <br /><br /></str><int name="id">813343</int><str
> name="language">JA</str><int name="rating_binrate">40</int><arr
> name="spell"><str>Toyota Manami VS Inoue Kyoko</str></arr><int
> name="stat_views">1422</int>....
>
> and just that in open office :
> <?xml version="1.0" encoding="UTF-8"?>
> <response>
> <lst name="responseHeader"><int name="status">0</int><int
> name="QTime">0</int><lst name="params"><str
> name="q">ALL</str></lst></lst><result name="response" numFound="3"
> start="0"><doc><date name="created">2006-10-10T05:29:32Z</date><str
> name="description_ja">All Japan Women's Pro-wrestling
>
> :( don't know!
>
>
> Jérôme Etévé wrote:
>>
>> Looks like you have a double encoding problem.
>>
>> It might be because you fetch UTF-8 binary data from mysql (I know
>> that for instance the perl driver has an issue with that) and you then
>> encode it a second time in UTF-8 when you post to solr.
>>
>> Make sure the string you're getting from mysql are actually proper
>> unicode strings and not the raw UTF-8 encoded binary form.
>>
>> You may want to have a look at
>> http://dev.mysql.com/doc/refman/5.0/en/connector-j-reference-charsets.html
>> for the proper option to use with your connection.
>>
>> What you can try to check you're posting actual UTF-8 data to solr is
>> to dump your xml post in a file (don't forget to set the input
>> encoding to UTF-8 ). Then you can check if this file is readable with
>> any UTF-8 aware editor.
>>
>> Cheers,
>>
>> Jerome.
>>
>>
>> On Tue, Oct 21, 2008 at 10:43 AM, sunnyfr <jo...@gmail.com> wrote:
>>>
>>> Hi,
>>>
>>> I've solr 1.3 and tomcat55.
>>> When I try to index a bit of data and I request ALL, obviously my accent
>>> and
>>> UTF8 encoding is not took in consideration.
>>> <doc>
>>> <date name="created">2006-12-14T15:28:27Z</date>
>>> <str name="description_ja">
>>> Le 1er film de Goro Miyazaki (fils de Hayao)
>>> <br />je suis allÃ(c)e ...
>>> ....
>>> <str name="title_ja">渡邊 å‰ å· vs 三ç"°ä¸‹ç"° 1</str>
>>>
>>>
>>> My database Mysql is well in UTF8, if I request data manually from mysql
>>> I
>>> will get accent even japan characters properly
>>>
>>> I index my data, my data-config is :
>>> <dataSource type="JdbcDataSource"
>>> driver="com.mysql.jdbc.Driver"
>>> url="jdbc:mysql://master-spare.videos.com/videos"
>>> user="solr"
>>> password="pass"
>>> batchSize="-1"
>>> responseBuffering="adaptive"/>
>>>
>>> My schema config file start by : <?xml version="1.0" encoding="UTF-8" ?>
>>>
>>> I've add in my server.xml : because my localhost point on 8180
>>> <Connector port="8180" maxHttpHeaderSize="8192"
>>> maxThreads="150" minSpareThreads="25" maxSpareThreads="75"
>>> enableLookups="false" redirectPort="8443"
>>> acceptCount="100"
>>> connectionTimeout="20000" disableUploadTimeout="true"
>>> URIEncoding="UTF-8" useBodyEncodingForURI="true" />
>>>
>>> What can I check?
>>> I'm using a linux server.
>>> If I do dpkg-reconfigure -plow locales
>>> Generating locales...
>>> fr_BE.UTF-8... up-to-date
>>> fr_CA.UTF-8... up-to-date
>>> fr_CH.UTF-8... up-to-date
>>> fr_FR.UTF-8... up-to-date
>>> fr_LU.UTF-8... up-to-date
>>> Generation complete.
>>>
>>> Would that be a problem, I would say no but maybe, do I miss a
>>> package???
>>>
>>>
>>>
>>> --
>>> View this message in context:
>>> http://www.nabble.com/tomcat55-solr1.3---Indexing-data%2C-doesnt-take-in-consideration-utf8%21-tp20086167p20086167.html
>>> Sent from the Solr - User mailing list archive at Nabble.com.
>>>
>>>
>>
>>
>>
>> --
>> Jerome Eteve.
>>
>> Chat with me live at http://www.eteve.net
>>
>> jerome@eteve.net
>>
>>
>
>
--
View this message in context: http://www.nabble.com/tomcat55-solr1.3---Indexing-data%2C-doesnt-take-in-consideration-utf8%21-tp20086167p20090130.html
Sent from the Solr - User mailing list archive at Nabble.com.
Re: tomcat55/solr1.3 - Indexing data, doesnt take in consideration
utf8!
Posted by sunnyfr <jo...@gmail.com>.
Hi Jerome,
I tried to chat with you but you wasn't there or ...?? lol on your website.
Ok I tried what you did and my file bring me back in gedit :
<?xml version="1.0" encoding="UTF-8"?>
<response>
<lst name="responseHeader"><int name="status">0</int><int
name="QTime">0</int><lst name="params"><str
name="q">ALL</str></lst></lst><result name="response" numFound="3"
start="0"><doc><date name="created">2006-10-10T05:29:32Z</date><str
name="description_ja">All Japan Women's Pro-wrestling
<br /><br />WWWA Champion Title Match
<br /><br />è±ç”°çœŸå¥ˆç¾Ž VS 井上京å
<br /><br /></str><int name="id">813343</int><str
name="language">JA</str><int name="rating_binrate">40</int><arr
name="spell"><str>Toyota Manami VS Inoue Kyoko</str></arr><int
name="stat_views">1422</int>....
and just that in open office :
<?xml version="1.0" encoding="UTF-8"?>
<response>
<lst name="responseHeader"><int name="status">0</int><int
name="QTime">0</int><lst name="params"><str
name="q">ALL</str></lst></lst><result name="response" numFound="3"
start="0"><doc><date name="created">2006-10-10T05:29:32Z</date><str
name="description_ja">All Japan Women's Pro-wrestling
:( don't know!
Jérôme Etévé wrote:
>
> Looks like you have a double encoding problem.
>
> It might be because you fetch UTF-8 binary data from mysql (I know
> that for instance the perl driver has an issue with that) and you then
> encode it a second time in UTF-8 when you post to solr.
>
> Make sure the string you're getting from mysql are actually proper
> unicode strings and not the raw UTF-8 encoded binary form.
>
> You may want to have a look at
> http://dev.mysql.com/doc/refman/5.0/en/connector-j-reference-charsets.html
> for the proper option to use with your connection.
>
> What you can try to check you're posting actual UTF-8 data to solr is
> to dump your xml post in a file (don't forget to set the input
> encoding to UTF-8 ). Then you can check if this file is readable with
> any UTF-8 aware editor.
>
> Cheers,
>
> Jerome.
>
>
> On Tue, Oct 21, 2008 at 10:43 AM, sunnyfr <jo...@gmail.com> wrote:
>>
>> Hi,
>>
>> I've solr 1.3 and tomcat55.
>> When I try to index a bit of data and I request ALL, obviously my accent
>> and
>> UTF8 encoding is not took in consideration.
>> <doc>
>> <date name="created">2006-12-14T15:28:27Z</date>
>> <str name="description_ja">
>> Le 1er film de Goro Miyazaki (fils de Hayao)
>> <br />je suis allÃ(c)e ...
>> ....
>> <str name="title_ja">渡邊 å‰ å· vs 三ç"°ä¸‹ç"° 1</str>
>>
>>
>> My database Mysql is well in UTF8, if I request data manually from mysql
>> I
>> will get accent even japan characters properly
>>
>> I index my data, my data-config is :
>> <dataSource type="JdbcDataSource"
>> driver="com.mysql.jdbc.Driver"
>> url="jdbc:mysql://master-spare.videos.com/videos"
>> user="solr"
>> password="pass"
>> batchSize="-1"
>> responseBuffering="adaptive"/>
>>
>> My schema config file start by : <?xml version="1.0" encoding="UTF-8" ?>
>>
>> I've add in my server.xml : because my localhost point on 8180
>> <Connector port="8180" maxHttpHeaderSize="8192"
>> maxThreads="150" minSpareThreads="25" maxSpareThreads="75"
>> enableLookups="false" redirectPort="8443" acceptCount="100"
>> connectionTimeout="20000" disableUploadTimeout="true"
>> URIEncoding="UTF-8" useBodyEncodingForURI="true" />
>>
>> What can I check?
>> I'm using a linux server.
>> If I do dpkg-reconfigure -plow locales
>> Generating locales...
>> fr_BE.UTF-8... up-to-date
>> fr_CA.UTF-8... up-to-date
>> fr_CH.UTF-8... up-to-date
>> fr_FR.UTF-8... up-to-date
>> fr_LU.UTF-8... up-to-date
>> Generation complete.
>>
>> Would that be a problem, I would say no but maybe, do I miss a package???
>>
>>
>>
>> --
>> View this message in context:
>> http://www.nabble.com/tomcat55-solr1.3---Indexing-data%2C-doesnt-take-in-consideration-utf8%21-tp20086167p20086167.html
>> Sent from the Solr - User mailing list archive at Nabble.com.
>>
>>
>
>
>
> --
> Jerome Eteve.
>
> Chat with me live at http://www.eteve.net
>
> jerome@eteve.net
>
>
--
View this message in context: http://www.nabble.com/tomcat55-solr1.3---Indexing-data%2C-doesnt-take-in-consideration-utf8%21-tp20086167p20088857.html
Sent from the Solr - User mailing list archive at Nabble.com.
Re: tomcat55/solr1.3 - Indexing data, doesnt take in consideration utf8!
Posted by Jérôme Etévé <je...@gmail.com>.
Looks like you have a double encoding problem.
It might be because you fetch UTF-8 binary data from mysql (I know
that for instance the perl driver has an issue with that) and you then
encode it a second time in UTF-8 when you post to solr.
Make sure the string you're getting from mysql are actually proper
unicode strings and not the raw UTF-8 encoded binary form.
You may want to have a look at
http://dev.mysql.com/doc/refman/5.0/en/connector-j-reference-charsets.html
for the proper option to use with your connection.
What you can try to check you're posting actual UTF-8 data to solr is
to dump your xml post in a file (don't forget to set the input
encoding to UTF-8 ). Then you can check if this file is readable with
any UTF-8 aware editor.
Cheers,
Jerome.
On Tue, Oct 21, 2008 at 10:43 AM, sunnyfr <jo...@gmail.com> wrote:
>
> Hi,
>
> I've solr 1.3 and tomcat55.
> When I try to index a bit of data and I request ALL, obviously my accent and
> UTF8 encoding is not took in consideration.
> <doc>
> <date name="created">2006-12-14T15:28:27Z</date>
> <str name="description_ja">
> Le 1er film de Goro Miyazaki (fils de Hayao)
> <br />je suis allÃ(c)e ...
> ....
> <str name="title_ja">渡邊 å‰ å· vs 三ç"°ä¸‹ç"° 1</str>
>
>
> My database Mysql is well in UTF8, if I request data manually from mysql I
> will get accent even japan characters properly
>
> I index my data, my data-config is :
> <dataSource type="JdbcDataSource"
> driver="com.mysql.jdbc.Driver"
> url="jdbc:mysql://master-spare.videos.com/videos"
> user="solr"
> password="pass"
> batchSize="-1"
> responseBuffering="adaptive"/>
>
> My schema config file start by : <?xml version="1.0" encoding="UTF-8" ?>
>
> I've add in my server.xml : because my localhost point on 8180
> <Connector port="8180" maxHttpHeaderSize="8192"
> maxThreads="150" minSpareThreads="25" maxSpareThreads="75"
> enableLookups="false" redirectPort="8443" acceptCount="100"
> connectionTimeout="20000" disableUploadTimeout="true"
> URIEncoding="UTF-8" useBodyEncodingForURI="true" />
>
> What can I check?
> I'm using a linux server.
> If I do dpkg-reconfigure -plow locales
> Generating locales...
> fr_BE.UTF-8... up-to-date
> fr_CA.UTF-8... up-to-date
> fr_CH.UTF-8... up-to-date
> fr_FR.UTF-8... up-to-date
> fr_LU.UTF-8... up-to-date
> Generation complete.
>
> Would that be a problem, I would say no but maybe, do I miss a package???
>
>
>
> --
> View this message in context: http://www.nabble.com/tomcat55-solr1.3---Indexing-data%2C-doesnt-take-in-consideration-utf8%21-tp20086167p20086167.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>
>
--
Jerome Eteve.
Chat with me live at http://www.eteve.net
jerome@eteve.net