You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by Shalin Shekhar Mangar <sh...@gmail.com> on 2009/01/13 14:44:43 UTC

Re: DataImportHandler: UTF-8 and Mysql

On Mon, Jan 12, 2009 at 3:48 PM, gwk <gi...@eyefi.nl> wrote:

> 1. Posting UTF-8 data through the example post-script works and I get
>     the proper results back when I query using the admin page.
>     However, data imported through the DataImportHandler from a MySQL
>     database (the database contains correct data, it's a copy of a
>     production db and selecting through the client gives the correct
>     characters) I get "ó" instead of "ó". I've tried several
>     combinations of arguments to my datasource url
>     (useUnicode=true&characterEncoding=UTF-8) but it does not seem to
>     help. How do I get this to work correctly?


DataImportHandler does not change any encoding. It receives a Java string
object from the driver and adds it to Solr. So I'm guessing the problem is
in the database or in the driver. Did you create the tables with UTF-8
encoding? Try looking in the MySql driver configuration parameters to force
UTF-8. Sorry, I can't be of much help here.


> 2. On the wikipage for DataImportHandler, the deletedPkQuery has no
>     real description, am I correct in assuming it should contain a
>     query which returns the ids of items which should be removed from
>     the index?


Yes you are right. It should return the primary keys of the rows to be
deleted.


>
>  3. Another question concerning the DataImportHandler wikipage, I'm
>     not sure about the exact way the field-tag works. From the first
>     data-config.xml example for the full-import I can infer that the
>     "column"-attribute represents the column from the sql-query and
>     the "name"-attribute represents the name of the field in the
>     schema the column should map to. However further on in the
>     RegexTransformer section there are column-attributes which do not
>     correspond to the sql-query result set and its the "sourceColName"
>     attribute which acually represents that data, which comes from the
>     RegexTransformer I understand but why then is the "column"
>     attribute used instead of the "name"-attribute. This has confused
>     me somewhat, any clarification would be greatly appreciated.
>

DataImportHandler reads by "column" from the resultset and writes by "name"
to Solr (or if name is unspecified, by "column"). So column is compulsory
but "name" is optional.

The typical use-case for a RegexTransformer is when you want to read a field
(say "a"), process it (save it as "b") and then add it to Solr (by name
"c").

So you read by "sourceColName", process and save it as "column" and write to
Solr as "name". So if "name" is unspecified, it will be written to Solr as
"column". The reason we use column and not name is because the user may want
to do something more with it, for example use that field in a template and
save that template to Solr. I know it is a bit confusing but it helps us to
keep DIH general enough.

Hope that helps.

-- 
Regards,
Shalin Shekhar Mangar.

Re: DataImportHandler: UTF-8 and Mysql

Posted by gwk <gi...@eyefi.nl>.
Shalin Shekhar Mangar wrote:
> On Mon, Jan 12, 2009 at 3:48 PM, gwk <gi...@eyefi.nl> wrote:
>
>   
>> 1. Posting UTF-8 data through the example post-script works and I get
>>     the proper results back when I query using the admin page.
>>     However, data imported through the DataImportHandler from a MySQL
>>     database (the database contains correct data, it's a copy of a
>>     production db and selecting through the client gives the correct
>>     characters) I get "ó" instead of "ó". I've tried several
>>     combinations of arguments to my datasource url
>>     (useUnicode=true&characterEncoding=UTF-8) but it does not seem to
>>     help. How do I get this to work correctly?
>>     
>
>
> DataImportHandler does not change any encoding. It receives a Java string
> object from the driver and adds it to Solr. So I'm guessing the problem is
> in the database or in the driver. Did you create the tables with UTF-8
> encoding? Try looking in the MySql driver configuration parameters to force
> UTF-8. Sorry, I can't be of much help here.
>
>
>   
I checked again and you were right, while the columns contained 
utf8-encoded strings, the actual encoding of the columns was set to 
latin1, I've fixed the database and now it's working correctly.
>> 2. On the wikipage for DataImportHandler, the deletedPkQuery has no
>>     real description, am I correct in assuming it should contain a
>>     query which returns the ids of items which should be removed from
>>     the index?
>>     
>
>
> Yes you are right. It should return the primary keys of the rows to be
> deleted.
>
>
>   
>>  3. Another question concerning the DataImportHandler wikipage, I'm
>>     not sure about the exact way the field-tag works. From the first
>>     data-config.xml example for the full-import I can infer that the
>>     "column"-attribute represents the column from the sql-query and
>>     the "name"-attribute represents the name of the field in the
>>     schema the column should map to. However further on in the
>>     RegexTransformer section there are column-attributes which do not
>>     correspond to the sql-query result set and its the "sourceColName"
>>     attribute which acually represents that data, which comes from the
>>     RegexTransformer I understand but why then is the "column"
>>     attribute used instead of the "name"-attribute. This has confused
>>     me somewhat, any clarification would be greatly appreciated.
>>
>>     
>
> DataImportHandler reads by "column" from the resultset and writes by "name"
> to Solr (or if name is unspecified, by "column"). So column is compulsory
> but "name" is optional.
>
> The typical use-case for a RegexTransformer is when you want to read a field
> (say "a"), process it (save it as "b") and then add it to Solr (by name
> "c").
>
> So you read by "sourceColName", process and save it as "column" and write to
> Solr as "name". So if "name" is unspecified, it will be written to Solr as
> "column". The reason we use column and not name is because the user may want
> to do something more with it, for example use that field in a template and
> save that template to Solr. I know it is a bit confusing but it helps us to
> keep DIH general enough.
>
> Hope that helps.
>
>   

Ok, that explains it for me, thanks for the clarification.