You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by Joel Nylund <jn...@yahoo.com> on 2009/10/29 17:18:39 UTC

data import with transformer

Hi, I have been reading the solr book and wiki, but I cant find any  
similar examples to what Im looking for.

I have a database field called category, this field needs some text  
manipulation before it goes in the index

here is the java code for what im trying to do:

// categories look like this "prefix category suffix"
// I want to turn them into "category" remove prefix and suffix and  
spaces before and after
  public static String getPrettyCategoryName(String categoryName)
     {
         String result;

         if (categoryName == null || categoryName.equals(""))
         {
             // nothing to do; just return what was passed in.
             result = categoryName;
         }
         else
         {
             result = categoryName.toLowerCase();

             if (result.startsWith(startString))
             {
                 result = result.substring(startString.length());
             }

             if (result.endsWith(endString))
             {
                 result = result.substring(0, (result.length() -  
endString
                     .length()));
             }

             if (result.length() > 0)
             {
                 result = Character.toUpperCase(result.charAt(0))
                     + result.substring(1);
             }
         }

         return result;
     }


Can I have a transformer call a java method?

It seems like I can, but how do I transform must one column. If  
someone can point me to a complete example that transforms a column  
using java or javascript im sure I can figure this out


thanks
Joel


Re: character encoding issue

Posted by gwk <gi...@eyefi.nl>.
I had a similar problem when using the dataimport handler on my database 
a couple of months ago. This was an old mysql database which was storing 
utf-8 in a latin1 table. PHP handles this fine but 'proper' database 
connectors coerce the data to the column's/table's/database's character 
encoding and it will cause Solr to import the data incorrectly. I this 
is the cause you can fix it with a couple of alter table statements (see 
the alter table syntax page in the mysql documentation, specifically the 
'convert to' section), but you will have to test if your php application 
will still work correctly.

Regards,

gwk

Jérôme Etévé wrote:
> Hi,
>
>  How do you post your data to solr? If it's by posting XML, then it
> should be properly encoded in UTF-8 (which is the XML default).
> Regardless of what's in the DB (which can be a mystery with MySQL).
>
> At query time, if the XML writer is used, then it's encoded in UTF-8.
> If the json one is used, I think it's the same. Because json is
> unicode compliant by nature (javascript).
>
> According to what you say, I would bet for a PHP problem. It seems PHP
> takes the correct UTF8 octets from solr and displays them as latin1
> encoding (hence the strange characters). You need to
> - either output your pages in UTF-8
> - or decode the octets given by solr to a unicode string and let it be
> encoded as latin1 for output (with the risk of loosing non-latin1
> encodable characters).
>
> I hope it helps.
>
> J.
>
> 2009/11/4 Jonathan Hendler <jo...@gmail.com>:
>   
>> Hi Peter,
>>
>> I have the same set of issues and will look for a response here.
>>
>> Sometimes those other chars can be create at the time of input (like
>> extraction from a Microsoft Office doc from third part tool for example).
>> But MySQL looking OK in the browser might be because the encoding of MySQL
>> was not the same as the original text. Say for example that the collation of
>> MySQL is Latin, and the document was UTF-8. When a browser renders, it might
>> assume chars are UTF-8, but SOLR might be taking the table type literally in
>> the DIH (Latin1 Swedish for example). Could also be the way PHP doesn't
>> handle UTF-8 well and it depends on your client.
>>
>> Don't think it has anything to do with Jetty - I use Resin.
>>
>> Hope that helps,
>>
>> - Jonathan
>>
>>
>> On Nov 4, 2009, at 8:48 AM, Peter Hedlund wrote:
>>
>>     
>>> I'm having a problem with character encoding.  The data that I'm indexing
>>> with SOLR is being pulled from a MySQL database and then the index is being
>>> integrated into a PHP application.  When I display the text from the SOLR
>>> index it's full of strange characters (–, é, etc...).  However, when I
>>> bypass SOLR and access the data from the MySQL table directly and write to
>>> the browser I don't see any problems with em-dashes and accented characters.
>>>
>>> Is this a JETTY issue or a SOLR issue or something else?  (It's not simply
>>> an issue of including <meta http-equiv="Content-Type"
>>> content="text/html;charset=UTF-8"> either)
>>>
>>> Thanks for any help.
>>>
>>> Peter Hedlund
>>>
>>>
>>>       
>>     
>
>
>
>   


Re: character encoding issue

Posted by Jérôme Etévé <je...@gmail.com>.
Hi,

 How do you post your data to solr? If it's by posting XML, then it
should be properly encoded in UTF-8 (which is the XML default).
Regardless of what's in the DB (which can be a mystery with MySQL).

At query time, if the XML writer is used, then it's encoded in UTF-8.
If the json one is used, I think it's the same. Because json is
unicode compliant by nature (javascript).

According to what you say, I would bet for a PHP problem. It seems PHP
takes the correct UTF8 octets from solr and displays them as latin1
encoding (hence the strange characters). You need to
- either output your pages in UTF-8
- or decode the octets given by solr to a unicode string and let it be
encoded as latin1 for output (with the risk of loosing non-latin1
encodable characters).

I hope it helps.

J.

2009/11/4 Jonathan Hendler <jo...@gmail.com>:
> Hi Peter,
>
> I have the same set of issues and will look for a response here.
>
> Sometimes those other chars can be create at the time of input (like
> extraction from a Microsoft Office doc from third part tool for example).
> But MySQL looking OK in the browser might be because the encoding of MySQL
> was not the same as the original text. Say for example that the collation of
> MySQL is Latin, and the document was UTF-8. When a browser renders, it might
> assume chars are UTF-8, but SOLR might be taking the table type literally in
> the DIH (Latin1 Swedish for example). Could also be the way PHP doesn't
> handle UTF-8 well and it depends on your client.
>
> Don't think it has anything to do with Jetty - I use Resin.
>
> Hope that helps,
>
> - Jonathan
>
>
> On Nov 4, 2009, at 8:48 AM, Peter Hedlund wrote:
>
>> I'm having a problem with character encoding.  The data that I'm indexing
>> with SOLR is being pulled from a MySQL database and then the index is being
>> integrated into a PHP application.  When I display the text from the SOLR
>> index it's full of strange characters (–, é, etc...).  However, when I
>> bypass SOLR and access the data from the MySQL table directly and write to
>> the browser I don't see any problems with em-dashes and accented characters.
>>
>> Is this a JETTY issue or a SOLR issue or something else?  (It's not simply
>> an issue of including <meta http-equiv="Content-Type"
>> content="text/html;charset=UTF-8"> either)
>>
>> Thanks for any help.
>>
>> Peter Hedlund
>>
>>
>
>



-- 
Jerome Eteve.
http://www.eteve.net
jerome@eteve.net

Re: character encoding issue

Posted by Jonathan Hendler <jo...@gmail.com>.
Hi Peter,

I have the same set of issues and will look for a response here.

Sometimes those other chars can be create at the time of input (like  
extraction from a Microsoft Office doc from third part tool for  
example). But MySQL looking OK in the browser might be because the  
encoding of MySQL was not the same as the original text. Say for  
example that the collation of MySQL is Latin, and the document was  
UTF-8. When a browser renders, it might assume chars are UTF-8, but  
SOLR might be taking the table type literally in the DIH (Latin1  
Swedish for example). Could also be the way PHP doesn't handle UTF-8  
well and it depends on your client.

Don't think it has anything to do with Jetty - I use Resin.

Hope that helps,

- Jonathan


On Nov 4, 2009, at 8:48 AM, Peter Hedlund wrote:

> I'm having a problem with character encoding.  The data that I'm  
> indexing with SOLR is being pulled from a MySQL database and then  
> the index is being integrated into a PHP application.  When I  
> display the text from the SOLR index it's full of strange characters  
> (–, é, etc...).  However, when I bypass SOLR and access the data  
> from the MySQL table directly and write to the browser I don't see  
> any problems with em-dashes and accented characters.
>
> Is this a JETTY issue or a SOLR issue or something else?  (It's not  
> simply an issue of including <meta http-equiv="Content-Type"  
> content="text/html;charset=UTF-8"> either)
>
> Thanks for any help.
>
> Peter Hedlund
>
>


character encoding issue

Posted by Peter Hedlund <pm...@virginia.edu>.
I'm having a problem with character encoding.  The data that I'm indexing with SOLR is being pulled from a MySQL database and then the index is being integrated into a PHP application.  When I display the text from the SOLR index it's full of strange characters (–, é, etc...).  However, when I bypass SOLR and access the data from the MySQL table directly and write to the browser I don't see any problems with em-dashes and accented characters.

Is this a JETTY issue or a SOLR issue or something else?  (It's not simply an issue of including <meta http-equiv="Content-Type" content="text/html;charset=UTF-8"> either)

Thanks for any help.

Peter Hedlund



Re: data import with transformer

Posted by Chantal Ackermann <ch...@btelligent.de>.
Another option is the RegexTransformer in DIH: 
http://wiki.apache.org/solr/DataImportHandler?highlight=%28regex%29#RegexTransformer

Chantal

William Pierce schrieb:
> I'd recommend two ways:   The way I do it in my app is that I have written a
> MySql function to transform the column as part of the select statement.   In
> this approach, your select query would like so:
>    select  col1, col2, col3, spPrettyPrintCategory(category) as X, col4,
> col5, .... from table where ....
> 
>   <field column="X" name="category" />
> 
> The <field> element is used to map the column "X" into the solr field name
> which I am assuming is the same as your "category" name.
> 
> The second approach is to write the JavaScript transformer.  The relevant
> code is in the wiki:
> 
> <dataConfig>
>         <script><![CDATA[
>                 function PrettyCategory(row)    {
>                    //split on spaces
>                     var pieces = row.get('category').split(' ');
>                     // get the second element of this array...do a trim if
> needed...
>                     var catname = pieces[1];
>                     row.remove('category');
>                     row.put('category', catname);
>                     return row;
>                 }
>         ]]></script>
>         <document>
>                 <entity name="e"  transformer="script:PrettyCategory"
> query="select * from X">
>                 ....
>                 </entity>
>         </document>
> </dataConfig>
> 
> - Bill
> 
> --------------------------------------------------
> From: "Joel Nylund" <jn...@yahoo.com>
> Sent: Thursday, October 29, 2009 9:18 AM
> To: <so...@lucene.apache.org>
> Subject: data import with transformer
> 
>> Hi, I have been reading the solr book and wiki, but I cant find any
>> similar examples to what Im looking for.
>>
>> I have a database field called category, this field needs some text
>> manipulation before it goes in the index
>>
>> here is the java code for what im trying to do:
>>
>> // categories look like this "prefix category suffix"
>> // I want to turn them into "category" remove prefix and suffix and
>> spaces before and after
>>  public static String getPrettyCategoryName(String categoryName)
>>     {
>>         String result;
>>
>>         if (categoryName == null || categoryName.equals(""))
>>         {
>>             // nothing to do; just return what was passed in.
>>             result = categoryName;
>>         }
>>         else
>>         {
>>             result = categoryName.toLowerCase();
>>
>>             if (result.startsWith(startString))
>>             {
>>                 result = result.substring(startString.length());
>>             }
>>
>>             if (result.endsWith(endString))
>>             {
>>                 result = result.substring(0, (result.length() -
>> endString
>>                     .length()));
>>             }
>>
>>             if (result.length() > 0)
>>             {
>>                 result = Character.toUpperCase(result.charAt(0))
>>                     + result.substring(1);
>>             }
>>         }
>>
>>         return result;
>>     }
>>
>>
>> Can I have a transformer call a java method?
>>
>> It seems like I can, but how do I transform must one column. If
>> someone can point me to a complete example that transforms a column
>> using java or javascript im sure I can figure this out
>>
>>
>> thanks
>> Joel
>>
>>

Re: data import with transformer

Posted by William Pierce <ev...@hotmail.com>.
I'd recommend two ways:   The way I do it in my app is that I have written a 
MySql function to transform the column as part of the select statement.   In 
this approach, your select query would like so:
   select  col1, col2, col3, spPrettyPrintCategory(category) as X, col4, 
col5, .... from table where ....

  <field column="X" name="category" />

The <field> element is used to map the column "X" into the solr field name 
which I am assuming is the same as your "category" name.

The second approach is to write the JavaScript transformer.  The relevant 
code is in the wiki:

<dataConfig>
        <script><![CDATA[
                function PrettyCategory(row)    {
                   //split on spaces
                    var pieces = row.get('category').split(' ');
                    // get the second element of this array...do a trim if 
needed...
                    var catname = pieces[1];
                    row.remove('category');
                    row.put('category', catname);
                    return row;
                }
        ]]></script>
        <document>
                <entity name="e"  transformer="script:PrettyCategory" 
query="select * from X">
                ....
                </entity>
        </document>
</dataConfig>

- Bill

--------------------------------------------------
From: "Joel Nylund" <jn...@yahoo.com>
Sent: Thursday, October 29, 2009 9:18 AM
To: <so...@lucene.apache.org>
Subject: data import with transformer

> Hi, I have been reading the solr book and wiki, but I cant find any
> similar examples to what Im looking for.
>
> I have a database field called category, this field needs some text
> manipulation before it goes in the index
>
> here is the java code for what im trying to do:
>
> // categories look like this "prefix category suffix"
> // I want to turn them into "category" remove prefix and suffix and
> spaces before and after
>  public static String getPrettyCategoryName(String categoryName)
>     {
>         String result;
>
>         if (categoryName == null || categoryName.equals(""))
>         {
>             // nothing to do; just return what was passed in.
>             result = categoryName;
>         }
>         else
>         {
>             result = categoryName.toLowerCase();
>
>             if (result.startsWith(startString))
>             {
>                 result = result.substring(startString.length());
>             }
>
>             if (result.endsWith(endString))
>             {
>                 result = result.substring(0, (result.length() -
> endString
>                     .length()));
>             }
>
>             if (result.length() > 0)
>             {
>                 result = Character.toUpperCase(result.charAt(0))
>                     + result.substring(1);
>             }
>         }
>
>         return result;
>     }
>
>
> Can I have a transformer call a java method?
>
> It seems like I can, but how do I transform must one column. If
> someone can point me to a complete example that transforms a column
> using java or javascript im sure I can figure this out
>
>
> thanks
> Joel
>
> 

Re: data import with transformer

Posted by Shalin Shekhar Mangar <sh...@gmail.com>.
On Thu, Oct 29, 2009 at 9:48 PM, Joel Nylund <jn...@yahoo.com> wrote:

> Hi, I have been reading the solr book and wiki, but I cant find any similar
> examples to what Im looking for.
>
> I have a database field called category, this field needs some text
> manipulation before it goes in the index
>
> here is the java code for what im trying to do:
>
> // categories look like this "prefix category suffix"
> // I want to turn them into "category" remove prefix and suffix and spaces
> before and after
>  public static String getPrettyCategoryName(String categoryName)
>    {
>        String result;
>
>        if (categoryName == null || categoryName.equals(""))
>        {
>            // nothing to do; just return what was passed in.
>            result = categoryName;
>        }
>        else
>        {
>            result = categoryName.toLowerCase();
>
>            if (result.startsWith(startString))
>            {
>                result = result.substring(startString.length());
>            }
>
>            if (result.endsWith(endString))
>            {
>                result = result.substring(0, (result.length() - endString
>                    .length()));
>            }
>
>            if (result.length() > 0)
>            {
>                result = Character.toUpperCase(result.charAt(0))
>                    + result.substring(1);
>            }
>        }
>
>        return result;
>    }
>
>
> Can I have a transformer call a java method?
>
> It seems like I can, but how do I transform must one column. If someone can
> point me to a complete example that transforms a column using java or
> javascript im sure I can figure this out
>
>
Sure, why not. You can either copy this method to your transformer or put
the jar into solr_home/lib and call it from your transformer. The row to be
transformer is a Map<String, Object>. So just lookup the "category" in the
map, transform its value and put it back in the Map.

-- 
Regards,
Shalin Shekhar Mangar.