You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by Joel Nylund <jn...@yahoo.com> on 2009/10/29 17:18:39 UTC
data import with transformer
Hi, I have been reading the solr book and wiki, but I cant find any
similar examples to what Im looking for.
I have a database field called category, this field needs some text
manipulation before it goes in the index
here is the java code for what im trying to do:
// categories look like this "prefix category suffix"
// I want to turn them into "category" remove prefix and suffix and
spaces before and after
public static String getPrettyCategoryName(String categoryName)
{
String result;
if (categoryName == null || categoryName.equals(""))
{
// nothing to do; just return what was passed in.
result = categoryName;
}
else
{
result = categoryName.toLowerCase();
if (result.startsWith(startString))
{
result = result.substring(startString.length());
}
if (result.endsWith(endString))
{
result = result.substring(0, (result.length() -
endString
.length()));
}
if (result.length() > 0)
{
result = Character.toUpperCase(result.charAt(0))
+ result.substring(1);
}
}
return result;
}
Can I have a transformer call a java method?
It seems like I can, but how do I transform must one column. If
someone can point me to a complete example that transforms a column
using java or javascript im sure I can figure this out
thanks
Joel
Re: character encoding issue
Posted by gwk <gi...@eyefi.nl>.
I had a similar problem when using the dataimport handler on my database
a couple of months ago. This was an old mysql database which was storing
utf-8 in a latin1 table. PHP handles this fine but 'proper' database
connectors coerce the data to the column's/table's/database's character
encoding and it will cause Solr to import the data incorrectly. I this
is the cause you can fix it with a couple of alter table statements (see
the alter table syntax page in the mysql documentation, specifically the
'convert to' section), but you will have to test if your php application
will still work correctly.
Regards,
gwk
Jérôme Etévé wrote:
> Hi,
>
> How do you post your data to solr? If it's by posting XML, then it
> should be properly encoded in UTF-8 (which is the XML default).
> Regardless of what's in the DB (which can be a mystery with MySQL).
>
> At query time, if the XML writer is used, then it's encoded in UTF-8.
> If the json one is used, I think it's the same. Because json is
> unicode compliant by nature (javascript).
>
> According to what you say, I would bet for a PHP problem. It seems PHP
> takes the correct UTF8 octets from solr and displays them as latin1
> encoding (hence the strange characters). You need to
> - either output your pages in UTF-8
> - or decode the octets given by solr to a unicode string and let it be
> encoded as latin1 for output (with the risk of loosing non-latin1
> encodable characters).
>
> I hope it helps.
>
> J.
>
> 2009/11/4 Jonathan Hendler <jo...@gmail.com>:
>
>> Hi Peter,
>>
>> I have the same set of issues and will look for a response here.
>>
>> Sometimes those other chars can be create at the time of input (like
>> extraction from a Microsoft Office doc from third part tool for example).
>> But MySQL looking OK in the browser might be because the encoding of MySQL
>> was not the same as the original text. Say for example that the collation of
>> MySQL is Latin, and the document was UTF-8. When a browser renders, it might
>> assume chars are UTF-8, but SOLR might be taking the table type literally in
>> the DIH (Latin1 Swedish for example). Could also be the way PHP doesn't
>> handle UTF-8 well and it depends on your client.
>>
>> Don't think it has anything to do with Jetty - I use Resin.
>>
>> Hope that helps,
>>
>> - Jonathan
>>
>>
>> On Nov 4, 2009, at 8:48 AM, Peter Hedlund wrote:
>>
>>
>>> I'm having a problem with character encoding. The data that I'm indexing
>>> with SOLR is being pulled from a MySQL database and then the index is being
>>> integrated into a PHP application. When I display the text from the SOLR
>>> index it's full of strange characters (–, é, etc...). However, when I
>>> bypass SOLR and access the data from the MySQL table directly and write to
>>> the browser I don't see any problems with em-dashes and accented characters.
>>>
>>> Is this a JETTY issue or a SOLR issue or something else? (It's not simply
>>> an issue of including <meta http-equiv="Content-Type"
>>> content="text/html;charset=UTF-8"> either)
>>>
>>> Thanks for any help.
>>>
>>> Peter Hedlund
>>>
>>>
>>>
>>
>
>
>
>
Re: character encoding issue
Posted by Jérôme Etévé <je...@gmail.com>.
Hi,
How do you post your data to solr? If it's by posting XML, then it
should be properly encoded in UTF-8 (which is the XML default).
Regardless of what's in the DB (which can be a mystery with MySQL).
At query time, if the XML writer is used, then it's encoded in UTF-8.
If the json one is used, I think it's the same. Because json is
unicode compliant by nature (javascript).
According to what you say, I would bet for a PHP problem. It seems PHP
takes the correct UTF8 octets from solr and displays them as latin1
encoding (hence the strange characters). You need to
- either output your pages in UTF-8
- or decode the octets given by solr to a unicode string and let it be
encoded as latin1 for output (with the risk of loosing non-latin1
encodable characters).
I hope it helps.
J.
2009/11/4 Jonathan Hendler <jo...@gmail.com>:
> Hi Peter,
>
> I have the same set of issues and will look for a response here.
>
> Sometimes those other chars can be create at the time of input (like
> extraction from a Microsoft Office doc from third part tool for example).
> But MySQL looking OK in the browser might be because the encoding of MySQL
> was not the same as the original text. Say for example that the collation of
> MySQL is Latin, and the document was UTF-8. When a browser renders, it might
> assume chars are UTF-8, but SOLR might be taking the table type literally in
> the DIH (Latin1 Swedish for example). Could also be the way PHP doesn't
> handle UTF-8 well and it depends on your client.
>
> Don't think it has anything to do with Jetty - I use Resin.
>
> Hope that helps,
>
> - Jonathan
>
>
> On Nov 4, 2009, at 8:48 AM, Peter Hedlund wrote:
>
>> I'm having a problem with character encoding. The data that I'm indexing
>> with SOLR is being pulled from a MySQL database and then the index is being
>> integrated into a PHP application. When I display the text from the SOLR
>> index it's full of strange characters (–, é, etc...). However, when I
>> bypass SOLR and access the data from the MySQL table directly and write to
>> the browser I don't see any problems with em-dashes and accented characters.
>>
>> Is this a JETTY issue or a SOLR issue or something else? (It's not simply
>> an issue of including <meta http-equiv="Content-Type"
>> content="text/html;charset=UTF-8"> either)
>>
>> Thanks for any help.
>>
>> Peter Hedlund
>>
>>
>
>
--
Jerome Eteve.
http://www.eteve.net
jerome@eteve.net
Re: character encoding issue
Posted by Jonathan Hendler <jo...@gmail.com>.
Hi Peter,
I have the same set of issues and will look for a response here.
Sometimes those other chars can be create at the time of input (like
extraction from a Microsoft Office doc from third part tool for
example). But MySQL looking OK in the browser might be because the
encoding of MySQL was not the same as the original text. Say for
example that the collation of MySQL is Latin, and the document was
UTF-8. When a browser renders, it might assume chars are UTF-8, but
SOLR might be taking the table type literally in the DIH (Latin1
Swedish for example). Could also be the way PHP doesn't handle UTF-8
well and it depends on your client.
Don't think it has anything to do with Jetty - I use Resin.
Hope that helps,
- Jonathan
On Nov 4, 2009, at 8:48 AM, Peter Hedlund wrote:
> I'm having a problem with character encoding. The data that I'm
> indexing with SOLR is being pulled from a MySQL database and then
> the index is being integrated into a PHP application. When I
> display the text from the SOLR index it's full of strange characters
> (–, é, etc...). However, when I bypass SOLR and access the data
> from the MySQL table directly and write to the browser I don't see
> any problems with em-dashes and accented characters.
>
> Is this a JETTY issue or a SOLR issue or something else? (It's not
> simply an issue of including <meta http-equiv="Content-Type"
> content="text/html;charset=UTF-8"> either)
>
> Thanks for any help.
>
> Peter Hedlund
>
>
character encoding issue
Posted by Peter Hedlund <pm...@virginia.edu>.
I'm having a problem with character encoding. The data that I'm indexing with SOLR is being pulled from a MySQL database and then the index is being integrated into a PHP application. When I display the text from the SOLR index it's full of strange characters (–, é, etc...). However, when I bypass SOLR and access the data from the MySQL table directly and write to the browser I don't see any problems with em-dashes and accented characters.
Is this a JETTY issue or a SOLR issue or something else? (It's not simply an issue of including <meta http-equiv="Content-Type" content="text/html;charset=UTF-8"> either)
Thanks for any help.
Peter Hedlund
Re: data import with transformer
Posted by Chantal Ackermann <ch...@btelligent.de>.
Another option is the RegexTransformer in DIH:
http://wiki.apache.org/solr/DataImportHandler?highlight=%28regex%29#RegexTransformer
Chantal
William Pierce schrieb:
> I'd recommend two ways: The way I do it in my app is that I have written a
> MySql function to transform the column as part of the select statement. In
> this approach, your select query would like so:
> select col1, col2, col3, spPrettyPrintCategory(category) as X, col4,
> col5, .... from table where ....
>
> <field column="X" name="category" />
>
> The <field> element is used to map the column "X" into the solr field name
> which I am assuming is the same as your "category" name.
>
> The second approach is to write the JavaScript transformer. The relevant
> code is in the wiki:
>
> <dataConfig>
> <script><![CDATA[
> function PrettyCategory(row) {
> //split on spaces
> var pieces = row.get('category').split(' ');
> // get the second element of this array...do a trim if
> needed...
> var catname = pieces[1];
> row.remove('category');
> row.put('category', catname);
> return row;
> }
> ]]></script>
> <document>
> <entity name="e" transformer="script:PrettyCategory"
> query="select * from X">
> ....
> </entity>
> </document>
> </dataConfig>
>
> - Bill
>
> --------------------------------------------------
> From: "Joel Nylund" <jn...@yahoo.com>
> Sent: Thursday, October 29, 2009 9:18 AM
> To: <so...@lucene.apache.org>
> Subject: data import with transformer
>
>> Hi, I have been reading the solr book and wiki, but I cant find any
>> similar examples to what Im looking for.
>>
>> I have a database field called category, this field needs some text
>> manipulation before it goes in the index
>>
>> here is the java code for what im trying to do:
>>
>> // categories look like this "prefix category suffix"
>> // I want to turn them into "category" remove prefix and suffix and
>> spaces before and after
>> public static String getPrettyCategoryName(String categoryName)
>> {
>> String result;
>>
>> if (categoryName == null || categoryName.equals(""))
>> {
>> // nothing to do; just return what was passed in.
>> result = categoryName;
>> }
>> else
>> {
>> result = categoryName.toLowerCase();
>>
>> if (result.startsWith(startString))
>> {
>> result = result.substring(startString.length());
>> }
>>
>> if (result.endsWith(endString))
>> {
>> result = result.substring(0, (result.length() -
>> endString
>> .length()));
>> }
>>
>> if (result.length() > 0)
>> {
>> result = Character.toUpperCase(result.charAt(0))
>> + result.substring(1);
>> }
>> }
>>
>> return result;
>> }
>>
>>
>> Can I have a transformer call a java method?
>>
>> It seems like I can, but how do I transform must one column. If
>> someone can point me to a complete example that transforms a column
>> using java or javascript im sure I can figure this out
>>
>>
>> thanks
>> Joel
>>
>>
Re: data import with transformer
Posted by William Pierce <ev...@hotmail.com>.
I'd recommend two ways: The way I do it in my app is that I have written a
MySql function to transform the column as part of the select statement. In
this approach, your select query would like so:
select col1, col2, col3, spPrettyPrintCategory(category) as X, col4,
col5, .... from table where ....
<field column="X" name="category" />
The <field> element is used to map the column "X" into the solr field name
which I am assuming is the same as your "category" name.
The second approach is to write the JavaScript transformer. The relevant
code is in the wiki:
<dataConfig>
<script><![CDATA[
function PrettyCategory(row) {
//split on spaces
var pieces = row.get('category').split(' ');
// get the second element of this array...do a trim if
needed...
var catname = pieces[1];
row.remove('category');
row.put('category', catname);
return row;
}
]]></script>
<document>
<entity name="e" transformer="script:PrettyCategory"
query="select * from X">
....
</entity>
</document>
</dataConfig>
- Bill
--------------------------------------------------
From: "Joel Nylund" <jn...@yahoo.com>
Sent: Thursday, October 29, 2009 9:18 AM
To: <so...@lucene.apache.org>
Subject: data import with transformer
> Hi, I have been reading the solr book and wiki, but I cant find any
> similar examples to what Im looking for.
>
> I have a database field called category, this field needs some text
> manipulation before it goes in the index
>
> here is the java code for what im trying to do:
>
> // categories look like this "prefix category suffix"
> // I want to turn them into "category" remove prefix and suffix and
> spaces before and after
> public static String getPrettyCategoryName(String categoryName)
> {
> String result;
>
> if (categoryName == null || categoryName.equals(""))
> {
> // nothing to do; just return what was passed in.
> result = categoryName;
> }
> else
> {
> result = categoryName.toLowerCase();
>
> if (result.startsWith(startString))
> {
> result = result.substring(startString.length());
> }
>
> if (result.endsWith(endString))
> {
> result = result.substring(0, (result.length() -
> endString
> .length()));
> }
>
> if (result.length() > 0)
> {
> result = Character.toUpperCase(result.charAt(0))
> + result.substring(1);
> }
> }
>
> return result;
> }
>
>
> Can I have a transformer call a java method?
>
> It seems like I can, but how do I transform must one column. If
> someone can point me to a complete example that transforms a column
> using java or javascript im sure I can figure this out
>
>
> thanks
> Joel
>
>
Re: data import with transformer
Posted by Shalin Shekhar Mangar <sh...@gmail.com>.
On Thu, Oct 29, 2009 at 9:48 PM, Joel Nylund <jn...@yahoo.com> wrote:
> Hi, I have been reading the solr book and wiki, but I cant find any similar
> examples to what Im looking for.
>
> I have a database field called category, this field needs some text
> manipulation before it goes in the index
>
> here is the java code for what im trying to do:
>
> // categories look like this "prefix category suffix"
> // I want to turn them into "category" remove prefix and suffix and spaces
> before and after
> public static String getPrettyCategoryName(String categoryName)
> {
> String result;
>
> if (categoryName == null || categoryName.equals(""))
> {
> // nothing to do; just return what was passed in.
> result = categoryName;
> }
> else
> {
> result = categoryName.toLowerCase();
>
> if (result.startsWith(startString))
> {
> result = result.substring(startString.length());
> }
>
> if (result.endsWith(endString))
> {
> result = result.substring(0, (result.length() - endString
> .length()));
> }
>
> if (result.length() > 0)
> {
> result = Character.toUpperCase(result.charAt(0))
> + result.substring(1);
> }
> }
>
> return result;
> }
>
>
> Can I have a transformer call a java method?
>
> It seems like I can, but how do I transform must one column. If someone can
> point me to a complete example that transforms a column using java or
> javascript im sure I can figure this out
>
>
Sure, why not. You can either copy this method to your transformer or put
the jar into solr_home/lib and call it from your transformer. The row to be
transformer is a Map<String, Object>. So just lookup the "category" in the
map, transform its value and put it back in the Map.
--
Regards,
Shalin Shekhar Mangar.