You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by Andy <an...@yahoo.com> on 2010/11/13 21:55:27 UTC
DIH for multilingual index & multiValued field?
I have a MySQL table:
CREATE TABLE documents (
id INT NOT NULL AUTO_INCREMENT,
language_code CHAR(2),
tags CHAR(30),
text TEXT,
PRIMARY KEY (id)
);
I have 2 questions about Solr DIH:
1) The "langauge_code" field indicates what language the "text" field is in. And depending on the language, I want to index "text" to different Solr fields.
# pseudo code
if langauge_code == "en":
index "text" to Solr field "text_en"
elif langauge_code == "fr":
index "text" to Solr field "text_fr"
elif langauge_code == "zh":
index "text" to Solr field "text_zh"
...
Can DIH handle a usecase like this? How do I configure it to do so?
2) The "tags" field needs to be indexed into a Solr multiValued field. Multiple values are stored in a string, separated by a comma. For example, if `tags` contains the string "blue, green, yellow" then I want to index the 3 values "blue", "green", "yellow" into a Solr multiValued field.
How do I do that with DIH?
Thanks.
Re: DIH for multilingual index & multiValued field?
Posted by Imran <im...@gmail.com>.
I think a custom transformer would be of help in these scenarios
http://wiki.apache.org/solr/DIHCustomTransformer
<http://wiki.apache.org/solr/DIHCustomTransformer>Cheers
-- Imran
On Sat, Nov 13, 2010 at 8:55 PM, Andy <an...@yahoo.com> wrote:
> I have a MySQL table:
>
> CREATE TABLE documents (
> id INT NOT NULL AUTO_INCREMENT,
> language_code CHAR(2),
> tags CHAR(30),
> text TEXT,
> PRIMARY KEY (id)
> );
>
> I have 2 questions about Solr DIH:
>
> 1) The "langauge_code" field indicates what language the "text" field is
> in. And depending on the language, I want to index "text" to different Solr
> fields.
>
> # pseudo code
>
> if langauge_code == "en":
> index "text" to Solr field "text_en"
> elif langauge_code == "fr":
> index "text" to Solr field "text_fr"
> elif langauge_code == "zh":
> index "text" to Solr field "text_zh"
> ...
>
> Can DIH handle a usecase like this? How do I configure it to do so?
>
> 2) The "tags" field needs to be indexed into a Solr multiValued field.
> Multiple values are stored in a string, separated by a comma. For example,
> if `tags` contains the string "blue, green, yellow" then I want to index the
> 3 values "blue", "green", "yellow" into a Solr multiValued field.
>
> How do I do that with DIH?
>
> Thanks.
>
>
>
>
Re: DIH for multilingual index & multiValued field?
Posted by Ken Stanley <do...@gmail.com>.
On Sat, Nov 13, 2010 at 5:59 PM, Ken Stanley <do...@gmail.com> wrote:
> CREATE TABLE documents (
> id INT NOT NULL AUTO_INCREMENT,
> language_code CHAR(2),
> tags CHAR(30),
> text TEXT,
> PRIMARY KEY (id)
> );
I apologize, but I couldn't leave the typo in my last post without a
follow up; it might cause confusion. I copied the OP's original table
definition and forgot to remove the tags field. My purposed definition
for the documents table should be:
CREATE TABLE documents (
id INT NOT NULL AUTO_INCREMENT,
language_code CHAR(2),
text TEXT,
PRIMARY KEY (id)
);
- Ken
Re: DIH for multilingual index & multiValued field?
Posted by Ken Stanley <do...@gmail.com>.
On Sat, Nov 13, 2010 at 4:56 PM, Ahmet Arslan <io...@yahoo.com> wrote:
> For (1) you probably need to write a custom transformer. Something like:
> public Object transformRow(Map<String, Object> row) {
> String language_code = row.get("language_code");
> String text = row.get("text");
> if("en".equals(language_code))
> row.put("text_en", text);
> else if if("fr".equals(language_code))
> row.put("text_fr", text);
>
> return row;
> }
>
>
> For (2), it doable with regex transformer.
>
> "<field column="mailId" splitBy="," sourceColName="emailids"/>
> The 'emailids' field in the table can be a comma separated value. So it ends up giving out one or more than one email ids and we expect the 'mailId' to be a multivalued field in Solr." [1]
>
> [1]http://wiki.apache.org/solr/DataImportHandler#RegexTransformer
>
In my opinion, I think that this is a bit of overkill. Since the DIH
supports multiple entities, with no real limit on the SQL queries, I
think that the easiest (and less involved) approach would be to create
three entities for the languages the OP wishes to index:
<entity name="english" query="SELECT * FROM documents WHERE
language_code='en'" transformer="RegexTransformer">
<field column="text_en" column="text" />
<field column="tags" column="tags" splitBy="," />
</entity>
<entity name="french" query="SELECT * FROM documents WHERE
language_code='fr'" transformer="RegexTransformer">
<field column="text_fr" column="text" />
<field column="tags" column="tags" splitBy="," />
</entity>
<entity name="chinese" query="SELECT * FROM documents WHERE
language_code='zh'" transformer="RegexTransformer">
<field column="text_zh" column="text" />
<field column="tags" column="tags" splitBy="," />
</entity>
But, I admit that depending on future growth of languages, as well as
other factors (i.e., needing more specific logic, etc), a programmatic
approach might be warranted.
I would recommend, however, that the database table be a little more
normalized. Your definition for tags is quite limiting, and could be
better served using a many-to-many relationship. Something like the
following might serve you well:
CREATE TABLE documents (
id INT NOT NULL AUTO_INCREMENT,
language_code CHAR(2),
tags CHAR(30),
text TEXT,
PRIMARY KEY (id)
);
CREATE TABLE document_tags (
id INT NOT NULL AUTO_INCREMENT,
tag CHAR(30),
PRIMARY KEY (id)
);
CREATE TABLE document_tag_lookup (
document_id INT NOT NULL,
tag_id INT NOT NULL,
PRIMARY KEY (document_id, tag_id)
);
Then in the DIH, you simply nest a second entity to look up the zero
or more tags that might be associated with your documents; take the
"english" entity from above:
<entity name="english" query="SELECT * FROM documents WHERE
language_code='en'" transformer="RegexTransformer">
<field name="text_en" column="text" />
<entity name="english_tags" query="SELECT * FROM document_tags dt
INNER JOIN document_tag_lookup dtl ON (dtl.tag_id = dt.id AND
dtl.document_id='${english.id}')">
<field name="tags" column="tag" />
</entity>
</entity>
This would allow for growth, and is easy to maintain. Additionally, if
you wanted to implement a custom transformer of your own, you could.
As an aside, a sort of compromise, you could also use the
ScriptTransformer [1] to create a Javascript function that can do your
language logic and create the necessary fields, and not have to worry
about maintaining any custom Java code.
[1] http://wiki.apache.org/solr/DataImportHandler#ScriptTransformer
- Ken
Re: DIH for multilingual index & multiValued field?
Posted by Ahmet Arslan <io...@yahoo.com>.
For (1) you probably need to write a custom transformer. Something like:
public Object transformRow(Map<String, Object> row) {
String language_code = row.get("language_code");
String text = row.get("text");
if("en".equals(language_code))
row.put("text_en", text);
else if if("fr".equals(language_code))
row.put("text_fr", text);
return row;
}
For (2), it doable with regex transformer.
"<field column="mailId" splitBy="," sourceColName="emailids"/>
The 'emailids' field in the table can be a comma separated value. So it ends up giving out one or more than one email ids and we expect the 'mailId' to be a multivalued field in Solr." [1]
[1]http://wiki.apache.org/solr/DataImportHandler#RegexTransformer
--- On Sat, 11/13/10, Andy <an...@yahoo.com> wrote:
> From: Andy <an...@yahoo.com>
> Subject: DIH for multilingual index & multiValued field?
> To: solr-user@lucene.apache.org
> Date: Saturday, November 13, 2010, 10:55 PM
> I have a MySQL table:
>
> CREATE TABLE documents (
> id INT NOT NULL
> AUTO_INCREMENT,
> language_code CHAR(2),
> tags CHAR(30),
> text TEXT,
> PRIMARY KEY (id)
> );
>
> I have 2 questions about Solr DIH:
>
> 1) The "langauge_code" field indicates what language the
> "text" field is in. And depending on the language, I want to
> index "text" to different Solr fields.
>
> # pseudo code
>
> if langauge_code == "en":
> index "text" to Solr field
> "text_en"
> elif langauge_code == "fr":
> index "text" to Solr field
> "text_fr"
> elif langauge_code == "zh":
> index "text" to Solr field
> "text_zh"
> ...
>
> Can DIH handle a usecase like this? How do I configure it
> to do so?
>
> 2) The "tags" field needs to be indexed into a Solr
> multiValued field. Multiple values are stored in a string,
> separated by a comma. For example, if `tags` contains the
> string "blue, green, yellow" then I want to index the 3
> values "blue", "green", "yellow" into a Solr multiValued
> field.
>
> How do I do that with DIH?
>
> Thanks.
>
>
>
>