You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by Andy <an...@yahoo.com> on 2010/11/13 21:55:27 UTC

DIH for multilingual index & multiValued field?

I have a MySQL table:

    CREATE TABLE documents (
        id INT NOT NULL AUTO_INCREMENT,
        language_code CHAR(2),
        tags CHAR(30),
        text TEXT,
        PRIMARY KEY (id)
    );

I have 2 questions about Solr DIH:

1) The "langauge_code" field indicates what language the "text" field is in. And depending on the language, I want to index "text" to different Solr fields.

    # pseudo code

    if langauge_code == "en":
        index "text" to Solr field "text_en"
    elif langauge_code == "fr":
        index "text" to Solr field "text_fr"
    elif langauge_code == "zh":
        index "text" to Solr field "text_zh"
    ...

Can DIH handle a usecase like this? How do I configure it to do so?

2) The "tags" field needs to be indexed into a Solr multiValued field. Multiple values are stored in a string, separated by a comma. For example, if `tags` contains the string "blue, green, yellow" then I want to index the 3 values "blue", "green", "yellow" into a Solr multiValued field.

How do I do that with DIH?

Thanks.


      

Re: DIH for multilingual index & multiValued field?

Posted by Imran <im...@gmail.com>.
I think a custom transformer would be of help in these scenarios
http://wiki.apache.org/solr/DIHCustomTransformer

<http://wiki.apache.org/solr/DIHCustomTransformer>Cheers
-- Imran

On Sat, Nov 13, 2010 at 8:55 PM, Andy <an...@yahoo.com> wrote:

> I have a MySQL table:
>
>    CREATE TABLE documents (
>        id INT NOT NULL AUTO_INCREMENT,
>        language_code CHAR(2),
>        tags CHAR(30),
>        text TEXT,
>        PRIMARY KEY (id)
>    );
>
> I have 2 questions about Solr DIH:
>
> 1) The "langauge_code" field indicates what language the "text" field is
> in. And depending on the language, I want to index "text" to different Solr
> fields.
>
>    # pseudo code
>
>    if langauge_code == "en":
>        index "text" to Solr field "text_en"
>    elif langauge_code == "fr":
>        index "text" to Solr field "text_fr"
>    elif langauge_code == "zh":
>        index "text" to Solr field "text_zh"
>    ...
>
> Can DIH handle a usecase like this? How do I configure it to do so?
>
> 2) The "tags" field needs to be indexed into a Solr multiValued field.
> Multiple values are stored in a string, separated by a comma. For example,
> if `tags` contains the string "blue, green, yellow" then I want to index the
> 3 values "blue", "green", "yellow" into a Solr multiValued field.
>
> How do I do that with DIH?
>
> Thanks.
>
>
>
>

Re: DIH for multilingual index & multiValued field?

Posted by Ken Stanley <do...@gmail.com>.
On Sat, Nov 13, 2010 at 5:59 PM, Ken Stanley <do...@gmail.com> wrote:
>   CREATE TABLE documents (
>       id INT NOT NULL AUTO_INCREMENT,
>       language_code CHAR(2),
>       tags CHAR(30),
>       text TEXT,
>       PRIMARY KEY (id)
>   );

I apologize, but I couldn't leave the typo in my last post without a
follow up; it might cause confusion. I copied the OP's original table
definition and forgot to remove the tags field. My purposed definition
for the documents table should be:

  CREATE TABLE documents (
      id INT NOT NULL AUTO_INCREMENT,
      language_code CHAR(2),
      text TEXT,
      PRIMARY KEY (id)
  );

- Ken

Re: DIH for multilingual index & multiValued field?

Posted by Ken Stanley <do...@gmail.com>.
On Sat, Nov 13, 2010 at 4:56 PM, Ahmet Arslan <io...@yahoo.com> wrote:
> For (1) you probably need to write a custom transformer. Something like:
> public Object transformRow(Map<String, Object> row)     {
> String language_code = row.get("language_code");
> String text = row.get("text");
> if("en".equals(language_code))
>       row.put("text_en", text);
> else if if("fr".equals(language_code))
>       row.put("text_fr", text);
>
> return row;
> }
>
>
> For (2), it doable with regex transformer.
>
> "<field column="mailId" splitBy="," sourceColName="emailids"/>
> The 'emailids' field in the table can be a comma separated value. So it ends up giving out one or more than one email ids and we expect the 'mailId' to be a multivalued field in Solr." [1]
>
> [1]http://wiki.apache.org/solr/DataImportHandler#RegexTransformer
>

In my opinion, I think that this is a bit of overkill. Since the DIH
supports multiple entities, with no real limit on the SQL queries, I
think that the easiest (and less involved) approach would be to create
three entities for the languages the OP wishes to index:

<entity name="english" query="SELECT * FROM documents WHERE
language_code='en'" transformer="RegexTransformer">
    <field column="text_en" column="text" />
    <field column="tags" column="tags" splitBy="," />
</entity>

<entity name="french" query="SELECT * FROM documents WHERE
language_code='fr'" transformer="RegexTransformer">
    <field column="text_fr" column="text" />
    <field column="tags" column="tags" splitBy="," />
</entity>

<entity name="chinese" query="SELECT * FROM documents WHERE
language_code='zh'" transformer="RegexTransformer">
    <field column="text_zh" column="text" />
    <field column="tags" column="tags" splitBy="," />
</entity>

But, I admit that depending on future growth of languages, as well as
other factors (i.e., needing more specific logic, etc), a programmatic
approach might be warranted.

I would recommend, however, that the database table be a little more
normalized. Your definition for tags is quite limiting, and could be
better served using a many-to-many relationship. Something like the
following might serve you well:

   CREATE TABLE documents (
       id INT NOT NULL AUTO_INCREMENT,
       language_code CHAR(2),
       tags CHAR(30),
       text TEXT,
       PRIMARY KEY (id)
   );

   CREATE TABLE document_tags (
       id INT NOT NULL AUTO_INCREMENT,
       tag CHAR(30),
       PRIMARY KEY (id)
   );

   CREATE TABLE document_tag_lookup (
       document_id INT NOT NULL,
       tag_id INT NOT NULL,
       PRIMARY KEY (document_id, tag_id)
   );

Then in the DIH, you simply nest a second entity to look up the zero
or more tags that might be associated with your documents; take the
"english" entity from above:

<entity name="english" query="SELECT * FROM documents WHERE
language_code='en'" transformer="RegexTransformer">
    <field name="text_en" column="text" />

    <entity name="english_tags" query="SELECT * FROM document_tags dt
INNER JOIN document_tag_lookup dtl ON (dtl.tag_id = dt.id AND
dtl.document_id='${english.id}')">
        <field name="tags" column="tag" />
    </entity>
</entity>

This would allow for growth, and is easy to maintain. Additionally, if
you wanted to implement a custom transformer of your own, you could.
As an aside, a sort of compromise, you could also use the
ScriptTransformer [1] to create a Javascript function that can do your
language logic and create the necessary fields, and not have to worry
about maintaining any custom Java code.

[1] http://wiki.apache.org/solr/DataImportHandler#ScriptTransformer

- Ken

Re: DIH for multilingual index & multiValued field?

Posted by Ahmet Arslan <io...@yahoo.com>.
For (1) you probably need to write a custom transformer. Something like:
public Object transformRow(Map<String, Object> row)     {
String language_code = row.get("language_code");
String text = row.get("text");
if("en".equals(language_code))
       row.put("text_en", text);
else if if("fr".equals(language_code))
       row.put("text_fr", text);

return row;
}


For (2), it doable with regex transformer. 

"<field column="mailId" splitBy="," sourceColName="emailids"/>
The 'emailids' field in the table can be a comma separated value. So it ends up giving out one or more than one email ids and we expect the 'mailId' to be a multivalued field in Solr." [1]

[1]http://wiki.apache.org/solr/DataImportHandler#RegexTransformer


--- On Sat, 11/13/10, Andy <an...@yahoo.com> wrote:

> From: Andy <an...@yahoo.com>
> Subject: DIH for multilingual index & multiValued field?
> To: solr-user@lucene.apache.org
> Date: Saturday, November 13, 2010, 10:55 PM
> I have a MySQL table:
> 
>     CREATE TABLE documents (
>         id INT NOT NULL
> AUTO_INCREMENT,
>         language_code CHAR(2),
>         tags CHAR(30),
>         text TEXT,
>         PRIMARY KEY (id)
>     );
> 
> I have 2 questions about Solr DIH:
> 
> 1) The "langauge_code" field indicates what language the
> "text" field is in. And depending on the language, I want to
> index "text" to different Solr fields.
> 
>     # pseudo code
> 
>     if langauge_code == "en":
>         index "text" to Solr field
> "text_en"
>     elif langauge_code == "fr":
>         index "text" to Solr field
> "text_fr"
>     elif langauge_code == "zh":
>         index "text" to Solr field
> "text_zh"
>     ...
> 
> Can DIH handle a usecase like this? How do I configure it
> to do so?
> 
> 2) The "tags" field needs to be indexed into a Solr
> multiValued field. Multiple values are stored in a string,
> separated by a comma. For example, if `tags` contains the
> string "blue, green, yellow" then I want to index the 3
> values "blue", "green", "yellow" into a Solr multiValued
> field.
> 
> How do I do that with DIH?
> 
> Thanks.
> 
> 
>       
>