You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@nutch.apache.org by "Matthias Agethle (JIRA)" <ji...@apache.org> on 2010/10/20 18:04:27 UTC

[jira] Created: (NUTCH-923) Multilingual support for Solr-index-mapping

Multilingual support for Solr-index-mapping
-------------------------------------------

                 Key: NUTCH-923
                 URL: https://issues.apache.org/jira/browse/NUTCH-923
             Project: Nutch
          Issue Type: Improvement
          Components: indexer
    Affects Versions: 1.2
            Reporter: Matthias Agethle
            Priority: Minor


It would be useful to extend the mapping-possibilites when indexing to solr.
One useful feature would be to use the detected language of the html page (for example via the language-identifier plugin) and send the content to corresponding language-aware solr-fields.

The mapping file could be as follows:
<field dest="lang" source="lang"/>
<field dest="title_${lang}" source="title" />
so that the title-field gets mapped to title_en for English-pages and tilte_fr for French pages.

What do you think? Could this be useful also to others?
Or are there already other solutions out there?

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (NUTCH-923) Multilingual support for Solr-index-mapping

Posted by "Andrzej Bialecki (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/NUTCH-923?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12924154#action_12924154 ] 

Andrzej Bialecki  commented on NUTCH-923:
-----------------------------------------

This doesn't solve the problem of potentially unbounded number of fields. Compliance is one thing, and you can clean up field names from invalid characters, but sanity is another thing - if you have {{title_*}} in your Solr schema then theoretically you are allowed to create unlimited number of fields with this prefix - Solr won't complain.

> Multilingual support for Solr-index-mapping
> -------------------------------------------
>
>                 Key: NUTCH-923
>                 URL: https://issues.apache.org/jira/browse/NUTCH-923
>             Project: Nutch
>          Issue Type: Improvement
>          Components: indexer
>    Affects Versions: 1.2
>            Reporter: Matthias Agethle
>            Assignee: Markus Jelsma
>            Priority: Minor
>
> It would be useful to extend the mapping-possibilites when indexing to solr.
> One useful feature would be to use the detected language of the html page (for example via the language-identifier plugin) and send the content to corresponding language-aware solr-fields.
> The mapping file could be as follows:
> <field dest="lang" source="lang"/>
> <field dest="title_${lang}" source="title" />
> so that the title-field gets mapped to title_en for English-pages and tilte_fr for French pages.
> What do you think? Could this be useful also to others?
> Or are there already other solutions out there?

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (NUTCH-923) Multilingual support for Solr-index-mapping

Posted by "Markus Jelsma (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/NUTCH-923?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12923919#action_12923919 ] 

Markus Jelsma commented on NUTCH-923:
-------------------------------------

Andrzej is right. The LanguageIndexingFilter can return a value based on the value found in the HTTP header which can return garbage but shouldn't the filter itself make sure either `unknown` or a valid ISO-639-2 value is set?

This way client code can safely rely on the value of the lang field instead of sanitizing. What if more components come that do something with the lang field, must they also sanitize on their own?

> Multilingual support for Solr-index-mapping
> -------------------------------------------
>
>                 Key: NUTCH-923
>                 URL: https://issues.apache.org/jira/browse/NUTCH-923
>             Project: Nutch
>          Issue Type: Improvement
>          Components: indexer
>    Affects Versions: 1.2
>            Reporter: Matthias Agethle
>            Assignee: Markus Jelsma
>            Priority: Minor
>
> It would be useful to extend the mapping-possibilites when indexing to solr.
> One useful feature would be to use the detected language of the html page (for example via the language-identifier plugin) and send the content to corresponding language-aware solr-fields.
> The mapping file could be as follows:
> <field dest="lang" source="lang"/>
> <field dest="title_${lang}" source="title" />
> so that the title-field gets mapped to title_en for English-pages and tilte_fr for French pages.
> What do you think? Could this be useful also to others?
> Or are there already other solutions out there?

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Assigned: (NUTCH-923) Multilingual support for Solr-index-mapping

Posted by "Markus Jelsma (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/NUTCH-923?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Markus Jelsma reassigned NUTCH-923:
-----------------------------------

    Assignee: Markus Jelsma

> Multilingual support for Solr-index-mapping
> -------------------------------------------
>
>                 Key: NUTCH-923
>                 URL: https://issues.apache.org/jira/browse/NUTCH-923
>             Project: Nutch
>          Issue Type: Improvement
>          Components: indexer
>    Affects Versions: 1.2
>            Reporter: Matthias Agethle
>            Assignee: Markus Jelsma
>            Priority: Minor
>
> It would be useful to extend the mapping-possibilites when indexing to solr.
> One useful feature would be to use the detected language of the html page (for example via the language-identifier plugin) and send the content to corresponding language-aware solr-fields.
> The mapping file could be as follows:
> <field dest="lang" source="lang"/>
> <field dest="title_${lang}" source="title" />
> so that the title-field gets mapped to title_en for English-pages and tilte_fr for French pages.
> What do you think? Could this be useful also to others?
> Or are there already other solutions out there?

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (NUTCH-923) Multilingual support for Solr-index-mapping

Posted by "Matthias Agethle (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/NUTCH-923?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12924145#action_12924145 ] 

Matthias Agethle commented on NUTCH-923:
----------------------------------------

What about querying Solr for the configured fields (perhaps one can do this using LukeRequestHandler, I'm not sure)?
When sending data to Solr one could check if they exist in the Solr schema; if not don't add this field and give a warning.

The other thing that comes to my mind is: what are valid field-names in solr? Obviously letters, numbers and so on, but is there a validation in Solr?
One could use this to check if a dynamically generated field name is compliant with solr (and in this way excluding control characters in field-names as Andrzej mentioned it).

> Multilingual support for Solr-index-mapping
> -------------------------------------------
>
>                 Key: NUTCH-923
>                 URL: https://issues.apache.org/jira/browse/NUTCH-923
>             Project: Nutch
>          Issue Type: Improvement
>          Components: indexer
>    Affects Versions: 1.2
>            Reporter: Matthias Agethle
>            Assignee: Markus Jelsma
>            Priority: Minor
>
> It would be useful to extend the mapping-possibilites when indexing to solr.
> One useful feature would be to use the detected language of the html page (for example via the language-identifier plugin) and send the content to corresponding language-aware solr-fields.
> The mapping file could be as follows:
> <field dest="lang" source="lang"/>
> <field dest="title_${lang}" source="title" />
> so that the title-field gets mapped to title_en for English-pages and tilte_fr for French pages.
> What do you think? Could this be useful also to others?
> Or are there already other solutions out there?

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (NUTCH-923) Multilingual support for Solr-index-mapping

Posted by "Andrzej Bialecki (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/NUTCH-923?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12923896#action_12923896 ] 

Andrzej Bialecki  commented on NUTCH-923:
-----------------------------------------

This sounds useful, though the implementation needs to keep the following in mind:
* you _assume_ that the lang field will have a nice predictable value, but unless you sanitize the values you can't assume anything... example: one page I saw had a language metadata set to a random string 8kB long with various control chars and '\0'-s.

* again, if you don't sanitize and control the total number of unique values in the source field, you could end up with a number of fields approaching infinity, and Solr would melt down...

> Multilingual support for Solr-index-mapping
> -------------------------------------------
>
>                 Key: NUTCH-923
>                 URL: https://issues.apache.org/jira/browse/NUTCH-923
>             Project: Nutch
>          Issue Type: Improvement
>          Components: indexer
>    Affects Versions: 1.2
>            Reporter: Matthias Agethle
>            Assignee: Markus Jelsma
>            Priority: Minor
>
> It would be useful to extend the mapping-possibilites when indexing to solr.
> One useful feature would be to use the detected language of the html page (for example via the language-identifier plugin) and send the content to corresponding language-aware solr-fields.
> The mapping file could be as follows:
> <field dest="lang" source="lang"/>
> <field dest="title_${lang}" source="title" />
> so that the title-field gets mapped to title_en for English-pages and tilte_fr for French pages.
> What do you think? Could this be useful also to others?
> Or are there already other solutions out there?

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (NUTCH-923) Multilingual support for Solr-index-mapping

Posted by "Matthias Agethle (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/NUTCH-923?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12924491#action_12924491 ] 

Matthias Agethle commented on NUTCH-923:
----------------------------------------

Perhaps something like Solr DIH could be a solution. Adding scriptable transformers would allow to write custom logic and would be much more flexible. This way one could also add default field values if no value is provided etc. 
E.g.
{code:xml}
<script><![CDATA[
                function addLanguage(row)        {
                     //Implementation
                }
        ]]></script>
<fields transformer="script:addLanguage" >
    <field dest="lang" source="lang"/>
    <field dest="title" source="title"/>
</fields>
{code}

In the addLanguage script one could do all kind of validations to restrict explosion of field-names.

> Multilingual support for Solr-index-mapping
> -------------------------------------------
>
>                 Key: NUTCH-923
>                 URL: https://issues.apache.org/jira/browse/NUTCH-923
>             Project: Nutch
>          Issue Type: Improvement
>          Components: indexer
>    Affects Versions: 1.2
>            Reporter: Matthias Agethle
>            Assignee: Markus Jelsma
>            Priority: Minor
>
> It would be useful to extend the mapping-possibilites when indexing to solr.
> One useful feature would be to use the detected language of the html page (for example via the language-identifier plugin) and send the content to corresponding language-aware solr-fields.
> The mapping file could be as follows:
> <field dest="lang" source="lang"/>
> <field dest="title_${lang}" source="title" />
> so that the title-field gets mapped to title_en for English-pages and tilte_fr for French pages.
> What do you think? Could this be useful also to others?
> Or are there already other solutions out there?

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (NUTCH-923) Multilingual support for Solr-index-mapping

Posted by "Andrzej Bialecki (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/NUTCH-923?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12923947#action_12923947 ] 

Andrzej Bialecki  commented on NUTCH-923:
-----------------------------------------

My point was simply that if you want to build your data schema dynamically, based on the actual input data, then you need to be aware that this process is inherently risky - now we could perhaps deal with "lang" and LanguageIdentifier, but tomorrow we may be dealing with dc.author or cc.license or something else, and then we will face the same issue, ie. a potentially unlimited number of fields created based on data.

I don't have a good answer to this problem. On one hand this functionality is useful, on the other hand it's inherently risky in presence of less than ideal data, which is always a possibility... Perhaps introducing some sort of validation mechanism would make this safer to use.

> Multilingual support for Solr-index-mapping
> -------------------------------------------
>
>                 Key: NUTCH-923
>                 URL: https://issues.apache.org/jira/browse/NUTCH-923
>             Project: Nutch
>          Issue Type: Improvement
>          Components: indexer
>    Affects Versions: 1.2
>            Reporter: Matthias Agethle
>            Assignee: Markus Jelsma
>            Priority: Minor
>
> It would be useful to extend the mapping-possibilites when indexing to solr.
> One useful feature would be to use the detected language of the html page (for example via the language-identifier plugin) and send the content to corresponding language-aware solr-fields.
> The mapping file could be as follows:
> <field dest="lang" source="lang"/>
> <field dest="title_${lang}" source="title" />
> so that the title-field gets mapped to title_en for English-pages and tilte_fr for French pages.
> What do you think? Could this be useful also to others?
> Or are there already other solutions out there?

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (NUTCH-923) Multilingual support for Solr-index-mapping

Posted by "Markus Jelsma (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/NUTCH-923?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12923879#action_12923879 ] 

Markus Jelsma commented on NUTCH-923:
-------------------------------------

This is a very useful feature. +1

> Multilingual support for Solr-index-mapping
> -------------------------------------------
>
>                 Key: NUTCH-923
>                 URL: https://issues.apache.org/jira/browse/NUTCH-923
>             Project: Nutch
>          Issue Type: Improvement
>          Components: indexer
>    Affects Versions: 1.2
>            Reporter: Matthias Agethle
>            Assignee: Markus Jelsma
>            Priority: Minor
>
> It would be useful to extend the mapping-possibilites when indexing to solr.
> One useful feature would be to use the detected language of the html page (for example via the language-identifier plugin) and send the content to corresponding language-aware solr-fields.
> The mapping file could be as follows:
> <field dest="lang" source="lang"/>
> <field dest="title_${lang}" source="title" />
> so that the title-field gets mapped to title_en for English-pages and tilte_fr for French pages.
> What do you think? Could this be useful also to others?
> Or are there already other solutions out there?

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.