You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@nutch.apache.org by "Fengtan (JIRA)" <ji...@apache.org> on 2016/06/11 02:57:20 UTC

[jira] [Updated] (NUTCH-2278) Handle alpha-2 language codes consistently

     [ https://issues.apache.org/jira/browse/NUTCH-2278?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Fengtan updated NUTCH-2278:
---------------------------
    Attachment: NUTCH-2278.patch

Suggested patch.

> Handle alpha-2 language codes consistently
> ------------------------------------------
>
>                 Key: NUTCH-2278
>                 URL: https://issues.apache.org/jira/browse/NUTCH-2278
>             Project: Nutch
>          Issue Type: Improvement
>          Components: plugin
>    Affects Versions: 1.12
>            Reporter: Fengtan
>            Priority: Minor
>         Attachments: NUTCH-2278.patch
>
>
> The language-identifier plugin provides two extraction policies: detect and identify.
> However the two policies handle [alpha-2|https://en.wikipedia.org/wiki/ISO_3166-1_alpha-2] codes differently:
> * 'identify' strips out the alpha-2 code e.g. if the identified language is 'en-US' then it will inject 'en' in the meta tags
> * 'detect' does not strip out the alpha-2 code e.g. if the detected language is 'en-US' then it will inject 'en-US' in the meta tags
> Any chance we can make this consistent and always strip out the alpha-2 code ?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)