You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@tika.apache.org by "Jan Høydahl (JIRA)" <ji...@apache.org> on 2010/08/21 20:12:17 UTC

[jira] Created: (TIKA-493) Support for macro languages

Support for macro languages
---------------------------

                 Key: TIKA-493
                 URL: https://issues.apache.org/jira/browse/TIKA-493
             Project: Tika
          Issue Type: New Feature
          Components: languageidentifier
    Affects Versions: 0.7
            Reporter: Jan Høydahl


Some languages have variants, and there are ISO codes to identify both the variants as well as a code to identify the macro-language. There should be a way to tell whether the identified language is part of a "macro language" and to return the macro language. This is because different applications require different codes. E.g. for search it makes sense to tag the document with both the unique code and the macro code.

Example:
Norwegian: no
Norwegian bokmål: nb
Norwegian nynorsk: nn

The getLanguage() call should continue to return the most correct and specific ISO code (according to which language profile matched).

In addition, it should be possible to get the macro language.

Proposed implementation:
Add some new methods:

public boolean hasMacroLanguage()    // true | false
public String getMacroLanguage()         // In case of "nn" or "nb", result would be "no"

The definition of macro languages can be added in the property file introduced in TIKA-490.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (TIKA-493) Support for macro languages

Posted by "Ken Krugler (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/TIKA-493?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Ken Krugler updated TIKA-493:
-----------------------------

    Priority: Minor  (was: Major)

> Support for macro languages
> ---------------------------
>
>                 Key: TIKA-493
>                 URL: https://issues.apache.org/jira/browse/TIKA-493
>             Project: Tika
>          Issue Type: New Feature
>          Components: languageidentifier
>    Affects Versions: 0.7
>            Reporter: Jan Høydahl
>            Assignee: Ken Krugler
>            Priority: Minor
>
> Some languages have variants, and there are ISO codes to identify both the variants as well as a code to identify the macro-language. There should be a way to tell whether the identified language is part of a "macro language" and to return the macro language. This is because different applications require different codes. E.g. for search it makes sense to tag the document with both the unique code and the macro code.
> Example:
> Norwegian: no
> Norwegian bokmål: nb
> Norwegian nynorsk: nn
> The getLanguage() call should continue to return the most correct and specific ISO code (according to which language profile matched).
> In addition, it should be possible to get the macro language.
> Proposed implementation:
> Add some new methods:
> public boolean hasMacroLanguage()    // true | false
> public String getMacroLanguage()         // In case of "nn" or "nb", result would be "no"
> The definition of macro languages can be added in the property file introduced in TIKA-490.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Assigned: (TIKA-493) Support for macro languages

Posted by "Ken Krugler (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/TIKA-493?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Ken Krugler reassigned TIKA-493:
--------------------------------

    Assignee: Ken Krugler

> Support for macro languages
> ---------------------------
>
>                 Key: TIKA-493
>                 URL: https://issues.apache.org/jira/browse/TIKA-493
>             Project: Tika
>          Issue Type: New Feature
>          Components: languageidentifier
>    Affects Versions: 0.7
>            Reporter: Jan Høydahl
>            Assignee: Ken Krugler
>
> Some languages have variants, and there are ISO codes to identify both the variants as well as a code to identify the macro-language. There should be a way to tell whether the identified language is part of a "macro language" and to return the macro language. This is because different applications require different codes. E.g. for search it makes sense to tag the document with both the unique code and the macro code.
> Example:
> Norwegian: no
> Norwegian bokmål: nb
> Norwegian nynorsk: nn
> The getLanguage() call should continue to return the most correct and specific ISO code (according to which language profile matched).
> In addition, it should be possible to get the macro language.
> Proposed implementation:
> Add some new methods:
> public boolean hasMacroLanguage()    // true | false
> public String getMacroLanguage()         // In case of "nn" or "nb", result would be "no"
> The definition of macro languages can be added in the property file introduced in TIKA-490.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.