You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@tika.apache.org by "Jan Høydahl (JIRA)" <ji...@apache.org> on 2010/08/21 18:56:16 UTC

[jira] Created: (TIKA-492) Add language identification support for North Sami, Lule Sami and South Sami

Add language identification support for North Sami, Lule Sami and South Sami
----------------------------------------------------------------------------

                 Key: TIKA-492
                 URL: https://issues.apache.org/jira/browse/TIKA-492
             Project: Tika
          Issue Type: New Feature
          Components: languageidentifier
    Affects Versions: 0.7
            Reporter: Jan Høydahl


Currently there is one Norwegian language profile in Tika - "no". We need to distinguish between the two official Norwegian languages defined by ISO 639-1 codes "nb" and "nn". Those codes are recommended used instead of the common "no" tag.

Proposed solved by removing the current language profile no.ngp and replacing it with two new ones for nb and nn.

We must also add tests for Norwegian

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


Re: [jira] Commented: (TIKA-492) Add language identification support for North Sami, Lule Sami and South Sami

Posted by Oleg Tikhonov <ol...@gmail.com>.
Hi Ken,
I used Nutch's LanguageProfiler in order to produce language profile.
More about this issue you can find:
http://www.ibm.com/developerworks/opensource/tutorials/os-apache-tika/authors.html
(It's not self - promoting !)
Download the sources, using ant task you'll able to create lang profile.
If you need any help, please do not hesitate to ask.


BR,
Oleg.

2010/8/24 Jan Høydahl (JIRA) <ji...@apache.org>

>
>    [
> https://issues.apache.org/jira/browse/TIKA-492?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12901900#action_12901900]
>
> Jan Høydahl commented on TIKA-492:
> ----------------------------------
>
> I'm in the process of gathering enough text content for the profiles.
>
> I also posted a question to the user list to ask what tool/process you use
> to generate profiles but did not see an answer yet.
>
> > Add language identification support for North Sami, Lule Sami and South
> Sami
> >
> ----------------------------------------------------------------------------
> >
> >                 Key: TIKA-492
> >                 URL: https://issues.apache.org/jira/browse/TIKA-492
> >             Project: Tika
> >          Issue Type: New Feature
> >          Components: languageidentifier
> >    Affects Versions: 0.7
> >            Reporter: Jan Høydahl
> >            Assignee: Ken Krugler
> >            Priority: Minor
> >
> > We need added support for Sami languages.
> > According to document "Requirements for support for Sami languages in
> data processing" (http://www.samit.no/01-850-51.pdf) Tika will get "Basic
> Level" support by detecting North Sami, Lule Sami and South Sami.
>
> --
> This message is automatically generated by JIRA.
> -
> You can reply to this email to add a comment to the issue online.
>
>


-- 
Best regards, Oleg.

[jira] Assigned: (TIKA-492) Add language identification support for North Sami, Lule Sami and South Sami

Posted by "Ken Krugler (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/TIKA-492?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Ken Krugler reassigned TIKA-492:
--------------------------------

    Assignee: Ken Krugler

> Add language identification support for North Sami, Lule Sami and South Sami
> ----------------------------------------------------------------------------
>
>                 Key: TIKA-492
>                 URL: https://issues.apache.org/jira/browse/TIKA-492
>             Project: Tika
>          Issue Type: New Feature
>          Components: languageidentifier
>    Affects Versions: 0.7
>            Reporter: Jan Høydahl
>            Assignee: Ken Krugler
>
> We need added support for Sami languages.
> According to document "Requirements for support for Sami languages in data processing" (http://www.samit.no/01-850-51.pdf) Tika will get "Basic Level" support by detecting North Sami, Lule Sami and South Sami.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (TIKA-492) Add language identification support for North Sami, Lule Sami and South Sami

Posted by "Ken Krugler (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/TIKA-492?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Ken Krugler updated TIKA-492:
-----------------------------

    Priority: Minor  (was: Major)

> Add language identification support for North Sami, Lule Sami and South Sami
> ----------------------------------------------------------------------------
>
>                 Key: TIKA-492
>                 URL: https://issues.apache.org/jira/browse/TIKA-492
>             Project: Tika
>          Issue Type: New Feature
>          Components: languageidentifier
>    Affects Versions: 0.7
>            Reporter: Jan Høydahl
>            Assignee: Ken Krugler
>            Priority: Minor
>
> We need added support for Sami languages.
> According to document "Requirements for support for Sami languages in data processing" (http://www.samit.no/01-850-51.pdf) Tika will get "Basic Level" support by detecting North Sami, Lule Sami and South Sami.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (TIKA-492) Add language identification support for North Sami, Lule Sami and South Sami

Posted by "Jan Høydahl (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/TIKA-492?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Jan Høydahl updated TIKA-492:
-----------------------------

    Description: 
We need added support for Sami languages.

According to document "Requirements for support for Sami languages in data processing" (http://www.samit.no/01-850-51.pdf) Tika will get "Basic Level" support by detecting North Sami, Lule Sami and South Sami.

  was:
Currently there is one Norwegian language profile in Tika - "no". We need to distinguish between the two official Norwegian languages defined by ISO 639-1 codes "nb" and "nn". Those codes are recommended used instead of the common "no" tag.

Proposed solved by removing the current language profile no.ngp and replacing it with two new ones for nb and nn.

We must also add tests for Norwegian


> Add language identification support for North Sami, Lule Sami and South Sami
> ----------------------------------------------------------------------------
>
>                 Key: TIKA-492
>                 URL: https://issues.apache.org/jira/browse/TIKA-492
>             Project: Tika
>          Issue Type: New Feature
>          Components: languageidentifier
>    Affects Versions: 0.7
>            Reporter: Jan Høydahl
>
> We need added support for Sami languages.
> According to document "Requirements for support for Sami languages in data processing" (http://www.samit.no/01-850-51.pdf) Tika will get "Basic Level" support by detecting North Sami, Lule Sami and South Sami.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (TIKA-492) Add language identification support for North Sami, Lule Sami and South Sami

Posted by "Ken Krugler (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/TIKA-492?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12901471#action_12901471 ] 

Ken Krugler commented on TIKA-492:
----------------------------------

Hi Jan,

Do you have profile files for these languages?

Thanks,

-- Ken

> Add language identification support for North Sami, Lule Sami and South Sami
> ----------------------------------------------------------------------------
>
>                 Key: TIKA-492
>                 URL: https://issues.apache.org/jira/browse/TIKA-492
>             Project: Tika
>          Issue Type: New Feature
>          Components: languageidentifier
>    Affects Versions: 0.7
>            Reporter: Jan Høydahl
>            Assignee: Ken Krugler
>            Priority: Minor
>
> We need added support for Sami languages.
> According to document "Requirements for support for Sami languages in data processing" (http://www.samit.no/01-850-51.pdf) Tika will get "Basic Level" support by detecting North Sami, Lule Sami and South Sami.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (TIKA-492) Add language identification support for North Sami, Lule Sami and South Sami

Posted by "Jan Høydahl (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/TIKA-492?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12901900#action_12901900 ] 

Jan Høydahl commented on TIKA-492:
----------------------------------

I'm in the process of gathering enough text content for the profiles.

I also posted a question to the user list to ask what tool/process you use to generate profiles but did not see an answer yet.

> Add language identification support for North Sami, Lule Sami and South Sami
> ----------------------------------------------------------------------------
>
>                 Key: TIKA-492
>                 URL: https://issues.apache.org/jira/browse/TIKA-492
>             Project: Tika
>          Issue Type: New Feature
>          Components: languageidentifier
>    Affects Versions: 0.7
>            Reporter: Jan Høydahl
>            Assignee: Ken Krugler
>            Priority: Minor
>
> We need added support for Sami languages.
> According to document "Requirements for support for Sami languages in data processing" (http://www.samit.no/01-850-51.pdf) Tika will get "Basic Level" support by detecting North Sami, Lule Sami and South Sami.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (TIKA-492) Add language identification support for North Sami, Lule Sami and South Sami

Posted by "Ken Krugler (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/TIKA-492?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12901906#action_12901906 ] 

Ken Krugler commented on TIKA-492:
----------------------------------

Sorry, I must have missed that question. I think Jukka handled this previously, though Jerome and Chris did the original work in Nutch], then Jukka simplified things - see [TIKA-209]

I'd repost, and depending on the response I'd file an issue about documenting the creation of language profiles. 

> Add language identification support for North Sami, Lule Sami and South Sami
> ----------------------------------------------------------------------------
>
>                 Key: TIKA-492
>                 URL: https://issues.apache.org/jira/browse/TIKA-492
>             Project: Tika
>          Issue Type: New Feature
>          Components: languageidentifier
>    Affects Versions: 0.7
>            Reporter: Jan Høydahl
>            Assignee: Ken Krugler
>            Priority: Minor
>
> We need added support for Sami languages.
> According to document "Requirements for support for Sami languages in data processing" (http://www.samit.no/01-850-51.pdf) Tika will get "Basic Level" support by detecting North Sami, Lule Sami and South Sami.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.