You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@nutch.apache.org by "Lewis John McGibbney (JIRA)" <ji...@apache.org> on 2012/06/15 15:59:42 UTC

[jira] [Created] (NUTCH-1397) language-identifier incorrectly handles double-barreled language properties

Lewis John McGibbney created NUTCH-1397:
-------------------------------------------

             Summary: language-identifier incorrectly handles double-barreled language properties
                 Key: NUTCH-1397
                 URL: https://issues.apache.org/jira/browse/NUTCH-1397
             Project: Nutch
          Issue Type: Improvement
          Components: parser
    Affects Versions: 1.5, nutchgora
            Reporter: Lewis John McGibbney
            Priority: Minor
             Fix For: 1.6, 2.1


Currently when language-identifier is activated is parses and identifies langauge-type=en, however does not identify en-GB or en-US. This issues should correct that. 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (NUTCH-1397) language-identifier incorrectly handles double-barreled language properties

Posted by "Ken Krugler (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/NUTCH-1397?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13295683#comment-13295683 ] 

Ken Krugler commented on NUTCH-1397:
------------------------------------

Should this issue be filed against Tika, versus Nutch? Or is this specific to language identification that's still part of Nutch? Sorry, but I haven't been keeping up with the state of migrating functionality to Tika.
                
> language-identifier incorrectly handles double-barreled language properties
> ---------------------------------------------------------------------------
>
>                 Key: NUTCH-1397
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1397
>             Project: Nutch
>          Issue Type: Improvement
>          Components: parser
>    Affects Versions: nutchgora, 1.5
>            Reporter: Lewis John McGibbney
>            Priority: Minor
>             Fix For: 1.6, 2.1
>
>
> Currently when language-identifier is activated is parses and identifies langauge-type=en, however does not identify en-GB or en-US. This issues should correct that. 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Updated] (NUTCH-1397) language-identifier incorrectly handles double-barreled language properties

Posted by "Lewis John McGibbney (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/NUTCH-1397?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Lewis John McGibbney updated NUTCH-1397:
----------------------------------------

    Fix Version/s:     (was: 2.1)
                   2.2
    
> language-identifier incorrectly handles double-barreled language properties
> ---------------------------------------------------------------------------
>
>                 Key: NUTCH-1397
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1397
>             Project: Nutch
>          Issue Type: Improvement
>          Components: parser
>    Affects Versions: nutchgora, 1.5
>            Reporter: Lewis John McGibbney
>            Priority: Minor
>             Fix For: 1.6, 2.2
>
>
> Currently when language-identifier is activated is parses and identifies langauge-type=en, however does not identify en-GB or en-US. This issues should correct that. 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (NUTCH-1397) language-identifier incorrectly handles double-barreled language properties

Posted by "Lewis John McGibbney (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/NUTCH-1397?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13295732#comment-13295732 ] 

Lewis John McGibbney commented on NUTCH-1397:
---------------------------------------------

Hi KEN, this is exactly what flashed through my mind when i opened the ticket. I hoped that one of you Tika guys would chime in and provide some commentary. Right enough language-identification IS delegated to Tika since NUTCH-1075 so yes I think your right. I'll check on the Tika Jira, if one doesn't exist then I will open accordingly. Thanks.
                
> language-identifier incorrectly handles double-barreled language properties
> ---------------------------------------------------------------------------
>
>                 Key: NUTCH-1397
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1397
>             Project: Nutch
>          Issue Type: Improvement
>          Components: parser
>    Affects Versions: nutchgora, 1.5
>            Reporter: Lewis John McGibbney
>            Priority: Minor
>             Fix For: 1.6, 2.1
>
>
> Currently when language-identifier is activated is parses and identifies langauge-type=en, however does not identify en-GB or en-US. This issues should correct that. 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (NUTCH-1397) language-identifier incorrectly handles double-barreled language properties

Posted by "Lewis John McGibbney (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/NUTCH-1397?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13295740#comment-13295740 ] 

Lewis John McGibbney commented on NUTCH-1397:
---------------------------------------------

Aye Julien. I was just this minute looking through the open Tika issues surrounding language-detection and yeah I understand exactly where you're coming from. In all honesty this was not a biggie for me, I was just aware of the fact that something wasn't right w.r.t the double-barreled scenario. I suppose if you were faceting by specific language e.g. en-GB en-US then it would be handy but for the run of the mill facet search it is required on a minimal basis (I assume). I suppose we can keep it open here in Nutch and as you mention if someone wishes (or I get round to it) then the patch would fit into the parsing code on this side.
                
> language-identifier incorrectly handles double-barreled language properties
> ---------------------------------------------------------------------------
>
>                 Key: NUTCH-1397
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1397
>             Project: Nutch
>          Issue Type: Improvement
>          Components: parser
>    Affects Versions: nutchgora, 1.5
>            Reporter: Lewis John McGibbney
>            Priority: Minor
>             Fix For: 1.6, 2.1
>
>
> Currently when language-identifier is activated is parses and identifies langauge-type=en, however does not identify en-GB or en-US. This issues should correct that. 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Comment Edited] (NUTCH-1397) language-identifier incorrectly handles double-barreled language properties

Posted by "Lewis John McGibbney (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/NUTCH-1397?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13295732#comment-13295732 ] 

Lewis John McGibbney edited comment on NUTCH-1397 at 6/15/12 3:52 PM:
----------------------------------------------------------------------

Hi Ken, this is exactly what flashed through my mind when i opened the ticket. I hoped that one of you Tika guys would chime in and provide some commentary. Right enough language-identification IS delegated to Tika since NUTCH-1075 so yes I think your right. I'll check on the Tika Jira, if one doesn't exist then I will open accordingly. Thanks.
                
      was (Author: lewismc):
    Hi KEN, this is exactly what flashed through my mind when i opened the ticket. I hoped that one of you Tika guys would chime in and provide some commentary. Right enough language-identification IS delegated to Tika since NUTCH-1075 so yes I think your right. I'll check on the Tika Jira, if one doesn't exist then I will open accordingly. Thanks.
                  
> language-identifier incorrectly handles double-barreled language properties
> ---------------------------------------------------------------------------
>
>                 Key: NUTCH-1397
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1397
>             Project: Nutch
>          Issue Type: Improvement
>          Components: parser
>    Affects Versions: nutchgora, 1.5
>            Reporter: Lewis John McGibbney
>            Priority: Minor
>             Fix For: 1.6, 2.1
>
>
> Currently when language-identifier is activated is parses and identifies langauge-type=en, however does not identify en-GB or en-US. This issues should correct that. 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (NUTCH-1397) language-identifier incorrectly handles double-barreled language properties

Posted by "Julien Nioche (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/NUTCH-1397?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13295738#comment-13295738 ] 

Julien Nioche commented on NUTCH-1397:
--------------------------------------

Lewis, the language identification is a combination of parsing of the html (done in Nutch) with statistical guessing (from Tika). The parser component ignores compound values and returns only the main language code, as for the statistical component is returns only the 2 letter code (and given how bad it is at it, I don't think it would be wise to try and get it to be more specific). In a nutshell these compound language codes are not supported in Nutch. We could possible store a separate value with the secondary code when available from the parsing but not the identifier.
Makes sense?
                
> language-identifier incorrectly handles double-barreled language properties
> ---------------------------------------------------------------------------
>
>                 Key: NUTCH-1397
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1397
>             Project: Nutch
>          Issue Type: Improvement
>          Components: parser
>    Affects Versions: nutchgora, 1.5
>            Reporter: Lewis John McGibbney
>            Priority: Minor
>             Fix For: 1.6, 2.1
>
>
> Currently when language-identifier is activated is parses and identifies langauge-type=en, however does not identify en-GB or en-US. This issues should correct that. 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira