You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@nutch.apache.org by "Julien Nioche (JIRA)" <ji...@apache.org> on 2010/08/27 10:47:53 UTC

[jira] Created: (NUTCH-894) Move statistical language identification from indexing to parsing step

Move statistical language identification from indexing to parsing step
----------------------------------------------------------------------

                 Key: NUTCH-894
                 URL: https://issues.apache.org/jira/browse/NUTCH-894
             Project: Nutch
          Issue Type: Improvement
          Components: parser
    Affects Versions: 2.0
            Reporter: Julien Nioche
            Assignee: Julien Nioche
             Fix For: 2.0


The statistical identification of language is currently done part in the indexing step, whereas the detection based on HTTP header and HTML code is done during the parsing.
We could keep the same logic i.e. do the statistical detection only if nothing has been found with the previous methods but as part of the parsing. This would be useful for ParseFilters which need the language information or to use with ScoringFilters e.g. to focus the crawl on a set of languages.

Since the statistical models have been ported to Tika we should probably rely on them instead of maintaining our own.

Any thoughts on this?

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (NUTCH-894) Move statistical language identification from indexing to parsing step

Posted by "Julien Nioche (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/NUTCH-894?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12916915#action_12916915 ] 

Julien Nioche commented on NUTCH-894:
-------------------------------------

Nice one, that's exactly what I had in mind.
+1 for commiting

> Move statistical language identification from indexing to parsing step
> ----------------------------------------------------------------------
>
>                 Key: NUTCH-894
>                 URL: https://issues.apache.org/jira/browse/NUTCH-894
>             Project: Nutch
>          Issue Type: Improvement
>          Components: parser
>    Affects Versions: 2.0
>            Reporter: Julien Nioche
>            Assignee: Julien Nioche
>             Fix For: 2.0
>
>         Attachments: NUTCH-894.patch
>
>
> The statistical identification of language is currently done part in the indexing step, whereas the detection based on HTTP header and HTML code is done during the parsing.
> We could keep the same logic i.e. do the statistical detection only if nothing has been found with the previous methods but as part of the parsing. This would be useful for ParseFilters which need the language information or to use with ScoringFilters e.g. to focus the crawl on a set of languages.
> Since the statistical models have been ported to Tika we should probably rely on them instead of maintaining our own.
> Any thoughts on this?

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (NUTCH-894) Move statistical language identification from indexing to parsing step

Posted by "Doğacan Güney (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/NUTCH-894?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12916899#action_12916899 ] 

Doğacan Güney commented on NUTCH-894:
-------------------------------------

+1 from me. 

If there are no objections for the next couple days or so, I would like to commit this patch.

> Move statistical language identification from indexing to parsing step
> ----------------------------------------------------------------------
>
>                 Key: NUTCH-894
>                 URL: https://issues.apache.org/jira/browse/NUTCH-894
>             Project: Nutch
>          Issue Type: Improvement
>          Components: parser
>    Affects Versions: 2.0
>            Reporter: Julien Nioche
>            Assignee: Julien Nioche
>             Fix For: 2.0
>
>         Attachments: NUTCH-894.patch
>
>
> The statistical identification of language is currently done part in the indexing step, whereas the detection based on HTTP header and HTML code is done during the parsing.
> We could keep the same logic i.e. do the statistical detection only if nothing has been found with the previous methods but as part of the parsing. This would be useful for ParseFilters which need the language information or to use with ScoringFilters e.g. to focus the crawl on a set of languages.
> Since the statistical models have been ported to Tika we should probably rely on them instead of maintaining our own.
> Any thoughts on this?

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (NUTCH-894) Move statistical language identification from indexing to parsing step

Posted by "Sertan Alkan (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/NUTCH-894?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Sertan Alkan updated NUTCH-894:
-------------------------------

    Attachment: NUTCH-894.patch

I agree to merging language extraction into one plugin and delegating this work to tika where possible, I am putting together a patch to just do this. This is mainly a housekeeping patch where it merges the two models in the parsing step and modifies the unit tests. Since we now rely on tika for language identification, patch removes any identification code and its test cases along with the resources, so beware, that it looks like rather a big diff.

Patch also introduces a new configuration option, lang.extraction.policy, to present users with an option to control the language extraction. So, the default action will stay the same, configured in the nutch-default.xml, the plugin will try to detect the language from headers and metadata, if this fails it will move on to use statistical identification. But, this way, users might be able to prefer one over another (only identification for instance).

Any thoughts on the approach?

> Move statistical language identification from indexing to parsing step
> ----------------------------------------------------------------------
>
>                 Key: NUTCH-894
>                 URL: https://issues.apache.org/jira/browse/NUTCH-894
>             Project: Nutch
>          Issue Type: Improvement
>          Components: parser
>    Affects Versions: 2.0
>            Reporter: Julien Nioche
>            Assignee: Julien Nioche
>             Fix For: 2.0
>
>         Attachments: NUTCH-894.patch
>
>
> The statistical identification of language is currently done part in the indexing step, whereas the detection based on HTTP header and HTML code is done during the parsing.
> We could keep the same logic i.e. do the statistical detection only if nothing has been found with the previous methods but as part of the parsing. This would be useful for ParseFilters which need the language information or to use with ScoringFilters e.g. to focus the crawl on a set of languages.
> Since the statistical models have been ported to Tika we should probably rely on them instead of maintaining our own.
> Any thoughts on this?

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (NUTCH-894) Move statistical language identification from indexing to parsing step

Posted by "Andrzej Bialecki (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/NUTCH-894?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12916907#action_12916907 ] 

Andrzej Bialecki  commented on NUTCH-894:
-----------------------------------------

+1, a nice clean up of our code base :)

> Move statistical language identification from indexing to parsing step
> ----------------------------------------------------------------------
>
>                 Key: NUTCH-894
>                 URL: https://issues.apache.org/jira/browse/NUTCH-894
>             Project: Nutch
>          Issue Type: Improvement
>          Components: parser
>    Affects Versions: 2.0
>            Reporter: Julien Nioche
>            Assignee: Julien Nioche
>             Fix For: 2.0
>
>         Attachments: NUTCH-894.patch
>
>
> The statistical identification of language is currently done part in the indexing step, whereas the detection based on HTTP header and HTML code is done during the parsing.
> We could keep the same logic i.e. do the statistical detection only if nothing has been found with the previous methods but as part of the parsing. This would be useful for ParseFilters which need the language information or to use with ScoringFilters e.g. to focus the crawl on a set of languages.
> Since the statistical models have been ported to Tika we should probably rely on them instead of maintaining our own.
> Any thoughts on this?

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Closed: (NUTCH-894) Move statistical language identification from indexing to parsing step

Posted by "Doğacan Güney (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/NUTCH-894?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Doğacan Güney closed NUTCH-894.
-------------------------------

      Assignee: Doğacan Güney  (was: Julien Nioche)
    Resolution: Fixed

Committed as of rev. 1003608.

> Move statistical language identification from indexing to parsing step
> ----------------------------------------------------------------------
>
>                 Key: NUTCH-894
>                 URL: https://issues.apache.org/jira/browse/NUTCH-894
>             Project: Nutch
>          Issue Type: Improvement
>          Components: parser
>    Affects Versions: 2.0
>            Reporter: Julien Nioche
>            Assignee: Doğacan Güney
>             Fix For: 2.0
>
>         Attachments: NUTCH-894.patch
>
>
> The statistical identification of language is currently done part in the indexing step, whereas the detection based on HTTP header and HTML code is done during the parsing.
> We could keep the same logic i.e. do the statistical detection only if nothing has been found with the previous methods but as part of the parsing. This would be useful for ParseFilters which need the language information or to use with ScoringFilters e.g. to focus the crawl on a set of languages.
> Since the statistical models have been ported to Tika we should probably rely on them instead of maintaining our own.
> Any thoughts on this?

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.