You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@nutch.apache.org by "Julien Nioche (JIRA)" <ji...@apache.org> on 2011/08/02 08:20:27 UTC

[jira] [Created] (NUTCH-1075) Delegate language identification to Tika

Delegate language identification to Tika
----------------------------------------

                 Key: NUTCH-1075
                 URL: https://issues.apache.org/jira/browse/NUTCH-1075
             Project: Nutch
          Issue Type: Improvement
          Components: parser
    Affects Versions: 1.4
            Reporter: Julien Nioche
            Assignee: Julien Nioche
             Fix For: 1.4


In 2.0 the language identification is delegated to Tika and is done as part of the parsing step (and not during the indexing as done currently).
The patch attached is a backport from trunk which implements this and adds a new parameter to determine the strategy to use

{code:xml} 
<property>
  <name>lang.extraction.policy</name>
  <value>detect,identify</value>
  <description>This determines when the plugin uses detection and
  statistical identification mechanisms. The order in which the
  detect and identify are written will determine the extraction
  policy. Default case (detect,identify)  means the plugin will
  first try to extract language info from page headers and metadata,
  if this is not successful it will try using tika language
  identification. Possible values are:
    detect
    identify
    detect,identify
    identify,detect
  </description>
</property>
{code} 

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (NUTCH-1075) Delegate language identification to Tika

Posted by "Julien Nioche (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/NUTCH-1075?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13087655#comment-13087655 ] 

Julien Nioche commented on NUTCH-1075:
--------------------------------------

Have been able to reproduce the issue. The difference between parse-html and parse-tika is that the latter filters out any attributes found on the 'html' elements e.g. 

{code}
<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="nl" lang="nl">
{code}

the language should however be found when specified with meta elements etc...

Improving parse-tika should be done in a separate issue.



> Delegate language identification to Tika
> ----------------------------------------
>
>                 Key: NUTCH-1075
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1075
>             Project: Nutch
>          Issue Type: Improvement
>          Components: parser
>    Affects Versions: 1.4
>            Reporter: Julien Nioche
>            Assignee: Julien Nioche
>             Fix For: 1.4
>
>         Attachments: NUTCH-1075-v2.patch, NUTCH-1075.patch
>
>
> In 2.0 the language identification is delegated to Tika and is done as part of the parsing step (and not during the indexing as done currently).
> The patch attached is a backport from trunk which implements this and adds a new parameter to determine the strategy to use
> {code:xml} 
> <property>
>   <name>lang.extraction.policy</name>
>   <value>detect,identify</value>
>   <description>This determines when the plugin uses detection and
>   statistical identification mechanisms. The order in which the
>   detect and identify are written will determine the extraction
>   policy. Default case (detect,identify)  means the plugin will
>   first try to extract language info from page headers and metadata,
>   if this is not successful it will try using tika language
>   identification. Possible values are:
>     detect
>     identify
>     detect,identify
>     identify,detect
>   </description>
> </property>
> {code} 

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Closed] (NUTCH-1075) Delegate language identification to Tika

Posted by "Markus Jelsma (Closed) (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/NUTCH-1075?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Markus Jelsma closed NUTCH-1075.
--------------------------------


Bulk close of resolved issues of 1.4. bulkclose-1.4-20111220
                
> Delegate language identification to Tika
> ----------------------------------------
>
>                 Key: NUTCH-1075
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1075
>             Project: Nutch
>          Issue Type: Improvement
>          Components: parser
>    Affects Versions: 1.4
>            Reporter: Julien Nioche
>            Assignee: Julien Nioche
>             Fix For: 1.4
>
>         Attachments: NUTCH-1075-v2.patch, NUTCH-1075-v3.patch, NUTCH-1075.patch
>
>
> In 2.0 the language identification is delegated to Tika and is done as part of the parsing step (and not during the indexing as done currently).
> The patch attached is a backport from trunk which implements this and adds a new parameter to determine the strategy to use
> {code:xml} 
> <property>
>   <name>lang.extraction.policy</name>
>   <value>detect,identify</value>
>   <description>This determines when the plugin uses detection and
>   statistical identification mechanisms. The order in which the
>   detect and identify are written will determine the extraction
>   policy. Default case (detect,identify)  means the plugin will
>   first try to extract language info from page headers and metadata,
>   if this is not successful it will try using tika language
>   identification. Possible values are:
>     detect
>     identify
>     detect,identify
>     identify,detect
>   </description>
> </property>
> {code} 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (NUTCH-1075) Delegate language identification to Tika

Posted by "Markus Jelsma (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/NUTCH-1075?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13087680#comment-13087680 ] 

Markus Jelsma commented on NUTCH-1075:
--------------------------------------

Yes, i can verify that!

> Delegate language identification to Tika
> ----------------------------------------
>
>                 Key: NUTCH-1075
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1075
>             Project: Nutch
>          Issue Type: Improvement
>          Components: parser
>    Affects Versions: 1.4
>            Reporter: Julien Nioche
>            Assignee: Julien Nioche
>             Fix For: 1.4
>
>         Attachments: NUTCH-1075-v2.patch, NUTCH-1075-v3.patch, NUTCH-1075.patch
>
>
> In 2.0 the language identification is delegated to Tika and is done as part of the parsing step (and not during the indexing as done currently).
> The patch attached is a backport from trunk which implements this and adds a new parameter to determine the strategy to use
> {code:xml} 
> <property>
>   <name>lang.extraction.policy</name>
>   <value>detect,identify</value>
>   <description>This determines when the plugin uses detection and
>   statistical identification mechanisms. The order in which the
>   detect and identify are written will determine the extraction
>   policy. Default case (detect,identify)  means the plugin will
>   first try to extract language info from page headers and metadata,
>   if this is not successful it will try using tika language
>   identification. Possible values are:
>     detect
>     identify
>     detect,identify
>     identify,detect
>   </description>
> </property>
> {code} 

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (NUTCH-1075) Delegate language identification to Tika

Posted by "Markus Jelsma (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/NUTCH-1075?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13085691#comment-13085691 ] 

Markus Jelsma commented on NUTCH-1075:
--------------------------------------

Indeed, it does. However, i seem to be unable to retrieve the language anymore. All output is without the lang field. The plugin is enabled as language-identifier and compiled:

{code}
bin/nutch parsechecker -D lang.extraction.policy=identify http://www.groene.nl
{code}

{code}
Content Metadata: X-Runtime=46 ETag="0305dcdfd8a9f88f049403af666c8f29" Content-Length=7692 Set-Cookie=_groene.nl_session=BAh7CDoQX2NzcmZfdG9rZW4iMTREczFGMnZaV3g2ejBJNGI3QllNZDFlRWFBTTZlbEw0VGFXS2VEMGdNS289Og9zZXNzaW9uX2lkIiU1YTZjYzNhNjQwNDY0ZjE0NGIzMjA4OTgyNGUwYTFlMDoPaG9tZV92aWV3c2kA--d4aa63ef6783ae64e1ba09c92ba0c95297adf803; path=/; HttpOnly Connection=close Server=Apache X-Powered-By=Phusion Passenger (mod_rails/mod_rack) 2.2.14 Cache-Control=private, max-age=0, must-revalidate Status=200 Date=Tue, 16 Aug 2011 13:26:13 GMT Vary=Accept-Encoding Content-Encoding=gzip h1=null Content-Type=text/html; charset=utf-8 h2=klaver 
Parse Metadata: robots=noarchive caching.forbidden=content Content-Encoding=UTF-8 Content-Type=text/html; charset=utf-8 
{code}

Haven't got a clue yet. I also doesn't matter if i disable the boilerplate patches.

> Delegate language identification to Tika
> ----------------------------------------
>
>                 Key: NUTCH-1075
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1075
>             Project: Nutch
>          Issue Type: Improvement
>          Components: parser
>    Affects Versions: 1.4
>            Reporter: Julien Nioche
>            Assignee: Julien Nioche
>             Fix For: 1.4
>
>         Attachments: NUTCH-1075.patch
>
>
> In 2.0 the language identification is delegated to Tika and is done as part of the parsing step (and not during the indexing as done currently).
> The patch attached is a backport from trunk which implements this and adds a new parameter to determine the strategy to use
> {code:xml} 
> <property>
>   <name>lang.extraction.policy</name>
>   <value>detect,identify</value>
>   <description>This determines when the plugin uses detection and
>   statistical identification mechanisms. The order in which the
>   detect and identify are written will determine the extraction
>   policy. Default case (detect,identify)  means the plugin will
>   first try to extract language info from page headers and metadata,
>   if this is not successful it will try using tika language
>   identification. Possible values are:
>     detect
>     identify
>     detect,identify
>     identify,detect
>   </description>
> </property>
> {code} 

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (NUTCH-1075) Delegate language identification to Tika

Posted by "Julien Nioche (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/NUTCH-1075?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13082610#comment-13082610 ] 

Julien Nioche commented on NUTCH-1075:
--------------------------------------

Hi Lewis,

One way of testing would be to call o.a.n.parse.ParserChecker on some documents and make sure that the language metadata has been set property. Otherwise having some test classes would be good as well.

This will make our code a bit lighter and give us more flexibility as we'll have more choice in the strategies to adopt i.e. extract vs identify

Thanks

Julien

> Delegate language identification to Tika
> ----------------------------------------
>
>                 Key: NUTCH-1075
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1075
>             Project: Nutch
>          Issue Type: Improvement
>          Components: parser
>    Affects Versions: 1.4
>            Reporter: Julien Nioche
>            Assignee: Julien Nioche
>             Fix For: 1.4
>
>         Attachments: NUTCH-1075.patch
>
>
> In 2.0 the language identification is delegated to Tika and is done as part of the parsing step (and not during the indexing as done currently).
> The patch attached is a backport from trunk which implements this and adds a new parameter to determine the strategy to use
> {code:xml} 
> <property>
>   <name>lang.extraction.policy</name>
>   <value>detect,identify</value>
>   <description>This determines when the plugin uses detection and
>   statistical identification mechanisms. The order in which the
>   detect and identify are written will determine the extraction
>   policy. Default case (detect,identify)  means the plugin will
>   first try to extract language info from page headers and metadata,
>   if this is not successful it will try using tika language
>   identification. Possible values are:
>     detect
>     identify
>     detect,identify
>     identify,detect
>   </description>
> </property>
> {code} 

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (NUTCH-1075) Delegate language identification to Tika

Posted by "Julien Nioche (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/NUTCH-1075?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13086301#comment-13086301 ] 

Julien Nioche commented on NUTCH-1075:
--------------------------------------

If you can't see it in the metadata displayed by ParserChecker you definitly won't get it in IndexerChecker. Could there be something specific in your config? You've added the plugin to the list, haven't you?
Have you tried debugging in Eclipse and see if you get to the parser class at least?

Thanks!


> Delegate language identification to Tika
> ----------------------------------------
>
>                 Key: NUTCH-1075
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1075
>             Project: Nutch
>          Issue Type: Improvement
>          Components: parser
>    Affects Versions: 1.4
>            Reporter: Julien Nioche
>            Assignee: Julien Nioche
>             Fix For: 1.4
>
>         Attachments: NUTCH-1075-v2.patch, NUTCH-1075.patch
>
>
> In 2.0 the language identification is delegated to Tika and is done as part of the parsing step (and not during the indexing as done currently).
> The patch attached is a backport from trunk which implements this and adds a new parameter to determine the strategy to use
> {code:xml} 
> <property>
>   <name>lang.extraction.policy</name>
>   <value>detect,identify</value>
>   <description>This determines when the plugin uses detection and
>   statistical identification mechanisms. The order in which the
>   detect and identify are written will determine the extraction
>   policy. Default case (detect,identify)  means the plugin will
>   first try to extract language info from page headers and metadata,
>   if this is not successful it will try using tika language
>   identification. Possible values are:
>     detect
>     identify
>     detect,identify
>     identify,detect
>   </description>
> </property>
> {code} 

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (NUTCH-1075) Delegate language identification to Tika

Posted by "Julien Nioche (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/NUTCH-1075?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13085728#comment-13085728 ] 

Julien Nioche commented on NUTCH-1075:
--------------------------------------

See https://issues.apache.org/jira/browse/NUTCH-623


> Delegate language identification to Tika
> ----------------------------------------
>
>                 Key: NUTCH-1075
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1075
>             Project: Nutch
>          Issue Type: Improvement
>          Components: parser
>    Affects Versions: 1.4
>            Reporter: Julien Nioche
>            Assignee: Julien Nioche
>             Fix For: 1.4
>
>         Attachments: NUTCH-1075.patch
>
>
> In 2.0 the language identification is delegated to Tika and is done as part of the parsing step (and not during the indexing as done currently).
> The patch attached is a backport from trunk which implements this and adds a new parameter to determine the strategy to use
> {code:xml} 
> <property>
>   <name>lang.extraction.policy</name>
>   <value>detect,identify</value>
>   <description>This determines when the plugin uses detection and
>   statistical identification mechanisms. The order in which the
>   detect and identify are written will determine the extraction
>   policy. Default case (detect,identify)  means the plugin will
>   first try to extract language info from page headers and metadata,
>   if this is not successful it will try using tika language
>   identification. Possible values are:
>     detect
>     identify
>     detect,identify
>     identify,detect
>   </description>
> </property>
> {code} 

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (NUTCH-1075) Delegate language identification to Tika

Posted by "Markus Jelsma (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/NUTCH-1075?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13085676#comment-13085676 ] 

Markus Jelsma commented on NUTCH-1075:
--------------------------------------

Oh i forgot, give me a few minutes, i've got a whole lot of url's with terrible language id i'm desperate to test with this neat improvement.

> Delegate language identification to Tika
> ----------------------------------------
>
>                 Key: NUTCH-1075
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1075
>             Project: Nutch
>          Issue Type: Improvement
>          Components: parser
>    Affects Versions: 1.4
>            Reporter: Julien Nioche
>            Assignee: Julien Nioche
>             Fix For: 1.4
>
>         Attachments: NUTCH-1075.patch
>
>
> In 2.0 the language identification is delegated to Tika and is done as part of the parsing step (and not during the indexing as done currently).
> The patch attached is a backport from trunk which implements this and adds a new parameter to determine the strategy to use
> {code:xml} 
> <property>
>   <name>lang.extraction.policy</name>
>   <value>detect,identify</value>
>   <description>This determines when the plugin uses detection and
>   statistical identification mechanisms. The order in which the
>   detect and identify are written will determine the extraction
>   policy. Default case (detect,identify)  means the plugin will
>   first try to extract language info from page headers and metadata,
>   if this is not successful it will try using tika language
>   identification. Possible values are:
>     detect
>     identify
>     detect,identify
>     identify,detect
>   </description>
> </property>
> {code} 

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (NUTCH-1075) Delegate language identification to Tika

Posted by "Markus Jelsma (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/NUTCH-1075?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13086441#comment-13086441 ] 

Markus Jelsma commented on NUTCH-1075:
--------------------------------------

The clean 1.4 check out i tested as well (see above) failed in the same fashion. I double checked the runtime/local/libs and it indeed uses Tika 0.9 core.

I don't think my computer likes me anymore. If you manage to do so with a clean check out + patch there seems to be something really wrong on my system. 

> Delegate language identification to Tika
> ----------------------------------------
>
>                 Key: NUTCH-1075
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1075
>             Project: Nutch
>          Issue Type: Improvement
>          Components: parser
>    Affects Versions: 1.4
>            Reporter: Julien Nioche
>            Assignee: Julien Nioche
>             Fix For: 1.4
>
>         Attachments: NUTCH-1075-v2.patch, NUTCH-1075.patch
>
>
> In 2.0 the language identification is delegated to Tika and is done as part of the parsing step (and not during the indexing as done currently).
> The patch attached is a backport from trunk which implements this and adds a new parameter to determine the strategy to use
> {code:xml} 
> <property>
>   <name>lang.extraction.policy</name>
>   <value>detect,identify</value>
>   <description>This determines when the plugin uses detection and
>   statistical identification mechanisms. The order in which the
>   detect and identify are written will determine the extraction
>   policy. Default case (detect,identify)  means the plugin will
>   first try to extract language info from page headers and metadata,
>   if this is not successful it will try using tika language
>   identification. Possible values are:
>     detect
>     identify
>     detect,identify
>     identify,detect
>   </description>
> </property>
> {code} 

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (NUTCH-1075) Delegate language identification to Tika

Posted by "Lewis John McGibbney (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/NUTCH-1075?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13082595#comment-13082595 ] 

Lewis John McGibbney commented on NUTCH-1075:
---------------------------------------------

Hi Julien,

Would it be possible to add some info as to how I can test this patch. It would be great to get this sorted out.

Thanks

> Delegate language identification to Tika
> ----------------------------------------
>
>                 Key: NUTCH-1075
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1075
>             Project: Nutch
>          Issue Type: Improvement
>          Components: parser
>    Affects Versions: 1.4
>            Reporter: Julien Nioche
>            Assignee: Julien Nioche
>             Fix For: 1.4
>
>         Attachments: NUTCH-1075.patch
>
>
> In 2.0 the language identification is delegated to Tika and is done as part of the parsing step (and not during the indexing as done currently).
> The patch attached is a backport from trunk which implements this and adds a new parameter to determine the strategy to use
> {code:xml} 
> <property>
>   <name>lang.extraction.policy</name>
>   <value>detect,identify</value>
>   <description>This determines when the plugin uses detection and
>   statistical identification mechanisms. The order in which the
>   detect and identify are written will determine the extraction
>   policy. Default case (detect,identify)  means the plugin will
>   first try to extract language info from page headers and metadata,
>   if this is not successful it will try using tika language
>   identification. Possible values are:
>     detect
>     identify
>     detect,identify
>     identify,detect
>   </description>
> </property>
> {code} 

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (NUTCH-1075) Delegate language identification to Tika

Posted by "Markus Jelsma (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/NUTCH-1075?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13085790#comment-13085790 ] 

Markus Jelsma commented on NUTCH-1075:
--------------------------------------

I'll test whenever it's needed again.

> Delegate language identification to Tika
> ----------------------------------------
>
>                 Key: NUTCH-1075
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1075
>             Project: Nutch
>          Issue Type: Improvement
>          Components: parser
>    Affects Versions: 1.4
>            Reporter: Julien Nioche
>            Assignee: Julien Nioche
>             Fix For: 1.4
>
>         Attachments: NUTCH-1075.patch
>
>
> In 2.0 the language identification is delegated to Tika and is done as part of the parsing step (and not during the indexing as done currently).
> The patch attached is a backport from trunk which implements this and adds a new parameter to determine the strategy to use
> {code:xml} 
> <property>
>   <name>lang.extraction.policy</name>
>   <value>detect,identify</value>
>   <description>This determines when the plugin uses detection and
>   statistical identification mechanisms. The order in which the
>   detect and identify are written will determine the extraction
>   policy. Default case (detect,identify)  means the plugin will
>   first try to extract language info from page headers and metadata,
>   if this is not successful it will try using tika language
>   identification. Possible values are:
>     detect
>     identify
>     detect,identify
>     identify,detect
>   </description>
> </property>
> {code} 

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Issue Comment Edited] (NUTCH-1075) Delegate language identification to Tika

Posted by "Markus Jelsma (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/NUTCH-1075?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13087680#comment-13087680 ] 

Markus Jelsma edited comment on NUTCH-1075 at 8/19/11 12:30 PM:
----------------------------------------------------------------

Yes, i can verify that!

edit:
about Tika stripping away lang attributes: i can understand that detection from attribs won't work anymore but what about identification? That should not rely on attribs? 

      was (Author: markus17):
    Yes, i can verify that!
  
> Delegate language identification to Tika
> ----------------------------------------
>
>                 Key: NUTCH-1075
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1075
>             Project: Nutch
>          Issue Type: Improvement
>          Components: parser
>    Affects Versions: 1.4
>            Reporter: Julien Nioche
>            Assignee: Julien Nioche
>             Fix For: 1.4
>
>         Attachments: NUTCH-1075-v2.patch, NUTCH-1075-v3.patch, NUTCH-1075.patch
>
>
> In 2.0 the language identification is delegated to Tika and is done as part of the parsing step (and not during the indexing as done currently).
> The patch attached is a backport from trunk which implements this and adds a new parameter to determine the strategy to use
> {code:xml} 
> <property>
>   <name>lang.extraction.policy</name>
>   <value>detect,identify</value>
>   <description>This determines when the plugin uses detection and
>   statistical identification mechanisms. The order in which the
>   detect and identify are written will determine the extraction
>   policy. Default case (detect,identify)  means the plugin will
>   first try to extract language info from page headers and metadata,
>   if this is not successful it will try using tika language
>   identification. Possible values are:
>     detect
>     identify
>     detect,identify
>     identify,detect
>   </description>
> </property>
> {code} 

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (NUTCH-1075) Delegate language identification to Tika

Posted by "Julien Nioche (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/NUTCH-1075?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Julien Nioche updated NUTCH-1075:
---------------------------------

    Attachment: NUTCH-1075-v2.patch

Need to delete some files manually before running the tests - the patch empties them but does not delete them.

Improved the tests so that it covers the identification as well as the extraction

> Delegate language identification to Tika
> ----------------------------------------
>
>                 Key: NUTCH-1075
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1075
>             Project: Nutch
>          Issue Type: Improvement
>          Components: parser
>    Affects Versions: 1.4
>            Reporter: Julien Nioche
>            Assignee: Julien Nioche
>             Fix For: 1.4
>
>         Attachments: NUTCH-1075-v2.patch, NUTCH-1075.patch
>
>
> In 2.0 the language identification is delegated to Tika and is done as part of the parsing step (and not during the indexing as done currently).
> The patch attached is a backport from trunk which implements this and adds a new parameter to determine the strategy to use
> {code:xml} 
> <property>
>   <name>lang.extraction.policy</name>
>   <value>detect,identify</value>
>   <description>This determines when the plugin uses detection and
>   statistical identification mechanisms. The order in which the
>   detect and identify are written will determine the extraction
>   policy. Default case (detect,identify)  means the plugin will
>   first try to extract language info from page headers and metadata,
>   if this is not successful it will try using tika language
>   identification. Possible values are:
>     detect
>     identify
>     detect,identify
>     identify,detect
>   </description>
> </property>
> {code} 

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (NUTCH-1075) Delegate language identification to Tika

Posted by "Julien Nioche (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/NUTCH-1075?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13085687#comment-13085687 ] 

Julien Nioche commented on NUTCH-1075:
--------------------------------------

Markus - have a look at https://issues.apache.org/jira/browse/NUTCH-1083
this will make your life easier when testing to specify various values for 'lang.extraction.policy' 

> Delegate language identification to Tika
> ----------------------------------------
>
>                 Key: NUTCH-1075
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1075
>             Project: Nutch
>          Issue Type: Improvement
>          Components: parser
>    Affects Versions: 1.4
>            Reporter: Julien Nioche
>            Assignee: Julien Nioche
>             Fix For: 1.4
>
>         Attachments: NUTCH-1075.patch
>
>
> In 2.0 the language identification is delegated to Tika and is done as part of the parsing step (and not during the indexing as done currently).
> The patch attached is a backport from trunk which implements this and adds a new parameter to determine the strategy to use
> {code:xml} 
> <property>
>   <name>lang.extraction.policy</name>
>   <value>detect,identify</value>
>   <description>This determines when the plugin uses detection and
>   statistical identification mechanisms. The order in which the
>   detect and identify are written will determine the extraction
>   policy. Default case (detect,identify)  means the plugin will
>   first try to extract language info from page headers and metadata,
>   if this is not successful it will try using tika language
>   identification. Possible values are:
>     detect
>     identify
>     detect,identify
>     identify,detect
>   </description>
> </property>
> {code} 

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (NUTCH-1075) Delegate language identification to Tika

Posted by "Markus Jelsma (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/NUTCH-1075?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13085732#comment-13085732 ] 

Markus Jelsma commented on NUTCH-1075:
--------------------------------------

ah yes i tried that too but it's language-identifier in the current 1.4. If i use languageidentifier it's not registered as a plugin (checked the logs for it)

> Delegate language identification to Tika
> ----------------------------------------
>
>                 Key: NUTCH-1075
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1075
>             Project: Nutch
>          Issue Type: Improvement
>          Components: parser
>    Affects Versions: 1.4
>            Reporter: Julien Nioche
>            Assignee: Julien Nioche
>             Fix For: 1.4
>
>         Attachments: NUTCH-1075.patch
>
>
> In 2.0 the language identification is delegated to Tika and is done as part of the parsing step (and not during the indexing as done currently).
> The patch attached is a backport from trunk which implements this and adds a new parameter to determine the strategy to use
> {code:xml} 
> <property>
>   <name>lang.extraction.policy</name>
>   <value>detect,identify</value>
>   <description>This determines when the plugin uses detection and
>   statistical identification mechanisms. The order in which the
>   detect and identify are written will determine the extraction
>   policy. Default case (detect,identify)  means the plugin will
>   first try to extract language info from page headers and metadata,
>   if this is not successful it will try using tika language
>   identification. Possible values are:
>     detect
>     identify
>     detect,identify
>     identify,detect
>   </description>
> </property>
> {code} 

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (NUTCH-1075) Delegate language identification to Tika

Posted by "Markus Jelsma (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/NUTCH-1075?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13087688#comment-13087688 ] 

Markus Jelsma commented on NUTCH-1075:
--------------------------------------

Yes, this v3 patch is very good!

- detect + identify work fine with parse-html
- identify works fine with parse-tika
- detect is broken with parse-tika because it strips attribs which is to be fixed elsewhere

I was fooled by the position i had expected the language attrib in the parsechecker output. With parse-html it appears at the end the Parse Metadata line whereas with parse-tika it appears somewhere else.

Thanks!

> Delegate language identification to Tika
> ----------------------------------------
>
>                 Key: NUTCH-1075
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1075
>             Project: Nutch
>          Issue Type: Improvement
>          Components: parser
>    Affects Versions: 1.4
>            Reporter: Julien Nioche
>            Assignee: Julien Nioche
>             Fix For: 1.4
>
>         Attachments: NUTCH-1075-v2.patch, NUTCH-1075-v3.patch, NUTCH-1075.patch
>
>
> In 2.0 the language identification is delegated to Tika and is done as part of the parsing step (and not during the indexing as done currently).
> The patch attached is a backport from trunk which implements this and adds a new parameter to determine the strategy to use
> {code:xml} 
> <property>
>   <name>lang.extraction.policy</name>
>   <value>detect,identify</value>
>   <description>This determines when the plugin uses detection and
>   statistical identification mechanisms. The order in which the
>   detect and identify are written will determine the extraction
>   policy. Default case (detect,identify)  means the plugin will
>   first try to extract language info from page headers and metadata,
>   if this is not successful it will try using tika language
>   identification. Possible values are:
>     detect
>     identify
>     detect,identify
>     identify,detect
>   </description>
> </property>
> {code} 

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (NUTCH-1075) Delegate language identification to Tika

Posted by "Julien Nioche (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/NUTCH-1075?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13087685#comment-13087685 ] 

Julien Nioche commented on NUTCH-1075:
--------------------------------------

the identification should not be affected by the underlying html parser
Let me know if patch is OK and can be committed
Thanks

> Delegate language identification to Tika
> ----------------------------------------
>
>                 Key: NUTCH-1075
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1075
>             Project: Nutch
>          Issue Type: Improvement
>          Components: parser
>    Affects Versions: 1.4
>            Reporter: Julien Nioche
>            Assignee: Julien Nioche
>             Fix For: 1.4
>
>         Attachments: NUTCH-1075-v2.patch, NUTCH-1075-v3.patch, NUTCH-1075.patch
>
>
> In 2.0 the language identification is delegated to Tika and is done as part of the parsing step (and not during the indexing as done currently).
> The patch attached is a backport from trunk which implements this and adds a new parameter to determine the strategy to use
> {code:xml} 
> <property>
>   <name>lang.extraction.policy</name>
>   <value>detect,identify</value>
>   <description>This determines when the plugin uses detection and
>   statistical identification mechanisms. The order in which the
>   detect and identify are written will determine the extraction
>   policy. Default case (detect,identify)  means the plugin will
>   first try to extract language info from page headers and metadata,
>   if this is not successful it will try using tika language
>   identification. Possible values are:
>     detect
>     identify
>     detect,identify
>     identify,detect
>   </description>
> </property>
> {code} 

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Resolved] (NUTCH-1075) Delegate language identification to Tika

Posted by "Julien Nioche (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/NUTCH-1075?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Julien Nioche resolved NUTCH-1075.
----------------------------------

    Resolution: Fixed

Committed revision 1159621.

Thanks for reviewing it!

> Delegate language identification to Tika
> ----------------------------------------
>
>                 Key: NUTCH-1075
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1075
>             Project: Nutch
>          Issue Type: Improvement
>          Components: parser
>    Affects Versions: 1.4
>            Reporter: Julien Nioche
>            Assignee: Julien Nioche
>             Fix For: 1.4
>
>         Attachments: NUTCH-1075-v2.patch, NUTCH-1075-v3.patch, NUTCH-1075.patch
>
>
> In 2.0 the language identification is delegated to Tika and is done as part of the parsing step (and not during the indexing as done currently).
> The patch attached is a backport from trunk which implements this and adds a new parameter to determine the strategy to use
> {code:xml} 
> <property>
>   <name>lang.extraction.policy</name>
>   <value>detect,identify</value>
>   <description>This determines when the plugin uses detection and
>   statistical identification mechanisms. The order in which the
>   detect and identify are written will determine the extraction
>   policy. Default case (detect,identify)  means the plugin will
>   first try to extract language info from page headers and metadata,
>   if this is not successful it will try using tika language
>   identification. Possible values are:
>     detect
>     identify
>     detect,identify
>     identify,detect
>   </description>
> </property>
> {code} 

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (NUTCH-1075) Delegate language identification to Tika

Posted by "Lewis John McGibbney (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/NUTCH-1075?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13083166#comment-13083166 ] 

Lewis John McGibbney commented on NUTCH-1075:
---------------------------------------------

OK so I attach a seed file including the URLs I tested against with the o.a.n.ParserChecker

Steps to reproduce are as follows
{code}
$NUTCH-BRANCH-1.4_HOME ant clean
$patch -p0 -i NUTCH-1075.patch
ant runtime
{code}

I then added the language-identifier plugin to plugin.includes property in nutch-site.xml

some output from http://www.lemonde.fr
{code}
Content Metadata: Age=38 Content-Length=62737 Expires=Thu, 11 Aug 2011 15:23:23 GMT Last-Modified=Thu, 11 Aug 2011 15:18:50 GMT X-Server=britpop Connection=close Server=Apache Cache-Control=private, max-age=60 Edge-Control=!no-store,max-age=1m X-CDN=Served by Cotendo Date=Thu, 11 Aug 2011 15:23:01 GMT Vary=Accept-Encoding Content-Encoding=gzip Accept-Ranges=bytes Content-Type=text/html 
Parse Metadata: CharEncodingForConversion=windows-1252 caching.forbidden=content OriginalCharEncoding=windows-1252 language=fr
{code}

some output from http://www.groene.nl
{code}
Content Metadata: X-Runtime=46 ETag="9791fa50616231029c669d43da0a7b09" Content-Length=7887 Set-Cookie=_groene.nl_session=BAh7CDoQX2NzcmZfdG9rZW4iMWpUY0hvYXpoRUpTNVNMU0NnTUs0Snk5TzdPSlF3UFNNWEd1LzdoYTVxUVU9Og9zZXNzaW9uX2lkIiVhYjEzNDBmYmQ0NjJhNGRhYzM2Y2MzOTI4NTk1YjA0MToPaG9tZV92aWV3c2kA--c0f0882428620024c071ff93b6f37e1e0c362008; path=/; HttpOnly Connection=close Server=Apache X-Powered-By=Phusion Passenger (mod_rails/mod_rack) 2.2.14 Cache-Control=private, max-age=0, must-revalidate Status=200 Date=Thu, 11 Aug 2011 15:29:16 GMT Vary=Accept-Encoding Content-Encoding=gzip Content-Type=text/html; charset=utf-8 
Parse Metadata: CharEncodingForConversion=utf-8 caching.forbidden=content OriginalCharEncoding=utf-8 language=nl
{code}

I am testing on more and more sites and will report back, but so far so good.

> Delegate language identification to Tika
> ----------------------------------------
>
>                 Key: NUTCH-1075
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1075
>             Project: Nutch
>          Issue Type: Improvement
>          Components: parser
>    Affects Versions: 1.4
>            Reporter: Julien Nioche
>            Assignee: Julien Nioche
>             Fix For: 1.4
>
>         Attachments: NUTCH-1075.patch
>
>
> In 2.0 the language identification is delegated to Tika and is done as part of the parsing step (and not during the indexing as done currently).
> The patch attached is a backport from trunk which implements this and adds a new parameter to determine the strategy to use
> {code:xml} 
> <property>
>   <name>lang.extraction.policy</name>
>   <value>detect,identify</value>
>   <description>This determines when the plugin uses detection and
>   statistical identification mechanisms. The order in which the
>   detect and identify are written will determine the extraction
>   policy. Default case (detect,identify)  means the plugin will
>   first try to extract language info from page headers and metadata,
>   if this is not successful it will try using tika language
>   identification. Possible values are:
>     detect
>     identify
>     detect,identify
>     identify,detect
>   </description>
> </property>
> {code} 

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (NUTCH-1075) Delegate language identification to Tika

Posted by "Markus Jelsma (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/NUTCH-1075?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13086285#comment-13086285 ] 

Markus Jelsma commented on NUTCH-1075:
--------------------------------------

Hi! I am not getting any language returned in my ParseMeta regardless of the used extraction policy. No results in parsechecker and always unknown in indexchecker. I cannot confirm Lewis' output of those two url's.

About the threshold, yes i believe it would make sense to force identification when i only want identification.

> Delegate language identification to Tika
> ----------------------------------------
>
>                 Key: NUTCH-1075
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1075
>             Project: Nutch
>          Issue Type: Improvement
>          Components: parser
>    Affects Versions: 1.4
>            Reporter: Julien Nioche
>            Assignee: Julien Nioche
>             Fix For: 1.4
>
>         Attachments: NUTCH-1075-v2.patch, NUTCH-1075.patch
>
>
> In 2.0 the language identification is delegated to Tika and is done as part of the parsing step (and not during the indexing as done currently).
> The patch attached is a backport from trunk which implements this and adds a new parameter to determine the strategy to use
> {code:xml} 
> <property>
>   <name>lang.extraction.policy</name>
>   <value>detect,identify</value>
>   <description>This determines when the plugin uses detection and
>   statistical identification mechanisms. The order in which the
>   detect and identify are written will determine the extraction
>   policy. Default case (detect,identify)  means the plugin will
>   first try to extract language info from page headers and metadata,
>   if this is not successful it will try using tika language
>   identification. Possible values are:
>     detect
>     identify
>     detect,identify
>     identify,detect
>   </description>
> </property>
> {code} 

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (NUTCH-1075) Delegate language identification to Tika

Posted by "Julien Nioche (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/NUTCH-1075?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13085671#comment-13085671 ] 

Julien Nioche commented on NUTCH-1075:
--------------------------------------

Any more testers for this issue? Shall we commit it?

> Delegate language identification to Tika
> ----------------------------------------
>
>                 Key: NUTCH-1075
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1075
>             Project: Nutch
>          Issue Type: Improvement
>          Components: parser
>    Affects Versions: 1.4
>            Reporter: Julien Nioche
>            Assignee: Julien Nioche
>             Fix For: 1.4
>
>         Attachments: NUTCH-1075.patch
>
>
> In 2.0 the language identification is delegated to Tika and is done as part of the parsing step (and not during the indexing as done currently).
> The patch attached is a backport from trunk which implements this and adds a new parameter to determine the strategy to use
> {code:xml} 
> <property>
>   <name>lang.extraction.policy</name>
>   <value>detect,identify</value>
>   <description>This determines when the plugin uses detection and
>   statistical identification mechanisms. The order in which the
>   detect and identify are written will determine the extraction
>   policy. Default case (detect,identify)  means the plugin will
>   first try to extract language info from page headers and metadata,
>   if this is not successful it will try using tika language
>   identification. Possible values are:
>     detect
>     identify
>     detect,identify
>     identify,detect
>   </description>
> </property>
> {code} 

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (NUTCH-1075) Delegate language identification to Tika

Posted by "Julien Nioche (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/NUTCH-1075?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Julien Nioche updated NUTCH-1075:
---------------------------------

    Attachment: NUTCH-1075.patch

Passes the tests but requires some testing

> Delegate language identification to Tika
> ----------------------------------------
>
>                 Key: NUTCH-1075
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1075
>             Project: Nutch
>          Issue Type: Improvement
>          Components: parser
>    Affects Versions: 1.4
>            Reporter: Julien Nioche
>            Assignee: Julien Nioche
>             Fix For: 1.4
>
>         Attachments: NUTCH-1075.patch
>
>
> In 2.0 the language identification is delegated to Tika and is done as part of the parsing step (and not during the indexing as done currently).
> The patch attached is a backport from trunk which implements this and adds a new parameter to determine the strategy to use
> {code:xml} 
> <property>
>   <name>lang.extraction.policy</name>
>   <value>detect,identify</value>
>   <description>This determines when the plugin uses detection and
>   statistical identification mechanisms. The order in which the
>   detect and identify are written will determine the extraction
>   policy. Default case (detect,identify)  means the plugin will
>   first try to extract language info from page headers and metadata,
>   if this is not successful it will try using tika language
>   identification. Possible values are:
>     detect
>     identify
>     detect,identify
>     identify,detect
>   </description>
> </property>
> {code} 

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (NUTCH-1075) Delegate language identification to Tika

Posted by "Julien Nioche (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/NUTCH-1075?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13087084#comment-13087084 ] 

Julien Nioche commented on NUTCH-1075:
--------------------------------------

OK, we will have to spend a bit of time on parse-tika, add some tests for html and make sure it behaves like the old parse-html at some point. 
Will have a look tomorrow and find out why it does not play well with the language identification.
Thanks 

> Delegate language identification to Tika
> ----------------------------------------
>
>                 Key: NUTCH-1075
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1075
>             Project: Nutch
>          Issue Type: Improvement
>          Components: parser
>    Affects Versions: 1.4
>            Reporter: Julien Nioche
>            Assignee: Julien Nioche
>             Fix For: 1.4
>
>         Attachments: NUTCH-1075-v2.patch, NUTCH-1075.patch
>
>
> In 2.0 the language identification is delegated to Tika and is done as part of the parsing step (and not during the indexing as done currently).
> The patch attached is a backport from trunk which implements this and adds a new parameter to determine the strategy to use
> {code:xml} 
> <property>
>   <name>lang.extraction.policy</name>
>   <value>detect,identify</value>
>   <description>This determines when the plugin uses detection and
>   statistical identification mechanisms. The order in which the
>   detect and identify are written will determine the extraction
>   policy. Default case (detect,identify)  means the plugin will
>   first try to extract language info from page headers and metadata,
>   if this is not successful it will try using tika language
>   identification. Possible values are:
>     detect
>     identify
>     detect,identify
>     identify,detect
>   </description>
> </property>
> {code} 

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (NUTCH-1075) Delegate language identification to Tika

Posted by "Markus Jelsma (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/NUTCH-1075?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13085723#comment-13085723 ] 

Markus Jelsma commented on NUTCH-1075:
--------------------------------------

I also see this on a clean 1.4-dev check out:

{code}
test:
     [echo] Testing plugin: protocol-file
    [junit] WARNING: multiple versions of ant detected in path for junit 
    [junit]          jar:file:/usr/share/ant/lib/ant.jar!/org/apache/tools/ant/Project.class
    [junit]      and jar:file:/home/markus/projects/apache/nutch/branch-1.4/build/lib/ant-1.6.5.jar!/org/apache/tools/ant/Project.class
    [junit] Running org.apache.nutch.protocol.file.TestProtocolFile
    [junit] Tests run: 2, Failures: 0, Errors: 0, Time elapsed: 1.356 sec
    [junit] Running org.apache.nutch.analysis.lang.TestLanguageIdentifier
    [junit] Tests run: 1, Failures: 0, Errors: 1, Time elapsed: 0 sec
    [junit] Test org.apache.nutch.analysis.lang.TestLanguageIdentifier FAILED
    [junit] Running org.apache.nutch.analysis.lang.TestNGramProfile
    [junit] Tests run: 1, Failures: 0, Errors: 1, Time elapsed: 0 sec
    [junit] Test org.apache.nutch.analysis.lang.TestNGramProfile FAILED

{code}

> Delegate language identification to Tika
> ----------------------------------------
>
>                 Key: NUTCH-1075
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1075
>             Project: Nutch
>          Issue Type: Improvement
>          Components: parser
>    Affects Versions: 1.4
>            Reporter: Julien Nioche
>            Assignee: Julien Nioche
>             Fix For: 1.4
>
>         Attachments: NUTCH-1075.patch
>
>
> In 2.0 the language identification is delegated to Tika and is done as part of the parsing step (and not during the indexing as done currently).
> The patch attached is a backport from trunk which implements this and adds a new parameter to determine the strategy to use
> {code:xml} 
> <property>
>   <name>lang.extraction.policy</name>
>   <value>detect,identify</value>
>   <description>This determines when the plugin uses detection and
>   statistical identification mechanisms. The order in which the
>   detect and identify are written will determine the extraction
>   policy. Default case (detect,identify)  means the plugin will
>   first try to extract language info from page headers and metadata,
>   if this is not successful it will try using tika language
>   identification. Possible values are:
>     detect
>     identify
>     detect,identify
>     identify,detect
>   </description>
> </property>
> {code} 

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (NUTCH-1075) Delegate language identification to Tika

Posted by "Markus Jelsma (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/NUTCH-1075?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13076187#comment-13076187 ] 

Markus Jelsma commented on NUTCH-1075:
--------------------------------------

Cool! This would solve a lot of issues with the current stuff as discussed on the list. I'll be happy to test this with large datasets when i get back, next week i think. 

> Delegate language identification to Tika
> ----------------------------------------
>
>                 Key: NUTCH-1075
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1075
>             Project: Nutch
>          Issue Type: Improvement
>          Components: parser
>    Affects Versions: 1.4
>            Reporter: Julien Nioche
>            Assignee: Julien Nioche
>             Fix For: 1.4
>
>         Attachments: NUTCH-1075.patch
>
>
> In 2.0 the language identification is delegated to Tika and is done as part of the parsing step (and not during the indexing as done currently).
> The patch attached is a backport from trunk which implements this and adds a new parameter to determine the strategy to use
> {code:xml} 
> <property>
>   <name>lang.extraction.policy</name>
>   <value>detect,identify</value>
>   <description>This determines when the plugin uses detection and
>   statistical identification mechanisms. The order in which the
>   detect and identify are written will determine the extraction
>   policy. Default case (detect,identify)  means the plugin will
>   first try to extract language info from page headers and metadata,
>   if this is not successful it will try using tika language
>   identification. Possible values are:
>     detect
>     identify
>     detect,identify
>     identify,detect
>   </description>
> </property>
> {code} 

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (NUTCH-1075) Delegate language identification to Tika

Posted by "Julien Nioche (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/NUTCH-1075?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13086392#comment-13086392 ] 

Julien Nioche commented on NUTCH-1075:
--------------------------------------

I've checked out a clean version of 1.4, applied the patch and it works fine. Could you try and revert to Tika-core 0.9 and see if this changes the situation?


> Delegate language identification to Tika
> ----------------------------------------
>
>                 Key: NUTCH-1075
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1075
>             Project: Nutch
>          Issue Type: Improvement
>          Components: parser
>    Affects Versions: 1.4
>            Reporter: Julien Nioche
>            Assignee: Julien Nioche
>             Fix For: 1.4
>
>         Attachments: NUTCH-1075-v2.patch, NUTCH-1075.patch
>
>
> In 2.0 the language identification is delegated to Tika and is done as part of the parsing step (and not during the indexing as done currently).
> The patch attached is a backport from trunk which implements this and adds a new parameter to determine the strategy to use
> {code:xml} 
> <property>
>   <name>lang.extraction.policy</name>
>   <value>detect,identify</value>
>   <description>This determines when the plugin uses detection and
>   statistical identification mechanisms. The order in which the
>   detect and identify are written will determine the extraction
>   policy. Default case (detect,identify)  means the plugin will
>   first try to extract language info from page headers and metadata,
>   if this is not successful it will try using tika language
>   identification. Possible values are:
>     detect
>     identify
>     detect,identify
>     identify,detect
>   </description>
> </property>
> {code} 

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (NUTCH-1075) Delegate language identification to Tika

Posted by "Markus Jelsma (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/NUTCH-1075?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13086318#comment-13086318 ] 

Markus Jelsma commented on NUTCH-1075:
--------------------------------------

Yes it's in the list. When i reverse the patches i use the old language-identifier again and then all works well. There are no funky changes in my check out except that i'm using a very recent tika-app-1.0-SNAPSHOT.jar instead of a Tika core 0.9.

I just tried a clean 1.4-dev check out with your latest patch and language-identifier|protocol-http|parse-tika|index-(basic|more|anchor) as plugins to no avail. The language plugin is registered, both the parser and indexing hooks are executed properly.

I've added some debugging and commenting to see what it's doing (no eclipse) and it's clear that lang is always null in HTMLLanguageParser.filter().

> Delegate language identification to Tika
> ----------------------------------------
>
>                 Key: NUTCH-1075
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1075
>             Project: Nutch
>          Issue Type: Improvement
>          Components: parser
>    Affects Versions: 1.4
>            Reporter: Julien Nioche
>            Assignee: Julien Nioche
>             Fix For: 1.4
>
>         Attachments: NUTCH-1075-v2.patch, NUTCH-1075.patch
>
>
> In 2.0 the language identification is delegated to Tika and is done as part of the parsing step (and not during the indexing as done currently).
> The patch attached is a backport from trunk which implements this and adds a new parameter to determine the strategy to use
> {code:xml} 
> <property>
>   <name>lang.extraction.policy</name>
>   <value>detect,identify</value>
>   <description>This determines when the plugin uses detection and
>   statistical identification mechanisms. The order in which the
>   detect and identify are written will determine the extraction
>   policy. Default case (detect,identify)  means the plugin will
>   first try to extract language info from page headers and metadata,
>   if this is not successful it will try using tika language
>   identification. Possible values are:
>     detect
>     identify
>     detect,identify
>     identify,detect
>   </description>
> </property>
> {code} 

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (NUTCH-1075) Delegate language identification to Tika

Posted by "Julien Nioche (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/NUTCH-1075?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13085826#comment-13085826 ] 

Julien Nioche commented on NUTCH-1075:
--------------------------------------

Markus - the issue you're having with -D lang.extraction.policy=identify is due to the score obtained for nl is higher than the threshold for identifier.isReasonnablyCertain which is hardcoded in Tika. 

We could make so that this test is ignored when the policy is identify only or add a param for it. I haven't found an example where the value returned was deemed reasonnably certain. WDYT?

> Delegate language identification to Tika
> ----------------------------------------
>
>                 Key: NUTCH-1075
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1075
>             Project: Nutch
>          Issue Type: Improvement
>          Components: parser
>    Affects Versions: 1.4
>            Reporter: Julien Nioche
>            Assignee: Julien Nioche
>             Fix For: 1.4
>
>         Attachments: NUTCH-1075.patch
>
>
> In 2.0 the language identification is delegated to Tika and is done as part of the parsing step (and not during the indexing as done currently).
> The patch attached is a backport from trunk which implements this and adds a new parameter to determine the strategy to use
> {code:xml} 
> <property>
>   <name>lang.extraction.policy</name>
>   <value>detect,identify</value>
>   <description>This determines when the plugin uses detection and
>   statistical identification mechanisms. The order in which the
>   detect and identify are written will determine the extraction
>   policy. Default case (detect,identify)  means the plugin will
>   first try to extract language info from page headers and metadata,
>   if this is not successful it will try using tika language
>   identification. Possible values are:
>     detect
>     identify
>     detect,identify
>     identify,detect
>   </description>
> </property>
> {code} 

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (NUTCH-1075) Delegate language identification to Tika

Posted by "Julien Nioche (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/NUTCH-1075?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Julien Nioche updated NUTCH-1075:
---------------------------------

    Attachment: NUTCH-1075-v3.patch

Added parameter to bypass the check on isReasonablyCertain.

{code}
./nutch parsechecker -D lang.extraction.policy=identify http://www.groene.nl 
{code}

should now retrieve the right language code



> Delegate language identification to Tika
> ----------------------------------------
>
>                 Key: NUTCH-1075
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1075
>             Project: Nutch
>          Issue Type: Improvement
>          Components: parser
>    Affects Versions: 1.4
>            Reporter: Julien Nioche
>            Assignee: Julien Nioche
>             Fix For: 1.4
>
>         Attachments: NUTCH-1075-v2.patch, NUTCH-1075-v3.patch, NUTCH-1075.patch
>
>
> In 2.0 the language identification is delegated to Tika and is done as part of the parsing step (and not during the indexing as done currently).
> The patch attached is a backport from trunk which implements this and adds a new parameter to determine the strategy to use
> {code:xml} 
> <property>
>   <name>lang.extraction.policy</name>
>   <value>detect,identify</value>
>   <description>This determines when the plugin uses detection and
>   statistical identification mechanisms. The order in which the
>   detect and identify are written will determine the extraction
>   policy. Default case (detect,identify)  means the plugin will
>   first try to extract language info from page headers and metadata,
>   if this is not successful it will try using tika language
>   identification. Possible values are:
>     detect
>     identify
>     detect,identify
>     identify,detect
>   </description>
> </property>
> {code} 

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (NUTCH-1075) Delegate language identification to Tika

Posted by "Markus Jelsma (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/NUTCH-1075?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13086984#comment-13086984 ] 

Markus Jelsma commented on NUTCH-1075:
--------------------------------------

Julien, i finally found the difference between our environments, i am using parse-tika and you guys tried parse-html. It fails miserably with parse-tika whereas it all works out fine when using parse-html.

> Delegate language identification to Tika
> ----------------------------------------
>
>                 Key: NUTCH-1075
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1075
>             Project: Nutch
>          Issue Type: Improvement
>          Components: parser
>    Affects Versions: 1.4
>            Reporter: Julien Nioche
>            Assignee: Julien Nioche
>             Fix For: 1.4
>
>         Attachments: NUTCH-1075-v2.patch, NUTCH-1075.patch
>
>
> In 2.0 the language identification is delegated to Tika and is done as part of the parsing step (and not during the indexing as done currently).
> The patch attached is a backport from trunk which implements this and adds a new parameter to determine the strategy to use
> {code:xml} 
> <property>
>   <name>lang.extraction.policy</name>
>   <value>detect,identify</value>
>   <description>This determines when the plugin uses detection and
>   statistical identification mechanisms. The order in which the
>   detect and identify are written will determine the extraction
>   policy. Default case (detect,identify)  means the plugin will
>   first try to extract language info from page headers and metadata,
>   if this is not successful it will try using tika language
>   identification. Possible values are:
>     detect
>     identify
>     detect,identify
>     identify,detect
>   </description>
> </property>
> {code} 

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (NUTCH-1075) Delegate language identification to Tika

Posted by "Julien Nioche (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/NUTCH-1075?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13085786#comment-13085786 ] 

Julien Nioche commented on NUTCH-1075:
--------------------------------------

ah, sorry. it looked a lot like the error we were getting in 623 but this one is due partly to the fact that the patch empties the content of some of the files instead of deleting them. Will look into that

> Delegate language identification to Tika
> ----------------------------------------
>
>                 Key: NUTCH-1075
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1075
>             Project: Nutch
>          Issue Type: Improvement
>          Components: parser
>    Affects Versions: 1.4
>            Reporter: Julien Nioche
>            Assignee: Julien Nioche
>             Fix For: 1.4
>
>         Attachments: NUTCH-1075.patch
>
>
> In 2.0 the language identification is delegated to Tika and is done as part of the parsing step (and not during the indexing as done currently).
> The patch attached is a backport from trunk which implements this and adds a new parameter to determine the strategy to use
> {code:xml} 
> <property>
>   <name>lang.extraction.policy</name>
>   <value>detect,identify</value>
>   <description>This determines when the plugin uses detection and
>   statistical identification mechanisms. The order in which the
>   detect and identify are written will determine the extraction
>   policy. Default case (detect,identify)  means the plugin will
>   first try to extract language info from page headers and metadata,
>   if this is not successful it will try using tika language
>   identification. Possible values are:
>     detect
>     identify
>     detect,identify
>     identify,detect
>   </description>
> </property>
> {code} 

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira