You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@nutch.apache.org by "Julien Nioche (JIRA)" <ji...@apache.org> on 2009/11/18 15:50:39 UTC

[jira] Created: (NUTCH-766) Tika parser

Tika parser
-----------

                 Key: NUTCH-766
                 URL: https://issues.apache.org/jira/browse/NUTCH-766
             Project: Nutch
          Issue Type: New Feature
            Reporter: Julien Nioche


Tika handles a lot of different formats under the bonnet and exposes them nicely via SAX events. What is described here is a tika-parser plugin which delegates the pasring mechanism of Tika but can still coexist with the existing parsing plugins which is useful for formats partially handled by Tika (or not at all). Some of the elements below have already been discussed on the mailing lists. Note that this is work in progress, your feedback is welcome.

Tika is already used by Nutch for its MimeType implementations. Tika comes as different jar files (core and parsers), in the work described here we decided to put the libs in 2 different places
NUTCH_HOME/lib : tika-core.jar
NUTCH_HOME/tika-plugin/lib : tika-parsers.jar
Tika being used by the core only for its Mimetype functionalities we only need to put tika-core at the main lib level whereas the tika plugin obviously needs the tika-parsers.jar + all the jars used internally by Tika

Due to limitations in the way Tika loads its classes, we had to duplicate the TikaConfig class in the tika-plugin. This might be fixed in the future in Tika itself or avoided by refactoring the mimetype part of Nutch using extension points.

Unlike most other parsers, Tika handles more than one Mime-type which is why we are using "*" as its mimetype value in the plugin descriptor and have modified ParserFactory.java so that it considers the tika parser as potentially suitable for all mime-types. In practice this means that the associations between a mime type and a parser plugin as defined in parse-plugins.xml are useful only for the cases where we want to handle a mime type with a different parser than Tika. 

The general approach I chose was to convert the SAX events returned by the Tika parsers into DOM objects and reuse the utilities that come with the current HTML parser i.e. link detection,  metatag handling but also means that we can use the HTMLParseFilters in exactly the same way. The main difference though is that HTMLParseFilters are not limited to HTML documents anymore as the XHTML tags returned by Tika can correspond to a different format for the original document. There is a duplication of code with the html-plugin which will be resolved by either a) getting rid of the html-plugin altogether or b) exporting its jar and make the tika parser depend on it.

The following libraries are required in the lib/ directory of the tika-parser : 

      <library name="asm-3.1.jar"/>
      <library name="bcmail-jdk15-144.jar"/>
      <library name="commons-compress-1.0.jar"/>
      <library name="commons-logging-1.1.1.jar"/>
      <library name="dom4j-1.6.1.jar"/>
      <library name="fontbox-0.8.0-incubator.jar"/>
      <library name="geronimo-stax-api_1.0_spec-1.0.1.jar"/>
      <library name="hamcrest-core-1.1.jar"/>
      <library name="jce-jdk13-144.jar"/>
      <library name="jempbox-0.8.0-incubator.jar"/>
      <library name="metadata-extractor-2.4.0-beta-1.jar"/>
      <library name="mockito-core-1.7.jar"/>
      <library name="objenesis-1.0.jar"/>
      <library name="ooxml-schemas-1.0.jar"/>
      <library name="pdfbox-0.8.0-incubating.jar"/>
      <library name="poi-3.5-FINAL.jar"/>
      <library name="poi-ooxml-3.5-FINAL.jar"/>
      <library name="poi-scratchpad-3.5-FINAL.jar"/>
      <library name="tagsoup-1.2.jar"/>
      <library name="tika-parsers-0.5-SNAPSHOT.jar"/>
      <library name="xml-apis-1.0.b2.jar"/>
      <library name="xmlbeans-2.3.0.jar"/>

There is a small test suite which needs to be improved. We will need to have a look at each individual format and check that it is covered by Tika and if so to the same extent; the Wiki is probably the right place for this. The language identifier (which is a HTMLParseFilter) seemed to work fine.
 
Again, your comments are welcome. Please bear in mind that this is just a first step. 

Julien
http://www.digitalpebble.com





-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Java Heap Limit Exceeded

Posted by "Withanage, Dulip" <wi...@asia-europe.uni-heidelberg.de>.

Dear developers,

I have installed a nutch system on a linux enterprise server with 8GB RAM.
My JAVA VM has 4GB RAM, when nutch starts.

I have configured a web-crawler to scan pdf documents (abour 3000) in intranet.
After about 100 PDF docs, there is always a outOfMemory Exception.

I tried following trick.

In idex.html, I generate links to a set of  html links. (link1.html, liknk2.html etc..) 
Each link.html has a link to 20 PDFS. But this trick also fails.

Can someone give some idea or a place to read?


Best regards,

Dulip Withanage, M.Sc 


Cluster of Excellence 
Karl Jaspers Centre
Heidelberg

Fax: +49-6221 - 54 4012
e-mail: withanage@asia-europe.uni-heidelberg.de

[jira] Issue Comment Edited: (NUTCH-766) Tika parser

Posted by "Chris A. Mattmann (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/NUTCH-766?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12803709#action_12803709 ] 

Chris A. Mattmann edited comment on NUTCH-766 at 1/22/10 2:38 PM:
------------------------------------------------------------------

{quote}
Sure, but it would be silly to block the whole Tika plugin because Tika does not support such or such format as well as the original Nutch plugins. As I explained above we can configure which parser to use for which mimetype and use the Tika-plugin by default. Hopefully the Tika implementation will get better and better and there will be no need for keeping the old plugins.
{quote}

+1, I'm going to agree on this one here Julien. Other communities ;) have convinced me of the need for backwards compat and unobtrusiveness when bringing in new functionality or results. +1 to at least in Nutch 1.1 leaving the old plugins (perhaps mentioning they should be deprecated and replaced by the Tika functionality) and then removing them in 1.2 or 1.3.

I got bogged down with my paid job, but I found some Apache time recently so this is tops on my list to tackle.

Cheers,
Chris



      was (Author: chrismattmann):
    {quote}
Sure, but it would be silly to block the whole Tika plugin because Tika does not support such or such format as well as the original Nutch plugins. As I explained above we can configure which parser to use for which mimetype and use the Tika-plugin by default. Hopefully the Tika implementation will get better and better and there will be no need for keeping the old plugins.
{quote}

+1, I'm going to agree on this one here Julien. Other communities ;) have convinced me of the need for backwards compat and unobtrusiveness when bringing in new functionality or results. +1 to at least in Nutch 1.1 leaving the old plugins (perhaps mentioning they should be deprecated and replace by the Tika functionality) and then removing them in 1.2 or 1.3.

I got bogged down with my paid job, but I found some Apache time recently so this is tops on my list to tackle.

Cheers,
Chris


  
> Tika parser
> -----------
>
>                 Key: NUTCH-766
>                 URL: https://issues.apache.org/jira/browse/NUTCH-766
>             Project: Nutch
>          Issue Type: New Feature
>            Reporter: Julien Nioche
>            Assignee: Chris A. Mattmann
>             Fix For: 1.1
>
>         Attachments: Nutch-766.ParserFactory.patch, NUTCH-766.tika.patch
>
>
> Tika handles a lot of different formats under the bonnet and exposes them nicely via SAX events. What is described here is a tika-parser plugin which delegates the pasring mechanism of Tika but can still coexist with the existing parsing plugins which is useful for formats partially handled by Tika (or not at all). Some of the elements below have already been discussed on the mailing lists. Note that this is work in progress, your feedback is welcome.
> Tika is already used by Nutch for its MimeType implementations. Tika comes as different jar files (core and parsers), in the work described here we decided to put the libs in 2 different places
> NUTCH_HOME/lib : tika-core.jar
> NUTCH_HOME/tika-plugin/lib : tika-parsers.jar
> Tika being used by the core only for its Mimetype functionalities we only need to put tika-core at the main lib level whereas the tika plugin obviously needs the tika-parsers.jar + all the jars used internally by Tika
> Due to limitations in the way Tika loads its classes, we had to duplicate the TikaConfig class in the tika-plugin. This might be fixed in the future in Tika itself or avoided by refactoring the mimetype part of Nutch using extension points.
> Unlike most other parsers, Tika handles more than one Mime-type which is why we are using "*" as its mimetype value in the plugin descriptor and have modified ParserFactory.java so that it considers the tika parser as potentially suitable for all mime-types. In practice this means that the associations between a mime type and a parser plugin as defined in parse-plugins.xml are useful only for the cases where we want to handle a mime type with a different parser than Tika. 
> The general approach I chose was to convert the SAX events returned by the Tika parsers into DOM objects and reuse the utilities that come with the current HTML parser i.e. link detection,  metatag handling but also means that we can use the HTMLParseFilters in exactly the same way. The main difference though is that HTMLParseFilters are not limited to HTML documents anymore as the XHTML tags returned by Tika can correspond to a different format for the original document. There is a duplication of code with the html-plugin which will be resolved by either a) getting rid of the html-plugin altogether or b) exporting its jar and make the tika parser depend on it.
> The following libraries are required in the lib/ directory of the tika-parser : 
>       <library name="asm-3.1.jar"/>
>       <library name="bcmail-jdk15-144.jar"/>
>       <library name="commons-compress-1.0.jar"/>
>       <library name="commons-logging-1.1.1.jar"/>
>       <library name="dom4j-1.6.1.jar"/>
>       <library name="fontbox-0.8.0-incubator.jar"/>
>       <library name="geronimo-stax-api_1.0_spec-1.0.1.jar"/>
>       <library name="hamcrest-core-1.1.jar"/>
>       <library name="jce-jdk13-144.jar"/>
>       <library name="jempbox-0.8.0-incubator.jar"/>
>       <library name="metadata-extractor-2.4.0-beta-1.jar"/>
>       <library name="mockito-core-1.7.jar"/>
>       <library name="objenesis-1.0.jar"/>
>       <library name="ooxml-schemas-1.0.jar"/>
>       <library name="pdfbox-0.8.0-incubating.jar"/>
>       <library name="poi-3.5-FINAL.jar"/>
>       <library name="poi-ooxml-3.5-FINAL.jar"/>
>       <library name="poi-scratchpad-3.5-FINAL.jar"/>
>       <library name="tagsoup-1.2.jar"/>
>       <library name="tika-parsers-0.5-SNAPSHOT.jar"/>
>       <library name="xml-apis-1.0.b2.jar"/>
>       <library name="xmlbeans-2.3.0.jar"/>
> There is a small test suite which needs to be improved. We will need to have a look at each individual format and check that it is covered by Tika and if so to the same extent; the Wiki is probably the right place for this. The language identifier (which is a HTMLParseFilter) seemed to work fine.
>  
> Again, your comments are welcome. Please bear in mind that this is just a first step. 
> Julien
> http://www.digitalpebble.com

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (NUTCH-766) Tika parser

Posted by "Chris A. Mattmann (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/NUTCH-766?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12832398#action_12832398 ] 

Chris A. Mattmann commented on NUTCH-766:
-----------------------------------------

I'm going to hold off on committing this tonight. I've updated the docs per Andrzej, and I've also updated CHANGES.txt, but when running:

{code}
ant clean compile-core test
{code}

I'm seeing these messages during plugin testing for parse-tika:

{noformat}
2010-02-10 22:39:16,593 ERROR tika.TikaParser (TikaParser.java:getParse(63)) - Can't retrieve Tika parser for mime-type application/pdf
------------- ---------------- ---------------

Testcase: testIt took 2.684 sec
        FAILED
null
junit.framework.AssertionFailedError
        at org.apache.nutch.tika.TestPdfParser.testIt(TestPdfParser.java:79)
{noformat}

It seems that the TikaConfig is not being found? I was looking at TikaParser#setConf and it seems that a default config is being created for Tika, but maybe not being loaded correctly? I need to look into this more...

> Tika parser
> -----------
>
>                 Key: NUTCH-766
>                 URL: https://issues.apache.org/jira/browse/NUTCH-766
>             Project: Nutch
>          Issue Type: New Feature
>            Reporter: Julien Nioche
>            Assignee: Chris A. Mattmann
>             Fix For: 1.1
>
>         Attachments: NUTCH-766-v3.patch, NUTCH-766.v2, sample.tar.gz
>
>
> Tika handles a lot of different formats under the bonnet and exposes them nicely via SAX events. What is described here is a tika-parser plugin which delegates the pasring mechanism of Tika but can still coexist with the existing parsing plugins which is useful for formats partially handled by Tika (or not at all). Some of the elements below have already been discussed on the mailing lists. Note that this is work in progress, your feedback is welcome.
> Tika is already used by Nutch for its MimeType implementations. Tika comes as different jar files (core and parsers), in the work described here we decided to put the libs in 2 different places
> NUTCH_HOME/lib : tika-core.jar
> NUTCH_HOME/tika-plugin/lib : tika-parsers.jar
> Tika being used by the core only for its Mimetype functionalities we only need to put tika-core at the main lib level whereas the tika plugin obviously needs the tika-parsers.jar + all the jars used internally by Tika
> Due to limitations in the way Tika loads its classes, we had to duplicate the TikaConfig class in the tika-plugin. This might be fixed in the future in Tika itself or avoided by refactoring the mimetype part of Nutch using extension points.
> Unlike most other parsers, Tika handles more than one Mime-type which is why we are using "*" as its mimetype value in the plugin descriptor and have modified ParserFactory.java so that it considers the tika parser as potentially suitable for all mime-types. In practice this means that the associations between a mime type and a parser plugin as defined in parse-plugins.xml are useful only for the cases where we want to handle a mime type with a different parser than Tika. 
> The general approach I chose was to convert the SAX events returned by the Tika parsers into DOM objects and reuse the utilities that come with the current HTML parser i.e. link detection,  metatag handling but also means that we can use the HTMLParseFilters in exactly the same way. The main difference though is that HTMLParseFilters are not limited to HTML documents anymore as the XHTML tags returned by Tika can correspond to a different format for the original document. There is a duplication of code with the html-plugin which will be resolved by either a) getting rid of the html-plugin altogether or b) exporting its jar and make the tika parser depend on it.
> The following libraries are required in the lib/ directory of the tika-parser : 
>       <library name="asm-3.1.jar"/>
>       <library name="bcmail-jdk15-144.jar"/>
>       <library name="commons-compress-1.0.jar"/>
>       <library name="commons-logging-1.1.1.jar"/>
>       <library name="dom4j-1.6.1.jar"/>
>       <library name="fontbox-0.8.0-incubator.jar"/>
>       <library name="geronimo-stax-api_1.0_spec-1.0.1.jar"/>
>       <library name="hamcrest-core-1.1.jar"/>
>       <library name="jce-jdk13-144.jar"/>
>       <library name="jempbox-0.8.0-incubator.jar"/>
>       <library name="metadata-extractor-2.4.0-beta-1.jar"/>
>       <library name="mockito-core-1.7.jar"/>
>       <library name="objenesis-1.0.jar"/>
>       <library name="ooxml-schemas-1.0.jar"/>
>       <library name="pdfbox-0.8.0-incubating.jar"/>
>       <library name="poi-3.5-FINAL.jar"/>
>       <library name="poi-ooxml-3.5-FINAL.jar"/>
>       <library name="poi-scratchpad-3.5-FINAL.jar"/>
>       <library name="tagsoup-1.2.jar"/>
>       <library name="tika-parsers-0.5-SNAPSHOT.jar"/>
>       <library name="xml-apis-1.0.b2.jar"/>
>       <library name="xmlbeans-2.3.0.jar"/>
> There is a small test suite which needs to be improved. We will need to have a look at each individual format and check that it is covered by Tika and if so to the same extent; the Wiki is probably the right place for this. The language identifier (which is a HTMLParseFilter) seemed to work fine.
>  
> Again, your comments are welcome. Please bear in mind that this is just a first step. 
> Julien
> http://www.digitalpebble.com

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (NUTCH-766) Tika parser

Posted by "Julien Nioche (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/NUTCH-766?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12832564#action_12832564 ] 

Julien Nioche commented on NUTCH-766:
-------------------------------------

I had a closer look at the HTML parsing issue. What happens  is that the association between the mime-type and the parser implementation is not explicitely set in parse-plugins.xml so the ParserFactory goes through all the plugins and gets the ones with a matching mimetype (or * for Tika). The Tika parser takes no precedence over the default HTML parser and the latter gets first in the list and is used for parsing.

Of course that does not happen if parse-html is not specified in plugin.includes or if an explicit mapping is set in parse-plugins.xml.  I don't think we want to have to specify explicitely that tika should be used in all the mappings and reserve cases for when a parser must be used instead of Tika.

What we could do though is that in the cases where no explicit mapping is set for a mimetype, Tika (or any parser marked as supporting any mimetype) will be put first in the list of discovered parsers so it would remain the default choice unless an explicit mapping is set (even if a plugin is loaded and can handle the type).

Makes sense?



> Tika parser
> -----------
>
>                 Key: NUTCH-766
>                 URL: https://issues.apache.org/jira/browse/NUTCH-766
>             Project: Nutch
>          Issue Type: New Feature
>            Reporter: Julien Nioche
>            Assignee: Chris A. Mattmann
>             Fix For: 1.1
>
>         Attachments: NUTCH-766-v3.patch, NUTCH-766.v2, NutchTikaConfig.java, sample.tar.gz, TikaParser.java
>
>
> Tika handles a lot of different formats under the bonnet and exposes them nicely via SAX events. What is described here is a tika-parser plugin which delegates the pasring mechanism of Tika but can still coexist with the existing parsing plugins which is useful for formats partially handled by Tika (or not at all). Some of the elements below have already been discussed on the mailing lists. Note that this is work in progress, your feedback is welcome.
> Tika is already used by Nutch for its MimeType implementations. Tika comes as different jar files (core and parsers), in the work described here we decided to put the libs in 2 different places
> NUTCH_HOME/lib : tika-core.jar
> NUTCH_HOME/tika-plugin/lib : tika-parsers.jar
> Tika being used by the core only for its Mimetype functionalities we only need to put tika-core at the main lib level whereas the tika plugin obviously needs the tika-parsers.jar + all the jars used internally by Tika
> Due to limitations in the way Tika loads its classes, we had to duplicate the TikaConfig class in the tika-plugin. This might be fixed in the future in Tika itself or avoided by refactoring the mimetype part of Nutch using extension points.
> Unlike most other parsers, Tika handles more than one Mime-type which is why we are using "*" as its mimetype value in the plugin descriptor and have modified ParserFactory.java so that it considers the tika parser as potentially suitable for all mime-types. In practice this means that the associations between a mime type and a parser plugin as defined in parse-plugins.xml are useful only for the cases where we want to handle a mime type with a different parser than Tika. 
> The general approach I chose was to convert the SAX events returned by the Tika parsers into DOM objects and reuse the utilities that come with the current HTML parser i.e. link detection,  metatag handling but also means that we can use the HTMLParseFilters in exactly the same way. The main difference though is that HTMLParseFilters are not limited to HTML documents anymore as the XHTML tags returned by Tika can correspond to a different format for the original document. There is a duplication of code with the html-plugin which will be resolved by either a) getting rid of the html-plugin altogether or b) exporting its jar and make the tika parser depend on it.
> The following libraries are required in the lib/ directory of the tika-parser : 
>       <library name="asm-3.1.jar"/>
>       <library name="bcmail-jdk15-144.jar"/>
>       <library name="commons-compress-1.0.jar"/>
>       <library name="commons-logging-1.1.1.jar"/>
>       <library name="dom4j-1.6.1.jar"/>
>       <library name="fontbox-0.8.0-incubator.jar"/>
>       <library name="geronimo-stax-api_1.0_spec-1.0.1.jar"/>
>       <library name="hamcrest-core-1.1.jar"/>
>       <library name="jce-jdk13-144.jar"/>
>       <library name="jempbox-0.8.0-incubator.jar"/>
>       <library name="metadata-extractor-2.4.0-beta-1.jar"/>
>       <library name="mockito-core-1.7.jar"/>
>       <library name="objenesis-1.0.jar"/>
>       <library name="ooxml-schemas-1.0.jar"/>
>       <library name="pdfbox-0.8.0-incubating.jar"/>
>       <library name="poi-3.5-FINAL.jar"/>
>       <library name="poi-ooxml-3.5-FINAL.jar"/>
>       <library name="poi-scratchpad-3.5-FINAL.jar"/>
>       <library name="tagsoup-1.2.jar"/>
>       <library name="tika-parsers-0.5-SNAPSHOT.jar"/>
>       <library name="xml-apis-1.0.b2.jar"/>
>       <library name="xmlbeans-2.3.0.jar"/>
> There is a small test suite which needs to be improved. We will need to have a look at each individual format and check that it is covered by Tika and if so to the same extent; the Wiki is probably the right place for this. The language identifier (which is a HTMLParseFilter) seemed to work fine.
>  
> Again, your comments are welcome. Please bear in mind that this is just a first step. 
> Julien
> http://www.digitalpebble.com

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Assigned: (NUTCH-766) Tika parser

Posted by "Chris A. Mattmann (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/NUTCH-766?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Chris A. Mattmann reassigned NUTCH-766:
---------------------------------------

    Assignee: Chris A. Mattmann

> Tika parser
> -----------
>
>                 Key: NUTCH-766
>                 URL: https://issues.apache.org/jira/browse/NUTCH-766
>             Project: Nutch
>          Issue Type: New Feature
>            Reporter: Julien Nioche
>            Assignee: Chris A. Mattmann
>         Attachments: Nutch-766.ParserFactory.patch, NUTCH-766.tika.patch
>
>
> Tika handles a lot of different formats under the bonnet and exposes them nicely via SAX events. What is described here is a tika-parser plugin which delegates the pasring mechanism of Tika but can still coexist with the existing parsing plugins which is useful for formats partially handled by Tika (or not at all). Some of the elements below have already been discussed on the mailing lists. Note that this is work in progress, your feedback is welcome.
> Tika is already used by Nutch for its MimeType implementations. Tika comes as different jar files (core and parsers), in the work described here we decided to put the libs in 2 different places
> NUTCH_HOME/lib : tika-core.jar
> NUTCH_HOME/tika-plugin/lib : tika-parsers.jar
> Tika being used by the core only for its Mimetype functionalities we only need to put tika-core at the main lib level whereas the tika plugin obviously needs the tika-parsers.jar + all the jars used internally by Tika
> Due to limitations in the way Tika loads its classes, we had to duplicate the TikaConfig class in the tika-plugin. This might be fixed in the future in Tika itself or avoided by refactoring the mimetype part of Nutch using extension points.
> Unlike most other parsers, Tika handles more than one Mime-type which is why we are using "*" as its mimetype value in the plugin descriptor and have modified ParserFactory.java so that it considers the tika parser as potentially suitable for all mime-types. In practice this means that the associations between a mime type and a parser plugin as defined in parse-plugins.xml are useful only for the cases where we want to handle a mime type with a different parser than Tika. 
> The general approach I chose was to convert the SAX events returned by the Tika parsers into DOM objects and reuse the utilities that come with the current HTML parser i.e. link detection,  metatag handling but also means that we can use the HTMLParseFilters in exactly the same way. The main difference though is that HTMLParseFilters are not limited to HTML documents anymore as the XHTML tags returned by Tika can correspond to a different format for the original document. There is a duplication of code with the html-plugin which will be resolved by either a) getting rid of the html-plugin altogether or b) exporting its jar and make the tika parser depend on it.
> The following libraries are required in the lib/ directory of the tika-parser : 
>       <library name="asm-3.1.jar"/>
>       <library name="bcmail-jdk15-144.jar"/>
>       <library name="commons-compress-1.0.jar"/>
>       <library name="commons-logging-1.1.1.jar"/>
>       <library name="dom4j-1.6.1.jar"/>
>       <library name="fontbox-0.8.0-incubator.jar"/>
>       <library name="geronimo-stax-api_1.0_spec-1.0.1.jar"/>
>       <library name="hamcrest-core-1.1.jar"/>
>       <library name="jce-jdk13-144.jar"/>
>       <library name="jempbox-0.8.0-incubator.jar"/>
>       <library name="metadata-extractor-2.4.0-beta-1.jar"/>
>       <library name="mockito-core-1.7.jar"/>
>       <library name="objenesis-1.0.jar"/>
>       <library name="ooxml-schemas-1.0.jar"/>
>       <library name="pdfbox-0.8.0-incubating.jar"/>
>       <library name="poi-3.5-FINAL.jar"/>
>       <library name="poi-ooxml-3.5-FINAL.jar"/>
>       <library name="poi-scratchpad-3.5-FINAL.jar"/>
>       <library name="tagsoup-1.2.jar"/>
>       <library name="tika-parsers-0.5-SNAPSHOT.jar"/>
>       <library name="xml-apis-1.0.b2.jar"/>
>       <library name="xmlbeans-2.3.0.jar"/>
> There is a small test suite which needs to be improved. We will need to have a look at each individual format and check that it is covered by Tika and if so to the same extent; the Wiki is probably the right place for this. The language identifier (which is a HTMLParseFilter) seemed to work fine.
>  
> Again, your comments are welcome. Please bear in mind that this is just a first step. 
> Julien
> http://www.digitalpebble.com

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Work started: (NUTCH-766) Tika parser

Posted by "Chris A. Mattmann (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/NUTCH-766?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Work on NUTCH-766 started by Chris A. Mattmann.

> Tika parser
> -----------
>
>                 Key: NUTCH-766
>                 URL: https://issues.apache.org/jira/browse/NUTCH-766
>             Project: Nutch
>          Issue Type: New Feature
>            Reporter: Julien Nioche
>            Assignee: Chris A. Mattmann
>         Attachments: Nutch-766.ParserFactory.patch, NUTCH-766.tika.patch
>
>
> Tika handles a lot of different formats under the bonnet and exposes them nicely via SAX events. What is described here is a tika-parser plugin which delegates the pasring mechanism of Tika but can still coexist with the existing parsing plugins which is useful for formats partially handled by Tika (or not at all). Some of the elements below have already been discussed on the mailing lists. Note that this is work in progress, your feedback is welcome.
> Tika is already used by Nutch for its MimeType implementations. Tika comes as different jar files (core and parsers), in the work described here we decided to put the libs in 2 different places
> NUTCH_HOME/lib : tika-core.jar
> NUTCH_HOME/tika-plugin/lib : tika-parsers.jar
> Tika being used by the core only for its Mimetype functionalities we only need to put tika-core at the main lib level whereas the tika plugin obviously needs the tika-parsers.jar + all the jars used internally by Tika
> Due to limitations in the way Tika loads its classes, we had to duplicate the TikaConfig class in the tika-plugin. This might be fixed in the future in Tika itself or avoided by refactoring the mimetype part of Nutch using extension points.
> Unlike most other parsers, Tika handles more than one Mime-type which is why we are using "*" as its mimetype value in the plugin descriptor and have modified ParserFactory.java so that it considers the tika parser as potentially suitable for all mime-types. In practice this means that the associations between a mime type and a parser plugin as defined in parse-plugins.xml are useful only for the cases where we want to handle a mime type with a different parser than Tika. 
> The general approach I chose was to convert the SAX events returned by the Tika parsers into DOM objects and reuse the utilities that come with the current HTML parser i.e. link detection,  metatag handling but also means that we can use the HTMLParseFilters in exactly the same way. The main difference though is that HTMLParseFilters are not limited to HTML documents anymore as the XHTML tags returned by Tika can correspond to a different format for the original document. There is a duplication of code with the html-plugin which will be resolved by either a) getting rid of the html-plugin altogether or b) exporting its jar and make the tika parser depend on it.
> The following libraries are required in the lib/ directory of the tika-parser : 
>       <library name="asm-3.1.jar"/>
>       <library name="bcmail-jdk15-144.jar"/>
>       <library name="commons-compress-1.0.jar"/>
>       <library name="commons-logging-1.1.1.jar"/>
>       <library name="dom4j-1.6.1.jar"/>
>       <library name="fontbox-0.8.0-incubator.jar"/>
>       <library name="geronimo-stax-api_1.0_spec-1.0.1.jar"/>
>       <library name="hamcrest-core-1.1.jar"/>
>       <library name="jce-jdk13-144.jar"/>
>       <library name="jempbox-0.8.0-incubator.jar"/>
>       <library name="metadata-extractor-2.4.0-beta-1.jar"/>
>       <library name="mockito-core-1.7.jar"/>
>       <library name="objenesis-1.0.jar"/>
>       <library name="ooxml-schemas-1.0.jar"/>
>       <library name="pdfbox-0.8.0-incubating.jar"/>
>       <library name="poi-3.5-FINAL.jar"/>
>       <library name="poi-ooxml-3.5-FINAL.jar"/>
>       <library name="poi-scratchpad-3.5-FINAL.jar"/>
>       <library name="tagsoup-1.2.jar"/>
>       <library name="tika-parsers-0.5-SNAPSHOT.jar"/>
>       <library name="xml-apis-1.0.b2.jar"/>
>       <library name="xmlbeans-2.3.0.jar"/>
> There is a small test suite which needs to be improved. We will need to have a look at each individual format and check that it is covered by Tika and if so to the same extent; the Wiki is probably the right place for this. The language identifier (which is a HTMLParseFilter) seemed to work fine.
>  
> Again, your comments are welcome. Please bear in mind that this is just a first step. 
> Julien
> http://www.digitalpebble.com

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (NUTCH-766) Tika parser

Posted by "Hudson (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/NUTCH-766?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12834658#action_12834658 ] 

Hudson commented on NUTCH-766:
------------------------------

Integrated in Nutch-trunk #1071 (See [http://hudson.zones.apache.org/hudson/job/Nutch-trunk/1071/])
    

> Tika parser
> -----------
>
>                 Key: NUTCH-766
>                 URL: https://issues.apache.org/jira/browse/NUTCH-766
>             Project: Nutch
>          Issue Type: New Feature
>            Reporter: Julien Nioche
>            Assignee: Chris A. Mattmann
>             Fix For: 1.1
>
>         Attachments: NUTCH-766-v3.patch, NUTCH-766.v2, NutchTikaConfig.java, sample.tar.gz, TikaParser.java
>
>
> Tika handles a lot of different formats under the bonnet and exposes them nicely via SAX events. What is described here is a tika-parser plugin which delegates the pasring mechanism of Tika but can still coexist with the existing parsing plugins which is useful for formats partially handled by Tika (or not at all). Some of the elements below have already been discussed on the mailing lists. Note that this is work in progress, your feedback is welcome.
> Tika is already used by Nutch for its MimeType implementations. Tika comes as different jar files (core and parsers), in the work described here we decided to put the libs in 2 different places
> NUTCH_HOME/lib : tika-core.jar
> NUTCH_HOME/tika-plugin/lib : tika-parsers.jar
> Tika being used by the core only for its Mimetype functionalities we only need to put tika-core at the main lib level whereas the tika plugin obviously needs the tika-parsers.jar + all the jars used internally by Tika
> Due to limitations in the way Tika loads its classes, we had to duplicate the TikaConfig class in the tika-plugin. This might be fixed in the future in Tika itself or avoided by refactoring the mimetype part of Nutch using extension points.
> Unlike most other parsers, Tika handles more than one Mime-type which is why we are using "*" as its mimetype value in the plugin descriptor and have modified ParserFactory.java so that it considers the tika parser as potentially suitable for all mime-types. In practice this means that the associations between a mime type and a parser plugin as defined in parse-plugins.xml are useful only for the cases where we want to handle a mime type with a different parser than Tika. 
> The general approach I chose was to convert the SAX events returned by the Tika parsers into DOM objects and reuse the utilities that come with the current HTML parser i.e. link detection,  metatag handling but also means that we can use the HTMLParseFilters in exactly the same way. The main difference though is that HTMLParseFilters are not limited to HTML documents anymore as the XHTML tags returned by Tika can correspond to a different format for the original document. There is a duplication of code with the html-plugin which will be resolved by either a) getting rid of the html-plugin altogether or b) exporting its jar and make the tika parser depend on it.
> The following libraries are required in the lib/ directory of the tika-parser : 
>       <library name="asm-3.1.jar"/>
>       <library name="bcmail-jdk15-144.jar"/>
>       <library name="commons-compress-1.0.jar"/>
>       <library name="commons-logging-1.1.1.jar"/>
>       <library name="dom4j-1.6.1.jar"/>
>       <library name="fontbox-0.8.0-incubator.jar"/>
>       <library name="geronimo-stax-api_1.0_spec-1.0.1.jar"/>
>       <library name="hamcrest-core-1.1.jar"/>
>       <library name="jce-jdk13-144.jar"/>
>       <library name="jempbox-0.8.0-incubator.jar"/>
>       <library name="metadata-extractor-2.4.0-beta-1.jar"/>
>       <library name="mockito-core-1.7.jar"/>
>       <library name="objenesis-1.0.jar"/>
>       <library name="ooxml-schemas-1.0.jar"/>
>       <library name="pdfbox-0.8.0-incubating.jar"/>
>       <library name="poi-3.5-FINAL.jar"/>
>       <library name="poi-ooxml-3.5-FINAL.jar"/>
>       <library name="poi-scratchpad-3.5-FINAL.jar"/>
>       <library name="tagsoup-1.2.jar"/>
>       <library name="tika-parsers-0.5-SNAPSHOT.jar"/>
>       <library name="xml-apis-1.0.b2.jar"/>
>       <library name="xmlbeans-2.3.0.jar"/>
> There is a small test suite which needs to be improved. We will need to have a look at each individual format and check that it is covered by Tika and if so to the same extent; the Wiki is probably the right place for this. The language identifier (which is a HTMLParseFilter) seemed to work fine.
>  
> Again, your comments are welcome. Please bear in mind that this is just a first step. 
> Julien
> http://www.digitalpebble.com

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (NUTCH-766) Tika parser

Posted by "Chris A. Mattmann (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/NUTCH-766?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12832255#action_12832255 ] 

Chris A. Mattmann commented on NUTCH-766:
-----------------------------------------

{quote}
+1 to commit this...
{quote}

Awesome, Andrzej. Will do so tonight, PST, if I don't hear any objections between now and then...

Thanks!

Cheers,
Chris


> Tika parser
> -----------
>
>                 Key: NUTCH-766
>                 URL: https://issues.apache.org/jira/browse/NUTCH-766
>             Project: Nutch
>          Issue Type: New Feature
>            Reporter: Julien Nioche
>            Assignee: Chris A. Mattmann
>             Fix For: 1.1
>
>         Attachments: NUTCH-766-v3.patch, NUTCH-766.v2, sample.tar.gz
>
>
> Tika handles a lot of different formats under the bonnet and exposes them nicely via SAX events. What is described here is a tika-parser plugin which delegates the pasring mechanism of Tika but can still coexist with the existing parsing plugins which is useful for formats partially handled by Tika (or not at all). Some of the elements below have already been discussed on the mailing lists. Note that this is work in progress, your feedback is welcome.
> Tika is already used by Nutch for its MimeType implementations. Tika comes as different jar files (core and parsers), in the work described here we decided to put the libs in 2 different places
> NUTCH_HOME/lib : tika-core.jar
> NUTCH_HOME/tika-plugin/lib : tika-parsers.jar
> Tika being used by the core only for its Mimetype functionalities we only need to put tika-core at the main lib level whereas the tika plugin obviously needs the tika-parsers.jar + all the jars used internally by Tika
> Due to limitations in the way Tika loads its classes, we had to duplicate the TikaConfig class in the tika-plugin. This might be fixed in the future in Tika itself or avoided by refactoring the mimetype part of Nutch using extension points.
> Unlike most other parsers, Tika handles more than one Mime-type which is why we are using "*" as its mimetype value in the plugin descriptor and have modified ParserFactory.java so that it considers the tika parser as potentially suitable for all mime-types. In practice this means that the associations between a mime type and a parser plugin as defined in parse-plugins.xml are useful only for the cases where we want to handle a mime type with a different parser than Tika. 
> The general approach I chose was to convert the SAX events returned by the Tika parsers into DOM objects and reuse the utilities that come with the current HTML parser i.e. link detection,  metatag handling but also means that we can use the HTMLParseFilters in exactly the same way. The main difference though is that HTMLParseFilters are not limited to HTML documents anymore as the XHTML tags returned by Tika can correspond to a different format for the original document. There is a duplication of code with the html-plugin which will be resolved by either a) getting rid of the html-plugin altogether or b) exporting its jar and make the tika parser depend on it.
> The following libraries are required in the lib/ directory of the tika-parser : 
>       <library name="asm-3.1.jar"/>
>       <library name="bcmail-jdk15-144.jar"/>
>       <library name="commons-compress-1.0.jar"/>
>       <library name="commons-logging-1.1.1.jar"/>
>       <library name="dom4j-1.6.1.jar"/>
>       <library name="fontbox-0.8.0-incubator.jar"/>
>       <library name="geronimo-stax-api_1.0_spec-1.0.1.jar"/>
>       <library name="hamcrest-core-1.1.jar"/>
>       <library name="jce-jdk13-144.jar"/>
>       <library name="jempbox-0.8.0-incubator.jar"/>
>       <library name="metadata-extractor-2.4.0-beta-1.jar"/>
>       <library name="mockito-core-1.7.jar"/>
>       <library name="objenesis-1.0.jar"/>
>       <library name="ooxml-schemas-1.0.jar"/>
>       <library name="pdfbox-0.8.0-incubating.jar"/>
>       <library name="poi-3.5-FINAL.jar"/>
>       <library name="poi-ooxml-3.5-FINAL.jar"/>
>       <library name="poi-scratchpad-3.5-FINAL.jar"/>
>       <library name="tagsoup-1.2.jar"/>
>       <library name="tika-parsers-0.5-SNAPSHOT.jar"/>
>       <library name="xml-apis-1.0.b2.jar"/>
>       <library name="xmlbeans-2.3.0.jar"/>
> There is a small test suite which needs to be improved. We will need to have a look at each individual format and check that it is covered by Tika and if so to the same extent; the Wiki is probably the right place for this. The language identifier (which is a HTMLParseFilter) seemed to work fine.
>  
> Again, your comments are welcome. Please bear in mind that this is just a first step. 
> Julien
> http://www.digitalpebble.com

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (NUTCH-766) Tika parser

Posted by "Chris A. Mattmann (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/NUTCH-766?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Chris A. Mattmann updated NUTCH-766:
------------------------------------

    Fix Version/s: 1.1

> Tika parser
> -----------
>
>                 Key: NUTCH-766
>                 URL: https://issues.apache.org/jira/browse/NUTCH-766
>             Project: Nutch
>          Issue Type: New Feature
>            Reporter: Julien Nioche
>            Assignee: Chris A. Mattmann
>             Fix For: 1.1
>
>         Attachments: Nutch-766.ParserFactory.patch, NUTCH-766.tika.patch
>
>
> Tika handles a lot of different formats under the bonnet and exposes them nicely via SAX events. What is described here is a tika-parser plugin which delegates the pasring mechanism of Tika but can still coexist with the existing parsing plugins which is useful for formats partially handled by Tika (or not at all). Some of the elements below have already been discussed on the mailing lists. Note that this is work in progress, your feedback is welcome.
> Tika is already used by Nutch for its MimeType implementations. Tika comes as different jar files (core and parsers), in the work described here we decided to put the libs in 2 different places
> NUTCH_HOME/lib : tika-core.jar
> NUTCH_HOME/tika-plugin/lib : tika-parsers.jar
> Tika being used by the core only for its Mimetype functionalities we only need to put tika-core at the main lib level whereas the tika plugin obviously needs the tika-parsers.jar + all the jars used internally by Tika
> Due to limitations in the way Tika loads its classes, we had to duplicate the TikaConfig class in the tika-plugin. This might be fixed in the future in Tika itself or avoided by refactoring the mimetype part of Nutch using extension points.
> Unlike most other parsers, Tika handles more than one Mime-type which is why we are using "*" as its mimetype value in the plugin descriptor and have modified ParserFactory.java so that it considers the tika parser as potentially suitable for all mime-types. In practice this means that the associations between a mime type and a parser plugin as defined in parse-plugins.xml are useful only for the cases where we want to handle a mime type with a different parser than Tika. 
> The general approach I chose was to convert the SAX events returned by the Tika parsers into DOM objects and reuse the utilities that come with the current HTML parser i.e. link detection,  metatag handling but also means that we can use the HTMLParseFilters in exactly the same way. The main difference though is that HTMLParseFilters are not limited to HTML documents anymore as the XHTML tags returned by Tika can correspond to a different format for the original document. There is a duplication of code with the html-plugin which will be resolved by either a) getting rid of the html-plugin altogether or b) exporting its jar and make the tika parser depend on it.
> The following libraries are required in the lib/ directory of the tika-parser : 
>       <library name="asm-3.1.jar"/>
>       <library name="bcmail-jdk15-144.jar"/>
>       <library name="commons-compress-1.0.jar"/>
>       <library name="commons-logging-1.1.1.jar"/>
>       <library name="dom4j-1.6.1.jar"/>
>       <library name="fontbox-0.8.0-incubator.jar"/>
>       <library name="geronimo-stax-api_1.0_spec-1.0.1.jar"/>
>       <library name="hamcrest-core-1.1.jar"/>
>       <library name="jce-jdk13-144.jar"/>
>       <library name="jempbox-0.8.0-incubator.jar"/>
>       <library name="metadata-extractor-2.4.0-beta-1.jar"/>
>       <library name="mockito-core-1.7.jar"/>
>       <library name="objenesis-1.0.jar"/>
>       <library name="ooxml-schemas-1.0.jar"/>
>       <library name="pdfbox-0.8.0-incubating.jar"/>
>       <library name="poi-3.5-FINAL.jar"/>
>       <library name="poi-ooxml-3.5-FINAL.jar"/>
>       <library name="poi-scratchpad-3.5-FINAL.jar"/>
>       <library name="tagsoup-1.2.jar"/>
>       <library name="tika-parsers-0.5-SNAPSHOT.jar"/>
>       <library name="xml-apis-1.0.b2.jar"/>
>       <library name="xmlbeans-2.3.0.jar"/>
> There is a small test suite which needs to be improved. We will need to have a look at each individual format and check that it is covered by Tika and if so to the same extent; the Wiki is probably the right place for this. The language identifier (which is a HTMLParseFilter) seemed to work fine.
>  
> Again, your comments are welcome. Please bear in mind that this is just a first step. 
> Julien
> http://www.digitalpebble.com

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (NUTCH-766) Tika parser

Posted by "Julien Nioche (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/NUTCH-766?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12803670#action_12803670 ] 

Julien Nioche commented on NUTCH-766:
-------------------------------------

> I think the end result of this plugin should be replacing all Tika supported parsers (or the parsers we choose to replace) with the TikaParser and not to build a parallel ways to parse same formats. 

That's how I see it - it's just that we have the option of choosing when to use Tika or not for a given mimetype. It is used by default unless an association is created between a parser implementation and   a mimetype in the parse-plugins.xml

> So I think we need to copy all of the the existing test files and move&adapt the existing testcases fully before committing this. That is a good way of seeing that the parse result is what is expected and also find out about possible differences with old vs. Tika version.

Sure, but it would be silly to block the whole Tika plugin because Tika does not support such or such format as well as the original Nutch plugins. As I explained above we can configure which parser to use for which mimetype and use the Tika-plugin by default.   Hopefully the Tika implementation will get better and better and there will be no need for keeping the old plugins.

BTW http://wiki.apache.org/nutch/TikaPlugin lists the differences between the current version of Tika and the existing Nutch parsers

Even if we decide to keep using the old plugins for some of the formats to start with, we'd still be able to the Tika plugin by default for the ones which have already the same coverage


> Tika parser
> -----------
>
>                 Key: NUTCH-766
>                 URL: https://issues.apache.org/jira/browse/NUTCH-766
>             Project: Nutch
>          Issue Type: New Feature
>            Reporter: Julien Nioche
>            Assignee: Chris A. Mattmann
>             Fix For: 1.1
>
>         Attachments: Nutch-766.ParserFactory.patch, NUTCH-766.tika.patch
>
>
> Tika handles a lot of different formats under the bonnet and exposes them nicely via SAX events. What is described here is a tika-parser plugin which delegates the pasring mechanism of Tika but can still coexist with the existing parsing plugins which is useful for formats partially handled by Tika (or not at all). Some of the elements below have already been discussed on the mailing lists. Note that this is work in progress, your feedback is welcome.
> Tika is already used by Nutch for its MimeType implementations. Tika comes as different jar files (core and parsers), in the work described here we decided to put the libs in 2 different places
> NUTCH_HOME/lib : tika-core.jar
> NUTCH_HOME/tika-plugin/lib : tika-parsers.jar
> Tika being used by the core only for its Mimetype functionalities we only need to put tika-core at the main lib level whereas the tika plugin obviously needs the tika-parsers.jar + all the jars used internally by Tika
> Due to limitations in the way Tika loads its classes, we had to duplicate the TikaConfig class in the tika-plugin. This might be fixed in the future in Tika itself or avoided by refactoring the mimetype part of Nutch using extension points.
> Unlike most other parsers, Tika handles more than one Mime-type which is why we are using "*" as its mimetype value in the plugin descriptor and have modified ParserFactory.java so that it considers the tika parser as potentially suitable for all mime-types. In practice this means that the associations between a mime type and a parser plugin as defined in parse-plugins.xml are useful only for the cases where we want to handle a mime type with a different parser than Tika. 
> The general approach I chose was to convert the SAX events returned by the Tika parsers into DOM objects and reuse the utilities that come with the current HTML parser i.e. link detection,  metatag handling but also means that we can use the HTMLParseFilters in exactly the same way. The main difference though is that HTMLParseFilters are not limited to HTML documents anymore as the XHTML tags returned by Tika can correspond to a different format for the original document. There is a duplication of code with the html-plugin which will be resolved by either a) getting rid of the html-plugin altogether or b) exporting its jar and make the tika parser depend on it.
> The following libraries are required in the lib/ directory of the tika-parser : 
>       <library name="asm-3.1.jar"/>
>       <library name="bcmail-jdk15-144.jar"/>
>       <library name="commons-compress-1.0.jar"/>
>       <library name="commons-logging-1.1.1.jar"/>
>       <library name="dom4j-1.6.1.jar"/>
>       <library name="fontbox-0.8.0-incubator.jar"/>
>       <library name="geronimo-stax-api_1.0_spec-1.0.1.jar"/>
>       <library name="hamcrest-core-1.1.jar"/>
>       <library name="jce-jdk13-144.jar"/>
>       <library name="jempbox-0.8.0-incubator.jar"/>
>       <library name="metadata-extractor-2.4.0-beta-1.jar"/>
>       <library name="mockito-core-1.7.jar"/>
>       <library name="objenesis-1.0.jar"/>
>       <library name="ooxml-schemas-1.0.jar"/>
>       <library name="pdfbox-0.8.0-incubating.jar"/>
>       <library name="poi-3.5-FINAL.jar"/>
>       <library name="poi-ooxml-3.5-FINAL.jar"/>
>       <library name="poi-scratchpad-3.5-FINAL.jar"/>
>       <library name="tagsoup-1.2.jar"/>
>       <library name="tika-parsers-0.5-SNAPSHOT.jar"/>
>       <library name="xml-apis-1.0.b2.jar"/>
>       <library name="xmlbeans-2.3.0.jar"/>
> There is a small test suite which needs to be improved. We will need to have a look at each individual format and check that it is covered by Tika and if so to the same extent; the Wiki is probably the right place for this. The language identifier (which is a HTMLParseFilter) seemed to work fine.
>  
> Again, your comments are welcome. Please bear in mind that this is just a first step. 
> Julien
> http://www.digitalpebble.com

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (NUTCH-766) Tika parser

Posted by "Chris A. Mattmann (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/NUTCH-766?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12832588#action_12832588 ] 

Chris A. Mattmann commented on NUTCH-766:
-----------------------------------------

@Julien:

Sigh, no I didn't! :(

That's probably why! Thanks for the help. I'll try it later today. If that passes, my +1 to commit. 

@Sami, regarding your updates, would you be OK with me creating another issue to track them, attaching your diffs as patches against this issue, once committed to the trunk? That way we'll make sure they get into 1.1, but we won't block this issue anymore from getting in. Let me know what you think, thanks.

Cheers,
Chris


> Tika parser
> -----------
>
>                 Key: NUTCH-766
>                 URL: https://issues.apache.org/jira/browse/NUTCH-766
>             Project: Nutch
>          Issue Type: New Feature
>            Reporter: Julien Nioche
>            Assignee: Chris A. Mattmann
>             Fix For: 1.1
>
>         Attachments: NUTCH-766-v3.patch, NUTCH-766.v2, NutchTikaConfig.java, sample.tar.gz, TikaParser.java
>
>
> Tika handles a lot of different formats under the bonnet and exposes them nicely via SAX events. What is described here is a tika-parser plugin which delegates the pasring mechanism of Tika but can still coexist with the existing parsing plugins which is useful for formats partially handled by Tika (or not at all). Some of the elements below have already been discussed on the mailing lists. Note that this is work in progress, your feedback is welcome.
> Tika is already used by Nutch for its MimeType implementations. Tika comes as different jar files (core and parsers), in the work described here we decided to put the libs in 2 different places
> NUTCH_HOME/lib : tika-core.jar
> NUTCH_HOME/tika-plugin/lib : tika-parsers.jar
> Tika being used by the core only for its Mimetype functionalities we only need to put tika-core at the main lib level whereas the tika plugin obviously needs the tika-parsers.jar + all the jars used internally by Tika
> Due to limitations in the way Tika loads its classes, we had to duplicate the TikaConfig class in the tika-plugin. This might be fixed in the future in Tika itself or avoided by refactoring the mimetype part of Nutch using extension points.
> Unlike most other parsers, Tika handles more than one Mime-type which is why we are using "*" as its mimetype value in the plugin descriptor and have modified ParserFactory.java so that it considers the tika parser as potentially suitable for all mime-types. In practice this means that the associations between a mime type and a parser plugin as defined in parse-plugins.xml are useful only for the cases where we want to handle a mime type with a different parser than Tika. 
> The general approach I chose was to convert the SAX events returned by the Tika parsers into DOM objects and reuse the utilities that come with the current HTML parser i.e. link detection,  metatag handling but also means that we can use the HTMLParseFilters in exactly the same way. The main difference though is that HTMLParseFilters are not limited to HTML documents anymore as the XHTML tags returned by Tika can correspond to a different format for the original document. There is a duplication of code with the html-plugin which will be resolved by either a) getting rid of the html-plugin altogether or b) exporting its jar and make the tika parser depend on it.
> The following libraries are required in the lib/ directory of the tika-parser : 
>       <library name="asm-3.1.jar"/>
>       <library name="bcmail-jdk15-144.jar"/>
>       <library name="commons-compress-1.0.jar"/>
>       <library name="commons-logging-1.1.1.jar"/>
>       <library name="dom4j-1.6.1.jar"/>
>       <library name="fontbox-0.8.0-incubator.jar"/>
>       <library name="geronimo-stax-api_1.0_spec-1.0.1.jar"/>
>       <library name="hamcrest-core-1.1.jar"/>
>       <library name="jce-jdk13-144.jar"/>
>       <library name="jempbox-0.8.0-incubator.jar"/>
>       <library name="metadata-extractor-2.4.0-beta-1.jar"/>
>       <library name="mockito-core-1.7.jar"/>
>       <library name="objenesis-1.0.jar"/>
>       <library name="ooxml-schemas-1.0.jar"/>
>       <library name="pdfbox-0.8.0-incubating.jar"/>
>       <library name="poi-3.5-FINAL.jar"/>
>       <library name="poi-ooxml-3.5-FINAL.jar"/>
>       <library name="poi-scratchpad-3.5-FINAL.jar"/>
>       <library name="tagsoup-1.2.jar"/>
>       <library name="tika-parsers-0.5-SNAPSHOT.jar"/>
>       <library name="xml-apis-1.0.b2.jar"/>
>       <library name="xmlbeans-2.3.0.jar"/>
> There is a small test suite which needs to be improved. We will need to have a look at each individual format and check that it is covered by Tika and if so to the same extent; the Wiki is probably the right place for this. The language identifier (which is a HTMLParseFilter) seemed to work fine.
>  
> Again, your comments are welcome. Please bear in mind that this is just a first step. 
> Julien
> http://www.digitalpebble.com

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (NUTCH-766) Tika parser

Posted by "Julien Nioche (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/NUTCH-766?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Julien Nioche updated NUTCH-766:
--------------------------------

    Attachment: NUTCH-766-v3.patch

Updated version of the plugin : uses Tika 0.6

> Tika parser
> -----------
>
>                 Key: NUTCH-766
>                 URL: https://issues.apache.org/jira/browse/NUTCH-766
>             Project: Nutch
>          Issue Type: New Feature
>            Reporter: Julien Nioche
>            Assignee: Chris A. Mattmann
>             Fix For: 1.1
>
>         Attachments: NUTCH-766-v3.patch, NUTCH-766.v2, sample.tar.gz
>
>
> Tika handles a lot of different formats under the bonnet and exposes them nicely via SAX events. What is described here is a tika-parser plugin which delegates the pasring mechanism of Tika but can still coexist with the existing parsing plugins which is useful for formats partially handled by Tika (or not at all). Some of the elements below have already been discussed on the mailing lists. Note that this is work in progress, your feedback is welcome.
> Tika is already used by Nutch for its MimeType implementations. Tika comes as different jar files (core and parsers), in the work described here we decided to put the libs in 2 different places
> NUTCH_HOME/lib : tika-core.jar
> NUTCH_HOME/tika-plugin/lib : tika-parsers.jar
> Tika being used by the core only for its Mimetype functionalities we only need to put tika-core at the main lib level whereas the tika plugin obviously needs the tika-parsers.jar + all the jars used internally by Tika
> Due to limitations in the way Tika loads its classes, we had to duplicate the TikaConfig class in the tika-plugin. This might be fixed in the future in Tika itself or avoided by refactoring the mimetype part of Nutch using extension points.
> Unlike most other parsers, Tika handles more than one Mime-type which is why we are using "*" as its mimetype value in the plugin descriptor and have modified ParserFactory.java so that it considers the tika parser as potentially suitable for all mime-types. In practice this means that the associations between a mime type and a parser plugin as defined in parse-plugins.xml are useful only for the cases where we want to handle a mime type with a different parser than Tika. 
> The general approach I chose was to convert the SAX events returned by the Tika parsers into DOM objects and reuse the utilities that come with the current HTML parser i.e. link detection,  metatag handling but also means that we can use the HTMLParseFilters in exactly the same way. The main difference though is that HTMLParseFilters are not limited to HTML documents anymore as the XHTML tags returned by Tika can correspond to a different format for the original document. There is a duplication of code with the html-plugin which will be resolved by either a) getting rid of the html-plugin altogether or b) exporting its jar and make the tika parser depend on it.
> The following libraries are required in the lib/ directory of the tika-parser : 
>       <library name="asm-3.1.jar"/>
>       <library name="bcmail-jdk15-144.jar"/>
>       <library name="commons-compress-1.0.jar"/>
>       <library name="commons-logging-1.1.1.jar"/>
>       <library name="dom4j-1.6.1.jar"/>
>       <library name="fontbox-0.8.0-incubator.jar"/>
>       <library name="geronimo-stax-api_1.0_spec-1.0.1.jar"/>
>       <library name="hamcrest-core-1.1.jar"/>
>       <library name="jce-jdk13-144.jar"/>
>       <library name="jempbox-0.8.0-incubator.jar"/>
>       <library name="metadata-extractor-2.4.0-beta-1.jar"/>
>       <library name="mockito-core-1.7.jar"/>
>       <library name="objenesis-1.0.jar"/>
>       <library name="ooxml-schemas-1.0.jar"/>
>       <library name="pdfbox-0.8.0-incubating.jar"/>
>       <library name="poi-3.5-FINAL.jar"/>
>       <library name="poi-ooxml-3.5-FINAL.jar"/>
>       <library name="poi-scratchpad-3.5-FINAL.jar"/>
>       <library name="tagsoup-1.2.jar"/>
>       <library name="tika-parsers-0.5-SNAPSHOT.jar"/>
>       <library name="xml-apis-1.0.b2.jar"/>
>       <library name="xmlbeans-2.3.0.jar"/>
> There is a small test suite which needs to be improved. We will need to have a look at each individual format and check that it is covered by Tika and if so to the same extent; the Wiki is probably the right place for this. The language identifier (which is a HTMLParseFilter) seemed to work fine.
>  
> Again, your comments are welcome. Please bear in mind that this is just a first step. 
> Julien
> http://www.digitalpebble.com

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (NUTCH-766) Tika parser

Posted by "Sami Siren (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/NUTCH-766?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Sami Siren updated NUTCH-766:
-----------------------------

    Attachment: NutchTikaConfig.java

Extended TikaConfig that is able to load parsers and can be used with existing tika classes. The call to (super) cannot load parser but then the config is porcessed again locally. This is a hack and hopefully at some point we can drop the class alltogether.

> Tika parser
> -----------
>
>                 Key: NUTCH-766
>                 URL: https://issues.apache.org/jira/browse/NUTCH-766
>             Project: Nutch
>          Issue Type: New Feature
>            Reporter: Julien Nioche
>            Assignee: Chris A. Mattmann
>             Fix For: 1.1
>
>         Attachments: NUTCH-766-v3.patch, NUTCH-766.v2, NutchTikaConfig.java, sample.tar.gz
>
>
> Tika handles a lot of different formats under the bonnet and exposes them nicely via SAX events. What is described here is a tika-parser plugin which delegates the pasring mechanism of Tika but can still coexist with the existing parsing plugins which is useful for formats partially handled by Tika (or not at all). Some of the elements below have already been discussed on the mailing lists. Note that this is work in progress, your feedback is welcome.
> Tika is already used by Nutch for its MimeType implementations. Tika comes as different jar files (core and parsers), in the work described here we decided to put the libs in 2 different places
> NUTCH_HOME/lib : tika-core.jar
> NUTCH_HOME/tika-plugin/lib : tika-parsers.jar
> Tika being used by the core only for its Mimetype functionalities we only need to put tika-core at the main lib level whereas the tika plugin obviously needs the tika-parsers.jar + all the jars used internally by Tika
> Due to limitations in the way Tika loads its classes, we had to duplicate the TikaConfig class in the tika-plugin. This might be fixed in the future in Tika itself or avoided by refactoring the mimetype part of Nutch using extension points.
> Unlike most other parsers, Tika handles more than one Mime-type which is why we are using "*" as its mimetype value in the plugin descriptor and have modified ParserFactory.java so that it considers the tika parser as potentially suitable for all mime-types. In practice this means that the associations between a mime type and a parser plugin as defined in parse-plugins.xml are useful only for the cases where we want to handle a mime type with a different parser than Tika. 
> The general approach I chose was to convert the SAX events returned by the Tika parsers into DOM objects and reuse the utilities that come with the current HTML parser i.e. link detection,  metatag handling but also means that we can use the HTMLParseFilters in exactly the same way. The main difference though is that HTMLParseFilters are not limited to HTML documents anymore as the XHTML tags returned by Tika can correspond to a different format for the original document. There is a duplication of code with the html-plugin which will be resolved by either a) getting rid of the html-plugin altogether or b) exporting its jar and make the tika parser depend on it.
> The following libraries are required in the lib/ directory of the tika-parser : 
>       <library name="asm-3.1.jar"/>
>       <library name="bcmail-jdk15-144.jar"/>
>       <library name="commons-compress-1.0.jar"/>
>       <library name="commons-logging-1.1.1.jar"/>
>       <library name="dom4j-1.6.1.jar"/>
>       <library name="fontbox-0.8.0-incubator.jar"/>
>       <library name="geronimo-stax-api_1.0_spec-1.0.1.jar"/>
>       <library name="hamcrest-core-1.1.jar"/>
>       <library name="jce-jdk13-144.jar"/>
>       <library name="jempbox-0.8.0-incubator.jar"/>
>       <library name="metadata-extractor-2.4.0-beta-1.jar"/>
>       <library name="mockito-core-1.7.jar"/>
>       <library name="objenesis-1.0.jar"/>
>       <library name="ooxml-schemas-1.0.jar"/>
>       <library name="pdfbox-0.8.0-incubating.jar"/>
>       <library name="poi-3.5-FINAL.jar"/>
>       <library name="poi-ooxml-3.5-FINAL.jar"/>
>       <library name="poi-scratchpad-3.5-FINAL.jar"/>
>       <library name="tagsoup-1.2.jar"/>
>       <library name="tika-parsers-0.5-SNAPSHOT.jar"/>
>       <library name="xml-apis-1.0.b2.jar"/>
>       <library name="xmlbeans-2.3.0.jar"/>
> There is a small test suite which needs to be improved. We will need to have a look at each individual format and check that it is covered by Tika and if so to the same extent; the Wiki is probably the right place for this. The language identifier (which is a HTMLParseFilter) seemed to work fine.
>  
> Again, your comments are welcome. Please bear in mind that this is just a first step. 
> Julien
> http://www.digitalpebble.com

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (NUTCH-766) Tika parser

Posted by "Julien Nioche (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/NUTCH-766?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Julien Nioche updated NUTCH-766:
--------------------------------

    Attachment: NUTCH-766.tika.patch

patch for the Tika-plugin

> Tika parser
> -----------
>
>                 Key: NUTCH-766
>                 URL: https://issues.apache.org/jira/browse/NUTCH-766
>             Project: Nutch
>          Issue Type: New Feature
>            Reporter: Julien Nioche
>         Attachments: NUTCH-766.tika.patch
>
>
> Tika handles a lot of different formats under the bonnet and exposes them nicely via SAX events. What is described here is a tika-parser plugin which delegates the pasring mechanism of Tika but can still coexist with the existing parsing plugins which is useful for formats partially handled by Tika (or not at all). Some of the elements below have already been discussed on the mailing lists. Note that this is work in progress, your feedback is welcome.
> Tika is already used by Nutch for its MimeType implementations. Tika comes as different jar files (core and parsers), in the work described here we decided to put the libs in 2 different places
> NUTCH_HOME/lib : tika-core.jar
> NUTCH_HOME/tika-plugin/lib : tika-parsers.jar
> Tika being used by the core only for its Mimetype functionalities we only need to put tika-core at the main lib level whereas the tika plugin obviously needs the tika-parsers.jar + all the jars used internally by Tika
> Due to limitations in the way Tika loads its classes, we had to duplicate the TikaConfig class in the tika-plugin. This might be fixed in the future in Tika itself or avoided by refactoring the mimetype part of Nutch using extension points.
> Unlike most other parsers, Tika handles more than one Mime-type which is why we are using "*" as its mimetype value in the plugin descriptor and have modified ParserFactory.java so that it considers the tika parser as potentially suitable for all mime-types. In practice this means that the associations between a mime type and a parser plugin as defined in parse-plugins.xml are useful only for the cases where we want to handle a mime type with a different parser than Tika. 
> The general approach I chose was to convert the SAX events returned by the Tika parsers into DOM objects and reuse the utilities that come with the current HTML parser i.e. link detection,  metatag handling but also means that we can use the HTMLParseFilters in exactly the same way. The main difference though is that HTMLParseFilters are not limited to HTML documents anymore as the XHTML tags returned by Tika can correspond to a different format for the original document. There is a duplication of code with the html-plugin which will be resolved by either a) getting rid of the html-plugin altogether or b) exporting its jar and make the tika parser depend on it.
> The following libraries are required in the lib/ directory of the tika-parser : 
>       <library name="asm-3.1.jar"/>
>       <library name="bcmail-jdk15-144.jar"/>
>       <library name="commons-compress-1.0.jar"/>
>       <library name="commons-logging-1.1.1.jar"/>
>       <library name="dom4j-1.6.1.jar"/>
>       <library name="fontbox-0.8.0-incubator.jar"/>
>       <library name="geronimo-stax-api_1.0_spec-1.0.1.jar"/>
>       <library name="hamcrest-core-1.1.jar"/>
>       <library name="jce-jdk13-144.jar"/>
>       <library name="jempbox-0.8.0-incubator.jar"/>
>       <library name="metadata-extractor-2.4.0-beta-1.jar"/>
>       <library name="mockito-core-1.7.jar"/>
>       <library name="objenesis-1.0.jar"/>
>       <library name="ooxml-schemas-1.0.jar"/>
>       <library name="pdfbox-0.8.0-incubating.jar"/>
>       <library name="poi-3.5-FINAL.jar"/>
>       <library name="poi-ooxml-3.5-FINAL.jar"/>
>       <library name="poi-scratchpad-3.5-FINAL.jar"/>
>       <library name="tagsoup-1.2.jar"/>
>       <library name="tika-parsers-0.5-SNAPSHOT.jar"/>
>       <library name="xml-apis-1.0.b2.jar"/>
>       <library name="xmlbeans-2.3.0.jar"/>
> There is a small test suite which needs to be improved. We will need to have a look at each individual format and check that it is covered by Tika and if so to the same extent; the Wiki is probably the right place for this. The language identifier (which is a HTMLParseFilter) seemed to work fine.
>  
> Again, your comments are welcome. Please bear in mind that this is just a first step. 
> Julien
> http://www.digitalpebble.com

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (NUTCH-766) Tika parser

Posted by "Julien Nioche (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/NUTCH-766?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12832454#action_12832454 ] 

Julien Nioche commented on NUTCH-766:
-------------------------------------

@Chris : I just did a fresh co from svn, applied the patch v3 and unzipped sample.tar.gz onto  the directory parse-tika and ran the test just as you did but could not reproduce the problem.  Could there be a difference between your version and the trunk?

@Sami :  

{quote} was there a reason not to use AutoDetect parser?  {quote} 
I suppose we could as long we give it a clue about the MimeType obtained from the Content.  As you pointed out, there could be a duplication with the detection done by Mime-Util. I suppose one way to do would be to add a new version of the method getParse(Content conte, MimeType type). That's an interesting point.

{quote} Also was there a reson not to parse html wtih tika?  {quote} 
It is supposed to do so, if it does not then it's a bug which needs urgent fixing.

Regarding parsing package formats, I think the plan is that Tika will handle that in the future but we could try to do that now if we find a relatively clean mechanism for doing so. BTW could you please send a diff and not the full code of the class you posted earlier, that would make the comparison much easier.




> Tika parser
> -----------
>
>                 Key: NUTCH-766
>                 URL: https://issues.apache.org/jira/browse/NUTCH-766
>             Project: Nutch
>          Issue Type: New Feature
>            Reporter: Julien Nioche
>            Assignee: Chris A. Mattmann
>             Fix For: 1.1
>
>         Attachments: NUTCH-766-v3.patch, NUTCH-766.v2, NutchTikaConfig.java, sample.tar.gz, TikaParser.java
>
>
> Tika handles a lot of different formats under the bonnet and exposes them nicely via SAX events. What is described here is a tika-parser plugin which delegates the pasring mechanism of Tika but can still coexist with the existing parsing plugins which is useful for formats partially handled by Tika (or not at all). Some of the elements below have already been discussed on the mailing lists. Note that this is work in progress, your feedback is welcome.
> Tika is already used by Nutch for its MimeType implementations. Tika comes as different jar files (core and parsers), in the work described here we decided to put the libs in 2 different places
> NUTCH_HOME/lib : tika-core.jar
> NUTCH_HOME/tika-plugin/lib : tika-parsers.jar
> Tika being used by the core only for its Mimetype functionalities we only need to put tika-core at the main lib level whereas the tika plugin obviously needs the tika-parsers.jar + all the jars used internally by Tika
> Due to limitations in the way Tika loads its classes, we had to duplicate the TikaConfig class in the tika-plugin. This might be fixed in the future in Tika itself or avoided by refactoring the mimetype part of Nutch using extension points.
> Unlike most other parsers, Tika handles more than one Mime-type which is why we are using "*" as its mimetype value in the plugin descriptor and have modified ParserFactory.java so that it considers the tika parser as potentially suitable for all mime-types. In practice this means that the associations between a mime type and a parser plugin as defined in parse-plugins.xml are useful only for the cases where we want to handle a mime type with a different parser than Tika. 
> The general approach I chose was to convert the SAX events returned by the Tika parsers into DOM objects and reuse the utilities that come with the current HTML parser i.e. link detection,  metatag handling but also means that we can use the HTMLParseFilters in exactly the same way. The main difference though is that HTMLParseFilters are not limited to HTML documents anymore as the XHTML tags returned by Tika can correspond to a different format for the original document. There is a duplication of code with the html-plugin which will be resolved by either a) getting rid of the html-plugin altogether or b) exporting its jar and make the tika parser depend on it.
> The following libraries are required in the lib/ directory of the tika-parser : 
>       <library name="asm-3.1.jar"/>
>       <library name="bcmail-jdk15-144.jar"/>
>       <library name="commons-compress-1.0.jar"/>
>       <library name="commons-logging-1.1.1.jar"/>
>       <library name="dom4j-1.6.1.jar"/>
>       <library name="fontbox-0.8.0-incubator.jar"/>
>       <library name="geronimo-stax-api_1.0_spec-1.0.1.jar"/>
>       <library name="hamcrest-core-1.1.jar"/>
>       <library name="jce-jdk13-144.jar"/>
>       <library name="jempbox-0.8.0-incubator.jar"/>
>       <library name="metadata-extractor-2.4.0-beta-1.jar"/>
>       <library name="mockito-core-1.7.jar"/>
>       <library name="objenesis-1.0.jar"/>
>       <library name="ooxml-schemas-1.0.jar"/>
>       <library name="pdfbox-0.8.0-incubating.jar"/>
>       <library name="poi-3.5-FINAL.jar"/>
>       <library name="poi-ooxml-3.5-FINAL.jar"/>
>       <library name="poi-scratchpad-3.5-FINAL.jar"/>
>       <library name="tagsoup-1.2.jar"/>
>       <library name="tika-parsers-0.5-SNAPSHOT.jar"/>
>       <library name="xml-apis-1.0.b2.jar"/>
>       <library name="xmlbeans-2.3.0.jar"/>
> There is a small test suite which needs to be improved. We will need to have a look at each individual format and check that it is covered by Tika and if so to the same extent; the Wiki is probably the right place for this. The language identifier (which is a HTMLParseFilter) seemed to work fine.
>  
> Again, your comments are welcome. Please bear in mind that this is just a first step. 
> Julien
> http://www.digitalpebble.com

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (NUTCH-766) Tika parser

Posted by "Chris A. Mattmann (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/NUTCH-766?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12803709#action_12803709 ] 

Chris A. Mattmann commented on NUTCH-766:
-----------------------------------------

{quote}
Sure, but it would be silly to block the whole Tika plugin because Tika does not support such or such format as well as the original Nutch plugins. As I explained above we can configure which parser to use for which mimetype and use the Tika-plugin by default. Hopefully the Tika implementation will get better and better and there will be no need for keeping the old plugins.
{quote}

+1, I'm going to agree on this one here Julien. Other communities ;) have convinced me of the need for backwards compat and unobtrusiveness when bringing in new functionality or results. +1 to at least in Nutch 1.1 leaving the old plugins (perhaps mentioning they should be deprecated and replace by the Tika functionality) and then removing them in 1.2 or 1.3.

I got bogged down with my paid job, but I found some Apache time recently so this is tops on my list to tackle.

Cheers,
Chris



> Tika parser
> -----------
>
>                 Key: NUTCH-766
>                 URL: https://issues.apache.org/jira/browse/NUTCH-766
>             Project: Nutch
>          Issue Type: New Feature
>            Reporter: Julien Nioche
>            Assignee: Chris A. Mattmann
>             Fix For: 1.1
>
>         Attachments: Nutch-766.ParserFactory.patch, NUTCH-766.tika.patch
>
>
> Tika handles a lot of different formats under the bonnet and exposes them nicely via SAX events. What is described here is a tika-parser plugin which delegates the pasring mechanism of Tika but can still coexist with the existing parsing plugins which is useful for formats partially handled by Tika (or not at all). Some of the elements below have already been discussed on the mailing lists. Note that this is work in progress, your feedback is welcome.
> Tika is already used by Nutch for its MimeType implementations. Tika comes as different jar files (core and parsers), in the work described here we decided to put the libs in 2 different places
> NUTCH_HOME/lib : tika-core.jar
> NUTCH_HOME/tika-plugin/lib : tika-parsers.jar
> Tika being used by the core only for its Mimetype functionalities we only need to put tika-core at the main lib level whereas the tika plugin obviously needs the tika-parsers.jar + all the jars used internally by Tika
> Due to limitations in the way Tika loads its classes, we had to duplicate the TikaConfig class in the tika-plugin. This might be fixed in the future in Tika itself or avoided by refactoring the mimetype part of Nutch using extension points.
> Unlike most other parsers, Tika handles more than one Mime-type which is why we are using "*" as its mimetype value in the plugin descriptor and have modified ParserFactory.java so that it considers the tika parser as potentially suitable for all mime-types. In practice this means that the associations between a mime type and a parser plugin as defined in parse-plugins.xml are useful only for the cases where we want to handle a mime type with a different parser than Tika. 
> The general approach I chose was to convert the SAX events returned by the Tika parsers into DOM objects and reuse the utilities that come with the current HTML parser i.e. link detection,  metatag handling but also means that we can use the HTMLParseFilters in exactly the same way. The main difference though is that HTMLParseFilters are not limited to HTML documents anymore as the XHTML tags returned by Tika can correspond to a different format for the original document. There is a duplication of code with the html-plugin which will be resolved by either a) getting rid of the html-plugin altogether or b) exporting its jar and make the tika parser depend on it.
> The following libraries are required in the lib/ directory of the tika-parser : 
>       <library name="asm-3.1.jar"/>
>       <library name="bcmail-jdk15-144.jar"/>
>       <library name="commons-compress-1.0.jar"/>
>       <library name="commons-logging-1.1.1.jar"/>
>       <library name="dom4j-1.6.1.jar"/>
>       <library name="fontbox-0.8.0-incubator.jar"/>
>       <library name="geronimo-stax-api_1.0_spec-1.0.1.jar"/>
>       <library name="hamcrest-core-1.1.jar"/>
>       <library name="jce-jdk13-144.jar"/>
>       <library name="jempbox-0.8.0-incubator.jar"/>
>       <library name="metadata-extractor-2.4.0-beta-1.jar"/>
>       <library name="mockito-core-1.7.jar"/>
>       <library name="objenesis-1.0.jar"/>
>       <library name="ooxml-schemas-1.0.jar"/>
>       <library name="pdfbox-0.8.0-incubating.jar"/>
>       <library name="poi-3.5-FINAL.jar"/>
>       <library name="poi-ooxml-3.5-FINAL.jar"/>
>       <library name="poi-scratchpad-3.5-FINAL.jar"/>
>       <library name="tagsoup-1.2.jar"/>
>       <library name="tika-parsers-0.5-SNAPSHOT.jar"/>
>       <library name="xml-apis-1.0.b2.jar"/>
>       <library name="xmlbeans-2.3.0.jar"/>
> There is a small test suite which needs to be improved. We will need to have a look at each individual format and check that it is covered by Tika and if so to the same extent; the Wiki is probably the right place for this. The language identifier (which is a HTMLParseFilter) seemed to work fine.
>  
> Again, your comments are welcome. Please bear in mind that this is just a first step. 
> Julien
> http://www.digitalpebble.com

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (NUTCH-766) Tika parser

Posted by "Sami Siren (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/NUTCH-766?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12803664#action_12803664 ] 

Sami Siren commented on NUTCH-766:
----------------------------------

I took a brief look into the proposed patch, some somments:

The public API footprint of new classes should be smaller, eg use private, package private or protected methods/classes as much as possible.

I think the end result of this plugin should be replacing all Tika supported parsers (or the parsers we choose to replace) with the TikaParser and not to build a parallel ways to parse same formats. So I think we need to copy all of the the existing test files and move&adapt the existing testcases fully before committing this. That is a good way of seeing that the parse result is what is expected and also find out about possible differences with old vs. Tika version.


> Tika parser
> -----------
>
>                 Key: NUTCH-766
>                 URL: https://issues.apache.org/jira/browse/NUTCH-766
>             Project: Nutch
>          Issue Type: New Feature
>            Reporter: Julien Nioche
>            Assignee: Chris A. Mattmann
>             Fix For: 1.1
>
>         Attachments: Nutch-766.ParserFactory.patch, NUTCH-766.tika.patch
>
>
> Tika handles a lot of different formats under the bonnet and exposes them nicely via SAX events. What is described here is a tika-parser plugin which delegates the pasring mechanism of Tika but can still coexist with the existing parsing plugins which is useful for formats partially handled by Tika (or not at all). Some of the elements below have already been discussed on the mailing lists. Note that this is work in progress, your feedback is welcome.
> Tika is already used by Nutch for its MimeType implementations. Tika comes as different jar files (core and parsers), in the work described here we decided to put the libs in 2 different places
> NUTCH_HOME/lib : tika-core.jar
> NUTCH_HOME/tika-plugin/lib : tika-parsers.jar
> Tika being used by the core only for its Mimetype functionalities we only need to put tika-core at the main lib level whereas the tika plugin obviously needs the tika-parsers.jar + all the jars used internally by Tika
> Due to limitations in the way Tika loads its classes, we had to duplicate the TikaConfig class in the tika-plugin. This might be fixed in the future in Tika itself or avoided by refactoring the mimetype part of Nutch using extension points.
> Unlike most other parsers, Tika handles more than one Mime-type which is why we are using "*" as its mimetype value in the plugin descriptor and have modified ParserFactory.java so that it considers the tika parser as potentially suitable for all mime-types. In practice this means that the associations between a mime type and a parser plugin as defined in parse-plugins.xml are useful only for the cases where we want to handle a mime type with a different parser than Tika. 
> The general approach I chose was to convert the SAX events returned by the Tika parsers into DOM objects and reuse the utilities that come with the current HTML parser i.e. link detection,  metatag handling but also means that we can use the HTMLParseFilters in exactly the same way. The main difference though is that HTMLParseFilters are not limited to HTML documents anymore as the XHTML tags returned by Tika can correspond to a different format for the original document. There is a duplication of code with the html-plugin which will be resolved by either a) getting rid of the html-plugin altogether or b) exporting its jar and make the tika parser depend on it.
> The following libraries are required in the lib/ directory of the tika-parser : 
>       <library name="asm-3.1.jar"/>
>       <library name="bcmail-jdk15-144.jar"/>
>       <library name="commons-compress-1.0.jar"/>
>       <library name="commons-logging-1.1.1.jar"/>
>       <library name="dom4j-1.6.1.jar"/>
>       <library name="fontbox-0.8.0-incubator.jar"/>
>       <library name="geronimo-stax-api_1.0_spec-1.0.1.jar"/>
>       <library name="hamcrest-core-1.1.jar"/>
>       <library name="jce-jdk13-144.jar"/>
>       <library name="jempbox-0.8.0-incubator.jar"/>
>       <library name="metadata-extractor-2.4.0-beta-1.jar"/>
>       <library name="mockito-core-1.7.jar"/>
>       <library name="objenesis-1.0.jar"/>
>       <library name="ooxml-schemas-1.0.jar"/>
>       <library name="pdfbox-0.8.0-incubating.jar"/>
>       <library name="poi-3.5-FINAL.jar"/>
>       <library name="poi-ooxml-3.5-FINAL.jar"/>
>       <library name="poi-scratchpad-3.5-FINAL.jar"/>
>       <library name="tagsoup-1.2.jar"/>
>       <library name="tika-parsers-0.5-SNAPSHOT.jar"/>
>       <library name="xml-apis-1.0.b2.jar"/>
>       <library name="xmlbeans-2.3.0.jar"/>
> There is a small test suite which needs to be improved. We will need to have a look at each individual format and check that it is covered by Tika and if so to the same extent; the Wiki is probably the right place for this. The language identifier (which is a HTMLParseFilter) seemed to work fine.
>  
> Again, your comments are welcome. Please bear in mind that this is just a first step. 
> Julien
> http://www.digitalpebble.com

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (NUTCH-766) Tika parser

Posted by "Andrzej Bialecki (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/NUTCH-766?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12832250#action_12832250 ] 

Andrzej Bialecki  commented on NUTCH-766:
-----------------------------------------

+1 to commit this - please remember to update nutch-default.xml to switch to the tika plugin, perhaps add a comment about the deprecated parse-* plugins - most people look here and not in the parse-plugins, where this change is documented...

> Tika parser
> -----------
>
>                 Key: NUTCH-766
>                 URL: https://issues.apache.org/jira/browse/NUTCH-766
>             Project: Nutch
>          Issue Type: New Feature
>            Reporter: Julien Nioche
>            Assignee: Chris A. Mattmann
>             Fix For: 1.1
>
>         Attachments: NUTCH-766-v3.patch, NUTCH-766.v2, sample.tar.gz
>
>
> Tika handles a lot of different formats under the bonnet and exposes them nicely via SAX events. What is described here is a tika-parser plugin which delegates the pasring mechanism of Tika but can still coexist with the existing parsing plugins which is useful for formats partially handled by Tika (or not at all). Some of the elements below have already been discussed on the mailing lists. Note that this is work in progress, your feedback is welcome.
> Tika is already used by Nutch for its MimeType implementations. Tika comes as different jar files (core and parsers), in the work described here we decided to put the libs in 2 different places
> NUTCH_HOME/lib : tika-core.jar
> NUTCH_HOME/tika-plugin/lib : tika-parsers.jar
> Tika being used by the core only for its Mimetype functionalities we only need to put tika-core at the main lib level whereas the tika plugin obviously needs the tika-parsers.jar + all the jars used internally by Tika
> Due to limitations in the way Tika loads its classes, we had to duplicate the TikaConfig class in the tika-plugin. This might be fixed in the future in Tika itself or avoided by refactoring the mimetype part of Nutch using extension points.
> Unlike most other parsers, Tika handles more than one Mime-type which is why we are using "*" as its mimetype value in the plugin descriptor and have modified ParserFactory.java so that it considers the tika parser as potentially suitable for all mime-types. In practice this means that the associations between a mime type and a parser plugin as defined in parse-plugins.xml are useful only for the cases where we want to handle a mime type with a different parser than Tika. 
> The general approach I chose was to convert the SAX events returned by the Tika parsers into DOM objects and reuse the utilities that come with the current HTML parser i.e. link detection,  metatag handling but also means that we can use the HTMLParseFilters in exactly the same way. The main difference though is that HTMLParseFilters are not limited to HTML documents anymore as the XHTML tags returned by Tika can correspond to a different format for the original document. There is a duplication of code with the html-plugin which will be resolved by either a) getting rid of the html-plugin altogether or b) exporting its jar and make the tika parser depend on it.
> The following libraries are required in the lib/ directory of the tika-parser : 
>       <library name="asm-3.1.jar"/>
>       <library name="bcmail-jdk15-144.jar"/>
>       <library name="commons-compress-1.0.jar"/>
>       <library name="commons-logging-1.1.1.jar"/>
>       <library name="dom4j-1.6.1.jar"/>
>       <library name="fontbox-0.8.0-incubator.jar"/>
>       <library name="geronimo-stax-api_1.0_spec-1.0.1.jar"/>
>       <library name="hamcrest-core-1.1.jar"/>
>       <library name="jce-jdk13-144.jar"/>
>       <library name="jempbox-0.8.0-incubator.jar"/>
>       <library name="metadata-extractor-2.4.0-beta-1.jar"/>
>       <library name="mockito-core-1.7.jar"/>
>       <library name="objenesis-1.0.jar"/>
>       <library name="ooxml-schemas-1.0.jar"/>
>       <library name="pdfbox-0.8.0-incubating.jar"/>
>       <library name="poi-3.5-FINAL.jar"/>
>       <library name="poi-ooxml-3.5-FINAL.jar"/>
>       <library name="poi-scratchpad-3.5-FINAL.jar"/>
>       <library name="tagsoup-1.2.jar"/>
>       <library name="tika-parsers-0.5-SNAPSHOT.jar"/>
>       <library name="xml-apis-1.0.b2.jar"/>
>       <library name="xmlbeans-2.3.0.jar"/>
> There is a small test suite which needs to be improved. We will need to have a look at each individual format and check that it is covered by Tika and if so to the same extent; the Wiki is probably the right place for this. The language identifier (which is a HTMLParseFilter) seemed to work fine.
>  
> Again, your comments are welcome. Please bear in mind that this is just a first step. 
> Julien
> http://www.digitalpebble.com

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (NUTCH-766) Tika parser

Posted by "Chris A. Mattmann (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/NUTCH-766?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12798718#action_12798718 ] 

Chris A. Mattmann commented on NUTCH-766:
-----------------------------------------

Hi Julien:

I have had a look and was trying to test it out but got sidetracked. Give me this week to try and put together a final reviewable/commitable patch, otherwise, it's all yours.

Cheers,
Chris


> Tika parser
> -----------
>
>                 Key: NUTCH-766
>                 URL: https://issues.apache.org/jira/browse/NUTCH-766
>             Project: Nutch
>          Issue Type: New Feature
>            Reporter: Julien Nioche
>            Assignee: Chris A. Mattmann
>             Fix For: 1.1
>
>         Attachments: Nutch-766.ParserFactory.patch, NUTCH-766.tika.patch
>
>
> Tika handles a lot of different formats under the bonnet and exposes them nicely via SAX events. What is described here is a tika-parser plugin which delegates the pasring mechanism of Tika but can still coexist with the existing parsing plugins which is useful for formats partially handled by Tika (or not at all). Some of the elements below have already been discussed on the mailing lists. Note that this is work in progress, your feedback is welcome.
> Tika is already used by Nutch for its MimeType implementations. Tika comes as different jar files (core and parsers), in the work described here we decided to put the libs in 2 different places
> NUTCH_HOME/lib : tika-core.jar
> NUTCH_HOME/tika-plugin/lib : tika-parsers.jar
> Tika being used by the core only for its Mimetype functionalities we only need to put tika-core at the main lib level whereas the tika plugin obviously needs the tika-parsers.jar + all the jars used internally by Tika
> Due to limitations in the way Tika loads its classes, we had to duplicate the TikaConfig class in the tika-plugin. This might be fixed in the future in Tika itself or avoided by refactoring the mimetype part of Nutch using extension points.
> Unlike most other parsers, Tika handles more than one Mime-type which is why we are using "*" as its mimetype value in the plugin descriptor and have modified ParserFactory.java so that it considers the tika parser as potentially suitable for all mime-types. In practice this means that the associations between a mime type and a parser plugin as defined in parse-plugins.xml are useful only for the cases where we want to handle a mime type with a different parser than Tika. 
> The general approach I chose was to convert the SAX events returned by the Tika parsers into DOM objects and reuse the utilities that come with the current HTML parser i.e. link detection,  metatag handling but also means that we can use the HTMLParseFilters in exactly the same way. The main difference though is that HTMLParseFilters are not limited to HTML documents anymore as the XHTML tags returned by Tika can correspond to a different format for the original document. There is a duplication of code with the html-plugin which will be resolved by either a) getting rid of the html-plugin altogether or b) exporting its jar and make the tika parser depend on it.
> The following libraries are required in the lib/ directory of the tika-parser : 
>       <library name="asm-3.1.jar"/>
>       <library name="bcmail-jdk15-144.jar"/>
>       <library name="commons-compress-1.0.jar"/>
>       <library name="commons-logging-1.1.1.jar"/>
>       <library name="dom4j-1.6.1.jar"/>
>       <library name="fontbox-0.8.0-incubator.jar"/>
>       <library name="geronimo-stax-api_1.0_spec-1.0.1.jar"/>
>       <library name="hamcrest-core-1.1.jar"/>
>       <library name="jce-jdk13-144.jar"/>
>       <library name="jempbox-0.8.0-incubator.jar"/>
>       <library name="metadata-extractor-2.4.0-beta-1.jar"/>
>       <library name="mockito-core-1.7.jar"/>
>       <library name="objenesis-1.0.jar"/>
>       <library name="ooxml-schemas-1.0.jar"/>
>       <library name="pdfbox-0.8.0-incubating.jar"/>
>       <library name="poi-3.5-FINAL.jar"/>
>       <library name="poi-ooxml-3.5-FINAL.jar"/>
>       <library name="poi-scratchpad-3.5-FINAL.jar"/>
>       <library name="tagsoup-1.2.jar"/>
>       <library name="tika-parsers-0.5-SNAPSHOT.jar"/>
>       <library name="xml-apis-1.0.b2.jar"/>
>       <library name="xmlbeans-2.3.0.jar"/>
> There is a small test suite which needs to be improved. We will need to have a look at each individual format and check that it is covered by Tika and if so to the same extent; the Wiki is probably the right place for this. The language identifier (which is a HTMLParseFilter) seemed to work fine.
>  
> Again, your comments are welcome. Please bear in mind that this is just a first step. 
> Julien
> http://www.digitalpebble.com

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (NUTCH-766) Tika parser

Posted by "Chris A. Mattmann (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/NUTCH-766?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12804546#action_12804546 ] 

Chris A. Mattmann commented on NUTCH-766:
-----------------------------------------

Hi Sami:

{quote}
Chris, can you please explain me how keeping two components doing identical work would be more backwards compatible than having only 1?
{quote}

Sure, it's more of a configuration backwards-compat issue. For those folks who have gone to the trouble of customizing their nutch configuration (nuch-site.xml, or nutch-default.xml, or even parse-plugins), to remove out the parsing plugins (e.g., basically say they don't exist anymore and update your deployed configuration to use the tika-plugin), this patch would require a configuration update in their deployed environments. Because of that, why don't we ease them into that upgrade with at least one released version before the plugins go away. It would make it easier from a configuration backwards-compat perspective.

HTH,
Chris


> Tika parser
> -----------
>
>                 Key: NUTCH-766
>                 URL: https://issues.apache.org/jira/browse/NUTCH-766
>             Project: Nutch
>          Issue Type: New Feature
>            Reporter: Julien Nioche
>            Assignee: Chris A. Mattmann
>             Fix For: 1.1
>
>         Attachments: Nutch-766.ParserFactory.patch, NUTCH-766.tika.patch
>
>
> Tika handles a lot of different formats under the bonnet and exposes them nicely via SAX events. What is described here is a tika-parser plugin which delegates the pasring mechanism of Tika but can still coexist with the existing parsing plugins which is useful for formats partially handled by Tika (or not at all). Some of the elements below have already been discussed on the mailing lists. Note that this is work in progress, your feedback is welcome.
> Tika is already used by Nutch for its MimeType implementations. Tika comes as different jar files (core and parsers), in the work described here we decided to put the libs in 2 different places
> NUTCH_HOME/lib : tika-core.jar
> NUTCH_HOME/tika-plugin/lib : tika-parsers.jar
> Tika being used by the core only for its Mimetype functionalities we only need to put tika-core at the main lib level whereas the tika plugin obviously needs the tika-parsers.jar + all the jars used internally by Tika
> Due to limitations in the way Tika loads its classes, we had to duplicate the TikaConfig class in the tika-plugin. This might be fixed in the future in Tika itself or avoided by refactoring the mimetype part of Nutch using extension points.
> Unlike most other parsers, Tika handles more than one Mime-type which is why we are using "*" as its mimetype value in the plugin descriptor and have modified ParserFactory.java so that it considers the tika parser as potentially suitable for all mime-types. In practice this means that the associations between a mime type and a parser plugin as defined in parse-plugins.xml are useful only for the cases where we want to handle a mime type with a different parser than Tika. 
> The general approach I chose was to convert the SAX events returned by the Tika parsers into DOM objects and reuse the utilities that come with the current HTML parser i.e. link detection,  metatag handling but also means that we can use the HTMLParseFilters in exactly the same way. The main difference though is that HTMLParseFilters are not limited to HTML documents anymore as the XHTML tags returned by Tika can correspond to a different format for the original document. There is a duplication of code with the html-plugin which will be resolved by either a) getting rid of the html-plugin altogether or b) exporting its jar and make the tika parser depend on it.
> The following libraries are required in the lib/ directory of the tika-parser : 
>       <library name="asm-3.1.jar"/>
>       <library name="bcmail-jdk15-144.jar"/>
>       <library name="commons-compress-1.0.jar"/>
>       <library name="commons-logging-1.1.1.jar"/>
>       <library name="dom4j-1.6.1.jar"/>
>       <library name="fontbox-0.8.0-incubator.jar"/>
>       <library name="geronimo-stax-api_1.0_spec-1.0.1.jar"/>
>       <library name="hamcrest-core-1.1.jar"/>
>       <library name="jce-jdk13-144.jar"/>
>       <library name="jempbox-0.8.0-incubator.jar"/>
>       <library name="metadata-extractor-2.4.0-beta-1.jar"/>
>       <library name="mockito-core-1.7.jar"/>
>       <library name="objenesis-1.0.jar"/>
>       <library name="ooxml-schemas-1.0.jar"/>
>       <library name="pdfbox-0.8.0-incubating.jar"/>
>       <library name="poi-3.5-FINAL.jar"/>
>       <library name="poi-ooxml-3.5-FINAL.jar"/>
>       <library name="poi-scratchpad-3.5-FINAL.jar"/>
>       <library name="tagsoup-1.2.jar"/>
>       <library name="tika-parsers-0.5-SNAPSHOT.jar"/>
>       <library name="xml-apis-1.0.b2.jar"/>
>       <library name="xmlbeans-2.3.0.jar"/>
> There is a small test suite which needs to be improved. We will need to have a look at each individual format and check that it is covered by Tika and if so to the same extent; the Wiki is probably the right place for this. The language identifier (which is a HTMLParseFilter) seemed to work fine.
>  
> Again, your comments are welcome. Please bear in mind that this is just a first step. 
> Julien
> http://www.digitalpebble.com

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (NUTCH-766) Tika parser

Posted by "Julien Nioche (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/NUTCH-766?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Julien Nioche updated NUTCH-766:
--------------------------------

    Attachment:     (was: NUTCH-766.tika.patch)

> Tika parser
> -----------
>
>                 Key: NUTCH-766
>                 URL: https://issues.apache.org/jira/browse/NUTCH-766
>             Project: Nutch
>          Issue Type: New Feature
>            Reporter: Julien Nioche
>            Assignee: Chris A. Mattmann
>             Fix For: 1.1
>
>         Attachments: NUTCH-766-v3.patch, NUTCH-766.v2, sample.tar.gz
>
>
> Tika handles a lot of different formats under the bonnet and exposes them nicely via SAX events. What is described here is a tika-parser plugin which delegates the pasring mechanism of Tika but can still coexist with the existing parsing plugins which is useful for formats partially handled by Tika (or not at all). Some of the elements below have already been discussed on the mailing lists. Note that this is work in progress, your feedback is welcome.
> Tika is already used by Nutch for its MimeType implementations. Tika comes as different jar files (core and parsers), in the work described here we decided to put the libs in 2 different places
> NUTCH_HOME/lib : tika-core.jar
> NUTCH_HOME/tika-plugin/lib : tika-parsers.jar
> Tika being used by the core only for its Mimetype functionalities we only need to put tika-core at the main lib level whereas the tika plugin obviously needs the tika-parsers.jar + all the jars used internally by Tika
> Due to limitations in the way Tika loads its classes, we had to duplicate the TikaConfig class in the tika-plugin. This might be fixed in the future in Tika itself or avoided by refactoring the mimetype part of Nutch using extension points.
> Unlike most other parsers, Tika handles more than one Mime-type which is why we are using "*" as its mimetype value in the plugin descriptor and have modified ParserFactory.java so that it considers the tika parser as potentially suitable for all mime-types. In practice this means that the associations between a mime type and a parser plugin as defined in parse-plugins.xml are useful only for the cases where we want to handle a mime type with a different parser than Tika. 
> The general approach I chose was to convert the SAX events returned by the Tika parsers into DOM objects and reuse the utilities that come with the current HTML parser i.e. link detection,  metatag handling but also means that we can use the HTMLParseFilters in exactly the same way. The main difference though is that HTMLParseFilters are not limited to HTML documents anymore as the XHTML tags returned by Tika can correspond to a different format for the original document. There is a duplication of code with the html-plugin which will be resolved by either a) getting rid of the html-plugin altogether or b) exporting its jar and make the tika parser depend on it.
> The following libraries are required in the lib/ directory of the tika-parser : 
>       <library name="asm-3.1.jar"/>
>       <library name="bcmail-jdk15-144.jar"/>
>       <library name="commons-compress-1.0.jar"/>
>       <library name="commons-logging-1.1.1.jar"/>
>       <library name="dom4j-1.6.1.jar"/>
>       <library name="fontbox-0.8.0-incubator.jar"/>
>       <library name="geronimo-stax-api_1.0_spec-1.0.1.jar"/>
>       <library name="hamcrest-core-1.1.jar"/>
>       <library name="jce-jdk13-144.jar"/>
>       <library name="jempbox-0.8.0-incubator.jar"/>
>       <library name="metadata-extractor-2.4.0-beta-1.jar"/>
>       <library name="mockito-core-1.7.jar"/>
>       <library name="objenesis-1.0.jar"/>
>       <library name="ooxml-schemas-1.0.jar"/>
>       <library name="pdfbox-0.8.0-incubating.jar"/>
>       <library name="poi-3.5-FINAL.jar"/>
>       <library name="poi-ooxml-3.5-FINAL.jar"/>
>       <library name="poi-scratchpad-3.5-FINAL.jar"/>
>       <library name="tagsoup-1.2.jar"/>
>       <library name="tika-parsers-0.5-SNAPSHOT.jar"/>
>       <library name="xml-apis-1.0.b2.jar"/>
>       <library name="xmlbeans-2.3.0.jar"/>
> There is a small test suite which needs to be improved. We will need to have a look at each individual format and check that it is covered by Tika and if so to the same extent; the Wiki is probably the right place for this. The language identifier (which is a HTMLParseFilter) seemed to work fine.
>  
> Again, your comments are welcome. Please bear in mind that this is just a first step. 
> Julien
> http://www.digitalpebble.com

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Issue Comment Edited: (NUTCH-766) Tika parser

Posted by "Julien Nioche (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/NUTCH-766?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12832564#action_12832564 ] 

Julien Nioche edited comment on NUTCH-766 at 2/11/10 5:22 PM:
--------------------------------------------------------------

I had a closer look at the HTML parsing issue. What happens  is that the association between the mime-type and the parser implementation is not explicitely set in parse-plugins.xml so the ParserFactory goes through all the plugins and gets the ones with a matching mimetype (or * for Tika). The Tika parser takes no precedence over the default HTML parser and the latter gets first in the list and is used for parsing.

Of course that does not happen if parse-html is not specified in plugin.includes or if an explicit mapping is set in parse-plugins.xml.  I don't think we want to have to specify explicitely that tika should be used in all the mappings and reserve cases for when a parser must be used instead of Tika.

What we could do though is that in the cases where no explicit mapping is set for a mimetype, Tika (or any parser marked as supporting any mimetype) will be put first in the list of discovered parsers so it would remain the default choice unless an explicit mapping is set (even if a plugin is loaded and can handle the type).

Makes sense?

The ParserFactory section of the patch v3 can be replaced by :  

Index: src/java/org/apache/nutch/parse/ParserFactory.java
===================================================================
--- src/java/org/apache/nutch/parse/ParserFactory.java	(revision 909059)
+++ src/java/org/apache/nutch/parse/ParserFactory.java	(working copy)
@@ -348,11 +348,23 @@
                 contentType)) {
           extList.add(extensions[i]);
         }
+        else if ("*".equals(extensions[i].getAttribute("contentType"))){
+          // default plugins get the priority
+          extList.add(0, extensions[i]);
+        }
       }
       
       if (extList.size() > 0) {
         if (LOG.isInfoEnabled()) {
-          LOG.info("The parsing plugins: " + extList +
+          StringBuffer extensionsIDs = new StringBuffer("[");
+          boolean isFirst = true;
+          for (Extension ext : extList){
+        	  if (!isFirst) extensionsIDs.append(" - ");
+        	  else isFirst=false;
+        	  extensionsIDs.append(ext.getId());
+          }
+    	  extensionsIDs.append("]");
+          LOG.info("The parsing plugins: " + extensionsIDs.toString() +
                    " are enabled via the plugin.includes system " +
                    "property, and all claim to support the content type " +
                    contentType + ", but they are not mapped to it  in the " +
@@ -369,7 +381,7 @@
 
   private boolean match(Extension extension, String id, String type) {
     return ((id.equals(extension.getId())) &&
-            (type.equals(extension.getAttribute("contentType")) ||
+            (type.equals(extension.getAttribute("contentType")) || extension.getAttribute("contentType").equals("*") ||
              type.equals(DEFAULT_PLUGIN)));
   }
   



      was (Author: jnioche):
    I had a closer look at the HTML parsing issue. What happens  is that the association between the mime-type and the parser implementation is not explicitely set in parse-plugins.xml so the ParserFactory goes through all the plugins and gets the ones with a matching mimetype (or * for Tika). The Tika parser takes no precedence over the default HTML parser and the latter gets first in the list and is used for parsing.

Of course that does not happen if parse-html is not specified in plugin.includes or if an explicit mapping is set in parse-plugins.xml.  I don't think we want to have to specify explicitely that tika should be used in all the mappings and reserve cases for when a parser must be used instead of Tika.

What we could do though is that in the cases where no explicit mapping is set for a mimetype, Tika (or any parser marked as supporting any mimetype) will be put first in the list of discovered parsers so it would remain the default choice unless an explicit mapping is set (even if a plugin is loaded and can handle the type).

Makes sense?


  
> Tika parser
> -----------
>
>                 Key: NUTCH-766
>                 URL: https://issues.apache.org/jira/browse/NUTCH-766
>             Project: Nutch
>          Issue Type: New Feature
>            Reporter: Julien Nioche
>            Assignee: Chris A. Mattmann
>             Fix For: 1.1
>
>         Attachments: NUTCH-766-v3.patch, NUTCH-766.v2, NutchTikaConfig.java, sample.tar.gz, TikaParser.java
>
>
> Tika handles a lot of different formats under the bonnet and exposes them nicely via SAX events. What is described here is a tika-parser plugin which delegates the pasring mechanism of Tika but can still coexist with the existing parsing plugins which is useful for formats partially handled by Tika (or not at all). Some of the elements below have already been discussed on the mailing lists. Note that this is work in progress, your feedback is welcome.
> Tika is already used by Nutch for its MimeType implementations. Tika comes as different jar files (core and parsers), in the work described here we decided to put the libs in 2 different places
> NUTCH_HOME/lib : tika-core.jar
> NUTCH_HOME/tika-plugin/lib : tika-parsers.jar
> Tika being used by the core only for its Mimetype functionalities we only need to put tika-core at the main lib level whereas the tika plugin obviously needs the tika-parsers.jar + all the jars used internally by Tika
> Due to limitations in the way Tika loads its classes, we had to duplicate the TikaConfig class in the tika-plugin. This might be fixed in the future in Tika itself or avoided by refactoring the mimetype part of Nutch using extension points.
> Unlike most other parsers, Tika handles more than one Mime-type which is why we are using "*" as its mimetype value in the plugin descriptor and have modified ParserFactory.java so that it considers the tika parser as potentially suitable for all mime-types. In practice this means that the associations between a mime type and a parser plugin as defined in parse-plugins.xml are useful only for the cases where we want to handle a mime type with a different parser than Tika. 
> The general approach I chose was to convert the SAX events returned by the Tika parsers into DOM objects and reuse the utilities that come with the current HTML parser i.e. link detection,  metatag handling but also means that we can use the HTMLParseFilters in exactly the same way. The main difference though is that HTMLParseFilters are not limited to HTML documents anymore as the XHTML tags returned by Tika can correspond to a different format for the original document. There is a duplication of code with the html-plugin which will be resolved by either a) getting rid of the html-plugin altogether or b) exporting its jar and make the tika parser depend on it.
> The following libraries are required in the lib/ directory of the tika-parser : 
>       <library name="asm-3.1.jar"/>
>       <library name="bcmail-jdk15-144.jar"/>
>       <library name="commons-compress-1.0.jar"/>
>       <library name="commons-logging-1.1.1.jar"/>
>       <library name="dom4j-1.6.1.jar"/>
>       <library name="fontbox-0.8.0-incubator.jar"/>
>       <library name="geronimo-stax-api_1.0_spec-1.0.1.jar"/>
>       <library name="hamcrest-core-1.1.jar"/>
>       <library name="jce-jdk13-144.jar"/>
>       <library name="jempbox-0.8.0-incubator.jar"/>
>       <library name="metadata-extractor-2.4.0-beta-1.jar"/>
>       <library name="mockito-core-1.7.jar"/>
>       <library name="objenesis-1.0.jar"/>
>       <library name="ooxml-schemas-1.0.jar"/>
>       <library name="pdfbox-0.8.0-incubating.jar"/>
>       <library name="poi-3.5-FINAL.jar"/>
>       <library name="poi-ooxml-3.5-FINAL.jar"/>
>       <library name="poi-scratchpad-3.5-FINAL.jar"/>
>       <library name="tagsoup-1.2.jar"/>
>       <library name="tika-parsers-0.5-SNAPSHOT.jar"/>
>       <library name="xml-apis-1.0.b2.jar"/>
>       <library name="xmlbeans-2.3.0.jar"/>
> There is a small test suite which needs to be improved. We will need to have a look at each individual format and check that it is covered by Tika and if so to the same extent; the Wiki is probably the right place for this. The language identifier (which is a HTMLParseFilter) seemed to work fine.
>  
> Again, your comments are welcome. Please bear in mind that this is just a first step. 
> Julien
> http://www.digitalpebble.com

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (NUTCH-766) Tika parser

Posted by "Sami Siren (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/NUTCH-766?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12832406#action_12832406 ] 

Sami Siren commented on NUTCH-766:
----------------------------------

I suggest that we would still drive this a bit further an use. currently this patch does not use Tika for pkg formats nor html.

Julien: was there a reason not to use AutoDetect parser? The only thing that I could come with was that the mime type detection would be done twice. We could get around this by implementing somethin simlilar to what composite parser does (it uses a parser (AutodetectParser) class from the context to do further parsing) to cover all supported pkg formats.

Also was there a reson not to parse html wtih tika?

I have a patch nearby to demonstrate some of the improvements that I will try to post briefly.

> Tika parser
> -----------
>
>                 Key: NUTCH-766
>                 URL: https://issues.apache.org/jira/browse/NUTCH-766
>             Project: Nutch
>          Issue Type: New Feature
>            Reporter: Julien Nioche
>            Assignee: Chris A. Mattmann
>             Fix For: 1.1
>
>         Attachments: NUTCH-766-v3.patch, NUTCH-766.v2, sample.tar.gz
>
>
> Tika handles a lot of different formats under the bonnet and exposes them nicely via SAX events. What is described here is a tika-parser plugin which delegates the pasring mechanism of Tika but can still coexist with the existing parsing plugins which is useful for formats partially handled by Tika (or not at all). Some of the elements below have already been discussed on the mailing lists. Note that this is work in progress, your feedback is welcome.
> Tika is already used by Nutch for its MimeType implementations. Tika comes as different jar files (core and parsers), in the work described here we decided to put the libs in 2 different places
> NUTCH_HOME/lib : tika-core.jar
> NUTCH_HOME/tika-plugin/lib : tika-parsers.jar
> Tika being used by the core only for its Mimetype functionalities we only need to put tika-core at the main lib level whereas the tika plugin obviously needs the tika-parsers.jar + all the jars used internally by Tika
> Due to limitations in the way Tika loads its classes, we had to duplicate the TikaConfig class in the tika-plugin. This might be fixed in the future in Tika itself or avoided by refactoring the mimetype part of Nutch using extension points.
> Unlike most other parsers, Tika handles more than one Mime-type which is why we are using "*" as its mimetype value in the plugin descriptor and have modified ParserFactory.java so that it considers the tika parser as potentially suitable for all mime-types. In practice this means that the associations between a mime type and a parser plugin as defined in parse-plugins.xml are useful only for the cases where we want to handle a mime type with a different parser than Tika. 
> The general approach I chose was to convert the SAX events returned by the Tika parsers into DOM objects and reuse the utilities that come with the current HTML parser i.e. link detection,  metatag handling but also means that we can use the HTMLParseFilters in exactly the same way. The main difference though is that HTMLParseFilters are not limited to HTML documents anymore as the XHTML tags returned by Tika can correspond to a different format for the original document. There is a duplication of code with the html-plugin which will be resolved by either a) getting rid of the html-plugin altogether or b) exporting its jar and make the tika parser depend on it.
> The following libraries are required in the lib/ directory of the tika-parser : 
>       <library name="asm-3.1.jar"/>
>       <library name="bcmail-jdk15-144.jar"/>
>       <library name="commons-compress-1.0.jar"/>
>       <library name="commons-logging-1.1.1.jar"/>
>       <library name="dom4j-1.6.1.jar"/>
>       <library name="fontbox-0.8.0-incubator.jar"/>
>       <library name="geronimo-stax-api_1.0_spec-1.0.1.jar"/>
>       <library name="hamcrest-core-1.1.jar"/>
>       <library name="jce-jdk13-144.jar"/>
>       <library name="jempbox-0.8.0-incubator.jar"/>
>       <library name="metadata-extractor-2.4.0-beta-1.jar"/>
>       <library name="mockito-core-1.7.jar"/>
>       <library name="objenesis-1.0.jar"/>
>       <library name="ooxml-schemas-1.0.jar"/>
>       <library name="pdfbox-0.8.0-incubating.jar"/>
>       <library name="poi-3.5-FINAL.jar"/>
>       <library name="poi-ooxml-3.5-FINAL.jar"/>
>       <library name="poi-scratchpad-3.5-FINAL.jar"/>
>       <library name="tagsoup-1.2.jar"/>
>       <library name="tika-parsers-0.5-SNAPSHOT.jar"/>
>       <library name="xml-apis-1.0.b2.jar"/>
>       <library name="xmlbeans-2.3.0.jar"/>
> There is a small test suite which needs to be improved. We will need to have a look at each individual format and check that it is covered by Tika and if so to the same extent; the Wiki is probably the right place for this. The language identifier (which is a HTMLParseFilter) seemed to work fine.
>  
> Again, your comments are welcome. Please bear in mind that this is just a first step. 
> Julien
> http://www.digitalpebble.com

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (NUTCH-766) Tika parser

Posted by "Julien Nioche (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/NUTCH-766?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12805892#action_12805892 ] 

Julien Nioche commented on NUTCH-766:
-------------------------------------

Here is a slightly better version of the patch which : 
• fixes a small bug in the Tika parser (the API has changed slightly between 1.5beta and 1.5)
• fixes a bug with the TestParserFactory
• adds the tika-plugin to the list of plugins to be built in src/plugin/build.xml
• limits public exposure of methods and classes (see Sami's comment)
• modified parse-plugins.xml : added parse-tika and commented out associations between some mime-types and the old parsers

I've also added an ANT script which uses IVY to pull the dependencies and copies them into the lib dir. Obviously this won't be needed when the plugin is committed but should simplify the initial testing. All you need to do after applying the patch is to :

cd src/plugin/parse-tika/
ant -f build-ivy.xml

Am also attaching the content of the sample directory as an archive - just unzip onto the src/plugin/parse-tika/ before calling ant test-plugins

Julien




> Tika parser
> -----------
>
>                 Key: NUTCH-766
>                 URL: https://issues.apache.org/jira/browse/NUTCH-766
>             Project: Nutch
>          Issue Type: New Feature
>            Reporter: Julien Nioche
>            Assignee: Chris A. Mattmann
>             Fix For: 1.1
>
>         Attachments: Nutch-766.ParserFactory.patch, NUTCH-766.tika.patch, NUTCH-766.v2, sample.tar.gz
>
>
> Tika handles a lot of different formats under the bonnet and exposes them nicely via SAX events. What is described here is a tika-parser plugin which delegates the pasring mechanism of Tika but can still coexist with the existing parsing plugins which is useful for formats partially handled by Tika (or not at all). Some of the elements below have already been discussed on the mailing lists. Note that this is work in progress, your feedback is welcome.
> Tika is already used by Nutch for its MimeType implementations. Tika comes as different jar files (core and parsers), in the work described here we decided to put the libs in 2 different places
> NUTCH_HOME/lib : tika-core.jar
> NUTCH_HOME/tika-plugin/lib : tika-parsers.jar
> Tika being used by the core only for its Mimetype functionalities we only need to put tika-core at the main lib level whereas the tika plugin obviously needs the tika-parsers.jar + all the jars used internally by Tika
> Due to limitations in the way Tika loads its classes, we had to duplicate the TikaConfig class in the tika-plugin. This might be fixed in the future in Tika itself or avoided by refactoring the mimetype part of Nutch using extension points.
> Unlike most other parsers, Tika handles more than one Mime-type which is why we are using "*" as its mimetype value in the plugin descriptor and have modified ParserFactory.java so that it considers the tika parser as potentially suitable for all mime-types. In practice this means that the associations between a mime type and a parser plugin as defined in parse-plugins.xml are useful only for the cases where we want to handle a mime type with a different parser than Tika. 
> The general approach I chose was to convert the SAX events returned by the Tika parsers into DOM objects and reuse the utilities that come with the current HTML parser i.e. link detection,  metatag handling but also means that we can use the HTMLParseFilters in exactly the same way. The main difference though is that HTMLParseFilters are not limited to HTML documents anymore as the XHTML tags returned by Tika can correspond to a different format for the original document. There is a duplication of code with the html-plugin which will be resolved by either a) getting rid of the html-plugin altogether or b) exporting its jar and make the tika parser depend on it.
> The following libraries are required in the lib/ directory of the tika-parser : 
>       <library name="asm-3.1.jar"/>
>       <library name="bcmail-jdk15-144.jar"/>
>       <library name="commons-compress-1.0.jar"/>
>       <library name="commons-logging-1.1.1.jar"/>
>       <library name="dom4j-1.6.1.jar"/>
>       <library name="fontbox-0.8.0-incubator.jar"/>
>       <library name="geronimo-stax-api_1.0_spec-1.0.1.jar"/>
>       <library name="hamcrest-core-1.1.jar"/>
>       <library name="jce-jdk13-144.jar"/>
>       <library name="jempbox-0.8.0-incubator.jar"/>
>       <library name="metadata-extractor-2.4.0-beta-1.jar"/>
>       <library name="mockito-core-1.7.jar"/>
>       <library name="objenesis-1.0.jar"/>
>       <library name="ooxml-schemas-1.0.jar"/>
>       <library name="pdfbox-0.8.0-incubating.jar"/>
>       <library name="poi-3.5-FINAL.jar"/>
>       <library name="poi-ooxml-3.5-FINAL.jar"/>
>       <library name="poi-scratchpad-3.5-FINAL.jar"/>
>       <library name="tagsoup-1.2.jar"/>
>       <library name="tika-parsers-0.5-SNAPSHOT.jar"/>
>       <library name="xml-apis-1.0.b2.jar"/>
>       <library name="xmlbeans-2.3.0.jar"/>
> There is a small test suite which needs to be improved. We will need to have a look at each individual format and check that it is covered by Tika and if so to the same extent; the Wiki is probably the right place for this. The language identifier (which is a HTMLParseFilter) seemed to work fine.
>  
> Again, your comments are welcome. Please bear in mind that this is just a first step. 
> Julien
> http://www.digitalpebble.com

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (NUTCH-766) Tika parser

Posted by "Julien Nioche (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/NUTCH-766?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12832583#action_12832583 ] 

Julien Nioche commented on NUTCH-766:
-------------------------------------

@Chris : did you do 

ant -f src/plugin/parse-tika/build-ivy.xml 

between 5 and 6? This is required in order to populate the lib directory automatically

> Tika parser
> -----------
>
>                 Key: NUTCH-766
>                 URL: https://issues.apache.org/jira/browse/NUTCH-766
>             Project: Nutch
>          Issue Type: New Feature
>            Reporter: Julien Nioche
>            Assignee: Chris A. Mattmann
>             Fix For: 1.1
>
>         Attachments: NUTCH-766-v3.patch, NUTCH-766.v2, NutchTikaConfig.java, sample.tar.gz, TikaParser.java
>
>
> Tika handles a lot of different formats under the bonnet and exposes them nicely via SAX events. What is described here is a tika-parser plugin which delegates the pasring mechanism of Tika but can still coexist with the existing parsing plugins which is useful for formats partially handled by Tika (or not at all). Some of the elements below have already been discussed on the mailing lists. Note that this is work in progress, your feedback is welcome.
> Tika is already used by Nutch for its MimeType implementations. Tika comes as different jar files (core and parsers), in the work described here we decided to put the libs in 2 different places
> NUTCH_HOME/lib : tika-core.jar
> NUTCH_HOME/tika-plugin/lib : tika-parsers.jar
> Tika being used by the core only for its Mimetype functionalities we only need to put tika-core at the main lib level whereas the tika plugin obviously needs the tika-parsers.jar + all the jars used internally by Tika
> Due to limitations in the way Tika loads its classes, we had to duplicate the TikaConfig class in the tika-plugin. This might be fixed in the future in Tika itself or avoided by refactoring the mimetype part of Nutch using extension points.
> Unlike most other parsers, Tika handles more than one Mime-type which is why we are using "*" as its mimetype value in the plugin descriptor and have modified ParserFactory.java so that it considers the tika parser as potentially suitable for all mime-types. In practice this means that the associations between a mime type and a parser plugin as defined in parse-plugins.xml are useful only for the cases where we want to handle a mime type with a different parser than Tika. 
> The general approach I chose was to convert the SAX events returned by the Tika parsers into DOM objects and reuse the utilities that come with the current HTML parser i.e. link detection,  metatag handling but also means that we can use the HTMLParseFilters in exactly the same way. The main difference though is that HTMLParseFilters are not limited to HTML documents anymore as the XHTML tags returned by Tika can correspond to a different format for the original document. There is a duplication of code with the html-plugin which will be resolved by either a) getting rid of the html-plugin altogether or b) exporting its jar and make the tika parser depend on it.
> The following libraries are required in the lib/ directory of the tika-parser : 
>       <library name="asm-3.1.jar"/>
>       <library name="bcmail-jdk15-144.jar"/>
>       <library name="commons-compress-1.0.jar"/>
>       <library name="commons-logging-1.1.1.jar"/>
>       <library name="dom4j-1.6.1.jar"/>
>       <library name="fontbox-0.8.0-incubator.jar"/>
>       <library name="geronimo-stax-api_1.0_spec-1.0.1.jar"/>
>       <library name="hamcrest-core-1.1.jar"/>
>       <library name="jce-jdk13-144.jar"/>
>       <library name="jempbox-0.8.0-incubator.jar"/>
>       <library name="metadata-extractor-2.4.0-beta-1.jar"/>
>       <library name="mockito-core-1.7.jar"/>
>       <library name="objenesis-1.0.jar"/>
>       <library name="ooxml-schemas-1.0.jar"/>
>       <library name="pdfbox-0.8.0-incubating.jar"/>
>       <library name="poi-3.5-FINAL.jar"/>
>       <library name="poi-ooxml-3.5-FINAL.jar"/>
>       <library name="poi-scratchpad-3.5-FINAL.jar"/>
>       <library name="tagsoup-1.2.jar"/>
>       <library name="tika-parsers-0.5-SNAPSHOT.jar"/>
>       <library name="xml-apis-1.0.b2.jar"/>
>       <library name="xmlbeans-2.3.0.jar"/>
> There is a small test suite which needs to be improved. We will need to have a look at each individual format and check that it is covered by Tika and if so to the same extent; the Wiki is probably the right place for this. The language identifier (which is a HTMLParseFilter) seemed to work fine.
>  
> Again, your comments are welcome. Please bear in mind that this is just a first step. 
> Julien
> http://www.digitalpebble.com

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (NUTCH-766) Tika parser

Posted by "Sami Siren (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/NUTCH-766?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12805661#action_12805661 ] 

Sami Siren commented on NUTCH-766:
----------------------------------

{quote}
Sure, it's more of a configuration backwards-compat issue. For those folks who have gone to the trouble of customizing their nutch configuration (nuch-site.xml, or nutch-default.xml, or even parse-plugins), to remove out the parsing plugins (e.g., basically say they don't exist anymore and update your deployed configuration to use the tika-plugin), this patch would require a configuration update in their deployed environments. Because of that, why don't we ease them into that upgrade with at least one released version before the plugins go away. It would make it easier from a configuration backwards-compat perspective.
{quote}

Ok, so you mean that we need to have duplicate parser plugins because we don't want to ask people already using nutch to reconfigure the bits this involves now even though we have to do it later? How is postponing going to ease the task they need to do anyway at some point? I still don't understand the (longer term) benefit.

I am not strongly against the idea of keeping duplicate plugins, I mean it's just another ~20M in the .job, what I am worried about is that the history will repeat itself and we will end up having one more case of duplicate components (in this case many of them) doing the same work and no interest in cleaning up afterwards. Doing it the way I suggested would guarantee that this will not happen.


> Tika parser
> -----------
>
>                 Key: NUTCH-766
>                 URL: https://issues.apache.org/jira/browse/NUTCH-766
>             Project: Nutch
>          Issue Type: New Feature
>            Reporter: Julien Nioche
>            Assignee: Chris A. Mattmann
>             Fix For: 1.1
>
>         Attachments: Nutch-766.ParserFactory.patch, NUTCH-766.tika.patch
>
>
> Tika handles a lot of different formats under the bonnet and exposes them nicely via SAX events. What is described here is a tika-parser plugin which delegates the pasring mechanism of Tika but can still coexist with the existing parsing plugins which is useful for formats partially handled by Tika (or not at all). Some of the elements below have already been discussed on the mailing lists. Note that this is work in progress, your feedback is welcome.
> Tika is already used by Nutch for its MimeType implementations. Tika comes as different jar files (core and parsers), in the work described here we decided to put the libs in 2 different places
> NUTCH_HOME/lib : tika-core.jar
> NUTCH_HOME/tika-plugin/lib : tika-parsers.jar
> Tika being used by the core only for its Mimetype functionalities we only need to put tika-core at the main lib level whereas the tika plugin obviously needs the tika-parsers.jar + all the jars used internally by Tika
> Due to limitations in the way Tika loads its classes, we had to duplicate the TikaConfig class in the tika-plugin. This might be fixed in the future in Tika itself or avoided by refactoring the mimetype part of Nutch using extension points.
> Unlike most other parsers, Tika handles more than one Mime-type which is why we are using "*" as its mimetype value in the plugin descriptor and have modified ParserFactory.java so that it considers the tika parser as potentially suitable for all mime-types. In practice this means that the associations between a mime type and a parser plugin as defined in parse-plugins.xml are useful only for the cases where we want to handle a mime type with a different parser than Tika. 
> The general approach I chose was to convert the SAX events returned by the Tika parsers into DOM objects and reuse the utilities that come with the current HTML parser i.e. link detection,  metatag handling but also means that we can use the HTMLParseFilters in exactly the same way. The main difference though is that HTMLParseFilters are not limited to HTML documents anymore as the XHTML tags returned by Tika can correspond to a different format for the original document. There is a duplication of code with the html-plugin which will be resolved by either a) getting rid of the html-plugin altogether or b) exporting its jar and make the tika parser depend on it.
> The following libraries are required in the lib/ directory of the tika-parser : 
>       <library name="asm-3.1.jar"/>
>       <library name="bcmail-jdk15-144.jar"/>
>       <library name="commons-compress-1.0.jar"/>
>       <library name="commons-logging-1.1.1.jar"/>
>       <library name="dom4j-1.6.1.jar"/>
>       <library name="fontbox-0.8.0-incubator.jar"/>
>       <library name="geronimo-stax-api_1.0_spec-1.0.1.jar"/>
>       <library name="hamcrest-core-1.1.jar"/>
>       <library name="jce-jdk13-144.jar"/>
>       <library name="jempbox-0.8.0-incubator.jar"/>
>       <library name="metadata-extractor-2.4.0-beta-1.jar"/>
>       <library name="mockito-core-1.7.jar"/>
>       <library name="objenesis-1.0.jar"/>
>       <library name="ooxml-schemas-1.0.jar"/>
>       <library name="pdfbox-0.8.0-incubating.jar"/>
>       <library name="poi-3.5-FINAL.jar"/>
>       <library name="poi-ooxml-3.5-FINAL.jar"/>
>       <library name="poi-scratchpad-3.5-FINAL.jar"/>
>       <library name="tagsoup-1.2.jar"/>
>       <library name="tika-parsers-0.5-SNAPSHOT.jar"/>
>       <library name="xml-apis-1.0.b2.jar"/>
>       <library name="xmlbeans-2.3.0.jar"/>
> There is a small test suite which needs to be improved. We will need to have a look at each individual format and check that it is covered by Tika and if so to the same extent; the Wiki is probably the right place for this. The language identifier (which is a HTMLParseFilter) seemed to work fine.
>  
> Again, your comments are welcome. Please bear in mind that this is just a first step. 
> Julien
> http://www.digitalpebble.com

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (NUTCH-766) Tika parser

Posted by "Julien Nioche (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/NUTCH-766?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12798727#action_12798727 ] 

Julien Nioche commented on NUTCH-766:
-------------------------------------

Hi Chris, 

No worries, I'd rather wait for you to have a look at it. It's quite a big change and it would be better if someone else had a look at it. Being the author I might miss something obvious

Thanks

J.

> Tika parser
> -----------
>
>                 Key: NUTCH-766
>                 URL: https://issues.apache.org/jira/browse/NUTCH-766
>             Project: Nutch
>          Issue Type: New Feature
>            Reporter: Julien Nioche
>            Assignee: Chris A. Mattmann
>             Fix For: 1.1
>
>         Attachments: Nutch-766.ParserFactory.patch, NUTCH-766.tika.patch
>
>
> Tika handles a lot of different formats under the bonnet and exposes them nicely via SAX events. What is described here is a tika-parser plugin which delegates the pasring mechanism of Tika but can still coexist with the existing parsing plugins which is useful for formats partially handled by Tika (or not at all). Some of the elements below have already been discussed on the mailing lists. Note that this is work in progress, your feedback is welcome.
> Tika is already used by Nutch for its MimeType implementations. Tika comes as different jar files (core and parsers), in the work described here we decided to put the libs in 2 different places
> NUTCH_HOME/lib : tika-core.jar
> NUTCH_HOME/tika-plugin/lib : tika-parsers.jar
> Tika being used by the core only for its Mimetype functionalities we only need to put tika-core at the main lib level whereas the tika plugin obviously needs the tika-parsers.jar + all the jars used internally by Tika
> Due to limitations in the way Tika loads its classes, we had to duplicate the TikaConfig class in the tika-plugin. This might be fixed in the future in Tika itself or avoided by refactoring the mimetype part of Nutch using extension points.
> Unlike most other parsers, Tika handles more than one Mime-type which is why we are using "*" as its mimetype value in the plugin descriptor and have modified ParserFactory.java so that it considers the tika parser as potentially suitable for all mime-types. In practice this means that the associations between a mime type and a parser plugin as defined in parse-plugins.xml are useful only for the cases where we want to handle a mime type with a different parser than Tika. 
> The general approach I chose was to convert the SAX events returned by the Tika parsers into DOM objects and reuse the utilities that come with the current HTML parser i.e. link detection,  metatag handling but also means that we can use the HTMLParseFilters in exactly the same way. The main difference though is that HTMLParseFilters are not limited to HTML documents anymore as the XHTML tags returned by Tika can correspond to a different format for the original document. There is a duplication of code with the html-plugin which will be resolved by either a) getting rid of the html-plugin altogether or b) exporting its jar and make the tika parser depend on it.
> The following libraries are required in the lib/ directory of the tika-parser : 
>       <library name="asm-3.1.jar"/>
>       <library name="bcmail-jdk15-144.jar"/>
>       <library name="commons-compress-1.0.jar"/>
>       <library name="commons-logging-1.1.1.jar"/>
>       <library name="dom4j-1.6.1.jar"/>
>       <library name="fontbox-0.8.0-incubator.jar"/>
>       <library name="geronimo-stax-api_1.0_spec-1.0.1.jar"/>
>       <library name="hamcrest-core-1.1.jar"/>
>       <library name="jce-jdk13-144.jar"/>
>       <library name="jempbox-0.8.0-incubator.jar"/>
>       <library name="metadata-extractor-2.4.0-beta-1.jar"/>
>       <library name="mockito-core-1.7.jar"/>
>       <library name="objenesis-1.0.jar"/>
>       <library name="ooxml-schemas-1.0.jar"/>
>       <library name="pdfbox-0.8.0-incubating.jar"/>
>       <library name="poi-3.5-FINAL.jar"/>
>       <library name="poi-ooxml-3.5-FINAL.jar"/>
>       <library name="poi-scratchpad-3.5-FINAL.jar"/>
>       <library name="tagsoup-1.2.jar"/>
>       <library name="tika-parsers-0.5-SNAPSHOT.jar"/>
>       <library name="xml-apis-1.0.b2.jar"/>
>       <library name="xmlbeans-2.3.0.jar"/>
> There is a small test suite which needs to be improved. We will need to have a look at each individual format and check that it is covered by Tika and if so to the same extent; the Wiki is probably the right place for this. The language identifier (which is a HTMLParseFilter) seemed to work fine.
>  
> Again, your comments are welcome. Please bear in mind that this is just a first step. 
> Julien
> http://www.digitalpebble.com

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (NUTCH-766) Tika parser

Posted by "Chris A. Mattmann (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/NUTCH-766?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12832565#action_12832565 ] 

Chris A. Mattmann commented on NUTCH-766:
-----------------------------------------

Hi Julien:

{quote}
@Chris : I just did a fresh co from svn, applied the patch v3 and unzipped sample.tar.gz onto the directory parse-tika and ran the test just as you did but could not reproduce the problem. Could there be a difference between your version and the trunk? 
{quote}

I tried this process last night:

1. SVN up to r908832
2. download patch v3
3. download sample.tgz
4. apply patch v3 to r908832
5. untar sample.tgz into src/plugin/parse-tika, creating a sample folder in that dir
6. ant clean compile-core test

Any idea why I'm seeing the error?

Cheers,
Chris


> Tika parser
> -----------
>
>                 Key: NUTCH-766
>                 URL: https://issues.apache.org/jira/browse/NUTCH-766
>             Project: Nutch
>          Issue Type: New Feature
>            Reporter: Julien Nioche
>            Assignee: Chris A. Mattmann
>             Fix For: 1.1
>
>         Attachments: NUTCH-766-v3.patch, NUTCH-766.v2, NutchTikaConfig.java, sample.tar.gz, TikaParser.java
>
>
> Tika handles a lot of different formats under the bonnet and exposes them nicely via SAX events. What is described here is a tika-parser plugin which delegates the pasring mechanism of Tika but can still coexist with the existing parsing plugins which is useful for formats partially handled by Tika (or not at all). Some of the elements below have already been discussed on the mailing lists. Note that this is work in progress, your feedback is welcome.
> Tika is already used by Nutch for its MimeType implementations. Tika comes as different jar files (core and parsers), in the work described here we decided to put the libs in 2 different places
> NUTCH_HOME/lib : tika-core.jar
> NUTCH_HOME/tika-plugin/lib : tika-parsers.jar
> Tika being used by the core only for its Mimetype functionalities we only need to put tika-core at the main lib level whereas the tika plugin obviously needs the tika-parsers.jar + all the jars used internally by Tika
> Due to limitations in the way Tika loads its classes, we had to duplicate the TikaConfig class in the tika-plugin. This might be fixed in the future in Tika itself or avoided by refactoring the mimetype part of Nutch using extension points.
> Unlike most other parsers, Tika handles more than one Mime-type which is why we are using "*" as its mimetype value in the plugin descriptor and have modified ParserFactory.java so that it considers the tika parser as potentially suitable for all mime-types. In practice this means that the associations between a mime type and a parser plugin as defined in parse-plugins.xml are useful only for the cases where we want to handle a mime type with a different parser than Tika. 
> The general approach I chose was to convert the SAX events returned by the Tika parsers into DOM objects and reuse the utilities that come with the current HTML parser i.e. link detection,  metatag handling but also means that we can use the HTMLParseFilters in exactly the same way. The main difference though is that HTMLParseFilters are not limited to HTML documents anymore as the XHTML tags returned by Tika can correspond to a different format for the original document. There is a duplication of code with the html-plugin which will be resolved by either a) getting rid of the html-plugin altogether or b) exporting its jar and make the tika parser depend on it.
> The following libraries are required in the lib/ directory of the tika-parser : 
>       <library name="asm-3.1.jar"/>
>       <library name="bcmail-jdk15-144.jar"/>
>       <library name="commons-compress-1.0.jar"/>
>       <library name="commons-logging-1.1.1.jar"/>
>       <library name="dom4j-1.6.1.jar"/>
>       <library name="fontbox-0.8.0-incubator.jar"/>
>       <library name="geronimo-stax-api_1.0_spec-1.0.1.jar"/>
>       <library name="hamcrest-core-1.1.jar"/>
>       <library name="jce-jdk13-144.jar"/>
>       <library name="jempbox-0.8.0-incubator.jar"/>
>       <library name="metadata-extractor-2.4.0-beta-1.jar"/>
>       <library name="mockito-core-1.7.jar"/>
>       <library name="objenesis-1.0.jar"/>
>       <library name="ooxml-schemas-1.0.jar"/>
>       <library name="pdfbox-0.8.0-incubating.jar"/>
>       <library name="poi-3.5-FINAL.jar"/>
>       <library name="poi-ooxml-3.5-FINAL.jar"/>
>       <library name="poi-scratchpad-3.5-FINAL.jar"/>
>       <library name="tagsoup-1.2.jar"/>
>       <library name="tika-parsers-0.5-SNAPSHOT.jar"/>
>       <library name="xml-apis-1.0.b2.jar"/>
>       <library name="xmlbeans-2.3.0.jar"/>
> There is a small test suite which needs to be improved. We will need to have a look at each individual format and check that it is covered by Tika and if so to the same extent; the Wiki is probably the right place for this. The language identifier (which is a HTMLParseFilter) seemed to work fine.
>  
> Again, your comments are welcome. Please bear in mind that this is just a first step. 
> Julien
> http://www.digitalpebble.com

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (NUTCH-766) Tika parser

Posted by "Sami Siren (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/NUTCH-766?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12804448#action_12804448 ] 

Sami Siren commented on NUTCH-766:
----------------------------------

>+1, I'm going to agree on this one here Julien. Other communities  have convinced me of the need for backwards compat and unobtrusiveness when bringing in new functionality or results. +1 to at least in Nutch 1.1 leaving the old plugins (perhaps mentioning they should be deprecated and replaced by the Tika functionality) and then removing them in 1.2 or 1.3.

Chris, can you please explain me how keeping two components doing identical work would be more backwards compatible than having only 1? 



> Tika parser
> -----------
>
>                 Key: NUTCH-766
>                 URL: https://issues.apache.org/jira/browse/NUTCH-766
>             Project: Nutch
>          Issue Type: New Feature
>            Reporter: Julien Nioche
>            Assignee: Chris A. Mattmann
>             Fix For: 1.1
>
>         Attachments: Nutch-766.ParserFactory.patch, NUTCH-766.tika.patch
>
>
> Tika handles a lot of different formats under the bonnet and exposes them nicely via SAX events. What is described here is a tika-parser plugin which delegates the pasring mechanism of Tika but can still coexist with the existing parsing plugins which is useful for formats partially handled by Tika (or not at all). Some of the elements below have already been discussed on the mailing lists. Note that this is work in progress, your feedback is welcome.
> Tika is already used by Nutch for its MimeType implementations. Tika comes as different jar files (core and parsers), in the work described here we decided to put the libs in 2 different places
> NUTCH_HOME/lib : tika-core.jar
> NUTCH_HOME/tika-plugin/lib : tika-parsers.jar
> Tika being used by the core only for its Mimetype functionalities we only need to put tika-core at the main lib level whereas the tika plugin obviously needs the tika-parsers.jar + all the jars used internally by Tika
> Due to limitations in the way Tika loads its classes, we had to duplicate the TikaConfig class in the tika-plugin. This might be fixed in the future in Tika itself or avoided by refactoring the mimetype part of Nutch using extension points.
> Unlike most other parsers, Tika handles more than one Mime-type which is why we are using "*" as its mimetype value in the plugin descriptor and have modified ParserFactory.java so that it considers the tika parser as potentially suitable for all mime-types. In practice this means that the associations between a mime type and a parser plugin as defined in parse-plugins.xml are useful only for the cases where we want to handle a mime type with a different parser than Tika. 
> The general approach I chose was to convert the SAX events returned by the Tika parsers into DOM objects and reuse the utilities that come with the current HTML parser i.e. link detection,  metatag handling but also means that we can use the HTMLParseFilters in exactly the same way. The main difference though is that HTMLParseFilters are not limited to HTML documents anymore as the XHTML tags returned by Tika can correspond to a different format for the original document. There is a duplication of code with the html-plugin which will be resolved by either a) getting rid of the html-plugin altogether or b) exporting its jar and make the tika parser depend on it.
> The following libraries are required in the lib/ directory of the tika-parser : 
>       <library name="asm-3.1.jar"/>
>       <library name="bcmail-jdk15-144.jar"/>
>       <library name="commons-compress-1.0.jar"/>
>       <library name="commons-logging-1.1.1.jar"/>
>       <library name="dom4j-1.6.1.jar"/>
>       <library name="fontbox-0.8.0-incubator.jar"/>
>       <library name="geronimo-stax-api_1.0_spec-1.0.1.jar"/>
>       <library name="hamcrest-core-1.1.jar"/>
>       <library name="jce-jdk13-144.jar"/>
>       <library name="jempbox-0.8.0-incubator.jar"/>
>       <library name="metadata-extractor-2.4.0-beta-1.jar"/>
>       <library name="mockito-core-1.7.jar"/>
>       <library name="objenesis-1.0.jar"/>
>       <library name="ooxml-schemas-1.0.jar"/>
>       <library name="pdfbox-0.8.0-incubating.jar"/>
>       <library name="poi-3.5-FINAL.jar"/>
>       <library name="poi-ooxml-3.5-FINAL.jar"/>
>       <library name="poi-scratchpad-3.5-FINAL.jar"/>
>       <library name="tagsoup-1.2.jar"/>
>       <library name="tika-parsers-0.5-SNAPSHOT.jar"/>
>       <library name="xml-apis-1.0.b2.jar"/>
>       <library name="xmlbeans-2.3.0.jar"/>
> There is a small test suite which needs to be improved. We will need to have a look at each individual format and check that it is covered by Tika and if so to the same extent; the Wiki is probably the right place for this. The language identifier (which is a HTMLParseFilter) seemed to work fine.
>  
> Again, your comments are welcome. Please bear in mind that this is just a first step. 
> Julien
> http://www.digitalpebble.com

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (NUTCH-766) Tika parser

Posted by "Sami Siren (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/NUTCH-766?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12803673#action_12803673 ] 

Sami Siren commented on NUTCH-766:
----------------------------------

> Sure, but it would be silly to block the whole Tika plugin because Tika does not support such or such format as well as the original Nutch plugins. As I explained above we can configure which parser to use for which mimetype and use the Tika-plugin by default. Hopefully the Tika implementation will get better and better and there will be no need for keeping the old plugins.

I meant test files for the parsers we replace, not all

> BTW http://wiki.apache.org/nutch/TikaPlugin lists the differences between the current version of Tika and the existing Nutch parsers

ok, I had misses that one. 

> Tika parser
> -----------
>
>                 Key: NUTCH-766
>                 URL: https://issues.apache.org/jira/browse/NUTCH-766
>             Project: Nutch
>          Issue Type: New Feature
>            Reporter: Julien Nioche
>            Assignee: Chris A. Mattmann
>             Fix For: 1.1
>
>         Attachments: Nutch-766.ParserFactory.patch, NUTCH-766.tika.patch
>
>
> Tika handles a lot of different formats under the bonnet and exposes them nicely via SAX events. What is described here is a tika-parser plugin which delegates the pasring mechanism of Tika but can still coexist with the existing parsing plugins which is useful for formats partially handled by Tika (or not at all). Some of the elements below have already been discussed on the mailing lists. Note that this is work in progress, your feedback is welcome.
> Tika is already used by Nutch for its MimeType implementations. Tika comes as different jar files (core and parsers), in the work described here we decided to put the libs in 2 different places
> NUTCH_HOME/lib : tika-core.jar
> NUTCH_HOME/tika-plugin/lib : tika-parsers.jar
> Tika being used by the core only for its Mimetype functionalities we only need to put tika-core at the main lib level whereas the tika plugin obviously needs the tika-parsers.jar + all the jars used internally by Tika
> Due to limitations in the way Tika loads its classes, we had to duplicate the TikaConfig class in the tika-plugin. This might be fixed in the future in Tika itself or avoided by refactoring the mimetype part of Nutch using extension points.
> Unlike most other parsers, Tika handles more than one Mime-type which is why we are using "*" as its mimetype value in the plugin descriptor and have modified ParserFactory.java so that it considers the tika parser as potentially suitable for all mime-types. In practice this means that the associations between a mime type and a parser plugin as defined in parse-plugins.xml are useful only for the cases where we want to handle a mime type with a different parser than Tika. 
> The general approach I chose was to convert the SAX events returned by the Tika parsers into DOM objects and reuse the utilities that come with the current HTML parser i.e. link detection,  metatag handling but also means that we can use the HTMLParseFilters in exactly the same way. The main difference though is that HTMLParseFilters are not limited to HTML documents anymore as the XHTML tags returned by Tika can correspond to a different format for the original document. There is a duplication of code with the html-plugin which will be resolved by either a) getting rid of the html-plugin altogether or b) exporting its jar and make the tika parser depend on it.
> The following libraries are required in the lib/ directory of the tika-parser : 
>       <library name="asm-3.1.jar"/>
>       <library name="bcmail-jdk15-144.jar"/>
>       <library name="commons-compress-1.0.jar"/>
>       <library name="commons-logging-1.1.1.jar"/>
>       <library name="dom4j-1.6.1.jar"/>
>       <library name="fontbox-0.8.0-incubator.jar"/>
>       <library name="geronimo-stax-api_1.0_spec-1.0.1.jar"/>
>       <library name="hamcrest-core-1.1.jar"/>
>       <library name="jce-jdk13-144.jar"/>
>       <library name="jempbox-0.8.0-incubator.jar"/>
>       <library name="metadata-extractor-2.4.0-beta-1.jar"/>
>       <library name="mockito-core-1.7.jar"/>
>       <library name="objenesis-1.0.jar"/>
>       <library name="ooxml-schemas-1.0.jar"/>
>       <library name="pdfbox-0.8.0-incubating.jar"/>
>       <library name="poi-3.5-FINAL.jar"/>
>       <library name="poi-ooxml-3.5-FINAL.jar"/>
>       <library name="poi-scratchpad-3.5-FINAL.jar"/>
>       <library name="tagsoup-1.2.jar"/>
>       <library name="tika-parsers-0.5-SNAPSHOT.jar"/>
>       <library name="xml-apis-1.0.b2.jar"/>
>       <library name="xmlbeans-2.3.0.jar"/>
> There is a small test suite which needs to be improved. We will need to have a look at each individual format and check that it is covered by Tika and if so to the same extent; the Wiki is probably the right place for this. The language identifier (which is a HTMLParseFilter) seemed to work fine.
>  
> Again, your comments are welcome. Please bear in mind that this is just a first step. 
> Julien
> http://www.digitalpebble.com

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (NUTCH-766) Tika parser

Posted by "Julien Nioche (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/NUTCH-766?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Julien Nioche updated NUTCH-766:
--------------------------------

    Attachment:     (was: Nutch-766.ParserFactory.patch)

> Tika parser
> -----------
>
>                 Key: NUTCH-766
>                 URL: https://issues.apache.org/jira/browse/NUTCH-766
>             Project: Nutch
>          Issue Type: New Feature
>            Reporter: Julien Nioche
>            Assignee: Chris A. Mattmann
>             Fix For: 1.1
>
>         Attachments: NUTCH-766-v3.patch, NUTCH-766.v2, sample.tar.gz
>
>
> Tika handles a lot of different formats under the bonnet and exposes them nicely via SAX events. What is described here is a tika-parser plugin which delegates the pasring mechanism of Tika but can still coexist with the existing parsing plugins which is useful for formats partially handled by Tika (or not at all). Some of the elements below have already been discussed on the mailing lists. Note that this is work in progress, your feedback is welcome.
> Tika is already used by Nutch for its MimeType implementations. Tika comes as different jar files (core and parsers), in the work described here we decided to put the libs in 2 different places
> NUTCH_HOME/lib : tika-core.jar
> NUTCH_HOME/tika-plugin/lib : tika-parsers.jar
> Tika being used by the core only for its Mimetype functionalities we only need to put tika-core at the main lib level whereas the tika plugin obviously needs the tika-parsers.jar + all the jars used internally by Tika
> Due to limitations in the way Tika loads its classes, we had to duplicate the TikaConfig class in the tika-plugin. This might be fixed in the future in Tika itself or avoided by refactoring the mimetype part of Nutch using extension points.
> Unlike most other parsers, Tika handles more than one Mime-type which is why we are using "*" as its mimetype value in the plugin descriptor and have modified ParserFactory.java so that it considers the tika parser as potentially suitable for all mime-types. In practice this means that the associations between a mime type and a parser plugin as defined in parse-plugins.xml are useful only for the cases where we want to handle a mime type with a different parser than Tika. 
> The general approach I chose was to convert the SAX events returned by the Tika parsers into DOM objects and reuse the utilities that come with the current HTML parser i.e. link detection,  metatag handling but also means that we can use the HTMLParseFilters in exactly the same way. The main difference though is that HTMLParseFilters are not limited to HTML documents anymore as the XHTML tags returned by Tika can correspond to a different format for the original document. There is a duplication of code with the html-plugin which will be resolved by either a) getting rid of the html-plugin altogether or b) exporting its jar and make the tika parser depend on it.
> The following libraries are required in the lib/ directory of the tika-parser : 
>       <library name="asm-3.1.jar"/>
>       <library name="bcmail-jdk15-144.jar"/>
>       <library name="commons-compress-1.0.jar"/>
>       <library name="commons-logging-1.1.1.jar"/>
>       <library name="dom4j-1.6.1.jar"/>
>       <library name="fontbox-0.8.0-incubator.jar"/>
>       <library name="geronimo-stax-api_1.0_spec-1.0.1.jar"/>
>       <library name="hamcrest-core-1.1.jar"/>
>       <library name="jce-jdk13-144.jar"/>
>       <library name="jempbox-0.8.0-incubator.jar"/>
>       <library name="metadata-extractor-2.4.0-beta-1.jar"/>
>       <library name="mockito-core-1.7.jar"/>
>       <library name="objenesis-1.0.jar"/>
>       <library name="ooxml-schemas-1.0.jar"/>
>       <library name="pdfbox-0.8.0-incubating.jar"/>
>       <library name="poi-3.5-FINAL.jar"/>
>       <library name="poi-ooxml-3.5-FINAL.jar"/>
>       <library name="poi-scratchpad-3.5-FINAL.jar"/>
>       <library name="tagsoup-1.2.jar"/>
>       <library name="tika-parsers-0.5-SNAPSHOT.jar"/>
>       <library name="xml-apis-1.0.b2.jar"/>
>       <library name="xmlbeans-2.3.0.jar"/>
> There is a small test suite which needs to be improved. We will need to have a look at each individual format and check that it is covered by Tika and if so to the same extent; the Wiki is probably the right place for this. The language identifier (which is a HTMLParseFilter) seemed to work fine.
>  
> Again, your comments are welcome. Please bear in mind that this is just a first step. 
> Julien
> http://www.digitalpebble.com

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Resolved: (NUTCH-766) Tika parser

Posted by "Chris A. Mattmann (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/NUTCH-766?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Chris A. Mattmann resolved NUTCH-766.
-------------------------------------

    Resolution: Fixed

- committed in r909268. Added in the nutch-default.xml comments near the parse-tika plugin.includes enable block. Sami, I'll create a new issue now to track your proposed updates to the Tika parser. I ran unit tests with the patch i committed, and they all passed.

Thanks, Julien!

> Tika parser
> -----------
>
>                 Key: NUTCH-766
>                 URL: https://issues.apache.org/jira/browse/NUTCH-766
>             Project: Nutch
>          Issue Type: New Feature
>            Reporter: Julien Nioche
>            Assignee: Chris A. Mattmann
>             Fix For: 1.1
>
>         Attachments: NUTCH-766-v3.patch, NUTCH-766.v2, NutchTikaConfig.java, sample.tar.gz, TikaParser.java
>
>
> Tika handles a lot of different formats under the bonnet and exposes them nicely via SAX events. What is described here is a tika-parser plugin which delegates the pasring mechanism of Tika but can still coexist with the existing parsing plugins which is useful for formats partially handled by Tika (or not at all). Some of the elements below have already been discussed on the mailing lists. Note that this is work in progress, your feedback is welcome.
> Tika is already used by Nutch for its MimeType implementations. Tika comes as different jar files (core and parsers), in the work described here we decided to put the libs in 2 different places
> NUTCH_HOME/lib : tika-core.jar
> NUTCH_HOME/tika-plugin/lib : tika-parsers.jar
> Tika being used by the core only for its Mimetype functionalities we only need to put tika-core at the main lib level whereas the tika plugin obviously needs the tika-parsers.jar + all the jars used internally by Tika
> Due to limitations in the way Tika loads its classes, we had to duplicate the TikaConfig class in the tika-plugin. This might be fixed in the future in Tika itself or avoided by refactoring the mimetype part of Nutch using extension points.
> Unlike most other parsers, Tika handles more than one Mime-type which is why we are using "*" as its mimetype value in the plugin descriptor and have modified ParserFactory.java so that it considers the tika parser as potentially suitable for all mime-types. In practice this means that the associations between a mime type and a parser plugin as defined in parse-plugins.xml are useful only for the cases where we want to handle a mime type with a different parser than Tika. 
> The general approach I chose was to convert the SAX events returned by the Tika parsers into DOM objects and reuse the utilities that come with the current HTML parser i.e. link detection,  metatag handling but also means that we can use the HTMLParseFilters in exactly the same way. The main difference though is that HTMLParseFilters are not limited to HTML documents anymore as the XHTML tags returned by Tika can correspond to a different format for the original document. There is a duplication of code with the html-plugin which will be resolved by either a) getting rid of the html-plugin altogether or b) exporting its jar and make the tika parser depend on it.
> The following libraries are required in the lib/ directory of the tika-parser : 
>       <library name="asm-3.1.jar"/>
>       <library name="bcmail-jdk15-144.jar"/>
>       <library name="commons-compress-1.0.jar"/>
>       <library name="commons-logging-1.1.1.jar"/>
>       <library name="dom4j-1.6.1.jar"/>
>       <library name="fontbox-0.8.0-incubator.jar"/>
>       <library name="geronimo-stax-api_1.0_spec-1.0.1.jar"/>
>       <library name="hamcrest-core-1.1.jar"/>
>       <library name="jce-jdk13-144.jar"/>
>       <library name="jempbox-0.8.0-incubator.jar"/>
>       <library name="metadata-extractor-2.4.0-beta-1.jar"/>
>       <library name="mockito-core-1.7.jar"/>
>       <library name="objenesis-1.0.jar"/>
>       <library name="ooxml-schemas-1.0.jar"/>
>       <library name="pdfbox-0.8.0-incubating.jar"/>
>       <library name="poi-3.5-FINAL.jar"/>
>       <library name="poi-ooxml-3.5-FINAL.jar"/>
>       <library name="poi-scratchpad-3.5-FINAL.jar"/>
>       <library name="tagsoup-1.2.jar"/>
>       <library name="tika-parsers-0.5-SNAPSHOT.jar"/>
>       <library name="xml-apis-1.0.b2.jar"/>
>       <library name="xmlbeans-2.3.0.jar"/>
> There is a small test suite which needs to be improved. We will need to have a look at each individual format and check that it is covered by Tika and if so to the same extent; the Wiki is probably the right place for this. The language identifier (which is a HTMLParseFilter) seemed to work fine.
>  
> Again, your comments are welcome. Please bear in mind that this is just a first step. 
> Julien
> http://www.digitalpebble.com

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (NUTCH-766) Tika parser

Posted by "Andrzej Bialecki (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/NUTCH-766?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12804558#action_12804558 ] 

Andrzej Bialecki  commented on NUTCH-766:
-----------------------------------------

I agree with Chris, +1 on keeping the old plugins in 1.1 with a prominent deprecation note, but I feel equally strongly that we should not prolong their life-cycle beyond what we can support, i.e. I'm +1 on removing them in 1.2/1.3. We simply don't have resources to maintain so many duplicate plugins, and instead we should direct our efforts to improve those in Tika.

> Tika parser
> -----------
>
>                 Key: NUTCH-766
>                 URL: https://issues.apache.org/jira/browse/NUTCH-766
>             Project: Nutch
>          Issue Type: New Feature
>            Reporter: Julien Nioche
>            Assignee: Chris A. Mattmann
>             Fix For: 1.1
>
>         Attachments: Nutch-766.ParserFactory.patch, NUTCH-766.tika.patch
>
>
> Tika handles a lot of different formats under the bonnet and exposes them nicely via SAX events. What is described here is a tika-parser plugin which delegates the pasring mechanism of Tika but can still coexist with the existing parsing plugins which is useful for formats partially handled by Tika (or not at all). Some of the elements below have already been discussed on the mailing lists. Note that this is work in progress, your feedback is welcome.
> Tika is already used by Nutch for its MimeType implementations. Tika comes as different jar files (core and parsers), in the work described here we decided to put the libs in 2 different places
> NUTCH_HOME/lib : tika-core.jar
> NUTCH_HOME/tika-plugin/lib : tika-parsers.jar
> Tika being used by the core only for its Mimetype functionalities we only need to put tika-core at the main lib level whereas the tika plugin obviously needs the tika-parsers.jar + all the jars used internally by Tika
> Due to limitations in the way Tika loads its classes, we had to duplicate the TikaConfig class in the tika-plugin. This might be fixed in the future in Tika itself or avoided by refactoring the mimetype part of Nutch using extension points.
> Unlike most other parsers, Tika handles more than one Mime-type which is why we are using "*" as its mimetype value in the plugin descriptor and have modified ParserFactory.java so that it considers the tika parser as potentially suitable for all mime-types. In practice this means that the associations between a mime type and a parser plugin as defined in parse-plugins.xml are useful only for the cases where we want to handle a mime type with a different parser than Tika. 
> The general approach I chose was to convert the SAX events returned by the Tika parsers into DOM objects and reuse the utilities that come with the current HTML parser i.e. link detection,  metatag handling but also means that we can use the HTMLParseFilters in exactly the same way. The main difference though is that HTMLParseFilters are not limited to HTML documents anymore as the XHTML tags returned by Tika can correspond to a different format for the original document. There is a duplication of code with the html-plugin which will be resolved by either a) getting rid of the html-plugin altogether or b) exporting its jar and make the tika parser depend on it.
> The following libraries are required in the lib/ directory of the tika-parser : 
>       <library name="asm-3.1.jar"/>
>       <library name="bcmail-jdk15-144.jar"/>
>       <library name="commons-compress-1.0.jar"/>
>       <library name="commons-logging-1.1.1.jar"/>
>       <library name="dom4j-1.6.1.jar"/>
>       <library name="fontbox-0.8.0-incubator.jar"/>
>       <library name="geronimo-stax-api_1.0_spec-1.0.1.jar"/>
>       <library name="hamcrest-core-1.1.jar"/>
>       <library name="jce-jdk13-144.jar"/>
>       <library name="jempbox-0.8.0-incubator.jar"/>
>       <library name="metadata-extractor-2.4.0-beta-1.jar"/>
>       <library name="mockito-core-1.7.jar"/>
>       <library name="objenesis-1.0.jar"/>
>       <library name="ooxml-schemas-1.0.jar"/>
>       <library name="pdfbox-0.8.0-incubating.jar"/>
>       <library name="poi-3.5-FINAL.jar"/>
>       <library name="poi-ooxml-3.5-FINAL.jar"/>
>       <library name="poi-scratchpad-3.5-FINAL.jar"/>
>       <library name="tagsoup-1.2.jar"/>
>       <library name="tika-parsers-0.5-SNAPSHOT.jar"/>
>       <library name="xml-apis-1.0.b2.jar"/>
>       <library name="xmlbeans-2.3.0.jar"/>
> There is a small test suite which needs to be improved. We will need to have a look at each individual format and check that it is covered by Tika and if so to the same extent; the Wiki is probably the right place for this. The language identifier (which is a HTMLParseFilter) seemed to work fine.
>  
> Again, your comments are welcome. Please bear in mind that this is just a first step. 
> Julien
> http://www.digitalpebble.com

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (NUTCH-766) Tika parser

Posted by "Julien Nioche (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/NUTCH-766?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Julien Nioche updated NUTCH-766:
--------------------------------

    Attachment: NUTCH-766.v2
                sample.tar.gz

new version of the patch + archive containing the binary docs used for testing

> Tika parser
> -----------
>
>                 Key: NUTCH-766
>                 URL: https://issues.apache.org/jira/browse/NUTCH-766
>             Project: Nutch
>          Issue Type: New Feature
>            Reporter: Julien Nioche
>            Assignee: Chris A. Mattmann
>             Fix For: 1.1
>
>         Attachments: Nutch-766.ParserFactory.patch, NUTCH-766.tika.patch, NUTCH-766.v2, sample.tar.gz
>
>
> Tika handles a lot of different formats under the bonnet and exposes them nicely via SAX events. What is described here is a tika-parser plugin which delegates the pasring mechanism of Tika but can still coexist with the existing parsing plugins which is useful for formats partially handled by Tika (or not at all). Some of the elements below have already been discussed on the mailing lists. Note that this is work in progress, your feedback is welcome.
> Tika is already used by Nutch for its MimeType implementations. Tika comes as different jar files (core and parsers), in the work described here we decided to put the libs in 2 different places
> NUTCH_HOME/lib : tika-core.jar
> NUTCH_HOME/tika-plugin/lib : tika-parsers.jar
> Tika being used by the core only for its Mimetype functionalities we only need to put tika-core at the main lib level whereas the tika plugin obviously needs the tika-parsers.jar + all the jars used internally by Tika
> Due to limitations in the way Tika loads its classes, we had to duplicate the TikaConfig class in the tika-plugin. This might be fixed in the future in Tika itself or avoided by refactoring the mimetype part of Nutch using extension points.
> Unlike most other parsers, Tika handles more than one Mime-type which is why we are using "*" as its mimetype value in the plugin descriptor and have modified ParserFactory.java so that it considers the tika parser as potentially suitable for all mime-types. In practice this means that the associations between a mime type and a parser plugin as defined in parse-plugins.xml are useful only for the cases where we want to handle a mime type with a different parser than Tika. 
> The general approach I chose was to convert the SAX events returned by the Tika parsers into DOM objects and reuse the utilities that come with the current HTML parser i.e. link detection,  metatag handling but also means that we can use the HTMLParseFilters in exactly the same way. The main difference though is that HTMLParseFilters are not limited to HTML documents anymore as the XHTML tags returned by Tika can correspond to a different format for the original document. There is a duplication of code with the html-plugin which will be resolved by either a) getting rid of the html-plugin altogether or b) exporting its jar and make the tika parser depend on it.
> The following libraries are required in the lib/ directory of the tika-parser : 
>       <library name="asm-3.1.jar"/>
>       <library name="bcmail-jdk15-144.jar"/>
>       <library name="commons-compress-1.0.jar"/>
>       <library name="commons-logging-1.1.1.jar"/>
>       <library name="dom4j-1.6.1.jar"/>
>       <library name="fontbox-0.8.0-incubator.jar"/>
>       <library name="geronimo-stax-api_1.0_spec-1.0.1.jar"/>
>       <library name="hamcrest-core-1.1.jar"/>
>       <library name="jce-jdk13-144.jar"/>
>       <library name="jempbox-0.8.0-incubator.jar"/>
>       <library name="metadata-extractor-2.4.0-beta-1.jar"/>
>       <library name="mockito-core-1.7.jar"/>
>       <library name="objenesis-1.0.jar"/>
>       <library name="ooxml-schemas-1.0.jar"/>
>       <library name="pdfbox-0.8.0-incubating.jar"/>
>       <library name="poi-3.5-FINAL.jar"/>
>       <library name="poi-ooxml-3.5-FINAL.jar"/>
>       <library name="poi-scratchpad-3.5-FINAL.jar"/>
>       <library name="tagsoup-1.2.jar"/>
>       <library name="tika-parsers-0.5-SNAPSHOT.jar"/>
>       <library name="xml-apis-1.0.b2.jar"/>
>       <library name="xmlbeans-2.3.0.jar"/>
> There is a small test suite which needs to be improved. We will need to have a look at each individual format and check that it is covered by Tika and if so to the same extent; the Wiki is probably the right place for this. The language identifier (which is a HTMLParseFilter) seemed to work fine.
>  
> Again, your comments are welcome. Please bear in mind that this is just a first step. 
> Julien
> http://www.digitalpebble.com

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (NUTCH-766) Tika parser

Posted by "Chris A. Mattmann (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/NUTCH-766?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12832866#action_12832866 ] 

Chris A. Mattmann commented on NUTCH-766:
-----------------------------------------

- forgot to add in dep libs, added in r909269. Thanks!

> Tika parser
> -----------
>
>                 Key: NUTCH-766
>                 URL: https://issues.apache.org/jira/browse/NUTCH-766
>             Project: Nutch
>          Issue Type: New Feature
>            Reporter: Julien Nioche
>            Assignee: Chris A. Mattmann
>             Fix For: 1.1
>
>         Attachments: NUTCH-766-v3.patch, NUTCH-766.v2, NutchTikaConfig.java, sample.tar.gz, TikaParser.java
>
>
> Tika handles a lot of different formats under the bonnet and exposes them nicely via SAX events. What is described here is a tika-parser plugin which delegates the pasring mechanism of Tika but can still coexist with the existing parsing plugins which is useful for formats partially handled by Tika (or not at all). Some of the elements below have already been discussed on the mailing lists. Note that this is work in progress, your feedback is welcome.
> Tika is already used by Nutch for its MimeType implementations. Tika comes as different jar files (core and parsers), in the work described here we decided to put the libs in 2 different places
> NUTCH_HOME/lib : tika-core.jar
> NUTCH_HOME/tika-plugin/lib : tika-parsers.jar
> Tika being used by the core only for its Mimetype functionalities we only need to put tika-core at the main lib level whereas the tika plugin obviously needs the tika-parsers.jar + all the jars used internally by Tika
> Due to limitations in the way Tika loads its classes, we had to duplicate the TikaConfig class in the tika-plugin. This might be fixed in the future in Tika itself or avoided by refactoring the mimetype part of Nutch using extension points.
> Unlike most other parsers, Tika handles more than one Mime-type which is why we are using "*" as its mimetype value in the plugin descriptor and have modified ParserFactory.java so that it considers the tika parser as potentially suitable for all mime-types. In practice this means that the associations between a mime type and a parser plugin as defined in parse-plugins.xml are useful only for the cases where we want to handle a mime type with a different parser than Tika. 
> The general approach I chose was to convert the SAX events returned by the Tika parsers into DOM objects and reuse the utilities that come with the current HTML parser i.e. link detection,  metatag handling but also means that we can use the HTMLParseFilters in exactly the same way. The main difference though is that HTMLParseFilters are not limited to HTML documents anymore as the XHTML tags returned by Tika can correspond to a different format for the original document. There is a duplication of code with the html-plugin which will be resolved by either a) getting rid of the html-plugin altogether or b) exporting its jar and make the tika parser depend on it.
> The following libraries are required in the lib/ directory of the tika-parser : 
>       <library name="asm-3.1.jar"/>
>       <library name="bcmail-jdk15-144.jar"/>
>       <library name="commons-compress-1.0.jar"/>
>       <library name="commons-logging-1.1.1.jar"/>
>       <library name="dom4j-1.6.1.jar"/>
>       <library name="fontbox-0.8.0-incubator.jar"/>
>       <library name="geronimo-stax-api_1.0_spec-1.0.1.jar"/>
>       <library name="hamcrest-core-1.1.jar"/>
>       <library name="jce-jdk13-144.jar"/>
>       <library name="jempbox-0.8.0-incubator.jar"/>
>       <library name="metadata-extractor-2.4.0-beta-1.jar"/>
>       <library name="mockito-core-1.7.jar"/>
>       <library name="objenesis-1.0.jar"/>
>       <library name="ooxml-schemas-1.0.jar"/>
>       <library name="pdfbox-0.8.0-incubating.jar"/>
>       <library name="poi-3.5-FINAL.jar"/>
>       <library name="poi-ooxml-3.5-FINAL.jar"/>
>       <library name="poi-scratchpad-3.5-FINAL.jar"/>
>       <library name="tagsoup-1.2.jar"/>
>       <library name="tika-parsers-0.5-SNAPSHOT.jar"/>
>       <library name="xml-apis-1.0.b2.jar"/>
>       <library name="xmlbeans-2.3.0.jar"/>
> There is a small test suite which needs to be improved. We will need to have a look at each individual format and check that it is covered by Tika and if so to the same extent; the Wiki is probably the right place for this. The language identifier (which is a HTMLParseFilter) seemed to work fine.
>  
> Again, your comments are welcome. Please bear in mind that this is just a first step. 
> Julien
> http://www.digitalpebble.com

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (NUTCH-766) Tika parser

Posted by "Hudson (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/NUTCH-766?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12833333#action_12833333 ] 

Hudson commented on NUTCH-766:
------------------------------

Integrated in Nutch-trunk #1067 (See [http://hudson.zones.apache.org/hudson/job/Nutch-trunk/1067/])
    - 2nd part of  Tika parser
- fix for  Tika parser


> Tika parser
> -----------
>
>                 Key: NUTCH-766
>                 URL: https://issues.apache.org/jira/browse/NUTCH-766
>             Project: Nutch
>          Issue Type: New Feature
>            Reporter: Julien Nioche
>            Assignee: Chris A. Mattmann
>             Fix For: 1.1
>
>         Attachments: NUTCH-766-v3.patch, NUTCH-766.v2, NutchTikaConfig.java, sample.tar.gz, TikaParser.java
>
>
> Tika handles a lot of different formats under the bonnet and exposes them nicely via SAX events. What is described here is a tika-parser plugin which delegates the pasring mechanism of Tika but can still coexist with the existing parsing plugins which is useful for formats partially handled by Tika (or not at all). Some of the elements below have already been discussed on the mailing lists. Note that this is work in progress, your feedback is welcome.
> Tika is already used by Nutch for its MimeType implementations. Tika comes as different jar files (core and parsers), in the work described here we decided to put the libs in 2 different places
> NUTCH_HOME/lib : tika-core.jar
> NUTCH_HOME/tika-plugin/lib : tika-parsers.jar
> Tika being used by the core only for its Mimetype functionalities we only need to put tika-core at the main lib level whereas the tika plugin obviously needs the tika-parsers.jar + all the jars used internally by Tika
> Due to limitations in the way Tika loads its classes, we had to duplicate the TikaConfig class in the tika-plugin. This might be fixed in the future in Tika itself or avoided by refactoring the mimetype part of Nutch using extension points.
> Unlike most other parsers, Tika handles more than one Mime-type which is why we are using "*" as its mimetype value in the plugin descriptor and have modified ParserFactory.java so that it considers the tika parser as potentially suitable for all mime-types. In practice this means that the associations between a mime type and a parser plugin as defined in parse-plugins.xml are useful only for the cases where we want to handle a mime type with a different parser than Tika. 
> The general approach I chose was to convert the SAX events returned by the Tika parsers into DOM objects and reuse the utilities that come with the current HTML parser i.e. link detection,  metatag handling but also means that we can use the HTMLParseFilters in exactly the same way. The main difference though is that HTMLParseFilters are not limited to HTML documents anymore as the XHTML tags returned by Tika can correspond to a different format for the original document. There is a duplication of code with the html-plugin which will be resolved by either a) getting rid of the html-plugin altogether or b) exporting its jar and make the tika parser depend on it.
> The following libraries are required in the lib/ directory of the tika-parser : 
>       <library name="asm-3.1.jar"/>
>       <library name="bcmail-jdk15-144.jar"/>
>       <library name="commons-compress-1.0.jar"/>
>       <library name="commons-logging-1.1.1.jar"/>
>       <library name="dom4j-1.6.1.jar"/>
>       <library name="fontbox-0.8.0-incubator.jar"/>
>       <library name="geronimo-stax-api_1.0_spec-1.0.1.jar"/>
>       <library name="hamcrest-core-1.1.jar"/>
>       <library name="jce-jdk13-144.jar"/>
>       <library name="jempbox-0.8.0-incubator.jar"/>
>       <library name="metadata-extractor-2.4.0-beta-1.jar"/>
>       <library name="mockito-core-1.7.jar"/>
>       <library name="objenesis-1.0.jar"/>
>       <library name="ooxml-schemas-1.0.jar"/>
>       <library name="pdfbox-0.8.0-incubating.jar"/>
>       <library name="poi-3.5-FINAL.jar"/>
>       <library name="poi-ooxml-3.5-FINAL.jar"/>
>       <library name="poi-scratchpad-3.5-FINAL.jar"/>
>       <library name="tagsoup-1.2.jar"/>
>       <library name="tika-parsers-0.5-SNAPSHOT.jar"/>
>       <library name="xml-apis-1.0.b2.jar"/>
>       <library name="xmlbeans-2.3.0.jar"/>
> There is a small test suite which needs to be improved. We will need to have a look at each individual format and check that it is covered by Tika and if so to the same extent; the Wiki is probably the right place for this. The language identifier (which is a HTMLParseFilter) seemed to work fine.
>  
> Again, your comments are welcome. Please bear in mind that this is just a first step. 
> Julien
> http://www.digitalpebble.com

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Closed: (NUTCH-766) Tika parser

Posted by "Julien Nioche (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/NUTCH-766?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Julien Nioche closed NUTCH-766.
-------------------------------


Have added small improvement in revision 910187 (Prioritise default Tika parser when discovering plugins matching mime-type).
Thanks to Chris for testing and committing it + Andrzej and Sami for their comments and suggestions

> Tika parser
> -----------
>
>                 Key: NUTCH-766
>                 URL: https://issues.apache.org/jira/browse/NUTCH-766
>             Project: Nutch
>          Issue Type: New Feature
>            Reporter: Julien Nioche
>            Assignee: Chris A. Mattmann
>             Fix For: 1.1
>
>         Attachments: NUTCH-766-v3.patch, NUTCH-766.v2, NutchTikaConfig.java, sample.tar.gz, TikaParser.java
>
>
> Tika handles a lot of different formats under the bonnet and exposes them nicely via SAX events. What is described here is a tika-parser plugin which delegates the pasring mechanism of Tika but can still coexist with the existing parsing plugins which is useful for formats partially handled by Tika (or not at all). Some of the elements below have already been discussed on the mailing lists. Note that this is work in progress, your feedback is welcome.
> Tika is already used by Nutch for its MimeType implementations. Tika comes as different jar files (core and parsers), in the work described here we decided to put the libs in 2 different places
> NUTCH_HOME/lib : tika-core.jar
> NUTCH_HOME/tika-plugin/lib : tika-parsers.jar
> Tika being used by the core only for its Mimetype functionalities we only need to put tika-core at the main lib level whereas the tika plugin obviously needs the tika-parsers.jar + all the jars used internally by Tika
> Due to limitations in the way Tika loads its classes, we had to duplicate the TikaConfig class in the tika-plugin. This might be fixed in the future in Tika itself or avoided by refactoring the mimetype part of Nutch using extension points.
> Unlike most other parsers, Tika handles more than one Mime-type which is why we are using "*" as its mimetype value in the plugin descriptor and have modified ParserFactory.java so that it considers the tika parser as potentially suitable for all mime-types. In practice this means that the associations between a mime type and a parser plugin as defined in parse-plugins.xml are useful only for the cases where we want to handle a mime type with a different parser than Tika. 
> The general approach I chose was to convert the SAX events returned by the Tika parsers into DOM objects and reuse the utilities that come with the current HTML parser i.e. link detection,  metatag handling but also means that we can use the HTMLParseFilters in exactly the same way. The main difference though is that HTMLParseFilters are not limited to HTML documents anymore as the XHTML tags returned by Tika can correspond to a different format for the original document. There is a duplication of code with the html-plugin which will be resolved by either a) getting rid of the html-plugin altogether or b) exporting its jar and make the tika parser depend on it.
> The following libraries are required in the lib/ directory of the tika-parser : 
>       <library name="asm-3.1.jar"/>
>       <library name="bcmail-jdk15-144.jar"/>
>       <library name="commons-compress-1.0.jar"/>
>       <library name="commons-logging-1.1.1.jar"/>
>       <library name="dom4j-1.6.1.jar"/>
>       <library name="fontbox-0.8.0-incubator.jar"/>
>       <library name="geronimo-stax-api_1.0_spec-1.0.1.jar"/>
>       <library name="hamcrest-core-1.1.jar"/>
>       <library name="jce-jdk13-144.jar"/>
>       <library name="jempbox-0.8.0-incubator.jar"/>
>       <library name="metadata-extractor-2.4.0-beta-1.jar"/>
>       <library name="mockito-core-1.7.jar"/>
>       <library name="objenesis-1.0.jar"/>
>       <library name="ooxml-schemas-1.0.jar"/>
>       <library name="pdfbox-0.8.0-incubating.jar"/>
>       <library name="poi-3.5-FINAL.jar"/>
>       <library name="poi-ooxml-3.5-FINAL.jar"/>
>       <library name="poi-scratchpad-3.5-FINAL.jar"/>
>       <library name="tagsoup-1.2.jar"/>
>       <library name="tika-parsers-0.5-SNAPSHOT.jar"/>
>       <library name="xml-apis-1.0.b2.jar"/>
>       <library name="xmlbeans-2.3.0.jar"/>
> There is a small test suite which needs to be improved. We will need to have a look at each individual format and check that it is covered by Tika and if so to the same extent; the Wiki is probably the right place for this. The language identifier (which is a HTMLParseFilter) seemed to work fine.
>  
> Again, your comments are welcome. Please bear in mind that this is just a first step. 
> Julien
> http://www.digitalpebble.com

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (NUTCH-766) Tika parser

Posted by "Sami Siren (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/NUTCH-766?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Sami Siren updated NUTCH-766:
-----------------------------

    Attachment: TikaParser.java

Modified parser that can process package formats too. To get rid of the mime type detection happening twice we have to extend AutoDetectParser so that skips the intitial detection but does the detection for the rest of the content (in pkg formats)

> Tika parser
> -----------
>
>                 Key: NUTCH-766
>                 URL: https://issues.apache.org/jira/browse/NUTCH-766
>             Project: Nutch
>          Issue Type: New Feature
>            Reporter: Julien Nioche
>            Assignee: Chris A. Mattmann
>             Fix For: 1.1
>
>         Attachments: NUTCH-766-v3.patch, NUTCH-766.v2, NutchTikaConfig.java, sample.tar.gz, TikaParser.java
>
>
> Tika handles a lot of different formats under the bonnet and exposes them nicely via SAX events. What is described here is a tika-parser plugin which delegates the pasring mechanism of Tika but can still coexist with the existing parsing plugins which is useful for formats partially handled by Tika (or not at all). Some of the elements below have already been discussed on the mailing lists. Note that this is work in progress, your feedback is welcome.
> Tika is already used by Nutch for its MimeType implementations. Tika comes as different jar files (core and parsers), in the work described here we decided to put the libs in 2 different places
> NUTCH_HOME/lib : tika-core.jar
> NUTCH_HOME/tika-plugin/lib : tika-parsers.jar
> Tika being used by the core only for its Mimetype functionalities we only need to put tika-core at the main lib level whereas the tika plugin obviously needs the tika-parsers.jar + all the jars used internally by Tika
> Due to limitations in the way Tika loads its classes, we had to duplicate the TikaConfig class in the tika-plugin. This might be fixed in the future in Tika itself or avoided by refactoring the mimetype part of Nutch using extension points.
> Unlike most other parsers, Tika handles more than one Mime-type which is why we are using "*" as its mimetype value in the plugin descriptor and have modified ParserFactory.java so that it considers the tika parser as potentially suitable for all mime-types. In practice this means that the associations between a mime type and a parser plugin as defined in parse-plugins.xml are useful only for the cases where we want to handle a mime type with a different parser than Tika. 
> The general approach I chose was to convert the SAX events returned by the Tika parsers into DOM objects and reuse the utilities that come with the current HTML parser i.e. link detection,  metatag handling but also means that we can use the HTMLParseFilters in exactly the same way. The main difference though is that HTMLParseFilters are not limited to HTML documents anymore as the XHTML tags returned by Tika can correspond to a different format for the original document. There is a duplication of code with the html-plugin which will be resolved by either a) getting rid of the html-plugin altogether or b) exporting its jar and make the tika parser depend on it.
> The following libraries are required in the lib/ directory of the tika-parser : 
>       <library name="asm-3.1.jar"/>
>       <library name="bcmail-jdk15-144.jar"/>
>       <library name="commons-compress-1.0.jar"/>
>       <library name="commons-logging-1.1.1.jar"/>
>       <library name="dom4j-1.6.1.jar"/>
>       <library name="fontbox-0.8.0-incubator.jar"/>
>       <library name="geronimo-stax-api_1.0_spec-1.0.1.jar"/>
>       <library name="hamcrest-core-1.1.jar"/>
>       <library name="jce-jdk13-144.jar"/>
>       <library name="jempbox-0.8.0-incubator.jar"/>
>       <library name="metadata-extractor-2.4.0-beta-1.jar"/>
>       <library name="mockito-core-1.7.jar"/>
>       <library name="objenesis-1.0.jar"/>
>       <library name="ooxml-schemas-1.0.jar"/>
>       <library name="pdfbox-0.8.0-incubating.jar"/>
>       <library name="poi-3.5-FINAL.jar"/>
>       <library name="poi-ooxml-3.5-FINAL.jar"/>
>       <library name="poi-scratchpad-3.5-FINAL.jar"/>
>       <library name="tagsoup-1.2.jar"/>
>       <library name="tika-parsers-0.5-SNAPSHOT.jar"/>
>       <library name="xml-apis-1.0.b2.jar"/>
>       <library name="xmlbeans-2.3.0.jar"/>
> There is a small test suite which needs to be improved. We will need to have a look at each individual format and check that it is covered by Tika and if so to the same extent; the Wiki is probably the right place for this. The language identifier (which is a HTMLParseFilter) seemed to work fine.
>  
> Again, your comments are welcome. Please bear in mind that this is just a first step. 
> Julien
> http://www.digitalpebble.com

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (NUTCH-766) Tika parser

Posted by "Julien Nioche (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/NUTCH-766?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Julien Nioche updated NUTCH-766:
--------------------------------

    Attachment: Nutch-766.ParserFactory.patch

Patch for the ParserFactory to allow * as mimetype value for a parser plugin

> Tika parser
> -----------
>
>                 Key: NUTCH-766
>                 URL: https://issues.apache.org/jira/browse/NUTCH-766
>             Project: Nutch
>          Issue Type: New Feature
>            Reporter: Julien Nioche
>         Attachments: Nutch-766.ParserFactory.patch, NUTCH-766.tika.patch
>
>
> Tika handles a lot of different formats under the bonnet and exposes them nicely via SAX events. What is described here is a tika-parser plugin which delegates the pasring mechanism of Tika but can still coexist with the existing parsing plugins which is useful for formats partially handled by Tika (or not at all). Some of the elements below have already been discussed on the mailing lists. Note that this is work in progress, your feedback is welcome.
> Tika is already used by Nutch for its MimeType implementations. Tika comes as different jar files (core and parsers), in the work described here we decided to put the libs in 2 different places
> NUTCH_HOME/lib : tika-core.jar
> NUTCH_HOME/tika-plugin/lib : tika-parsers.jar
> Tika being used by the core only for its Mimetype functionalities we only need to put tika-core at the main lib level whereas the tika plugin obviously needs the tika-parsers.jar + all the jars used internally by Tika
> Due to limitations in the way Tika loads its classes, we had to duplicate the TikaConfig class in the tika-plugin. This might be fixed in the future in Tika itself or avoided by refactoring the mimetype part of Nutch using extension points.
> Unlike most other parsers, Tika handles more than one Mime-type which is why we are using "*" as its mimetype value in the plugin descriptor and have modified ParserFactory.java so that it considers the tika parser as potentially suitable for all mime-types. In practice this means that the associations between a mime type and a parser plugin as defined in parse-plugins.xml are useful only for the cases where we want to handle a mime type with a different parser than Tika. 
> The general approach I chose was to convert the SAX events returned by the Tika parsers into DOM objects and reuse the utilities that come with the current HTML parser i.e. link detection,  metatag handling but also means that we can use the HTMLParseFilters in exactly the same way. The main difference though is that HTMLParseFilters are not limited to HTML documents anymore as the XHTML tags returned by Tika can correspond to a different format for the original document. There is a duplication of code with the html-plugin which will be resolved by either a) getting rid of the html-plugin altogether or b) exporting its jar and make the tika parser depend on it.
> The following libraries are required in the lib/ directory of the tika-parser : 
>       <library name="asm-3.1.jar"/>
>       <library name="bcmail-jdk15-144.jar"/>
>       <library name="commons-compress-1.0.jar"/>
>       <library name="commons-logging-1.1.1.jar"/>
>       <library name="dom4j-1.6.1.jar"/>
>       <library name="fontbox-0.8.0-incubator.jar"/>
>       <library name="geronimo-stax-api_1.0_spec-1.0.1.jar"/>
>       <library name="hamcrest-core-1.1.jar"/>
>       <library name="jce-jdk13-144.jar"/>
>       <library name="jempbox-0.8.0-incubator.jar"/>
>       <library name="metadata-extractor-2.4.0-beta-1.jar"/>
>       <library name="mockito-core-1.7.jar"/>
>       <library name="objenesis-1.0.jar"/>
>       <library name="ooxml-schemas-1.0.jar"/>
>       <library name="pdfbox-0.8.0-incubating.jar"/>
>       <library name="poi-3.5-FINAL.jar"/>
>       <library name="poi-ooxml-3.5-FINAL.jar"/>
>       <library name="poi-scratchpad-3.5-FINAL.jar"/>
>       <library name="tagsoup-1.2.jar"/>
>       <library name="tika-parsers-0.5-SNAPSHOT.jar"/>
>       <library name="xml-apis-1.0.b2.jar"/>
>       <library name="xmlbeans-2.3.0.jar"/>
> There is a small test suite which needs to be improved. We will need to have a look at each individual format and check that it is covered by Tika and if so to the same extent; the Wiki is probably the right place for this. The language identifier (which is a HTMLParseFilter) seemed to work fine.
>  
> Again, your comments are welcome. Please bear in mind that this is just a first step. 
> Julien
> http://www.digitalpebble.com

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.