You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@nutch.apache.org by "Sebastian Nagel (JIRA)" <ji...@apache.org> on 2018/06/20 15:07:00 UTC

[jira] [Commented] (NUTCH-2603) Bring back legacy pre-Tika parsers and use them as back up parsers

    [ https://issues.apache.org/jira/browse/NUTCH-2603?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16518239#comment-16518239 ] 

Sebastian Nagel commented on NUTCH-2603:
----------------------------------------

Hi [~ArkadiKosmynin], this is in contradiction with my experience: I've used Nutch since version 0.9 (without parse-tika) to process also PDFs and office documents. Most issues with these documents disappeared over time while Tika has become mature. 

I've tried to reproduce any parsing issues using the recent Nutch master branch (with Tika 1.18) by picking randomly documents parsed with one of the "legacy" parsers. 
- at least, some documents require authentication, e.g. [1|https://www.atnf.csiro.au/lists/at_meetings/2006/att-0064/CJ_SKA_Industry_Astro_update_Apr_06.ppt], [2|https://www.atnf.csiro.au/lists/at_meetings/2008/att-0001/ASKAP_Antennas_summary.doc], [3|https://svn.atnf.csiro.au/askap/ASKAPDesignEnhancements/PCB/790-0015-Bullant/Datasheets/Control.xls]. [~ArkadiKosmynin], could you provide some of these documents or remove the access restrictions?
- among the remaining URLs there is one systematic error: the server behind www.atnf.csiro.au regularly sends {{application/msword}} for plain-text documents ending in {{.doc}}, eg. [update.doc|https://www.atnf.csiro.au/computing/software/gipsy/doc/update.doc]. Tika can parse this document, opened NUTCH-2606 to address this.
- Tika fails to parse MS Word 2.0 documents: [zenpap4.doc|http://www.atnf.csiro.au/pub/data/pmn/zenpap4.doc]. It's a known issue that Tika (resp. POI) cannot parse old MS Office document, see TIKA-2107. However, I doubt that they have been successfully parsed using the "legacy" parsers:
-* testing with Nutch 1.0 and parse-msword I get a failed parse with the error:
{noformat}
Can't be handled as Microsoft document. java.io.IOException: Invalid header signature; read 867295287388775899, expected -2226271756974174256
{noformat}
-* similar as using  the recent Nutch master and Tika 1.18:
{noformat}
% bin/nutch parsechecker -Dplugin.includes='protocol-http|parse-tika' -forceAs application/msword http://www.atnf.csiro.au/pub/data/pmn/zenpap4.doc
...
Status: failed(2,0): Invalid header signature; read 0x0C094078002DA5DB, expected 0xE11AB1A1E011CFD0 - Your file appears not to be a valid OLE2 document
{noformat}
-* when forcing the OpenDocumentParser, the result is a successful but empty parse, both for recent Nutch/Tika as well as Nutch 1.0 and parse-oo:
{noformat}
% bin/ nutch parsechecker -Dplugin.includes='protocol-http|parse-tika'  -forceAs application/vnd.oasis.opendocument.text http://www.atnf.csiro.au/pub/data/pmn/zenpap4.doc
...
Status: success(1,0)
Title: 
Outlinks: 0
{noformat}
resp.
{noformat}
% nutch-1.0/bin/nutch org.apache.nutch.parse.ParserChecker -forceAs application/vnd.oasis.opendocument.text http://www.atnf.csiro.au/pub/data/pmn/zenpap4.doc
...
Status: success(1,0)
Title: 
Outlinks: 0
...
{noformat}
same when calling the main routine of the plugin parse-oo directly (with a local file):
{noformat}
% nutch-1.0/bin/nutch plugin parse-oo org.apache.nutch.parse.oo.OOParser .../zenpap4.doc
Version: 5
Status: success(1,0)
Title: 
Outlinks: 0
Content Metadata: 
Parse Metadata: 

Text: ''
{noformat}
I've opened TIKA-2675 to address this problem. [~ArkadiKosmynin], is it possible that the message in the attached public_docs.txt is also misleading?
{noformat}
arch.log.2018-06-15:2018-06-15 16:05:49,686 INFO  parse.ParseUtil - Successfully parsed [http://www.atnf.csiro.au/pub/data/pmn/zenpap4.doc] with [org.apache.nutch.parse.oo.OOParser@380d5de2]
{noformat}

So far, these are the only problems I've seen so far. Please open separate issues for other problems. I strongly opt for fixing issues, instead of maintaining potentially buggy legacy parsers.

> Bring back legacy pre-Tika parsers and use them as back up parsers
> ------------------------------------------------------------------
>
>                 Key: NUTCH-2603
>                 URL: https://issues.apache.org/jira/browse/NUTCH-2603
>             Project: Nutch
>          Issue Type: Improvement
>          Components: parser
>    Affects Versions: 1.15
>            Reporter: Arkadi Kosmynin
>            Priority: Major
>         Attachments: public_docs.txt
>
>
> There are cases when legacy parsers successfully parse documents on which Tika fails. I am attaching a list of examples of such documents. Nutch allows use of more than one parser on a document, in a sequence, until the document has been parsed successfully. Thus, old parsers can be combined with Tika to achieve better parsing success rate, at least until Tika is perfect.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)