You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@tika.apache.org by "Timo Boehme (JIRA)" <ji...@apache.org> on 2013/12/03 10:22:38 UTC

[jira] [Commented] (TIKA-1201) Add possibility for switching to pdfbox NonSequentialPDFParser

    [ https://issues.apache.org/jira/browse/TIKA-1201?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13837482#comment-13837482 ] 

Timo Boehme commented on TIKA-1201:
-----------------------------------

Hi,
I would only like to clarify what can be expected by using  the NonSequentialPDFParser. PDFBOX-1104 was only a starting point. The parser as it is implemented now can be found in issue PDFBOX-1199. While in principle the parser could be faster for extracting single pages, it currently parses the whole document because the other classes working on parser output expect all objects to be available (on demand parsing might be available in version 2). Thus it is (only) faster if document contains unused objects (e.g. after document was edited), which the 'old' parser analyzes.
However the real advantage in using this parser is that it is much more conform to PDF specification and has no problems with unused content in PDF files (where the 'old' one often failed).
Differences in behavior/result to the 'old' parser may arise if
- the document contains unused content (the 'old' parser may interpret/use it)
- the document is not valid PDF ('new' parser needs correct XREF table entry while the 'old' one finds the objects during parsing)
- the document was edited ('new' parser should correctly parse the latest version; the 'old' parser may not in every case)

Thus it is highly recommended to use the 'new' parser - however not because of the speed but because of its much better parsing capabilities.

> Add possibility for switching to pdfbox NonSequentialPDFParser
> --------------------------------------------------------------
>
>                 Key: TIKA-1201
>                 URL: https://issues.apache.org/jira/browse/TIKA-1201
>             Project: Tika
>          Issue Type: Improvement
>          Components: parser
>    Affects Versions: 1.4
>         Environment: all
>            Reporter: Hong-Thai Nguyen
>            Assignee: Tim Allison
>            Priority: Critical
>             Fix For: 1.5
>
>         Attachments: TIKA-1201.patch
>
>
> As discussing, we can improve PDF extraction by 45% with this new NonSequentialPDFParser and fit more with PDF specification. This parser will be integrated by default in pdfbox 2.0.
> ref.: 
> https://issues.apache.org/jira/browse/PDFBOX-1104
> http://pdfbox.apache.org/ideas.html
> We should provide an extended parser or parameter current PDFParser to call:
> {code}
> PDDocument.loadNonSeq(file, scratchFile);
> {code}



--
This message was sent by Atlassian JIRA
(v6.1#6144)