You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@tika.apache.org by "Tim Allison (JIRA)" <ji...@apache.org> on 2013/12/03 17:26:36 UTC

[jira] [Comment Edited] (TIKA-1201) Add possibility for switching to pdfbox NonSequentialPDFParser

    [ https://issues.apache.org/jira/browse/TIKA-1201?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13837169#comment-13837169 ] 

Tim Allison edited comment on TIKA-1201 at 12/3/13 4:25 PM:
------------------------------------------------------------

Basic parameter-based capability added in r1547250.  User beware that there may be differences in metadata processing between the NonSequentialPDFParser and the traditional parser.  (See TIKA-1203 for failure of NonSequentialPDFParser to extract metadata from testAnnotations.pdf).


was (Author: tallison@mitre.org):
Basic parameter-based capability added in r1547250.  User beware that there may be differences in metadata processing between the NonSequentialPDFParser and the traditional parser.  Will open issue to track failure to extract metadata from testAnnotations.pdf.

> Add possibility for switching to pdfbox NonSequentialPDFParser
> --------------------------------------------------------------
>
>                 Key: TIKA-1201
>                 URL: https://issues.apache.org/jira/browse/TIKA-1201
>             Project: Tika
>          Issue Type: Improvement
>          Components: parser
>    Affects Versions: 1.4
>         Environment: all
>            Reporter: Hong-Thai Nguyen
>            Assignee: Tim Allison
>            Priority: Critical
>             Fix For: 1.5
>
>         Attachments: TIKA-1201.patch
>
>
> As discussing, we can improve PDF extraction by 45% with this new NonSequentialPDFParser and fit more with PDF specification. This parser will be integrated by default in pdfbox 2.0.
> ref.: 
> https://issues.apache.org/jira/browse/PDFBOX-1104
> http://pdfbox.apache.org/ideas.html
> We should provide an extended parser or parameter current PDFParser to call:
> {code}
> PDDocument.loadNonSeq(file, scratchFile);
> {code}



--
This message was sent by Atlassian JIRA
(v6.1#6144)