You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@tika.apache.org by "Hong-Thai Nguyen (JIRA)" <ji...@apache.org> on 2013/12/02 16:29:41 UTC
[jira] [Updated] (TIKA-1201) Add possibility for switching to
pdfbox NonSequentialPDFParser
[ https://issues.apache.org/jira/browse/TIKA-1201?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Hong-Thai Nguyen updated TIKA-1201:
-----------------------------------
Summary: Add possibility for switching to pdfbox NonSequentialPDFParser (was: Add option for switching to pdfbox NonSequentialPDFParser)
> Add possibility for switching to pdfbox NonSequentialPDFParser
> --------------------------------------------------------------
>
> Key: TIKA-1201
> URL: https://issues.apache.org/jira/browse/TIKA-1201
> Project: Tika
> Issue Type: Improvement
> Components: parser
> Affects Versions: 1.4
> Environment: all
> Reporter: Hong-Thai Nguyen
> Priority: Critical
>
> As discussing, we can improve PDF extraction by 45% with this new NonSequentialPDFParser and fit more with PDF specification. This parser will be integrated by default in pdfbox 2.0.
> ref.:
> https://issues.apache.org/jira/browse/PDFBOX-1104
> http://pdfbox.apache.org/ideas.html
> We should provide an extended parser or parameter current PDFParser to call:
> {code}
> PDDocument.loadNonSeq(file, scratchFile);
> {code}
--
This message was sent by Atlassian JIRA
(v6.1#6144)