You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@tika.apache.org by "Tim Allison (JIRA)" <ji...@apache.org> on 2015/07/15 12:49:04 UTC
[jira] [Commented] (TIKA-1679) Parse PDF file page by page
[ https://issues.apache.org/jira/browse/TIKA-1679?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14627891#comment-14627891 ]
Tim Allison commented on TIKA-1679:
-----------------------------------
To confirm,iIs the problem that an exception is thrown at page three and you are losing content from pages 4 and 5?
> Parse PDF file page by page
> ---------------------------
>
> Key: TIKA-1679
> URL: https://issues.apache.org/jira/browse/TIKA-1679
> Project: Tika
> Issue Type: Improvement
> Components: parser
> Affects Versions: 1.9
> Reporter: Raymond Wu
>
> I have a PDF file contains 5 pages.
> Page 3 cannot be parsed by PDFBox, but the rest pages are okay.
> So I try to parse this file page by page.
> Fix method PDF2XHTML.process() at PDF2XHTML.java.
> public static void process(
> PDDocument document, ContentHandler handler, Metadata metadata,
> boolean extractAnnotationText, boolean enableAutoSpace,
> boolean suppressDuplicateOverlappingText, boolean sortByPosition)
> throws SAXException, TikaException {
> try {
> // Extract text using a dummy Writer as we override the
> // key methods to output to the given content
> // handler.
> Writer dummyWriter = new Writer() {
> @Override
> public void write(char[] cbuf, int off, int len) {
> }
> @Override
> public void flush() {
> }
> @Override
> public void close() {
> }
> };
> // Parse page by page
> int nop = document.getNumberOfPages();
> for(int i=1;i<=nop;i++) {
> PDF2XHTML pdf2XHTML = new PDF2XHTML(handler, metadata,
> extractAnnotationText, enableAutoSpace,
> suppressDuplicateOverlappingText, sortByPosition);
> try {
> pdf2XHTML.setStartPage(i);
> pdf2XHTML.setEndPage(i);
> pdf2XHTML.writeText(document, dummyWriter);
> } catch(Exception e) {
> // TODO ...
> }
> }
> } catch (IOException e) {
> if (e.getCause() instanceof SAXException) {
> throw (SAXException) e.getCause();
> } else {
> throw new TikaException("Unable to extract PDF content", e);
> }
> }
> }
> This method can parse PDF with partial broken pages.
> I know It's not an optimized design.
> But it is enough to solve my problem.
> From Tika 1.4~1.9, I need to recompile every version for this problem.
> So I'd like to improve this parser.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)