You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@tika.apache.org by "Tim Allison (JIRA)" <ji...@apache.org> on 2015/07/15 12:49:04 UTC
[jira] [Commented] (TIKA-1679) Parse PDF file page by page

    [ https://issues.apache.org/jira/browse/TIKA-1679?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14627891#comment-14627891 ] 

Tim Allison commented on TIKA-1679:
-----------------------------------

To confirm,iIs the problem that an exception is thrown at page three and you are losing content from pages 4 and 5?

> Parse PDF file page by page
> ---------------------------
>
>                 Key: TIKA-1679
>                 URL: https://issues.apache.org/jira/browse/TIKA-1679
>             Project: Tika
>          Issue Type: Improvement
>          Components: parser
>    Affects Versions: 1.9
>            Reporter: Raymond Wu
>
> I have a PDF file contains 5 pages.
> Page 3 cannot be parsed by PDFBox, but the rest pages are okay.
> So I try to parse this file page by page.
> Fix method PDF2XHTML.process() at PDF2XHTML.java.
> public static void process(
>             PDDocument document, ContentHandler handler, Metadata metadata,
>             boolean extractAnnotationText, boolean enableAutoSpace,
>             boolean suppressDuplicateOverlappingText, boolean sortByPosition)
>             throws SAXException, TikaException {
>         try {
>             // Extract text using a dummy Writer as we override the
>             // key methods to output to the given content
>             // handler.
>             Writer dummyWriter = new Writer() {
>                 @Override
>                 public void write(char[] cbuf, int off, int len) {
>                 }
>                 @Override
>                 public void flush() {
>                 }
>                 @Override
>                 public void close() {
>                 }
>             };
>             // Parse page by page
>             int nop = document.getNumberOfPages();
>             for(int i=1;i<=nop;i++) {
>                 PDF2XHTML pdf2XHTML = new PDF2XHTML(handler, metadata,
>                 extractAnnotationText, enableAutoSpace,
>                 suppressDuplicateOverlappingText, sortByPosition);
>                 try {
>                     pdf2XHTML.setStartPage(i);
>                     pdf2XHTML.setEndPage(i);
>                     pdf2XHTML.writeText(document, dummyWriter);
>                 } catch(Exception e) {
>                     // TODO ...
>                 }
>             }
>         } catch (IOException e) {
>             if (e.getCause() instanceof SAXException) {
>                 throw (SAXException) e.getCause();
>             } else {
>                 throw new TikaException("Unable to extract PDF content", e);
>             }
>         }
>     }
> This method can parse PDF with partial broken pages.
> I know It's not an optimized design.
> But it is enough to solve my problem.
> From Tika 1.4~1.9, I need to recompile every version for this problem.
> So I'd like to improve this parser.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)