You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@tika.apache.org by "Malik Hemani (JIRA)" <ji...@apache.org> on 2011/09/04 14:20:10 UTC

[jira] [Commented] (TIKA-100) Structured PDF parsing

    [ https://issues.apache.org/jira/browse/TIKA-100?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13096863#comment-13096863 ] 

Malik Hemani commented on TIKA-100:
-----------------------------------

Since PDFTextStripper can extract at page level, here is one possible solution that can let Tika extract text for a single page or a range of pages (excuse the formatting lost in translation):

1. Add a new method to Parser interface:
    void parse(
            InputStream stream, ContentHandler handler,
            Metadata metadata, ParseContext context, int startPage, int endPage)
            throws IOException, SAXException, TikaException;

2. Implement the method PDFParser class:
    public void parse(
            InputStream stream, ContentHandler handler,
            Metadata metadata, ParseContext context, int startPage, int endPage)
            throws IOException, SAXException, TikaException {
        PDDocument pdfDocument = PDDocument.load(stream, true);
        try {
            if (pdfDocument.isEncrypted()) {
                try {
                    String password = metadata.get(PASSWORD);
                    if (password == null) {
                        password = "";
                    }
                    pdfDocument.decrypt(password);
                } catch (Exception e) {
                    // Ignore
                }
            }
            metadata.set(Metadata.CONTENT_TYPE, "application/pdf");
            extractMetadata(pdfDocument, metadata);
            PDF2XHTML.process(pdfDocument, handler, metadata, startPage, endPage);
        } finally {
            pdfDocument.close();
        }
    }

3. Add a new method in PDF2XHTML class:
    public static void process(
        PDDocument document, ContentHandler handler, Metadata metadata, int startPage, int endPage)
            throws SAXException, TikaException {
        try {
            // Extract text using a dummy Writer as we override the
            // key methods to output to the given content handler.
            PDF2XHTML pdf2XHTML = new PDF2XHTML(handler, metadata);

            // Set start and end page
            if (startPage > 0) {
            	pdf2XHTML.setStartPage(startPage);
	    }

	    if (endPage > 0) {
            	pdf2XHTML.setEndPage(endPage);
	    }

            pdf2XHTML.writeText(document, new Writer() {
                @Override
                public void write(char[] cbuf, int off, int len) {
                }
                @Override
                public void flush() {
                }
                @Override
                public void close() {
                }
            });
        } catch (IOException e) {
            if (e.getCause() instanceof SAXException) {
                throw (SAXException) e.getCause();
            } else {
                throw new TikaException("Unable to extract PDF content", e);
            }
        }
    }

4. Example of a call to extract page 2 of a PDF:
                ...
                int startPage = 2;
                int endPage = 2;
		PDFParser parser = new PDFParser();
		parser.parse(input, textHandler, metadata, new ParseContext(), startPage, endPage);



> Structured PDF parsing
> ----------------------
>
>                 Key: TIKA-100
>                 URL: https://issues.apache.org/jira/browse/TIKA-100
>             Project: Tika
>          Issue Type: Improvement
>          Components: parser
>            Reporter: Jukka Zitting
>            Priority: Minor
>
> The PDF parser currently extracts and outputs document content as a single string. PDFBox could be used to support structuring at least down to page and paragraph (not sure how accurate) level.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira