You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@tika.apache.org by "Malik Hemani (JIRA)" <ji...@apache.org> on 2011/09/04 14:20:10 UTC
[jira] [Commented] (TIKA-100) Structured PDF parsing
[ https://issues.apache.org/jira/browse/TIKA-100?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13096863#comment-13096863 ]
Malik Hemani commented on TIKA-100:
-----------------------------------
Since PDFTextStripper can extract at page level, here is one possible solution that can let Tika extract text for a single page or a range of pages (excuse the formatting lost in translation):
1. Add a new method to Parser interface:
void parse(
InputStream stream, ContentHandler handler,
Metadata metadata, ParseContext context, int startPage, int endPage)
throws IOException, SAXException, TikaException;
2. Implement the method PDFParser class:
public void parse(
InputStream stream, ContentHandler handler,
Metadata metadata, ParseContext context, int startPage, int endPage)
throws IOException, SAXException, TikaException {
PDDocument pdfDocument = PDDocument.load(stream, true);
try {
if (pdfDocument.isEncrypted()) {
try {
String password = metadata.get(PASSWORD);
if (password == null) {
password = "";
}
pdfDocument.decrypt(password);
} catch (Exception e) {
// Ignore
}
}
metadata.set(Metadata.CONTENT_TYPE, "application/pdf");
extractMetadata(pdfDocument, metadata);
PDF2XHTML.process(pdfDocument, handler, metadata, startPage, endPage);
} finally {
pdfDocument.close();
}
}
3. Add a new method in PDF2XHTML class:
public static void process(
PDDocument document, ContentHandler handler, Metadata metadata, int startPage, int endPage)
throws SAXException, TikaException {
try {
// Extract text using a dummy Writer as we override the
// key methods to output to the given content handler.
PDF2XHTML pdf2XHTML = new PDF2XHTML(handler, metadata);
// Set start and end page
if (startPage > 0) {
pdf2XHTML.setStartPage(startPage);
}
if (endPage > 0) {
pdf2XHTML.setEndPage(endPage);
}
pdf2XHTML.writeText(document, new Writer() {
@Override
public void write(char[] cbuf, int off, int len) {
}
@Override
public void flush() {
}
@Override
public void close() {
}
});
} catch (IOException e) {
if (e.getCause() instanceof SAXException) {
throw (SAXException) e.getCause();
} else {
throw new TikaException("Unable to extract PDF content", e);
}
}
}
4. Example of a call to extract page 2 of a PDF:
...
int startPage = 2;
int endPage = 2;
PDFParser parser = new PDFParser();
parser.parse(input, textHandler, metadata, new ParseContext(), startPage, endPage);
> Structured PDF parsing
> ----------------------
>
> Key: TIKA-100
> URL: https://issues.apache.org/jira/browse/TIKA-100
> Project: Tika
> Issue Type: Improvement
> Components: parser
> Reporter: Jukka Zitting
> Priority: Minor
>
> The PDF parser currently extracts and outputs document content as a single string. PDFBox could be used to support structuring at least down to page and paragraph (not sure how accurate) level.
--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira