You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@tika.apache.org by "Loris Bachert (JIRA)" <ji...@apache.org> on 2015/09/03 11:50:45 UTC

[jira] [Created] (TIKA-1729) OCR in PDF files

Loris Bachert created TIKA-1729:
-----------------------------------

             Summary: OCR in PDF files
                 Key: TIKA-1729
                 URL: https://issues.apache.org/jira/browse/TIKA-1729
             Project: Tika
          Issue Type: Bug
          Components: config, parser
    Affects Versions: 1.10, 1.9
         Environment: Windows 7, 64-bit, JDK 1.8.0_51 64 bit
            Reporter: Loris Bachert


As described in this [stackoverflow-post|http://stackoverflow.com/questions/32354209/apache-tika-extract-scanned-pdf-files] i'm having troubles extracting text out of scanned PDF files. By scanned PDF files i mean PDF files that consist only of images. Because each page is an image i can't extract them using a custom ParsingEmbeddedDocumentExtractor. I also tried using the setExtractInlineImages method of the PDFParserConfig but this didn't work aswell.
There was already a [ticket|https://issues.apache.org/jira/browse/TIKA-93] regarding the OCR support and including the [PDF file|https://issues.apache.org/jira/secure/attachment/12627866/testOCR.pdf] i'm using for my tests.
Here is a JUnit-test about my issue:
{code:title=PDFOCRTest.java|borderStyle=solid}
@Test
public void testPDFOCRExtraction() throws IOException, SAXException, TikaException {
	File file = new File(filePath);
	InputStream stream = new FileInputStream(file);
	
	BodyContentHandler handler = new BodyContentHandler(Integer.MAX_VALUE);
	Metadata metadata = new Metadata();
	PDFParserConfig config = new PDFParserConfig();
	config.setExtractInlineImages(true);
	ParseContext context = new ParseContext();
	context.set(PDFParserConfig.class, config);
	
	PDFParser pdfParser = new PDFParser();
	pdfParser.setPDFParserConfig(config);
	pdfParser.parse(stream, handler, metadata, context);
	String text = handler.toString();
	assertFalse(text.isEmpty());
}
{code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)