You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@pdfbox.apache.org by "John Hewson (JIRA)" <ji...@apache.org> on 2014/10/11 01:25:35 UTC

[jira] [Updated] (PDFBOX-1912) Optical Character Recognition (OCR)

     [ https://issues.apache.org/jira/browse/PDFBOX-1912?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

John Hewson updated PDFBOX-1912:
--------------------------------
    Fix Version/s: 2.0.0

> Optical Character Recognition (OCR)
> -----------------------------------
>
>                 Key: PDFBOX-1912
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-1912
>             Project: PDFBox
>          Issue Type: Wish
>          Components: Text extraction
>    Affects Versions: 2.0.0
>         Environment: JDK 6, C/C++
>            Reporter: John Hewson
>            Assignee: John Hewson
>              Labels: gsoc2014
>             Fix For: 2.0.0
>
>
> Brief explanation: The PDFBox library is widely used to extract text from PDF files. However, many PDF files embed text in a malformed manner which renders text extraction useless. There has recently been interest in extracting governmental data from PDF files, the PDF Liberation commons being a notable example, see https://github.com/pdfliberation for more details.
> Many end-users of PDFBox have been making use of OCR tools such as Google's Tesseract https://code.google.com/p/tesseract-ocr/ which are run on the final image generated by PDFBox. We think that by adding a more integrated OCR API to PDFBox it will be possible to do a better job. PDFBox often has access to encoding and positioning information for individual glyphs. Even when their extracted text is meaningless, a character-by-character, or line-by-line OCR could be more accurate. PDFBox also has information such as image orientation which could allow it to better perform OCR on pages such as embedded landscape tables.
> There are existing JNI bindings for Tesseract available at https://code.google.com/p/tesseract-android-tools/
> Expected results: To extend PDF box with an API which allows external OCR tools to be plugged-in, and an implementation of a Tesseract plug-in using either JNI or the command line via Process.exec.
> Knowledge Prerequisite: Java, JNI (C/C++)
> Mentor: John Hewson
> PMC Note: Tesseract  is under the Apache License 2.0
> To learn more about PDFBox, please visit http://pdfbox.apache.org/



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)