You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@tika.apache.org by "Matthew Caruana Galizia (JIRA)" <ji...@apache.org> on 2017/01/11 12:28:58 UTC

[jira] [Commented] (TIKA-2235) Use Tesseract's recommended DPI for PDF images

    [ https://issues.apache.org/jira/browse/TIKA-2235?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15818176#comment-15818176 ] 

Matthew Caruana Galizia commented on TIKA-2235:
-----------------------------------------------

Yes, I am already! Thanks for linking me to that. It's good that that pull request adds metadata support for JBIG2, but would it not be better to wait for the PDFBox 2.0.5 release (which I'm assuming is soon) instead of adding todos?

> Use Tesseract's recommended DPI for PDF images
> ----------------------------------------------
>
>                 Key: TIKA-2235
>                 URL: https://issues.apache.org/jira/browse/TIKA-2235
>             Project: Tika
>          Issue Type: Improvement
>          Components: parser
>    Affects Versions: 1.14
>            Reporter: Matthew Caruana Galizia
>            Priority: Minor
>              Labels: ocr, pdf
>             Fix For: 2.0, 1.15
>
>
> From the [Tesseract wiki|https://github.com/tesseract-ocr/tesseract/wiki/ImproveQuality]:
> {quote}
> Tesseract works best on images which have a DPI of at least 300 dpi....
> {quote}
> PDFParserConfig is currently initialised with a value of 200 for ocrDPI.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)