You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@tika.apache.org by Apache Wiki <wi...@apache.org> on 2015/04/07 06:19:09 UTC

[Tika Wiki] Update of "TesseractOCRStats" by ChrisMattmann

Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Tika Wiki" for change notification.

The "TesseractOCRStats" page has been changed by ChrisMattmann:
https://wiki.apache.org/tika/TesseractOCRStats

New page:
Here are some stats contributed by Mark Kerzner and Amanda Towler from Hyperion Gray.

{{{
Total number of images to process: about 300,000
Average time per image: about 1 sec
Total run time required: about 10 days
Our run times on various bathes: about 1 day total
OCR quality: decent
}}}

= Future Work =

 * Use Tika, rather than do Tesseract directly
 * Scale it up with Spark or Hadoop
 * A few polishes, with the view on other teams/projects using it later