You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@tika.apache.org by Apache Wiki <wi...@apache.org> on 2015/04/07 06:19:09 UTC
[Tika Wiki] Update of "TesseractOCRStats" by ChrisMattmann
Dear Wiki user,
You have subscribed to a wiki page or wiki category on "Tika Wiki" for change notification.
The "TesseractOCRStats" page has been changed by ChrisMattmann:
https://wiki.apache.org/tika/TesseractOCRStats
New page:
Here are some stats contributed by Mark Kerzner and Amanda Towler from Hyperion Gray.
{{{
Total number of images to process: about 300,000
Average time per image: about 1 sec
Total run time required: about 10 days
Our run times on various bathes: about 1 day total
OCR quality: decent
}}}
= Future Work =
* Use Tika, rather than do Tesseract directly
* Scale it up with Spark or Hadoop
* A few polishes, with the view on other teams/projects using it later