You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@tika.apache.org by "Chris A. Mattmann (JIRA)" <ji...@apache.org> on 2016/07/07 06:40:11 UTC

[jira] [Resolved] (TIKA-2021) Improving accuracy of Tesseract parser

     [ https://issues.apache.org/jira/browse/TIKA-2021?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Chris A. Mattmann resolved TIKA-2021.
-------------------------------------
    Resolution: Fixed

Great work [~Zarana Parekh] and thanks for the great review [~lewismc]!

{noformat}
LMC-053601:tika1.13 mattmann$ git commit -m "Fix to work if ImageMagick isn't present. Fix forbidden APIs."
[master 6f16480] Fix to work if ImageMagick isn't present. Fix forbidden APIs.
 2 files changed, 3 insertions(+), 3 deletions(-)
LMC-053601:tika1.13 mattmann$ git push -u origin master
Counting objects: 267, done.
Delta compression using up to 8 threads.
Compressing objects: 100% (128/128), done.
Writing objects: 100% (267/267), 29.68 KiB | 0 bytes/s, done.
Total 267 (delta 93), reused 207 (delta 62)
remote: tika git commit: Fix to work if ImageMagick isn't present. Fix forbidden APIs.
remote: tika git commit: Merge branch 'TIKA-2021' of https://github.com/Zarana-Parekh/tika
remote: tika git commit: fix orthogonal changes
remote: tika git commit: formatting changes
remote: tika git commit: added check for non-UNIX OS
remote: tika git commit: formatting changes
remote: tika git commit: rebasing pom.xml for tika-bundle
remote: tika git commit: formatting chanages
remote: tika git commit: updated config file
remote: tika git commit: updated scope in pom.xml
remote: tika git commit: updated Javadoc for Tesseract config and parser
remote: tika git commit: updated property name, removed orthogonal changes
remote: tika git commit: added validation tests for new processing features
remote: tika git commit: optional processing enabled
remote: tika git commit: fix for TIKA-2021 contributed by Zarana Parekh
remote: tika git commit: fix for TIKA-2021 contributed by Zarana Parekh
To https://git-wip-us.apache.org/repos/asf/tika.git
   95b2cd1..6f16480  master -> master
Branch master set up to track remote branch master from origin.
LMC-053601:tika1.13 mattmann$ 
{noformat}


> Improving accuracy of Tesseract parser
> --------------------------------------
>
>                 Key: TIKA-2021
>                 URL: https://issues.apache.org/jira/browse/TIKA-2021
>             Project: Tika
>          Issue Type: Improvement
>          Components: ocr, parser
>            Reporter: Zarana Parekh
>            Assignee: Chris A. Mattmann
>              Labels: memex
>             Fix For: 1.14
>
>
> Tesseract OCR parser works well with images containing English text. However, there is possibility of improvement in case of alphanumeric and numeric content which require training Tesseract with the relevant cases in order to better extract content from images. Such a customization can be helpful in extraction of serial numbers from images of counterfeit electronics and other applications focussing on atypical textual content.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)