You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@tika.apache.org by "Chris A. Mattmann (JIRA)" <ji...@apache.org> on 2016/07/07 06:40:11 UTC

[jira] [Updated] (TIKA-2021) Improving accuracy of Tesseract parser for Serial Number and Part Number (Numeric) Extraction

     [ https://issues.apache.org/jira/browse/TIKA-2021?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Chris A. Mattmann updated TIKA-2021:
------------------------------------
    Summary: Improving accuracy of Tesseract parser for Serial Number and Part Number (Numeric) Extraction  (was: Improving accuracy of Tesseract parser)

> Improving accuracy of Tesseract parser for Serial Number and Part Number (Numeric) Extraction
> ---------------------------------------------------------------------------------------------
>
>                 Key: TIKA-2021
>                 URL: https://issues.apache.org/jira/browse/TIKA-2021
>             Project: Tika
>          Issue Type: Improvement
>          Components: ocr, parser
>            Reporter: Zarana Parekh
>            Assignee: Chris A. Mattmann
>              Labels: memex
>             Fix For: 1.14
>
>
> Tesseract OCR parser works well with images containing English text. However, there is possibility of improvement in case of alphanumeric and numeric content which require training Tesseract with the relevant cases in order to better extract content from images. Such a customization can be helpful in extraction of serial numbers from images of counterfeit electronics and other applications focussing on atypical textual content.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)