You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@tika.apache.org by "Hudson (JIRA)" <ji...@apache.org> on 2016/10/01 00:21:20 UTC

[jira] [Commented] (TIKA-2106) "hocr" case on Linux fails, but works on OSX. Related to TIKA-2093

    [ https://issues.apache.org/jira/browse/TIKA-2106?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15537507#comment-15537507 ] 

Hudson commented on TIKA-2106:
------------------------------

FAILURE: Integrated in Jenkins build tika-2.x-windows #59 (See [https://builds.apache.org/job/tika-2.x-windows/59/])
TIKA-2106 -- need to lower case hocr/txt suffix, thanks to Eric Pugh. (tallison: rev 1ab6c81cef1497e81d030d99195df1e479e0644d)
* (edit) tika-parser-modules/tika-parser-multimedia-module/src/main/java/org/apache/tika/parser/ocr/TesseractOCRParser.java


> "hocr" case on Linux fails, but works on OSX.  Related to TIKA-2093
> -------------------------------------------------------------------
>
>                 Key: TIKA-2106
>                 URL: https://issues.apache.org/jira/browse/TIKA-2106
>             Project: Tika
>          Issue Type: Bug
>          Components: ocr
>         Environment: Bug in Linux, but fine in OSX.
>            Reporter: Eric Pugh
>            Assignee: Tim Allison
>
> We pass a output type, either TXT or HOCR to the Tesseract command line.   When we call the command line we lowercase it to "txt" or "hocr".  However, when we read back in the output, we don't lower case it.  on OSX the constructed file path "output.HOCR" is actually found, but in Linux it doesn't.  This patch lower cases the HOCR to hocr and TXT to txt in the constructed file path.
> I didn't write a unit test as I don't have a good linux env to test it in, but I was able to put a patched version of the Tika Parser Jar into my Docker Build to test it works.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)