You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@tika.apache.org by "Sandeepan (JIRA)" <ji...@apache.org> on 2017/02/08 13:56:41 UTC
[jira] [Created] (TIKA-2261) TikaOcr giving different result across
platforms
Sandeepan created TIKA-2261:
-------------------------------
Summary: TikaOcr giving different result across platforms
Key: TIKA-2261
URL: https://issues.apache.org/jira/browse/TIKA-2261
Project: Tika
Issue Type: Bug
Affects Versions: 1.14
Reporter: Sandeepan
Hi,
I am using Tika to parse every type of file and it works great for non image files.
My local machine is an Mac but I deploy stuff on ubuntu 14.04. On command line, i get the same result on both the platforms.
Example Command
tesseract 3.jpg ouput -l eng -psm 1 txt
But when I use it through Java code, it gives me very different results and the quality is worse in case of ubuntu.
Sample Code
AutoDetectParser parser = new AutoDetectParser();
BodyContentHandler handler = new BodyContentHandler(-1);
Metadata metadata = new Metadata();
FileInputStream in = new FileInputStream(path);
parser.parse(in, handler, metadata);
parsedText = handler.toString();
On Mac :
++++++
$ tesseract -v
tesseract 3.04.01
leptonica-1.74.1
libjpeg 8d : libpng 1.6.28 : libtiff 4.0.7 : zlib 1.2.8
On Ubuntu
ubuntu@ubuntu-4gb-postprocess:~$ tesseract -v
tesseract 3.04.01
leptonica-1.74.1
libjpeg 8d : libpng 1.6.28 : libtiff 4.0.7 : zlib 1.2.8
Not able to figure out what the issue is. \
--
This message was sent by Atlassian JIRA
(v6.3.15#6346)