You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@tika.apache.org by "Sandeepan (JIRA)" <ji...@apache.org> on 2017/02/09 04:18:41 UTC
[jira] [Updated] (TIKA-2261) TikaOcr giving different result across
platforms
[ https://issues.apache.org/jira/browse/TIKA-2261?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Sandeepan updated TIKA-2261:
----------------------------
Attachment: 4.png
This file's output on Mac vs Ubuntu
Only first two lines.
====
Mac
====
[-~-]
With the sorrow of living so great, the sorrow of punishment had to be piti-
less. We lived for the day and died for it.
===
Ubuntu
===
WiLh Lhe somw of living so greal. Lhe sorrow of punishmem had to he pikir
less. We lived fox Lhe day and died fox ir
> TikaOcr giving different result across platforms
> ------------------------------------------------
>
> Key: TIKA-2261
> URL: https://issues.apache.org/jira/browse/TIKA-2261
> Project: Tika
> Issue Type: Bug
> Affects Versions: 1.14
> Reporter: Sandeepan
> Attachments: 4.png
>
>
> Hi,
> I am using Tika to parse every type of file and it works great for non image files.
> My local machine is an Mac but I deploy stuff on ubuntu 14.04. On command line, i get the same result on both the platforms.
> Example Command
> tesseract 3.jpg ouput -l eng -psm 1 txt
> But when I use it through Java code, it gives me very different results and the quality is worse in case of ubuntu.
> Sample Code
> AutoDetectParser parser = new AutoDetectParser();
> BodyContentHandler handler = new BodyContentHandler(-1);
> Metadata metadata = new Metadata();
> FileInputStream in = new FileInputStream(path);
> parser.parse(in, handler, metadata);
> parsedText = handler.toString();
> On Mac :
> ++++++
> $ tesseract -v
> tesseract 3.04.01
> leptonica-1.74.1
> libjpeg 8d : libpng 1.6.28 : libtiff 4.0.7 : zlib 1.2.8
> On Ubuntu
> ubuntu@ubuntu-4gb-postprocess:~$ tesseract -v
> tesseract 3.04.01
> leptonica-1.74.1
> libjpeg 8d : libpng 1.6.28 : libtiff 4.0.7 : zlib 1.2.8
> Not able to figure out what the issue is. \
--
This message was sent by Atlassian JIRA
(v6.3.15#6346)