You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@tika.apache.org by "Tim Allison (Jira)" <ji...@apache.org> on 2021/08/06 21:59:00 UTC

[jira] [Commented] (TIKA-3517) Text extraction doesn't work for Pages and Numbers when Tesseract is disabled

    [ https://issues.apache.org/jira/browse/TIKA-3517?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17394998#comment-17394998 ] 

Tim Allison commented on TIKA-3517:
-----------------------------------

Sorry about this.  The answer will bring you no joy.

We don't handle the more modern versions of iWorks files.  In Tika 1.x, the numbers file is being detected as "application/vnd.apple.unknown.13", and no text is returned.  There's clearly a bug in tika-app and tika-server that is preventing correct file type detection of iworks files.  I'll figure that out and fix it on this issue.  When that is fixed, though, you'll get correct file type detection, but still no text because we don't support these versions of iworks. :(

The reason you're getting text w ocr is because of the thumbnails in the file.

> Text extraction doesn't work for Pages and Numbers when Tesseract is disabled
> -----------------------------------------------------------------------------
>
>                 Key: TIKA-3517
>                 URL: https://issues.apache.org/jira/browse/TIKA-3517
>             Project: Tika
>          Issue Type: Bug
>    Affects Versions: 2.0.0
>         Environment: I tested this on RHEL7.  I got the same results whether I was using Tesseract 3 or Tesseract 4, but that doesn't really matter because the problems I'm having are when Tesseract is disabled.
>            Reporter: Chris Bryant
>            Priority: Major
>         Attachments: SSN.numbers, SSN.pages, no_ocr.xml
>
>
> When I try running tika to try to extract text from Mac Pages and Numbers files, the text extraction does not work if Tesseract is disabled.  I'm attaching sample files, including the config file I use to disable Tesseract.  I get the same results whether I run the server version (tika-server-standard-2.0.0.jar) or the command line app (tika-app-2.0.0.jar).  
> The following commands extract text along with what appears to be a list of a bunch of .iwa files and .jpg files inside the Pages and Numbers files:
> java -jar ~/tika-app-2.0.0.jar -t ~/SSN.pages
> java -jar ~/tika-app-2.0.0.jar -t ~/SSN.numbers
> However, when I run the following commands using the configuration file to disable Tesseract, all that is extracted is the list of .iwa and .jpg files and none of the actual text is extracted:
> java -jar ~/tika-app-2.0.0.jar --config=no_ocr.xml -t ~/SSN.pages
> java -jar ~/tika-app-2.0.0.jar --config=no_ocr.xml -t ~/SSN.numbers
>  
> I haven't see similar problems with other types of files I've tested with, including .docx, pptx, .xlsx, .odt, .ods, .odp, and .pdf.  Those work fine with or without Tesseract disabled.
>  
> On a somewhat separate issue, I have been unable to get any text extracted from my test Keynote file at all, whether Tesseract is enabled or not.  I'm having difficulty uploading that file, so I'll see if I can add that later.
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)