You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@tika.apache.org by "GIOELE PERIN (Jira)" <ji...@apache.org> on 2021/11/23 09:13:00 UTC

[jira] [Created] (TIKA-3599) Command line tika extracts encoding of file in eml

GIOELE PERIN created TIKA-3599:
----------------------------------

             Summary: Command line tika extracts encoding of file in eml
                 Key: TIKA-3599
                 URL: https://issues.apache.org/jira/browse/TIKA-3599
             Project: Tika
          Issue Type: Bug
          Components: app
    Affects Versions: 2.1.0
         Environment: Windows 10 pro version 10.0.19043 Build 19043

Java:

openjdk version "1.8.0-262"
OpenJDK Runtime Environment (build 1.8.0-262-b10)
OpenJDK 64-Bit Server VM (build 25.71-b10, mixed mode)

OCR:

Tesseract 5
            Reporter: GIOELE PERIN
         Attachments: eml_test.eml, output.txt

Tika cannot extract the text in the attached .eml file. Instead, it returns what I think is the encoding of the attachments. 

This does not happen in all .eml files but we have not been able to identify the cause of this behavior. The same file saved in .msg format is extracted correctly.

The extracted .txt file has the same size as the original .eml file.

I will attach the .eml file and the output provided by tika.

The command used is
{code:java}
java -jar tika-app-2.1.0.jar path\to\eml_test.eml > output.txt {code}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)