You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@tika.apache.org by "GIOELE PERIN (Jira)" <ji...@apache.org> on 2021/11/23 09:13:00 UTC
[jira] [Created] (TIKA-3599) Command line tika extracts encoding of file in eml
GIOELE PERIN created TIKA-3599:
----------------------------------
Summary: Command line tika extracts encoding of file in eml
Key: TIKA-3599
URL: https://issues.apache.org/jira/browse/TIKA-3599
Project: Tika
Issue Type: Bug
Components: app
Affects Versions: 2.1.0
Environment: Windows 10 pro version 10.0.19043 Build 19043
Java:
openjdk version "1.8.0-262"
OpenJDK Runtime Environment (build 1.8.0-262-b10)
OpenJDK 64-Bit Server VM (build 25.71-b10, mixed mode)
OCR:
Tesseract 5
Reporter: GIOELE PERIN
Attachments: eml_test.eml, output.txt
Tika cannot extract the text in the attached .eml file. Instead, it returns what I think is the encoding of the attachments.
This does not happen in all .eml files but we have not been able to identify the cause of this behavior. The same file saved in .msg format is extracted correctly.
The extracted .txt file has the same size as the original .eml file.
I will attach the .eml file and the output provided by tika.
The command used is
{code:java}
java -jar tika-app-2.1.0.jar path\to\eml_test.eml > output.txt {code}
--
This message was sent by Atlassian Jira
(v8.20.1#820001)