You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@tika.apache.org by "Tim Allison (Jira)" <ji...@apache.org> on 2021/11/29 14:59:00 UTC

[jira] [Updated] (TIKA-3599) Command line tika extracts encoding of file in eml

     [ https://issues.apache.org/jira/browse/TIKA-3599?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Tim Allison updated TIKA-3599:
------------------------------
    Attachment: as_is.png
                fix_1.png

> Command line tika extracts encoding of file in eml
> --------------------------------------------------
>
>                 Key: TIKA-3599
>                 URL: https://issues.apache.org/jira/browse/TIKA-3599
>             Project: Tika
>          Issue Type: Bug
>          Components: app
>    Affects Versions: 2.1.0
>         Environment: Windows 10 pro version 10.0.19043 Build 19043
> Java:
> openjdk version "1.8.0-262"
> OpenJDK Runtime Environment (build 1.8.0-262-b10)
> OpenJDK 64-Bit Server VM (build 25.71-b10, mixed mode)
> OCR:
> Tesseract 5
>            Reporter: GIOELE PERIN
>            Priority: Major
>         Attachments: as_is.png, eml_test.eml, fix_1.png, output.txt
>
>
> Tika cannot extract the text in the attached .eml file. Instead, it returns what I think is the encoding of the attachments. 
> This does not happen in all .eml files but we have not been able to identify the cause of this behavior. The same file saved in .msg format is extracted correctly.
> The extracted .txt file has the same size as the original .eml file.
> I will attach the .eml file and the output provided by tika.
> The command used is
> {code:java}
> java -jar tika-app-2.1.0.jar path\to\eml_test.eml > output.txt {code}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)