You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@tika.apache.org by "Tim Allison (Jira)" <ji...@apache.org> on 2021/08/11 16:18:00 UTC
[jira] [Resolved] (TIKA-3515) Tika CLI -t should use UTF-8 as
default output encoding
[ https://issues.apache.org/jira/browse/TIKA-3515?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Tim Allison resolved TIKA-3515.
-------------------------------
Fix Version/s: 2.0.1
Assignee: Tim Allison
Resolution: Fixed
> Tika CLI -t should use UTF-8 as default output encoding
> -------------------------------------------------------
>
> Key: TIKA-3515
> URL: https://issues.apache.org/jira/browse/TIKA-3515
> Project: Tika
> Issue Type: Improvement
> Affects Versions: 2.0.0, 1.27
> Environment: Windows 10, Liberica OpenJDK FULL x64 1.8.0_302
> Reporter: Luís Filipe Nassif
> Assignee: Tim Allison
> Priority: Minor
> Fix For: 2.0.1
>
> Attachments: Korean lessons_ Lesson 2 – Learnkorean.com.pdf, LIVE-Seoul-ntfs-utf-16-be.txt, LIVE-Seoul-ntfs-utf-16-le.txt, LIVE-Seoul-ntfs-utf-8.txt, LIVE-Seoul-ntfs-utf-8.txt_-x_output.xml, LIVE-Seoul-ntfs-utf-8_-t_output.txt, Screen Shot 2021-08-06 at 5.50.04 PM.png, Screen Shot 2021-08-06 at 5.50.21 PM.png, Screen Shot tika-app.png, image-2021-08-09-14-37-30-552.png, image-2021-08-09-14-38-26-763.png
>
>
> Some Korean chars are extracted as squares. The encodings of plain texts are detected correctly. Maybe this is related with the content handler (just a guess). I'll attach the triggering files.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)