You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@tika.apache.org by "Luís Filipe Nassif (Jira)" <ji...@apache.org> on 2021/08/06 16:42:00 UTC

[jira] [Created] (TIKA-3515) Korean chars not extracted correctly

Luís Filipe Nassif created TIKA-3515:
----------------------------------------

             Summary: Korean chars not extracted correctly
                 Key: TIKA-3515
                 URL: https://issues.apache.org/jira/browse/TIKA-3515
             Project: Tika
          Issue Type: Bug
    Affects Versions: 2.0.0-BETA, 1.27
            Reporter: Luís Filipe Nassif
         Attachments: LIVE-Seoul-ntfs-utf-16-be.txt, LIVE-Seoul-ntfs-utf-16-le.txt, LIVE-Seoul-ntfs-utf-8.txt

Some Korean chars are extracted as squares. The encodings of plain texts are detected correctly. Maybe this is related with the content handler (just a guess). I'll attach the triggering files.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)