You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@tika.apache.org by "Olivier M (JIRA)" <ji...@apache.org> on 2015/11/16 11:01:11 UTC
[jira] [Comment Edited] (TIKA-1794) TXTParser removes form feed
characters
[ https://issues.apache.org/jira/browse/TIKA-1794?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15006441#comment-15006441 ]
Olivier M edited comment on TIKA-1794 at 11/16/15 10:00 AM:
------------------------------------------------------------
Txt file with form feed character attached.
was (Author: maol):
Txt file with form feed character.
> TXTParser removes form feed characters
> --------------------------------------
>
> Key: TIKA-1794
> URL: https://issues.apache.org/jira/browse/TIKA-1794
> Project: Tika
> Issue Type: Bug
> Components: parser
> Affects Versions: 1.11
> Environment: Java(TM) SE Runtime Environment (build 1.8.0_60-b27)
> Reporter: Olivier M
> Priority: Minor
> Labels: parser, txt
> Attachments: form_feed.txt
>
>
> Just noticed that Apache Tika removes form feed characters (0C in UTF-8) when parsing a text file.
> If I compare the hex bytes of the original file and the hex bytes of the extracted text I can see that the 0C character is replaced by EF BF BD which is the UTF-8 replacement character.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)