You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@tika.apache.org by "Hong-Thai Nguyen (JIRA)" <ji...@apache.org> on 2014/09/25 17:45:35 UTC

[jira] [Commented] (TIKA-1428) Microsoft Word 97 - 2003 (.doc) footnote references are Unicode Replacement Character

    [ https://issues.apache.org/jira/browse/TIKA-1428?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14147880#comment-14147880 ] 

Hong-Thai Nguyen commented on TIKA-1428:
----------------------------------------

Thanks [~theoettheo], any chance to have a patch with a test case for this problem ?

> Microsoft Word 97 - 2003 (.doc) footnote references are Unicode Replacement Character
> -------------------------------------------------------------------------------------
>
>                 Key: TIKA-1428
>                 URL: https://issues.apache.org/jira/browse/TIKA-1428
>             Project: Tika
>          Issue Type: Bug
>    Affects Versions: 1.4, 1.6
>            Reporter: Theodor Sjöstedt
>            Priority: Minor
>         Attachments: TIKA-doc-footnotes-issue.png
>
>
> Footnotes from {{.doc}} documents are extracted, but the references to the footnotes are replaced by the Unicode Replacement Character (�).
> I have tried this in 1.4 and 1.6.
> In 1.4, both reference in text and reference at footnote have been replaced.
> In 1.6, reference in text has disappeared completely.
> See attached image for original document, 1.4 Formatted text, and 1.6 Formatted text.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)