You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@tika.apache.org by "Mike Rodent (JIRA)" <ji...@apache.org> on 2017/02/14 17:43:41 UTC

[jira] [Issue Comment Deleted] (TIKA-2265) Problem with footnotes/endnotes in Tika.parseToString with MS Word (.docx) files

     [ https://issues.apache.org/jira/browse/TIKA-2265?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Mike Rodent updated TIKA-2265:
------------------------------
    Comment: was deleted

(was: I'm not that surprised to find that...

So I ran my tests again... same problem. The app code is as simple as can be: 

Tika tika = new Tika();
tika.setMaxStringLength( Integer.MAX_VALUE );
			contents = tika.parseToString(documentFile);

and I'm still getting the anomaly.  I also tried cutting out the top of my file (test shorter.docx)... same result: I'm getting footnotes "2", "3" and "4" for the first line: this is the logged output:

# TIKA contents for file test shorter.docx: Tecum optime[footnoteRef:2], deinde etiam[footnoteRef:3] cum mediocri amico[footnoteRef:4]. [2: Sed quoniam et advesperascit et mihi ad villam revertendum est, nunc quidem hactenus; Quod si ita sit, cur opera philosophiae sit danda nescio.] [3: Si quae forte-possumus. Immo videri fortasse.] [4: Huius ego nunc auctoritatem [sequens idem faciam]. Confecta res esset. Primum Theophrasti, Strato, physicum se voluit; Ut proverbia non nulla veriora sint quam vestra dogmata.] 

both files uploaded... PS Tika version is 1.14.)

> Problem with footnotes/endnotes in Tika.parseToString with MS Word (.docx) files
> --------------------------------------------------------------------------------
>
>                 Key: TIKA-2265
>                 URL: https://issues.apache.org/jira/browse/TIKA-2265
>             Project: Tika
>          Issue Type: Bug
>          Components: parser
>    Affects Versions: 1.14
>         Environment: N/A
>            Reporter: Mike Rodent
>            Assignee: Tim Allison
>            Priority: Minor
>              Labels: newbie
>
> It seems to be the case that a footnote numbered "1" in the real document will be outputted by Tika.parseToString() as "2" in the footnote reference, and "2" in the corresponding footnote body text.... real footnote "2" becomes "3", "3" becomes "4", etc.  Have not yet looked at source code ... I can't imagine it would be difficult to correct this.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)