You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@tika.apache.org by "Jukka Zitting (Updated) (JIRA)" <ji...@apache.org> on 2011/10/07 10:18:29 UTC
[jira] [Updated] (TIKA-423) Parse docx and output to text file
missing words
[ https://issues.apache.org/jira/browse/TIKA-423?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Jukka Zitting updated TIKA-423:
-------------------------------
Affects Version/s: 0.8
0.9
0.10
This is still a problem with Tika 0.10 and the latest trunk.
> Parse docx and output to text file missing words
> ------------------------------------------------
>
> Key: TIKA-423
> URL: https://issues.apache.org/jira/browse/TIKA-423
> Project: Tika
> Issue Type: Bug
> Components: parser
> Affects Versions: 0.7, 0.8, 0.9, 0.10
> Environment: Windows and Mac
> Reporter: David Tran
> Labels: docx, missing_word, smart_tag, word
> Attachments: output.txt, tika_test.docx
>
>
> I created a word document using Word 2007 on a Windows Server 2003 machine (using Remote desktop), it has also happened to someone else using Windows XP, with person names, country names, addresses, and a date. Some of these elements are tagged as "Smart Tags" by Word, and in the output of parsing by Tika, some words disappear.
> So a text fragment like the one below in Word:
> Smart tags typically are names like George Washington, Marilyn Monroe, Napoleon Bonaparte, etc. But they are automatically generated by Word, so it can be difficult to control how they are
> After running Tika from the command line (on OSX), java -jar /path/to/tika-app-0.7.jar -t /path/to/docx/document.docx > /path/to/output.txt will result in something like:
> Smart tags typically are names like , , Napoleon Bonaparte, etc. But they are automatically generated by Word, so it can be difficult to control how they are
> Note the missing names George Washington and Marilyn Monroe, Marilyn Monroe was one that was tagged by Word.
> While I've only tried this with Tika 0.7, my understanding is that it has been an issue since 0.3 at least.
> Removing all Smart tags from the document using Autocorrect options in Word will result in the correct output.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira