You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@tika.apache.org by "Jukka Zitting (Updated) (JIRA)" <ji...@apache.org> on 2011/10/07 10:18:29 UTC

[jira] [Updated] (TIKA-423) Parse docx and output to text file missing words

     [ https://issues.apache.org/jira/browse/TIKA-423?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Jukka Zitting updated TIKA-423:
-------------------------------

    Affects Version/s: 0.8
                       0.9
                       0.10

This is still a problem with Tika 0.10 and the latest trunk.
                
> Parse docx and output to text file missing words
> ------------------------------------------------
>
>                 Key: TIKA-423
>                 URL: https://issues.apache.org/jira/browse/TIKA-423
>             Project: Tika
>          Issue Type: Bug
>          Components: parser
>    Affects Versions: 0.7, 0.8, 0.9, 0.10
>         Environment: Windows and Mac
>            Reporter: David Tran
>              Labels: docx, missing_word, smart_tag, word
>         Attachments: output.txt, tika_test.docx
>
>
> I created a word document using Word 2007 on a Windows Server 2003 machine (using Remote desktop), it has also happened to someone else using Windows XP, with person names, country names, addresses, and a date. Some of these elements are tagged as "Smart Tags" by Word, and in the output of parsing by Tika, some words disappear.
> So a text fragment like the one below in Word:
> Smart tags typically are names like George Washington, Marilyn Monroe, Napoleon Bonaparte, etc. But they are automatically generated by Word, so it can be difficult to control how they are 
> After running Tika from the command line (on OSX), java -jar /path/to/tika-app-0.7.jar -t /path/to/docx/document.docx > /path/to/output.txt will result in something like:
> Smart tags typically are names like  , , Napoleon Bonaparte, etc. But they are automatically generated by Word, so it can be difficult to control how they are
> Note the missing names George Washington and Marilyn Monroe, Marilyn Monroe was one that was tagged by Word.
> While I've only tried this with Tika 0.7, my understanding is that it has been an issue since 0.3 at least.
> Removing all Smart tags from the document using Autocorrect options in Word will result in the correct output.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira