You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@tika.apache.org by "Curt Arnold (JIRA)" <ji...@apache.org> on 2011/09/01 17:24:09 UTC

[jira] [Commented] (TIKA-207) MS word doc containing tracked changes produces incorrect text

    [ https://issues.apache.org/jira/browse/TIKA-207?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13095346#comment-13095346 ] 

Curt Arnold commented on TIKA-207:
----------------------------------

I also ran into this problem and at least the manifestation that I ran into can be addressed with a trivial fix.

In org.apache.tika.parsers.microsoft.WordExtractor.java in the handleParagraph method, there is a loop over the CharacterRun's in a paragraph starting at line 162. If the body of the loop is guarded with a:

if(!cr.isMarkedDeleted()) {

then all deleted text is suppressed. Adding this line did not impact the unit tests. It was also not sufficient to fix the same problem for .docx files, so I'm going to have to do a little more digging to find where that is handled. 


> MS word doc containing tracked changes produces incorrect text
> --------------------------------------------------------------
>
>                 Key: TIKA-207
>                 URL: https://issues.apache.org/jira/browse/TIKA-207
>             Project: Tika
>          Issue Type: Bug
>          Components: parser
>    Affects Versions: 0.3
>         Environment: tika-0.3-standalone.jar
>            Reporter: Michael McCandless
>            Priority: Minor
>
> Spinoff from this discussion:
>   http://n2.nabble.com/getting-text-from-MS-Word-docs-with-tracked-changes...-td2463811.html
> When extracting text from an MS Word doc (2003 format) that has
> unapproved pending changes, the text from both old and new is glommed
> together.
> EG I had a doc that contained text "Field.Index.TOKENIZED", and I
> changed TOKENIZED to ANALYZED with track changes enabled, and
> then when I extract text (using TikaCLI) it produces this:
>   Field.Index.TOKENIZEDANALYZED
> So, first, it'd be nice to at least get whitespace inserted between
> old & new text.
> And, second, it'd be great to have an option to control whether it's
> old or new text that's indexed (or at least an option to only see
> "new" text, ie the current document).
> From the discussion above, it seems like POI may expose the
> fine-grained APIs to allow Tika to do this; it's just that Tika's not
> leveraging these APIs  for MS Word docs.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Re: [jira] [Commented] (TIKA-207) MS word doc containing tracked changes produces incorrect text

Posted by Mark Kerzner <ma...@gmail.com>.
>From this comment I see that one can tell whether this MS Word has "track
changes" on, is that true? -- Thank you.

Mark

On Thu, Sep 1, 2011 at 10:24 AM, Curt Arnold (JIRA) <ji...@apache.org> wrote:

>
>    [
> https://issues.apache.org/jira/browse/TIKA-207?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13095346#comment-13095346]
>
> Curt Arnold commented on TIKA-207:
> ----------------------------------
>
> I also ran into this problem and at least the manifestation that I ran into
> can be addressed with a trivial fix.
>
> In org.apache.tika.parsers.microsoft.WordExtractor.java in the
> handleParagraph method, there is a loop over the CharacterRun's in a
> paragraph starting at line 162. If the body of the loop is guarded with a:
>
> if(!cr.isMarkedDeleted()) {
>
> then all deleted text is suppressed. Adding this line did not impact the
> unit tests. It was also not sufficient to fix the same problem for .docx
> files, so I'm going to have to do a little more digging to find where that
> is handled.
>
>
> > MS word doc containing tracked changes produces incorrect text
> > --------------------------------------------------------------
> >
> >                 Key: TIKA-207
> >                 URL: https://issues.apache.org/jira/browse/TIKA-207
> >             Project: Tika
> >          Issue Type: Bug
> >          Components: parser
> >    Affects Versions: 0.3
> >         Environment: tika-0.3-standalone.jar
> >            Reporter: Michael McCandless
> >            Priority: Minor
> >
> > Spinoff from this discussion:
> >
> http://n2.nabble.com/getting-text-from-MS-Word-docs-with-tracked-changes...-td2463811.html
> > When extracting text from an MS Word doc (2003 format) that has
> > unapproved pending changes, the text from both old and new is glommed
> > together.
> > EG I had a doc that contained text "Field.Index.TOKENIZED", and I
> > changed TOKENIZED to ANALYZED with track changes enabled, and
> > then when I extract text (using TikaCLI) it produces this:
> >   Field.Index.TOKENIZEDANALYZED
> > So, first, it'd be nice to at least get whitespace inserted between
> > old & new text.
> > And, second, it'd be great to have an option to control whether it's
> > old or new text that's indexed (or at least an option to only see
> > "new" text, ie the current document).
> > From the discussion above, it seems like POI may expose the
> > fine-grained APIs to allow Tika to do this; it's just that Tika's not
> > leveraging these APIs  for MS Word docs.
>
> --
> This message is automatically generated by JIRA.
> For more information on JIRA, see: http://www.atlassian.com/software/jira
>
>
>