You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@tika.apache.org by Michael McCandless <lu...@mikemccandless.com> on 2009/03/11 22:03:21 UTC

getting text from MS Word docs with tracked changes...

These may simply be POI issues, but I thought I'd start here.

When I use TikaCLI to extract text from an MS Word doc (in 2003
format) that has pending tracked changes, the resulting text has both
the old and the new text.

First question: is there some way to see only the new text?  (Besides
going and accepting/rejecting all changes in the doc).

Second question/issue: sometimes the old + new text is glommed
together without a separating space.  EG, for LIA2, I had done a big
replace of Field.Index.TOKENIZED -> Field.Index.ANALYZED (it was
exhausting), and now I see text like this:

   Field.Index.TOKENIZEDANALYZED

coming out from TikaCLI.  Is there some way to get a space in there...?

Mike

Re: getting text from MS Word docs with tracked changes...

Posted by Michael McCandless <lu...@mikemccandless.com>.

OK thanks Jukka.  I'll open an issue to track this...

Mike

Jukka Zitting wrote:

> Hi,
>
> On Wed, Mar 11, 2009 at 10:03 PM, Michael McCandless
> <lu...@mikemccandless.com> wrote:
>> When I use TikaCLI to extract text from an MS Word doc (in 2003
>> format) that has pending tracked changes, the resulting text has both
>> the old and the new text.
>>
>> First question: is there some way to see only the new text?  (Besides
>> going and accepting/rejecting all changes in the doc).
>
> Currently the Word parser in Tika just leverages the WordExtractor
> class from POI without any extra settings:
>
>    WordExtractor extractor = new WordExtractor(filesystem);
>    for (String paragraph : extractor.getParagraphText()) {
>        xhtml.element("p", paragraph);
>    }
>
> It would be nice if Tika was able to express the structure of the
> underlying document in more detail, but that probably requires us to
> use lower level POI APIs, something we already do when parsing Excel
> spreadsheets.
>
>> Second question/issue: sometimes the old + new text is glommed
>> together without a separating space.  EG, for LIA2, I had done a big
>> replace of Field.Index.TOKENIZED -> Field.Index.ANALYZED (it was
>> exhausting), and now I see text like this:
>>
>>  Field.Index.TOKENIZEDANALYZED
>>
>> coming out from TikaCLI.  Is there some way to get a space in  
>> there...?
>
> That text is probably something we get directly from
> WordExtractor.getParagraphText(), so there isn't much we can do about
> it until we start using the more fine-grained APIs in POI.
>
> BR,
>
> Jukka Zitting

Re: getting text from MS Word docs with tracked changes...

Posted by Jukka Zitting <ju...@gmail.com>.

Hi,

On Wed, Mar 11, 2009 at 10:03 PM, Michael McCandless
<lu...@mikemccandless.com> wrote:
> When I use TikaCLI to extract text from an MS Word doc (in 2003
> format) that has pending tracked changes, the resulting text has both
> the old and the new text.
>
> First question: is there some way to see only the new text?  (Besides
> going and accepting/rejecting all changes in the doc).

Currently the Word parser in Tika just leverages the WordExtractor
class from POI without any extra settings:

    WordExtractor extractor = new WordExtractor(filesystem);
    for (String paragraph : extractor.getParagraphText()) {
        xhtml.element("p", paragraph);
    }

It would be nice if Tika was able to express the structure of the
underlying document in more detail, but that probably requires us to
use lower level POI APIs, something we already do when parsing Excel
spreadsheets.

> Second question/issue: sometimes the old + new text is glommed
> together without a separating space.  EG, for LIA2, I had done a big
> replace of Field.Index.TOKENIZED -> Field.Index.ANALYZED (it was
> exhausting), and now I see text like this:
>
>  Field.Index.TOKENIZEDANALYZED
>
> coming out from TikaCLI.  Is there some way to get a space in there...?

That text is probably something we get directly from
WordExtractor.getParagraphText(), so there isn't much we can do about
it until we start using the more fine-grained APIs in POI.

BR,

Jukka Zitting