You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@tika.apache.org by "Tim Allison (JIRA)" <ji...@apache.org> on 2018/03/01 15:10:00 UTC
[jira] [Comment Edited] (TIKA-2593) docx with track change producing incorrect output

    [ https://issues.apache.org/jira/browse/TIKA-2593?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16382125#comment-16382125 ] 

Tim Allison edited comment on TIKA-2593 at 3/1/18 3:09 PM:
-----------------------------------------------------------

bq. I think I did figure it out. I need to set officeParserConfig.setUseSAXDocxExtractor(true);

Sorry for not responding sooner...IIRC, we can't yet remove deleted contents with our regular DOM parser, so you do have to use the SAXDocx parser.

bq. But still doesn't work for officeParserConfig.setIncludeShapeBasedContent(false);

If {{setIncludeShapeBasedContent}} is set to false, are you saying that deleted content comes through?!



was (Author: tallison@mitre.org):
bq. I think I did figure it out. I need to set officeParserConfig.setUseSAXDocxExtractor(true);

Sorry for not responsding...IIRC, we can't yet remove deleted contents with our regular DOM parser, so you do have to use the SAXDocx parser.

bq. But still doesn't work for officeParserConfig.setIncludeShapeBasedContent(false);

If {{setIncludeShapeBasedContent}} is set to false, are you saying that deleted content comes through?!


> docx with track change producing incorrect output
> -------------------------------------------------
>
>                 Key: TIKA-2593
>                 URL: https://issues.apache.org/jira/browse/TIKA-2593
>             Project: Tika
>          Issue Type: Bug
>          Components: core, handler
>    Affects Versions: 1.17
>            Reporter: Md
>            Priority: Major
>         Attachments: sample.docx
>
>
> I am using following code to extract text from docx file 
> {code:java}
> AutoDetectParser parser = new AutoDetectParser();
> ContentHandler contentHandler = new BodyContentHandler();
> inputStream = new BufferedInputStream(new FileInputStream(inputFileName));
> Metadata metadata = new Metadata();
> OfficeParserConfig officeParserConfig = new OfficeParserConfig();
> officeParserConfig.setIncludeDeletedContent(false);
> parseContext.set(OfficeParserConfig.class, officeParserConfig);
> parser.parse(inputStream, contentHandler, metadata, parseContext);
> System.out.println(contentHandler.toString());
> {code}
> When I am sending track revised files it's adding all the text deleted with the actual text and inserted text. Is there a way to tell parser to exclude the deleted text?
> Here is an example 
> input Text: This is a sample text. -This part will- be deleted. +This is inserted.+
> outputText: This is a sample text. This part will be deleted. This is inserted.
> Desired output: This is a sample text.  be deleted. This is inserted.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)