You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@tika.apache.org by "Dave Meikle (JIRA)" <ji...@apache.org> on 2008/01/04 17:10:34 UTC

[jira] Commented: (TIKA-109) WordParser fails on some Word files

    [ https://issues.apache.org/jira/browse/TIKA-109?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12555960#action_12555960 ] 

Dave Meikle commented on TIKA-109:
----------------------------------

The problem is that the current code does not follow the contact of the Word format. Comparing the start and end of a TextPiece and a CHPX will not work.

There are three options here:

a) we remove the code to check if text is marked as deleted and just loop around the text pieces outputting each ones content - this will allow the extraction of the text to be fast but will include text marked as deleted in the output

b) we utilise POI to load the full document up, as the POI code will handle extracting the CHPX and PAPX required to make both the text and style available - this will take longer than option a but will allow text marked as deleted to be excluded from the output, as well as presenting the rest of the formatting options known by POI

c) we use the POI internal model to do the least amount of extraction to make the text and style available, as add the required code to use this.

Whilst I am not strongly in favour of any particular approach, if the requirement to excluded text marked as deleted from the output is required I would suggest using approach b. I say this because it will allow us to utilise the existing POI code (on which we currently have a hard dependency anyway) to make this information available. If we use approach c we are then maintaining this code separately from POI and will not benefit from any fixes/changes there.

That said if the community would prefer to go with option c, I am happy to make the change.

Cheers,
Dave

> WordParser fails on some Word files
> -----------------------------------
>
>                 Key: TIKA-109
>                 URL: https://issues.apache.org/jira/browse/TIKA-109
>             Project: Tika
>          Issue Type: Bug
>          Components: parser
>    Affects Versions: 0.1-incubating
>         Environment: Windows XP
> Java(TM) SE Runtime Environment (build 1.6.0_03-b05)
>            Reporter: Mats Norén
>         Attachments: fil6.doc
>
>
> WordParser fail on some word files. A negative value is sent to TextPiece.substring in POI for some corner case in the algorithm.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.