You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@tika.apache.org by "Nick Burch (JIRA)" <ji...@apache.org> on 2014/06/03 18:15:03 UTC

[jira] [Commented] (TIKA-1318) Use of Deprecated Word6Extractor.getParagraphText() Method

    [ https://issues.apache.org/jira/browse/TIKA-1318?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14016840#comment-14016840 ] 

Nick Burch commented on TIKA-1318:
----------------------------------

It might make sense to switch this for a call to WordToHtmlConverter, which is able to work with a HWPFOldDocument. Currently, we're calling WordExtractor which in turn passes the HWPFOldDocument to WordToTextConverter, so we loose out on a bit of formatting.

> Use of Deprecated Word6Extractor.getParagraphText() Method
> ----------------------------------------------------------
>
>                 Key: TIKA-1318
>                 URL: https://issues.apache.org/jira/browse/TIKA-1318
>             Project: Tika
>          Issue Type: Bug
>          Components: parser
>    Affects Versions: 1.5
>            Reporter: Tyler Palsulich
>            Priority: Minor
>              Labels: deprecation
>             Fix For: 1.6
>
>
> org.apache.tika.parser.microsoft.WordExtractor.parseWord6() uses the deprecated Word6Extractor.getParagraphText() method. getParagraphText() is supposed to return a String[] with an element for each paragraph in the text. The replacement is getText(), which lets paragraph, cell, etc separation be implementation specific. I'm not sure, at this point, how the POI WordExtractor separates them.



--
This message was sent by Atlassian JIRA
(v6.2#6252)