You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@poi.apache.org by bu...@apache.org on 2015/07/02 12:39:52 UTC

[Bug 58093] Rework of getDocumentText() in HWPFDocument

https://bz.apache.org/bugzilla/show_bug.cgi?id=58093

Nick Burch <ap...@gagravarr.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
             Status|NEW                         |RESOLVED
         Resolution|---                         |WONTFIX

--- Comment #1 from Nick Burch <ap...@gagravarr.org> ---
There are all sorts of control sequences / fields that can come through in the
text, as the .doc format handles loads of things that way

If you don't want these, and only want the text, then use a util method like
https://poi.apache.org/apidocs/org/apache/poi/hwpf/extractor/WordExtractor.html#stripFields%28java.lang.String%29
to have them removed

Note that the javadocs for
https://poi.apache.org/apidocs/org/apache/poi/hwpf/HWPFDocumentCore.html#getDocumentText%28%29
explicitly state that you get the fields included in the response. Other
methods (eg via WordExtractor, or Apache Tika) are provided to give
content-text only, for those who want it

-- 
You are receiving this mail because:
You are the assignee for the bug.

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@poi.apache.org
For additional commands, e-mail: dev-help@poi.apache.org