You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@poi.apache.org by Lawrence Tsang <tk...@gmail.com> on 2012/02/02 15:41:41 UTC

Parsing "hyperlinks", "equations" and "graphs" in MS Word 2003 document

Hi,

     As a newbie of Apache POI, I use the
"org.apache.poi.hwpf.Word2Forrest" class to extract text in a MS Word 2003
document. The Word document contains text as well as hyperlinks, equations
and graphs. The normal text is extracted OK. However, when a hyperlink is
extracted, it looks, for example, something like this :

extracted hyperlink :
<p>“Java Native Access (JNA)” --- call DLL functions from Java,  HYPERLINK
"https://jna.dev.java.net/" https://jna.dev.java.net/.
</p>

original hyperlink :
“Java Native Access (JNA)” --- call DLL functions from Java,
https://jna.dev.java.net/.

     The hyperlink address is duplicated in the extracted text. Moreover,
when equations are extracted, something like "EMBED Equation.3" are
displayed in the extracted text. Furthermore, when graphs are extracted,
nothing would be displayed in the extracted text.

     I would like to know that is this the best behavior of Apache POI in
parsing MS Word document ? Could we change some configurations so that
Apache POI could handle "hyperlinks", "equations" and "graphs" in a better
way ?

     Thanks for any suggestion.

Lawrence

Re: Parsing "hyperlinks", "equations" and "graphs" in MS Word 2003 document

Posted by Lawrence Tsang <tk...@gmail.com>.
WordExtractor works. Thanks Nick.

On Thu, Feb 2, 2012 at 10:50 PM, Nick Burch <ni...@alfresco.com> wrote:

> On Thu, 2 Feb 2012, Lawrence Tsang wrote:
>
>> As a newbie of Apache POI, I use the "org.apache.poi.hwpf.**Word2Forrest"
>> class to extract text in a MS Word 2003 document.
>>
>
> I wouldn't recommend using that class for text extraction, unless you
> really need it to come out in the Forrest format
>
> Instead, you should use one of:
>  * org.apache.poi.hwpf.extractor.**WordExtractor
>  * org.apache.poi.hwpf.converter.**WordToTextConverter (or HTML or Fo)
>  * Apache Tika
>
> Depending on if you want plain text, clean html, HTML with full document
> stylings etc
>
> Nick
>
> ------------------------------**------------------------------**---------
> To unsubscribe, e-mail: user-unsubscribe@poi.apache.**org<us...@poi.apache.org>
> For additional commands, e-mail: user-help@poi.apache.org
>
>

Re: Parsing "hyperlinks", "equations" and "graphs" in MS Word 2003 document

Posted by Nick Burch <ni...@alfresco.com>.
On Thu, 2 Feb 2012, Lawrence Tsang wrote:
> As a newbie of Apache POI, I use the "org.apache.poi.hwpf.Word2Forrest" 
> class to extract text in a MS Word 2003 document.

I wouldn't recommend using that class for text extraction, unless you 
really need it to come out in the Forrest format

Instead, you should use one of:
  * org.apache.poi.hwpf.extractor.WordExtractor
  * org.apache.poi.hwpf.converter.WordToTextConverter (or HTML or Fo)
  * Apache Tika

Depending on if you want plain text, clean html, HTML with full document 
stylings etc

Nick

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@poi.apache.org
For additional commands, e-mail: user-help@poi.apache.org