You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@poi.apache.org by jetcat33 <ma...@live.com> on 2010/12/31 18:52:39 UTC

HWPF and XWPF: How to read newline?

I've created a program to read .doc and .docx text. I now want to search and
replace all newline characters (the ones created with shift+enter in Word)
with the following: "<br>" For some reason, however, newline characters
aren't being read properly in HWPF and XWPF.

I use the following to read .doc:

WordExtractor wx = new WordExtractor(document);
String docText = wx.getText();

I use the following to read .docx:

XWPFWordExtractor wx = new XWPFWordExtractor(document);
String docxText = wx.getText();

Let's say I'm reading a Word document formatted as follows:

Bojo<br>the clown<p>Funny

(assume, instead of <br>, in Word I use the shift+enter line feed/new line,
and instead of <p>, in Word I use the regular enter carriage return)

Using HWPF, docText will print (using System.out.println): 
Bojo the clown
Funny

Using XWPF, docxText will print:
Bojothe clown
Funny

Notice how neither HWPF nor XWPF show the "shift+enter" return, but both
reflect the normal "enter" return. Also notice that XWPF doesn't even show
the empty space for the "shift+enter" return, unlike HWPF, which at least
shows a whitespace character.

What is going on? Why can't I display the "shift+enter" character?


-- 
View this message in context: http://apache-poi.1045710.n5.nabble.com/HWPF-and-XWPF-How-to-read-newline-tp3323805p3323805.html
Sent from the POI - User mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@poi.apache.org
For additional commands, e-mail: user-help@poi.apache.org


Re: HWPF and XWPF: How to read newline?

Posted by Nick Burch <ni...@alfresco.com>.
On Fri, 31 Dec 2010, jetcat33 wrote:
> I've created a program to read .doc and .docx text. I now want to search and
> replace all newline characters (the ones created with shift+enter in Word)
> with the following: "<br>" For some reason, however, newline characters
> aren't being read properly in HWPF and XWPF.

If you want to get a HTML version, then you'll probably want to use Apache 
Tika. It uses poi internally, but returns a html version of the word files 
rather than just a plain text one

Nick

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@poi.apache.org
For additional commands, e-mail: user-help@poi.apache.org