You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@poi.apache.org by pof <Me...@gmail.com> on 2009/06/11 08:46:51 UTC

docx parse example

Hi, I was wondering if someone could provide an example how to parse out the
plain text from a docx using poi 3.5 beta5?

Cheers, Brett.
-- 
View this message in context: http://www.nabble.com/docx-parse-example-tp23976192p23976192.html
Sent from the POI - User mailing list archive at Nabble.com.


---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@poi.apache.org
For additional commands, e-mail: user-help@poi.apache.org

Re: docx parse example

Posted by MSB <ma...@tiscali.co.uk>.

I hope that this information is still up to date so forgive me please if it
is not.

As far as I am aware, you cannot currently use HWPF to parse docx files. I
paid a quick visit to the project page at
http://poi.apache.org/hwpf/index.html and found this;

"HWPF is the name of our port of the Microsoft Word 97(-2007) file format to
pure Java. It does not support the new Word 2007 .docx file format, which is
not OLE2 based."

You do have options however. One that I have looked at but never used in
anger is docx4j. The projects website is;

http://dev.plutext.org/blog/category/docx4j/

Another would be to use the UNO interface to manipulate the OpenOffice
application whilst a third could be to write your own parser; the docx file
format is zipped xml after all and if all you want to do is get at the raw
text, it may be worthwhile looking into this option.

Yours

Mark B

pof wrote:
> 
> Hi, I was wondering if someone could provide an example how to parse out
> the plain text from a docx using poi 3.5 beta5?
> 
> Cheers, Brett.
> 

-- 
View this message in context: http://www.nabble.com/docx-parse-example-tp23976192p23976770.html
Sent from the POI - User mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@poi.apache.org
For additional commands, e-mail: user-help@poi.apache.org

Re: docx parse example

Posted by mstocker <ma...@gmail.com>.

pof <MelbourneBeerBaron <at> gmail.com> writes:

> 
> 
> Hi, I was wondering if someone could provide an example how to parse out the
> plain text from a docx using poi 3.5 beta5?
> 
> Cheers, Brett.

I dicsovered it's fairly easy to get all (or most anyway) of the text from a 
DOCX with basic Java libraries. A docx file is just a zip file with a bunch of 
XML files in it. 

I have an example of this I posted in my blog at 
http://www.maxstocker.com/blog.php?en=c6270d6e2bde17ae8c6f9659b3b863773

but the basic steps are

1) open the docx as a ZipFile
2) Get the XML file as the ZipEntry "word/document.xml"
3) Parse the XML document and get all tags named "w:t"
4) Extract content from those tags





---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@poi.apache.org
For additional commands, e-mail: user-help@poi.apache.org

Re: docx parse example

Posted by MSB <ma...@tiscali.co.uk>.

I was wrong.

If you download the current build of version 3.5, in the poi-ooxml-3.5.beta6
archive is a package called xwpf. I guess that contains classes you can use
to parse docx files. Must admit that I do not know how to use them but it
could be worth digging around a bit.

Yours

Mark B

pof wrote:
> 
> Hi, I was wondering if someone could provide an example how to parse out
> the plain text from a docx using poi 3.5 beta5?
> 
> Cheers, Brett.
> 

-- 
View this message in context: http://www.nabble.com/docx-parse-example-tp23976192p23987083.html
Sent from the POI - User mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@poi.apache.org
For additional commands, e-mail: user-help@poi.apache.org