You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@poi.apache.org by ma...@usal.es on 2006/08/15 02:37:09 UTC

Problem Extracting Text from MS Word

Hi all!. I extract the text from MS Words documentos using this code:

HWPFDocument wdoc = new HWPFDocument(stream);
Range r = wdoc.getRange();

for (int x = 0; x < r.numSections(); x++){
     Section s = r.getSection(x);
     for (int y = 0; y < s.numParagraphs(); y++){
         Paragraph p = s.getParagraph(y);
         for (int z = 0; z < p.numCharacterRuns(); z++){
	     //character run
	     CharacterRun run = p.getCharacterRun(z);
	     //character run text
	    String text = run.text();
            String finalText = new String();

            byte[] b1=text.getBytes();
	    // show us the text
            output.write(b1);
            }
	}
}
	output.close();
	stream.close();

The problem is I also get text from internal information of MSWord, for
example, the hyperlinks like this:

   "4.1- Introducción PAGEREF _Toc142772733 \h 31
HYPERLINK \l "_Toc142772734" 4.2- Apple webobjects PAGEREF _Toc142772734
\h 32"


Can you give me any solution??

Thank's in advance.

---------------------------------------------------------------------
To unsubscribe, e-mail: poi-user-unsubscribe@jakarta.apache.org
Mailing List:     http://jakarta.apache.org/site/mail2.html#poi
The Apache Jakarta Poi Project:  http://jakarta.apache.org/poi/


Re: Problem Extracting Text from MS Word

Posted by Nick Burch <ni...@torchbox.com>.
On Tue, 15 Aug 2006, manumohedano@usal.es wrote:
> The problem is I also get text from internal information of MSWord, for
> example, the hyperlinks like this:
>
>   "4.1- Introducción PAGEREF _Toc142772733 \h 31
> HYPERLINK \l "_Toc142772734" 4.2- Apple webobjects PAGEREF _Toc142772734
> \h 32"
>
> Can you give me any solution??

Alas not really. It looks like these are stored in character runs, so 
they're being returned when you ask a paragraph for its runs.

You could try looking at the range type, and see if these problem runs 
have a different type you can exclude. Otherwise, patches to make hwpf
behave better are always appreciated :)

Nick