You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@poi.apache.org by ma...@usal.es on 2006/08/15 02:37:09 UTC
Problem Extracting Text from MS Word
Hi all!. I extract the text from MS Words documentos using this code:
HWPFDocument wdoc = new HWPFDocument(stream);
Range r = wdoc.getRange();
for (int x = 0; x < r.numSections(); x++){
Section s = r.getSection(x);
for (int y = 0; y < s.numParagraphs(); y++){
Paragraph p = s.getParagraph(y);
for (int z = 0; z < p.numCharacterRuns(); z++){
//character run
CharacterRun run = p.getCharacterRun(z);
//character run text
String text = run.text();
String finalText = new String();
byte[] b1=text.getBytes();
// show us the text
output.write(b1);
}
}
}
output.close();
stream.close();
The problem is I also get text from internal information of MSWord, for
example, the hyperlinks like this:
"4.1- Introducción PAGEREF _Toc142772733 \h 31
HYPERLINK \l "_Toc142772734" 4.2- Apple webobjects PAGEREF _Toc142772734
\h 32"
Can you give me any solution??
Thank's in advance.
---------------------------------------------------------------------
To unsubscribe, e-mail: poi-user-unsubscribe@jakarta.apache.org
Mailing List: http://jakarta.apache.org/site/mail2.html#poi
The Apache Jakarta Poi Project: http://jakarta.apache.org/poi/
Re: Problem Extracting Text from MS Word
Posted by Nick Burch <ni...@torchbox.com>.
On Tue, 15 Aug 2006, manumohedano@usal.es wrote:
> The problem is I also get text from internal information of MSWord, for
> example, the hyperlinks like this:
>
> "4.1- Introducción PAGEREF _Toc142772733 \h 31
> HYPERLINK \l "_Toc142772734" 4.2- Apple webobjects PAGEREF _Toc142772734
> \h 32"
>
> Can you give me any solution??
Alas not really. It looks like these are stored in character runs, so
they're being returned when you ask a paragraph for its runs.
You could try looking at the range type, and see if these problem runs
have a different type you can exclude. Otherwise, patches to make hwpf
behave better are always appreciated :)
Nick