You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@poi.apache.org by Fernando Bernardino <fg...@gmail.com> on 2006/08/30 20:55:14 UTC
Advanced Text Extraction
Hello, how are you people?
I need do extract text from word, ppt, pps, xls documents. This is working
fine, but when POI finds an image, graphic or other object embedded, the
string is appendded with a EMBED "tag". This is happening for
"WordExtractor", "HWPFDocument" for while.
My problem: I need to create a XML file with a summary of the text to show
in the result page (software structure) and the XML parser can't validate
this tags because of the strange characters. There is a way to not include
this in the text extraction?
Ex.:
TextExtraction:
POIFSFileSystem fileSystem = new POIFSFileSystem(inputStream);
HWPFDocument document = new HWPFDocument(fileSystem);
Range range = document.getRange();
for (int i = 0; i < range.numParagraphs(); i++)
{
Paragraph paragraph = range.getParagraph(i);
wordDocText.append(paragraph.text());
}
System.out.println(wordDocText.toString());
Result (the strange characters dont show in the email body...):
--> EMBED Word.Picture.8
Documento de Projeto
Manual do Usuário
Web Publication
--> EMBED CorelDraw.Graphic.9
StackTrace from the Parser:
Caused by: net.sf.saxon.trans.DynamicError: org.xml.sax.SAXParseException:
An invalid XML character (Unicode: 0x14) was found in the CDATA section.
Thanks people! Any help is useful,
--
Fernando Bernardino