You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@poi.apache.org by Fernando Bernardino <fg...@gmail.com> on 2006/08/30 20:55:14 UTC

Advanced Text Extraction

Hello, how are you people?

I need do extract text from word, ppt, pps, xls documents. This is working
fine, but when POI finds an image, graphic or other object  embedded, the
string is appendded with a EMBED "tag". This is happening for
"WordExtractor", "HWPFDocument" for while.

My problem: I need to create a XML file with a summary of the text to show
in the result page (software structure) and the XML parser can't validate
this tags because of the strange characters. There is a way to not include
this in the text extraction?

Ex.:
TextExtraction:
            POIFSFileSystem fileSystem = new POIFSFileSystem(inputStream);
            HWPFDocument document = new HWPFDocument(fileSystem);
            Range range = document.getRange();
            for (int i = 0; i < range.numParagraphs(); i++)
            {
                Paragraph paragraph = range.getParagraph(i);
                wordDocText.append(paragraph.text());
            }
            System.out.println(wordDocText.toString());
Result (the strange characters dont show in the email body...):
   -->     EMBED Word.Picture.8  
            Documento de Projeto
            Manual do Usuário
            Web Publication
   -->     EMBED CorelDraw.Graphic.9  


StackTrace from the Parser:
Caused by: net.sf.saxon.trans.DynamicError: org.xml.sax.SAXParseException:
An invalid XML character (Unicode: 0x14) was found in the CDATA section.


Thanks people! Any help is useful,

--
Fernando Bernardino