You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@poi.apache.org by "Ahmed, Sana R (IS)" <Sa...@ngc.com> on 2009/08/24 22:01:07 UTC

POI WordExtractor Not Extracting Entire Document

Hi.
 
We are using poi 3.5 beta 6 in production to extract office documents.  We came across a document where it did not extract completely.  The extracted text appears to have left out a couple of paragraphs from the middle of the document.  
 
Here is a link to the document.  http://www.mediafire.com/?sharekey=2e6a7badb4ab32e07f7ec40ad
 
The following is the snippet of code we are using to extract the document.
 
   WordExtractor we = new WordExtractor(new FileInputStream(args[0]));
   BufferedWriter bw = new BufferedWriter(new OutputStreamWriter(new FileOutputStream(outputFile), "UTF-8"));
   bw.write(we.getText().replaceAll("\n", System.getProperty("line.separator")));
   bw.flush();
   bw.close();
 
This is a major production problem, so please respond as soon as possible.  
 
Thanks!