You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@pdfbox.apache.org by Shen Wang <fe...@gmail.com> on 2009/11/02 23:56:44 UTC

ExtractText.java doesn't work and a question about field charactersByArticle in PDFTextStripper

Hi guys,

I am trying to use ExtractText.java to extract the text from a pdf file. 
However, it only gives me an empty .txt file. I tracked down the source 
code, and get confused by PDFTextStripper.

So, the ExtractText.java calls stripper.writerPage which in turn calls 
processPage method. In processPage method, it plays with the 
charactersByArticle field and my understanding is that it wants to put 
the articles information into the field charactersByArticle. However, 
when it sets charactersByArticle's value, it actually set it to empty 
ArrayList ("charactersByArticle.set( i, new ArrayList() );"). And this 
line seems to be the only place that the field charactersByArticle is 
ever modified. As a result charactersByArticle is nothing but a vector 
of empty ArrayList. Then, when the writePage method is called, it 
iterates through charactersByArticle and finds no text. This is my 
understanding of the reason why the ExtractText example doesn't work for 
me. Please do let me know if I get something wrong or you guys have any 
suggestions.

Thanks!

Felix