You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@pdfbox.apache.org by Shen Wang <fe...@gmail.com> on 2009/11/02 23:56:44 UTC
ExtractText.java doesn't work and a question about field charactersByArticle
in PDFTextStripper
Hi guys,
I am trying to use ExtractText.java to extract the text from a pdf file.
However, it only gives me an empty .txt file. I tracked down the source
code, and get confused by PDFTextStripper.
So, the ExtractText.java calls stripper.writerPage which in turn calls
processPage method. In processPage method, it plays with the
charactersByArticle field and my understanding is that it wants to put
the articles information into the field charactersByArticle. However,
when it sets charactersByArticle's value, it actually set it to empty
ArrayList ("charactersByArticle.set( i, new ArrayList() );"). And this
line seems to be the only place that the field charactersByArticle is
ever modified. As a result charactersByArticle is nothing but a vector
of empty ArrayList. Then, when the writePage method is called, it
iterates through charactersByArticle and finds no text. This is my
understanding of the reason why the ExtractText example doesn't work for
me. Please do let me know if I get something wrong or you guys have any
suggestions.
Thanks!
Felix