You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@pdfbox.apache.org by Kai Dietrich <ma...@cleeus.de> on 2009/02/27 10:29:31 UTC

pdfextracttextbatch

Hello list,

first of all, thank you all for pdfbox, I'm currently using it to extract the 
text from a huge (40.000+) collection of PDF files and it works pretty good 
(besides failing on some strange encodings, broken headers and the like). 
Then again, here comes the problem: My poor little box is running at 100% CPU 
all night with a "find . -name '*.pdf' | xargs pdfextracttext" and I'm not 
even at 1 file/second. Most of the CPU load probably comes from starting and 
stopping the Java VM - which is a huge waste of time and energy. So what 
would be quite helpful is a tool which has the "for file in files"-loop 
inside the VM. So, here comes the tool :) It's an addon to 
org.apache.pdfbox.ExtractText -- it just removes the output-file parameter 
and handles all given files as input pdf files.

Problem is, I can't get pdfbox 0.8.0 to work well, because of some 
incompatability with the fontbox package from my distro (Gentoo). And I think 
I'll just torture my little box a bit longer until the find-xargs-extracttext 
job is done. But I don't want to waste the idea and the code, so maybe 
someone who has a working developer setup could test and improve on exception 
handling.

Greetings

Kai