You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@pdfbox.apache.org by Kai Dietrich <ma...@cleeus.de> on 2009/02/27 10:29:31 UTC
pdfextracttextbatch
Hello list,
first of all, thank you all for pdfbox, I'm currently using it to extract the
text from a huge (40.000+) collection of PDF files and it works pretty good
(besides failing on some strange encodings, broken headers and the like).
Then again, here comes the problem: My poor little box is running at 100% CPU
all night with a "find . -name '*.pdf' | xargs pdfextracttext" and I'm not
even at 1 file/second. Most of the CPU load probably comes from starting and
stopping the Java VM - which is a huge waste of time and energy. So what
would be quite helpful is a tool which has the "for file in files"-loop
inside the VM. So, here comes the tool :) It's an addon to
org.apache.pdfbox.ExtractText -- it just removes the output-file parameter
and handles all given files as input pdf files.
Problem is, I can't get pdfbox 0.8.0 to work well, because of some
incompatability with the fontbox package from my distro (Gentoo). And I think
I'll just torture my little box a bit longer until the find-xargs-extracttext
job is done. But I don't want to waste the idea and the code, so maybe
someone who has a working developer setup could test and improve on exception
handling.
Greetings
Kai