You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@pdfbox.apache.org by David Green <da...@davidgreen.co.uk> on 2016/04/20 21:51:27 UTC

is it possible to batch extract text from pdf files within a tree of folders within a zip file ?

. . . and save the text files in the same tree structure on another drive ?
this seems a big ask

-- 
Regards
David

Re: is it possible to batch extract text from pdf files within a tree of folders within a zip file ?

Posted by Tilman Hausherr <TH...@t-online.de>.
Am 20.04.2016 um 21:51 schrieb David Green:
> . . . and save the text files in the same tree structure on another drive ?

sure... but this is not a PDFBox problem, this is related to go through 
a ZIP file. Read about ZipInputStream and ZipEntry.

Tilman

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
For additional commands, e-mail: users-help@pdfbox.apache.org


RE: is it possible to batch extract text from pdf files within a tree of folders within a zip file ?

Posted by "Allison, Timothy B." <ta...@mitre.org>.
Might want to look at Tika (which uses PDFBox) for that.

Let's say you have an <inputdir> that contains your zips.

java -jar tika-app.jar -J -t -i <inputdir> -o <outputdir>

See if that gets you close enough.

-----Original Message-----
From: davidgreen.co.uk@gmail.com [mailto:davidgreen.co.uk@gmail.com] On Behalf Of David Green
Sent: Wednesday, April 20, 2016 3:51 PM
To: users@pdfbox.apache.org
Subject: is it possible to batch extract text from pdf files within a tree of folders within a zip file ?

. . . and save the text files in the same tree structure on another drive ?
this seems a big ask

-- 
Regards
David

Re: is it possible to batch extract text from pdf files within a tree of folders within a zip file ?

Posted by Branden Visser <mr...@gmail.com>.
PDFBox can extract the text from the PDF files for you, however
unpacking the zip file, locating the PDF documents, saving in a
different format and rezipping I believe is something you'll have to
handle with other other libraries like commons-compress [1].

Hope that helps.

Branden

[1] https://commons.apache.org/proper/commons-compress/

On Wed, Apr 20, 2016 at 12:51 PM, David Green <da...@davidgreen.co.uk> wrote:
> . . . and save the text files in the same tree structure on another drive ?
> this seems a big ask
>
> --
> Regards
> David

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
For additional commands, e-mail: users-help@pdfbox.apache.org