You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@pdfbox.apache.org by David Green <da...@davidgreen.co.uk> on 2016/05/01 03:06:55 UTC

Re: is it possible to batch extract text from pdf files within a tree of folders within a zip file ?

sorry for using wrong forum
is there a tika forum ?

your suggested command is working of a fashion
java -jar c:\jars\tika-app-1.12.jar -J -t -i f: -o g:
the directory structure is being reproduced but the zip files are being
copied as zip files (I think)
the copied files retain the original filename (including the original zip
extension) with an additional json extension
though when I try to open the file using B1 file archiver, it reports a
corrupt file.

Re: is it possible to batch extract text from pdf files within a tree of folders within a zip file ?

Posted by Tilman Hausherr <TH...@t-online.de>.
Am 01.05.2016 um 03:06 schrieb David Green:
> sorry for using wrong forum
> is there a tika forum ?

https://mail-archives.apache.org/mod_mbox/tika-user/


>
> your suggested command is working of a fashion
> java -jar c:\jars\tika-app-1.12.jar -J -t -i f: -o g:
> the directory structure is being reproduced but the zip files are being
> copied as zip files (I think)
> the copied files retain the original filename (including the original zip
> extension) with an additional json extension
> though when I try to open the file using B1 file archiver, it reports a
> corrupt file.
>


---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
For additional commands, e-mail: users-help@pdfbox.apache.org


RE: is it possible to batch extract text from pdf files within a tree of folders within a zip file ?

Posted by "Allison, Timothy B." <ta...@mitre.org>.
The commandline I gave you outputs JSON files.  If you open them in a text/JSON editor, you should see valid data.  If they're corrupt, please let us know!

If you're able to process JSON files, you should be good to go.  Otherwise, the recommendation to use Java's ZipFile API and do the unzipping yourself is probably the best option.  

In Tika, we do have a -z option to extract embedded files, but that only extracts the first level of documents and it doesn't reproduce the original file structure. If you have zips within zips, you won't get the content.

 
-----Original Message-----
From: davidgreen.co.uk@gmail.com [mailto:davidgreen.co.uk@gmail.com] On Behalf Of David Green
Sent: Saturday, April 30, 2016 9:07 PM
To: users@pdfbox.apache.org
Subject: Re: is it possible to batch extract text from pdf files within a tree of folders within a zip file ?

sorry for using wrong forum
is there a tika forum ?

your suggested command is working of a fashion java -jar c:\jars\tika-app-1.12.jar -J -t -i f: -o g:
the directory structure is being reproduced but the zip files are being copied as zip files (I think) the copied files retain the original filename (including the original zip
extension) with an additional json extension though when I try to open the file using B1 file archiver, it reports a corrupt file.