You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@tika.apache.org by Jake Burns <ja...@threatwave.com> on 2018/08/27 19:55:48 UTC

Can't use recursive parsing.

I'm trying to parse a directory full of .eml files (and many have
attachments). Even though I use -J, I'm not seeing the results of OCR on
the attachments. I'm also not seeing anything extracted from PDFs. Finally,
tika-app is not recognizing a bunch of command flags.

I'm running ubuntu 18.04 and have openjdk-8 (1.181) installed with the
latest maven (3.5.4).
I've also got the libtesseract-dev and tesseract-OCR-all installed on my
machine.

I downloaded Tika 1.18 and ran mvn clean install.  The build completes fine
and I see the tika-app jar at ~/tika-1.18/tika-app-target/tika-app-1.18.jar

I am able to run java -jar ~pathto/tika-app-1.18.jar -J -i
/mydirectoryoffiles/ -o /mytikaoutput/ and it works alright.

I am not able to pass any other flags to tika though. for example -r.
I'm not able to pass -z to extract attachments either.

I get stuff like this:
"INFO  about to start driver
BatchProcess:No config file set via -bc, relying on
tika-app-batch-config.xml or default-tika-batch-config.xml
INFO  BatchProcess: org.apache.commons.cli.UnrecognizedOptionException:
Unrecognized option: -z"

Can anyone tell me how I can parse a directory of .eml files and extract
the data from their attachments?

Re: Can't use recursive parsing.

Posted by Tim Allison <ta...@apache.org>.
>
With a directory of 100,000 .eml files (many with attachments), is
there a recommended way to parallelize or do batch parsing reliably?

If the -J -t options get you what you need with tika-app in batch
mode, it is running in parallel.  You can set the number of threads
with -numConsumers.  At some point, even on a decent sized box, you'll
become I/O bound because Tika, for some file formats, creates quite a
few temp files.  If you are limited to a single machine, but have
several ssds, you could read from one, write to another and use a
third for java.io.tmpdir.

>A lot of my messages will have timeouts
As of Tika 1.19-SNAPSHOT (not yet released, you can control tesseract
timeouts with, e.g.:
https://github.com/apache/tika/blob/master/tika-parsers/src/test/resources/org/apache/tika/config/TIKA-2705-tesseract.xml

>Sometimes tika won't put any content at all in the output. It will just be filename.eml.json of 0 bytes, that happens when I run:

It is expected that tika-batch will create 0 byte .json files if
something catastrophic happened during processing -- oom, permanent
hang, etc.  You can look at the logs for what might be happening
catastrophically.

>java -Xmx12g -Xms12g -Dsun.java2d.cmm=sun.java2d.cmm.kcms.KcmsServiceProvider -Dorg.apache.pdfbox.rendering.UsePureJavaCMYKConversion=true -jar ~/tika-1.18/tika-app/target/tika-app-1.18.jar -J -t -i /mailin/ -o /mailout/

It would be helpful to get the exact IO errors.  The system properties
you are setting go to the parent process, which monitors the child
process, and it is the child process that is doing the heavy
lifting/actual parsing.  To set the system props for the child
process, prefix with -J, as in:

java -Dlog4j.configuration=file:log4j_driver.xml -jar tika-app.jar
-JXX:-OmitStackTraceInFastThrow -JXmx6g
-JDlog4j.configuration=file:log4j.xml -bc
tika-batch-config-basic-test.xml -i /data2/docs/ -o
/data4/batch_runs/tika_1_19-poi4d -numConsumers 10 -c tika_config.xml

Sidenote: definitely include the -JXX:-OmitStackTraceInFastThrow to
make sure that you're getting complete stacktraces.

>I think the majority of files tika doesn't parse is due to tesseractOCR timeouts.
To see how many exceptions and of what types, consider running
tika-eval in 'profile' mode.  This will work well given that you're
already using the -J option.  See:
https://wiki.apache.org/tika/TikaEval
On Wed, Aug 29, 2018 at 9:57 AM Jake Burns <ja...@threatwave.com> wrote:
>
> Thanks, I guess I'll refrain from using extra flags on the command line.
>
> I think the majority of files tika doesn't parse is due to tesseractOCR timeouts.
>
> If I run:
>
> java -jar ~/tika-1.18/tika-app/target/tika-app-1.18.jar -J -t -i /mailin/ -o /mailout/
>
> A lot of my messages will have timeouts like this where the X-TIKA:content object should be.:
>
> X-TIKA:EXCEPTION:embedded_exception":"org.apache.tika.exception.TikaException: TesseractOCRParser timeout\n\tat org.apache.tika.parser.ocr.TesseractOCRParser.doOCR(TesseractOCRParser.java:560)\n\tat org.apache.tika.parser.ocr.TesseractOCRParser.parse(TesseractOCRParser.java:432)\n\tat org.apache.tika.parser.ocr.TesseractOCRParser.parse(TesseractOCRParser.java:286)\n\tat org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)\n\tat org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)\n\tat org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:143)\n\tat org.apache.tika.parser.ParserDecorator.parse(ParserDecorator.java:188)\n\tat org.apache.tika.parser.DigestingParser.parse(DigestingParser.java:84)\n\tat org.apache.tika.parser.ParserDecorator.parse(ParserDecorator.java:188)\n\tat org.apache.tika.parser.RecursiveParserWrapper$EmbeddedParserDecorator.parse(RecursiveParserWrapper.java:318)\n\tat org.apache.tika.parser.DelegatingParser.parse(DelegatingParser.java:72)\n\tat org.apache.tika.extractor.ParsingEmbeddedDocumentExtractor.parseEmbedded(ParsingEmbeddedDocumentExtractor.java:102)\n\tat org.apache.tika.parser.mail.MailContentHandler.handleEmbedded(MailContentHandler.java:283)\n\tat org.apache.tika.parser.mail.MailContentHandler.body(MailContentHandler.java:228)\n\tat org.apache.james.mime4j.parser.MimeStreamParser.parse(MimeStreamParser.java:133)\n\tat org.apache.tika.parser.mail.RFC822Parser.parse(RFC822Parser.java:100)\n\tat org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)\n\tat org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)\n\tat org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:143)\n\tat org.apache.tika.parser.ParserDecorator.parse(ParserDecorator.java:188)\n\tat org.apache.tika.parser.DigestingParser.parse(DigestingParser.java:84)\n\tat org.apache.tika.parser.RecursiveParserWrapper.parse(RecursiveParserWrapper.java:159)\n\tat org.apache.tika.batch.FileResourceConsumer.parse(FileResourceConsumer.java:406)\n\tat org.apache.tika.batch.fs.RecursiveParserWrapperFSConsumer.processFileResource(RecursiveParserWrapperFSConsumer.java:104)\n\tat org.apache.tika.batch.FileResourceConsumer._processFileResource(FileResourceConsumer.java:181)\n\tat org.apache.tika.batch.FileResourceConsumer.call(FileResourceConsumer.java:115)\n\tat org.apache.tika.batch.FileResourceConsumer.call(FileResourceConsumer.java:50)\n\tat java.util.concurrent.FutureTask.run(FutureTask.java:266)\n\tat java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)\n\tat java.util.concurrent.FutureTask.run(FutureTask.java:266)\n\tat java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)\n\tat java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)\n\tat java.lang.Thread.run(Thread.java:748)\nCaused by: java.util.concurrent.TimeoutException\n\tat java.util.concurrent.FutureTask.get(FutureTask.java:205)\n\tat org.apache.tika.parser.ocr.TesseractOCRParser.doOCR(TesseractOCRParser.java:549)\n\t... 32 more\n","X-TIKA:digest:MD5":"79171517bfedab52b24bd1691a5ff544","X-TIKA:embedded_resource_path":"/CastleBrooks Ulana.jpg
>
>
> Sometimes tika won't put any content at all in the output. It will just be filename.eml.json of 0 bytes, that happens when I run:
>
> java -Xmx12g -Xms12g -Dsun.java2d.cmm=sun.java2d.cmm.kcms.KcmsServiceProvider -Dorg.apache.pdfbox.rendering.UsePureJavaCMYKConversion=true -jar ~/tika-1.18/tika-app/target/tika-app-1.18.jar -J -t -i /mailin/ -o /mailout/
>
> Sometimes the tika processing just grinds to a halt with illegalIOexception too.
>
> TL;DR -
>
> I'm running 24 CPU Cores with 64 GB of RAM on SSDs.
>
> With a directory of 100,000 .eml files (many with attachments), is there a recommended way to parallelize or do batch parsing reliably?
>
>
> On 08/28/2018 07:54 AM, Tim Allison wrote:
>
> Hi Jake,
> In reverse order...
>
> 1) command flags:  right, sorry, we've only implemented text/metadata
> extraction via batch-mode (triggered by -i and -o).  The -z option
> currently only operates one file at a time.
>
> 2) "Even though I use -J, I'm not seeing the results of OCR on the
> attachments" ... when you type 'tesseract' at the command line, does
> that kickoff tesseract, or is it not on your path...do you have a
> custom installation?  If you run tika-app.jar -J against a single file
> with an attachment that should be OCR'd, what values are you getting
> for X-ParsedBy.... to help isolate whether tesseract is being called
> at all, try running standalone tika-app.jar -J against, e.g.
> https://github.com/apache/tika/blob/master/tika-parsers/src/test/resources/test-documents/testOCR.docx
> or https://github.com/apache/tika/blob/master/tika-parsers/src/test/resources/test-documents/testOCR.pdf
>
> 3) "I'm also not seeing anything extracted from PDFs" -- are the PDF's
> image only or do they actually contain text?  If image only, once we
> figure out whether tesseract is being called at all, that might solve
> the problem, but also see:
> https://wiki.apache.org/tika/PDFParser%20%28Apache%20PDFBox%29#OCR for
> how to use a tika-config to turn on the extraction/OCR'ing of inline
> images in PDFs.
> On Mon, Aug 27, 2018 at 3:56 PM Jake Burns <ja...@threatwave.com> wrote:
>
> I'm trying to parse a directory full of .eml files (and many have attachments). Even though I use -J, I'm not seeing the results of OCR on the attachments. I'm also not seeing anything extracted from PDFs. Finally, tika-app is not recognizing a bunch of command flags.
>
> I'm running ubuntu 18.04 and have openjdk-8 (1.181) installed with the latest maven (3.5.4).
> I've also got the libtesseract-dev and tesseract-OCR-all installed on my machine.
>
> I downloaded Tika 1.18 and ran mvn clean install.  The build completes fine and I see the tika-app jar at ~/tika-1.18/tika-app-target/tika-app-1.18.jar
>
> I am able to run java -jar ~pathto/tika-app-1.18.jar -J -i /mydirectoryoffiles/ -o /mytikaoutput/ and it works alright.
>
> I am not able to pass any other flags to tika though. for example -r.
> I'm not able to pass -z to extract attachments either.
>
> I get stuff like this:
> "INFO  about to start driver
> BatchProcess:No config file set via -bc, relying on tika-app-batch-config.xml or default-tika-batch-config.xml
> INFO  BatchProcess: org.apache.commons.cli.UnrecognizedOptionException: Unrecognized option: -z"
>
> Can anyone tell me how I can parse a directory of .eml files and extract the data from their attachments?
>
>

Re: Can't use recursive parsing.

Posted by Jake Burns <ja...@threatwave.com>.
Thanks, I guess I'll refrain from using extra flags on the command line.

I think the majority of files tika doesn't parse is due to tesseractOCR 
timeouts.

If I run:

java -jar ~/tika-1.18/tika-app/target/tika-app-1.18.jar -J -t -i 
/mailin/ -o /mailout/

A lot of my messages will have timeouts like this where the 
X-TIKA:content object should be.:

X-TIKA:EXCEPTION:embedded_exception":"org.apache.tika.exception.TikaException: 
TesseractOCRParser timeout\n\tat 
org.apache.tika.parser.ocr.TesseractOCRParser.doOCR(TesseractOCRParser.java:560)\n\tat 
org.apache.tika.parser.ocr.TesseractOCRParser.parse(TesseractOCRParser.java:432)\n\tat 
org.apache.tika.parser.ocr.TesseractOCRParser.parse(TesseractOCRParser.java:286)\n\tat 
org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)\n\tat 
org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)\n\tat 
org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:143)\n\tat 
org.apache.tika.parser.ParserDecorator.parse(ParserDecorator.java:188)\n\tat 
org.apache.tika.parser.DigestingParser.parse(DigestingParser.java:84)\n\tat 
org.apache.tika.parser.ParserDecorator.parse(ParserDecorator.java:188)\n\tat 
org.apache.tika.parser.RecursiveParserWrapper$EmbeddedParserDecorator.parse(RecursiveParserWrapper.java:318)\n\tat 
org.apache.tika.parser.DelegatingParser.parse(DelegatingParser.java:72)\n\tat 
org.apache.tika.extractor.ParsingEmbeddedDocumentExtractor.parseEmbedded(ParsingEmbeddedDocumentExtractor.java:102)\n\tat 
org.apache.tika.parser.mail.MailContentHandler.handleEmbedded(MailContentHandler.java:283)\n\tat 
org.apache.tika.parser.mail.MailContentHandler.body(MailContentHandler.java:228)\n\tat 
org.apache.james.mime4j.parser.MimeStreamParser.parse(MimeStreamParser.java:133)\n\tat 
org.apache.tika.parser.mail.RFC822Parser.parse(RFC822Parser.java:100)\n\tat 
org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)\n\tat 
org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)\n\tat 
org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:143)\n\tat 
org.apache.tika.parser.ParserDecorator.parse(ParserDecorator.java:188)\n\tat 
org.apache.tika.parser.DigestingParser.parse(DigestingParser.java:84)\n\tat 
org.apache.tika.parser.RecursiveParserWrapper.parse(RecursiveParserWrapper.java:159)\n\tat 
org.apache.tika.batch.FileResourceConsumer.parse(FileResourceConsumer.java:406)\n\tat 
org.apache.tika.batch.fs.RecursiveParserWrapperFSConsumer.processFileResource(RecursiveParserWrapperFSConsumer.java:104)\n\tat 
org.apache.tika.batch.FileResourceConsumer._processFileResource(FileResourceConsumer.java:181)\n\tat 
org.apache.tika.batch.FileResourceConsumer.call(FileResourceConsumer.java:115)\n\tat 
org.apache.tika.batch.FileResourceConsumer.call(FileResourceConsumer.java:50)\n\tat 
java.util.concurrent.FutureTask.run(FutureTask.java:266)\n\tat 
java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)\n\tat 
java.util.concurrent.FutureTask.run(FutureTask.java:266)\n\tat 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)\n\tat 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)\n\tat 
java.lang.Thread.run(Thread.java:748)\nCaused by: 
java.util.concurrent.TimeoutException\n\tat 
java.util.concurrent.FutureTask.get(FutureTask.java:205)\n\tat 
org.apache.tika.parser.ocr.TesseractOCRParser.doOCR(TesseractOCRParser.java:549)\n\t... 
32 
more\n","X-TIKA:digest:MD5":"79171517bfedab52b24bd1691a5ff544","X-TIKA:embedded_resource_path":"/CastleBrooks 
Ulana.jpg


Sometimes tika won't put any content at all in the output. It will just 
be filename.eml.json of 0 bytes, that happens when I run:

java -Xmx12g -Xms12g 
-Dsun.java2d.cmm=sun.java2d.cmm.kcms.KcmsServiceProvider 
-Dorg.apache.pdfbox.rendering.UsePureJavaCMYKConversion=true -jar 
~/tika-1.18/tika-app/target/tika-app-1.18.jar -J -t -i /mailin/ -o 
/mailout/

Sometimes the tika processing just grinds to a halt with 
illegalIOexception too.

TL;DR -

I'm running 24 CPU Cores with 64 GB of RAM on SSDs.

With a directory of 100,000 .eml files (many with attachments), is there 
a recommended way to parallelize or do batch parsing reliably?


On 08/28/2018 07:54 AM, Tim Allison wrote:
> Hi Jake,
> In reverse order...
>
> 1) command flags:  right, sorry, we've only implemented text/metadata
> extraction via batch-mode (triggered by -i and -o).  The -z option
> currently only operates one file at a time.
>
> 2) "Even though I use -J, I'm not seeing the results of OCR on the
> attachments" ... when you type 'tesseract' at the command line, does
> that kickoff tesseract, or is it not on your path...do you have a
> custom installation?  If you run tika-app.jar -J against a single file
> with an attachment that should be OCR'd, what values are you getting
> for X-ParsedBy.... to help isolate whether tesseract is being called
> at all, try running standalone tika-app.jar -J against, e.g.
> https://github.com/apache/tika/blob/master/tika-parsers/src/test/resources/test-documents/testOCR.docx
> or https://github.com/apache/tika/blob/master/tika-parsers/src/test/resources/test-documents/testOCR.pdf
>
> 3) "I'm also not seeing anything extracted from PDFs" -- are the PDF's
> image only or do they actually contain text?  If image only, once we
> figure out whether tesseract is being called at all, that might solve
> the problem, but also see:
> https://wiki.apache.org/tika/PDFParser%20%28Apache%20PDFBox%29#OCR for
> how to use a tika-config to turn on the extraction/OCR'ing of inline
> images in PDFs.
> On Mon, Aug 27, 2018 at 3:56 PM Jake Burns <ja...@threatwave.com> wrote:
>> I'm trying to parse a directory full of .eml files (and many have attachments). Even though I use -J, I'm not seeing the results of OCR on the attachments. I'm also not seeing anything extracted from PDFs. Finally, tika-app is not recognizing a bunch of command flags.
>>
>> I'm running ubuntu 18.04 and have openjdk-8 (1.181) installed with the latest maven (3.5.4).
>> I've also got the libtesseract-dev and tesseract-OCR-all installed on my machine.
>>
>> I downloaded Tika 1.18 and ran mvn clean install.  The build completes fine and I see the tika-app jar at ~/tika-1.18/tika-app-target/tika-app-1.18.jar
>>
>> I am able to run java -jar ~pathto/tika-app-1.18.jar -J -i /mydirectoryoffiles/ -o /mytikaoutput/ and it works alright.
>>
>> I am not able to pass any other flags to tika though. for example -r.
>> I'm not able to pass -z to extract attachments either.
>>
>> I get stuff like this:
>> "INFO  about to start driver
>> BatchProcess:No config file set via -bc, relying on tika-app-batch-config.xml or default-tika-batch-config.xml
>> INFO  BatchProcess: org.apache.commons.cli.UnrecognizedOptionException: Unrecognized option: -z"
>>
>> Can anyone tell me how I can parse a directory of .eml files and extract the data from their attachments?


Re: Can't use recursive parsing.

Posted by Tim Allison <ta...@apache.org>.
Hi Jake,
In reverse order...

1) command flags:  right, sorry, we've only implemented text/metadata
extraction via batch-mode (triggered by -i and -o).  The -z option
currently only operates one file at a time.

2) "Even though I use -J, I'm not seeing the results of OCR on the
attachments" ... when you type 'tesseract' at the command line, does
that kickoff tesseract, or is it not on your path...do you have a
custom installation?  If you run tika-app.jar -J against a single file
with an attachment that should be OCR'd, what values are you getting
for X-ParsedBy.... to help isolate whether tesseract is being called
at all, try running standalone tika-app.jar -J against, e.g.
https://github.com/apache/tika/blob/master/tika-parsers/src/test/resources/test-documents/testOCR.docx
or https://github.com/apache/tika/blob/master/tika-parsers/src/test/resources/test-documents/testOCR.pdf

3) "I'm also not seeing anything extracted from PDFs" -- are the PDF's
image only or do they actually contain text?  If image only, once we
figure out whether tesseract is being called at all, that might solve
the problem, but also see:
https://wiki.apache.org/tika/PDFParser%20%28Apache%20PDFBox%29#OCR for
how to use a tika-config to turn on the extraction/OCR'ing of inline
images in PDFs.
On Mon, Aug 27, 2018 at 3:56 PM Jake Burns <ja...@threatwave.com> wrote:
>
> I'm trying to parse a directory full of .eml files (and many have attachments). Even though I use -J, I'm not seeing the results of OCR on the attachments. I'm also not seeing anything extracted from PDFs. Finally, tika-app is not recognizing a bunch of command flags.
>
> I'm running ubuntu 18.04 and have openjdk-8 (1.181) installed with the latest maven (3.5.4).
> I've also got the libtesseract-dev and tesseract-OCR-all installed on my machine.
>
> I downloaded Tika 1.18 and ran mvn clean install.  The build completes fine and I see the tika-app jar at ~/tika-1.18/tika-app-target/tika-app-1.18.jar
>
> I am able to run java -jar ~pathto/tika-app-1.18.jar -J -i /mydirectoryoffiles/ -o /mytikaoutput/ and it works alright.
>
> I am not able to pass any other flags to tika though. for example -r.
> I'm not able to pass -z to extract attachments either.
>
> I get stuff like this:
> "INFO  about to start driver
> BatchProcess:No config file set via -bc, relying on tika-app-batch-config.xml or default-tika-batch-config.xml
> INFO  BatchProcess: org.apache.commons.cli.UnrecognizedOptionException: Unrecognized option: -z"
>
> Can anyone tell me how I can parse a directory of .eml files and extract the data from their attachments?