You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@tika.apache.org by kostali hassan <me...@gmail.com> on 2016/07/15 09:16:30 UTC

detect corrupt file and build a list of them before indexing in solr

I'am looking to index ms word and pdf using uploading data with solr cell
using apache tika;
 I just hope use tika to detect corrupt files before indexing and get a
list of corrupted file. if its possible.
I try runing java -jar tika-app.jar <input_dir> <output_dir> I get in the
output_dir all the files of <input_dir> in format xml and all the corrupt
file with size 0ko (empty)

Re: detect corrupt file and build a list of them before indexing in solr

Posted by kostali hassan <me...@gmail.com>.
Thank you verry much Allison; now I will wrght a script to get only the
name of each corrupt file without the cause of ERRORs.using this files

Thank you again have a nice day.

2016-07-15 18:38 GMT+01:00 Allison, Timothy B. <ta...@mitre.org>:

> Rename the shell script’s extension to end in .bat and you should be good
> to go.
>
>
>
>
>
> *From:* kostali hassan [mailto:med.has.kostali@gmail.com]
> *Sent:* Friday, July 15, 2016 1:26 PM
>
> *To:* user@tika.apache.org
> *Subject:* Re: detect corrupt file and build a list of them before
> indexing in solr
>
>
>
> I USE TIKA_app1.12
>
>
>
> 2016-07-15 18:20 GMT+01:00 Allison, Timothy B. <ta...@mitre.org>:
>
> Can you share the shell script/bat file you’re using?
>
>
>
> *From:* kostali hassan [mailto:med.has.kostali@gmail.com]
> *Sent:* Friday, July 15, 2016 1:13 PM
>
>
> *To:* user@tika.apache.org
> *Subject:* Re: detect corrupt file and build a list of them before
> indexing in solr
>
>
>
> when I add to inputDIR d:\test the log tell me:java.lang.RuntimeException:
> Crawler couldn't find this directory:D:\tika_batch_config\test
>
> the same if I add to inputDIR d:\Cvs the log is:java.lang.RuntimeException:
> Crawler couldn't find this directory: D:\tika_batch_config\Cvs
>
>
>
> 2016-07-15 17:54 GMT+01:00 kostali hassan <me...@gmail.com>:
>
> I added this directorry ANd still not working
>
>
>
> 2016-07-15 17:42 GMT+01:00 Allison, Timothy B. <ta...@mitre.org>:
>
> Y, the log tells you that the input directory wasn’t specified correctly:
>
>
>
> 1375 2016-07-15 17:33:17,354 [Thread-2] INFO
> org.apache.tika.batch.BatchProcessDriverCLI  - BatchProcess:
> java.lang.RuntimeException: Crawler couldn't find this
> directory:D:\tika_batch_config\test
>
>
>
> *From:* kostali hassan [mailto:med.has.kostali@gmail.com]
> *Sent:* Friday, July 15, 2016 12:40 PM
>
>
> *To:* user@tika.apache.org
> *Subject:* Re: detect corrupt file and build a list of them before
> indexing in solr
>
>
>
> only JXmx1g work AND the inputDIR is empty AND I get this files empty in
> logs :
>
> batch-driver-warn.log
>
> batch-process-warn.log
>
> tika-batch-pdfbox.log
>
>
>
> AND this attached files
>
>
>
> 2016-07-15 16:36 GMT+01:00 Allison, Timothy B. <ta...@mitre.org>:
>
> Try changing the max heap to something that will work on your computer:
>
>
>
> -JXmx5g
>
>
>
> To (say):
>
>
>
> -JXmx1g
>
> *From:* kostali hassan [mailto:med.has.kostali@gmail.com]
> *Sent:* Friday, July 15, 2016 11:27 AM
> *To:* user@tika.apache.org
> *Subject:* Re: detect corrupt file and build a list of them before
> indexing in solr
>
>
>
> I get this files in the logs ; AND when I run the script he dont finich he
> restart all the time
>
>
>
> 2016-07-15 13:19 GMT+01:00 Allison, Timothy B. <ta...@mitre.org>:
>
> Sorry, you’ll get 0 byte files for an error that caused Tika batch to do a
> restart (hang/oom); and depending on cause, you may get an error logged in
> batch-process-error.xml.  If your OS kills the process or something truly
> catastrophic happens, the only trace you have is the 0 byte file.
>
>
>
>   For regular caught exceptions, you can look in the .json file (key: TikaCoreProperties.*TIKA_META_EXCEPTION_PREFIX*+*"runtime"*)
>
> for the stack trace, or you can look in the logs as described below.
>
>
>
> *From:* Allison, Timothy B. [mailto:tallison@mitre.org]
> *Sent:* Friday, July 15, 2016 8:11 AM
> *To:* user@tika.apache.org
> *Subject:* RE: detect corrupt file and build a list of them before
> indexing in solr
>
>
>
> Checking for 0 byte files is one option.  The other option is to configure
> the logs to capture exceptions.  I’ve attached the config files and the
> shell script that I use when running our large scale regression testing
> here:
> https://wiki.apache.org/tika/TikaBatchUsage?action=AttachFile&do=view&target=tika-batch-sh.zip
>
>
>
> To run those, unzip the folder, put the tika-app.jar in the bin/
> directory, update the shell script for your <input_dir> and your
> <output_dir> and you should be good to go.  You may need to create a “logs”
> directory.  Exceptions will be recorded in the batch-process-warn.log, and
> original file names are included along with stack traces.
>
>
>
> *From:* kostali hassan [mailto:med.has.kostali@gmail.com
> <me...@gmail.com>]
> *Sent:* Friday, July 15, 2016 5:17 AM
> *To:* user@tika.apache.org
> *Subject:* detect corrupt file and build a list of them before indexing
> in solr
>
>
>
> I'am looking to index ms word and pdf using uploading data with solr cell
> using apache tika;
>
>  I just hope use tika to detect corrupt files before indexing and get a
> list of corrupted file. if its possible.
>
> I try runing java -jar tika-app.jar <input_dir> <output_dir> I get in the
> output_dir all the files of <input_dir> in format xml and all the corrupt
> file with size 0ko (empty)
>
>
>
>
>
>
>
>
>
>
>

RE: detect corrupt file and build a list of them before indexing in solr

Posted by "Allison, Timothy B." <ta...@mitre.org>.
Rename the shell script’s extension to end in .bat and you should be good to go.


From: kostali hassan [mailto:med.has.kostali@gmail.com]
Sent: Friday, July 15, 2016 1:26 PM
To: user@tika.apache.org
Subject: Re: detect corrupt file and build a list of them before indexing in solr

I USE TIKA_app1.12

2016-07-15 18:20 GMT+01:00 Allison, Timothy B. <ta...@mitre.org>>:
Can you share the shell script/bat file you’re using?

From: kostali hassan [mailto:med.has.kostali@gmail.com<ma...@gmail.com>]
Sent: Friday, July 15, 2016 1:13 PM

To: user@tika.apache.org<ma...@tika.apache.org>
Subject: Re: detect corrupt file and build a list of them before indexing in solr

when I add to inputDIR d:\test the log tell me:java.lang.RuntimeException: Crawler couldn't find this directory:D:\tika_batch_config\test
the same if I add to inputDIR d:\Cvs the log is:java.lang.RuntimeException: Crawler couldn't find this directory: D:\tika_batch_config\Cvs

2016-07-15 17:54 GMT+01:00 kostali hassan <me...@gmail.com>>:
I added this directorry ANd still not working

2016-07-15 17:42 GMT+01:00 Allison, Timothy B. <ta...@mitre.org>>:
Y, the log tells you that the input directory wasn’t specified correctly:

1375 2016-07-15 17:33:17,354 [Thread-2] INFO  org.apache.tika.batch.BatchProcessDriverCLI  - BatchProcess: java.lang.RuntimeException: Crawler couldn't find this directory:D:\tika_batch_config\test

From: kostali hassan [mailto:med.has.kostali@gmail.com<ma...@gmail.com>]
Sent: Friday, July 15, 2016 12:40 PM

To: user@tika.apache.org<ma...@tika.apache.org>
Subject: Re: detect corrupt file and build a list of them before indexing in solr

only JXmx1g work AND the inputDIR is empty AND I get this files empty in logs :
batch-driver-warn.log
batch-process-warn.log
tika-batch-pdfbox.log

AND this attached files

2016-07-15 16:36 GMT+01:00 Allison, Timothy B. <ta...@mitre.org>>:
Try changing the max heap to something that will work on your computer:

-JXmx5g

To (say):

-JXmx1g
From: kostali hassan [mailto:med.has.kostali@gmail.com<ma...@gmail.com>]
Sent: Friday, July 15, 2016 11:27 AM
To: user@tika.apache.org<ma...@tika.apache.org>
Subject: Re: detect corrupt file and build a list of them before indexing in solr

I get this files in the logs ; AND when I run the script he dont finich he restart all the time

2016-07-15 13:19 GMT+01:00 Allison, Timothy B. <ta...@mitre.org>>:
Sorry, you’ll get 0 byte files for an error that caused Tika batch to do a restart (hang/oom); and depending on cause, you may get an error logged in batch-process-error.xml.  If your OS kills the process or something truly catastrophic happens, the only trace you have is the 0 byte file.


  For regular caught exceptions, you can look in the .json file (key: TikaCoreProperties.TIKA_META_EXCEPTION_PREFIX+"runtime")
for the stack trace, or you can look in the logs as described below.

From: Allison, Timothy B. [mailto:tallison@mitre.org<ma...@mitre.org>]
Sent: Friday, July 15, 2016 8:11 AM
To: user@tika.apache.org<ma...@tika.apache.org>
Subject: RE: detect corrupt file and build a list of them before indexing in solr

Checking for 0 byte files is one option.  The other option is to configure the logs to capture exceptions.  I’ve attached the config files and the shell script that I use when running our large scale regression testing here: https://wiki.apache.org/tika/TikaBatchUsage?action=AttachFile&do=view&target=tika-batch-sh.zip

To run those, unzip the folder, put the tika-app.jar in the bin/ directory, update the shell script for your <input_dir> and your <output_dir> and you should be good to go.  You may need to create a “logs” directory.  Exceptions will be recorded in the batch-process-warn.log, and original file names are included along with stack traces.

From: kostali hassan [mailto:med.has.kostali@gmail.com]
Sent: Friday, July 15, 2016 5:17 AM
To: user@tika.apache.org<ma...@tika.apache.org>
Subject: detect corrupt file and build a list of them before indexing in solr

I'am looking to index ms word and pdf using uploading data with solr cell using apache tika;
 I just hope use tika to detect corrupt files before indexing and get a list of corrupted file. if its possible.
I try runing java -jar tika-app.jar <input_dir> <output_dir> I get in the output_dir all the files of <input_dir> in format xml and all the corrupt file with size 0ko (empty)






Re: detect corrupt file and build a list of them before indexing in solr

Posted by kostali hassan <me...@gmail.com>.
I USE TIKA_app1.12

2016-07-15 18:20 GMT+01:00 Allison, Timothy B. <ta...@mitre.org>:

> Can you share the shell script/bat file you’re using?
>
>
>
> *From:* kostali hassan [mailto:med.has.kostali@gmail.com]
> *Sent:* Friday, July 15, 2016 1:13 PM
>
> *To:* user@tika.apache.org
> *Subject:* Re: detect corrupt file and build a list of them before
> indexing in solr
>
>
>
> when I add to inputDIR d:\test the log tell me:java.lang.RuntimeException:
> Crawler couldn't find this directory:D:\tika_batch_config\test
>
> the same if I add to inputDIR d:\Cvs the log is:java.lang.RuntimeException:
> Crawler couldn't find this directory: D:\tika_batch_config\Cvs
>
>
>
> 2016-07-15 17:54 GMT+01:00 kostali hassan <me...@gmail.com>:
>
> I added this directorry ANd still not working
>
>
>
> 2016-07-15 17:42 GMT+01:00 Allison, Timothy B. <ta...@mitre.org>:
>
> Y, the log tells you that the input directory wasn’t specified correctly:
>
>
>
> 1375 2016-07-15 17:33:17,354 [Thread-2] INFO
> org.apache.tika.batch.BatchProcessDriverCLI  - BatchProcess:
> java.lang.RuntimeException: Crawler couldn't find this
> directory:D:\tika_batch_config\test
>
>
>
> *From:* kostali hassan [mailto:med.has.kostali@gmail.com]
> *Sent:* Friday, July 15, 2016 12:40 PM
>
>
> *To:* user@tika.apache.org
> *Subject:* Re: detect corrupt file and build a list of them before
> indexing in solr
>
>
>
> only JXmx1g work AND the inputDIR is empty AND I get this files empty in
> logs :
>
> batch-driver-warn.log
>
> batch-process-warn.log
>
> tika-batch-pdfbox.log
>
>
>
> AND this attached files
>
>
>
> 2016-07-15 16:36 GMT+01:00 Allison, Timothy B. <ta...@mitre.org>:
>
> Try changing the max heap to something that will work on your computer:
>
>
>
> -JXmx5g
>
>
>
> To (say):
>
>
>
> -JXmx1g
>
> *From:* kostali hassan [mailto:med.has.kostali@gmail.com]
> *Sent:* Friday, July 15, 2016 11:27 AM
> *To:* user@tika.apache.org
> *Subject:* Re: detect corrupt file and build a list of them before
> indexing in solr
>
>
>
> I get this files in the logs ; AND when I run the script he dont finich he
> restart all the time
>
>
>
> 2016-07-15 13:19 GMT+01:00 Allison, Timothy B. <ta...@mitre.org>:
>
> Sorry, you’ll get 0 byte files for an error that caused Tika batch to do a
> restart (hang/oom); and depending on cause, you may get an error logged in
> batch-process-error.xml.  If your OS kills the process or something truly
> catastrophic happens, the only trace you have is the 0 byte file.
>
>
>
>   For regular caught exceptions, you can look in the .json file (key: TikaCoreProperties.*TIKA_META_EXCEPTION_PREFIX*+*"runtime"*)
>
> for the stack trace, or you can look in the logs as described below.
>
>
>
> *From:* Allison, Timothy B. [mailto:tallison@mitre.org]
> *Sent:* Friday, July 15, 2016 8:11 AM
> *To:* user@tika.apache.org
> *Subject:* RE: detect corrupt file and build a list of them before
> indexing in solr
>
>
>
> Checking for 0 byte files is one option.  The other option is to configure
> the logs to capture exceptions.  I’ve attached the config files and the
> shell script that I use when running our large scale regression testing
> here:
> https://wiki.apache.org/tika/TikaBatchUsage?action=AttachFile&do=view&target=tika-batch-sh.zip
>
>
>
> To run those, unzip the folder, put the tika-app.jar in the bin/
> directory, update the shell script for your <input_dir> and your
> <output_dir> and you should be good to go.  You may need to create a “logs”
> directory.  Exceptions will be recorded in the batch-process-warn.log, and
> original file names are included along with stack traces.
>
>
>
> *From:* kostali hassan [mailto:med.has.kostali@gmail.com
> <me...@gmail.com>]
> *Sent:* Friday, July 15, 2016 5:17 AM
> *To:* user@tika.apache.org
> *Subject:* detect corrupt file and build a list of them before indexing
> in solr
>
>
>
> I'am looking to index ms word and pdf using uploading data with solr cell
> using apache tika;
>
>  I just hope use tika to detect corrupt files before indexing and get a
> list of corrupted file. if its possible.
>
> I try runing java -jar tika-app.jar <input_dir> <output_dir> I get in the
> output_dir all the files of <input_dir> in format xml and all the corrupt
> file with size 0ko (empty)
>
>
>
>
>
>
>
>
>

RE: detect corrupt file and build a list of them before indexing in solr

Posted by "Allison, Timothy B." <ta...@mitre.org>.
Can you share the shell script/bat file you’re using?

From: kostali hassan [mailto:med.has.kostali@gmail.com]
Sent: Friday, July 15, 2016 1:13 PM
To: user@tika.apache.org
Subject: Re: detect corrupt file and build a list of them before indexing in solr

when I add to inputDIR d:\test the log tell me:java.lang.RuntimeException: Crawler couldn't find this directory:D:\tika_batch_config\test
the same if I add to inputDIR d:\Cvs the log is:java.lang.RuntimeException: Crawler couldn't find this directory: D:\tika_batch_config\Cvs

2016-07-15 17:54 GMT+01:00 kostali hassan <me...@gmail.com>>:
I added this directorry ANd still not working

2016-07-15 17:42 GMT+01:00 Allison, Timothy B. <ta...@mitre.org>>:
Y, the log tells you that the input directory wasn’t specified correctly:

1375 2016-07-15 17:33:17,354 [Thread-2] INFO  org.apache.tika.batch.BatchProcessDriverCLI  - BatchProcess: java.lang.RuntimeException: Crawler couldn't find this directory:D:\tika_batch_config\test

From: kostali hassan [mailto:med.has.kostali@gmail.com<ma...@gmail.com>]
Sent: Friday, July 15, 2016 12:40 PM

To: user@tika.apache.org<ma...@tika.apache.org>
Subject: Re: detect corrupt file and build a list of them before indexing in solr

only JXmx1g work AND the inputDIR is empty AND I get this files empty in logs :
batch-driver-warn.log
batch-process-warn.log
tika-batch-pdfbox.log

AND this attached files

2016-07-15 16:36 GMT+01:00 Allison, Timothy B. <ta...@mitre.org>>:
Try changing the max heap to something that will work on your computer:

-JXmx5g

To (say):

-JXmx1g
From: kostali hassan [mailto:med.has.kostali@gmail.com<ma...@gmail.com>]
Sent: Friday, July 15, 2016 11:27 AM
To: user@tika.apache.org<ma...@tika.apache.org>
Subject: Re: detect corrupt file and build a list of them before indexing in solr

I get this files in the logs ; AND when I run the script he dont finich he restart all the time

2016-07-15 13:19 GMT+01:00 Allison, Timothy B. <ta...@mitre.org>>:
Sorry, you’ll get 0 byte files for an error that caused Tika batch to do a restart (hang/oom); and depending on cause, you may get an error logged in batch-process-error.xml.  If your OS kills the process or something truly catastrophic happens, the only trace you have is the 0 byte file.


  For regular caught exceptions, you can look in the .json file (key: TikaCoreProperties.TIKA_META_EXCEPTION_PREFIX+"runtime")
for the stack trace, or you can look in the logs as described below.

From: Allison, Timothy B. [mailto:tallison@mitre.org<ma...@mitre.org>]
Sent: Friday, July 15, 2016 8:11 AM
To: user@tika.apache.org<ma...@tika.apache.org>
Subject: RE: detect corrupt file and build a list of them before indexing in solr

Checking for 0 byte files is one option.  The other option is to configure the logs to capture exceptions.  I’ve attached the config files and the shell script that I use when running our large scale regression testing here: https://wiki.apache.org/tika/TikaBatchUsage?action=AttachFile&do=view&target=tika-batch-sh.zip

To run those, unzip the folder, put the tika-app.jar in the bin/ directory, update the shell script for your <input_dir> and your <output_dir> and you should be good to go.  You may need to create a “logs” directory.  Exceptions will be recorded in the batch-process-warn.log, and original file names are included along with stack traces.

From: kostali hassan [mailto:med.has.kostali@gmail.com]
Sent: Friday, July 15, 2016 5:17 AM
To: user@tika.apache.org<ma...@tika.apache.org>
Subject: detect corrupt file and build a list of them before indexing in solr

I'am looking to index ms word and pdf using uploading data with solr cell using apache tika;
 I just hope use tika to detect corrupt files before indexing and get a list of corrupted file. if its possible.
I try runing java -jar tika-app.jar <input_dir> <output_dir> I get in the output_dir all the files of <input_dir> in format xml and all the corrupt file with size 0ko (empty)





Re: detect corrupt file and build a list of them before indexing in solr

Posted by kostali hassan <me...@gmail.com>.
when I add to inputDIR d:\test the log tell me:java.lang.RuntimeException:
Crawler couldn't find this directory:D:\tika_batch_config\test
the same if I add to inputDIR d:\Cvs the log is:java.lang.RuntimeException:
Crawler couldn't find this directory: D:\tika_batch_config\Cvs

2016-07-15 17:54 GMT+01:00 kostali hassan <me...@gmail.com>:

> I added this directorry ANd still not working
>
> 2016-07-15 17:42 GMT+01:00 Allison, Timothy B. <ta...@mitre.org>:
>
>> Y, the log tells you that the input directory wasn’t specified correctly:
>>
>>
>>
>> 1375 2016-07-15 17:33:17,354 [Thread-2] INFO
>> org.apache.tika.batch.BatchProcessDriverCLI  - BatchProcess:
>> java.lang.RuntimeException: Crawler couldn't find this
>> directory:D:\tika_batch_config\test
>>
>>
>>
>> *From:* kostali hassan [mailto:med.has.kostali@gmail.com]
>> *Sent:* Friday, July 15, 2016 12:40 PM
>>
>> *To:* user@tika.apache.org
>> *Subject:* Re: detect corrupt file and build a list of them before
>> indexing in solr
>>
>>
>>
>> only JXmx1g work AND the inputDIR is empty AND I get this files empty in
>> logs :
>>
>> batch-driver-warn.log
>>
>> batch-process-warn.log
>>
>> tika-batch-pdfbox.log
>>
>>
>>
>> AND this attached files
>>
>>
>>
>> 2016-07-15 16:36 GMT+01:00 Allison, Timothy B. <ta...@mitre.org>:
>>
>> Try changing the max heap to something that will work on your computer:
>>
>>
>>
>> -JXmx5g
>>
>>
>>
>> To (say):
>>
>>
>>
>> -JXmx1g
>>
>> *From:* kostali hassan [mailto:med.has.kostali@gmail.com]
>> *Sent:* Friday, July 15, 2016 11:27 AM
>> *To:* user@tika.apache.org
>> *Subject:* Re: detect corrupt file and build a list of them before
>> indexing in solr
>>
>>
>>
>> I get this files in the logs ; AND when I run the script he dont finich
>> he restart all the time
>>
>>
>>
>> 2016-07-15 13:19 GMT+01:00 Allison, Timothy B. <ta...@mitre.org>:
>>
>> Sorry, you’ll get 0 byte files for an error that caused Tika batch to do
>> a restart (hang/oom); and depending on cause, you may get an error logged
>> in batch-process-error.xml.  If your OS kills the process or something
>> truly catastrophic happens, the only trace you have is the 0 byte file.
>>
>>
>>
>>   For regular caught exceptions, you can look in the .json file (key: TikaCoreProperties.*TIKA_META_EXCEPTION_PREFIX*+*"runtime"*)
>>
>> for the stack trace, or you can look in the logs as described below.
>>
>>
>>
>> *From:* Allison, Timothy B. [mailto:tallison@mitre.org]
>> *Sent:* Friday, July 15, 2016 8:11 AM
>> *To:* user@tika.apache.org
>> *Subject:* RE: detect corrupt file and build a list of them before
>> indexing in solr
>>
>>
>>
>> Checking for 0 byte files is one option.  The other option is to
>> configure the logs to capture exceptions.  I’ve attached the config files
>> and the shell script that I use when running our large scale regression
>> testing here:
>> https://wiki.apache.org/tika/TikaBatchUsage?action=AttachFile&do=view&target=tika-batch-sh.zip
>>
>>
>>
>> To run those, unzip the folder, put the tika-app.jar in the bin/
>> directory, update the shell script for your <input_dir> and your
>> <output_dir> and you should be good to go.  You may need to create a “logs”
>> directory.  Exceptions will be recorded in the batch-process-warn.log, and
>> original file names are included along with stack traces.
>>
>>
>>
>> *From:* kostali hassan [mailto:med.has.kostali@gmail.com
>> <me...@gmail.com>]
>> *Sent:* Friday, July 15, 2016 5:17 AM
>> *To:* user@tika.apache.org
>> *Subject:* detect corrupt file and build a list of them before indexing
>> in solr
>>
>>
>>
>> I'am looking to index ms word and pdf using uploading data with solr cell
>> using apache tika;
>>
>>  I just hope use tika to detect corrupt files before indexing and get a
>> list of corrupted file. if its possible.
>>
>> I try runing java -jar tika-app.jar <input_dir> <output_dir> I get in the
>> output_dir all the files of <input_dir> in format xml and all the corrupt
>> file with size 0ko (empty)
>>
>>
>>
>>
>>
>
>

Re: detect corrupt file and build a list of them before indexing in solr

Posted by kostali hassan <me...@gmail.com>.
I added this directorry ANd still not working

2016-07-15 17:42 GMT+01:00 Allison, Timothy B. <ta...@mitre.org>:

> Y, the log tells you that the input directory wasn’t specified correctly:
>
>
>
> 1375 2016-07-15 17:33:17,354 [Thread-2] INFO
> org.apache.tika.batch.BatchProcessDriverCLI  - BatchProcess:
> java.lang.RuntimeException: Crawler couldn't find this
> directory:D:\tika_batch_config\test
>
>
>
> *From:* kostali hassan [mailto:med.has.kostali@gmail.com]
> *Sent:* Friday, July 15, 2016 12:40 PM
>
> *To:* user@tika.apache.org
> *Subject:* Re: detect corrupt file and build a list of them before
> indexing in solr
>
>
>
> only JXmx1g work AND the inputDIR is empty AND I get this files empty in
> logs :
>
> batch-driver-warn.log
>
> batch-process-warn.log
>
> tika-batch-pdfbox.log
>
>
>
> AND this attached files
>
>
>
> 2016-07-15 16:36 GMT+01:00 Allison, Timothy B. <ta...@mitre.org>:
>
> Try changing the max heap to something that will work on your computer:
>
>
>
> -JXmx5g
>
>
>
> To (say):
>
>
>
> -JXmx1g
>
> *From:* kostali hassan [mailto:med.has.kostali@gmail.com]
> *Sent:* Friday, July 15, 2016 11:27 AM
> *To:* user@tika.apache.org
> *Subject:* Re: detect corrupt file and build a list of them before
> indexing in solr
>
>
>
> I get this files in the logs ; AND when I run the script he dont finich he
> restart all the time
>
>
>
> 2016-07-15 13:19 GMT+01:00 Allison, Timothy B. <ta...@mitre.org>:
>
> Sorry, you’ll get 0 byte files for an error that caused Tika batch to do a
> restart (hang/oom); and depending on cause, you may get an error logged in
> batch-process-error.xml.  If your OS kills the process or something truly
> catastrophic happens, the only trace you have is the 0 byte file.
>
>
>
>   For regular caught exceptions, you can look in the .json file (key: TikaCoreProperties.*TIKA_META_EXCEPTION_PREFIX*+*"runtime"*)
>
> for the stack trace, or you can look in the logs as described below.
>
>
>
> *From:* Allison, Timothy B. [mailto:tallison@mitre.org]
> *Sent:* Friday, July 15, 2016 8:11 AM
> *To:* user@tika.apache.org
> *Subject:* RE: detect corrupt file and build a list of them before
> indexing in solr
>
>
>
> Checking for 0 byte files is one option.  The other option is to configure
> the logs to capture exceptions.  I’ve attached the config files and the
> shell script that I use when running our large scale regression testing
> here:
> https://wiki.apache.org/tika/TikaBatchUsage?action=AttachFile&do=view&target=tika-batch-sh.zip
>
>
>
> To run those, unzip the folder, put the tika-app.jar in the bin/
> directory, update the shell script for your <input_dir> and your
> <output_dir> and you should be good to go.  You may need to create a “logs”
> directory.  Exceptions will be recorded in the batch-process-warn.log, and
> original file names are included along with stack traces.
>
>
>
> *From:* kostali hassan [mailto:med.has.kostali@gmail.com
> <me...@gmail.com>]
> *Sent:* Friday, July 15, 2016 5:17 AM
> *To:* user@tika.apache.org
> *Subject:* detect corrupt file and build a list of them before indexing
> in solr
>
>
>
> I'am looking to index ms word and pdf using uploading data with solr cell
> using apache tika;
>
>  I just hope use tika to detect corrupt files before indexing and get a
> list of corrupted file. if its possible.
>
> I try runing java -jar tika-app.jar <input_dir> <output_dir> I get in the
> output_dir all the files of <input_dir> in format xml and all the corrupt
> file with size 0ko (empty)
>
>
>
>
>

RE: detect corrupt file and build a list of them before indexing in solr

Posted by "Allison, Timothy B." <ta...@mitre.org>.
Y, the log tells you that the input directory wasn’t specified correctly:

1375 2016-07-15 17:33:17,354 [Thread-2] INFO  org.apache.tika.batch.BatchProcessDriverCLI  - BatchProcess: java.lang.RuntimeException: Crawler couldn't find this directory:D:\tika_batch_config\test

From: kostali hassan [mailto:med.has.kostali@gmail.com]
Sent: Friday, July 15, 2016 12:40 PM
To: user@tika.apache.org
Subject: Re: detect corrupt file and build a list of them before indexing in solr

only JXmx1g work AND the inputDIR is empty AND I get this files empty in logs :
batch-driver-warn.log
batch-process-warn.log
tika-batch-pdfbox.log

AND this attached files

2016-07-15 16:36 GMT+01:00 Allison, Timothy B. <ta...@mitre.org>>:
Try changing the max heap to something that will work on your computer:

-JXmx5g

To (say):

-JXmx1g
From: kostali hassan [mailto:med.has.kostali@gmail.com<ma...@gmail.com>]
Sent: Friday, July 15, 2016 11:27 AM
To: user@tika.apache.org<ma...@tika.apache.org>
Subject: Re: detect corrupt file and build a list of them before indexing in solr

I get this files in the logs ; AND when I run the script he dont finich he restart all the time

2016-07-15 13:19 GMT+01:00 Allison, Timothy B. <ta...@mitre.org>>:
Sorry, you’ll get 0 byte files for an error that caused Tika batch to do a restart (hang/oom); and depending on cause, you may get an error logged in batch-process-error.xml.  If your OS kills the process or something truly catastrophic happens, the only trace you have is the 0 byte file.


  For regular caught exceptions, you can look in the .json file (key: TikaCoreProperties.TIKA_META_EXCEPTION_PREFIX+"runtime")
for the stack trace, or you can look in the logs as described below.

From: Allison, Timothy B. [mailto:tallison@mitre.org<ma...@mitre.org>]
Sent: Friday, July 15, 2016 8:11 AM
To: user@tika.apache.org<ma...@tika.apache.org>
Subject: RE: detect corrupt file and build a list of them before indexing in solr

Checking for 0 byte files is one option.  The other option is to configure the logs to capture exceptions.  I’ve attached the config files and the shell script that I use when running our large scale regression testing here: https://wiki.apache.org/tika/TikaBatchUsage?action=AttachFile&do=view&target=tika-batch-sh.zip

To run those, unzip the folder, put the tika-app.jar in the bin/ directory, update the shell script for your <input_dir> and your <output_dir> and you should be good to go.  You may need to create a “logs” directory.  Exceptions will be recorded in the batch-process-warn.log, and original file names are included along with stack traces.

From: kostali hassan [mailto:med.has.kostali@gmail.com]
Sent: Friday, July 15, 2016 5:17 AM
To: user@tika.apache.org<ma...@tika.apache.org>
Subject: detect corrupt file and build a list of them before indexing in solr

I'am looking to index ms word and pdf using uploading data with solr cell using apache tika;
 I just hope use tika to detect corrupt files before indexing and get a list of corrupted file. if its possible.
I try runing java -jar tika-app.jar <input_dir> <output_dir> I get in the output_dir all the files of <input_dir> in format xml and all the corrupt file with size 0ko (empty)



Re: detect corrupt file and build a list of them before indexing in solr

Posted by kostali hassan <me...@gmail.com>.
only JXmx1g work AND the inputDIR is empty AND I get this files empty in
logs :
batch-driver-warn.log
batch-process-warn.log
tika-batch-pdfbox.log

AND this attached files

2016-07-15 16:36 GMT+01:00 Allison, Timothy B. <ta...@mitre.org>:

> Try changing the max heap to something that will work on your computer:
>
>
>
> -JXmx5g
>
>
>
> To (say):
>
>
>
> -JXmx1g
>
> *From:* kostali hassan [mailto:med.has.kostali@gmail.com]
> *Sent:* Friday, July 15, 2016 11:27 AM
> *To:* user@tika.apache.org
> *Subject:* Re: detect corrupt file and build a list of them before
> indexing in solr
>
>
>
> I get this files in the logs ; AND when I run the script he dont finich he
> restart all the time
>
>
>
> 2016-07-15 13:19 GMT+01:00 Allison, Timothy B. <ta...@mitre.org>:
>
> Sorry, you’ll get 0 byte files for an error that caused Tika batch to do a
> restart (hang/oom); and depending on cause, you may get an error logged in
> batch-process-error.xml.  If your OS kills the process or something truly
> catastrophic happens, the only trace you have is the 0 byte file.
>
>
>
>   For regular caught exceptions, you can look in the .json file (key: TikaCoreProperties.*TIKA_META_EXCEPTION_PREFIX*+*"runtime"*)
>
> for the stack trace, or you can look in the logs as described below.
>
>
>
> *From:* Allison, Timothy B. [mailto:tallison@mitre.org]
> *Sent:* Friday, July 15, 2016 8:11 AM
> *To:* user@tika.apache.org
> *Subject:* RE: detect corrupt file and build a list of them before
> indexing in solr
>
>
>
> Checking for 0 byte files is one option.  The other option is to configure
> the logs to capture exceptions.  I’ve attached the config files and the
> shell script that I use when running our large scale regression testing
> here:
> https://wiki.apache.org/tika/TikaBatchUsage?action=AttachFile&do=view&target=tika-batch-sh.zip
>
>
>
> To run those, unzip the folder, put the tika-app.jar in the bin/
> directory, update the shell script for your <input_dir> and your
> <output_dir> and you should be good to go.  You may need to create a “logs”
> directory.  Exceptions will be recorded in the batch-process-warn.log, and
> original file names are included along with stack traces.
>
>
>
> *From:* kostali hassan [mailto:med.has.kostali@gmail.com
> <me...@gmail.com>]
> *Sent:* Friday, July 15, 2016 5:17 AM
> *To:* user@tika.apache.org
> *Subject:* detect corrupt file and build a list of them before indexing
> in solr
>
>
>
> I'am looking to index ms word and pdf using uploading data with solr cell
> using apache tika;
>
>  I just hope use tika to detect corrupt files before indexing and get a
> list of corrupted file. if its possible.
>
> I try runing java -jar tika-app.jar <input_dir> <output_dir> I get in the
> output_dir all the files of <input_dir> in format xml and all the corrupt
> file with size 0ko (empty)
>
>
>

RE: detect corrupt file and build a list of them before indexing in solr

Posted by "Allison, Timothy B." <ta...@mitre.org>.
Try changing the max heap to something that will work on your computer:

-JXmx5g

To (say):

-JXmx1g
From: kostali hassan [mailto:med.has.kostali@gmail.com]
Sent: Friday, July 15, 2016 11:27 AM
To: user@tika.apache.org
Subject: Re: detect corrupt file and build a list of them before indexing in solr

I get this files in the logs ; AND when I run the script he dont finich he restart all the time

2016-07-15 13:19 GMT+01:00 Allison, Timothy B. <ta...@mitre.org>>:
Sorry, you’ll get 0 byte files for an error that caused Tika batch to do a restart (hang/oom); and depending on cause, you may get an error logged in batch-process-error.xml.  If your OS kills the process or something truly catastrophic happens, the only trace you have is the 0 byte file.


  For regular caught exceptions, you can look in the .json file (key: TikaCoreProperties.TIKA_META_EXCEPTION_PREFIX+"runtime")
for the stack trace, or you can look in the logs as described below.

From: Allison, Timothy B. [mailto:tallison@mitre.org<ma...@mitre.org>]
Sent: Friday, July 15, 2016 8:11 AM
To: user@tika.apache.org<ma...@tika.apache.org>
Subject: RE: detect corrupt file and build a list of them before indexing in solr

Checking for 0 byte files is one option.  The other option is to configure the logs to capture exceptions.  I’ve attached the config files and the shell script that I use when running our large scale regression testing here: https://wiki.apache.org/tika/TikaBatchUsage?action=AttachFile&do=view&target=tika-batch-sh.zip

To run those, unzip the folder, put the tika-app.jar in the bin/ directory, update the shell script for your <input_dir> and your <output_dir> and you should be good to go.  You may need to create a “logs” directory.  Exceptions will be recorded in the batch-process-warn.log, and original file names are included along with stack traces.

From: kostali hassan [mailto:med.has.kostali@gmail.com]
Sent: Friday, July 15, 2016 5:17 AM
To: user@tika.apache.org<ma...@tika.apache.org>
Subject: detect corrupt file and build a list of them before indexing in solr

I'am looking to index ms word and pdf using uploading data with solr cell using apache tika;
 I just hope use tika to detect corrupt files before indexing and get a list of corrupted file. if its possible.
I try runing java -jar tika-app.jar <input_dir> <output_dir> I get in the output_dir all the files of <input_dir> in format xml and all the corrupt file with size 0ko (empty)


Re: detect corrupt file and build a list of them before indexing in solr

Posted by kostali hassan <me...@gmail.com>.
I get this files in the logs ; AND when I run the script he dont finich he
restart all the time

2016-07-15 13:19 GMT+01:00 Allison, Timothy B. <ta...@mitre.org>:

> Sorry, you’ll get 0 byte files for an error that caused Tika batch to do a
> restart (hang/oom); and depending on cause, you may get an error logged in
> batch-process-error.xml.  If your OS kills the process or something truly
> catastrophic happens, the only trace you have is the 0 byte file.
>
>
>
>   For regular caught exceptions, you can look in the .json file (key: TikaCoreProperties.*TIKA_META_EXCEPTION_PREFIX*+*"runtime"*)
>
> for the stack trace, or you can look in the logs as described below.
>
>
>
> *From:* Allison, Timothy B. [mailto:tallison@mitre.org]
> *Sent:* Friday, July 15, 2016 8:11 AM
> *To:* user@tika.apache.org
> *Subject:* RE: detect corrupt file and build a list of them before
> indexing in solr
>
>
>
> Checking for 0 byte files is one option.  The other option is to configure
> the logs to capture exceptions.  I’ve attached the config files and the
> shell script that I use when running our large scale regression testing
> here:
> https://wiki.apache.org/tika/TikaBatchUsage?action=AttachFile&do=view&target=tika-batch-sh.zip
>
>
>
> To run those, unzip the folder, put the tika-app.jar in the bin/
> directory, update the shell script for your <input_dir> and your
> <output_dir> and you should be good to go.  You may need to create a “logs”
> directory.  Exceptions will be recorded in the batch-process-warn.log, and
> original file names are included along with stack traces.
>
>
>
> *From:* kostali hassan [mailto:med.has.kostali@gmail.com
> <me...@gmail.com>]
> *Sent:* Friday, July 15, 2016 5:17 AM
> *To:* user@tika.apache.org
> *Subject:* detect corrupt file and build a list of them before indexing
> in solr
>
>
>
> I'am looking to index ms word and pdf using uploading data with solr cell
> using apache tika;
>
>  I just hope use tika to detect corrupt files before indexing and get a
> list of corrupted file. if its possible.
>
> I try runing java -jar tika-app.jar <input_dir> <output_dir> I get in the
> output_dir all the files of <input_dir> in format xml and all the corrupt
> file with size 0ko (empty)
>

RE: detect corrupt file and build a list of them before indexing in solr

Posted by "Allison, Timothy B." <ta...@mitre.org>.
Sorry, you’ll get 0 byte files for an error that caused Tika batch to do a restart (hang/oom); and depending on cause, you may get an error logged in batch-process-error.xml.  If your OS kills the process or something truly catastrophic happens, the only trace you have is the 0 byte file.


  For regular caught exceptions, you can look in the .json file (key: TikaCoreProperties.TIKA_META_EXCEPTION_PREFIX+"runtime")
for the stack trace, or you can look in the logs as described below.

From: Allison, Timothy B. [mailto:tallison@mitre.org]
Sent: Friday, July 15, 2016 8:11 AM
To: user@tika.apache.org
Subject: RE: detect corrupt file and build a list of them before indexing in solr

Checking for 0 byte files is one option.  The other option is to configure the logs to capture exceptions.  I’ve attached the config files and the shell script that I use when running our large scale regression testing here: https://wiki.apache.org/tika/TikaBatchUsage?action=AttachFile&do=view&target=tika-batch-sh.zip

To run those, unzip the folder, put the tika-app.jar in the bin/ directory, update the shell script for your <input_dir> and your <output_dir> and you should be good to go.  You may need to create a “logs” directory.  Exceptions will be recorded in the batch-process-warn.log, and original file names are included along with stack traces.

From: kostali hassan [mailto:med.has.kostali@gmail.com]
Sent: Friday, July 15, 2016 5:17 AM
To: user@tika.apache.org<ma...@tika.apache.org>
Subject: detect corrupt file and build a list of them before indexing in solr

I'am looking to index ms word and pdf using uploading data with solr cell using apache tika;
 I just hope use tika to detect corrupt files before indexing and get a list of corrupted file. if its possible.
I try runing java -jar tika-app.jar <input_dir> <output_dir> I get in the output_dir all the files of <input_dir> in format xml and all the corrupt file with size 0ko (empty)

RE: detect corrupt file and build a list of them before indexing in solr

Posted by "Allison, Timothy B." <ta...@mitre.org>.
Checking for 0 byte files is one option.  The other option is to configure the logs to capture exceptions.  I’ve attached the config files and the shell script that I use when running our large scale regression testing here: https://wiki.apache.org/tika/TikaBatchUsage?action=AttachFile&do=view&target=tika-batch-sh.zip

To run those, unzip the folder, put the tika-app.jar in the bin/ directory, update the shell script for your <input_dir> and your <output_dir> and you should be good to go.  You may need to create a “logs” directory.  Exceptions will be recorded in the batch-process-warn.log, and original file names are included along with stack traces.

From: kostali hassan [mailto:med.has.kostali@gmail.com]
Sent: Friday, July 15, 2016 5:17 AM
To: user@tika.apache.org
Subject: detect corrupt file and build a list of them before indexing in solr

I'am looking to index ms word and pdf using uploading data with solr cell using apache tika;
 I just hope use tika to detect corrupt files before indexing and get a list of corrupted file. if its possible.
I try runing java -jar tika-app.jar <input_dir> <output_dir> I get in the output_dir all the files of <input_dir> in format xml and all the corrupt file with size 0ko (empty)