You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@tika.apache.org by Olivier Tavard <ol...@francelabs.com> on 2018/10/11 13:58:54 UTC

Logging and filename

Hi,

I have a question about the log into Tika and for Tika server specifically.
We use Tika server for indexing millions of files into a Windows fileshare. To be more precise we use Apache ManifoldCF to crawl the files and the text extraction is done by Tika server 1.19. 
The spawnChild option is active. In case of very big files, we have somme OOM and the Tika server parent kills and restarts child process as it should. It works great, I just wanted to know if it would be possible to have into the Tika server child log the name of the file that caused the OOM. So far in the Tika log I can find the error and the date of the error but not the filename. I changed the log mode to debug but the filename did not appear neither.

To find this information first I have to find the date and time of the restart of the child in the Tika server log.  Then I open the log of Apache ManifoldCF and search into it at the date and time found before in the Tika log  to finally find the problematic file sent to Tika.
Did I miss something and the filename can be found on the Tika log ? If Tika could add the filename into its own log, it would be very helpful for us. 

Thanks,
Best regards,
Olivier 

Re: Logging and filename

Posted by Olivier Tavard <ol...@francelabs.com>.
Hello,

Thanks for the fix, it works well !
 
Best regards,

Olivier 


> Le 12 oct. 2018 à 18:41, Tim Allison <ta...@apache.org> a écrit :
> 
> Except that it didn't fix anything!  I _think_ I got it right this
> time: https://issues.apache.org/jira/browse/TIKA-2754  Let me know
> what you find.
> 
> Thank you, again.
> 
> Cheers,
> 
>         Tim
> On Fri, Oct 12, 2018 at 5:44 AM Olivier Tavard
> <ol...@francelabs.com> wrote:
>> 
>> Hi,
>> 
>> Thanks for the quick fix !
>> The value of the parameter "path" where you did the commit (parse method in Tikaresource class) is always set to "unpack/all" when I launched the indexation on the file share. Normally it should be the file path right ? I do not understand why it has this value.
>> 
>> Thanks,
>> Best regards,
>> 
>> Olivier
>> 
>> 
>> Le 11 oct. 2018 à 19:46, Tim Allison <ta...@apache.org> a écrit :
>> 
>> Doh. Sorry.  I just added that in bf75e39.  Please let us know what
>> else you find!
>> 
>> Aside from the unit tests, I haven't had a chance to try to break the
>> -spawnChild option with our regression corpus.
>> On Thu, Oct 11, 2018 at 9:59 AM Olivier Tavard
>> <ol...@francelabs.com> wrote:
>> 
>> 
>> Hi,
>> 
>> I have a question about the log into Tika and for Tika server specifically.
>> We use Tika server for indexing millions of files into a Windows fileshare. To be more precise we use Apache ManifoldCF to crawl the files and the text extraction is done by Tika server 1.19.
>> The spawnChild option is active. In case of very big files, we have somme OOM and the Tika server parent kills and restarts child process as it should. It works great, I just wanted to know if it would be possible to have into the Tika server child log the name of the file that caused the OOM. So far in the Tika log I can find the error and the date of the error but not the filename. I changed the log mode to debug but the filename did not appear neither.
>> 
>> To find this information first I have to find the date and time of the restart of the child in the Tika server log.  Then I open the log of Apache ManifoldCF and search into it at the date and time found before in the Tika log  to finally find the problematic file sent to Tika.
>> Did I miss something and the filename can be found on the Tika log ? If Tika could add the filename into its own log, it would be very helpful for us.
>> 
>> Thanks,
>> Best regards,
>> Olivier
>> 
>> 


Re: Logging and filename

Posted by Tim Allison <ta...@apache.org>.
Except that it didn't fix anything!  I _think_ I got it right this
time: https://issues.apache.org/jira/browse/TIKA-2754  Let me know
what you find.

Thank you, again.

Cheers,

         Tim
On Fri, Oct 12, 2018 at 5:44 AM Olivier Tavard
<ol...@francelabs.com> wrote:
>
> Hi,
>
> Thanks for the quick fix !
> The value of the parameter "path" where you did the commit (parse method in Tikaresource class) is always set to "unpack/all" when I launched the indexation on the file share. Normally it should be the file path right ? I do not understand why it has this value.
>
> Thanks,
> Best regards,
>
> Olivier
>
>
> Le 11 oct. 2018 à 19:46, Tim Allison <ta...@apache.org> a écrit :
>
> Doh. Sorry.  I just added that in bf75e39.  Please let us know what
> else you find!
>
> Aside from the unit tests, I haven't had a chance to try to break the
> -spawnChild option with our regression corpus.
> On Thu, Oct 11, 2018 at 9:59 AM Olivier Tavard
> <ol...@francelabs.com> wrote:
>
>
> Hi,
>
> I have a question about the log into Tika and for Tika server specifically.
> We use Tika server for indexing millions of files into a Windows fileshare. To be more precise we use Apache ManifoldCF to crawl the files and the text extraction is done by Tika server 1.19.
> The spawnChild option is active. In case of very big files, we have somme OOM and the Tika server parent kills and restarts child process as it should. It works great, I just wanted to know if it would be possible to have into the Tika server child log the name of the file that caused the OOM. So far in the Tika log I can find the error and the date of the error but not the filename. I changed the log mode to debug but the filename did not appear neither.
>
> To find this information first I have to find the date and time of the restart of the child in the Tika server log.  Then I open the log of Apache ManifoldCF and search into it at the date and time found before in the Tika log  to finally find the problematic file sent to Tika.
> Did I miss something and the filename can be found on the Tika log ? If Tika could add the filename into its own log, it would be very helpful for us.
>
> Thanks,
> Best regards,
> Olivier
>
>

Re: Logging and filename

Posted by Olivier Tavard <ol...@francelabs.com>.
Hi,

Thanks for the quick fix !
The value of the parameter "path" where you did the commit (parse method in Tikaresource class) is always set to "unpack/all" when I launched the indexation on the file share. Normally it should be the file path right ? I do not understand why it has this value.

Thanks,
Best regards,

Olivier 


> Le 11 oct. 2018 à 19:46, Tim Allison <ta...@apache.org> a écrit :
> 
> Doh. Sorry.  I just added that in bf75e39.  Please let us know what
> else you find!
> 
> Aside from the unit tests, I haven't had a chance to try to break the
> -spawnChild option with our regression corpus.
> On Thu, Oct 11, 2018 at 9:59 AM Olivier Tavard
> <ol...@francelabs.com> wrote:
>> 
>> Hi,
>> 
>> I have a question about the log into Tika and for Tika server specifically.
>> We use Tika server for indexing millions of files into a Windows fileshare. To be more precise we use Apache ManifoldCF to crawl the files and the text extraction is done by Tika server 1.19.
>> The spawnChild option is active. In case of very big files, we have somme OOM and the Tika server parent kills and restarts child process as it should. It works great, I just wanted to know if it would be possible to have into the Tika server child log the name of the file that caused the OOM. So far in the Tika log I can find the error and the date of the error but not the filename. I changed the log mode to debug but the filename did not appear neither.
>> 
>> To find this information first I have to find the date and time of the restart of the child in the Tika server log.  Then I open the log of Apache ManifoldCF and search into it at the date and time found before in the Tika log  to finally find the problematic file sent to Tika.
>> Did I miss something and the filename can be found on the Tika log ? If Tika could add the filename into its own log, it would be very helpful for us.
>> 
>> Thanks,
>> Best regards,
>> Olivier


Re: Logging and filename

Posted by Tim Allison <ta...@apache.org>.
Doh. Sorry.  I just added that in bf75e39.  Please let us know what
else you find!

Aside from the unit tests, I haven't had a chance to try to break the
-spawnChild option with our regression corpus.
On Thu, Oct 11, 2018 at 9:59 AM Olivier Tavard
<ol...@francelabs.com> wrote:
>
> Hi,
>
> I have a question about the log into Tika and for Tika server specifically.
> We use Tika server for indexing millions of files into a Windows fileshare. To be more precise we use Apache ManifoldCF to crawl the files and the text extraction is done by Tika server 1.19.
> The spawnChild option is active. In case of very big files, we have somme OOM and the Tika server parent kills and restarts child process as it should. It works great, I just wanted to know if it would be possible to have into the Tika server child log the name of the file that caused the OOM. So far in the Tika log I can find the error and the date of the error but not the filename. I changed the log mode to debug but the filename did not appear neither.
>
> To find this information first I have to find the date and time of the restart of the child in the Tika server log.  Then I open the log of Apache ManifoldCF and search into it at the date and time found before in the Tika log  to finally find the problematic file sent to Tika.
> Did I miss something and the filename can be found on the Tika log ? If Tika could add the filename into its own log, it would be very helpful for us.
>
> Thanks,
> Best regards,
> Olivier