You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@manifoldcf.apache.org by Cihad Guzel <cg...@gmail.com> on 2022/10/19 23:28:12 UTC

Tika Service Rmeta Connector Error

 Hi Julien,

I ran the tika 2x service using the official tika available on docker hub.
I am using MFC version 2.3. I activated the tika-service-rmeta connector
for MFC. I created a job on mfc for a folder with 5 files in it. But OCR
was not performed on some of the files. When I look at Solr, the content of
some files seems empty. I also got the error messages found in the
attachment.

In the second test I made, this time I created 5 separate jobs to include
each of the 5 files one by one. When I ran these jobs, I did not encounter
any problems.

When I send these 5 files directly to the tika-service using curl it also
works correctly.

When I examine the Simple History Report, I see error messages for some
files as in the attached picture.

Could Tika connector have a bug that will cause an error while sending
multiple files to tika? Could it have something to do with this issue?
https://issues.apache.org/jira/browse/CONNECTORS-1733
[image: Screen Shot 2022-10-20 at 02.08.11.png]
Regards,
Cihad Güzel

RE: Tika Service Rmeta Connector Error

Posted by Julien Massiera <ju...@francelabs.com>.
Hi Cihad,

 

OCR processing takes a lot of resources and time process, so when sending several files at the same time to Tika, you increase the time processing for each file, resulting in timeout on the connector side like you have experienced. So, by decreasing the number of files to process, you will improve the time processing for each file and so, you decrease the probability to encounter a timeout issue (if you don’t change the timeout value of course). The timeout parameters for the Tika connector are there for that reason and you used them well. 

Concerning the error, there is a very high probability, in a corpus of files, that some files are problematic for Tika and causes timeout, OCR processing is not the only one that triggers that kind of pb. So a choice had to be made in order to deal with those errors, either to trigger an error in the Tika connector that will stop the job, or to consider that the error will happen a lot of time, log it in the simple history and ignore it to continue the job processing. The second option has been retained because in the other case, more than 90% of crawl jobs involving Tika in an enterprise environment would fail and it would be nearly impossible to solve/filter all the problematic files.

Concerning the Solr insertion, the connector will only trigger an error if the Solr indexation cannot be done, which is not linked to any previous connector in the pipeline and will never be. In your case, when a file is timed out in Tika, its content and metadata cannot be retrieved by the Tika server so the document is indexed like this, and the ingest process works so there are no error to trigger.

 

Cheers,

Julien 

 

 

De : Cihad Guzel <cg...@gmail.com> 
Envoyé : jeudi 20 octobre 2022 03:17
À : julien.massiera@francelabs.com
Cc : dev <de...@manifoldcf.apache.org>; user@manifoldcf.apache.org
Objet : Re: Tika Service Rmeta Connector Error

 

Hi,

The problem goes away when I increase the socket timeout from the mfc tika connector edit page. I think "document ingest (Solr)" should not be OK when there is such a problem.

Regards,


Cihad Güzel

 

Cihad Guzel < <ma...@gmail.com> cguzelg@gmail.com>, 20 Eki 2022 Per, 02:28 tarihinde şunu yazdı:

 Hi Julien,

I ran the tika 2x service using the official tika available on docker hub. I am using MFC version 2.3. I activated the tika-service-rmeta connector for MFC. I created a job on mfc for a folder with 5 files in it. But OCR was not performed on some of the files. When I look at Solr, the content of some files seems empty. I also got the error messages found in the attachment.

In the second test I made, this time I created 5 separate jobs to include each of the 5 files one by one. When I ran these jobs, I did not encounter any problems.

When I send these 5 files directly to the tika-service using curl it also works correctly.

When I examine the Simple History Report, I see error messages for some files as in the attached picture.

Could Tika connector have a bug that will cause an error while sending multiple files to tika? Could it have something to do with this issue?  <https://issues.apache.org/jira/browse/CONNECTORS-1733> https://issues.apache.org/jira/browse/CONNECTORS-1733



Regards,


Cihad Güzel


RE: Tika Service Rmeta Connector Error

Posted by Julien Massiera <ju...@francelabs.com>.
Hi Cihad,

 

OCR processing takes a lot of resources and time process, so when sending several files at the same time to Tika, you increase the time processing for each file, resulting in timeout on the connector side like you have experienced. So, by decreasing the number of files to process, you will improve the time processing for each file and so, you decrease the probability to encounter a timeout issue (if you don’t change the timeout value of course). The timeout parameters for the Tika connector are there for that reason and you used them well. 

Concerning the error, there is a very high probability, in a corpus of files, that some files are problematic for Tika and causes timeout, OCR processing is not the only one that triggers that kind of pb. So a choice had to be made in order to deal with those errors, either to trigger an error in the Tika connector that will stop the job, or to consider that the error will happen a lot of time, log it in the simple history and ignore it to continue the job processing. The second option has been retained because in the other case, more than 90% of crawl jobs involving Tika in an enterprise environment would fail and it would be nearly impossible to solve/filter all the problematic files.

Concerning the Solr insertion, the connector will only trigger an error if the Solr indexation cannot be done, which is not linked to any previous connector in the pipeline and will never be. In your case, when a file is timed out in Tika, its content and metadata cannot be retrieved by the Tika server so the document is indexed like this, and the ingest process works so there are no error to trigger.

 

Cheers,

Julien 

 

 

De : Cihad Guzel <cg...@gmail.com> 
Envoyé : jeudi 20 octobre 2022 03:17
À : julien.massiera@francelabs.com
Cc : dev <de...@manifoldcf.apache.org>; user@manifoldcf.apache.org
Objet : Re: Tika Service Rmeta Connector Error

 

Hi,

The problem goes away when I increase the socket timeout from the mfc tika connector edit page. I think "document ingest (Solr)" should not be OK when there is such a problem.

Regards,


Cihad Güzel

 

Cihad Guzel < <ma...@gmail.com> cguzelg@gmail.com>, 20 Eki 2022 Per, 02:28 tarihinde şunu yazdı:

 Hi Julien,

I ran the tika 2x service using the official tika available on docker hub. I am using MFC version 2.3. I activated the tika-service-rmeta connector for MFC. I created a job on mfc for a folder with 5 files in it. But OCR was not performed on some of the files. When I look at Solr, the content of some files seems empty. I also got the error messages found in the attachment.

In the second test I made, this time I created 5 separate jobs to include each of the 5 files one by one. When I ran these jobs, I did not encounter any problems.

When I send these 5 files directly to the tika-service using curl it also works correctly.

When I examine the Simple History Report, I see error messages for some files as in the attached picture.

Could Tika connector have a bug that will cause an error while sending multiple files to tika? Could it have something to do with this issue?  <https://issues.apache.org/jira/browse/CONNECTORS-1733> https://issues.apache.org/jira/browse/CONNECTORS-1733



Regards,


Cihad Güzel


Re: Tika Service Rmeta Connector Error

Posted by Cihad Guzel <cg...@gmail.com>.
Hi,

The problem goes away when I increase the socket timeout from the mfc tika
connector edit page. I think "document ingest (Solr)" should not be OK when
there is such a problem.

Regards,
Cihad Güzel


Cihad Guzel <cg...@gmail.com>, 20 Eki 2022 Per, 02:28 tarihinde şunu
yazdı:

>  Hi Julien,
>
> I ran the tika 2x service using the official tika available on docker hub.
> I am using MFC version 2.3. I activated the tika-service-rmeta connector
> for MFC. I created a job on mfc for a folder with 5 files in it. But OCR
> was not performed on some of the files. When I look at Solr, the content of
> some files seems empty. I also got the error messages found in the
> attachment.
>
> In the second test I made, this time I created 5 separate jobs to include
> each of the 5 files one by one. When I ran these jobs, I did not encounter
> any problems.
>
> When I send these 5 files directly to the tika-service using curl it also
> works correctly.
>
> When I examine the Simple History Report, I see error messages for some
> files as in the attached picture.
>
> Could Tika connector have a bug that will cause an error while sending
> multiple files to tika? Could it have something to do with this issue?
> https://issues.apache.org/jira/browse/CONNECTORS-1733
> [image: Screen Shot 2022-10-20 at 02.08.11.png]
> Regards,
> Cihad Güzel
>

Re: Tika Service Rmeta Connector Error

Posted by Cihad Guzel <cg...@gmail.com>.
Hi,

The problem goes away when I increase the socket timeout from the mfc tika
connector edit page. I think "document ingest (Solr)" should not be OK when
there is such a problem.

Regards,
Cihad Güzel


Cihad Guzel <cg...@gmail.com>, 20 Eki 2022 Per, 02:28 tarihinde şunu
yazdı:

>  Hi Julien,
>
> I ran the tika 2x service using the official tika available on docker hub.
> I am using MFC version 2.3. I activated the tika-service-rmeta connector
> for MFC. I created a job on mfc for a folder with 5 files in it. But OCR
> was not performed on some of the files. When I look at Solr, the content of
> some files seems empty. I also got the error messages found in the
> attachment.
>
> In the second test I made, this time I created 5 separate jobs to include
> each of the 5 files one by one. When I ran these jobs, I did not encounter
> any problems.
>
> When I send these 5 files directly to the tika-service using curl it also
> works correctly.
>
> When I examine the Simple History Report, I see error messages for some
> files as in the attached picture.
>
> Could Tika connector have a bug that will cause an error while sending
> multiple files to tika? Could it have something to do with this issue?
> https://issues.apache.org/jira/browse/CONNECTORS-1733
> [image: Screen Shot 2022-10-20 at 02.08.11.png]
> Regards,
> Cihad Güzel
>