You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@manifoldcf.apache.org by Julien Massiera <ju...@francelabs.com> on 2019/06/05 14:40:57 UTC

Alternative approaches for jobs aborting on problematic docs

Hi Karl,

I don't know for other MCF users, but we have many use cases where we 
need to crawl several millions of documents from different kinds of 
repositories. With those, we sometime have difficulties to manage issues 
when crawl jobs suddenly stop because of problematic files that can only 
be filtered to avoid the job to abort.

 From past discussions in the mailing list, I think that from your point 
of view, it is preferable to stop a job when it encounters (or after 
several failing retries) an unknown and/or unexpected issue in order to 
be aware of this issue and fix it.

Although I can understand your point of view, I do not think it 
represents the exhaustivity of expected MCF behaviors in production. As 
a matter of fact, we have encountered several times scenarios where 
customers would prefer an approach where the crawl tries moving on, 
while still giving us the possibility to investigate any file that may 
have been skipped (One of the argument is that sometimes, jobs are 
started on Friday evenings, and if it aborts during the weekend, we lost 
at worse 60h of crawling before the admin can check the status of the job).

Yet as of now, this is not feasible, as jobs end up aborting when 
encountering non-clearly identified problematic files.

We have brainstormed internally, and we have a proposal which we think 
can satisfy both your view and ours, which we hope you consider as 
satisfying :

Whenever a job encounters an error that is not clearly identified :
1. It immediately retries one time;
2. If it succeeds, the crawl moves on as usual;
3. If it fails, the job moves this document to the current end of the 
processing pipeline, and crawls the remaining documents. It increments 
the counter of tentative for this document to 2.
4. When encountering this document again, the job tries again. If it 
succeeds, the crawl moves on as usual. If it fails, it moves this 
document to the current end of the processing pipeline, increment the 
counter of 1, and doubles the delay between two tentatives.
5. We iterate until the maximum number of tentatives of the crawl for 
the problematic document has been reached. If it fails, abort the crawl.
With this behavior, a job is finally aborted on critical errors but at 
least we will be able to crawl a maximum number of non problematic 
documents till the failure.

Another more "direct" approach, could be to simply have an optional 
parameter for a job: a "skip errors" checkbox. This parameter would tell 
a job to skip any encountered error. This is assuming we properly log 
the errors in the log files and/or in the simple history, thus allowing 
us to debug later on.

We would gladly welcome your thoughts on these 2 approaches.

Regards,
Julien

Re: Alternative approaches for jobs aborting on problematic docs

Posted by Karl Wright <da...@gmail.com>.
Generic SMB errors we can deal with differently, yes.
Not existing/not readable columns in JDBC sounds much more fatal to me.
Indexing errors in Solr because of non-ascii characters sounds like a true
three-alarm fire frankly and we wouldn't want to just ignore those.

Karl


On Thu, Jun 6, 2019 at 5:37 AM Julien Massiera <
julien.massiera@francelabs.com> wrote:

> Hi Karl,
>
> sure, all errors are not the same and we cannot deal the same way with
> OOM errors than with "file no longuer exists" error for example.
>
> The classes of errors that are triggering frequent job abortions are
> generic errors like:
> - SMBException errors for the win share connector
> - problematic/not existing/not readable columns/blobs for the JDBC
> connector
> - more recently we noticed insertion errors with the Solr output
> connector with documents containing metadata with non ASCII characters
> (errors occured with chinese/japanese chars). The error mentioned a HTTP
> bad request header, so most propably a 4xx/5xx HTTP error.
>
> Do you think we can work out something to postpone/skip these classes of
> errors ? Would be great !
>
> Regards,
> Julien
>
> On 05/06/2019 23:29, Karl Wright wrote:
> > Please let me note that there are *tons* of errors you can get when
> > crawling, from database errors to out-of-memory conditions to the actual
> > ones you care about, namely errors accessing the repository.  It is
> crucial
> > that the connector code separate these errors into those that are fatal,
> > those that can be retried, and those that indicate that the document
> should
> > be skipped.  It is simply not workable to try to insist that all errors
> are
> > the same.
> >
> > The difficulty comes in what the default behavior is for certain classes
> of
> > errors that we've never seen before.  I'm perfectly fine with trying to
> > establish such a policy as you suggest in approach 1 for general classes
> of
> > errors that are seen.  But once again we need to catalog these and
> > enumerate at least what classes these are.  That's necessary on a
> > connector-by-connector basis.
> >
> > The "brute force" approach of simply accepting all errors and continuing
> no
> > matter what will not work, because really it's the same problem and the
> > same bit of information you'd need to properly implement this.  There's
> no
> > shortcut I'm afraid.
> >
> > Please let me know which errors you are seeing and for which connector
> and
> > let's work out how we handle them (or similar ones).
> >
> > Karl
> >
> >
> > On Wed, Jun 5, 2019 at 10:41 AM Julien Massiera <
> > julien.massiera@francelabs.com> wrote:
> >
> >> Hi Karl,
> >>
> >> I don't know for other MCF users, but we have many use cases where we
> >> need to crawl several millions of documents from different kinds of
> >> repositories. With those, we sometime have difficulties to manage issues
> >> when crawl jobs suddenly stop because of problematic files that can only
> >> be filtered to avoid the job to abort.
> >>
> >>   From past discussions in the mailing list, I think that from your
> point
> >> of view, it is preferable to stop a job when it encounters (or after
> >> several failing retries) an unknown and/or unexpected issue in order to
> >> be aware of this issue and fix it.
> >>
> >> Although I can understand your point of view, I do not think it
> >> represents the exhaustivity of expected MCF behaviors in production. As
> >> a matter of fact, we have encountered several times scenarios where
> >> customers would prefer an approach where the crawl tries moving on,
> >> while still giving us the possibility to investigate any file that may
> >> have been skipped (One of the argument is that sometimes, jobs are
> >> started on Friday evenings, and if it aborts during the weekend, we lost
> >> at worse 60h of crawling before the admin can check the status of the
> job).
> >>
> >> Yet as of now, this is not feasible, as jobs end up aborting when
> >> encountering non-clearly identified problematic files.
> >>
> >> We have brainstormed internally, and we have a proposal which we think
> >> can satisfy both your view and ours, which we hope you consider as
> >> satisfying :
> >>
> >> Whenever a job encounters an error that is not clearly identified :
> >> 1. It immediately retries one time;
> >> 2. If it succeeds, the crawl moves on as usual;
> >> 3. If it fails, the job moves this document to the current end of the
> >> processing pipeline, and crawls the remaining documents. It increments
> >> the counter of tentative for this document to 2.
> >> 4. When encountering this document again, the job tries again. If it
> >> succeeds, the crawl moves on as usual. If it fails, it moves this
> >> document to the current end of the processing pipeline, increment the
> >> counter of 1, and doubles the delay between two tentatives.
> >> 5. We iterate until the maximum number of tentatives of the crawl for
> >> the problematic document has been reached. If it fails, abort the crawl.
> >> With this behavior, a job is finally aborted on critical errors but at
> >> least we will be able to crawl a maximum number of non problematic
> >> documents till the failure.
> >>
> >> Another more "direct" approach, could be to simply have an optional
> >> parameter for a job: a "skip errors" checkbox. This parameter would tell
> >> a job to skip any encountered error. This is assuming we properly log
> >> the errors in the log files and/or in the simple history, thus allowing
> >> us to debug later on.
> >>
> >> We would gladly welcome your thoughts on these 2 approaches.
> >>
> >> Regards,
> >> Julien
> >>
> --
> Julien MASSIERA
> Directeur développement produit
> France Labs – Les experts du Search
> Datafari – Vainqueur du trophée Big Data 2018 au Digital Innovation Makers
> Summit
> www.francelabs.com
>
>

Re: Alternative approaches for jobs aborting on problematic docs

Posted by Julien Massiera <ju...@francelabs.com>.
Hi Karl,

sure, all errors are not the same and we cannot deal the same way with 
OOM errors than with "file no longuer exists" error for example.

The classes of errors that are triggering frequent job abortions are 
generic errors like:
- SMBException errors for the win share connector
- problematic/not existing/not readable columns/blobs for the JDBC connector
- more recently we noticed insertion errors with the Solr output 
connector with documents containing metadata with non ASCII characters 
(errors occured with chinese/japanese chars). The error mentioned a HTTP 
bad request header, so most propably a 4xx/5xx HTTP error.

Do you think we can work out something to postpone/skip these classes of 
errors ? Would be great !

Regards,
Julien

On 05/06/2019 23:29, Karl Wright wrote:
> Please let me note that there are *tons* of errors you can get when
> crawling, from database errors to out-of-memory conditions to the actual
> ones you care about, namely errors accessing the repository.  It is crucial
> that the connector code separate these errors into those that are fatal,
> those that can be retried, and those that indicate that the document should
> be skipped.  It is simply not workable to try to insist that all errors are
> the same.
>
> The difficulty comes in what the default behavior is for certain classes of
> errors that we've never seen before.  I'm perfectly fine with trying to
> establish such a policy as you suggest in approach 1 for general classes of
> errors that are seen.  But once again we need to catalog these and
> enumerate at least what classes these are.  That's necessary on a
> connector-by-connector basis.
>
> The "brute force" approach of simply accepting all errors and continuing no
> matter what will not work, because really it's the same problem and the
> same bit of information you'd need to properly implement this.  There's no
> shortcut I'm afraid.
>
> Please let me know which errors you are seeing and for which connector and
> let's work out how we handle them (or similar ones).
>
> Karl
>
>
> On Wed, Jun 5, 2019 at 10:41 AM Julien Massiera <
> julien.massiera@francelabs.com> wrote:
>
>> Hi Karl,
>>
>> I don't know for other MCF users, but we have many use cases where we
>> need to crawl several millions of documents from different kinds of
>> repositories. With those, we sometime have difficulties to manage issues
>> when crawl jobs suddenly stop because of problematic files that can only
>> be filtered to avoid the job to abort.
>>
>>   From past discussions in the mailing list, I think that from your point
>> of view, it is preferable to stop a job when it encounters (or after
>> several failing retries) an unknown and/or unexpected issue in order to
>> be aware of this issue and fix it.
>>
>> Although I can understand your point of view, I do not think it
>> represents the exhaustivity of expected MCF behaviors in production. As
>> a matter of fact, we have encountered several times scenarios where
>> customers would prefer an approach where the crawl tries moving on,
>> while still giving us the possibility to investigate any file that may
>> have been skipped (One of the argument is that sometimes, jobs are
>> started on Friday evenings, and if it aborts during the weekend, we lost
>> at worse 60h of crawling before the admin can check the status of the job).
>>
>> Yet as of now, this is not feasible, as jobs end up aborting when
>> encountering non-clearly identified problematic files.
>>
>> We have brainstormed internally, and we have a proposal which we think
>> can satisfy both your view and ours, which we hope you consider as
>> satisfying :
>>
>> Whenever a job encounters an error that is not clearly identified :
>> 1. It immediately retries one time;
>> 2. If it succeeds, the crawl moves on as usual;
>> 3. If it fails, the job moves this document to the current end of the
>> processing pipeline, and crawls the remaining documents. It increments
>> the counter of tentative for this document to 2.
>> 4. When encountering this document again, the job tries again. If it
>> succeeds, the crawl moves on as usual. If it fails, it moves this
>> document to the current end of the processing pipeline, increment the
>> counter of 1, and doubles the delay between two tentatives.
>> 5. We iterate until the maximum number of tentatives of the crawl for
>> the problematic document has been reached. If it fails, abort the crawl.
>> With this behavior, a job is finally aborted on critical errors but at
>> least we will be able to crawl a maximum number of non problematic
>> documents till the failure.
>>
>> Another more "direct" approach, could be to simply have an optional
>> parameter for a job: a "skip errors" checkbox. This parameter would tell
>> a job to skip any encountered error. This is assuming we properly log
>> the errors in the log files and/or in the simple history, thus allowing
>> us to debug later on.
>>
>> We would gladly welcome your thoughts on these 2 approaches.
>>
>> Regards,
>> Julien
>>
-- 
Julien MASSIERA
Directeur développement produit
France Labs – Les experts du Search
Datafari – Vainqueur du trophée Big Data 2018 au Digital Innovation Makers Summit
www.francelabs.com


Re: Alternative approaches for jobs aborting on problematic docs

Posted by Karl Wright <da...@gmail.com>.
Please let me note that there are *tons* of errors you can get when
crawling, from database errors to out-of-memory conditions to the actual
ones you care about, namely errors accessing the repository.  It is crucial
that the connector code separate these errors into those that are fatal,
those that can be retried, and those that indicate that the document should
be skipped.  It is simply not workable to try to insist that all errors are
the same.

The difficulty comes in what the default behavior is for certain classes of
errors that we've never seen before.  I'm perfectly fine with trying to
establish such a policy as you suggest in approach 1 for general classes of
errors that are seen.  But once again we need to catalog these and
enumerate at least what classes these are.  That's necessary on a
connector-by-connector basis.

The "brute force" approach of simply accepting all errors and continuing no
matter what will not work, because really it's the same problem and the
same bit of information you'd need to properly implement this.  There's no
shortcut I'm afraid.

Please let me know which errors you are seeing and for which connector and
let's work out how we handle them (or similar ones).

Karl


On Wed, Jun 5, 2019 at 10:41 AM Julien Massiera <
julien.massiera@francelabs.com> wrote:

> Hi Karl,
>
> I don't know for other MCF users, but we have many use cases where we
> need to crawl several millions of documents from different kinds of
> repositories. With those, we sometime have difficulties to manage issues
> when crawl jobs suddenly stop because of problematic files that can only
> be filtered to avoid the job to abort.
>
>  From past discussions in the mailing list, I think that from your point
> of view, it is preferable to stop a job when it encounters (or after
> several failing retries) an unknown and/or unexpected issue in order to
> be aware of this issue and fix it.
>
> Although I can understand your point of view, I do not think it
> represents the exhaustivity of expected MCF behaviors in production. As
> a matter of fact, we have encountered several times scenarios where
> customers would prefer an approach where the crawl tries moving on,
> while still giving us the possibility to investigate any file that may
> have been skipped (One of the argument is that sometimes, jobs are
> started on Friday evenings, and if it aborts during the weekend, we lost
> at worse 60h of crawling before the admin can check the status of the job).
>
> Yet as of now, this is not feasible, as jobs end up aborting when
> encountering non-clearly identified problematic files.
>
> We have brainstormed internally, and we have a proposal which we think
> can satisfy both your view and ours, which we hope you consider as
> satisfying :
>
> Whenever a job encounters an error that is not clearly identified :
> 1. It immediately retries one time;
> 2. If it succeeds, the crawl moves on as usual;
> 3. If it fails, the job moves this document to the current end of the
> processing pipeline, and crawls the remaining documents. It increments
> the counter of tentative for this document to 2.
> 4. When encountering this document again, the job tries again. If it
> succeeds, the crawl moves on as usual. If it fails, it moves this
> document to the current end of the processing pipeline, increment the
> counter of 1, and doubles the delay between two tentatives.
> 5. We iterate until the maximum number of tentatives of the crawl for
> the problematic document has been reached. If it fails, abort the crawl.
> With this behavior, a job is finally aborted on critical errors but at
> least we will be able to crawl a maximum number of non problematic
> documents till the failure.
>
> Another more "direct" approach, could be to simply have an optional
> parameter for a job: a "skip errors" checkbox. This parameter would tell
> a job to skip any encountered error. This is assuming we properly log
> the errors in the log files and/or in the simple history, thus allowing
> us to debug later on.
>
> We would gladly welcome your thoughts on these 2 approaches.
>
> Regards,
> Julien
>