You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@drill.apache.org by "ASF GitHub Bot (Jira)" <ji...@apache.org> on 2020/02/13 18:43:00 UTC

[jira] [Commented] (DRILL-7583) Remove STOP status in favor of fail-fast

    [ https://issues.apache.org/jira/browse/DRILL-7583?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17036417#comment-17036417 ] 

ASF GitHub Bot commented on DRILL-7583:
---------------------------------------

paul-rogers commented on pull request #1981: DRILL-7583: Remove STOP status from operator outcome
URL: https://github.com/apache/drill/pull/1981
 
 
   Removes the `IterOutcome.STOP` status and revises the `kill` methods.
   
   ## Description
   
   Operators must handle both the "happy path" and failures. The "happy path" is defined by the `RecordBatch.IterOutcome` enum: each operator's `next()` method returns this outcome to describe what happened: batch, batch with new schema, or EOF.
   
   Operators originally handled the error path the same way: via an `IterOutcome` of `STOP`. Error handling was somewhat complex:
   
   * Set a failed flag in the operator,
   * Tell upstream operators they have failed,
   * Consuming data from the upstream until it returns `NONE`,
   * Originally, sometimes calling `close()`,
   * Returning a STOP status.
   
   Each operator implemented some variation or subset of the above. Early on, the error protocol was source of errors. I actually wrote a detailed analysis of the issues several years ago, and have been chipping away at fixing the problems ever since.
   
   Today, the error path mostly works; is simply a source of highly complex code. And, as it turns out, entirely unnecessary.
   
   Drill has long supported a "fail-fast" error path based on throwing an exception; relying on the fragment executor to clean up the operator stack. The "fail-fast" system is simple (throw an exception) and has a uniform way of cleaning up (call `close()` once on each operator.) Recent revisions have converted most operators to use the simpler fail-fast strategy based on throwing an exception instead of using the older STOP approach.
   
   The careful reader of those PRs will have noted that, as a result, no code returns the `STOP` status any longer.  This PR goes to the next logical step and removes the old, complex, and now-unused `STOP` based path.
   
   There is a related mechanism in operators: the `kill()` method, and its implementation, `killIncoming()`. The original purpose appears to be that step above: when an operator fails, fail all its children. Things got messy when an operator received a `STOP` status: should it call `kill()` on its child? On itself? Lots of fun bugs to fix back in the day.
   
   With `STOP` retired, the main purpose of the `kill()` method disappears. There is, however, a second use that we retain: cancelling upstream operators in the "normal" case. Think of the LIMIT operator: once a `LIMIT 10` has 10 rows, it need not ask for more. A Parquet reader, say, may still be busily reading columns in worker threads. The `LIMIT` wants to tell its upstream operators, "hey, I have all the rows I need, thanks. You can go ahead and stop reading more data." That is, we need a "normal case" cancellation.
   
   To make clear that `kill()` now does cancellation, this PR renames the method to `cancel()`.
   
   ## Documentation
   
   This is not a user-visible change, it is purely internal to the execution engine. (The one user benefit is that error messages might be a bit more precise because of earlier work.)
   
   ## Testing
   
   Reran all unit tests which revealed a few complexities that remain, such as the way operators handle the `cancel()` call. All tests pass. The next step is to run the full test suite as part of pre-commit.
   
 
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


> Remove STOP status in favor of fail-fast
> ----------------------------------------
>
>                 Key: DRILL-7583
>                 URL: https://issues.apache.org/jira/browse/DRILL-7583
>             Project: Apache Drill
>          Issue Type: Improvement
>    Affects Versions: 1.17.0
>            Reporter: Paul Rogers
>            Assignee: Paul Rogers
>            Priority: Major
>             Fix For: 1.18.0
>
>
> The original error solution was a complex process of a) setting a failed flag, b) telling all upstream operators they have failed, c) returning a {{STOP}} status.  Drill has long supported a "fail-fast" error path based on throwing an exception; relying on the fragment executor to clean up the operator stack. Recent revisions have converted most operators to use the simpler fail-fast strategy based on throwing an exception instead of using the older {{STOP}} approach. This change simply removes the old, now-unused {{STOP}} based path.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)