You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@drill.apache.org by abdelhakim deneche <ad...@gmail.com> on 2015/07/03 16:53:02 UTC

Review Request 36168: DRILL-3167: When a query fails, Foreman should wait for all fragments to finish cleaning up before sending a FAILED state to the client

-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/36168/
-----------------------------------------------------------

Review request for drill, Jacques Nadeau and Sudheesh Katkam.


Bugs: DRILL-3167
    https://issues.apache.org/jira/browse/DRILL-3167


Repository: drill-git


Description
-------

when a query fails Foreman moves to a FAILING state until all fragment return a terminal state. The Web UI will still display FAILED instead of FAILING

TestDrillbitResilience#failsWhenSendingFragments exposes a limitation to this approach: if an error occurs when setting up remote fragments we can't assume they will be able to return a terminal state. In this case the Foreman will not wait for them to finish and return FAILED to the client immediately.


Diffs
-----

  exec/java-exec/src/main/java/org/apache/drill/exec/server/rest/profile/ProfileResources.java 6656bf6 
  exec/java-exec/src/main/java/org/apache/drill/exec/server/rest/profile/ProfileWrapper.java dd26a76 
  exec/java-exec/src/main/java/org/apache/drill/exec/work/foreman/Foreman.java 716fb66 
  exec/java-exec/src/main/java/org/apache/drill/exec/work/foreman/QueryManager.java 9318233 
  exec/java-exec/src/main/resources/rest/profile/list.ftl cf92ede 
  exec/java-exec/src/main/resources/rest/profile/profile.ftl 46cdc83 
  exec/java-exec/src/test/java/org/apache/drill/exec/server/TestDrillbitResilience.java ce09f68 
  protocol/src/main/java/org/apache/drill/exec/proto/UserBitShared.java e76d748 
  protocol/src/main/java/org/apache/drill/exec/proto/beans/QueryResult.java 474e330 
  protocol/src/main/protobuf/UserBitShared.proto 0451fd2 

Diff: https://reviews.apache.org/r/36168/diff/


Testing
-------

unit tests are passing, ongoing cluster testing...


Thanks,

abdelhakim deneche


Re: Review Request 36168: DRILL-3167: When a query fails, Foreman should wait for all fragments to finish cleaning up before sending a FAILED state to the client

Posted by Jacques Nadeau <ja...@gmail.com>.

> On July 6, 2015, 6:35 p.m., Sudheesh Katkam wrote:
> > exec/java-exec/src/main/java/org/apache/drill/exec/work/foreman/Foreman.java, line 841
> > <https://reviews.apache.org/r/36168/diff/1/?file=998962#file998962line841>
> >
> >     If there is a large number of fragments being sent and the last one failed in setupNonRootFragments, we do not _wait for all fragments to finish cleaning up_, correct? It is a limitation then?
> 
> abdelhakim deneche wrote:
>     true and that is exactly the scenario described in the else-block comment.
>     
>     Unfortunately I couldn't find a straightforward way to fix that case (that was the reason I discarded the previous review request in the first place).
>     
>     One thing to keep in mind though, if we try to wait for all running fragments to finish (like we do for cancelled queries) there is a risk for the Foreman to wait forever if some of those fragments become unresponsive, or in this case the fragment didn't even start running on the remote fragment. That's exactly what I tried to avoid by using the "nonRootFragmentsSent" flag

We should strive to get to the point where we wait.  Our target is Foreman always knowing the correct state of its fragemnts and never being destroyed until everything else is already gone.  If there are edge cases we don't currently manage to achieve this, we should file bugs for those and get them fixed.


- Jacques


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/36168/#review90514
-----------------------------------------------------------


On July 3, 2015, 4:13 p.m., abdelhakim deneche wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/36168/
> -----------------------------------------------------------
> 
> (Updated July 3, 2015, 4:13 p.m.)
> 
> 
> Review request for drill, Jacques Nadeau and Sudheesh Katkam.
> 
> 
> Bugs: DRILL-3167
>     https://issues.apache.org/jira/browse/DRILL-3167
> 
> 
> Repository: drill-git
> 
> 
> Description
> -------
> 
> when a query fails Foreman moves to a FAILING state until all fragment return a terminal state. The Web UI will still display FAILED instead of FAILING
> 
> TestDrillbitResilience#failsWhenSendingFragments exposes a limitation to this approach: if an error occurs when setting up remote fragments we can't assume they will be able to return a terminal state. In this case the Foreman will not wait for them to finish and return FAILED to the client immediately.
> 
> 
> Diffs
> -----
> 
>   exec/java-exec/src/main/java/org/apache/drill/exec/server/rest/profile/ProfileResources.java 6656bf6 
>   exec/java-exec/src/main/java/org/apache/drill/exec/server/rest/profile/ProfileWrapper.java dd26a76 
>   exec/java-exec/src/main/java/org/apache/drill/exec/work/foreman/Foreman.java 716fb66 
>   exec/java-exec/src/main/java/org/apache/drill/exec/work/foreman/QueryManager.java 9318233 
>   exec/java-exec/src/main/resources/rest/profile/list.ftl cf92ede 
>   exec/java-exec/src/main/resources/rest/profile/profile.ftl 46cdc83 
>   exec/java-exec/src/test/java/org/apache/drill/exec/server/TestDrillbitResilience.java ce09f68 
>   protocol/src/main/java/org/apache/drill/exec/proto/UserBitShared.java e76d748 
>   protocol/src/main/java/org/apache/drill/exec/proto/beans/QueryResult.java 474e330 
>   protocol/src/main/protobuf/UserBitShared.proto 0451fd2 
> 
> Diff: https://reviews.apache.org/r/36168/diff/
> 
> 
> Testing
> -------
> 
> unit tests are passing along with customer and tpch100
> 
> 
> Thanks,
> 
> abdelhakim deneche
> 
>


Re: Review Request 36168: DRILL-3167: When a query fails, Foreman should wait for all fragments to finish cleaning up before sending a FAILED state to the client

Posted by abdelhakim deneche <ad...@gmail.com>.

> On July 6, 2015, 6:35 p.m., Sudheesh Katkam wrote:
> > exec/java-exec/src/main/java/org/apache/drill/exec/work/foreman/Foreman.java, line 841
> > <https://reviews.apache.org/r/36168/diff/1/?file=998962#file998962line841>
> >
> >     If there is a large number of fragments being sent and the last one failed in setupNonRootFragments, we do not _wait for all fragments to finish cleaning up_, correct? It is a limitation then?

true and that is exactly the scenario described in the else-block comment.

Unfortunately I couldn't find a straightforward way to fix that case (that was the reason I discarded the previous review request in the first place).

One thing to keep in mind though, if we try to wait for all running fragments to finish (like we do for cancelled queries) there is a risk for the Foreman to wait forever if some of those fragments become unresponsive, or in this case the fragment didn't even start running on the remote fragment. That's exactly what I tried to avoid by using the "nonRootFragmentsSent" flag


- abdelhakim


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/36168/#review90514
-----------------------------------------------------------


On July 3, 2015, 4:13 p.m., abdelhakim deneche wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/36168/
> -----------------------------------------------------------
> 
> (Updated July 3, 2015, 4:13 p.m.)
> 
> 
> Review request for drill, Jacques Nadeau and Sudheesh Katkam.
> 
> 
> Bugs: DRILL-3167
>     https://issues.apache.org/jira/browse/DRILL-3167
> 
> 
> Repository: drill-git
> 
> 
> Description
> -------
> 
> when a query fails Foreman moves to a FAILING state until all fragment return a terminal state. The Web UI will still display FAILED instead of FAILING
> 
> TestDrillbitResilience#failsWhenSendingFragments exposes a limitation to this approach: if an error occurs when setting up remote fragments we can't assume they will be able to return a terminal state. In this case the Foreman will not wait for them to finish and return FAILED to the client immediately.
> 
> 
> Diffs
> -----
> 
>   exec/java-exec/src/main/java/org/apache/drill/exec/server/rest/profile/ProfileResources.java 6656bf6 
>   exec/java-exec/src/main/java/org/apache/drill/exec/server/rest/profile/ProfileWrapper.java dd26a76 
>   exec/java-exec/src/main/java/org/apache/drill/exec/work/foreman/Foreman.java 716fb66 
>   exec/java-exec/src/main/java/org/apache/drill/exec/work/foreman/QueryManager.java 9318233 
>   exec/java-exec/src/main/resources/rest/profile/list.ftl cf92ede 
>   exec/java-exec/src/main/resources/rest/profile/profile.ftl 46cdc83 
>   exec/java-exec/src/test/java/org/apache/drill/exec/server/TestDrillbitResilience.java ce09f68 
>   protocol/src/main/java/org/apache/drill/exec/proto/UserBitShared.java e76d748 
>   protocol/src/main/java/org/apache/drill/exec/proto/beans/QueryResult.java 474e330 
>   protocol/src/main/protobuf/UserBitShared.proto 0451fd2 
> 
> Diff: https://reviews.apache.org/r/36168/diff/
> 
> 
> Testing
> -------
> 
> unit tests are passing along with customer and tpch100
> 
> 
> Thanks,
> 
> abdelhakim deneche
> 
>


Re: Review Request 36168: DRILL-3167: When a query fails, Foreman should wait for all fragments to finish cleaning up before sending a FAILED state to the client

Posted by Sudheesh Katkam <sk...@maprtech.com>.
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/36168/#review90514
-----------------------------------------------------------



exec/java-exec/src/main/java/org/apache/drill/exec/work/foreman/Foreman.java (line 839)
<https://reviews.apache.org/r/36168/#comment143640>

    If there is a large number of fragments being sent and the last one failed in setupNonRootFragments, we do not _wait for all fragments to finish cleaning up_, correct? It is a limitation then?



exec/java-exec/src/main/java/org/apache/drill/exec/work/foreman/Foreman.java (line 1033)
<https://reviews.apache.org/r/36168/#comment143632>

    extra space?


- Sudheesh Katkam


On July 3, 2015, 4:13 p.m., abdelhakim deneche wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/36168/
> -----------------------------------------------------------
> 
> (Updated July 3, 2015, 4:13 p.m.)
> 
> 
> Review request for drill, Jacques Nadeau and Sudheesh Katkam.
> 
> 
> Bugs: DRILL-3167
>     https://issues.apache.org/jira/browse/DRILL-3167
> 
> 
> Repository: drill-git
> 
> 
> Description
> -------
> 
> when a query fails Foreman moves to a FAILING state until all fragment return a terminal state. The Web UI will still display FAILED instead of FAILING
> 
> TestDrillbitResilience#failsWhenSendingFragments exposes a limitation to this approach: if an error occurs when setting up remote fragments we can't assume they will be able to return a terminal state. In this case the Foreman will not wait for them to finish and return FAILED to the client immediately.
> 
> 
> Diffs
> -----
> 
>   exec/java-exec/src/main/java/org/apache/drill/exec/server/rest/profile/ProfileResources.java 6656bf6 
>   exec/java-exec/src/main/java/org/apache/drill/exec/server/rest/profile/ProfileWrapper.java dd26a76 
>   exec/java-exec/src/main/java/org/apache/drill/exec/work/foreman/Foreman.java 716fb66 
>   exec/java-exec/src/main/java/org/apache/drill/exec/work/foreman/QueryManager.java 9318233 
>   exec/java-exec/src/main/resources/rest/profile/list.ftl cf92ede 
>   exec/java-exec/src/main/resources/rest/profile/profile.ftl 46cdc83 
>   exec/java-exec/src/test/java/org/apache/drill/exec/server/TestDrillbitResilience.java ce09f68 
>   protocol/src/main/java/org/apache/drill/exec/proto/UserBitShared.java e76d748 
>   protocol/src/main/java/org/apache/drill/exec/proto/beans/QueryResult.java 474e330 
>   protocol/src/main/protobuf/UserBitShared.proto 0451fd2 
> 
> Diff: https://reviews.apache.org/r/36168/diff/
> 
> 
> Testing
> -------
> 
> unit tests are passing along with customer and tpch100
> 
> 
> Thanks,
> 
> abdelhakim deneche
> 
>


Re: Review Request 36168: DRILL-3167: When a query fails, Foreman should wait for all fragments to finish cleaning up before sending a FAILED state to the client

Posted by Sudheesh Katkam <sk...@maprtech.com>.
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/36168/#review90724
-----------------------------------------------------------


Please add unit tests to ensure the behavior is as expected. Although you are making the change to fix a specific issue, it is a change in the state machine and this can introduce issues (and in the past, you have chased some of these too).

I can think of one: pause on one drillbit, fail on another, resume the pause and ensure you get the state *after* the resume.

- Sudheesh Katkam


On July 3, 2015, 4:13 p.m., abdelhakim deneche wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/36168/
> -----------------------------------------------------------
> 
> (Updated July 3, 2015, 4:13 p.m.)
> 
> 
> Review request for drill, Jacques Nadeau and Sudheesh Katkam.
> 
> 
> Bugs: DRILL-3167
>     https://issues.apache.org/jira/browse/DRILL-3167
> 
> 
> Repository: drill-git
> 
> 
> Description
> -------
> 
> when a query fails Foreman moves to a FAILING state until all fragment return a terminal state. The Web UI will still display FAILED instead of FAILING
> 
> TestDrillbitResilience#failsWhenSendingFragments exposes a limitation to this approach: if an error occurs when setting up remote fragments we can't assume they will be able to return a terminal state. In this case the Foreman will not wait for them to finish and return FAILED to the client immediately.
> 
> 
> Diffs
> -----
> 
>   exec/java-exec/src/main/java/org/apache/drill/exec/server/rest/profile/ProfileResources.java 6656bf6 
>   exec/java-exec/src/main/java/org/apache/drill/exec/server/rest/profile/ProfileWrapper.java dd26a76 
>   exec/java-exec/src/main/java/org/apache/drill/exec/work/foreman/Foreman.java 716fb66 
>   exec/java-exec/src/main/java/org/apache/drill/exec/work/foreman/QueryManager.java 9318233 
>   exec/java-exec/src/main/resources/rest/profile/list.ftl cf92ede 
>   exec/java-exec/src/main/resources/rest/profile/profile.ftl 46cdc83 
>   exec/java-exec/src/test/java/org/apache/drill/exec/server/TestDrillbitResilience.java ce09f68 
>   protocol/src/main/java/org/apache/drill/exec/proto/UserBitShared.java e76d748 
>   protocol/src/main/java/org/apache/drill/exec/proto/beans/QueryResult.java 474e330 
>   protocol/src/main/protobuf/UserBitShared.proto 0451fd2 
> 
> Diff: https://reviews.apache.org/r/36168/diff/
> 
> 
> Testing
> -------
> 
> unit tests are passing along with customer and tpch100
> 
> 
> Thanks,
> 
> abdelhakim deneche
> 
>


Re: Review Request 36168: DRILL-3167: When a query fails, Foreman should wait for all fragments to finish cleaning up before sending a FAILED state to the client

Posted by abdelhakim deneche <ad...@gmail.com>.
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/36168/
-----------------------------------------------------------

(Updated July 3, 2015, 4:13 p.m.)


Review request for drill, Jacques Nadeau and Sudheesh Katkam.


Bugs: DRILL-3167
    https://issues.apache.org/jira/browse/DRILL-3167


Repository: drill-git


Description
-------

when a query fails Foreman moves to a FAILING state until all fragment return a terminal state. The Web UI will still display FAILED instead of FAILING

TestDrillbitResilience#failsWhenSendingFragments exposes a limitation to this approach: if an error occurs when setting up remote fragments we can't assume they will be able to return a terminal state. In this case the Foreman will not wait for them to finish and return FAILED to the client immediately.


Diffs
-----

  exec/java-exec/src/main/java/org/apache/drill/exec/server/rest/profile/ProfileResources.java 6656bf6 
  exec/java-exec/src/main/java/org/apache/drill/exec/server/rest/profile/ProfileWrapper.java dd26a76 
  exec/java-exec/src/main/java/org/apache/drill/exec/work/foreman/Foreman.java 716fb66 
  exec/java-exec/src/main/java/org/apache/drill/exec/work/foreman/QueryManager.java 9318233 
  exec/java-exec/src/main/resources/rest/profile/list.ftl cf92ede 
  exec/java-exec/src/main/resources/rest/profile/profile.ftl 46cdc83 
  exec/java-exec/src/test/java/org/apache/drill/exec/server/TestDrillbitResilience.java ce09f68 
  protocol/src/main/java/org/apache/drill/exec/proto/UserBitShared.java e76d748 
  protocol/src/main/java/org/apache/drill/exec/proto/beans/QueryResult.java 474e330 
  protocol/src/main/protobuf/UserBitShared.proto 0451fd2 

Diff: https://reviews.apache.org/r/36168/diff/


Testing (updated)
-------

unit tests are passing along with customer and tpch100


Thanks,

abdelhakim deneche