You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@drill.apache.org by abdelhakim deneche <ad...@gmail.com> on 2015/07/03 16:53:02 UTC
Review Request 36168: DRILL-3167: When a query fails,
Foreman should wait for all fragments to finish cleaning up before sending a
FAILED state to the client
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/36168/
-----------------------------------------------------------
Review request for drill, Jacques Nadeau and Sudheesh Katkam.
Bugs: DRILL-3167
https://issues.apache.org/jira/browse/DRILL-3167
Repository: drill-git
Description
-------
when a query fails Foreman moves to a FAILING state until all fragment return a terminal state. The Web UI will still display FAILED instead of FAILING
TestDrillbitResilience#failsWhenSendingFragments exposes a limitation to this approach: if an error occurs when setting up remote fragments we can't assume they will be able to return a terminal state. In this case the Foreman will not wait for them to finish and return FAILED to the client immediately.
Diffs
-----
exec/java-exec/src/main/java/org/apache/drill/exec/server/rest/profile/ProfileResources.java 6656bf6
exec/java-exec/src/main/java/org/apache/drill/exec/server/rest/profile/ProfileWrapper.java dd26a76
exec/java-exec/src/main/java/org/apache/drill/exec/work/foreman/Foreman.java 716fb66
exec/java-exec/src/main/java/org/apache/drill/exec/work/foreman/QueryManager.java 9318233
exec/java-exec/src/main/resources/rest/profile/list.ftl cf92ede
exec/java-exec/src/main/resources/rest/profile/profile.ftl 46cdc83
exec/java-exec/src/test/java/org/apache/drill/exec/server/TestDrillbitResilience.java ce09f68
protocol/src/main/java/org/apache/drill/exec/proto/UserBitShared.java e76d748
protocol/src/main/java/org/apache/drill/exec/proto/beans/QueryResult.java 474e330
protocol/src/main/protobuf/UserBitShared.proto 0451fd2
Diff: https://reviews.apache.org/r/36168/diff/
Testing
-------
unit tests are passing, ongoing cluster testing...
Thanks,
abdelhakim deneche
Re: Review Request 36168: DRILL-3167: When a query fails,
Foreman should wait for all fragments to finish cleaning up before sending a
FAILED state to the client
Posted by Jacques Nadeau <ja...@gmail.com>.
> On July 6, 2015, 6:35 p.m., Sudheesh Katkam wrote:
> > exec/java-exec/src/main/java/org/apache/drill/exec/work/foreman/Foreman.java, line 841
> > <https://reviews.apache.org/r/36168/diff/1/?file=998962#file998962line841>
> >
> > If there is a large number of fragments being sent and the last one failed in setupNonRootFragments, we do not _wait for all fragments to finish cleaning up_, correct? It is a limitation then?
>
> abdelhakim deneche wrote:
> true and that is exactly the scenario described in the else-block comment.
>
> Unfortunately I couldn't find a straightforward way to fix that case (that was the reason I discarded the previous review request in the first place).
>
> One thing to keep in mind though, if we try to wait for all running fragments to finish (like we do for cancelled queries) there is a risk for the Foreman to wait forever if some of those fragments become unresponsive, or in this case the fragment didn't even start running on the remote fragment. That's exactly what I tried to avoid by using the "nonRootFragmentsSent" flag
We should strive to get to the point where we wait. Our target is Foreman always knowing the correct state of its fragemnts and never being destroyed until everything else is already gone. If there are edge cases we don't currently manage to achieve this, we should file bugs for those and get them fixed.
- Jacques
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/36168/#review90514
-----------------------------------------------------------
On July 3, 2015, 4:13 p.m., abdelhakim deneche wrote:
>
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/36168/
> -----------------------------------------------------------
>
> (Updated July 3, 2015, 4:13 p.m.)
>
>
> Review request for drill, Jacques Nadeau and Sudheesh Katkam.
>
>
> Bugs: DRILL-3167
> https://issues.apache.org/jira/browse/DRILL-3167
>
>
> Repository: drill-git
>
>
> Description
> -------
>
> when a query fails Foreman moves to a FAILING state until all fragment return a terminal state. The Web UI will still display FAILED instead of FAILING
>
> TestDrillbitResilience#failsWhenSendingFragments exposes a limitation to this approach: if an error occurs when setting up remote fragments we can't assume they will be able to return a terminal state. In this case the Foreman will not wait for them to finish and return FAILED to the client immediately.
>
>
> Diffs
> -----
>
> exec/java-exec/src/main/java/org/apache/drill/exec/server/rest/profile/ProfileResources.java 6656bf6
> exec/java-exec/src/main/java/org/apache/drill/exec/server/rest/profile/ProfileWrapper.java dd26a76
> exec/java-exec/src/main/java/org/apache/drill/exec/work/foreman/Foreman.java 716fb66
> exec/java-exec/src/main/java/org/apache/drill/exec/work/foreman/QueryManager.java 9318233
> exec/java-exec/src/main/resources/rest/profile/list.ftl cf92ede
> exec/java-exec/src/main/resources/rest/profile/profile.ftl 46cdc83
> exec/java-exec/src/test/java/org/apache/drill/exec/server/TestDrillbitResilience.java ce09f68
> protocol/src/main/java/org/apache/drill/exec/proto/UserBitShared.java e76d748
> protocol/src/main/java/org/apache/drill/exec/proto/beans/QueryResult.java 474e330
> protocol/src/main/protobuf/UserBitShared.proto 0451fd2
>
> Diff: https://reviews.apache.org/r/36168/diff/
>
>
> Testing
> -------
>
> unit tests are passing along with customer and tpch100
>
>
> Thanks,
>
> abdelhakim deneche
>
>
Re: Review Request 36168: DRILL-3167: When a query fails,
Foreman should wait for all fragments to finish cleaning up before sending a
FAILED state to the client
Posted by abdelhakim deneche <ad...@gmail.com>.
> On July 6, 2015, 6:35 p.m., Sudheesh Katkam wrote:
> > exec/java-exec/src/main/java/org/apache/drill/exec/work/foreman/Foreman.java, line 841
> > <https://reviews.apache.org/r/36168/diff/1/?file=998962#file998962line841>
> >
> > If there is a large number of fragments being sent and the last one failed in setupNonRootFragments, we do not _wait for all fragments to finish cleaning up_, correct? It is a limitation then?
true and that is exactly the scenario described in the else-block comment.
Unfortunately I couldn't find a straightforward way to fix that case (that was the reason I discarded the previous review request in the first place).
One thing to keep in mind though, if we try to wait for all running fragments to finish (like we do for cancelled queries) there is a risk for the Foreman to wait forever if some of those fragments become unresponsive, or in this case the fragment didn't even start running on the remote fragment. That's exactly what I tried to avoid by using the "nonRootFragmentsSent" flag
- abdelhakim
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/36168/#review90514
-----------------------------------------------------------
On July 3, 2015, 4:13 p.m., abdelhakim deneche wrote:
>
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/36168/
> -----------------------------------------------------------
>
> (Updated July 3, 2015, 4:13 p.m.)
>
>
> Review request for drill, Jacques Nadeau and Sudheesh Katkam.
>
>
> Bugs: DRILL-3167
> https://issues.apache.org/jira/browse/DRILL-3167
>
>
> Repository: drill-git
>
>
> Description
> -------
>
> when a query fails Foreman moves to a FAILING state until all fragment return a terminal state. The Web UI will still display FAILED instead of FAILING
>
> TestDrillbitResilience#failsWhenSendingFragments exposes a limitation to this approach: if an error occurs when setting up remote fragments we can't assume they will be able to return a terminal state. In this case the Foreman will not wait for them to finish and return FAILED to the client immediately.
>
>
> Diffs
> -----
>
> exec/java-exec/src/main/java/org/apache/drill/exec/server/rest/profile/ProfileResources.java 6656bf6
> exec/java-exec/src/main/java/org/apache/drill/exec/server/rest/profile/ProfileWrapper.java dd26a76
> exec/java-exec/src/main/java/org/apache/drill/exec/work/foreman/Foreman.java 716fb66
> exec/java-exec/src/main/java/org/apache/drill/exec/work/foreman/QueryManager.java 9318233
> exec/java-exec/src/main/resources/rest/profile/list.ftl cf92ede
> exec/java-exec/src/main/resources/rest/profile/profile.ftl 46cdc83
> exec/java-exec/src/test/java/org/apache/drill/exec/server/TestDrillbitResilience.java ce09f68
> protocol/src/main/java/org/apache/drill/exec/proto/UserBitShared.java e76d748
> protocol/src/main/java/org/apache/drill/exec/proto/beans/QueryResult.java 474e330
> protocol/src/main/protobuf/UserBitShared.proto 0451fd2
>
> Diff: https://reviews.apache.org/r/36168/diff/
>
>
> Testing
> -------
>
> unit tests are passing along with customer and tpch100
>
>
> Thanks,
>
> abdelhakim deneche
>
>
Re: Review Request 36168: DRILL-3167: When a query fails,
Foreman should wait for all fragments to finish cleaning up before sending a
FAILED state to the client
Posted by Sudheesh Katkam <sk...@maprtech.com>.
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/36168/#review90514
-----------------------------------------------------------
exec/java-exec/src/main/java/org/apache/drill/exec/work/foreman/Foreman.java (line 839)
<https://reviews.apache.org/r/36168/#comment143640>
If there is a large number of fragments being sent and the last one failed in setupNonRootFragments, we do not _wait for all fragments to finish cleaning up_, correct? It is a limitation then?
exec/java-exec/src/main/java/org/apache/drill/exec/work/foreman/Foreman.java (line 1033)
<https://reviews.apache.org/r/36168/#comment143632>
extra space?
- Sudheesh Katkam
On July 3, 2015, 4:13 p.m., abdelhakim deneche wrote:
>
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/36168/
> -----------------------------------------------------------
>
> (Updated July 3, 2015, 4:13 p.m.)
>
>
> Review request for drill, Jacques Nadeau and Sudheesh Katkam.
>
>
> Bugs: DRILL-3167
> https://issues.apache.org/jira/browse/DRILL-3167
>
>
> Repository: drill-git
>
>
> Description
> -------
>
> when a query fails Foreman moves to a FAILING state until all fragment return a terminal state. The Web UI will still display FAILED instead of FAILING
>
> TestDrillbitResilience#failsWhenSendingFragments exposes a limitation to this approach: if an error occurs when setting up remote fragments we can't assume they will be able to return a terminal state. In this case the Foreman will not wait for them to finish and return FAILED to the client immediately.
>
>
> Diffs
> -----
>
> exec/java-exec/src/main/java/org/apache/drill/exec/server/rest/profile/ProfileResources.java 6656bf6
> exec/java-exec/src/main/java/org/apache/drill/exec/server/rest/profile/ProfileWrapper.java dd26a76
> exec/java-exec/src/main/java/org/apache/drill/exec/work/foreman/Foreman.java 716fb66
> exec/java-exec/src/main/java/org/apache/drill/exec/work/foreman/QueryManager.java 9318233
> exec/java-exec/src/main/resources/rest/profile/list.ftl cf92ede
> exec/java-exec/src/main/resources/rest/profile/profile.ftl 46cdc83
> exec/java-exec/src/test/java/org/apache/drill/exec/server/TestDrillbitResilience.java ce09f68
> protocol/src/main/java/org/apache/drill/exec/proto/UserBitShared.java e76d748
> protocol/src/main/java/org/apache/drill/exec/proto/beans/QueryResult.java 474e330
> protocol/src/main/protobuf/UserBitShared.proto 0451fd2
>
> Diff: https://reviews.apache.org/r/36168/diff/
>
>
> Testing
> -------
>
> unit tests are passing along with customer and tpch100
>
>
> Thanks,
>
> abdelhakim deneche
>
>
Re: Review Request 36168: DRILL-3167: When a query fails,
Foreman should wait for all fragments to finish cleaning up before sending a
FAILED state to the client
Posted by Sudheesh Katkam <sk...@maprtech.com>.
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/36168/#review90724
-----------------------------------------------------------
Please add unit tests to ensure the behavior is as expected. Although you are making the change to fix a specific issue, it is a change in the state machine and this can introduce issues (and in the past, you have chased some of these too).
I can think of one: pause on one drillbit, fail on another, resume the pause and ensure you get the state *after* the resume.
- Sudheesh Katkam
On July 3, 2015, 4:13 p.m., abdelhakim deneche wrote:
>
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/36168/
> -----------------------------------------------------------
>
> (Updated July 3, 2015, 4:13 p.m.)
>
>
> Review request for drill, Jacques Nadeau and Sudheesh Katkam.
>
>
> Bugs: DRILL-3167
> https://issues.apache.org/jira/browse/DRILL-3167
>
>
> Repository: drill-git
>
>
> Description
> -------
>
> when a query fails Foreman moves to a FAILING state until all fragment return a terminal state. The Web UI will still display FAILED instead of FAILING
>
> TestDrillbitResilience#failsWhenSendingFragments exposes a limitation to this approach: if an error occurs when setting up remote fragments we can't assume they will be able to return a terminal state. In this case the Foreman will not wait for them to finish and return FAILED to the client immediately.
>
>
> Diffs
> -----
>
> exec/java-exec/src/main/java/org/apache/drill/exec/server/rest/profile/ProfileResources.java 6656bf6
> exec/java-exec/src/main/java/org/apache/drill/exec/server/rest/profile/ProfileWrapper.java dd26a76
> exec/java-exec/src/main/java/org/apache/drill/exec/work/foreman/Foreman.java 716fb66
> exec/java-exec/src/main/java/org/apache/drill/exec/work/foreman/QueryManager.java 9318233
> exec/java-exec/src/main/resources/rest/profile/list.ftl cf92ede
> exec/java-exec/src/main/resources/rest/profile/profile.ftl 46cdc83
> exec/java-exec/src/test/java/org/apache/drill/exec/server/TestDrillbitResilience.java ce09f68
> protocol/src/main/java/org/apache/drill/exec/proto/UserBitShared.java e76d748
> protocol/src/main/java/org/apache/drill/exec/proto/beans/QueryResult.java 474e330
> protocol/src/main/protobuf/UserBitShared.proto 0451fd2
>
> Diff: https://reviews.apache.org/r/36168/diff/
>
>
> Testing
> -------
>
> unit tests are passing along with customer and tpch100
>
>
> Thanks,
>
> abdelhakim deneche
>
>
Re: Review Request 36168: DRILL-3167: When a query fails,
Foreman should wait for all fragments to finish cleaning up before sending a
FAILED state to the client
Posted by abdelhakim deneche <ad...@gmail.com>.
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/36168/
-----------------------------------------------------------
(Updated July 3, 2015, 4:13 p.m.)
Review request for drill, Jacques Nadeau and Sudheesh Katkam.
Bugs: DRILL-3167
https://issues.apache.org/jira/browse/DRILL-3167
Repository: drill-git
Description
-------
when a query fails Foreman moves to a FAILING state until all fragment return a terminal state. The Web UI will still display FAILED instead of FAILING
TestDrillbitResilience#failsWhenSendingFragments exposes a limitation to this approach: if an error occurs when setting up remote fragments we can't assume they will be able to return a terminal state. In this case the Foreman will not wait for them to finish and return FAILED to the client immediately.
Diffs
-----
exec/java-exec/src/main/java/org/apache/drill/exec/server/rest/profile/ProfileResources.java 6656bf6
exec/java-exec/src/main/java/org/apache/drill/exec/server/rest/profile/ProfileWrapper.java dd26a76
exec/java-exec/src/main/java/org/apache/drill/exec/work/foreman/Foreman.java 716fb66
exec/java-exec/src/main/java/org/apache/drill/exec/work/foreman/QueryManager.java 9318233
exec/java-exec/src/main/resources/rest/profile/list.ftl cf92ede
exec/java-exec/src/main/resources/rest/profile/profile.ftl 46cdc83
exec/java-exec/src/test/java/org/apache/drill/exec/server/TestDrillbitResilience.java ce09f68
protocol/src/main/java/org/apache/drill/exec/proto/UserBitShared.java e76d748
protocol/src/main/java/org/apache/drill/exec/proto/beans/QueryResult.java 474e330
protocol/src/main/protobuf/UserBitShared.proto 0451fd2
Diff: https://reviews.apache.org/r/36168/diff/
Testing (updated)
-------
unit tests are passing along with customer and tpch100
Thanks,
abdelhakim deneche