You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@impala.apache.org by "Sailesh Mukil (Code Review)" <ge...@cloudera.org> on 2018/04/03 22:03:28 UTC

[Impala-ASF-CR] IMPALA-6792: Fail status reporting if coordinator refuses connections

Sailesh Mukil has uploaded this change for review. ( http://gerrit.cloudera.org:8080/9916


Change subject: IMPALA-6792: Fail status reporting if coordinator refuses connections
......................................................................

IMPALA-6792: Fail status reporting if coordinator refuses connections

The ReportExecStatusAux() function is run on a dedicated thread per
fragment instance. This thread will run until the fragment instance
completes executing.

On every attempt to send a report to the coordinator, it will attempt
to send up to 3 RPCs. If all 3 of them fail, then the fragment instance
will cancel itself.

However, there is one case where a failure to send the RPC will not
be considered a failed RPC. If when we attempt to obtain a new
connection, we end up creating a new connection
(via ClientCache::CreateClient()) instead of getting a previously
cached connection, and this new connection fails to even Open(),
it will not be counted as a RPC failure.

This patch counts such an error as a failed RPC too.

Change-Id: If668838f99f78b5ffa713488178b2eb5799ba220
---
M be/src/runtime/query-state.cc
1 file changed, 3 insertions(+), 2 deletions(-)



  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/16/9916/1
-- 
To view, visit http://gerrit.cloudera.org:8080/9916
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newchange
Gerrit-Change-Id: If668838f99f78b5ffa713488178b2eb5799ba220
Gerrit-Change-Number: 9916
Gerrit-PatchSet: 1
Gerrit-Owner: Sailesh Mukil <sa...@cloudera.com>

[Impala-ASF-CR] IMPALA-6792: Fail status reporting if coordinator refuses connections

Posted by "Sailesh Mukil (Code Review)" <ge...@cloudera.org>.
Sailesh Mukil has posted comments on this change. ( http://gerrit.cloudera.org:8080/9916 )

Change subject: IMPALA-6792: Fail status reporting if coordinator refuses connections
......................................................................


Patch Set 2:

(2 comments)

> Patch Set 2: Code-Review+2
> 
> I'm fine with merging this as-is but I'd feel better if we followed up with a regression test.

I tried looking into adding a regression test, but it seems a little tricky to add. In the interest of time, I'll defer that now but I'll continue working on it either as a regression test or as part of a fault injection test suite.

http://gerrit.cloudera.org:8080/#/c/9916/2/be/src/runtime/query-state.cc
File be/src/runtime/query-state.cc:

http://gerrit.cloudera.org:8080/#/c/9916/2/be/src/runtime/query-state.cc@260
PS2, Line 260:     if (i < 2) SleepForMs(100);
> I would feel more comfortable if the 100ms is under a flag as this might un
Done


http://gerrit.cloudera.org:8080/#/c/9916/2/be/src/runtime/query-state.cc@274
PS2, Line 274:                  << "Query " << PrintId(query_id()) << " may hang. See IMPALA-2990.";
> Can't connect to coordinator Vs. Can't report status?
The different errors would be:
1. "Can't obtain connection to coordinator"
2. "Can't reach the coordinator (but we have a connection)"

The third is part of our normal query lifecycle:
"ReportExecStatus RPC sent and coordinator asked us to cancel the fragment instances."

I've added log messages for all the above 3 cases.



-- 
To view, visit http://gerrit.cloudera.org:8080/9916
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: If668838f99f78b5ffa713488178b2eb5799ba220
Gerrit-Change-Number: 9916
Gerrit-PatchSet: 2
Gerrit-Owner: Sailesh Mukil <sa...@cloudera.com>
Gerrit-Reviewer: Dan Hecht <dh...@cloudera.com>
Gerrit-Reviewer: Lars Volker <lv...@cloudera.com>
Gerrit-Reviewer: Mostafa Mokhtar <mm...@cloudera.com>
Gerrit-Reviewer: Sailesh Mukil <sa...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-Comment-Date: Wed, 04 Apr 2018 18:50:17 +0000
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-6792: Fail status reporting if coordinator refuses connections

Posted by "Tim Armstrong (Code Review)" <ge...@cloudera.org>.
Tim Armstrong has posted comments on this change. ( http://gerrit.cloudera.org:8080/9916 )

Change subject: IMPALA-6792: Fail status reporting if coordinator refuses connections
......................................................................


Patch Set 3:

(5 comments)

http://gerrit.cloudera.org:8080/#/c/9916/3/be/src/runtime/query-state.cc
File be/src/runtime/query-state.cc:

http://gerrit.cloudera.org:8080/#/c/9916/3/be/src/runtime/query-state.cc@41
PS3, Line 41: DEFINE_int32(report_status_retry_interval, 100,
_ms to make the unit clear.


http://gerrit.cloudera.org:8080/#/c/9916/3/be/src/runtime/query-state.cc@42
PS3, Line 42:     "The interval to wait before retrying a failed status report RPC to the "
"interval in milliseconds"


http://gerrit.cloudera.org:8080/#/c/9916/3/be/src/runtime/query-state.cc@279
PS3, Line 279:                  << "Query " << PrintId(query_id()) << " may hang. See IMPALA-2990.";
Shouldn't we print client_status.GetDetail() here since we have it?


http://gerrit.cloudera.org:8080/#/c/9916/3/be/src/runtime/query-state.cc@282
PS3, Line 282:                  << "coordinator. (Eg: Connection timed out)"
Shouldn't we print rpc_status.GetDetail() here since we have it?


http://gerrit.cloudera.org:8080/#/c/9916/3/be/src/runtime/query-state.cc@288
PS3, Line 288:       LOG(INFO) << "Cancelling fragment instances as directed by the coordinator.";
Is it worth printing result_status.GetDetail() here since we have it? I guess usually it should be CANCELLED, but might be good to know for sure. Or does this just add noise?



-- 
To view, visit http://gerrit.cloudera.org:8080/9916
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: If668838f99f78b5ffa713488178b2eb5799ba220
Gerrit-Change-Number: 9916
Gerrit-PatchSet: 3
Gerrit-Owner: Sailesh Mukil <sa...@cloudera.com>
Gerrit-Reviewer: Dan Hecht <dh...@cloudera.com>
Gerrit-Reviewer: Lars Volker <lv...@cloudera.com>
Gerrit-Reviewer: Mostafa Mokhtar <mm...@cloudera.com>
Gerrit-Reviewer: Sailesh Mukil <sa...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-Comment-Date: Wed, 04 Apr 2018 19:01:27 +0000
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-6792: Fail status reporting if coordinator refuses connections

Posted by "Tim Armstrong (Code Review)" <ge...@cloudera.org>.
Tim Armstrong has posted comments on this change. ( http://gerrit.cloudera.org:8080/9916 )

Change subject: IMPALA-6792: Fail status reporting if coordinator refuses connections
......................................................................


Patch Set 2: Code-Review+2

I'm fine with merging this as-is but I'd feel better if we followed up with a regression test.


-- 
To view, visit http://gerrit.cloudera.org:8080/9916
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: If668838f99f78b5ffa713488178b2eb5799ba220
Gerrit-Change-Number: 9916
Gerrit-PatchSet: 2
Gerrit-Owner: Sailesh Mukil <sa...@cloudera.com>
Gerrit-Reviewer: Dan Hecht <dh...@cloudera.com>
Gerrit-Reviewer: Sailesh Mukil <sa...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-Comment-Date: Wed, 04 Apr 2018 00:14:48 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-6792: Fail status reporting if coordinator refuses connections

Posted by "Mostafa Mokhtar (Code Review)" <ge...@cloudera.org>.
Mostafa Mokhtar has posted comments on this change. ( http://gerrit.cloudera.org:8080/9916 )

Change subject: IMPALA-6792: Fail status reporting if coordinator refuses connections
......................................................................


Patch Set 2:

(2 comments)

http://gerrit.cloudera.org:8080/#/c/9916/2/be/src/runtime/query-state.cc
File be/src/runtime/query-state.cc:

http://gerrit.cloudera.org:8080/#/c/9916/2/be/src/runtime/query-state.cc@260
PS2, Line 260:     if (i < 2) SleepForMs(100);
I would feel more comfortable if the 100ms is under a flag as this might uncover false positives on large clusters with busy coordinators.


http://gerrit.cloudera.org:8080/#/c/9916/2/be/src/runtime/query-state.cc@274
PS2, Line 274:                  << "Query " << PrintId(query_id()) << " may hang. See IMPALA-2990.";
> Should we include the reason for the abort here? Otherwise the large number
Can't connect to coordinator Vs. Can't report status?



-- 
To view, visit http://gerrit.cloudera.org:8080/9916
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: If668838f99f78b5ffa713488178b2eb5799ba220
Gerrit-Change-Number: 9916
Gerrit-PatchSet: 2
Gerrit-Owner: Sailesh Mukil <sa...@cloudera.com>
Gerrit-Reviewer: Dan Hecht <dh...@cloudera.com>
Gerrit-Reviewer: Lars Volker <lv...@cloudera.com>
Gerrit-Reviewer: Mostafa Mokhtar <mm...@cloudera.com>
Gerrit-Reviewer: Sailesh Mukil <sa...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-Comment-Date: Wed, 04 Apr 2018 02:41:34 +0000
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-6792: Fail status reporting if coordinator refuses connections

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/9916 )

Change subject: IMPALA-6792: Fail status reporting if coordinator refuses connections
......................................................................


Patch Set 5: Verified+1


-- 
To view, visit http://gerrit.cloudera.org:8080/9916
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: If668838f99f78b5ffa713488178b2eb5799ba220
Gerrit-Change-Number: 9916
Gerrit-PatchSet: 5
Gerrit-Owner: Sailesh Mukil <sa...@cloudera.com>
Gerrit-Reviewer: Dan Hecht <dh...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins
Gerrit-Reviewer: Lars Volker <lv...@cloudera.com>
Gerrit-Reviewer: Mostafa Mokhtar <mm...@cloudera.com>
Gerrit-Reviewer: Sailesh Mukil <sa...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-Comment-Date: Thu, 05 Apr 2018 02:17:08 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-6792: Fail status reporting if coordinator refuses connections

Posted by "Tim Armstrong (Code Review)" <ge...@cloudera.org>.
Tim Armstrong has posted comments on this change. ( http://gerrit.cloudera.org:8080/9916 )

Change subject: IMPALA-6792: Fail status reporting if coordinator refuses connections
......................................................................


Patch Set 4: Code-Review+2


-- 
To view, visit http://gerrit.cloudera.org:8080/9916
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: If668838f99f78b5ffa713488178b2eb5799ba220
Gerrit-Change-Number: 9916
Gerrit-PatchSet: 4
Gerrit-Owner: Sailesh Mukil <sa...@cloudera.com>
Gerrit-Reviewer: Dan Hecht <dh...@cloudera.com>
Gerrit-Reviewer: Lars Volker <lv...@cloudera.com>
Gerrit-Reviewer: Mostafa Mokhtar <mm...@cloudera.com>
Gerrit-Reviewer: Sailesh Mukil <sa...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-Comment-Date: Wed, 04 Apr 2018 19:40:15 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-6792: Fail status reporting if coordinator refuses connections

Posted by "Tim Armstrong (Code Review)" <ge...@cloudera.org>.
Tim Armstrong has posted comments on this change. ( http://gerrit.cloudera.org:8080/9916 )

Change subject: IMPALA-6792: Fail status reporting if coordinator refuses connections
......................................................................


Patch Set 2:

Can you think of a way to regression test this? It would be valuable to exercise this scenario.


-- 
To view, visit http://gerrit.cloudera.org:8080/9916
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: If668838f99f78b5ffa713488178b2eb5799ba220
Gerrit-Change-Number: 9916
Gerrit-PatchSet: 2
Gerrit-Owner: Sailesh Mukil <sa...@cloudera.com>
Gerrit-Reviewer: Dan Hecht <dh...@cloudera.com>
Gerrit-Reviewer: Sailesh Mukil <sa...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-Comment-Date: Tue, 03 Apr 2018 22:57:01 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-6792: Fail status reporting if coordinator refuses connections

Posted by "Sailesh Mukil (Code Review)" <ge...@cloudera.org>.
Sailesh Mukil has posted comments on this change. ( http://gerrit.cloudera.org:8080/9916 )

Change subject: IMPALA-6792: Fail status reporting if coordinator refuses connections
......................................................................


Patch Set 2:

(1 comment)

http://gerrit.cloudera.org:8080/#/c/9916/1/be/src/runtime/query-state.cc
File be/src/runtime/query-state.cc:

http://gerrit.cloudera.org:8080/#/c/9916/1/be/src/runtime/query-state.cc@269
PS1, Line 269: !client_status.o
> shouldn't we check client_status here too?
Good catch. Done.



-- 
To view, visit http://gerrit.cloudera.org:8080/9916
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: If668838f99f78b5ffa713488178b2eb5799ba220
Gerrit-Change-Number: 9916
Gerrit-PatchSet: 2
Gerrit-Owner: Sailesh Mukil <sa...@cloudera.com>
Gerrit-Reviewer: Dan Hecht <dh...@cloudera.com>
Gerrit-Reviewer: Sailesh Mukil <sa...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-Comment-Date: Tue, 03 Apr 2018 22:31:07 +0000
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-6792: Fail status reporting if coordinator refuses connections

Posted by "Sailesh Mukil (Code Review)" <ge...@cloudera.org>.
Sailesh Mukil has posted comments on this change. ( http://gerrit.cloudera.org:8080/9916 )

Change subject: IMPALA-6792: Fail status reporting if coordinator refuses connections
......................................................................


Patch Set 4:

(5 comments)

http://gerrit.cloudera.org:8080/#/c/9916/3/be/src/runtime/query-state.cc
File be/src/runtime/query-state.cc:

http://gerrit.cloudera.org:8080/#/c/9916/3/be/src/runtime/query-state.cc@41
PS3, Line 41: DEFINE_int32(report_status_retry_interval_ms, 100,
> _ms to make the unit clear.
Done


http://gerrit.cloudera.org:8080/#/c/9916/3/be/src/runtime/query-state.cc@42
PS3, Line 42:     "The interval in milliseconds to wait before retrying a failed status report RPC to "
> "interval in milliseconds"
Done


http://gerrit.cloudera.org:8080/#/c/9916/3/be/src/runtime/query-state.cc@279
PS3, Line 279:                  << "). Query " << PrintId(query_id()) << " may hang. See IMPALA-2990.";
> Shouldn't we print client_status.GetDetail() here since we have it?
Good point. Done.


http://gerrit.cloudera.org:8080/#/c/9916/3/be/src/runtime/query-state.cc@282
PS3, Line 282:                  << "coordinator. (" << rpc_status.GetDetail()
> Shouldn't we print rpc_status.GetDetail() here since we have it?
Good point. Done.


http://gerrit.cloudera.org:8080/#/c/9916/3/be/src/runtime/query-state.cc@288
PS3, Line 288:       LOG(INFO) << "Cancelling fragment instances as directed by the coordinator. "
> Is it worth printing result_status.GetDetail() here since we have it? I gue
It mostly will be CANCELLED, but it can help in the case where somehow the coordinator cancels the query and leaves while this fragment instance is still running, which would make this an orphaned fragment instance. The error returned in that case would be something like "Could not find query", etc.

So I've added it.



-- 
To view, visit http://gerrit.cloudera.org:8080/9916
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: If668838f99f78b5ffa713488178b2eb5799ba220
Gerrit-Change-Number: 9916
Gerrit-PatchSet: 4
Gerrit-Owner: Sailesh Mukil <sa...@cloudera.com>
Gerrit-Reviewer: Dan Hecht <dh...@cloudera.com>
Gerrit-Reviewer: Lars Volker <lv...@cloudera.com>
Gerrit-Reviewer: Mostafa Mokhtar <mm...@cloudera.com>
Gerrit-Reviewer: Sailesh Mukil <sa...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-Comment-Date: Wed, 04 Apr 2018 19:13:22 +0000
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-6792: Fail status reporting if coordinator refuses connections

Posted by "Sailesh Mukil (Code Review)" <ge...@cloudera.org>.
Hello Lars Volker, Tim Armstrong, Mostafa Mokhtar, Dan Hecht, 

I'd like you to reexamine a change. Please visit

    http://gerrit.cloudera.org:8080/9916

to look at the new patch set (#4).

Change subject: IMPALA-6792: Fail status reporting if coordinator refuses connections
......................................................................

IMPALA-6792: Fail status reporting if coordinator refuses connections

The ReportExecStatusAux() function is run on a dedicated thread per
fragment instance. This thread will run until the fragment instance
completes executing.

On every attempt to send a report to the coordinator, it will attempt
to send up to 3 RPCs. If all 3 of them fail, then the fragment instance
will cancel itself.

However, there is one case where a failure to send the RPC will not
be considered a failed RPC. If when we attempt to obtain a new
connection, we end up creating a new connection
(via ClientCache::CreateClient()) instead of getting a previously
cached connection, and this new connection fails to even Open(),
it will not be counted as a RPC failure.

This patch counts such an error as a failed RPC too.

This patch also clarifies some of the error log messages and introduces
a flag to control the sleep interval between failed ReportExecStatus RPC
retries.

Change-Id: If668838f99f78b5ffa713488178b2eb5799ba220
---
M be/src/runtime/query-state.cc
1 file changed, 25 insertions(+), 9 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/16/9916/4
-- 
To view, visit http://gerrit.cloudera.org:8080/9916
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: If668838f99f78b5ffa713488178b2eb5799ba220
Gerrit-Change-Number: 9916
Gerrit-PatchSet: 4
Gerrit-Owner: Sailesh Mukil <sa...@cloudera.com>
Gerrit-Reviewer: Dan Hecht <dh...@cloudera.com>
Gerrit-Reviewer: Lars Volker <lv...@cloudera.com>
Gerrit-Reviewer: Mostafa Mokhtar <mm...@cloudera.com>
Gerrit-Reviewer: Sailesh Mukil <sa...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>

[Impala-ASF-CR] IMPALA-6792: Fail status reporting if coordinator refuses connections

Posted by "Dan Hecht (Code Review)" <ge...@cloudera.org>.
Dan Hecht has posted comments on this change. ( http://gerrit.cloudera.org:8080/9916 )

Change subject: IMPALA-6792: Fail status reporting if coordinator refuses connections
......................................................................


Patch Set 2: Code-Review+1

Please get a second review.


-- 
To view, visit http://gerrit.cloudera.org:8080/9916
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: If668838f99f78b5ffa713488178b2eb5799ba220
Gerrit-Change-Number: 9916
Gerrit-PatchSet: 2
Gerrit-Owner: Sailesh Mukil <sa...@cloudera.com>
Gerrit-Reviewer: Dan Hecht <dh...@cloudera.com>
Gerrit-Reviewer: Sailesh Mukil <sa...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-Comment-Date: Tue, 03 Apr 2018 22:32:01 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-6792: Fail status reporting if coordinator refuses connections

Posted by "Lars Volker (Code Review)" <ge...@cloudera.org>.
Lars Volker has posted comments on this change. ( http://gerrit.cloudera.org:8080/9916 )

Change subject: IMPALA-6792: Fail status reporting if coordinator refuses connections
......................................................................


Patch Set 2:

(1 comment)

Just a thought passing by, no need to hold the change up for this.

http://gerrit.cloudera.org:8080/#/c/9916/2/be/src/runtime/query-state.cc
File be/src/runtime/query-state.cc:

http://gerrit.cloudera.org:8080/#/c/9916/2/be/src/runtime/query-state.cc@274
PS2, Line 274:                  << "Query " << PrintId(query_id()) << " may hang. See IMPALA-2990.";
Should we include the reason for the abort here? Otherwise the large number of cases that possibly lead to this error could make debugging unnecessarily hard.



-- 
To view, visit http://gerrit.cloudera.org:8080/9916
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: If668838f99f78b5ffa713488178b2eb5799ba220
Gerrit-Change-Number: 9916
Gerrit-PatchSet: 2
Gerrit-Owner: Sailesh Mukil <sa...@cloudera.com>
Gerrit-Reviewer: Dan Hecht <dh...@cloudera.com>
Gerrit-Reviewer: Lars Volker <lv...@cloudera.com>
Gerrit-Reviewer: Sailesh Mukil <sa...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-Comment-Date: Wed, 04 Apr 2018 02:14:14 +0000
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-6792: Fail status reporting if coordinator refuses connections

Posted by "Sailesh Mukil (Code Review)" <ge...@cloudera.org>.
Hello Lars Volker, Tim Armstrong, Mostafa Mokhtar, Dan Hecht, 

I'd like you to reexamine a change. Please visit

    http://gerrit.cloudera.org:8080/9916

to look at the new patch set (#3).

Change subject: IMPALA-6792: Fail status reporting if coordinator refuses connections
......................................................................

IMPALA-6792: Fail status reporting if coordinator refuses connections

The ReportExecStatusAux() function is run on a dedicated thread per
fragment instance. This thread will run until the fragment instance
completes executing.

On every attempt to send a report to the coordinator, it will attempt
to send up to 3 RPCs. If all 3 of them fail, then the fragment instance
will cancel itself.

However, there is one case where a failure to send the RPC will not
be considered a failed RPC. If when we attempt to obtain a new
connection, we end up creating a new connection
(via ClientCache::CreateClient()) instead of getting a previously
cached connection, and this new connection fails to even Open(),
it will not be counted as a RPC failure.

This patch counts such an error as a failed RPC too.

This patch also clarifies some of the error log messages and introduces
a flag to control the sleep interval between failed ReportExecStatus RPC
retries.

Change-Id: If668838f99f78b5ffa713488178b2eb5799ba220
---
M be/src/runtime/query-state.cc
1 file changed, 23 insertions(+), 8 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/16/9916/3
-- 
To view, visit http://gerrit.cloudera.org:8080/9916
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: If668838f99f78b5ffa713488178b2eb5799ba220
Gerrit-Change-Number: 9916
Gerrit-PatchSet: 3
Gerrit-Owner: Sailesh Mukil <sa...@cloudera.com>
Gerrit-Reviewer: Dan Hecht <dh...@cloudera.com>
Gerrit-Reviewer: Lars Volker <lv...@cloudera.com>
Gerrit-Reviewer: Mostafa Mokhtar <mm...@cloudera.com>
Gerrit-Reviewer: Sailesh Mukil <sa...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>

[Impala-ASF-CR] IMPALA-6792: Fail status reporting if coordinator refuses connections

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/9916 )

Change subject: IMPALA-6792: Fail status reporting if coordinator refuses connections
......................................................................


Patch Set 5:

Build started: https://jenkins.impala.io/job/gerrit-verify-dryrun/2240/


-- 
To view, visit http://gerrit.cloudera.org:8080/9916
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: If668838f99f78b5ffa713488178b2eb5799ba220
Gerrit-Change-Number: 9916
Gerrit-PatchSet: 5
Gerrit-Owner: Sailesh Mukil <sa...@cloudera.com>
Gerrit-Reviewer: Dan Hecht <dh...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins
Gerrit-Reviewer: Lars Volker <lv...@cloudera.com>
Gerrit-Reviewer: Mostafa Mokhtar <mm...@cloudera.com>
Gerrit-Reviewer: Sailesh Mukil <sa...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-Comment-Date: Wed, 04 Apr 2018 22:23:57 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-6792: Fail status reporting if coordinator refuses connections

Posted by "Sailesh Mukil (Code Review)" <ge...@cloudera.org>.
Hello Tim Armstrong, Dan Hecht, 

I'd like you to reexamine a change. Please visit

    http://gerrit.cloudera.org:8080/9916

to look at the new patch set (#2).

Change subject: IMPALA-6792: Fail status reporting if coordinator refuses connections
......................................................................

IMPALA-6792: Fail status reporting if coordinator refuses connections

The ReportExecStatusAux() function is run on a dedicated thread per
fragment instance. This thread will run until the fragment instance
completes executing.

On every attempt to send a report to the coordinator, it will attempt
to send up to 3 RPCs. If all 3 of them fail, then the fragment instance
will cancel itself.

However, there is one case where a failure to send the RPC will not
be considered a failed RPC. If when we attempt to obtain a new
connection, we end up creating a new connection
(via ClientCache::CreateClient()) instead of getting a previously
cached connection, and this new connection fails to even Open(),
it will not be counted as a RPC failure.

This patch counts such an error as a failed RPC too.

Change-Id: If668838f99f78b5ffa713488178b2eb5799ba220
---
M be/src/runtime/query-state.cc
1 file changed, 4 insertions(+), 3 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/16/9916/2
-- 
To view, visit http://gerrit.cloudera.org:8080/9916
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: If668838f99f78b5ffa713488178b2eb5799ba220
Gerrit-Change-Number: 9916
Gerrit-PatchSet: 2
Gerrit-Owner: Sailesh Mukil <sa...@cloudera.com>
Gerrit-Reviewer: Dan Hecht <dh...@cloudera.com>
Gerrit-Reviewer: Sailesh Mukil <sa...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>

[Impala-ASF-CR] IMPALA-6792: Fail status reporting if coordinator refuses connections

Posted by "Sailesh Mukil (Code Review)" <ge...@cloudera.org>.
Sailesh Mukil has posted comments on this change. ( http://gerrit.cloudera.org:8080/9916 )

Change subject: IMPALA-6792: Fail status reporting if coordinator refuses connections
......................................................................


Patch Set 5: Code-Review+2

Rebase, carry +2.


-- 
To view, visit http://gerrit.cloudera.org:8080/9916
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: If668838f99f78b5ffa713488178b2eb5799ba220
Gerrit-Change-Number: 9916
Gerrit-PatchSet: 5
Gerrit-Owner: Sailesh Mukil <sa...@cloudera.com>
Gerrit-Reviewer: Dan Hecht <dh...@cloudera.com>
Gerrit-Reviewer: Lars Volker <lv...@cloudera.com>
Gerrit-Reviewer: Mostafa Mokhtar <mm...@cloudera.com>
Gerrit-Reviewer: Sailesh Mukil <sa...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-Comment-Date: Wed, 04 Apr 2018 22:23:36 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-6792: Fail status reporting if coordinator refuses connections

Posted by "Dan Hecht (Code Review)" <ge...@cloudera.org>.
Dan Hecht has posted comments on this change. ( http://gerrit.cloudera.org:8080/9916 )

Change subject: IMPALA-6792: Fail status reporting if coordinator refuses connections
......................................................................


Patch Set 1:

(1 comment)

http://gerrit.cloudera.org:8080/#/c/9916/1/be/src/runtime/query-state.cc
File be/src/runtime/query-state.cc:

http://gerrit.cloudera.org:8080/#/c/9916/1/be/src/runtime/query-state.cc@269
PS1, Line 269: !rpc_status.ok()
shouldn't we check client_status here too?



-- 
To view, visit http://gerrit.cloudera.org:8080/9916
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: If668838f99f78b5ffa713488178b2eb5799ba220
Gerrit-Change-Number: 9916
Gerrit-PatchSet: 1
Gerrit-Owner: Sailesh Mukil <sa...@cloudera.com>
Gerrit-Reviewer: Dan Hecht <dh...@cloudera.com>
Gerrit-Reviewer: Sailesh Mukil <sa...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-Comment-Date: Tue, 03 Apr 2018 22:27:15 +0000
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-6792: Fail status reporting if coordinator refuses connections

Posted by "Sailesh Mukil (Code Review)" <ge...@cloudera.org>.
Sailesh Mukil has posted comments on this change. ( http://gerrit.cloudera.org:8080/9916 )

Change subject: IMPALA-6792: Fail status reporting if coordinator refuses connections
......................................................................


Patch Set 1:

Something to add. This patch means that when previously we would have succeeded sending a report after many many iterations of trying, now we'll end up failing after 3 tries, which in all fairness was the original intent.

A comprehensive redesign of status reporting will be done as part of IMPALA-2990.


-- 
To view, visit http://gerrit.cloudera.org:8080/9916
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: If668838f99f78b5ffa713488178b2eb5799ba220
Gerrit-Change-Number: 9916
Gerrit-PatchSet: 1
Gerrit-Owner: Sailesh Mukil <sa...@cloudera.com>
Gerrit-Reviewer: Sailesh Mukil <sa...@cloudera.com>
Gerrit-Comment-Date: Tue, 03 Apr 2018 22:11:44 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-6792: Fail status reporting if coordinator refuses connections

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has submitted this change and it was merged. ( http://gerrit.cloudera.org:8080/9916 )

Change subject: IMPALA-6792: Fail status reporting if coordinator refuses connections
......................................................................

IMPALA-6792: Fail status reporting if coordinator refuses connections

The ReportExecStatusAux() function is run on a dedicated thread per
fragment instance. This thread will run until the fragment instance
completes executing.

On every attempt to send a report to the coordinator, it will attempt
to send up to 3 RPCs. If all 3 of them fail, then the fragment instance
will cancel itself.

However, there is one case where a failure to send the RPC will not
be considered a failed RPC. If when we attempt to obtain a new
connection, we end up creating a new connection
(via ClientCache::CreateClient()) instead of getting a previously
cached connection, and this new connection fails to even Open(),
it will not be counted as a RPC failure.

This patch counts such an error as a failed RPC too.

This patch also clarifies some of the error log messages and introduces
a flag to control the sleep interval between failed ReportExecStatus RPC
retries.

Change-Id: If668838f99f78b5ffa713488178b2eb5799ba220
Reviewed-on: http://gerrit.cloudera.org:8080/9916
Reviewed-by: Sailesh Mukil <sa...@cloudera.com>
Tested-by: Impala Public Jenkins
---
M be/src/runtime/query-state.cc
1 file changed, 25 insertions(+), 9 deletions(-)

Approvals:
  Sailesh Mukil: Looks good to me, approved
  Impala Public Jenkins: Verified

-- 
To view, visit http://gerrit.cloudera.org:8080/9916
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: merged
Gerrit-Change-Id: If668838f99f78b5ffa713488178b2eb5799ba220
Gerrit-Change-Number: 9916
Gerrit-PatchSet: 6
Gerrit-Owner: Sailesh Mukil <sa...@cloudera.com>
Gerrit-Reviewer: Dan Hecht <dh...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins
Gerrit-Reviewer: Lars Volker <lv...@cloudera.com>
Gerrit-Reviewer: Mostafa Mokhtar <mm...@cloudera.com>
Gerrit-Reviewer: Sailesh Mukil <sa...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>