You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@impala.apache.org by "Thomas Tauber-Marshall (Code Review)" <ge...@cloudera.org> on 2016/09/22 23:49:41 UTC

[Impala-ASF-CR] IMPALA-4135: Thrift threaded server times-out connections during high load

Thomas Tauber-Marshall has uploaded a new change for review.

  http://gerrit.cloudera.org:8080/4519

Change subject: IMPALA-4135: Thrift threaded server times-out connections during high load
......................................................................

IMPALA-4135: Thrift threaded server times-out connections during high load

During times of high load, Thrift's TThreadedServer can't keep up with the
rate of new socket connections, causing some to time out.

This patch creates TAcceptQueueThreadedServer, which is a modified version of
TThreadedServer that calls accept() and then hands the returned TTransport
off to a thread pool to handle setting up the connection. This ensures that
accept() is called as quickly as possible, preventing connections from timing
out while waiting.

This patch has been tested locally with the repro shown in IMPALA-4135, but
it still needs to be tested in a real cluster.

Change-Id: Ie50e728974ef31a9d49132a0b3f7cde2a4f3356d
---
M be/CMakeLists.txt
M be/src/rpc/thrift-server.cc
A be/src/server/CMakeLists.txt
A be/src/server/TAcceptQueueThreadedServer.cpp
A be/src/server/TAcceptQueueThreadedServer.h
5 files changed, 449 insertions(+), 2 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/19/4519/1
-- 
To view, visit http://gerrit.cloudera.org:8080/4519
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: newchange
Gerrit-Change-Id: Ie50e728974ef31a9d49132a0b3f7cde2a4f3356d
Gerrit-PatchSet: 1
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Thomas Tauber-Marshall <tm...@cloudera.com>

[Impala-ASF-CR] IMPALA-4135: Thrift threaded server times-out connections during high load

Posted by "Thomas Tauber-Marshall (Code Review)" <ge...@cloudera.org>.
Thomas Tauber-Marshall has uploaded a new patch set (#3).

Change subject: IMPALA-4135: Thrift threaded server times-out connections during high load
......................................................................

IMPALA-4135: Thrift threaded server times-out connections during high load

During times of high load, Thrift's TThreadedServer can't keep up with the
rate of new socket connections, causing some to time out.

This patch creates TAcceptQueueServer, which is a modified version of
TThreadedServer that calls accept() and then hands the returned TTransport
off to a thread pool to handle setting up the connection. This ensures that
accept() is called as quickly as possible, preventing connections from timing
out while waiting.

This patch has been tested locally with the repro shown in IMPALA-4135, but
it still needs to be tested in a real cluster.

Change-Id: Ie50e728974ef31a9d49132a0b3f7cde2a4f3356d
---
M be/src/common/global-flags.cc
M be/src/rpc/CMakeLists.txt
A be/src/rpc/TAcceptQueueServer.cpp
A be/src/rpc/TAcceptQueueServer.h
M be/src/rpc/thrift-server-test.cc
M be/src/rpc/thrift-server.cc
6 files changed, 449 insertions(+), 2 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/19/4519/3
-- 
To view, visit http://gerrit.cloudera.org:8080/4519
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: newpatchset
Gerrit-Change-Id: Ie50e728974ef31a9d49132a0b3f7cde2a4f3356d
Gerrit-PatchSet: 3
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Thomas Tauber-Marshall <tm...@cloudera.com>
Gerrit-Reviewer: Henry Robinson <he...@cloudera.com>
Gerrit-Reviewer: Juan Yu <jy...@cloudera.com>
Gerrit-Reviewer: Matthew Jacobs <mj...@cloudera.com>
Gerrit-Reviewer: Sailesh Mukil <sa...@cloudera.com>
Gerrit-Reviewer: Thomas Tauber-Marshall <tm...@cloudera.com>

[Impala-ASF-CR] IMPALA-4135: Thrift threaded server times-out connections during high load

Posted by "Henry Robinson (Code Review)" <ge...@cloudera.org>.
Henry Robinson has posted comments on this change.

Change subject: IMPALA-4135: Thrift threaded server times-out connections during high load
......................................................................


Patch Set 3:

(8 comments)

Some comments before I head off to Strata. I would run clang-format over the new thrift files. That way they'll be more readable to Impala developers, while keeping the structure and idioms of the thrift code.

http://gerrit.cloudera.org:8080/#/c/4519/3/be/src/rpc/TAcceptQueueServer.cpp
File be/src/rpc/TAcceptQueueServer.cpp:

PS3, Line 204: 1
this is an important enough parameter that I'd make a constant for it, and put your comments about only using thread on that. We don't want someone changing the number of threads without understanding the thread-safety implications.


PS3, Line 216: acceptPool.Offer returned false.
If this returns false it's because the queue is shut down. Better to say that clearly in the error message so users have an idea what this might mean. Also you can say that it's unexpected - we never shut down our servers so the queue should never be shut down.


http://gerrit.cloudera.org:8080/#/c/4519/3/be/src/rpc/TAcceptQueueServer.h
File be/src/rpc/TAcceptQueueServer.h:

Line 46:  * connections will time out while in the OS accept queue.
add a reference to IMPALA-4135 here.


PS3, Line 105: std::mutex big_lock_;
Is this used anywhere?


http://gerrit.cloudera.org:8080/#/c/4519/3/be/src/rpc/thrift-server-test.cc
File be/src/rpc/thrift-server-test.cc:

Line 174:   ThreadPool<int64_t> pool("group", "test", 256, 10000, [](int tid, const int64_t& item) {
Add a comment about what you're testing and the associated JIRA.


PS3, Line 178: ASSERT_OK(status);
Does this work in a threaded context - does the test fail if the thread hits an ASSERT?


PS3, Line 180: for (int i = 0; i < 1024 * 16; ++i) pool.Offer(i);
I'm a bit concerned about failing with ulimit errors, particularly because both client and server should be in the same process. Have you ever seen this repro with fewer concurrent connections - say 8K?


PS3, Line 173: TEST(ConcurrencyTest, ConnectTimeout) {
             :   ThreadPool<int64_t> pool("group", "test", 256, 10000, [](int tid, const int64_t& item) {
             :         using Client = ThriftClient<ImpalaInternalServiceClient>;
             :         Client* client = new Client("127.0.0.1", 22000, "", NULL, false);
             :         Status status = client->Open();
             :         ASSERT_OK(status);
             :       });
             :   for (int i = 0; i < 1024 * 16; ++i) pool.Offer(i);
             :   pool.DrainAndShutdown();
             : }
This relies on a running impala internal service - you probably had an Impala process running when you ran this? That won't be true in our builds. Use the same server code that all the other tests use to start a server in this process.


-- 
To view, visit http://gerrit.cloudera.org:8080/4519
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: Ie50e728974ef31a9d49132a0b3f7cde2a4f3356d
Gerrit-PatchSet: 3
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Thomas Tauber-Marshall <tm...@cloudera.com>
Gerrit-Reviewer: Henry Robinson <he...@cloudera.com>
Gerrit-Reviewer: Juan Yu <jy...@cloudera.com>
Gerrit-Reviewer: Matthew Jacobs <mj...@cloudera.com>
Gerrit-Reviewer: Sailesh Mukil <sa...@cloudera.com>
Gerrit-Reviewer: Thomas Tauber-Marshall <tm...@cloudera.com>
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-4135: Thrift threaded server times-out connections during high load

Posted by "Internal Jenkins (Code Review)" <ge...@cloudera.org>.
Internal Jenkins has posted comments on this change.

Change subject: IMPALA-4135: Thrift threaded server times-out connections during high load
......................................................................


Patch Set 8: Verified+1

-- 
To view, visit http://gerrit.cloudera.org:8080/4519
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: Ie50e728974ef31a9d49132a0b3f7cde2a4f3356d
Gerrit-PatchSet: 8
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Thomas Tauber-Marshall <tm...@cloudera.com>
Gerrit-Reviewer: Henry Robinson <he...@cloudera.com>
Gerrit-Reviewer: Internal Jenkins
Gerrit-Reviewer: Juan Yu <jy...@cloudera.com>
Gerrit-Reviewer: Matthew Jacobs <mj...@cloudera.com>
Gerrit-Reviewer: Sailesh Mukil <sa...@cloudera.com>
Gerrit-Reviewer: Thomas Tauber-Marshall <tm...@cloudera.com>
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-4135: Thrift threaded server times-out connections during high load

Posted by "Thomas Tauber-Marshall (Code Review)" <ge...@cloudera.org>.
Thomas Tauber-Marshall has uploaded a new patch set (#4).

Change subject: IMPALA-4135: Thrift threaded server times-out connections during high load
......................................................................

IMPALA-4135: Thrift threaded server times-out connections during high load

During times of high load, Thrift's TThreadedServer can't keep up with the
rate of new socket connections, causing some to time out.

This patch creates TAcceptQueueServer, which is a modified version of
TThreadedServer that calls accept() and then hands the returned TTransport
off to a thread pool to handle setting up the connection. This ensures that
accept() is called as quickly as possible, preventing connections from timing
out while waiting.

This patch has been tested locally with the repro shown in IMPALA-4135, but
it still needs to be tested in a real cluster.

Change-Id: Ie50e728974ef31a9d49132a0b3f7cde2a4f3356d
---
M be/src/common/global-flags.cc
M be/src/rpc/CMakeLists.txt
A be/src/rpc/TAcceptQueueServer.cpp
A be/src/rpc/TAcceptQueueServer.h
M be/src/rpc/thrift-server-test.cc
M be/src/rpc/thrift-server.cc
6 files changed, 457 insertions(+), 2 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/19/4519/4
-- 
To view, visit http://gerrit.cloudera.org:8080/4519
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: newpatchset
Gerrit-Change-Id: Ie50e728974ef31a9d49132a0b3f7cde2a4f3356d
Gerrit-PatchSet: 4
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Thomas Tauber-Marshall <tm...@cloudera.com>
Gerrit-Reviewer: Henry Robinson <he...@cloudera.com>
Gerrit-Reviewer: Juan Yu <jy...@cloudera.com>
Gerrit-Reviewer: Matthew Jacobs <mj...@cloudera.com>
Gerrit-Reviewer: Sailesh Mukil <sa...@cloudera.com>
Gerrit-Reviewer: Thomas Tauber-Marshall <tm...@cloudera.com>

[Impala-ASF-CR] IMPALA-4135: Thrift threaded server times-out connections during high load

Posted by "Juan Yu (Code Review)" <ge...@cloudera.org>.
Juan Yu has posted comments on this change.

Change subject: IMPALA-4135: Thrift threaded server times-out connections during high load
......................................................................


Patch Set 1:

(1 comment)

http://gerrit.cloudera.org:8080/#/c/4519/1/be/src/server/TAcceptQueueThreadedServer.cpp
File be/src/server/TAcceptQueueThreadedServer.cpp:

PS1, Line 211: Offer
This can return false, right?
shouldn't we handle the case it returns false?


-- 
To view, visit http://gerrit.cloudera.org:8080/4519
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: Ie50e728974ef31a9d49132a0b3f7cde2a4f3356d
Gerrit-PatchSet: 1
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Thomas Tauber-Marshall <tm...@cloudera.com>
Gerrit-Reviewer: Henry Robinson <he...@cloudera.com>
Gerrit-Reviewer: Juan Yu <jy...@cloudera.com>
Gerrit-Reviewer: Matthew Jacobs <mj...@cloudera.com>
Gerrit-Reviewer: Sailesh Mukil <sa...@cloudera.com>
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-4135: Thrift threaded server times-out connections during high load

Posted by "Thomas Tauber-Marshall (Code Review)" <ge...@cloudera.org>.
Thomas Tauber-Marshall has posted comments on this change.

Change subject: IMPALA-4135: Thrift threaded server times-out connections during high load
......................................................................


Patch Set 4:

(10 comments)

http://gerrit.cloudera.org:8080/#/c/4519/3/be/src/rpc/TAcceptQueueServer.cpp
File be/src/rpc/TAcceptQueueServer.cpp:

PS3, Line 204: 
> this is an important enough parameter that I'd make a constant for it, and 
Done


PS3, Line 216: 
> If this returns false it's because the queue is shut down. Better to say th
Done


PS3, Line 218: break;
> set stop_ to true to cleanup?
Done


http://gerrit.cloudera.org:8080/#/c/4519/3/be/src/rpc/TAcceptQueueServer.h
File be/src/rpc/TAcceptQueueServer.h:

Line 46:  *
> add a reference to IMPALA-4135 here.
Done


PS3, Line 105: 
> Is this used anywhere?
Done


http://gerrit.cloudera.org:8080/#/c/4519/3/be/src/rpc/thrift-server-test.cc
File be/src/rpc/thrift-server-test.cc:

Line 174:   // Test that a large number of concurrent connections will all succeed and not time out
> Add a comment about what you're testing and the associated JIRA.
Done


PS3, Line 178: _OK(server->Start(
> Does this work in a threaded context - does the test fail if the thread hit
Yes, although the test runs to completion before reporting the error.


PS3, Line 180: ThreadPool<int64_t> pool(
> I'm a bit concerned about failing with ulimit errors, particularly because 
Agreed, this test has a lot of potential for flakiness or other pain. Not sure how else to do an automatic test of this issue, though.

As written, it doesn't even actually fail on my machine without the fix, unless I add an artificial slowdown.


PS3, Line 173: TEST(ConcurrencyTest, ConnectTimeout) {
             :   // Test that a large number of concurrent connections will all succeed and not time out
             :   // waiting to be accepted.(IMPALA-4135)
             :   int port = GetServerPort();
             :   ThriftServer* server = new ThriftServer("DummyServer", MakeProcessor(), port);
             :   ASSERT_OK(server->Start());
             : 
             :   ThreadPool<int64_t> pool(
             :       "group", "test", 256, 10000, [port](int tid, const int64_t& item) {
             :  
> This relies on a running impala internal service - you probably had an Impa
Done


http://gerrit.cloudera.org:8080/#/c/4519/1/be/src/server/TAcceptQueueThreadedServer.cpp
File be/src/server/TAcceptQueueThreadedServer.cpp:

PS1, Line 233: 
> Oops, sorry about that.
No problem.


-- 
To view, visit http://gerrit.cloudera.org:8080/4519
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: Ie50e728974ef31a9d49132a0b3f7cde2a4f3356d
Gerrit-PatchSet: 4
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Thomas Tauber-Marshall <tm...@cloudera.com>
Gerrit-Reviewer: Henry Robinson <he...@cloudera.com>
Gerrit-Reviewer: Juan Yu <jy...@cloudera.com>
Gerrit-Reviewer: Matthew Jacobs <mj...@cloudera.com>
Gerrit-Reviewer: Sailesh Mukil <sa...@cloudera.com>
Gerrit-Reviewer: Thomas Tauber-Marshall <tm...@cloudera.com>
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-4135: Thrift threaded server times-out connections during high load

Posted by "Thomas Tauber-Marshall (Code Review)" <ge...@cloudera.org>.
Hello Henry Robinson,

I'd like you to reexamine a change.  Please visit

    http://gerrit.cloudera.org:8080/4519

to look at the new patch set (#7).

Change subject: IMPALA-4135: Thrift threaded server times-out connections during high load
......................................................................

IMPALA-4135: Thrift threaded server times-out connections during high load

During times of high load, Thrift's TThreadedServer can't keep up with the
rate of new socket connections, causing some to time out.

This patch creates TAcceptQueueServer, which is a modified version of
TThreadedServer that calls accept() and then hands the returned TTransport
off to a thread pool to handle setting up the connection. This ensures that
accept() is called as quickly as possible, preventing connections from timing
out while waiting.

It also adds a metric, connection-setup-queue-size, to monitor the number
of accepted connections waiting to be processed.

Testing:
- New test added to thrift-server-test.
- Locally with the repro shown in IMPALA-4135.
- On the 16-node with a real repro query.
- Ran the stress test for a while.

Change-Id: Ie50e728974ef31a9d49132a0b3f7cde2a4f3356d
---
M be/src/common/global-flags.cc
M be/src/rpc/CMakeLists.txt
A be/src/rpc/TAcceptQueueServer.cpp
A be/src/rpc/TAcceptQueueServer.h
M be/src/rpc/thrift-server-test.cc
M be/src/rpc/thrift-server.cc
M be/src/rpc/thrift-server.h
M common/thrift/metrics.json
8 files changed, 525 insertions(+), 2 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/19/4519/7
-- 
To view, visit http://gerrit.cloudera.org:8080/4519
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: newpatchset
Gerrit-Change-Id: Ie50e728974ef31a9d49132a0b3f7cde2a4f3356d
Gerrit-PatchSet: 7
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Thomas Tauber-Marshall <tm...@cloudera.com>
Gerrit-Reviewer: Henry Robinson <he...@cloudera.com>
Gerrit-Reviewer: Juan Yu <jy...@cloudera.com>
Gerrit-Reviewer: Matthew Jacobs <mj...@cloudera.com>
Gerrit-Reviewer: Sailesh Mukil <sa...@cloudera.com>
Gerrit-Reviewer: Thomas Tauber-Marshall <tm...@cloudera.com>

[Impala-ASF-CR] IMPALA-4135: Thrift threaded server times-out connections during high load

Posted by "Henry Robinson (Code Review)" <ge...@cloudera.org>.
Henry Robinson has posted comments on this change.

Change subject: IMPALA-4135: Thrift threaded server times-out connections during high load
......................................................................


Patch Set 9:

(1 comment)

http://gerrit.cloudera.org:8080/#/c/4519/9/be/src/common/global-flags.cc
File be/src/common/global-flags.cc:

Line 116:     "to a thread pool to finish setup, reducing the chances that connections time out "
> what's the rationale for making this optional and leaving the old logic in 
While we're confident in this implementation, it's not a bad idea to have a get-out-of-jail flag until we've seen it be successful in real-world deployments. 

This only controls which kind of object gets created, so keeping it doesn't add much complexity over removing it.


-- 
To view, visit http://gerrit.cloudera.org:8080/4519
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: Ie50e728974ef31a9d49132a0b3f7cde2a4f3356d
Gerrit-PatchSet: 9
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Thomas Tauber-Marshall <tm...@cloudera.com>
Gerrit-Reviewer: Henry Robinson <he...@cloudera.com>
Gerrit-Reviewer: Internal Jenkins
Gerrit-Reviewer: Juan Yu <jy...@cloudera.com>
Gerrit-Reviewer: Marcel Kornacker <ma...@cloudera.com>
Gerrit-Reviewer: Matthew Jacobs <mj...@cloudera.com>
Gerrit-Reviewer: Sailesh Mukil <sa...@cloudera.com>
Gerrit-Reviewer: Thomas Tauber-Marshall <tm...@cloudera.com>
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-4135: Thrift threaded server times-out connections during high load

Posted by "Sailesh Mukil (Code Review)" <ge...@cloudera.org>.
Sailesh Mukil has posted comments on this change.

Change subject: IMPALA-4135: Thrift threaded server times-out connections during high load
......................................................................


Patch Set 1:

Have you considered making TAcceptQueueThreadedServer inherit from TThreadedServer and just overloading the serve() functions, and adding the doAccept()?

The rest of the code looks mostly the same.

-- 
To view, visit http://gerrit.cloudera.org:8080/4519
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: Ie50e728974ef31a9d49132a0b3f7cde2a4f3356d
Gerrit-PatchSet: 1
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Thomas Tauber-Marshall <tm...@cloudera.com>
Gerrit-Reviewer: Henry Robinson <he...@cloudera.com>
Gerrit-Reviewer: Sailesh Mukil <sa...@cloudera.com>
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-4135: Thrift threaded server times-out connections during high load

Posted by "Marcel Kornacker (Code Review)" <ge...@cloudera.org>.
Marcel Kornacker has posted comments on this change.

Change subject: IMPALA-4135: Thrift threaded server times-out connections during high load
......................................................................


Patch Set 9:

(2 comments)

http://gerrit.cloudera.org:8080/#/c/4519/9/be/src/common/global-flags.cc
File be/src/common/global-flags.cc:

Line 116:     "to a thread pool to finish setup, reducing the chances that connections time out "
what's the rationale for making this optional and leaving the old logic in place?


http://gerrit.cloudera.org:8080/#/c/4519/9/be/src/rpc/TAcceptQueueServer.h
File be/src/rpc/TAcceptQueueServer.h:

Line 22: #ifndef _THRIFT_SERVER_TACCEPTQUEUESERVER_H_
change file name to reflect our naming conventions


-- 
To view, visit http://gerrit.cloudera.org:8080/4519
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: Ie50e728974ef31a9d49132a0b3f7cde2a4f3356d
Gerrit-PatchSet: 9
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Thomas Tauber-Marshall <tm...@cloudera.com>
Gerrit-Reviewer: Henry Robinson <he...@cloudera.com>
Gerrit-Reviewer: Internal Jenkins
Gerrit-Reviewer: Juan Yu <jy...@cloudera.com>
Gerrit-Reviewer: Marcel Kornacker <ma...@cloudera.com>
Gerrit-Reviewer: Matthew Jacobs <mj...@cloudera.com>
Gerrit-Reviewer: Sailesh Mukil <sa...@cloudera.com>
Gerrit-Reviewer: Thomas Tauber-Marshall <tm...@cloudera.com>
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-4135: Thrift threaded server times-out connections during high load

Posted by "Henry Robinson (Code Review)" <ge...@cloudera.org>.
Henry Robinson has posted comments on this change.

Change subject: IMPALA-4135: Thrift threaded server times-out connections during high load
......................................................................


Patch Set 8: Code-Review+2

Rebase (plus add a flag and disable the test), carry +2.

-- 
To view, visit http://gerrit.cloudera.org:8080/4519
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: Ie50e728974ef31a9d49132a0b3f7cde2a4f3356d
Gerrit-PatchSet: 8
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Thomas Tauber-Marshall <tm...@cloudera.com>
Gerrit-Reviewer: Henry Robinson <he...@cloudera.com>
Gerrit-Reviewer: Internal Jenkins
Gerrit-Reviewer: Juan Yu <jy...@cloudera.com>
Gerrit-Reviewer: Matthew Jacobs <mj...@cloudera.com>
Gerrit-Reviewer: Sailesh Mukil <sa...@cloudera.com>
Gerrit-Reviewer: Thomas Tauber-Marshall <tm...@cloudera.com>
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-4135: Thrift threaded server times-out connections during high load

Posted by "Henry Robinson (Code Review)" <ge...@cloudera.org>.
Henry Robinson has posted comments on this change.

Change subject: IMPALA-4135: Thrift threaded server times-out connections during high load
......................................................................


Patch Set 6: Code-Review+2

(1 comment)

http://gerrit.cloudera.org:8080/#/c/4519/6//COMMIT_MSG
Commit Message:

PS6, Line 25: awhile
a while


-- 
To view, visit http://gerrit.cloudera.org:8080/4519
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: Ie50e728974ef31a9d49132a0b3f7cde2a4f3356d
Gerrit-PatchSet: 6
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Thomas Tauber-Marshall <tm...@cloudera.com>
Gerrit-Reviewer: Henry Robinson <he...@cloudera.com>
Gerrit-Reviewer: Juan Yu <jy...@cloudera.com>
Gerrit-Reviewer: Matthew Jacobs <mj...@cloudera.com>
Gerrit-Reviewer: Sailesh Mukil <sa...@cloudera.com>
Gerrit-Reviewer: Thomas Tauber-Marshall <tm...@cloudera.com>
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-4135: Thrift threaded server times-out connections during high load

Posted by "Thomas Tauber-Marshall (Code Review)" <ge...@cloudera.org>.
Thomas Tauber-Marshall has uploaded a new patch set (#5).

Change subject: IMPALA-4135: Thrift threaded server times-out connections during high load
......................................................................

IMPALA-4135: Thrift threaded server times-out connections during high load

During times of high load, Thrift's TThreadedServer can't keep up with the
rate of new socket connections, causing some to time out.

This patch creates TAcceptQueueServer, which is a modified version of
TThreadedServer that calls accept() and then hands the returned TTransport
off to a thread pool to handle setting up the connection. This ensures that
accept() is called as quickly as possible, preventing connections from timing
out while waiting.

This patch has been tested locally with the repro shown in IMPALA-4135, but
it still needs to be tested in a real cluster.

Change-Id: Ie50e728974ef31a9d49132a0b3f7cde2a4f3356d
---
M be/src/common/global-flags.cc
M be/src/rpc/CMakeLists.txt
A be/src/rpc/TAcceptQueueServer.cpp
A be/src/rpc/TAcceptQueueServer.h
M be/src/rpc/thrift-server-test.cc
M be/src/rpc/thrift-server.cc
6 files changed, 459 insertions(+), 2 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/19/4519/5
-- 
To view, visit http://gerrit.cloudera.org:8080/4519
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: newpatchset
Gerrit-Change-Id: Ie50e728974ef31a9d49132a0b3f7cde2a4f3356d
Gerrit-PatchSet: 5
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Thomas Tauber-Marshall <tm...@cloudera.com>
Gerrit-Reviewer: Henry Robinson <he...@cloudera.com>
Gerrit-Reviewer: Juan Yu <jy...@cloudera.com>
Gerrit-Reviewer: Matthew Jacobs <mj...@cloudera.com>
Gerrit-Reviewer: Sailesh Mukil <sa...@cloudera.com>
Gerrit-Reviewer: Thomas Tauber-Marshall <tm...@cloudera.com>

[Impala-ASF-CR] IMPALA-4135: Thrift threaded server times-out connections during high load

Posted by "Thomas Tauber-Marshall (Code Review)" <ge...@cloudera.org>.
Thomas Tauber-Marshall has uploaded a new patch set (#2).

Change subject: IMPALA-4135: Thrift threaded server times-out connections during high load
......................................................................

IMPALA-4135: Thrift threaded server times-out connections during high load

During times of high load, Thrift's TThreadedServer can't keep up with the
rate of new socket connections, causing some to time out.

This patch creates TAcceptQueueServer, which is a modified version of
TThreadedServer that calls accept() and then hands the returned TTransport
off to a thread pool to handle setting up the connection. This ensures that
accept() is called as quickly as possible, preventing connections from timing
out while waiting.

This patch has been tested locally with the repro shown in IMPALA-4135, but
it still needs to be tested in a real cluster.

Change-Id: Ie50e728974ef31a9d49132a0b3f7cde2a4f3356d
---
M be/src/common/global-flags.cc
M be/src/rpc/CMakeLists.txt
A be/src/rpc/TAcceptQueueServer.cpp
A be/src/rpc/TAcceptQueueServer.h
M be/src/rpc/thrift-server-test.cc
M be/src/rpc/thrift-server.cc
6 files changed, 451 insertions(+), 2 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/19/4519/2
-- 
To view, visit http://gerrit.cloudera.org:8080/4519
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: newpatchset
Gerrit-Change-Id: Ie50e728974ef31a9d49132a0b3f7cde2a4f3356d
Gerrit-PatchSet: 2
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Thomas Tauber-Marshall <tm...@cloudera.com>
Gerrit-Reviewer: Henry Robinson <he...@cloudera.com>
Gerrit-Reviewer: Juan Yu <jy...@cloudera.com>
Gerrit-Reviewer: Matthew Jacobs <mj...@cloudera.com>
Gerrit-Reviewer: Sailesh Mukil <sa...@cloudera.com>

[Impala-ASF-CR] IMPALA-4135: Thrift threaded server times-out connections during high load

Posted by "Henry Robinson (Code Review)" <ge...@cloudera.org>.
Henry Robinson has posted comments on this change.

Change subject: IMPALA-4135: Thrift threaded server times-out connections during high load
......................................................................


Patch Set 5:

(3 comments)

This looks good to me. 

One last thing: we should think about how we're going to monitor this. Could you add metrics to this server to measure the queue size? You can add a "InitMetrics(MetricGroup*)" call to TAcceptQueueServer, called from ThriftServer. The most important metric would be a queue length one, incremented when the accept thread adds a connection to the queue, and decremented on the other side.

http://gerrit.cloudera.org:8080/#/c/4519/5//COMMIT_MSG
Commit Message:

PS5, Line 18: This patch has been tested locally with the repro shown in IMPALA-4135, but
            : it still needs to be tested in a real cluster.
no longer true.

Add a comment about what testing you did.


http://gerrit.cloudera.org:8080/#/c/4519/5/be/src/rpc/TAcceptQueueServer.h
File be/src/rpc/TAcceptQueueServer.h:

PS5, Line 20: // with the
            : // significant changes noted inline below.
strange line break in comment again


PS5, Line 45: thread pool
"separate thread, asynchronously. "


-- 
To view, visit http://gerrit.cloudera.org:8080/4519
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: Ie50e728974ef31a9d49132a0b3f7cde2a4f3356d
Gerrit-PatchSet: 5
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Thomas Tauber-Marshall <tm...@cloudera.com>
Gerrit-Reviewer: Henry Robinson <he...@cloudera.com>
Gerrit-Reviewer: Juan Yu <jy...@cloudera.com>
Gerrit-Reviewer: Matthew Jacobs <mj...@cloudera.com>
Gerrit-Reviewer: Sailesh Mukil <sa...@cloudera.com>
Gerrit-Reviewer: Thomas Tauber-Marshall <tm...@cloudera.com>
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-4135: Thrift threaded server times-out connections during high load

Posted by "Henry Robinson (Code Review)" <ge...@cloudera.org>.
Henry Robinson has posted comments on this change.

Change subject: IMPALA-4135: Thrift threaded server times-out connections during high load
......................................................................


Patch Set 1:

(7 comments)

I think Sailesh's suggestion is a good one. Is there any state in TThreadedServer that can't be accessed by a subclass? 

 I agree that a separate library seems unnecessary. There doesn't seem any need for a separate top-level dir; I expected a new directory under rpc/. But the simplest is probably just to put the file in rpc/ and be done with it.

http://gerrit.cloudera.org:8080/#/c/4519/1/be/src/rpc/thrift-server.cc
File be/src/rpc/thrift-server.cc:

Line 419:       server_.reset(new TAcceptQueueThreadedServer(processor_, server_socket,
let's make the choice of server configurable by a flag that we can turn off if needed.


http://gerrit.cloudera.org:8080/#/c/4519/1/be/src/server/TAcceptQueueThreadedServer.cpp
File be/src/server/TAcceptQueueThreadedServer.cpp:

Line 144:     inputTransportFactory_->getTransport(client);
do you see much slowdown in synchronizing around these few lines that create protocols, transports and processors? If not, let's do that to make sure we rule out any possibility of race conditions.


Line 164:     shared_ptr<Thread> thread = shared_ptr<Thread>(threadFactory_->newThread(runnable));
Is this definitely thread-safe? I think this might be the bottleneck (or maybe it's start()?) so it's going to be useful to keep unsynchronized.


Line 200:   ThreadPool<shared_ptr<TTransport>> acceptPool("server-accept", "accept-worker", 1, 10000,
How did you arrive at 1 as the default number of worker threads? What happens as that number increases? If it *has* to be one, to keep doAccept() thread-safe, add a comment here to that effect.


Line 236:       acceptPool.DrainAndShutdown();
I'm not sure we ever actually stop the thrift servers, so it might be moot - but shouldn't you signal to the worker threads that they should shut down and therefore not create any new threads?


http://gerrit.cloudera.org:8080/#/c/4519/1/be/src/server/TAcceptQueueThreadedServer.h
File be/src/server/TAcceptQueueThreadedServer.h:

PS1, Line 1: // This file was copied from apache::thrift::server::TThreadedServer.cpp v0.9.0, with the
           : // significant changes noted inline below.
           : /*
           :  * Licensed to the Apache Software Foundation (ASF) under one
           :  * or more contributor license agreements. See the NOTICE file
           :  * distributed with this work for additional information
           :  * regarding copyright ownership. The ASF licenses this file
           :  * to you under the Apache License, Version 2.0 (the
           :  * "License"); you may not use this file except in compliance
           :  * with the License. You may obtain a copy of the License at
           :  *
           :  *   http://www.apache.org/licenses/LICENSE-2.0
           :  *
           :  * Unless required by applicable law or agreed to in writing,
           :  * software distributed under the License is distributed on an
           :  * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
           :  * KIND, either express or implied. See the License for the
           :  * specific language governing permissions and limitations
           :  * under the License.
           :  */
> I think this is fine, but if you haven't already, can you ask Jim or Henry 
AFAIK it's completely fine - copying from another ASF project is legit.

That said, I'd keep the license header as the first thing in the file (and if you subclass, so much the better).


PS1, Line 44: and effectively creating a new,
            :  * internally managed, accept queue.
not exactly - the connections have been accepted, the next state is transport set-up.


-- 
To view, visit http://gerrit.cloudera.org:8080/4519
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: Ie50e728974ef31a9d49132a0b3f7cde2a4f3356d
Gerrit-PatchSet: 1
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Thomas Tauber-Marshall <tm...@cloudera.com>
Gerrit-Reviewer: Henry Robinson <he...@cloudera.com>
Gerrit-Reviewer: Matthew Jacobs <mj...@cloudera.com>
Gerrit-Reviewer: Sailesh Mukil <sa...@cloudera.com>
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-4135: Thrift threaded server times-out connections during high load

Posted by "Henry Robinson (Code Review)" <ge...@cloudera.org>.
Henry Robinson has uploaded a new patch set (#8).

Change subject: IMPALA-4135: Thrift threaded server times-out connections during high load
......................................................................

IMPALA-4135: Thrift threaded server times-out connections during high load

During times of high load, Thrift's TThreadedServer can't keep up with the
rate of new socket connections, causing some to time out.

This patch creates TAcceptQueueServer, which is a modified version of
TThreadedServer that calls accept() and then hands the returned TTransport
off to a thread pool to handle setting up the connection. This ensures that
accept() is called as quickly as possible, preventing connections from timing
out while waiting.

It also adds a metric, connection-setup-queue-size, to monitor the number
of accepted connections waiting to be processed.

A flag, --accepted_cnxn_queue_depth, controls the size of the accepted
connection buffer.

Testing:
- New test added to thrift-server-test. (Disabled by default, due to
  high ulimit -n requirement)
- Locally with the repro shown in IMPALA-4135.
- On the 16-node with a real repro query.
- Ran the stress test for a while.

Change-Id: Ie50e728974ef31a9d49132a0b3f7cde2a4f3356d
---
M be/src/common/global-flags.cc
M be/src/rpc/CMakeLists.txt
A be/src/rpc/TAcceptQueueServer.cpp
A be/src/rpc/TAcceptQueueServer.h
M be/src/rpc/thrift-server-test.cc
M be/src/rpc/thrift-server.cc
M be/src/rpc/thrift-server.h
M common/thrift/metrics.json
8 files changed, 545 insertions(+), 14 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/19/4519/8
-- 
To view, visit http://gerrit.cloudera.org:8080/4519
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: newpatchset
Gerrit-Change-Id: Ie50e728974ef31a9d49132a0b3f7cde2a4f3356d
Gerrit-PatchSet: 8
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Thomas Tauber-Marshall <tm...@cloudera.com>
Gerrit-Reviewer: Henry Robinson <he...@cloudera.com>
Gerrit-Reviewer: Juan Yu <jy...@cloudera.com>
Gerrit-Reviewer: Matthew Jacobs <mj...@cloudera.com>
Gerrit-Reviewer: Sailesh Mukil <sa...@cloudera.com>
Gerrit-Reviewer: Thomas Tauber-Marshall <tm...@cloudera.com>

[Impala-ASF-CR] IMPALA-4135: Thrift threaded server times-out connections during high load

Posted by "Juan Yu (Code Review)" <ge...@cloudera.org>.
Juan Yu has posted comments on this change.

Change subject: IMPALA-4135: Thrift threaded server times-out connections during high load
......................................................................


Patch Set 1:

(2 comments)

code style, we use 4 space indent for continued line

http://gerrit.cloudera.org:8080/#/c/4519/1/be/src/server/TAcceptQueueThreadedServer.cpp
File be/src/server/TAcceptQueueThreadedServer.cpp:

PS1, Line 233: if (stop_) {
shouldn't this be out of the while loop? to guarantee a proper cleanup.


PS1, Line 252: stop_ = false
why set stop_ back to false?


-- 
To view, visit http://gerrit.cloudera.org:8080/4519
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: Ie50e728974ef31a9d49132a0b3f7cde2a4f3356d
Gerrit-PatchSet: 1
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Thomas Tauber-Marshall <tm...@cloudera.com>
Gerrit-Reviewer: Henry Robinson <he...@cloudera.com>
Gerrit-Reviewer: Juan Yu <jy...@cloudera.com>
Gerrit-Reviewer: Matthew Jacobs <mj...@cloudera.com>
Gerrit-Reviewer: Sailesh Mukil <sa...@cloudera.com>
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-4135: Thrift threaded server times-out connections during high load

Posted by "Henry Robinson (Code Review)" <ge...@cloudera.org>.
Henry Robinson has posted comments on this change.

Change subject: IMPALA-4135: Thrift threaded server times-out connections during high load
......................................................................


Patch Set 4:

(7 comments)

Looking pretty good.

http://gerrit.cloudera.org:8080/#/c/4519/4/be/src/rpc/TAcceptQueueServer.cpp
File be/src/rpc/TAcceptQueueServer.cpp:

PS4, Line 136: doAccept
doAccept() isn't a great name (I know it was from my patch!) because this isn't actually doing the work of accept, but instead establishing a connection. Perhaps call this SetupConnection() ?

You can use Impala's convention for method capitalisation; again makes it easier to see what was added.


PS4, Line 204: acceptPool
connection_handler_pool, or something similar.


PS4, Line 205: ACCEPT_THREAD_POOL_SIZE
This can be defined in this method as a constexpr.


http://gerrit.cloudera.org:8080/#/c/4519/4/be/src/rpc/TAcceptQueueServer.h
File be/src/rpc/TAcceptQueueServer.h:

PS4, Line 96: thread
strange line breaks in this comment.


PS4, Line 97: static const int ACCEPT_THREAD_POOL_SIZE = 1;
move into .cc file


http://gerrit.cloudera.org:8080/#/c/4519/4/be/src/rpc/thrift-server-test.cc
File be/src/rpc/thrift-server-test.cc:

PS4, Line 173: ConnectTimeout
Call this ManyConcurrentConnection or something.


Line 175:   // waiting to be accepted.(IMPALA-4135)
Mention that this does not always fail. How long does it take to run?


-- 
To view, visit http://gerrit.cloudera.org:8080/4519
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: Ie50e728974ef31a9d49132a0b3f7cde2a4f3356d
Gerrit-PatchSet: 4
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Thomas Tauber-Marshall <tm...@cloudera.com>
Gerrit-Reviewer: Henry Robinson <he...@cloudera.com>
Gerrit-Reviewer: Juan Yu <jy...@cloudera.com>
Gerrit-Reviewer: Matthew Jacobs <mj...@cloudera.com>
Gerrit-Reviewer: Sailesh Mukil <sa...@cloudera.com>
Gerrit-Reviewer: Thomas Tauber-Marshall <tm...@cloudera.com>
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-4135: Thrift threaded server times-out connections during high load

Posted by "Henry Robinson (Code Review)" <ge...@cloudera.org>.
Henry Robinson has submitted this change and it was merged.

Change subject: IMPALA-4135: Thrift threaded server times-out connections during high load
......................................................................


IMPALA-4135: Thrift threaded server times-out connections during high load

During times of high load, Thrift's TThreadedServer can't keep up with the
rate of new socket connections, causing some to time out.

This patch creates TAcceptQueueServer, which is a modified version of
TThreadedServer that calls accept() and then hands the returned TTransport
off to a thread pool to handle setting up the connection. This ensures that
accept() is called as quickly as possible, preventing connections from timing
out while waiting.

It also adds a metric, connection-setup-queue-size, to monitor the number
of accepted connections waiting to be processed.

A flag, --accepted_cnxn_queue_depth, controls the size of the accepted
connection buffer.

Testing:
- New test added to thrift-server-test. (Disabled by default, due to
  high ulimit -n requirement)
- Locally with the repro shown in IMPALA-4135.
- On the 16-node with a real repro query.
- Ran the stress test for a while.

Change-Id: Ie50e728974ef31a9d49132a0b3f7cde2a4f3356d
Reviewed-on: http://gerrit.cloudera.org:8080/4519
Tested-by: Internal Jenkins
Reviewed-by: Henry Robinson <he...@cloudera.com>
---
M be/src/common/global-flags.cc
M be/src/rpc/CMakeLists.txt
A be/src/rpc/TAcceptQueueServer.cpp
A be/src/rpc/TAcceptQueueServer.h
M be/src/rpc/thrift-server-test.cc
M be/src/rpc/thrift-server.cc
M be/src/rpc/thrift-server.h
M common/thrift/metrics.json
8 files changed, 545 insertions(+), 14 deletions(-)

Approvals:
  Henry Robinson: Looks good to me, approved
  Internal Jenkins: Verified



-- 
To view, visit http://gerrit.cloudera.org:8080/4519
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: merged
Gerrit-Change-Id: Ie50e728974ef31a9d49132a0b3f7cde2a4f3356d
Gerrit-PatchSet: 9
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Thomas Tauber-Marshall <tm...@cloudera.com>
Gerrit-Reviewer: Henry Robinson <he...@cloudera.com>
Gerrit-Reviewer: Internal Jenkins
Gerrit-Reviewer: Juan Yu <jy...@cloudera.com>
Gerrit-Reviewer: Matthew Jacobs <mj...@cloudera.com>
Gerrit-Reviewer: Sailesh Mukil <sa...@cloudera.com>
Gerrit-Reviewer: Thomas Tauber-Marshall <tm...@cloudera.com>

[Impala-ASF-CR] IMPALA-4135: Thrift threaded server times-out connections during high load

Posted by "Thomas Tauber-Marshall (Code Review)" <ge...@cloudera.org>.
Thomas Tauber-Marshall has posted comments on this change.

Change subject: IMPALA-4135: Thrift threaded server times-out connections during high load
......................................................................


Patch Set 7:

(1 comment)

http://gerrit.cloudera.org:8080/#/c/4519/6//COMMIT_MSG
Commit Message:

PS6, Line 25: a whil
> a while
Done


-- 
To view, visit http://gerrit.cloudera.org:8080/4519
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: Ie50e728974ef31a9d49132a0b3f7cde2a4f3356d
Gerrit-PatchSet: 7
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Thomas Tauber-Marshall <tm...@cloudera.com>
Gerrit-Reviewer: Henry Robinson <he...@cloudera.com>
Gerrit-Reviewer: Juan Yu <jy...@cloudera.com>
Gerrit-Reviewer: Matthew Jacobs <mj...@cloudera.com>
Gerrit-Reviewer: Sailesh Mukil <sa...@cloudera.com>
Gerrit-Reviewer: Thomas Tauber-Marshall <tm...@cloudera.com>
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-4135: Thrift threaded server times-out connections during high load

Posted by "Thomas Tauber-Marshall (Code Review)" <ge...@cloudera.org>.
Thomas Tauber-Marshall has posted comments on this change.

Change subject: IMPALA-4135: Thrift threaded server times-out connections during high load
......................................................................


Patch Set 6:

(3 comments)

http://gerrit.cloudera.org:8080/#/c/4519/5//COMMIT_MSG
Commit Message:

PS5, Line 18: It also adds a metric, connection-setup-queue-size, to monitor the number
            : of accepted connections waiting to be processe
> no longer true.
Done


http://gerrit.cloudera.org:8080/#/c/4519/5/be/src/rpc/TAcceptQueueServer.h
File be/src/rpc/TAcceptQueueServer.h:

PS5, Line 20: // significant changes noted inline below.
            : 
> strange line break in comment again
Done


PS5, Line 45: d then imme
> "separate thread, asynchronously. "
Done


-- 
To view, visit http://gerrit.cloudera.org:8080/4519
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: Ie50e728974ef31a9d49132a0b3f7cde2a4f3356d
Gerrit-PatchSet: 6
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Thomas Tauber-Marshall <tm...@cloudera.com>
Gerrit-Reviewer: Henry Robinson <he...@cloudera.com>
Gerrit-Reviewer: Juan Yu <jy...@cloudera.com>
Gerrit-Reviewer: Matthew Jacobs <mj...@cloudera.com>
Gerrit-Reviewer: Sailesh Mukil <sa...@cloudera.com>
Gerrit-Reviewer: Thomas Tauber-Marshall <tm...@cloudera.com>
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-4135: Thrift threaded server times-out connections during high load

Posted by "Juan Yu (Code Review)" <ge...@cloudera.org>.
Juan Yu has posted comments on this change.

Change subject: IMPALA-4135: Thrift threaded server times-out connections during high load
......................................................................


Patch Set 1:

(2 comments)

http://gerrit.cloudera.org:8080/#/c/4519/3/be/src/rpc/TAcceptQueueServer.cpp
File be/src/rpc/TAcceptQueueServer.cpp:

PS3, Line 218: 
set stop_ to true to cleanup?


http://gerrit.cloudera.org:8080/#/c/4519/1/be/src/server/TAcceptQueueThreadedServer.cpp
File be/src/server/TAcceptQueueThreadedServer.cpp:

PS1, Line 233: if (stop_) {
> Its not in the while loop.
Oops, sorry about that.


-- 
To view, visit http://gerrit.cloudera.org:8080/4519
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: Ie50e728974ef31a9d49132a0b3f7cde2a4f3356d
Gerrit-PatchSet: 1
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Thomas Tauber-Marshall <tm...@cloudera.com>
Gerrit-Reviewer: Henry Robinson <he...@cloudera.com>
Gerrit-Reviewer: Juan Yu <jy...@cloudera.com>
Gerrit-Reviewer: Matthew Jacobs <mj...@cloudera.com>
Gerrit-Reviewer: Sailesh Mukil <sa...@cloudera.com>
Gerrit-Reviewer: Thomas Tauber-Marshall <tm...@cloudera.com>
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-4135: Thrift threaded server times-out connections during high load

Posted by "Thomas Tauber-Marshall (Code Review)" <ge...@cloudera.org>.
Thomas Tauber-Marshall has posted comments on this change.

Change subject: IMPALA-4135: Thrift threaded server times-out connections during high load
......................................................................


Patch Set 5:

(7 comments)

http://gerrit.cloudera.org:8080/#/c/4519/4/be/src/rpc/TAcceptQueueServer.cpp
File be/src/rpc/TAcceptQueueServer.cpp:

PS4, Line 136: SetupCon
> doAccept() isn't a great name (I know it was from my patch!) because this i
Done


PS4, Line 204: rift code 
> connection_handler_pool, or something similar.
Done


PS4, Line 205: texpr int CONNECTION_SE
> This can be defined in this method as a constexpr.
Done


http://gerrit.cloudera.org:8080/#/c/4519/4/be/src/rpc/TAcceptQueueServer.h
File be/src/rpc/TAcceptQueueServer.h:

PS4, Line 96: New - 
> strange line breaks in this comment.
Done


PS4, Line 97: // up the connection and starting a thread to
> move into .cc file
Done


http://gerrit.cloudera.org:8080/#/c/4519/4/be/src/rpc/thrift-server-test.cc
File be/src/rpc/thrift-server-test.cc:

PS4, Line 173: ManyConcurrent
> Call this ManyConcurrentConnection or something.
Done


Line 175:   // waiting to be accepted. (IMPALA-4135)
> Mention that this does not always fail. How long does it take to run?
It takes about 2 seconds on my machine.


-- 
To view, visit http://gerrit.cloudera.org:8080/4519
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: Ie50e728974ef31a9d49132a0b3f7cde2a4f3356d
Gerrit-PatchSet: 5
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Thomas Tauber-Marshall <tm...@cloudera.com>
Gerrit-Reviewer: Henry Robinson <he...@cloudera.com>
Gerrit-Reviewer: Juan Yu <jy...@cloudera.com>
Gerrit-Reviewer: Matthew Jacobs <mj...@cloudera.com>
Gerrit-Reviewer: Sailesh Mukil <sa...@cloudera.com>
Gerrit-Reviewer: Thomas Tauber-Marshall <tm...@cloudera.com>
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-4135: Thrift threaded server times-out connections during high load

Posted by "Matthew Jacobs (Code Review)" <ge...@cloudera.org>.
Matthew Jacobs has posted comments on this change.

Change subject: IMPALA-4135: Thrift threaded server times-out connections during high load
......................................................................


Patch Set 1:

(3 comments)

Thanks! Not a full review yet. I still need to spend time reading TAcceptQueueThreadedServer.cpp carefully.

http://gerrit.cloudera.org:8080/#/c/4519/1//COMMIT_MSG
Commit Message:

PS1, Line 18: This patch has been tested locally with the repro shown in IMPALA-4135
Can you add the test, modified if necessary?


http://gerrit.cloudera.org:8080/#/c/4519/1/be/src/server/CMakeLists.txt
File be/src/server/CMakeLists.txt:

PS1, Line 24: add_library(ThriftAcceptQueueThreadedServer
            :     TAcceptQueueThreadedServer.cpp
            :   )
            : add_dependencies(ThriftAcceptQueueThreadedServer thrift-deps)
Was there a reason you wanted this as a separate library? I think it's fine to put this code in rpc/ since that's where we have our thrift-server.


http://gerrit.cloudera.org:8080/#/c/4519/1/be/src/server/TAcceptQueueThreadedServer.h
File be/src/server/TAcceptQueueThreadedServer.h:

PS1, Line 1: // This file was copied from apache::thrift::server::TThreadedServer.cpp v0.9.0, with the
           : // significant changes noted inline below.
           : /*
           :  * Licensed to the Apache Software Foundation (ASF) under one
           :  * or more contributor license agreements. See the NOTICE file
           :  * distributed with this work for additional information
           :  * regarding copyright ownership. The ASF licenses this file
           :  * to you under the Apache License, Version 2.0 (the
           :  * "License"); you may not use this file except in compliance
           :  * with the License. You may obtain a copy of the License at
           :  *
           :  *   http://www.apache.org/licenses/LICENSE-2.0
           :  *
           :  * Unless required by applicable law or agreed to in writing,
           :  * software distributed under the License is distributed on an
           :  * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
           :  * KIND, either express or implied. See the License for the
           :  * specific language governing permissions and limitations
           :  * under the License.
           :  */
I think this is fine, but if you haven't already, can you ask Jim or Henry if there's anything else we have to do for legal?


-- 
To view, visit http://gerrit.cloudera.org:8080/4519
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: Ie50e728974ef31a9d49132a0b3f7cde2a4f3356d
Gerrit-PatchSet: 1
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Thomas Tauber-Marshall <tm...@cloudera.com>
Gerrit-Reviewer: Henry Robinson <he...@cloudera.com>
Gerrit-Reviewer: Matthew Jacobs <mj...@cloudera.com>
Gerrit-Reviewer: Sailesh Mukil <sa...@cloudera.com>
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-4135: Thrift threaded server times-out connections during high load

Posted by "Thomas Tauber-Marshall (Code Review)" <ge...@cloudera.org>.
Thomas Tauber-Marshall has posted comments on this change.

Change subject: IMPALA-4135: Thrift threaded server times-out connections during high load
......................................................................


Patch Set 5:

The patch passed an exhaustive run on Jenkins:
http://sandbox.jenkins.cloudera.com/job/impala-umbrella-build-and-test/4911/

I have also now, with Mostafa's help, come up with a query that reliably reproduces the issue on the 16 node cluster and succeeds on the 16 node with this patch deployed:

 select /* +straight_join */ count(*) from
 tpch_300_parquet.orders A join /* +shuffle */  tpch_300_parquet.orders B on A.o_orderkey = B.o_orderkey 
 join /* +shuffle */   tpch_300_parquet.orders C on c.o_orderkey = B.o_orderkey
 join /* +shuffle */   tpch_300_parquet.orders D on d.o_orderkey = B.o_orderkey
 join /* +shuffle */   tpch_300_parquet.orders E on e.o_orderkey = B.o_orderkey
 join /* +shuffle */   tpch_300_parquet.orders F on f.o_orderkey = B.o_orderkey
 join /* +shuffle */   tpch_300_parquet.orders G on g.o_orderkey = B.o_orderkey
 join /* +shuffle */   tpch_300_parquet.orders H on h.o_orderkey = B.o_orderkey
 join /* +shuffle */   tpch_300_parquet.orders I on i.o_orderkey = B.o_orderkey
 join /* +shuffle */   tpch_300_parquet.orders J on j.o_orderkey = B.o_orderkey
 join /* +shuffle */   tpch_300_parquet.orders K on k.o_orderkey = B.o_orderkey
 join /* +shuffle */   tpch_300_parquet.orders L on l.o_orderkey = B.o_orderkey
 join /* +shuffle */   tpch_300_parquet.orders M on m.o_orderkey = B.o_orderkey
 join /* +shuffle */   tpch_300_parquet.orders N on n.o_orderkey = B.o_orderkey
 join /* +shuffle */   tpch_300_parquet.orders O on o.o_orderkey = B.o_orderkey
 join /* +shuffle */   tpch_300_parquet.orders P on p.o_orderkey = B.o_orderkey
 join /* +shuffle */   tpch_300_parquet.orders Q on q.o_orderkey = B.o_orderkey
 join /* +shuffle */   tpch_300_parquet.orders R on r.o_orderkey = B.o_orderkey
 join /* +shuffle */   tpch_300_parquet.orders S on s.o_orderkey = B.o_orderkey
 join /* +shuffle */   tpch_300_parquet.orders T on t.o_orderkey = B.o_orderkey
 join /* +shuffle */   tpch_300_parquet.orders U on u.o_orderkey = B.o_orderkey
 join /* +shuffle */   tpch_300_parquet.orders V on v.o_orderkey = B.o_orderkey
 where a.o_orderkey < 100000;

Currently working on testing it with Kerberos on.

-- 
To view, visit http://gerrit.cloudera.org:8080/4519
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: Ie50e728974ef31a9d49132a0b3f7cde2a4f3356d
Gerrit-PatchSet: 5
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Thomas Tauber-Marshall <tm...@cloudera.com>
Gerrit-Reviewer: Henry Robinson <he...@cloudera.com>
Gerrit-Reviewer: Juan Yu <jy...@cloudera.com>
Gerrit-Reviewer: Matthew Jacobs <mj...@cloudera.com>
Gerrit-Reviewer: Sailesh Mukil <sa...@cloudera.com>
Gerrit-Reviewer: Thomas Tauber-Marshall <tm...@cloudera.com>
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-4135: Thrift threaded server times-out connections during high load

Posted by "Thomas Tauber-Marshall (Code Review)" <ge...@cloudera.org>.
Thomas Tauber-Marshall has uploaded a new patch set (#6).

Change subject: IMPALA-4135: Thrift threaded server times-out connections during high load
......................................................................

IMPALA-4135: Thrift threaded server times-out connections during high load

During times of high load, Thrift's TThreadedServer can't keep up with the
rate of new socket connections, causing some to time out.

This patch creates TAcceptQueueServer, which is a modified version of
TThreadedServer that calls accept() and then hands the returned TTransport
off to a thread pool to handle setting up the connection. This ensures that
accept() is called as quickly as possible, preventing connections from timing
out while waiting.

It also adds a metric, connection-setup-queue-size, to monitor the number
of accepted connections waiting to be processed.

Testing:
- New test added to thrift-server-test.
- Locally with the repro shown in IMPALA-4135.
- On the 16-node with a real repro query.
- Ran the stress test for awhile.

Change-Id: Ie50e728974ef31a9d49132a0b3f7cde2a4f3356d
---
M be/src/common/global-flags.cc
M be/src/rpc/CMakeLists.txt
A be/src/rpc/TAcceptQueueServer.cpp
A be/src/rpc/TAcceptQueueServer.h
M be/src/rpc/thrift-server-test.cc
M be/src/rpc/thrift-server.cc
M be/src/rpc/thrift-server.h
M common/thrift/metrics.json
8 files changed, 525 insertions(+), 2 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/19/4519/6
-- 
To view, visit http://gerrit.cloudera.org:8080/4519
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: newpatchset
Gerrit-Change-Id: Ie50e728974ef31a9d49132a0b3f7cde2a4f3356d
Gerrit-PatchSet: 6
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Thomas Tauber-Marshall <tm...@cloudera.com>
Gerrit-Reviewer: Henry Robinson <he...@cloudera.com>
Gerrit-Reviewer: Juan Yu <jy...@cloudera.com>
Gerrit-Reviewer: Matthew Jacobs <mj...@cloudera.com>
Gerrit-Reviewer: Sailesh Mukil <sa...@cloudera.com>
Gerrit-Reviewer: Thomas Tauber-Marshall <tm...@cloudera.com>

[Impala-ASF-CR] IMPALA-4135: Thrift threaded server times-out connections during high load

Posted by "Thomas Tauber-Marshall (Code Review)" <ge...@cloudera.org>.
Thomas Tauber-Marshall has posted comments on this change.

Change subject: IMPALA-4135: Thrift threaded server times-out connections during high load
......................................................................


Patch Set 1:

(12 comments)

The main problem with subclassing TThreadedServer is that TThreadedServer::Task is defined in the thrift cpp file, so we can't access it. We could just copy that part, but between that, the copy of serve() that we have to make, and boiler plate around the constructors that would be necessary, subclasses doesn't really save any copied code other than init() and stop(), so it seems cleaner to me to keep the relationships between the classes simple by just copying everything rather than subclassing.

I'm not sure what to do about formatting - currently, TAcceptQueueServer.h/cpp are written in Thrift's style (eg. wrapped lines indented 2 spaces rather than 4), though I may have gotten some of it wrong. Of course, I can switch it to Impala style if desired.

For the cluster repro - I've crafted a query that consistently gets the "timed out" error on the 16 node cluster without the fix. However, with the fix, after running somewhat longer, it fails with a 'memory limit exceeded' error. Obviously it would be preferable to have a query that both repros and runs to completion on a fixed cluster, so I'm working on that, but it does seem like the patch works.

http://gerrit.cloudera.org:8080/#/c/4519/1//COMMIT_MSG
Commit Message:

PS1, Line 18: This patch has been tested locally with the repro shown in IMPALA-4135
> Can you add the test, modified if necessary?
Done


http://gerrit.cloudera.org:8080/#/c/4519/1/be/src/rpc/thrift-server.cc
File be/src/rpc/thrift-server.cc:

Line 419:       server_.reset(new TAcceptQueueThreadedServer(processor_, server_socket,
> let's make the choice of server configurable by a flag that we can turn off
Done


http://gerrit.cloudera.org:8080/#/c/4519/1/be/src/server/CMakeLists.txt
File be/src/server/CMakeLists.txt:

PS1, Line 24: add_library(ThriftAcceptQueueThreadedServer
            :     TAcceptQueueThreadedServer.cpp
            :   )
            : add_dependencies(ThriftAcceptQueueThreadedServer thrift-deps)
> Was there a reason you wanted this as a separate library? I think it's fine
Done


http://gerrit.cloudera.org:8080/#/c/4519/1/be/src/server/TAcceptQueueThreadedServer.cpp
File be/src/server/TAcceptQueueThreadedServer.cpp:

Line 144:     inputTransportFactory_->getTransport(client);
> do you see much slowdown in synchronizing around these few lines that creat
Most of this is fast and synchronizing it doesn't affect performance much.


Line 164:     shared_ptr<Thread> thread = shared_ptr<Thread>(threadFactory_->newThread(runnable));
> Is this definitely thread-safe? I think this might be the bottleneck (or ma
The bottleneck is 'thread->start()' below. I'm fairly confident that's thread safe, as essentially all it does is create a new boost::thread.


Line 200:   ThreadPool<shared_ptr<TTransport>> acceptPool("server-accept", "accept-worker", 1, 10000,
> How did you arrive at 1 as the default number of worker threads? What happe
I saw the same behavior as you, where using more than 1 thread here was actually slower than just 1. I don't know why that's happening (Michael's fix for IMPALA-4026 currently under review doesn't seem to fix it), but given that just using 1 thread was enough to solve the problem for the repro in the JIRA, and given that only using one prevents us from having to worry about thread safety, it seems like the best choice, assuming that it also works in a real cluster repro. I could also make it configurable.


PS1, Line 211: Offer
> This can return false, right?
Done


PS1, Line 233: if (stop_) {
> shouldn't this be out of the while loop? to guarantee a proper cleanup.
Its not in the while loop.


Line 236:       acceptPool.DrainAndShutdown();
> I'm not sure we ever actually stop the thrift servers, so it might be moot 
Done


PS1, Line 252: stop_ = false
> why set stop_ back to false?
I guess so that it can be restarted? (this was copied from the thrift code, and I think its better to leave it as close to the original behavior as possible)


http://gerrit.cloudera.org:8080/#/c/4519/1/be/src/server/TAcceptQueueThreadedServer.h
File be/src/server/TAcceptQueueThreadedServer.h:

PS1, Line 1: // This file was copied from apache::thrift::server::TThreadedServer.cpp v0.9.0, with the
           : // significant changes noted inline below.
           : /*
           :  * Licensed to the Apache Software Foundation (ASF) under one
           :  * or more contributor license agreements. See the NOTICE file
           :  * distributed with this work for additional information
           :  * regarding copyright ownership. The ASF licenses this file
           :  * to you under the Apache License, Version 2.0 (the
           :  * "License"); you may not use this file except in compliance
           :  * with the License. You may obtain a copy of the License at
           :  *
           :  *   http://www.apache.org/licenses/LICENSE-2.0
           :  *
           :  * Unless required by applicable law or agreed to in writing,
           :  * software distributed under the License is distributed on an
           :  * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
           :  * KIND, either express or implied. See the License for the
           :  * specific language governing permissions and limitations
           :  * under the License.
           :  */
> AFAIK it's completely fine - copying from another ASF project is legit.
Done


PS1, Line 44: and effectively creating a new,
            :  * internally managed, accept queue.
> not exactly - the connections have been accepted, the next state is transpo
Done


-- 
To view, visit http://gerrit.cloudera.org:8080/4519
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: Ie50e728974ef31a9d49132a0b3f7cde2a4f3356d
Gerrit-PatchSet: 1
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Thomas Tauber-Marshall <tm...@cloudera.com>
Gerrit-Reviewer: Henry Robinson <he...@cloudera.com>
Gerrit-Reviewer: Juan Yu <jy...@cloudera.com>
Gerrit-Reviewer: Matthew Jacobs <mj...@cloudera.com>
Gerrit-Reviewer: Sailesh Mukil <sa...@cloudera.com>
Gerrit-Reviewer: Thomas Tauber-Marshall <tm...@cloudera.com>
Gerrit-HasComments: Yes