You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@arrow.apache.org by Yibo Cai <yi...@arm.com> on 2020/06/15 10:44:15 UTC

Flight benchmark question

I'm evaluating flight benchmark [1] on single host. Met with one problem. Would like to seek for help.

Flight benchmark has a "num_threads" parameter [1] to set "number of current gets". Counter-intuitively, setting it to larger values drops performance, "arrow-flight-benchmark --num_threads=1" performs much better than "arrow-flight-benchmark --num_threads=2". There's a history thread talking about this issue [2], explains it's better to spawn more servers on different ports rather than having all threads go to a single server app.

I did another test with standalone server, the result is different.

1. spawn a standalone flight server
    $ ./arrow-flight-perf-server
    Server host: localhost
    Server port: 31337

2. test one flight benchmark to get baseline performance
    $ ./arrow-flight-benchmark --num_threads 1 --server_host localhost --records_per_stream=123456789
    ....
    Speed: 4717.28 MB/s

3. test two flight benchmarks concurrently, check scalability
    # run in one console
    $ ./arrow-flight-benchmark --num_threads 1 --server_host localhost --records_per_stream=123456789
    ....
    Speed: 4160.94 MB/s

    # run at *same time* in another console
    $ ./arrow-flight-benchmark --num_threads 1 --server_host localhost --records_per_stream=123456789
    ....
    Speed: 4154.65 MB/s

 From this result, looks flight server has good multi core scalability. Same behaviour observed if tested across network.
What's the difference of above two tests, using standalone server and not.

[1] https://github.com/apache/arrow/blob/master/cpp/src/arrow/flight/flight_benchmark.cc#L44
[2] https://lists.apache.org/thread.html/rd2aa01f460dd1092c60d1ba75087c2ce87c81ac543a246549b4713fb%40%3Cdev.arrow.apache.org%3E

Yibo

Re: Flight benchmark question

Posted by Yibo Cai <yi...@arm.com>.
On 6/17/20 8:33 PM, David Li wrote:
> ------ Tessian Warning ------
> 
> There is something unusual about this email, please take care as it could be malicious.
> 
> Tessian has flagged this email because the sender could be trying to impersonate someone at your company. The sender, "David Li <li[.]davidm96@gmail[.]com>", looks similar to "david lim <dave[.]lim@arm[.]com>", someone at your company.
> 
> 
> COVID-19 update: Phishing attacks are increasing to take advantage of the current situation. Try contacting the sender through a different channel to confirm that the email was sent by them.
> 
> This warning message will be removed if you reply to or forward this email to a recipient outside of your organization.
> 
> ---- Tessian Warning End ----
> 
> Hey Yibo,
> 
> Thanks for investigating this! This is a great writeup.
> 
> There was a PR recently to let clients set gRPC options like this, so
> it can be enabled on a case-by-case basis:
> https://github.com/apache/arrow/pull/7406
> So we could add that to the benchmark or suggest it in documentation.

Thanks David, exactly what I want.

> 
> I think this benchmark is a bit of a pathological case for gRPC. gRPC
> will share sockets when all client options are exactly the same; it
> seems just adding TLS, for instance, would break that (unless you
> intentionally shared TLS credentials, which Flight doesn't):
> https://github.com/grpc/grpc/issues/15207. I believe grpc-java doesn't
> have this behavior (different Channel instances won't share
> connections).
> 
> Also, did you investigate the SO_ZEROCOPY flag gRPC now offers? I
> wonder if that might also help performance a bit.
> https://grpc.github.io/grpc/core/group__grpc__arg__keys.html#ga1eb58c302eaf27a5d982b30402b8f84a

Did a quick try. Ubuntu 4.15 kernel doesn't support SO_ZEROCOPY. I upgraded to 5.7 kernel.
Per my test on same host, no obvious difference after setting this option. Will do more tests over network.

> 
> Best,
> David
> 
> On 6/17/20, Chengxin Ma <cx...@protonmail.ch.invalid> wrote:
>> Hi Yibo,
>>
>>
>> Your discovery is impressive.
>>
>>
>> Did you consider the `num_streams` parameter [1] as well? If I understood
>> correctly, this parameter is used for setting the conceptual concurrent
>> streams between the client and the server, while `num_threads` is used for
>> setting the size of the thread pool that actually handles these streams [2].
>> By default, both of the two parameters are 4.
>>
>>
>> As for CPU usage, the parameter `records_per_batch`[3] has an impact as
>> well. If you increase the value of this parameter, you will probably see
>> that the data transfer speed increased while the server-side CPU usage
>> dropped [4].
>> My guess is that as more records are put in one record batch, the total
>> number of batches would decrease. CPU is only used for (de)serializing the
>> metadata (i.e. schema) of each record batch while the payload can be
>> transferred with zero cost [5].
>>
>>
>> [1] https://github.com/apache/arrow/blob/513d77bf5a21fe817994a4a87f68c52e8a453933/cpp/src/arrow/flight/flight_benchmark.cc#L43
>> [2] https://github.com/apache/arrow/blob/513d77bf5a21fe817994a4a87f68c52e8a453933/cpp/src/arrow/flight/flight_benchmark.cc#L230
>> [3] https://github.com/apache/arrow/blob/513d77bf5a21fe817994a4a87f68c52e8a453933/cpp/src/arrow/flight/flight_benchmark.cc#L46
>> [4] https://drive.google.com/file/d/1aH84DdenLr0iH-RuMFU3_q87nPE_HLmP/view?usp=sharing
>> [5] See "Optimizing Data Throughput over gRPC"
>> in https://arrow.apache.org/blog/2019/10/13/introducing-arrow-flight/
>>
>>
>> Kind Regards
>> Chengxin
>>
>>
>> Sent with ProtonMail Secure Email.
>>
>> ‐‐‐‐‐‐‐ Original Message ‐‐‐‐‐‐‐
>> On Wednesday, June 17, 2020 8:35 AM, Yibo Cai <yi...@arm.com> wrote:
>>
>>> Find a way to achieve reasonable benchmark result with multiple threads.
>>> Diff pasted below for a quick review or try.
>>> Tested on E5-2650, with this change:
>>> num_threads = 1, speed = 1996
>>> num_threads = 2, speed = 3555
>>> num_threads = 4, speed = 5828
>>>
>>> When running `arrow_flight_benchmark`, I find there's only one TCP
>>> connection between client and server, no matter what `num_threads` is. All
>>> clients share one TCP connection. At server side, I see only one thread is
>>> processing network packets. On my machine, one client already saturates a
>>> CPU core, so it becomes worse when `num_threads` increase, as that single
>>> server thread becomes bottleneck.
>>>
>>> If running in standalone mode, flight clients are from different processes
>>> and have their own TCP connections to the server. There're separated
>>> server threads handling network traffics for each connection, without a
>>> central bottleneck.
>>>
>>> I'm lucky to find arg GRPC_ARG_USE_LOCAL_SUBCHANNEL_POOL[1] just before
>>> give up. Setting that arg makes each client establishes its own TCP
>>> connection to the server, similar to standalone mode.
>>>
>>> Actually, I'm not quite sure if we should set this arg. Sharing one TCP
>>> connection is a reasonable configuration, and it's an advantage of
>>> gRPC[2].
>>>
>>> Per my test, most CPU cycles are spent in kernel mode doing networking and
>>> data transfer. Maybe better solution is to leverage modern network
>>> techniques like RDMA or user mode stack for higher performance.
>>>
>>> [1]
>>> https://grpc.github.io/grpc/core/group__grpc__arg__keys.html#gaa49ebd41af390c78a2c0ed94b74abfbc
>>> [2] https://platformlab.stanford.edu/Seminar Talks/gRPC.pdf, page5
>>>
>>> diff --git a/cpp/src/arrow/flight/client.cc
>>> b/cpp/src/arrow/flight/client.cc
>>> index d530093d9..6904640d3 100644
>>> --- a/cpp/src/arrow/flight/client.cc
>>> +++ b/cpp/src/arrow/flight/client.cc
>>> @@ -811,6 +811,9 @@ class FlightClient::FlightClientImpl {
>>> args.SetInt(GRPC_ARG_INITIAL_RECONNECT_BACKOFF_MS, 100);
>>> // Receive messages of any size
>>> args.SetMaxReceiveMessageSize(-1);
>>>
>>> -   // Setting this arg enables each client to open it's own TCP
>>> connection to server,
>>> -   // not sharing one single connection, which becomes bottleneck under
>>> high load.
>>> -   args.SetInt(GRPC_ARG_USE_LOCAL_SUBCHANNEL_POOL, 1);
>>>
>>>      if (options.override_hostname != "") {
>>>      args.SetSslTargetNameOverride(options.override_hostname);
>>>
>>>      On 6/15/20 10:00 PM, Wes McKinney wrote:
>>>
>>>
>>>> On Mon, Jun 15, 2020 at 8:43 AM Antoine Pitrou antoine@python.org
>>>> wrote:
>>>>
>>>>> Le 15/06/2020 à 15:36, Wes McKinney a écrit :
>>>>>
>>>>>> When you have only a single server, all the gRPC traffic goes
>>>>>> through
>>>>>> a common port and is handled by a common server, so if both client
>>>>>> and
>>>>>> server are roughly IO bound you aren't going to get better
>>>>>> performance
>>>>>> by hitting the server with multiple clients simultaneously, only
>>>>>> worse
>>>>>> because the packets from different client requests are intermingled
>>>>>> in
>>>>>> the TCP traffic on that port. I'm not a networking expert but this
>>>>>> is
>>>>>> my best understanding of what is going on.
>>>>>
>>>>> Yibo Cai's experiment disproves that explanation, though.
>>>>> When I run a single client against the test server, I get ~4 GB/s.
>>>>> When
>>>>> I run 6 standalone clients against the same test server, I get ~8
>>>>> GB/s
>>>>> aggregate. So there's something else going on that limits scalability
>>>>> when the benchmark executable runs all clients by itself (perhaps
>>>>> gRPC
>>>>> clients in a single process share some underlying structure or
>>>>> execution
>>>>> threads? I don't know).
>>>>
>>>> I see, thanks. OK then clearly something else is going on.
>>>>
>>>>>> I hope someone will implement the "multiple test servers" TODO in
>>>>>> the
>>>>>> benchmark.
>>>>>
>>>>> I think that's a bad idea in any case, as running multiple servers on
>>>>> different ports is not a realistic expectation from users.
>>>>> Regards
>>>>> Antoine.
>>
>>
>>
> 

Re: Flight benchmark question

Posted by David Li <li...@gmail.com>.
Hey Yibo,

Thanks for investigating this! This is a great writeup.

There was a PR recently to let clients set gRPC options like this, so
it can be enabled on a case-by-case basis:
https://github.com/apache/arrow/pull/7406
So we could add that to the benchmark or suggest it in documentation.

I think this benchmark is a bit of a pathological case for gRPC. gRPC
will share sockets when all client options are exactly the same; it
seems just adding TLS, for instance, would break that (unless you
intentionally shared TLS credentials, which Flight doesn't):
https://github.com/grpc/grpc/issues/15207. I believe grpc-java doesn't
have this behavior (different Channel instances won't share
connections).

Also, did you investigate the SO_ZEROCOPY flag gRPC now offers? I
wonder if that might also help performance a bit.
https://grpc.github.io/grpc/core/group__grpc__arg__keys.html#ga1eb58c302eaf27a5d982b30402b8f84a

Best,
David

On 6/17/20, Chengxin Ma <cx...@protonmail.ch.invalid> wrote:
> Hi Yibo,
>
>
> Your discovery is impressive.
>
>
> Did you consider the `num_streams` parameter [1] as well? If I understood
> correctly, this parameter is used for setting the conceptual concurrent
> streams between the client and the server, while `num_threads` is used for
> setting the size of the thread pool that actually handles these streams [2].
> By default, both of the two parameters are 4.
>
>
> As for CPU usage, the parameter `records_per_batch`[3] has an impact as
> well. If you increase the value of this parameter, you will probably see
> that the data transfer speed increased while the server-side CPU usage
> dropped [4].
> My guess is that as more records are put in one record batch, the total
> number of batches would decrease. CPU is only used for (de)serializing the
> metadata (i.e. schema) of each record batch while the payload can be
> transferred with zero cost [5].
>
>
> [1] https://github.com/apache/arrow/blob/513d77bf5a21fe817994a4a87f68c52e8a453933/cpp/src/arrow/flight/flight_benchmark.cc#L43
> [2] https://github.com/apache/arrow/blob/513d77bf5a21fe817994a4a87f68c52e8a453933/cpp/src/arrow/flight/flight_benchmark.cc#L230
> [3] https://github.com/apache/arrow/blob/513d77bf5a21fe817994a4a87f68c52e8a453933/cpp/src/arrow/flight/flight_benchmark.cc#L46
> [4] https://drive.google.com/file/d/1aH84DdenLr0iH-RuMFU3_q87nPE_HLmP/view?usp=sharing
> [5] See "Optimizing Data Throughput over gRPC"
> in https://arrow.apache.org/blog/2019/10/13/introducing-arrow-flight/
>
>
> Kind Regards
> Chengxin
>
>
> Sent with ProtonMail Secure Email.
>
> ‐‐‐‐‐‐‐ Original Message ‐‐‐‐‐‐‐
> On Wednesday, June 17, 2020 8:35 AM, Yibo Cai <yi...@arm.com> wrote:
>
>> Find a way to achieve reasonable benchmark result with multiple threads.
>> Diff pasted below for a quick review or try.
>> Tested on E5-2650, with this change:
>> num_threads = 1, speed = 1996
>> num_threads = 2, speed = 3555
>> num_threads = 4, speed = 5828
>>
>> When running `arrow_flight_benchmark`, I find there's only one TCP
>> connection between client and server, no matter what `num_threads` is. All
>> clients share one TCP connection. At server side, I see only one thread is
>> processing network packets. On my machine, one client already saturates a
>> CPU core, so it becomes worse when `num_threads` increase, as that single
>> server thread becomes bottleneck.
>>
>> If running in standalone mode, flight clients are from different processes
>> and have their own TCP connections to the server. There're separated
>> server threads handling network traffics for each connection, without a
>> central bottleneck.
>>
>> I'm lucky to find arg GRPC_ARG_USE_LOCAL_SUBCHANNEL_POOL[1] just before
>> give up. Setting that arg makes each client establishes its own TCP
>> connection to the server, similar to standalone mode.
>>
>> Actually, I'm not quite sure if we should set this arg. Sharing one TCP
>> connection is a reasonable configuration, and it's an advantage of
>> gRPC[2].
>>
>> Per my test, most CPU cycles are spent in kernel mode doing networking and
>> data transfer. Maybe better solution is to leverage modern network
>> techniques like RDMA or user mode stack for higher performance.
>>
>> [1]
>> https://grpc.github.io/grpc/core/group__grpc__arg__keys.html#gaa49ebd41af390c78a2c0ed94b74abfbc
>> [2] https://platformlab.stanford.edu/Seminar Talks/gRPC.pdf, page5
>>
>> diff --git a/cpp/src/arrow/flight/client.cc
>> b/cpp/src/arrow/flight/client.cc
>> index d530093d9..6904640d3 100644
>> --- a/cpp/src/arrow/flight/client.cc
>> +++ b/cpp/src/arrow/flight/client.cc
>> @@ -811,6 +811,9 @@ class FlightClient::FlightClientImpl {
>> args.SetInt(GRPC_ARG_INITIAL_RECONNECT_BACKOFF_MS, 100);
>> // Receive messages of any size
>> args.SetMaxReceiveMessageSize(-1);
>>
>> -   // Setting this arg enables each client to open it's own TCP
>> connection to server,
>> -   // not sharing one single connection, which becomes bottleneck under
>> high load.
>> -   args.SetInt(GRPC_ARG_USE_LOCAL_SUBCHANNEL_POOL, 1);
>>
>>     if (options.override_hostname != "") {
>>     args.SetSslTargetNameOverride(options.override_hostname);
>>
>>     On 6/15/20 10:00 PM, Wes McKinney wrote:
>>
>>
>> > On Mon, Jun 15, 2020 at 8:43 AM Antoine Pitrou antoine@python.org
>> > wrote:
>> >
>> > > Le 15/06/2020 à 15:36, Wes McKinney a écrit :
>> > >
>> > > > When you have only a single server, all the gRPC traffic goes
>> > > > through
>> > > > a common port and is handled by a common server, so if both client
>> > > > and
>> > > > server are roughly IO bound you aren't going to get better
>> > > > performance
>> > > > by hitting the server with multiple clients simultaneously, only
>> > > > worse
>> > > > because the packets from different client requests are intermingled
>> > > > in
>> > > > the TCP traffic on that port. I'm not a networking expert but this
>> > > > is
>> > > > my best understanding of what is going on.
>> > >
>> > > Yibo Cai's experiment disproves that explanation, though.
>> > > When I run a single client against the test server, I get ~4 GB/s.
>> > > When
>> > > I run 6 standalone clients against the same test server, I get ~8
>> > > GB/s
>> > > aggregate. So there's something else going on that limits scalability
>> > > when the benchmark executable runs all clients by itself (perhaps
>> > > gRPC
>> > > clients in a single process share some underlying structure or
>> > > execution
>> > > threads? I don't know).
>> >
>> > I see, thanks. OK then clearly something else is going on.
>> >
>> > > > I hope someone will implement the "multiple test servers" TODO in
>> > > > the
>> > > > benchmark.
>> > >
>> > > I think that's a bad idea in any case, as running multiple servers on
>> > > different ports is not a realistic expectation from users.
>> > > Regards
>> > > Antoine.
>
>
>

Re: Flight benchmark question

Posted by Chengxin Ma <cx...@protonmail.ch.INVALID>.
Hi Yibo,


Your discovery is impressive.


Did you consider the `num_streams` parameter [1] as well? If I understood correctly, this parameter is used for setting the conceptual concurrent streams between the client and the server, while `num_threads` is used for setting the size of the thread pool that actually handles these streams [2]. By default, both of the two parameters are 4.


As for CPU usage, the parameter `records_per_batch`[3] has an impact as well. If you increase the value of this parameter, you will probably see that the data transfer speed increased while the server-side CPU usage dropped [4].
My guess is that as more records are put in one record batch, the total number of batches would decrease. CPU is only used for (de)serializing the metadata (i.e. schema) of each record batch while the payload can be transferred with zero cost [5].


[1] https://github.com/apache/arrow/blob/513d77bf5a21fe817994a4a87f68c52e8a453933/cpp/src/arrow/flight/flight_benchmark.cc#L43
[2] https://github.com/apache/arrow/blob/513d77bf5a21fe817994a4a87f68c52e8a453933/cpp/src/arrow/flight/flight_benchmark.cc#L230
[3] https://github.com/apache/arrow/blob/513d77bf5a21fe817994a4a87f68c52e8a453933/cpp/src/arrow/flight/flight_benchmark.cc#L46
[4] https://drive.google.com/file/d/1aH84DdenLr0iH-RuMFU3_q87nPE_HLmP/view?usp=sharing
[5] See "Optimizing Data Throughput over gRPC" in https://arrow.apache.org/blog/2019/10/13/introducing-arrow-flight/


Kind Regards
Chengxin


Sent with ProtonMail Secure Email.

‐‐‐‐‐‐‐ Original Message ‐‐‐‐‐‐‐
On Wednesday, June 17, 2020 8:35 AM, Yibo Cai <yi...@arm.com> wrote:

> Find a way to achieve reasonable benchmark result with multiple threads. Diff pasted below for a quick review or try.
> Tested on E5-2650, with this change:
> num_threads = 1, speed = 1996
> num_threads = 2, speed = 3555
> num_threads = 4, speed = 5828
>
> When running `arrow_flight_benchmark`, I find there's only one TCP connection between client and server, no matter what `num_threads` is. All clients share one TCP connection. At server side, I see only one thread is processing network packets. On my machine, one client already saturates a CPU core, so it becomes worse when `num_threads` increase, as that single server thread becomes bottleneck.
>
> If running in standalone mode, flight clients are from different processes and have their own TCP connections to the server. There're separated server threads handling network traffics for each connection, without a central bottleneck.
>
> I'm lucky to find arg GRPC_ARG_USE_LOCAL_SUBCHANNEL_POOL[1] just before give up. Setting that arg makes each client establishes its own TCP connection to the server, similar to standalone mode.
>
> Actually, I'm not quite sure if we should set this arg. Sharing one TCP connection is a reasonable configuration, and it's an advantage of gRPC[2].
>
> Per my test, most CPU cycles are spent in kernel mode doing networking and data transfer. Maybe better solution is to leverage modern network techniques like RDMA or user mode stack for higher performance.
>
> [1] https://grpc.github.io/grpc/core/group__grpc__arg__keys.html#gaa49ebd41af390c78a2c0ed94b74abfbc
> [2] https://platformlab.stanford.edu/Seminar Talks/gRPC.pdf, page5
>
> diff --git a/cpp/src/arrow/flight/client.cc b/cpp/src/arrow/flight/client.cc
> index d530093d9..6904640d3 100644
> --- a/cpp/src/arrow/flight/client.cc
> +++ b/cpp/src/arrow/flight/client.cc
> @@ -811,6 +811,9 @@ class FlightClient::FlightClientImpl {
> args.SetInt(GRPC_ARG_INITIAL_RECONNECT_BACKOFF_MS, 100);
> // Receive messages of any size
> args.SetMaxReceiveMessageSize(-1);
>
> -   // Setting this arg enables each client to open it's own TCP connection to server,
> -   // not sharing one single connection, which becomes bottleneck under high load.
> -   args.SetInt(GRPC_ARG_USE_LOCAL_SUBCHANNEL_POOL, 1);
>
>     if (options.override_hostname != "") {
>     args.SetSslTargetNameOverride(options.override_hostname);
>
>     On 6/15/20 10:00 PM, Wes McKinney wrote:
>
>
> > On Mon, Jun 15, 2020 at 8:43 AM Antoine Pitrou antoine@python.org wrote:
> >
> > > Le 15/06/2020 à 15:36, Wes McKinney a écrit :
> > >
> > > > When you have only a single server, all the gRPC traffic goes through
> > > > a common port and is handled by a common server, so if both client and
> > > > server are roughly IO bound you aren't going to get better performance
> > > > by hitting the server with multiple clients simultaneously, only worse
> > > > because the packets from different client requests are intermingled in
> > > > the TCP traffic on that port. I'm not a networking expert but this is
> > > > my best understanding of what is going on.
> > >
> > > Yibo Cai's experiment disproves that explanation, though.
> > > When I run a single client against the test server, I get ~4 GB/s. When
> > > I run 6 standalone clients against the same test server, I get ~8 GB/s
> > > aggregate. So there's something else going on that limits scalability
> > > when the benchmark executable runs all clients by itself (perhaps gRPC
> > > clients in a single process share some underlying structure or execution
> > > threads? I don't know).
> >
> > I see, thanks. OK then clearly something else is going on.
> >
> > > > I hope someone will implement the "multiple test servers" TODO in the
> > > > benchmark.
> > >
> > > I think that's a bad idea in any case, as running multiple servers on
> > > different ports is not a realistic expectation from users.
> > > Regards
> > > Antoine.



Re: Flight benchmark question

Posted by Yibo Cai <yi...@arm.com>.
Find a way to achieve reasonable benchmark result with multiple threads. Diff pasted below for a quick review or try.
Tested on E5-2650, with this change:
num_threads = 1, speed = 1996
num_threads = 2, speed = 3555
num_threads = 4, speed = 5828

When running `arrow_flight_benchmark`, I find there's only one TCP connection between client and server, no matter what `num_threads` is. All clients share one TCP connection. At server side, I see only one thread is processing network packets. On my machine, one client already saturates a CPU core, so it becomes worse when `num_threads` increase, as that single server thread becomes bottleneck.

If running in standalone mode, flight clients are from different processes and have their own TCP connections to the server. There're separated server threads handling network traffics for each connection, without a central bottleneck.

I'm lucky to find arg GRPC_ARG_USE_LOCAL_SUBCHANNEL_POOL[1] just before give up. Setting that arg makes each client establishes its own TCP connection to the server, similar to standalone mode.

Actually, I'm not quite sure if we should set this arg. Sharing one TCP connection is a reasonable configuration, and it's an advantage of gRPC[2].

Per my test, most CPU cycles are spent in kernel mode doing networking and data transfer. Maybe better solution is to leverage modern network techniques like RDMA or user mode stack for higher performance.

[1] https://grpc.github.io/grpc/core/group__grpc__arg__keys.html#gaa49ebd41af390c78a2c0ed94b74abfbc
[2] https://platformlab.stanford.edu/Seminar%20Talks/gRPC.pdf, page5


diff --git a/cpp/src/arrow/flight/client.cc b/cpp/src/arrow/flight/client.cc
index d530093d9..6904640d3 100644
--- a/cpp/src/arrow/flight/client.cc
+++ b/cpp/src/arrow/flight/client.cc
@@ -811,6 +811,9 @@ class FlightClient::FlightClientImpl {
      args.SetInt(GRPC_ARG_INITIAL_RECONNECT_BACKOFF_MS, 100);
      // Receive messages of any size
      args.SetMaxReceiveMessageSize(-1);
+    // Setting this arg enables each client to open it's own TCP connection to server,
+    // not sharing one single connection, which becomes bottleneck under high load.
+    args.SetInt(GRPC_ARG_USE_LOCAL_SUBCHANNEL_POOL, 1);
  
      if (options.override_hostname != "") {
        args.SetSslTargetNameOverride(options.override_hostname);


On 6/15/20 10:00 PM, Wes McKinney wrote:
> On Mon, Jun 15, 2020 at 8:43 AM Antoine Pitrou <an...@python.org> wrote:
>>
>>
>> Le 15/06/2020 à 15:36, Wes McKinney a écrit :
>>>
>>> When you have only a single server, all the gRPC traffic goes through
>>> a common port and is handled by a common server, so if both client and
>>> server are roughly IO bound you aren't going to get better performance
>>> by hitting the server with multiple clients simultaneously, only worse
>>> because the packets from different client requests are intermingled in
>>> the TCP traffic on that port. I'm not a networking expert but this is
>>> my best understanding of what is going on.
>>
>> Yibo Cai's experiment disproves that explanation, though.
>>
>> When I run a single client against the test server, I get ~4 GB/s.  When
>> I run 6 standalone clients against the *same* test server, I get ~8 GB/s
>> aggregate.  So there's something else going on that limits scalability
>> when the benchmark executable runs all clients by itself (perhaps gRPC
>> clients in a single process share some underlying structure or execution
>> threads? I don't know).
>>
> 
> I see, thanks. OK then clearly something else is going on.
> 
>>> I hope someone will implement the "multiple test servers" TODO in the
>>> benchmark.
>>
>> I think that's a bad idea *in any case*, as running multiple servers on
>> different ports is not a realistic expectation from users.
>>
>> Regards
>>
>> Antoine.

Re: Flight benchmark question

Posted by Wes McKinney <we...@gmail.com>.
On Mon, Jun 15, 2020 at 8:43 AM Antoine Pitrou <an...@python.org> wrote:
>
>
> Le 15/06/2020 à 15:36, Wes McKinney a écrit :
> >
> > When you have only a single server, all the gRPC traffic goes through
> > a common port and is handled by a common server, so if both client and
> > server are roughly IO bound you aren't going to get better performance
> > by hitting the server with multiple clients simultaneously, only worse
> > because the packets from different client requests are intermingled in
> > the TCP traffic on that port. I'm not a networking expert but this is
> > my best understanding of what is going on.
>
> Yibo Cai's experiment disproves that explanation, though.
>
> When I run a single client against the test server, I get ~4 GB/s.  When
> I run 6 standalone clients against the *same* test server, I get ~8 GB/s
> aggregate.  So there's something else going on that limits scalability
> when the benchmark executable runs all clients by itself (perhaps gRPC
> clients in a single process share some underlying structure or execution
> threads? I don't know).
>

I see, thanks. OK then clearly something else is going on.

> > I hope someone will implement the "multiple test servers" TODO in the
> > benchmark.
>
> I think that's a bad idea *in any case*, as running multiple servers on
> different ports is not a realistic expectation from users.
>
> Regards
>
> Antoine.

Re: Flight benchmark question

Posted by Antoine Pitrou <an...@python.org>.
Le 15/06/2020 à 15:36, Wes McKinney a écrit :
> 
> When you have only a single server, all the gRPC traffic goes through
> a common port and is handled by a common server, so if both client and
> server are roughly IO bound you aren't going to get better performance
> by hitting the server with multiple clients simultaneously, only worse
> because the packets from different client requests are intermingled in
> the TCP traffic on that port. I'm not a networking expert but this is
> my best understanding of what is going on.

Yibo Cai's experiment disproves that explanation, though.

When I run a single client against the test server, I get ~4 GB/s.  When
I run 6 standalone clients against the *same* test server, I get ~8 GB/s
aggregate.  So there's something else going on that limits scalability
when the benchmark executable runs all clients by itself (perhaps gRPC
clients in a single process share some underlying structure or execution
threads? I don't know).

> I hope someone will implement the "multiple test servers" TODO in the
> benchmark.

I think that's a bad idea *in any case*, as running multiple servers on
different ports is not a realistic expectation from users.

Regards

Antoine.

Re: Flight benchmark question

Posted by Wes McKinney <we...@gmail.com>.
We had a _very_ similar discussion in April

https://lists.apache.org/thread.html/rd2aa01f460dd1092c60d1ba75087c2ce87c81ac543a246549b4713fb%40%3Cdev.arrow.apache.org%3E

When you have only a single server, all the gRPC traffic goes through
a common port and is handled by a common server, so if both client and
server are roughly IO bound you aren't going to get better performance
by hitting the server with multiple clients simultaneously, only worse
because the packets from different client requests are intermingled in
the TCP traffic on that port. I'm not a networking expert but this is
my best understanding of what is going on.

I hope someone will implement the "multiple test servers" TODO in the
benchmark.

- Wes

On Mon, Jun 15, 2020 at 5:44 AM Yibo Cai <yi...@arm.com> wrote:
>
> I'm evaluating flight benchmark [1] on single host. Met with one problem. Would like to seek for help.
>
> Flight benchmark has a "num_threads" parameter [1] to set "number of current gets". Counter-intuitively, setting it to larger values drops performance, "arrow-flight-benchmark --num_threads=1" performs much better than "arrow-flight-benchmark --num_threads=2". There's a history thread talking about this issue [2], explains it's better to spawn more servers on different ports rather than having all threads go to a single server app.
>
> I did another test with standalone server, the result is different.
>
> 1. spawn a standalone flight server
>     $ ./arrow-flight-perf-server
>     Server host: localhost
>     Server port: 31337
>
> 2. test one flight benchmark to get baseline performance
>     $ ./arrow-flight-benchmark --num_threads 1 --server_host localhost --records_per_stream=123456789
>     ....
>     Speed: 4717.28 MB/s
>
> 3. test two flight benchmarks concurrently, check scalability
>     # run in one console
>     $ ./arrow-flight-benchmark --num_threads 1 --server_host localhost --records_per_stream=123456789
>     ....
>     Speed: 4160.94 MB/s
>
>     # run at *same time* in another console
>     $ ./arrow-flight-benchmark --num_threads 1 --server_host localhost --records_per_stream=123456789
>     ....
>     Speed: 4154.65 MB/s
>
>  From this result, looks flight server has good multi core scalability. Same behaviour observed if tested across network.
> What's the difference of above two tests, using standalone server and not.
>
> [1] https://github.com/apache/arrow/blob/master/cpp/src/arrow/flight/flight_benchmark.cc#L44
> [2] https://lists.apache.org/thread.html/rd2aa01f460dd1092c60d1ba75087c2ce87c81ac543a246549b4713fb%40%3Cdev.arrow.apache.org%3E
>
> Yibo