You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2022/01/20 05:58:06 UTC

[GitHub] [arrow] cyb70289 opened a new pull request #12196: [RFC] ARROW-15282: [C++][FlightRPC] Support non-grpc data planes

cyb70289 opened a new pull request #12196:
URL: https://github.com/apache/arrow/pull/12196


   This patch decouples flightrpc data plane from grpc so we can leverage
   optimized data transfer libraries.
   
   The basic idea is to replace grpc stream with a data plane stream for
   FlightData transmission in DoGet/DoPut/DoExchange. There's no big change
   to current flight client and server implementations. Added a wrapper to
   support both grpc stream and data plane stream. By default, grpc stream
   is used, which goes the current grpc based code path. If a data plane is
   enabled (currently through environment variable), flight payload will go
   through the data plane stream instead. See client.cc and server.cc to
   review the changes.
   
   **About data plane implementation**
   
   - data_plane/types.{h,cc}
     Defines client/server data plane and data plane stream interfaces.
     It's the only exported api to other component ({client,server}.cc).
   - data_plane/serialize.{h,cc}
     De-Serialize FlightData manually as we bypass grpc. Luckly, we already
     implemented related functions to support payload zero-copy.
   - shm.cc
     A shared memory driver to verify the data plane approach. The code may
     be a bit hard to read, it's better to focus on data plane interface
     implementations at first before dive deep into details like shared
     memory, ipc and buffer management related code.
     Please note there are still many caveats in current code, see TODO and
     XXX in shm.cc for details.
   
   **To evaluate this patch**
   
   I tested shared memory data plane on Linux (x86, Arm) and MacOS (Arm).
   Build with `-DARROW_FLIGHT_DP_SHM=ON` to enable the shared memory data
   plane. Set `FLIGHT_DATAPLANE=shm` environment variable to run unit tests
   and benchmarks with the shared memory data plane enabled.
   
   ```
   Build: cmake -DARROW_FLIGHT_DP_SHM=ON -DARROW_FLIGHT=ON ....
   Test:  FLIGHT_DATAPLANE=shm release/arrow-flight-test
   Bench: FLIGHT_DATAPLANE=shm release/arrow-flight-benchmark \
          -num_streams=1|2|4 -num_threads=1|2|4
   ```
   
   Benchmark result (throughput, latency) on Xeon Gold 5218.
   Test case: DoGet, batch size = 128KiB
   
   | streams | grpc over unix socket | shared memory data plane |
   | ------- | --------------------- | ------------------------ |
   | 1       |  3324 MB/s,  35 us    |  7045 MB/s,  16 us       |
   | 2       |  6289 MB/s,  38 us    | 13311 MB/s,  17 us       |
   | 4       | 10037 MB/s,  44 us    | 25012 MB/s,  17 us       |


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] bkmgit commented on pull request #12196: [RFC] ARROW-15282: [C++][FlightRPC] Support non-grpc data planes

Posted by GitBox <gi...@apache.org>.
bkmgit commented on pull request #12196:
URL: https://github.com/apache/arrow/pull/12196#issuecomment-1018442427


   How were the tests run? Assume the [flight benchmark](https://github.com/cyb70289/arrow/blob/flight-data-plane/cpp/src/arrow/flight/flight_benchmark.cc) as described at https://arrow.apache.org/blog/2019/10/13/introducing-arrow-flight/ was used.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] cyb70289 commented on pull request #12196: [RFC] ARROW-15282: [C++][FlightRPC] Support non-grpc data planes

Posted by GitBox <gi...@apache.org>.
cyb70289 commented on pull request #12196:
URL: https://github.com/apache/arrow/pull/12196#issuecomment-1017155532


   **NOTE:** The main purpose of this PR is to collect responses and seek for best approaches to support optimized data transfer methods other than grpc.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] lidavidm commented on a change in pull request #12196: [RFC] ARROW-15282: [C++][FlightRPC] Support non-grpc data planes

Posted by GitBox <gi...@apache.org>.
lidavidm commented on a change in pull request #12196:
URL: https://github.com/apache/arrow/pull/12196#discussion_r788828887



##########
File path: cpp/src/arrow/flight/data_plane/types.cc
##########
@@ -0,0 +1,127 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+#include "arrow/flight/data_plane/types.h"
+#include "arrow/flight/data_plane/internal.h"
+#include "arrow/util/io_util.h"
+#include "arrow/util/logging.h"
+#include "arrow/util/make_unique.h"
+
+#ifdef GRPCPP_PP_INCLUDE
+#include <grpcpp/grpcpp.h>
+#else
+#include <grpc++/grpc++.h>
+#endif
+
+namespace arrow {
+namespace flight {
+namespace internal {
+
+namespace {
+
+// data plane registry (name -> data plane maker)
+struct Registry {
+  std::map<const std::string, DataPlaneMaker> makers;
+
+  // register all data planes on creation of registry singleton
+  Registry() {
+#ifdef FLIGHT_DP_SHM
+    Register("shm", GetShmDataPlaneMaker());
+#endif
+  }
+
+  static const Registry& instance() {
+    static const Registry registry;
+    return registry;
+  }
+
+  void Register(const std::string& name, const DataPlaneMaker& maker) {
+    DCHECK_EQ(makers.find(name), makers.end());
+    makers[name] = maker;
+  }
+
+  arrow::Result<DataPlaneMaker> GetDataPlaneMaker(const std::string& uri) const {
+    const std::string name = uri.substr(0, uri.find(':'));
+    auto it = makers.find(name);
+    if (it == makers.end()) {
+      return Status::Invalid("unknown data plane: ", name);
+    }
+    return it->second;
+  }
+};
+
+std::string GetGrpcMetadata(const grpc::ServerContext& context, const std::string& key) {
+  const auto client_metadata = context.client_metadata();
+  const auto found = client_metadata.find(key);
+  std::string token;
+  if (found == client_metadata.end()) {
+    DCHECK(false);
+    token = "";
+  } else {
+    token = std::string(found->second.data(), found->second.length());
+  }
+  return token;
+}
+
+// TODO(yibo): getting data plane uri from env var is bad, shall we extend
+// location to support two uri (control, data)? or any better approach to

Review comment:
       `grpc+tcp://localhost:1337/?data=shm` or something? Or maybe something like `grpc+tcp+shm://...` not sure what is semantically correct.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] cyb70289 commented on pull request #12196: [RFC] ARROW-15282: [C++][FlightRPC] Support non-grpc data planes

Posted by GitBox <gi...@apache.org>.
cyb70289 commented on pull request #12196:
URL: https://github.com/apache/arrow/pull/12196#issuecomment-1042576911


   Close this PR. Prefer https://github.com/apache/arrow/pull/12442.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] github-actions[bot] commented on pull request #12196: [RFC] ARROW-15282: [C++][FlightRPC] Support non-grpc data planes

Posted by GitBox <gi...@apache.org>.
github-actions[bot] commented on pull request #12196:
URL: https://github.com/apache/arrow/pull/12196#issuecomment-1017145615


   <!--
     Licensed to the Apache Software Foundation (ASF) under one
     or more contributor license agreements.  See the NOTICE file
     distributed with this work for additional information
     regarding copyright ownership.  The ASF licenses this file
     to you under the Apache License, Version 2.0 (the
     "License"); you may not use this file except in compliance
     with the License.  You may obtain a copy of the License at
   
       http://www.apache.org/licenses/LICENSE-2.0
   
     Unless required by applicable law or agreed to in writing,
     software distributed under the License is distributed on an
     "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
     KIND, either express or implied.  See the License for the
     specific language governing permissions and limitations
     under the License.
   -->
   
   Thanks for opening a pull request!
   
   If this is not a [minor PR](https://github.com/apache/arrow/blob/master/CONTRIBUTING.md#Minor-Fixes). Could you open an issue for this pull request on JIRA? https://issues.apache.org/jira/browse/ARROW
   
   Opening JIRAs ahead of time contributes to the [Openness](http://theapacheway.com/open/#:~:text=Openness%20allows%20new%20users%20the,must%20happen%20in%20the%20open.) of the Apache Arrow project.
   
   Then could you also rename pull request title in the following format?
   
       ARROW-${JIRA_ID}: [${COMPONENT}] ${SUMMARY}
   
   or
   
       MINOR: [${COMPONENT}] ${SUMMARY}
   
   See also:
   
     * [Other pull requests](https://github.com/apache/arrow/pulls/)
     * [Contribution Guidelines - How to contribute patches](https://arrow.apache.org/docs/developers/contributing.html#how-to-contribute-patches)
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] cyb70289 commented on pull request #12196: [RFC] ARROW-15282: [C++][FlightRPC] Support non-grpc data planes

Posted by GitBox <gi...@apache.org>.
cyb70289 commented on pull request #12196:
URL: https://github.com/apache/arrow/pull/12196#issuecomment-1018535080


   > How were the tests run? Assume the [flight benchmark](https://github.com/cyb70289/arrow/blob/flight-data-plane/cpp/src/arrow/flight/flight_benchmark.cc) as described at https://arrow.apache.org/blog/2019/10/13/introducing-arrow-flight/ was used.
   
   Do you mean how to run flightrpc benchmark?
   - To build it you have to enable -DARROW_FLIGHT=ON in cmake.
   - To run client and server on same host: just run "arrow-flight-benchmark -num_streams=1 -num_threads=1" (change 1 to 2,4,... for more streams). It spawns server at the background automatically.
   - To run across network, you run "arrow-flight-perf-server" on server and "arrow-flight-benchmark" on client, you will need to specify server ip, see the cmdline helps for more options.
   - To verify this PR please see the commit message above.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] lidavidm commented on pull request #12196: [RFC] ARROW-15282: [C++][FlightRPC] Support non-grpc data planes

Posted by GitBox <gi...@apache.org>.
lidavidm commented on pull request #12196:
URL: https://github.com/apache/arrow/pull/12196#issuecomment-1017583501


   I suppose the key thing is to consider the types of backends we want to support. If they look more like UCX or WebSockets and can entirely replace gRPC, that's one thing; if they look more like shared memory (or libfabrics? DPDK?) then this approach is probably easier.
   
   Or maybe there's an approach where we allow swapping gRPC out entirely, _but only for the data plane methods_. Since it doesn't really add value to reimplement GetFlightInfo in UCX. (That argument breaks down somewhat for WebSockets, where I think using gRPC at all requires proxying or some other deployment configuration.) 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] cyb70289 closed pull request #12196: [RFC] ARROW-15282: [C++][FlightRPC] Support non-grpc data planes

Posted by GitBox <gi...@apache.org>.
cyb70289 closed pull request #12196:
URL: https://github.com/apache/arrow/pull/12196


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] cyb70289 commented on pull request #12196: [RFC] ARROW-15282: [C++][FlightRPC] Support non-grpc data planes

Posted by GitBox <gi...@apache.org>.
cyb70289 commented on pull request #12196:
URL: https://github.com/apache/arrow/pull/12196#issuecomment-1018387095


   Thanks @lidavidm !
   
   Now looks to me UCX transport is the better way.
   
   My main concern of the data plane approach is that we have to build by ourselves the data transmission over raw data plane libraries. A robust, high performance communication system is hard enough and we'd better adopt mature frameworks like gRPC or UCX. 80% code of this PR is the shared memory driver, and it's still far from production quality (we need to handle cache management, flow control, and many other tricky things like race conditions).
   
   I think we can leave this PR open to see if there are other comments.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] lidavidm commented on pull request #12196: [RFC] ARROW-15282: [C++][FlightRPC] Support non-grpc data planes

Posted by GitBox <gi...@apache.org>.
lidavidm commented on pull request #12196:
URL: https://github.com/apache/arrow/pull/12196#issuecomment-1018506973


   > My main concern of the data plane approach is that we have to build by ourselves the data transmission over raw data plane libraries. A robust, high performance communication system is hard enough and we'd better adopt mature frameworks like gRPC or UCX.
   
   To be fair, even with UCX there is still a fair amount of work to get Flight working on top. But I see your point - there is a lot of code that can be reused in Flight, but something very low level like shared memory without a helper library still requires a lot of work.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org