You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2020/06/09 22:03:08 UTC

[GitHub] [arrow] lidavidm opened a new pull request #7387: ARROW-5377: [C++] Make IpcPayload public and add GetPayloadSize

lidavidm opened a new pull request #7387:
URL: https://github.com/apache/arrow/pull/7387


   As discussed in #7299, this makes IpcPayload and related functions public, and adds `arrow::ipc::GetPayloadSize`. It also adds a synthetic benchmark comparing it to `GetRecordBatchSize`. (Obviously, a very unfair comparison.)
   
   ```
   -------------------------------------------------------------------------
   Benchmark                                  Time           CPU Iterations
   -------------------------------------------------------------------------
   GetIpcPayloadSize/1/real_time             18 ns         18 ns   39779690   54.6317TB/s
   GetIpcPayloadSize/4/real_time             18 ns         18 ns   39345745   54.2457TB/s
   GetIpcPayloadSize/16/real_time            18 ns         18 ns   39393839   54.2546TB/s
   GetIpcPayloadSize/64/real_time            19 ns         19 ns   38973515   51.9554TB/s
   GetIpcPayloadSize/256/real_time           18 ns         18 ns   37427051    53.213TB/s
   GetIpcPayloadSize/1024/real_time          18 ns         18 ns   38466296    55.367TB/s
   GetIpcPayloadSize/4096/real_time          19 ns         19 ns   38322691    62.537TB/s
   GetIpcPayloadSize/8192/real_time          19 ns         19 ns   38556584   72.0611TB/s
   GetRecordBatchSize/1/real_time         10308 ns      10298 ns      67088    96.227GB/s
   GetRecordBatchSize/4/real_time         15564 ns      15547 ns      45002    63.744GB/s
   GetRecordBatchSize/16/real_time        30790 ns      30759 ns      22630   32.2385GB/s
   GetRecordBatchSize/64/real_time        87783 ns      87687 ns       8037   11.3322GB/s
   GetRecordBatchSize/256/real_time      307918 ns     307583 ns       2292   3.25852GB/s
   GetRecordBatchSize/1024/real_time    1215936 ns    1214092 ns        582   873.888MB/s
   GetRecordBatchSize/4096/real_time    5498273 ns    5486982 ns        126   221.522MB/s
   GetRecordBatchSize/8192/real_time   12711545 ns   12680450 ns         55   112.191MB/s
   ```
   
   This _doesn't_ meet the goal of ARROW-8487, which was to allow clients to limit the size of record batches sent via Flight. I tried adding `RecordBatchWriter::WritePayload`, but this needs some thought; in particular, when you have dictionaries, you need to assemble the dictionary payloads yourself, but can't do so because you can't get the writer's internal DictionaryMemo.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] pitrou commented on pull request #7387: ARROW-5377: [C++] Make IpcPayload public and add GetPayloadSize

Posted by GitBox <gi...@apache.org>.
pitrou commented on pull request #7387:
URL: https://github.com/apache/arrow/pull/7387#issuecomment-642170854


   But then why would the application write payloads directly?


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] pitrou closed pull request #7387: ARROW-5377: [C++] Make IpcPayload public and add GetPayloadSize

Posted by GitBox <gi...@apache.org>.
pitrou closed pull request #7387:
URL: https://github.com/apache/arrow/pull/7387


   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] lidavidm commented on pull request #7387: ARROW-5377: [C++] Make IpcPayload public and add GetPayloadSize

Posted by GitBox <gi...@apache.org>.
lidavidm commented on pull request #7387:
URL: https://github.com/apache/arrow/pull/7387#issuecomment-642168091


   > Why is `WriteIpcPayload` not sufficient for that?
   
   `WriteIpcPayload` targets an `io::OutputStream` but in Flight, you get only a `RecordBatchWriter`.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] github-actions[bot] commented on pull request #7387: ARROW-5377: [C++] Make IpcPayload public and add GetPayloadSize

Posted by GitBox <gi...@apache.org>.
github-actions[bot] commented on pull request #7387:
URL: https://github.com/apache/arrow/pull/7387#issuecomment-641610654


   https://issues.apache.org/jira/browse/ARROW-5377


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] lidavidm commented on pull request #7387: ARROW-5377: [C++] Make IpcPayload public and add GetPayloadSize

Posted by GitBox <gi...@apache.org>.
lidavidm commented on pull request #7387:
URL: https://github.com/apache/arrow/pull/7387#issuecomment-642183620


   Reopened as #7398 since I rebased before reopening the PR.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] pitrou commented on pull request #7387: ARROW-5377: [C++] Make IpcPayload public and add GetPayloadSize

Posted by GitBox <gi...@apache.org>.
pitrou commented on pull request #7387:
URL: https://github.com/apache/arrow/pull/7387#issuecomment-642156864


   Why is `WriteIpcPayload` not sufficient for that?


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] lidavidm commented on pull request #7387: ARROW-5377: [C++] Make IpcPayload public and add GetPayloadSize

Posted by GitBox <gi...@apache.org>.
lidavidm commented on pull request #7387:
URL: https://github.com/apache/arrow/pull/7387#issuecomment-641971704


   Thanks for the review. What do you think about having `RecordBatchWriter::WritePayload(const IpcPayload&)` along with `DictionaryMemo* RecordBatchWriter::dictionary_memo`? I think that would be sufficient to let an application manage writing its own payloads to a regular stream (if it wanted to control the serialized size).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] pitrou commented on pull request #7387: ARROW-5377: [C++] Make IpcPayload public and add GetPayloadSize

Posted by GitBox <gi...@apache.org>.
pitrou commented on pull request #7387:
URL: https://github.com/apache/arrow/pull/7387#issuecomment-642536886


   Should I merge this first and then you'll rebase #7398?


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] pitrou commented on pull request #7387: ARROW-5377: [C++] Make IpcPayload public and add GetPayloadSize

Posted by GitBox <gi...@apache.org>.
pitrou commented on pull request #7387:
URL: https://github.com/apache/arrow/pull/7387#issuecomment-642168852


   Who is "you"? The application?


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] pitrou commented on pull request #7387: ARROW-5377: [C++] Make IpcPayload public and add GetPayloadSize

Posted by GitBox <gi...@apache.org>.
pitrou commented on pull request #7387:
URL: https://github.com/apache/arrow/pull/7387#issuecomment-642179329


   Ah, well, then let's reopen #7299 instead of violating API layers in `RecordBatchWriter`.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] lidavidm commented on pull request #7387: ARROW-5377: [C++] Make IpcPayload public and add GetPayloadSize

Posted by GitBox <gi...@apache.org>.
lidavidm commented on pull request #7387:
URL: https://github.com/apache/arrow/pull/7387#issuecomment-642628014


   Yes, sounds good, thank you.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] lidavidm commented on pull request #7387: ARROW-5377: [C++] Make IpcPayload public and add GetPayloadSize

Posted by GitBox <gi...@apache.org>.
lidavidm commented on pull request #7387:
URL: https://github.com/apache/arrow/pull/7387#issuecomment-642173870


   So, gRPC has a default limit on the max size of any message on the wire. Flight disables this limit to allow you to write large record batches. However, we've found that it's still useful to enforce a (less restrictive) limit on the server side to help control the memory usage of a Java Flight server (so this applies to clients sending data via DoPut).
   
   Once you do that though, a Flight client must then limit the sizes of record batches it writes. While there are things like `RecordBatch.nbytes`, these aren't perfect estimates (e.g. it doesn't account for slicing/it really measures the memory usage of the underlying buffers).
   
   Being able to write a payload would help by letting the application construct a payload, check its size, then write it. In the common case, the payload is under the size limit and can be written, avoiding duplicate serialization.
   
   (Of course, this would need to get exposed to Python to be useful...)


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] lidavidm commented on pull request #7387: ARROW-5377: [C++] Make IpcPayload public and add GetPayloadSize

Posted by GitBox <gi...@apache.org>.
lidavidm commented on pull request #7387:
URL: https://github.com/apache/arrow/pull/7387#issuecomment-642169852


   Sorry, yes - the application only gets a RecordBatchWriter.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] pitrou commented on a change in pull request #7387: ARROW-5377: [C++] Make IpcPayload public and add GetPayloadSize

Posted by GitBox <gi...@apache.org>.
pitrou commented on a change in pull request #7387:
URL: https://github.com/apache/arrow/pull/7387#discussion_r438018616



##########
File path: cpp/src/arrow/ipc/writer.h
##########
@@ -330,6 +306,11 @@ ARROW_EXPORT
 Status GetRecordBatchPayload(const RecordBatch& batch, const IpcWriteOptions& options,
                              IpcPayload* out);
 
+/// \brief Write an IPC payload to the given stream.
+/// \param[in] payload the payload to write
+/// \param[in] options options for serialization
+/// \param[out] metadata_length the length of the serialized metadata

Review comment:
       The docstring is missing the `dst` parameter.

##########
File path: cpp/src/arrow/ipc/read_write_benchmark.cc
##########
@@ -153,9 +153,38 @@ static void DecodeStream(benchmark::State& state) {  // NOLINT non-const referen
   state.SetBytesProcessed(int64_t(state.iterations()) * kTotalSize);
 }
 
+static void GetRecordBatchSize(benchmark::State& state) {  // NOLINT non-const reference

Review comment:
       I don't think it makes sense to maintain those benchmarks in the codebase.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org