You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2020/06/09 22:03:08 UTC
[GitHub] [arrow] lidavidm opened a new pull request #7387: ARROW-5377: [C++] Make IpcPayload public and add GetPayloadSize
lidavidm opened a new pull request #7387:
URL: https://github.com/apache/arrow/pull/7387
As discussed in #7299, this makes IpcPayload and related functions public, and adds `arrow::ipc::GetPayloadSize`. It also adds a synthetic benchmark comparing it to `GetRecordBatchSize`. (Obviously, a very unfair comparison.)
```
-------------------------------------------------------------------------
Benchmark Time CPU Iterations
-------------------------------------------------------------------------
GetIpcPayloadSize/1/real_time 18 ns 18 ns 39779690 54.6317TB/s
GetIpcPayloadSize/4/real_time 18 ns 18 ns 39345745 54.2457TB/s
GetIpcPayloadSize/16/real_time 18 ns 18 ns 39393839 54.2546TB/s
GetIpcPayloadSize/64/real_time 19 ns 19 ns 38973515 51.9554TB/s
GetIpcPayloadSize/256/real_time 18 ns 18 ns 37427051 53.213TB/s
GetIpcPayloadSize/1024/real_time 18 ns 18 ns 38466296 55.367TB/s
GetIpcPayloadSize/4096/real_time 19 ns 19 ns 38322691 62.537TB/s
GetIpcPayloadSize/8192/real_time 19 ns 19 ns 38556584 72.0611TB/s
GetRecordBatchSize/1/real_time 10308 ns 10298 ns 67088 96.227GB/s
GetRecordBatchSize/4/real_time 15564 ns 15547 ns 45002 63.744GB/s
GetRecordBatchSize/16/real_time 30790 ns 30759 ns 22630 32.2385GB/s
GetRecordBatchSize/64/real_time 87783 ns 87687 ns 8037 11.3322GB/s
GetRecordBatchSize/256/real_time 307918 ns 307583 ns 2292 3.25852GB/s
GetRecordBatchSize/1024/real_time 1215936 ns 1214092 ns 582 873.888MB/s
GetRecordBatchSize/4096/real_time 5498273 ns 5486982 ns 126 221.522MB/s
GetRecordBatchSize/8192/real_time 12711545 ns 12680450 ns 55 112.191MB/s
```
This _doesn't_ meet the goal of ARROW-8487, which was to allow clients to limit the size of record batches sent via Flight. I tried adding `RecordBatchWriter::WritePayload`, but this needs some thought; in particular, when you have dictionaries, you need to assemble the dictionary payloads yourself, but can't do so because you can't get the writer's internal DictionaryMemo.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [arrow] pitrou commented on pull request #7387: ARROW-5377: [C++] Make IpcPayload public and add GetPayloadSize
Posted by GitBox <gi...@apache.org>.
pitrou commented on pull request #7387:
URL: https://github.com/apache/arrow/pull/7387#issuecomment-642170854
But then why would the application write payloads directly?
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [arrow] pitrou closed pull request #7387: ARROW-5377: [C++] Make IpcPayload public and add GetPayloadSize
Posted by GitBox <gi...@apache.org>.
pitrou closed pull request #7387:
URL: https://github.com/apache/arrow/pull/7387
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [arrow] lidavidm commented on pull request #7387: ARROW-5377: [C++] Make IpcPayload public and add GetPayloadSize
Posted by GitBox <gi...@apache.org>.
lidavidm commented on pull request #7387:
URL: https://github.com/apache/arrow/pull/7387#issuecomment-642168091
> Why is `WriteIpcPayload` not sufficient for that?
`WriteIpcPayload` targets an `io::OutputStream` but in Flight, you get only a `RecordBatchWriter`.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [arrow] github-actions[bot] commented on pull request #7387: ARROW-5377: [C++] Make IpcPayload public and add GetPayloadSize
Posted by GitBox <gi...@apache.org>.
github-actions[bot] commented on pull request #7387:
URL: https://github.com/apache/arrow/pull/7387#issuecomment-641610654
https://issues.apache.org/jira/browse/ARROW-5377
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [arrow] lidavidm commented on pull request #7387: ARROW-5377: [C++] Make IpcPayload public and add GetPayloadSize
Posted by GitBox <gi...@apache.org>.
lidavidm commented on pull request #7387:
URL: https://github.com/apache/arrow/pull/7387#issuecomment-642183620
Reopened as #7398 since I rebased before reopening the PR.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [arrow] pitrou commented on pull request #7387: ARROW-5377: [C++] Make IpcPayload public and add GetPayloadSize
Posted by GitBox <gi...@apache.org>.
pitrou commented on pull request #7387:
URL: https://github.com/apache/arrow/pull/7387#issuecomment-642156864
Why is `WriteIpcPayload` not sufficient for that?
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [arrow] lidavidm commented on pull request #7387: ARROW-5377: [C++] Make IpcPayload public and add GetPayloadSize
Posted by GitBox <gi...@apache.org>.
lidavidm commented on pull request #7387:
URL: https://github.com/apache/arrow/pull/7387#issuecomment-641971704
Thanks for the review. What do you think about having `RecordBatchWriter::WritePayload(const IpcPayload&)` along with `DictionaryMemo* RecordBatchWriter::dictionary_memo`? I think that would be sufficient to let an application manage writing its own payloads to a regular stream (if it wanted to control the serialized size).
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [arrow] pitrou commented on pull request #7387: ARROW-5377: [C++] Make IpcPayload public and add GetPayloadSize
Posted by GitBox <gi...@apache.org>.
pitrou commented on pull request #7387:
URL: https://github.com/apache/arrow/pull/7387#issuecomment-642536886
Should I merge this first and then you'll rebase #7398?
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [arrow] pitrou commented on pull request #7387: ARROW-5377: [C++] Make IpcPayload public and add GetPayloadSize
Posted by GitBox <gi...@apache.org>.
pitrou commented on pull request #7387:
URL: https://github.com/apache/arrow/pull/7387#issuecomment-642168852
Who is "you"? The application?
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [arrow] pitrou commented on pull request #7387: ARROW-5377: [C++] Make IpcPayload public and add GetPayloadSize
Posted by GitBox <gi...@apache.org>.
pitrou commented on pull request #7387:
URL: https://github.com/apache/arrow/pull/7387#issuecomment-642179329
Ah, well, then let's reopen #7299 instead of violating API layers in `RecordBatchWriter`.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [arrow] lidavidm commented on pull request #7387: ARROW-5377: [C++] Make IpcPayload public and add GetPayloadSize
Posted by GitBox <gi...@apache.org>.
lidavidm commented on pull request #7387:
URL: https://github.com/apache/arrow/pull/7387#issuecomment-642628014
Yes, sounds good, thank you.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [arrow] lidavidm commented on pull request #7387: ARROW-5377: [C++] Make IpcPayload public and add GetPayloadSize
Posted by GitBox <gi...@apache.org>.
lidavidm commented on pull request #7387:
URL: https://github.com/apache/arrow/pull/7387#issuecomment-642173870
So, gRPC has a default limit on the max size of any message on the wire. Flight disables this limit to allow you to write large record batches. However, we've found that it's still useful to enforce a (less restrictive) limit on the server side to help control the memory usage of a Java Flight server (so this applies to clients sending data via DoPut).
Once you do that though, a Flight client must then limit the sizes of record batches it writes. While there are things like `RecordBatch.nbytes`, these aren't perfect estimates (e.g. it doesn't account for slicing/it really measures the memory usage of the underlying buffers).
Being able to write a payload would help by letting the application construct a payload, check its size, then write it. In the common case, the payload is under the size limit and can be written, avoiding duplicate serialization.
(Of course, this would need to get exposed to Python to be useful...)
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [arrow] lidavidm commented on pull request #7387: ARROW-5377: [C++] Make IpcPayload public and add GetPayloadSize
Posted by GitBox <gi...@apache.org>.
lidavidm commented on pull request #7387:
URL: https://github.com/apache/arrow/pull/7387#issuecomment-642169852
Sorry, yes - the application only gets a RecordBatchWriter.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [arrow] pitrou commented on a change in pull request #7387: ARROW-5377: [C++] Make IpcPayload public and add GetPayloadSize
Posted by GitBox <gi...@apache.org>.
pitrou commented on a change in pull request #7387:
URL: https://github.com/apache/arrow/pull/7387#discussion_r438018616
##########
File path: cpp/src/arrow/ipc/writer.h
##########
@@ -330,6 +306,11 @@ ARROW_EXPORT
Status GetRecordBatchPayload(const RecordBatch& batch, const IpcWriteOptions& options,
IpcPayload* out);
+/// \brief Write an IPC payload to the given stream.
+/// \param[in] payload the payload to write
+/// \param[in] options options for serialization
+/// \param[out] metadata_length the length of the serialized metadata
Review comment:
The docstring is missing the `dst` parameter.
##########
File path: cpp/src/arrow/ipc/read_write_benchmark.cc
##########
@@ -153,9 +153,38 @@ static void DecodeStream(benchmark::State& state) { // NOLINT non-const referen
state.SetBytesProcessed(int64_t(state.iterations()) * kTotalSize);
}
+static void GetRecordBatchSize(benchmark::State& state) { // NOLINT non-const reference
Review comment:
I don't think it makes sense to maintain those benchmarks in the codebase.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org