You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@arrow.apache.org by "lidavidm (via GitHub)" <gi...@apache.org> on 2023/03/07 13:41:55 UTC

[GitHub] [arrow] lidavidm opened a new issue, #34485: [Format][FlightRPC] Transfer FlightData in pieces

lidavidm opened a new issue, #34485:
URL: https://github.com/apache/arrow/issues/34485

   ### Describe the enhancement requested
   
   gRPC presents a message-based interface (not a stream-based interface like HTTP). It also limits the size of individual messages by default. For Flight, this means that large record batches will by default cause gRPC clients to error.
   
   Flight clients generally override this setting to avoid this error, but some users may want to use the default gRPC client, and some users may want to keep this setting enabled to protect against misbehaving servers. (And some users may not know about this behavior.) Indeed, the Go Flight client doesn't unlock this setting by default.
   
   For maximal compatibility, Flight servers can reslice record batches to limit their size before sending them on the wire. However, this has a few problems. Estimating the size of a record batch takes some work, and then you don't know the size of the IPC metadata or Protobuf metadata (which gets included in the gRPC message size), so a fudge factor is necessary. Slicing is not necessarily free in all implementations, and it is not free once we get to IPC and may need to rewrite offset buffers. (This should be cheap, but it isn't free.) And a single row is unsliceable, but large string data or highly nested data may put a large amount of data in a single logical row.
   
   In C++ DoPut has a mechanism to work around this on the client by implementing an 'optimistic' check (where the size is checked after IPC serialization, and a distinct error returned so the client can slice and try again), but it is error-prone to implement and isn't currently available to the server or across languages.
   
   An alternative may be to have Flight model the IPC structure in more detail. Currently, a FlightData message is effectively a RecordBatch modeled as an IPC message + a single body buffer. Instead, we could transfer the data as multiple messages with a gRPC/Protobuf message representing one or more buffers. That way the only limitation is that an individual buffer must fit in the message size limit. (So large string data may still be a problem.) (We could even fragment buffers, but then it becomes non-zero-copy on the reader. I don't think there's a way to work around that without implementing gRPC ourselves. We may have to accept that tradeoff if large string data is a problem, though.)
   
   This may also be an opportunity to solve #32276. We should know what the Protobuf metadata looks like and be able to finagle it so that the metadata always takes 8 bytes, or insert our own padding. (However, given gRPC-C++ returns messages as a list of byte slices with no guarantees on alignment as far as I'm aware, just fixing it on the wire format side may not be enough.)
   
   This also touches on prior issues about feature/version negotiation in Flight, since we would need some way to know to use the new format.
   
   ### Component(s)
   
   FlightRPC, Format


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@arrow.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] alamb commented on issue #34485: [Format][FlightRPC] Transfer FlightData in pieces

Posted by "alamb (via GitHub)" <gi...@apache.org>.
alamb commented on issue #34485:
URL: https://github.com/apache/arrow/issues/34485#issuecomment-1739684988

   We hit gRPC message size limits (the aforementioned 4MB limit in the default golang one) frequently
   
   The rust flight API has a workaround for this limit that does the RecordBatch slicing mentioned above for users. See   https://docs.rs/arrow-flight/47.0.0/arrow_flight/encode/struct.FlightDataEncoderBuilder.html#method.with_max_flight_data_size
   
   This works pretty well for us, but it does not handle the cases that @lidavidm mentions (single large row / string value)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org