You are viewing a plain text version of this content. The canonical link for it is here.

Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2020/12/20 06:49:38 UTC

[GitHub] [arrow] jorgecarleitao commented on pull request #8968: ARROW-10979: [Rust] Basic Kafka Reader

jorgecarleitao commented on pull request #8968:
URL: https://github.com/apache/arrow/pull/8968#issuecomment-748571124

Thanks a lot for this PR and for opening this discussion.

I am trying to understand the rational, also :) :

* Kafka is event driven, Arrow format is batch processing
* Kafka is event driven, Iterators are blocking
* current PR handles the payload as a opaque binary

Wrt to the first point, we could use micro-batches, but doesn't that defeat the purpose of the Arrow format? The whole idea is based on the notion of data locality, low metadata footprint, and batch processing. All these are shredded in a micro-batch architecture.

Wrt to the second point, wouldn't it make sense to build a `stream` instead of an `iterator`? Currently we will be blocking the thread waiting for something to happen on the topic, no?

Wrt to the third point, why adding the complexity of Arrow if in the end the payload is an opaque binary? The user would still have to convert that payload to the arrow format for compute (or is the idea to keep that as opaque?), so, IMO we are not really solving the problem: why would a user prefer to use a `RecordBatch` of N messages instead of a `Vec<KafkaBatch>` (or even better, not block the thread and just use a stream of `KafkaBatch`?

If anything, I would say that the architecture should be something like:

```
stream of rows ------ stream of batches of size N ------ compute
```

I.e. there should be a stream adapter (a chunk iterator) that maps rows to a batch, either in arrow or in other format.

One idea in this direction would be to reduce the scope of this PR to introducing a stream adapter that does exactly this: a generic struct that takes a stream of rows and returns a new stream of `RecordBatch`. But again, why bother with `Vec<KafkaBatch> -> RecordBatch`? A `StructArray` with a `Binary` field still needs to be parsed into an arrow array given a schema, and that is the most blocking operation of all of this. We would still need to agree upfront about the format of the payload, which is typically heterogeneous and use-case dependent (and seldomly decided by the consumer). So, this bring us to the IPC already initiated above.

But maybe I am completely missing the point ^_^

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org