You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@arrow.apache.org by Matt Rudary <Ma...@twosigma.com> on 2022/04/27 17:02:38 UTC

Arrow, Flight, Streaming, and Watermarking

Hi,

We're looking at using Arrow as part of our solution to ship tabular data between different streaming systems, potentially implemented using different technologies, like Spark, Beam, Flink, etc. Some of these systems contain "watermarks" as a key concept. Briefly, a watermark is a promise that a certain data source will not produce any more events/rows with a timestamp earlier than a given time. For example, if I produce a batch of rows every 5 minutes, after I've finished sending the 12:00 data, I would send a watermark update of 12:04:59, thus letting downstream consumers know that no future row from me will have a timestamp before 12:05.

We would like to be able to propagate watermarks with our data, and I wondered if this list has any ideas of how to do this currently, or whether it is part of the roadmap for the Arrow compute api or similar. We'd like to be able to do this over Arrow Flight, but potentially also for other methods of shipping Arrow data, like pubsub feeds, file dumps, etc.

Thanks
Matt Rudary
Two Sigma

Re: Arrow, Flight, Streaming, and Watermarking

Posted by David Li <li...@apache.org>.
Hey Matt,

For Flight: for DoGet/DoPut/DoExchange, you can accomplish with the app_metadata fields built in to these methods. For instance, in DoGet/DoExchange, you could send some batches of data, then send a message with only an app_metadata field encoding the watermark. (The app_metadata field is a byte blob, and you can impart structure on it with JSON or Protobuf or the scheme of your choice.) The app_metadata field could also be appended to the side of a record batch as well.

For Arrow files: ARROW-16131 in Arrow 8.0.0 gives you the ability to add custom key-value metadata to individual batches, and Arrow already supported key-value metadata in files, so you could use either of these to record a watermark. 

Do these seem workable? Regarding the roadmap, I don't think we've seen requests for this previously; what sort of support would help in the compute API?

-David

On Wed, Apr 27, 2022, at 13:02, Matt Rudary wrote:
> Hi,
>
> We're looking at using Arrow as part of our solution to ship tabular 
> data between different streaming systems, potentially implemented using 
> different technologies, like Spark, Beam, Flink, etc. Some of these 
> systems contain "watermarks" as a key concept. Briefly, a watermark is 
> a promise that a certain data source will not produce any more 
> events/rows with a timestamp earlier than a given time. For example, if 
> I produce a batch of rows every 5 minutes, after I've finished sending 
> the 12:00 data, I would send a watermark update of 12:04:59, thus 
> letting downstream consumers know that no future row from me will have 
> a timestamp before 12:05.
>
> We would like to be able to propagate watermarks with our data, and I 
> wondered if this list has any ideas of how to do this currently, or 
> whether it is part of the roadmap for the Arrow compute api or similar. 
> We'd like to be able to do this over Arrow Flight, but potentially also 
> for other methods of shipping Arrow data, like pubsub feeds, file 
> dumps, etc.
>
> Thanks
> Matt Rudary
> Two Sigma