You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by Arnaud Bailly <ar...@gmail.com> on 2016/07/04 15:23:29 UTC

How to handle update/deletion in Structured Streaming?

Hello,

I am interested in using the new Structured Streaming feature of Spark SQL
and am currently doing some experiments on code at HEAD. I would like to
have a better understanding of how deletion should be handled in a
structured streaming setting.

Given some incremental query computing an arbitrary aggregation over some
dataset, inserting new values is somewhat obvious: simply update the
aggregate computation tree with whatever new values are added to input
datasets/datastreams. But things are not so obvious for updates and
deletions: do they have a representation in the input datastreams? If I
have a query that aggregates some value over some key, and I delete all
instances of that key, I would expect the query to output a result removing
the key's aggregated value. The same is true for updates...

Thanks for any insights you might want to share.

Regards,
-- 
Arnaud Bailly

twitter: abailly
skype: arnaud-bailly
linkedin: http://fr.linkedin.com/in/arnaudbailly/

Re: How to handle update/deletion in Structured Streaming?

Posted by Tathagata Das <ta...@gmail.com>.

Input datasets which represent a input data stream only supports appending
of new rows, as the stream is modeled as an unbounded table where new data
in the stream are new rows being appended to the table. For transformed
datasets generated from the input dataset, rows can be updated and removed
as input dataset has added rows. To take a concrete example, if you are
maintaining a running word count dataset, every time the input dataset has
new rows, the rows of the word count dataset will get updated. Where this
really matters is when the transformed data needs written to a output sink
and that's where the output modes decide how the updated/deleted rows are
written to the sink. Currently, Spark 2.0 will support only the Complete
Mode, where after any update, ALL the rows (i.e. added, updated, and
unchanged rows) of the word count dataset will be given to the sink for
output. Future version of Spark will have the Update mode, where only the
added/updated rows will be given to the sink.

On Mon, Jul 4, 2016 at 8:23 AM, Arnaud Bailly <ar...@gmail.com>
wrote:

> Hello,
>
> I am interested in using the new Structured Streaming feature of Spark SQL
> and am currently doing some experiments on code at HEAD. I would like to
> have a better understanding of how deletion should be handled in a
> structured streaming setting.
>
> Given some incremental query computing an arbitrary aggregation over some
> dataset, inserting new values is somewhat obvious: simply update the
> aggregate computation tree with whatever new values are added to input
> datasets/datastreams. But things are not so obvious for updates and
> deletions: do they have a representation in the input datastreams? If I
> have a query that aggregates some value over some key, and I delete all
> instances of that key, I would expect the query to output a result removing
> the key's aggregated value. The same is true for updates...
>
> Thanks for any insights you might want to share.
>
> Regards,
> --
> Arnaud Bailly
>
> twitter: abailly
> skype: arnaud-bailly
> linkedin: http://fr.linkedin.com/in/arnaudbailly/
>