You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@flink.apache.org by Stephen Young <wi...@googlemail.com> on 2020/02/04 15:14:06 UTC

Best approach for recalculating statistics based on amended or deleted events?

I am currently looking into how Flink can support a live data collection platform. We want to collect certain data in real-time. This data will be sent to Kafka and we want to use Flink to calculate statistics and derived events from it.

An important thing we need to be able to handle is amendment or deletion events. For example, we may get an event that someone has performed an action and from this we'd calculate how many of these actions they had taken in total. We'd also build calculations on top of that, for example top 10 rankings by these counts, or arbitrarily many layers of calculations beyond that. But sometime later (this could be a few seconds or a week) we receive an amendment event to that action. This indicates that the action was taken by a different person or from a different location. We then need Flink to recalculate all of our downstream stats i.e. the counts need to be changed and rankings need to be adjusted.

From my research into Flink I can see there is a page about Dynamic Tables and also there was some stuff about retraction support for the Table/SQL API. But it seems like this is simply how Flink models changes to aggregated data. I would like to be able to do something like calculate a count from a set of events each with their own id, then retract one of those events by its id and have the count automatically change.

Is anything like this achievable with Flink? Thanks!

Re: Best approach for recalculating statistics based on amended or deleted events?

Posted by Timo Walther <tw...@apache.org>.
Hi Stephen,

it would I meant was that the schema of the table might still contain a 
column that descibes the change "isRetract". We cannot apply it 
internally. But of course you can deal with this column in a SQL query.

This answer here and the linked answer might also help you:

https://stackoverflow.com/questions/59360243/flink-sql-use-changelog-stream-to-update-rows-in-dynamic-table


Regards,
Timo


On 06.02.20 12:37, Stephen Young wrote:
> Are you able to advise any further Timo? Thanks!
> 
> On 2020/02/04 16:10:04, Stephen Young <wi...@googlemail.com> wrote:
>> Hi Timo,
>>
>> Thanks for replying to me so quickly!
>>
>> We could do it with insert-only rows. When you say flags in the data do you mean a field with a name like 'retracts' and then the value of that field is the id of the event/row we want to retract? How would that be possible with Flink?
>>
>> Thanks!
>>
>> On 2020/02/04 15:27:20, Timo Walther <tw...@apache.org> wrote:
>>> Hi Stephan,
>>>
>>> the use cases you are describing sound like a perfect fit to Flink.
>>> Internally, Flink deals with insertions and deletions that are flowing
>>> through the system and can update chained aggregations and complex queries.
>>>
>>> The only bigger limitation at the moment is that we only support sources
>>> that emit insert-only rows. The community is currently working on
>>> designing how we expose the internal changelog processing capabilities
>>> through our APIs.
>>>
>>> However, your use case might also work with insert-only rows and a query
>>> based on the flags in the data, correct?
>>>
>>> Regards,
>>> Timo
>>>
>>>
>>> On 04.02.20 16:14, Stephen Young wrote:
>>>> I am currently looking into how Flink can support a live data collection platform. We want to collect certain data in real-time. This data will be sent to Kafka and we want to use Flink to calculate statistics and derived events from it.
>>>>
>>>> An important thing we need to be able to handle is amendment or deletion events. For example, we may get an event that someone has performed an action and from this we'd calculate how many of these actions they had taken in total. We'd also build calculations on top of that, for example top 10 rankings by these counts, or arbitrarily many layers of calculations beyond that. But sometime later (this could be a few seconds or a week) we receive an amendment event to that action. This indicates that the action was taken by a different person or from a different location. We then need Flink to recalculate all of our downstream stats i.e. the counts need to be changed and rankings need to be adjusted.
>>>>
>>>> >From my research into Flink I can see there is a page about Dynamic Tables and also there was some stuff about retraction support for the Table/SQL API. But it seems like this is simply how Flink models changes to aggregated data. I would like to be able to do something like calculate a count from a set of events each with their own id, then retract one of those events by its id and have the count automatically change.
>>>>
>>>> Is anything like this achievable with Flink? Thanks!
>>>>
>>>
>>>
>>


Re: Best approach for recalculating statistics based on amended or deleted events?

Posted by Stephen Young <wi...@googlemail.com>.
Are you able to advise any further Timo? Thanks!

On 2020/02/04 16:10:04, Stephen Young <wi...@googlemail.com> wrote: 
> Hi Timo,
> 
> Thanks for replying to me so quickly!
> 
> We could do it with insert-only rows. When you say flags in the data do you mean a field with a name like 'retracts' and then the value of that field is the id of the event/row we want to retract? How would that be possible with Flink?
> 
> Thanks!
> 
> On 2020/02/04 15:27:20, Timo Walther <tw...@apache.org> wrote: 
> > Hi Stephan,
> > 
> > the use cases you are describing sound like a perfect fit to Flink. 
> > Internally, Flink deals with insertions and deletions that are flowing 
> > through the system and can update chained aggregations and complex queries.
> > 
> > The only bigger limitation at the moment is that we only support sources 
> > that emit insert-only rows. The community is currently working on 
> > designing how we expose the internal changelog processing capabilities 
> > through our APIs.
> > 
> > However, your use case might also work with insert-only rows and a query 
> > based on the flags in the data, correct?
> > 
> > Regards,
> > Timo
> > 
> > 
> > On 04.02.20 16:14, Stephen Young wrote:
> > > I am currently looking into how Flink can support a live data collection platform. We want to collect certain data in real-time. This data will be sent to Kafka and we want to use Flink to calculate statistics and derived events from it.
> > > 
> > > An important thing we need to be able to handle is amendment or deletion events. For example, we may get an event that someone has performed an action and from this we'd calculate how many of these actions they had taken in total. We'd also build calculations on top of that, for example top 10 rankings by these counts, or arbitrarily many layers of calculations beyond that. But sometime later (this could be a few seconds or a week) we receive an amendment event to that action. This indicates that the action was taken by a different person or from a different location. We then need Flink to recalculate all of our downstream stats i.e. the counts need to be changed and rankings need to be adjusted.
> > > 
> > >>From my research into Flink I can see there is a page about Dynamic Tables and also there was some stuff about retraction support for the Table/SQL API. But it seems like this is simply how Flink models changes to aggregated data. I would like to be able to do something like calculate a count from a set of events each with their own id, then retract one of those events by its id and have the count automatically change.
> > > 
> > > Is anything like this achievable with Flink? Thanks!
> > > 
> > 
> > 
> 

Re: Best approach for recalculating statistics based on amended or deleted events?

Posted by Stephen Young <wi...@googlemail.com>.
Hi Timo,

Thanks for replying to me so quickly!

We could do it with insert-only rows. When you say flags in the data do you mean a field with a name like 'retracts' and then the value of that field is the id of the event/row we want to retract? How would that be possible with Flink?

Thanks!

On 2020/02/04 15:27:20, Timo Walther <tw...@apache.org> wrote: 
> Hi Stephan,
> 
> the use cases you are describing sound like a perfect fit to Flink. 
> Internally, Flink deals with insertions and deletions that are flowing 
> through the system and can update chained aggregations and complex queries.
> 
> The only bigger limitation at the moment is that we only support sources 
> that emit insert-only rows. The community is currently working on 
> designing how we expose the internal changelog processing capabilities 
> through our APIs.
> 
> However, your use case might also work with insert-only rows and a query 
> based on the flags in the data, correct?
> 
> Regards,
> Timo
> 
> 
> On 04.02.20 16:14, Stephen Young wrote:
> > I am currently looking into how Flink can support a live data collection platform. We want to collect certain data in real-time. This data will be sent to Kafka and we want to use Flink to calculate statistics and derived events from it.
> > 
> > An important thing we need to be able to handle is amendment or deletion events. For example, we may get an event that someone has performed an action and from this we'd calculate how many of these actions they had taken in total. We'd also build calculations on top of that, for example top 10 rankings by these counts, or arbitrarily many layers of calculations beyond that. But sometime later (this could be a few seconds or a week) we receive an amendment event to that action. This indicates that the action was taken by a different person or from a different location. We then need Flink to recalculate all of our downstream stats i.e. the counts need to be changed and rankings need to be adjusted.
> > 
> >>From my research into Flink I can see there is a page about Dynamic Tables and also there was some stuff about retraction support for the Table/SQL API. But it seems like this is simply how Flink models changes to aggregated data. I would like to be able to do something like calculate a count from a set of events each with their own id, then retract one of those events by its id and have the count automatically change.
> > 
> > Is anything like this achievable with Flink? Thanks!
> > 
> 
> 

Re: Best approach for recalculating statistics based on amended or deleted events?

Posted by Timo Walther <tw...@apache.org>.
Hi Stephan,

the use cases you are describing sound like a perfect fit to Flink. 
Internally, Flink deals with insertions and deletions that are flowing 
through the system and can update chained aggregations and complex queries.

The only bigger limitation at the moment is that we only support sources 
that emit insert-only rows. The community is currently working on 
designing how we expose the internal changelog processing capabilities 
through our APIs.

However, your use case might also work with insert-only rows and a query 
based on the flags in the data, correct?

Regards,
Timo


On 04.02.20 16:14, Stephen Young wrote:
> I am currently looking into how Flink can support a live data collection platform. We want to collect certain data in real-time. This data will be sent to Kafka and we want to use Flink to calculate statistics and derived events from it.
> 
> An important thing we need to be able to handle is amendment or deletion events. For example, we may get an event that someone has performed an action and from this we'd calculate how many of these actions they had taken in total. We'd also build calculations on top of that, for example top 10 rankings by these counts, or arbitrarily many layers of calculations beyond that. But sometime later (this could be a few seconds or a week) we receive an amendment event to that action. This indicates that the action was taken by a different person or from a different location. We then need Flink to recalculate all of our downstream stats i.e. the counts need to be changed and rankings need to be adjusted.
> 
>>From my research into Flink I can see there is a page about Dynamic Tables and also there was some stuff about retraction support for the Table/SQL API. But it seems like this is simply how Flink models changes to aggregated data. I would like to be able to do something like calculate a count from a set of events each with their own id, then retract one of those events by its id and have the count automatically change.
> 
> Is anything like this achievable with Flink? Thanks!
>