You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@flink.apache.org by Harshith Chennamaneni <hc...@hiya.com> on 2016/08/19 03:20:25 UTC

Flink redshift table lookup and updates

Hi,

I've very recently come upon flink and I'm trying to use it to solve a
problem that I have.

I have a stream of User Settings updates coming through kafka queue. I need
to store the most recent settings along with a history of settings for each
user in redshift which then feeds into analytics dashboards.

I've been contemplating using Flink for this problem. I wanted some
guidance from people experienced in Flink to help me decide if Flink is
suited to this problem and if so what approach might work best. I am
considering the following approaches:

1. Create a secondary key-value database with the users latest settings and
lookup these settings after grouping the stream byKey(userId) to check if a
setting has changed and if so create a history record. I came across this
stackoverflow thread:
http://stackoverflow.com/questions/38866078/how-to-look-up-and-update-the-state-of-a-record-from-a-database-in-apache-flink
to help with this approach.

2. Pull the current snapshot of users from redshift and keep it as state in
Flink program at program start (the snapshot isn't huge ~1GB). Subsequently
lookup from this state and update it when processing events.

In both these cases I plan to create a Redshift sink that batches updates
to history as well as latest state and persists to redshift by batches
(through s3 and copy command for history, through a update on join for
snapshot).

Is one of these designs more suited to working with Flink? Is there an
alternative I should consider?

Thanks!

-H

Re: Flink redshift table lookup and updates

Posted by Robert Metzger <rm...@apache.org>.

Hi Harshith,

Welcome to the Flink community ;)

I would recommend using approach 2. Keeping the state in Flink and just
sending updates to the dashboard store should give you better performance
and consistency.
I don't know whether its better to download the full state snapshot from
redshift in the beginning, or lazily load the required data once you need
it (and then use the state afterwards).

Regards,
Robert

On Fri, Aug 19, 2016 at 5:20 AM, Harshith Chennamaneni <
hchennamaneni@hiya.com> wrote:

> Hi,
>
> I've very recently come upon flink and I'm trying to use it to solve a
> problem that I have.
>
> I have a stream of User Settings updates coming through kafka queue. I
> need to store the most recent settings along with a history of settings for
> each user in redshift which then feeds into analytics dashboards.
>
> I've been contemplating using Flink for this problem. I wanted some
> guidance from people experienced in Flink to help me decide if Flink is
> suited to this problem and if so what approach might work best. I am
> considering the following approaches:
>
> 1. Create a secondary key-value database with the users latest settings
> and lookup these settings after grouping the stream byKey(userId) to check
> if a setting has changed and if so create a history record. I came across
> this stackoverflow thread: http://stackoverflow.com/
> questions/38866078/how-to-look-up-and-update-the-state-
> of-a-record-from-a-database-in-apache-flink to help with this approach.
>
> 2. Pull the current snapshot of users from redshift and keep it as state
> in Flink program at program start (the snapshot isn't huge ~1GB).
> Subsequently lookup from this state and update it when processing events.
>
> In both these cases I plan to create a Redshift sink that batches updates
> to history as well as latest state and persists to redshift by batches
> (through s3 and copy command for history, through a update on join for
> snapshot).
>
> Is one of these designs more suited to working with Flink? Is there an
> alternative I should consider?
>
> Thanks!
>
> -H
>