You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@flink.apache.org by Govindarajan Srinivasaraghavan <go...@gmail.com> on 2016/09/24 17:32:06 UTC

Questions on flink

Hi,

I'm working on apache flink for data streaming and I have few questions.
Any help is greatly appreciated. Thanks.

1) Are there any restrictions on creating tumbling windows. For example, if
I want to create a tumbling window per user id for 2 secs and let’s say if
I have more than 10 million user id's would that be a problem. (I'm using
keyBy user id and then creating a timeWindow for 2 secs)? How are these
windows maintained internally in flink?

2) I looked at rebalance for round robin partitioning. Let’s say I have a
cluster set up and if I have a parallelism of 1 for source and if I do a
rebalance, will my data be shuffled across machines to improve performance?
If so is there a specific port using which the data is transferred to other
nodes in the cluster?

3) Are there any limitations on state maintenance? I'm planning to maintain
some user id related data which could grow very large. I read about flink
using rocks db to maintain the state. Just wanted to check if there are any
limitations on how much data can be maintained?

4) Also where is the state maintained if the amount of data is less? (I
guess in JVM memory) If I have several machines on my cluster can every
node get the current state version?

5) I need a way to send external configuration changes to flink. Lets say
there is a new parameter that has to added or an external change which has
to be updated inside flink's state, how can this be done?

Thanks

Re: Questions on flink

Posted by Jamie Grier <ja...@data-artisans.com>.
Hi Govindarajan,

I've put some answers in-line below..

On Sat, Sep 24, 2016 at 7:32 PM, Govindarajan Srinivasaraghavan <
govindraghvan@gmail.com> wrote:

> Hi,
>
> I'm working on apache flink for data streaming and I have few questions.
> Any help is greatly appreciated. Thanks.
>
> 1) Are there any restrictions on creating tumbling windows. For example,
> if I want to create a tumbling window per user id for 2 secs and let’s say
> if I have more than 10 million user id's would that be a problem. (I'm
> using keyBy user id and then creating a timeWindow for 2 secs)? How are
> these windows maintained internally in flink?
>

That should not be a problem in general.  An important question may be how
many unique keys will you see in two seconds.  This is more important than
your total key cardinality of 10 Million and probably a *much* smaller
number unless your input message rate is really high.

>
> 2) I looked at rebalance for round robin partitioning. Let’s say I have a
> cluster set up and if I have a parallelism of 1 for source and if I do a
> rebalance, will my data be shuffled across machines to improve performance?
> If so is there a specific port using which the data is transferred to other
> nodes in the cluster?
>

Yes, rebalance() does a round-robin distribution of messages to other
machines in the cluster.  There is not a specific port used for each
TaskManager to communicate on but rather an available port is assigned at
runtime.  This is the default.  You can also set this to a specific port if
you have reason and a lot depends on how you will deploy -- via YARN or as
a standalone Flink cluster.


>
> 3) Are there any limitations on state maintenance? I'm planning to
> maintain some user id related data which could grow very large. I read
> about flink using rocks db to maintain the state. Just wanted to check if
> there are any limitations on how much data can be maintained?
>

Yes, there are limits.  The total data that can be maintained today is
determined by the fact that Flink has to periodically snapshot this data
and copy it to a persistent storage system such as HDFS whether you are
using RocksDB or not.  The aggregate bandwidth required to your storage
system (like HDFS) is your total Flink state size multiplied by your Flink
checkpoint interval.


> 4) Also where is the state maintained if the amount of data is less? (I
> guess in JVM memory) If I have several machines on my cluster can every
> node get the current state version?
>

I'm not exactly sure what you're asking here.  All data is check-pointed to
a persistent store which must be accessible from each machine in the
cluster.


> 5) I need a way to send external configuration changes to flink. Lets say
> there is a new parameter that has to added or an external change which has
> to be updated inside flink's state, how can this be done?
>

The typical way to do this is to consume that configuration as a stream and
hold the configuration internally in the state of a particular user
function.


>
> Thanks
>



-- 

Jamie Grier
data Artisans, Director of Applications Engineering
@jamiegrier <https://twitter.com/jamiegrier>
jamie@data-artisans.com

Re: Questions on flink

Posted by Jamie Grier <ja...@data-artisans.com>.
Hi Govindarajan,

I've put some answers in-line below..

On Sat, Sep 24, 2016 at 7:32 PM, Govindarajan Srinivasaraghavan <
govindraghvan@gmail.com> wrote:

> Hi,
>
> I'm working on apache flink for data streaming and I have few questions.
> Any help is greatly appreciated. Thanks.
>
> 1) Are there any restrictions on creating tumbling windows. For example,
> if I want to create a tumbling window per user id for 2 secs and let’s say
> if I have more than 10 million user id's would that be a problem. (I'm
> using keyBy user id and then creating a timeWindow for 2 secs)? How are
> these windows maintained internally in flink?
>

That should not be a problem in general.  An important question may be how
many unique keys will you see in two seconds.  This is more important than
your total key cardinality of 10 Million and probably a *much* smaller
number unless your input message rate is really high.

>
> 2) I looked at rebalance for round robin partitioning. Let’s say I have a
> cluster set up and if I have a parallelism of 1 for source and if I do a
> rebalance, will my data be shuffled across machines to improve performance?
> If so is there a specific port using which the data is transferred to other
> nodes in the cluster?
>

Yes, rebalance() does a round-robin distribution of messages to other
machines in the cluster.  There is not a specific port used for each
TaskManager to communicate on but rather an available port is assigned at
runtime.  This is the default.  You can also set this to a specific port if
you have reason and a lot depends on how you will deploy -- via YARN or as
a standalone Flink cluster.


>
> 3) Are there any limitations on state maintenance? I'm planning to
> maintain some user id related data which could grow very large. I read
> about flink using rocks db to maintain the state. Just wanted to check if
> there are any limitations on how much data can be maintained?
>

Yes, there are limits.  The total data that can be maintained today is
determined by the fact that Flink has to periodically snapshot this data
and copy it to a persistent storage system such as HDFS whether you are
using RocksDB or not.  The aggregate bandwidth required to your storage
system (like HDFS) is your total Flink state size multiplied by your Flink
checkpoint interval.


> 4) Also where is the state maintained if the amount of data is less? (I
> guess in JVM memory) If I have several machines on my cluster can every
> node get the current state version?
>

I'm not exactly sure what you're asking here.  All data is check-pointed to
a persistent store which must be accessible from each machine in the
cluster.


> 5) I need a way to send external configuration changes to flink. Lets say
> there is a new parameter that has to added or an external change which has
> to be updated inside flink's state, how can this be done?
>

The typical way to do this is to consume that configuration as a stream and
hold the configuration internally in the state of a particular user
function.


>
> Thanks
>



-- 

Jamie Grier
data Artisans, Director of Applications Engineering
@jamiegrier <https://twitter.com/jamiegrier>
jamie@data-artisans.com