You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@flink.apache.org by Dmitry Golubets <dg...@gmail.com> on 2017/01/23 10:05:34 UTC

Count window on partition

Hi,

I'm looking for the right way to do the following scheme:

1. Read data
2. Split it into partitions for parallel processing
3. In every partition group data in N elements batches
4. Process these batches

My first attempt was:
*dataStream.keyBy(_.key).countWindow(..)*
But countWindow groups by a key. I however want to group all elements in
partition.

Then I tried:
*dataStream.keyBy(_.key).countWindowAll(..)*
But apparently countWindowAll doesn't work on partitioned data.

So, my last version is:

*dataStream.keyBy(_.key.hashCode % 4).countWindow(..)*
But is looks kida hacky with hardcoded partitions number.

So, what's the right way of doing it?

Best regards,
Dmitry

Re: Count window on partition

Posted by Fabian Hueske <fh...@gmail.com>.
Hi Dmitry,

the third version is the way to go, IMO.
You might want to have a larger number of partitions if you are planning to
later increase the parallelism of the job.
Also note, that it is not guaranteed that 4 keys are uniformly distributed
to 4 tasks. It might happen that one task ends up with two keys and another
with none.

If you want more control you can use partitionCustom (which does not
produce a KeyedStream) and a stateful Map or FlatMap function to do the
aggregation yourself.

Best, Fabian

2017-01-23 11:12 GMT+01:00 Kostas Kloudas <k....@data-artisans.com>:

> Hi Dmitry,
>
> In all cases, the result of the countWindow will be also grouped by key
> because of
> the keyBy() that you  are using.
>
> If you want to have a non-keyed stream and then split it in count windows,
> remove
> the keyBy() and instead of countWindow(), use countWindowAll(). This will
> have
> *parallelism 1 *but then you can repartition your stream so that the
> downstream
> operators have higher parallelism.
>
> Hope this helps,
> Kostas
>
> On Jan 23, 2017, at 11:05 AM, Dmitry Golubets <dg...@gmail.com> wrote:
>
> Hi,
>
> I'm looking for the right way to do the following scheme:
>
> 1. Read data
> 2. Split it into partitions for parallel processing
> 3. In every partition group data in N elements batches
> 4. Process these batches
>
> My first attempt was:
> *dataStream.keyBy(_.key).countWindow(..)*
> But countWindow groups by a key. I however want to group all elements in
> partition.
>
> Then I tried:
> *dataStream.keyBy(_.key).countWindowAll(..)*
> But apparently countWindowAll doesn't work on partitioned data.
>
> So, my last version is:
>
> *dataStream.keyBy(_.key.hashCode % 4).countWindow(..)*
> But is looks kida hacky with hardcoded partitions number.
>
> So, what's the right way of doing it?
>
> Best regards,
> Dmitry
>
>
>

Re: Count window on partition

Posted by Kostas Kloudas <k....@data-artisans.com>.
Hi Dmitry,

In all cases, the result of the countWindow will be also grouped by key because of 
the keyBy() that you  are using.

If you want to have a non-keyed stream and then split it in count windows, remove 
the keyBy() and instead of countWindow(), use countWindowAll(). This will have
parallelism 1 but then you can repartition your stream so that the downstream 
operators have higher parallelism.

Hope this helps,
Kostas

> On Jan 23, 2017, at 11:05 AM, Dmitry Golubets <dg...@gmail.com> wrote:
> 
> Hi,
> 
> I'm looking for the right way to do the following scheme:
> 
> 1. Read data
> 2. Split it into partitions for parallel processing
> 3. In every partition group data in N elements batches
> 4. Process these batches
> 
> My first attempt was: dataStream.keyBy(_.key).countWindow(..)
> But countWindow groups by a key. I however want to group all elements in partition.
> 
> Then I tried: dataStream.keyBy(_.key).countWindowAll(..)
> But apparently countWindowAll doesn't work on partitioned data.
> 
> So, my last version is: 
> dataStream.keyBy(_.key.hashCode % 4).countWindow(..)
> But is looks kida hacky with hardcoded partitions number.
> 
> So, what's the right way of doing it?
> 
> Best regards,
> Dmitry