You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@flink.apache.org by Dmitry Golubets <dg...@gmail.com> on 2017/01/23 10:05:34 UTC
Count window on partition
Hi,
I'm looking for the right way to do the following scheme:
1. Read data
2. Split it into partitions for parallel processing
3. In every partition group data in N elements batches
4. Process these batches
My first attempt was:
*dataStream.keyBy(_.key).countWindow(..)*
But countWindow groups by a key. I however want to group all elements in
partition.
Then I tried:
*dataStream.keyBy(_.key).countWindowAll(..)*
But apparently countWindowAll doesn't work on partitioned data.
So, my last version is:
*dataStream.keyBy(_.key.hashCode % 4).countWindow(..)*
But is looks kida hacky with hardcoded partitions number.
So, what's the right way of doing it?
Best regards,
Dmitry
Re: Count window on partition
Posted by Fabian Hueske <fh...@gmail.com>.
Hi Dmitry,
the third version is the way to go, IMO.
You might want to have a larger number of partitions if you are planning to
later increase the parallelism of the job.
Also note, that it is not guaranteed that 4 keys are uniformly distributed
to 4 tasks. It might happen that one task ends up with two keys and another
with none.
If you want more control you can use partitionCustom (which does not
produce a KeyedStream) and a stateful Map or FlatMap function to do the
aggregation yourself.
Best, Fabian
2017-01-23 11:12 GMT+01:00 Kostas Kloudas <k....@data-artisans.com>:
> Hi Dmitry,
>
> In all cases, the result of the countWindow will be also grouped by key
> because of
> the keyBy() that you are using.
>
> If you want to have a non-keyed stream and then split it in count windows,
> remove
> the keyBy() and instead of countWindow(), use countWindowAll(). This will
> have
> *parallelism 1 *but then you can repartition your stream so that the
> downstream
> operators have higher parallelism.
>
> Hope this helps,
> Kostas
>
> On Jan 23, 2017, at 11:05 AM, Dmitry Golubets <dg...@gmail.com> wrote:
>
> Hi,
>
> I'm looking for the right way to do the following scheme:
>
> 1. Read data
> 2. Split it into partitions for parallel processing
> 3. In every partition group data in N elements batches
> 4. Process these batches
>
> My first attempt was:
> *dataStream.keyBy(_.key).countWindow(..)*
> But countWindow groups by a key. I however want to group all elements in
> partition.
>
> Then I tried:
> *dataStream.keyBy(_.key).countWindowAll(..)*
> But apparently countWindowAll doesn't work on partitioned data.
>
> So, my last version is:
>
> *dataStream.keyBy(_.key.hashCode % 4).countWindow(..)*
> But is looks kida hacky with hardcoded partitions number.
>
> So, what's the right way of doing it?
>
> Best regards,
> Dmitry
>
>
>
Re: Count window on partition
Posted by Kostas Kloudas <k....@data-artisans.com>.
Hi Dmitry,
In all cases, the result of the countWindow will be also grouped by key because of
the keyBy() that you are using.
If you want to have a non-keyed stream and then split it in count windows, remove
the keyBy() and instead of countWindow(), use countWindowAll(). This will have
parallelism 1 but then you can repartition your stream so that the downstream
operators have higher parallelism.
Hope this helps,
Kostas
> On Jan 23, 2017, at 11:05 AM, Dmitry Golubets <dg...@gmail.com> wrote:
>
> Hi,
>
> I'm looking for the right way to do the following scheme:
>
> 1. Read data
> 2. Split it into partitions for parallel processing
> 3. In every partition group data in N elements batches
> 4. Process these batches
>
> My first attempt was: dataStream.keyBy(_.key).countWindow(..)
> But countWindow groups by a key. I however want to group all elements in partition.
>
> Then I tried: dataStream.keyBy(_.key).countWindowAll(..)
> But apparently countWindowAll doesn't work on partitioned data.
>
> So, my last version is:
> dataStream.keyBy(_.key.hashCode % 4).countWindow(..)
> But is looks kida hacky with hardcoded partitions number.
>
> So, what's the right way of doing it?
>
> Best regards,
> Dmitry