You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@flink.apache.org by madan <ma...@gmail.com> on 2018/11/05 10:37:18 UTC

Split one dataset into multiple

Hi,

I have a custom iterator which gives data of multitple entities. For
example iterator gives data of Department, Employee and Address. Record's
entity type is identified by a field value. And I need to apply different
set of operations on each dataset. Ex., Department data may have
aggregations, Employee and Address data are simply joined together after
some filteration.

If I have different datasets for each entity type the job is easy. So I am
trying to split incoming data to different datasets. What is the best
possible way to achieve this ?

*Iterator can be read only once.


-- 
Thank you,
Madan.

Re: Split one dataset into multiple

Posted by Fabian Hueske <fh...@gmail.com>.

Hi,

Multiple flatmaps on a single DataSet work perfectly fine. All functions
will see the full DataSet.

DataSet input = ...
DataSet out1 = input.flatMap(func1);
DataSet out2 + input.flatMap(func2);
...

Best, Fabian

Am Di., 6. Nov. 2018 um 13:19 Uhr schrieb madan <ma...@gmail.com>:

> Hi Fabian,
>
> Is it multiple FlatmapFunctions on same Dataset or chaining of
> FlatmapFunctions ?  My usecase is something like below diagram
> (intermediate transformations might change),
>
> [image: IMG_1613.jpg]
>
>
> If we can have multiple FlatmapFunctions on same DataSet, above case works
> fine. But to my understanding it is not possible (correct me If I am
> wrong).
> And chaining multiple FlatmapFunctions might not work for my case. Please
> guide If I am missing something. If possible please provide some psuedo
> code.
>
>
> Thank you,
> Madan
>
>
>
> On Tue, Nov 6, 2018 at 3:55 PM Fabian Hueske <fh...@gmail.com> wrote:
>
>> You have to define a common type, like an n-ary Either type and return
>> that from your source / operator.
>> The resulting DataSet can be consumed by multiple FlatmapFunctions, each
>> extracting and forwarding one of the the result types.
>>
>> Cheers, Fabian
>>
>> Am Di., 6. Nov. 2018 um 10:40 Uhr schrieb madan <madan.yellanki@gmail.com
>> >:
>>
>>> Hi Vino,
>>>
>>> Thank you for suggestions. In my case I am using DataSet since data is
>>> limited, and split/select is not available on DataSet api.
>>> I doubt even hash partition might not work for me. By doing hash
>>> partition, I do not know which partition is having which entity data (Dept,
>>> Emp in my example. And sometimes hasing might be same for 2 different
>>> entities). And on that partition I need to apply some other
>>> transformations(based on partition data) which is not possible using
>>> MapPartitionFunction.
>>>
>>> Please suggest if my understanding is wrong and usecase is achievable
>>> (little example is of great help).
>>>
>>> Thank you,
>>> Madan
>>>
>>> On Tue, Nov 6, 2018 at 12:03 PM vino yang <ya...@gmail.com> wrote:
>>>
>>>> Hi madan,
>>>>
>>>> I think you need to hash partition your records.
>>>> Flink supports hash partitioning of data.
>>>> The operator is keyBy.
>>>> If the value of your tag field is enumerable, you can also use
>>>> split/select to achieve your purpose.
>>>>
>>>> Thanks, vino.
>>>>
>>>> madan <ma...@gmail.com> 于2018年11月5日周一 下午6:37写道：
>>>>
>>>>> Hi,
>>>>>
>>>>> I have a custom iterator which gives data of multitple entities. For
>>>>> example iterator gives data of Department, Employee and Address. Record's
>>>>> entity type is identified by a field value. And I need to apply different
>>>>> set of operations on each dataset. Ex., Department data may have
>>>>> aggregations, Employee and Address data are simply joined together after
>>>>> some filteration.
>>>>>
>>>>> If I have different datasets for each entity type the job is easy. So
>>>>> I am trying to split incoming data to different datasets. What is the best
>>>>> possible way to achieve this ?
>>>>>
>>>>> *Iterator can be read only once.
>>>>>
>>>>>
>>>>> --
>>>>> Thank you,
>>>>> Madan.
>>>>>
>>>>
>>>
>>> --
>>> Thank you,
>>> Madan.
>>>
>>
>
> --
> Thank you,
> Madan.
>

Re: Split one dataset into multiple

Posted by madan <ma...@gmail.com>.

Hi Fabian,

Is it multiple FlatmapFunctions on same Dataset or chaining of
FlatmapFunctions ?  My usecase is something like below diagram
(intermediate transformations might change),

[image: IMG_1613.jpg]


If we can have multiple FlatmapFunctions on same DataSet, above case works
fine. But to my understanding it is not possible (correct me If I am
wrong).
And chaining multiple FlatmapFunctions might not work for my case. Please
guide If I am missing something. If possible please provide some psuedo
code.


Thank you,
Madan



On Tue, Nov 6, 2018 at 3:55 PM Fabian Hueske <fh...@gmail.com> wrote:

> You have to define a common type, like an n-ary Either type and return
> that from your source / operator.
> The resulting DataSet can be consumed by multiple FlatmapFunctions, each
> extracting and forwarding one of the the result types.
>
> Cheers, Fabian
>
> Am Di., 6. Nov. 2018 um 10:40 Uhr schrieb madan <madan.yellanki@gmail.com
> >:
>
>> Hi Vino,
>>
>> Thank you for suggestions. In my case I am using DataSet since data is
>> limited, and split/select is not available on DataSet api.
>> I doubt even hash partition might not work for me. By doing hash
>> partition, I do not know which partition is having which entity data (Dept,
>> Emp in my example. And sometimes hasing might be same for 2 different
>> entities). And on that partition I need to apply some other
>> transformations(based on partition data) which is not possible using
>> MapPartitionFunction.
>>
>> Please suggest if my understanding is wrong and usecase is achievable
>> (little example is of great help).
>>
>> Thank you,
>> Madan
>>
>> On Tue, Nov 6, 2018 at 12:03 PM vino yang <ya...@gmail.com> wrote:
>>
>>> Hi madan,
>>>
>>> I think you need to hash partition your records.
>>> Flink supports hash partitioning of data.
>>> The operator is keyBy.
>>> If the value of your tag field is enumerable, you can also use
>>> split/select to achieve your purpose.
>>>
>>> Thanks, vino.
>>>
>>> madan <ma...@gmail.com> 于2018年11月5日周一 下午6:37写道：
>>>
>>>> Hi,
>>>>
>>>> I have a custom iterator which gives data of multitple entities. For
>>>> example iterator gives data of Department, Employee and Address. Record's
>>>> entity type is identified by a field value. And I need to apply different
>>>> set of operations on each dataset. Ex., Department data may have
>>>> aggregations, Employee and Address data are simply joined together after
>>>> some filteration.
>>>>
>>>> If I have different datasets for each entity type the job is easy. So I
>>>> am trying to split incoming data to different datasets. What is the best
>>>> possible way to achieve this ?
>>>>
>>>> *Iterator can be read only once.
>>>>
>>>>
>>>> --
>>>> Thank you,
>>>> Madan.
>>>>
>>>
>>
>> --
>> Thank you,
>> Madan.
>>
>

-- 
Thank you,
Madan.

Re: Split one dataset into multiple

Posted by Fabian Hueske <fh...@gmail.com>.

You have to define a common type, like an n-ary Either type and return that
from your source / operator.
The resulting DataSet can be consumed by multiple FlatmapFunctions, each
extracting and forwarding one of the the result types.

Cheers, Fabian

Am Di., 6. Nov. 2018 um 10:40 Uhr schrieb madan <ma...@gmail.com>:

> Hi Vino,
>
> Thank you for suggestions. In my case I am using DataSet since data is
> limited, and split/select is not available on DataSet api.
> I doubt even hash partition might not work for me. By doing hash
> partition, I do not know which partition is having which entity data (Dept,
> Emp in my example. And sometimes hasing might be same for 2 different
> entities). And on that partition I need to apply some other
> transformations(based on partition data) which is not possible using
> MapPartitionFunction.
>
> Please suggest if my understanding is wrong and usecase is achievable
> (little example is of great help).
>
> Thank you,
> Madan
>
> On Tue, Nov 6, 2018 at 12:03 PM vino yang <ya...@gmail.com> wrote:
>
>> Hi madan,
>>
>> I think you need to hash partition your records.
>> Flink supports hash partitioning of data.
>> The operator is keyBy.
>> If the value of your tag field is enumerable, you can also use
>> split/select to achieve your purpose.
>>
>> Thanks, vino.
>>
>> madan <ma...@gmail.com> 于2018年11月5日周一 下午6:37写道：
>>
>>> Hi,
>>>
>>> I have a custom iterator which gives data of multitple entities. For
>>> example iterator gives data of Department, Employee and Address. Record's
>>> entity type is identified by a field value. And I need to apply different
>>> set of operations on each dataset. Ex., Department data may have
>>> aggregations, Employee and Address data are simply joined together after
>>> some filteration.
>>>
>>> If I have different datasets for each entity type the job is easy. So I
>>> am trying to split incoming data to different datasets. What is the best
>>> possible way to achieve this ?
>>>
>>> *Iterator can be read only once.
>>>
>>>
>>> --
>>> Thank you,
>>> Madan.
>>>
>>
>
> --
> Thank you,
> Madan.
>

Re: Split one dataset into multiple

Posted by madan <ma...@gmail.com>.

Hi Vino,

Thank you for suggestions. In my case I am using DataSet since data is
limited, and split/select is not available on DataSet api.
I doubt even hash partition might not work for me. By doing hash partition,
I do not know which partition is having which entity data (Dept, Emp in my
example. And sometimes hasing might be same for 2 different entities). And
on that partition I need to apply some other transformations(based on
partition data) which is not possible using MapPartitionFunction.

Please suggest if my understanding is wrong and usecase is achievable
(little example is of great help).

Thank you,
Madan

On Tue, Nov 6, 2018 at 12:03 PM vino yang <ya...@gmail.com> wrote:

> Hi madan,
>
> I think you need to hash partition your records.
> Flink supports hash partitioning of data.
> The operator is keyBy.
> If the value of your tag field is enumerable, you can also use
> split/select to achieve your purpose.
>
> Thanks, vino.
>
> madan <ma...@gmail.com> 于2018年11月5日周一 下午6:37写道：
>
>> Hi,
>>
>> I have a custom iterator which gives data of multitple entities. For
>> example iterator gives data of Department, Employee and Address. Record's
>> entity type is identified by a field value. And I need to apply different
>> set of operations on each dataset. Ex., Department data may have
>> aggregations, Employee and Address data are simply joined together after
>> some filteration.
>>
>> If I have different datasets for each entity type the job is easy. So I
>> am trying to split incoming data to different datasets. What is the best
>> possible way to achieve this ?
>>
>> *Iterator can be read only once.
>>
>>
>> --
>> Thank you,
>> Madan.
>>
>

-- 
Thank you,
Madan.

Re: Split one dataset into multiple

Posted by vino yang <ya...@gmail.com>.

Hi madan,

I think you need to hash partition your records.
Flink supports hash partitioning of data.
The operator is keyBy.
If the value of your tag field is enumerable, you can also use split/select
to achieve your purpose.

Thanks, vino.

madan <ma...@gmail.com> 于2018年11月5日周一 下午6:37写道：

> Hi,
>
> I have a custom iterator which gives data of multitple entities. For
> example iterator gives data of Department, Employee and Address. Record's
> entity type is identified by a field value. And I need to apply different
> set of operations on each dataset. Ex., Department data may have
> aggregations, Employee and Address data are simply joined together after
> some filteration.
>
> If I have different datasets for each entity type the job is easy. So I am
> trying to split incoming data to different datasets. What is the best
> possible way to achieve this ?
>
> *Iterator can be read only once.
>
>
> --
> Thank you,
> Madan.
>