You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@flink.apache.org by Saliya Ekanayake <es...@gmail.com> on 2016/02/25 17:38:30 UTC

Mapping two datasets

Hi,

I've two data sets like,

DataSet<T> a = ...
DataSet<T> b = ...

They have the same type and same decomposition. I want to apply a map
operator that need both *a* and *b. *For example,

a.map( i -> OP)

within this OP I need the corresponding (*i *th) element of *b* as well. Is
there a way to do this?

Thank you,
Saliya

-- 
Saliya Ekanayake
Ph.D. Candidate | Research Assistant
School of Informatics and Computing | Digital Science Center
Indiana University, Bloomington
Cell 812-391-4914
http://saliya.org

Re: Mapping two datasets

Posted by Saliya Ekanayake <es...@gmail.com>.
Thank you. Any thoughts on the ParallelIteratorInputFormat in Flink?

On Thu, Feb 25, 2016 at 12:07 PM, Márton Balassi <ba...@gmail.com>
wrote:

> Hey Saliya,
>
> I recommend using DataSetUtils.zipWithIndex for this task. [1] It comes
> with flink-java.
>
> [1]
> https://github.com/apache/flink/blob/master/flink-java/src/main/java/org/apache/flink/api/java/utils/DataSetUtils.java#L77
>
> On Thu, Feb 25, 2016 at 5:52 PM, Saliya Ekanayake <es...@gmail.com>
> wrote:
>
>> Thank you, Marton. That seems doable.
>>
>> However, is there a way I can create a dummy indexed data set? Like a way
>> to partition the index range without data across parallel tasks. For
>> example, if I could have something like,
>>
>> DataSet<IndexedSet> ds = ...
>>
>> then I can implement a custom method to load required data for a split
>> within a map operation, which will be less expensive than a join for my
>> case.
>>
>> Thank you,
>> Saliya
>>
>> On Thu, Feb 25, 2016 at 11:45 AM, Márton Balassi <
>> balassi.marton@gmail.com> wrote:
>>
>>> Hey Saliya,
>>>
>>> I would add a uniqe ID to both the DataSets, the variable you referred
>>> to as 'i'. Then you can join the two DataSets on the field containing 'i'
>>> and do the mapping on the joined result.
>>>
>>> Hope this helps,
>>>
>>> Marton
>>>
>>> On Thu, Feb 25, 2016 at 5:38 PM, Saliya Ekanayake <es...@gmail.com>
>>> wrote:
>>>
>>>> Hi,
>>>>
>>>> I've two data sets like,
>>>>
>>>> DataSet<T> a = ...
>>>> DataSet<T> b = ...
>>>>
>>>> They have the same type and same decomposition. I want to apply a map
>>>> operator that need both *a* and *b. *For example,
>>>>
>>>> a.map( i -> OP)
>>>>
>>>> within this OP I need the corresponding (*i *th) element of *b* as
>>>> well. Is there a way to do this?
>>>>
>>>> Thank you,
>>>> Saliya
>>>>
>>>> --
>>>> Saliya Ekanayake
>>>> Ph.D. Candidate | Research Assistant
>>>> School of Informatics and Computing | Digital Science Center
>>>> Indiana University, Bloomington
>>>> Cell 812-391-4914
>>>> http://saliya.org
>>>>
>>>
>>>
>>
>>
>> --
>> Saliya Ekanayake
>> Ph.D. Candidate | Research Assistant
>> School of Informatics and Computing | Digital Science Center
>> Indiana University, Bloomington
>> Cell 812-391-4914
>> http://saliya.org
>>
>
>


-- 
Saliya Ekanayake
Ph.D. Candidate | Research Assistant
School of Informatics and Computing | Digital Science Center
Indiana University, Bloomington
Cell 812-391-4914
http://saliya.org

Re: Mapping two datasets

Posted by Márton Balassi <ba...@gmail.com>.
Hey Saliya,

I recommend using DataSetUtils.zipWithIndex for this task. [1] It comes
with flink-java.

[1]
https://github.com/apache/flink/blob/master/flink-java/src/main/java/org/apache/flink/api/java/utils/DataSetUtils.java#L77

On Thu, Feb 25, 2016 at 5:52 PM, Saliya Ekanayake <es...@gmail.com> wrote:

> Thank you, Marton. That seems doable.
>
> However, is there a way I can create a dummy indexed data set? Like a way
> to partition the index range without data across parallel tasks. For
> example, if I could have something like,
>
> DataSet<IndexedSet> ds = ...
>
> then I can implement a custom method to load required data for a split
> within a map operation, which will be less expensive than a join for my
> case.
>
> Thank you,
> Saliya
>
> On Thu, Feb 25, 2016 at 11:45 AM, Márton Balassi <balassi.marton@gmail.com
> > wrote:
>
>> Hey Saliya,
>>
>> I would add a uniqe ID to both the DataSets, the variable you referred to
>> as 'i'. Then you can join the two DataSets on the field containing 'i' and
>> do the mapping on the joined result.
>>
>> Hope this helps,
>>
>> Marton
>>
>> On Thu, Feb 25, 2016 at 5:38 PM, Saliya Ekanayake <es...@gmail.com>
>> wrote:
>>
>>> Hi,
>>>
>>> I've two data sets like,
>>>
>>> DataSet<T> a = ...
>>> DataSet<T> b = ...
>>>
>>> They have the same type and same decomposition. I want to apply a map
>>> operator that need both *a* and *b. *For example,
>>>
>>> a.map( i -> OP)
>>>
>>> within this OP I need the corresponding (*i *th) element of *b* as
>>> well. Is there a way to do this?
>>>
>>> Thank you,
>>> Saliya
>>>
>>> --
>>> Saliya Ekanayake
>>> Ph.D. Candidate | Research Assistant
>>> School of Informatics and Computing | Digital Science Center
>>> Indiana University, Bloomington
>>> Cell 812-391-4914
>>> http://saliya.org
>>>
>>
>>
>
>
> --
> Saliya Ekanayake
> Ph.D. Candidate | Research Assistant
> School of Informatics and Computing | Digital Science Center
> Indiana University, Bloomington
> Cell 812-391-4914
> http://saliya.org
>

Re: Mapping two datasets

Posted by Saliya Ekanayake <es...@gmail.com>.
Thank you, Marton. That seems doable.

However, is there a way I can create a dummy indexed data set? Like a way
to partition the index range without data across parallel tasks. For
example, if I could have something like,

DataSet<IndexedSet> ds = ...

then I can implement a custom method to load required data for a split
within a map operation, which will be less expensive than a join for my
case.

Thank you,
Saliya

On Thu, Feb 25, 2016 at 11:45 AM, Márton Balassi <ba...@gmail.com>
wrote:

> Hey Saliya,
>
> I would add a uniqe ID to both the DataSets, the variable you referred to
> as 'i'. Then you can join the two DataSets on the field containing 'i' and
> do the mapping on the joined result.
>
> Hope this helps,
>
> Marton
>
> On Thu, Feb 25, 2016 at 5:38 PM, Saliya Ekanayake <es...@gmail.com>
> wrote:
>
>> Hi,
>>
>> I've two data sets like,
>>
>> DataSet<T> a = ...
>> DataSet<T> b = ...
>>
>> They have the same type and same decomposition. I want to apply a map
>> operator that need both *a* and *b. *For example,
>>
>> a.map( i -> OP)
>>
>> within this OP I need the corresponding (*i *th) element of *b* as well.
>> Is there a way to do this?
>>
>> Thank you,
>> Saliya
>>
>> --
>> Saliya Ekanayake
>> Ph.D. Candidate | Research Assistant
>> School of Informatics and Computing | Digital Science Center
>> Indiana University, Bloomington
>> Cell 812-391-4914
>> http://saliya.org
>>
>
>


-- 
Saliya Ekanayake
Ph.D. Candidate | Research Assistant
School of Informatics and Computing | Digital Science Center
Indiana University, Bloomington
Cell 812-391-4914
http://saliya.org

Re: Mapping two datasets

Posted by Márton Balassi <ba...@gmail.com>.
Hey Saliya,

I would add a uniqe ID to both the DataSets, the variable you referred to
as 'i'. Then you can join the two DataSets on the field containing 'i' and
do the mapping on the joined result.

Hope this helps,

Marton

On Thu, Feb 25, 2016 at 5:38 PM, Saliya Ekanayake <es...@gmail.com> wrote:

> Hi,
>
> I've two data sets like,
>
> DataSet<T> a = ...
> DataSet<T> b = ...
>
> They have the same type and same decomposition. I want to apply a map
> operator that need both *a* and *b. *For example,
>
> a.map( i -> OP)
>
> within this OP I need the corresponding (*i *th) element of *b* as well.
> Is there a way to do this?
>
> Thank you,
> Saliya
>
> --
> Saliya Ekanayake
> Ph.D. Candidate | Research Assistant
> School of Informatics and Computing | Digital Science Center
> Indiana University, Bloomington
> Cell 812-391-4914
> http://saliya.org
>