You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@flink.apache.org by Tarandeep Singh <ta...@gmail.com> on 2016/03/30 08:14:03 UTC

DataSetUtils zipWithIndex question

Hi,

I am looking at implementation of zipWithIndex in DataSetUtils-
https://github.com/apache/flink/blob/master/flink-java/src/main/java/org/apache/flink/api/java/utils/DataSetUtils.java

It works in two phases/steps
1) Count number of elements in each partition (using mapPartition)
2) In second mapPartition, unique ID is assigned by calculating offset
using number of elements computed in step 1.

Is there any chance the second mapPartition won't get same number of
elements as first mapPartition (assuming data is in HDFS)?

Thanks
Tarandeep

Re: DataSetUtils zipWithIndex question

Posted by Flavio Pompermaier <po...@okkam.it>.

Ok, thanks for the clarification Till!

On Thu, Mar 31, 2016 at 2:14 PM, Till Rohrmann <tr...@apache.org> wrote:

> A partition is the portion of data each task receives. Thus, the degree of
> parallelism of your program/task decides how many different partitions you
> have. Depending on the upstream operators (and which data is send to which
> task), the partitions will most likely differ in size.
>
> Cheers,
> Till
>
> On Thu, Mar 31, 2016 at 2:11 PM, Flavio Pompermaier <po...@okkam.it>
> wrote:
>
>> Hi Till and Tarandeep,
>> I'm also interested in better understanding my knowledge about the
>> concept of a partition..
>> From what I know a partition is the portion of data assigned by the job
>> manager to each task manager..right?
>> Then, each partition is divided again at the task manager to maximize the
>> slot usage..is it correct?
>> In every case, there will be a case where at least one partition is
>> smaller than the others...am I wrong? Am I confusing some term..?
>>
>> Best,
>> Flavio
>>
>>
>> On Thu, Mar 31, 2016 at 1:56 PM, Till Rohrmann <tr...@apache.org>
>> wrote:
>>
>>> Hi Tarandeep,
>>>
>>> the number of elements in each partition should stay constant. In fact
>>> the elements in each partition should not change.
>>>
>>> Cheers,
>>> Till
>>>
>>> On Wed, Mar 30, 2016 at 8:14 AM, Tarandeep Singh <ta...@gmail.com>
>>> wrote:
>>>
>>>> Hi,
>>>>
>>>> I am looking at implementation of zipWithIndex in DataSetUtils-
>>>>
>>>> https://github.com/apache/flink/blob/master/flink-java/src/main/java/org/apache/flink/api/java/utils/DataSetUtils.java
>>>>
>>>> It works in two phases/steps
>>>> 1) Count number of elements in each partition (using mapPartition)
>>>> 2) In second mapPartition, unique ID is assigned by calculating offset
>>>> using number of elements computed in step 1.
>>>>
>>>> Is there any chance the second mapPartition won't get same number of
>>>> elements as first mapPartition (assuming data is in HDFS)?
>>>>
>>>> Thanks
>>>> Tarandeep
>>>>
>>>
>>>
>>
>

Re: DataSetUtils zipWithIndex question

Posted by Till Rohrmann <tr...@apache.org>.

A partition is the portion of data each task receives. Thus, the degree of
parallelism of your program/task decides how many different partitions you
have. Depending on the upstream operators (and which data is send to which
task), the partitions will most likely differ in size.

Cheers,
Till

On Thu, Mar 31, 2016 at 2:11 PM, Flavio Pompermaier <po...@okkam.it>
wrote:

> Hi Till and Tarandeep,
> I'm also interested in better understanding my knowledge about the concept
> of a partition..
> From what I know a partition is the portion of data assigned by the job
> manager to each task manager..right?
> Then, each partition is divided again at the task manager to maximize the
> slot usage..is it correct?
> In every case, there will be a case where at least one partition is
> smaller than the others...am I wrong? Am I confusing some term..?
>
> Best,
> Flavio
>
>
> On Thu, Mar 31, 2016 at 1:56 PM, Till Rohrmann <tr...@apache.org>
> wrote:
>
>> Hi Tarandeep,
>>
>> the number of elements in each partition should stay constant. In fact
>> the elements in each partition should not change.
>>
>> Cheers,
>> Till
>>
>> On Wed, Mar 30, 2016 at 8:14 AM, Tarandeep Singh <ta...@gmail.com>
>> wrote:
>>
>>> Hi,
>>>
>>> I am looking at implementation of zipWithIndex in DataSetUtils-
>>>
>>> https://github.com/apache/flink/blob/master/flink-java/src/main/java/org/apache/flink/api/java/utils/DataSetUtils.java
>>>
>>> It works in two phases/steps
>>> 1) Count number of elements in each partition (using mapPartition)
>>> 2) In second mapPartition, unique ID is assigned by calculating offset
>>> using number of elements computed in step 1.
>>>
>>> Is there any chance the second mapPartition won't get same number of
>>> elements as first mapPartition (assuming data is in HDFS)?
>>>
>>> Thanks
>>> Tarandeep
>>>
>>
>>
>

Re: DataSetUtils zipWithIndex question

Posted by Flavio Pompermaier <po...@okkam.it>.

Hi Till and Tarandeep,
I'm also interested in better understanding my knowledge about the concept
of a partition..
>From what I know a partition is the portion of data assigned by the job
manager to each task manager..right?
Then, each partition is divided again at the task manager to maximize the
slot usage..is it correct?
In every case, there will be a case where at least one partition is smaller
than the others...am I wrong? Am I confusing some term..?

Best,
Flavio


On Thu, Mar 31, 2016 at 1:56 PM, Till Rohrmann <tr...@apache.org> wrote:

> Hi Tarandeep,
>
> the number of elements in each partition should stay constant. In fact the
> elements in each partition should not change.
>
> Cheers,
> Till
>
> On Wed, Mar 30, 2016 at 8:14 AM, Tarandeep Singh <ta...@gmail.com>
> wrote:
>
>> Hi,
>>
>> I am looking at implementation of zipWithIndex in DataSetUtils-
>>
>> https://github.com/apache/flink/blob/master/flink-java/src/main/java/org/apache/flink/api/java/utils/DataSetUtils.java
>>
>> It works in two phases/steps
>> 1) Count number of elements in each partition (using mapPartition)
>> 2) In second mapPartition, unique ID is assigned by calculating offset
>> using number of elements computed in step 1.
>>
>> Is there any chance the second mapPartition won't get same number of
>> elements as first mapPartition (assuming data is in HDFS)?
>>
>> Thanks
>> Tarandeep
>>
>
>

Re: DataSetUtils zipWithIndex question

Posted by Till Rohrmann <tr...@apache.org>.

Hi Tarandeep,

the number of elements in each partition should stay constant. In fact the
elements in each partition should not change.

Cheers,
Till

On Wed, Mar 30, 2016 at 8:14 AM, Tarandeep Singh <ta...@gmail.com>
wrote:

> Hi,
>
> I am looking at implementation of zipWithIndex in DataSetUtils-
>
> https://github.com/apache/flink/blob/master/flink-java/src/main/java/org/apache/flink/api/java/utils/DataSetUtils.java
>
> It works in two phases/steps
> 1) Count number of elements in each partition (using mapPartition)
> 2) In second mapPartition, unique ID is assigned by calculating offset
> using number of elements computed in step 1.
>
> Is there any chance the second mapPartition won't get same number of
> elements as first mapPartition (assuming data is in HDFS)?
>
> Thanks
> Tarandeep
>