You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@hama.apache.org by Thomas Jungblut <th...@googlemail.com> on 2011/11/14 07:38:30 UTC

#Task setting and IO

Hey,

I have several unclarity with the setting of number of tasks and I don't
think it currently runs correctly.

Let's make some scenarios:

1. User defines no input and number of tasks: "vanilla"-hama behaviour ->
Check if the number of tasks fit in the cluster and then run.

2. User defines input, no number of tasks and no partitioner -> this should
set the #bsptasks to what the split calculated. *What if this exeeds the
cluster capacity?*

3. User defines input, number of tasks and a partitioner -> this should
partition the dataset via the partitioner to >number of tasks< files and
let the fileinput split assign the files to the tasks.

4. User defines already defines partitioned input (e.G. Output of a M/R
job), and no other stuff -> What do you think this should do?

Part 4 is the most important I guess, because a mapreduce job partitions
the data faster than our partitioner, especially for large inputs.
And I don't actually know if all this steps are the right way we want it.
What do you think?

-- 
Thomas Jungblut
Berlin <th...@gmail.com>

Re: #Task setting and IO

Posted by "Edward J. Yoon" <ed...@apache.org>.

> set the #bsptasks to what the split calculated. *What if this exeeds the> cluster capacity?*
I think, there're two option.

1) Fix the computeSplitSize() method to return the max split length
(less than cluster capacity).

2) Or assign the split array (one more splits) to each task.

On Mon, Nov 14, 2011 at 3:38 PM, Thomas Jungblut
<th...@googlemail.com> wrote:
> Hey,
>
> I have several unclarity with the setting of number of tasks and I don't
> think it currently runs correctly.
>
> Let's make some scenarios:
>
> 1. User defines no input and number of tasks: "vanilla"-hama behaviour ->
> Check if the number of tasks fit in the cluster and then run.
>
> 2. User defines input, no number of tasks and no partitioner -> this should
> set the #bsptasks to what the split calculated. *What if this exeeds the
> cluster capacity?*
>
> 3. User defines input, number of tasks and a partitioner -> this should
> partition the dataset via the partitioner to >number of tasks< files and
> let the fileinput split assign the files to the tasks.
>
> 4. User defines already defines partitioned input (e.G. Output of a M/R
> job), and no other stuff -> What do you think this should do?
>
> Part 4 is the most important I guess, because a mapreduce job partitions
> the data faster than our partitioner, especially for large inputs.
> And I don't actually know if all this steps are the right way we want it.
> What do you think?
>
> --
> Thomas Jungblut
> Berlin <th...@gmail.com>
>



-- 
Best Regards, Edward J. Yoon
@eddieyoon