You are viewing a plain text version of this content. The canonical link for it is here.

Posted to hdfs-user@hadoop.apache.org by David Parks <da...@yahoo.com> on 2012/10/24 08:10:30 UTC

How do map tasks get assigned efficiently?

Even after reading O'reillys book on hadoop I don't feel like I have a clear
vision of how the map tasks get assigned.

 

They depend on splits right?

 

But I have 3 jobs running. And splits will come from various sources: HDFS,
S3, and slow HTTP sources.

 

So I've got some concern as to how the map tasks will be distributed to
handle the data acquisition.

 

Can I do anything to ensure that I don't let the cluster go idle processing
slow HTTP downloads when the boxes could simultaneously be doing HTTP
downloads for one job and reading large files off HDFS for another job?

 

I'm imagining a scenario where the only map tasks that are running are all
blocking on splits requiring HTTP downloads and the splits coming from HDFS
are all queuing up behind it, when they'd run more efficiently in parallel
per node.

Re: How do map tasks get assigned efficiently?

Posted by Harsh J <ha...@cloudera.com>.

Hi David,

Two things help avoid this, I think:

1. Blocks are small in size. Usually ranging in 128 MB to mostly less
than a GB on the whole. Reading this contiguous but limited chunk of
data per process doesn't take too much time (or if it does, its not
your disk to blame sometimes).

2. DNs support multiple-disks (and we recommend using JBOD configs),
via dfs.datanode.data.dir config prop, and use round-robin block
placement to store blocks (when writing) across these disks. In this
case, although it is possible to have several tasks reading from the
same disk, the occurrence is rare in runtime.

Even if you store a huge file, you still end up
reading it efficiently as the blocks are well distributed across the
cluster and across disks in each machine.

Regarding the original query on how splits really work, for HDFS, the
NN provides a list of hostnames to use to the MR framework when it
wants access to a specific block (a offset->length  in a whole file).
This helps MR schedule with a sense of data locality.

The data is shipped from NN to the MR framework in form of the
InputSplit classes, which have a InputSplit.getLocations() API. If you
had a non-HDFS source and you still needed locality hints (remember -
not enforcers, mere hints), you can write up your own InputFormat
class and return tweaked InputSplit objects with desired location
hostnames, via InputFormat#getSplits that gets called at the client
side by the framework. Hope this helps!

On Thu, Oct 25, 2012 at 8:19 AM, David Parks <da...@yahoo.com> wrote:
> So the thing that just doesn’t click for me yet is this:
>
>
>
> On a typical computer, if I try to read two huge files off disk
> simultaneously it’ll just kill the disk performance. This seems like a risk.
>
>
>
> What’s preventing such disk contention in Hadoop?  Is HDFS smart enough to
> serialize major disk access?
>
>
>
>
>
> From: Michael Segel [mailto:michael_segel@hotmail.com]
> Sent: Wednesday, October 24, 2012 6:51 PM
> To: user@hadoop.apache.org
> Subject: Re: How do map tasks get assigned efficiently?
>
>
>
> So...
>
>
>
> Data locality only works when you actually have data on the cluster itself.
> Otherwise how can the data be local.
>
>
>
> Assuming 3X replication, and you're not doing a custom split and your input
> file is splittable...
>
>
>
> You will split along the block delineation.  So if your input file has 5
> blocks, you will have 5 mappers.
>
>
>
> Since there are 3 copies of the block, its possible that for that map task
> to run on the DN which has a copy of that block.
>
>
>
> So its pretty straight forward to a point.
>
>
>
> When your cluster starts to get a lot of jobs and a slot opens up, your job
> may not be data local.
>
>
>
> With HBase... YMMV
>
> With S3 the data isn't local so it doesn't matter which Data Node gets the
> job.
>
>
>
> HTH
>
>
>
> -Mike
>
>
>
> On Oct 24, 2012, at 1:10 AM, David Parks <da...@yahoo.com> wrote:
>
>
>
> Even after reading O’reillys book on hadoop I don’t feel like I have a clear
> vision of how the map tasks get assigned.
>
>
>
> They depend on splits right?
>
>
>
> But I have 3 jobs running. And splits will come from various sources: HDFS,
> S3, and slow HTTP sources.
>
>
>
> So I’ve got some concern as to how the map tasks will be distributed to
> handle the data acquisition.
>
>
>
> Can I do anything to ensure that I don’t let the cluster go idle processing
> slow HTTP downloads when the boxes could simultaneously be doing HTTP
> downloads for one job and reading large files off HDFS for another job?
>
>
>
> I’m imagining a scenario where the only map tasks that are running are all
> blocking on splits requiring HTTP downloads and the splits coming from HDFS
> are all queuing up behind it, when they’d run more efficiently in parallel
> per node.
>
>
>
>
>
>

-- 
Harsh J

Re: How do map tasks get assigned efficiently?

Posted by Harsh J <ha...@cloudera.com>.

Hi David,

Two things help avoid this, I think:

1. Blocks are small in size. Usually ranging in 128 MB to mostly less
than a GB on the whole. Reading this contiguous but limited chunk of
data per process doesn't take too much time (or if it does, its not
your disk to blame sometimes).

2. DNs support multiple-disks (and we recommend using JBOD configs),
via dfs.datanode.data.dir config prop, and use round-robin block
placement to store blocks (when writing) across these disks. In this
case, although it is possible to have several tasks reading from the
same disk, the occurrence is rare in runtime.

Even if you store a huge file, you still end up
reading it efficiently as the blocks are well distributed across the
cluster and across disks in each machine.

Regarding the original query on how splits really work, for HDFS, the
NN provides a list of hostnames to use to the MR framework when it
wants access to a specific block (a offset->length  in a whole file).
This helps MR schedule with a sense of data locality.

The data is shipped from NN to the MR framework in form of the
InputSplit classes, which have a InputSplit.getLocations() API. If you
had a non-HDFS source and you still needed locality hints (remember -
not enforcers, mere hints), you can write up your own InputFormat
class and return tweaked InputSplit objects with desired location
hostnames, via InputFormat#getSplits that gets called at the client
side by the framework. Hope this helps!

On Thu, Oct 25, 2012 at 8:19 AM, David Parks <da...@yahoo.com> wrote:
> So the thing that just doesn’t click for me yet is this:
>
>
>
> On a typical computer, if I try to read two huge files off disk
> simultaneously it’ll just kill the disk performance. This seems like a risk.
>
>
>
> What’s preventing such disk contention in Hadoop?  Is HDFS smart enough to
> serialize major disk access?
>
>
>
>
>
> From: Michael Segel [mailto:michael_segel@hotmail.com]
> Sent: Wednesday, October 24, 2012 6:51 PM
> To: user@hadoop.apache.org
> Subject: Re: How do map tasks get assigned efficiently?
>
>
>
> So...
>
>
>
> Data locality only works when you actually have data on the cluster itself.
> Otherwise how can the data be local.
>
>
>
> Assuming 3X replication, and you're not doing a custom split and your input
> file is splittable...
>
>
>
> You will split along the block delineation.  So if your input file has 5
> blocks, you will have 5 mappers.
>
>
>
> Since there are 3 copies of the block, its possible that for that map task
> to run on the DN which has a copy of that block.
>
>
>
> So its pretty straight forward to a point.
>
>
>
> When your cluster starts to get a lot of jobs and a slot opens up, your job
> may not be data local.
>
>
>
> With HBase... YMMV
>
> With S3 the data isn't local so it doesn't matter which Data Node gets the
> job.
>
>
>
> HTH
>
>
>
> -Mike
>
>
>
> On Oct 24, 2012, at 1:10 AM, David Parks <da...@yahoo.com> wrote:
>
>
>
> Even after reading O’reillys book on hadoop I don’t feel like I have a clear
> vision of how the map tasks get assigned.
>
>
>
> They depend on splits right?
>
>
>
> But I have 3 jobs running. And splits will come from various sources: HDFS,
> S3, and slow HTTP sources.
>
>
>
> So I’ve got some concern as to how the map tasks will be distributed to
> handle the data acquisition.
>
>
>
> Can I do anything to ensure that I don’t let the cluster go idle processing
> slow HTTP downloads when the boxes could simultaneously be doing HTTP
> downloads for one job and reading large files off HDFS for another job?
>
>
>
> I’m imagining a scenario where the only map tasks that are running are all
> blocking on splits requiring HTTP downloads and the splits coming from HDFS
> are all queuing up behind it, when they’d run more efficiently in parallel
> per node.
>
>
>
>
>
>

-- 
Harsh J

Re: How do map tasks get assigned efficiently?

Posted by Harsh J <ha...@cloudera.com>.

Hi David,

Two things help avoid this, I think:

1. Blocks are small in size. Usually ranging in 128 MB to mostly less
than a GB on the whole. Reading this contiguous but limited chunk of
data per process doesn't take too much time (or if it does, its not
your disk to blame sometimes).

2. DNs support multiple-disks (and we recommend using JBOD configs),
via dfs.datanode.data.dir config prop, and use round-robin block
placement to store blocks (when writing) across these disks. In this
case, although it is possible to have several tasks reading from the
same disk, the occurrence is rare in runtime.

Even if you store a huge file, you still end up
reading it efficiently as the blocks are well distributed across the
cluster and across disks in each machine.

Regarding the original query on how splits really work, for HDFS, the
NN provides a list of hostnames to use to the MR framework when it
wants access to a specific block (a offset->length  in a whole file).
This helps MR schedule with a sense of data locality.

The data is shipped from NN to the MR framework in form of the
InputSplit classes, which have a InputSplit.getLocations() API. If you
had a non-HDFS source and you still needed locality hints (remember -
not enforcers, mere hints), you can write up your own InputFormat
class and return tweaked InputSplit objects with desired location
hostnames, via InputFormat#getSplits that gets called at the client
side by the framework. Hope this helps!

On Thu, Oct 25, 2012 at 8:19 AM, David Parks <da...@yahoo.com> wrote:
> So the thing that just doesn’t click for me yet is this:
>
>
>
> On a typical computer, if I try to read two huge files off disk
> simultaneously it’ll just kill the disk performance. This seems like a risk.
>
>
>
> What’s preventing such disk contention in Hadoop?  Is HDFS smart enough to
> serialize major disk access?
>
>
>
>
>
> From: Michael Segel [mailto:michael_segel@hotmail.com]
> Sent: Wednesday, October 24, 2012 6:51 PM
> To: user@hadoop.apache.org
> Subject: Re: How do map tasks get assigned efficiently?
>
>
>
> So...
>
>
>
> Data locality only works when you actually have data on the cluster itself.
> Otherwise how can the data be local.
>
>
>
> Assuming 3X replication, and you're not doing a custom split and your input
> file is splittable...
>
>
>
> You will split along the block delineation.  So if your input file has 5
> blocks, you will have 5 mappers.
>
>
>
> Since there are 3 copies of the block, its possible that for that map task
> to run on the DN which has a copy of that block.
>
>
>
> So its pretty straight forward to a point.
>
>
>
> When your cluster starts to get a lot of jobs and a slot opens up, your job
> may not be data local.
>
>
>
> With HBase... YMMV
>
> With S3 the data isn't local so it doesn't matter which Data Node gets the
> job.
>
>
>
> HTH
>
>
>
> -Mike
>
>
>
> On Oct 24, 2012, at 1:10 AM, David Parks <da...@yahoo.com> wrote:
>
>
>
> Even after reading O’reillys book on hadoop I don’t feel like I have a clear
> vision of how the map tasks get assigned.
>
>
>
> They depend on splits right?
>
>
>
> But I have 3 jobs running. And splits will come from various sources: HDFS,
> S3, and slow HTTP sources.
>
>
>
> So I’ve got some concern as to how the map tasks will be distributed to
> handle the data acquisition.
>
>
>
> Can I do anything to ensure that I don’t let the cluster go idle processing
> slow HTTP downloads when the boxes could simultaneously be doing HTTP
> downloads for one job and reading large files off HDFS for another job?
>
>
>
> I’m imagining a scenario where the only map tasks that are running are all
> blocking on splits requiring HTTP downloads and the splits coming from HDFS
> are all queuing up behind it, when they’d run more efficiently in parallel
> per node.
>
>
>
>
>
>

-- 
Harsh J

Re: How do map tasks get assigned efficiently?

Posted by Harsh J <ha...@cloudera.com>.

Hi David,

Two things help avoid this, I think:

1. Blocks are small in size. Usually ranging in 128 MB to mostly less
than a GB on the whole. Reading this contiguous but limited chunk of
data per process doesn't take too much time (or if it does, its not
your disk to blame sometimes).

2. DNs support multiple-disks (and we recommend using JBOD configs),
via dfs.datanode.data.dir config prop, and use round-robin block
placement to store blocks (when writing) across these disks. In this
case, although it is possible to have several tasks reading from the
same disk, the occurrence is rare in runtime.

Even if you store a huge file, you still end up
reading it efficiently as the blocks are well distributed across the
cluster and across disks in each machine.

Regarding the original query on how splits really work, for HDFS, the
NN provides a list of hostnames to use to the MR framework when it
wants access to a specific block (a offset->length  in a whole file).
This helps MR schedule with a sense of data locality.

The data is shipped from NN to the MR framework in form of the
InputSplit classes, which have a InputSplit.getLocations() API. If you
had a non-HDFS source and you still needed locality hints (remember -
not enforcers, mere hints), you can write up your own InputFormat
class and return tweaked InputSplit objects with desired location
hostnames, via InputFormat#getSplits that gets called at the client
side by the framework. Hope this helps!

On Thu, Oct 25, 2012 at 8:19 AM, David Parks <da...@yahoo.com> wrote:
> So the thing that just doesn’t click for me yet is this:
>
>
>
> On a typical computer, if I try to read two huge files off disk
> simultaneously it’ll just kill the disk performance. This seems like a risk.
>
>
>
> What’s preventing such disk contention in Hadoop?  Is HDFS smart enough to
> serialize major disk access?
>
>
>
>
>
> From: Michael Segel [mailto:michael_segel@hotmail.com]
> Sent: Wednesday, October 24, 2012 6:51 PM
> To: user@hadoop.apache.org
> Subject: Re: How do map tasks get assigned efficiently?
>
>
>
> So...
>
>
>
> Data locality only works when you actually have data on the cluster itself.
> Otherwise how can the data be local.
>
>
>
> Assuming 3X replication, and you're not doing a custom split and your input
> file is splittable...
>
>
>
> You will split along the block delineation.  So if your input file has 5
> blocks, you will have 5 mappers.
>
>
>
> Since there are 3 copies of the block, its possible that for that map task
> to run on the DN which has a copy of that block.
>
>
>
> So its pretty straight forward to a point.
>
>
>
> When your cluster starts to get a lot of jobs and a slot opens up, your job
> may not be data local.
>
>
>
> With HBase... YMMV
>
> With S3 the data isn't local so it doesn't matter which Data Node gets the
> job.
>
>
>
> HTH
>
>
>
> -Mike
>
>
>
> On Oct 24, 2012, at 1:10 AM, David Parks <da...@yahoo.com> wrote:
>
>
>
> Even after reading O’reillys book on hadoop I don’t feel like I have a clear
> vision of how the map tasks get assigned.
>
>
>
> They depend on splits right?
>
>
>
> But I have 3 jobs running. And splits will come from various sources: HDFS,
> S3, and slow HTTP sources.
>
>
>
> So I’ve got some concern as to how the map tasks will be distributed to
> handle the data acquisition.
>
>
>
> Can I do anything to ensure that I don’t let the cluster go idle processing
> slow HTTP downloads when the boxes could simultaneously be doing HTTP
> downloads for one job and reading large files off HDFS for another job?
>
>
>
> I’m imagining a scenario where the only map tasks that are running are all
> blocking on splits requiring HTTP downloads and the splits coming from HDFS
> are all queuing up behind it, when they’d run more efficiently in parallel
> per node.
>
>
>
>
>
>

-- 
Harsh J

RE: How do map tasks get assigned efficiently?

Posted by David Parks <da...@yahoo.com>.

So the thing that just doesn't click for me yet is this:

 

On a typical computer, if I try to read two huge files off disk
simultaneously it'll just kill the disk performance. This seems like a risk.

 

What's preventing such disk contention in Hadoop?  Is HDFS smart enough to
serialize major disk access?

 

 

From: Michael Segel [mailto:michael_segel@hotmail.com] 
Sent: Wednesday, October 24, 2012 6:51 PM
To: user@hadoop.apache.org
Subject: Re: How do map tasks get assigned efficiently?

 

So... 

 

Data locality only works when you actually have data on the cluster itself.
Otherwise how can the data be local. 

 

Assuming 3X replication, and you're not doing a custom split and your input
file is splittable...

 

You will split along the block delineation.  So if your input file has 5
blocks, you will have 5 mappers.

 

Since there are 3 copies of the block, its possible that for that map task
to run on the DN which has a copy of that block. 

 

So its pretty straight forward to a point. 

 

When your cluster starts to get a lot of jobs and a slot opens up, your job
may not be data local. 

 

With HBase... YMMV 

With S3 the data isn't local so it doesn't matter which Data Node gets the
job. 

 

HTH

 

-Mike

 

On Oct 24, 2012, at 1:10 AM, David Parks <da...@yahoo.com> wrote:





Even after reading O'reillys book on hadoop I don't feel like I have a clear
vision of how the map tasks get assigned.

 

They depend on splits right?

 

But I have 3 jobs running. And splits will come from various sources: HDFS,
S3, and slow HTTP sources.

 

So I've got some concern as to how the map tasks will be distributed to
handle the data acquisition.

 

Can I do anything to ensure that I don't let the cluster go idle processing
slow HTTP downloads when the boxes could simultaneously be doing HTTP
downloads for one job and reading large files off HDFS for another job?

 

I'm imagining a scenario where the only map tasks that are running are all
blocking on splits requiring HTTP downloads and the splits coming from HDFS
are all queuing up behind it, when they'd run more efficiently in parallel
per node.

RE: How do map tasks get assigned efficiently?

Posted by David Parks <da...@yahoo.com>.

So the thing that just doesn't click for me yet is this:

 

On a typical computer, if I try to read two huge files off disk
simultaneously it'll just kill the disk performance. This seems like a risk.

 

What's preventing such disk contention in Hadoop?  Is HDFS smart enough to
serialize major disk access?

 

 

From: Michael Segel [mailto:michael_segel@hotmail.com] 
Sent: Wednesday, October 24, 2012 6:51 PM
To: user@hadoop.apache.org
Subject: Re: How do map tasks get assigned efficiently?

 

So... 

 

Data locality only works when you actually have data on the cluster itself.
Otherwise how can the data be local. 

 

Assuming 3X replication, and you're not doing a custom split and your input
file is splittable...

 

You will split along the block delineation.  So if your input file has 5
blocks, you will have 5 mappers.

 

Since there are 3 copies of the block, its possible that for that map task
to run on the DN which has a copy of that block. 

 

So its pretty straight forward to a point. 

 

When your cluster starts to get a lot of jobs and a slot opens up, your job
may not be data local. 

 

With HBase... YMMV 

With S3 the data isn't local so it doesn't matter which Data Node gets the
job. 

 

HTH

 

-Mike

 

On Oct 24, 2012, at 1:10 AM, David Parks <da...@yahoo.com> wrote:





Even after reading O'reillys book on hadoop I don't feel like I have a clear
vision of how the map tasks get assigned.

 

They depend on splits right?

 

But I have 3 jobs running. And splits will come from various sources: HDFS,
S3, and slow HTTP sources.

 

So I've got some concern as to how the map tasks will be distributed to
handle the data acquisition.

 

Can I do anything to ensure that I don't let the cluster go idle processing
slow HTTP downloads when the boxes could simultaneously be doing HTTP
downloads for one job and reading large files off HDFS for another job?

 

I'm imagining a scenario where the only map tasks that are running are all
blocking on splits requiring HTTP downloads and the splits coming from HDFS
are all queuing up behind it, when they'd run more efficiently in parallel
per node.

RE: How do map tasks get assigned efficiently?

Posted by David Parks <da...@yahoo.com>.

So the thing that just doesn't click for me yet is this:

 

On a typical computer, if I try to read two huge files off disk
simultaneously it'll just kill the disk performance. This seems like a risk.

 

What's preventing such disk contention in Hadoop?  Is HDFS smart enough to
serialize major disk access?

 

 

From: Michael Segel [mailto:michael_segel@hotmail.com] 
Sent: Wednesday, October 24, 2012 6:51 PM
To: user@hadoop.apache.org
Subject: Re: How do map tasks get assigned efficiently?

 

So... 

 

Data locality only works when you actually have data on the cluster itself.
Otherwise how can the data be local. 

 

Assuming 3X replication, and you're not doing a custom split and your input
file is splittable...

 

You will split along the block delineation.  So if your input file has 5
blocks, you will have 5 mappers.

 

Since there are 3 copies of the block, its possible that for that map task
to run on the DN which has a copy of that block. 

 

So its pretty straight forward to a point. 

 

When your cluster starts to get a lot of jobs and a slot opens up, your job
may not be data local. 

 

With HBase... YMMV 

With S3 the data isn't local so it doesn't matter which Data Node gets the
job. 

 

HTH

 

-Mike

 

On Oct 24, 2012, at 1:10 AM, David Parks <da...@yahoo.com> wrote:





Even after reading O'reillys book on hadoop I don't feel like I have a clear
vision of how the map tasks get assigned.

 

They depend on splits right?

 

But I have 3 jobs running. And splits will come from various sources: HDFS,
S3, and slow HTTP sources.

 

So I've got some concern as to how the map tasks will be distributed to
handle the data acquisition.

 

Can I do anything to ensure that I don't let the cluster go idle processing
slow HTTP downloads when the boxes could simultaneously be doing HTTP
downloads for one job and reading large files off HDFS for another job?

 

I'm imagining a scenario where the only map tasks that are running are all
blocking on splits requiring HTTP downloads and the splits coming from HDFS
are all queuing up behind it, when they'd run more efficiently in parallel
per node.

RE: How do map tasks get assigned efficiently?

Posted by David Parks <da...@yahoo.com>.

So the thing that just doesn't click for me yet is this:

 

On a typical computer, if I try to read two huge files off disk
simultaneously it'll just kill the disk performance. This seems like a risk.

 

What's preventing such disk contention in Hadoop?  Is HDFS smart enough to
serialize major disk access?

 

 

From: Michael Segel [mailto:michael_segel@hotmail.com] 
Sent: Wednesday, October 24, 2012 6:51 PM
To: user@hadoop.apache.org
Subject: Re: How do map tasks get assigned efficiently?

 

So... 

 

Data locality only works when you actually have data on the cluster itself.
Otherwise how can the data be local. 

 

Assuming 3X replication, and you're not doing a custom split and your input
file is splittable...

 

You will split along the block delineation.  So if your input file has 5
blocks, you will have 5 mappers.

 

Since there are 3 copies of the block, its possible that for that map task
to run on the DN which has a copy of that block. 

 

So its pretty straight forward to a point. 

 

When your cluster starts to get a lot of jobs and a slot opens up, your job
may not be data local. 

 

With HBase... YMMV 

With S3 the data isn't local so it doesn't matter which Data Node gets the
job. 

 

HTH

 

-Mike

 

On Oct 24, 2012, at 1:10 AM, David Parks <da...@yahoo.com> wrote:





Even after reading O'reillys book on hadoop I don't feel like I have a clear
vision of how the map tasks get assigned.

 

They depend on splits right?

 

But I have 3 jobs running. And splits will come from various sources: HDFS,
S3, and slow HTTP sources.

 

So I've got some concern as to how the map tasks will be distributed to
handle the data acquisition.

 

Can I do anything to ensure that I don't let the cluster go idle processing
slow HTTP downloads when the boxes could simultaneously be doing HTTP
downloads for one job and reading large files off HDFS for another job?

 

I'm imagining a scenario where the only map tasks that are running are all
blocking on splits requiring HTTP downloads and the splits coming from HDFS
are all queuing up behind it, when they'd run more efficiently in parallel
per node.

Re: How do map tasks get assigned efficiently?

Posted by Michael Segel <mi...@hotmail.com>.

So... 

Data locality only works when you actually have data on the cluster itself. Otherwise how can the data be local. 

Assuming 3X replication, and you're not doing a custom split and your input file is splittable...

You will split along the block delineation.  So if your input file has 5 blocks, you will have 5 mappers.

Since there are 3 copies of the block, its possible that for that map task to run on the DN which has a copy of that block. 

So its pretty straight forward to a point. 

When your cluster starts to get a lot of jobs and a slot opens up, your job may not be data local. 

With HBase... YMMV 
With S3 the data isn't local so it doesn't matter which Data Node gets the job. 

HTH

-Mike

On Oct 24, 2012, at 1:10 AM, David Parks <da...@yahoo.com> wrote:

> Even after reading O’reillys book on hadoop I don’t feel like I have a clear vision of how the map tasks get assigned.
>  
> They depend on splits right?
>  
> But I have 3 jobs running. And splits will come from various sources: HDFS, S3, and slow HTTP sources.
>  
> So I’ve got some concern as to how the map tasks will be distributed to handle the data acquisition.
>  
> Can I do anything to ensure that I don’t let the cluster go idle processing slow HTTP downloads when the boxes could simultaneously be doing HTTP downloads for one job and reading large files off HDFS for another job?
>  
> I’m imagining a scenario where the only map tasks that are running are all blocking on splits requiring HTTP downloads and the splits coming from HDFS are all queuing up behind it, when they’d run more efficiently in parallel per node.
>  
>

Re: How do map tasks get assigned efficiently?

Posted by Michael Segel <mi...@hotmail.com>.

So... 

Data locality only works when you actually have data on the cluster itself. Otherwise how can the data be local. 

Assuming 3X replication, and you're not doing a custom split and your input file is splittable...

You will split along the block delineation.  So if your input file has 5 blocks, you will have 5 mappers.

Since there are 3 copies of the block, its possible that for that map task to run on the DN which has a copy of that block. 

So its pretty straight forward to a point. 

When your cluster starts to get a lot of jobs and a slot opens up, your job may not be data local. 

With HBase... YMMV 
With S3 the data isn't local so it doesn't matter which Data Node gets the job. 

HTH

-Mike

On Oct 24, 2012, at 1:10 AM, David Parks <da...@yahoo.com> wrote:

> Even after reading O’reillys book on hadoop I don’t feel like I have a clear vision of how the map tasks get assigned.
>  
> They depend on splits right?
>  
> But I have 3 jobs running. And splits will come from various sources: HDFS, S3, and slow HTTP sources.
>  
> So I’ve got some concern as to how the map tasks will be distributed to handle the data acquisition.
>  
> Can I do anything to ensure that I don’t let the cluster go idle processing slow HTTP downloads when the boxes could simultaneously be doing HTTP downloads for one job and reading large files off HDFS for another job?
>  
> I’m imagining a scenario where the only map tasks that are running are all blocking on splits requiring HTTP downloads and the splits coming from HDFS are all queuing up behind it, when they’d run more efficiently in parallel per node.
>  
>

Re: How do map tasks get assigned efficiently?

Posted by Michael Segel <mi...@hotmail.com>.

So... 

Data locality only works when you actually have data on the cluster itself. Otherwise how can the data be local. 

Assuming 3X replication, and you're not doing a custom split and your input file is splittable...

You will split along the block delineation.  So if your input file has 5 blocks, you will have 5 mappers.

Since there are 3 copies of the block, its possible that for that map task to run on the DN which has a copy of that block. 

So its pretty straight forward to a point. 

When your cluster starts to get a lot of jobs and a slot opens up, your job may not be data local. 

With HBase... YMMV 
With S3 the data isn't local so it doesn't matter which Data Node gets the job. 

HTH

-Mike

On Oct 24, 2012, at 1:10 AM, David Parks <da...@yahoo.com> wrote:

> Even after reading O’reillys book on hadoop I don’t feel like I have a clear vision of how the map tasks get assigned.
>  
> They depend on splits right?
>  
> But I have 3 jobs running. And splits will come from various sources: HDFS, S3, and slow HTTP sources.
>  
> So I’ve got some concern as to how the map tasks will be distributed to handle the data acquisition.
>  
> Can I do anything to ensure that I don’t let the cluster go idle processing slow HTTP downloads when the boxes could simultaneously be doing HTTP downloads for one job and reading large files off HDFS for another job?
>  
> I’m imagining a scenario where the only map tasks that are running are all blocking on splits requiring HTTP downloads and the splits coming from HDFS are all queuing up behind it, when they’d run more efficiently in parallel per node.
>  
>

Re: How do map tasks get assigned efficiently?

Posted by Michael Segel <mi...@hotmail.com>.

So... 

Data locality only works when you actually have data on the cluster itself. Otherwise how can the data be local. 

Assuming 3X replication, and you're not doing a custom split and your input file is splittable...

You will split along the block delineation.  So if your input file has 5 blocks, you will have 5 mappers.

Since there are 3 copies of the block, its possible that for that map task to run on the DN which has a copy of that block. 

So its pretty straight forward to a point. 

When your cluster starts to get a lot of jobs and a slot opens up, your job may not be data local. 

With HBase... YMMV 
With S3 the data isn't local so it doesn't matter which Data Node gets the job. 

HTH

-Mike

On Oct 24, 2012, at 1:10 AM, David Parks <da...@yahoo.com> wrote:

> Even after reading O’reillys book on hadoop I don’t feel like I have a clear vision of how the map tasks get assigned.
>  
> They depend on splits right?
>  
> But I have 3 jobs running. And splits will come from various sources: HDFS, S3, and slow HTTP sources.
>  
> So I’ve got some concern as to how the map tasks will be distributed to handle the data acquisition.
>  
> Can I do anything to ensure that I don’t let the cluster go idle processing slow HTTP downloads when the boxes could simultaneously be doing HTTP downloads for one job and reading large files off HDFS for another job?
>  
> I’m imagining a scenario where the only map tasks that are running are all blocking on splits requiring HTTP downloads and the splits coming from HDFS are all queuing up behind it, when they’d run more efficiently in parallel per node.
>  
>