You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@hadoop.apache.org by "Hider, Sandy" <Sa...@jhuapl.edu> on 2013/10/14 23:49:30 UTC

Identification of mapper slots

In Hadoop under the mapred-site.conf  I can set the maximum number of mappers. For the sake of this email I will call the number of concurrent mappers: mapper slots.

Is it possible to figure out from within the mapper which mapper slot it is running in?

On this project this is important because each mapper has to fork off a Matlab runtime compiled executable.  The executable is passed in at runtime a cache to work in.  Setting up the cache when given an new directory takes a long time but can be used again quickly on future calls if provided the same location of the cache.   As it turns out when multiple mappers try to use the same cache they crash the executable.   So ideally if I could identify which mapper slot a mapper is running in, I can setup caches for each slot and avoid the cache creation time and still guarantee that no two mappers write to the same cache.

Thanks for taking the time to read this,

Sandy

Re: Identification of mapper slots

Posted by Rahul Jain <rj...@gmail.com>.

I assume you know the tradeoff here: If you do depend upon mapper slot # in
your implementation to speed it up, you are losing on code portability in
long term....

That said, one way to achieve this is to use the JobConf API:

int partition = jobConf.getInt(JobContext.TASK_PARTITION, -1);

The framework assigns unique partition # to each mapper; this allows  them
to write to a distinct output file. Note that this is a global partition #,
not local to each node.

Also, in case you have mappers and reducers using the same cache, then add

jobConf.getBoolean(JobContext.TASK_ISMAP)...  check to indicate whether you
are executing in mapper or reducer context.

-Rahul

On Mon, Oct 14, 2013 at 2:49 PM, Hider, Sandy <Sa...@jhuapl.edu>wrote:

> ** **
>
> In Hadoop under the mapred-site.conf  I can set the maximum number of
> mappers. For the sake of this email I will call the number of concurrent
> mappers: mapper slots.  ****
>
> ** **
>
> Is it possible to figure out from within the mapper which mapper slot it
> is running in? ****
>
> ** **
>
> On this project this is important because each mapper has to fork off a
> Matlab runtime compiled executable.  The executable is passed in at runtime
> a cache to work in.  Setting up the cache when given an new directory takes
> a long time but can be used again quickly on future calls if provided the
> same location of the cache.   As it turns out when multiple mappers try to
> use the same cache they crash the executable.   So ideally if I could
> identify which mapper slot a mapper is running in, I can setup caches for
> each slot and avoid the cache creation time and still guarantee that no two
> mappers write to the same cache.  ****
>
> ** **
>
> Thanks for taking the time to read this,****
>
> ** **
>
> Sandy****
>
> ** **
>
> ** **
>

Re: Identification of mapper slots

Posted by Rahul Jain <rj...@gmail.com>.

I assume you know the tradeoff here: If you do depend upon mapper slot # in
your implementation to speed it up, you are losing on code portability in
long term....

That said, one way to achieve this is to use the JobConf API:

int partition = jobConf.getInt(JobContext.TASK_PARTITION, -1);

The framework assigns unique partition # to each mapper; this allows  them
to write to a distinct output file. Note that this is a global partition #,
not local to each node.

Also, in case you have mappers and reducers using the same cache, then add

jobConf.getBoolean(JobContext.TASK_ISMAP)...  check to indicate whether you
are executing in mapper or reducer context.

-Rahul

On Mon, Oct 14, 2013 at 2:49 PM, Hider, Sandy <Sa...@jhuapl.edu>wrote:

> ** **
>
> In Hadoop under the mapred-site.conf  I can set the maximum number of
> mappers. For the sake of this email I will call the number of concurrent
> mappers: mapper slots.  ****
>
> ** **
>
> Is it possible to figure out from within the mapper which mapper slot it
> is running in? ****
>
> ** **
>
> On this project this is important because each mapper has to fork off a
> Matlab runtime compiled executable.  The executable is passed in at runtime
> a cache to work in.  Setting up the cache when given an new directory takes
> a long time but can be used again quickly on future calls if provided the
> same location of the cache.   As it turns out when multiple mappers try to
> use the same cache they crash the executable.   So ideally if I could
> identify which mapper slot a mapper is running in, I can setup caches for
> each slot and avoid the cache creation time and still guarantee that no two
> mappers write to the same cache.  ****
>
> ** **
>
> Thanks for taking the time to read this,****
>
> ** **
>
> Sandy****
>
> ** **
>
> ** **
>

Re: Identification of mapper slots

Posted by Rahul Jain <rj...@gmail.com>.

I assume you know the tradeoff here: If you do depend upon mapper slot # in
your implementation to speed it up, you are losing on code portability in
long term....

That said, one way to achieve this is to use the JobConf API:

int partition = jobConf.getInt(JobContext.TASK_PARTITION, -1);

The framework assigns unique partition # to each mapper; this allows  them
to write to a distinct output file. Note that this is a global partition #,
not local to each node.

Also, in case you have mappers and reducers using the same cache, then add

jobConf.getBoolean(JobContext.TASK_ISMAP)...  check to indicate whether you
are executing in mapper or reducer context.

-Rahul

On Mon, Oct 14, 2013 at 2:49 PM, Hider, Sandy <Sa...@jhuapl.edu>wrote:

> ** **
>
> In Hadoop under the mapred-site.conf  I can set the maximum number of
> mappers. For the sake of this email I will call the number of concurrent
> mappers: mapper slots.  ****
>
> ** **
>
> Is it possible to figure out from within the mapper which mapper slot it
> is running in? ****
>
> ** **
>
> On this project this is important because each mapper has to fork off a
> Matlab runtime compiled executable.  The executable is passed in at runtime
> a cache to work in.  Setting up the cache when given an new directory takes
> a long time but can be used again quickly on future calls if provided the
> same location of the cache.   As it turns out when multiple mappers try to
> use the same cache they crash the executable.   So ideally if I could
> identify which mapper slot a mapper is running in, I can setup caches for
> each slot and avoid the cache creation time and still guarantee that no two
> mappers write to the same cache.  ****
>
> ** **
>
> Thanks for taking the time to read this,****
>
> ** **
>
> Sandy****
>
> ** **
>
> ** **
>

Re: Identification of mapper slots

Posted by Rahul Jain <rj...@gmail.com>.

I assume you know the tradeoff here: If you do depend upon mapper slot # in
your implementation to speed it up, you are losing on code portability in
long term....

That said, one way to achieve this is to use the JobConf API:

int partition = jobConf.getInt(JobContext.TASK_PARTITION, -1);

The framework assigns unique partition # to each mapper; this allows  them
to write to a distinct output file. Note that this is a global partition #,
not local to each node.

Also, in case you have mappers and reducers using the same cache, then add

jobConf.getBoolean(JobContext.TASK_ISMAP)...  check to indicate whether you
are executing in mapper or reducer context.

-Rahul

On Mon, Oct 14, 2013 at 2:49 PM, Hider, Sandy <Sa...@jhuapl.edu>wrote:

> ** **
>
> In Hadoop under the mapred-site.conf  I can set the maximum number of
> mappers. For the sake of this email I will call the number of concurrent
> mappers: mapper slots.  ****
>
> ** **
>
> Is it possible to figure out from within the mapper which mapper slot it
> is running in? ****
>
> ** **
>
> On this project this is important because each mapper has to fork off a
> Matlab runtime compiled executable.  The executable is passed in at runtime
> a cache to work in.  Setting up the cache when given an new directory takes
> a long time but can be used again quickly on future calls if provided the
> same location of the cache.   As it turns out when multiple mappers try to
> use the same cache they crash the executable.   So ideally if I could
> identify which mapper slot a mapper is running in, I can setup caches for
> each slot and avoid the cache creation time and still guarantee that no two
> mappers write to the same cache.  ****
>
> ** **
>
> Thanks for taking the time to read this,****
>
> ** **
>
> Sandy****
>
> ** **
>
> ** **
>