You are viewing a plain text version of this content. The canonical link for it is here.

Posted to common-user@hadoop.apache.org by "N.N. Gesli" <nn...@gmail.com> on 2011/10/28 09:08:28 UTC

Map-Reduce in memory

Hello,

We have 12 node Hadoop Cluster that is running Hadoop 0.20.2-cdh3u0. Each
node has 8 core and 144GB RAM (don't ask). So, I want to take advantage of
this huge RAM and run the map-reduce jobs mostly in memory with no spill, if
possible. We use Hive for most of the processes. I have set:
mapred.tasktracker.map.tasks.maximum = 16
mapred.tasktracker.reduce.tasks.maximum = 8
mapred.child.java.opts = 6144
mapred.reduce.parallel.copies = 20
mapred.job.shuffle.merge.percent = 1.0
mapred.job.reduce.input.buffer.percent = 0.25
mapred.inmem.merge.threshold = 0

One of my Hive queries is producing 6 stage map-reduce jobs. On the third
stage when it queries from a 200GB table, the last 14 reducers hang. I
changed mapred.task.timeout to 0 to see if they really hang. It has been 5
hours, so something terribly wrong in my setup. Parts of the log is below.
My questions:
* What should be my configurations to make reducers to run in the memory?
* Why it keeps waiting for map outputs?
* What does it mean "dup hosts"?

Thank you!
N.Gesli



2011-10-27 16:35:24,304 WARN org.apache.hadoop.util.NativeCodeLoader: Unable
to load native-hadoop library for your platform... using builtin-java
classes where applicable

2011-10-27 16:35:24,611 INFO org.apache.hadoop.metrics.jvm.JvmMetrics:
Initializing JVM Metrics with processName=SHUFFLE, sessionId=
2011-10-27 16:35:24,722 INFO org.apache.hadoop.mapred.ReduceTask:
ShuffleRamManager: MemoryLimit=1503238528,
MaxSingleShuffleLimit=375809632
2011-10-27 16:35:24,733 INFO org.apache.hadoop.mapred.ReduceTask:
attempt_201110201507_0061_r_000007_1 Thread started: Thread for
merging on-disk files
2011-10-27 16:35:24,733 INFO org.apache.hadoop.mapred.ReduceTask:
attempt_201110201507_0061_r_000007_1 Thread waiting: Thread for
merging on-disk files
2011-10-27 16:35:24,734 INFO org.apache.hadoop.mapred.ReduceTask:
attempt_201110201507_0061_r_000007_1 Thread started: Thread for
merging in memory files
2011-10-27 16:35:24,735 INFO org.apache.hadoop.mapred.ReduceTask:
attempt_201110201507_0061_r_000007_1 Thread started: Thread for
polling Map Completion Events
2011-10-27 16:35:24,735 INFO org.apache.hadoop.mapred.ReduceTask:
attempt_201110201507_0061_r_000007_1 Need another 1308 map output(s)
where 0 is already in progress
2011-10-27 16:35:24,736 INFO org.apache.hadoop.mapred.ReduceTask:
attempt_201110201507_0061_r_000007_1 Scheduled 0 outputs (0 slow hosts
and0 dup hosts)
2011-10-27 16:35:29,738 INFO org.apache.hadoop.mapred.ReduceTask:
attempt_201110201507_0061_r_000007_1 Scheduled 12 outputs (0 slow
hosts and0 dup hosts)
2011-10-27 16:35:30,364 INFO org.apache.hadoop.mapred.ReduceTask:
attempt_201110201507_0061_r_000007_1 Scheduled 5 outputs (0 slow hosts
and753 dup hosts)
2011-10-27 16:35:30,367 INFO org.apache.hadoop.mapred.ReduceTask:
attempt_201110201507_0061_r_000007_1 Scheduled 1 outputs (0 slow hosts
and1182 dup hosts)
2011-10-27 16:35:30,368 INFO org.apache.hadoop.mapred.ReduceTask:
attempt_201110201507_0061_r_000007_1 Scheduled 1 outputs (0 slow hosts
and1184 dup hosts)
2011-10-27 16:35:30,371 INFO org.apache.hadoop.mapred.ReduceTask:
attempt_201110201507_0061_r_000007_1 Scheduled 2 outputs (0 slow hosts
and1073 dup hosts)

...

2011-10-27 16:36:04,284 INFO org.apache.hadoop.mapred.ReduceTask:
attempt_201110201507_0061_r_000007_1 Scheduled 1 outputs (0 slow hosts
and958 dup hosts)

2011-10-27 16:36:04,310 INFO org.apache.hadoop.mapred.ReduceTask:
attempt_201110201507_0061_r_000007_1 Scheduled 1 outputs (0 slow hosts
and958 dup hosts)
2011-10-27 16:36:07,721 INFO org.apache.hadoop.mapred.ReduceTask:
attempt_201110201507_0061_r_000007_1 Scheduled 1 outputs (0 slow hosts
and950 dup hosts)
2011-10-27 16:36:16,455 INFO org.apache.hadoop.mapred.ReduceTask:
attempt_201110201507_0061_r_000007_1 Scheduled 1 outputs (0 slow hosts
and948 dup hosts)
2011-10-27 16:36:16,464 INFO org.apache.hadoop.mapred.ReduceTask:
attempt_201110201507_0061_r_000007_1 Scheduled 1 outputs (0 slow hosts
and948 dup hosts)
2011-10-27 16:36:24,736 INFO org.apache.hadoop.mapred.ReduceTask:
attempt_201110201507_0061_r_000007_1 Need another 1061 map output(s)
where 12 is already in progress
2011-10-27 16:36:24,736 INFO org.apache.hadoop.mapred.ReduceTask:
attempt_201110201507_0061_r_000007_1 Scheduled 0 outputs (0 slow hosts
and1049 dup hosts)
2011-10-27 16:37:24,737 INFO org.apache.hadoop.mapred.ReduceTask:
attempt_201110201507_0061_r_000007_1 Need another 1061 map output(s)
where 12 is already in progress
2011-10-27 16:37:24,737 INFO org.apache.hadoop.mapred.ReduceTask:
attempt_201110201507_0061_r_000007_1 Scheduled 0 outputs (0 slow hosts
and1049 dup hosts)
2011-10-27 16:38:24,738 INFO org.apache.hadoop.mapred.ReduceTask:
attempt_201110201507_0061_r_000007_1 Need another 1061 map output(s)
where 12 is already in progress
2011-10-27 16:38:24,738 INFO org.apache.hadoop.mapred.ReduceTask:
attempt_201110201507_0061_r_000007_1 Scheduled 0 outputs (0 slow hosts
and1049 dup hosts)
2011-10-27 16:39:24,739 INFO org.apache.hadoop.mapred.ReduceTask:
attempt_201110201507_0061_r_000007_1 Need another 1061 map output(s)
where 12 is already in progress
2011-10-27 16:39:24,739 INFO org.apache.hadoop.mapred.ReduceTask:
attempt_201110201507_0061_r_000007_1 Scheduled 0 outputs (0 slow hosts
and1049 dup hosts)
2011-10-27 16:40:24,740 INFO org.apache.hadoop.mapred.ReduceTask:
attempt_201110201507_0061_r_000007_1 Need another 1061 map output(s)
where 12 is already in progress
2011-10-27 16:40:24,740 INFO org.apache.hadoop.mapred.ReduceTask:
attempt_201110201507_0061_r_000007_1 Scheduled 0 outputs (0 slow hosts
and1049 dup hosts)

...

2011-10-27 21:55:25,070 INFO org.apache.hadoop.mapred.ReduceTask:
attempt_201110201507_0061_r_000007_1 Need another 1061 map output(s)
where 12 is already in progress
2011-10-27 21:55:25,070 INFO org.apache.hadoop.mapred.ReduceTask:
attempt_201110201507_0061_r_000007_1 Scheduled 0 outputs (0 slow hosts
and1049 dup hosts)
2011-10-27 21:56:25,071 INFO org.apache.hadoop.mapred.ReduceTask:
attempt_201110201507_0061_r_000007_1 Need another 1061 map output(s)
where 12 is already in progress
2011-10-27 21:56:25,071 INFO org.apache.hadoop.mapred.ReduceTask:
attempt_201110201507_0061_r_000007_1 Scheduled 0 outputs (0 slow hosts
and1049 dup hosts)
2011-10-27 21:57:25,072 INFO org.apache.hadoop.mapred.ReduceTask:
attempt_201110201507_0061_r_000007_1 Need another 1061 map output(s)
where 12 is already in progress


------------------------------

Re: Map-Reduce in memory

Posted by Michel Segel <mi...@hotmail.com>.

Hi, 
First, you have 8 physical cores. Hyper threading makes the machine think that it has 16. The trouble is that you really don't have 16 cores so you need to be a little more conservative.

You don't mention HBase, so I'm going to assume that you don't have it installed.
So in terms of tasks, allocate a core each to DN and TT leaving 6 cores or 12 hyper threaded cores. This leaves a little headroom for the other linux processes...

Now you can split the number of remaining cores however you want.
You can even overlap a bit since you are not going to be running all of your reducers at the same time.
So let's say 10 mappers and the 4 reducers to start.

Since you have all that memory, you can bump up you DN and TT allocations.

W ith respect to your tuning... You need to change them one at a time...

Sent from a remote device. Please excuse any typos...

Mike Segel

On Nov 4, 2011, at 1:46 AM, "N.N. Gesli" <nn...@gmail.com> wrote:

> Thank you very much for your replies.
> 
> Michel, disk is 3TB (6x550GB, 50 GB from each disk is reserved for local basically for mapred.local.dir). You are right on the CPU; it is 8 core but shows as 16. Is that mean it can handle 16 JVMs at a time? CPU is a little overloaded, but that is not a huge problem at this point.
> 
> I made io.sort.factor 200 and io.sort.mb 2000. Still got the same error/timeout. I played with all related conf settings one by one. At last, changing mapred.job.shuffle.merge.percent from 1.0 back to 0.66 solved the problem.
> 
> However, the job is still taking long time. There are 84 reducers, but only one of them takes a very long time. I attached the log file of that reduce task. Majority of the data gets spilled to disk. Even if I set mapred.child.java.opts to 6144, the reduce task log shows
> ShuffleRamManager: MemoryLimit=1503238528, MaxSingleShuffleLimit=375809632
> as if memory is 2GB (70% of 2GB=1503238528b). In the same log file later there is also this line:
> INFO ExecReducer: maximum memory = 6414139392
> I am not using memory monitoring. Tasktrackers have this line in the log:
> TaskTracker's totalMemoryAllottedForTasks is -1. TaskMemoryManager is disabled.
> Why is ShuffleRamManager is finding that number as if the max memory is 2GB?
> Why am I still getting that much spill even with these aggressive memory settings?
> Why only one reducer taking that long?
> What else I can change to make this job processed in the memory and finish faster?
> 
> Thank you.
> -N.N.Gesli
> 
> On Fri, Oct 28, 2011 at 2:14 AM, Michel Segel <mi...@hotmail.com> wrote:
> Uhm...
> He has plenty of memory... Depending on what sort of m/r tasks... He could push it.
> Didn't say how much disk...
> 
> I wouldn't start that high... Try 10 mappers and 2. Reducers. Granted it is a bit asymmetric and you can bump up the reducers...
> 
> Watch your jobs in ganglia and see what is happening...
> 
> Harsh, assuming he is using intel, each core is hyper threaded so the box sees this as 2x CPUs.
> 8 cores looks like 16.
> 
> 
> Sent from a remote device. Please excuse any typos...
> 
> Mike Segel
> 
> On Oct 28, 2011, at 3:08 AM, Harsh J <ha...@cloudera.com> wrote:
> 
> > Hey N.N. Gesli,
> >
> > (Inline)
> >
> > On Fri, Oct 28, 2011 at 12:38 PM, N.N. Gesli <nn...@gmail.com> wrote:
> >> Hello,
> >>
> >> We have 12 node Hadoop Cluster that is running Hadoop 0.20.2-cdh3u0. Each
> >> node has 8 core and 144GB RAM (don't ask). So, I want to take advantage of
> >> this huge RAM and run the map-reduce jobs mostly in memory with no spill, if
> >> possible. We use Hive for most of the processes. I have set:
> >> mapred.tasktracker.map.tasks.maximum = 16
> >> mapred.tasktracker.reduce.tasks.maximum = 8
> >
> > This is *crazy* for an 8 core machine. Try to keep M+R slots well
> > below 8 instead - You're probably CPU-thrashed in this setup once
> > large number of tasks get booted.
> >
> >> mapred.child.java.opts = 6144
> >
> > You can also raise io.sort.mb to 2000, and tweak io.sort.factor.
> >
> > The child opts raise to 6~ GB looks a bit unnecessary since most of
> > your tasks work on record basis and would not care much about total
> > RAM. Perhaps use all that RAM for a service like HBase which can
> > leverage caching nicely!
> >
> >> One of my Hive queries is producing 6 stage map-reduce jobs. On the third
> >> stage when it queries from a 200GB table, the last 14 reducers hang. I
> >> changed mapred.task.timeout to 0 to see if they really hang. It has been 5
> >> hours, so something terribly wrong in my setup. Parts of the log is below.
> >
> > It is probably just your slot settings. You may be massively
> > over-subscribing your CPU resources with 16 map task slots + 8 reduce
> > tasks slots. At worst case, it would mean 24 total JVMs competing over
> > 8 available physical processors. Doesn't make sense to me at least -
> > Make it more like 7 M / 2 R or so :)
> >
> >> My questions:
> >> * What should be my configurations to make reducers to run in the memory?
> >> * Why it keeps waiting for map outputs?
> >
> > It has to fetch map outputs to get some data to start with. And it
> > pulls the map outputs a few at a time - to not overload the network
> > during shuffle phases of several reducers across the cluster.
> >
> >> * What does it mean "dup hosts"?
> >
> > Duplicate hosts. Hosts it already knows about and has already
> > scheduled fetch work upon.
> >
> > <snip>
> >
> > --
> > Harsh J
> >
> 
> <ngesli_reduce_log.txt>

Re: Map-Reduce in memory

Posted by "N.N. Gesli" <nn...@gmail.com>.

Thank you very much for your replies.

Michel, disk is 3TB (6x550GB, 50 GB from each disk is reserved for local
basically for mapred.local.dir). You are right on the CPU; it is 8 core but
shows as 16. Is that mean it can handle 16 JVMs at a time? CPU is a little
overloaded, but that is not a huge problem at this point.

I made io.sort.factor 200 and io.sort.mb 2000. Still got the same
error/timeout. I played with all related conf settings one by one. At last,
changing mapred.job.shuffle.merge.percent from 1.0 back to 0.66 solved the
problem.

However, the job is still taking long time. There are 84 reducers, but only
one of them takes a very long time. I attached the log file of that reduce
task. Majority of the data gets spilled to disk. Even if I set
mapred.child.java.opts
to 6144, the reduce task log shows

ShuffleRamManager: MemoryLimit=1503238528, MaxSingleShuffleLimit=375809632

as if memory is 2GB (70% of 2GB=1503238528b). In the same log file
later there is also this line:

INFO ExecReducer: maximum memory = 6414139392

I am not using memory monitoring. Tasktrackers have this line in the log:

TaskTracker's totalMemoryAllottedForTasks is -1. TaskMemoryManager is disabled.

Why is ShuffleRamManager is finding that number as if the max memory is 2GB?
Why am I still getting that much spill even with these aggressive memory
settings?
Why only one reducer taking that long?
What else I can change to make this job processed in the memory and finish
faster?

Thank you.
-N.N.Gesli

On Fri, Oct 28, 2011 at 2:14 AM, Michel Segel <mi...@hotmail.com>wrote:

> Uhm...
> He has plenty of memory... Depending on what sort of m/r tasks... He could
> push it.
> Didn't say how much disk...
>
> I wouldn't start that high... Try 10 mappers and 2. Reducers. Granted it
> is a bit asymmetric and you can bump up the reducers...
>
> Watch your jobs in ganglia and see what is happening...
>
> Harsh, assuming he is using intel, each core is hyper threaded so the box
> sees this as 2x CPUs.
> 8 cores looks like 16.
>
>
> Sent from a remote device. Please excuse any typos...
>
> Mike Segel
>
> On Oct 28, 2011, at 3:08 AM, Harsh J <ha...@cloudera.com> wrote:
>
> > Hey N.N. Gesli,
> >
> > (Inline)
> >
> > On Fri, Oct 28, 2011 at 12:38 PM, N.N. Gesli <nn...@gmail.com> wrote:
> >> Hello,
> >>
> >> We have 12 node Hadoop Cluster that is running Hadoop 0.20.2-cdh3u0.
> Each
> >> node has 8 core and 144GB RAM (don't ask). So, I want to take advantage
> of
> >> this huge RAM and run the map-reduce jobs mostly in memory with no
> spill, if
> >> possible. We use Hive for most of the processes. I have set:
> >> mapred.tasktracker.map.tasks.maximum = 16
> >> mapred.tasktracker.reduce.tasks.maximum = 8
> >
> > This is *crazy* for an 8 core machine. Try to keep M+R slots well
> > below 8 instead - You're probably CPU-thrashed in this setup once
> > large number of tasks get booted.
> >
> >> mapred.child.java.opts = 6144
> >
> > You can also raise io.sort.mb to 2000, and tweak io.sort.factor.
> >
> > The child opts raise to 6~ GB looks a bit unnecessary since most of
> > your tasks work on record basis and would not care much about total
> > RAM. Perhaps use all that RAM for a service like HBase which can
> > leverage caching nicely!
> >
> >> One of my Hive queries is producing 6 stage map-reduce jobs. On the
> third
> >> stage when it queries from a 200GB table, the last 14 reducers hang. I
> >> changed mapred.task.timeout to 0 to see if they really hang. It has
> been 5
> >> hours, so something terribly wrong in my setup. Parts of the log is
> below.
> >
> > It is probably just your slot settings. You may be massively
> > over-subscribing your CPU resources with 16 map task slots + 8 reduce
> > tasks slots. At worst case, it would mean 24 total JVMs competing over
> > 8 available physical processors. Doesn't make sense to me at least -
> > Make it more like 7 M / 2 R or so :)
> >
> >> My questions:
> >> * What should be my configurations to make reducers to run in the
> memory?
> >> * Why it keeps waiting for map outputs?
> >
> > It has to fetch map outputs to get some data to start with. And it
> > pulls the map outputs a few at a time - to not overload the network
> > during shuffle phases of several reducers across the cluster.
> >
> >> * What does it mean "dup hosts"?
> >
> > Duplicate hosts. Hosts it already knows about and has already
> > scheduled fetch work upon.
> >
> > <snip>
> >
> > --
> > Harsh J
> >
>

Re: Map-Reduce in memory

Posted by Michel Segel <mi...@hotmail.com>.

Uhm...
He has plenty of memory... Depending on what sort of m/r tasks... He could push it.
Didn't say how much disk...

I wouldn't start that high... Try 10 mappers and 2. Reducers. Granted it is a bit asymmetric and you can bump up the reducers...

Watch your jobs in ganglia and see what is happening... 

Harsh, assuming he is using intel, each core is hyper threaded so the box sees this as 2x CPUs.
8 cores looks like 16. 


Sent from a remote device. Please excuse any typos...

Mike Segel

On Oct 28, 2011, at 3:08 AM, Harsh J <ha...@cloudera.com> wrote:

> Hey N.N. Gesli,
> 
> (Inline)
> 
> On Fri, Oct 28, 2011 at 12:38 PM, N.N. Gesli <nn...@gmail.com> wrote:
>> Hello,
>> 
>> We have 12 node Hadoop Cluster that is running Hadoop 0.20.2-cdh3u0. Each
>> node has 8 core and 144GB RAM (don't ask). So, I want to take advantage of
>> this huge RAM and run the map-reduce jobs mostly in memory with no spill, if
>> possible. We use Hive for most of the processes. I have set:
>> mapred.tasktracker.map.tasks.maximum = 16
>> mapred.tasktracker.reduce.tasks.maximum = 8
> 
> This is *crazy* for an 8 core machine. Try to keep M+R slots well
> below 8 instead - You're probably CPU-thrashed in this setup once
> large number of tasks get booted.
> 
>> mapred.child.java.opts = 6144
> 
> You can also raise io.sort.mb to 2000, and tweak io.sort.factor.
> 
> The child opts raise to 6~ GB looks a bit unnecessary since most of
> your tasks work on record basis and would not care much about total
> RAM. Perhaps use all that RAM for a service like HBase which can
> leverage caching nicely!
> 
>> One of my Hive queries is producing 6 stage map-reduce jobs. On the third
>> stage when it queries from a 200GB table, the last 14 reducers hang. I
>> changed mapred.task.timeout to 0 to see if they really hang. It has been 5
>> hours, so something terribly wrong in my setup. Parts of the log is below.
> 
> It is probably just your slot settings. You may be massively
> over-subscribing your CPU resources with 16 map task slots + 8 reduce
> tasks slots. At worst case, it would mean 24 total JVMs competing over
> 8 available physical processors. Doesn't make sense to me at least -
> Make it more like 7 M / 2 R or so :)
> 
>> My questions:
>> * What should be my configurations to make reducers to run in the memory?
>> * Why it keeps waiting for map outputs?
> 
> It has to fetch map outputs to get some data to start with. And it
> pulls the map outputs a few at a time - to not overload the network
> during shuffle phases of several reducers across the cluster.
> 
>> * What does it mean "dup hosts"?
> 
> Duplicate hosts. Hosts it already knows about and has already
> scheduled fetch work upon.
> 
> <snip>
> 
> -- 
> Harsh J
>

Re: Map-Reduce in memory

Posted by Harsh J <ha...@cloudera.com>.

Hey N.N. Gesli,

(Inline)

On Fri, Oct 28, 2011 at 12:38 PM, N.N. Gesli <nn...@gmail.com> wrote:
> Hello,
>
> We have 12 node Hadoop Cluster that is running Hadoop 0.20.2-cdh3u0. Each
> node has 8 core and 144GB RAM (don't ask). So, I want to take advantage of
> this huge RAM and run the map-reduce jobs mostly in memory with no spill, if
> possible. We use Hive for most of the processes. I have set:
> mapred.tasktracker.map.tasks.maximum = 16
> mapred.tasktracker.reduce.tasks.maximum = 8

This is *crazy* for an 8 core machine. Try to keep M+R slots well
below 8 instead - You're probably CPU-thrashed in this setup once
large number of tasks get booted.

> mapred.child.java.opts = 6144

You can also raise io.sort.mb to 2000, and tweak io.sort.factor.

The child opts raise to 6~ GB looks a bit unnecessary since most of
your tasks work on record basis and would not care much about total
RAM. Perhaps use all that RAM for a service like HBase which can
leverage caching nicely!

> One of my Hive queries is producing 6 stage map-reduce jobs. On the third
> stage when it queries from a 200GB table, the last 14 reducers hang. I
> changed mapred.task.timeout to 0 to see if they really hang. It has been 5
> hours, so something terribly wrong in my setup. Parts of the log is below.

It is probably just your slot settings. You may be massively
over-subscribing your CPU resources with 16 map task slots + 8 reduce
tasks slots. At worst case, it would mean 24 total JVMs competing over
8 available physical processors. Doesn't make sense to me at least -
Make it more like 7 M / 2 R or so :)

> My questions:
> * What should be my configurations to make reducers to run in the memory?
> * Why it keeps waiting for map outputs?

It has to fetch map outputs to get some data to start with. And it
pulls the map outputs a few at a time - to not overload the network
during shuffle phases of several reducers across the cluster.

> * What does it mean "dup hosts"?

Duplicate hosts. Hosts it already knows about and has already
scheduled fetch work upon.

<snip>

-- 
Harsh J