You are viewing a plain text version of this content. The canonical link for it is here.
Posted to mapreduce-user@hadoop.apache.org by Tim Robertson <ti...@gmail.com> on 2010/11/17 09:43:09 UTC

Help tuning a cluster - COPY slow

Hi all,

We have setup a small cluster (13 nodes) using CDH3

We have been tuning it using TeraSort and Hive queries on our data,
and the copy phase is very slow, so I'd like to ask if anyone can look
over our config.

We have an unbalanced set of machines (all on a single switch):
- 10 of Intel @ 2.83GHz Quad, 8GB, 2x500G 7.2K SATA (3 mappers, 2 reducers)
- 3 of  Intel @ 2.53GHz Dual Quad, 24GB, 6x250GB 5.4K SATA (12
mappers, 12 reducers)

We monitored the load using $top on machines, to settle on the number
of mappers and reducers to stop overloading them, and the map() and
reduce() is working very nicely - all our time

The config:

io.sort.mb=400
io.sort.factor=100
mapred.reduce.parallel.copies=20
tasktracker.http.threads=80
mapred.compress.map.output=true/false (no notible difference)
mapred.map.output.compression.codec=com.hadoop.compression.lzo.LzoCodec
mapred.output.compression.type=BLOCK
mapred.inmem.merge.threshold=0
mapred.job.reduce.input.buffer.percent=0.7
mapred.job.reuse.jvm.num.tasks=50

An example job:
(select basis_of_record,count(1) from occurrence_record group by
basis_of_record)
Map input records 262,573,931 finished in 2mins30 using 833 mappers
Reduce was at 24% at 2mins30 finished map with all 55 running
Map output records: 1,855
Map output bytes: 28,724
REDUCE COPY PHASE finished after 7mins01 secs
Reduce finished after 7mins17secs

I am correct that 28,724 bytes emitted from a map should not take 4mins30 right?

We're running puppet so can test changes quickly.

Any pointers on how we can debug / improve this are greatly appreciated!
Tim

Re: Help tuning a cluster - COPY slow

Posted by Tim Robertson <ti...@gmail.com>.
Just to close this thread.
Turns out it all came down to a mapred.reduce.parallel.copies being
overwritten to 5 on the Hive submission. Cranking that back up and
everything is happy again.

Thanks for the ideas,

Tim




On Thu, Nov 18, 2010 at 11:04 AM, Tim Robertson
<ti...@gmail.com> wrote:
> Thanks again.
>
> We are getting closer to debugging this.  Our reference for all these
> tests was a simple GroupBy using Hive, but when I do a vanilla MR job
> on the tab file input to do the same group by, it flies through -
> almost exactly 2 times quicker.  Investigating further as it is not
> quite a fair test at the moment due to some config differences...
>
>
> On Thu, Nov 18, 2010 at 10:19 AM, Friso van Vollenhoven
> <fv...@xebia.com> wrote:
>> Do you have IPv6 enabled on the boxes? If DNS gives both IPv4 and IPv6 results for lookups, Java will try v6 first and then fall back to v4, which is an additional connect attempt. You can force Java to use only v4 by setting the system property java.net.preferIPv4Stack=true.
>>
>> Also, I am not sure whether Java does the same thing as nslookup when doing name lookups (I believe it has its own cache as well, but correct me if I'm wrong).
>>
>> You could try running something like strace (with the -T option, which shows time spent in system calls) to see whether network related system calls take a long time.
>>
>>
>>
>> Friso
>>
>>
>>
>>
>> On 17 nov 2010, at 22:20, Tim Robertson wrote:
>>
>>> I don't think so Aaron - but we use names not IPs in the config and on
>>> a node the following is instant:
>>>
>>> [root@c2n1 ~]# nslookup c1n1.gbif.org
>>> Server:               130.226.238.254
>>> Address:      130.226.238.254#53
>>>
>>> Non-authoritative answer:
>>> Name: c1n1.gbif.org
>>> Address: 130.226.238.171
>>>
>>> If I ssh onto an arbitrary machine in the cluster and pull a file
>>> using curl (e.g.
>>> http://c1n9.gbif.org:50075/streamFile?filename=%2Fuser%2Fhive%2Fwarehouse%2Feol_density2_4%2Fattempt_201011151423_0027_m_000000_0&delegation=null)
>>> it comes down at 110M/s with no delay on DNS lookup.
>>>
>>> Is there a better test I can do? - I am not so much a network guy...
>>> Cheers,
>>> Tim
>>>
>>>
>>>
>>>
>>>
>>> On Wed, Nov 17, 2010 at 10:08 PM, Aaron Kimball <ak...@gmail.com> wrote:
>>>> Tim,
>>>> Are there issues with DNS caching (or lack thereof), misconfigured
>>>> /etc/hosts, or other network-config gotchas that might be preventing network
>>>> connections between hosts from opening efficiently?
>>>> - Aaron
>>>>
>>>> On Wed, Nov 17, 2010 at 12:50 PM, Tim Robertson <ti...@gmail.com>
>>>> wrote:
>>>>>
>>>>> Thanks Friso,
>>>>>
>>>>> We've been trying to diagnose all day and still did not find a solution.
>>>>> We're running cacti and IO wait is down at 0.5%, M&R are tuned right
>>>>> down to 1M 1R on each machine, and the machine CPUs are almost idle
>>>>> with no swap.
>>>>> Using curl to pull a file from a DN comes down at 110m/s.
>>>>>
>>>>> We are now upping things like epoll
>>>>>
>>>>> Any ideas really greatly appreciated at this stage!
>>>>> Tim
>>>>>
>>>>>
>>>>> On Wed, Nov 17, 2010 at 10:20 AM, Friso van Vollenhoven
>>>>> <fv...@xebia.com> wrote:
>>>>>> Hi Tim,
>>>>>> Getting 28K of map outputs to reducers should not take minutes. Reducers
>>>>>> on
>>>>>> a properly setup (1Gb) network should be copying at multiple MB/s. I
>>>>>> think
>>>>>> you need to get some more info.
>>>>>> Apart from top, you'll probably also want to look at iostat and vmstat.
>>>>>> The
>>>>>> first will tell you something about disk utilization and the latter can
>>>>>> tell
>>>>>> you whether the machines are using swap or not. This is very important.
>>>>>> If
>>>>>> you are over utilizing physical memory on the machines, thing will be
>>>>>> slow.
>>>>>> It's even better if you put something in place that allows you to get an
>>>>>> overall view of the resource usage across the cluster. Look at Ganglia
>>>>>> (http://ganglia.sourceforge.net/) or Cacti (http://www.cacti.net/) or
>>>>>> something similar.
>>>>>> Basically a job is either CPU bound, IO bound or network bound. You need
>>>>>> to
>>>>>> be able to look at all three to see what the bottleneck is. Also, you
>>>>>> can
>>>>>> run into churn when you saturate resources and processes are competing
>>>>>> for
>>>>>> them (e.g. when you have two disks and 50 processes / threads reading
>>>>>> from
>>>>>> them, things will be slow because the OS needs to switch between them a
>>>>>> lot
>>>>>> and overall throughput will be less than what the disks can do; you can
>>>>>> see
>>>>>> this when there is a lot of time in iowait, but overall throughput is
>>>>>> low so
>>>>>> there's a lot of seeks going on).
>>>>>>
>>>>>>
>>>>>>
>>>>>> On 17 nov 2010, at 09:43, Tim Robertson wrote:
>>>>>>
>>>>>> Hi all,
>>>>>>
>>>>>> We have setup a small cluster (13 nodes) using CDH3
>>>>>>
>>>>>> We have been tuning it using TeraSort and Hive queries on our data,
>>>>>> and the copy phase is very slow, so I'd like to ask if anyone can look
>>>>>> over our config.
>>>>>>
>>>>>> We have an unbalanced set of machines (all on a single switch):
>>>>>> - 10 of Intel @ 2.83GHz Quad, 8GB, 2x500G 7.2K SATA (3 mappers, 2
>>>>>> reducers)
>>>>>> - 3 of  Intel @ 2.53GHz Dual Quad, 24GB, 6x250GB 5.4K SATA (12
>>>>>> mappers, 12 reducers)
>>>>>>
>>>>>> We monitored the load using $top on machines, to settle on the number
>>>>>> of mappers and reducers to stop overloading them, and the map() and
>>>>>> reduce() is working very nicely - all our time
>>>>>>
>>>>>> The config:
>>>>>>
>>>>>> io.sort.mb=400
>>>>>> io.sort.factor=100
>>>>>> mapred.reduce.parallel.copies=20
>>>>>> tasktracker.http.threads=80
>>>>>> mapred.compress.map.output=true/false (no notible difference)
>>>>>> mapred.map.output.compression.codec=com.hadoop.compression.lzo.LzoCodec
>>>>>> mapred.output.compression.type=BLOCK
>>>>>> mapred.inmem.merge.threshold=0
>>>>>> mapred.job.reduce.input.buffer.percent=0.7
>>>>>> mapred.job.reuse.jvm.num.tasks=50
>>>>>>
>>>>>> An example job:
>>>>>> (select basis_of_record,count(1) from occurrence_record group by
>>>>>> basis_of_record)
>>>>>> Map input records 262,573,931 finished in 2mins30 using 833 mappers
>>>>>> Reduce was at 24% at 2mins30 finished map with all 55 running
>>>>>> Map output records: 1,855
>>>>>> Map output bytes: 28,724
>>>>>> REDUCE COPY PHASE finished after 7mins01 secs
>>>>>> Reduce finished after 7mins17secs
>>>>>>
>>>>>> I am correct that 28,724 bytes emitted from a map should not take
>>>>>> 4mins30
>>>>>> right?
>>>>>>
>>>>>> We're running puppet so can test changes quickly.
>>>>>>
>>>>>> Any pointers on how we can debug / improve this are greatly appreciated!
>>>>>> Tim
>>>>>>
>>>>>>
>>>>
>>>>
>>
>>
>

Re: Help tuning a cluster - COPY slow

Posted by Tim Robertson <ti...@gmail.com>.
Thanks again.

We are getting closer to debugging this.  Our reference for all these
tests was a simple GroupBy using Hive, but when I do a vanilla MR job
on the tab file input to do the same group by, it flies through -
almost exactly 2 times quicker.  Investigating further as it is not
quite a fair test at the moment due to some config differences...


On Thu, Nov 18, 2010 at 10:19 AM, Friso van Vollenhoven
<fv...@xebia.com> wrote:
> Do you have IPv6 enabled on the boxes? If DNS gives both IPv4 and IPv6 results for lookups, Java will try v6 first and then fall back to v4, which is an additional connect attempt. You can force Java to use only v4 by setting the system property java.net.preferIPv4Stack=true.
>
> Also, I am not sure whether Java does the same thing as nslookup when doing name lookups (I believe it has its own cache as well, but correct me if I'm wrong).
>
> You could try running something like strace (with the -T option, which shows time spent in system calls) to see whether network related system calls take a long time.
>
>
>
> Friso
>
>
>
>
> On 17 nov 2010, at 22:20, Tim Robertson wrote:
>
>> I don't think so Aaron - but we use names not IPs in the config and on
>> a node the following is instant:
>>
>> [root@c2n1 ~]# nslookup c1n1.gbif.org
>> Server:               130.226.238.254
>> Address:      130.226.238.254#53
>>
>> Non-authoritative answer:
>> Name: c1n1.gbif.org
>> Address: 130.226.238.171
>>
>> If I ssh onto an arbitrary machine in the cluster and pull a file
>> using curl (e.g.
>> http://c1n9.gbif.org:50075/streamFile?filename=%2Fuser%2Fhive%2Fwarehouse%2Feol_density2_4%2Fattempt_201011151423_0027_m_000000_0&delegation=null)
>> it comes down at 110M/s with no delay on DNS lookup.
>>
>> Is there a better test I can do? - I am not so much a network guy...
>> Cheers,
>> Tim
>>
>>
>>
>>
>>
>> On Wed, Nov 17, 2010 at 10:08 PM, Aaron Kimball <ak...@gmail.com> wrote:
>>> Tim,
>>> Are there issues with DNS caching (or lack thereof), misconfigured
>>> /etc/hosts, or other network-config gotchas that might be preventing network
>>> connections between hosts from opening efficiently?
>>> - Aaron
>>>
>>> On Wed, Nov 17, 2010 at 12:50 PM, Tim Robertson <ti...@gmail.com>
>>> wrote:
>>>>
>>>> Thanks Friso,
>>>>
>>>> We've been trying to diagnose all day and still did not find a solution.
>>>> We're running cacti and IO wait is down at 0.5%, M&R are tuned right
>>>> down to 1M 1R on each machine, and the machine CPUs are almost idle
>>>> with no swap.
>>>> Using curl to pull a file from a DN comes down at 110m/s.
>>>>
>>>> We are now upping things like epoll
>>>>
>>>> Any ideas really greatly appreciated at this stage!
>>>> Tim
>>>>
>>>>
>>>> On Wed, Nov 17, 2010 at 10:20 AM, Friso van Vollenhoven
>>>> <fv...@xebia.com> wrote:
>>>>> Hi Tim,
>>>>> Getting 28K of map outputs to reducers should not take minutes. Reducers
>>>>> on
>>>>> a properly setup (1Gb) network should be copying at multiple MB/s. I
>>>>> think
>>>>> you need to get some more info.
>>>>> Apart from top, you'll probably also want to look at iostat and vmstat.
>>>>> The
>>>>> first will tell you something about disk utilization and the latter can
>>>>> tell
>>>>> you whether the machines are using swap or not. This is very important.
>>>>> If
>>>>> you are over utilizing physical memory on the machines, thing will be
>>>>> slow.
>>>>> It's even better if you put something in place that allows you to get an
>>>>> overall view of the resource usage across the cluster. Look at Ganglia
>>>>> (http://ganglia.sourceforge.net/) or Cacti (http://www.cacti.net/) or
>>>>> something similar.
>>>>> Basically a job is either CPU bound, IO bound or network bound. You need
>>>>> to
>>>>> be able to look at all three to see what the bottleneck is. Also, you
>>>>> can
>>>>> run into churn when you saturate resources and processes are competing
>>>>> for
>>>>> them (e.g. when you have two disks and 50 processes / threads reading
>>>>> from
>>>>> them, things will be slow because the OS needs to switch between them a
>>>>> lot
>>>>> and overall throughput will be less than what the disks can do; you can
>>>>> see
>>>>> this when there is a lot of time in iowait, but overall throughput is
>>>>> low so
>>>>> there's a lot of seeks going on).
>>>>>
>>>>>
>>>>>
>>>>> On 17 nov 2010, at 09:43, Tim Robertson wrote:
>>>>>
>>>>> Hi all,
>>>>>
>>>>> We have setup a small cluster (13 nodes) using CDH3
>>>>>
>>>>> We have been tuning it using TeraSort and Hive queries on our data,
>>>>> and the copy phase is very slow, so I'd like to ask if anyone can look
>>>>> over our config.
>>>>>
>>>>> We have an unbalanced set of machines (all on a single switch):
>>>>> - 10 of Intel @ 2.83GHz Quad, 8GB, 2x500G 7.2K SATA (3 mappers, 2
>>>>> reducers)
>>>>> - 3 of  Intel @ 2.53GHz Dual Quad, 24GB, 6x250GB 5.4K SATA (12
>>>>> mappers, 12 reducers)
>>>>>
>>>>> We monitored the load using $top on machines, to settle on the number
>>>>> of mappers and reducers to stop overloading them, and the map() and
>>>>> reduce() is working very nicely - all our time
>>>>>
>>>>> The config:
>>>>>
>>>>> io.sort.mb=400
>>>>> io.sort.factor=100
>>>>> mapred.reduce.parallel.copies=20
>>>>> tasktracker.http.threads=80
>>>>> mapred.compress.map.output=true/false (no notible difference)
>>>>> mapred.map.output.compression.codec=com.hadoop.compression.lzo.LzoCodec
>>>>> mapred.output.compression.type=BLOCK
>>>>> mapred.inmem.merge.threshold=0
>>>>> mapred.job.reduce.input.buffer.percent=0.7
>>>>> mapred.job.reuse.jvm.num.tasks=50
>>>>>
>>>>> An example job:
>>>>> (select basis_of_record,count(1) from occurrence_record group by
>>>>> basis_of_record)
>>>>> Map input records 262,573,931 finished in 2mins30 using 833 mappers
>>>>> Reduce was at 24% at 2mins30 finished map with all 55 running
>>>>> Map output records: 1,855
>>>>> Map output bytes: 28,724
>>>>> REDUCE COPY PHASE finished after 7mins01 secs
>>>>> Reduce finished after 7mins17secs
>>>>>
>>>>> I am correct that 28,724 bytes emitted from a map should not take
>>>>> 4mins30
>>>>> right?
>>>>>
>>>>> We're running puppet so can test changes quickly.
>>>>>
>>>>> Any pointers on how we can debug / improve this are greatly appreciated!
>>>>> Tim
>>>>>
>>>>>
>>>
>>>
>
>

Re: Help tuning a cluster - COPY slow

Posted by Friso van Vollenhoven <fv...@xebia.com>.
Do you have IPv6 enabled on the boxes? If DNS gives both IPv4 and IPv6 results for lookups, Java will try v6 first and then fall back to v4, which is an additional connect attempt. You can force Java to use only v4 by setting the system property java.net.preferIPv4Stack=true.

Also, I am not sure whether Java does the same thing as nslookup when doing name lookups (I believe it has its own cache as well, but correct me if I'm wrong).

You could try running something like strace (with the -T option, which shows time spent in system calls) to see whether network related system calls take a long time.



Friso




On 17 nov 2010, at 22:20, Tim Robertson wrote:

> I don't think so Aaron - but we use names not IPs in the config and on
> a node the following is instant:
> 
> [root@c2n1 ~]# nslookup c1n1.gbif.org
> Server:		130.226.238.254
> Address:	130.226.238.254#53
> 
> Non-authoritative answer:
> Name:	c1n1.gbif.org
> Address: 130.226.238.171
> 
> If I ssh onto an arbitrary machine in the cluster and pull a file
> using curl (e.g.
> http://c1n9.gbif.org:50075/streamFile?filename=%2Fuser%2Fhive%2Fwarehouse%2Feol_density2_4%2Fattempt_201011151423_0027_m_000000_0&delegation=null)
> it comes down at 110M/s with no delay on DNS lookup.
> 
> Is there a better test I can do? - I am not so much a network guy...
> Cheers,
> Tim
> 
> 
> 
> 
> 
> On Wed, Nov 17, 2010 at 10:08 PM, Aaron Kimball <ak...@gmail.com> wrote:
>> Tim,
>> Are there issues with DNS caching (or lack thereof), misconfigured
>> /etc/hosts, or other network-config gotchas that might be preventing network
>> connections between hosts from opening efficiently?
>> - Aaron
>> 
>> On Wed, Nov 17, 2010 at 12:50 PM, Tim Robertson <ti...@gmail.com>
>> wrote:
>>> 
>>> Thanks Friso,
>>> 
>>> We've been trying to diagnose all day and still did not find a solution.
>>> We're running cacti and IO wait is down at 0.5%, M&R are tuned right
>>> down to 1M 1R on each machine, and the machine CPUs are almost idle
>>> with no swap.
>>> Using curl to pull a file from a DN comes down at 110m/s.
>>> 
>>> We are now upping things like epoll
>>> 
>>> Any ideas really greatly appreciated at this stage!
>>> Tim
>>> 
>>> 
>>> On Wed, Nov 17, 2010 at 10:20 AM, Friso van Vollenhoven
>>> <fv...@xebia.com> wrote:
>>>> Hi Tim,
>>>> Getting 28K of map outputs to reducers should not take minutes. Reducers
>>>> on
>>>> a properly setup (1Gb) network should be copying at multiple MB/s. I
>>>> think
>>>> you need to get some more info.
>>>> Apart from top, you'll probably also want to look at iostat and vmstat.
>>>> The
>>>> first will tell you something about disk utilization and the latter can
>>>> tell
>>>> you whether the machines are using swap or not. This is very important.
>>>> If
>>>> you are over utilizing physical memory on the machines, thing will be
>>>> slow.
>>>> It's even better if you put something in place that allows you to get an
>>>> overall view of the resource usage across the cluster. Look at Ganglia
>>>> (http://ganglia.sourceforge.net/) or Cacti (http://www.cacti.net/) or
>>>> something similar.
>>>> Basically a job is either CPU bound, IO bound or network bound. You need
>>>> to
>>>> be able to look at all three to see what the bottleneck is. Also, you
>>>> can
>>>> run into churn when you saturate resources and processes are competing
>>>> for
>>>> them (e.g. when you have two disks and 50 processes / threads reading
>>>> from
>>>> them, things will be slow because the OS needs to switch between them a
>>>> lot
>>>> and overall throughput will be less than what the disks can do; you can
>>>> see
>>>> this when there is a lot of time in iowait, but overall throughput is
>>>> low so
>>>> there's a lot of seeks going on).
>>>> 
>>>> 
>>>> 
>>>> On 17 nov 2010, at 09:43, Tim Robertson wrote:
>>>> 
>>>> Hi all,
>>>> 
>>>> We have setup a small cluster (13 nodes) using CDH3
>>>> 
>>>> We have been tuning it using TeraSort and Hive queries on our data,
>>>> and the copy phase is very slow, so I'd like to ask if anyone can look
>>>> over our config.
>>>> 
>>>> We have an unbalanced set of machines (all on a single switch):
>>>> - 10 of Intel @ 2.83GHz Quad, 8GB, 2x500G 7.2K SATA (3 mappers, 2
>>>> reducers)
>>>> - 3 of  Intel @ 2.53GHz Dual Quad, 24GB, 6x250GB 5.4K SATA (12
>>>> mappers, 12 reducers)
>>>> 
>>>> We monitored the load using $top on machines, to settle on the number
>>>> of mappers and reducers to stop overloading them, and the map() and
>>>> reduce() is working very nicely - all our time
>>>> 
>>>> The config:
>>>> 
>>>> io.sort.mb=400
>>>> io.sort.factor=100
>>>> mapred.reduce.parallel.copies=20
>>>> tasktracker.http.threads=80
>>>> mapred.compress.map.output=true/false (no notible difference)
>>>> mapred.map.output.compression.codec=com.hadoop.compression.lzo.LzoCodec
>>>> mapred.output.compression.type=BLOCK
>>>> mapred.inmem.merge.threshold=0
>>>> mapred.job.reduce.input.buffer.percent=0.7
>>>> mapred.job.reuse.jvm.num.tasks=50
>>>> 
>>>> An example job:
>>>> (select basis_of_record,count(1) from occurrence_record group by
>>>> basis_of_record)
>>>> Map input records 262,573,931 finished in 2mins30 using 833 mappers
>>>> Reduce was at 24% at 2mins30 finished map with all 55 running
>>>> Map output records: 1,855
>>>> Map output bytes: 28,724
>>>> REDUCE COPY PHASE finished after 7mins01 secs
>>>> Reduce finished after 7mins17secs
>>>> 
>>>> I am correct that 28,724 bytes emitted from a map should not take
>>>> 4mins30
>>>> right?
>>>> 
>>>> We're running puppet so can test changes quickly.
>>>> 
>>>> Any pointers on how we can debug / improve this are greatly appreciated!
>>>> Tim
>>>> 
>>>> 
>> 
>> 


Re: Help tuning a cluster - COPY slow

Posted by Tim Robertson <ti...@gmail.com>.
I don't think so Aaron - but we use names not IPs in the config and on
a node the following is instant:

[root@c2n1 ~]# nslookup c1n1.gbif.org
Server:		130.226.238.254
Address:	130.226.238.254#53

Non-authoritative answer:
Name:	c1n1.gbif.org
Address: 130.226.238.171

If I ssh onto an arbitrary machine in the cluster and pull a file
using curl (e.g.
http://c1n9.gbif.org:50075/streamFile?filename=%2Fuser%2Fhive%2Fwarehouse%2Feol_density2_4%2Fattempt_201011151423_0027_m_000000_0&delegation=null)
it comes down at 110M/s with no delay on DNS lookup.

Is there a better test I can do? - I am not so much a network guy...
Cheers,
Tim





On Wed, Nov 17, 2010 at 10:08 PM, Aaron Kimball <ak...@gmail.com> wrote:
> Tim,
> Are there issues with DNS caching (or lack thereof), misconfigured
> /etc/hosts, or other network-config gotchas that might be preventing network
> connections between hosts from opening efficiently?
> - Aaron
>
> On Wed, Nov 17, 2010 at 12:50 PM, Tim Robertson <ti...@gmail.com>
> wrote:
>>
>> Thanks Friso,
>>
>> We've been trying to diagnose all day and still did not find a solution.
>> We're running cacti and IO wait is down at 0.5%, M&R are tuned right
>> down to 1M 1R on each machine, and the machine CPUs are almost idle
>> with no swap.
>> Using curl to pull a file from a DN comes down at 110m/s.
>>
>> We are now upping things like epoll
>>
>> Any ideas really greatly appreciated at this stage!
>> Tim
>>
>>
>> On Wed, Nov 17, 2010 at 10:20 AM, Friso van Vollenhoven
>> <fv...@xebia.com> wrote:
>> > Hi Tim,
>> > Getting 28K of map outputs to reducers should not take minutes. Reducers
>> > on
>> > a properly setup (1Gb) network should be copying at multiple MB/s. I
>> > think
>> > you need to get some more info.
>> > Apart from top, you'll probably also want to look at iostat and vmstat.
>> > The
>> > first will tell you something about disk utilization and the latter can
>> > tell
>> > you whether the machines are using swap or not. This is very important.
>> > If
>> > you are over utilizing physical memory on the machines, thing will be
>> > slow.
>> > It's even better if you put something in place that allows you to get an
>> > overall view of the resource usage across the cluster. Look at Ganglia
>> > (http://ganglia.sourceforge.net/) or Cacti (http://www.cacti.net/) or
>> > something similar.
>> > Basically a job is either CPU bound, IO bound or network bound. You need
>> > to
>> > be able to look at all three to see what the bottleneck is. Also, you
>> > can
>> > run into churn when you saturate resources and processes are competing
>> > for
>> > them (e.g. when you have two disks and 50 processes / threads reading
>> > from
>> > them, things will be slow because the OS needs to switch between them a
>> > lot
>> > and overall throughput will be less than what the disks can do; you can
>> > see
>> > this when there is a lot of time in iowait, but overall throughput is
>> > low so
>> > there's a lot of seeks going on).
>> >
>> >
>> >
>> > On 17 nov 2010, at 09:43, Tim Robertson wrote:
>> >
>> > Hi all,
>> >
>> > We have setup a small cluster (13 nodes) using CDH3
>> >
>> > We have been tuning it using TeraSort and Hive queries on our data,
>> > and the copy phase is very slow, so I'd like to ask if anyone can look
>> > over our config.
>> >
>> > We have an unbalanced set of machines (all on a single switch):
>> > - 10 of Intel @ 2.83GHz Quad, 8GB, 2x500G 7.2K SATA (3 mappers, 2
>> > reducers)
>> > - 3 of  Intel @ 2.53GHz Dual Quad, 24GB, 6x250GB 5.4K SATA (12
>> > mappers, 12 reducers)
>> >
>> > We monitored the load using $top on machines, to settle on the number
>> > of mappers and reducers to stop overloading them, and the map() and
>> > reduce() is working very nicely - all our time
>> >
>> > The config:
>> >
>> > io.sort.mb=400
>> > io.sort.factor=100
>> > mapred.reduce.parallel.copies=20
>> > tasktracker.http.threads=80
>> > mapred.compress.map.output=true/false (no notible difference)
>> > mapred.map.output.compression.codec=com.hadoop.compression.lzo.LzoCodec
>> > mapred.output.compression.type=BLOCK
>> > mapred.inmem.merge.threshold=0
>> > mapred.job.reduce.input.buffer.percent=0.7
>> > mapred.job.reuse.jvm.num.tasks=50
>> >
>> > An example job:
>> > (select basis_of_record,count(1) from occurrence_record group by
>> > basis_of_record)
>> > Map input records 262,573,931 finished in 2mins30 using 833 mappers
>> > Reduce was at 24% at 2mins30 finished map with all 55 running
>> > Map output records: 1,855
>> > Map output bytes: 28,724
>> > REDUCE COPY PHASE finished after 7mins01 secs
>> > Reduce finished after 7mins17secs
>> >
>> > I am correct that 28,724 bytes emitted from a map should not take
>> > 4mins30
>> > right?
>> >
>> > We're running puppet so can test changes quickly.
>> >
>> > Any pointers on how we can debug / improve this are greatly appreciated!
>> > Tim
>> >
>> >
>
>

Re: Help tuning a cluster - COPY slow

Posted by Aaron Kimball <ak...@gmail.com>.
Tim,

Are there issues with DNS caching (or lack thereof), misconfigured
/etc/hosts, or other network-config gotchas that might be preventing network
connections between hosts from opening efficiently?

- Aaron

On Wed, Nov 17, 2010 at 12:50 PM, Tim Robertson
<ti...@gmail.com>wrote:

> Thanks Friso,
>
> We've been trying to diagnose all day and still did not find a solution.
> We're running cacti and IO wait is down at 0.5%, M&R are tuned right
> down to 1M 1R on each machine, and the machine CPUs are almost idle
> with no swap.
> Using curl to pull a file from a DN comes down at 110m/s.
>
> We are now upping things like epoll
>
> Any ideas really greatly appreciated at this stage!
> Tim
>
>
> On Wed, Nov 17, 2010 at 10:20 AM, Friso van Vollenhoven
> <fv...@xebia.com> wrote:
> > Hi Tim,
> > Getting 28K of map outputs to reducers should not take minutes. Reducers
> on
> > a properly setup (1Gb) network should be copying at multiple MB/s. I
> think
> > you need to get some more info.
> > Apart from top, you'll probably also want to look at iostat and vmstat.
> The
> > first will tell you something about disk utilization and the latter can
> tell
> > you whether the machines are using swap or not. This is very important.
> If
> > you are over utilizing physical memory on the machines, thing will be
> slow.
> > It's even better if you put something in place that allows you to get an
> > overall view of the resource usage across the cluster. Look at Ganglia
> > (http://ganglia.sourceforge.net/) or Cacti (http://www.cacti.net/) or
> > something similar.
> > Basically a job is either CPU bound, IO bound or network bound. You need
> to
> > be able to look at all three to see what the bottleneck is. Also, you can
> > run into churn when you saturate resources and processes are competing
> for
> > them (e.g. when you have two disks and 50 processes / threads reading
> from
> > them, things will be slow because the OS needs to switch between them a
> lot
> > and overall throughput will be less than what the disks can do; you can
> see
> > this when there is a lot of time in iowait, but overall throughput is low
> so
> > there's a lot of seeks going on).
> >
> >
> >
> > On 17 nov 2010, at 09:43, Tim Robertson wrote:
> >
> > Hi all,
> >
> > We have setup a small cluster (13 nodes) using CDH3
> >
> > We have been tuning it using TeraSort and Hive queries on our data,
> > and the copy phase is very slow, so I'd like to ask if anyone can look
> > over our config.
> >
> > We have an unbalanced set of machines (all on a single switch):
> > - 10 of Intel @ 2.83GHz Quad, 8GB, 2x500G 7.2K SATA (3 mappers, 2
> reducers)
> > - 3 of  Intel @ 2.53GHz Dual Quad, 24GB, 6x250GB 5.4K SATA (12
> > mappers, 12 reducers)
> >
> > We monitored the load using $top on machines, to settle on the number
> > of mappers and reducers to stop overloading them, and the map() and
> > reduce() is working very nicely - all our time
> >
> > The config:
> >
> > io.sort.mb=400
> > io.sort.factor=100
> > mapred.reduce.parallel.copies=20
> > tasktracker.http.threads=80
> > mapred.compress.map.output=true/false (no notible difference)
> > mapred.map.output.compression.codec=com.hadoop.compression.lzo.LzoCodec
> > mapred.output.compression.type=BLOCK
> > mapred.inmem.merge.threshold=0
> > mapred.job.reduce.input.buffer.percent=0.7
> > mapred.job.reuse.jvm.num.tasks=50
> >
> > An example job:
> > (select basis_of_record,count(1) from occurrence_record group by
> > basis_of_record)
> > Map input records 262,573,931 finished in 2mins30 using 833 mappers
> > Reduce was at 24% at 2mins30 finished map with all 55 running
> > Map output records: 1,855
> > Map output bytes: 28,724
> > REDUCE COPY PHASE finished after 7mins01 secs
> > Reduce finished after 7mins17secs
> >
> > I am correct that 28,724 bytes emitted from a map should not take 4mins30
> > right?
> >
> > We're running puppet so can test changes quickly.
> >
> > Any pointers on how we can debug / improve this are greatly appreciated!
> > Tim
> >
> >
>

Re: Help tuning a cluster - COPY slow

Posted by Tim Robertson <ti...@gmail.com>.
Thanks Friso,

We've been trying to diagnose all day and still did not find a solution.
We're running cacti and IO wait is down at 0.5%, M&R are tuned right
down to 1M 1R on each machine, and the machine CPUs are almost idle
with no swap.
Using curl to pull a file from a DN comes down at 110m/s.

We are now upping things like epoll

Any ideas really greatly appreciated at this stage!
Tim


On Wed, Nov 17, 2010 at 10:20 AM, Friso van Vollenhoven
<fv...@xebia.com> wrote:
> Hi Tim,
> Getting 28K of map outputs to reducers should not take minutes. Reducers on
> a properly setup (1Gb) network should be copying at multiple MB/s. I think
> you need to get some more info.
> Apart from top, you'll probably also want to look at iostat and vmstat. The
> first will tell you something about disk utilization and the latter can tell
> you whether the machines are using swap or not. This is very important. If
> you are over utilizing physical memory on the machines, thing will be slow.
> It's even better if you put something in place that allows you to get an
> overall view of the resource usage across the cluster. Look at Ganglia
> (http://ganglia.sourceforge.net/) or Cacti (http://www.cacti.net/) or
> something similar.
> Basically a job is either CPU bound, IO bound or network bound. You need to
> be able to look at all three to see what the bottleneck is. Also, you can
> run into churn when you saturate resources and processes are competing for
> them (e.g. when you have two disks and 50 processes / threads reading from
> them, things will be slow because the OS needs to switch between them a lot
> and overall throughput will be less than what the disks can do; you can see
> this when there is a lot of time in iowait, but overall throughput is low so
> there's a lot of seeks going on).
>
>
>
> On 17 nov 2010, at 09:43, Tim Robertson wrote:
>
> Hi all,
>
> We have setup a small cluster (13 nodes) using CDH3
>
> We have been tuning it using TeraSort and Hive queries on our data,
> and the copy phase is very slow, so I'd like to ask if anyone can look
> over our config.
>
> We have an unbalanced set of machines (all on a single switch):
> - 10 of Intel @ 2.83GHz Quad, 8GB, 2x500G 7.2K SATA (3 mappers, 2 reducers)
> - 3 of  Intel @ 2.53GHz Dual Quad, 24GB, 6x250GB 5.4K SATA (12
> mappers, 12 reducers)
>
> We monitored the load using $top on machines, to settle on the number
> of mappers and reducers to stop overloading them, and the map() and
> reduce() is working very nicely - all our time
>
> The config:
>
> io.sort.mb=400
> io.sort.factor=100
> mapred.reduce.parallel.copies=20
> tasktracker.http.threads=80
> mapred.compress.map.output=true/false (no notible difference)
> mapred.map.output.compression.codec=com.hadoop.compression.lzo.LzoCodec
> mapred.output.compression.type=BLOCK
> mapred.inmem.merge.threshold=0
> mapred.job.reduce.input.buffer.percent=0.7
> mapred.job.reuse.jvm.num.tasks=50
>
> An example job:
> (select basis_of_record,count(1) from occurrence_record group by
> basis_of_record)
> Map input records 262,573,931 finished in 2mins30 using 833 mappers
> Reduce was at 24% at 2mins30 finished map with all 55 running
> Map output records: 1,855
> Map output bytes: 28,724
> REDUCE COPY PHASE finished after 7mins01 secs
> Reduce finished after 7mins17secs
>
> I am correct that 28,724 bytes emitted from a map should not take 4mins30
> right?
>
> We're running puppet so can test changes quickly.
>
> Any pointers on how we can debug / improve this are greatly appreciated!
> Tim
>
>

Re: Help tuning a cluster - COPY slow

Posted by Friso van Vollenhoven <fv...@xebia.com>.
Hi Tim,

Getting 28K of map outputs to reducers should not take minutes. Reducers on a properly setup (1Gb) network should be copying at multiple MB/s. I think you need to get some more info.

Apart from top, you'll probably also want to look at iostat and vmstat. The first will tell you something about disk utilization and the latter can tell you whether the machines are using swap or not. This is very important. If you are over utilizing physical memory on the machines, thing will be slow.

It's even better if you put something in place that allows you to get an overall view of the resource usage across the cluster. Look at Ganglia (http://ganglia.sourceforge.net/) or Cacti (http://www.cacti.net/) or something similar.

Basically a job is either CPU bound, IO bound or network bound. You need to be able to look at all three to see what the bottleneck is. Also, you can run into churn when you saturate resources and processes are competing for them (e.g. when you have two disks and 50 processes / threads reading from them, things will be slow because the OS needs to switch between them a lot and overall throughput will be less than what the disks can do; you can see this when there is a lot of time in iowait, but overall throughput is low so there's a lot of seeks going on).




On 17 nov 2010, at 09:43, Tim Robertson wrote:

Hi all,

We have setup a small cluster (13 nodes) using CDH3

We have been tuning it using TeraSort and Hive queries on our data,
and the copy phase is very slow, so I'd like to ask if anyone can look
over our config.

We have an unbalanced set of machines (all on a single switch):
- 10 of Intel @ 2.83GHz Quad, 8GB, 2x500G 7.2K SATA (3 mappers, 2 reducers)
- 3 of  Intel @ 2.53GHz Dual Quad, 24GB, 6x250GB 5.4K SATA (12
mappers, 12 reducers)

We monitored the load using $top on machines, to settle on the number
of mappers and reducers to stop overloading them, and the map() and
reduce() is working very nicely - all our time

The config:

io.sort.mb=400
io.sort.factor=100
mapred.reduce.parallel.copies=20
tasktracker.http.threads=80
mapred.compress.map.output=true/false (no notible difference)
mapred.map.output.compression.codec=com.hadoop.compression.lzo.LzoCodec
mapred.output.compression.type=BLOCK
mapred.inmem.merge.threshold=0
mapred.job.reduce.input.buffer.percent=0.7
mapred.job.reuse.jvm.num.tasks=50

An example job:
(select basis_of_record,count(1) from occurrence_record group by
basis_of_record)
Map input records 262,573,931 finished in 2mins30 using 833 mappers
Reduce was at 24% at 2mins30 finished map with all 55 running
Map output records: 1,855
Map output bytes: 28,724
REDUCE COPY PHASE finished after 7mins01 secs
Reduce finished after 7mins17secs

I am correct that 28,724 bytes emitted from a map should not take 4mins30 right?

We're running puppet so can test changes quickly.

Any pointers on how we can debug / improve this are greatly appreciated!
Tim