You are viewing a plain text version of this content. The canonical link for it is here.
Posted to common-user@hadoop.apache.org by Ed Mazur <ma...@cs.umass.edu> on 2010/03/30 09:16:01 UTC

Single datanode setup

Hi,

I have a 12 node cluster where instead of running a DN on each compute
node, I'm running just one DN backed by a large RAID (with a
dfs.replication of 1). The compute node storage is limited, so the
idea behind this was to free up more space for intermediate job data.
So the cluster has that one node with the DN, a master node with the
JT/NN, and 10 compute nodes each with a TT. I am running 0.20.1+169.68
from Cloudera.

The problem is that MR job performance is now worse than when using a
traditional HDFS setup. A job that took 76 minutes before now takes
169 minutes. I've used this single DN setup before on a
similarly-sized cluster without any problems, so what can I do to find
the bottleneck?

-Loading data into HDFS was fast, under 30 minutes to load ~240GB, so
I'm thinking this is a DN <-> map task communication problem.

-With a traditional HDFS setup, map tasks were taking 10-30 seconds,
but they now take 45-90 seconds or more.

-I grep'd the DN logs to find how long the size 67633152 HDFS reads
(map inputs) were taking. With the central DN, the reads were an order
of magnitude slower than with traditional HDFS (e.g. 82008147000 vs.
8238455000).

-I tried increasing dfs.datanode.handler.count to 10, but this didn't
seem to have any effect.

-Could low memory be an issue? The machine the DN is running on only
has 2GB and there is less than 100MB free without the DN running. I
haven't observed any swapping going on though.

-I looked at netstat during a job. I wasn't too sure what to look for,
but I didn't see any substantial send/receive buffering.

I've tried everything I can think of, so I'd really appreciate any tips. Thanks.

Ed

Re: Single datanode setup

Posted by Steve Loughran <st...@apache.org>.
Ed Mazur wrote:
> Hi,
> 
> I have a 12 node cluster where instead of running a DN on each compute
> node, I'm running just one DN backed by a large RAID (with a
> dfs.replication of 1). The compute node storage is limited, so the
> idea behind this was to free up more space for intermediate job data.
> So the cluster has that one node with the DN, a master node with the
> JT/NN, and 10 compute nodes each with a TT. I am running 0.20.1+169.68
> from Cloudera.
> 
> The problem is that MR job performance is now worse than when using a
> traditional HDFS setup. A job that took 76 minutes before now takes
> 169 minutes. I've used this single DN setup before on a
> similarly-sized cluster without any problems, so what can I do to find
> the bottleneck?

I wouldn't use hdfs in this situation. Your network will be the 
bottleneck. If you have a SAN, high end filesystem and/or fast network, 
just use file:// URLs and let the underlying OS/network handle it. I 
know people who use alternate filesystems this way. Side benefit: the NN 
is longer an SPOF. Just your storage array. But they never fail, right?

Having a single DN and NN is a waste of effort here. There's no 
locality, no replication, so no need for the replication and locality 
features of HDFS. Try mounting the filestore everywhere with NFS (or 
other protocol of choice), and skip HDFS entirely.

-Steve


Re: Single datanode setup

Posted by Ed Mazur <ma...@cs.umass.edu>.
I set dfs.datanode.max.xcievers to 4096, but this didn't seem to have
any effect on performance.

Here are some benchmarks (not sure what typical values are):

----- TestDFSIO ----- : write
           Date & time: Tue Mar 30 04:53:18 EDT 2010
       Number of files: 10
Total MBytes processed: 10000
     Throughput mb/sec: 23.41355598064167
Average IO rate mb/sec: 25.179018020629883
 IO rate std deviation: 7.022948102609891
    Test exec time sec: 74.437

----- TestDFSIO ----- : read
           Date & time: Tue Mar 30 05:02:01 EDT 2010
       Number of files: 10
Total MBytes processed: 10000
     Throughput mb/sec: 10.735545929349373
Average IO rate mb/sec: 10.741226196289062
 IO rate std deviation: 0.24872891783558398
    Test exec time sec: 119.561

----- TestDFSIO ----- : write
           Date & time: Tue Mar 30 05:09:59 EDT 2010
       Number of files: 40
Total MBytes processed: 40000
     Throughput mb/sec: 3.3887489806219473
Average IO rate mb/sec: 5.173769950866699
 IO rate std deviation: 6.293246618896401
    Test exec time sec: 360.765

----- TestDFSIO ----- : read
           Date & time: Tue Mar 30 05:18:20 EDT 2010
       Number of files: 40
Total MBytes processed: 40000
     Throughput mb/sec: 2.345990558443698
Average IO rate mb/sec: 2.3469674587249756
 IO rate std deviation: 0.04731737036312141
    Test exec time sec: 477.568

I also used 40 files in the benchmarks because I have 10 compute nodes
with mapred.tasktracker.map.tasks.maximum set to 4. It looks like
performance degrades quite a bit when switching from 10 files.

I set mapred.tasktracker.map.tasks.maximum to 1 and ran a MR job. This
got map completion times back down to the expected 15-30 seconds, but
did not change the overall running time.

Does this just mean that the RAID isn't able to keep up with 10*4=40
parallel requests, but it is able to keep up with 10*1=10 parallel
requests? And if so, is there anything I can do to change this? I know
this isn't how HDFS is meant to be used, but this single DN/RAID setup
has worked for me in the past on a similarly-sized cluster.

Ed

On Tue, Mar 30, 2010 at 4:29 AM, Ankur C. Goel <ga...@yahoo-inc.com> wrote:
>
> M/R is performance is known to be better when using just a bunch of disks (BOD) instead of RAID.
>
> From your setup it looks like your single datanode must be running hot on I/O activity.
>
> The parameter- dfs.datanode.handler.count only control the number of datanode threads serving IPC request.
> These are NOT used for actual block transfer. Try upping - dfs.datanode.max.xcievers.
>
> You can then run the I/O  benchmarks to measure the I/O throughput -
> jar $HADOOP_INSTALL/hadoop-*-test.jar TestDFSIO -write -nrFiles 10 -fileSize 1000
>
> -@nkur
>
> On 3/30/10 12:46 PM, "Ed Mazur" <ma...@cs.umass.edu> wrote:
>
> Hi,
>
> I have a 12 node cluster where instead of running a DN on each compute
> node, I'm running just one DN backed by a large RAID (with a
> dfs.replication of 1). The compute node storage is limited, so the
> idea behind this was to free up more space for intermediate job data.
> So the cluster has that one node with the DN, a master node with the
> JT/NN, and 10 compute nodes each with a TT. I am running 0.20.1+169.68
> from Cloudera.
>
> The problem is that MR job performance is now worse than when using a
> traditional HDFS setup. A job that took 76 minutes before now takes
> 169 minutes. I've used this single DN setup before on a
> similarly-sized cluster without any problems, so what can I do to find
> the bottleneck?
>
> -Loading data into HDFS was fast, under 30 minutes to load ~240GB, so
> I'm thinking this is a DN <-> map task communication problem.
>
> -With a traditional HDFS setup, map tasks were taking 10-30 seconds,
> but they now take 45-90 seconds or more.
>
> -I grep'd the DN logs to find how long the size 67633152 HDFS reads
> (map inputs) were taking. With the central DN, the reads were an order
> of magnitude slower than with traditional HDFS (e.g. 82008147000 vs.
> 8238455000).
>
> -I tried increasing dfs.datanode.handler.count to 10, but this didn't
> seem to have any effect.
>
> -Could low memory be an issue? The machine the DN is running on only
> has 2GB and there is less than 100MB free without the DN running. I
> haven't observed any swapping going on though.
>
> -I looked at netstat during a job. I wasn't too sure what to look for,
> but I didn't see any substantial send/receive buffering.
>
> I've tried everything I can think of, so I'd really appreciate any tips. Thanks.
>
> Ed
>
>

Re: Single datanode setup

Posted by "Ankur C. Goel" <ga...@yahoo-inc.com>.
M/R is performance is known to be better when using just a bunch of disks (BOD) instead of RAID.

>From your setup it looks like your single datanode must be running hot on I/O activity.

The parameter- dfs.datanode.handler.count only control the number of datanode threads serving IPC request.
These are NOT used for actual block transfer. Try upping - dfs.datanode.max.xcievers.

You can then run the I/O  benchmarks to measure the I/O throughput -
jar $HADOOP_INSTALL/hadoop-*-test.jar TestDFSIO -write -nrFiles 10 -fileSize 1000

-@nkur

On 3/30/10 12:46 PM, "Ed Mazur" <ma...@cs.umass.edu> wrote:

Hi,

I have a 12 node cluster where instead of running a DN on each compute
node, I'm running just one DN backed by a large RAID (with a
dfs.replication of 1). The compute node storage is limited, so the
idea behind this was to free up more space for intermediate job data.
So the cluster has that one node with the DN, a master node with the
JT/NN, and 10 compute nodes each with a TT. I am running 0.20.1+169.68
from Cloudera.

The problem is that MR job performance is now worse than when using a
traditional HDFS setup. A job that took 76 minutes before now takes
169 minutes. I've used this single DN setup before on a
similarly-sized cluster without any problems, so what can I do to find
the bottleneck?

-Loading data into HDFS was fast, under 30 minutes to load ~240GB, so
I'm thinking this is a DN <-> map task communication problem.

-With a traditional HDFS setup, map tasks were taking 10-30 seconds,
but they now take 45-90 seconds or more.

-I grep'd the DN logs to find how long the size 67633152 HDFS reads
(map inputs) were taking. With the central DN, the reads were an order
of magnitude slower than with traditional HDFS (e.g. 82008147000 vs.
8238455000).

-I tried increasing dfs.datanode.handler.count to 10, but this didn't
seem to have any effect.

-Could low memory be an issue? The machine the DN is running on only
has 2GB and there is less than 100MB free without the DN running. I
haven't observed any swapping going on though.

-I looked at netstat during a job. I wasn't too sure what to look for,
but I didn't see any substantial send/receive buffering.

I've tried everything I can think of, so I'd really appreciate any tips. Thanks.

Ed