You are viewing a plain text version of this content. The canonical link for it is here.
Posted to common-user@hadoop.apache.org by Guilherme Menezes <gu...@gmail.com> on 2008/09/21 18:40:16 UTC

Hadoop Cluster Size Scalability Numbers?

Hi,

I wonder if someone is aware of any measurement of Hadoop scalability with
the cluster size, eg., read/write/appends throughput on a cluster of
5,10,30,50,100 nodes, or something alike.

These numbers would help us to plan the resources for a small academic
cluster. We are concerned about the possibility of Hadoop not working
properly if there are too few nodes (too many map/reduce jobs per node or
too many replicas per node, for example). We currently have 4 nodes (16GB of
ram, 6 * 750 GB disks, Quad-Core AMD Opteron processor). Our initial plans
are to perform a Web crawl for academic purposes (something between 500
million and 1 billion pages), and we need to expand the number of nodes for
that. Is it better to have a larger number of nodes simpler than the ones we
currently have (less memory, less processing?) in terms of Hadoop
performance?

Thank you in advance for any information!

Guilherme Menezes

Re: Hadoop Cluster Size Scalability Numbers?

Posted by Konstantin Shvachko <sh...@yahoo-inc.com>.

Allen Wittenauer wrote:
> On 9/21/08 2:51 PM, "Dmitry Pushkarev" <um...@stanford.edu> wrote:
>> Speaking about NFS-backup idea:
>> If I have secure nfs storage which is much slower than network (3MB/d vs
>> 100MB/s network we use between nodes) will it adversely affect performance,
>> or I can rely on NFS caching to do the job?
> 
>     I think Konstantin has some benchmarks in a JIRA somewhere that shows
> that the current bottleneck isn't the fsimage/edits writes.

HADOOP-3860 has name-node benchmark numbers.
It concludes that for the name-node operations the bottleneck is exactly the edits writes.
But another conclusion is that real-world clusters do not provide enough load on the
name-node so that it could reach that bottleneck.
Particularly for NFS I found out that although it slows down the name-node
but the slow down is less than 5%.

>> And if nfs share dies, will it shutdown the namenode as well?
> 
>     In our experiences, the name node continues.  But be warned that it will
> only put a message in the name node log that the NFS mount became
> unwritable.  There is a JIRA open to fix this though. 

Name-node treats NFS shares the same as local ones, it does not distinguish between
different storage directories. The name-node will continue to run until there is at
least one storage directory available. So if you have one NFS share and one local
and NFS fails the name-node will continue to run. But if NFS was the only storage
directory the name-node will shut down.

--Konstantin

Re: Hadoop Cluster Size Scalability Numbers?

Posted by Allen Wittenauer <aw...@yahoo-inc.com>.
On 9/21/08 2:51 PM, "Dmitry Pushkarev" <um...@stanford.edu> wrote:
> Speaking about NFS-backup idea:
> If I have secure nfs storage which is much slower than network (3MB/d vs
> 100MB/s network we use between nodes) will it adversely affect performance,
> or I can rely on NFS caching to do the job?

    I think Konstantin has some benchmarks in a JIRA somewhere that shows
that the current bottleneck isn't the fsimage/edits writes.
 
> And if nfs share dies, will it shutdown the namenode as well?

    In our experiences, the name node continues.  But be warned that it will
only put a message in the name node log that the NFS mount became
unwritable.  There is a JIRA open to fix this though. 


RE: Hadoop Cluster Size Scalability Numbers?

Posted by Dmitry Pushkarev <um...@stanford.edu>.
Speaking about NFS-backup idea:
If I have secure nfs storage which is much slower than network (3MB/d vs
100MB/s network we use between nodes) will it adversely affect performance,
or I can rely on NFS caching to do the job?

And if nfs share dies, will it shutdown the namenode as well?

-----Original Message-----
From: Allen Wittenauer [mailto:aw@yahoo-inc.com] 
Sent: Sunday, September 21, 2008 1:38 PM
To: core-user@hadoop.apache.org
Subject: Re: Hadoop Cluster Size Scalability Numbers?




On 9/21/08 9:40 AM, "Guilherme Menezes" <gu...@gmail.com>
wrote:
> We currently have 4 nodes (16GB of
> ram, 6 * 750 GB disks, Quad-Core AMD Opteron processor). Our initial plans
> are to perform a Web crawl for academic purposes (something between 500
> million and 1 billion pages), and we need to expand the number of nodes
for
> that. Is it better to have a larger number of nodes simpler than the ones
we
> currently have (less memory, less processing?) in terms of Hadoop
> performance?

    Your current boxes seem overpowered for crawling. If it were me, I'd
probably:

        a) turn the current four machines into dedicated namenode, job
tracker, secondary name node, oh-no-a-machine-just-died! backup node (setup
an nfs server on it and run it as your secondary direct copy of the fsimage
and edits file if you don't have one).   With 16gb name nodes, you should be
able to store a lot of data.

        b) when you buy new nodes, I'd cut down on memory and cpu and just
turn them into your work horses

    That said, I know little-to-nothing about crawling.  So, IMHO on the
above.


Re: Hadoop Cluster Size Scalability Numbers?

Posted by Guilherme Menezes <gu...@gmail.com>.
On Sun, Sep 21, 2008 at 5:37 PM, Allen Wittenauer <aw...@yahoo-inc.com> wrote:

>
>
>
> On 9/21/08 9:40 AM, "Guilherme Menezes" <gu...@gmail.com>
> wrote:
> > We currently have 4 nodes (16GB of
> > ram, 6 * 750 GB disks, Quad-Core AMD Opteron processor). Our initial
> plans
> > are to perform a Web crawl for academic purposes (something between 500
> > million and 1 billion pages), and we need to expand the number of nodes
> for
> > that. Is it better to have a larger number of nodes simpler than the ones
> we
> > currently have (less memory, less processing?) in terms of Hadoop
> > performance?
>
>     Your current boxes seem overpowered for crawling. If it were me, I'd
> probably:


Hi, thanks for the answer.

We'll have to perform some tests to check if they really are overpowered for
our needs, but I guess we would get a lot more of parallelism if we had more
nodes (more disks). As soon as we get to some conclusion I'll post it here!


>
>
>        a) turn the current four machines into dedicated namenode, job
> tracker, secondary name node, oh-no-a-machine-just-died! backup node (setup
> an nfs server on it and run it as your secondary direct copy of the fsimage
> and edits file if you don't have one).   With 16gb name nodes, you should
> be
> able to store a lot of data.


Very useful suggestions. To clarify the "oh-no-a-machine-just-died! backup
node", what is the difference between this kind of backup (NFS) and the
secondary name node backup, and why do we need both of them?


>
>
>        b) when you buy new nodes, I'd cut down on memory and cpu and just
> turn them into your work horses
>
>    That said, I know little-to-nothing about crawling.  So, IMHO on the
> above.


We are studing how Nutch works right now to understand better how crawling
is done with map-reduce. Maybe the nutch-users list would be a better place
to post this questions, but thanks anyway!

Re: Hadoop Cluster Size Scalability Numbers?

Posted by Allen Wittenauer <aw...@yahoo-inc.com>.


On 9/21/08 9:40 AM, "Guilherme Menezes" <gu...@gmail.com>
wrote:
> We currently have 4 nodes (16GB of
> ram, 6 * 750 GB disks, Quad-Core AMD Opteron processor). Our initial plans
> are to perform a Web crawl for academic purposes (something between 500
> million and 1 billion pages), and we need to expand the number of nodes for
> that. Is it better to have a larger number of nodes simpler than the ones we
> currently have (less memory, less processing?) in terms of Hadoop
> performance?

    Your current boxes seem overpowered for crawling. If it were me, I'd
probably:

        a) turn the current four machines into dedicated namenode, job
tracker, secondary name node, oh-no-a-machine-just-died! backup node (setup
an nfs server on it and run it as your secondary direct copy of the fsimage
and edits file if you don't have one).   With 16gb name nodes, you should be
able to store a lot of data.

        b) when you buy new nodes, I'd cut down on memory and cpu and just
turn them into your work horses

    That said, I know little-to-nothing about crawling.  So, IMHO on the
above.