You are viewing a plain text version of this content. The canonical link for it is here.
Posted to common-user@hadoop.apache.org by jagaran das <ja...@yahoo.co.in> on 2011/08/10 09:58:16 UTC
Namenode Scalability
In my current project we are planning to streams of data to Namenode (20 Node Cluster).
Data Volume would be around 1 PB per day.
But there are application which can publish data at 1GBPS.
Few queries:
1. Can a single Namenode handle such high speed writes? Or it becomes unresponsive when GC cycle kicks in.
2. Can we have multiple federated Name nodes sharing the same slaves and then we can distribute the writes accordingly.
3. Can multiple region servers of HBase help us ??
Please suggest how we can design the streaming part to handle such scale of data.
Regards,
Jagaran Das
Re: Namenode Scalability
Posted by Steve Loughran <st...@apache.org>.
On 10/08/2011 08:58, jagaran das wrote:
> In my current project we are planning to streams of data to Namenode (20 Node Cluster).
> Data Volume would be around 1 PB per day.
> But there are application which can publish data at 1GBPS.
That's Gigabyte/s or Gigabit/s?
>
> Few queries:
>
> 1. Can a single Namenode handle such high speed writes? Or it becomes unresponsive when GC cycle kicks in.
see below
> 2. Can we have multiple federated Name nodes sharing the same slaves and then we can distribute the writes accordingly.
that won't solve your problem
> 3. Can multiple region servers of HBase help us ??
no
>
> Please suggest how we can design the streaming part to handle such scale of data.
Data is written to datanodes, not namenodes. the NN is used to set up
the write chain and then just tracks node health -the data does not go
through it.
This changes your problem to one of
-can the NN set up write chains at the speed you want, or do you need
to throttle back the file creation rate by writing bigger files
-can the NN handle the (file x block count) volumes you expect
-what is the network traffic of the data ingress
-what is the total bandwidth of the replication traffic combined with
the data ingress traffic?
-do you have enough disks for the data
-do your HDDs have enough bandwidth?
-do you want to do any work with the data, and what CPU/HDD/net load
does this generate?
-what impact will disk & datanode replication traffic have?
-how much of the backbone will you have to allocated to the rebalancer.
A 1 PB/day, ignoring all network issues, you will reach the current
documented HDFS limits within four weeks. What are you going to do then,
or will you have processed it down?
I could imagine some experiments you could conduct against a namenode to
see what its limits are, but there are lot of datacentre bandwidth and
computation details you have to worry above and beyond datanode
performance issues.
Like Michael says, 1 PB/day sounds like a homework project, especially
if you haven't used hadoop at smaller scale. If it is homework, once
you've done the work (and submitted it), it'd be nice to see the final
paper.
If it is something you plan to take live, well, there are lots of issues
to address, of which the NN is just one of the issues -and one you can
test in advance. Ramping up the cluster with different loads will teach
you more about the bottlenecks. Otherwise: there are people who know how
to run Hadoop at scale, who, in exchange for money, will help you.
-steve
Re: Namenode Scalability
Posted by Steve Loughran <st...@apache.org>.
On 17/08/11 08:48, Dieter Plaetinck wrote:
> Hi,
>
> On Wed, 10 Aug 2011 13:26:18 -0500
> Michel Segel<mi...@hotmail.com> wrote:
>
>> This sounds like a homework assignment than a real world problem.
>
> Why? just wondering.
The question proposed a data rate comparable with Yahoo, Google and
Facebook --yet it was ingress rather than egress, which was even more
unusual. You'd have to be doing a web-scale search engine to need that
data rate -and if you were doing that you need to know a lot more about
how Hadoop works (i.e. the limited role of the NN). You'd also have to
addressed the entire network infrastructure, the costs of the work on
your external system, DNS load, power budget. Oh, and the fact that
unless you were processing discarding those PB/day at the rate of
ingress, you'd need to add a new Hadoop cluster at a rate of 1
cluster/month, which is not only expensive, I don't think datacentre
construction rates could handle it, even if your server vendor had set
up a construction/test pipeline to ship down an assembled and test
containerised cluster every few weeks (which we can do, incidentally :)
>
>> I guess people don't race cars against trains or have two trains
>> traveling in different directions anymore... :-)
>
> huh?
Different Homework questions.
Re: Namenode Scalability
Posted by Dieter Plaetinck <di...@intec.ugent.be>.
Hi,
On Wed, 10 Aug 2011 13:26:18 -0500
Michel Segel <mi...@hotmail.com> wrote:
> This sounds like a homework assignment than a real world problem.
Why? just wondering.
> I guess people don't race cars against trains or have two trains
> traveling in different directions anymore... :-)
huh?
Re: Namenode Scalability
Posted by jagaran das <ja...@yahoo.co.in>.
What would cause the name node to have a GC issue?
- I am writing opening at max 5000 connections and writing continuously through those 5000 connections to 5000 files at a time.
- The volume of data that I would write through 5000 connections cannot be controlled as it is depends on upstream applications that publish data.
Now if the heap memory nears the full size (let say M GB) and when the major GC cycle kicks in, the NameNode could stop responding for some time.
This "stop the world" time should be directly proportional to the Heap Size.
This may cause the data being blogged on the streaming application's memory.
As of our architecture,
It has a cluster of JMS Queue and We have multithreaded application that picks the messages from the queue and streams it to NameNode of a 20 Node cluster
using FileSystem API as exposed.
BTW, in real world if you have a fast car, you can race and win against a slow train, it all depends from what reference frame you are in :)
Regards,
Jagaran
________________________________
From: Michel Segel <mi...@hotmail.com>
To: "common-user@hadoop.apache.org" <co...@hadoop.apache.org>
Cc: "common-user@hadoop.apache.org" <co...@hadoop.apache.org>; jagaran das <ja...@yahoo.co.in>
Sent: Wednesday, 10 August 2011 11:26 AM
Subject: Re: Namenode Scalability
So many questions, why stop there?
First question... What would cause the name node to have a GC issue?
Second question... You're streaming 1PB a day. Is this a single stream of data?
Are you writing this to one file before processing, or are you processing the data directly on the ingestion stream?
Are you also filtering the data so that you are not saving all of the data?
This sounds like a homework assignment than a real world problem.
I guess people don't race cars against trains or have two trains traveling in different directions anymore... :-)
Sent from a remote device. Please excuse any typos...
Mike Segel
On Aug 10, 2011, at 12:07 PM, jagaran das <ja...@yahoo.co.in> wrote:
> To be precise, the projected data is around 1 PB.
> But the publishing rate is also around 1GBPS.
>
> Please suggest.
>
>
> ________________________________
> From: jagaran das <ja...@yahoo.co.in>
> To: "common-user@hadoop.apache.org" <co...@hadoop.apache.org>
> Sent: Wednesday, 10 August 2011 12:58 AM
> Subject: Namenode Scalability
>
> In my current project we are planning to streams of data to Namenode (20 Node Cluster).
> Data Volume would be around 1 PB per day.
> But there are application which can publish data at 1GBPS.
>
> Few queries:
>
> 1. Can a single Namenode handle such high speed writes? Or it becomes unresponsive when GC cycle kicks in.
> 2. Can we have multiple federated Name nodes sharing the same slaves and then we can distribute the writes accordingly.
> 3. Can multiple region servers of HBase help us ??
>
> Please suggest how we can design the streaming part to handle such scale of data.
>
> Regards,
> Jagaran Das
Re: Namenode Scalability
Posted by Michel Segel <mi...@hotmail.com>.
So many questions, why stop there?
First question... What would cause the name node to have a GC issue?
Second question... You're streaming 1PB a day. Is this a single stream of data?
Are you writing this to one file before processing, or are you processing the data directly on the ingestion stream?
Are you also filtering the data so that you are not saving all of the data?
This sounds like a homework assignment than a real world problem.
I guess people don't race cars against trains or have two trains traveling in different directions anymore... :-)
Sent from a remote device. Please excuse any typos...
Mike Segel
On Aug 10, 2011, at 12:07 PM, jagaran das <ja...@yahoo.co.in> wrote:
> To be precise, the projected data is around 1 PB.
> But the publishing rate is also around 1GBPS.
>
> Please suggest.
>
>
> ________________________________
> From: jagaran das <ja...@yahoo.co.in>
> To: "common-user@hadoop.apache.org" <co...@hadoop.apache.org>
> Sent: Wednesday, 10 August 2011 12:58 AM
> Subject: Namenode Scalability
>
> In my current project we are planning to streams of data to Namenode (20 Node Cluster).
> Data Volume would be around 1 PB per day.
> But there are application which can publish data at 1GBPS.
>
> Few queries:
>
> 1. Can a single Namenode handle such high speed writes? Or it becomes unresponsive when GC cycle kicks in.
> 2. Can we have multiple federated Name nodes sharing the same slaves and then we can distribute the writes accordingly.
> 3. Can multiple region servers of HBase help us ??
>
> Please suggest how we can design the streaming part to handle such scale of data.
>
> Regards,
> Jagaran Das
Re: Namenode Scalability
Posted by jagaran das <ja...@yahoo.co.in>.
To be precise, the projected data is around 1 PB.
But the publishing rate is also around 1GBPS.
Please suggest.
________________________________
From: jagaran das <ja...@yahoo.co.in>
To: "common-user@hadoop.apache.org" <co...@hadoop.apache.org>
Sent: Wednesday, 10 August 2011 12:58 AM
Subject: Namenode Scalability
In my current project we are planning to streams of data to Namenode (20 Node Cluster).
Data Volume would be around 1 PB per day.
But there are application which can publish data at 1GBPS.
Few queries:
1. Can a single Namenode handle such high speed writes? Or it becomes unresponsive when GC cycle kicks in.
2. Can we have multiple federated Name nodes sharing the same slaves and then we can distribute the writes accordingly.
3. Can multiple region servers of HBase help us ??
Please suggest how we can design the streaming part to handle such scale of data.
Regards,
Jagaran Das
Re: Namenode Scalability
Posted by Mathias Herberts <ma...@gmail.com>.
Just curious, what are the techspecs of your datanodes to accomodate 1PB/day
on 20 nodes?
On Aug 10, 2011 10:12 AM, "jagaran das" <ja...@yahoo.co.in> wrote:
> In my current project we are planning to streams of data to Namenode (20
Node Cluster).
> Data Volume would be around 1 PB per day.
> But there are application which can publish data at 1GBPS.
>
> Few queries:
>
> 1. Can a single Namenode handle such high speed writes? Or it becomes
unresponsive when GC cycle kicks in.
> 2. Can we have multiple federated Name nodes sharing the same slaves and
then we can distribute the writes accordingly.
> 3. Can multiple region servers of HBase help us ??
>
> Please suggest how we can design the streaming part to handle such scale
of data.
>
> Regards,
> Jagaran Das