You are viewing a plain text version of this content. The canonical link for it is here.
Posted to common-user@hadoop.apache.org by Adarsh Sharma <ad...@orkash.com> on 2010/11/11 12:02:30 UTC

Deficiency in Hadoop

Dear all,

Does anyone have an experience on working Hadoop Integration with SGE ( 
Sun Grid Engine ).
It is open -source too ( sge-6.2u5 ).
Did SGE really overcomes some of the deficiencies of Hadoop.
According to a article :-

Instead, to set the stage, let's talk about what Hadoop doesn't do so 
well. I currently see two important deficiencies in Hadoop: it doesn't 
play well with others, and it has no real accounting framework. Pretty 
much every customer I've seen running Hadoop does it on a dedicated 
cluster. Why? Because the tasktrackers assume they own the machines on 
which they run. If there's anything on the cluster other than Hadoop, 
it's in direct competition with Hadoop. That wouldn't be such a big deal 
if Hadoop clusters didn't tend to be so huge. Folks are dedicating 
hundreds, thousands, or even tens of thousands of machines to their 
Hadoop applications. That's a lot of hardware to be walled off for a 
single purpose. Are those machines really being used? You may not be 
able to tell. You can monitor state in the moment, and you can grep 
through log files to find out about past usage (Gah!), but there's no 
historical accounting capability there.

So I want to know that is it worthful to use SGE with Hadoop in 
Production Cluster or not.
Please share your views.

Thanks in Advance
Adarsh Sharma




Re: Deficiency in Hadoop

Posted by Steve Loughran <st...@apache.org>.
On 11/11/10 13:09, Michael Segel wrote:

> The only time you'd want to look at a configurable cluster is if you're doing HoD and you don't need to persist your data sets for long periods of time.

We run virtual private Hadoop clouds against persistent storage for 
various other reasons
  -lets us reuse the same machines for other work.
  -lets people play with Hadoop with a small sample of their data to see 
if it works for their app, without spending any money

If it does work, that's when we say "you now need a real Hadoop 
cluster". One interesting trend now is that you can buy 1U servers with 
2*12 TB of HDD and 6-12 cores (*). This gives you 1 PB of raw storage 
and the compute to go with it, in under 50 servers, which is incredible 
given how many machines were needed for that even a few years back.



> There are deficiencies in Hadoop and more in HBase. Yet even with those deficiencies, its still a good tool set and over time the deficiencies will be addressed.

+1

-Steve

RE: Deficiency in Hadoop

Posted by Michael Segel <mi...@hotmail.com>.


> Date: Thu, 11 Nov 2010 12:00:49 +0000
> From: stevel@apache.org
> To: common-user@hadoop.apache.org
> Subject: Re: Deficiency in Hadoop
> 

> A permanently allocated set of machines gives you permanent HDFS storage 
> at the cost of SATA HDDs. Once you go to any on-demand infrastructure 
> you need some persistent store, and it tends to lack locality and have a 
> higher cost/GB, usually because it is SAN-based.
> 

One small nit.
Remember that HBase is part of the hadoop infrastructure and here, locality of the data
is less important, depending on your use case. ;-) And here locality means disk attached to the same machine versus somewhere else in the cloud.

But in general, I do agree that if you're looking at a virtual HOD model, your infrastructure cost is going to be higher.
SAN disk isn't cheap. Looking at the hardware, a virtual cloud will cost you more. The current price breakdown is at 32GB.
Go beyond 32GB it gets expensive. Intel chips beyond the e5500 (quad core dual socket machines) gets expensive. 

Unless you're a google, yahoo, ... density isn't a problem.

The only time you'd want to look at a configurable cluster is if you're doing HoD and you don't need to persist your data sets for long periods of time. 
Looking at Amazon's cloud (EC2) as an example.

I've seen a presentation from Sun about a year ago. Wasn't impressed then, now that they are Oracle... still not impressed.

There are deficiencies in Hadoop and more in HBase. Yet even with those deficiencies, its still a good tool set and over time the deficiencies will be addressed.

 		 	   		  

Re: Deficiency in Hadoop

Posted by Adarsh Sharma <ad...@orkash.com>.
Steve Loughran wrote:
> On 11/11/10 11:02, Adarsh Sharma wrote:
>> Dear all,
>>
>> Does anyone have an experience on working Hadoop Integration with SGE (
>> Sun Grid Engine ).
>> It is open -source too ( sge-6.2u5 ).
>> Did SGE really overcomes some of the deficiencies of Hadoop.
>> According to a article :-
>
> That'll be DanT's posting
> http://blogs.sun.com/templedf/entry/leading_the_herd
>
>>
>> Instead, to set the stage, let's talk about what Hadoop doesn't do so
>> well. I currently see two important deficiencies in Hadoop: it doesn't
>> play well with others, and it has no real accounting framework. Pretty
>> much every customer I've seen running Hadoop does it on a dedicated
>> cluster. Why? Because the tasktrackers assume they own the machines on
>> which they run. If there's anything on the cluster other than Hadoop,
>> it's in direct competition with Hadoop. That wouldn't be such a big deal
>> if Hadoop clusters didn't tend to be so huge. Folks are dedicating
>> hundreds, thousands, or even tens of thousands of machines to their
>> Hadoop applications. That's a lot of hardware to be walled off for a
>> single purpose. Are those machines really being used? You may not be
>> able to tell. You can monitor state in the moment, and you can grep
>> through log files to find out about past usage (Gah!), but there's no
>> historical accounting capability there.
>>
>> So I want to know that is it worthful to use SGE with Hadoop in
>> Production Cluster or not.
>> Please share your views.
>>
>
> A permanently allocated set of machines gives you permanent HDFS 
> storage at the cost of SATA HDDs. Once you go to any on-demand 
> infrastructure you need some persistent store, and it tends to lack 
> locality and have a higher cost/GB, usually because it is SAN-based.
>
> Where on-demand stuff is good for is for sharing physical machines, 
> because unless you can keep the CPU+RAM busy in your cluster, that's 
> wasted CAPEX/OPEX budgets.
>
> One thing that's been discussed is to have a physical hadoop cluster, 
> but have the TT's capacity reporting work well with other schedulers, 
> via some plugin point:
>
> https://issues.apache.org/jira/browse/MAPREDUCE-1603
>
> This would let your cluster also accept work from other job execution 
> frameworks, and when busy with that work, report less slots to the TT, 
> though still serve up data to the rest of the hadoop workers
>
> Benefits
>  -cost of storage is HDFS rates
>  -performance of a normal hadoop cluster
>  -under-utilised hadoop cluster time can be used by other work 
> schedulers, ones that don't need access to the Hadoop storage.
>
> Costs:
>  -HDFS security -can you lock it down?
>  -your other workloads had better not expect SAN or low-latency 
> interconnect like Infiniband, unless you add them to the cluster too, 
> which bumps up costs.
>
> Nobody has implemented this yet, so volunteers to take up their IDE 
> against Hadoop 0.23 would be welcome. And yes, I do mean 0.23, that's 
> the schedule that would work.
>
> -Steve
Thanks a Lot!   Steve
This is the way to explain other doubts.

Best Regards
-Adarsh


Re: Deficiency in Hadoop

Posted by Steve Loughran <st...@apache.org>.
On 11/11/10 11:02, Adarsh Sharma wrote:
> Dear all,
>
> Does anyone have an experience on working Hadoop Integration with SGE (
> Sun Grid Engine ).
> It is open -source too ( sge-6.2u5 ).
> Did SGE really overcomes some of the deficiencies of Hadoop.
> According to a article :-

That'll be DanT's posting
http://blogs.sun.com/templedf/entry/leading_the_herd

>
> Instead, to set the stage, let's talk about what Hadoop doesn't do so
> well. I currently see two important deficiencies in Hadoop: it doesn't
> play well with others, and it has no real accounting framework. Pretty
> much every customer I've seen running Hadoop does it on a dedicated
> cluster. Why? Because the tasktrackers assume they own the machines on
> which they run. If there's anything on the cluster other than Hadoop,
> it's in direct competition with Hadoop. That wouldn't be such a big deal
> if Hadoop clusters didn't tend to be so huge. Folks are dedicating
> hundreds, thousands, or even tens of thousands of machines to their
> Hadoop applications. That's a lot of hardware to be walled off for a
> single purpose. Are those machines really being used? You may not be
> able to tell. You can monitor state in the moment, and you can grep
> through log files to find out about past usage (Gah!), but there's no
> historical accounting capability there.
>
> So I want to know that is it worthful to use SGE with Hadoop in
> Production Cluster or not.
> Please share your views.
>

A permanently allocated set of machines gives you permanent HDFS storage 
at the cost of SATA HDDs. Once you go to any on-demand infrastructure 
you need some persistent store, and it tends to lack locality and have a 
higher cost/GB, usually because it is SAN-based.

Where on-demand stuff is good for is for sharing physical machines, 
because unless you can keep the CPU+RAM busy in your cluster, that's 
wasted CAPEX/OPEX budgets.

One thing that's been discussed is to have a physical hadoop cluster, 
but have the TT's capacity reporting work well with other schedulers, 
via some plugin point:

https://issues.apache.org/jira/browse/MAPREDUCE-1603

This would let your cluster also accept work from other job execution 
frameworks, and when busy with that work, report less slots to the TT, 
though still serve up data to the rest of the hadoop workers

Benefits
  -cost of storage is HDFS rates
  -performance of a normal hadoop cluster
  -under-utilised hadoop cluster time can be used by other work 
schedulers, ones that don't need access to the Hadoop storage.

Costs:
  -HDFS security -can you lock it down?
  -your other workloads had better not expect SAN or low-latency 
interconnect like Infiniband, unless you add them to the cluster too, 
which bumps up costs.

Nobody has implemented this yet, so volunteers to take up their IDE 
against Hadoop 0.23 would be welcome. And yes, I do mean 0.23, that's 
the schedule that would work.

-Steve