You are viewing a plain text version of this content. The canonical link for it is here.

Posted to common-user@hadoop.apache.org by Bradford Stephens <br...@gmail.com> on 2009/05/05 04:44:31 UTC

What do we call Hadoop+HBase+Lucene+Zookeeper+etc....

Hey all,

I'm going to be speaking at OSCON about my company's experiences with
Hadoop and Friends, but I'm having a hard time coming up with a name
for the entire software ecosystem. I'm thinking of calling it the
"Apache CloudStack". Does this sound legit to you all? :) Is there
something more 'official'?

Cheers,
Bradford

Re: What do we call Hadoop+HBase+Lucene+Zookeeper+etc....

Posted by Steve Loughran <st...@apache.org>.

Edward Capriolo wrote:
> 'cloud computing' is a hot term. According to the definition provided
> by wikipedia http://en.wikipedia.org/wiki/Cloud_computing,
> Hadoop+HBase+Lucene+Zookeeper, fits some of the criteria but not well.
> 
> Hadoop is scalable, with HOD it is dynamically scalable.
> 
> I do not think (Hadoop+HBase+Lucene+Zookeeper) can be used for
> 'utility computing'. as managing the stack and getting started is
> quite a complex process.

Exactly. Which is why the Apache Clouds proposal emphasises

-Lightweight front end: low Wattage, stateless nodes for web GUI, bonded 
to the back end

-instrumentation for liveness and load monitoring. Hadoop has a lot of 
this, I'm trying to add more, but we want it everywhere.

-Resource Management: bringing up and tearing down nodes by asking the 
infrastructure. Some Apache projects have done this but only for EC2 and 
only for their layer of the stack. You need something that keeps track 
of everything and acts in your interests, not those of the datacentre 
provider

-Packaging for fully automated install/deploy on Linux systems (=rpm and 
deb)

-A development process in which the tools push the code out to a 
targeted infrastracture even for test runs

Hadoop and friends are part of this, they are a very interesting 
foundation, but they are only part of the storing
> 
> Also this stack is best running on LAN network with high speed
> interlinks. Historically the "Cloud" is composed of WAN links. An
> implication of Cloud Computing is that different services would be
> running in different geographical locations which is not how hadoop is
> normally deployed.
> 
> I believe 'Apache Grid Stack' would be a more fitting.
> 
> http://en.wikipedia.org/wiki/Grid_computing
> 
> Grid computing (or the use of computational grids) is the application
> of several computers to a single problem at the same time — usually to
> a scientific or technical problem that requires a great number of
> computer processing cycles or access to large amounts of data.

Classic Grid computing - OGSi/OGSA is something I want to steer clear 
of. Historically, you end up in WS-* and computer management politics. 
Furthermore, OGSA never had a good use case except "rewrite your apps 
for the cloud and they will be better". They (lets be fair, we) also 
focused too much on CPU scheduling, not on storage.

> Grid computing via the Wikipedia definition describes exactly what
> hadoop does. Without amazon S3 and EC2 hadoop does not fit well into a
> 'cloud computing' IMHO

To be precise: without a dynamic infrastructure provider that is more 
than just AWS: it could be Sun/Oracle, IBM/google, HP/Intel/Yahoo!, it 
could be your ops team and Eucalyptus.

The other hardware/service vendors are working on this infrastructure. 
Apache doesn't work at that level, but if we provide the code to run on 
all of them, we give the users the independence of a particular 
infrastructure provider

Re: What do we call Hadoop+HBase+Lucene+Zookeeper+etc....

Posted by Edward Capriolo <ed...@gmail.com>.

'cloud computing' is a hot term. According to the definition provided
by wikipedia http://en.wikipedia.org/wiki/Cloud_computing,
Hadoop+HBase+Lucene+Zookeeper, fits some of the criteria but not well.

Hadoop is scalable, with HOD it is dynamically scalable.

I do not think (Hadoop+HBase+Lucene+Zookeeper) can be used for
'utility computing'. as managing the stack and getting started is
quite a complex process.

Also this stack is best running on LAN network with high speed
interlinks. Historically the "Cloud" is composed of WAN links. An
implication of Cloud Computing is that different services would be
running in different geographical locations which is not how hadoop is
normally deployed.

I believe 'Apache Grid Stack' would be a more fitting.

http://en.wikipedia.org/wiki/Grid_computing

Grid computing (or the use of computational grids) is the application
of several computers to a single problem at the same time — usually to
a scientific or technical problem that requires a great number of
computer processing cycles or access to large amounts of data.

Grid computing via the Wikipedia definition describes exactly what
hadoop does. Without amazon S3 and EC2 hadoop does not fit well into a
'cloud computing' IMHO

RE: What do we call Hadoop+HBase+Lucene+Zookeeper+etc....

Posted by Ricky Ho <rh...@adobe.com>.

The slide deck talks about possible bundling of various existing Apache technologies in distributed systems as well as some Java API to access Amazon cloud services.

What hasn't been discussed is the difference between a "traditional distributed architecture" and "the cloud".  They are "close" but not close enough to be treated the "same".  In my opinion, some of the distributed technology in Apache need to be enhanced in order to fit into the cloud more effectively.

Let me focus in some cloud characteristics that our existing Apache distributed technologies hasn't been paying attention to:  Extreme elasticity, Trust boundary, and cost awareness.

Extreme elasticity
===================
Most distributed technologies treat machine shutdown/startup a relatively infrequent operation and hasn't tried hard to minimize the cost of handling this situations.  Look at Hadoop as an example, although it can handle machine crashes gracefully, it doesn't handle cloud bursting scenario well (ie: when a lot of machines is added to Hadoop cluster).  You need to run a data redistribution task in the background and slow down your existing job.

Another example is that many scripts in Hadoop relies on config file that specify each cluster member's IP address.  In a cloud environment, IP address is unstable so we need to have a discovery mechanism and also rework the scripts.

Trust boundary
===============
Most distributed technologies are assuming a homogeneous environment (every member has the same degree of trust), which is not the case in the cloud environment.  Additional processing (cryptographic operation for data transfer and storage) may be necessary when dealing with machines running in the cloud.

Cost awareness
===============
Same reason as they are assuming a homogeneous environment, the scheduler is not aware of the involved cost when they move data across the cloud boundary (especially bandwidth cost is relatively high).  The Hadoop MapReduce scheduler need to be more sophisticated when scheduling where to start the Mapper and Reducer.  Similarly, when making the replica placement decision, HDFS needs to be aware of which machine is located in which cloud.

That said, I am not discounting the existing Apache technology.  In fact, we have already made a good step.  We just need to go further.

Rgds,
Ricky

-----Original Message-----
From: Bradford Stephens [mailto:bradfordstephens@gmail.com] 
Sent: Tuesday, May 05, 2009 9:53 AM
To: core-user@hadoop.apache.org
Subject: Re: What do we call Hadoop+HBase+Lucene+Zookeeper+etc....

I read through the deck and sent it around the company. Good stuff!
It's going to be a big help for trying to get the .NET Enterprise
people wrapping their heads around web-scale data.

I must admit "Apache Cloud Computing Edition" is sort of unwieldy to
say verbally, and frankly "Java Enterprise Edition" is a taboo phrase
at a lot of projects I've had. Guilt by association. I think I'll call
it "Apache Cloud Stack", and reference "Apache Cloud Computing
Edition" in my deck. When I think "Stack", I think of a suite of
software that provides all the pieces I need to solve my problem :)

On Tue, May 5, 2009 at 7:00 AM, Steve Loughran <st...@apache.org> wrote:
> Bradford Stephens wrote:
>>
>> Hey all,
>>
>> I'm going to be speaking at OSCON about my company's experiences with
>> Hadoop and Friends, but I'm having a hard time coming up with a name
>> for the entire software ecosystem. I'm thinking of calling it the
>> "Apache CloudStack". Does this sound legit to you all? :) Is there
>> something more 'official'?
>
> We've been using "Apache Cloud Computing Edition" for this, to emphasise
> this is the successor to Java Enterprise Edition, and that it is cross
> language and being built at apache. If you use the same term, even if you
> put a different stack outline than us, it gives the idea more legitimacy.
>
> The slides that Andrew linked to are all in SVN under
> http://svn.apache.org/repos/asf/labs/clouds/
>
> we have a space in the apache labs for "apache clouds", where we want to do
> more work integrating things, and bringing the idea of deploy and test on
> someone else's infrastructure mainstream across all the apache products. We
> would welcome your involvement -and if you send a draft of your slides out,
> will happily review them
>
> -steve
>

Re: What do we call Hadoop+HBase+Lucene+Zookeeper+etc....

Posted by Bradford Stephens <br...@gmail.com>.

I read through the deck and sent it around the company. Good stuff!
It's going to be a big help for trying to get the .NET Enterprise
people wrapping their heads around web-scale data.

I must admit "Apache Cloud Computing Edition" is sort of unwieldy to
say verbally, and frankly "Java Enterprise Edition" is a taboo phrase
at a lot of projects I've had. Guilt by association. I think I'll call
it "Apache Cloud Stack", and reference "Apache Cloud Computing
Edition" in my deck. When I think "Stack", I think of a suite of
software that provides all the pieces I need to solve my problem :)

On Tue, May 5, 2009 at 7:00 AM, Steve Loughran <st...@apache.org> wrote:
> Bradford Stephens wrote:
>>
>> Hey all,
>>
>> I'm going to be speaking at OSCON about my company's experiences with
>> Hadoop and Friends, but I'm having a hard time coming up with a name
>> for the entire software ecosystem. I'm thinking of calling it the
>> "Apache CloudStack". Does this sound legit to you all? :) Is there
>> something more 'official'?
>
> We've been using "Apache Cloud Computing Edition" for this, to emphasise
> this is the successor to Java Enterprise Edition, and that it is cross
> language and being built at apache. If you use the same term, even if you
> put a different stack outline than us, it gives the idea more legitimacy.
>
> The slides that Andrew linked to are all in SVN under
> http://svn.apache.org/repos/asf/labs/clouds/
>
> we have a space in the apache labs for "apache clouds", where we want to do
> more work integrating things, and bringing the idea of deploy and test on
> someone else's infrastructure mainstream across all the apache products. We
> would welcome your involvement -and if you send a draft of your slides out,
> will happily review them
>
> -steve
>

Re: What do we call Hadoop+HBase+Lucene+Zookeeper+etc....

Posted by Steve Loughran <st...@apache.org>.

Bradford Stephens wrote:
> Hey all,
> 
> I'm going to be speaking at OSCON about my company's experiences with
> Hadoop and Friends, but I'm having a hard time coming up with a name
> for the entire software ecosystem. I'm thinking of calling it the
> "Apache CloudStack". Does this sound legit to you all? :) Is there
> something more 'official'?

We've been using "Apache Cloud Computing Edition" for this, to emphasise 
this is the successor to Java Enterprise Edition, and that it is cross 
language and being built at apache. If you use the same term, even if 
you put a different stack outline than us, it gives the idea more 
legitimacy.

The slides that Andrew linked to are all in SVN under
http://svn.apache.org/repos/asf/labs/clouds/

we have a space in the apache labs for "apache clouds", where we want to 
do more work integrating things, and bringing the idea of deploy and 
test on someone else's infrastructure mainstream across all the apache 
products. We would welcome your involvement -and if you send a draft of 
your slides out, will happily review them

-steve

Re: What do we call Hadoop+HBase+Lucene+Zookeeper+etc....

Posted by Andrew Purtell <ap...@apache.org>.

Hi Bradford,

Your mail reminds me of something I recently came across:

http://svn.apache.org/repos/asf/labs/clouds/apache_cloud_computing_edition.pdf

Perhaps if you have slides accompanying your talk, you may
consider to make them publicly available. I for one would
love to see them.

Best regards,

    - Andy

> From: Bradford Stephens
> Subject: What do we call Hadoop+HBase+Lucene+Zookeeper+etc....
> Date: Monday, May 4, 2009, 7:44 PM
> Hey all,
> 
> I'm going to be speaking at OSCON about my company's
> experiences with Hadoop and Friends, but I'm having a
> hard time coming up with a name for the entire software
> ecosystem. I'm thinking of calling it the "Apache
> CloudStack". Does this sound legit to you all? :) Is
> there something more 'official'?
> 
> Cheers,
> Bradford


      

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: What do we call Hadoop+HBase+Lucene+Zookeeper+etc....

Posted by Andrew Purtell <ap...@apache.org>.

Hi Bradford,

Your mail reminds me of something I recently came across:

http://svn.apache.org/repos/asf/labs/clouds/apache_cloud_computing_edition.pdf

Perhaps if you have slides accompanying your talk, you may
consider to make them publicly available. I for one would
love to see them.

Best regards,

    - Andy

> From: Bradford Stephens
> Subject: What do we call Hadoop+HBase+Lucene+Zookeeper+etc....
> Date: Monday, May 4, 2009, 7:44 PM
> Hey all,
> 
> I'm going to be speaking at OSCON about my company's
> experiences with Hadoop and Friends, but I'm having a
> hard time coming up with a name for the entire software
> ecosystem. I'm thinking of calling it the "Apache
> CloudStack". Does this sound legit to you all? :) Is
> there something more 'official'?
> 
> Cheers,
> Bradford

Re: What do we call Hadoop+HBase+Lucene+Zookeeper+etc....

Posted by Andrew Purtell <ap...@apache.org>.

Hi Bradford,

Your mail reminds me of something I recently came across:

http://svn.apache.org/repos/asf/labs/clouds/apache_cloud_computing_edition.pdf

Perhaps if you have slides accompanying your talk, you may
consider to make them publicly available. I for one would
love to see them.

Best regards,

    - Andy

> From: Bradford Stephens
> Subject: What do we call Hadoop+HBase+Lucene+Zookeeper+etc....
> Date: Monday, May 4, 2009, 7:44 PM
> Hey all,
> 
> I'm going to be speaking at OSCON about my company's
> experiences with Hadoop and Friends, but I'm having a
> hard time coming up with a name for the entire software
> ecosystem. I'm thinking of calling it the "Apache
> CloudStack". Does this sound legit to you all? :) Is
> there something more 'official'?
> 
> Cheers,
> Bradford