You are viewing a plain text version of this content. The canonical link for it is here.
Posted to general@hadoop.apache.org by Keith Wiley <kw...@keithwiley.com> on 2012/09/20 20:06:59 UTC

Choosing a Hadoop distribution

I'm tasked with creating a guide that instructs on how to choose a Hadoop distribution from the handful of common options.  I'm finding this rather perplexing.  While some of the venders offer additional management software (Cloudera Manager is an example) I'm unclear whether those packages could be installed and run irregardless of the underlying Hadoop distribution or if they are exclusively compatible with their vender's distribution (or if there's some crossover).  I'm also unclear on any other basis for comparison.  For example HortonWorks originated HCatalog (to the best of my understanding), but that doesn't necessarily mean one needs to use the HW Hadoop dist. to use HCatalog since it's just a public Apache project anyway at this point.  I'm sure similar statements could be made about MapR or Greenplum (although I thin Greenplum's Hadoop uses MapR's M5 anyway so again, the decision-making process in such a case seems baffling).

And then there's the option of installing the Apache version directly, always on the table I suppose.

Does anyone have any thoughts on what criteria might govern such a decision?  I'm not trying to get into an argument about which distribution is best, I'm not even looking for defenses or arguments for one distribution or another, but rather a notion of what the criteria for basing such a decision might be.

Thanks.

Cheers!

________________________________________________________________________________
Keith Wiley     kwiley@keithwiley.com     keithwiley.com    music.keithwiley.com

"It's a fine line between meticulous and obsessive-compulsive and a slippery
rope between obsessive-compulsive and debilitatingly slow."
                                           --  Keith Wiley
________________________________________________________________________________


Re: Choosing a Hadoop distribution

Posted by Ted Dunning <td...@maprtech.com>.
Forrester did this, more or less.

http://www.forrester.com/The+Forrester+Wave+Enterprise+Hadoop+Solutions+Q1+2012/fulltext/-/E-RES60755

They want money for this.

On Mon, Sep 24, 2012 at 3:29 AM, Christian Schäfer <sy...@yahoo.de>wrote:

> I think a good starting point for that distribution guide would be a
> feature matrix where all reasonable distributions could be compaired.
>
>
> There could be metrics for cross cutting concerns like performance,
> security, etc. referring to real benchmarks.
> Upon this one could derive (maybe by additional explainations) which
> distribution fits in a certain use case the best.
>
> Though, most important is that this comparison is not biased but
> indepedent.
>
> regards
> Chris
>
>
> ________________________________
> Von: Keith Wiley <kw...@keithwiley.com>
> An: general@hadoop.apache.org
> Gesendet: 5:45 Freitag, 21.September 2012
> Betreff: Re: Choosing a Hadoop distribution
>
> Thanks, that all seems quite reasonable I suppose.
>
> Cheers!
>
> On Sep 20, 2012, at 11:22 , Aaron Eng wrote:
>
> >> I'm tasked with creating a guide that instructs on how to choose a
> Hadoop
> > distribution from the handful of common options.
> >> Does anyone have any thoughts on what criteria might govern such a
> > decision?
> >
> > What problem(s) are you trying to solve with Hadoop (and related
> projects)?
> > What are your expectations of the technology?
> >
> > The details beyond that level could take many, many pages to cover.
> >
> > Not all Hadoop distributions are tested the same way, packaged with the
> > same components,
> etc.  Not all components of a given Hadoop distribution
> > work with other Hadoop distributions.  There are a lot of common things
> > between distributions which is probably why its difficult to articulate
> how
> > to choose one over the another.  So when you look at the problem you are
> > trying to solve and your expectations of the technology, many things may
> > seem relatively equal and hence you may need to get into some significant
> > level of detail to pick something that best solves your problem.  In some
> > cases it may be very straightforward as to whether a distribution will
> meet
> > your requirements.  In other cases, things may look relatively equal
> across
> > the board until you drill down to a point where you find differentiation
> > (or maybe you dont find it).  But those would be my critera, articulate
> the
> > problem and expectations and compare functionality
> until you find
> > differentiation.
> >
> >
> > On Thu, Sep 20, 2012 at 11:06 AM, Keith Wiley <kw...@keithwiley.com>
> wrote:
> >
> >> I'm tasked with creating a guide that instructs on how to choose a
> Hadoop
> >> distribution from the handful of common options.  I'm finding this
> rather
> >> perplexing.  While some of the venders offer additional management
> software
> >> (Cloudera Manager is an example) I'm unclear whether those packages
> could
> >> be installed and run irregardless of the underlying Hadoop distribution
> or
> >> if they are exclusively compatible with their vender's distribution (or
> if
> >> there's some crossover).  I'm also unclear on any other basis for
> >> comparison.  For example HortonWorks originated HCatalog (to the best
> of my
> >>
> understanding), but that doesn't necessarily mean one needs to use the HW
> >> Hadoop dist. to use HCatalog since it's just a public Apache project
> anyway
> >> at this point.  I'm sure similar statements could be made about MapR or
> >> Greenplum (although I thin Greenplum's Hadoop uses MapR's M5 anyway so
> >> again, the decision-making process in such a case seems baffling).
> >>
> >> And then there's the option of installing the Apache version directly,
> >> always on the table I suppose.
> >>
> >> Does anyone have any thoughts on what criteria might govern such a
> >> decision?  I'm not trying to get into an argument about which
> distribution
> >> is best, I'm not even looking for defenses or arguments for one
> >> distribution or another, but rather a notion of what the criteria for
> >> basing such a decision might be.
> >>
> >>
> Thanks.
> >>
> >> Cheers!
>
>
>
> ________________________________________________________________________________
> Keith Wiley    kwiley@keithwiley.com     keithwiley.com
> music.keithwiley.com
>
> "And what if we picked the wrong religion?  Every week, we're just making
> God
> madder and madder!"
>                                            --  Homer Simpson
>
> ________________________________________________________________________________
>

Re: WG: Choosing a Hadoop distribution

Posted by Steve Loughran <st...@gmail.com>.
On 24 September 2012 14:41, Marcos Ortiz <ml...@uci.cu> wrote:

>
> On 09/24/2012 06:29 AM, Christian Schäfer wrote:
>
>> I think a good starting point for that distribution guide would be a
>> feature matrix where all reasonable distributions could be compaired.
>>
> +1  for this idea
> I think that this feature matrix will be on the Hadoop wiki.
>
>
gets too controversial

I wouldn't be completely dismissive of Apache 1.0.3; it went through the
large cluster QA by the QA team at hortonworks (disclaimer: my colleagues)
; the 1.x branch is going to be long-lived and is in use in production.


>
>>
>> There could be metrics for cross cutting concerns like performance,
>> security, etc. referring to real benchmarks.
>> Upon this one could derive (maybe by additional explainations) which
>> distribution fits in a certain use case the best.
>>
> Umm, this is tricky, How we can decide which is the best fit for a certain
> type of problem?
> My suggestion is to avoid this, because this will bring some hot
> discussions and that´s not the idea.
> It´s my personal opinion.
>

What would be good would be more traces of real-world cluster use, stuff
that can be fed into the gridmix 3 benchmarker [
http://developer.yahoo.com/blogs/hadoop/posts/2010/04/gridmix3_emulating_production/].
If your workload gets pulled into the performance tests used by the
Hadoop development teams. .

Re: WG: Choosing a Hadoop distribution

Posted by Marcos Ortiz <ml...@uci.cu>.
On 09/24/2012 06:29 AM, Christian Schäfer wrote:
> I think a good starting point for that distribution guide would be a feature matrix where all reasonable distributions could be compaired.
+1  for this idea
I think that this feature matrix will be on the Hadoop wiki.

>
>
> There could be metrics for cross cutting concerns like performance, security, etc. referring to real benchmarks.
> Upon this one could derive (maybe by additional explainations) which distribution fits in a certain use case the best.
Umm, this is tricky, How we can decide which is the best fit for a 
certain type of problem?
My suggestion is to avoid this, because this will bring some hot 
discussions and that´s not the idea.
It´s my personal opinion.
>
> Though, most important is that this comparison is not biased but indepedent.
>
> regards
> Chris
>
>
> ________________________________
> Von: Keith Wiley <kw...@keithwiley.com>
> An: general@hadoop.apache.org
> Gesendet: 5:45 Freitag, 21.September 2012
> Betreff: Re: Choosing a Hadoop distribution
>
> Thanks, that all seems quite reasonable I suppose.
>
> Cheers!
>
> On Sep 20, 2012, at 11:22 , Aaron Eng wrote:
>
>>> I'm tasked with creating a guide that instructs on how to choose a Hadoop
>> distribution from the handful of common options.
>>> Does anyone have any thoughts on what criteria might govern such a
>> decision?
>>
>> What problem(s) are you trying to solve with Hadoop (and related projects)?
>> What are your expectations of the technology?
>>
>> The details beyond that level could take many, many pages to cover.
>>
>> Not all Hadoop distributions are tested the same way, packaged with the
>> same components,
> etc.  Not all components of a given Hadoop distribution
>> work with other Hadoop distributions.  There are a lot of common things
>> between distributions which is probably why its difficult to articulate how
>> to choose one over the another.  So when you look at the problem you are
>> trying to solve and your expectations of the technology, many things may
>> seem relatively equal and hence you may need to get into some significant
>> level of detail to pick something that best solves your problem.  In some
>> cases it may be very straightforward as to whether a distribution will meet
>> your requirements.  In other cases, things may look relatively equal across
>> the board until you drill down to a point where you find differentiation
>> (or maybe you dont find it).  But those would be my critera, articulate the
>> problem and expectations and compare functionality
> until you find
>> differentiation.
>>
>>
>> On Thu, Sep 20, 2012 at 11:06 AM, Keith Wiley <kw...@keithwiley.com> wrote:
>>
>>> I'm tasked with creating a guide that instructs on how to choose a Hadoop
>>> distribution from the handful of common options.  I'm finding this rather
>>> perplexing.  While some of the venders offer additional management software
>>> (Cloudera Manager is an example) I'm unclear whether those packages could
>>> be installed and run irregardless of the underlying Hadoop distribution or
>>> if they are exclusively compatible with their vender's distribution (or if
>>> there's some crossover).  I'm also unclear on any other basis for
>>> comparison.  For example HortonWorks originated HCatalog (to the best of my
>>>
> understanding), but that doesn't necessarily mean one needs to use the HW
>>> Hadoop dist. to use HCatalog since it's just a public Apache project anyway
>>> at this point.  I'm sure similar statements could be made about MapR or
>>> Greenplum (although I thin Greenplum's Hadoop uses MapR's M5 anyway so
>>> again, the decision-making process in such a case seems baffling).
>>>
>>> And then there's the option of installing the Apache version directly,
>>> always on the table I suppose.
>>>
>>> Does anyone have any thoughts on what criteria might govern such a
>>> decision?  I'm not trying to get into an argument about which distribution
>>> is best, I'm not even looking for defenses or arguments for one
>>> distribution or another, but rather a notion of what the criteria for
>>> basing such a decision might be.
>>>
>>>
> Thanks.
>>> Cheers!
>
> ________________________________________________________________________________
> Keith Wiley    kwiley@keithwiley.com     keithwiley.com    music.keithwiley.com
>
> "And what if we picked the wrong religion?  Every week, we're just making God
> madder and madder!"
>                                             --  Homer Simpson
> ________________________________________________________________________________
>
> 10mo. ANIVERSARIO DE LA CREACION DE LA UNIVERSIDAD DE LAS CIENCIAS INFORMATICAS...
> CONECTADOS AL FUTURO, CONECTADOS A LA REVOLUCION
>
> http://www.uci.cu
> http://www.facebook.com/universidad.uci
> http://www.flickr.com/photos/universidad_uci

-- 

Marcos Luis Ortíz Valmaseda
*Data Engineer && Sr. System Administrator at UCI*
about.me/marcosortiz <http://about.me/marcosortiz>
My Blog <http://marcosluis2186.posterous.com>
Tumblr's blog <http://marcosortiz.tumblr.com/>
@marcosluis2186 <http://twitter.com/marcosluis2186>



10mo. ANIVERSARIO DE LA CREACION DE LA UNIVERSIDAD DE LAS CIENCIAS INFORMATICAS...
CONECTADOS AL FUTURO, CONECTADOS A LA REVOLUCION

http://www.uci.cu
http://www.facebook.com/universidad.uci
http://www.flickr.com/photos/universidad_uci

WG: Choosing a Hadoop distribution

Posted by Christian Schäfer <sy...@yahoo.de>.
I think a good starting point for that distribution guide would be a feature matrix where all reasonable distributions could be compaired.


There could be metrics for cross cutting concerns like performance, security, etc. referring to real benchmarks.
Upon this one could derive (maybe by additional explainations) which distribution fits in a certain use case the best.

Though, most important is that this comparison is not biased but indepedent.

regards
Chris


________________________________
Von: Keith Wiley <kw...@keithwiley.com>
An: general@hadoop.apache.org 
Gesendet: 5:45 Freitag, 21.September 2012
Betreff: Re: Choosing a Hadoop distribution

Thanks, that all seems quite reasonable I suppose.

Cheers!

On Sep 20, 2012, at 11:22 , Aaron Eng wrote:

>> I'm tasked with creating a guide that instructs on how to choose a Hadoop
> distribution from the handful of common options.
>> Does anyone have any thoughts on what criteria might govern such a
> decision?
> 
> What problem(s) are you trying to solve with Hadoop (and related projects)?
> What are your expectations of the technology?
> 
> The details beyond that level could take many, many pages to cover.
> 
> Not all Hadoop distributions are tested the same way, packaged with the
> same components,
etc.  Not all components of a given Hadoop distribution
> work with other Hadoop distributions.  There are a lot of common things
> between distributions which is probably why its difficult to articulate how
> to choose one over the another.  So when you look at the problem you are
> trying to solve and your expectations of the technology, many things may
> seem relatively equal and hence you may need to get into some significant
> level of detail to pick something that best solves your problem.  In some
> cases it may be very straightforward as to whether a distribution will meet
> your requirements.  In other cases, things may look relatively equal across
> the board until you drill down to a point where you find differentiation
> (or maybe you dont find it).  But those would be my critera, articulate the
> problem and expectations and compare functionality
until you find
> differentiation.
> 
> 
> On Thu, Sep 20, 2012 at 11:06 AM, Keith Wiley <kw...@keithwiley.com> wrote:
> 
>> I'm tasked with creating a guide that instructs on how to choose a Hadoop
>> distribution from the handful of common options.  I'm finding this rather
>> perplexing.  While some of the venders offer additional management software
>> (Cloudera Manager is an example) I'm unclear whether those packages could
>> be installed and run irregardless of the underlying Hadoop distribution or
>> if they are exclusively compatible with their vender's distribution (or if
>> there's some crossover).  I'm also unclear on any other basis for
>> comparison.  For example HortonWorks originated HCatalog (to the best of my
>>
understanding), but that doesn't necessarily mean one needs to use the HW
>> Hadoop dist. to use HCatalog since it's just a public Apache project anyway
>> at this point.  I'm sure similar statements could be made about MapR or
>> Greenplum (although I thin Greenplum's Hadoop uses MapR's M5 anyway so
>> again, the decision-making process in such a case seems baffling).
>> 
>> And then there's the option of installing the Apache version directly,
>> always on the table I suppose.
>> 
>> Does anyone have any thoughts on what criteria might govern such a
>> decision?  I'm not trying to get into an argument about which distribution
>> is best, I'm not even looking for defenses or arguments for one
>> distribution or another, but rather a notion of what the criteria for
>> basing such a decision might be.
>> 
>>
Thanks.
>> 
>> Cheers!


________________________________________________________________________________
Keith Wiley    kwiley@keithwiley.com     keithwiley.com    music.keithwiley.com

"And what if we picked the wrong religion?  Every week, we're just making God
madder and madder!"
                                           --  Homer Simpson
________________________________________________________________________________

Re: Choosing a Hadoop distribution

Posted by Keith Wiley <kw...@keithwiley.com>.
Thanks, that all seems quite reasonable I suppose.

Cheers!

On Sep 20, 2012, at 11:22 , Aaron Eng wrote:

>> I'm tasked with creating a guide that instructs on how to choose a Hadoop
> distribution from the handful of common options.
>> Does anyone have any thoughts on what criteria might govern such a
> decision?
> 
> What problem(s) are you trying to solve with Hadoop (and related projects)?
> What are your expectations of the technology?
> 
> The details beyond that level could take many, many pages to cover.
> 
> Not all Hadoop distributions are tested the same way, packaged with the
> same components, etc.  Not all components of a given Hadoop distribution
> work with other Hadoop distributions.  There are a lot of common things
> between distributions which is probably why its difficult to articulate how
> to choose one over the another.  So when you look at the problem you are
> trying to solve and your expectations of the technology, many things may
> seem relatively equal and hence you may need to get into some significant
> level of detail to pick something that best solves your problem.  In some
> cases it may be very straightforward as to whether a distribution will meet
> your requirements.  In other cases, things may look relatively equal across
> the board until you drill down to a point where you find differentiation
> (or maybe you dont find it).  But those would be my critera, articulate the
> problem and expectations and compare functionality until you find
> differentiation.
> 
> 
> On Thu, Sep 20, 2012 at 11:06 AM, Keith Wiley <kw...@keithwiley.com> wrote:
> 
>> I'm tasked with creating a guide that instructs on how to choose a Hadoop
>> distribution from the handful of common options.  I'm finding this rather
>> perplexing.  While some of the venders offer additional management software
>> (Cloudera Manager is an example) I'm unclear whether those packages could
>> be installed and run irregardless of the underlying Hadoop distribution or
>> if they are exclusively compatible with their vender's distribution (or if
>> there's some crossover).  I'm also unclear on any other basis for
>> comparison.  For example HortonWorks originated HCatalog (to the best of my
>> understanding), but that doesn't necessarily mean one needs to use the HW
>> Hadoop dist. to use HCatalog since it's just a public Apache project anyway
>> at this point.  I'm sure similar statements could be made about MapR or
>> Greenplum (although I thin Greenplum's Hadoop uses MapR's M5 anyway so
>> again, the decision-making process in such a case seems baffling).
>> 
>> And then there's the option of installing the Apache version directly,
>> always on the table I suppose.
>> 
>> Does anyone have any thoughts on what criteria might govern such a
>> decision?  I'm not trying to get into an argument about which distribution
>> is best, I'm not even looking for defenses or arguments for one
>> distribution or another, but rather a notion of what the criteria for
>> basing such a decision might be.
>> 
>> Thanks.
>> 
>> Cheers!


________________________________________________________________________________
Keith Wiley     kwiley@keithwiley.com     keithwiley.com    music.keithwiley.com

"And what if we picked the wrong religion?  Every week, we're just making God
madder and madder!"
                                           --  Homer Simpson
________________________________________________________________________________


Re: Choosing a Hadoop distribution

Posted by Nael Mohammad <li...@yahoo.com>.
Just recently VMWARE announced Project Serengeti which is an open source 
OVA based on Apache Hadoop with HDFS, MapReduce, Pig, and Hive to name 
few. It requires vSphere I believe to use the OVA.


http://serengeti.cloudfoundry.com/

GitHub source:
https://github.com/vmware-serengeti


-nael

On 9/20/12 10:44 PM, Konstantin Boudnik wrote:
> I would add a couple more points to your consideration (may be this is just
> me):
>    - vendor lock-in:
>      
>      - when you pick a software make sure that you'd be able to move over to a
>        different (yet similar) product offering if you need to.  You are asking
>        about CHD's CM here: I don't think it would work with anything else but
>        CDH (I am not working there, so I don't know for sure - but it seems
>        line a reasonable assumption).
>      
>      - HW's HDP is providing Ambari for the cluster management needs, that is a
>        completely open source technology that you can master if needed and most
>        likely use with other stack based on Hadoop (as far as I can see).
>
>      - MapR has quite a bit of proprietary components in their stack, which
>        might be beneficial in your particular case or not: this is something
>        you have to decide for yourself.
>
>    - what are the road-map of possible distributions? Do they have what you
>      need in the future? The case in the point is these guys
>          http://www.magnatempusgroup.net/blog/2012/09/05/whats-cooking/
>      who are seemingly bringing in in-memory analytics in their upcoming
>      release. You might want to follow a big Hadoop conference next month,
>      that's likely to have a number of interesting announcements (otherwise,
>      what would be the point of such conference ;)
>
> These two would be a pivotal points for me. Hope it helps,
>    Cos
>
> On Fri, Sep 21, 2012 at 11:17AM, hadoop wrote:
>> I Have the same question.
>> Which version ,Which vender do we choose?
>>
>>
>> --
>> hadoop
>> Sent with Sparrow (http://www.sparrowmailapp.com/?sig)
>>
>>
>> On 2012年9月21日Friday at 上午2:22, Aaron Eng wrote:
>>
>>>> I'm tasked with creating a guide that instructs on how to choose a Hadoop
>>>   
>>> distribution from the handful of common options.
>>>> Does anyone have any thoughts on what criteria might govern such a
>>>   
>>> decision?
>>>   
>>> What problem(s) are you trying to solve with Hadoop (and related projects)?
>>> What are your expectations of the technology?
>>>   
>>> The details beyond that level could take many, many pages to cover.
>>>   
>>> Not all Hadoop distributions are tested the same way, packaged with the
>>> same components, etc. Not all components of a given Hadoop distribution
>>> work with other Hadoop distributions. There are a lot of common things
>>> between distributions which is probably why its difficult to articulate how
>>> to choose one over the another. So when you look at the problem you are
>>> trying to solve and your expectations of the technology, many things may
>>> seem relatively equal and hence you may need to get into some significant
>>> level of detail to pick something that best solves your problem. In some
>>> cases it may be very straightforward as to whether a distribution will meet
>>> your requirements. In other cases, things may look relatively equal across
>>> the board until you drill down to a point where you find differentiation
>>> (or maybe you dont find it). But those would be my critera, articulate the
>>> problem and expectations and compare functionality until you find
>>> differentiation.
>>>   
>>>   
>>>   
>>> On Thu, Sep 20, 2012 at 11:06 AM, Keith Wiley <kwiley@keithwiley.com (mailto:kwiley@keithwiley.com)> wrote:
>>>   
>>>> I'm tasked with creating a guide that instructs on how to choose a Hadoop
>>>> distribution from the handful of common options. I'm finding this rather
>>>> perplexing. While some of the venders offer additional management software
>>>> (Cloudera Manager is an example) I'm unclear whether those packages could
>>>> be installed and run irregardless of the underlying Hadoop distribution or
>>>> if they are exclusively compatible with their vender's distribution (or if
>>>> there's some crossover). I'm also unclear on any other basis for
>>>> comparison. For example HortonWorks originated HCatalog (to the best of my
>>>> understanding), but that doesn't necessarily mean one needs to use the HW
>>>> Hadoop dist. to use HCatalog since it's just a public Apache project anyway
>>>> at this point. I'm sure similar statements could be made about MapR or
>>>> Greenplum (although I thin Greenplum's Hadoop uses MapR's M5 anyway so
>>>> again, the decision-making process in such a case seems baffling).
>>>>   
>>>> And then there's the option of installing the Apache version directly,
>>>> always on the table I suppose.
>>>>   
>>>> Does anyone have any thoughts on what criteria might govern such a
>>>> decision? I'm not trying to get into an argument about which distribution
>>>> is best, I'm not even looking for defenses or arguments for one
>>>> distribution or another, but rather a notion of what the criteria for
>>>> basing such a decision might be.
>>>>   
>>>> Thanks.
>>>>   
>>>> Cheers!
>>>>   
>>>>   
>>>> ________________________________________________________________________________
>>>> Keith Wiley kwiley@keithwiley.com (mailto:kwiley@keithwiley.com) keithwiley.com (http://keithwiley.com)
>>>> music.keithwiley.com (http://music.keithwiley.com)
>>>>   
>>>> "It's a fine line between meticulous and obsessive-compulsive and a
>>>> slippery
>>>> rope between obsessive-compulsive and debilitatingly slow."
>>>> -- Keith Wiley
>>>>   
>>>> ________________________________________________________________________________


Re: Choosing a Hadoop distribution

Posted by Konstantin Boudnik <co...@apache.org>.
I would add a couple more points to your consideration (may be this is just
me):
  - vendor lock-in:
    
    - when you pick a software make sure that you'd be able to move over to a
      different (yet similar) product offering if you need to.  You are asking
      about CHD's CM here: I don't think it would work with anything else but
      CDH (I am not working there, so I don't know for sure - but it seems
      line a reasonable assumption).
    
    - HW's HDP is providing Ambari for the cluster management needs, that is a
      completely open source technology that you can master if needed and most
      likely use with other stack based on Hadoop (as far as I can see).

    - MapR has quite a bit of proprietary components in their stack, which
      might be beneficial in your particular case or not: this is something
      you have to decide for yourself.

  - what are the road-map of possible distributions? Do they have what you
    need in the future? The case in the point is these guys
        http://www.magnatempusgroup.net/blog/2012/09/05/whats-cooking/
    who are seemingly bringing in in-memory analytics in their upcoming
    release. You might want to follow a big Hadoop conference next month,
    that's likely to have a number of interesting announcements (otherwise,
    what would be the point of such conference ;)

These two would be a pivotal points for me. Hope it helps,
  Cos

On Fri, Sep 21, 2012 at 11:17AM, hadoop wrote:
> I Have the same question.   
> Which version ,Which vender do we choose?
> 
> 
> --  
> hadoop
> Sent with Sparrow (http://www.sparrowmailapp.com/?sig)
> 
> 
> On 2012年9月21日Friday at 上午2:22, Aaron Eng wrote:
> 
> > > I'm tasked with creating a guide that instructs on how to choose a Hadoop
> >  
> > distribution from the handful of common options.
> > > Does anyone have any thoughts on what criteria might govern such a
> >  
> > decision?
> >  
> > What problem(s) are you trying to solve with Hadoop (and related projects)?
> > What are your expectations of the technology?
> >  
> > The details beyond that level could take many, many pages to cover.
> >  
> > Not all Hadoop distributions are tested the same way, packaged with the
> > same components, etc. Not all components of a given Hadoop distribution
> > work with other Hadoop distributions. There are a lot of common things
> > between distributions which is probably why its difficult to articulate how
> > to choose one over the another. So when you look at the problem you are
> > trying to solve and your expectations of the technology, many things may
> > seem relatively equal and hence you may need to get into some significant
> > level of detail to pick something that best solves your problem. In some
> > cases it may be very straightforward as to whether a distribution will meet
> > your requirements. In other cases, things may look relatively equal across
> > the board until you drill down to a point where you find differentiation
> > (or maybe you dont find it). But those would be my critera, articulate the
> > problem and expectations and compare functionality until you find
> > differentiation.
> >  
> >  
> >  
> > On Thu, Sep 20, 2012 at 11:06 AM, Keith Wiley <kwiley@keithwiley.com (mailto:kwiley@keithwiley.com)> wrote:
> >  
> > > I'm tasked with creating a guide that instructs on how to choose a Hadoop
> > > distribution from the handful of common options. I'm finding this rather
> > > perplexing. While some of the venders offer additional management software
> > > (Cloudera Manager is an example) I'm unclear whether those packages could
> > > be installed and run irregardless of the underlying Hadoop distribution or
> > > if they are exclusively compatible with their vender's distribution (or if
> > > there's some crossover). I'm also unclear on any other basis for
> > > comparison. For example HortonWorks originated HCatalog (to the best of my
> > > understanding), but that doesn't necessarily mean one needs to use the HW
> > > Hadoop dist. to use HCatalog since it's just a public Apache project anyway
> > > at this point. I'm sure similar statements could be made about MapR or
> > > Greenplum (although I thin Greenplum's Hadoop uses MapR's M5 anyway so
> > > again, the decision-making process in such a case seems baffling).
> > >  
> > > And then there's the option of installing the Apache version directly,
> > > always on the table I suppose.
> > >  
> > > Does anyone have any thoughts on what criteria might govern such a
> > > decision? I'm not trying to get into an argument about which distribution
> > > is best, I'm not even looking for defenses or arguments for one
> > > distribution or another, but rather a notion of what the criteria for
> > > basing such a decision might be.
> > >  
> > > Thanks.
> > >  
> > > Cheers!
> > >  
> > >  
> > > ________________________________________________________________________________
> > > Keith Wiley kwiley@keithwiley.com (mailto:kwiley@keithwiley.com) keithwiley.com (http://keithwiley.com)
> > > music.keithwiley.com (http://music.keithwiley.com)
> > >  
> > > "It's a fine line between meticulous and obsessive-compulsive and a
> > > slippery
> > > rope between obsessive-compulsive and debilitatingly slow."
> > > -- Keith Wiley
> > >  
> > > ________________________________________________________________________________  
> 

Re: Choosing a Hadoop distribution

Posted by hadoop <ha...@gmail.com>.
I Have the same question.   
Which version ,Which vender do we choose?


--  
hadoop
Sent with Sparrow (http://www.sparrowmailapp.com/?sig)


On 2012年9月21日Friday at 上午2:22, Aaron Eng wrote:

> > I'm tasked with creating a guide that instructs on how to choose a Hadoop
>  
> distribution from the handful of common options.
> > Does anyone have any thoughts on what criteria might govern such a
>  
> decision?
>  
> What problem(s) are you trying to solve with Hadoop (and related projects)?
> What are your expectations of the technology?
>  
> The details beyond that level could take many, many pages to cover.
>  
> Not all Hadoop distributions are tested the same way, packaged with the
> same components, etc. Not all components of a given Hadoop distribution
> work with other Hadoop distributions. There are a lot of common things
> between distributions which is probably why its difficult to articulate how
> to choose one over the another. So when you look at the problem you are
> trying to solve and your expectations of the technology, many things may
> seem relatively equal and hence you may need to get into some significant
> level of detail to pick something that best solves your problem. In some
> cases it may be very straightforward as to whether a distribution will meet
> your requirements. In other cases, things may look relatively equal across
> the board until you drill down to a point where you find differentiation
> (or maybe you dont find it). But those would be my critera, articulate the
> problem and expectations and compare functionality until you find
> differentiation.
>  
>  
>  
> On Thu, Sep 20, 2012 at 11:06 AM, Keith Wiley <kwiley@keithwiley.com (mailto:kwiley@keithwiley.com)> wrote:
>  
> > I'm tasked with creating a guide that instructs on how to choose a Hadoop
> > distribution from the handful of common options. I'm finding this rather
> > perplexing. While some of the venders offer additional management software
> > (Cloudera Manager is an example) I'm unclear whether those packages could
> > be installed and run irregardless of the underlying Hadoop distribution or
> > if they are exclusively compatible with their vender's distribution (or if
> > there's some crossover). I'm also unclear on any other basis for
> > comparison. For example HortonWorks originated HCatalog (to the best of my
> > understanding), but that doesn't necessarily mean one needs to use the HW
> > Hadoop dist. to use HCatalog since it's just a public Apache project anyway
> > at this point. I'm sure similar statements could be made about MapR or
> > Greenplum (although I thin Greenplum's Hadoop uses MapR's M5 anyway so
> > again, the decision-making process in such a case seems baffling).
> >  
> > And then there's the option of installing the Apache version directly,
> > always on the table I suppose.
> >  
> > Does anyone have any thoughts on what criteria might govern such a
> > decision? I'm not trying to get into an argument about which distribution
> > is best, I'm not even looking for defenses or arguments for one
> > distribution or another, but rather a notion of what the criteria for
> > basing such a decision might be.
> >  
> > Thanks.
> >  
> > Cheers!
> >  
> >  
> > ________________________________________________________________________________
> > Keith Wiley kwiley@keithwiley.com (mailto:kwiley@keithwiley.com) keithwiley.com (http://keithwiley.com)
> > music.keithwiley.com (http://music.keithwiley.com)
> >  
> > "It's a fine line between meticulous and obsessive-compulsive and a
> > slippery
> > rope between obsessive-compulsive and debilitatingly slow."
> > -- Keith Wiley
> >  
> > ________________________________________________________________________________  


Re: Choosing a Hadoop distribution

Posted by Aaron Eng <ae...@maprtech.com>.
>I'm tasked with creating a guide that instructs on how to choose a Hadoop
distribution from the handful of common options.
>Does anyone have any thoughts on what criteria might govern such a
decision?

What problem(s) are you trying to solve with Hadoop (and related projects)?
What are your expectations of the technology?

The details beyond that level could take many, many pages to cover.

Not all Hadoop distributions are tested the same way, packaged with the
same components, etc.  Not all components of a given Hadoop distribution
work with other Hadoop distributions.  There are a lot of common things
between distributions which is probably why its difficult to articulate how
to choose one over the another.  So when you look at the problem you are
trying to solve and your expectations of the technology, many things may
seem relatively equal and hence you may need to get into some significant
level of detail to pick something that best solves your problem.  In some
cases it may be very straightforward as to whether a distribution will meet
your requirements.  In other cases, things may look relatively equal across
the board until you drill down to a point where you find differentiation
(or maybe you dont find it).  But those would be my critera, articulate the
problem and expectations and compare functionality until you find
differentiation.



On Thu, Sep 20, 2012 at 11:06 AM, Keith Wiley <kw...@keithwiley.com> wrote:

> I'm tasked with creating a guide that instructs on how to choose a Hadoop
> distribution from the handful of common options.  I'm finding this rather
> perplexing.  While some of the venders offer additional management software
> (Cloudera Manager is an example) I'm unclear whether those packages could
> be installed and run irregardless of the underlying Hadoop distribution or
> if they are exclusively compatible with their vender's distribution (or if
> there's some crossover).  I'm also unclear on any other basis for
> comparison.  For example HortonWorks originated HCatalog (to the best of my
> understanding), but that doesn't necessarily mean one needs to use the HW
> Hadoop dist. to use HCatalog since it's just a public Apache project anyway
> at this point.  I'm sure similar statements could be made about MapR or
> Greenplum (although I thin Greenplum's Hadoop uses MapR's M5 anyway so
> again, the decision-making process in such a case seems baffling).
>
> And then there's the option of installing the Apache version directly,
> always on the table I suppose.
>
> Does anyone have any thoughts on what criteria might govern such a
> decision?  I'm not trying to get into an argument about which distribution
> is best, I'm not even looking for defenses or arguments for one
> distribution or another, but rather a notion of what the criteria for
> basing such a decision might be.
>
> Thanks.
>
> Cheers!
>
>
> ________________________________________________________________________________
> Keith Wiley     kwiley@keithwiley.com     keithwiley.com
> music.keithwiley.com
>
> "It's a fine line between meticulous and obsessive-compulsive and a
> slippery
> rope between obsessive-compulsive and debilitatingly slow."
>                                            --  Keith Wiley
>
> ________________________________________________________________________________
>
>