You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@bigtop.apache.org by jay vyas <ja...@gmail.com> on 2014/12/07 00:23:30 UTC

What will the next generation of bigtop look like?

hi bigtop !

I thought id start a thread a few vaguely related thoughts i have around
next couple iterations of bigtop.

1) Hive:  How will bigtop to evolve to support it, now that it is much
more  than a mapreduce query wrapper?

2) I wonder wether we should confirm cassandra interoperability of spark in
bigtop distros,

3) Also, as per , https://issues.apache.org/jira/browse/BIGTOP-1561 ---
What about presto ?  Who is interested in supporting it - packaging it
-testing it etc..?  (con) I don't know if its really ready to be in bigtop,
but (pro) i think if there is someone really dedicated to testing its
interop w/ the bigtop stack, that could be great news for us.

*** Now three concrete questions lead to a more interesting question ***

4) in general, i think bigtop can move in one of 3 directions.

  EXPAND ? : Expanding to include new components, with just basic interop,
and let folks evolve their own stacks on top of bigtop on their own.

  CONTRACT+FOCUS ?  Contracting to focus on a lean set of core components,
with super high quality.

  STAY THE COURSE ? Staying the same ~ a packaging platform for just
hadoop's direct ecosystem.

I am intrigued by the idea of A and B both have clear benefits and
costs...  would like to see the opinions of folks --- do we  lean in one
direction or another? What is the criteria for adding a new feature,
package, stack to bigtop?

... Or maybe im just overthinking it and should be spending this time
testing spark for 0.9 release....

Either way, looking forward to some feedback on these thoughts from the
bigtop community !

jay vyas

Re: What will the next generation of bigtop look like?

Posted by Konstantin Boudnik <co...@apache.org>.

On Tue, Dec 23, 2014 at 08:55AM, Andrew Purtell wrote:
> > So, in this sense commercials aren't releasing Apache software, but
> rather its derivatives. I don't see how Bigtop would compete in this field
> neither culturally nor resource wise.
> 
> This. From personal experience, once you start patching Apache releases you
> kick off a snowball of curation that gets larger with every upstream commit
> on each project. The only way to stop the ball rolling is to periodically
> nuke it with a rebase on a Apache release, reset that patch delta on as
> many components as possible to near 0. In practice Bigtop can't do this
> because we don't have the bandwidth. But in principle it is also good in my
> opinion: If the Apache releases cannot stand up a stable and cohesive stack
> then the ecosystem has issues, and Bigtop can provide a corrective
> influence as patches, JIRAs, email.

Same here. Although, when we were doing a Hadoop distribution at WANdisco it
was 100% Apache, e.g. Bigtop stack with a couple of our own components on top
of it. Otherwise, you risk a divergence or a huge effort to simply keep this
business of yours afloat.

Cos

> On Tue, Dec 23, 2014 at 12:36 AM, Konstantin Boudnik <co...@apache.org> wrote:
> >
> > Sorry for being a bit late to this discussion, but I will try to make a
> > very
> > sort summary of what I've read:
> >
> > 1. there's the intention to focus on the vertical value-add,
> >    rather than just a platform [Andrew, et all]
> > 2. Focus more on in-memory technologies (which seemed to be our trend ever
> > since
> >    we added Spark and now Ignite (incubating)). [Jay, Evans, RJ]
> > 3. while many data processing components aren't HDFS centric anymore, the
> >    storage layer still seems to be important for anything related to
> > Hadoop.
> >    Since, I don't think HDFS can be dropped tomorrow.
> >
> > And I don't see these three interfering with each other. To me they are
> > quite
> > complemektary.
> >
> > As for lower appeal of Bigtop stack to commercials as Evans alluded.
> > There's
> > that. And the main reason is that Bigtop releases are always based on
> > official
> > Apache release of upstream components. Wheres non of the Hadoop vendors can
> > say the same. Anything that Cloudera or HortonWorks put out there is
> > Hadoop 2.x
> > + N-patches, where N could be anywhere between 1 and 2000, but is never 0.
> >
> > So, in this sense commercials aren't releasing Apache software, but rather
> > its
> > derivatives. I don't see how Bigtop would compete in this field neither
> > culturally nor resource wise.
> >
> > Cos
> >
> > On Sat, Dec 06, 2014 at 06:23PM, jay vyas wrote:
> > > hi bigtop !
> > >
> > > I thought id start a thread a few vaguely related thoughts i have around
> > > next couple iterations of bigtop.
> > >
> > > 1) Hive:  How will bigtop to evolve to support it, now that it is much
> > > more  than a mapreduce query wrapper?
> > >
> > > 2) I wonder wether we should confirm cassandra interoperability of spark
> > in
> > > bigtop distros,
> > >
> > > 3) Also, as per , https://issues.apache.org/jira/browse/BIGTOP-1561 ---
> > > What about presto ?  Who is interested in supporting it - packaging it
> > > -testing it etc..?  (con) I don't know if its really ready to be in
> > bigtop,
> > > but (pro) i think if there is someone really dedicated to testing its
> > > interop w/ the bigtop stack, that could be great news for us.
> > >
> > > *** Now three concrete questions lead to a more interesting question ***
> > >
> > > 4) in general, i think bigtop can move in one of 3 directions.
> > >
> > >   EXPAND ? : Expanding to include new components, with just basic
> > interop,
> > > and let folks evolve their own stacks on top of bigtop on their own.
> > >
> > >   CONTRACT+FOCUS ?  Contracting to focus on a lean set of core
> > components,
> > > with super high quality.
> > >
> > >   STAY THE COURSE ? Staying the same ~ a packaging platform for just
> > > hadoop's direct ecosystem.
> > >
> > > I am intrigued by the idea of A and B both have clear benefits and
> > > costs...  would like to see the opinions of folks --- do we  lean in one
> > > direction or another? What is the criteria for adding a new feature,
> > > package, stack to bigtop?
> > >
> > > ... Or maybe im just overthinking it and should be spending this time
> > > testing spark for 0.9 release....
> > >
> > > Either way, looking forward to some feedback on these thoughts from the
> > > bigtop community !
> > >
> > > jay vyas
> >
> 
> 
> -- 
> Best regards,
> 
>    - Andy
> 
> Problems worthy of attack prove their worth by hitting back. - Piet Hein
> (via Tom White)

Re: What will the next generation of bigtop look like?

Posted by Konstantin Boudnik <co...@apache.org>.

On Tue, Dec 23, 2014 at 08:55AM, Andrew Purtell wrote:
> > So, in this sense commercials aren't releasing Apache software, but
> rather its derivatives. I don't see how Bigtop would compete in this field
> neither culturally nor resource wise.
> 
> This. From personal experience, once you start patching Apache releases you
> kick off a snowball of curation that gets larger with every upstream commit
> on each project. The only way to stop the ball rolling is to periodically
> nuke it with a rebase on a Apache release, reset that patch delta on as
> many components as possible to near 0. In practice Bigtop can't do this
> because we don't have the bandwidth. But in principle it is also good in my
> opinion: If the Apache releases cannot stand up a stable and cohesive stack
> then the ecosystem has issues, and Bigtop can provide a corrective
> influence as patches, JIRAs, email.

Same here. Although, when we were doing a Hadoop distribution at WANdisco it
was 100% Apache, e.g. Bigtop stack with a couple of our own components on top
of it. Otherwise, you risk a divergence or a huge effort to simply keep this
business of yours afloat.

Cos

> On Tue, Dec 23, 2014 at 12:36 AM, Konstantin Boudnik <co...@apache.org> wrote:
> >
> > Sorry for being a bit late to this discussion, but I will try to make a
> > very
> > sort summary of what I've read:
> >
> > 1. there's the intention to focus on the vertical value-add,
> >    rather than just a platform [Andrew, et all]
> > 2. Focus more on in-memory technologies (which seemed to be our trend ever
> > since
> >    we added Spark and now Ignite (incubating)). [Jay, Evans, RJ]
> > 3. while many data processing components aren't HDFS centric anymore, the
> >    storage layer still seems to be important for anything related to
> > Hadoop.
> >    Since, I don't think HDFS can be dropped tomorrow.
> >
> > And I don't see these three interfering with each other. To me they are
> > quite
> > complemektary.
> >
> > As for lower appeal of Bigtop stack to commercials as Evans alluded.
> > There's
> > that. And the main reason is that Bigtop releases are always based on
> > official
> > Apache release of upstream components. Wheres non of the Hadoop vendors can
> > say the same. Anything that Cloudera or HortonWorks put out there is
> > Hadoop 2.x
> > + N-patches, where N could be anywhere between 1 and 2000, but is never 0.
> >
> > So, in this sense commercials aren't releasing Apache software, but rather
> > its
> > derivatives. I don't see how Bigtop would compete in this field neither
> > culturally nor resource wise.
> >
> > Cos
> >
> > On Sat, Dec 06, 2014 at 06:23PM, jay vyas wrote:
> > > hi bigtop !
> > >
> > > I thought id start a thread a few vaguely related thoughts i have around
> > > next couple iterations of bigtop.
> > >
> > > 1) Hive:  How will bigtop to evolve to support it, now that it is much
> > > more  than a mapreduce query wrapper?
> > >
> > > 2) I wonder wether we should confirm cassandra interoperability of spark
> > in
> > > bigtop distros,
> > >
> > > 3) Also, as per , https://issues.apache.org/jira/browse/BIGTOP-1561 ---
> > > What about presto ?  Who is interested in supporting it - packaging it
> > > -testing it etc..?  (con) I don't know if its really ready to be in
> > bigtop,
> > > but (pro) i think if there is someone really dedicated to testing its
> > > interop w/ the bigtop stack, that could be great news for us.
> > >
> > > *** Now three concrete questions lead to a more interesting question ***
> > >
> > > 4) in general, i think bigtop can move in one of 3 directions.
> > >
> > >   EXPAND ? : Expanding to include new components, with just basic
> > interop,
> > > and let folks evolve their own stacks on top of bigtop on their own.
> > >
> > >   CONTRACT+FOCUS ?  Contracting to focus on a lean set of core
> > components,
> > > with super high quality.
> > >
> > >   STAY THE COURSE ? Staying the same ~ a packaging platform for just
> > > hadoop's direct ecosystem.
> > >
> > > I am intrigued by the idea of A and B both have clear benefits and
> > > costs...  would like to see the opinions of folks --- do we  lean in one
> > > direction or another? What is the criteria for adding a new feature,
> > > package, stack to bigtop?
> > >
> > > ... Or maybe im just overthinking it and should be spending this time
> > > testing spark for 0.9 release....
> > >
> > > Either way, looking forward to some feedback on these thoughts from the
> > > bigtop community !
> > >
> > > jay vyas
> >
> 
> 
> -- 
> Best regards,
> 
>    - Andy
> 
> Problems worthy of attack prove their worth by hitting back. - Piet Hein
> (via Tom White)

Re: What will the next generation of bigtop look like?

Posted by Andrew Purtell <ap...@apache.org>.

> So, in this sense commercials aren't releasing Apache software, but
rather its derivatives. I don't see how Bigtop would compete in this field
neither culturally nor resource wise.

This. From personal experience, once you start patching Apache releases you
kick off a snowball of curation that gets larger with every upstream commit
on each project. The only way to stop the ball rolling is to periodically
nuke it with a rebase on a Apache release, reset that patch delta on as
many components as possible to near 0. In practice Bigtop can't do this
because we don't have the bandwidth. But in principle it is also good in my
opinion: If the Apache releases cannot stand up a stable and cohesive stack
then the ecosystem has issues, and Bigtop can provide a corrective
influence as patches, JIRAs, email.


On Tue, Dec 23, 2014 at 12:36 AM, Konstantin Boudnik <co...@apache.org> wrote:
>
> Sorry for being a bit late to this discussion, but I will try to make a
> very
> sort summary of what I've read:
>
> 1. there's the intention to focus on the vertical value-add,
>    rather than just a platform [Andrew, et all]
> 2. Focus more on in-memory technologies (which seemed to be our trend ever
> since
>    we added Spark and now Ignite (incubating)). [Jay, Evans, RJ]
> 3. while many data processing components aren't HDFS centric anymore, the
>    storage layer still seems to be important for anything related to
> Hadoop.
>    Since, I don't think HDFS can be dropped tomorrow.
>
> And I don't see these three interfering with each other. To me they are
> quite
> complemektary.
>
> As for lower appeal of Bigtop stack to commercials as Evans alluded.
> There's
> that. And the main reason is that Bigtop releases are always based on
> official
> Apache release of upstream components. Wheres non of the Hadoop vendors can
> say the same. Anything that Cloudera or HortonWorks put out there is
> Hadoop 2.x
> + N-patches, where N could be anywhere between 1 and 2000, but is never 0.
>
> So, in this sense commercials aren't releasing Apache software, but rather
> its
> derivatives. I don't see how Bigtop would compete in this field neither
> culturally nor resource wise.
>
> Cos
>
> On Sat, Dec 06, 2014 at 06:23PM, jay vyas wrote:
> > hi bigtop !
> >
> > I thought id start a thread a few vaguely related thoughts i have around
> > next couple iterations of bigtop.
> >
> > 1) Hive:  How will bigtop to evolve to support it, now that it is much
> > more  than a mapreduce query wrapper?
> >
> > 2) I wonder wether we should confirm cassandra interoperability of spark
> in
> > bigtop distros,
> >
> > 3) Also, as per , https://issues.apache.org/jira/browse/BIGTOP-1561 ---
> > What about presto ?  Who is interested in supporting it - packaging it
> > -testing it etc..?  (con) I don't know if its really ready to be in
> bigtop,
> > but (pro) i think if there is someone really dedicated to testing its
> > interop w/ the bigtop stack, that could be great news for us.
> >
> > *** Now three concrete questions lead to a more interesting question ***
> >
> > 4) in general, i think bigtop can move in one of 3 directions.
> >
> >   EXPAND ? : Expanding to include new components, with just basic
> interop,
> > and let folks evolve their own stacks on top of bigtop on their own.
> >
> >   CONTRACT+FOCUS ?  Contracting to focus on a lean set of core
> components,
> > with super high quality.
> >
> >   STAY THE COURSE ? Staying the same ~ a packaging platform for just
> > hadoop's direct ecosystem.
> >
> > I am intrigued by the idea of A and B both have clear benefits and
> > costs...  would like to see the opinions of folks --- do we  lean in one
> > direction or another? What is the criteria for adding a new feature,
> > package, stack to bigtop?
> >
> > ... Or maybe im just overthinking it and should be spending this time
> > testing spark for 0.9 release....
> >
> > Either way, looking forward to some feedback on these thoughts from the
> > bigtop community !
> >
> > jay vyas
>


-- 
Best regards,

   - Andy

Problems worthy of attack prove their worth by hitting back. - Piet Hein
(via Tom White)

Re: What will the next generation of bigtop look like?

Posted by Konstantin Boudnik <co...@apache.org>.

Sorry for being a bit late to this discussion, but I will try to make a very
sort summary of what I've read:

1. there's the intention to focus on the vertical value-add,
   rather than just a platform [Andrew, et all]
2. Focus more on in-memory technologies (which seemed to be our trend ever since
   we added Spark and now Ignite (incubating)). [Jay, Evans, RJ]
3. while many data processing components aren't HDFS centric anymore, the
   storage layer still seems to be important for anything related to Hadoop.
   Since, I don't think HDFS can be dropped tomorrow.

And I don't see these three interfering with each other. To me they are quite
complemektary.

As for lower appeal of Bigtop stack to commercials as Evans alluded. There's
that. And the main reason is that Bigtop releases are always based on official
Apache release of upstream components. Wheres non of the Hadoop vendors can
say the same. Anything that Cloudera or HortonWorks put out there is Hadoop 2.x
+ N-patches, where N could be anywhere between 1 and 2000, but is never 0.

So, in this sense commercials aren't releasing Apache software, but rather its
derivatives. I don't see how Bigtop would compete in this field neither
culturally nor resource wise.

Cos

On Sat, Dec 06, 2014 at 06:23PM, jay vyas wrote:
> hi bigtop !
> 
> I thought id start a thread a few vaguely related thoughts i have around
> next couple iterations of bigtop.
> 
> 1) Hive:  How will bigtop to evolve to support it, now that it is much
> more  than a mapreduce query wrapper?
> 
> 2) I wonder wether we should confirm cassandra interoperability of spark in
> bigtop distros,
> 
> 3) Also, as per , https://issues.apache.org/jira/browse/BIGTOP-1561 ---
> What about presto ?  Who is interested in supporting it - packaging it
> -testing it etc..?  (con) I don't know if its really ready to be in bigtop,
> but (pro) i think if there is someone really dedicated to testing its
> interop w/ the bigtop stack, that could be great news for us.
> 
> *** Now three concrete questions lead to a more interesting question ***
> 
> 4) in general, i think bigtop can move in one of 3 directions.
> 
>   EXPAND ? : Expanding to include new components, with just basic interop,
> and let folks evolve their own stacks on top of bigtop on their own.
> 
>   CONTRACT+FOCUS ?  Contracting to focus on a lean set of core components,
> with super high quality.
> 
>   STAY THE COURSE ? Staying the same ~ a packaging platform for just
> hadoop's direct ecosystem.
> 
> I am intrigued by the idea of A and B both have clear benefits and
> costs...  would like to see the opinions of folks --- do we  lean in one
> direction or another? What is the criteria for adding a new feature,
> package, stack to bigtop?
> 
> ... Or maybe im just overthinking it and should be spending this time
> testing spark for 0.9 release....
> 
> Either way, looking forward to some feedback on these thoughts from the
> bigtop community !
> 
> jay vyas

Re: What will the next generation of bigtop look like?

Posted by Jay Vyas <ja...@gmail.com>.

Thanks for the real world user story evans...and mostly in line w the thoughts of others I think.

Any other folks, or any disagreements on the idea of a leaner BigTop centered around :

{{ HDFS,yarn core, spark, zk, hbase, Kafka, solr, and ignite  }}

> On Dec 13, 2014, at 1:18 PM, Evans Ye <sa...@gmail.com> wrote:
> 
> Please allow me to chime in.
> Here's a real story brings another aspect that we probably can head toward.
> My company recently is going to upgrade our hadoop version. Here we got
> several distribution on the table to be chosen: CDH, HDP, BigTop. But it is
> a
> fact that BigTop is unlikely to be the one for decision maker to choose
> because
> its version set is a little bit too old compared to the others.
> My point is maybe one thing we can do is to release often, release faster.
> And there're things we can do to achieve this goal:
> 1) shrink the supported components and focus on the vital parts
>    (as this thread already mentioned)
> 2) establish an comprehensive auto testing system(like smoke tests) that
>    supports us to do fast release
> 3) Instead of cutting off components, maybe we can upgrade major components
>    more often than minors. For example in 0.8.1, we can only have Hadoop,
>    Spark,HBase,Kafka,Solr upgraded. After all, I believe most of the
>    companies running Hadoop cluster with limited core components.
>    Hence the lag of minors might not be a big problem.
> 
> The quick release cycle not only makes BigTop more attractive but also
> gives the community a vivid image. I think that is the crucial part for
> community to keep growing.
> 
> 2014-12-09 4:23 GMT+08:00 Konstantin Boudnik <co...@apache.org>:
>> 
>> First I want to address the RJ's question:
>> 
>> The most prominent downstream Bigtop Dependency would be any commercial
>> Hadoop distribution like HDP and CDH. The former is trying to
>> disguise their affiliation by pushing Ambari forward, and Cloudera's
>> seemingly
>> shifting her focus to compressed tarballs media (aka parcels) which
>> requires
>> a closed-source solutions like Cloudera Manager to deploy and control your
>> cluster, effectively rendering it useless if you ever decide to uninstall
>> the
>> control software. In the interest of full disclosure, I don't think parcels
>> have any chance to landslide the consensus in the industry from Linux
>> packaging towards something so obscure and proprietary as parcels are.
>> 
>> 
>> And now to my actual points....:
>> 
>> I do strongly believe the Bigtop was and is the only completely
>> transparent,
>> vendors' friendly, and 100% sticking to official ASF product releases way
>> of
>> building your stack from ground up, deploying and controlling it anyway you
>> want to. I agree with Roman's presentation on how this project can move
>> forward. However, I somewhat disagree with his view on the perspectives. It
>> might be a hard road to drive the opinion of the community.  But, it is a
>> high
>> road.
>> 
>> We are definitely small and mostly unsupported by commercial groups that
>> are
>> using the framework. Being a box of LEGO won't win us anything. If
>> anything,
>> the empirical evidences are against it as commercial distros have decided
>> to
>> move towards their own means of "vendor lock-in" (yes, you hear me
>> right - that's exactly what I said: all so called open-source companies
>> have
>> invented a way to lock-in their customers either with fancy "enterprise
>> features" that aren't adding but amending underlying stack; or with custom
>> set
>> of patches oftentimes rendering the cluster to become incompatible between
>> different vendors).
>> 
>> By all means, my money are on the second way, yet slightly modified (as
>> use-cases are coming from users, not developers):
>>  #2 start driving adoption of software stacks for the particular kind of
>> data workloads
>> 
>> This community has enough day-to-day practitioners on board to
>> accumulate a near-complete introspection of where the technology is moving.
>> And instead of wobbling in a backwash, let's see if we can be smart and
>> define
>> this landscape. After all, Bigtop has adopted Spark well before any of the
>> commercials have officially accepted it. We seemingly are moving more and
>> more into in-memory realm of data processing: Apache Ignite (Gridgain),
>> Tachyon, Spark. I don't know how much legs Hive got in it, but I am
>> doubtful,
>> that it can walk for much longer... May be it's just me.
>> 
>> In this thread http://is.gd/MV2BH9 we already discussed some of the
>> aspects
>> influencing the feature of this project. And we are de-facto working on the
>> implementation. In my opinion, Hadoop has been more or less commoditized
>> already. And it isn't a bad thing, but it means that the innovations are
>> elsewhere. E.g. Spark moving is moving beyond its ties with storage layer
>> via
>> Tachyon abstraction; GridGain simply doesn't care what's underlying storage
>> is. However, data needs to be stored somewhere before it can be processed.
>> And
>> HCFS seems to be fitting the bill ok. But, as I said already, I see the
>> real
>> action elsewhere. If I were to define the shape of our mid- to long'ish
>> term
>> roadmap it'd be something like that:
>> 
>>            ^   Dashboard/Visualization  ^
>>            |     OLTP/ML processing     |
>>            |    Caching/Acceleration    |
>>            |         Storage            |
>> 
>> And around this we can add/improve on deployment (R8???),
>> virtualization/containers/clouds.  In other words - let's focus on the
>> vertical part of the stack, instead of simply supporting the status quo.
>> 
>> Does Cassandra fits the Storage layer in that model? I don't know and most
>> important - I don't care. If there's an interest and manpower to have
>> Cassandra-based stack - sure, but perhaps let's do as a separate branch or
>> something, so we aren't over-complicating things. As Roman said earlier, in
>> this case it'd be great to engage Cassandra/DataStax people into this
>> project.
>> But something tells me they won't be eager to jump on board.
>> 
>> And finally, all this above leads to "how": how we can start reshaping the
>> stack into its next incarnation? Perhaps, Ubuntu model might be an answer
>> for
>> that, but we have discussed that elsewhere and dropped the idea as it
>> wasn't
>> feasible back in the day. Perhaps its time just came?
>> 
>> Apologies for a long post.
>>  Cos
>> 
>> 
>>> On Sun, Dec 07, 2014 at 07:04PM, RJ Nowling wrote:
>>> Which other projects depend on BigTop?  How will the questions about the
>>> direction of BigTop affect those projects?
>>> 
>>> On Sun, Dec 7, 2014 at 6:10 PM, Roman Shaposhnik <ro...@shaposhnik.org>
>>> wrote:
>>> 
>>>> Hi!
>>>> 
>>>> On Sat, Dec 6, 2014 at 3:23 PM, jay vyas <ja...@gmail.com>
>>>> wrote:
>>>>> hi bigtop !
>>>>> 
>>>>> I thought id start a thread a few vaguely related thoughts i have
>> around
>>>>> next couple iterations of bigtop.
>>>> 
>>>> I think in general I see two major ways for something like
>>>> Bigtop to evolve:
>>>>   #1 remain a 'box of LEGO bricks' with very little opinion on
>>>>        how these pieces need to be integrated
>>>>   #2 start driving oppinioned use-cases for the particular kind of
>>>>        bigdata workloads
>>>> 
>>>> #1 is sort of what all of the Linux distros have been doing for
>>>> the majority of time they existed. #2 is close to what CentOS
>>>> is doing with SIGs.
>>>> 
>>>> Honestly, given the size of our community so far and a total
>>>> lack of corporate backing (with a small exception of Cloudera
>>>> still paying for our EC2 time) I think #1 is all we can do. I'd
>>>> love to be wrong, though.
>>>> 
>>>>> 1) Hive:  How will bigtop to evolve to support it, now that it is
>> much
>>>> more
>>>>> than a mapreduce query wrapper?
>>>> 
>>>> I think Hive will remain a big part of Hadoop workloads for forseeable
>>>> future. What I'd love to see more of is rationalizing things like how
>>>> HCatalog, etc. need to be deployed.
>>>> 
>>>>> 2) I wonder wether we should confirm cassandra interoperability of
>> spark
>>>> in
>>>>> bigtop distros,
>>>> 
>>>> Only if there's a significant interest from cassandra community and
>> even
>>>> then my biggest fear is that with cassandra we're totally changing the
>>>> requirements for the underlying storage subsystem (nothing wrong with
>>>> that, its just that in Hadoop ecosystem everything assumes very
>> HDFS'ish
>>>> requirements for the scale-out storage).
>>>> 
>>>>> 4) in general, i think bigtop can move in one of 3 directions.
>>>>> 
>>>>>  EXPAND ? : Expanding to include new components, with just basic
>>>> interop,
>>>>> and let folks evolve their own stacks on top of bigtop on their own.
>>>>> 
>>>>>  CONTRACT+FOCUS ?  Contracting to focus on a lean set of core
>>>> components,
>>>>> with super high quality.
>>>>> 
>>>>>  STAY THE COURSE ? Staying the same ~ a packaging platform for just
>>>>> hadoop's direct ecosystem.
>>>>> 
>>>>> I am intrigued by the idea of A and B both have clear benefits and
>>>> costs...
>>>>> would like to see the opinions of folks --- do we  lean in one
>> direction
>>>> or
>>>>> another? What is the criteria for adding a new feature, package,
>> stack to
>>>>> bigtop?
>>>>> 
>>>>> ... Or maybe im just overthinking it and should be spending this time
>>>>> testing spark for 0.9 release....
>>>> 
>>>> I'd love to know what other think, but for 0.9 I'd rather stay the
>> course.
>>>> 
>>>> Thanks,
>>>> Roman.
>>>> 
>>>> P.S. There are also market forces at play that may fundamentally change
>>>> the focus of what we're all working on in the year or so.
>>

Re: What will the next generation of bigtop look like?

Posted by Evans Ye <sa...@gmail.com>.

Please allow me to chime in.
Here's a real story brings another aspect that we probably can head toward.
My company recently is going to upgrade our hadoop version. Here we got
several distribution on the table to be chosen: CDH, HDP, BigTop. But it is
a
fact that BigTop is unlikely to be the one for decision maker to choose
because
its version set is a little bit too old compared to the others.
My point is maybe one thing we can do is to release often, release faster.
And there're things we can do to achieve this goal:
1) shrink the supported components and focus on the vital parts
    (as this thread already mentioned)
2) establish an comprehensive auto testing system(like smoke tests) that
    supports us to do fast release
3) Instead of cutting off components, maybe we can upgrade major components
    more often than minors. For example in 0.8.1, we can only have Hadoop,
    Spark,HBase,Kafka,Solr upgraded. After all, I believe most of the
    companies running Hadoop cluster with limited core components.
    Hence the lag of minors might not be a big problem.

The quick release cycle not only makes BigTop more attractive but also
gives the community a vivid image. I think that is the crucial part for
community to keep growing.

2014-12-09 4:23 GMT+08:00 Konstantin Boudnik <co...@apache.org>:
>
> First I want to address the RJ's question:
>
> The most prominent downstream Bigtop Dependency would be any commercial
> Hadoop distribution like HDP and CDH. The former is trying to
> disguise their affiliation by pushing Ambari forward, and Cloudera's
> seemingly
> shifting her focus to compressed tarballs media (aka parcels) which
> requires
> a closed-source solutions like Cloudera Manager to deploy and control your
> cluster, effectively rendering it useless if you ever decide to uninstall
> the
> control software. In the interest of full disclosure, I don't think parcels
> have any chance to landslide the consensus in the industry from Linux
> packaging towards something so obscure and proprietary as parcels are.
>
>
> And now to my actual points....:
>
> I do strongly believe the Bigtop was and is the only completely
> transparent,
> vendors' friendly, and 100% sticking to official ASF product releases way
> of
> building your stack from ground up, deploying and controlling it anyway you
> want to. I agree with Roman's presentation on how this project can move
> forward. However, I somewhat disagree with his view on the perspectives. It
> might be a hard road to drive the opinion of the community.  But, it is a
> high
> road.
>
> We are definitely small and mostly unsupported by commercial groups that
> are
> using the framework. Being a box of LEGO won't win us anything. If
> anything,
> the empirical evidences are against it as commercial distros have decided
> to
> move towards their own means of "vendor lock-in" (yes, you hear me
> right - that's exactly what I said: all so called open-source companies
> have
> invented a way to lock-in their customers either with fancy "enterprise
> features" that aren't adding but amending underlying stack; or with custom
> set
> of patches oftentimes rendering the cluster to become incompatible between
> different vendors).
>
> By all means, my money are on the second way, yet slightly modified (as
> use-cases are coming from users, not developers):
>   #2 start driving adoption of software stacks for the particular kind of
> data workloads
>
> This community has enough day-to-day practitioners on board to
> accumulate a near-complete introspection of where the technology is moving.
> And instead of wobbling in a backwash, let's see if we can be smart and
> define
> this landscape. After all, Bigtop has adopted Spark well before any of the
> commercials have officially accepted it. We seemingly are moving more and
> more into in-memory realm of data processing: Apache Ignite (Gridgain),
> Tachyon, Spark. I don't know how much legs Hive got in it, but I am
> doubtful,
> that it can walk for much longer... May be it's just me.
>
> In this thread http://is.gd/MV2BH9 we already discussed some of the
> aspects
> influencing the feature of this project. And we are de-facto working on the
> implementation. In my opinion, Hadoop has been more or less commoditized
> already. And it isn't a bad thing, but it means that the innovations are
> elsewhere. E.g. Spark moving is moving beyond its ties with storage layer
> via
> Tachyon abstraction; GridGain simply doesn't care what's underlying storage
> is. However, data needs to be stored somewhere before it can be processed.
> And
> HCFS seems to be fitting the bill ok. But, as I said already, I see the
> real
> action elsewhere. If I were to define the shape of our mid- to long'ish
> term
> roadmap it'd be something like that:
>
>             ^   Dashboard/Visualization  ^
>             |     OLTP/ML processing     |
>             |    Caching/Acceleration    |
>             |         Storage            |
>
> And around this we can add/improve on deployment (R8???),
> virtualization/containers/clouds.  In other words - let's focus on the
> vertical part of the stack, instead of simply supporting the status quo.
>
> Does Cassandra fits the Storage layer in that model? I don't know and most
> important - I don't care. If there's an interest and manpower to have
> Cassandra-based stack - sure, but perhaps let's do as a separate branch or
> something, so we aren't over-complicating things. As Roman said earlier, in
> this case it'd be great to engage Cassandra/DataStax people into this
> project.
> But something tells me they won't be eager to jump on board.
>
> And finally, all this above leads to "how": how we can start reshaping the
> stack into its next incarnation? Perhaps, Ubuntu model might be an answer
> for
> that, but we have discussed that elsewhere and dropped the idea as it
> wasn't
> feasible back in the day. Perhaps its time just came?
>
> Apologies for a long post.
>   Cos
>
>
> On Sun, Dec 07, 2014 at 07:04PM, RJ Nowling wrote:
> > Which other projects depend on BigTop?  How will the questions about the
> > direction of BigTop affect those projects?
> >
> > On Sun, Dec 7, 2014 at 6:10 PM, Roman Shaposhnik <ro...@shaposhnik.org>
> > wrote:
> >
> > > Hi!
> > >
> > > On Sat, Dec 6, 2014 at 3:23 PM, jay vyas <ja...@gmail.com>
> > > wrote:
> > > > hi bigtop !
> > > >
> > > > I thought id start a thread a few vaguely related thoughts i have
> around
> > > > next couple iterations of bigtop.
> > >
> > > I think in general I see two major ways for something like
> > > Bigtop to evolve:
> > >    #1 remain a 'box of LEGO bricks' with very little opinion on
> > >         how these pieces need to be integrated
> > >    #2 start driving oppinioned use-cases for the particular kind of
> > >         bigdata workloads
> > >
> > > #1 is sort of what all of the Linux distros have been doing for
> > > the majority of time they existed. #2 is close to what CentOS
> > > is doing with SIGs.
> > >
> > > Honestly, given the size of our community so far and a total
> > > lack of corporate backing (with a small exception of Cloudera
> > > still paying for our EC2 time) I think #1 is all we can do. I'd
> > > love to be wrong, though.
> > >
> > > > 1) Hive:  How will bigtop to evolve to support it, now that it is
> much
> > > more
> > > > than a mapreduce query wrapper?
> > >
> > > I think Hive will remain a big part of Hadoop workloads for forseeable
> > > future. What I'd love to see more of is rationalizing things like how
> > > HCatalog, etc. need to be deployed.
> > >
> > > > 2) I wonder wether we should confirm cassandra interoperability of
> spark
> > > in
> > > > bigtop distros,
> > >
> > > Only if there's a significant interest from cassandra community and
> even
> > > then my biggest fear is that with cassandra we're totally changing the
> > > requirements for the underlying storage subsystem (nothing wrong with
> > > that, its just that in Hadoop ecosystem everything assumes very
> HDFS'ish
> > > requirements for the scale-out storage).
> > >
> > > > 4) in general, i think bigtop can move in one of 3 directions.
> > > >
> > > >   EXPAND ? : Expanding to include new components, with just basic
> > > interop,
> > > > and let folks evolve their own stacks on top of bigtop on their own.
> > > >
> > > >   CONTRACT+FOCUS ?  Contracting to focus on a lean set of core
> > > components,
> > > > with super high quality.
> > > >
> > > >   STAY THE COURSE ? Staying the same ~ a packaging platform for just
> > > > hadoop's direct ecosystem.
> > > >
> > > > I am intrigued by the idea of A and B both have clear benefits and
> > > costs...
> > > > would like to see the opinions of folks --- do we  lean in one
> direction
> > > or
> > > > another? What is the criteria for adding a new feature, package,
> stack to
> > > > bigtop?
> > > >
> > > > ... Or maybe im just overthinking it and should be spending this time
> > > > testing spark for 0.9 release....
> > >
> > > I'd love to know what other think, but for 0.9 I'd rather stay the
> course.
> > >
> > > Thanks,
> > > Roman.
> > >
> > > P.S. There are also market forces at play that may fundamentally change
> > > the focus of what we're all working on in the year or so.
> > >
>

Re: What will the next generation of bigtop look like?

Posted by jay vyas <ja...@gmail.com>.

Yeah, I think HDFS is pretty important, esp. if we want to use SolrCloud +
HBase, both of which need a good DFS
working under the hood.   So it sounds like the most important packages
I've heard so far are:

HDFS
yarn
Spark
HBase+Phoenix
Kafka
Solr

are the ones which seem to be most appealing.

On Fri, Dec 12, 2014 at 9:54 PM, Andrew Purtell <ap...@apache.org> wrote:
>
> Well if HDFS is dropped, and all packages that depend on it, the
> usefulness of Bigtop for me drops to zero.
>
> On Fri, Dec 12, 2014 at 6:24 PM, jay vyas <ja...@gmail.com>
> wrote:
>
>> youre right - this thread is just a discussion ... nothing at all has
>> been replaced, nor even proposed to be replaced.
>> I think the purpose of this thread is to discuss the **entire** bigtop
>> stack,  from a theoretical perspective... so there are no sacred cows :)...
>>
>>
>> On Fri, Dec 12, 2014 at 7:59 PM, Andrew Purtell <ap...@apache.org>
>> wrote:
>>
>>> > thanks for the input maybe yarn and HDFS should continue to stick
>>> around
>>>
>>> Huh? When/where was HDFS not sticking around? To be replaced with what?
>>>
>>>
>>> On Thu, Dec 11, 2014 at 6:57 PM, jay vyas <ja...@gmail.com>
>>> wrote:
>>>>
>>>> great feedback guys.  my thoughts:
>>>>
>>>> @andrew, yeah thats some very good points you make.  thanks for the
>>>> input maybe yarn and HDFS should continue to stick around.
>>>>
>>>> @RJ, i think python, and the mgmt tooling can be complimentary, but
>>>> more ** on top ** of bigtop, i.e. in a vendor product based on bigtop : or
>>>> a community repackaging of bigtop ---  managing them as part of bigtop
>>>> itself --  might be extra features are a little out of scope of bigtop ;
>>>> which is more around deployment and packaging / testing for the core of a
>>>> big data infrastructure, as opposed to an e2e solution.  but *at the least*
>>>> i think it makes sense to keep in mind the  ambaris and also the
>>>> ipythons/tableus  of the world when producing bigtop releases, b/c if those
>>>> represent use cases we can have good tests for (i.e. REST Apis and PySpark
>>>> tests and so on)...
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>
>>> --
>>> Best regards,
>>>
>>>    - Andy
>>>
>>> Problems worthy of attack prove their worth by hitting back. - Piet Hein
>>> (via Tom White)
>>>
>>
>>
>>
>> --
>> jay vyas
>>
>
>
>
> --
> Best regards,
>
>    - Andy
>
> Problems worthy of attack prove their worth by hitting back. - Piet Hein
> (via Tom White)
>


-- 
jay vyas

Re: What will the next generation of bigtop look like?

Posted by jay vyas <ja...@gmail.com>.

Yeah, I think HDFS is pretty important, esp. if we want to use SolrCloud +
HBase, both of which need a good DFS
working under the hood.   So it sounds like the most important packages
I've heard so far are:

HDFS
yarn
Spark
HBase+Phoenix
Kafka
Solr

are the ones which seem to be most appealing.

On Fri, Dec 12, 2014 at 9:54 PM, Andrew Purtell <ap...@apache.org> wrote:
>
> Well if HDFS is dropped, and all packages that depend on it, the
> usefulness of Bigtop for me drops to zero.
>
> On Fri, Dec 12, 2014 at 6:24 PM, jay vyas <ja...@gmail.com>
> wrote:
>
>> youre right - this thread is just a discussion ... nothing at all has
>> been replaced, nor even proposed to be replaced.
>> I think the purpose of this thread is to discuss the **entire** bigtop
>> stack,  from a theoretical perspective... so there are no sacred cows :)...
>>
>>
>> On Fri, Dec 12, 2014 at 7:59 PM, Andrew Purtell <ap...@apache.org>
>> wrote:
>>
>>> > thanks for the input maybe yarn and HDFS should continue to stick
>>> around
>>>
>>> Huh? When/where was HDFS not sticking around? To be replaced with what?
>>>
>>>
>>> On Thu, Dec 11, 2014 at 6:57 PM, jay vyas <ja...@gmail.com>
>>> wrote:
>>>>
>>>> great feedback guys.  my thoughts:
>>>>
>>>> @andrew, yeah thats some very good points you make.  thanks for the
>>>> input maybe yarn and HDFS should continue to stick around.
>>>>
>>>> @RJ, i think python, and the mgmt tooling can be complimentary, but
>>>> more ** on top ** of bigtop, i.e. in a vendor product based on bigtop : or
>>>> a community repackaging of bigtop ---  managing them as part of bigtop
>>>> itself --  might be extra features are a little out of scope of bigtop ;
>>>> which is more around deployment and packaging / testing for the core of a
>>>> big data infrastructure, as opposed to an e2e solution.  but *at the least*
>>>> i think it makes sense to keep in mind the  ambaris and also the
>>>> ipythons/tableus  of the world when producing bigtop releases, b/c if those
>>>> represent use cases we can have good tests for (i.e. REST Apis and PySpark
>>>> tests and so on)...
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>
>>> --
>>> Best regards,
>>>
>>>    - Andy
>>>
>>> Problems worthy of attack prove their worth by hitting back. - Piet Hein
>>> (via Tom White)
>>>
>>
>>
>>
>> --
>> jay vyas
>>
>
>
>
> --
> Best regards,
>
>    - Andy
>
> Problems worthy of attack prove their worth by hitting back. - Piet Hein
> (via Tom White)
>


-- 
jay vyas

Re: What will the next generation of bigtop look like?

Posted by Andrew Purtell <ap...@apache.org>.

Well if HDFS is dropped, and all packages that depend on it, the usefulness
of Bigtop for me drops to zero.

On Fri, Dec 12, 2014 at 6:24 PM, jay vyas <ja...@gmail.com>
wrote:

> youre right - this thread is just a discussion ... nothing at all has been
> replaced, nor even proposed to be replaced.
> I think the purpose of this thread is to discuss the **entire** bigtop
> stack,  from a theoretical perspective... so there are no sacred cows :)...
>
>
> On Fri, Dec 12, 2014 at 7:59 PM, Andrew Purtell <ap...@apache.org>
> wrote:
>
>> > thanks for the input maybe yarn and HDFS should continue to stick
>> around
>>
>> Huh? When/where was HDFS not sticking around? To be replaced with what?
>>
>>
>> On Thu, Dec 11, 2014 at 6:57 PM, jay vyas <ja...@gmail.com>
>> wrote:
>>>
>>> great feedback guys.  my thoughts:
>>>
>>> @andrew, yeah thats some very good points you make.  thanks for the
>>> input maybe yarn and HDFS should continue to stick around.
>>>
>>> @RJ, i think python, and the mgmt tooling can be complimentary, but more
>>> ** on top ** of bigtop, i.e. in a vendor product based on bigtop : or a
>>> community repackaging of bigtop ---  managing them as part of bigtop itself
>>> --  might be extra features are a little out of scope of bigtop ; which is
>>> more around deployment and packaging / testing for the core of a big data
>>> infrastructure, as opposed to an e2e solution.  but *at the least* i think
>>> it makes sense to keep in mind the  ambaris and also the ipythons/tableus
>>>  of the world when producing bigtop releases, b/c if those represent use
>>> cases we can have good tests for (i.e. REST Apis and PySpark tests and so
>>> on)...
>>>
>>>
>>>
>>>
>>>
>>>
>>
>> --
>> Best regards,
>>
>>    - Andy
>>
>> Problems worthy of attack prove their worth by hitting back. - Piet Hein
>> (via Tom White)
>>
>
>
>
> --
> jay vyas
>



-- 
Best regards,

   - Andy

Problems worthy of attack prove their worth by hitting back. - Piet Hein
(via Tom White)

Re: What will the next generation of bigtop look like?

Posted by Andrew Purtell <ap...@apache.org>.

Well if HDFS is dropped, and all packages that depend on it, the usefulness
of Bigtop for me drops to zero.

On Fri, Dec 12, 2014 at 6:24 PM, jay vyas <ja...@gmail.com>
wrote:

> youre right - this thread is just a discussion ... nothing at all has been
> replaced, nor even proposed to be replaced.
> I think the purpose of this thread is to discuss the **entire** bigtop
> stack,  from a theoretical perspective... so there are no sacred cows :)...
>
>
> On Fri, Dec 12, 2014 at 7:59 PM, Andrew Purtell <ap...@apache.org>
> wrote:
>
>> > thanks for the input maybe yarn and HDFS should continue to stick
>> around
>>
>> Huh? When/where was HDFS not sticking around? To be replaced with what?
>>
>>
>> On Thu, Dec 11, 2014 at 6:57 PM, jay vyas <ja...@gmail.com>
>> wrote:
>>>
>>> great feedback guys.  my thoughts:
>>>
>>> @andrew, yeah thats some very good points you make.  thanks for the
>>> input maybe yarn and HDFS should continue to stick around.
>>>
>>> @RJ, i think python, and the mgmt tooling can be complimentary, but more
>>> ** on top ** of bigtop, i.e. in a vendor product based on bigtop : or a
>>> community repackaging of bigtop ---  managing them as part of bigtop itself
>>> --  might be extra features are a little out of scope of bigtop ; which is
>>> more around deployment and packaging / testing for the core of a big data
>>> infrastructure, as opposed to an e2e solution.  but *at the least* i think
>>> it makes sense to keep in mind the  ambaris and also the ipythons/tableus
>>>  of the world when producing bigtop releases, b/c if those represent use
>>> cases we can have good tests for (i.e. REST Apis and PySpark tests and so
>>> on)...
>>>
>>>
>>>
>>>
>>>
>>>
>>
>> --
>> Best regards,
>>
>>    - Andy
>>
>> Problems worthy of attack prove their worth by hitting back. - Piet Hein
>> (via Tom White)
>>
>
>
>
> --
> jay vyas
>



-- 
Best regards,

   - Andy

Problems worthy of attack prove their worth by hitting back. - Piet Hein
(via Tom White)

Re: What will the next generation of bigtop look like?

Posted by jay vyas <ja...@gmail.com>.

youre right - this thread is just a discussion ... nothing at all has been
replaced, nor even proposed to be replaced.
I think the purpose of this thread is to discuss the **entire** bigtop
stack,  from a theoretical perspective... so there are no sacred cows :)...


On Fri, Dec 12, 2014 at 7:59 PM, Andrew Purtell <ap...@apache.org> wrote:

> > thanks for the input maybe yarn and HDFS should continue to stick around
>
> Huh? When/where was HDFS not sticking around? To be replaced with what?
>
>
> On Thu, Dec 11, 2014 at 6:57 PM, jay vyas <ja...@gmail.com>
> wrote:
>>
>> great feedback guys.  my thoughts:
>>
>> @andrew, yeah thats some very good points you make.  thanks for the input
>> maybe yarn and HDFS should continue to stick around.
>>
>> @RJ, i think python, and the mgmt tooling can be complimentary, but more
>> ** on top ** of bigtop, i.e. in a vendor product based on bigtop : or a
>> community repackaging of bigtop ---  managing them as part of bigtop itself
>> --  might be extra features are a little out of scope of bigtop ; which is
>> more around deployment and packaging / testing for the core of a big data
>> infrastructure, as opposed to an e2e solution.  but *at the least* i think
>> it makes sense to keep in mind the  ambaris and also the ipythons/tableus
>>  of the world when producing bigtop releases, b/c if those represent use
>> cases we can have good tests for (i.e. REST Apis and PySpark tests and so
>> on)...
>>
>>
>>
>>
>>
>>
>
> --
> Best regards,
>
>    - Andy
>
> Problems worthy of attack prove their worth by hitting back. - Piet Hein
> (via Tom White)
>



-- 
jay vyas

Re: What will the next generation of bigtop look like?

Posted by jay vyas <ja...@gmail.com>.

youre right - this thread is just a discussion ... nothing at all has been
replaced, nor even proposed to be replaced.
I think the purpose of this thread is to discuss the **entire** bigtop
stack,  from a theoretical perspective... so there are no sacred cows :)...


On Fri, Dec 12, 2014 at 7:59 PM, Andrew Purtell <ap...@apache.org> wrote:

> > thanks for the input maybe yarn and HDFS should continue to stick around
>
> Huh? When/where was HDFS not sticking around? To be replaced with what?
>
>
> On Thu, Dec 11, 2014 at 6:57 PM, jay vyas <ja...@gmail.com>
> wrote:
>>
>> great feedback guys.  my thoughts:
>>
>> @andrew, yeah thats some very good points you make.  thanks for the input
>> maybe yarn and HDFS should continue to stick around.
>>
>> @RJ, i think python, and the mgmt tooling can be complimentary, but more
>> ** on top ** of bigtop, i.e. in a vendor product based on bigtop : or a
>> community repackaging of bigtop ---  managing them as part of bigtop itself
>> --  might be extra features are a little out of scope of bigtop ; which is
>> more around deployment and packaging / testing for the core of a big data
>> infrastructure, as opposed to an e2e solution.  but *at the least* i think
>> it makes sense to keep in mind the  ambaris and also the ipythons/tableus
>>  of the world when producing bigtop releases, b/c if those represent use
>> cases we can have good tests for (i.e. REST Apis and PySpark tests and so
>> on)...
>>
>>
>>
>>
>>
>>
>
> --
> Best regards,
>
>    - Andy
>
> Problems worthy of attack prove their worth by hitting back. - Piet Hein
> (via Tom White)
>



-- 
jay vyas

Re: What will the next generation of bigtop look like?

Posted by Andrew Purtell <ap...@apache.org>.

> thanks for the input maybe yarn and HDFS should continue to stick around

Huh? When/where was HDFS not sticking around? To be replaced with what?


On Thu, Dec 11, 2014 at 6:57 PM, jay vyas <ja...@gmail.com>
wrote:
>
> great feedback guys.  my thoughts:
>
> @andrew, yeah thats some very good points you make.  thanks for the input
> maybe yarn and HDFS should continue to stick around.
>
> @RJ, i think python, and the mgmt tooling can be complimentary, but more
> ** on top ** of bigtop, i.e. in a vendor product based on bigtop : or a
> community repackaging of bigtop ---  managing them as part of bigtop itself
> --  might be extra features are a little out of scope of bigtop ; which is
> more around deployment and packaging / testing for the core of a big data
> infrastructure, as opposed to an e2e solution.  but *at the least* i think
> it makes sense to keep in mind the  ambaris and also the ipythons/tableus
>  of the world when producing bigtop releases, b/c if those represent use
> cases we can have good tests for (i.e. REST Apis and PySpark tests and so
> on)...
>
>
>
>
>
>

-- 
Best regards,

   - Andy

Problems worthy of attack prove their worth by hitting back. - Piet Hein
(via Tom White)

Re: What will the next generation of bigtop look like?

Posted by Andrew Purtell <ap...@apache.org>.

> thanks for the input maybe yarn and HDFS should continue to stick around

Huh? When/where was HDFS not sticking around? To be replaced with what?


On Thu, Dec 11, 2014 at 6:57 PM, jay vyas <ja...@gmail.com>
wrote:
>
> great feedback guys.  my thoughts:
>
> @andrew, yeah thats some very good points you make.  thanks for the input
> maybe yarn and HDFS should continue to stick around.
>
> @RJ, i think python, and the mgmt tooling can be complimentary, but more
> ** on top ** of bigtop, i.e. in a vendor product based on bigtop : or a
> community repackaging of bigtop ---  managing them as part of bigtop itself
> --  might be extra features are a little out of scope of bigtop ; which is
> more around deployment and packaging / testing for the core of a big data
> infrastructure, as opposed to an e2e solution.  but *at the least* i think
> it makes sense to keep in mind the  ambaris and also the ipythons/tableus
>  of the world when producing bigtop releases, b/c if those represent use
> cases we can have good tests for (i.e. REST Apis and PySpark tests and so
> on)...
>
>
>
>
>
>

-- 
Best regards,

   - Andy

Problems worthy of attack prove their worth by hitting back. - Piet Hein
(via Tom White)

Re: What will the next generation of bigtop look like?

Posted by jay vyas <ja...@gmail.com>.

great feedback guys.  my thoughts:

@andrew, yeah thats some very good points you make.  thanks for the input
maybe yarn and HDFS should continue to stick around.

@RJ, i think python, and the mgmt tooling can be complimentary, but more **
on top ** of bigtop, i.e. in a vendor product based on bigtop : or a
community repackaging of bigtop ---  managing them as part of bigtop itself
--  might be extra features are a little out of scope of bigtop ; which is
more around deployment and packaging / testing for the core of a big data
infrastructure, as opposed to an e2e solution.  but *at the least* i think
it makes sense to keep in mind the  ambaris and also the ipythons/tableus
 of the world when producing bigtop releases, b/c if those represent use
cases we can have good tests for (i.e. REST Apis and PySpark tests and so
on)...

Re: What will the next generation of bigtop look like?

Posted by Konstantin Boudnik <co...@apache.org>.

I want to agree with Andrew. While Spark is a huge step forward compare to
basic Hadoop it isn't a solution for everything and definitely isn't a
solution for fast processing of data sets that don't fit the memory. Oh, and
by the way let's not forget about the fact that ML/analytics on Hadoop isn't
the whole world of data processing. Say OLTP workloads command a way larger
market share that just ML. That's why I am very optimistic about Ignite
(incubating).

Cos

On Thu, Dec 11, 2014 at 02:04PM, Andrew Purtell wrote:
> The problem I see with a Spark-only stack is, in my experience, Spark falls
> apart as soon as the working set exceeds all available RAM on the cluster.
> (One is presented with a sea of exceptions.) We need Hadoop anyway for HDFS
> and Common (required by many many components), we get YARN and the MR
> runtime as part of this package, and Hadoop MR is still eminently useful
> when data sets and storage requirements are far beyond agg RAM.
> 
> We have an open JIRA for adding Kafka, it would be fantastic if someone
> picks it up and brings it over the finish line.
> 
> 
> On Thu, Dec 11, 2014 at 10:14 AM, RJ Nowling <rn...@gmail.com> wrote:
> 
> > GraphX, Streaming, MLlib, and Spark SQL are all part of Spark and would be
> > included in BigTop if Spark is included. They're also pretty well
> > integrated with each other.
> >
> > I'd like to throw out a radical idea, based on Andrew's comments: focus on
> > the vertical rather than the horizontal with a slimmed down, Spark-oriented
> > stack.  (This could be a subset of the current stack.)  Strat.io's work
> > provides a nice example of a pure Spark stack.
> >
> > Spark offers a smaller footprint, far less maintenance, functionality of
> > many Hadoop components in one (and better integration!), and is better
> > suited for diverse deployment situations (cloud, non-HDFS storage, etc.)
> >
> > A few other complementary components would be needed: Kafka would be
> > needed for HA with Spark streaming.  Tachyon.  Maybe offer Cassandra or
> > similar as an alternative storage option.    Combine this with dashboards
> > and visualization and high quality deployment options (Puppet, Docker,
> > etc.).  With the data generator and Spark implementation of BigPetStore, my
> > goal is to to expand BPS to provide high quality analytics examples,
> > oriented more towards data scientists.
> >
> > Just a thought...
> >
> > On Thu, Dec 11, 2014 at 12:39 PM, Andrew Purtell <ap...@apache.org>
> > wrote:
> >
> >> This is a really great post and I was nodding along with most of it.
> >>
> >> My personal view is Bigtop starts as a deployable stack of Apache
> >> ecosystem components for Big Data. Commodification of (Linux) deployable
> >> packages and basic install integration is the baseline.
> >>
> >> Bigtop packaging Spark components first is an unfortunately little known
> >> win of this community, but its still a win. Although replicating that
> >> success with choice of the 'next big thing' is going to be a hit or miss
> >> proposition unless one of us can figure out time travel, definitely we can
> >> make some observations and scour and/or influence the Apache project
> >> landscape to pick up coverage in the space:
> >>
> >> - Storage is commoditized. Nearly everyone bases the storage stack on
> >> HDFS. Everyone does so with what we'd call HCFS. Best to focus elsewhere.
> >>
> >> - Packaging is commoditized. It's a shame that vendors pursue misguided
> >> lock-in strategies but we have no control over that. It's still true that
> >> someone using HDP or CDH 4 can switch to Bigtop and vice versa without
> >> changing package management tools or strategy. As a user of Apache stack
> >> technologies I want long term sustainable package management so will vote
> >> with my feet for the commodity option, and won't be alone. Bigtop should
> >> provide this, and does, and it's mostly a solved problem.
> >>
> >> - Deployment is also a "solved" problem but unfortunately everyone solves
> >> it differently. :-) This is an area where Bigtop can provide real value,
> >> and does, with the Puppet scripts, with the containerization work. One
> >> function Bigtop can serve is as repository and example of Hadoop-ish
> >> production tooling.
> >>
> >> - YARN is a reasonably generic grid resource manager. We don't have the
> >> resources to stand up an alternate RM and all the tooling necessary with
> >> Mesos, but if Mesosphere made a contribution of that I suspect we'd take
> >> it. From the Bigtop perspective I think computation framework options are
> >> well handled, in that I don't see Bigtop or anyone else developing credible
> >> alternatives to MR and Spark for some time. Not sure there's enough oxygen.
> >> And we have Giraph (and is GraphX packaged with Spark?). To the extent
> >> Spark-on-YARN has rough edges in the Bigtop framework that's an area where
> >> contributors can produce value. Related, support for Hive on Spark, Pig on
> >> Spark (spork).
> >>
> >> - The Apache stack includes three streaming computation frameworks -
> >> Storm, Spark Streaming, Samza - but Bigtop has mostly missed the boat here.
> >> Spark streaming is included in the spark package (I think) but how well is
> >> it integrated? Samza is well integrated with YARN but we don't package it.
> >> There's also been Storm-on-YARN work out of Yahoo, not sure about what was
> >> upstreamed or might be available. Anyway, integration of stream computation
> >> frameworks into Bigtop's packaging and deployment/management scripts can
> >> produce value, especially if we provide multiple options, because vendors
> >> are choosing favorites.
> >>
> >> - Data access. We do have players differentiating themselves here. Bigtop
> >> provides two SQL options (Hive, Phoenix+HBase), can add a third, I see
> >> someone's proposed Presto packaging. I'm not sure from the Bigtop
> >> perspective we need to pursue additional alternatives, but if there were
> >> contributions, we might well take them. "Enterprise friendly API" (SQL) is
> >> half of the data access picture I think, the other half is access control.
> >> There are competing projects in incubation, Sentry and Ranger, with no
> >> shared purpose, which is a real shame. To the extent that Bigtop adopts a
> >> cross-component full-stack access control technology, or helps bring
> >> another alternative into incubation and adopts that, we can move the needle
> >> in this space. We'd offer a vendor neutral access control option devoid of
> >> lock-in risk, this would be a big deal for big-E enterprises.
> >>
> >> - Data management and provenance. Now we're moving up the value chain
> >> from storage and data access to the next layer. This is mostly greenfield /
> >> blue ocean space in the Apache stack. We have interesting options in
> >> incubation: Falcon, Taverna, NiFi. (I think the last one might be truly
> >> comprehensive.) All of these are higher level data management and
> >> processing workflows which include aspects of management and provenance.
> >> One or more could be adopted and refined. There are a lot of relevant
> >> integration opportunities up and down the stack that could be undertaken
> >> with shared effort of the Bigtop, framework, and component communities.
> >>
> >> - Machine learning. Moving further up the value chain, we have data and
> >> computation and workflow, now how do we derive the competitive advantage
> >> that all of the lower layer technologies are in place for? The new hotness
> >> is surfacing of insights out of scaled parallel statistical inference.
> >> Unfortunately this space doesn't present itself well to the toolbox
> >> approach. Bigtop provides Mahout and MLLib as part of Spark (right?), they
> >> themselves are toolkits with components of varying utility and maturity
> >> (and relevance). I think Bigtop could provide some value by curating ML
> >> frameworks that tie in with other Apache stack technologies. ML toolkits
> >> leave would-be users in the cold. One has to know what one is doing, and
> >> what to do is highly use case specific, this is why "data scientists" can
> >> command obscene salaries and only commercial vendors have the resources to
> >> focus on specific verticals.
> >>
> >> - Visualization and preparation. Moving further up, now we are almost
> >> touching directly the use case. We have data but we need to clean it,
> >> normalize, regularize, filter, slice and dice. Where there are reasonably
> >> generic open source tools, preferably at Apache, for data preparation and
> >> cleaning Bigtop could provide baseline value by packaging it, and
> >> additional value with deeper integration with Apache stack components. Data
> >> preparation is a concern hand in hand with data ingest, so we have an
> >> interesting feedback loop from the top back down to ingest tools/building
> >> blocks like Kafka and Flume. Data cleaning concerns might overlap with the
> >> workflow frameworks too. If there's a friendly licensed open source
> >> graphical front end to the data cleaning/munging/exploration process that
> >> is generic enough that would be a really interesting "acquisition".
> >> - We can also package visualization libraries and toolkits for building
> >> dashboards. Like with ML algorithms, a complete integration is probably out
> >> of scope because every instance would be use case and user specific.
> >>
> >>
> >>
> >> On Mon, Dec 8, 2014 at 12:23 PM, Konstantin Boudnik <co...@apache.org>
> >> wrote:
> >>
> >>> First I want to address the RJ's question:
> >>>
> >>> The most prominent downstream Bigtop Dependency would be any commercial
> >>> Hadoop distribution like HDP and CDH. The former is trying to
> >>> disguise their affiliation by pushing Ambari forward, and Cloudera's
> >>> seemingly
> >>> shifting her focus to compressed tarballs media (aka parcels) which
> >>> requires
> >>> a closed-source solutions like Cloudera Manager to deploy and control
> >>> your
> >>> cluster, effectively rendering it useless if you ever decide to
> >>> uninstall the
> >>> control software. In the interest of full disclosure, I don't think
> >>> parcels
> >>> have any chance to landslide the consensus in the industry from Linux
> >>> packaging towards something so obscure and proprietary as parcels are.
> >>>
> >>>
> >>> And now to my actual points....:
> >>>
> >>> I do strongly believe the Bigtop was and is the only completely
> >>> transparent,
> >>> vendors' friendly, and 100% sticking to official ASF product releases
> >>> way of
> >>> building your stack from ground up, deploying and controlling it anyway
> >>> you
> >>> want to. I agree with Roman's presentation on how this project can move
> >>> forward. However, I somewhat disagree with his view on the perspectives.
> >>> It
> >>> might be a hard road to drive the opinion of the community.  But, it is
> >>> a high
> >>> road.
> >>>
> >>> We are definitely small and mostly unsupported by commercial groups that
> >>> are
> >>> using the framework. Being a box of LEGO won't win us anything. If
> >>> anything,
> >>> the empirical evidences are against it as commercial distros have
> >>> decided to
> >>> move towards their own means of "vendor lock-in" (yes, you hear me
> >>> right - that's exactly what I said: all so called open-source companies
> >>> have
> >>> invented a way to lock-in their customers either with fancy "enterprise
> >>> features" that aren't adding but amending underlying stack; or with
> >>> custom set
> >>> of patches oftentimes rendering the cluster to become incompatible
> >>> between
> >>> different vendors).
> >>>
> >>> By all means, my money are on the second way, yet slightly modified (as
> >>> use-cases are coming from users, not developers):
> >>>   #2 start driving adoption of software stacks for the particular kind
> >>> of data workloads
> >>>
> >>> This community has enough day-to-day practitioners on board to
> >>> accumulate a near-complete introspection of where the technology is
> >>> moving.
> >>> And instead of wobbling in a backwash, let's see if we can be smart and
> >>> define
> >>> this landscape. After all, Bigtop has adopted Spark well before any of
> >>> the
> >>> commercials have officially accepted it. We seemingly are moving more and
> >>> more into in-memory realm of data processing: Apache Ignite (Gridgain),
> >>> Tachyon, Spark. I don't know how much legs Hive got in it, but I am
> >>> doubtful,
> >>> that it can walk for much longer... May be it's just me.
> >>>
> >>> In this thread http://is.gd/MV2BH9 we already discussed some of the
> >>> aspects
> >>> influencing the feature of this project. And we are de-facto working on
> >>> the
> >>> implementation. In my opinion, Hadoop has been more or less commoditized
> >>> already. And it isn't a bad thing, but it means that the innovations are
> >>> elsewhere. E.g. Spark moving is moving beyond its ties with storage
> >>> layer via
> >>> Tachyon abstraction; GridGain simply doesn't care what's underlying
> >>> storage
> >>> is. However, data needs to be stored somewhere before it can be
> >>> processed. And
> >>> HCFS seems to be fitting the bill ok. But, as I said already, I see the
> >>> real
> >>> action elsewhere. If I were to define the shape of our mid- to long'ish
> >>> term
> >>> roadmap it'd be something like that:
> >>>
> >>>             ^   Dashboard/Visualization  ^
> >>>             |     OLTP/ML processing     |
> >>>             |    Caching/Acceleration    |
> >>>             |         Storage            |
> >>>
> >>> And around this we can add/improve on deployment (R8???),
> >>> virtualization/containers/clouds.  In other words - let's focus on the
> >>> vertical part of the stack, instead of simply supporting the status quo.
> >>>
> >>> Does Cassandra fits the Storage layer in that model? I don't know and
> >>> most
> >>> important - I don't care. If there's an interest and manpower to have
> >>> Cassandra-based stack - sure, but perhaps let's do as a separate branch
> >>> or
> >>> something, so we aren't over-complicating things. As Roman said earlier,
> >>> in
> >>> this case it'd be great to engage Cassandra/DataStax people into this
> >>> project.
> >>> But something tells me they won't be eager to jump on board.
> >>>
> >>> And finally, all this above leads to "how": how we can start reshaping
> >>> the
> >>> stack into its next incarnation? Perhaps, Ubuntu model might be an
> >>> answer for
> >>> that, but we have discussed that elsewhere and dropped the idea as it
> >>> wasn't
> >>> feasible back in the day. Perhaps its time just came?
> >>>
> >>> Apologies for a long post.
> >>>   Cos
> >>>
> >>>
> >>> On Sun, Dec 07, 2014 at 07:04PM, RJ Nowling wrote:
> >>> > Which other projects depend on BigTop?  How will the questions about
> >>> the
> >>> > direction of BigTop affect those projects?
> >>> >
> >>> > On Sun, Dec 7, 2014 at 6:10 PM, Roman Shaposhnik <roman@shaposhnik.org
> >>> >
> >>> > wrote:
> >>> >
> >>> > > Hi!
> >>> > >
> >>> > > On Sat, Dec 6, 2014 at 3:23 PM, jay vyas <
> >>> jayunit100.apache@gmail.com>
> >>> > > wrote:
> >>> > > > hi bigtop !
> >>> > > >
> >>> > > > I thought id start a thread a few vaguely related thoughts i have
> >>> around
> >>> > > > next couple iterations of bigtop.
> >>> > >
> >>> > > I think in general I see two major ways for something like
> >>> > > Bigtop to evolve:
> >>> > >    #1 remain a 'box of LEGO bricks' with very little opinion on
> >>> > >         how these pieces need to be integrated
> >>> > >    #2 start driving oppinioned use-cases for the particular kind of
> >>> > >         bigdata workloads
> >>> > >
> >>> > > #1 is sort of what all of the Linux distros have been doing for
> >>> > > the majority of time they existed. #2 is close to what CentOS
> >>> > > is doing with SIGs.
> >>> > >
> >>> > > Honestly, given the size of our community so far and a total
> >>> > > lack of corporate backing (with a small exception of Cloudera
> >>> > > still paying for our EC2 time) I think #1 is all we can do. I'd
> >>> > > love to be wrong, though.
> >>> > >
> >>> > > > 1) Hive:  How will bigtop to evolve to support it, now that it is
> >>> much
> >>> > > more
> >>> > > > than a mapreduce query wrapper?
> >>> > >
> >>> > > I think Hive will remain a big part of Hadoop workloads for
> >>> forseeable
> >>> > > future. What I'd love to see more of is rationalizing things like how
> >>> > > HCatalog, etc. need to be deployed.
> >>> > >
> >>> > > > 2) I wonder wether we should confirm cassandra interoperability of
> >>> spark
> >>> > > in
> >>> > > > bigtop distros,
> >>> > >
> >>> > > Only if there's a significant interest from cassandra community and
> >>> even
> >>> > > then my biggest fear is that with cassandra we're totally changing
> >>> the
> >>> > > requirements for the underlying storage subsystem (nothing wrong with
> >>> > > that, its just that in Hadoop ecosystem everything assumes very
> >>> HDFS'ish
> >>> > > requirements for the scale-out storage).
> >>> > >
> >>> > > > 4) in general, i think bigtop can move in one of 3 directions.
> >>> > > >
> >>> > > >   EXPAND ? : Expanding to include new components, with just basic
> >>> > > interop,
> >>> > > > and let folks evolve their own stacks on top of bigtop on their
> >>> own.
> >>> > > >
> >>> > > >   CONTRACT+FOCUS ?  Contracting to focus on a lean set of core
> >>> > > components,
> >>> > > > with super high quality.
> >>> > > >
> >>> > > >   STAY THE COURSE ? Staying the same ~ a packaging platform for
> >>> just
> >>> > > > hadoop's direct ecosystem.
> >>> > > >
> >>> > > > I am intrigued by the idea of A and B both have clear benefits and
> >>> > > costs...
> >>> > > > would like to see the opinions of folks --- do we  lean in one
> >>> direction
> >>> > > or
> >>> > > > another? What is the criteria for adding a new feature, package,
> >>> stack to
> >>> > > > bigtop?
> >>> > > >
> >>> > > > ... Or maybe im just overthinking it and should be spending this
> >>> time
> >>> > > > testing spark for 0.9 release....
> >>> > >
> >>> > > I'd love to know what other think, but for 0.9 I'd rather stay the
> >>> course.
> >>> > >
> >>> > > Thanks,
> >>> > > Roman.
> >>> > >
> >>> > > P.S. There are also market forces at play that may fundamentally
> >>> change
> >>> > > the focus of what we're all working on in the year or so.
> >>> > >
> >>>
> >>
> >>
> >>
> >> --
> >> Best regards,
> >>
> >>    - Andy
> >>
> >> Problems worthy of attack prove their worth by hitting back. - Piet Hein
> >> (via Tom White)
> >>
> >
> >
> 
> 
> -- 
> Best regards,
> 
>    - Andy
> 
> Problems worthy of attack prove their worth by hitting back. - Piet Hein
> (via Tom White)

Re: What will the next generation of bigtop look like?

Posted by Konstantin Boudnik <co...@apache.org>.

I want to agree with Andrew. While Spark is a huge step forward compare to
basic Hadoop it isn't a solution for everything and definitely isn't a
solution for fast processing of data sets that don't fit the memory. Oh, and
by the way let's not forget about the fact that ML/analytics on Hadoop isn't
the whole world of data processing. Say OLTP workloads command a way larger
market share that just ML. That's why I am very optimistic about Ignite
(incubating).

Cos

On Thu, Dec 11, 2014 at 02:04PM, Andrew Purtell wrote:
> The problem I see with a Spark-only stack is, in my experience, Spark falls
> apart as soon as the working set exceeds all available RAM on the cluster.
> (One is presented with a sea of exceptions.) We need Hadoop anyway for HDFS
> and Common (required by many many components), we get YARN and the MR
> runtime as part of this package, and Hadoop MR is still eminently useful
> when data sets and storage requirements are far beyond agg RAM.
> 
> We have an open JIRA for adding Kafka, it would be fantastic if someone
> picks it up and brings it over the finish line.
> 
> 
> On Thu, Dec 11, 2014 at 10:14 AM, RJ Nowling <rn...@gmail.com> wrote:
> 
> > GraphX, Streaming, MLlib, and Spark SQL are all part of Spark and would be
> > included in BigTop if Spark is included. They're also pretty well
> > integrated with each other.
> >
> > I'd like to throw out a radical idea, based on Andrew's comments: focus on
> > the vertical rather than the horizontal with a slimmed down, Spark-oriented
> > stack.  (This could be a subset of the current stack.)  Strat.io's work
> > provides a nice example of a pure Spark stack.
> >
> > Spark offers a smaller footprint, far less maintenance, functionality of
> > many Hadoop components in one (and better integration!), and is better
> > suited for diverse deployment situations (cloud, non-HDFS storage, etc.)
> >
> > A few other complementary components would be needed: Kafka would be
> > needed for HA with Spark streaming.  Tachyon.  Maybe offer Cassandra or
> > similar as an alternative storage option.    Combine this with dashboards
> > and visualization and high quality deployment options (Puppet, Docker,
> > etc.).  With the data generator and Spark implementation of BigPetStore, my
> > goal is to to expand BPS to provide high quality analytics examples,
> > oriented more towards data scientists.
> >
> > Just a thought...
> >
> > On Thu, Dec 11, 2014 at 12:39 PM, Andrew Purtell <ap...@apache.org>
> > wrote:
> >
> >> This is a really great post and I was nodding along with most of it.
> >>
> >> My personal view is Bigtop starts as a deployable stack of Apache
> >> ecosystem components for Big Data. Commodification of (Linux) deployable
> >> packages and basic install integration is the baseline.
> >>
> >> Bigtop packaging Spark components first is an unfortunately little known
> >> win of this community, but its still a win. Although replicating that
> >> success with choice of the 'next big thing' is going to be a hit or miss
> >> proposition unless one of us can figure out time travel, definitely we can
> >> make some observations and scour and/or influence the Apache project
> >> landscape to pick up coverage in the space:
> >>
> >> - Storage is commoditized. Nearly everyone bases the storage stack on
> >> HDFS. Everyone does so with what we'd call HCFS. Best to focus elsewhere.
> >>
> >> - Packaging is commoditized. It's a shame that vendors pursue misguided
> >> lock-in strategies but we have no control over that. It's still true that
> >> someone using HDP or CDH 4 can switch to Bigtop and vice versa without
> >> changing package management tools or strategy. As a user of Apache stack
> >> technologies I want long term sustainable package management so will vote
> >> with my feet for the commodity option, and won't be alone. Bigtop should
> >> provide this, and does, and it's mostly a solved problem.
> >>
> >> - Deployment is also a "solved" problem but unfortunately everyone solves
> >> it differently. :-) This is an area where Bigtop can provide real value,
> >> and does, with the Puppet scripts, with the containerization work. One
> >> function Bigtop can serve is as repository and example of Hadoop-ish
> >> production tooling.
> >>
> >> - YARN is a reasonably generic grid resource manager. We don't have the
> >> resources to stand up an alternate RM and all the tooling necessary with
> >> Mesos, but if Mesosphere made a contribution of that I suspect we'd take
> >> it. From the Bigtop perspective I think computation framework options are
> >> well handled, in that I don't see Bigtop or anyone else developing credible
> >> alternatives to MR and Spark for some time. Not sure there's enough oxygen.
> >> And we have Giraph (and is GraphX packaged with Spark?). To the extent
> >> Spark-on-YARN has rough edges in the Bigtop framework that's an area where
> >> contributors can produce value. Related, support for Hive on Spark, Pig on
> >> Spark (spork).
> >>
> >> - The Apache stack includes three streaming computation frameworks -
> >> Storm, Spark Streaming, Samza - but Bigtop has mostly missed the boat here.
> >> Spark streaming is included in the spark package (I think) but how well is
> >> it integrated? Samza is well integrated with YARN but we don't package it.
> >> There's also been Storm-on-YARN work out of Yahoo, not sure about what was
> >> upstreamed or might be available. Anyway, integration of stream computation
> >> frameworks into Bigtop's packaging and deployment/management scripts can
> >> produce value, especially if we provide multiple options, because vendors
> >> are choosing favorites.
> >>
> >> - Data access. We do have players differentiating themselves here. Bigtop
> >> provides two SQL options (Hive, Phoenix+HBase), can add a third, I see
> >> someone's proposed Presto packaging. I'm not sure from the Bigtop
> >> perspective we need to pursue additional alternatives, but if there were
> >> contributions, we might well take them. "Enterprise friendly API" (SQL) is
> >> half of the data access picture I think, the other half is access control.
> >> There are competing projects in incubation, Sentry and Ranger, with no
> >> shared purpose, which is a real shame. To the extent that Bigtop adopts a
> >> cross-component full-stack access control technology, or helps bring
> >> another alternative into incubation and adopts that, we can move the needle
> >> in this space. We'd offer a vendor neutral access control option devoid of
> >> lock-in risk, this would be a big deal for big-E enterprises.
> >>
> >> - Data management and provenance. Now we're moving up the value chain
> >> from storage and data access to the next layer. This is mostly greenfield /
> >> blue ocean space in the Apache stack. We have interesting options in
> >> incubation: Falcon, Taverna, NiFi. (I think the last one might be truly
> >> comprehensive.) All of these are higher level data management and
> >> processing workflows which include aspects of management and provenance.
> >> One or more could be adopted and refined. There are a lot of relevant
> >> integration opportunities up and down the stack that could be undertaken
> >> with shared effort of the Bigtop, framework, and component communities.
> >>
> >> - Machine learning. Moving further up the value chain, we have data and
> >> computation and workflow, now how do we derive the competitive advantage
> >> that all of the lower layer technologies are in place for? The new hotness
> >> is surfacing of insights out of scaled parallel statistical inference.
> >> Unfortunately this space doesn't present itself well to the toolbox
> >> approach. Bigtop provides Mahout and MLLib as part of Spark (right?), they
> >> themselves are toolkits with components of varying utility and maturity
> >> (and relevance). I think Bigtop could provide some value by curating ML
> >> frameworks that tie in with other Apache stack technologies. ML toolkits
> >> leave would-be users in the cold. One has to know what one is doing, and
> >> what to do is highly use case specific, this is why "data scientists" can
> >> command obscene salaries and only commercial vendors have the resources to
> >> focus on specific verticals.
> >>
> >> - Visualization and preparation. Moving further up, now we are almost
> >> touching directly the use case. We have data but we need to clean it,
> >> normalize, regularize, filter, slice and dice. Where there are reasonably
> >> generic open source tools, preferably at Apache, for data preparation and
> >> cleaning Bigtop could provide baseline value by packaging it, and
> >> additional value with deeper integration with Apache stack components. Data
> >> preparation is a concern hand in hand with data ingest, so we have an
> >> interesting feedback loop from the top back down to ingest tools/building
> >> blocks like Kafka and Flume. Data cleaning concerns might overlap with the
> >> workflow frameworks too. If there's a friendly licensed open source
> >> graphical front end to the data cleaning/munging/exploration process that
> >> is generic enough that would be a really interesting "acquisition".
> >> - We can also package visualization libraries and toolkits for building
> >> dashboards. Like with ML algorithms, a complete integration is probably out
> >> of scope because every instance would be use case and user specific.
> >>
> >>
> >>
> >> On Mon, Dec 8, 2014 at 12:23 PM, Konstantin Boudnik <co...@apache.org>
> >> wrote:
> >>
> >>> First I want to address the RJ's question:
> >>>
> >>> The most prominent downstream Bigtop Dependency would be any commercial
> >>> Hadoop distribution like HDP and CDH. The former is trying to
> >>> disguise their affiliation by pushing Ambari forward, and Cloudera's
> >>> seemingly
> >>> shifting her focus to compressed tarballs media (aka parcels) which
> >>> requires
> >>> a closed-source solutions like Cloudera Manager to deploy and control
> >>> your
> >>> cluster, effectively rendering it useless if you ever decide to
> >>> uninstall the
> >>> control software. In the interest of full disclosure, I don't think
> >>> parcels
> >>> have any chance to landslide the consensus in the industry from Linux
> >>> packaging towards something so obscure and proprietary as parcels are.
> >>>
> >>>
> >>> And now to my actual points....:
> >>>
> >>> I do strongly believe the Bigtop was and is the only completely
> >>> transparent,
> >>> vendors' friendly, and 100% sticking to official ASF product releases
> >>> way of
> >>> building your stack from ground up, deploying and controlling it anyway
> >>> you
> >>> want to. I agree with Roman's presentation on how this project can move
> >>> forward. However, I somewhat disagree with his view on the perspectives.
> >>> It
> >>> might be a hard road to drive the opinion of the community.  But, it is
> >>> a high
> >>> road.
> >>>
> >>> We are definitely small and mostly unsupported by commercial groups that
> >>> are
> >>> using the framework. Being a box of LEGO won't win us anything. If
> >>> anything,
> >>> the empirical evidences are against it as commercial distros have
> >>> decided to
> >>> move towards their own means of "vendor lock-in" (yes, you hear me
> >>> right - that's exactly what I said: all so called open-source companies
> >>> have
> >>> invented a way to lock-in their customers either with fancy "enterprise
> >>> features" that aren't adding but amending underlying stack; or with
> >>> custom set
> >>> of patches oftentimes rendering the cluster to become incompatible
> >>> between
> >>> different vendors).
> >>>
> >>> By all means, my money are on the second way, yet slightly modified (as
> >>> use-cases are coming from users, not developers):
> >>>   #2 start driving adoption of software stacks for the particular kind
> >>> of data workloads
> >>>
> >>> This community has enough day-to-day practitioners on board to
> >>> accumulate a near-complete introspection of where the technology is
> >>> moving.
> >>> And instead of wobbling in a backwash, let's see if we can be smart and
> >>> define
> >>> this landscape. After all, Bigtop has adopted Spark well before any of
> >>> the
> >>> commercials have officially accepted it. We seemingly are moving more and
> >>> more into in-memory realm of data processing: Apache Ignite (Gridgain),
> >>> Tachyon, Spark. I don't know how much legs Hive got in it, but I am
> >>> doubtful,
> >>> that it can walk for much longer... May be it's just me.
> >>>
> >>> In this thread http://is.gd/MV2BH9 we already discussed some of the
> >>> aspects
> >>> influencing the feature of this project. And we are de-facto working on
> >>> the
> >>> implementation. In my opinion, Hadoop has been more or less commoditized
> >>> already. And it isn't a bad thing, but it means that the innovations are
> >>> elsewhere. E.g. Spark moving is moving beyond its ties with storage
> >>> layer via
> >>> Tachyon abstraction; GridGain simply doesn't care what's underlying
> >>> storage
> >>> is. However, data needs to be stored somewhere before it can be
> >>> processed. And
> >>> HCFS seems to be fitting the bill ok. But, as I said already, I see the
> >>> real
> >>> action elsewhere. If I were to define the shape of our mid- to long'ish
> >>> term
> >>> roadmap it'd be something like that:
> >>>
> >>>             ^   Dashboard/Visualization  ^
> >>>             |     OLTP/ML processing     |
> >>>             |    Caching/Acceleration    |
> >>>             |         Storage            |
> >>>
> >>> And around this we can add/improve on deployment (R8???),
> >>> virtualization/containers/clouds.  In other words - let's focus on the
> >>> vertical part of the stack, instead of simply supporting the status quo.
> >>>
> >>> Does Cassandra fits the Storage layer in that model? I don't know and
> >>> most
> >>> important - I don't care. If there's an interest and manpower to have
> >>> Cassandra-based stack - sure, but perhaps let's do as a separate branch
> >>> or
> >>> something, so we aren't over-complicating things. As Roman said earlier,
> >>> in
> >>> this case it'd be great to engage Cassandra/DataStax people into this
> >>> project.
> >>> But something tells me they won't be eager to jump on board.
> >>>
> >>> And finally, all this above leads to "how": how we can start reshaping
> >>> the
> >>> stack into its next incarnation? Perhaps, Ubuntu model might be an
> >>> answer for
> >>> that, but we have discussed that elsewhere and dropped the idea as it
> >>> wasn't
> >>> feasible back in the day. Perhaps its time just came?
> >>>
> >>> Apologies for a long post.
> >>>   Cos
> >>>
> >>>
> >>> On Sun, Dec 07, 2014 at 07:04PM, RJ Nowling wrote:
> >>> > Which other projects depend on BigTop?  How will the questions about
> >>> the
> >>> > direction of BigTop affect those projects?
> >>> >
> >>> > On Sun, Dec 7, 2014 at 6:10 PM, Roman Shaposhnik <roman@shaposhnik.org
> >>> >
> >>> > wrote:
> >>> >
> >>> > > Hi!
> >>> > >
> >>> > > On Sat, Dec 6, 2014 at 3:23 PM, jay vyas <
> >>> jayunit100.apache@gmail.com>
> >>> > > wrote:
> >>> > > > hi bigtop !
> >>> > > >
> >>> > > > I thought id start a thread a few vaguely related thoughts i have
> >>> around
> >>> > > > next couple iterations of bigtop.
> >>> > >
> >>> > > I think in general I see two major ways for something like
> >>> > > Bigtop to evolve:
> >>> > >    #1 remain a 'box of LEGO bricks' with very little opinion on
> >>> > >         how these pieces need to be integrated
> >>> > >    #2 start driving oppinioned use-cases for the particular kind of
> >>> > >         bigdata workloads
> >>> > >
> >>> > > #1 is sort of what all of the Linux distros have been doing for
> >>> > > the majority of time they existed. #2 is close to what CentOS
> >>> > > is doing with SIGs.
> >>> > >
> >>> > > Honestly, given the size of our community so far and a total
> >>> > > lack of corporate backing (with a small exception of Cloudera
> >>> > > still paying for our EC2 time) I think #1 is all we can do. I'd
> >>> > > love to be wrong, though.
> >>> > >
> >>> > > > 1) Hive:  How will bigtop to evolve to support it, now that it is
> >>> much
> >>> > > more
> >>> > > > than a mapreduce query wrapper?
> >>> > >
> >>> > > I think Hive will remain a big part of Hadoop workloads for
> >>> forseeable
> >>> > > future. What I'd love to see more of is rationalizing things like how
> >>> > > HCatalog, etc. need to be deployed.
> >>> > >
> >>> > > > 2) I wonder wether we should confirm cassandra interoperability of
> >>> spark
> >>> > > in
> >>> > > > bigtop distros,
> >>> > >
> >>> > > Only if there's a significant interest from cassandra community and
> >>> even
> >>> > > then my biggest fear is that with cassandra we're totally changing
> >>> the
> >>> > > requirements for the underlying storage subsystem (nothing wrong with
> >>> > > that, its just that in Hadoop ecosystem everything assumes very
> >>> HDFS'ish
> >>> > > requirements for the scale-out storage).
> >>> > >
> >>> > > > 4) in general, i think bigtop can move in one of 3 directions.
> >>> > > >
> >>> > > >   EXPAND ? : Expanding to include new components, with just basic
> >>> > > interop,
> >>> > > > and let folks evolve their own stacks on top of bigtop on their
> >>> own.
> >>> > > >
> >>> > > >   CONTRACT+FOCUS ?  Contracting to focus on a lean set of core
> >>> > > components,
> >>> > > > with super high quality.
> >>> > > >
> >>> > > >   STAY THE COURSE ? Staying the same ~ a packaging platform for
> >>> just
> >>> > > > hadoop's direct ecosystem.
> >>> > > >
> >>> > > > I am intrigued by the idea of A and B both have clear benefits and
> >>> > > costs...
> >>> > > > would like to see the opinions of folks --- do we  lean in one
> >>> direction
> >>> > > or
> >>> > > > another? What is the criteria for adding a new feature, package,
> >>> stack to
> >>> > > > bigtop?
> >>> > > >
> >>> > > > ... Or maybe im just overthinking it and should be spending this
> >>> time
> >>> > > > testing spark for 0.9 release....
> >>> > >
> >>> > > I'd love to know what other think, but for 0.9 I'd rather stay the
> >>> course.
> >>> > >
> >>> > > Thanks,
> >>> > > Roman.
> >>> > >
> >>> > > P.S. There are also market forces at play that may fundamentally
> >>> change
> >>> > > the focus of what we're all working on in the year or so.
> >>> > >
> >>>
> >>
> >>
> >>
> >> --
> >> Best regards,
> >>
> >>    - Andy
> >>
> >> Problems worthy of attack prove their worth by hitting back. - Piet Hein
> >> (via Tom White)
> >>
> >
> >
> 
> 
> -- 
> Best regards,
> 
>    - Andy
> 
> Problems worthy of attack prove their worth by hitting back. - Piet Hein
> (via Tom White)

Re: What will the next generation of bigtop look like?

Posted by jay vyas <ja...@gmail.com>.

great feedback guys.  my thoughts:

@andrew, yeah thats some very good points you make.  thanks for the input
maybe yarn and HDFS should continue to stick around.

@RJ, i think python, and the mgmt tooling can be complimentary, but more **
on top ** of bigtop, i.e. in a vendor product based on bigtop : or a
community repackaging of bigtop ---  managing them as part of bigtop itself
--  might be extra features are a little out of scope of bigtop ; which is
more around deployment and packaging / testing for the core of a big data
infrastructure, as opposed to an e2e solution.  but *at the least* i think
it makes sense to keep in mind the  ambaris and also the ipythons/tableus
 of the world when producing bigtop releases, b/c if those represent use
cases we can have good tests for (i.e. REST Apis and PySpark tests and so
on)...

Re: What will the next generation of bigtop look like?

Posted by Andrew Purtell <ap...@apache.org>.

The problem I see with a Spark-only stack is, in my experience, Spark falls
apart as soon as the working set exceeds all available RAM on the cluster.
(One is presented with a sea of exceptions.) We need Hadoop anyway for HDFS
and Common (required by many many components), we get YARN and the MR
runtime as part of this package, and Hadoop MR is still eminently useful
when data sets and storage requirements are far beyond agg RAM.

We have an open JIRA for adding Kafka, it would be fantastic if someone
picks it up and brings it over the finish line.


On Thu, Dec 11, 2014 at 10:14 AM, RJ Nowling <rn...@gmail.com> wrote:

> GraphX, Streaming, MLlib, and Spark SQL are all part of Spark and would be
> included in BigTop if Spark is included. They're also pretty well
> integrated with each other.
>
> I'd like to throw out a radical idea, based on Andrew's comments: focus on
> the vertical rather than the horizontal with a slimmed down, Spark-oriented
> stack.  (This could be a subset of the current stack.)  Strat.io's work
> provides a nice example of a pure Spark stack.
>
> Spark offers a smaller footprint, far less maintenance, functionality of
> many Hadoop components in one (and better integration!), and is better
> suited for diverse deployment situations (cloud, non-HDFS storage, etc.)
>
> A few other complementary components would be needed: Kafka would be
> needed for HA with Spark streaming.  Tachyon.  Maybe offer Cassandra or
> similar as an alternative storage option.    Combine this with dashboards
> and visualization and high quality deployment options (Puppet, Docker,
> etc.).  With the data generator and Spark implementation of BigPetStore, my
> goal is to to expand BPS to provide high quality analytics examples,
> oriented more towards data scientists.
>
> Just a thought...
>
> On Thu, Dec 11, 2014 at 12:39 PM, Andrew Purtell <ap...@apache.org>
> wrote:
>
>> This is a really great post and I was nodding along with most of it.
>>
>> My personal view is Bigtop starts as a deployable stack of Apache
>> ecosystem components for Big Data. Commodification of (Linux) deployable
>> packages and basic install integration is the baseline.
>>
>> Bigtop packaging Spark components first is an unfortunately little known
>> win of this community, but its still a win. Although replicating that
>> success with choice of the 'next big thing' is going to be a hit or miss
>> proposition unless one of us can figure out time travel, definitely we can
>> make some observations and scour and/or influence the Apache project
>> landscape to pick up coverage in the space:
>>
>> - Storage is commoditized. Nearly everyone bases the storage stack on
>> HDFS. Everyone does so with what we'd call HCFS. Best to focus elsewhere.
>>
>> - Packaging is commoditized. It's a shame that vendors pursue misguided
>> lock-in strategies but we have no control over that. It's still true that
>> someone using HDP or CDH 4 can switch to Bigtop and vice versa without
>> changing package management tools or strategy. As a user of Apache stack
>> technologies I want long term sustainable package management so will vote
>> with my feet for the commodity option, and won't be alone. Bigtop should
>> provide this, and does, and it's mostly a solved problem.
>>
>> - Deployment is also a "solved" problem but unfortunately everyone solves
>> it differently. :-) This is an area where Bigtop can provide real value,
>> and does, with the Puppet scripts, with the containerization work. One
>> function Bigtop can serve is as repository and example of Hadoop-ish
>> production tooling.
>>
>> - YARN is a reasonably generic grid resource manager. We don't have the
>> resources to stand up an alternate RM and all the tooling necessary with
>> Mesos, but if Mesosphere made a contribution of that I suspect we'd take
>> it. From the Bigtop perspective I think computation framework options are
>> well handled, in that I don't see Bigtop or anyone else developing credible
>> alternatives to MR and Spark for some time. Not sure there's enough oxygen.
>> And we have Giraph (and is GraphX packaged with Spark?). To the extent
>> Spark-on-YARN has rough edges in the Bigtop framework that's an area where
>> contributors can produce value. Related, support for Hive on Spark, Pig on
>> Spark (spork).
>>
>> - The Apache stack includes three streaming computation frameworks -
>> Storm, Spark Streaming, Samza - but Bigtop has mostly missed the boat here.
>> Spark streaming is included in the spark package (I think) but how well is
>> it integrated? Samza is well integrated with YARN but we don't package it.
>> There's also been Storm-on-YARN work out of Yahoo, not sure about what was
>> upstreamed or might be available. Anyway, integration of stream computation
>> frameworks into Bigtop's packaging and deployment/management scripts can
>> produce value, especially if we provide multiple options, because vendors
>> are choosing favorites.
>>
>> - Data access. We do have players differentiating themselves here. Bigtop
>> provides two SQL options (Hive, Phoenix+HBase), can add a third, I see
>> someone's proposed Presto packaging. I'm not sure from the Bigtop
>> perspective we need to pursue additional alternatives, but if there were
>> contributions, we might well take them. "Enterprise friendly API" (SQL) is
>> half of the data access picture I think, the other half is access control.
>> There are competing projects in incubation, Sentry and Ranger, with no
>> shared purpose, which is a real shame. To the extent that Bigtop adopts a
>> cross-component full-stack access control technology, or helps bring
>> another alternative into incubation and adopts that, we can move the needle
>> in this space. We'd offer a vendor neutral access control option devoid of
>> lock-in risk, this would be a big deal for big-E enterprises.
>>
>> - Data management and provenance. Now we're moving up the value chain
>> from storage and data access to the next layer. This is mostly greenfield /
>> blue ocean space in the Apache stack. We have interesting options in
>> incubation: Falcon, Taverna, NiFi. (I think the last one might be truly
>> comprehensive.) All of these are higher level data management and
>> processing workflows which include aspects of management and provenance.
>> One or more could be adopted and refined. There are a lot of relevant
>> integration opportunities up and down the stack that could be undertaken
>> with shared effort of the Bigtop, framework, and component communities.
>>
>> - Machine learning. Moving further up the value chain, we have data and
>> computation and workflow, now how do we derive the competitive advantage
>> that all of the lower layer technologies are in place for? The new hotness
>> is surfacing of insights out of scaled parallel statistical inference.
>> Unfortunately this space doesn't present itself well to the toolbox
>> approach. Bigtop provides Mahout and MLLib as part of Spark (right?), they
>> themselves are toolkits with components of varying utility and maturity
>> (and relevance). I think Bigtop could provide some value by curating ML
>> frameworks that tie in with other Apache stack technologies. ML toolkits
>> leave would-be users in the cold. One has to know what one is doing, and
>> what to do is highly use case specific, this is why "data scientists" can
>> command obscene salaries and only commercial vendors have the resources to
>> focus on specific verticals.
>>
>> - Visualization and preparation. Moving further up, now we are almost
>> touching directly the use case. We have data but we need to clean it,
>> normalize, regularize, filter, slice and dice. Where there are reasonably
>> generic open source tools, preferably at Apache, for data preparation and
>> cleaning Bigtop could provide baseline value by packaging it, and
>> additional value with deeper integration with Apache stack components. Data
>> preparation is a concern hand in hand with data ingest, so we have an
>> interesting feedback loop from the top back down to ingest tools/building
>> blocks like Kafka and Flume. Data cleaning concerns might overlap with the
>> workflow frameworks too. If there's a friendly licensed open source
>> graphical front end to the data cleaning/munging/exploration process that
>> is generic enough that would be a really interesting "acquisition".
>> - We can also package visualization libraries and toolkits for building
>> dashboards. Like with ML algorithms, a complete integration is probably out
>> of scope because every instance would be use case and user specific.
>>
>>
>>
>> On Mon, Dec 8, 2014 at 12:23 PM, Konstantin Boudnik <co...@apache.org>
>> wrote:
>>
>>> First I want to address the RJ's question:
>>>
>>> The most prominent downstream Bigtop Dependency would be any commercial
>>> Hadoop distribution like HDP and CDH. The former is trying to
>>> disguise their affiliation by pushing Ambari forward, and Cloudera's
>>> seemingly
>>> shifting her focus to compressed tarballs media (aka parcels) which
>>> requires
>>> a closed-source solutions like Cloudera Manager to deploy and control
>>> your
>>> cluster, effectively rendering it useless if you ever decide to
>>> uninstall the
>>> control software. In the interest of full disclosure, I don't think
>>> parcels
>>> have any chance to landslide the consensus in the industry from Linux
>>> packaging towards something so obscure and proprietary as parcels are.
>>>
>>>
>>> And now to my actual points....:
>>>
>>> I do strongly believe the Bigtop was and is the only completely
>>> transparent,
>>> vendors' friendly, and 100% sticking to official ASF product releases
>>> way of
>>> building your stack from ground up, deploying and controlling it anyway
>>> you
>>> want to. I agree with Roman's presentation on how this project can move
>>> forward. However, I somewhat disagree with his view on the perspectives.
>>> It
>>> might be a hard road to drive the opinion of the community.  But, it is
>>> a high
>>> road.
>>>
>>> We are definitely small and mostly unsupported by commercial groups that
>>> are
>>> using the framework. Being a box of LEGO won't win us anything. If
>>> anything,
>>> the empirical evidences are against it as commercial distros have
>>> decided to
>>> move towards their own means of "vendor lock-in" (yes, you hear me
>>> right - that's exactly what I said: all so called open-source companies
>>> have
>>> invented a way to lock-in their customers either with fancy "enterprise
>>> features" that aren't adding but amending underlying stack; or with
>>> custom set
>>> of patches oftentimes rendering the cluster to become incompatible
>>> between
>>> different vendors).
>>>
>>> By all means, my money are on the second way, yet slightly modified (as
>>> use-cases are coming from users, not developers):
>>>   #2 start driving adoption of software stacks for the particular kind
>>> of data workloads
>>>
>>> This community has enough day-to-day practitioners on board to
>>> accumulate a near-complete introspection of where the technology is
>>> moving.
>>> And instead of wobbling in a backwash, let's see if we can be smart and
>>> define
>>> this landscape. After all, Bigtop has adopted Spark well before any of
>>> the
>>> commercials have officially accepted it. We seemingly are moving more and
>>> more into in-memory realm of data processing: Apache Ignite (Gridgain),
>>> Tachyon, Spark. I don't know how much legs Hive got in it, but I am
>>> doubtful,
>>> that it can walk for much longer... May be it's just me.
>>>
>>> In this thread http://is.gd/MV2BH9 we already discussed some of the
>>> aspects
>>> influencing the feature of this project. And we are de-facto working on
>>> the
>>> implementation. In my opinion, Hadoop has been more or less commoditized
>>> already. And it isn't a bad thing, but it means that the innovations are
>>> elsewhere. E.g. Spark moving is moving beyond its ties with storage
>>> layer via
>>> Tachyon abstraction; GridGain simply doesn't care what's underlying
>>> storage
>>> is. However, data needs to be stored somewhere before it can be
>>> processed. And
>>> HCFS seems to be fitting the bill ok. But, as I said already, I see the
>>> real
>>> action elsewhere. If I were to define the shape of our mid- to long'ish
>>> term
>>> roadmap it'd be something like that:
>>>
>>>             ^   Dashboard/Visualization  ^
>>>             |     OLTP/ML processing     |
>>>             |    Caching/Acceleration    |
>>>             |         Storage            |
>>>
>>> And around this we can add/improve on deployment (R8???),
>>> virtualization/containers/clouds.  In other words - let's focus on the
>>> vertical part of the stack, instead of simply supporting the status quo.
>>>
>>> Does Cassandra fits the Storage layer in that model? I don't know and
>>> most
>>> important - I don't care. If there's an interest and manpower to have
>>> Cassandra-based stack - sure, but perhaps let's do as a separate branch
>>> or
>>> something, so we aren't over-complicating things. As Roman said earlier,
>>> in
>>> this case it'd be great to engage Cassandra/DataStax people into this
>>> project.
>>> But something tells me they won't be eager to jump on board.
>>>
>>> And finally, all this above leads to "how": how we can start reshaping
>>> the
>>> stack into its next incarnation? Perhaps, Ubuntu model might be an
>>> answer for
>>> that, but we have discussed that elsewhere and dropped the idea as it
>>> wasn't
>>> feasible back in the day. Perhaps its time just came?
>>>
>>> Apologies for a long post.
>>>   Cos
>>>
>>>
>>> On Sun, Dec 07, 2014 at 07:04PM, RJ Nowling wrote:
>>> > Which other projects depend on BigTop?  How will the questions about
>>> the
>>> > direction of BigTop affect those projects?
>>> >
>>> > On Sun, Dec 7, 2014 at 6:10 PM, Roman Shaposhnik <roman@shaposhnik.org
>>> >
>>> > wrote:
>>> >
>>> > > Hi!
>>> > >
>>> > > On Sat, Dec 6, 2014 at 3:23 PM, jay vyas <
>>> jayunit100.apache@gmail.com>
>>> > > wrote:
>>> > > > hi bigtop !
>>> > > >
>>> > > > I thought id start a thread a few vaguely related thoughts i have
>>> around
>>> > > > next couple iterations of bigtop.
>>> > >
>>> > > I think in general I see two major ways for something like
>>> > > Bigtop to evolve:
>>> > >    #1 remain a 'box of LEGO bricks' with very little opinion on
>>> > >         how these pieces need to be integrated
>>> > >    #2 start driving oppinioned use-cases for the particular kind of
>>> > >         bigdata workloads
>>> > >
>>> > > #1 is sort of what all of the Linux distros have been doing for
>>> > > the majority of time they existed. #2 is close to what CentOS
>>> > > is doing with SIGs.
>>> > >
>>> > > Honestly, given the size of our community so far and a total
>>> > > lack of corporate backing (with a small exception of Cloudera
>>> > > still paying for our EC2 time) I think #1 is all we can do. I'd
>>> > > love to be wrong, though.
>>> > >
>>> > > > 1) Hive:  How will bigtop to evolve to support it, now that it is
>>> much
>>> > > more
>>> > > > than a mapreduce query wrapper?
>>> > >
>>> > > I think Hive will remain a big part of Hadoop workloads for
>>> forseeable
>>> > > future. What I'd love to see more of is rationalizing things like how
>>> > > HCatalog, etc. need to be deployed.
>>> > >
>>> > > > 2) I wonder wether we should confirm cassandra interoperability of
>>> spark
>>> > > in
>>> > > > bigtop distros,
>>> > >
>>> > > Only if there's a significant interest from cassandra community and
>>> even
>>> > > then my biggest fear is that with cassandra we're totally changing
>>> the
>>> > > requirements for the underlying storage subsystem (nothing wrong with
>>> > > that, its just that in Hadoop ecosystem everything assumes very
>>> HDFS'ish
>>> > > requirements for the scale-out storage).
>>> > >
>>> > > > 4) in general, i think bigtop can move in one of 3 directions.
>>> > > >
>>> > > >   EXPAND ? : Expanding to include new components, with just basic
>>> > > interop,
>>> > > > and let folks evolve their own stacks on top of bigtop on their
>>> own.
>>> > > >
>>> > > >   CONTRACT+FOCUS ?  Contracting to focus on a lean set of core
>>> > > components,
>>> > > > with super high quality.
>>> > > >
>>> > > >   STAY THE COURSE ? Staying the same ~ a packaging platform for
>>> just
>>> > > > hadoop's direct ecosystem.
>>> > > >
>>> > > > I am intrigued by the idea of A and B both have clear benefits and
>>> > > costs...
>>> > > > would like to see the opinions of folks --- do we  lean in one
>>> direction
>>> > > or
>>> > > > another? What is the criteria for adding a new feature, package,
>>> stack to
>>> > > > bigtop?
>>> > > >
>>> > > > ... Or maybe im just overthinking it and should be spending this
>>> time
>>> > > > testing spark for 0.9 release....
>>> > >
>>> > > I'd love to know what other think, but for 0.9 I'd rather stay the
>>> course.
>>> > >
>>> > > Thanks,
>>> > > Roman.
>>> > >
>>> > > P.S. There are also market forces at play that may fundamentally
>>> change
>>> > > the focus of what we're all working on in the year or so.
>>> > >
>>>
>>
>>
>>
>> --
>> Best regards,
>>
>>    - Andy
>>
>> Problems worthy of attack prove their worth by hitting back. - Piet Hein
>> (via Tom White)
>>
>
>


-- 
Best regards,

   - Andy

Problems worthy of attack prove their worth by hitting back. - Piet Hein
(via Tom White)

Re: What will the next generation of bigtop look like?

Posted by RJ Nowling <rn...@gmail.com>.

Good call on Hbase+Phoenix.  What is the use case for SOLR?  I saw that
SOLR can now run on jobs on Spark but I'm not too familiar with the use
cases.

Another avenue may be to add iPython (for notebooks), plotting libraries
like matplotlib and ggplot2, and the scientific python libraries (numpy,
scipy, scikit-learn).  This would approximate would Databricks is offering
as a hosted service and provide a nice toolkit for data scientists to
prototype with.  Since Spark supports Python (PySpark) it provides a user
friendly top to the stack.

Would a management tool like Ambari also be useful?

On Thu, Dec 11, 2014 at 4:10 PM, Jay Vyas <ja...@gmail.com>
wrote:

> Rj - that's not too radical, seems like a lot of folks are embracing that
> idiom.
>
> 1) I like featuring spark along with some persistence technology.
> Cassandra don't seem to have interest in BigTop however.  So maybe...
>
> Spark
> Tachyon
> Hbase+Phoenix
> SOLR
> Kafka
>
> Could be pretty effective.
>
> 2) visualization ? I think that is an afterthought, at least for now....
> It's a lot of work just to get the stack compiling.
>
> On Dec 11, 2014, at 1:14 PM, RJ Nowling <rn...@gmail.com> wrote:
>
> GraphX, Streaming, MLlib, and Spark SQL are all part of Spark and would be
> included in BigTop if Spark is included. They're also pretty well
> integrated with each other.
>
> I'd like to throw out a radical idea, based on Andrew's comments: focus on
> the vertical rather than the horizontal with a slimmed down, Spark-oriented
> stack.  (This could be a subset of the current stack.)  Strat.io's work
> provides a nice example of a pure Spark stack.
>
> Spark offers a smaller footprint, far less maintenance, functionality of
> many Hadoop components in one (and better integration!), and is better
> suited for diverse deployment situations (cloud, non-HDFS storage, etc.)
>
> A few other complementary components would be needed: Kafka would be
> needed for HA with Spark streaming.  Tachyon.  Maybe offer Cassandra or
> similar as an alternative storage option.    Combine this with dashboards
> and visualization and high quality deployment options (Puppet, Docker,
> etc.).  With the data generator and Spark implementation of BigPetStore, my
> goal is to to expand BPS to provide high quality analytics examples,
> oriented more towards data scientists.
>
> Just a thought...
>
> On Thu, Dec 11, 2014 at 12:39 PM, Andrew Purtell <ap...@apache.org>
> wrote:
>
>> This is a really great post and I was nodding along with most of it.
>>
>> My personal view is Bigtop starts as a deployable stack of Apache
>> ecosystem components for Big Data. Commodification of (Linux) deployable
>> packages and basic install integration is the baseline.
>>
>> Bigtop packaging Spark components first is an unfortunately little known
>> win of this community, but its still a win. Although replicating that
>> success with choice of the 'next big thing' is going to be a hit or miss
>> proposition unless one of us can figure out time travel, definitely we can
>> make some observations and scour and/or influence the Apache project
>> landscape to pick up coverage in the space:
>>
>> - Storage is commoditized. Nearly everyone bases the storage stack on
>> HDFS. Everyone does so with what we'd call HCFS. Best to focus elsewhere.
>>
>> - Packaging is commoditized. It's a shame that vendors pursue misguided
>> lock-in strategies but we have no control over that. It's still true that
>> someone using HDP or CDH 4 can switch to Bigtop and vice versa without
>> changing package management tools or strategy. As a user of Apache stack
>> technologies I want long term sustainable package management so will vote
>> with my feet for the commodity option, and won't be alone. Bigtop should
>> provide this, and does, and it's mostly a solved problem.
>>
>> - Deployment is also a "solved" problem but unfortunately everyone solves
>> it differently. :-) This is an area where Bigtop can provide real value,
>> and does, with the Puppet scripts, with the containerization work. One
>> function Bigtop can serve is as repository and example of Hadoop-ish
>> production tooling.
>>
>> - YARN is a reasonably generic grid resource manager. We don't have the
>> resources to stand up an alternate RM and all the tooling necessary with
>> Mesos, but if Mesosphere made a contribution of that I suspect we'd take
>> it. From the Bigtop perspective I think computation framework options are
>> well handled, in that I don't see Bigtop or anyone else developing credible
>> alternatives to MR and Spark for some time. Not sure there's enough oxygen.
>> And we have Giraph (and is GraphX packaged with Spark?). To the extent
>> Spark-on-YARN has rough edges in the Bigtop framework that's an area where
>> contributors can produce value. Related, support for Hive on Spark, Pig on
>> Spark (spork).
>>
>> - The Apache stack includes three streaming computation frameworks -
>> Storm, Spark Streaming, Samza - but Bigtop has mostly missed the boat here.
>> Spark streaming is included in the spark package (I think) but how well is
>> it integrated? Samza is well integrated with YARN but we don't package it.
>> There's also been Storm-on-YARN work out of Yahoo, not sure about what was
>> upstreamed or might be available. Anyway, integration of stream computation
>> frameworks into Bigtop's packaging and deployment/management scripts can
>> produce value, especially if we provide multiple options, because vendors
>> are choosing favorites.
>>
>> - Data access. We do have players differentiating themselves here. Bigtop
>> provides two SQL options (Hive, Phoenix+HBase), can add a third, I see
>> someone's proposed Presto packaging. I'm not sure from the Bigtop
>> perspective we need to pursue additional alternatives, but if there were
>> contributions, we might well take them. "Enterprise friendly API" (SQL) is
>> half of the data access picture I think, the other half is access control.
>> There are competing projects in incubation, Sentry and Ranger, with no
>> shared purpose, which is a real shame. To the extent that Bigtop adopts a
>> cross-component full-stack access control technology, or helps bring
>> another alternative into incubation and adopts that, we can move the needle
>> in this space. We'd offer a vendor neutral access control option devoid of
>> lock-in risk, this would be a big deal for big-E enterprises.
>>
>> - Data management and provenance. Now we're moving up the value chain
>> from storage and data access to the next layer. This is mostly greenfield /
>> blue ocean space in the Apache stack. We have interesting options in
>> incubation: Falcon, Taverna, NiFi. (I think the last one might be truly
>> comprehensive.) All of these are higher level data management and
>> processing workflows which include aspects of management and provenance.
>> One or more could be adopted and refined. There are a lot of relevant
>> integration opportunities up and down the stack that could be undertaken
>> with shared effort of the Bigtop, framework, and component communities.
>>
>> - Machine learning. Moving further up the value chain, we have data and
>> computation and workflow, now how do we derive the competitive advantage
>> that all of the lower layer technologies are in place for? The new hotness
>> is surfacing of insights out of scaled parallel statistical inference.
>> Unfortunately this space doesn't present itself well to the toolbox
>> approach. Bigtop provides Mahout and MLLib as part of Spark (right?), they
>> themselves are toolkits with components of varying utility and maturity
>> (and relevance). I think Bigtop could provide some value by curating ML
>> frameworks that tie in with other Apache stack technologies. ML toolkits
>> leave would-be users in the cold. One has to know what one is doing, and
>> what to do is highly use case specific, this is why "data scientists" can
>> command obscene salaries and only commercial vendors have the resources to
>> focus on specific verticals.
>>
>> - Visualization and preparation. Moving further up, now we are almost
>> touching directly the use case. We have data but we need to clean it,
>> normalize, regularize, filter, slice and dice. Where there are reasonably
>> generic open source tools, preferably at Apache, for data preparation and
>> cleaning Bigtop could provide baseline value by packaging it, and
>> additional value with deeper integration with Apache stack components. Data
>> preparation is a concern hand in hand with data ingest, so we have an
>> interesting feedback loop from the top back down to ingest tools/building
>> blocks like Kafka and Flume. Data cleaning concerns might overlap with the
>> workflow frameworks too. If there's a friendly licensed open source
>> graphical front end to the data cleaning/munging/exploration process that
>> is generic enough that would be a really interesting "acquisition".
>> - We can also package visualization libraries and toolkits for building
>> dashboards. Like with ML algorithms, a complete integration is probably out
>> of scope because every instance would be use case and user specific.
>>
>>
>>
>> On Mon, Dec 8, 2014 at 12:23 PM, Konstantin Boudnik <co...@apache.org>
>> wrote:
>>
>>> First I want to address the RJ's question:
>>>
>>> The most prominent downstream Bigtop Dependency would be any commercial
>>> Hadoop distribution like HDP and CDH. The former is trying to
>>> disguise their affiliation by pushing Ambari forward, and Cloudera's
>>> seemingly
>>> shifting her focus to compressed tarballs media (aka parcels) which
>>> requires
>>> a closed-source solutions like Cloudera Manager to deploy and control
>>> your
>>> cluster, effectively rendering it useless if you ever decide to
>>> uninstall the
>>> control software. In the interest of full disclosure, I don't think
>>> parcels
>>> have any chance to landslide the consensus in the industry from Linux
>>> packaging towards something so obscure and proprietary as parcels are.
>>>
>>>
>>> And now to my actual points....:
>>>
>>> I do strongly believe the Bigtop was and is the only completely
>>> transparent,
>>> vendors' friendly, and 100% sticking to official ASF product releases
>>> way of
>>> building your stack from ground up, deploying and controlling it anyway
>>> you
>>> want to. I agree with Roman's presentation on how this project can move
>>> forward. However, I somewhat disagree with his view on the perspectives.
>>> It
>>> might be a hard road to drive the opinion of the community.  But, it is
>>> a high
>>> road.
>>>
>>> We are definitely small and mostly unsupported by commercial groups that
>>> are
>>> using the framework. Being a box of LEGO won't win us anything. If
>>> anything,
>>> the empirical evidences are against it as commercial distros have
>>> decided to
>>> move towards their own means of "vendor lock-in" (yes, you hear me
>>> right - that's exactly what I said: all so called open-source companies
>>> have
>>> invented a way to lock-in their customers either with fancy "enterprise
>>> features" that aren't adding but amending underlying stack; or with
>>> custom set
>>> of patches oftentimes rendering the cluster to become incompatible
>>> between
>>> different vendors).
>>>
>>> By all means, my money are on the second way, yet slightly modified (as
>>> use-cases are coming from users, not developers):
>>>   #2 start driving adoption of software stacks for the particular kind
>>> of data workloads
>>>
>>> This community has enough day-to-day practitioners on board to
>>> accumulate a near-complete introspection of where the technology is
>>> moving.
>>> And instead of wobbling in a backwash, let's see if we can be smart and
>>> define
>>> this landscape. After all, Bigtop has adopted Spark well before any of
>>> the
>>> commercials have officially accepted it. We seemingly are moving more and
>>> more into in-memory realm of data processing: Apache Ignite (Gridgain),
>>> Tachyon, Spark. I don't know how much legs Hive got in it, but I am
>>> doubtful,
>>> that it can walk for much longer... May be it's just me.
>>>
>>> In this thread http://is.gd/MV2BH9 we already discussed some of the
>>> aspects
>>> influencing the feature of this project. And we are de-facto working on
>>> the
>>> implementation. In my opinion, Hadoop has been more or less commoditized
>>> already. And it isn't a bad thing, but it means that the innovations are
>>> elsewhere. E.g. Spark moving is moving beyond its ties with storage
>>> layer via
>>> Tachyon abstraction; GridGain simply doesn't care what's underlying
>>> storage
>>> is. However, data needs to be stored somewhere before it can be
>>> processed. And
>>> HCFS seems to be fitting the bill ok. But, as I said already, I see the
>>> real
>>> action elsewhere. If I were to define the shape of our mid- to long'ish
>>> term
>>> roadmap it'd be something like that:
>>>
>>>             ^   Dashboard/Visualization  ^
>>>             |     OLTP/ML processing     |
>>>             |    Caching/Acceleration    |
>>>             |         Storage            |
>>>
>>> And around this we can add/improve on deployment (R8???),
>>> virtualization/containers/clouds.  In other words - let's focus on the
>>> vertical part of the stack, instead of simply supporting the status quo.
>>>
>>> Does Cassandra fits the Storage layer in that model? I don't know and
>>> most
>>> important - I don't care. If there's an interest and manpower to have
>>> Cassandra-based stack - sure, but perhaps let's do as a separate branch
>>> or
>>> something, so we aren't over-complicating things. As Roman said earlier,
>>> in
>>> this case it'd be great to engage Cassandra/DataStax people into this
>>> project.
>>> But something tells me they won't be eager to jump on board.
>>>
>>> And finally, all this above leads to "how": how we can start reshaping
>>> the
>>> stack into its next incarnation? Perhaps, Ubuntu model might be an
>>> answer for
>>> that, but we have discussed that elsewhere and dropped the idea as it
>>> wasn't
>>> feasible back in the day. Perhaps its time just came?
>>>
>>> Apologies for a long post.
>>>   Cos
>>>
>>>
>>> On Sun, Dec 07, 2014 at 07:04PM, RJ Nowling wrote:
>>> > Which other projects depend on BigTop?  How will the questions about
>>> the
>>> > direction of BigTop affect those projects?
>>> >
>>> > On Sun, Dec 7, 2014 at 6:10 PM, Roman Shaposhnik <roman@shaposhnik.org
>>> >
>>> > wrote:
>>> >
>>> > > Hi!
>>> > >
>>> > > On Sat, Dec 6, 2014 at 3:23 PM, jay vyas <
>>> jayunit100.apache@gmail.com>
>>> > > wrote:
>>> > > > hi bigtop !
>>> > > >
>>> > > > I thought id start a thread a few vaguely related thoughts i have
>>> around
>>> > > > next couple iterations of bigtop.
>>> > >
>>> > > I think in general I see two major ways for something like
>>> > > Bigtop to evolve:
>>> > >    #1 remain a 'box of LEGO bricks' with very little opinion on
>>> > >         how these pieces need to be integrated
>>> > >    #2 start driving oppinioned use-cases for the particular kind of
>>> > >         bigdata workloads
>>> > >
>>> > > #1 is sort of what all of the Linux distros have been doing for
>>> > > the majority of time they existed. #2 is close to what CentOS
>>> > > is doing with SIGs.
>>> > >
>>> > > Honestly, given the size of our community so far and a total
>>> > > lack of corporate backing (with a small exception of Cloudera
>>> > > still paying for our EC2 time) I think #1 is all we can do. I'd
>>> > > love to be wrong, though.
>>> > >
>>> > > > 1) Hive:  How will bigtop to evolve to support it, now that it is
>>> much
>>> > > more
>>> > > > than a mapreduce query wrapper?
>>> > >
>>> > > I think Hive will remain a big part of Hadoop workloads for
>>> forseeable
>>> > > future. What I'd love to see more of is rationalizing things like how
>>> > > HCatalog, etc. need to be deployed.
>>> > >
>>> > > > 2) I wonder wether we should confirm cassandra interoperability of
>>> spark
>>> > > in
>>> > > > bigtop distros,
>>> > >
>>> > > Only if there's a significant interest from cassandra community and
>>> even
>>> > > then my biggest fear is that with cassandra we're totally changing
>>> the
>>> > > requirements for the underlying storage subsystem (nothing wrong with
>>> > > that, its just that in Hadoop ecosystem everything assumes very
>>> HDFS'ish
>>> > > requirements for the scale-out storage).
>>> > >
>>> > > > 4) in general, i think bigtop can move in one of 3 directions.
>>> > > >
>>> > > >   EXPAND ? : Expanding to include new components, with just basic
>>> > > interop,
>>> > > > and let folks evolve their own stacks on top of bigtop on their
>>> own.
>>> > > >
>>> > > >   CONTRACT+FOCUS ?  Contracting to focus on a lean set of core
>>> > > components,
>>> > > > with super high quality.
>>> > > >
>>> > > >   STAY THE COURSE ? Staying the same ~ a packaging platform for
>>> just
>>> > > > hadoop's direct ecosystem.
>>> > > >
>>> > > > I am intrigued by the idea of A and B both have clear benefits and
>>> > > costs...
>>> > > > would like to see the opinions of folks --- do we  lean in one
>>> direction
>>> > > or
>>> > > > another? What is the criteria for adding a new feature, package,
>>> stack to
>>> > > > bigtop?
>>> > > >
>>> > > > ... Or maybe im just overthinking it and should be spending this
>>> time
>>> > > > testing spark for 0.9 release....
>>> > >
>>> > > I'd love to know what other think, but for 0.9 I'd rather stay the
>>> course.
>>> > >
>>> > > Thanks,
>>> > > Roman.
>>> > >
>>> > > P.S. There are also market forces at play that may fundamentally
>>> change
>>> > > the focus of what we're all working on in the year or so.
>>> > >
>>>
>>
>>
>>
>> --
>> Best regards,
>>
>>    - Andy
>>
>> Problems worthy of attack prove their worth by hitting back. - Piet Hein
>> (via Tom White)
>>
>
>

Re: What will the next generation of bigtop look like?

Posted by RJ Nowling <rn...@gmail.com>.

Good call on Hbase+Phoenix.  What is the use case for SOLR?  I saw that
SOLR can now run on jobs on Spark but I'm not too familiar with the use
cases.

Another avenue may be to add iPython (for notebooks), plotting libraries
like matplotlib and ggplot2, and the scientific python libraries (numpy,
scipy, scikit-learn).  This would approximate would Databricks is offering
as a hosted service and provide a nice toolkit for data scientists to
prototype with.  Since Spark supports Python (PySpark) it provides a user
friendly top to the stack.

Would a management tool like Ambari also be useful?

On Thu, Dec 11, 2014 at 4:10 PM, Jay Vyas <ja...@gmail.com>
wrote:

> Rj - that's not too radical, seems like a lot of folks are embracing that
> idiom.
>
> 1) I like featuring spark along with some persistence technology.
> Cassandra don't seem to have interest in BigTop however.  So maybe...
>
> Spark
> Tachyon
> Hbase+Phoenix
> SOLR
> Kafka
>
> Could be pretty effective.
>
> 2) visualization ? I think that is an afterthought, at least for now....
> It's a lot of work just to get the stack compiling.
>
> On Dec 11, 2014, at 1:14 PM, RJ Nowling <rn...@gmail.com> wrote:
>
> GraphX, Streaming, MLlib, and Spark SQL are all part of Spark and would be
> included in BigTop if Spark is included. They're also pretty well
> integrated with each other.
>
> I'd like to throw out a radical idea, based on Andrew's comments: focus on
> the vertical rather than the horizontal with a slimmed down, Spark-oriented
> stack.  (This could be a subset of the current stack.)  Strat.io's work
> provides a nice example of a pure Spark stack.
>
> Spark offers a smaller footprint, far less maintenance, functionality of
> many Hadoop components in one (and better integration!), and is better
> suited for diverse deployment situations (cloud, non-HDFS storage, etc.)
>
> A few other complementary components would be needed: Kafka would be
> needed for HA with Spark streaming.  Tachyon.  Maybe offer Cassandra or
> similar as an alternative storage option.    Combine this with dashboards
> and visualization and high quality deployment options (Puppet, Docker,
> etc.).  With the data generator and Spark implementation of BigPetStore, my
> goal is to to expand BPS to provide high quality analytics examples,
> oriented more towards data scientists.
>
> Just a thought...
>
> On Thu, Dec 11, 2014 at 12:39 PM, Andrew Purtell <ap...@apache.org>
> wrote:
>
>> This is a really great post and I was nodding along with most of it.
>>
>> My personal view is Bigtop starts as a deployable stack of Apache
>> ecosystem components for Big Data. Commodification of (Linux) deployable
>> packages and basic install integration is the baseline.
>>
>> Bigtop packaging Spark components first is an unfortunately little known
>> win of this community, but its still a win. Although replicating that
>> success with choice of the 'next big thing' is going to be a hit or miss
>> proposition unless one of us can figure out time travel, definitely we can
>> make some observations and scour and/or influence the Apache project
>> landscape to pick up coverage in the space:
>>
>> - Storage is commoditized. Nearly everyone bases the storage stack on
>> HDFS. Everyone does so with what we'd call HCFS. Best to focus elsewhere.
>>
>> - Packaging is commoditized. It's a shame that vendors pursue misguided
>> lock-in strategies but we have no control over that. It's still true that
>> someone using HDP or CDH 4 can switch to Bigtop and vice versa without
>> changing package management tools or strategy. As a user of Apache stack
>> technologies I want long term sustainable package management so will vote
>> with my feet for the commodity option, and won't be alone. Bigtop should
>> provide this, and does, and it's mostly a solved problem.
>>
>> - Deployment is also a "solved" problem but unfortunately everyone solves
>> it differently. :-) This is an area where Bigtop can provide real value,
>> and does, with the Puppet scripts, with the containerization work. One
>> function Bigtop can serve is as repository and example of Hadoop-ish
>> production tooling.
>>
>> - YARN is a reasonably generic grid resource manager. We don't have the
>> resources to stand up an alternate RM and all the tooling necessary with
>> Mesos, but if Mesosphere made a contribution of that I suspect we'd take
>> it. From the Bigtop perspective I think computation framework options are
>> well handled, in that I don't see Bigtop or anyone else developing credible
>> alternatives to MR and Spark for some time. Not sure there's enough oxygen.
>> And we have Giraph (and is GraphX packaged with Spark?). To the extent
>> Spark-on-YARN has rough edges in the Bigtop framework that's an area where
>> contributors can produce value. Related, support for Hive on Spark, Pig on
>> Spark (spork).
>>
>> - The Apache stack includes three streaming computation frameworks -
>> Storm, Spark Streaming, Samza - but Bigtop has mostly missed the boat here.
>> Spark streaming is included in the spark package (I think) but how well is
>> it integrated? Samza is well integrated with YARN but we don't package it.
>> There's also been Storm-on-YARN work out of Yahoo, not sure about what was
>> upstreamed or might be available. Anyway, integration of stream computation
>> frameworks into Bigtop's packaging and deployment/management scripts can
>> produce value, especially if we provide multiple options, because vendors
>> are choosing favorites.
>>
>> - Data access. We do have players differentiating themselves here. Bigtop
>> provides two SQL options (Hive, Phoenix+HBase), can add a third, I see
>> someone's proposed Presto packaging. I'm not sure from the Bigtop
>> perspective we need to pursue additional alternatives, but if there were
>> contributions, we might well take them. "Enterprise friendly API" (SQL) is
>> half of the data access picture I think, the other half is access control.
>> There are competing projects in incubation, Sentry and Ranger, with no
>> shared purpose, which is a real shame. To the extent that Bigtop adopts a
>> cross-component full-stack access control technology, or helps bring
>> another alternative into incubation and adopts that, we can move the needle
>> in this space. We'd offer a vendor neutral access control option devoid of
>> lock-in risk, this would be a big deal for big-E enterprises.
>>
>> - Data management and provenance. Now we're moving up the value chain
>> from storage and data access to the next layer. This is mostly greenfield /
>> blue ocean space in the Apache stack. We have interesting options in
>> incubation: Falcon, Taverna, NiFi. (I think the last one might be truly
>> comprehensive.) All of these are higher level data management and
>> processing workflows which include aspects of management and provenance.
>> One or more could be adopted and refined. There are a lot of relevant
>> integration opportunities up and down the stack that could be undertaken
>> with shared effort of the Bigtop, framework, and component communities.
>>
>> - Machine learning. Moving further up the value chain, we have data and
>> computation and workflow, now how do we derive the competitive advantage
>> that all of the lower layer technologies are in place for? The new hotness
>> is surfacing of insights out of scaled parallel statistical inference.
>> Unfortunately this space doesn't present itself well to the toolbox
>> approach. Bigtop provides Mahout and MLLib as part of Spark (right?), they
>> themselves are toolkits with components of varying utility and maturity
>> (and relevance). I think Bigtop could provide some value by curating ML
>> frameworks that tie in with other Apache stack technologies. ML toolkits
>> leave would-be users in the cold. One has to know what one is doing, and
>> what to do is highly use case specific, this is why "data scientists" can
>> command obscene salaries and only commercial vendors have the resources to
>> focus on specific verticals.
>>
>> - Visualization and preparation. Moving further up, now we are almost
>> touching directly the use case. We have data but we need to clean it,
>> normalize, regularize, filter, slice and dice. Where there are reasonably
>> generic open source tools, preferably at Apache, for data preparation and
>> cleaning Bigtop could provide baseline value by packaging it, and
>> additional value with deeper integration with Apache stack components. Data
>> preparation is a concern hand in hand with data ingest, so we have an
>> interesting feedback loop from the top back down to ingest tools/building
>> blocks like Kafka and Flume. Data cleaning concerns might overlap with the
>> workflow frameworks too. If there's a friendly licensed open source
>> graphical front end to the data cleaning/munging/exploration process that
>> is generic enough that would be a really interesting "acquisition".
>> - We can also package visualization libraries and toolkits for building
>> dashboards. Like with ML algorithms, a complete integration is probably out
>> of scope because every instance would be use case and user specific.
>>
>>
>>
>> On Mon, Dec 8, 2014 at 12:23 PM, Konstantin Boudnik <co...@apache.org>
>> wrote:
>>
>>> First I want to address the RJ's question:
>>>
>>> The most prominent downstream Bigtop Dependency would be any commercial
>>> Hadoop distribution like HDP and CDH. The former is trying to
>>> disguise their affiliation by pushing Ambari forward, and Cloudera's
>>> seemingly
>>> shifting her focus to compressed tarballs media (aka parcels) which
>>> requires
>>> a closed-source solutions like Cloudera Manager to deploy and control
>>> your
>>> cluster, effectively rendering it useless if you ever decide to
>>> uninstall the
>>> control software. In the interest of full disclosure, I don't think
>>> parcels
>>> have any chance to landslide the consensus in the industry from Linux
>>> packaging towards something so obscure and proprietary as parcels are.
>>>
>>>
>>> And now to my actual points....:
>>>
>>> I do strongly believe the Bigtop was and is the only completely
>>> transparent,
>>> vendors' friendly, and 100% sticking to official ASF product releases
>>> way of
>>> building your stack from ground up, deploying and controlling it anyway
>>> you
>>> want to. I agree with Roman's presentation on how this project can move
>>> forward. However, I somewhat disagree with his view on the perspectives.
>>> It
>>> might be a hard road to drive the opinion of the community.  But, it is
>>> a high
>>> road.
>>>
>>> We are definitely small and mostly unsupported by commercial groups that
>>> are
>>> using the framework. Being a box of LEGO won't win us anything. If
>>> anything,
>>> the empirical evidences are against it as commercial distros have
>>> decided to
>>> move towards their own means of "vendor lock-in" (yes, you hear me
>>> right - that's exactly what I said: all so called open-source companies
>>> have
>>> invented a way to lock-in their customers either with fancy "enterprise
>>> features" that aren't adding but amending underlying stack; or with
>>> custom set
>>> of patches oftentimes rendering the cluster to become incompatible
>>> between
>>> different vendors).
>>>
>>> By all means, my money are on the second way, yet slightly modified (as
>>> use-cases are coming from users, not developers):
>>>   #2 start driving adoption of software stacks for the particular kind
>>> of data workloads
>>>
>>> This community has enough day-to-day practitioners on board to
>>> accumulate a near-complete introspection of where the technology is
>>> moving.
>>> And instead of wobbling in a backwash, let's see if we can be smart and
>>> define
>>> this landscape. After all, Bigtop has adopted Spark well before any of
>>> the
>>> commercials have officially accepted it. We seemingly are moving more and
>>> more into in-memory realm of data processing: Apache Ignite (Gridgain),
>>> Tachyon, Spark. I don't know how much legs Hive got in it, but I am
>>> doubtful,
>>> that it can walk for much longer... May be it's just me.
>>>
>>> In this thread http://is.gd/MV2BH9 we already discussed some of the
>>> aspects
>>> influencing the feature of this project. And we are de-facto working on
>>> the
>>> implementation. In my opinion, Hadoop has been more or less commoditized
>>> already. And it isn't a bad thing, but it means that the innovations are
>>> elsewhere. E.g. Spark moving is moving beyond its ties with storage
>>> layer via
>>> Tachyon abstraction; GridGain simply doesn't care what's underlying
>>> storage
>>> is. However, data needs to be stored somewhere before it can be
>>> processed. And
>>> HCFS seems to be fitting the bill ok. But, as I said already, I see the
>>> real
>>> action elsewhere. If I were to define the shape of our mid- to long'ish
>>> term
>>> roadmap it'd be something like that:
>>>
>>>             ^   Dashboard/Visualization  ^
>>>             |     OLTP/ML processing     |
>>>             |    Caching/Acceleration    |
>>>             |         Storage            |
>>>
>>> And around this we can add/improve on deployment (R8???),
>>> virtualization/containers/clouds.  In other words - let's focus on the
>>> vertical part of the stack, instead of simply supporting the status quo.
>>>
>>> Does Cassandra fits the Storage layer in that model? I don't know and
>>> most
>>> important - I don't care. If there's an interest and manpower to have
>>> Cassandra-based stack - sure, but perhaps let's do as a separate branch
>>> or
>>> something, so we aren't over-complicating things. As Roman said earlier,
>>> in
>>> this case it'd be great to engage Cassandra/DataStax people into this
>>> project.
>>> But something tells me they won't be eager to jump on board.
>>>
>>> And finally, all this above leads to "how": how we can start reshaping
>>> the
>>> stack into its next incarnation? Perhaps, Ubuntu model might be an
>>> answer for
>>> that, but we have discussed that elsewhere and dropped the idea as it
>>> wasn't
>>> feasible back in the day. Perhaps its time just came?
>>>
>>> Apologies for a long post.
>>>   Cos
>>>
>>>
>>> On Sun, Dec 07, 2014 at 07:04PM, RJ Nowling wrote:
>>> > Which other projects depend on BigTop?  How will the questions about
>>> the
>>> > direction of BigTop affect those projects?
>>> >
>>> > On Sun, Dec 7, 2014 at 6:10 PM, Roman Shaposhnik <roman@shaposhnik.org
>>> >
>>> > wrote:
>>> >
>>> > > Hi!
>>> > >
>>> > > On Sat, Dec 6, 2014 at 3:23 PM, jay vyas <
>>> jayunit100.apache@gmail.com>
>>> > > wrote:
>>> > > > hi bigtop !
>>> > > >
>>> > > > I thought id start a thread a few vaguely related thoughts i have
>>> around
>>> > > > next couple iterations of bigtop.
>>> > >
>>> > > I think in general I see two major ways for something like
>>> > > Bigtop to evolve:
>>> > >    #1 remain a 'box of LEGO bricks' with very little opinion on
>>> > >         how these pieces need to be integrated
>>> > >    #2 start driving oppinioned use-cases for the particular kind of
>>> > >         bigdata workloads
>>> > >
>>> > > #1 is sort of what all of the Linux distros have been doing for
>>> > > the majority of time they existed. #2 is close to what CentOS
>>> > > is doing with SIGs.
>>> > >
>>> > > Honestly, given the size of our community so far and a total
>>> > > lack of corporate backing (with a small exception of Cloudera
>>> > > still paying for our EC2 time) I think #1 is all we can do. I'd
>>> > > love to be wrong, though.
>>> > >
>>> > > > 1) Hive:  How will bigtop to evolve to support it, now that it is
>>> much
>>> > > more
>>> > > > than a mapreduce query wrapper?
>>> > >
>>> > > I think Hive will remain a big part of Hadoop workloads for
>>> forseeable
>>> > > future. What I'd love to see more of is rationalizing things like how
>>> > > HCatalog, etc. need to be deployed.
>>> > >
>>> > > > 2) I wonder wether we should confirm cassandra interoperability of
>>> spark
>>> > > in
>>> > > > bigtop distros,
>>> > >
>>> > > Only if there's a significant interest from cassandra community and
>>> even
>>> > > then my biggest fear is that with cassandra we're totally changing
>>> the
>>> > > requirements for the underlying storage subsystem (nothing wrong with
>>> > > that, its just that in Hadoop ecosystem everything assumes very
>>> HDFS'ish
>>> > > requirements for the scale-out storage).
>>> > >
>>> > > > 4) in general, i think bigtop can move in one of 3 directions.
>>> > > >
>>> > > >   EXPAND ? : Expanding to include new components, with just basic
>>> > > interop,
>>> > > > and let folks evolve their own stacks on top of bigtop on their
>>> own.
>>> > > >
>>> > > >   CONTRACT+FOCUS ?  Contracting to focus on a lean set of core
>>> > > components,
>>> > > > with super high quality.
>>> > > >
>>> > > >   STAY THE COURSE ? Staying the same ~ a packaging platform for
>>> just
>>> > > > hadoop's direct ecosystem.
>>> > > >
>>> > > > I am intrigued by the idea of A and B both have clear benefits and
>>> > > costs...
>>> > > > would like to see the opinions of folks --- do we  lean in one
>>> direction
>>> > > or
>>> > > > another? What is the criteria for adding a new feature, package,
>>> stack to
>>> > > > bigtop?
>>> > > >
>>> > > > ... Or maybe im just overthinking it and should be spending this
>>> time
>>> > > > testing spark for 0.9 release....
>>> > >
>>> > > I'd love to know what other think, but for 0.9 I'd rather stay the
>>> course.
>>> > >
>>> > > Thanks,
>>> > > Roman.
>>> > >
>>> > > P.S. There are also market forces at play that may fundamentally
>>> change
>>> > > the focus of what we're all working on in the year or so.
>>> > >
>>>
>>
>>
>>
>> --
>> Best regards,
>>
>>    - Andy
>>
>> Problems worthy of attack prove their worth by hitting back. - Piet Hein
>> (via Tom White)
>>
>
>

Re: What will the next generation of bigtop look like?

Posted by Jay Vyas <ja...@gmail.com>.

Rj - that's not too radical, seems like a lot of folks are embracing that idiom.

1) I like featuring spark along with some persistence technology.  Cassandra don't seem to have interest in BigTop however.  So maybe...

Spark
Tachyon
Hbase+Phoenix
SOLR
Kafka

Could be pretty effective.

2) visualization ? I think that is an afterthought, at least for now.... It's a lot of work just to get the stack compiling.  

> On Dec 11, 2014, at 1:14 PM, RJ Nowling <rn...@gmail.com> wrote:
> 
> GraphX, Streaming, MLlib, and Spark SQL are all part of Spark and would be included in BigTop if Spark is included. They're also pretty well integrated with each other.
> 
> I'd like to throw out a radical idea, based on Andrew's comments: focus on the vertical rather than the horizontal with a slimmed down, Spark-oriented stack.  (This could be a subset of the current stack.)  Strat.io's work provides a nice example of a pure Spark stack.
> 
> Spark offers a smaller footprint, far less maintenance, functionality of many Hadoop components in one (and better integration!), and is better suited for diverse deployment situations (cloud, non-HDFS storage, etc.) 
> 
> A few other complementary components would be needed: Kafka would be needed for HA with Spark streaming.  Tachyon.  Maybe offer Cassandra or similar as an alternative storage option.    Combine this with dashboards and visualization and high quality deployment options (Puppet, Docker, etc.).  With the data generator and Spark implementation of BigPetStore, my goal is to to expand BPS to provide high quality analytics examples, oriented more towards data scientists.
> 
> Just a thought...
> 
>> On Thu, Dec 11, 2014 at 12:39 PM, Andrew Purtell <ap...@apache.org> wrote:
>> This is a really great post and I was nodding along with most of it. 
>> 
>> My personal view is Bigtop starts as a deployable stack of Apache ecosystem components for Big Data. Commodification of (Linux) deployable packages and basic install integration is the baseline. 
>> 
>> Bigtop packaging Spark components first is an unfortunately little known win of this community, but its still a win. Although replicating that success with choice of the 'next big thing' is going to be a hit or miss proposition unless one of us can figure out time travel, definitely we can make some observations and scour and/or influence the Apache project landscape to pick up coverage in the space:
>> 
>> - Storage is commoditized. Nearly everyone bases the storage stack on HDFS. Everyone does so with what we'd call HCFS. Best to focus elsewhere.
>> 
>> - Packaging is commoditized. It's a shame that vendors pursue misguided lock-in strategies but we have no control over that. It's still true that someone using HDP or CDH 4 can switch to Bigtop and vice versa without changing package management tools or strategy. As a user of Apache stack technologies I want long term sustainable package management so will vote with my feet for the commodity option, and won't be alone. Bigtop should provide this, and does, and it's mostly a solved problem.
>> 
>> - Deployment is also a "solved" problem but unfortunately everyone solves it differently. :-) This is an area where Bigtop can provide real value, and does, with the Puppet scripts, with the containerization work. One function Bigtop can serve is as repository and example of Hadoop-ish production tooling.
>> 
>> - YARN is a reasonably generic grid resource manager. We don't have the resources to stand up an alternate RM and all the tooling necessary with Mesos, but if Mesosphere made a contribution of that I suspect we'd take it. From the Bigtop perspective I think computation framework options are well handled, in that I don't see Bigtop or anyone else developing credible alternatives to MR and Spark for some time. Not sure there's enough oxygen. And we have Giraph (and is GraphX packaged with Spark?). To the extent Spark-on-YARN has rough edges in the Bigtop framework that's an area where contributors can produce value. Related, support for Hive on Spark, Pig on Spark (spork). 
>> 
>> - The Apache stack includes three streaming computation frameworks - Storm, Spark Streaming, Samza - but Bigtop has mostly missed the boat here. Spark streaming is included in the spark package (I think) but how well is it integrated? Samza is well integrated with YARN but we don't package it. There's also been Storm-on-YARN work out of Yahoo, not sure about what was upstreamed or might be available. Anyway, integration of stream computation frameworks into Bigtop's packaging and deployment/management scripts can produce value, especially if we provide multiple options, because vendors are choosing favorites. 
>> 
>> - Data access. We do have players differentiating themselves here. Bigtop provides two SQL options (Hive, Phoenix+HBase), can add a third, I see someone's proposed Presto packaging. I'm not sure from the Bigtop perspective we need to pursue additional alternatives, but if there were contributions, we might well take them. "Enterprise friendly API" (SQL) is half of the data access picture I think, the other half is access control. There are competing projects in incubation, Sentry and Ranger, with no shared purpose, which is a real shame. To the extent that Bigtop adopts a cross-component full-stack access control technology, or helps bring another alternative into incubation and adopts that, we can move the needle in this space. We'd offer a vendor neutral access control option devoid of lock-in risk, this would be a big deal for big-E enterprises.
>> 
>> - Data management and provenance. Now we're moving up the value chain from storage and data access to the next layer. This is mostly greenfield / blue ocean space in the Apache stack. We have interesting options in incubation: Falcon, Taverna, NiFi. (I think the last one might be truly comprehensive.) All of these are higher level data management and processing workflows which include aspects of management and provenance. One or more could be adopted and refined. There are a lot of relevant integration opportunities up and down the stack that could be undertaken with shared effort of the Bigtop, framework, and component communities.
>> 
>> - Machine learning. Moving further up the value chain, we have data and computation and workflow, now how do we derive the competitive advantage that all of the lower layer technologies are in place for? The new hotness is surfacing of insights out of scaled parallel statistical inference. Unfortunately this space doesn't present itself well to the toolbox approach. Bigtop provides Mahout and MLLib as part of Spark (right?), they themselves are toolkits with components of varying utility and maturity (and relevance). I think Bigtop could provide some value by curating ML frameworks that tie in with other Apache stack technologies. ML toolkits leave would-be users in the cold. One has to know what one is doing, and what to do is highly use case specific, this is why "data scientists" can command obscene salaries and only commercial vendors have the resources to focus on specific verticals. 
>> 
>> - Visualization and preparation. Moving further up, now we are almost touching directly the use case. We have data but we need to clean it, normalize, regularize, filter, slice and dice. Where there are reasonably generic open source tools, preferably at Apache, for data preparation and cleaning Bigtop could provide baseline value by packaging it, and additional value with deeper integration with Apache stack components. Data preparation is a concern hand in hand with data ingest, so we have an interesting feedback loop from the top back down to ingest tools/building blocks like Kafka and Flume. Data cleaning concerns might overlap with the workflow frameworks too. If there's a friendly licensed open source graphical front end to the data cleaning/munging/exploration process that is generic enough that would be a really interesting "acquisition". 
>> - We can also package visualization libraries and toolkits for building dashboards. Like with ML algorithms, a complete integration is probably out of scope because every instance would be use case and user specific.
>> 
>> 
>> 
>>> On Mon, Dec 8, 2014 at 12:23 PM, Konstantin Boudnik <co...@apache.org> wrote:
>>> First I want to address the RJ's question:
>>> 
>>> The most prominent downstream Bigtop Dependency would be any commercial
>>> Hadoop distribution like HDP and CDH. The former is trying to
>>> disguise their affiliation by pushing Ambari forward, and Cloudera's seemingly
>>> shifting her focus to compressed tarballs media (aka parcels) which requires
>>> a closed-source solutions like Cloudera Manager to deploy and control your
>>> cluster, effectively rendering it useless if you ever decide to uninstall the
>>> control software. In the interest of full disclosure, I don't think parcels
>>> have any chance to landslide the consensus in the industry from Linux
>>> packaging towards something so obscure and proprietary as parcels are.
>>> 
>>> 
>>> And now to my actual points....:
>>> 
>>> I do strongly believe the Bigtop was and is the only completely transparent,
>>> vendors' friendly, and 100% sticking to official ASF product releases way of
>>> building your stack from ground up, deploying and controlling it anyway you
>>> want to. I agree with Roman's presentation on how this project can move
>>> forward. However, I somewhat disagree with his view on the perspectives. It
>>> might be a hard road to drive the opinion of the community.  But, it is a high
>>> road.
>>> 
>>> We are definitely small and mostly unsupported by commercial groups that are
>>> using the framework. Being a box of LEGO won't win us anything. If anything,
>>> the empirical evidences are against it as commercial distros have decided to
>>> move towards their own means of "vendor lock-in" (yes, you hear me
>>> right - that's exactly what I said: all so called open-source companies have
>>> invented a way to lock-in their customers either with fancy "enterprise
>>> features" that aren't adding but amending underlying stack; or with custom set
>>> of patches oftentimes rendering the cluster to become incompatible between
>>> different vendors).
>>> 
>>> By all means, my money are on the second way, yet slightly modified (as
>>> use-cases are coming from users, not developers):
>>>   #2 start driving adoption of software stacks for the particular kind of data workloads
>>> 
>>> This community has enough day-to-day practitioners on board to
>>> accumulate a near-complete introspection of where the technology is moving.
>>> And instead of wobbling in a backwash, let's see if we can be smart and define
>>> this landscape. After all, Bigtop has adopted Spark well before any of the
>>> commercials have officially accepted it. We seemingly are moving more and
>>> more into in-memory realm of data processing: Apache Ignite (Gridgain),
>>> Tachyon, Spark. I don't know how much legs Hive got in it, but I am doubtful,
>>> that it can walk for much longer... May be it's just me.
>>> 
>>> In this thread http://is.gd/MV2BH9 we already discussed some of the aspects
>>> influencing the feature of this project. And we are de-facto working on the
>>> implementation. In my opinion, Hadoop has been more or less commoditized
>>> already. And it isn't a bad thing, but it means that the innovations are
>>> elsewhere. E.g. Spark moving is moving beyond its ties with storage layer via
>>> Tachyon abstraction; GridGain simply doesn't care what's underlying storage
>>> is. However, data needs to be stored somewhere before it can be processed. And
>>> HCFS seems to be fitting the bill ok. But, as I said already, I see the real
>>> action elsewhere. If I were to define the shape of our mid- to long'ish term
>>> roadmap it'd be something like that:
>>> 
>>>             ^   Dashboard/Visualization  ^
>>>             |     OLTP/ML processing     |
>>>             |    Caching/Acceleration    |
>>>             |         Storage            |
>>> 
>>> And around this we can add/improve on deployment (R8???),
>>> virtualization/containers/clouds.  In other words - let's focus on the
>>> vertical part of the stack, instead of simply supporting the status quo.
>>> 
>>> Does Cassandra fits the Storage layer in that model? I don't know and most
>>> important - I don't care. If there's an interest and manpower to have
>>> Cassandra-based stack - sure, but perhaps let's do as a separate branch or
>>> something, so we aren't over-complicating things. As Roman said earlier, in
>>> this case it'd be great to engage Cassandra/DataStax people into this project.
>>> But something tells me they won't be eager to jump on board.
>>> 
>>> And finally, all this above leads to "how": how we can start reshaping the
>>> stack into its next incarnation? Perhaps, Ubuntu model might be an answer for
>>> that, but we have discussed that elsewhere and dropped the idea as it wasn't
>>> feasible back in the day. Perhaps its time just came?
>>> 
>>> Apologies for a long post.
>>>   Cos
>>> 
>>> 
>>> On Sun, Dec 07, 2014 at 07:04PM, RJ Nowling wrote:
>>> > Which other projects depend on BigTop?  How will the questions about the
>>> > direction of BigTop affect those projects?
>>> >
>>> > On Sun, Dec 7, 2014 at 6:10 PM, Roman Shaposhnik <ro...@shaposhnik.org>
>>> > wrote:
>>> >
>>> > > Hi!
>>> > >
>>> > > On Sat, Dec 6, 2014 at 3:23 PM, jay vyas <ja...@gmail.com>
>>> > > wrote:
>>> > > > hi bigtop !
>>> > > >
>>> > > > I thought id start a thread a few vaguely related thoughts i have around
>>> > > > next couple iterations of bigtop.
>>> > >
>>> > > I think in general I see two major ways for something like
>>> > > Bigtop to evolve:
>>> > >    #1 remain a 'box of LEGO bricks' with very little opinion on
>>> > >         how these pieces need to be integrated
>>> > >    #2 start driving oppinioned use-cases for the particular kind of
>>> > >         bigdata workloads
>>> > >
>>> > > #1 is sort of what all of the Linux distros have been doing for
>>> > > the majority of time they existed. #2 is close to what CentOS
>>> > > is doing with SIGs.
>>> > >
>>> > > Honestly, given the size of our community so far and a total
>>> > > lack of corporate backing (with a small exception of Cloudera
>>> > > still paying for our EC2 time) I think #1 is all we can do. I'd
>>> > > love to be wrong, though.
>>> > >
>>> > > > 1) Hive:  How will bigtop to evolve to support it, now that it is much
>>> > > more
>>> > > > than a mapreduce query wrapper?
>>> > >
>>> > > I think Hive will remain a big part of Hadoop workloads for forseeable
>>> > > future. What I'd love to see more of is rationalizing things like how
>>> > > HCatalog, etc. need to be deployed.
>>> > >
>>> > > > 2) I wonder wether we should confirm cassandra interoperability of spark
>>> > > in
>>> > > > bigtop distros,
>>> > >
>>> > > Only if there's a significant interest from cassandra community and even
>>> > > then my biggest fear is that with cassandra we're totally changing the
>>> > > requirements for the underlying storage subsystem (nothing wrong with
>>> > > that, its just that in Hadoop ecosystem everything assumes very HDFS'ish
>>> > > requirements for the scale-out storage).
>>> > >
>>> > > > 4) in general, i think bigtop can move in one of 3 directions.
>>> > > >
>>> > > >   EXPAND ? : Expanding to include new components, with just basic
>>> > > interop,
>>> > > > and let folks evolve their own stacks on top of bigtop on their own.
>>> > > >
>>> > > >   CONTRACT+FOCUS ?  Contracting to focus on a lean set of core
>>> > > components,
>>> > > > with super high quality.
>>> > > >
>>> > > >   STAY THE COURSE ? Staying the same ~ a packaging platform for just
>>> > > > hadoop's direct ecosystem.
>>> > > >
>>> > > > I am intrigued by the idea of A and B both have clear benefits and
>>> > > costs...
>>> > > > would like to see the opinions of folks --- do we  lean in one direction
>>> > > or
>>> > > > another? What is the criteria for adding a new feature, package, stack to
>>> > > > bigtop?
>>> > > >
>>> > > > ... Or maybe im just overthinking it and should be spending this time
>>> > > > testing spark for 0.9 release....
>>> > >
>>> > > I'd love to know what other think, but for 0.9 I'd rather stay the course.
>>> > >
>>> > > Thanks,
>>> > > Roman.
>>> > >
>>> > > P.S. There are also market forces at play that may fundamentally change
>>> > > the focus of what we're all working on in the year or so.
>>> > >
>> 
>> 
>> 
>> -- 
>> Best regards,
>> 
>>    - Andy
>> 
>> Problems worthy of attack prove their worth by hitting back. - Piet Hein (via Tom White)
>

Re: What will the next generation of bigtop look like?

Posted by Andrew Purtell <ap...@apache.org>.

The problem I see with a Spark-only stack is, in my experience, Spark falls
apart as soon as the working set exceeds all available RAM on the cluster.
(One is presented with a sea of exceptions.) We need Hadoop anyway for HDFS
and Common (required by many many components), we get YARN and the MR
runtime as part of this package, and Hadoop MR is still eminently useful
when data sets and storage requirements are far beyond agg RAM.

We have an open JIRA for adding Kafka, it would be fantastic if someone
picks it up and brings it over the finish line.


On Thu, Dec 11, 2014 at 10:14 AM, RJ Nowling <rn...@gmail.com> wrote:

> GraphX, Streaming, MLlib, and Spark SQL are all part of Spark and would be
> included in BigTop if Spark is included. They're also pretty well
> integrated with each other.
>
> I'd like to throw out a radical idea, based on Andrew's comments: focus on
> the vertical rather than the horizontal with a slimmed down, Spark-oriented
> stack.  (This could be a subset of the current stack.)  Strat.io's work
> provides a nice example of a pure Spark stack.
>
> Spark offers a smaller footprint, far less maintenance, functionality of
> many Hadoop components in one (and better integration!), and is better
> suited for diverse deployment situations (cloud, non-HDFS storage, etc.)
>
> A few other complementary components would be needed: Kafka would be
> needed for HA with Spark streaming.  Tachyon.  Maybe offer Cassandra or
> similar as an alternative storage option.    Combine this with dashboards
> and visualization and high quality deployment options (Puppet, Docker,
> etc.).  With the data generator and Spark implementation of BigPetStore, my
> goal is to to expand BPS to provide high quality analytics examples,
> oriented more towards data scientists.
>
> Just a thought...
>
> On Thu, Dec 11, 2014 at 12:39 PM, Andrew Purtell <ap...@apache.org>
> wrote:
>
>> This is a really great post and I was nodding along with most of it.
>>
>> My personal view is Bigtop starts as a deployable stack of Apache
>> ecosystem components for Big Data. Commodification of (Linux) deployable
>> packages and basic install integration is the baseline.
>>
>> Bigtop packaging Spark components first is an unfortunately little known
>> win of this community, but its still a win. Although replicating that
>> success with choice of the 'next big thing' is going to be a hit or miss
>> proposition unless one of us can figure out time travel, definitely we can
>> make some observations and scour and/or influence the Apache project
>> landscape to pick up coverage in the space:
>>
>> - Storage is commoditized. Nearly everyone bases the storage stack on
>> HDFS. Everyone does so with what we'd call HCFS. Best to focus elsewhere.
>>
>> - Packaging is commoditized. It's a shame that vendors pursue misguided
>> lock-in strategies but we have no control over that. It's still true that
>> someone using HDP or CDH 4 can switch to Bigtop and vice versa without
>> changing package management tools or strategy. As a user of Apache stack
>> technologies I want long term sustainable package management so will vote
>> with my feet for the commodity option, and won't be alone. Bigtop should
>> provide this, and does, and it's mostly a solved problem.
>>
>> - Deployment is also a "solved" problem but unfortunately everyone solves
>> it differently. :-) This is an area where Bigtop can provide real value,
>> and does, with the Puppet scripts, with the containerization work. One
>> function Bigtop can serve is as repository and example of Hadoop-ish
>> production tooling.
>>
>> - YARN is a reasonably generic grid resource manager. We don't have the
>> resources to stand up an alternate RM and all the tooling necessary with
>> Mesos, but if Mesosphere made a contribution of that I suspect we'd take
>> it. From the Bigtop perspective I think computation framework options are
>> well handled, in that I don't see Bigtop or anyone else developing credible
>> alternatives to MR and Spark for some time. Not sure there's enough oxygen.
>> And we have Giraph (and is GraphX packaged with Spark?). To the extent
>> Spark-on-YARN has rough edges in the Bigtop framework that's an area where
>> contributors can produce value. Related, support for Hive on Spark, Pig on
>> Spark (spork).
>>
>> - The Apache stack includes three streaming computation frameworks -
>> Storm, Spark Streaming, Samza - but Bigtop has mostly missed the boat here.
>> Spark streaming is included in the spark package (I think) but how well is
>> it integrated? Samza is well integrated with YARN but we don't package it.
>> There's also been Storm-on-YARN work out of Yahoo, not sure about what was
>> upstreamed or might be available. Anyway, integration of stream computation
>> frameworks into Bigtop's packaging and deployment/management scripts can
>> produce value, especially if we provide multiple options, because vendors
>> are choosing favorites.
>>
>> - Data access. We do have players differentiating themselves here. Bigtop
>> provides two SQL options (Hive, Phoenix+HBase), can add a third, I see
>> someone's proposed Presto packaging. I'm not sure from the Bigtop
>> perspective we need to pursue additional alternatives, but if there were
>> contributions, we might well take them. "Enterprise friendly API" (SQL) is
>> half of the data access picture I think, the other half is access control.
>> There are competing projects in incubation, Sentry and Ranger, with no
>> shared purpose, which is a real shame. To the extent that Bigtop adopts a
>> cross-component full-stack access control technology, or helps bring
>> another alternative into incubation and adopts that, we can move the needle
>> in this space. We'd offer a vendor neutral access control option devoid of
>> lock-in risk, this would be a big deal for big-E enterprises.
>>
>> - Data management and provenance. Now we're moving up the value chain
>> from storage and data access to the next layer. This is mostly greenfield /
>> blue ocean space in the Apache stack. We have interesting options in
>> incubation: Falcon, Taverna, NiFi. (I think the last one might be truly
>> comprehensive.) All of these are higher level data management and
>> processing workflows which include aspects of management and provenance.
>> One or more could be adopted and refined. There are a lot of relevant
>> integration opportunities up and down the stack that could be undertaken
>> with shared effort of the Bigtop, framework, and component communities.
>>
>> - Machine learning. Moving further up the value chain, we have data and
>> computation and workflow, now how do we derive the competitive advantage
>> that all of the lower layer technologies are in place for? The new hotness
>> is surfacing of insights out of scaled parallel statistical inference.
>> Unfortunately this space doesn't present itself well to the toolbox
>> approach. Bigtop provides Mahout and MLLib as part of Spark (right?), they
>> themselves are toolkits with components of varying utility and maturity
>> (and relevance). I think Bigtop could provide some value by curating ML
>> frameworks that tie in with other Apache stack technologies. ML toolkits
>> leave would-be users in the cold. One has to know what one is doing, and
>> what to do is highly use case specific, this is why "data scientists" can
>> command obscene salaries and only commercial vendors have the resources to
>> focus on specific verticals.
>>
>> - Visualization and preparation. Moving further up, now we are almost
>> touching directly the use case. We have data but we need to clean it,
>> normalize, regularize, filter, slice and dice. Where there are reasonably
>> generic open source tools, preferably at Apache, for data preparation and
>> cleaning Bigtop could provide baseline value by packaging it, and
>> additional value with deeper integration with Apache stack components. Data
>> preparation is a concern hand in hand with data ingest, so we have an
>> interesting feedback loop from the top back down to ingest tools/building
>> blocks like Kafka and Flume. Data cleaning concerns might overlap with the
>> workflow frameworks too. If there's a friendly licensed open source
>> graphical front end to the data cleaning/munging/exploration process that
>> is generic enough that would be a really interesting "acquisition".
>> - We can also package visualization libraries and toolkits for building
>> dashboards. Like with ML algorithms, a complete integration is probably out
>> of scope because every instance would be use case and user specific.
>>
>>
>>
>> On Mon, Dec 8, 2014 at 12:23 PM, Konstantin Boudnik <co...@apache.org>
>> wrote:
>>
>>> First I want to address the RJ's question:
>>>
>>> The most prominent downstream Bigtop Dependency would be any commercial
>>> Hadoop distribution like HDP and CDH. The former is trying to
>>> disguise their affiliation by pushing Ambari forward, and Cloudera's
>>> seemingly
>>> shifting her focus to compressed tarballs media (aka parcels) which
>>> requires
>>> a closed-source solutions like Cloudera Manager to deploy and control
>>> your
>>> cluster, effectively rendering it useless if you ever decide to
>>> uninstall the
>>> control software. In the interest of full disclosure, I don't think
>>> parcels
>>> have any chance to landslide the consensus in the industry from Linux
>>> packaging towards something so obscure and proprietary as parcels are.
>>>
>>>
>>> And now to my actual points....:
>>>
>>> I do strongly believe the Bigtop was and is the only completely
>>> transparent,
>>> vendors' friendly, and 100% sticking to official ASF product releases
>>> way of
>>> building your stack from ground up, deploying and controlling it anyway
>>> you
>>> want to. I agree with Roman's presentation on how this project can move
>>> forward. However, I somewhat disagree with his view on the perspectives.
>>> It
>>> might be a hard road to drive the opinion of the community.  But, it is
>>> a high
>>> road.
>>>
>>> We are definitely small and mostly unsupported by commercial groups that
>>> are
>>> using the framework. Being a box of LEGO won't win us anything. If
>>> anything,
>>> the empirical evidences are against it as commercial distros have
>>> decided to
>>> move towards their own means of "vendor lock-in" (yes, you hear me
>>> right - that's exactly what I said: all so called open-source companies
>>> have
>>> invented a way to lock-in their customers either with fancy "enterprise
>>> features" that aren't adding but amending underlying stack; or with
>>> custom set
>>> of patches oftentimes rendering the cluster to become incompatible
>>> between
>>> different vendors).
>>>
>>> By all means, my money are on the second way, yet slightly modified (as
>>> use-cases are coming from users, not developers):
>>>   #2 start driving adoption of software stacks for the particular kind
>>> of data workloads
>>>
>>> This community has enough day-to-day practitioners on board to
>>> accumulate a near-complete introspection of where the technology is
>>> moving.
>>> And instead of wobbling in a backwash, let's see if we can be smart and
>>> define
>>> this landscape. After all, Bigtop has adopted Spark well before any of
>>> the
>>> commercials have officially accepted it. We seemingly are moving more and
>>> more into in-memory realm of data processing: Apache Ignite (Gridgain),
>>> Tachyon, Spark. I don't know how much legs Hive got in it, but I am
>>> doubtful,
>>> that it can walk for much longer... May be it's just me.
>>>
>>> In this thread http://is.gd/MV2BH9 we already discussed some of the
>>> aspects
>>> influencing the feature of this project. And we are de-facto working on
>>> the
>>> implementation. In my opinion, Hadoop has been more or less commoditized
>>> already. And it isn't a bad thing, but it means that the innovations are
>>> elsewhere. E.g. Spark moving is moving beyond its ties with storage
>>> layer via
>>> Tachyon abstraction; GridGain simply doesn't care what's underlying
>>> storage
>>> is. However, data needs to be stored somewhere before it can be
>>> processed. And
>>> HCFS seems to be fitting the bill ok. But, as I said already, I see the
>>> real
>>> action elsewhere. If I were to define the shape of our mid- to long'ish
>>> term
>>> roadmap it'd be something like that:
>>>
>>>             ^   Dashboard/Visualization  ^
>>>             |     OLTP/ML processing     |
>>>             |    Caching/Acceleration    |
>>>             |         Storage            |
>>>
>>> And around this we can add/improve on deployment (R8???),
>>> virtualization/containers/clouds.  In other words - let's focus on the
>>> vertical part of the stack, instead of simply supporting the status quo.
>>>
>>> Does Cassandra fits the Storage layer in that model? I don't know and
>>> most
>>> important - I don't care. If there's an interest and manpower to have
>>> Cassandra-based stack - sure, but perhaps let's do as a separate branch
>>> or
>>> something, so we aren't over-complicating things. As Roman said earlier,
>>> in
>>> this case it'd be great to engage Cassandra/DataStax people into this
>>> project.
>>> But something tells me they won't be eager to jump on board.
>>>
>>> And finally, all this above leads to "how": how we can start reshaping
>>> the
>>> stack into its next incarnation? Perhaps, Ubuntu model might be an
>>> answer for
>>> that, but we have discussed that elsewhere and dropped the idea as it
>>> wasn't
>>> feasible back in the day. Perhaps its time just came?
>>>
>>> Apologies for a long post.
>>>   Cos
>>>
>>>
>>> On Sun, Dec 07, 2014 at 07:04PM, RJ Nowling wrote:
>>> > Which other projects depend on BigTop?  How will the questions about
>>> the
>>> > direction of BigTop affect those projects?
>>> >
>>> > On Sun, Dec 7, 2014 at 6:10 PM, Roman Shaposhnik <roman@shaposhnik.org
>>> >
>>> > wrote:
>>> >
>>> > > Hi!
>>> > >
>>> > > On Sat, Dec 6, 2014 at 3:23 PM, jay vyas <
>>> jayunit100.apache@gmail.com>
>>> > > wrote:
>>> > > > hi bigtop !
>>> > > >
>>> > > > I thought id start a thread a few vaguely related thoughts i have
>>> around
>>> > > > next couple iterations of bigtop.
>>> > >
>>> > > I think in general I see two major ways for something like
>>> > > Bigtop to evolve:
>>> > >    #1 remain a 'box of LEGO bricks' with very little opinion on
>>> > >         how these pieces need to be integrated
>>> > >    #2 start driving oppinioned use-cases for the particular kind of
>>> > >         bigdata workloads
>>> > >
>>> > > #1 is sort of what all of the Linux distros have been doing for
>>> > > the majority of time they existed. #2 is close to what CentOS
>>> > > is doing with SIGs.
>>> > >
>>> > > Honestly, given the size of our community so far and a total
>>> > > lack of corporate backing (with a small exception of Cloudera
>>> > > still paying for our EC2 time) I think #1 is all we can do. I'd
>>> > > love to be wrong, though.
>>> > >
>>> > > > 1) Hive:  How will bigtop to evolve to support it, now that it is
>>> much
>>> > > more
>>> > > > than a mapreduce query wrapper?
>>> > >
>>> > > I think Hive will remain a big part of Hadoop workloads for
>>> forseeable
>>> > > future. What I'd love to see more of is rationalizing things like how
>>> > > HCatalog, etc. need to be deployed.
>>> > >
>>> > > > 2) I wonder wether we should confirm cassandra interoperability of
>>> spark
>>> > > in
>>> > > > bigtop distros,
>>> > >
>>> > > Only if there's a significant interest from cassandra community and
>>> even
>>> > > then my biggest fear is that with cassandra we're totally changing
>>> the
>>> > > requirements for the underlying storage subsystem (nothing wrong with
>>> > > that, its just that in Hadoop ecosystem everything assumes very
>>> HDFS'ish
>>> > > requirements for the scale-out storage).
>>> > >
>>> > > > 4) in general, i think bigtop can move in one of 3 directions.
>>> > > >
>>> > > >   EXPAND ? : Expanding to include new components, with just basic
>>> > > interop,
>>> > > > and let folks evolve their own stacks on top of bigtop on their
>>> own.
>>> > > >
>>> > > >   CONTRACT+FOCUS ?  Contracting to focus on a lean set of core
>>> > > components,
>>> > > > with super high quality.
>>> > > >
>>> > > >   STAY THE COURSE ? Staying the same ~ a packaging platform for
>>> just
>>> > > > hadoop's direct ecosystem.
>>> > > >
>>> > > > I am intrigued by the idea of A and B both have clear benefits and
>>> > > costs...
>>> > > > would like to see the opinions of folks --- do we  lean in one
>>> direction
>>> > > or
>>> > > > another? What is the criteria for adding a new feature, package,
>>> stack to
>>> > > > bigtop?
>>> > > >
>>> > > > ... Or maybe im just overthinking it and should be spending this
>>> time
>>> > > > testing spark for 0.9 release....
>>> > >
>>> > > I'd love to know what other think, but for 0.9 I'd rather stay the
>>> course.
>>> > >
>>> > > Thanks,
>>> > > Roman.
>>> > >
>>> > > P.S. There are also market forces at play that may fundamentally
>>> change
>>> > > the focus of what we're all working on in the year or so.
>>> > >
>>>
>>
>>
>>
>> --
>> Best regards,
>>
>>    - Andy
>>
>> Problems worthy of attack prove their worth by hitting back. - Piet Hein
>> (via Tom White)
>>
>
>


-- 
Best regards,

   - Andy

Problems worthy of attack prove their worth by hitting back. - Piet Hein
(via Tom White)

Re: What will the next generation of bigtop look like?

Posted by Jay Vyas <ja...@gmail.com>.

Rj - that's not too radical, seems like a lot of folks are embracing that idiom.

1) I like featuring spark along with some persistence technology.  Cassandra don't seem to have interest in BigTop however.  So maybe...

Spark
Tachyon
Hbase+Phoenix
SOLR
Kafka

Could be pretty effective.

2) visualization ? I think that is an afterthought, at least for now.... It's a lot of work just to get the stack compiling.  

> On Dec 11, 2014, at 1:14 PM, RJ Nowling <rn...@gmail.com> wrote:
> 
> GraphX, Streaming, MLlib, and Spark SQL are all part of Spark and would be included in BigTop if Spark is included. They're also pretty well integrated with each other.
> 
> I'd like to throw out a radical idea, based on Andrew's comments: focus on the vertical rather than the horizontal with a slimmed down, Spark-oriented stack.  (This could be a subset of the current stack.)  Strat.io's work provides a nice example of a pure Spark stack.
> 
> Spark offers a smaller footprint, far less maintenance, functionality of many Hadoop components in one (and better integration!), and is better suited for diverse deployment situations (cloud, non-HDFS storage, etc.) 
> 
> A few other complementary components would be needed: Kafka would be needed for HA with Spark streaming.  Tachyon.  Maybe offer Cassandra or similar as an alternative storage option.    Combine this with dashboards and visualization and high quality deployment options (Puppet, Docker, etc.).  With the data generator and Spark implementation of BigPetStore, my goal is to to expand BPS to provide high quality analytics examples, oriented more towards data scientists.
> 
> Just a thought...
> 
>> On Thu, Dec 11, 2014 at 12:39 PM, Andrew Purtell <ap...@apache.org> wrote:
>> This is a really great post and I was nodding along with most of it. 
>> 
>> My personal view is Bigtop starts as a deployable stack of Apache ecosystem components for Big Data. Commodification of (Linux) deployable packages and basic install integration is the baseline. 
>> 
>> Bigtop packaging Spark components first is an unfortunately little known win of this community, but its still a win. Although replicating that success with choice of the 'next big thing' is going to be a hit or miss proposition unless one of us can figure out time travel, definitely we can make some observations and scour and/or influence the Apache project landscape to pick up coverage in the space:
>> 
>> - Storage is commoditized. Nearly everyone bases the storage stack on HDFS. Everyone does so with what we'd call HCFS. Best to focus elsewhere.
>> 
>> - Packaging is commoditized. It's a shame that vendors pursue misguided lock-in strategies but we have no control over that. It's still true that someone using HDP or CDH 4 can switch to Bigtop and vice versa without changing package management tools or strategy. As a user of Apache stack technologies I want long term sustainable package management so will vote with my feet for the commodity option, and won't be alone. Bigtop should provide this, and does, and it's mostly a solved problem.
>> 
>> - Deployment is also a "solved" problem but unfortunately everyone solves it differently. :-) This is an area where Bigtop can provide real value, and does, with the Puppet scripts, with the containerization work. One function Bigtop can serve is as repository and example of Hadoop-ish production tooling.
>> 
>> - YARN is a reasonably generic grid resource manager. We don't have the resources to stand up an alternate RM and all the tooling necessary with Mesos, but if Mesosphere made a contribution of that I suspect we'd take it. From the Bigtop perspective I think computation framework options are well handled, in that I don't see Bigtop or anyone else developing credible alternatives to MR and Spark for some time. Not sure there's enough oxygen. And we have Giraph (and is GraphX packaged with Spark?). To the extent Spark-on-YARN has rough edges in the Bigtop framework that's an area where contributors can produce value. Related, support for Hive on Spark, Pig on Spark (spork). 
>> 
>> - The Apache stack includes three streaming computation frameworks - Storm, Spark Streaming, Samza - but Bigtop has mostly missed the boat here. Spark streaming is included in the spark package (I think) but how well is it integrated? Samza is well integrated with YARN but we don't package it. There's also been Storm-on-YARN work out of Yahoo, not sure about what was upstreamed or might be available. Anyway, integration of stream computation frameworks into Bigtop's packaging and deployment/management scripts can produce value, especially if we provide multiple options, because vendors are choosing favorites. 
>> 
>> - Data access. We do have players differentiating themselves here. Bigtop provides two SQL options (Hive, Phoenix+HBase), can add a third, I see someone's proposed Presto packaging. I'm not sure from the Bigtop perspective we need to pursue additional alternatives, but if there were contributions, we might well take them. "Enterprise friendly API" (SQL) is half of the data access picture I think, the other half is access control. There are competing projects in incubation, Sentry and Ranger, with no shared purpose, which is a real shame. To the extent that Bigtop adopts a cross-component full-stack access control technology, or helps bring another alternative into incubation and adopts that, we can move the needle in this space. We'd offer a vendor neutral access control option devoid of lock-in risk, this would be a big deal for big-E enterprises.
>> 
>> - Data management and provenance. Now we're moving up the value chain from storage and data access to the next layer. This is mostly greenfield / blue ocean space in the Apache stack. We have interesting options in incubation: Falcon, Taverna, NiFi. (I think the last one might be truly comprehensive.) All of these are higher level data management and processing workflows which include aspects of management and provenance. One or more could be adopted and refined. There are a lot of relevant integration opportunities up and down the stack that could be undertaken with shared effort of the Bigtop, framework, and component communities.
>> 
>> - Machine learning. Moving further up the value chain, we have data and computation and workflow, now how do we derive the competitive advantage that all of the lower layer technologies are in place for? The new hotness is surfacing of insights out of scaled parallel statistical inference. Unfortunately this space doesn't present itself well to the toolbox approach. Bigtop provides Mahout and MLLib as part of Spark (right?), they themselves are toolkits with components of varying utility and maturity (and relevance). I think Bigtop could provide some value by curating ML frameworks that tie in with other Apache stack technologies. ML toolkits leave would-be users in the cold. One has to know what one is doing, and what to do is highly use case specific, this is why "data scientists" can command obscene salaries and only commercial vendors have the resources to focus on specific verticals. 
>> 
>> - Visualization and preparation. Moving further up, now we are almost touching directly the use case. We have data but we need to clean it, normalize, regularize, filter, slice and dice. Where there are reasonably generic open source tools, preferably at Apache, for data preparation and cleaning Bigtop could provide baseline value by packaging it, and additional value with deeper integration with Apache stack components. Data preparation is a concern hand in hand with data ingest, so we have an interesting feedback loop from the top back down to ingest tools/building blocks like Kafka and Flume. Data cleaning concerns might overlap with the workflow frameworks too. If there's a friendly licensed open source graphical front end to the data cleaning/munging/exploration process that is generic enough that would be a really interesting "acquisition". 
>> - We can also package visualization libraries and toolkits for building dashboards. Like with ML algorithms, a complete integration is probably out of scope because every instance would be use case and user specific.
>> 
>> 
>> 
>>> On Mon, Dec 8, 2014 at 12:23 PM, Konstantin Boudnik <co...@apache.org> wrote:
>>> First I want to address the RJ's question:
>>> 
>>> The most prominent downstream Bigtop Dependency would be any commercial
>>> Hadoop distribution like HDP and CDH. The former is trying to
>>> disguise their affiliation by pushing Ambari forward, and Cloudera's seemingly
>>> shifting her focus to compressed tarballs media (aka parcels) which requires
>>> a closed-source solutions like Cloudera Manager to deploy and control your
>>> cluster, effectively rendering it useless if you ever decide to uninstall the
>>> control software. In the interest of full disclosure, I don't think parcels
>>> have any chance to landslide the consensus in the industry from Linux
>>> packaging towards something so obscure and proprietary as parcels are.
>>> 
>>> 
>>> And now to my actual points....:
>>> 
>>> I do strongly believe the Bigtop was and is the only completely transparent,
>>> vendors' friendly, and 100% sticking to official ASF product releases way of
>>> building your stack from ground up, deploying and controlling it anyway you
>>> want to. I agree with Roman's presentation on how this project can move
>>> forward. However, I somewhat disagree with his view on the perspectives. It
>>> might be a hard road to drive the opinion of the community.  But, it is a high
>>> road.
>>> 
>>> We are definitely small and mostly unsupported by commercial groups that are
>>> using the framework. Being a box of LEGO won't win us anything. If anything,
>>> the empirical evidences are against it as commercial distros have decided to
>>> move towards their own means of "vendor lock-in" (yes, you hear me
>>> right - that's exactly what I said: all so called open-source companies have
>>> invented a way to lock-in their customers either with fancy "enterprise
>>> features" that aren't adding but amending underlying stack; or with custom set
>>> of patches oftentimes rendering the cluster to become incompatible between
>>> different vendors).
>>> 
>>> By all means, my money are on the second way, yet slightly modified (as
>>> use-cases are coming from users, not developers):
>>>   #2 start driving adoption of software stacks for the particular kind of data workloads
>>> 
>>> This community has enough day-to-day practitioners on board to
>>> accumulate a near-complete introspection of where the technology is moving.
>>> And instead of wobbling in a backwash, let's see if we can be smart and define
>>> this landscape. After all, Bigtop has adopted Spark well before any of the
>>> commercials have officially accepted it. We seemingly are moving more and
>>> more into in-memory realm of data processing: Apache Ignite (Gridgain),
>>> Tachyon, Spark. I don't know how much legs Hive got in it, but I am doubtful,
>>> that it can walk for much longer... May be it's just me.
>>> 
>>> In this thread http://is.gd/MV2BH9 we already discussed some of the aspects
>>> influencing the feature of this project. And we are de-facto working on the
>>> implementation. In my opinion, Hadoop has been more or less commoditized
>>> already. And it isn't a bad thing, but it means that the innovations are
>>> elsewhere. E.g. Spark moving is moving beyond its ties with storage layer via
>>> Tachyon abstraction; GridGain simply doesn't care what's underlying storage
>>> is. However, data needs to be stored somewhere before it can be processed. And
>>> HCFS seems to be fitting the bill ok. But, as I said already, I see the real
>>> action elsewhere. If I were to define the shape of our mid- to long'ish term
>>> roadmap it'd be something like that:
>>> 
>>>             ^   Dashboard/Visualization  ^
>>>             |     OLTP/ML processing     |
>>>             |    Caching/Acceleration    |
>>>             |         Storage            |
>>> 
>>> And around this we can add/improve on deployment (R8???),
>>> virtualization/containers/clouds.  In other words - let's focus on the
>>> vertical part of the stack, instead of simply supporting the status quo.
>>> 
>>> Does Cassandra fits the Storage layer in that model? I don't know and most
>>> important - I don't care. If there's an interest and manpower to have
>>> Cassandra-based stack - sure, but perhaps let's do as a separate branch or
>>> something, so we aren't over-complicating things. As Roman said earlier, in
>>> this case it'd be great to engage Cassandra/DataStax people into this project.
>>> But something tells me they won't be eager to jump on board.
>>> 
>>> And finally, all this above leads to "how": how we can start reshaping the
>>> stack into its next incarnation? Perhaps, Ubuntu model might be an answer for
>>> that, but we have discussed that elsewhere and dropped the idea as it wasn't
>>> feasible back in the day. Perhaps its time just came?
>>> 
>>> Apologies for a long post.
>>>   Cos
>>> 
>>> 
>>> On Sun, Dec 07, 2014 at 07:04PM, RJ Nowling wrote:
>>> > Which other projects depend on BigTop?  How will the questions about the
>>> > direction of BigTop affect those projects?
>>> >
>>> > On Sun, Dec 7, 2014 at 6:10 PM, Roman Shaposhnik <ro...@shaposhnik.org>
>>> > wrote:
>>> >
>>> > > Hi!
>>> > >
>>> > > On Sat, Dec 6, 2014 at 3:23 PM, jay vyas <ja...@gmail.com>
>>> > > wrote:
>>> > > > hi bigtop !
>>> > > >
>>> > > > I thought id start a thread a few vaguely related thoughts i have around
>>> > > > next couple iterations of bigtop.
>>> > >
>>> > > I think in general I see two major ways for something like
>>> > > Bigtop to evolve:
>>> > >    #1 remain a 'box of LEGO bricks' with very little opinion on
>>> > >         how these pieces need to be integrated
>>> > >    #2 start driving oppinioned use-cases for the particular kind of
>>> > >         bigdata workloads
>>> > >
>>> > > #1 is sort of what all of the Linux distros have been doing for
>>> > > the majority of time they existed. #2 is close to what CentOS
>>> > > is doing with SIGs.
>>> > >
>>> > > Honestly, given the size of our community so far and a total
>>> > > lack of corporate backing (with a small exception of Cloudera
>>> > > still paying for our EC2 time) I think #1 is all we can do. I'd
>>> > > love to be wrong, though.
>>> > >
>>> > > > 1) Hive:  How will bigtop to evolve to support it, now that it is much
>>> > > more
>>> > > > than a mapreduce query wrapper?
>>> > >
>>> > > I think Hive will remain a big part of Hadoop workloads for forseeable
>>> > > future. What I'd love to see more of is rationalizing things like how
>>> > > HCatalog, etc. need to be deployed.
>>> > >
>>> > > > 2) I wonder wether we should confirm cassandra interoperability of spark
>>> > > in
>>> > > > bigtop distros,
>>> > >
>>> > > Only if there's a significant interest from cassandra community and even
>>> > > then my biggest fear is that with cassandra we're totally changing the
>>> > > requirements for the underlying storage subsystem (nothing wrong with
>>> > > that, its just that in Hadoop ecosystem everything assumes very HDFS'ish
>>> > > requirements for the scale-out storage).
>>> > >
>>> > > > 4) in general, i think bigtop can move in one of 3 directions.
>>> > > >
>>> > > >   EXPAND ? : Expanding to include new components, with just basic
>>> > > interop,
>>> > > > and let folks evolve their own stacks on top of bigtop on their own.
>>> > > >
>>> > > >   CONTRACT+FOCUS ?  Contracting to focus on a lean set of core
>>> > > components,
>>> > > > with super high quality.
>>> > > >
>>> > > >   STAY THE COURSE ? Staying the same ~ a packaging platform for just
>>> > > > hadoop's direct ecosystem.
>>> > > >
>>> > > > I am intrigued by the idea of A and B both have clear benefits and
>>> > > costs...
>>> > > > would like to see the opinions of folks --- do we  lean in one direction
>>> > > or
>>> > > > another? What is the criteria for adding a new feature, package, stack to
>>> > > > bigtop?
>>> > > >
>>> > > > ... Or maybe im just overthinking it and should be spending this time
>>> > > > testing spark for 0.9 release....
>>> > >
>>> > > I'd love to know what other think, but for 0.9 I'd rather stay the course.
>>> > >
>>> > > Thanks,
>>> > > Roman.
>>> > >
>>> > > P.S. There are also market forces at play that may fundamentally change
>>> > > the focus of what we're all working on in the year or so.
>>> > >
>> 
>> 
>> 
>> -- 
>> Best regards,
>> 
>>    - Andy
>> 
>> Problems worthy of attack prove their worth by hitting back. - Piet Hein (via Tom White)
>

Re: What will the next generation of bigtop look like?

Posted by RJ Nowling <rn...@gmail.com>.

GraphX, Streaming, MLlib, and Spark SQL are all part of Spark and would be
included in BigTop if Spark is included. They're also pretty well
integrated with each other.

I'd like to throw out a radical idea, based on Andrew's comments: focus on
the vertical rather than the horizontal with a slimmed down, Spark-oriented
stack.  (This could be a subset of the current stack.)  Strat.io's work
provides a nice example of a pure Spark stack.

Spark offers a smaller footprint, far less maintenance, functionality of
many Hadoop components in one (and better integration!), and is better
suited for diverse deployment situations (cloud, non-HDFS storage, etc.)

A few other complementary components would be needed: Kafka would be needed
for HA with Spark streaming.  Tachyon.  Maybe offer Cassandra or similar as
an alternative storage option.    Combine this with dashboards and
visualization and high quality deployment options (Puppet, Docker, etc.).
With the data generator and Spark implementation of BigPetStore, my goal is
to to expand BPS to provide high quality analytics examples, oriented more
towards data scientists.

Just a thought...

On Thu, Dec 11, 2014 at 12:39 PM, Andrew Purtell <ap...@apache.org>
wrote:

> This is a really great post and I was nodding along with most of it.
>
> My personal view is Bigtop starts as a deployable stack of Apache
> ecosystem components for Big Data. Commodification of (Linux) deployable
> packages and basic install integration is the baseline.
>
> Bigtop packaging Spark components first is an unfortunately little known
> win of this community, but its still a win. Although replicating that
> success with choice of the 'next big thing' is going to be a hit or miss
> proposition unless one of us can figure out time travel, definitely we can
> make some observations and scour and/or influence the Apache project
> landscape to pick up coverage in the space:
>
> - Storage is commoditized. Nearly everyone bases the storage stack on
> HDFS. Everyone does so with what we'd call HCFS. Best to focus elsewhere.
>
> - Packaging is commoditized. It's a shame that vendors pursue misguided
> lock-in strategies but we have no control over that. It's still true that
> someone using HDP or CDH 4 can switch to Bigtop and vice versa without
> changing package management tools or strategy. As a user of Apache stack
> technologies I want long term sustainable package management so will vote
> with my feet for the commodity option, and won't be alone. Bigtop should
> provide this, and does, and it's mostly a solved problem.
>
> - Deployment is also a "solved" problem but unfortunately everyone solves
> it differently. :-) This is an area where Bigtop can provide real value,
> and does, with the Puppet scripts, with the containerization work. One
> function Bigtop can serve is as repository and example of Hadoop-ish
> production tooling.
>
> - YARN is a reasonably generic grid resource manager. We don't have the
> resources to stand up an alternate RM and all the tooling necessary with
> Mesos, but if Mesosphere made a contribution of that I suspect we'd take
> it. From the Bigtop perspective I think computation framework options are
> well handled, in that I don't see Bigtop or anyone else developing credible
> alternatives to MR and Spark for some time. Not sure there's enough oxygen.
> And we have Giraph (and is GraphX packaged with Spark?). To the extent
> Spark-on-YARN has rough edges in the Bigtop framework that's an area where
> contributors can produce value. Related, support for Hive on Spark, Pig on
> Spark (spork).
>
> - The Apache stack includes three streaming computation frameworks -
> Storm, Spark Streaming, Samza - but Bigtop has mostly missed the boat here.
> Spark streaming is included in the spark package (I think) but how well is
> it integrated? Samza is well integrated with YARN but we don't package it.
> There's also been Storm-on-YARN work out of Yahoo, not sure about what was
> upstreamed or might be available. Anyway, integration of stream computation
> frameworks into Bigtop's packaging and deployment/management scripts can
> produce value, especially if we provide multiple options, because vendors
> are choosing favorites.
>
> - Data access. We do have players differentiating themselves here. Bigtop
> provides two SQL options (Hive, Phoenix+HBase), can add a third, I see
> someone's proposed Presto packaging. I'm not sure from the Bigtop
> perspective we need to pursue additional alternatives, but if there were
> contributions, we might well take them. "Enterprise friendly API" (SQL) is
> half of the data access picture I think, the other half is access control.
> There are competing projects in incubation, Sentry and Ranger, with no
> shared purpose, which is a real shame. To the extent that Bigtop adopts a
> cross-component full-stack access control technology, or helps bring
> another alternative into incubation and adopts that, we can move the needle
> in this space. We'd offer a vendor neutral access control option devoid of
> lock-in risk, this would be a big deal for big-E enterprises.
>
> - Data management and provenance. Now we're moving up the value chain from
> storage and data access to the next layer. This is mostly greenfield / blue
> ocean space in the Apache stack. We have interesting options in incubation:
> Falcon, Taverna, NiFi. (I think the last one might be truly comprehensive.)
> All of these are higher level data management and processing workflows
> which include aspects of management and provenance. One or more could be
> adopted and refined. There are a lot of relevant integration opportunities
> up and down the stack that could be undertaken with shared effort of the
> Bigtop, framework, and component communities.
>
> - Machine learning. Moving further up the value chain, we have data and
> computation and workflow, now how do we derive the competitive advantage
> that all of the lower layer technologies are in place for? The new hotness
> is surfacing of insights out of scaled parallel statistical inference.
> Unfortunately this space doesn't present itself well to the toolbox
> approach. Bigtop provides Mahout and MLLib as part of Spark (right?), they
> themselves are toolkits with components of varying utility and maturity
> (and relevance). I think Bigtop could provide some value by curating ML
> frameworks that tie in with other Apache stack technologies. ML toolkits
> leave would-be users in the cold. One has to know what one is doing, and
> what to do is highly use case specific, this is why "data scientists" can
> command obscene salaries and only commercial vendors have the resources to
> focus on specific verticals.
>
> - Visualization and preparation. Moving further up, now we are almost
> touching directly the use case. We have data but we need to clean it,
> normalize, regularize, filter, slice and dice. Where there are reasonably
> generic open source tools, preferably at Apache, for data preparation and
> cleaning Bigtop could provide baseline value by packaging it, and
> additional value with deeper integration with Apache stack components. Data
> preparation is a concern hand in hand with data ingest, so we have an
> interesting feedback loop from the top back down to ingest tools/building
> blocks like Kafka and Flume. Data cleaning concerns might overlap with the
> workflow frameworks too. If there's a friendly licensed open source
> graphical front end to the data cleaning/munging/exploration process that
> is generic enough that would be a really interesting "acquisition".
> - We can also package visualization libraries and toolkits for building
> dashboards. Like with ML algorithms, a complete integration is probably out
> of scope because every instance would be use case and user specific.
>
>
>
> On Mon, Dec 8, 2014 at 12:23 PM, Konstantin Boudnik <co...@apache.org>
> wrote:
>
>> First I want to address the RJ's question:
>>
>> The most prominent downstream Bigtop Dependency would be any commercial
>> Hadoop distribution like HDP and CDH. The former is trying to
>> disguise their affiliation by pushing Ambari forward, and Cloudera's
>> seemingly
>> shifting her focus to compressed tarballs media (aka parcels) which
>> requires
>> a closed-source solutions like Cloudera Manager to deploy and control your
>> cluster, effectively rendering it useless if you ever decide to uninstall
>> the
>> control software. In the interest of full disclosure, I don't think
>> parcels
>> have any chance to landslide the consensus in the industry from Linux
>> packaging towards something so obscure and proprietary as parcels are.
>>
>>
>> And now to my actual points....:
>>
>> I do strongly believe the Bigtop was and is the only completely
>> transparent,
>> vendors' friendly, and 100% sticking to official ASF product releases way
>> of
>> building your stack from ground up, deploying and controlling it anyway
>> you
>> want to. I agree with Roman's presentation on how this project can move
>> forward. However, I somewhat disagree with his view on the perspectives.
>> It
>> might be a hard road to drive the opinion of the community.  But, it is a
>> high
>> road.
>>
>> We are definitely small and mostly unsupported by commercial groups that
>> are
>> using the framework. Being a box of LEGO won't win us anything. If
>> anything,
>> the empirical evidences are against it as commercial distros have decided
>> to
>> move towards their own means of "vendor lock-in" (yes, you hear me
>> right - that's exactly what I said: all so called open-source companies
>> have
>> invented a way to lock-in their customers either with fancy "enterprise
>> features" that aren't adding but amending underlying stack; or with
>> custom set
>> of patches oftentimes rendering the cluster to become incompatible between
>> different vendors).
>>
>> By all means, my money are on the second way, yet slightly modified (as
>> use-cases are coming from users, not developers):
>>   #2 start driving adoption of software stacks for the particular kind of
>> data workloads
>>
>> This community has enough day-to-day practitioners on board to
>> accumulate a near-complete introspection of where the technology is
>> moving.
>> And instead of wobbling in a backwash, let's see if we can be smart and
>> define
>> this landscape. After all, Bigtop has adopted Spark well before any of the
>> commercials have officially accepted it. We seemingly are moving more and
>> more into in-memory realm of data processing: Apache Ignite (Gridgain),
>> Tachyon, Spark. I don't know how much legs Hive got in it, but I am
>> doubtful,
>> that it can walk for much longer... May be it's just me.
>>
>> In this thread http://is.gd/MV2BH9 we already discussed some of the
>> aspects
>> influencing the feature of this project. And we are de-facto working on
>> the
>> implementation. In my opinion, Hadoop has been more or less commoditized
>> already. And it isn't a bad thing, but it means that the innovations are
>> elsewhere. E.g. Spark moving is moving beyond its ties with storage layer
>> via
>> Tachyon abstraction; GridGain simply doesn't care what's underlying
>> storage
>> is. However, data needs to be stored somewhere before it can be
>> processed. And
>> HCFS seems to be fitting the bill ok. But, as I said already, I see the
>> real
>> action elsewhere. If I were to define the shape of our mid- to long'ish
>> term
>> roadmap it'd be something like that:
>>
>>             ^   Dashboard/Visualization  ^
>>             |     OLTP/ML processing     |
>>             |    Caching/Acceleration    |
>>             |         Storage            |
>>
>> And around this we can add/improve on deployment (R8???),
>> virtualization/containers/clouds.  In other words - let's focus on the
>> vertical part of the stack, instead of simply supporting the status quo.
>>
>> Does Cassandra fits the Storage layer in that model? I don't know and most
>> important - I don't care. If there's an interest and manpower to have
>> Cassandra-based stack - sure, but perhaps let's do as a separate branch or
>> something, so we aren't over-complicating things. As Roman said earlier,
>> in
>> this case it'd be great to engage Cassandra/DataStax people into this
>> project.
>> But something tells me they won't be eager to jump on board.
>>
>> And finally, all this above leads to "how": how we can start reshaping the
>> stack into its next incarnation? Perhaps, Ubuntu model might be an answer
>> for
>> that, but we have discussed that elsewhere and dropped the idea as it
>> wasn't
>> feasible back in the day. Perhaps its time just came?
>>
>> Apologies for a long post.
>>   Cos
>>
>>
>> On Sun, Dec 07, 2014 at 07:04PM, RJ Nowling wrote:
>> > Which other projects depend on BigTop?  How will the questions about the
>> > direction of BigTop affect those projects?
>> >
>> > On Sun, Dec 7, 2014 at 6:10 PM, Roman Shaposhnik <ro...@shaposhnik.org>
>> > wrote:
>> >
>> > > Hi!
>> > >
>> > > On Sat, Dec 6, 2014 at 3:23 PM, jay vyas <jayunit100.apache@gmail.com
>> >
>> > > wrote:
>> > > > hi bigtop !
>> > > >
>> > > > I thought id start a thread a few vaguely related thoughts i have
>> around
>> > > > next couple iterations of bigtop.
>> > >
>> > > I think in general I see two major ways for something like
>> > > Bigtop to evolve:
>> > >    #1 remain a 'box of LEGO bricks' with very little opinion on
>> > >         how these pieces need to be integrated
>> > >    #2 start driving oppinioned use-cases for the particular kind of
>> > >         bigdata workloads
>> > >
>> > > #1 is sort of what all of the Linux distros have been doing for
>> > > the majority of time they existed. #2 is close to what CentOS
>> > > is doing with SIGs.
>> > >
>> > > Honestly, given the size of our community so far and a total
>> > > lack of corporate backing (with a small exception of Cloudera
>> > > still paying for our EC2 time) I think #1 is all we can do. I'd
>> > > love to be wrong, though.
>> > >
>> > > > 1) Hive:  How will bigtop to evolve to support it, now that it is
>> much
>> > > more
>> > > > than a mapreduce query wrapper?
>> > >
>> > > I think Hive will remain a big part of Hadoop workloads for forseeable
>> > > future. What I'd love to see more of is rationalizing things like how
>> > > HCatalog, etc. need to be deployed.
>> > >
>> > > > 2) I wonder wether we should confirm cassandra interoperability of
>> spark
>> > > in
>> > > > bigtop distros,
>> > >
>> > > Only if there's a significant interest from cassandra community and
>> even
>> > > then my biggest fear is that with cassandra we're totally changing the
>> > > requirements for the underlying storage subsystem (nothing wrong with
>> > > that, its just that in Hadoop ecosystem everything assumes very
>> HDFS'ish
>> > > requirements for the scale-out storage).
>> > >
>> > > > 4) in general, i think bigtop can move in one of 3 directions.
>> > > >
>> > > >   EXPAND ? : Expanding to include new components, with just basic
>> > > interop,
>> > > > and let folks evolve their own stacks on top of bigtop on their own.
>> > > >
>> > > >   CONTRACT+FOCUS ?  Contracting to focus on a lean set of core
>> > > components,
>> > > > with super high quality.
>> > > >
>> > > >   STAY THE COURSE ? Staying the same ~ a packaging platform for just
>> > > > hadoop's direct ecosystem.
>> > > >
>> > > > I am intrigued by the idea of A and B both have clear benefits and
>> > > costs...
>> > > > would like to see the opinions of folks --- do we  lean in one
>> direction
>> > > or
>> > > > another? What is the criteria for adding a new feature, package,
>> stack to
>> > > > bigtop?
>> > > >
>> > > > ... Or maybe im just overthinking it and should be spending this
>> time
>> > > > testing spark for 0.9 release....
>> > >
>> > > I'd love to know what other think, but for 0.9 I'd rather stay the
>> course.
>> > >
>> > > Thanks,
>> > > Roman.
>> > >
>> > > P.S. There are also market forces at play that may fundamentally
>> change
>> > > the focus of what we're all working on in the year or so.
>> > >
>>
>
>
>
> --
> Best regards,
>
>    - Andy
>
> Problems worthy of attack prove their worth by hitting back. - Piet Hein
> (via Tom White)
>

Re: What will the next generation of bigtop look like?

Posted by RJ Nowling <rn...@gmail.com>.

GraphX, Streaming, MLlib, and Spark SQL are all part of Spark and would be
included in BigTop if Spark is included. They're also pretty well
integrated with each other.

I'd like to throw out a radical idea, based on Andrew's comments: focus on
the vertical rather than the horizontal with a slimmed down, Spark-oriented
stack.  (This could be a subset of the current stack.)  Strat.io's work
provides a nice example of a pure Spark stack.

Spark offers a smaller footprint, far less maintenance, functionality of
many Hadoop components in one (and better integration!), and is better
suited for diverse deployment situations (cloud, non-HDFS storage, etc.)

A few other complementary components would be needed: Kafka would be needed
for HA with Spark streaming.  Tachyon.  Maybe offer Cassandra or similar as
an alternative storage option.    Combine this with dashboards and
visualization and high quality deployment options (Puppet, Docker, etc.).
With the data generator and Spark implementation of BigPetStore, my goal is
to to expand BPS to provide high quality analytics examples, oriented more
towards data scientists.

Just a thought...

On Thu, Dec 11, 2014 at 12:39 PM, Andrew Purtell <ap...@apache.org>
wrote:

> This is a really great post and I was nodding along with most of it.
>
> My personal view is Bigtop starts as a deployable stack of Apache
> ecosystem components for Big Data. Commodification of (Linux) deployable
> packages and basic install integration is the baseline.
>
> Bigtop packaging Spark components first is an unfortunately little known
> win of this community, but its still a win. Although replicating that
> success with choice of the 'next big thing' is going to be a hit or miss
> proposition unless one of us can figure out time travel, definitely we can
> make some observations and scour and/or influence the Apache project
> landscape to pick up coverage in the space:
>
> - Storage is commoditized. Nearly everyone bases the storage stack on
> HDFS. Everyone does so with what we'd call HCFS. Best to focus elsewhere.
>
> - Packaging is commoditized. It's a shame that vendors pursue misguided
> lock-in strategies but we have no control over that. It's still true that
> someone using HDP or CDH 4 can switch to Bigtop and vice versa without
> changing package management tools or strategy. As a user of Apache stack
> technologies I want long term sustainable package management so will vote
> with my feet for the commodity option, and won't be alone. Bigtop should
> provide this, and does, and it's mostly a solved problem.
>
> - Deployment is also a "solved" problem but unfortunately everyone solves
> it differently. :-) This is an area where Bigtop can provide real value,
> and does, with the Puppet scripts, with the containerization work. One
> function Bigtop can serve is as repository and example of Hadoop-ish
> production tooling.
>
> - YARN is a reasonably generic grid resource manager. We don't have the
> resources to stand up an alternate RM and all the tooling necessary with
> Mesos, but if Mesosphere made a contribution of that I suspect we'd take
> it. From the Bigtop perspective I think computation framework options are
> well handled, in that I don't see Bigtop or anyone else developing credible
> alternatives to MR and Spark for some time. Not sure there's enough oxygen.
> And we have Giraph (and is GraphX packaged with Spark?). To the extent
> Spark-on-YARN has rough edges in the Bigtop framework that's an area where
> contributors can produce value. Related, support for Hive on Spark, Pig on
> Spark (spork).
>
> - The Apache stack includes three streaming computation frameworks -
> Storm, Spark Streaming, Samza - but Bigtop has mostly missed the boat here.
> Spark streaming is included in the spark package (I think) but how well is
> it integrated? Samza is well integrated with YARN but we don't package it.
> There's also been Storm-on-YARN work out of Yahoo, not sure about what was
> upstreamed or might be available. Anyway, integration of stream computation
> frameworks into Bigtop's packaging and deployment/management scripts can
> produce value, especially if we provide multiple options, because vendors
> are choosing favorites.
>
> - Data access. We do have players differentiating themselves here. Bigtop
> provides two SQL options (Hive, Phoenix+HBase), can add a third, I see
> someone's proposed Presto packaging. I'm not sure from the Bigtop
> perspective we need to pursue additional alternatives, but if there were
> contributions, we might well take them. "Enterprise friendly API" (SQL) is
> half of the data access picture I think, the other half is access control.
> There are competing projects in incubation, Sentry and Ranger, with no
> shared purpose, which is a real shame. To the extent that Bigtop adopts a
> cross-component full-stack access control technology, or helps bring
> another alternative into incubation and adopts that, we can move the needle
> in this space. We'd offer a vendor neutral access control option devoid of
> lock-in risk, this would be a big deal for big-E enterprises.
>
> - Data management and provenance. Now we're moving up the value chain from
> storage and data access to the next layer. This is mostly greenfield / blue
> ocean space in the Apache stack. We have interesting options in incubation:
> Falcon, Taverna, NiFi. (I think the last one might be truly comprehensive.)
> All of these are higher level data management and processing workflows
> which include aspects of management and provenance. One or more could be
> adopted and refined. There are a lot of relevant integration opportunities
> up and down the stack that could be undertaken with shared effort of the
> Bigtop, framework, and component communities.
>
> - Machine learning. Moving further up the value chain, we have data and
> computation and workflow, now how do we derive the competitive advantage
> that all of the lower layer technologies are in place for? The new hotness
> is surfacing of insights out of scaled parallel statistical inference.
> Unfortunately this space doesn't present itself well to the toolbox
> approach. Bigtop provides Mahout and MLLib as part of Spark (right?), they
> themselves are toolkits with components of varying utility and maturity
> (and relevance). I think Bigtop could provide some value by curating ML
> frameworks that tie in with other Apache stack technologies. ML toolkits
> leave would-be users in the cold. One has to know what one is doing, and
> what to do is highly use case specific, this is why "data scientists" can
> command obscene salaries and only commercial vendors have the resources to
> focus on specific verticals.
>
> - Visualization and preparation. Moving further up, now we are almost
> touching directly the use case. We have data but we need to clean it,
> normalize, regularize, filter, slice and dice. Where there are reasonably
> generic open source tools, preferably at Apache, for data preparation and
> cleaning Bigtop could provide baseline value by packaging it, and
> additional value with deeper integration with Apache stack components. Data
> preparation is a concern hand in hand with data ingest, so we have an
> interesting feedback loop from the top back down to ingest tools/building
> blocks like Kafka and Flume. Data cleaning concerns might overlap with the
> workflow frameworks too. If there's a friendly licensed open source
> graphical front end to the data cleaning/munging/exploration process that
> is generic enough that would be a really interesting "acquisition".
> - We can also package visualization libraries and toolkits for building
> dashboards. Like with ML algorithms, a complete integration is probably out
> of scope because every instance would be use case and user specific.
>
>
>
> On Mon, Dec 8, 2014 at 12:23 PM, Konstantin Boudnik <co...@apache.org>
> wrote:
>
>> First I want to address the RJ's question:
>>
>> The most prominent downstream Bigtop Dependency would be any commercial
>> Hadoop distribution like HDP and CDH. The former is trying to
>> disguise their affiliation by pushing Ambari forward, and Cloudera's
>> seemingly
>> shifting her focus to compressed tarballs media (aka parcels) which
>> requires
>> a closed-source solutions like Cloudera Manager to deploy and control your
>> cluster, effectively rendering it useless if you ever decide to uninstall
>> the
>> control software. In the interest of full disclosure, I don't think
>> parcels
>> have any chance to landslide the consensus in the industry from Linux
>> packaging towards something so obscure and proprietary as parcels are.
>>
>>
>> And now to my actual points....:
>>
>> I do strongly believe the Bigtop was and is the only completely
>> transparent,
>> vendors' friendly, and 100% sticking to official ASF product releases way
>> of
>> building your stack from ground up, deploying and controlling it anyway
>> you
>> want to. I agree with Roman's presentation on how this project can move
>> forward. However, I somewhat disagree with his view on the perspectives.
>> It
>> might be a hard road to drive the opinion of the community.  But, it is a
>> high
>> road.
>>
>> We are definitely small and mostly unsupported by commercial groups that
>> are
>> using the framework. Being a box of LEGO won't win us anything. If
>> anything,
>> the empirical evidences are against it as commercial distros have decided
>> to
>> move towards their own means of "vendor lock-in" (yes, you hear me
>> right - that's exactly what I said: all so called open-source companies
>> have
>> invented a way to lock-in their customers either with fancy "enterprise
>> features" that aren't adding but amending underlying stack; or with
>> custom set
>> of patches oftentimes rendering the cluster to become incompatible between
>> different vendors).
>>
>> By all means, my money are on the second way, yet slightly modified (as
>> use-cases are coming from users, not developers):
>>   #2 start driving adoption of software stacks for the particular kind of
>> data workloads
>>
>> This community has enough day-to-day practitioners on board to
>> accumulate a near-complete introspection of where the technology is
>> moving.
>> And instead of wobbling in a backwash, let's see if we can be smart and
>> define
>> this landscape. After all, Bigtop has adopted Spark well before any of the
>> commercials have officially accepted it. We seemingly are moving more and
>> more into in-memory realm of data processing: Apache Ignite (Gridgain),
>> Tachyon, Spark. I don't know how much legs Hive got in it, but I am
>> doubtful,
>> that it can walk for much longer... May be it's just me.
>>
>> In this thread http://is.gd/MV2BH9 we already discussed some of the
>> aspects
>> influencing the feature of this project. And we are de-facto working on
>> the
>> implementation. In my opinion, Hadoop has been more or less commoditized
>> already. And it isn't a bad thing, but it means that the innovations are
>> elsewhere. E.g. Spark moving is moving beyond its ties with storage layer
>> via
>> Tachyon abstraction; GridGain simply doesn't care what's underlying
>> storage
>> is. However, data needs to be stored somewhere before it can be
>> processed. And
>> HCFS seems to be fitting the bill ok. But, as I said already, I see the
>> real
>> action elsewhere. If I were to define the shape of our mid- to long'ish
>> term
>> roadmap it'd be something like that:
>>
>>             ^   Dashboard/Visualization  ^
>>             |     OLTP/ML processing     |
>>             |    Caching/Acceleration    |
>>             |         Storage            |
>>
>> And around this we can add/improve on deployment (R8???),
>> virtualization/containers/clouds.  In other words - let's focus on the
>> vertical part of the stack, instead of simply supporting the status quo.
>>
>> Does Cassandra fits the Storage layer in that model? I don't know and most
>> important - I don't care. If there's an interest and manpower to have
>> Cassandra-based stack - sure, but perhaps let's do as a separate branch or
>> something, so we aren't over-complicating things. As Roman said earlier,
>> in
>> this case it'd be great to engage Cassandra/DataStax people into this
>> project.
>> But something tells me they won't be eager to jump on board.
>>
>> And finally, all this above leads to "how": how we can start reshaping the
>> stack into its next incarnation? Perhaps, Ubuntu model might be an answer
>> for
>> that, but we have discussed that elsewhere and dropped the idea as it
>> wasn't
>> feasible back in the day. Perhaps its time just came?
>>
>> Apologies for a long post.
>>   Cos
>>
>>
>> On Sun, Dec 07, 2014 at 07:04PM, RJ Nowling wrote:
>> > Which other projects depend on BigTop?  How will the questions about the
>> > direction of BigTop affect those projects?
>> >
>> > On Sun, Dec 7, 2014 at 6:10 PM, Roman Shaposhnik <ro...@shaposhnik.org>
>> > wrote:
>> >
>> > > Hi!
>> > >
>> > > On Sat, Dec 6, 2014 at 3:23 PM, jay vyas <jayunit100.apache@gmail.com
>> >
>> > > wrote:
>> > > > hi bigtop !
>> > > >
>> > > > I thought id start a thread a few vaguely related thoughts i have
>> around
>> > > > next couple iterations of bigtop.
>> > >
>> > > I think in general I see two major ways for something like
>> > > Bigtop to evolve:
>> > >    #1 remain a 'box of LEGO bricks' with very little opinion on
>> > >         how these pieces need to be integrated
>> > >    #2 start driving oppinioned use-cases for the particular kind of
>> > >         bigdata workloads
>> > >
>> > > #1 is sort of what all of the Linux distros have been doing for
>> > > the majority of time they existed. #2 is close to what CentOS
>> > > is doing with SIGs.
>> > >
>> > > Honestly, given the size of our community so far and a total
>> > > lack of corporate backing (with a small exception of Cloudera
>> > > still paying for our EC2 time) I think #1 is all we can do. I'd
>> > > love to be wrong, though.
>> > >
>> > > > 1) Hive:  How will bigtop to evolve to support it, now that it is
>> much
>> > > more
>> > > > than a mapreduce query wrapper?
>> > >
>> > > I think Hive will remain a big part of Hadoop workloads for forseeable
>> > > future. What I'd love to see more of is rationalizing things like how
>> > > HCatalog, etc. need to be deployed.
>> > >
>> > > > 2) I wonder wether we should confirm cassandra interoperability of
>> spark
>> > > in
>> > > > bigtop distros,
>> > >
>> > > Only if there's a significant interest from cassandra community and
>> even
>> > > then my biggest fear is that with cassandra we're totally changing the
>> > > requirements for the underlying storage subsystem (nothing wrong with
>> > > that, its just that in Hadoop ecosystem everything assumes very
>> HDFS'ish
>> > > requirements for the scale-out storage).
>> > >
>> > > > 4) in general, i think bigtop can move in one of 3 directions.
>> > > >
>> > > >   EXPAND ? : Expanding to include new components, with just basic
>> > > interop,
>> > > > and let folks evolve their own stacks on top of bigtop on their own.
>> > > >
>> > > >   CONTRACT+FOCUS ?  Contracting to focus on a lean set of core
>> > > components,
>> > > > with super high quality.
>> > > >
>> > > >   STAY THE COURSE ? Staying the same ~ a packaging platform for just
>> > > > hadoop's direct ecosystem.
>> > > >
>> > > > I am intrigued by the idea of A and B both have clear benefits and
>> > > costs...
>> > > > would like to see the opinions of folks --- do we  lean in one
>> direction
>> > > or
>> > > > another? What is the criteria for adding a new feature, package,
>> stack to
>> > > > bigtop?
>> > > >
>> > > > ... Or maybe im just overthinking it and should be spending this
>> time
>> > > > testing spark for 0.9 release....
>> > >
>> > > I'd love to know what other think, but for 0.9 I'd rather stay the
>> course.
>> > >
>> > > Thanks,
>> > > Roman.
>> > >
>> > > P.S. There are also market forces at play that may fundamentally
>> change
>> > > the focus of what we're all working on in the year or so.
>> > >
>>
>
>
>
> --
> Best regards,
>
>    - Andy
>
> Problems worthy of attack prove their worth by hitting back. - Piet Hein
> (via Tom White)
>

Re: What will the next generation of bigtop look like?

Posted by Andrew Purtell <ap...@apache.org>.

This is a really great post and I was nodding along with most of it.

My personal view is Bigtop starts as a deployable stack of Apache ecosystem
components for Big Data. Commodification of (Linux) deployable packages and
basic install integration is the baseline.

Bigtop packaging Spark components first is an unfortunately little known
win of this community, but its still a win. Although replicating that
success with choice of the 'next big thing' is going to be a hit or miss
proposition unless one of us can figure out time travel, definitely we can
make some observations and scour and/or influence the Apache project
landscape to pick up coverage in the space:

- Storage is commoditized. Nearly everyone bases the storage stack on HDFS.
Everyone does so with what we'd call HCFS. Best to focus elsewhere.

- Packaging is commoditized. It's a shame that vendors pursue misguided
lock-in strategies but we have no control over that. It's still true that
someone using HDP or CDH 4 can switch to Bigtop and vice versa without
changing package management tools or strategy. As a user of Apache stack
technologies I want long term sustainable package management so will vote
with my feet for the commodity option, and won't be alone. Bigtop should
provide this, and does, and it's mostly a solved problem.

- Deployment is also a "solved" problem but unfortunately everyone solves
it differently. :-) This is an area where Bigtop can provide real value,
and does, with the Puppet scripts, with the containerization work. One
function Bigtop can serve is as repository and example of Hadoop-ish
production tooling.

- YARN is a reasonably generic grid resource manager. We don't have the
resources to stand up an alternate RM and all the tooling necessary with
Mesos, but if Mesosphere made a contribution of that I suspect we'd take
it. From the Bigtop perspective I think computation framework options are
well handled, in that I don't see Bigtop or anyone else developing credible
alternatives to MR and Spark for some time. Not sure there's enough oxygen.
And we have Giraph (and is GraphX packaged with Spark?). To the extent
Spark-on-YARN has rough edges in the Bigtop framework that's an area where
contributors can produce value. Related, support for Hive on Spark, Pig on
Spark (spork).

- The Apache stack includes three streaming computation frameworks - Storm,
Spark Streaming, Samza - but Bigtop has mostly missed the boat here. Spark
streaming is included in the spark package (I think) but how well is it
integrated? Samza is well integrated with YARN but we don't package it.
There's also been Storm-on-YARN work out of Yahoo, not sure about what was
upstreamed or might be available. Anyway, integration of stream computation
frameworks into Bigtop's packaging and deployment/management scripts can
produce value, especially if we provide multiple options, because vendors
are choosing favorites.

- Data access. We do have players differentiating themselves here. Bigtop
provides two SQL options (Hive, Phoenix+HBase), can add a third, I see
someone's proposed Presto packaging. I'm not sure from the Bigtop
perspective we need to pursue additional alternatives, but if there were
contributions, we might well take them. "Enterprise friendly API" (SQL) is
half of the data access picture I think, the other half is access control.
There are competing projects in incubation, Sentry and Ranger, with no
shared purpose, which is a real shame. To the extent that Bigtop adopts a
cross-component full-stack access control technology, or helps bring
another alternative into incubation and adopts that, we can move the needle
in this space. We'd offer a vendor neutral access control option devoid of
lock-in risk, this would be a big deal for big-E enterprises.

- Data management and provenance. Now we're moving up the value chain from
storage and data access to the next layer. This is mostly greenfield / blue
ocean space in the Apache stack. We have interesting options in incubation:
Falcon, Taverna, NiFi. (I think the last one might be truly comprehensive.)
All of these are higher level data management and processing workflows
which include aspects of management and provenance. One or more could be
adopted and refined. There are a lot of relevant integration opportunities
up and down the stack that could be undertaken with shared effort of the
Bigtop, framework, and component communities.

- Machine learning. Moving further up the value chain, we have data and
computation and workflow, now how do we derive the competitive advantage
that all of the lower layer technologies are in place for? The new hotness
is surfacing of insights out of scaled parallel statistical inference.
Unfortunately this space doesn't present itself well to the toolbox
approach. Bigtop provides Mahout and MLLib as part of Spark (right?), they
themselves are toolkits with components of varying utility and maturity
(and relevance). I think Bigtop could provide some value by curating ML
frameworks that tie in with other Apache stack technologies. ML toolkits
leave would-be users in the cold. One has to know what one is doing, and
what to do is highly use case specific, this is why "data scientists" can
command obscene salaries and only commercial vendors have the resources to
focus on specific verticals.

- Visualization and preparation. Moving further up, now we are almost
touching directly the use case. We have data but we need to clean it,
normalize, regularize, filter, slice and dice. Where there are reasonably
generic open source tools, preferably at Apache, for data preparation and
cleaning Bigtop could provide baseline value by packaging it, and
additional value with deeper integration with Apache stack components. Data
preparation is a concern hand in hand with data ingest, so we have an
interesting feedback loop from the top back down to ingest tools/building
blocks like Kafka and Flume. Data cleaning concerns might overlap with the
workflow frameworks too. If there's a friendly licensed open source
graphical front end to the data cleaning/munging/exploration process that
is generic enough that would be a really interesting "acquisition".
- We can also package visualization libraries and toolkits for building
dashboards. Like with ML algorithms, a complete integration is probably out
of scope because every instance would be use case and user specific.



On Mon, Dec 8, 2014 at 12:23 PM, Konstantin Boudnik <co...@apache.org> wrote:

> First I want to address the RJ's question:
>
> The most prominent downstream Bigtop Dependency would be any commercial
> Hadoop distribution like HDP and CDH. The former is trying to
> disguise their affiliation by pushing Ambari forward, and Cloudera's
> seemingly
> shifting her focus to compressed tarballs media (aka parcels) which
> requires
> a closed-source solutions like Cloudera Manager to deploy and control your
> cluster, effectively rendering it useless if you ever decide to uninstall
> the
> control software. In the interest of full disclosure, I don't think parcels
> have any chance to landslide the consensus in the industry from Linux
> packaging towards something so obscure and proprietary as parcels are.
>
>
> And now to my actual points....:
>
> I do strongly believe the Bigtop was and is the only completely
> transparent,
> vendors' friendly, and 100% sticking to official ASF product releases way
> of
> building your stack from ground up, deploying and controlling it anyway you
> want to. I agree with Roman's presentation on how this project can move
> forward. However, I somewhat disagree with his view on the perspectives. It
> might be a hard road to drive the opinion of the community.  But, it is a
> high
> road.
>
> We are definitely small and mostly unsupported by commercial groups that
> are
> using the framework. Being a box of LEGO won't win us anything. If
> anything,
> the empirical evidences are against it as commercial distros have decided
> to
> move towards their own means of "vendor lock-in" (yes, you hear me
> right - that's exactly what I said: all so called open-source companies
> have
> invented a way to lock-in their customers either with fancy "enterprise
> features" that aren't adding but amending underlying stack; or with custom
> set
> of patches oftentimes rendering the cluster to become incompatible between
> different vendors).
>
> By all means, my money are on the second way, yet slightly modified (as
> use-cases are coming from users, not developers):
>   #2 start driving adoption of software stacks for the particular kind of
> data workloads
>
> This community has enough day-to-day practitioners on board to
> accumulate a near-complete introspection of where the technology is moving.
> And instead of wobbling in a backwash, let's see if we can be smart and
> define
> this landscape. After all, Bigtop has adopted Spark well before any of the
> commercials have officially accepted it. We seemingly are moving more and
> more into in-memory realm of data processing: Apache Ignite (Gridgain),
> Tachyon, Spark. I don't know how much legs Hive got in it, but I am
> doubtful,
> that it can walk for much longer... May be it's just me.
>
> In this thread http://is.gd/MV2BH9 we already discussed some of the
> aspects
> influencing the feature of this project. And we are de-facto working on the
> implementation. In my opinion, Hadoop has been more or less commoditized
> already. And it isn't a bad thing, but it means that the innovations are
> elsewhere. E.g. Spark moving is moving beyond its ties with storage layer
> via
> Tachyon abstraction; GridGain simply doesn't care what's underlying storage
> is. However, data needs to be stored somewhere before it can be processed.
> And
> HCFS seems to be fitting the bill ok. But, as I said already, I see the
> real
> action elsewhere. If I were to define the shape of our mid- to long'ish
> term
> roadmap it'd be something like that:
>
>             ^   Dashboard/Visualization  ^
>             |     OLTP/ML processing     |
>             |    Caching/Acceleration    |
>             |         Storage            |
>
> And around this we can add/improve on deployment (R8???),
> virtualization/containers/clouds.  In other words - let's focus on the
> vertical part of the stack, instead of simply supporting the status quo.
>
> Does Cassandra fits the Storage layer in that model? I don't know and most
> important - I don't care. If there's an interest and manpower to have
> Cassandra-based stack - sure, but perhaps let's do as a separate branch or
> something, so we aren't over-complicating things. As Roman said earlier, in
> this case it'd be great to engage Cassandra/DataStax people into this
> project.
> But something tells me they won't be eager to jump on board.
>
> And finally, all this above leads to "how": how we can start reshaping the
> stack into its next incarnation? Perhaps, Ubuntu model might be an answer
> for
> that, but we have discussed that elsewhere and dropped the idea as it
> wasn't
> feasible back in the day. Perhaps its time just came?
>
> Apologies for a long post.
>   Cos
>
>
> On Sun, Dec 07, 2014 at 07:04PM, RJ Nowling wrote:
> > Which other projects depend on BigTop?  How will the questions about the
> > direction of BigTop affect those projects?
> >
> > On Sun, Dec 7, 2014 at 6:10 PM, Roman Shaposhnik <ro...@shaposhnik.org>
> > wrote:
> >
> > > Hi!
> > >
> > > On Sat, Dec 6, 2014 at 3:23 PM, jay vyas <ja...@gmail.com>
> > > wrote:
> > > > hi bigtop !
> > > >
> > > > I thought id start a thread a few vaguely related thoughts i have
> around
> > > > next couple iterations of bigtop.
> > >
> > > I think in general I see two major ways for something like
> > > Bigtop to evolve:
> > >    #1 remain a 'box of LEGO bricks' with very little opinion on
> > >         how these pieces need to be integrated
> > >    #2 start driving oppinioned use-cases for the particular kind of
> > >         bigdata workloads
> > >
> > > #1 is sort of what all of the Linux distros have been doing for
> > > the majority of time they existed. #2 is close to what CentOS
> > > is doing with SIGs.
> > >
> > > Honestly, given the size of our community so far and a total
> > > lack of corporate backing (with a small exception of Cloudera
> > > still paying for our EC2 time) I think #1 is all we can do. I'd
> > > love to be wrong, though.
> > >
> > > > 1) Hive:  How will bigtop to evolve to support it, now that it is
> much
> > > more
> > > > than a mapreduce query wrapper?
> > >
> > > I think Hive will remain a big part of Hadoop workloads for forseeable
> > > future. What I'd love to see more of is rationalizing things like how
> > > HCatalog, etc. need to be deployed.
> > >
> > > > 2) I wonder wether we should confirm cassandra interoperability of
> spark
> > > in
> > > > bigtop distros,
> > >
> > > Only if there's a significant interest from cassandra community and
> even
> > > then my biggest fear is that with cassandra we're totally changing the
> > > requirements for the underlying storage subsystem (nothing wrong with
> > > that, its just that in Hadoop ecosystem everything assumes very
> HDFS'ish
> > > requirements for the scale-out storage).
> > >
> > > > 4) in general, i think bigtop can move in one of 3 directions.
> > > >
> > > >   EXPAND ? : Expanding to include new components, with just basic
> > > interop,
> > > > and let folks evolve their own stacks on top of bigtop on their own.
> > > >
> > > >   CONTRACT+FOCUS ?  Contracting to focus on a lean set of core
> > > components,
> > > > with super high quality.
> > > >
> > > >   STAY THE COURSE ? Staying the same ~ a packaging platform for just
> > > > hadoop's direct ecosystem.
> > > >
> > > > I am intrigued by the idea of A and B both have clear benefits and
> > > costs...
> > > > would like to see the opinions of folks --- do we  lean in one
> direction
> > > or
> > > > another? What is the criteria for adding a new feature, package,
> stack to
> > > > bigtop?
> > > >
> > > > ... Or maybe im just overthinking it and should be spending this time
> > > > testing spark for 0.9 release....
> > >
> > > I'd love to know what other think, but for 0.9 I'd rather stay the
> course.
> > >
> > > Thanks,
> > > Roman.
> > >
> > > P.S. There are also market forces at play that may fundamentally change
> > > the focus of what we're all working on in the year or so.
> > >
>



-- 
Best regards,

   - Andy

Problems worthy of attack prove their worth by hitting back. - Piet Hein
(via Tom White)

Re: What will the next generation of bigtop look like?

Posted by Andre Arcilla <ar...@apache.org>.

I will be juggling several projects beginning of Jan, and can only arrange
hosting (and attend a meetup) starting the last week of Jan (1/25 and on).
I also notice that Jay suggested "a meetup after January". Anyone for the
last week of Jan?


On Tue, Dec 9, 2014 at 11:39 PM, Konstantin Boudnik <co...@apache.org> wrote:

> On Mon, Dec 08, 2014 at 09:16PM, Konstantin Boudnik wrote:
> > On Mon, Dec 08, 2014 at 11:57PM, Jay Vyas wrote:
> > > "Let's see if we can be smart and define the landscape"
> > >
> > > Well put @cos...I think Romans point was that it would be hard, not
> that it
> > > would be bad. And I think you're both right : it's hard? Yes. But
> > > worthwhile... Possibly? Next step we will all have to get in a room and
> > > think about this face to face.
> > >
> > > Let's shoot for a meetup after january in California... Where we can
> plan
> > > the future direction of bigtop.  In the meanwhile hope to hear more
> opinions
> > > on this.
> >
> > +1 I can host at WANdisco or perhaps there other options?
>
> Shall we start putting some arrangements/planning for the January
> meetup? Say, 2nd week or January? Or the following one?
>
> Andre, do you guys want to host it at A9? Anyone else? I am happy to do
> this
> at my office, but it might be a bit of travel, although against the traffic
> both ways.
>
> Cos
>
> > > > On Dec 8, 2014, at 3:23 PM, Konstantin Boudnik <co...@apache.org>
> wrote:
> > > >
> > > > First I want to address the RJ's question:
> > > >
> > > > The most prominent downstream Bigtop Dependency would be any
> commercial
> > > > Hadoop distribution like HDP and CDH. The former is trying to
> > > > disguise their affiliation by pushing Ambari forward, and Cloudera's
> seemingly
> > > > shifting her focus to compressed tarballs media (aka parcels) which
> requires
> > > > a closed-source solutions like Cloudera Manager to deploy and
> control your
> > > > cluster, effectively rendering it useless if you ever decide to
> uninstall the
> > > > control software. In the interest of full disclosure, I don't think
> parcels
> > > > have any chance to landslide the consensus in the industry from Linux
> > > > packaging towards something so obscure and proprietary as parcels
> are.
> > > >
> > > >
> > > > And now to my actual points....:
> > > >
> > > > I do strongly believe the Bigtop was and is the only completely
> transparent,
> > > > vendors' friendly, and 100% sticking to official ASF product
> releases way of
> > > > building your stack from ground up, deploying and controlling it
> anyway you
> > > > want to. I agree with Roman's presentation on how this project can
> move
> > > > forward. However, I somewhat disagree with his view on the
> perspectives. It
> > > > might be a hard road to drive the opinion of the community.  But, it
> is a high
> > > > road.
> > > >
> > > > We are definitely small and mostly unsupported by commercial groups
> that are
> > > > using the framework. Being a box of LEGO won't win us anything. If
> anything,
> > > > the empirical evidences are against it as commercial distros have
> decided to
> > > > move towards their own means of "vendor lock-in" (yes, you hear me
> > > > right - that's exactly what I said: all so called open-source
> companies have
> > > > invented a way to lock-in their customers either with fancy
> "enterprise
> > > > features" that aren't adding but amending underlying stack; or with
> custom set
> > > > of patches oftentimes rendering the cluster to become incompatible
> between
> > > > different vendors).
> > > >
> > > > By all means, my money are on the second way, yet slightly modified
> (as
> > > > use-cases are coming from users, not developers):
> > > >  #2 start driving adoption of software stacks for the particular
> kind of data workloads
> > > >
> > > > This community has enough day-to-day practitioners on board to
> > > > accumulate a near-complete introspection of where the technology is
> moving.
> > > > And instead of wobbling in a backwash, let's see if we can be smart
> and define
> > > > this landscape. After all, Bigtop has adopted Spark well before any
> of the
> > > > commercials have officially accepted it. We seemingly are moving
> more and
> > > > more into in-memory realm of data processing: Apache Ignite
> (Gridgain),
> > > > Tachyon, Spark. I don't know how much legs Hive got in it, but I am
> doubtful,
> > > > that it can walk for much longer... May be it's just me.
> > > >
> > > > In this thread http://is.gd/MV2BH9 we already discussed some of the
> aspects
> > > > influencing the feature of this project. And we are de-facto working
> on the
> > > > implementation. In my opinion, Hadoop has been more or less
> commoditized
> > > > already. And it isn't a bad thing, but it means that the innovations
> are
> > > > elsewhere. E.g. Spark moving is moving beyond its ties with storage
> layer via
> > > > Tachyon abstraction; GridGain simply doesn't care what's underlying
> storage
> > > > is. However, data needs to be stored somewhere before it can be
> processed. And
> > > > HCFS seems to be fitting the bill ok. But, as I said already, I see
> the real
> > > > action elsewhere. If I were to define the shape of our mid- to
> long'ish term
> > > > roadmap it'd be something like that:
> > > >
> > > >            ^   Dashboard/Visualization  ^
> > > >            |     OLTP/ML processing     |
> > > >            |    Caching/Acceleration    |
> > > >            |         Storage            |
> > > >
> > > > And around this we can add/improve on deployment (R8???),
> > > > virtualization/containers/clouds.  In other words - let's focus on
> the
> > > > vertical part of the stack, instead of simply supporting the status
> quo.
> > > >
> > > > Does Cassandra fits the Storage layer in that model? I don't know
> and most
> > > > important - I don't care. If there's an interest and manpower to have
> > > > Cassandra-based stack - sure, but perhaps let's do as a separate
> branch or
> > > > something, so we aren't over-complicating things. As Roman said
> earlier, in
> > > > this case it'd be great to engage Cassandra/DataStax people into
> this project.
> > > > But something tells me they won't be eager to jump on board.
> > > >
> > > > And finally, all this above leads to "how": how we can start
> reshaping the
> > > > stack into its next incarnation? Perhaps, Ubuntu model might be an
> answer for
> > > > that, but we have discussed that elsewhere and dropped the idea as
> it wasn't
> > > > feasible back in the day. Perhaps its time just came?
> > > >
> > > > Apologies for a long post.
> > > >  Cos
> > > >
> > > >
> > > >> On Sun, Dec 07, 2014 at 07:04PM, RJ Nowling wrote:
> > > >> Which other projects depend on BigTop?  How will the questions
> about the
> > > >> direction of BigTop affect those projects?
> > > >>
> > > >> On Sun, Dec 7, 2014 at 6:10 PM, Roman Shaposhnik <
> roman@shaposhnik.org>
> > > >> wrote:
> > > >>
> > > >>> Hi!
> > > >>>
> > > >>> On Sat, Dec 6, 2014 at 3:23 PM, jay vyas <
> jayunit100.apache@gmail.com>
> > > >>> wrote:
> > > >>>> hi bigtop !
> > > >>>>
> > > >>>> I thought id start a thread a few vaguely related thoughts i have
> around
> > > >>>> next couple iterations of bigtop.
> > > >>>
> > > >>> I think in general I see two major ways for something like
> > > >>> Bigtop to evolve:
> > > >>>   #1 remain a 'box of LEGO bricks' with very little opinion on
> > > >>>        how these pieces need to be integrated
> > > >>>   #2 start driving oppinioned use-cases for the particular kind of
> > > >>>        bigdata workloads
> > > >>>
> > > >>> #1 is sort of what all of the Linux distros have been doing for
> > > >>> the majority of time they existed. #2 is close to what CentOS
> > > >>> is doing with SIGs.
> > > >>>
> > > >>> Honestly, given the size of our community so far and a total
> > > >>> lack of corporate backing (with a small exception of Cloudera
> > > >>> still paying for our EC2 time) I think #1 is all we can do. I'd
> > > >>> love to be wrong, though.
> > > >>>
> > > >>>> 1) Hive:  How will bigtop to evolve to support it, now that it is
> much
> > > >>> more
> > > >>>> than a mapreduce query wrapper?
> > > >>>
> > > >>> I think Hive will remain a big part of Hadoop workloads for
> forseeable
> > > >>> future. What I'd love to see more of is rationalizing things like
> how
> > > >>> HCatalog, etc. need to be deployed.
> > > >>>
> > > >>>> 2) I wonder wether we should confirm cassandra interoperability
> of spark
> > > >>> in
> > > >>>> bigtop distros,
> > > >>>
> > > >>> Only if there's a significant interest from cassandra community
> and even
> > > >>> then my biggest fear is that with cassandra we're totally changing
> the
> > > >>> requirements for the underlying storage subsystem (nothing wrong
> with
> > > >>> that, its just that in Hadoop ecosystem everything assumes very
> HDFS'ish
> > > >>> requirements for the scale-out storage).
> > > >>>
> > > >>>> 4) in general, i think bigtop can move in one of 3 directions.
> > > >>>>
> > > >>>>  EXPAND ? : Expanding to include new components, with just basic
> > > >>> interop,
> > > >>>> and let folks evolve their own stacks on top of bigtop on their
> own.
> > > >>>>
> > > >>>>  CONTRACT+FOCUS ?  Contracting to focus on a lean set of core
> > > >>> components,
> > > >>>> with super high quality.
> > > >>>>
> > > >>>>  STAY THE COURSE ? Staying the same ~ a packaging platform for
> just
> > > >>>> hadoop's direct ecosystem.
> > > >>>>
> > > >>>> I am intrigued by the idea of A and B both have clear benefits and
> > > >>> costs...
> > > >>>> would like to see the opinions of folks --- do we  lean in one
> direction
> > > >>> or
> > > >>>> another? What is the criteria for adding a new feature, package,
> stack to
> > > >>>> bigtop?
> > > >>>>
> > > >>>> ... Or maybe im just overthinking it and should be spending this
> time
> > > >>>> testing spark for 0.9 release....
> > > >>>
> > > >>> I'd love to know what other think, but for 0.9 I'd rather stay the
> course.
> > > >>>
> > > >>> Thanks,
> > > >>> Roman.
> > > >>>
> > > >>> P.S. There are also market forces at play that may fundamentally
> change
> > > >>> the focus of what we're all working on in the year or so.
> > > >>>
>
>
>

Re: What will the next generation of bigtop look like?

Posted by Andre Arcilla <ar...@apache.org>.

I will be juggling several projects beginning of Jan, and can only arrange
hosting (and attend a meetup) starting the last week of Jan (1/25 and on).
I also notice that Jay suggested "a meetup after January". Anyone for the
last week of Jan?


On Tue, Dec 9, 2014 at 11:39 PM, Konstantin Boudnik <co...@apache.org> wrote:

> On Mon, Dec 08, 2014 at 09:16PM, Konstantin Boudnik wrote:
> > On Mon, Dec 08, 2014 at 11:57PM, Jay Vyas wrote:
> > > "Let's see if we can be smart and define the landscape"
> > >
> > > Well put @cos...I think Romans point was that it would be hard, not
> that it
> > > would be bad. And I think you're both right : it's hard? Yes. But
> > > worthwhile... Possibly? Next step we will all have to get in a room and
> > > think about this face to face.
> > >
> > > Let's shoot for a meetup after january in California... Where we can
> plan
> > > the future direction of bigtop.  In the meanwhile hope to hear more
> opinions
> > > on this.
> >
> > +1 I can host at WANdisco or perhaps there other options?
>
> Shall we start putting some arrangements/planning for the January
> meetup? Say, 2nd week or January? Or the following one?
>
> Andre, do you guys want to host it at A9? Anyone else? I am happy to do
> this
> at my office, but it might be a bit of travel, although against the traffic
> both ways.
>
> Cos
>
> > > > On Dec 8, 2014, at 3:23 PM, Konstantin Boudnik <co...@apache.org>
> wrote:
> > > >
> > > > First I want to address the RJ's question:
> > > >
> > > > The most prominent downstream Bigtop Dependency would be any
> commercial
> > > > Hadoop distribution like HDP and CDH. The former is trying to
> > > > disguise their affiliation by pushing Ambari forward, and Cloudera's
> seemingly
> > > > shifting her focus to compressed tarballs media (aka parcels) which
> requires
> > > > a closed-source solutions like Cloudera Manager to deploy and
> control your
> > > > cluster, effectively rendering it useless if you ever decide to
> uninstall the
> > > > control software. In the interest of full disclosure, I don't think
> parcels
> > > > have any chance to landslide the consensus in the industry from Linux
> > > > packaging towards something so obscure and proprietary as parcels
> are.
> > > >
> > > >
> > > > And now to my actual points....:
> > > >
> > > > I do strongly believe the Bigtop was and is the only completely
> transparent,
> > > > vendors' friendly, and 100% sticking to official ASF product
> releases way of
> > > > building your stack from ground up, deploying and controlling it
> anyway you
> > > > want to. I agree with Roman's presentation on how this project can
> move
> > > > forward. However, I somewhat disagree with his view on the
> perspectives. It
> > > > might be a hard road to drive the opinion of the community.  But, it
> is a high
> > > > road.
> > > >
> > > > We are definitely small and mostly unsupported by commercial groups
> that are
> > > > using the framework. Being a box of LEGO won't win us anything. If
> anything,
> > > > the empirical evidences are against it as commercial distros have
> decided to
> > > > move towards their own means of "vendor lock-in" (yes, you hear me
> > > > right - that's exactly what I said: all so called open-source
> companies have
> > > > invented a way to lock-in their customers either with fancy
> "enterprise
> > > > features" that aren't adding but amending underlying stack; or with
> custom set
> > > > of patches oftentimes rendering the cluster to become incompatible
> between
> > > > different vendors).
> > > >
> > > > By all means, my money are on the second way, yet slightly modified
> (as
> > > > use-cases are coming from users, not developers):
> > > >  #2 start driving adoption of software stacks for the particular
> kind of data workloads
> > > >
> > > > This community has enough day-to-day practitioners on board to
> > > > accumulate a near-complete introspection of where the technology is
> moving.
> > > > And instead of wobbling in a backwash, let's see if we can be smart
> and define
> > > > this landscape. After all, Bigtop has adopted Spark well before any
> of the
> > > > commercials have officially accepted it. We seemingly are moving
> more and
> > > > more into in-memory realm of data processing: Apache Ignite
> (Gridgain),
> > > > Tachyon, Spark. I don't know how much legs Hive got in it, but I am
> doubtful,
> > > > that it can walk for much longer... May be it's just me.
> > > >
> > > > In this thread http://is.gd/MV2BH9 we already discussed some of the
> aspects
> > > > influencing the feature of this project. And we are de-facto working
> on the
> > > > implementation. In my opinion, Hadoop has been more or less
> commoditized
> > > > already. And it isn't a bad thing, but it means that the innovations
> are
> > > > elsewhere. E.g. Spark moving is moving beyond its ties with storage
> layer via
> > > > Tachyon abstraction; GridGain simply doesn't care what's underlying
> storage
> > > > is. However, data needs to be stored somewhere before it can be
> processed. And
> > > > HCFS seems to be fitting the bill ok. But, as I said already, I see
> the real
> > > > action elsewhere. If I were to define the shape of our mid- to
> long'ish term
> > > > roadmap it'd be something like that:
> > > >
> > > >            ^   Dashboard/Visualization  ^
> > > >            |     OLTP/ML processing     |
> > > >            |    Caching/Acceleration    |
> > > >            |         Storage            |
> > > >
> > > > And around this we can add/improve on deployment (R8???),
> > > > virtualization/containers/clouds.  In other words - let's focus on
> the
> > > > vertical part of the stack, instead of simply supporting the status
> quo.
> > > >
> > > > Does Cassandra fits the Storage layer in that model? I don't know
> and most
> > > > important - I don't care. If there's an interest and manpower to have
> > > > Cassandra-based stack - sure, but perhaps let's do as a separate
> branch or
> > > > something, so we aren't over-complicating things. As Roman said
> earlier, in
> > > > this case it'd be great to engage Cassandra/DataStax people into
> this project.
> > > > But something tells me they won't be eager to jump on board.
> > > >
> > > > And finally, all this above leads to "how": how we can start
> reshaping the
> > > > stack into its next incarnation? Perhaps, Ubuntu model might be an
> answer for
> > > > that, but we have discussed that elsewhere and dropped the idea as
> it wasn't
> > > > feasible back in the day. Perhaps its time just came?
> > > >
> > > > Apologies for a long post.
> > > >  Cos
> > > >
> > > >
> > > >> On Sun, Dec 07, 2014 at 07:04PM, RJ Nowling wrote:
> > > >> Which other projects depend on BigTop?  How will the questions
> about the
> > > >> direction of BigTop affect those projects?
> > > >>
> > > >> On Sun, Dec 7, 2014 at 6:10 PM, Roman Shaposhnik <
> roman@shaposhnik.org>
> > > >> wrote:
> > > >>
> > > >>> Hi!
> > > >>>
> > > >>> On Sat, Dec 6, 2014 at 3:23 PM, jay vyas <
> jayunit100.apache@gmail.com>
> > > >>> wrote:
> > > >>>> hi bigtop !
> > > >>>>
> > > >>>> I thought id start a thread a few vaguely related thoughts i have
> around
> > > >>>> next couple iterations of bigtop.
> > > >>>
> > > >>> I think in general I see two major ways for something like
> > > >>> Bigtop to evolve:
> > > >>>   #1 remain a 'box of LEGO bricks' with very little opinion on
> > > >>>        how these pieces need to be integrated
> > > >>>   #2 start driving oppinioned use-cases for the particular kind of
> > > >>>        bigdata workloads
> > > >>>
> > > >>> #1 is sort of what all of the Linux distros have been doing for
> > > >>> the majority of time they existed. #2 is close to what CentOS
> > > >>> is doing with SIGs.
> > > >>>
> > > >>> Honestly, given the size of our community so far and a total
> > > >>> lack of corporate backing (with a small exception of Cloudera
> > > >>> still paying for our EC2 time) I think #1 is all we can do. I'd
> > > >>> love to be wrong, though.
> > > >>>
> > > >>>> 1) Hive:  How will bigtop to evolve to support it, now that it is
> much
> > > >>> more
> > > >>>> than a mapreduce query wrapper?
> > > >>>
> > > >>> I think Hive will remain a big part of Hadoop workloads for
> forseeable
> > > >>> future. What I'd love to see more of is rationalizing things like
> how
> > > >>> HCatalog, etc. need to be deployed.
> > > >>>
> > > >>>> 2) I wonder wether we should confirm cassandra interoperability
> of spark
> > > >>> in
> > > >>>> bigtop distros,
> > > >>>
> > > >>> Only if there's a significant interest from cassandra community
> and even
> > > >>> then my biggest fear is that with cassandra we're totally changing
> the
> > > >>> requirements for the underlying storage subsystem (nothing wrong
> with
> > > >>> that, its just that in Hadoop ecosystem everything assumes very
> HDFS'ish
> > > >>> requirements for the scale-out storage).
> > > >>>
> > > >>>> 4) in general, i think bigtop can move in one of 3 directions.
> > > >>>>
> > > >>>>  EXPAND ? : Expanding to include new components, with just basic
> > > >>> interop,
> > > >>>> and let folks evolve their own stacks on top of bigtop on their
> own.
> > > >>>>
> > > >>>>  CONTRACT+FOCUS ?  Contracting to focus on a lean set of core
> > > >>> components,
> > > >>>> with super high quality.
> > > >>>>
> > > >>>>  STAY THE COURSE ? Staying the same ~ a packaging platform for
> just
> > > >>>> hadoop's direct ecosystem.
> > > >>>>
> > > >>>> I am intrigued by the idea of A and B both have clear benefits and
> > > >>> costs...
> > > >>>> would like to see the opinions of folks --- do we  lean in one
> direction
> > > >>> or
> > > >>>> another? What is the criteria for adding a new feature, package,
> stack to
> > > >>>> bigtop?
> > > >>>>
> > > >>>> ... Or maybe im just overthinking it and should be spending this
> time
> > > >>>> testing spark for 0.9 release....
> > > >>>
> > > >>> I'd love to know what other think, but for 0.9 I'd rather stay the
> course.
> > > >>>
> > > >>> Thanks,
> > > >>> Roman.
> > > >>>
> > > >>> P.S. There are also market forces at play that may fundamentally
> change
> > > >>> the focus of what we're all working on in the year or so.
> > > >>>
>
>
>

Re: What will the next generation of bigtop look like?

Posted by Konstantin Boudnik <co...@apache.org>.

On Mon, Dec 08, 2014 at 09:16PM, Konstantin Boudnik wrote:
> On Mon, Dec 08, 2014 at 11:57PM, Jay Vyas wrote:
> > "Let's see if we can be smart and define the landscape"
> > 
> > Well put @cos...I think Romans point was that it would be hard, not that it
> > would be bad. And I think you're both right : it's hard? Yes. But
> > worthwhile... Possibly? Next step we will all have to get in a room and
> > think about this face to face.
> > 
> > Let's shoot for a meetup after january in California... Where we can plan
> > the future direction of bigtop.  In the meanwhile hope to hear more opinions
> > on this.
> 
> +1 I can host at WANdisco or perhaps there other options?

Shall we start putting some arrangements/planning for the January
meetup? Say, 2nd week or January? Or the following one?

Andre, do you guys want to host it at A9? Anyone else? I am happy to do this
at my office, but it might be a bit of travel, although against the traffic
both ways.

Cos

> > > On Dec 8, 2014, at 3:23 PM, Konstantin Boudnik <co...@apache.org> wrote:
> > > 
> > > First I want to address the RJ's question:
> > > 
> > > The most prominent downstream Bigtop Dependency would be any commercial
> > > Hadoop distribution like HDP and CDH. The former is trying to
> > > disguise their affiliation by pushing Ambari forward, and Cloudera's seemingly
> > > shifting her focus to compressed tarballs media (aka parcels) which requires
> > > a closed-source solutions like Cloudera Manager to deploy and control your
> > > cluster, effectively rendering it useless if you ever decide to uninstall the
> > > control software. In the interest of full disclosure, I don't think parcels
> > > have any chance to landslide the consensus in the industry from Linux
> > > packaging towards something so obscure and proprietary as parcels are.
> > > 
> > > 
> > > And now to my actual points....:
> > > 
> > > I do strongly believe the Bigtop was and is the only completely transparent,
> > > vendors' friendly, and 100% sticking to official ASF product releases way of
> > > building your stack from ground up, deploying and controlling it anyway you
> > > want to. I agree with Roman's presentation on how this project can move
> > > forward. However, I somewhat disagree with his view on the perspectives. It
> > > might be a hard road to drive the opinion of the community.  But, it is a high
> > > road.
> > > 
> > > We are definitely small and mostly unsupported by commercial groups that are
> > > using the framework. Being a box of LEGO won't win us anything. If anything,
> > > the empirical evidences are against it as commercial distros have decided to
> > > move towards their own means of "vendor lock-in" (yes, you hear me
> > > right - that's exactly what I said: all so called open-source companies have
> > > invented a way to lock-in their customers either with fancy "enterprise
> > > features" that aren't adding but amending underlying stack; or with custom set
> > > of patches oftentimes rendering the cluster to become incompatible between
> > > different vendors).
> > > 
> > > By all means, my money are on the second way, yet slightly modified (as
> > > use-cases are coming from users, not developers):
> > >  #2 start driving adoption of software stacks for the particular kind of data workloads
> > > 
> > > This community has enough day-to-day practitioners on board to
> > > accumulate a near-complete introspection of where the technology is moving.
> > > And instead of wobbling in a backwash, let's see if we can be smart and define
> > > this landscape. After all, Bigtop has adopted Spark well before any of the
> > > commercials have officially accepted it. We seemingly are moving more and
> > > more into in-memory realm of data processing: Apache Ignite (Gridgain),
> > > Tachyon, Spark. I don't know how much legs Hive got in it, but I am doubtful,
> > > that it can walk for much longer... May be it's just me.
> > > 
> > > In this thread http://is.gd/MV2BH9 we already discussed some of the aspects
> > > influencing the feature of this project. And we are de-facto working on the
> > > implementation. In my opinion, Hadoop has been more or less commoditized
> > > already. And it isn't a bad thing, but it means that the innovations are
> > > elsewhere. E.g. Spark moving is moving beyond its ties with storage layer via
> > > Tachyon abstraction; GridGain simply doesn't care what's underlying storage
> > > is. However, data needs to be stored somewhere before it can be processed. And
> > > HCFS seems to be fitting the bill ok. But, as I said already, I see the real
> > > action elsewhere. If I were to define the shape of our mid- to long'ish term
> > > roadmap it'd be something like that:
> > > 
> > >            ^   Dashboard/Visualization  ^
> > >            |     OLTP/ML processing     |
> > >            |    Caching/Acceleration    |
> > >            |         Storage            |
> > > 
> > > And around this we can add/improve on deployment (R8???),
> > > virtualization/containers/clouds.  In other words - let's focus on the
> > > vertical part of the stack, instead of simply supporting the status quo.
> > > 
> > > Does Cassandra fits the Storage layer in that model? I don't know and most
> > > important - I don't care. If there's an interest and manpower to have
> > > Cassandra-based stack - sure, but perhaps let's do as a separate branch or
> > > something, so we aren't over-complicating things. As Roman said earlier, in
> > > this case it'd be great to engage Cassandra/DataStax people into this project.
> > > But something tells me they won't be eager to jump on board.
> > > 
> > > And finally, all this above leads to "how": how we can start reshaping the
> > > stack into its next incarnation? Perhaps, Ubuntu model might be an answer for
> > > that, but we have discussed that elsewhere and dropped the idea as it wasn't
> > > feasible back in the day. Perhaps its time just came?
> > > 
> > > Apologies for a long post.
> > >  Cos
> > > 
> > > 
> > >> On Sun, Dec 07, 2014 at 07:04PM, RJ Nowling wrote:
> > >> Which other projects depend on BigTop?  How will the questions about the
> > >> direction of BigTop affect those projects?
> > >> 
> > >> On Sun, Dec 7, 2014 at 6:10 PM, Roman Shaposhnik <ro...@shaposhnik.org>
> > >> wrote:
> > >> 
> > >>> Hi!
> > >>> 
> > >>> On Sat, Dec 6, 2014 at 3:23 PM, jay vyas <ja...@gmail.com>
> > >>> wrote:
> > >>>> hi bigtop !
> > >>>> 
> > >>>> I thought id start a thread a few vaguely related thoughts i have around
> > >>>> next couple iterations of bigtop.
> > >>> 
> > >>> I think in general I see two major ways for something like
> > >>> Bigtop to evolve:
> > >>>   #1 remain a 'box of LEGO bricks' with very little opinion on
> > >>>        how these pieces need to be integrated
> > >>>   #2 start driving oppinioned use-cases for the particular kind of
> > >>>        bigdata workloads
> > >>> 
> > >>> #1 is sort of what all of the Linux distros have been doing for
> > >>> the majority of time they existed. #2 is close to what CentOS
> > >>> is doing with SIGs.
> > >>> 
> > >>> Honestly, given the size of our community so far and a total
> > >>> lack of corporate backing (with a small exception of Cloudera
> > >>> still paying for our EC2 time) I think #1 is all we can do. I'd
> > >>> love to be wrong, though.
> > >>> 
> > >>>> 1) Hive:  How will bigtop to evolve to support it, now that it is much
> > >>> more
> > >>>> than a mapreduce query wrapper?
> > >>> 
> > >>> I think Hive will remain a big part of Hadoop workloads for forseeable
> > >>> future. What I'd love to see more of is rationalizing things like how
> > >>> HCatalog, etc. need to be deployed.
> > >>> 
> > >>>> 2) I wonder wether we should confirm cassandra interoperability of spark
> > >>> in
> > >>>> bigtop distros,
> > >>> 
> > >>> Only if there's a significant interest from cassandra community and even
> > >>> then my biggest fear is that with cassandra we're totally changing the
> > >>> requirements for the underlying storage subsystem (nothing wrong with
> > >>> that, its just that in Hadoop ecosystem everything assumes very HDFS'ish
> > >>> requirements for the scale-out storage).
> > >>> 
> > >>>> 4) in general, i think bigtop can move in one of 3 directions.
> > >>>> 
> > >>>>  EXPAND ? : Expanding to include new components, with just basic
> > >>> interop,
> > >>>> and let folks evolve their own stacks on top of bigtop on their own.
> > >>>> 
> > >>>>  CONTRACT+FOCUS ?  Contracting to focus on a lean set of core
> > >>> components,
> > >>>> with super high quality.
> > >>>> 
> > >>>>  STAY THE COURSE ? Staying the same ~ a packaging platform for just
> > >>>> hadoop's direct ecosystem.
> > >>>> 
> > >>>> I am intrigued by the idea of A and B both have clear benefits and
> > >>> costs...
> > >>>> would like to see the opinions of folks --- do we  lean in one direction
> > >>> or
> > >>>> another? What is the criteria for adding a new feature, package, stack to
> > >>>> bigtop?
> > >>>> 
> > >>>> ... Or maybe im just overthinking it and should be spending this time
> > >>>> testing spark for 0.9 release....
> > >>> 
> > >>> I'd love to know what other think, but for 0.9 I'd rather stay the course.
> > >>> 
> > >>> Thanks,
> > >>> Roman.
> > >>> 
> > >>> P.S. There are also market forces at play that may fundamentally change
> > >>> the focus of what we're all working on in the year or so.
> > >>>

Re: What will the next generation of bigtop look like?

Posted by Konstantin Boudnik <co...@apache.org>.

On Mon, Dec 08, 2014 at 09:16PM, Konstantin Boudnik wrote:
> On Mon, Dec 08, 2014 at 11:57PM, Jay Vyas wrote:
> > "Let's see if we can be smart and define the landscape"
> > 
> > Well put @cos...I think Romans point was that it would be hard, not that it
> > would be bad. And I think you're both right : it's hard? Yes. But
> > worthwhile... Possibly? Next step we will all have to get in a room and
> > think about this face to face.
> > 
> > Let's shoot for a meetup after january in California... Where we can plan
> > the future direction of bigtop.  In the meanwhile hope to hear more opinions
> > on this.
> 
> +1 I can host at WANdisco or perhaps there other options?

Shall we start putting some arrangements/planning for the January
meetup? Say, 2nd week or January? Or the following one?

Andre, do you guys want to host it at A9? Anyone else? I am happy to do this
at my office, but it might be a bit of travel, although against the traffic
both ways.

Cos

> > > On Dec 8, 2014, at 3:23 PM, Konstantin Boudnik <co...@apache.org> wrote:
> > > 
> > > First I want to address the RJ's question:
> > > 
> > > The most prominent downstream Bigtop Dependency would be any commercial
> > > Hadoop distribution like HDP and CDH. The former is trying to
> > > disguise their affiliation by pushing Ambari forward, and Cloudera's seemingly
> > > shifting her focus to compressed tarballs media (aka parcels) which requires
> > > a closed-source solutions like Cloudera Manager to deploy and control your
> > > cluster, effectively rendering it useless if you ever decide to uninstall the
> > > control software. In the interest of full disclosure, I don't think parcels
> > > have any chance to landslide the consensus in the industry from Linux
> > > packaging towards something so obscure and proprietary as parcels are.
> > > 
> > > 
> > > And now to my actual points....:
> > > 
> > > I do strongly believe the Bigtop was and is the only completely transparent,
> > > vendors' friendly, and 100% sticking to official ASF product releases way of
> > > building your stack from ground up, deploying and controlling it anyway you
> > > want to. I agree with Roman's presentation on how this project can move
> > > forward. However, I somewhat disagree with his view on the perspectives. It
> > > might be a hard road to drive the opinion of the community.  But, it is a high
> > > road.
> > > 
> > > We are definitely small and mostly unsupported by commercial groups that are
> > > using the framework. Being a box of LEGO won't win us anything. If anything,
> > > the empirical evidences are against it as commercial distros have decided to
> > > move towards their own means of "vendor lock-in" (yes, you hear me
> > > right - that's exactly what I said: all so called open-source companies have
> > > invented a way to lock-in their customers either with fancy "enterprise
> > > features" that aren't adding but amending underlying stack; or with custom set
> > > of patches oftentimes rendering the cluster to become incompatible between
> > > different vendors).
> > > 
> > > By all means, my money are on the second way, yet slightly modified (as
> > > use-cases are coming from users, not developers):
> > >  #2 start driving adoption of software stacks for the particular kind of data workloads
> > > 
> > > This community has enough day-to-day practitioners on board to
> > > accumulate a near-complete introspection of where the technology is moving.
> > > And instead of wobbling in a backwash, let's see if we can be smart and define
> > > this landscape. After all, Bigtop has adopted Spark well before any of the
> > > commercials have officially accepted it. We seemingly are moving more and
> > > more into in-memory realm of data processing: Apache Ignite (Gridgain),
> > > Tachyon, Spark. I don't know how much legs Hive got in it, but I am doubtful,
> > > that it can walk for much longer... May be it's just me.
> > > 
> > > In this thread http://is.gd/MV2BH9 we already discussed some of the aspects
> > > influencing the feature of this project. And we are de-facto working on the
> > > implementation. In my opinion, Hadoop has been more or less commoditized
> > > already. And it isn't a bad thing, but it means that the innovations are
> > > elsewhere. E.g. Spark moving is moving beyond its ties with storage layer via
> > > Tachyon abstraction; GridGain simply doesn't care what's underlying storage
> > > is. However, data needs to be stored somewhere before it can be processed. And
> > > HCFS seems to be fitting the bill ok. But, as I said already, I see the real
> > > action elsewhere. If I were to define the shape of our mid- to long'ish term
> > > roadmap it'd be something like that:
> > > 
> > >            ^   Dashboard/Visualization  ^
> > >            |     OLTP/ML processing     |
> > >            |    Caching/Acceleration    |
> > >            |         Storage            |
> > > 
> > > And around this we can add/improve on deployment (R8???),
> > > virtualization/containers/clouds.  In other words - let's focus on the
> > > vertical part of the stack, instead of simply supporting the status quo.
> > > 
> > > Does Cassandra fits the Storage layer in that model? I don't know and most
> > > important - I don't care. If there's an interest and manpower to have
> > > Cassandra-based stack - sure, but perhaps let's do as a separate branch or
> > > something, so we aren't over-complicating things. As Roman said earlier, in
> > > this case it'd be great to engage Cassandra/DataStax people into this project.
> > > But something tells me they won't be eager to jump on board.
> > > 
> > > And finally, all this above leads to "how": how we can start reshaping the
> > > stack into its next incarnation? Perhaps, Ubuntu model might be an answer for
> > > that, but we have discussed that elsewhere and dropped the idea as it wasn't
> > > feasible back in the day. Perhaps its time just came?
> > > 
> > > Apologies for a long post.
> > >  Cos
> > > 
> > > 
> > >> On Sun, Dec 07, 2014 at 07:04PM, RJ Nowling wrote:
> > >> Which other projects depend on BigTop?  How will the questions about the
> > >> direction of BigTop affect those projects?
> > >> 
> > >> On Sun, Dec 7, 2014 at 6:10 PM, Roman Shaposhnik <ro...@shaposhnik.org>
> > >> wrote:
> > >> 
> > >>> Hi!
> > >>> 
> > >>> On Sat, Dec 6, 2014 at 3:23 PM, jay vyas <ja...@gmail.com>
> > >>> wrote:
> > >>>> hi bigtop !
> > >>>> 
> > >>>> I thought id start a thread a few vaguely related thoughts i have around
> > >>>> next couple iterations of bigtop.
> > >>> 
> > >>> I think in general I see two major ways for something like
> > >>> Bigtop to evolve:
> > >>>   #1 remain a 'box of LEGO bricks' with very little opinion on
> > >>>        how these pieces need to be integrated
> > >>>   #2 start driving oppinioned use-cases for the particular kind of
> > >>>        bigdata workloads
> > >>> 
> > >>> #1 is sort of what all of the Linux distros have been doing for
> > >>> the majority of time they existed. #2 is close to what CentOS
> > >>> is doing with SIGs.
> > >>> 
> > >>> Honestly, given the size of our community so far and a total
> > >>> lack of corporate backing (with a small exception of Cloudera
> > >>> still paying for our EC2 time) I think #1 is all we can do. I'd
> > >>> love to be wrong, though.
> > >>> 
> > >>>> 1) Hive:  How will bigtop to evolve to support it, now that it is much
> > >>> more
> > >>>> than a mapreduce query wrapper?
> > >>> 
> > >>> I think Hive will remain a big part of Hadoop workloads for forseeable
> > >>> future. What I'd love to see more of is rationalizing things like how
> > >>> HCatalog, etc. need to be deployed.
> > >>> 
> > >>>> 2) I wonder wether we should confirm cassandra interoperability of spark
> > >>> in
> > >>>> bigtop distros,
> > >>> 
> > >>> Only if there's a significant interest from cassandra community and even
> > >>> then my biggest fear is that with cassandra we're totally changing the
> > >>> requirements for the underlying storage subsystem (nothing wrong with
> > >>> that, its just that in Hadoop ecosystem everything assumes very HDFS'ish
> > >>> requirements for the scale-out storage).
> > >>> 
> > >>>> 4) in general, i think bigtop can move in one of 3 directions.
> > >>>> 
> > >>>>  EXPAND ? : Expanding to include new components, with just basic
> > >>> interop,
> > >>>> and let folks evolve their own stacks on top of bigtop on their own.
> > >>>> 
> > >>>>  CONTRACT+FOCUS ?  Contracting to focus on a lean set of core
> > >>> components,
> > >>>> with super high quality.
> > >>>> 
> > >>>>  STAY THE COURSE ? Staying the same ~ a packaging platform for just
> > >>>> hadoop's direct ecosystem.
> > >>>> 
> > >>>> I am intrigued by the idea of A and B both have clear benefits and
> > >>> costs...
> > >>>> would like to see the opinions of folks --- do we  lean in one direction
> > >>> or
> > >>>> another? What is the criteria for adding a new feature, package, stack to
> > >>>> bigtop?
> > >>>> 
> > >>>> ... Or maybe im just overthinking it and should be spending this time
> > >>>> testing spark for 0.9 release....
> > >>> 
> > >>> I'd love to know what other think, but for 0.9 I'd rather stay the course.
> > >>> 
> > >>> Thanks,
> > >>> Roman.
> > >>> 
> > >>> P.S. There are also market forces at play that may fundamentally change
> > >>> the focus of what we're all working on in the year or so.
> > >>>

Re: What will the next generation of bigtop look like?

Posted by Konstantin Boudnik <co...@apache.org>.

On Mon, Dec 08, 2014 at 11:57PM, Jay Vyas wrote:
> "Let's see if we can be smart and define the landscape"
> 
> Well put @cos...I think Romans point was that it would be hard, not that it
> would be bad. And I think you're both right : it's hard? Yes. But
> worthwhile... Possibly? Next step we will all have to get in a room and
> think about this face to face.
> 
> Let's shoot for a meetup after january in California... Where we can plan
> the future direction of bigtop.  In the meanwhile hope to hear more opinions
> on this.

+1 I can host at WANdisco or perhaps there other options?

Cos

> > On Dec 8, 2014, at 3:23 PM, Konstantin Boudnik <co...@apache.org> wrote:
> > 
> > First I want to address the RJ's question:
> > 
> > The most prominent downstream Bigtop Dependency would be any commercial
> > Hadoop distribution like HDP and CDH. The former is trying to
> > disguise their affiliation by pushing Ambari forward, and Cloudera's seemingly
> > shifting her focus to compressed tarballs media (aka parcels) which requires
> > a closed-source solutions like Cloudera Manager to deploy and control your
> > cluster, effectively rendering it useless if you ever decide to uninstall the
> > control software. In the interest of full disclosure, I don't think parcels
> > have any chance to landslide the consensus in the industry from Linux
> > packaging towards something so obscure and proprietary as parcels are.
> > 
> > 
> > And now to my actual points....:
> > 
> > I do strongly believe the Bigtop was and is the only completely transparent,
> > vendors' friendly, and 100% sticking to official ASF product releases way of
> > building your stack from ground up, deploying and controlling it anyway you
> > want to. I agree with Roman's presentation on how this project can move
> > forward. However, I somewhat disagree with his view on the perspectives. It
> > might be a hard road to drive the opinion of the community.  But, it is a high
> > road.
> > 
> > We are definitely small and mostly unsupported by commercial groups that are
> > using the framework. Being a box of LEGO won't win us anything. If anything,
> > the empirical evidences are against it as commercial distros have decided to
> > move towards their own means of "vendor lock-in" (yes, you hear me
> > right - that's exactly what I said: all so called open-source companies have
> > invented a way to lock-in their customers either with fancy "enterprise
> > features" that aren't adding but amending underlying stack; or with custom set
> > of patches oftentimes rendering the cluster to become incompatible between
> > different vendors).
> > 
> > By all means, my money are on the second way, yet slightly modified (as
> > use-cases are coming from users, not developers):
> >  #2 start driving adoption of software stacks for the particular kind of data workloads
> > 
> > This community has enough day-to-day practitioners on board to
> > accumulate a near-complete introspection of where the technology is moving.
> > And instead of wobbling in a backwash, let's see if we can be smart and define
> > this landscape. After all, Bigtop has adopted Spark well before any of the
> > commercials have officially accepted it. We seemingly are moving more and
> > more into in-memory realm of data processing: Apache Ignite (Gridgain),
> > Tachyon, Spark. I don't know how much legs Hive got in it, but I am doubtful,
> > that it can walk for much longer... May be it's just me.
> > 
> > In this thread http://is.gd/MV2BH9 we already discussed some of the aspects
> > influencing the feature of this project. And we are de-facto working on the
> > implementation. In my opinion, Hadoop has been more or less commoditized
> > already. And it isn't a bad thing, but it means that the innovations are
> > elsewhere. E.g. Spark moving is moving beyond its ties with storage layer via
> > Tachyon abstraction; GridGain simply doesn't care what's underlying storage
> > is. However, data needs to be stored somewhere before it can be processed. And
> > HCFS seems to be fitting the bill ok. But, as I said already, I see the real
> > action elsewhere. If I were to define the shape of our mid- to long'ish term
> > roadmap it'd be something like that:
> > 
> >            ^   Dashboard/Visualization  ^
> >            |     OLTP/ML processing     |
> >            |    Caching/Acceleration    |
> >            |         Storage            |
> > 
> > And around this we can add/improve on deployment (R8???),
> > virtualization/containers/clouds.  In other words - let's focus on the
> > vertical part of the stack, instead of simply supporting the status quo.
> > 
> > Does Cassandra fits the Storage layer in that model? I don't know and most
> > important - I don't care. If there's an interest and manpower to have
> > Cassandra-based stack - sure, but perhaps let's do as a separate branch or
> > something, so we aren't over-complicating things. As Roman said earlier, in
> > this case it'd be great to engage Cassandra/DataStax people into this project.
> > But something tells me they won't be eager to jump on board.
> > 
> > And finally, all this above leads to "how": how we can start reshaping the
> > stack into its next incarnation? Perhaps, Ubuntu model might be an answer for
> > that, but we have discussed that elsewhere and dropped the idea as it wasn't
> > feasible back in the day. Perhaps its time just came?
> > 
> > Apologies for a long post.
> >  Cos
> > 
> > 
> >> On Sun, Dec 07, 2014 at 07:04PM, RJ Nowling wrote:
> >> Which other projects depend on BigTop?  How will the questions about the
> >> direction of BigTop affect those projects?
> >> 
> >> On Sun, Dec 7, 2014 at 6:10 PM, Roman Shaposhnik <ro...@shaposhnik.org>
> >> wrote:
> >> 
> >>> Hi!
> >>> 
> >>> On Sat, Dec 6, 2014 at 3:23 PM, jay vyas <ja...@gmail.com>
> >>> wrote:
> >>>> hi bigtop !
> >>>> 
> >>>> I thought id start a thread a few vaguely related thoughts i have around
> >>>> next couple iterations of bigtop.
> >>> 
> >>> I think in general I see two major ways for something like
> >>> Bigtop to evolve:
> >>>   #1 remain a 'box of LEGO bricks' with very little opinion on
> >>>        how these pieces need to be integrated
> >>>   #2 start driving oppinioned use-cases for the particular kind of
> >>>        bigdata workloads
> >>> 
> >>> #1 is sort of what all of the Linux distros have been doing for
> >>> the majority of time they existed. #2 is close to what CentOS
> >>> is doing with SIGs.
> >>> 
> >>> Honestly, given the size of our community so far and a total
> >>> lack of corporate backing (with a small exception of Cloudera
> >>> still paying for our EC2 time) I think #1 is all we can do. I'd
> >>> love to be wrong, though.
> >>> 
> >>>> 1) Hive:  How will bigtop to evolve to support it, now that it is much
> >>> more
> >>>> than a mapreduce query wrapper?
> >>> 
> >>> I think Hive will remain a big part of Hadoop workloads for forseeable
> >>> future. What I'd love to see more of is rationalizing things like how
> >>> HCatalog, etc. need to be deployed.
> >>> 
> >>>> 2) I wonder wether we should confirm cassandra interoperability of spark
> >>> in
> >>>> bigtop distros,
> >>> 
> >>> Only if there's a significant interest from cassandra community and even
> >>> then my biggest fear is that with cassandra we're totally changing the
> >>> requirements for the underlying storage subsystem (nothing wrong with
> >>> that, its just that in Hadoop ecosystem everything assumes very HDFS'ish
> >>> requirements for the scale-out storage).
> >>> 
> >>>> 4) in general, i think bigtop can move in one of 3 directions.
> >>>> 
> >>>>  EXPAND ? : Expanding to include new components, with just basic
> >>> interop,
> >>>> and let folks evolve their own stacks on top of bigtop on their own.
> >>>> 
> >>>>  CONTRACT+FOCUS ?  Contracting to focus on a lean set of core
> >>> components,
> >>>> with super high quality.
> >>>> 
> >>>>  STAY THE COURSE ? Staying the same ~ a packaging platform for just
> >>>> hadoop's direct ecosystem.
> >>>> 
> >>>> I am intrigued by the idea of A and B both have clear benefits and
> >>> costs...
> >>>> would like to see the opinions of folks --- do we  lean in one direction
> >>> or
> >>>> another? What is the criteria for adding a new feature, package, stack to
> >>>> bigtop?
> >>>> 
> >>>> ... Or maybe im just overthinking it and should be spending this time
> >>>> testing spark for 0.9 release....
> >>> 
> >>> I'd love to know what other think, but for 0.9 I'd rather stay the course.
> >>> 
> >>> Thanks,
> >>> Roman.
> >>> 
> >>> P.S. There are also market forces at play that may fundamentally change
> >>> the focus of what we're all working on in the year or so.
> >>>

Re: What will the next generation of bigtop look like?

Posted by Konstantin Boudnik <co...@apache.org>.

On Mon, Dec 08, 2014 at 11:57PM, Jay Vyas wrote:
> "Let's see if we can be smart and define the landscape"
> 
> Well put @cos...I think Romans point was that it would be hard, not that it
> would be bad. And I think you're both right : it's hard? Yes. But
> worthwhile... Possibly? Next step we will all have to get in a room and
> think about this face to face.
> 
> Let's shoot for a meetup after january in California... Where we can plan
> the future direction of bigtop.  In the meanwhile hope to hear more opinions
> on this.

+1 I can host at WANdisco or perhaps there other options?

Cos

> > On Dec 8, 2014, at 3:23 PM, Konstantin Boudnik <co...@apache.org> wrote:
> > 
> > First I want to address the RJ's question:
> > 
> > The most prominent downstream Bigtop Dependency would be any commercial
> > Hadoop distribution like HDP and CDH. The former is trying to
> > disguise their affiliation by pushing Ambari forward, and Cloudera's seemingly
> > shifting her focus to compressed tarballs media (aka parcels) which requires
> > a closed-source solutions like Cloudera Manager to deploy and control your
> > cluster, effectively rendering it useless if you ever decide to uninstall the
> > control software. In the interest of full disclosure, I don't think parcels
> > have any chance to landslide the consensus in the industry from Linux
> > packaging towards something so obscure and proprietary as parcels are.
> > 
> > 
> > And now to my actual points....:
> > 
> > I do strongly believe the Bigtop was and is the only completely transparent,
> > vendors' friendly, and 100% sticking to official ASF product releases way of
> > building your stack from ground up, deploying and controlling it anyway you
> > want to. I agree with Roman's presentation on how this project can move
> > forward. However, I somewhat disagree with his view on the perspectives. It
> > might be a hard road to drive the opinion of the community.  But, it is a high
> > road.
> > 
> > We are definitely small and mostly unsupported by commercial groups that are
> > using the framework. Being a box of LEGO won't win us anything. If anything,
> > the empirical evidences are against it as commercial distros have decided to
> > move towards their own means of "vendor lock-in" (yes, you hear me
> > right - that's exactly what I said: all so called open-source companies have
> > invented a way to lock-in their customers either with fancy "enterprise
> > features" that aren't adding but amending underlying stack; or with custom set
> > of patches oftentimes rendering the cluster to become incompatible between
> > different vendors).
> > 
> > By all means, my money are on the second way, yet slightly modified (as
> > use-cases are coming from users, not developers):
> >  #2 start driving adoption of software stacks for the particular kind of data workloads
> > 
> > This community has enough day-to-day practitioners on board to
> > accumulate a near-complete introspection of where the technology is moving.
> > And instead of wobbling in a backwash, let's see if we can be smart and define
> > this landscape. After all, Bigtop has adopted Spark well before any of the
> > commercials have officially accepted it. We seemingly are moving more and
> > more into in-memory realm of data processing: Apache Ignite (Gridgain),
> > Tachyon, Spark. I don't know how much legs Hive got in it, but I am doubtful,
> > that it can walk for much longer... May be it's just me.
> > 
> > In this thread http://is.gd/MV2BH9 we already discussed some of the aspects
> > influencing the feature of this project. And we are de-facto working on the
> > implementation. In my opinion, Hadoop has been more or less commoditized
> > already. And it isn't a bad thing, but it means that the innovations are
> > elsewhere. E.g. Spark moving is moving beyond its ties with storage layer via
> > Tachyon abstraction; GridGain simply doesn't care what's underlying storage
> > is. However, data needs to be stored somewhere before it can be processed. And
> > HCFS seems to be fitting the bill ok. But, as I said already, I see the real
> > action elsewhere. If I were to define the shape of our mid- to long'ish term
> > roadmap it'd be something like that:
> > 
> >            ^   Dashboard/Visualization  ^
> >            |     OLTP/ML processing     |
> >            |    Caching/Acceleration    |
> >            |         Storage            |
> > 
> > And around this we can add/improve on deployment (R8???),
> > virtualization/containers/clouds.  In other words - let's focus on the
> > vertical part of the stack, instead of simply supporting the status quo.
> > 
> > Does Cassandra fits the Storage layer in that model? I don't know and most
> > important - I don't care. If there's an interest and manpower to have
> > Cassandra-based stack - sure, but perhaps let's do as a separate branch or
> > something, so we aren't over-complicating things. As Roman said earlier, in
> > this case it'd be great to engage Cassandra/DataStax people into this project.
> > But something tells me they won't be eager to jump on board.
> > 
> > And finally, all this above leads to "how": how we can start reshaping the
> > stack into its next incarnation? Perhaps, Ubuntu model might be an answer for
> > that, but we have discussed that elsewhere and dropped the idea as it wasn't
> > feasible back in the day. Perhaps its time just came?
> > 
> > Apologies for a long post.
> >  Cos
> > 
> > 
> >> On Sun, Dec 07, 2014 at 07:04PM, RJ Nowling wrote:
> >> Which other projects depend on BigTop?  How will the questions about the
> >> direction of BigTop affect those projects?
> >> 
> >> On Sun, Dec 7, 2014 at 6:10 PM, Roman Shaposhnik <ro...@shaposhnik.org>
> >> wrote:
> >> 
> >>> Hi!
> >>> 
> >>> On Sat, Dec 6, 2014 at 3:23 PM, jay vyas <ja...@gmail.com>
> >>> wrote:
> >>>> hi bigtop !
> >>>> 
> >>>> I thought id start a thread a few vaguely related thoughts i have around
> >>>> next couple iterations of bigtop.
> >>> 
> >>> I think in general I see two major ways for something like
> >>> Bigtop to evolve:
> >>>   #1 remain a 'box of LEGO bricks' with very little opinion on
> >>>        how these pieces need to be integrated
> >>>   #2 start driving oppinioned use-cases for the particular kind of
> >>>        bigdata workloads
> >>> 
> >>> #1 is sort of what all of the Linux distros have been doing for
> >>> the majority of time they existed. #2 is close to what CentOS
> >>> is doing with SIGs.
> >>> 
> >>> Honestly, given the size of our community so far and a total
> >>> lack of corporate backing (with a small exception of Cloudera
> >>> still paying for our EC2 time) I think #1 is all we can do. I'd
> >>> love to be wrong, though.
> >>> 
> >>>> 1) Hive:  How will bigtop to evolve to support it, now that it is much
> >>> more
> >>>> than a mapreduce query wrapper?
> >>> 
> >>> I think Hive will remain a big part of Hadoop workloads for forseeable
> >>> future. What I'd love to see more of is rationalizing things like how
> >>> HCatalog, etc. need to be deployed.
> >>> 
> >>>> 2) I wonder wether we should confirm cassandra interoperability of spark
> >>> in
> >>>> bigtop distros,
> >>> 
> >>> Only if there's a significant interest from cassandra community and even
> >>> then my biggest fear is that with cassandra we're totally changing the
> >>> requirements for the underlying storage subsystem (nothing wrong with
> >>> that, its just that in Hadoop ecosystem everything assumes very HDFS'ish
> >>> requirements for the scale-out storage).
> >>> 
> >>>> 4) in general, i think bigtop can move in one of 3 directions.
> >>>> 
> >>>>  EXPAND ? : Expanding to include new components, with just basic
> >>> interop,
> >>>> and let folks evolve their own stacks on top of bigtop on their own.
> >>>> 
> >>>>  CONTRACT+FOCUS ?  Contracting to focus on a lean set of core
> >>> components,
> >>>> with super high quality.
> >>>> 
> >>>>  STAY THE COURSE ? Staying the same ~ a packaging platform for just
> >>>> hadoop's direct ecosystem.
> >>>> 
> >>>> I am intrigued by the idea of A and B both have clear benefits and
> >>> costs...
> >>>> would like to see the opinions of folks --- do we  lean in one direction
> >>> or
> >>>> another? What is the criteria for adding a new feature, package, stack to
> >>>> bigtop?
> >>>> 
> >>>> ... Or maybe im just overthinking it and should be spending this time
> >>>> testing spark for 0.9 release....
> >>> 
> >>> I'd love to know what other think, but for 0.9 I'd rather stay the course.
> >>> 
> >>> Thanks,
> >>> Roman.
> >>> 
> >>> P.S. There are also market forces at play that may fundamentally change
> >>> the focus of what we're all working on in the year or so.
> >>>

Re: What will the next generation of bigtop look like?

Posted by Jay Vyas <ja...@gmail.com>.

"Let's see if we can be smart and define the landscape"

Well put @cos...I think Romans point was that it would be hard, not that it would be bad. And I think you're both right : it's hard? Yes. But worthwhile... Possibly? Next step we will all have to get in a room and think about this face to face.

Let's shoot for a meetup after january in California... Where we can plan the future direction of bigtop.  In the meanwhile hope to hear more opinions on this.


> On Dec 8, 2014, at 3:23 PM, Konstantin Boudnik <co...@apache.org> wrote:
> 
> First I want to address the RJ's question:
> 
> The most prominent downstream Bigtop Dependency would be any commercial
> Hadoop distribution like HDP and CDH. The former is trying to
> disguise their affiliation by pushing Ambari forward, and Cloudera's seemingly
> shifting her focus to compressed tarballs media (aka parcels) which requires
> a closed-source solutions like Cloudera Manager to deploy and control your
> cluster, effectively rendering it useless if you ever decide to uninstall the
> control software. In the interest of full disclosure, I don't think parcels
> have any chance to landslide the consensus in the industry from Linux
> packaging towards something so obscure and proprietary as parcels are.
> 
> 
> And now to my actual points....:
> 
> I do strongly believe the Bigtop was and is the only completely transparent,
> vendors' friendly, and 100% sticking to official ASF product releases way of
> building your stack from ground up, deploying and controlling it anyway you
> want to. I agree with Roman's presentation on how this project can move
> forward. However, I somewhat disagree with his view on the perspectives. It
> might be a hard road to drive the opinion of the community.  But, it is a high
> road.
> 
> We are definitely small and mostly unsupported by commercial groups that are
> using the framework. Being a box of LEGO won't win us anything. If anything,
> the empirical evidences are against it as commercial distros have decided to
> move towards their own means of "vendor lock-in" (yes, you hear me
> right - that's exactly what I said: all so called open-source companies have
> invented a way to lock-in their customers either with fancy "enterprise
> features" that aren't adding but amending underlying stack; or with custom set
> of patches oftentimes rendering the cluster to become incompatible between
> different vendors).
> 
> By all means, my money are on the second way, yet slightly modified (as
> use-cases are coming from users, not developers):
>  #2 start driving adoption of software stacks for the particular kind of data workloads
> 
> This community has enough day-to-day practitioners on board to
> accumulate a near-complete introspection of where the technology is moving.
> And instead of wobbling in a backwash, let's see if we can be smart and define
> this landscape. After all, Bigtop has adopted Spark well before any of the
> commercials have officially accepted it. We seemingly are moving more and
> more into in-memory realm of data processing: Apache Ignite (Gridgain),
> Tachyon, Spark. I don't know how much legs Hive got in it, but I am doubtful,
> that it can walk for much longer... May be it's just me.
> 
> In this thread http://is.gd/MV2BH9 we already discussed some of the aspects
> influencing the feature of this project. And we are de-facto working on the
> implementation. In my opinion, Hadoop has been more or less commoditized
> already. And it isn't a bad thing, but it means that the innovations are
> elsewhere. E.g. Spark moving is moving beyond its ties with storage layer via
> Tachyon abstraction; GridGain simply doesn't care what's underlying storage
> is. However, data needs to be stored somewhere before it can be processed. And
> HCFS seems to be fitting the bill ok. But, as I said already, I see the real
> action elsewhere. If I were to define the shape of our mid- to long'ish term
> roadmap it'd be something like that:
> 
>            ^   Dashboard/Visualization  ^
>            |     OLTP/ML processing     |
>            |    Caching/Acceleration    |
>            |         Storage            |
> 
> And around this we can add/improve on deployment (R8???),
> virtualization/containers/clouds.  In other words - let's focus on the
> vertical part of the stack, instead of simply supporting the status quo.
> 
> Does Cassandra fits the Storage layer in that model? I don't know and most
> important - I don't care. If there's an interest and manpower to have
> Cassandra-based stack - sure, but perhaps let's do as a separate branch or
> something, so we aren't over-complicating things. As Roman said earlier, in
> this case it'd be great to engage Cassandra/DataStax people into this project.
> But something tells me they won't be eager to jump on board.
> 
> And finally, all this above leads to "how": how we can start reshaping the
> stack into its next incarnation? Perhaps, Ubuntu model might be an answer for
> that, but we have discussed that elsewhere and dropped the idea as it wasn't
> feasible back in the day. Perhaps its time just came?
> 
> Apologies for a long post.
>  Cos
> 
> 
>> On Sun, Dec 07, 2014 at 07:04PM, RJ Nowling wrote:
>> Which other projects depend on BigTop?  How will the questions about the
>> direction of BigTop affect those projects?
>> 
>> On Sun, Dec 7, 2014 at 6:10 PM, Roman Shaposhnik <ro...@shaposhnik.org>
>> wrote:
>> 
>>> Hi!
>>> 
>>> On Sat, Dec 6, 2014 at 3:23 PM, jay vyas <ja...@gmail.com>
>>> wrote:
>>>> hi bigtop !
>>>> 
>>>> I thought id start a thread a few vaguely related thoughts i have around
>>>> next couple iterations of bigtop.
>>> 
>>> I think in general I see two major ways for something like
>>> Bigtop to evolve:
>>>   #1 remain a 'box of LEGO bricks' with very little opinion on
>>>        how these pieces need to be integrated
>>>   #2 start driving oppinioned use-cases for the particular kind of
>>>        bigdata workloads
>>> 
>>> #1 is sort of what all of the Linux distros have been doing for
>>> the majority of time they existed. #2 is close to what CentOS
>>> is doing with SIGs.
>>> 
>>> Honestly, given the size of our community so far and a total
>>> lack of corporate backing (with a small exception of Cloudera
>>> still paying for our EC2 time) I think #1 is all we can do. I'd
>>> love to be wrong, though.
>>> 
>>>> 1) Hive:  How will bigtop to evolve to support it, now that it is much
>>> more
>>>> than a mapreduce query wrapper?
>>> 
>>> I think Hive will remain a big part of Hadoop workloads for forseeable
>>> future. What I'd love to see more of is rationalizing things like how
>>> HCatalog, etc. need to be deployed.
>>> 
>>>> 2) I wonder wether we should confirm cassandra interoperability of spark
>>> in
>>>> bigtop distros,
>>> 
>>> Only if there's a significant interest from cassandra community and even
>>> then my biggest fear is that with cassandra we're totally changing the
>>> requirements for the underlying storage subsystem (nothing wrong with
>>> that, its just that in Hadoop ecosystem everything assumes very HDFS'ish
>>> requirements for the scale-out storage).
>>> 
>>>> 4) in general, i think bigtop can move in one of 3 directions.
>>>> 
>>>>  EXPAND ? : Expanding to include new components, with just basic
>>> interop,
>>>> and let folks evolve their own stacks on top of bigtop on their own.
>>>> 
>>>>  CONTRACT+FOCUS ?  Contracting to focus on a lean set of core
>>> components,
>>>> with super high quality.
>>>> 
>>>>  STAY THE COURSE ? Staying the same ~ a packaging platform for just
>>>> hadoop's direct ecosystem.
>>>> 
>>>> I am intrigued by the idea of A and B both have clear benefits and
>>> costs...
>>>> would like to see the opinions of folks --- do we  lean in one direction
>>> or
>>>> another? What is the criteria for adding a new feature, package, stack to
>>>> bigtop?
>>>> 
>>>> ... Or maybe im just overthinking it and should be spending this time
>>>> testing spark for 0.9 release....
>>> 
>>> I'd love to know what other think, but for 0.9 I'd rather stay the course.
>>> 
>>> Thanks,
>>> Roman.
>>> 
>>> P.S. There are also market forces at play that may fundamentally change
>>> the focus of what we're all working on in the year or so.
>>>

Re: What will the next generation of bigtop look like?

Posted by Andrew Purtell <ap...@apache.org>.

This is a really great post and I was nodding along with most of it.

My personal view is Bigtop starts as a deployable stack of Apache ecosystem
components for Big Data. Commodification of (Linux) deployable packages and
basic install integration is the baseline.

Bigtop packaging Spark components first is an unfortunately little known
win of this community, but its still a win. Although replicating that
success with choice of the 'next big thing' is going to be a hit or miss
proposition unless one of us can figure out time travel, definitely we can
make some observations and scour and/or influence the Apache project
landscape to pick up coverage in the space:

- Storage is commoditized. Nearly everyone bases the storage stack on HDFS.
Everyone does so with what we'd call HCFS. Best to focus elsewhere.

- Packaging is commoditized. It's a shame that vendors pursue misguided
lock-in strategies but we have no control over that. It's still true that
someone using HDP or CDH 4 can switch to Bigtop and vice versa without
changing package management tools or strategy. As a user of Apache stack
technologies I want long term sustainable package management so will vote
with my feet for the commodity option, and won't be alone. Bigtop should
provide this, and does, and it's mostly a solved problem.

- Deployment is also a "solved" problem but unfortunately everyone solves
it differently. :-) This is an area where Bigtop can provide real value,
and does, with the Puppet scripts, with the containerization work. One
function Bigtop can serve is as repository and example of Hadoop-ish
production tooling.

- YARN is a reasonably generic grid resource manager. We don't have the
resources to stand up an alternate RM and all the tooling necessary with
Mesos, but if Mesosphere made a contribution of that I suspect we'd take
it. From the Bigtop perspective I think computation framework options are
well handled, in that I don't see Bigtop or anyone else developing credible
alternatives to MR and Spark for some time. Not sure there's enough oxygen.
And we have Giraph (and is GraphX packaged with Spark?). To the extent
Spark-on-YARN has rough edges in the Bigtop framework that's an area where
contributors can produce value. Related, support for Hive on Spark, Pig on
Spark (spork).

- The Apache stack includes three streaming computation frameworks - Storm,
Spark Streaming, Samza - but Bigtop has mostly missed the boat here. Spark
streaming is included in the spark package (I think) but how well is it
integrated? Samza is well integrated with YARN but we don't package it.
There's also been Storm-on-YARN work out of Yahoo, not sure about what was
upstreamed or might be available. Anyway, integration of stream computation
frameworks into Bigtop's packaging and deployment/management scripts can
produce value, especially if we provide multiple options, because vendors
are choosing favorites.

- Data access. We do have players differentiating themselves here. Bigtop
provides two SQL options (Hive, Phoenix+HBase), can add a third, I see
someone's proposed Presto packaging. I'm not sure from the Bigtop
perspective we need to pursue additional alternatives, but if there were
contributions, we might well take them. "Enterprise friendly API" (SQL) is
half of the data access picture I think, the other half is access control.
There are competing projects in incubation, Sentry and Ranger, with no
shared purpose, which is a real shame. To the extent that Bigtop adopts a
cross-component full-stack access control technology, or helps bring
another alternative into incubation and adopts that, we can move the needle
in this space. We'd offer a vendor neutral access control option devoid of
lock-in risk, this would be a big deal for big-E enterprises.

- Data management and provenance. Now we're moving up the value chain from
storage and data access to the next layer. This is mostly greenfield / blue
ocean space in the Apache stack. We have interesting options in incubation:
Falcon, Taverna, NiFi. (I think the last one might be truly comprehensive.)
All of these are higher level data management and processing workflows
which include aspects of management and provenance. One or more could be
adopted and refined. There are a lot of relevant integration opportunities
up and down the stack that could be undertaken with shared effort of the
Bigtop, framework, and component communities.

- Machine learning. Moving further up the value chain, we have data and
computation and workflow, now how do we derive the competitive advantage
that all of the lower layer technologies are in place for? The new hotness
is surfacing of insights out of scaled parallel statistical inference.
Unfortunately this space doesn't present itself well to the toolbox
approach. Bigtop provides Mahout and MLLib as part of Spark (right?), they
themselves are toolkits with components of varying utility and maturity
(and relevance). I think Bigtop could provide some value by curating ML
frameworks that tie in with other Apache stack technologies. ML toolkits
leave would-be users in the cold. One has to know what one is doing, and
what to do is highly use case specific, this is why "data scientists" can
command obscene salaries and only commercial vendors have the resources to
focus on specific verticals.

- Visualization and preparation. Moving further up, now we are almost
touching directly the use case. We have data but we need to clean it,
normalize, regularize, filter, slice and dice. Where there are reasonably
generic open source tools, preferably at Apache, for data preparation and
cleaning Bigtop could provide baseline value by packaging it, and
additional value with deeper integration with Apache stack components. Data
preparation is a concern hand in hand with data ingest, so we have an
interesting feedback loop from the top back down to ingest tools/building
blocks like Kafka and Flume. Data cleaning concerns might overlap with the
workflow frameworks too. If there's a friendly licensed open source
graphical front end to the data cleaning/munging/exploration process that
is generic enough that would be a really interesting "acquisition".
- We can also package visualization libraries and toolkits for building
dashboards. Like with ML algorithms, a complete integration is probably out
of scope because every instance would be use case and user specific.



On Mon, Dec 8, 2014 at 12:23 PM, Konstantin Boudnik <co...@apache.org> wrote:

> First I want to address the RJ's question:
>
> The most prominent downstream Bigtop Dependency would be any commercial
> Hadoop distribution like HDP and CDH. The former is trying to
> disguise their affiliation by pushing Ambari forward, and Cloudera's
> seemingly
> shifting her focus to compressed tarballs media (aka parcels) which
> requires
> a closed-source solutions like Cloudera Manager to deploy and control your
> cluster, effectively rendering it useless if you ever decide to uninstall
> the
> control software. In the interest of full disclosure, I don't think parcels
> have any chance to landslide the consensus in the industry from Linux
> packaging towards something so obscure and proprietary as parcels are.
>
>
> And now to my actual points....:
>
> I do strongly believe the Bigtop was and is the only completely
> transparent,
> vendors' friendly, and 100% sticking to official ASF product releases way
> of
> building your stack from ground up, deploying and controlling it anyway you
> want to. I agree with Roman's presentation on how this project can move
> forward. However, I somewhat disagree with his view on the perspectives. It
> might be a hard road to drive the opinion of the community.  But, it is a
> high
> road.
>
> We are definitely small and mostly unsupported by commercial groups that
> are
> using the framework. Being a box of LEGO won't win us anything. If
> anything,
> the empirical evidences are against it as commercial distros have decided
> to
> move towards their own means of "vendor lock-in" (yes, you hear me
> right - that's exactly what I said: all so called open-source companies
> have
> invented a way to lock-in their customers either with fancy "enterprise
> features" that aren't adding but amending underlying stack; or with custom
> set
> of patches oftentimes rendering the cluster to become incompatible between
> different vendors).
>
> By all means, my money are on the second way, yet slightly modified (as
> use-cases are coming from users, not developers):
>   #2 start driving adoption of software stacks for the particular kind of
> data workloads
>
> This community has enough day-to-day practitioners on board to
> accumulate a near-complete introspection of where the technology is moving.
> And instead of wobbling in a backwash, let's see if we can be smart and
> define
> this landscape. After all, Bigtop has adopted Spark well before any of the
> commercials have officially accepted it. We seemingly are moving more and
> more into in-memory realm of data processing: Apache Ignite (Gridgain),
> Tachyon, Spark. I don't know how much legs Hive got in it, but I am
> doubtful,
> that it can walk for much longer... May be it's just me.
>
> In this thread http://is.gd/MV2BH9 we already discussed some of the
> aspects
> influencing the feature of this project. And we are de-facto working on the
> implementation. In my opinion, Hadoop has been more or less commoditized
> already. And it isn't a bad thing, but it means that the innovations are
> elsewhere. E.g. Spark moving is moving beyond its ties with storage layer
> via
> Tachyon abstraction; GridGain simply doesn't care what's underlying storage
> is. However, data needs to be stored somewhere before it can be processed.
> And
> HCFS seems to be fitting the bill ok. But, as I said already, I see the
> real
> action elsewhere. If I were to define the shape of our mid- to long'ish
> term
> roadmap it'd be something like that:
>
>             ^   Dashboard/Visualization  ^
>             |     OLTP/ML processing     |
>             |    Caching/Acceleration    |
>             |         Storage            |
>
> And around this we can add/improve on deployment (R8???),
> virtualization/containers/clouds.  In other words - let's focus on the
> vertical part of the stack, instead of simply supporting the status quo.
>
> Does Cassandra fits the Storage layer in that model? I don't know and most
> important - I don't care. If there's an interest and manpower to have
> Cassandra-based stack - sure, but perhaps let's do as a separate branch or
> something, so we aren't over-complicating things. As Roman said earlier, in
> this case it'd be great to engage Cassandra/DataStax people into this
> project.
> But something tells me they won't be eager to jump on board.
>
> And finally, all this above leads to "how": how we can start reshaping the
> stack into its next incarnation? Perhaps, Ubuntu model might be an answer
> for
> that, but we have discussed that elsewhere and dropped the idea as it
> wasn't
> feasible back in the day. Perhaps its time just came?
>
> Apologies for a long post.
>   Cos
>
>
> On Sun, Dec 07, 2014 at 07:04PM, RJ Nowling wrote:
> > Which other projects depend on BigTop?  How will the questions about the
> > direction of BigTop affect those projects?
> >
> > On Sun, Dec 7, 2014 at 6:10 PM, Roman Shaposhnik <ro...@shaposhnik.org>
> > wrote:
> >
> > > Hi!
> > >
> > > On Sat, Dec 6, 2014 at 3:23 PM, jay vyas <ja...@gmail.com>
> > > wrote:
> > > > hi bigtop !
> > > >
> > > > I thought id start a thread a few vaguely related thoughts i have
> around
> > > > next couple iterations of bigtop.
> > >
> > > I think in general I see two major ways for something like
> > > Bigtop to evolve:
> > >    #1 remain a 'box of LEGO bricks' with very little opinion on
> > >         how these pieces need to be integrated
> > >    #2 start driving oppinioned use-cases for the particular kind of
> > >         bigdata workloads
> > >
> > > #1 is sort of what all of the Linux distros have been doing for
> > > the majority of time they existed. #2 is close to what CentOS
> > > is doing with SIGs.
> > >
> > > Honestly, given the size of our community so far and a total
> > > lack of corporate backing (with a small exception of Cloudera
> > > still paying for our EC2 time) I think #1 is all we can do. I'd
> > > love to be wrong, though.
> > >
> > > > 1) Hive:  How will bigtop to evolve to support it, now that it is
> much
> > > more
> > > > than a mapreduce query wrapper?
> > >
> > > I think Hive will remain a big part of Hadoop workloads for forseeable
> > > future. What I'd love to see more of is rationalizing things like how
> > > HCatalog, etc. need to be deployed.
> > >
> > > > 2) I wonder wether we should confirm cassandra interoperability of
> spark
> > > in
> > > > bigtop distros,
> > >
> > > Only if there's a significant interest from cassandra community and
> even
> > > then my biggest fear is that with cassandra we're totally changing the
> > > requirements for the underlying storage subsystem (nothing wrong with
> > > that, its just that in Hadoop ecosystem everything assumes very
> HDFS'ish
> > > requirements for the scale-out storage).
> > >
> > > > 4) in general, i think bigtop can move in one of 3 directions.
> > > >
> > > >   EXPAND ? : Expanding to include new components, with just basic
> > > interop,
> > > > and let folks evolve their own stacks on top of bigtop on their own.
> > > >
> > > >   CONTRACT+FOCUS ?  Contracting to focus on a lean set of core
> > > components,
> > > > with super high quality.
> > > >
> > > >   STAY THE COURSE ? Staying the same ~ a packaging platform for just
> > > > hadoop's direct ecosystem.
> > > >
> > > > I am intrigued by the idea of A and B both have clear benefits and
> > > costs...
> > > > would like to see the opinions of folks --- do we  lean in one
> direction
> > > or
> > > > another? What is the criteria for adding a new feature, package,
> stack to
> > > > bigtop?
> > > >
> > > > ... Or maybe im just overthinking it and should be spending this time
> > > > testing spark for 0.9 release....
> > >
> > > I'd love to know what other think, but for 0.9 I'd rather stay the
> course.
> > >
> > > Thanks,
> > > Roman.
> > >
> > > P.S. There are also market forces at play that may fundamentally change
> > > the focus of what we're all working on in the year or so.
> > >
>



-- 
Best regards,

   - Andy

Problems worthy of attack prove their worth by hitting back. - Piet Hein
(via Tom White)

Re: What will the next generation of bigtop look like?

Posted by Jay Vyas <ja...@gmail.com>.

"Let's see if we can be smart and define the landscape"

Well put @cos...I think Romans point was that it would be hard, not that it would be bad. And I think you're both right : it's hard? Yes. But worthwhile... Possibly? Next step we will all have to get in a room and think about this face to face.

Let's shoot for a meetup after january in California... Where we can plan the future direction of bigtop.  In the meanwhile hope to hear more opinions on this.


> On Dec 8, 2014, at 3:23 PM, Konstantin Boudnik <co...@apache.org> wrote:
> 
> First I want to address the RJ's question:
> 
> The most prominent downstream Bigtop Dependency would be any commercial
> Hadoop distribution like HDP and CDH. The former is trying to
> disguise their affiliation by pushing Ambari forward, and Cloudera's seemingly
> shifting her focus to compressed tarballs media (aka parcels) which requires
> a closed-source solutions like Cloudera Manager to deploy and control your
> cluster, effectively rendering it useless if you ever decide to uninstall the
> control software. In the interest of full disclosure, I don't think parcels
> have any chance to landslide the consensus in the industry from Linux
> packaging towards something so obscure and proprietary as parcels are.
> 
> 
> And now to my actual points....:
> 
> I do strongly believe the Bigtop was and is the only completely transparent,
> vendors' friendly, and 100% sticking to official ASF product releases way of
> building your stack from ground up, deploying and controlling it anyway you
> want to. I agree with Roman's presentation on how this project can move
> forward. However, I somewhat disagree with his view on the perspectives. It
> might be a hard road to drive the opinion of the community.  But, it is a high
> road.
> 
> We are definitely small and mostly unsupported by commercial groups that are
> using the framework. Being a box of LEGO won't win us anything. If anything,
> the empirical evidences are against it as commercial distros have decided to
> move towards their own means of "vendor lock-in" (yes, you hear me
> right - that's exactly what I said: all so called open-source companies have
> invented a way to lock-in their customers either with fancy "enterprise
> features" that aren't adding but amending underlying stack; or with custom set
> of patches oftentimes rendering the cluster to become incompatible between
> different vendors).
> 
> By all means, my money are on the second way, yet slightly modified (as
> use-cases are coming from users, not developers):
>  #2 start driving adoption of software stacks for the particular kind of data workloads
> 
> This community has enough day-to-day practitioners on board to
> accumulate a near-complete introspection of where the technology is moving.
> And instead of wobbling in a backwash, let's see if we can be smart and define
> this landscape. After all, Bigtop has adopted Spark well before any of the
> commercials have officially accepted it. We seemingly are moving more and
> more into in-memory realm of data processing: Apache Ignite (Gridgain),
> Tachyon, Spark. I don't know how much legs Hive got in it, but I am doubtful,
> that it can walk for much longer... May be it's just me.
> 
> In this thread http://is.gd/MV2BH9 we already discussed some of the aspects
> influencing the feature of this project. And we are de-facto working on the
> implementation. In my opinion, Hadoop has been more or less commoditized
> already. And it isn't a bad thing, but it means that the innovations are
> elsewhere. E.g. Spark moving is moving beyond its ties with storage layer via
> Tachyon abstraction; GridGain simply doesn't care what's underlying storage
> is. However, data needs to be stored somewhere before it can be processed. And
> HCFS seems to be fitting the bill ok. But, as I said already, I see the real
> action elsewhere. If I were to define the shape of our mid- to long'ish term
> roadmap it'd be something like that:
> 
>            ^   Dashboard/Visualization  ^
>            |     OLTP/ML processing     |
>            |    Caching/Acceleration    |
>            |         Storage            |
> 
> And around this we can add/improve on deployment (R8???),
> virtualization/containers/clouds.  In other words - let's focus on the
> vertical part of the stack, instead of simply supporting the status quo.
> 
> Does Cassandra fits the Storage layer in that model? I don't know and most
> important - I don't care. If there's an interest and manpower to have
> Cassandra-based stack - sure, but perhaps let's do as a separate branch or
> something, so we aren't over-complicating things. As Roman said earlier, in
> this case it'd be great to engage Cassandra/DataStax people into this project.
> But something tells me they won't be eager to jump on board.
> 
> And finally, all this above leads to "how": how we can start reshaping the
> stack into its next incarnation? Perhaps, Ubuntu model might be an answer for
> that, but we have discussed that elsewhere and dropped the idea as it wasn't
> feasible back in the day. Perhaps its time just came?
> 
> Apologies for a long post.
>  Cos
> 
> 
>> On Sun, Dec 07, 2014 at 07:04PM, RJ Nowling wrote:
>> Which other projects depend on BigTop?  How will the questions about the
>> direction of BigTop affect those projects?
>> 
>> On Sun, Dec 7, 2014 at 6:10 PM, Roman Shaposhnik <ro...@shaposhnik.org>
>> wrote:
>> 
>>> Hi!
>>> 
>>> On Sat, Dec 6, 2014 at 3:23 PM, jay vyas <ja...@gmail.com>
>>> wrote:
>>>> hi bigtop !
>>>> 
>>>> I thought id start a thread a few vaguely related thoughts i have around
>>>> next couple iterations of bigtop.
>>> 
>>> I think in general I see two major ways for something like
>>> Bigtop to evolve:
>>>   #1 remain a 'box of LEGO bricks' with very little opinion on
>>>        how these pieces need to be integrated
>>>   #2 start driving oppinioned use-cases for the particular kind of
>>>        bigdata workloads
>>> 
>>> #1 is sort of what all of the Linux distros have been doing for
>>> the majority of time they existed. #2 is close to what CentOS
>>> is doing with SIGs.
>>> 
>>> Honestly, given the size of our community so far and a total
>>> lack of corporate backing (with a small exception of Cloudera
>>> still paying for our EC2 time) I think #1 is all we can do. I'd
>>> love to be wrong, though.
>>> 
>>>> 1) Hive:  How will bigtop to evolve to support it, now that it is much
>>> more
>>>> than a mapreduce query wrapper?
>>> 
>>> I think Hive will remain a big part of Hadoop workloads for forseeable
>>> future. What I'd love to see more of is rationalizing things like how
>>> HCatalog, etc. need to be deployed.
>>> 
>>>> 2) I wonder wether we should confirm cassandra interoperability of spark
>>> in
>>>> bigtop distros,
>>> 
>>> Only if there's a significant interest from cassandra community and even
>>> then my biggest fear is that with cassandra we're totally changing the
>>> requirements for the underlying storage subsystem (nothing wrong with
>>> that, its just that in Hadoop ecosystem everything assumes very HDFS'ish
>>> requirements for the scale-out storage).
>>> 
>>>> 4) in general, i think bigtop can move in one of 3 directions.
>>>> 
>>>>  EXPAND ? : Expanding to include new components, with just basic
>>> interop,
>>>> and let folks evolve their own stacks on top of bigtop on their own.
>>>> 
>>>>  CONTRACT+FOCUS ?  Contracting to focus on a lean set of core
>>> components,
>>>> with super high quality.
>>>> 
>>>>  STAY THE COURSE ? Staying the same ~ a packaging platform for just
>>>> hadoop's direct ecosystem.
>>>> 
>>>> I am intrigued by the idea of A and B both have clear benefits and
>>> costs...
>>>> would like to see the opinions of folks --- do we  lean in one direction
>>> or
>>>> another? What is the criteria for adding a new feature, package, stack to
>>>> bigtop?
>>>> 
>>>> ... Or maybe im just overthinking it and should be spending this time
>>>> testing spark for 0.9 release....
>>> 
>>> I'd love to know what other think, but for 0.9 I'd rather stay the course.
>>> 
>>> Thanks,
>>> Roman.
>>> 
>>> P.S. There are also market forces at play that may fundamentally change
>>> the focus of what we're all working on in the year or so.
>>>

Re: What will the next generation of bigtop look like?

Posted by Konstantin Boudnik <co...@apache.org>.

First I want to address the RJ's question:

The most prominent downstream Bigtop Dependency would be any commercial
Hadoop distribution like HDP and CDH. The former is trying to
disguise their affiliation by pushing Ambari forward, and Cloudera's seemingly
shifting her focus to compressed tarballs media (aka parcels) which requires
a closed-source solutions like Cloudera Manager to deploy and control your
cluster, effectively rendering it useless if you ever decide to uninstall the
control software. In the interest of full disclosure, I don't think parcels
have any chance to landslide the consensus in the industry from Linux
packaging towards something so obscure and proprietary as parcels are.

And now to my actual points....:

I do strongly believe the Bigtop was and is the only completely transparent,
vendors' friendly, and 100% sticking to official ASF product releases way of
building your stack from ground up, deploying and controlling it anyway you
want to. I agree with Roman's presentation on how this project can move
forward. However, I somewhat disagree with his view on the perspectives. It
might be a hard road to drive the opinion of the community.  But, it is a high
road.

We are definitely small and mostly unsupported by commercial groups that are
using the framework. Being a box of LEGO won't win us anything. If anything,
the empirical evidences are against it as commercial distros have decided to
move towards their own means of "vendor lock-in" (yes, you hear me
right - that's exactly what I said: all so called open-source companies have
invented a way to lock-in their customers either with fancy "enterprise
features" that aren't adding but amending underlying stack; or with custom set
of patches oftentimes rendering the cluster to become incompatible between
different vendors).

By all means, my money are on the second way, yet slightly modified (as
use-cases are coming from users, not developers):
  #2 start driving adoption of software stacks for the particular kind of data workloads

This community has enough day-to-day practitioners on board to
accumulate a near-complete introspection of where the technology is moving.
And instead of wobbling in a backwash, let's see if we can be smart and define
this landscape. After all, Bigtop has adopted Spark well before any of the
commercials have officially accepted it. We seemingly are moving more and
more into in-memory realm of data processing: Apache Ignite (Gridgain),
Tachyon, Spark. I don't know how much legs Hive got in it, but I am doubtful,
that it can walk for much longer... May be it's just me.

In this thread http://is.gd/MV2BH9 we already discussed some of the aspects
influencing the feature of this project. And we are de-facto working on the
implementation. In my opinion, Hadoop has been more or less commoditized
already. And it isn't a bad thing, but it means that the innovations are
elsewhere. E.g. Spark moving is moving beyond its ties with storage layer via
Tachyon abstraction; GridGain simply doesn't care what's underlying storage
is. However, data needs to be stored somewhere before it can be processed. And
HCFS seems to be fitting the bill ok. But, as I said already, I see the real
action elsewhere. If I were to define the shape of our mid- to long'ish term
roadmap it'd be something like that:

            ^   Dashboard/Visualization  ^
            |     OLTP/ML processing     |
            |    Caching/Acceleration    |
            |         Storage            |

And around this we can add/improve on deployment (R8???),
virtualization/containers/clouds.  In other words - let's focus on the
vertical part of the stack, instead of simply supporting the status quo.

Does Cassandra fits the Storage layer in that model? I don't know and most
important - I don't care. If there's an interest and manpower to have
Cassandra-based stack - sure, but perhaps let's do as a separate branch or
something, so we aren't over-complicating things. As Roman said earlier, in
this case it'd be great to engage Cassandra/DataStax people into this project.
But something tells me they won't be eager to jump on board.

And finally, all this above leads to "how": how we can start reshaping the
stack into its next incarnation? Perhaps, Ubuntu model might be an answer for
that, but we have discussed that elsewhere and dropped the idea as it wasn't
feasible back in the day. Perhaps its time just came?

Apologies for a long post.
  Cos

On Sun, Dec 07, 2014 at 07:04PM, RJ Nowling wrote:
> Which other projects depend on BigTop?  How will the questions about the
> direction of BigTop affect those projects?
> 
> On Sun, Dec 7, 2014 at 6:10 PM, Roman Shaposhnik <ro...@shaposhnik.org>
> wrote:
> 
> > Hi!
> >
> > On Sat, Dec 6, 2014 at 3:23 PM, jay vyas <ja...@gmail.com>
> > wrote:
> > > hi bigtop !
> > >
> > > I thought id start a thread a few vaguely related thoughts i have around
> > > next couple iterations of bigtop.
> >
> > I think in general I see two major ways for something like
> > Bigtop to evolve:
> >    #1 remain a 'box of LEGO bricks' with very little opinion on
> >         how these pieces need to be integrated
> >    #2 start driving oppinioned use-cases for the particular kind of
> >         bigdata workloads
> >
> > #1 is sort of what all of the Linux distros have been doing for
> > the majority of time they existed. #2 is close to what CentOS
> > is doing with SIGs.
> >
> > Honestly, given the size of our community so far and a total
> > lack of corporate backing (with a small exception of Cloudera
> > still paying for our EC2 time) I think #1 is all we can do. I'd
> > love to be wrong, though.
> >
> > > 1) Hive:  How will bigtop to evolve to support it, now that it is much
> > more
> > > than a mapreduce query wrapper?
> >
> > I think Hive will remain a big part of Hadoop workloads for forseeable
> > future. What I'd love to see more of is rationalizing things like how
> > HCatalog, etc. need to be deployed.
> >
> > > 2) I wonder wether we should confirm cassandra interoperability of spark
> > in
> > > bigtop distros,
> >
> > Only if there's a significant interest from cassandra community and even
> > then my biggest fear is that with cassandra we're totally changing the
> > requirements for the underlying storage subsystem (nothing wrong with
> > that, its just that in Hadoop ecosystem everything assumes very HDFS'ish
> > requirements for the scale-out storage).
> >
> > > 4) in general, i think bigtop can move in one of 3 directions.
> > >
> > >   EXPAND ? : Expanding to include new components, with just basic
> > interop,
> > > and let folks evolve their own stacks on top of bigtop on their own.
> > >
> > >   CONTRACT+FOCUS ?  Contracting to focus on a lean set of core
> > components,
> > > with super high quality.
> > >
> > >   STAY THE COURSE ? Staying the same ~ a packaging platform for just
> > > hadoop's direct ecosystem.
> > >
> > > I am intrigued by the idea of A and B both have clear benefits and
> > costs...
> > > would like to see the opinions of folks --- do we  lean in one direction
> > or
> > > another? What is the criteria for adding a new feature, package, stack to
> > > bigtop?
> > >
> > > ... Or maybe im just overthinking it and should be spending this time
> > > testing spark for 0.9 release....
> >
> > I'd love to know what other think, but for 0.9 I'd rather stay the course.
> >
> > Thanks,
> > Roman.
> >
> > P.S. There are also market forces at play that may fundamentally change
> > the focus of what we're all working on in the year or so.
> >

Re: What will the next generation of bigtop look like?

Posted by Konstantin Boudnik <co...@apache.org>.

First I want to address the RJ's question:

The most prominent downstream Bigtop Dependency would be any commercial
Hadoop distribution like HDP and CDH. The former is trying to
disguise their affiliation by pushing Ambari forward, and Cloudera's seemingly
shifting her focus to compressed tarballs media (aka parcels) which requires
a closed-source solutions like Cloudera Manager to deploy and control your
cluster, effectively rendering it useless if you ever decide to uninstall the
control software. In the interest of full disclosure, I don't think parcels
have any chance to landslide the consensus in the industry from Linux
packaging towards something so obscure and proprietary as parcels are.

And now to my actual points....:

I do strongly believe the Bigtop was and is the only completely transparent,
vendors' friendly, and 100% sticking to official ASF product releases way of
building your stack from ground up, deploying and controlling it anyway you
want to. I agree with Roman's presentation on how this project can move
forward. However, I somewhat disagree with his view on the perspectives. It
might be a hard road to drive the opinion of the community.  But, it is a high
road.

We are definitely small and mostly unsupported by commercial groups that are
using the framework. Being a box of LEGO won't win us anything. If anything,
the empirical evidences are against it as commercial distros have decided to
move towards their own means of "vendor lock-in" (yes, you hear me
right - that's exactly what I said: all so called open-source companies have
invented a way to lock-in their customers either with fancy "enterprise
features" that aren't adding but amending underlying stack; or with custom set
of patches oftentimes rendering the cluster to become incompatible between
different vendors).

By all means, my money are on the second way, yet slightly modified (as
use-cases are coming from users, not developers):
  #2 start driving adoption of software stacks for the particular kind of data workloads

This community has enough day-to-day practitioners on board to
accumulate a near-complete introspection of where the technology is moving.
And instead of wobbling in a backwash, let's see if we can be smart and define
this landscape. After all, Bigtop has adopted Spark well before any of the
commercials have officially accepted it. We seemingly are moving more and
more into in-memory realm of data processing: Apache Ignite (Gridgain),
Tachyon, Spark. I don't know how much legs Hive got in it, but I am doubtful,
that it can walk for much longer... May be it's just me.

In this thread http://is.gd/MV2BH9 we already discussed some of the aspects
influencing the feature of this project. And we are de-facto working on the
implementation. In my opinion, Hadoop has been more or less commoditized
already. And it isn't a bad thing, but it means that the innovations are
elsewhere. E.g. Spark moving is moving beyond its ties with storage layer via
Tachyon abstraction; GridGain simply doesn't care what's underlying storage
is. However, data needs to be stored somewhere before it can be processed. And
HCFS seems to be fitting the bill ok. But, as I said already, I see the real
action elsewhere. If I were to define the shape of our mid- to long'ish term
roadmap it'd be something like that:

            ^   Dashboard/Visualization  ^
            |     OLTP/ML processing     |
            |    Caching/Acceleration    |
            |         Storage            |

And around this we can add/improve on deployment (R8???),
virtualization/containers/clouds.  In other words - let's focus on the
vertical part of the stack, instead of simply supporting the status quo.

Does Cassandra fits the Storage layer in that model? I don't know and most
important - I don't care. If there's an interest and manpower to have
Cassandra-based stack - sure, but perhaps let's do as a separate branch or
something, so we aren't over-complicating things. As Roman said earlier, in
this case it'd be great to engage Cassandra/DataStax people into this project.
But something tells me they won't be eager to jump on board.

And finally, all this above leads to "how": how we can start reshaping the
stack into its next incarnation? Perhaps, Ubuntu model might be an answer for
that, but we have discussed that elsewhere and dropped the idea as it wasn't
feasible back in the day. Perhaps its time just came?

Apologies for a long post.
  Cos

On Sun, Dec 07, 2014 at 07:04PM, RJ Nowling wrote:
> Which other projects depend on BigTop?  How will the questions about the
> direction of BigTop affect those projects?
> 
> On Sun, Dec 7, 2014 at 6:10 PM, Roman Shaposhnik <ro...@shaposhnik.org>
> wrote:
> 
> > Hi!
> >
> > On Sat, Dec 6, 2014 at 3:23 PM, jay vyas <ja...@gmail.com>
> > wrote:
> > > hi bigtop !
> > >
> > > I thought id start a thread a few vaguely related thoughts i have around
> > > next couple iterations of bigtop.
> >
> > I think in general I see two major ways for something like
> > Bigtop to evolve:
> >    #1 remain a 'box of LEGO bricks' with very little opinion on
> >         how these pieces need to be integrated
> >    #2 start driving oppinioned use-cases for the particular kind of
> >         bigdata workloads
> >
> > #1 is sort of what all of the Linux distros have been doing for
> > the majority of time they existed. #2 is close to what CentOS
> > is doing with SIGs.
> >
> > Honestly, given the size of our community so far and a total
> > lack of corporate backing (with a small exception of Cloudera
> > still paying for our EC2 time) I think #1 is all we can do. I'd
> > love to be wrong, though.
> >
> > > 1) Hive:  How will bigtop to evolve to support it, now that it is much
> > more
> > > than a mapreduce query wrapper?
> >
> > I think Hive will remain a big part of Hadoop workloads for forseeable
> > future. What I'd love to see more of is rationalizing things like how
> > HCatalog, etc. need to be deployed.
> >
> > > 2) I wonder wether we should confirm cassandra interoperability of spark
> > in
> > > bigtop distros,
> >
> > Only if there's a significant interest from cassandra community and even
> > then my biggest fear is that with cassandra we're totally changing the
> > requirements for the underlying storage subsystem (nothing wrong with
> > that, its just that in Hadoop ecosystem everything assumes very HDFS'ish
> > requirements for the scale-out storage).
> >
> > > 4) in general, i think bigtop can move in one of 3 directions.
> > >
> > >   EXPAND ? : Expanding to include new components, with just basic
> > interop,
> > > and let folks evolve their own stacks on top of bigtop on their own.
> > >
> > >   CONTRACT+FOCUS ?  Contracting to focus on a lean set of core
> > components,
> > > with super high quality.
> > >
> > >   STAY THE COURSE ? Staying the same ~ a packaging platform for just
> > > hadoop's direct ecosystem.
> > >
> > > I am intrigued by the idea of A and B both have clear benefits and
> > costs...
> > > would like to see the opinions of folks --- do we  lean in one direction
> > or
> > > another? What is the criteria for adding a new feature, package, stack to
> > > bigtop?
> > >
> > > ... Or maybe im just overthinking it and should be spending this time
> > > testing spark for 0.9 release....
> >
> > I'd love to know what other think, but for 0.9 I'd rather stay the course.
> >
> > Thanks,
> > Roman.
> >
> > P.S. There are also market forces at play that may fundamentally change
> > the focus of what we're all working on in the year or so.
> >

Re: What will the next generation of bigtop look like?

Posted by RJ Nowling <rn...@gmail.com>.

Which other projects depend on BigTop?  How will the questions about the
direction of BigTop affect those projects?

On Sun, Dec 7, 2014 at 6:10 PM, Roman Shaposhnik <ro...@shaposhnik.org>
wrote:

> Hi!
>
> On Sat, Dec 6, 2014 at 3:23 PM, jay vyas <ja...@gmail.com>
> wrote:
> > hi bigtop !
> >
> > I thought id start a thread a few vaguely related thoughts i have around
> > next couple iterations of bigtop.
>
> I think in general I see two major ways for something like
> Bigtop to evolve:
>    #1 remain a 'box of LEGO bricks' with very little opinion on
>         how these pieces need to be integrated
>    #2 start driving oppinioned use-cases for the particular kind of
>         bigdata workloads
>
> #1 is sort of what all of the Linux distros have been doing for
> the majority of time they existed. #2 is close to what CentOS
> is doing with SIGs.
>
> Honestly, given the size of our community so far and a total
> lack of corporate backing (with a small exception of Cloudera
> still paying for our EC2 time) I think #1 is all we can do. I'd
> love to be wrong, though.
>
> > 1) Hive:  How will bigtop to evolve to support it, now that it is much
> more
> > than a mapreduce query wrapper?
>
> I think Hive will remain a big part of Hadoop workloads for forseeable
> future. What I'd love to see more of is rationalizing things like how
> HCatalog, etc. need to be deployed.
>
> > 2) I wonder wether we should confirm cassandra interoperability of spark
> in
> > bigtop distros,
>
> Only if there's a significant interest from cassandra community and even
> then my biggest fear is that with cassandra we're totally changing the
> requirements for the underlying storage subsystem (nothing wrong with
> that, its just that in Hadoop ecosystem everything assumes very HDFS'ish
> requirements for the scale-out storage).
>
> > 4) in general, i think bigtop can move in one of 3 directions.
> >
> >   EXPAND ? : Expanding to include new components, with just basic
> interop,
> > and let folks evolve their own stacks on top of bigtop on their own.
> >
> >   CONTRACT+FOCUS ?  Contracting to focus on a lean set of core
> components,
> > with super high quality.
> >
> >   STAY THE COURSE ? Staying the same ~ a packaging platform for just
> > hadoop's direct ecosystem.
> >
> > I am intrigued by the idea of A and B both have clear benefits and
> costs...
> > would like to see the opinions of folks --- do we  lean in one direction
> or
> > another? What is the criteria for adding a new feature, package, stack to
> > bigtop?
> >
> > ... Or maybe im just overthinking it and should be spending this time
> > testing spark for 0.9 release....
>
> I'd love to know what other think, but for 0.9 I'd rather stay the course.
>
> Thanks,
> Roman.
>
> P.S. There are also market forces at play that may fundamentally change
> the focus of what we're all working on in the year or so.
>

Re: What will the next generation of bigtop look like?

Posted by Roman Shaposhnik <ro...@shaposhnik.org>.

Hi!

On Sat, Dec 6, 2014 at 3:23 PM, jay vyas <ja...@gmail.com> wrote:
> hi bigtop !
>
> I thought id start a thread a few vaguely related thoughts i have around
> next couple iterations of bigtop.

I think in general I see two major ways for something like
Bigtop to evolve:
   #1 remain a 'box of LEGO bricks' with very little opinion on
        how these pieces need to be integrated
   #2 start driving oppinioned use-cases for the particular kind of
        bigdata workloads

#1 is sort of what all of the Linux distros have been doing for
the majority of time they existed. #2 is close to what CentOS
is doing with SIGs.

Honestly, given the size of our community so far and a total
lack of corporate backing (with a small exception of Cloudera
still paying for our EC2 time) I think #1 is all we can do. I'd
love to be wrong, though.

> 1) Hive:  How will bigtop to evolve to support it, now that it is much more
> than a mapreduce query wrapper?

I think Hive will remain a big part of Hadoop workloads for forseeable
future. What I'd love to see more of is rationalizing things like how
HCatalog, etc. need to be deployed.

> 2) I wonder wether we should confirm cassandra interoperability of spark in
> bigtop distros,

Only if there's a significant interest from cassandra community and even
then my biggest fear is that with cassandra we're totally changing the
requirements for the underlying storage subsystem (nothing wrong with
that, its just that in Hadoop ecosystem everything assumes very HDFS'ish
requirements for the scale-out storage).

> 4) in general, i think bigtop can move in one of 3 directions.
>
>   EXPAND ? : Expanding to include new components, with just basic interop,
> and let folks evolve their own stacks on top of bigtop on their own.
>
>   CONTRACT+FOCUS ?  Contracting to focus on a lean set of core components,
> with super high quality.
>
>   STAY THE COURSE ? Staying the same ~ a packaging platform for just
> hadoop's direct ecosystem.
>
> I am intrigued by the idea of A and B both have clear benefits and costs...
> would like to see the opinions of folks --- do we  lean in one direction or
> another? What is the criteria for adding a new feature, package, stack to
> bigtop?
>
> ... Or maybe im just overthinking it and should be spending this time
> testing spark for 0.9 release....

I'd love to know what other think, but for 0.9 I'd rather stay the course.

Thanks,
Roman.

P.S. There are also market forces at play that may fundamentally change
the focus of what we're all working on in the year or so.