You are viewing a plain text version of this content. The canonical link for it is here.
Posted to general@hadoop.apache.org by Arun C Murthy <ac...@hortonworks.com> on 2013/03/01 19:58:39 UTC

Re: [DISCUSS] stabilizing Hadoop releases wrt. downstream

I feel this is being blown out of proportion.

Integration is high on the list of *every* release. In future, if anyone or bigtop wants to help, running integration tests on a hadoop RC and providing feedback would be very welcome. I'm pretty sure I will stop an RC if it means it breaks and Oozie or HBase or Pig or Hive and re-spin it. For e.g. see recent efforts to do a 2.0.4-alpha.

With hadoop-2.0.3-alpha we discovered 3 *bugs* - making it sound like we intentionally disregard integation issues is very harsh.

Please also see other thread where we discussed stabilizing APIS, protocols etc. for the next 'beta' release.

Arun

On Feb 26, 2013, at 5:43 PM, Roman Shaposhnik wrote:

> Hi!
> 
> for the past couple of releases of Hadoop 2.X code line the issue
> of integration between Hadoop and its downstream projects has
> become quite a thorny issue. The poster child here is Oozie, where
> every release of Hadoop 2.X seems to be breaking the compatibility
> in various unpredictable ways. At times other components (such
> as HBase for example) also seem to be affected.
> 
> Now, to be extremely clear -- I'm NOT talking about the *latest* version
> of Oozie working with the *latest* version of Hadoop, instead
> my observations come from running previous *stable*  releases
> of Bigtop on top of Hadoop 2.X RCs.
> 
> As many of you know Apache Bigtop aims at providing a single
> platform for integration of Hadoop and Hadoop ecosystem projects.
> As such we're uniquely positioned to track compatibility between
> different Hadoop releases with regards to the downstream components
> (things like Oozie, Pig, Hive, Mahout, etc.). Every single single RC
> we've been pretty diligent at trying to provide integration-level feedback
> on the quality of the upcoming release,  but it seems that our efforts
> don't quite suffice in Hadoop 2.X stabilizing.
> 
> Of course, one could argue that while Hadoop 2.X code line was
> designated 'alpha' expecting much in the way of perfect integration
> and compatibility was NOT what the Hadoop community was
> focusing on. I can appreciate that view, but what I'm interested in
> is the future of Hadoop 2.X not its past. Hence, here's my question
> to all of you as a Hadoop community at large:
> 
> Do you guys think that the project have reached a point where integration
> and compatibility issues should be prioritized really high on the list
> of things that make or break each future release?
> 
> The good news, is that Bigtop's charter is in big part *exactly* about
> providing you with this kind of feedback. We can easily tell you when
> Hadoop behavior, with regard to downstream components, changes
> between a previous stable release and the new RC (or even branch/trunk).
> What we can NOT do is submit patches for all the issues. We are simply
> too small a project and we need your help with that.
> 
> I truly believe that we owe it to the downstream projects, and in the
> second half of this email I will try to convince you of that.
> 
> We all know that integration projects are impossible to pull off
> unless there's a general consensus between all of the projects involved
> that they indeed need to work with each other. You can NOT force
> that notion, but you can always try to influence. This relationship
> goes both ways.
> 
> Consider a question in front of the downstream communities
> of  whether or not to adopt Hadoop 2.X as the basis. To answer
> that question each downstream project has to be reasonably
> sure that their concerns will NOT fall on deaf ears and that
> Hadoop developers are, essentially, 'ready' for them to pick
> up Hadoop 2.X. I would argue that so far the Hadoop community
> had gone out of its way to signal that 2.X codeline is NOT
> ready for the downstream.
> 
> I would argue that moving forward this is a really unfortunate
> situation that may end up undermining the long term success
> of Hadoop 2.X if we don't start addressing the problem. Think
> about it -- 90% of unit tests that run downstream on Apache
> infrastructure are still exercising Hadoop 1.X underneath.
> In fact, if you were to forcefully make, lets say, HBase's
> unit tests run on top of Hadoop 2.X quite a few of them
> are going to fail. Hadoop community is, in effect, cutting
> itself off from the biggest source of feedback -- its downstream
> users. This in turn:
> 
>   * leaves Hadoop project in a perpetual state of broken
>     windows syndrome.
> 
>   * leaves Apache Hadoop 2.X releases in a state considerably
>     inferior to the releases *including* Apache Hadoop done by the
>     vendors. The users have no choice but to alight themselves
>     with vendor offerings if they wish to utilize latest Hadoop functionality.
>     The artifact that is know as Apache Hadoop 2.X stopped being
>     a viable choice thus fracturing the user community and reducing
>     the benefits of a commonly deployed codebase.
> 
>    * leaves downstream projects of Hadoop  in a jaded state where
>      they legitimately get very discouraged and frustrated and eventually
>      give up thinking that -- well, we work with one release of Hadoop
>      (the stable one Hadoop 1.X) and we shall wait for the Hadoop
>      community to get their act together.
> 
> In my view (shared by quite a few members of the Apache Bigtop) we
> can definitely do better than this if we all agree that the proposed
> first 'beta' release of Hadoop 2.0.4 is the right time for it to happen.
> 
> It is about time Hadoop 2.X community wins back all those end users
> and downstream projects that got left behind during the alpha
> stabilization phase.
> 
> Thanks,
> Roman.

--
Arun C. Murthy
Hortonworks Inc.
http://hortonworks.com/



Re: [DISCUSS] stabilizing Hadoop releases wrt. downstream

Posted by Ted Yu <yu...@gmail.com>.
Thanks Bobby.

HBase trunk can build upon 2.0 SNAPSHOOT so that regression can be detected
early.

On Tue, Mar 5, 2013 at 7:18 AM, Robert Evans <ev...@yahoo-inc.com> wrote:

> That is a great point.  I have been meaning to set up the Jenkins build
> for branch-2 for a while, so I took the 10 mins and just did it.
>
> https://builds.apache.org/job/Hadoop-Common-2-Commit/
>
> Don't let the name fool you, it publishes not just common, but HDFS, YARN,
> MR, and tools too.  You should now have branch-2 SNAPSHOTS updated on each
> commit to branch-2.  Feel free to bug me if you need more integration
> points.  I am not an RE guy, but I can hack it to make things work :)
>
> --Bobby
>
> On 3/5/13 12:15 AM, "Konstantin Boudnik" <co...@apache.org> wrote:
>
> >Arun,
> >
> >first of all, I don't think anyone is trying to put a blame on someone
> >else. E.g. I had similar experience with Oozie being broken because of
> >certain released changes in the upstream.
> >
> >I am sure that most people in BigTop community - especially those who
> >share the committer-ship privilege in BigTop and other upstream
> >projects, including Hadoop, - would be happy to help with the
> >stabilization of the Hadoop base. The issue that a downstream
> >integration project is likely to have is - for once - the absence of
> >regularly published development artifacts. In the light of "it didn't
> >happen if there's no picture" here's a couple of examples:
> >
> >  - 2.0.2-SNAPSHOT weren't published at all; only release 2.0.2-alpha
> >artifacts were
> >  - 2.0.3-SNAPSHOT weren't published until Feb 29, 2013 (it happened just
> >once)
> >
> >So, technically speaking, unless an integration project is willing to
> >build and maintain its own artifacts, it is impossible to do any
> >preventive validation.
> >
> >Which brings me to my next question: how do you guys address
> >"Integration is high on the list of *every* release". Again, please
> >don't get me wrong - I am not looking to lay a blame on or corner
> >anyone - I am really curious and would appreciate the input.
> >
> >
> >Vinod:
> >
> >> As you yourself noted later, the pain is part of the 'alpha' status
> >> of the release. We are targeting +one of the immediate future
> >> releases to be a beta and so these troubles are really only the
> >> short +term.
> >
> >I don't really want to get into the discussion about of what
> >constitutes the alpha and how it has delayed the adoption of Hadoop2
> >line. However, I want to point out that it is especially important for
> >"alpha" platform to work nicely with downstream consumers of the said
> >platform. For quite obvious reasons, I believe.
> >
> >> I think there is a fundamental problem with the interaction of
> >> Bigtop with the downstream projects, if nothing else, with
> >
> >BigTop is as downstream as it can get, because BigTop essentially
> >consumes all other component releases in order to produce a viable
> >stack. Technicalities aside...
> >
> >> Hadoop. We never formalized on the process, will BigTop step in
> >> after an RC is up for vote or before? As I see it, it's happening
> >
> >Bigtop essentially can give any component, including Hadoop, and
> >better yet - the set of components - certain guaratees about
> >compatibility and dependencies being included. Case in point is
> >missing commons libraries missed in 1.0.1 release that essentially
> >prevented HBase from working properly.
> >
> >> after the vote is up, so no wonder we are in this state. Shall we
> >> have a pre-notice to Bigtop so that it can step in before?
> >
> >The above is in contradiction with earlier statement of "Integration
> >is high on the list of *every* release". If BigTop isn't used for
> >integration testing, then how said integration testing is performed?
> >Is it some sort of test-patch process as Luke referred earlier?  And
> >why it leaves the room for the integration issues being uncaught?
> >Again, I am genuinely interested to know.
> >
> >> these short term pains. I'd rather like us swim through these now
> >> instead of support broken APIs and features in our beta, having seen
> >> this very thing happen with 1.*.
> >
> >I think you're mixing the point of integration with downstream and
> >being in an alpha phase of the development. The former isn't about
> >supporting "broken APIs" - it is about being consistent and avoid
> >breaking the downstream applicaitons without letting said applications
> >to accomodate the platform changes first.
> >
> >Changes in the API, after all, can be relatively easy traced by
> >integration validation - this is the whole point of integration
> >testing. And BigTop does the job better then anything around, simply
> >because there's nothing else around to do it.
> >
> >If you stay in shape-shifting "alpha" that doesn't integrate well for
> >a very long time, you risk to lose downstream customers' interest,
> >because they might get tired of waiting until a next stable API will
> >be ready for them.
> >
> >> Let's fix the way the release related communication is happening
> >> across our projects so that we can all work together and make 2.X a
> >> success.
> >
> >This is a very good point indeed! Let's start a separate discussion
> >thread on how we can improve the release model for coming Hadoop
> >releases, where we - as the community - can provide better guarantees
> >of the inter-component compatibility (sorry for an overused word).
> >
> >Cos
> >
> >On Fri, Mar 01, 2013 at 10:58AM, Arun C Murthy wrote:
> >> I feel this is being blown out of proportion.
> >>
> >> Integration is high on the list of *every* release. In future, if
> >>anyone or
> >> bigtop wants to help, running integration tests on a hadoop RC and
> >>providing
> >> feedback would be very welcome. I'm pretty sure I will stop an RC if it
> >> means it breaks and Oozie or HBase or Pig or Hive and re-spin it. For
> >>e.g.
> >> see recent efforts to do a 2.0.4-alpha.
> >>
> >> With hadoop-2.0.3-alpha we discovered 3 *bugs* - making it sound like we
> >> intentionally disregard integation issues is very harsh.
> >>
> >> Please also see other thread where we discussed stabilizing APIS,
> >>protocols
> >> etc. for the next 'beta' release.
> >>
> >> Arun
> >>
> >> On Feb 26, 2013, at 5:43 PM, Roman Shaposhnik wrote:
> >>
> >> > Hi!
> >> >
> >> > for the past couple of releases of Hadoop 2.X code line the issue
> >> > of integration between Hadoop and its downstream projects has
> >> > become quite a thorny issue. The poster child here is Oozie, where
> >> > every release of Hadoop 2.X seems to be breaking the compatibility
> >> > in various unpredictable ways. At times other components (such
> >> > as HBase for example) also seem to be affected.
> >> >
> >> > Now, to be extremely clear -- I'm NOT talking about the *latest*
> >>version
> >> > of Oozie working with the *latest* version of Hadoop, instead
> >> > my observations come from running previous *stable*  releases
> >> > of Bigtop on top of Hadoop 2.X RCs.
> >> >
> >> > As many of you know Apache Bigtop aims at providing a single
> >> > platform for integration of Hadoop and Hadoop ecosystem projects.
> >> > As such we're uniquely positioned to track compatibility between
> >> > different Hadoop releases with regards to the downstream components
> >> > (things like Oozie, Pig, Hive, Mahout, etc.). Every single single RC
> >> > we've been pretty diligent at trying to provide integration-level
> >>feedback
> >> > on the quality of the upcoming release,  but it seems that our efforts
> >> > don't quite suffice in Hadoop 2.X stabilizing.
> >> >
> >> > Of course, one could argue that while Hadoop 2.X code line was
> >> > designated 'alpha' expecting much in the way of perfect integration
> >> > and compatibility was NOT what the Hadoop community was
> >> > focusing on. I can appreciate that view, but what I'm interested in
> >> > is the future of Hadoop 2.X not its past. Hence, here's my question
> >> > to all of you as a Hadoop community at large:
> >> >
> >> > Do you guys think that the project have reached a point where
> >>integration
> >> > and compatibility issues should be prioritized really high on the list
> >> > of things that make or break each future release?
> >> >
> >> > The good news, is that Bigtop's charter is in big part *exactly* about
> >> > providing you with this kind of feedback. We can easily tell you when
> >> > Hadoop behavior, with regard to downstream components, changes
> >> > between a previous stable release and the new RC (or even
> >>branch/trunk).
> >> > What we can NOT do is submit patches for all the issues. We are simply
> >> > too small a project and we need your help with that.
> >> >
> >> > I truly believe that we owe it to the downstream projects, and in the
> >> > second half of this email I will try to convince you of that.
> >> >
> >> > We all know that integration projects are impossible to pull off
> >> > unless there's a general consensus between all of the projects
> >>involved
> >> > that they indeed need to work with each other. You can NOT force
> >> > that notion, but you can always try to influence. This relationship
> >> > goes both ways.
> >> >
> >> > Consider a question in front of the downstream communities
> >> > of  whether or not to adopt Hadoop 2.X as the basis. To answer
> >> > that question each downstream project has to be reasonably
> >> > sure that their concerns will NOT fall on deaf ears and that
> >> > Hadoop developers are, essentially, 'ready' for them to pick
> >> > up Hadoop 2.X. I would argue that so far the Hadoop community
> >> > had gone out of its way to signal that 2.X codeline is NOT
> >> > ready for the downstream.
> >> >
> >> > I would argue that moving forward this is a really unfortunate
> >> > situation that may end up undermining the long term success
> >> > of Hadoop 2.X if we don't start addressing the problem. Think
> >> > about it -- 90% of unit tests that run downstream on Apache
> >> > infrastructure are still exercising Hadoop 1.X underneath.
> >> > In fact, if you were to forcefully make, lets say, HBase's
> >> > unit tests run on top of Hadoop 2.X quite a few of them
> >> > are going to fail. Hadoop community is, in effect, cutting
> >> > itself off from the biggest source of feedback -- its downstream
> >> > users. This in turn:
> >> >
> >> >   * leaves Hadoop project in a perpetual state of broken
> >> >     windows syndrome.
> >> >
> >> >   * leaves Apache Hadoop 2.X releases in a state considerably
> >> >     inferior to the releases *including* Apache Hadoop done by the
> >> >     vendors. The users have no choice but to alight themselves
> >> >     with vendor offerings if they wish to utilize latest Hadoop
> >>functionality.
> >> >     The artifact that is know as Apache Hadoop 2.X stopped being
> >> >     a viable choice thus fracturing the user community and reducing
> >> >     the benefits of a commonly deployed codebase.
> >> >
> >> >    * leaves downstream projects of Hadoop  in a jaded state where
> >> >      they legitimately get very discouraged and frustrated and
> >>eventually
> >> >      give up thinking that -- well, we work with one release of Hadoop
> >> >      (the stable one Hadoop 1.X) and we shall wait for the Hadoop
> >> >      community to get their act together.
> >> >
> >> > In my view (shared by quite a few members of the Apache Bigtop) we
> >> > can definitely do better than this if we all agree that the proposed
> >> > first 'beta' release of Hadoop 2.0.4 is the right time for it to
> >>happen.
> >> >
> >> > It is about time Hadoop 2.X community wins back all those end users
> >> > and downstream projects that got left behind during the alpha
> >> > stabilization phase.
> >> >
> >> > Thanks,
> >> > Roman.
> >>
> >> --
> >> Arun C. Murthy
> >> Hortonworks Inc.
> >> http://hortonworks.com/
> >>
> >>
>
>

Re: [DISCUSS] stabilizing Hadoop releases wrt. downstream

Posted by Giridharan Kesavan <gk...@hortonworks.com>.
Thanks Bobby. I 've setup a unit test
job<https://builds.apache.org/view/Hadoop/job/Hadoop-Common-2-Build/>to
execute unit test's on branch-2 on a daily basis.
I'm happy to help with build setup. Let me know.

-Giri


On Tue, Mar 5, 2013 at 9:02 PM, Konstantin Boudnik <co...@apache.org> wrote:

> Great start, Bobby! I certainly can jump on fix something quickly if
> needed as
> well (neither an RE person, but CI is truly a dev. tool!)
>
> Thanks!
>   Cos
>
> On Tue, Mar 05, 2013 at 07:18AM, Robert Evans wrote:
> > That is a great point.  I have been meaning to set up the Jenkins build
> > for branch-2 for a while, so I took the 10 mins and just did it.
> >
> > https://builds.apache.org/job/Hadoop-Common-2-Commit/
> >
> > Don't let the name fool you, it publishes not just common, but HDFS,
> YARN,
> > MR, and tools too.  You should now have branch-2 SNAPSHOTS updated on
> each
> > commit to branch-2.  Feel free to bug me if you need more integration
> > points.  I am not an RE guy, but I can hack it to make things work :)
> >
> > --Bobby
> >
> > On 3/5/13 12:15 AM, "Konstantin Boudnik" <co...@apache.org> wrote:
> >
> > >Arun,
> > >
> > >first of all, I don't think anyone is trying to put a blame on someone
> > >else. E.g. I had similar experience with Oozie being broken because of
> > >certain released changes in the upstream.
> > >
> > >I am sure that most people in BigTop community - especially those who
> > >share the committer-ship privilege in BigTop and other upstream
> > >projects, including Hadoop, - would be happy to help with the
> > >stabilization of the Hadoop base. The issue that a downstream
> > >integration project is likely to have is - for once - the absence of
> > >regularly published development artifacts. In the light of "it didn't
> > >happen if there's no picture" here's a couple of examples:
> > >
> > >  - 2.0.2-SNAPSHOT weren't published at all; only release 2.0.2-alpha
> > >artifacts were
> > >  - 2.0.3-SNAPSHOT weren't published until Feb 29, 2013 (it happened
> just
> > >once)
> > >
> > >So, technically speaking, unless an integration project is willing to
> > >build and maintain its own artifacts, it is impossible to do any
> > >preventive validation.
> > >
> > >Which brings me to my next question: how do you guys address
> > >"Integration is high on the list of *every* release". Again, please
> > >don't get me wrong - I am not looking to lay a blame on or corner
> > >anyone - I am really curious and would appreciate the input.
> > >
> > >
> > >Vinod:
> > >
> > >> As you yourself noted later, the pain is part of the 'alpha' status
> > >> of the release. We are targeting +one of the immediate future
> > >> releases to be a beta and so these troubles are really only the
> > >> short +term.
> > >
> > >I don't really want to get into the discussion about of what
> > >constitutes the alpha and how it has delayed the adoption of Hadoop2
> > >line. However, I want to point out that it is especially important for
> > >"alpha" platform to work nicely with downstream consumers of the said
> > >platform. For quite obvious reasons, I believe.
> > >
> > >> I think there is a fundamental problem with the interaction of
> > >> Bigtop with the downstream projects, if nothing else, with
> > >
> > >BigTop is as downstream as it can get, because BigTop essentially
> > >consumes all other component releases in order to produce a viable
> > >stack. Technicalities aside...
> > >
> > >> Hadoop. We never formalized on the process, will BigTop step in
> > >> after an RC is up for vote or before? As I see it, it's happening
> > >
> > >Bigtop essentially can give any component, including Hadoop, and
> > >better yet - the set of components - certain guaratees about
> > >compatibility and dependencies being included. Case in point is
> > >missing commons libraries missed in 1.0.1 release that essentially
> > >prevented HBase from working properly.
> > >
> > >> after the vote is up, so no wonder we are in this state. Shall we
> > >> have a pre-notice to Bigtop so that it can step in before?
> > >
> > >The above is in contradiction with earlier statement of "Integration
> > >is high on the list of *every* release". If BigTop isn't used for
> > >integration testing, then how said integration testing is performed?
> > >Is it some sort of test-patch process as Luke referred earlier?  And
> > >why it leaves the room for the integration issues being uncaught?
> > >Again, I am genuinely interested to know.
> > >
> > >> these short term pains. I'd rather like us swim through these now
> > >> instead of support broken APIs and features in our beta, having seen
> > >> this very thing happen with 1.*.
> > >
> > >I think you're mixing the point of integration with downstream and
> > >being in an alpha phase of the development. The former isn't about
> > >supporting "broken APIs" - it is about being consistent and avoid
> > >breaking the downstream applicaitons without letting said applications
> > >to accomodate the platform changes first.
> > >
> > >Changes in the API, after all, can be relatively easy traced by
> > >integration validation - this is the whole point of integration
> > >testing. And BigTop does the job better then anything around, simply
> > >because there's nothing else around to do it.
> > >
> > >If you stay in shape-shifting "alpha" that doesn't integrate well for
> > >a very long time, you risk to lose downstream customers' interest,
> > >because they might get tired of waiting until a next stable API will
> > >be ready for them.
> > >
> > >> Let's fix the way the release related communication is happening
> > >> across our projects so that we can all work together and make 2.X a
> > >> success.
> > >
> > >This is a very good point indeed! Let's start a separate discussion
> > >thread on how we can improve the release model for coming Hadoop
> > >releases, where we - as the community - can provide better guarantees
> > >of the inter-component compatibility (sorry for an overused word).
> > >
> > >Cos
> > >
> > >On Fri, Mar 01, 2013 at 10:58AM, Arun C Murthy wrote:
> > >> I feel this is being blown out of proportion.
> > >>
> > >> Integration is high on the list of *every* release. In future, if
> > >>anyone or
> > >> bigtop wants to help, running integration tests on a hadoop RC and
> > >>providing
> > >> feedback would be very welcome. I'm pretty sure I will stop an RC if
> it
> > >> means it breaks and Oozie or HBase or Pig or Hive and re-spin it. For
> > >>e.g.
> > >> see recent efforts to do a 2.0.4-alpha.
> > >>
> > >> With hadoop-2.0.3-alpha we discovered 3 *bugs* - making it sound like
> we
> > >> intentionally disregard integation issues is very harsh.
> > >>
> > >> Please also see other thread where we discussed stabilizing APIS,
> > >>protocols
> > >> etc. for the next 'beta' release.
> > >>
> > >> Arun
> > >>
> > >> On Feb 26, 2013, at 5:43 PM, Roman Shaposhnik wrote:
> > >>
> > >> > Hi!
> > >> >
> > >> > for the past couple of releases of Hadoop 2.X code line the issue
> > >> > of integration between Hadoop and its downstream projects has
> > >> > become quite a thorny issue. The poster child here is Oozie, where
> > >> > every release of Hadoop 2.X seems to be breaking the compatibility
> > >> > in various unpredictable ways. At times other components (such
> > >> > as HBase for example) also seem to be affected.
> > >> >
> > >> > Now, to be extremely clear -- I'm NOT talking about the *latest*
> > >>version
> > >> > of Oozie working with the *latest* version of Hadoop, instead
> > >> > my observations come from running previous *stable*  releases
> > >> > of Bigtop on top of Hadoop 2.X RCs.
> > >> >
> > >> > As many of you know Apache Bigtop aims at providing a single
> > >> > platform for integration of Hadoop and Hadoop ecosystem projects.
> > >> > As such we're uniquely positioned to track compatibility between
> > >> > different Hadoop releases with regards to the downstream components
> > >> > (things like Oozie, Pig, Hive, Mahout, etc.). Every single single RC
> > >> > we've been pretty diligent at trying to provide integration-level
> > >>feedback
> > >> > on the quality of the upcoming release,  but it seems that our
> efforts
> > >> > don't quite suffice in Hadoop 2.X stabilizing.
> > >> >
> > >> > Of course, one could argue that while Hadoop 2.X code line was
> > >> > designated 'alpha' expecting much in the way of perfect integration
> > >> > and compatibility was NOT what the Hadoop community was
> > >> > focusing on. I can appreciate that view, but what I'm interested in
> > >> > is the future of Hadoop 2.X not its past. Hence, here's my question
> > >> > to all of you as a Hadoop community at large:
> > >> >
> > >> > Do you guys think that the project have reached a point where
> > >>integration
> > >> > and compatibility issues should be prioritized really high on the
> list
> > >> > of things that make or break each future release?
> > >> >
> > >> > The good news, is that Bigtop's charter is in big part *exactly*
> about
> > >> > providing you with this kind of feedback. We can easily tell you
> when
> > >> > Hadoop behavior, with regard to downstream components, changes
> > >> > between a previous stable release and the new RC (or even
> > >>branch/trunk).
> > >> > What we can NOT do is submit patches for all the issues. We are
> simply
> > >> > too small a project and we need your help with that.
> > >> >
> > >> > I truly believe that we owe it to the downstream projects, and in
> the
> > >> > second half of this email I will try to convince you of that.
> > >> >
> > >> > We all know that integration projects are impossible to pull off
> > >> > unless there's a general consensus between all of the projects
> > >>involved
> > >> > that they indeed need to work with each other. You can NOT force
> > >> > that notion, but you can always try to influence. This relationship
> > >> > goes both ways.
> > >> >
> > >> > Consider a question in front of the downstream communities
> > >> > of  whether or not to adopt Hadoop 2.X as the basis. To answer
> > >> > that question each downstream project has to be reasonably
> > >> > sure that their concerns will NOT fall on deaf ears and that
> > >> > Hadoop developers are, essentially, 'ready' for them to pick
> > >> > up Hadoop 2.X. I would argue that so far the Hadoop community
> > >> > had gone out of its way to signal that 2.X codeline is NOT
> > >> > ready for the downstream.
> > >> >
> > >> > I would argue that moving forward this is a really unfortunate
> > >> > situation that may end up undermining the long term success
> > >> > of Hadoop 2.X if we don't start addressing the problem. Think
> > >> > about it -- 90% of unit tests that run downstream on Apache
> > >> > infrastructure are still exercising Hadoop 1.X underneath.
> > >> > In fact, if you were to forcefully make, lets say, HBase's
> > >> > unit tests run on top of Hadoop 2.X quite a few of them
> > >> > are going to fail. Hadoop community is, in effect, cutting
> > >> > itself off from the biggest source of feedback -- its downstream
> > >> > users. This in turn:
> > >> >
> > >> >   * leaves Hadoop project in a perpetual state of broken
> > >> >     windows syndrome.
> > >> >
> > >> >   * leaves Apache Hadoop 2.X releases in a state considerably
> > >> >     inferior to the releases *including* Apache Hadoop done by the
> > >> >     vendors. The users have no choice but to alight themselves
> > >> >     with vendor offerings if they wish to utilize latest Hadoop
> > >>functionality.
> > >> >     The artifact that is know as Apache Hadoop 2.X stopped being
> > >> >     a viable choice thus fracturing the user community and reducing
> > >> >     the benefits of a commonly deployed codebase.
> > >> >
> > >> >    * leaves downstream projects of Hadoop  in a jaded state where
> > >> >      they legitimately get very discouraged and frustrated and
> > >>eventually
> > >> >      give up thinking that -- well, we work with one release of
> Hadoop
> > >> >      (the stable one Hadoop 1.X) and we shall wait for the Hadoop
> > >> >      community to get their act together.
> > >> >
> > >> > In my view (shared by quite a few members of the Apache Bigtop) we
> > >> > can definitely do better than this if we all agree that the proposed
> > >> > first 'beta' release of Hadoop 2.0.4 is the right time for it to
> > >>happen.
> > >> >
> > >> > It is about time Hadoop 2.X community wins back all those end users
> > >> > and downstream projects that got left behind during the alpha
> > >> > stabilization phase.
> > >> >
> > >> > Thanks,
> > >> > Roman.
> > >>
> > >> --
> > >> Arun C. Murthy
> > >> Hortonworks Inc.
> > >> http://hortonworks.com/
> > >>
> > >>
> >
>

Re: [DISCUSS] stabilizing Hadoop releases wrt. downstream

Posted by Giridharan Kesavan <gk...@hortonworks.com>.
Thanks Bobby. I 've setup a unit test
job<https://builds.apache.org/view/Hadoop/job/Hadoop-Common-2-Build/>to
execute unit test's on branch-2 on a daily basis.
I'm happy to help with build setup. Let me know.

-Giri


On Tue, Mar 5, 2013 at 9:02 PM, Konstantin Boudnik <co...@apache.org> wrote:

> Great start, Bobby! I certainly can jump on fix something quickly if
> needed as
> well (neither an RE person, but CI is truly a dev. tool!)
>
> Thanks!
>   Cos
>
> On Tue, Mar 05, 2013 at 07:18AM, Robert Evans wrote:
> > That is a great point.  I have been meaning to set up the Jenkins build
> > for branch-2 for a while, so I took the 10 mins and just did it.
> >
> > https://builds.apache.org/job/Hadoop-Common-2-Commit/
> >
> > Don't let the name fool you, it publishes not just common, but HDFS,
> YARN,
> > MR, and tools too.  You should now have branch-2 SNAPSHOTS updated on
> each
> > commit to branch-2.  Feel free to bug me if you need more integration
> > points.  I am not an RE guy, but I can hack it to make things work :)
> >
> > --Bobby
> >
> > On 3/5/13 12:15 AM, "Konstantin Boudnik" <co...@apache.org> wrote:
> >
> > >Arun,
> > >
> > >first of all, I don't think anyone is trying to put a blame on someone
> > >else. E.g. I had similar experience with Oozie being broken because of
> > >certain released changes in the upstream.
> > >
> > >I am sure that most people in BigTop community - especially those who
> > >share the committer-ship privilege in BigTop and other upstream
> > >projects, including Hadoop, - would be happy to help with the
> > >stabilization of the Hadoop base. The issue that a downstream
> > >integration project is likely to have is - for once - the absence of
> > >regularly published development artifacts. In the light of "it didn't
> > >happen if there's no picture" here's a couple of examples:
> > >
> > >  - 2.0.2-SNAPSHOT weren't published at all; only release 2.0.2-alpha
> > >artifacts were
> > >  - 2.0.3-SNAPSHOT weren't published until Feb 29, 2013 (it happened
> just
> > >once)
> > >
> > >So, technically speaking, unless an integration project is willing to
> > >build and maintain its own artifacts, it is impossible to do any
> > >preventive validation.
> > >
> > >Which brings me to my next question: how do you guys address
> > >"Integration is high on the list of *every* release". Again, please
> > >don't get me wrong - I am not looking to lay a blame on or corner
> > >anyone - I am really curious and would appreciate the input.
> > >
> > >
> > >Vinod:
> > >
> > >> As you yourself noted later, the pain is part of the 'alpha' status
> > >> of the release. We are targeting +one of the immediate future
> > >> releases to be a beta and so these troubles are really only the
> > >> short +term.
> > >
> > >I don't really want to get into the discussion about of what
> > >constitutes the alpha and how it has delayed the adoption of Hadoop2
> > >line. However, I want to point out that it is especially important for
> > >"alpha" platform to work nicely with downstream consumers of the said
> > >platform. For quite obvious reasons, I believe.
> > >
> > >> I think there is a fundamental problem with the interaction of
> > >> Bigtop with the downstream projects, if nothing else, with
> > >
> > >BigTop is as downstream as it can get, because BigTop essentially
> > >consumes all other component releases in order to produce a viable
> > >stack. Technicalities aside...
> > >
> > >> Hadoop. We never formalized on the process, will BigTop step in
> > >> after an RC is up for vote or before? As I see it, it's happening
> > >
> > >Bigtop essentially can give any component, including Hadoop, and
> > >better yet - the set of components - certain guaratees about
> > >compatibility and dependencies being included. Case in point is
> > >missing commons libraries missed in 1.0.1 release that essentially
> > >prevented HBase from working properly.
> > >
> > >> after the vote is up, so no wonder we are in this state. Shall we
> > >> have a pre-notice to Bigtop so that it can step in before?
> > >
> > >The above is in contradiction with earlier statement of "Integration
> > >is high on the list of *every* release". If BigTop isn't used for
> > >integration testing, then how said integration testing is performed?
> > >Is it some sort of test-patch process as Luke referred earlier?  And
> > >why it leaves the room for the integration issues being uncaught?
> > >Again, I am genuinely interested to know.
> > >
> > >> these short term pains. I'd rather like us swim through these now
> > >> instead of support broken APIs and features in our beta, having seen
> > >> this very thing happen with 1.*.
> > >
> > >I think you're mixing the point of integration with downstream and
> > >being in an alpha phase of the development. The former isn't about
> > >supporting "broken APIs" - it is about being consistent and avoid
> > >breaking the downstream applicaitons without letting said applications
> > >to accomodate the platform changes first.
> > >
> > >Changes in the API, after all, can be relatively easy traced by
> > >integration validation - this is the whole point of integration
> > >testing. And BigTop does the job better then anything around, simply
> > >because there's nothing else around to do it.
> > >
> > >If you stay in shape-shifting "alpha" that doesn't integrate well for
> > >a very long time, you risk to lose downstream customers' interest,
> > >because they might get tired of waiting until a next stable API will
> > >be ready for them.
> > >
> > >> Let's fix the way the release related communication is happening
> > >> across our projects so that we can all work together and make 2.X a
> > >> success.
> > >
> > >This is a very good point indeed! Let's start a separate discussion
> > >thread on how we can improve the release model for coming Hadoop
> > >releases, where we - as the community - can provide better guarantees
> > >of the inter-component compatibility (sorry for an overused word).
> > >
> > >Cos
> > >
> > >On Fri, Mar 01, 2013 at 10:58AM, Arun C Murthy wrote:
> > >> I feel this is being blown out of proportion.
> > >>
> > >> Integration is high on the list of *every* release. In future, if
> > >>anyone or
> > >> bigtop wants to help, running integration tests on a hadoop RC and
> > >>providing
> > >> feedback would be very welcome. I'm pretty sure I will stop an RC if
> it
> > >> means it breaks and Oozie or HBase or Pig or Hive and re-spin it. For
> > >>e.g.
> > >> see recent efforts to do a 2.0.4-alpha.
> > >>
> > >> With hadoop-2.0.3-alpha we discovered 3 *bugs* - making it sound like
> we
> > >> intentionally disregard integation issues is very harsh.
> > >>
> > >> Please also see other thread where we discussed stabilizing APIS,
> > >>protocols
> > >> etc. for the next 'beta' release.
> > >>
> > >> Arun
> > >>
> > >> On Feb 26, 2013, at 5:43 PM, Roman Shaposhnik wrote:
> > >>
> > >> > Hi!
> > >> >
> > >> > for the past couple of releases of Hadoop 2.X code line the issue
> > >> > of integration between Hadoop and its downstream projects has
> > >> > become quite a thorny issue. The poster child here is Oozie, where
> > >> > every release of Hadoop 2.X seems to be breaking the compatibility
> > >> > in various unpredictable ways. At times other components (such
> > >> > as HBase for example) also seem to be affected.
> > >> >
> > >> > Now, to be extremely clear -- I'm NOT talking about the *latest*
> > >>version
> > >> > of Oozie working with the *latest* version of Hadoop, instead
> > >> > my observations come from running previous *stable*  releases
> > >> > of Bigtop on top of Hadoop 2.X RCs.
> > >> >
> > >> > As many of you know Apache Bigtop aims at providing a single
> > >> > platform for integration of Hadoop and Hadoop ecosystem projects.
> > >> > As such we're uniquely positioned to track compatibility between
> > >> > different Hadoop releases with regards to the downstream components
> > >> > (things like Oozie, Pig, Hive, Mahout, etc.). Every single single RC
> > >> > we've been pretty diligent at trying to provide integration-level
> > >>feedback
> > >> > on the quality of the upcoming release,  but it seems that our
> efforts
> > >> > don't quite suffice in Hadoop 2.X stabilizing.
> > >> >
> > >> > Of course, one could argue that while Hadoop 2.X code line was
> > >> > designated 'alpha' expecting much in the way of perfect integration
> > >> > and compatibility was NOT what the Hadoop community was
> > >> > focusing on. I can appreciate that view, but what I'm interested in
> > >> > is the future of Hadoop 2.X not its past. Hence, here's my question
> > >> > to all of you as a Hadoop community at large:
> > >> >
> > >> > Do you guys think that the project have reached a point where
> > >>integration
> > >> > and compatibility issues should be prioritized really high on the
> list
> > >> > of things that make or break each future release?
> > >> >
> > >> > The good news, is that Bigtop's charter is in big part *exactly*
> about
> > >> > providing you with this kind of feedback. We can easily tell you
> when
> > >> > Hadoop behavior, with regard to downstream components, changes
> > >> > between a previous stable release and the new RC (or even
> > >>branch/trunk).
> > >> > What we can NOT do is submit patches for all the issues. We are
> simply
> > >> > too small a project and we need your help with that.
> > >> >
> > >> > I truly believe that we owe it to the downstream projects, and in
> the
> > >> > second half of this email I will try to convince you of that.
> > >> >
> > >> > We all know that integration projects are impossible to pull off
> > >> > unless there's a general consensus between all of the projects
> > >>involved
> > >> > that they indeed need to work with each other. You can NOT force
> > >> > that notion, but you can always try to influence. This relationship
> > >> > goes both ways.
> > >> >
> > >> > Consider a question in front of the downstream communities
> > >> > of  whether or not to adopt Hadoop 2.X as the basis. To answer
> > >> > that question each downstream project has to be reasonably
> > >> > sure that their concerns will NOT fall on deaf ears and that
> > >> > Hadoop developers are, essentially, 'ready' for them to pick
> > >> > up Hadoop 2.X. I would argue that so far the Hadoop community
> > >> > had gone out of its way to signal that 2.X codeline is NOT
> > >> > ready for the downstream.
> > >> >
> > >> > I would argue that moving forward this is a really unfortunate
> > >> > situation that may end up undermining the long term success
> > >> > of Hadoop 2.X if we don't start addressing the problem. Think
> > >> > about it -- 90% of unit tests that run downstream on Apache
> > >> > infrastructure are still exercising Hadoop 1.X underneath.
> > >> > In fact, if you were to forcefully make, lets say, HBase's
> > >> > unit tests run on top of Hadoop 2.X quite a few of them
> > >> > are going to fail. Hadoop community is, in effect, cutting
> > >> > itself off from the biggest source of feedback -- its downstream
> > >> > users. This in turn:
> > >> >
> > >> >   * leaves Hadoop project in a perpetual state of broken
> > >> >     windows syndrome.
> > >> >
> > >> >   * leaves Apache Hadoop 2.X releases in a state considerably
> > >> >     inferior to the releases *including* Apache Hadoop done by the
> > >> >     vendors. The users have no choice but to alight themselves
> > >> >     with vendor offerings if they wish to utilize latest Hadoop
> > >>functionality.
> > >> >     The artifact that is know as Apache Hadoop 2.X stopped being
> > >> >     a viable choice thus fracturing the user community and reducing
> > >> >     the benefits of a commonly deployed codebase.
> > >> >
> > >> >    * leaves downstream projects of Hadoop  in a jaded state where
> > >> >      they legitimately get very discouraged and frustrated and
> > >>eventually
> > >> >      give up thinking that -- well, we work with one release of
> Hadoop
> > >> >      (the stable one Hadoop 1.X) and we shall wait for the Hadoop
> > >> >      community to get their act together.
> > >> >
> > >> > In my view (shared by quite a few members of the Apache Bigtop) we
> > >> > can definitely do better than this if we all agree that the proposed
> > >> > first 'beta' release of Hadoop 2.0.4 is the right time for it to
> > >>happen.
> > >> >
> > >> > It is about time Hadoop 2.X community wins back all those end users
> > >> > and downstream projects that got left behind during the alpha
> > >> > stabilization phase.
> > >> >
> > >> > Thanks,
> > >> > Roman.
> > >>
> > >> --
> > >> Arun C. Murthy
> > >> Hortonworks Inc.
> > >> http://hortonworks.com/
> > >>
> > >>
> >
>

Re: [DISCUSS] stabilizing Hadoop releases wrt. downstream

Posted by Konstantin Boudnik <co...@apache.org>.
Great start, Bobby! I certainly can jump on fix something quickly if needed as
well (neither an RE person, but CI is truly a dev. tool!)

Thanks!
  Cos

On Tue, Mar 05, 2013 at 07:18AM, Robert Evans wrote:
> That is a great point.  I have been meaning to set up the Jenkins build
> for branch-2 for a while, so I took the 10 mins and just did it.
> 
> https://builds.apache.org/job/Hadoop-Common-2-Commit/
> 
> Don't let the name fool you, it publishes not just common, but HDFS, YARN,
> MR, and tools too.  You should now have branch-2 SNAPSHOTS updated on each
> commit to branch-2.  Feel free to bug me if you need more integration
> points.  I am not an RE guy, but I can hack it to make things work :)
> 
> --Bobby
> 
> On 3/5/13 12:15 AM, "Konstantin Boudnik" <co...@apache.org> wrote:
> 
> >Arun,
> >
> >first of all, I don't think anyone is trying to put a blame on someone
> >else. E.g. I had similar experience with Oozie being broken because of
> >certain released changes in the upstream.
> >
> >I am sure that most people in BigTop community - especially those who
> >share the committer-ship privilege in BigTop and other upstream
> >projects, including Hadoop, - would be happy to help with the
> >stabilization of the Hadoop base. The issue that a downstream
> >integration project is likely to have is - for once - the absence of
> >regularly published development artifacts. In the light of "it didn't
> >happen if there's no picture" here's a couple of examples:
> >
> >  - 2.0.2-SNAPSHOT weren't published at all; only release 2.0.2-alpha
> >artifacts were
> >  - 2.0.3-SNAPSHOT weren't published until Feb 29, 2013 (it happened just
> >once)
> >
> >So, technically speaking, unless an integration project is willing to
> >build and maintain its own artifacts, it is impossible to do any
> >preventive validation.
> >
> >Which brings me to my next question: how do you guys address
> >"Integration is high on the list of *every* release". Again, please
> >don't get me wrong - I am not looking to lay a blame on or corner
> >anyone - I am really curious and would appreciate the input.
> >
> >
> >Vinod:
> >
> >> As you yourself noted later, the pain is part of the 'alpha' status
> >> of the release. We are targeting +one of the immediate future
> >> releases to be a beta and so these troubles are really only the
> >> short +term.
> >
> >I don't really want to get into the discussion about of what
> >constitutes the alpha and how it has delayed the adoption of Hadoop2
> >line. However, I want to point out that it is especially important for
> >"alpha" platform to work nicely with downstream consumers of the said
> >platform. For quite obvious reasons, I believe.
> >
> >> I think there is a fundamental problem with the interaction of
> >> Bigtop with the downstream projects, if nothing else, with
> >
> >BigTop is as downstream as it can get, because BigTop essentially
> >consumes all other component releases in order to produce a viable
> >stack. Technicalities aside...
> >
> >> Hadoop. We never formalized on the process, will BigTop step in
> >> after an RC is up for vote or before? As I see it, it's happening
> >
> >Bigtop essentially can give any component, including Hadoop, and
> >better yet - the set of components - certain guaratees about
> >compatibility and dependencies being included. Case in point is
> >missing commons libraries missed in 1.0.1 release that essentially
> >prevented HBase from working properly.
> >
> >> after the vote is up, so no wonder we are in this state. Shall we
> >> have a pre-notice to Bigtop so that it can step in before?
> >
> >The above is in contradiction with earlier statement of "Integration
> >is high on the list of *every* release". If BigTop isn't used for
> >integration testing, then how said integration testing is performed?
> >Is it some sort of test-patch process as Luke referred earlier?  And
> >why it leaves the room for the integration issues being uncaught?
> >Again, I am genuinely interested to know.
> >
> >> these short term pains. I'd rather like us swim through these now
> >> instead of support broken APIs and features in our beta, having seen
> >> this very thing happen with 1.*.
> >
> >I think you're mixing the point of integration with downstream and
> >being in an alpha phase of the development. The former isn't about
> >supporting "broken APIs" - it is about being consistent and avoid
> >breaking the downstream applicaitons without letting said applications
> >to accomodate the platform changes first.
> >
> >Changes in the API, after all, can be relatively easy traced by
> >integration validation - this is the whole point of integration
> >testing. And BigTop does the job better then anything around, simply
> >because there's nothing else around to do it.
> >
> >If you stay in shape-shifting "alpha" that doesn't integrate well for
> >a very long time, you risk to lose downstream customers' interest,
> >because they might get tired of waiting until a next stable API will
> >be ready for them.
> >
> >> Let's fix the way the release related communication is happening
> >> across our projects so that we can all work together and make 2.X a
> >> success.
> >
> >This is a very good point indeed! Let's start a separate discussion
> >thread on how we can improve the release model for coming Hadoop
> >releases, where we - as the community - can provide better guarantees
> >of the inter-component compatibility (sorry for an overused word).
> >
> >Cos
> >
> >On Fri, Mar 01, 2013 at 10:58AM, Arun C Murthy wrote:
> >> I feel this is being blown out of proportion.
> >> 
> >> Integration is high on the list of *every* release. In future, if
> >>anyone or
> >> bigtop wants to help, running integration tests on a hadoop RC and
> >>providing
> >> feedback would be very welcome. I'm pretty sure I will stop an RC if it
> >> means it breaks and Oozie or HBase or Pig or Hive and re-spin it. For
> >>e.g.
> >> see recent efforts to do a 2.0.4-alpha.
> >> 
> >> With hadoop-2.0.3-alpha we discovered 3 *bugs* - making it sound like we
> >> intentionally disregard integation issues is very harsh.
> >> 
> >> Please also see other thread where we discussed stabilizing APIS,
> >>protocols
> >> etc. for the next 'beta' release.
> >> 
> >> Arun
> >> 
> >> On Feb 26, 2013, at 5:43 PM, Roman Shaposhnik wrote:
> >> 
> >> > Hi!
> >> > 
> >> > for the past couple of releases of Hadoop 2.X code line the issue
> >> > of integration between Hadoop and its downstream projects has
> >> > become quite a thorny issue. The poster child here is Oozie, where
> >> > every release of Hadoop 2.X seems to be breaking the compatibility
> >> > in various unpredictable ways. At times other components (such
> >> > as HBase for example) also seem to be affected.
> >> > 
> >> > Now, to be extremely clear -- I'm NOT talking about the *latest*
> >>version
> >> > of Oozie working with the *latest* version of Hadoop, instead
> >> > my observations come from running previous *stable*  releases
> >> > of Bigtop on top of Hadoop 2.X RCs.
> >> > 
> >> > As many of you know Apache Bigtop aims at providing a single
> >> > platform for integration of Hadoop and Hadoop ecosystem projects.
> >> > As such we're uniquely positioned to track compatibility between
> >> > different Hadoop releases with regards to the downstream components
> >> > (things like Oozie, Pig, Hive, Mahout, etc.). Every single single RC
> >> > we've been pretty diligent at trying to provide integration-level
> >>feedback
> >> > on the quality of the upcoming release,  but it seems that our efforts
> >> > don't quite suffice in Hadoop 2.X stabilizing.
> >> > 
> >> > Of course, one could argue that while Hadoop 2.X code line was
> >> > designated 'alpha' expecting much in the way of perfect integration
> >> > and compatibility was NOT what the Hadoop community was
> >> > focusing on. I can appreciate that view, but what I'm interested in
> >> > is the future of Hadoop 2.X not its past. Hence, here's my question
> >> > to all of you as a Hadoop community at large:
> >> > 
> >> > Do you guys think that the project have reached a point where
> >>integration
> >> > and compatibility issues should be prioritized really high on the list
> >> > of things that make or break each future release?
> >> > 
> >> > The good news, is that Bigtop's charter is in big part *exactly* about
> >> > providing you with this kind of feedback. We can easily tell you when
> >> > Hadoop behavior, with regard to downstream components, changes
> >> > between a previous stable release and the new RC (or even
> >>branch/trunk).
> >> > What we can NOT do is submit patches for all the issues. We are simply
> >> > too small a project and we need your help with that.
> >> > 
> >> > I truly believe that we owe it to the downstream projects, and in the
> >> > second half of this email I will try to convince you of that.
> >> > 
> >> > We all know that integration projects are impossible to pull off
> >> > unless there's a general consensus between all of the projects
> >>involved
> >> > that they indeed need to work with each other. You can NOT force
> >> > that notion, but you can always try to influence. This relationship
> >> > goes both ways.
> >> > 
> >> > Consider a question in front of the downstream communities
> >> > of  whether or not to adopt Hadoop 2.X as the basis. To answer
> >> > that question each downstream project has to be reasonably
> >> > sure that their concerns will NOT fall on deaf ears and that
> >> > Hadoop developers are, essentially, 'ready' for them to pick
> >> > up Hadoop 2.X. I would argue that so far the Hadoop community
> >> > had gone out of its way to signal that 2.X codeline is NOT
> >> > ready for the downstream.
> >> > 
> >> > I would argue that moving forward this is a really unfortunate
> >> > situation that may end up undermining the long term success
> >> > of Hadoop 2.X if we don't start addressing the problem. Think
> >> > about it -- 90% of unit tests that run downstream on Apache
> >> > infrastructure are still exercising Hadoop 1.X underneath.
> >> > In fact, if you were to forcefully make, lets say, HBase's
> >> > unit tests run on top of Hadoop 2.X quite a few of them
> >> > are going to fail. Hadoop community is, in effect, cutting
> >> > itself off from the biggest source of feedback -- its downstream
> >> > users. This in turn:
> >> > 
> >> >   * leaves Hadoop project in a perpetual state of broken
> >> >     windows syndrome.
> >> > 
> >> >   * leaves Apache Hadoop 2.X releases in a state considerably
> >> >     inferior to the releases *including* Apache Hadoop done by the
> >> >     vendors. The users have no choice but to alight themselves
> >> >     with vendor offerings if they wish to utilize latest Hadoop
> >>functionality.
> >> >     The artifact that is know as Apache Hadoop 2.X stopped being
> >> >     a viable choice thus fracturing the user community and reducing
> >> >     the benefits of a commonly deployed codebase.
> >> > 
> >> >    * leaves downstream projects of Hadoop  in a jaded state where
> >> >      they legitimately get very discouraged and frustrated and
> >>eventually
> >> >      give up thinking that -- well, we work with one release of Hadoop
> >> >      (the stable one Hadoop 1.X) and we shall wait for the Hadoop
> >> >      community to get their act together.
> >> > 
> >> > In my view (shared by quite a few members of the Apache Bigtop) we
> >> > can definitely do better than this if we all agree that the proposed
> >> > first 'beta' release of Hadoop 2.0.4 is the right time for it to
> >>happen.
> >> > 
> >> > It is about time Hadoop 2.X community wins back all those end users
> >> > and downstream projects that got left behind during the alpha
> >> > stabilization phase.
> >> > 
> >> > Thanks,
> >> > Roman.
> >> 
> >> --
> >> Arun C. Murthy
> >> Hortonworks Inc.
> >> http://hortonworks.com/
> >> 
> >> 
> 

Re: [DISCUSS] stabilizing Hadoop releases wrt. downstream

Posted by Konstantin Boudnik <co...@apache.org>.
Great start, Bobby! I certainly can jump on fix something quickly if needed as
well (neither an RE person, but CI is truly a dev. tool!)

Thanks!
  Cos

On Tue, Mar 05, 2013 at 07:18AM, Robert Evans wrote:
> That is a great point.  I have been meaning to set up the Jenkins build
> for branch-2 for a while, so I took the 10 mins and just did it.
> 
> https://builds.apache.org/job/Hadoop-Common-2-Commit/
> 
> Don't let the name fool you, it publishes not just common, but HDFS, YARN,
> MR, and tools too.  You should now have branch-2 SNAPSHOTS updated on each
> commit to branch-2.  Feel free to bug me if you need more integration
> points.  I am not an RE guy, but I can hack it to make things work :)
> 
> --Bobby
> 
> On 3/5/13 12:15 AM, "Konstantin Boudnik" <co...@apache.org> wrote:
> 
> >Arun,
> >
> >first of all, I don't think anyone is trying to put a blame on someone
> >else. E.g. I had similar experience with Oozie being broken because of
> >certain released changes in the upstream.
> >
> >I am sure that most people in BigTop community - especially those who
> >share the committer-ship privilege in BigTop and other upstream
> >projects, including Hadoop, - would be happy to help with the
> >stabilization of the Hadoop base. The issue that a downstream
> >integration project is likely to have is - for once - the absence of
> >regularly published development artifacts. In the light of "it didn't
> >happen if there's no picture" here's a couple of examples:
> >
> >  - 2.0.2-SNAPSHOT weren't published at all; only release 2.0.2-alpha
> >artifacts were
> >  - 2.0.3-SNAPSHOT weren't published until Feb 29, 2013 (it happened just
> >once)
> >
> >So, technically speaking, unless an integration project is willing to
> >build and maintain its own artifacts, it is impossible to do any
> >preventive validation.
> >
> >Which brings me to my next question: how do you guys address
> >"Integration is high on the list of *every* release". Again, please
> >don't get me wrong - I am not looking to lay a blame on or corner
> >anyone - I am really curious and would appreciate the input.
> >
> >
> >Vinod:
> >
> >> As you yourself noted later, the pain is part of the 'alpha' status
> >> of the release. We are targeting +one of the immediate future
> >> releases to be a beta and so these troubles are really only the
> >> short +term.
> >
> >I don't really want to get into the discussion about of what
> >constitutes the alpha and how it has delayed the adoption of Hadoop2
> >line. However, I want to point out that it is especially important for
> >"alpha" platform to work nicely with downstream consumers of the said
> >platform. For quite obvious reasons, I believe.
> >
> >> I think there is a fundamental problem with the interaction of
> >> Bigtop with the downstream projects, if nothing else, with
> >
> >BigTop is as downstream as it can get, because BigTop essentially
> >consumes all other component releases in order to produce a viable
> >stack. Technicalities aside...
> >
> >> Hadoop. We never formalized on the process, will BigTop step in
> >> after an RC is up for vote or before? As I see it, it's happening
> >
> >Bigtop essentially can give any component, including Hadoop, and
> >better yet - the set of components - certain guaratees about
> >compatibility and dependencies being included. Case in point is
> >missing commons libraries missed in 1.0.1 release that essentially
> >prevented HBase from working properly.
> >
> >> after the vote is up, so no wonder we are in this state. Shall we
> >> have a pre-notice to Bigtop so that it can step in before?
> >
> >The above is in contradiction with earlier statement of "Integration
> >is high on the list of *every* release". If BigTop isn't used for
> >integration testing, then how said integration testing is performed?
> >Is it some sort of test-patch process as Luke referred earlier?  And
> >why it leaves the room for the integration issues being uncaught?
> >Again, I am genuinely interested to know.
> >
> >> these short term pains. I'd rather like us swim through these now
> >> instead of support broken APIs and features in our beta, having seen
> >> this very thing happen with 1.*.
> >
> >I think you're mixing the point of integration with downstream and
> >being in an alpha phase of the development. The former isn't about
> >supporting "broken APIs" - it is about being consistent and avoid
> >breaking the downstream applicaitons without letting said applications
> >to accomodate the platform changes first.
> >
> >Changes in the API, after all, can be relatively easy traced by
> >integration validation - this is the whole point of integration
> >testing. And BigTop does the job better then anything around, simply
> >because there's nothing else around to do it.
> >
> >If you stay in shape-shifting "alpha" that doesn't integrate well for
> >a very long time, you risk to lose downstream customers' interest,
> >because they might get tired of waiting until a next stable API will
> >be ready for them.
> >
> >> Let's fix the way the release related communication is happening
> >> across our projects so that we can all work together and make 2.X a
> >> success.
> >
> >This is a very good point indeed! Let's start a separate discussion
> >thread on how we can improve the release model for coming Hadoop
> >releases, where we - as the community - can provide better guarantees
> >of the inter-component compatibility (sorry for an overused word).
> >
> >Cos
> >
> >On Fri, Mar 01, 2013 at 10:58AM, Arun C Murthy wrote:
> >> I feel this is being blown out of proportion.
> >> 
> >> Integration is high on the list of *every* release. In future, if
> >>anyone or
> >> bigtop wants to help, running integration tests on a hadoop RC and
> >>providing
> >> feedback would be very welcome. I'm pretty sure I will stop an RC if it
> >> means it breaks and Oozie or HBase or Pig or Hive and re-spin it. For
> >>e.g.
> >> see recent efforts to do a 2.0.4-alpha.
> >> 
> >> With hadoop-2.0.3-alpha we discovered 3 *bugs* - making it sound like we
> >> intentionally disregard integation issues is very harsh.
> >> 
> >> Please also see other thread where we discussed stabilizing APIS,
> >>protocols
> >> etc. for the next 'beta' release.
> >> 
> >> Arun
> >> 
> >> On Feb 26, 2013, at 5:43 PM, Roman Shaposhnik wrote:
> >> 
> >> > Hi!
> >> > 
> >> > for the past couple of releases of Hadoop 2.X code line the issue
> >> > of integration between Hadoop and its downstream projects has
> >> > become quite a thorny issue. The poster child here is Oozie, where
> >> > every release of Hadoop 2.X seems to be breaking the compatibility
> >> > in various unpredictable ways. At times other components (such
> >> > as HBase for example) also seem to be affected.
> >> > 
> >> > Now, to be extremely clear -- I'm NOT talking about the *latest*
> >>version
> >> > of Oozie working with the *latest* version of Hadoop, instead
> >> > my observations come from running previous *stable*  releases
> >> > of Bigtop on top of Hadoop 2.X RCs.
> >> > 
> >> > As many of you know Apache Bigtop aims at providing a single
> >> > platform for integration of Hadoop and Hadoop ecosystem projects.
> >> > As such we're uniquely positioned to track compatibility between
> >> > different Hadoop releases with regards to the downstream components
> >> > (things like Oozie, Pig, Hive, Mahout, etc.). Every single single RC
> >> > we've been pretty diligent at trying to provide integration-level
> >>feedback
> >> > on the quality of the upcoming release,  but it seems that our efforts
> >> > don't quite suffice in Hadoop 2.X stabilizing.
> >> > 
> >> > Of course, one could argue that while Hadoop 2.X code line was
> >> > designated 'alpha' expecting much in the way of perfect integration
> >> > and compatibility was NOT what the Hadoop community was
> >> > focusing on. I can appreciate that view, but what I'm interested in
> >> > is the future of Hadoop 2.X not its past. Hence, here's my question
> >> > to all of you as a Hadoop community at large:
> >> > 
> >> > Do you guys think that the project have reached a point where
> >>integration
> >> > and compatibility issues should be prioritized really high on the list
> >> > of things that make or break each future release?
> >> > 
> >> > The good news, is that Bigtop's charter is in big part *exactly* about
> >> > providing you with this kind of feedback. We can easily tell you when
> >> > Hadoop behavior, with regard to downstream components, changes
> >> > between a previous stable release and the new RC (or even
> >>branch/trunk).
> >> > What we can NOT do is submit patches for all the issues. We are simply
> >> > too small a project and we need your help with that.
> >> > 
> >> > I truly believe that we owe it to the downstream projects, and in the
> >> > second half of this email I will try to convince you of that.
> >> > 
> >> > We all know that integration projects are impossible to pull off
> >> > unless there's a general consensus between all of the projects
> >>involved
> >> > that they indeed need to work with each other. You can NOT force
> >> > that notion, but you can always try to influence. This relationship
> >> > goes both ways.
> >> > 
> >> > Consider a question in front of the downstream communities
> >> > of  whether or not to adopt Hadoop 2.X as the basis. To answer
> >> > that question each downstream project has to be reasonably
> >> > sure that their concerns will NOT fall on deaf ears and that
> >> > Hadoop developers are, essentially, 'ready' for them to pick
> >> > up Hadoop 2.X. I would argue that so far the Hadoop community
> >> > had gone out of its way to signal that 2.X codeline is NOT
> >> > ready for the downstream.
> >> > 
> >> > I would argue that moving forward this is a really unfortunate
> >> > situation that may end up undermining the long term success
> >> > of Hadoop 2.X if we don't start addressing the problem. Think
> >> > about it -- 90% of unit tests that run downstream on Apache
> >> > infrastructure are still exercising Hadoop 1.X underneath.
> >> > In fact, if you were to forcefully make, lets say, HBase's
> >> > unit tests run on top of Hadoop 2.X quite a few of them
> >> > are going to fail. Hadoop community is, in effect, cutting
> >> > itself off from the biggest source of feedback -- its downstream
> >> > users. This in turn:
> >> > 
> >> >   * leaves Hadoop project in a perpetual state of broken
> >> >     windows syndrome.
> >> > 
> >> >   * leaves Apache Hadoop 2.X releases in a state considerably
> >> >     inferior to the releases *including* Apache Hadoop done by the
> >> >     vendors. The users have no choice but to alight themselves
> >> >     with vendor offerings if they wish to utilize latest Hadoop
> >>functionality.
> >> >     The artifact that is know as Apache Hadoop 2.X stopped being
> >> >     a viable choice thus fracturing the user community and reducing
> >> >     the benefits of a commonly deployed codebase.
> >> > 
> >> >    * leaves downstream projects of Hadoop  in a jaded state where
> >> >      they legitimately get very discouraged and frustrated and
> >>eventually
> >> >      give up thinking that -- well, we work with one release of Hadoop
> >> >      (the stable one Hadoop 1.X) and we shall wait for the Hadoop
> >> >      community to get their act together.
> >> > 
> >> > In my view (shared by quite a few members of the Apache Bigtop) we
> >> > can definitely do better than this if we all agree that the proposed
> >> > first 'beta' release of Hadoop 2.0.4 is the right time for it to
> >>happen.
> >> > 
> >> > It is about time Hadoop 2.X community wins back all those end users
> >> > and downstream projects that got left behind during the alpha
> >> > stabilization phase.
> >> > 
> >> > Thanks,
> >> > Roman.
> >> 
> >> --
> >> Arun C. Murthy
> >> Hortonworks Inc.
> >> http://hortonworks.com/
> >> 
> >> 
> 

Re: [DISCUSS] stabilizing Hadoop releases wrt. downstream

Posted by Konstantin Boudnik <co...@apache.org>.
On Wed, Mar 06, 2013 at 07:24AM, Arun C Murthy wrote:
> Cos,
> 
> On Mar 4, 2013, at 10:15 PM, Konstantin Boudnik wrote:
> 
> > The issue that a downstream
> > integration project is likely to have is - for once - the absence of
> > regularly published development artifacts. In the light of "it didn't
> > happen if there's no picture" here's a couple of examples:
> 
>  Maybe I wasn't clear, apologies, my request is simpler: Can Bigtop validate a Hadoop RC?

No offense taken. I think this is indeed the intention of initial and
follow-up emails. There need to be a bit more coordination between the
two, so BigTop's branches can be set in advance to churn stack builds
and validations with underlying Hadoop RCs or even before; and Hadoop
RM - and potentially downstream components - have more time to react
to the problems.

One of the issues here is a difference in the release schedule between
the two. Perhaps we can setup a validation branch in the bigtop to
help with releases. This is something that needs to be decided within
the project apparently, but the it might work.

Cos

>  Every RC has it's artifacts in staging at the ASF maven repo. In the past we have had Bigtop actually verifying an RC:
>  http://s.apache.org/sjz
> 
>  If Bigtop can do that consistently, it would definitely help. Agree?
> 
> thanks,
> Arun
> 

Re: [DISCUSS] stabilizing Hadoop releases wrt. downstream

Posted by Matt Foley <mf...@hortonworks.com>.
Hi Tom,
Yes, I will also update HowToReleasePostMavenization.

Thanks,
--Matt


On Fri, Mar 8, 2013 at 3:46 PM, Thomas Graves <tg...@yahoo-inc.com> wrote:

> Matt,
>
> Could you also update HowToReleasePostMavenization or ping me when you
> made your update and I will update it to match.
>
> Thanks,
> Tom
>
> On 3/8/13 4:16 PM, "Matt Foley" <mf...@hortonworks.com> wrote:
>
> >>
> >> If future Hadoop RMs would
> >> be willing to drop us a note and indicate when the branch
> >> is in a shape where Bigtop feedback is ready to be received
> >> I think that'll be the best way to proceed.
> >
> >
> >This info is always broadcast to common-dev@hadoop.apache.org,
> >but I will try to remember to send such a pre-announcement directly
> >to Bigtop also.  In fact, I'll modify the HowToRelease instructions to
> >incorporate this as a step.
> >
> >Would you like that to be sent to dev@bigtop.apache.org?
> >
> >Thanks,
> >--Matt
> >
> >
> >
> >On Fri, Mar 8, 2013 at 9:55 AM, Roman Shaposhnik <rv...@apache.org> wrote:
> >
> >> On Wed, Mar 6, 2013 at 7:24 AM, Arun C Murthy <ac...@hortonworks.com>
> >>wrote:
> >> >> The issue that a downstream
> >> >> integration project is likely to have is - for once - the absence of
> >> >> regularly published development artifacts. In the light of "it didn't
> >> >> happen if there's no picture" here's a couple of examples:
> >> >
> >> >  Maybe I wasn't clear, apologies, my request is simpler: Can Bigtop
> >> validate a Hadoop RC?
> >>
> >> The short answer to this is -- yes. It would be extremely nice if
> >> we don't have to wait till the first official RC but could jump on
> >> the branch somewhat sooner. That, of course, would require
> >> a higher degree of coordination. If future Hadoop RMs would
> >> be willing to drop us a note and indicate when the branch
> >> is in a shape where Bigtop feedback is ready to be received
> >> I think that'll be the best way to proceed.
> >>
> >> Do you think this would be a reasonable agreement?
> >>
> >> On a related subject -- it would be extremely nice if Hadoop
> >> community can help us with planning the bill of materials
> >> for the future release of Bigtop. Essentially taking a bit
> >> of a more active role in yay'ing or nay'ing our version
> >> choices. There's a thread on LTS Bigtop releases that
> >> Cos started but I think we can only hope to have an LTS
> >> if the underlying Hadoop is also LTS'ish.
> >>
> >> Thanks,
> >> Roman.
> >>
>
>

Re: [DISCUSS] stabilizing Hadoop releases wrt. downstream

Posted by Thomas Graves <tg...@yahoo-inc.com>.
Matt,

Could you also update HowToReleasePostMavenization or ping me when you
made your update and I will update it to match.

Thanks,
Tom

On 3/8/13 4:16 PM, "Matt Foley" <mf...@hortonworks.com> wrote:

>>
>> If future Hadoop RMs would
>> be willing to drop us a note and indicate when the branch
>> is in a shape where Bigtop feedback is ready to be received
>> I think that'll be the best way to proceed.
>
>
>This info is always broadcast to common-dev@hadoop.apache.org,
>but I will try to remember to send such a pre-announcement directly
>to Bigtop also.  In fact, I'll modify the HowToRelease instructions to
>incorporate this as a step.
>
>Would you like that to be sent to dev@bigtop.apache.org?
>
>Thanks,
>--Matt
>
>
>
>On Fri, Mar 8, 2013 at 9:55 AM, Roman Shaposhnik <rv...@apache.org> wrote:
>
>> On Wed, Mar 6, 2013 at 7:24 AM, Arun C Murthy <ac...@hortonworks.com>
>>wrote:
>> >> The issue that a downstream
>> >> integration project is likely to have is - for once - the absence of
>> >> regularly published development artifacts. In the light of "it didn't
>> >> happen if there's no picture" here's a couple of examples:
>> >
>> >  Maybe I wasn't clear, apologies, my request is simpler: Can Bigtop
>> validate a Hadoop RC?
>>
>> The short answer to this is -- yes. It would be extremely nice if
>> we don't have to wait till the first official RC but could jump on
>> the branch somewhat sooner. That, of course, would require
>> a higher degree of coordination. If future Hadoop RMs would
>> be willing to drop us a note and indicate when the branch
>> is in a shape where Bigtop feedback is ready to be received
>> I think that'll be the best way to proceed.
>>
>> Do you think this would be a reasonable agreement?
>>
>> On a related subject -- it would be extremely nice if Hadoop
>> community can help us with planning the bill of materials
>> for the future release of Bigtop. Essentially taking a bit
>> of a more active role in yay'ing or nay'ing our version
>> choices. There's a thread on LTS Bigtop releases that
>> Cos started but I think we can only hope to have an LTS
>> if the underlying Hadoop is also LTS'ish.
>>
>> Thanks,
>> Roman.
>>


Re: [DISCUSS] stabilizing Hadoop releases wrt. downstream

Posted by Matt Foley <mf...@hortonworks.com>.
>
> If future Hadoop RMs would
> be willing to drop us a note and indicate when the branch
> is in a shape where Bigtop feedback is ready to be received
> I think that'll be the best way to proceed.


This info is always broadcast to common-dev@hadoop.apache.org,
but I will try to remember to send such a pre-announcement directly
to Bigtop also.  In fact, I'll modify the HowToRelease instructions to
incorporate this as a step.

Would you like that to be sent to dev@bigtop.apache.org?

Thanks,
--Matt



On Fri, Mar 8, 2013 at 9:55 AM, Roman Shaposhnik <rv...@apache.org> wrote:

> On Wed, Mar 6, 2013 at 7:24 AM, Arun C Murthy <ac...@hortonworks.com> wrote:
> >> The issue that a downstream
> >> integration project is likely to have is - for once - the absence of
> >> regularly published development artifacts. In the light of "it didn't
> >> happen if there's no picture" here's a couple of examples:
> >
> >  Maybe I wasn't clear, apologies, my request is simpler: Can Bigtop
> validate a Hadoop RC?
>
> The short answer to this is -- yes. It would be extremely nice if
> we don't have to wait till the first official RC but could jump on
> the branch somewhat sooner. That, of course, would require
> a higher degree of coordination. If future Hadoop RMs would
> be willing to drop us a note and indicate when the branch
> is in a shape where Bigtop feedback is ready to be received
> I think that'll be the best way to proceed.
>
> Do you think this would be a reasonable agreement?
>
> On a related subject -- it would be extremely nice if Hadoop
> community can help us with planning the bill of materials
> for the future release of Bigtop. Essentially taking a bit
> of a more active role in yay'ing or nay'ing our version
> choices. There's a thread on LTS Bigtop releases that
> Cos started but I think we can only hope to have an LTS
> if the underlying Hadoop is also LTS'ish.
>
> Thanks,
> Roman.
>

Re: [DISCUSS] stabilizing Hadoop releases wrt. downstream

Posted by Matt Foley <mf...@hortonworks.com>.
>
> If future Hadoop RMs would
> be willing to drop us a note and indicate when the branch
> is in a shape where Bigtop feedback is ready to be received
> I think that'll be the best way to proceed.


This info is always broadcast to common-dev@hadoop.apache.org,
but I will try to remember to send such a pre-announcement directly
to Bigtop also.  In fact, I'll modify the HowToRelease instructions to
incorporate this as a step.

Would you like that to be sent to dev@bigtop.apache.org?

Thanks,
--Matt



On Fri, Mar 8, 2013 at 9:55 AM, Roman Shaposhnik <rv...@apache.org> wrote:

> On Wed, Mar 6, 2013 at 7:24 AM, Arun C Murthy <ac...@hortonworks.com> wrote:
> >> The issue that a downstream
> >> integration project is likely to have is - for once - the absence of
> >> regularly published development artifacts. In the light of "it didn't
> >> happen if there's no picture" here's a couple of examples:
> >
> >  Maybe I wasn't clear, apologies, my request is simpler: Can Bigtop
> validate a Hadoop RC?
>
> The short answer to this is -- yes. It would be extremely nice if
> we don't have to wait till the first official RC but could jump on
> the branch somewhat sooner. That, of course, would require
> a higher degree of coordination. If future Hadoop RMs would
> be willing to drop us a note and indicate when the branch
> is in a shape where Bigtop feedback is ready to be received
> I think that'll be the best way to proceed.
>
> Do you think this would be a reasonable agreement?
>
> On a related subject -- it would be extremely nice if Hadoop
> community can help us with planning the bill of materials
> for the future release of Bigtop. Essentially taking a bit
> of a more active role in yay'ing or nay'ing our version
> choices. There's a thread on LTS Bigtop releases that
> Cos started but I think we can only hope to have an LTS
> if the underlying Hadoop is also LTS'ish.
>
> Thanks,
> Roman.
>

Re: [DISCUSS] stabilizing Hadoop releases wrt. downstream

Posted by Roman Shaposhnik <rv...@apache.org>.
On Wed, Mar 6, 2013 at 7:24 AM, Arun C Murthy <ac...@hortonworks.com> wrote:
>> The issue that a downstream
>> integration project is likely to have is - for once - the absence of
>> regularly published development artifacts. In the light of "it didn't
>> happen if there's no picture" here's a couple of examples:
>
>  Maybe I wasn't clear, apologies, my request is simpler: Can Bigtop validate a Hadoop RC?

The short answer to this is -- yes. It would be extremely nice if
we don't have to wait till the first official RC but could jump on
the branch somewhat sooner. That, of course, would require
a higher degree of coordination. If future Hadoop RMs would
be willing to drop us a note and indicate when the branch
is in a shape where Bigtop feedback is ready to be received
I think that'll be the best way to proceed.

Do you think this would be a reasonable agreement?

On a related subject -- it would be extremely nice if Hadoop
community can help us with planning the bill of materials
for the future release of Bigtop. Essentially taking a bit
of a more active role in yay'ing or nay'ing our version
choices. There's a thread on LTS Bigtop releases that
Cos started but I think we can only hope to have an LTS
if the underlying Hadoop is also LTS'ish.

Thanks,
Roman.

Re: [DISCUSS] stabilizing Hadoop releases wrt. downstream

Posted by Konstantin Boudnik <co...@apache.org>.
On Wed, Mar 06, 2013 at 07:24AM, Arun C Murthy wrote:
> Cos,
> 
> On Mar 4, 2013, at 10:15 PM, Konstantin Boudnik wrote:
> 
> > The issue that a downstream
> > integration project is likely to have is - for once - the absence of
> > regularly published development artifacts. In the light of "it didn't
> > happen if there's no picture" here's a couple of examples:
> 
>  Maybe I wasn't clear, apologies, my request is simpler: Can Bigtop validate a Hadoop RC?

No offense taken. I think this is indeed the intention of initial and
follow-up emails. There need to be a bit more coordination between the
two, so BigTop's branches can be set in advance to churn stack builds
and validations with underlying Hadoop RCs or even before; and Hadoop
RM - and potentially downstream components - have more time to react
to the problems.

One of the issues here is a difference in the release schedule between
the two. Perhaps we can setup a validation branch in the bigtop to
help with releases. This is something that needs to be decided within
the project apparently, but the it might work.

Cos

>  Every RC has it's artifacts in staging at the ASF maven repo. In the past we have had Bigtop actually verifying an RC:
>  http://s.apache.org/sjz
> 
>  If Bigtop can do that consistently, it would definitely help. Agree?
> 
> thanks,
> Arun
> 

Re: [DISCUSS] stabilizing Hadoop releases wrt. downstream

Posted by Roman Shaposhnik <rv...@apache.org>.
On Wed, Mar 6, 2013 at 7:24 AM, Arun C Murthy <ac...@hortonworks.com> wrote:
>> The issue that a downstream
>> integration project is likely to have is - for once - the absence of
>> regularly published development artifacts. In the light of "it didn't
>> happen if there's no picture" here's a couple of examples:
>
>  Maybe I wasn't clear, apologies, my request is simpler: Can Bigtop validate a Hadoop RC?

The short answer to this is -- yes. It would be extremely nice if
we don't have to wait till the first official RC but could jump on
the branch somewhat sooner. That, of course, would require
a higher degree of coordination. If future Hadoop RMs would
be willing to drop us a note and indicate when the branch
is in a shape where Bigtop feedback is ready to be received
I think that'll be the best way to proceed.

Do you think this would be a reasonable agreement?

On a related subject -- it would be extremely nice if Hadoop
community can help us with planning the bill of materials
for the future release of Bigtop. Essentially taking a bit
of a more active role in yay'ing or nay'ing our version
choices. There's a thread on LTS Bigtop releases that
Cos started but I think we can only hope to have an LTS
if the underlying Hadoop is also LTS'ish.

Thanks,
Roman.

Re: [DISCUSS] stabilizing Hadoop releases wrt. downstream

Posted by Arun C Murthy <ac...@hortonworks.com>.
Cos,

On Mar 4, 2013, at 10:15 PM, Konstantin Boudnik wrote:

> The issue that a downstream
> integration project is likely to have is - for once - the absence of
> regularly published development artifacts. In the light of "it didn't
> happen if there's no picture" here's a couple of examples:

 Maybe I wasn't clear, apologies, my request is simpler: Can Bigtop validate a Hadoop RC?

 Every RC has it's artifacts in staging at the ASF maven repo. In the past we have had Bigtop actually verifying an RC:
 http://s.apache.org/sjz

 If Bigtop can do that consistently, it would definitely help. Agree?

thanks,
Arun


Re: [DISCUSS] stabilizing Hadoop releases wrt. downstream

Posted by Arun C Murthy <ac...@hortonworks.com>.
Cos,

On Mar 4, 2013, at 10:15 PM, Konstantin Boudnik wrote:

> The issue that a downstream
> integration project is likely to have is - for once - the absence of
> regularly published development artifacts. In the light of "it didn't
> happen if there's no picture" here's a couple of examples:

 Maybe I wasn't clear, apologies, my request is simpler: Can Bigtop validate a Hadoop RC?

 Every RC has it's artifacts in staging at the ASF maven repo. In the past we have had Bigtop actually verifying an RC:
 http://s.apache.org/sjz

 If Bigtop can do that consistently, it would definitely help. Agree?

thanks,
Arun


Re: [DISCUSS] stabilizing Hadoop releases wrt. downstream

Posted by Robert Evans <ev...@yahoo-inc.com>.
That is a great point.  I have been meaning to set up the Jenkins build
for branch-2 for a while, so I took the 10 mins and just did it.

https://builds.apache.org/job/Hadoop-Common-2-Commit/

Don't let the name fool you, it publishes not just common, but HDFS, YARN,
MR, and tools too.  You should now have branch-2 SNAPSHOTS updated on each
commit to branch-2.  Feel free to bug me if you need more integration
points.  I am not an RE guy, but I can hack it to make things work :)

--Bobby

On 3/5/13 12:15 AM, "Konstantin Boudnik" <co...@apache.org> wrote:

>Arun,
>
>first of all, I don't think anyone is trying to put a blame on someone
>else. E.g. I had similar experience with Oozie being broken because of
>certain released changes in the upstream.
>
>I am sure that most people in BigTop community - especially those who
>share the committer-ship privilege in BigTop and other upstream
>projects, including Hadoop, - would be happy to help with the
>stabilization of the Hadoop base. The issue that a downstream
>integration project is likely to have is - for once - the absence of
>regularly published development artifacts. In the light of "it didn't
>happen if there's no picture" here's a couple of examples:
>
>  - 2.0.2-SNAPSHOT weren't published at all; only release 2.0.2-alpha
>artifacts were
>  - 2.0.3-SNAPSHOT weren't published until Feb 29, 2013 (it happened just
>once)
>
>So, technically speaking, unless an integration project is willing to
>build and maintain its own artifacts, it is impossible to do any
>preventive validation.
>
>Which brings me to my next question: how do you guys address
>"Integration is high on the list of *every* release". Again, please
>don't get me wrong - I am not looking to lay a blame on or corner
>anyone - I am really curious and would appreciate the input.
>
>
>Vinod:
>
>> As you yourself noted later, the pain is part of the 'alpha' status
>> of the release. We are targeting +one of the immediate future
>> releases to be a beta and so these troubles are really only the
>> short +term.
>
>I don't really want to get into the discussion about of what
>constitutes the alpha and how it has delayed the adoption of Hadoop2
>line. However, I want to point out that it is especially important for
>"alpha" platform to work nicely with downstream consumers of the said
>platform. For quite obvious reasons, I believe.
>
>> I think there is a fundamental problem with the interaction of
>> Bigtop with the downstream projects, if nothing else, with
>
>BigTop is as downstream as it can get, because BigTop essentially
>consumes all other component releases in order to produce a viable
>stack. Technicalities aside...
>
>> Hadoop. We never formalized on the process, will BigTop step in
>> after an RC is up for vote or before? As I see it, it's happening
>
>Bigtop essentially can give any component, including Hadoop, and
>better yet - the set of components - certain guaratees about
>compatibility and dependencies being included. Case in point is
>missing commons libraries missed in 1.0.1 release that essentially
>prevented HBase from working properly.
>
>> after the vote is up, so no wonder we are in this state. Shall we
>> have a pre-notice to Bigtop so that it can step in before?
>
>The above is in contradiction with earlier statement of "Integration
>is high on the list of *every* release". If BigTop isn't used for
>integration testing, then how said integration testing is performed?
>Is it some sort of test-patch process as Luke referred earlier?  And
>why it leaves the room for the integration issues being uncaught?
>Again, I am genuinely interested to know.
>
>> these short term pains. I'd rather like us swim through these now
>> instead of support broken APIs and features in our beta, having seen
>> this very thing happen with 1.*.
>
>I think you're mixing the point of integration with downstream and
>being in an alpha phase of the development. The former isn't about
>supporting "broken APIs" - it is about being consistent and avoid
>breaking the downstream applicaitons without letting said applications
>to accomodate the platform changes first.
>
>Changes in the API, after all, can be relatively easy traced by
>integration validation - this is the whole point of integration
>testing. And BigTop does the job better then anything around, simply
>because there's nothing else around to do it.
>
>If you stay in shape-shifting "alpha" that doesn't integrate well for
>a very long time, you risk to lose downstream customers' interest,
>because they might get tired of waiting until a next stable API will
>be ready for them.
>
>> Let's fix the way the release related communication is happening
>> across our projects so that we can all work together and make 2.X a
>> success.
>
>This is a very good point indeed! Let's start a separate discussion
>thread on how we can improve the release model for coming Hadoop
>releases, where we - as the community - can provide better guarantees
>of the inter-component compatibility (sorry for an overused word).
>
>Cos
>
>On Fri, Mar 01, 2013 at 10:58AM, Arun C Murthy wrote:
>> I feel this is being blown out of proportion.
>> 
>> Integration is high on the list of *every* release. In future, if
>>anyone or
>> bigtop wants to help, running integration tests on a hadoop RC and
>>providing
>> feedback would be very welcome. I'm pretty sure I will stop an RC if it
>> means it breaks and Oozie or HBase or Pig or Hive and re-spin it. For
>>e.g.
>> see recent efforts to do a 2.0.4-alpha.
>> 
>> With hadoop-2.0.3-alpha we discovered 3 *bugs* - making it sound like we
>> intentionally disregard integation issues is very harsh.
>> 
>> Please also see other thread where we discussed stabilizing APIS,
>>protocols
>> etc. for the next 'beta' release.
>> 
>> Arun
>> 
>> On Feb 26, 2013, at 5:43 PM, Roman Shaposhnik wrote:
>> 
>> > Hi!
>> > 
>> > for the past couple of releases of Hadoop 2.X code line the issue
>> > of integration between Hadoop and its downstream projects has
>> > become quite a thorny issue. The poster child here is Oozie, where
>> > every release of Hadoop 2.X seems to be breaking the compatibility
>> > in various unpredictable ways. At times other components (such
>> > as HBase for example) also seem to be affected.
>> > 
>> > Now, to be extremely clear -- I'm NOT talking about the *latest*
>>version
>> > of Oozie working with the *latest* version of Hadoop, instead
>> > my observations come from running previous *stable*  releases
>> > of Bigtop on top of Hadoop 2.X RCs.
>> > 
>> > As many of you know Apache Bigtop aims at providing a single
>> > platform for integration of Hadoop and Hadoop ecosystem projects.
>> > As such we're uniquely positioned to track compatibility between
>> > different Hadoop releases with regards to the downstream components
>> > (things like Oozie, Pig, Hive, Mahout, etc.). Every single single RC
>> > we've been pretty diligent at trying to provide integration-level
>>feedback
>> > on the quality of the upcoming release,  but it seems that our efforts
>> > don't quite suffice in Hadoop 2.X stabilizing.
>> > 
>> > Of course, one could argue that while Hadoop 2.X code line was
>> > designated 'alpha' expecting much in the way of perfect integration
>> > and compatibility was NOT what the Hadoop community was
>> > focusing on. I can appreciate that view, but what I'm interested in
>> > is the future of Hadoop 2.X not its past. Hence, here's my question
>> > to all of you as a Hadoop community at large:
>> > 
>> > Do you guys think that the project have reached a point where
>>integration
>> > and compatibility issues should be prioritized really high on the list
>> > of things that make or break each future release?
>> > 
>> > The good news, is that Bigtop's charter is in big part *exactly* about
>> > providing you with this kind of feedback. We can easily tell you when
>> > Hadoop behavior, with regard to downstream components, changes
>> > between a previous stable release and the new RC (or even
>>branch/trunk).
>> > What we can NOT do is submit patches for all the issues. We are simply
>> > too small a project and we need your help with that.
>> > 
>> > I truly believe that we owe it to the downstream projects, and in the
>> > second half of this email I will try to convince you of that.
>> > 
>> > We all know that integration projects are impossible to pull off
>> > unless there's a general consensus between all of the projects
>>involved
>> > that they indeed need to work with each other. You can NOT force
>> > that notion, but you can always try to influence. This relationship
>> > goes both ways.
>> > 
>> > Consider a question in front of the downstream communities
>> > of  whether or not to adopt Hadoop 2.X as the basis. To answer
>> > that question each downstream project has to be reasonably
>> > sure that their concerns will NOT fall on deaf ears and that
>> > Hadoop developers are, essentially, 'ready' for them to pick
>> > up Hadoop 2.X. I would argue that so far the Hadoop community
>> > had gone out of its way to signal that 2.X codeline is NOT
>> > ready for the downstream.
>> > 
>> > I would argue that moving forward this is a really unfortunate
>> > situation that may end up undermining the long term success
>> > of Hadoop 2.X if we don't start addressing the problem. Think
>> > about it -- 90% of unit tests that run downstream on Apache
>> > infrastructure are still exercising Hadoop 1.X underneath.
>> > In fact, if you were to forcefully make, lets say, HBase's
>> > unit tests run on top of Hadoop 2.X quite a few of them
>> > are going to fail. Hadoop community is, in effect, cutting
>> > itself off from the biggest source of feedback -- its downstream
>> > users. This in turn:
>> > 
>> >   * leaves Hadoop project in a perpetual state of broken
>> >     windows syndrome.
>> > 
>> >   * leaves Apache Hadoop 2.X releases in a state considerably
>> >     inferior to the releases *including* Apache Hadoop done by the
>> >     vendors. The users have no choice but to alight themselves
>> >     with vendor offerings if they wish to utilize latest Hadoop
>>functionality.
>> >     The artifact that is know as Apache Hadoop 2.X stopped being
>> >     a viable choice thus fracturing the user community and reducing
>> >     the benefits of a commonly deployed codebase.
>> > 
>> >    * leaves downstream projects of Hadoop  in a jaded state where
>> >      they legitimately get very discouraged and frustrated and
>>eventually
>> >      give up thinking that -- well, we work with one release of Hadoop
>> >      (the stable one Hadoop 1.X) and we shall wait for the Hadoop
>> >      community to get their act together.
>> > 
>> > In my view (shared by quite a few members of the Apache Bigtop) we
>> > can definitely do better than this if we all agree that the proposed
>> > first 'beta' release of Hadoop 2.0.4 is the right time for it to
>>happen.
>> > 
>> > It is about time Hadoop 2.X community wins back all those end users
>> > and downstream projects that got left behind during the alpha
>> > stabilization phase.
>> > 
>> > Thanks,
>> > Roman.
>> 
>> --
>> Arun C. Murthy
>> Hortonworks Inc.
>> http://hortonworks.com/
>> 
>> 


Re: [DISCUSS] stabilizing Hadoop releases wrt. downstream

Posted by Robert Evans <ev...@yahoo-inc.com>.
That is a great point.  I have been meaning to set up the Jenkins build
for branch-2 for a while, so I took the 10 mins and just did it.

https://builds.apache.org/job/Hadoop-Common-2-Commit/

Don't let the name fool you, it publishes not just common, but HDFS, YARN,
MR, and tools too.  You should now have branch-2 SNAPSHOTS updated on each
commit to branch-2.  Feel free to bug me if you need more integration
points.  I am not an RE guy, but I can hack it to make things work :)

--Bobby

On 3/5/13 12:15 AM, "Konstantin Boudnik" <co...@apache.org> wrote:

>Arun,
>
>first of all, I don't think anyone is trying to put a blame on someone
>else. E.g. I had similar experience with Oozie being broken because of
>certain released changes in the upstream.
>
>I am sure that most people in BigTop community - especially those who
>share the committer-ship privilege in BigTop and other upstream
>projects, including Hadoop, - would be happy to help with the
>stabilization of the Hadoop base. The issue that a downstream
>integration project is likely to have is - for once - the absence of
>regularly published development artifacts. In the light of "it didn't
>happen if there's no picture" here's a couple of examples:
>
>  - 2.0.2-SNAPSHOT weren't published at all; only release 2.0.2-alpha
>artifacts were
>  - 2.0.3-SNAPSHOT weren't published until Feb 29, 2013 (it happened just
>once)
>
>So, technically speaking, unless an integration project is willing to
>build and maintain its own artifacts, it is impossible to do any
>preventive validation.
>
>Which brings me to my next question: how do you guys address
>"Integration is high on the list of *every* release". Again, please
>don't get me wrong - I am not looking to lay a blame on or corner
>anyone - I am really curious and would appreciate the input.
>
>
>Vinod:
>
>> As you yourself noted later, the pain is part of the 'alpha' status
>> of the release. We are targeting +one of the immediate future
>> releases to be a beta and so these troubles are really only the
>> short +term.
>
>I don't really want to get into the discussion about of what
>constitutes the alpha and how it has delayed the adoption of Hadoop2
>line. However, I want to point out that it is especially important for
>"alpha" platform to work nicely with downstream consumers of the said
>platform. For quite obvious reasons, I believe.
>
>> I think there is a fundamental problem with the interaction of
>> Bigtop with the downstream projects, if nothing else, with
>
>BigTop is as downstream as it can get, because BigTop essentially
>consumes all other component releases in order to produce a viable
>stack. Technicalities aside...
>
>> Hadoop. We never formalized on the process, will BigTop step in
>> after an RC is up for vote or before? As I see it, it's happening
>
>Bigtop essentially can give any component, including Hadoop, and
>better yet - the set of components - certain guaratees about
>compatibility and dependencies being included. Case in point is
>missing commons libraries missed in 1.0.1 release that essentially
>prevented HBase from working properly.
>
>> after the vote is up, so no wonder we are in this state. Shall we
>> have a pre-notice to Bigtop so that it can step in before?
>
>The above is in contradiction with earlier statement of "Integration
>is high on the list of *every* release". If BigTop isn't used for
>integration testing, then how said integration testing is performed?
>Is it some sort of test-patch process as Luke referred earlier?  And
>why it leaves the room for the integration issues being uncaught?
>Again, I am genuinely interested to know.
>
>> these short term pains. I'd rather like us swim through these now
>> instead of support broken APIs and features in our beta, having seen
>> this very thing happen with 1.*.
>
>I think you're mixing the point of integration with downstream and
>being in an alpha phase of the development. The former isn't about
>supporting "broken APIs" - it is about being consistent and avoid
>breaking the downstream applicaitons without letting said applications
>to accomodate the platform changes first.
>
>Changes in the API, after all, can be relatively easy traced by
>integration validation - this is the whole point of integration
>testing. And BigTop does the job better then anything around, simply
>because there's nothing else around to do it.
>
>If you stay in shape-shifting "alpha" that doesn't integrate well for
>a very long time, you risk to lose downstream customers' interest,
>because they might get tired of waiting until a next stable API will
>be ready for them.
>
>> Let's fix the way the release related communication is happening
>> across our projects so that we can all work together and make 2.X a
>> success.
>
>This is a very good point indeed! Let's start a separate discussion
>thread on how we can improve the release model for coming Hadoop
>releases, where we - as the community - can provide better guarantees
>of the inter-component compatibility (sorry for an overused word).
>
>Cos
>
>On Fri, Mar 01, 2013 at 10:58AM, Arun C Murthy wrote:
>> I feel this is being blown out of proportion.
>> 
>> Integration is high on the list of *every* release. In future, if
>>anyone or
>> bigtop wants to help, running integration tests on a hadoop RC and
>>providing
>> feedback would be very welcome. I'm pretty sure I will stop an RC if it
>> means it breaks and Oozie or HBase or Pig or Hive and re-spin it. For
>>e.g.
>> see recent efforts to do a 2.0.4-alpha.
>> 
>> With hadoop-2.0.3-alpha we discovered 3 *bugs* - making it sound like we
>> intentionally disregard integation issues is very harsh.
>> 
>> Please also see other thread where we discussed stabilizing APIS,
>>protocols
>> etc. for the next 'beta' release.
>> 
>> Arun
>> 
>> On Feb 26, 2013, at 5:43 PM, Roman Shaposhnik wrote:
>> 
>> > Hi!
>> > 
>> > for the past couple of releases of Hadoop 2.X code line the issue
>> > of integration between Hadoop and its downstream projects has
>> > become quite a thorny issue. The poster child here is Oozie, where
>> > every release of Hadoop 2.X seems to be breaking the compatibility
>> > in various unpredictable ways. At times other components (such
>> > as HBase for example) also seem to be affected.
>> > 
>> > Now, to be extremely clear -- I'm NOT talking about the *latest*
>>version
>> > of Oozie working with the *latest* version of Hadoop, instead
>> > my observations come from running previous *stable*  releases
>> > of Bigtop on top of Hadoop 2.X RCs.
>> > 
>> > As many of you know Apache Bigtop aims at providing a single
>> > platform for integration of Hadoop and Hadoop ecosystem projects.
>> > As such we're uniquely positioned to track compatibility between
>> > different Hadoop releases with regards to the downstream components
>> > (things like Oozie, Pig, Hive, Mahout, etc.). Every single single RC
>> > we've been pretty diligent at trying to provide integration-level
>>feedback
>> > on the quality of the upcoming release,  but it seems that our efforts
>> > don't quite suffice in Hadoop 2.X stabilizing.
>> > 
>> > Of course, one could argue that while Hadoop 2.X code line was
>> > designated 'alpha' expecting much in the way of perfect integration
>> > and compatibility was NOT what the Hadoop community was
>> > focusing on. I can appreciate that view, but what I'm interested in
>> > is the future of Hadoop 2.X not its past. Hence, here's my question
>> > to all of you as a Hadoop community at large:
>> > 
>> > Do you guys think that the project have reached a point where
>>integration
>> > and compatibility issues should be prioritized really high on the list
>> > of things that make or break each future release?
>> > 
>> > The good news, is that Bigtop's charter is in big part *exactly* about
>> > providing you with this kind of feedback. We can easily tell you when
>> > Hadoop behavior, with regard to downstream components, changes
>> > between a previous stable release and the new RC (or even
>>branch/trunk).
>> > What we can NOT do is submit patches for all the issues. We are simply
>> > too small a project and we need your help with that.
>> > 
>> > I truly believe that we owe it to the downstream projects, and in the
>> > second half of this email I will try to convince you of that.
>> > 
>> > We all know that integration projects are impossible to pull off
>> > unless there's a general consensus between all of the projects
>>involved
>> > that they indeed need to work with each other. You can NOT force
>> > that notion, but you can always try to influence. This relationship
>> > goes both ways.
>> > 
>> > Consider a question in front of the downstream communities
>> > of  whether or not to adopt Hadoop 2.X as the basis. To answer
>> > that question each downstream project has to be reasonably
>> > sure that their concerns will NOT fall on deaf ears and that
>> > Hadoop developers are, essentially, 'ready' for them to pick
>> > up Hadoop 2.X. I would argue that so far the Hadoop community
>> > had gone out of its way to signal that 2.X codeline is NOT
>> > ready for the downstream.
>> > 
>> > I would argue that moving forward this is a really unfortunate
>> > situation that may end up undermining the long term success
>> > of Hadoop 2.X if we don't start addressing the problem. Think
>> > about it -- 90% of unit tests that run downstream on Apache
>> > infrastructure are still exercising Hadoop 1.X underneath.
>> > In fact, if you were to forcefully make, lets say, HBase's
>> > unit tests run on top of Hadoop 2.X quite a few of them
>> > are going to fail. Hadoop community is, in effect, cutting
>> > itself off from the biggest source of feedback -- its downstream
>> > users. This in turn:
>> > 
>> >   * leaves Hadoop project in a perpetual state of broken
>> >     windows syndrome.
>> > 
>> >   * leaves Apache Hadoop 2.X releases in a state considerably
>> >     inferior to the releases *including* Apache Hadoop done by the
>> >     vendors. The users have no choice but to alight themselves
>> >     with vendor offerings if they wish to utilize latest Hadoop
>>functionality.
>> >     The artifact that is know as Apache Hadoop 2.X stopped being
>> >     a viable choice thus fracturing the user community and reducing
>> >     the benefits of a commonly deployed codebase.
>> > 
>> >    * leaves downstream projects of Hadoop  in a jaded state where
>> >      they legitimately get very discouraged and frustrated and
>>eventually
>> >      give up thinking that -- well, we work with one release of Hadoop
>> >      (the stable one Hadoop 1.X) and we shall wait for the Hadoop
>> >      community to get their act together.
>> > 
>> > In my view (shared by quite a few members of the Apache Bigtop) we
>> > can definitely do better than this if we all agree that the proposed
>> > first 'beta' release of Hadoop 2.0.4 is the right time for it to
>>happen.
>> > 
>> > It is about time Hadoop 2.X community wins back all those end users
>> > and downstream projects that got left behind during the alpha
>> > stabilization phase.
>> > 
>> > Thanks,
>> > Roman.
>> 
>> --
>> Arun C. Murthy
>> Hortonworks Inc.
>> http://hortonworks.com/
>> 
>> 


Re: [DISCUSS] stabilizing Hadoop releases wrt. downstream

Posted by Konstantin Boudnik <co...@apache.org>.
Arun,

first of all, I don't think anyone is trying to put a blame on someone
else. E.g. I had similar experience with Oozie being broken because of
certain released changes in the upstream.

I am sure that most people in BigTop community - especially those who
share the committer-ship privilege in BigTop and other upstream
projects, including Hadoop, - would be happy to help with the
stabilization of the Hadoop base. The issue that a downstream
integration project is likely to have is - for once - the absence of
regularly published development artifacts. In the light of "it didn't
happen if there's no picture" here's a couple of examples:

  - 2.0.2-SNAPSHOT weren't published at all; only release 2.0.2-alpha artifacts were
  - 2.0.3-SNAPSHOT weren't published until Feb 29, 2013 (it happened just once)

So, technically speaking, unless an integration project is willing to
build and maintain its own artifacts, it is impossible to do any
preventive validation.

Which brings me to my next question: how do you guys address
"Integration is high on the list of *every* release". Again, please
don't get me wrong - I am not looking to lay a blame on or corner
anyone - I am really curious and would appreciate the input.


Vinod:

> As you yourself noted later, the pain is part of the 'alpha' status
> of the release. We are targeting +one of the immediate future
> releases to be a beta and so these troubles are really only the
> short +term.

I don't really want to get into the discussion about of what
constitutes the alpha and how it has delayed the adoption of Hadoop2
line. However, I want to point out that it is especially important for
"alpha" platform to work nicely with downstream consumers of the said
platform. For quite obvious reasons, I believe.

> I think there is a fundamental problem with the interaction of
> Bigtop with the downstream projects, if nothing else, with

BigTop is as downstream as it can get, because BigTop essentially
consumes all other component releases in order to produce a viable
stack. Technicalities aside...

> Hadoop. We never formalized on the process, will BigTop step in
> after an RC is up for vote or before? As I see it, it's happening

Bigtop essentially can give any component, including Hadoop, and
better yet - the set of components - certain guaratees about
compatibility and dependencies being included. Case in point is
missing commons libraries missed in 1.0.1 release that essentially
prevented HBase from working properly.

> after the vote is up, so no wonder we are in this state. Shall we
> have a pre-notice to Bigtop so that it can step in before?

The above is in contradiction with earlier statement of "Integration
is high on the list of *every* release". If BigTop isn't used for
integration testing, then how said integration testing is performed?
Is it some sort of test-patch process as Luke referred earlier?  And
why it leaves the room for the integration issues being uncaught?
Again, I am genuinely interested to know.

> these short term pains. I'd rather like us swim through these now
> instead of support broken APIs and features in our beta, having seen
> this very thing happen with 1.*.

I think you're mixing the point of integration with downstream and
being in an alpha phase of the development. The former isn't about
supporting "broken APIs" - it is about being consistent and avoid
breaking the downstream applicaitons without letting said applications
to accomodate the platform changes first.

Changes in the API, after all, can be relatively easy traced by
integration validation - this is the whole point of integration
testing. And BigTop does the job better then anything around, simply
because there's nothing else around to do it.

If you stay in shape-shifting "alpha" that doesn't integrate well for
a very long time, you risk to lose downstream customers' interest,
because they might get tired of waiting until a next stable API will
be ready for them.

> Let's fix the way the release related communication is happening
> across our projects so that we can all work together and make 2.X a
> success.

This is a very good point indeed! Let's start a separate discussion
thread on how we can improve the release model for coming Hadoop
releases, where we - as the community - can provide better guarantees
of the inter-component compatibility (sorry for an overused word).

Cos

On Fri, Mar 01, 2013 at 10:58AM, Arun C Murthy wrote:
> I feel this is being blown out of proportion.
> 
> Integration is high on the list of *every* release. In future, if anyone or
> bigtop wants to help, running integration tests on a hadoop RC and providing
> feedback would be very welcome. I'm pretty sure I will stop an RC if it
> means it breaks and Oozie or HBase or Pig or Hive and re-spin it. For e.g.
> see recent efforts to do a 2.0.4-alpha.
> 
> With hadoop-2.0.3-alpha we discovered 3 *bugs* - making it sound like we
> intentionally disregard integation issues is very harsh.
> 
> Please also see other thread where we discussed stabilizing APIS, protocols
> etc. for the next 'beta' release.
> 
> Arun
> 
> On Feb 26, 2013, at 5:43 PM, Roman Shaposhnik wrote:
> 
> > Hi!
> > 
> > for the past couple of releases of Hadoop 2.X code line the issue
> > of integration between Hadoop and its downstream projects has
> > become quite a thorny issue. The poster child here is Oozie, where
> > every release of Hadoop 2.X seems to be breaking the compatibility
> > in various unpredictable ways. At times other components (such
> > as HBase for example) also seem to be affected.
> > 
> > Now, to be extremely clear -- I'm NOT talking about the *latest* version
> > of Oozie working with the *latest* version of Hadoop, instead
> > my observations come from running previous *stable*  releases
> > of Bigtop on top of Hadoop 2.X RCs.
> > 
> > As many of you know Apache Bigtop aims at providing a single
> > platform for integration of Hadoop and Hadoop ecosystem projects.
> > As such we're uniquely positioned to track compatibility between
> > different Hadoop releases with regards to the downstream components
> > (things like Oozie, Pig, Hive, Mahout, etc.). Every single single RC
> > we've been pretty diligent at trying to provide integration-level feedback
> > on the quality of the upcoming release,  but it seems that our efforts
> > don't quite suffice in Hadoop 2.X stabilizing.
> > 
> > Of course, one could argue that while Hadoop 2.X code line was
> > designated 'alpha' expecting much in the way of perfect integration
> > and compatibility was NOT what the Hadoop community was
> > focusing on. I can appreciate that view, but what I'm interested in
> > is the future of Hadoop 2.X not its past. Hence, here's my question
> > to all of you as a Hadoop community at large:
> > 
> > Do you guys think that the project have reached a point where integration
> > and compatibility issues should be prioritized really high on the list
> > of things that make or break each future release?
> > 
> > The good news, is that Bigtop's charter is in big part *exactly* about
> > providing you with this kind of feedback. We can easily tell you when
> > Hadoop behavior, with regard to downstream components, changes
> > between a previous stable release and the new RC (or even branch/trunk).
> > What we can NOT do is submit patches for all the issues. We are simply
> > too small a project and we need your help with that.
> > 
> > I truly believe that we owe it to the downstream projects, and in the
> > second half of this email I will try to convince you of that.
> > 
> > We all know that integration projects are impossible to pull off
> > unless there's a general consensus between all of the projects involved
> > that they indeed need to work with each other. You can NOT force
> > that notion, but you can always try to influence. This relationship
> > goes both ways.
> > 
> > Consider a question in front of the downstream communities
> > of  whether or not to adopt Hadoop 2.X as the basis. To answer
> > that question each downstream project has to be reasonably
> > sure that their concerns will NOT fall on deaf ears and that
> > Hadoop developers are, essentially, 'ready' for them to pick
> > up Hadoop 2.X. I would argue that so far the Hadoop community
> > had gone out of its way to signal that 2.X codeline is NOT
> > ready for the downstream.
> > 
> > I would argue that moving forward this is a really unfortunate
> > situation that may end up undermining the long term success
> > of Hadoop 2.X if we don't start addressing the problem. Think
> > about it -- 90% of unit tests that run downstream on Apache
> > infrastructure are still exercising Hadoop 1.X underneath.
> > In fact, if you were to forcefully make, lets say, HBase's
> > unit tests run on top of Hadoop 2.X quite a few of them
> > are going to fail. Hadoop community is, in effect, cutting
> > itself off from the biggest source of feedback -- its downstream
> > users. This in turn:
> > 
> >   * leaves Hadoop project in a perpetual state of broken
> >     windows syndrome.
> > 
> >   * leaves Apache Hadoop 2.X releases in a state considerably
> >     inferior to the releases *including* Apache Hadoop done by the
> >     vendors. The users have no choice but to alight themselves
> >     with vendor offerings if they wish to utilize latest Hadoop functionality.
> >     The artifact that is know as Apache Hadoop 2.X stopped being
> >     a viable choice thus fracturing the user community and reducing
> >     the benefits of a commonly deployed codebase.
> > 
> >    * leaves downstream projects of Hadoop  in a jaded state where
> >      they legitimately get very discouraged and frustrated and eventually
> >      give up thinking that -- well, we work with one release of Hadoop
> >      (the stable one Hadoop 1.X) and we shall wait for the Hadoop
> >      community to get their act together.
> > 
> > In my view (shared by quite a few members of the Apache Bigtop) we
> > can definitely do better than this if we all agree that the proposed
> > first 'beta' release of Hadoop 2.0.4 is the right time for it to happen.
> > 
> > It is about time Hadoop 2.X community wins back all those end users
> > and downstream projects that got left behind during the alpha
> > stabilization phase.
> > 
> > Thanks,
> > Roman.
> 
> --
> Arun C. Murthy
> Hortonworks Inc.
> http://hortonworks.com/
> 
> 

Re: [DISCUSS] stabilizing Hadoop releases wrt. downstream

Posted by Konstantin Boudnik <co...@apache.org>.
Arun,

first of all, I don't think anyone is trying to put a blame on someone
else. E.g. I had similar experience with Oozie being broken because of
certain released changes in the upstream.

I am sure that most people in BigTop community - especially those who
share the committer-ship privilege in BigTop and other upstream
projects, including Hadoop, - would be happy to help with the
stabilization of the Hadoop base. The issue that a downstream
integration project is likely to have is - for once - the absence of
regularly published development artifacts. In the light of "it didn't
happen if there's no picture" here's a couple of examples:

  - 2.0.2-SNAPSHOT weren't published at all; only release 2.0.2-alpha artifacts were
  - 2.0.3-SNAPSHOT weren't published until Feb 29, 2013 (it happened just once)

So, technically speaking, unless an integration project is willing to
build and maintain its own artifacts, it is impossible to do any
preventive validation.

Which brings me to my next question: how do you guys address
"Integration is high on the list of *every* release". Again, please
don't get me wrong - I am not looking to lay a blame on or corner
anyone - I am really curious and would appreciate the input.


Vinod:

> As you yourself noted later, the pain is part of the 'alpha' status
> of the release. We are targeting +one of the immediate future
> releases to be a beta and so these troubles are really only the
> short +term.

I don't really want to get into the discussion about of what
constitutes the alpha and how it has delayed the adoption of Hadoop2
line. However, I want to point out that it is especially important for
"alpha" platform to work nicely with downstream consumers of the said
platform. For quite obvious reasons, I believe.

> I think there is a fundamental problem with the interaction of
> Bigtop with the downstream projects, if nothing else, with

BigTop is as downstream as it can get, because BigTop essentially
consumes all other component releases in order to produce a viable
stack. Technicalities aside...

> Hadoop. We never formalized on the process, will BigTop step in
> after an RC is up for vote or before? As I see it, it's happening

Bigtop essentially can give any component, including Hadoop, and
better yet - the set of components - certain guaratees about
compatibility and dependencies being included. Case in point is
missing commons libraries missed in 1.0.1 release that essentially
prevented HBase from working properly.

> after the vote is up, so no wonder we are in this state. Shall we
> have a pre-notice to Bigtop so that it can step in before?

The above is in contradiction with earlier statement of "Integration
is high on the list of *every* release". If BigTop isn't used for
integration testing, then how said integration testing is performed?
Is it some sort of test-patch process as Luke referred earlier?  And
why it leaves the room for the integration issues being uncaught?
Again, I am genuinely interested to know.

> these short term pains. I'd rather like us swim through these now
> instead of support broken APIs and features in our beta, having seen
> this very thing happen with 1.*.

I think you're mixing the point of integration with downstream and
being in an alpha phase of the development. The former isn't about
supporting "broken APIs" - it is about being consistent and avoid
breaking the downstream applicaitons without letting said applications
to accomodate the platform changes first.

Changes in the API, after all, can be relatively easy traced by
integration validation - this is the whole point of integration
testing. And BigTop does the job better then anything around, simply
because there's nothing else around to do it.

If you stay in shape-shifting "alpha" that doesn't integrate well for
a very long time, you risk to lose downstream customers' interest,
because they might get tired of waiting until a next stable API will
be ready for them.

> Let's fix the way the release related communication is happening
> across our projects so that we can all work together and make 2.X a
> success.

This is a very good point indeed! Let's start a separate discussion
thread on how we can improve the release model for coming Hadoop
releases, where we - as the community - can provide better guarantees
of the inter-component compatibility (sorry for an overused word).

Cos

On Fri, Mar 01, 2013 at 10:58AM, Arun C Murthy wrote:
> I feel this is being blown out of proportion.
> 
> Integration is high on the list of *every* release. In future, if anyone or
> bigtop wants to help, running integration tests on a hadoop RC and providing
> feedback would be very welcome. I'm pretty sure I will stop an RC if it
> means it breaks and Oozie or HBase or Pig or Hive and re-spin it. For e.g.
> see recent efforts to do a 2.0.4-alpha.
> 
> With hadoop-2.0.3-alpha we discovered 3 *bugs* - making it sound like we
> intentionally disregard integation issues is very harsh.
> 
> Please also see other thread where we discussed stabilizing APIS, protocols
> etc. for the next 'beta' release.
> 
> Arun
> 
> On Feb 26, 2013, at 5:43 PM, Roman Shaposhnik wrote:
> 
> > Hi!
> > 
> > for the past couple of releases of Hadoop 2.X code line the issue
> > of integration between Hadoop and its downstream projects has
> > become quite a thorny issue. The poster child here is Oozie, where
> > every release of Hadoop 2.X seems to be breaking the compatibility
> > in various unpredictable ways. At times other components (such
> > as HBase for example) also seem to be affected.
> > 
> > Now, to be extremely clear -- I'm NOT talking about the *latest* version
> > of Oozie working with the *latest* version of Hadoop, instead
> > my observations come from running previous *stable*  releases
> > of Bigtop on top of Hadoop 2.X RCs.
> > 
> > As many of you know Apache Bigtop aims at providing a single
> > platform for integration of Hadoop and Hadoop ecosystem projects.
> > As such we're uniquely positioned to track compatibility between
> > different Hadoop releases with regards to the downstream components
> > (things like Oozie, Pig, Hive, Mahout, etc.). Every single single RC
> > we've been pretty diligent at trying to provide integration-level feedback
> > on the quality of the upcoming release,  but it seems that our efforts
> > don't quite suffice in Hadoop 2.X stabilizing.
> > 
> > Of course, one could argue that while Hadoop 2.X code line was
> > designated 'alpha' expecting much in the way of perfect integration
> > and compatibility was NOT what the Hadoop community was
> > focusing on. I can appreciate that view, but what I'm interested in
> > is the future of Hadoop 2.X not its past. Hence, here's my question
> > to all of you as a Hadoop community at large:
> > 
> > Do you guys think that the project have reached a point where integration
> > and compatibility issues should be prioritized really high on the list
> > of things that make or break each future release?
> > 
> > The good news, is that Bigtop's charter is in big part *exactly* about
> > providing you with this kind of feedback. We can easily tell you when
> > Hadoop behavior, with regard to downstream components, changes
> > between a previous stable release and the new RC (or even branch/trunk).
> > What we can NOT do is submit patches for all the issues. We are simply
> > too small a project and we need your help with that.
> > 
> > I truly believe that we owe it to the downstream projects, and in the
> > second half of this email I will try to convince you of that.
> > 
> > We all know that integration projects are impossible to pull off
> > unless there's a general consensus between all of the projects involved
> > that they indeed need to work with each other. You can NOT force
> > that notion, but you can always try to influence. This relationship
> > goes both ways.
> > 
> > Consider a question in front of the downstream communities
> > of  whether or not to adopt Hadoop 2.X as the basis. To answer
> > that question each downstream project has to be reasonably
> > sure that their concerns will NOT fall on deaf ears and that
> > Hadoop developers are, essentially, 'ready' for them to pick
> > up Hadoop 2.X. I would argue that so far the Hadoop community
> > had gone out of its way to signal that 2.X codeline is NOT
> > ready for the downstream.
> > 
> > I would argue that moving forward this is a really unfortunate
> > situation that may end up undermining the long term success
> > of Hadoop 2.X if we don't start addressing the problem. Think
> > about it -- 90% of unit tests that run downstream on Apache
> > infrastructure are still exercising Hadoop 1.X underneath.
> > In fact, if you were to forcefully make, lets say, HBase's
> > unit tests run on top of Hadoop 2.X quite a few of them
> > are going to fail. Hadoop community is, in effect, cutting
> > itself off from the biggest source of feedback -- its downstream
> > users. This in turn:
> > 
> >   * leaves Hadoop project in a perpetual state of broken
> >     windows syndrome.
> > 
> >   * leaves Apache Hadoop 2.X releases in a state considerably
> >     inferior to the releases *including* Apache Hadoop done by the
> >     vendors. The users have no choice but to alight themselves
> >     with vendor offerings if they wish to utilize latest Hadoop functionality.
> >     The artifact that is know as Apache Hadoop 2.X stopped being
> >     a viable choice thus fracturing the user community and reducing
> >     the benefits of a commonly deployed codebase.
> > 
> >    * leaves downstream projects of Hadoop  in a jaded state where
> >      they legitimately get very discouraged and frustrated and eventually
> >      give up thinking that -- well, we work with one release of Hadoop
> >      (the stable one Hadoop 1.X) and we shall wait for the Hadoop
> >      community to get their act together.
> > 
> > In my view (shared by quite a few members of the Apache Bigtop) we
> > can definitely do better than this if we all agree that the proposed
> > first 'beta' release of Hadoop 2.0.4 is the right time for it to happen.
> > 
> > It is about time Hadoop 2.X community wins back all those end users
> > and downstream projects that got left behind during the alpha
> > stabilization phase.
> > 
> > Thanks,
> > Roman.
> 
> --
> Arun C. Murthy
> Hortonworks Inc.
> http://hortonworks.com/
> 
>