You are viewing a plain text version of this content. The canonical link for it is here.

Posted to general@hadoop.apache.org by Todd Lipcon <to...@cloudera.com> on 2012/09/01 10:20:47 UTC

Re: Large feature development

Thanks for starting this thread, Steve. I think your points below are
good. I've snipped most of your comment and will reply inline to one
bit below:

On Fri, Aug 31, 2012 at 10:07 AM, Steve Loughran
<st...@gmail.com> wrote:

> Of the big changes that have worked, they are
>
>
>    1. HDFS 2's HA and ongoing improvements: collaborative dev on the list
>    with incremental changes going on in trunk, RTC with lots of tests. This
>    isn't finished, and the test problem there is that functional testing of
>    all failure modes requires software-controlled fencing devices and switches
>    -and tests to generated the expected failure space.

Actually, most of the HDFS HA code has been done on branches. The
first work that led towards HA was the redesign of the edits logging
infrastrucutre -- HDFS-1073. This was a feature branch with about 60
patches on it. Then HDFS-1623, the main manual-failover HA
development, had close to 150 patches on the branch. Automatic HA
(HDFS-3042) was some 15-20 patches. The current work (removing
dependency on NAS) is around 35 patches in so far and getting close to
merge.

In these various branches, we've experimented with a few policies
which have differed from trunk. In particular:
- HDFS-1073 had a "modified review then commit" policy, which was
that, if a patch sat without a review for more than 24hrs, we
committed it with the restriction that there would be a post-commit
review before the branch was merged.
- All of the branches have done away with the requirement of running
the full QA suite, findbugs, etc prior to commit. This means that the
branches at times have broken tests checked in, but also makes it
quicker to iterate on the new feature. Again, the assumption is that
these requirements are met before merge.
- In all cases there has been a design doc and some good design
discussion up front before substantial code was written. This made it
easier to forge ahead on the branch with good confidence that the
community was on-board with the idea.

Given my experiences, I think all of the above are useful to follow.
It means development can happen quickly, but ensures that when the
merge is proposed, people feel like the quality meets our normal
standards.

>    2. YARN: Arun on his own branch, CTR, merge once mostly stable, and
>    completely replacing MRv1.

I'd actually contend that YARN was merged too early. I have yet to see
anyone running YARN in production, and it's holding up the "Stable"
moniker for Hadoop 2.0 -- HDFS-wise we are already quite stable and
I'm seeing fewer issues in our customers running Hadoop HDFS 2
compared to Hadoop 1-derived code.

>
> How then do we get (a) more dev projects working and integrated by the
> current committers, and (b) a process in which people who are not yet
> contributors/committers can develop non-trivial changes to the project in a
> way that it is done with the knowledge, support and mentorship of the rest
> of the community?

Here's one proposal, making use of git as an easy way to allow
non-committers to "commit" code while still tracking development in
the usual places:
- Upon anyone's request, we create a new "Version" tag in JIRA.
- The developers create an umbrella JIRA for the project, and file the
individual work items as subtasks (either up front, or as they are
developed if using a more iterative model)
- On the umbrella, they add a pointer to a git branch to be used as
the staging area for the branch. As they develop each subtask, they
can use the JIRA to discuss the development like they would with a
normally committed JIRA, but when they feel it is ready to go (not
requiring a +1 from any committer) they commit to their git branch
instead of the SVN repo.
- When the branch is ready to merge, they can call a merge vote, which
requires +1 from 3 committers, same as a branch being proposed by an
existing committer. A committer would then use git-svn to merge their
branch commit-by-commit, or if it is less extensive, simply generate a
single big patch to commit into SVN.

My thinking is that this would provide a low-friction way for people
to collaborate with the community and develop in the open, without
having to work closely with any committer to review every individual
subtask.

Another alternative, if people are reluctant to use git, would be to
add a "sandbox/" repository inside our SVN, and hand out commit bit to
branches inside there without any PMC vote. Anyone interested in
contributing could request a branch in the sandbox, and be granted
access as soon as they get an apache SVN account.

-Todd
-- 
Todd Lipcon
Software Engineer, Cloudera

Re: Large feature development

Posted by Arun Murthy <ac...@hortonworks.com>.

Rajiv,

I'm pretty sure you mean '*blame* for certain work, [...],  being
attributed' ... :)

I certainly find blame for failures much more palatable than credit
for collective successes.

IAC, thanks for chiming in, Hadoop will be better with you being more
present at the forefront.

Arun

On Sep 1, 2012, at 2:30 PM, Rajiv Chittajallu <ra...@yahoo-inc.com> wrote:

> Its unfortunate that certain work, an year after accepted in to the main line, being attributed to a single person. There is significant amount of work done by people who are not in the PMC or a commiter, especially to get it running in production. For those who have been associated with running hadoop before its became synonymous with 'BigData', stabilizing major release takes time. With more critical systems dependent on hadoop, transitioning to new feature set would take longer. hadoop-0.20 took ~8 months.
>
>
> IMHO, months after a feature set is accepted in to the mainline, it may not be appropriate to question its quality.
>
> In next couple of months, we are planning to widely deploy 0.23.3 release by Bobby. As with any major release, I know this is not going to be a smooth ride.
>
> -rajive
>
>
> ----- Original Message -----
>> From: Todd Lipcon <to...@cloudera.com>
>> To: general@hadoop.apache.org
>> Cc:
>> Sent: Saturday, September 1, 2012 1:20 AM
>> Subject: Re: Large feature development
>>
>> T hanks for starting this thread, Steve. I think your points below are
>> good. I've snipped most of your comment and will reply inline to one
>> bit below:
>>
>> On Fri, Aug 31, 2012 at 10:07 AM, Steve Loughran
>> <st...@gmail.com> wrote:
>>
>>> Of the big changes that have worked, they are
>>>
>>>
>>>     1. HDFS 2's HA and ongoing improvements: collaborative dev on the
>> list
>>>     with incremental changes going on in trunk, RTC with lots of tests. This
>>>     isn't finished, and the test problem there is that functional
>> testing of
>>>     all failure modes requires software-controlled fencing devices and
>> switches
>>>     -and tests to generated the expected failure space.
>>
>> Actually, most of the HDFS HA code has been done on branches. The
>> first work that led towards HA was the redesign of the edits logging
>> infrastrucutre -- HDFS-1073. This was a feature branch with about 60
>> patches on it. Then HDFS-1623, the main manual-failover HA
>> development, had close to 150 patches on the branch. Automatic HA
>> (HDFS-3042) was some 15-20 patches. The current work (removing
>> dependency on NAS) is around 35 patches in so far and getting close to
>> merge.
>>
>> In these various branches, we've experimented with a few policies
>> which have differed from trunk. In particular:
>> - HDFS-1073 had a "modified review then commit" policy, which was
>> that, if a patch sat without a review for more than 24hrs, we
>> committed it with the restriction that there would be a post-commit
>> review before the branch was merged.
>> - All of the branches have done away with the requirement of running
>> the full QA suite, findbugs, etc prior to commit. This means that the
>> branches at times have broken tests checked in, but also makes it
>> quicker to iterate on the new feature. Again, the assumption is that
>> these requirements are met before merge.
>> - In all cases there has been a design doc and some good design
>> discussion up front before substantial code was written. This made it
>> easier to forge ahead on the branch with good confidence that the
>> community was on-board with the idea.
>>
>> Given my experiences, I think all of the above are useful to follow.
>> It means development can happen quickly, but ensures that when the
>> merge is proposed, people feel like the quality meets our normal
>> standards.
>>
>>>     2. YARN: Arun on his own branch, CTR, merge once mostly stable, and
>>>     completely replacing MRv1.
>>
>> I'd actually contend that YARN was merged too early. I have yet to see
>> anyone running YARN in production, and it's holding up the
>> "Stable"
>> moniker for Hadoop 2.0 -- HDFS-wise we are already quite stable and
>> I'm seeing fewer issues in our customers running Hadoop HDFS 2
>> compared to Hadoop 1-derived code.
>>
>>>
>>> How then do we get (a) more dev projects working and integrated by the
>>> current committers, and (b) a process in which people who are not yet
>>> contributors/committers can develop non-trivial changes to the project in a
>>> way that it is done with the knowledge, support and mentorship of the rest
>>> of the community?
>>
>> Here's one proposal, making use of git as an easy way to allow
>> non-committers to "commit" code while still tracking development in
>> the usual places:
>> - Upon anyone's request, we create a new "Version" tag in JIRA.
>> - The developers create an umbrella JIRA for the project, and file the
>> individual work items as subtasks (either up front, or as they are
>> developed if using a more iterative model)
>> - On the umbrella, they add a pointer to a git branch to be used as
>> the staging area for the branch. As they develop each subtask, they
>> can use the JIRA to discuss the development like they would with a
>> normally committed JIRA, but when they feel it is ready to go (not
>> requiring a +1 from any committer) they commit to their git branch
>> instead of the SVN repo.
>> - When the branch is ready to merge, they can call a merge vote, which
>> requires +1 from 3 committers, same as a branch being proposed by an
>> existing committer. A committer would then use git-svn to merge their
>> branch commit-by-commit, or if it is less extensive, simply generate a
>> single big patch to commit into SVN.
>>
>> My thinking is that this would provide a low-friction way for people
>> to collaborate with the community and develop in the open, without
>> having to work closely with any committer to review every individual
>> subtask.
>>
>> Another alternative, if people are reluctant to use git, would be to
>> add a "sandbox/" repository inside our SVN, and hand out commit bit to
>> branches inside there without any PMC vote. Anyone interested in
>> contributing could request a branch in the sandbox, and be granted
>> access as soon as they get an apache SVN account.
>>
>> -Todd
>> --
>> Todd Lipcon
>> Software Engineer, Cloudera
>>

Re: Large feature development

Posted by Rajiv Chittajallu <ra...@yahoo-inc.com>.

Its unfortunate that certain work, an year after accepted in to the main line, being attributed to a single person. There is significant amount of work done by people who are not in the PMC or a commiter, especially to get it running in production. For those who have been associated with running hadoop before its became synonymous with 'BigData', stabilizing major release takes time. With more critical systems dependent on hadoop, transitioning to new feature set would take longer. hadoop-0.20 took ~8 months.


IMHO, months after a feature set is accepted in to the mainline, it may not be appropriate to question its quality.

In next couple of months, we are planning to widely deploy 0.23.3 release by Bobby. As with any major release, I know this is not going to be a smooth ride. 

-rajive


----- Original Message -----
> From: Todd Lipcon <to...@cloudera.com>
> To: general@hadoop.apache.org
> Cc: 
> Sent: Saturday, September 1, 2012 1:20 AM
> Subject: Re: Large feature development
> 
>T hanks for starting this thread, Steve. I think your points below are
> good. I've snipped most of your comment and will reply inline to one
> bit below:
> 
> On Fri, Aug 31, 2012 at 10:07 AM, Steve Loughran
> <st...@gmail.com> wrote:
> 
>>  Of the big changes that have worked, they are
>> 
>> 
>>     1. HDFS 2's HA and ongoing improvements: collaborative dev on the 
> list
>>     with incremental changes going on in trunk, RTC with lots of tests. This
>>     isn't finished, and the test problem there is that functional 
> testing of
>>     all failure modes requires software-controlled fencing devices and 
> switches
>>     -and tests to generated the expected failure space.
> 
> Actually, most of the HDFS HA code has been done on branches. The
> first work that led towards HA was the redesign of the edits logging
> infrastrucutre -- HDFS-1073. This was a feature branch with about 60
> patches on it. Then HDFS-1623, the main manual-failover HA
> development, had close to 150 patches on the branch. Automatic HA
> (HDFS-3042) was some 15-20 patches. The current work (removing
> dependency on NAS) is around 35 patches in so far and getting close to
> merge.
> 
> In these various branches, we've experimented with a few policies
> which have differed from trunk. In particular:
> - HDFS-1073 had a "modified review then commit" policy, which was
> that, if a patch sat without a review for more than 24hrs, we
> committed it with the restriction that there would be a post-commit
> review before the branch was merged.
> - All of the branches have done away with the requirement of running
> the full QA suite, findbugs, etc prior to commit. This means that the
> branches at times have broken tests checked in, but also makes it
> quicker to iterate on the new feature. Again, the assumption is that
> these requirements are met before merge.
> - In all cases there has been a design doc and some good design
> discussion up front before substantial code was written. This made it
> easier to forge ahead on the branch with good confidence that the
> community was on-board with the idea.
> 
> Given my experiences, I think all of the above are useful to follow.
> It means development can happen quickly, but ensures that when the
> merge is proposed, people feel like the quality meets our normal
> standards.
> 
>>     2. YARN: Arun on his own branch, CTR, merge once mostly stable, and
>>     completely replacing MRv1.
> 
> I'd actually contend that YARN was merged too early. I have yet to see
> anyone running YARN in production, and it's holding up the 
> "Stable"
> moniker for Hadoop 2.0 -- HDFS-wise we are already quite stable and
> I'm seeing fewer issues in our customers running Hadoop HDFS 2
> compared to Hadoop 1-derived code.
> 
>> 
>>  How then do we get (a) more dev projects working and integrated by the
>>  current committers, and (b) a process in which people who are not yet
>>  contributors/committers can develop non-trivial changes to the project in a
>>  way that it is done with the knowledge, support and mentorship of the rest
>>  of the community?
> 
> Here's one proposal, making use of git as an easy way to allow
> non-committers to "commit" code while still tracking development in
> the usual places:
> - Upon anyone's request, we create a new "Version" tag in JIRA.
> - The developers create an umbrella JIRA for the project, and file the
> individual work items as subtasks (either up front, or as they are
> developed if using a more iterative model)
> - On the umbrella, they add a pointer to a git branch to be used as
> the staging area for the branch. As they develop each subtask, they
> can use the JIRA to discuss the development like they would with a
> normally committed JIRA, but when they feel it is ready to go (not
> requiring a +1 from any committer) they commit to their git branch
> instead of the SVN repo.
> - When the branch is ready to merge, they can call a merge vote, which
> requires +1 from 3 committers, same as a branch being proposed by an
> existing committer. A committer would then use git-svn to merge their
> branch commit-by-commit, or if it is less extensive, simply generate a
> single big patch to commit into SVN.
> 
> My thinking is that this would provide a low-friction way for people
> to collaborate with the community and develop in the open, without
> having to work closely with any committer to review every individual
> subtask.
> 
> Another alternative, if people are reluctant to use git, would be to
> add a "sandbox/" repository inside our SVN, and hand out commit bit to
> branches inside there without any PMC vote. Anyone interested in
> contributing could request a branch in the sandbox, and be granted
> access as soon as they get an apache SVN account.
> 
> -Todd
> -- 
> Todd Lipcon
> Software Engineer, Cloudera
>

Re: Large feature development

Posted by Arun C Murthy <ac...@hortonworks.com>.

On Sep 3, 2012, at 12:05 AM, Arun C Murthy wrote:

> Todd,
> 
> I'll unfair to tag-team me while consistently ignoring what I write. 

Ugh, late Sunday night school-boy error - should have read:

I'll point out it's unfair [...]

Arun

Re: Large feature development

Posted by Arun C Murthy <ac...@hortonworks.com>.

On Sep 3, 2012, at 12:31 AM, Todd Lipcon wrote:

> On Mon, Sep 3, 2012 at 12:05 AM, Arun C Murthy <ac...@hortonworks.com> wrote:
>>> 
>>> But, I'll stand by my point that YARN is at this point more "alpha"
>>> than HDFS2.
>> 
>> I'll unfair to tag-team me while consistently ignoring what I write.
> 
> I'm not sure I ignored what you wrote. I understand that Yahoo is
> deploying soon on one of their clusters. That's great news. My
> original point was about the state of YARN when it was merged, and the
> comment about its current state was more of an aside. Hardly worth
> debating further. Best of luck with the deployment next week - I look
> forward to reading about how it goes on the list.

Everyone +1'ed the merge, now we'd like to rewrite history?
Also, it's current state is much that what you trivialized as 'deployed to one cluster' - again, please read my email on the effort we've undertaken to get where we are. That's a lot of work by many tens of people - hardly good form to trivialize them as you did.

Arun

Re: Large feature development

Posted by Todd Lipcon <to...@cloudera.com>.

On Mon, Sep 3, 2012 at 12:05 AM, Arun C Murthy <ac...@hortonworks.com> wrote:
>>
>> But, I'll stand by my point that YARN is at this point more "alpha"
>> than HDFS2.
>
> I'll unfair to tag-team me while consistently ignoring what I write.

I'm not sure I ignored what you wrote. I understand that Yahoo is
deploying soon on one of their clusters. That's great news. My
original point was about the state of YARN when it was merged, and the
comment about its current state was more of an aside. Hardly worth
debating further. Best of luck with the deployment next week - I look
forward to reading about how it goes on the list.

>> You brought up two bugs in the HDFS2 code base as examples
>> of HDFS 2 not being high quality.
>
> Through a lot of words you just agreed with what I said - if people didn't upgrade to HDFS2 (not just HA) they wouldn't hit any of these: HDFS-3626,

You could hit this on Hadoop 1, it was just harder to hit.

> HDFS-3731 etc.

The details of this bug have to do with the upgrade/snapshot behavior
of the blocksBeingWritten directory which was added in branch-1. In
fact, the same basic bug continues to exist in branch-1. If you
perform an upgrade, it doesn't hard-link the blocks into the new
"current" directory. Hence, if the upgraded cluster exits safe mode
(causing lease recovery of those blocks), and then the user issues a
rollback, the blocks will have been deleted from the pre-upgrade
image. This broken branch-1 behavior carried over into branch-2 as
well, but it's not a new bug, as I said before.

> There are more, for e.g. how do folks work around Secondary NN not starting up on upgrades from hadoop-1 (HDFS-3597)? They just copy multiple PBs over to a new hadoop-2 cluster, or patch SNN themselves post HDFS-1073?

No, they rm -Rf the contents of the 2NN directory, which is completely
safe and doesn't data loss in any way. In fact, the bug fix is exactly
that -- it just does the rm -Rf itself, automatically. It's a trivial
workaround similar to how other bugs in the Hadoop 1 branch have
required workarounds in the past. Certainly no data movement or local
patching. The SNN is transient state and can always be cleared.

If you have any questions about other bugs in the 2.x line, feel free
to ask on the relevant JIRAs. I'm still perfectly confident in the
stability of HDFS 2 vs HDFS 1. In fact my cell phone is likely the one
that would ring if any of these production HDFS 2 clusters had an
issue, and I'll offer the same publicly to anyone on this list. If you
experience a corruption or data loss issue on the tip of branch-2
HDFS, email me off-list and I'll personally diagnose the issue. I
would not make that same offer for branch-1 due to the fundamentally
less robust design which has caused a lot of subtle bugs over the past
several years.

Thanks
-Todd
-- 
Todd Lipcon
Software Engineer, Cloudera

Re: Large feature development

Posted by Arun C Murthy <ac...@hortonworks.com>.

Todd,

On Sep 2, 2012, at 6:12 PM, Todd Lipcon wrote:

> First, let me apologize if my email came off as a personal "snipe"
> against the project or anyone working on it. I know the team has been
> hard at work for multiple years now on the project, and I certainly
> don't mean to denigrate the work anyone has done. 
> 
> But, I'll stand by my point that YARN is at this point more "alpha"
> than HDFS2.

I'll unfair to tag-team me while consistently ignoring what I write. 
(We are also in danger of hitting the threefold repetition rule: http://en.wikipedia.org/wiki/Threefold_repetition. *smile*)

Anyway, I'l repeat, here are the facts on the ground - the work we've done testing/stabilizing YARN/MRv2, it's stability, user-certification across thousands of unique apps, deployment etc. etc.: http://s.apache.org/QVX

> You brought up two bugs in the HDFS2 code base as examples
> of HDFS 2 not being high quality.

Through a lot of words you just agreed with what I said - if people didn't upgrade to HDFS2 (not just HA) they wouldn't hit any of these: HDFS-3626, HDFS-3731 etc. There are more, for e.g. how do folks work around Secondary NN not starting up on upgrades from hadoop-1 (HDFS-3597)? They just copy multiple PBs over to a new hadoop-2 cluster, or patch SNN themselves post HDFS-1073?

Anyway, I agree, we should talk about this in context of an actual release - hadoop-2.1.0 should mark YARN as *beta* IMO - particularly since it will be deployed at scale.

Arun

Re: Large feature development

Posted by Todd Lipcon <to...@cloudera.com>.

Hey Arun,

First, let me apologize if my email came off as a personal "snipe"
against the project or anyone working on it. I know the team has been
hard at work for multiple years now on the project, and I certainly
don't mean to denigrate the work anyone has done. I also agree that
the improvements made possible by YARN are tremendously important, and
I've expressed this opinion both online and in interviews with
analysts, etc.

But, I'll stand by my point that YARN is at this point more "alpha"
than HDFS2. You brought up two bugs in the HDFS2 code base as examples
of HDFS 2 not being high quality. The first, HDFS-3626, was indeed a
messy bug, but had nothing to do with HA, the edit log rewrite, or any
other of the changes being discussed in the thread. In fact, the bug
has been there since the "beginning of time", and is in fact present
in Hadoop 1.0.x as well (which is why the JIRA is still open). You
simply need to pass a non-canonicalized path by the Path(URI)
constructor, and you'll see the same behavior in every release
including 1.0.x, 0.20.x, or earlier. The reason it shows up more often
in Hadoop 2 was actually due to the FsShell rewrite -- not any changes
in HDFS itself, and certainly not related to HA like you've implied
here.

The other bug causes blocksBeingWritten to disappear upon upgrade.
This, also, had nothing to do with any of the features being discussed
in this thread, and in fact only impacts a cluster which is taken down
_uncleanly_ prior to an upgrade. Upon starting the upgraded cluster,
the user would be alerted to the missing blocks and could rollback
with no lost data. So, while it should be fixed (and has been), I
wouldn't consider it particularly frightening. Most users I am aware
of do a "clean" shutdown of services like HBase before trying to
upgrade their cluster, and, worst case, they would see the issue
immediately after the upgrade and perform a rollback with no adverse
effects.

In branch-1, however, I've seen other bugs that I'd consider much more
scary. Two in particular come to mind and together represent the vast
majority of cases in which we've seen customers experience data
corruption: HDFS-3652 and HDFS-2305. These two bugs were branch-1
only, and never present in Hadoop 2 due to the "edit log rewrite"
project (HDFS-1073).

So, at risk of this thread just becoming a laundry list of bugs that
have existed in HDFS, or a list of bugs in YARN, I'll summarize: I
still think that YARN is "alpha" and HDFS 2 is at least as "stable" as
Hadoop 1.0. We have customers running it for production workloads, in
multi-rack clusters, with great success. But this has nothing to do
with this thread at hand, so I'll raise the question of
alpha/beta/stable labeling in the context of our next release vote,
and hope we can go back to the more fruitful discussion of how to
encourage large feature development while maintaining stability.

Thanks
-Todd

On Sun, Sep 2, 2012 at 3:11 PM, Arun Murthy <ac...@hortonworks.com> wrote:
> Eli,
>
> On Sep 2, 2012, at 1:01 PM, Eli Collins <el...@cloudera.com> wrote:
>
>> On Sat, Sep 1, 2012 at 12:47 PM, Arun C Murthy <ac...@hortonworks.com> wrote:
>>> Todd,
>>>
>>> On Sep 1, 2012, at 1:20 AM, Todd Lipcon wrote:
>>>
>>>> I'd actually contend that YARN was merged too early. I have yet to see
>>>> anyone running YARN in production, and it's holding up the "Stable"
>>>> moniker for Hadoop 2.0 -- HDFS-wise we are already quite stable and
>>>> I'm seeing fewer issues in our customers running Hadoop HDFS 2
>>>> compared to Hadoop 1-derived code.
>>>
>>> You know I respect you a ton, but I'm very saddened to see you perpetuate this FUD on our public lists. I expected better, particularly when everyone is working towards the same goals of advancing Hadoop-2. This sniping on other members doing work is, um, I'll just stop here rather than regret later.
>> 2. HDFS is more mature than YARN. Not a surprise given that we all
>> agree YARN is alpha, and a much newer project than HDFS that hasn't
>> yet been deployed in production environments yet (to my knowledge).
>
> Let's focus on the ground reality here.
>
> Please read my (or Rajiv's) message again about YARN's current
> stability and how much it's baked, it's deployment plans to a very
> large cluster in a few *days*. Or, talk to the people developing,
> testing and supporting these customers and clusters.
>
> I'll repeat - YARN has clearly baked much more than HDFS HA given
> the basic bugs (upgrade, edit logs corruption etc.) we've seen after
> being declared *done*; but then we just disagree since clearly I'm
> more conservative. Also, we need to be more conservative wrt HDFS -
> but then what would I know...
>
> I'll admit it's hard to discuss with someone (or a collective) who
> just repeat themselves. Plus, I broke my own rule about email this
> weekend - so, I'll try harder.
>
> Arun

-- 
Todd Lipcon
Software Engineer, Cloudera

Re: Large feature development (YARN vs HDFS)

Posted by Arun C Murthy <ac...@hortonworks.com>.

Agreed... it does seem like a case of 'my wife is prettier'.

Maybe I'm oversensitive and it may even be understandable given how much of my waking time I've devoted to YARN over the last 30 months; but I do apologize for indulging in the behavior I accused others of. A good night's sleep does help in clearing mists. IAC, the point I was trying to quantify is simple - current state of YARN is far better than was being characterized here.

We should get back to discussing 'large-feature development' - thanks for starting that discussion Steve.

Arun

On Sep 3, 2012, at 2:30 PM, Eric Baldeschwieler wrote:

> 
> Referring back to Chris M.s thread, this YARN vs HDFS discussion sounds a lot like an umbrella project issue to me.
> 
> On Sep 2, 2012, at 3:11 PM, Arun Murthy wrote:
> 
>> Eli,
>> 
>> On Sep 2, 2012, at 1:01 PM, Eli Collins <el...@cloudera.com> wrote:
>> 
>>> On Sat, Sep 1, 2012 at 12:47 PM, Arun C Murthy <ac...@hortonworks.com> wrote:
>>>> Todd,
>>>> 
>>>> On Sep 1, 2012, at 1:20 AM, Todd Lipcon wrote:
>>>> 
>>>>> I'd actually contend that YARN was merged too early. I have yet to see
>>>>> anyone running YARN in production, and it's holding up the "Stable"
>>>>> moniker for Hadoop 2.0 -- HDFS-wise we are already quite stable and
>>>>> I'm seeing fewer issues in our customers running Hadoop HDFS 2
>>>>> compared to Hadoop 1-derived code.
>>>> 
>>>> You know I respect you a ton, but I'm very saddened to see you perpetuate this FUD on our public lists. I expected better, particularly when everyone is working towards the same goals of advancing Hadoop-2. This sniping on other members doing work is, um, I'll just stop here rather than regret later.
>>> 2. HDFS is more mature than YARN. Not a surprise given that we all
>>> agree YARN is alpha, and a much newer project than HDFS that hasn't
>>> yet been deployed in production environments yet (to my knowledge).
>> 
>> Let's focus on the ground reality here.
>> 
>> Please read my (or Rajiv's) message again about YARN's current
>> stability and how much it's baked, it's deployment plans to a very
>> large cluster in a few *days*. Or, talk to the people developing,
>> testing and supporting these customers and clusters.
>> 
>> I'll repeat - YARN has clearly baked much more than HDFS HA given
>> the basic bugs (upgrade, edit logs corruption etc.) we've seen after
>> being declared *done*; but then we just disagree since clearly I'm
>> more conservative. Also, we need to be more conservative wrt HDFS -
>> but then what would I know...
>> 
>> I'll admit it's hard to discuss with someone (or a collective) who
>> just repeat themselves. Plus, I broke my own rule about email this
>> weekend - so, I'll try harder.
>> 
>> Arun
> 

--
Arun C. Murthy
Hortonworks Inc.
http://hortonworks.com/

Re: Large feature development (YARN vs HDFS)

Posted by Eric Baldeschwieler <er...@hortonworks.com>.

Referring back to Chris M.s thread, this YARN vs HDFS discussion sounds a lot like an umbrella project issue to me.

On Sep 2, 2012, at 3:11 PM, Arun Murthy wrote:

> Eli,
> 
> On Sep 2, 2012, at 1:01 PM, Eli Collins <el...@cloudera.com> wrote:
> 
>> On Sat, Sep 1, 2012 at 12:47 PM, Arun C Murthy <ac...@hortonworks.com> wrote:
>>> Todd,
>>> 
>>> On Sep 1, 2012, at 1:20 AM, Todd Lipcon wrote:
>>> 
>>>> I'd actually contend that YARN was merged too early. I have yet to see
>>>> anyone running YARN in production, and it's holding up the "Stable"
>>>> moniker for Hadoop 2.0 -- HDFS-wise we are already quite stable and
>>>> I'm seeing fewer issues in our customers running Hadoop HDFS 2
>>>> compared to Hadoop 1-derived code.
>>> 
>>> You know I respect you a ton, but I'm very saddened to see you perpetuate this FUD on our public lists. I expected better, particularly when everyone is working towards the same goals of advancing Hadoop-2. This sniping on other members doing work is, um, I'll just stop here rather than regret later.
>> 2. HDFS is more mature than YARN. Not a surprise given that we all
>> agree YARN is alpha, and a much newer project than HDFS that hasn't
>> yet been deployed in production environments yet (to my knowledge).
> 
> Let's focus on the ground reality here.
> 
> Please read my (or Rajiv's) message again about YARN's current
> stability and how much it's baked, it's deployment plans to a very
> large cluster in a few *days*. Or, talk to the people developing,
> testing and supporting these customers and clusters.
> 
> I'll repeat - YARN has clearly baked much more than HDFS HA given
> the basic bugs (upgrade, edit logs corruption etc.) we've seen after
> being declared *done*; but then we just disagree since clearly I'm
> more conservative. Also, we need to be more conservative wrt HDFS -
> but then what would I know...
> 
> I'll admit it's hard to discuss with someone (or a collective) who
> just repeat themselves. Plus, I broke my own rule about email this
> weekend - so, I'll try harder.
> 
> Arun

Re: Large feature development

Posted by Arun Murthy <ac...@hortonworks.com>.

Eli,

On Sep 2, 2012, at 1:01 PM, Eli Collins <el...@cloudera.com> wrote:

> On Sat, Sep 1, 2012 at 12:47 PM, Arun C Murthy <ac...@hortonworks.com> wrote:
>> Todd,
>>
>> On Sep 1, 2012, at 1:20 AM, Todd Lipcon wrote:
>>
>>> I'd actually contend that YARN was merged too early. I have yet to see
>>> anyone running YARN in production, and it's holding up the "Stable"
>>> moniker for Hadoop 2.0 -- HDFS-wise we are already quite stable and
>>> I'm seeing fewer issues in our customers running Hadoop HDFS 2
>>> compared to Hadoop 1-derived code.
>>
>> You know I respect you a ton, but I'm very saddened to see you perpetuate this FUD on our public lists. I expected better, particularly when everyone is working towards the same goals of advancing Hadoop-2. This sniping on other members doing work is, um, I'll just stop here rather than regret later.
> 2. HDFS is more mature than YARN. Not a surprise given that we all
> agree YARN is alpha, and a much newer project than HDFS that hasn't
> yet been deployed in production environments yet (to my knowledge).

Let's focus on the ground reality here.

Please read my (or Rajiv's) message again about YARN's current
stability and how much it's baked, it's deployment plans to a very
large cluster in a few *days*. Or, talk to the people developing,
testing and supporting these customers and clusters.

I'll repeat - YARN has clearly baked much more than HDFS HA given
the basic bugs (upgrade, edit logs corruption etc.) we've seen after
being declared *done*; but then we just disagree since clearly I'm
more conservative. Also, we need to be more conservative wrt HDFS -
but then what would I know...

I'll admit it's hard to discuss with someone (or a collective) who
just repeat themselves. Plus, I broke my own rule about email this
weekend - so, I'll try harder.

Arun

Re: Large feature development

Posted by Eli Collins <el...@cloudera.com>.

On Sat, Sep 1, 2012 at 12:47 PM, Arun C Murthy <ac...@hortonworks.com> wrote:
> Todd,
>
> On Sep 1, 2012, at 1:20 AM, Todd Lipcon wrote:
>
>> I'd actually contend that YARN was merged too early. I have yet to see
>> anyone running YARN in production, and it's holding up the "Stable"
>> moniker for Hadoop 2.0 -- HDFS-wise we are already quite stable and
>> I'm seeing fewer issues in our customers running Hadoop HDFS 2
>> compared to Hadoop 1-derived code.
>
> You know I respect you a ton, but I'm very saddened to see you perpetuate this FUD on our public lists. I expected better, particularly when everyone is working towards the same goals of advancing Hadoop-2. This sniping on other members doing work is, um, I'll just stop here rather than regret later.

Todd is just saying that:

1. HDFS v2 has fewer critical bugs than v1  (mostly thanks to the edit
log rewrite, which aside from HA was motivated by all the quality
issues the v1 code has had)

2. HDFS is more mature than YARN. Not a surprise given that we all
agree YARN is alpha, and a much newer project than HDFS that hasn't
yet been deployed in production environments yet (to my knowledge).

I don't read this as a snipe against anyone coding on Hadoop, it's
just that the two sub-projects are at different stages in their life
and development.

Thanks,
Eli

Re: Large feature development

Posted by Arun C Murthy <ac...@hortonworks.com>.

Todd,

On Sep 1, 2012, at 1:20 AM, Todd Lipcon wrote:

> I'd actually contend that YARN was merged too early. I have yet to see
> anyone running YARN in production, and it's holding up the "Stable"
> moniker for Hadoop 2.0 -- HDFS-wise we are already quite stable and
> I'm seeing fewer issues in our customers running Hadoop HDFS 2
> compared to Hadoop 1-derived code.

You know I respect you a ton, but I'm very saddened to see you perpetuate this FUD on our public lists. I expected better, particularly when everyone is working towards the same goals of advancing Hadoop-2. This sniping on other members doing work is, um, I'll just stop here rather than regret later.

I'm pretty sure you realize this (we've talked about this privately), yet, for other users who might not be aware:
# YARN has been deployed on, by almost everyone's standards, on a very LARGE ~450 node cluster for 6 months now at Yahoo.
# The entire YARN & MapReduce developer community has done an enormous amount of testing, compatibility work and performance work for many months now. It's been clear that YARN/MRv2 is superior to MR1 on every dimension - performance (2x in several cases), scale etc.; all dimensions which are critical for Hadoop's success in the past and future.
# Not just MR, this work has been done across the stack - Pig, Oozie, HCatalog etc. This has been an enormous amount of work not just by YARN/MRv2, but by all these communities.
# Many thousands of unique end-user applications at Yahoo have *certified* YARN/MRv2. That is pretty much *all* MapReduce, Pig etc. applications at Yahoo - the most advanced Hadoop deployment in the world.
# It is now *days* away from being deployed on one of the largest and most demanding Hadoop clusters in the world with several *thousand* nodes and millions of applications per month. See Bobby's note if you don't believe me.

Notice, I didn't talk about any of the other benefits of YARN such as other frameworks to MR etc. - you'll see more of this such as real-time applications on Hadoop clusters over the next many months. For e.g. see discussions on Storm/S4 lists about YARN prototypes at various stages of availability.

Paying you back with the same coin, after being declared *done*, HDFS2 had several BASIC issues such as a non-working upgrade from hadoop-1 (HDFS-3731, HDFS-3579) or edit-log corruption (HDFS-3626). Maybe you or the customers you talk about don't care about it, whatever. For e.g. is the QJM work part of stable HDFS2? It's not even code complete yet. 

IAC, It's pretty obvious we have different standards for declaring HDFS stable v/s YARN/MRv2 as stable. The standards I'm used to, being around since the dawn of this project, is what I use to measure stability i.e. deployed and stable for weeks/months on some of the largest Hadoop clusters in the world before letting it loose on other 'customers'. 

Given that upgrade-failures or data-corruption is acceptable, is YARN 'stable'? By the same standards - YES! - for many months now, much before HDFS HA was even code complete!

I don't want to engage in a debate on this further or expect you to care about YARN/MRv2, but please, for heavens' sake, do not publicly diss the work so many people have done for many, many months now or accuse them of *holding up Hadoop* - it's very poor form. 

I'm very proud to have contributed to this effort, even more to have worked with such a talented and dedicated bunch. A acknowledgement would be nice, but the least I/we *do* expect is absence of public sniping by other members of the Hadoop community.

respectfully,
Arun

Re: Large feature development

Posted by Eli Collins <el...@cloudera.com>.

On Sun, Sep 2, 2012 at 7:58 AM, Steve Loughran <st...@gmail.com> wrote:
> On 1 September 2012 09:20, Todd Lipcon <to...@cloudera.com> wrote:
>
>> Thanks for starting this thread, Steve. I think your points below are
>> good. I've snipped most of your comment and will reply inline to one
>> bit below:
>>
>> On Fri, Aug 31, 2012 at 10:07 AM, Steve Loughran
>> <st...@gmail.com> wrote:
>>
>>
>> >
>> > How then do we get (a) more dev projects working and integrated by the
>> > current committers, and (b) a process in which people who are not yet
>> > contributors/committers can develop non-trivial changes to the project
>> in a
>> > way that it is done with the knowledge, support and mentorship of the
>> rest
>> > of the community?
>>
>>
> Both HDFS2 and MRv2 are in trunk, therefore I consider them successes.
>
>
>> Here's one proposal, making use of git as an easy way to allow
>> non-committers to "commit" code while still tracking development in
>> the usual places:
>>
>
> This is effectively what people do. I'm less worried about the code side of
> things than the integration and mentoring
>
>
>> - Upon anyone's request, we create a new "Version" tag in JIRA.
>>
>
> -1. There are enough versions. There is a "tag" field in JIRA for precisely
> this purpose
>
>
>> - The developers create an umbrella JIRA for the project, and file the
>> individual work items as subtasks (either up front, or as they are
>> developed if using a more iterative model)
>>
>
> as today
>
>
>> - On the umbrella, they add a pointer to a git branch to be used as
>> the staging area for the branch. As they develop each subtask, they
>> can use the JIRA to discuss the development like they would with a
>> normally committed JIRA, but when they feel it is ready to go (not
>> requiring a +1 from any committer) they commit to their git branch
>> instead of the SVN repo.
>>
>
> some integration w/ jenkins and pull testing would be good here
>
>
>> - When the branch is ready to merge, they can call a merge vote, which
>> requires +1 from 3 committers, same as a branch being proposed by an
>> existing committer. A committer would then use git-svn to merge their
>> branch commit-by-commit, or if it is less extensive, simply generate a
>> single big patch to commit into SVN.
>>
>> My thinking is that this would provide a low-friction way for people
>> to collaborate with the community and develop in the open, without
>> having to work closely with any committer to review every individual
>> subtask.
>>
>> Another alternative, if people are reluctant to use git, would be to
>> add a "sandbox/" repository inside our SVN, and hand out commit bit to
>> branches inside there without any PMC vote. Anyone interested in
>> contributing could request a branch in the sandbox, and be granted
>> access as soon as they get an apache SVN account.
>>
>>
> I don't see the technical issues with how the merge is done as the main
> problem.
>
> The barriers to getting your stuff in are
> 1. getting people to care enough to help develop the feature -mentorship,
> collaborative development.
> 2. getting incremental parts in to avoid the continual
> merge-regression-test hell that you go through if you are trying to keep a
> separate branch alive. It's not the technical aspects of the merge so much
> as the need to run all the hadoop tests and your own test suite, and track
> down whether a failure is a regression in -trunk or something in your code.
>
> Jun's patch is an example of this situation. We haven't seen the effort he
> and his colleagues have done with merge and test, but I'm confident it's
> been there. What they now have is a "big bang" class of patch which is so
> big that anyone reviewing it would have to spend a couple of weeks going
> through the codebase trying to understand it. Which as we all know means
> two weeks not doing all the things you are committed to doing.
>
> We know it's there, we know it's current -so how to use this as an exercise
> in something to pull in incrementally?

Jun's patches from HADOOP-8468 (which were developed on a private
github repo) are being pulled in incrementally into trunk, there's no
feature branch (which I think would have been a better route but at
least the current approach has not prevented some progress).

All the recent examples of features that I can think of that have been
developed upstream first at Apache on feature branches have gone well.

Thanks,
Eli

Re: Large feature development

Posted by Steve Loughran <st...@gmail.com>.

On 1 September 2012 09:20, Todd Lipcon <to...@cloudera.com> wrote:

> Thanks for starting this thread, Steve. I think your points below are
> good. I've snipped most of your comment and will reply inline to one
> bit below:
>
> On Fri, Aug 31, 2012 at 10:07 AM, Steve Loughran
> <st...@gmail.com> wrote:
>
>
> >
> > How then do we get (a) more dev projects working and integrated by the
> > current committers, and (b) a process in which people who are not yet
> > contributors/committers can develop non-trivial changes to the project
> in a
> > way that it is done with the knowledge, support and mentorship of the
> rest
> > of the community?
>
>
Both HDFS2 and MRv2 are in trunk, therefore I consider them successes.

> Here's one proposal, making use of git as an easy way to allow
> non-committers to "commit" code while still tracking development in
> the usual places:
>

This is effectively what people do. I'm less worried about the code side of
things than the integration and mentoring

> - Upon anyone's request, we create a new "Version" tag in JIRA.
>

-1. There are enough versions. There is a "tag" field in JIRA for precisely
this purpose

> - The developers create an umbrella JIRA for the project, and file the
> individual work items as subtasks (either up front, or as they are
> developed if using a more iterative model)
>

as today

> - On the umbrella, they add a pointer to a git branch to be used as
> the staging area for the branch. As they develop each subtask, they
> can use the JIRA to discuss the development like they would with a
> normally committed JIRA, but when they feel it is ready to go (not
> requiring a +1 from any committer) they commit to their git branch
> instead of the SVN repo.
>

some integration w/ jenkins and pull testing would be good here

> - When the branch is ready to merge, they can call a merge vote, which
> requires +1 from 3 committers, same as a branch being proposed by an
> existing committer. A committer would then use git-svn to merge their
> branch commit-by-commit, or if it is less extensive, simply generate a
> single big patch to commit into SVN.
>
> My thinking is that this would provide a low-friction way for people
> to collaborate with the community and develop in the open, without
> having to work closely with any committer to review every individual
> subtask.
>
> Another alternative, if people are reluctant to use git, would be to
> add a "sandbox/" repository inside our SVN, and hand out commit bit to
> branches inside there without any PMC vote. Anyone interested in
> contributing could request a branch in the sandbox, and be granted
> access as soon as they get an apache SVN account.
>
>
I don't see the technical issues with how the merge is done as the main
problem.

The barriers to getting your stuff in are
1. getting people to care enough to help develop the feature -mentorship,
collaborative development.
2. getting incremental parts in to avoid the continual
merge-regression-test hell that you go through if you are trying to keep a
separate branch alive. It's not the technical aspects of the merge so much
as the need to run all the hadoop tests and your own test suite, and track
down whether a failure is a regression in -trunk or something in your code.

Jun's patch is an example of this situation. We haven't seen the effort he
and his colleagues have done with merge and test, but I'm confident it's
been there. What they now have is a "big bang" class of patch which is so
big that anyone reviewing it would have to spend a couple of weeks going
through the codebase trying to understand it. Which as we all know means
two weeks not doing all the things you are committed to doing.

We know it's there, we know it's current -so how to use this as an exercise
in something to pull in incrementally?

-Steve