You are viewing a plain text version of this content. The canonical link for it is here.
Posted to general@hadoop.apache.org by Matt Foley <mf...@hortonworks.com> on 2011/09/07 12:23:42 UTC

0.20.205 Sustaining Release branch plan and content plan

Hi all,
Over the past week a number of people have provided input for patches they
would like to see in 205, with reasons and risk evaluations; please
see the threads
"Content request for 0.20.205 Sustaining Release" and
"Add Append-HBase support in upcoming 20.205".
Thanks to all who took the effort to share this information with the list.

The various patches are grouped below, in numeric order for ease of review.
My proposed plan for branching the branch-0.20.205 is at the end of
this message.

Comparing the requests with the patches currently in
branch-0.20-security, we have the following:

1. THESE PATCHES ARE ALREADY IN 20-security AND ARE REQUESTED FOR
INCLUSION IN 205:

    HADOOP-6833. IPC leaks call parameters when exceptions thrown.
(Todd Lipcon via eli)
    HADOOP-6889. Make RPC to have an option to timeout - backport to
  0.20-security. Unit tests updated to 17/Aug/2011 version.    (John
George and Ravi Prakash via mattf)
    HADOOP-7314. Add support for throwing UnknownHostException when a
host    doesn't resolve. Needed for MAPREDUCE-2489. (Jeffrey Naisbitt
via mattf)
    HADOOP-7432. Back-port HADOOP-7110 to 0.20-security: Implement
chmod    in NativeIO library. (Sherry Chen via mattf)
    HADOOP-7472. RPC client should deal with IP address change.
(Kihwal Lee via suresh)
    HADOOP-7539. merge hadoop archive goodness from trunk to .20 (John
George     via mahadev)
    HDFS-0142. Blocks that are being written by a client are stored in
the    blocksBeingWritten directory.     (Dhruba Borthakur, Nicolas
Spiegelberg, Todd Lipcon via dhruba)
    HDFS-0200. Support append and sync for hadoop 0.20 branch. (dhruba)
    HDFS-0561. Fix write pipeline READ_TIMEOUT.    (Todd Lipcon via dhruba)
    HDFS-0606. Fix ConcurrentModificationException in
invalidateCorruptReplicas.    (Todd Lipcon via dhruba)
    HDFS-0630. Client can exclude specific nodes in the write
pipeline.    (Nicolas Spiegelberg via dhruba)
    HDFS-0724.  Use a bidirectional heartbeat to detect stuck
pipeline. (hairong)
    HDFS-0826. Allow a mechanism for an application to detect that
datanode(s) have died in the write pipeline. (dhruba)
    HDFS-0895. Allow hflush/sync to occur in parallel with new writes
to    the file. (Todd Lipcon via hairong)
    HDFS-0988. Fix bug where savenameSpace can corrupt edits log.
(Nicolas Spiegelberg via dhruba)
    HDFS-1054. remove sleep before retry for allocating a block.
(Todd Lipcon via dhruba)
    HDFS-1057.  Concurrent readers hit ChecksumExceptions if following
    a writer to very end of file (Sam Rash via dhruba)
    HDFS-1118. Fix socketleak on DFSClient.     (Zheng Shao via dhruba)
    HDFS-1141. completeFile does not check lease ownership.    (Todd
Lipcon via dhruba)
    HDFS-1164. TestHdfsProxy is failing. (Todd Lipcon)
    HDFS-1202. DataBlockScanner throws NPE when updated before
initialized.     (Todd Lipcon)
    HDFS-1204. Lease expiration should recover single files,     not
entire lease holder (Sam Rash via dhruba)
    HDFS-1210. DFSClient should log exception when block recovery
fails.    (Todd Lipcon via dhruba)
    HDFS-1211. Block receiver should not log "rewind" packets at INFO
level.    (Todd Lipcon)
    HDFS-1346. DFSClient receives out of order packet ack. (hairong)
    HDFS-1520. Lightweight NameNode operation recoverLease to trigger
   lease recovery. (Hairong Kuang via dhruba)
    HDFS-1554. New semantics for recoverLease. (hairong)
    HDFS-1555. Disallow pipelien recovery if a file is already being
 lease recovered. (hairong)
    HDFS-1836. Thousand of CLOSE_WAIT socket. Contributed by Todd
Lipcon,    ported to security branch by Bharath Mundlapudi. (via
mattf)
    HDFS-2053. Bug in INodeDirectory#computeContentSummary warning
(Michael Noll via eli)
    HDFS-2117. DiskChecker#mkdirsWithExistsAndPermissionCheck may
return true even when the dir is not created. (eli)
    HDFS-2190. NN fails to start if it encounters an empty or
malformed fstime    file. (atm)
    HDFS-2202. Add a new DFSAdmin command to set balancer bandwidth of
   datanodes without restarting.  (Eric Payne via szetszwo)
    MAPREDUCE-2187. Reporter sends progress during sort/merge. (Anupam
Seth via    acmurthy)
    MAPREDUCE-2324. Removed usage of broken
ResourceEstimator.getEstimatedReduceInputSize to check against usable
  disk-space on TaskTracker. (Robert Evans via acmurthy)
    MAPREDUCE-2489. Jobsplits with random hostnames can make the
queue unusable. (Jeffrey Naisbitt via mahadev)
    MAPREDUCE-2494. Make the distributed cache delete entires using
LRU     priority (Robert Joseph Evans via mahadev)
    MAPREDUCE-2650. back-port MAPREDUCE-2238 to 0.20-security.
(Sherry Chen via mahadev)
    MAPREDUCE-2705. Implements launch of multiple tasks concurrently.
  (Thomas Graves via ddas)
    MAPREDUCE-2729. Ensure jobs with reduces which can't be launched
due to    slow-start do not count for user-limits. (Sherry Chen via
acmurthy)
    MAPREDUCE-2780. Use a utility method to set service in token.
(Daryn Sharp via jitendra)
    MAPREDUCE-2852. Jira for YDH bug 2854624. (Kihwal Lee via eli)

2. THESE PATCHES ARE ALREADY IN 20-security BUT NO ONE HAS YET SPOKEN FOR
INCLUDING THEM IN 205:

    HADOOP-7400. Fix HdfsProxyTests fails when the -Dtest.build.dir
 and -Dbuild.test is set a dir other than build dir (gkesavan).
    HADOOP-7594. Support HTTP REST in HttpServer.  (szetszwo)
    HADOOP-7596. Makes packaging of 64-bit jsvc possible. Has other
bug fixes to do with packaging. (Eric Yang via ddas)
    HDFS-1207. FSNamesystem.stallReplicationWork should be volatile.
 (Todd Lipcon via dhruba)
    HDFS-2259. DN web-UI doesn't work with paths that contain html. (eli)
    HDFS-2309. TestRenameWhileOpen fails. (jitendra)
    MAPREDUCE-7343. Make the number of warnings accepted by test-patch
   configurable to limit false positives. (Thomas Graves via cdouglas)

3. THESE PATCHES ARE REQUESTED FOR INCLUSION IN 205, BUT ARE NOT YET
IN 20-security:

Additional append issues (proponents Todd and Suresh):
HADOOP-6722   Workaround a TCP spec quirk by not allowing
NetUtils.connect to connect to itself
HDFS-0611     Heartbeats times from Datanodes increase when there are
plenty of blocks to delete
HDFS-0915     Write pipeline hangs for too long when ResponseProcessor
hits timeout
HDFS-1056     Multi-node RPC deadlocks during block recovery
HDFS-1122     Don't allow client verification to prematurely add
inprogress blocks to DataBlockScanner
HDFS-1186     0.20: DNs should interrupt writers at start of recovery
HDFS-1197     Blocks are considered "complete" prematurely after
commitBlockSynchronization or DN restart
HDFS-1218     20 append: Blocks recovered on startup should be treated
with lower priority during block synchronization
HDFS-1242     0.20 append: Add test for appendFile() race solved in HDFS-142
HDFS-1247     Improvements to HDFS-1204 test
HDFS-1248     Misc cleanup/logging improvements for branch-20-append
HDFS-1252     TestDFSConcurrentFileOperations broken in 0.20-append
HDFS-1254     Support append/sync via the default configuration.
HDFS-1260     0.20: Block lost when multiple DNs trying to recover it
to different genstamps
HDFS-1262     Failed pipeline creation during append leaves lease hanging on NN
HDFS-1264     0.20: OOME in HDFS client made an unrecoverable HDFS block
HDFS-1266     Missing license headers in branch-20-append
HDFS-1779     After NameNode restart , Clients can not read partial
files even after client invokes Sync.
HDFS-2300     TestFileAppend4 and TestMultiThreadedSync fail on 20.append

other issues (with proponents' names):
Suresh	HADOOP-7119    add Kerberos HTTP SPNEGO authentication support
to Hadoop JT/NN/DN/TT web-consoles
Nathan	HADOOP-7510 - Tokens should use original hostname provided instead of ip
John George	HADOOP-7602    wordcount, sort etc on har files fails with NPE
Nathan	HDFS-2257 - HftpFilesysystem should implement GetDelegationTokens
Arun/Matei	MAPREDUCE-0551    Add preemption to the fair scheduler
Arun/Matei	MAPREDUCE-0706    Support for FIFO pools in the fair scheduler
Arun/Matei	and other FairScheduler-related items
Venu	MAPREDUCE-2237    Lost heartbeat response containing MapTask
throws NPE when it is resent
Venu	MAPREDUCE-2264    Job status exceeds 100% in some cases
Bharath	MAPREDUCE-2413    TaskTracker should handle disk failures at
both startup and runtime
Bharath	MAPREDUCE-2415    Distribute TaskTracker userlogs onto multiple disks
Venu	MAPREDUCE-2549    Potential resource leaks in HadoopServer.java,
RunOnHadoopWizard.java and Environment.java
Joep	MAPREDUCE-2610    Inconsistent API JobClient.getQueueAclsForCurrentUser
Nathan	MAPREDUCE-2621 - TestCapacityScheduler fails with Queue q1 does not exist
Nathan	MAPREDUCE-2651 - Race condition in Linux Task Controller for
job log directory creation
Nathan	MAPREDUCE-2764 - Fix renewal of dfs delegation tokens
Joep	MAPREDUCE-2779    JobSplitWriter.java can't handle large job.split file
Nathan	MAPREDUCE-2915 - LinuxTaskController does not work when
JniBasedUnixGroupsNetgroupMapping or JniBasedUnixGroupsMapping is
enabled

Obviously plenty of material has accumulated while 204 was being
stabilized.  I would like to
proceed with 205 relatively quickly.  I plan to create the release
branch (branch-0.20.205)
this weekend, 10 September.

The items in group 1 are acceptable to me and will be included.

The items in group 2 are acceptable for inclusion, but someone needs
to speak up for them
in the next three days.  I'm not going to include anything that nobody wants!

The items in group 3  will be the subject of further discussion
between their proponents
and myself.  If they can be committed WITH UNIT TESTS AND APPROPRIATE LEVELS
OF TESTING, I'm still willing to see them in 205.  Most of the items
in the first sub-group
(additional Append issues) have in fact been well-tested in CDH
releases, and would be
valuable to include in 205; it's just a matter of getting the existing
patches committed.
About half of them seem to need no changes, the other half may need to
be re-based.

Again, I plan to create the release branch this weekend.  Thanks for everyone's
contributions.

--Matt

Re: 0.20.205 Sustaining Release branch plan and content plan

Posted by Eric Baldeschwieler <er...@hortonworks.com>.
Let me give a shout out to all the folks that are helping to make this work.  20.205 has gotten a lot of things done in a very short period of time and its exciting to see such a diverse group of folks pushing together to drive this forward!  I'm looking forward to seeing security and HBase run together!  And the new HDFS HTTP APIs are going to open up a lot of interesting possibilities!

Thanks all!

E14

On Sep 9, 2011, at 5:11 PM, Matt Foley wrote:

> [REMINDER:  branch-0.20.205 will be created this weekend.]
> 
> Regarding "group 2" of the planning message, there were seven orphan
> patches, already in 0.20-security, but not yet spoken for in 205:
> 
>   HADOOP-7400. Fix HdfsProxyTests fails when the -Dtest.build.dir
>> and -Dbuild.test is set a dir other than build dir (gkesavan).
>>   HADOOP-7594. Support HTTP REST in HttpServer.  (szetszwo)
>>   HADOOP-7596. Makes packaging of 64-bit jsvc possible. Has other
>> bug fixes to do with packaging. (Eric Yang via ddas)
>>   HDFS-1207. FSNamesystem.stallReplicationWork should be volatile.
>> (Todd Lipcon via dhruba)
>>   HDFS-2259. DN web-UI doesn't work with paths that contain html. (eli)
>>   HDFS-2309. TestRenameWhileOpen fails. (jitendra)
>>   MAPREDUCE-7343. Make the number of warnings accepted by test-patch
>>  configurable to limit false positives. (Thomas Graves via cdouglas)
>> 
> 
> Here is their disposition:
> 
> HADOOP-7400 and HADOOP-7596 are build/package infrastructure issues.  I need
> them for the release, so they will be included.
> 
> HDFS-1207 is needed for append, is requested by Suresh, and will be
> included.
> 
> HDFS-2259 is recommended by Eli, and will be included.
> 
> HDFS-2309 fixes a bug detected by a failing unit test in the 0.20 build.  It
> will be included.
> 
> HADOOP-7594 is requested by Sanjay, and will be included.
> 
> MAPREDUCE-7343 doesn't exist!  It is really a reference to HADOOP-7343.  It
> is requested by Nathan, and will be included.  (I fixed the CHANGES.txt
> reference and caused the commit to show - as much as possible - in the
> jira.)
> 
> So in summary, all the orphans have been claimed and championed.
> --Matt


Re: 0.20.205 Sustaining Release branch plan and content plan

Posted by Matt Foley <mf...@hortonworks.com>.
[REMINDER:  branch-0.20.205 will be created this weekend.]

Regarding "group 2" of the planning message, there were seven orphan
patches, already in 0.20-security, but not yet spoken for in 205:

   HADOOP-7400. Fix HdfsProxyTests fails when the -Dtest.build.dir
>  and -Dbuild.test is set a dir other than build dir (gkesavan).
>    HADOOP-7594. Support HTTP REST in HttpServer.  (szetszwo)
>    HADOOP-7596. Makes packaging of 64-bit jsvc possible. Has other
> bug fixes to do with packaging. (Eric Yang via ddas)
>    HDFS-1207. FSNamesystem.stallReplicationWork should be volatile.
>  (Todd Lipcon via dhruba)
>    HDFS-2259. DN web-UI doesn't work with paths that contain html. (eli)
>    HDFS-2309. TestRenameWhileOpen fails. (jitendra)
>    MAPREDUCE-7343. Make the number of warnings accepted by test-patch
>   configurable to limit false positives. (Thomas Graves via cdouglas)
>

Here is their disposition:

HADOOP-7400 and HADOOP-7596 are build/package infrastructure issues.  I need
them for the release, so they will be included.

HDFS-1207 is needed for append, is requested by Suresh, and will be
included.

HDFS-2259 is recommended by Eli, and will be included.

HDFS-2309 fixes a bug detected by a failing unit test in the 0.20 build.  It
will be included.

HADOOP-7594 is requested by Sanjay, and will be included.

MAPREDUCE-7343 doesn't exist!  It is really a reference to HADOOP-7343.  It
is requested by Nathan, and will be included.  (I fixed the CHANGES.txt
reference and caused the commit to show - as much as possible - in the
jira.)

So in summary, all the orphans have been claimed and championed.
--Matt

Re: 0.20.205 Sustaining Release branch plan and content plan

Posted by Todd Lipcon <to...@cloudera.com>.
On Fri, Sep 9, 2011 at 4:56 PM, Matt Foley <mf...@hortonworks.com> wrote:
> If I read the jira correctly, this is a workaround for RHEL6.0 that is no
> longer needed for RHEL6.1.
> Is that correct?  If so, would it be no longer needed?

Yes, it's fixed in RHEL 6.1. Also, since the uid caching is enabled in
the 20x series, it's less important, since the race is much much rare.
So I'm +/- 0 (doesn't seem urgent but shoudln't hurt things)

-Todd

>
> On Fri, Sep 9, 2011 at 3:45 AM, Steve Loughran <st...@apache.org> wrote:
>
>>
>> What about RHEL6.1 workarounds?
>> https://issues.apache.org/**jira/browse/HADOOP-7156<https://issues.apache.org/jira/browse/HADOOP-7156>
>>
>>
>



-- 
Todd Lipcon
Software Engineer, Cloudera

Re: 0.20.205 Sustaining Release branch plan and content plan

Posted by Matt Foley <mf...@hortonworks.com>.
If I read the jira correctly, this is a workaround for RHEL6.0 that is no
longer needed for RHEL6.1.
Is that correct?  If so, would it be no longer needed?

On Fri, Sep 9, 2011 at 3:45 AM, Steve Loughran <st...@apache.org> wrote:

>
> What about RHEL6.1 workarounds?
> https://issues.apache.org/**jira/browse/HADOOP-7156<https://issues.apache.org/jira/browse/HADOOP-7156>
>
>

Re: 0.20.205 Sustaining Release branch plan and content plan

Posted by Steve Loughran <st...@apache.org>.
On 09/09/11 00:47, Suresh Srinivas wrote:
> Here is the patch that are not all committed to 205 yet. I am working with
> Todd, Jitendra and Sanjay on this. We plan to get it done by tomorrow:
> HDFS-1207. FSNamesystem.stallReplicationWork should be volatile. (Todd
> Lipcon via dhruba)
> Risk level: Low - Simple change of a variable to volatile, for multi
> threaded correctness


What about RHEL6.1 workarounds?
https://issues.apache.org/jira/browse/HADOOP-7156


Re: 0.20.205 Sustaining Release branch plan and content plan

Posted by Suresh Srinivas <su...@hortonworks.com>.
Here is the patch that are not all committed to 205 yet. I am working with
Todd, Jitendra and Sanjay on this. We plan to get it done by tomorrow:
HDFS-1207. FSNamesystem.stallReplicationWork should be volatile. (Todd
Lipcon via dhruba)
Risk level: Low - Simple change of a variable to volatile, for multi
threaded correctness

HDFS-2309. TestRenameWhileOpen fails. (jitendra)
Risk level: Low - Simple change to introduce first block report flag to fix
a test failure.

HADOOP-6722   Workaround a TCP spec quirk by not allowing NetUtils.connect
to connect to itself
Risk level: Low - check to see if a socket connected to it self.

HDFS-1252     TestDFSConcurrentFileOperations broken in 0.20-append TODO
suresh
Risk level: Low - fixing the test for correctness

HDFS-2300     TestFileAppend4 and TestMultiThreadedSync fail on 20.append
Risk level: Low - simple changes to fix the test failure.

HDFS-1779     After NameNode restart , Clients can not read partial files
even after client invokes Sync.
Risk level: Low - fixes related to bbw block reports. Disabled by append
supported config flag. This has been tested as part of CDH.

HDFS-1186     0.20: DNs should interrupt writers at start of recovery
Risk level: Low - ensures data integrity by preventing further writes on
lease recovery. This has been tested as part of CDH.

HDFS-1260     0.20: Block lost when multiple DNs trying to recover it to
different genstamps
Risk level: Low - code change looks straight forward. Tested as part of CDH.

HDFS-1122     Don't allow client verification to prematurely add
Risk level: Low - code change looks straight forward change. Handles client
verification interaction with DataBlockScanner and marking a block corrupt
incorrectly. Tested as part of CDH.

HDFS-1242 0.20 append: Add test for appendFile() race solved in HDFS-142
Risk level: Low - adds more tests to already commited change from HDFS-142.

HDFS-1218     20 append: Blocks recovered on startup should be treated
Risk level: Medium. This is a must fix to prevent dataloss if datanode goes
down in pipeline. This has been tested in CDH.

HDFS-1197 - Blocks are considered "complete" prematurely after
Risk level: Low. This fixes dataloss. This has been tested in CDH.
Considering a shorter version of the patch, given some of the issues were
handled by HDFS-1779, to reduce the risk.


*The patches I am not planning to add to 205 and the reason:*
HDFS-611  Heartbeats times from Datanodes increase when there are plenty of
blocks to delete
Could be HBase related.

HDFS-1056 Multi-node RPC deadlocks during block recovery
Setting up xceiver port using “dfs.datanode.port” to work around this issue.

HDFS-1982 Null pointer exception is thrown when NN restarts with a block
lesser in size than the block present in DN1 but generation stamps is
greater in NN.
Low probability of this occurring. No patch is available yet.

HDFS-1951/HDFS-1970
Null pointer exception comes when Namenode recovery happens and there is no
response from client to NN more than the hardlimit for NN recovery and the
current block is more than the prev block size in NN
Not in CDH. Suitable for a subsequent release.

HDFS-1264     0.20: OOME in HDFS client made an unrecoverable HDFS block
No patch available yet.

HDFS-1262     Failed pipeline creation during append leaves lease hanging on
NN
Not relevant to get flush as append is no longer used by HBase.

HDFS-1266     Missing license headers in branch-20-append
Missing license headers - has already been fixed. TODO check

HDFS-1248 Misc cleanup/logging improvements for branch-20-append
Log related cleanup. Not critical for 205.

HDFS-1247 Improvements to HDFS-1204 test
Risk level: Low - Test improvements.