You are viewing a plain text version of this content. The canonical link for it is here.
Posted to yarn-dev@hadoop.apache.org by Gabor Bota <ga...@cloudera.com.INVALID> on 2020/06/26 13:51:39 UTC

[VOTE] Release Apache Hadoop 3.1.4 (RC2)

Hi folks,

I have put together a release candidate (RC2) for Hadoop 3.1.4.

The RC is available at: http://people.apache.org/~gabota/hadoop-3.1.4-RC2/
The RC tag in git is here:
https://github.com/apache/hadoop/releases/tag/release-3.1.4-RC2
The maven artifacts are staged at
https://repository.apache.org/content/repositories/orgapachehadoop-1269/

You can find my public key at:
https://dist.apache.org/repos/dist/release/hadoop/common/KEYS
and http://keys.gnupg.net/pks/lookup?op=get&search=0xB86249D83539B38C

Please try the release and vote. The vote will run for 5 weekdays,
until July 6. 2020. 23:00 CET.

The release includes the revert of HDFS-14941, as it caused
HDFS-15421. IBR leak causes standby NN to be stuck in safe mode.
(https://issues.apache.org/jira/browse/HDFS-15421)
The release includes HDFS-15323, as requested.
(https://issues.apache.org/jira/browse/HDFS-15323)

Thanks,
Gabor

---------------------------------------------------------------------
To unsubscribe, e-mail: yarn-dev-unsubscribe@hadoop.apache.org
For additional commands, e-mail: yarn-dev-help@hadoop.apache.org


Re: [VOTE] Release Apache Hadoop 3.1.4 (RC2)

Posted by Steve Loughran <st...@cloudera.com.INVALID>.
+1, with the instruction "warn everyone about the guava update possibly
breaking things at run time"

With the key issues being
* code compiled with the new guava release will not link against the older
releases, even without any changes in the source files.
* this includes hadoop-common

Applications which exclude the guava dependency published by hadoop-
artifacts to use their own, must set guava.version=27.0-jre or
guava.version=27.0 to be consistent with that of this release.


My tests were all with using the artifacts downstream via maven; I trust
others to look at the big tarball release.


*Project 1: cloudstore*


This is my extra diagnostics and cloud utils module.
https://github.com/steveloughran/cloudstore


All compiled fine, but the tests failed on guava linkage

testNoOverwriteDest(org.apache.hadoop.tools.cloudup.TestLocalCloudup)  Time
elapsed: 0.012 sec  <<< ERROR! java.lang.NoSuchMethodError: 'void
com.google.common.base.Preconditions.checkArgument(boolean,
java.lang.String, java.lang.Object, java.lang.Object)'
at org.apache.hadoop.fs.tools.cloudup.Cloudup.run(Cloudup.java:177)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:76)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:90)
at org.apache.hadoop.tools.store.StoreTestUtils.exec(StoreTestUtils.java:4


Note: that app is designed to run against hadoop branch-2 and other
branches, so I ended up reimplementing the checkArgument and checkState
calls so that I can have a binary which links everywhere. My code, nothing
serious.

*Project 2: Spark*


apache spark main branch built with maven (not tried the SBT build).


mvn -T 1  -Phadoop-3.2 -Dhadoop.version=3.1.4 -Psnapshots-and-staging
-Phadoop-cloud,yarn,kinesis-asl -DskipTests clean package

All good. Then I ran the committer unit test suite

mvn -T 1  -Phadoop-3.2 -Dhadoop.version=3.1.4
-Phadoop-cloud,yarn,kinesis-as  -Psnapshots-and-staging --pl hadoop-cloud
test

CommitterBindingSuite:
*** RUN ABORTED ***
  java.lang.NoSuchMethodError: 'void
com.google.common.base.Preconditions.checkArgument(boolean,
java.lang.String, java.lang.Object)'
  at org.apache.hadoop.conf.Configuration.set(Configuration.java:1357)
  at org.apache.hadoop.conf.Configuration.set(Configuration.java:1338)
  at
org.apache.spark.internal.io.cloud.CommitterBindingSuite.newJob(CommitterBindingSuite.scala:89)
  at
org.apache.spark.internal.io.cloud.CommitterBindingSuite.$anonfun$new$1(CommitterBindingSuite.scala:55)
  at org.scalatest.OutcomeOf.outcomeOf(OutcomeOf.scala:85)
  at org.scalatest.OutcomeOf.outcomeOf$(OutcomeOf.scala:83)
  at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104)
  at org.scalatest.Transformer.apply(Transformer.scala:22)
  at org.scalatest.Transformer.apply(Transformer.scala:20)
  at org.scalatest.FunSuiteLike$$anon$1.apply(FunSuiteLike.scala:186)
  ...

Fix: again, tell the build this is a later version of Guava:


mvn -T 1  -Phadoop-3.2 -Dhadoop.version=3.1.4
-Phadoop-cloud,yarn,kinesis-asl  -Psnapshots-and-staging --pl hadoop-cloud
-Dguava.version=27.0-jre test


the mismatch doesn't break spark internally, they shade their stuff anyway,
the guava.version here is actually the one which hadoop is to be linked
with.

outcome: tests work

[INFO] --- scalatest-maven-plugin:2.0.0:test (test) @
spark-hadoop-cloud_2.12 ---
Discovery starting.
Discovery completed in 438 milliseconds.
Run starting. Expected test count is: 4
CommitterBindingSuite:
- BindingParquetOutputCommitter binds to the inner committer
- committer protocol can be serialized and deserialized
- local filesystem instantiation
- reject dynamic partitioning
Run completed in 1 second, 411 milliseconds.
Total number of tests run: 4
Suites: completed 2, aborted 0
Tests: succeeded 4, failed 0, canceled 0, ignored 0, pending 0

This is a real PITA, and its invariably those checkArgument calls, because
the later guava versions added some overloaded methods. Compile existing
source with a later guava version and the .class no longer binds to the
older guava version, even though no new guava APIs have been adopted.

I am really tempted to go through src/**/*.java and replace all Guava
checkArgument/checkState with our own implementation in hadoop.common, at
least for any which uses the vararg variant. But: it'd be a big change and
there may be related issues elsewhere. At least now things fail fast.

*Project 3: spark cloud integration  *

https://github.com/hortonworks-spark/cloud-integration

This is where the functional tests for the s3a committer through spark live

-Dhadoop.version=3.1.2 -Dspark.version=3.1.0-SNAPSHOT
-Psnapshots-and-staging

and a full test run

mvn test -Dcloud.test.configuration.file=../test-configs/s3a.xml --pl
cloud-examples -Dhadoop.version=3.1.2 -Dspark.version=3.1.0-SNAPSHOT
-Psnapshots-and-staging

All good. A couple of test failures, but that was because one of my test
datasets is not on any bucket I have...will have to fix that.


To conclude: the artefacts are all there, existing code compiles against
the new version without obvious problems. Where people will see stack
traces is from the guava update. Is it frustrating, but there is nothing we
can do about it. All we can do is remember to ourselves "don't add
overloaded methods where you have already shipped an implementation with a
varargs one"

For the release notes: we need to explain what is happening and why.



n Fri, 26 Jun 2020 at 14:51, Gabor Bota <ga...@cloudera.com> wrote:

> Hi folks,
>
> I have put together a release candidate (RC2) for Hadoop 3.1.4.
>
> The RC is available at: http://people.apache.org/~gabota/hadoop-3.1.4-RC2/
> The RC tag in git is here:
> https://github.com/apache/hadoop/releases/tag/release-3.1.4-RC2
> The maven artifacts are staged at
> https://repository.apache.org/content/repositories/orgapachehadoop-1269/
>
> You can find my public key at:
> https://dist.apache.org/repos/dist/release/hadoop/common/KEYS
> and http://keys.gnupg.net/pks/lookup?op=get&search=0xB86249D83539B38C
>
> Please try the release and vote. The vote will run for 5 weekdays,
> until July 6. 2020. 23:00 CET.
>
> The release includes the revert of HDFS-14941, as it caused
> HDFS-15421. IBR leak causes standby NN to be stuck in safe mode.
> (https://issues.apache.org/jira/browse/HDFS-15421)
> The release includes HDFS-15323, as requested.
> (https://issues.apache.org/jira/browse/HDFS-15323)
>
> Thanks,
> Gabor
>

Re: [VOTE] Release Apache Hadoop 3.1.4 (RC2)

Posted by Gabor Bota <ga...@cloudera.com.INVALID>.
Yes, sure. I'll do another RC for next week.

Thank you all for working on this!

On Thu, Jul 9, 2020 at 8:20 AM Masatake Iwasaki
<iw...@oss.nttdata.co.jp> wrote:
>
> Hi Gabor Bota,
>
> I committed the fix of YARN-10347 to branch-3.1.
> I think this should be blocker for 3.1.4.
> Could you cherry-pick it to branch-3.1.4 and cut a new RC?
>
> Thanks,
> Masatake Iwasaki
>
> On 2020/07/08 23:31, Masatake Iwasaki wrote:
> > Thanks Steve and Prabhu for the information.
> >
> > The cause turned out to be locking in CapacityScheduler#reinitialize.
> > I think the method is called after transitioning to active stat if
> > RM-HA is enabled.
> >
> > I filed YARN-10347 and created PR.
> >
> >
> > Masatake Iwasaki
> >
> >
> > On 2020/07/08 16:33, Prabhu Joseph wrote:
> >> Hi Masatake,
> >>
> >>       The thread is waiting for a ReadLock, we need to check what the
> >> other
> >> thread holding WriteLock is blocked on.
> >> Can you get three consecutive complete jstack of ResourceManager
> >> during the
> >> issue.
> >>
> >>>> I got no issue if RM-HA is disabled.
> >> Looks RM is not able to access Zookeeper State Store. Can you check if
> >> there is any connectivity issue between RM and Zookeeper.
> >>
> >> Thanks,
> >> Prabhu Joseph
> >>
> >>
> >> On Mon, Jul 6, 2020 at 2:44 AM Masatake Iwasaki
> >> <iw...@oss.nttdata.co.jp>
> >> wrote:
> >>
> >>> Thanks for putting this up, Gabor Bota.
> >>>
> >>> I'm testing the RC2 on 3 node docker cluster with NN-HA and RM-HA
> >>> enabled.
> >>> ResourceManager reproducibly blocks on submitApplication while
> >>> launching
> >>> example MR jobs.
> >>> Does anyone run into the same issue?
> >>>
> >>> The same configuration worked for 3.1.3.
> >>> I got no issue if RM-HA is disabled.
> >>>
> >>>
> >>> "IPC Server handler 1 on default port 8032" #167 daemon prio=5
> >>> os_prio=0
> >>> tid=0x00007fe91821ec50 nid=0x3b9 waiting on condition
> >>> [0x00007fe901bac000]
> >>>      java.lang.Thread.State: WAITING (parking)
> >>>           at sun.misc.Unsafe.park(Native Method)
> >>>           - parking to wait for  <0x0000000085d37a40> (a
> >>> java.util.concurrent.locks.ReentrantReadWriteLock$NonfairSync)
> >>>           at
> >>> java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
> >>>           at
> >>>
> >>> java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:836)
> >>>
> >>>           at
> >>>
> >>> java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireShared(AbstractQueuedSynchronizer.java:967)
> >>>
> >>>           at
> >>>
> >>> java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireShared(AbstractQueuedSynchronizer.java:1283)
> >>>
> >>>           at
> >>>
> >>> java.util.concurrent.locks.ReentrantReadWriteLock$ReadLock.lock(ReentrantReadWriteLock.java:727)
> >>>
> >>>           at
> >>>
> >>> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.checkAndGetApplicationPriority(CapacityScheduler.java:2521)
> >>>
> >>>           at
> >>>
> >>> org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.createAndPopulateNewRMApp(RMAppManager.java:417)
> >>>
> >>>           at
> >>>
> >>> org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.submitApplication(RMAppManager.java:342)
> >>>
> >>>           at
> >>>
> >>> org.apache.hadoop.yarn.server.resourcemanager.ClientRMService.submitApplication(ClientRMService.java:678)
> >>>
> >>>           at
> >>>
> >>> org.apache.hadoop.yarn.api.impl.pb.service.ApplicationClientProtocolPBServiceImpl.submitApplication(ApplicationClientProtocolPBServiceImpl.java:277)
> >>>
> >>>           at
> >>>
> >>> org.apache.hadoop.yarn.proto.ApplicationClientProtocol$ApplicationClientProtocolService$2.callBlockingMethod(ApplicationClientProtocol.java:563)
> >>>
> >>>           at
> >>>
> >>> org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:527)
> >>>
> >>>           at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1036)
> >>>           at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:1015)
> >>>           at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:943)
> >>>           at java.security.AccessController.doPrivileged(Native Method)
> >>>           at javax.security.auth.Subject.doAs(Subject.java:422)
> >>>           at
> >>>
> >>> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1729)
> >>>
> >>>           at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2943)
> >>>
> >>>
> >>> Masatake Iwasaki
> >>>
> >>> On 2020/06/26 22:51, Gabor Bota wrote:
> >>>> Hi folks,
> >>>>
> >>>> I have put together a release candidate (RC2) for Hadoop 3.1.4.
> >>>>
> >>>> The RC is available at:
> >>> http://people.apache.org/~gabota/hadoop-3.1.4-RC2/
> >>>> The RC tag in git is here:
> >>>> https://github.com/apache/hadoop/releases/tag/release-3.1.4-RC2
> >>>> The maven artifacts are staged at
> >>>> https://repository.apache.org/content/repositories/orgapachehadoop-1269/
> >>>>
> >>>>
> >>>> You can find my public key at:
> >>>> https://dist.apache.org/repos/dist/release/hadoop/common/KEYS
> >>>> and http://keys.gnupg.net/pks/lookup?op=get&search=0xB86249D83539B38C
> >>>>
> >>>> Please try the release and vote. The vote will run for 5 weekdays,
> >>>> until July 6. 2020. 23:00 CET.
> >>>>
> >>>> The release includes the revert of HDFS-14941, as it caused
> >>>> HDFS-15421. IBR leak causes standby NN to be stuck in safe mode.
> >>>> (https://issues.apache.org/jira/browse/HDFS-15421)
> >>>> The release includes HDFS-15323, as requested.
> >>>> (https://issues.apache.org/jira/browse/HDFS-15323)
> >>>>
> >>>> Thanks,
> >>>> Gabor
> >>>>
> >>>> ---------------------------------------------------------------------
> >>>> To unsubscribe, e-mail: common-dev-unsubscribe@hadoop.apache.org
> >>>> For additional commands, e-mail: common-dev-help@hadoop.apache.org
> >>>>
> >>> ---------------------------------------------------------------------
> >>> To unsubscribe, e-mail: yarn-dev-unsubscribe@hadoop.apache.org
> >>> For additional commands, e-mail: yarn-dev-help@hadoop.apache.org
> >>>
> >>>
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: common-dev-unsubscribe@hadoop.apache.org
> > For additional commands, e-mail: common-dev-help@hadoop.apache.org
> >
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: common-dev-unsubscribe@hadoop.apache.org
> For additional commands, e-mail: common-dev-help@hadoop.apache.org
>

---------------------------------------------------------------------
To unsubscribe, e-mail: yarn-dev-unsubscribe@hadoop.apache.org
For additional commands, e-mail: yarn-dev-help@hadoop.apache.org


Re: [VOTE] Release Apache Hadoop 3.1.4 (RC2)

Posted by Gabor Bota <ga...@cloudera.com.INVALID>.
Yes, sure. I'll do another RC for next week.

Thank you all for working on this!

On Thu, Jul 9, 2020 at 8:20 AM Masatake Iwasaki
<iw...@oss.nttdata.co.jp> wrote:
>
> Hi Gabor Bota,
>
> I committed the fix of YARN-10347 to branch-3.1.
> I think this should be blocker for 3.1.4.
> Could you cherry-pick it to branch-3.1.4 and cut a new RC?
>
> Thanks,
> Masatake Iwasaki
>
> On 2020/07/08 23:31, Masatake Iwasaki wrote:
> > Thanks Steve and Prabhu for the information.
> >
> > The cause turned out to be locking in CapacityScheduler#reinitialize.
> > I think the method is called after transitioning to active stat if
> > RM-HA is enabled.
> >
> > I filed YARN-10347 and created PR.
> >
> >
> > Masatake Iwasaki
> >
> >
> > On 2020/07/08 16:33, Prabhu Joseph wrote:
> >> Hi Masatake,
> >>
> >>       The thread is waiting for a ReadLock, we need to check what the
> >> other
> >> thread holding WriteLock is blocked on.
> >> Can you get three consecutive complete jstack of ResourceManager
> >> during the
> >> issue.
> >>
> >>>> I got no issue if RM-HA is disabled.
> >> Looks RM is not able to access Zookeeper State Store. Can you check if
> >> there is any connectivity issue between RM and Zookeeper.
> >>
> >> Thanks,
> >> Prabhu Joseph
> >>
> >>
> >> On Mon, Jul 6, 2020 at 2:44 AM Masatake Iwasaki
> >> <iw...@oss.nttdata.co.jp>
> >> wrote:
> >>
> >>> Thanks for putting this up, Gabor Bota.
> >>>
> >>> I'm testing the RC2 on 3 node docker cluster with NN-HA and RM-HA
> >>> enabled.
> >>> ResourceManager reproducibly blocks on submitApplication while
> >>> launching
> >>> example MR jobs.
> >>> Does anyone run into the same issue?
> >>>
> >>> The same configuration worked for 3.1.3.
> >>> I got no issue if RM-HA is disabled.
> >>>
> >>>
> >>> "IPC Server handler 1 on default port 8032" #167 daemon prio=5
> >>> os_prio=0
> >>> tid=0x00007fe91821ec50 nid=0x3b9 waiting on condition
> >>> [0x00007fe901bac000]
> >>>      java.lang.Thread.State: WAITING (parking)
> >>>           at sun.misc.Unsafe.park(Native Method)
> >>>           - parking to wait for  <0x0000000085d37a40> (a
> >>> java.util.concurrent.locks.ReentrantReadWriteLock$NonfairSync)
> >>>           at
> >>> java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
> >>>           at
> >>>
> >>> java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:836)
> >>>
> >>>           at
> >>>
> >>> java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireShared(AbstractQueuedSynchronizer.java:967)
> >>>
> >>>           at
> >>>
> >>> java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireShared(AbstractQueuedSynchronizer.java:1283)
> >>>
> >>>           at
> >>>
> >>> java.util.concurrent.locks.ReentrantReadWriteLock$ReadLock.lock(ReentrantReadWriteLock.java:727)
> >>>
> >>>           at
> >>>
> >>> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.checkAndGetApplicationPriority(CapacityScheduler.java:2521)
> >>>
> >>>           at
> >>>
> >>> org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.createAndPopulateNewRMApp(RMAppManager.java:417)
> >>>
> >>>           at
> >>>
> >>> org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.submitApplication(RMAppManager.java:342)
> >>>
> >>>           at
> >>>
> >>> org.apache.hadoop.yarn.server.resourcemanager.ClientRMService.submitApplication(ClientRMService.java:678)
> >>>
> >>>           at
> >>>
> >>> org.apache.hadoop.yarn.api.impl.pb.service.ApplicationClientProtocolPBServiceImpl.submitApplication(ApplicationClientProtocolPBServiceImpl.java:277)
> >>>
> >>>           at
> >>>
> >>> org.apache.hadoop.yarn.proto.ApplicationClientProtocol$ApplicationClientProtocolService$2.callBlockingMethod(ApplicationClientProtocol.java:563)
> >>>
> >>>           at
> >>>
> >>> org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:527)
> >>>
> >>>           at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1036)
> >>>           at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:1015)
> >>>           at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:943)
> >>>           at java.security.AccessController.doPrivileged(Native Method)
> >>>           at javax.security.auth.Subject.doAs(Subject.java:422)
> >>>           at
> >>>
> >>> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1729)
> >>>
> >>>           at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2943)
> >>>
> >>>
> >>> Masatake Iwasaki
> >>>
> >>> On 2020/06/26 22:51, Gabor Bota wrote:
> >>>> Hi folks,
> >>>>
> >>>> I have put together a release candidate (RC2) for Hadoop 3.1.4.
> >>>>
> >>>> The RC is available at:
> >>> http://people.apache.org/~gabota/hadoop-3.1.4-RC2/
> >>>> The RC tag in git is here:
> >>>> https://github.com/apache/hadoop/releases/tag/release-3.1.4-RC2
> >>>> The maven artifacts are staged at
> >>>> https://repository.apache.org/content/repositories/orgapachehadoop-1269/
> >>>>
> >>>>
> >>>> You can find my public key at:
> >>>> https://dist.apache.org/repos/dist/release/hadoop/common/KEYS
> >>>> and http://keys.gnupg.net/pks/lookup?op=get&search=0xB86249D83539B38C
> >>>>
> >>>> Please try the release and vote. The vote will run for 5 weekdays,
> >>>> until July 6. 2020. 23:00 CET.
> >>>>
> >>>> The release includes the revert of HDFS-14941, as it caused
> >>>> HDFS-15421. IBR leak causes standby NN to be stuck in safe mode.
> >>>> (https://issues.apache.org/jira/browse/HDFS-15421)
> >>>> The release includes HDFS-15323, as requested.
> >>>> (https://issues.apache.org/jira/browse/HDFS-15323)
> >>>>
> >>>> Thanks,
> >>>> Gabor
> >>>>
> >>>> ---------------------------------------------------------------------
> >>>> To unsubscribe, e-mail: common-dev-unsubscribe@hadoop.apache.org
> >>>> For additional commands, e-mail: common-dev-help@hadoop.apache.org
> >>>>
> >>> ---------------------------------------------------------------------
> >>> To unsubscribe, e-mail: yarn-dev-unsubscribe@hadoop.apache.org
> >>> For additional commands, e-mail: yarn-dev-help@hadoop.apache.org
> >>>
> >>>
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: common-dev-unsubscribe@hadoop.apache.org
> > For additional commands, e-mail: common-dev-help@hadoop.apache.org
> >
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: common-dev-unsubscribe@hadoop.apache.org
> For additional commands, e-mail: common-dev-help@hadoop.apache.org
>

---------------------------------------------------------------------
To unsubscribe, e-mail: hdfs-dev-unsubscribe@hadoop.apache.org
For additional commands, e-mail: hdfs-dev-help@hadoop.apache.org


Re: [VOTE] Release Apache Hadoop 3.1.4 (RC2)

Posted by Gabor Bota <ga...@cloudera.com.INVALID>.
Yes, sure. I'll do another RC for next week.

Thank you all for working on this!

On Thu, Jul 9, 2020 at 8:20 AM Masatake Iwasaki
<iw...@oss.nttdata.co.jp> wrote:
>
> Hi Gabor Bota,
>
> I committed the fix of YARN-10347 to branch-3.1.
> I think this should be blocker for 3.1.4.
> Could you cherry-pick it to branch-3.1.4 and cut a new RC?
>
> Thanks,
> Masatake Iwasaki
>
> On 2020/07/08 23:31, Masatake Iwasaki wrote:
> > Thanks Steve and Prabhu for the information.
> >
> > The cause turned out to be locking in CapacityScheduler#reinitialize.
> > I think the method is called after transitioning to active stat if
> > RM-HA is enabled.
> >
> > I filed YARN-10347 and created PR.
> >
> >
> > Masatake Iwasaki
> >
> >
> > On 2020/07/08 16:33, Prabhu Joseph wrote:
> >> Hi Masatake,
> >>
> >>       The thread is waiting for a ReadLock, we need to check what the
> >> other
> >> thread holding WriteLock is blocked on.
> >> Can you get three consecutive complete jstack of ResourceManager
> >> during the
> >> issue.
> >>
> >>>> I got no issue if RM-HA is disabled.
> >> Looks RM is not able to access Zookeeper State Store. Can you check if
> >> there is any connectivity issue between RM and Zookeeper.
> >>
> >> Thanks,
> >> Prabhu Joseph
> >>
> >>
> >> On Mon, Jul 6, 2020 at 2:44 AM Masatake Iwasaki
> >> <iw...@oss.nttdata.co.jp>
> >> wrote:
> >>
> >>> Thanks for putting this up, Gabor Bota.
> >>>
> >>> I'm testing the RC2 on 3 node docker cluster with NN-HA and RM-HA
> >>> enabled.
> >>> ResourceManager reproducibly blocks on submitApplication while
> >>> launching
> >>> example MR jobs.
> >>> Does anyone run into the same issue?
> >>>
> >>> The same configuration worked for 3.1.3.
> >>> I got no issue if RM-HA is disabled.
> >>>
> >>>
> >>> "IPC Server handler 1 on default port 8032" #167 daemon prio=5
> >>> os_prio=0
> >>> tid=0x00007fe91821ec50 nid=0x3b9 waiting on condition
> >>> [0x00007fe901bac000]
> >>>      java.lang.Thread.State: WAITING (parking)
> >>>           at sun.misc.Unsafe.park(Native Method)
> >>>           - parking to wait for  <0x0000000085d37a40> (a
> >>> java.util.concurrent.locks.ReentrantReadWriteLock$NonfairSync)
> >>>           at
> >>> java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
> >>>           at
> >>>
> >>> java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:836)
> >>>
> >>>           at
> >>>
> >>> java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireShared(AbstractQueuedSynchronizer.java:967)
> >>>
> >>>           at
> >>>
> >>> java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireShared(AbstractQueuedSynchronizer.java:1283)
> >>>
> >>>           at
> >>>
> >>> java.util.concurrent.locks.ReentrantReadWriteLock$ReadLock.lock(ReentrantReadWriteLock.java:727)
> >>>
> >>>           at
> >>>
> >>> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.checkAndGetApplicationPriority(CapacityScheduler.java:2521)
> >>>
> >>>           at
> >>>
> >>> org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.createAndPopulateNewRMApp(RMAppManager.java:417)
> >>>
> >>>           at
> >>>
> >>> org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.submitApplication(RMAppManager.java:342)
> >>>
> >>>           at
> >>>
> >>> org.apache.hadoop.yarn.server.resourcemanager.ClientRMService.submitApplication(ClientRMService.java:678)
> >>>
> >>>           at
> >>>
> >>> org.apache.hadoop.yarn.api.impl.pb.service.ApplicationClientProtocolPBServiceImpl.submitApplication(ApplicationClientProtocolPBServiceImpl.java:277)
> >>>
> >>>           at
> >>>
> >>> org.apache.hadoop.yarn.proto.ApplicationClientProtocol$ApplicationClientProtocolService$2.callBlockingMethod(ApplicationClientProtocol.java:563)
> >>>
> >>>           at
> >>>
> >>> org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:527)
> >>>
> >>>           at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1036)
> >>>           at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:1015)
> >>>           at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:943)
> >>>           at java.security.AccessController.doPrivileged(Native Method)
> >>>           at javax.security.auth.Subject.doAs(Subject.java:422)
> >>>           at
> >>>
> >>> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1729)
> >>>
> >>>           at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2943)
> >>>
> >>>
> >>> Masatake Iwasaki
> >>>
> >>> On 2020/06/26 22:51, Gabor Bota wrote:
> >>>> Hi folks,
> >>>>
> >>>> I have put together a release candidate (RC2) for Hadoop 3.1.4.
> >>>>
> >>>> The RC is available at:
> >>> http://people.apache.org/~gabota/hadoop-3.1.4-RC2/
> >>>> The RC tag in git is here:
> >>>> https://github.com/apache/hadoop/releases/tag/release-3.1.4-RC2
> >>>> The maven artifacts are staged at
> >>>> https://repository.apache.org/content/repositories/orgapachehadoop-1269/
> >>>>
> >>>>
> >>>> You can find my public key at:
> >>>> https://dist.apache.org/repos/dist/release/hadoop/common/KEYS
> >>>> and http://keys.gnupg.net/pks/lookup?op=get&search=0xB86249D83539B38C
> >>>>
> >>>> Please try the release and vote. The vote will run for 5 weekdays,
> >>>> until July 6. 2020. 23:00 CET.
> >>>>
> >>>> The release includes the revert of HDFS-14941, as it caused
> >>>> HDFS-15421. IBR leak causes standby NN to be stuck in safe mode.
> >>>> (https://issues.apache.org/jira/browse/HDFS-15421)
> >>>> The release includes HDFS-15323, as requested.
> >>>> (https://issues.apache.org/jira/browse/HDFS-15323)
> >>>>
> >>>> Thanks,
> >>>> Gabor
> >>>>
> >>>> ---------------------------------------------------------------------
> >>>> To unsubscribe, e-mail: common-dev-unsubscribe@hadoop.apache.org
> >>>> For additional commands, e-mail: common-dev-help@hadoop.apache.org
> >>>>
> >>> ---------------------------------------------------------------------
> >>> To unsubscribe, e-mail: yarn-dev-unsubscribe@hadoop.apache.org
> >>> For additional commands, e-mail: yarn-dev-help@hadoop.apache.org
> >>>
> >>>
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: common-dev-unsubscribe@hadoop.apache.org
> > For additional commands, e-mail: common-dev-help@hadoop.apache.org
> >
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: common-dev-unsubscribe@hadoop.apache.org
> For additional commands, e-mail: common-dev-help@hadoop.apache.org
>

---------------------------------------------------------------------
To unsubscribe, e-mail: mapreduce-dev-unsubscribe@hadoop.apache.org
For additional commands, e-mail: mapreduce-dev-help@hadoop.apache.org


Re: [VOTE] Release Apache Hadoop 3.1.4 (RC2)

Posted by Gabor Bota <ga...@cloudera.com.INVALID>.
Yes, sure. I'll do another RC for next week.

Thank you all for working on this!

On Thu, Jul 9, 2020 at 8:20 AM Masatake Iwasaki
<iw...@oss.nttdata.co.jp> wrote:
>
> Hi Gabor Bota,
>
> I committed the fix of YARN-10347 to branch-3.1.
> I think this should be blocker for 3.1.4.
> Could you cherry-pick it to branch-3.1.4 and cut a new RC?
>
> Thanks,
> Masatake Iwasaki
>
> On 2020/07/08 23:31, Masatake Iwasaki wrote:
> > Thanks Steve and Prabhu for the information.
> >
> > The cause turned out to be locking in CapacityScheduler#reinitialize.
> > I think the method is called after transitioning to active stat if
> > RM-HA is enabled.
> >
> > I filed YARN-10347 and created PR.
> >
> >
> > Masatake Iwasaki
> >
> >
> > On 2020/07/08 16:33, Prabhu Joseph wrote:
> >> Hi Masatake,
> >>
> >>       The thread is waiting for a ReadLock, we need to check what the
> >> other
> >> thread holding WriteLock is blocked on.
> >> Can you get three consecutive complete jstack of ResourceManager
> >> during the
> >> issue.
> >>
> >>>> I got no issue if RM-HA is disabled.
> >> Looks RM is not able to access Zookeeper State Store. Can you check if
> >> there is any connectivity issue between RM and Zookeeper.
> >>
> >> Thanks,
> >> Prabhu Joseph
> >>
> >>
> >> On Mon, Jul 6, 2020 at 2:44 AM Masatake Iwasaki
> >> <iw...@oss.nttdata.co.jp>
> >> wrote:
> >>
> >>> Thanks for putting this up, Gabor Bota.
> >>>
> >>> I'm testing the RC2 on 3 node docker cluster with NN-HA and RM-HA
> >>> enabled.
> >>> ResourceManager reproducibly blocks on submitApplication while
> >>> launching
> >>> example MR jobs.
> >>> Does anyone run into the same issue?
> >>>
> >>> The same configuration worked for 3.1.3.
> >>> I got no issue if RM-HA is disabled.
> >>>
> >>>
> >>> "IPC Server handler 1 on default port 8032" #167 daemon prio=5
> >>> os_prio=0
> >>> tid=0x00007fe91821ec50 nid=0x3b9 waiting on condition
> >>> [0x00007fe901bac000]
> >>>      java.lang.Thread.State: WAITING (parking)
> >>>           at sun.misc.Unsafe.park(Native Method)
> >>>           - parking to wait for  <0x0000000085d37a40> (a
> >>> java.util.concurrent.locks.ReentrantReadWriteLock$NonfairSync)
> >>>           at
> >>> java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
> >>>           at
> >>>
> >>> java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:836)
> >>>
> >>>           at
> >>>
> >>> java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireShared(AbstractQueuedSynchronizer.java:967)
> >>>
> >>>           at
> >>>
> >>> java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireShared(AbstractQueuedSynchronizer.java:1283)
> >>>
> >>>           at
> >>>
> >>> java.util.concurrent.locks.ReentrantReadWriteLock$ReadLock.lock(ReentrantReadWriteLock.java:727)
> >>>
> >>>           at
> >>>
> >>> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.checkAndGetApplicationPriority(CapacityScheduler.java:2521)
> >>>
> >>>           at
> >>>
> >>> org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.createAndPopulateNewRMApp(RMAppManager.java:417)
> >>>
> >>>           at
> >>>
> >>> org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.submitApplication(RMAppManager.java:342)
> >>>
> >>>           at
> >>>
> >>> org.apache.hadoop.yarn.server.resourcemanager.ClientRMService.submitApplication(ClientRMService.java:678)
> >>>
> >>>           at
> >>>
> >>> org.apache.hadoop.yarn.api.impl.pb.service.ApplicationClientProtocolPBServiceImpl.submitApplication(ApplicationClientProtocolPBServiceImpl.java:277)
> >>>
> >>>           at
> >>>
> >>> org.apache.hadoop.yarn.proto.ApplicationClientProtocol$ApplicationClientProtocolService$2.callBlockingMethod(ApplicationClientProtocol.java:563)
> >>>
> >>>           at
> >>>
> >>> org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:527)
> >>>
> >>>           at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1036)
> >>>           at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:1015)
> >>>           at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:943)
> >>>           at java.security.AccessController.doPrivileged(Native Method)
> >>>           at javax.security.auth.Subject.doAs(Subject.java:422)
> >>>           at
> >>>
> >>> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1729)
> >>>
> >>>           at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2943)
> >>>
> >>>
> >>> Masatake Iwasaki
> >>>
> >>> On 2020/06/26 22:51, Gabor Bota wrote:
> >>>> Hi folks,
> >>>>
> >>>> I have put together a release candidate (RC2) for Hadoop 3.1.4.
> >>>>
> >>>> The RC is available at:
> >>> http://people.apache.org/~gabota/hadoop-3.1.4-RC2/
> >>>> The RC tag in git is here:
> >>>> https://github.com/apache/hadoop/releases/tag/release-3.1.4-RC2
> >>>> The maven artifacts are staged at
> >>>> https://repository.apache.org/content/repositories/orgapachehadoop-1269/
> >>>>
> >>>>
> >>>> You can find my public key at:
> >>>> https://dist.apache.org/repos/dist/release/hadoop/common/KEYS
> >>>> and http://keys.gnupg.net/pks/lookup?op=get&search=0xB86249D83539B38C
> >>>>
> >>>> Please try the release and vote. The vote will run for 5 weekdays,
> >>>> until July 6. 2020. 23:00 CET.
> >>>>
> >>>> The release includes the revert of HDFS-14941, as it caused
> >>>> HDFS-15421. IBR leak causes standby NN to be stuck in safe mode.
> >>>> (https://issues.apache.org/jira/browse/HDFS-15421)
> >>>> The release includes HDFS-15323, as requested.
> >>>> (https://issues.apache.org/jira/browse/HDFS-15323)
> >>>>
> >>>> Thanks,
> >>>> Gabor
> >>>>
> >>>> ---------------------------------------------------------------------
> >>>> To unsubscribe, e-mail: common-dev-unsubscribe@hadoop.apache.org
> >>>> For additional commands, e-mail: common-dev-help@hadoop.apache.org
> >>>>
> >>> ---------------------------------------------------------------------
> >>> To unsubscribe, e-mail: yarn-dev-unsubscribe@hadoop.apache.org
> >>> For additional commands, e-mail: yarn-dev-help@hadoop.apache.org
> >>>
> >>>
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: common-dev-unsubscribe@hadoop.apache.org
> > For additional commands, e-mail: common-dev-help@hadoop.apache.org
> >
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: common-dev-unsubscribe@hadoop.apache.org
> For additional commands, e-mail: common-dev-help@hadoop.apache.org
>

---------------------------------------------------------------------
To unsubscribe, e-mail: common-dev-unsubscribe@hadoop.apache.org
For additional commands, e-mail: common-dev-help@hadoop.apache.org


Re: [VOTE] Release Apache Hadoop 3.1.4 (RC2)

Posted by Masatake Iwasaki <iw...@oss.nttdata.co.jp>.
Hi Gabor Bota,

I committed the fix of YARN-10347 to branch-3.1.
I think this should be blocker for 3.1.4.
Could you cherry-pick it to branch-3.1.4 and cut a new RC?

Thanks,
Masatake Iwasaki

On 2020/07/08 23:31, Masatake Iwasaki wrote:
> Thanks Steve and Prabhu for the information.
>
> The cause turned out to be locking in CapacityScheduler#reinitialize.
> I think the method is called after transitioning to active stat if 
> RM-HA is enabled.
>
> I filed YARN-10347 and created PR.
>
>
> Masatake Iwasaki
>
>
> On 2020/07/08 16:33, Prabhu Joseph wrote:
>> Hi Masatake,
>>
>>       The thread is waiting for a ReadLock, we need to check what the 
>> other
>> thread holding WriteLock is blocked on.
>> Can you get three consecutive complete jstack of ResourceManager 
>> during the
>> issue.
>>
>>>> I got no issue if RM-HA is disabled.
>> Looks RM is not able to access Zookeeper State Store. Can you check if
>> there is any connectivity issue between RM and Zookeeper.
>>
>> Thanks,
>> Prabhu Joseph
>>
>>
>> On Mon, Jul 6, 2020 at 2:44 AM Masatake Iwasaki 
>> <iw...@oss.nttdata.co.jp>
>> wrote:
>>
>>> Thanks for putting this up, Gabor Bota.
>>>
>>> I'm testing the RC2 on 3 node docker cluster with NN-HA and RM-HA 
>>> enabled.
>>> ResourceManager reproducibly blocks on submitApplication while 
>>> launching
>>> example MR jobs.
>>> Does anyone run into the same issue?
>>>
>>> The same configuration worked for 3.1.3.
>>> I got no issue if RM-HA is disabled.
>>>
>>>
>>> "IPC Server handler 1 on default port 8032" #167 daemon prio=5 
>>> os_prio=0
>>> tid=0x00007fe91821ec50 nid=0x3b9 waiting on condition 
>>> [0x00007fe901bac000]
>>>      java.lang.Thread.State: WAITING (parking)
>>>           at sun.misc.Unsafe.park(Native Method)
>>>           - parking to wait for  <0x0000000085d37a40> (a
>>> java.util.concurrent.locks.ReentrantReadWriteLock$NonfairSync)
>>>           at
>>> java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
>>>           at
>>>
>>> java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:836) 
>>>
>>>           at
>>>
>>> java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireShared(AbstractQueuedSynchronizer.java:967) 
>>>
>>>           at
>>>
>>> java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireShared(AbstractQueuedSynchronizer.java:1283) 
>>>
>>>           at
>>>
>>> java.util.concurrent.locks.ReentrantReadWriteLock$ReadLock.lock(ReentrantReadWriteLock.java:727) 
>>>
>>>           at
>>>
>>> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.checkAndGetApplicationPriority(CapacityScheduler.java:2521) 
>>>
>>>           at
>>>
>>> org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.createAndPopulateNewRMApp(RMAppManager.java:417) 
>>>
>>>           at
>>>
>>> org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.submitApplication(RMAppManager.java:342) 
>>>
>>>           at
>>>
>>> org.apache.hadoop.yarn.server.resourcemanager.ClientRMService.submitApplication(ClientRMService.java:678) 
>>>
>>>           at
>>>
>>> org.apache.hadoop.yarn.api.impl.pb.service.ApplicationClientProtocolPBServiceImpl.submitApplication(ApplicationClientProtocolPBServiceImpl.java:277) 
>>>
>>>           at
>>>
>>> org.apache.hadoop.yarn.proto.ApplicationClientProtocol$ApplicationClientProtocolService$2.callBlockingMethod(ApplicationClientProtocol.java:563) 
>>>
>>>           at
>>>
>>> org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:527) 
>>>
>>>           at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1036)
>>>           at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:1015)
>>>           at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:943)
>>>           at java.security.AccessController.doPrivileged(Native Method)
>>>           at javax.security.auth.Subject.doAs(Subject.java:422)
>>>           at
>>>
>>> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1729) 
>>>
>>>           at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2943)
>>>
>>>
>>> Masatake Iwasaki
>>>
>>> On 2020/06/26 22:51, Gabor Bota wrote:
>>>> Hi folks,
>>>>
>>>> I have put together a release candidate (RC2) for Hadoop 3.1.4.
>>>>
>>>> The RC is available at:
>>> http://people.apache.org/~gabota/hadoop-3.1.4-RC2/
>>>> The RC tag in git is here:
>>>> https://github.com/apache/hadoop/releases/tag/release-3.1.4-RC2
>>>> The maven artifacts are staged at
>>>> https://repository.apache.org/content/repositories/orgapachehadoop-1269/ 
>>>>
>>>>
>>>> You can find my public key at:
>>>> https://dist.apache.org/repos/dist/release/hadoop/common/KEYS
>>>> and http://keys.gnupg.net/pks/lookup?op=get&search=0xB86249D83539B38C
>>>>
>>>> Please try the release and vote. The vote will run for 5 weekdays,
>>>> until July 6. 2020. 23:00 CET.
>>>>
>>>> The release includes the revert of HDFS-14941, as it caused
>>>> HDFS-15421. IBR leak causes standby NN to be stuck in safe mode.
>>>> (https://issues.apache.org/jira/browse/HDFS-15421)
>>>> The release includes HDFS-15323, as requested.
>>>> (https://issues.apache.org/jira/browse/HDFS-15323)
>>>>
>>>> Thanks,
>>>> Gabor
>>>>
>>>> ---------------------------------------------------------------------
>>>> To unsubscribe, e-mail: common-dev-unsubscribe@hadoop.apache.org
>>>> For additional commands, e-mail: common-dev-help@hadoop.apache.org
>>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: yarn-dev-unsubscribe@hadoop.apache.org
>>> For additional commands, e-mail: yarn-dev-help@hadoop.apache.org
>>>
>>>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: common-dev-unsubscribe@hadoop.apache.org
> For additional commands, e-mail: common-dev-help@hadoop.apache.org
>

---------------------------------------------------------------------
To unsubscribe, e-mail: mapreduce-dev-unsubscribe@hadoop.apache.org
For additional commands, e-mail: mapreduce-dev-help@hadoop.apache.org


Re: [VOTE] Release Apache Hadoop 3.1.4 (RC2)

Posted by Masatake Iwasaki <iw...@oss.nttdata.co.jp>.
Hi Gabor Bota,

I committed the fix of YARN-10347 to branch-3.1.
I think this should be blocker for 3.1.4.
Could you cherry-pick it to branch-3.1.4 and cut a new RC?

Thanks,
Masatake Iwasaki

On 2020/07/08 23:31, Masatake Iwasaki wrote:
> Thanks Steve and Prabhu for the information.
>
> The cause turned out to be locking in CapacityScheduler#reinitialize.
> I think the method is called after transitioning to active stat if 
> RM-HA is enabled.
>
> I filed YARN-10347 and created PR.
>
>
> Masatake Iwasaki
>
>
> On 2020/07/08 16:33, Prabhu Joseph wrote:
>> Hi Masatake,
>>
>>       The thread is waiting for a ReadLock, we need to check what the 
>> other
>> thread holding WriteLock is blocked on.
>> Can you get three consecutive complete jstack of ResourceManager 
>> during the
>> issue.
>>
>>>> I got no issue if RM-HA is disabled.
>> Looks RM is not able to access Zookeeper State Store. Can you check if
>> there is any connectivity issue between RM and Zookeeper.
>>
>> Thanks,
>> Prabhu Joseph
>>
>>
>> On Mon, Jul 6, 2020 at 2:44 AM Masatake Iwasaki 
>> <iw...@oss.nttdata.co.jp>
>> wrote:
>>
>>> Thanks for putting this up, Gabor Bota.
>>>
>>> I'm testing the RC2 on 3 node docker cluster with NN-HA and RM-HA 
>>> enabled.
>>> ResourceManager reproducibly blocks on submitApplication while 
>>> launching
>>> example MR jobs.
>>> Does anyone run into the same issue?
>>>
>>> The same configuration worked for 3.1.3.
>>> I got no issue if RM-HA is disabled.
>>>
>>>
>>> "IPC Server handler 1 on default port 8032" #167 daemon prio=5 
>>> os_prio=0
>>> tid=0x00007fe91821ec50 nid=0x3b9 waiting on condition 
>>> [0x00007fe901bac000]
>>>      java.lang.Thread.State: WAITING (parking)
>>>           at sun.misc.Unsafe.park(Native Method)
>>>           - parking to wait for  <0x0000000085d37a40> (a
>>> java.util.concurrent.locks.ReentrantReadWriteLock$NonfairSync)
>>>           at
>>> java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
>>>           at
>>>
>>> java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:836) 
>>>
>>>           at
>>>
>>> java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireShared(AbstractQueuedSynchronizer.java:967) 
>>>
>>>           at
>>>
>>> java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireShared(AbstractQueuedSynchronizer.java:1283) 
>>>
>>>           at
>>>
>>> java.util.concurrent.locks.ReentrantReadWriteLock$ReadLock.lock(ReentrantReadWriteLock.java:727) 
>>>
>>>           at
>>>
>>> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.checkAndGetApplicationPriority(CapacityScheduler.java:2521) 
>>>
>>>           at
>>>
>>> org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.createAndPopulateNewRMApp(RMAppManager.java:417) 
>>>
>>>           at
>>>
>>> org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.submitApplication(RMAppManager.java:342) 
>>>
>>>           at
>>>
>>> org.apache.hadoop.yarn.server.resourcemanager.ClientRMService.submitApplication(ClientRMService.java:678) 
>>>
>>>           at
>>>
>>> org.apache.hadoop.yarn.api.impl.pb.service.ApplicationClientProtocolPBServiceImpl.submitApplication(ApplicationClientProtocolPBServiceImpl.java:277) 
>>>
>>>           at
>>>
>>> org.apache.hadoop.yarn.proto.ApplicationClientProtocol$ApplicationClientProtocolService$2.callBlockingMethod(ApplicationClientProtocol.java:563) 
>>>
>>>           at
>>>
>>> org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:527) 
>>>
>>>           at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1036)
>>>           at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:1015)
>>>           at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:943)
>>>           at java.security.AccessController.doPrivileged(Native Method)
>>>           at javax.security.auth.Subject.doAs(Subject.java:422)
>>>           at
>>>
>>> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1729) 
>>>
>>>           at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2943)
>>>
>>>
>>> Masatake Iwasaki
>>>
>>> On 2020/06/26 22:51, Gabor Bota wrote:
>>>> Hi folks,
>>>>
>>>> I have put together a release candidate (RC2) for Hadoop 3.1.4.
>>>>
>>>> The RC is available at:
>>> http://people.apache.org/~gabota/hadoop-3.1.4-RC2/
>>>> The RC tag in git is here:
>>>> https://github.com/apache/hadoop/releases/tag/release-3.1.4-RC2
>>>> The maven artifacts are staged at
>>>> https://repository.apache.org/content/repositories/orgapachehadoop-1269/ 
>>>>
>>>>
>>>> You can find my public key at:
>>>> https://dist.apache.org/repos/dist/release/hadoop/common/KEYS
>>>> and http://keys.gnupg.net/pks/lookup?op=get&search=0xB86249D83539B38C
>>>>
>>>> Please try the release and vote. The vote will run for 5 weekdays,
>>>> until July 6. 2020. 23:00 CET.
>>>>
>>>> The release includes the revert of HDFS-14941, as it caused
>>>> HDFS-15421. IBR leak causes standby NN to be stuck in safe mode.
>>>> (https://issues.apache.org/jira/browse/HDFS-15421)
>>>> The release includes HDFS-15323, as requested.
>>>> (https://issues.apache.org/jira/browse/HDFS-15323)
>>>>
>>>> Thanks,
>>>> Gabor
>>>>
>>>> ---------------------------------------------------------------------
>>>> To unsubscribe, e-mail: common-dev-unsubscribe@hadoop.apache.org
>>>> For additional commands, e-mail: common-dev-help@hadoop.apache.org
>>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: yarn-dev-unsubscribe@hadoop.apache.org
>>> For additional commands, e-mail: yarn-dev-help@hadoop.apache.org
>>>
>>>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: common-dev-unsubscribe@hadoop.apache.org
> For additional commands, e-mail: common-dev-help@hadoop.apache.org
>

---------------------------------------------------------------------
To unsubscribe, e-mail: hdfs-dev-unsubscribe@hadoop.apache.org
For additional commands, e-mail: hdfs-dev-help@hadoop.apache.org


Re: [VOTE] Release Apache Hadoop 3.1.4 (RC2)

Posted by Masatake Iwasaki <iw...@oss.nttdata.co.jp>.
Hi Gabor Bota,

I committed the fix of YARN-10347 to branch-3.1.
I think this should be blocker for 3.1.4.
Could you cherry-pick it to branch-3.1.4 and cut a new RC?

Thanks,
Masatake Iwasaki

On 2020/07/08 23:31, Masatake Iwasaki wrote:
> Thanks Steve and Prabhu for the information.
>
> The cause turned out to be locking in CapacityScheduler#reinitialize.
> I think the method is called after transitioning to active stat if 
> RM-HA is enabled.
>
> I filed YARN-10347 and created PR.
>
>
> Masatake Iwasaki
>
>
> On 2020/07/08 16:33, Prabhu Joseph wrote:
>> Hi Masatake,
>>
>>       The thread is waiting for a ReadLock, we need to check what the 
>> other
>> thread holding WriteLock is blocked on.
>> Can you get three consecutive complete jstack of ResourceManager 
>> during the
>> issue.
>>
>>>> I got no issue if RM-HA is disabled.
>> Looks RM is not able to access Zookeeper State Store. Can you check if
>> there is any connectivity issue between RM and Zookeeper.
>>
>> Thanks,
>> Prabhu Joseph
>>
>>
>> On Mon, Jul 6, 2020 at 2:44 AM Masatake Iwasaki 
>> <iw...@oss.nttdata.co.jp>
>> wrote:
>>
>>> Thanks for putting this up, Gabor Bota.
>>>
>>> I'm testing the RC2 on 3 node docker cluster with NN-HA and RM-HA 
>>> enabled.
>>> ResourceManager reproducibly blocks on submitApplication while 
>>> launching
>>> example MR jobs.
>>> Does anyone run into the same issue?
>>>
>>> The same configuration worked for 3.1.3.
>>> I got no issue if RM-HA is disabled.
>>>
>>>
>>> "IPC Server handler 1 on default port 8032" #167 daemon prio=5 
>>> os_prio=0
>>> tid=0x00007fe91821ec50 nid=0x3b9 waiting on condition 
>>> [0x00007fe901bac000]
>>>      java.lang.Thread.State: WAITING (parking)
>>>           at sun.misc.Unsafe.park(Native Method)
>>>           - parking to wait for  <0x0000000085d37a40> (a
>>> java.util.concurrent.locks.ReentrantReadWriteLock$NonfairSync)
>>>           at
>>> java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
>>>           at
>>>
>>> java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:836) 
>>>
>>>           at
>>>
>>> java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireShared(AbstractQueuedSynchronizer.java:967) 
>>>
>>>           at
>>>
>>> java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireShared(AbstractQueuedSynchronizer.java:1283) 
>>>
>>>           at
>>>
>>> java.util.concurrent.locks.ReentrantReadWriteLock$ReadLock.lock(ReentrantReadWriteLock.java:727) 
>>>
>>>           at
>>>
>>> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.checkAndGetApplicationPriority(CapacityScheduler.java:2521) 
>>>
>>>           at
>>>
>>> org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.createAndPopulateNewRMApp(RMAppManager.java:417) 
>>>
>>>           at
>>>
>>> org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.submitApplication(RMAppManager.java:342) 
>>>
>>>           at
>>>
>>> org.apache.hadoop.yarn.server.resourcemanager.ClientRMService.submitApplication(ClientRMService.java:678) 
>>>
>>>           at
>>>
>>> org.apache.hadoop.yarn.api.impl.pb.service.ApplicationClientProtocolPBServiceImpl.submitApplication(ApplicationClientProtocolPBServiceImpl.java:277) 
>>>
>>>           at
>>>
>>> org.apache.hadoop.yarn.proto.ApplicationClientProtocol$ApplicationClientProtocolService$2.callBlockingMethod(ApplicationClientProtocol.java:563) 
>>>
>>>           at
>>>
>>> org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:527) 
>>>
>>>           at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1036)
>>>           at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:1015)
>>>           at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:943)
>>>           at java.security.AccessController.doPrivileged(Native Method)
>>>           at javax.security.auth.Subject.doAs(Subject.java:422)
>>>           at
>>>
>>> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1729) 
>>>
>>>           at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2943)
>>>
>>>
>>> Masatake Iwasaki
>>>
>>> On 2020/06/26 22:51, Gabor Bota wrote:
>>>> Hi folks,
>>>>
>>>> I have put together a release candidate (RC2) for Hadoop 3.1.4.
>>>>
>>>> The RC is available at:
>>> http://people.apache.org/~gabota/hadoop-3.1.4-RC2/
>>>> The RC tag in git is here:
>>>> https://github.com/apache/hadoop/releases/tag/release-3.1.4-RC2
>>>> The maven artifacts are staged at
>>>> https://repository.apache.org/content/repositories/orgapachehadoop-1269/ 
>>>>
>>>>
>>>> You can find my public key at:
>>>> https://dist.apache.org/repos/dist/release/hadoop/common/KEYS
>>>> and http://keys.gnupg.net/pks/lookup?op=get&search=0xB86249D83539B38C
>>>>
>>>> Please try the release and vote. The vote will run for 5 weekdays,
>>>> until July 6. 2020. 23:00 CET.
>>>>
>>>> The release includes the revert of HDFS-14941, as it caused
>>>> HDFS-15421. IBR leak causes standby NN to be stuck in safe mode.
>>>> (https://issues.apache.org/jira/browse/HDFS-15421)
>>>> The release includes HDFS-15323, as requested.
>>>> (https://issues.apache.org/jira/browse/HDFS-15323)
>>>>
>>>> Thanks,
>>>> Gabor
>>>>
>>>> ---------------------------------------------------------------------
>>>> To unsubscribe, e-mail: common-dev-unsubscribe@hadoop.apache.org
>>>> For additional commands, e-mail: common-dev-help@hadoop.apache.org
>>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: yarn-dev-unsubscribe@hadoop.apache.org
>>> For additional commands, e-mail: yarn-dev-help@hadoop.apache.org
>>>
>>>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: common-dev-unsubscribe@hadoop.apache.org
> For additional commands, e-mail: common-dev-help@hadoop.apache.org
>

---------------------------------------------------------------------
To unsubscribe, e-mail: common-dev-unsubscribe@hadoop.apache.org
For additional commands, e-mail: common-dev-help@hadoop.apache.org


Re: [VOTE] Release Apache Hadoop 3.1.4 (RC2)

Posted by Masatake Iwasaki <iw...@oss.nttdata.co.jp>.
Hi Gabor Bota,

I committed the fix of YARN-10347 to branch-3.1.
I think this should be blocker for 3.1.4.
Could you cherry-pick it to branch-3.1.4 and cut a new RC?

Thanks,
Masatake Iwasaki

On 2020/07/08 23:31, Masatake Iwasaki wrote:
> Thanks Steve and Prabhu for the information.
>
> The cause turned out to be locking in CapacityScheduler#reinitialize.
> I think the method is called after transitioning to active stat if 
> RM-HA is enabled.
>
> I filed YARN-10347 and created PR.
>
>
> Masatake Iwasaki
>
>
> On 2020/07/08 16:33, Prabhu Joseph wrote:
>> Hi Masatake,
>>
>>       The thread is waiting for a ReadLock, we need to check what the 
>> other
>> thread holding WriteLock is blocked on.
>> Can you get three consecutive complete jstack of ResourceManager 
>> during the
>> issue.
>>
>>>> I got no issue if RM-HA is disabled.
>> Looks RM is not able to access Zookeeper State Store. Can you check if
>> there is any connectivity issue between RM and Zookeeper.
>>
>> Thanks,
>> Prabhu Joseph
>>
>>
>> On Mon, Jul 6, 2020 at 2:44 AM Masatake Iwasaki 
>> <iw...@oss.nttdata.co.jp>
>> wrote:
>>
>>> Thanks for putting this up, Gabor Bota.
>>>
>>> I'm testing the RC2 on 3 node docker cluster with NN-HA and RM-HA 
>>> enabled.
>>> ResourceManager reproducibly blocks on submitApplication while 
>>> launching
>>> example MR jobs.
>>> Does anyone run into the same issue?
>>>
>>> The same configuration worked for 3.1.3.
>>> I got no issue if RM-HA is disabled.
>>>
>>>
>>> "IPC Server handler 1 on default port 8032" #167 daemon prio=5 
>>> os_prio=0
>>> tid=0x00007fe91821ec50 nid=0x3b9 waiting on condition 
>>> [0x00007fe901bac000]
>>>      java.lang.Thread.State: WAITING (parking)
>>>           at sun.misc.Unsafe.park(Native Method)
>>>           - parking to wait for  <0x0000000085d37a40> (a
>>> java.util.concurrent.locks.ReentrantReadWriteLock$NonfairSync)
>>>           at
>>> java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
>>>           at
>>>
>>> java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:836) 
>>>
>>>           at
>>>
>>> java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireShared(AbstractQueuedSynchronizer.java:967) 
>>>
>>>           at
>>>
>>> java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireShared(AbstractQueuedSynchronizer.java:1283) 
>>>
>>>           at
>>>
>>> java.util.concurrent.locks.ReentrantReadWriteLock$ReadLock.lock(ReentrantReadWriteLock.java:727) 
>>>
>>>           at
>>>
>>> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.checkAndGetApplicationPriority(CapacityScheduler.java:2521) 
>>>
>>>           at
>>>
>>> org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.createAndPopulateNewRMApp(RMAppManager.java:417) 
>>>
>>>           at
>>>
>>> org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.submitApplication(RMAppManager.java:342) 
>>>
>>>           at
>>>
>>> org.apache.hadoop.yarn.server.resourcemanager.ClientRMService.submitApplication(ClientRMService.java:678) 
>>>
>>>           at
>>>
>>> org.apache.hadoop.yarn.api.impl.pb.service.ApplicationClientProtocolPBServiceImpl.submitApplication(ApplicationClientProtocolPBServiceImpl.java:277) 
>>>
>>>           at
>>>
>>> org.apache.hadoop.yarn.proto.ApplicationClientProtocol$ApplicationClientProtocolService$2.callBlockingMethod(ApplicationClientProtocol.java:563) 
>>>
>>>           at
>>>
>>> org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:527) 
>>>
>>>           at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1036)
>>>           at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:1015)
>>>           at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:943)
>>>           at java.security.AccessController.doPrivileged(Native Method)
>>>           at javax.security.auth.Subject.doAs(Subject.java:422)
>>>           at
>>>
>>> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1729) 
>>>
>>>           at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2943)
>>>
>>>
>>> Masatake Iwasaki
>>>
>>> On 2020/06/26 22:51, Gabor Bota wrote:
>>>> Hi folks,
>>>>
>>>> I have put together a release candidate (RC2) for Hadoop 3.1.4.
>>>>
>>>> The RC is available at:
>>> http://people.apache.org/~gabota/hadoop-3.1.4-RC2/
>>>> The RC tag in git is here:
>>>> https://github.com/apache/hadoop/releases/tag/release-3.1.4-RC2
>>>> The maven artifacts are staged at
>>>> https://repository.apache.org/content/repositories/orgapachehadoop-1269/ 
>>>>
>>>>
>>>> You can find my public key at:
>>>> https://dist.apache.org/repos/dist/release/hadoop/common/KEYS
>>>> and http://keys.gnupg.net/pks/lookup?op=get&search=0xB86249D83539B38C
>>>>
>>>> Please try the release and vote. The vote will run for 5 weekdays,
>>>> until July 6. 2020. 23:00 CET.
>>>>
>>>> The release includes the revert of HDFS-14941, as it caused
>>>> HDFS-15421. IBR leak causes standby NN to be stuck in safe mode.
>>>> (https://issues.apache.org/jira/browse/HDFS-15421)
>>>> The release includes HDFS-15323, as requested.
>>>> (https://issues.apache.org/jira/browse/HDFS-15323)
>>>>
>>>> Thanks,
>>>> Gabor
>>>>
>>>> ---------------------------------------------------------------------
>>>> To unsubscribe, e-mail: common-dev-unsubscribe@hadoop.apache.org
>>>> For additional commands, e-mail: common-dev-help@hadoop.apache.org
>>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: yarn-dev-unsubscribe@hadoop.apache.org
>>> For additional commands, e-mail: yarn-dev-help@hadoop.apache.org
>>>
>>>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: common-dev-unsubscribe@hadoop.apache.org
> For additional commands, e-mail: common-dev-help@hadoop.apache.org
>

---------------------------------------------------------------------
To unsubscribe, e-mail: yarn-dev-unsubscribe@hadoop.apache.org
For additional commands, e-mail: yarn-dev-help@hadoop.apache.org


Re: [VOTE] Release Apache Hadoop 3.1.4 (RC2)

Posted by Masatake Iwasaki <iw...@oss.nttdata.co.jp>.
Thanks Steve and Prabhu for the information.

The cause turned out to be locking in CapacityScheduler#reinitialize.
I think the method is called after transitioning to active stat if RM-HA 
is enabled.

I filed YARN-10347 and created PR.


Masatake Iwasaki


On 2020/07/08 16:33, Prabhu Joseph wrote:
> Hi Masatake,
>
>       The thread is waiting for a ReadLock, we need to check what the other
> thread holding WriteLock is blocked on.
> Can you get three consecutive complete jstack of ResourceManager during the
> issue.
>
>>> I got no issue if RM-HA is disabled.
> Looks RM is not able to access Zookeeper State Store. Can you check if
> there is any connectivity issue between RM and Zookeeper.
>
> Thanks,
> Prabhu Joseph
>
>
> On Mon, Jul 6, 2020 at 2:44 AM Masatake Iwasaki <iw...@oss.nttdata.co.jp>
> wrote:
>
>> Thanks for putting this up, Gabor Bota.
>>
>> I'm testing the RC2 on 3 node docker cluster with NN-HA and RM-HA enabled.
>> ResourceManager reproducibly blocks on submitApplication while launching
>> example MR jobs.
>> Does anyone run into the same issue?
>>
>> The same configuration worked for 3.1.3.
>> I got no issue if RM-HA is disabled.
>>
>>
>> "IPC Server handler 1 on default port 8032" #167 daemon prio=5 os_prio=0
>> tid=0x00007fe91821ec50 nid=0x3b9 waiting on condition [0x00007fe901bac000]
>>      java.lang.Thread.State: WAITING (parking)
>>           at sun.misc.Unsafe.park(Native Method)
>>           - parking to wait for  <0x0000000085d37a40> (a
>> java.util.concurrent.locks.ReentrantReadWriteLock$NonfairSync)
>>           at
>> java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
>>           at
>>
>> java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:836)
>>           at
>>
>> java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireShared(AbstractQueuedSynchronizer.java:967)
>>           at
>>
>> java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireShared(AbstractQueuedSynchronizer.java:1283)
>>           at
>>
>> java.util.concurrent.locks.ReentrantReadWriteLock$ReadLock.lock(ReentrantReadWriteLock.java:727)
>>           at
>>
>> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.checkAndGetApplicationPriority(CapacityScheduler.java:2521)
>>           at
>>
>> org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.createAndPopulateNewRMApp(RMAppManager.java:417)
>>           at
>>
>> org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.submitApplication(RMAppManager.java:342)
>>           at
>>
>> org.apache.hadoop.yarn.server.resourcemanager.ClientRMService.submitApplication(ClientRMService.java:678)
>>           at
>>
>> org.apache.hadoop.yarn.api.impl.pb.service.ApplicationClientProtocolPBServiceImpl.submitApplication(ApplicationClientProtocolPBServiceImpl.java:277)
>>           at
>>
>> org.apache.hadoop.yarn.proto.ApplicationClientProtocol$ApplicationClientProtocolService$2.callBlockingMethod(ApplicationClientProtocol.java:563)
>>           at
>>
>> org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:527)
>>           at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1036)
>>           at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:1015)
>>           at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:943)
>>           at java.security.AccessController.doPrivileged(Native Method)
>>           at javax.security.auth.Subject.doAs(Subject.java:422)
>>           at
>>
>> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1729)
>>           at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2943)
>>
>>
>> Masatake Iwasaki
>>
>> On 2020/06/26 22:51, Gabor Bota wrote:
>>> Hi folks,
>>>
>>> I have put together a release candidate (RC2) for Hadoop 3.1.4.
>>>
>>> The RC is available at:
>> http://people.apache.org/~gabota/hadoop-3.1.4-RC2/
>>> The RC tag in git is here:
>>> https://github.com/apache/hadoop/releases/tag/release-3.1.4-RC2
>>> The maven artifacts are staged at
>>> https://repository.apache.org/content/repositories/orgapachehadoop-1269/
>>>
>>> You can find my public key at:
>>> https://dist.apache.org/repos/dist/release/hadoop/common/KEYS
>>> and http://keys.gnupg.net/pks/lookup?op=get&search=0xB86249D83539B38C
>>>
>>> Please try the release and vote. The vote will run for 5 weekdays,
>>> until July 6. 2020. 23:00 CET.
>>>
>>> The release includes the revert of HDFS-14941, as it caused
>>> HDFS-15421. IBR leak causes standby NN to be stuck in safe mode.
>>> (https://issues.apache.org/jira/browse/HDFS-15421)
>>> The release includes HDFS-15323, as requested.
>>> (https://issues.apache.org/jira/browse/HDFS-15323)
>>>
>>> Thanks,
>>> Gabor
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: common-dev-unsubscribe@hadoop.apache.org
>>> For additional commands, e-mail: common-dev-help@hadoop.apache.org
>>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: yarn-dev-unsubscribe@hadoop.apache.org
>> For additional commands, e-mail: yarn-dev-help@hadoop.apache.org
>>
>>

---------------------------------------------------------------------
To unsubscribe, e-mail: yarn-dev-unsubscribe@hadoop.apache.org
For additional commands, e-mail: yarn-dev-help@hadoop.apache.org


Re: [VOTE] Release Apache Hadoop 3.1.4 (RC2)

Posted by Masatake Iwasaki <iw...@oss.nttdata.co.jp>.
Thanks Steve and Prabhu for the information.

The cause turned out to be locking in CapacityScheduler#reinitialize.
I think the method is called after transitioning to active stat if RM-HA 
is enabled.

I filed YARN-10347 and created PR.


Masatake Iwasaki


On 2020/07/08 16:33, Prabhu Joseph wrote:
> Hi Masatake,
>
>       The thread is waiting for a ReadLock, we need to check what the other
> thread holding WriteLock is blocked on.
> Can you get three consecutive complete jstack of ResourceManager during the
> issue.
>
>>> I got no issue if RM-HA is disabled.
> Looks RM is not able to access Zookeeper State Store. Can you check if
> there is any connectivity issue between RM and Zookeeper.
>
> Thanks,
> Prabhu Joseph
>
>
> On Mon, Jul 6, 2020 at 2:44 AM Masatake Iwasaki <iw...@oss.nttdata.co.jp>
> wrote:
>
>> Thanks for putting this up, Gabor Bota.
>>
>> I'm testing the RC2 on 3 node docker cluster with NN-HA and RM-HA enabled.
>> ResourceManager reproducibly blocks on submitApplication while launching
>> example MR jobs.
>> Does anyone run into the same issue?
>>
>> The same configuration worked for 3.1.3.
>> I got no issue if RM-HA is disabled.
>>
>>
>> "IPC Server handler 1 on default port 8032" #167 daemon prio=5 os_prio=0
>> tid=0x00007fe91821ec50 nid=0x3b9 waiting on condition [0x00007fe901bac000]
>>      java.lang.Thread.State: WAITING (parking)
>>           at sun.misc.Unsafe.park(Native Method)
>>           - parking to wait for  <0x0000000085d37a40> (a
>> java.util.concurrent.locks.ReentrantReadWriteLock$NonfairSync)
>>           at
>> java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
>>           at
>>
>> java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:836)
>>           at
>>
>> java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireShared(AbstractQueuedSynchronizer.java:967)
>>           at
>>
>> java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireShared(AbstractQueuedSynchronizer.java:1283)
>>           at
>>
>> java.util.concurrent.locks.ReentrantReadWriteLock$ReadLock.lock(ReentrantReadWriteLock.java:727)
>>           at
>>
>> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.checkAndGetApplicationPriority(CapacityScheduler.java:2521)
>>           at
>>
>> org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.createAndPopulateNewRMApp(RMAppManager.java:417)
>>           at
>>
>> org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.submitApplication(RMAppManager.java:342)
>>           at
>>
>> org.apache.hadoop.yarn.server.resourcemanager.ClientRMService.submitApplication(ClientRMService.java:678)
>>           at
>>
>> org.apache.hadoop.yarn.api.impl.pb.service.ApplicationClientProtocolPBServiceImpl.submitApplication(ApplicationClientProtocolPBServiceImpl.java:277)
>>           at
>>
>> org.apache.hadoop.yarn.proto.ApplicationClientProtocol$ApplicationClientProtocolService$2.callBlockingMethod(ApplicationClientProtocol.java:563)
>>           at
>>
>> org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:527)
>>           at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1036)
>>           at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:1015)
>>           at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:943)
>>           at java.security.AccessController.doPrivileged(Native Method)
>>           at javax.security.auth.Subject.doAs(Subject.java:422)
>>           at
>>
>> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1729)
>>           at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2943)
>>
>>
>> Masatake Iwasaki
>>
>> On 2020/06/26 22:51, Gabor Bota wrote:
>>> Hi folks,
>>>
>>> I have put together a release candidate (RC2) for Hadoop 3.1.4.
>>>
>>> The RC is available at:
>> http://people.apache.org/~gabota/hadoop-3.1.4-RC2/
>>> The RC tag in git is here:
>>> https://github.com/apache/hadoop/releases/tag/release-3.1.4-RC2
>>> The maven artifacts are staged at
>>> https://repository.apache.org/content/repositories/orgapachehadoop-1269/
>>>
>>> You can find my public key at:
>>> https://dist.apache.org/repos/dist/release/hadoop/common/KEYS
>>> and http://keys.gnupg.net/pks/lookup?op=get&search=0xB86249D83539B38C
>>>
>>> Please try the release and vote. The vote will run for 5 weekdays,
>>> until July 6. 2020. 23:00 CET.
>>>
>>> The release includes the revert of HDFS-14941, as it caused
>>> HDFS-15421. IBR leak causes standby NN to be stuck in safe mode.
>>> (https://issues.apache.org/jira/browse/HDFS-15421)
>>> The release includes HDFS-15323, as requested.
>>> (https://issues.apache.org/jira/browse/HDFS-15323)
>>>
>>> Thanks,
>>> Gabor
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: common-dev-unsubscribe@hadoop.apache.org
>>> For additional commands, e-mail: common-dev-help@hadoop.apache.org
>>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: yarn-dev-unsubscribe@hadoop.apache.org
>> For additional commands, e-mail: yarn-dev-help@hadoop.apache.org
>>
>>

---------------------------------------------------------------------
To unsubscribe, e-mail: common-dev-unsubscribe@hadoop.apache.org
For additional commands, e-mail: common-dev-help@hadoop.apache.org


Re: [VOTE] Release Apache Hadoop 3.1.4 (RC2)

Posted by Masatake Iwasaki <iw...@oss.nttdata.co.jp>.
Thanks Steve and Prabhu for the information.

The cause turned out to be locking in CapacityScheduler#reinitialize.
I think the method is called after transitioning to active stat if RM-HA 
is enabled.

I filed YARN-10347 and created PR.


Masatake Iwasaki


On 2020/07/08 16:33, Prabhu Joseph wrote:
> Hi Masatake,
>
>       The thread is waiting for a ReadLock, we need to check what the other
> thread holding WriteLock is blocked on.
> Can you get three consecutive complete jstack of ResourceManager during the
> issue.
>
>>> I got no issue if RM-HA is disabled.
> Looks RM is not able to access Zookeeper State Store. Can you check if
> there is any connectivity issue between RM and Zookeeper.
>
> Thanks,
> Prabhu Joseph
>
>
> On Mon, Jul 6, 2020 at 2:44 AM Masatake Iwasaki <iw...@oss.nttdata.co.jp>
> wrote:
>
>> Thanks for putting this up, Gabor Bota.
>>
>> I'm testing the RC2 on 3 node docker cluster with NN-HA and RM-HA enabled.
>> ResourceManager reproducibly blocks on submitApplication while launching
>> example MR jobs.
>> Does anyone run into the same issue?
>>
>> The same configuration worked for 3.1.3.
>> I got no issue if RM-HA is disabled.
>>
>>
>> "IPC Server handler 1 on default port 8032" #167 daemon prio=5 os_prio=0
>> tid=0x00007fe91821ec50 nid=0x3b9 waiting on condition [0x00007fe901bac000]
>>      java.lang.Thread.State: WAITING (parking)
>>           at sun.misc.Unsafe.park(Native Method)
>>           - parking to wait for  <0x0000000085d37a40> (a
>> java.util.concurrent.locks.ReentrantReadWriteLock$NonfairSync)
>>           at
>> java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
>>           at
>>
>> java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:836)
>>           at
>>
>> java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireShared(AbstractQueuedSynchronizer.java:967)
>>           at
>>
>> java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireShared(AbstractQueuedSynchronizer.java:1283)
>>           at
>>
>> java.util.concurrent.locks.ReentrantReadWriteLock$ReadLock.lock(ReentrantReadWriteLock.java:727)
>>           at
>>
>> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.checkAndGetApplicationPriority(CapacityScheduler.java:2521)
>>           at
>>
>> org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.createAndPopulateNewRMApp(RMAppManager.java:417)
>>           at
>>
>> org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.submitApplication(RMAppManager.java:342)
>>           at
>>
>> org.apache.hadoop.yarn.server.resourcemanager.ClientRMService.submitApplication(ClientRMService.java:678)
>>           at
>>
>> org.apache.hadoop.yarn.api.impl.pb.service.ApplicationClientProtocolPBServiceImpl.submitApplication(ApplicationClientProtocolPBServiceImpl.java:277)
>>           at
>>
>> org.apache.hadoop.yarn.proto.ApplicationClientProtocol$ApplicationClientProtocolService$2.callBlockingMethod(ApplicationClientProtocol.java:563)
>>           at
>>
>> org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:527)
>>           at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1036)
>>           at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:1015)
>>           at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:943)
>>           at java.security.AccessController.doPrivileged(Native Method)
>>           at javax.security.auth.Subject.doAs(Subject.java:422)
>>           at
>>
>> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1729)
>>           at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2943)
>>
>>
>> Masatake Iwasaki
>>
>> On 2020/06/26 22:51, Gabor Bota wrote:
>>> Hi folks,
>>>
>>> I have put together a release candidate (RC2) for Hadoop 3.1.4.
>>>
>>> The RC is available at:
>> http://people.apache.org/~gabota/hadoop-3.1.4-RC2/
>>> The RC tag in git is here:
>>> https://github.com/apache/hadoop/releases/tag/release-3.1.4-RC2
>>> The maven artifacts are staged at
>>> https://repository.apache.org/content/repositories/orgapachehadoop-1269/
>>>
>>> You can find my public key at:
>>> https://dist.apache.org/repos/dist/release/hadoop/common/KEYS
>>> and http://keys.gnupg.net/pks/lookup?op=get&search=0xB86249D83539B38C
>>>
>>> Please try the release and vote. The vote will run for 5 weekdays,
>>> until July 6. 2020. 23:00 CET.
>>>
>>> The release includes the revert of HDFS-14941, as it caused
>>> HDFS-15421. IBR leak causes standby NN to be stuck in safe mode.
>>> (https://issues.apache.org/jira/browse/HDFS-15421)
>>> The release includes HDFS-15323, as requested.
>>> (https://issues.apache.org/jira/browse/HDFS-15323)
>>>
>>> Thanks,
>>> Gabor
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: common-dev-unsubscribe@hadoop.apache.org
>>> For additional commands, e-mail: common-dev-help@hadoop.apache.org
>>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: yarn-dev-unsubscribe@hadoop.apache.org
>> For additional commands, e-mail: yarn-dev-help@hadoop.apache.org
>>
>>

---------------------------------------------------------------------
To unsubscribe, e-mail: hdfs-dev-unsubscribe@hadoop.apache.org
For additional commands, e-mail: hdfs-dev-help@hadoop.apache.org


Re: [VOTE] Release Apache Hadoop 3.1.4 (RC2)

Posted by Masatake Iwasaki <iw...@oss.nttdata.co.jp>.
Thanks Steve and Prabhu for the information.

The cause turned out to be locking in CapacityScheduler#reinitialize.
I think the method is called after transitioning to active stat if RM-HA 
is enabled.

I filed YARN-10347 and created PR.


Masatake Iwasaki


On 2020/07/08 16:33, Prabhu Joseph wrote:
> Hi Masatake,
>
>       The thread is waiting for a ReadLock, we need to check what the other
> thread holding WriteLock is blocked on.
> Can you get three consecutive complete jstack of ResourceManager during the
> issue.
>
>>> I got no issue if RM-HA is disabled.
> Looks RM is not able to access Zookeeper State Store. Can you check if
> there is any connectivity issue between RM and Zookeeper.
>
> Thanks,
> Prabhu Joseph
>
>
> On Mon, Jul 6, 2020 at 2:44 AM Masatake Iwasaki <iw...@oss.nttdata.co.jp>
> wrote:
>
>> Thanks for putting this up, Gabor Bota.
>>
>> I'm testing the RC2 on 3 node docker cluster with NN-HA and RM-HA enabled.
>> ResourceManager reproducibly blocks on submitApplication while launching
>> example MR jobs.
>> Does anyone run into the same issue?
>>
>> The same configuration worked for 3.1.3.
>> I got no issue if RM-HA is disabled.
>>
>>
>> "IPC Server handler 1 on default port 8032" #167 daemon prio=5 os_prio=0
>> tid=0x00007fe91821ec50 nid=0x3b9 waiting on condition [0x00007fe901bac000]
>>      java.lang.Thread.State: WAITING (parking)
>>           at sun.misc.Unsafe.park(Native Method)
>>           - parking to wait for  <0x0000000085d37a40> (a
>> java.util.concurrent.locks.ReentrantReadWriteLock$NonfairSync)
>>           at
>> java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
>>           at
>>
>> java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:836)
>>           at
>>
>> java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireShared(AbstractQueuedSynchronizer.java:967)
>>           at
>>
>> java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireShared(AbstractQueuedSynchronizer.java:1283)
>>           at
>>
>> java.util.concurrent.locks.ReentrantReadWriteLock$ReadLock.lock(ReentrantReadWriteLock.java:727)
>>           at
>>
>> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.checkAndGetApplicationPriority(CapacityScheduler.java:2521)
>>           at
>>
>> org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.createAndPopulateNewRMApp(RMAppManager.java:417)
>>           at
>>
>> org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.submitApplication(RMAppManager.java:342)
>>           at
>>
>> org.apache.hadoop.yarn.server.resourcemanager.ClientRMService.submitApplication(ClientRMService.java:678)
>>           at
>>
>> org.apache.hadoop.yarn.api.impl.pb.service.ApplicationClientProtocolPBServiceImpl.submitApplication(ApplicationClientProtocolPBServiceImpl.java:277)
>>           at
>>
>> org.apache.hadoop.yarn.proto.ApplicationClientProtocol$ApplicationClientProtocolService$2.callBlockingMethod(ApplicationClientProtocol.java:563)
>>           at
>>
>> org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:527)
>>           at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1036)
>>           at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:1015)
>>           at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:943)
>>           at java.security.AccessController.doPrivileged(Native Method)
>>           at javax.security.auth.Subject.doAs(Subject.java:422)
>>           at
>>
>> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1729)
>>           at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2943)
>>
>>
>> Masatake Iwasaki
>>
>> On 2020/06/26 22:51, Gabor Bota wrote:
>>> Hi folks,
>>>
>>> I have put together a release candidate (RC2) for Hadoop 3.1.4.
>>>
>>> The RC is available at:
>> http://people.apache.org/~gabota/hadoop-3.1.4-RC2/
>>> The RC tag in git is here:
>>> https://github.com/apache/hadoop/releases/tag/release-3.1.4-RC2
>>> The maven artifacts are staged at
>>> https://repository.apache.org/content/repositories/orgapachehadoop-1269/
>>>
>>> You can find my public key at:
>>> https://dist.apache.org/repos/dist/release/hadoop/common/KEYS
>>> and http://keys.gnupg.net/pks/lookup?op=get&search=0xB86249D83539B38C
>>>
>>> Please try the release and vote. The vote will run for 5 weekdays,
>>> until July 6. 2020. 23:00 CET.
>>>
>>> The release includes the revert of HDFS-14941, as it caused
>>> HDFS-15421. IBR leak causes standby NN to be stuck in safe mode.
>>> (https://issues.apache.org/jira/browse/HDFS-15421)
>>> The release includes HDFS-15323, as requested.
>>> (https://issues.apache.org/jira/browse/HDFS-15323)
>>>
>>> Thanks,
>>> Gabor
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: common-dev-unsubscribe@hadoop.apache.org
>>> For additional commands, e-mail: common-dev-help@hadoop.apache.org
>>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: yarn-dev-unsubscribe@hadoop.apache.org
>> For additional commands, e-mail: yarn-dev-help@hadoop.apache.org
>>
>>

---------------------------------------------------------------------
To unsubscribe, e-mail: mapreduce-dev-unsubscribe@hadoop.apache.org
For additional commands, e-mail: mapreduce-dev-help@hadoop.apache.org


Re: [VOTE] Release Apache Hadoop 3.1.4 (RC2)

Posted by Prabhu Joseph <pr...@gmail.com>.
Hi Masatake,

     The thread is waiting for a ReadLock, we need to check what the other
thread holding WriteLock is blocked on.
Can you get three consecutive complete jstack of ResourceManager during the
issue.

>> I got no issue if RM-HA is disabled.

Looks RM is not able to access Zookeeper State Store. Can you check if
there is any connectivity issue between RM and Zookeeper.

Thanks,
Prabhu Joseph


On Mon, Jul 6, 2020 at 2:44 AM Masatake Iwasaki <iw...@oss.nttdata.co.jp>
wrote:

> Thanks for putting this up, Gabor Bota.
>
> I'm testing the RC2 on 3 node docker cluster with NN-HA and RM-HA enabled.
> ResourceManager reproducibly blocks on submitApplication while launching
> example MR jobs.
> Does anyone run into the same issue?
>
> The same configuration worked for 3.1.3.
> I got no issue if RM-HA is disabled.
>
>
> "IPC Server handler 1 on default port 8032" #167 daemon prio=5 os_prio=0
> tid=0x00007fe91821ec50 nid=0x3b9 waiting on condition [0x00007fe901bac000]
>     java.lang.Thread.State: WAITING (parking)
>          at sun.misc.Unsafe.park(Native Method)
>          - parking to wait for  <0x0000000085d37a40> (a
> java.util.concurrent.locks.ReentrantReadWriteLock$NonfairSync)
>          at
> java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
>          at
>
> java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:836)
>          at
>
> java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireShared(AbstractQueuedSynchronizer.java:967)
>          at
>
> java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireShared(AbstractQueuedSynchronizer.java:1283)
>          at
>
> java.util.concurrent.locks.ReentrantReadWriteLock$ReadLock.lock(ReentrantReadWriteLock.java:727)
>          at
>
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.checkAndGetApplicationPriority(CapacityScheduler.java:2521)
>          at
>
> org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.createAndPopulateNewRMApp(RMAppManager.java:417)
>          at
>
> org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.submitApplication(RMAppManager.java:342)
>          at
>
> org.apache.hadoop.yarn.server.resourcemanager.ClientRMService.submitApplication(ClientRMService.java:678)
>          at
>
> org.apache.hadoop.yarn.api.impl.pb.service.ApplicationClientProtocolPBServiceImpl.submitApplication(ApplicationClientProtocolPBServiceImpl.java:277)
>          at
>
> org.apache.hadoop.yarn.proto.ApplicationClientProtocol$ApplicationClientProtocolService$2.callBlockingMethod(ApplicationClientProtocol.java:563)
>          at
>
> org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:527)
>          at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1036)
>          at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:1015)
>          at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:943)
>          at java.security.AccessController.doPrivileged(Native Method)
>          at javax.security.auth.Subject.doAs(Subject.java:422)
>          at
>
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1729)
>          at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2943)
>
>
> Masatake Iwasaki
>
> On 2020/06/26 22:51, Gabor Bota wrote:
> > Hi folks,
> >
> > I have put together a release candidate (RC2) for Hadoop 3.1.4.
> >
> > The RC is available at:
> http://people.apache.org/~gabota/hadoop-3.1.4-RC2/
> > The RC tag in git is here:
> > https://github.com/apache/hadoop/releases/tag/release-3.1.4-RC2
> > The maven artifacts are staged at
> > https://repository.apache.org/content/repositories/orgapachehadoop-1269/
> >
> > You can find my public key at:
> > https://dist.apache.org/repos/dist/release/hadoop/common/KEYS
> > and http://keys.gnupg.net/pks/lookup?op=get&search=0xB86249D83539B38C
> >
> > Please try the release and vote. The vote will run for 5 weekdays,
> > until July 6. 2020. 23:00 CET.
> >
> > The release includes the revert of HDFS-14941, as it caused
> > HDFS-15421. IBR leak causes standby NN to be stuck in safe mode.
> > (https://issues.apache.org/jira/browse/HDFS-15421)
> > The release includes HDFS-15323, as requested.
> > (https://issues.apache.org/jira/browse/HDFS-15323)
> >
> > Thanks,
> > Gabor
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: common-dev-unsubscribe@hadoop.apache.org
> > For additional commands, e-mail: common-dev-help@hadoop.apache.org
> >
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: yarn-dev-unsubscribe@hadoop.apache.org
> For additional commands, e-mail: yarn-dev-help@hadoop.apache.org
>
>

Re: [VOTE] Release Apache Hadoop 3.1.4 (RC2)

Posted by Steve Loughran <st...@cloudera.com.INVALID>.
hmm

YARN-9341 went through all of the yarn lock code -it's in 3.3 but not in
3.1. And we do not want to attempt to backport 175KB of lock
acquire/release code, do we?

anyone in yarn-dev got any thoughts here?

On Sun, 5 Jul 2020 at 22:14, Masatake Iwasaki <iw...@oss.nttdata.co.jp>
wrote:

> Thanks for putting this up, Gabor Bota.
>
> I'm testing the RC2 on 3 node docker cluster with NN-HA and RM-HA enabled.
> ResourceManager reproducibly blocks on submitApplication while launching
> example MR jobs.
> Does anyone run into the same issue?
>
> The same configuration worked for 3.1.3.
> I got no issue if RM-HA is disabled.
>
>
> "IPC Server handler 1 on default port 8032" #167 daemon prio=5 os_prio=0
> tid=0x00007fe91821ec50 nid=0x3b9 waiting on condition [0x00007fe901bac000]
>     java.lang.Thread.State: WAITING (parking)
>          at sun.misc.Unsafe.park(Native Method)
>          - parking to wait for  <0x0000000085d37a40> (a
> java.util.concurrent.locks.ReentrantReadWriteLock$NonfairSync)
>          at
> java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
>          at
>
> java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:836)
>          at
>
> java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireShared(AbstractQueuedSynchronizer.java:967)
>          at
>
> java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireShared(AbstractQueuedSynchronizer.java:1283)
>          at
>
> java.util.concurrent.locks.ReentrantReadWriteLock$ReadLock.lock(ReentrantReadWriteLock.java:727)
>          at
>
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.checkAndGetApplicationPriority(CapacityScheduler.java:2521)
>          at
>
> org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.createAndPopulateNewRMApp(RMAppManager.java:417)
>          at
>
> org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.submitApplication(RMAppManager.java:342)
>          at
>
> org.apache.hadoop.yarn.server.resourcemanager.ClientRMService.submitApplication(ClientRMService.java:678)
>          at
>
> org.apache.hadoop.yarn.api.impl.pb.service.ApplicationClientProtocolPBServiceImpl.submitApplication(ApplicationClientProtocolPBServiceImpl.java:277)
>          at
>
> org.apache.hadoop.yarn.proto.ApplicationClientProtocol$ApplicationClientProtocolService$2.callBlockingMethod(ApplicationClientProtocol.java:563)
>          at
>
> org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:527)
>          at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1036)
>          at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:1015)
>          at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:943)
>          at java.security.AccessController.doPrivileged(Native Method)
>          at javax.security.auth.Subject.doAs(Subject.java:422)
>          at
>
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1729)
>          at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2943)
>
>
> Masatake Iwasaki
>
> On 2020/06/26 22:51, Gabor Bota wrote:
> > Hi folks,
> >
> > I have put together a release candidate (RC2) for Hadoop 3.1.4.
> >
> > The RC is available at:
> http://people.apache.org/~gabota/hadoop-3.1.4-RC2/
> > The RC tag in git is here:
> > https://github.com/apache/hadoop/releases/tag/release-3.1.4-RC2
> > The maven artifacts are staged at
> > https://repository.apache.org/content/repositories/orgapachehadoop-1269/
> >
> > You can find my public key at:
> > https://dist.apache.org/repos/dist/release/hadoop/common/KEYS
> > and http://keys.gnupg.net/pks/lookup?op=get&search=0xB86249D83539B38C
> >
> > Please try the release and vote. The vote will run for 5 weekdays,
> > until July 6. 2020. 23:00 CET.
> >
> > The release includes the revert of HDFS-14941, as it caused
> > HDFS-15421. IBR leak causes standby NN to be stuck in safe mode.
> > (https://issues.apache.org/jira/browse/HDFS-15421)
> > The release includes HDFS-15323, as requested.
> > (https://issues.apache.org/jira/browse/HDFS-15323)
> >
> > Thanks,
> > Gabor
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: common-dev-unsubscribe@hadoop.apache.org
> > For additional commands, e-mail: common-dev-help@hadoop.apache.org
> >
>

Re: [VOTE] Release Apache Hadoop 3.1.4 (RC2)

Posted by Steve Loughran <st...@cloudera.com.INVALID>.
hmm

YARN-9341 went through all of the yarn lock code -it's in 3.3 but not in
3.1. And we do not want to attempt to backport 175KB of lock
acquire/release code, do we?

anyone in yarn-dev got any thoughts here?

On Sun, 5 Jul 2020 at 22:14, Masatake Iwasaki <iw...@oss.nttdata.co.jp>
wrote:

> Thanks for putting this up, Gabor Bota.
>
> I'm testing the RC2 on 3 node docker cluster with NN-HA and RM-HA enabled.
> ResourceManager reproducibly blocks on submitApplication while launching
> example MR jobs.
> Does anyone run into the same issue?
>
> The same configuration worked for 3.1.3.
> I got no issue if RM-HA is disabled.
>
>
> "IPC Server handler 1 on default port 8032" #167 daemon prio=5 os_prio=0
> tid=0x00007fe91821ec50 nid=0x3b9 waiting on condition [0x00007fe901bac000]
>     java.lang.Thread.State: WAITING (parking)
>          at sun.misc.Unsafe.park(Native Method)
>          - parking to wait for  <0x0000000085d37a40> (a
> java.util.concurrent.locks.ReentrantReadWriteLock$NonfairSync)
>          at
> java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
>          at
>
> java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:836)
>          at
>
> java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireShared(AbstractQueuedSynchronizer.java:967)
>          at
>
> java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireShared(AbstractQueuedSynchronizer.java:1283)
>          at
>
> java.util.concurrent.locks.ReentrantReadWriteLock$ReadLock.lock(ReentrantReadWriteLock.java:727)
>          at
>
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.checkAndGetApplicationPriority(CapacityScheduler.java:2521)
>          at
>
> org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.createAndPopulateNewRMApp(RMAppManager.java:417)
>          at
>
> org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.submitApplication(RMAppManager.java:342)
>          at
>
> org.apache.hadoop.yarn.server.resourcemanager.ClientRMService.submitApplication(ClientRMService.java:678)
>          at
>
> org.apache.hadoop.yarn.api.impl.pb.service.ApplicationClientProtocolPBServiceImpl.submitApplication(ApplicationClientProtocolPBServiceImpl.java:277)
>          at
>
> org.apache.hadoop.yarn.proto.ApplicationClientProtocol$ApplicationClientProtocolService$2.callBlockingMethod(ApplicationClientProtocol.java:563)
>          at
>
> org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:527)
>          at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1036)
>          at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:1015)
>          at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:943)
>          at java.security.AccessController.doPrivileged(Native Method)
>          at javax.security.auth.Subject.doAs(Subject.java:422)
>          at
>
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1729)
>          at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2943)
>
>
> Masatake Iwasaki
>
> On 2020/06/26 22:51, Gabor Bota wrote:
> > Hi folks,
> >
> > I have put together a release candidate (RC2) for Hadoop 3.1.4.
> >
> > The RC is available at:
> http://people.apache.org/~gabota/hadoop-3.1.4-RC2/
> > The RC tag in git is here:
> > https://github.com/apache/hadoop/releases/tag/release-3.1.4-RC2
> > The maven artifacts are staged at
> > https://repository.apache.org/content/repositories/orgapachehadoop-1269/
> >
> > You can find my public key at:
> > https://dist.apache.org/repos/dist/release/hadoop/common/KEYS
> > and http://keys.gnupg.net/pks/lookup?op=get&search=0xB86249D83539B38C
> >
> > Please try the release and vote. The vote will run for 5 weekdays,
> > until July 6. 2020. 23:00 CET.
> >
> > The release includes the revert of HDFS-14941, as it caused
> > HDFS-15421. IBR leak causes standby NN to be stuck in safe mode.
> > (https://issues.apache.org/jira/browse/HDFS-15421)
> > The release includes HDFS-15323, as requested.
> > (https://issues.apache.org/jira/browse/HDFS-15323)
> >
> > Thanks,
> > Gabor
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: common-dev-unsubscribe@hadoop.apache.org
> > For additional commands, e-mail: common-dev-help@hadoop.apache.org
> >
>

Re: [VOTE] Release Apache Hadoop 3.1.4 (RC2)

Posted by Prabhu Joseph <pr...@gmail.com>.
Hi Masatake,

     The thread is waiting for a ReadLock, we need to check what the other
thread holding WriteLock is blocked on.
Can you get three consecutive complete jstack of ResourceManager during the
issue.

>> I got no issue if RM-HA is disabled.

Looks RM is not able to access Zookeeper State Store. Can you check if
there is any connectivity issue between RM and Zookeeper.

Thanks,
Prabhu Joseph


On Mon, Jul 6, 2020 at 2:44 AM Masatake Iwasaki <iw...@oss.nttdata.co.jp>
wrote:

> Thanks for putting this up, Gabor Bota.
>
> I'm testing the RC2 on 3 node docker cluster with NN-HA and RM-HA enabled.
> ResourceManager reproducibly blocks on submitApplication while launching
> example MR jobs.
> Does anyone run into the same issue?
>
> The same configuration worked for 3.1.3.
> I got no issue if RM-HA is disabled.
>
>
> "IPC Server handler 1 on default port 8032" #167 daemon prio=5 os_prio=0
> tid=0x00007fe91821ec50 nid=0x3b9 waiting on condition [0x00007fe901bac000]
>     java.lang.Thread.State: WAITING (parking)
>          at sun.misc.Unsafe.park(Native Method)
>          - parking to wait for  <0x0000000085d37a40> (a
> java.util.concurrent.locks.ReentrantReadWriteLock$NonfairSync)
>          at
> java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
>          at
>
> java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:836)
>          at
>
> java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireShared(AbstractQueuedSynchronizer.java:967)
>          at
>
> java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireShared(AbstractQueuedSynchronizer.java:1283)
>          at
>
> java.util.concurrent.locks.ReentrantReadWriteLock$ReadLock.lock(ReentrantReadWriteLock.java:727)
>          at
>
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.checkAndGetApplicationPriority(CapacityScheduler.java:2521)
>          at
>
> org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.createAndPopulateNewRMApp(RMAppManager.java:417)
>          at
>
> org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.submitApplication(RMAppManager.java:342)
>          at
>
> org.apache.hadoop.yarn.server.resourcemanager.ClientRMService.submitApplication(ClientRMService.java:678)
>          at
>
> org.apache.hadoop.yarn.api.impl.pb.service.ApplicationClientProtocolPBServiceImpl.submitApplication(ApplicationClientProtocolPBServiceImpl.java:277)
>          at
>
> org.apache.hadoop.yarn.proto.ApplicationClientProtocol$ApplicationClientProtocolService$2.callBlockingMethod(ApplicationClientProtocol.java:563)
>          at
>
> org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:527)
>          at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1036)
>          at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:1015)
>          at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:943)
>          at java.security.AccessController.doPrivileged(Native Method)
>          at javax.security.auth.Subject.doAs(Subject.java:422)
>          at
>
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1729)
>          at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2943)
>
>
> Masatake Iwasaki
>
> On 2020/06/26 22:51, Gabor Bota wrote:
> > Hi folks,
> >
> > I have put together a release candidate (RC2) for Hadoop 3.1.4.
> >
> > The RC is available at:
> http://people.apache.org/~gabota/hadoop-3.1.4-RC2/
> > The RC tag in git is here:
> > https://github.com/apache/hadoop/releases/tag/release-3.1.4-RC2
> > The maven artifacts are staged at
> > https://repository.apache.org/content/repositories/orgapachehadoop-1269/
> >
> > You can find my public key at:
> > https://dist.apache.org/repos/dist/release/hadoop/common/KEYS
> > and http://keys.gnupg.net/pks/lookup?op=get&search=0xB86249D83539B38C
> >
> > Please try the release and vote. The vote will run for 5 weekdays,
> > until July 6. 2020. 23:00 CET.
> >
> > The release includes the revert of HDFS-14941, as it caused
> > HDFS-15421. IBR leak causes standby NN to be stuck in safe mode.
> > (https://issues.apache.org/jira/browse/HDFS-15421)
> > The release includes HDFS-15323, as requested.
> > (https://issues.apache.org/jira/browse/HDFS-15323)
> >
> > Thanks,
> > Gabor
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: common-dev-unsubscribe@hadoop.apache.org
> > For additional commands, e-mail: common-dev-help@hadoop.apache.org
> >
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: yarn-dev-unsubscribe@hadoop.apache.org
> For additional commands, e-mail: yarn-dev-help@hadoop.apache.org
>
>

Re: [VOTE] Release Apache Hadoop 3.1.4 (RC2)

Posted by Steve Loughran <st...@cloudera.com.INVALID>.
hmm

YARN-9341 went through all of the yarn lock code -it's in 3.3 but not in
3.1. And we do not want to attempt to backport 175KB of lock
acquire/release code, do we?

anyone in yarn-dev got any thoughts here?

On Sun, 5 Jul 2020 at 22:14, Masatake Iwasaki <iw...@oss.nttdata.co.jp>
wrote:

> Thanks for putting this up, Gabor Bota.
>
> I'm testing the RC2 on 3 node docker cluster with NN-HA and RM-HA enabled.
> ResourceManager reproducibly blocks on submitApplication while launching
> example MR jobs.
> Does anyone run into the same issue?
>
> The same configuration worked for 3.1.3.
> I got no issue if RM-HA is disabled.
>
>
> "IPC Server handler 1 on default port 8032" #167 daemon prio=5 os_prio=0
> tid=0x00007fe91821ec50 nid=0x3b9 waiting on condition [0x00007fe901bac000]
>     java.lang.Thread.State: WAITING (parking)
>          at sun.misc.Unsafe.park(Native Method)
>          - parking to wait for  <0x0000000085d37a40> (a
> java.util.concurrent.locks.ReentrantReadWriteLock$NonfairSync)
>          at
> java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
>          at
>
> java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:836)
>          at
>
> java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireShared(AbstractQueuedSynchronizer.java:967)
>          at
>
> java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireShared(AbstractQueuedSynchronizer.java:1283)
>          at
>
> java.util.concurrent.locks.ReentrantReadWriteLock$ReadLock.lock(ReentrantReadWriteLock.java:727)
>          at
>
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.checkAndGetApplicationPriority(CapacityScheduler.java:2521)
>          at
>
> org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.createAndPopulateNewRMApp(RMAppManager.java:417)
>          at
>
> org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.submitApplication(RMAppManager.java:342)
>          at
>
> org.apache.hadoop.yarn.server.resourcemanager.ClientRMService.submitApplication(ClientRMService.java:678)
>          at
>
> org.apache.hadoop.yarn.api.impl.pb.service.ApplicationClientProtocolPBServiceImpl.submitApplication(ApplicationClientProtocolPBServiceImpl.java:277)
>          at
>
> org.apache.hadoop.yarn.proto.ApplicationClientProtocol$ApplicationClientProtocolService$2.callBlockingMethod(ApplicationClientProtocol.java:563)
>          at
>
> org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:527)
>          at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1036)
>          at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:1015)
>          at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:943)
>          at java.security.AccessController.doPrivileged(Native Method)
>          at javax.security.auth.Subject.doAs(Subject.java:422)
>          at
>
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1729)
>          at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2943)
>
>
> Masatake Iwasaki
>
> On 2020/06/26 22:51, Gabor Bota wrote:
> > Hi folks,
> >
> > I have put together a release candidate (RC2) for Hadoop 3.1.4.
> >
> > The RC is available at:
> http://people.apache.org/~gabota/hadoop-3.1.4-RC2/
> > The RC tag in git is here:
> > https://github.com/apache/hadoop/releases/tag/release-3.1.4-RC2
> > The maven artifacts are staged at
> > https://repository.apache.org/content/repositories/orgapachehadoop-1269/
> >
> > You can find my public key at:
> > https://dist.apache.org/repos/dist/release/hadoop/common/KEYS
> > and http://keys.gnupg.net/pks/lookup?op=get&search=0xB86249D83539B38C
> >
> > Please try the release and vote. The vote will run for 5 weekdays,
> > until July 6. 2020. 23:00 CET.
> >
> > The release includes the revert of HDFS-14941, as it caused
> > HDFS-15421. IBR leak causes standby NN to be stuck in safe mode.
> > (https://issues.apache.org/jira/browse/HDFS-15421)
> > The release includes HDFS-15323, as requested.
> > (https://issues.apache.org/jira/browse/HDFS-15323)
> >
> > Thanks,
> > Gabor
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: common-dev-unsubscribe@hadoop.apache.org
> > For additional commands, e-mail: common-dev-help@hadoop.apache.org
> >
>

Re: [VOTE] Release Apache Hadoop 3.1.4 (RC2)

Posted by Prabhu Joseph <pr...@gmail.com>.
Hi Masatake,

     The thread is waiting for a ReadLock, we need to check what the other
thread holding WriteLock is blocked on.
Can you get three consecutive complete jstack of ResourceManager during the
issue.

>> I got no issue if RM-HA is disabled.

Looks RM is not able to access Zookeeper State Store. Can you check if
there is any connectivity issue between RM and Zookeeper.

Thanks,
Prabhu Joseph


On Mon, Jul 6, 2020 at 2:44 AM Masatake Iwasaki <iw...@oss.nttdata.co.jp>
wrote:

> Thanks for putting this up, Gabor Bota.
>
> I'm testing the RC2 on 3 node docker cluster with NN-HA and RM-HA enabled.
> ResourceManager reproducibly blocks on submitApplication while launching
> example MR jobs.
> Does anyone run into the same issue?
>
> The same configuration worked for 3.1.3.
> I got no issue if RM-HA is disabled.
>
>
> "IPC Server handler 1 on default port 8032" #167 daemon prio=5 os_prio=0
> tid=0x00007fe91821ec50 nid=0x3b9 waiting on condition [0x00007fe901bac000]
>     java.lang.Thread.State: WAITING (parking)
>          at sun.misc.Unsafe.park(Native Method)
>          - parking to wait for  <0x0000000085d37a40> (a
> java.util.concurrent.locks.ReentrantReadWriteLock$NonfairSync)
>          at
> java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
>          at
>
> java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:836)
>          at
>
> java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireShared(AbstractQueuedSynchronizer.java:967)
>          at
>
> java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireShared(AbstractQueuedSynchronizer.java:1283)
>          at
>
> java.util.concurrent.locks.ReentrantReadWriteLock$ReadLock.lock(ReentrantReadWriteLock.java:727)
>          at
>
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.checkAndGetApplicationPriority(CapacityScheduler.java:2521)
>          at
>
> org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.createAndPopulateNewRMApp(RMAppManager.java:417)
>          at
>
> org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.submitApplication(RMAppManager.java:342)
>          at
>
> org.apache.hadoop.yarn.server.resourcemanager.ClientRMService.submitApplication(ClientRMService.java:678)
>          at
>
> org.apache.hadoop.yarn.api.impl.pb.service.ApplicationClientProtocolPBServiceImpl.submitApplication(ApplicationClientProtocolPBServiceImpl.java:277)
>          at
>
> org.apache.hadoop.yarn.proto.ApplicationClientProtocol$ApplicationClientProtocolService$2.callBlockingMethod(ApplicationClientProtocol.java:563)
>          at
>
> org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:527)
>          at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1036)
>          at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:1015)
>          at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:943)
>          at java.security.AccessController.doPrivileged(Native Method)
>          at javax.security.auth.Subject.doAs(Subject.java:422)
>          at
>
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1729)
>          at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2943)
>
>
> Masatake Iwasaki
>
> On 2020/06/26 22:51, Gabor Bota wrote:
> > Hi folks,
> >
> > I have put together a release candidate (RC2) for Hadoop 3.1.4.
> >
> > The RC is available at:
> http://people.apache.org/~gabota/hadoop-3.1.4-RC2/
> > The RC tag in git is here:
> > https://github.com/apache/hadoop/releases/tag/release-3.1.4-RC2
> > The maven artifacts are staged at
> > https://repository.apache.org/content/repositories/orgapachehadoop-1269/
> >
> > You can find my public key at:
> > https://dist.apache.org/repos/dist/release/hadoop/common/KEYS
> > and http://keys.gnupg.net/pks/lookup?op=get&search=0xB86249D83539B38C
> >
> > Please try the release and vote. The vote will run for 5 weekdays,
> > until July 6. 2020. 23:00 CET.
> >
> > The release includes the revert of HDFS-14941, as it caused
> > HDFS-15421. IBR leak causes standby NN to be stuck in safe mode.
> > (https://issues.apache.org/jira/browse/HDFS-15421)
> > The release includes HDFS-15323, as requested.
> > (https://issues.apache.org/jira/browse/HDFS-15323)
> >
> > Thanks,
> > Gabor
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: common-dev-unsubscribe@hadoop.apache.org
> > For additional commands, e-mail: common-dev-help@hadoop.apache.org
> >
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: yarn-dev-unsubscribe@hadoop.apache.org
> For additional commands, e-mail: yarn-dev-help@hadoop.apache.org
>
>

Re: [VOTE] Release Apache Hadoop 3.1.4 (RC2)

Posted by Prabhu Joseph <pr...@gmail.com>.
Hi Masatake,

     The thread is waiting for a ReadLock, we need to check what the other
thread holding WriteLock is blocked on.
Can you get three consecutive complete jstack of ResourceManager during the
issue.

>> I got no issue if RM-HA is disabled.

Looks RM is not able to access Zookeeper State Store. Can you check if
there is any connectivity issue between RM and Zookeeper.

Thanks,
Prabhu Joseph


On Mon, Jul 6, 2020 at 2:44 AM Masatake Iwasaki <iw...@oss.nttdata.co.jp>
wrote:

> Thanks for putting this up, Gabor Bota.
>
> I'm testing the RC2 on 3 node docker cluster with NN-HA and RM-HA enabled.
> ResourceManager reproducibly blocks on submitApplication while launching
> example MR jobs.
> Does anyone run into the same issue?
>
> The same configuration worked for 3.1.3.
> I got no issue if RM-HA is disabled.
>
>
> "IPC Server handler 1 on default port 8032" #167 daemon prio=5 os_prio=0
> tid=0x00007fe91821ec50 nid=0x3b9 waiting on condition [0x00007fe901bac000]
>     java.lang.Thread.State: WAITING (parking)
>          at sun.misc.Unsafe.park(Native Method)
>          - parking to wait for  <0x0000000085d37a40> (a
> java.util.concurrent.locks.ReentrantReadWriteLock$NonfairSync)
>          at
> java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
>          at
>
> java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:836)
>          at
>
> java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireShared(AbstractQueuedSynchronizer.java:967)
>          at
>
> java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireShared(AbstractQueuedSynchronizer.java:1283)
>          at
>
> java.util.concurrent.locks.ReentrantReadWriteLock$ReadLock.lock(ReentrantReadWriteLock.java:727)
>          at
>
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.checkAndGetApplicationPriority(CapacityScheduler.java:2521)
>          at
>
> org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.createAndPopulateNewRMApp(RMAppManager.java:417)
>          at
>
> org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.submitApplication(RMAppManager.java:342)
>          at
>
> org.apache.hadoop.yarn.server.resourcemanager.ClientRMService.submitApplication(ClientRMService.java:678)
>          at
>
> org.apache.hadoop.yarn.api.impl.pb.service.ApplicationClientProtocolPBServiceImpl.submitApplication(ApplicationClientProtocolPBServiceImpl.java:277)
>          at
>
> org.apache.hadoop.yarn.proto.ApplicationClientProtocol$ApplicationClientProtocolService$2.callBlockingMethod(ApplicationClientProtocol.java:563)
>          at
>
> org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:527)
>          at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1036)
>          at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:1015)
>          at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:943)
>          at java.security.AccessController.doPrivileged(Native Method)
>          at javax.security.auth.Subject.doAs(Subject.java:422)
>          at
>
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1729)
>          at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2943)
>
>
> Masatake Iwasaki
>
> On 2020/06/26 22:51, Gabor Bota wrote:
> > Hi folks,
> >
> > I have put together a release candidate (RC2) for Hadoop 3.1.4.
> >
> > The RC is available at:
> http://people.apache.org/~gabota/hadoop-3.1.4-RC2/
> > The RC tag in git is here:
> > https://github.com/apache/hadoop/releases/tag/release-3.1.4-RC2
> > The maven artifacts are staged at
> > https://repository.apache.org/content/repositories/orgapachehadoop-1269/
> >
> > You can find my public key at:
> > https://dist.apache.org/repos/dist/release/hadoop/common/KEYS
> > and http://keys.gnupg.net/pks/lookup?op=get&search=0xB86249D83539B38C
> >
> > Please try the release and vote. The vote will run for 5 weekdays,
> > until July 6. 2020. 23:00 CET.
> >
> > The release includes the revert of HDFS-14941, as it caused
> > HDFS-15421. IBR leak causes standby NN to be stuck in safe mode.
> > (https://issues.apache.org/jira/browse/HDFS-15421)
> > The release includes HDFS-15323, as requested.
> > (https://issues.apache.org/jira/browse/HDFS-15323)
> >
> > Thanks,
> > Gabor
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: common-dev-unsubscribe@hadoop.apache.org
> > For additional commands, e-mail: common-dev-help@hadoop.apache.org
> >
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: yarn-dev-unsubscribe@hadoop.apache.org
> For additional commands, e-mail: yarn-dev-help@hadoop.apache.org
>
>

Re: [VOTE] Release Apache Hadoop 3.1.4 (RC2)

Posted by Steve Loughran <st...@cloudera.com.INVALID>.
hmm

YARN-9341 went through all of the yarn lock code -it's in 3.3 but not in
3.1. And we do not want to attempt to backport 175KB of lock
acquire/release code, do we?

anyone in yarn-dev got any thoughts here?

On Sun, 5 Jul 2020 at 22:14, Masatake Iwasaki <iw...@oss.nttdata.co.jp>
wrote:

> Thanks for putting this up, Gabor Bota.
>
> I'm testing the RC2 on 3 node docker cluster with NN-HA and RM-HA enabled.
> ResourceManager reproducibly blocks on submitApplication while launching
> example MR jobs.
> Does anyone run into the same issue?
>
> The same configuration worked for 3.1.3.
> I got no issue if RM-HA is disabled.
>
>
> "IPC Server handler 1 on default port 8032" #167 daemon prio=5 os_prio=0
> tid=0x00007fe91821ec50 nid=0x3b9 waiting on condition [0x00007fe901bac000]
>     java.lang.Thread.State: WAITING (parking)
>          at sun.misc.Unsafe.park(Native Method)
>          - parking to wait for  <0x0000000085d37a40> (a
> java.util.concurrent.locks.ReentrantReadWriteLock$NonfairSync)
>          at
> java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
>          at
>
> java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:836)
>          at
>
> java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireShared(AbstractQueuedSynchronizer.java:967)
>          at
>
> java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireShared(AbstractQueuedSynchronizer.java:1283)
>          at
>
> java.util.concurrent.locks.ReentrantReadWriteLock$ReadLock.lock(ReentrantReadWriteLock.java:727)
>          at
>
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.checkAndGetApplicationPriority(CapacityScheduler.java:2521)
>          at
>
> org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.createAndPopulateNewRMApp(RMAppManager.java:417)
>          at
>
> org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.submitApplication(RMAppManager.java:342)
>          at
>
> org.apache.hadoop.yarn.server.resourcemanager.ClientRMService.submitApplication(ClientRMService.java:678)
>          at
>
> org.apache.hadoop.yarn.api.impl.pb.service.ApplicationClientProtocolPBServiceImpl.submitApplication(ApplicationClientProtocolPBServiceImpl.java:277)
>          at
>
> org.apache.hadoop.yarn.proto.ApplicationClientProtocol$ApplicationClientProtocolService$2.callBlockingMethod(ApplicationClientProtocol.java:563)
>          at
>
> org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:527)
>          at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1036)
>          at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:1015)
>          at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:943)
>          at java.security.AccessController.doPrivileged(Native Method)
>          at javax.security.auth.Subject.doAs(Subject.java:422)
>          at
>
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1729)
>          at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2943)
>
>
> Masatake Iwasaki
>
> On 2020/06/26 22:51, Gabor Bota wrote:
> > Hi folks,
> >
> > I have put together a release candidate (RC2) for Hadoop 3.1.4.
> >
> > The RC is available at:
> http://people.apache.org/~gabota/hadoop-3.1.4-RC2/
> > The RC tag in git is here:
> > https://github.com/apache/hadoop/releases/tag/release-3.1.4-RC2
> > The maven artifacts are staged at
> > https://repository.apache.org/content/repositories/orgapachehadoop-1269/
> >
> > You can find my public key at:
> > https://dist.apache.org/repos/dist/release/hadoop/common/KEYS
> > and http://keys.gnupg.net/pks/lookup?op=get&search=0xB86249D83539B38C
> >
> > Please try the release and vote. The vote will run for 5 weekdays,
> > until July 6. 2020. 23:00 CET.
> >
> > The release includes the revert of HDFS-14941, as it caused
> > HDFS-15421. IBR leak causes standby NN to be stuck in safe mode.
> > (https://issues.apache.org/jira/browse/HDFS-15421)
> > The release includes HDFS-15323, as requested.
> > (https://issues.apache.org/jira/browse/HDFS-15323)
> >
> > Thanks,
> > Gabor
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: common-dev-unsubscribe@hadoop.apache.org
> > For additional commands, e-mail: common-dev-help@hadoop.apache.org
> >
>

Re: [VOTE] Release Apache Hadoop 3.1.4 (RC2)

Posted by Masatake Iwasaki <iw...@oss.nttdata.co.jp>.
Thanks for putting this up, Gabor Bota.

I'm testing the RC2 on 3 node docker cluster with NN-HA and RM-HA enabled.
ResourceManager reproducibly blocks on submitApplication while launching 
example MR jobs.
Does anyone run into the same issue?

The same configuration worked for 3.1.3.
I got no issue if RM-HA is disabled.


"IPC Server handler 1 on default port 8032" #167 daemon prio=5 os_prio=0 
tid=0x00007fe91821ec50 nid=0x3b9 waiting on condition [0x00007fe901bac000]
    java.lang.Thread.State: WAITING (parking)
         at sun.misc.Unsafe.park(Native Method)
         - parking to wait for  <0x0000000085d37a40> (a 
java.util.concurrent.locks.ReentrantReadWriteLock$NonfairSync)
         at 
java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
         at 
java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:836)
         at 
java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireShared(AbstractQueuedSynchronizer.java:967)
         at 
java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireShared(AbstractQueuedSynchronizer.java:1283)
         at 
java.util.concurrent.locks.ReentrantReadWriteLock$ReadLock.lock(ReentrantReadWriteLock.java:727)
         at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.checkAndGetApplicationPriority(CapacityScheduler.java:2521)
         at 
org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.createAndPopulateNewRMApp(RMAppManager.java:417)
         at 
org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.submitApplication(RMAppManager.java:342)
         at 
org.apache.hadoop.yarn.server.resourcemanager.ClientRMService.submitApplication(ClientRMService.java:678)
         at 
org.apache.hadoop.yarn.api.impl.pb.service.ApplicationClientProtocolPBServiceImpl.submitApplication(ApplicationClientProtocolPBServiceImpl.java:277)
         at 
org.apache.hadoop.yarn.proto.ApplicationClientProtocol$ApplicationClientProtocolService$2.callBlockingMethod(ApplicationClientProtocol.java:563)
         at 
org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:527)
         at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1036)
         at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:1015)
         at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:943)
         at java.security.AccessController.doPrivileged(Native Method)
         at javax.security.auth.Subject.doAs(Subject.java:422)
         at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1729)
         at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2943)


Masatake Iwasaki

On 2020/06/26 22:51, Gabor Bota wrote:
> Hi folks,
>
> I have put together a release candidate (RC2) for Hadoop 3.1.4.
>
> The RC is available at: http://people.apache.org/~gabota/hadoop-3.1.4-RC2/
> The RC tag in git is here:
> https://github.com/apache/hadoop/releases/tag/release-3.1.4-RC2
> The maven artifacts are staged at
> https://repository.apache.org/content/repositories/orgapachehadoop-1269/
>
> You can find my public key at:
> https://dist.apache.org/repos/dist/release/hadoop/common/KEYS
> and http://keys.gnupg.net/pks/lookup?op=get&search=0xB86249D83539B38C
>
> Please try the release and vote. The vote will run for 5 weekdays,
> until July 6. 2020. 23:00 CET.
>
> The release includes the revert of HDFS-14941, as it caused
> HDFS-15421. IBR leak causes standby NN to be stuck in safe mode.
> (https://issues.apache.org/jira/browse/HDFS-15421)
> The release includes HDFS-15323, as requested.
> (https://issues.apache.org/jira/browse/HDFS-15323)
>
> Thanks,
> Gabor
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: common-dev-unsubscribe@hadoop.apache.org
> For additional commands, e-mail: common-dev-help@hadoop.apache.org
>

---------------------------------------------------------------------
To unsubscribe, e-mail: common-dev-unsubscribe@hadoop.apache.org
For additional commands, e-mail: common-dev-help@hadoop.apache.org


Re: [VOTE] Release Apache Hadoop 3.1.4 (RC2)

Posted by Masatake Iwasaki <iw...@oss.nttdata.co.jp>.
Thanks for putting this up, Gabor Bota.

I'm testing the RC2 on 3 node docker cluster with NN-HA and RM-HA enabled.
ResourceManager reproducibly blocks on submitApplication while launching 
example MR jobs.
Does anyone run into the same issue?

The same configuration worked for 3.1.3.
I got no issue if RM-HA is disabled.


"IPC Server handler 1 on default port 8032" #167 daemon prio=5 os_prio=0 
tid=0x00007fe91821ec50 nid=0x3b9 waiting on condition [0x00007fe901bac000]
    java.lang.Thread.State: WAITING (parking)
         at sun.misc.Unsafe.park(Native Method)
         - parking to wait for  <0x0000000085d37a40> (a 
java.util.concurrent.locks.ReentrantReadWriteLock$NonfairSync)
         at 
java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
         at 
java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:836)
         at 
java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireShared(AbstractQueuedSynchronizer.java:967)
         at 
java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireShared(AbstractQueuedSynchronizer.java:1283)
         at 
java.util.concurrent.locks.ReentrantReadWriteLock$ReadLock.lock(ReentrantReadWriteLock.java:727)
         at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.checkAndGetApplicationPriority(CapacityScheduler.java:2521)
         at 
org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.createAndPopulateNewRMApp(RMAppManager.java:417)
         at 
org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.submitApplication(RMAppManager.java:342)
         at 
org.apache.hadoop.yarn.server.resourcemanager.ClientRMService.submitApplication(ClientRMService.java:678)
         at 
org.apache.hadoop.yarn.api.impl.pb.service.ApplicationClientProtocolPBServiceImpl.submitApplication(ApplicationClientProtocolPBServiceImpl.java:277)
         at 
org.apache.hadoop.yarn.proto.ApplicationClientProtocol$ApplicationClientProtocolService$2.callBlockingMethod(ApplicationClientProtocol.java:563)
         at 
org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:527)
         at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1036)
         at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:1015)
         at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:943)
         at java.security.AccessController.doPrivileged(Native Method)
         at javax.security.auth.Subject.doAs(Subject.java:422)
         at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1729)
         at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2943)


Masatake Iwasaki

On 2020/06/26 22:51, Gabor Bota wrote:
> Hi folks,
>
> I have put together a release candidate (RC2) for Hadoop 3.1.4.
>
> The RC is available at: http://people.apache.org/~gabota/hadoop-3.1.4-RC2/
> The RC tag in git is here:
> https://github.com/apache/hadoop/releases/tag/release-3.1.4-RC2
> The maven artifacts are staged at
> https://repository.apache.org/content/repositories/orgapachehadoop-1269/
>
> You can find my public key at:
> https://dist.apache.org/repos/dist/release/hadoop/common/KEYS
> and http://keys.gnupg.net/pks/lookup?op=get&search=0xB86249D83539B38C
>
> Please try the release and vote. The vote will run for 5 weekdays,
> until July 6. 2020. 23:00 CET.
>
> The release includes the revert of HDFS-14941, as it caused
> HDFS-15421. IBR leak causes standby NN to be stuck in safe mode.
> (https://issues.apache.org/jira/browse/HDFS-15421)
> The release includes HDFS-15323, as requested.
> (https://issues.apache.org/jira/browse/HDFS-15323)
>
> Thanks,
> Gabor
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: common-dev-unsubscribe@hadoop.apache.org
> For additional commands, e-mail: common-dev-help@hadoop.apache.org
>

---------------------------------------------------------------------
To unsubscribe, e-mail: mapreduce-dev-unsubscribe@hadoop.apache.org
For additional commands, e-mail: mapreduce-dev-help@hadoop.apache.org


Re: [VOTE] Release Apache Hadoop 3.1.4 (RC2)

Posted by Steve Loughran <st...@cloudera.com.INVALID>.
+1, with the instruction "warn everyone about the guava update possibly
breaking things at run time"

With the key issues being
* code compiled with the new guava release will not link against the older
releases, even without any changes in the source files.
* this includes hadoop-common

Applications which exclude the guava dependency published by hadoop-
artifacts to use their own, must set guava.version=27.0-jre or
guava.version=27.0 to be consistent with that of this release.


My tests were all with using the artifacts downstream via maven; I trust
others to look at the big tarball release.


*Project 1: cloudstore*


This is my extra diagnostics and cloud utils module.
https://github.com/steveloughran/cloudstore


All compiled fine, but the tests failed on guava linkage

testNoOverwriteDest(org.apache.hadoop.tools.cloudup.TestLocalCloudup)  Time
elapsed: 0.012 sec  <<< ERROR! java.lang.NoSuchMethodError: 'void
com.google.common.base.Preconditions.checkArgument(boolean,
java.lang.String, java.lang.Object, java.lang.Object)'
at org.apache.hadoop.fs.tools.cloudup.Cloudup.run(Cloudup.java:177)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:76)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:90)
at org.apache.hadoop.tools.store.StoreTestUtils.exec(StoreTestUtils.java:4


Note: that app is designed to run against hadoop branch-2 and other
branches, so I ended up reimplementing the checkArgument and checkState
calls so that I can have a binary which links everywhere. My code, nothing
serious.

*Project 2: Spark*


apache spark main branch built with maven (not tried the SBT build).


mvn -T 1  -Phadoop-3.2 -Dhadoop.version=3.1.4 -Psnapshots-and-staging
-Phadoop-cloud,yarn,kinesis-asl -DskipTests clean package

All good. Then I ran the committer unit test suite

mvn -T 1  -Phadoop-3.2 -Dhadoop.version=3.1.4
-Phadoop-cloud,yarn,kinesis-as  -Psnapshots-and-staging --pl hadoop-cloud
test

CommitterBindingSuite:
*** RUN ABORTED ***
  java.lang.NoSuchMethodError: 'void
com.google.common.base.Preconditions.checkArgument(boolean,
java.lang.String, java.lang.Object)'
  at org.apache.hadoop.conf.Configuration.set(Configuration.java:1357)
  at org.apache.hadoop.conf.Configuration.set(Configuration.java:1338)
  at
org.apache.spark.internal.io.cloud.CommitterBindingSuite.newJob(CommitterBindingSuite.scala:89)
  at
org.apache.spark.internal.io.cloud.CommitterBindingSuite.$anonfun$new$1(CommitterBindingSuite.scala:55)
  at org.scalatest.OutcomeOf.outcomeOf(OutcomeOf.scala:85)
  at org.scalatest.OutcomeOf.outcomeOf$(OutcomeOf.scala:83)
  at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104)
  at org.scalatest.Transformer.apply(Transformer.scala:22)
  at org.scalatest.Transformer.apply(Transformer.scala:20)
  at org.scalatest.FunSuiteLike$$anon$1.apply(FunSuiteLike.scala:186)
  ...

Fix: again, tell the build this is a later version of Guava:


mvn -T 1  -Phadoop-3.2 -Dhadoop.version=3.1.4
-Phadoop-cloud,yarn,kinesis-asl  -Psnapshots-and-staging --pl hadoop-cloud
-Dguava.version=27.0-jre test


the mismatch doesn't break spark internally, they shade their stuff anyway,
the guava.version here is actually the one which hadoop is to be linked
with.

outcome: tests work

[INFO] --- scalatest-maven-plugin:2.0.0:test (test) @
spark-hadoop-cloud_2.12 ---
Discovery starting.
Discovery completed in 438 milliseconds.
Run starting. Expected test count is: 4
CommitterBindingSuite:
- BindingParquetOutputCommitter binds to the inner committer
- committer protocol can be serialized and deserialized
- local filesystem instantiation
- reject dynamic partitioning
Run completed in 1 second, 411 milliseconds.
Total number of tests run: 4
Suites: completed 2, aborted 0
Tests: succeeded 4, failed 0, canceled 0, ignored 0, pending 0

This is a real PITA, and its invariably those checkArgument calls, because
the later guava versions added some overloaded methods. Compile existing
source with a later guava version and the .class no longer binds to the
older guava version, even though no new guava APIs have been adopted.

I am really tempted to go through src/**/*.java and replace all Guava
checkArgument/checkState with our own implementation in hadoop.common, at
least for any which uses the vararg variant. But: it'd be a big change and
there may be related issues elsewhere. At least now things fail fast.

*Project 3: spark cloud integration  *

https://github.com/hortonworks-spark/cloud-integration

This is where the functional tests for the s3a committer through spark live

-Dhadoop.version=3.1.2 -Dspark.version=3.1.0-SNAPSHOT
-Psnapshots-and-staging

and a full test run

mvn test -Dcloud.test.configuration.file=../test-configs/s3a.xml --pl
cloud-examples -Dhadoop.version=3.1.2 -Dspark.version=3.1.0-SNAPSHOT
-Psnapshots-and-staging

All good. A couple of test failures, but that was because one of my test
datasets is not on any bucket I have...will have to fix that.


To conclude: the artefacts are all there, existing code compiles against
the new version without obvious problems. Where people will see stack
traces is from the guava update. Is it frustrating, but there is nothing we
can do about it. All we can do is remember to ourselves "don't add
overloaded methods where you have already shipped an implementation with a
varargs one"

For the release notes: we need to explain what is happening and why.



n Fri, 26 Jun 2020 at 14:51, Gabor Bota <ga...@cloudera.com> wrote:

> Hi folks,
>
> I have put together a release candidate (RC2) for Hadoop 3.1.4.
>
> The RC is available at: http://people.apache.org/~gabota/hadoop-3.1.4-RC2/
> The RC tag in git is here:
> https://github.com/apache/hadoop/releases/tag/release-3.1.4-RC2
> The maven artifacts are staged at
> https://repository.apache.org/content/repositories/orgapachehadoop-1269/
>
> You can find my public key at:
> https://dist.apache.org/repos/dist/release/hadoop/common/KEYS
> and http://keys.gnupg.net/pks/lookup?op=get&search=0xB86249D83539B38C
>
> Please try the release and vote. The vote will run for 5 weekdays,
> until July 6. 2020. 23:00 CET.
>
> The release includes the revert of HDFS-14941, as it caused
> HDFS-15421. IBR leak causes standby NN to be stuck in safe mode.
> (https://issues.apache.org/jira/browse/HDFS-15421)
> The release includes HDFS-15323, as requested.
> (https://issues.apache.org/jira/browse/HDFS-15323)
>
> Thanks,
> Gabor
>

Re: [VOTE] Release Apache Hadoop 3.1.4 (RC2)

Posted by Steve Loughran <st...@cloudera.com.INVALID>.
+1, with the instruction "warn everyone about the guava update possibly
breaking things at run time"

With the key issues being
* code compiled with the new guava release will not link against the older
releases, even without any changes in the source files.
* this includes hadoop-common

Applications which exclude the guava dependency published by hadoop-
artifacts to use their own, must set guava.version=27.0-jre or
guava.version=27.0 to be consistent with that of this release.


My tests were all with using the artifacts downstream via maven; I trust
others to look at the big tarball release.


*Project 1: cloudstore*


This is my extra diagnostics and cloud utils module.
https://github.com/steveloughran/cloudstore


All compiled fine, but the tests failed on guava linkage

testNoOverwriteDest(org.apache.hadoop.tools.cloudup.TestLocalCloudup)  Time
elapsed: 0.012 sec  <<< ERROR! java.lang.NoSuchMethodError: 'void
com.google.common.base.Preconditions.checkArgument(boolean,
java.lang.String, java.lang.Object, java.lang.Object)'
at org.apache.hadoop.fs.tools.cloudup.Cloudup.run(Cloudup.java:177)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:76)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:90)
at org.apache.hadoop.tools.store.StoreTestUtils.exec(StoreTestUtils.java:4


Note: that app is designed to run against hadoop branch-2 and other
branches, so I ended up reimplementing the checkArgument and checkState
calls so that I can have a binary which links everywhere. My code, nothing
serious.

*Project 2: Spark*


apache spark main branch built with maven (not tried the SBT build).


mvn -T 1  -Phadoop-3.2 -Dhadoop.version=3.1.4 -Psnapshots-and-staging
-Phadoop-cloud,yarn,kinesis-asl -DskipTests clean package

All good. Then I ran the committer unit test suite

mvn -T 1  -Phadoop-3.2 -Dhadoop.version=3.1.4
-Phadoop-cloud,yarn,kinesis-as  -Psnapshots-and-staging --pl hadoop-cloud
test

CommitterBindingSuite:
*** RUN ABORTED ***
  java.lang.NoSuchMethodError: 'void
com.google.common.base.Preconditions.checkArgument(boolean,
java.lang.String, java.lang.Object)'
  at org.apache.hadoop.conf.Configuration.set(Configuration.java:1357)
  at org.apache.hadoop.conf.Configuration.set(Configuration.java:1338)
  at
org.apache.spark.internal.io.cloud.CommitterBindingSuite.newJob(CommitterBindingSuite.scala:89)
  at
org.apache.spark.internal.io.cloud.CommitterBindingSuite.$anonfun$new$1(CommitterBindingSuite.scala:55)
  at org.scalatest.OutcomeOf.outcomeOf(OutcomeOf.scala:85)
  at org.scalatest.OutcomeOf.outcomeOf$(OutcomeOf.scala:83)
  at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104)
  at org.scalatest.Transformer.apply(Transformer.scala:22)
  at org.scalatest.Transformer.apply(Transformer.scala:20)
  at org.scalatest.FunSuiteLike$$anon$1.apply(FunSuiteLike.scala:186)
  ...

Fix: again, tell the build this is a later version of Guava:


mvn -T 1  -Phadoop-3.2 -Dhadoop.version=3.1.4
-Phadoop-cloud,yarn,kinesis-asl  -Psnapshots-and-staging --pl hadoop-cloud
-Dguava.version=27.0-jre test


the mismatch doesn't break spark internally, they shade their stuff anyway,
the guava.version here is actually the one which hadoop is to be linked
with.

outcome: tests work

[INFO] --- scalatest-maven-plugin:2.0.0:test (test) @
spark-hadoop-cloud_2.12 ---
Discovery starting.
Discovery completed in 438 milliseconds.
Run starting. Expected test count is: 4
CommitterBindingSuite:
- BindingParquetOutputCommitter binds to the inner committer
- committer protocol can be serialized and deserialized
- local filesystem instantiation
- reject dynamic partitioning
Run completed in 1 second, 411 milliseconds.
Total number of tests run: 4
Suites: completed 2, aborted 0
Tests: succeeded 4, failed 0, canceled 0, ignored 0, pending 0

This is a real PITA, and its invariably those checkArgument calls, because
the later guava versions added some overloaded methods. Compile existing
source with a later guava version and the .class no longer binds to the
older guava version, even though no new guava APIs have been adopted.

I am really tempted to go through src/**/*.java and replace all Guava
checkArgument/checkState with our own implementation in hadoop.common, at
least for any which uses the vararg variant. But: it'd be a big change and
there may be related issues elsewhere. At least now things fail fast.

*Project 3: spark cloud integration  *

https://github.com/hortonworks-spark/cloud-integration

This is where the functional tests for the s3a committer through spark live

-Dhadoop.version=3.1.2 -Dspark.version=3.1.0-SNAPSHOT
-Psnapshots-and-staging

and a full test run

mvn test -Dcloud.test.configuration.file=../test-configs/s3a.xml --pl
cloud-examples -Dhadoop.version=3.1.2 -Dspark.version=3.1.0-SNAPSHOT
-Psnapshots-and-staging

All good. A couple of test failures, but that was because one of my test
datasets is not on any bucket I have...will have to fix that.


To conclude: the artefacts are all there, existing code compiles against
the new version without obvious problems. Where people will see stack
traces is from the guava update. Is it frustrating, but there is nothing we
can do about it. All we can do is remember to ourselves "don't add
overloaded methods where you have already shipped an implementation with a
varargs one"

For the release notes: we need to explain what is happening and why.



n Fri, 26 Jun 2020 at 14:51, Gabor Bota <ga...@cloudera.com> wrote:

> Hi folks,
>
> I have put together a release candidate (RC2) for Hadoop 3.1.4.
>
> The RC is available at: http://people.apache.org/~gabota/hadoop-3.1.4-RC2/
> The RC tag in git is here:
> https://github.com/apache/hadoop/releases/tag/release-3.1.4-RC2
> The maven artifacts are staged at
> https://repository.apache.org/content/repositories/orgapachehadoop-1269/
>
> You can find my public key at:
> https://dist.apache.org/repos/dist/release/hadoop/common/KEYS
> and http://keys.gnupg.net/pks/lookup?op=get&search=0xB86249D83539B38C
>
> Please try the release and vote. The vote will run for 5 weekdays,
> until July 6. 2020. 23:00 CET.
>
> The release includes the revert of HDFS-14941, as it caused
> HDFS-15421. IBR leak causes standby NN to be stuck in safe mode.
> (https://issues.apache.org/jira/browse/HDFS-15421)
> The release includes HDFS-15323, as requested.
> (https://issues.apache.org/jira/browse/HDFS-15323)
>
> Thanks,
> Gabor
>

Re: [VOTE] Release Apache Hadoop 3.1.4 (RC2)

Posted by Steve Loughran <st...@cloudera.com.INVALID>.
Mukund -thank you for running these tests. Both of them are things we've
fixed, and in both cases, problems in the tests, not the production code

On Wed, 1 Jul 2020 at 14:22, Mukund Madhav Thakur <mt...@cloudera.com>
wrote:

> Compile the distribution using  mvn package -Pdist -DskipTests
> -Dmaven.javadoc.skip=true  -DskipShade and run some hadoop fs commands. All
> good there.
>
> Then I ran the hadoop-aws tests and saw following failures:
>
> [*ERROR*] *Failures: *
>
> [*ERROR*] *
> ITestS3AMiscOperations.testEmptyFileChecksums:147->Assert.assertEquals:118->Assert.failNotEquals:743->Assert.fail:88
> checksums expected:<etag: "6dd081d9f4abc2fb88fb75f94c84a85f"> but
> was:<etag: "aa7c140fc86610c1d0d188acb572036c">*
>
> [*ERROR*] *
> ITestS3AMiscOperations.testNonEmptyFileChecksumsUnencrypted:199->Assert.assertEquals:118->Assert.failNotEquals:743->Assert.fail:88
> checksums expected:<etag: "381e9886ed6117722fa9080e5234202f"> but
> was:<etag: "10908d6e1c24362a79a3cd4c5aafb1a1">*
>

you've got a bucket encrypting things so checksums some back different.
We've tweaked those tests so on 3.3 we look @ the bucket and skip the test
if there's any default encryption policy

https://issues.apache.org/jira/browse/HADOOP-16319



> These were the same failures which I saw in RC0 as well. I think these are
> known failures.
>
>
> Apart from that, all of my AssumedRole tests are failing AccessDenied
> exception like
>
> [*ERROR*]
> testPartialDeleteSingleDelete(org.apache.hadoop.fs.s3a.auth.ITestAssumeRole)
> Time elapsed: 3.359 s  <<< ERROR!
>
> org.apache.hadoop.fs.s3a.AWSServiceIOException: initTable on mthakur-data:
> com.amazonaws.services.dynamodbv2.model.AmazonDynamoDBException: User:
> arn:aws:sts::152813717728:assumed-role/mthakur-assumed-role/valid is not
> authorized to perform: dynamodb:DescribeTable on resource:
> arn:aws:dynamodb:ap-south-1:152813717728:table/mthakur-data (Service:
> AmazonDynamoDBv2; Status Code: 400; Error Code: AccessDeniedException;
> Request ID: UJLKVGJ9I1S9TQF3AEPHVGENVJVV4KQNSO5AEMVJF66Q9ASUAAJG): User:
> arn:aws:sts::152813717728:assumed-role/mthakur-assumed-role/valid is not
> authorized to perform: dynamodb:DescribeTable on resource:
> arn:aws:dynamodb:ap-south-1:152813717728:table/mthakur-data (Service:
> AmazonDynamoDBv2; Status Code: 400; Error Code: AccessDeniedException;
> Request ID: UJLKVGJ9I1S9TQF3AEPHVGENVJVV4KQNSO5AEMVJF66Q9ASUAAJG)
>
> at
> org.apache.hadoop.fs.s3a.auth.ITestAssumeRole.executePartialDelete(ITestAssumeRole.java:759)
>
> at
> org.apache.hadoop.fs.s3a.auth.ITestAssumeRole.testPartialDeleteSingleDelete(ITestAssumeRole.java:735)
>
>
> I checked my policy and could verify that dynamodb:DescribeTable access is
> present there.
>
>
> So just to cross check, I ran the AssumedRole test with the same configs
> for apache/trunk and it succeeded. Not sure if this is a false alarm but I
> think it would be better if someone else run these AssumedRole tests as
> well and verify.
>

That's https://issues.apache.org/jira/browse/HADOOP-15583

nothing to worry about


>>

Re: [VOTE] Release Apache Hadoop 3.1.4 (RC2)

Posted by Steve Loughran <st...@cloudera.com.INVALID>.
Mukund -thank you for running these tests. Both of them are things we've
fixed, and in both cases, problems in the tests, not the production code

On Wed, 1 Jul 2020 at 14:22, Mukund Madhav Thakur <mt...@cloudera.com>
wrote:

> Compile the distribution using  mvn package -Pdist -DskipTests
> -Dmaven.javadoc.skip=true  -DskipShade and run some hadoop fs commands. All
> good there.
>
> Then I ran the hadoop-aws tests and saw following failures:
>
> [*ERROR*] *Failures: *
>
> [*ERROR*] *
> ITestS3AMiscOperations.testEmptyFileChecksums:147->Assert.assertEquals:118->Assert.failNotEquals:743->Assert.fail:88
> checksums expected:<etag: "6dd081d9f4abc2fb88fb75f94c84a85f"> but
> was:<etag: "aa7c140fc86610c1d0d188acb572036c">*
>
> [*ERROR*] *
> ITestS3AMiscOperations.testNonEmptyFileChecksumsUnencrypted:199->Assert.assertEquals:118->Assert.failNotEquals:743->Assert.fail:88
> checksums expected:<etag: "381e9886ed6117722fa9080e5234202f"> but
> was:<etag: "10908d6e1c24362a79a3cd4c5aafb1a1">*
>

you've got a bucket encrypting things so checksums some back different.
We've tweaked those tests so on 3.3 we look @ the bucket and skip the test
if there's any default encryption policy

https://issues.apache.org/jira/browse/HADOOP-16319



> These were the same failures which I saw in RC0 as well. I think these are
> known failures.
>
>
> Apart from that, all of my AssumedRole tests are failing AccessDenied
> exception like
>
> [*ERROR*]
> testPartialDeleteSingleDelete(org.apache.hadoop.fs.s3a.auth.ITestAssumeRole)
> Time elapsed: 3.359 s  <<< ERROR!
>
> org.apache.hadoop.fs.s3a.AWSServiceIOException: initTable on mthakur-data:
> com.amazonaws.services.dynamodbv2.model.AmazonDynamoDBException: User:
> arn:aws:sts::152813717728:assumed-role/mthakur-assumed-role/valid is not
> authorized to perform: dynamodb:DescribeTable on resource:
> arn:aws:dynamodb:ap-south-1:152813717728:table/mthakur-data (Service:
> AmazonDynamoDBv2; Status Code: 400; Error Code: AccessDeniedException;
> Request ID: UJLKVGJ9I1S9TQF3AEPHVGENVJVV4KQNSO5AEMVJF66Q9ASUAAJG): User:
> arn:aws:sts::152813717728:assumed-role/mthakur-assumed-role/valid is not
> authorized to perform: dynamodb:DescribeTable on resource:
> arn:aws:dynamodb:ap-south-1:152813717728:table/mthakur-data (Service:
> AmazonDynamoDBv2; Status Code: 400; Error Code: AccessDeniedException;
> Request ID: UJLKVGJ9I1S9TQF3AEPHVGENVJVV4KQNSO5AEMVJF66Q9ASUAAJG)
>
> at
> org.apache.hadoop.fs.s3a.auth.ITestAssumeRole.executePartialDelete(ITestAssumeRole.java:759)
>
> at
> org.apache.hadoop.fs.s3a.auth.ITestAssumeRole.testPartialDeleteSingleDelete(ITestAssumeRole.java:735)
>
>
> I checked my policy and could verify that dynamodb:DescribeTable access is
> present there.
>
>
> So just to cross check, I ran the AssumedRole test with the same configs
> for apache/trunk and it succeeded. Not sure if this is a false alarm but I
> think it would be better if someone else run these AssumedRole tests as
> well and verify.
>

That's https://issues.apache.org/jira/browse/HADOOP-15583

nothing to worry about


>>

Re: [VOTE] Release Apache Hadoop 3.1.4 (RC2)

Posted by Steve Loughran <st...@cloudera.com.INVALID>.
Mukund -thank you for running these tests. Both of them are things we've
fixed, and in both cases, problems in the tests, not the production code

On Wed, 1 Jul 2020 at 14:22, Mukund Madhav Thakur <mt...@cloudera.com>
wrote:

> Compile the distribution using  mvn package -Pdist -DskipTests
> -Dmaven.javadoc.skip=true  -DskipShade and run some hadoop fs commands. All
> good there.
>
> Then I ran the hadoop-aws tests and saw following failures:
>
> [*ERROR*] *Failures: *
>
> [*ERROR*] *
> ITestS3AMiscOperations.testEmptyFileChecksums:147->Assert.assertEquals:118->Assert.failNotEquals:743->Assert.fail:88
> checksums expected:<etag: "6dd081d9f4abc2fb88fb75f94c84a85f"> but
> was:<etag: "aa7c140fc86610c1d0d188acb572036c">*
>
> [*ERROR*] *
> ITestS3AMiscOperations.testNonEmptyFileChecksumsUnencrypted:199->Assert.assertEquals:118->Assert.failNotEquals:743->Assert.fail:88
> checksums expected:<etag: "381e9886ed6117722fa9080e5234202f"> but
> was:<etag: "10908d6e1c24362a79a3cd4c5aafb1a1">*
>

you've got a bucket encrypting things so checksums some back different.
We've tweaked those tests so on 3.3 we look @ the bucket and skip the test
if there's any default encryption policy

https://issues.apache.org/jira/browse/HADOOP-16319



> These were the same failures which I saw in RC0 as well. I think these are
> known failures.
>
>
> Apart from that, all of my AssumedRole tests are failing AccessDenied
> exception like
>
> [*ERROR*]
> testPartialDeleteSingleDelete(org.apache.hadoop.fs.s3a.auth.ITestAssumeRole)
> Time elapsed: 3.359 s  <<< ERROR!
>
> org.apache.hadoop.fs.s3a.AWSServiceIOException: initTable on mthakur-data:
> com.amazonaws.services.dynamodbv2.model.AmazonDynamoDBException: User:
> arn:aws:sts::152813717728:assumed-role/mthakur-assumed-role/valid is not
> authorized to perform: dynamodb:DescribeTable on resource:
> arn:aws:dynamodb:ap-south-1:152813717728:table/mthakur-data (Service:
> AmazonDynamoDBv2; Status Code: 400; Error Code: AccessDeniedException;
> Request ID: UJLKVGJ9I1S9TQF3AEPHVGENVJVV4KQNSO5AEMVJF66Q9ASUAAJG): User:
> arn:aws:sts::152813717728:assumed-role/mthakur-assumed-role/valid is not
> authorized to perform: dynamodb:DescribeTable on resource:
> arn:aws:dynamodb:ap-south-1:152813717728:table/mthakur-data (Service:
> AmazonDynamoDBv2; Status Code: 400; Error Code: AccessDeniedException;
> Request ID: UJLKVGJ9I1S9TQF3AEPHVGENVJVV4KQNSO5AEMVJF66Q9ASUAAJG)
>
> at
> org.apache.hadoop.fs.s3a.auth.ITestAssumeRole.executePartialDelete(ITestAssumeRole.java:759)
>
> at
> org.apache.hadoop.fs.s3a.auth.ITestAssumeRole.testPartialDeleteSingleDelete(ITestAssumeRole.java:735)
>
>
> I checked my policy and could verify that dynamodb:DescribeTable access is
> present there.
>
>
> So just to cross check, I ran the AssumedRole test with the same configs
> for apache/trunk and it succeeded. Not sure if this is a false alarm but I
> think it would be better if someone else run these AssumedRole tests as
> well and verify.
>

That's https://issues.apache.org/jira/browse/HADOOP-15583

nothing to worry about


>>

Re: [VOTE] Release Apache Hadoop 3.1.4 (RC2)

Posted by Mukund Madhav Thakur <mt...@cloudera.com.INVALID>.
Compile the distribution using  mvn package -Pdist -DskipTests
-Dmaven.javadoc.skip=true  -DskipShade and run some hadoop fs commands. All
good there.

Then I ran the hadoop-aws tests and saw following failures:

[*ERROR*] *Failures: *

[*ERROR*] *
ITestS3AMiscOperations.testEmptyFileChecksums:147->Assert.assertEquals:118->Assert.failNotEquals:743->Assert.fail:88
checksums expected:<etag: "6dd081d9f4abc2fb88fb75f94c84a85f"> but
was:<etag: "aa7c140fc86610c1d0d188acb572036c">*

[*ERROR*] *
ITestS3AMiscOperations.testNonEmptyFileChecksumsUnencrypted:199->Assert.assertEquals:118->Assert.failNotEquals:743->Assert.fail:88
checksums expected:<etag: "381e9886ed6117722fa9080e5234202f"> but
was:<etag: "10908d6e1c24362a79a3cd4c5aafb1a1">*


These were the same failures which I saw in RC0 as well. I think these are
known failures.


Apart from that, all of my AssumedRole tests are failing AccessDenied
exception like

[*ERROR*]
testPartialDeleteSingleDelete(org.apache.hadoop.fs.s3a.auth.ITestAssumeRole)
Time elapsed: 3.359 s  <<< ERROR!

org.apache.hadoop.fs.s3a.AWSServiceIOException: initTable on mthakur-data:
com.amazonaws.services.dynamodbv2.model.AmazonDynamoDBException: User:
arn:aws:sts::152813717728:assumed-role/mthakur-assumed-role/valid is not
authorized to perform: dynamodb:DescribeTable on resource:
arn:aws:dynamodb:ap-south-1:152813717728:table/mthakur-data (Service:
AmazonDynamoDBv2; Status Code: 400; Error Code: AccessDeniedException;
Request ID: UJLKVGJ9I1S9TQF3AEPHVGENVJVV4KQNSO5AEMVJF66Q9ASUAAJG): User:
arn:aws:sts::152813717728:assumed-role/mthakur-assumed-role/valid is not
authorized to perform: dynamodb:DescribeTable on resource:
arn:aws:dynamodb:ap-south-1:152813717728:table/mthakur-data (Service:
AmazonDynamoDBv2; Status Code: 400; Error Code: AccessDeniedException;
Request ID: UJLKVGJ9I1S9TQF3AEPHVGENVJVV4KQNSO5AEMVJF66Q9ASUAAJG)

at
org.apache.hadoop.fs.s3a.auth.ITestAssumeRole.executePartialDelete(ITestAssumeRole.java:759)

at
org.apache.hadoop.fs.s3a.auth.ITestAssumeRole.testPartialDeleteSingleDelete(ITestAssumeRole.java:735)


I checked my policy and could verify that dynamodb:DescribeTable access is
present there.


So just to cross check, I ran the AssumedRole test with the same configs
for apache/trunk and it succeeded. Not sure if this is a false alarm but I
think it would be better if someone else run these AssumedRole tests as
well and verify.


Thanks

Mukund

On Fri, Jun 26, 2020 at 7:21 PM Gabor Bota <ga...@cloudera.com> wrote:

> Hi folks,
>
> I have put together a release candidate (RC2) for Hadoop 3.1.4.
>
> The RC is available at: http://people.apache.org/~gabota/hadoop-3.1.4-RC2/
> The RC tag in git is here:
> https://github.com/apache/hadoop/releases/tag/release-3.1.4-RC2
> The maven artifacts are staged at
> https://repository.apache.org/content/repositories/orgapachehadoop-1269/
>
> You can find my public key at:
> https://dist.apache.org/repos/dist/release/hadoop/common/KEYS
> and http://keys.gnupg.net/pks/lookup?op=get&search=0xB86249D83539B38C
>
> Please try the release and vote. The vote will run for 5 weekdays,
> until July 6. 2020. 23:00 CET.
>
> The release includes the revert of HDFS-14941, as it caused
> HDFS-15421. IBR leak causes standby NN to be stuck in safe mode.
> (https://issues.apache.org/jira/browse/HDFS-15421)
> The release includes HDFS-15323, as requested.
> (https://issues.apache.org/jira/browse/HDFS-15323)
>
> Thanks,
> Gabor
>

Re: [VOTE] Release Apache Hadoop 3.1.4 (RC2)

Posted by Masatake Iwasaki <iw...@oss.nttdata.co.jp>.
Thanks for putting this up, Gabor Bota.

I'm testing the RC2 on 3 node docker cluster with NN-HA and RM-HA enabled.
ResourceManager reproducibly blocks on submitApplication while launching 
example MR jobs.
Does anyone run into the same issue?

The same configuration worked for 3.1.3.
I got no issue if RM-HA is disabled.


"IPC Server handler 1 on default port 8032" #167 daemon prio=5 os_prio=0 
tid=0x00007fe91821ec50 nid=0x3b9 waiting on condition [0x00007fe901bac000]
    java.lang.Thread.State: WAITING (parking)
         at sun.misc.Unsafe.park(Native Method)
         - parking to wait for  <0x0000000085d37a40> (a 
java.util.concurrent.locks.ReentrantReadWriteLock$NonfairSync)
         at 
java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
         at 
java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:836)
         at 
java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireShared(AbstractQueuedSynchronizer.java:967)
         at 
java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireShared(AbstractQueuedSynchronizer.java:1283)
         at 
java.util.concurrent.locks.ReentrantReadWriteLock$ReadLock.lock(ReentrantReadWriteLock.java:727)
         at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.checkAndGetApplicationPriority(CapacityScheduler.java:2521)
         at 
org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.createAndPopulateNewRMApp(RMAppManager.java:417)
         at 
org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.submitApplication(RMAppManager.java:342)
         at 
org.apache.hadoop.yarn.server.resourcemanager.ClientRMService.submitApplication(ClientRMService.java:678)
         at 
org.apache.hadoop.yarn.api.impl.pb.service.ApplicationClientProtocolPBServiceImpl.submitApplication(ApplicationClientProtocolPBServiceImpl.java:277)
         at 
org.apache.hadoop.yarn.proto.ApplicationClientProtocol$ApplicationClientProtocolService$2.callBlockingMethod(ApplicationClientProtocol.java:563)
         at 
org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:527)
         at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1036)
         at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:1015)
         at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:943)
         at java.security.AccessController.doPrivileged(Native Method)
         at javax.security.auth.Subject.doAs(Subject.java:422)
         at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1729)
         at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2943)


Masatake Iwasaki

On 2020/06/26 22:51, Gabor Bota wrote:
> Hi folks,
>
> I have put together a release candidate (RC2) for Hadoop 3.1.4.
>
> The RC is available at: http://people.apache.org/~gabota/hadoop-3.1.4-RC2/
> The RC tag in git is here:
> https://github.com/apache/hadoop/releases/tag/release-3.1.4-RC2
> The maven artifacts are staged at
> https://repository.apache.org/content/repositories/orgapachehadoop-1269/
>
> You can find my public key at:
> https://dist.apache.org/repos/dist/release/hadoop/common/KEYS
> and http://keys.gnupg.net/pks/lookup?op=get&search=0xB86249D83539B38C
>
> Please try the release and vote. The vote will run for 5 weekdays,
> until July 6. 2020. 23:00 CET.
>
> The release includes the revert of HDFS-14941, as it caused
> HDFS-15421. IBR leak causes standby NN to be stuck in safe mode.
> (https://issues.apache.org/jira/browse/HDFS-15421)
> The release includes HDFS-15323, as requested.
> (https://issues.apache.org/jira/browse/HDFS-15323)
>
> Thanks,
> Gabor
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: common-dev-unsubscribe@hadoop.apache.org
> For additional commands, e-mail: common-dev-help@hadoop.apache.org
>

---------------------------------------------------------------------
To unsubscribe, e-mail: yarn-dev-unsubscribe@hadoop.apache.org
For additional commands, e-mail: yarn-dev-help@hadoop.apache.org


Re: [VOTE] Release Apache Hadoop 3.1.4 (RC2)

Posted by Steve Loughran <st...@cloudera.com.INVALID>.
+1, with the instruction "warn everyone about the guava update possibly
breaking things at run time"

With the key issues being
* code compiled with the new guava release will not link against the older
releases, even without any changes in the source files.
* this includes hadoop-common

Applications which exclude the guava dependency published by hadoop-
artifacts to use their own, must set guava.version=27.0-jre or
guava.version=27.0 to be consistent with that of this release.


My tests were all with using the artifacts downstream via maven; I trust
others to look at the big tarball release.


*Project 1: cloudstore*


This is my extra diagnostics and cloud utils module.
https://github.com/steveloughran/cloudstore


All compiled fine, but the tests failed on guava linkage

testNoOverwriteDest(org.apache.hadoop.tools.cloudup.TestLocalCloudup)  Time
elapsed: 0.012 sec  <<< ERROR! java.lang.NoSuchMethodError: 'void
com.google.common.base.Preconditions.checkArgument(boolean,
java.lang.String, java.lang.Object, java.lang.Object)'
at org.apache.hadoop.fs.tools.cloudup.Cloudup.run(Cloudup.java:177)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:76)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:90)
at org.apache.hadoop.tools.store.StoreTestUtils.exec(StoreTestUtils.java:4


Note: that app is designed to run against hadoop branch-2 and other
branches, so I ended up reimplementing the checkArgument and checkState
calls so that I can have a binary which links everywhere. My code, nothing
serious.

*Project 2: Spark*


apache spark main branch built with maven (not tried the SBT build).


mvn -T 1  -Phadoop-3.2 -Dhadoop.version=3.1.4 -Psnapshots-and-staging
-Phadoop-cloud,yarn,kinesis-asl -DskipTests clean package

All good. Then I ran the committer unit test suite

mvn -T 1  -Phadoop-3.2 -Dhadoop.version=3.1.4
-Phadoop-cloud,yarn,kinesis-as  -Psnapshots-and-staging --pl hadoop-cloud
test

CommitterBindingSuite:
*** RUN ABORTED ***
  java.lang.NoSuchMethodError: 'void
com.google.common.base.Preconditions.checkArgument(boolean,
java.lang.String, java.lang.Object)'
  at org.apache.hadoop.conf.Configuration.set(Configuration.java:1357)
  at org.apache.hadoop.conf.Configuration.set(Configuration.java:1338)
  at
org.apache.spark.internal.io.cloud.CommitterBindingSuite.newJob(CommitterBindingSuite.scala:89)
  at
org.apache.spark.internal.io.cloud.CommitterBindingSuite.$anonfun$new$1(CommitterBindingSuite.scala:55)
  at org.scalatest.OutcomeOf.outcomeOf(OutcomeOf.scala:85)
  at org.scalatest.OutcomeOf.outcomeOf$(OutcomeOf.scala:83)
  at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104)
  at org.scalatest.Transformer.apply(Transformer.scala:22)
  at org.scalatest.Transformer.apply(Transformer.scala:20)
  at org.scalatest.FunSuiteLike$$anon$1.apply(FunSuiteLike.scala:186)
  ...

Fix: again, tell the build this is a later version of Guava:


mvn -T 1  -Phadoop-3.2 -Dhadoop.version=3.1.4
-Phadoop-cloud,yarn,kinesis-asl  -Psnapshots-and-staging --pl hadoop-cloud
-Dguava.version=27.0-jre test


the mismatch doesn't break spark internally, they shade their stuff anyway,
the guava.version here is actually the one which hadoop is to be linked
with.

outcome: tests work

[INFO] --- scalatest-maven-plugin:2.0.0:test (test) @
spark-hadoop-cloud_2.12 ---
Discovery starting.
Discovery completed in 438 milliseconds.
Run starting. Expected test count is: 4
CommitterBindingSuite:
- BindingParquetOutputCommitter binds to the inner committer
- committer protocol can be serialized and deserialized
- local filesystem instantiation
- reject dynamic partitioning
Run completed in 1 second, 411 milliseconds.
Total number of tests run: 4
Suites: completed 2, aborted 0
Tests: succeeded 4, failed 0, canceled 0, ignored 0, pending 0

This is a real PITA, and its invariably those checkArgument calls, because
the later guava versions added some overloaded methods. Compile existing
source with a later guava version and the .class no longer binds to the
older guava version, even though no new guava APIs have been adopted.

I am really tempted to go through src/**/*.java and replace all Guava
checkArgument/checkState with our own implementation in hadoop.common, at
least for any which uses the vararg variant. But: it'd be a big change and
there may be related issues elsewhere. At least now things fail fast.

*Project 3: spark cloud integration  *

https://github.com/hortonworks-spark/cloud-integration

This is where the functional tests for the s3a committer through spark live

-Dhadoop.version=3.1.2 -Dspark.version=3.1.0-SNAPSHOT
-Psnapshots-and-staging

and a full test run

mvn test -Dcloud.test.configuration.file=../test-configs/s3a.xml --pl
cloud-examples -Dhadoop.version=3.1.2 -Dspark.version=3.1.0-SNAPSHOT
-Psnapshots-and-staging

All good. A couple of test failures, but that was because one of my test
datasets is not on any bucket I have...will have to fix that.


To conclude: the artefacts are all there, existing code compiles against
the new version without obvious problems. Where people will see stack
traces is from the guava update. Is it frustrating, but there is nothing we
can do about it. All we can do is remember to ourselves "don't add
overloaded methods where you have already shipped an implementation with a
varargs one"

For the release notes: we need to explain what is happening and why.



n Fri, 26 Jun 2020 at 14:51, Gabor Bota <ga...@cloudera.com> wrote:

> Hi folks,
>
> I have put together a release candidate (RC2) for Hadoop 3.1.4.
>
> The RC is available at: http://people.apache.org/~gabota/hadoop-3.1.4-RC2/
> The RC tag in git is here:
> https://github.com/apache/hadoop/releases/tag/release-3.1.4-RC2
> The maven artifacts are staged at
> https://repository.apache.org/content/repositories/orgapachehadoop-1269/
>
> You can find my public key at:
> https://dist.apache.org/repos/dist/release/hadoop/common/KEYS
> and http://keys.gnupg.net/pks/lookup?op=get&search=0xB86249D83539B38C
>
> Please try the release and vote. The vote will run for 5 weekdays,
> until July 6. 2020. 23:00 CET.
>
> The release includes the revert of HDFS-14941, as it caused
> HDFS-15421. IBR leak causes standby NN to be stuck in safe mode.
> (https://issues.apache.org/jira/browse/HDFS-15421)
> The release includes HDFS-15323, as requested.
> (https://issues.apache.org/jira/browse/HDFS-15323)
>
> Thanks,
> Gabor
>

Re: [VOTE] Release Apache Hadoop 3.1.4 (RC2)

Posted by Masatake Iwasaki <iw...@oss.nttdata.co.jp>.
Thanks for putting this up, Gabor Bota.

I'm testing the RC2 on 3 node docker cluster with NN-HA and RM-HA enabled.
ResourceManager reproducibly blocks on submitApplication while launching 
example MR jobs.
Does anyone run into the same issue?

The same configuration worked for 3.1.3.
I got no issue if RM-HA is disabled.


"IPC Server handler 1 on default port 8032" #167 daemon prio=5 os_prio=0 
tid=0x00007fe91821ec50 nid=0x3b9 waiting on condition [0x00007fe901bac000]
    java.lang.Thread.State: WAITING (parking)
         at sun.misc.Unsafe.park(Native Method)
         - parking to wait for  <0x0000000085d37a40> (a 
java.util.concurrent.locks.ReentrantReadWriteLock$NonfairSync)
         at 
java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
         at 
java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:836)
         at 
java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireShared(AbstractQueuedSynchronizer.java:967)
         at 
java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireShared(AbstractQueuedSynchronizer.java:1283)
         at 
java.util.concurrent.locks.ReentrantReadWriteLock$ReadLock.lock(ReentrantReadWriteLock.java:727)
         at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.checkAndGetApplicationPriority(CapacityScheduler.java:2521)
         at 
org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.createAndPopulateNewRMApp(RMAppManager.java:417)
         at 
org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.submitApplication(RMAppManager.java:342)
         at 
org.apache.hadoop.yarn.server.resourcemanager.ClientRMService.submitApplication(ClientRMService.java:678)
         at 
org.apache.hadoop.yarn.api.impl.pb.service.ApplicationClientProtocolPBServiceImpl.submitApplication(ApplicationClientProtocolPBServiceImpl.java:277)
         at 
org.apache.hadoop.yarn.proto.ApplicationClientProtocol$ApplicationClientProtocolService$2.callBlockingMethod(ApplicationClientProtocol.java:563)
         at 
org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:527)
         at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1036)
         at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:1015)
         at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:943)
         at java.security.AccessController.doPrivileged(Native Method)
         at javax.security.auth.Subject.doAs(Subject.java:422)
         at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1729)
         at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2943)


Masatake Iwasaki

On 2020/06/26 22:51, Gabor Bota wrote:
> Hi folks,
>
> I have put together a release candidate (RC2) for Hadoop 3.1.4.
>
> The RC is available at: http://people.apache.org/~gabota/hadoop-3.1.4-RC2/
> The RC tag in git is here:
> https://github.com/apache/hadoop/releases/tag/release-3.1.4-RC2
> The maven artifacts are staged at
> https://repository.apache.org/content/repositories/orgapachehadoop-1269/
>
> You can find my public key at:
> https://dist.apache.org/repos/dist/release/hadoop/common/KEYS
> and http://keys.gnupg.net/pks/lookup?op=get&search=0xB86249D83539B38C
>
> Please try the release and vote. The vote will run for 5 weekdays,
> until July 6. 2020. 23:00 CET.
>
> The release includes the revert of HDFS-14941, as it caused
> HDFS-15421. IBR leak causes standby NN to be stuck in safe mode.
> (https://issues.apache.org/jira/browse/HDFS-15421)
> The release includes HDFS-15323, as requested.
> (https://issues.apache.org/jira/browse/HDFS-15323)
>
> Thanks,
> Gabor
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: common-dev-unsubscribe@hadoop.apache.org
> For additional commands, e-mail: common-dev-help@hadoop.apache.org
>

---------------------------------------------------------------------
To unsubscribe, e-mail: hdfs-dev-unsubscribe@hadoop.apache.org
For additional commands, e-mail: hdfs-dev-help@hadoop.apache.org