You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@spark.apache.org by Cody Koeninger <co...@mediacrossing.com> on 2014/07/14 22:42:38 UTC

Reproducible deadlock in 1.0.1, possibly related to Spark-1097

Hi all, just wanted to give a heads up that we're seeing a reproducible
deadlock with spark 1.0.1 with 2.3.0-mr1-cdh5.0.2

If jira is a better place for this, apologies in advance - figured talking
about it on the mailing list was friendlier than randomly (re)opening jira
tickets.

I know Gary had mentioned some issues with 1.0.1 on the mailing list, once
we got a thread dump I wanted to follow up.

The thread dump shows the deadlock occurs in the synchronized block of code
that was changed in HadoopRDD.scala, for the Spark-1097 issue

Relevant portions of the thread dump are summarized below, we can provide
the whole dump if it's useful.

Found one Java-level deadlock:
=============================
"Executor task launch worker-1":
  waiting to lock monitor 0x00007f250400c520 (object 0x00000000fae7dc30, a
org.apache.hadoop.co
nf.Configuration),
  which is held by "Executor task launch worker-0"
"Executor task launch worker-0":
  waiting to lock monitor 0x00007f2520495620 (object 0x00000000faeb4fc8, a
java.lang.Class),
  which is held by "Executor task launch worker-1"


"Executor task launch worker-1":
        at
org.apache.hadoop.conf.Configuration.reloadConfiguration(Configuration.java:791)
        - waiting to lock <0x00000000fae7dc30> (a
org.apache.hadoop.conf.Configuration)
        at
org.apache.hadoop.conf.Configuration.addDefaultResource(Configuration.java:690)
        - locked <0x00000000faca6ff8> (a java.lang.Class for
org.apache.hadoop.conf.Configurati
on)
        at
org.apache.hadoop.hdfs.HdfsConfiguration.<clinit>(HdfsConfiguration.java:34)
        at
org.apache.hadoop.hdfs.DistributedFileSystem.<clinit>(DistributedFileSystem.java:110
)
        at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native
Method)
        at
sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.
java:57)
        at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native
Method)
        at
sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.
java:57)
        at
sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAcces
sorImpl.java:45)
        at java.lang.reflect.Constructor.newInstance(Constructor.java:525)
        at java.lang.Class.newInstance0(Class.java:374)
        at java.lang.Class.newInstance(Class.java:327)
        at java.util.ServiceLoader$LazyIterator.next(ServiceLoader.java:373)
        at java.util.ServiceLoader$1.next(ServiceLoader.java:445)
        at
org.apache.hadoop.fs.FileSystem.loadFileSystems(FileSystem.java:2364)
        - locked <0x00000000faeb4fc8> (a java.lang.Class for
org.apache.hadoop.fs.FileSystem)
        at
org.apache.hadoop.fs.FileSystem.getFileSystemClass(FileSystem.java:2375)
        at
org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2392)
        at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:89)
        at
org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2431)
        at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2413)
        at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:368)
        at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:167)
        at
org.apache.hadoop.mapred.JobConf.getWorkingDirectory(JobConf.java:587)
        at
org.apache.hadoop.mapred.FileInputFormat.setInputPaths(FileInputFormat.java:315)
        at
org.apache.hadoop.mapred.FileInputFormat.setInputPaths(FileInputFormat.java:288)
        at
org.apache.spark.SparkContext$$anonfun$22.apply(SparkContext.scala:546)
        at
org.apache.spark.SparkContext$$anonfun$22.apply(SparkContext.scala:546)
        at
org.apache.spark.rdd.HadoopRDD$$anonfun$getJobConf$1.apply(HadoopRDD.scala:145)



...elided...


"Executor task launch worker-0" daemon prio=10 tid=0x0000000001e71800
nid=0x2d97 waiting for monitor entry [0x00007f24d2bf1000]
   java.lang.Thread.State: BLOCKED (on object monitor)
        at
org.apache.hadoop.fs.FileSystem.loadFileSystems(FileSystem.java:2362)
        - waiting to lock <0x00000000faeb4fc8> (a java.lang.Class for
org.apache.hadoop.fs.FileSystem)
        at
org.apache.hadoop.fs.FileSystem.getFileSystemClass(FileSystem.java:2375)
        at
org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2392)
        at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:89)
        at
org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2431)
        at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2413)
        at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:368)
        at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:167)
        at
org.apache.hadoop.mapred.JobConf.getWorkingDirectory(JobConf.java:587)
        at
org.apache.hadoop.mapred.FileInputFormat.setInputPaths(FileInputFormat.java:315)
        at
org.apache.hadoop.mapred.FileInputFormat.setInputPaths(FileInputFormat.java:288)
        at
org.apache.spark.SparkContext$$anonfun$22.apply(SparkContext.scala:546)
        at
org.apache.spark.SparkContext$$anonfun$22.apply(SparkContext.scala:546)
        at
org.apache.spark.rdd.HadoopRDD$$anonfun$getJobConf$1.apply(HadoopRDD.scala:145)

Re: Reproducible deadlock in 1.0.1, possibly related to Spark-1097

Posted by Aaron Davidson <il...@gmail.com>.

The patch won't solve the problem where two people try to add a
configuration option at the same time, but I think there is currently an
issue where two people can try to initialize the Configuration at the same
time and still run into a ConcurrentModificationException. This at least
reduces (slightly) the scope of the exception although eliminating it may
not be possible.


On Mon, Jul 14, 2014 at 4:35 PM, Gary Malouf <ma...@gmail.com> wrote:

> We'll try to run a build tomorrow AM.
>
>
> On Mon, Jul 14, 2014 at 7:22 PM, Patrick Wendell <pw...@gmail.com>
> wrote:
>
> > Andrew and Gary,
> >
> > Would you guys be able to test
> > https://github.com/apache/spark/pull/1409/files and see if it solves
> > your problem?
> >
> > - Patrick
> >
> > On Mon, Jul 14, 2014 at 4:18 PM, Andrew Ash <an...@andrewash.com>
> wrote:
> > > I observed a deadlock here when using the AvroInputFormat as well. The
> > > short of the issue is that there's one configuration object per JVM,
> but
> > > multiple threads, one for each task. If each thread attempts to add a
> > > configuration option to the Configuration object at once you get issues
> > > because HashMap isn't thread safe.
> > >
> > > More details to come tonight. Thanks!
> > > On Jul 14, 2014 4:11 PM, "Nishkam Ravi" <nr...@cloudera.com> wrote:
> > >
> > >> HI Patrick, I'm not aware of another place where the access happens,
> but
> > >> it's possible that it does. The original fix synchronized on the
> > >> broadcastConf object and someone reported the same exception.
> > >>
> > >>
> > >> On Mon, Jul 14, 2014 at 3:57 PM, Patrick Wendell <pw...@gmail.com>
> > >> wrote:
> > >>
> > >> > Hey Nishkam,
> > >> >
> > >> > Aaron's fix should prevent two concurrent accesses to getJobConf
> (and
> > >> > the Hadoop code therein). But if there is code elsewhere that tries
> to
> > >> > mutate the configuration, then I could see how we might still have
> the
> > >> > ConcurrentModificationException.
> > >> >
> > >> > I looked at your patch for HADOOP-10456 and the only example you
> give
> > >> > is of the data being accessed inside of getJobConf. Is it accessed
> > >> > somewhere else too from Spark that you are aware of?
> > >> >
> > >> > https://issues.apache.org/jira/browse/HADOOP-10456
> > >> >
> > >> > - Patrick
> > >> >
> > >> > On Mon, Jul 14, 2014 at 3:28 PM, Nishkam Ravi <nr...@cloudera.com>
> > >> wrote:
> > >> > > Hi Aaron, I'm not sure if synchronizing on an arbitrary lock
> object
> > >> would
> > >> > > help. I suspect we will start seeing the
> > >> ConcurrentModificationException
> > >> > > again. The right fix has gone into Hadoop through 10456.
> > >> Unfortunately, I
> > >> > > don't have any bright ideas on how to synchronize this at the
> Spark
> > >> level
> > >> > > without the risk of deadlocks.
> > >> > >
> > >> > >
> > >> > > On Mon, Jul 14, 2014 at 3:07 PM, Aaron Davidson <
> ilikerps@gmail.com
> > >
> > >> > wrote:
> > >> > >
> > >> > >> The full jstack would still be useful, but our current working
> > theory
> > >> is
> > >> > >> that this is due to the fact that Configuration#loadDefaults goes
> > >> > through
> > >> > >> every Configuration object that was ever created (via
> > >> > >> Configuration.REGISTRY) and locks it, thus introducing a
> dependency
> > >> from
> > >> > >> new Configuration to old, otherwise unrelated, Configuration
> > objects
> > >> > that
> > >> > >> our locking did not anticipate.
> > >> > >>
> > >> > >> I have created https://github.com/apache/spark/pull/1409 to
> > hopefully
> > >> > fix
> > >> > >> this bug.
> > >> > >>
> > >> > >>
> > >> > >> On Mon, Jul 14, 2014 at 2:44 PM, Patrick Wendell <
> > pwendell@gmail.com>
> > >> > >> wrote:
> > >> > >>
> > >> > >> > Hey Cody,
> > >> > >> >
> > >> > >> > This Jstack seems truncated, would you mind giving the entire
> > stack
> > >> > >> > trace? For the second thread, for instance, we can't see where
> > the
> > >> > >> > lock is being acquired.
> > >> > >> >
> > >> > >> > - Patrick
> > >> > >> >
> > >> > >> > On Mon, Jul 14, 2014 at 1:42 PM, Cody Koeninger
> > >> > >> > <co...@mediacrossing.com> wrote:
> > >> > >> > > Hi all, just wanted to give a heads up that we're seeing a
> > >> > reproducible
> > >> > >> > > deadlock with spark 1.0.1 with 2.3.0-mr1-cdh5.0.2
> > >> > >> > >
> > >> > >> > > If jira is a better place for this, apologies in advance -
> > figured
> > >> > >> > talking
> > >> > >> > > about it on the mailing list was friendlier than randomly
> > >> > (re)opening
> > >> > >> > jira
> > >> > >> > > tickets.
> > >> > >> > >
> > >> > >> > > I know Gary had mentioned some issues with 1.0.1 on the
> mailing
> > >> > list,
> > >> > >> > once
> > >> > >> > > we got a thread dump I wanted to follow up.
> > >> > >> > >
> > >> > >> > > The thread dump shows the deadlock occurs in the synchronized
> > >> block
> > >> > of
> > >> > >> > code
> > >> > >> > > that was changed in HadoopRDD.scala, for the Spark-1097 issue
> > >> > >> > >
> > >> > >> > > Relevant portions of the thread dump are summarized below, we
> > can
> > >> > >> provide
> > >> > >> > > the whole dump if it's useful.
> > >> > >> > >
> > >> > >> > > Found one Java-level deadlock:
> > >> > >> > > =============================
> > >> > >> > > "Executor task launch worker-1":
> > >> > >> > >   waiting to lock monitor 0x00007f250400c520 (object
> > >> > >> 0x00000000fae7dc30,
> > >> > >> > a
> > >> > >> > > org.apache.hadoop.co
> > >> > >> > > nf.Configuration),
> > >> > >> > >   which is held by "Executor task launch worker-0"
> > >> > >> > > "Executor task launch worker-0":
> > >> > >> > >   waiting to lock monitor 0x00007f2520495620 (object
> > >> > >> 0x00000000faeb4fc8,
> > >> > >> > a
> > >> > >> > > java.lang.Class),
> > >> > >> > >   which is held by "Executor task launch worker-1"
> > >> > >> > >
> > >> > >> > >
> > >> > >> > > "Executor task launch worker-1":
> > >> > >> > >         at
> > >> > >> > >
> > >> > >> >
> > >> > >>
> > >> >
> > >>
> >
> org.apache.hadoop.conf.Configuration.reloadConfiguration(Configuration.java:791)
> > >> > >> > >         - waiting to lock <0x00000000fae7dc30> (a
> > >> > >> > > org.apache.hadoop.conf.Configuration)
> > >> > >> > >         at
> > >> > >> > >
> > >> > >> >
> > >> > >>
> > >> >
> > >>
> >
> org.apache.hadoop.conf.Configuration.addDefaultResource(Configuration.java:690)
> > >> > >> > >         - locked <0x00000000faca6ff8> (a java.lang.Class for
> > >> > >> > > org.apache.hadoop.conf.Configurati
> > >> > >> > > on)
> > >> > >> > >         at
> > >> > >> > >
> > >> > >> >
> > >> > >>
> > >> >
> > >>
> >
> org.apache.hadoop.hdfs.HdfsConfiguration.<clinit>(HdfsConfiguration.java:34)
> > >> > >> > >         at
> > >> > >> > >
> > >> > >> >
> > >> > >>
> > >> >
> > >>
> >
> org.apache.hadoop.hdfs.DistributedFileSystem.<clinit>(DistributedFileSystem.java:110
> > >> > >> > > )
> > >> > >> > >         at
> > >> > >> sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native
> > >> > >> > > Method)
> > >> > >> > >         at
> > >> > >> > >
> > >> > >> >
> > >> > >>
> > >> >
> > >>
> >
> sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.
> > >> > >> > > java:57)
> > >> > >> > >         at
> > >> > >> sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native
> > >> > >> > > Method)
> > >> > >> > >         at
> > >> > >> > >
> > >> > >> >
> > >> > >>
> > >> >
> > >>
> >
> sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.
> > >> > >> > > java:57)
> > >> > >> > >         at
> > >> > >> > >
> > >> > >> >
> > >> > >>
> > >> >
> > >>
> >
> sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAcces
> > >> > >> > > sorImpl.java:45)
> > >> > >> > >         at
> > >> > >> > java.lang.reflect.Constructor.newInstance(Constructor.java:525)
> > >> > >> > >         at java.lang.Class.newInstance0(Class.java:374)
> > >> > >> > >         at java.lang.Class.newInstance(Class.java:327)
> > >> > >> > >         at
> > >> > >> >
> java.util.ServiceLoader$LazyIterator.next(ServiceLoader.java:373)
> > >> > >> > >         at
> > java.util.ServiceLoader$1.next(ServiceLoader.java:445)
> > >> > >> > >         at
> > >> > >> > >
> > >> >
> org.apache.hadoop.fs.FileSystem.loadFileSystems(FileSystem.java:2364)
> > >> > >> > >         - locked <0x00000000faeb4fc8> (a java.lang.Class for
> > >> > >> > > org.apache.hadoop.fs.FileSystem)
> > >> > >> > >         at
> > >> > >> > >
> > >> > >>
> > >>
> org.apache.hadoop.fs.FileSystem.getFileSystemClass(FileSystem.java:2375)
> > >> > >> > >         at
> > >> > >> > >
> > >> >
> org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2392)
> > >> > >> > >         at
> > >> > >> org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:89)
> > >> > >> > >         at
> > >> > >> > >
> > >> >
> > org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2431)
> > >> > >> > >         at
> > >> > >> > org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2413)
> > >> > >> > >         at
> > >> org.apache.hadoop.fs.FileSystem.get(FileSystem.java:368)
> > >> > >> > >         at
> > >> org.apache.hadoop.fs.FileSystem.get(FileSystem.java:167)
> > >> > >> > >         at
> > >> > >> > >
> > >> >
> org.apache.hadoop.mapred.JobConf.getWorkingDirectory(JobConf.java:587)
> > >> > >> > >         at
> > >> > >> > >
> > >> > >> >
> > >> > >>
> > >> >
> > >>
> >
> org.apache.hadoop.mapred.FileInputFormat.setInputPaths(FileInputFormat.java:315)
> > >> > >> > >         at
> > >> > >> > >
> > >> > >> >
> > >> > >>
> > >> >
> > >>
> >
> org.apache.hadoop.mapred.FileInputFormat.setInputPaths(FileInputFormat.java:288)
> > >> > >> > >         at
> > >> > >> > >
> > >> >
> > org.apache.spark.SparkContext$$anonfun$22.apply(SparkContext.scala:546)
> > >> > >> > >         at
> > >> > >> > >
> > >> >
> > org.apache.spark.SparkContext$$anonfun$22.apply(SparkContext.scala:546)
> > >> > >> > >         at
> > >> > >> > >
> > >> > >> >
> > >> > >>
> > >> >
> > >>
> >
> org.apache.spark.rdd.HadoopRDD$$anonfun$getJobConf$1.apply(HadoopRDD.scala:145)
> > >> > >> > >
> > >> > >> > >
> > >> > >> > >
> > >> > >> > > ...elided...
> > >> > >> > >
> > >> > >> > >
> > >> > >> > > "Executor task launch worker-0" daemon prio=10
> > >> > tid=0x0000000001e71800
> > >> > >> > > nid=0x2d97 waiting for monitor entry [0x00007f24d2bf1000]
> > >> > >> > >    java.lang.Thread.State: BLOCKED (on object monitor)
> > >> > >> > >         at
> > >> > >> > >
> > >> >
> org.apache.hadoop.fs.FileSystem.loadFileSystems(FileSystem.java:2362)
> > >> > >> > >         - waiting to lock <0x00000000faeb4fc8> (a
> > java.lang.Class
> > >> > for
> > >> > >> > > org.apache.hadoop.fs.FileSystem)
> > >> > >> > >         at
> > >> > >> > >
> > >> > >>
> > >>
> org.apache.hadoop.fs.FileSystem.getFileSystemClass(FileSystem.java:2375)
> > >> > >> > >         at
> > >> > >> > >
> > >> >
> org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2392)
> > >> > >> > >         at
> > >> > >> org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:89)
> > >> > >> > >         at
> > >> > >> > >
> > >> >
> > org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2431)
> > >> > >> > >         at
> > >> > >> > org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2413)
> > >> > >> > >         at
> > >> org.apache.hadoop.fs.FileSystem.get(FileSystem.java:368)
> > >> > >> > >         at
> > >> org.apache.hadoop.fs.FileSystem.get(FileSystem.java:167)
> > >> > >> > >         at
> > >> > >> > >
> > >> >
> org.apache.hadoop.mapred.JobConf.getWorkingDirectory(JobConf.java:587)
> > >> > >> > >         at
> > >> > >> > >
> > >> > >> >
> > >> > >>
> > >> >
> > >>
> >
> org.apache.hadoop.mapred.FileInputFormat.setInputPaths(FileInputFormat.java:315)
> > >> > >> > >         at
> > >> > >> > >
> > >> > >> >
> > >> > >>
> > >> >
> > >>
> >
> org.apache.hadoop.mapred.FileInputFormat.setInputPaths(FileInputFormat.java:288)
> > >> > >> > >         at
> > >> > >> > >
> > >> >
> > org.apache.spark.SparkContext$$anonfun$22.apply(SparkContext.scala:546)
> > >> > >> > >         at
> > >> > >> > >
> > >> >
> > org.apache.spark.SparkContext$$anonfun$22.apply(SparkContext.scala:546)
> > >> > >> > >         at
> > >> > >> > >
> > >> > >> >
> > >> > >>
> > >> >
> > >>
> >
> org.apache.spark.rdd.HadoopRDD$$anonfun$getJobConf$1.apply(HadoopRDD.scala:145)
> > >> > >> >
> > >> > >>
> > >> >
> > >>
> >
>

Re: Reproducible deadlock in 1.0.1, possibly related to Spark-1097

Posted by Gary Malouf <ma...@gmail.com>.

We'll try to run a build tomorrow AM.


On Mon, Jul 14, 2014 at 7:22 PM, Patrick Wendell <pw...@gmail.com> wrote:

> Andrew and Gary,
>
> Would you guys be able to test
> https://github.com/apache/spark/pull/1409/files and see if it solves
> your problem?
>
> - Patrick
>
> On Mon, Jul 14, 2014 at 4:18 PM, Andrew Ash <an...@andrewash.com> wrote:
> > I observed a deadlock here when using the AvroInputFormat as well. The
> > short of the issue is that there's one configuration object per JVM, but
> > multiple threads, one for each task. If each thread attempts to add a
> > configuration option to the Configuration object at once you get issues
> > because HashMap isn't thread safe.
> >
> > More details to come tonight. Thanks!
> > On Jul 14, 2014 4:11 PM, "Nishkam Ravi" <nr...@cloudera.com> wrote:
> >
> >> HI Patrick, I'm not aware of another place where the access happens, but
> >> it's possible that it does. The original fix synchronized on the
> >> broadcastConf object and someone reported the same exception.
> >>
> >>
> >> On Mon, Jul 14, 2014 at 3:57 PM, Patrick Wendell <pw...@gmail.com>
> >> wrote:
> >>
> >> > Hey Nishkam,
> >> >
> >> > Aaron's fix should prevent two concurrent accesses to getJobConf (and
> >> > the Hadoop code therein). But if there is code elsewhere that tries to
> >> > mutate the configuration, then I could see how we might still have the
> >> > ConcurrentModificationException.
> >> >
> >> > I looked at your patch for HADOOP-10456 and the only example you give
> >> > is of the data being accessed inside of getJobConf. Is it accessed
> >> > somewhere else too from Spark that you are aware of?
> >> >
> >> > https://issues.apache.org/jira/browse/HADOOP-10456
> >> >
> >> > - Patrick
> >> >
> >> > On Mon, Jul 14, 2014 at 3:28 PM, Nishkam Ravi <nr...@cloudera.com>
> >> wrote:
> >> > > Hi Aaron, I'm not sure if synchronizing on an arbitrary lock object
> >> would
> >> > > help. I suspect we will start seeing the
> >> ConcurrentModificationException
> >> > > again. The right fix has gone into Hadoop through 10456.
> >> Unfortunately, I
> >> > > don't have any bright ideas on how to synchronize this at the Spark
> >> level
> >> > > without the risk of deadlocks.
> >> > >
> >> > >
> >> > > On Mon, Jul 14, 2014 at 3:07 PM, Aaron Davidson <ilikerps@gmail.com
> >
> >> > wrote:
> >> > >
> >> > >> The full jstack would still be useful, but our current working
> theory
> >> is
> >> > >> that this is due to the fact that Configuration#loadDefaults goes
> >> > through
> >> > >> every Configuration object that was ever created (via
> >> > >> Configuration.REGISTRY) and locks it, thus introducing a dependency
> >> from
> >> > >> new Configuration to old, otherwise unrelated, Configuration
> objects
> >> > that
> >> > >> our locking did not anticipate.
> >> > >>
> >> > >> I have created https://github.com/apache/spark/pull/1409 to
> hopefully
> >> > fix
> >> > >> this bug.
> >> > >>
> >> > >>
> >> > >> On Mon, Jul 14, 2014 at 2:44 PM, Patrick Wendell <
> pwendell@gmail.com>
> >> > >> wrote:
> >> > >>
> >> > >> > Hey Cody,
> >> > >> >
> >> > >> > This Jstack seems truncated, would you mind giving the entire
> stack
> >> > >> > trace? For the second thread, for instance, we can't see where
> the
> >> > >> > lock is being acquired.
> >> > >> >
> >> > >> > - Patrick
> >> > >> >
> >> > >> > On Mon, Jul 14, 2014 at 1:42 PM, Cody Koeninger
> >> > >> > <co...@mediacrossing.com> wrote:
> >> > >> > > Hi all, just wanted to give a heads up that we're seeing a
> >> > reproducible
> >> > >> > > deadlock with spark 1.0.1 with 2.3.0-mr1-cdh5.0.2
> >> > >> > >
> >> > >> > > If jira is a better place for this, apologies in advance -
> figured
> >> > >> > talking
> >> > >> > > about it on the mailing list was friendlier than randomly
> >> > (re)opening
> >> > >> > jira
> >> > >> > > tickets.
> >> > >> > >
> >> > >> > > I know Gary had mentioned some issues with 1.0.1 on the mailing
> >> > list,
> >> > >> > once
> >> > >> > > we got a thread dump I wanted to follow up.
> >> > >> > >
> >> > >> > > The thread dump shows the deadlock occurs in the synchronized
> >> block
> >> > of
> >> > >> > code
> >> > >> > > that was changed in HadoopRDD.scala, for the Spark-1097 issue
> >> > >> > >
> >> > >> > > Relevant portions of the thread dump are summarized below, we
> can
> >> > >> provide
> >> > >> > > the whole dump if it's useful.
> >> > >> > >
> >> > >> > > Found one Java-level deadlock:
> >> > >> > > =============================
> >> > >> > > "Executor task launch worker-1":
> >> > >> > >   waiting to lock monitor 0x00007f250400c520 (object
> >> > >> 0x00000000fae7dc30,
> >> > >> > a
> >> > >> > > org.apache.hadoop.co
> >> > >> > > nf.Configuration),
> >> > >> > >   which is held by "Executor task launch worker-0"
> >> > >> > > "Executor task launch worker-0":
> >> > >> > >   waiting to lock monitor 0x00007f2520495620 (object
> >> > >> 0x00000000faeb4fc8,
> >> > >> > a
> >> > >> > > java.lang.Class),
> >> > >> > >   which is held by "Executor task launch worker-1"
> >> > >> > >
> >> > >> > >
> >> > >> > > "Executor task launch worker-1":
> >> > >> > >         at
> >> > >> > >
> >> > >> >
> >> > >>
> >> >
> >>
> org.apache.hadoop.conf.Configuration.reloadConfiguration(Configuration.java:791)
> >> > >> > >         - waiting to lock <0x00000000fae7dc30> (a
> >> > >> > > org.apache.hadoop.conf.Configuration)
> >> > >> > >         at
> >> > >> > >
> >> > >> >
> >> > >>
> >> >
> >>
> org.apache.hadoop.conf.Configuration.addDefaultResource(Configuration.java:690)
> >> > >> > >         - locked <0x00000000faca6ff8> (a java.lang.Class for
> >> > >> > > org.apache.hadoop.conf.Configurati
> >> > >> > > on)
> >> > >> > >         at
> >> > >> > >
> >> > >> >
> >> > >>
> >> >
> >>
> org.apache.hadoop.hdfs.HdfsConfiguration.<clinit>(HdfsConfiguration.java:34)
> >> > >> > >         at
> >> > >> > >
> >> > >> >
> >> > >>
> >> >
> >>
> org.apache.hadoop.hdfs.DistributedFileSystem.<clinit>(DistributedFileSystem.java:110
> >> > >> > > )
> >> > >> > >         at
> >> > >> sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native
> >> > >> > > Method)
> >> > >> > >         at
> >> > >> > >
> >> > >> >
> >> > >>
> >> >
> >>
> sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.
> >> > >> > > java:57)
> >> > >> > >         at
> >> > >> sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native
> >> > >> > > Method)
> >> > >> > >         at
> >> > >> > >
> >> > >> >
> >> > >>
> >> >
> >>
> sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.
> >> > >> > > java:57)
> >> > >> > >         at
> >> > >> > >
> >> > >> >
> >> > >>
> >> >
> >>
> sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAcces
> >> > >> > > sorImpl.java:45)
> >> > >> > >         at
> >> > >> > java.lang.reflect.Constructor.newInstance(Constructor.java:525)
> >> > >> > >         at java.lang.Class.newInstance0(Class.java:374)
> >> > >> > >         at java.lang.Class.newInstance(Class.java:327)
> >> > >> > >         at
> >> > >> > java.util.ServiceLoader$LazyIterator.next(ServiceLoader.java:373)
> >> > >> > >         at
> java.util.ServiceLoader$1.next(ServiceLoader.java:445)
> >> > >> > >         at
> >> > >> > >
> >> > org.apache.hadoop.fs.FileSystem.loadFileSystems(FileSystem.java:2364)
> >> > >> > >         - locked <0x00000000faeb4fc8> (a java.lang.Class for
> >> > >> > > org.apache.hadoop.fs.FileSystem)
> >> > >> > >         at
> >> > >> > >
> >> > >>
> >> org.apache.hadoop.fs.FileSystem.getFileSystemClass(FileSystem.java:2375)
> >> > >> > >         at
> >> > >> > >
> >> > org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2392)
> >> > >> > >         at
> >> > >> org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:89)
> >> > >> > >         at
> >> > >> > >
> >> >
> org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2431)
> >> > >> > >         at
> >> > >> > org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2413)
> >> > >> > >         at
> >> org.apache.hadoop.fs.FileSystem.get(FileSystem.java:368)
> >> > >> > >         at
> >> org.apache.hadoop.fs.FileSystem.get(FileSystem.java:167)
> >> > >> > >         at
> >> > >> > >
> >> > org.apache.hadoop.mapred.JobConf.getWorkingDirectory(JobConf.java:587)
> >> > >> > >         at
> >> > >> > >
> >> > >> >
> >> > >>
> >> >
> >>
> org.apache.hadoop.mapred.FileInputFormat.setInputPaths(FileInputFormat.java:315)
> >> > >> > >         at
> >> > >> > >
> >> > >> >
> >> > >>
> >> >
> >>
> org.apache.hadoop.mapred.FileInputFormat.setInputPaths(FileInputFormat.java:288)
> >> > >> > >         at
> >> > >> > >
> >> >
> org.apache.spark.SparkContext$$anonfun$22.apply(SparkContext.scala:546)
> >> > >> > >         at
> >> > >> > >
> >> >
> org.apache.spark.SparkContext$$anonfun$22.apply(SparkContext.scala:546)
> >> > >> > >         at
> >> > >> > >
> >> > >> >
> >> > >>
> >> >
> >>
> org.apache.spark.rdd.HadoopRDD$$anonfun$getJobConf$1.apply(HadoopRDD.scala:145)
> >> > >> > >
> >> > >> > >
> >> > >> > >
> >> > >> > > ...elided...
> >> > >> > >
> >> > >> > >
> >> > >> > > "Executor task launch worker-0" daemon prio=10
> >> > tid=0x0000000001e71800
> >> > >> > > nid=0x2d97 waiting for monitor entry [0x00007f24d2bf1000]
> >> > >> > >    java.lang.Thread.State: BLOCKED (on object monitor)
> >> > >> > >         at
> >> > >> > >
> >> > org.apache.hadoop.fs.FileSystem.loadFileSystems(FileSystem.java:2362)
> >> > >> > >         - waiting to lock <0x00000000faeb4fc8> (a
> java.lang.Class
> >> > for
> >> > >> > > org.apache.hadoop.fs.FileSystem)
> >> > >> > >         at
> >> > >> > >
> >> > >>
> >> org.apache.hadoop.fs.FileSystem.getFileSystemClass(FileSystem.java:2375)
> >> > >> > >         at
> >> > >> > >
> >> > org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2392)
> >> > >> > >         at
> >> > >> org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:89)
> >> > >> > >         at
> >> > >> > >
> >> >
> org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2431)
> >> > >> > >         at
> >> > >> > org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2413)
> >> > >> > >         at
> >> org.apache.hadoop.fs.FileSystem.get(FileSystem.java:368)
> >> > >> > >         at
> >> org.apache.hadoop.fs.FileSystem.get(FileSystem.java:167)
> >> > >> > >         at
> >> > >> > >
> >> > org.apache.hadoop.mapred.JobConf.getWorkingDirectory(JobConf.java:587)
> >> > >> > >         at
> >> > >> > >
> >> > >> >
> >> > >>
> >> >
> >>
> org.apache.hadoop.mapred.FileInputFormat.setInputPaths(FileInputFormat.java:315)
> >> > >> > >         at
> >> > >> > >
> >> > >> >
> >> > >>
> >> >
> >>
> org.apache.hadoop.mapred.FileInputFormat.setInputPaths(FileInputFormat.java:288)
> >> > >> > >         at
> >> > >> > >
> >> >
> org.apache.spark.SparkContext$$anonfun$22.apply(SparkContext.scala:546)
> >> > >> > >         at
> >> > >> > >
> >> >
> org.apache.spark.SparkContext$$anonfun$22.apply(SparkContext.scala:546)
> >> > >> > >         at
> >> > >> > >
> >> > >> >
> >> > >>
> >> >
> >>
> org.apache.spark.rdd.HadoopRDD$$anonfun$getJobConf$1.apply(HadoopRDD.scala:145)
> >> > >> >
> >> > >>
> >> >
> >>
>

Re: Reproducible deadlock in 1.0.1, possibly related to Spark-1097

Posted by Patrick Wendell <pw...@gmail.com>.

Andrew and Gary,

Would you guys be able to test
https://github.com/apache/spark/pull/1409/files and see if it solves
your problem?

- Patrick

On Mon, Jul 14, 2014 at 4:18 PM, Andrew Ash <an...@andrewash.com> wrote:
> I observed a deadlock here when using the AvroInputFormat as well. The
> short of the issue is that there's one configuration object per JVM, but
> multiple threads, one for each task. If each thread attempts to add a
> configuration option to the Configuration object at once you get issues
> because HashMap isn't thread safe.
>
> More details to come tonight. Thanks!
> On Jul 14, 2014 4:11 PM, "Nishkam Ravi" <nr...@cloudera.com> wrote:
>
>> HI Patrick, I'm not aware of another place where the access happens, but
>> it's possible that it does. The original fix synchronized on the
>> broadcastConf object and someone reported the same exception.
>>
>>
>> On Mon, Jul 14, 2014 at 3:57 PM, Patrick Wendell <pw...@gmail.com>
>> wrote:
>>
>> > Hey Nishkam,
>> >
>> > Aaron's fix should prevent two concurrent accesses to getJobConf (and
>> > the Hadoop code therein). But if there is code elsewhere that tries to
>> > mutate the configuration, then I could see how we might still have the
>> > ConcurrentModificationException.
>> >
>> > I looked at your patch for HADOOP-10456 and the only example you give
>> > is of the data being accessed inside of getJobConf. Is it accessed
>> > somewhere else too from Spark that you are aware of?
>> >
>> > https://issues.apache.org/jira/browse/HADOOP-10456
>> >
>> > - Patrick
>> >
>> > On Mon, Jul 14, 2014 at 3:28 PM, Nishkam Ravi <nr...@cloudera.com>
>> wrote:
>> > > Hi Aaron, I'm not sure if synchronizing on an arbitrary lock object
>> would
>> > > help. I suspect we will start seeing the
>> ConcurrentModificationException
>> > > again. The right fix has gone into Hadoop through 10456.
>> Unfortunately, I
>> > > don't have any bright ideas on how to synchronize this at the Spark
>> level
>> > > without the risk of deadlocks.
>> > >
>> > >
>> > > On Mon, Jul 14, 2014 at 3:07 PM, Aaron Davidson <il...@gmail.com>
>> > wrote:
>> > >
>> > >> The full jstack would still be useful, but our current working theory
>> is
>> > >> that this is due to the fact that Configuration#loadDefaults goes
>> > through
>> > >> every Configuration object that was ever created (via
>> > >> Configuration.REGISTRY) and locks it, thus introducing a dependency
>> from
>> > >> new Configuration to old, otherwise unrelated, Configuration objects
>> > that
>> > >> our locking did not anticipate.
>> > >>
>> > >> I have created https://github.com/apache/spark/pull/1409 to hopefully
>> > fix
>> > >> this bug.
>> > >>
>> > >>
>> > >> On Mon, Jul 14, 2014 at 2:44 PM, Patrick Wendell <pw...@gmail.com>
>> > >> wrote:
>> > >>
>> > >> > Hey Cody,
>> > >> >
>> > >> > This Jstack seems truncated, would you mind giving the entire stack
>> > >> > trace? For the second thread, for instance, we can't see where the
>> > >> > lock is being acquired.
>> > >> >
>> > >> > - Patrick
>> > >> >
>> > >> > On Mon, Jul 14, 2014 at 1:42 PM, Cody Koeninger
>> > >> > <co...@mediacrossing.com> wrote:
>> > >> > > Hi all, just wanted to give a heads up that we're seeing a
>> > reproducible
>> > >> > > deadlock with spark 1.0.1 with 2.3.0-mr1-cdh5.0.2
>> > >> > >
>> > >> > > If jira is a better place for this, apologies in advance - figured
>> > >> > talking
>> > >> > > about it on the mailing list was friendlier than randomly
>> > (re)opening
>> > >> > jira
>> > >> > > tickets.
>> > >> > >
>> > >> > > I know Gary had mentioned some issues with 1.0.1 on the mailing
>> > list,
>> > >> > once
>> > >> > > we got a thread dump I wanted to follow up.
>> > >> > >
>> > >> > > The thread dump shows the deadlock occurs in the synchronized
>> block
>> > of
>> > >> > code
>> > >> > > that was changed in HadoopRDD.scala, for the Spark-1097 issue
>> > >> > >
>> > >> > > Relevant portions of the thread dump are summarized below, we can
>> > >> provide
>> > >> > > the whole dump if it's useful.
>> > >> > >
>> > >> > > Found one Java-level deadlock:
>> > >> > > =============================
>> > >> > > "Executor task launch worker-1":
>> > >> > >   waiting to lock monitor 0x00007f250400c520 (object
>> > >> 0x00000000fae7dc30,
>> > >> > a
>> > >> > > org.apache.hadoop.co
>> > >> > > nf.Configuration),
>> > >> > >   which is held by "Executor task launch worker-0"
>> > >> > > "Executor task launch worker-0":
>> > >> > >   waiting to lock monitor 0x00007f2520495620 (object
>> > >> 0x00000000faeb4fc8,
>> > >> > a
>> > >> > > java.lang.Class),
>> > >> > >   which is held by "Executor task launch worker-1"
>> > >> > >
>> > >> > >
>> > >> > > "Executor task launch worker-1":
>> > >> > >         at
>> > >> > >
>> > >> >
>> > >>
>> >
>> org.apache.hadoop.conf.Configuration.reloadConfiguration(Configuration.java:791)
>> > >> > >         - waiting to lock <0x00000000fae7dc30> (a
>> > >> > > org.apache.hadoop.conf.Configuration)
>> > >> > >         at
>> > >> > >
>> > >> >
>> > >>
>> >
>> org.apache.hadoop.conf.Configuration.addDefaultResource(Configuration.java:690)
>> > >> > >         - locked <0x00000000faca6ff8> (a java.lang.Class for
>> > >> > > org.apache.hadoop.conf.Configurati
>> > >> > > on)
>> > >> > >         at
>> > >> > >
>> > >> >
>> > >>
>> >
>> org.apache.hadoop.hdfs.HdfsConfiguration.<clinit>(HdfsConfiguration.java:34)
>> > >> > >         at
>> > >> > >
>> > >> >
>> > >>
>> >
>> org.apache.hadoop.hdfs.DistributedFileSystem.<clinit>(DistributedFileSystem.java:110
>> > >> > > )
>> > >> > >         at
>> > >> sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native
>> > >> > > Method)
>> > >> > >         at
>> > >> > >
>> > >> >
>> > >>
>> >
>> sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.
>> > >> > > java:57)
>> > >> > >         at
>> > >> sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native
>> > >> > > Method)
>> > >> > >         at
>> > >> > >
>> > >> >
>> > >>
>> >
>> sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.
>> > >> > > java:57)
>> > >> > >         at
>> > >> > >
>> > >> >
>> > >>
>> >
>> sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAcces
>> > >> > > sorImpl.java:45)
>> > >> > >         at
>> > >> > java.lang.reflect.Constructor.newInstance(Constructor.java:525)
>> > >> > >         at java.lang.Class.newInstance0(Class.java:374)
>> > >> > >         at java.lang.Class.newInstance(Class.java:327)
>> > >> > >         at
>> > >> > java.util.ServiceLoader$LazyIterator.next(ServiceLoader.java:373)
>> > >> > >         at java.util.ServiceLoader$1.next(ServiceLoader.java:445)
>> > >> > >         at
>> > >> > >
>> > org.apache.hadoop.fs.FileSystem.loadFileSystems(FileSystem.java:2364)
>> > >> > >         - locked <0x00000000faeb4fc8> (a java.lang.Class for
>> > >> > > org.apache.hadoop.fs.FileSystem)
>> > >> > >         at
>> > >> > >
>> > >>
>> org.apache.hadoop.fs.FileSystem.getFileSystemClass(FileSystem.java:2375)
>> > >> > >         at
>> > >> > >
>> > org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2392)
>> > >> > >         at
>> > >> org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:89)
>> > >> > >         at
>> > >> > >
>> > org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2431)
>> > >> > >         at
>> > >> > org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2413)
>> > >> > >         at
>> org.apache.hadoop.fs.FileSystem.get(FileSystem.java:368)
>> > >> > >         at
>> org.apache.hadoop.fs.FileSystem.get(FileSystem.java:167)
>> > >> > >         at
>> > >> > >
>> > org.apache.hadoop.mapred.JobConf.getWorkingDirectory(JobConf.java:587)
>> > >> > >         at
>> > >> > >
>> > >> >
>> > >>
>> >
>> org.apache.hadoop.mapred.FileInputFormat.setInputPaths(FileInputFormat.java:315)
>> > >> > >         at
>> > >> > >
>> > >> >
>> > >>
>> >
>> org.apache.hadoop.mapred.FileInputFormat.setInputPaths(FileInputFormat.java:288)
>> > >> > >         at
>> > >> > >
>> > org.apache.spark.SparkContext$$anonfun$22.apply(SparkContext.scala:546)
>> > >> > >         at
>> > >> > >
>> > org.apache.spark.SparkContext$$anonfun$22.apply(SparkContext.scala:546)
>> > >> > >         at
>> > >> > >
>> > >> >
>> > >>
>> >
>> org.apache.spark.rdd.HadoopRDD$$anonfun$getJobConf$1.apply(HadoopRDD.scala:145)
>> > >> > >
>> > >> > >
>> > >> > >
>> > >> > > ...elided...
>> > >> > >
>> > >> > >
>> > >> > > "Executor task launch worker-0" daemon prio=10
>> > tid=0x0000000001e71800
>> > >> > > nid=0x2d97 waiting for monitor entry [0x00007f24d2bf1000]
>> > >> > >    java.lang.Thread.State: BLOCKED (on object monitor)
>> > >> > >         at
>> > >> > >
>> > org.apache.hadoop.fs.FileSystem.loadFileSystems(FileSystem.java:2362)
>> > >> > >         - waiting to lock <0x00000000faeb4fc8> (a java.lang.Class
>> > for
>> > >> > > org.apache.hadoop.fs.FileSystem)
>> > >> > >         at
>> > >> > >
>> > >>
>> org.apache.hadoop.fs.FileSystem.getFileSystemClass(FileSystem.java:2375)
>> > >> > >         at
>> > >> > >
>> > org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2392)
>> > >> > >         at
>> > >> org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:89)
>> > >> > >         at
>> > >> > >
>> > org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2431)
>> > >> > >         at
>> > >> > org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2413)
>> > >> > >         at
>> org.apache.hadoop.fs.FileSystem.get(FileSystem.java:368)
>> > >> > >         at
>> org.apache.hadoop.fs.FileSystem.get(FileSystem.java:167)
>> > >> > >         at
>> > >> > >
>> > org.apache.hadoop.mapred.JobConf.getWorkingDirectory(JobConf.java:587)
>> > >> > >         at
>> > >> > >
>> > >> >
>> > >>
>> >
>> org.apache.hadoop.mapred.FileInputFormat.setInputPaths(FileInputFormat.java:315)
>> > >> > >         at
>> > >> > >
>> > >> >
>> > >>
>> >
>> org.apache.hadoop.mapred.FileInputFormat.setInputPaths(FileInputFormat.java:288)
>> > >> > >         at
>> > >> > >
>> > org.apache.spark.SparkContext$$anonfun$22.apply(SparkContext.scala:546)
>> > >> > >         at
>> > >> > >
>> > org.apache.spark.SparkContext$$anonfun$22.apply(SparkContext.scala:546)
>> > >> > >         at
>> > >> > >
>> > >> >
>> > >>
>> >
>> org.apache.spark.rdd.HadoopRDD$$anonfun$getJobConf$1.apply(HadoopRDD.scala:145)
>> > >> >
>> > >>
>> >
>>

Re: Reproducible deadlock in 1.0.1, possibly related to Spark-1097

Posted by Andrew Ash <an...@andrewash.com>.

I observed a deadlock here when using the AvroInputFormat as well. The
short of the issue is that there's one configuration object per JVM, but
multiple threads, one for each task. If each thread attempts to add a
configuration option to the Configuration object at once you get issues
because HashMap isn't thread safe.

More details to come tonight. Thanks!
On Jul 14, 2014 4:11 PM, "Nishkam Ravi" <nr...@cloudera.com> wrote:

> HI Patrick, I'm not aware of another place where the access happens, but
> it's possible that it does. The original fix synchronized on the
> broadcastConf object and someone reported the same exception.
>
>
> On Mon, Jul 14, 2014 at 3:57 PM, Patrick Wendell <pw...@gmail.com>
> wrote:
>
> > Hey Nishkam,
> >
> > Aaron's fix should prevent two concurrent accesses to getJobConf (and
> > the Hadoop code therein). But if there is code elsewhere that tries to
> > mutate the configuration, then I could see how we might still have the
> > ConcurrentModificationException.
> >
> > I looked at your patch for HADOOP-10456 and the only example you give
> > is of the data being accessed inside of getJobConf. Is it accessed
> > somewhere else too from Spark that you are aware of?
> >
> > https://issues.apache.org/jira/browse/HADOOP-10456
> >
> > - Patrick
> >
> > On Mon, Jul 14, 2014 at 3:28 PM, Nishkam Ravi <nr...@cloudera.com>
> wrote:
> > > Hi Aaron, I'm not sure if synchronizing on an arbitrary lock object
> would
> > > help. I suspect we will start seeing the
> ConcurrentModificationException
> > > again. The right fix has gone into Hadoop through 10456.
> Unfortunately, I
> > > don't have any bright ideas on how to synchronize this at the Spark
> level
> > > without the risk of deadlocks.
> > >
> > >
> > > On Mon, Jul 14, 2014 at 3:07 PM, Aaron Davidson <il...@gmail.com>
> > wrote:
> > >
> > >> The full jstack would still be useful, but our current working theory
> is
> > >> that this is due to the fact that Configuration#loadDefaults goes
> > through
> > >> every Configuration object that was ever created (via
> > >> Configuration.REGISTRY) and locks it, thus introducing a dependency
> from
> > >> new Configuration to old, otherwise unrelated, Configuration objects
> > that
> > >> our locking did not anticipate.
> > >>
> > >> I have created https://github.com/apache/spark/pull/1409 to hopefully
> > fix
> > >> this bug.
> > >>
> > >>
> > >> On Mon, Jul 14, 2014 at 2:44 PM, Patrick Wendell <pw...@gmail.com>
> > >> wrote:
> > >>
> > >> > Hey Cody,
> > >> >
> > >> > This Jstack seems truncated, would you mind giving the entire stack
> > >> > trace? For the second thread, for instance, we can't see where the
> > >> > lock is being acquired.
> > >> >
> > >> > - Patrick
> > >> >
> > >> > On Mon, Jul 14, 2014 at 1:42 PM, Cody Koeninger
> > >> > <co...@mediacrossing.com> wrote:
> > >> > > Hi all, just wanted to give a heads up that we're seeing a
> > reproducible
> > >> > > deadlock with spark 1.0.1 with 2.3.0-mr1-cdh5.0.2
> > >> > >
> > >> > > If jira is a better place for this, apologies in advance - figured
> > >> > talking
> > >> > > about it on the mailing list was friendlier than randomly
> > (re)opening
> > >> > jira
> > >> > > tickets.
> > >> > >
> > >> > > I know Gary had mentioned some issues with 1.0.1 on the mailing
> > list,
> > >> > once
> > >> > > we got a thread dump I wanted to follow up.
> > >> > >
> > >> > > The thread dump shows the deadlock occurs in the synchronized
> block
> > of
> > >> > code
> > >> > > that was changed in HadoopRDD.scala, for the Spark-1097 issue
> > >> > >
> > >> > > Relevant portions of the thread dump are summarized below, we can
> > >> provide
> > >> > > the whole dump if it's useful.
> > >> > >
> > >> > > Found one Java-level deadlock:
> > >> > > =============================
> > >> > > "Executor task launch worker-1":
> > >> > >   waiting to lock monitor 0x00007f250400c520 (object
> > >> 0x00000000fae7dc30,
> > >> > a
> > >> > > org.apache.hadoop.co
> > >> > > nf.Configuration),
> > >> > >   which is held by "Executor task launch worker-0"
> > >> > > "Executor task launch worker-0":
> > >> > >   waiting to lock monitor 0x00007f2520495620 (object
> > >> 0x00000000faeb4fc8,
> > >> > a
> > >> > > java.lang.Class),
> > >> > >   which is held by "Executor task launch worker-1"
> > >> > >
> > >> > >
> > >> > > "Executor task launch worker-1":
> > >> > >         at
> > >> > >
> > >> >
> > >>
> >
> org.apache.hadoop.conf.Configuration.reloadConfiguration(Configuration.java:791)
> > >> > >         - waiting to lock <0x00000000fae7dc30> (a
> > >> > > org.apache.hadoop.conf.Configuration)
> > >> > >         at
> > >> > >
> > >> >
> > >>
> >
> org.apache.hadoop.conf.Configuration.addDefaultResource(Configuration.java:690)
> > >> > >         - locked <0x00000000faca6ff8> (a java.lang.Class for
> > >> > > org.apache.hadoop.conf.Configurati
> > >> > > on)
> > >> > >         at
> > >> > >
> > >> >
> > >>
> >
> org.apache.hadoop.hdfs.HdfsConfiguration.<clinit>(HdfsConfiguration.java:34)
> > >> > >         at
> > >> > >
> > >> >
> > >>
> >
> org.apache.hadoop.hdfs.DistributedFileSystem.<clinit>(DistributedFileSystem.java:110
> > >> > > )
> > >> > >         at
> > >> sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native
> > >> > > Method)
> > >> > >         at
> > >> > >
> > >> >
> > >>
> >
> sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.
> > >> > > java:57)
> > >> > >         at
> > >> sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native
> > >> > > Method)
> > >> > >         at
> > >> > >
> > >> >
> > >>
> >
> sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.
> > >> > > java:57)
> > >> > >         at
> > >> > >
> > >> >
> > >>
> >
> sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAcces
> > >> > > sorImpl.java:45)
> > >> > >         at
> > >> > java.lang.reflect.Constructor.newInstance(Constructor.java:525)
> > >> > >         at java.lang.Class.newInstance0(Class.java:374)
> > >> > >         at java.lang.Class.newInstance(Class.java:327)
> > >> > >         at
> > >> > java.util.ServiceLoader$LazyIterator.next(ServiceLoader.java:373)
> > >> > >         at java.util.ServiceLoader$1.next(ServiceLoader.java:445)
> > >> > >         at
> > >> > >
> > org.apache.hadoop.fs.FileSystem.loadFileSystems(FileSystem.java:2364)
> > >> > >         - locked <0x00000000faeb4fc8> (a java.lang.Class for
> > >> > > org.apache.hadoop.fs.FileSystem)
> > >> > >         at
> > >> > >
> > >>
> org.apache.hadoop.fs.FileSystem.getFileSystemClass(FileSystem.java:2375)
> > >> > >         at
> > >> > >
> > org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2392)
> > >> > >         at
> > >> org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:89)
> > >> > >         at
> > >> > >
> > org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2431)
> > >> > >         at
> > >> > org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2413)
> > >> > >         at
> org.apache.hadoop.fs.FileSystem.get(FileSystem.java:368)
> > >> > >         at
> org.apache.hadoop.fs.FileSystem.get(FileSystem.java:167)
> > >> > >         at
> > >> > >
> > org.apache.hadoop.mapred.JobConf.getWorkingDirectory(JobConf.java:587)
> > >> > >         at
> > >> > >
> > >> >
> > >>
> >
> org.apache.hadoop.mapred.FileInputFormat.setInputPaths(FileInputFormat.java:315)
> > >> > >         at
> > >> > >
> > >> >
> > >>
> >
> org.apache.hadoop.mapred.FileInputFormat.setInputPaths(FileInputFormat.java:288)
> > >> > >         at
> > >> > >
> > org.apache.spark.SparkContext$$anonfun$22.apply(SparkContext.scala:546)
> > >> > >         at
> > >> > >
> > org.apache.spark.SparkContext$$anonfun$22.apply(SparkContext.scala:546)
> > >> > >         at
> > >> > >
> > >> >
> > >>
> >
> org.apache.spark.rdd.HadoopRDD$$anonfun$getJobConf$1.apply(HadoopRDD.scala:145)
> > >> > >
> > >> > >
> > >> > >
> > >> > > ...elided...
> > >> > >
> > >> > >
> > >> > > "Executor task launch worker-0" daemon prio=10
> > tid=0x0000000001e71800
> > >> > > nid=0x2d97 waiting for monitor entry [0x00007f24d2bf1000]
> > >> > >    java.lang.Thread.State: BLOCKED (on object monitor)
> > >> > >         at
> > >> > >
> > org.apache.hadoop.fs.FileSystem.loadFileSystems(FileSystem.java:2362)
> > >> > >         - waiting to lock <0x00000000faeb4fc8> (a java.lang.Class
> > for
> > >> > > org.apache.hadoop.fs.FileSystem)
> > >> > >         at
> > >> > >
> > >>
> org.apache.hadoop.fs.FileSystem.getFileSystemClass(FileSystem.java:2375)
> > >> > >         at
> > >> > >
> > org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2392)
> > >> > >         at
> > >> org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:89)
> > >> > >         at
> > >> > >
> > org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2431)
> > >> > >         at
> > >> > org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2413)
> > >> > >         at
> org.apache.hadoop.fs.FileSystem.get(FileSystem.java:368)
> > >> > >         at
> org.apache.hadoop.fs.FileSystem.get(FileSystem.java:167)
> > >> > >         at
> > >> > >
> > org.apache.hadoop.mapred.JobConf.getWorkingDirectory(JobConf.java:587)
> > >> > >         at
> > >> > >
> > >> >
> > >>
> >
> org.apache.hadoop.mapred.FileInputFormat.setInputPaths(FileInputFormat.java:315)
> > >> > >         at
> > >> > >
> > >> >
> > >>
> >
> org.apache.hadoop.mapred.FileInputFormat.setInputPaths(FileInputFormat.java:288)
> > >> > >         at
> > >> > >
> > org.apache.spark.SparkContext$$anonfun$22.apply(SparkContext.scala:546)
> > >> > >         at
> > >> > >
> > org.apache.spark.SparkContext$$anonfun$22.apply(SparkContext.scala:546)
> > >> > >         at
> > >> > >
> > >> >
> > >>
> >
> org.apache.spark.rdd.HadoopRDD$$anonfun$getJobConf$1.apply(HadoopRDD.scala:145)
> > >> >
> > >>
> >
>

Re: Reproducible deadlock in 1.0.1, possibly related to Spark-1097

Posted by Nishkam Ravi <nr...@cloudera.com>.

HI Patrick, I'm not aware of another place where the access happens, but
it's possible that it does. The original fix synchronized on the
broadcastConf object and someone reported the same exception.


On Mon, Jul 14, 2014 at 3:57 PM, Patrick Wendell <pw...@gmail.com> wrote:

> Hey Nishkam,
>
> Aaron's fix should prevent two concurrent accesses to getJobConf (and
> the Hadoop code therein). But if there is code elsewhere that tries to
> mutate the configuration, then I could see how we might still have the
> ConcurrentModificationException.
>
> I looked at your patch for HADOOP-10456 and the only example you give
> is of the data being accessed inside of getJobConf. Is it accessed
> somewhere else too from Spark that you are aware of?
>
> https://issues.apache.org/jira/browse/HADOOP-10456
>
> - Patrick
>
> On Mon, Jul 14, 2014 at 3:28 PM, Nishkam Ravi <nr...@cloudera.com> wrote:
> > Hi Aaron, I'm not sure if synchronizing on an arbitrary lock object would
> > help. I suspect we will start seeing the ConcurrentModificationException
> > again. The right fix has gone into Hadoop through 10456. Unfortunately, I
> > don't have any bright ideas on how to synchronize this at the Spark level
> > without the risk of deadlocks.
> >
> >
> > On Mon, Jul 14, 2014 at 3:07 PM, Aaron Davidson <il...@gmail.com>
> wrote:
> >
> >> The full jstack would still be useful, but our current working theory is
> >> that this is due to the fact that Configuration#loadDefaults goes
> through
> >> every Configuration object that was ever created (via
> >> Configuration.REGISTRY) and locks it, thus introducing a dependency from
> >> new Configuration to old, otherwise unrelated, Configuration objects
> that
> >> our locking did not anticipate.
> >>
> >> I have created https://github.com/apache/spark/pull/1409 to hopefully
> fix
> >> this bug.
> >>
> >>
> >> On Mon, Jul 14, 2014 at 2:44 PM, Patrick Wendell <pw...@gmail.com>
> >> wrote:
> >>
> >> > Hey Cody,
> >> >
> >> > This Jstack seems truncated, would you mind giving the entire stack
> >> > trace? For the second thread, for instance, we can't see where the
> >> > lock is being acquired.
> >> >
> >> > - Patrick
> >> >
> >> > On Mon, Jul 14, 2014 at 1:42 PM, Cody Koeninger
> >> > <co...@mediacrossing.com> wrote:
> >> > > Hi all, just wanted to give a heads up that we're seeing a
> reproducible
> >> > > deadlock with spark 1.0.1 with 2.3.0-mr1-cdh5.0.2
> >> > >
> >> > > If jira is a better place for this, apologies in advance - figured
> >> > talking
> >> > > about it on the mailing list was friendlier than randomly
> (re)opening
> >> > jira
> >> > > tickets.
> >> > >
> >> > > I know Gary had mentioned some issues with 1.0.1 on the mailing
> list,
> >> > once
> >> > > we got a thread dump I wanted to follow up.
> >> > >
> >> > > The thread dump shows the deadlock occurs in the synchronized block
> of
> >> > code
> >> > > that was changed in HadoopRDD.scala, for the Spark-1097 issue
> >> > >
> >> > > Relevant portions of the thread dump are summarized below, we can
> >> provide
> >> > > the whole dump if it's useful.
> >> > >
> >> > > Found one Java-level deadlock:
> >> > > =============================
> >> > > "Executor task launch worker-1":
> >> > >   waiting to lock monitor 0x00007f250400c520 (object
> >> 0x00000000fae7dc30,
> >> > a
> >> > > org.apache.hadoop.co
> >> > > nf.Configuration),
> >> > >   which is held by "Executor task launch worker-0"
> >> > > "Executor task launch worker-0":
> >> > >   waiting to lock monitor 0x00007f2520495620 (object
> >> 0x00000000faeb4fc8,
> >> > a
> >> > > java.lang.Class),
> >> > >   which is held by "Executor task launch worker-1"
> >> > >
> >> > >
> >> > > "Executor task launch worker-1":
> >> > >         at
> >> > >
> >> >
> >>
> org.apache.hadoop.conf.Configuration.reloadConfiguration(Configuration.java:791)
> >> > >         - waiting to lock <0x00000000fae7dc30> (a
> >> > > org.apache.hadoop.conf.Configuration)
> >> > >         at
> >> > >
> >> >
> >>
> org.apache.hadoop.conf.Configuration.addDefaultResource(Configuration.java:690)
> >> > >         - locked <0x00000000faca6ff8> (a java.lang.Class for
> >> > > org.apache.hadoop.conf.Configurati
> >> > > on)
> >> > >         at
> >> > >
> >> >
> >>
> org.apache.hadoop.hdfs.HdfsConfiguration.<clinit>(HdfsConfiguration.java:34)
> >> > >         at
> >> > >
> >> >
> >>
> org.apache.hadoop.hdfs.DistributedFileSystem.<clinit>(DistributedFileSystem.java:110
> >> > > )
> >> > >         at
> >> sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native
> >> > > Method)
> >> > >         at
> >> > >
> >> >
> >>
> sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.
> >> > > java:57)
> >> > >         at
> >> sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native
> >> > > Method)
> >> > >         at
> >> > >
> >> >
> >>
> sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.
> >> > > java:57)
> >> > >         at
> >> > >
> >> >
> >>
> sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAcces
> >> > > sorImpl.java:45)
> >> > >         at
> >> > java.lang.reflect.Constructor.newInstance(Constructor.java:525)
> >> > >         at java.lang.Class.newInstance0(Class.java:374)
> >> > >         at java.lang.Class.newInstance(Class.java:327)
> >> > >         at
> >> > java.util.ServiceLoader$LazyIterator.next(ServiceLoader.java:373)
> >> > >         at java.util.ServiceLoader$1.next(ServiceLoader.java:445)
> >> > >         at
> >> > >
> org.apache.hadoop.fs.FileSystem.loadFileSystems(FileSystem.java:2364)
> >> > >         - locked <0x00000000faeb4fc8> (a java.lang.Class for
> >> > > org.apache.hadoop.fs.FileSystem)
> >> > >         at
> >> > >
> >> org.apache.hadoop.fs.FileSystem.getFileSystemClass(FileSystem.java:2375)
> >> > >         at
> >> > >
> org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2392)
> >> > >         at
> >> org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:89)
> >> > >         at
> >> > >
> org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2431)
> >> > >         at
> >> > org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2413)
> >> > >         at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:368)
> >> > >         at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:167)
> >> > >         at
> >> > >
> org.apache.hadoop.mapred.JobConf.getWorkingDirectory(JobConf.java:587)
> >> > >         at
> >> > >
> >> >
> >>
> org.apache.hadoop.mapred.FileInputFormat.setInputPaths(FileInputFormat.java:315)
> >> > >         at
> >> > >
> >> >
> >>
> org.apache.hadoop.mapred.FileInputFormat.setInputPaths(FileInputFormat.java:288)
> >> > >         at
> >> > >
> org.apache.spark.SparkContext$$anonfun$22.apply(SparkContext.scala:546)
> >> > >         at
> >> > >
> org.apache.spark.SparkContext$$anonfun$22.apply(SparkContext.scala:546)
> >> > >         at
> >> > >
> >> >
> >>
> org.apache.spark.rdd.HadoopRDD$$anonfun$getJobConf$1.apply(HadoopRDD.scala:145)
> >> > >
> >> > >
> >> > >
> >> > > ...elided...
> >> > >
> >> > >
> >> > > "Executor task launch worker-0" daemon prio=10
> tid=0x0000000001e71800
> >> > > nid=0x2d97 waiting for monitor entry [0x00007f24d2bf1000]
> >> > >    java.lang.Thread.State: BLOCKED (on object monitor)
> >> > >         at
> >> > >
> org.apache.hadoop.fs.FileSystem.loadFileSystems(FileSystem.java:2362)
> >> > >         - waiting to lock <0x00000000faeb4fc8> (a java.lang.Class
> for
> >> > > org.apache.hadoop.fs.FileSystem)
> >> > >         at
> >> > >
> >> org.apache.hadoop.fs.FileSystem.getFileSystemClass(FileSystem.java:2375)
> >> > >         at
> >> > >
> org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2392)
> >> > >         at
> >> org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:89)
> >> > >         at
> >> > >
> org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2431)
> >> > >         at
> >> > org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2413)
> >> > >         at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:368)
> >> > >         at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:167)
> >> > >         at
> >> > >
> org.apache.hadoop.mapred.JobConf.getWorkingDirectory(JobConf.java:587)
> >> > >         at
> >> > >
> >> >
> >>
> org.apache.hadoop.mapred.FileInputFormat.setInputPaths(FileInputFormat.java:315)
> >> > >         at
> >> > >
> >> >
> >>
> org.apache.hadoop.mapred.FileInputFormat.setInputPaths(FileInputFormat.java:288)
> >> > >         at
> >> > >
> org.apache.spark.SparkContext$$anonfun$22.apply(SparkContext.scala:546)
> >> > >         at
> >> > >
> org.apache.spark.SparkContext$$anonfun$22.apply(SparkContext.scala:546)
> >> > >         at
> >> > >
> >> >
> >>
> org.apache.spark.rdd.HadoopRDD$$anonfun$getJobConf$1.apply(HadoopRDD.scala:145)
> >> >
> >>
>

Re: Reproducible deadlock in 1.0.1, possibly related to Spark-1097

Posted by Gary Malouf <ma...@gmail.com>.

We use the Hadoop configuration inside of our code executing on Spark as we
need to list out files in the path.  Maybe that is why it is exposed for us.


On Mon, Jul 14, 2014 at 6:57 PM, Patrick Wendell <pw...@gmail.com> wrote:

> Hey Nishkam,
>
> Aaron's fix should prevent two concurrent accesses to getJobConf (and
> the Hadoop code therein). But if there is code elsewhere that tries to
> mutate the configuration, then I could see how we might still have the
> ConcurrentModificationException.
>
> I looked at your patch for HADOOP-10456 and the only example you give
> is of the data being accessed inside of getJobConf. Is it accessed
> somewhere else too from Spark that you are aware of?
>
> https://issues.apache.org/jira/browse/HADOOP-10456
>
> - Patrick
>
> On Mon, Jul 14, 2014 at 3:28 PM, Nishkam Ravi <nr...@cloudera.com> wrote:
> > Hi Aaron, I'm not sure if synchronizing on an arbitrary lock object would
> > help. I suspect we will start seeing the ConcurrentModificationException
> > again. The right fix has gone into Hadoop through 10456. Unfortunately, I
> > don't have any bright ideas on how to synchronize this at the Spark level
> > without the risk of deadlocks.
> >
> >
> > On Mon, Jul 14, 2014 at 3:07 PM, Aaron Davidson <il...@gmail.com>
> wrote:
> >
> >> The full jstack would still be useful, but our current working theory is
> >> that this is due to the fact that Configuration#loadDefaults goes
> through
> >> every Configuration object that was ever created (via
> >> Configuration.REGISTRY) and locks it, thus introducing a dependency from
> >> new Configuration to old, otherwise unrelated, Configuration objects
> that
> >> our locking did not anticipate.
> >>
> >> I have created https://github.com/apache/spark/pull/1409 to hopefully
> fix
> >> this bug.
> >>
> >>
> >> On Mon, Jul 14, 2014 at 2:44 PM, Patrick Wendell <pw...@gmail.com>
> >> wrote:
> >>
> >> > Hey Cody,
> >> >
> >> > This Jstack seems truncated, would you mind giving the entire stack
> >> > trace? For the second thread, for instance, we can't see where the
> >> > lock is being acquired.
> >> >
> >> > - Patrick
> >> >
> >> > On Mon, Jul 14, 2014 at 1:42 PM, Cody Koeninger
> >> > <co...@mediacrossing.com> wrote:
> >> > > Hi all, just wanted to give a heads up that we're seeing a
> reproducible
> >> > > deadlock with spark 1.0.1 with 2.3.0-mr1-cdh5.0.2
> >> > >
> >> > > If jira is a better place for this, apologies in advance - figured
> >> > talking
> >> > > about it on the mailing list was friendlier than randomly
> (re)opening
> >> > jira
> >> > > tickets.
> >> > >
> >> > > I know Gary had mentioned some issues with 1.0.1 on the mailing
> list,
> >> > once
> >> > > we got a thread dump I wanted to follow up.
> >> > >
> >> > > The thread dump shows the deadlock occurs in the synchronized block
> of
> >> > code
> >> > > that was changed in HadoopRDD.scala, for the Spark-1097 issue
> >> > >
> >> > > Relevant portions of the thread dump are summarized below, we can
> >> provide
> >> > > the whole dump if it's useful.
> >> > >
> >> > > Found one Java-level deadlock:
> >> > > =============================
> >> > > "Executor task launch worker-1":
> >> > >   waiting to lock monitor 0x00007f250400c520 (object
> >> 0x00000000fae7dc30,
> >> > a
> >> > > org.apache.hadoop.co
> >> > > nf.Configuration),
> >> > >   which is held by "Executor task launch worker-0"
> >> > > "Executor task launch worker-0":
> >> > >   waiting to lock monitor 0x00007f2520495620 (object
> >> 0x00000000faeb4fc8,
> >> > a
> >> > > java.lang.Class),
> >> > >   which is held by "Executor task launch worker-1"
> >> > >
> >> > >
> >> > > "Executor task launch worker-1":
> >> > >         at
> >> > >
> >> >
> >>
> org.apache.hadoop.conf.Configuration.reloadConfiguration(Configuration.java:791)
> >> > >         - waiting to lock <0x00000000fae7dc30> (a
> >> > > org.apache.hadoop.conf.Configuration)
> >> > >         at
> >> > >
> >> >
> >>
> org.apache.hadoop.conf.Configuration.addDefaultResource(Configuration.java:690)
> >> > >         - locked <0x00000000faca6ff8> (a java.lang.Class for
> >> > > org.apache.hadoop.conf.Configurati
> >> > > on)
> >> > >         at
> >> > >
> >> >
> >>
> org.apache.hadoop.hdfs.HdfsConfiguration.<clinit>(HdfsConfiguration.java:34)
> >> > >         at
> >> > >
> >> >
> >>
> org.apache.hadoop.hdfs.DistributedFileSystem.<clinit>(DistributedFileSystem.java:110
> >> > > )
> >> > >         at
> >> sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native
> >> > > Method)
> >> > >         at
> >> > >
> >> >
> >>
> sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.
> >> > > java:57)
> >> > >         at
> >> sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native
> >> > > Method)
> >> > >         at
> >> > >
> >> >
> >>
> sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.
> >> > > java:57)
> >> > >         at
> >> > >
> >> >
> >>
> sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAcces
> >> > > sorImpl.java:45)
> >> > >         at
> >> > java.lang.reflect.Constructor.newInstance(Constructor.java:525)
> >> > >         at java.lang.Class.newInstance0(Class.java:374)
> >> > >         at java.lang.Class.newInstance(Class.java:327)
> >> > >         at
> >> > java.util.ServiceLoader$LazyIterator.next(ServiceLoader.java:373)
> >> > >         at java.util.ServiceLoader$1.next(ServiceLoader.java:445)
> >> > >         at
> >> > >
> org.apache.hadoop.fs.FileSystem.loadFileSystems(FileSystem.java:2364)
> >> > >         - locked <0x00000000faeb4fc8> (a java.lang.Class for
> >> > > org.apache.hadoop.fs.FileSystem)
> >> > >         at
> >> > >
> >> org.apache.hadoop.fs.FileSystem.getFileSystemClass(FileSystem.java:2375)
> >> > >         at
> >> > >
> org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2392)
> >> > >         at
> >> org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:89)
> >> > >         at
> >> > >
> org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2431)
> >> > >         at
> >> > org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2413)
> >> > >         at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:368)
> >> > >         at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:167)
> >> > >         at
> >> > >
> org.apache.hadoop.mapred.JobConf.getWorkingDirectory(JobConf.java:587)
> >> > >         at
> >> > >
> >> >
> >>
> org.apache.hadoop.mapred.FileInputFormat.setInputPaths(FileInputFormat.java:315)
> >> > >         at
> >> > >
> >> >
> >>
> org.apache.hadoop.mapred.FileInputFormat.setInputPaths(FileInputFormat.java:288)
> >> > >         at
> >> > >
> org.apache.spark.SparkContext$$anonfun$22.apply(SparkContext.scala:546)
> >> > >         at
> >> > >
> org.apache.spark.SparkContext$$anonfun$22.apply(SparkContext.scala:546)
> >> > >         at
> >> > >
> >> >
> >>
> org.apache.spark.rdd.HadoopRDD$$anonfun$getJobConf$1.apply(HadoopRDD.scala:145)
> >> > >
> >> > >
> >> > >
> >> > > ...elided...
> >> > >
> >> > >
> >> > > "Executor task launch worker-0" daemon prio=10
> tid=0x0000000001e71800
> >> > > nid=0x2d97 waiting for monitor entry [0x00007f24d2bf1000]
> >> > >    java.lang.Thread.State: BLOCKED (on object monitor)
> >> > >         at
> >> > >
> org.apache.hadoop.fs.FileSystem.loadFileSystems(FileSystem.java:2362)
> >> > >         - waiting to lock <0x00000000faeb4fc8> (a java.lang.Class
> for
> >> > > org.apache.hadoop.fs.FileSystem)
> >> > >         at
> >> > >
> >> org.apache.hadoop.fs.FileSystem.getFileSystemClass(FileSystem.java:2375)
> >> > >         at
> >> > >
> org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2392)
> >> > >         at
> >> org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:89)
> >> > >         at
> >> > >
> org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2431)
> >> > >         at
> >> > org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2413)
> >> > >         at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:368)
> >> > >         at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:167)
> >> > >         at
> >> > >
> org.apache.hadoop.mapred.JobConf.getWorkingDirectory(JobConf.java:587)
> >> > >         at
> >> > >
> >> >
> >>
> org.apache.hadoop.mapred.FileInputFormat.setInputPaths(FileInputFormat.java:315)
> >> > >         at
> >> > >
> >> >
> >>
> org.apache.hadoop.mapred.FileInputFormat.setInputPaths(FileInputFormat.java:288)
> >> > >         at
> >> > >
> org.apache.spark.SparkContext$$anonfun$22.apply(SparkContext.scala:546)
> >> > >         at
> >> > >
> org.apache.spark.SparkContext$$anonfun$22.apply(SparkContext.scala:546)
> >> > >         at
> >> > >
> >> >
> >>
> org.apache.spark.rdd.HadoopRDD$$anonfun$getJobConf$1.apply(HadoopRDD.scala:145)
> >> >
> >>
>

Re: Reproducible deadlock in 1.0.1, possibly related to Spark-1097

Posted by Patrick Wendell <pw...@gmail.com>.

Hey Nishkam,

Aaron's fix should prevent two concurrent accesses to getJobConf (and
the Hadoop code therein). But if there is code elsewhere that tries to
mutate the configuration, then I could see how we might still have the
ConcurrentModificationException.

I looked at your patch for HADOOP-10456 and the only example you give
is of the data being accessed inside of getJobConf. Is it accessed
somewhere else too from Spark that you are aware of?

https://issues.apache.org/jira/browse/HADOOP-10456

- Patrick

On Mon, Jul 14, 2014 at 3:28 PM, Nishkam Ravi <nr...@cloudera.com> wrote:
> Hi Aaron, I'm not sure if synchronizing on an arbitrary lock object would
> help. I suspect we will start seeing the ConcurrentModificationException
> again. The right fix has gone into Hadoop through 10456. Unfortunately, I
> don't have any bright ideas on how to synchronize this at the Spark level
> without the risk of deadlocks.
>
>
> On Mon, Jul 14, 2014 at 3:07 PM, Aaron Davidson <il...@gmail.com> wrote:
>
>> The full jstack would still be useful, but our current working theory is
>> that this is due to the fact that Configuration#loadDefaults goes through
>> every Configuration object that was ever created (via
>> Configuration.REGISTRY) and locks it, thus introducing a dependency from
>> new Configuration to old, otherwise unrelated, Configuration objects that
>> our locking did not anticipate.
>>
>> I have created https://github.com/apache/spark/pull/1409 to hopefully fix
>> this bug.
>>
>>
>> On Mon, Jul 14, 2014 at 2:44 PM, Patrick Wendell <pw...@gmail.com>
>> wrote:
>>
>> > Hey Cody,
>> >
>> > This Jstack seems truncated, would you mind giving the entire stack
>> > trace? For the second thread, for instance, we can't see where the
>> > lock is being acquired.
>> >
>> > - Patrick
>> >
>> > On Mon, Jul 14, 2014 at 1:42 PM, Cody Koeninger
>> > <co...@mediacrossing.com> wrote:
>> > > Hi all, just wanted to give a heads up that we're seeing a reproducible
>> > > deadlock with spark 1.0.1 with 2.3.0-mr1-cdh5.0.2
>> > >
>> > > If jira is a better place for this, apologies in advance - figured
>> > talking
>> > > about it on the mailing list was friendlier than randomly (re)opening
>> > jira
>> > > tickets.
>> > >
>> > > I know Gary had mentioned some issues with 1.0.1 on the mailing list,
>> > once
>> > > we got a thread dump I wanted to follow up.
>> > >
>> > > The thread dump shows the deadlock occurs in the synchronized block of
>> > code
>> > > that was changed in HadoopRDD.scala, for the Spark-1097 issue
>> > >
>> > > Relevant portions of the thread dump are summarized below, we can
>> provide
>> > > the whole dump if it's useful.
>> > >
>> > > Found one Java-level deadlock:
>> > > =============================
>> > > "Executor task launch worker-1":
>> > >   waiting to lock monitor 0x00007f250400c520 (object
>> 0x00000000fae7dc30,
>> > a
>> > > org.apache.hadoop.co
>> > > nf.Configuration),
>> > >   which is held by "Executor task launch worker-0"
>> > > "Executor task launch worker-0":
>> > >   waiting to lock monitor 0x00007f2520495620 (object
>> 0x00000000faeb4fc8,
>> > a
>> > > java.lang.Class),
>> > >   which is held by "Executor task launch worker-1"
>> > >
>> > >
>> > > "Executor task launch worker-1":
>> > >         at
>> > >
>> >
>> org.apache.hadoop.conf.Configuration.reloadConfiguration(Configuration.java:791)
>> > >         - waiting to lock <0x00000000fae7dc30> (a
>> > > org.apache.hadoop.conf.Configuration)
>> > >         at
>> > >
>> >
>> org.apache.hadoop.conf.Configuration.addDefaultResource(Configuration.java:690)
>> > >         - locked <0x00000000faca6ff8> (a java.lang.Class for
>> > > org.apache.hadoop.conf.Configurati
>> > > on)
>> > >         at
>> > >
>> >
>> org.apache.hadoop.hdfs.HdfsConfiguration.<clinit>(HdfsConfiguration.java:34)
>> > >         at
>> > >
>> >
>> org.apache.hadoop.hdfs.DistributedFileSystem.<clinit>(DistributedFileSystem.java:110
>> > > )
>> > >         at
>> sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native
>> > > Method)
>> > >         at
>> > >
>> >
>> sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.
>> > > java:57)
>> > >         at
>> sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native
>> > > Method)
>> > >         at
>> > >
>> >
>> sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.
>> > > java:57)
>> > >         at
>> > >
>> >
>> sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAcces
>> > > sorImpl.java:45)
>> > >         at
>> > java.lang.reflect.Constructor.newInstance(Constructor.java:525)
>> > >         at java.lang.Class.newInstance0(Class.java:374)
>> > >         at java.lang.Class.newInstance(Class.java:327)
>> > >         at
>> > java.util.ServiceLoader$LazyIterator.next(ServiceLoader.java:373)
>> > >         at java.util.ServiceLoader$1.next(ServiceLoader.java:445)
>> > >         at
>> > > org.apache.hadoop.fs.FileSystem.loadFileSystems(FileSystem.java:2364)
>> > >         - locked <0x00000000faeb4fc8> (a java.lang.Class for
>> > > org.apache.hadoop.fs.FileSystem)
>> > >         at
>> > >
>> org.apache.hadoop.fs.FileSystem.getFileSystemClass(FileSystem.java:2375)
>> > >         at
>> > > org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2392)
>> > >         at
>> org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:89)
>> > >         at
>> > > org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2431)
>> > >         at
>> > org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2413)
>> > >         at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:368)
>> > >         at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:167)
>> > >         at
>> > > org.apache.hadoop.mapred.JobConf.getWorkingDirectory(JobConf.java:587)
>> > >         at
>> > >
>> >
>> org.apache.hadoop.mapred.FileInputFormat.setInputPaths(FileInputFormat.java:315)
>> > >         at
>> > >
>> >
>> org.apache.hadoop.mapred.FileInputFormat.setInputPaths(FileInputFormat.java:288)
>> > >         at
>> > > org.apache.spark.SparkContext$$anonfun$22.apply(SparkContext.scala:546)
>> > >         at
>> > > org.apache.spark.SparkContext$$anonfun$22.apply(SparkContext.scala:546)
>> > >         at
>> > >
>> >
>> org.apache.spark.rdd.HadoopRDD$$anonfun$getJobConf$1.apply(HadoopRDD.scala:145)
>> > >
>> > >
>> > >
>> > > ...elided...
>> > >
>> > >
>> > > "Executor task launch worker-0" daemon prio=10 tid=0x0000000001e71800
>> > > nid=0x2d97 waiting for monitor entry [0x00007f24d2bf1000]
>> > >    java.lang.Thread.State: BLOCKED (on object monitor)
>> > >         at
>> > > org.apache.hadoop.fs.FileSystem.loadFileSystems(FileSystem.java:2362)
>> > >         - waiting to lock <0x00000000faeb4fc8> (a java.lang.Class for
>> > > org.apache.hadoop.fs.FileSystem)
>> > >         at
>> > >
>> org.apache.hadoop.fs.FileSystem.getFileSystemClass(FileSystem.java:2375)
>> > >         at
>> > > org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2392)
>> > >         at
>> org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:89)
>> > >         at
>> > > org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2431)
>> > >         at
>> > org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2413)
>> > >         at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:368)
>> > >         at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:167)
>> > >         at
>> > > org.apache.hadoop.mapred.JobConf.getWorkingDirectory(JobConf.java:587)
>> > >         at
>> > >
>> >
>> org.apache.hadoop.mapred.FileInputFormat.setInputPaths(FileInputFormat.java:315)
>> > >         at
>> > >
>> >
>> org.apache.hadoop.mapred.FileInputFormat.setInputPaths(FileInputFormat.java:288)
>> > >         at
>> > > org.apache.spark.SparkContext$$anonfun$22.apply(SparkContext.scala:546)
>> > >         at
>> > > org.apache.spark.SparkContext$$anonfun$22.apply(SparkContext.scala:546)
>> > >         at
>> > >
>> >
>> org.apache.spark.rdd.HadoopRDD$$anonfun$getJobConf$1.apply(HadoopRDD.scala:145)
>> >
>>

Re: Reproducible deadlock in 1.0.1, possibly related to Spark-1097

Posted by Nishkam Ravi <nr...@cloudera.com>.

Hi Aaron, I'm not sure if synchronizing on an arbitrary lock object would
help. I suspect we will start seeing the ConcurrentModificationException
again. The right fix has gone into Hadoop through 10456. Unfortunately, I
don't have any bright ideas on how to synchronize this at the Spark level
without the risk of deadlocks.


On Mon, Jul 14, 2014 at 3:07 PM, Aaron Davidson <il...@gmail.com> wrote:

> The full jstack would still be useful, but our current working theory is
> that this is due to the fact that Configuration#loadDefaults goes through
> every Configuration object that was ever created (via
> Configuration.REGISTRY) and locks it, thus introducing a dependency from
> new Configuration to old, otherwise unrelated, Configuration objects that
> our locking did not anticipate.
>
> I have created https://github.com/apache/spark/pull/1409 to hopefully fix
> this bug.
>
>
> On Mon, Jul 14, 2014 at 2:44 PM, Patrick Wendell <pw...@gmail.com>
> wrote:
>
> > Hey Cody,
> >
> > This Jstack seems truncated, would you mind giving the entire stack
> > trace? For the second thread, for instance, we can't see where the
> > lock is being acquired.
> >
> > - Patrick
> >
> > On Mon, Jul 14, 2014 at 1:42 PM, Cody Koeninger
> > <co...@mediacrossing.com> wrote:
> > > Hi all, just wanted to give a heads up that we're seeing a reproducible
> > > deadlock with spark 1.0.1 with 2.3.0-mr1-cdh5.0.2
> > >
> > > If jira is a better place for this, apologies in advance - figured
> > talking
> > > about it on the mailing list was friendlier than randomly (re)opening
> > jira
> > > tickets.
> > >
> > > I know Gary had mentioned some issues with 1.0.1 on the mailing list,
> > once
> > > we got a thread dump I wanted to follow up.
> > >
> > > The thread dump shows the deadlock occurs in the synchronized block of
> > code
> > > that was changed in HadoopRDD.scala, for the Spark-1097 issue
> > >
> > > Relevant portions of the thread dump are summarized below, we can
> provide
> > > the whole dump if it's useful.
> > >
> > > Found one Java-level deadlock:
> > > =============================
> > > "Executor task launch worker-1":
> > >   waiting to lock monitor 0x00007f250400c520 (object
> 0x00000000fae7dc30,
> > a
> > > org.apache.hadoop.co
> > > nf.Configuration),
> > >   which is held by "Executor task launch worker-0"
> > > "Executor task launch worker-0":
> > >   waiting to lock monitor 0x00007f2520495620 (object
> 0x00000000faeb4fc8,
> > a
> > > java.lang.Class),
> > >   which is held by "Executor task launch worker-1"
> > >
> > >
> > > "Executor task launch worker-1":
> > >         at
> > >
> >
> org.apache.hadoop.conf.Configuration.reloadConfiguration(Configuration.java:791)
> > >         - waiting to lock <0x00000000fae7dc30> (a
> > > org.apache.hadoop.conf.Configuration)
> > >         at
> > >
> >
> org.apache.hadoop.conf.Configuration.addDefaultResource(Configuration.java:690)
> > >         - locked <0x00000000faca6ff8> (a java.lang.Class for
> > > org.apache.hadoop.conf.Configurati
> > > on)
> > >         at
> > >
> >
> org.apache.hadoop.hdfs.HdfsConfiguration.<clinit>(HdfsConfiguration.java:34)
> > >         at
> > >
> >
> org.apache.hadoop.hdfs.DistributedFileSystem.<clinit>(DistributedFileSystem.java:110
> > > )
> > >         at
> sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native
> > > Method)
> > >         at
> > >
> >
> sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.
> > > java:57)
> > >         at
> sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native
> > > Method)
> > >         at
> > >
> >
> sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.
> > > java:57)
> > >         at
> > >
> >
> sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAcces
> > > sorImpl.java:45)
> > >         at
> > java.lang.reflect.Constructor.newInstance(Constructor.java:525)
> > >         at java.lang.Class.newInstance0(Class.java:374)
> > >         at java.lang.Class.newInstance(Class.java:327)
> > >         at
> > java.util.ServiceLoader$LazyIterator.next(ServiceLoader.java:373)
> > >         at java.util.ServiceLoader$1.next(ServiceLoader.java:445)
> > >         at
> > > org.apache.hadoop.fs.FileSystem.loadFileSystems(FileSystem.java:2364)
> > >         - locked <0x00000000faeb4fc8> (a java.lang.Class for
> > > org.apache.hadoop.fs.FileSystem)
> > >         at
> > >
> org.apache.hadoop.fs.FileSystem.getFileSystemClass(FileSystem.java:2375)
> > >         at
> > > org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2392)
> > >         at
> org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:89)
> > >         at
> > > org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2431)
> > >         at
> > org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2413)
> > >         at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:368)
> > >         at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:167)
> > >         at
> > > org.apache.hadoop.mapred.JobConf.getWorkingDirectory(JobConf.java:587)
> > >         at
> > >
> >
> org.apache.hadoop.mapred.FileInputFormat.setInputPaths(FileInputFormat.java:315)
> > >         at
> > >
> >
> org.apache.hadoop.mapred.FileInputFormat.setInputPaths(FileInputFormat.java:288)
> > >         at
> > > org.apache.spark.SparkContext$$anonfun$22.apply(SparkContext.scala:546)
> > >         at
> > > org.apache.spark.SparkContext$$anonfun$22.apply(SparkContext.scala:546)
> > >         at
> > >
> >
> org.apache.spark.rdd.HadoopRDD$$anonfun$getJobConf$1.apply(HadoopRDD.scala:145)
> > >
> > >
> > >
> > > ...elided...
> > >
> > >
> > > "Executor task launch worker-0" daemon prio=10 tid=0x0000000001e71800
> > > nid=0x2d97 waiting for monitor entry [0x00007f24d2bf1000]
> > >    java.lang.Thread.State: BLOCKED (on object monitor)
> > >         at
> > > org.apache.hadoop.fs.FileSystem.loadFileSystems(FileSystem.java:2362)
> > >         - waiting to lock <0x00000000faeb4fc8> (a java.lang.Class for
> > > org.apache.hadoop.fs.FileSystem)
> > >         at
> > >
> org.apache.hadoop.fs.FileSystem.getFileSystemClass(FileSystem.java:2375)
> > >         at
> > > org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2392)
> > >         at
> org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:89)
> > >         at
> > > org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2431)
> > >         at
> > org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2413)
> > >         at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:368)
> > >         at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:167)
> > >         at
> > > org.apache.hadoop.mapred.JobConf.getWorkingDirectory(JobConf.java:587)
> > >         at
> > >
> >
> org.apache.hadoop.mapred.FileInputFormat.setInputPaths(FileInputFormat.java:315)
> > >         at
> > >
> >
> org.apache.hadoop.mapred.FileInputFormat.setInputPaths(FileInputFormat.java:288)
> > >         at
> > > org.apache.spark.SparkContext$$anonfun$22.apply(SparkContext.scala:546)
> > >         at
> > > org.apache.spark.SparkContext$$anonfun$22.apply(SparkContext.scala:546)
> > >         at
> > >
> >
> org.apache.spark.rdd.HadoopRDD$$anonfun$getJobConf$1.apply(HadoopRDD.scala:145)
> >
>

Re: Reproducible deadlock in 1.0.1, possibly related to Spark-1097

Posted by Aaron Davidson <il...@gmail.com>.

The full jstack would still be useful, but our current working theory is
that this is due to the fact that Configuration#loadDefaults goes through
every Configuration object that was ever created (via
Configuration.REGISTRY) and locks it, thus introducing a dependency from
new Configuration to old, otherwise unrelated, Configuration objects that
our locking did not anticipate.

I have created https://github.com/apache/spark/pull/1409 to hopefully fix
this bug.


On Mon, Jul 14, 2014 at 2:44 PM, Patrick Wendell <pw...@gmail.com> wrote:

> Hey Cody,
>
> This Jstack seems truncated, would you mind giving the entire stack
> trace? For the second thread, for instance, we can't see where the
> lock is being acquired.
>
> - Patrick
>
> On Mon, Jul 14, 2014 at 1:42 PM, Cody Koeninger
> <co...@mediacrossing.com> wrote:
> > Hi all, just wanted to give a heads up that we're seeing a reproducible
> > deadlock with spark 1.0.1 with 2.3.0-mr1-cdh5.0.2
> >
> > If jira is a better place for this, apologies in advance - figured
> talking
> > about it on the mailing list was friendlier than randomly (re)opening
> jira
> > tickets.
> >
> > I know Gary had mentioned some issues with 1.0.1 on the mailing list,
> once
> > we got a thread dump I wanted to follow up.
> >
> > The thread dump shows the deadlock occurs in the synchronized block of
> code
> > that was changed in HadoopRDD.scala, for the Spark-1097 issue
> >
> > Relevant portions of the thread dump are summarized below, we can provide
> > the whole dump if it's useful.
> >
> > Found one Java-level deadlock:
> > =============================
> > "Executor task launch worker-1":
> >   waiting to lock monitor 0x00007f250400c520 (object 0x00000000fae7dc30,
> a
> > org.apache.hadoop.co
> > nf.Configuration),
> >   which is held by "Executor task launch worker-0"
> > "Executor task launch worker-0":
> >   waiting to lock monitor 0x00007f2520495620 (object 0x00000000faeb4fc8,
> a
> > java.lang.Class),
> >   which is held by "Executor task launch worker-1"
> >
> >
> > "Executor task launch worker-1":
> >         at
> >
> org.apache.hadoop.conf.Configuration.reloadConfiguration(Configuration.java:791)
> >         - waiting to lock <0x00000000fae7dc30> (a
> > org.apache.hadoop.conf.Configuration)
> >         at
> >
> org.apache.hadoop.conf.Configuration.addDefaultResource(Configuration.java:690)
> >         - locked <0x00000000faca6ff8> (a java.lang.Class for
> > org.apache.hadoop.conf.Configurati
> > on)
> >         at
> >
> org.apache.hadoop.hdfs.HdfsConfiguration.<clinit>(HdfsConfiguration.java:34)
> >         at
> >
> org.apache.hadoop.hdfs.DistributedFileSystem.<clinit>(DistributedFileSystem.java:110
> > )
> >         at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native
> > Method)
> >         at
> >
> sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.
> > java:57)
> >         at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native
> > Method)
> >         at
> >
> sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.
> > java:57)
> >         at
> >
> sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAcces
> > sorImpl.java:45)
> >         at
> java.lang.reflect.Constructor.newInstance(Constructor.java:525)
> >         at java.lang.Class.newInstance0(Class.java:374)
> >         at java.lang.Class.newInstance(Class.java:327)
> >         at
> java.util.ServiceLoader$LazyIterator.next(ServiceLoader.java:373)
> >         at java.util.ServiceLoader$1.next(ServiceLoader.java:445)
> >         at
> > org.apache.hadoop.fs.FileSystem.loadFileSystems(FileSystem.java:2364)
> >         - locked <0x00000000faeb4fc8> (a java.lang.Class for
> > org.apache.hadoop.fs.FileSystem)
> >         at
> > org.apache.hadoop.fs.FileSystem.getFileSystemClass(FileSystem.java:2375)
> >         at
> > org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2392)
> >         at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:89)
> >         at
> > org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2431)
> >         at
> org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2413)
> >         at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:368)
> >         at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:167)
> >         at
> > org.apache.hadoop.mapred.JobConf.getWorkingDirectory(JobConf.java:587)
> >         at
> >
> org.apache.hadoop.mapred.FileInputFormat.setInputPaths(FileInputFormat.java:315)
> >         at
> >
> org.apache.hadoop.mapred.FileInputFormat.setInputPaths(FileInputFormat.java:288)
> >         at
> > org.apache.spark.SparkContext$$anonfun$22.apply(SparkContext.scala:546)
> >         at
> > org.apache.spark.SparkContext$$anonfun$22.apply(SparkContext.scala:546)
> >         at
> >
> org.apache.spark.rdd.HadoopRDD$$anonfun$getJobConf$1.apply(HadoopRDD.scala:145)
> >
> >
> >
> > ...elided...
> >
> >
> > "Executor task launch worker-0" daemon prio=10 tid=0x0000000001e71800
> > nid=0x2d97 waiting for monitor entry [0x00007f24d2bf1000]
> >    java.lang.Thread.State: BLOCKED (on object monitor)
> >         at
> > org.apache.hadoop.fs.FileSystem.loadFileSystems(FileSystem.java:2362)
> >         - waiting to lock <0x00000000faeb4fc8> (a java.lang.Class for
> > org.apache.hadoop.fs.FileSystem)
> >         at
> > org.apache.hadoop.fs.FileSystem.getFileSystemClass(FileSystem.java:2375)
> >         at
> > org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2392)
> >         at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:89)
> >         at
> > org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2431)
> >         at
> org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2413)
> >         at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:368)
> >         at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:167)
> >         at
> > org.apache.hadoop.mapred.JobConf.getWorkingDirectory(JobConf.java:587)
> >         at
> >
> org.apache.hadoop.mapred.FileInputFormat.setInputPaths(FileInputFormat.java:315)
> >         at
> >
> org.apache.hadoop.mapred.FileInputFormat.setInputPaths(FileInputFormat.java:288)
> >         at
> > org.apache.spark.SparkContext$$anonfun$22.apply(SparkContext.scala:546)
> >         at
> > org.apache.spark.SparkContext$$anonfun$22.apply(SparkContext.scala:546)
> >         at
> >
> org.apache.spark.rdd.HadoopRDD$$anonfun$getJobConf$1.apply(HadoopRDD.scala:145)
>

Re: Reproducible deadlock in 1.0.1, possibly related to Spark-1097

Posted by Patrick Wendell <pw...@gmail.com>.

Hey Andrew,

Yeah, that would be preferable. Definitely worth investigating both,
but the regression is more pressing at the moment.

- Patrick

On Mon, Jul 14, 2014 at 10:02 PM, Andrew Ash <an...@andrewash.com> wrote:
> I don't believe mine is a regression. But it is related to thread safety on
> Hadoop Configuration objects. Should I start a new thread?
> On Jul 15, 2014 12:55 AM, "Patrick Wendell" <pw...@gmail.com> wrote:
>
>> Andrew is your issue also a regression from 1.0.0 to 1.0.1? The
>> immediate priority is addressing regressions between these two
>> releases.
>>
>> On Mon, Jul 14, 2014 at 9:05 PM, Andrew Ash <an...@andrewash.com> wrote:
>> > I'm not sure either of those PRs will fix the concurrent adds to
>> > Configuration issue I observed. I've got a stack trace and writeup I'll
>> > share in an hour or two (traveling today).
>> > On Jul 14, 2014 9:50 PM, "scwf" <wa...@huawei.com> wrote:
>> >
>> >> hi，Cody
>> >>   i met this issue days before and i post a PR for this(
>> >> https://github.com/apache/spark/pull/1385)
>> >> it's very strange that if i synchronize conf it will deadlock but it is
>> ok
>> >> when synchronize initLocalJobConfFuncOpt
>> >>
>> >>
>> >>  Here's the entire jstack output.
>> >>>
>> >>>
>> >>> On Mon, Jul 14, 2014 at 4:44 PM, Patrick Wendell <pwendell@gmail.com
>> >>> <ma...@gmail.com>> wrote:
>> >>>
>> >>>     Hey Cody,
>> >>>
>> >>>     This Jstack seems truncated, would you mind giving the entire stack
>> >>>     trace? For the second thread, for instance, we can't see where the
>> >>>     lock is being acquired.
>> >>>
>> >>>     - Patrick
>> >>>
>> >>>     On Mon, Jul 14, 2014 at 1:42 PM, Cody Koeninger
>> >>>     <cody.koeninger@mediacrossing.com <mailto:cody.koeninger@
>> >>> mediacrossing.com>> wrote:
>> >>>      > Hi all, just wanted to give a heads up that we're seeing a
>> >>> reproducible
>> >>>      > deadlock with spark 1.0.1 with 2.3.0-mr1-cdh5.0.2
>> >>>      >
>> >>>      > If jira is a better place for this, apologies in advance -
>> figured
>> >>> talking
>> >>>      > about it on the mailing list was friendlier than randomly
>> >>> (re)opening jira
>> >>>      > tickets.
>> >>>      >
>> >>>      > I know Gary had mentioned some issues with 1.0.1 on the mailing
>> >>> list, once
>> >>>      > we got a thread dump I wanted to follow up.
>> >>>      >
>> >>>      > The thread dump shows the deadlock occurs in the synchronized
>> >>> block of code
>> >>>      > that was changed in HadoopRDD.scala, for the Spark-1097 issue
>> >>>      >
>> >>>      > Relevant portions of the thread dump are summarized below, we
>> can
>> >>> provide
>> >>>      > the whole dump if it's useful.
>> >>>      >
>> >>>      > Found one Java-level deadlock:
>> >>>      > =============================
>> >>>      > "Executor task launch worker-1":
>> >>>      >   waiting to lock monitor 0x00007f250400c520 (object
>> >>> 0x00000000fae7dc30, a
>> >>>      > org.apache.hadoop.co <http://org.apache.hadoop.co>
>> >>>      > nf.Configuration),
>> >>>      >   which is held by "Executor task launch worker-0"
>> >>>      > "Executor task launch worker-0":
>> >>>      >   waiting to lock monitor 0x00007f2520495620 (object
>> >>> 0x00000000faeb4fc8, a
>> >>>      > java.lang.Class),
>> >>>      >   which is held by "Executor task launch worker-1"
>> >>>      >
>> >>>      >
>> >>>      > "Executor task launch worker-1":
>> >>>      >         at
>> >>>      > org.apache.hadoop.conf.Configuration.reloadConfiguration(
>> >>> Configuration.java:791)
>> >>>      >         - waiting to lock <0x00000000fae7dc30> (a
>> >>>      > org.apache.hadoop.conf.Configuration)
>> >>>      >         at
>> >>>      > org.apache.hadoop.conf.Configuration.addDefaultResource(
>> >>> Configuration.java:690)
>> >>>      >         - locked <0x00000000faca6ff8> (a java.lang.Class for
>> >>>      > org.apache.hadoop.conf.Configurati
>> >>>      > on)
>> >>>      >         at
>> >>>      > org.apache.hadoop.hdfs.HdfsConfiguration.<clinit>(
>> >>> HdfsConfiguration.java:34)
>> >>>      >         at
>> >>>      > org.apache.hadoop.hdfs.DistributedFileSystem.<clinit>
>> >>> (DistributedFileSystem.java:110
>> >>>      > )
>> >>>      >         at sun.reflect.NativeConstructorAccessorImpl.
>> >>> newInstance0(Native
>> >>>      > Method)
>> >>>      >         at
>> >>>      > sun.reflect.NativeConstructorAccessorImpl.newInstance(
>> >>> NativeConstructorAccessorImpl.
>> >>>      > java:57)
>> >>>      >         at sun.reflect.NativeConstructorAccessorImpl.
>> >>> newInstance0(Native
>> >>>      > Method)
>> >>>      >         at
>> >>>      > sun.reflect.NativeConstructorAccessorImpl.newInstance(
>> >>> NativeConstructorAccessorImpl.
>> >>>      > java:57)
>> >>>      >         at
>> >>>      > sun.reflect.DelegatingConstructorAccessorImpl.newInstance(
>> >>> DelegatingConstructorAcces
>> >>>      > sorImpl.java:45)
>> >>>      >         at java.lang.reflect.Constructor.
>> >>> newInstance(Constructor.java:525)
>> >>>      >         at java.lang.Class.newInstance0(Class.java:374)
>> >>>      >         at java.lang.Class.newInstance(Class.java:327)
>> >>>      >         at java.util.ServiceLoader$LazyIterator.next(
>> >>> ServiceLoader.java:373)
>> >>>      >         at
>> java.util.ServiceLoader$1.next(ServiceLoader.java:445)
>> >>>      >         at
>> >>>      > org.apache.hadoop.fs.FileSystem.loadFileSystems(
>> >>> FileSystem.java:2364)
>> >>>      >         - locked <0x00000000faeb4fc8> (a java.lang.Class for
>> >>>      > org.apache.hadoop.fs.FileSystem)
>> >>>      >         at
>> >>>      > org.apache.hadoop.fs.FileSystem.getFileSystemClass(
>> >>> FileSystem.java:2375)
>> >>>      >         at
>> >>>      > org.apache.hadoop.fs.FileSystem.createFileSystem(
>> >>> FileSystem.java:2392)
>> >>>      >         at org.apache.hadoop.fs.FileSystem.access$200(
>> >>> FileSystem.java:89)
>> >>>      >         at
>> >>>      > org.apache.hadoop.fs.FileSystem$Cache.getInternal(
>> >>> FileSystem.java:2431)
>> >>>      >         at org.apache.hadoop.fs.FileSystem$Cache.get(
>> >>> FileSystem.java:2413)
>> >>>      >         at org.apache.hadoop.fs.FileSystem.get(FileSystem.
>> >>> java:368)
>> >>>      >         at org.apache.hadoop.fs.FileSystem.get(FileSystem.
>> >>> java:167)
>> >>>      >         at
>> >>>      > org.apache.hadoop.mapred.JobConf.getWorkingDirectory(
>> >>> JobConf.java:587)
>> >>>      >         at
>> >>>      > org.apache.hadoop.mapred.FileInputFormat.setInputPaths(
>> >>> FileInputFormat.java:315)
>> >>>      >         at
>> >>>      > org.apache.hadoop.mapred.FileInputFormat.setInputPaths(
>> >>> FileInputFormat.java:288)
>> >>>      >         at
>> >>>      > org.apache.spark.SparkContext$$anonfun$22.apply(
>> >>> SparkContext.scala:546)
>> >>>      >         at
>> >>>      > org.apache.spark.SparkContext$$anonfun$22.apply(
>> >>> SparkContext.scala:546)
>> >>>      >         at
>> >>>      > org.apache.spark.rdd.HadoopRDD$$anonfun$getJobConf$
>> >>> 1.apply(HadoopRDD.scala:145)
>> >>>      >
>> >>>      >
>> >>>      >
>> >>>      > ...elided...
>> >>>      >
>> >>>      >
>> >>>      > "Executor task launch worker-0" daemon prio=10
>> >>> tid=0x0000000001e71800
>> >>>      > nid=0x2d97 waiting for monitor entry [0x00007f24d2bf1000]
>> >>>      >    java.lang.Thread.State: BLOCKED (on object monitor)
>> >>>      >         at
>> >>>      > org.apache.hadoop.fs.FileSystem.loadFileSystems(
>> >>> FileSystem.java:2362)
>> >>>      >         - waiting to lock <0x00000000faeb4fc8> (a
>> java.lang.Class
>> >>> for
>> >>>      > org.apache.hadoop.fs.FileSystem)
>> >>>      >         at
>> >>>      > org.apache.hadoop.fs.FileSystem.getFileSystemClass(
>> >>> FileSystem.java:2375)
>> >>>      >         at
>> >>>      > org.apache.hadoop.fs.FileSystem.createFileSystem(
>> >>> FileSystem.java:2392)
>> >>>      >         at org.apache.hadoop.fs.FileSystem.access$200(
>> >>> FileSystem.java:89)
>> >>>      >         at
>> >>>      > org.apache.hadoop.fs.FileSystem$Cache.getInternal(
>> >>> FileSystem.java:2431)
>> >>>      >         at org.apache.hadoop.fs.FileSystem$Cache.get(
>> >>> FileSystem.java:2413)
>> >>>      >         at org.apache.hadoop.fs.FileSystem.get(FileSystem.
>> >>> java:368)
>> >>>      >         at org.apache.hadoop.fs.FileSystem.get(FileSystem.
>> >>> java:167)
>> >>>      >         at
>> >>>      > org.apache.hadoop.mapred.JobConf.getWorkingDirectory(
>> >>> JobConf.java:587)
>> >>>      >         at
>> >>>      > org.apache.hadoop.mapred.FileInputFormat.setInputPaths(
>> >>> FileInputFormat.java:315)
>> >>>      >         at
>> >>>      > org.apache.hadoop.mapred.FileInputFormat.setInputPaths(
>> >>> FileInputFormat.java:288)
>> >>>      >         at
>> >>>      > org.apache.spark.SparkContext$$anonfun$22.apply(
>> >>> SparkContext.scala:546)
>> >>>      >         at
>> >>>      > org.apache.spark.SparkContext$$anonfun$22.apply(
>> >>> SparkContext.scala:546)
>> >>>      >         at
>> >>>      > org.apache.spark.rdd.HadoopRDD$$anonfun$getJobConf$
>> >>> 1.apply(HadoopRDD.scala:145)
>> >>>
>> >>>
>> >>>
>> >>
>> >> --
>> >>
>> >> Best Regards
>> >> Fei Wang
>> >>
>> >> ------------------------------------------------------------
>> >> --------------------
>> >>
>> >>
>> >>
>>

Re: Reproducible deadlock in 1.0.1, possibly related to Spark-1097

Posted by Andrew Ash <an...@andrewash.com>.

I don't believe mine is a regression. But it is related to thread safety on
Hadoop Configuration objects. Should I start a new thread?
On Jul 15, 2014 12:55 AM, "Patrick Wendell" <pw...@gmail.com> wrote:

> Andrew is your issue also a regression from 1.0.0 to 1.0.1? The
> immediate priority is addressing regressions between these two
> releases.
>
> On Mon, Jul 14, 2014 at 9:05 PM, Andrew Ash <an...@andrewash.com> wrote:
> > I'm not sure either of those PRs will fix the concurrent adds to
> > Configuration issue I observed. I've got a stack trace and writeup I'll
> > share in an hour or two (traveling today).
> > On Jul 14, 2014 9:50 PM, "scwf" <wa...@huawei.com> wrote:
> >
> >> hi，Cody
> >>   i met this issue days before and i post a PR for this(
> >> https://github.com/apache/spark/pull/1385)
> >> it's very strange that if i synchronize conf it will deadlock but it is
> ok
> >> when synchronize initLocalJobConfFuncOpt
> >>
> >>
> >>  Here's the entire jstack output.
> >>>
> >>>
> >>> On Mon, Jul 14, 2014 at 4:44 PM, Patrick Wendell <pwendell@gmail.com
> >>> <ma...@gmail.com>> wrote:
> >>>
> >>>     Hey Cody,
> >>>
> >>>     This Jstack seems truncated, would you mind giving the entire stack
> >>>     trace? For the second thread, for instance, we can't see where the
> >>>     lock is being acquired.
> >>>
> >>>     - Patrick
> >>>
> >>>     On Mon, Jul 14, 2014 at 1:42 PM, Cody Koeninger
> >>>     <cody.koeninger@mediacrossing.com <mailto:cody.koeninger@
> >>> mediacrossing.com>> wrote:
> >>>      > Hi all, just wanted to give a heads up that we're seeing a
> >>> reproducible
> >>>      > deadlock with spark 1.0.1 with 2.3.0-mr1-cdh5.0.2
> >>>      >
> >>>      > If jira is a better place for this, apologies in advance -
> figured
> >>> talking
> >>>      > about it on the mailing list was friendlier than randomly
> >>> (re)opening jira
> >>>      > tickets.
> >>>      >
> >>>      > I know Gary had mentioned some issues with 1.0.1 on the mailing
> >>> list, once
> >>>      > we got a thread dump I wanted to follow up.
> >>>      >
> >>>      > The thread dump shows the deadlock occurs in the synchronized
> >>> block of code
> >>>      > that was changed in HadoopRDD.scala, for the Spark-1097 issue
> >>>      >
> >>>      > Relevant portions of the thread dump are summarized below, we
> can
> >>> provide
> >>>      > the whole dump if it's useful.
> >>>      >
> >>>      > Found one Java-level deadlock:
> >>>      > =============================
> >>>      > "Executor task launch worker-1":
> >>>      >   waiting to lock monitor 0x00007f250400c520 (object
> >>> 0x00000000fae7dc30, a
> >>>      > org.apache.hadoop.co <http://org.apache.hadoop.co>
> >>>      > nf.Configuration),
> >>>      >   which is held by "Executor task launch worker-0"
> >>>      > "Executor task launch worker-0":
> >>>      >   waiting to lock monitor 0x00007f2520495620 (object
> >>> 0x00000000faeb4fc8, a
> >>>      > java.lang.Class),
> >>>      >   which is held by "Executor task launch worker-1"
> >>>      >
> >>>      >
> >>>      > "Executor task launch worker-1":
> >>>      >         at
> >>>      > org.apache.hadoop.conf.Configuration.reloadConfiguration(
> >>> Configuration.java:791)
> >>>      >         - waiting to lock <0x00000000fae7dc30> (a
> >>>      > org.apache.hadoop.conf.Configuration)
> >>>      >         at
> >>>      > org.apache.hadoop.conf.Configuration.addDefaultResource(
> >>> Configuration.java:690)
> >>>      >         - locked <0x00000000faca6ff8> (a java.lang.Class for
> >>>      > org.apache.hadoop.conf.Configurati
> >>>      > on)
> >>>      >         at
> >>>      > org.apache.hadoop.hdfs.HdfsConfiguration.<clinit>(
> >>> HdfsConfiguration.java:34)
> >>>      >         at
> >>>      > org.apache.hadoop.hdfs.DistributedFileSystem.<clinit>
> >>> (DistributedFileSystem.java:110
> >>>      > )
> >>>      >         at sun.reflect.NativeConstructorAccessorImpl.
> >>> newInstance0(Native
> >>>      > Method)
> >>>      >         at
> >>>      > sun.reflect.NativeConstructorAccessorImpl.newInstance(
> >>> NativeConstructorAccessorImpl.
> >>>      > java:57)
> >>>      >         at sun.reflect.NativeConstructorAccessorImpl.
> >>> newInstance0(Native
> >>>      > Method)
> >>>      >         at
> >>>      > sun.reflect.NativeConstructorAccessorImpl.newInstance(
> >>> NativeConstructorAccessorImpl.
> >>>      > java:57)
> >>>      >         at
> >>>      > sun.reflect.DelegatingConstructorAccessorImpl.newInstance(
> >>> DelegatingConstructorAcces
> >>>      > sorImpl.java:45)
> >>>      >         at java.lang.reflect.Constructor.
> >>> newInstance(Constructor.java:525)
> >>>      >         at java.lang.Class.newInstance0(Class.java:374)
> >>>      >         at java.lang.Class.newInstance(Class.java:327)
> >>>      >         at java.util.ServiceLoader$LazyIterator.next(
> >>> ServiceLoader.java:373)
> >>>      >         at
> java.util.ServiceLoader$1.next(ServiceLoader.java:445)
> >>>      >         at
> >>>      > org.apache.hadoop.fs.FileSystem.loadFileSystems(
> >>> FileSystem.java:2364)
> >>>      >         - locked <0x00000000faeb4fc8> (a java.lang.Class for
> >>>      > org.apache.hadoop.fs.FileSystem)
> >>>      >         at
> >>>      > org.apache.hadoop.fs.FileSystem.getFileSystemClass(
> >>> FileSystem.java:2375)
> >>>      >         at
> >>>      > org.apache.hadoop.fs.FileSystem.createFileSystem(
> >>> FileSystem.java:2392)
> >>>      >         at org.apache.hadoop.fs.FileSystem.access$200(
> >>> FileSystem.java:89)
> >>>      >         at
> >>>      > org.apache.hadoop.fs.FileSystem$Cache.getInternal(
> >>> FileSystem.java:2431)
> >>>      >         at org.apache.hadoop.fs.FileSystem$Cache.get(
> >>> FileSystem.java:2413)
> >>>      >         at org.apache.hadoop.fs.FileSystem.get(FileSystem.
> >>> java:368)
> >>>      >         at org.apache.hadoop.fs.FileSystem.get(FileSystem.
> >>> java:167)
> >>>      >         at
> >>>      > org.apache.hadoop.mapred.JobConf.getWorkingDirectory(
> >>> JobConf.java:587)
> >>>      >         at
> >>>      > org.apache.hadoop.mapred.FileInputFormat.setInputPaths(
> >>> FileInputFormat.java:315)
> >>>      >         at
> >>>      > org.apache.hadoop.mapred.FileInputFormat.setInputPaths(
> >>> FileInputFormat.java:288)
> >>>      >         at
> >>>      > org.apache.spark.SparkContext$$anonfun$22.apply(
> >>> SparkContext.scala:546)
> >>>      >         at
> >>>      > org.apache.spark.SparkContext$$anonfun$22.apply(
> >>> SparkContext.scala:546)
> >>>      >         at
> >>>      > org.apache.spark.rdd.HadoopRDD$$anonfun$getJobConf$
> >>> 1.apply(HadoopRDD.scala:145)
> >>>      >
> >>>      >
> >>>      >
> >>>      > ...elided...
> >>>      >
> >>>      >
> >>>      > "Executor task launch worker-0" daemon prio=10
> >>> tid=0x0000000001e71800
> >>>      > nid=0x2d97 waiting for monitor entry [0x00007f24d2bf1000]
> >>>      >    java.lang.Thread.State: BLOCKED (on object monitor)
> >>>      >         at
> >>>      > org.apache.hadoop.fs.FileSystem.loadFileSystems(
> >>> FileSystem.java:2362)
> >>>      >         - waiting to lock <0x00000000faeb4fc8> (a
> java.lang.Class
> >>> for
> >>>      > org.apache.hadoop.fs.FileSystem)
> >>>      >         at
> >>>      > org.apache.hadoop.fs.FileSystem.getFileSystemClass(
> >>> FileSystem.java:2375)
> >>>      >         at
> >>>      > org.apache.hadoop.fs.FileSystem.createFileSystem(
> >>> FileSystem.java:2392)
> >>>      >         at org.apache.hadoop.fs.FileSystem.access$200(
> >>> FileSystem.java:89)
> >>>      >         at
> >>>      > org.apache.hadoop.fs.FileSystem$Cache.getInternal(
> >>> FileSystem.java:2431)
> >>>      >         at org.apache.hadoop.fs.FileSystem$Cache.get(
> >>> FileSystem.java:2413)
> >>>      >         at org.apache.hadoop.fs.FileSystem.get(FileSystem.
> >>> java:368)
> >>>      >         at org.apache.hadoop.fs.FileSystem.get(FileSystem.
> >>> java:167)
> >>>      >         at
> >>>      > org.apache.hadoop.mapred.JobConf.getWorkingDirectory(
> >>> JobConf.java:587)
> >>>      >         at
> >>>      > org.apache.hadoop.mapred.FileInputFormat.setInputPaths(
> >>> FileInputFormat.java:315)
> >>>      >         at
> >>>      > org.apache.hadoop.mapred.FileInputFormat.setInputPaths(
> >>> FileInputFormat.java:288)
> >>>      >         at
> >>>      > org.apache.spark.SparkContext$$anonfun$22.apply(
> >>> SparkContext.scala:546)
> >>>      >         at
> >>>      > org.apache.spark.SparkContext$$anonfun$22.apply(
> >>> SparkContext.scala:546)
> >>>      >         at
> >>>      > org.apache.spark.rdd.HadoopRDD$$anonfun$getJobConf$
> >>> 1.apply(HadoopRDD.scala:145)
> >>>
> >>>
> >>>
> >>
> >> --
> >>
> >> Best Regards
> >> Fei Wang
> >>
> >> ------------------------------------------------------------
> >> --------------------
> >>
> >>
> >>
>

Re: Reproducible deadlock in 1.0.1, possibly related to Spark-1097

Posted by Patrick Wendell <pw...@gmail.com>.

Andrew is your issue also a regression from 1.0.0 to 1.0.1? The
immediate priority is addressing regressions between these two
releases.

On Mon, Jul 14, 2014 at 9:05 PM, Andrew Ash <an...@andrewash.com> wrote:
> I'm not sure either of those PRs will fix the concurrent adds to
> Configuration issue I observed. I've got a stack trace and writeup I'll
> share in an hour or two (traveling today).
> On Jul 14, 2014 9:50 PM, "scwf" <wa...@huawei.com> wrote:
>
>> hi，Cody
>>   i met this issue days before and i post a PR for this(
>> https://github.com/apache/spark/pull/1385)
>> it's very strange that if i synchronize conf it will deadlock but it is ok
>> when synchronize initLocalJobConfFuncOpt
>>
>>
>>  Here's the entire jstack output.
>>>
>>>
>>> On Mon, Jul 14, 2014 at 4:44 PM, Patrick Wendell <pwendell@gmail.com
>>> <ma...@gmail.com>> wrote:
>>>
>>>     Hey Cody,
>>>
>>>     This Jstack seems truncated, would you mind giving the entire stack
>>>     trace? For the second thread, for instance, we can't see where the
>>>     lock is being acquired.
>>>
>>>     - Patrick
>>>
>>>     On Mon, Jul 14, 2014 at 1:42 PM, Cody Koeninger
>>>     <cody.koeninger@mediacrossing.com <mailto:cody.koeninger@
>>> mediacrossing.com>> wrote:
>>>      > Hi all, just wanted to give a heads up that we're seeing a
>>> reproducible
>>>      > deadlock with spark 1.0.1 with 2.3.0-mr1-cdh5.0.2
>>>      >
>>>      > If jira is a better place for this, apologies in advance - figured
>>> talking
>>>      > about it on the mailing list was friendlier than randomly
>>> (re)opening jira
>>>      > tickets.
>>>      >
>>>      > I know Gary had mentioned some issues with 1.0.1 on the mailing
>>> list, once
>>>      > we got a thread dump I wanted to follow up.
>>>      >
>>>      > The thread dump shows the deadlock occurs in the synchronized
>>> block of code
>>>      > that was changed in HadoopRDD.scala, for the Spark-1097 issue
>>>      >
>>>      > Relevant portions of the thread dump are summarized below, we can
>>> provide
>>>      > the whole dump if it's useful.
>>>      >
>>>      > Found one Java-level deadlock:
>>>      > =============================
>>>      > "Executor task launch worker-1":
>>>      >   waiting to lock monitor 0x00007f250400c520 (object
>>> 0x00000000fae7dc30, a
>>>      > org.apache.hadoop.co <http://org.apache.hadoop.co>
>>>      > nf.Configuration),
>>>      >   which is held by "Executor task launch worker-0"
>>>      > "Executor task launch worker-0":
>>>      >   waiting to lock monitor 0x00007f2520495620 (object
>>> 0x00000000faeb4fc8, a
>>>      > java.lang.Class),
>>>      >   which is held by "Executor task launch worker-1"
>>>      >
>>>      >
>>>      > "Executor task launch worker-1":
>>>      >         at
>>>      > org.apache.hadoop.conf.Configuration.reloadConfiguration(
>>> Configuration.java:791)
>>>      >         - waiting to lock <0x00000000fae7dc30> (a
>>>      > org.apache.hadoop.conf.Configuration)
>>>      >         at
>>>      > org.apache.hadoop.conf.Configuration.addDefaultResource(
>>> Configuration.java:690)
>>>      >         - locked <0x00000000faca6ff8> (a java.lang.Class for
>>>      > org.apache.hadoop.conf.Configurati
>>>      > on)
>>>      >         at
>>>      > org.apache.hadoop.hdfs.HdfsConfiguration.<clinit>(
>>> HdfsConfiguration.java:34)
>>>      >         at
>>>      > org.apache.hadoop.hdfs.DistributedFileSystem.<clinit>
>>> (DistributedFileSystem.java:110
>>>      > )
>>>      >         at sun.reflect.NativeConstructorAccessorImpl.
>>> newInstance0(Native
>>>      > Method)
>>>      >         at
>>>      > sun.reflect.NativeConstructorAccessorImpl.newInstance(
>>> NativeConstructorAccessorImpl.
>>>      > java:57)
>>>      >         at sun.reflect.NativeConstructorAccessorImpl.
>>> newInstance0(Native
>>>      > Method)
>>>      >         at
>>>      > sun.reflect.NativeConstructorAccessorImpl.newInstance(
>>> NativeConstructorAccessorImpl.
>>>      > java:57)
>>>      >         at
>>>      > sun.reflect.DelegatingConstructorAccessorImpl.newInstance(
>>> DelegatingConstructorAcces
>>>      > sorImpl.java:45)
>>>      >         at java.lang.reflect.Constructor.
>>> newInstance(Constructor.java:525)
>>>      >         at java.lang.Class.newInstance0(Class.java:374)
>>>      >         at java.lang.Class.newInstance(Class.java:327)
>>>      >         at java.util.ServiceLoader$LazyIterator.next(
>>> ServiceLoader.java:373)
>>>      >         at java.util.ServiceLoader$1.next(ServiceLoader.java:445)
>>>      >         at
>>>      > org.apache.hadoop.fs.FileSystem.loadFileSystems(
>>> FileSystem.java:2364)
>>>      >         - locked <0x00000000faeb4fc8> (a java.lang.Class for
>>>      > org.apache.hadoop.fs.FileSystem)
>>>      >         at
>>>      > org.apache.hadoop.fs.FileSystem.getFileSystemClass(
>>> FileSystem.java:2375)
>>>      >         at
>>>      > org.apache.hadoop.fs.FileSystem.createFileSystem(
>>> FileSystem.java:2392)
>>>      >         at org.apache.hadoop.fs.FileSystem.access$200(
>>> FileSystem.java:89)
>>>      >         at
>>>      > org.apache.hadoop.fs.FileSystem$Cache.getInternal(
>>> FileSystem.java:2431)
>>>      >         at org.apache.hadoop.fs.FileSystem$Cache.get(
>>> FileSystem.java:2413)
>>>      >         at org.apache.hadoop.fs.FileSystem.get(FileSystem.
>>> java:368)
>>>      >         at org.apache.hadoop.fs.FileSystem.get(FileSystem.
>>> java:167)
>>>      >         at
>>>      > org.apache.hadoop.mapred.JobConf.getWorkingDirectory(
>>> JobConf.java:587)
>>>      >         at
>>>      > org.apache.hadoop.mapred.FileInputFormat.setInputPaths(
>>> FileInputFormat.java:315)
>>>      >         at
>>>      > org.apache.hadoop.mapred.FileInputFormat.setInputPaths(
>>> FileInputFormat.java:288)
>>>      >         at
>>>      > org.apache.spark.SparkContext$$anonfun$22.apply(
>>> SparkContext.scala:546)
>>>      >         at
>>>      > org.apache.spark.SparkContext$$anonfun$22.apply(
>>> SparkContext.scala:546)
>>>      >         at
>>>      > org.apache.spark.rdd.HadoopRDD$$anonfun$getJobConf$
>>> 1.apply(HadoopRDD.scala:145)
>>>      >
>>>      >
>>>      >
>>>      > ...elided...
>>>      >
>>>      >
>>>      > "Executor task launch worker-0" daemon prio=10
>>> tid=0x0000000001e71800
>>>      > nid=0x2d97 waiting for monitor entry [0x00007f24d2bf1000]
>>>      >    java.lang.Thread.State: BLOCKED (on object monitor)
>>>      >         at
>>>      > org.apache.hadoop.fs.FileSystem.loadFileSystems(
>>> FileSystem.java:2362)
>>>      >         - waiting to lock <0x00000000faeb4fc8> (a java.lang.Class
>>> for
>>>      > org.apache.hadoop.fs.FileSystem)
>>>      >         at
>>>      > org.apache.hadoop.fs.FileSystem.getFileSystemClass(
>>> FileSystem.java:2375)
>>>      >         at
>>>      > org.apache.hadoop.fs.FileSystem.createFileSystem(
>>> FileSystem.java:2392)
>>>      >         at org.apache.hadoop.fs.FileSystem.access$200(
>>> FileSystem.java:89)
>>>      >         at
>>>      > org.apache.hadoop.fs.FileSystem$Cache.getInternal(
>>> FileSystem.java:2431)
>>>      >         at org.apache.hadoop.fs.FileSystem$Cache.get(
>>> FileSystem.java:2413)
>>>      >         at org.apache.hadoop.fs.FileSystem.get(FileSystem.
>>> java:368)
>>>      >         at org.apache.hadoop.fs.FileSystem.get(FileSystem.
>>> java:167)
>>>      >         at
>>>      > org.apache.hadoop.mapred.JobConf.getWorkingDirectory(
>>> JobConf.java:587)
>>>      >         at
>>>      > org.apache.hadoop.mapred.FileInputFormat.setInputPaths(
>>> FileInputFormat.java:315)
>>>      >         at
>>>      > org.apache.hadoop.mapred.FileInputFormat.setInputPaths(
>>> FileInputFormat.java:288)
>>>      >         at
>>>      > org.apache.spark.SparkContext$$anonfun$22.apply(
>>> SparkContext.scala:546)
>>>      >         at
>>>      > org.apache.spark.SparkContext$$anonfun$22.apply(
>>> SparkContext.scala:546)
>>>      >         at
>>>      > org.apache.spark.rdd.HadoopRDD$$anonfun$getJobConf$
>>> 1.apply(HadoopRDD.scala:145)
>>>
>>>
>>>
>>
>> --
>>
>> Best Regards
>> Fei Wang
>>
>> ------------------------------------------------------------
>> --------------------
>>
>>
>>

Re: Reproducible deadlock in 1.0.1, possibly related to Spark-1097

Posted by Andrew Ash <an...@andrewash.com>.

I'm not sure either of those PRs will fix the concurrent adds to
Configuration issue I observed. I've got a stack trace and writeup I'll
share in an hour or two (traveling today).
On Jul 14, 2014 9:50 PM, "scwf" <wa...@huawei.com> wrote:

> hi，Cody
>   i met this issue days before and i post a PR for this(
> https://github.com/apache/spark/pull/1385)
> it's very strange that if i synchronize conf it will deadlock but it is ok
> when synchronize initLocalJobConfFuncOpt
>
>
>  Here's the entire jstack output.
>>
>>
>> On Mon, Jul 14, 2014 at 4:44 PM, Patrick Wendell <pwendell@gmail.com
>> <ma...@gmail.com>> wrote:
>>
>>     Hey Cody,
>>
>>     This Jstack seems truncated, would you mind giving the entire stack
>>     trace? For the second thread, for instance, we can't see where the
>>     lock is being acquired.
>>
>>     - Patrick
>>
>>     On Mon, Jul 14, 2014 at 1:42 PM, Cody Koeninger
>>     <cody.koeninger@mediacrossing.com <mailto:cody.koeninger@
>> mediacrossing.com>> wrote:
>>      > Hi all, just wanted to give a heads up that we're seeing a
>> reproducible
>>      > deadlock with spark 1.0.1 with 2.3.0-mr1-cdh5.0.2
>>      >
>>      > If jira is a better place for this, apologies in advance - figured
>> talking
>>      > about it on the mailing list was friendlier than randomly
>> (re)opening jira
>>      > tickets.
>>      >
>>      > I know Gary had mentioned some issues with 1.0.1 on the mailing
>> list, once
>>      > we got a thread dump I wanted to follow up.
>>      >
>>      > The thread dump shows the deadlock occurs in the synchronized
>> block of code
>>      > that was changed in HadoopRDD.scala, for the Spark-1097 issue
>>      >
>>      > Relevant portions of the thread dump are summarized below, we can
>> provide
>>      > the whole dump if it's useful.
>>      >
>>      > Found one Java-level deadlock:
>>      > =============================
>>      > "Executor task launch worker-1":
>>      >   waiting to lock monitor 0x00007f250400c520 (object
>> 0x00000000fae7dc30, a
>>      > org.apache.hadoop.co <http://org.apache.hadoop.co>
>>      > nf.Configuration),
>>      >   which is held by "Executor task launch worker-0"
>>      > "Executor task launch worker-0":
>>      >   waiting to lock monitor 0x00007f2520495620 (object
>> 0x00000000faeb4fc8, a
>>      > java.lang.Class),
>>      >   which is held by "Executor task launch worker-1"
>>      >
>>      >
>>      > "Executor task launch worker-1":
>>      >         at
>>      > org.apache.hadoop.conf.Configuration.reloadConfiguration(
>> Configuration.java:791)
>>      >         - waiting to lock <0x00000000fae7dc30> (a
>>      > org.apache.hadoop.conf.Configuration)
>>      >         at
>>      > org.apache.hadoop.conf.Configuration.addDefaultResource(
>> Configuration.java:690)
>>      >         - locked <0x00000000faca6ff8> (a java.lang.Class for
>>      > org.apache.hadoop.conf.Configurati
>>      > on)
>>      >         at
>>      > org.apache.hadoop.hdfs.HdfsConfiguration.<clinit>(
>> HdfsConfiguration.java:34)
>>      >         at
>>      > org.apache.hadoop.hdfs.DistributedFileSystem.<clinit>
>> (DistributedFileSystem.java:110
>>      > )
>>      >         at sun.reflect.NativeConstructorAccessorImpl.
>> newInstance0(Native
>>      > Method)
>>      >         at
>>      > sun.reflect.NativeConstructorAccessorImpl.newInstance(
>> NativeConstructorAccessorImpl.
>>      > java:57)
>>      >         at sun.reflect.NativeConstructorAccessorImpl.
>> newInstance0(Native
>>      > Method)
>>      >         at
>>      > sun.reflect.NativeConstructorAccessorImpl.newInstance(
>> NativeConstructorAccessorImpl.
>>      > java:57)
>>      >         at
>>      > sun.reflect.DelegatingConstructorAccessorImpl.newInstance(
>> DelegatingConstructorAcces
>>      > sorImpl.java:45)
>>      >         at java.lang.reflect.Constructor.
>> newInstance(Constructor.java:525)
>>      >         at java.lang.Class.newInstance0(Class.java:374)
>>      >         at java.lang.Class.newInstance(Class.java:327)
>>      >         at java.util.ServiceLoader$LazyIterator.next(
>> ServiceLoader.java:373)
>>      >         at java.util.ServiceLoader$1.next(ServiceLoader.java:445)
>>      >         at
>>      > org.apache.hadoop.fs.FileSystem.loadFileSystems(
>> FileSystem.java:2364)
>>      >         - locked <0x00000000faeb4fc8> (a java.lang.Class for
>>      > org.apache.hadoop.fs.FileSystem)
>>      >         at
>>      > org.apache.hadoop.fs.FileSystem.getFileSystemClass(
>> FileSystem.java:2375)
>>      >         at
>>      > org.apache.hadoop.fs.FileSystem.createFileSystem(
>> FileSystem.java:2392)
>>      >         at org.apache.hadoop.fs.FileSystem.access$200(
>> FileSystem.java:89)
>>      >         at
>>      > org.apache.hadoop.fs.FileSystem$Cache.getInternal(
>> FileSystem.java:2431)
>>      >         at org.apache.hadoop.fs.FileSystem$Cache.get(
>> FileSystem.java:2413)
>>      >         at org.apache.hadoop.fs.FileSystem.get(FileSystem.
>> java:368)
>>      >         at org.apache.hadoop.fs.FileSystem.get(FileSystem.
>> java:167)
>>      >         at
>>      > org.apache.hadoop.mapred.JobConf.getWorkingDirectory(
>> JobConf.java:587)
>>      >         at
>>      > org.apache.hadoop.mapred.FileInputFormat.setInputPaths(
>> FileInputFormat.java:315)
>>      >         at
>>      > org.apache.hadoop.mapred.FileInputFormat.setInputPaths(
>> FileInputFormat.java:288)
>>      >         at
>>      > org.apache.spark.SparkContext$$anonfun$22.apply(
>> SparkContext.scala:546)
>>      >         at
>>      > org.apache.spark.SparkContext$$anonfun$22.apply(
>> SparkContext.scala:546)
>>      >         at
>>      > org.apache.spark.rdd.HadoopRDD$$anonfun$getJobConf$
>> 1.apply(HadoopRDD.scala:145)
>>      >
>>      >
>>      >
>>      > ...elided...
>>      >
>>      >
>>      > "Executor task launch worker-0" daemon prio=10
>> tid=0x0000000001e71800
>>      > nid=0x2d97 waiting for monitor entry [0x00007f24d2bf1000]
>>      >    java.lang.Thread.State: BLOCKED (on object monitor)
>>      >         at
>>      > org.apache.hadoop.fs.FileSystem.loadFileSystems(
>> FileSystem.java:2362)
>>      >         - waiting to lock <0x00000000faeb4fc8> (a java.lang.Class
>> for
>>      > org.apache.hadoop.fs.FileSystem)
>>      >         at
>>      > org.apache.hadoop.fs.FileSystem.getFileSystemClass(
>> FileSystem.java:2375)
>>      >         at
>>      > org.apache.hadoop.fs.FileSystem.createFileSystem(
>> FileSystem.java:2392)
>>      >         at org.apache.hadoop.fs.FileSystem.access$200(
>> FileSystem.java:89)
>>      >         at
>>      > org.apache.hadoop.fs.FileSystem$Cache.getInternal(
>> FileSystem.java:2431)
>>      >         at org.apache.hadoop.fs.FileSystem$Cache.get(
>> FileSystem.java:2413)
>>      >         at org.apache.hadoop.fs.FileSystem.get(FileSystem.
>> java:368)
>>      >         at org.apache.hadoop.fs.FileSystem.get(FileSystem.
>> java:167)
>>      >         at
>>      > org.apache.hadoop.mapred.JobConf.getWorkingDirectory(
>> JobConf.java:587)
>>      >         at
>>      > org.apache.hadoop.mapred.FileInputFormat.setInputPaths(
>> FileInputFormat.java:315)
>>      >         at
>>      > org.apache.hadoop.mapred.FileInputFormat.setInputPaths(
>> FileInputFormat.java:288)
>>      >         at
>>      > org.apache.spark.SparkContext$$anonfun$22.apply(
>> SparkContext.scala:546)
>>      >         at
>>      > org.apache.spark.SparkContext$$anonfun$22.apply(
>> SparkContext.scala:546)
>>      >         at
>>      > org.apache.spark.rdd.HadoopRDD$$anonfun$getJobConf$
>> 1.apply(HadoopRDD.scala:145)
>>
>>
>>
>
> --
>
> Best Regards
> Fei Wang
>
> ------------------------------------------------------------
> --------------------
>
>
>

Re: Reproducible deadlock in 1.0.1, possibly related to Spark-1097

Posted by scwf <wa...@huawei.com>.

hi，Cody
   i met this issue days before and i post a PR for this( https://github.com/apache/spark/pull/1385)
it's very strange that if i synchronize conf it will deadlock but it is ok when synchronize initLocalJobConfFuncOpt


> Here's the entire jstack output.
>
>
> On Mon, Jul 14, 2014 at 4:44 PM, Patrick Wendell <pwendell@gmail.com <ma...@gmail.com>> wrote:
>
>     Hey Cody,
>
>     This Jstack seems truncated, would you mind giving the entire stack
>     trace? For the second thread, for instance, we can't see where the
>     lock is being acquired.
>
>     - Patrick
>
>     On Mon, Jul 14, 2014 at 1:42 PM, Cody Koeninger
>     <cody.koeninger@mediacrossing.com <ma...@mediacrossing.com>> wrote:
>      > Hi all, just wanted to give a heads up that we're seeing a reproducible
>      > deadlock with spark 1.0.1 with 2.3.0-mr1-cdh5.0.2
>      >
>      > If jira is a better place for this, apologies in advance - figured talking
>      > about it on the mailing list was friendlier than randomly (re)opening jira
>      > tickets.
>      >
>      > I know Gary had mentioned some issues with 1.0.1 on the mailing list, once
>      > we got a thread dump I wanted to follow up.
>      >
>      > The thread dump shows the deadlock occurs in the synchronized block of code
>      > that was changed in HadoopRDD.scala, for the Spark-1097 issue
>      >
>      > Relevant portions of the thread dump are summarized below, we can provide
>      > the whole dump if it's useful.
>      >
>      > Found one Java-level deadlock:
>      > =============================
>      > "Executor task launch worker-1":
>      >   waiting to lock monitor 0x00007f250400c520 (object 0x00000000fae7dc30, a
>      > org.apache.hadoop.co <http://org.apache.hadoop.co>
>      > nf.Configuration),
>      >   which is held by "Executor task launch worker-0"
>      > "Executor task launch worker-0":
>      >   waiting to lock monitor 0x00007f2520495620 (object 0x00000000faeb4fc8, a
>      > java.lang.Class),
>      >   which is held by "Executor task launch worker-1"
>      >
>      >
>      > "Executor task launch worker-1":
>      >         at
>      > org.apache.hadoop.conf.Configuration.reloadConfiguration(Configuration.java:791)
>      >         - waiting to lock <0x00000000fae7dc30> (a
>      > org.apache.hadoop.conf.Configuration)
>      >         at
>      > org.apache.hadoop.conf.Configuration.addDefaultResource(Configuration.java:690)
>      >         - locked <0x00000000faca6ff8> (a java.lang.Class for
>      > org.apache.hadoop.conf.Configurati
>      > on)
>      >         at
>      > org.apache.hadoop.hdfs.HdfsConfiguration.<clinit>(HdfsConfiguration.java:34)
>      >         at
>      > org.apache.hadoop.hdfs.DistributedFileSystem.<clinit>(DistributedFileSystem.java:110
>      > )
>      >         at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native
>      > Method)
>      >         at
>      > sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.
>      > java:57)
>      >         at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native
>      > Method)
>      >         at
>      > sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.
>      > java:57)
>      >         at
>      > sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAcces
>      > sorImpl.java:45)
>      >         at java.lang.reflect.Constructor.newInstance(Constructor.java:525)
>      >         at java.lang.Class.newInstance0(Class.java:374)
>      >         at java.lang.Class.newInstance(Class.java:327)
>      >         at java.util.ServiceLoader$LazyIterator.next(ServiceLoader.java:373)
>      >         at java.util.ServiceLoader$1.next(ServiceLoader.java:445)
>      >         at
>      > org.apache.hadoop.fs.FileSystem.loadFileSystems(FileSystem.java:2364)
>      >         - locked <0x00000000faeb4fc8> (a java.lang.Class for
>      > org.apache.hadoop.fs.FileSystem)
>      >         at
>      > org.apache.hadoop.fs.FileSystem.getFileSystemClass(FileSystem.java:2375)
>      >         at
>      > org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2392)
>      >         at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:89)
>      >         at
>      > org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2431)
>      >         at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2413)
>      >         at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:368)
>      >         at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:167)
>      >         at
>      > org.apache.hadoop.mapred.JobConf.getWorkingDirectory(JobConf.java:587)
>      >         at
>      > org.apache.hadoop.mapred.FileInputFormat.setInputPaths(FileInputFormat.java:315)
>      >         at
>      > org.apache.hadoop.mapred.FileInputFormat.setInputPaths(FileInputFormat.java:288)
>      >         at
>      > org.apache.spark.SparkContext$$anonfun$22.apply(SparkContext.scala:546)
>      >         at
>      > org.apache.spark.SparkContext$$anonfun$22.apply(SparkContext.scala:546)
>      >         at
>      > org.apache.spark.rdd.HadoopRDD$$anonfun$getJobConf$1.apply(HadoopRDD.scala:145)
>      >
>      >
>      >
>      > ...elided...
>      >
>      >
>      > "Executor task launch worker-0" daemon prio=10 tid=0x0000000001e71800
>      > nid=0x2d97 waiting for monitor entry [0x00007f24d2bf1000]
>      >    java.lang.Thread.State: BLOCKED (on object monitor)
>      >         at
>      > org.apache.hadoop.fs.FileSystem.loadFileSystems(FileSystem.java:2362)
>      >         - waiting to lock <0x00000000faeb4fc8> (a java.lang.Class for
>      > org.apache.hadoop.fs.FileSystem)
>      >         at
>      > org.apache.hadoop.fs.FileSystem.getFileSystemClass(FileSystem.java:2375)
>      >         at
>      > org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2392)
>      >         at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:89)
>      >         at
>      > org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2431)
>      >         at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2413)
>      >         at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:368)
>      >         at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:167)
>      >         at
>      > org.apache.hadoop.mapred.JobConf.getWorkingDirectory(JobConf.java:587)
>      >         at
>      > org.apache.hadoop.mapred.FileInputFormat.setInputPaths(FileInputFormat.java:315)
>      >         at
>      > org.apache.hadoop.mapred.FileInputFormat.setInputPaths(FileInputFormat.java:288)
>      >         at
>      > org.apache.spark.SparkContext$$anonfun$22.apply(SparkContext.scala:546)
>      >         at
>      > org.apache.spark.SparkContext$$anonfun$22.apply(SparkContext.scala:546)
>      >         at
>      > org.apache.spark.rdd.HadoopRDD$$anonfun$getJobConf$1.apply(HadoopRDD.scala:145)
>
>


-- 

Best Regards
Fei Wang

--------------------------------------------------------------------------------

Re: Reproducible deadlock in 1.0.1, possibly related to Spark-1097

Posted by Cody Koeninger <co...@mediacrossing.com>.

Here's the entire jstack output.


On Mon, Jul 14, 2014 at 4:44 PM, Patrick Wendell <pw...@gmail.com> wrote:

> Hey Cody,
>
> This Jstack seems truncated, would you mind giving the entire stack
> trace? For the second thread, for instance, we can't see where the
> lock is being acquired.
>
> - Patrick
>
> On Mon, Jul 14, 2014 at 1:42 PM, Cody Koeninger
> <co...@mediacrossing.com> wrote:
> > Hi all, just wanted to give a heads up that we're seeing a reproducible
> > deadlock with spark 1.0.1 with 2.3.0-mr1-cdh5.0.2
> >
> > If jira is a better place for this, apologies in advance - figured
> talking
> > about it on the mailing list was friendlier than randomly (re)opening
> jira
> > tickets.
> >
> > I know Gary had mentioned some issues with 1.0.1 on the mailing list,
> once
> > we got a thread dump I wanted to follow up.
> >
> > The thread dump shows the deadlock occurs in the synchronized block of
> code
> > that was changed in HadoopRDD.scala, for the Spark-1097 issue
> >
> > Relevant portions of the thread dump are summarized below, we can provide
> > the whole dump if it's useful.
> >
> > Found one Java-level deadlock:
> > =============================
> > "Executor task launch worker-1":
> >   waiting to lock monitor 0x00007f250400c520 (object 0x00000000fae7dc30,
> a
> > org.apache.hadoop.co
> > nf.Configuration),
> >   which is held by "Executor task launch worker-0"
> > "Executor task launch worker-0":
> >   waiting to lock monitor 0x00007f2520495620 (object 0x00000000faeb4fc8,
> a
> > java.lang.Class),
> >   which is held by "Executor task launch worker-1"
> >
> >
> > "Executor task launch worker-1":
> >         at
> >
> org.apache.hadoop.conf.Configuration.reloadConfiguration(Configuration.java:791)
> >         - waiting to lock <0x00000000fae7dc30> (a
> > org.apache.hadoop.conf.Configuration)
> >         at
> >
> org.apache.hadoop.conf.Configuration.addDefaultResource(Configuration.java:690)
> >         - locked <0x00000000faca6ff8> (a java.lang.Class for
> > org.apache.hadoop.conf.Configurati
> > on)
> >         at
> >
> org.apache.hadoop.hdfs.HdfsConfiguration.<clinit>(HdfsConfiguration.java:34)
> >         at
> >
> org.apache.hadoop.hdfs.DistributedFileSystem.<clinit>(DistributedFileSystem.java:110
> > )
> >         at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native
> > Method)
> >         at
> >
> sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.
> > java:57)
> >         at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native
> > Method)
> >         at
> >
> sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.
> > java:57)
> >         at
> >
> sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAcces
> > sorImpl.java:45)
> >         at
> java.lang.reflect.Constructor.newInstance(Constructor.java:525)
> >         at java.lang.Class.newInstance0(Class.java:374)
> >         at java.lang.Class.newInstance(Class.java:327)
> >         at
> java.util.ServiceLoader$LazyIterator.next(ServiceLoader.java:373)
> >         at java.util.ServiceLoader$1.next(ServiceLoader.java:445)
> >         at
> > org.apache.hadoop.fs.FileSystem.loadFileSystems(FileSystem.java:2364)
> >         - locked <0x00000000faeb4fc8> (a java.lang.Class for
> > org.apache.hadoop.fs.FileSystem)
> >         at
> > org.apache.hadoop.fs.FileSystem.getFileSystemClass(FileSystem.java:2375)
> >         at
> > org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2392)
> >         at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:89)
> >         at
> > org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2431)
> >         at
> org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2413)
> >         at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:368)
> >         at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:167)
> >         at
> > org.apache.hadoop.mapred.JobConf.getWorkingDirectory(JobConf.java:587)
> >         at
> >
> org.apache.hadoop.mapred.FileInputFormat.setInputPaths(FileInputFormat.java:315)
> >         at
> >
> org.apache.hadoop.mapred.FileInputFormat.setInputPaths(FileInputFormat.java:288)
> >         at
> > org.apache.spark.SparkContext$$anonfun$22.apply(SparkContext.scala:546)
> >         at
> > org.apache.spark.SparkContext$$anonfun$22.apply(SparkContext.scala:546)
> >         at
> >
> org.apache.spark.rdd.HadoopRDD$$anonfun$getJobConf$1.apply(HadoopRDD.scala:145)
> >
> >
> >
> > ...elided...
> >
> >
> > "Executor task launch worker-0" daemon prio=10 tid=0x0000000001e71800
> > nid=0x2d97 waiting for monitor entry [0x00007f24d2bf1000]
> >    java.lang.Thread.State: BLOCKED (on object monitor)
> >         at
> > org.apache.hadoop.fs.FileSystem.loadFileSystems(FileSystem.java:2362)
> >         - waiting to lock <0x00000000faeb4fc8> (a java.lang.Class for
> > org.apache.hadoop.fs.FileSystem)
> >         at
> > org.apache.hadoop.fs.FileSystem.getFileSystemClass(FileSystem.java:2375)
> >         at
> > org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2392)
> >         at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:89)
> >         at
> > org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2431)
> >         at
> org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2413)
> >         at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:368)
> >         at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:167)
> >         at
> > org.apache.hadoop.mapred.JobConf.getWorkingDirectory(JobConf.java:587)
> >         at
> >
> org.apache.hadoop.mapred.FileInputFormat.setInputPaths(FileInputFormat.java:315)
> >         at
> >
> org.apache.hadoop.mapred.FileInputFormat.setInputPaths(FileInputFormat.java:288)
> >         at
> > org.apache.spark.SparkContext$$anonfun$22.apply(SparkContext.scala:546)
> >         at
> > org.apache.spark.SparkContext$$anonfun$22.apply(SparkContext.scala:546)
> >         at
> >
> org.apache.spark.rdd.HadoopRDD$$anonfun$getJobConf$1.apply(HadoopRDD.scala:145)
>

Re: Reproducible deadlock in 1.0.1, possibly related to Spark-1097

Posted by Patrick Wendell <pw...@gmail.com>.

Hey Cody,

This Jstack seems truncated, would you mind giving the entire stack
trace? For the second thread, for instance, we can't see where the
lock is being acquired.

- Patrick

On Mon, Jul 14, 2014 at 1:42 PM, Cody Koeninger
<co...@mediacrossing.com> wrote:
> Hi all, just wanted to give a heads up that we're seeing a reproducible
> deadlock with spark 1.0.1 with 2.3.0-mr1-cdh5.0.2
>
> If jira is a better place for this, apologies in advance - figured talking
> about it on the mailing list was friendlier than randomly (re)opening jira
> tickets.
>
> I know Gary had mentioned some issues with 1.0.1 on the mailing list, once
> we got a thread dump I wanted to follow up.
>
> The thread dump shows the deadlock occurs in the synchronized block of code
> that was changed in HadoopRDD.scala, for the Spark-1097 issue
>
> Relevant portions of the thread dump are summarized below, we can provide
> the whole dump if it's useful.
>
> Found one Java-level deadlock:
> =============================
> "Executor task launch worker-1":
>   waiting to lock monitor 0x00007f250400c520 (object 0x00000000fae7dc30, a
> org.apache.hadoop.co
> nf.Configuration),
>   which is held by "Executor task launch worker-0"
> "Executor task launch worker-0":
>   waiting to lock monitor 0x00007f2520495620 (object 0x00000000faeb4fc8, a
> java.lang.Class),
>   which is held by "Executor task launch worker-1"
>
>
> "Executor task launch worker-1":
>         at
> org.apache.hadoop.conf.Configuration.reloadConfiguration(Configuration.java:791)
>         - waiting to lock <0x00000000fae7dc30> (a
> org.apache.hadoop.conf.Configuration)
>         at
> org.apache.hadoop.conf.Configuration.addDefaultResource(Configuration.java:690)
>         - locked <0x00000000faca6ff8> (a java.lang.Class for
> org.apache.hadoop.conf.Configurati
> on)
>         at
> org.apache.hadoop.hdfs.HdfsConfiguration.<clinit>(HdfsConfiguration.java:34)
>         at
> org.apache.hadoop.hdfs.DistributedFileSystem.<clinit>(DistributedFileSystem.java:110
> )
>         at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native
> Method)
>         at
> sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.
> java:57)
>         at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native
> Method)
>         at
> sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.
> java:57)
>         at
> sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAcces
> sorImpl.java:45)
>         at java.lang.reflect.Constructor.newInstance(Constructor.java:525)
>         at java.lang.Class.newInstance0(Class.java:374)
>         at java.lang.Class.newInstance(Class.java:327)
>         at java.util.ServiceLoader$LazyIterator.next(ServiceLoader.java:373)
>         at java.util.ServiceLoader$1.next(ServiceLoader.java:445)
>         at
> org.apache.hadoop.fs.FileSystem.loadFileSystems(FileSystem.java:2364)
>         - locked <0x00000000faeb4fc8> (a java.lang.Class for
> org.apache.hadoop.fs.FileSystem)
>         at
> org.apache.hadoop.fs.FileSystem.getFileSystemClass(FileSystem.java:2375)
>         at
> org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2392)
>         at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:89)
>         at
> org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2431)
>         at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2413)
>         at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:368)
>         at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:167)
>         at
> org.apache.hadoop.mapred.JobConf.getWorkingDirectory(JobConf.java:587)
>         at
> org.apache.hadoop.mapred.FileInputFormat.setInputPaths(FileInputFormat.java:315)
>         at
> org.apache.hadoop.mapred.FileInputFormat.setInputPaths(FileInputFormat.java:288)
>         at
> org.apache.spark.SparkContext$$anonfun$22.apply(SparkContext.scala:546)
>         at
> org.apache.spark.SparkContext$$anonfun$22.apply(SparkContext.scala:546)
>         at
> org.apache.spark.rdd.HadoopRDD$$anonfun$getJobConf$1.apply(HadoopRDD.scala:145)
>
>
>
> ...elided...
>
>
> "Executor task launch worker-0" daemon prio=10 tid=0x0000000001e71800
> nid=0x2d97 waiting for monitor entry [0x00007f24d2bf1000]
>    java.lang.Thread.State: BLOCKED (on object monitor)
>         at
> org.apache.hadoop.fs.FileSystem.loadFileSystems(FileSystem.java:2362)
>         - waiting to lock <0x00000000faeb4fc8> (a java.lang.Class for
> org.apache.hadoop.fs.FileSystem)
>         at
> org.apache.hadoop.fs.FileSystem.getFileSystemClass(FileSystem.java:2375)
>         at
> org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2392)
>         at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:89)
>         at
> org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2431)
>         at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2413)
>         at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:368)
>         at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:167)
>         at
> org.apache.hadoop.mapred.JobConf.getWorkingDirectory(JobConf.java:587)
>         at
> org.apache.hadoop.mapred.FileInputFormat.setInputPaths(FileInputFormat.java:315)
>         at
> org.apache.hadoop.mapred.FileInputFormat.setInputPaths(FileInputFormat.java:288)
>         at
> org.apache.spark.SparkContext$$anonfun$22.apply(SparkContext.scala:546)
>         at
> org.apache.spark.SparkContext$$anonfun$22.apply(SparkContext.scala:546)
>         at
> org.apache.spark.rdd.HadoopRDD$$anonfun$getJobConf$1.apply(HadoopRDD.scala:145)