You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by Michael Johnson <mj...@yahoo.com.INVALID> on 2016/11/06 13:28:13 UTC

Very long pause/hang at end of execution

I'm doing some processing and then clustering of a small dataset (~150 MB). Everything seems to work fine, until the end; the last few lines of my program are log statements, but after printing those, nothing seems to happen for a long time...many minutes; I'm not usually patient enough to let it go, but I think one time when I did just wait, it took over an hour (and did eventually exit on its own). Any ideas on what's happening, or how to troubleshoot?
(This happens both when running locally, using the localhost mode, as well as on a small cluster with four 4-processor nodes each with 15GB of RAM; in both cases the executors have 2GB+ of RAM, and none of the inputs/outputs on any of the stages is more than 75 MB...)
Thanks,Michael

Re: Very long pause/hang at end of execution

Posted by Gourav Sengupta <go...@gmail.com>.

Hi,

In case your process finishes after a lag, then please check whether you
are writing by converting to Pandas or using coalesce (in which case entire
traffic is being directed to a single node) or writing over S3, in which
case there can be lags.

Regards,
Gourav

On Sun, Nov 6, 2016 at 1:28 PM, Michael Johnson <
mjjohnson.geo@yahoo.com.invalid> wrote:

> I'm doing some processing and then clustering of a small dataset (~150
> MB). Everything seems to work fine, until the end; the last few lines of my
> program are log statements, but after printing those, nothing seems to
> happen for a long time...many minutes; I'm not usually patient enough to
> let it go, but I think one time when I did just wait, it took over an hour
> (and did eventually exit on its own). Any ideas on what's happening, or how
> to troubleshoot?
>
> (This happens both when running locally, using the localhost mode, as well
> as on a small cluster with four 4-processor nodes each with 15GB of RAM; in
> both cases the executors have 2GB+ of RAM, and none of the inputs/outputs
> on any of the stages is more than 75 MB...)
>
> Thanks,
> Michael
>

Re: Very long pause/hang at end of execution

Posted by Michael Johnson <mj...@yahoo.com.INVALID>.

On Wed, Nov 16, 2016 at 10:44 AM Aniket Bhatnagar <an...@gmail.com> wrote:
Thanks for sharing the thread dump. I had a look at them and couldn't find anything unusual. Is there anything in the logs (driver + executor) that suggests what's going on? Also, what does the spark job do and what is the version of spark and hadoop you are using?

I haven't seen anything in the logs; when I observed it happening before, in local mode, the last output before the hang would be a log statement from my code (that is, I had a log4j logger and was calling info() on that logger). That was also the last line of my main() function. Then, I saw no more output, neither from the driver nor the executors. I have seen the pause be as short as a few minutes, or approaching an hour. As far as I can tell, when it continues, the log statements look more or less normal.
Locally, I'm using Spark 2.0.1 built for Hadoop 2.7, but without installing Hadoop. Remotely, I'm running on Google Cloud Dataproc, which also uses Spark 2.0.1, along with Hadoop 2.7.3. I've had it happen both locally and remotely.
The job loads data from a text file (using SparkContext.textFile()), and then splits each line and converts it into an array of integers. From there, I do some sketching (the data encodes either a tree, a graph, or text, and I create a fixed-length sketch that probabilistically produces similar results for similar nodes in the tree/graph). I then do some lightweight clustering on the sketches, and save the cluster assignments to a text file.
For what it's worth, when I look at the GC stats from the UI, they seem a bit high (they can be as high as 1 minute GC for a 15 minute run). However, those stats do not change during the pause period.
On Wed, Nov 16, 2016 at 2:48 AM Aniket Bhatnagar <an...@gmail.com> wrote:
Also, how are you launching the application? Through spark submit or creating spark content in your app?

I'm calling spark-submit, and then within my app I call SparkContext.getOrCreate() to get a context. I then call sc.textFile() to load my data into an RDD, and then perform various actions on that. I tried adding a call to sc.stop() at the very end, after seeing some discussion that that might be necessary, but it didn't seem to make a difference.
The strange thing is that this behavior comes and goes. I tried opening the UI, as Pietro suggested, but that didn't seem to trigger it for me; I haven't figured out what, if anything, will make it happen every time.
On Wednesday, November 16, 2016 4:41 AM, Pietro Pugni <pi...@gmail.com> wrote:

I have the same issue with Spark 2.0.1, Java 1.8.x and pyspark. I also use SparkSQL and JDBC. My application runs locally. It happens only of I connect to the UI during Spark execution and even if I close the browser before the execution ends. I observed this behaviour both on macOS Sierra and Red Hat 6.7

That is interesting that you are seeing this too. I can't get it to happen by using the UI...but I also am having difficulty making it happen at all right now. (Only trying locally at the moment.)

Re: Very long pause/hang at end of execution

Posted by Aniket Bhatnagar <an...@gmail.com>.

Also, how are you launching the application? Through spark submit or
creating spark content in your app?

Thanks,
Aniket

On Wed, Nov 16, 2016 at 10:44 AM Aniket Bhatnagar <
aniket.bhatnagar@gmail.com> wrote:

> Thanks for sharing the thread dump. I had a look at them and couldn't find
> anything unusual. Is there anything in the logs (driver + executor) that
> suggests what's going on? Also, what does the spark job do and what is the
> version of spark and hadoop you are using?
>
> Thanks,
> Aniket
>
>
> On Wed, Nov 16, 2016 at 2:07 AM Michael Johnson <mj...@yahoo.com>
> wrote:
>
> The extremely long hand/pause has started happening again. I've been
> running on a small remote cluster, so I used the UI to grab thread dumps
> rather than doing it from the command line. There seems to be one executor
> still alive, along with the driver; I grabbed 4 thread dumps from each, a
> couple of seconds apart. I'd greatly appreciate any help tracking down
> what's going on! (I've attached them, but I can paste them somewhere if
> that's more convenient.)
>
> Thanks,
> Michael
>
>
>
>
> On Sunday, November 6, 2016 10:49 PM, Michael Johnson
> <mj...@yahoo.com.INVALID> wrote:
>
>
> Hm. Something must have changed, as it was happening quite consistently
> and now I can't get it to reproduce. Thank you for the offer, and if it
> happens again I will try grabbing thread dumps and I will see if I can
> figure out what is going on.
>
>
> On Sunday, November 6, 2016 10:02 AM, Aniket Bhatnagar <
> aniket.bhatnagar@gmail.com> wrote:
>
>
> I doubt it's GC as you mentioned that the pause is several minutes. Since
> it's reproducible in local mode, can you run the spark application locally
> and once your job is complete (and application appears paused), can you
> take 5 thread dumps (using jstack or jcmd on the local spark JVM process)
> with 1 second delay between each dump and attach them? I can take a look.
>
> Thanks,
> Aniket
>
> On Sun, Nov 6, 2016 at 2:21 PM Michael Johnson <mj...@yahoo.com>
> wrote:
>
> Thanks; I tried looking at the thread dumps for the driver and the one
> executor that had that option in the UI, but I'm afraid I don't know how to
> interpret what I saw...  I don't think it could be my code directly, since
> at this point my code has all completed? Could GC be taking that long?
>
> (I could also try grabbing the thread dumps and pasting them here, if that
> would help?)
>
> On Sunday, November 6, 2016 8:36 AM, Aniket Bhatnagar <
> aniket.bhatnagar@gmail.com> wrote:
>
>
> In order to know what's going on, you can study the thread dumps either
> from spark UI or from any other thread dump analysis tool.
>
> Thanks,
> Aniket
>
> On Sun, Nov 6, 2016 at 1:31 PM Michael Johnson
> <mj...@yahoo.com.invalid> wrote:
>
> I'm doing some processing and then clustering of a small dataset (~150
> MB). Everything seems to work fine, until the end; the last few lines of my
> program are log statements, but after printing those, nothing seems to
> happen for a long time...many minutes; I'm not usually patient enough to
> let it go, but I think one time when I did just wait, it took over an hour
> (and did eventually exit on its own). Any ideas on what's happening, or how
> to troubleshoot?
>
> (This happens both when running locally, using the localhost mode, as well
> as on a small cluster with four 4-processor nodes each with 15GB of RAM; in
> both cases the executors have 2GB+ of RAM, and none of the inputs/outputs
> on any of the stages is more than 75 MB...)
>
> Thanks,
> Michael
>
>
>
>
>
>
>
>

Re: Very long pause/hang at end of execution

Posted by Aniket Bhatnagar <an...@gmail.com>.

Thanks for sharing the thread dump. I had a look at them and couldn't find
anything unusual. Is there anything in the logs (driver + executor) that
suggests what's going on? Also, what does the spark job do and what is the
version of spark and hadoop you are using?

Thanks,
Aniket

On Wed, Nov 16, 2016 at 2:07 AM Michael Johnson <mj...@yahoo.com>
wrote:

> The extremely long hand/pause has started happening again. I've been
> running on a small remote cluster, so I used the UI to grab thread dumps
> rather than doing it from the command line. There seems to be one executor
> still alive, along with the driver; I grabbed 4 thread dumps from each, a
> couple of seconds apart. I'd greatly appreciate any help tracking down
> what's going on! (I've attached them, but I can paste them somewhere if
> that's more convenient.)
>
> Thanks,
> Michael
>
>
>
>
> On Sunday, November 6, 2016 10:49 PM, Michael Johnson
> <mj...@yahoo.com.INVALID> wrote:
>
>
> Hm. Something must have changed, as it was happening quite consistently
> and now I can't get it to reproduce. Thank you for the offer, and if it
> happens again I will try grabbing thread dumps and I will see if I can
> figure out what is going on.
>
>
> On Sunday, November 6, 2016 10:02 AM, Aniket Bhatnagar <
> aniket.bhatnagar@gmail.com> wrote:
>
>
> I doubt it's GC as you mentioned that the pause is several minutes. Since
> it's reproducible in local mode, can you run the spark application locally
> and once your job is complete (and application appears paused), can you
> take 5 thread dumps (using jstack or jcmd on the local spark JVM process)
> with 1 second delay between each dump and attach them? I can take a look.
>
> Thanks,
> Aniket
>
> On Sun, Nov 6, 2016 at 2:21 PM Michael Johnson <mj...@yahoo.com>
> wrote:
>
> Thanks; I tried looking at the thread dumps for the driver and the one
> executor that had that option in the UI, but I'm afraid I don't know how to
> interpret what I saw...  I don't think it could be my code directly, since
> at this point my code has all completed? Could GC be taking that long?
>
> (I could also try grabbing the thread dumps and pasting them here, if that
> would help?)
>
> On Sunday, November 6, 2016 8:36 AM, Aniket Bhatnagar <
> aniket.bhatnagar@gmail.com> wrote:
>
>
> In order to know what's going on, you can study the thread dumps either
> from spark UI or from any other thread dump analysis tool.
>
> Thanks,
> Aniket
>
> On Sun, Nov 6, 2016 at 1:31 PM Michael Johnson
> <mj...@yahoo.com.invalid> wrote:
>
> I'm doing some processing and then clustering of a small dataset (~150
> MB). Everything seems to work fine, until the end; the last few lines of my
> program are log statements, but after printing those, nothing seems to
> happen for a long time...many minutes; I'm not usually patient enough to
> let it go, but I think one time when I did just wait, it took over an hour
> (and did eventually exit on its own). Any ideas on what's happening, or how
> to troubleshoot?
>
> (This happens both when running locally, using the localhost mode, as well
> as on a small cluster with four 4-processor nodes each with 15GB of RAM; in
> both cases the executors have 2GB+ of RAM, and none of the inputs/outputs
> on any of the stages is more than 75 MB...)
>
> Thanks,
> Michael
>
>
>
>
>
>
>
>

Re: Very long pause/hang at end of execution

Posted by Pietro Pugni <pi...@gmail.com>.

I have the same issue with Spark 2.0.1, Java 1.8.x and pyspark. I also use
SparkSQL and JDBC. My application runs locally. It happens only of I
connect to the UI during Spark execution and even if I close the browser
before the execution ends. I observed this behaviour both on macOS Sierra
and Red Hat 6.7

Il 16 nov 2016 3:09 AM, "Michael Johnson" <mj...@yahoo.com.invalid>
ha scritto:

> The extremely long hand/pause has started happening again. I've been
> running on a small remote cluster, so I used the UI to grab thread dumps
> rather than doing it from the command line. There seems to be one executor
> still alive, along with the driver; I grabbed 4 thread dumps from each, a
> couple of seconds apart. I'd greatly appreciate any help tracking down
> what's going on! (I've attached them, but I can paste them somewhere if
> that's more convenient.)
>
> Thanks,
> Michael
>
>
>
>
> On Sunday, November 6, 2016 10:49 PM, Michael Johnson <
> mjjohnson.geo@yahoo.com.INVALID> wrote:
>
>
> Hm. Something must have changed, as it was happening quite consistently
> and now I can't get it to reproduce. Thank you for the offer, and if it
> happens again I will try grabbing thread dumps and I will see if I can
> figure out what is going on.
>
>
> On Sunday, November 6, 2016 10:02 AM, Aniket Bhatnagar <
> aniket.bhatnagar@gmail.com> wrote:
>
>
> I doubt it's GC as you mentioned that the pause is several minutes. Since
> it's reproducible in local mode, can you run the spark application locally
> and once your job is complete (and application appears paused), can you
> take 5 thread dumps (using jstack or jcmd on the local spark JVM process)
> with 1 second delay between each dump and attach them? I can take a look.
>
> Thanks,
> Aniket
>
> On Sun, Nov 6, 2016 at 2:21 PM Michael Johnson <mj...@yahoo.com>
> wrote:
>
> Thanks; I tried looking at the thread dumps for the driver and the one
> executor that had that option in the UI, but I'm afraid I don't know how to
> interpret what I saw...  I don't think it could be my code directly, since
> at this point my code has all completed? Could GC be taking that long?
>
> (I could also try grabbing the thread dumps and pasting them here, if that
> would help?)
>
> On Sunday, November 6, 2016 8:36 AM, Aniket Bhatnagar <
> aniket.bhatnagar@gmail.com> wrote:
>
>
> In order to know what's going on, you can study the thread dumps either
> from spark UI or from any other thread dump analysis tool.
>
> Thanks,
> Aniket
>
> On Sun, Nov 6, 2016 at 1:31 PM Michael Johnson <mj...@yahoo.com.invalid>
> wrote:
>
> I'm doing some processing and then clustering of a small dataset (~150
> MB). Everything seems to work fine, until the end; the last few lines of my
> program are log statements, but after printing those, nothing seems to
> happen for a long time...many minutes; I'm not usually patient enough to
> let it go, but I think one time when I did just wait, it took over an hour
> (and did eventually exit on its own). Any ideas on what's happening, or how
> to troubleshoot?
>
> (This happens both when running locally, using the localhost mode, as well
> as on a small cluster with four 4-processor nodes each with 15GB of RAM; in
> both cases the executors have 2GB+ of RAM, and none of the inputs/outputs
> on any of the stages is more than 75 MB...)
>
> Thanks,
> Michael
>
>
>
>
>
>
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe e-mail: user-unsubscribe@spark.apache.org
>

Re: Very long pause/hang at end of execution

Posted by Michael Johnson <mj...@yahoo.com.INVALID>.

The extremely long hand/pause has started happening again. I've been running on a small remote cluster, so I used the UI to grab thread dumps rather than doing it from the command line. There seems to be one executor still alive, along with the driver; I grabbed 4 thread dumps from each, a couple of seconds apart. I'd greatly appreciate any help tracking down what's going on! (I've attached them, but I can paste them somewhere if that's more convenient.)
Thanks,Michael

On Sunday, November 6, 2016 10:49 PM, Michael Johnson <mj...@yahoo.com.INVALID> wrote:

Hm. Something must have changed, as it was happening quite consistently and now I can't get it to reproduce. Thank you for the offer, and if it happens again I will try grabbing thread dumps and I will see if I can figure out what is going on.

On Sunday, November 6, 2016 10:02 AM, Aniket Bhatnagar <an...@gmail.com> wrote:

I doubt it's GC as you mentioned that the pause is several minutes. Since it's reproducible in local mode, can you run the spark application locally and once your job is complete (and application appears paused), can you take 5 thread dumps (using jstack or jcmd on the local spark JVM process) with 1 second delay between each dump and attach them? I can take a look.
Thanks,Aniket
On Sun, Nov 6, 2016 at 2:21 PM Michael Johnson <mj...@yahoo.com> wrote:

Thanks; I tried looking at the thread dumps for the driver and the one executor that had that option in the UI, but I'm afraid I don't know how to interpret what I saw... I don't think it could be my code directly, since at this point my code has all completed? Could GC be taking that long?
(I could also try grabbing the thread dumps and pasting them here, if that would help?)

On Sunday, November 6, 2016 8:36 AM, Aniket Bhatnagar <an...@gmail.com> wrote:

In order to know what's going on, you can study the thread dumps either from spark UI or from any other thread dump analysis tool.
Thanks,Aniket
On Sun, Nov 6, 2016 at 1:31 PM Michael Johnson <mj...@yahoo.com.invalid> wrote:

I'm doing some processing and then clustering of a small dataset (~150 MB). Everything seems to work fine, until the end; the last few lines of my program are log statements, but after printing those, nothing seems to happen for a long time...many minutes; I'm not usually patient enough to let it go, but I think one time when I did just wait, it took over an hour (and did eventually exit on its own). Any ideas on what's happening, or how to troubleshoot?
(This happens both when running locally, using the localhost mode, as well as on a small cluster with four 4-processor nodes each with 15GB of RAM; in both cases the executors have 2GB+ of RAM, and none of the inputs/outputs on any of the stages is more than 75 MB...)
Thanks,Michael

Re: Very long pause/hang at end of execution

Posted by Michael Johnson <mj...@yahoo.com.INVALID>.

Hm. Something must have changed, as it was happening quite consistently and now I can't get it to reproduce. Thank you for the offer, and if it happens again I will try grabbing thread dumps and I will see if I can figure out what is going on. 

    On Sunday, November 6, 2016 10:02 AM, Aniket Bhatnagar <an...@gmail.com> wrote:
 

 I doubt it's GC as you mentioned that the pause is several minutes. Since it's reproducible in local mode, can you run the spark application locally and once your job is complete (and application appears paused), can you take 5 thread dumps (using jstack or jcmd on the local spark JVM process) with 1 second delay between each dump and attach them? I can take a look.
Thanks,Aniket
On Sun, Nov 6, 2016 at 2:21 PM Michael Johnson <mj...@yahoo.com> wrote:

Thanks; I tried looking at the thread dumps for the driver and the one executor that had that option in the UI, but I'm afraid I don't know how to interpret what I saw...  I don't think it could be my code directly, since at this point my code has all completed? Could GC be taking that long? 
(I could also try grabbing the thread dumps and pasting them here, if that would help?)

    On Sunday, November 6, 2016 8:36 AM, Aniket Bhatnagar <an...@gmail.com> wrote:
 

 In order to know what's going on, you can study the thread dumps either from spark UI or from any other thread dump analysis tool.
Thanks,Aniket
On Sun, Nov 6, 2016 at 1:31 PM Michael Johnson <mj...@yahoo.com.invalid> wrote:

I'm doing some processing and then clustering of a small dataset (~150 MB). Everything seems to work fine, until the end; the last few lines of my program are log statements, but after printing those, nothing seems to happen for a long time...many minutes; I'm not usually patient enough to let it go, but I think one time when I did just wait, it took over an hour (and did eventually exit on its own). Any ideas on what's happening, or how to troubleshoot?
(This happens both when running locally, using the localhost mode, as well as on a small cluster with four 4-processor nodes each with 15GB of RAM; in both cases the executors have 2GB+ of RAM, and none of the inputs/outputs on any of the stages is more than 75 MB...)
Thanks,Michael

Re: Very long pause/hang at end of execution

Posted by Aniket Bhatnagar <an...@gmail.com>.

I doubt it's GC as you mentioned that the pause is several minutes. Since
it's reproducible in local mode, can you run the spark application locally
and once your job is complete (and application appears paused), can you
take 5 thread dumps (using jstack or jcmd on the local spark JVM process)
with 1 second delay between each dump and attach them? I can take a look.

Thanks,
Aniket

On Sun, Nov 6, 2016 at 2:21 PM Michael Johnson <mj...@yahoo.com>
wrote:

> Thanks; I tried looking at the thread dumps for the driver and the one
> executor that had that option in the UI, but I'm afraid I don't know how to
> interpret what I saw...  I don't think it could be my code directly, since
> at this point my code has all completed? Could GC be taking that long?
>
> (I could also try grabbing the thread dumps and pasting them here, if that
> would help?)
>
> On Sunday, November 6, 2016 8:36 AM, Aniket Bhatnagar <
> aniket.bhatnagar@gmail.com> wrote:
>
>
> In order to know what's going on, you can study the thread dumps either
> from spark UI or from any other thread dump analysis tool.
>
> Thanks,
> Aniket
>
> On Sun, Nov 6, 2016 at 1:31 PM Michael Johnson
> <mj...@yahoo.com.invalid> wrote:
>
> I'm doing some processing and then clustering of a small dataset (~150
> MB). Everything seems to work fine, until the end; the last few lines of my
> program are log statements, but after printing those, nothing seems to
> happen for a long time...many minutes; I'm not usually patient enough to
> let it go, but I think one time when I did just wait, it took over an hour
> (and did eventually exit on its own). Any ideas on what's happening, or how
> to troubleshoot?
>
> (This happens both when running locally, using the localhost mode, as well
> as on a small cluster with four 4-processor nodes each with 15GB of RAM; in
> both cases the executors have 2GB+ of RAM, and none of the inputs/outputs
> on any of the stages is more than 75 MB...)
>
> Thanks,
> Michael
>
>
>
>

Re: Very long pause/hang at end of execution

Posted by Michael Johnson <mj...@yahoo.com.INVALID>.

Thanks; I tried looking at the thread dumps for the driver and the one executor that had that option in the UI, but I'm afraid I don't know how to interpret what I saw...  I don't think it could be my code directly, since at this point my code has all completed? Could GC be taking that long? 
(I could also try grabbing the thread dumps and pasting them here, if that would help?)

    On Sunday, November 6, 2016 8:36 AM, Aniket Bhatnagar <an...@gmail.com> wrote:
 

 In order to know what's going on, you can study the thread dumps either from spark UI or from any other thread dump analysis tool.
Thanks,Aniket
On Sun, Nov 6, 2016 at 1:31 PM Michael Johnson <mj...@yahoo.com.invalid> wrote:

I'm doing some processing and then clustering of a small dataset (~150 MB). Everything seems to work fine, until the end; the last few lines of my program are log statements, but after printing those, nothing seems to happen for a long time...many minutes; I'm not usually patient enough to let it go, but I think one time when I did just wait, it took over an hour (and did eventually exit on its own). Any ideas on what's happening, or how to troubleshoot?
(This happens both when running locally, using the localhost mode, as well as on a small cluster with four 4-processor nodes each with 15GB of RAM; in both cases the executors have 2GB+ of RAM, and none of the inputs/outputs on any of the stages is more than 75 MB...)
Thanks,Michael

Re: Very long pause/hang at end of execution

Posted by Aniket Bhatnagar <an...@gmail.com>.

In order to know what's going on, you can study the thread dumps either
from spark UI or from any other thread dump analysis tool.

Thanks,
Aniket

On Sun, Nov 6, 2016 at 1:31 PM Michael Johnson
<mj...@yahoo.com.invalid> wrote:

> I'm doing some processing and then clustering of a small dataset (~150
> MB). Everything seems to work fine, until the end; the last few lines of my
> program are log statements, but after printing those, nothing seems to
> happen for a long time...many minutes; I'm not usually patient enough to
> let it go, but I think one time when I did just wait, it took over an hour
> (and did eventually exit on its own). Any ideas on what's happening, or how
> to troubleshoot?
>
> (This happens both when running locally, using the localhost mode, as well
> as on a small cluster with four 4-processor nodes each with 15GB of RAM; in
> both cases the executors have 2GB+ of RAM, and none of the inputs/outputs
> on any of the stages is more than 75 MB...)
>
> Thanks,
> Michael
>