You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by "Carlile, Ken" <ca...@janelia.hhmi.org> on 2016/06/16 11:40:51 UTC

Spark crashes worker nodes with multiple application starts

We run Spark on a general purpose HPC cluster (using standalone mode and the HPC scheduler), and are currently on Spark 1.6.1. One of the primary users has been testing various storage and other parameters for Spark, which involves doing multiple shuffles and shutting down and starting many applications serially on a single cluster instance. He is using pyspark (via jupyter notebooks). Python version is 2.7.6.

We have been seeing multiple HPC node hard locks in this scenario, all at the termination of a jupyter kernel (read Spark application). The symptom is that the load on the node keeps going higher. We have determined this is because of iowait on background processes (namely puppet and facter, clean up scripts, etc). What he sees is that when he starts a new kernel (application), the executor on those nodes will not start. We can no longer ssh into the nodes, and no commands can be run on them; everything goes into iowait. The only solution is to do a hard reset on the nodes.

Obviously this is very disruptive, both to us sysadmins and to him. We have a limited number of HPC nodes that are permitted to run spark clusters, so this is a big problem.

I have attempted to limit the background processes, but it doesn’t seem to matter; it can be any process that attempts io on the boot drive. He has tried various things (limiting CPU cores used by Spark, reducing the memory, etc.), but we have been unable to find a solution, or really, a cause.

Has anyone seen anything like this? Any ideas where to look next?

Thanks,
Ken
---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
For additional commands, e-mail: user-help@spark.apache.org

Re: Spark crashes worker nodes with multiple application starts

Posted by Deepak Goel <de...@gmail.com>.

Well, my only guess (It is just a guess, as I don't have access to the
machines which requires a hard reset)..The system is running into some kind
of race condition while accessing the disk...And is not able to solve
this..hence it is hanging (well this is a pretty vague statement, but it
seems it will require some trial and error to figure out why exactly the
system is hanging)...Also I believe you are using HDFS as data
storage..HDFS relaxes some POSIX requirements for faster data access to the
system, i wonder if this is the cause

Hey

Namaskara~Nalama~Guten Tag~Bonjour


   --
Keigu

Deepak
73500 12833
www.simtree.net, deepak@simtree.net
deicool@gmail.com

LinkedIn: www.linkedin.com/in/deicool
Skype: thumsupdeicool
Google talk: deicool
Blog: http://loveandfearless.wordpress.com
Facebook: http://www.facebook.com/deicool

"Contribute to the world, environment and more : http://www.gridrepublic.org
"

On Thu, Jun 16, 2016 at 10:54 PM, Carlile, Ken <ca...@janelia.hhmi.org>
wrote:

> Hi Deepak,
>
> Yes, that’s about the size of it. The spark job isn’t filling the disk by
> any stretch of the imagination; in fact the only stuff that’s writing to
> the disk from Spark in certain of these instances is the logging.
>
> Thanks,
> —Ken
>
> On Jun 16, 2016, at 12:17 PM, Deepak Goel <de...@gmail.com> wrote:
>
> I guess what you are saying is:
>
> 1. The nodes work perfectly ok without io wait before Spark job.
> 2. After you have run Spark job and killed it, the io wait persist.
>
> So what it seems, the Spark Job is altering the disk in such a way that
> other programs can't access the disk after the spark job is killed. (A
> naive thought) I wonder if the spark job fills up the disk so that no other
> program on your node could write to it and hence the io wait.
>
> Also facter just normally reads up your system so it shouldn't block your
> system. There must be some other background scripts running on your node
> which are writing to the disk perhaps..
>
> Hey
>
> Namaskara~Nalama~Guten Tag~Bonjour
>
>
>    --
> Keigu
>
> Deepak
> 73500 12833
> www.simtree.net, deepak@simtree.net
> deicool@gmail.com
>
> LinkedIn: www.linkedin.com/in/deicool
> Skype: thumsupdeicool
> Google talk: deicool
> Blog: http://loveandfearless.wordpress.com
> Facebook: http://www.facebook.com/deicool
>
> "Contribute to the world, environment and more :
> http://www.gridrepublic.org
> "
>
> On Thu, Jun 16, 2016 at 5:56 PM, Carlile, Ken <ca...@janelia.hhmi.org>
> wrote:
>
>> 1. There are 320 nodes in total, with 96 dedicated to Spark. In this
>> particular case, 21 are in the Spark cluster. In typical Spark usage, maybe
>> 1-3 nodes will crash in a day, with probably an average of 4-5 Spark
>> clusters running at a given time. In THIS case, 7-12 nodes will crash
>> simultaneously on application termination (not Spark cluster termination,
>> but termination of a Spark application/jupyter kernel)
>> 2. I’ve turned off puppet, no effect. I’ve not fully disabled facter. The
>> iowait persists after the scheduler kills the Spark job (that still works,
>> at least)
>> 3. He’s attempted to run with 15 cores out of 16 and 25GB of RAM out of
>> 128. He still lost nodes.
>> 4. He’s currently running storage benchmarking tests, which consist
>> mainly of shuffles.
>>
>> Thanks!
>> Ken
>>
>> On Jun 16, 2016, at 8:00 AM, Deepak Goel <de...@gmail.com> wrote:
>>
>> I am no expert, but some naive thoughts...
>>
>> 1. How many HPC nodes do you have? How many of them crash (What do you
>> mean by multiple)? Do all of them crash?
>>
>> 2. What things are you running on Puppet? Can't you switch it off and
>> test Spark? Also you can switch of Facter. Btw, your observation that there
>> is iowait on these applications might be because they have low priority
>> than Spark. Hence they are waiting for Spark to finish. So the real
>> bottleneck might be Spark and not these background processes
>>
>> 3. Limiting cpu's and memory for Spark, might have an inverse effect on
>> iowait. As more of Spark processes would have to access the disk due to
>> reduced memory and CPU
>>
>> 4. Offcourse, you might have to give more info on what kind of
>> applications you are running on Spark as they might be the main culpirit
>>
>> Deepak
>>
>> Hey
>>
>> Namaskara~Nalama~Guten Tag~Bonjour
>>
>>
>>    --
>> Keigu
>>
>> Deepak
>> 73500 12833
>> www.simtree.net, deepak@simtree.net
>> deicool@gmail.com
>>
>> LinkedIn: www.linkedin.com/in/deicool
>> Skype: thumsupdeicool
>> Google talk: deicool
>> Blog: http://loveandfearless.wordpress.com
>> Facebook: http://www.facebook.com/deicool
>>
>> "Contribute to the world, environment and more :
>> http://www.gridrepublic.org
>> "
>>
>> On Thu, Jun 16, 2016 at 5:10 PM, Carlile, Ken <ca...@janelia.hhmi.org>
>> wrote:
>>
>>> We run Spark on a general purpose HPC cluster (using standalone mode and
>>> the HPC scheduler), and are currently on Spark 1.6.1. One of the primary
>>> users has been testing various storage and other parameters for Spark,
>>> which involves doing multiple shuffles and shutting down and starting many
>>> applications serially on a single cluster instance. He is using pyspark
>>> (via jupyter notebooks). Python version is 2.7.6.
>>>
>>> We have been seeing multiple HPC node hard locks in this scenario, all
>>> at the termination of a jupyter kernel (read Spark application). The
>>> symptom is that the load on the node keeps going higher. We have determined
>>> this is because of iowait on background processes (namely puppet and
>>> facter, clean up scripts, etc). What he sees is that when he starts a new
>>> kernel (application), the executor on those nodes will not start. We can no
>>> longer ssh into the nodes, and no commands can be run on them; everything
>>> goes into iowait. The only solution is to do a hard reset on the nodes.
>>>
>>> Obviously this is very disruptive, both to us sysadmins and to him. We
>>> have a limited number of HPC nodes that are permitted to run spark
>>> clusters, so this is a big problem.
>>>
>>> I have attempted to limit the background processes, but it doesn’t seem
>>> to matter; it can be any process that attempts io on the boot drive. He has
>>> tried various things (limiting CPU cores used by Spark, reducing the
>>> memory, etc.), but we have been unable to find a solution, or really, a
>>> cause.
>>>
>>> Has anyone seen anything like this? Any ideas where to look next?
>>>
>>> Thanks,
>>> Ken
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
>>> For additional commands, e-mail: user-help@spark.apache.org
>>>
>>>
>>
>>
>
>

Re: Spark crashes worker nodes with multiple application starts

Posted by Deepak Goel <de...@gmail.com>.

I guess what you are saying is:

1. The nodes work perfectly ok without io wait before Spark job.
2. After you have run Spark job and killed it, the io wait persist.

So what it seems, the Spark Job is altering the disk in such a way that
other programs can't access the disk after the spark job is killed. (A
naive thought) I wonder if the spark job fills up the disk so that no other
program on your node could write to it and hence the io wait.

Also facter just normally reads up your system so it shouldn't block your
system. There must be some other background scripts running on your node
which are writing to the disk perhaps..

Hey

Namaskara~Nalama~Guten Tag~Bonjour


   --
Keigu

Deepak
73500 12833
www.simtree.net, deepak@simtree.net
deicool@gmail.com

LinkedIn: www.linkedin.com/in/deicool
Skype: thumsupdeicool
Google talk: deicool
Blog: http://loveandfearless.wordpress.com
Facebook: http://www.facebook.com/deicool

"Contribute to the world, environment and more : http://www.gridrepublic.org
"

On Thu, Jun 16, 2016 at 5:56 PM, Carlile, Ken <ca...@janelia.hhmi.org>
wrote:

> 1. There are 320 nodes in total, with 96 dedicated to Spark. In this
> particular case, 21 are in the Spark cluster. In typical Spark usage, maybe
> 1-3 nodes will crash in a day, with probably an average of 4-5 Spark
> clusters running at a given time. In THIS case, 7-12 nodes will crash
> simultaneously on application termination (not Spark cluster termination,
> but termination of a Spark application/jupyter kernel)
> 2. I’ve turned off puppet, no effect. I’ve not fully disabled facter. The
> iowait persists after the scheduler kills the Spark job (that still works,
> at least)
> 3. He’s attempted to run with 15 cores out of 16 and 25GB of RAM out of
> 128. He still lost nodes.
> 4. He’s currently running storage benchmarking tests, which consist mainly
> of shuffles.
>
> Thanks!
> Ken
>
> On Jun 16, 2016, at 8:00 AM, Deepak Goel <de...@gmail.com> wrote:
>
> I am no expert, but some naive thoughts...
>
> 1. How many HPC nodes do you have? How many of them crash (What do you
> mean by multiple)? Do all of them crash?
>
> 2. What things are you running on Puppet? Can't you switch it off and test
> Spark? Also you can switch of Facter. Btw, your observation that there is
> iowait on these applications might be because they have low priority than
> Spark. Hence they are waiting for Spark to finish. So the real bottleneck
> might be Spark and not these background processes
>
> 3. Limiting cpu's and memory for Spark, might have an inverse effect on
> iowait. As more of Spark processes would have to access the disk due to
> reduced memory and CPU
>
> 4. Offcourse, you might have to give more info on what kind of
> applications you are running on Spark as they might be the main culpirit
>
> Deepak
>
> Hey
>
> Namaskara~Nalama~Guten Tag~Bonjour
>
>
>    --
> Keigu
>
> Deepak
> 73500 12833
> www.simtree.net, deepak@simtree.net
> deicool@gmail.com
>
> LinkedIn: www.linkedin.com/in/deicool
> Skype: thumsupdeicool
> Google talk: deicool
> Blog: http://loveandfearless.wordpress.com
> Facebook: http://www.facebook.com/deicool
>
> "Contribute to the world, environment and more :
> http://www.gridrepublic.org
> "
>
> On Thu, Jun 16, 2016 at 5:10 PM, Carlile, Ken <ca...@janelia.hhmi.org>
> wrote:
>
>> We run Spark on a general purpose HPC cluster (using standalone mode and
>> the HPC scheduler), and are currently on Spark 1.6.1. One of the primary
>> users has been testing various storage and other parameters for Spark,
>> which involves doing multiple shuffles and shutting down and starting many
>> applications serially on a single cluster instance. He is using pyspark
>> (via jupyter notebooks). Python version is 2.7.6.
>>
>> We have been seeing multiple HPC node hard locks in this scenario, all at
>> the termination of a jupyter kernel (read Spark application). The symptom
>> is that the load on the node keeps going higher. We have determined this is
>> because of iowait on background processes (namely puppet and facter, clean
>> up scripts, etc). What he sees is that when he starts a new kernel
>> (application), the executor on those nodes will not start. We can no longer
>> ssh into the nodes, and no commands can be run on them; everything goes
>> into iowait. The only solution is to do a hard reset on the nodes.
>>
>> Obviously this is very disruptive, both to us sysadmins and to him. We
>> have a limited number of HPC nodes that are permitted to run spark
>> clusters, so this is a big problem.
>>
>> I have attempted to limit the background processes, but it doesn’t seem
>> to matter; it can be any process that attempts io on the boot drive. He has
>> tried various things (limiting CPU cores used by Spark, reducing the
>> memory, etc.), but we have been unable to find a solution, or really, a
>> cause.
>>
>> Has anyone seen anything like this? Any ideas where to look next?
>>
>> Thanks,
>> Ken
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
>> For additional commands, e-mail: user-help@spark.apache.org
>>
>>
>
>

Re: Spark crashes worker nodes with multiple application starts

Posted by Deepak Goel <de...@gmail.com>.

I am no expert, but some naive thoughts...

1. How many HPC nodes do you have? How many of them crash (What do you mean
by multiple)? Do all of them crash?

2. What things are you running on Puppet? Can't you switch it off and test
Spark? Also you can switch of Facter. Btw, your observation that there is
iowait on these applications might be because they have low priority than
Spark. Hence they are waiting for Spark to finish. So the real bottleneck
might be Spark and not these background processes

3. Limiting cpu's and memory for Spark, might have an inverse effect on
iowait. As more of Spark processes would have to access the disk due to
reduced memory and CPU

4. Offcourse, you might have to give more info on what kind of applications
you are running on Spark as they might be the main culpirit

Deepak

Hey

Namaskara~Nalama~Guten Tag~Bonjour


   --
Keigu

Deepak
73500 12833
www.simtree.net, deepak@simtree.net
deicool@gmail.com

LinkedIn: www.linkedin.com/in/deicool
Skype: thumsupdeicool
Google talk: deicool
Blog: http://loveandfearless.wordpress.com
Facebook: http://www.facebook.com/deicool

"Contribute to the world, environment and more : http://www.gridrepublic.org
"

On Thu, Jun 16, 2016 at 5:10 PM, Carlile, Ken <ca...@janelia.hhmi.org>
wrote:

> We run Spark on a general purpose HPC cluster (using standalone mode and
> the HPC scheduler), and are currently on Spark 1.6.1. One of the primary
> users has been testing various storage and other parameters for Spark,
> which involves doing multiple shuffles and shutting down and starting many
> applications serially on a single cluster instance. He is using pyspark
> (via jupyter notebooks). Python version is 2.7.6.
>
> We have been seeing multiple HPC node hard locks in this scenario, all at
> the termination of a jupyter kernel (read Spark application). The symptom
> is that the load on the node keeps going higher. We have determined this is
> because of iowait on background processes (namely puppet and facter, clean
> up scripts, etc). What he sees is that when he starts a new kernel
> (application), the executor on those nodes will not start. We can no longer
> ssh into the nodes, and no commands can be run on them; everything goes
> into iowait. The only solution is to do a hard reset on the nodes.
>
> Obviously this is very disruptive, both to us sysadmins and to him. We
> have a limited number of HPC nodes that are permitted to run spark
> clusters, so this is a big problem.
>
> I have attempted to limit the background processes, but it doesn’t seem to
> matter; it can be any process that attempts io on the boot drive. He has
> tried various things (limiting CPU cores used by Spark, reducing the
> memory, etc.), but we have been unable to find a solution, or really, a
> cause.
>
> Has anyone seen anything like this? Any ideas where to look next?
>
> Thanks,
> Ken
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
> For additional commands, e-mail: user-help@spark.apache.org
>
>