You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@spark.apache.org by Praveen R <pr...@sigmoidanalytics.com> on 2014/04/14 15:29:20 UTC

Lost an executor error - Jobs fail

Had below error while running shark queries on 30 node cluster and was not
able to start shark server or run any jobs.

*14/04/11 19:06:52 ERROR scheduler.TaskSchedulerImpl: Lost an executor 4
(already removed): Failed to create local directory (bad spark.local.dir?)*
*Full log: *https://gist.github.com/praveenr019/10647049

After spending quite some time, found it was due to disk read errors on one
node and had the cluster working after removing the node.

Wanted to know if there is any configuration (like akkaTimeout) which can
handle this or does mesos help ?

Shouldn't the worker be marked dead in such scenario, instead of making the
cluster non-usable so the debugging can be done at leisure.

Thanks,
Praveen R

Re: Lost an executor error - Jobs fail

Posted by Aaron Davidson <il...@gmail.com>.
Hmm, interesting. I created
https://issues.apache.org/jira/browse/SPARK-1499to track the issue of
Workers continuously spewing bad executors, but the
real issue seems to be a combination of that and some other bug in Shark or
Spark which fails to handle the situation properly.

Please let us know if you can reproduce it (especially if deterministic!),
or if you can provide any more details about exceptions thrown. A
preliminary search didn't bring up much about the error code 101...


On Mon, Apr 14, 2014 at 10:03 PM, Praveen R <pr...@sigmoidanalytics.com>wrote:

> Unfortunately queries kept failing with SparkTask101 errors and had them
> working after removing the troublesome node.
>
> FAILED: Execution Error, return code -101 from shark.execution.SparkTask
>
> I wish it would have been easy to re-produce it. I shall give a try to
> hard remove write permissions on one node to see if the same error happens.
>
>
>
> On Tue, Apr 15, 2014 at 9:17 AM, Aaron Davidson <il...@gmail.com>wrote:
>
>> Cool! It's pretty rare to actually get logs from a wild hardware failure.
>> The problem is as you said, that the executor keeps failing, but the worker
>> doesn't get the hint, so it keeps creating new, bad executors.
>>
>> However, this issue should not have caused your cluster to fail to start
>> up. In the linked logs, for instance, the shark shell started up just fine
>> (though the "shark>" was lost in some of the log messages). Queries should
>> have been able to execute just fine. Was this not the case?
>>
>>
>> On Mon, Apr 14, 2014 at 7:38 AM, Praveen R <pr...@sigmoidanalytics.com>wrote:
>>
>>> Configuration comes from spark-ec2 setup script, sets spark.local.dir to
>>> use /mnt/spark, /mnt2/spark.
>>>  Setup actually worked for quite sometime and then on one of the node
>>> there were some disk errors as
>>>
>>> mv: cannot remove
>>> `/mnt2/spark/spark-local-20140409182103-c775/09/shuffle_1_248_0': Read-only
>>> file system
>>> mv: cannot remove
>>> `/mnt2/spark/spark-local-20140409182103-c775/24/shuffle_1_260_0': Read-only
>>> file system
>>> mv: cannot remove
>>> `/mnt2/spark/spark-local-20140409182103-c775/24/shuffle_2_658_0': Read-only
>>> file system
>>>
>>> I understand the issue is hardware level but thought it would be great
>>> if spark could handle it and avoid cluster going down.
>>>
>>>
>>> On Mon, Apr 14, 2014 at 7:58 PM, giive chen <th...@gmail.com> wrote:
>>>
>>>> Hi Praveen
>>>>
>>>> What is your config about "* spark.local.dir" ? *
>>>> Is all your worker has this dir and all worker has right permission on
>>>> this dir?
>>>>
>>>> I think this is the reason of your error
>>>>
>>>> Wisely Chen
>>>>
>>>>
>>>> On Mon, Apr 14, 2014 at 9:29 PM, Praveen R <
>>>> praveen@sigmoidanalytics.com> wrote:
>>>>
>>>>> Had below error while running shark queries on 30 node cluster and was
>>>>> not able to start shark server or run any jobs.
>>>>>
>>>>> *14/04/11 19:06:52 ERROR scheduler.TaskSchedulerImpl: Lost an executor
>>>>> 4 (already removed): Failed to create local directory (bad
>>>>> spark.local.dir?)*
>>>>> *Full log: *https://gist.github.com/praveenr019/10647049
>>>>>
>>>>> After spending quite some time, found it was due to disk read errors
>>>>> on one node and had the cluster working after removing the node.
>>>>>
>>>>> Wanted to know if there is any configuration (like akkaTimeout) which
>>>>> can handle this or does mesos help ?
>>>>>
>>>>> Shouldn't the worker be marked dead in such scenario, instead of
>>>>> making the cluster non-usable so the debugging can be done at leisure.
>>>>>
>>>>> Thanks,
>>>>> Praveen R
>>>>>
>>>>>
>>>>>
>>>>
>>>
>>
>

Re: Lost an executor error - Jobs fail

Posted by Praveen R <pr...@sigmoidanalytics.com>.
Unfortunately queries kept failing with SparkTask101 errors and had them
working after removing the troublesome node.

FAILED: Execution Error, return code -101 from shark.execution.SparkTask

I wish it would have been easy to re-produce it. I shall give a try to hard
remove write permissions on one node to see if the same error happens.



On Tue, Apr 15, 2014 at 9:17 AM, Aaron Davidson <il...@gmail.com> wrote:

> Cool! It's pretty rare to actually get logs from a wild hardware failure.
> The problem is as you said, that the executor keeps failing, but the worker
> doesn't get the hint, so it keeps creating new, bad executors.
>
> However, this issue should not have caused your cluster to fail to start
> up. In the linked logs, for instance, the shark shell started up just fine
> (though the "shark>" was lost in some of the log messages). Queries should
> have been able to execute just fine. Was this not the case?
>
>
> On Mon, Apr 14, 2014 at 7:38 AM, Praveen R <pr...@sigmoidanalytics.com>wrote:
>
>> Configuration comes from spark-ec2 setup script, sets spark.local.dir to
>> use /mnt/spark, /mnt2/spark.
>>  Setup actually worked for quite sometime and then on one of the node
>> there were some disk errors as
>>
>> mv: cannot remove
>> `/mnt2/spark/spark-local-20140409182103-c775/09/shuffle_1_248_0': Read-only
>> file system
>> mv: cannot remove
>> `/mnt2/spark/spark-local-20140409182103-c775/24/shuffle_1_260_0': Read-only
>> file system
>> mv: cannot remove
>> `/mnt2/spark/spark-local-20140409182103-c775/24/shuffle_2_658_0': Read-only
>> file system
>>
>> I understand the issue is hardware level but thought it would be great if
>> spark could handle it and avoid cluster going down.
>>
>>
>> On Mon, Apr 14, 2014 at 7:58 PM, giive chen <th...@gmail.com> wrote:
>>
>>> Hi Praveen
>>>
>>> What is your config about "* spark.local.dir" ? *
>>> Is all your worker has this dir and all worker has right permission on
>>> this dir?
>>>
>>> I think this is the reason of your error
>>>
>>> Wisely Chen
>>>
>>>
>>> On Mon, Apr 14, 2014 at 9:29 PM, Praveen R <praveen@sigmoidanalytics.com
>>> > wrote:
>>>
>>>> Had below error while running shark queries on 30 node cluster and was
>>>> not able to start shark server or run any jobs.
>>>>
>>>> *14/04/11 19:06:52 ERROR scheduler.TaskSchedulerImpl: Lost an executor
>>>> 4 (already removed): Failed to create local directory (bad
>>>> spark.local.dir?)*
>>>> *Full log: *https://gist.github.com/praveenr019/10647049
>>>>
>>>> After spending quite some time, found it was due to disk read errors on
>>>> one node and had the cluster working after removing the node.
>>>>
>>>> Wanted to know if there is any configuration (like akkaTimeout) which
>>>> can handle this or does mesos help ?
>>>>
>>>> Shouldn't the worker be marked dead in such scenario, instead of making
>>>> the cluster non-usable so the debugging can be done at leisure.
>>>>
>>>> Thanks,
>>>> Praveen R
>>>>
>>>>
>>>>
>>>
>>
>

Re: Lost an executor error - Jobs fail

Posted by Aaron Davidson <il...@gmail.com>.
Cool! It's pretty rare to actually get logs from a wild hardware failure.
The problem is as you said, that the executor keeps failing, but the worker
doesn't get the hint, so it keeps creating new, bad executors.

However, this issue should not have caused your cluster to fail to start
up. In the linked logs, for instance, the shark shell started up just fine
(though the "shark>" was lost in some of the log messages). Queries should
have been able to execute just fine. Was this not the case?


On Mon, Apr 14, 2014 at 7:38 AM, Praveen R <pr...@sigmoidanalytics.com>wrote:

> Configuration comes from spark-ec2 setup script, sets spark.local.dir to
> use /mnt/spark, /mnt2/spark.
>  Setup actually worked for quite sometime and then on one of the node
> there were some disk errors as
>
> mv: cannot remove
> `/mnt2/spark/spark-local-20140409182103-c775/09/shuffle_1_248_0': Read-only
> file system
> mv: cannot remove
> `/mnt2/spark/spark-local-20140409182103-c775/24/shuffle_1_260_0': Read-only
> file system
> mv: cannot remove
> `/mnt2/spark/spark-local-20140409182103-c775/24/shuffle_2_658_0': Read-only
> file system
>
> I understand the issue is hardware level but thought it would be great if
> spark could handle it and avoid cluster going down.
>
>
> On Mon, Apr 14, 2014 at 7:58 PM, giive chen <th...@gmail.com> wrote:
>
>> Hi Praveen
>>
>> What is your config about "* spark.local.dir" ? *
>> Is all your worker has this dir and all worker has right permission on
>> this dir?
>>
>> I think this is the reason of your error
>>
>> Wisely Chen
>>
>>
>> On Mon, Apr 14, 2014 at 9:29 PM, Praveen R <pr...@sigmoidanalytics.com>wrote:
>>
>>> Had below error while running shark queries on 30 node cluster and was
>>> not able to start shark server or run any jobs.
>>>
>>> *14/04/11 19:06:52 ERROR scheduler.TaskSchedulerImpl: Lost an executor 4
>>> (already removed): Failed to create local directory (bad spark.local.dir?)*
>>> *Full log: *https://gist.github.com/praveenr019/10647049
>>>
>>> After spending quite some time, found it was due to disk read errors on
>>> one node and had the cluster working after removing the node.
>>>
>>> Wanted to know if there is any configuration (like akkaTimeout) which
>>> can handle this or does mesos help ?
>>>
>>> Shouldn't the worker be marked dead in such scenario, instead of making
>>> the cluster non-usable so the debugging can be done at leisure.
>>>
>>> Thanks,
>>> Praveen R
>>>
>>>
>>>
>>
>

Re: Lost an executor error - Jobs fail

Posted by Praveen R <pr...@sigmoidanalytics.com>.
Configuration comes from spark-ec2 setup script, sets spark.local.dir to
use /mnt/spark, /mnt2/spark.
Setup actually worked for quite sometime and then on one of the node there
were some disk errors as

mv: cannot remove
`/mnt2/spark/spark-local-20140409182103-c775/09/shuffle_1_248_0': Read-only
file system
mv: cannot remove
`/mnt2/spark/spark-local-20140409182103-c775/24/shuffle_1_260_0': Read-only
file system
mv: cannot remove
`/mnt2/spark/spark-local-20140409182103-c775/24/shuffle_2_658_0': Read-only
file system

I understand the issue is hardware level but thought it would be great if
spark could handle it and avoid cluster going down.


On Mon, Apr 14, 2014 at 7:58 PM, giive chen <th...@gmail.com> wrote:

> Hi Praveen
>
> What is your config about "* spark.local.dir" ? *
> Is all your worker has this dir and all worker has right permission on
> this dir?
>
> I think this is the reason of your error
>
> Wisely Chen
>
>
> On Mon, Apr 14, 2014 at 9:29 PM, Praveen R <pr...@sigmoidanalytics.com>wrote:
>
>> Had below error while running shark queries on 30 node cluster and was
>> not able to start shark server or run any jobs.
>>
>> *14/04/11 19:06:52 ERROR scheduler.TaskSchedulerImpl: Lost an executor 4
>> (already removed): Failed to create local directory (bad spark.local.dir?)*
>> *Full log: *https://gist.github.com/praveenr019/10647049
>>
>> After spending quite some time, found it was due to disk read errors on
>> one node and had the cluster working after removing the node.
>>
>> Wanted to know if there is any configuration (like akkaTimeout) which can
>> handle this or does mesos help ?
>>
>> Shouldn't the worker be marked dead in such scenario, instead of making
>> the cluster non-usable so the debugging can be done at leisure.
>>
>> Thanks,
>> Praveen R
>>
>>
>>
>

Re: Lost an executor error - Jobs fail

Posted by giive chen <th...@gmail.com>.
Hi Praveen

What is your config about "* spark.local.dir" ? *
Is all your worker has this dir and all worker has right permission on this
dir?

I think this is the reason of your error

Wisely Chen


On Mon, Apr 14, 2014 at 9:29 PM, Praveen R <pr...@sigmoidanalytics.com>wrote:

> Had below error while running shark queries on 30 node cluster and was not
> able to start shark server or run any jobs.
>
> *14/04/11 19:06:52 ERROR scheduler.TaskSchedulerImpl: Lost an executor 4
> (already removed): Failed to create local directory (bad spark.local.dir?)*
> *Full log: *https://gist.github.com/praveenr019/10647049
>
> After spending quite some time, found it was due to disk read errors on
> one node and had the cluster working after removing the node.
>
> Wanted to know if there is any configuration (like akkaTimeout) which can
> handle this or does mesos help ?
>
> Shouldn't the worker be marked dead in such scenario, instead of making
> the cluster non-usable so the debugging can be done at leisure.
>
> Thanks,
> Praveen R
>
>
>