You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@spark.apache.org by Ognen Duzlevski <og...@nengoiksvelzud.com> on 2014/03/23 23:06:36 UTC

No space left on device exception

Hello,

I have a weird error showing up when I run a job on my Spark cluster. 
The version of spark is 0.9 and I have 3+ GB free on the disk when this 
error shows up. Any ideas what I should be looking for?

[error] (run-main-0) org.apache.spark.SparkException: Job aborted: Task 
167.0:3 failed 4 times (most recent failure: Exception failure: 
java.io.FileNotFoundException: 
/tmp/spark-local-20140323214638-72df/31/shuffle_31_3_127 (No space left 
on device))
org.apache.spark.SparkException: Job aborted: Task 167.0:3 failed 4 
times (most recent failure: Exception failure: 
java.io.FileNotFoundException: 
/tmp/spark-local-20140323214638-72df/31/shuffle_31_3_127 (No space left 
on device))
     at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$abortStage$1.apply(DAGScheduler.scala:1028)
     at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$abortStage$1.apply(DAGScheduler.scala:1026)
     at 
scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
     at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
     at 
org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$abortStage(DAGScheduler.scala:1026)
     at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$processEvent$10.apply(DAGScheduler.scala:619)
     at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$processEvent$10.apply(DAGScheduler.scala:619)
     at scala.Option.foreach(Option.scala:236)
     at 
org.apache.spark.scheduler.DAGScheduler.processEvent(DAGScheduler.scala:619)
     at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$start$1$$anon$2$$anonfun$receive$1.applyOrElse(DAGScheduler.scala:207)

Thanks!
Ognen

Re: No space left on device exception

Posted by Ognen Duzlevski <og...@plainvanillagames.com>.
Another thing I have noticed is that out of my master+15 slaves, two 
slaves always carry a higher inode load. So for example right now I am 
running an intensive job that takes about an hour to finish and two 
slaves have been showing an increase in inode consumption (they are 
about 10% above the rest of the slaves+master) and increasing.

Ognen

On 3/24/14, 7:00 AM, Ognen Duzlevski wrote:
> Patrick, correct. I have a 16 node cluster. On 14 machines out of 16, 
> the inode usage was about 50%. On two of the slaves, one had inode 
> usage of 96% and on the other it was 100%. When i went into /tmp on 
> these two nodes - there were a bunch of /tmp/spark* subdirectories 
> which I deleted. This resulted in the inode consumption falling back 
> down to 50% and the job running successfully to completion. The slave 
> with the 100% inode usage had the spark/work/app/<number>/stdout with 
> the message that the filesystem is running out of disk space (which I 
> posted in the original email that started this thread).
>
> What is interesting is that only two out of the 16 slaves had this 
> problem :)
>
> Ognen
>
> On 3/24/14, 12:57 AM, Patrick Wendell wrote:
>> Ognen - just so I understand. The issue is that there weren't enough
>> inodes and this was causing a "No space left on device" error? Is that
>> correct? If so, that's good to know because it's definitely counter
>> intuitive.
>>
>> On Sun, Mar 23, 2014 at 8:36 PM, Ognen Duzlevski
>> <og...@nengoiksvelzud.com> wrote:
>>> I would love to work on this (and other) stuff if I can bother 
>>> someone with
>>> questions offline or on a dev mailing list.
>>> Ognen
>>>
>>>
>>> On 3/23/14, 10:04 PM, Aaron Davidson wrote:
>>>
>>> Thanks for bringing this up, 100% inode utilization is an issue I 
>>> haven't
>>> seen raised before and this raises another issue which is not on our 
>>> current
>>> roadmap for state cleanup (cleaning up data which was not fully 
>>> cleaned up
>>> from a crashed process).
>>>
>>>
>>> On Sun, Mar 23, 2014 at 7:57 PM, Ognen Duzlevski
>>> <og...@plainvanillagames.com> wrote:
>>>> Bleh, strike that, one of my slaves was at 100% inode utilization 
>>>> on the
>>>> file system. It was /tmp/spark* leftovers that apparently did not get
>>>> cleaned up properly after failed or interrupted jobs.
>>>> Mental note - run a cron job on all slaves and master to clean up
>>>> /tmp/spark* regularly.
>>>>
>>>> Thanks (and sorry for the noise)!
>>>> Ognen
>>>>
>>>>
>>>> On 3/23/14, 9:52 PM, Ognen Duzlevski wrote:
>>>>
>>>> Aaron, thanks for replying. I am very much puzzled as to what is 
>>>> going on.
>>>> A job that used to run on the same cluster is failing with this 
>>>> mysterious
>>>> message about not having enough disk space when in fact I can see 
>>>> through
>>>> "watch df -h" that the free space is always hovering around 3+GB on 
>>>> the disk
>>>> and the free inodes are at 50% (this is on master). I went through 
>>>> each
>>>> slave and the spark/work/app*/stderr and stdout and spark/logs/*out 
>>>> files
>>>> and no mention of too many open files failures on any of the slaves 
>>>> nor on
>>>> the master :(
>>>>
>>>> Thanks
>>>> Ognen
>>>>
>>>> On 3/23/14, 8:38 PM, Aaron Davidson wrote:
>>>>
>>>> By default, with P partitions (for both the pre-shuffle stage and
>>>> post-shuffle), there are P^2 files created. With
>>>> spark.shuffle.consolidateFiles turned on, we would instead create 
>>>> only P
>>>> files. Disk space consumption is largely unaffected, however. by 
>>>> the number
>>>> of partitions unless each partition is particularly small.
>>>>
>>>> You might look at the actual executors' logs, as it's possible that 
>>>> this
>>>> error was caused by an earlier exception, such as "too many open 
>>>> files".
>>>>
>>>>
>>>> On Sun, Mar 23, 2014 at 4:46 PM, Ognen Duzlevski
>>>> <og...@plainvanillagames.com> wrote:
>>>>> On 3/23/14, 5:49 PM, Matei Zaharia wrote:
>>>>>
>>>>> You can set spark.local.dir to put this data somewhere other than 
>>>>> /tmp if
>>>>> /tmp is full. Actually it's recommended to have multiple local 
>>>>> disks and set
>>>>> to to a comma-separated list of directories, one per disk.
>>>>>
>>>>> Matei, does the number of tasks/partitions in a transformation 
>>>>> influence
>>>>> something in terms of disk space consumption? Or inode consumption?
>>>>>
>>>>> Thanks,
>>>>> Ognen
>>>>
>>>>
>>>> -- 
>>>> "A distributed system is one in which the failure of a computer you 
>>>> didn't
>>>> even know existed can render your own computer unusable"
>>>> -- Leslie Lamport
>>>
>>>
>>> -- 
>>> "No matter what they ever do to us, we must always act for the love 
>>> of our
>>> people and the earth. We must not react out of hatred against those 
>>> who have
>>> no sense."
>>> -- John Trudell
>

-- 
“A distributed system is one in which the failure of a computer you didn’t even know existed can render your own computer unusable”
-- Leslie Lamport


Re: No space left on device exception

Posted by Ognen Duzlevski <og...@plainvanillagames.com>.
Patrick, correct. I have a 16 node cluster. On 14 machines out of 16, 
the inode usage was about 50%. On two of the slaves, one had inode usage 
of 96% and on the other it was 100%. When i went into /tmp on these two 
nodes - there were a bunch of /tmp/spark* subdirectories which I 
deleted. This resulted in the inode consumption falling back down to 50% 
and the job running successfully to completion. The slave with the 100% 
inode usage had the spark/work/app/<number>/stdout with the message that 
the filesystem is running out of disk space (which I posted in the 
original email that started this thread).

What is interesting is that only two out of the 16 slaves had this 
problem :)

Ognen

On 3/24/14, 12:57 AM, Patrick Wendell wrote:
> Ognen - just so I understand. The issue is that there weren't enough
> inodes and this was causing a "No space left on device" error? Is that
> correct? If so, that's good to know because it's definitely counter
> intuitive.
>
> On Sun, Mar 23, 2014 at 8:36 PM, Ognen Duzlevski
> <og...@nengoiksvelzud.com> wrote:
>> I would love to work on this (and other) stuff if I can bother someone with
>> questions offline or on a dev mailing list.
>> Ognen
>>
>>
>> On 3/23/14, 10:04 PM, Aaron Davidson wrote:
>>
>> Thanks for bringing this up, 100% inode utilization is an issue I haven't
>> seen raised before and this raises another issue which is not on our current
>> roadmap for state cleanup (cleaning up data which was not fully cleaned up
>> from a crashed process).
>>
>>
>> On Sun, Mar 23, 2014 at 7:57 PM, Ognen Duzlevski
>> <og...@plainvanillagames.com> wrote:
>>> Bleh, strike that, one of my slaves was at 100% inode utilization on the
>>> file system. It was /tmp/spark* leftovers that apparently did not get
>>> cleaned up properly after failed or interrupted jobs.
>>> Mental note - run a cron job on all slaves and master to clean up
>>> /tmp/spark* regularly.
>>>
>>> Thanks (and sorry for the noise)!
>>> Ognen
>>>
>>>
>>> On 3/23/14, 9:52 PM, Ognen Duzlevski wrote:
>>>
>>> Aaron, thanks for replying. I am very much puzzled as to what is going on.
>>> A job that used to run on the same cluster is failing with this mysterious
>>> message about not having enough disk space when in fact I can see through
>>> "watch df -h" that the free space is always hovering around 3+GB on the disk
>>> and the free inodes are at 50% (this is on master). I went through each
>>> slave and the spark/work/app*/stderr and stdout and spark/logs/*out files
>>> and no mention of too many open files failures on any of the slaves nor on
>>> the master :(
>>>
>>> Thanks
>>> Ognen
>>>
>>> On 3/23/14, 8:38 PM, Aaron Davidson wrote:
>>>
>>> By default, with P partitions (for both the pre-shuffle stage and
>>> post-shuffle), there are P^2 files created. With
>>> spark.shuffle.consolidateFiles turned on, we would instead create only P
>>> files. Disk space consumption is largely unaffected, however. by the number
>>> of partitions unless each partition is particularly small.
>>>
>>> You might look at the actual executors' logs, as it's possible that this
>>> error was caused by an earlier exception, such as "too many open files".
>>>
>>>
>>> On Sun, Mar 23, 2014 at 4:46 PM, Ognen Duzlevski
>>> <og...@plainvanillagames.com> wrote:
>>>> On 3/23/14, 5:49 PM, Matei Zaharia wrote:
>>>>
>>>> You can set spark.local.dir to put this data somewhere other than /tmp if
>>>> /tmp is full. Actually it's recommended to have multiple local disks and set
>>>> to to a comma-separated list of directories, one per disk.
>>>>
>>>> Matei, does the number of tasks/partitions in a transformation influence
>>>> something in terms of disk space consumption? Or inode consumption?
>>>>
>>>> Thanks,
>>>> Ognen
>>>
>>>
>>> --
>>> "A distributed system is one in which the failure of a computer you didn't
>>> even know existed can render your own computer unusable"
>>> -- Leslie Lamport
>>
>>
>> --
>> "No matter what they ever do to us, we must always act for the love of our
>> people and the earth. We must not react out of hatred against those who have
>> no sense."
>> -- John Trudell

-- 
“A distributed system is one in which the failure of a computer you didn’t even know existed can render your own computer unusable”
-- Leslie Lamport


Re: No space left on device exception

Posted by Patrick Wendell <pw...@gmail.com>.
Ognen - just so I understand. The issue is that there weren't enough
inodes and this was causing a "No space left on device" error? Is that
correct? If so, that's good to know because it's definitely counter
intuitive.

On Sun, Mar 23, 2014 at 8:36 PM, Ognen Duzlevski
<og...@nengoiksvelzud.com> wrote:
> I would love to work on this (and other) stuff if I can bother someone with
> questions offline or on a dev mailing list.
> Ognen
>
>
> On 3/23/14, 10:04 PM, Aaron Davidson wrote:
>
> Thanks for bringing this up, 100% inode utilization is an issue I haven't
> seen raised before and this raises another issue which is not on our current
> roadmap for state cleanup (cleaning up data which was not fully cleaned up
> from a crashed process).
>
>
> On Sun, Mar 23, 2014 at 7:57 PM, Ognen Duzlevski
> <og...@plainvanillagames.com> wrote:
>>
>> Bleh, strike that, one of my slaves was at 100% inode utilization on the
>> file system. It was /tmp/spark* leftovers that apparently did not get
>> cleaned up properly after failed or interrupted jobs.
>> Mental note - run a cron job on all slaves and master to clean up
>> /tmp/spark* regularly.
>>
>> Thanks (and sorry for the noise)!
>> Ognen
>>
>>
>> On 3/23/14, 9:52 PM, Ognen Duzlevski wrote:
>>
>> Aaron, thanks for replying. I am very much puzzled as to what is going on.
>> A job that used to run on the same cluster is failing with this mysterious
>> message about not having enough disk space when in fact I can see through
>> "watch df -h" that the free space is always hovering around 3+GB on the disk
>> and the free inodes are at 50% (this is on master). I went through each
>> slave and the spark/work/app*/stderr and stdout and spark/logs/*out files
>> and no mention of too many open files failures on any of the slaves nor on
>> the master :(
>>
>> Thanks
>> Ognen
>>
>> On 3/23/14, 8:38 PM, Aaron Davidson wrote:
>>
>> By default, with P partitions (for both the pre-shuffle stage and
>> post-shuffle), there are P^2 files created. With
>> spark.shuffle.consolidateFiles turned on, we would instead create only P
>> files. Disk space consumption is largely unaffected, however. by the number
>> of partitions unless each partition is particularly small.
>>
>> You might look at the actual executors' logs, as it's possible that this
>> error was caused by an earlier exception, such as "too many open files".
>>
>>
>> On Sun, Mar 23, 2014 at 4:46 PM, Ognen Duzlevski
>> <og...@plainvanillagames.com> wrote:
>>>
>>> On 3/23/14, 5:49 PM, Matei Zaharia wrote:
>>>
>>> You can set spark.local.dir to put this data somewhere other than /tmp if
>>> /tmp is full. Actually it's recommended to have multiple local disks and set
>>> to to a comma-separated list of directories, one per disk.
>>>
>>> Matei, does the number of tasks/partitions in a transformation influence
>>> something in terms of disk space consumption? Or inode consumption?
>>>
>>> Thanks,
>>> Ognen
>>
>>
>>
>> --
>> "A distributed system is one in which the failure of a computer you didn't
>> even know existed can render your own computer unusable"
>> -- Leslie Lamport
>
>
>
> --
> "No matter what they ever do to us, we must always act for the love of our
> people and the earth. We must not react out of hatred against those who have
> no sense."
> -- John Trudell

Re: No space left on device exception

Posted by Ognen Duzlevski <og...@nengoiksvelzud.com>.
I would love to work on this (and other) stuff if I can bother someone 
with questions offline or on a dev mailing list.
Ognen

On 3/23/14, 10:04 PM, Aaron Davidson wrote:
> Thanks for bringing this up, 100% inode utilization is an issue I 
> haven't seen raised before and this raises another issue which is not 
> on our current roadmap for state cleanup (cleaning up data which was 
> not fully cleaned up from a crashed process).
>
>
> On Sun, Mar 23, 2014 at 7:57 PM, Ognen Duzlevski 
> <ognen@plainvanillagames.com <ma...@plainvanillagames.com>> wrote:
>
>     Bleh, strike that, one of my slaves was at 100% inode utilization
>     on the file system. It was /tmp/spark* leftovers that apparently
>     did not get cleaned up properly after failed or interrupted jobs.
>     Mental note - run a cron job on all slaves and master to clean up
>     /tmp/spark* regularly.
>
>     Thanks (and sorry for the noise)!
>     Ognen
>
>
>     On 3/23/14, 9:52 PM, Ognen Duzlevski wrote:
>>     Aaron, thanks for replying. I am very much puzzled as to what is
>>     going on. A job that used to run on the same cluster is failing
>>     with this mysterious message about not having enough disk space
>>     when in fact I can see through "watch df -h" that the free space
>>     is always hovering around 3+GB on the disk and the free inodes
>>     are at 50% (this is on master). I went through each slave and the
>>     spark/work/app*/stderr and stdout and spark/logs/*out files and
>>     no mention of too many open files failures on any of the slaves
>>     nor on the master :(
>>
>>     Thanks
>>     Ognen
>>
>>     On 3/23/14, 8:38 PM, Aaron Davidson wrote:
>>>     By default, with P partitions (for both the pre-shuffle stage
>>>     and post-shuffle), there are P^2 files created.
>>>     With spark.shuffle.consolidateFiles turned on, we would instead
>>>     create only P files. Disk space consumption is largely
>>>     unaffected, however. by the number of partitions unless each
>>>     partition is particularly small.
>>>
>>>     You might look at the actual executors' logs, as it's possible
>>>     that this error was caused by an earlier exception, such as "too
>>>     many open files".
>>>
>>>
>>>     On Sun, Mar 23, 2014 at 4:46 PM, Ognen Duzlevski
>>>     <ognen@plainvanillagames.com
>>>     <ma...@plainvanillagames.com>> wrote:
>>>
>>>         On 3/23/14, 5:49 PM, Matei Zaharia wrote:
>>>>         You can set spark.local.dir to put this data somewhere
>>>>         other than /tmp if /tmp is full. Actually it's recommended
>>>>         to have multiple local disks and set to to a
>>>>         comma-separated list of directories, one per disk.
>>>         Matei, does the number of tasks/partitions in a
>>>         transformation influence something in terms of disk space
>>>         consumption? Or inode consumption?
>>>
>>>         Thanks,
>>>         Ognen
>>>
>>
>
>     -- 
>     "A distributed system is one in which the failure of a computer you didn't even know existed can render your own computer unusable"
>     -- Leslie Lamport
>
>
>
> -- 
> "No matter what they ever do to us, we must always act for the love of our people and the earth. We must not react out of hatred against those who have no sense."
> -- John Trudell

Re: No space left on device exception

Posted by Aaron Davidson <il...@gmail.com>.
Thanks for bringing this up, 100% inode utilization is an issue I haven't
seen raised before and this raises another issue which is not on our
current roadmap for state cleanup (cleaning up data which was not fully
cleaned up from a crashed process).


On Sun, Mar 23, 2014 at 7:57 PM, Ognen Duzlevski <
ognen@plainvanillagames.com> wrote:

>  Bleh, strike that, one of my slaves was at 100% inode utilization on the
> file system. It was /tmp/spark* leftovers that apparently did not get
> cleaned up properly after failed or interrupted jobs.
> Mental note - run a cron job on all slaves and master to clean up
> /tmp/spark* regularly.
>
> Thanks (and sorry for the noise)!
> Ognen
>
>
> On 3/23/14, 9:52 PM, Ognen Duzlevski wrote:
>
> Aaron, thanks for replying. I am very much puzzled as to what is going on.
> A job that used to run on the same cluster is failing with this mysterious
> message about not having enough disk space when in fact I can see through
> "watch df -h" that the free space is always hovering around 3+GB on the
> disk and the free inodes are at 50% (this is on master). I went through
> each slave and the spark/work/app*/stderr and stdout and spark/logs/*out
> files and no mention of too many open files failures on any of the slaves
> nor on the master :(
>
> Thanks
> Ognen
>
> On 3/23/14, 8:38 PM, Aaron Davidson wrote:
>
> By default, with P partitions (for both the pre-shuffle stage and
> post-shuffle), there are P^2 files created.
> With spark.shuffle.consolidateFiles turned on, we would instead create only
> P files. Disk space consumption is largely unaffected, however. by the
> number of partitions unless each partition is particularly small.
>
>  You might look at the actual executors' logs, as it's possible that this
> error was caused by an earlier exception, such as "too many open files".
>
>
> On Sun, Mar 23, 2014 at 4:46 PM, Ognen Duzlevski <
> ognen@plainvanillagames.com> wrote:
>
>>  On 3/23/14, 5:49 PM, Matei Zaharia wrote:
>>
>> You can set spark.local.dir to put this data somewhere other than /tmp if
>> /tmp is full. Actually it's recommended to have multiple local disks and
>> set to to a comma-separated list of directories, one per disk.
>>
>>  Matei, does the number of tasks/partitions in a transformation influence
>> something in terms of disk space consumption? Or inode consumption?
>>
>> Thanks,
>> Ognen
>>
>
>
> --
> "A distributed system is one in which the failure of a computer you didn't even know existed can render your own computer unusable"
> -- Leslie Lamport
>
>

Re: No space left on device exception

Posted by Ognen Duzlevski <og...@plainvanillagames.com>.
Bleh, strike that, one of my slaves was at 100% inode utilization on the 
file system. It was /tmp/spark* leftovers that apparently did not get 
cleaned up properly after failed or interrupted jobs.
Mental note - run a cron job on all slaves and master to clean up 
/tmp/spark* regularly.

Thanks (and sorry for the noise)!
Ognen

On 3/23/14, 9:52 PM, Ognen Duzlevski wrote:
> Aaron, thanks for replying. I am very much puzzled as to what is going 
> on. A job that used to run on the same cluster is failing with this 
> mysterious message about not having enough disk space when in fact I 
> can see through "watch df -h" that the free space is always hovering 
> around 3+GB on the disk and the free inodes are at 50% (this is on 
> master). I went through each slave and the spark/work/app*/stderr and 
> stdout and spark/logs/*out files and no mention of too many open files 
> failures on any of the slaves nor on the master :(
>
> Thanks
> Ognen
>
> On 3/23/14, 8:38 PM, Aaron Davidson wrote:
>> By default, with P partitions (for both the pre-shuffle stage and 
>> post-shuffle), there are P^2 files created. 
>> With spark.shuffle.consolidateFiles turned on, we would instead 
>> create only P files. Disk space consumption is largely unaffected, 
>> however. by the number of partitions unless each partition is 
>> particularly small.
>>
>> You might look at the actual executors' logs, as it's possible that 
>> this error was caused by an earlier exception, such as "too many open 
>> files".
>>
>>
>> On Sun, Mar 23, 2014 at 4:46 PM, Ognen Duzlevski 
>> <ognen@plainvanillagames.com <ma...@plainvanillagames.com>> wrote:
>>
>>     On 3/23/14, 5:49 PM, Matei Zaharia wrote:
>>>     You can set spark.local.dir to put this data somewhere other
>>>     than /tmp if /tmp is full. Actually it's recommended to have
>>>     multiple local disks and set to to a comma-separated list of
>>>     directories, one per disk.
>>     Matei, does the number of tasks/partitions in a transformation
>>     influence something in terms of disk space consumption? Or inode
>>     consumption?
>>
>>     Thanks,
>>     Ognen
>>
>

-- 
"A distributed system is one in which the failure of a computer you didn't even know existed can render your own computer unusable"
-- Leslie Lamport


Re: No space left on device exception

Posted by Ognen Duzlevski <og...@plainvanillagames.com>.
Aaron, thanks for replying. I am very much puzzled as to what is going 
on. A job that used to run on the same cluster is failing with this 
mysterious message about not having enough disk space when in fact I can 
see through "watch df -h" that the free space is always hovering around 
3+GB on the disk and the free inodes are at 50% (this is on master). I 
went through each slave and the spark/work/app*/stderr and stdout and 
spark/logs/*out files and no mention of too many open files failures on 
any of the slaves nor on the master :(

Thanks
Ognen

On 3/23/14, 8:38 PM, Aaron Davidson wrote:
> By default, with P partitions (for both the pre-shuffle stage and 
> post-shuffle), there are P^2 files created. 
> With spark.shuffle.consolidateFiles turned on, we would instead create 
> only P files. Disk space consumption is largely unaffected, however. 
> by the number of partitions unless each partition is particularly small.
>
> You might look at the actual executors' logs, as it's possible that 
> this error was caused by an earlier exception, such as "too many open 
> files".
>
>
> On Sun, Mar 23, 2014 at 4:46 PM, Ognen Duzlevski 
> <ognen@plainvanillagames.com <ma...@plainvanillagames.com>> wrote:
>
>     On 3/23/14, 5:49 PM, Matei Zaharia wrote:
>>     You can set spark.local.dir to put this data somewhere other than
>>     /tmp if /tmp is full. Actually it's recommended to have multiple
>>     local disks and set to to a comma-separated list of directories,
>>     one per disk.
>     Matei, does the number of tasks/partitions in a transformation
>     influence something in terms of disk space consumption? Or inode
>     consumption?
>
>     Thanks,
>     Ognen
>


Re: No space left on device exception

Posted by Aaron Davidson <il...@gmail.com>.
By default, with P partitions (for both the pre-shuffle stage and
post-shuffle), there are P^2 files created.
With spark.shuffle.consolidateFiles turned on, we would instead create only
P files. Disk space consumption is largely unaffected, however. by the
number of partitions unless each partition is particularly small.

You might look at the actual executors' logs, as it's possible that this
error was caused by an earlier exception, such as "too many open files".


On Sun, Mar 23, 2014 at 4:46 PM, Ognen Duzlevski <
ognen@plainvanillagames.com> wrote:

>  On 3/23/14, 5:49 PM, Matei Zaharia wrote:
>
> You can set spark.local.dir to put this data somewhere other than /tmp if
> /tmp is full. Actually it's recommended to have multiple local disks and
> set to to a comma-separated list of directories, one per disk.
>
> Matei, does the number of tasks/partitions in a transformation influence
> something in terms of disk space consumption? Or inode consumption?
>
> Thanks,
> Ognen
>

Re: No space left on device exception

Posted by Ognen Duzlevski <og...@plainvanillagames.com>.
On 3/23/14, 5:49 PM, Matei Zaharia wrote:
> You can set spark.local.dir to put this data somewhere other than /tmp 
> if /tmp is full. Actually it’s recommended to have multiple local 
> disks and set to to a comma-separated list of directories, one per disk.
Matei, does the number of tasks/partitions in a transformation influence 
something in terms of disk space consumption? Or inode consumption?

Thanks,
Ognen

Re: No space left on device exception

Posted by Matei Zaharia <ma...@gmail.com>.
You can set spark.local.dir to put this data somewhere other than /tmp if /tmp is full. Actually it’s recommended to have multiple local disks and set to to a comma-separated list of directories, one per disk.

Matei

On Mar 23, 2014, at 3:35 PM, Aaron Davidson <il...@gmail.com> wrote:

> On some systems, /tmp/ is an in-memory tmpfs file system, with its own size limit. It's possible that this limit has been exceeded. You might try running the "df" command to check to free space of "/tmp" or root if tmp isn't listed.
> 
> 3 GB also seems pretty low for the remaining free space of a disk. If your disk size is in the TB range, it's possible that the last couple GB have issues when being allocated due to fragmentation or reclamation policies.
> 
> 
> On Sun, Mar 23, 2014 at 3:06 PM, Ognen Duzlevski <og...@nengoiksvelzud.com> wrote:
> Hello,
> 
> I have a weird error showing up when I run a job on my Spark cluster. The version of spark is 0.9 and I have 3+ GB free on the disk when this error shows up. Any ideas what I should be looking for?
> 
> [error] (run-main-0) org.apache.spark.SparkException: Job aborted: Task 167.0:3 failed 4 times (most recent failure: Exception failure: java.io.FileNotFoundException: /tmp/spark-local-20140323214638-72df/31/shuffle_31_3_127 (No space left on device))
> org.apache.spark.SparkException: Job aborted: Task 167.0:3 failed 4 times (most recent failure: Exception failure: java.io.FileNotFoundException: /tmp/spark-local-20140323214638-72df/31/shuffle_31_3_127 (No space left on device))
>     at org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$abortStage$1.apply(DAGScheduler.scala:1028)
>     at org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$abortStage$1.apply(DAGScheduler.scala:1026)
>     at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
>     at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
>     at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$abortStage(DAGScheduler.scala:1026)
>     at org.apache.spark.scheduler.DAGScheduler$$anonfun$processEvent$10.apply(DAGScheduler.scala:619)
>     at org.apache.spark.scheduler.DAGScheduler$$anonfun$processEvent$10.apply(DAGScheduler.scala:619)
>     at scala.Option.foreach(Option.scala:236)
>     at org.apache.spark.scheduler.DAGScheduler.processEvent(DAGScheduler.scala:619)
>     at org.apache.spark.scheduler.DAGScheduler$$anonfun$start$1$$anon$2$$anonfun$receive$1.applyOrElse(DAGScheduler.scala:207)
> 
> Thanks!
> Ognen
> 


Re: No space left on device exception

Posted by Ognen Duzlevski <og...@plainvanillagames.com>.
On 3/23/14, 5:35 PM, Aaron Davidson wrote:
> On some systems, /tmp/ is an in-memory tmpfs file system, with its own 
> size limit. It's possible that this limit has been exceeded. You might 
> try running the "df" command to check to free space of "/tmp" or root 
> if tmp isn't listed.
>
> 3 GB also seems pretty low for the remaining free space of a disk. If 
> your disk size is in the TB range, it's possible that the last couple 
> GB have issues when being allocated due to fragmentation or 
> reclamation policies.
>

Aaron, thanks for the reply. These are Amazon Ubuntu instances that I 
have that have an 8GB root filesystem (with everything OS+Spark taking 
up about 4.5GB). The /tmp appears to be just a regular directory in / - 
hence it shares in the same 3.5GB left free space. I was "watch df"ing 
the space while my job was running.
Ognen

Re: No space left on device exception

Posted by Aaron Davidson <il...@gmail.com>.
On some systems, /tmp/ is an in-memory tmpfs file system, with its own size
limit. It's possible that this limit has been exceeded. You might try
running the "df" command to check to free space of "/tmp" or root if tmp
isn't listed.

3 GB also seems pretty low for the remaining free space of a disk. If your
disk size is in the TB range, it's possible that the last couple GB have
issues when being allocated due to fragmentation or reclamation policies.


On Sun, Mar 23, 2014 at 3:06 PM, Ognen Duzlevski
<og...@nengoiksvelzud.com>wrote:

> Hello,
>
> I have a weird error showing up when I run a job on my Spark cluster. The
> version of spark is 0.9 and I have 3+ GB free on the disk when this error
> shows up. Any ideas what I should be looking for?
>
> [error] (run-main-0) org.apache.spark.SparkException: Job aborted: Task
> 167.0:3 failed 4 times (most recent failure: Exception failure:
> java.io.FileNotFoundException: /tmp/spark-local-20140323214638-72df/31/shuffle_31_3_127
> (No space left on device))
> org.apache.spark.SparkException: Job aborted: Task 167.0:3 failed 4 times
> (most recent failure: Exception failure: java.io.FileNotFoundException:
> /tmp/spark-local-20140323214638-72df/31/shuffle_31_3_127 (No space left
> on device))
>     at org.apache.spark.scheduler.DAGScheduler$$anonfun$org$
> apache$spark$scheduler$DAGScheduler$$abortStage$1.
> apply(DAGScheduler.scala:1028)
>     at org.apache.spark.scheduler.DAGScheduler$$anonfun$org$
> apache$spark$scheduler$DAGScheduler$$abortStage$1.
> apply(DAGScheduler.scala:1026)
>     at scala.collection.mutable.ResizableArray$class.foreach(
> ResizableArray.scala:59)
>     at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
>     at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$
> scheduler$DAGScheduler$$abortStage(DAGScheduler.scala:1026)
>     at org.apache.spark.scheduler.DAGScheduler$$anonfun$
> processEvent$10.apply(DAGScheduler.scala:619)
>     at org.apache.spark.scheduler.DAGScheduler$$anonfun$
> processEvent$10.apply(DAGScheduler.scala:619)
>     at scala.Option.foreach(Option.scala:236)
>     at org.apache.spark.scheduler.DAGScheduler.processEvent(
> DAGScheduler.scala:619)
>     at org.apache.spark.scheduler.DAGScheduler$$anonfun$start$1$
> $anon$2$$anonfun$receive$1.applyOrElse(DAGScheduler.scala:207)
>
> Thanks!
> Ognen
>