You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@spark.apache.org by Ognen Duzlevski <og...@nengoiksvelzud.com> on 2014/03/23 23:06:36 UTC
No space left on device exception
Hello,
I have a weird error showing up when I run a job on my Spark cluster.
The version of spark is 0.9 and I have 3+ GB free on the disk when this
error shows up. Any ideas what I should be looking for?
[error] (run-main-0) org.apache.spark.SparkException: Job aborted: Task
167.0:3 failed 4 times (most recent failure: Exception failure:
java.io.FileNotFoundException:
/tmp/spark-local-20140323214638-72df/31/shuffle_31_3_127 (No space left
on device))
org.apache.spark.SparkException: Job aborted: Task 167.0:3 failed 4
times (most recent failure: Exception failure:
java.io.FileNotFoundException:
/tmp/spark-local-20140323214638-72df/31/shuffle_31_3_127 (No space left
on device))
at
org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$abortStage$1.apply(DAGScheduler.scala:1028)
at
org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$abortStage$1.apply(DAGScheduler.scala:1026)
at
scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
at
org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$abortStage(DAGScheduler.scala:1026)
at
org.apache.spark.scheduler.DAGScheduler$$anonfun$processEvent$10.apply(DAGScheduler.scala:619)
at
org.apache.spark.scheduler.DAGScheduler$$anonfun$processEvent$10.apply(DAGScheduler.scala:619)
at scala.Option.foreach(Option.scala:236)
at
org.apache.spark.scheduler.DAGScheduler.processEvent(DAGScheduler.scala:619)
at
org.apache.spark.scheduler.DAGScheduler$$anonfun$start$1$$anon$2$$anonfun$receive$1.applyOrElse(DAGScheduler.scala:207)
Thanks!
Ognen
Re: No space left on device exception
Posted by Ognen Duzlevski <og...@plainvanillagames.com>.
Another thing I have noticed is that out of my master+15 slaves, two
slaves always carry a higher inode load. So for example right now I am
running an intensive job that takes about an hour to finish and two
slaves have been showing an increase in inode consumption (they are
about 10% above the rest of the slaves+master) and increasing.
Ognen
On 3/24/14, 7:00 AM, Ognen Duzlevski wrote:
> Patrick, correct. I have a 16 node cluster. On 14 machines out of 16,
> the inode usage was about 50%. On two of the slaves, one had inode
> usage of 96% and on the other it was 100%. When i went into /tmp on
> these two nodes - there were a bunch of /tmp/spark* subdirectories
> which I deleted. This resulted in the inode consumption falling back
> down to 50% and the job running successfully to completion. The slave
> with the 100% inode usage had the spark/work/app/<number>/stdout with
> the message that the filesystem is running out of disk space (which I
> posted in the original email that started this thread).
>
> What is interesting is that only two out of the 16 slaves had this
> problem :)
>
> Ognen
>
> On 3/24/14, 12:57 AM, Patrick Wendell wrote:
>> Ognen - just so I understand. The issue is that there weren't enough
>> inodes and this was causing a "No space left on device" error? Is that
>> correct? If so, that's good to know because it's definitely counter
>> intuitive.
>>
>> On Sun, Mar 23, 2014 at 8:36 PM, Ognen Duzlevski
>> <og...@nengoiksvelzud.com> wrote:
>>> I would love to work on this (and other) stuff if I can bother
>>> someone with
>>> questions offline or on a dev mailing list.
>>> Ognen
>>>
>>>
>>> On 3/23/14, 10:04 PM, Aaron Davidson wrote:
>>>
>>> Thanks for bringing this up, 100% inode utilization is an issue I
>>> haven't
>>> seen raised before and this raises another issue which is not on our
>>> current
>>> roadmap for state cleanup (cleaning up data which was not fully
>>> cleaned up
>>> from a crashed process).
>>>
>>>
>>> On Sun, Mar 23, 2014 at 7:57 PM, Ognen Duzlevski
>>> <og...@plainvanillagames.com> wrote:
>>>> Bleh, strike that, one of my slaves was at 100% inode utilization
>>>> on the
>>>> file system. It was /tmp/spark* leftovers that apparently did not get
>>>> cleaned up properly after failed or interrupted jobs.
>>>> Mental note - run a cron job on all slaves and master to clean up
>>>> /tmp/spark* regularly.
>>>>
>>>> Thanks (and sorry for the noise)!
>>>> Ognen
>>>>
>>>>
>>>> On 3/23/14, 9:52 PM, Ognen Duzlevski wrote:
>>>>
>>>> Aaron, thanks for replying. I am very much puzzled as to what is
>>>> going on.
>>>> A job that used to run on the same cluster is failing with this
>>>> mysterious
>>>> message about not having enough disk space when in fact I can see
>>>> through
>>>> "watch df -h" that the free space is always hovering around 3+GB on
>>>> the disk
>>>> and the free inodes are at 50% (this is on master). I went through
>>>> each
>>>> slave and the spark/work/app*/stderr and stdout and spark/logs/*out
>>>> files
>>>> and no mention of too many open files failures on any of the slaves
>>>> nor on
>>>> the master :(
>>>>
>>>> Thanks
>>>> Ognen
>>>>
>>>> On 3/23/14, 8:38 PM, Aaron Davidson wrote:
>>>>
>>>> By default, with P partitions (for both the pre-shuffle stage and
>>>> post-shuffle), there are P^2 files created. With
>>>> spark.shuffle.consolidateFiles turned on, we would instead create
>>>> only P
>>>> files. Disk space consumption is largely unaffected, however. by
>>>> the number
>>>> of partitions unless each partition is particularly small.
>>>>
>>>> You might look at the actual executors' logs, as it's possible that
>>>> this
>>>> error was caused by an earlier exception, such as "too many open
>>>> files".
>>>>
>>>>
>>>> On Sun, Mar 23, 2014 at 4:46 PM, Ognen Duzlevski
>>>> <og...@plainvanillagames.com> wrote:
>>>>> On 3/23/14, 5:49 PM, Matei Zaharia wrote:
>>>>>
>>>>> You can set spark.local.dir to put this data somewhere other than
>>>>> /tmp if
>>>>> /tmp is full. Actually it's recommended to have multiple local
>>>>> disks and set
>>>>> to to a comma-separated list of directories, one per disk.
>>>>>
>>>>> Matei, does the number of tasks/partitions in a transformation
>>>>> influence
>>>>> something in terms of disk space consumption? Or inode consumption?
>>>>>
>>>>> Thanks,
>>>>> Ognen
>>>>
>>>>
>>>> --
>>>> "A distributed system is one in which the failure of a computer you
>>>> didn't
>>>> even know existed can render your own computer unusable"
>>>> -- Leslie Lamport
>>>
>>>
>>> --
>>> "No matter what they ever do to us, we must always act for the love
>>> of our
>>> people and the earth. We must not react out of hatred against those
>>> who have
>>> no sense."
>>> -- John Trudell
>
--
“A distributed system is one in which the failure of a computer you didn’t even know existed can render your own computer unusable”
-- Leslie Lamport
Re: No space left on device exception
Posted by Ognen Duzlevski <og...@plainvanillagames.com>.
Patrick, correct. I have a 16 node cluster. On 14 machines out of 16,
the inode usage was about 50%. On two of the slaves, one had inode usage
of 96% and on the other it was 100%. When i went into /tmp on these two
nodes - there were a bunch of /tmp/spark* subdirectories which I
deleted. This resulted in the inode consumption falling back down to 50%
and the job running successfully to completion. The slave with the 100%
inode usage had the spark/work/app/<number>/stdout with the message that
the filesystem is running out of disk space (which I posted in the
original email that started this thread).
What is interesting is that only two out of the 16 slaves had this
problem :)
Ognen
On 3/24/14, 12:57 AM, Patrick Wendell wrote:
> Ognen - just so I understand. The issue is that there weren't enough
> inodes and this was causing a "No space left on device" error? Is that
> correct? If so, that's good to know because it's definitely counter
> intuitive.
>
> On Sun, Mar 23, 2014 at 8:36 PM, Ognen Duzlevski
> <og...@nengoiksvelzud.com> wrote:
>> I would love to work on this (and other) stuff if I can bother someone with
>> questions offline or on a dev mailing list.
>> Ognen
>>
>>
>> On 3/23/14, 10:04 PM, Aaron Davidson wrote:
>>
>> Thanks for bringing this up, 100% inode utilization is an issue I haven't
>> seen raised before and this raises another issue which is not on our current
>> roadmap for state cleanup (cleaning up data which was not fully cleaned up
>> from a crashed process).
>>
>>
>> On Sun, Mar 23, 2014 at 7:57 PM, Ognen Duzlevski
>> <og...@plainvanillagames.com> wrote:
>>> Bleh, strike that, one of my slaves was at 100% inode utilization on the
>>> file system. It was /tmp/spark* leftovers that apparently did not get
>>> cleaned up properly after failed or interrupted jobs.
>>> Mental note - run a cron job on all slaves and master to clean up
>>> /tmp/spark* regularly.
>>>
>>> Thanks (and sorry for the noise)!
>>> Ognen
>>>
>>>
>>> On 3/23/14, 9:52 PM, Ognen Duzlevski wrote:
>>>
>>> Aaron, thanks for replying. I am very much puzzled as to what is going on.
>>> A job that used to run on the same cluster is failing with this mysterious
>>> message about not having enough disk space when in fact I can see through
>>> "watch df -h" that the free space is always hovering around 3+GB on the disk
>>> and the free inodes are at 50% (this is on master). I went through each
>>> slave and the spark/work/app*/stderr and stdout and spark/logs/*out files
>>> and no mention of too many open files failures on any of the slaves nor on
>>> the master :(
>>>
>>> Thanks
>>> Ognen
>>>
>>> On 3/23/14, 8:38 PM, Aaron Davidson wrote:
>>>
>>> By default, with P partitions (for both the pre-shuffle stage and
>>> post-shuffle), there are P^2 files created. With
>>> spark.shuffle.consolidateFiles turned on, we would instead create only P
>>> files. Disk space consumption is largely unaffected, however. by the number
>>> of partitions unless each partition is particularly small.
>>>
>>> You might look at the actual executors' logs, as it's possible that this
>>> error was caused by an earlier exception, such as "too many open files".
>>>
>>>
>>> On Sun, Mar 23, 2014 at 4:46 PM, Ognen Duzlevski
>>> <og...@plainvanillagames.com> wrote:
>>>> On 3/23/14, 5:49 PM, Matei Zaharia wrote:
>>>>
>>>> You can set spark.local.dir to put this data somewhere other than /tmp if
>>>> /tmp is full. Actually it's recommended to have multiple local disks and set
>>>> to to a comma-separated list of directories, one per disk.
>>>>
>>>> Matei, does the number of tasks/partitions in a transformation influence
>>>> something in terms of disk space consumption? Or inode consumption?
>>>>
>>>> Thanks,
>>>> Ognen
>>>
>>>
>>> --
>>> "A distributed system is one in which the failure of a computer you didn't
>>> even know existed can render your own computer unusable"
>>> -- Leslie Lamport
>>
>>
>> --
>> "No matter what they ever do to us, we must always act for the love of our
>> people and the earth. We must not react out of hatred against those who have
>> no sense."
>> -- John Trudell
--
“A distributed system is one in which the failure of a computer you didn’t even know existed can render your own computer unusable”
-- Leslie Lamport
Re: No space left on device exception
Posted by Patrick Wendell <pw...@gmail.com>.
Ognen - just so I understand. The issue is that there weren't enough
inodes and this was causing a "No space left on device" error? Is that
correct? If so, that's good to know because it's definitely counter
intuitive.
On Sun, Mar 23, 2014 at 8:36 PM, Ognen Duzlevski
<og...@nengoiksvelzud.com> wrote:
> I would love to work on this (and other) stuff if I can bother someone with
> questions offline or on a dev mailing list.
> Ognen
>
>
> On 3/23/14, 10:04 PM, Aaron Davidson wrote:
>
> Thanks for bringing this up, 100% inode utilization is an issue I haven't
> seen raised before and this raises another issue which is not on our current
> roadmap for state cleanup (cleaning up data which was not fully cleaned up
> from a crashed process).
>
>
> On Sun, Mar 23, 2014 at 7:57 PM, Ognen Duzlevski
> <og...@plainvanillagames.com> wrote:
>>
>> Bleh, strike that, one of my slaves was at 100% inode utilization on the
>> file system. It was /tmp/spark* leftovers that apparently did not get
>> cleaned up properly after failed or interrupted jobs.
>> Mental note - run a cron job on all slaves and master to clean up
>> /tmp/spark* regularly.
>>
>> Thanks (and sorry for the noise)!
>> Ognen
>>
>>
>> On 3/23/14, 9:52 PM, Ognen Duzlevski wrote:
>>
>> Aaron, thanks for replying. I am very much puzzled as to what is going on.
>> A job that used to run on the same cluster is failing with this mysterious
>> message about not having enough disk space when in fact I can see through
>> "watch df -h" that the free space is always hovering around 3+GB on the disk
>> and the free inodes are at 50% (this is on master). I went through each
>> slave and the spark/work/app*/stderr and stdout and spark/logs/*out files
>> and no mention of too many open files failures on any of the slaves nor on
>> the master :(
>>
>> Thanks
>> Ognen
>>
>> On 3/23/14, 8:38 PM, Aaron Davidson wrote:
>>
>> By default, with P partitions (for both the pre-shuffle stage and
>> post-shuffle), there are P^2 files created. With
>> spark.shuffle.consolidateFiles turned on, we would instead create only P
>> files. Disk space consumption is largely unaffected, however. by the number
>> of partitions unless each partition is particularly small.
>>
>> You might look at the actual executors' logs, as it's possible that this
>> error was caused by an earlier exception, such as "too many open files".
>>
>>
>> On Sun, Mar 23, 2014 at 4:46 PM, Ognen Duzlevski
>> <og...@plainvanillagames.com> wrote:
>>>
>>> On 3/23/14, 5:49 PM, Matei Zaharia wrote:
>>>
>>> You can set spark.local.dir to put this data somewhere other than /tmp if
>>> /tmp is full. Actually it's recommended to have multiple local disks and set
>>> to to a comma-separated list of directories, one per disk.
>>>
>>> Matei, does the number of tasks/partitions in a transformation influence
>>> something in terms of disk space consumption? Or inode consumption?
>>>
>>> Thanks,
>>> Ognen
>>
>>
>>
>> --
>> "A distributed system is one in which the failure of a computer you didn't
>> even know existed can render your own computer unusable"
>> -- Leslie Lamport
>
>
>
> --
> "No matter what they ever do to us, we must always act for the love of our
> people and the earth. We must not react out of hatred against those who have
> no sense."
> -- John Trudell
Re: No space left on device exception
Posted by Ognen Duzlevski <og...@nengoiksvelzud.com>.
I would love to work on this (and other) stuff if I can bother someone
with questions offline or on a dev mailing list.
Ognen
On 3/23/14, 10:04 PM, Aaron Davidson wrote:
> Thanks for bringing this up, 100% inode utilization is an issue I
> haven't seen raised before and this raises another issue which is not
> on our current roadmap for state cleanup (cleaning up data which was
> not fully cleaned up from a crashed process).
>
>
> On Sun, Mar 23, 2014 at 7:57 PM, Ognen Duzlevski
> <ognen@plainvanillagames.com <ma...@plainvanillagames.com>> wrote:
>
> Bleh, strike that, one of my slaves was at 100% inode utilization
> on the file system. It was /tmp/spark* leftovers that apparently
> did not get cleaned up properly after failed or interrupted jobs.
> Mental note - run a cron job on all slaves and master to clean up
> /tmp/spark* regularly.
>
> Thanks (and sorry for the noise)!
> Ognen
>
>
> On 3/23/14, 9:52 PM, Ognen Duzlevski wrote:
>> Aaron, thanks for replying. I am very much puzzled as to what is
>> going on. A job that used to run on the same cluster is failing
>> with this mysterious message about not having enough disk space
>> when in fact I can see through "watch df -h" that the free space
>> is always hovering around 3+GB on the disk and the free inodes
>> are at 50% (this is on master). I went through each slave and the
>> spark/work/app*/stderr and stdout and spark/logs/*out files and
>> no mention of too many open files failures on any of the slaves
>> nor on the master :(
>>
>> Thanks
>> Ognen
>>
>> On 3/23/14, 8:38 PM, Aaron Davidson wrote:
>>> By default, with P partitions (for both the pre-shuffle stage
>>> and post-shuffle), there are P^2 files created.
>>> With spark.shuffle.consolidateFiles turned on, we would instead
>>> create only P files. Disk space consumption is largely
>>> unaffected, however. by the number of partitions unless each
>>> partition is particularly small.
>>>
>>> You might look at the actual executors' logs, as it's possible
>>> that this error was caused by an earlier exception, such as "too
>>> many open files".
>>>
>>>
>>> On Sun, Mar 23, 2014 at 4:46 PM, Ognen Duzlevski
>>> <ognen@plainvanillagames.com
>>> <ma...@plainvanillagames.com>> wrote:
>>>
>>> On 3/23/14, 5:49 PM, Matei Zaharia wrote:
>>>> You can set spark.local.dir to put this data somewhere
>>>> other than /tmp if /tmp is full. Actually it's recommended
>>>> to have multiple local disks and set to to a
>>>> comma-separated list of directories, one per disk.
>>> Matei, does the number of tasks/partitions in a
>>> transformation influence something in terms of disk space
>>> consumption? Or inode consumption?
>>>
>>> Thanks,
>>> Ognen
>>>
>>
>
> --
> "A distributed system is one in which the failure of a computer you didn't even know existed can render your own computer unusable"
> -- Leslie Lamport
>
>
>
> --
> "No matter what they ever do to us, we must always act for the love of our people and the earth. We must not react out of hatred against those who have no sense."
> -- John Trudell
Re: No space left on device exception
Posted by Aaron Davidson <il...@gmail.com>.
Thanks for bringing this up, 100% inode utilization is an issue I haven't
seen raised before and this raises another issue which is not on our
current roadmap for state cleanup (cleaning up data which was not fully
cleaned up from a crashed process).
On Sun, Mar 23, 2014 at 7:57 PM, Ognen Duzlevski <
ognen@plainvanillagames.com> wrote:
> Bleh, strike that, one of my slaves was at 100% inode utilization on the
> file system. It was /tmp/spark* leftovers that apparently did not get
> cleaned up properly after failed or interrupted jobs.
> Mental note - run a cron job on all slaves and master to clean up
> /tmp/spark* regularly.
>
> Thanks (and sorry for the noise)!
> Ognen
>
>
> On 3/23/14, 9:52 PM, Ognen Duzlevski wrote:
>
> Aaron, thanks for replying. I am very much puzzled as to what is going on.
> A job that used to run on the same cluster is failing with this mysterious
> message about not having enough disk space when in fact I can see through
> "watch df -h" that the free space is always hovering around 3+GB on the
> disk and the free inodes are at 50% (this is on master). I went through
> each slave and the spark/work/app*/stderr and stdout and spark/logs/*out
> files and no mention of too many open files failures on any of the slaves
> nor on the master :(
>
> Thanks
> Ognen
>
> On 3/23/14, 8:38 PM, Aaron Davidson wrote:
>
> By default, with P partitions (for both the pre-shuffle stage and
> post-shuffle), there are P^2 files created.
> With spark.shuffle.consolidateFiles turned on, we would instead create only
> P files. Disk space consumption is largely unaffected, however. by the
> number of partitions unless each partition is particularly small.
>
> You might look at the actual executors' logs, as it's possible that this
> error was caused by an earlier exception, such as "too many open files".
>
>
> On Sun, Mar 23, 2014 at 4:46 PM, Ognen Duzlevski <
> ognen@plainvanillagames.com> wrote:
>
>> On 3/23/14, 5:49 PM, Matei Zaharia wrote:
>>
>> You can set spark.local.dir to put this data somewhere other than /tmp if
>> /tmp is full. Actually it's recommended to have multiple local disks and
>> set to to a comma-separated list of directories, one per disk.
>>
>> Matei, does the number of tasks/partitions in a transformation influence
>> something in terms of disk space consumption? Or inode consumption?
>>
>> Thanks,
>> Ognen
>>
>
>
> --
> "A distributed system is one in which the failure of a computer you didn't even know existed can render your own computer unusable"
> -- Leslie Lamport
>
>
Re: No space left on device exception
Posted by Ognen Duzlevski <og...@plainvanillagames.com>.
Bleh, strike that, one of my slaves was at 100% inode utilization on the
file system. It was /tmp/spark* leftovers that apparently did not get
cleaned up properly after failed or interrupted jobs.
Mental note - run a cron job on all slaves and master to clean up
/tmp/spark* regularly.
Thanks (and sorry for the noise)!
Ognen
On 3/23/14, 9:52 PM, Ognen Duzlevski wrote:
> Aaron, thanks for replying. I am very much puzzled as to what is going
> on. A job that used to run on the same cluster is failing with this
> mysterious message about not having enough disk space when in fact I
> can see through "watch df -h" that the free space is always hovering
> around 3+GB on the disk and the free inodes are at 50% (this is on
> master). I went through each slave and the spark/work/app*/stderr and
> stdout and spark/logs/*out files and no mention of too many open files
> failures on any of the slaves nor on the master :(
>
> Thanks
> Ognen
>
> On 3/23/14, 8:38 PM, Aaron Davidson wrote:
>> By default, with P partitions (for both the pre-shuffle stage and
>> post-shuffle), there are P^2 files created.
>> With spark.shuffle.consolidateFiles turned on, we would instead
>> create only P files. Disk space consumption is largely unaffected,
>> however. by the number of partitions unless each partition is
>> particularly small.
>>
>> You might look at the actual executors' logs, as it's possible that
>> this error was caused by an earlier exception, such as "too many open
>> files".
>>
>>
>> On Sun, Mar 23, 2014 at 4:46 PM, Ognen Duzlevski
>> <ognen@plainvanillagames.com <ma...@plainvanillagames.com>> wrote:
>>
>> On 3/23/14, 5:49 PM, Matei Zaharia wrote:
>>> You can set spark.local.dir to put this data somewhere other
>>> than /tmp if /tmp is full. Actually it's recommended to have
>>> multiple local disks and set to to a comma-separated list of
>>> directories, one per disk.
>> Matei, does the number of tasks/partitions in a transformation
>> influence something in terms of disk space consumption? Or inode
>> consumption?
>>
>> Thanks,
>> Ognen
>>
>
--
"A distributed system is one in which the failure of a computer you didn't even know existed can render your own computer unusable"
-- Leslie Lamport
Re: No space left on device exception
Posted by Ognen Duzlevski <og...@plainvanillagames.com>.
Aaron, thanks for replying. I am very much puzzled as to what is going
on. A job that used to run on the same cluster is failing with this
mysterious message about not having enough disk space when in fact I can
see through "watch df -h" that the free space is always hovering around
3+GB on the disk and the free inodes are at 50% (this is on master). I
went through each slave and the spark/work/app*/stderr and stdout and
spark/logs/*out files and no mention of too many open files failures on
any of the slaves nor on the master :(
Thanks
Ognen
On 3/23/14, 8:38 PM, Aaron Davidson wrote:
> By default, with P partitions (for both the pre-shuffle stage and
> post-shuffle), there are P^2 files created.
> With spark.shuffle.consolidateFiles turned on, we would instead create
> only P files. Disk space consumption is largely unaffected, however.
> by the number of partitions unless each partition is particularly small.
>
> You might look at the actual executors' logs, as it's possible that
> this error was caused by an earlier exception, such as "too many open
> files".
>
>
> On Sun, Mar 23, 2014 at 4:46 PM, Ognen Duzlevski
> <ognen@plainvanillagames.com <ma...@plainvanillagames.com>> wrote:
>
> On 3/23/14, 5:49 PM, Matei Zaharia wrote:
>> You can set spark.local.dir to put this data somewhere other than
>> /tmp if /tmp is full. Actually it's recommended to have multiple
>> local disks and set to to a comma-separated list of directories,
>> one per disk.
> Matei, does the number of tasks/partitions in a transformation
> influence something in terms of disk space consumption? Or inode
> consumption?
>
> Thanks,
> Ognen
>
Re: No space left on device exception
Posted by Aaron Davidson <il...@gmail.com>.
By default, with P partitions (for both the pre-shuffle stage and
post-shuffle), there are P^2 files created.
With spark.shuffle.consolidateFiles turned on, we would instead create only
P files. Disk space consumption is largely unaffected, however. by the
number of partitions unless each partition is particularly small.
You might look at the actual executors' logs, as it's possible that this
error was caused by an earlier exception, such as "too many open files".
On Sun, Mar 23, 2014 at 4:46 PM, Ognen Duzlevski <
ognen@plainvanillagames.com> wrote:
> On 3/23/14, 5:49 PM, Matei Zaharia wrote:
>
> You can set spark.local.dir to put this data somewhere other than /tmp if
> /tmp is full. Actually it's recommended to have multiple local disks and
> set to to a comma-separated list of directories, one per disk.
>
> Matei, does the number of tasks/partitions in a transformation influence
> something in terms of disk space consumption? Or inode consumption?
>
> Thanks,
> Ognen
>
Re: No space left on device exception
Posted by Ognen Duzlevski <og...@plainvanillagames.com>.
On 3/23/14, 5:49 PM, Matei Zaharia wrote:
> You can set spark.local.dir to put this data somewhere other than /tmp
> if /tmp is full. Actually it’s recommended to have multiple local
> disks and set to to a comma-separated list of directories, one per disk.
Matei, does the number of tasks/partitions in a transformation influence
something in terms of disk space consumption? Or inode consumption?
Thanks,
Ognen
Re: No space left on device exception
Posted by Matei Zaharia <ma...@gmail.com>.
You can set spark.local.dir to put this data somewhere other than /tmp if /tmp is full. Actually it’s recommended to have multiple local disks and set to to a comma-separated list of directories, one per disk.
Matei
On Mar 23, 2014, at 3:35 PM, Aaron Davidson <il...@gmail.com> wrote:
> On some systems, /tmp/ is an in-memory tmpfs file system, with its own size limit. It's possible that this limit has been exceeded. You might try running the "df" command to check to free space of "/tmp" or root if tmp isn't listed.
>
> 3 GB also seems pretty low for the remaining free space of a disk. If your disk size is in the TB range, it's possible that the last couple GB have issues when being allocated due to fragmentation or reclamation policies.
>
>
> On Sun, Mar 23, 2014 at 3:06 PM, Ognen Duzlevski <og...@nengoiksvelzud.com> wrote:
> Hello,
>
> I have a weird error showing up when I run a job on my Spark cluster. The version of spark is 0.9 and I have 3+ GB free on the disk when this error shows up. Any ideas what I should be looking for?
>
> [error] (run-main-0) org.apache.spark.SparkException: Job aborted: Task 167.0:3 failed 4 times (most recent failure: Exception failure: java.io.FileNotFoundException: /tmp/spark-local-20140323214638-72df/31/shuffle_31_3_127 (No space left on device))
> org.apache.spark.SparkException: Job aborted: Task 167.0:3 failed 4 times (most recent failure: Exception failure: java.io.FileNotFoundException: /tmp/spark-local-20140323214638-72df/31/shuffle_31_3_127 (No space left on device))
> at org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$abortStage$1.apply(DAGScheduler.scala:1028)
> at org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$abortStage$1.apply(DAGScheduler.scala:1026)
> at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
> at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
> at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$abortStage(DAGScheduler.scala:1026)
> at org.apache.spark.scheduler.DAGScheduler$$anonfun$processEvent$10.apply(DAGScheduler.scala:619)
> at org.apache.spark.scheduler.DAGScheduler$$anonfun$processEvent$10.apply(DAGScheduler.scala:619)
> at scala.Option.foreach(Option.scala:236)
> at org.apache.spark.scheduler.DAGScheduler.processEvent(DAGScheduler.scala:619)
> at org.apache.spark.scheduler.DAGScheduler$$anonfun$start$1$$anon$2$$anonfun$receive$1.applyOrElse(DAGScheduler.scala:207)
>
> Thanks!
> Ognen
>
Re: No space left on device exception
Posted by Ognen Duzlevski <og...@plainvanillagames.com>.
On 3/23/14, 5:35 PM, Aaron Davidson wrote:
> On some systems, /tmp/ is an in-memory tmpfs file system, with its own
> size limit. It's possible that this limit has been exceeded. You might
> try running the "df" command to check to free space of "/tmp" or root
> if tmp isn't listed.
>
> 3 GB also seems pretty low for the remaining free space of a disk. If
> your disk size is in the TB range, it's possible that the last couple
> GB have issues when being allocated due to fragmentation or
> reclamation policies.
>
Aaron, thanks for the reply. These are Amazon Ubuntu instances that I
have that have an 8GB root filesystem (with everything OS+Spark taking
up about 4.5GB). The /tmp appears to be just a regular directory in / -
hence it shares in the same 3.5GB left free space. I was "watch df"ing
the space while my job was running.
Ognen
Re: No space left on device exception
Posted by Aaron Davidson <il...@gmail.com>.
On some systems, /tmp/ is an in-memory tmpfs file system, with its own size
limit. It's possible that this limit has been exceeded. You might try
running the "df" command to check to free space of "/tmp" or root if tmp
isn't listed.
3 GB also seems pretty low for the remaining free space of a disk. If your
disk size is in the TB range, it's possible that the last couple GB have
issues when being allocated due to fragmentation or reclamation policies.
On Sun, Mar 23, 2014 at 3:06 PM, Ognen Duzlevski
<og...@nengoiksvelzud.com>wrote:
> Hello,
>
> I have a weird error showing up when I run a job on my Spark cluster. The
> version of spark is 0.9 and I have 3+ GB free on the disk when this error
> shows up. Any ideas what I should be looking for?
>
> [error] (run-main-0) org.apache.spark.SparkException: Job aborted: Task
> 167.0:3 failed 4 times (most recent failure: Exception failure:
> java.io.FileNotFoundException: /tmp/spark-local-20140323214638-72df/31/shuffle_31_3_127
> (No space left on device))
> org.apache.spark.SparkException: Job aborted: Task 167.0:3 failed 4 times
> (most recent failure: Exception failure: java.io.FileNotFoundException:
> /tmp/spark-local-20140323214638-72df/31/shuffle_31_3_127 (No space left
> on device))
> at org.apache.spark.scheduler.DAGScheduler$$anonfun$org$
> apache$spark$scheduler$DAGScheduler$$abortStage$1.
> apply(DAGScheduler.scala:1028)
> at org.apache.spark.scheduler.DAGScheduler$$anonfun$org$
> apache$spark$scheduler$DAGScheduler$$abortStage$1.
> apply(DAGScheduler.scala:1026)
> at scala.collection.mutable.ResizableArray$class.foreach(
> ResizableArray.scala:59)
> at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
> at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$
> scheduler$DAGScheduler$$abortStage(DAGScheduler.scala:1026)
> at org.apache.spark.scheduler.DAGScheduler$$anonfun$
> processEvent$10.apply(DAGScheduler.scala:619)
> at org.apache.spark.scheduler.DAGScheduler$$anonfun$
> processEvent$10.apply(DAGScheduler.scala:619)
> at scala.Option.foreach(Option.scala:236)
> at org.apache.spark.scheduler.DAGScheduler.processEvent(
> DAGScheduler.scala:619)
> at org.apache.spark.scheduler.DAGScheduler$$anonfun$start$1$
> $anon$2$$anonfun$receive$1.applyOrElse(DAGScheduler.scala:207)
>
> Thanks!
> Ognen
>