You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by Chris DuBois <ch...@gmail.com> on 2014/07/16 08:35:54 UTC

Error: No space left on device

Hi all,

I am encountering the following error:

INFO scheduler.TaskSetManager: Loss was due to java.io.IOException: No
space left on device [duplicate 4]

For each slave, df -h looks roughtly like this, which makes the above error
surprising.

Filesystem            Size  Used Avail Use% Mounted on
/dev/xvda1            7.9G  4.4G  3.5G  57% /
tmpfs                 7.4G  4.0K  7.4G   1% /dev/shm
/dev/xvdb              37G  3.3G   32G  10% /mnt
/dev/xvdf              37G  2.0G   34G   6% /mnt2
/dev/xvdv             500G   33M  500G   1% /vol

I'm on an EC2 cluster (c3.xlarge + 5 x m3) that I launched using the
spark-ec2 scripts and a clone of spark from today. The job I am running
closely resembles the collaborative filtering example
<https://spark.apache.org/docs/latest/mllib-collaborative-filtering.html>.
This issue happens with the 1M version as well as the 10 million rating
version of the MovieLens dataset.

I have seen previous
<http://mail-archives.apache.org/mod_mbox/spark-user/201403.mbox/%3C532F5AEC.8060805@nengoiksvelzud.com%3E>
 questions
<https://groups.google.com/forum/#!msg/spark-users/Axx4optAj-E/q5lWMv-ZqnwJ>,
but they haven't helped yet. For example, I tried setting the Spark tmp
directory to the EBS volume at /vol/, both by editing the spark conf file
(and copy-dir'ing it to the slaves) as well as through the SparkConf. Yet I
still get the above error. Here is my current Spark config below. Note that
I'm launching via ~/spark/bin/spark-submit.

conf = SparkConf()
conf.setAppName("RecommendALS").set("spark.local.dir",
"/vol/").set("spark.executor.memory", "7g").set("spark.akka.frameSize",
"100").setExecutorEnv("SPARK_JAVA_OPTS", " -Dspark.akka.frameSize=100")
sc = SparkContext(conf=conf)

Thanks for any advice,
Chris

Re: Error: No space left on device

Posted by Chris DuBois <ch...@gmail.com>.

Hi Xiangrui,

Thanks. I have taken your advice and set all 5 of my slaves to be
c3.4xlarge. In this case /mnt and /mnt2 have plenty of space by default. I
now do sc.textFile(blah).repartition(N).map(...).cache() with N=80 and
spark.executor.memory to be 20gb and --driver-memory 20g. So far things
seem more stable.

Thanks for the help,
Chris


On Thu, Jul 17, 2014 at 12:56 AM, Xiangrui Meng <me...@gmail.com> wrote:

> Set N be the total number of cores on the cluster or less. sc.textFile
> doesn't always give you that number, depends on the block size. For
> MovieLens, I think the default behavior should be 2~3 partitions. You
> need to call repartition to ensure the right number of partitions.
>
> Which EC2 instance type did you use? I usually use m3.2xlarge or c?
> instances that come with SSD and 1G or 10G network. For those
> instances, you should see local drives mounted at /mnt, /mnt2, /mnt3,
> ... Make sure there was no error when you used the ec2 script to
> launch the cluster.
>
> It is a little strange to see 94% of / was used on a slave. Maybe
> shuffle data went to /. I'm not sure which settings went wrong. I
> recommend trying re-launching a cluster with m3.2xlarge instances and
> using the default settings (don't set anything in SparkConf). Submit
> the application with --driver-memory 20g.
>
> The running times are slower than what I remember, but it depends on
> the instance type.
>
> Best,
> Xiangrui
>
>
>
> On Wed, Jul 16, 2014 at 10:18 PM, Chris DuBois <ch...@gmail.com>
> wrote:
> > Hi Xiangrui,
> >
> > I will try this shortly. When using N partitions, do you recommend N be
> the
> > number of cores on each slave or the number of cores on the master?
> Forgive
> > my ignorance, but is this best achieved as an argument to sc.textFile?
> >
> > The slaves on the EC2 clusters start with only 8gb of storage, and it
> > doesn't seem that /mnt/spark and /mnt2/spark are mounted anywhere else by
> > default. Looking at spark-ec2/setup-slaves.sh, it appears that these are
> > only mounted if the instance type begins with r3. (Or am I not reading
> that
> > right?) My slaves are a different instance type, and currently look like
> > this:
> > Filesystem            Size  Used Avail Use% Mounted on
> > /dev/xvda1            7.9G  7.3G  515M  94% /
> > tmpfs                 7.4G  4.0K  7.4G   1% /dev/shm
> > /dev/xvdv             500G  2.5G  498G   1% /vol
> >
> > I have been able to finish ALS on MovieLens 10M only twice, taking 221s
> and
> > 315s for 20 iterations at K=20 and lambda=0.01. Does that timing sound
> about
> > right, or does it point to a poor configuration? The same script with
> > MovieLens 1M runs fine in about 30-40s with the same settings. (In both
> > cases I'm training on 70% of the data.)
> >
> > Thanks for your help!
> > Chris
> >
> >
> > On Wed, Jul 16, 2014 at 4:29 PM, Xiangrui Meng <me...@gmail.com> wrote:
> >>
> >> For ALS, I would recommend repartitioning the ratings to match the
> >> number of CPU cores or even less. ALS is not computation heavy for
> >> small k but communication heavy. Having small number of partitions may
> >> help. For EC2 clusters, we use /mnt/spark and /mnt2/spark as the
> >> default local directory because they are local hard drives. Did your
> >> last run of ALS on MovieLens 10M-100K with the default settings
> >> succeed? -Xiangrui
> >>
> >> On Wed, Jul 16, 2014 at 8:00 AM, Chris DuBois <ch...@gmail.com>
> >> wrote:
> >> > Hi Xiangrui,
> >> >
> >> > I accidentally did not send df -i for the master node. Here it is at
> the
> >> > moment of failure:
> >> >
> >> > Filesystem            Inodes   IUsed   IFree IUse% Mounted on
> >> > /dev/xvda1            524288  280938  243350   54% /
> >> > tmpfs                3845409       1 3845408    1% /dev/shm
> >> > /dev/xvdb            10002432    1027 10001405    1% /mnt
> >> > /dev/xvdf            10002432      16 10002416    1% /mnt2
> >> > /dev/xvdv            524288000      13 524287987    1% /vol
> >> >
> >> > I am using default settings now, but is there a way to make sure that
> >> > the
> >> > proper directories are being used? How many blocks/partitions do you
> >> > recommend?
> >> >
> >> > Chris
> >> >
> >> >
> >> > On Wed, Jul 16, 2014 at 1:09 AM, Chris DuBois <chris.dubois@gmail.com
> >
> >> > wrote:
> >> >>
> >> >> Hi Xiangrui,
> >> >>
> >> >> Here is the result on the master node:
> >> >> $ df -i
> >> >> Filesystem            Inodes   IUsed   IFree IUse% Mounted on
> >> >> /dev/xvda1            524288  273997  250291   53% /
> >> >> tmpfs                1917974       1 1917973    1% /dev/shm
> >> >> /dev/xvdv            524288000      30 524287970    1% /vol
> >> >>
> >> >> I have reproduced the error while using the MovieLens 10M data set
> on a
> >> >> newly created cluster.
> >> >>
> >> >> Thanks for the help.
> >> >> Chris
> >> >>
> >> >>
> >> >> On Wed, Jul 16, 2014 at 12:22 AM, Xiangrui Meng <me...@gmail.com>
> >> >> wrote:
> >> >>>
> >> >>> Hi Chris,
> >> >>>
> >> >>> Could you also try `df -i` on the master node? How many
> >> >>> blocks/partitions did you set?
> >> >>>
> >> >>> In the current implementation, ALS doesn't clean the shuffle data
> >> >>> because the operations are chained together. But it shouldn't run
> out
> >> >>> of disk space on the MovieLens dataset, which is small. spark-ec2
> >> >>> script sets /mnt/spark and /mnt/spark2 as the local.dir by default,
> I
> >> >>> would recommend leaving this setting as the default value.
> >> >>>
> >> >>> Best,
> >> >>> Xiangrui
> >> >>>
> >> >>> On Wed, Jul 16, 2014 at 12:02 AM, Chris DuBois
> >> >>> <ch...@gmail.com>
> >> >>> wrote:
> >> >>> > Thanks for the quick responses!
> >> >>> >
> >> >>> > I used your final -Dspark.local.dir suggestion, but I see this
> >> >>> > during
> >> >>> > the
> >> >>> > initialization of the application:
> >> >>> >
> >> >>> > 14/07/16 06:56:08 INFO storage.DiskBlockManager: Created local
> >> >>> > directory at
> >> >>> > /vol/spark-local-20140716065608-7b2a
> >> >>> >
> >> >>> > I would have expected something in /mnt/spark/.
> >> >>> >
> >> >>> > Thanks,
> >> >>> > Chris
> >> >>> >
> >> >>> >
> >> >>> >
> >> >>> > On Tue, Jul 15, 2014 at 11:44 PM, Chris Gore <cd...@cdgore.com>
> >> >>> > wrote:
> >> >>> >>
> >> >>> >> Hi Chris,
> >> >>> >>
> >> >>> >> I've encountered this error when running Spark’s ALS methods too.
> >> >>> >> In
> >> >>> >> my
> >> >>> >> case, it was because I set spark.local.dir improperly, and every
> >> >>> >> time
> >> >>> >> there
> >> >>> >> was a shuffle, it would spill many GB of data onto the local
> drive.
> >> >>> >> What
> >> >>> >> fixed it was setting it to use the /mnt directory, where a
> network
> >> >>> >> drive is
> >> >>> >> mounted.  For example, setting an environmental variable:
> >> >>> >>
> >> >>> >> export SPACE=$(mount | grep mnt | awk '{print $3"/spark/"}' |
> xargs
> >> >>> >> |
> >> >>> >> sed
> >> >>> >> 's/ /,/g’)
> >> >>> >>
> >> >>> >> Then adding -Dspark.local.dir=$SPACE or simply
> >> >>> >> -Dspark.local.dir=/mnt/spark/,/mnt2/spark/ when you run your
> driver
> >> >>> >> application
> >> >>> >>
> >> >>> >> Chris
> >> >>> >>
> >> >>> >> On Jul 15, 2014, at 11:39 PM, Xiangrui Meng <me...@gmail.com>
> >> >>> >> wrote:
> >> >>> >>
> >> >>> >> > Check the number of inodes (df -i). The assembly build may
> create
> >> >>> >> > many
> >> >>> >> > small files. -Xiangrui
> >> >>> >> >
> >> >>> >> > On Tue, Jul 15, 2014 at 11:35 PM, Chris DuBois
> >> >>> >> > <ch...@gmail.com>
> >> >>> >> > wrote:
> >> >>> >> >> Hi all,
> >> >>> >> >>
> >> >>> >> >> I am encountering the following error:
> >> >>> >> >>
> >> >>> >> >> INFO scheduler.TaskSetManager: Loss was due to
> >> >>> >> >> java.io.IOException:
> >> >>> >> >> No
> >> >>> >> >> space
> >> >>> >> >> left on device [duplicate 4]
> >> >>> >> >>
> >> >>> >> >> For each slave, df -h looks roughtly like this, which makes
> the
> >> >>> >> >> above
> >> >>> >> >> error
> >> >>> >> >> surprising.
> >> >>> >> >>
> >> >>> >> >> Filesystem            Size  Used Avail Use% Mounted on
> >> >>> >> >> /dev/xvda1            7.9G  4.4G  3.5G  57% /
> >> >>> >> >> tmpfs                 7.4G  4.0K  7.4G   1% /dev/shm
> >> >>> >> >> /dev/xvdb              37G  3.3G   32G  10% /mnt
> >> >>> >> >> /dev/xvdf              37G  2.0G   34G   6% /mnt2
> >> >>> >> >> /dev/xvdv             500G   33M  500G   1% /vol
> >> >>> >> >>
> >> >>> >> >> I'm on an EC2 cluster (c3.xlarge + 5 x m3) that I launched
> using
> >> >>> >> >> the
> >> >>> >> >> spark-ec2 scripts and a clone of spark from today. The job I
> am
> >> >>> >> >> running
> >> >>> >> >> closely resembles the collaborative filtering example. This
> >> >>> >> >> issue
> >> >>> >> >> happens
> >> >>> >> >> with the 1M version as well as the 10 million rating version
> of
> >> >>> >> >> the
> >> >>> >> >> MovieLens dataset.
> >> >>> >> >>
> >> >>> >> >> I have seen previous questions, but they haven't helped yet.
> For
> >> >>> >> >> example, I
> >> >>> >> >> tried setting the Spark tmp directory to the EBS volume at
> >> >>> >> >> /vol/,
> >> >>> >> >> both
> >> >>> >> >> by
> >> >>> >> >> editing the spark conf file (and copy-dir'ing it to the
> slaves)
> >> >>> >> >> as
> >> >>> >> >> well
> >> >>> >> >> as
> >> >>> >> >> through the SparkConf. Yet I still get the above error. Here
> is
> >> >>> >> >> my
> >> >>> >> >> current
> >> >>> >> >> Spark config below. Note that I'm launching via
> >> >>> >> >> ~/spark/bin/spark-submit.
> >> >>> >> >>
> >> >>> >> >> conf = SparkConf()
> >> >>> >> >> conf.setAppName("RecommendALS").set("spark.local.dir",
> >> >>> >> >> "/vol/").set("spark.executor.memory",
> >> >>> >> >> "7g").set("spark.akka.frameSize",
> >> >>> >> >> "100").setExecutorEnv("SPARK_JAVA_OPTS", "
> >> >>> >> >> -Dspark.akka.frameSize=100")
> >> >>> >> >> sc = SparkContext(conf=conf)
> >> >>> >> >>
> >> >>> >> >> Thanks for any advice,
> >> >>> >> >> Chris
> >> >>> >> >>
> >> >>> >>
> >> >>> >
> >> >>
> >> >>
> >> >
> >
> >
>

Re: Error: No space left on device

Posted by Chris DuBois <ch...@gmail.com>.

Things were going smoothly.... until I hit the following:

py4j.protocol.Py4JJavaError: An error occurred while calling o1282.collect.
: org.apache.spark.SparkException: Job aborted due to stage failure: Master
removed our application: FAILED

Any ideas why this might occur? This is while running ALS on a subset of
Netflix.

It seems like spark is pouring tons of stuff onto disk for each slave. Each
slave looks roughly like this:

Filesystem            Size  Used Avail Use% Mounted on
/dev/xvda1            7.9G  4.4G  3.4G  57% /
tmpfs                  15G   12K   15G   1% /dev/shm
/dev/xvdb             151G   25G  118G  18% /mnt
/dev/xvdf             151G   24G  120G  17% /mnt2
/dev/xvdv             500G  1.8G  498G   1% /vol


On Thu, Jul 17, 2014 at 1:27 PM, Bill Jay <bi...@gmail.com>
wrote:

> Hi，
>
> I also have some issues with repartition. In my program, I consume data
> from Kafka. After I consume data, I use repartition(N). However, although I
> set N to be 120, there are around 18 executors allocated for my reduce
> stage. I am not sure how the repartition command works ton ensure the
> parallelism.
>
> Bill
>
>
> On Thu, Jul 17, 2014 at 12:56 AM, Xiangrui Meng <me...@gmail.com> wrote:
>
>> Set N be the total number of cores on the cluster or less. sc.textFile
>> doesn't always give you that number, depends on the block size. For
>> MovieLens, I think the default behavior should be 2~3 partitions. You
>> need to call repartition to ensure the right number of partitions.
>>
>> Which EC2 instance type did you use? I usually use m3.2xlarge or c?
>> instances that come with SSD and 1G or 10G network. For those
>> instances, you should see local drives mounted at /mnt, /mnt2, /mnt3,
>> ... Make sure there was no error when you used the ec2 script to
>> launch the cluster.
>>
>> It is a little strange to see 94% of / was used on a slave. Maybe
>> shuffle data went to /. I'm not sure which settings went wrong. I
>> recommend trying re-launching a cluster with m3.2xlarge instances and
>> using the default settings (don't set anything in SparkConf). Submit
>> the application with --driver-memory 20g.
>>
>> The running times are slower than what I remember, but it depends on
>> the instance type.
>>
>> Best,
>> Xiangrui
>>
>>
>>
>> On Wed, Jul 16, 2014 at 10:18 PM, Chris DuBois <ch...@gmail.com>
>> wrote:
>> > Hi Xiangrui,
>> >
>> > I will try this shortly. When using N partitions, do you recommend N be
>> the
>> > number of cores on each slave or the number of cores on the master?
>> Forgive
>> > my ignorance, but is this best achieved as an argument to sc.textFile?
>> >
>> > The slaves on the EC2 clusters start with only 8gb of storage, and it
>> > doesn't seem that /mnt/spark and /mnt2/spark are mounted anywhere else
>> by
>> > default. Looking at spark-ec2/setup-slaves.sh, it appears that these are
>> > only mounted if the instance type begins with r3. (Or am I not reading
>> that
>> > right?) My slaves are a different instance type, and currently look like
>> > this:
>> > Filesystem            Size  Used Avail Use% Mounted on
>> > /dev/xvda1            7.9G  7.3G  515M  94% /
>> > tmpfs                 7.4G  4.0K  7.4G   1% /dev/shm
>> > /dev/xvdv             500G  2.5G  498G   1% /vol
>> >
>> > I have been able to finish ALS on MovieLens 10M only twice, taking 221s
>> and
>> > 315s for 20 iterations at K=20 and lambda=0.01. Does that timing sound
>> about
>> > right, or does it point to a poor configuration? The same script with
>> > MovieLens 1M runs fine in about 30-40s with the same settings. (In both
>> > cases I'm training on 70% of the data.)
>> >
>> > Thanks for your help!
>> > Chris
>> >
>> >
>> > On Wed, Jul 16, 2014 at 4:29 PM, Xiangrui Meng <me...@gmail.com>
>> wrote:
>> >>
>> >> For ALS, I would recommend repartitioning the ratings to match the
>> >> number of CPU cores or even less. ALS is not computation heavy for
>> >> small k but communication heavy. Having small number of partitions may
>> >> help. For EC2 clusters, we use /mnt/spark and /mnt2/spark as the
>> >> default local directory because they are local hard drives. Did your
>> >> last run of ALS on MovieLens 10M-100K with the default settings
>> >> succeed? -Xiangrui
>> >>
>> >> On Wed, Jul 16, 2014 at 8:00 AM, Chris DuBois <ch...@gmail.com>
>> >> wrote:
>> >> > Hi Xiangrui,
>> >> >
>> >> > I accidentally did not send df -i for the master node. Here it is at
>> the
>> >> > moment of failure:
>> >> >
>> >> > Filesystem            Inodes   IUsed   IFree IUse% Mounted on
>> >> > /dev/xvda1            524288  280938  243350   54% /
>> >> > tmpfs                3845409       1 3845408    1% /dev/shm
>> >> > /dev/xvdb            10002432    1027 10001405    1% /mnt
>> >> > /dev/xvdf            10002432      16 10002416    1% /mnt2
>> >> > /dev/xvdv            524288000      13 524287987    1% /vol
>> >> >
>> >> > I am using default settings now, but is there a way to make sure that
>> >> > the
>> >> > proper directories are being used? How many blocks/partitions do you
>> >> > recommend?
>> >> >
>> >> > Chris
>> >> >
>> >> >
>> >> > On Wed, Jul 16, 2014 at 1:09 AM, Chris DuBois <
>> chris.dubois@gmail.com>
>> >> > wrote:
>> >> >>
>> >> >> Hi Xiangrui,
>> >> >>
>> >> >> Here is the result on the master node:
>> >> >> $ df -i
>> >> >> Filesystem            Inodes   IUsed   IFree IUse% Mounted on
>> >> >> /dev/xvda1            524288  273997  250291   53% /
>> >> >> tmpfs                1917974       1 1917973    1% /dev/shm
>> >> >> /dev/xvdv            524288000      30 524287970    1% /vol
>> >> >>
>> >> >> I have reproduced the error while using the MovieLens 10M data set
>> on a
>> >> >> newly created cluster.
>> >> >>
>> >> >> Thanks for the help.
>> >> >> Chris
>> >> >>
>> >> >>
>> >> >> On Wed, Jul 16, 2014 at 12:22 AM, Xiangrui Meng <me...@gmail.com>
>> >> >> wrote:
>> >> >>>
>> >> >>> Hi Chris,
>> >> >>>
>> >> >>> Could you also try `df -i` on the master node? How many
>> >> >>> blocks/partitions did you set?
>> >> >>>
>> >> >>> In the current implementation, ALS doesn't clean the shuffle data
>> >> >>> because the operations are chained together. But it shouldn't run
>> out
>> >> >>> of disk space on the MovieLens dataset, which is small. spark-ec2
>> >> >>> script sets /mnt/spark and /mnt/spark2 as the local.dir by
>> default, I
>> >> >>> would recommend leaving this setting as the default value.
>> >> >>>
>> >> >>> Best,
>> >> >>> Xiangrui
>> >> >>>
>> >> >>> On Wed, Jul 16, 2014 at 12:02 AM, Chris DuBois
>> >> >>> <ch...@gmail.com>
>> >> >>> wrote:
>> >> >>> > Thanks for the quick responses!
>> >> >>> >
>> >> >>> > I used your final -Dspark.local.dir suggestion, but I see this
>> >> >>> > during
>> >> >>> > the
>> >> >>> > initialization of the application:
>> >> >>> >
>> >> >>> > 14/07/16 06:56:08 INFO storage.DiskBlockManager: Created local
>> >> >>> > directory at
>> >> >>> > /vol/spark-local-20140716065608-7b2a
>> >> >>> >
>> >> >>> > I would have expected something in /mnt/spark/.
>> >> >>> >
>> >> >>> > Thanks,
>> >> >>> > Chris
>> >> >>> >
>> >> >>> >
>> >> >>> >
>> >> >>> > On Tue, Jul 15, 2014 at 11:44 PM, Chris Gore <cd...@cdgore.com>
>> >> >>> > wrote:
>> >> >>> >>
>> >> >>> >> Hi Chris,
>> >> >>> >>
>> >> >>> >> I've encountered this error when running Spark’s ALS methods
>> too.
>> >> >>> >> In
>> >> >>> >> my
>> >> >>> >> case, it was because I set spark.local.dir improperly, and every
>> >> >>> >> time
>> >> >>> >> there
>> >> >>> >> was a shuffle, it would spill many GB of data onto the local
>> drive.
>> >> >>> >> What
>> >> >>> >> fixed it was setting it to use the /mnt directory, where a
>> network
>> >> >>> >> drive is
>> >> >>> >> mounted.  For example, setting an environmental variable:
>> >> >>> >>
>> >> >>> >> export SPACE=$(mount | grep mnt | awk '{print $3"/spark/"}' |
>> xargs
>> >> >>> >> |
>> >> >>> >> sed
>> >> >>> >> 's/ /,/g’)
>> >> >>> >>
>> >> >>> >> Then adding -Dspark.local.dir=$SPACE or simply
>> >> >>> >> -Dspark.local.dir=/mnt/spark/,/mnt2/spark/ when you run your
>> driver
>> >> >>> >> application
>> >> >>> >>
>> >> >>> >> Chris
>> >> >>> >>
>> >> >>> >> On Jul 15, 2014, at 11:39 PM, Xiangrui Meng <me...@gmail.com>
>> >> >>> >> wrote:
>> >> >>> >>
>> >> >>> >> > Check the number of inodes (df -i). The assembly build may
>> create
>> >> >>> >> > many
>> >> >>> >> > small files. -Xiangrui
>> >> >>> >> >
>> >> >>> >> > On Tue, Jul 15, 2014 at 11:35 PM, Chris DuBois
>> >> >>> >> > <ch...@gmail.com>
>> >> >>> >> > wrote:
>> >> >>> >> >> Hi all,
>> >> >>> >> >>
>> >> >>> >> >> I am encountering the following error:
>> >> >>> >> >>
>> >> >>> >> >> INFO scheduler.TaskSetManager: Loss was due to
>> >> >>> >> >> java.io.IOException:
>> >> >>> >> >> No
>> >> >>> >> >> space
>> >> >>> >> >> left on device [duplicate 4]
>> >> >>> >> >>
>> >> >>> >> >> For each slave, df -h looks roughtly like this, which makes
>> the
>> >> >>> >> >> above
>> >> >>> >> >> error
>> >> >>> >> >> surprising.
>> >> >>> >> >>
>> >> >>> >> >> Filesystem            Size  Used Avail Use% Mounted on
>> >> >>> >> >> /dev/xvda1            7.9G  4.4G  3.5G  57% /
>> >> >>> >> >> tmpfs                 7.4G  4.0K  7.4G   1% /dev/shm
>> >> >>> >> >> /dev/xvdb              37G  3.3G   32G  10% /mnt
>> >> >>> >> >> /dev/xvdf              37G  2.0G   34G   6% /mnt2
>> >> >>> >> >> /dev/xvdv             500G   33M  500G   1% /vol
>> >> >>> >> >>
>> >> >>> >> >> I'm on an EC2 cluster (c3.xlarge + 5 x m3) that I launched
>> using
>> >> >>> >> >> the
>> >> >>> >> >> spark-ec2 scripts and a clone of spark from today. The job I
>> am
>> >> >>> >> >> running
>> >> >>> >> >> closely resembles the collaborative filtering example. This
>> >> >>> >> >> issue
>> >> >>> >> >> happens
>> >> >>> >> >> with the 1M version as well as the 10 million rating version
>> of
>> >> >>> >> >> the
>> >> >>> >> >> MovieLens dataset.
>> >> >>> >> >>
>> >> >>> >> >> I have seen previous questions, but they haven't helped yet.
>> For
>> >> >>> >> >> example, I
>> >> >>> >> >> tried setting the Spark tmp directory to the EBS volume at
>> >> >>> >> >> /vol/,
>> >> >>> >> >> both
>> >> >>> >> >> by
>> >> >>> >> >> editing the spark conf file (and copy-dir'ing it to the
>> slaves)
>> >> >>> >> >> as
>> >> >>> >> >> well
>> >> >>> >> >> as
>> >> >>> >> >> through the SparkConf. Yet I still get the above error. Here
>> is
>> >> >>> >> >> my
>> >> >>> >> >> current
>> >> >>> >> >> Spark config below. Note that I'm launching via
>> >> >>> >> >> ~/spark/bin/spark-submit.
>> >> >>> >> >>
>> >> >>> >> >> conf = SparkConf()
>> >> >>> >> >> conf.setAppName("RecommendALS").set("spark.local.dir",
>> >> >>> >> >> "/vol/").set("spark.executor.memory",
>> >> >>> >> >> "7g").set("spark.akka.frameSize",
>> >> >>> >> >> "100").setExecutorEnv("SPARK_JAVA_OPTS", "
>> >> >>> >> >> -Dspark.akka.frameSize=100")
>> >> >>> >> >> sc = SparkContext(conf=conf)
>> >> >>> >> >>
>> >> >>> >> >> Thanks for any advice,
>> >> >>> >> >> Chris
>> >> >>> >> >>
>> >> >>> >>
>> >> >>> >
>> >> >>
>> >> >>
>> >> >
>> >
>> >
>>
>
>

Re: Error: No space left on device

Posted by Bill Jay <bi...@gmail.com>.

Hi，

I also have some issues with repartition. In my program, I consume data
from Kafka. After I consume data, I use repartition(N). However, although I
set N to be 120, there are around 18 executors allocated for my reduce
stage. I am not sure how the repartition command works ton ensure the
parallelism.

Bill


On Thu, Jul 17, 2014 at 12:56 AM, Xiangrui Meng <me...@gmail.com> wrote:

> Set N be the total number of cores on the cluster or less. sc.textFile
> doesn't always give you that number, depends on the block size. For
> MovieLens, I think the default behavior should be 2~3 partitions. You
> need to call repartition to ensure the right number of partitions.
>
> Which EC2 instance type did you use? I usually use m3.2xlarge or c?
> instances that come with SSD and 1G or 10G network. For those
> instances, you should see local drives mounted at /mnt, /mnt2, /mnt3,
> ... Make sure there was no error when you used the ec2 script to
> launch the cluster.
>
> It is a little strange to see 94% of / was used on a slave. Maybe
> shuffle data went to /. I'm not sure which settings went wrong. I
> recommend trying re-launching a cluster with m3.2xlarge instances and
> using the default settings (don't set anything in SparkConf). Submit
> the application with --driver-memory 20g.
>
> The running times are slower than what I remember, but it depends on
> the instance type.
>
> Best,
> Xiangrui
>
>
>
> On Wed, Jul 16, 2014 at 10:18 PM, Chris DuBois <ch...@gmail.com>
> wrote:
> > Hi Xiangrui,
> >
> > I will try this shortly. When using N partitions, do you recommend N be
> the
> > number of cores on each slave or the number of cores on the master?
> Forgive
> > my ignorance, but is this best achieved as an argument to sc.textFile?
> >
> > The slaves on the EC2 clusters start with only 8gb of storage, and it
> > doesn't seem that /mnt/spark and /mnt2/spark are mounted anywhere else by
> > default. Looking at spark-ec2/setup-slaves.sh, it appears that these are
> > only mounted if the instance type begins with r3. (Or am I not reading
> that
> > right?) My slaves are a different instance type, and currently look like
> > this:
> > Filesystem            Size  Used Avail Use% Mounted on
> > /dev/xvda1            7.9G  7.3G  515M  94% /
> > tmpfs                 7.4G  4.0K  7.4G   1% /dev/shm
> > /dev/xvdv             500G  2.5G  498G   1% /vol
> >
> > I have been able to finish ALS on MovieLens 10M only twice, taking 221s
> and
> > 315s for 20 iterations at K=20 and lambda=0.01. Does that timing sound
> about
> > right, or does it point to a poor configuration? The same script with
> > MovieLens 1M runs fine in about 30-40s with the same settings. (In both
> > cases I'm training on 70% of the data.)
> >
> > Thanks for your help!
> > Chris
> >
> >
> > On Wed, Jul 16, 2014 at 4:29 PM, Xiangrui Meng <me...@gmail.com> wrote:
> >>
> >> For ALS, I would recommend repartitioning the ratings to match the
> >> number of CPU cores or even less. ALS is not computation heavy for
> >> small k but communication heavy. Having small number of partitions may
> >> help. For EC2 clusters, we use /mnt/spark and /mnt2/spark as the
> >> default local directory because they are local hard drives. Did your
> >> last run of ALS on MovieLens 10M-100K with the default settings
> >> succeed? -Xiangrui
> >>
> >> On Wed, Jul 16, 2014 at 8:00 AM, Chris DuBois <ch...@gmail.com>
> >> wrote:
> >> > Hi Xiangrui,
> >> >
> >> > I accidentally did not send df -i for the master node. Here it is at
> the
> >> > moment of failure:
> >> >
> >> > Filesystem            Inodes   IUsed   IFree IUse% Mounted on
> >> > /dev/xvda1            524288  280938  243350   54% /
> >> > tmpfs                3845409       1 3845408    1% /dev/shm
> >> > /dev/xvdb            10002432    1027 10001405    1% /mnt
> >> > /dev/xvdf            10002432      16 10002416    1% /mnt2
> >> > /dev/xvdv            524288000      13 524287987    1% /vol
> >> >
> >> > I am using default settings now, but is there a way to make sure that
> >> > the
> >> > proper directories are being used? How many blocks/partitions do you
> >> > recommend?
> >> >
> >> > Chris
> >> >
> >> >
> >> > On Wed, Jul 16, 2014 at 1:09 AM, Chris DuBois <chris.dubois@gmail.com
> >
> >> > wrote:
> >> >>
> >> >> Hi Xiangrui,
> >> >>
> >> >> Here is the result on the master node:
> >> >> $ df -i
> >> >> Filesystem            Inodes   IUsed   IFree IUse% Mounted on
> >> >> /dev/xvda1            524288  273997  250291   53% /
> >> >> tmpfs                1917974       1 1917973    1% /dev/shm
> >> >> /dev/xvdv            524288000      30 524287970    1% /vol
> >> >>
> >> >> I have reproduced the error while using the MovieLens 10M data set
> on a
> >> >> newly created cluster.
> >> >>
> >> >> Thanks for the help.
> >> >> Chris
> >> >>
> >> >>
> >> >> On Wed, Jul 16, 2014 at 12:22 AM, Xiangrui Meng <me...@gmail.com>
> >> >> wrote:
> >> >>>
> >> >>> Hi Chris,
> >> >>>
> >> >>> Could you also try `df -i` on the master node? How many
> >> >>> blocks/partitions did you set?
> >> >>>
> >> >>> In the current implementation, ALS doesn't clean the shuffle data
> >> >>> because the operations are chained together. But it shouldn't run
> out
> >> >>> of disk space on the MovieLens dataset, which is small. spark-ec2
> >> >>> script sets /mnt/spark and /mnt/spark2 as the local.dir by default,
> I
> >> >>> would recommend leaving this setting as the default value.
> >> >>>
> >> >>> Best,
> >> >>> Xiangrui
> >> >>>
> >> >>> On Wed, Jul 16, 2014 at 12:02 AM, Chris DuBois
> >> >>> <ch...@gmail.com>
> >> >>> wrote:
> >> >>> > Thanks for the quick responses!
> >> >>> >
> >> >>> > I used your final -Dspark.local.dir suggestion, but I see this
> >> >>> > during
> >> >>> > the
> >> >>> > initialization of the application:
> >> >>> >
> >> >>> > 14/07/16 06:56:08 INFO storage.DiskBlockManager: Created local
> >> >>> > directory at
> >> >>> > /vol/spark-local-20140716065608-7b2a
> >> >>> >
> >> >>> > I would have expected something in /mnt/spark/.
> >> >>> >
> >> >>> > Thanks,
> >> >>> > Chris
> >> >>> >
> >> >>> >
> >> >>> >
> >> >>> > On Tue, Jul 15, 2014 at 11:44 PM, Chris Gore <cd...@cdgore.com>
> >> >>> > wrote:
> >> >>> >>
> >> >>> >> Hi Chris,
> >> >>> >>
> >> >>> >> I've encountered this error when running Spark’s ALS methods too.
> >> >>> >> In
> >> >>> >> my
> >> >>> >> case, it was because I set spark.local.dir improperly, and every
> >> >>> >> time
> >> >>> >> there
> >> >>> >> was a shuffle, it would spill many GB of data onto the local
> drive.
> >> >>> >> What
> >> >>> >> fixed it was setting it to use the /mnt directory, where a
> network
> >> >>> >> drive is
> >> >>> >> mounted.  For example, setting an environmental variable:
> >> >>> >>
> >> >>> >> export SPACE=$(mount | grep mnt | awk '{print $3"/spark/"}' |
> xargs
> >> >>> >> |
> >> >>> >> sed
> >> >>> >> 's/ /,/g’)
> >> >>> >>
> >> >>> >> Then adding -Dspark.local.dir=$SPACE or simply
> >> >>> >> -Dspark.local.dir=/mnt/spark/,/mnt2/spark/ when you run your
> driver
> >> >>> >> application
> >> >>> >>
> >> >>> >> Chris
> >> >>> >>
> >> >>> >> On Jul 15, 2014, at 11:39 PM, Xiangrui Meng <me...@gmail.com>
> >> >>> >> wrote:
> >> >>> >>
> >> >>> >> > Check the number of inodes (df -i). The assembly build may
> create
> >> >>> >> > many
> >> >>> >> > small files. -Xiangrui
> >> >>> >> >
> >> >>> >> > On Tue, Jul 15, 2014 at 11:35 PM, Chris DuBois
> >> >>> >> > <ch...@gmail.com>
> >> >>> >> > wrote:
> >> >>> >> >> Hi all,
> >> >>> >> >>
> >> >>> >> >> I am encountering the following error:
> >> >>> >> >>
> >> >>> >> >> INFO scheduler.TaskSetManager: Loss was due to
> >> >>> >> >> java.io.IOException:
> >> >>> >> >> No
> >> >>> >> >> space
> >> >>> >> >> left on device [duplicate 4]
> >> >>> >> >>
> >> >>> >> >> For each slave, df -h looks roughtly like this, which makes
> the
> >> >>> >> >> above
> >> >>> >> >> error
> >> >>> >> >> surprising.
> >> >>> >> >>
> >> >>> >> >> Filesystem            Size  Used Avail Use% Mounted on
> >> >>> >> >> /dev/xvda1            7.9G  4.4G  3.5G  57% /
> >> >>> >> >> tmpfs                 7.4G  4.0K  7.4G   1% /dev/shm
> >> >>> >> >> /dev/xvdb              37G  3.3G   32G  10% /mnt
> >> >>> >> >> /dev/xvdf              37G  2.0G   34G   6% /mnt2
> >> >>> >> >> /dev/xvdv             500G   33M  500G   1% /vol
> >> >>> >> >>
> >> >>> >> >> I'm on an EC2 cluster (c3.xlarge + 5 x m3) that I launched
> using
> >> >>> >> >> the
> >> >>> >> >> spark-ec2 scripts and a clone of spark from today. The job I
> am
> >> >>> >> >> running
> >> >>> >> >> closely resembles the collaborative filtering example. This
> >> >>> >> >> issue
> >> >>> >> >> happens
> >> >>> >> >> with the 1M version as well as the 10 million rating version
> of
> >> >>> >> >> the
> >> >>> >> >> MovieLens dataset.
> >> >>> >> >>
> >> >>> >> >> I have seen previous questions, but they haven't helped yet.
> For
> >> >>> >> >> example, I
> >> >>> >> >> tried setting the Spark tmp directory to the EBS volume at
> >> >>> >> >> /vol/,
> >> >>> >> >> both
> >> >>> >> >> by
> >> >>> >> >> editing the spark conf file (and copy-dir'ing it to the
> slaves)
> >> >>> >> >> as
> >> >>> >> >> well
> >> >>> >> >> as
> >> >>> >> >> through the SparkConf. Yet I still get the above error. Here
> is
> >> >>> >> >> my
> >> >>> >> >> current
> >> >>> >> >> Spark config below. Note that I'm launching via
> >> >>> >> >> ~/spark/bin/spark-submit.
> >> >>> >> >>
> >> >>> >> >> conf = SparkConf()
> >> >>> >> >> conf.setAppName("RecommendALS").set("spark.local.dir",
> >> >>> >> >> "/vol/").set("spark.executor.memory",
> >> >>> >> >> "7g").set("spark.akka.frameSize",
> >> >>> >> >> "100").setExecutorEnv("SPARK_JAVA_OPTS", "
> >> >>> >> >> -Dspark.akka.frameSize=100")
> >> >>> >> >> sc = SparkContext(conf=conf)
> >> >>> >> >>
> >> >>> >> >> Thanks for any advice,
> >> >>> >> >> Chris
> >> >>> >> >>
> >> >>> >>
> >> >>> >
> >> >>
> >> >>
> >> >
> >
> >
>

Re: Error: No space left on device

Posted by Xiangrui Meng <me...@gmail.com>.

Set N be the total number of cores on the cluster or less. sc.textFile
doesn't always give you that number, depends on the block size. For
MovieLens, I think the default behavior should be 2~3 partitions. You
need to call repartition to ensure the right number of partitions.

Which EC2 instance type did you use? I usually use m3.2xlarge or c?
instances that come with SSD and 1G or 10G network. For those
instances, you should see local drives mounted at /mnt, /mnt2, /mnt3,
... Make sure there was no error when you used the ec2 script to
launch the cluster.

It is a little strange to see 94% of / was used on a slave. Maybe
shuffle data went to /. I'm not sure which settings went wrong. I
recommend trying re-launching a cluster with m3.2xlarge instances and
using the default settings (don't set anything in SparkConf). Submit
the application with --driver-memory 20g.

The running times are slower than what I remember, but it depends on
the instance type.

Best,
Xiangrui



On Wed, Jul 16, 2014 at 10:18 PM, Chris DuBois <ch...@gmail.com> wrote:
> Hi Xiangrui,
>
> I will try this shortly. When using N partitions, do you recommend N be the
> number of cores on each slave or the number of cores on the master? Forgive
> my ignorance, but is this best achieved as an argument to sc.textFile?
>
> The slaves on the EC2 clusters start with only 8gb of storage, and it
> doesn't seem that /mnt/spark and /mnt2/spark are mounted anywhere else by
> default. Looking at spark-ec2/setup-slaves.sh, it appears that these are
> only mounted if the instance type begins with r3. (Or am I not reading that
> right?) My slaves are a different instance type, and currently look like
> this:
> Filesystem            Size  Used Avail Use% Mounted on
> /dev/xvda1            7.9G  7.3G  515M  94% /
> tmpfs                 7.4G  4.0K  7.4G   1% /dev/shm
> /dev/xvdv             500G  2.5G  498G   1% /vol
>
> I have been able to finish ALS on MovieLens 10M only twice, taking 221s and
> 315s for 20 iterations at K=20 and lambda=0.01. Does that timing sound about
> right, or does it point to a poor configuration? The same script with
> MovieLens 1M runs fine in about 30-40s with the same settings. (In both
> cases I'm training on 70% of the data.)
>
> Thanks for your help!
> Chris
>
>
> On Wed, Jul 16, 2014 at 4:29 PM, Xiangrui Meng <me...@gmail.com> wrote:
>>
>> For ALS, I would recommend repartitioning the ratings to match the
>> number of CPU cores or even less. ALS is not computation heavy for
>> small k but communication heavy. Having small number of partitions may
>> help. For EC2 clusters, we use /mnt/spark and /mnt2/spark as the
>> default local directory because they are local hard drives. Did your
>> last run of ALS on MovieLens 10M-100K with the default settings
>> succeed? -Xiangrui
>>
>> On Wed, Jul 16, 2014 at 8:00 AM, Chris DuBois <ch...@gmail.com>
>> wrote:
>> > Hi Xiangrui,
>> >
>> > I accidentally did not send df -i for the master node. Here it is at the
>> > moment of failure:
>> >
>> > Filesystem            Inodes   IUsed   IFree IUse% Mounted on
>> > /dev/xvda1            524288  280938  243350   54% /
>> > tmpfs                3845409       1 3845408    1% /dev/shm
>> > /dev/xvdb            10002432    1027 10001405    1% /mnt
>> > /dev/xvdf            10002432      16 10002416    1% /mnt2
>> > /dev/xvdv            524288000      13 524287987    1% /vol
>> >
>> > I am using default settings now, but is there a way to make sure that
>> > the
>> > proper directories are being used? How many blocks/partitions do you
>> > recommend?
>> >
>> > Chris
>> >
>> >
>> > On Wed, Jul 16, 2014 at 1:09 AM, Chris DuBois <ch...@gmail.com>
>> > wrote:
>> >>
>> >> Hi Xiangrui,
>> >>
>> >> Here is the result on the master node:
>> >> $ df -i
>> >> Filesystem            Inodes   IUsed   IFree IUse% Mounted on
>> >> /dev/xvda1            524288  273997  250291   53% /
>> >> tmpfs                1917974       1 1917973    1% /dev/shm
>> >> /dev/xvdv            524288000      30 524287970    1% /vol
>> >>
>> >> I have reproduced the error while using the MovieLens 10M data set on a
>> >> newly created cluster.
>> >>
>> >> Thanks for the help.
>> >> Chris
>> >>
>> >>
>> >> On Wed, Jul 16, 2014 at 12:22 AM, Xiangrui Meng <me...@gmail.com>
>> >> wrote:
>> >>>
>> >>> Hi Chris,
>> >>>
>> >>> Could you also try `df -i` on the master node? How many
>> >>> blocks/partitions did you set?
>> >>>
>> >>> In the current implementation, ALS doesn't clean the shuffle data
>> >>> because the operations are chained together. But it shouldn't run out
>> >>> of disk space on the MovieLens dataset, which is small. spark-ec2
>> >>> script sets /mnt/spark and /mnt/spark2 as the local.dir by default, I
>> >>> would recommend leaving this setting as the default value.
>> >>>
>> >>> Best,
>> >>> Xiangrui
>> >>>
>> >>> On Wed, Jul 16, 2014 at 12:02 AM, Chris DuBois
>> >>> <ch...@gmail.com>
>> >>> wrote:
>> >>> > Thanks for the quick responses!
>> >>> >
>> >>> > I used your final -Dspark.local.dir suggestion, but I see this
>> >>> > during
>> >>> > the
>> >>> > initialization of the application:
>> >>> >
>> >>> > 14/07/16 06:56:08 INFO storage.DiskBlockManager: Created local
>> >>> > directory at
>> >>> > /vol/spark-local-20140716065608-7b2a
>> >>> >
>> >>> > I would have expected something in /mnt/spark/.
>> >>> >
>> >>> > Thanks,
>> >>> > Chris
>> >>> >
>> >>> >
>> >>> >
>> >>> > On Tue, Jul 15, 2014 at 11:44 PM, Chris Gore <cd...@cdgore.com>
>> >>> > wrote:
>> >>> >>
>> >>> >> Hi Chris,
>> >>> >>
>> >>> >> I've encountered this error when running Spark’s ALS methods too.
>> >>> >> In
>> >>> >> my
>> >>> >> case, it was because I set spark.local.dir improperly, and every
>> >>> >> time
>> >>> >> there
>> >>> >> was a shuffle, it would spill many GB of data onto the local drive.
>> >>> >> What
>> >>> >> fixed it was setting it to use the /mnt directory, where a network
>> >>> >> drive is
>> >>> >> mounted.  For example, setting an environmental variable:
>> >>> >>
>> >>> >> export SPACE=$(mount | grep mnt | awk '{print $3"/spark/"}' | xargs
>> >>> >> |
>> >>> >> sed
>> >>> >> 's/ /,/g’)
>> >>> >>
>> >>> >> Then adding -Dspark.local.dir=$SPACE or simply
>> >>> >> -Dspark.local.dir=/mnt/spark/,/mnt2/spark/ when you run your driver
>> >>> >> application
>> >>> >>
>> >>> >> Chris
>> >>> >>
>> >>> >> On Jul 15, 2014, at 11:39 PM, Xiangrui Meng <me...@gmail.com>
>> >>> >> wrote:
>> >>> >>
>> >>> >> > Check the number of inodes (df -i). The assembly build may create
>> >>> >> > many
>> >>> >> > small files. -Xiangrui
>> >>> >> >
>> >>> >> > On Tue, Jul 15, 2014 at 11:35 PM, Chris DuBois
>> >>> >> > <ch...@gmail.com>
>> >>> >> > wrote:
>> >>> >> >> Hi all,
>> >>> >> >>
>> >>> >> >> I am encountering the following error:
>> >>> >> >>
>> >>> >> >> INFO scheduler.TaskSetManager: Loss was due to
>> >>> >> >> java.io.IOException:
>> >>> >> >> No
>> >>> >> >> space
>> >>> >> >> left on device [duplicate 4]
>> >>> >> >>
>> >>> >> >> For each slave, df -h looks roughtly like this, which makes the
>> >>> >> >> above
>> >>> >> >> error
>> >>> >> >> surprising.
>> >>> >> >>
>> >>> >> >> Filesystem            Size  Used Avail Use% Mounted on
>> >>> >> >> /dev/xvda1            7.9G  4.4G  3.5G  57% /
>> >>> >> >> tmpfs                 7.4G  4.0K  7.4G   1% /dev/shm
>> >>> >> >> /dev/xvdb              37G  3.3G   32G  10% /mnt
>> >>> >> >> /dev/xvdf              37G  2.0G   34G   6% /mnt2
>> >>> >> >> /dev/xvdv             500G   33M  500G   1% /vol
>> >>> >> >>
>> >>> >> >> I'm on an EC2 cluster (c3.xlarge + 5 x m3) that I launched using
>> >>> >> >> the
>> >>> >> >> spark-ec2 scripts and a clone of spark from today. The job I am
>> >>> >> >> running
>> >>> >> >> closely resembles the collaborative filtering example. This
>> >>> >> >> issue
>> >>> >> >> happens
>> >>> >> >> with the 1M version as well as the 10 million rating version of
>> >>> >> >> the
>> >>> >> >> MovieLens dataset.
>> >>> >> >>
>> >>> >> >> I have seen previous questions, but they haven't helped yet. For
>> >>> >> >> example, I
>> >>> >> >> tried setting the Spark tmp directory to the EBS volume at
>> >>> >> >> /vol/,
>> >>> >> >> both
>> >>> >> >> by
>> >>> >> >> editing the spark conf file (and copy-dir'ing it to the slaves)
>> >>> >> >> as
>> >>> >> >> well
>> >>> >> >> as
>> >>> >> >> through the SparkConf. Yet I still get the above error. Here is
>> >>> >> >> my
>> >>> >> >> current
>> >>> >> >> Spark config below. Note that I'm launching via
>> >>> >> >> ~/spark/bin/spark-submit.
>> >>> >> >>
>> >>> >> >> conf = SparkConf()
>> >>> >> >> conf.setAppName("RecommendALS").set("spark.local.dir",
>> >>> >> >> "/vol/").set("spark.executor.memory",
>> >>> >> >> "7g").set("spark.akka.frameSize",
>> >>> >> >> "100").setExecutorEnv("SPARK_JAVA_OPTS", "
>> >>> >> >> -Dspark.akka.frameSize=100")
>> >>> >> >> sc = SparkContext(conf=conf)
>> >>> >> >>
>> >>> >> >> Thanks for any advice,
>> >>> >> >> Chris
>> >>> >> >>
>> >>> >>
>> >>> >
>> >>
>> >>
>> >
>
>

Re: Error: No space left on device

Posted by Chris DuBois <ch...@gmail.com>.

Hi Xiangrui,

I will try this shortly. When using N partitions, do you recommend N be the
number of cores on each slave or the number of cores on the master? Forgive
my ignorance, but is this best achieved as an argument to sc.textFile?

The slaves on the EC2 clusters start with only 8gb of storage, and it
doesn't seem that /mnt/spark and /mnt2/spark are mounted anywhere else by
default. Looking at spark-ec2/setup-slaves.sh, it appears that these are
only mounted if the instance type begins with r3. (Or am I not reading that
right?) My slaves are a different instance type, and currently look like
this:
Filesystem            Size  Used Avail Use% Mounted on
/dev/xvda1            7.9G  7.3G  515M  94% /
tmpfs                 7.4G  4.0K  7.4G   1% /dev/shm
/dev/xvdv             500G  2.5G  498G   1% /vol

I have been able to finish ALS on MovieLens 10M only twice, taking 221s and
315s for 20 iterations at K=20 and lambda=0.01. Does that timing sound
about right, or does it point to a poor configuration? The same script with
MovieLens 1M runs fine in about 30-40s with the same settings. (In both
cases I'm training on 70% of the data.)

Thanks for your help!
Chris


On Wed, Jul 16, 2014 at 4:29 PM, Xiangrui Meng <me...@gmail.com> wrote:

> For ALS, I would recommend repartitioning the ratings to match the
> number of CPU cores or even less. ALS is not computation heavy for
> small k but communication heavy. Having small number of partitions may
> help. For EC2 clusters, we use /mnt/spark and /mnt2/spark as the
> default local directory because they are local hard drives. Did your
> last run of ALS on MovieLens 10M-100K with the default settings
> succeed? -Xiangrui
>
> On Wed, Jul 16, 2014 at 8:00 AM, Chris DuBois <ch...@gmail.com>
> wrote:
> > Hi Xiangrui,
> >
> > I accidentally did not send df -i for the master node. Here it is at the
> > moment of failure:
> >
> > Filesystem            Inodes   IUsed   IFree IUse% Mounted on
> > /dev/xvda1            524288  280938  243350   54% /
> > tmpfs                3845409       1 3845408    1% /dev/shm
> > /dev/xvdb            10002432    1027 10001405    1% /mnt
> > /dev/xvdf            10002432      16 10002416    1% /mnt2
> > /dev/xvdv            524288000      13 524287987    1% /vol
> >
> > I am using default settings now, but is there a way to make sure that the
> > proper directories are being used? How many blocks/partitions do you
> > recommend?
> >
> > Chris
> >
> >
> > On Wed, Jul 16, 2014 at 1:09 AM, Chris DuBois <ch...@gmail.com>
> > wrote:
> >>
> >> Hi Xiangrui,
> >>
> >> Here is the result on the master node:
> >> $ df -i
> >> Filesystem            Inodes   IUsed   IFree IUse% Mounted on
> >> /dev/xvda1            524288  273997  250291   53% /
> >> tmpfs                1917974       1 1917973    1% /dev/shm
> >> /dev/xvdv            524288000      30 524287970    1% /vol
> >>
> >> I have reproduced the error while using the MovieLens 10M data set on a
> >> newly created cluster.
> >>
> >> Thanks for the help.
> >> Chris
> >>
> >>
> >> On Wed, Jul 16, 2014 at 12:22 AM, Xiangrui Meng <me...@gmail.com>
> wrote:
> >>>
> >>> Hi Chris,
> >>>
> >>> Could you also try `df -i` on the master node? How many
> >>> blocks/partitions did you set?
> >>>
> >>> In the current implementation, ALS doesn't clean the shuffle data
> >>> because the operations are chained together. But it shouldn't run out
> >>> of disk space on the MovieLens dataset, which is small. spark-ec2
> >>> script sets /mnt/spark and /mnt/spark2 as the local.dir by default, I
> >>> would recommend leaving this setting as the default value.
> >>>
> >>> Best,
> >>> Xiangrui
> >>>
> >>> On Wed, Jul 16, 2014 at 12:02 AM, Chris DuBois <chris.dubois@gmail.com
> >
> >>> wrote:
> >>> > Thanks for the quick responses!
> >>> >
> >>> > I used your final -Dspark.local.dir suggestion, but I see this during
> >>> > the
> >>> > initialization of the application:
> >>> >
> >>> > 14/07/16 06:56:08 INFO storage.DiskBlockManager: Created local
> >>> > directory at
> >>> > /vol/spark-local-20140716065608-7b2a
> >>> >
> >>> > I would have expected something in /mnt/spark/.
> >>> >
> >>> > Thanks,
> >>> > Chris
> >>> >
> >>> >
> >>> >
> >>> > On Tue, Jul 15, 2014 at 11:44 PM, Chris Gore <cd...@cdgore.com>
> wrote:
> >>> >>
> >>> >> Hi Chris,
> >>> >>
> >>> >> I've encountered this error when running Spark’s ALS methods too.
>  In
> >>> >> my
> >>> >> case, it was because I set spark.local.dir improperly, and every
> time
> >>> >> there
> >>> >> was a shuffle, it would spill many GB of data onto the local drive.
> >>> >> What
> >>> >> fixed it was setting it to use the /mnt directory, where a network
> >>> >> drive is
> >>> >> mounted.  For example, setting an environmental variable:
> >>> >>
> >>> >> export SPACE=$(mount | grep mnt | awk '{print $3"/spark/"}' | xargs
> |
> >>> >> sed
> >>> >> 's/ /,/g’)
> >>> >>
> >>> >> Then adding -Dspark.local.dir=$SPACE or simply
> >>> >> -Dspark.local.dir=/mnt/spark/,/mnt2/spark/ when you run your driver
> >>> >> application
> >>> >>
> >>> >> Chris
> >>> >>
> >>> >> On Jul 15, 2014, at 11:39 PM, Xiangrui Meng <me...@gmail.com>
> wrote:
> >>> >>
> >>> >> > Check the number of inodes (df -i). The assembly build may create
> >>> >> > many
> >>> >> > small files. -Xiangrui
> >>> >> >
> >>> >> > On Tue, Jul 15, 2014 at 11:35 PM, Chris DuBois
> >>> >> > <ch...@gmail.com>
> >>> >> > wrote:
> >>> >> >> Hi all,
> >>> >> >>
> >>> >> >> I am encountering the following error:
> >>> >> >>
> >>> >> >> INFO scheduler.TaskSetManager: Loss was due to
> java.io.IOException:
> >>> >> >> No
> >>> >> >> space
> >>> >> >> left on device [duplicate 4]
> >>> >> >>
> >>> >> >> For each slave, df -h looks roughtly like this, which makes the
> >>> >> >> above
> >>> >> >> error
> >>> >> >> surprising.
> >>> >> >>
> >>> >> >> Filesystem            Size  Used Avail Use% Mounted on
> >>> >> >> /dev/xvda1            7.9G  4.4G  3.5G  57% /
> >>> >> >> tmpfs                 7.4G  4.0K  7.4G   1% /dev/shm
> >>> >> >> /dev/xvdb              37G  3.3G   32G  10% /mnt
> >>> >> >> /dev/xvdf              37G  2.0G   34G   6% /mnt2
> >>> >> >> /dev/xvdv             500G   33M  500G   1% /vol
> >>> >> >>
> >>> >> >> I'm on an EC2 cluster (c3.xlarge + 5 x m3) that I launched using
> >>> >> >> the
> >>> >> >> spark-ec2 scripts and a clone of spark from today. The job I am
> >>> >> >> running
> >>> >> >> closely resembles the collaborative filtering example. This issue
> >>> >> >> happens
> >>> >> >> with the 1M version as well as the 10 million rating version of
> the
> >>> >> >> MovieLens dataset.
> >>> >> >>
> >>> >> >> I have seen previous questions, but they haven't helped yet. For
> >>> >> >> example, I
> >>> >> >> tried setting the Spark tmp directory to the EBS volume at /vol/,
> >>> >> >> both
> >>> >> >> by
> >>> >> >> editing the spark conf file (and copy-dir'ing it to the slaves)
> as
> >>> >> >> well
> >>> >> >> as
> >>> >> >> through the SparkConf. Yet I still get the above error. Here is
> my
> >>> >> >> current
> >>> >> >> Spark config below. Note that I'm launching via
> >>> >> >> ~/spark/bin/spark-submit.
> >>> >> >>
> >>> >> >> conf = SparkConf()
> >>> >> >> conf.setAppName("RecommendALS").set("spark.local.dir",
> >>> >> >> "/vol/").set("spark.executor.memory",
> >>> >> >> "7g").set("spark.akka.frameSize",
> >>> >> >> "100").setExecutorEnv("SPARK_JAVA_OPTS", "
> >>> >> >> -Dspark.akka.frameSize=100")
> >>> >> >> sc = SparkContext(conf=conf)
> >>> >> >>
> >>> >> >> Thanks for any advice,
> >>> >> >> Chris
> >>> >> >>
> >>> >>
> >>> >
> >>
> >>
> >
>

Re: Error: No space left on device

Posted by Xiangrui Meng <me...@gmail.com>.

For ALS, I would recommend repartitioning the ratings to match the
number of CPU cores or even less. ALS is not computation heavy for
small k but communication heavy. Having small number of partitions may
help. For EC2 clusters, we use /mnt/spark and /mnt2/spark as the
default local directory because they are local hard drives. Did your
last run of ALS on MovieLens 10M-100K with the default settings
succeed? -Xiangrui

On Wed, Jul 16, 2014 at 8:00 AM, Chris DuBois <ch...@gmail.com> wrote:
> Hi Xiangrui,
>
> I accidentally did not send df -i for the master node. Here it is at the
> moment of failure:
>
> Filesystem            Inodes   IUsed   IFree IUse% Mounted on
> /dev/xvda1            524288  280938  243350   54% /
> tmpfs                3845409       1 3845408    1% /dev/shm
> /dev/xvdb            10002432    1027 10001405    1% /mnt
> /dev/xvdf            10002432      16 10002416    1% /mnt2
> /dev/xvdv            524288000      13 524287987    1% /vol
>
> I am using default settings now, but is there a way to make sure that the
> proper directories are being used? How many blocks/partitions do you
> recommend?
>
> Chris
>
>
> On Wed, Jul 16, 2014 at 1:09 AM, Chris DuBois <ch...@gmail.com>
> wrote:
>>
>> Hi Xiangrui,
>>
>> Here is the result on the master node:
>> $ df -i
>> Filesystem            Inodes   IUsed   IFree IUse% Mounted on
>> /dev/xvda1            524288  273997  250291   53% /
>> tmpfs                1917974       1 1917973    1% /dev/shm
>> /dev/xvdv            524288000      30 524287970    1% /vol
>>
>> I have reproduced the error while using the MovieLens 10M data set on a
>> newly created cluster.
>>
>> Thanks for the help.
>> Chris
>>
>>
>> On Wed, Jul 16, 2014 at 12:22 AM, Xiangrui Meng <me...@gmail.com> wrote:
>>>
>>> Hi Chris,
>>>
>>> Could you also try `df -i` on the master node? How many
>>> blocks/partitions did you set?
>>>
>>> In the current implementation, ALS doesn't clean the shuffle data
>>> because the operations are chained together. But it shouldn't run out
>>> of disk space on the MovieLens dataset, which is small. spark-ec2
>>> script sets /mnt/spark and /mnt/spark2 as the local.dir by default, I
>>> would recommend leaving this setting as the default value.
>>>
>>> Best,
>>> Xiangrui
>>>
>>> On Wed, Jul 16, 2014 at 12:02 AM, Chris DuBois <ch...@gmail.com>
>>> wrote:
>>> > Thanks for the quick responses!
>>> >
>>> > I used your final -Dspark.local.dir suggestion, but I see this during
>>> > the
>>> > initialization of the application:
>>> >
>>> > 14/07/16 06:56:08 INFO storage.DiskBlockManager: Created local
>>> > directory at
>>> > /vol/spark-local-20140716065608-7b2a
>>> >
>>> > I would have expected something in /mnt/spark/.
>>> >
>>> > Thanks,
>>> > Chris
>>> >
>>> >
>>> >
>>> > On Tue, Jul 15, 2014 at 11:44 PM, Chris Gore <cd...@cdgore.com> wrote:
>>> >>
>>> >> Hi Chris,
>>> >>
>>> >> I've encountered this error when running Spark’s ALS methods too.  In
>>> >> my
>>> >> case, it was because I set spark.local.dir improperly, and every time
>>> >> there
>>> >> was a shuffle, it would spill many GB of data onto the local drive.
>>> >> What
>>> >> fixed it was setting it to use the /mnt directory, where a network
>>> >> drive is
>>> >> mounted.  For example, setting an environmental variable:
>>> >>
>>> >> export SPACE=$(mount | grep mnt | awk '{print $3"/spark/"}' | xargs |
>>> >> sed
>>> >> 's/ /,/g’)
>>> >>
>>> >> Then adding -Dspark.local.dir=$SPACE or simply
>>> >> -Dspark.local.dir=/mnt/spark/,/mnt2/spark/ when you run your driver
>>> >> application
>>> >>
>>> >> Chris
>>> >>
>>> >> On Jul 15, 2014, at 11:39 PM, Xiangrui Meng <me...@gmail.com> wrote:
>>> >>
>>> >> > Check the number of inodes (df -i). The assembly build may create
>>> >> > many
>>> >> > small files. -Xiangrui
>>> >> >
>>> >> > On Tue, Jul 15, 2014 at 11:35 PM, Chris DuBois
>>> >> > <ch...@gmail.com>
>>> >> > wrote:
>>> >> >> Hi all,
>>> >> >>
>>> >> >> I am encountering the following error:
>>> >> >>
>>> >> >> INFO scheduler.TaskSetManager: Loss was due to java.io.IOException:
>>> >> >> No
>>> >> >> space
>>> >> >> left on device [duplicate 4]
>>> >> >>
>>> >> >> For each slave, df -h looks roughtly like this, which makes the
>>> >> >> above
>>> >> >> error
>>> >> >> surprising.
>>> >> >>
>>> >> >> Filesystem            Size  Used Avail Use% Mounted on
>>> >> >> /dev/xvda1            7.9G  4.4G  3.5G  57% /
>>> >> >> tmpfs                 7.4G  4.0K  7.4G   1% /dev/shm
>>> >> >> /dev/xvdb              37G  3.3G   32G  10% /mnt
>>> >> >> /dev/xvdf              37G  2.0G   34G   6% /mnt2
>>> >> >> /dev/xvdv             500G   33M  500G   1% /vol
>>> >> >>
>>> >> >> I'm on an EC2 cluster (c3.xlarge + 5 x m3) that I launched using
>>> >> >> the
>>> >> >> spark-ec2 scripts and a clone of spark from today. The job I am
>>> >> >> running
>>> >> >> closely resembles the collaborative filtering example. This issue
>>> >> >> happens
>>> >> >> with the 1M version as well as the 10 million rating version of the
>>> >> >> MovieLens dataset.
>>> >> >>
>>> >> >> I have seen previous questions, but they haven't helped yet. For
>>> >> >> example, I
>>> >> >> tried setting the Spark tmp directory to the EBS volume at /vol/,
>>> >> >> both
>>> >> >> by
>>> >> >> editing the spark conf file (and copy-dir'ing it to the slaves) as
>>> >> >> well
>>> >> >> as
>>> >> >> through the SparkConf. Yet I still get the above error. Here is my
>>> >> >> current
>>> >> >> Spark config below. Note that I'm launching via
>>> >> >> ~/spark/bin/spark-submit.
>>> >> >>
>>> >> >> conf = SparkConf()
>>> >> >> conf.setAppName("RecommendALS").set("spark.local.dir",
>>> >> >> "/vol/").set("spark.executor.memory",
>>> >> >> "7g").set("spark.akka.frameSize",
>>> >> >> "100").setExecutorEnv("SPARK_JAVA_OPTS", "
>>> >> >> -Dspark.akka.frameSize=100")
>>> >> >> sc = SparkContext(conf=conf)
>>> >> >>
>>> >> >> Thanks for any advice,
>>> >> >> Chris
>>> >> >>
>>> >>
>>> >
>>
>>
>

Re: Error: No space left on device

Posted by Chris DuBois <ch...@gmail.com>.

Hi Xiangrui,

I accidentally did not send df -i for the master node. Here it is at the
moment of failure:

Filesystem            Inodes   IUsed   IFree IUse% Mounted on
/dev/xvda1            524288  280938  243350   54% /
tmpfs                3845409       1 3845408    1% /dev/shm
/dev/xvdb            10002432    1027 10001405    1% /mnt
/dev/xvdf            10002432      16 10002416    1% /mnt2
/dev/xvdv            524288000      13 524287987    1% /vol

I am using default settings now, but is there a way to make sure that the
proper directories are being used? How many blocks/partitions do you
recommend?

Chris


On Wed, Jul 16, 2014 at 1:09 AM, Chris DuBois <ch...@gmail.com>
wrote:

> Hi Xiangrui,
>
> Here is the result on the master node:
> $ df -i
> Filesystem            Inodes   IUsed   IFree IUse% Mounted on
> /dev/xvda1            524288  273997  250291   53% /
> tmpfs                1917974       1 1917973    1% /dev/shm
> /dev/xvdv            524288000      30 524287970    1% /vol
>
> I have reproduced the error while using the MovieLens 10M data set on a
> newly created cluster.
>
> Thanks for the help.
> Chris
>
>
> On Wed, Jul 16, 2014 at 12:22 AM, Xiangrui Meng <me...@gmail.com> wrote:
>
>> Hi Chris,
>>
>> Could you also try `df -i` on the master node? How many
>> blocks/partitions did you set?
>>
>> In the current implementation, ALS doesn't clean the shuffle data
>> because the operations are chained together. But it shouldn't run out
>> of disk space on the MovieLens dataset, which is small. spark-ec2
>> script sets /mnt/spark and /mnt/spark2 as the local.dir by default, I
>> would recommend leaving this setting as the default value.
>>
>> Best,
>> Xiangrui
>>
>> On Wed, Jul 16, 2014 at 12:02 AM, Chris DuBois <ch...@gmail.com>
>> wrote:
>> > Thanks for the quick responses!
>> >
>> > I used your final -Dspark.local.dir suggestion, but I see this during
>> the
>> > initialization of the application:
>> >
>> > 14/07/16 06:56:08 INFO storage.DiskBlockManager: Created local
>> directory at
>> > /vol/spark-local-20140716065608-7b2a
>> >
>> > I would have expected something in /mnt/spark/.
>> >
>> > Thanks,
>> > Chris
>> >
>> >
>> >
>> > On Tue, Jul 15, 2014 at 11:44 PM, Chris Gore <cd...@cdgore.com> wrote:
>> >>
>> >> Hi Chris,
>> >>
>> >> I've encountered this error when running Spark’s ALS methods too.  In
>> my
>> >> case, it was because I set spark.local.dir improperly, and every time
>> there
>> >> was a shuffle, it would spill many GB of data onto the local drive.
>>  What
>> >> fixed it was setting it to use the /mnt directory, where a network
>> drive is
>> >> mounted.  For example, setting an environmental variable:
>> >>
>> >> export SPACE=$(mount | grep mnt | awk '{print $3"/spark/"}' | xargs |
>> sed
>> >> 's/ /,/g’)
>> >>
>> >> Then adding -Dspark.local.dir=$SPACE or simply
>> >> -Dspark.local.dir=/mnt/spark/,/mnt2/spark/ when you run your driver
>> >> application
>> >>
>> >> Chris
>> >>
>> >> On Jul 15, 2014, at 11:39 PM, Xiangrui Meng <me...@gmail.com> wrote:
>> >>
>> >> > Check the number of inodes (df -i). The assembly build may create
>> many
>> >> > small files. -Xiangrui
>> >> >
>> >> > On Tue, Jul 15, 2014 at 11:35 PM, Chris DuBois <
>> chris.dubois@gmail.com>
>> >> > wrote:
>> >> >> Hi all,
>> >> >>
>> >> >> I am encountering the following error:
>> >> >>
>> >> >> INFO scheduler.TaskSetManager: Loss was due to java.io.IOException:
>> No
>> >> >> space
>> >> >> left on device [duplicate 4]
>> >> >>
>> >> >> For each slave, df -h looks roughtly like this, which makes the
>> above
>> >> >> error
>> >> >> surprising.
>> >> >>
>> >> >> Filesystem            Size  Used Avail Use% Mounted on
>> >> >> /dev/xvda1            7.9G  4.4G  3.5G  57% /
>> >> >> tmpfs                 7.4G  4.0K  7.4G   1% /dev/shm
>> >> >> /dev/xvdb              37G  3.3G   32G  10% /mnt
>> >> >> /dev/xvdf              37G  2.0G   34G   6% /mnt2
>> >> >> /dev/xvdv             500G   33M  500G   1% /vol
>> >> >>
>> >> >> I'm on an EC2 cluster (c3.xlarge + 5 x m3) that I launched using the
>> >> >> spark-ec2 scripts and a clone of spark from today. The job I am
>> running
>> >> >> closely resembles the collaborative filtering example. This issue
>> >> >> happens
>> >> >> with the 1M version as well as the 10 million rating version of the
>> >> >> MovieLens dataset.
>> >> >>
>> >> >> I have seen previous questions, but they haven't helped yet. For
>> >> >> example, I
>> >> >> tried setting the Spark tmp directory to the EBS volume at /vol/,
>> both
>> >> >> by
>> >> >> editing the spark conf file (and copy-dir'ing it to the slaves) as
>> well
>> >> >> as
>> >> >> through the SparkConf. Yet I still get the above error. Here is my
>> >> >> current
>> >> >> Spark config below. Note that I'm launching via
>> >> >> ~/spark/bin/spark-submit.
>> >> >>
>> >> >> conf = SparkConf()
>> >> >> conf.setAppName("RecommendALS").set("spark.local.dir",
>> >> >> "/vol/").set("spark.executor.memory",
>> "7g").set("spark.akka.frameSize",
>> >> >> "100").setExecutorEnv("SPARK_JAVA_OPTS", "
>> -Dspark.akka.frameSize=100")
>> >> >> sc = SparkContext(conf=conf)
>> >> >>
>> >> >> Thanks for any advice,
>> >> >> Chris
>> >> >>
>> >>
>> >
>>
>
>

Re: Error: No space left on device

Posted by Chris DuBois <ch...@gmail.com>.

Hi Xiangrui,

Here is the result on the master node:
$ df -i
Filesystem            Inodes   IUsed   IFree IUse% Mounted on
/dev/xvda1            524288  273997  250291   53% /
tmpfs                1917974       1 1917973    1% /dev/shm
/dev/xvdv            524288000      30 524287970    1% /vol

I have reproduced the error while using the MovieLens 10M data set on a
newly created cluster.

Thanks for the help.
Chris


On Wed, Jul 16, 2014 at 12:22 AM, Xiangrui Meng <me...@gmail.com> wrote:

> Hi Chris,
>
> Could you also try `df -i` on the master node? How many
> blocks/partitions did you set?
>
> In the current implementation, ALS doesn't clean the shuffle data
> because the operations are chained together. But it shouldn't run out
> of disk space on the MovieLens dataset, which is small. spark-ec2
> script sets /mnt/spark and /mnt/spark2 as the local.dir by default, I
> would recommend leaving this setting as the default value.
>
> Best,
> Xiangrui
>
> On Wed, Jul 16, 2014 at 12:02 AM, Chris DuBois <ch...@gmail.com>
> wrote:
> > Thanks for the quick responses!
> >
> > I used your final -Dspark.local.dir suggestion, but I see this during the
> > initialization of the application:
> >
> > 14/07/16 06:56:08 INFO storage.DiskBlockManager: Created local directory
> at
> > /vol/spark-local-20140716065608-7b2a
> >
> > I would have expected something in /mnt/spark/.
> >
> > Thanks,
> > Chris
> >
> >
> >
> > On Tue, Jul 15, 2014 at 11:44 PM, Chris Gore <cd...@cdgore.com> wrote:
> >>
> >> Hi Chris,
> >>
> >> I've encountered this error when running Spark’s ALS methods too.  In my
> >> case, it was because I set spark.local.dir improperly, and every time
> there
> >> was a shuffle, it would spill many GB of data onto the local drive.
>  What
> >> fixed it was setting it to use the /mnt directory, where a network
> drive is
> >> mounted.  For example, setting an environmental variable:
> >>
> >> export SPACE=$(mount | grep mnt | awk '{print $3"/spark/"}' | xargs |
> sed
> >> 's/ /,/g’)
> >>
> >> Then adding -Dspark.local.dir=$SPACE or simply
> >> -Dspark.local.dir=/mnt/spark/,/mnt2/spark/ when you run your driver
> >> application
> >>
> >> Chris
> >>
> >> On Jul 15, 2014, at 11:39 PM, Xiangrui Meng <me...@gmail.com> wrote:
> >>
> >> > Check the number of inodes (df -i). The assembly build may create many
> >> > small files. -Xiangrui
> >> >
> >> > On Tue, Jul 15, 2014 at 11:35 PM, Chris DuBois <
> chris.dubois@gmail.com>
> >> > wrote:
> >> >> Hi all,
> >> >>
> >> >> I am encountering the following error:
> >> >>
> >> >> INFO scheduler.TaskSetManager: Loss was due to java.io.IOException:
> No
> >> >> space
> >> >> left on device [duplicate 4]
> >> >>
> >> >> For each slave, df -h looks roughtly like this, which makes the above
> >> >> error
> >> >> surprising.
> >> >>
> >> >> Filesystem            Size  Used Avail Use% Mounted on
> >> >> /dev/xvda1            7.9G  4.4G  3.5G  57% /
> >> >> tmpfs                 7.4G  4.0K  7.4G   1% /dev/shm
> >> >> /dev/xvdb              37G  3.3G   32G  10% /mnt
> >> >> /dev/xvdf              37G  2.0G   34G   6% /mnt2
> >> >> /dev/xvdv             500G   33M  500G   1% /vol
> >> >>
> >> >> I'm on an EC2 cluster (c3.xlarge + 5 x m3) that I launched using the
> >> >> spark-ec2 scripts and a clone of spark from today. The job I am
> running
> >> >> closely resembles the collaborative filtering example. This issue
> >> >> happens
> >> >> with the 1M version as well as the 10 million rating version of the
> >> >> MovieLens dataset.
> >> >>
> >> >> I have seen previous questions, but they haven't helped yet. For
> >> >> example, I
> >> >> tried setting the Spark tmp directory to the EBS volume at /vol/,
> both
> >> >> by
> >> >> editing the spark conf file (and copy-dir'ing it to the slaves) as
> well
> >> >> as
> >> >> through the SparkConf. Yet I still get the above error. Here is my
> >> >> current
> >> >> Spark config below. Note that I'm launching via
> >> >> ~/spark/bin/spark-submit.
> >> >>
> >> >> conf = SparkConf()
> >> >> conf.setAppName("RecommendALS").set("spark.local.dir",
> >> >> "/vol/").set("spark.executor.memory",
> "7g").set("spark.akka.frameSize",
> >> >> "100").setExecutorEnv("SPARK_JAVA_OPTS", "
> -Dspark.akka.frameSize=100")
> >> >> sc = SparkContext(conf=conf)
> >> >>
> >> >> Thanks for any advice,
> >> >> Chris
> >> >>
> >>
> >
>

Re: Error: No space left on device

Posted by Xiangrui Meng <me...@gmail.com>.

Hi Chris,

Could you also try `df -i` on the master node? How many
blocks/partitions did you set?

In the current implementation, ALS doesn't clean the shuffle data
because the operations are chained together. But it shouldn't run out
of disk space on the MovieLens dataset, which is small. spark-ec2
script sets /mnt/spark and /mnt/spark2 as the local.dir by default, I
would recommend leaving this setting as the default value.

Best,
Xiangrui

On Wed, Jul 16, 2014 at 12:02 AM, Chris DuBois <ch...@gmail.com> wrote:
> Thanks for the quick responses!
>
> I used your final -Dspark.local.dir suggestion, but I see this during the
> initialization of the application:
>
> 14/07/16 06:56:08 INFO storage.DiskBlockManager: Created local directory at
> /vol/spark-local-20140716065608-7b2a
>
> I would have expected something in /mnt/spark/.
>
> Thanks,
> Chris
>
>
>
> On Tue, Jul 15, 2014 at 11:44 PM, Chris Gore <cd...@cdgore.com> wrote:
>>
>> Hi Chris,
>>
>> I've encountered this error when running Spark’s ALS methods too.  In my
>> case, it was because I set spark.local.dir improperly, and every time there
>> was a shuffle, it would spill many GB of data onto the local drive.  What
>> fixed it was setting it to use the /mnt directory, where a network drive is
>> mounted.  For example, setting an environmental variable:
>>
>> export SPACE=$(mount | grep mnt | awk '{print $3"/spark/"}' | xargs | sed
>> 's/ /,/g’)
>>
>> Then adding -Dspark.local.dir=$SPACE or simply
>> -Dspark.local.dir=/mnt/spark/,/mnt2/spark/ when you run your driver
>> application
>>
>> Chris
>>
>> On Jul 15, 2014, at 11:39 PM, Xiangrui Meng <me...@gmail.com> wrote:
>>
>> > Check the number of inodes (df -i). The assembly build may create many
>> > small files. -Xiangrui
>> >
>> > On Tue, Jul 15, 2014 at 11:35 PM, Chris DuBois <ch...@gmail.com>
>> > wrote:
>> >> Hi all,
>> >>
>> >> I am encountering the following error:
>> >>
>> >> INFO scheduler.TaskSetManager: Loss was due to java.io.IOException: No
>> >> space
>> >> left on device [duplicate 4]
>> >>
>> >> For each slave, df -h looks roughtly like this, which makes the above
>> >> error
>> >> surprising.
>> >>
>> >> Filesystem            Size  Used Avail Use% Mounted on
>> >> /dev/xvda1            7.9G  4.4G  3.5G  57% /
>> >> tmpfs                 7.4G  4.0K  7.4G   1% /dev/shm
>> >> /dev/xvdb              37G  3.3G   32G  10% /mnt
>> >> /dev/xvdf              37G  2.0G   34G   6% /mnt2
>> >> /dev/xvdv             500G   33M  500G   1% /vol
>> >>
>> >> I'm on an EC2 cluster (c3.xlarge + 5 x m3) that I launched using the
>> >> spark-ec2 scripts and a clone of spark from today. The job I am running
>> >> closely resembles the collaborative filtering example. This issue
>> >> happens
>> >> with the 1M version as well as the 10 million rating version of the
>> >> MovieLens dataset.
>> >>
>> >> I have seen previous questions, but they haven't helped yet. For
>> >> example, I
>> >> tried setting the Spark tmp directory to the EBS volume at /vol/, both
>> >> by
>> >> editing the spark conf file (and copy-dir'ing it to the slaves) as well
>> >> as
>> >> through the SparkConf. Yet I still get the above error. Here is my
>> >> current
>> >> Spark config below. Note that I'm launching via
>> >> ~/spark/bin/spark-submit.
>> >>
>> >> conf = SparkConf()
>> >> conf.setAppName("RecommendALS").set("spark.local.dir",
>> >> "/vol/").set("spark.executor.memory", "7g").set("spark.akka.frameSize",
>> >> "100").setExecutorEnv("SPARK_JAVA_OPTS", " -Dspark.akka.frameSize=100")
>> >> sc = SparkContext(conf=conf)
>> >>
>> >> Thanks for any advice,
>> >> Chris
>> >>
>>
>

Re: Error: No space left on device

Posted by Chris DuBois <ch...@gmail.com>.

Thanks for the quick responses!

I used your final -Dspark.local.dir suggestion, but I see this during the
initialization of the application:

14/07/16 06:56:08 INFO storage.DiskBlockManager: Created local directory at
/vol/spark-local-20140716065608-7b2a

I would have expected something in /mnt/spark/.

Thanks,
Chris



On Tue, Jul 15, 2014 at 11:44 PM, Chris Gore <cd...@cdgore.com> wrote:

> Hi Chris,
>
> I've encountered this error when running Spark’s ALS methods too.  In my
> case, it was because I set spark.local.dir improperly, and every time there
> was a shuffle, it would spill many GB of data onto the local drive.  What
> fixed it was setting it to use the /mnt directory, where a network drive is
> mounted.  For example, setting an environmental variable:
>
> export SPACE=$(mount | grep mnt | awk '{print $3"/spark/"}' | xargs | sed
> 's/ /,/g’)
>
> Then adding -Dspark.local.dir=$SPACE or simply
> -Dspark.local.dir=/mnt/spark/,/mnt2/spark/ when you run your driver
> application
>
> Chris
>
> On Jul 15, 2014, at 11:39 PM, Xiangrui Meng <me...@gmail.com> wrote:
>
> > Check the number of inodes (df -i). The assembly build may create many
> > small files. -Xiangrui
> >
> > On Tue, Jul 15, 2014 at 11:35 PM, Chris DuBois <ch...@gmail.com>
> wrote:
> >> Hi all,
> >>
> >> I am encountering the following error:
> >>
> >> INFO scheduler.TaskSetManager: Loss was due to java.io.IOException: No
> space
> >> left on device [duplicate 4]
> >>
> >> For each slave, df -h looks roughtly like this, which makes the above
> error
> >> surprising.
> >>
> >> Filesystem            Size  Used Avail Use% Mounted on
> >> /dev/xvda1            7.9G  4.4G  3.5G  57% /
> >> tmpfs                 7.4G  4.0K  7.4G   1% /dev/shm
> >> /dev/xvdb              37G  3.3G   32G  10% /mnt
> >> /dev/xvdf              37G  2.0G   34G   6% /mnt2
> >> /dev/xvdv             500G   33M  500G   1% /vol
> >>
> >> I'm on an EC2 cluster (c3.xlarge + 5 x m3) that I launched using the
> >> spark-ec2 scripts and a clone of spark from today. The job I am running
> >> closely resembles the collaborative filtering example. This issue
> happens
> >> with the 1M version as well as the 10 million rating version of the
> >> MovieLens dataset.
> >>
> >> I have seen previous questions, but they haven't helped yet. For
> example, I
> >> tried setting the Spark tmp directory to the EBS volume at /vol/, both
> by
> >> editing the spark conf file (and copy-dir'ing it to the slaves) as well
> as
> >> through the SparkConf. Yet I still get the above error. Here is my
> current
> >> Spark config below. Note that I'm launching via
> ~/spark/bin/spark-submit.
> >>
> >> conf = SparkConf()
> >> conf.setAppName("RecommendALS").set("spark.local.dir",
> >> "/vol/").set("spark.executor.memory", "7g").set("spark.akka.frameSize",
> >> "100").setExecutorEnv("SPARK_JAVA_OPTS", " -Dspark.akka.frameSize=100")
> >> sc = SparkContext(conf=conf)
> >>
> >> Thanks for any advice,
> >> Chris
> >>
>
>

Re: Error: No space left on device

Posted by Chris Gore <cd...@cdgore.com>.

Hi Chris,

I've encountered this error when running Spark’s ALS methods too.  In my case, it was because I set spark.local.dir improperly, and every time there was a shuffle, it would spill many GB of data onto the local drive.  What fixed it was setting it to use the /mnt directory, where a network drive is mounted.  For example, setting an environmental variable:

export SPACE=$(mount | grep mnt | awk '{print $3"/spark/"}' | xargs | sed 's/ /,/g’)

Then adding -Dspark.local.dir=$SPACE or simply -Dspark.local.dir=/mnt/spark/,/mnt2/spark/ when you run your driver application

Chris

On Jul 15, 2014, at 11:39 PM, Xiangrui Meng <me...@gmail.com> wrote:

> Check the number of inodes (df -i). The assembly build may create many
> small files. -Xiangrui
> 
> On Tue, Jul 15, 2014 at 11:35 PM, Chris DuBois <ch...@gmail.com> wrote:
>> Hi all,
>> 
>> I am encountering the following error:
>> 
>> INFO scheduler.TaskSetManager: Loss was due to java.io.IOException: No space
>> left on device [duplicate 4]
>> 
>> For each slave, df -h looks roughtly like this, which makes the above error
>> surprising.
>> 
>> Filesystem            Size  Used Avail Use% Mounted on
>> /dev/xvda1            7.9G  4.4G  3.5G  57% /
>> tmpfs                 7.4G  4.0K  7.4G   1% /dev/shm
>> /dev/xvdb              37G  3.3G   32G  10% /mnt
>> /dev/xvdf              37G  2.0G   34G   6% /mnt2
>> /dev/xvdv             500G   33M  500G   1% /vol
>> 
>> I'm on an EC2 cluster (c3.xlarge + 5 x m3) that I launched using the
>> spark-ec2 scripts and a clone of spark from today. The job I am running
>> closely resembles the collaborative filtering example. This issue happens
>> with the 1M version as well as the 10 million rating version of the
>> MovieLens dataset.
>> 
>> I have seen previous questions, but they haven't helped yet. For example, I
>> tried setting the Spark tmp directory to the EBS volume at /vol/, both by
>> editing the spark conf file (and copy-dir'ing it to the slaves) as well as
>> through the SparkConf. Yet I still get the above error. Here is my current
>> Spark config below. Note that I'm launching via ~/spark/bin/spark-submit.
>> 
>> conf = SparkConf()
>> conf.setAppName("RecommendALS").set("spark.local.dir",
>> "/vol/").set("spark.executor.memory", "7g").set("spark.akka.frameSize",
>> "100").setExecutorEnv("SPARK_JAVA_OPTS", " -Dspark.akka.frameSize=100")
>> sc = SparkContext(conf=conf)
>> 
>> Thanks for any advice,
>> Chris
>>

Re: Error: No space left on device

Posted by Chris DuBois <ch...@gmail.com>.

df -i  # on a slave

Filesystem            Inodes   IUsed   IFree IUse% Mounted on
/dev/xvda1            524288  277701  246587   53% /
tmpfs                1917974       1 1917973    1% /dev/shm


On Tue, Jul 15, 2014 at 11:39 PM, Xiangrui Meng <me...@gmail.com> wrote:

> Check the number of inodes (df -i). The assembly build may create many
> small files. -Xiangrui
>
> On Tue, Jul 15, 2014 at 11:35 PM, Chris DuBois <ch...@gmail.com>
> wrote:
> > Hi all,
> >
> > I am encountering the following error:
> >
> > INFO scheduler.TaskSetManager: Loss was due to java.io.IOException: No
> space
> > left on device [duplicate 4]
> >
> > For each slave, df -h looks roughtly like this, which makes the above
> error
> > surprising.
> >
> > Filesystem            Size  Used Avail Use% Mounted on
> > /dev/xvda1            7.9G  4.4G  3.5G  57% /
> > tmpfs                 7.4G  4.0K  7.4G   1% /dev/shm
> > /dev/xvdb              37G  3.3G   32G  10% /mnt
> > /dev/xvdf              37G  2.0G   34G   6% /mnt2
> > /dev/xvdv             500G   33M  500G   1% /vol
> >
> > I'm on an EC2 cluster (c3.xlarge + 5 x m3) that I launched using the
> > spark-ec2 scripts and a clone of spark from today. The job I am running
> > closely resembles the collaborative filtering example. This issue happens
> > with the 1M version as well as the 10 million rating version of the
> > MovieLens dataset.
> >
> > I have seen previous questions, but they haven't helped yet. For
> example, I
> > tried setting the Spark tmp directory to the EBS volume at /vol/, both by
> > editing the spark conf file (and copy-dir'ing it to the slaves) as well
> as
> > through the SparkConf. Yet I still get the above error. Here is my
> current
> > Spark config below. Note that I'm launching via ~/spark/bin/spark-submit.
> >
> > conf = SparkConf()
> > conf.setAppName("RecommendALS").set("spark.local.dir",
> > "/vol/").set("spark.executor.memory", "7g").set("spark.akka.frameSize",
> > "100").setExecutorEnv("SPARK_JAVA_OPTS", " -Dspark.akka.frameSize=100")
> > sc = SparkContext(conf=conf)
> >
> > Thanks for any advice,
> > Chris
> >
>

Re: Error: No space left on device

Posted by Xiangrui Meng <me...@gmail.com>.

Check the number of inodes (df -i). The assembly build may create many
small files. -Xiangrui

On Tue, Jul 15, 2014 at 11:35 PM, Chris DuBois <ch...@gmail.com> wrote:
> Hi all,
>
> I am encountering the following error:
>
> INFO scheduler.TaskSetManager: Loss was due to java.io.IOException: No space
> left on device [duplicate 4]
>
> For each slave, df -h looks roughtly like this, which makes the above error
> surprising.
>
> Filesystem            Size  Used Avail Use% Mounted on
> /dev/xvda1            7.9G  4.4G  3.5G  57% /
> tmpfs                 7.4G  4.0K  7.4G   1% /dev/shm
> /dev/xvdb              37G  3.3G   32G  10% /mnt
> /dev/xvdf              37G  2.0G   34G   6% /mnt2
> /dev/xvdv             500G   33M  500G   1% /vol
>
> I'm on an EC2 cluster (c3.xlarge + 5 x m3) that I launched using the
> spark-ec2 scripts and a clone of spark from today. The job I am running
> closely resembles the collaborative filtering example. This issue happens
> with the 1M version as well as the 10 million rating version of the
> MovieLens dataset.
>
> I have seen previous questions, but they haven't helped yet. For example, I
> tried setting the Spark tmp directory to the EBS volume at /vol/, both by
> editing the spark conf file (and copy-dir'ing it to the slaves) as well as
> through the SparkConf. Yet I still get the above error. Here is my current
> Spark config below. Note that I'm launching via ~/spark/bin/spark-submit.
>
> conf = SparkConf()
> conf.setAppName("RecommendALS").set("spark.local.dir",
> "/vol/").set("spark.executor.memory", "7g").set("spark.akka.frameSize",
> "100").setExecutorEnv("SPARK_JAVA_OPTS", " -Dspark.akka.frameSize=100")
> sc = SparkContext(conf=conf)
>
> Thanks for any advice,
> Chris
>