You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by Matthew Cheah <ma...@gmail.com> on 2014/03/10 18:41:02 UTC

"Too many open files" exception on reduceByKey

Hi everyone,

My team (cc'ed in this e-mail) and I are running a Spark reduceByKey
operation on a cluster of 10 slaves where I don't have the privileges to
set "ulimit -n" to a higher number. I'm running on a cluster where "ulimit
-n" returns 1024 on each machine.

When I attempt to run this job with the data originating from a text file,
stored in an HDFS cluster running on the same nodes as the Spark cluster,
the job crashes with the message, "Too many open files".

My question is, why are so many files being created, and is there a way to
configure the Spark context to avoid spawning that many files? I am already
setting spark.shuffle.consolidateFiles to true.

I want to repeat - I can't change the maximum number of open file
descriptors on the machines. This cluster is not owned by me and the system
administrator is responding quite slowly.

Thanks,

-Matt Cheah

Re: "Too many open files" exception on reduceByKey

Posted by Matthew Cheah <ma...@gmail.com>.

Sorry, I also have some follow-up questions.

"In general if a node in your cluster has C assigned cores and you run a
job with X reducers then Spark will open C*X files in parallel and start
writing."

Some questions came to mind just now:
1) It would be nice to have a brief overview as to what these files are
being used for?
2) Is this C*X files being opened on each machine? Also, is C the total
number of cores among all machines in the cluster?

Thanks,

-Matt Cheah


On Tue, Mar 11, 2014 at 4:35 PM, Matthew Cheah <ma...@gmail.com>wrote:

> Thanks. Just curious, is there a default number of reducers that are used?
>
> -Matt Cheah
>
>
> On Mon, Mar 10, 2014 at 7:22 PM, Patrick Wendell <pw...@gmail.com>wrote:
>
>> Hey Matt,
>>
>> The best way is definitely just to increase the ulimit if possible,
>> this is sort of an assumption we make in Spark that clusters will be
>> able to move it around.
>>
>> You might be able to hack around this by decreasing the number of
>> reducers but this could have some performance implications for your
>> job.
>>
>> In general if a node in your cluster has C assigned cores and you run
>> a job with X reducers then Spark will open C*X files in parallel and
>> start writing. Shuffle consolidation will help decrease the total
>> number of files created but the number of file handles open at any
>> time doesn't change so it won't help the ulimit problem.
>>
>> This means you'll have to use fewer reducers (e.g. pass reduceByKey a
>> number of reducers) or use fewer cores on each machine.
>>
>> - Patrick
>>
>> On Mon, Mar 10, 2014 at 10:41 AM, Matthew Cheah
>> <ma...@gmail.com> wrote:
>> > Hi everyone,
>> >
>> > My team (cc'ed in this e-mail) and I are running a Spark reduceByKey
>> > operation on a cluster of 10 slaves where I don't have the privileges
>> to set
>> > "ulimit -n" to a higher number. I'm running on a cluster where "ulimit
>> -n"
>> > returns 1024 on each machine.
>> >
>> > When I attempt to run this job with the data originating from a text
>> file,
>> > stored in an HDFS cluster running on the same nodes as the Spark
>> cluster,
>> > the job crashes with the message, "Too many open files".
>> >
>> > My question is, why are so many files being created, and is there a way
>> to
>> > configure the Spark context to avoid spawning that many files? I am
>> already
>> > setting spark.shuffle.consolidateFiles to true.
>> >
>> > I want to repeat - I can't change the maximum number of open file
>> > descriptors on the machines. This cluster is not owned by me and the
>> system
>> > administrator is responding quite slowly.
>> >
>> > Thanks,
>> >
>> > -Matt Cheah
>>
>
>

Re: "Too many open files" exception on reduceByKey

Posted by Matthew Cheah <ma...@gmail.com>.

Thanks. Just curious, is there a default number of reducers that are used?

-Matt Cheah


On Mon, Mar 10, 2014 at 7:22 PM, Patrick Wendell <pw...@gmail.com> wrote:

> Hey Matt,
>
> The best way is definitely just to increase the ulimit if possible,
> this is sort of an assumption we make in Spark that clusters will be
> able to move it around.
>
> You might be able to hack around this by decreasing the number of
> reducers but this could have some performance implications for your
> job.
>
> In general if a node in your cluster has C assigned cores and you run
> a job with X reducers then Spark will open C*X files in parallel and
> start writing. Shuffle consolidation will help decrease the total
> number of files created but the number of file handles open at any
> time doesn't change so it won't help the ulimit problem.
>
> This means you'll have to use fewer reducers (e.g. pass reduceByKey a
> number of reducers) or use fewer cores on each machine.
>
> - Patrick
>
> On Mon, Mar 10, 2014 at 10:41 AM, Matthew Cheah
> <ma...@gmail.com> wrote:
> > Hi everyone,
> >
> > My team (cc'ed in this e-mail) and I are running a Spark reduceByKey
> > operation on a cluster of 10 slaves where I don't have the privileges to
> set
> > "ulimit -n" to a higher number. I'm running on a cluster where "ulimit
> -n"
> > returns 1024 on each machine.
> >
> > When I attempt to run this job with the data originating from a text
> file,
> > stored in an HDFS cluster running on the same nodes as the Spark cluster,
> > the job crashes with the message, "Too many open files".
> >
> > My question is, why are so many files being created, and is there a way
> to
> > configure the Spark context to avoid spawning that many files? I am
> already
> > setting spark.shuffle.consolidateFiles to true.
> >
> > I want to repeat - I can't change the maximum number of open file
> > descriptors on the machines. This cluster is not owned by me and the
> system
> > administrator is responding quite slowly.
> >
> > Thanks,
> >
> > -Matt Cheah
>

Re: "Too many open files" exception on reduceByKey

Posted by tian zhang <tz...@yahoo.com.INVALID>.

You are right, I did find that mesos overwrite this to a smaller number.So we will modify that and try to run again. Thanks!
Tian 


     On Thursday, October 8, 2015 4:18 PM, DB Tsai <db...@dbtsai.com> wrote:
   

 Try to run to see actual ulimit. We found that mesos overrides the ulimit which causes the issue.
import sys.process._
val p = 1 to 100
val rdd = sc.parallelize(p, 100)
val a = rdd.map(x=> Seq("sh", "-c", "ulimit -n").!!.toDouble.toLong).collect



Sincerely,

DB Tsai
----------------------------------------------------------Blog: https://www.dbtsai.com
PGP Key ID: 0xAF08DF8D

On Thu, Oct 8, 2015 at 3:22 PM, Tian Zhang <tz...@yahoo.com> wrote:

I hit this issue with spark 1.3.0 stateful application (with
updateStateByKey) function on mesos.  It will
fail after running fine for about 24 hours.
The error stack trace as below, I checked ulimit -n and we have very large
numbers set on the machines.
What else can be wrong?
15/09/27 18:45:11 WARN scheduler.TaskSetManager: Lost task 2.0 in stage
113727.0 (TID 833758, ip-10-112-10-221.ec2.internal):
java.io.FileNotFoundException:
/media/ephemeral0/oncue/mesos-slave/slaves/20150512-215537-2165010442-5050-1730-S5/frameworks/20150825-175705-2165010442-5050-13705-0338/executors/0/runs/19342849-d076-483c-88da-747896e19b93/./spark-6efa2dcd-aea7-478e-9fa9-6e0973578eb4/blockmgr-33b1e093-6dd6-4462-938c-2597516272a9/27/shuffle_535_2_0.index
(Too many open files)
        at java.io.FileOutputStream.open(Native Method)
        at java.io.FileOutputStream.<init>(FileOutputStream.java:221)
        at java.io.FileOutputStream.<init>(FileOutputStream.java:171)
        at
org.apache.spark.shuffle.IndexShuffleBlockManager.writeIndexFile(IndexShuffleBlockManager.scala:85)
        at
org.apache.spark.shuffle.sort.SortShuffleWriter.write(SortShuffleWriter.scala:69)
        at
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:68)
        at
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
        at org.apache.spark.scheduler.Task.run(Task.scala:64)
        at
org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:203)
        at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
        at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
        at java.lang.Thread.run(Thread.java:745)





--
View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Too-many-open-files-exception-on-reduceByKey-tp2462p24985.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
For additional commands, e-mail: user-help@spark.apache.org

Re: "Too many open files" exception on reduceByKey

Posted by DB Tsai <db...@dbtsai.com>.

Try to run to see actual ulimit. We found that mesos overrides the ulimit
which causes the issue.

import sys.process._
val p = 1 to 100
val rdd = sc.parallelize(p, 100)
val a = rdd.map(x=> Seq("sh", "-c", "ulimit -n").!!.toDouble.toLong).collect




Sincerely,

DB Tsai
----------------------------------------------------------
Blog: https://www.dbtsai.com
PGP Key ID: 0xAF08DF8D
<https://pgp.mit.edu/pks/lookup?search=0x59DF55B8AF08DF8D>

On Thu, Oct 8, 2015 at 3:22 PM, Tian Zhang <tz...@yahoo.com> wrote:

> I hit this issue with spark 1.3.0 stateful application (with
> updateStateByKey) function on mesos.  It will
> fail after running fine for about 24 hours.
> The error stack trace as below, I checked ulimit -n and we have very large
> numbers set on the machines.
> What else can be wrong?
> 15/09/27 18:45:11 WARN scheduler.TaskSetManager: Lost task 2.0 in stage
> 113727.0 (TID 833758, ip-10-112-10-221.ec2.internal):
> java.io.FileNotFoundException:
>
> /media/ephemeral0/oncue/mesos-slave/slaves/20150512-215537-2165010442-5050-1730-S5/frameworks/20150825-175705-2165010442-5050-13705-0338/executors/0/runs/19342849-d076-483c-88da-747896e19b93/./spark-6efa2dcd-aea7-478e-9fa9-6e0973578eb4/blockmgr-33b1e093-6dd6-4462-938c-2597516272a9/27/shuffle_535_2_0.index
> (Too many open files)
>         at java.io.FileOutputStream.open(Native Method)
>         at java.io.FileOutputStream.<init>(FileOutputStream.java:221)
>         at java.io.FileOutputStream.<init>(FileOutputStream.java:171)
>         at
>
> org.apache.spark.shuffle.IndexShuffleBlockManager.writeIndexFile(IndexShuffleBlockManager.scala:85)
>         at
>
> org.apache.spark.shuffle.sort.SortShuffleWriter.write(SortShuffleWriter.scala:69)
>         at
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:68)
>         at
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
>         at org.apache.spark.scheduler.Task.run(Task.scala:64)
>         at
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:203)
>         at
>
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>         at
>
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>         at java.lang.Thread.run(Thread.java:745)
>
>
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/Too-many-open-files-exception-on-reduceByKey-tp2462p24985.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
> For additional commands, e-mail: user-help@spark.apache.org
>
>

Re: "Too many open files" exception on reduceByKey

Posted by Tian Zhang <tz...@yahoo.com>.

It turns out the mesos can overwrite the OS ulimit -n setting. So we have
increased the mesos slave ulimit -n setting.



--
View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Too-many-open-files-exception-on-reduceByKey-tp2462p25019.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
For additional commands, e-mail: user-help@spark.apache.org

Re: "Too many open files" exception on reduceByKey

Posted by Tian Zhang <tz...@yahoo.com>.

I hit this issue with spark 1.3.0 stateful application (with
updateStateByKey) function on mesos.  It will 
fail after running fine for about 24 hours.
The error stack trace as below, I checked ulimit -n and we have very large
numbers set on the machines.
What else can be wrong?
15/09/27 18:45:11 WARN scheduler.TaskSetManager: Lost task 2.0 in stage
113727.0 (TID 833758, ip-10-112-10-221.ec2.internal):
java.io.FileNotFoundException:
/media/ephemeral0/oncue/mesos-slave/slaves/20150512-215537-2165010442-5050-1730-S5/frameworks/20150825-175705-2165010442-5050-13705-0338/executors/0/runs/19342849-d076-483c-88da-747896e19b93/./spark-6efa2dcd-aea7-478e-9fa9-6e0973578eb4/blockmgr-33b1e093-6dd6-4462-938c-2597516272a9/27/shuffle_535_2_0.index
(Too many open files)
        at java.io.FileOutputStream.open(Native Method)
        at java.io.FileOutputStream.<init>(FileOutputStream.java:221)
        at java.io.FileOutputStream.<init>(FileOutputStream.java:171)
        at
org.apache.spark.shuffle.IndexShuffleBlockManager.writeIndexFile(IndexShuffleBlockManager.scala:85)
        at
org.apache.spark.shuffle.sort.SortShuffleWriter.write(SortShuffleWriter.scala:69)
        at
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:68)
        at
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
        at org.apache.spark.scheduler.Task.run(Task.scala:64)
        at
org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:203)
        at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
        at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
        at java.lang.Thread.run(Thread.java:745)





--
View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Too-many-open-files-exception-on-reduceByKey-tp2462p24985.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
For additional commands, e-mail: user-help@spark.apache.org

Re: "Too many open files" exception on reduceByKey

Posted by Patrick Wendell <pw...@gmail.com>.

Hey Matt,

The best way is definitely just to increase the ulimit if possible,
this is sort of an assumption we make in Spark that clusters will be
able to move it around.

You might be able to hack around this by decreasing the number of
reducers but this could have some performance implications for your
job.

In general if a node in your cluster has C assigned cores and you run
a job with X reducers then Spark will open C*X files in parallel and
start writing. Shuffle consolidation will help decrease the total
number of files created but the number of file handles open at any
time doesn't change so it won't help the ulimit problem.

This means you'll have to use fewer reducers (e.g. pass reduceByKey a
number of reducers) or use fewer cores on each machine.

- Patrick

On Mon, Mar 10, 2014 at 10:41 AM, Matthew Cheah
<ma...@gmail.com> wrote:
> Hi everyone,
>
> My team (cc'ed in this e-mail) and I are running a Spark reduceByKey
> operation on a cluster of 10 slaves where I don't have the privileges to set
> "ulimit -n" to a higher number. I'm running on a cluster where "ulimit -n"
> returns 1024 on each machine.
>
> When I attempt to run this job with the data originating from a text file,
> stored in an HDFS cluster running on the same nodes as the Spark cluster,
> the job crashes with the message, "Too many open files".
>
> My question is, why are so many files being created, and is there a way to
> configure the Spark context to avoid spawning that many files? I am already
> setting spark.shuffle.consolidateFiles to true.
>
> I want to repeat - I can't change the maximum number of open file
> descriptors on the machines. This cluster is not owned by me and the system
> administrator is responding quite slowly.
>
> Thanks,
>
> -Matt Cheah