You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by Shangyu Luo <ls...@gmail.com> on 2013/10/08 06:50:26 UTC

The functionality of daemon.py?

Hello!
I am using Spark 0.7.3 with python version.  Recently when I run some spark
program on a cluster, I found that some processes invoked
by spark-0.7.3/python/pyspark/daemon.py would capturing CPU for a long time
and consume much memory (e.g., 5g for each process). It seemed that the
java process, which was invoked by
java -cp
:/usr/lib/spark-0.7.3/conf:/usr/lib/spark-0.7.3/core/target/scala-2.9.3/classes
...  , was 'competing' with the daemon.py for CPU resources. From my
understanding, the java process should be responsible for the 'real'
computation in spark.
So I am wondering what job the daemon.py will work on? Is it normal for it
to consume a lot of CPU and memory?
Thanks!


Best,
Shangyu Luo
-- 
--

Shangyu, Luo
Department of Computer Science
Rice University

Re: The functionality of daemon.py?

Posted by Shangyu Luo <ls...@gmail.com>.

Also, I found that the 'daemon.py' will continue running on one worker node
even after I terminated the spark job at master node. A little strange for
me.


2013/10/8 Shangyu Luo <ls...@gmail.com>

> Hello Jey,
> Thank you for answering. I have found that there are about 6 or 7
> 'daemon.py' processes in one worker node. Will each core have a 'daemon.py'
> process? How to decide how many 'daemon.py' processes in one worker node? I
> have also found that there are many spark related java process in a worker
> node, so if the java process on worker node is just responsible for
> communication, why spark needs so many java processes?
> Overall, I think the main problem I have for my program is the memory
> allocation. More specifically, in spark-env.sh, there are two options, *
> SPARK_DAEMON_MEMORY* and *SPARK_DAEMON_JAVA_OPTS*. I can also set up *
> spark.executor.memory* in SPARK_JAVA_OPTS. So if I have 68g memory in a
> worker node, how should I distribute memory for these options? At present,
> I use the default value for SPARK_DAEMON_MEMORY and SPARK_DAEMON_JAVA_OPTS
> and set spark.executor.memory to 20g. It seems that spark will add rdd to
> spark.executor.memory and I find that each 'daemon.py' will also consume
> about 7g memory. Now when running my program for a while, the program will
> use up all memory on a worker node and the master node will report
> connection errors. (I have 5 worker nodes, each has 8 cores) So I am a
> little confused about the jobs that the three options are responsible for
> and how to distribute memories to them.
> Any suggestion will be appreciated.
> Thanks!
>
> Best,
> Shangyu
>
>
> 2013/10/8 Jey Kottalam <je...@cs.berkeley.edu>
>
>> Hi Shangyu,
>>
>> The daemon.py python process is the actual PySpark worker process, and
>> is launched by the Spark worker when running Python jobs. So, when
>> using PySpark, the "real computation" is handled by a python process
>> (via daemon.py), not a java process.
>>
>> Hope that helps,
>> -Jey
>>
>> On Mon, Oct 7, 2013 at 9:50 PM, Shangyu Luo <ls...@gmail.com> wrote:
>> > Hello!
>> > I am using Spark 0.7.3 with python version.  Recently when I run some
>> spark
>> > program on a cluster, I found that some processes invoked by
>> > spark-0.7.3/python/pyspark/daemon.py would capturing CPU for a long
>> time and
>> > consume much memory (e.g., 5g for each process). It seemed that the java
>> > process, which was invoked by
>> > java -cp
>> >
>> :/usr/lib/spark-0.7.3/conf:/usr/lib/spark-0.7.3/core/target/scala-2.9.3/classes
>> > ...  , was 'competing' with the daemon.py for CPU resources. From my
>> > understanding, the java process should be responsible for the 'real'
>> > computation in spark.
>> > So I am wondering what job the daemon.py will work on? Is it normal for
>> it
>> > to consume a lot of CPU and memory?
>> > Thanks!
>> >
>> >
>> > Best,
>> > Shangyu Luo
>> > --
>> > --
>> >
>> > Shangyu, Luo
>> > Department of Computer Science
>> > Rice University
>> >
>>
>
>
>
> --
> --
>
> Shangyu, Luo
> Department of Computer Science
> Rice University
>
> --
> Not Just Think About It, But Do It!
> --
> Success is never final.
> --
> Losers always whine about their best
>



-- 
--

Shangyu, Luo
Department of Computer Science
Rice University

--
Not Just Think About It, But Do It!
--
Success is never final.
--
Losers always whine about their best

Re: The functionality of daemon.py?

Posted by Shangyu Luo <ls...@gmail.com>.

Hello Jey,
Thank you for answering. I have found that there are about 6 or 7
'daemon.py' processes in one worker node. Will each core have a 'daemon.py'
process? How to decide how many 'daemon.py' processes in one worker node? I
have also found that there are many spark related java process in a worker
node, so if the java process on worker node is just responsible for
communication, why spark needs so many java processes?
Overall, I think the main problem I have for my program is the memory
allocation. More specifically, in spark-env.sh, there are two options, *
SPARK_DAEMON_MEMORY* and *SPARK_DAEMON_JAVA_OPTS*. I can also set up *
spark.executor.memory* in SPARK_JAVA_OPTS. So if I have 68g memory in a
worker node, how should I distribute memory for these options? At present,
I use the default value for SPARK_DAEMON_MEMORY and SPARK_DAEMON_JAVA_OPTS
and set spark.executor.memory to 20g. It seems that spark will add rdd to
spark.executor.memory and I find that each 'daemon.py' will also consume
about 7g memory. Now when running my program for a while, the program will
use up all memory on a worker node and the master node will report
connection errors. (I have 5 worker nodes, each has 8 cores) So I am a
little confused about the jobs that the three options are responsible for
and how to distribute memories to them.
Any suggestion will be appreciated.
Thanks!

Best,
Shangyu

2013/10/8 Jey Kottalam <je...@cs.berkeley.edu>

> Hi Shangyu,
>
> The daemon.py python process is the actual PySpark worker process, and
> is launched by the Spark worker when running Python jobs. So, when
> using PySpark, the "real computation" is handled by a python process
> (via daemon.py), not a java process.
>
> Hope that helps,
> -Jey
>
> On Mon, Oct 7, 2013 at 9:50 PM, Shangyu Luo <ls...@gmail.com> wrote:
> > Hello!
> > I am using Spark 0.7.3 with python version.  Recently when I run some
> spark
> > program on a cluster, I found that some processes invoked by
> > spark-0.7.3/python/pyspark/daemon.py would capturing CPU for a long time
> and
> > consume much memory (e.g., 5g for each process). It seemed that the java
> > process, which was invoked by
> > java -cp
> >
> :/usr/lib/spark-0.7.3/conf:/usr/lib/spark-0.7.3/core/target/scala-2.9.3/classes
> > ...  , was 'competing' with the daemon.py for CPU resources. From my
> > understanding, the java process should be responsible for the 'real'
> > computation in spark.
> > So I am wondering what job the daemon.py will work on? Is it normal for
> it
> > to consume a lot of CPU and memory?
> > Thanks!
> >
> >
> > Best,
> > Shangyu Luo
> > --
> > --
> >
> > Shangyu, Luo
> > Department of Computer Science
> > Rice University
> >
>

-- 
--

Shangyu, Luo
Department of Computer Science
Rice University

--
Not Just Think About It, But Do It!
--
Success is never final.
--
Losers always whine about their best

Re: The functionality of daemon.py?

Posted by Jey Kottalam <je...@cs.berkeley.edu>.

Hi Shangyu,

The daemon.py python process is the actual PySpark worker process, and
is launched by the Spark worker when running Python jobs. So, when
using PySpark, the "real computation" is handled by a python process
(via daemon.py), not a java process.

Hope that helps,
-Jey

On Mon, Oct 7, 2013 at 9:50 PM, Shangyu Luo <ls...@gmail.com> wrote:
> Hello!
> I am using Spark 0.7.3 with python version.  Recently when I run some spark
> program on a cluster, I found that some processes invoked by
> spark-0.7.3/python/pyspark/daemon.py would capturing CPU for a long time and
> consume much memory (e.g., 5g for each process). It seemed that the java
> process, which was invoked by
> java -cp
> :/usr/lib/spark-0.7.3/conf:/usr/lib/spark-0.7.3/core/target/scala-2.9.3/classes
> ...  , was 'competing' with the daemon.py for CPU resources. From my
> understanding, the java process should be responsible for the 'real'
> computation in spark.
> So I am wondering what job the daemon.py will work on? Is it normal for it
> to consume a lot of CPU and memory?
> Thanks!
>
>
> Best,
> Shangyu Luo
> --
> --
>
> Shangyu, Luo
> Department of Computer Science
> Rice University
>