You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by Gautham <ga...@gmail.com> on 2014/12/09 19:59:41 UTC

pyspark sc.textFile uses only 4 out of 32 threads per node

I am having an issue with pyspark launched in ec2 (using spark-ec2) with 5
r3.4xlarge machines where each has 32 threads and 240GB of RAM. When I do
sc.textFile to load data from a number of gz files, it does not progress as
fast as expected. When I log-in to a child node and run top, I see only 4
threads at 100 cpu. All remaining 28 cores were idle. This is not an issue
when processing the strings after loading, when all the cores are used to
process the data.

Please help me with this? What setting can be changed to get the CPU usage
back up to full?



--
View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/pyspark-sc-textFile-uses-only-4-out-of-32-threads-per-node-tp20595.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
For additional commands, e-mail: user-help@spark.apache.org

Re: pyspark sc.textFile uses only 4 out of 32 threads per node

Posted by Nicholas Chammas <ni...@gmail.com>.

Are the gz files roughly equal in size? Do you know that your partitions
are roughly balanced? Perhaps some cores get assigned tasks that end very
quickly, while others get most of the work.

On Sat Jan 17 2015 at 2:02:49 AM Gautham Anil <ga...@gmail.com>
wrote:

> Hi,
>
> Thanks for getting back to me. Sorry for the delay. I am still having
> this issue.
>
> @sun: To clarify, The machine actually has 16 usable threads and the
> job has more than 100 gzip files. So, there are enough partitions to
> use all threads.
>
> @nicholas: The number of partitions match the number of files: > 100.
>
> @Sebastian: I understand the lazy loading behavior. For this reason, I
> usually use a .count() to force the transformation (.first() will not
> be enough). Still, during the transformation, only 4 cores are used
> for processing the input files.
>
> I don't know if this issue is noticed by other people. Can anyone
> reproduce it with v1.1?
>
>
> On Wed, Dec 17, 2014 at 2:14 AM, Nicholas Chammas
> <ni...@gmail.com> wrote:
> > Rui is correct.
> >
> > Check how many partitions your RDD has after loading the gzipped files.
> e.g.
> > rdd.getNumPartitions().
> >
> > If that number is way less than the number of cores in your cluster (in
> your
> > case I suspect the number is 4), then explicitly repartition the RDD to
> > match the number of cores in your cluster, or some multiple thereof.
> >
> > For example:
> >
> > new_rdd = rdd.repartition(sc.defaultParallelism * 3)
> >
> > Operations on new_rdd should utilize all the cores in your cluster.
> >
> > Nick
> >
> >
> > On Wed Dec 17 2014 at 1:42:16 AM Sun, Rui <ru...@intel.com> wrote:
> >>
> >> Gautham,
> >>
> >> How many number of gz files do you have?  Maybe the reason is that gz
> file
> >> is compressed that can't be splitted for processing by Mapreduce. A
> single
> >> gz  file can only be processed by a single Mapper so that the CPU treads
> >> can't be fully utilized.
> >>
> >> -----Original Message-----
> >> From: Gautham [mailto:gautham.anil@gmail.com]
> >> Sent: Wednesday, December 10, 2014 3:00 AM
> >> To: user@spark.incubator.apache.org
> >> Subject: pyspark sc.textFile uses only 4 out of 32 threads per node
> >>
> >> I am having an issue with pyspark launched in ec2 (using spark-ec2)
> with 5
> >> r3.4xlarge machines where each has 32 threads and 240GB of RAM. When I
> do
> >> sc.textFile to load data from a number of gz files, it does not
> progress as
> >> fast as expected. When I log-in to a child node and run top, I see only
> 4
> >> threads at 100 cpu. All remaining 28 cores were idle. This is not an
> issue
> >> when processing the strings after loading, when all the cores are used
> to
> >> process the data.
> >>
> >> Please help me with this? What setting can be changed to get the CPU
> usage
> >> back up to full?
> >>
> >>
> >>
> >> --
> >> View this message in context:
> >> http://apache-spark-user-list.1001560.n3.nabble.com/pyspark-
> sc-textFile-uses-only-4-out-of-32-threads-per-node-tp20595.html
> >> Sent from the Apache Spark User List mailing list archive at Nabble.com.
> >>
> >> ---------------------------------------------------------------------
> >> To unsubscribe, e-mail: user-unsubscribe@spark.apache.org For
> additional
> >> commands, e-mail: user-help@spark.apache.org
> >>
> >>
> >> ---------------------------------------------------------------------
> >> To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
> >> For additional commands, e-mail: user-help@spark.apache.org
> >>
> >
>
>
>
> --
> Gautham Anil
>
> "The first principle is that you must not fool yourself. And you are
> the easiest person to fool" - Richard P. Feynman
>

Re: pyspark sc.textFile uses only 4 out of 32 threads per node

Posted by Gautham Anil <ga...@gmail.com>.

Hi,

Thanks for getting back to me. Sorry for the delay. I am still having
this issue.

@sun: To clarify, The machine actually has 16 usable threads and the
job has more than 100 gzip files. So, there are enough partitions to
use all threads.

@nicholas: The number of partitions match the number of files: > 100.

@Sebastian: I understand the lazy loading behavior. For this reason, I
usually use a .count() to force the transformation (.first() will not
be enough). Still, during the transformation, only 4 cores are used
for processing the input files.

I don't know if this issue is noticed by other people. Can anyone
reproduce it with v1.1?


On Wed, Dec 17, 2014 at 2:14 AM, Nicholas Chammas
<ni...@gmail.com> wrote:
> Rui is correct.
>
> Check how many partitions your RDD has after loading the gzipped files. e.g.
> rdd.getNumPartitions().
>
> If that number is way less than the number of cores in your cluster (in your
> case I suspect the number is 4), then explicitly repartition the RDD to
> match the number of cores in your cluster, or some multiple thereof.
>
> For example:
>
> new_rdd = rdd.repartition(sc.defaultParallelism * 3)
>
> Operations on new_rdd should utilize all the cores in your cluster.
>
> Nick
>
>
> On Wed Dec 17 2014 at 1:42:16 AM Sun, Rui <ru...@intel.com> wrote:
>>
>> Gautham,
>>
>> How many number of gz files do you have?  Maybe the reason is that gz file
>> is compressed that can't be splitted for processing by Mapreduce. A  single
>> gz  file can only be processed by a single Mapper so that the CPU treads
>> can't be fully utilized.
>>
>> -----Original Message-----
>> From: Gautham [mailto:gautham.anil@gmail.com]
>> Sent: Wednesday, December 10, 2014 3:00 AM
>> To: user@spark.incubator.apache.org
>> Subject: pyspark sc.textFile uses only 4 out of 32 threads per node
>>
>> I am having an issue with pyspark launched in ec2 (using spark-ec2) with 5
>> r3.4xlarge machines where each has 32 threads and 240GB of RAM. When I do
>> sc.textFile to load data from a number of gz files, it does not progress as
>> fast as expected. When I log-in to a child node and run top, I see only 4
>> threads at 100 cpu. All remaining 28 cores were idle. This is not an issue
>> when processing the strings after loading, when all the cores are used to
>> process the data.
>>
>> Please help me with this? What setting can be changed to get the CPU usage
>> back up to full?
>>
>>
>>
>> --
>> View this message in context:
>> http://apache-spark-user-list.1001560.n3.nabble.com/pyspark-sc-textFile-uses-only-4-out-of-32-threads-per-node-tp20595.html
>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: user-unsubscribe@spark.apache.org For additional
>> commands, e-mail: user-help@spark.apache.org
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
>> For additional commands, e-mail: user-help@spark.apache.org
>>
>



-- 
Gautham Anil

"The first principle is that you must not fool yourself. And you are
the easiest person to fool" - Richard P. Feynman

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
For additional commands, e-mail: user-help@spark.apache.org

Re: pyspark sc.textFile uses only 4 out of 32 threads per node

Posted by Nicholas Chammas <ni...@gmail.com>.

Rui is correct.

Check how many partitions your RDD has after loading the gzipped files.
e.g. rdd.getNumPartitions().

If that number is way less than the number of cores in your cluster (in
your case I suspect the number is 4), then explicitly repartition the RDD
to match the number of cores in your cluster, or some multiple thereof.

For example:

new_rdd = rdd.repartition(sc.defaultParallelism * 3)

Operations on new_rdd should utilize all the cores in your cluster.

Nick


On Wed Dec 17 2014 at 1:42:16 AM Sun, Rui <ru...@intel.com> wrote:

> Gautham,
>
> How many number of gz files do you have?  Maybe the reason is that gz file
> is compressed that can't be splitted for processing by Mapreduce. A  single
> gz  file can only be processed by a single Mapper so that the CPU treads
> can't be fully utilized.
>
> -----Original Message-----
> From: Gautham [mailto:gautham.anil@gmail.com]
> Sent: Wednesday, December 10, 2014 3:00 AM
> To: user@spark.incubator.apache.org
> Subject: pyspark sc.textFile uses only 4 out of 32 threads per node
>
> I am having an issue with pyspark launched in ec2 (using spark-ec2) with 5
> r3.4xlarge machines where each has 32 threads and 240GB of RAM. When I do
> sc.textFile to load data from a number of gz files, it does not progress as
> fast as expected. When I log-in to a child node and run top, I see only 4
> threads at 100 cpu. All remaining 28 cores were idle. This is not an issue
> when processing the strings after loading, when all the cores are used to
> process the data.
>
> Please help me with this? What setting can be changed to get the CPU usage
> back up to full?
>
>
>
> --
> View this message in context: http://apache-spark-user-list.
> 1001560.n3.nabble.com/pyspark-sc-textFile-uses-only-4-out-
> of-32-threads-per-node-tp20595.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: user-unsubscribe@spark.apache.org For additional
> commands, e-mail: user-help@spark.apache.org
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
> For additional commands, e-mail: user-help@spark.apache.org
>
>

RE: pyspark sc.textFile uses only 4 out of 32 threads per node

Posted by "Sun, Rui" <ru...@intel.com>.

Gautham,

How many number of gz files do you have?  Maybe the reason is that gz file is compressed that can't be splitted for processing by Mapreduce. A  single gz  file can only be processed by a single Mapper so that the CPU treads can't be fully utilized.

-----Original Message-----
From: Gautham [mailto:gautham.anil@gmail.com] 
Sent: Wednesday, December 10, 2014 3:00 AM
To: user@spark.incubator.apache.org
Subject: pyspark sc.textFile uses only 4 out of 32 threads per node

I am having an issue with pyspark launched in ec2 (using spark-ec2) with 5 r3.4xlarge machines where each has 32 threads and 240GB of RAM. When I do sc.textFile to load data from a number of gz files, it does not progress as fast as expected. When I log-in to a child node and run top, I see only 4 threads at 100 cpu. All remaining 28 cores were idle. This is not an issue when processing the strings after loading, when all the cores are used to process the data.

Please help me with this? What setting can be changed to get the CPU usage back up to full?

--
View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/pyspark-sc-textFile-uses-only-4-out-of-32-threads-per-node-tp20595.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@spark.apache.org For additional commands, e-mail: user-help@spark.apache.org

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
For additional commands, e-mail: user-help@spark.apache.org

Re: pyspark sc.textFile uses only 4 out of 32 threads per node

Posted by Sebastián Ramírez <se...@senseta.com>.

Are you reading the file from your driver (main / master) program?

Is your file in a distributed system like HDFS? available to all your nodes?

It might be due to the laziness of transformations:
http://spark.apache.org/docs/latest/programming-guide.html#rdd-operations

"Transformations" are lazy, and aren't applied until they are needed by an
"action" (and, to me, it happend for readings too some time ago).
You can try calling a .first() in your RDD from once in a while to force it
to load the RDD to your cluster (but it might not be the cleanest way to do
it).


*Sebastián Ramírez*
Diseñador de Algoritmos

 <http://www.senseta.com>
________________
 Tel: (+571) 795 7950 ext: 1012
 Cel: (+57) 300 370 77 10
 Calle 73 No 7 - 06  Piso 4
 Linkedin: co.linkedin.com/in/tiangolo/
 Twitter: @tiangolo <https://twitter.com/tiangolo>
 Email: sebastian.ramirez@senseta.com
 www.senseta.com

On Tue, Dec 9, 2014 at 1:59 PM, Gautham <ga...@gmail.com> wrote:
>
> I am having an issue with pyspark launched in ec2 (using spark-ec2) with 5
> r3.4xlarge machines where each has 32 threads and 240GB of RAM. When I do
> sc.textFile to load data from a number of gz files, it does not progress as
> fast as expected. When I log-in to a child node and run top, I see only 4
> threads at 100 cpu. All remaining 28 cores were idle. This is not an issue
> when processing the strings after loading, when all the cores are used to
> process the data.
>
> Please help me with this? What setting can be changed to get the CPU usage
> back up to full?
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/pyspark-sc-textFile-uses-only-4-out-of-32-threads-per-node-tp20595.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
> For additional commands, e-mail: user-help@spark.apache.org
>
>

-- 
*----------------------------------------------------*
*This e-mail transmission, including any attachments, is intended only for 
the named recipient(s) and may contain information that is privileged, 
confidential and/or exempt from disclosure under applicable law. If you 
have received this transmission in error, or are not the named 
recipient(s), please notify Senseta immediately by return e-mail and 
permanently delete this transmission, including any attachments.*