You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by mingweili0x <ml...@spokeo.com> on 2015/03/09 18:31:59 UTC

saveAsTextFile extremely slow near finish

I'm basically running a sorting using spark. The spark program will read from
HDFS, sort on composite keys, and then save the partitioned result back to
HDFS.
pseudo code is like this:

input = sc.textFile
pairs = input.mapToPair
sorted = pairs.sortByKey
values = sorted.values
values.saveAsTextFile

 Input size is ~ 160G, and I made 1000 partitions specified in
JavaSparkContext.textFile and JavaPairRDD.sortByKey. From WebUI, the job is
splitted into two stages: saveAsTextFile and mapToPair. MapToPair finished
in 8 mins. While saveAsTextFile took ~15mins to reach (2366/2373) progress
and the last few jobs just took forever and never finishes. 

Cluster setup:
8 nodes
on each node: 15gb memory, 8 cores

running parameters:
--executor-memory 12G
--conf "spark.cores.max=60"

Thank you for any help.



--
View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/saveAsTextFile-extremely-slow-near-finish-tp21978.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
For additional commands, e-mail: user-help@spark.apache.org

Re: saveAsTextFile extremely slow near finish

Posted by Sean Owen <so...@cloudera.com>.

This is more of an aside, but why repartition this data instead of letting
it define partitions naturally? You will end up with a similar number.
On Mar 9, 2015 5:32 PM, "mingweili0x" <ml...@spokeo.com> wrote:

> I'm basically running a sorting using spark. The spark program will read
> from
> HDFS, sort on composite keys, and then save the partitioned result back to
> HDFS.
> pseudo code is like this:
>
> input = sc.textFile
> pairs = input.mapToPair
> sorted = pairs.sortByKey
> values = sorted.values
> values.saveAsTextFile
>
>  Input size is ~ 160G, and I made 1000 partitions specified in
> JavaSparkContext.textFile and JavaPairRDD.sortByKey. From WebUI, the job is
> splitted into two stages: saveAsTextFile and mapToPair. MapToPair finished
> in 8 mins. While saveAsTextFile took ~15mins to reach (2366/2373) progress
> and the last few jobs just took forever and never finishes.
>
> Cluster setup:
> 8 nodes
> on each node: 15gb memory, 8 cores
>
> running parameters:
> --executor-memory 12G
> --conf "spark.cores.max=60"
>
> Thank you for any help.
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/saveAsTextFile-extremely-slow-near-finish-tp21978.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
> For additional commands, e-mail: user-help@spark.apache.org
>
>

Re: saveAsTextFile extremely slow near finish

Posted by Imran Rashid <ir...@cloudera.com>.

is your data skewed?  Could it be that there are a few keys with a huge
number of records?  You might consider outputting
(recordA, count)
(recordB, count)

instead of

recordA
recordA
recordA
...


you could do this with:

input = sc.textFile
pairsCounts = input.map{x => (x,1)}.reduceByKey{_ + _}
sorted = pairs.sortByKey
sorted.saveAsTextFile


On Mon, Mar 9, 2015 at 12:31 PM, mingweili0x <ml...@spokeo.com> wrote:

> I'm basically running a sorting using spark. The spark program will read
> from
> HDFS, sort on composite keys, and then save the partitioned result back to
> HDFS.
> pseudo code is like this:
>
> input = sc.textFile
> pairs = input.mapToPair
> sorted = pairs.sortByKey
> values = sorted.values
> values.saveAsTextFile
>
>  Input size is ~ 160G, and I made 1000 partitions specified in
> JavaSparkContext.textFile and JavaPairRDD.sortByKey. From WebUI, the job is
> splitted into two stages: saveAsTextFile and mapToPair. MapToPair finished
> in 8 mins. While saveAsTextFile took ~15mins to reach (2366/2373) progress
> and the last few jobs just took forever and never finishes.
>
> Cluster setup:
> 8 nodes
> on each node: 15gb memory, 8 cores
>
> running parameters:
> --executor-memory 12G
> --conf "spark.cores.max=60"
>
> Thank you for any help.
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/saveAsTextFile-extremely-slow-near-finish-tp21978.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
> For additional commands, e-mail: user-help@spark.apache.org
>
>

Re: saveAsTextFile extremely slow near finish

Posted by Akhil Das <ak...@sigmoidanalytics.com>.

Don't you think 1000 is too less for 160GB of data? Also you could try
using KryoSerializer, Enabling RDD Compression.

Thanks
Best Regards

On Mon, Mar 9, 2015 at 11:01 PM, mingweili0x <ml...@spokeo.com> wrote:

> I'm basically running a sorting using spark. The spark program will read
> from
> HDFS, sort on composite keys, and then save the partitioned result back to
> HDFS.
> pseudo code is like this:
>
> input = sc.textFile
> pairs = input.mapToPair
> sorted = pairs.sortByKey
> values = sorted.values
> values.saveAsTextFile
>
>  Input size is ~ 160G, and I made 1000 partitions specified in
> JavaSparkContext.textFile and JavaPairRDD.sortByKey. From WebUI, the job is
> splitted into two stages: saveAsTextFile and mapToPair. MapToPair finished
> in 8 mins. While saveAsTextFile took ~15mins to reach (2366/2373) progress
> and the last few jobs just took forever and never finishes.
>
> Cluster setup:
> 8 nodes
> on each node: 15gb memory, 8 cores
>
> running parameters:
> --executor-memory 12G
> --conf "spark.cores.max=60"
>
> Thank you for any help.
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/saveAsTextFile-extremely-slow-near-finish-tp21978.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
> For additional commands, e-mail: user-help@spark.apache.org
>
>