You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Murtaza Kanchwala (JIRA)" <ji...@apache.org> on 2015/07/15 20:12:04 UTC
[jira] [Created] (SPARK-9072) Parquet : Writing data to S3 very
slowly
Murtaza Kanchwala created SPARK-9072:
----------------------------------------
Summary: Parquet : Writing data to S3 very slowly
Key: SPARK-9072
URL: https://issues.apache.org/jira/browse/SPARK-9072
Project: Spark
Issue Type: Sub-task
Components: SQL
Reporter: Murtaza Kanchwala
Priority: Critical
Fix For: 1.5.0
I've created spark programs through which I am converting the normal textfile to parquet and csv to S3.
There is around 8 TB of data and I need to compress it into lower for further processing on Amazon EMR
Results :
1) Text -> CSV took 1.2 hrs to transform 8 TB of data without any problems successfully to S3.
2) Text -> Parquet Job completed in the same time (i.e. 1.2 hrs) but still after the Job completion it is spilling/writing the data separately to S3 which is making it slower and in starvation.
Input : s3n://<SameBucket>/input
Output : s3n://<SameBucket>/output/parquet
Lets say If I have around 10K files then it is taking 1000 files / 20 min to write back in S3.
Note :
Also I found that program is creating temp folder on S3 output location, And in Logs I've seen S3ReadDelays.
Can anyone tell me what am I doing wrong? or is there anything I need to add so that the Spark App cant create temp folder on S3 and do write ups fast from EMR to S3 just like saveAsTextFile. Thanks
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org