You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Piero Cinquegrana (JIRA)" <ji...@apache.org> on 2019/04/27 15:37:00 UTC

[jira] [Commented] (SPARK-20049) Writing data to Parquet with partitions takes very long after the job finishes

    [ https://issues.apache.org/jira/browse/SPARK-20049?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16827645#comment-16827645 ] 

Piero Cinquegrana commented on SPARK-20049:
-------------------------------------------

[~yumwang] and [~jdramosf] we encountered a similar issue writing parquet files to S3 with partition from many gzipped files. The workaround was writing unpartitioned parquet files as an intermediate job. 

> Writing data to Parquet with partitions takes very long after the job finishes
> ------------------------------------------------------------------------------
>
>                 Key: SPARK-20049
>                 URL: https://issues.apache.org/jira/browse/SPARK-20049
>             Project: Spark
>          Issue Type: Improvement
>          Components: Input/Output, PySpark, SQL
>    Affects Versions: 2.1.0
>         Environment: Spark 2.1.0, CDH 5.8, Python 3.4, Java 8, Debian GNU/Linux 8.7 (jessie)
>            Reporter: Jakub Nowacki
>            Priority: Minor
>
> I was testing writing DataFrame to partitioned Parquet files.The command is quite straight forward and the data set is really a sample from larger data set in Parquet; the job is done in PySpark on YARN and written to HDFS:
> {code}
> # there is column 'date' in df
> df.write.partitionBy("date").parquet("dest_dir")
> {code}
> The reading part took as long as usual, but after the job has been marked in PySpark and UI as finished, the Python interpreter still was showing it as busy. Indeed, when I checked the HDFS folder I noticed that the files are still transferred from {{dest_dir/_temporary}} to all the {{dest_dir/date=*}} folders. 
> First of all it takes much longer than saving the same set without partitioning. Second, it is done in the background, without visible progress of any kind. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org