You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Anbu Cheeralan (JIRA)" <ji...@apache.org> on 2016/12/15 20:54:59 UTC

[jira] [Comment Edited] (SPARK-17493) Spark Job hangs while DataFrame writing to HDFS path with parquet mode

    [ https://issues.apache.org/jira/browse/SPARK-17493?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15752443#comment-15752443 ] 

Anbu Cheeralan edited comment on SPARK-17493 at 12/15/16 8:54 PM:
------------------------------------------------------------------

[~sowen] I faced a similar error while writing to google storage. This issue is specific while writing to object stores. This happens in append mode.

In org.apache.spark.sql.execution.datasources.DataSource.write() following code causes huge number of RPC calls when the file system is on Object Stores (S3, GS). 
{quote}
          if (mode == SaveMode.Append) \{
            val existingPartitionColumns = Try \{
              resolveRelation()
                .asInstanceOf[HadoopFsRelation]
                .location
                .partitionSpec()
                .partitionColumns
                .fieldNames
                .toSeq
            \}.getOrElse(Seq.empty[String])
{quote}
There should be a flag to skip Partition Match Check in append mode. I can work on the patch.


was (Author: alunarbeach):
[~sowen] I faced a similar error while writing to google storage. This issue is specific while writing to object stores. This happens in append mode.

In org.apache.spark.sql.execution.datasources.DataSource.write() following code causes huge number of RPC calls when the file system is on Object Stores (S3, GS). 
{quote}
          if (mode == SaveMode.Append) {
            val existingPartitionColumns = Try {
              resolveRelation()
                .asInstanceOf[HadoopFsRelation]
                .location
                .partitionSpec()
                .partitionColumns
                .fieldNames
                .toSeq
            }.getOrElse(Seq.empty[String])
{quote}
There should be a flag to skip Partition Match Check in append mode. I can work on the patch.

> Spark Job hangs while DataFrame writing to HDFS path with parquet mode
> ----------------------------------------------------------------------
>
>                 Key: SPARK-17493
>                 URL: https://issues.apache.org/jira/browse/SPARK-17493
>             Project: Spark
>          Issue Type: Bug
>          Components: Input/Output
>    Affects Versions: 2.0.0
>         Environment: AWS Cluster
>            Reporter: Gautam Solanki
>
> While saving a RDD to HDFS path in parquet format with the following rddout.write.partitionBy("event_date").mode(org.apache.spark.sql.SaveMode.Append).parquet("hdfs:////tmp//rddout_parquet_full_hdfs1//") , the spark job was hanging as the two write tasks with Shuffle Read of size 0 could not complete. But, the executors notified the driver about the completion of these two tasks. 
>  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org