You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Jeffrey Quinn (JIRA)" <ji...@apache.org> on 2017/05/30 19:18:04 UTC
[jira] [Updated] (SPARK-20925) Out of Memory Issues With org.apache.spark.sql.DataFrameWriter#partitionBy

     [ https://issues.apache.org/jira/browse/SPARK-20925?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Jeffrey Quinn updated SPARK-20925:
----------------------------------
    Description: 
Observed under the following conditions:

Spark Version: Spark 2.1.0
Hadoop Version: Amazon 2.7.3 (emr-5.5.0)
spark.submit.deployMode = client
spark.master = yarn
spark.driver.memory = 10g
spark.shuffle.service.enabled = true
spark.dynamicAllocation.enabled = true

The job we are running is very simple: Our workflow reads data from a JSON format stored on S3, and write out partitioned parquet files to HDFS.

As a one-liner, the whole workflow looks like this:

```
sparkSession.sqlContext
        .read
        .schema(inputSchema)
        .json(expandedInputPath)
        .select(columnMap:_*)
        .write.partitionBy("partition_by_column")
        .parquet(outputPath)
```

Unfortunately, for larger inputs, this job consistently fails with containers running out of memory. We observed containers of up to 20GB OOMing, which is surprising because the input data itself is only 15 GB compressed and maybe 100GB uncompressed.

We were able to bisect that `partitionBy` is the problem by progressively removing/commenting out parts of our workflow. Finally when we get to the above state, if we remove `partitionBy` the job succeeds with no OOM.

  was:
Observed under the following conditions:

Spark Version: Spark 2.1.0
Hadoop Version: Amazon 2.7.3 (emr-5.5.0)
spark.submit.deployMode = client
spark.master = yarn
spark.driver.memory = 10g
spark.shuffle.service.enabled = true
spark.dynamicAllocation.enabled = true

The job we are running is very simple: Our workflow reads data from a JSON format stored on S3, and write out partitioned parquet files to HDFS.

As a one-liner, the whole workflow looks like this:

```
sparkSession.sqlContext
        .read
        .schema(inputSchema)
        .json(expandedInputPath)
        .select(columnMap:_*)
        .write.partitionBy("partition_by_column")
        .parquet(outputPath)
```

Unfortunately, for larger inputs, this job consistently fails with containers running out of memory. We observed containers of up to 20GB OOMing, which is surprising because the input data itself is only 15 GB compressed and maybe 100GB uncompressed.


> Out of Memory Issues With org.apache.spark.sql.DataFrameWriter#partitionBy
> --------------------------------------------------------------------------
>
>                 Key: SPARK-20925
>                 URL: https://issues.apache.org/jira/browse/SPARK-20925
>             Project: Spark
>          Issue Type: Bug
>          Components: SQL
>    Affects Versions: 2.1.0
>            Reporter: Jeffrey Quinn
>
> Observed under the following conditions:
> Spark Version: Spark 2.1.0
> Hadoop Version: Amazon 2.7.3 (emr-5.5.0)
> spark.submit.deployMode = client
> spark.master = yarn
> spark.driver.memory = 10g
> spark.shuffle.service.enabled = true
> spark.dynamicAllocation.enabled = true
> The job we are running is very simple: Our workflow reads data from a JSON format stored on S3, and write out partitioned parquet files to HDFS.
> As a one-liner, the whole workflow looks like this:
> ```
> sparkSession.sqlContext
>         .read
>         .schema(inputSchema)
>         .json(expandedInputPath)
>         .select(columnMap:_*)
>         .write.partitionBy("partition_by_column")
>         .parquet(outputPath)
> ```
> Unfortunately, for larger inputs, this job consistently fails with containers running out of memory. We observed containers of up to 20GB OOMing, which is surprising because the input data itself is only 15 GB compressed and maybe 100GB uncompressed.
> We were able to bisect that `partitionBy` is the problem by progressively removing/commenting out parts of our workflow. Finally when we get to the above state, if we remove `partitionBy` the job succeeds with no OOM.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org