You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Thomas Graves (JIRA)" <ji...@apache.org> on 2016/06/01 14:42:59 UTC
[jira] [Created] (SPARK-15700) Spark 2.0 dataframes using more driver memory (reading/writing parquet)

Thomas Graves created SPARK-15700:
-------------------------------------

             Summary: Spark 2.0 dataframes using more driver memory (reading/writing parquet)
                 Key: SPARK-15700
                 URL: https://issues.apache.org/jira/browse/SPARK-15700
             Project: Spark
          Issue Type: Bug
          Components: Spark Core
    Affects Versions: 2.0.0
            Reporter: Thomas Graves


I was running a large 15TB join job with 100000 map tasks, 20000 reducers that I frequently have run on Spark 1.6 successfully (with very little GC) and it failed with an out of heap memory on the driver. Driver had 10GB heap with 3GB overhead.

16/05/31 22:47:44 ERROR InsertIntoHadoopFsRelationCommand: Aborting job.
java.lang.OutOfMemoryError: Java heap space
        at java.util.Arrays.copyOfRange(Arrays.java:3520)
        at org.apache.parquet.io.api.Binary$ByteArrayBackedBinary.getBytes(Binary.java:262)
        at org.apache.parquet.column.statistics.BinaryStatistics.getMinBytes(BinaryStatistics.java:67)
        at org.apache.parquet.format.converter.ParquetMetadataConverter.toParquetStatistics(ParquetMetadataConverter.java:242)
        at org.apache.parquet.format.converter.ParquetMetadataConverter.addRowGroup(ParquetMetadataConverter.java:184)
        at org.apache.parquet.format.converter.ParquetMetadataConverter.toParquetMetadata(ParquetMetadataConverter.java:95)
        at org.apache.parquet.hadoop.ParquetFileWriter.serializeFooter(ParquetFileWriter.java:472)
        at org.apache.parquet.hadoop.ParquetFileWriter.writeMetadataFile(ParquetFileWriter.java:500)
        at org.apache.parquet.hadoop.ParquetFileWriter.writeMetadataFile(ParquetFileWriter.java:490)
        at org.apache.parquet.hadoop.ParquetOutputCommitter.writeMetaDataFile(ParquetOutputCommitter.java:63)
        at org.apache.parquet.hadoop.ParquetOutputCommitter.commitJob(ParquetOutputCommitter.java:48)
        at org.apache.spark.sql.execution.datasources.BaseWriterContainer.commitJob(WriterContainer.scala:221)
        at org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand$$anonfun$run$1.apply$mcV$sp(InsertIntoHadoopFsRelationCommand.scala:144)
        at org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand$$anonfun$run$1.apply(InsertIntoHadoopFsRelationCommand.scala:115)
        at org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand$$anonfun$run$1.apply(InsertIntoHadoopFsRelationCommand.scala:115)
        at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:57)
        at org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand.run(InsertIntoHadoopFsRelationCommand.scala:115)
        at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:57)
        at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:55)
        at org.apache.spark.sql.execution.command.ExecutedCommandExec.doExecute(commands.scala:69)
        at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:115)
        at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:115)
        at org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:136)
        at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
        at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:133)
        at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:114)
        at org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:85)
        at org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:85)
        at org.apache.spark.sql.execution.datasources.DataSource.write(DataSource.scala:479)
        at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:252)
        at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:234)
        at org.apache.spark.sql.DataFrameWriter.parquet(DataFrameWriter.scala:626)


I haven't had a chance to look into this further yet just reporting it for now.




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org