You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Thomas Graves (JIRA)" <ji...@apache.org> on 2016/06/01 14:42:59 UTC
[jira] [Created] (SPARK-15700) Spark 2.0 dataframes using more
driver memory (reading/writing parquet)
Thomas Graves created SPARK-15700:
-------------------------------------
Summary: Spark 2.0 dataframes using more driver memory (reading/writing parquet)
Key: SPARK-15700
URL: https://issues.apache.org/jira/browse/SPARK-15700
Project: Spark
Issue Type: Bug
Components: Spark Core
Affects Versions: 2.0.0
Reporter: Thomas Graves
I was running a large 15TB join job with 100000 map tasks, 20000 reducers that I frequently have run on Spark 1.6 successfully (with very little GC) and it failed with an out of heap memory on the driver. Driver had 10GB heap with 3GB overhead.
16/05/31 22:47:44 ERROR InsertIntoHadoopFsRelationCommand: Aborting job.
java.lang.OutOfMemoryError: Java heap space
at java.util.Arrays.copyOfRange(Arrays.java:3520)
at org.apache.parquet.io.api.Binary$ByteArrayBackedBinary.getBytes(Binary.java:262)
at org.apache.parquet.column.statistics.BinaryStatistics.getMinBytes(BinaryStatistics.java:67)
at org.apache.parquet.format.converter.ParquetMetadataConverter.toParquetStatistics(ParquetMetadataConverter.java:242)
at org.apache.parquet.format.converter.ParquetMetadataConverter.addRowGroup(ParquetMetadataConverter.java:184)
at org.apache.parquet.format.converter.ParquetMetadataConverter.toParquetMetadata(ParquetMetadataConverter.java:95)
at org.apache.parquet.hadoop.ParquetFileWriter.serializeFooter(ParquetFileWriter.java:472)
at org.apache.parquet.hadoop.ParquetFileWriter.writeMetadataFile(ParquetFileWriter.java:500)
at org.apache.parquet.hadoop.ParquetFileWriter.writeMetadataFile(ParquetFileWriter.java:490)
at org.apache.parquet.hadoop.ParquetOutputCommitter.writeMetaDataFile(ParquetOutputCommitter.java:63)
at org.apache.parquet.hadoop.ParquetOutputCommitter.commitJob(ParquetOutputCommitter.java:48)
at org.apache.spark.sql.execution.datasources.BaseWriterContainer.commitJob(WriterContainer.scala:221)
at org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand$$anonfun$run$1.apply$mcV$sp(InsertIntoHadoopFsRelationCommand.scala:144)
at org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand$$anonfun$run$1.apply(InsertIntoHadoopFsRelationCommand.scala:115)
at org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand$$anonfun$run$1.apply(InsertIntoHadoopFsRelationCommand.scala:115)
at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:57)
at org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand.run(InsertIntoHadoopFsRelationCommand.scala:115)
at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:57)
at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:55)
at org.apache.spark.sql.execution.command.ExecutedCommandExec.doExecute(commands.scala:69)
at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:115)
at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:115)
at org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:136)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:133)
at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:114)
at org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:85)
at org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:85)
at org.apache.spark.sql.execution.datasources.DataSource.write(DataSource.scala:479)
at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:252)
at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:234)
at org.apache.spark.sql.DataFrameWriter.parquet(DataFrameWriter.scala:626)
I haven't had a chance to look into this further yet just reporting it for now.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org