You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@parquet.apache.org by "madhukara phatak (JIRA)" <ji...@apache.org> on 2015/06/05 19:19:00 UTC
[jira] [Commented] (PARQUET-222) parquet writer runs into OOM
during writing when calling DataFrame.saveAsParquetFile in Spark SQL
[ https://issues.apache.org/jira/browse/PARQUET-222?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14574867#comment-14574867 ]
madhukara phatak commented on PARQUET-222:
------------------------------------------
I am getting same issue for a file with 26k columns but single row. Even after I have set 4gb heap space, I am getting OOM
> parquet writer runs into OOM during writing when calling DataFrame.saveAsParquetFile in Spark SQL
> -------------------------------------------------------------------------------------------------
>
> Key: PARQUET-222
> URL: https://issues.apache.org/jira/browse/PARQUET-222
> Project: Parquet
> Issue Type: Bug
> Components: parquet-mr
> Affects Versions: 1.6.0
> Reporter: Chaozhong Yang
> Original Estimate: 336h
> Remaining Estimate: 336h
>
> In Spark SQL, there is a function `saveAsParquetFile` in DataFrame or SchemaRDD. That function calls method in parquet-mr, and sometimes it will fail due to the OOM error thrown by parquet-mr. We can see the exception stack trace as follows:
> WARN] [task-result-getter-3] 03-19 11:17:58,274 [TaskSetManager] - Lost task 0.2 in stag
> e 137.0 (TID 309, hb1.avoscloud.com): java.lang.OutOfMemoryError: Java heap space
> at parquet.column.values.dictionary.IntList.initSlab(IntList.java:87)
> at parquet.column.values.dictionary.IntList.<init>(IntList.java:83)
> at parquet.column.values.dictionary.DictionaryValuesWriter.<init>(DictionaryValue
> sWriter.java:85)
> at parquet.column.values.dictionary.DictionaryValuesWriter$PlainIntegerDictionary
> ValuesWriter.<init>(DictionaryValuesWriter.java:549)
> at parquet.column.ParquetProperties.getValuesWriter(ParquetProperties.java:88)
> at parquet.column.impl.ColumnWriterImpl.<init>(ColumnWriterImpl.java:74)
> at parquet.column.impl.ColumnWriteStoreImpl.newMemColumn(ColumnWriteStoreImpl.jav
> a:68)
> at parquet.column.impl.ColumnWriteStoreImpl.getColumnWriter(ColumnWriteStoreImpl.
> java:56)
> at parquet.io.MessageColumnIO$MessageColumnIORecordConsumer.<init>(MessageColumnI
> O.java:178)
> at parquet.io.MessageColumnIO.getRecordWriter(MessageColumnIO.java:369)
> at parquet.hadoop.InternalParquetRecordWriter.initStore(InternalParquetRecordWrit
> er.java:108)
> at parquet.hadoop.InternalParquetRecordWriter.<init>(InternalParquetRecordWriter.
> java:94)
> at parquet.hadoop.ParquetRecordWriter.<init>(ParquetRecordWriter.java:64)
> at parquet.hadoop.ParquetOutputFormat.getRecordWriter(ParquetOutputFormat.java:28
> 2)
> at parquet.hadoop.ParquetOutputFormat.getRecordWriter(ParquetOutputFormat.java:25
> 2)
> at org.apache.spark.sql.parquet.InsertIntoParquetTable.org$apache$spark$sql$parqu
> et$InsertIntoParquetTable$$writeShard$1(ParquetTableOperations.scala:304)
> at org.apache.spark.sql.parquet.InsertIntoParquetTable$$anonfun$saveAsHadoopFile$
> 1.apply(ParquetTableOperations.scala:325)
> at org.apache.spark.sql.parquet.InsertIntoParquetTable$$anonfun$saveAsHadoopFile$
> 1.apply(ParquetTableOperations.scala:325)
> at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61)
> at org.apache.spark.scheduler.Task.run(Task.scala:56)
> at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:196)
> at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java
> :886)
> at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908
> )
> at java.lang.Thread.run(Thread.java:662)
> By the way, there is another similar issue https://issues.apache.org/jira/browse/PARQUET-99. But the reporter has closed it and mark it as resolved.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)