You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@parquet.apache.org by "Gajulapalli Praveen Kumar (JIRA)" <ji...@apache.org> on 2018/08/14 06:16:00 UTC
[jira] [Commented] (PARQUET-632) Parquet file in invalid state while writing to S3 from EMR

    [ https://issues.apache.org/jira/browse/PARQUET-632?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16579311#comment-16579311 ] 

Gajulapalli Praveen Kumar commented on PARQUET-632:
---------------------------------------------------

i'm also facing this issue. I'm using spark 2.2.0

> Parquet file in invalid state while writing to S3 from EMR
> ----------------------------------------------------------
>
>                 Key: PARQUET-632
>                 URL: https://issues.apache.org/jira/browse/PARQUET-632
>             Project: Parquet
>          Issue Type: Bug
>    Affects Versions: 1.7.0
>            Reporter: Peter Halliday
>            Priority: Blocker
>
> I'm writing parquet to S3 from Spark 1.6.1 on EMR.  And when it got to the last few files to write to S3, I received this stacktrace in the log with no other errors before or after it.  It's very consistent.  This particular batch keeps erroring the same way.
> {noformat}
> 2016-06-10 01:46:05,282] WARN org.apache.spark.scheduler.TaskSetManager [task-result-getter-2hread] - Lost task 3737.0 in stage 2.0 (TID 10585, ip-172-16-96-32.ec2.internal): org.apache.spark.SparkException: Task failed while writing rows.
> 	at org.apache.spark.sql.execution.datasources.DynamicPartitionWriterContainer.writeRows(WriterContainer.scala:414)
> 	at org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelation$$anonfun$run$1$$anonfun$apply$mcV$sp$3.apply(InsertIntoHadoopFsRelation.scala:150)
> 	at org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelation$$anonfun$run$1$$anonfun$apply$mcV$sp$3.apply(InsertIntoHadoopFsRelation.scala:150)
> 	at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
> 	at org.apache.spark.scheduler.Task.run(Task.scala:89)
> 	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
> 	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
> 	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
> 	at java.lang.Thread.run(Thread.java:745)
> Caused by: java.io.IOException: The file being written is in an invalid state. Probably caused by an error thrown previously. Current state: COLUMN
> 	at org.apache.parquet.hadoop.ParquetFileWriter$STATE.error(ParquetFileWriter.java:146)
> 	at org.apache.parquet.hadoop.ParquetFileWriter$STATE.startBlock(ParquetFileWriter.java:138)
> 	at org.apache.parquet.hadoop.ParquetFileWriter.startBlock(ParquetFileWriter.java:195)
> 	at org.apache.parquet.hadoop.InternalParquetRecordWriter.flushRowGroupToStore(InternalParquetRecordWriter.java:153)
> 	at org.apache.parquet.hadoop.InternalParquetRecordWriter.close(InternalParquetRecordWriter.java:113)
> 	at org.apache.parquet.hadoop.ParquetRecordWriter.close(ParquetRecordWriter.java:112)
> 	at org.apache.spark.sql.execution.datasources.parquet.ParquetOutputWriter.close(ParquetRelation.scala:101)
> 	at org.apache.spark.sql.execution.datasources.DynamicPartitionWriterContainer.writeRows(WriterContainer.scala:405)
> 	... 8 more
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)