You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@iceberg.apache.org by GitBox <gi...@apache.org> on 2019/09/30 20:32:28 UTC

[GitHub] [incubator-iceberg] mykolasmith opened a new issue #508: java.lang.IllegalStateException: Already closed file for partition

mykolasmith opened a new issue #508: java.lang.IllegalStateException: Already closed file for partition
URL: https://github.com/apache/incubator-iceberg/issues/508
 
 
   Hi, I am trying to write a Spark DataFrame to Iceberg that contains rows that cross an hourly partition threshold (i.e. the DataFrame contains rows in >1 hour). The expected result would be to commit a different file for each partition. However, I am receiving this error:
   
   ```19/09/30 16:30:51 INFO CodecPool: Got brand-new compressor [.gz]
   19/09/30 16:30:52 INFO CodecPool: Got brand-new compressor [.gz]
   19/09/30 16:30:52 WARN Writer: Duplicate key: [436073] == [436073]
   19/09/30 16:30:52 ERROR Utils: Aborting task
   java.lang.IllegalStateException: Already closed file for partition: ec_event_time_hour=2019-09-30-17
   	at org.apache.iceberg.spark.source.Writer$PartitionedWriter.write(Writer.java:389)
   	at org.apache.iceberg.spark.source.Writer$PartitionedWriter.write(Writer.java:350)
   	at org.apache.spark.sql.execution.datasources.v2.DataWritingSparkTask$.$anonfun$run$2(WriteToDataSourceV2Exec.scala:118)
   	at org.apache.spark.util.Utils$.tryWithSafeFinallyAndFailureCallbacks(Utils.scala:1394)
   	at org.apache.spark.sql.execution.datasources.v2.DataWritingSparkTask$.run(WriteToDataSourceV2Exec.scala:116)
   	at org.apache.spark.sql.execution.datasources.v2.WriteToDataSourceV2Exec.$anonfun$doExecute$2(WriteToDataSourceV2Exec.scala:67)
   	at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
   	at org.apache.spark.scheduler.Task.run(Task.scala:121)
   	at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:411)
   	at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360)
   	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:414)
   	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
   	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
   	at java.lang.Thread.run(Thread.java:748)
   ```
   
   Any idea what could be happening here? Do I need to group by DataFrame by hour in order to get two different dataframes that only contain rows in a single partition?

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org