You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@iceberg.apache.org by GitBox <gi...@apache.org> on 2019/12/24 03:59:01 UTC

[GitHub] [incubator-iceberg] living42 opened a new issue #717: Write stream of unordered rows into partitioned table causes "Already closed files for partition"

living42 opened a new issue #717: Write stream of unordered rows into partitioned table causes "Already closed files for partition"
URL: https://github.com/apache/incubator-iceberg/issues/717
 
 
   I'am trying to use Structured Streaming to move data from Kafka to Iceberg table. some time, the data is out of order according to the partition spec, then i got a exception like below:
   
   ```
   java.lang.IllegalStateException: Already closed files for partition: time_day=2019-12-17
   	at org.apache.iceberg.spark.source.Writer$PartitionedWriter.write(Writer.java:505)
   	at org.apache.iceberg.spark.source.Writer$PartitionedWriter.write(Writer.java:476)
   	at org.apache.spark.sql.execution.datasources.v2.DataWritingSparkTask$$anonfun$run$3.apply(WriteToDataSourceV2Exec.scala:118)
   	at org.apache.spark.sql.execution.datasources.v2.DataWritingSparkTask$$anonfun$run$3.apply(WriteToDataSourceV2Exec.scala:116)
   	at org.apache.spark.util.Utils$.tryWithSafeFinallyAndFailureCallbacks(Utils.scala:1394)
   	at org.apache.spark.sql.execution.datasources.v2.DataWritingSparkTask$.run(WriteToDataSourceV2Exec.scala:146)
   	at org.apache.spark.sql.execution.datasources.v2.WriteToDataSourceV2Exec$$anonfun$doExecute$2.apply(WriteToDataSourceV2Exec.scala:67)
   	at org.apache.spark.sql.execution.datasources.v2.WriteToDataSourceV2Exec$$anonfun$doExecute$2.apply(WriteToDataSourceV2Exec.scala:66)
   	at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
   	at org.apache.spark.scheduler.Task.run(Task.scala:123)
   	at org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:408)
   	at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360)
   	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:414)
   	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
   	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
   	at java.lang.Thread.run(Thread.java:748)
   ```
   
   I wish i can use `sortWithinPartitions` operation on streaming, but Spark won't let me do that.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [incubator-iceberg] rdblue commented on issue #717: Write stream of unordered rows into partitioned table causes "Already closed files for partition"

Posted by GitBox <gi...@apache.org>.

rdblue commented on issue #717: Write stream of unordered rows into partitioned table causes "Already closed files for partition"
URL: https://github.com/apache/incubator-iceberg/issues/717#issuecomment-569362746
 
 
   Yes, `sortWithinPartitions` won't work, but you should be able to `repartition` by your partition columns, right?
   
   @aokolnychyi and I have also talked about relaxing this constraint for Spark streaming and keeping multiple files open.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] davseitsev commented on issue #717: Write stream of unordered rows into partitioned table causes "Already closed files for partition"

Posted by GitBox <gi...@apache.org>.

davseitsev commented on issue #717:
URL: https://github.com/apache/iceberg/issues/717#issuecomment-641147330


   We also got into this restriction when write data from Kafka to iceberg and partition it by date.
   We would like to have a way avoid the restriction, write data as is and then compact small files in background.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [incubator-iceberg] living42 commented on issue #717: Write stream of unordered rows into partitioned table causes "Already closed files for partition"

Posted by GitBox <gi...@apache.org>.

living42 commented on issue #717: Write stream of unordered rows into partitioned table causes "Already closed files for partition"
URL: https://github.com/apache/incubator-iceberg/issues/717#issuecomment-569384441
 
 
   Yes, repartition works, but it feels like it’s a extra work to workaround the constraint.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] HeartSaVioR commented on issue #717: Write stream of unordered rows into partitioned table causes "Already closed files for partition"

Posted by GitBox <gi...@apache.org>.

HeartSaVioR commented on issue #717:
URL: https://github.com/apache/iceberg/issues/717#issuecomment-767194839


   fanout writer will enable you to do this without repartition/sort. There's no doc yet (as I guess the functionality will be available in 0.11) so please refer https://github.com/apache/iceberg/pull/1929 to let yourself preview the functionality.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] HeartSaVioR commented on issue #717: Write stream of unordered rows into partitioned table causes "Already closed files for partition"

Posted by GitBox <gi...@apache.org>.

HeartSaVioR commented on issue #717:
URL: https://github.com/apache/iceberg/issues/717#issuecomment-767194839


   fanout writer will enable you to do this without repartition/sort. There's no doc yet (as I guess the functionality will be available in 0.11) so please refer https://github.com/apache/iceberg/pull/1929 to let yourself preview the functionality.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org