You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@iceberg.apache.org by "kongul (via GitHub)" <gi...@apache.org> on 2023/06/23 12:04:21 UTC

[GitHub] [iceberg] kongul opened a new issue, #7890: Data files name collision written by Spark Streaming job after it's restart

kongul opened a new issue, #7890:
URL: https://github.com/apache/iceberg/issues/7890

   ### Apache Iceberg version
   
   1.2.1
   
   ### Query engine
   
   Spark
   
   ### Please describe the bug 🐞
   
   We have number of Spark jobs that do stream data to Iceberg tables. Recently we faced issue reading those tables - data files were deleted or overridden by other data files with different size (checked older version in s3 bucket). After Investigation this i what we found.
   
   Here's how filename is constructed https://github.com/apache/iceberg/blob/apache-iceberg-1.2.1/core/src/main/java/org/apache/iceberg/io/OutputFileFactory.java#L51-L100
   As it said there 
   
   ```
      * Constructor with specific operationId. The [partitionId, taskId, operationId] triplet has to be
      * unique across JVM instances otherwise the same file name could be generated by different
      * instances of the OutputFileFactory.
   ```
   
   Here we can see that `queryId` is passed as `operationId`
   
   Now let's see what is passed there from Spark side
   https://github.com/apache/spark/blob/branch-3.3/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/StreamExecution.scala#L159
   https://github.com/apache/spark/blob/branch-3.3/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/StreamExecution.scala#L134C1-L143
   https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/StreamMetadata.scala
   
   So stream metadata file contain in queryId is persisted across Spark Streaming Jobs restarts, hence your requirement `The [partitionId, taskId, operationId] triplet has to be unique` is violatet. So new streaming job run can generate the same filename that already exists and override exiting file.
   
   https://github.com/apache/iceberg/blob/apache-iceberg-1.2.1/core/src/main/java/org/apache/iceberg/io/OutputFileFactory.java#L91-L100
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

Re: [I] Data files name collision written by Spark Streaming job after it's restart [iceberg]

Posted by "dotjdk (via GitHub)" <gi...@apache.org>.

dotjdk commented on issue #7890:
URL: https://github.com/apache/iceberg/issues/7890#issuecomment-1886426049

   I was about to create a bug report on this too. 
   
   We are experiencing the same issue after upgrading from 1.1.x to 1.4.x. The bug was introduced in 1.2.x it seems in this commit:
   
   Spark: Add the query ID to file names (#6569)
   https://github.com/apache/iceberg/commit/046a81aa734dc4b61b66c3214ba6888a72d68bc9
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org