You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@iceberg.apache.org by GitBox <gi...@apache.org> on 2022/07/22 10:38:06 UTC

[GitHub] [iceberg] pvary opened a new issue, #5339: Adding the same file twice for the same table

pvary opened a new issue, #5339:
URL: https://github.com/apache/iceberg/issues/5339

   During reviewing #4904 I found the following with a slightly modified `TestIcebergInputFormats.testFilterExp` test:
   ```java
   [..]
       helper = new TestHelper(conf, tables, location.toString(), SCHEMA, SPEC, fileFormat, temp);
   [..]
       helper.createTable();
   
       List<Record> expectedRecords = helper.generateRandomRecords(2, 0L);
       expectedRecords.get(0).set(2, "2020-03-20");
       expectedRecords.get(1).set(2, "2020-03-20");
   
       DataFile dataFile1 = helper.writeFile(Row.of("2020-03-20", 0), expectedRecords);
       DataFile dataFile2 = helper.writeFile(Row.of("2020-03-21", 0), helper.generateRandomRecords(2, 0L));
       helper.appendToTable(dataFile1, dataFile2); // This creates a transaction and adds the data files to it using 'table.newAppend()'
   
       // Adding the same files again to the same table
       helper.appendToTable(dataFile1, dataFile2);
   ```
   
   The test basically adds the same data file twice for the Iceberg table.
   The result is that the table will contain duplicate rows, which is what I would expect if we do not want to prevent this situation in the first place.
   
   I have not tested yet, but based on the specification it is not possible to deduplicate the data using any of the V2 delete formats. It is only possible with knowledge about the data and the data files of the Iceberg table.
   
   Question for the community:
   - Do we think that this is an expected behaviour?
   - Do we want to prevent this situation by checking the uniqueness of the file names when adding new data files to a table? What should we do in this case?
       - Throw an exception?
       - Log a warning message, and skip adding the file?
   
   My first instinct would be to prevent adding the same file to the same table again and throw an exception, but I would like to see how others think about this issue.
   
   Thanks,
   Peter


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] szehon-ho commented on issue #5339: Adding the same file twice for the same table

Posted by GitBox <gi...@apache.org>.
szehon-ho commented on issue #5339:
URL: https://github.com/apache/iceberg/issues/5339#issuecomment-1192982933

   Yea I think there was an old similar discussion here:  https://github.com/apache/iceberg/issues/3064.  I think we can do a per check of all files added in same transaction, but anything beyond that involves an expensive spark call to check for duplicates in the table itself?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] szehon-ho commented on issue #5339: Adding the same file twice for the same table

Posted by GitBox <gi...@apache.org>.
szehon-ho commented on issue #5339:
URL: https://github.com/apache/iceberg/issues/5339#issuecomment-1193071314

   From my knowledge, when you have a Table object, you usually have in the memory all the ManifestFiles, but that's as far as it goes.  What happens during commit is a bit trickier (can lose a lot of time in that code :).  There are a few cases of reading old manifests and checking their DataFiles, like 1) in delete operation to check if we can delete any DataFile entirely from the delete expression, or 2) to merge a new manifest with older ones, but I'd be surprised if we load DataFile entries that we don't need.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] hililiwei commented on issue #5339: Adding the same file twice for the same table

Posted by GitBox <gi...@apache.org>.
hililiwei commented on issue #5339:
URL: https://github.com/apache/iceberg/issues/5339#issuecomment-1193003590

   Similarly, in Flink, when we write data, we need to find a way to avoid double commits.
   We might add a default behavior that does not allow the same file to be submitted twice. In addition to checking the file path, we should also check the file content, such as verifying the MD5, to ensure that the contents of the two files are also consistent.
   If we really want to add duplicate files, we can enforce it by an option like 'force', just like what @Spince said.
   Or reverse the logic and allow it by default. This has the advantage of preserving compatibility, consistent with our current behavior.
   In conclusion, I believe that it is useful to provide such a mechanism.
   
   Thanks.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] pvary commented on issue #5339: Adding the same file twice for the same table

Posted by GitBox <gi...@apache.org>.
pvary commented on issue #5339:
URL: https://github.com/apache/iceberg/issues/5339#issuecomment-1193063396

   > Yea I think there was an old similar discussion here: #3064. I think we can do a per check of all files added in same transaction, but anything beyond that involves an expensive spark call to check for duplicates in the table itself?
   
   Thanks @szehon-ho, I was not aware of the old thread. It seems like a reasonable comprise to accept duplicated files, if we do not parse the whole table metadata anyway.
   What is the level of the data parsed when we have a `Table` object at hand? Which metadata files do we read when we commit something? Does anyone have a quick answer for this, or shall I check?
   
   Thanks everyone for the answers!
    


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] github-actions[bot] closed issue #5339: Adding the same file twice for the same table

Posted by "github-actions[bot] (via GitHub)" <gi...@apache.org>.
github-actions[bot] closed issue #5339: Adding the same file twice for the same table
URL: https://github.com/apache/iceberg/issues/5339


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] github-actions[bot] commented on issue #5339: Adding the same file twice for the same table

Posted by "github-actions[bot] (via GitHub)" <gi...@apache.org>.
github-actions[bot] commented on issue #5339:
URL: https://github.com/apache/iceberg/issues/5339#issuecomment-1416548418

   This issue has been closed because it has not received any activity in the last 14 days since being marked as 'stale'


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] Spince commented on issue #5339: Adding the same file twice for the same table

Posted by GitBox <gi...@apache.org>.
Spince commented on issue #5339:
URL: https://github.com/apache/iceberg/issues/5339#issuecomment-1192537949

   Curious what this conversation leads to. I'd expect an error/skipping the file with an option to "force" if desired. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] github-actions[bot] commented on issue #5339: Adding the same file twice for the same table

Posted by GitBox <gi...@apache.org>.
github-actions[bot] commented on issue #5339:
URL: https://github.com/apache/iceberg/issues/5339#issuecomment-1397768506

   This issue has been automatically marked as stale because it has been open for 180 days with no activity. It will be closed in next 14 days if no further activity occurs. To permanently prevent this issue from being considered stale, add the label 'not-stale', but commenting on the issue is preferred when possible.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org