You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@iceberg.apache.org by GitBox <gi...@apache.org> on 2022/02/18 09:43:38 UTC

[GitHub] [iceberg] wuwenchi opened a new pull request #4162: Core: bugfix add initial offset when using parquet format

wuwenchi opened a new pull request #4162:
URL: https://github.com/apache/iceberg/pull/4162


   when using data files in parquet format, there's going to be a problem here because the initial state of the parquet file itself has a 4-byte offset.
   Closes #4070 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] wuwenchi commented on a change in pull request #4162: Core: bugfix add initial offset when using parquet format

Posted by GitBox <gi...@apache.org>.

wuwenchi commented on a change in pull request #4162:
URL: https://github.com/apache/iceberg/pull/4162#discussion_r811513685



##########
File path: core/src/main/java/org/apache/iceberg/actions/BaseRewriteDataFilesAction.java
##########
@@ -58,6 +59,7 @@
     extends BaseSnapshotUpdateAction<ThisT, RewriteDataFilesActionResult> {
 
   private static final Logger LOG = LoggerFactory.getLogger(BaseRewriteDataFilesAction.class);
+  private static final int PARQUET_MAGIC_BYTES = 4;

Review comment:
       Oh, yes, the name is incorrect... How about 'PARQUET_MAGIC_LEN' ?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] rdblue commented on a change in pull request #4162: Core: bugfix add initial offset when using parquet format

Posted by GitBox <gi...@apache.org>.

rdblue commented on a change in pull request #4162:
URL: https://github.com/apache/iceberg/pull/4162#discussion_r811366094



##########
File path: core/src/main/java/org/apache/iceberg/actions/BaseRewriteDataFilesAction.java
##########
@@ -58,6 +59,7 @@
     extends BaseSnapshotUpdateAction<ThisT, RewriteDataFilesActionResult> {
 
   private static final Logger LOG = LoggerFactory.getLogger(BaseRewriteDataFilesAction.class);
+  private static final int PARQUET_MAGIC_BYTES = 4;

Review comment:
       This variable name doesn't correctly describe its contents because these are not the magic bytes, this is the magic size or length.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] rdblue commented on a change in pull request #4162: Core: bugfix add initial offset when using parquet format

Posted by GitBox <gi...@apache.org>.

rdblue commented on a change in pull request #4162:
URL: https://github.com/apache/iceberg/pull/4162#discussion_r812201336



##########
File path: core/src/main/java/org/apache/iceberg/actions/BaseRewriteDataFilesAction.java
##########
@@ -300,6 +302,10 @@ void doReplace(Iterable<DataFile> deletedDataFiles, Iterable<DataFile> addedData
   private boolean isPartialFileScan(CombinedScanTask task) {
     if (task.files().size() == 1) {
       FileScanTask fileScanTask = task.files().iterator().next();
+      if (fileScanTask.file().format() == FileFormat.PARQUET) {
+        // in parquet format, there is an initial offset of 4 bytes
+        return fileScanTask.file().fileSizeInBytes() != (fileScanTask.length() + PARQUET_MAGIC_BYTES);
+      }

Review comment:
       That wouldn't be a problem with Parquet. The format will first read the file footer and then find all row groups that are assigned to it via the split range, and then seek directly to the start of those row groups. This wouldn't be a problem at all.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] kbendick commented on a change in pull request #4162: Core: bugfix add initial offset when using parquet format

Posted by GitBox <gi...@apache.org>.

kbendick commented on a change in pull request #4162:
URL: https://github.com/apache/iceberg/pull/4162#discussion_r810217951



##########
File path: core/src/main/java/org/apache/iceberg/actions/BaseRewriteDataFilesAction.java
##########
@@ -300,6 +301,11 @@ void doReplace(Iterable<DataFile> deletedDataFiles, Iterable<DataFile> addedData
   private boolean isPartialFileScan(CombinedScanTask task) {
     if (task.files().size() == 1) {
       FileScanTask fileScanTask = task.files().iterator().next();
+      if (fileScanTask.file().format() == FileFormat.PARQUET) {
+        // in parquet format, there is an initial offset of 4 bytes
+        return fileScanTask.file().fileSizeInBytes() !=
+                (fileScanTask.length() + fileScanTask.file().splitOffsets().get(0));

Review comment:
       Nit: If we know that it's just 4 bytes, can we perhaps instead add a constant field for `PARQUET_MAGIC_BYTES` (I believe this is usually referred to as Parquet magic bytes?).
   
   Using `splitOffsets.get(0)` is a bit more confusing for me, particularly if we expect it to be an exact 4 bytes every time.

##########
File path: core/src/main/java/org/apache/iceberg/actions/BaseRewriteDataFilesAction.java
##########
@@ -300,6 +301,11 @@ void doReplace(Iterable<DataFile> deletedDataFiles, Iterable<DataFile> addedData
   private boolean isPartialFileScan(CombinedScanTask task) {
     if (task.files().size() == 1) {
       FileScanTask fileScanTask = task.files().iterator().next();
+      if (fileScanTask.file().format() == FileFormat.PARQUET) {
+        // in parquet format, there is an initial offset of 4 bytes
+        return fileScanTask.file().fileSizeInBytes() !=
+                (fileScanTask.length() + fileScanTask.file().splitOffsets().get(0));
+      }
       return fileScanTask.file().fileSizeInBytes() != fileScanTask.length();

Review comment:
       Does `fileScanTask.length()` not include the parquet initial offset of 4 bytes if the task is a complete (not-partial) file scan?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] rdblue commented on pull request #4162: Core: bugfix add initial offset when using parquet format

Posted by GitBox <gi...@apache.org>.

rdblue commented on pull request #4162:
URL: https://github.com/apache/iceberg/pull/4162#issuecomment-1046277195


   @wuwenchi, was there a problem that this caused? Can you update the description with what this is fixing besides trying to be slightly more permissive?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] wuwenchi commented on a change in pull request #4162: Core: bugfix add initial offset when using parquet format

Posted by GitBox <gi...@apache.org>.

wuwenchi commented on a change in pull request #4162:
URL: https://github.com/apache/iceberg/pull/4162#discussion_r812515068



##########
File path: core/src/main/java/org/apache/iceberg/actions/BaseRewriteDataFilesAction.java
##########
@@ -300,6 +302,10 @@ void doReplace(Iterable<DataFile> deletedDataFiles, Iterable<DataFile> addedData
   private boolean isPartialFileScan(CombinedScanTask task) {
     if (task.files().size() == 1) {
       FileScanTask fileScanTask = task.files().iterator().next();
+      if (fileScanTask.file().format() == FileFormat.PARQUET) {
+        // in parquet format, there is an initial offset of 4 bytes
+        return fileScanTask.file().fileSizeInBytes() != (fileScanTask.length() + PARQUET_MAGIC_BYTES);
+      }

Review comment:
       If you include offset 0 in the **first Parquet** task, we may have the following doubts: Why deal with the first one? And  why only do parquet processing? It is difficult for us to think that the operation here is for the judgment of whether it is a complete file...
   
   We know that there will be a magic bytes in the parquet file. If we want to get the size of the data (not directly read the data content), we will skip these 4 bytes naturally.
   




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] wuwenchi commented on pull request #4162: Core: bugfix add initial offset when using parquet format

Posted by GitBox <gi...@apache.org>.

wuwenchi commented on pull request #4162:
URL: https://github.com/apache/iceberg/pull/4162#issuecomment-1046759080


   > Thanks for posting this @wuwenchi!This seems rather important. Can you please get the Spark2 tests working? It seems related.
   > 
   > It would also be very helpful if you could add a test to demonstrate that this fixes the issue (something that breaks before that doesn't break after).
   
   The test case already exists, called testRewriteAvoidRepeatCompress.
   However, there are some problems in the writing of this test case, which leads to the fact that the offset list of parquet is not updated, and the wrong file length is obtained.
   The wrong file length, coupled with the wrong judgment process, has obtained the expected correct result.
   So I directly modify this test case.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] wuwenchi commented on pull request #4162: Core: bugfix add initial offset when using parquet format

Posted by GitBox <gi...@apache.org>.

wuwenchi commented on pull request #4162:
URL: https://github.com/apache/iceberg/pull/4162#issuecomment-1058757307


   @kbendick  Is there anything else that needs to be modified? 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] wuwenchi commented on a change in pull request #4162: Core: bugfix add initial offset when using parquet format

Posted by GitBox <gi...@apache.org>.

wuwenchi commented on a change in pull request #4162:
URL: https://github.com/apache/iceberg/pull/4162#discussion_r811514993



##########
File path: core/src/main/java/org/apache/iceberg/actions/BaseRewriteDataFilesAction.java
##########
@@ -300,6 +302,10 @@ void doReplace(Iterable<DataFile> deletedDataFiles, Iterable<DataFile> addedData
   private boolean isPartialFileScan(CombinedScanTask task) {
     if (task.files().size() == 1) {
       FileScanTask fileScanTask = task.files().iterator().next();
+      if (fileScanTask.file().format() == FileFormat.PARQUET) {
+        // in parquet format, there is an initial offset of 4 bytes
+        return fileScanTask.file().fileSizeInBytes() != (fileScanTask.length() + PARQUET_MAGIC_BYTES);
+      }

Review comment:
       I think that's probably not good, because that's how parquet files are defined. Otherwise, every time you want to read a parquet file, you will read the first magic 4 bytes, and then skip these 4 bytes to find the real data...




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] wuwenchi commented on pull request #4162: Core: bugfix add initial offset when using parquet format

Posted by GitBox <gi...@apache.org>.

wuwenchi commented on pull request #4162:
URL: https://github.com/apache/iceberg/pull/4162#issuecomment-1046757683

> @wuwenchi, was there a problem that this caused? Can you update the description with what this is fixing besides trying to be slightly more permissive?

@rdblue
When doing rewrite, it will cause a parquet file to be completely rewritten, generating a new file that is exactly the same as the source file. I think this is unnecessary.

Just like the testRewriteDataFilesForLargeFile use case in the previous spark2.4. There are 3 files in the table:
big_a.parquet : 50k
small_b.parquet : 2k
small_c.parquet : 2k

Set targetSize = 40k for rewrite,
So two tasks will be generated here:
task1 : big_a.parquet
task2 : small_b.parquet + small_c.parquet

What we expect is that task1 is filtered out because it has only one file.
But in fact this file may also be split, so it is necessary to judge whether it is a file split or not.
The second filter condition will determine whether the file needs to be split. Because of the error in the judgment of the parquet format file, this task is retained, so finally big_a.parquet is copied again and a new file is generated.

So I increased the judgment of the parquet file, so that the task1 can be deleted.

--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] wuwenchi commented on a change in pull request #4162: Core: bugfix add initial offset when using parquet format

Posted by GitBox <gi...@apache.org>.

wuwenchi commented on a change in pull request #4162:
URL: https://github.com/apache/iceberg/pull/4162#discussion_r810462879



##########
File path: core/src/main/java/org/apache/iceberg/actions/BaseRewriteDataFilesAction.java
##########
@@ -300,6 +301,11 @@ void doReplace(Iterable<DataFile> deletedDataFiles, Iterable<DataFile> addedData
   private boolean isPartialFileScan(CombinedScanTask task) {
     if (task.files().size() == 1) {
       FileScanTask fileScanTask = task.files().iterator().next();
+      if (fileScanTask.file().format() == FileFormat.PARQUET) {
+        // in parquet format, there is an initial offset of 4 bytes
+        return fileScanTask.file().fileSizeInBytes() !=
+                (fileScanTask.length() + fileScanTask.file().splitOffsets().get(0));

Review comment:
       That sounds good. 
   I will add `public static final int PARQUET_MAGIC_BYTES = 4;` in `Parquet` class, 
   and use `Parquet.PARQUET_MAGIC_BYTES` instead of `fileScanTask.file().splitOffsets().get(0)`.
   Is it ok?

##########
File path: core/src/main/java/org/apache/iceberg/actions/BaseRewriteDataFilesAction.java
##########
@@ -300,6 +301,11 @@ void doReplace(Iterable<DataFile> deletedDataFiles, Iterable<DataFile> addedData
   private boolean isPartialFileScan(CombinedScanTask task) {
     if (task.files().size() == 1) {
       FileScanTask fileScanTask = task.files().iterator().next();
+      if (fileScanTask.file().format() == FileFormat.PARQUET) {
+        // in parquet format, there is an initial offset of 4 bytes
+        return fileScanTask.file().fileSizeInBytes() !=
+                (fileScanTask.length() + fileScanTask.file().splitOffsets().get(0));
+      }
       return fileScanTask.file().fileSizeInBytes() != fileScanTask.length();

Review comment:
       yes, if it is a complete file scan, fileScanTask.length() will also not contain offset of 4 bytes.
   




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] wuwenchi commented on pull request #4162: Core: bugfix add initial offset when using parquet format

Posted by GitBox <gi...@apache.org>.

wuwenchi commented on pull request #4162:
URL: https://github.com/apache/iceberg/pull/4162#issuecomment-1045964525


   > 
   
   Sorry, forgot to check all test cases...
   I will check all the relevant tests, and then add the corresponding test cases


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] rdblue commented on a change in pull request #4162: Core: bugfix add initial offset when using parquet format

Posted by GitBox <gi...@apache.org>.

rdblue commented on a change in pull request #4162:
URL: https://github.com/apache/iceberg/pull/4162#discussion_r811367527



##########
File path: core/src/main/java/org/apache/iceberg/actions/BaseRewriteDataFilesAction.java
##########
@@ -300,6 +302,10 @@ void doReplace(Iterable<DataFile> deletedDataFiles, Iterable<DataFile> addedData
   private boolean isPartialFileScan(CombinedScanTask task) {
     if (task.files().size() == 1) {
       FileScanTask fileScanTask = task.files().iterator().next();
+      if (fileScanTask.file().format() == FileFormat.PARQUET) {
+        // in parquet format, there is an initial offset of 4 bytes
+        return fileScanTask.file().fileSizeInBytes() != (fileScanTask.length() + PARQUET_MAGIC_BYTES);
+      }

Review comment:
       I wonder if it would be better to always include offset 0 in the first Parquet task to avoid format-specific logic here. Could you try that instead? If the offset is 4 for a Parquet file, just modify it to start from 0.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org