You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@iceberg.apache.org by GitBox <gi...@apache.org> on 2022/10/25 20:00:05 UTC

[GitHub] [iceberg] flyrain opened a new pull request, #6056: Parquet: Remove the row position since parquet row group has it natively

flyrain opened a new pull request, #6056:
URL: https://github.com/apache/iceberg/pull/6056

   After https://github.com/apache/parquet-mr/commit/c7bff519094920a8609df6cbd98821a43ed779e3, which is in parquet 1.12.3, the parquet row group provides the row index offset natively. We don't need to calculate in Iceberg. 
   
   cc @chenjunjiedada @rdblue @wypoon @aokolnychyi 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] wypoon commented on pull request #6056: Parquet: Remove the row position since parquet row group has it natively

Posted by GitBox <gi...@apache.org>.
wypoon commented on PR #6056:
URL: https://github.com/apache/iceberg/pull/6056#issuecomment-1292767534

   > > Why doesn't it affect the test I added in `TestSparkReaderDeletes`?
   > 
   > I assume there is only one reader initialized in the test you added, which can set the right row offset since it reads from the beginning of the parquet file, instead of from the middle.
   
   Ah, in a real-world situation, we could have a parquet file with multiple row groups and it is split among multiple scan tasks, and it has deletes to be applied, in which case the read will be incorrect then (since some split(s) will be reading the file from the middle).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] chenjunjiedada commented on pull request #6056: Parquet: Remove the row position since parquet row group has it natively

Posted by GitBox <gi...@apache.org>.
chenjunjiedada commented on PR #6056:
URL: https://github.com/apache/iceberg/pull/6056#issuecomment-1291952716

   >Looks like the RowIndexOffset won't be set correct if we read from the middle of a parquet file. It will start from 0 no matter where we start to read.
   
   @flyrain Do you mean parquet doesn't give the right offset? I remember there is a fix(https://github.com/apache/parquet-mr/pull/978) for this, not sure whether this fix is already in 1.12.3.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


Re: [PR] Parquet: Remove the row position since parquet row group has it natively [iceberg]

Posted by "ricardopereira33 (via GitHub)" <gi...@apache.org>.
ricardopereira33 commented on PR #6056:
URL: https://github.com/apache/iceberg/pull/6056#issuecomment-1804297603

   Hi @flyrain @wypoon !
   
   Are there any updates regarding this issue? We have a case when we write with Trino, and then we do data files compaction (or even just read a file), Spark can not read the file because the offsets are not within the range... example:
   ![Screenshot 2023-11-06 at 09 45 59](https://github.com/apache/iceberg/assets/16037370/a7d8e84b-b9ce-4000-a6b7-aa0d8855badc)
   
   Code section: https://github.com/apache/parquet-mr/blob/0a066d8a5c71386e56dee7bd7a21170b27e4283a/parquet-hadoop/src/main/java/org/apache/parquet/format/converter/ParquetMetadataConverter.java#L1287


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] flyrain commented on pull request #6056: Parquet: Remove the row position since parquet row group has it natively

Posted by GitBox <gi...@apache.org>.
flyrain commented on PR #6056:
URL: https://github.com/apache/iceberg/pull/6056#issuecomment-1291282360

   Looks like the `RowIndexOffset` won't be set correct if we read from the middle of a parquet file. It will start from 0 no matter where we start to read.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue commented on pull request #6056: Parquet: Remove the row position since parquet row group has it natively

Posted by GitBox <gi...@apache.org>.
rdblue commented on PR #6056:
URL: https://github.com/apache/iceberg/pull/6056#issuecomment-1304896473

   Do we trust this value from Parquet?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] chenjunjiedada commented on pull request #6056: Parquet: Remove the row position since parquet row group has it natively

Posted by GitBox <gi...@apache.org>.
chenjunjiedada commented on PR #6056:
URL: https://github.com/apache/iceberg/pull/6056#issuecomment-1291948335

   >Looks like the RowIndexOffset won't be set correct if we read from the middle of a parquet file. It will start from 0 no matter where we start to read.
   
   @flyrain Do you mean parquet doesn't give the right offset? I remember there is a fix(https://github.com/apache/parquet-mr/pull/978) for this, not sure whether this fix is already in 1.12.3.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


Re: [PR] Parquet: Remove the row position since parquet row group has it natively [iceberg]

Posted by "Fokko (via GitHub)" <gi...@apache.org>.
Fokko commented on PR #6056:
URL: https://github.com/apache/iceberg/pull/6056#issuecomment-1806267078

   @flyrain @chenjunjiedada do we want to revisit this since we're on Parquet 1.13.1?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] flyrain commented on pull request #6056: Parquet: Remove the row position since parquet row group has it natively

Posted by GitBox <gi...@apache.org>.
flyrain commented on PR #6056:
URL: https://github.com/apache/iceberg/pull/6056#issuecomment-1306047299

   > Do we trust this value from Parquet?
   
   The approach parquet used is similar to what @chenjunjiedada implemented in Iceberg repo. As long as it is reliable(no bug), I don't see a reason to not trust it. By using it, we don't have to calculate it again since parquet lib did it already, and it also makes the Iceberg code base a bit cleaner. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] wypoon commented on pull request #6056: Parquet: Remove the row position since parquet row group has it natively

Posted by GitBox <gi...@apache.org>.
wypoon commented on PR #6056:
URL: https://github.com/apache/iceberg/pull/6056#issuecomment-1292619990

   So this change needs to wait until there is a parquet-mr release with a fix for [PARQUET-2161](https://issues.apache.org/jira/browse/PARQUET-2161), right?
   Just so I understand, for our usage, the bug manifests when `Parquet.ReadBuilder#split` is called (and thus when `Parquet.ReadBuilder#build` is called, the `VectorizedParquetReader` or `ParquetReader` is constructed with a `ParquetReadOptions` that has a `ParquetMetadataConverter.RangeMetadataFilter`), right? Why doesn't it affect the test I added in `TestSparkReaderDeletes`?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] flyrain commented on pull request #6056: Parquet: Remove the row position since parquet row group has it natively

Posted by GitBox <gi...@apache.org>.
flyrain commented on PR #6056:
URL: https://github.com/apache/iceberg/pull/6056#issuecomment-1292341403

   Yes, that's exactly why our test `testReadRowNumbersWithSplits` failed. https://github.com/apache/parquet-mr/pull/978 is not in 1.12.3(the latest one). We may have to wait until it is released.
   
   > For example, if a file has two row groups with 10 rows each, and we attempt to only read the 2nd row group, we are going to produce row indexes 0, 1, 2, ..., 9 instead of expected 10, 11, ..., 19.
   
   https://issues.apache.org/jira/browse/PARQUET-2161


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] flyrain commented on pull request #6056: Parquet: Remove the row position since parquet row group has it natively

Posted by GitBox <gi...@apache.org>.
flyrain commented on PR #6056:
URL: https://github.com/apache/iceberg/pull/6056#issuecomment-1292635632

   > So this change needs to wait until there is a parquet-mr release with a fix for [PARQUET-2161]
   
   Yes.
   >Why doesn't it affect the test I added in `TestSparkReaderDeletes`? 
   
   I assume there is only one reader initialized in the test you added, which can set the right row offset since it reads the whole parquet file, instead of reading from the middle.
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


Re: [PR] Parquet: Remove the row position since parquet row group has it natively [iceberg]

Posted by "flyrain (via GitHub)" <gi...@apache.org>.
flyrain commented on PR #6056:
URL: https://github.com/apache/iceberg/pull/6056#issuecomment-1809224390

   Thanks for sharing @ricardopereira33. Looks like the issue is due to a bug in Parquet 1.12.3, not directly related to this PR. It'd be nice to have this PR in though.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


Re: [PR] Parquet: Remove the row position since parquet row group has it natively [iceberg]

Posted by "flyrain (via GitHub)" <gi...@apache.org>.
flyrain commented on PR #6056:
URL: https://github.com/apache/iceberg/pull/6056#issuecomment-1806287219

   @Fokko, yes, we can revisit this with the new parquet release. It has the change Iceberg required. It should be safe and clean. Although, i'm not sure the issue mentioned by @ricardopereira33 is related. Hi @ricardopereira33, can you elaborate a bit?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] flyrain commented on pull request #6056: Parquet: Remove the row position since parquet row group has it natively

Posted by "flyrain (via GitHub)" <gi...@apache.org>.
flyrain commented on PR #6056:
URL: https://github.com/apache/iceberg/pull/6056#issuecomment-1551733952

   We can reconsider this after this PR https://github.com/apache/iceberg/pull/7301 is merged.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


Re: [PR] Parquet: Remove the row position since parquet row group has it natively [iceberg]

Posted by "ricardopereira33 (via GitHub)" <gi...@apache.org>.
ricardopereira33 commented on PR #6056:
URL: https://github.com/apache/iceberg/pull/6056#issuecomment-1806829411

   Hi @flyrain ! I added all the details in this [Slack thread](https://apache-iceberg.slack.com/archives/C025PH0G1D4/p1699608872098359). Can you see it?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org