You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@parquet.apache.org by GitBox <gi...@apache.org> on 2021/04/22 16:04:38 UTC

[GitHub] [parquet-mr] gszadovszky commented on a change in pull request #896: PARQUET-2027: Fix calculating directory offset for merge

gszadovszky commented on a change in pull request #896:
URL: https://github.com/apache/parquet-mr/pull/896#discussion_r618536203



##########
File path: parquet-hadoop/src/main/java/org/apache/parquet/hadoop/Offsets.java
##########
@@ -68,12 +68,14 @@ public static Offsets getOffsets(SeekableInputStream input, ColumnChunkMetaData
     return new Offsets(firstDataPageOffset, dictionaryPageOffset);
   }
 
-  private static long readDictionaryPageSize(SeekableInputStream in, long pos) throws IOException {
+  private static long readDictionaryPageSize(SeekableInputStream in, ColumnChunkMetaData chunk) throws IOException {
     long origPos = -1;
     try {
       origPos = in.getPos();
+      in.seek(chunk.getStartingPos());

Review comment:
       It is not obvious that one have to search this statements in the [Encoding docs](https://github.com/apache/parquet-format/blob/master/Encodings.md#dictionary-encoding-plain_dictionary--2-and-rle_dictionary--8) but it is there:
   > The dictionary page is written first, before the data pages of the column chunk.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org