You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@iceberg.apache.org by "singhpk234 (via GitHub)" <gi...@apache.org> on 2023/04/06 17:51:12 UTC

[GitHub] [iceberg] singhpk234 commented on a diff in pull request #7279: [Parquet] Eagerly fetch row groups when reading parquet

singhpk234 commented on code in PR #7279:
URL: https://github.com/apache/iceberg/pull/7279#discussion_r1160094891


##########
parquet/src/main/java/org/apache/iceberg/parquet/VectorizedParquetReader.java:
##########
@@ -154,21 +177,47 @@ public T next() {
     }
 
     private void advance() {
-      while (shouldSkip[nextRowGroup]) {
-        nextRowGroup += 1;
-        reader.skipNextRowGroup();
-      }
-      PageReadStore pages;
       try {
-        pages = reader.readNextRowGroup();
-      } catch (IOException e) {
-        throw new RuntimeIOException(e);
+        Preconditions.checkNotNull(prefetchRowGroupFuture, "future should not be null");
+        PageReadStore pages = prefetchRowGroupFuture.get();
+
+        if (prefetchedRowGroup >= totalRowGroups) {
+          return;
+        }
+        Preconditions.checkState(
+            pages != null,
+            "advance() should have been only when there was at least one row group to read");
+        long rowPosition = rowGroupsStartRowPos[prefetchedRowGroup];
+        model.setRowGroupInfo(pages, columnChunkMetadata.get(prefetchedRowGroup), rowPosition);
+        nextRowGroupStart += pages.getRowCount();
+        prefetchedRowGroup += 1;
+        prefetchNextRowGroup(); // eagerly fetch the next row group

Review Comment:
   Agree with you, we can certainly do that, at the moment just kept it simple and closer to the existing poc code. Let me add this in later revisions if folks are onboard.
   
   Though we might not wanna load all the row-groups at one shot, as it might cause some memory pressure, may be having a conf for controlling the same would be helpful.
   
   P.S. Will also add pre-fetch of parquet data pages as well shortly.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org