You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@parquet.apache.org by GitBox <gi...@apache.org> on 2023/01/08 09:51:18 UTC

[GitHub] [parquet-mr] wgtmac opened a new pull request, #1018: PARQUET-2219: ParquetFileReader skips empty row group

wgtmac opened a new pull request, #1018:
URL: https://github.com/apache/parquet-mr/pull/1018

   ### Jira
   
   My PR addresses the [PARQUET-2219](https://issues.apache.org/jira/browse/PARQUET/PARQUET-2219).
   
   ### Tests
   
   My PR adds the following unit test to read parquet file with empty row group:
   - parquet-hadoop/src/test/java/org/apache/parquet/hadoop/TestParquetReaderEmptyBlock.java
   
   ### Commits
   
   The parquet specs does not forbid empty row group and some implementations are able to generate files with empty row group. The commit aims to make ParquetFileReader robust by skipping empty row group while reading.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@parquet.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [parquet-mr] shangxinli commented on a diff in pull request #1018: PARQUET-2219: ParquetFileReader skips empty row group

Posted by GitBox <gi...@apache.org>.

shangxinli commented on code in PR #1018:
URL: https://github.com/apache/parquet-mr/pull/1018#discussion_r1066042941


##########
parquet-hadoop/src/main/java/org/apache/parquet/hadoop/ParquetFileReader.java:
##########
@@ -1038,7 +1044,10 @@ public PageReadStore readNextFilteredRowGroup() throws IOException {
     }
     BlockMetaData block = blocks.get(currentBlock);
     if (block.getRowCount() == 0L) {
-      throw new RuntimeException("Illegal row group of 0 rows");
+      LOG.warn("Read empty block at index {}", currentBlock);

Review Comment:
   Same comments as above



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@parquet.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [parquet-mr] wgtmac commented on pull request #1018: PARQUET-2219: ParquetFileReader skips empty row group

Posted by GitBox <gi...@apache.org>.

wgtmac commented on PR #1018:
URL: https://github.com/apache/parquet-mr/pull/1018#issuecomment-1377485663

   > Thanks you for fixing this. I've added some comments. Also, could you add a similar test for the filtered row groups?
   
   Thanks for your review @gszadovszky !
   
   I have addressed all of your comments. Please take a look again.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@parquet.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [parquet-mr] shangxinli commented on a diff in pull request #1018: PARQUET-2219: ParquetFileReader skips empty row group

Posted by GitBox <gi...@apache.org>.

shangxinli commented on code in PR #1018:
URL: https://github.com/apache/parquet-mr/pull/1018#discussion_r1066038932


##########
parquet-hadoop/src/main/java/org/apache/parquet/hadoop/ParquetFileReader.java:
##########
@@ -927,7 +925,15 @@ public PageReadStore readRowGroup(int blockIndex) throws IOException {
    * @return the PageReadStore which can provide PageReaders for each column.
    */
   public PageReadStore readNextRowGroup() throws IOException {
-    ColumnChunkPageReadStore rowGroup = internalReadRowGroup(currentBlock);
+    ColumnChunkPageReadStore rowGroup = null;
+    try {
+      rowGroup = internalReadRowGroup(currentBlock);
+    } catch (ParquetEmptyBlockException e) {
+      LOG.warn("Read empty block at index {}", currentBlock);

Review Comment:
   Any way to add file path?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@parquet.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [parquet-mr] shangxinli commented on pull request #1018: PARQUET-2219: ParquetFileReader skips empty row group

Posted by GitBox <gi...@apache.org>.

shangxinli commented on PR #1018:
URL: https://github.com/apache/parquet-mr/pull/1018#issuecomment-1376751333

   @gszadovszky Nice to see you are back!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@parquet.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [parquet-mr] gszadovszky commented on a diff in pull request #1018: PARQUET-2219: ParquetFileReader skips empty row group

Posted by GitBox <gi...@apache.org>.

gszadovszky commented on code in PR #1018:
URL: https://github.com/apache/parquet-mr/pull/1018#discussion_r1064374553


##########
parquet-hadoop/src/main/java/org/apache/parquet/hadoop/ParquetFileReader.java:
##########
@@ -1038,7 +1044,9 @@ public PageReadStore readNextFilteredRowGroup() throws IOException {
     }
     BlockMetaData block = blocks.get(currentBlock);
     if (block.getRowCount() == 0L) {
-      throw new RuntimeException("Illegal row group of 0 rows");
+      // Skip the empty block
+      advanceToNextBlock();
+      return readNextFilteredRowGroup();

Review Comment:
   There is a warning log in case of `readNextRowGroup` but here we don't log anything.



##########
parquet-hadoop/src/test/java/org/apache/parquet/hadoop/TestParquetReaderEmptyBlock.java:
##########
@@ -0,0 +1,83 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+package org.apache.parquet.hadoop;
+
+import org.apache.hadoop.conf.Configuration;
+import org.apache.hadoop.fs.Path;
+import org.apache.parquet.column.page.PageReadStore;
+import org.apache.parquet.hadoop.metadata.ParquetMetadata;
+import org.junit.Assert;
+import org.junit.Test;
+
+import java.io.IOException;
+import java.net.URISyntaxException;
+
+public class TestParquetReaderEmptyBlock {
+
+  private static final Path EMPTY_BLOCK_FILE_1 = createPathFromCP("/test-empty-row-group_1.parquet");
+
+  private static final Path EMPTY_BLOCK_FILE_2 = createPathFromCP("/test-empty-row-group_2.parquet");
+
+  private static Path createPathFromCP(String path) {
+    try {
+      return new Path(TestParquetReaderEmptyBlock.class.getResource(path).toURI());
+    } catch (URISyntaxException e) {
+      throw new RuntimeException(e);
+    }
+  }
+
+  @Test
+  public void testReadOnlyEmptyBlock() throws IOException {
+    Configuration conf = new Configuration();
+    ParquetMetadata readFooter = ParquetFileReader.readFooter(conf, EMPTY_BLOCK_FILE_1);
+
+    // The parquet file contains only one empty row group
+    Assert.assertEquals(1, readFooter.getBlocks().size());
+
+    // The empty block is skipped
+    try (ParquetFileReader r = new ParquetFileReader(conf, EMPTY_BLOCK_FILE_1, readFooter)) {
+      Assert.assertNull(r.readNextRowGroup());
+    }
+  }
+
+  @Test
+  public void testSkipEmptyBlock() throws IOException {
+    Configuration conf = new Configuration();
+    ParquetMetadata readFooter = ParquetFileReader.readFooter(conf, EMPTY_BLOCK_FILE_2);
+
+    // The parquet file contains three row groups, the second one is empty

Review Comment:
   I think, it would be nice to test the case of multiple empty row groups next to each other.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@parquet.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [parquet-mr] wgtmac commented on pull request #1018: PARQUET-2219: ParquetFileReader skips empty row group

Posted by GitBox <gi...@apache.org>.

wgtmac commented on PR #1018:
URL: https://github.com/apache/parquet-mr/pull/1018#issuecomment-1374806195

   @gszadovszky @ggershinsky @shangxinli @sunchao Could you please take a look when you have time?
   
   cc @emkornfield 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@parquet.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [parquet-mr] shangxinli merged pull request #1018: PARQUET-2219: ParquetFileReader skips empty row group

Posted by GitBox <gi...@apache.org>.

shangxinli merged PR #1018:
URL: https://github.com/apache/parquet-mr/pull/1018


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@parquet.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [parquet-mr] gszadovszky commented on pull request #1018: PARQUET-2219: ParquetFileReader skips empty row group

Posted by GitBox <gi...@apache.org>.

gszadovszky commented on PR #1018:
URL: https://github.com/apache/parquet-mr/pull/1018#issuecomment-1377700950

   > @gszadovszky Nice to see you are back!
   
   @shangxinli, I wouldn't say I'm back, unfortunately. I'm a bit closer to Parquet at Dremio but actually not working on it. We'll see if I will have some spare time for it. :)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@parquet.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [parquet-mr] wgtmac commented on a diff in pull request #1018: PARQUET-2219: ParquetFileReader skips empty row group

Posted by GitBox <gi...@apache.org>.

wgtmac commented on code in PR #1018:
URL: https://github.com/apache/parquet-mr/pull/1018#discussion_r1066592694


##########
parquet-hadoop/src/main/java/org/apache/parquet/hadoop/ParquetFileReader.java:
##########
@@ -927,7 +925,15 @@ public PageReadStore readRowGroup(int blockIndex) throws IOException {
    * @return the PageReadStore which can provide PageReaders for each column.
    */
   public PageReadStore readNextRowGroup() throws IOException {
-    ColumnChunkPageReadStore rowGroup = internalReadRowGroup(currentBlock);
+    ColumnChunkPageReadStore rowGroup = null;
+    try {
+      rowGroup = internalReadRowGroup(currentBlock);
+    } catch (ParquetEmptyBlockException e) {
+      LOG.warn("Read empty block at index {}", currentBlock);

Review Comment:
   Added the file path to the log. Please take a look again. Thanks!



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscribe@parquet.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org