You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@iceberg.apache.org by GitBox <gi...@apache.org> on 2021/12/15 12:48:10 UTC

[GitHub] [iceberg] szlta commented on a change in pull request #3748: Hive: ORC vectorization fails when split offsets are considered during split generation

szlta commented on a change in pull request #3748:
URL: https://github.com/apache/iceberg/pull/3748#discussion_r769593880



##########
File path: mr/src/test/java/org/apache/iceberg/mr/hive/TestHiveIcebergStorageHandlerWithEngine.java
##########
@@ -777,6 +779,42 @@ public void testStatsPopulation() throws Exception {
     Assert.assertTrue(stats.startsWith("{\"BASIC_STATS\":\"true\"")); // it's followed by column stats in Hive3
   }
 
+  /**
+   * Tests that vectorized ORC reading code path correctly handles when the same ORC file is split into multiple parts.
+   * Although the split offsets and length will not always include the file tail that contains the metadata, the
+   * vectorized reader needs to make sure to handle the tail reading regardless of the offsets. If this is not done
+   * correctly, the last SELECT query will fail.
+   * @throws Exception - any test error
+   */
+  @Test
+  public void testVectorizedOrcMultipleSplits() throws Exception {
+    assumeTrue(isVectorized && FileFormat.ORC.equals(fileFormat));
+
+    try {
+      // This data will be held by a ~870kB ORC file
+      List<Record> records = TestHelper.generateRandomRecords(HiveIcebergStorageHandlerTestUtils.CUSTOMER_SCHEMA,
+          20000, 0L);
+
+      // To support splitting the ORC file, we need to specify the stripe size to a small value. It looks like the min
+      // value is about 220kB, no smaller stripes are written by ORC. Anyway, this setting will produce 4 stripes.
+      shell.getHiveConf().set("orc.stripe.size", "200000");
+
+      testTables.createTable(shell, "targettab", HiveIcebergStorageHandlerTestUtils.CUSTOMER_SCHEMA,
+          fileFormat, records);
+
+      // Will request 4 splits, separated on the exact stripe boundaries within the ORC file.

Review comment:
       It's very deeply internal unfortunately..




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org