You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@iceberg.apache.org by "gustavoatt (via GitHub)" <gi...@apache.org> on 2023/04/14 22:28:22 UTC

[GitHub] [iceberg] gustavoatt opened a new pull request, #7352: Fix IcebergGenerics::read to read metadata tables

gustavoatt opened a new pull request, #7352:
URL: https://github.com/apache/iceberg/pull/7352

   Add support on `IcebergGenerics` to read metadata files (e.g. `#partitions`, `#snapshots`, etc) without having to rely on a compute engine (or lower level APIs) to read the metadata of Iceberg tables.
   
   This fixes https://github.com/apache/iceberg/issues/7351 which was throwing an `UnsupportedOperationException` when reading metadata tables.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] szehon-ho commented on a diff in pull request #7352: Fix IcebergGenerics::read to read metadata tables

Posted by "szehon-ho (via GitHub)" <gi...@apache.org>.

szehon-ho commented on code in PR #7352:
URL: https://github.com/apache/iceberg/pull/7352#discussion_r1231425205


##########
core/src/main/java/org/apache/iceberg/StaticDataTask.java:
##########
@@ -71,8 +71,10 @@ public List<DeleteFile> deletes() {
 
   @Override
   public CloseableIterable<StructLike> rows() {
-    StructProjection projection = StructProjection.create(tableSchema, projectedSchema);
-    Iterable<StructLike> projectedRows = Iterables.transform(Arrays.asList(rows), projection::wrap);
+    Iterable<StructLike> projectedRows =

Review Comment:
   This isnt great as it will have a perf hit on the non Generic use case.  Is there a reason it works now in Spark engine?
   
   I havent checked it, but it may be more an issue about ManifestReader using Avro.read().reuseContainer() (ie, we need to make a defensive copy like in many places in the code)



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] gustavoatt commented on a diff in pull request #7352: Fix IcebergGenerics::read to read metadata tables

Posted by "gustavoatt (via GitHub)" <gi...@apache.org>.

gustavoatt commented on code in PR #7352:
URL: https://github.com/apache/iceberg/pull/7352#discussion_r1256437201


##########
core/src/main/java/org/apache/iceberg/StaticDataTask.java:
##########
@@ -71,8 +71,10 @@ public List<DeleteFile> deletes() {
 
   @Override
   public CloseableIterable<StructLike> rows() {
-    StructProjection projection = StructProjection.create(tableSchema, projectedSchema);
-    Iterable<StructLike> projectedRows = Iterables.transform(Arrays.asList(rows), projection::wrap);
+    Iterable<StructLike> projectedRows =

Review Comment:
   @szehon-ho I believe it work in Spark, because the Spark API [iterates one row at a time](https://github.com/apache/iceberg/blob/master/spark/v3.1/spark/src/main/java/org/apache/iceberg/spark/source/BaseDataReader.java#L103), using [this iterator](https://github.com/apache/iceberg/blob/master/spark/v3.1/spark/src/main/java/org/apache/iceberg/spark/source/RowDataReader.java#L87).
   
   I tested changing `reuseContainer` in `ManifestReader`, but it does not work. I believe the problem is with always using the same `StructProjection`. One alternative I see is modifying the `DataTask` interface and add a `reuseContainer` method there that that Generic can then set to false but it will be set to true everywhere else. What do you think?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] gustavoatt commented on a diff in pull request #7352: Fix IcebergGenerics::read to read metadata tables

Posted by "gustavoatt (via GitHub)" <gi...@apache.org>.

gustavoatt commented on code in PR #7352:
URL: https://github.com/apache/iceberg/pull/7352#discussion_r1167314470


##########
core/src/main/java/org/apache/iceberg/StaticDataTask.java:
##########
@@ -71,8 +71,10 @@ public List<DeleteFile> deletes() {
 
   @Override
   public CloseableIterable<StructLike> rows() {
-    StructProjection projection = StructProjection.create(tableSchema, projectedSchema);
-    Iterable<StructLike> projectedRows = Iterables.transform(Arrays.asList(rows), projection::wrap);
+    Iterable<StructLike> projectedRows =

Review Comment:
   Note that I changed this to always create a new `StructProjection`. Without this all the rows when reading metadata tables were identical since they kept using the same projection.
   
   I don't believe this should be problematic since `StaticDataTask` is only used for metadata tables, but happy to hear any alternatives.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] huyuanfeng2018 commented on pull request #7352: Fix IcebergGenerics::read to read metadata tables

Posted by "huyuanfeng2018 (via GitHub)" <gi...@apache.org>.

huyuanfeng2018 commented on PR #7352:
URL: https://github.com/apache/iceberg/pull/7352#issuecomment-1517317286

   good !! this works for me
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] dramaticlly commented on pull request #7352: Fix IcebergGenerics::read to read metadata tables

Posted by "dramaticlly (via GitHub)" <gi...@apache.org>.

dramaticlly commented on PR #7352:
URL: https://github.com/apache/iceberg/pull/7352#issuecomment-1622424103

   hey @gustavoatt, I am running into similar problem and glad to see you already have a patch. Looks like @szehon-ho left some comment on https://github.com/apache/iceberg/pull/7352#discussion_r1231425205, do you want to take another look? I can also help here if needed


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] edgarRd commented on a diff in pull request #7352: Fix IcebergGenerics::read to read metadata tables

Posted by "edgarRd (via GitHub)" <gi...@apache.org>.

edgarRd commented on code in PR #7352:
URL: https://github.com/apache/iceberg/pull/7352#discussion_r1168875378


##########
data/src/test/java/org/apache/iceberg/data/TestLocalScan.java:
##########
@@ -515,13 +518,117 @@ public void testAsOfTimeOlderThanFirstSnapshot() {
             "Cannot find a snapshot older than " + DateTimeUtil.formatTimestampMillis(timestamp));
   }
 
+  @Test
+  public void testLoadPartitionsTable() throws IOException {
+    Schema schema =
+        new Schema(
+            optional(1, "id", Types.IntegerType.get()), optional(2, "ds", Types.StringType.get()));
+    PartitionSpec spec = PartitionSpec.builderFor(schema).identity("ds").build();
+
+    File location = temp.newFolder("partitions_table_test_" + format.name());
+    Table table =
+        TABLES.create(
+            schema,
+            spec,
+            ImmutableMap.of(TableProperties.DEFAULT_FILE_FORMAT, format.name()),
+            location.getAbsolutePath());
+
+    GenericRecord record = GenericRecord.create(schema);
+
+    // Create two files with different partitions ds=2021-01-01 and ds=2021-01-02.
+    List<Record> firstFileRecords =
+        Lists.newArrayList(
+            record.copy(ImmutableMap.of("id", 1, "ds", "2021-01-01")),
+            record.copy(ImmutableMap.of("id", 2, "ds", "2021-01-01")));
+    List<Record> secondFileRecords =
+        Lists.newArrayList(
+            record.copy(ImmutableMap.of("id", 3, "ds", "2021-01-02")),
+            record.copy(ImmutableMap.of("id", 4, "ds", "2021-01-02")));
+
+    PartitionKey firstPartitionKey = new PartitionKey(spec, schema);
+    firstPartitionKey.partition(firstFileRecords.get(0));
+    PartitionKey secondPartitionKey = new PartitionKey(spec, schema);
+    secondPartitionKey.partition(secondFileRecords.get(0));
+
+    DataFile df1 =
+        writeFile(
+            location.getAbsolutePath(),
+            format.addExtension("first"),
+            schema,
+            spec,
+            firstPartitionKey,
+            firstFileRecords);
+    DataFile df2 =
+        writeFile(
+            location.getAbsolutePath(),
+            format.addExtension("second"),
+            schema,
+            spec,
+            secondPartitionKey,
+            secondFileRecords);
+    table.newAppend().appendFile(df1).appendFile(df2).commit();
+
+    Table partitionsTable = TABLES.load(table.name() + "#partitions");
+
+    // Verify that we can read the partitions table correctly, and we can get both partitions.
+    Set<Record> actualRecords =
+        Sets.newHashSet(IcebergGenerics.read(partitionsTable).select("partition").build());

Review Comment:
   Perhaps you can simplify this and the following check by just querying `partition.ds`, collecting to a set and asserting the set is the same you expect.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] gustavoatt commented on a diff in pull request #7352: Fix IcebergGenerics::read to read metadata tables

Posted by "gustavoatt (via GitHub)" <gi...@apache.org>.

gustavoatt commented on code in PR #7352:
URL: https://github.com/apache/iceberg/pull/7352#discussion_r1168993116


##########
data/src/test/java/org/apache/iceberg/data/TestLocalScan.java:
##########
@@ -515,13 +518,117 @@ public void testAsOfTimeOlderThanFirstSnapshot() {
             "Cannot find a snapshot older than " + DateTimeUtil.formatTimestampMillis(timestamp));
   }
 
+  @Test
+  public void testLoadPartitionsTable() throws IOException {
+    Schema schema =
+        new Schema(
+            optional(1, "id", Types.IntegerType.get()), optional(2, "ds", Types.StringType.get()));
+    PartitionSpec spec = PartitionSpec.builderFor(schema).identity("ds").build();
+
+    File location = temp.newFolder("partitions_table_test_" + format.name());
+    Table table =
+        TABLES.create(
+            schema,
+            spec,
+            ImmutableMap.of(TableProperties.DEFAULT_FILE_FORMAT, format.name()),
+            location.getAbsolutePath());
+
+    GenericRecord record = GenericRecord.create(schema);
+
+    // Create two files with different partitions ds=2021-01-01 and ds=2021-01-02.
+    List<Record> firstFileRecords =
+        Lists.newArrayList(
+            record.copy(ImmutableMap.of("id", 1, "ds", "2021-01-01")),
+            record.copy(ImmutableMap.of("id", 2, "ds", "2021-01-01")));
+    List<Record> secondFileRecords =
+        Lists.newArrayList(
+            record.copy(ImmutableMap.of("id", 3, "ds", "2021-01-02")),
+            record.copy(ImmutableMap.of("id", 4, "ds", "2021-01-02")));
+
+    PartitionKey firstPartitionKey = new PartitionKey(spec, schema);
+    firstPartitionKey.partition(firstFileRecords.get(0));
+    PartitionKey secondPartitionKey = new PartitionKey(spec, schema);
+    secondPartitionKey.partition(secondFileRecords.get(0));
+
+    DataFile df1 =
+        writeFile(
+            location.getAbsolutePath(),
+            format.addExtension("first"),
+            schema,
+            spec,
+            firstPartitionKey,
+            firstFileRecords);
+    DataFile df2 =
+        writeFile(
+            location.getAbsolutePath(),
+            format.addExtension("second"),
+            schema,
+            spec,
+            secondPartitionKey,
+            secondFileRecords);
+    table.newAppend().appendFile(df1).appendFile(df2).commit();
+
+    Table partitionsTable = TABLES.load(table.name() + "#partitions");
+
+    // Verify that we can read the partitions table correctly, and we can get both partitions.
+    Set<Record> actualRecords =
+        Sets.newHashSet(IcebergGenerics.read(partitionsTable).select("partition").build());

Review Comment:
   @edgarRd I tried doing that, but when I do that I still get a record with schema `struct partition<ds: string>`. I believe the column projection projects what is needed but it keeps it all in the full schema path (i.e. it discards unnecessary columns but we still need to fully address the project columns `partition.ds`).
   
   So it unfortunately does not make the test simpler.



##########
data/src/test/java/org/apache/iceberg/data/TestLocalScan.java:
##########
@@ -515,13 +518,117 @@ public void testAsOfTimeOlderThanFirstSnapshot() {
             "Cannot find a snapshot older than " + DateTimeUtil.formatTimestampMillis(timestamp));
   }
 
+  @Test
+  public void testLoadPartitionsTable() throws IOException {
+    Schema schema =
+        new Schema(
+            optional(1, "id", Types.IntegerType.get()), optional(2, "ds", Types.StringType.get()));
+    PartitionSpec spec = PartitionSpec.builderFor(schema).identity("ds").build();
+
+    File location = temp.newFolder("partitions_table_test_" + format.name());
+    Table table =
+        TABLES.create(
+            schema,
+            spec,
+            ImmutableMap.of(TableProperties.DEFAULT_FILE_FORMAT, format.name()),
+            location.getAbsolutePath());
+
+    GenericRecord record = GenericRecord.create(schema);
+
+    // Create two files with different partitions ds=2021-01-01 and ds=2021-01-02.
+    List<Record> firstFileRecords =
+        Lists.newArrayList(
+            record.copy(ImmutableMap.of("id", 1, "ds", "2021-01-01")),
+            record.copy(ImmutableMap.of("id", 2, "ds", "2021-01-01")));
+    List<Record> secondFileRecords =
+        Lists.newArrayList(
+            record.copy(ImmutableMap.of("id", 3, "ds", "2021-01-02")),
+            record.copy(ImmutableMap.of("id", 4, "ds", "2021-01-02")));
+
+    PartitionKey firstPartitionKey = new PartitionKey(spec, schema);
+    firstPartitionKey.partition(firstFileRecords.get(0));
+    PartitionKey secondPartitionKey = new PartitionKey(spec, schema);
+    secondPartitionKey.partition(secondFileRecords.get(0));
+
+    DataFile df1 =
+        writeFile(
+            location.getAbsolutePath(),
+            format.addExtension("first"),
+            schema,
+            spec,
+            firstPartitionKey,
+            firstFileRecords);
+    DataFile df2 =
+        writeFile(
+            location.getAbsolutePath(),
+            format.addExtension("second"),
+            schema,
+            spec,
+            secondPartitionKey,
+            secondFileRecords);
+    table.newAppend().appendFile(df1).appendFile(df2).commit();
+
+    Table partitionsTable = TABLES.load(table.name() + "#partitions");
+
+    // Verify that we can read the partitions table correctly, and we can get both partitions.
+    Set<Record> actualRecords =
+        Sets.newHashSet(IcebergGenerics.read(partitionsTable).select("partition").build());

Review Comment:
   @edgarRd I tried doing that, but when I do that I still get a record with schema `struct partition<ds: string>`. I believe the column projection projects what is needed but it keeps it all in the full schema path (i.e. it discards unnecessary columns but we still need to fully address the project columns `partition.ds`) https://github.com/apache/iceberg/blob/master/core/src/main/java/org/apache/iceberg/BaseScan.java#L239.
   
   So it unfortunately does not make the test simpler.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] huyuanfeng2018 commented on pull request #7352: Fix IcebergGenerics::read to read metadata tables

Posted by "huyuanfeng2018 (via GitHub)" <gi...@apache.org>.

huyuanfeng2018 commented on PR #7352:
URL: https://github.com/apache/iceberg/pull/7352#issuecomment-1517360759

   oh no , if table partitions is a large amount，Memory can't hold up
   
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org