You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@iceberg.apache.org by "nastra (via GitHub)" <gi...@apache.org> on 2023/04/03 09:12:51 UTC

[GitHub] [iceberg] nastra commented on a diff in pull request #6934: core: add JSON parser for ContentFile and FileScanTask

nastra commented on code in PR #6934:
URL: https://github.com/apache/iceberg/pull/6934#discussion_r1155658780


##########
core/src/main/java/org/apache/iceberg/ContentFileParser.java:
##########
@@ -0,0 +1,262 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+package org.apache.iceberg;
+
+import com.fasterxml.jackson.core.JsonGenerator;
+import com.fasterxml.jackson.databind.JsonNode;
+import java.io.IOException;
+import java.nio.ByteBuffer;
+import java.util.List;
+import java.util.Map;
+import org.apache.iceberg.relocated.com.google.common.base.Preconditions;
+import org.apache.iceberg.util.JsonUtil;
+
+class ContentFileParser {
+  private static final String SPEC_ID = "spec-id";
+  private static final String CONTENT = "content";
+  private static final String FILE_PATH = "file-path";
+  private static final String FILE_FORMAT = "file-format";
+  private static final String PARTITION = "partition";
+  private static final String RECORD_COUNT = "record-count";
+  private static final String FILE_SIZE = "file-size-in-bytes";
+  private static final String COLUMN_SIZES = "column-sizes";
+  private static final String VALUE_COUNTS = "value-counts";
+  private static final String NULL_VALUE_COUNTS = "null-value-counts";
+  private static final String NAN_VALUE_COUNTS = "nan-value-counts";
+  private static final String LOWER_BOUNDS = "lower-bounds";
+  private static final String UPPER_BOUNDS = "upper-bounds";
+  private static final String KEY_METADATA = "key-metadata";
+  private static final String SPLIT_OFFSETS = "split-offsets";
+  private static final String EQUALITY_IDS = "equality-ids";
+  private static final String SORT_ORDER_ID = "sort-order-id";
+
+  private ContentFileParser() {}
+
+  private static boolean hasPartitionData(StructLike partitionData) {
+    return partitionData != null && partitionData.size() > 0;
+  }
+
+  static void toJson(ContentFile<?> contentFile, PartitionSpec spec, JsonGenerator generator)
+      throws IOException {
+    Preconditions.checkArgument(contentFile != null, "Content file cannot be null");

Review Comment:
   nit: I think the preferred pattern for error messages is `Invalid content file: null`.
   There are a few places in this PR that might need to be updated to that



##########
core/src/main/java/org/apache/iceberg/FileScanTaskParser.java:
##########
@@ -0,0 +1,137 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+package org.apache.iceberg;
+
+import com.fasterxml.jackson.core.JsonGenerator;
+import com.fasterxml.jackson.databind.JsonNode;
+import java.io.IOException;
+import java.io.StringWriter;
+import java.io.UncheckedIOException;
+import org.apache.iceberg.expressions.Expression;
+import org.apache.iceberg.expressions.ExpressionParser;
+import org.apache.iceberg.expressions.Expressions;
+import org.apache.iceberg.expressions.ResidualEvaluator;
+import org.apache.iceberg.relocated.com.google.common.base.Preconditions;
+import org.apache.iceberg.relocated.com.google.common.collect.ImmutableList;
+import org.apache.iceberg.util.JsonUtil;
+
+public class FileScanTaskParser {
+  private static final String SCHEMA = "schema";
+  private static final String SPEC = "spec";
+  private static final String DATA_FILE = "data-file";
+  private static final String DELETE_FILES = "delete-files";
+  private static final String RESIDUAL = "residual-filter";
+
+  private FileScanTaskParser() {}
+
+  public static String toJson(FileScanTask fileScanTask) {
+    Preconditions.checkArgument(fileScanTask != null, "File scan task cannot be null");
+    try (StringWriter writer = new StringWriter()) {

Review Comment:
   the entire method can be simplified to `return JsonUtil.generate(toJson -> FileScanTaskParser.toJson(fileScanTask, toJson), false);`
   
   and the`Preconditions.checkArgument(fileScanTask != null...` check can move into the below `toJson(FileScanTask, JsonGenerator)` method



##########
core/src/test/java/org/apache/iceberg/TestFileScanTaskParser.java:
##########
@@ -0,0 +1,92 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+package org.apache.iceberg;
+
+import java.util.stream.Stream;
+import org.apache.iceberg.expressions.ExpressionUtil;
+import org.apache.iceberg.expressions.Expressions;
+import org.apache.iceberg.expressions.ResidualEvaluator;
+import org.assertj.core.api.Assertions;
+import org.junit.Assert;
+import org.junit.Test;
+import org.junit.jupiter.params.ParameterizedTest;
+import org.junit.jupiter.params.provider.Arguments;
+import org.junit.jupiter.params.provider.MethodSource;
+
+public class TestFileScanTaskParser {
+  @Test
+  public void testNullArguments() {
+    Assertions.assertThatThrownBy(() -> FileScanTaskParser.toJson(null))
+        .isInstanceOf(IllegalArgumentException.class)
+        .hasMessage("File scan task cannot be null");
+
+    Assertions.assertThatThrownBy(() -> FileScanTaskParser.fromJson(null, true))
+        .isInstanceOf(IllegalArgumentException.class)
+        .hasMessage("Cannot parse file scan task from null JSON string");
+  }
+
+  private static Stream<Arguments> provideArgs() {
+    return Stream.of(Arguments.of(true), Arguments.of(false));
+  }
+
+  @ParameterizedTest
+  @MethodSource("provideArgs")
+  public void testParser(boolean caseSensitive) {
+    PartitionSpec spec = TableTestBase.SPEC;
+    FileScanTask fileScanTask = createScanTask(spec, caseSensitive);
+    String jsonStr = FileScanTaskParser.toJson(fileScanTask);
+    FileScanTask deserializedTask = FileScanTaskParser.fromJson(jsonStr, caseSensitive);
+    assertFileScanTaskEquals(fileScanTask, deserializedTask, spec, caseSensitive);
+  }
+
+  private FileScanTask createScanTask(PartitionSpec spec, boolean caseSensitive) {
+    ResidualEvaluator residualEvaluator;
+    if (spec.isUnpartitioned()) {
+      residualEvaluator = ResidualEvaluator.unpartitioned(Expressions.alwaysTrue());
+    } else {
+      residualEvaluator = ResidualEvaluator.of(spec, Expressions.equal("id", 1), caseSensitive);
+    }
+
+    return new BaseFileScanTask(
+        TableTestBase.FILE_A,
+        new DeleteFile[] {TableTestBase.FILE_A_DELETES, TableTestBase.FILE_A2_DELETES},
+        SchemaParser.toJson(TableTestBase.SCHEMA),
+        PartitionSpecParser.toJson(spec),
+        residualEvaluator);
+  }
+
+  private static void assertFileScanTaskEquals(
+      FileScanTask expected, FileScanTask actual, PartitionSpec spec, boolean caseSensitive) {
+    TestContentFileParser.assertContentFileEquals(expected.file(), actual.file(), spec);
+    Assert.assertEquals(expected.deletes().size(), actual.deletes().size());

Review Comment:
   just a suggestion but `Assertions.assertThat(actual.deletes()).hasSameSizeAs(expected.deletes());` would be a more preferrable check, because it would show you the content of both collections if the check ever fails



##########
core/src/test/java/org/apache/iceberg/TestContentFileParser.java:
##########
@@ -0,0 +1,245 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+package org.apache.iceberg;
+
+import com.fasterxml.jackson.core.JsonGenerator;
+import com.fasterxml.jackson.databind.JsonNode;
+import java.io.StringWriter;
+import java.nio.ByteBuffer;
+import java.util.Arrays;
+import java.util.Collections;
+import java.util.stream.Stream;
+import org.apache.iceberg.relocated.com.google.common.collect.ImmutableMap;
+import org.apache.iceberg.types.Comparators;
+import org.apache.iceberg.types.Conversions;
+import org.apache.iceberg.types.Types;
+import org.apache.iceberg.util.JsonUtil;
+import org.assertj.core.api.Assertions;
+import org.junit.Assert;
+import org.junit.Test;
+import org.junit.jupiter.params.ParameterizedTest;
+import org.junit.jupiter.params.provider.Arguments;
+import org.junit.jupiter.params.provider.MethodSource;
+
+public class TestContentFileParser {
+  @Test
+  public void testNullArguments() throws Exception {
+    StringWriter stringWriter = new StringWriter();
+    JsonGenerator jsonStringGenerator = JsonUtil.factory().createGenerator(stringWriter);

Review Comment:
   In order to avoid creating a json generator in tests you could add 
   ```
   static void toJson(ContentFile<?> contentFile, PartitionSpec spec) {
       JsonUtil.generate(gen -> ContentFileParser.toJson(contentFile, spec, gen), false);
     }
   ```
   to `ContentFileParser`



##########
core/src/test/java/org/apache/iceberg/TestContentFileParser.java:
##########
@@ -0,0 +1,245 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+package org.apache.iceberg;
+
+import com.fasterxml.jackson.core.JsonGenerator;
+import com.fasterxml.jackson.databind.JsonNode;
+import java.io.StringWriter;
+import java.nio.ByteBuffer;
+import java.util.Arrays;
+import java.util.Collections;
+import java.util.stream.Stream;
+import org.apache.iceberg.relocated.com.google.common.collect.ImmutableMap;
+import org.apache.iceberg.types.Comparators;
+import org.apache.iceberg.types.Conversions;
+import org.apache.iceberg.types.Types;
+import org.apache.iceberg.util.JsonUtil;
+import org.assertj.core.api.Assertions;
+import org.junit.Assert;
+import org.junit.Test;
+import org.junit.jupiter.params.ParameterizedTest;
+import org.junit.jupiter.params.provider.Arguments;
+import org.junit.jupiter.params.provider.MethodSource;
+
+public class TestContentFileParser {
+  @Test
+  public void testNullArguments() throws Exception {
+    StringWriter stringWriter = new StringWriter();
+    JsonGenerator jsonStringGenerator = JsonUtil.factory().createGenerator(stringWriter);
+
+    Assertions.assertThatThrownBy(
+            () -> ContentFileParser.toJson(null, TableTestBase.SPEC, jsonStringGenerator))
+        .isInstanceOf(IllegalArgumentException.class)
+        .hasMessage("Content file cannot be null");
+
+    Assertions.assertThatThrownBy(() -> ContentFileParser.fromJson(null, TableTestBase.SPEC))
+        .isInstanceOf(IllegalArgumentException.class)
+        .hasMessage("Cannot parse content file from null JSON node");
+  }
+
+  private static Stream<Arguments> provideSpecAndDataFile() {
+    return Stream.of(
+        Arguments.of(
+            PartitionSpec.unpartitioned(), dataFileWithRequiredOnly(PartitionSpec.unpartitioned())),
+        Arguments.of(
+            PartitionSpec.unpartitioned(), dataFileWithAllOptional(PartitionSpec.unpartitioned())),
+        Arguments.of(TableTestBase.SPEC, dataFileWithRequiredOnly(TableTestBase.SPEC)),
+        Arguments.of(TableTestBase.SPEC, dataFileWithAllOptional(TableTestBase.SPEC)));
+  }
+
+  private static DataFile dataFileWithRequiredOnly(PartitionSpec spec) {
+    DataFiles.Builder builder =
+        DataFiles.builder(spec)
+            .withPath("/path/to/data-a.parquet")
+            .withFileSizeInBytes(10)
+            .withRecordCount(1);
+
+    if (spec.isPartitioned()) {
+      // easy way to set partition data for now
+      builder.withPartitionPath("data_bucket=1");
+    }
+
+    return builder.build();
+  }
+
+  private static DataFile dataFileWithAllOptional(PartitionSpec spec) {
+    DataFiles.Builder builder =
+        DataFiles.builder(spec)
+            .withPath("/path/to/data-with-stats.parquet")
+            .withMetrics(
+                new Metrics(
+                    10L, // record count
+                    ImmutableMap.of(3, 100L, 4, 200L), // column sizes
+                    ImmutableMap.of(3, 90L, 4, 180L), // value counts
+                    ImmutableMap.of(3, 10L, 4, 20L), // null value counts
+                    ImmutableMap.of(3, 0L, 4, 0L), // nan value counts
+                    ImmutableMap.of(
+                        3,
+                        Conversions.toByteBuffer(Types.IntegerType.get(), 1),
+                        4,
+                        Conversions.toByteBuffer(Types.IntegerType.get(), 2)), // lower bounds
+                    ImmutableMap.of(
+                        3,
+                        Conversions.toByteBuffer(Types.IntegerType.get(), 5),
+                        4,
+                        Conversions.toByteBuffer(Types.IntegerType.get(), 10)) // upperbounds
+                    ))
+            .withFileSizeInBytes(350)
+            .withSplitOffsets(Arrays.asList(128L, 256L))
+            .withEqualityFieldIds(Collections.singletonList(1))
+            .withEncryptionKeyMetadata(ByteBuffer.wrap(new byte[16]))
+            .withSortOrder(
+                SortOrder.builderFor(TableTestBase.SCHEMA)
+                    .withOrderId(1)
+                    .sortBy("id", SortDirection.ASC, NullOrder.NULLS_FIRST)
+                    .build());
+
+    if (spec.isPartitioned()) {
+      // easy way to set partition data for now
+      builder.withPartitionPath("data_bucket=1");
+    }
+
+    return builder.build();
+  }
+
+  private static Stream<Arguments> provideSpecAndDeleteFile() {
+    return Stream.of(
+        Arguments.of(
+            PartitionSpec.unpartitioned(),
+            deleteFileWithRequiredOnly(PartitionSpec.unpartitioned())),
+        Arguments.of(
+            PartitionSpec.unpartitioned(),
+            deleteFileWithAllOptional(PartitionSpec.unpartitioned())),
+        Arguments.of(TableTestBase.SPEC, deleteFileWithRequiredOnly(TableTestBase.SPEC)),
+        Arguments.of(TableTestBase.SPEC, deleteFileWithAllOptional(TableTestBase.SPEC)));
+  }
+
+  private static DeleteFile deleteFileWithRequiredOnly(PartitionSpec spec) {
+    PartitionData partitionData = null;
+    if (spec.isPartitioned()) {
+      partitionData = new PartitionData(spec.partitionType());
+      partitionData.set(0, 9);
+    }
+
+    return new GenericDeleteFile(
+        spec.specId(),
+        FileContent.POSITION_DELETES,
+        "/path/to/delete-a.parquet",
+        FileFormat.PARQUET,
+        partitionData,
+        1234,
+        new Metrics(9L, null, null, null, null),
+        null,
+        null,
+        null);
+  }
+
+  private static DeleteFile deleteFileWithAllOptional(PartitionSpec spec) {
+    PartitionData partitionData = new PartitionData(spec.partitionType());
+    if (spec.isPartitioned()) {
+      partitionData.set(0, 9);
+    }
+
+    Metrics metrics =
+        new Metrics(
+            10L, // record count
+            ImmutableMap.of(3, 100L, 4, 200L), // column sizes
+            ImmutableMap.of(3, 90L, 4, 180L), // value counts
+            ImmutableMap.of(3, 10L, 4, 20L), // null value counts
+            ImmutableMap.of(3, 0L, 4, 0L), // nan value counts
+            ImmutableMap.of(
+                3,
+                Conversions.toByteBuffer(Types.IntegerType.get(), 1),
+                4,
+                Conversions.toByteBuffer(Types.IntegerType.get(), 2)), // lower bounds
+            ImmutableMap.of(
+                3,
+                Conversions.toByteBuffer(Types.IntegerType.get(), 5),
+                4,
+                Conversions.toByteBuffer(Types.IntegerType.get(), 10)) // upperbounds
+            );
+
+    return new GenericDeleteFile(
+        spec.specId(),
+        FileContent.EQUALITY_DELETES,
+        "/path/to/delete-with-stats.parquet",
+        FileFormat.PARQUET,
+        partitionData,
+        1234,
+        metrics,
+        new int[] {3},
+        1,
+        ByteBuffer.wrap(new byte[16]));
+  }
+
+  @ParameterizedTest
+  @MethodSource("provideSpecAndDataFile")
+  public void testDataFile(PartitionSpec spec, DataFile dataFile) throws Exception {

Review Comment:
   I think it makes sense to move the actual tests to the top of the class



##########
core/src/test/java/org/apache/iceberg/TestContentFileParser.java:
##########
@@ -0,0 +1,245 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+package org.apache.iceberg;
+
+import com.fasterxml.jackson.core.JsonGenerator;
+import com.fasterxml.jackson.databind.JsonNode;
+import java.io.StringWriter;
+import java.nio.ByteBuffer;
+import java.util.Arrays;
+import java.util.Collections;
+import java.util.stream.Stream;
+import org.apache.iceberg.relocated.com.google.common.collect.ImmutableMap;
+import org.apache.iceberg.types.Comparators;
+import org.apache.iceberg.types.Conversions;
+import org.apache.iceberg.types.Types;
+import org.apache.iceberg.util.JsonUtil;
+import org.assertj.core.api.Assertions;
+import org.junit.Assert;
+import org.junit.Test;

Review Comment:
   `org.junit.Test` -> `org.junit.jupiter.api.Test`



##########
core/src/test/java/org/apache/iceberg/TestContentFileParser.java:
##########
@@ -0,0 +1,245 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+package org.apache.iceberg;
+
+import com.fasterxml.jackson.core.JsonGenerator;
+import com.fasterxml.jackson.databind.JsonNode;
+import java.io.StringWriter;
+import java.nio.ByteBuffer;
+import java.util.Arrays;
+import java.util.Collections;
+import java.util.stream.Stream;
+import org.apache.iceberg.relocated.com.google.common.collect.ImmutableMap;
+import org.apache.iceberg.types.Comparators;
+import org.apache.iceberg.types.Conversions;
+import org.apache.iceberg.types.Types;
+import org.apache.iceberg.util.JsonUtil;
+import org.assertj.core.api.Assertions;
+import org.junit.Assert;
+import org.junit.Test;
+import org.junit.jupiter.params.ParameterizedTest;
+import org.junit.jupiter.params.provider.Arguments;
+import org.junit.jupiter.params.provider.MethodSource;
+
+public class TestContentFileParser {
+  @Test
+  public void testNullArguments() throws Exception {
+    StringWriter stringWriter = new StringWriter();
+    JsonGenerator jsonStringGenerator = JsonUtil.factory().createGenerator(stringWriter);
+
+    Assertions.assertThatThrownBy(
+            () -> ContentFileParser.toJson(null, TableTestBase.SPEC, jsonStringGenerator))
+        .isInstanceOf(IllegalArgumentException.class)
+        .hasMessage("Content file cannot be null");
+
+    Assertions.assertThatThrownBy(() -> ContentFileParser.fromJson(null, TableTestBase.SPEC))

Review Comment:
   I think we should also add calls where the spec + the json generator is null



##########
core/src/test/java/org/apache/iceberg/TestFileScanTaskParser.java:
##########
@@ -0,0 +1,92 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+package org.apache.iceberg;
+
+import java.util.stream.Stream;
+import org.apache.iceberg.expressions.ExpressionUtil;
+import org.apache.iceberg.expressions.Expressions;
+import org.apache.iceberg.expressions.ResidualEvaluator;
+import org.assertj.core.api.Assertions;
+import org.junit.Assert;
+import org.junit.Test;

Review Comment:
   while this PR was open we added https://iceberg.apache.org/contribute/#junit4--junit5, so I think it would make sense to switch the test to using JUnit5 (`org.junit.jupiter.api.Test` rathern than `org.junit.Test`)
   



##########
core/src/main/java/org/apache/iceberg/FileScanTaskParser.java:
##########
@@ -0,0 +1,141 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+package org.apache.iceberg;
+
+import com.fasterxml.jackson.core.JsonGenerator;
+import com.fasterxml.jackson.databind.JsonNode;
+import com.github.benmanes.caffeine.cache.Caffeine;
+import com.github.benmanes.caffeine.cache.LoadingCache;
+import java.io.IOException;
+import java.io.StringWriter;
+import java.io.UncheckedIOException;
+import org.apache.iceberg.expressions.Expression;
+import org.apache.iceberg.expressions.ExpressionParser;
+import org.apache.iceberg.expressions.Expressions;
+import org.apache.iceberg.expressions.ResidualEvaluator;
+import org.apache.iceberg.relocated.com.google.common.base.Preconditions;
+import org.apache.iceberg.relocated.com.google.common.collect.ImmutableList;
+import org.apache.iceberg.util.JsonUtil;
+
+public class FileScanTaskParser {
+  private static final String SCHEMA = "schema";
+  private static final String SPEC = "spec";
+  private static final String DATA_FILE = "data-file";
+  private static final String DELETE_FILES = "delete-files";
+  private static final String RESIDUAL = "residual-filter";
+
+  private final boolean caseSensitive;
+  private final LoadingCache<PartitionSpec, ContentFileParser> contentFileParsersBySpec;
+
+  public FileScanTaskParser(boolean caseSensitive) {
+    this.caseSensitive = caseSensitive;
+    this.contentFileParsersBySpec =
+        Caffeine.newBuilder().weakKeys().build(spec -> new ContentFileParser(spec));
+  }
+
+  public String toJson(FileScanTask fileScanTask) {
+    try (StringWriter writer = new StringWriter()) {
+      JsonGenerator generator = JsonUtil.factory().createGenerator(writer);
+      toJson(fileScanTask, generator);
+      generator.flush();
+      return writer.toString();
+    } catch (IOException e) {
+      throw new UncheckedIOException("Failed to write json for: " + fileScanTask, e);
+    }
+  }
+
+  void toJson(FileScanTask fileScanTask, JsonGenerator generator) throws IOException {
+    generator.writeStartObject();

Review Comment:
   as mentioned in the above comment, the null check should be done at the beginning of this method because `toJson(FileScanTask fileScanTask)` will just call this method



##########
core/src/main/java/org/apache/iceberg/ContentFileParser.java:
##########
@@ -0,0 +1,286 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+package org.apache.iceberg;
+
+import com.fasterxml.jackson.core.JsonGenerator;
+import com.fasterxml.jackson.databind.JsonNode;
+import java.io.IOException;
+import java.nio.ByteBuffer;
+import java.util.List;
+import java.util.Map;
+import org.apache.iceberg.relocated.com.google.common.base.Preconditions;
+import org.apache.iceberg.util.ArrayUtil;
+import org.apache.iceberg.util.JsonUtil;
+
+class ContentFileParser {
+  private static final String SPEC_ID = "spec-id";
+  private static final String CONTENT = "content";
+  private static final String FILE_PATH = "file-path";
+  private static final String FILE_FORMAT = "file-format";
+  private static final String PARTITION = "partition";
+  private static final String RECORD_COUNT = "record-count";
+  private static final String FILE_SIZE = "file-size-in-bytes";
+  private static final String COLUMN_SIZES = "column-sizes";
+  private static final String VALUE_COUNTS = "value-counts";
+  private static final String NULL_VALUE_COUNTS = "null-value-counts";
+  private static final String NAN_VALUE_COUNTS = "nan-value-counts";
+  private static final String LOWER_BOUNDS = "lower-bounds";
+  private static final String UPPER_BOUNDS = "upper-bounds";
+  private static final String KEY_METADATA = "key-metadata";
+  private static final String SPLIT_OFFSETS = "split-offsets";
+  private static final String EQUALITY_IDS = "equality-ids";
+  private static final String SORT_ORDER_ID = "sort-order-id";
+
+  private final PartitionSpec spec;
+
+  ContentFileParser(PartitionSpec spec) {
+    this.spec = spec;
+  }
+
+  private boolean hasPartitionData(StructLike partitionData) {
+    return partitionData != null && partitionData.size() > 0;
+  }
+
+  void toJson(ContentFile contentFile, JsonGenerator generator) throws IOException {
+    generator.writeStartObject();
+
+    // ignore the ordinal position (ContentFile#pos) of the file in a manifest,
+    // as it isn't used and BaseFile constructor doesn't support it.
+
+    Preconditions.checkArgument(
+        contentFile.specId() == spec.specId(),
+        "Partition spec id mismatch: expected = %s, actual = %s",
+        spec.specId(),
+        contentFile.specId());
+    generator.writeNumberField(SPEC_ID, contentFile.specId());
+
+    generator.writeStringField(CONTENT, contentFile.content().name());
+    generator.writeStringField(FILE_PATH, contentFile.path().toString());
+    generator.writeStringField(FILE_FORMAT, contentFile.format().name());
+
+    Preconditions.checkArgument(
+        spec.isPartitioned() == hasPartitionData(contentFile.partition()),
+        "Invalid data file: partition data (%s) doesn't match the expected (%s)",
+        hasPartitionData(contentFile.partition()),
+        spec.isPartitioned());
+    if (contentFile.partition() != null) {
+      generator.writeFieldName(PARTITION);
+      SingleValueParser.toJson(spec.partitionType(), contentFile.partition(), generator);
+    }
+
+    generator.writeNumberField(RECORD_COUNT, contentFile.recordCount());
+    generator.writeNumberField(FILE_SIZE, contentFile.fileSizeInBytes());
+
+    if (contentFile.columnSizes() != null) {
+      generator.writeFieldName(COLUMN_SIZES);
+      SingleValueParser.toJson(DataFile.COLUMN_SIZES.type(), contentFile.columnSizes(), generator);
+    }
+
+    if (contentFile.valueCounts() != null) {
+      generator.writeFieldName(VALUE_COUNTS);
+      SingleValueParser.toJson(DataFile.VALUE_COUNTS.type(), contentFile.valueCounts(), generator);
+    }
+
+    if (contentFile.nullValueCounts() != null) {
+      generator.writeFieldName(NULL_VALUE_COUNTS);
+      SingleValueParser.toJson(
+          DataFile.NULL_VALUE_COUNTS.type(), contentFile.nullValueCounts(), generator);
+    }
+
+    if (contentFile.nullValueCounts() != null) {
+      generator.writeFieldName(NAN_VALUE_COUNTS);
+      SingleValueParser.toJson(
+          DataFile.NAN_VALUE_COUNTS.type(), contentFile.nanValueCounts(), generator);
+    }
+
+    if (contentFile.lowerBounds() != null) {
+      generator.writeFieldName(LOWER_BOUNDS);
+      SingleValueParser.toJson(DataFile.LOWER_BOUNDS.type(), contentFile.lowerBounds(), generator);
+    }
+
+    if (contentFile.upperBounds() != null) {
+      generator.writeFieldName(UPPER_BOUNDS);
+      SingleValueParser.toJson(DataFile.UPPER_BOUNDS.type(), contentFile.upperBounds(), generator);
+    }
+
+    if (contentFile.keyMetadata() != null) {
+      generator.writeFieldName(KEY_METADATA);
+      SingleValueParser.toJson(DataFile.KEY_METADATA.type(), contentFile.keyMetadata(), generator);
+    }
+
+    if (contentFile.splitOffsets() != null) {
+      generator.writeFieldName(SPLIT_OFFSETS);
+      SingleValueParser.toJson(
+          DataFile.SPLIT_OFFSETS.type(), contentFile.splitOffsets(), generator);
+    }
+
+    if (contentFile.equalityFieldIds() != null) {
+      generator.writeFieldName(EQUALITY_IDS);
+      SingleValueParser.toJson(
+          DataFile.EQUALITY_IDS.type(), contentFile.equalityFieldIds(), generator);
+    }
+
+    if (contentFile.sortOrderId() != null) {
+      generator.writeNumberField(SORT_ORDER_ID, contentFile.sortOrderId());
+    }
+
+    generator.writeEndObject();
+  }
+
+  ContentFile fromJson(JsonNode jsonNode) {
+    Preconditions.checkArgument(

Review Comment:
   should we also add a null check for spec here?



##########
core/src/test/java/org/apache/iceberg/TestFileScanTaskParser.java:
##########
@@ -0,0 +1,92 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+package org.apache.iceberg;
+
+import java.util.stream.Stream;
+import org.apache.iceberg.expressions.ExpressionUtil;
+import org.apache.iceberg.expressions.Expressions;
+import org.apache.iceberg.expressions.ResidualEvaluator;
+import org.assertj.core.api.Assertions;
+import org.junit.Assert;
+import org.junit.Test;
+import org.junit.jupiter.params.ParameterizedTest;
+import org.junit.jupiter.params.provider.Arguments;
+import org.junit.jupiter.params.provider.MethodSource;
+
+public class TestFileScanTaskParser {
+  @Test
+  public void testNullArguments() {
+    Assertions.assertThatThrownBy(() -> FileScanTaskParser.toJson(null))
+        .isInstanceOf(IllegalArgumentException.class)
+        .hasMessage("File scan task cannot be null");
+
+    Assertions.assertThatThrownBy(() -> FileScanTaskParser.fromJson(null, true))
+        .isInstanceOf(IllegalArgumentException.class)
+        .hasMessage("Cannot parse file scan task from null JSON string");
+  }
+
+  private static Stream<Arguments> provideArgs() {
+    return Stream.of(Arguments.of(true), Arguments.of(false));
+  }
+
+  @ParameterizedTest
+  @MethodSource("provideArgs")

Review Comment:
   we should be able to simplify this to `@ValueSource(booleans = {true, false})` rather than having a method source that provides booleans



##########
core/src/test/java/org/apache/iceberg/TestContentFileParser.java:
##########
@@ -0,0 +1,245 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+package org.apache.iceberg;
+
+import com.fasterxml.jackson.core.JsonGenerator;
+import com.fasterxml.jackson.databind.JsonNode;
+import java.io.StringWriter;
+import java.nio.ByteBuffer;
+import java.util.Arrays;
+import java.util.Collections;
+import java.util.stream.Stream;
+import org.apache.iceberg.relocated.com.google.common.collect.ImmutableMap;
+import org.apache.iceberg.types.Comparators;
+import org.apache.iceberg.types.Conversions;
+import org.apache.iceberg.types.Types;
+import org.apache.iceberg.util.JsonUtil;
+import org.assertj.core.api.Assertions;
+import org.junit.Assert;
+import org.junit.Test;
+import org.junit.jupiter.params.ParameterizedTest;
+import org.junit.jupiter.params.provider.Arguments;
+import org.junit.jupiter.params.provider.MethodSource;
+
+public class TestContentFileParser {
+  @Test
+  public void testNullArguments() throws Exception {
+    StringWriter stringWriter = new StringWriter();
+    JsonGenerator jsonStringGenerator = JsonUtil.factory().createGenerator(stringWriter);
+
+    Assertions.assertThatThrownBy(
+            () -> ContentFileParser.toJson(null, TableTestBase.SPEC, jsonStringGenerator))
+        .isInstanceOf(IllegalArgumentException.class)
+        .hasMessage("Content file cannot be null");
+
+    Assertions.assertThatThrownBy(() -> ContentFileParser.fromJson(null, TableTestBase.SPEC))
+        .isInstanceOf(IllegalArgumentException.class)
+        .hasMessage("Cannot parse content file from null JSON node");
+  }
+
+  private static Stream<Arguments> provideSpecAndDataFile() {
+    return Stream.of(
+        Arguments.of(
+            PartitionSpec.unpartitioned(), dataFileWithRequiredOnly(PartitionSpec.unpartitioned())),
+        Arguments.of(
+            PartitionSpec.unpartitioned(), dataFileWithAllOptional(PartitionSpec.unpartitioned())),
+        Arguments.of(TableTestBase.SPEC, dataFileWithRequiredOnly(TableTestBase.SPEC)),
+        Arguments.of(TableTestBase.SPEC, dataFileWithAllOptional(TableTestBase.SPEC)));
+  }
+
+  private static DataFile dataFileWithRequiredOnly(PartitionSpec spec) {
+    DataFiles.Builder builder =
+        DataFiles.builder(spec)
+            .withPath("/path/to/data-a.parquet")
+            .withFileSizeInBytes(10)
+            .withRecordCount(1);
+
+    if (spec.isPartitioned()) {
+      // easy way to set partition data for now
+      builder.withPartitionPath("data_bucket=1");
+    }
+
+    return builder.build();
+  }
+
+  private static DataFile dataFileWithAllOptional(PartitionSpec spec) {
+    DataFiles.Builder builder =
+        DataFiles.builder(spec)
+            .withPath("/path/to/data-with-stats.parquet")
+            .withMetrics(
+                new Metrics(
+                    10L, // record count
+                    ImmutableMap.of(3, 100L, 4, 200L), // column sizes
+                    ImmutableMap.of(3, 90L, 4, 180L), // value counts
+                    ImmutableMap.of(3, 10L, 4, 20L), // null value counts
+                    ImmutableMap.of(3, 0L, 4, 0L), // nan value counts
+                    ImmutableMap.of(
+                        3,
+                        Conversions.toByteBuffer(Types.IntegerType.get(), 1),
+                        4,
+                        Conversions.toByteBuffer(Types.IntegerType.get(), 2)), // lower bounds
+                    ImmutableMap.of(
+                        3,
+                        Conversions.toByteBuffer(Types.IntegerType.get(), 5),
+                        4,
+                        Conversions.toByteBuffer(Types.IntegerType.get(), 10)) // upperbounds
+                    ))
+            .withFileSizeInBytes(350)
+            .withSplitOffsets(Arrays.asList(128L, 256L))
+            .withEqualityFieldIds(Collections.singletonList(1))
+            .withEncryptionKeyMetadata(ByteBuffer.wrap(new byte[16]))
+            .withSortOrder(
+                SortOrder.builderFor(TableTestBase.SCHEMA)
+                    .withOrderId(1)
+                    .sortBy("id", SortDirection.ASC, NullOrder.NULLS_FIRST)
+                    .build());
+
+    if (spec.isPartitioned()) {
+      // easy way to set partition data for now
+      builder.withPartitionPath("data_bucket=1");
+    }
+
+    return builder.build();
+  }
+
+  private static Stream<Arguments> provideSpecAndDeleteFile() {
+    return Stream.of(
+        Arguments.of(
+            PartitionSpec.unpartitioned(),
+            deleteFileWithRequiredOnly(PartitionSpec.unpartitioned())),
+        Arguments.of(
+            PartitionSpec.unpartitioned(),
+            deleteFileWithAllOptional(PartitionSpec.unpartitioned())),
+        Arguments.of(TableTestBase.SPEC, deleteFileWithRequiredOnly(TableTestBase.SPEC)),
+        Arguments.of(TableTestBase.SPEC, deleteFileWithAllOptional(TableTestBase.SPEC)));
+  }
+
+  private static DeleteFile deleteFileWithRequiredOnly(PartitionSpec spec) {
+    PartitionData partitionData = null;
+    if (spec.isPartitioned()) {
+      partitionData = new PartitionData(spec.partitionType());
+      partitionData.set(0, 9);
+    }
+
+    return new GenericDeleteFile(
+        spec.specId(),
+        FileContent.POSITION_DELETES,
+        "/path/to/delete-a.parquet",
+        FileFormat.PARQUET,
+        partitionData,
+        1234,
+        new Metrics(9L, null, null, null, null),
+        null,
+        null,
+        null);
+  }
+
+  private static DeleteFile deleteFileWithAllOptional(PartitionSpec spec) {
+    PartitionData partitionData = new PartitionData(spec.partitionType());
+    if (spec.isPartitioned()) {
+      partitionData.set(0, 9);
+    }
+
+    Metrics metrics =
+        new Metrics(
+            10L, // record count
+            ImmutableMap.of(3, 100L, 4, 200L), // column sizes
+            ImmutableMap.of(3, 90L, 4, 180L), // value counts
+            ImmutableMap.of(3, 10L, 4, 20L), // null value counts
+            ImmutableMap.of(3, 0L, 4, 0L), // nan value counts
+            ImmutableMap.of(
+                3,
+                Conversions.toByteBuffer(Types.IntegerType.get(), 1),
+                4,
+                Conversions.toByteBuffer(Types.IntegerType.get(), 2)), // lower bounds
+            ImmutableMap.of(
+                3,
+                Conversions.toByteBuffer(Types.IntegerType.get(), 5),
+                4,
+                Conversions.toByteBuffer(Types.IntegerType.get(), 10)) // upperbounds
+            );
+
+    return new GenericDeleteFile(
+        spec.specId(),
+        FileContent.EQUALITY_DELETES,
+        "/path/to/delete-with-stats.parquet",
+        FileFormat.PARQUET,
+        partitionData,
+        1234,
+        metrics,
+        new int[] {3},
+        1,
+        ByteBuffer.wrap(new byte[16]));
+  }
+
+  @ParameterizedTest
+  @MethodSource("provideSpecAndDataFile")
+  public void testDataFile(PartitionSpec spec, DataFile dataFile) throws Exception {
+    StringWriter stringWriter = new StringWriter();
+    JsonGenerator jsonStringGenerator = JsonUtil.factory().createGenerator(stringWriter);
+
+    ContentFileParser.toJson(dataFile, spec, jsonStringGenerator);
+    jsonStringGenerator.close();
+    String jsonStr = stringWriter.toString();
+
+    JsonNode jsonNode = JsonUtil.mapper().readTree(jsonStr);
+    DataFile deserializedFile = (DataFile) ContentFileParser.fromJson(jsonNode, spec);

Review Comment:
   rather than casting I think we should check that it's actually a DataFile:
   
   ```
   ContentFile<?> contentFile = ContentFileParser.fromJson(jsonNode, spec);
   Assertions.assertThat(contentFile).isInstanceOf(DataFile.class);
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org