You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@iceberg.apache.org by GitBox <gi...@apache.org> on 2022/05/28 00:22:12 UTC

[GitHub] [iceberg] flyrain opened a new pull request, #4888: Core: Support _deleted metadata column in vectorized read

flyrain opened a new pull request, #4888:
URL: https://github.com/apache/iceberg/pull/4888

   This is the vectorized version of #4683.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] flyrain commented on pull request #4888: Core: Support _deleted metadata column in vectorized read

Posted by GitBox <gi...@apache.org>.

flyrain commented on PR #4888:
URL: https://github.com/apache/iceberg/pull/4888#issuecomment-1159301418

   Thanks @RussellSpitzer for the review. Refactor class `ColumnBatchReader` a bit to remove two loop on isDeleted array. My benchmark shows a bit perf gain.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] flyrain commented on a diff in pull request #4888: Core: Support _deleted metadata column in vectorized read

Posted by GitBox <gi...@apache.org>.

flyrain commented on code in PR #4888:
URL: https://github.com/apache/iceberg/pull/4888#discussion_r899391540


##########
spark/v3.2/spark/src/test/java/org/apache/iceberg/spark/data/TestSparkParquetReadMetadataColumns.java:
##########
@@ -173,27 +174,28 @@ public void testReadRowNumbers() throws IOException {
 
   @Test
   public void testReadRowNumbersWithDelete() throws IOException {
-    if (vectorized) {
-      List<InternalRow> expectedRowsAfterDelete = Lists.newArrayList(EXPECTED_ROWS);
-      // remove row at position 98, 99, 100, 101, 102, this crosses two row groups [0, 100) and [100, 200)
-      for (int i = 1; i <= 5; i++) {
-        expectedRowsAfterDelete.remove(98);
-      }
+    Assume.assumeTrue(vectorized);

Review Comment:
   Exactly. A refactor to use the nice tool `Assume.assumeTrue()`.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] flyrain commented on a diff in pull request #4888: Core: Support _deleted metadata column in vectorized read

Posted by GitBox <gi...@apache.org>.

flyrain commented on code in PR #4888:
URL: https://github.com/apache/iceberg/pull/4888#discussion_r899532567


##########
arrow/src/main/java/org/apache/iceberg/arrow/vectorized/VectorHolder.java:
##########
@@ -104,6 +104,10 @@ public static <T> VectorHolder constantHolder(int numRows, T constantValue) {
     return new ConstantVectorHolder(numRows, constantValue);
   }
 
+  public static <T> VectorHolder isDeletedHolder(int numRows) {

Review Comment:
   We may keep it as is to keep the naming consistent since we don't change the class name.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] aokolnychyi commented on a diff in pull request #4888: Core: Support _deleted metadata column in vectorized read

Posted by GitBox <gi...@apache.org>.

aokolnychyi commented on code in PR #4888:
URL: https://github.com/apache/iceberg/pull/4888#discussion_r927231315


##########
arrow/src/main/java/org/apache/iceberg/arrow/vectorized/VectorizedArrowReader.java:
##########
@@ -517,5 +517,32 @@ public void setBatchSize(int batchSize) {
     }
   }
 
+  /**
+   * A Dummy Vector Reader which doesn't actually read files, instead it returns an
+   * IsDeleted Vector Holder which indicates whether a given row is deleted.
+   */
+  public static class DeletedVectorReader extends VectorizedArrowReader {

Review Comment:
   nit: The naming seems inconsistent as we have `IsDeletedVectorHolder`.



##########
arrow/src/main/java/org/apache/iceberg/arrow/vectorized/VectorHolder.java:
##########
@@ -104,6 +104,10 @@ public static <T> VectorHolder constantHolder(int numRows, T constantValue) {
     return new ConstantVectorHolder(numRows, constantValue);
   }
 
+  public static <T> VectorHolder isDeletedHolder(int numRows) {

Review Comment:
   Why do we have <T> here? Is it actually being used?



##########
spark/v3.2/spark/src/main/java/org/apache/iceberg/spark/data/vectorized/ColumnarBatchReader.java:
##########
@@ -66,35 +78,50 @@ public final ColumnarBatch read(ColumnarBatch reuse, int numRowsToRead) {
       closeVectors();
     }
 
-    ColumnBatchLoader batchLoader = new ColumnBatchLoader(numRowsToRead);
+    ColumnarBatch columnarBatch = new ColumnBatchLoader(numRowsToRead).loadDataToColumnBatch();
     rowStartPosInBatch += numRowsToRead;
-    return batchLoader.columnarBatch;
+    return columnarBatch;
   }
 
   private class ColumnBatchLoader {
-    private int[] rowIdMapping; // the rowId mapping to skip deleted rows for all column vectors inside a batch
-    private int numRows;
-    private ColumnarBatch columnarBatch;
+    private final int numRowsToRead;
+    // the rowId mapping to skip deleted rows for all column vectors inside a batch, it is null when there is no deletes
+    private int[] rowIdMapping;
+    // the array to indicate if a row is deleted or not, it is null when there is no "_deleted" metadata column
+    private boolean[] isDeleted;
 
     ColumnBatchLoader(int numRowsToRead) {
-      initRowIdMapping(numRowsToRead);
-      loadDataToColumnBatch(numRowsToRead);
+      Preconditions.checkArgument(numRowsToRead > 0, "Invalid number of rows to read: %s", numRowsToRead);
+      this.numRowsToRead = numRowsToRead;
+      if (hasIsDeletedColumn) {
+        isDeleted = new boolean[numRowsToRead];
+      }
     }
 
-    ColumnarBatch loadDataToColumnBatch(int numRowsToRead) {
-      Preconditions.checkArgument(numRowsToRead > 0, "Invalid number of rows to read: %s", numRowsToRead);
-      ColumnVector[] arrowColumnVectors = readDataToColumnVectors(numRowsToRead);
+    ColumnarBatch loadDataToColumnBatch() {
+      int numRowsUndeleted = initRowIdMapping();
+
+      ColumnVector[] arrowColumnVectors = readDataToColumnVectors();
 
-      columnarBatch = new ColumnarBatch(arrowColumnVectors);
-      columnarBatch.setNumRows(numRows);
+      ColumnarBatch newColumnarBatch = new ColumnarBatch(arrowColumnVectors);
+      newColumnarBatch.setNumRows(numRowsUndeleted);
 
       if (hasEqDeletes()) {
-        applyEqDelete();
+        applyEqDelete(newColumnarBatch);
       }
-      return columnarBatch;
+
+      if (hasIsDeletedColumn && rowIdMapping != null) {
+        // reset the row id mapping array, so that it doesn't filter out the deleted rows
+        for (int i = 0; i < numRowsToRead; i++) {
+          rowIdMapping[i] = i;

Review Comment:
   Question: do we have to populate the row ID mapping initially if we know we have `_deleted` metadata column?



##########
spark/v3.2/spark/src/main/java/org/apache/iceberg/spark/data/vectorized/ColumnarBatchReader.java:
##########
@@ -44,9 +44,21 @@
 public class ColumnarBatchReader extends BaseBatchReader<ColumnarBatch> {
   private DeleteFilter<InternalRow> deletes = null;
   private long rowStartPosInBatch = 0;
+  private final boolean hasIsDeletedColumn;

Review Comment:
   nit: Final vars should be defined before mutable ones



##########
spark/v3.2/spark/src/jmh/java/org/apache/iceberg/spark/source/IcebergSourceDeleteBenchmark.java:
##########
@@ -80,50 +81,79 @@ public void tearDownBenchmark() throws IOException {
 
   @Benchmark
   @Threads(1)
-  public void readIceberg() {
+  public void readIceberg(Blackhole blackhole) {
     Map<String, String> tableProperties = Maps.newHashMap();
     tableProperties.put(SPLIT_OPEN_FILE_COST, Integer.toString(128 * 1024 * 1024));
+    tableProperties.put(TableProperties.PARQUET_VECTORIZATION_ENABLED, "false");

Review Comment:
   nit: Let's add a static import like we have for `SPLIT_OPEN_FILE_COST` for consistency



##########
spark/v3.2/spark/src/main/java/org/apache/iceberg/spark/data/vectorized/ColumnarBatchReader.java:
##########
@@ -44,9 +44,21 @@
 public class ColumnarBatchReader extends BaseBatchReader<ColumnarBatch> {
   private DeleteFilter<InternalRow> deletes = null;
   private long rowStartPosInBatch = 0;
+  private final boolean hasIsDeletedColumn;
 
   public ColumnarBatchReader(List<VectorizedReader<?>> readers) {
     super(readers);
+    this.hasIsDeletedColumn = hasDeletedVectorReader(readers);
+  }
+
+  private boolean hasDeletedVectorReader(List<VectorizedReader<?>> readers) {

Review Comment:
   nit: Can this be simplified like this? I'd also prefer a direct import of `DeletedVectorReader` to shorten the line.
   
   ```
   this.hasIsDeletedColumn = readers.stream().anyMatch(reader -> reader instanceof DeletedVectorReader);
   ```



##########
spark/v3.2/spark/src/main/java/org/apache/iceberg/spark/data/vectorized/ColumnarBatchReader.java:
##########
@@ -105,35 +132,31 @@ ColumnVector[] readDataToColumnVectors(int numRowsToRead) {
             "Number of rows in the vector %s didn't match expected %s ", numRowsInVector,
             numRowsToRead);
 
-        arrowColumnVectors[i] = hasDeletes() ?
-            ColumnVectorWithFilter.forHolder(vectorHolders[i], rowIdMapping, numRows) :
-            IcebergArrowColumnVector.forHolder(vectorHolders[i], numRowsInVector);
+        arrowColumnVectors[i] = new ColumnVectorBuilder(vectorHolders[i], numRowsInVector)

Review Comment:
   Do we have to construct a column vector builder for every column? What about having a constructor accepting the row ID mapping and is deleted array and making `build(VectorHolder holder, int numRows)`? That way you can init the builder outside of the for loop and call build inside the loop for a particular `vectorHolder`.
   
   ```
   ColumnVectorBuilder columnVectorBuilder = new ColumnVectorBuilder(rowIdMapping, isDeleted);
   for (int i = 0; i < readers.length; i += 1) {
     ...
     arrowColumnVectors[i] = columnVectorBuilder.build(vectorHolders[i], numRowsInVector);
   }
   ```



##########
spark/v3.2/spark/src/main/java/org/apache/iceberg/spark/data/vectorized/ColumnVectorBuilder.java:
##########
@@ -0,0 +1,62 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.iceberg.spark.data.vectorized;
+
+import org.apache.iceberg.arrow.vectorized.VectorHolder;
+import org.apache.iceberg.arrow.vectorized.VectorHolder.ConstantVectorHolder;
+import org.apache.iceberg.arrow.vectorized.VectorHolder.IsDeletedVectorHolder;
+import org.apache.iceberg.types.Types;
+import org.apache.spark.sql.vectorized.ColumnVector;
+
+public class ColumnVectorBuilder {

Review Comment:
   Do this class and its constructors/methods have to be public?



##########
spark/v2.4/spark/src/test/java/org/apache/iceberg/spark/data/TestSparkParquetReadMetadataColumns.java:
##########
@@ -70,8 +70,7 @@ public class TestSparkParquetReadMetadataColumns {
   private static final Schema PROJECTION_SCHEMA = new Schema(
       required(100, "id", Types.LongType.get()),
       required(101, "data", Types.StringType.get()),
-      MetadataColumns.ROW_POSITION,

Review Comment:
   Why is this change needed?



##########
spark/v3.2/spark/src/main/java/org/apache/iceberg/spark/data/vectorized/ColumnVectorBuilder.java:
##########
@@ -0,0 +1,62 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.iceberg.spark.data.vectorized;
+
+import org.apache.iceberg.arrow.vectorized.VectorHolder;
+import org.apache.iceberg.arrow.vectorized.VectorHolder.ConstantVectorHolder;
+import org.apache.iceberg.arrow.vectorized.VectorHolder.IsDeletedVectorHolder;
+import org.apache.iceberg.types.Types;
+import org.apache.spark.sql.vectorized.ColumnVector;
+
+public class ColumnVectorBuilder {
+  private final VectorHolder holder;
+  private final int numRows;
+
+  private boolean[] isDeleted;
+  private int[] rowIdMapping;
+
+  public ColumnVectorBuilder(VectorHolder holder, int numRows) {
+    this.holder = holder;
+    this.numRows = numRows;
+  }
+
+  public ColumnVectorBuilder withDeletedRows(int[] rowIdMappingArray, boolean[] isDeletedArray) {
+    this.rowIdMapping = rowIdMappingArray;
+    this.isDeleted = isDeletedArray;
+    return this;
+  }
+
+  public ColumnVector build() {
+    if (holder.isDummy()) {
+      if (holder instanceof IsDeletedVectorHolder) {
+        return new DeletedMetaColumnVector(Types.BooleanType.get(), isDeleted);
+      } else if (holder instanceof ConstantVectorHolder) {
+        return new ConstantColumnVector(Types.IntegerType.get(), numRows,
+            ((ConstantVectorHolder) holder).getConstant());

Review Comment:
   nit: `ConstantVectorHolder` -> `ConstantVectorHolder<?>`.



##########
spark/v3.2/spark/src/main/java/org/apache/iceberg/spark/data/vectorized/DeletedMetaColumnVector.java:
##########
@@ -0,0 +1,123 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.iceberg.spark.data.vectorized;
+
+import org.apache.iceberg.relocated.com.google.common.base.Preconditions;
+import org.apache.iceberg.spark.SparkSchemaUtil;
+import org.apache.iceberg.types.Type;
+import org.apache.spark.sql.types.Decimal;
+import org.apache.spark.sql.vectorized.ColumnVector;
+import org.apache.spark.sql.vectorized.ColumnarArray;
+import org.apache.spark.sql.vectorized.ColumnarMap;
+import org.apache.spark.unsafe.types.UTF8String;
+
+public class DeletedMetaColumnVector extends ColumnVector {

Review Comment:
   The naming in new classes is a bit inconsistent. Can we align that?
   
   ```
   IsDeletedVectorHolder
   DeletedMetaColumnVector
   DeletedVectorReader
   ```
   
   



##########
spark/v3.2/spark/src/jmh/java/org/apache/iceberg/spark/source/IcebergSourceDeleteBenchmark.java:
##########
@@ -80,50 +81,79 @@ public void tearDownBenchmark() throws IOException {
 
   @Benchmark
   @Threads(1)
-  public void readIceberg() {
+  public void readIceberg(Blackhole blackhole) {
     Map<String, String> tableProperties = Maps.newHashMap();
     tableProperties.put(SPLIT_OPEN_FILE_COST, Integer.toString(128 * 1024 * 1024));
+    tableProperties.put(TableProperties.PARQUET_VECTORIZATION_ENABLED, "false");

Review Comment:
   Applies to all places in this class.



##########
spark/v3.2/spark/src/main/java/org/apache/iceberg/spark/data/vectorized/ColumnVectorBuilder.java:
##########
@@ -0,0 +1,62 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.iceberg.spark.data.vectorized;
+
+import org.apache.iceberg.arrow.vectorized.VectorHolder;
+import org.apache.iceberg.arrow.vectorized.VectorHolder.ConstantVectorHolder;
+import org.apache.iceberg.arrow.vectorized.VectorHolder.IsDeletedVectorHolder;
+import org.apache.iceberg.types.Types;
+import org.apache.spark.sql.vectorized.ColumnVector;
+
+public class ColumnVectorBuilder {
+  private final VectorHolder holder;
+  private final int numRows;
+
+  private boolean[] isDeleted;
+  private int[] rowIdMapping;
+
+  public ColumnVectorBuilder(VectorHolder holder, int numRows) {
+    this.holder = holder;
+    this.numRows = numRows;
+  }
+
+  public ColumnVectorBuilder withDeletedRows(int[] rowIdMappingArray, boolean[] isDeletedArray) {
+    this.rowIdMapping = rowIdMappingArray;
+    this.isDeleted = isDeletedArray;
+    return this;
+  }
+
+  public ColumnVector build() {
+    if (holder.isDummy()) {
+      if (holder instanceof IsDeletedVectorHolder) {
+        return new DeletedMetaColumnVector(Types.BooleanType.get(), isDeleted);
+      } else if (holder instanceof ConstantVectorHolder) {
+        return new ConstantColumnVector(Types.IntegerType.get(), numRows,
+            ((ConstantVectorHolder) holder).getConstant());

Review Comment:
   nit: I think this should fit on a single line



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] flyrain commented on a diff in pull request #4888: Core: Support _deleted metadata column in vectorized read

Posted by GitBox <gi...@apache.org>.

flyrain commented on code in PR #4888:
URL: https://github.com/apache/iceberg/pull/4888#discussion_r927941210


##########
spark/v3.2/spark/src/main/java/org/apache/iceberg/spark/data/vectorized/DeletedMetaColumnVector.java:
##########
@@ -0,0 +1,123 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.iceberg.spark.data.vectorized;
+
+import org.apache.iceberg.relocated.com.google.common.base.Preconditions;
+import org.apache.iceberg.spark.SparkSchemaUtil;
+import org.apache.iceberg.types.Type;
+import org.apache.spark.sql.types.Decimal;
+import org.apache.spark.sql.vectorized.ColumnVector;
+import org.apache.spark.sql.vectorized.ColumnarArray;
+import org.apache.spark.sql.vectorized.ColumnarMap;
+import org.apache.spark.unsafe.types.UTF8String;
+
+public class DeletedMetaColumnVector extends ColumnVector {

Review Comment:
   Made the following changes
   ```
   IsDeletedVectorHolder -> DeletedVectorHolder 
   DeletedMetaColumnVector -> DeletedColumnVector
   DeletedVectorReader
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] RussellSpitzer commented on a diff in pull request #4888: Core: Support _deleted metadata column in vectorized read

Posted by GitBox <gi...@apache.org>.

RussellSpitzer commented on code in PR #4888:
URL: https://github.com/apache/iceberg/pull/4888#discussion_r898482218


##########
spark/v3.2/spark/src/main/java/org/apache/iceberg/spark/data/vectorized/ColumnarBatchReader.java:
##########
@@ -209,7 +231,32 @@ void applyEqDelete() {
         rowId++;
       }
 
-      columnarBatch.setNumRows(currentRowId);
+      newColumnarBatch.setNumRows(currentRowId);
+      return currentRowId;
+    }
+
+    /**
+     * Convert the row id mapping array to the isDeleted array.
+     *
+     * @param numRowsInRowIdMapping the num of rows in the row id mapping array
+     */
+    void rowIdMappingToIsDeleted(int numRowsInRowIdMapping) {
+      if (isDeleted == null || rowIdMapping == null) {

Review Comment:
   Is this ok to return on? Seems like we should fail if isDeleted is set but rowIdMapping is null? Or is is rowIdMapping null if there are no delete files? 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] aokolnychyi commented on a diff in pull request #4888: Core: Support _deleted metadata column in vectorized read

Posted by GitBox <gi...@apache.org>.

aokolnychyi commented on code in PR #4888:
URL: https://github.com/apache/iceberg/pull/4888#discussion_r927230924


##########
arrow/src/main/java/org/apache/iceberg/arrow/vectorized/VectorHolder.java:
##########
@@ -104,6 +104,10 @@ public static <T> VectorHolder constantHolder(int numRows, T constantValue) {
     return new ConstantVectorHolder(numRows, constantValue);
   }
 
+  public static <T> VectorHolder isDeletedHolder(int numRows) {

Review Comment:
   Why do we have `<T>` here? Is it actually being used?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] flyrain commented on pull request #4888: Core: Support _deleted metadata column in vectorized read

Posted by GitBox <gi...@apache.org>.

flyrain commented on PR #4888:
URL: https://github.com/apache/iceberg/pull/4888#issuecomment-1194790590

   Thanks for the review, @aokolnychyi @RussellSpitzer. Per discussion with @aokolnychyi, I will file a followup to throw an exception when `_deleted` column is used in spark 2.4/3.0/3.1. It always return false no matter whether a row is deleted.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] flyrain commented on a diff in pull request #4888: Core: Support _deleted metadata column in vectorized read

Posted by GitBox <gi...@apache.org>.

flyrain commented on code in PR #4888:
URL: https://github.com/apache/iceberg/pull/4888#discussion_r884039841


##########
spark/v2.4/spark/src/test/java/org/apache/iceberg/spark/data/TestSparkParquetReadMetadataColumns.java:
##########
@@ -70,8 +70,7 @@ public class TestSparkParquetReadMetadataColumns {
   private static final Schema PROJECTION_SCHEMA = new Schema(
       required(100, "id", Types.LongType.get()),
       required(101, "data", Types.StringType.get()),
-      MetadataColumns.ROW_POSITION,
-      MetadataColumns.IS_DELETED
+      MetadataColumns.ROW_POSITION

Review Comment:
   We need this change since the class VectorizedReaderBuilder is shared by all spark versions. The change in line 94 of VectorizedReaderBuilder changes the type of the reader as the following code shows. Then, the read throws exception in the method `IcebergArrowColumnVector.forHolder()` of the old Spark version. This change should be fine due to the old Spark doesn't really support _deleted metadata column.
   ```
           reorderedFields.add(new VectorizedArrowReader.DeletedVectorReader());
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] RussellSpitzer commented on a diff in pull request #4888: Core: Support _deleted metadata column in vectorized read

Posted by GitBox <gi...@apache.org>.

RussellSpitzer commented on code in PR #4888:
URL: https://github.com/apache/iceberg/pull/4888#discussion_r898482975


##########
spark/v3.2/spark/src/main/java/org/apache/iceberg/spark/data/vectorized/ColumnarBatchReader.java:
##########
@@ -209,7 +231,32 @@ void applyEqDelete() {
         rowId++;
       }
 
-      columnarBatch.setNumRows(currentRowId);
+      newColumnarBatch.setNumRows(currentRowId);
+      return currentRowId;
+    }
+
+    /**
+     * Convert the row id mapping array to the isDeleted array.
+     *
+     * @param numRowsInRowIdMapping the num of rows in the row id mapping array
+     */
+    void rowIdMappingToIsDeleted(int numRowsInRowIdMapping) {
+      if (isDeleted == null || rowIdMapping == null) {
+        return;
+      }
+
+      for (int i = 0; i < numRowsToRead; i++) {

Review Comment:
   Arrays.fill(isDeleted, true) ?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] RussellSpitzer commented on a diff in pull request #4888: Core: Support _deleted metadata column in vectorized read

Posted by GitBox <gi...@apache.org>.

RussellSpitzer commented on code in PR #4888:
URL: https://github.com/apache/iceberg/pull/4888#discussion_r898484504


##########
spark/v3.2/spark/src/main/java/org/apache/iceberg/spark/data/vectorized/ColumnarBatchReader.java:
##########
@@ -209,7 +231,32 @@ void applyEqDelete() {
         rowId++;
       }
 
-      columnarBatch.setNumRows(currentRowId);
+      newColumnarBatch.setNumRows(currentRowId);
+      return currentRowId;
+    }
+
+    /**
+     * Convert the row id mapping array to the isDeleted array.
+     *
+     * @param numRowsInRowIdMapping the num of rows in the row id mapping array
+     */
+    void rowIdMappingToIsDeleted(int numRowsInRowIdMapping) {
+      if (isDeleted == null || rowIdMapping == null) {
+        return;
+      }
+
+      for (int i = 0; i < numRowsToRead; i++) {
+        isDeleted[i] = true;
+      }
+
+      for (int i = 0; i < numRowsInRowIdMapping; i++) {
+        isDeleted[rowIdMapping[i]] = false;
+      }
+
+      // reset the row id mapping array, so that it doesn't filter out the deleted rows
+      for (int i = 0; i < numRowsToRead; i++) {

Review Comment:
   Could we just set rowIdMapping to null? Isn't that an allowed state here?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] RussellSpitzer commented on a diff in pull request #4888: Core: Support _deleted metadata column in vectorized read

Posted by GitBox <gi...@apache.org>.

RussellSpitzer commented on code in PR #4888:
URL: https://github.com/apache/iceberg/pull/4888#discussion_r898488577


##########
spark/v3.2/spark/src/test/java/org/apache/iceberg/spark/data/TestSparkParquetReadMetadataColumns.java:
##########
@@ -173,27 +174,28 @@ public void testReadRowNumbers() throws IOException {
 
   @Test
   public void testReadRowNumbersWithDelete() throws IOException {
-    if (vectorized) {
-      List<InternalRow> expectedRowsAfterDelete = Lists.newArrayList(EXPECTED_ROWS);
-      // remove row at position 98, 99, 100, 101, 102, this crosses two row groups [0, 100) and [100, 200)
-      for (int i = 1; i <= 5; i++) {
-        expectedRowsAfterDelete.remove(98);
-      }
+    Assume.assumeTrue(vectorized);

Review Comment:
   This change is unrelated to the current pr correct? Looks like we are just using the correct Junit ignore functionality but for a test which doesn't use the "isdeleted" column



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] flyrain commented on a diff in pull request #4888: Core: Support _deleted metadata column in vectorized read

Posted by GitBox <gi...@apache.org>.

flyrain commented on code in PR #4888:
URL: https://github.com/apache/iceberg/pull/4888#discussion_r927848406


##########
spark/v3.2/spark/src/main/java/org/apache/iceberg/spark/data/vectorized/ColumnVectorBuilder.java:
##########
@@ -0,0 +1,62 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.iceberg.spark.data.vectorized;
+
+import org.apache.iceberg.arrow.vectorized.VectorHolder;
+import org.apache.iceberg.arrow.vectorized.VectorHolder.ConstantVectorHolder;
+import org.apache.iceberg.arrow.vectorized.VectorHolder.IsDeletedVectorHolder;
+import org.apache.iceberg.types.Types;
+import org.apache.spark.sql.vectorized.ColumnVector;
+
+public class ColumnVectorBuilder {

Review Comment:
   It's not necessary. Let me change it to package-level.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] aokolnychyi commented on pull request #4888: Core: Support _deleted metadata column in vectorized read

Posted by GitBox <gi...@apache.org>.

aokolnychyi commented on PR #4888:
URL: https://github.com/apache/iceberg/pull/4888#issuecomment-1192099534

   Sorry for the delay. Let me see.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] flyrain commented on pull request #4888: Core: Support _deleted metadata column in vectorized read

Posted by GitBox <gi...@apache.org>.

flyrain commented on PR #4888:
URL: https://github.com/apache/iceberg/pull/4888#issuecomment-1163769266

Hi @aokolnychyi and @RussellSpitzer, vectorized read is enabled by default several months ago. But the benchmark still assumes it false by default. I have set it false explicitly, and run the benchmark again. Now we can see the big performance gain between vectorized and non-vectorized read, as the following diagram shows.
<img width="996" alt="Screen Shot 2022-06-22 at 4 22 09 PM" src="https://user-images.githubusercontent.com/1322359/175173362-e2e6d636-4e5c-4ed4-bcc5-a8888c6b1e1c.png">
I also profile the benchmarks. Here is the flame graph for 25% vectorized read. It looks normal to me. The program spent majority time to read pos delete file and the data file. The read of position delete file is still non-vectorized, which takes a big portion. Would suggest to enable the vectorized read on delete files to improve the overall perf. That's probably next step we can do.
<img width="1543" alt="Screen Shot 2022-06-22 at 4 26 57 PM" src="https://user-images.githubusercontent.com/1322359/175176096-e52048aa-5524-49b8-9645-7cdfdc9463e7.png">

--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] flyrain commented on a diff in pull request #4888: Core: Support _deleted metadata column in vectorized read

Posted by GitBox <gi...@apache.org>.

flyrain commented on code in PR #4888:
URL: https://github.com/apache/iceberg/pull/4888#discussion_r929398608


##########
spark/v3.2/spark/src/main/java/org/apache/iceberg/spark/data/vectorized/ColumnVectorBuilder.java:
##########
@@ -0,0 +1,53 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.iceberg.spark.data.vectorized;
+
+import org.apache.iceberg.arrow.vectorized.VectorHolder;
+import org.apache.iceberg.arrow.vectorized.VectorHolder.ConstantVectorHolder;
+import org.apache.iceberg.types.Types;
+import org.apache.spark.sql.vectorized.ColumnVector;
+
+class ColumnVectorBuilder {
+  private boolean[] isDeleted;
+  private int[] rowIdMapping;
+
+  public ColumnVectorBuilder withDeletedRows(int[] rowIdMappingArray, boolean[] isDeletedArray) {

Review Comment:
   I am trying to make the builder more generic so that it can also be used for creation of vectors without deletes.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] aokolnychyi commented on a diff in pull request #4888: Core: Support _deleted metadata column in vectorized read

Posted by GitBox <gi...@apache.org>.

aokolnychyi commented on code in PR #4888:
URL: https://github.com/apache/iceberg/pull/4888#discussion_r929399091


##########
spark/v3.2/spark/src/main/java/org/apache/iceberg/spark/data/vectorized/ColumnVectorBuilder.java:
##########
@@ -0,0 +1,53 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.iceberg.spark.data.vectorized;
+
+import org.apache.iceberg.arrow.vectorized.VectorHolder;
+import org.apache.iceberg.arrow.vectorized.VectorHolder.ConstantVectorHolder;
+import org.apache.iceberg.types.Types;
+import org.apache.spark.sql.vectorized.ColumnVector;
+
+class ColumnVectorBuilder {
+  private boolean[] isDeleted;
+  private int[] rowIdMapping;
+
+  public ColumnVectorBuilder withDeletedRows(int[] rowIdMappingArray, boolean[] isDeletedArray) {

Review Comment:
   Okay, I see now. Then it is fine.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] RussellSpitzer commented on a diff in pull request #4888: Core: Support _deleted metadata column in vectorized read

Posted by GitBox <gi...@apache.org>.

RussellSpitzer commented on code in PR #4888:
URL: https://github.com/apache/iceberg/pull/4888#discussion_r898382366


##########
arrow/src/main/java/org/apache/iceberg/arrow/vectorized/VectorHolder.java:
##########
@@ -104,6 +104,10 @@ public static <T> VectorHolder constantHolder(int numRows, T constantValue) {
     return new ConstantVectorHolder(numRows, constantValue);
   }
 
+  public static <T> VectorHolder isDeletedHolder(int numRows) {

Review Comment:
   This is another kinda on the fence for me, while the return type isn't boolean, this does seem like a boolean method. Maybe it should just be deletedHolder ? Just skip the "is" since it's a bit confusing in this context?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] flyrain commented on a diff in pull request #4888: Core: Support _deleted metadata column in vectorized read

Posted by GitBox <gi...@apache.org>.

flyrain commented on code in PR #4888:
URL: https://github.com/apache/iceberg/pull/4888#discussion_r884039841


##########
spark/v2.4/spark/src/test/java/org/apache/iceberg/spark/data/TestSparkParquetReadMetadataColumns.java:
##########
@@ -70,8 +70,7 @@ public class TestSparkParquetReadMetadataColumns {
   private static final Schema PROJECTION_SCHEMA = new Schema(
       required(100, "id", Types.LongType.get()),
       required(101, "data", Types.StringType.get()),
-      MetadataColumns.ROW_POSITION,
-      MetadataColumns.IS_DELETED
+      MetadataColumns.ROW_POSITION

Review Comment:
   We need this change since VectorizedReaderBuilder is shared by all spark versions. The change in line 94 of VectorizedReaderBuilder changes the type of the reader as the following code shows. Then, the read throws exception in the method `IcebergArrowColumnVector.forHolder()` of the old Spark version. This change should be fine, since the old Spark doesn't really support _deleted metadata column.
   ```
           reorderedFields.add(new VectorizedArrowReader.DeletedVectorReader());
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] flyrain commented on a diff in pull request #4888: Core: Support _deleted metadata column in vectorized read

Posted by GitBox <gi...@apache.org>.

flyrain commented on code in PR #4888:
URL: https://github.com/apache/iceberg/pull/4888#discussion_r927938728


##########
arrow/src/main/java/org/apache/iceberg/arrow/vectorized/VectorizedArrowReader.java:
##########
@@ -517,5 +517,32 @@ public void setBatchSize(int batchSize) {
     }
   }
 
+  /**
+   * A Dummy Vector Reader which doesn't actually read files, instead it returns an
+   * IsDeleted Vector Holder which indicates whether a given row is deleted.
+   */
+  public static class DeletedVectorReader extends VectorizedArrowReader {

Review Comment:
   I've change `IsDeletedVectorHolder` to `DeletedVectorHolder`



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] flyrain commented on a diff in pull request #4888: Core: Support _deleted metadata column in vectorized read

Posted by GitBox <gi...@apache.org>.

flyrain commented on code in PR #4888:
URL: https://github.com/apache/iceberg/pull/4888#discussion_r927841717


##########
spark/v2.4/spark/src/test/java/org/apache/iceberg/spark/data/TestSparkParquetReadMetadataColumns.java:
##########
@@ -70,8 +70,7 @@ public class TestSparkParquetReadMetadataColumns {
   private static final Schema PROJECTION_SCHEMA = new Schema(
       required(100, "id", Types.LongType.get()),
       required(101, "data", Types.StringType.get()),
-      MetadataColumns.ROW_POSITION,

Review Comment:
   I put the reason here https://github.com/apache/iceberg/pull/4888#discussion_r884039841.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] aokolnychyi commented on a diff in pull request #4888: Core: Support _deleted metadata column in vectorized read

Posted by GitBox <gi...@apache.org>.

aokolnychyi commented on code in PR #4888:
URL: https://github.com/apache/iceberg/pull/4888#discussion_r929392782


##########
spark/v3.2/spark/src/main/java/org/apache/iceberg/spark/data/vectorized/ColumnVectorBuilder.java:
##########
@@ -0,0 +1,53 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.iceberg.spark.data.vectorized;
+
+import org.apache.iceberg.arrow.vectorized.VectorHolder;
+import org.apache.iceberg.arrow.vectorized.VectorHolder.ConstantVectorHolder;
+import org.apache.iceberg.types.Types;
+import org.apache.spark.sql.vectorized.ColumnVector;
+
+class ColumnVectorBuilder {
+  private boolean[] isDeleted;
+  private int[] rowIdMapping;
+
+  public ColumnVectorBuilder withDeletedRows(int[] rowIdMappingArray, boolean[] isDeletedArray) {

Review Comment:
   nit: I feel we better make this a constructor and pass these arrays only once during the construction.



##########
spark/v3.2/spark/src/main/java/org/apache/iceberg/spark/data/vectorized/ColumnarBatchReader.java:
##########
@@ -66,37 +69,53 @@ public final ColumnarBatch read(ColumnarBatch reuse, int numRowsToRead) {
       closeVectors();
     }
 
-    ColumnBatchLoader batchLoader = new ColumnBatchLoader(numRowsToRead);
+    ColumnarBatch columnarBatch = new ColumnBatchLoader(numRowsToRead).loadDataToColumnBatch();
     rowStartPosInBatch += numRowsToRead;
-    return batchLoader.columnarBatch;
+    return columnarBatch;
   }
 
   private class ColumnBatchLoader {
-    private int[] rowIdMapping; // the rowId mapping to skip deleted rows for all column vectors inside a batch
-    private int numRows;
-    private ColumnarBatch columnarBatch;
+    private final int numRowsToRead;
+    // the rowId mapping to skip deleted rows for all column vectors inside a batch, it is null when there is no deletes
+    private int[] rowIdMapping;
+    // the array to indicate if a row is deleted or not, it is null when there is no "_deleted" metadata column
+    private boolean[] isDeleted;
 
     ColumnBatchLoader(int numRowsToRead) {
-      initRowIdMapping(numRowsToRead);
-      loadDataToColumnBatch(numRowsToRead);
+      Preconditions.checkArgument(numRowsToRead > 0, "Invalid number of rows to read: %s", numRowsToRead);
+      this.numRowsToRead = numRowsToRead;
+      if (hasIsDeletedColumn) {
+        isDeleted = new boolean[numRowsToRead];
+      }
     }
 
-    ColumnarBatch loadDataToColumnBatch(int numRowsToRead) {
-      Preconditions.checkArgument(numRowsToRead > 0, "Invalid number of rows to read: %s", numRowsToRead);
-      ColumnVector[] arrowColumnVectors = readDataToColumnVectors(numRowsToRead);
+    ColumnarBatch loadDataToColumnBatch() {
+      int numRowsUndeleted = initRowIdMapping();
 
-      columnarBatch = new ColumnarBatch(arrowColumnVectors);
-      columnarBatch.setNumRows(numRows);
+      ColumnVector[] arrowColumnVectors = readDataToColumnVectors();
+
+      ColumnarBatch newColumnarBatch = new ColumnarBatch(arrowColumnVectors);
+      newColumnarBatch.setNumRows(numRowsUndeleted);
 
       if (hasEqDeletes()) {
-        applyEqDelete();
+        applyEqDelete(newColumnarBatch);
+      }
+
+      if (hasIsDeletedColumn && rowIdMapping != null) {
+        // reset the row id mapping array, so that it doesn't filter out the deleted rows
+        for (int i = 0; i < numRowsToRead; i++) {
+          rowIdMapping[i] = i;
+        }
+        newColumnarBatch.setNumRows(numRowsToRead);
       }
-      return columnarBatch;
+
+      return newColumnarBatch;
     }
 
-    ColumnVector[] readDataToColumnVectors(int numRowsToRead) {
+    ColumnVector[] readDataToColumnVectors() {
       ColumnVector[] arrowColumnVectors = new ColumnVector[readers.length];
 
+      ColumnVectorBuilder columnVectorBuilder = new ColumnVectorBuilder();

Review Comment:
   nit: This is probably where you can pass `rowIdMapping` and `isDeleted` as those two don't change.



##########
spark/v3.2/spark/src/main/java/org/apache/iceberg/spark/data/vectorized/ColumnarBatchReader.java:
##########
@@ -66,35 +78,50 @@ public final ColumnarBatch read(ColumnarBatch reuse, int numRowsToRead) {
       closeVectors();
     }
 
-    ColumnBatchLoader batchLoader = new ColumnBatchLoader(numRowsToRead);
+    ColumnarBatch columnarBatch = new ColumnBatchLoader(numRowsToRead).loadDataToColumnBatch();
     rowStartPosInBatch += numRowsToRead;
-    return batchLoader.columnarBatch;
+    return columnarBatch;
   }
 
   private class ColumnBatchLoader {
-    private int[] rowIdMapping; // the rowId mapping to skip deleted rows for all column vectors inside a batch
-    private int numRows;
-    private ColumnarBatch columnarBatch;
+    private final int numRowsToRead;
+    // the rowId mapping to skip deleted rows for all column vectors inside a batch, it is null when there is no deletes
+    private int[] rowIdMapping;
+    // the array to indicate if a row is deleted or not, it is null when there is no "_deleted" metadata column
+    private boolean[] isDeleted;
 
     ColumnBatchLoader(int numRowsToRead) {
-      initRowIdMapping(numRowsToRead);
-      loadDataToColumnBatch(numRowsToRead);
+      Preconditions.checkArgument(numRowsToRead > 0, "Invalid number of rows to read: %s", numRowsToRead);
+      this.numRowsToRead = numRowsToRead;
+      if (hasIsDeletedColumn) {
+        isDeleted = new boolean[numRowsToRead];
+      }
     }
 
-    ColumnarBatch loadDataToColumnBatch(int numRowsToRead) {
-      Preconditions.checkArgument(numRowsToRead > 0, "Invalid number of rows to read: %s", numRowsToRead);
-      ColumnVector[] arrowColumnVectors = readDataToColumnVectors(numRowsToRead);
+    ColumnarBatch loadDataToColumnBatch() {
+      int numRowsUndeleted = initRowIdMapping();
+
+      ColumnVector[] arrowColumnVectors = readDataToColumnVectors();
 
-      columnarBatch = new ColumnarBatch(arrowColumnVectors);
-      columnarBatch.setNumRows(numRows);
+      ColumnarBatch newColumnarBatch = new ColumnarBatch(arrowColumnVectors);
+      newColumnarBatch.setNumRows(numRowsUndeleted);
 
       if (hasEqDeletes()) {
-        applyEqDelete();
+        applyEqDelete(newColumnarBatch);
       }
-      return columnarBatch;
+
+      if (hasIsDeletedColumn && rowIdMapping != null) {
+        // reset the row id mapping array, so that it doesn't filter out the deleted rows
+        for (int i = 0; i < numRowsToRead; i++) {
+          rowIdMapping[i] = i;

Review Comment:
   Sounds good to me.



##########
spark/v3.3/spark/src/main/java/org/apache/iceberg/spark/data/vectorized/ColumnVectorBuilder.java:
##########
@@ -0,0 +1,53 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.iceberg.spark.data.vectorized;
+
+import org.apache.iceberg.arrow.vectorized.VectorHolder;
+import org.apache.iceberg.arrow.vectorized.VectorHolder.ConstantVectorHolder;
+import org.apache.iceberg.types.Types;
+import org.apache.spark.sql.vectorized.ColumnVector;
+
+class ColumnVectorBuilder {
+  private boolean[] isDeleted;
+  private int[] rowIdMapping;
+
+  public ColumnVectorBuilder withDeletedRows(int[] rowIdMappingArray, boolean[] isDeletedArray) {

Review Comment:
   nit: Same for 3.3



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] aokolnychyi commented on pull request #4888: Core: Support _deleted metadata column in vectorized read

Posted by GitBox <gi...@apache.org>.

aokolnychyi commented on PR #4888:
URL: https://github.com/apache/iceberg/pull/4888#issuecomment-1194784418

   Thanks, @flyrain! Great to have this done. Thanks for reviewing, @RussellSpitzer!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] aokolnychyi commented on pull request #4888: Core: Support _deleted metadata column in vectorized read

Posted by GitBox <gi...@apache.org>.

aokolnychyi commented on PR #4888:
URL: https://github.com/apache/iceberg/pull/4888#issuecomment-1192723965

   This seems correct to me. I had only a few questions/comments.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] flyrain commented on pull request #4888: Core: Support _deleted metadata column in vectorized read

Posted by GitBox <gi...@apache.org>.

flyrain commented on PR #4888:
URL: https://github.com/apache/iceberg/pull/4888#issuecomment-1193000105

   Hi @aokolnychyi, this is ready for review. I have to apply the same changes to Spark 3.3, otherwise unit test won't pass.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] flyrain commented on a diff in pull request #4888: Core: Support _deleted metadata column in vectorized read

Posted by GitBox <gi...@apache.org>.

flyrain commented on code in PR #4888:
URL: https://github.com/apache/iceberg/pull/4888#discussion_r927852774


##########
spark/v3.2/spark/src/main/java/org/apache/iceberg/spark/data/vectorized/ColumnVectorBuilder.java:
##########
@@ -0,0 +1,62 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.iceberg.spark.data.vectorized;
+
+import org.apache.iceberg.arrow.vectorized.VectorHolder;
+import org.apache.iceberg.arrow.vectorized.VectorHolder.ConstantVectorHolder;
+import org.apache.iceberg.arrow.vectorized.VectorHolder.IsDeletedVectorHolder;
+import org.apache.iceberg.types.Types;
+import org.apache.spark.sql.vectorized.ColumnVector;
+
+public class ColumnVectorBuilder {
+  private final VectorHolder holder;
+  private final int numRows;
+
+  private boolean[] isDeleted;
+  private int[] rowIdMapping;
+
+  public ColumnVectorBuilder(VectorHolder holder, int numRows) {
+    this.holder = holder;
+    this.numRows = numRows;
+  }
+
+  public ColumnVectorBuilder withDeletedRows(int[] rowIdMappingArray, boolean[] isDeletedArray) {
+    this.rowIdMapping = rowIdMappingArray;
+    this.isDeleted = isDeletedArray;
+    return this;
+  }
+
+  public ColumnVector build() {
+    if (holder.isDummy()) {
+      if (holder instanceof IsDeletedVectorHolder) {
+        return new DeletedMetaColumnVector(Types.BooleanType.get(), isDeleted);
+      } else if (holder instanceof ConstantVectorHolder) {
+        return new ConstantColumnVector(Types.IntegerType.get(), numRows,
+            ((ConstantVectorHolder) holder).getConstant());

Review Comment:
   it cannot with the `<?>`



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] aokolnychyi merged pull request #4888: Core: Support _deleted metadata column in vectorized read

Posted by GitBox <gi...@apache.org>.

aokolnychyi merged PR #4888:
URL: https://github.com/apache/iceberg/pull/4888


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] RussellSpitzer commented on a diff in pull request #4888: Core: Support _deleted metadata column in vectorized read

Posted by GitBox <gi...@apache.org>.

RussellSpitzer commented on code in PR #4888:
URL: https://github.com/apache/iceberg/pull/4888#discussion_r898097104


##########
spark/v3.2/spark/src/main/java/org/apache/iceberg/spark/data/vectorized/ColumnarBatchReader.java:
##########
@@ -72,31 +84,44 @@ public final ColumnarBatch read(ColumnarBatch reuse, int numRowsToRead) {
   }
 
   private class ColumnBatchLoader {
-    private int[] rowIdMapping; // the rowId mapping to skip deleted rows for all column vectors inside a batch
-    private int numRows;
     private ColumnarBatch columnarBatch;
+    private final int numRowsToRead;
+    private int[] rowIdMapping; // the rowId mapping to skip deleted rows for all column vectors inside a batch
+    private boolean[] isDeleted; // the array to indicate if a row is deleted or not
 
     ColumnBatchLoader(int numRowsToRead) {
-      initRowIdMapping(numRowsToRead);
-      loadDataToColumnBatch(numRowsToRead);
+      Preconditions.checkArgument(numRowsToRead > 0, "Invalid number of rows to read: %s", numRowsToRead);
+      this.numRowsToRead = numRowsToRead;
+      this.columnarBatch = loadDataToColumnBatch();
     }
 
-    ColumnarBatch loadDataToColumnBatch(int numRowsToRead) {
-      Preconditions.checkArgument(numRowsToRead > 0, "Invalid number of rows to read: %s", numRowsToRead);
-      ColumnVector[] arrowColumnVectors = readDataToColumnVectors(numRowsToRead);
+    ColumnarBatch loadDataToColumnBatch() {
+      int numRowsUndeleted = initRowIdMapping();
+
+      ColumnVector[] arrowColumnVectors = readDataToColumnVectors();
 
-      columnarBatch = new ColumnarBatch(arrowColumnVectors);
-      columnarBatch.setNumRows(numRows);
+      ColumnarBatch newColumnarBatch = new ColumnarBatch(arrowColumnVectors);
+      newColumnarBatch.setNumRows(numRowsUndeleted);
 
       if (hasEqDeletes()) {
-        applyEqDelete();
+        numRowsUndeleted = applyEqDelete(newColumnarBatch);
+      }
+
+      if (hasColumnIsDeleted) {

Review Comment:
   This is a nit but, i think this makes more sense read a `hasIsDeletedColumn`



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] flyrain commented on pull request #4888: Core: Support _deleted metadata column in vectorized read

Posted by GitBox <gi...@apache.org>.

flyrain commented on PR #4888:
URL: https://github.com/apache/iceberg/pull/4888#issuecomment-1197530335

   @aokolnychyi, checked the module Spark2.4/3.0/3.1. Metadata column `_deleted` isn't supported. For example, I've added the following tests into class `TestSelect`. 
   ```
   @Test
   public void testSelectDeletedMetaColumn() {
     List<Object[]> expected = ImmutableList.of(
         row(1L, "a", 1.0F), row(2L, "b", 2.0F), row(3L, "c", Float.NaN));
   
     assertEquals("Should return all expected rows", expected, sql("SELECT * FROM %s where _deleted=false", tableName));
   
     expected = ImmutableList.of();
     assertEquals("Should return all expected rows", expected, sql("SELECT * FROM %s where _deleted=true", tableName));
   }
   ```
   It reports the following errors. We don't need to change anything in that sense. 
   ```
   cannot resolve '`_deleted`' given input columns: [table.id, table.data, table.doubleVal]; line 1 pos 26;
   'Project [*]
   +- 'Filter ('_deleted = false)
      +- SubqueryAlias `table`
         +- RelationV2 iceberg[id#12, data#13, doubleVal#14] (Options: [path=file:/var/folders/69/j9m_r8gj69753xfnsjnlsl_m0000gn/T/junit2794346723320750539/junit3760903...)
   
   org.apache.spark.sql.AnalysisException: cannot resolve '`_deleted`' given input columns: [table.id, table.data, table.doubleVal]; line 1 pos 26;
   'Project [*]
   +- 'Filter ('_deleted = false)
      +- SubqueryAlias `table`
         +- RelationV2 iceberg[id#12, data#13, doubleVal#14] (Options: [path=file:/var/folders/69/j9m_r8gj69753xfnsjnlsl_m0000gn/T/junit2794346723320750539/junit3760903...)
   ```
   And the change I did in class `TestSparkParquetReadMetadataColumns.java`, they are unit tests for internal classes. User cannot use them directly.
   In conclusion, we are OK to leave it as is.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] RussellSpitzer commented on a diff in pull request #4888: Core: Support _deleted metadata column in vectorized read

Posted by GitBox <gi...@apache.org>.

RussellSpitzer commented on code in PR #4888:
URL: https://github.com/apache/iceberg/pull/4888#discussion_r898430641


##########
arrow/src/main/java/org/apache/iceberg/arrow/vectorized/VectorizedArrowReader.java:
##########
@@ -517,5 +517,32 @@ public void setBatchSize(int batchSize) {
     }
   }
 
+  /**
+   * A Dummy Vector Reader which doesn't actually read files, instead it returns a dummy

Review Comment:
   Just a little note here, I think you don't have to say "dummy" twice. 
   
   Small suggestion?
   "A Vector Reader that returns a IsDeleted Vector Holder which returns true when a given row is deleted when read"?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] RussellSpitzer commented on a diff in pull request #4888: Core: Support _deleted metadata column in vectorized read

Posted by GitBox <gi...@apache.org>.

RussellSpitzer commented on code in PR #4888:
URL: https://github.com/apache/iceberg/pull/4888#discussion_r898485901


##########
spark/v3.2/spark/src/main/java/org/apache/iceberg/spark/data/vectorized/DeletedMetaColumnVector.java:
##########
@@ -0,0 +1,123 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.iceberg.spark.data.vectorized;
+
+import org.apache.iceberg.relocated.com.google.common.base.Preconditions;
+import org.apache.iceberg.spark.SparkSchemaUtil;
+import org.apache.iceberg.types.Type;
+import org.apache.spark.sql.types.Decimal;
+import org.apache.spark.sql.vectorized.ColumnVector;
+import org.apache.spark.sql.vectorized.ColumnarArray;
+import org.apache.spark.sql.vectorized.ColumnarMap;
+import org.apache.spark.unsafe.types.UTF8String;
+
+public class DeletedMetaColumnVector extends ColumnVector {
+  private final boolean[] isDeleted;
+
+  public DeletedMetaColumnVector(Type type, boolean[] isDeleted) {
+    super(SparkSchemaUtil.convert(type));
+    Preconditions.checkArgument(isDeleted != null, "Boolean array isDeleted cannot be null");
+    this.isDeleted = isDeleted;
+  }
+
+  @Override
+  public void close() {
+  }
+
+  @Override
+  public boolean hasNull() {
+    return false;
+  }
+
+  @Override
+  public int numNulls() {
+    return 0;
+  }
+
+  @Override
+  public boolean isNullAt(int rowId) {
+    return false;
+  }
+
+  @Override
+  public boolean getBoolean(int rowId) {
+    return isDeleted[rowId];
+  }
+
+  @Override
+  public byte getByte(int rowId) {

Review Comment:
   Not sure if we did this in the others, but IMHO all the accessors should through `UnsupportedOperationException` except for `getBoolean`



##########
spark/v3.2/spark/src/main/java/org/apache/iceberg/spark/data/vectorized/DeletedMetaColumnVector.java:
##########
@@ -0,0 +1,123 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.iceberg.spark.data.vectorized;
+
+import org.apache.iceberg.relocated.com.google.common.base.Preconditions;
+import org.apache.iceberg.spark.SparkSchemaUtil;
+import org.apache.iceberg.types.Type;
+import org.apache.spark.sql.types.Decimal;
+import org.apache.spark.sql.vectorized.ColumnVector;
+import org.apache.spark.sql.vectorized.ColumnarArray;
+import org.apache.spark.sql.vectorized.ColumnarMap;
+import org.apache.spark.unsafe.types.UTF8String;
+
+public class DeletedMetaColumnVector extends ColumnVector {
+  private final boolean[] isDeleted;
+
+  public DeletedMetaColumnVector(Type type, boolean[] isDeleted) {
+    super(SparkSchemaUtil.convert(type));
+    Preconditions.checkArgument(isDeleted != null, "Boolean array isDeleted cannot be null");
+    this.isDeleted = isDeleted;
+  }
+
+  @Override
+  public void close() {
+  }
+
+  @Override
+  public boolean hasNull() {
+    return false;
+  }
+
+  @Override
+  public int numNulls() {
+    return 0;
+  }
+
+  @Override
+  public boolean isNullAt(int rowId) {
+    return false;
+  }
+
+  @Override
+  public boolean getBoolean(int rowId) {
+    return isDeleted[rowId];
+  }
+
+  @Override
+  public byte getByte(int rowId) {

Review Comment:
   Not sure if we did this in the others, but IMHO all the accessors should throw `UnsupportedOperationException` except for `getBoolean`



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] RussellSpitzer commented on a diff in pull request #4888: Core: Support _deleted metadata column in vectorized read

Posted by GitBox <gi...@apache.org>.

RussellSpitzer commented on code in PR #4888:
URL: https://github.com/apache/iceberg/pull/4888#discussion_r898487203


##########
spark/v3.2/spark/src/main/java/org/apache/iceberg/spark/data/vectorized/ColumnarBatchReader.java:
##########
@@ -72,31 +84,44 @@ public final ColumnarBatch read(ColumnarBatch reuse, int numRowsToRead) {
   }
 
   private class ColumnBatchLoader {
-    private int[] rowIdMapping; // the rowId mapping to skip deleted rows for all column vectors inside a batch
-    private int numRows;
     private ColumnarBatch columnarBatch;
+    private final int numRowsToRead;
+    private int[] rowIdMapping; // the rowId mapping to skip deleted rows for all column vectors inside a batch
+    private boolean[] isDeleted; // the array to indicate if a row is deleted or not

Review Comment:
   For my confusion below can we indicate here for these two arrays to describe when these two can be null?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] RussellSpitzer commented on a diff in pull request #4888: Core: Support _deleted metadata column in vectorized read

Posted by GitBox <gi...@apache.org>.

RussellSpitzer commented on code in PR #4888:
URL: https://github.com/apache/iceberg/pull/4888#discussion_r898488772


##########
spark/v3.2/spark/src/test/java/org/apache/iceberg/spark/source/TestSparkReaderDeletes.java:
##########
@@ -277,8 +276,6 @@ public void testPosDeletesAllRowsInBatch() throws IOException {
 
   @Test
   public void testPosDeletesWithDeletedColumn() throws IOException {
-    Assume.assumeFalse(vectorized);

Review Comment:
   +1



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] flyrain commented on a diff in pull request #4888: Core: Support _deleted metadata column in vectorized read

Posted by GitBox <gi...@apache.org>.

flyrain commented on code in PR #4888:
URL: https://github.com/apache/iceberg/pull/4888#discussion_r899634574


##########
spark/v3.2/spark/src/main/java/org/apache/iceberg/spark/data/vectorized/DeletedMetaColumnVector.java:
##########
@@ -0,0 +1,123 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.iceberg.spark.data.vectorized;
+
+import org.apache.iceberg.relocated.com.google.common.base.Preconditions;
+import org.apache.iceberg.spark.SparkSchemaUtil;
+import org.apache.iceberg.types.Type;
+import org.apache.spark.sql.types.Decimal;
+import org.apache.spark.sql.vectorized.ColumnVector;
+import org.apache.spark.sql.vectorized.ColumnarArray;
+import org.apache.spark.sql.vectorized.ColumnarMap;
+import org.apache.spark.unsafe.types.UTF8String;
+
+public class DeletedMetaColumnVector extends ColumnVector {
+  private final boolean[] isDeleted;
+
+  public DeletedMetaColumnVector(Type type, boolean[] isDeleted) {
+    super(SparkSchemaUtil.convert(type));
+    Preconditions.checkArgument(isDeleted != null, "Boolean array isDeleted cannot be null");
+    this.isDeleted = isDeleted;
+  }
+
+  @Override
+  public void close() {
+  }
+
+  @Override
+  public boolean hasNull() {
+    return false;
+  }
+
+  @Override
+  public int numNulls() {
+    return 0;
+  }
+
+  @Override
+  public boolean isNullAt(int rowId) {
+    return false;
+  }
+
+  @Override
+  public boolean getBoolean(int rowId) {
+    return isDeleted[rowId];
+  }
+
+  @Override
+  public byte getByte(int rowId) {

Review Comment:
   Make sense. We did in class `RowPositionColumnVector`.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] flyrain commented on a diff in pull request #4888: Core: Support _deleted metadata column in vectorized read

Posted by GitBox <gi...@apache.org>.

flyrain commented on code in PR #4888:
URL: https://github.com/apache/iceberg/pull/4888#discussion_r884039841


##########
spark/v2.4/spark/src/test/java/org/apache/iceberg/spark/data/TestSparkParquetReadMetadataColumns.java:
##########
@@ -70,8 +70,7 @@ public class TestSparkParquetReadMetadataColumns {
   private static final Schema PROJECTION_SCHEMA = new Schema(
       required(100, "id", Types.LongType.get()),
       required(101, "data", Types.StringType.get()),
-      MetadataColumns.ROW_POSITION,
-      MetadataColumns.IS_DELETED
+      MetadataColumns.ROW_POSITION

Review Comment:
   We need this change since the class VectorizedReaderBuilder is shared by all spark versions. The change in line 94 of VectorizedReaderBuilder changes the type of the reader as the following code shows. Then, the read throws exception in the method `IcebergArrowColumnVector.forHolder()` of the old Spark version. This change should be fine, since the old Spark doesn't really support _deleted metadata column.
   ```
           reorderedFields.add(new VectorizedArrowReader.DeletedVectorReader());
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] flyrain commented on a diff in pull request #4888: Core: Support _deleted metadata column in vectorized read

Posted by GitBox <gi...@apache.org>.

flyrain commented on code in PR #4888:
URL: https://github.com/apache/iceberg/pull/4888#discussion_r927887360


##########
spark/v3.2/spark/src/main/java/org/apache/iceberg/spark/data/vectorized/ColumnarBatchReader.java:
##########
@@ -66,35 +78,50 @@ public final ColumnarBatch read(ColumnarBatch reuse, int numRowsToRead) {
       closeVectors();
     }
 
-    ColumnBatchLoader batchLoader = new ColumnBatchLoader(numRowsToRead);
+    ColumnarBatch columnarBatch = new ColumnBatchLoader(numRowsToRead).loadDataToColumnBatch();
     rowStartPosInBatch += numRowsToRead;
-    return batchLoader.columnarBatch;
+    return columnarBatch;
   }
 
   private class ColumnBatchLoader {
-    private int[] rowIdMapping; // the rowId mapping to skip deleted rows for all column vectors inside a batch
-    private int numRows;
-    private ColumnarBatch columnarBatch;
+    private final int numRowsToRead;
+    // the rowId mapping to skip deleted rows for all column vectors inside a batch, it is null when there is no deletes
+    private int[] rowIdMapping;
+    // the array to indicate if a row is deleted or not, it is null when there is no "_deleted" metadata column
+    private boolean[] isDeleted;
 
     ColumnBatchLoader(int numRowsToRead) {
-      initRowIdMapping(numRowsToRead);
-      loadDataToColumnBatch(numRowsToRead);
+      Preconditions.checkArgument(numRowsToRead > 0, "Invalid number of rows to read: %s", numRowsToRead);
+      this.numRowsToRead = numRowsToRead;
+      if (hasIsDeletedColumn) {
+        isDeleted = new boolean[numRowsToRead];
+      }
     }
 
-    ColumnarBatch loadDataToColumnBatch(int numRowsToRead) {
-      Preconditions.checkArgument(numRowsToRead > 0, "Invalid number of rows to read: %s", numRowsToRead);
-      ColumnVector[] arrowColumnVectors = readDataToColumnVectors(numRowsToRead);
+    ColumnarBatch loadDataToColumnBatch() {
+      int numRowsUndeleted = initRowIdMapping();
+
+      ColumnVector[] arrowColumnVectors = readDataToColumnVectors();
 
-      columnarBatch = new ColumnarBatch(arrowColumnVectors);
-      columnarBatch.setNumRows(numRows);
+      ColumnarBatch newColumnarBatch = new ColumnarBatch(arrowColumnVectors);
+      newColumnarBatch.setNumRows(numRowsUndeleted);
 
       if (hasEqDeletes()) {
-        applyEqDelete();
+        applyEqDelete(newColumnarBatch);
       }
-      return columnarBatch;
+
+      if (hasIsDeletedColumn && rowIdMapping != null) {
+        // reset the row id mapping array, so that it doesn't filter out the deleted rows
+        for (int i = 0; i < numRowsToRead; i++) {
+          rowIdMapping[i] = i;

Review Comment:
   That's a good question. In short, I'm using row ID mapping to improve eq deletes perf when we have both pos deletes and eq deletes. I think it is worth to do that since applying eq deletes is expensive, it has to go row by row.  Here is an example, after the pos deletes, we will only need to iterate 6 rows instead of 8 rows for applying eq delete.
   ```
        * Filter out the equality deleted rows. Here is an example,
        * [0,1,2,3,4,5,6,7] -- Original status of the row id mapping array
        * [F,F,F,F,F,F,F,F] -- Original status of the isDeleted array
        * Position delete 2, 6
        * [0,1,3,4,5,7,-,-] -- After applying position deletes [Set Num records to 6]
        * [F,F,T,F,F,F,T,F] -- After applying position deletes
        * Equality delete 1 <= x <= 3
        * [0,4,5,7,-,-,-,-] -- After applying equality deletes [Set Num records to 4]
        * [F,T,T,T,F,F,T,F] -- After applying equality deletes
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] flyrain commented on a diff in pull request #4888: Core: Support _deleted metadata column in vectorized read

Posted by GitBox <gi...@apache.org>.

flyrain commented on code in PR #4888:
URL: https://github.com/apache/iceberg/pull/4888#discussion_r927906389


##########
spark/v3.2/spark/src/main/java/org/apache/iceberg/spark/data/vectorized/ColumnarBatchReader.java:
##########
@@ -105,35 +132,31 @@ ColumnVector[] readDataToColumnVectors(int numRowsToRead) {
             "Number of rows in the vector %s didn't match expected %s ", numRowsInVector,
             numRowsToRead);
 
-        arrowColumnVectors[i] = hasDeletes() ?
-            ColumnVectorWithFilter.forHolder(vectorHolders[i], rowIdMapping, numRows) :
-            IcebergArrowColumnVector.forHolder(vectorHolders[i], numRowsInVector);
+        arrowColumnVectors[i] = new ColumnVectorBuilder(vectorHolders[i], numRowsInVector)

Review Comment:
   Nice suggestion. Made the change.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] flyrain commented on a diff in pull request #4888: Core: Support _deleted metadata column in vectorized read

Posted by GitBox <gi...@apache.org>.

flyrain commented on code in PR #4888:
URL: https://github.com/apache/iceberg/pull/4888#discussion_r900501091


##########
spark/v3.2/spark/src/main/java/org/apache/iceberg/spark/data/vectorized/ColumnarBatchReader.java:
##########
@@ -209,7 +231,32 @@ void applyEqDelete() {
         rowId++;
       }
 
-      columnarBatch.setNumRows(currentRowId);
+      newColumnarBatch.setNumRows(currentRowId);
+      return currentRowId;
+    }
+
+    /**
+     * Convert the row id mapping array to the isDeleted array.
+     *
+     * @param numRowsInRowIdMapping the num of rows in the row id mapping array
+     */
+    void rowIdMappingToIsDeleted(int numRowsInRowIdMapping) {
+      if (isDeleted == null || rowIdMapping == null) {
+        return;
+      }
+
+      for (int i = 0; i < numRowsToRead; i++) {
+        isDeleted[i] = true;
+      }
+
+      for (int i = 0; i < numRowsInRowIdMapping; i++) {
+        isDeleted[rowIdMapping[i]] = false;
+      }
+
+      // reset the row id mapping array, so that it doesn't filter out the deleted rows
+      for (int i = 0; i < numRowsToRead; i++) {

Review Comment:
   We've set the ColumnVectorWithFilter with rowIdMapping at line 133. The instance of ColumnVectorWithFilter will throw NPE if we set rowIdMapping to null.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] flyrain commented on a diff in pull request #4888: Core: Support _deleted metadata column in vectorized read

Posted by GitBox <gi...@apache.org>.

flyrain commented on code in PR #4888:
URL: https://github.com/apache/iceberg/pull/4888#discussion_r900501319


##########
spark/v3.2/spark/src/main/java/org/apache/iceberg/spark/data/vectorized/ColumnarBatchReader.java:
##########
@@ -209,7 +231,32 @@ void applyEqDelete() {
         rowId++;
       }
 
-      columnarBatch.setNumRows(currentRowId);
+      newColumnarBatch.setNumRows(currentRowId);
+      return currentRowId;
+    }
+
+    /**
+     * Convert the row id mapping array to the isDeleted array.
+     *
+     * @param numRowsInRowIdMapping the num of rows in the row id mapping array
+     */
+    void rowIdMappingToIsDeleted(int numRowsInRowIdMapping) {
+      if (isDeleted == null || rowIdMapping == null) {

Review Comment:
   yeah, let me add doc.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] RussellSpitzer commented on a diff in pull request #4888: Core: Support _deleted metadata column in vectorized read

Posted by GitBox <gi...@apache.org>.

RussellSpitzer commented on code in PR #4888:
URL: https://github.com/apache/iceberg/pull/4888#discussion_r898485495


##########
spark/v3.2/spark/src/main/java/org/apache/iceberg/spark/data/vectorized/ColumnarBatchReader.java:
##########
@@ -209,7 +231,32 @@ void applyEqDelete() {
         rowId++;
       }
 
-      columnarBatch.setNumRows(currentRowId);
+      newColumnarBatch.setNumRows(currentRowId);
+      return currentRowId;
+    }
+
+    /**
+     * Convert the row id mapping array to the isDeleted array.
+     *
+     * @param numRowsInRowIdMapping the num of rows in the row id mapping array
+     */
+    void rowIdMappingToIsDeleted(int numRowsInRowIdMapping) {
+      if (isDeleted == null || rowIdMapping == null) {

Review Comment:
   Oh wait I think I see, so if rowIdMapping is null it means there are no deletes. Does that mean we treat isDeleted as always false if null? I think that may need to be a javadoc comment?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] flyrain commented on a diff in pull request #4888: Core: Support _deleted metadata column in vectorized read

Posted by GitBox <gi...@apache.org>.

flyrain commented on code in PR #4888:
URL: https://github.com/apache/iceberg/pull/4888#discussion_r899391540


##########
spark/v3.2/spark/src/test/java/org/apache/iceberg/spark/data/TestSparkParquetReadMetadataColumns.java:
##########
@@ -173,27 +174,28 @@ public void testReadRowNumbers() throws IOException {
 
   @Test
   public void testReadRowNumbersWithDelete() throws IOException {
-    if (vectorized) {
-      List<InternalRow> expectedRowsAfterDelete = Lists.newArrayList(EXPECTED_ROWS);
-      // remove row at position 98, 99, 100, 101, 102, this crosses two row groups [0, 100) and [100, 200)
-      for (int i = 1; i <= 5; i++) {
-        expectedRowsAfterDelete.remove(98);
-      }
+    Assume.assumeTrue(vectorized);

Review Comment:
   Exactly.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] RussellSpitzer commented on a diff in pull request #4888: Core: Support _deleted metadata column in vectorized read

Posted by GitBox <gi...@apache.org>.

RussellSpitzer commented on code in PR #4888:
URL: https://github.com/apache/iceberg/pull/4888#discussion_r898428103


##########
arrow/src/main/java/org/apache/iceberg/arrow/vectorized/VectorHolder.java:
##########
@@ -104,6 +104,10 @@ public static <T> VectorHolder constantHolder(int numRows, T constantValue) {
     return new ConstantVectorHolder(numRows, constantValue);
   }
 
+  public static <T> VectorHolder isDeletedHolder(int numRows) {

Review Comment:
   I think the class name is fine, just this method seems a little confusing to me, but maybe it's just me :)



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org