You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@iceberg.apache.org by GitBox <gi...@apache.org> on 2020/10/09 21:16:10 UTC

[GitHub] [iceberg] holdenk opened a new pull request #1572: Refactor reusable map data in Spark and Flink parquet readers #1331

holdenk opened a new pull request #1572:
URL: https://github.com/apache/iceberg/pull/1572


   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] kbendick commented on a change in pull request #1572: Refactor reusable map data in Spark and Flink parquet readers #1331

Posted by GitBox <gi...@apache.org>.

kbendick commented on a change in pull request #1572:
URL: https://github.com/apache/iceberg/pull/1572#discussion_r502841925



##########
File path: spark/src/main/java/org/apache/iceberg/spark/data/SparkParquetReaders.java
##########
@@ -655,6 +653,11 @@ public int numElements() {
       return numElements;
     }
 
+    @Override
+    public int getNumElements() {

Review comment:
       It looks like the function `numElements` is still there, just above this one

##########
File path: spark/src/main/java/org/apache/iceberg/spark/data/SparkParquetReaders.java
##########
@@ -655,6 +653,11 @@ public int numElements() {
       return numElements;
     }
 
+    @Override
+    public int getNumElements() {

Review comment:
       It looks like the function `numElements` is still there (it's part of the catalyst abstract class), in the lines just above these ones.
   
   But unless there's something going on with Spark, I agree with @rdblue.
   
   I looked to see if `getNumElements` were needed on the Flink side, as Flink highly favors POJOs for performance reasons and then for style reasons they also tend to use POJO-y like names, but the Flink `ArrayData` interface uses `size` instead, so keeping the iceberg ReusableArrayData interface to use `numElements` seems like it would kill two birds with one stone by also fulfilling the catalyst contract.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] kbendick commented on a change in pull request #1572: Refactor reusable map data in Spark and Flink parquet readers #1331

Posted by GitBox <gi...@apache.org>.

kbendick commented on a change in pull request #1572:
URL: https://github.com/apache/iceberg/pull/1572#discussion_r503507114



##########
File path: parquet/src/main/java/org/apache/iceberg/parquet/ReusableMapData.java
##########
@@ -0,0 +1,72 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.iceberg.parquet;
+
+import java.util.Map;
+
+public interface ReusableMapData {
+  ReusableArrayData keys();
+  ReusableArrayData values();
+
+  default void grow() {
+    keys().grow();
+    values().grow();
+  }
+
+  default int capacity() {
+    return keys().capacity();
+  }
+
+  default void setNumElements(int numElements) {
+    keys().setNumElements(numElements);
+    values().setNumElements(numElements);
+  }
+
+  int size();
+
+  @SuppressWarnings("unchecked")
+  default <K, V> Map.Entry<K, V> getRaw(
+      int pos,
+      ParquetValueReaders.ReusableEntry<K, V> reuse,
+      ParquetValueReaders.ReusableEntry<K, V> def) {
+    if (pos < capacity()) {
+      reuse.set((K) keys().getObj(pos), (V) values().getObj(pos));
+      return reuse;
+    }
+    return def;
+  }
+
+  default void addPair(Object key, Object value) {
+    if (size() >= capacity()) {
+      grow();
+    }
+    keys().update(size(), key);
+    values().update(size(), value);
+    setNumElements(size() + 1);

Review comment:
       I might be misinterpreting your question / concern, but it seems like at least in the `SparkReusableMapData` this is the case. 
   
   If I'm reading this correctly, addPair is overwritten and then `setNumElements` is only called on `buildMap`. Again, I might be misinterpreting the question / concern but for a default implementation given that it seems to be overridden on the Spark side at least.
   
   ```java
   private static class MapReader<K, V> extends RepeatedKeyValueReader<MapData, SparkReusableMapData, K, V> {
       // Removed to the bare minimum ....
       @Override
       protected void addPair(SparkReusableMapData map, K key, V value) {
         if (writePos >= map.capacity()) {
           map.grow();
         }
   
         map.keys.values[writePos] = key;
         map.values.values[writePos] = value;
   
         writePos += 1;
       }
   
       @Override
       protected MapData buildMap(SparkReusableMapData map) {
         map.setNumElements(writePos);
         return map;
       }
   ```




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] rdblue commented on a change in pull request #1572: Refactor reusable map data in Spark and Flink parquet readers #1331

Posted by GitBox <gi...@apache.org>.

rdblue commented on a change in pull request #1572:
URL: https://github.com/apache/iceberg/pull/1572#discussion_r502724623



##########
File path: parquet/src/main/java/org/apache/iceberg/parquet/ReusableMapData.java
##########
@@ -0,0 +1,72 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.iceberg.parquet;
+
+import java.util.Map;
+
+public interface ReusableMapData {
+  ReusableArrayData keys();
+  ReusableArrayData values();
+
+  default void grow() {
+    keys().grow();
+    values().grow();
+  }
+
+  default int capacity() {
+    return keys().capacity();
+  }
+
+  default void setNumElements(int numElements) {
+    keys().setNumElements(numElements);
+    values().setNumElements(numElements);
+  }
+
+  int size();
+
+  @SuppressWarnings("unchecked")
+  default <K, V> Map.Entry<K, V> getRaw(
+      int pos,
+      ParquetValueReaders.ReusableEntry<K, V> reuse,
+      ParquetValueReaders.ReusableEntry<K, V> def) {
+    if (pos < capacity()) {
+      reuse.set((K) keys().getObj(pos), (V) values().getObj(pos));
+      return reuse;
+    }
+    return def;
+  }
+
+  default void addPair(Object key, Object value) {
+    if (size() >= capacity()) {
+      grow();
+    }
+    keys().update(size(), key);
+    values().update(size(), value);
+    setNumElements(size() + 1);

Review comment:
       Does this need to set this every time a new pair is added? I think it would be better to set the valid size just before returning, to avoid extra work.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] openinx commented on a change in pull request #1572: Refactor reusable map data in Spark and Flink parquet readers #1331

Posted by GitBox <gi...@apache.org>.

openinx commented on a change in pull request #1572:
URL: https://github.com/apache/iceberg/pull/1572#discussion_r507395091



##########
File path: parquet/src/main/java/org/apache/iceberg/parquet/ReusableMapData.java
##########
@@ -0,0 +1,72 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.iceberg.parquet;
+
+import java.util.Map;
+
+public interface ReusableMapData {
+  ReusableArrayData keys();
+  ReusableArrayData values();
+
+  default void grow() {
+    keys().grow();
+    values().grow();
+  }
+
+  default int capacity() {
+    return keys().capacity();
+  }
+
+  default void setNumElements(int numElements) {
+    keys().setNumElements(numElements);
+    values().setNumElements(numElements);
+  }
+
+  int size();

Review comment:
       nit:  how about making this method to return the `keys().getNumElements()` by default ?  then we may don't have to implement it in subclasses now. 




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] holdenk commented on a change in pull request #1572: Refactor reusable map data in Spark and Flink parquet readers #1331

Posted by GitBox <gi...@apache.org>.

holdenk commented on a change in pull request #1572:
URL: https://github.com/apache/iceberg/pull/1572#discussion_r506672955



##########
File path: parquet/src/main/java/org/apache/iceberg/parquet/ReusableMapData.java
##########
@@ -0,0 +1,72 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.iceberg.parquet;
+
+import java.util.Map;
+
+public interface ReusableMapData {
+  ReusableArrayData keys();
+  ReusableArrayData values();
+
+  default void grow() {
+    keys().grow();
+    values().grow();
+  }
+
+  default int capacity() {
+    return keys().capacity();
+  }
+
+  default void setNumElements(int numElements) {
+    keys().setNumElements(numElements);
+    values().setNumElements(numElements);
+  }
+
+  int size();
+
+  @SuppressWarnings("unchecked")
+  default <K, V> Map.Entry<K, V> getRaw(
+      int pos,
+      ParquetValueReaders.ReusableEntry<K, V> reuse,
+      ParquetValueReaders.ReusableEntry<K, V> def) {
+    if (pos < capacity()) {
+      reuse.set((K) keys().getObj(pos), (V) values().getObj(pos));
+      return reuse;
+    }
+    return def;
+  }
+
+  default void addPair(Object key, Object value) {
+    if (size() >= capacity()) {
+      grow();
+    }
+    keys().update(size(), key);
+    values().update(size(), value);
+    setNumElements(size() + 1);

Review comment:
       So yes we could do that, but we'd be keeping track of the write position in the `MapReader` so this just goves the grow code into the base class. I can swap it back around though.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] kbendick commented on a change in pull request #1572: Refactor reusable map data in Spark and Flink parquet readers #1331

Posted by GitBox <gi...@apache.org>.

kbendick commented on a change in pull request #1572:
URL: https://github.com/apache/iceberg/pull/1572#discussion_r503507114



##########
File path: parquet/src/main/java/org/apache/iceberg/parquet/ReusableMapData.java
##########
@@ -0,0 +1,72 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.iceberg.parquet;
+
+import java.util.Map;
+
+public interface ReusableMapData {
+  ReusableArrayData keys();
+  ReusableArrayData values();
+
+  default void grow() {
+    keys().grow();
+    values().grow();
+  }
+
+  default int capacity() {
+    return keys().capacity();
+  }
+
+  default void setNumElements(int numElements) {
+    keys().setNumElements(numElements);
+    values().setNumElements(numElements);
+  }
+
+  int size();
+
+  @SuppressWarnings("unchecked")
+  default <K, V> Map.Entry<K, V> getRaw(
+      int pos,
+      ParquetValueReaders.ReusableEntry<K, V> reuse,
+      ParquetValueReaders.ReusableEntry<K, V> def) {
+    if (pos < capacity()) {
+      reuse.set((K) keys().getObj(pos), (V) values().getObj(pos));
+      return reuse;
+    }
+    return def;
+  }
+
+  default void addPair(Object key, Object value) {
+    if (size() >= capacity()) {
+      grow();
+    }
+    keys().update(size(), key);
+    values().update(size(), value);
+    setNumElements(size() + 1);

Review comment:
       I might be misinterpreting your question / concern, but it seems like at least in the `SparkReusableMapData` this is the case. 
   
   If I'm reading this correctly, addPair is overwritten and then `setNumElements` is only called on `buildMap`. Again, I might be misinterpreting the question / concern but for a default implementation this seems fine given that it seems to be overridden on the Spark side and on the Flink side too (where theres a `buildMap` function).
   
   ```java
   private static class MapReader<K, V> extends RepeatedKeyValueReader<MapData, SparkReusableMapData, K, V> {
       // Removed to the bare minimum ....
       @Override
       protected void addPair(SparkReusableMapData map, K key, V value) {
         if (writePos >= map.capacity()) {
           map.grow();
         }
   
         map.keys.values[writePos] = key;
         map.values.values[writePos] = value;
   
         writePos += 1;
       }
   
       @Override
       protected MapData buildMap(SparkReusableMapData map) {
         map.setNumElements(writePos);
         return map;
       }
   ```




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] rdblue commented on a change in pull request #1572: Refactor reusable map data in Spark and Flink parquet readers #1331

Posted by GitBox <gi...@apache.org>.

rdblue commented on a change in pull request #1572:
URL: https://github.com/apache/iceberg/pull/1572#discussion_r502723966



##########
File path: parquet/src/main/java/org/apache/iceberg/parquet/ReusableArrayData.java
##########
@@ -0,0 +1,54 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.iceberg.parquet;
+
+public interface ReusableArrayData {

Review comment:
       Was it necessary to make this an `interface`? Because this does allocation, it seems much harder to make it an interface than to make it an abstract class that handles allocation and the values array internally.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] rdblue commented on a change in pull request #1572: Refactor reusable map data in Spark and Flink parquet readers #1331

Posted by GitBox <gi...@apache.org>.

rdblue commented on a change in pull request #1572:
URL: https://github.com/apache/iceberg/pull/1572#discussion_r506668328



##########
File path: parquet/src/main/java/org/apache/iceberg/parquet/ReusableArrayData.java
##########
@@ -0,0 +1,54 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.iceberg.parquet;
+
+public interface ReusableArrayData {

Review comment:
       Yes, I think so as well. That just means I'll need to be a bit more careful with the review since it is more complicated than I thought. I should have time today, hopefully.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] kbendick commented on a change in pull request #1572: Refactor reusable map data in Spark and Flink parquet readers #1331

Posted by GitBox <gi...@apache.org>.

kbendick commented on a change in pull request #1572:
URL: https://github.com/apache/iceberg/pull/1572#discussion_r502843848



##########
File path: spark/src/main/java/org/apache/iceberg/spark/data/SparkParquetReaders.java
##########
@@ -655,6 +653,11 @@ public int numElements() {
       return numElements;
     }
 
+    @Override
+    public int getNumElements() {

Review comment:
       It looks like the function `numElements` is still there (it's part of the catalyst abstract class), in the lines just above these ones.
   
   But unless there's something going on with Spark, I agree with @rdblue.
   
   I looked to see if `getNumElements` were needed on the Flink side, as Flink highly favors POJOs for performance reasons within their own serialization that they can infer and then for style reasons they also tend to use POJO-y like names even when its not always needed, but the Flink `ArrayData` interface uses `size` instead, so keeping the iceberg ReusableArrayData interface to use `numElements` seems like it would kill two birds with one stone by also fulfilling the catalyst contract.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] holdenk commented on a change in pull request #1572: Refactor reusable map data in Spark and Flink parquet readers #1331

Posted by GitBox <gi...@apache.org>.

holdenk commented on a change in pull request #1572:
URL: https://github.com/apache/iceberg/pull/1572#discussion_r506673783



##########
File path: spark/src/main/java/org/apache/iceberg/spark/data/SparkParquetReaders.java
##########
@@ -655,6 +653,11 @@ public int numElements() {
       return numElements;
     }
 
+    @Override
+    public int getNumElements() {

Review comment:
       Yeah this is a good point, I'll simplify the interface so we don't have numElements and getNumElements.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] rdblue commented on a change in pull request #1572: Refactor reusable map data in Spark and Flink parquet readers #1331

Posted by GitBox <gi...@apache.org>.

rdblue commented on a change in pull request #1572:
URL: https://github.com/apache/iceberg/pull/1572#discussion_r506674021



##########
File path: flink/src/main/java/org/apache/iceberg/flink/data/FlinkParquetReaders.java
##########
@@ -480,178 +481,95 @@ protected ArrayData buildList(ReusableArrayData list) {
     }
 
     @Override
-    protected ReusableMapData newMapData(MapData reuse) {
+    protected FlinkReusableMapData newMapData(MapData reuse) {
       this.readPos = 0;
-      this.writePos = 0;
 
-      if (reuse instanceof ReusableMapData) {
-        return (ReusableMapData) reuse;
+      if (reuse instanceof FlinkReusableMapData) {
+        return (FlinkReusableMapData) reuse;
       } else {
-        return new ReusableMapData();
+        return new FlinkReusableMapData();
       }
     }
 
     @Override
     @SuppressWarnings("unchecked")
-    protected Map.Entry<K, V> getPair(ReusableMapData map) {
-      Map.Entry<K, V> kv = nullEntry;
-      if (readPos < map.capacity()) {
-        entry.set((K) map.keys.values[readPos], (V) map.values.values[readPos]);
-        kv = entry;
-      }
-
+    protected Map.Entry<K, V> getPair(FlinkReusableMapData map) {
+      Map.Entry<K, V> kv = map.getRaw(readPos, entry, nullEntry);
       readPos += 1;
 
       return kv;
     }
 
     @Override
-    protected void addPair(ReusableMapData map, K key, V value) {
-      if (writePos >= map.capacity()) {
-        map.grow();
-      }
-
-      map.keys.values[writePos] = key;
-      map.values.values[writePos] = value;
-
-      writePos += 1;
+    protected void addPair(FlinkReusableMapData map, K key, V value) {
+      map.addPair(key, value);
     }
 
     @Override
-    protected MapData buildMap(ReusableMapData map) {
-      map.setNumElements(writePos);
+    protected MapData buildMap(FlinkReusableMapData map) {
       return map;
     }
   }
 
-  private static class RowDataReader extends ParquetValueReaders.StructReader<RowData, GenericRowData> {
-    private final int numFields;
-
-    RowDataReader(List<Type> types, List<ParquetValueReader<?>> readers) {
-      super(types, readers);
-      this.numFields = readers.size();
-    }
-
-    @Override
-    protected GenericRowData newStructData(RowData reuse) {
-      if (reuse instanceof GenericRowData) {
-        return (GenericRowData) reuse;
-      } else {
-        return new GenericRowData(numFields);
-      }
-    }
-
-    @Override
-    protected Object getField(GenericRowData intermediate, int pos) {
-      return intermediate.getField(pos);
-    }
-
-    @Override
-    protected RowData buildStruct(GenericRowData struct) {
-      return struct;
-    }
+  private static class FlinkReusableMapData implements ReusableMapData, MapData {

Review comment:
       Was the order of classes changed? It looks like that may be why there are more changes in this diff.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] holdenk commented on a change in pull request #1572: Refactor reusable map data in Spark and Flink parquet readers #1331

Posted by GitBox <gi...@apache.org>.

holdenk commented on a change in pull request #1572:
URL: https://github.com/apache/iceberg/pull/1572#discussion_r506696399



##########
File path: flink/src/main/java/org/apache/iceberg/flink/data/FlinkParquetReaders.java
##########
@@ -480,178 +481,95 @@ protected ArrayData buildList(ReusableArrayData list) {
     }
 
     @Override
-    protected ReusableMapData newMapData(MapData reuse) {
+    protected FlinkReusableMapData newMapData(MapData reuse) {
       this.readPos = 0;
-      this.writePos = 0;
 
-      if (reuse instanceof ReusableMapData) {
-        return (ReusableMapData) reuse;
+      if (reuse instanceof FlinkReusableMapData) {
+        return (FlinkReusableMapData) reuse;
       } else {
-        return new ReusableMapData();
+        return new FlinkReusableMapData();
       }
     }
 
     @Override
     @SuppressWarnings("unchecked")
-    protected Map.Entry<K, V> getPair(ReusableMapData map) {
-      Map.Entry<K, V> kv = nullEntry;
-      if (readPos < map.capacity()) {
-        entry.set((K) map.keys.values[readPos], (V) map.values.values[readPos]);
-        kv = entry;
-      }
-
+    protected Map.Entry<K, V> getPair(FlinkReusableMapData map) {
+      Map.Entry<K, V> kv = map.getRaw(readPos, entry, nullEntry);
       readPos += 1;
 
       return kv;
     }
 
     @Override
-    protected void addPair(ReusableMapData map, K key, V value) {
-      if (writePos >= map.capacity()) {
-        map.grow();
-      }
-
-      map.keys.values[writePos] = key;
-      map.values.values[writePos] = value;
-
-      writePos += 1;
+    protected void addPair(FlinkReusableMapData map, K key, V value) {
+      map.addPair(key, value);
     }
 
     @Override
-    protected MapData buildMap(ReusableMapData map) {
-      map.setNumElements(writePos);
+    protected MapData buildMap(FlinkReusableMapData map) {
       return map;
     }
   }
 
-  private static class RowDataReader extends ParquetValueReaders.StructReader<RowData, GenericRowData> {
-    private final int numFields;
-
-    RowDataReader(List<Type> types, List<ParquetValueReader<?>> readers) {
-      super(types, readers);
-      this.numFields = readers.size();
-    }
-
-    @Override
-    protected GenericRowData newStructData(RowData reuse) {
-      if (reuse instanceof GenericRowData) {
-        return (GenericRowData) reuse;
-      } else {
-        return new GenericRowData(numFields);
-      }
-    }
-
-    @Override
-    protected Object getField(GenericRowData intermediate, int pos) {
-      return intermediate.getField(pos);
-    }
-
-    @Override
-    protected RowData buildStruct(GenericRowData struct) {
-      return struct;
-    }
+  private static class FlinkReusableMapData implements ReusableMapData, MapData {

Review comment:
       Let me see if I can minimize the changes in the diff this weekend.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] rdblue closed pull request #1572: Refactor reusable map data in Spark and Flink parquet readers #1331

Posted by GitBox <gi...@apache.org>.

rdblue closed pull request #1572:
URL: https://github.com/apache/iceberg/pull/1572


   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] holdenk commented on pull request #1572: Refactor reusable map data in Spark and Flink parquet readers #1331

Posted by GitBox <gi...@apache.org>.

holdenk commented on pull request #1572:
URL: https://github.com/apache/iceberg/pull/1572#issuecomment-708715504


   Thanks y'all for the reviews. Sorry I'm a little slow responding to this, I'm currently dealing with a race condition inside of some new Spark code but I'm hoping to circle back to this PR before the end of the week.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] holdenk commented on a change in pull request #1572: Refactor reusable map data in Spark and Flink parquet readers #1331

Posted by GitBox <gi...@apache.org>.

holdenk commented on a change in pull request #1572:
URL: https://github.com/apache/iceberg/pull/1572#discussion_r544554782



##########
File path: parquet/src/main/java/org/apache/iceberg/parquet/ReusableMapData.java
##########
@@ -0,0 +1,72 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.iceberg.parquet;
+
+import java.util.Map;
+
+public interface ReusableMapData {
+  ReusableArrayData keys();
+  ReusableArrayData values();
+
+  default void grow() {
+    keys().grow();
+    values().grow();
+  }
+
+  default int capacity() {
+    return keys().capacity();
+  }
+
+  default void setNumElements(int numElements) {
+    keys().setNumElements(numElements);
+    values().setNumElements(numElements);
+  }
+
+  int size();

Review comment:
       I tried that, doesn't work since size is required by the Spark/Flink class/interface.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] rdblue commented on a change in pull request #1572: Refactor reusable map data in Spark and Flink parquet readers #1331

Posted by GitBox <gi...@apache.org>.

rdblue commented on a change in pull request #1572:
URL: https://github.com/apache/iceberg/pull/1572#discussion_r502724748



##########
File path: spark/src/main/java/org/apache/iceberg/spark/data/SparkParquetReaders.java
##########
@@ -655,6 +653,11 @@ public int numElements() {
       return numElements;
     }
 
+    @Override
+    public int getNumElements() {

Review comment:
       Before, this was called `numElements`. Could you revert the name change?




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] holdenk commented on a change in pull request #1572: Refactor reusable map data in Spark and Flink parquet readers #1331

Posted by GitBox <gi...@apache.org>.

holdenk commented on a change in pull request #1572:
URL: https://github.com/apache/iceberg/pull/1572#discussion_r506666245



##########
File path: parquet/src/main/java/org/apache/iceberg/parquet/ReusableArrayData.java
##########
@@ -0,0 +1,54 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.iceberg.parquet;
+
+public interface ReusableArrayData {

Review comment:
       Circled back, and yeah I think this does have to be an interface :/




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] holdenk commented on a change in pull request #1572: Refactor reusable map data in Spark and Flink parquet readers #1331

Posted by GitBox <gi...@apache.org>.

holdenk commented on a change in pull request #1572:
URL: https://github.com/apache/iceberg/pull/1572#discussion_r505069914



##########
File path: parquet/src/main/java/org/apache/iceberg/parquet/ReusableArrayData.java
##########
@@ -0,0 +1,54 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.iceberg.parquet;
+
+public interface ReusableArrayData {

Review comment:
       I was thinking about this today. Originally I had assumed that I could not have the flink one inherit from Spark because flink & spark wouldn't want to take dependencies on each other. But since the flink one is an interface, I'll give it a shot and see if I can make this an abstract class.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] openinx commented on a change in pull request #1572: Refactor reusable map data in Spark and Flink parquet readers #1331

Posted by GitBox <gi...@apache.org>.

openinx commented on a change in pull request #1572:
URL: https://github.com/apache/iceberg/pull/1572#discussion_r507404135



##########
File path: parquet/src/main/java/org/apache/iceberg/parquet/ReusableArrayData.java
##########
@@ -0,0 +1,54 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.iceberg.parquet;
+
+public interface ReusableArrayData {

Review comment:
       Is it possible that design the ReusableArrayData as a generic class `ReusableArrayData<T>` ? 
   
   and the`FlinkReusableArrayData` and `SparkReusableArrayData` would have a private member which is `ReusableArrayData<Object>` , then both of them could delegate  methods such as capacity(), setNumElements(), getNumElements(). 
   
   Making the ReusableArrayData an interface just confused me a bit, for me it should have a complete implementation , although its type is generic. 




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] kbendick commented on a change in pull request #1572: Refactor reusable map data in Spark and Flink parquet readers #1331

Posted by GitBox <gi...@apache.org>.

kbendick commented on a change in pull request #1572:
URL: https://github.com/apache/iceberg/pull/1572#discussion_r502843848



##########
File path: spark/src/main/java/org/apache/iceberg/spark/data/SparkParquetReaders.java
##########
@@ -655,6 +653,11 @@ public int numElements() {
       return numElements;
     }
 
+    @Override
+    public int getNumElements() {

Review comment:
       It looks like the function `numElements` is still there (it's part of the catalyst abstract class), in the lines just above these ones.
   
   But unless there's something going on with Spark, I agree with @rdblue.
   
   I looked to see if `getNumElements` were needed on the Flink side, as Flink highly favors POJOs for performance reasons within their own serialization that they can infer and then for style reasons they also tend to use POJO-y like names at times even when its not always necessarily needed in the current moment, but the Flink `ArrayData` interface uses `size` instead, so keeping the iceberg ReusableArrayData interface to use `numElements` seems like it would kill two birds with one stone by also fulfilling the catalyst contract.
   
   But maybe there's something we're not seeing.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] holdenk commented on a change in pull request #1572: Refactor reusable map data in Spark and Flink parquet readers #1331

Posted by GitBox <gi...@apache.org>.

holdenk commented on a change in pull request #1572:
URL: https://github.com/apache/iceberg/pull/1572#discussion_r502725663



##########
File path: parquet/src/main/java/org/apache/iceberg/parquet/ReusableArrayData.java
##########
@@ -0,0 +1,54 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.iceberg.parquet;
+
+public interface ReusableArrayData {

Review comment:
       Yeah, the problem is Spark's base class is an abstract class and Flink's is an interface :/




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] holdenk commented on a change in pull request #1572: Refactor reusable map data in Spark and Flink parquet readers #1331

Posted by GitBox <gi...@apache.org>.

holdenk commented on a change in pull request #1572:
URL: https://github.com/apache/iceberg/pull/1572#discussion_r544534945



##########
File path: parquet/src/main/java/org/apache/iceberg/parquet/ReusableArrayData.java
##########
@@ -0,0 +1,54 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.iceberg.parquet;
+
+public interface ReusableArrayData {

Review comment:
       I don't think so since we can't do multiple inheritence with the class.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] kbendick commented on pull request #1572: Refactor reusable map data in Spark and Flink parquet readers #1331

Posted by GitBox <gi...@apache.org>.

kbendick commented on pull request #1572:
URL: https://github.com/apache/iceberg/pull/1572#issuecomment-706647804


   This is an important task so thank you for taking it on @holdenk 


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] rdblue commented on a change in pull request #1572: Refactor reusable map data in Spark and Flink parquet readers #1331

Posted by GitBox <gi...@apache.org>.

rdblue commented on a change in pull request #1572:
URL: https://github.com/apache/iceberg/pull/1572#discussion_r506668827



##########
File path: parquet/src/main/java/org/apache/iceberg/parquet/ReusableMapData.java
##########
@@ -0,0 +1,72 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.iceberg.parquet;
+
+import java.util.Map;
+
+public interface ReusableMapData {
+  ReusableArrayData keys();
+  ReusableArrayData values();
+
+  default void grow() {
+    keys().grow();
+    values().grow();
+  }
+
+  default int capacity() {
+    return keys().capacity();
+  }
+
+  default void setNumElements(int numElements) {
+    keys().setNumElements(numElements);
+    values().setNumElements(numElements);
+  }
+
+  int size();
+
+  @SuppressWarnings("unchecked")
+  default <K, V> Map.Entry<K, V> getRaw(
+      int pos,
+      ParquetValueReaders.ReusableEntry<K, V> reuse,
+      ParquetValueReaders.ReusableEntry<K, V> def) {
+    if (pos < capacity()) {
+      reuse.set((K) keys().getObj(pos), (V) values().getObj(pos));
+      return reuse;
+    }
+    return def;
+  }
+
+  default void addPair(Object key, Object value) {
+    if (size() >= capacity()) {
+      grow();
+    }
+    keys().update(size(), key);
+    values().update(size(), value);
+    setNumElements(size() + 1);

Review comment:
       Yes, my point is that we don't need to set the number of valid elements until the end, just before MapData is returned.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] holdenk commented on a change in pull request #1572: Refactor reusable map data in Spark and Flink parquet readers #1331

Posted by GitBox <gi...@apache.org>.

holdenk commented on a change in pull request #1572:
URL: https://github.com/apache/iceberg/pull/1572#discussion_r544539618



##########
File path: parquet/src/main/java/org/apache/iceberg/parquet/ReusableMapData.java
##########
@@ -0,0 +1,72 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.iceberg.parquet;
+
+import java.util.Map;
+
+public interface ReusableMapData {
+  ReusableArrayData keys();
+  ReusableArrayData values();
+
+  default void grow() {
+    keys().grow();
+    values().grow();
+  }
+
+  default int capacity() {
+    return keys().capacity();
+  }
+
+  default void setNumElements(int numElements) {
+    keys().setNumElements(numElements);
+    values().setNumElements(numElements);
+  }
+
+  int size();
+
+  @SuppressWarnings("unchecked")
+  default <K, V> Map.Entry<K, V> getRaw(
+      int pos,
+      ParquetValueReaders.ReusableEntry<K, V> reuse,
+      ParquetValueReaders.ReusableEntry<K, V> def) {
+    if (pos < capacity()) {
+      reuse.set((K) keys().getObj(pos), (V) values().getObj(pos));
+      return reuse;
+    }
+    return def;
+  }
+
+  default void addPair(Object key, Object value) {
+    if (size() >= capacity()) {
+      grow();
+    }
+    keys().update(size(), key);
+    values().update(size(), value);
+    setNumElements(size() + 1);

Review comment:
       Oh right because this is an interface we need to call a method anyways since we can't have a concrete type in the interface.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] rdblue commented on pull request #1572: Refactor reusable map data in Spark and Flink parquet readers #1331

Posted by GitBox <gi...@apache.org>.

rdblue commented on pull request #1572:
URL: https://github.com/apache/iceberg/pull/1572#issuecomment-747137094


   I just talked with @holdenk directly about this and we agreed that it probably isn't the right direction to go because the shared code needs to be carried by an interface instead of an abstract class. The interface can't handle its own state, so it ends up having way more method calls to get state from the child classes. That splits state that should be managed by a single class across multiple places (e.g. buffer growth) and could introduce unnecessary dispatch costs due to the calls.
   
   We agreed that it is cleaner to have some code duplication instead, so I'll close this issue. Thanks for investigating it and working on the prototype, @holdenk!


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] kbendick commented on a change in pull request #1572: Refactor reusable map data in Spark and Flink parquet readers #1331

Posted by GitBox <gi...@apache.org>.

kbendick commented on a change in pull request #1572:
URL: https://github.com/apache/iceberg/pull/1572#discussion_r502843848



##########
File path: spark/src/main/java/org/apache/iceberg/spark/data/SparkParquetReaders.java
##########
@@ -655,6 +653,11 @@ public int numElements() {
       return numElements;
     }
 
+    @Override
+    public int getNumElements() {

Review comment:
       It looks like the function `numElements` is still there (it's part of the catalyst abstract class), in the lines just above these ones.
   
   But unless there's something going on with Spark, I agree with @rdblue.
   
   I looked to see if `getNumElements` were needed on the Flink side, as Flink highly favors POJOs for performance reasons within their own serialization that they can infer and then for style reasons they also tend to use POJO-y like names at times even when its not always necessarily needed in the current moment, but the Flink `ArrayData` interface uses `size` instead, so keeping the iceberg ReusableArrayData interface to use `numElements` seems like it would kill two birds with one stone by also fulfilling the catalyst contract.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org