You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@iceberg.apache.org by GitBox <gi...@apache.org> on 2021/09/18 15:16:24 UTC

[GitHub] [iceberg] mayursrivastava opened a new pull request #3148: [ARROW] Vectorized Parquet Reads - Make ArrowReader's iterator idempotent

mayursrivastava opened a new pull request #3148:
URL: https://github.com/apache/iceberg/pull/3148


   Follow-up on https://github.com/apache/iceberg/pull/2933.
   + @nastra 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] kbendick commented on a change in pull request #3148: [ARROW] Vectorized Parquet Reads - Make ArrowReader's iterator idempotent

Posted by GitBox <gi...@apache.org>.
kbendick commented on a change in pull request #3148:
URL: https://github.com/apache/iceberg/pull/3148#discussion_r712756668



##########
File path: arrow/src/main/java/org/apache/iceberg/arrow/vectorized/ArrowReader.java
##########
@@ -308,7 +307,11 @@ public boolean hasNext() {
 
     @Override
     public ColumnarBatch next() {
-      return current;
+      if (hasNext()) {
+        return currentIterator.next();

Review comment:
       Is it possible to initialize `currentIterator` elsewhere so that `next` doesn't need to call `hasNext`?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] mayursrivastava commented on a change in pull request #3148: [ARROW] Vectorized Parquet Reads - Make ArrowReader's iterator idempotent

Posted by GitBox <gi...@apache.org>.
mayursrivastava commented on a change in pull request #3148:
URL: https://github.com/apache/iceberg/pull/3148#discussion_r712143514



##########
File path: arrow/src/main/java/org/apache/iceberg/arrow/vectorized/ArrowReader.java
##########
@@ -308,7 +307,11 @@ public boolean hasNext() {
 
     @Override
     public ColumnarBatch next() {
-      return current;
+      if (hasNext()) {
+        return currentIterator.next();

Review comment:
       Thanks @nastra. 
   
   Yes, that was my first thought as well. But, I wanted this iterator to return all rows even if only next() is called repeatedly without checking hasNext(), and if we don't call hasNext() inside next(), currentIterator will not be initialized and possibly a NullPointerException will be thrown.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] mayur-srivastava commented on pull request #3148: [ARROW] Vectorized Parquet Reads - Make ArrowReader's iterator idempotent

Posted by GitBox <gi...@apache.org>.
mayur-srivastava commented on pull request #3148:
URL: https://github.com/apache/iceberg/pull/3148#issuecomment-927187371


   @rdblue @nastra @kbendick please approve and merge if this looks good.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] mayursrivastava commented on a change in pull request #3148: [ARROW] Vectorized Parquet Reads - Make ArrowReader's iterator idempotent

Posted by GitBox <gi...@apache.org>.
mayursrivastava commented on a change in pull request #3148:
URL: https://github.com/apache/iceberg/pull/3148#discussion_r712141284



##########
File path: arrow/src/test/java/org/apache/iceberg/arrow/vectorized/ArrowReaderTest.java
##########
@@ -353,6 +371,29 @@ private void readAndCheckArrowResult(
     assertEquals(expectedTotalRows, totalRows);
   }
 
+  private void readAndCheckHasNextIsIdempotent(
+      TableScan scan,
+      int numRowsPerRoot,
+      int expectedTotalRows,
+      int numExtraCallsToHasNext) throws IOException {
+    int totalRows = 0;
+    try (VectorizedTableScanIterable itr = new VectorizedTableScanIterable(scan, numRowsPerRoot, false)) {
+      CloseableIterator<ColumnarBatch> iterator = itr.iterator();
+      while (iterator.hasNext()) {
+        // Call hasNext() a few extra times.
+        // This should not affect the total number of rows read.
+        for (int i = 0; i < numExtraCallsToHasNext; i++) {
+          assertTrue(iterator.hasNext());
+        }
+
+        ColumnarBatch batch = iterator.next();
+        VectorSchemaRoot root = batch.createVectorSchemaRootFromVectors();
+        totalRows += root.getRowCount();

Review comment:
       Sure. Added the test.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] mayursrivastava commented on a change in pull request #3148: [ARROW] Vectorized Parquet Reads - Make ArrowReader's iterator idempotent

Posted by GitBox <gi...@apache.org>.
mayursrivastava commented on a change in pull request #3148:
URL: https://github.com/apache/iceberg/pull/3148#discussion_r712978161



##########
File path: arrow/src/main/java/org/apache/iceberg/arrow/vectorized/ArrowReader.java
##########
@@ -308,7 +307,11 @@ public boolean hasNext() {
 
     @Override
     public ColumnarBatch next() {
-      return current;
+      if (hasNext()) {
+        return currentIterator.next();

Review comment:
       I'm following the pattern in https://github.com/apache/iceberg/blob/90225d6c9413016d611e2ce5eff37db1bc1b4fc5/api/src/main/java/org/apache/iceberg/io/CloseableIterable.java#L204. 
   
   Initializing currentIterator is not the only thing that needs to be done, we need to keep creating new currentIterator on every new data file. We can move the common logic to another method but it will be same logic as hasNext() and I don't see any real benefit in moving the code. 
   
   Do you have something in mind that will improve the logic?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue commented on a change in pull request #3148: [ARROW] Vectorized Parquet Reads - Make ArrowReader's iterator idempotent

Posted by GitBox <gi...@apache.org>.
rdblue commented on a change in pull request #3148:
URL: https://github.com/apache/iceberg/pull/3148#discussion_r711821316



##########
File path: arrow/src/test/java/org/apache/iceberg/arrow/vectorized/ArrowReaderTest.java
##########
@@ -353,6 +371,29 @@ private void readAndCheckArrowResult(
     assertEquals(expectedTotalRows, totalRows);
   }
 
+  private void readAndCheckHasNextIsIdempotent(
+      TableScan scan,
+      int numRowsPerRoot,
+      int expectedTotalRows,
+      int numExtraCallsToHasNext) throws IOException {
+    int totalRows = 0;
+    try (VectorizedTableScanIterable itr = new VectorizedTableScanIterable(scan, numRowsPerRoot, false)) {
+      CloseableIterator<ColumnarBatch> iterator = itr.iterator();
+      while (iterator.hasNext()) {
+        // Call hasNext() a few extra times.
+        // This should not affect the total number of rows read.
+        for (int i = 0; i < numExtraCallsToHasNext; i++) {
+          assertTrue(iterator.hasNext());
+        }
+
+        ColumnarBatch batch = iterator.next();
+        VectorSchemaRoot root = batch.createVectorSchemaRootFromVectors();
+        totalRows += root.getRowCount();

Review comment:
       How does this test that the iterator is idempotent? It looks like this just tests that the batch size is correct. I think that this should also call `checkAllVectorValues` to ensure that the expected rows are the ones produced rather than relying on the total number of rows.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue commented on a change in pull request #3148: [ARROW] Vectorized Parquet Reads - Make ArrowReader's iterator idempotent

Posted by GitBox <gi...@apache.org>.
rdblue commented on a change in pull request #3148:
URL: https://github.com/apache/iceberg/pull/3148#discussion_r720860330



##########
File path: arrow/src/main/java/org/apache/iceberg/arrow/vectorized/ArrowReader.java
##########
@@ -308,7 +307,11 @@ public boolean hasNext() {
 
     @Override
     public ColumnarBatch next() {
-      return current;
+      if (hasNext()) {
+        return currentIterator.next();

Review comment:
       I think this is good, and I agree with the call to `hasNext()`. We shouldn't rely on the underlying iterator for correct behavior (throwing `NoSuchElementException`) because iterators are actually hard to get right. Plus, this ensures that `hasNext()` doesn't advance the iterator.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue merged pull request #3148: [ARROW] Vectorized Parquet Reads - Make ArrowReader's iterator idempotent

Posted by GitBox <gi...@apache.org>.
rdblue merged pull request #3148:
URL: https://github.com/apache/iceberg/pull/3148


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] nastra commented on a change in pull request #3148: [ARROW] Vectorized Parquet Reads - Make ArrowReader's iterator idempotent

Posted by GitBox <gi...@apache.org>.
nastra commented on a change in pull request #3148:
URL: https://github.com/apache/iceberg/pull/3148#discussion_r711897194



##########
File path: arrow/src/main/java/org/apache/iceberg/arrow/vectorized/ArrowReader.java
##########
@@ -308,7 +307,11 @@ public boolean hasNext() {
 
     @Override
     public ColumnarBatch next() {
-      return current;
+      if (hasNext()) {
+        return currentIterator.next();

Review comment:
       I think it would be sufficient to just call `currentIterator.next` here because the `NoSuchElementException` will be thrown implicitly when there's no other element




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] mayursrivastava commented on a change in pull request #3148: [ARROW] Vectorized Parquet Reads - Make ArrowReader's iterator idempotent

Posted by GitBox <gi...@apache.org>.
mayursrivastava commented on a change in pull request #3148:
URL: https://github.com/apache/iceberg/pull/3148#discussion_r712143747



##########
File path: arrow/src/test/java/org/apache/iceberg/arrow/vectorized/ArrowReaderTest.java
##########
@@ -353,6 +371,29 @@ private void readAndCheckArrowResult(
     assertEquals(expectedTotalRows, totalRows);
   }
 
+  private void readAndCheckHasNextIsIdempotent(
+      TableScan scan,
+      int numRowsPerRoot,
+      int expectedTotalRows,
+      int numExtraCallsToHasNext) throws IOException {
+    int totalRows = 0;
+    try (VectorizedTableScanIterable itr = new VectorizedTableScanIterable(scan, numRowsPerRoot, false)) {
+      CloseableIterator<ColumnarBatch> iterator = itr.iterator();
+      while (iterator.hasNext()) {
+        // Call hasNext() a few extra times.
+        // This should not affect the total number of rows read.
+        for (int i = 0; i < numExtraCallsToHasNext; i++) {
+          assertTrue(iterator.hasNext());
+        }
+
+        ColumnarBatch batch = iterator.next();
+        VectorSchemaRoot root = batch.createVectorSchemaRootFromVectors();
+        totalRows += root.getRowCount();

Review comment:
       Thanks for reviewing this @rdblue 




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] kbendick commented on a change in pull request #3148: [ARROW] Vectorized Parquet Reads - Make ArrowReader's iterator idempotent

Posted by GitBox <gi...@apache.org>.
kbendick commented on a change in pull request #3148:
URL: https://github.com/apache/iceberg/pull/3148#discussion_r715236373



##########
File path: arrow/src/main/java/org/apache/iceberg/arrow/vectorized/ArrowReader.java
##########
@@ -308,7 +307,11 @@ public boolean hasNext() {
 
     @Override
     public ColumnarBatch next() {
-      return current;
+      if (hasNext()) {
+        return currentIterator.next();

Review comment:
       I can't think of anything that would improve the logic.
   
   I'm a bit hesitant about copying some of the logic from `CloseableIterator` as there are a few situations we've seen where that hangs. But in the absence of a complete understanding of that issue, I think it's fine to borrow form there as it's the same situation.
   
   Also, thanks for letting me know that this came from `CloseableIterator`. I have plans to begin researching that soon and that helps. 👍 




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue commented on pull request #3148: [ARROW] Vectorized Parquet Reads - Make ArrowReader's iterator idempotent

Posted by GitBox <gi...@apache.org>.
rdblue commented on pull request #3148:
URL: https://github.com/apache/iceberg/pull/3148#issuecomment-932992708


   Thanks for fixing the tests, @mayursrivastava!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org