You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2020/08/19 13:50:08 UTC

[GitHub] [arrow] alamb opened a new pull request #8007: ARROW-9790: [Rust][Parquet] Fix PrimitiveArrayReader boundary conditions

alamb opened a new pull request #8007:
URL: https://github.com/apache/arrow/pull/8007


   When I was reading a parquet file into `RecordBatches` using `ParquetFileArrowReader` that had row groups that were 100,000 rows in length with a batch size of 60,000, after reading 300,000 rows successfully, I started seeing this error
   
   ```
    ParquetError("Parquet error: Not all children array length are the same!")
   ```
   
   Upon investigation, I found that when reading with `ParquetFileArrowReader`, if the parquet input file has multiple row groups, and if a batch happens to end at the end of a row group for Int or Float, no subsequent row groups are read
   
   Visually:
   
   ```
   +-----+
   | RG1 |
   |     |
   +-----+  <-- If a batch ends exactly at the end of this row group (page), RG2 is never read
   +-----+
   | RG2 |
   |     |
   +-----+
   ```
   
   I traced the issue down to a bug in `PrimitiveArrayReader` where it mistakenly interprets reading `0` rows from a page reader as being at the end of the column.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] sunchao closed pull request #8007: ARROW-9790: [Rust][Parquet] Fix PrimitiveArrayReader boundary conditions

Posted by GitBox <gi...@apache.org>.
sunchao closed pull request #8007:
URL: https://github.com/apache/arrow/pull/8007


   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] alamb commented on pull request #8007: ARROW-9790: [Rust][Parquet] Fix PrimitiveArrayReader boundary conditions

Posted by GitBox <gi...@apache.org>.
alamb commented on pull request #8007:
URL: https://github.com/apache/arrow/pull/8007#issuecomment-676450197


   As the code that I removed in this PR  added to support empty pages in 8a61570a8bd64ae6f6e47ce8efd2287c3c2feded (but it didn't seem to have any tests), I was worried that this deletion would cause some sort of reversion in behavior. To ensure it didn't, I added a test for empty pages in commit cfdea8349a8f41229a4c8fb2cfddec3bb11df112
   
   To try and ensure that the test covers the empty pages case, I and ran the new test on commit ed1f771dccdde623ce85e212eccb2b573185c461 (the one right before 8a61570a8bd64ae6f6e47ce8efd2287c3c2feded) and the test fails, thus leading me to conclude that the code removed in this PR is unnecessary and zero page parquet files are still supported.
   
   Log of what I did:
   
   Get commit right before support for empty pages:
   ```
   alamb@MacBook-Pro rust % git checkout ed1f771dccdde623ce85e212eccb2b573185c461
   git checkout ed1f771dccdde623ce85e212eccb2b573185c461
   M	testing
   Note: switching to 'ed1f771dccdde623ce85e212eccb2b573185c461'.
   ...
   HEAD is now at ed1f771dc ARROW-8717: [CI][Packaging] Add build dependency on boost to homebrew
   ```
   
   Apply new test for no pages:
   
   ```
   alamb@MacBook-Pro rust % git cherry-pick cfdea8349a8f41229a4c8fb2cfddec3bb11df112
   git cherry-pick cfdea8349a8f41229a4c8fb2cfddec3bb11df112
   Auto-merging rust/parquet/src/arrow/array_reader.rs
   [detached HEAD ab3bcdd8f] Add a test for reading empty pages
    Date: Wed Aug 19 10:16:31 2020 -0400
    1 file changed, 57 insertions(+), 1 deletion(-)
   ```
   
   Then I had to edit the test a little to use `.len` instead of `.is_empty` (which did not exist in the old commit):
   
   ```
   diff --git a/rust/parquet/src/arrow/array_reader.rs b/rust/parquet/src/arrow/array_reader.rs
   index d70619518..54c929c05 100644
   --- a/rust/parquet/src/arrow/array_reader.rs
   +++ b/rust/parquet/src/arrow/array_reader.rs
   @@ -967,7 +967,7 @@ mod tests {
   
            // expect no values to be read
            let array = array_reader.next_batch(50).unwrap();
   -        assert!(array.is_empty());
   +        assert!(array.len() == 0);
        }
   
        #[test]
   ```
   
   And then I ran the test, and it fails:
   
   ```
   cd arrow/rust/parquet && PARQUET_TEST_DATA=`pwd`/../../cpp/submodules/parquet-testing/data ARROW_TEST_DATA=`pwd`/../../testing/data cargo test
   ...
   ---- arrow::array_reader::tests::test_primitive_array_reader_empty_pages stdout ----
   thread 'arrow::array_reader::tests::test_primitive_array_reader_empty_pages' panicked at 'called `Result::unwrap()` on an `Err` value: General("Can\'t build array without pages!")', parquet/src/arrow/array_reader.rs:962:32
   note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
   ```
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] andygrove commented on pull request #8007: ARROW-9790: [Rust][Parquet] Fix PrimitiveArrayReader boundary conditions

Posted by GitBox <gi...@apache.org>.
andygrove commented on pull request #8007:
URL: https://github.com/apache/arrow/pull/8007#issuecomment-676450729


   @sunchao Could you review as well ?


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] alamb commented on a change in pull request #8007: ARROW-9790: [Rust][Parquet] Fix PrimitiveArrayReader boundary conditions

Posted by GitBox <gi...@apache.org>.
alamb commented on a change in pull request #8007:
URL: https://github.com/apache/arrow/pull/8007#discussion_r473075068



##########
File path: rust/parquet/src/arrow/array_reader.rs
##########
@@ -1000,6 +998,33 @@ mod tests {
         }
     }
 
+    #[test]
+    fn test_primitive_array_reader_empty_pages() {

Review comment:
       This is a test for the functionality added in https://github.com/apache/arrow/commit/8a61570a8bd64ae6f6e47ce8efd2287c3c2feded / https://github.com/apache/arrow/pull/7140




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] sunchao commented on pull request #8007: ARROW-9790: [Rust][Parquet] Fix PrimitiveArrayReader boundary conditions

Posted by GitBox <gi...@apache.org>.
sunchao commented on pull request #8007:
URL: https://github.com/apache/arrow/pull/8007#issuecomment-676475353


   > @sunchao Could you review as well ?
   
   Thanks for pinging! will do today.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] alamb commented on a change in pull request #8007: ARROW-9790: [Rust][Parquet] Fix PrimitiveArrayReader boundary conditions

Posted by GitBox <gi...@apache.org>.
alamb commented on a change in pull request #8007:
URL: https://github.com/apache/arrow/pull/8007#discussion_r473116005



##########
File path: rust/parquet/src/arrow/arrow_reader.rs
##########
@@ -304,6 +304,45 @@ mod tests {
         >(2, 100, 2, message_type, 15, 50, converter);
     }
 
+    #[test]

Review comment:
       https://github.com/apache/arrow/pull/8009




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] alamb commented on a change in pull request #8007: ARROW-9790: [Rust][Parquet] Fix PrimitiveArrayReader boundary conditions

Posted by GitBox <gi...@apache.org>.
alamb commented on a change in pull request #8007:
URL: https://github.com/apache/arrow/pull/8007#discussion_r473045357



##########
File path: rust/parquet/src/arrow/array_reader.rs
##########
@@ -136,10 +136,8 @@ impl<T: DataType> ArrayReader for PrimitiveArrayReader<T> {
         while records_read < batch_size {
             let records_to_read = batch_size - records_read;
 
+            // NB can be 0 if at end of page
             let records_read_once = self.record_reader.read_records(records_to_read)?;
-            if records_read_once == 0 {

Review comment:
       The case of `0` rows being read is handled in the `if records_read_once < records_to-read` clause below -- namely in this case the code needs to try and get the next page of data from the page reader. 

##########
File path: rust/parquet/src/arrow/arrow_reader.rs
##########
@@ -304,6 +304,45 @@ mod tests {
         >(2, 100, 2, message_type, 15, 50, converter);
     }
 
+    #[test]

Review comment:
       Both these tests fail without the changes in this PR.
   
   I don't like the copy/paste nature of these tests and I plan a minor PR building on this one proposing how to remove the duplication and make the tests easier to read. 




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] github-actions[bot] commented on pull request #8007: ARROW-9790: [Rust][Parquet] Fix PrimitiveArrayReader boundary conditions

Posted by GitBox <gi...@apache.org>.
github-actions[bot] commented on pull request #8007:
URL: https://github.com/apache/arrow/pull/8007#issuecomment-676414386


   https://issues.apache.org/jira/browse/ARROW-9790


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] sunchao commented on a change in pull request #8007: ARROW-9790: [Rust][Parquet] Fix PrimitiveArrayReader boundary conditions

Posted by GitBox <gi...@apache.org>.
sunchao commented on a change in pull request #8007:
URL: https://github.com/apache/arrow/pull/8007#discussion_r473493106



##########
File path: rust/parquet/src/arrow/array_reader.rs
##########
@@ -136,10 +136,8 @@ impl<T: DataType> ArrayReader for PrimitiveArrayReader<T> {
         while records_read < batch_size {
             let records_to_read = batch_size - records_read;
 
+            // NB can be 0 if at end of page
             let records_read_once = self.record_reader.read_records(records_to_read)?;
-            if records_read_once == 0 {

Review comment:
       +1




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] alamb edited a comment on pull request #8007: ARROW-9790: [Rust][Parquet] Fix PrimitiveArrayReader boundary conditions

Posted by GitBox <gi...@apache.org>.
alamb edited a comment on pull request #8007:
URL: https://github.com/apache/arrow/pull/8007#issuecomment-676450197


   As the code that I removed in this PR  added to support empty pages in 8a61570a8bd64ae6f6e47ce8efd2287c3c2feded (but it didn't seem to have any tests), I was worried that this deletion would cause some sort of reversion in behavior. To ensure it didn't, I added a test for empty pages in commit cfdea8349a8f41229a4c8fb2cfddec3bb11df112
   
   To try and ensure that the test covers the empty pages case, I and ran the new test on commit ed1f771dccdde623ce85e212eccb2b573185c461 (the one right before 8a61570a8bd64ae6f6e47ce8efd2287c3c2feded) and the test fails, thus leading me to conclude that the code removed in this PR is unnecessary and zero page parquet files are still supported.
   
   Log of what I did:
   
   Get commit right before support for empty pages:
   ```
   alamb@MacBook-Pro rust % git checkout ed1f771dccdde623ce85e212eccb2b573185c461
   git checkout ed1f771dccdde623ce85e212eccb2b573185c461
   M	testing
   Note: switching to 'ed1f771dccdde623ce85e212eccb2b573185c461'.
   ...
   HEAD is now at ed1f771dc ARROW-8717: [CI][Packaging] Add build dependency on boost to homebrew
   ```
   
   Apply new test for no pages:
   
   ```
   alamb@MacBook-Pro rust % git cherry-pick cfdea8349a8f41229a4c8fb2cfddec3bb11df112
   git cherry-pick cfdea8349a8f41229a4c8fb2cfddec3bb11df112
   Auto-merging rust/parquet/src/arrow/array_reader.rs
   [detached HEAD ab3bcdd8f] Add a test for reading empty pages
    Date: Wed Aug 19 10:16:31 2020 -0400
    1 file changed, 57 insertions(+), 1 deletion(-)
   ```
   
   Then I had to edit the test a little to use `.len` instead of `.is_empty` (which did not exist in the old commit):
   
   ```
   diff --git a/rust/parquet/src/arrow/array_reader.rs b/rust/parquet/src/arrow/array_reader.rs
   index d70619518..54c929c05 100644
   --- a/rust/parquet/src/arrow/array_reader.rs
   +++ b/rust/parquet/src/arrow/array_reader.rs
   @@ -967,7 +967,7 @@ mod tests {
   
            // expect no values to be read
            let array = array_reader.next_batch(50).unwrap();
   -        assert!(array.is_empty());
   +        assert!(array.len() == 0);
        }
   
        #[test]
   ```
   
   And then I ran the test, and it fails:
   
   ```
   cd arrow/rust/parquet && PARQUET_TEST_DATA=`pwd`/../../cpp/submodules/parquet-testing/data ARROW_TEST_DATA=`pwd`/../../testing/data cargo test
   ...
   ---- arrow::array_reader::tests::test_primitive_array_reader_empty_pages stdout ----
   thread 'arrow::array_reader::tests::test_primitive_array_reader_empty_pages' panicked at 'called `Result::unwrap()` on an `Err` value: General("Can\'t build array without pages!")', parquet/src/arrow/array_reader.rs:962:32
   note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
   ```
   FYI @houqp  and @paddyhoran 


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org