You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2021/05/25 15:46:02 UTC

[GitHub] [arrow-rs] garyanaplan opened a new issue #349: parquet reading hangs when row_group contains more than 2048 rows of data

garyanaplan opened a new issue #349:
URL: https://github.com/apache/arrow-rs/issues/349


   **Describe the bug**
   Reading an apparently valid parquet file (which can be read by java tools such as parquet-tools) from any rust program will hang. CPU load goes to 100%. Reproduced on both 4.0.0 and 4.1.0. rustc: 1.51.0
   
   **To Reproduce**
   Create a parquet file with 2 row groups. Each row group must have > 2048 rows (e.g.: 2049). Run a (rust) program to read the file and it will hang when visiting the 2048th row. Java program (parquet-tools) reads with no issue.
   
   This test program can be used to produce a file that can then be read using parquet-read to reproduce:
   
   ```
       #[test]
       fn it_writes_data() {
           let path = Path::new("sample.parquet");
   
           let message_type = "
     message ItHangs {
       REQUIRED INT64 DIM0;
       REQUIRED DOUBLE DIM1;
       REQUIRED BYTE_ARRAY DIM2;
       REQUIRED BOOLEAN DIM3;
     }
   ";
           let schema = Arc::new(parse_message_type(message_type).unwrap());
           let props = Arc::new(WriterProperties::builder().build());
           let file = fs::File::create(&path).unwrap();
           let mut writer = SerializedFileWriter::new(file, schema, props).unwrap();
           for _group in 0..1 {
               let mut row_group_writer = writer.next_row_group().unwrap();
               let values: Vec<i64> = vec![0; 2049];
               let my_values: Vec<i64> = values
                   .iter()
                   .enumerate()
                   .map(|(count, _x)| count.try_into().unwrap())
                   .collect();
               let my_double_values: Vec<f64> = values
                   .iter()
                   .enumerate()
                   .map(|(count, _x)| count as f64)
                   .collect();
               let my_bool_values: Vec<bool> = values
                   .iter()
                   .enumerate()
                   .map(|(count, _x)| count % 2 == 0)
                   .collect();
               let my_ba_values: Vec<ByteArray> = values
                   .iter()
                   .enumerate()
                   .map(|(count, _x)| {
                       let s = format!("{}", count);
                       ByteArray::from(s.as_ref())
                   })
                   .collect();
               while let Some(mut col_writer) = row_group_writer.next_column().expect("next column") {
                   match col_writer {
                       // ... write values to a column writer
                       // You can also use `get_typed_column_writer` method to extract typed writer.
                       ColumnWriter::Int64ColumnWriter(ref mut typed_writer) => {
                           typed_writer
                               .write_batch(&my_values, None, None)
                               .expect("writing int column");
                       }
                       ColumnWriter::DoubleColumnWriter(ref mut typed_writer) => {
                           typed_writer
                               .write_batch(&my_double_values, None, None)
                               .expect("writing double column");
                       }
                       ColumnWriter::BoolColumnWriter(ref mut typed_writer) => {
                           typed_writer
                               .write_batch(&my_bool_values, None, None)
                               .expect("writing bool column");
                       }
                       ColumnWriter::ByteArrayColumnWriter(ref mut typed_writer) => {
                           typed_writer
                               .write_batch(&my_ba_values, None, None)
                               .expect("writing bytes column");
                       }
                       _ => {
                           println!("huh:!");
                       }
                   }
                   row_group_writer
                       .close_column(col_writer)
                       .expect("close column");
               }
               let rg_md = row_group_writer.close().expect("close row group");
               println!("total rows written: {}", rg_md.num_rows());
               writer
                   .close_row_group(row_group_writer)
                   .expect("close row groups");
           }
           writer.close().expect("close writer");
   
           let bytes = fs::read(&path).unwrap();
           assert_eq!(&bytes[0..4], &[b'P', b'A', b'R', b'1']);
       }
   ```
   
   
   **Expected behavior**
   The read will complete without hanging.
   
   **Additional context**
   My development system is Mac OS X, so only tested on OS X.
   
   rustup reports:
   active toolchain
   ----------------
   
   1.51.0-x86_64-apple-darwin (default)
   rustc 1.51.0 (2fd73fabe 2021-03-23)
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-rs] alamb commented on issue #349: parquet reading hangs when row_group contains more than 2048 rows of data

Posted by GitBox <gi...@apache.org>.
alamb commented on issue #349:
URL: https://github.com/apache/arrow-rs/issues/349#issuecomment-848301844


   Thanks for the report @garyanaplan !


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-rs] garyanaplan commented on issue #349: parquet reading hangs when row_group contains more than 2048 rows of data

Posted by GitBox <gi...@apache.org>.
garyanaplan commented on issue #349:
URL: https://github.com/apache/arrow-rs/issues/349#issuecomment-858513251


   More poking reveals that PlainEncoder has a bit_writer with a hard-coded size of 256 (big enough to hold 2048 bits...).
   `src/encodings/encoding.rs: line             bit_writer: BitWriter::new(256),
   `
   If you adjust that value up or down you trip the error at different times. So, that looks like it's a contributing factor. I'm now trying to understand the logic around buffer flushing and re-use. Feel, like I'm getting close to the root cause.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-rs] Pand9 edited a comment on issue #349: parquet reading hangs when row_group contains more than 2048 rows of data

Posted by GitBox <gi...@apache.org>.
Pand9 edited a comment on issue #349:
URL: https://github.com/apache/arrow-rs/issues/349#issuecomment-857903895


   I've also just encountered it. Common element with this reproduction is BOOLEAN field. It worked without BOOLEAN as well.
   
   After quick investigation of the looping code, I've found something suspect, but it's just about naming - not sure if it's actual bug.
   
   This function returns something initialized as input's length and called `values_to_read`: https://github.com/apache/arrow-rs/blob/0f55b828883b3b3afda43ae404b130d374e6f1a1/parquet/src/util/bit_util.rs#L588
   
   Meanwhile calling site assigns the return value to `values_read`.
   
   Btw it loops because after reading 2048 values, it's 0.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-rs] garyanaplan commented on issue #349: parquet reading hangs when row_group contains more than 2048 rows of data

Posted by GitBox <gi...@apache.org>.
garyanaplan commented on issue #349:
URL: https://github.com/apache/arrow-rs/issues/349#issuecomment-858568206


   Looks like that hard-coded value (256) in the bit-writer is the root cause. When writing, if we try to put > 2048 boolean values, then the writer just "ignores" the writes. This is caused by the fact that bool encoder silently ignores calls to put_value that return false.
   
   I have a fix for this which works by extending the size of the BitWriter (in 256 byte) increments and also checks the return of put_value in BoolType::encode() and raises an error if the call fails.
   
   `garypennington@Garys-MBP parquet % git diff
   diff --git a/parquet/src/data_type.rs b/parquet/src/data_type.rs
   index aa1def3d..f7514d8b 100644
   --- a/parquet/src/data_type.rs
   +++ b/parquet/src/data_type.rs
   @@ -662,7 +662,9 @@ pub(crate) mod private {
                bit_writer: &mut BitWriter,
            ) -> Result<()> {
                for value in values {
   -                bit_writer.put_value(*value as u64, 1);
   +                if !bit_writer.put_value(*value as u64, 1) {
   +                    return Err(ParquetError::EOF("unable to put boolean value".to_string()));
   +                }
                }
                Ok(())
            }
   diff --git a/parquet/src/encodings/encoding.rs b/parquet/src/encodings/encoding.rs
   index d0427381..eb5ebb13 100644
   --- a/parquet/src/encodings/encoding.rs
   +++ b/parquet/src/encodings/encoding.rs
   @@ -112,6 +112,7 @@ pub struct PlainEncoder<T: DataType> {
        buffer: ByteBuffer,
        bit_writer: BitWriter,
        desc: ColumnDescPtr,
   +    bw_bytes_written: usize,
        _phantom: PhantomData<T>,
    }
    
   @@ -124,6 +125,7 @@ impl<T: DataType> PlainEncoder<T> {
                buffer: byte_buffer,
                bit_writer: BitWriter::new(256),
                desc,
   +            bw_bytes_written: 0,
                _phantom: PhantomData,
            }
        }
   @@ -153,7 +155,11 @@ impl<T: DataType> Encoder<T> for PlainEncoder<T> {
    
        #[inline]
        fn put(&mut self, values: &[T::T]) -> Result<()> {
   +        if self.bw_bytes_written + values.len() >= self.bit_writer.capacity() {
   +            self.bit_writer.extend(256);
   +        }
            T::T::encode(values, &mut self.buffer, &mut self.bit_writer)?;
   +        self.bw_bytes_written += values.len();
            Ok(())
        }
    }
   diff --git a/parquet/src/util/bit_util.rs b/parquet/src/util/bit_util.rs
   index 8dfb6312..45cfe2b6 100644
   --- a/parquet/src/util/bit_util.rs
   +++ b/parquet/src/util/bit_util.rs
   @@ -223,6 +223,20 @@ impl BitWriter {
            }
        }
    
   +    /// Extend buffer size
   +    #[inline]
   +    pub fn extend(&mut self, increment: usize) {
   +        self.max_bytes += increment;
   +        let extra = vec![0; increment];
   +        self.buffer.extend(extra);
   +    }
   +
   +    /// Report buffer size
   +    #[inline]
   +    pub fn capacity(&mut self) -> usize {
   +        self.max_bytes
   +    }
   +
        /// Consumes and returns the current buffer.
        #[inline]
        pub fn consume(mut self) -> Vec<u8> {
   garypennington@Garys-MBP parquet % `
   
   Can anyone comment on this approach?
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-rs] garyanaplan commented on issue #349: parquet reading hangs when row_group contains more than 2048 rows of data

Posted by GitBox <gi...@apache.org>.
garyanaplan commented on issue #349:
URL: https://github.com/apache/arrow-rs/issues/349#issuecomment-858592224


   Yeah. I'm not really happy with it, because I don't love the special handling for Booleans via the BitWriter. Just growing the buffer indefinitely seems "wrong", but I think any other kind of fix would be much more extensive/intrusive.
   
   I'll file the PR and see what feedback I get.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-rs] MichaelBitard commented on issue #349: parquet reading hangs when row_group contains more than 2048 rows of data

Posted by GitBox <gi...@apache.org>.
MichaelBitard commented on issue #349:
URL: https://github.com/apache/arrow-rs/issues/349#issuecomment-879223861


   This still happens with parquet 4.4.0, it may be related to an other type, I'll try to reproduce it with a minimal example, but right now, it always hangs after reading 2046 rows.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-rs] MichaelBitard edited a comment on issue #349: parquet reading hangs when row_group contains more than 2048 rows of data

Posted by GitBox <gi...@apache.org>.
MichaelBitard edited a comment on issue #349:
URL: https://github.com/apache/arrow-rs/issues/349#issuecomment-879243332


   I just updated prqs to use the latest version of parquet-rs and arrow (commit 6698eed) and the issue still happens with the example you provided @garyanaplan. It is stuck at 2046 rows read.
   
   To reproduce:
   * clone https://github.com/MichaelBitard/pqrs (I pushed the sample.parquet in the repository)
   * launch `cargo run rowcount sample.parquet` --> you'll see 2049 lines
   * launch `cargo run cat sample.parquet`
     * It'll hang at the 2046th line: 
   ```
   CurrentRow 2046 0
   {DIM0: 2046, DIM1: 2046.0, DIM2: [50, 48, 52, 54], DIM3: true}
   ```
   
   It is stuck in the print_rows function: https://github.com/MichaelBitard/pqrs/blob/master/src/utils.rs#L55


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-rs] MichaelBitard commented on issue #349: parquet reading hangs when row_group contains more than 2048 rows of data

Posted by GitBox <gi...@apache.org>.
MichaelBitard commented on issue #349:
URL: https://github.com/apache/arrow-rs/issues/349#issuecomment-879999541


   Oops, you are right, sorry. 
   
   If I generate the sample.parquet with the latest version, it not longer hangs during reading.
   
   Thanks for noticing and sorry again!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-rs] garyanaplan commented on issue #349: parquet reading hangs when row_group contains more than 2048 rows of data

Posted by GitBox <gi...@apache.org>.
garyanaplan commented on issue #349:
URL: https://github.com/apache/arrow-rs/issues/349#issuecomment-858440550


   Yep. If I update my test to remove BOOLEAN from the schema, the problem goes away. I've done some digging around today and noticed that it looks like the problem might lie in the generation of the file.
   
   I previously reported that parquet-tools dump <file> would happily process the file, however I trimmed down the example to just include BOOLEAN field in schema and increased the number of rows in the group and noted the following when dumping:
   
   `value 2039:   R:0 D:0 V:true
   value 2040:   R:0 D:0 V:false
   value 2041:   R:0 D:0 V:true
   value 2042:   R:0 D:0 V:false
   value 2043:   R:0 D:0 V:true
   value 2044:   R:0 D:0 V:false
   value 2045:   R:0 D:0 V:true
   value 2046:   R:0 D:0 V:false
   value 2047:   R:0 D:0 V:true
   value 2048:   R:0 D:0 V:false
   value 2049:   R:0 D:0 V:false
   value 2050:   R:0 D:0 V:false
   value 2051:   R:0 D:0 V:false
   value 2052:   R:0 D:0 V:false
   value 2053:   R:0 D:0 V:false
   value 2054:   R:0 D:0 V:false
   value 2055:   R:0 D:0 V:false
   `
   All the values after 2048 are false and they continue to be false until the end of the file.
   It looks like the generated input file is invalid, so I'm going to poke around there a little next.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-rs] garyanaplan edited a comment on issue #349: parquet reading hangs when row_group contains more than 2048 rows of data

Posted by GitBox <gi...@apache.org>.
garyanaplan edited a comment on issue #349:
URL: https://github.com/apache/arrow-rs/issues/349#issuecomment-858568206


   Looks like that hard-coded value (256) in the bit-writer is the root cause. When writing, if we try to put > 2048 boolean values, then the writer just "ignores" the writes. This is caused by the fact that bool encoder silently ignores calls to put_value that return false.
   
   I have a fix for this which works by extending the size of the BitWriter (in 256 byte) increments and also checks the return of put_value in BoolType::encode() and raises an error if the call fails.
   
   Can anyone comment on this approach?
   
   (diff attached)
   
   [a.diff.txt](https://github.com/apache/arrow-rs/files/6631262/a.diff.txt)
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-rs] Pand9 edited a comment on issue #349: parquet reading hangs when row_group contains more than 2048 rows of data

Posted by GitBox <gi...@apache.org>.
Pand9 edited a comment on issue #349:
URL: https://github.com/apache/arrow-rs/issues/349#issuecomment-857903895


   I've also just encountered it. Common element with this reproduction is BOOLEAN field. It worked without BOOLEAN as well.
   
   After quick investigation of the looping code, I've found something suspect, but it's just about naming - not sure if it's actual bug.
   
   This function returns something initialized as input's length and called `values_to_read`: https://github.com/apache/arrow-rs/blob/0f55b828883b3b3afda43ae404b130d374e6f1a1/parquet/src/util/bit_util.rs#L588
   
   Meanwhile calling site (which I can't find on Github, because admittedly I'm using older version - will add it later) assigns the return value to `values_read`.
   
   Btw it loops because after reading 2048 values, it's 0.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-rs] garyanaplan commented on issue #349: parquet reading hangs when row_group contains more than 2048 rows of data

Posted by GitBox <gi...@apache.org>.
garyanaplan commented on issue #349:
URL: https://github.com/apache/arrow-rs/issues/349#issuecomment-848585616


   yw.
   
   Extra Info: It happens with debug or release builds and I reproduced it with 1.51.0 on a linux system.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-rs] alamb commented on issue #349: parquet reading hangs when row_group contains more than 2048 rows of data

Posted by GitBox <gi...@apache.org>.
alamb commented on issue #349:
URL: https://github.com/apache/arrow-rs/issues/349#issuecomment-880147276


   Thank you @MichaelBitard  for taking the time to report it!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-rs] MichaelBitard commented on issue #349: parquet reading hangs when row_group contains more than 2048 rows of data

Posted by GitBox <gi...@apache.org>.
MichaelBitard commented on issue #349:
URL: https://github.com/apache/arrow-rs/issues/349#issuecomment-879243332


   I just updated prqs to use the latest version of parquet-rs and arrow (commit 6698eed) and the issue still happens with the example you provided @garyanaplan. It is stuck at 2046 rows read.
   
   To reproduce:
   * clone https://github.com/MichaelBitard/pqrs (I pushed the sample.parquet in the repository)
   * launch `cargo run rowcount sample.parquet` --> you'll see 2049 files
   * launch `cargo run cat sample.parquet`
     * It'll hang at the 2046th line: 
   ```
   CurrentRow 2046 0
   {DIM0: 2046, DIM1: 2046.0, DIM2: [50, 48, 52, 54], DIM3: true}
   ```
   
   It is stuck in the print_rows function: https://github.com/MichaelBitard/pqrs/blob/master/src/utils.rs#L55


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-rs] alamb commented on issue #349: parquet reading hangs when row_group contains more than 2048 rows of data

Posted by GitBox <gi...@apache.org>.
alamb commented on issue #349:
URL: https://github.com/apache/arrow-rs/issues/349#issuecomment-858589145


   @garyanaplan  -- I think the best way to get feedback on the approach would be to open a pull request


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-rs] Pand9 commented on issue #349: parquet reading hangs when row_group contains more than 2048 rows of data

Posted by GitBox <gi...@apache.org>.
Pand9 commented on issue #349:
URL: https://github.com/apache/arrow-rs/issues/349#issuecomment-857903895


   I've also just encountered it. Common element with this reproduction is BOOLEAN field. It worked without BOOLEAN as well.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-rs] Pand9 edited a comment on issue #349: parquet reading hangs when row_group contains more than 2048 rows of data

Posted by GitBox <gi...@apache.org>.
Pand9 edited a comment on issue #349:
URL: https://github.com/apache/arrow-rs/issues/349#issuecomment-857903895


   I've also just encountered it. Common element with this reproduction is BOOLEAN field. It worked without BOOLEAN as well.
   
   After quick investigation of the looping code, I've found something suspect, but it's just about naming - not sure if it's actual bug.
   
   This function returns something initialized as input's length and called `values_to_read`: https://github.com/apache/arrow-rs/blob/0f55b828883b3b3afda43ae404b130d374e6f1a1/parquet/src/util/bit_util.rs#L588
   
   Meanwhile calling site (which I can't find on Github, because admittedly I'm using older version - will add it later) assigns the return value to `values_read`.
   
   Btw it loops because after reading 2048 values, this returned value is 0.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-rs] MichaelBitard edited a comment on issue #349: parquet reading hangs when row_group contains more than 2048 rows of data

Posted by GitBox <gi...@apache.org>.
MichaelBitard edited a comment on issue #349:
URL: https://github.com/apache/arrow-rs/issues/349#issuecomment-879223861


   This still happens with parquet 4.4.0, it may be related to an other type, I'll try to reproduce it with a minimal example, but right now, it always hangs after reading 2046 rows.
   
   EDIT: I just saw this was on master but not released yet. Is there a way to have this fix on a 4.4.1 or is this too much relying on the 5.0.0 SNAPSHOT? I can try and make a PR based on the 4.4.0 if you think you could release it


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-rs] garyanaplan commented on issue #349: parquet reading hangs when row_group contains more than 2048 rows of data

Posted by GitBox <gi...@apache.org>.
garyanaplan commented on issue #349:
URL: https://github.com/apache/arrow-rs/issues/349#issuecomment-879859004


   Hi @MichaelBitard,
   
   Unfortunately, the problem was caused by writing a parquet file. I imagine you created your sample.parquet file with the unfixed version. That would mean you would still hit the problem when reading.
   
   Can you confirm that sample.parquet was created with the fixed code and then verify that it will read ok?
   
   Gary
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org