You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2021/06/10 13:13:24 UTC

[GitHub] [arrow-rs] garyanaplan opened a new pull request #443: improve BOOLEAN writing logic and report error on encoding fail

garyanaplan opened a new pull request #443:
URL: https://github.com/apache/arrow-rs/pull/443


   # Which issue does this PR close?
   
   Closes #349 .
   
   # Rationale for this change
    
   When writing BOOLEAN data, writing more than 2048 rows of data will
   overflow the hard-coded 256 buffer set for the bit-writer in the
   PlainEncoder. Once this occurs, further attempts to write to the encoder
   fail, because capacity is exceeded but the errors are silently ignored.
   
   This fix improves the error detection and reporting at the point of
   encoding and modifies the logic for bit_writing (BOOLEANS). The
   bit_writer is initially allocated 256 bytes (as at present), then each
   time the capacity is exceeded the capacity is incremented by another
   256 bytes.
   
   This certainly resolves the current problem, but it's not exactly a
   great fix because the capacity of the bit_writer could now grow
   substantially.
   
   Other data types seem to have a more sophisticated mechanism for writing
   data which doesn't involve growing or having a fixed size buffer. It
   would be desirable to make the BOOLEAN type use this same mechanism if
   possible, but that level of change is more intrusive and probably
   requires greater knowledge of the implementation than I possess.
   
   # What changes are included in this PR?
   
   (see above)
   
   # Are there any user-facing changes?
   
   No, although they may encounter the encoding error now which was silently ignored previously.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-rs] alamb commented on pull request #443: parquet: improve BOOLEAN writing logic and report error on encoding fail

Posted by GitBox <gi...@apache.org>.
alamb commented on pull request #443:
URL: https://github.com/apache/arrow-rs/pull/443#issuecomment-860697522


   @garyanaplan  what would you say to using the reproducer from https://github.com/apache/arrow-rs/issues/349 to test this issue? I realize like it probably seems unnecessary for such a small code change, but the amount of effort that went into the reproducer was significant and I would hate to have some future optimization reintroduce the bug again.
   
   If you don't have time, I can try to make time to create the test


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-rs] garyanaplan commented on a change in pull request #443: parquet: improve BOOLEAN writing logic and report error on encoding fail

Posted by GitBox <gi...@apache.org>.
garyanaplan commented on a change in pull request #443:
URL: https://github.com/apache/arrow-rs/pull/443#discussion_r650746454



##########
File path: parquet/src/data_type.rs
##########
@@ -661,8 +661,15 @@ pub(crate) mod private {
             _: &mut W,
             bit_writer: &mut BitWriter,
         ) -> Result<()> {
+            if bit_writer.bytes_written() + values.len() >= bit_writer.capacity() {

Review comment:
       I think this calculation is entirely in terms of bytes, so units should all be correct as is.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-rs] garyanaplan commented on pull request #443: parquet: improve BOOLEAN writing logic and report error on encoding fail

Posted by GitBox <gi...@apache.org>.
garyanaplan commented on pull request #443:
URL: https://github.com/apache/arrow-rs/pull/443#issuecomment-860768278


   I'd already written the test, just been in meetings. If we'd rather rely on the test framework to terminate hanging tests, just remove the thread/mpsc/channel stuff and do a straight read after verifying the write looks ok.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-rs] alamb commented on a change in pull request #443: parquet: improve BOOLEAN writing logic and report error on encoding fail

Posted by GitBox <gi...@apache.org>.
alamb commented on a change in pull request #443:
URL: https://github.com/apache/arrow-rs/pull/443#discussion_r650505824



##########
File path: parquet/src/data_type.rs
##########
@@ -661,8 +661,15 @@ pub(crate) mod private {
             _: &mut W,
             bit_writer: &mut BitWriter,
         ) -> Result<()> {
+            if bit_writer.bytes_written() + values.len() >= bit_writer.capacity() {
+                bit_writer.extend(256);
+            }
             for value in values {
-                bit_writer.put_value(*value as u64, 1);
+                if !bit_writer.put_value(*value as u64, 1) {

Review comment:
       Since `put_value` returns false if there isn't enough space, you might be able to avoid errors with something like:
   
   ```rust
   for value in values {
     if !bit_writer.put_value(*value as u64, 1) {
       bit_writer.extend(256)
       bit_writer.put_value(*value as u64, 1)
     }
   }
   ```
   
   Rather than returning an error




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-rs] sunchao commented on a change in pull request #443: parquet: improve BOOLEAN writing logic and report error on encoding fail

Posted by GitBox <gi...@apache.org>.
sunchao commented on a change in pull request #443:
URL: https://github.com/apache/arrow-rs/pull/443#discussion_r650588350



##########
File path: parquet/src/data_type.rs
##########
@@ -661,8 +661,15 @@ pub(crate) mod private {
             _: &mut W,
             bit_writer: &mut BitWriter,
         ) -> Result<()> {
+            if bit_writer.bytes_written() + values.len() >= bit_writer.capacity() {

Review comment:
       Seems here `values.len` is the number of bits to be written? should we use `values.len() / 8`?

##########
File path: parquet/src/data_type.rs
##########
@@ -661,8 +661,15 @@ pub(crate) mod private {
             _: &mut W,
             bit_writer: &mut BitWriter,
         ) -> Result<()> {
+            if bit_writer.bytes_written() + values.len() >= bit_writer.capacity() {
+                bit_writer.extend(256);
+            }
             for value in values {
-                bit_writer.put_value(*value as u64, 1);
+                if !bit_writer.put_value(*value as u64, 1) {

Review comment:
       Yea, we can either do this or make sure up front that there's enough capacity to write. One minor concern is putting the if branch inside the for loop might hurt the performance.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-rs] sunchao commented on a change in pull request #443: parquet: improve BOOLEAN writing logic and report error on encoding fail

Posted by GitBox <gi...@apache.org>.
sunchao commented on a change in pull request #443:
URL: https://github.com/apache/arrow-rs/pull/443#discussion_r651247944



##########
File path: parquet/src/data_type.rs
##########
@@ -661,8 +661,15 @@ pub(crate) mod private {
             _: &mut W,
             bit_writer: &mut BitWriter,
         ) -> Result<()> {
+            if bit_writer.bytes_written() + values.len() >= bit_writer.capacity() {

Review comment:
       Hmm I'm sorry but can you elaborate why the unit of `values` is also byte?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-rs] garyanaplan commented on a change in pull request #443: improve BOOLEAN writing logic and report error on encoding fail

Posted by GitBox <gi...@apache.org>.
garyanaplan commented on a change in pull request #443:
URL: https://github.com/apache/arrow-rs/pull/443#discussion_r649180581



##########
File path: parquet/src/encodings/encoding.rs
##########
@@ -153,7 +155,11 @@ impl<T: DataType> Encoder<T> for PlainEncoder<T> {
 
     #[inline]
     fn put(&mut self, values: &[T::T]) -> Result<()> {
+        if self.bw_bytes_written + values.len() >= self.bit_writer.capacity() {
+            self.bit_writer.extend(256);
+        }
         T::T::encode(values, &mut self.buffer, &mut self.bit_writer)?;
+        self.bw_bytes_written += values.len();

Review comment:
       I'm going to add a comment myself! :-)
   
   I just realised that I only want to do this checking if the encoding is for a Boolean, otherwise it's wasted work/memory. I'll think of the best way to achieve that.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-rs] alamb commented on pull request #443: parquet: improve BOOLEAN writing logic and report error on encoding fail

Posted by GitBox <gi...@apache.org>.
alamb commented on pull request #443:
URL: https://github.com/apache/arrow-rs/pull/443#issuecomment-860783118


   > I'd already written the test, just been in meetings. If we'd rather rely on the test framework to terminate hanging tests, just remove the thread/mpsc/channel stuff and do a straight read after verifying the write looks ok.
   
   Either way is fine with me. Thank you @garyanaplan 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-rs] codecov-commenter edited a comment on pull request #443: parquet: improve BOOLEAN writing logic and report error on encoding fail

Posted by GitBox <gi...@apache.org>.
codecov-commenter edited a comment on pull request #443:
URL: https://github.com/apache/arrow-rs/pull/443#issuecomment-860050867


   # [Codecov](https://codecov.io/gh/apache/arrow-rs/pull/443?src=pr&el=h1&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation) Report
   > Merging [#443](https://codecov.io/gh/apache/arrow-rs/pull/443?src=pr&el=desc&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation) (da8c665) into [master](https://codecov.io/gh/apache/arrow-rs/commit/0c0077697e55eb154dbfcf3127a3f39e63be2df8?el=desc&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation) (0c00776) will **decrease** coverage by `0.05%`.
   > The diff coverage is `93.87%`.
   
   [![Impacted file tree graph](https://codecov.io/gh/apache/arrow-rs/pull/443/graphs/tree.svg?width=650&height=150&src=pr&token=pq9V9qWZ1N&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation)](https://codecov.io/gh/apache/arrow-rs/pull/443?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation)
   
   ```diff
   @@            Coverage Diff             @@
   ##           master     #443      +/-   ##
   ==========================================
   - Coverage   82.71%   82.65%   -0.06%     
   ==========================================
     Files         163      165       +2     
     Lines       44795    45556     +761     
   ==========================================
   + Hits        37051    37655     +604     
   - Misses       7744     7901     +157     
   ```
   
   
   | [Impacted Files](https://codecov.io/gh/apache/arrow-rs/pull/443?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation) | Coverage Δ | |
   |---|---|---|
   | [parquet/src/data\_type.rs](https://codecov.io/gh/apache/arrow-rs/pull/443/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-cGFycXVldC9zcmMvZGF0YV90eXBlLnJz) | `77.64% <60.00%> (-0.27%)` | :arrow_down: |
   | [parquet/tests/boolean\_writer.rs](https://codecov.io/gh/apache/arrow-rs/pull/443/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-cGFycXVldC90ZXN0cy9ib29sZWFuX3dyaXRlci5ycw==) | `97.36% <97.36%> (ø)` | |
   | [parquet/src/util/bit\_util.rs](https://codecov.io/gh/apache/arrow-rs/pull/443/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-cGFycXVldC9zcmMvdXRpbC9iaXRfdXRpbC5ycw==) | `93.14% <100.00%> (+0.07%)` | :arrow_up: |
   | [arrow/src/compute/kernels/partition.rs](https://codecov.io/gh/apache/arrow-rs/pull/443/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-YXJyb3cvc3JjL2NvbXB1dGUva2VybmVscy9wYXJ0aXRpb24ucnM=) | `97.50% <0.00%> (-1.70%)` | :arrow_down: |
   | [arrow/src/compute/kernels/sort.rs](https://codecov.io/gh/apache/arrow-rs/pull/443/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-YXJyb3cvc3JjL2NvbXB1dGUva2VybmVscy9zb3J0LnJz) | `94.11% <0.00%> (-0.86%)` | :arrow_down: |
   | [parquet/src/util/test\_common/page\_util.rs](https://codecov.io/gh/apache/arrow-rs/pull/443/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-cGFycXVldC9zcmMvdXRpbC90ZXN0X2NvbW1vbi9wYWdlX3V0aWwucnM=) | `91.00% <0.00%> (-0.67%)` | :arrow_down: |
   | [parquet/src/arrow/record\_reader.rs](https://codecov.io/gh/apache/arrow-rs/pull/443/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-cGFycXVldC9zcmMvYXJyb3cvcmVjb3JkX3JlYWRlci5ycw==) | `93.44% <0.00%> (-0.54%)` | :arrow_down: |
   | [parquet/src/column/reader.rs](https://codecov.io/gh/apache/arrow-rs/pull/443/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-cGFycXVldC9zcmMvY29sdW1uL3JlYWRlci5ycw==) | `74.36% <0.00%> (-0.38%)` | :arrow_down: |
   | [arrow/src/array/array\_dictionary.rs](https://codecov.io/gh/apache/arrow-rs/pull/443/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-YXJyb3cvc3JjL2FycmF5L2FycmF5X2RpY3Rpb25hcnkucnM=) | `84.56% <0.00%> (-0.32%)` | :arrow_down: |
   | [arrow/src/array/transform/mod.rs](https://codecov.io/gh/apache/arrow-rs/pull/443/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-YXJyb3cvc3JjL2FycmF5L3RyYW5zZm9ybS9tb2QucnM=) | `86.06% <0.00%> (-0.09%)` | :arrow_down: |
   | ... and [17 more](https://codecov.io/gh/apache/arrow-rs/pull/443/diff?src=pr&el=tree-more&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation) | |
   
   ------
   
   [Continue to review full report at Codecov](https://codecov.io/gh/apache/arrow-rs/pull/443?src=pr&el=continue&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation).
   > **Legend** - [Click here to learn more](https://docs.codecov.io/docs/codecov-delta?utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation)
   > `Δ = absolute <relative> (impact)`, `ø = not affected`, `? = missing data`
   > Powered by [Codecov](https://codecov.io/gh/apache/arrow-rs/pull/443?src=pr&el=footer&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation). Last update [0c00776...da8c665](https://codecov.io/gh/apache/arrow-rs/pull/443?src=pr&el=lastupdated&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation). Read the [comment docs](https://docs.codecov.io/docs/pull-request-comments?utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation).
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-rs] alamb commented on a change in pull request #443: parquet: improve BOOLEAN writing logic and report error on encoding fail

Posted by GitBox <gi...@apache.org>.
alamb commented on a change in pull request #443:
URL: https://github.com/apache/arrow-rs/pull/443#discussion_r650959936



##########
File path: parquet/src/data_type.rs
##########
@@ -661,8 +661,15 @@ pub(crate) mod private {
             _: &mut W,
             bit_writer: &mut BitWriter,
         ) -> Result<()> {
+            if bit_writer.bytes_written() + values.len() >= bit_writer.capacity() {
+                bit_writer.extend(256);
+            }
             for value in values {
-                bit_writer.put_value(*value as u64, 1);
+                if !bit_writer.put_value(*value as u64, 1) {

Review comment:
       leaving the code as is seems fine to me




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-rs] garyanaplan commented on pull request #443: parquet: improve BOOLEAN writing logic and report error on encoding fail

Posted by GitBox <gi...@apache.org>.
garyanaplan commented on pull request #443:
URL: https://github.com/apache/arrow-rs/pull/443#issuecomment-860712452


   The problem with writing an effective test is that the error was only detected on file read and the read behaviour was to hang indefinitely. Taken together, those characteristics of the problem make crafting an effective test difficult.
   
   To be effective a test would need to write > 2048 boolean values to a test file, then read that file and not hang. I can think of ways to do that with a timeout and assume that if the read doesn't finish within timeout, then it must have failed. Such a test would rely on multi-threaded or async testing for co-ordination. I don't think there's any async stuff in parquet yet, so multi-threaded test would be required.
   
   I'll knock something up and push it to this branch.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-rs] alamb commented on pull request #443: parquet: improve BOOLEAN writing logic and report error on encoding fail

Posted by GitBox <gi...@apache.org>.
alamb commented on pull request #443:
URL: https://github.com/apache/arrow-rs/pull/443#issuecomment-860048965


   Thanks for the contribution @garyanaplan ! I will try and review this carefully tomorrow. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-rs] garyanaplan commented on a change in pull request #443: parquet: improve BOOLEAN writing logic and report error on encoding fail

Posted by GitBox <gi...@apache.org>.
garyanaplan commented on a change in pull request #443:
URL: https://github.com/apache/arrow-rs/pull/443#discussion_r650750468



##########
File path: parquet/src/data_type.rs
##########
@@ -661,8 +661,15 @@ pub(crate) mod private {
             _: &mut W,
             bit_writer: &mut BitWriter,
         ) -> Result<()> {
+            if bit_writer.bytes_written() + values.len() >= bit_writer.capacity() {
+                bit_writer.extend(256);
+            }
             for value in values {
-                bit_writer.put_value(*value as u64, 1);
+                if !bit_writer.put_value(*value as u64, 1) {

Review comment:
       I found it hard to think of a good way to test this with the fix in place.
   
   I preferred the "don't auto expand memory at the point of failure" approach because I'm fairly conservative and didn't want to make a change that was too wide in impact without a better understanding of the code. i.e.: my fix specifically targeted the error I reported and made it possible to detect in other locations.
   
   I think a better fix would be to (somehow) pre-size the vector or avoid having to size a vector for all the bytes that could be written, but that would be a much bigger scope to the fix.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-rs] garyanaplan commented on pull request #443: improve BOOLEAN writing logic and report error on encoding fail

Posted by GitBox <gi...@apache.org>.
garyanaplan commented on pull request #443:
URL: https://github.com/apache/arrow-rs/pull/443#issuecomment-858802185


   Ok. I'm finished poking this now. I've isolated the changes required to 2 files and eliminated the original runtime impact from the PlainEncoder.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-rs] alamb commented on pull request #443: parquet: improve BOOLEAN writing logic and report error on encoding fail

Posted by GitBox <gi...@apache.org>.
alamb commented on pull request #443:
URL: https://github.com/apache/arrow-rs/pull/443#issuecomment-860740823


   >  I can think of ways to do that with a timeout and assume that if the read doesn't finish within timeout, then it must have failed. 
   
   @garyanaplan  I don't think we need to do anything special for timeouts -- between the default rust test runner and github ci action timeouts any test that hangs indefinitely will cause a failure (not run successfully)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-rs] alamb commented on a change in pull request #443: parquet: improve BOOLEAN writing logic and report error on encoding fail

Posted by GitBox <gi...@apache.org>.
alamb commented on a change in pull request #443:
URL: https://github.com/apache/arrow-rs/pull/443#discussion_r651058305



##########
File path: parquet/tests/boolean_writer.rs
##########
@@ -0,0 +1,100 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+use parquet::column::writer::ColumnWriter;
+use parquet::file::properties::WriterProperties;
+use parquet::file::reader::FileReader;
+use parquet::file::serialized_reader::SerializedFileReader;
+use parquet::file::writer::FileWriter;
+use parquet::file::writer::SerializedFileWriter;
+use parquet::schema::parser::parse_message_type;
+use std::fs;
+use std::path::Path;
+use std::sync::{mpsc, Arc};
+use std::thread;
+use std::time::Duration;
+
+#[test]
+fn it_writes_data_without_hanging() {
+    let path = Path::new("it_writes_data_without_hanging.parquet");
+
+    let message_type = "
+  message BooleanType {
+    REQUIRED BOOLEAN DIM0;
+  }
+";
+    let schema = Arc::new(parse_message_type(message_type).expect("parse schema"));
+    let props = Arc::new(WriterProperties::builder().build());
+    let file = fs::File::create(&path).expect("create file");
+    let mut writer =
+        SerializedFileWriter::new(file, schema, props).expect("create parquet writer");
+    for _group in 0..1 {
+        let mut row_group_writer = writer.next_row_group().expect("get row group writer");
+        let values: Vec<i64> = vec![0; 2049];
+        let my_bool_values: Vec<bool> = values
+            .iter()
+            .enumerate()
+            .map(|(count, _x)| count % 2 == 0)
+            .collect();
+        while let Some(mut col_writer) =
+            row_group_writer.next_column().expect("next column")
+        {
+            match col_writer {
+                ColumnWriter::BoolColumnWriter(ref mut typed_writer) => {
+                    typed_writer
+                        .write_batch(&my_bool_values, None, None)
+                        .expect("writing bool column");
+                }
+                _ => {
+                    panic!("only test boolean values");
+                }
+            }
+            row_group_writer
+                .close_column(col_writer)
+                .expect("close column");
+        }
+        let rg_md = row_group_writer.close().expect("close row group");
+        println!("total rows written: {}", rg_md.num_rows());
+        writer
+            .close_row_group(row_group_writer)
+            .expect("close row groups");
+    }
+    writer.close().expect("close writer");
+
+    let bytes = fs::read(&path).expect("read file");
+    assert_eq!(&bytes[0..4], &[b'P', b'A', b'R', b'1']);
+
+    // Now that we have written our data and are happy with it, make
+    // sure we can read it back in < 5 seconds...
+    let (sender, receiver) = mpsc::channel();
+    let _t = thread::spawn(move || {
+        let file = fs::File::open(&Path::new("it_writes_data_without_hanging.parquet"))
+            .expect("open file");
+        let reader = SerializedFileReader::new(file).expect("get serialized reader");
+        let iter = reader.get_row_iter(None).expect("get iterator");
+        for record in iter {
+            println!("reading: {}", record);
+        }
+        println!("finished reading");
+        if let Ok(()) = sender.send(true) {}
+    });
+    assert_ne!(

Review comment:
       You could also check `assert_eq!(Ok(true), receiver.recv_timeout(Duration::from_millis(5000))` as well
   
   However, I think that is equivalent to what you have here. 👍 thank you




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-rs] codecov-commenter commented on pull request #443: parquet: improve BOOLEAN writing logic and report error on encoding fail

Posted by GitBox <gi...@apache.org>.
codecov-commenter commented on pull request #443:
URL: https://github.com/apache/arrow-rs/pull/443#issuecomment-860050867


   # [Codecov](https://codecov.io/gh/apache/arrow-rs/pull/443?src=pr&el=h1&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation) Report
   > Merging [#443](https://codecov.io/gh/apache/arrow-rs/pull/443?src=pr&el=desc&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation) (eb11b2b) into [master](https://codecov.io/gh/apache/arrow-rs/commit/0c0077697e55eb154dbfcf3127a3f39e63be2df8?el=desc&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation) (0c00776) will **increase** coverage by `0.00%`.
   > The diff coverage is `81.81%`.
   
   [![Impacted file tree graph](https://codecov.io/gh/apache/arrow-rs/pull/443/graphs/tree.svg?width=650&height=150&src=pr&token=pq9V9qWZ1N&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation)](https://codecov.io/gh/apache/arrow-rs/pull/443?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation)
   
   ```diff
   @@           Coverage Diff           @@
   ##           master     #443   +/-   ##
   =======================================
     Coverage   82.71%   82.71%           
   =======================================
     Files         163      163           
     Lines       44795    44805   +10     
   =======================================
   + Hits        37051    37061   +10     
     Misses       7744     7744           
   ```
   
   
   | [Impacted Files](https://codecov.io/gh/apache/arrow-rs/pull/443?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation) | Coverage Δ | |
   |---|---|---|
   | [parquet/src/data\_type.rs](https://codecov.io/gh/apache/arrow-rs/pull/443/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-cGFycXVldC9zcmMvZGF0YV90eXBlLnJz) | `77.64% <60.00%> (-0.27%)` | :arrow_down: |
   | [parquet/src/util/bit\_util.rs](https://codecov.io/gh/apache/arrow-rs/pull/443/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-cGFycXVldC9zcmMvdXRpbC9iaXRfdXRpbC5ycw==) | `93.14% <100.00%> (+0.07%)` | :arrow_up: |
   | [parquet/src/encodings/encoding.rs](https://codecov.io/gh/apache/arrow-rs/pull/443/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-cGFycXVldC9zcmMvZW5jb2RpbmdzL2VuY29kaW5nLnJz) | `95.04% <0.00%> (+0.19%)` | :arrow_up: |
   | [arrow/src/array/transform/boolean.rs](https://codecov.io/gh/apache/arrow-rs/pull/443/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-YXJyb3cvc3JjL2FycmF5L3RyYW5zZm9ybS9ib29sZWFuLnJz) | `84.61% <0.00%> (+7.69%)` | :arrow_up: |
   
   ------
   
   [Continue to review full report at Codecov](https://codecov.io/gh/apache/arrow-rs/pull/443?src=pr&el=continue&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation).
   > **Legend** - [Click here to learn more](https://docs.codecov.io/docs/codecov-delta?utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation)
   > `Δ = absolute <relative> (impact)`, `ø = not affected`, `? = missing data`
   > Powered by [Codecov](https://codecov.io/gh/apache/arrow-rs/pull/443?src=pr&el=footer&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation). Last update [0c00776...eb11b2b](https://codecov.io/gh/apache/arrow-rs/pull/443?src=pr&el=lastupdated&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation). Read the [comment docs](https://docs.codecov.io/docs/pull-request-comments?utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation).
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org