You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by "Jefffrey (via GitHub)" <gi...@apache.org> on 2023/11/29 12:28:16 UTC

[PR] Parquet: omit min/max for interval columns when writing stats [arrow-rs]

Jefffrey opened a new pull request, #5147:
URL: https://github.com/apache/arrow-rs/pull/5147

   # Which issue does this PR close?
   
   <!--
   We generally require a GitHub issue to be filed for all bug fixes and enhancements and this helps us generate change logs for our releases. You can link an issue to this PR using the GitHub syntax. For example `Closes #123` indicates that this PR will close issue #123.
   -->
   
   Closes #5145
   
   # Rationale for this change
    
   <!--
   Why are you proposing this change? If this is already explained clearly in the issue then this section is not needed.
   Explaining clearly why changes are proposed helps reviewers understand your changes and offer better suggestions for fixes.
   -->
   
   # What changes are included in this PR?
   
   <!--
   There is no need to duplicate the description in the issue here but it is sometimes worth providing a summary of the individual changes in this PR.
   -->
   
   Add extra checks before calculating min/max for chunks/pages, to ignore Interval columns
   
   # Are there any user-facing changes?
   
   
   <!--
   If there are user-facing changes then we may require documentation to be updated before approving the PR.
   -->
   
   <!---
   If there are any breaking changes to public APIs, please add the `breaking change` label.
   -->
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


Re: [PR] Parquet: omit min/max for interval columns when writing stats [arrow-rs]

Posted by "Jefffrey (via GitHub)" <gi...@apache.org>.
Jefffrey commented on PR #5147:
URL: https://github.com/apache/arrow-rs/pull/5147#issuecomment-1833545825

   I noticed this:
   
   https://github.com/apache/arrow-rs/blob/6d4b8bbad95c7e4fec0c4f1fb755ad7a1c542983/parquet/src/file/writer.rs#L333
   
   - Unsure if there are other places to consider
   
   Looks like might be a separate issue, to implement writing ColumnOrder


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


Re: [PR] Parquet: omit min/max for interval columns when writing stats [arrow-rs]

Posted by "tustvold (via GitHub)" <gi...@apache.org>.
tustvold merged PR #5147:
URL: https://github.com/apache/arrow-rs/pull/5147


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


Re: [PR] Parquet: omit min/max for interval columns when writing stats [arrow-rs]

Posted by "tustvold (via GitHub)" <gi...@apache.org>.
tustvold commented on PR #5147:
URL: https://github.com/apache/arrow-rs/pull/5147#issuecomment-1831956539

   What ColumnOrder are we currently writing for these columns?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


Re: [PR] Parquet: omit min/max for interval columns when writing stats [arrow-rs]

Posted by "Jefffrey (via GitHub)" <gi...@apache.org>.
Jefffrey commented on PR #5147:
URL: https://github.com/apache/arrow-rs/pull/5147#issuecomment-1832608022

   > What ColumnOrder are we currently writing for these columns?
   
   I'm not sure, actually. I tried running this test in arrow_writer/mod.rs on master branch:
   
   ```rust
       #[test]
       fn test_123() {
           let a = Int32Array::from(vec![1, 2, 3, 4, 5]);
           let b = IntervalDayTimeArray::from(vec![0; 5]);
           let batch = RecordBatch::try_from_iter(vec![
               ("a", Arc::new(a) as ArrayRef),
               ("b", Arc::new(b) as ArrayRef),
           ])
           .unwrap();
   
           let mut buf = Vec::with_capacity(1024);
           let mut writer = ArrowWriter::try_new(&mut buf, batch.schema(), None).unwrap();
           writer.write(&batch).unwrap();
           writer.close().unwrap();
   
           let bytes = Bytes::from(buf);
           let options = ReadOptionsBuilder::new().with_page_index().build();
           let reader = SerializedFileReader::new_with_options(bytes, options).unwrap();
           dbg!(reader.metadata().file_metadata().column_orders());
       }
   ```
   
   Running:
   
   ```shell
   arrow-rs$ cargo test -p parquet --lib arrow::arrow_writer::tests::test_123 -- --nocapture --exact
       Blocking waiting for file lock on build directory
      Compiling parquet v49.0.0 (/home/jeffrey/Code/arrow-rs/parquet)
       Finished test [unoptimized + debuginfo] target(s) in 11.49s
        Running unittests src/lib.rs (/media/jeffrey/1tb_860evo_ssd/.cargo_target_cache/debug/deps/parquet-a4f7a499e85a325c)
   
   running 1 test
   [parquet/src/arrow/arrow_writer/mod.rs:2760] reader.metadata().file_metadata().column_orders() = None
   test arrow::arrow_writer::tests::test_123 ... ok
   
   test result: ok. 1 passed; 0 failed; 0 ignored; 0 measured; 667 filtered out; finished in 0.00s
   ```
   
   Even when I change it to only write the Int32Array, it is still none.
   
   Not sure if I'm doing something wrong here?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


Re: [PR] Parquet: omit min/max for interval columns when writing stats [arrow-rs]

Posted by "Jefffrey (via GitHub)" <gi...@apache.org>.
Jefffrey commented on PR #5147:
URL: https://github.com/apache/arrow-rs/pull/5147#issuecomment-1834522922

   Raised https://github.com/apache/arrow-rs/issues/5152 for the column order issue


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org