You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by "tmcw (via GitHub)" <gi...@apache.org> on 2023/09/09 18:01:23 UTC

[GitHub] [arrow-rs] tmcw opened a new issue, #4804: Different encoding options appear to have no effect on output (in Rust port)

tmcw opened a new issue, #4804:
URL: https://github.com/apache/arrow-rs/issues/4804

   **Which part is this question about**
   
   The arrow-rs implementation
   
   **Describe your question**
   
   I've been trying to use arrow-rs to encode a large-ish dataset - about 2GB of gzipped JSON that becomes about 200MB of a parquet file. The data is very amenable to delta-encoding - it's a time-series capacity dataset. But the encoding options provided don't seem to make any difference in output size. Maybe I'm connecting the pieces incorrectly?
   
   Here's the most minimal example I've been able to cook up:
   
   Cargo.toml:
   
   ```toml
   [package]
   name = "parquet-demo"
   version = "0.1.0"
   edition = "2021"
   
   # See more keys and their definitions at https://doc.rust-lang.org/cargo/reference/manifest.html
   
   [dependencies]
   arrow = "46.0.0"
   arrow-array = "46.0.0"
   parquet = "46.0.0"
   ```
   
   /src/main.rs:
   
   ```rs
   use arrow::datatypes::{DataType, Field, Schema};
   use arrow_array::{builder::PrimitiveBuilder, types::Int32Type, ArrayRef, RecordBatch};
   use parquet::{arrow::ArrowWriter, file::properties::WriterProperties};
   use std::{fs, path::Path, sync::Arc};
   
   fn main() {
       let path = Path::new("sample.parquet");
   
       let numbers = Field::new("numbers", DataType::Int32.clone(), false);
       let schema = Schema::new(vec![numbers]);
       let file = fs::File::create(&path).unwrap();
   
       let props = WriterProperties::builder()
           .set_encoding(parquet::basic::Encoding::DELTA_BINARY_PACKED)
           .set_compression(parquet::basic::Compression::UNCOMPRESSED)
           .set_writer_version(parquet::file::properties::WriterVersion::PARQUET_2_0);
   
       let mut writer = ArrowWriter::try_new(file, schema.into(), Some(props.build())).unwrap();
   
       let mut numbers = PrimitiveBuilder::<Int32Type>::new();
   
       for j in 0..10000 {
           for _i in 0..10000 {
               numbers.append_value(j);
           }
       }
   
       let batch =
           RecordBatch::try_from_iter(vec![("numbers", Arc::new(numbers.finish()) as ArrayRef)])
               .unwrap();
   
       writer.write(&batch).expect("Writing batch");
       writer.close().unwrap();
   }
   ```
   
   Running this produces:
   
   ```
   ➜  parquet-demo git:(main) ✗ cargo run && du -sh sample.parquet
      Compiling parquet-demo v0.1.0 (/Users/tmcw/s/parquet-demo)
       Finished dev [unoptimized + debuginfo] target(s) in 0.41s
        Running `target/debug/parquet-demo`
   136K    sample.parquet
   ```
   
   So, that's with DELTA_BIT_PACKED encoding, and the dataset is lots of consecutive identical values - 0000000111111122222 - that kind of thing. It should be amenable to delta encoding or RLE. Trying PLAIN encoding:
   
   ```rs
   use arrow::datatypes::{DataType, Field, Schema};
   use arrow_array::{builder::PrimitiveBuilder, types::Int32Type, ArrayRef, RecordBatch};
   use parquet::{arrow::ArrowWriter, file::properties::WriterProperties};
   use std::{fs, path::Path, sync::Arc};
   
   fn main() {
       let path = Path::new("sample.parquet");
   
       let numbers = Field::new("numbers", DataType::Int32.clone(), false);
       let schema = Schema::new(vec![numbers]);
       let file = fs::File::create(&path).unwrap();
   
       let props = WriterProperties::builder()
           .set_encoding(parquet::basic::Encoding::PLAIN)
           .set_compression(parquet::basic::Compression::UNCOMPRESSED)
           .set_writer_version(parquet::file::properties::WriterVersion::PARQUET_2_0);
   
       let mut writer = ArrowWriter::try_new(file, schema.into(), Some(props.build())).unwrap();
   
       let mut numbers = PrimitiveBuilder::<Int32Type>::new();
   
       for j in 0..10000 {
           for _i in 0..10000 {
               numbers.append_value(j);
           }
       }
   
       let batch =
           RecordBatch::try_from_iter(vec![("numbers", Arc::new(numbers.finish()) as ArrayRef)])
               .unwrap();
   
       writer.write(&batch).expect("Writing batch");
       writer.close().unwrap();
   }
   ```
   
   ```
   ➜  parquet-demo git:(main) ✗ cargo run && du -sh sample.parquet
      Compiling parquet-demo v0.1.0 (/Users/tmcw/s/parquet-demo)
       Finished dev [unoptimized + debuginfo] target(s) in 0.39s
        Running `target/debug/parquet-demo`
   136K    sample.parquet
   ```
   
   Same exact size. The same goes if I swap out PLAIN for RLE, or any other encoding value: it's always the same size. The same goes for my much larger dataset that I'm trying to encode.
   
   I'm totally new to this domain, so this easily could be that I'm using it wrong! I've also tried `.set_column_encoding` with the same result. I don't know what's going wrong - any ideas? Thanks!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-rs] tmcw commented on issue #4804: Different encoding options appear to have no effect on output size

Posted by "tmcw (via GitHub)" <gi...@apache.org>.
tmcw commented on issue #4804:
URL: https://github.com/apache/arrow-rs/issues/4804#issuecomment-1712588472

   Got it, thanks! Disabling dictionary encoding with
   
   ```rs
       let props = WriterProperties::builder()
           .set_dictionary_enabled(false)
   ```
   
   Made changing the encoding with `set_encoding` actually have an effect, and ends up that dictionary encoding seems like the only beneficial setting for this data - plain is larger, delta binary packed is also a little larger, and the rest either fail because they're not compatible with the data type or they're the dictionary encoding.
   
   It's a little confusing for newcomers, I think, the relationship between `set_encoding` and `set_dictionary_enabled` - I guess `set_encoding` is a no-op if dictionary is enabled? Happy to contribute a docs note if that's the case.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-rs] tustvold closed issue #4804: Different encoding options appear to have no effect on output size

Posted by "tustvold (via GitHub)" <gi...@apache.org>.
tustvold closed issue #4804: Different encoding options appear to have no effect on output size
URL: https://github.com/apache/arrow-rs/issues/4804


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-rs] tustvold commented on issue #4804: Different encoding options appear to have no effect on output size

Posted by "tustvold (via GitHub)" <gi...@apache.org>.
tustvold commented on issue #4804:
URL: https://github.com/apache/arrow-rs/issues/4804#issuecomment-1712584063

   The encoder will always use dictionary encoding by default if enabled, falling back to the specified encoding otherwise. You likely want to disable dictionary encoding for these tests.
   
   As an aside I would probably discourage the delta encoding, the ecosystem support isn't great, and the specification literally links to a paper that says why the particular encoding they chose is a bad idea 😅


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-rs] tustvold commented on issue #4804: Different encoding options appear to have no effect on output size

Posted by "tustvold (via GitHub)" <gi...@apache.org>.
tustvold commented on issue #4804:
URL: https://github.com/apache/arrow-rs/issues/4804#issuecomment-1733449981

   Closing as this question appears to have been answered, feel free to reopen if I am mistaken


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-rs] tustvold commented on issue #4804: Different encoding options appear to have no effect on output size

Posted by "tustvold (via GitHub)" <gi...@apache.org>.
tustvold commented on issue #4804:
URL: https://github.com/apache/arrow-rs/issues/4804#issuecomment-1712590204

   >  I guess set_encoding is a no-op if dictionary is enabled
   
   There is a maximum dictonary size, if this is exceeded it falls back to whatever encoding is specified. Would be happy to review a docs PR to clarify this


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org