You are viewing a plain text version of this content. The canonical link for it is here.
Posted to jira@arrow.apache.org by "Neville Dipale (Jira)" <ji...@apache.org> on 2020/09/14 11:51:00 UTC
[jira] [Commented] (ARROW-5123) [Rust] derive RecordWriter from struct definitions

    [ https://issues.apache.org/jira/browse/ARROW-5123?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17195410#comment-17195410 ] 

Neville Dipale commented on ARROW-5123:
---------------------------------------

I'm unable to assign to Xavier

> [Rust] derive RecordWriter from struct definitions
> --------------------------------------------------
>
>                 Key: ARROW-5123
>                 URL: https://issues.apache.org/jira/browse/ARROW-5123
>             Project: Apache Arrow
>          Issue Type: New Feature
>          Components: Rust
>            Reporter: Xavier Lange
>            Priority: Major
>              Labels: pull-request-available
>             Fix For: 2.0.0
>
>          Time Spent: 14h 20m
>  Remaining Estimate: 0h
>
> Migrated from previous github issue (which saw a lot of comments but at a rough transition time in the project): https://github.com/sunchao/parquet-rs/pull/197
>  
> Goal
> ===
> Writing many columns to a file is a chore. If you can put your values in to a struct which mirrors the schema of your file, this `derive(ParquetRecordWriter)` will write out all the fields, in the order in which they are defined, to a row_group.
> How to Use
> ===
> ```
> extern crate parquet;
> #[macro_use] extern crate parquet_derive;
> #[derive(ParquetRecordWriter)]
> struct ACompleteRecord<'a> {
>   pub a_bool: bool,
>   pub a_str: &'a str,
> }
> ```
> RecordWriter trait
> ===
> This is the new trait which `parquet_derive` will implement for your structs.
> ```
> use super::RowGroupWriter;
> pub trait RecordWriter<T> {
>   fn write_to_row_group(&self, row_group_writer: &mut Box<RowGroupWriter>);
> }
> ```
> How does it work?
> ===
> The `parquet_derive` crate adds code generating functionality to the rust compiler. The code generation takes rust syntax and emits additional syntax. This macro expansion works on rust 1.15+ stable. This is a dynamic plugin, loaded by the machinery in cargo. Users don't have to do any special `build.rs` steps or anything like that, it's automatic by including `parquet_derive` in their project. The `parquet_derive/src/Cargo.toml` has a section saying as much:
> ```
> [lib]
> proc-macro = true
> ```
> The rust struct tagged with `#[derive(ParquetRecordWriter)]` is provided to the `parquet_record_writer` function in `parquet_derive/src/lib.rs`. The `syn` crate parses the struct from a string-representation to a AST (a recursive enum value). The AST contains all the values I care about when generating a `RecordWriter` impl:
>  - the name of the struct
>  - the lifetime variables of the struct
>  - the fields of the struct
> The fields of the struct are translated from AST to a flat `FieldInfo` struct. It has the bits I care about for writing a column: `field_name`, `field_lifetime`, `field_type`, `is_option`, `column_writer_variant`.
> The code then does the equivalent of templating to build the `RecordWriter` implementation. The templating functionality is provided by the `quote` crate. At a high-level the template for `RecordWriter` looks like:
> ```
> impl RecordWriter for $struct_name {
>   fn write_row_group(..) {
>     $({
>       $column_writer_snippet
>     })
>   } 
> }
> ```
> this template is then added under the struct definition, ending up something like:
> ```
> struct MyStruct {
> }
> impl RecordWriter for MyStruct {
>   fn write_row_group(..) {
>     {
>        write_col_1();
>     };
>    {
>        write_col_2();
>    }
>   }
> }
> ```
> and finally _THIS_ is the code passed to rustc. It's just code now, fully expanded and standalone. If a user ever changes their `struct MyValue` definition the `ParquetRecordWriter` will be regenerated. There's no intermediate values to version control or worry about.
> Viewing the Derived Code
> ===
> To see the generated code before it's compiled, one very useful bit is to install `cargo expand` [more info on gh](https://github.com/dtolnay/cargo-expand), then you can do:
> ```
> $WORK_DIR/parquet-rs/parquet_derive_test
> cargo expand --lib > ../temp.rs
> ```
> then you can dump the contents:
> ```
> struct DumbRecord {
>     pub a_bool: bool,
>     pub a2_bool: bool,
> }
> impl RecordWriter<DumbRecord> for &[DumbRecord] {
>     fn write_to_row_group(
>         &self,
>         row_group_writer: &mut Box<parquet::file::writer::RowGroupWriter>,
>     ) {
>         let mut row_group_writer = row_group_writer;
>         {
>             let vals: Vec<bool> = self.iter().map(|x| x.a_bool).collect();
>             let mut column_writer = row_group_writer.next_column().unwrap().unwrap();
>             if let parquet::column::writer::ColumnWriter::BoolColumnWriter(ref mut typed) =
>                 column_writer
>             {
>                 typed.write_batch(&vals[..], None, None).unwrap();
>             }
>             row_group_writer.close_column(column_writer).unwrap();
>         };
>         {
>             let vals: Vec<bool> = self.iter().map(|x| x.a2_bool).collect();
>             let mut column_writer = row_group_writer.next_column().unwrap().unwrap();
>             if let parquet::column::writer::ColumnWriter::BoolColumnWriter(ref mut typed) =
>                 column_writer
>             {
>                 typed.write_batch(&vals[..], None, None).unwrap();
>             }
>             row_group_writer.close_column(column_writer).unwrap();
>         }
>     }
> }
> ```
> now I need to write out all the combinations of types we support and make sure it writes out data.
> Procedural Macros
> ===
> The `parquet_derive` crate can ONLY export the derivation functionality. No traits, nothing else. The derive crate can not host test cases. It's kind of like a "dummy" crate which is only used by the compiler, never the code.
> The parent crate cannot use the derivation functionality, which is important because it means test code cannot be in the parent crate. This forces us to have a third crate, `parquet_derive_test`.
> I'm open to being wrong on any one of these finer points. I had to bang on this for a while to get it to compile!
> Potentials For Better Design
> ===
>  - [x] Recursion could be limited by generating the code as "snippets" instead of one big `quote!` AST generator. Or so I think. It might be nicer to push generating each columns writing code to another loop.
>  - [X] ~~It would be nicer if I didn't have to be so picky about data going in to the `write_batch` function. Is it possible we could make a version of the function which accept `Into<DataType>` or similar? This would greatly simplify this derivation code as it would not need to enumerate all the supported types. Something like `write_generic_batch(&[impl Into<DataType>])` would be neat.~~ (not tackling in this generation of the plugin)
>  - [X] ~~Another idea to improving writing columns, could we have a write function for `Iterator`s? I already have a `Vec<DumbRecord>`, if I could just write a mapping for accessing the one value, we could skip the whole intermediate vec for `write_batch`. Should have some significant memory advantages.~~ (not tackling in this generation of the plugin, it's a bigger parquet-rs enhancement)
>  - [X] ~~It might be worthwhile to derive a parquet schema directly from a struct definition. That should stamp out opportunities for type errors.~~ (moved to #203)
> Status
> ===
> I have successfully integrated this work with my own data exporter (takes postgres/couchdb and outputs a single parquet file).
> I think this code is worth including in the project, with the caveat that it only generates simplistic `RecordWriter`s. As people start to use we can add code generation for more complex, nested structs.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)