You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2022/05/30 09:32:08 UTC

[GitHub] [arrow-rs] Uinelj commented on issue #1745: Lack of examples on parquet file write

Uinelj commented on issue #1745:
URL: https://github.com/apache/arrow-rs/issues/1745#issuecomment-1140928286

   @tustvold : I have not tried to use the arrow facing part as my usecase is centered around writing (huge) parquet files. Do you think that it would make sense to go through the Arrow API even if I'm only looking to write Parquet files? 
   The main gripe I have/had is around the whole Dremel logic that is hard to grasp, and even if a thorough tutorial/explanation might be out of scope, some pointers and an example could help people.
   
   I have changed my code to comply with the `15.0.0` developments, and I feel that the writing, opening and closing parts are smoother now, thanks a lot for that 👍 
   
   @alamb Well I originally had a very nested schema, involving maps, nullable lists, required lists with nullable elements, etc. I'm not yet fixed on a format since I want to measure performance for a set of usecases, so I'll experiment on the format.
   
   After failing, I went to try and create simple layouts, namely the one from the python tutorial (https://arrow.apache.org/docs/python/parquet.html#reading-and-writing-single-files), as well as another one using a required list with nullable items.
   
   I think that having two or three examples increasing in complexity and involving optionality and some amount of nesting would be good (to show how to define lists/maps, compute def/rep level and how to manage `Option<T>`).
   
   Here is the code that replicates the format from the Parquet Python tutorial. It may have too much data structures and no comments, but if you feel like it it could become one of the examples after being fixed!
   
   ```rs
   use std::{fmt::Write, fs::File, sync::Arc};
   
   use parquet::{
       column::writer::ColumnWriter,
       data_type::ByteArray,
       errors::ParquetError,
       file::{properties::WriterProperties, writer::SerializedFileWriter},
       schema::{parser::parse_message_type, types::Type},
   };
   
   /// Simple example struct
   struct Simple {
       one: Option<f64>,
       two: String,
       three: bool,
   }
   
   /// row-oriented set of structs.
   struct SimpleRows {
       ones: Vec<Option<f64>>,
       twos: Vec<ByteArray>,
       threes: Vec<bool>,
   }
   
   /// Represents a struct field along with rep and def levels
   struct WriteData<T> {
       data: Vec<T>,
       def: Vec<i16>,
       rep: Vec<i16>,
   }
   
   impl SimpleRows {
       fn ones_all(&self) -> WriteData<f64> {
           let mut data = Vec::with_capacity(self.ones.len());
           let mut rep = vec![1; self.ones.len() - 1];
           rep.push(0);
           rep.reverse();
   
           let def = self
               .ones
               .iter()
               .map(|x| match x {
                   Some(d) => {
                       data.push(*d);
                       1
                   }
                   None => 0,
               })
               .collect();
   
           WriteData { data, def, rep }
       }
   
       fn twos_all(&self) -> WriteData<ByteArray> {
           let def = vec![1; self.twos.len()];
           let rep = vec![1; self.twos.len()];
           let data = self.twos.to_vec();
   
           WriteData { data, def, rep }
       }
   
       fn threes_all(&self) -> WriteData<bool> {
           let def = vec![1; self.twos.len()];
           let rep = vec![1; self.twos.len()];
           let data = self.threes.to_vec();
   
           WriteData { data, def, rep }
       }
   }
   
   fn to_simplerows(s: &[Simple]) -> SimpleRows {
       let mut ones = Vec::with_capacity(s.len());
       let mut twos = Vec::with_capacity(s.len());
       let mut threes = Vec::with_capacity(s.len());
   
       for row in s {
           ones.push(row.one);
           twos.push(row.two.as_str().into());
           threes.push(row.three);
       }
   
       SimpleRows { ones, twos, threes }
   }
   
   fn write(schema: Type, rows: SimpleRows) -> Result<(), ParquetError> {
       let buf = File::create("./test.parquet").unwrap();
       let props = WriterProperties::builder().build();
       let mut w = SerializedFileWriter::new(buf, Arc::new(schema.clone()), Arc::new(props)).unwrap();
   
       let mut rg = w.next_row_group().unwrap();
       let mut nb_col = 0;
       while let Some(mut col_writer) = rg.next_column().unwrap() {
           match nb_col {
               0 => {
                   if let ColumnWriter::DoubleColumnWriter(ref mut col_writer) = col_writer.untyped() {
                       let r = rows.ones_all();
                       col_writer
                           .write_batch(&r.data, Some(&r.def[..]), Some(&r.rep[..]))
                           .unwrap();
                   } else {
                       panic!("wrong col type for nb col 0")
                   }
               }
               1 => {
                   if let ColumnWriter::ByteArrayColumnWriter(ref mut col_writer) =
                       col_writer.untyped()
                   {
                       let r = rows.twos_all();
                       col_writer
                           .write_batch(&r.data, Some(&r.def[..]), Some(&r.rep[..]))
                           .unwrap();
                   } else {
                       panic!("wrong col type for nb col 1")
                   }
               }
               2 => {
                   if let ColumnWriter::BoolColumnWriter(ref mut col_writer) = col_writer.untyped() {
                       let r = rows.threes_all();
                       col_writer
                           .write_batch(&r.data, Some(&r.def[..]), Some(&r.rep[..]))
                           .unwrap();
                   } else {
                       panic!("wrong col type for nb col 2")
                   }
               }
               _ => panic!("wrong col nb"),
           }
           nb_col += 1;
           col_writer.close()?;
       }
       rg.close()?;
       w.close()?;
       Ok(())
   }
   
   fn get_examples() -> Vec<Simple> {
       let a = Simple {
           one: Some(-1.0),
           two: "foo".to_string(),
           three: true,
       };
       let b = Simple {
           one: None,
           two: "bar".to_string(),
           three: false,
       };
       let c = Simple {
           one: Some(2.5),
           two: "baz".to_string(),
           three: true,
       };
   
       vec![a, b, c]
   }
   fn main() {
       // list non null, elements nullable
       let schema = r#"
           message documents {
               optional double one;
               optional binary two (string);
               optional boolean three;
           }
       "#;
   
       let schema = parse_message_type(schema).expect("invalid schema");
       let simples = to_simplerows(&get_examples());
       write(schema, simples).unwrap();
   }
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org