You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2022/05/30 11:10:43 UTC

[GitHub] [arrow-rs] tustvold commented on issue #1760: Best way to convert arrow to Rust native type

tustvold commented on issue #1760:
URL: https://github.com/apache/arrow-rs/issues/1760#issuecomment-1141024114

   I'll answer your question in two parts:
   
   * How to access the values in the arrays
   * How the types differ from `Vec<String>`, etc...
   
   ## Downcasting ArrayRef
   
   Let's say you constructed your RecordBatch like this
   
   ```
   let strings = StringArray::from_iter_values(["foo", "bar"]);
   let integers = Int32Array::from_iter_values([1, 2, 3, 4]);
   
   let batch = RecordBatch::try_from_iter([
       ("strings", Arc::new(strings) as _),
       ("integers", Arc::new(integers) as _),
   ])
   .unwrap();
   ```
   
   If you want to access the integers in column 1 you might do
   
   ```
   let integers = batch.column(1).as_any().downcast_ref::<Int32Array>().unwrap();
   ```
   
   You now have an `&Int32Array` you can iterate through
   
   ```
   for (idx, i) in integers.iter().enumerate() {
       match i {
           None => println!("{}: NULL", idx),
           Some(i) => println!("{}: {}", idx, i)
       }
   }
   ```
   
   You can also interact with the values data directly, although you will need to handle nulls yourself
   
   ```
   let values: &[i32] = integers.values();
   for (idx, i) in values.iter().enumerate() {
       match integers.is_valid(idx) {
           false => println!("{}: NULL", idx),
           true => println!("{}: {}", idx, i)
       }
   }
   ```
   
   The same is true of `StringArray`
   
   ```
   let strings = batch.column(0).as_any().downcast_ref::<StringArray>().unwrap();
   for .. in strings.iter() {
     ...
   }
   ```
   
   You could use this to convert to `Vec<(i32, String)>` if you wished to, but depending on your workload this conversion may not be cheap, and will likely sacrifice performance over using the arrow compute kernels, or interacting with the arrays as-is. If you are able to describe your workload I might be able to help with this.
   
   ## Data Representation
   
   Under the hood, types are not stored as `Vec<Option<T>>`:
   
   * Primitives are roughly represented as a null mask and `Arc<[TYPE]>`
   * Strings arrays are a null mask, a values data `String`, and a list of offsets `Arc<[i32]>`
   * etc...
   
   The full details are described [here](https://arrow.apache.org/docs/format/Columnar.html#variable-size-binary-layout). As well as supporting more types, this allows for more efficient kernels.
   
   Arrow also supports zero-copy slicing of arrays, something which cannot be performed with `Vec`.
   
   For example,
   
   ```
   let strings = StringArray::from_iter_values(["foo", "bar", "bax"]);
   
   let sliced = strings.slice(1, 1);
   assert_eq!(sliced.len(), 1);
   let sliced = sliced.as_any().downcast_ref::<StringArray>().unwrap();
   assert_eq!(sliced.value(0), "bar");
   ```
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org