You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2022/05/28 10:44:51 UTC

[GitHub] [arrow-rs] Tudyx opened a new issue, #1760: Best way to convert arrow to Rust native type

Tudyx opened a new issue, #1760:
URL: https://github.com/apache/arrow-rs/issues/1760

   I have a machine learning dataset for text classification in `arrow` format . A record contains 2 elements, a label (integer) and a text a (string).
   I want to convert it into  a native Rust native type, to exploit it in my code. A `Vec<(i32, String)>` would be ideal for fast indexing.
   I've read the source code to find a way to do that, i have seen that `Datatype` can be create from Rust native type but not the other way.
   
   So i found a way to accomplish what i want, but it fill a little bit hacky to me. I convert each `arrrow` records into a string and then i cast it into a Rust native type. Here is my code for doing that:
   ```rust
   pub fn read_arrow_file_into_vec(arrow_file: &str) -> Vec<(String, String)> {
       let dataset = File::open(arrow_file).unwrap();
       let stream_reader = arrow::ipc::reader::StreamReader::try_new(dataset, None).unwrap();
       let batches: Result<Vec<RecordBatch>, arrow::error::ArrowError> = stream_reader.collect();
       let batches = batches.unwrap();
       let mut res: Vec<(String, String)> = Vec::new();
   
       for batch in &batches {
           for row in 0..batch.num_rows() {
               let mut sample = Vec::new();
               for col in 0..batch.num_columns() {
                   let column = batch.column(col);
                   sample.push(array_value_to_string(column, row).unwrap());
               }
               res.push((sample[0].clone(), sample[1].clone()));
           }
       }
       res
   }
   ```
   Then i cast the `Vec<(String,String)>`  into `Vec<(i32, String)>` or other native types depending on the datastet schema. 
   If i want to generalize this , maybe i could write a pattern matching on the `DataType`, cast it into `String` (like in `arrow::csv::Writer::convert`) and then try to cast it into a Rust native type.
   
    Please could you indicate me if there is a better way of doing this?
   
   Maybe it's the idea of exploiting directly the arrow which is bad, i could use the `arrow::csv::Writer` to convert it into `csv` and then the deserialization into a `Vec<(i32, String)>` would be trivial. 
   I want to  keep the `arrow` format because when i'm working with huge dataset (several GigaBytes) that doesn't fit in my RAM i want to use the memory mapped capabilities of `arrow` format and  read only small chunk at the time.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-rs] tustvold commented on issue #1760: Best way to convert arrow to Rust native type

Posted by GitBox <gi...@apache.org>.
tustvold commented on issue #1760:
URL: https://github.com/apache/arrow-rs/issues/1760#issuecomment-1141932010

   > does using a slice with vector is not actually doing a zero-copy slicing also
   
   Yes, sorry I wasn't clear. The distinction is that arrow slices are owning, i.e. they don't borrow from the parent, which can make them significantly easier to work with.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-rs] tustvold commented on issue #1760: Best way to convert arrow to Rust native type

Posted by GitBox <gi...@apache.org>.
tustvold commented on issue #1760:
URL: https://github.com/apache/arrow-rs/issues/1760#issuecomment-1141024114

   I'll answer your question in two parts:
   
   * How to access the values in the arrays
   * How the types differ from `Vec<String>`, etc...
   
   ## Downcasting ArrayRef
   
   Let's say you constructed your RecordBatch like this
   
   ```
   let strings = StringArray::from_iter_values(["foo", "bar"]);
   let integers = Int32Array::from_iter_values([1, 2, 3, 4]);
   
   let batch = RecordBatch::try_from_iter([
       ("strings", Arc::new(strings) as _),
       ("integers", Arc::new(integers) as _),
   ])
   .unwrap();
   ```
   
   If you want to access the integers in column 1 you might do
   
   ```
   let integers = batch.column(1).as_any().downcast_ref::<Int32Array>().unwrap();
   ```
   
   You now have an `&Int32Array` you can iterate through
   
   ```
   for (idx, i) in integers.iter().enumerate() {
       match i {
           None => println!("{}: NULL", idx),
           Some(i) => println!("{}: {}", idx, i)
       }
   }
   ```
   
   You can also interact with the values data directly, although you will need to handle nulls yourself
   
   ```
   let values: &[i32] = integers.values();
   for (idx, i) in values.iter().enumerate() {
       match integers.is_valid(idx) {
           false => println!("{}: NULL", idx),
           true => println!("{}: {}", idx, i)
       }
   }
   ```
   
   The same is true of `StringArray`
   
   ```
   let strings = batch.column(0).as_any().downcast_ref::<StringArray>().unwrap();
   for .. in strings.iter() {
     ...
   }
   ```
   
   You could use this to convert to `Vec<(i32, String)>` if you wished to, but depending on your workload this conversion may not be cheap, and will likely sacrifice performance over using the arrow compute kernels, or interacting with the arrays as-is. If you are able to describe your workload I might be able to help with this.
   
   ## Data Representation
   
   Under the hood, types are not stored as `Vec<Option<T>>`:
   
   * Primitives are roughly represented as a null mask and `Arc<[TYPE]>`
   * Strings arrays are a null mask, a values data `String`, and a list of offsets `Arc<[i32]>`
   * etc...
   
   The full details are described [here](https://arrow.apache.org/docs/format/Columnar.html#variable-size-binary-layout). As well as supporting more types, this allows for more efficient kernels.
   
   Arrow also supports zero-copy slicing of arrays, something which cannot be performed with `Vec`.
   
   For example,
   
   ```
   let strings = StringArray::from_iter_values(["foo", "bar", "bax"]);
   
   let sliced = strings.slice(1, 1);
   assert_eq!(sliced.len(), 1);
   let sliced = sliced.as_any().downcast_ref::<StringArray>().unwrap();
   assert_eq!(sliced.value(0), "bar");
   ```
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-rs] Tudyx commented on issue #1760: Best way to convert arrow to Rust native type

Posted by GitBox <gi...@apache.org>.
Tudyx commented on issue #1760:
URL: https://github.com/apache/arrow-rs/issues/1760#issuecomment-1141513104

   Thanks a lot for your response, it help me a lot to better understand how to deal with `arrow` format.
   
   - Concerning the workload, it can be quite big. I'm currently  working on my spare times on a port of `PyTorch` `dataloader` in Rust. I've implemented all the base functionalities.  I want to play with dataset from [huggingFace](https://huggingface.co/datasets) which contains a ton of `arrow` dataset, to do more advanced test with my library. The typical workflow is to process some contiguous rows at the time, so i think slicing is an important operation
   The idea is to propose an option for loading the dataset in RAM or use `arrow` memory map depending on the size of the dataset.
   
   - About the data representation, i have a little question that may sound stupid. When you say that `Arrow` also supports zero-copy slicing of arrays, something which cannot be performed with `Vec`, does using a slice with vector is not actually doing a zero-copy slicing also? Like in this example
   ```rust
   let vector = vec!["foo", "bar", "bax"];
   let slice = &vector[1..2];
   assert_eq!(slice[0], "bar");
   ```
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-rs] Tudyx closed issue #1760: Best way to convert arrow to Rust native type

Posted by GitBox <gi...@apache.org>.
Tudyx closed issue #1760: Best way to convert arrow to Rust native type
URL: https://github.com/apache/arrow-rs/issues/1760


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-rs] Tudyx commented on issue #1760: Best way to convert arrow to Rust native type

Posted by GitBox <gi...@apache.org>.
Tudyx commented on issue #1760:
URL: https://github.com/apache/arrow-rs/issues/1760#issuecomment-1279788618

   For anyone who read this if found a crate that is made to convert arrow to Rust Native type here: https://github.com/DataEngineeringLabs/arrow2-convert


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-rs] Tudyx commented on issue #1760: Best way to convert arrow to Rust native type

Posted by GitBox <gi...@apache.org>.
Tudyx commented on issue #1760:
URL: https://github.com/apache/arrow-rs/issues/1760#issuecomment-1142617042

   I guess i will stream a batch of row from an arrow dataset (memory mapped or not) and convert it into Rust native type on the fly (or not) with the technique you showed me. Thanks for your help!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org