You are viewing a plain text version of this content. The canonical link for it is here.
Posted to jira@arrow.apache.org by "Jorge (Jira)" <ji...@apache.org> on 2020/09/17 07:08:00 UTC

[jira] [Updated] (ARROW-10030) [Rust] Support fromIter and toIter

     [ https://issues.apache.org/jira/browse/ARROW-10030?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Jorge updated ARROW-10030:
--------------------------
    Component/s: Rust
    Description: 
Proposal for comments: [https://docs.google.com/document/d/1d6rV1WmvIH6uW-bcHKrYBSyPddrpXH8Q4CtVfFHtI04/edit?usp=sharing]

(dump of the document above)

Rust Arrow supports two main computational models:
 # Batch Operations, that leverage some form of vectorization
 # Element-by-element operations, that emerge in more complex operations

This document concerns element-by-element operations, that are the most common operations outside of the library.
h2. Element-by-element operations

These operations are programmatically written as:
 # Downcast the array to its specific type
 # Initialize buffers
 # Iterate over indices and perform the operation, appending to the buffers accordingly
 # Create ArrayData with the required null bitmap, buffers, childs, etc.
 # return ArrayRef from ArrayData

 

We can split this process in 3 parts:
 # Initialization (1 and 2)
 # Iteration (3)
 # Finalization (4 and 5)

Currently, the API that we offer to our users is:
 # as_any() to downcast the array based on its DataType
 # Builders for all types, that users can initialize, matching the downcasted array
 # Iterate
 # Use for i in (0..array.len())
 # Use Array::value(i) and Array::is_valid(i)/is_null(i)`
 # use builder.append_value(new_value) or builder.append_null()

 # Finish the builder and wrap the result in an Arc

This API has some issues:
 # value(i) +is unsafe+, even though it is not marked as such
 # builders are usually slow due to the checks that they need to perform
 # The API is not intuitive

h2. Proposal

This proposal aims at improving this API in 2 specific ways:
 * Implement IntoIterator Iterator<Item=T> and Iterator<Item=Option<T>>
 * Implement FromIterator<Item=T> and Item=Option<T>

so that users can write:

 
{code:java}
let array = Int32Array::from(vec![Some(0), None, Some(2), None, Some(4)]);
// to and from iter, with a +1
let result: Int32Array = array

    .iter()

    .map(|e| if let Some(r) = e { Some(r + 1) } else { None })

    .collect();
let expected = Int32Array::from(vec![Some(1), None, Some(3), None, Some(5)]); 
assert_eq!(result, expected);
{code}
 

This results in an API that is:
 # efficient, as it is our responsibility to create `FromIterator` that are efficient in populating the buffers/child etc from an iterator
 # Safe, as it does not allow segfaults
 # Simple, as users do not need to worry about Builders, buffers, etc, only native Rust.

  was:
Proposal for comments: https://docs.google.com/document/d/1d6rV1WmvIH6uW-bcHKrYBSyPddrpXH8Q4CtVfFHtI04/edit?usp=sharing

 

(dump of the proposal:)

Rust Arrow supports two main computational models:
 # Batch Operations, that leverage some form of vectorization
 # Element-by-element operations, that emerge in more complex operations

This document concerns element-by-element operations, that are the most common operations outside of the library.
h2. Element-by-element operations

These operations are programmatically written as:
 # Downcast the array to its specific type
 # Initialize buffers
 # Iterate over indices and perform the operation, appending to the buffers accordingly
 # Create ArrayData with the required null bitmap, buffers, childs, etc.
 # return ArrayRef from ArrayData

 

We can split this process in 3 parts:
 # Initialization (1 and 2)
 # Iteration (3)
 # Finalization (4 and 5)

Currently, the API that we offer to our users is:
 # as_any() to downcast the array based on its DataType
 # Builders for all types, that users can initialize, matching the downcasted array
 # Iterate
 # Use for i in (0..array.len())
 # Use Array::value(i) and Array::is_valid(i)/is_null(i)`
 # use builder.append_value(new_value) or builder.append_null()


 # Finish the builder and wrap the result in an Arc

This API has some issues:
 # value(i) +is unsafe+, even though it is not marked as such
 # builders are usually slow due to the checks that they need to perform
 # The API is not intuitive

h2. Proposal

This proposal aims at improving this API in 2 specific ways:
 * Implement IntoIterator Iterator<Item=T> and Iterator<Item=Option<T>>
 * Implement FromIterator<Item=T> and Item=Option<T>

so that users can write:

 
{code:java}
let array = Int32Array::from(vec![Some(0), None, Some(2), None, Some(4)]);
// to and from iter, with a +1
let result: Int32Array = array

    .iter()

    .map(|e| if let Some(r) = e { Some(r + 1) } else { None })

    .collect();
let expected = Int32Array::from(vec![Some(1), None, Some(3), None, Some(5)]); 
assert_eq!(result, expected);
{code}
 

This results in an API that is:
 # efficient, as it is our responsibility to create `FromIterator` that are efficient in populating the buffers/child etc from an iterator
 # Safe, as it does not allow segfaults
 # Simple, as users do not need to worry about Builders, buffers, etc, only native Rust.


> [Rust] Support fromIter and toIter
> ----------------------------------
>
>                 Key: ARROW-10030
>                 URL: https://issues.apache.org/jira/browse/ARROW-10030
>             Project: Apache Arrow
>          Issue Type: Improvement
>          Components: Rust
>            Reporter: Jorge
>            Priority: Major
>
> Proposal for comments: [https://docs.google.com/document/d/1d6rV1WmvIH6uW-bcHKrYBSyPddrpXH8Q4CtVfFHtI04/edit?usp=sharing]
> (dump of the document above)
> Rust Arrow supports two main computational models:
>  # Batch Operations, that leverage some form of vectorization
>  # Element-by-element operations, that emerge in more complex operations
> This document concerns element-by-element operations, that are the most common operations outside of the library.
> h2. Element-by-element operations
> These operations are programmatically written as:
>  # Downcast the array to its specific type
>  # Initialize buffers
>  # Iterate over indices and perform the operation, appending to the buffers accordingly
>  # Create ArrayData with the required null bitmap, buffers, childs, etc.
>  # return ArrayRef from ArrayData
>  
> We can split this process in 3 parts:
>  # Initialization (1 and 2)
>  # Iteration (3)
>  # Finalization (4 and 5)
> Currently, the API that we offer to our users is:
>  # as_any() to downcast the array based on its DataType
>  # Builders for all types, that users can initialize, matching the downcasted array
>  # Iterate
>  # Use for i in (0..array.len())
>  # Use Array::value(i) and Array::is_valid(i)/is_null(i)`
>  # use builder.append_value(new_value) or builder.append_null()
>  # Finish the builder and wrap the result in an Arc
> This API has some issues:
>  # value(i) +is unsafe+, even though it is not marked as such
>  # builders are usually slow due to the checks that they need to perform
>  # The API is not intuitive
> h2. Proposal
> This proposal aims at improving this API in 2 specific ways:
>  * Implement IntoIterator Iterator<Item=T> and Iterator<Item=Option<T>>
>  * Implement FromIterator<Item=T> and Item=Option<T>
> so that users can write:
>  
> {code:java}
> let array = Int32Array::from(vec![Some(0), None, Some(2), None, Some(4)]);
> // to and from iter, with a +1
> let result: Int32Array = array
>     .iter()
>     .map(|e| if let Some(r) = e { Some(r + 1) } else { None })
>     .collect();
> let expected = Int32Array::from(vec![Some(1), None, Some(3), None, Some(5)]); 
> assert_eq!(result, expected);
> {code}
>  
> This results in an API that is:
>  # efficient, as it is our responsibility to create `FromIterator` that are efficient in populating the buffers/child etc from an iterator
>  # Safe, as it does not allow segfaults
>  # Simple, as users do not need to worry about Builders, buffers, etc, only native Rust.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)