You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2020/10/05 18:43:37 UTC

[GitHub] [arrow] alamb opened a new pull request #8346: ARROW-10164: [Rust] Add support for DictionaryArray to cast kernel

alamb opened a new pull request #8346:
URL: https://github.com/apache/arrow/pull/8346


   NOTE: this builds on #8333 and https://github.com/apache/arrow/pull/8340
   
   This PR adds support to the rust compute kernal casting `DictionaryArray` to/from  `PrimitiveArray`/`StringArray`
   
   It does not include other types such as LargeString or Binary (though the code could be extended fairly easily following the same pattern) -- but my usecase doesn't need LargeString or Binary. 
   
   
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] github-actions[bot] commented on pull request #8346: ARROW-10164: [Rust] Add support for DictionaryArray to cast kernel

Posted by GitBox <gi...@apache.org>.
github-actions[bot] commented on pull request #8346:
URL: https://github.com/apache/arrow/pull/8346#issuecomment-703824202


   https://issues.apache.org/jira/browse/ARROW-10164


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] nevi-me commented on a change in pull request #8346: ARROW-10164: [Rust] Add support for DictionaryArray to cast kernel

Posted by GitBox <gi...@apache.org>.
nevi-me commented on a change in pull request #8346:
URL: https://github.com/apache/arrow/pull/8346#discussion_r500409499



##########
File path: rust/arrow/src/compute/kernels/cast.rs
##########
@@ -755,10 +784,253 @@ where
     Ok(b.finish())
 }
 
+/// Attempts to cast an `ArrayDictionary` with index type K into
+/// `to_type` for supported type.
+///
+/// K is the key type
+fn dictionary_cast<K: ArrowDictionaryKeyType>(
+    array: &ArrayRef,
+    to_type: &DataType,
+) -> Result<ArrayRef> {
+    use DataType::*;
+
+    let dict_array = array
+        .as_any()
+        .downcast_ref::<DictionaryArray<K>>()
+        .ok_or_else(|| {
+            ArrowError::ComputeError(
+                "Internal Error: Cannot cast dictionary to DictionaryArray of expected type".to_string(),
+            )
+        })?;
+
+    match to_type {
+        Dictionary(to_index_type, to_value_type) => {
+            let keys_array: ArrayRef = Arc::new(dict_array.keys_array());
+            let values_array: ArrayRef = dict_array.values();
+            let cast_keys = cast(&keys_array, to_index_type)?;
+            let cast_values = cast(&values_array, to_value_type)?;
+
+            // Failure to cast keys (because they don't fit in the
+            // target type) results in NULL values;
+            if cast_keys.null_count() > keys_array.null_count() {
+                return Err(ArrowError::ComputeError(format!(
+                    "Could not convert {} dictionary indexes from {:?} to {:?}",
+                    cast_keys.null_count() - keys_array.null_count(),
+                    keys_array.data_type(),
+                    to_index_type
+                )));
+            }
+
+            // keys are data, child_data is values (dictionary)
+            let data = Arc::new(ArrayData::new(
+                to_type.clone(),
+                cast_keys.len(),
+                Some(cast_keys.null_count()),
+                cast_keys
+                    .data()
+                    .null_bitmap()
+                    .clone()
+                    .map(|bitmap| bitmap.bits),
+                cast_keys.data().offset(),
+                cast_keys.data().buffers().to_vec(),
+                vec![cast_values.data()],
+            ));
+
+            // create the appropriate array type
+            let new_array: ArrayRef = match **to_index_type {
+                Int8 => Arc::new(DictionaryArray::<Int8Type>::from(data)),
+                Int16 => Arc::new(DictionaryArray::<Int16Type>::from(data)),
+                Int32 => Arc::new(DictionaryArray::<Int32Type>::from(data)),
+                Int64 => Arc::new(DictionaryArray::<Int64Type>::from(data)),
+                UInt8 => Arc::new(DictionaryArray::<UInt8Type>::from(data)),
+                UInt16 => Arc::new(DictionaryArray::<UInt16Type>::from(data)),
+                UInt32 => Arc::new(DictionaryArray::<UInt32Type>::from(data)),
+                UInt64 => Arc::new(DictionaryArray::<UInt64Type>::from(data)),
+                _ => {
+                    return Err(ArrowError::ComputeError(format!(
+                        "Unsupported type {:?} for dictionary index",
+                        to_index_type
+                    )))
+                }
+            };
+
+            Ok(new_array)
+        }
+        // numeric types
+        Int8 => unpack_dictionary_to_numeric::<K, Int8Type>(dict_array, to_type),
+        Int16 => unpack_dictionary_to_numeric::<K, Int16Type>(dict_array, to_type),
+        Int32 => unpack_dictionary_to_numeric::<K, Int32Type>(dict_array, to_type),
+        Int64 => unpack_dictionary_to_numeric::<K, Int64Type>(dict_array, to_type),
+        UInt8 => unpack_dictionary_to_numeric::<K, UInt8Type>(dict_array, to_type),
+        UInt16 => unpack_dictionary_to_numeric::<K, UInt16Type>(dict_array, to_type),
+        UInt32 => unpack_dictionary_to_numeric::<K, UInt32Type>(dict_array, to_type),
+        UInt64 => unpack_dictionary_to_numeric::<K, UInt64Type>(dict_array, to_type),
+        Utf8 => unpack_dictionary_to_string::<K>(dict_array),
+        _ => Err(ArrowError::ComputeError(format!(
+            "Unsupported output type for dictionary conversion: {:?}",
+            to_type
+        ))),
+    }
+}
+
+// Unpack the dictionary where the keys are of type <K> and the values
+// are of type <V> into a primative array of type to_type
+fn unpack_dictionary_to_numeric<K, V>(
+    dict_array: &DictionaryArray<K>,
+    to_type: &DataType,
+) -> Result<ArrayRef>
+where
+    K: ArrowDictionaryKeyType,
+    V: ArrowNumericType,
+{
+    // attempt to cast the dict values to the target type
+    let cast_dict_values = cast(&dict_array.values(), to_type)?;
+    let dict_values = cast_dict_values
+        .as_any()
+        .downcast_ref::<PrimitiveArray<V>>()
+        .unwrap();
+
+    let mut b = PrimitiveBuilder::<V>::new(dict_array.len());
+
+    // copy each element one at a time
+    for key in dict_array.keys() {

Review comment:
       Isn't it possible to use `arrow::compute::take(indices, values)`?




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] nevi-me commented on a change in pull request #8346: ARROW-10164: [Rust] Add support for DictionaryArray to cast kernel

Posted by GitBox <gi...@apache.org>.
nevi-me commented on a change in pull request #8346:
URL: https://github.com/apache/arrow/pull/8346#discussion_r500409499



##########
File path: rust/arrow/src/compute/kernels/cast.rs
##########
@@ -755,10 +784,253 @@ where
     Ok(b.finish())
 }
 
+/// Attempts to cast an `ArrayDictionary` with index type K into
+/// `to_type` for supported type.
+///
+/// K is the key type
+fn dictionary_cast<K: ArrowDictionaryKeyType>(
+    array: &ArrayRef,
+    to_type: &DataType,
+) -> Result<ArrayRef> {
+    use DataType::*;
+
+    let dict_array = array
+        .as_any()
+        .downcast_ref::<DictionaryArray<K>>()
+        .ok_or_else(|| {
+            ArrowError::ComputeError(
+                "Internal Error: Cannot cast dictionary to DictionaryArray of expected type".to_string(),
+            )
+        })?;
+
+    match to_type {
+        Dictionary(to_index_type, to_value_type) => {
+            let keys_array: ArrayRef = Arc::new(dict_array.keys_array());
+            let values_array: ArrayRef = dict_array.values();
+            let cast_keys = cast(&keys_array, to_index_type)?;
+            let cast_values = cast(&values_array, to_value_type)?;
+
+            // Failure to cast keys (because they don't fit in the
+            // target type) results in NULL values;
+            if cast_keys.null_count() > keys_array.null_count() {
+                return Err(ArrowError::ComputeError(format!(
+                    "Could not convert {} dictionary indexes from {:?} to {:?}",
+                    cast_keys.null_count() - keys_array.null_count(),
+                    keys_array.data_type(),
+                    to_index_type
+                )));
+            }
+
+            // keys are data, child_data is values (dictionary)
+            let data = Arc::new(ArrayData::new(
+                to_type.clone(),
+                cast_keys.len(),
+                Some(cast_keys.null_count()),
+                cast_keys
+                    .data()
+                    .null_bitmap()
+                    .clone()
+                    .map(|bitmap| bitmap.bits),
+                cast_keys.data().offset(),
+                cast_keys.data().buffers().to_vec(),
+                vec![cast_values.data()],
+            ));
+
+            // create the appropriate array type
+            let new_array: ArrayRef = match **to_index_type {
+                Int8 => Arc::new(DictionaryArray::<Int8Type>::from(data)),
+                Int16 => Arc::new(DictionaryArray::<Int16Type>::from(data)),
+                Int32 => Arc::new(DictionaryArray::<Int32Type>::from(data)),
+                Int64 => Arc::new(DictionaryArray::<Int64Type>::from(data)),
+                UInt8 => Arc::new(DictionaryArray::<UInt8Type>::from(data)),
+                UInt16 => Arc::new(DictionaryArray::<UInt16Type>::from(data)),
+                UInt32 => Arc::new(DictionaryArray::<UInt32Type>::from(data)),
+                UInt64 => Arc::new(DictionaryArray::<UInt64Type>::from(data)),
+                _ => {
+                    return Err(ArrowError::ComputeError(format!(
+                        "Unsupported type {:?} for dictionary index",
+                        to_index_type
+                    )))
+                }
+            };
+
+            Ok(new_array)
+        }
+        // numeric types
+        Int8 => unpack_dictionary_to_numeric::<K, Int8Type>(dict_array, to_type),
+        Int16 => unpack_dictionary_to_numeric::<K, Int16Type>(dict_array, to_type),
+        Int32 => unpack_dictionary_to_numeric::<K, Int32Type>(dict_array, to_type),
+        Int64 => unpack_dictionary_to_numeric::<K, Int64Type>(dict_array, to_type),
+        UInt8 => unpack_dictionary_to_numeric::<K, UInt8Type>(dict_array, to_type),
+        UInt16 => unpack_dictionary_to_numeric::<K, UInt16Type>(dict_array, to_type),
+        UInt32 => unpack_dictionary_to_numeric::<K, UInt32Type>(dict_array, to_type),
+        UInt64 => unpack_dictionary_to_numeric::<K, UInt64Type>(dict_array, to_type),
+        Utf8 => unpack_dictionary_to_string::<K>(dict_array),
+        _ => Err(ArrowError::ComputeError(format!(
+            "Unsupported output type for dictionary conversion: {:?}",
+            to_type
+        ))),
+    }
+}
+
+// Unpack the dictionary where the keys are of type <K> and the values
+// are of type <V> into a primative array of type to_type
+fn unpack_dictionary_to_numeric<K, V>(
+    dict_array: &DictionaryArray<K>,
+    to_type: &DataType,
+) -> Result<ArrayRef>
+where
+    K: ArrowDictionaryKeyType,
+    V: ArrowNumericType,
+{
+    // attempt to cast the dict values to the target type
+    let cast_dict_values = cast(&dict_array.values(), to_type)?;
+    let dict_values = cast_dict_values
+        .as_any()
+        .downcast_ref::<PrimitiveArray<V>>()
+        .unwrap();
+
+    let mut b = PrimitiveBuilder::<V>::new(dict_array.len());
+
+    // copy each element one at a time
+    for key in dict_array.keys() {

Review comment:
       Isn't it possible to use `arrow::compute::take(indices, values)`? I think if you cast the indices to u32, you'd be able to. 




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] andygrove closed pull request #8346: ARROW-10164: [Rust] Add support for DictionaryArray to cast kernel

Posted by GitBox <gi...@apache.org>.
andygrove closed pull request #8346:
URL: https://github.com/apache/arrow/pull/8346


   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] alamb commented on pull request #8346: ARROW-10164: [Rust] Add support for DictionaryArray to cast kernel

Posted by GitBox <gi...@apache.org>.
alamb commented on pull request #8346:
URL: https://github.com/apache/arrow/pull/8346#issuecomment-705089990


   @nevi-me I think this one is now ready for reveiw


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] alamb commented on a change in pull request #8346: ARROW-10164: [Rust] Add support for DictionaryArray to cast kernel

Posted by GitBox <gi...@apache.org>.
alamb commented on a change in pull request #8346:
URL: https://github.com/apache/arrow/pull/8346#discussion_r501187806



##########
File path: rust/arrow/src/compute/kernels/cast.rs
##########
@@ -740,10 +769,203 @@ where
         .collect()
 }
 
+/// Attempts to cast an `ArrayDictionary` with index type K into
+/// `to_type` for supported types.
+///
+/// K is the key type
+fn dictionary_cast<K: ArrowDictionaryKeyType>(
+    array: &ArrayRef,
+    to_type: &DataType,
+) -> Result<ArrayRef> {
+    use DataType::*;
+
+    match to_type {
+        Dictionary(to_index_type, to_value_type) => {
+            let dict_array = array
+                .as_any()
+                .downcast_ref::<DictionaryArray<K>>()
+                .ok_or_else(|| {
+                    ArrowError::ComputeError(
+                        "Internal Error: Cannot cast dictionary to DictionaryArray of expected type".to_string(),
+                    )
+                })?;
+
+            let keys_array: ArrayRef = Arc::new(dict_array.keys_array());
+            let values_array: ArrayRef = dict_array.values();
+            let cast_keys = cast(&keys_array, to_index_type)?;
+            let cast_values = cast(&values_array, to_value_type)?;
+
+            // Failure to cast keys (because they don't fit in the
+            // target type) results in NULL values;
+            if cast_keys.null_count() > keys_array.null_count() {
+                return Err(ArrowError::ComputeError(format!(
+                    "Could not convert {} dictionary indexes from {:?} to {:?}",
+                    cast_keys.null_count() - keys_array.null_count(),
+                    keys_array.data_type(),
+                    to_index_type
+                )));
+            }
+
+            // keys are data, child_data is values (dictionary)
+            let data = Arc::new(ArrayData::new(
+                to_type.clone(),
+                cast_keys.len(),
+                Some(cast_keys.null_count()),
+                cast_keys
+                    .data()
+                    .null_bitmap()
+                    .clone()
+                    .map(|bitmap| bitmap.bits),
+                cast_keys.data().offset(),
+                cast_keys.data().buffers().to_vec(),
+                vec![cast_values.data()],
+            ));
+
+            // create the appropriate array type
+            let new_array: ArrayRef = match **to_index_type {
+                Int8 => Arc::new(DictionaryArray::<Int8Type>::from(data)),
+                Int16 => Arc::new(DictionaryArray::<Int16Type>::from(data)),
+                Int32 => Arc::new(DictionaryArray::<Int32Type>::from(data)),
+                Int64 => Arc::new(DictionaryArray::<Int64Type>::from(data)),
+                UInt8 => Arc::new(DictionaryArray::<UInt8Type>::from(data)),
+                UInt16 => Arc::new(DictionaryArray::<UInt16Type>::from(data)),
+                UInt32 => Arc::new(DictionaryArray::<UInt32Type>::from(data)),
+                UInt64 => Arc::new(DictionaryArray::<UInt64Type>::from(data)),
+                _ => {
+                    return Err(ArrowError::ComputeError(format!(
+                        "Unsupported type {:?} for dictionary index",
+                        to_index_type
+                    )))
+                }
+            };
+
+            Ok(new_array)
+        }
+        _ => unpack_dictionary::<K>(array, to_type),
+    }
+}
+
+// Unpack a dictionary where the keys are of type <K> into a flattened array of type to_type
+fn unpack_dictionary<K>(array: &ArrayRef, to_type: &DataType) -> Result<ArrayRef>

Review comment:
       This code uses the `take` kernel, as suggested by @nevi-me. That means both less code and support for unpacking a larger range of datatype (any type that the dictionary can be unpacked to). 👍 👍 

##########
File path: rust/arrow/src/compute/kernels/cast.rs
##########
@@ -740,10 +769,203 @@ where
         .collect()
 }
 
+/// Attempts to cast an `ArrayDictionary` with index type K into
+/// `to_type` for supported types.
+///
+/// K is the key type
+fn dictionary_cast<K: ArrowDictionaryKeyType>(
+    array: &ArrayRef,
+    to_type: &DataType,
+) -> Result<ArrayRef> {
+    use DataType::*;
+
+    match to_type {
+        Dictionary(to_index_type, to_value_type) => {
+            let dict_array = array
+                .as_any()
+                .downcast_ref::<DictionaryArray<K>>()
+                .ok_or_else(|| {
+                    ArrowError::ComputeError(
+                        "Internal Error: Cannot cast dictionary to DictionaryArray of expected type".to_string(),
+                    )
+                })?;
+
+            let keys_array: ArrayRef = Arc::new(dict_array.keys_array());
+            let values_array: ArrayRef = dict_array.values();
+            let cast_keys = cast(&keys_array, to_index_type)?;
+            let cast_values = cast(&values_array, to_value_type)?;
+
+            // Failure to cast keys (because they don't fit in the
+            // target type) results in NULL values;
+            if cast_keys.null_count() > keys_array.null_count() {
+                return Err(ArrowError::ComputeError(format!(
+                    "Could not convert {} dictionary indexes from {:?} to {:?}",
+                    cast_keys.null_count() - keys_array.null_count(),
+                    keys_array.data_type(),
+                    to_index_type
+                )));
+            }
+
+            // keys are data, child_data is values (dictionary)
+            let data = Arc::new(ArrayData::new(
+                to_type.clone(),
+                cast_keys.len(),
+                Some(cast_keys.null_count()),
+                cast_keys
+                    .data()
+                    .null_bitmap()
+                    .clone()
+                    .map(|bitmap| bitmap.bits),
+                cast_keys.data().offset(),
+                cast_keys.data().buffers().to_vec(),
+                vec![cast_values.data()],
+            ));
+
+            // create the appropriate array type
+            let new_array: ArrayRef = match **to_index_type {
+                Int8 => Arc::new(DictionaryArray::<Int8Type>::from(data)),
+                Int16 => Arc::new(DictionaryArray::<Int16Type>::from(data)),
+                Int32 => Arc::new(DictionaryArray::<Int32Type>::from(data)),
+                Int64 => Arc::new(DictionaryArray::<Int64Type>::from(data)),
+                UInt8 => Arc::new(DictionaryArray::<UInt8Type>::from(data)),
+                UInt16 => Arc::new(DictionaryArray::<UInt16Type>::from(data)),
+                UInt32 => Arc::new(DictionaryArray::<UInt32Type>::from(data)),
+                UInt64 => Arc::new(DictionaryArray::<UInt64Type>::from(data)),
+                _ => {
+                    return Err(ArrowError::ComputeError(format!(
+                        "Unsupported type {:?} for dictionary index",
+                        to_index_type
+                    )))
+                }
+            };
+
+            Ok(new_array)
+        }
+        _ => unpack_dictionary::<K>(array, to_type),
+    }
+}
+
+// Unpack a dictionary where the keys are of type <K> into a flattened array of type to_type
+fn unpack_dictionary<K>(array: &ArrayRef, to_type: &DataType) -> Result<ArrayRef>
+where
+    K: ArrowDictionaryKeyType,
+{
+    let dict_array = array
+        .as_any()
+        .downcast_ref::<DictionaryArray<K>>()
+        .ok_or_else(|| {
+            ArrowError::ComputeError(
+                "Internal Error: Cannot cast dictionary to DictionaryArray of expected type".to_string(),
+            )
+        })?;
+
+    // attempt to cast the dict values to the target type
+    // use the take kernel to expand out the dictionary
+    let cast_dict_values = cast(&dict_array.values(), to_type)?;
+
+    // Note take requires first casting the indicies to u32
+    let keys_array: ArrayRef = Arc::new(dict_array.keys_array());
+    let indicies = cast(&keys_array, &DataType::UInt32)?;
+    let u32_indicies =
+        indicies
+            .as_any()
+            .downcast_ref::<UInt32Array>()
+            .ok_or_else(|| {
+                ArrowError::ComputeError(
+                    "Internal Error: Cannot cast dict indicies to UInt32".to_string(),
+                )
+            })?;
+
+    take(&cast_dict_values, u32_indicies, None)
+}
+
+/// Attempts to encode an array into an `ArrayDictionary` with index
+/// type K and value (dictionary) type value_type
+///
+/// K is the key type
+fn cast_to_dictionary<K: ArrowDictionaryKeyType>(
+    array: &ArrayRef,
+    dict_value_type: &DataType,
+) -> Result<ArrayRef> {
+    use DataType::*;
+
+    match *dict_value_type {
+        Int8 => pack_numeric_to_dictionary::<K, Int8Type>(array, dict_value_type),
+        Int16 => pack_numeric_to_dictionary::<K, Int16Type>(array, dict_value_type),
+        Int32 => pack_numeric_to_dictionary::<K, Int32Type>(array, dict_value_type),
+        Int64 => pack_numeric_to_dictionary::<K, Int64Type>(array, dict_value_type),
+        UInt8 => pack_numeric_to_dictionary::<K, UInt8Type>(array, dict_value_type),
+        UInt16 => pack_numeric_to_dictionary::<K, UInt16Type>(array, dict_value_type),
+        UInt32 => pack_numeric_to_dictionary::<K, UInt32Type>(array, dict_value_type),
+        UInt64 => pack_numeric_to_dictionary::<K, UInt64Type>(array, dict_value_type),
+        Utf8 => pack_string_to_dictionary::<K>(array),
+        _ => Err(ArrowError::ComputeError(format!(
+            "Internal Error: Unsupported output type for dictionary packing: {:?}",
+            dict_value_type
+        ))),
+    }
+}
+
+// Packs the data from the primitive array of type <V> to a
+// DictionaryArray with keys of type K and values of value_type V
+fn pack_numeric_to_dictionary<K, V>(
+    array: &ArrayRef,
+    dict_value_type: &DataType,
+) -> Result<ArrayRef>
+where
+    K: ArrowDictionaryKeyType,
+    V: ArrowNumericType,
+{
+    // attempt to cast the source array values to the target value type (the dictionary values type)
+    let cast_values = cast(array, &dict_value_type)?;
+    let values = cast_values
+        .as_any()
+        .downcast_ref::<PrimitiveArray<V>>()
+        .unwrap();
+
+    let keys_builder = PrimitiveBuilder::<K>::new(values.len());
+    let values_builder = PrimitiveBuilder::<V>::new(values.len());
+    let mut b = PrimitiveDictionaryBuilder::new(keys_builder, values_builder);
+
+    // copy each element one at a time
+    for i in 0..values.len() {
+        if values.is_null(i) {
+            b.append_null()?;
+        } else {
+            b.append(values.value(i))?;
+        }
+    }
+    Ok(Arc::new(b.finish()))
+}
+
+// Packs the data as a StringDictionaryArray, if possible, with the
+// key types of K
+fn pack_string_to_dictionary<K>(array: &ArrayRef) -> Result<ArrayRef>

Review comment:
       Maybe there is a fancier way to implement packing for arrays other than `PrimitiveType` but I didn't see anything obvious (maybe a `macro!`?)




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] alamb commented on pull request #8346: ARROW-10164: [Rust] Add support for DictionaryArray to cast kernel

Posted by GitBox <gi...@apache.org>.
alamb commented on pull request #8346:
URL: https://github.com/apache/arrow/pull/8346#issuecomment-705163232


   Rebased as I had some CI failures -- hoping to get a green run


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] alamb commented on a change in pull request #8346: ARROW-10164: [Rust] Add support for DictionaryArray to cast kernel

Posted by GitBox <gi...@apache.org>.
alamb commented on a change in pull request #8346:
URL: https://github.com/apache/arrow/pull/8346#discussion_r500443425



##########
File path: rust/arrow/src/compute/kernels/cast.rs
##########
@@ -755,10 +784,253 @@ where
     Ok(b.finish())
 }
 
+/// Attempts to cast an `ArrayDictionary` with index type K into
+/// `to_type` for supported type.
+///
+/// K is the key type
+fn dictionary_cast<K: ArrowDictionaryKeyType>(
+    array: &ArrayRef,
+    to_type: &DataType,
+) -> Result<ArrayRef> {
+    use DataType::*;
+
+    let dict_array = array
+        .as_any()
+        .downcast_ref::<DictionaryArray<K>>()
+        .ok_or_else(|| {
+            ArrowError::ComputeError(
+                "Internal Error: Cannot cast dictionary to DictionaryArray of expected type".to_string(),
+            )
+        })?;
+
+    match to_type {
+        Dictionary(to_index_type, to_value_type) => {
+            let keys_array: ArrayRef = Arc::new(dict_array.keys_array());
+            let values_array: ArrayRef = dict_array.values();
+            let cast_keys = cast(&keys_array, to_index_type)?;
+            let cast_values = cast(&values_array, to_value_type)?;
+
+            // Failure to cast keys (because they don't fit in the
+            // target type) results in NULL values;
+            if cast_keys.null_count() > keys_array.null_count() {
+                return Err(ArrowError::ComputeError(format!(
+                    "Could not convert {} dictionary indexes from {:?} to {:?}",
+                    cast_keys.null_count() - keys_array.null_count(),
+                    keys_array.data_type(),
+                    to_index_type
+                )));
+            }
+
+            // keys are data, child_data is values (dictionary)
+            let data = Arc::new(ArrayData::new(
+                to_type.clone(),
+                cast_keys.len(),
+                Some(cast_keys.null_count()),
+                cast_keys
+                    .data()
+                    .null_bitmap()
+                    .clone()
+                    .map(|bitmap| bitmap.bits),
+                cast_keys.data().offset(),
+                cast_keys.data().buffers().to_vec(),
+                vec![cast_values.data()],
+            ));
+
+            // create the appropriate array type
+            let new_array: ArrayRef = match **to_index_type {
+                Int8 => Arc::new(DictionaryArray::<Int8Type>::from(data)),
+                Int16 => Arc::new(DictionaryArray::<Int16Type>::from(data)),
+                Int32 => Arc::new(DictionaryArray::<Int32Type>::from(data)),
+                Int64 => Arc::new(DictionaryArray::<Int64Type>::from(data)),
+                UInt8 => Arc::new(DictionaryArray::<UInt8Type>::from(data)),
+                UInt16 => Arc::new(DictionaryArray::<UInt16Type>::from(data)),
+                UInt32 => Arc::new(DictionaryArray::<UInt32Type>::from(data)),
+                UInt64 => Arc::new(DictionaryArray::<UInt64Type>::from(data)),
+                _ => {
+                    return Err(ArrowError::ComputeError(format!(
+                        "Unsupported type {:?} for dictionary index",
+                        to_index_type
+                    )))
+                }
+            };
+
+            Ok(new_array)
+        }
+        // numeric types
+        Int8 => unpack_dictionary_to_numeric::<K, Int8Type>(dict_array, to_type),
+        Int16 => unpack_dictionary_to_numeric::<K, Int16Type>(dict_array, to_type),
+        Int32 => unpack_dictionary_to_numeric::<K, Int32Type>(dict_array, to_type),
+        Int64 => unpack_dictionary_to_numeric::<K, Int64Type>(dict_array, to_type),
+        UInt8 => unpack_dictionary_to_numeric::<K, UInt8Type>(dict_array, to_type),
+        UInt16 => unpack_dictionary_to_numeric::<K, UInt16Type>(dict_array, to_type),
+        UInt32 => unpack_dictionary_to_numeric::<K, UInt32Type>(dict_array, to_type),
+        UInt64 => unpack_dictionary_to_numeric::<K, UInt64Type>(dict_array, to_type),
+        Utf8 => unpack_dictionary_to_string::<K>(dict_array),
+        _ => Err(ArrowError::ComputeError(format!(
+            "Unsupported output type for dictionary conversion: {:?}",
+            to_type
+        ))),
+    }
+}
+
+// Unpack the dictionary where the keys are of type <K> and the values
+// are of type <V> into a primative array of type to_type
+fn unpack_dictionary_to_numeric<K, V>(
+    dict_array: &DictionaryArray<K>,
+    to_type: &DataType,
+) -> Result<ArrayRef>
+where
+    K: ArrowDictionaryKeyType,
+    V: ArrowNumericType,
+{
+    // attempt to cast the dict values to the target type
+    let cast_dict_values = cast(&dict_array.values(), to_type)?;
+    let dict_values = cast_dict_values
+        .as_any()
+        .downcast_ref::<PrimitiveArray<V>>()
+        .unwrap();
+
+    let mut b = PrimitiveBuilder::<V>::new(dict_array.len());
+
+    // copy each element one at a time
+    for key in dict_array.keys() {

Review comment:
       that is an excellent idea -- I will look into doing so




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] alamb edited a comment on pull request #8346: ARROW-10164: [Rust] Add support for DictionaryArray to cast kernel

Posted by GitBox <gi...@apache.org>.
alamb edited a comment on pull request #8346:
URL: https://github.com/apache/arrow/pull/8346#issuecomment-705089990


   @nevi-me I think this one is now ready for review, if you have the time


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] nevi-me commented on pull request #8346: ARROW-10164: [Rust] Add support for DictionaryArray to cast kernel

Posted by GitBox <gi...@apache.org>.
nevi-me commented on pull request #8346:
URL: https://github.com/apache/arrow/pull/8346#issuecomment-705532225


   > 🤔 so the 'pretty-print' feature is not enabled by default for arrow and thus I can't use it in the tests. I put in a hack (copy/paste of the pretty printing) in [9f8b9ba](https://github.com/apache/arrow/commit/9f8b9ba11e9523a19ffc6449ac7a3b915b933ef7) but I am not super happy with that.
   > 
   > @nevi-me do you by any chance have any other suggestions?
   
   Apologies for the delay @alamb. I remember when I started working on Arrow, stringy comparisons were discouraged at the time because you could have arrays that end up having the same string representation, but have data differences which might not get caught.
   This is also important for integration tests.
   
   I'm on the fence, you could enable that assertion only on the `pretty` feature?
   
   Perhaps other commiters will have a differing opinion (@sunchao @jorgecarleitao @paddyhoran @andygrove).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] alamb commented on pull request #8346: ARROW-10164: [Rust] Add support for DictionaryArray to cast kernel

Posted by GitBox <gi...@apache.org>.
alamb commented on pull request #8346:
URL: https://github.com/apache/arrow/pull/8346#issuecomment-705206173


   🤔 so the 'pretty-print' feature is not enabled by default for arrow and thus I can't use it in the tests. I put in a hack (copy/paste of the pretty printing) in https://github.com/apache/arrow/pull/8346/commits/9f8b9ba11e9523a19ffc6449ac7a3b915b933ef7 but I am not super happy with that.
   
   @nevi-me  do you by any chance have any other suggestions?


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] nevi-me commented on pull request #8346: ARROW-10164: [Rust] Add support for DictionaryArray to cast kernel

Posted by GitBox <gi...@apache.org>.
nevi-me commented on pull request #8346:
URL: https://github.com/apache/arrow/pull/8346#issuecomment-705532225


   > 🤔 so the 'pretty-print' feature is not enabled by default for arrow and thus I can't use it in the tests. I put in a hack (copy/paste of the pretty printing) in [9f8b9ba](https://github.com/apache/arrow/commit/9f8b9ba11e9523a19ffc6449ac7a3b915b933ef7) but I am not super happy with that.
   > 
   > @nevi-me do you by any chance have any other suggestions?
   
   Apologies for the delay @alamb. I remember when I started working on Arrow, stringy comparisons were discouraged at the time because you could have arrays that end up having the same string representation, but have data differences which might not get caught.
   This is also important for integration tests.
   
   I'm on the fence, you could enable that assertion only on the `pretty` feature?
   
   Perhaps other commiters will have a differing opinion (@sunchao @jorgecarleitao @paddyhoran @andygrove).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] nevi-me commented on a change in pull request #8346: ARROW-10164: [Rust] Add support for DictionaryArray to cast kernel

Posted by GitBox <gi...@apache.org>.
nevi-me commented on a change in pull request #8346:
URL: https://github.com/apache/arrow/pull/8346#discussion_r501224081



##########
File path: rust/arrow/src/compute/kernels/cast.rs
##########
@@ -740,10 +769,203 @@ where
         .collect()
 }
 
+/// Attempts to cast an `ArrayDictionary` with index type K into
+/// `to_type` for supported types.
+///
+/// K is the key type
+fn dictionary_cast<K: ArrowDictionaryKeyType>(
+    array: &ArrayRef,
+    to_type: &DataType,
+) -> Result<ArrayRef> {
+    use DataType::*;
+
+    match to_type {
+        Dictionary(to_index_type, to_value_type) => {
+            let dict_array = array
+                .as_any()
+                .downcast_ref::<DictionaryArray<K>>()
+                .ok_or_else(|| {
+                    ArrowError::ComputeError(
+                        "Internal Error: Cannot cast dictionary to DictionaryArray of expected type".to_string(),
+                    )
+                })?;
+
+            let keys_array: ArrayRef = Arc::new(dict_array.keys_array());
+            let values_array: ArrayRef = dict_array.values();
+            let cast_keys = cast(&keys_array, to_index_type)?;
+            let cast_values = cast(&values_array, to_value_type)?;
+
+            // Failure to cast keys (because they don't fit in the
+            // target type) results in NULL values;
+            if cast_keys.null_count() > keys_array.null_count() {
+                return Err(ArrowError::ComputeError(format!(
+                    "Could not convert {} dictionary indexes from {:?} to {:?}",
+                    cast_keys.null_count() - keys_array.null_count(),
+                    keys_array.data_type(),
+                    to_index_type
+                )));
+            }
+
+            // keys are data, child_data is values (dictionary)
+            let data = Arc::new(ArrayData::new(
+                to_type.clone(),
+                cast_keys.len(),
+                Some(cast_keys.null_count()),
+                cast_keys
+                    .data()
+                    .null_bitmap()
+                    .clone()
+                    .map(|bitmap| bitmap.bits),
+                cast_keys.data().offset(),
+                cast_keys.data().buffers().to_vec(),
+                vec![cast_values.data()],
+            ));
+
+            // create the appropriate array type
+            let new_array: ArrayRef = match **to_index_type {
+                Int8 => Arc::new(DictionaryArray::<Int8Type>::from(data)),
+                Int16 => Arc::new(DictionaryArray::<Int16Type>::from(data)),
+                Int32 => Arc::new(DictionaryArray::<Int32Type>::from(data)),
+                Int64 => Arc::new(DictionaryArray::<Int64Type>::from(data)),
+                UInt8 => Arc::new(DictionaryArray::<UInt8Type>::from(data)),
+                UInt16 => Arc::new(DictionaryArray::<UInt16Type>::from(data)),
+                UInt32 => Arc::new(DictionaryArray::<UInt32Type>::from(data)),
+                UInt64 => Arc::new(DictionaryArray::<UInt64Type>::from(data)),
+                _ => {
+                    return Err(ArrowError::ComputeError(format!(
+                        "Unsupported type {:?} for dictionary index",
+                        to_index_type
+                    )))
+                }
+            };
+
+            Ok(new_array)
+        }
+        _ => unpack_dictionary::<K>(array, to_type),
+    }
+}
+
+// Unpack a dictionary where the keys are of type <K> into a flattened array of type to_type
+fn unpack_dictionary<K>(array: &ArrayRef, to_type: &DataType) -> Result<ArrayRef>
+where
+    K: ArrowDictionaryKeyType,
+{
+    let dict_array = array
+        .as_any()
+        .downcast_ref::<DictionaryArray<K>>()
+        .ok_or_else(|| {
+            ArrowError::ComputeError(
+                "Internal Error: Cannot cast dictionary to DictionaryArray of expected type".to_string(),
+            )
+        })?;
+
+    // attempt to cast the dict values to the target type
+    // use the take kernel to expand out the dictionary
+    let cast_dict_values = cast(&dict_array.values(), to_type)?;
+
+    // Note take requires first casting the indicies to u32
+    let keys_array: ArrayRef = Arc::new(dict_array.keys_array());
+    let indicies = cast(&keys_array, &DataType::UInt32)?;
+    let u32_indicies =
+        indicies
+            .as_any()
+            .downcast_ref::<UInt32Array>()
+            .ok_or_else(|| {
+                ArrowError::ComputeError(
+                    "Internal Error: Cannot cast dict indicies to UInt32".to_string(),
+                )
+            })?;
+
+    take(&cast_dict_values, u32_indicies, None)
+}
+
+/// Attempts to encode an array into an `ArrayDictionary` with index
+/// type K and value (dictionary) type value_type
+///
+/// K is the key type
+fn cast_to_dictionary<K: ArrowDictionaryKeyType>(
+    array: &ArrayRef,
+    dict_value_type: &DataType,
+) -> Result<ArrayRef> {
+    use DataType::*;
+
+    match *dict_value_type {
+        Int8 => pack_numeric_to_dictionary::<K, Int8Type>(array, dict_value_type),
+        Int16 => pack_numeric_to_dictionary::<K, Int16Type>(array, dict_value_type),
+        Int32 => pack_numeric_to_dictionary::<K, Int32Type>(array, dict_value_type),
+        Int64 => pack_numeric_to_dictionary::<K, Int64Type>(array, dict_value_type),
+        UInt8 => pack_numeric_to_dictionary::<K, UInt8Type>(array, dict_value_type),
+        UInt16 => pack_numeric_to_dictionary::<K, UInt16Type>(array, dict_value_type),
+        UInt32 => pack_numeric_to_dictionary::<K, UInt32Type>(array, dict_value_type),
+        UInt64 => pack_numeric_to_dictionary::<K, UInt64Type>(array, dict_value_type),
+        Utf8 => pack_string_to_dictionary::<K>(array),
+        _ => Err(ArrowError::ComputeError(format!(
+            "Internal Error: Unsupported output type for dictionary packing: {:?}",
+            dict_value_type
+        ))),
+    }
+}
+
+// Packs the data from the primitive array of type <V> to a
+// DictionaryArray with keys of type K and values of value_type V
+fn pack_numeric_to_dictionary<K, V>(
+    array: &ArrayRef,
+    dict_value_type: &DataType,
+) -> Result<ArrayRef>
+where
+    K: ArrowDictionaryKeyType,
+    V: ArrowNumericType,
+{
+    // attempt to cast the source array values to the target value type (the dictionary values type)
+    let cast_values = cast(array, &dict_value_type)?;
+    let values = cast_values
+        .as_any()
+        .downcast_ref::<PrimitiveArray<V>>()
+        .unwrap();
+
+    let keys_builder = PrimitiveBuilder::<K>::new(values.len());
+    let values_builder = PrimitiveBuilder::<V>::new(values.len());
+    let mut b = PrimitiveDictionaryBuilder::new(keys_builder, values_builder);
+
+    // copy each element one at a time
+    for i in 0..values.len() {
+        if values.is_null(i) {
+            b.append_null()?;
+        } else {
+            b.append(values.value(i))?;
+        }
+    }
+    Ok(Arc::new(b.finish()))
+}
+
+// Packs the data as a StringDictionaryArray, if possible, with the
+// key types of K
+fn pack_string_to_dictionary<K>(array: &ArrayRef) -> Result<ArrayRef>

Review comment:
       We can look at it in follow-ups




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] alamb commented on pull request #8346: ARROW-10164: [Rust] Add support for DictionaryArray to cast kernel

Posted by GitBox <gi...@apache.org>.
alamb commented on pull request #8346:
URL: https://github.com/apache/arrow/pull/8346#issuecomment-705497069


   > pretty-print' feature is not enabled by default for arrow and thus I can't use it in the tests. 
   
   I thought about this last night and came up with a better proposal: https://github.com/apache/arrow/pull/8397
   
   If that PR is merged i I can remove https://github.com/apache/arrow/commit/9f8b9ba11e9523a19ffc6449ac7a3b915b933ef7 from this PR. Or alternately we can merge this PR and I'll remove the extra copy in #8397 


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] alamb commented on pull request #8346: ARROW-10164: [Rust] Add support for DictionaryArray to cast kernel

Posted by GitBox <gi...@apache.org>.
alamb commented on pull request #8346:
URL: https://github.com/apache/arrow/pull/8346#issuecomment-705636603


   I am rebasing to pick up https://github.com/apache/arrow/commit/4bbb74713c6883e8523eeeb5ac80a1e1f8521674 -- note however, that the tests in this PR use string comparisons (not arrow struct comparisons). 


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] andygrove closed pull request #8346: ARROW-10164: [Rust] Add support for DictionaryArray to cast kernel

Posted by GitBox <gi...@apache.org>.
andygrove closed pull request #8346:
URL: https://github.com/apache/arrow/pull/8346


   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] alamb commented on pull request #8346: ARROW-10164: [Rust] Add support for DictionaryArray to cast kernel

Posted by GitBox <gi...@apache.org>.
alamb commented on pull request #8346:
URL: https://github.com/apache/arrow/pull/8346#issuecomment-705497069






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org