You are viewing a plain text version of this content. The canonical link for it is here.
Posted to jira@arrow.apache.org by "Rok Mihevc (Jira)" <ji...@apache.org> on 2022/09/03 10:16:00 UTC

[jira] [Resolved] (ARROW-16695) [R][Python][C++] Extension types are not supported in joins

     [ https://issues.apache.org/jira/browse/ARROW-16695?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Rok Mihevc resolved ARROW-16695.
--------------------------------
    Fix Version/s: 10.0.0
       Resolution: Fixed

Issue resolved by pull request 13501
[https://github.com/apache/arrow/pull/13501]

> [R][Python][C++] Extension types are not supported in joins
> -----------------------------------------------------------
>
>                 Key: ARROW-16695
>                 URL: https://issues.apache.org/jira/browse/ARROW-16695
>             Project: Apache Arrow
>          Issue Type: Improvement
>          Components: C++, Python, R
>            Reporter: Dewey Dunnington
>            Assignee: Rok Mihevc
>            Priority: Major
>              Labels: pull-request-available
>             Fix For: 10.0.0
>
>          Time Spent: 3h 10m
>  Remaining Estimate: 0h
>
> It looks like extension types are not supported in joins (even if the underlying type is supproted)! Reported by [~jonkeane] while making a demo for Arrow + Query engine + geoarrow (R package), which uses extension types liberally:
> {code:R}
> library(arrow, warn.conflicts = FALSE)
> library(dplyr, warn.conflicts = FALSE)
> rb_non_ext <- record_batch(
>   a = 1:5, 
>   b = letters[1:5]
> )
> rb_ext_storage <- record_batch(
>   b = letters[1:5],
>   c = Array$create(list(as.raw(1:5)), type = binary())
> )
> rb_ext <- record_batch(
>   b = letters[1:5],
>   c = vctrs_extension_array(rb_ext_storage$c$as_vector())
> )
> rb_non_ext %>% 
>   left_join(rb_ext_storage) %>% 
>   collect()
> #> # A tibble: 5 × 3
> #>       a b                      c
> #>   <int> <chr>         <arrw_bnr>
> #> 1     1 a     01, 02, 03, 04, 05
> #> 2     2 b     01, 02, 03, 04, 05
> #> 3     3 c     01, 02, 03, 04, 05
> #> 4     4 d     01, 02, 03, 04, 05
> #> 5     5 e     01, 02, 03, 04, 05
> rb_non_ext %>% 
>   left_join(rb_ext) %>% 
>   collect()
> #> Error in `collect()`:
> #> ! Invalid: Data type <arrow_binary[0]> is not supported in join non-key field
> #> /Users/deweydunnington/Desktop/rscratch/arrow/cpp/src/arrow/compute/exec/hash_join_node.cc:121  ValidateSchemas(join_type, left_schema, left_keys, left_output, right_schema, right_keys, right_output, left_field_name_suffix, right_field_name_suffix)
> #> /Users/deweydunnington/Desktop/rscratch/arrow/cpp/src/arrow/compute/exec/hash_join_node.cc:499  schema_mgr->Init( join_options.join_type, left_schema, join_options.left_keys, join_options.left_output, right_schema, join_options.right_keys, join_options.right_output, join_options.filter, join_options.output_suffix_for_left, join_options.output_suffix_for_right)
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)