You are viewing a plain text version of this content. The canonical link for it is here.
Posted to jira@arrow.apache.org by "Weston Pace (Jira)" <ji...@apache.org> on 2022/04/12 20:09:00 UTC

[jira] [Commented] (ARROW-16172) [C++] cast when reasonable for join keys

    [ https://issues.apache.org/jira/browse/ARROW-16172?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17521318#comment-17521318 ] 

Weston Pace commented on ARROW-16172:
-------------------------------------

So all the auto-casting we do today happens inside of expression evaluation which is why I suppose it isn't handled in the join.  In the expression evaluation we are looking for a combination of kernel and casts.  Here we would just be looking for a "least common ancestor" of sorts.  It is an easier problem but it is a slightly different problem and so we don't have an exact function for it.  I'll attempt to transcribe a rough approximation of our current rules so we can see if they make sense:

* The current casting mechanism dictionary decodes everything so the common ancestor of Dictionary<int8, string> and Dictionary<int8, large_string> is large_string and not Dictionary<int8, large_string>.
** I don't know if the hash join node decodes dictionaries or not but it would be nice if it were up to the hash join node and so I think the "least common ancestor" logic should not automatically decode.
* We never automatically cast from string to integer or boolean to integer.  I could see some people arguing these are valid casts but I think the majority would agree with the decisions here.
* null + x => x
* decimal + float => float
* decimal + int => decimal
* decimal128 + decimal256 => decimal256
* decimal<A, B> + decimal<C,D> => decimal<E,F> based on rules appropriate to the function
** In this case I think the rule would be to grow as much as possible and sacrifice scale if needed.
*** decimal<20,6> + decimal<10,8> => decimal<20,8>
*** decimal<20,10> + decimal<36,6> => decimal<38,8>
* float + int => float
* int + uint => unsigned next power of 2 (e.g. int32 + uint32 => uint64)
** Caps at uint64 so int64 + uint64 => uint64
* int + int / uint + uint => widest type (e.g. int32 + int16 => int16)
* temporal + temporal => finest resolution (e.g. timestamp<s> + timestamp<ms> => timestamp<ms>
** Crossing of temporal types is not allowed (e.g. cannot go from date to duration or from duration to timestamp)

I think these rules still hold up (excepting possibly dictionary encoding but we could solve that later).  I wouldn't be opposed to adding this to the join.

> it's a surprising UX that this doesn't work + I need to type coerce on my own for this.

In general, I think it is a slippery slope from "intuitive UX" to "invisibly inefficient queries".  There's also the "death by 1000 cuts" phenomenon of gradually and unintentionally writing our own plan optimizer that we then have to maintain.  However, this could all just be FUD.  On the face of it, I can't see anything immediately harmful with adding this functionality.

> [C++] cast when reasonable for join keys
> ----------------------------------------
>
>                 Key: ARROW-16172
>                 URL: https://issues.apache.org/jira/browse/ARROW-16172
>             Project: Apache Arrow
>          Issue Type: Improvement
>          Components: C++
>            Reporter: Jonathan Keane
>            Priority: Major
>
> Joining an integer column with a float column that happens to have whole numbers errors. For kernels, we would autocast in this circumstance, so it's a surprising UX that this doesn't work + I need to type coerce on my own for this.
> {code}
> library(arrow, warn.conflicts = FALSE)
> #> See arrow_info() for available features
> library(dplyr, warn.conflicts = FALSE)
> tab_int <- arrow_table(data.frame(let = letters, num = 1L:26L))
> tab_float <- arrow_table(data.frame(let = letters, num = as.double(1:26)))
> left_join(tab_int, tab_float) %>% collect()
> #> Error in `handle_csv_read_error()`:
> #> ! Invalid: Incompatible data types for corresponding join field keys: FieldRef.Name(num) of type int32 and FieldRef.Name(num) of type double
> {code}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)