You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@arrow.apache.org by GitBox <gi...@apache.org> on 2023/01/10 04:20:44 UTC
[GitHub] [arrow] heronshoes opened a new issue, #15287: [Ruby] Add option to keep/merge join keys in Table#join
heronshoes opened a new issue, #15287:
URL: https://github.com/apache/arrow/issues/15287
### Describe the enhancement requested
#### Target method
Arrow::Table#join
#### Proposed feature
Add an option to select whether to keep or to merge columns in the join result.
#### Impact of this request
Current behavior is always keep every columns. But it will be convenient to set merge as a default.
For example, default in dplyr is `keep = FALSE` (Do not preserve both join keys).
#### Reference of this request
dplyr-join in R.
https://github.com/apache/arrow/blob/master/r/R/dplyr-join.R
Join keys are always merged in RedAmber (without an option).
https://github.com/heronshoes/red_amber/blob/main/lib/red_amber/data_frame_combinable.rb#L836-L842
### Component(s)
Ruby
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: issues-unsubscribe@arrow.apache.org.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [arrow] heronshoes commented on issue #15287: [Ruby] Add option to keep/merge join keys in Table#join
Posted by GitBox <gi...@apache.org>.
heronshoes commented on issue #15287:
URL: https://github.com/apache/arrow/issues/15287#issuecomment-1379711803
I will add candidates for natural keys (`keys=nil`) case.
```ruby
puts left.join(right, keys=nil, type: :full_outer) # if keep them (4)
#=>
id region_id description id description
0 1 7 A (null) (null)
1 2 8 B (null) (null)
2 3 9 C (null) (null)
3 (null) (null) (null) 1 D
4 (null) (null) (null) 2 E
5 (null) (null) (null) 4 F
#=> or merge them if they are natural keys. (5)
id region_id description
0 1 7 A
1 2 8 B
2 3 9 C
3 1 (null) D
4 2 (null) E
5 4 (null) F
```
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [arrow] kou closed issue #15287: [Ruby] Add option to keep/merge join keys in Table#join
Posted by GitBox <gi...@apache.org>.
kou closed issue #15287: [Ruby] Add option to keep/merge join keys in Table#join
URL: https://github.com/apache/arrow/issues/15287
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: issues-unsubscribe@arrow.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [arrow] heronshoes commented on issue #15287: [Ruby] Add option to keep/merge join keys in Table#join
Posted by GitBox <gi...@apache.org>.
heronshoes commented on issue #15287:
URL: https://github.com/apache/arrow/issues/15287#issuecomment-1378781524
I think it will work as long as C++ join() is correct :-)
```ruby
left = Arrow::Table.new(KEY: ["A", "B", nil], X1: [1, 2, 3])
right = Arrow::Table.new(KEY: ["A", "C", nil], X2: [4, 5, 6])
left.join(right, :KEY, type: :left_outer)
#=>
#<Arrow::Table:0x116f5d158 ptr=0x7fa61e50bc50>
KEY X1 KEY X2
0 A 1 A 4
1 B 2 (null) (null)
2 (null) 3 (null) (null)
left.join(right, :KEY, type: :right_outer)
#=>
#<Arrow::Table:0x116f51bf0 ptr=0x7fa61e50bd40>
KEY X1 KEY X2
0 A 1 A 4
1 (null) (null) C 5
2 (null) (null) (null) 6
left.join(right, :KEY, type: :full_outer)
#=>
#<Arrow::Table:0x116f18378 ptr=0x7fa61e06f0e0>
KEY X1 KEY X2
0 A 1 A 4
1 B 2 (null) (null)
2 (null) 3 (null) (null)
3 (null) (null) C 5
4 (null) (null) (null) 6
left.join(right, :KEY, type: :inner)
#=>
#<Arrow::Table:0x11a048128 ptr=0x7fa61d5edd10>
KEY X1 KEY X2
0 A 1 A 4
left.join(right, :KEY, type: :left_semi)
#=>
#<Arrow::Table:0x11a0ea748 ptr=0x7fa61d531cb0>
KEY X1
0 A 1
left.join(right, :KEY, type: :right_semi)
#=>
#<Arrow::Table:0x11a0e2c00 ptr=0x7fa61d581190>
KEY X2
0 A 4
left.join(right, :KEY, type: :left_anti)
#=>
#<Arrow::Table:0x11a0cbb40 ptr=0x7fa61e3dde50>
KEY X1
0 B 2
1 (null) 3
left.join(right, :KEY, type: :right_anti)
#=>
#<Arrow::Table:0x11a0c56f0 ptr=0x7fa61e3ddee0>
KEY X2
0 C 5
1 (null) 6
```
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [arrow] heronshoes commented on issue #15287: [Ruby] Add option to keep/merge join keys in Table#join
Posted by GitBox <gi...@apache.org>.
heronshoes commented on issue #15287:
URL: https://github.com/apache/arrow/issues/15287#issuecomment-1381931541
That's correct. Sorry for the confusion.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [arrow] heronshoes commented on issue #15287: [Ruby] Add option to keep/merge join keys in Table#join
Posted by GitBox <gi...@apache.org>.
heronshoes commented on issue #15287:
URL: https://github.com/apache/arrow/issues/15287#issuecomment-1379697006
@westonpace I think non-key common columns (`description`) will be preserved. Arrow's Table allows duplicated column names.
```ruby
left = Arrow::Table.new(id: [1, 2, 3], region_id: [7, 8, 9], description: %w[A B C])
right = Arrow::Table.new(id: [1, 2, 4], description: %w[D E F])
puts left.join(right, :id, type: :full_outer) # current implementation (1)
puts left.join(right, [:id], type: :full_outer) # is same as this (2)
#=>
id region_id description id description
0 1 7 A 1 D
1 2 8 B 2 E
2 3 9 C (null) (null)
3 (null) (null) (null) 4 F
puts left.join(right, :id, type: :full_outer) # this will merge key (3)
id region_id description description
0 1 7 A D
1 2 8 B E
2 3 9 C (null)
3 4 (null) (null) F
```
@kou Is above same as your thought?
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [arrow] kou commented on issue #15287: [Ruby] Add option to keep/merge join keys in Table#join
Posted by GitBox <gi...@apache.org>.
kou commented on issue #15287:
URL: https://github.com/apache/arrow/issues/15287#issuecomment-1381438649
I think that (2) is needless. I think that the current implementation already implements (2).
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [arrow] westonpace commented on issue #15287: [Ruby] Add option to keep/merge join keys in Table#join
Posted by GitBox <gi...@apache.org>.
westonpace commented on issue #15287:
URL: https://github.com/apache/arrow/issues/15287#issuecomment-1379600810
One thing to consider is that it is possible for the left and right side to have a non-key column that has the same name. For example:
```
CREATE TABLE products (
id int,
region_id int,
description string,
);
CREATE TABLE regions (
id int,
description string
);
SELECT * FROM products JOIN regions ON id;
```
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [arrow] heronshoes commented on issue #15287: [Ruby] Add option to keep/merge join keys in Table#join
Posted by GitBox <gi...@apache.org>.
heronshoes commented on issue #15287:
URL: https://github.com/apache/arrow/issues/15287#issuecomment-1380365483
Thank you! I will try to make PR with (2), (3) and (5).
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [arrow] kou commented on issue #15287: [Ruby] Add option to keep/merge join keys in Table#join
Posted by GitBox <gi...@apache.org>.
kou commented on issue #15287:
URL: https://github.com/apache/arrow/issues/15287#issuecomment-1378348469
How about `left_suffix=nil, right_suffix=nil` like pandas?
https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.join.html
If both of `left_suffix` and `right_suffix` are `nil`, we don't return duplicated keys.
If any of `left_suffix` and `right_suffix` are not `nil`, we return both keys in left and right with suffix.
For merging, how about `merge_key_values=true`? If both `left_suffix` and `right_suffix` are `nil` and `merge_key_values` is `true`, fill NULL values in a left key with values in a right key. (Hmm. I'm not sure this is suitable behavior. If left key has multiple NULL values and outer join is used, can we distinct NULL values in left key and NULL values caused by join?)
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [arrow] kou commented on issue #15287: [Ruby] Add option to keep/merge join keys in Table#join
Posted by GitBox <gi...@apache.org>.
kou commented on issue #15287:
URL: https://github.com/apache/arrow/issues/15287#issuecomment-1379511152
> Is it better to ignore `left_suffix` and `right_suffix` when `merge_key_values` is `true`?
> I think only one key is used when column is merged. It may be left_key and do not need renaming.
OK. Then, how about `table.join(right, :KEY)` merges `KEY` values instead of adding new `merge_key_values` option? (`table.join(right, [:KEY])` and `table.join(right, {left: :KEY, right: :KEY})` don't merge.)
> Are default values `left_suffix="", right_suffix="", merge_key_values=false` to ensure backward compatibility?
Yes but we can change the default `keys`. If `column_names & right.column_names` only have 1 element, we can use `(column_names & right.column_names)[0]` to merge key values by default. Because `keys=nil` isn't released yet. (We don't need to think about backward compatibility.)
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [arrow] kou commented on issue #15287: [Ruby] Add option to keep/merge join keys in Table#join
Posted by GitBox <gi...@apache.org>.
kou commented on issue #15287:
URL: https://github.com/apache/arrow/issues/15287#issuecomment-1380187282
> I will add candidates for natural keys (`keys=nil`) case.
I didn't know `NATURAL JOIN` in SQL!
We can choose (5) because it's same behavior as SQL's `NATURAL JOIN`.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [arrow] kou commented on issue #15287: [Ruby] Add option to keep/merge join keys in Table#join
Posted by GitBox <gi...@apache.org>.
kou commented on issue #15287:
URL: https://github.com/apache/arrow/issues/15287#issuecomment-1378349478
`left_suffix="", right_suffix=""` that generates the same name columns may be better to keep backward compatibility...
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [arrow] heronshoes commented on issue #15287: [Ruby] Add option to keep/merge join keys in Table#join
Posted by GitBox <gi...@apache.org>.
heronshoes commented on issue #15287:
URL: https://github.com/apache/arrow/issues/15287#issuecomment-1378799296
Is it better to ignore `left_suffix` and `right_suffix` when `merge_key_values` is `true`?
I think only one key is used when column is merged. It may be left_key and do not need renaming.
Are default values `left_suffix="", right_suffix="", merge_key_values=false` ?
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [arrow] kou commented on issue #15287: [Ruby] Add option to keep/merge join keys in Table#join
Posted by GitBox <gi...@apache.org>.
kou commented on issue #15287:
URL: https://github.com/apache/arrow/issues/15287#issuecomment-1380176972
> @kou Is above same as your thought?
Yes.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org