You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@arrow.apache.org by GitBox <gi...@apache.org> on 2023/01/10 04:20:44 UTC

[GitHub] [arrow] heronshoes opened a new issue, #15287: [Ruby] Add option to keep/merge join keys in Table#join

heronshoes opened a new issue, #15287:
URL: https://github.com/apache/arrow/issues/15287

   ### Describe the enhancement requested
   
   #### Target method
   
   Arrow::Table#join
   
   #### Proposed feature
   
   Add an option to select whether to keep or to merge columns in the join result.
   
   #### Impact of this request
   
   Current behavior is always keep every columns. But it will be convenient to set merge as a default.
   
   For example, default in dplyr is `keep = FALSE` (Do not preserve both join keys).
   
   #### Reference of this request
   
   dplyr-join in R.
   
   https://github.com/apache/arrow/blob/master/r/R/dplyr-join.R
   
   Join keys are always merged in RedAmber (without an option).
   
   https://github.com/heronshoes/red_amber/blob/main/lib/red_amber/data_frame_combinable.rb#L836-L842
   
   
   ### Component(s)
   
   Ruby


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@arrow.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] heronshoes commented on issue #15287: [Ruby] Add option to keep/merge join keys in Table#join

Posted by GitBox <gi...@apache.org>.
heronshoes commented on issue #15287:
URL: https://github.com/apache/arrow/issues/15287#issuecomment-1379711803

   I will add candidates for natural keys (`keys=nil`) case.
   ```ruby
   puts left.join(right, keys=nil, type: :full_outer)  # if keep them      (4)
   #=>
   	    id	region_id	description	    id	description
   0	     1	        7	A          	(null)	     (null)
   1	     2	        8	B          	(null)	     (null)
   2	     3	        9	C          	(null)	     (null)
   3	(null)	   (null)	     (null)	     1	D          
   4	(null)	   (null)	     (null)	     2	E          
   5	(null)	   (null)	     (null)	     4	F
   
   #=> or merge them if they are natural keys.      (5)
   	id	region_id	description
   0	 1	        7	A          
   1	 2	        8	B          
   2	 3	        9	C          
   3	 1	   (null)	D          
   4	 2	   (null)	E          
   5	 4	   (null)	F
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] kou closed issue #15287: [Ruby] Add option to keep/merge join keys in Table#join

Posted by GitBox <gi...@apache.org>.
kou closed issue #15287: [Ruby] Add option to keep/merge join keys in Table#join
URL: https://github.com/apache/arrow/issues/15287


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] heronshoes commented on issue #15287: [Ruby] Add option to keep/merge join keys in Table#join

Posted by GitBox <gi...@apache.org>.
heronshoes commented on issue #15287:
URL: https://github.com/apache/arrow/issues/15287#issuecomment-1378781524

   I think it will work as long as C++ join() is correct :-)
   
   ```ruby
   left = Arrow::Table.new(KEY: ["A", "B", nil], X1: [1, 2, 3])
   right = Arrow::Table.new(KEY: ["A", "C", nil], X2: [4, 5, 6])
   
   left.join(right, :KEY, type: :left_outer)
   #=>
   #<Arrow::Table:0x116f5d158 ptr=0x7fa61e50bc50>
   	KEY	X1	KEY	    X2
   0	A  	 1	A  	     4
   1	B  	 2	(null)	(null)
   2	(null)	 3	(null)	(null)
   
   left.join(right, :KEY, type: :right_outer)
   #=>
   #<Arrow::Table:0x116f51bf0 ptr=0x7fa61e50bd40>
   	KEY	    X1	KEY	X2
   0	A  	     1	A  	 4
   1	(null)	(null)	C  	 5
   2	(null)	(null)	(null)	 6
   
   left.join(right, :KEY, type: :full_outer)
   #=>
   #<Arrow::Table:0x116f18378 ptr=0x7fa61e06f0e0>
   	KEY	    X1	KEY	    X2
   0	A  	     1	A  	     4
   1	B  	     2	(null)	(null)
   2	(null)	     3	(null)	(null)
   3	(null)	(null)	C  	     5
   4	(null)	(null)	(null)	     6
   
   left.join(right, :KEY, type: :inner)
   #=>
   #<Arrow::Table:0x11a048128 ptr=0x7fa61d5edd10>
   	KEY	X1	KEY	X2
   0	A  	 1	A  	 4
   
   left.join(right, :KEY, type: :left_semi)
   #=>
   #<Arrow::Table:0x11a0ea748 ptr=0x7fa61d531cb0>
   	KEY	X1
   0	A  	 1
   
   left.join(right, :KEY, type: :right_semi)
   #=>
   #<Arrow::Table:0x11a0e2c00 ptr=0x7fa61d581190>
   	KEY	X2
   0	A  	 4
   
   left.join(right, :KEY, type: :left_anti)
   #=>
   #<Arrow::Table:0x11a0cbb40 ptr=0x7fa61e3dde50>
   	KEY	X1
   0	B  	 2
   1	(null)	 3
   
   left.join(right, :KEY, type: :right_anti)
   #=>
   #<Arrow::Table:0x11a0c56f0 ptr=0x7fa61e3ddee0>
   	KEY	X2
   0	C  	 5
   1	(null)	 6
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] heronshoes commented on issue #15287: [Ruby] Add option to keep/merge join keys in Table#join

Posted by GitBox <gi...@apache.org>.
heronshoes commented on issue #15287:
URL: https://github.com/apache/arrow/issues/15287#issuecomment-1381931541

   That's correct. Sorry for the confusion.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] heronshoes commented on issue #15287: [Ruby] Add option to keep/merge join keys in Table#join

Posted by GitBox <gi...@apache.org>.
heronshoes commented on issue #15287:
URL: https://github.com/apache/arrow/issues/15287#issuecomment-1379697006

   @westonpace I think non-key common columns (`description`) will be preserved. Arrow's Table allows duplicated column names.
   
   ```ruby
   left  = Arrow::Table.new(id: [1, 2, 3], region_id: [7, 8, 9], description: %w[A B C])
   right = Arrow::Table.new(id: [1, 2, 4],                       description: %w[D E F])
   
   puts left.join(right, :id, type: :full_outer)   # current implementation (1)
   puts left.join(right, [:id], type: :full_outer) # is same as this (2)
   #=>
   	    id	region_id	description	    id	description
   0	     1	        7	A          	     1	D          
   1	     2	        8	B          	     2	E          
   2	     3	        9	C          	(null)	     (null)
   3	(null)	   (null)	     (null)	     4	F
   
   puts left.join(right, :id, type: :full_outer) # this will merge key (3)
   
   	id	region_id	description	description
   0	 1	        7	A          	D          
   1	 2	        8	B          	E          
   2	 3	        9	C          	     (null)
   3	 4	   (null)	     (null)	F
   ```
   
   @kou Is above same as your thought?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] kou commented on issue #15287: [Ruby] Add option to keep/merge join keys in Table#join

Posted by GitBox <gi...@apache.org>.
kou commented on issue #15287:
URL: https://github.com/apache/arrow/issues/15287#issuecomment-1381438649

   I think that (2) is needless. I think that the current implementation already implements (2).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] westonpace commented on issue #15287: [Ruby] Add option to keep/merge join keys in Table#join

Posted by GitBox <gi...@apache.org>.
westonpace commented on issue #15287:
URL: https://github.com/apache/arrow/issues/15287#issuecomment-1379600810

   One thing to consider is that it is possible for the left and right side to have a non-key column that has the same name.  For example:
   
   ```
   CREATE TABLE products (
       id int,
       region_id int,
       description string,
   ); 
   CREATE TABLE regions (
       id int,
       description string
   ); 
   SELECT * FROM products JOIN regions ON id;
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] heronshoes commented on issue #15287: [Ruby] Add option to keep/merge join keys in Table#join

Posted by GitBox <gi...@apache.org>.
heronshoes commented on issue #15287:
URL: https://github.com/apache/arrow/issues/15287#issuecomment-1380365483

   Thank you! I will try to make PR with (2), (3) and (5).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] kou commented on issue #15287: [Ruby] Add option to keep/merge join keys in Table#join

Posted by GitBox <gi...@apache.org>.
kou commented on issue #15287:
URL: https://github.com/apache/arrow/issues/15287#issuecomment-1378348469

   How about `left_suffix=nil, right_suffix=nil` like pandas?
   https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.join.html
   
   If both of `left_suffix` and `right_suffix` are `nil`, we don't return duplicated keys.
   If any of `left_suffix` and `right_suffix` are not `nil`, we return both keys in left and right with suffix.
   
   For merging, how about `merge_key_values=true`? If both `left_suffix` and `right_suffix` are `nil` and `merge_key_values` is `true`, fill NULL values in a left key with values in a right key. (Hmm. I'm not sure this is suitable behavior. If left key has multiple NULL values and outer join is used, can we distinct NULL values in left key and NULL values caused by join?)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] kou commented on issue #15287: [Ruby] Add option to keep/merge join keys in Table#join

Posted by GitBox <gi...@apache.org>.
kou commented on issue #15287:
URL: https://github.com/apache/arrow/issues/15287#issuecomment-1379511152

   > Is it better to ignore `left_suffix` and `right_suffix` when `merge_key_values` is `true`?
   > I think only one key is used when column is merged. It may be left_key and do not need renaming.
   
   OK. Then, how about `table.join(right, :KEY)` merges `KEY` values instead of adding new `merge_key_values` option? (`table.join(right, [:KEY])` and `table.join(right, {left: :KEY, right: :KEY})`  don't  merge.)
   
   > Are default values `left_suffix="", right_suffix="", merge_key_values=false` to ensure backward compatibility?
   
   Yes but we can change the default `keys`. If `column_names & right.column_names` only have 1 element, we can use `(column_names & right.column_names)[0]` to merge key values by default. Because `keys=nil` isn't released yet. (We don't need to think about backward compatibility.)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] kou commented on issue #15287: [Ruby] Add option to keep/merge join keys in Table#join

Posted by GitBox <gi...@apache.org>.
kou commented on issue #15287:
URL: https://github.com/apache/arrow/issues/15287#issuecomment-1380187282

   > I will add candidates for natural keys (`keys=nil`) case.
   
   I didn't know `NATURAL JOIN` in SQL!
   
   We can choose (5) because it's same behavior as SQL's `NATURAL JOIN`.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] kou commented on issue #15287: [Ruby] Add option to keep/merge join keys in Table#join

Posted by GitBox <gi...@apache.org>.
kou commented on issue #15287:
URL: https://github.com/apache/arrow/issues/15287#issuecomment-1378349478

   `left_suffix="", right_suffix=""` that generates the same name columns may be better to keep backward compatibility...


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] heronshoes commented on issue #15287: [Ruby] Add option to keep/merge join keys in Table#join

Posted by GitBox <gi...@apache.org>.
heronshoes commented on issue #15287:
URL: https://github.com/apache/arrow/issues/15287#issuecomment-1378799296

   Is it better to ignore `left_suffix` and `right_suffix` when `merge_key_values` is `true`?
   I think only one key is used when column is merged. It may be left_key and do not need renaming.
   
   Are default values `left_suffix="", right_suffix="", merge_key_values=false` ?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] kou commented on issue #15287: [Ruby] Add option to keep/merge join keys in Table#join

Posted by GitBox <gi...@apache.org>.
kou commented on issue #15287:
URL: https://github.com/apache/arrow/issues/15287#issuecomment-1380176972

   > @kou Is above same as your thought?
   
   Yes.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org