You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@arrow.apache.org by "joocer (via GitHub)" <gi...@apache.org> on 2023/05/02 17:03:29 UTC

[GitHub] [arrow] joocer opened a new issue, #35389: Table.join no longer respecting `coalesce_keys` parameter (Python, PyArrow 12)

joocer opened a new issue, #35389:
URL: https://github.com/apache/arrow/issues/35389

   ### Describe the bug, including details regarding any error messages, version, and platform.
   
   Table.join has an attribute `coalesce_keys`, which the documentation says:
   
   > coalesce_keys[bool](https://docs.python.org/3/library/stdtypes.html#bltin-boolean-values), default [True](https://docs.python.org/3/library/constants.html#True)
   > If the duplicated keys should be omitted from one of the sides in the join result.
   
   In PyArrow v11, the columns used to perform the join were retained in the resultant table when this parameter was set to `False`. However, in v12, the column from the 'right' table (the one in the parameters) is omitted from the result.
   
   A review of the change log for v12 doesn't suggest this change in behaviour is intentional.
   
   This change in behaviour was observed in the matrix regression testing for [Opteryx](https://github.com/mabel-dev/opteryx), which has matrix regression testing across Mac, Linux and Windows, for Python versions 3.8, 3.9, 3.10 and 3.11, and for Left and Inner JOINs - all variations in the test matrix appear to have this same behaviour. 
   
   No error is observed, only a change in behaviour.
   
   ### Component(s)
   
   Python


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@arrow.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] joocer commented on issue #35389: [Python] Table.join no longer respecting `coalesce_keys` parameter (PyArrow 12)

Posted by "joocer (via GitHub)" <gi...@apache.org>.
joocer commented on issue #35389:
URL: https://github.com/apache/arrow/issues/35389#issuecomment-1533020561

   No worries - here's a contrived snippet to demonstrate:
   
   ~~~python
   import pyarrow
   
   movie_vampires = pyarrow.Table.from_pydict(
       {
           "Movie": ["Twilight", "Interview with the Vampire", "Dracula", "Blade", "Underworld"],
           "Vampire": ["Edward Cullen", "Lestat de Lioncourt", "Count Dracula", "Blade", "Selene"],
       }
   )
   
   actors = pyarrow.Table.from_pydict(
       {
           "Character": ["Edward Cullen", "Lestat de Lioncourt", "Count Dracula", "Blade", "Selene"],
           "Actor": ["Robert Pattinson", "Tom Cruise", "Gary Oldman", "Wesley Snipes", "Kate Beckinsale"],
       }
   )
   
   movie_actors = movie_vampires.join(
       actors,
       keys=["Vampire"],
       right_keys=["Character"],
       join_type="inner",
       coalesce_keys=False,
   )
   
   print(movie_actors)
   ~~~
   
   In pyarrow 11, this is the result (note the four columns):
   
   ~~~
   Movie: string
   Vampire: string
   Character: string
   Actor: string
   ----
   Movie: [["Twilight","Interview with the Vampire","Dracula","Blade","Underworld"]]
   Vampire: [["Edward Cullen","Lestat de Lioncourt","Count Dracula","Blade","Selene"]]
   Character: [["Edward Cullen","Lestat de Lioncourt","Count Dracula","Blade","Selene"]]
   Actor: [["Robert Pattinson","Tom Cruise","Gary Oldman","Wesley Snipes","Kate Beckinsale"]]
   ~~~
   
   in pyarrow 12 this is the result (note only three columns)
   
   ~~~
   Movie: string
   Vampire: string
   Actor: string
   ----
   Movie: [["Twilight","Interview with the Vampire","Dracula","Blade","Underworld"]]
   Vampire: [["Edward Cullen","Lestat de Lioncourt","Count Dracula","Blade","Selene"]]
   Actor: [["Robert Pattinson","Tom Cruise","Gary Oldman","Wesley Snipes","Kate Beckinsale"]]
   ~~~
   
   This output is from a box with Python 3.10.7 on Debian Buster x86 64bit, but it appears to happen on all the Python versions and OSes I've tried.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] jorisvandenbossche closed issue #35389: [Python] Table.join no longer respecting `coalesce_keys` parameter (PyArrow 12)

Posted by "jorisvandenbossche (via GitHub)" <gi...@apache.org>.
jorisvandenbossche closed issue #35389: [Python] Table.join no longer respecting `coalesce_keys` parameter (PyArrow 12)
URL: https://github.com/apache/arrow/issues/35389


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] westonpace commented on issue #35389: [Python] Table.join no longer respecting `coalesce_keys` parameter (PyArrow 12)

Posted by "westonpace (via GitHub)" <gi...@apache.org>.
westonpace commented on issue #35389:
URL: https://github.com/apache/arrow/issues/35389#issuecomment-1538794834

   Both pyarrow and R have been doing this on their own and so I think the motivation to add the fix to C++ directly was low.  However, now that pyarrow is using C++ more directly, it sounds like we need it.  There was a very old PR here: https://github.com/zagto/arrow/pull/1 that should provide a rough approach.  However, it is probably out of date.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] jorisvandenbossche commented on issue #35389: [Python] Table.join no longer respecting `coalesce_keys` parameter (PyArrow 12)

Posted by "jorisvandenbossche (via GitHub)" <gi...@apache.org>.
jorisvandenbossche commented on issue #35389:
URL: https://github.com/apache/arrow/issues/35389#issuecomment-1532939024

   @joocer thanks for the report! That was certainly not an intentional change in behaviour, but the implementation was refactored to use a different invocation of the C++ APIs (now relying on the `pyarrow.acero` Declaration bindings). So something might have gone wrong in this refactor (and if so, unfortunately this doesn't seem to have been covered properly by our tests). 
   
   Could you provide a small reproducible code example that illustrates the change in behaviour?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] jorisvandenbossche commented on issue #35389: [Python] Table.join no longer respecting `coalesce_keys` parameter (PyArrow 12)

Posted by "jorisvandenbossche (via GitHub)" <gi...@apache.org>.
jorisvandenbossche commented on issue #35389:
URL: https://github.com/apache/arrow/issues/35389#issuecomment-1539738426

   While I think it would certainly be nice to move this coalesce logic into C++, in this case it was just a small oversight in my refactor of the python bindings that caused it (I always the selection of columns to the join node, instead of only when `coalesce_keys=True`) 
   PR to fix this -> https://github.com/apache/arrow/pull/35505


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org