You are viewing a plain text version of this content. The canonical link for it is here.

Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2020/07/14 22:17:11 UTC

[GitHub] [arrow] nealrichardson commented on pull request #7545: ARROW-9139: [Python] Switch parquet.read_table to use new datasets API by default

nealrichardson commented on pull request #7545:
URL: https://github.com/apache/arrow/pull/7545#issuecomment-658440946

Short version: I agree with Wes: do what you think best but let's merge something.

Longer thoughts:

If I understand correctly, the difference is whether or not partition fields that are integers are encoded as dictionaries in the old function and they come out as ints in the new? Semantically, they should behave the same though right?

It is possible to cast an integer column to dictionary, or at least I recall seeing a dictionary encode function. So if someone felt strongly about it, they could make their partition fields be dictionaries after the fact.

I caveat this by confessing that I've never personally used this function, but on face value I don't see why partition fields have to be dictionary type, particularly if we're only talking about integers. So if the main reason for adding something to force them to be dictionaries is to conform exactly with an old API, I'm not sure it's worth it. Maybe that's the wrong call, but since there is a workaround, why not try it without coercing everything to dictionary and see what reaction we get?

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org