You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2022/03/08 17:46:52 UTC

[GitHub] [arrow] sanjibansg commented on pull request #12530: ARROW-14612: [C++] Support for filename-based partitioning

sanjibansg commented on pull request #12530:
URL: https://github.com/apache/arrow/pull/12530#issuecomment-1062040407


   > @westonpace asked me to review this as I opened the ticket originally based on a user-request. My main criteria for "does this do what the original user had in mind" is "can we **read** from a directory of files in which sections of the filenames are variables we want to analyse in our data" - and it looks like this both does that and enables us to write these files as well, which is really cool!
   > 
   > One thing I do want to check though - if I have a load of files called, e.g. `foo_bar_whatever_month_year.csv`, is there a way I can just have `month` and `year` as variables without the `foo`, `bar`, and `whatever` or would I have to read them in as variables and then just drop those columns later?
   
   Yes, we would have to read them in as variables and then drop those columns later. Currently, with this PR, the entire filename(discarding the last part for eg. `part-0.parquet` or `chunk-0.parquet`) is expected to have the partitioning values separated by `_`. In the future, we may need to add the functionality to allow custom name separator then just only using the underscore.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org