You are viewing a plain text version of this content. The canonical link for it is here.

Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2022/11/10 16:15:40 UTC

[GitHub] [arrow] code1704 opened a new issue, #14619: arrow dataset: how to use date.year and date.month as partitioning

code1704 opened a new issue, #14619:
URL: https://github.com/apache/arrow/issues/14619

   The data columns:
   datetime, value
   
   I want to build a dataset with partitioning like ("date.year", "date.month"). How to do that?
   And when do filtering like date > "2021-09-02" and date < "2022-04-06", will it read the partition files between 2021-09 to 2022-04 only?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] djouallah commented on issue #14619: arrow dataset: how to use date.year and date.month as partitioning

Posted by GitBox <gi...@apache.org>.

djouallah commented on issue #14619:
URL: https://github.com/apache/arrow/issues/14619#issuecomment-1312649527

   yes, we do have it in BigQuery and it is freaking amazing !!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

Re: [I] arrow dataset: how to use date.year and date.month as partitioning [arrow]

Posted by "jmakov (via GitHub)" <gi...@apache.org>.

jmakov commented on issue #14619:
URL: https://github.com/apache/arrow/issues/14619#issuecomment-1826823302

   A bit surprised how this isn't a bigger issue. Doesn't anybody use arrow partitions or do they just use other solutions?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] code1704 commented on issue #14619: arrow dataset: how to use date.year and date.month as partitioning

Posted by GitBox <gi...@apache.org>.

code1704 commented on issue #14619:
URL: https://github.com/apache/arrow/issues/14619#issuecomment-1327659040

   > Thank you for the clarification. You are correct that this is not supported. Just to clarify. The directories are structured something like:
   > 
   > ```
   > /year=2020/month=01/day=01/chunk-0.parquet
   > ...
   > /year=2021/month=03/day=12/chunk-0.parquet
   > ```
   > 
   > Also, it sounds like there will be cases where you do not have the day (or maybe that is the typical case), correct?
   
   Yes. And we use duckdb as the query engine: duckdb + arrow dataset. For query like `SELECT * FROM some_dataset WHERE date BETWEEN "2015-03-05" AND "2018-04012"`, partition pruning is expected to work.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] djouallah commented on issue #14619: arrow dataset: how to use date.year and date.month as partitioning

Posted by GitBox <gi...@apache.org>.

djouallah commented on issue #14619:
URL: https://github.com/apache/arrow/issues/14619#issuecomment-1323102531

   @westonpace I don't know how it is implemented, BigQuery don't expose its file format.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] westonpace commented on issue #14619: arrow dataset: how to use date.year and date.month as partitioning

Posted by GitBox <gi...@apache.org>.

westonpace commented on issue #14619:
URL: https://github.com/apache/arrow/issues/14619#issuecomment-1319128138

   Thank you for the clarification.  You are correct that this is not supported.  Just to clarify.  The directories are structured something like:
   
   ```
   /year=2020/month=01/day=01/chunk-0.parquet
   ...
   /year=2021/month=03/day=12/chunk-0.parquet
   ```
   
   Also, it sounds like there will be cases where you do not have the day (or maybe that is the typical case), correct?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] westonpace commented on issue #14619: arrow dataset: how to use date.year and date.month as partitioning

Posted by GitBox <gi...@apache.org>.

westonpace commented on issue #14619:
URL: https://github.com/apache/arrow/issues/14619#issuecomment-1316460668

   Are you using pyarrow?  Have you looked at https://arrow.apache.org/docs/python/dataset.html#writing-partitioned-data ?
   
   > And when do filtering like date > "2021-09-02" and date < "2022-04-06", will it read the partition files between 2021-09 to 2022-04 only?
   
   Yes, that should work.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] djouallah commented on issue #14619: arrow dataset: how to use date.year and date.month as partitioning

Posted by GitBox <gi...@apache.org>.

djouallah commented on issue #14619:
URL: https://github.com/apache/arrow/issues/14619#issuecomment-1316512016

   @westonpace I think we are talking about two different thing, ideally arrow should be able to partition by a field date, but instead of generating a file by day, it will generate a file by year, I don't think it is supported yet, a workaround is to create a new field year and use it as a partition column, the problem is you have to add it as a filter in the where clause.
   
   BigQuery and I think iceberg support that functionality partition by Year (field)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] code1704 commented on issue #14619: arrow dataset: how to use date.year and date.month as partitioning

Posted by GitBox <gi...@apache.org>.

code1704 commented on issue #14619:
URL: https://github.com/apache/arrow/issues/14619#issuecomment-1313199806

   Thanks @djouallah . What is BigQuery? Is it a feature of duckdb?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org