You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by "StuartHadfield (via GitHub)" <gi...@apache.org> on 2023/02/02 00:30:53 UTC

[GitHub] [arrow] StuartHadfield opened a new issue, #33930: [Python] Partition by Split Date Column

StuartHadfield opened a new issue, #33930:
URL: https://github.com/apache/arrow/issues/33930

   ### Describe the usage question you have. Please include as many useful details as  possible.
   
   
   Suppose I have a table that includes a `date` column, and I want to partition in the form:
   
   `year=2011/month=10/day=26/part-0.parquet`
   
   For writing a dataset, how do I accomplish this? Is the only option to preprocess the table prior to writing?
   
   For e.g.
   
   ```py
   import pyarrow as pa
   from pyarrow import dataset as ds
   from datetime import datetime
   
   schema = pa.schema([('foo', pa.string()), ('date', pa.date32())])
   my_batch = pa.RecordBatch.from_pylist(
     [
       {'foo': 'bar', 'date': datetime(2022, 1, 1)},
       {'foo': 'baz', 'date': datetime(2022, 1, 2)}
     ],
     schema=schema,
   )
   
   
   ds.write_dataset(
     my_batch,
     base_dir='./',
     format='parquet',
     partitioning='date', # Presumably I can split this here? Maybe preprocess the date column, or pass a schema of some sort?
     flavor='hive',
   )
   ```
   
   Naturally right now, I'll get partitions like `date=2022-01-01/part-0.parquet`, which isn't what I want.
   
   If the answer is just to preprocess my source data - that's okay. I'm just finding the docs on partitioning a little confusing. Thanks!
   
   ### Component(s)
   
   Python


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] StuartHadfield closed issue #33930: [Python] Partition by Split Date Column

Posted by "StuartHadfield (via GitHub)" <gi...@apache.org>.
StuartHadfield closed issue #33930: [Python] Partition by Split Date Column
URL: https://github.com/apache/arrow/issues/33930


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] westonpace commented on issue #33930: Partition by Split Date Column

Posted by "westonpace (via GitHub)" <gi...@apache.org>.
westonpace commented on issue #33930:
URL: https://github.com/apache/arrow/issues/33930#issuecomment-1409311984

   That is the current answer.  There are no utilities built in right now for what you want.  I think this might be a duplicate of https://github.com/apache/arrow/issues/14619 ?  Could you take a look?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] westonpace commented on issue #33930: [Python] Partition by Split Date Column

Posted by "westonpace (via GitHub)" <gi...@apache.org>.
westonpace commented on issue #33930:
URL: https://github.com/apache/arrow/issues/33930#issuecomment-1412962084

   Most of the logic would belong in [partition.h](https://github.com/apache/arrow/blob/master/cpp/src/arrow/dataset/partition.h) / partition.cc.  However, it would require some understanding of Arrow core concepts, datasets, and compute expressions so I won't say that it would be easy :)
   
   Closing as a duplicate of #14619


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] westonpace commented on issue #33930: [Python] Partition by Split Date Column

Posted by "westonpace (via GitHub)" <gi...@apache.org>.
westonpace commented on issue #33930:
URL: https://github.com/apache/arrow/issues/33930#issuecomment-1412969025

   Duplicate of #14169


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] StuartHadfield commented on issue #33930: [Python] Partition by Split Date Column

Posted by "StuartHadfield (via GitHub)" <gi...@apache.org>.
StuartHadfield commented on issue #33930:
URL: https://github.com/apache/arrow/issues/33930#issuecomment-1410070387

   Ah - yeah that looks about the same. If you want to mark this as a duplicate and close I'm cool with that.
   
   In the interim, I can get away with preprocessing, but I think more broadly this would be a pretty useful feature so if there's a chance it could make it on to a roadmap, that'd be great! I'd offer to contribute myself, but after reading some of the source code (or at least trying to), I think implementing this feature would be a bit beyond me haha.
   
   Thanks @westonpace!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] westonpace closed issue #33930: [Python] Partition by Split Date Column

Posted by "westonpace (via GitHub)" <gi...@apache.org>.
westonpace closed issue #33930: [Python] Partition by Split Date Column
URL: https://github.com/apache/arrow/issues/33930


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] westonpace commented on issue #33930: [Python] Partition by Split Date Column

Posted by "westonpace (via GitHub)" <gi...@apache.org>.
westonpace commented on issue #33930:
URL: https://github.com/apache/arrow/issues/33930#issuecomment-1412971081

   Ah...that took an embarrassingly long time to figure out :cold_sweat:  


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] westonpace commented on issue #33930: [Python] Partition by Split Date Column

Posted by "westonpace (via GitHub)" <gi...@apache.org>.
westonpace commented on issue #33930:
URL: https://github.com/apache/arrow/issues/33930#issuecomment-1412963312

   duplicate of #14169


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] StuartHadfield commented on issue #33930: [Python] Partition by Split Date Column

Posted by "StuartHadfield (via GitHub)" <gi...@apache.org>.
StuartHadfield commented on issue #33930:
URL: https://github.com/apache/arrow/issues/33930#issuecomment-1438120614

   @westonpace - I've closed - but just so you're aware - you've marked this as a duplicate of https://github.com/apache/arrow/pull/14169 and not https://github.com/apache/arrow/issues/14169 (PR vs Issue)!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org