You are viewing a plain text version of this content. The canonical link for it is here.

Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2021/07/06 11:21:05 UTC

[GitHub] [arrow] jorisvandenbossche commented on pull request #10661: ARROW-8655: [C++][Python] Preserve partitioning information for a discovered Dataset

jorisvandenbossche commented on pull request #10661:
URL: https://github.com/apache/arrow/pull/10661#issuecomment-874672918


   See the JIRA for more details and motivating use cases, but a small demo of what this exposes:
   
   ```python
   import pandas as pd
   import pyarrow as pa
   import pyarrow.parquet as pq
   import pyarrow.dataset as ds
   
   df = pd.DataFrame({"year": [2020, 2020, 2021, 2021], "month":[1, 2, 1, 2], "values": [1, 2, 3, 4]})
   df.to_parquet("test_partitioned", partition_cols=["year", "month"], engine="pyarrow")
   ```
   
   A discovered dataset now stores the Partitioning object that was created/discovered in the FileSystemDatasetFactory, so you can still access this information on the dataset:
   
   ```python
   >>> dataset = ds.dataset("test_partitioned/", partitioning="hive")
   >>> dataset
   <pyarrow._dataset.FileSystemDataset at 0x7fe481b6e270>
   >>> dataset.partitioning
   <pyarrow._dataset.HivePartitioning at 0x7fe48162c7f0>
   
   # the schema (field names and types) of the partitioning
   >>> dataset.partitioning.schema 
   year: int32
   month: int32
   >>> dataset.partitioning.schema.names
   ['year', 'month']
   # and all partition field values discovered during the factory
   >>> dataset.partitioning.dictionaries
   [<pyarrow.lib.Int32Array object at 0x7fe480fd9fa0>
    [
      2020,
      2021
    ],
    <pyarrow.lib.Int32Array object at 0x7fe480fd9b80>
    [
      1,
      2
    ]]
   ```
   
   With inferring a dictionary type for the partitioning, the schema of the partitioning object indeed has dictionary type:
   
   ```python
   >>> dataset2 = ds.dataset("test_partitioned/", partitioning=ds.HivePartitioning.discover(infer_dictionary=True))
   >>> datatset2.partitioning.schema
   year: dictionary<values=int32, indices=int32, ordered=0>
   month: dictionary<values=int32, indices=int32, ordered=0>
   -- schema metadata --
   pandas: '{"index_columns": [], "column_indexes": [{"name": null, "field_n' + 352
   ```
   
   When the FileSystemDataset is not discovered with a partitioning or not created through a discovery, the `partitioning` attribute is None:
   
   ```python
   >>> dataset3 = ds.dataset("test_partitioned/")
   # TODO this is still something to fix, right now this is a "default" partitioning (which isn't exposed in Python)
   >>> dataset3.partitioning
   ...
   ~/scipy/repos/arrow/python/pyarrow/_dataset.pyx in pyarrow._dataset.Partitioning.wrap()
   TypeError: default
   
   >>> dataset4 = ds.FileSystemDataset(list(dataset.get_fragments()), dataset.schema, dataset.format, dataset.filesystem)
   >>> dataset4.partitioning is None
   True
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org