You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2022/04/04 09:24:28 UTC

[GitHub] [arrow] jorisvandenbossche commented on a diff in pull request #12704: ARROW-15428: [Python] Address docstrings in Parquet classes and functions

jorisvandenbossche commented on code in PR #12704:
URL: https://github.com/apache/arrow/pull/12704#discussion_r841518804


##########
python/pyarrow/parquet.py:
##########
@@ -225,6 +225,64 @@ class ParquetFile:
         in nanoseconds.
     decryption_properties : FileDecryptionProperties, default None
         File decryption properties for Parquet Modular Encryption.
+
+    Examples
+    --------
+
+    Generate an example PyArrow Table and write it to Parquet file:
+
+    >>> import pandas as pd
+    >>> import pyarrow as pa
+    >>> df = pd.DataFrame({'year': [2020, 2022, 2021, 2022, 2019, 2021],
+    ...                    'month': [3, 5, 7, 9, 11, 12],
+    ...                    'day': [1, 5, 9, 13, 17, 23],
+    ...                    'n_legs': [2, 2, 4, 4, 5, 100],
+    ...                    'animal': ["Flamingo", "Parrot", "Dog", "Horse",
+    ...                    "Brittle stars", "Centipede"]})
+    >>> table = pa.Table.from_pandas(df)
+
+    >>> import pyarrow.parquet as pq
+    >>> pq.write_table(table, 'example.parquet')
+
+    create a ParquetFile object from the Parqet file:

Review Comment:
   ```suggestion
       Create a ``ParquetFile`` object from the Parquet file:
   ```



##########
python/pyarrow/parquet.py:
##########
@@ -225,6 +225,64 @@ class ParquetFile:
         in nanoseconds.
     decryption_properties : FileDecryptionProperties, default None
         File decryption properties for Parquet Modular Encryption.
+
+    Examples
+    --------
+
+    Generate an example PyArrow Table and write it to Parquet file:
+
+    >>> import pandas as pd
+    >>> import pyarrow as pa
+    >>> df = pd.DataFrame({'year': [2020, 2022, 2021, 2022, 2019, 2021],
+    ...                    'month': [3, 5, 7, 9, 11, 12],
+    ...                    'day': [1, 5, 9, 13, 17, 23],
+    ...                    'n_legs': [2, 2, 4, 4, 5, 100],
+    ...                    'animal': ["Flamingo", "Parrot", "Dog", "Horse",
+    ...                    "Brittle stars", "Centipede"]})
+    >>> table = pa.Table.from_pandas(df)
+
+    >>> import pyarrow.parquet as pq
+    >>> pq.write_table(table, 'example.parquet')
+
+    create a ParquetFile object from the Parqet file:
+
+    >>> parquet_file = pq.ParquetFile('example.parquet')
+
+    read the data:
+
+    >>> parquet_file.read()
+    pyarrow.Table
+    year: int64
+    month: int64
+    day: int64
+    n_legs: int64
+    animal: string
+    ----
+    year: [[2020,2022,2021,2022,2019,2021]]
+    month: [[3,5,7,9,11,12]]
+    day: [[1,5,9,13,17,23]]
+    n_legs: [[2,2,4,4,5,100]]
+    animal: [["Flamingo","Parrot","Dog","Horse","Brittle stars","Centipede"]]
+
+    create a ParquetFile object with "animals" column as DictionaryArray:

Review Comment:
   ```suggestion
       Create a ParquetFile object with "animals" column as DictionaryArray:
   ```



##########
python/pyarrow/parquet.py:
##########
@@ -2177,13 +3031,13 @@ def write_to_dataset(table, root_path, partition_cols=None,
     Parameters
     ----------
     table : pyarrow.Table
-    root_path : str, pathlib.Path
+    root_path : str, pathlib.Pathß

Review Comment:
   ```suggestion
       root_path : str, pathlib.Path
   ```



##########
python/pyarrow/parquet.py:
##########
@@ -2201,6 +3055,47 @@ def write_to_dataset(table, root_path, partition_cols=None,
         Using `metadata_collector` in kwargs allows one to collect the
         file metadata instances of dataset pieces. The file paths in the
         ColumnChunkMetaData will be set relative to `root_path`.
+
+    Examples
+    --------
+    Generate an example PyArrow Table:
+
+    >>> import pyarrow as pa
+    >>> import pandas as pd
+    >>> df = pd.DataFrame({'year': [2020, 2022, 2021, 2022, 2019, 2021],
+    ...                    'month': [3, 5, 7, 9, 11, 12],
+    ...                    'day': [1, 5, 9, 13, 17, 23],
+    ...                    'n_legs': [2, 2, 4, 4, 5, 100],
+    ...                    'animals': ["Flamingo", "Parrot", "Dog", "Horse",
+    ...                    "Brittle stars", "Centipede"]})
+    >>> table = pa.Table.from_pandas(df)
+
+    and write it to a partitioned dataset:
+
+    >>> import pyarrow.parquet as pq
+    >>> pq.write_to_dataset(table, root_path='dataset_name_3',
+    ...                     partition_cols=['year', 'month', 'day'],
+    ...                     use_legacy_dataset=False
+    ...                    )
+    >>> pq.ParquetDataset('dataset_name_3', use_legacy_dataset=False).files
+    ['dataset_name_3/year=2019/month=11/day=17/part-0.parquet', ...
+
+    Use old Arrow Dataset API and override the partition filename:
+
+    >>> pq.write_to_dataset(table, root_path='dataset_name_5',
+    ...                     partition_cols=['year', 'month', 'day'],
+    ...                     partition_filename_cb=lambda x:
+    ...                     str(x[0]) + str(x[1]) + str(x[2])  + '.parquet'
+    ...                    )
+    >>> pq.ParquetDataset('dataset_name_5/', use_legacy_dataset=False).files
+    ['dataset_name_5/year=2019/month=11/day=17/20191117.parquet', ...
+
+    Write to a single Parquet file:

Review Comment:
   It still creates a directory, but with a single file



##########
python/pyarrow/parquet.py:
##########
@@ -225,6 +225,64 @@ class ParquetFile:
         in nanoseconds.
     decryption_properties : FileDecryptionProperties, default None
         File decryption properties for Parquet Modular Encryption.
+
+    Examples
+    --------
+
+    Generate an example PyArrow Table and write it to Parquet file:
+
+    >>> import pandas as pd
+    >>> import pyarrow as pa
+    >>> df = pd.DataFrame({'year': [2020, 2022, 2021, 2022, 2019, 2021],
+    ...                    'month': [3, 5, 7, 9, 11, 12],
+    ...                    'day': [1, 5, 9, 13, 17, 23],
+    ...                    'n_legs': [2, 2, 4, 4, 5, 100],
+    ...                    'animal': ["Flamingo", "Parrot", "Dog", "Horse",
+    ...                    "Brittle stars", "Centipede"]})
+    >>> table = pa.Table.from_pandas(df)

Review Comment:
   Also, I would maybe in general keep the example data a bit smaller, if not needed for showing a specific feature (eg 2 or 3 columns is probably enough, instead of 5 columns; this reduces the vertical space)



##########
python/pyarrow/parquet.py:
##########
@@ -2201,6 +3055,47 @@ def write_to_dataset(table, root_path, partition_cols=None,
         Using `metadata_collector` in kwargs allows one to collect the
         file metadata instances of dataset pieces. The file paths in the
         ColumnChunkMetaData will be set relative to `root_path`.
+
+    Examples
+    --------
+    Generate an example PyArrow Table:
+
+    >>> import pyarrow as pa
+    >>> import pandas as pd
+    >>> df = pd.DataFrame({'year': [2020, 2022, 2021, 2022, 2019, 2021],
+    ...                    'month': [3, 5, 7, 9, 11, 12],
+    ...                    'day': [1, 5, 9, 13, 17, 23],
+    ...                    'n_legs': [2, 2, 4, 4, 5, 100],
+    ...                    'animals': ["Flamingo", "Parrot", "Dog", "Horse",
+    ...                    "Brittle stars", "Centipede"]})
+    >>> table = pa.Table.from_pandas(df)
+
+    and write it to a partitioned dataset:
+
+    >>> import pyarrow.parquet as pq
+    >>> pq.write_to_dataset(table, root_path='dataset_name_3',
+    ...                     partition_cols=['year', 'month', 'day'],
+    ...                     use_legacy_dataset=False
+    ...                    )
+    >>> pq.ParquetDataset('dataset_name_3', use_legacy_dataset=False).files
+    ['dataset_name_3/year=2019/month=11/day=17/part-0.parquet', ...
+
+    Use old Arrow Dataset API and override the partition filename:
+
+    >>> pq.write_to_dataset(table, root_path='dataset_name_5',
+    ...                     partition_cols=['year', 'month', 'day'],
+    ...                     partition_filename_cb=lambda x:
+    ...                     str(x[0]) + str(x[1]) + str(x[2])  + '.parquet'
+    ...                    )
+    >>> pq.ParquetDataset('dataset_name_5/', use_legacy_dataset=False).files
+    ['dataset_name_5/year=2019/month=11/day=17/20191117.parquet', ...

Review Comment:
   I would maybe not show this if we are not yet sure that we will keep it ..



##########
python/pyarrow/parquet.py:
##########
@@ -225,6 +225,64 @@ class ParquetFile:
         in nanoseconds.
     decryption_properties : FileDecryptionProperties, default None
         File decryption properties for Parquet Modular Encryption.
+
+    Examples
+    --------
+
+    Generate an example PyArrow Table and write it to Parquet file:
+
+    >>> import pandas as pd
+    >>> import pyarrow as pa
+    >>> df = pd.DataFrame({'year': [2020, 2022, 2021, 2022, 2019, 2021],
+    ...                    'month': [3, 5, 7, 9, 11, 12],
+    ...                    'day': [1, 5, 9, 13, 17, 23],
+    ...                    'n_legs': [2, 2, 4, 4, 5, 100],
+    ...                    'animal': ["Flamingo", "Parrot", "Dog", "Horse",
+    ...                    "Brittle stars", "Centipede"]})
+    >>> table = pa.Table.from_pandas(df)
+
+    >>> import pyarrow.parquet as pq
+    >>> pq.write_table(table, 'example.parquet')
+
+    create a ParquetFile object from the Parqet file:
+
+    >>> parquet_file = pq.ParquetFile('example.parquet')
+
+    read the data:

Review Comment:
   ```suggestion
       Read the data:
   ```
   
   (this reads the full file?)



##########
python/pyarrow/parquet.py:
##########
@@ -2128,6 +2938,46 @@ def write_table(table, where, row_group_size=None, version='1.0',
         raise
 
 
+_write_table_example = """\
+Generate an example PyArrow Table:
+
+>>> import pyarrow as pa
+>>> import pandas as pd
+>>> df = pd.DataFrame({'year': [2020, 2022, 2021, 2022, 2019, 2021],
+...                    'month': [3, 5, 7, 9, 11, 12],
+...                    'day': [1, 5, 9, 13, 17, 23],
+...                    'n_legs': [2, 2, 4, 4, 5, 100],
+...                    'animals': ["Flamingo", "Parrot", "Dog", "Horse",
+...                    "Brittle stars", "Centipede"]})
+>>> table = pa.Table.from_pandas(df)
+
+and write the Table into Parquet file:
+
+>>> import pyarrow.parquet as pq
+>>> pq.write_table(table, 'example.parquet')
+
+Defining row group size for the Parquet file:
+
+>>> pq.write_table(table, 'example.parquet', row_group_size=3)
+
+Defining row group compression (default is Snappy):
+
+>>> pq.write_table(table, 'example.parquet', compression='none')
+
+Defining row group compression and encoding per-column:
+
+>>> pq.write_table(table, 'example.parquet',
+...                compression={'foo': 'snappy', 'bar': 'gzip'},
+...                use_dictionary=['foo', 'bar'])

Review Comment:
   Can you refer to one of the actual columns in the table instead of foo/bar?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org