You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2022/04/04 08:57:12 UTC

[GitHub] [arrow] amol- commented on a diff in pull request #12704: ARROW-15428: [Python] Address docstrings in Parquet classes and functions

amol- commented on code in PR #12704:
URL: https://github.com/apache/arrow/pull/12704#discussion_r838345396


##########
python/pyarrow/parquet.py:
##########
@@ -225,6 +225,64 @@ class ParquetFile:
         in nanoseconds.
     decryption_properties : FileDecryptionProperties, default None
         File decryption properties for Parquet Modular Encryption.
+
+    Examples
+    --------
+
+    Generate an example PyArrow Table and write it to Parquet file:
+
+    >>> import pandas as pd
+    >>> import pyarrow as pa
+    >>> df = pd.DataFrame({'year': [2020, 2022, 2021, 2022, 2019, 2021],
+    ...                    'month': [3, 5, 7, 9, 11, 12],
+    ...                    'day': [1, 5, 9, 13, 17, 23],
+    ...                    'n_legs': [2, 2, 4, 4, 5, 100],
+    ...                    'animal': ["Flamingo", "Parrot", "Dog", "Horse",
+    ...                    "Brittle stars", "Centipede"]})
+    >>> table = pa.Table.from_pandas(df)

Review Comment:
   Why going through pandas? Involving an external dependency for a pyarrow example does seem to add confusion. Especially in this case where we could do
   ```suggestion
       >>> import pyarrow as pa
       >>> table = pa.table({'year': [2020, 2022, 2021, 2022, 2019, 2021],
       ...                   'month': [3, 5, 7, 9, 11, 12],
       ...                   'day': [1, 5, 9, 13, 17, 23],
       ...                   'n_legs': [2, 2, 4, 4, 5, 100],
       ...                   'animal': ["Flamingo", "Parrot", "Dog", "Horse",
       ...                              "Brittle stars", "Centipede"]})
   ```
   and end up with even simpler code.
   
   The same seems to apply as well to all the subsequent usages of `DataFrame`



##########
python/pyarrow/parquet.py:
##########
@@ -612,6 +857,58 @@ def _sanitize_table(table, new_schema, flavor):
     the batch size can help keep page sizes closer to the intended size.
 """
 
+_parquet_writer_example_doc = """\

Review Comment:
   I'm a bit confused, given that we use this docstring only once, why is a variable? 
   In the case of `_parquet_writer_arg_docs` it was a variable because the docstring was shared by multiple classes.



##########
python/pyarrow/parquet.py:
##########
@@ -369,6 +537,40 @@ def iter_batches(self, batch_size=65536, row_groups=None, columns=None,
         -------
         iterator of pyarrow.RecordBatch
             Contents of each batch as a record batch
+
+        Examples
+        --------
+        Generate an example Parquet file:
+
+        >>> import pandas as pd
+        >>> import pyarrow as pa
+        >>> df = pd.DataFrame({'year': [2020, 2022, 2021, 2022, 2019, 2021],
+        ...                    'month': [3, 5, 7, 9, 11, 12],
+        ...                    'day': [1, 5, 9, 13, 17, 23],
+        ...                    'n_legs': [2, 2, 4, 4, 5, 100],
+        ...                    'animal': ["Flamingo", "Parrot", "Dog", "Horse",
+        ...                    "Brittle stars", "Centipede"]})
+        >>> table = pa.Table.from_pandas(df)
+        >>> import pyarrow.parquet as pq
+        >>> pq.write_table(table, 'example.parquet')
+        >>> parquet_file = pq.ParquetFile('example.parquet')
+
+        Read streaming batches:
+
+        >>> for i in parquet_file.iter_batches(batch_size=3):

Review Comment:
   I wonder if we making the example more complex than needed to show the method capabilities.
   The user might have not yet dig into the details of what each argument does and thus might not immediately notice that the output involves multiple record batches due to `batch_size=3`. 
   I'll leave it to your choice, but personally I would omit the `batch_size=3` argument and just show to the user that the output is a recordbatch without forcefully trying to show that they are more than one.



##########
python/pyarrow/parquet.py:
##########
@@ -1310,6 +1615,49 @@ def _open_dataset_file(dataset, path, meta=None):
     you need to specify the field names or a full schema. See the
     ``pyarrow.dataset.partitioning()`` function for more details."""
 
+_parquet_dataset_example = """\

Review Comment:
   Same as before, maybe better to inline this into the docstring?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org