You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2021/01/05 14:44:57 UTC
[GitHub] [arrow] jorisvandenbossche commented on a change in pull request #6979: ARROW-7800 [Python] implement iter_batches() method for ParquetFile and ParquetReader

jorisvandenbossche commented on a change in pull request #6979:
URL: https://github.com/apache/arrow/pull/6979#discussion_r551972959



##########
File path: python/pyarrow/parquet.py
##########
@@ -319,6 +319,44 @@ def read_row_groups(self, row_groups, columns=None, use_threads=True,
                                            column_indices=column_indices,
                                            use_threads=use_threads)
 
+    def iter_batches(self, batch_size=65536, row_groups=None, columns=None,
+                     use_threads=True, use_pandas_metadata=False):
+        """
+        Read streaming batches from a Parquet file
+
+        Parameters
+        ----------
+        batch_size: int, default 64K
+            Maximum number of records to yield per batch. Batches may be
+            smaller if there aren't enough rows in a rowgroup.

Review comment:
       ```suggestion
               smaller if there aren't enough rows in the file.
   ```
   
   ? (given that it currently returns row across row groups)

##########
File path: python/pyarrow/parquet.py
##########
@@ -319,6 +319,44 @@ def read_row_groups(self, row_groups, columns=None, use_threads=True,
                                            column_indices=column_indices,
                                            use_threads=use_threads)
 
+    def iter_batches(self, batch_size=65536, row_groups=None, columns=None,
+                     use_threads=True, use_pandas_metadata=False):
+        """
+        Read streaming batches from a Parquet file
+
+        Parameters
+        ----------
+        batch_size: int, default 64K
+            Maximum number of records to yield per batch. Batches may be
+            smaller if there aren't enough rows in a rowgroup.
+        row_groups: list
+            Only these row groups will be read from the file.
+        columns: list
+            If not None, only these columns will be read from the file. A
+            column name may be a prefix of a nested field, e.g. 'a' will select
+            'a.b', 'a.c', and 'a.d.e'
+        use_threads : boolean, default True
+            Perform multi-threaded column reads
+        use_pandas_metadata : boolean, default False
+            If True and file has custom pandas schema metadata, ensure that
+            index columns are also loaded

Review comment:
       ```suggestion
           columns: list
               If not None, only these columns will be read from the file. A
               column name may be a prefix of a nested field, e.g. 'a' will select
               'a.b', 'a.c', and 'a.d.e'.
           use_threads : boolean, default True
               Perform multi-threaded column reads.
           use_pandas_metadata : boolean, default False
               If True and file has custom pandas schema metadata, ensure that
               index columns are also loaded.
   ```
   
   (minor nitpick on punctuation)




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org