You are viewing a plain text version of this content. The canonical link for it is here.

Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2021/12/17 19:12:09 UTC

[GitHub] [arrow] westonpace commented on pull request #11911: ARROW-15019: [Python] Add bindings for new dataset writing options

westonpace commented on pull request #11911:
URL: https://github.com/apache/arrow/pull/11911#issuecomment-996972468


   > I think there might be a more direct way to count the number of row groups created by inspecting the parquet files, rather than inferring based on the batches that dataset to_batches() returns
   
   For a parquet file you can do:
   ```
   # Either works
   pq.ParquetFile('/tmp/foo.parquet').metadata.num_row_groups
   pq.read_metadata('/tmp/foo.parquet').num_row_groups
   ```
   For an IPC file you can do:
   ```
   with ipc.RecordBatchFileReader('/tmp/foo.arrow') as reader:
     num_record_batches = reader.num_record_batches
   ```
   
   For testing purposes though I would almost rather just stick with reading in a table as it's universal across the formats.  The performance difference at this scale should be trivial.  Also, this test is checking the # of rows in each batch in addition to the # of batches (although one could argue that the feature can be tested solely by the # of batches).
   
   There actually is no way to get the size of the batches in IPC without reading them in (this has some implications for scanning and someday I'd like to do some experiments on whether or not a change to the IPC format might help us here).  For parquet that `metadata` object is rich enough you can get the size of each row group (`metadata.row_group(0).num_rows` for example)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org