You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2020/10/07 10:13:47 UTC

[GitHub] [arrow] jorisvandenbossche commented on a change in pull request #8317: ARROW-10134: [Python][Dataset] Add ParquetFileFragment.num_row_groups

jorisvandenbossche commented on a change in pull request #8317:
URL: https://github.com/apache/arrow/pull/8317#discussion_r500896637



##########
File path: cpp/src/arrow/dataset/file_parquet.cc
##########
@@ -530,17 +548,17 @@ Status ParquetFileFragment::EnsureCompleteMetadata(parquet::arrow::FileReader* r
   physical_schema_ = std::move(schema);
 
   std::shared_ptr<parquet::FileMetaData> metadata = reader->parquet_reader()->metadata();
-  int num_row_groups = metadata->num_row_groups();
+  num_row_groups_ = metadata->num_row_groups();

Review comment:
       I don't think we should set the class property here? Because that will override a potential subselection of row groups?
   
   Small example:
   
   ```python
   import pyarrow.parquet as pq
   table = pa.table({'a': [1, 2, 3, 4]})
   pq.write_table(table, "test_num_row_groups.parquet", row_group_size=2)
   
   import pyarrow.dataset as ds
   dataset = ds.dataset("test_num_row_groups.parquet")
   fragment = list(dataset.get_fragments())[0]
   
   # make fragment viewing the first row group
   In [14]: fragment0 = fragment.format.make_fragment(
       ...:     fragment.path, fragment.filesystem, row_groups=[0])
   
   In [15]: fragment0.num_row_groups
   Out[15]: 1
   
   In [16]: fragment0.row_groups
   Out[16]: [<pyarrow._dataset.RowGroupInfo at 0x7f633bbc93f8>]
   
   In [17]: fragment0.row_groups[0].statistics
   
   # complete the metadata -> still has a single fragment, but property returns wrong length
   In [18]: fragment0.ensure_complete_metadata()
   
   In [19]: fragment0.num_row_groups
   Out[19]: 2
   
   In [20]: fragment0.row_groups
   Out[20]: [<pyarrow._dataset.RowGroupInfo at 0x7f633bbcc3a0>]
   
   In [21]: fragment0.row_groups[0].statistics
   Out[21]: {'a': {'min': 1, 'max': 2}}
   ```




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org