You are viewing a plain text version of this content. The canonical link for it is here.
Posted to jira@arrow.apache.org by "Joris Van den Bossche (Jira)" <ji...@apache.org> on 2020/09/29 11:50:03 UTC

[jira] [Created] (ARROW-10130) [C++][Dataset] ParquetFileFragment::SplitByRowGroup does not preserve "complete_metadata" status

Joris Van den Bossche created ARROW-10130:
---------------------------------------------

             Summary: [C++][Dataset] ParquetFileFragment::SplitByRowGroup does not preserve "complete_metadata" status
                 Key: ARROW-10130
                 URL: https://issues.apache.org/jira/browse/ARROW-10130
             Project: Apache Arrow
          Issue Type: Bug
          Components: C++
            Reporter: Joris Van den Bossche
             Fix For: 2.0.0


Splitting a ParquetFileFragment in  multiple fragments per row group ({{SplitByRowGroup}}) calls {{EnsureCompleteMetadata}} initially, but then the created fragments afterwards don't have the {{has_complete_metadata_}} property set. This means that when calling {{EnsureCompleteMetadata}} on the splitted fragments, it will read/parse the metadata again, instead of using the cached ones (which are already present).

Small example to illustrate:

{code:python}
In [1]: import pyarrow.dataset as ds

In [2]: dataset = ds.parquet_dataset("nyc-taxi-data/dask-partitioned/_metadata", partitioning="hive")

In [3]: rg_fragments = [rg for frag in dataset.get_fragments() for rg in frag.split_by_row_group()]

In [4]: len(rg_fragments)
Out[4]: 4520

# row group fragments actually have statistics
In [7]: rg_fragments[0].row_groups[0].statistics
Out[7]: 
{'vendor_id': {'min': '1', 'max': '4'},
 'pickup_at': {'min': datetime.datetime(2009, 1, 1, 0, 5, 51),
  'max': datetime.datetime(2018, 12, 26, 14, 48, 54)},
...

# but calling ensure_complete_metadata still takes a lot of time the first call
In [8]: %time _ = [fr.ensure_complete_metadata() for fr in rg_fragments]
CPU times: user 1.72 s, sys: 203 ms, total: 1.92 s
Wall time: 1.9 s

In [9]: %time _ = [fr.ensure_complete_metadata() for fr in rg_fragments]
CPU times: user 1.34 ms, sys: 0 ns, total: 1.34 ms
Wall time: 1.35 ms
{code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)