You are viewing a plain text version of this content. The canonical link for it is here.
Posted to jira@arrow.apache.org by "Joris Van den Bossche (Jira)" <ji...@apache.org> on 2021/01/06 16:56:00 UTC
[jira] [Commented] (ARROW-10436) [Python][Dataset] Deprecate
RowGroupInfo
[ https://issues.apache.org/jira/browse/ARROW-10436?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17259878#comment-17259878 ]
Joris Van den Bossche commented on ARROW-10436:
-----------------------------------------------
I wanted to do a PR to add a deprecation notice, but ran into this question: without the {{row_groups}} attribute, how can you know which row groups a file the ParquetFileFragment is "viewing" (eg is it is filtered / subsetted)?
{code}
import pyarrow.parquet as pq
import pyarrow.dataset as ds
table = pa.table({'a': range(300)})
pq.write_table(table, "test.parquet", row_group_size=100)
dataset = ds.dataset("test.parquet")
fragment = list(dataset.get_fragments())[0]
In [77]: fragment.num_row_groups
Out[77]: 3
In [78]: subfragment = fragment.subset(row_group_ids=[0,2])
In [79]: subfragment.row_groups
Out[79]: [RowGroupInfo(0), RowGroupInfo(2)]
In [80]: subfragment.metadata
Out[80]:
<pyarrow._parquet.FileMetaData object at 0x7f45716d2860>
created_by: parquet-cpp version 1.5.1-SNAPSHOT
num_columns: 1
num_rows: 300
num_row_groups: 3
format_version: 1.0
serialized_size: 596
{code}
Dask for example uses the {{RowGroupInfo}} to get the "id" of the row group.
> [Python][Dataset] Deprecate RowGroupInfo
> ----------------------------------------
>
> Key: ARROW-10436
> URL: https://issues.apache.org/jira/browse/ARROW-10436
> Project: Apache Arrow
> Issue Type: Improvement
> Components: Python
> Affects Versions: 2.0.0
> Reporter: Ben Kietzman
> Assignee: Joris Van den Bossche
> Priority: Minor
> Labels: dataset
> Fix For: 3.0.0
>
>
> After ARROW-10131 {{RowGroupInfo}} has questionable merit since it is now a thin wrapper around an integer row group id and a filemetadata.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)