You are viewing a plain text version of this content. The canonical link for it is here.
Posted to jira@arrow.apache.org by "Joris Van den Bossche (Jira)" <ji...@apache.org> on 2021/01/06 16:56:00 UTC

[jira] [Commented] (ARROW-10436) [Python][Dataset] Deprecate RowGroupInfo

    [ https://issues.apache.org/jira/browse/ARROW-10436?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17259878#comment-17259878 ] 

Joris Van den Bossche commented on ARROW-10436:
-----------------------------------------------

I wanted to do a PR to add a deprecation notice, but ran into this question: without the {{row_groups}} attribute, how can you know which row groups a file the ParquetFileFragment is "viewing" (eg is it is filtered / subsetted)?

{code}
import pyarrow.parquet as pq
import pyarrow.dataset as ds

table = pa.table({'a': range(300)})
pq.write_table(table, "test.parquet", row_group_size=100)

dataset = ds.dataset("test.parquet")
fragment = list(dataset.get_fragments())[0]

In [77]: fragment.num_row_groups
Out[77]: 3

In [78]: subfragment = fragment.subset(row_group_ids=[0,2])

In [79]: subfragment.row_groups
Out[79]: [RowGroupInfo(0), RowGroupInfo(2)]

In [80]: subfragment.metadata
Out[80]: 
<pyarrow._parquet.FileMetaData object at 0x7f45716d2860>
  created_by: parquet-cpp version 1.5.1-SNAPSHOT
  num_columns: 1
  num_rows: 300
  num_row_groups: 3
  format_version: 1.0
  serialized_size: 596
{code}


Dask for example uses the {{RowGroupInfo}} to get the "id" of the row group.

> [Python][Dataset] Deprecate RowGroupInfo
> ----------------------------------------
>
>                 Key: ARROW-10436
>                 URL: https://issues.apache.org/jira/browse/ARROW-10436
>             Project: Apache Arrow
>          Issue Type: Improvement
>          Components: Python
>    Affects Versions: 2.0.0
>            Reporter: Ben Kietzman
>            Assignee: Joris Van den Bossche
>            Priority: Minor
>              Labels: dataset
>             Fix For: 3.0.0
>
>
> After ARROW-10131 {{RowGroupInfo}} has questionable merit since it is now a thin wrapper around an integer row group id and a filemetadata.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)