You are viewing a plain text version of this content. The canonical link for it is here.
Posted to jira@arrow.apache.org by "Joris Van den Bossche (Jira)" <ji...@apache.org> on 2020/11/26 12:27:00 UTC

[jira] [Commented] (ARROW-10736) [Python] feather/arrow row splitting and counting (Dataset API)

    [ https://issues.apache.org/jira/browse/ARROW-10736?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17239228#comment-17239228 ] 

Joris Van den Bossche commented on ARROW-10736:
-----------------------------------------------

I think this are generally things we want, and not that difficult to implement. Up to now, we mostly focused on specific functionality for the ParquetFileFragment (as that was needed to get feature parity with ParquetDataset, eg for dask, and is also used most widely), but we should try to generalize a certain subset of "statistics" functionality (such as number of rows).

For the row count specifically, there is also ARROW-9697

bq.  scan without any columns is not possible it seems, nor any method to get the row count.

Not that this is the nicest workaround, but I thought this was possible:

{code}
In [1]: table = pa.table({'a': np.arange(100), 'b': np.random.randn(100)})

In [2]: import pyarrow.dataset as ds

# ugly way to split table in 2 batches
In [10]: table2 = pa.Table.from_batches([*table[0:50].to_batches(), *table[50:100].to_batches()])

In [11]: ds.write_dataset(table2, "test_ipc2", format="ipc")

In [12]: dataset = ds.dataset("test_ipc2", format="ipc")

# indeed single fragment
In [15]: list(dataset.get_fragments())
Out[15]: [<pyarrow._dataset.FileFragment at 0x7f7444252c70>]

# # scanning with empty column selection
In [16]: for scan_task in dataset.scan(columns=[]):
    ...:     for record_batch in scan_task.execute():
    ...:         print(record_batch.num_rows)
50
50
{code}

{code}


> [Python] feather/arrow row splitting and counting (Dataset API)
> ---------------------------------------------------------------
>
>                 Key: ARROW-10736
>                 URL: https://issues.apache.org/jira/browse/ARROW-10736
>             Project: Apache Arrow
>          Issue Type: Improvement
>          Components: C++, Python
>            Reporter: Maarten Breddels
>            Priority: Major
>
> For parquet files using the Dataset API, we have the option to access the row groups, and count the total number of rows within each. I don't see the option to get the number of rows from a dataset with feather/arrow ipc files. For instance, a scan without any columns is not possible it seems, nor any method to get the row count.
> Also, if a file consists of chunked arrays, it is exposed as 1 fragment, and it is not possible to read only a portion of a filefragment (row slicing), similar to how one could work with ParquetFileFragment.split_by_row_group.
> I don't know of any other way within Apache Arrow to work with feather/arrow ipc files and only read portions of it (e.g. a particular column for row i to j).
> Are these features possible any other way, or is this already planned, possibly difficult to implement?
> cheers,
> Maarten



--
This message was sent by Atlassian Jira
(v8.3.4#803005)