You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@arrow.apache.org by "Yair Lenga (Jira)" <ji...@apache.org> on 2021/08/01 12:22:00 UTC

[jira] [Created] (ARROW-13518) Identify selected row when using filters

Yair Lenga created ARROW-13518:
----------------------------------

             Summary: Identify selected row when using filters
                 Key: ARROW-13518
                 URL: https://issues.apache.org/jira/browse/ARROW-13518
             Project: Apache Arrow
          Issue Type: New Feature
          Components: C++, Parquet, Python
            Reporter: Yair Lenga


I created a proposed enhancement to speed up reading of select row arrow-

proposing extending the functions that provides filter parquet.read_table ([https://arrow.apache.org/docs/python/generated/pyarrow.parquet.read_table.html#pyarrow.parquet.read_table]) to support returning actual row numbers (e.g, row_group and row_index). 



with the proposed enhancement, this can provide for faster reading of the data (e.g. by caching the return indices, and reading the full data when needed). 



proposed implementation will be to add 2 pseudo columns, which can be requested in the columns list. E.g., columns=[ ‘$row_group’, ‘$row_index’, ‘dealid’, …] or similar.

 

not sure if this requires change to the c++ interface, or just to the python part of pyarrow.

 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)