You are viewing a plain text version of this content. The canonical link for it is here.

Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2021/04/07 16:10:10 UTC

[GitHub] [arrow] aiqc opened a new issue #9932: Read parquet via row indexes to support chunking?

aiqc opened a new issue #9932:
URL: https://github.com/apache/arrow/issues/9932


   > We have GitHub issues available as a way for new contributors and
   passers-by who are unfamiliar with Apache Software Foundation projects
   to ask questions and interact with the project. Do not be surprised if
   the first response is to open a JIRA issue or to write an e-mail to
   one of the public mailing lists:
   
   Hi there. Is there a way to read a Parquet file by way of row (aka index) range? Not seeing it in `pyarrow.parquet.read_table` and there are questions about it:
   
   - https://stackoverflow.com/questions/64050609/pyarrow-read-parquet-via-column-index-or-order
   - https://stackoverflow.com/questions/62252259/pandas-read-write-parquet-data-using-column-index
   
   Right now I just read the whole file in and then drop rows, which won't be feasible on larger datasets.
   ```
   df = pd.read_parquet(my_stream)
   df = df.iloc[samples_indices]
   ```
   
   I feel like I could do this with Spark, but don't want to add that dependency.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] aiqc commented on issue #9932: Read parquet via row indexes to support chunking?

Posted by GitBox <gi...@apache.org>.

aiqc commented on issue #9932:
URL: https://github.com/apache/arrow/issues/9932#issuecomment-816926797


   @emkornfield thank you. The 'github issue template' was a bit ambiguous about where to ask.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] emkornfield commented on issue #9932: Read parquet via row indexes to support chunking?

Posted by GitBox <gi...@apache.org>.

emkornfield commented on issue #9932:
URL: https://github.com/apache/arrow/issues/9932#issuecomment-815411183


   Hi @aiqc 
   
   As of pyarrow 3 reading as a generator should be [possible](https://arrow.apache.org/docs/python/generated/pyarrow.parquet.ParquetFile.html#pyarrow.parquet.ParquetFile.iter_batches)
   
   Also we try to use the appropriate mailing lists [user@ and dev@](https://arrow.apache.org/community/) to answer questions instead of github issues.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] aiqc edited a comment on issue #9932: Read parquet via row indexes to support chunking?

Posted by GitBox <gi...@apache.org>.

aiqc edited a comment on issue #9932:
URL: https://github.com/apache/arrow/issues/9932#issuecomment-816926797


   @emkornfield thank you! The 'github issue template' was a bit ambiguous about where to ask.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] emkornfield commented on issue #9932: Read parquet via row indexes to support chunking?

Posted by GitBox <gi...@apache.org>.

emkornfield commented on issue #9932:
URL: https://github.com/apache/arrow/issues/9932#issuecomment-815412479


   You can also use [pyarrow datasets](http://arrow.apache.org/docs/python/generated/pyarrow.parquet.ParquetDataset.html?highlight=parquet%20filter) to push some level of filtering down the stack.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] emkornfield closed issue #9932: Read parquet via row indexes to support chunking?

Posted by GitBox <gi...@apache.org>.

emkornfield closed issue #9932:
URL: https://github.com/apache/arrow/issues/9932


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org