You are viewing a plain text version of this content. The canonical link for it is here.

Posted to jira@arrow.apache.org by "Weston Pace (Jira)" <ji...@apache.org> on 2021/07/12 20:39:00 UTC

[jira] [Commented] (ARROW-13317) [Python] Improve documentation on what 'use_threads' does in 'read_feather'

    [ https://issues.apache.org/jira/browse/ARROW-13317?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17379405#comment-17379405 ] 

Weston Pace commented on ARROW-13317:
-------------------------------------

The RecordBatchFileReader reader (which the feather reader will be using behind the scenes) has a use_threads option which should control this.  Is read_feather simply being kept alive for backwards compatibility (in which case we should not make it more configurable and should probably mark it deprecated) or is it going to be maintained as a separate API and a simpler frontend to RecordBatchFileReader (I think I'll send a ML topic with this question actually)?

Also, now that I look at it, RecordBatchFileReader in python doesn't expose the IpcReadOptions at all.  So a python change would need to be made to expose this too.

I don't know about mentioning set_cpu_count.  It does solve the problem but it's more of a "global" setting as it will affect how many files are read at once by dataset scans, parquet parallelism, and even compute level parallelism (once that has more support).  We probably don't want to reference it everywhere that it affects.

> [Python] Improve documentation on what 'use_threads' does in 'read_feather'
> ---------------------------------------------------------------------------
>
>                 Key: ARROW-13317
>                 URL: https://issues.apache.org/jira/browse/ARROW-13317
>             Project: Apache Arrow
>          Issue Type: Improvement
>          Components: Python
>    Affects Versions: 4.0.1
>            Reporter: Arun Joseph
>            Priority: Trivial
>              Labels: documentation
>
> The current documentation for [read_feather|https://arrow.apache.org/docs/python/generated/pyarrow.feather.read_feather.html] states the following:
> *use_threads* (_bool__,_ _default True_) – Whether to parallelize reading using multiple threads.
> if the underlying file uses compression, then multiple threads can still be spawned. The verbiage of the *use_threads* is ambiguous on whether the restriction on multiple threads is only for the conversion from pyarrow to the pandas dataframe vs the reading/decompression of the file itself which might spawn additional threads.
> [set_cpu_count|http://arrow.apache.org/docs/python/generated/pyarrow.set_cpu_count.html#pyarrow.set_cpu_count] might be good to mention as a way to actually limit threads spawned



--
This message was sent by Atlassian Jira
(v8.3.4#803005)