You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@arrow.apache.org by Arun Joseph <aj...@gmail.com> on 2021/07/12 20:00:06 UTC

[Python] pyarrow.read_feather use_threads option not respected?

I'm running the following:

Python 3.7.4 (default, Aug 13 2019, 20:35:49)
[GCC 7.3.0] :: Anaconda, Inc. on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import pyarrow
>>> pyarrow.__version__
'4.0.1'

from pyarrow import feather

feather.write_feather(df, dest=file_path, compression='zstd',
compression_level=19)
file_path=f'{valid_file_path}'
feather.read_feather(file_path, use_threads=False)

It seems like the use_threads argument does not alter the number of threads
launched. I've tested with both use_threads=True and use_threads=False. Am
I misunderstanding what use_threads actually means? It seems like it
launches ~12 threads.

Could this be related to the compression strategy of the file itself?

Thank You,
Arun Joseph

Re: [Python] pyarrow.read_feather use_threads option not respected?

Posted by Arun Joseph <aj...@gmail.com>.
Thanks for the quick response Wes, set_cpu_count resolved the issue for me.
I've created a JIRA issue to improve the documentation:
https://issues.apache.org/jira/browse/ARROW-13317

On Mon, Jul 12, 2021 at 4:08 PM Wes McKinney <we...@gmail.com> wrote:

> hi Arun — the `use_threads` argument here only toggles whether
> multiple threads are used in the conversion from the Arrow/Feather
> representation to pandas. Since you elected to use compression,
> multiple threads are used when decompressing the data, and this can
> only be changed by setting the number of threads globally in the
> pyarrow library [1]
>
> This seems a bit misleading to me, so it would be good to open a Jira
> issue to clarify in the documentation what "use_threads" does
>
> [1]:
> http://arrow.apache.org/docs/python/generated/pyarrow.set_cpu_count.html#pyarrow.set_cpu_count
>
> On Mon, Jul 12, 2021 at 3:00 PM Arun Joseph <aj...@gmail.com> wrote:
> >
> > I'm running the following:
> >
> > Python 3.7.4 (default, Aug 13 2019, 20:35:49)
> > [GCC 7.3.0] :: Anaconda, Inc. on linux
> > Type "help", "copyright", "credits" or "license" for more information.
> > >>> import pyarrow
> > >>> pyarrow.__version__
> > '4.0.1'
> >
> > from pyarrow import feather
> >
> > feather.write_feather(df, dest=file_path, compression='zstd',
> compression_level=19)
> > file_path=f'{valid_file_path}'
> > feather.read_feather(file_path, use_threads=False)
> >
> > It seems like the use_threads argument does not alter the number of
> threads launched. I've tested with both use_threads=True and
> use_threads=False. Am I misunderstanding what use_threads actually means?
> It seems like it launches ~12 threads.
> >
> > Could this be related to the compression strategy of the file itself?
> >
> > Thank You,
> > Arun Joseph
> >
>


-- 
Arun Joseph

Re: [Python] pyarrow.read_feather use_threads option not respected?

Posted by Wes McKinney <we...@gmail.com>.
hi Burke — to remove yourself, you have to e-mail

user-unsubscribe@arrow.apache.org

On Mon, Jul 12, 2021 at 3:11 PM Burke Kaltenberger
<bu...@firsttalentsearch.com> wrote:
>
> Please take me off the mailing list
>
> On Mon, Jul 12, 2021 at 1:08 PM Wes McKinney <we...@gmail.com> wrote:
>>
>> hi Arun — the `use_threads` argument here only toggles whether
>> multiple threads are used in the conversion from the Arrow/Feather
>> representation to pandas. Since you elected to use compression,
>> multiple threads are used when decompressing the data, and this can
>> only be changed by setting the number of threads globally in the
>> pyarrow library [1]
>>
>> This seems a bit misleading to me, so it would be good to open a Jira
>> issue to clarify in the documentation what "use_threads" does
>>
>> [1]: http://arrow.apache.org/docs/python/generated/pyarrow.set_cpu_count.html#pyarrow.set_cpu_count
>>
>> On Mon, Jul 12, 2021 at 3:00 PM Arun Joseph <aj...@gmail.com> wrote:
>> >
>> > I'm running the following:
>> >
>> > Python 3.7.4 (default, Aug 13 2019, 20:35:49)
>> > [GCC 7.3.0] :: Anaconda, Inc. on linux
>> > Type "help", "copyright", "credits" or "license" for more information.
>> > >>> import pyarrow
>> > >>> pyarrow.__version__
>> > '4.0.1'
>> >
>> > from pyarrow import feather
>> >
>> > feather.write_feather(df, dest=file_path, compression='zstd', compression_level=19)
>> > file_path=f'{valid_file_path}'
>> > feather.read_feather(file_path, use_threads=False)
>> >
>> > It seems like the use_threads argument does not alter the number of threads launched. I've tested with both use_threads=True and use_threads=False. Am I misunderstanding what use_threads actually means? It seems like it launches ~12 threads.
>> >
>> > Could this be related to the compression strategy of the file itself?
>> >
>> > Thank You,
>> > Arun Joseph
>> >
>
>
>
> --
> First Talent Search & Placement
> Burke Kaltenberger | Founder
> 408.458.0071

Re: [Python] pyarrow.read_feather use_threads option not respected?

Posted by Burke Kaltenberger <bu...@firsttalentsearch.com>.
Please take me off the mailing list

On Mon, Jul 12, 2021 at 1:08 PM Wes McKinney <we...@gmail.com> wrote:

> hi Arun — the `use_threads` argument here only toggles whether
> multiple threads are used in the conversion from the Arrow/Feather
> representation to pandas. Since you elected to use compression,
> multiple threads are used when decompressing the data, and this can
> only be changed by setting the number of threads globally in the
> pyarrow library [1]
>
> This seems a bit misleading to me, so it would be good to open a Jira
> issue to clarify in the documentation what "use_threads" does
>
> [1]:
> http://arrow.apache.org/docs/python/generated/pyarrow.set_cpu_count.html#pyarrow.set_cpu_count
>
> On Mon, Jul 12, 2021 at 3:00 PM Arun Joseph <aj...@gmail.com> wrote:
> >
> > I'm running the following:
> >
> > Python 3.7.4 (default, Aug 13 2019, 20:35:49)
> > [GCC 7.3.0] :: Anaconda, Inc. on linux
> > Type "help", "copyright", "credits" or "license" for more information.
> > >>> import pyarrow
> > >>> pyarrow.__version__
> > '4.0.1'
> >
> > from pyarrow import feather
> >
> > feather.write_feather(df, dest=file_path, compression='zstd',
> compression_level=19)
> > file_path=f'{valid_file_path}'
> > feather.read_feather(file_path, use_threads=False)
> >
> > It seems like the use_threads argument does not alter the number of
> threads launched. I've tested with both use_threads=True and
> use_threads=False. Am I misunderstanding what use_threads actually means?
> It seems like it launches ~12 threads.
> >
> > Could this be related to the compression strategy of the file itself?
> >
> > Thank You,
> > Arun Joseph
> >
>


-- 
*First Talent Search & Placement*
*Burke Kaltenberger
<https://www.linkedin.com/in/burke-kaltenberger-3a41731/> | Founder*
*408.458.0071*

Re: [Python] pyarrow.read_feather use_threads option not respected?

Posted by Wes McKinney <we...@gmail.com>.
hi Arun — the `use_threads` argument here only toggles whether
multiple threads are used in the conversion from the Arrow/Feather
representation to pandas. Since you elected to use compression,
multiple threads are used when decompressing the data, and this can
only be changed by setting the number of threads globally in the
pyarrow library [1]

This seems a bit misleading to me, so it would be good to open a Jira
issue to clarify in the documentation what "use_threads" does

[1]: http://arrow.apache.org/docs/python/generated/pyarrow.set_cpu_count.html#pyarrow.set_cpu_count

On Mon, Jul 12, 2021 at 3:00 PM Arun Joseph <aj...@gmail.com> wrote:
>
> I'm running the following:
>
> Python 3.7.4 (default, Aug 13 2019, 20:35:49)
> [GCC 7.3.0] :: Anaconda, Inc. on linux
> Type "help", "copyright", "credits" or "license" for more information.
> >>> import pyarrow
> >>> pyarrow.__version__
> '4.0.1'
>
> from pyarrow import feather
>
> feather.write_feather(df, dest=file_path, compression='zstd', compression_level=19)
> file_path=f'{valid_file_path}'
> feather.read_feather(file_path, use_threads=False)
>
> It seems like the use_threads argument does not alter the number of threads launched. I've tested with both use_threads=True and use_threads=False. Am I misunderstanding what use_threads actually means? It seems like it launches ~12 threads.
>
> Could this be related to the compression strategy of the file itself?
>
> Thank You,
> Arun Joseph
>