You are viewing a plain text version of this content. The canonical link for it is here.
Posted to jira@arrow.apache.org by "Wes McKinney (Jira)" <ji...@apache.org> on 2020/06/15 16:25:00 UTC

[jira] [Assigned] (ARROW-4633) [Python] ParquetFile.read(use_threads=False) creates ThreadPool anyway

     [ https://issues.apache.org/jira/browse/ARROW-4633?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Wes McKinney reassigned ARROW-4633:
-----------------------------------

    Assignee:     (was: Wes McKinney)

> [Python] ParquetFile.read(use_threads=False) creates ThreadPool anyway
> ----------------------------------------------------------------------
>
>                 Key: ARROW-4633
>                 URL: https://issues.apache.org/jira/browse/ARROW-4633
>             Project: Apache Arrow
>          Issue Type: Bug
>          Components: Python
>    Affects Versions: 0.11.1, 0.12.0
>         Environment: Linux, Python 3.7.1, pyarrow.__version__ = 0.12.0
>            Reporter: Taylor Johnson
>            Priority: Minor
>              Labels: dataset-parquet-read, newbie, parquet, pull-request-available
>          Time Spent: 1h 10m
>  Remaining Estimate: 0h
>
> The following code seems to suggest that ParquetFile.read(use_threads=False) still creates a ThreadPool.  This is observed in ParquetFile.read_row_group(use_threads=False) as well. 
> This does not appear to be a problem in pyarrow.Table.to_pandas(use_threads=False).
> I've tried tracing the error.  Starting in python/pyarrow/parquet.py, both ParquetReader.read_all() and ParquetReader.read_row_group() pass the use_threads input along to self.reader which is a ParquetReader imported from _parquet.pyx
> Following the calls into python/pyarrow/_parquet.pyx, we see that ParquetReader.read_all() and ParquetReader.read_row_group() have the following code which seems a bit suspicious
> {quote}if use_threads:
>     self.set_use_threads(use_threads)
> {quote}
> Why not just always call self.set_use_threads(use_threads)?
> The ParquetReader.set_use_threads simply calls self.reader.get().set_use_threads(use_threads).  This self.reader is assigned as unique_ptr[FileReader].  I think this points to cpp/src/parquet/arrow/reader.cc, but I'm not sure about that.  The FileReader::Impl::ReadRowGroup logic looks ok, as a call to ::arrow::internal::GetCpuThreadPool() is only called if use_threads is True.  The same is true for ReadTable.
> So when is the ThreadPool getting created?
> Example code:
> --------------------------------------------------
> {quote}import pandas as pd
> import psutil
> import pyarrow as pa
> import pyarrow.parquet as pq
> use_threads=False
> p=psutil.Process()
> print('Starting with {} threads'.format(p.num_threads()))
> df = pd.DataFrame(\{'x':[0]})
> table = pa.Table.from_pandas(df)
> print('After table creation, {} threads'.format(p.num_threads()))
> df = table.to_pandas(use_threads=use_threads)
> print('table.to_pandas(use_threads={}), {} threads'.format(use_threads, p.num_threads()))
> writer = pq.ParquetWriter('tmp.parquet', table.schema)
> writer.write_table(table)
> writer.close()
> print('After writing parquet file, {} threads'.format(p.num_threads()))
> pf = pq.ParquetFile('tmp.parquet')
> print('After ParquetFile, {} threads'.format(p.num_threads()))
> df = pf.read(use_threads=use_threads).to_pandas()
> print('After pf.read(use_threads={}), {} threads'.format(use_threads, p.num_threads()))
> {quote}
> -----------------------------------------------------------------------
> $ python pyarrow_test.py
> Starting with 1 threads
> After table creation, 1 threads
> table.to_pandas(use_threads=False), 1 threads
> After writing parquet file, 1 threads
> After ParquetFile, 1 threads
> After pf.read(use_threads=False), 5 threads



--
This message was sent by Atlassian Jira
(v8.3.4#803005)