You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@arrow.apache.org by "Kyle Barron (Jira)" <ji...@apache.org> on 2022/05/19 00:06:00 UTC

[jira] [Created] (ARROW-16613) [Python][Parquet] pyarrow.parquet.write_metadata with metadata_collector appears to be O(n^2)

Kyle Barron created ARROW-16613:
-----------------------------------

             Summary: [Python][Parquet] pyarrow.parquet.write_metadata with metadata_collector appears to be O(n^2)
                 Key: ARROW-16613
                 URL: https://issues.apache.org/jira/browse/ARROW-16613
             Project: Apache Arrow
          Issue Type: Improvement
          Components: Parquet, Python
    Affects Versions: 8.0.0
            Reporter: Kyle Barron


Hello!

 

I've noticed that when writing a `_metadata` file with `pyarrow.parquet.write_metadata`, it is very slow with a large `metadata_collector`, exhibiting O(n^2) behavior. Specifically, it appears that the concatenation inside `metadata.append_row_groups` is very slow. The writer first and [iterates over every item of the list|https://github.com/apache/arrow/blob/027920be05198ee89e643b9e44e20fb477f97292/python/pyarrow/parquet/__init__.py#L3301-L3302] and then [concatenates them on each iteration|https://github.com/apache/arrow/blob/b0c75dee34de65834e5a83438e6581f90970fd3d/python/pyarrow/_parquet.pyx#L787-L799].

 

Would it be possible to make a vectorized implementation of this? Where `append_row_groups` accepts a list of `FileMetaData` objects, and where concatenation happens only once?

 

Repro (in IPython to use `%time`)

```

from io import BytesIO

import pyarrow as pa
import pyarrow.parquet as pq


def create_example_file_meta_data():
    data = {
        "str": pa.array(["a", "b", "c", "d"], type=pa.string()),
        "uint8": pa.array([1, 2, 3, 4], type=pa.uint8()),
        "int32": pa.array([0, -2147483638, 2147483637, 1], type=pa.int32()),
        "bool": pa.array([True, True, False, False], type=pa.bool_()),
    }
    table = pa.table(data)
    metadata_collector = []
    pq.write_table(table, BytesIO(), metadata_collector=metadata_collector)
    return table.schema, metadata_collector[0]

 

schema, meta = create_example_file_meta_data()

metadata_collector = [meta] * 500
%time pq.write_metadata(schema, BytesIO(), metadata_collector=metadata_collector)
# CPU times: user 230 ms, sys: 2.96 ms, total: 233 ms
# Wall time: 234 ms

metadata_collector = [meta] * 1000
%time pq.write_metadata(schema, BytesIO(), metadata_collector=metadata_collector)
# CPU times: user 960 ms, sys: 6.56 ms, total: 967 ms
# Wall time: 970 ms

metadata_collector = [meta] * 2000
%time pq.write_metadata(schema, BytesIO(), metadata_collector=metadata_collector)
# CPU times: user 4.08 s, sys: 54.3 ms, total: 4.13 s
# Wall time: 4.3 s

metadata_collector = [meta] * 4000
%time pq.write_metadata(schema, BytesIO(), metadata_collector=metadata_collector)
# CPU times: user 16.6 s, sys: 593 ms, total: 17.2 s
# Wall time: 17.3 s

```



--
This message was sent by Atlassian Jira
(v8.20.7#820007)