You are viewing a plain text version of this content. The canonical link for it is here.

Posted to jira@arrow.apache.org by "Adrien Hoarau (Jira)" <ji...@apache.org> on 2021/11/12 17:04:00 UTC

[jira] [Created] (ARROW-14701) parquet.write_table has an undocumented and silent cap on row_group_size

Adrien Hoarau created ARROW-14701:
-------------------------------------

             Summary: parquet.write_table has an undocumented and silent cap on row_group_size
                 Key: ARROW-14701
                 URL: https://issues.apache.org/jira/browse/ARROW-14701
             Project: Apache Arrow
          Issue Type: Bug
          Components: Python
    Affects Versions: 6.0.0
            Reporter: Adrien Hoarau


```

from io import BytesIO
import pandas as pd
import pyarrow
from pyarrow import parquet
from pyarrow import fs

print(pyarrow.__version__)

def check_row_groups_created(size: int):
    df = pd.DataFrame(\{"a": range(size)})
    t = pyarrow.Table.from_pandas(df)
    buffer = BytesIO()
    parquet.write_table(t, buffer, row_group_size=size)
    buffer.seek(0)
    print(parquet.read_metadata(buffer))
    
check_row_groups_created(50_000_000)
check_row_groups_created(100_000_000)
```
outputs:
6.0.0
<pyarrow._parquet.FileMetaData object at 0x7f838584ab80>
created_by: parquet-cpp-arrow version 6.0.0
num_columns: 1
num_rows: 50000000
num_row_groups: 1
format_version: 1.0
serialized_size: 1493
<pyarrow._parquet.FileMetaData object at 0x7f838584ab80>
created_by: parquet-cpp-arrow version 6.0.0
num_columns: 1
num_rows: 100000000
num_row_groups: 2
format_version: 1.0 serialized_size: 1640
```



--
This message was sent by Atlassian Jira
(v8.20.1#820001)