You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@arrow.apache.org by "jonashaag (via GitHub)" <gi...@apache.org> on 2023/05/31 21:17:56 UTC
[GitHub] [arrow] jonashaag opened a new issue, #35859: New row_group default of 1024 * 1024 not working
jonashaag opened a new issue, #35859:
URL: https://github.com/apache/arrow/issues/35859
### Describe the bug, including details regarding any error messages, version, and platform.
In the release notes of Arrow 12 say that the default row group size has been lowered from 64 Mi to 1 Mi. But here's what's happening in practice (PyArrow from conda-forge):
```
In [1]: import pyarrow.parquet as pq
In [2]: import pyarrow as pa
In [3]: pa.__version__
Out[3]: '12.0.0'
In [12]: pq.write_table(pa.Table.from_pydict({"x": [42] * (64 * 1024 * 1024 + 1)}), "/tmp/x")
In [13]: pq.read_metadata("/tmp/x")
Out[13]:
<pyarrow._parquet.FileMetaData object at 0x7fa6b81eaa70>
created_by: parquet-cpp-arrow version 12.0.0
num_columns: 1
num_rows: 67108865
num_row_groups: 2 # <=============
format_version: 2.6
serialized_size: 485
In [14]: pq.write_table(pa.Table.from_pydict({"x": [42] * (64 * 1024 * 1024 + 0)}), "/tmp/x")
In [15]: pq.read_metadata("/tmp/x")
Out[15]:
<pyarrow._parquet.FileMetaData object at 0x7fa6b823d490>
created_by: parquet-cpp-arrow version 12.0.0
num_columns: 1
num_rows: 67108864
num_row_groups: 1 # <=============
format_version: 2.6
serialized_size: 376
```
### Component(s)
Python
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: issues-unsubscribe@arrow.apache.org.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [arrow] westonpace closed issue #35859: [Python] New row_group_size default of 1 Mi not working
Posted by "westonpace (via GitHub)" <gi...@apache.org>.
westonpace closed issue #35859: [Python] New row_group_size default of 1 Mi not working
URL: https://github.com/apache/arrow/issues/35859
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: issues-unsubscribe@arrow.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [arrow] westonpace commented on issue #35859: New row_group_size default of 1 Mi not working
Posted by "westonpace (via GitHub)" <gi...@apache.org>.
westonpace commented on issue #35859:
URL: https://github.com/apache/arrow/issues/35859#issuecomment-1583614922
This is embarrassing, and shame on me for not writing better regression tests.
https://github.com/apache/arrow/pull/34281 changed the default for C++ and python but it was too strict and it wasn't possible (via python) to go past the default.
https://github.com/apache/arrow/pull/34435 restored the ability to go past the default but it looks like it changed the default for pyarrow in the process.
I'll put in a fix soon.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org