You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@arrow.apache.org by "jonashaag (via GitHub)" <gi...@apache.org> on 2023/05/31 21:17:56 UTC

[GitHub] [arrow] jonashaag opened a new issue, #35859: New row_group default of 1024 * 1024 not working

jonashaag opened a new issue, #35859:
URL: https://github.com/apache/arrow/issues/35859

   ### Describe the bug, including details regarding any error messages, version, and platform.
   
   In the release notes of Arrow 12 say that the default row group size has been lowered from 64 Mi to 1 Mi. But here's what's happening in practice (PyArrow from conda-forge):
   
   ```
   In [1]: import pyarrow.parquet as pq
   
   In [2]: import pyarrow as pa
   
   In [3]: pa.__version__
   Out[3]: '12.0.0'
   
   In [12]: pq.write_table(pa.Table.from_pydict({"x": [42] * (64 * 1024 * 1024 + 1)}), "/tmp/x")
   
   In [13]: pq.read_metadata("/tmp/x")
   Out[13]:
   <pyarrow._parquet.FileMetaData object at 0x7fa6b81eaa70>
     created_by: parquet-cpp-arrow version 12.0.0
     num_columns: 1
     num_rows: 67108865
     num_row_groups: 2  # <=============
     format_version: 2.6
     serialized_size: 485
   
   In [14]: pq.write_table(pa.Table.from_pydict({"x": [42] * (64 * 1024 * 1024 + 0)}), "/tmp/x")
   
   In [15]: pq.read_metadata("/tmp/x")
   Out[15]:
   <pyarrow._parquet.FileMetaData object at 0x7fa6b823d490>
     created_by: parquet-cpp-arrow version 12.0.0
     num_columns: 1
     num_rows: 67108864
     num_row_groups: 1  # <=============
     format_version: 2.6
     serialized_size: 376
   ```
   
   ### Component(s)
   
   Python


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@arrow.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] westonpace closed issue #35859: [Python] New row_group_size default of 1 Mi not working

Posted by "westonpace (via GitHub)" <gi...@apache.org>.
westonpace closed issue #35859: [Python] New row_group_size default of 1 Mi not working
URL: https://github.com/apache/arrow/issues/35859


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] westonpace commented on issue #35859: New row_group_size default of 1 Mi not working

Posted by "westonpace (via GitHub)" <gi...@apache.org>.
westonpace commented on issue #35859:
URL: https://github.com/apache/arrow/issues/35859#issuecomment-1583614922

   This is embarrassing, and shame on me for not writing better regression tests.
   
   https://github.com/apache/arrow/pull/34281 changed the default for C++ and python but it was too strict and it wasn't possible (via python) to go past the default.
   
   https://github.com/apache/arrow/pull/34435 restored the ability to go past the default but it looks like it changed the default for pyarrow in the process.
   
   I'll put in a fix soon.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org