You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@arrow.apache.org by "jorisvandenbossche (via GitHub)" <gi...@apache.org> on 2023/03/02 08:34:32 UTC

[GitHub] [arrow] jorisvandenbossche opened a new issue, #34410: [Python][C++] No longer possible to specify higher chunksize than the default for Parquet writing

jorisvandenbossche opened a new issue, #34410:
URL: https://github.com/apache/arrow/issues/34410

   ### Describe the bug, including details regarding any error messages, version, and platform.
   
   See https://github.com/apache/arrow/issues/34374#issuecomment-1449926603 for context
   
   https://github.com/apache/arrow/pull/34281 changed the default row group size (`chunksize` in the C++ WriteTable API). However, that PR changed `DEFAULT_MAX_ROW_GROUP_LENGTH`, which doesn't set the _default_ `chunksize`, but actually caps the max chunk (row group) size regardless of the user-specified `chunksize`.
   
   It seems this constant is used both for setting this max upper cap, and as the default values for `chunk_size` in `WriteTable`. I assume we will have to distinguish those two meanings.
   
   ### Component(s)
   
   C++, Parquet, Python


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@arrow.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] wjones127 commented on issue #34410: [Python][C++] No longer possible to specify higher chunksize than the default for Parquet writing

Posted by "wjones127 (via GitHub)" <gi...@apache.org>.
wjones127 commented on issue #34410:
URL: https://github.com/apache/arrow/issues/34410#issuecomment-1453792845

   For Python, could we just set `max_row_group_length` to a very high value (knowing that it will always be overridden with `chunk_size`)? That seems to be the status quo right now, right? (That is, you can't from Python write more than 64 million rows per group)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] westonpace commented on issue #34410: [Python][C++] No longer possible to specify higher chunksize than the default for Parquet writing

Posted by "westonpace (via GitHub)" <gi...@apache.org>.
westonpace commented on issue #34410:
URL: https://github.com/apache/arrow/issues/34410#issuecomment-1453797373

   > For Python, could we just set max_row_group_length to a very high value (knowing that it will always be overridden with chunk_size)? That seems to be the status quo right now, right? (That is, you can't from Python write more than 64 million rows per group)
   
   That seems like a reasonable compromise. I think python will always use `write_table` and so it should work.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] jorisvandenbossche commented on issue #34410: [Python][C++] No longer possible to specify higher chunksize than the default for Parquet writing

Posted by "jorisvandenbossche (via GitHub)" <gi...@apache.org>.
jorisvandenbossche commented on issue #34410:
URL: https://github.com/apache/arrow/issues/34410#issuecomment-1454643843

   @wjones127 good idea!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] westonpace commented on issue #34410: [Python][C++] No longer possible to specify higher chunksize than the default for Parquet writing

Posted by "westonpace (via GitHub)" <gi...@apache.org>.
westonpace commented on issue #34410:
URL: https://github.com/apache/arrow/issues/34410#issuecomment-1453884786

   I created #34435 with @wjones127 's suggestion


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] westonpace commented on issue #34410: [Python][C++] No longer possible to specify higher chunksize than the default for Parquet writing

Posted by "westonpace (via GitHub)" <gi...@apache.org>.
westonpace commented on issue #34410:
URL: https://github.com/apache/arrow/issues/34410#issuecomment-1452823267

   Ah, I think the problem is that `parquet::arrow::FileWriter::WriteTable` has a `chunk_size` argument (which pyarrow sets) and it also has a `parquet::WriterProperties` which has `max_row_group_length` which pyarrow does not set (and this latter property overrides the former).  So either we need to change pyarrow to set `parquet::WriterProperties::max_chunk_size` or only change the default for `chunk_size` (this might be preferred).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] wjones127 closed issue #34410: [Python][C++] No longer possible to specify higher chunksize than the default for Parquet writing

Posted by "wjones127 (via GitHub)" <gi...@apache.org>.
wjones127 closed issue #34410: [Python][C++] No longer possible to specify higher chunksize than the default for Parquet writing
URL: https://github.com/apache/arrow/issues/34410


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] jorisvandenbossche commented on issue #34410: [Python][C++] No longer possible to specify higher chunksize than the default for Parquet writing

Posted by "jorisvandenbossche (via GitHub)" <gi...@apache.org>.
jorisvandenbossche commented on issue #34410:
URL: https://github.com/apache/arrow/issues/34410#issuecomment-1453501827

   > or only change the default for `chunk_size` (this might be preferred).
   
   I don't know what the typical usage is from C++? For that, it might be more useful to actually change the `max_row_group_length`? (since not everyone will write through `WriteTable`)
   
   I misinterpreted `max_row_group_length`, thinking it was meant as some global max size we still set regardless of what the user provides, but through `WriterProperties`, it's basically the chunk size argument users can set using the properties interface. It's also only WriteTable that gives the additional `chunk_size` keyword, while other write methods like `WriteRecordBatch` only use the properties `max_row_group_length`.
   
   Naively, I would expect that specifying `chunk_size` in `WriteTable` would overrule the `max_row_group_size`, instead of being overwritten by that. 
   Changing that to give priority to `chunk_size` would also solve the issue, but that's a breaking change? 
   
   > So either we need to change pyarrow to set `parquet::WriterProperties::max_row_group_length`
   
   That's also not that simple, since it has similar logic as in C++: the ParquetWriter class is created with properties, and then afterwards the `write_table` method has this chunk_size keyword, but at that point the Writer object was already created (and in theory you can also call that method multiple times with different chunk_size values for writing to the same file).
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org