You are viewing a plain text version of this content. The canonical link for it is here.

Posted to github@arrow.apache.org by "Mytherin (via GitHub)" <gi...@apache.org> on 2023/03/01 15:37:40 UTC

[GitHub] [arrow] Mytherin commented on pull request #34281: GH-34280: [C++][Python] Clarify meaning of row_group_size and change default to 1Mi

Mytherin commented on PR #34281:
URL: https://github.com/apache/arrow/pull/34281#issuecomment-1450355278

Thanks for this change! This is great.

> 100k is probably workable but I don't think ideal. 100k in an int32 column would mean, at most (e.g. no encodings / compression), 400KiB per column. If you are reading scattershot from a file (e.g. 12 out of 100 columns) then this starts to degrade performance on HDD (and probably SSD as well) and S3 (but would be fine on NVME or hot-in-memory).

Agreed, we might change DuckDB's default to be larger (e.g. 500K or 1 million) as well. The problem with making it too large is that many data sets are inherently not very large, but still benefit immensely from the parallelism that multi-row groups offer. Parallelism starts to matter significantly already when dealing with a few million rows.

> That being said, no single default is going to work for all cases (100k is better for single row reads for example). I personally think it would be more useful to make scanners that are more robust against large row groups (e.g. pyarrow's could be improved) by supporting reading at the page level instead of the row group level (still not 100% convinced this is possible). So at the moment I'm still leaning towards 1Mi but I could be convinced otherwise.

The challenge there is that you need to align the multiple columns somehow, and there is no guarantee that pages are aligned between columns. For example, you might have a column that has rows partitioned in 50K pages (`[0..50K][50K..100K]...`) and a column that has rows partitioned in 60K pages (`[0..60K][60K..120K]...`). Splitting pages over multiple threads doesn't work nicely for that reason.

The writer could guarantee that pages are aligned so that each page has the same number of rows (e.g. `120K` rows each). But from the readers' perspective you are dependent on how the pages are laid out in the file.

Another alternative is to parallelize over the number of columns instead, as each column can be read independently. That does not fit very cleanly into most execution workflows, however, and the available parallelism is entirely query-dependent (e.g. reading 2 columns means you are limited to 2 cores).

--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org