You are viewing a plain text version of this content. The canonical link for it is here.

Posted to github@arrow.apache.org by "westonpace (via GitHub)" <gi...@apache.org> on 2023/02/21 17:28:11 UTC

[GitHub] [arrow] westonpace opened a new pull request, #34281: GH-34280: [C++][Python] Clarify meaning of row_group_size and change default to 1Mi

westonpace opened a new pull request, #34281:
URL: https://github.com/apache/arrow/pull/34281

   BREAKING CHANGE: Changes the default row group size when writing parquet files.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] westonpace commented on pull request #34281: GH-34280: [C++][Python] Clarify meaning of row_group_size and change default to 1Mi

Posted by "westonpace (via GitHub)" <gi...@apache.org>.

westonpace commented on PR #34281:
URL: https://github.com/apache/arrow/pull/34281#issuecomment-1442290439

   > There are some old discussions before
   
   Thank you.  I thought I remembered some past discussion on this topic.  It appears the conclusion reached in the email thread was the same (the default should be 64MiB if possible, but 1Mi rows if not).
   
   There seems to be consensus for this change.  I will merge this tomorrow.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] XinyuZeng commented on pull request #34281: GH-34280: [C++][Python] Clarify meaning of row_group_size and change default to 1Mi

Posted by "XinyuZeng (via GitHub)" <gi...@apache.org>.

XinyuZeng commented on PR #34281:
URL: https://github.com/apache/arrow/pull/34281#issuecomment-1441377839

   There are some old discussions before: https://lists.apache.org/thread/g868kbv4jzszwrs65yv45ym1wq5ztyvq
   
   Also, arrow-rs set the default to 1 Mi, just for reference:https://github.com/apache/arrow-rs/blob/master/parquet/src/file/properties.rs#L83


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] westonpace commented on a diff in pull request #34281: GH-34280: [C++][Python] Clarify meaning of row_group_size and change default to 1Mi

Posted by "westonpace (via GitHub)" <gi...@apache.org>.

westonpace commented on code in PR #34281:
URL: https://github.com/apache/arrow/pull/34281#discussion_r1113650642


##########
cpp/src/parquet/properties.h:
##########
@@ -264,8 +264,8 @@ class PARQUET_EXPORT WriterProperties {
       return this;
     }
 
-    /// Specify the max row group length.
-    /// Default 64M.
+    /// Specify the max number of rows to put in a single row group.
+    /// Default 64Mi rows.

Review Comment:
   Ah, good catch.  Thanks.  I've made this change.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] wjones127 commented on pull request #34281: GH-34280: [C++][Python] Clarify meaning of row_group_size and change default to 1Mi

Posted by "wjones127 (via GitHub)" <gi...@apache.org>.

wjones127 commented on PR #34281:
URL: https://github.com/apache/arrow/pull/34281#issuecomment-1439099401

   > I personally think it would be more useful to make scanners that are more robust against large row groups (e.g. pyarrow's could be improved) by supporting reading at the page level instead of the row group level (still not 100% convinced this is possible). So at the moment I'm still leaning towards 1Mi but I could be convinced otherwise.
   
   I agree with this assessment. For highly selective / single row reads, we should be optimizing the PageIndex and Page pruning paths. That work seems to be in progress:
   
   * Read: https://github.com/apache/arrow/issues/33596
   * Write: https://github.com/apache/arrow/issues/34053


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] wjones127 commented on pull request #34281: GH-34280: [C++][Python] Clarify meaning of row_group_size and change default to 1Mi

Posted by "wjones127 (via GitHub)" <gi...@apache.org>.

wjones127 commented on PR #34281:
URL: https://github.com/apache/arrow/pull/34281#issuecomment-1438869939

   Also FYI I think DuckDB defaults to 100,000 (unless I read their source wrong earlier). In delta-rs we mimicked that with the PyArrow writer in https://github.com/delta-io/delta-rs/pull/818


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] westonpace commented on pull request #34281: GH-34280: [C++][Python] Clarify meaning of row_group_size and change default to 1Mi

Posted by "westonpace (via GitHub)" <gi...@apache.org>.

westonpace commented on PR #34281:
URL: https://github.com/apache/arrow/pull/34281#issuecomment-1450516224

> The challenge there is that you need to align the multiple columns somehow, and there is no guarantee that pages are aligned between columns. For example, you might have a column that has rows partitioned in 50K pages ([0..50K][50K..100K]...) and a column that has rows partitioned in 60K pages ([0..60K][60K..120K]...). Splitting pages over multiple threads doesn't work nicely for that reason.

In my head, the way this would be addressed is in stages:

* I/O Stage: Each column reader independently reads ahead 8-16MB worth of data pages. This could be configurable and tweaked and could even change dynamically based on how quickly a column is being read.
* Decoding Stage: Decoding is parallel across columns but has batch semantics. In other words, the "batch decoder" is asked to decode a batch of X rows (e.g. 1 million). Space could be allocated at this point for the output batch. This then delegates to the various column decoders so each column decoder is asked to decode X items. So maybe a column reads 5 pages to fill this batch or maybe it reads 50.

I have no idea how close Arrow's parquet-cpp reader is to this implementation or whether it is even feasible.

--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] ursabot commented on pull request #34281: GH-34280: [C++][Python] Clarify meaning of row_group_size and change default to 1Mi

Posted by "ursabot (via GitHub)" <gi...@apache.org>.

ursabot commented on PR #34281:
URL: https://github.com/apache/arrow/pull/34281#issuecomment-1446866873

   Benchmark runs are scheduled for baseline = 6a4bcb36c091fea07c03c57b2e31dd29f9846ac2 and contender = 7e248b10b428ad1275045b229323672afba6c5e4. 7e248b10b428ad1275045b229323672afba6c5e4 is a master commit associated with this PR. Results will be available as each benchmark for each run completes.
   Conbench compare runs links:
   [Finished :arrow_down:0.0% :arrow_up:0.0%] [ec2-t3-xlarge-us-east-2](https://conbench.ursa.dev/compare/runs/e96fd7be764a4fc5ac19f1de6d6d7904...24ec9053198f45e1ac4002ca02432a3f/)
   [Failed :arrow_down:1.99% :arrow_up:0.95%] [test-mac-arm](https://conbench.ursa.dev/compare/runs/12500204759f431bb2b7aa86ff92cfa2...4a1908e927fb45db9e0d3635ae6b32e7/)
   [Finished :arrow_down:1.53% :arrow_up:0.26%] [ursa-i9-9960x](https://conbench.ursa.dev/compare/runs/4f4776c497624be189af3c17a5f627e8...cc0d3e5e476148ff8ad72af2a31fe8eb/)
   [Finished :arrow_down:0.25% :arrow_up:0.03%] [ursa-thinkcentre-m75q](https://conbench.ursa.dev/compare/runs/6d896eb4064a42658e4f2eae7efb3b31...8918b2a2cb0d42b3b883c860076197b2/)
   Buildkite builds:
   [Finished] [`7e248b10` ec2-t3-xlarge-us-east-2](https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-ec2-t3-xlarge-us-east-2/builds/2439)
   [Finished] [`7e248b10` test-mac-arm](https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-test-mac-arm/builds/2469)
   [Finished] [`7e248b10` ursa-i9-9960x](https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-ursa-i9-9960x/builds/2436)
   [Finished] [`7e248b10` ursa-thinkcentre-m75q](https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-ursa-thinkcentre-m75q/builds/2460)
   [Finished] [`6a4bcb36` ec2-t3-xlarge-us-east-2](https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-ec2-t3-xlarge-us-east-2/builds/2438)
   [Failed] [`6a4bcb36` test-mac-arm](https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-test-mac-arm/builds/2468)
   [Finished] [`6a4bcb36` ursa-i9-9960x](https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-ursa-i9-9960x/builds/2435)
   [Finished] [`6a4bcb36` ursa-thinkcentre-m75q](https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-ursa-thinkcentre-m75q/builds/2459)
   Supported benchmarks:
   ec2-t3-xlarge-us-east-2: Supported benchmark langs: Python, R. Runs only benchmarks with cloud = True
   test-mac-arm: Supported benchmark langs: C++, Python, R
   ursa-i9-9960x: Supported benchmark langs: Python, R, JavaScript
   ursa-thinkcentre-m75q: Supported benchmark langs: C++, Java
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] westonpace commented on pull request #34281: GH-34280: [C++][Python] Clarify meaning of row_group_size and change default to 1Mi

Posted by "westonpace (via GitHub)" <gi...@apache.org>.

westonpace commented on PR #34281:
URL: https://github.com/apache/arrow/pull/34281#issuecomment-1452808447

   > Maybe it's better to define default RowGroup length for different FS or storage backend.
   
   Agreed.
   
   > By the way, it's funny that, in arrow's io-merging in S3, the default range limit is 32M :)
   
   AWS [suggests](https://docs.aws.amazon.com/AmazonS3/latest/userguide/optimizing-performance-guidelines.html) 8MB or 16MB.  32MB is probably fine.  It's still possible to get 32MiB reads.  For example, when reading multiple columns at a time.
   
   My own testing with HDD shows that ~4MB is good enough to get sequential reads.
   
   1Mi rows, without any encodings, will yield ~4-8MiB for a column (assuming 4 or 8 bytes per value) so in many cases I think the reads will still be large enough.
   
   > Maybe it's better to define default RowGroup length for different FS or storage backend.
   
   That is probably a good idea.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] westonpace commented on a diff in pull request #34281: GH-34280: [C++][Python] Clarify meaning of row_group_size and change default to 1Mi

Posted by "westonpace (via GitHub)" <gi...@apache.org>.

westonpace commented on code in PR #34281:
URL: https://github.com/apache/arrow/pull/34281#discussion_r1113650802


##########
python/pyarrow/parquet/core.py:
##########
@@ -1038,9 +1038,9 @@ def write(self, table_or_batch, row_group_size=None):
         ----------
         table_or_batch : {RecordBatch, Table}
         row_group_size : int, default None
-            Maximum size of each written row group. If None, the
-            row group size will be the minimum of the input
-            table or batch length and 64 * 1024 * 1024.
+            Maximum # of rows in each written row group. If None,

Review Comment:
   I agree.  I've switched to `number`.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] westonpace commented on a diff in pull request #34281: GH-34280: [C++][Python] Clarify meaning of row_group_size and change default to 1Mi

Posted by "westonpace (via GitHub)" <gi...@apache.org>.

westonpace commented on code in PR #34281:
URL: https://github.com/apache/arrow/pull/34281#discussion_r1123941531


##########
python/pyarrow/parquet/core.py:
##########
@@ -3153,9 +3153,9 @@ def write_table(table, where, row_group_size=None, version='2.4',
 table : pyarrow.Table
 where : string or pyarrow.NativeFile
 row_group_size : int
-    Maximum size of each written row group. If None, the
-    row group size will be the minimum of the Table size
-    and 64 * 1024 * 1024.
+    Maximum number of rows in each written row group. If None, the
+    row group size will be the minimum of the Table size and
+    1024 * 1024.

Review Comment:
   It is rows.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] jorisvandenbossche commented on a diff in pull request #34281: GH-34280: [C++][Python] Clarify meaning of row_group_size and change default to 1Mi

Posted by "jorisvandenbossche (via GitHub)" <gi...@apache.org>.

jorisvandenbossche commented on code in PR #34281:
URL: https://github.com/apache/arrow/pull/34281#discussion_r1113529571


##########
python/pyarrow/parquet/core.py:
##########
@@ -1038,9 +1038,9 @@ def write(self, table_or_batch, row_group_size=None):
         ----------
         table_or_batch : {RecordBatch, Table}
         row_group_size : int, default None
-            Maximum size of each written row group. If None, the
-            row group size will be the minimum of the input
-            table or batch length and 64 * 1024 * 1024.
+            Maximum # of rows in each written row group. If None,

Review Comment:
   ```suggestion
               Maximum number of rows in each written row group. If None,
   ```
   
   (I suppose "#" is quite universal, but just using the word "number" seems even more explicit and not much longer



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] github-actions[bot] commented on pull request #34281: GH-34280: [C++][Python] Clarify meaning of row_group_size and change default to 1Mi

Posted by "github-actions[bot] (via GitHub)" <gi...@apache.org>.

github-actions[bot] commented on PR #34281:
URL: https://github.com/apache/arrow/pull/34281#issuecomment-1438851226

   * Closes: #34280


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] westonpace commented on pull request #34281: GH-34280: [C++][Python] Clarify meaning of row_group_size and change default to 1Mi

Posted by "westonpace (via GitHub)" <gi...@apache.org>.

westonpace commented on PR #34281:
URL: https://github.com/apache/arrow/pull/34281#issuecomment-1439075520

> Also FYI I think [DuckDB defaults to 100,000](https://github.com/Mytherin/duckdb/commit/3b8ad037bff978b263fd06ec9d0635fcb049e92a#diff-a95d5e017c81184e18f0f04c5df3b72061fd80555d581a4ce163af5deca3dac0R394) (unless I read their source wrong earlier).

100k is probably workable but I don't think ideal. 100k in an int32 column would mean, at most (e.g. no encodings / compression), 400KiB per column. If you are reading scattershot from a file (e.g. 12 out of 100 columns) then this starts to degrade performance on HDD (and probably SSD as well) and S3 (but would be fine on NVME or hot-in-memory).

That being said, no single default is going to work for all cases (100k is better for single row reads for example). I personally think it would be more useful to make scanners that are more robust against large row groups (e.g. pyarrow's could be improved) by supporting reading at the page level instead of the row group level (still not 100% convinced this is possible). So at the moment I'm still leaning towards 1Mi but I could be convinced otherwise.

--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] westonpace commented on pull request #34281: GH-34280: [C++][Python] Clarify meaning of row_group_size and change default to 1Mi

Posted by "westonpace (via GitHub)" <gi...@apache.org>.

westonpace commented on PR #34281:
URL: https://github.com/apache/arrow/pull/34281#issuecomment-1454236034

   > 4MB per row-group or column-chunk?
   
   Column chunk.
   
   > Or we should implement it later?
   
   If I understand correctly then yes, this is something that would be nice someday.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] mapleFU commented on a diff in pull request #34281: GH-34280: [C++][Python] Clarify meaning of row_group_size and change default to 1Mi

Posted by "mapleFU (via GitHub)" <gi...@apache.org>.

mapleFU commented on code in PR #34281:
URL: https://github.com/apache/arrow/pull/34281#discussion_r1122614458


##########
python/pyarrow/parquet/core.py:
##########
@@ -3153,9 +3153,9 @@ def write_table(table, where, row_group_size=None, version='2.4',
 table : pyarrow.Table
 where : string or pyarrow.NativeFile
 row_group_size : int
-    Maximum size of each written row group. If None, the
-    row group size will be the minimum of the Table size
-    and 64 * 1024 * 1024.
+    Maximum number of rows in each written row group. If None, the
+    row group size will be the minimum of the Table size and
+    1024 * 1024.

Review Comment:
   it's "bytes" or other?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] jorisvandenbossche commented on pull request #34281: GH-34280: [C++][Python] Clarify meaning of row_group_size and change default to 1Mi

Posted by "jorisvandenbossche (via GitHub)" <gi...@apache.org>.

jorisvandenbossche commented on PR #34281:
URL: https://github.com/apache/arrow/pull/34281#issuecomment-1439050500

   cc @rjzamora do you have an idea if this would affect dask? I didn't find that dask exposes this keyword in the to_parquet function explicitly, but I suppose it can be passed through. But I can imagine that if most users rely on the default, and if you are then reading with `split_row_groups=True`, that might affect dask users?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] github-actions[bot] commented on pull request #34281: GH-34280: [C++][Python] Clarify meaning of row_group_size and change default to 1Mi

Posted by "github-actions[bot] (via GitHub)" <gi...@apache.org>.

github-actions[bot] commented on PR #34281:
URL: https://github.com/apache/arrow/pull/34281#issuecomment-1438851280

   :warning: GitHub issue #34280 **has been automatically assigned in GitHub** to PR creator.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] ursabot commented on pull request #34281: GH-34280: [C++][Python] Clarify meaning of row_group_size and change default to 1Mi

Posted by "ursabot (via GitHub)" <gi...@apache.org>.

ursabot commented on PR #34281:
URL: https://github.com/apache/arrow/pull/34281#issuecomment-1446869871

   ['Python', 'R'] benchmarks have high level of regressions.
   [ursa-i9-9960x](https://conbench.ursa.dev/compare/runs/4f4776c497624be189af3c17a5f627e8...cc0d3e5e476148ff8ad72af2a31fe8eb/)
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] mapleFU commented on pull request #34281: GH-34280: [C++][Python] Clarify meaning of row_group_size and change default to 1Mi

Posted by "mapleFU (via GitHub)" <gi...@apache.org>.

mapleFU commented on PR #34281:
URL: https://github.com/apache/arrow/pull/34281#issuecomment-1451333486

   By the way, it's funny that, in arrow's io-merging in S3, the default range limit is 32M :)
   
   ```c++
   class ARROW_EXPORT ReadRangeCache {
    public:
     static constexpr int64_t kDefaultHoleSizeLimit = 8192;
     static constexpr int64_t kDefaultRangeSizeLimit = 32 * 1024 * 1024;
   ```
   
   /cc @pitrou 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] mapleFU commented on pull request #34281: GH-34280: [C++][Python] Clarify meaning of row_group_size and change default to 1Mi

Posted by "mapleFU (via GitHub)" <gi...@apache.org>.

mapleFU commented on PR #34281:
URL: https://github.com/apache/arrow/pull/34281#issuecomment-1452873147

   @westonpace HDD 4MB Sequential read means, at least 4MB per row-group or column-chunk?
   I think currently, we don't support split row-group by rows. How can we use 1Mi Rows to split?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] wjones127 commented on a diff in pull request #34281: GH-34280: [C++][Python] Clarify meaning of row_group_size and change default to 1Mi

Posted by "wjones127 (via GitHub)" <gi...@apache.org>.

wjones127 commented on code in PR #34281:
URL: https://github.com/apache/arrow/pull/34281#discussion_r1113382826


##########
cpp/src/parquet/properties.h:
##########
@@ -264,8 +264,8 @@ class PARQUET_EXPORT WriterProperties {
       return this;
     }
 
-    /// Specify the max row group length.
-    /// Default 64M.
+    /// Specify the max number of rows to put in a single row group.
+    /// Default 64Mi rows.

Review Comment:
   ```suggestion
       /// Default 1Mi rows.
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] westonpace merged pull request #34281: GH-34280: [C++][Python] Clarify meaning of row_group_size and change default to 1Mi

Posted by "westonpace (via GitHub)" <gi...@apache.org>.

westonpace merged PR #34281:
URL: https://github.com/apache/arrow/pull/34281


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] westonpace commented on pull request #34281: GH-34280: [C++][Python] Clarify meaning of row_group_size and change default to 1Mi

Posted by "westonpace (via GitHub)" <gi...@apache.org>.

westonpace commented on PR #34281:
URL: https://github.com/apache/arrow/pull/34281#issuecomment-1447287875

   This change had some significant performance regression for writing parquet files.  This may be unavoidable but I've opened https://github.com/apache/arrow/issues/34374 to at least investigate.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] Mytherin commented on pull request #34281: GH-34280: [C++][Python] Clarify meaning of row_group_size and change default to 1Mi

Posted by "Mytherin (via GitHub)" <gi...@apache.org>.

Mytherin commented on PR #34281:
URL: https://github.com/apache/arrow/pull/34281#issuecomment-1450355278

Thanks for this change! This is great.

> 100k is probably workable but I don't think ideal. 100k in an int32 column would mean, at most (e.g. no encodings / compression), 400KiB per column. If you are reading scattershot from a file (e.g. 12 out of 100 columns) then this starts to degrade performance on HDD (and probably SSD as well) and S3 (but would be fine on NVME or hot-in-memory).

Agreed, we might change DuckDB's default to be larger (e.g. 500K or 1 million) as well. The problem with making it too large is that many data sets are inherently not very large, but still benefit immensely from the parallelism that multi-row groups offer. Parallelism starts to matter significantly already when dealing with a few million rows.

> That being said, no single default is going to work for all cases (100k is better for single row reads for example). I personally think it would be more useful to make scanners that are more robust against large row groups (e.g. pyarrow's could be improved) by supporting reading at the page level instead of the row group level (still not 100% convinced this is possible). So at the moment I'm still leaning towards 1Mi but I could be convinced otherwise.

The challenge there is that you need to align the multiple columns somehow, and there is no guarantee that pages are aligned between columns. For example, you might have a column that has rows partitioned in 50K pages (`[0..50K][50K..100K]...`) and a column that has rows partitioned in 60K pages (`[0..60K][60K..120K]...`). Splitting pages over multiple threads doesn't work nicely for that reason.

The writer could guarantee that pages are aligned so that each page has the same number of rows (e.g. `120K` rows each). But from the readers' perspective you are dependent on how the pages are laid out in the file.

Another alternative is to parallelize over the number of columns instead, as each column can be read independently. That does not fit very cleanly into most execution workflows, however, and the available parallelism is entirely query-dependent (e.g. reading 2 columns means you are limited to 2 cores).

--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] wgtmac commented on pull request #34281: GH-34280: [C++][Python] Clarify meaning of row_group_size and change default to 1Mi

Posted by "wgtmac (via GitHub)" <gi...@apache.org>.

wgtmac commented on PR #34281:
URL: https://github.com/apache/arrow/pull/34281#issuecomment-1451307061

   > The writer could guarantee that pages are aligned so that each page has the same number of rows (e.g. 120K rows each). But from the readers' perspective you are dependent on how the pages are laid out in the file.
   
   I agree with this. Aligning page boundaries benefits reading by avoiding unnecessary I/O and decoding. No special care is required on the reader side.
   
   > I have no idea how close Arrow's parquet-cpp reader is to this implementation or whether it is even feasible.
   
   Arrow's parquet-cpp reader currently reads pages of a single column chunk in a sequential fashion. The column reader does not know offset/length of all pages in advance. Without the knowledge of page index, it is difficult to schedule I/O and decoding in an efficient way. This is on my plan to contribute but I cannot promise the time frame yet.
   
   
   According to my experience, 1 Mi rows seems small to me if the schema contains only a few columns. We'd better decide the row group by `num_of_rows` and `estimated_compressed_size` together.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] rjzamora commented on pull request #34281: GH-34280: [C++][Python] Clarify meaning of row_group_size and change default to 1Mi

Posted by "rjzamora (via GitHub)" <gi...@apache.org>.

rjzamora commented on PR #34281:
URL: https://github.com/apache/arrow/pull/34281#issuecomment-1439238620

   >do you have an idea if this would affect dask? I didn't find that dask exposes this keyword in the to_parquet function explicitly, but I suppose it can be passed through. But I can imagine that if most users rely on the default, and if you are then reading with split_row_groups=True, that might affect dask users?
   
   Thanks for the ping @jorisvandenbossche !  If I understand correctly, this change is more likely to benefit Dask than it is to cause problems. When using `to_parquet(..., engine="pyarrow")`, users sometimes end up with oversized row-groups that cause memory errors during ETL. These users can already pass `row_group_size` through to mitigate this issue, but it would be nice if the default made OOM errors a bit less common.  With that said, the size of a single row completely depends on the properties of the dataset. Therefore, I don't have any strong feelings about the row-count default. If it were up to me, pyarrow would fall back to a byte-size default, rather than a row-count default (like cudf).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org