You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by "westonpace (via GitHub)" <gi...@apache.org> on 2023/03/03 17:48:39 UTC

[GitHub] [arrow] westonpace opened a new pull request, #34435: GH-34410: [Python] Allow chunk sizes larger than the default to be used

westonpace opened a new pull request, #34435:
URL: https://github.com/apache/arrow/pull/34435

   ### Rationale for this change
   
   We changed the default chunk size from 64Mi rows to 1Mi rows.  However, it turns out that this property was being treated not just as the default but also as the absolute max.  So it was no longer possible to specify chunk sizes larger than 1Mi rows.  This change separates those two things and restores the max to 64Mi rows.
   
   ### What changes are included in this PR?
   
   Pyarrow will now set the `ParquetWriterProperties::max_row_group_length` to 64Mi when constructing a parquet writer.
   
   ### Are these changes tested?
   
   Yes.  Unit tests are added.
   
   ### Are there any user-facing changes?
   
   No.  The previous change #34281 changed two defaults (absolute max and default).  This PR restores the absolute max back to what it was before.  So it is removing a user-facing change.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] westonpace commented on a diff in pull request #34435: GH-34410: [Python] Allow chunk sizes larger than the default to be used

Posted by "westonpace (via GitHub)" <gi...@apache.org>.
westonpace commented on code in PR #34435:
URL: https://github.com/apache/arrow/pull/34435#discussion_r1127045690


##########
python/pyarrow/tests/parquet/test_parquet_writer.py:
##########
@@ -202,6 +203,39 @@ def test_parquet_writer_write_wrappers(tempdir, filesystem):
     tm.assert_frame_equal(result, df)
 
 
+@pytest.mark.pandas

Review Comment:
   Good idea



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] wjones127 commented on a diff in pull request #34435: GH-34410: [Python] Allow chunk sizes larger than the default to be used

Posted by "wjones127 (via GitHub)" <gi...@apache.org>.
wjones127 commented on code in PR #34435:
URL: https://github.com/apache/arrow/pull/34435#discussion_r1126815311


##########
python/pyarrow/tests/parquet/test_parquet_writer.py:
##########
@@ -202,6 +203,39 @@ def test_parquet_writer_write_wrappers(tempdir, filesystem):
     tm.assert_frame_equal(result, df)
 
 
+@pytest.mark.pandas

Review Comment:
   Perhaps also:
   
   ```suggestion
   @pytest.mark.large_memory
   @pytest.mark.pandas
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] wjones127 merged pull request #34435: GH-34410: [Python] Allow chunk sizes larger than the default to be used

Posted by "wjones127 (via GitHub)" <gi...@apache.org>.
wjones127 merged PR #34435:
URL: https://github.com/apache/arrow/pull/34435


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] github-actions[bot] commented on pull request #34435: GH-34410: [Python] Allow chunk sizes larger than the default to be used

Posted by "github-actions[bot] (via GitHub)" <gi...@apache.org>.
github-actions[bot] commented on PR #34435:
URL: https://github.com/apache/arrow/pull/34435#issuecomment-1453884999

   :warning: GitHub issue #34410 **has been automatically assigned in GitHub** to PR creator.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] mapleFU commented on a diff in pull request #34435: GH-34410: [Python] Allow chunk sizes larger than the default to be used

Posted by "mapleFU (via GitHub)" <gi...@apache.org>.
mapleFU commented on code in PR #34435:
URL: https://github.com/apache/arrow/pull/34435#discussion_r1223840611


##########
python/pyarrow/_parquet.pyx:
##########
@@ -1597,6 +1597,15 @@ cdef shared_ptr[WriterProperties] _create_writer_properties(
         props.encryption(
             (<FileEncryptionProperties>encryption_properties).unwrap())
 
+    # For backwards compatibility reasons we cap the maximum row group size
+    # at 64Mi rows.  This could be changed in the future, though it would be
+    # a breaking change.
+    #
+    # The user can always specify a smaller row group size (and the default
+    # is smaller) when calling write_table.  If the call to write_table uses
+    # a size larger than this then it will be latched to this value.
+    props.max_row_group_length(64*1024*1024)

Review Comment:
   does this causing the bug in https://github.com/apache/arrow/issues/35859#issuecomment-1583614922 ?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] westonpace commented on a diff in pull request #34435: GH-34410: [Python] Allow chunk sizes larger than the default to be used

Posted by "westonpace (via GitHub)" <gi...@apache.org>.
westonpace commented on code in PR #34435:
URL: https://github.com/apache/arrow/pull/34435#discussion_r1224462453


##########
python/pyarrow/_parquet.pyx:
##########
@@ -1597,6 +1597,15 @@ cdef shared_ptr[WriterProperties] _create_writer_properties(
         props.encryption(
             (<FileEncryptionProperties>encryption_properties).unwrap())
 
+    # For backwards compatibility reasons we cap the maximum row group size
+    # at 64Mi rows.  This could be changed in the future, though it would be
+    # a breaking change.
+    #
+    # The user can always specify a smaller row group size (and the default
+    # is smaller) when calling write_table.  If the call to write_table uses
+    # a size larger than this then it will be latched to this value.
+    props.max_row_group_length(64*1024*1024)

Review Comment:
   Sort of.  There are two properties which is what I think causes the confusion:
   
   `parquet::arrow::WriteTable::chunk_size` and `parquet::WriterProperties::max_row_group_length`.
   
   I think it is ok for `max_row_group_length` to always be 64Mi.  This is what the comment is justifying.  However, we now need to change the default `chunk_size` to be 1Mi (this is what the default is in C++).  Currently pyarrow uses `table.num_rows()` as a default.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] github-actions[bot] commented on pull request #34435: GH-34410: [Python] Allow chunk sizes larger than the default to be used

Posted by "github-actions[bot] (via GitHub)" <gi...@apache.org>.
github-actions[bot] commented on PR #34435:
URL: https://github.com/apache/arrow/pull/34435#issuecomment-1453884921

   * Closes: #34410


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] ursabot commented on pull request #34435: GH-34410: [Python] Allow chunk sizes larger than the default to be used

Posted by "ursabot (via GitHub)" <gi...@apache.org>.
ursabot commented on PR #34435:
URL: https://github.com/apache/arrow/pull/34435#issuecomment-1457426071

   Benchmark runs are scheduled for baseline = 8e9866df21aeb771d2d600191db246b646835ec6 and contender = fc950192ae7b801a76adc3b3de410dc22da83fd4. fc950192ae7b801a76adc3b3de410dc22da83fd4 is a master commit associated with this PR. Results will be available as each benchmark for each run completes.
   Conbench compare runs links:
   [Finished :arrow_down:0.0% :arrow_up:0.0%] [ec2-t3-xlarge-us-east-2](https://conbench.ursa.dev/compare/runs/06dd06f9141241cb9bd5acd418e53e2d...83e6648f8bae4aed82c17bd6070463d6/)
   [Failed :arrow_down:0.31% :arrow_up:0.0%] [test-mac-arm](https://conbench.ursa.dev/compare/runs/511e0b1398cf454081ff325d3959f424...70a51126d35740449494eb37418701a9/)
   [Finished :arrow_down:0.26% :arrow_up:0.0%] [ursa-i9-9960x](https://conbench.ursa.dev/compare/runs/a23fdba687494b5e90938e9ad1f2d9eb...74c0a39e0c1b459fbc1ac2ed65f32e56/)
   [Finished :arrow_down:0.73% :arrow_up:0.0%] [ursa-thinkcentre-m75q](https://conbench.ursa.dev/compare/runs/ad710b288a12475e8729a96d65d4b976...8fa420ed32aa46ba982d44fa473dbca8/)
   Buildkite builds:
   [Finished] [`fc950192` ec2-t3-xlarge-us-east-2](https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-ec2-t3-xlarge-us-east-2/builds/2484)
   [Failed] [`fc950192` test-mac-arm](https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-test-mac-arm/builds/2514)
   [Finished] [`fc950192` ursa-i9-9960x](https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-ursa-i9-9960x/builds/2482)
   [Finished] [`fc950192` ursa-thinkcentre-m75q](https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-ursa-thinkcentre-m75q/builds/2505)
   [Finished] [`8e9866df` ec2-t3-xlarge-us-east-2](https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-ec2-t3-xlarge-us-east-2/builds/2483)
   [Finished] [`8e9866df` test-mac-arm](https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-test-mac-arm/builds/2513)
   [Finished] [`8e9866df` ursa-i9-9960x](https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-ursa-i9-9960x/builds/2481)
   [Finished] [`8e9866df` ursa-thinkcentre-m75q](https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-ursa-thinkcentre-m75q/builds/2504)
   Supported benchmarks:
   ec2-t3-xlarge-us-east-2: Supported benchmark langs: Python, R. Runs only benchmarks with cloud = True
   test-mac-arm: Supported benchmark langs: C++, Python, R
   ursa-i9-9960x: Supported benchmark langs: Python, R, JavaScript
   ursa-thinkcentre-m75q: Supported benchmark langs: C++, Java
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org