You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2022/03/14 23:17:44 UTC

[GitHub] [arrow] tachyonwill opened a new pull request #12630: ARROW-15934: Expose write_batch_size in python.

tachyonwill opened a new pull request #12630:
URL: https://github.com/apache/arrow/pull/12630


   Already exposed in C++. Useful when dealing columns of large strings.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] jorisvandenbossche closed pull request #12630: ARROW-15934: [Python] Expose write_batch_size in python

Posted by GitBox <gi...@apache.org>.
jorisvandenbossche closed pull request #12630:
URL: https://github.com/apache/arrow/pull/12630


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] jorisvandenbossche commented on pull request #12630: ARROW-15934: [Python] Expose write_batch_size in python

Posted by GitBox <gi...@apache.org>.
jorisvandenbossche commented on pull request #12630:
URL: https://github.com/apache/arrow/pull/12630#issuecomment-1074262391


   I just moved the test from `test_basic.py` (`test_parquet_writer.py` currently has tests specifically for the ParquetWriter API, while this is a test for a generic keyword, and eg the `data_page_size` keyword is also tested in `test_basic.py`)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] ursabot edited a comment on pull request #12630: ARROW-15934: [Python] Expose write_batch_size in python

Posted by GitBox <gi...@apache.org>.
ursabot edited a comment on pull request #12630:
URL: https://github.com/apache/arrow/pull/12630#issuecomment-1074820888


   Benchmark runs are scheduled for baseline = 106af3c4419f317973bc41abe7e5793911cff39b and contender = 9b63506c169cf660c518bf7a30523b46fd729be0. 9b63506c169cf660c518bf7a30523b46fd729be0 is a master commit associated with this PR. Results will be available as each benchmark for each run completes.
   Conbench compare runs links:
   [Scheduled] [ec2-t3-xlarge-us-east-2](https://conbench.ursa.dev/compare/runs/71fe9676c4254e0ab1ee48d3c820dc7d...0470965b331e471c98226ed76b7c68b9/)
   [Finished :arrow_down:0.34% :arrow_up:0.0%] [test-mac-arm](https://conbench.ursa.dev/compare/runs/7ea2127d52104a4eaf45ae42cd903677...5ee931ff12ea425899edcc8ee5b4715d/)
   [Finished :arrow_down:0.36% :arrow_up:0.0%] [ursa-i9-9960x](https://conbench.ursa.dev/compare/runs/a31c576e158440688cba18ae5a2db147...acdfaf91b72b468c8ccb8ac5a18a0f88/)
   [Finished :arrow_down:0.21% :arrow_up:0.0%] [ursa-thinkcentre-m75q](https://conbench.ursa.dev/compare/runs/d0b3b2ea7d19407f989ddf5e55025f77...bf7fa330a2da42ada3ca968204d91445/)
   Supported benchmarks:
   ec2-t3-xlarge-us-east-2: Supported benchmark langs: Python, R. Runs only benchmarks with cloud = True
   test-mac-arm: Supported benchmark langs: C++, Python, R
   ursa-i9-9960x: Supported benchmark langs: Python, R, JavaScript
   ursa-thinkcentre-m75q: Supported benchmark langs: C++, Java
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] jorisvandenbossche commented on a change in pull request #12630: ARROW-15934: [Python] Expose write_batch_size in python

Posted by GitBox <gi...@apache.org>.
jorisvandenbossche commented on a change in pull request #12630:
URL: https://github.com/apache/arrow/pull/12630#discussion_r828835827



##########
File path: python/pyarrow/parquet.py
##########
@@ -605,6 +605,9 @@ def _sanitize_table(table, new_schema, flavor):
     If None, no encryption will be done.
     The encryption properties can be created using:
     ``CryptoFactory.file_encryption_properties()``.
+write_batch_size : int, default None
+    Number of values to write to a page at a time. If None, use the default of
+    1024.

Review comment:
       Ah, yes, sorry was looking at the wrong line (it seems that the other constant I pointed to is never used anywhere).
   
   Thanks for the explanation how those two arguments interact. Would it be worth adding something about that to the docstring? 




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] tachyonwill commented on a change in pull request #12630: ARROW-15934: [Python] Expose write_batch_size in python

Posted by GitBox <gi...@apache.org>.
tachyonwill commented on a change in pull request #12630:
URL: https://github.com/apache/arrow/pull/12630#discussion_r829260876



##########
File path: python/pyarrow/parquet.py
##########
@@ -605,6 +605,9 @@ def _sanitize_table(table, new_schema, flavor):
     If None, no encryption will be done.
     The encryption properties can be created using:
     ``CryptoFactory.file_encryption_properties()``.
+write_batch_size : int, default None
+    Number of values to write to a page at a time. If None, use the default of
+    1024.

Review comment:
       Added more commentary in the docstring, but I am intentionally trying to be somewhat vague here and avoid exposing too many implementation details. I think we might want to change the way we write parquet pages in the future. For example, I suspect, but haven't investigated, that the current logic does not work correctly for V2 data pages.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] jorisvandenbossche commented on a change in pull request #12630: ARROW-15934: [Python] Expose write_batch_size in python

Posted by GitBox <gi...@apache.org>.
jorisvandenbossche commented on a change in pull request #12630:
URL: https://github.com/apache/arrow/pull/12630#discussion_r827041784



##########
File path: python/pyarrow/parquet.py
##########
@@ -605,6 +605,9 @@ def _sanitize_table(table, new_schema, flavor):
     If None, no encryption will be done.
     The encryption properties can be created using:
     ``CryptoFactory.file_encryption_properties()``.
+write_batch_size : int, default None
+    Number of values to write to a page at a time. If None, use the default of
+    1024.

Review comment:
       The default might be 1000 instead of 1024? (at least from looking at https://github.com/apache/arrow/blob/fbbe5361eff3c1601b7b5f228c52a662f4c124e4/cpp/src/parquet/column_writer.h#L111)
   
   And is this about the page size? How does it then differ from (or interact with) `data_page_size`?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] github-actions[bot] commented on pull request #12630: ARROW-15934: Expose write_batch_size in python.

Posted by GitBox <gi...@apache.org>.
github-actions[bot] commented on pull request #12630:
URL: https://github.com/apache/arrow/pull/12630#issuecomment-1067394244






-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] ursabot edited a comment on pull request #12630: ARROW-15934: [Python] Expose write_batch_size in python

Posted by GitBox <gi...@apache.org>.
ursabot edited a comment on pull request #12630:
URL: https://github.com/apache/arrow/pull/12630#issuecomment-1074820888


   Benchmark runs are scheduled for baseline = 106af3c4419f317973bc41abe7e5793911cff39b and contender = 9b63506c169cf660c518bf7a30523b46fd729be0. 9b63506c169cf660c518bf7a30523b46fd729be0 is a master commit associated with this PR. Results will be available as each benchmark for each run completes.
   Conbench compare runs links:
   [Scheduled] [ec2-t3-xlarge-us-east-2](https://conbench.ursa.dev/compare/runs/71fe9676c4254e0ab1ee48d3c820dc7d...0470965b331e471c98226ed76b7c68b9/)
   [Finished :arrow_down:0.34% :arrow_up:0.0%] [test-mac-arm](https://conbench.ursa.dev/compare/runs/7ea2127d52104a4eaf45ae42cd903677...5ee931ff12ea425899edcc8ee5b4715d/)
   [Scheduled] [ursa-i9-9960x](https://conbench.ursa.dev/compare/runs/a31c576e158440688cba18ae5a2db147...acdfaf91b72b468c8ccb8ac5a18a0f88/)
   [Finished :arrow_down:0.21% :arrow_up:0.0%] [ursa-thinkcentre-m75q](https://conbench.ursa.dev/compare/runs/d0b3b2ea7d19407f989ddf5e55025f77...bf7fa330a2da42ada3ca968204d91445/)
   Supported benchmarks:
   ec2-t3-xlarge-us-east-2: Supported benchmark langs: Python, R. Runs only benchmarks with cloud = True
   test-mac-arm: Supported benchmark langs: C++, Python, R
   ursa-i9-9960x: Supported benchmark langs: Python, R, JavaScript
   ursa-thinkcentre-m75q: Supported benchmark langs: C++, Java
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] jorisvandenbossche edited a comment on pull request #12630: ARROW-15934: [Python] Expose write_batch_size in python

Posted by GitBox <gi...@apache.org>.
jorisvandenbossche edited a comment on pull request #12630:
URL: https://github.com/apache/arrow/pull/12630#issuecomment-1074237348


   > Added more commentary in the docstring, but I am intentionally trying to be somewhat vague here 
   
   Thanks for the update, sounds good!
   
   > Added a test, however I don't think there is a good way to inspect the resulting pages sizes programmatically so this is really just a test that the keywords are accepted. 
   
   
   Indeed, the python API doesn't have functionality to check the number or the size of pages, so such a smoke test is the best we can do.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] tachyonwill commented on a change in pull request #12630: ARROW-15934: [Python] Expose write_batch_size in python

Posted by GitBox <gi...@apache.org>.
tachyonwill commented on a change in pull request #12630:
URL: https://github.com/apache/arrow/pull/12630#discussion_r827248653



##########
File path: python/pyarrow/parquet.py
##########
@@ -605,6 +605,9 @@ def _sanitize_table(table, new_schema, flavor):
     If None, no encryption will be done.
     The encryption properties can be created using:
     ``CryptoFactory.file_encryption_properties()``.
+write_batch_size : int, default None
+    Number of values to write to a page at a time. If None, use the default of
+    1024.

Review comment:
       The default is 1024: https://github.com/apache/arrow/blob/3bf061783f4e1ab447d2eb0f487c0c4fce6d5b15/cpp/src/parquet/properties.h#L96
   
   The way `data_page_size` and `write_batch_size` work is `write_batch_size` values are written, then the size of the page is checked against `data_page_size` and if `data_page_size` is exceeded, we start a new page. Normally, 1024 is fine for the `write_batch_size` but if the values are really big(big strings) or  `data_page_size` really small, then a smaller `write_batch_size` is needed to keep the page sizes close to `data_page_size` .




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] tachyonwill commented on pull request #12630: ARROW-15934: [Python] Expose write_batch_size in python

Posted by GitBox <gi...@apache.org>.
tachyonwill commented on pull request #12630:
URL: https://github.com/apache/arrow/pull/12630#issuecomment-1068266636


   Added a test, however I don't think there is a good way to inspect the resulting pages sizes programmatically so this is really just a test that the keywords are accepted. I have done manual tests and inspections to confirm that write_batch_size is doing what it is supposed to.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] tachyonwill commented on a change in pull request #12630: ARROW-15934: [Python] Expose write_batch_size in python

Posted by GitBox <gi...@apache.org>.
tachyonwill commented on a change in pull request #12630:
URL: https://github.com/apache/arrow/pull/12630#discussion_r829260876



##########
File path: python/pyarrow/parquet.py
##########
@@ -605,6 +605,9 @@ def _sanitize_table(table, new_schema, flavor):
     If None, no encryption will be done.
     The encryption properties can be created using:
     ``CryptoFactory.file_encryption_properties()``.
+write_batch_size : int, default None
+    Number of values to write to a page at a time. If None, use the default of
+    1024.

Review comment:
       Added more commentary in the docstring, but I am intentionally trying to be somewhat vague here and avoid exposing too many implementation details. I think we might want to change the way we write parquet pages in the future. For example, I suspect, but haven't investigated, that the current logic does not work correctly for V2 data pages.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] ursabot commented on pull request #12630: ARROW-15934: [Python] Expose write_batch_size in python

Posted by GitBox <gi...@apache.org>.
ursabot commented on pull request #12630:
URL: https://github.com/apache/arrow/pull/12630#issuecomment-1074820888


   Benchmark runs are scheduled for baseline = 106af3c4419f317973bc41abe7e5793911cff39b and contender = 9b63506c169cf660c518bf7a30523b46fd729be0. 9b63506c169cf660c518bf7a30523b46fd729be0 is a master commit associated with this PR. Results will be available as each benchmark for each run completes.
   Conbench compare runs links:
   [Scheduled] [ec2-t3-xlarge-us-east-2](https://conbench.ursa.dev/compare/runs/71fe9676c4254e0ab1ee48d3c820dc7d...0470965b331e471c98226ed76b7c68b9/)
   [Scheduled] [test-mac-arm](https://conbench.ursa.dev/compare/runs/7ea2127d52104a4eaf45ae42cd903677...5ee931ff12ea425899edcc8ee5b4715d/)
   [Scheduled] [ursa-i9-9960x](https://conbench.ursa.dev/compare/runs/a31c576e158440688cba18ae5a2db147...acdfaf91b72b468c8ccb8ac5a18a0f88/)
   [Scheduled] [ursa-thinkcentre-m75q](https://conbench.ursa.dev/compare/runs/d0b3b2ea7d19407f989ddf5e55025f77...bf7fa330a2da42ada3ca968204d91445/)
   Supported benchmarks:
   ec2-t3-xlarge-us-east-2: Supported benchmark langs: Python, R. Runs only benchmarks with cloud = True
   test-mac-arm: Supported benchmark langs: C++, Python, R
   ursa-i9-9960x: Supported benchmark langs: Python, R, JavaScript
   ursa-thinkcentre-m75q: Supported benchmark langs: C++, Java
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] jorisvandenbossche commented on pull request #12630: ARROW-15934: [Python] Expose write_batch_size in python

Posted by GitBox <gi...@apache.org>.
jorisvandenbossche commented on pull request #12630:
URL: https://github.com/apache/arrow/pull/12630#issuecomment-1074237348


   > Added more commentary in the docstring, but I am intentionally trying to be somewhat vague here 
   
   Thanks for the update, sounds good!
   
   > Added a test, however I don't think there is a good way to inspect the resulting pages sizes programmatically so this is really just a test that the keywords are accepted. 
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] ursabot edited a comment on pull request #12630: ARROW-15934: [Python] Expose write_batch_size in python

Posted by GitBox <gi...@apache.org>.
ursabot edited a comment on pull request #12630:
URL: https://github.com/apache/arrow/pull/12630#issuecomment-1074820888


   Benchmark runs are scheduled for baseline = 106af3c4419f317973bc41abe7e5793911cff39b and contender = 9b63506c169cf660c518bf7a30523b46fd729be0. 9b63506c169cf660c518bf7a30523b46fd729be0 is a master commit associated with this PR. Results will be available as each benchmark for each run completes.
   Conbench compare runs links:
   [Failed :arrow_down:0.0% :arrow_up:0.0%] [ec2-t3-xlarge-us-east-2](https://conbench.ursa.dev/compare/runs/71fe9676c4254e0ab1ee48d3c820dc7d...0470965b331e471c98226ed76b7c68b9/)
   [Finished :arrow_down:0.34% :arrow_up:0.0%] [test-mac-arm](https://conbench.ursa.dev/compare/runs/7ea2127d52104a4eaf45ae42cd903677...5ee931ff12ea425899edcc8ee5b4715d/)
   [Finished :arrow_down:0.36% :arrow_up:0.0%] [ursa-i9-9960x](https://conbench.ursa.dev/compare/runs/a31c576e158440688cba18ae5a2db147...acdfaf91b72b468c8ccb8ac5a18a0f88/)
   [Finished :arrow_down:0.21% :arrow_up:0.0%] [ursa-thinkcentre-m75q](https://conbench.ursa.dev/compare/runs/d0b3b2ea7d19407f989ddf5e55025f77...bf7fa330a2da42ada3ca968204d91445/)
   Supported benchmarks:
   ec2-t3-xlarge-us-east-2: Supported benchmark langs: Python, R. Runs only benchmarks with cloud = True
   test-mac-arm: Supported benchmark langs: C++, Python, R
   ursa-i9-9960x: Supported benchmark langs: Python, R, JavaScript
   ursa-thinkcentre-m75q: Supported benchmark langs: C++, Java
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org