You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by "westonpace (via GitHub)" <gi...@apache.org> on 2023/06/09 15:54:23 UTC

[GitHub] [arrow] westonpace opened a new pull request, #36012: GH-35859: [Python] Actually change the default row group size to 1Mi

westonpace opened a new pull request, #36012:
URL: https://github.com/apache/arrow/pull/36012

   ### Rationale for this change
   
   In #34280 the default row group size was changed to 1Mi.  However, this was accidentally reverted (for python, but not C++) in #34435 
   
   The problem is that there is both an "absolute max row group size for the writer" and a "row group size to use for this table"  The pyarrow user is unable to set the former property.
   
   The behavior in pyarrow was previously "If no value is given in the call to write_table then don't specify anything and let the absolute max apply"
   
   The first fix changed the absolute max to 1Mi.  However, this made it impossible for the user to use a larger row group size.  The second fix changed the absolute max back to 64Mi.  However, this meant the default didn't change.
   
   ### What changes are included in this PR?
   
   This change leaves the absolute max at 64Mi.  However, if the user does not specify a row group size, we no longer "just use the table size" and instead use 1Mi.
   
   ### Are these changes tested?
   
   Yes, a unit test was added.
   
   ### Are there any user-facing changes?
   
   Yes, the default row group size now truly changes to 1Mi.  This change was already announced as part of #34280


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] westonpace merged pull request #36012: GH-35859: [Python] Actually change the default row group size to 1Mi

Posted by "westonpace (via GitHub)" <gi...@apache.org>.
westonpace merged PR #36012:
URL: https://github.com/apache/arrow/pull/36012


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] mapleFU commented on a diff in pull request #36012: GH-35859: [Python] Actually change the default row group size to 1Mi

Posted by "mapleFU (via GitHub)" <gi...@apache.org>.
mapleFU commented on code in PR #36012:
URL: https://github.com/apache/arrow/pull/36012#discussion_r1224493168


##########
python/pyarrow/_parquet.pyx:
##########
@@ -1767,7 +1767,7 @@ cdef class ParquetWriter(_Weakrefable):
             int64_t c_row_group_size
 
         if row_group_size is None or row_group_size == -1:
-            c_row_group_size = ctable.num_rows()
+            c_row_group_size = min(ctable.num_rows(), 1024*1024)

Review Comment:
   should we declare a constant rather than 1024*1024 directly?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] jorisvandenbossche commented on a diff in pull request #36012: GH-35859: [Python] Actually change the default row group size to 1Mi

Posted by "jorisvandenbossche (via GitHub)" <gi...@apache.org>.
jorisvandenbossche commented on code in PR #36012:
URL: https://github.com/apache/arrow/pull/36012#discussion_r1238952785


##########
python/pyarrow/_parquet.pyx:
##########
@@ -1767,7 +1767,7 @@ cdef class ParquetWriter(_Weakrefable):
             int64_t c_row_group_size
 
         if row_group_size is None or row_group_size == -1:
-            c_row_group_size = ctable.num_rows()
+            c_row_group_size = min(ctable.num_rows(), 1024*1024)

Review Comment:
   Either way is fine for me, but so it would like the following (defined at the top of the file, after the imports):
   
   https://github.com/apache/arrow/blob/2ce4a38fcf1fbd0d759b6c2cdf657f27ee6f02a9/python/pyarrow/_dataset.pyx#L42-L46
   
   (but for something that is not reused multiple times, it is less worth it, I think)
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] conbench-apache-arrow[bot] commented on pull request #36012: GH-35859: [Python] Actually change the default row group size to 1Mi

Posted by "conbench-apache-arrow[bot] (via GitHub)" <gi...@apache.org>.
conbench-apache-arrow[bot] commented on PR #36012:
URL: https://github.com/apache/arrow/pull/36012#issuecomment-1606546327

   Conbench analyzed the 6 benchmark runs on commit `0d1f7234`.
   
   There were no benchmark performance regressions. 🎉
   
   The [full Conbench report](https://github.com/apache/arrow/runs/14542113866) has more details.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] westonpace commented on a diff in pull request #36012: GH-35859: [Python] Actually change the default row group size to 1Mi

Posted by "westonpace (via GitHub)" <gi...@apache.org>.
westonpace commented on code in PR #36012:
URL: https://github.com/apache/arrow/pull/36012#discussion_r1229994073


##########
python/pyarrow/_parquet.pyx:
##########
@@ -1767,7 +1767,7 @@ cdef class ParquetWriter(_Weakrefable):
             int64_t c_row_group_size
 
         if row_group_size is None or row_group_size == -1:
-            c_row_group_size = ctable.num_rows()
+            c_row_group_size = min(ctable.num_rows(), 1024*1024)

Review Comment:
   I'll defer to @jorisvandenbossche only because I don't actually know what a constant should like like in this file (style wise).  I can't find any good examples.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] westonpace commented on a diff in pull request #36012: GH-35859: [Python] Actually change the default row group size to 1Mi

Posted by "westonpace (via GitHub)" <gi...@apache.org>.
westonpace commented on code in PR #36012:
URL: https://github.com/apache/arrow/pull/36012#discussion_r1238953989


##########
python/pyarrow/_parquet.pyx:
##########
@@ -1767,7 +1767,7 @@ cdef class ParquetWriter(_Weakrefable):
             int64_t c_row_group_size
 
         if row_group_size is None or row_group_size == -1:
-            c_row_group_size = ctable.num_rows()
+            c_row_group_size = min(ctable.num_rows(), 1024*1024)

Review Comment:
   Ok,I have to fix a lint issue anyways so I'll add a constant real quick for readability.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org