You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2021/12/03 21:10:09 UTC

[GitHub] [arrow] jonkeane commented on a change in pull request #11783: ARROW-14894: [R] Integer overflow in write_parquet chunk size calculation

jonkeane commented on a change in pull request #11783:
URL: https://github.com/apache/arrow/pull/11783#discussion_r762070759



##########
File path: r/tests/testthat/test-parquet.R
##########
@@ -311,3 +311,7 @@ test_that("ParquetFileWrite chunk_size defaults", {
       expect_error(reader$ReadRowGroup(2), "Some index in row_group_indices")
     })
 })
+
+test_that("ParquetFileWrite chunk_size calculation doesn't have integer overflow issues (ARROW-14894)", {
+  expect_equal(calculate_chunk_size(31869547, 108, 2.5e8, 200), 2451504)

Review comment:
       ```suggestion
     expect_equal(calculate_chunk_size(31869547, 108, 2.5e8, 200), 2451504)
     
     # we can set the target cells per group, and it rounds appropriately
     expect_equal(calculate_chunk_size(100, 1, 25), 25)
     expect_equal(calculate_chunk_size(101, 1, 25), 26)
     
     # but our max_chunks is respected
     expect_equal(calculate_chunk_size(101, 1, 25, 2), 51)
   ```
   
   These are kinda-sorta the same tests as above, so we don't _need_ to include them but I find these a bit easier to reason about (and certainly easier to test than constructing a whole dataset!)

##########
File path: r/R/parquet.R
##########
@@ -608,3 +593,28 @@ ParquetArrowReaderProperties <- R6Class("ParquetArrowReaderProperties",
 ParquetArrowReaderProperties$create <- function(use_threads = option_use_threads()) {
   parquet___arrow___ArrowReaderProperties__Make(isTRUE(use_threads))
 }
+
+calculate_chunk_size <- function(rows,  columns,
+  target_cells_per_group = getOption("arrow.parquet_cells_per_group", 2.5e8),
+  max_chunks = getOption("arrow.parquet_max_chunks", 200)
+  ){

Review comment:
       ```suggestion
     ) {
   ```




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org