You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@arrow.apache.org by "jllipatz (via GitHub)" <gi...@apache.org> on 2023/03/31 10:21:22 UTC

[GitHub] [arrow] jllipatz opened a new issue, #34820: Abnormal memory consumption with as_record_batch_reader

jllipatz opened a new issue, #34820:
URL: https://github.com/apache/arrow/issues/34820

   ### Describe the usage question you have. Please include as many useful details as  possible.
   
   
   Hello
   
   I am using the arrow R package version 11.0.0.3
   
   I work with a large parquet file (around 55 GO when loaded in memory). I want to build another parquet file with an additionnal column computed from a join with a small table. I have several solutions with duckdb and I am trying to build one using arrow alone. The following code leads to an abnormal memory consumption (76 GO) when it starts the block after the call `as_record_batch` as if the reader needs the whole result to be in memory.
   That is not the case with the duckdb solution for which the used memory varies during the process but doesn't go above 17 GO.
   The duration of the two versions are very similar.
    
   Additionnally the chunksize used by the arrow version is very small, is there a way to improve the making of the parquet file?
   
   
   `
   library(tictoc)
   library(arrow)
   library(dplyr)
   
   dep <- rio::import('V:/PALETTES/IGoR/data/dep2014.dbf')
   
   ds <- open_dataset('V:/PALETTES/parquet/rp68a19.parquet')
   tic()
   
   reader <- ds %>%
     left_join(dep,by=c("DR"="DEP")) %>%
     as_record_batch_reader()
     
   file <- FileOutputStream$create('V:/PALETTES/tmp/rp68a19c2.parquet')
   batch <- reader$read_next_batch()
   if (!is.null(batch)) {
     s <- batch$schema
     writer <- ParquetFileWriter$create(s,file,
            properties = ParquetWriterProperties$create(names(s)))
   
     i <- 0
     while (!is.null(batch)) {
       i <- i+1
       message(sprintf("%d, %d rows",i,nrow(batch)))
       writer$WriteTable(arrow_table(batch),chunk_size=1e6)
       batch <- reader$read_next_batch()
     }
     writer$Close()
   }
   file$close()
   toc()`
   
   The code with duckdb:
   `
   library(DBI)
   library(arrow)
   library(duckdb)
   library(tictoc)
   con <- dbConnect(duckdb::duckdb())
   
   tic()
   reader <- duckdb_fetch_record_batch(
     dbSendQuery(con," 
       SELECT a.*,b.REGION
       FROM 'V:/PALETTES/parquet/rp68a19.parquet' a
       LEFT JOIN 'V:/PALETTES/SQL/data/dep2014.parquet' b
       ON a.DR=b.DEP
     ", arrow=TRUE))
   
   file <- FileOutputStream$create('V:/PALETTES/tmp/rp68a19d.parquet')
   batch <- reader$read_next_batch()
   if (!is.null(batch)) {
     s <- batch$schema
     writer <- ParquetFileWriter$create(s,file,
            properties = ParquetWriterProperties$create(names(s)))
   
     i <- 0
     while (!is.null(batch)) {
       i <- i+1
       message(sprintf("%d, %d rows",i,nrow(batch)))
       writer$WriteTable(arrow_table(batch),chunk_size=1e6)
       batch <- reader$read_next_batch()
     }
   
     writer$Close()
   }
   file$close()
   toc() `
   
   ### Component(s)
   
   Parquet, R


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@arrow.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] jllipatz commented on issue #34820: Abnormal memory consumption with as_record_batch_reader

Posted by "jllipatz (via GitHub)" <gi...@apache.org>.

jllipatz commented on issue #34820:
URL: https://github.com/apache/arrow/issues/34820#issuecomment-1494086169

   Thanks.
   `left_join(ds, dep,by=c("DR"="DEP")) %>%
     write_dataset('V:/PALETTES/tmp/rp68a19c.parquet')` 
   makes the end of my program unuseful.
   But it doesn't help very much. The writing into the file starts immediatly but memory use increases when writing other records. At the end of the process it uses about 10% memory less than the other arrow solutions with an elapsed time similar. Obviously it doesn't need to collect the result in memory first but what is the gain if it ends with almost the whole data in temporary memory? 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] paleolimbot commented on issue #34820: Abnormal memory consumption with as_record_batch_reader

Posted by "paleolimbot (via GitHub)" <gi...@apache.org>.

paleolimbot commented on issue #34820:
URL: https://github.com/apache/arrow/issues/34820#issuecomment-1491957447

   I'm glad that a DuckDB solution is promising as well!
   
   I wonder if `reader <- ds %>% left_join(dep, by=c("DR"="DEP")) %>% write_dataset()` would help at all? I am not sure why the entire left side of the table would have to be materialized.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org