You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@arrow.apache.org by "jllipatz (via GitHub)" <gi...@apache.org> on 2023/03/31 10:21:22 UTC

[GitHub] [arrow] jllipatz opened a new issue, #34820: Abnormal memory consumption with as_record_batch_reader

jllipatz opened a new issue, #34820:
URL: https://github.com/apache/arrow/issues/34820

   ### Describe the usage question you have. Please include as many useful details as  possible.
   
   
   Hello
   
   I am using the arrow R package version 11.0.0.3
   
   I work with a large parquet file (around 55 GO when loaded in memory). I want to build another parquet file with an additionnal column computed from a join with a small table. I have several solutions with duckdb and I am trying to build one using arrow alone. The following code leads to an abnormal memory consumption (76 GO) when it starts the block after the call `as_record_batch` as if the reader needs the whole result to be in memory.
   That is not the case with the duckdb solution for which the used memory varies during the process but doesn't go above 17 GO.
   The duration of the two versions are very similar.
    
   Additionnally the chunksize used by the arrow version is very small, is there a way to improve the making of the parquet file?
   
   
   `
   library(tictoc)
   library(arrow)
   library(dplyr)
   
   dep <- rio::import('V:/PALETTES/IGoR/data/dep2014.dbf')
   
   ds <- open_dataset('V:/PALETTES/parquet/rp68a19.parquet')
   tic()
   
   reader <- ds %>%
     left_join(dep,by=c("DR"="DEP")) %>%
     as_record_batch_reader()
     
   file <- FileOutputStream$create('V:/PALETTES/tmp/rp68a19c2.parquet')
   batch <- reader$read_next_batch()
   if (!is.null(batch)) {
     s <- batch$schema
     writer <- ParquetFileWriter$create(s,file,
            properties = ParquetWriterProperties$create(names(s)))
   
     i <- 0
     while (!is.null(batch)) {
       i <- i+1
       message(sprintf("%d, %d rows",i,nrow(batch)))
       writer$WriteTable(arrow_table(batch),chunk_size=1e6)
       batch <- reader$read_next_batch()
     }
     writer$Close()
   }
   file$close()
   toc()`
   
   The code with duckdb:
   `
   library(DBI)
   library(arrow)
   library(duckdb)
   library(tictoc)
   con <- dbConnect(duckdb::duckdb())
   
   tic()
   reader <- duckdb_fetch_record_batch(
     dbSendQuery(con," 
       SELECT a.*,b.REGION
       FROM 'V:/PALETTES/parquet/rp68a19.parquet' a
       LEFT JOIN 'V:/PALETTES/SQL/data/dep2014.parquet' b
       ON a.DR=b.DEP
     ", arrow=TRUE))
   
   file <- FileOutputStream$create('V:/PALETTES/tmp/rp68a19d.parquet')
   batch <- reader$read_next_batch()
   if (!is.null(batch)) {
     s <- batch$schema
     writer <- ParquetFileWriter$create(s,file,
            properties = ParquetWriterProperties$create(names(s)))
   
     i <- 0
     while (!is.null(batch)) {
       i <- i+1
       message(sprintf("%d, %d rows",i,nrow(batch)))
       writer$WriteTable(arrow_table(batch),chunk_size=1e6)
       batch <- reader$read_next_batch()
     }
   
     writer$Close()
   }
   file$close()
   toc() `
   
   ### Component(s)
   
   Parquet, R


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@arrow.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] jllipatz commented on issue #34820: Abnormal memory consumption with as_record_batch_reader

Posted by "jllipatz (via GitHub)" <gi...@apache.org>.
jllipatz commented on issue #34820:
URL: https://github.com/apache/arrow/issues/34820#issuecomment-1494086169

   Thanks.
   `left_join(ds, dep,by=c("DR"="DEP")) %>%
     write_dataset('V:/PALETTES/tmp/rp68a19c.parquet')` 
   makes the end of my program unuseful.
   But it doesn't help very much. The writing into the file starts immediatly but memory use increases when writing other records. At the end of the process it uses about 10% memory less than the other arrow solutions with an elapsed time similar. Obviously it doesn't need to collect the result in memory first but what is the gain if it ends with almost the whole data in temporary memory? 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] paleolimbot commented on issue #34820: Abnormal memory consumption with as_record_batch_reader

Posted by "paleolimbot (via GitHub)" <gi...@apache.org>.
paleolimbot commented on issue #34820:
URL: https://github.com/apache/arrow/issues/34820#issuecomment-1491957447

   I'm glad that a DuckDB solution is promising as well!
   
   I wonder if `reader <- ds %>% left_join(dep, by=c("DR"="DEP")) %>% write_dataset()` would help at all? I am not sure why the entire left side of the table would have to be materialized.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] paleolimbot commented on issue #34820: Abnormal memory consumption with as_record_batch_reader

Posted by "paleolimbot (via GitHub)" <gi...@apache.org>.
paleolimbot commented on issue #34820:
URL: https://github.com/apache/arrow/issues/34820#issuecomment-1495865412

   That's perfect! Thanks!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] westonpace commented on issue #34820: Abnormal memory consumption with as_record_batch_reader

Posted by "westonpace (via GitHub)" <gi...@apache.org>.
westonpace commented on issue #34820:
URL: https://github.com/apache/arrow/issues/34820#issuecomment-1501932375

   Instead of `ds %>% left_join(dep,by=c("DR"="DEP"))` can you do `dep %>% right_join(ds,by=c("DEP"="DR"))`?
   
   Typically you want the small table to be the build side (in Acero the second input is the build side).  From the plan it appears that `ds` is the build side which is bad.  Switching from a left to a right join should swap sides.  If it still doesn't work can you share the resulting `explain()` with the right join?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] jllipatz commented on issue #34820: Abnormal memory consumption with as_record_batch_reader

Posted by "jllipatz (via GitHub)" <gi...@apache.org>.
jllipatz commented on issue #34820:
URL: https://github.com/apache/arrow/issues/34820#issuecomment-1495603989

   I don't understand what you mean with 'Acero plan'. I can get the plan of the query. Is there something for the plan of the writing?
   `> ds %>%
   +   left_join(dep,by=c("DR"="DEP")) %>%
   +   explain()
   ExecPlan with 5 nodes:
   4:SinkNode{}
     3:HashJoinNode{implementation=SwissJoin}
       2:TableSourceNode{}
       1:ProjectNode{projection=[RPOP, RR, DR, DCR, STABLE, ARMR, POND, AAX, BDX, CBX, HLMX, NPX, RGMX, TCHAUX, TLX, TMENX, VX, IDENTIND, IDMENX, IDFAMX, NFX, TFAMX, AEX, AEX2, ANAI, AGE, AGEMIL, ARMLTX, ARMRANX, COHAX, CSX, CPX, DCLTX, DCRANX, DLTX, DNX, DPX, DRANX, IMMIX, IRANX, IRIS, LCFX, LCMX, NAANX, NAX, NES4X, PNX, PRANX, PRX, RLTX, RNX, RRANX, S, SOX, STAT_CONJX, STX, SURFX, TAX, TRANSX, DCRX, COMP9099]}
         0:SourceNode{}
   `


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] westonpace commented on issue #34820: Abnormal memory consumption with as_record_batch_reader

Posted by "westonpace (via GitHub)" <gi...@apache.org>.
westonpace commented on issue #34820:
URL: https://github.com/apache/arrow/issues/34820#issuecomment-1494941541

   @paleolimbot 
   
   Can we print the Acero plan that actually gets executed here?  I don't remember how exactly to do that in R.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org