You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@arrow.apache.org by "jllipatz (via GitHub)" <gi...@apache.org> on 2023/04/06 10:07:31 UTC

[GitHub] [arrow] jllipatz opened a new issue, #34923: write_parquet crashes R when source comes from read_parquet!

jllipatz opened a new issue, #34923:
URL: https://github.com/apache/arrow/issues/34923

   ### Describe the usage question you have. Please include as many useful details as  possible.
   
   
   Hello
   This is somehow related to issue 34820.
   I am using the arrow package for R in its version 11.0.0.3.
   
   I have a large file (130 Millions of lies, 60 columns, about 3.7 GO of disk storage with parquet format). My problem is to find the most efficient method  (relative to duration and memory consumption) to add one column that is computed as recoding an existing one using a small correspondence table.
   Althouh I already have some kind of solution with duckdb and a SQL join. I wanted to expriment some rough method as the ones beginners can think about.
   
   The following code doesn't work. It correctly loads the files and creates the column; at this point the file uses 58 GO in memory. But when arriving to the writing point, the R session terminates immediatly without any message. 
   The strange thing is that if I use an RDS file with the same data the things come to a normal end, even if it uses 133 GO to process the data which uses 78 GO of memory after loading and recoding.
   
   The available memory on the machine I use is about 250 GO.
   
   My program:
   `
   dep <- rio::import('V:/PALETTES/IGoR/data/dep2014.dbf')
   
   df <- arrow::read_parquet('V:/PALETTES/parquet/rp68a19.parquet')
   
   df$REGION <- 
     factor(df$DR,levels=dep$DEP,labels=dep$REGION) |>
     as.character()
   
   arrow ::write_parquet(df,'V:/PALETTES/tmp/rp68a19a.parquet')
   `
   
   ### Component(s)
   
   Benchmarking, Parquet, R


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@arrow.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


Re: [I] [R] write_parquet crashes R when source comes from read_parquet! [arrow]

Posted by "thisisnic (via GitHub)" <gi...@apache.org>.
thisisnic closed issue #34923: [R] write_parquet crashes R when source comes from read_parquet!
URL: https://github.com/apache/arrow/issues/34923


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] thisisnic commented on issue #34923: [R] write_parquet crashes R when source comes from read_parquet!

Posted by "thisisnic (via GitHub)" <gi...@apache.org>.
thisisnic commented on issue #34923:
URL: https://github.com/apache/arrow/issues/34923#issuecomment-1503184900

   Thanks for reporting this, @jllipatz!  This is a bit of a tricky one to pin down exactly what's happening here.  One thing to note is that when you read in the file using `read_parquet()`, the resulting object (`df`) is now a data frame object, so it looks like it's the writing process which is having issues.
   
   One thing to try could be calling `arrow::as_arrow_table()` on `df` both before and after you add the `REGION` column, and letting us know if that crashes R or not - this will help narrow down what's going on here.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] jllipatz commented on issue #34923: [R] write_parquet crashes R when source comes from read_parquet!

Posted by "jllipatz (via GitHub)" <gi...@apache.org>.
jllipatz commented on issue #34923:
URL: https://github.com/apache/arrow/issues/34923#issuecomment-1512547595

   
   Now it works. 58 GO used by the whole process.
   But two questions remains :
   
   - Why the initial program work when data come (slowly) from a RDS file, without adding the suggested calls?
   - Why data coming (quckly) from a parquet file take less memory than the one coming from a RDS file? It sounds as if the thing that seems to be a data.frame was hiding some trick as in 1:1000000 and ALTREP vectors instead of standard ones. 
   
   ```
   library(tictoc)
   dep <- rio::import('V:/PALETTES/IGoR/data/dep2014.dbf')
   
   tic()
   df <- arrow::read_parquet('V:/PALETTES/parquet/rp68a19.parquet')
   toc() # 785s
   
   
   tic()
   df <- arrow::as_arrow_table(df)
   toc() # 0.17s
   tic()
   df$REGION <- 
     factor(df$DR,levels=dep$DEP,labels=dep$REGION) |> 
     as.character()
   toc() # 17s
   
   tic()
   df <- arrow::as_arrow_table(df)
   toc() #0.02s
   
   tic()
   arrow::write_parquet(df,'V:/PALETTES/tmp/rp68a19a.parquet')
   toc() #388s
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] paleolimbot commented on issue #34923: [R] write_parquet crashes R when source comes from read_parquet!

Posted by "paleolimbot (via GitHub)" <gi...@apache.org>.
paleolimbot commented on issue #34923:
URL: https://github.com/apache/arrow/issues/34923#issuecomment-1516355216

   I believe this is fixed by #34489! Does this problem reproduce with the latest nightly build? ( https://arrow.apache.org/docs/r/articles/install_nightly.html#install-nightly-builds )


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


Re: [I] [R] write_parquet crashes R when source comes from read_parquet! [arrow]

Posted by "thisisnic (via GitHub)" <gi...@apache.org>.
thisisnic commented on issue #34923:
URL: https://github.com/apache/arrow/issues/34923#issuecomment-1746566577

   Closing this as it appears fixed but feel free to reopen if not!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org