You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@arrow.apache.org by "cboettig (via GitHub)" <gi...@apache.org> on 2023/04/17 12:08:58 UTC

[GitHub] [arrow] cboettig opened a new issue, #35184: [R] arrow::write_dataset segfaults on partitioning large data.frame

cboettig opened a new issue, #35184:
URL: https://github.com/apache/arrow/issues/35184

   ### Describe the bug, including details regarding any error messages, version, and platform.
   
   The following code produces a segfault in R (arrow 11.0.3, on Ubuntu Linux):
   
   ```r
   download.file("https://minio.carlboettiger.info/shared-data/tmp_df1.parquet", "tmp_df1.parquet")
   df1 <- arrow::read_parquet("tmp_df1.parquet")
   arrow::write_dataset(df1, tempfile(), partitioning="site_id") # segfault
   ```
   
   Note that if we omit the partitioning, or if we use only the first 1e6 rows (the data.frame has about 11 million rows, and is nearly a gigabyte in RAM), the code does not segfault.
   
   ```r
   arrow::write_dataset(head(df1, 1e6), tempfile(), partitioning="site_id") # works fine
   ```
   
   
   
   ### Component(s)
   
   Parquet, R


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@arrow.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] cboettig commented on issue #35184: [R] arrow::write_dataset segfaults on partitioning large data.frame

Posted by "cboettig (via GitHub)" <gi...@apache.org>.
cboettig commented on issue #35184:
URL: https://github.com/apache/arrow/issues/35184#issuecomment-1512177024

   Note that attempting to stream from the parquet to a partitioned version also leads to fatal error (segfault):
   
   ```r
   arrow::open_dataset("tmp_df1.parquet")  |> write_dataset(tempfile(), partitioning="site_id")
   ```
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] cboettig closed issue #35184: [R] arrow::write_dataset segfaults on partitioning large data.frame

Posted by "cboettig (via GitHub)" <gi...@apache.org>.
cboettig closed issue #35184: [R] arrow::write_dataset segfaults on partitioning large data.frame
URL: https://github.com/apache/arrow/issues/35184


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] cboettig commented on issue #35184: [R] arrow::write_dataset segfaults on partitioning large data.frame

Posted by "cboettig (via GitHub)" <gi...@apache.org>.
cboettig commented on issue #35184:
URL: https://github.com/apache/arrow/issues/35184#issuecomment-1515609817

   works fine in the nightlies!  :tada: :tada: thanks @PMassicotte


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] cboettig commented on issue #35184: [R] arrow::write_dataset segfaults on partitioning large data.frame

Posted by "cboettig (via GitHub)" <gi...@apache.org>.
cboettig commented on issue #35184:
URL: https://github.com/apache/arrow/issues/35184#issuecomment-1512734633

   This appears to be a regression in 11.0.3, arrow 10.0.0 does not segfault.  


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] paleolimbot commented on issue #35184: [R] arrow::write_dataset segfaults on partitioning large data.frame

Posted by "paleolimbot (via GitHub)" <gi...@apache.org>.
paleolimbot commented on issue #35184:
URL: https://github.com/apache/arrow/issues/35184#issuecomment-1516309017

   I'm so glad! FWIW we did have other reports of segfaulting write_dataset that I was able to replicate on Mac OSX. There was another report of write_dataset using a lot more memory in 11.0.0 than 10.0.0, so keep us posted if this pops up again! 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org