You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@arrow.apache.org by "sometimesabird (via GitHub)" <gi...@apache.org> on 2023/04/15 18:15:31 UTC

[GitHub] [arrow] sometimesabird opened a new issue, #35156: write_dataset freezes

sometimesabird opened a new issue, #35156:
URL: https://github.com/apache/arrow/issues/35156

   ### Describe the bug, including details regarding any error messages, version, and platform.
   
   An otherwise perfectly functioning arrow dataset does not finish the command `write_dataset` when passing on a hive structure, and I have to interrupt R. Looking at the folder structure, it seems to be writing files perfectly well until some point after which no new files are written -- but the job isn't finished. 
   
   The dataset also writes well into a single file (`write_dataset` without partitioning or grouping). It also writes well when I create less groups that I would like to. I haven't seen anyone complain about this, so I suspect that I am doing something so silly that no one has attempted before. Am I creating too many groups?
   
   Grouping that works: A, B, C, D, E, where all groups are binary.
   
   Grouping that doesn't work: A, B, C, D, X, where X has 90+ values (and not all values exist for each level of other variable. So, say, a combination A=1, B=1, C=1, D=1 might not have X=67.
   
   Grouping that *crashes*: X, A, B, C, D.
   
   I am on Garuda Linux (Arch-based) with R version 4.2.3.
   
   ### Component(s)
   
   R


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@arrow.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] joshbode commented on issue #35156: write_dataset freezes

Posted by "joshbode (via GitHub)" <gi...@apache.org>.
joshbode commented on issue #35156:
URL: https://github.com/apache/arrow/issues/35156#issuecomment-1626915608

   I can confirm similar behaviour with Python using `pyarrow==12.0.1` with both `write_dataset` and the older `write_to_dataset` with a large number of partitions (over 5000 in my case). I'll post more details and try to dig in a bit deeper, but for now, this is mostly just to say "you're not alone" :)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] westonpace commented on issue #35156: write_dataset freezes

Posted by "westonpace (via GitHub)" <gi...@apache.org>.
westonpace commented on issue #35156:
URL: https://github.com/apache/arrow/issues/35156#issuecomment-1632938838

   Another thing to check is to monitor memory.  `write_dataset`, if it runs long enough, will fill up the OS's disk cache.  This can often lead to swapping / etc which can cause the entire system to freeze and run slowly.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] westonpace commented on issue #35156: write_dataset freezes

Posted by "westonpace (via GitHub)" <gi...@apache.org>.
westonpace commented on issue #35156:
URL: https://github.com/apache/arrow/issues/35156#issuecomment-1512529791

   Are you able to capture a core dump or create a small script that reproduces this?  Which version of Arrow are you using?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] westonpace commented on issue #35156: [C++] write_dataset freezes

Posted by "westonpace (via GitHub)" <gi...@apache.org>.
westonpace commented on issue #35156:
URL: https://github.com/apache/arrow/issues/35156#issuecomment-1632940206

   Also, if you can create any kind of reproducible example we can take a look further.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org