You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@arrow.apache.org by "tdhock (via GitHub)" <gi...@apache.org> on 2023/06/16 04:38:27 UTC
[GitHub] [arrow] tdhock opened a new issue, #36121: R hangs when read_csv_arrow after set_io_thread_count(1)
tdhock opened a new issue, #36121:
URL: https://github.com/apache/arrow/issues/36121
### Describe the bug, including details regarding any error messages, version, and platform.
I tried setting the number of IO threads to 1, and then I expected to be able to read a CSV file, but instead I observed that the R interpreter hangs, perhaps in an infinite loop, and can not even be interrupted with control-C. I expected that I should be able to cancel this command with control-C.
If 1 IO thread is not supported, I would have at least expected an error message after running `arrow::set_io_thread_count(1)` such as "one IO thread is not allowed, please use at least two IO threads."
Also I would have expected some mention of how to control number of threads used for CSV reading on the man page for read_csv_arrow, but there is no mention of threads on that man page. Something like "use arrow::set_cpu_count(N_CPUS) to tell arrow to use N_CPUS for reading the CSV file" on that man page would be useful.
Related issues
- https://github.com/apache/arrow/issues/30205#issuecomment-1378060874 explains that CPU threads (not IO threads) can be used to increase speed of CSV reading.
- https://github.com/apache/arrow/issues/27688 is about allowing cancelling long running commands.
- Documentation of threading model https://github.com/apache/arrow/issues/30242
Here is a minimal reproducible example R script:
```r
write.csv(iris,"iris.csv")
arrow::io_thread_count()
sessionInfo()
head(arrow::read_csv_arrow("iris.csv"))
arrow::set_io_thread_count(2)
head(arrow::read_csv_arrow("iris.csv"))
arrow::set_io_thread_count(1)
head(arrow::read_csv_arrow("iris.csv"))
```
Output when running on Linux laptop:
```
(base) tdhock@tdhock-MacBook:~/R$ R --vanilla < arrow-hang.R
R version 4.3.0 (2023-04-21) -- "Already Tomorrow"
Copyright (C) 2023 The R Foundation for Statistical Computing
Platform: x86_64-pc-linux-gnu (64-bit)
R est un logiciel libre livré sans AUCUNE GARANTIE.
Vous pouvez le redistribuer sous certaines conditions.
Tapez 'license()' ou 'licence()' pour plus de détails.
R est un projet collaboratif avec de nombreux contributeurs.
Tapez 'contributors()' pour plus d'information et
'citation()' pour la façon de le citer dans les publications.
Tapez 'demo()' pour des démonstrations, 'help()' pour l'aide
en ligne ou 'help.start()' pour obtenir l'aide au format HTML.
Tapez 'q()' pour quitter R.
> write.csv(iris,"iris.csv")
> arrow::io_thread_count()
[1] 8
> sessionInfo()
R version 4.3.0 (2023-04-21)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 22.04.2 LTS
Matrix products: default
BLAS: /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.10.0
LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.10.0
locale:
[1] LC_CTYPE=fr_FR.UTF-8 LC_NUMERIC=C
[3] LC_TIME=fr_FR.UTF-8 LC_COLLATE=fr_FR.UTF-8
[5] LC_MONETARY=fr_FR.UTF-8 LC_MESSAGES=fr_FR.UTF-8
[7] LC_PAPER=fr_FR.UTF-8 LC_NAME=C
[9] LC_ADDRESS=C LC_TELEPHONE=C
[11] LC_MEASUREMENT=fr_FR.UTF-8 LC_IDENTIFICATION=C
time zone: America/New_York
tzcode source: system (glibc)
attached base packages:
[1] stats graphics grDevices utils datasets methods base
loaded via a namespace (and not attached):
[1] tidyselect_1.2.0 bit_4.0.5 compiler_4.3.0 magrittr_2.0.3
[5] assertthat_0.2.1 R6_2.5.1 cli_3.6.1 glue_1.6.2
[9] bit64_4.0.5 vctrs_0.6.3 lifecycle_1.0.3 arrow_12.0.0
[13] rlang_1.1.1 purrr_1.0.1
> head(arrow::read_csv_arrow("iris.csv"))
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1 1 5.1 3.5 1.4 0.2 setosa
2 2 4.9 3.0 1.4 0.2 setosa
3 3 4.7 3.2 1.3 0.2 setosa
4 4 4.6 3.1 1.5 0.2 setosa
5 5 5.0 3.6 1.4 0.2 setosa
6 6 5.4 3.9 1.7 0.4 setosa
> arrow::set_io_thread_count(2)
> head(arrow::read_csv_arrow("iris.csv"))
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1 1 5.1 3.5 1.4 0.2 setosa
2 2 4.9 3.0 1.4 0.2 setosa
3 3 4.7 3.2 1.3 0.2 setosa
4 4 4.6 3.1 1.5 0.2 setosa
5 5 5.0 3.6 1.4 0.2 setosa
6 6 5.4 3.9 1.7 0.4 setosa
> arrow::set_io_thread_count(1)
> head(arrow::read_csv_arrow("iris.csv"))
```
Output when running on Linux server:
```
th798@cn36:~/R$ R --vanilla < arrow-hang.R
R version 4.3.0 (2023-04-21) -- "Already Tomorrow"
Copyright (C) 2023 The R Foundation for Statistical Computing
Platform: x86_64-pc-linux-gnu (64-bit)
R is free software and comes with ABSOLUTELY NO WARRANTY.
You are welcome to redistribute it under certain conditions.
Type 'license()' or 'licence()' for distribution details.
R is a collaborative project with many contributors.
Type 'contributors()' for more information and
'citation()' on how to cite R or R packages in publications.
Type 'demo()' for some demos, 'help()' for on-line help, or
'help.start()' for an HTML browser interface to help.
Type 'q()' to quit R.
During startup - Warning messages:
1: Setting LC_CTYPE failed, using "C"
2: Setting LC_COLLATE failed, using "C"
3: Setting LC_TIME failed, using "C"
4: Setting LC_MESSAGES failed, using "C"
5: Setting LC_MONETARY failed, using "C"
6: Setting LC_PAPER failed, using "C"
7: Setting LC_MEASUREMENT failed, using "C"
> write.csv(iris,"iris.csv")
> arrow::io_thread_count()
[1] 8
> sessionInfo()
R version 4.3.0 (2023-04-21)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Red Hat Enterprise Linux 8.7 (Ootpa)
Matrix products: default
BLAS: /projects/genomic-ml/lib64/R/lib/libRblas.so
LAPACK: /projects/genomic-ml/lib64/R/lib/libRlapack.so; LAPACK version 3.11.0
locale:
[1] C
time zone: America/Phoenix
tzcode source: system (glibc)
attached base packages:
[1] stats graphics grDevices utils datasets methods base
loaded via a namespace (and not attached):
[1] tidyselect_1.2.0 bit_4.0.5 compiler_4.3.0 magrittr_2.0.3
[5] assertthat_0.2.1 R6_2.5.1 cli_3.6.1 glue_1.6.2
[9] bit64_4.0.5 vctrs_0.6.2 lifecycle_1.0.3 arrow_11.0.0.3
[13] rlang_1.1.0 purrr_1.0.1
> head(arrow::read_csv_arrow("iris.csv"))
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1 1 5.1 3.5 1.4 0.2 setosa
2 2 4.9 3.0 1.4 0.2 setosa
3 3 4.7 3.2 1.3 0.2 setosa
4 4 4.6 3.1 1.5 0.2 setosa
5 5 5.0 3.6 1.4 0.2 setosa
6 6 5.4 3.9 1.7 0.4 setosa
> arrow::set_io_thread_count(2)
> head(arrow::read_csv_arrow("iris.csv"))
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1 1 5.1 3.5 1.4 0.2 setosa
2 2 4.9 3.0 1.4 0.2 setosa
3 3 4.7 3.2 1.3 0.2 setosa
4 4 4.6 3.1 1.5 0.2 setosa
5 5 5.0 3.6 1.4 0.2 setosa
6 6 5.4 3.9 1.7 0.4 setosa
> arrow::set_io_thread_count(1)
> head(arrow::read_csv_arrow("iris.csv"))
```
On both computers the last command hangs (infinite loop?) and can not be interrupted, even with Control-C.
### Component(s)
R
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: issues-unsubscribe@arrow.apache.org.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [arrow] thisisnic commented on issue #36121: R hangs when read_csv_arrow after set_io_thread_count(1)
Posted by "thisisnic (via GitHub)" <gi...@apache.org>.
thisisnic commented on issue #36121:
URL: https://github.com/apache/arrow/issues/36121#issuecomment-1594846918
Can confirm that this can be reproduced on arrow 12.0.1 on Ubuntu 22.04, and agreed that we should warn & document better. Thanks for reporting this!
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [arrow] paleolimbot commented on issue #36121: [R] R hangs when read_csv_arrow after set_io_thread_count(1)
Posted by "paleolimbot (via GitHub)" <gi...@apache.org>.
paleolimbot commented on issue #36121:
URL: https://github.com/apache/arrow/issues/36121#issuecomment-1595559221
This one is my fault 😬 ...we hijack the IO thread pool to make it possible to call into R (e.g., user-defined functions, R connections as input) while doing certain Arrow tasks ( https://github.com/apache/arrow/blob/main/r/src/safe-call-into-r.h#L315 ). I imagine that there is some Arrow code that makes the usually safe assumption that there is at least one available IO thread.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [arrow] westonpace commented on issue #36121: [R] R hangs when read_csv_arrow after set_io_thread_count(1)
Posted by "westonpace (via GitHub)" <gi...@apache.org>.
westonpace commented on issue #36121:
URL: https://github.com/apache/arrow/issues/36121#issuecomment-1601760022
> I imagine that there is some Arrow code that makes the usually safe assumption that there is at least one available IO thread.
Yes :laughing:
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [arrow] paleolimbot closed issue #36121: [R] R hangs when read_csv_arrow after set_io_thread_count(1)
Posted by "paleolimbot (via GitHub)" <gi...@apache.org>.
paleolimbot closed issue #36121: [R] R hangs when read_csv_arrow after set_io_thread_count(1)
URL: https://github.com/apache/arrow/issues/36121
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: issues-unsubscribe@arrow.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [arrow] tdhock commented on issue #36121: R hangs when read_csv_arrow after set_io_thread_count(1)
Posted by "tdhock (via GitHub)" <gi...@apache.org>.
tdhock commented on issue #36121:
URL: https://github.com/apache/arrow/issues/36121#issuecomment-1594096248
The arrow doc web page about threading model does not mention anything about a min number of IO threads, https://arrow.apache.org/docs/cpp/threading.html
Could a link to that page be added on the R man pages for arrow::cpu_count and arrow::io_thread_count?
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org