You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@arrow.apache.org by "tdhock (via GitHub)" <gi...@apache.org> on 2023/06/16 04:38:27 UTC

[GitHub] [arrow] tdhock opened a new issue, #36121: R hangs when read_csv_arrow after set_io_thread_count(1)

tdhock opened a new issue, #36121:
URL: https://github.com/apache/arrow/issues/36121

   ### Describe the bug, including details regarding any error messages, version, and platform.
   
   I tried setting the number of IO threads to 1, and then I expected to be able to read a CSV file, but instead I observed that the R interpreter hangs, perhaps in an infinite loop, and can not even be interrupted with control-C. I expected that I should be able to cancel this command with control-C.
   
   If 1 IO thread is not supported, I would have at least expected an error message after running `arrow::set_io_thread_count(1)` such as "one IO thread is not allowed, please use at least two IO threads."
   
   Also I would have expected some mention of how to control number of threads used for CSV reading on the man page for read_csv_arrow, but there is no mention of threads on that man page. Something like "use arrow::set_cpu_count(N_CPUS) to tell arrow to use N_CPUS for reading the CSV file" on that man page would be useful. 
   
   Related issues
   - https://github.com/apache/arrow/issues/30205#issuecomment-1378060874 explains that CPU threads (not IO threads) can be used to increase speed of CSV reading.
   - https://github.com/apache/arrow/issues/27688 is about allowing cancelling long running commands. 
   - Documentation of threading model https://github.com/apache/arrow/issues/30242 
   
   Here is a minimal reproducible example R script:
   ```r
   write.csv(iris,"iris.csv")
   arrow::io_thread_count()
   sessionInfo()
   head(arrow::read_csv_arrow("iris.csv"))
   arrow::set_io_thread_count(2)
   head(arrow::read_csv_arrow("iris.csv"))
   arrow::set_io_thread_count(1)
   head(arrow::read_csv_arrow("iris.csv"))
   ```
   Output when running on Linux laptop:
   ```
   (base) tdhock@tdhock-MacBook:~/R$ R --vanilla < arrow-hang.R 
   
   R version 4.3.0 (2023-04-21) -- "Already Tomorrow"
   Copyright (C) 2023 The R Foundation for Statistical Computing
   Platform: x86_64-pc-linux-gnu (64-bit)
   
   R est un logiciel libre livré sans AUCUNE GARANTIE.
   Vous pouvez le redistribuer sous certaines conditions.
   Tapez 'license()' ou 'licence()' pour plus de détails.
   
   R est un projet collaboratif avec de nombreux contributeurs.
   Tapez 'contributors()' pour plus d'information et
   'citation()' pour la façon de le citer dans les publications.
   
   Tapez 'demo()' pour des démonstrations, 'help()' pour l'aide
   en ligne ou 'help.start()' pour obtenir l'aide au format HTML.
   Tapez 'q()' pour quitter R.
   
   > write.csv(iris,"iris.csv")
   > arrow::io_thread_count()
   [1] 8
   > sessionInfo()
   R version 4.3.0 (2023-04-21)
   Platform: x86_64-pc-linux-gnu (64-bit)
   Running under: Ubuntu 22.04.2 LTS
   
   Matrix products: default
   BLAS:   /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.10.0 
   LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.10.0
   
   locale:
    [1] LC_CTYPE=fr_FR.UTF-8       LC_NUMERIC=C              
    [3] LC_TIME=fr_FR.UTF-8        LC_COLLATE=fr_FR.UTF-8    
    [5] LC_MONETARY=fr_FR.UTF-8    LC_MESSAGES=fr_FR.UTF-8   
    [7] LC_PAPER=fr_FR.UTF-8       LC_NAME=C                 
    [9] LC_ADDRESS=C               LC_TELEPHONE=C            
   [11] LC_MEASUREMENT=fr_FR.UTF-8 LC_IDENTIFICATION=C       
   
   time zone: America/New_York
   tzcode source: system (glibc)
   
   attached base packages:
   [1] stats     graphics  grDevices utils     datasets  methods   base     
   
   loaded via a namespace (and not attached):
    [1] tidyselect_1.2.0 bit_4.0.5        compiler_4.3.0   magrittr_2.0.3  
    [5] assertthat_0.2.1 R6_2.5.1         cli_3.6.1        glue_1.6.2      
    [9] bit64_4.0.5      vctrs_0.6.3      lifecycle_1.0.3  arrow_12.0.0    
   [13] rlang_1.1.1      purrr_1.0.1     
   > head(arrow::read_csv_arrow("iris.csv"))
       Sepal.Length Sepal.Width Petal.Length Petal.Width Species
   1 1          5.1         3.5          1.4         0.2  setosa
   2 2          4.9         3.0          1.4         0.2  setosa
   3 3          4.7         3.2          1.3         0.2  setosa
   4 4          4.6         3.1          1.5         0.2  setosa
   5 5          5.0         3.6          1.4         0.2  setosa
   6 6          5.4         3.9          1.7         0.4  setosa
   > arrow::set_io_thread_count(2)
   > head(arrow::read_csv_arrow("iris.csv"))
       Sepal.Length Sepal.Width Petal.Length Petal.Width Species
   1 1          5.1         3.5          1.4         0.2  setosa
   2 2          4.9         3.0          1.4         0.2  setosa
   3 3          4.7         3.2          1.3         0.2  setosa
   4 4          4.6         3.1          1.5         0.2  setosa
   5 5          5.0         3.6          1.4         0.2  setosa
   6 6          5.4         3.9          1.7         0.4  setosa
   > arrow::set_io_thread_count(1)
   > head(arrow::read_csv_arrow("iris.csv"))
   ```
   
   Output when running on Linux server:
   ```
   th798@cn36:~/R$ R --vanilla < arrow-hang.R
   
   R version 4.3.0 (2023-04-21) -- "Already Tomorrow"
   Copyright (C) 2023 The R Foundation for Statistical Computing
   Platform: x86_64-pc-linux-gnu (64-bit)
   
   R is free software and comes with ABSOLUTELY NO WARRANTY.
   You are welcome to redistribute it under certain conditions.
   Type 'license()' or 'licence()' for distribution details.
   
   R is a collaborative project with many contributors.
   Type 'contributors()' for more information and
   'citation()' on how to cite R or R packages in publications.
   
   Type 'demo()' for some demos, 'help()' for on-line help, or
   'help.start()' for an HTML browser interface to help.
   Type 'q()' to quit R.
   
   During startup - Warning messages:
   1: Setting LC_CTYPE failed, using "C" 
   2: Setting LC_COLLATE failed, using "C" 
   3: Setting LC_TIME failed, using "C" 
   4: Setting LC_MESSAGES failed, using "C" 
   5: Setting LC_MONETARY failed, using "C" 
   6: Setting LC_PAPER failed, using "C" 
   7: Setting LC_MEASUREMENT failed, using "C" 
   > write.csv(iris,"iris.csv")
   > arrow::io_thread_count()
   [1] 8
   > sessionInfo()
   R version 4.3.0 (2023-04-21)
   Platform: x86_64-pc-linux-gnu (64-bit)
   Running under: Red Hat Enterprise Linux 8.7 (Ootpa)
   
   Matrix products: default
   BLAS:   /projects/genomic-ml/lib64/R/lib/libRblas.so 
   LAPACK: /projects/genomic-ml/lib64/R/lib/libRlapack.so;  LAPACK version 3.11.0
   
   locale:
   [1] C
   
   time zone: America/Phoenix
   tzcode source: system (glibc)
   
   attached base packages:
   [1] stats     graphics  grDevices utils     datasets  methods   base     
   
   loaded via a namespace (and not attached):
    [1] tidyselect_1.2.0 bit_4.0.5        compiler_4.3.0   magrittr_2.0.3  
    [5] assertthat_0.2.1 R6_2.5.1         cli_3.6.1        glue_1.6.2      
    [9] bit64_4.0.5      vctrs_0.6.2      lifecycle_1.0.3  arrow_11.0.0.3  
   [13] rlang_1.1.0      purrr_1.0.1     
   > head(arrow::read_csv_arrow("iris.csv"))
       Sepal.Length Sepal.Width Petal.Length Petal.Width Species
   1 1          5.1         3.5          1.4         0.2  setosa
   2 2          4.9         3.0          1.4         0.2  setosa
   3 3          4.7         3.2          1.3         0.2  setosa
   4 4          4.6         3.1          1.5         0.2  setosa
   5 5          5.0         3.6          1.4         0.2  setosa
   6 6          5.4         3.9          1.7         0.4  setosa
   > arrow::set_io_thread_count(2)
   > head(arrow::read_csv_arrow("iris.csv"))
       Sepal.Length Sepal.Width Petal.Length Petal.Width Species
   1 1          5.1         3.5          1.4         0.2  setosa
   2 2          4.9         3.0          1.4         0.2  setosa
   3 3          4.7         3.2          1.3         0.2  setosa
   4 4          4.6         3.1          1.5         0.2  setosa
   5 5          5.0         3.6          1.4         0.2  setosa
   6 6          5.4         3.9          1.7         0.4  setosa
   > arrow::set_io_thread_count(1)
   > head(arrow::read_csv_arrow("iris.csv"))
   ```
   On both computers the last command hangs (infinite loop?) and can not be interrupted, even with Control-C.
   
   ### Component(s)
   
   R


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@arrow.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] thisisnic commented on issue #36121: R hangs when read_csv_arrow after set_io_thread_count(1)

Posted by "thisisnic (via GitHub)" <gi...@apache.org>.
thisisnic commented on issue #36121:
URL: https://github.com/apache/arrow/issues/36121#issuecomment-1594846918

   Can confirm that this can be reproduced on arrow 12.0.1 on Ubuntu 22.04, and agreed that we should warn & document better.  Thanks for reporting this!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] paleolimbot commented on issue #36121: [R] R hangs when read_csv_arrow after set_io_thread_count(1)

Posted by "paleolimbot (via GitHub)" <gi...@apache.org>.
paleolimbot commented on issue #36121:
URL: https://github.com/apache/arrow/issues/36121#issuecomment-1595559221

   This one is my fault 😬 ...we hijack the IO thread pool to make it possible to call into R (e.g., user-defined functions, R connections as input) while doing certain Arrow tasks ( https://github.com/apache/arrow/blob/main/r/src/safe-call-into-r.h#L315 ). I imagine that there is some Arrow code that makes the usually safe assumption that there is at least one available IO thread.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] westonpace commented on issue #36121: [R] R hangs when read_csv_arrow after set_io_thread_count(1)

Posted by "westonpace (via GitHub)" <gi...@apache.org>.
westonpace commented on issue #36121:
URL: https://github.com/apache/arrow/issues/36121#issuecomment-1601760022

   > I imagine that there is some Arrow code that makes the usually safe assumption that there is at least one available IO thread.
   
   Yes :laughing: 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] paleolimbot closed issue #36121: [R] R hangs when read_csv_arrow after set_io_thread_count(1)

Posted by "paleolimbot (via GitHub)" <gi...@apache.org>.
paleolimbot closed issue #36121: [R] R hangs when read_csv_arrow after set_io_thread_count(1)
URL: https://github.com/apache/arrow/issues/36121


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] tdhock commented on issue #36121: R hangs when read_csv_arrow after set_io_thread_count(1)

Posted by "tdhock (via GitHub)" <gi...@apache.org>.
tdhock commented on issue #36121:
URL: https://github.com/apache/arrow/issues/36121#issuecomment-1594096248

   The arrow doc web page about threading model does not mention anything about a min number of IO threads, https://arrow.apache.org/docs/cpp/threading.html 
   Could a link to that page be added on the R man pages for arrow::cpu_count and arrow::io_thread_count?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org