You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by "paleolimbot (via GitHub)" <gi...@apache.org> on 2023/03/24 14:33:00 UTC

[GitHub] [arrow] paleolimbot commented on a diff in pull request #34708: GH-33287: [R] Cannot read_parquet on http URL

paleolimbot commented on code in PR #34708:
URL: https://github.com/apache/arrow/pull/34708#discussion_r1147643721


##########
r/R/io.R:
##########
@@ -239,6 +239,14 @@ make_readable_file <- function(file, mmap = TRUE) {
     path <- sub("/$", "", file$base_path)
     file <- filesystem$OpenInputFile(path)
   } else if (is.string(file)) {
+
+    # if this is a HTTP URL, we need a local copy to pass to FileSystem$from_uri
+    if (is_http_url(file)) {
+        tf <- tempfile()
+        download.file(file, tf)

Review Comment:
   ```suggestion
           download.file(file, tf, mode = 'wb')
   ```
   
   For god knows what reason, `mode = 'wb'` is needed on Windows. Also be careful with verbosity here (this might print to the console on POSIX or pop up a window on Windows).
   
   (Perhaps better to use Arrow's infrastructure to do this? We can already create an `InputStream` so you could "just" read that into a file)



##########
r/R/io.R:
##########
@@ -239,6 +239,14 @@ make_readable_file <- function(file, mmap = TRUE) {
     path <- sub("/$", "", file$base_path)
     file <- filesystem$OpenInputFile(path)
   } else if (is.string(file)) {
+
+    # if this is a HTTP URL, we need a local copy to pass to FileSystem$from_uri

Review Comment:
   This is only true for Parquet and the band formerly known as Feather: the CSV and IPC stream readers can stream their data from a URL and this worked already (as far as I know).
   
   Below there's a note `# Try to create a RandomAccessFile first because some readers need this`, and I wonder if a better solution is to pass an argument `needs_random_access = TRUE`. In that case, the "make a temp file" workaround can be invoked where needed.



##########
r/R/io.R:
##########
@@ -239,6 +239,14 @@ make_readable_file <- function(file, mmap = TRUE) {
     path <- sub("/$", "", file$base_path)
     file <- filesystem$OpenInputFile(path)
   } else if (is.string(file)) {
+
+    # if this is a HTTP URL, we need a local copy to pass to FileSystem$from_uri
+    if (is_http_url(file)) {
+        tf <- tempfile()

Review Comment:
   This tempfile should be deleted when the read is over. You could do this at the R level using `on.exit()` in `read_parquet()` but you'd need some way to communicate what exactly needs deleting. Maybe the caller should have to provide `temp_file` as an argument (which may or may not be used by `make_readable_file()`.



##########
r/tests/testthat/test-parquet.R:
##########
@@ -472,3 +472,10 @@ test_that("Can read parquet with nested lists and maps", {
   pq <- read_parquet(paste0(parquet_test_data, "/nested_maps.snappy.parquet"), as_data_frame = FALSE)
   expect_true(pq$a$type == map_of(utf8(), map_of(int32(), field("value", boolean(), nullable = FALSE))))
 })
+
+test_that("Can read Parquet files from a URL", {

Review Comment:
   Maybe also pin the commit of this file and `skip_on_cran()`...we don't want to get a CRAN ERROR because GitHub decides to go down for a day.



##########
r/tests/testthat/test-parquet.R:
##########
@@ -472,3 +472,10 @@ test_that("Can read parquet with nested lists and maps", {
   pq <- read_parquet(paste0(parquet_test_data, "/nested_maps.snappy.parquet"), as_data_frame = FALSE)
   expect_true(pq$a$type == map_of(utf8(), map_of(int32(), field("value", boolean(), nullable = FALSE))))
 })
+
+test_that("Can read Parquet files from a URL", {

Review Comment:
   Make sure that you `skip_if_offline()` here



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org