You are viewing a plain text version of this content. The canonical link for it is here.

Posted to github@arrow.apache.org by "cboettig (via GitHub)" <gi...@apache.org> on 2023/05/22 19:10:47 UTC

[GitHub] [arrow] cboettig commented on issue #35715: open_dataset() on long vec of URIs uses much more RAM & is much slower than on partition root.

cboettig commented on issue #35715:
URL: https://github.com/apache/arrow/issues/35715#issuecomment-1557782223

   realized I didn't include timings above.  On my machine, I see:
   
   - `open_dataset()` at the root of the parquet partitioning above takes 1.5 minutes
   - Calling `ls(recursive=TRUE)` on this same location takes about 1.6 minutes
   
   Starting with the full vector of 12,706 URIs from the recursive list, `duckdb` opens the dataset in _2 seconds_ (with support for hive partitions)
   
   ```
   
   library(duckdb)
   library(glue)
   
   bench::bench_time({
     
     conn <- DBI::dbConnect(duckdb(), ":memory:")
     DBI::dbExecute(conn, "INSTALL 'httpfs';")
     DBI::dbExecute(conn, "LOAD 'httpfs';")
     endpoint <- "data.ecoforecast.org"
     DBI::dbExecute(conn, glue("SET s3_endpoint='{endpoint}';"))
     DBI::dbExecute(conn, glue("SET s3_url_style='path';"))
     parquet <-  paste0("[", paste0(paste0("'",  paste0("s3://neon4cast-scores/parquet/aquatics/", all_paths), "'"), collapse = ","), "]")
     tblname <- "forecast_subset"
     view_query <-glue::glue("CREATE VIEW '{tblname}' ", 
                             "AS SELECT * FROM parquet_scan({parquet}, HIVE_PARTITIONING=true);")
     DBI::dbSendQuery(conn, view_query)
     ds <- dplyr::tbl(conn, tblname)
     
   })
   
   DBI::dbSendQuery(conn, glue::glue("DROP VIEW {tblname}"))
   ```
   
   But trying to open those 12,706 URIs directly using `arrow` is _way_ slower -- still waiting for the example to finish after at least 30 minutes.
   
   
   Also note that with duckdb if we don't have the full URIs, we can use wildcard notation to let it figure out the file paths.  As expected from the known cost of S3 listing, this is 10x slower than the above example, but still that means it takes only 20 sec -- which is still more than 4x the speed at which `arrow` does the equivalent `open_dataset()` call.  I'm hoping this means that fixing https://github.com/apache/arrow/issues/34213 might give us close to that 4x improvement, but would really love to see the ability to pass a vector of URLs and get something like the 2 second speed of establishing the lazy tbl we see here.  
   
   ```r
   bench::bench_time({
     
   parquet <-  paste0("'",  "s3://neon4cast-scores/parquet/aquatics/*/*/*", "'")
   tblname <- "forecast_subset"
   view_query <-glue::glue("CREATE VIEW '{tblname}' ", 
                           "AS SELECT * FROM parquet_scan({parquet}, HIVE_PARTITIONING=true);")
   DBI::dbSendQuery(conn, view_query)
   ds <- dplyr::tbl(conn, tblname)
   
   })
   ````
   
   
   
   
     


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org