You are viewing a plain text version of this content. The canonical link for it is here.
Posted to jira@arrow.apache.org by "Dewey Dunnington (Jira)" <ji...@apache.org> on 2022/04/11 15:27:00 UTC
[jira] [Commented] (ARROW-16133) [R][Python] Convert python dataset to R dataset

    [ https://issues.apache.org/jira/browse/ARROW-16133?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17520644#comment-17520644 ] 

Dewey Dunnington commented on ARROW-16133:
------------------------------------------

That is a tricky one and I don't think that there is a good way to do this in the current implementation. I'm guessing that the larger-scale problem that you are trying to solve is that Azure Blob storage isn't implemented in the R bindings yet?

Something that you might be able to do as a workaround is (1) create a {{Scanner}} that does some initial filtering and pass a {{RecordBatchReader}} from Python to R.

{code:R}
library(arrow, warn.conflicts = FALSE)
library(dplyr, warn.conflicts = FALSE)
library(reticulate)

ds_dir <- tempfile()

mtcars %>% 
  group_by(gear) %>% 
  write_dataset(ds_dir)

pa <- reticulate::import("pyarrow", convert = FALSE)
ds <- reticulate::import("pyarrow.dataset", convert = FALSE)
pc <- reticulate::import("pyarrow.compute", convert = FALSE)
fs <- pa$fs$LocalFileSystem()

py_ds <- ds$dataset(source = ds_dir, filesystem = fs, partitioning="hive")

rbr <- py_ds$scanner(filter = pc$equal(ds$field("gear"), ds$scalar(5)))$to_reader()
arrow_rbr <- py_to_r(rbr)

arrow_rbr %>% 
  filter(mpg > 30) %>% 
  dplyr::collect()
#> # A tibble: 1 × 11
#>     mpg   cyl  disp    hp  drat    wt  qsec    vs    am  carb  gear
#>   <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <int>
#> 1  30.4     4  95.1   113  3.77  1.51  16.9     1     1     2     5
{code}


> [R][Python] Convert python dataset to R dataset
> -----------------------------------------------
>
>                 Key: ARROW-16133
>                 URL: https://issues.apache.org/jira/browse/ARROW-16133
>             Project: Apache Arrow
>          Issue Type: Wish
>          Components: Python, R
>            Reporter: Martin du Toit
>            Priority: Major
>
> Hi. 
> I can open an arrow dataset from R using reticulate, but I need to use that dataset further in R. How can I convert the Python arrow dataset to a R arrow dataset for further processing?
> {code:r}
> reticulate::py_discover_config()
> reticulate::py_available(initialize = TRUE)
> pd <- reticulate::import("pandas", convert = FALSE)
> adlfs <- reticulate::import("adlfs", convert = FALSE)
> pa <- reticulate::import("pyarrow", convert = FALSE)
> pyds <- reticulate::import("pyarrow.dataset", convert = FALSE)
> pafs <- reticulate::import("pyarrow.filesystem", convert = FALSE)
> dl_path = "investmentaccountingdata/rawdata/transactions/transactions-xxx/v1.1"
> format_name <- "transactions_transactions-xxx_v1.1"
> config <- get_config()
> datalake_secret <- config$get_datalake_secret()
> account_name <- datalake_secret$storname
> account_key <- datalake_secret$storkey
> dm_file_type <- dmfile_create_from_name(format_name = format_name)
> format_all <- dpl_arrow_format_get(dm_file_type)
> fs = adlfs$AzureBlobFileSystem(account_name=account_name, account_key=account_key)
> # Works as expected
> fs$ls("/")
> schema_file <- dpl_arrow_schema_get_dm(dm_file_type, all_char = TRUE, pyarrow = pa)
> ds <- pyds$dataset(source = dl_path, filesystem=fs, partitioning="hive", format="csv", schema = schema_file)
> # This works as expected
> files <- ds$files
> files <- py_to_r(files)
> {code}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)