You are viewing a plain text version of this content. The canonical link for it is here.
Posted to jira@arrow.apache.org by "Martin du Toit (Jira)" <ji...@apache.org> on 2022/04/06 11:23:00 UTC

[jira] [Created] (ARROW-16133) [R][Python] Convert python dataset to R dataset

Martin du Toit created ARROW-16133:
--------------------------------------

             Summary: [R][Python] Convert python dataset to R dataset
                 Key: ARROW-16133
                 URL: https://issues.apache.org/jira/browse/ARROW-16133
             Project: Apache Arrow
          Issue Type: Wish
          Components: Python, R
            Reporter: Martin du Toit


Hi. 

I can open an arrow dataset from R using reticulate, but I need to use that dataset further in R. How can I convert the Python arrow dataset to a R arrow dataset for further processing?
{code:r}
reticulate::py_discover_config()
reticulate::py_available(initialize = TRUE)

pd <- reticulate::import("pandas", convert = FALSE)
adlfs <- reticulate::import("adlfs", convert = FALSE)
pa <- reticulate::import("pyarrow", convert = FALSE)
pyds <- reticulate::import("pyarrow.dataset", convert = FALSE)
pafs <- reticulate::import("pyarrow.filesystem", convert = FALSE)

dl_path = "investmentaccountingdata/rawdata/transactions/transactions-xxx/v1.1"
format_name <- "transactions_transactions-xxx_v1.1"

config <- get_config()
datalake_secret <- config$get_datalake_secret()

account_name <- datalake_secret$storname
account_key <- datalake_secret$storkey

dm_file_type <- dmfile_create_from_name(format_name = format_name)
format_all <- dpl_arrow_format_get(dm_file_type)

fs = adlfs$AzureBlobFileSystem(account_name=account_name, account_key=account_key)

# Works as expected
fs$ls("/")

schema_file <- dpl_arrow_schema_get_dm(dm_file_type, all_char = TRUE, pyarrow = pa)

ds <- pyds$dataset(source = dl_path, filesystem=fs, partitioning="hive", format="csv", schema = schema_file)

# This works as expected
files <- ds$files
files <- py_to_r(files)
{code}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)