You are viewing a plain text version of this content. The canonical link for it is here.

Posted to jira@arrow.apache.org by "Neal Richardson (Jira)" <ji...@apache.org> on 2022/07/02 14:10:00 UTC

[jira] [Assigned] (ARROW-16133) [R][Python] Convert python dataset to R dataset

     [ https://issues.apache.org/jira/browse/ARROW-16133?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Neal Richardson reassigned ARROW-16133:
---------------------------------------

    Assignee: Neal Richardson

> [R][Python] Convert python dataset to R dataset
> -----------------------------------------------
>
>                 Key: ARROW-16133
>                 URL: https://issues.apache.org/jira/browse/ARROW-16133
>             Project: Apache Arrow
>          Issue Type: Wish
>          Components: Python, R
>            Reporter: Martin du Toit
>            Assignee: Neal Richardson
>            Priority: Major
>
> Hi. 
> I can open an arrow dataset from R using reticulate, but I need to use that dataset further in R. How can I convert the Python arrow dataset to a R arrow dataset for further processing?
> {code:r}
> reticulate::py_discover_config()
> reticulate::py_available(initialize = TRUE)
> pd <- reticulate::import("pandas", convert = FALSE)
> adlfs <- reticulate::import("adlfs", convert = FALSE)
> pa <- reticulate::import("pyarrow", convert = FALSE)
> pyds <- reticulate::import("pyarrow.dataset", convert = FALSE)
> pafs <- reticulate::import("pyarrow.filesystem", convert = FALSE)
> dl_path = "investmentaccountingdata/rawdata/transactions/transactions-xxx/v1.1"
> format_name <- "transactions_transactions-xxx_v1.1"
> config <- get_config()
> datalake_secret <- config$get_datalake_secret()
> account_name <- datalake_secret$storname
> account_key <- datalake_secret$storkey
> dm_file_type <- dmfile_create_from_name(format_name = format_name)
> format_all <- dpl_arrow_format_get(dm_file_type)
> fs = adlfs$AzureBlobFileSystem(account_name=account_name, account_key=account_key)
> # Works as expected
> fs$ls("/")
> schema_file <- dpl_arrow_schema_get_dm(dm_file_type, all_char = TRUE, pyarrow = pa)
> ds <- pyds$dataset(source = dl_path, filesystem=fs, partitioning="hive", format="csv", schema = schema_file)
> # This works as expected
> files <- ds$files
> files <- py_to_r(files)
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)