You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@arrow.apache.org by "Zsolt Kegyes-Brassai (Jira)" <ji...@apache.org> on 2022/06/15 06:10:00 UTC
[jira] [Created] (ARROW-16833) [R] how to enforce type conversion in open_dataset()
Zsolt Kegyes-Brassai created ARROW-16833:
--------------------------------------------
Summary: [R] how to enforce type conversion in open_dataset()
Key: ARROW-16833
URL: https://issues.apache.org/jira/browse/ARROW-16833
Project: Apache Arrow
Issue Type: Improvement
Affects Versions: 8.0.0
Reporter: Zsolt Kegyes-Brassai
Here is a small example:
{{}}
{code:java}
library(arrow)
df_numbers <- tibble::tibble(number = c(1,2,3,"error", 4, 5, NA, 6))
str(df_numbers)
#> tibble [8 x 1] (S3: tbl_df/tbl/data.frame)
#> $ number: chr [1:8] "1" "2" "3" "error" ...
write_parquet(df_numbers, "numbers.parquet")
open_dataset("numbers.parquet")
#> FileSystemDataset with 1 Parquet file
#> number: string
open_dataset("numbers.parquet", schema(number = int8())) |> dplyr::collect()
#> Error in `dplyr::collect()`:
#> ! Invalid: Failed to parse string: 'error' as a scalar of type int8
{code}
The expected result is having an input column of integers; where the non-integer values are converted to NAs.
How this type conversion can be enforced using schema definition in in the {{{}open_dataset(){}}}?
Rationale: I would like to include this in a code chunk which imports a csv dataset and saves to parquet dataset (open_dataset -> write_dataset); where the type conversion based on a preset schema would be done at the same time. And all these steps without loading all the data in memory.
--
This message was sent by Atlassian Jira
(v8.20.7#820007)