You are viewing a plain text version of this content. The canonical link for it is here.
Posted to jira@arrow.apache.org by "Nicola Crane (Jira)" <ji...@apache.org> on 2022/06/15 13:43:00 UTC
[jira] [Comment Edited] (ARROW-16833) [R] how to enforce type conversion in open_dataset()

    [ https://issues.apache.org/jira/browse/ARROW-16833?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17554541#comment-17554541 ] 

Nicola Crane edited comment on ARROW-16833 at 6/15/22 1:42 PM:
---------------------------------------------------------------

Hi [~kbzsl].  In Arrow, strings and ints are not compatible types to cast between, which is why the example you give there won't work.  However, given the aim is to read from a CSV file and output as Parquet, you can provide the schema and the "null" specifier when you read in the CSV like so:

 
{code:java}
library(arrow)

df_numbers <- tibble::tibble(number = c(1,2,3,"error", 4, 5, NA, 6))

write_csv_arrow(df_numbers, "numbers.csv")

open_dataset("numbers.csv", format = "csv") |> 
  dplyr::collect()
#> # A tibble: 7 × 1
#>   number
#>   <chr> 
#> 1 1     
#> 2 2     
#> 3 3     
#> 4 error 
#> 5 4     
#> 6 5     
#> 7 6

open_dataset(
  "numbers.csv",
  format = "csv",
  null_values = c(NA, "error"),
  schema = schema(number = int8()),
  skip = 1
  ) |> 
  dplyr::collect()
#> # A tibble: 7 × 1
#>   number
#>    <int>
#> 1      1
#> 2      2
#> 3      3
#> 4     NA
#> 5      4
#> 6      5
#> 7      6
{code}
 


was (Author: thisisnic):
Hi [~kbzsl].  In Arrow, strings and ints are not compatible types to cast between, which is why the example you give there won't work.  However, given the aim is to read from a CSV file and output as Parquet, you can provide the schema and the "null" specifier when you read in the CSV like so:

 
{code:java}
library(arrow)

df_numbers <- tibble::tibble(number = c(1,2,3,"error", 4, 5, NA, 6))

write_csv_arrow(df_numbers, "numbers.csv")

open_dataset("numbers.csv", format = "csv") |> 
  dplyr::collect()
#> # A tibble: 7 × 1
#>   number
#>   <chr> 
#> 1 1     
#> 2 2     
#> 3 3     
#> 4 error 
#> 5 4     
#> 6 5     
#> 7 6

open_dataset(
  "numbers.csv",
  format = "csv",
  convert_options = CsvConvertOptions$create(null_values = c(NA, "error")),
  schema = schema(number = int8()),
  skip = 1
  ) |> 
  dplyr::collect()
#> # A tibble: 7 × 1
#>   number
#>    <int>
#> 1      1
#> 2      2
#> 3      3
#> 4     NA
#> 5      4
#> 6      5
#> 7      6
{code}
It's not the most user-friendly interface and we hope to implement something a bit cleaner in future.

> [R] how to enforce type conversion in open_dataset()
> ----------------------------------------------------
>
>                 Key: ARROW-16833
>                 URL: https://issues.apache.org/jira/browse/ARROW-16833
>             Project: Apache Arrow
>          Issue Type: Improvement
>    Affects Versions: 8.0.0
>            Reporter: Zsolt Kegyes-Brassai
>            Priority: Minor
>
> Here is a small example:
> {{}}
> {code:java}
> library(arrow)
> df_numbers <- tibble::tibble(number = c(1,2,3,"error", 4, 5, NA, 6))
> str(df_numbers)
> #> tibble [8 x 1] (S3: tbl_df/tbl/data.frame)
> #>  $ number: chr [1:8] "1" "2" "3" "error" ...
> write_parquet(df_numbers, "numbers.parquet")
> open_dataset("numbers.parquet") 
> #> FileSystemDataset with 1 Parquet file
> #> number: string
> open_dataset("numbers.parquet", schema(number = int8())) |> dplyr::collect()
> #> Error in `dplyr::collect()`:
> #> ! Invalid: Failed to parse string: 'error' as a scalar of type int8
> {code}
> The expected result is having an input column of integers; where the non-integer values are converted to NAs.
> How this type conversion can be enforced using schema definition in in the  {{{}open_dataset(){}}}? 
> Rationale: I would like to include this in a code chunk  which imports a csv dataset and saves to parquet dataset (open_dataset -> write_dataset); where the type conversion based on a preset schema would be done at the same time.  And all these steps without loading all the data in memory.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)