You are viewing a plain text version of this content. The canonical link for it is here.
Posted to jira@arrow.apache.org by "Neal Richardson (Jira)" <ji...@apache.org> on 2022/07/13 17:14:00 UTC
[jira] [Commented] (ARROW-16863) [R] open_dataset() silently drops the missing values from a csv file

    [ https://issues.apache.org/jira/browse/ARROW-16863?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17566426#comment-17566426 ] 

Neal Richardson commented on ARROW-16863:
-----------------------------------------

I think this is only an issue because the "csv" just has a single column (no commas involved really). So your missing value shows up as just an extra newline character. This behavior is consistent with base::read.csv() and readr::read_csv():

{code}
> read.csv("numbers.csv")
  number
1      1
2      2
3  error
4      4
5      5
6      7
7      8
> readr::read_csv("numbers.csv")
                                                                                                         Rows: 7 Columns: 1
── Column specification ─────────────────────────────────────────────────────────────────────────────────
Delimiter: ","
chr (1): number

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
# A tibble: 7 × 1
  number
  <chr> 
1 1     
2 2     
3 error 
4 4     
5 5     
6 7     
7 8  
{code}

And if you have more than one column, there is no issue:

{code}
> df_numbers$num2 <- df_numbers$number
> tf <- tempfile()
> write_csv_arrow(df_numbers, tf)
> open_dataset(tf, format = "csv") %>% collect()
# A tibble: 8 × 2
  number  num2   
  <chr>   <chr>  
1 "1"     "1"    
2 "2"     "2"    
3 "error" "error"
4 "4"     "4"    
5 "5"     "5"    
6 ""      ""     
7 "7"     "7"    
8 "8"     "8"    
{code}


> [R] open_dataset() silently drops the missing values from a csv file
> --------------------------------------------------------------------
>
>                 Key: ARROW-16863
>                 URL: https://issues.apache.org/jira/browse/ARROW-16863
>             Project: Apache Arrow
>          Issue Type: Bug
>          Components: R
>            Reporter: Zsolt Kegyes-Brassai
>            Priority: Major
>
> The {{open_dataset()}} +silently+ drops the empty/missing values from a csv file. This empty string was generated when writing a dataframe containing a NA value using the {{{}write_csv_arrow(){}}}.
>  
> {code:java}
> df_numbers <- tibble::tibble(number = c(1, 2, "error", 4, 5, NA, 7, 8))
> arrow::write_csv_arrow(df_numbers, "numbers.csv")
> readLines("numbers.csv")
> #> [1] "\"number\"" "\"1\""      "\"2\""      "\"error\""  "\"4\""     
> #> [6] "\"5\""      ""           "\"7\""      "\"8\""
> arrow::open_dataset("numbers.csv", format = "csv") |> dplyr::collect()
> #> # A tibble: 7 x 1
> #>   number
> #>   <chr> 
> #> 1 1     
> #> 2 2     
> #> 3 error 
> #> 4 4     
> #> 5 5     
> #> 6 7     
> #> 7 8
> {code}
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)