You are viewing a plain text version of this content. The canonical link for it is here.
Posted to jira@arrow.apache.org by "Neal Richardson (Jira)" <ji...@apache.org> on 2022/07/13 17:14:00 UTC
[jira] [Commented] (ARROW-16863) [R] open_dataset() silently drops the missing values from a csv file
[ https://issues.apache.org/jira/browse/ARROW-16863?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17566426#comment-17566426 ]
Neal Richardson commented on ARROW-16863:
-----------------------------------------
I think this is only an issue because the "csv" just has a single column (no commas involved really). So your missing value shows up as just an extra newline character. This behavior is consistent with base::read.csv() and readr::read_csv():
{code}
> read.csv("numbers.csv")
number
1 1
2 2
3 error
4 4
5 5
6 7
7 8
> readr::read_csv("numbers.csv")
Rows: 7 Columns: 1
── Column specification ─────────────────────────────────────────────────────────────────────────────────
Delimiter: ","
chr (1): number
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
# A tibble: 7 × 1
number
<chr>
1 1
2 2
3 error
4 4
5 5
6 7
7 8
{code}
And if you have more than one column, there is no issue:
{code}
> df_numbers$num2 <- df_numbers$number
> tf <- tempfile()
> write_csv_arrow(df_numbers, tf)
> open_dataset(tf, format = "csv") %>% collect()
# A tibble: 8 × 2
number num2
<chr> <chr>
1 "1" "1"
2 "2" "2"
3 "error" "error"
4 "4" "4"
5 "5" "5"
6 "" ""
7 "7" "7"
8 "8" "8"
{code}
> [R] open_dataset() silently drops the missing values from a csv file
> --------------------------------------------------------------------
>
> Key: ARROW-16863
> URL: https://issues.apache.org/jira/browse/ARROW-16863
> Project: Apache Arrow
> Issue Type: Bug
> Components: R
> Reporter: Zsolt Kegyes-Brassai
> Priority: Major
>
> The {{open_dataset()}} +silently+ drops the empty/missing values from a csv file. This empty string was generated when writing a dataframe containing a NA value using the {{{}write_csv_arrow(){}}}.
>
> {code:java}
> df_numbers <- tibble::tibble(number = c(1, 2, "error", 4, 5, NA, 7, 8))
> arrow::write_csv_arrow(df_numbers, "numbers.csv")
> readLines("numbers.csv")
> #> [1] "\"number\"" "\"1\"" "\"2\"" "\"error\"" "\"4\""
> #> [6] "\"5\"" "" "\"7\"" "\"8\""
> arrow::open_dataset("numbers.csv", format = "csv") |> dplyr::collect()
> #> # A tibble: 7 x 1
> #> number
> #> <chr>
> #> 1 1
> #> 2 2
> #> 3 error
> #> 4 4
> #> 5 5
> #> 6 7
> #> 7 8
> {code}
>
--
This message was sent by Atlassian Jira
(v8.20.10#820010)