You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by "fontikar (via GitHub)" <gi...@apache.org> on 2023/03/09 01:31:34 UTC
[GitHub] [arrow] fontikar commented on issue #33432: [R] str_replace with NA does not match stringr behavior
fontikar commented on issue #33432:
URL: https://github.com/apache/arrow/issues/33432#issuecomment-1461130698
Hello maintainers of {arrow}! đź‘‹
Big fan of package, I use it regularly for some of my larger datasets.
**tl;dr** I noticed that, NAs are treated as character strings and not as true missing values in `parquets` that were created from `open_dataset`. This issue does not arise when I `read_csv` then `write_parquet`.
My reprex below:
``` r
library(arrow)
#>
#> Attaching package: 'arrow'
#> The following object is masked from 'package:utils':
#>
#> timestamp
library(here)
#> here() starts at /private/var/folders/fk/9s3srn850qj90zp4t67zc0fm0000gq/T/RtmpHI8rBb/reprex-c5b23ec2440-awake-topi
library(tidyverse)
# Setting file paths
proj_dir <- ("~/Dropbox/1 - ALA/Projects/data_cleaning_workflows/")
# Open csv as a dataset
plants <- open_dataset(here(proj_dir, "ignore/Curated_Plant_and_Invertebrate_Data_for_Bushfire_Modelling/vascularplant.data.csv"), format = "csv")
# Saving the Myrtles as a parquet
plants %>%
filter(family == "Myrtaceae") %>%
select(record_id:longitude_used) %>%
rename(latitude = latitude_used,
longitude = longitude_used) %>%
write_parquet(sink = here(proj_dir, "data/dap/myrtles.parquet"))
# Read parquet
myrtles <- read_parquet(here(proj_dir, "data/dap/myrtles.parquet"))
# Check for NA character in genus
myrtles %>%
filter(is.na(genus))
#> # A tibble: 0 Ă— 13
#> # … with 13 variables: record_id <chr>, scientific_name <chr>,
#> # vernacular_name <chr>, kingdom <chr>, phylum <chr>, class <chr>,
#> # order <chr>, family <chr>, genus <chr>, species <chr>, subspecies <chr>,
#> # latitude <dbl>, longitude <dbl>
myrtles %>%
filter(genus == NA)
#> # A tibble: 0 Ă— 13
#> # … with 13 variables: record_id <chr>, scientific_name <chr>,
#> # vernacular_name <chr>, kingdom <chr>, phylum <chr>, class <chr>,
#> # order <chr>, family <chr>, genus <chr>, species <chr>, subspecies <chr>,
#> # latitude <dbl>, longitude <dbl>
# Filter by character
myrtles %>%
filter(genus == "NA")
#> # A tibble: 55 Ă— 13
#> record_id scien…¹ verna…² kingdom phylum class order family genus species
#> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
#> 1 d6e812d2-72b… Kunzea… NA Plantae Trach… Magn… Myrt… Myrta… NA Kunzea…
#> 2 edf848f3-e84… Kunzea… NA Plantae Trach… Magn… Myrt… Myrta… NA Kunzea…
#> 3 e309b35f-523… Kunzea… NA Plantae Trach… Magn… Myrt… Myrta… NA Kunzea…
#> 4 94f9e331-d88… Kunzea… NA Plantae Trach… Magn… Myrt… Myrta… NA Kunzea…
#> 5 fb6345c6-350… Kunzea… NA Plantae Trach… Magn… Myrt… Myrta… NA Kunzea…
#> 6 84901a11-7a7… Kunzea… NA Plantae Trach… Magn… Myrt… Myrta… NA Kunzea…
#> 7 8cd35ed1-880… Kunzea… NA Plantae Trach… Magn… Myrt… Myrta… NA Kunzea…
#> 8 f9f2583b-fb7… Kunzea… NA Plantae Trach… Magn… Myrt… Myrta… NA Kunzea…
#> 9 e1761068-8ca… Kunzea… NA Plantae Trach… Magn… Myrt… Myrta… NA Kunzea…
#> 10 e9ff1c43-8ad… Kunzea… NA Plantae Trach… Magn… Myrt… Myrta… NA Kunzea…
#> # … with 45 more rows, 3 more variables: subspecies <chr>, latitude <dbl>,
#> # longitude <dbl>, and abbreviated variable names ¹​scientific_name,
#> # ²​vernacular_name
# Saving myrtles as csv
plants %>%
filter(family == "Myrtaceae") %>%
select(record_id:longitude_used) %>%
rename(latitude = latitude_used,
longitude = longitude_used) %>%
write_csv_arrow(here(proj_dir,"data/dap/myrtles.csv"))
# Read in csv
myrtles_csv <- read_csv(here(proj_dir,"data/dap/myrtles.csv"))
#> Rows: 8376 Columns: 13
#> ── Column specification ────────────────────────────────────────────────────────
#> Delimiter: ","
#> chr (11): record_id, scientific_name, vernacular_name, kingdom, phylum, clas...
#> dbl (2): latitude, longitude
#>
#> â„ą Use `spec()` to retrieve the full column specification for this data.
#> â„ą Specify the column types or set `show_col_types = FALSE` to quiet this message.
# Check for NA character in genus
myrtles_csv %>%
filter(is.na(genus))
#> # A tibble: 55 Ă— 13
#> record_id scien…¹ verna…² kingdom phylum class order family genus species
#> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
#> 1 d6e812d2-72b… Kunzea… <NA> Plantae Trach… Magn… Myrt… Myrta… <NA> Kunzea…
#> 2 edf848f3-e84… Kunzea… <NA> Plantae Trach… Magn… Myrt… Myrta… <NA> Kunzea…
#> 3 e309b35f-523… Kunzea… <NA> Plantae Trach… Magn… Myrt… Myrta… <NA> Kunzea…
#> 4 94f9e331-d88… Kunzea… <NA> Plantae Trach… Magn… Myrt… Myrta… <NA> Kunzea…
#> 5 fb6345c6-350… Kunzea… <NA> Plantae Trach… Magn… Myrt… Myrta… <NA> Kunzea…
#> 6 84901a11-7a7… Kunzea… <NA> Plantae Trach… Magn… Myrt… Myrta… <NA> Kunzea…
#> 7 8cd35ed1-880… Kunzea… <NA> Plantae Trach… Magn… Myrt… Myrta… <NA> Kunzea…
#> 8 f9f2583b-fb7… Kunzea… <NA> Plantae Trach… Magn… Myrt… Myrta… <NA> Kunzea…
#> 9 e1761068-8ca… Kunzea… <NA> Plantae Trach… Magn… Myrt… Myrta… <NA> Kunzea…
#> 10 e9ff1c43-8ad… Kunzea… <NA> Plantae Trach… Magn… Myrt… Myrta… <NA> Kunzea…
#> # … with 45 more rows, 3 more variables: subspecies <chr>, latitude <dbl>,
#> # longitude <dbl>, and abbreviated variable names ¹​scientific_name,
#> # ²​vernacular_name
myrtles_csv %>%
filter(genus == NA)
#> # A tibble: 0 Ă— 13
#> # … with 13 variables: record_id <chr>, scientific_name <chr>,
#> # vernacular_name <chr>, kingdom <chr>, phylum <chr>, class <chr>,
#> # order <chr>, family <chr>, genus <chr>, species <chr>, subspecies <chr>,
#> # latitude <dbl>, longitude <dbl>
# Filter by character
myrtles_csv %>%
filter(genus == "NA")
#> # A tibble: 0 Ă— 13
#> # … with 13 variables: record_id <chr>, scientific_name <chr>,
#> # vernacular_name <chr>, kingdom <chr>, phylum <chr>, class <chr>,
#> # order <chr>, family <chr>, genus <chr>, species <chr>, subspecies <chr>,
#> # latitude <dbl>, longitude <dbl>
# Original plant dataset
## As parquet
plants_parquet <- read_parquet(here(proj_dir, "ignore/Curated_Plant_and_Invertebrate_Data_for_Bushfire_Modelling/vascularplant.data.parquet"))
# Check for NA character in genus
plants_parquet %>%
filter(is.na(genus))
#> # A tibble: 1,509 Ă— 34
#> record_id scien…¹ verna…² kingdom phylum class order family genus species
#> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
#> 1 a1ee0b88-6a7… Diplod… <NA> Plantae Trach… Magn… Aspa… Orchi… <NA> Diplod…
#> 2 c86a3efe-79a… Diplod… <NA> Plantae Trach… Magn… Aspa… Orchi… <NA> Diplod…
#> 3 5e230845-24d… Diplod… <NA> Plantae Trach… Magn… Aspa… Orchi… <NA> Diplod…
#> 4 f031cfc7-794… Crypto… <NA> Plantae Trach… <NA> <NA> Orchi… <NA> Crypto…
#> 5 f2504b7c-bf7… Thelym… Heathl… Plantae Trach… Magn… Aspa… Orchi… <NA> Thelym…
#> 6 b939d99f-d79… Diplod… <NA> Plantae Trach… Magn… Aspa… Orchi… <NA> Diplod…
#> 7 e49a1a79-e70… Prasop… <NA> Plantae Trach… Magn… <NA> Orchi… <NA> Prasop…
#> 8 b409c0f6-b09… Crypto… <NA> Plantae Trach… <NA> <NA> Orchi… <NA> Crypto…
#> 9 f9a1d50d-dcd… Parapr… Forest… Plantae Trach… Magn… <NA> Orchi… <NA> Parapr…
#> 10 f418c4ba-c63… Diplod… <NA> Plantae Trach… Magn… Aspa… Orchi… <NA> Diplod…
#> # … with 1,499 more rows, 24 more variables: subspecies <chr>,
#> # latitude_used <dbl>, longitude_used <dbl>, catalogue_number <chr>,
#> # taxon_concept_guid <chr>, scientific_name_original <chr>,
#> # data_resource_id <chr>, data_resource_name <chr>, institution_code <chr>,
#> # licence <chr>, locality <chr>, latitude_original <dbl>,
#> # longitude_original <dbl>, coordinate_uncertainty_in_metres <dbl>,
#> # state_parsed <chr>, ibra_7_regions <chr>, collector <chr>, …
plants_parquet %>%
filter(genus == NA)
#> # A tibble: 0 Ă— 34
#> # … with 34 variables: record_id <chr>, scientific_name <chr>,
#> # vernacular_name <chr>, kingdom <chr>, phylum <chr>, class <chr>,
#> # order <chr>, family <chr>, genus <chr>, species <chr>, subspecies <chr>,
#> # latitude_used <dbl>, longitude_used <dbl>, catalogue_number <chr>,
#> # taxon_concept_guid <chr>, scientific_name_original <chr>,
#> # data_resource_id <chr>, data_resource_name <chr>, institution_code <chr>,
#> # licence <chr>, locality <chr>, latitude_original <dbl>, …
# Filter by character
plants_parquet %>%
filter(genus == "NA")
#> # A tibble: 0 Ă— 34
#> # … with 34 variables: record_id <chr>, scientific_name <chr>,
#> # vernacular_name <chr>, kingdom <chr>, phylum <chr>, class <chr>,
#> # order <chr>, family <chr>, genus <chr>, species <chr>, subspecies <chr>,
#> # latitude_used <dbl>, longitude_used <dbl>, catalogue_number <chr>,
#> # taxon_concept_guid <chr>, scientific_name_original <chr>,
#> # data_resource_id <chr>, data_resource_name <chr>, institution_code <chr>,
#> # licence <chr>, locality <chr>, latitude_original <dbl>, …
## As csv
plants <- read_csv(here(proj_dir, "ignore/Curated_Plant_and_Invertebrate_Data_for_Bushfire_Modelling/vascularplant.data.csv"))
#> Rows: 41572 Columns: 34
#> ── Column specification ────────────────────────────────────────────────────────
#> Delimiter: ","
#> chr (27): record_id, scientific_name, vernacular_name, kingdom, phylum, clas...
#> dbl (5): latitude_used, longitude_used, latitude_original, longitude_origin...
#> lgl (2): taxonomic_quality, location_quality
#>
#> â„ą Use `spec()` to retrieve the full column specification for this data.
#> â„ą Specify the column types or set `show_col_types = FALSE` to quiet this message.
# Check for NA character in genus
plants %>%
filter(is.na(genus))
#> # A tibble: 1,509 Ă— 34
#> record_id scien…¹ verna…² kingdom phylum class order family genus species
#> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
#> 1 a1ee0b88-6a7… Diplod… <NA> Plantae Trach… Magn… Aspa… Orchi… <NA> Diplod…
#> 2 c86a3efe-79a… Diplod… <NA> Plantae Trach… Magn… Aspa… Orchi… <NA> Diplod…
#> 3 5e230845-24d… Diplod… <NA> Plantae Trach… Magn… Aspa… Orchi… <NA> Diplod…
#> 4 f031cfc7-794… Crypto… <NA> Plantae Trach… <NA> <NA> Orchi… <NA> Crypto…
#> 5 f2504b7c-bf7… Thelym… Heathl… Plantae Trach… Magn… Aspa… Orchi… <NA> Thelym…
#> 6 b939d99f-d79… Diplod… <NA> Plantae Trach… Magn… Aspa… Orchi… <NA> Diplod…
#> 7 e49a1a79-e70… Prasop… <NA> Plantae Trach… Magn… <NA> Orchi… <NA> Prasop…
#> 8 b409c0f6-b09… Crypto… <NA> Plantae Trach… <NA> <NA> Orchi… <NA> Crypto…
#> 9 f9a1d50d-dcd… Parapr… Forest… Plantae Trach… Magn… <NA> Orchi… <NA> Parapr…
#> 10 f418c4ba-c63… Diplod… <NA> Plantae Trach… Magn… Aspa… Orchi… <NA> Diplod…
#> # … with 1,499 more rows, 24 more variables: subspecies <chr>,
#> # latitude_used <dbl>, longitude_used <dbl>, catalogue_number <chr>,
#> # taxon_concept_guid <chr>, scientific_name_original <chr>,
#> # data_resource_id <chr>, data_resource_name <chr>, institution_code <chr>,
#> # licence <chr>, locality <chr>, latitude_original <dbl>,
#> # longitude_original <dbl>, coordinate_uncertainty_in_metres <dbl>,
#> # state_parsed <chr>, ibra_7_regions <chr>, collector <chr>, …
plants %>%
filter(genus == NA)
#> # A tibble: 0 Ă— 34
#> # … with 34 variables: record_id <chr>, scientific_name <chr>,
#> # vernacular_name <chr>, kingdom <chr>, phylum <chr>, class <chr>,
#> # order <chr>, family <chr>, genus <chr>, species <chr>, subspecies <chr>,
#> # latitude_used <dbl>, longitude_used <dbl>, catalogue_number <chr>,
#> # taxon_concept_guid <chr>, scientific_name_original <chr>,
#> # data_resource_id <chr>, data_resource_name <chr>, institution_code <chr>,
#> # licence <chr>, locality <chr>, latitude_original <dbl>, …
# Filter by character
plants %>%
filter(genus == "NA")
#> # A tibble: 0 Ă— 34
#> # … with 34 variables: record_id <chr>, scientific_name <chr>,
#> # vernacular_name <chr>, kingdom <chr>, phylum <chr>, class <chr>,
#> # order <chr>, family <chr>, genus <chr>, species <chr>, subspecies <chr>,
#> # latitude_used <dbl>, longitude_used <dbl>, catalogue_number <chr>,
#> # taxon_concept_guid <chr>, scientific_name_original <chr>,
#> # data_resource_id <chr>, data_resource_name <chr>, institution_code <chr>,
#> # licence <chr>, locality <chr>, latitude_original <dbl>, …
sessionInfo()
#> R version 4.2.1 (2022-06-23)
#> Platform: aarch64-apple-darwin20 (64-bit)
#> Running under: macOS Monterey 12.6.3
#>
#> Matrix products: default
#> BLAS: /Library/Frameworks/R.framework/Versions/4.2-arm64/Resources/lib/libRblas.0.dylib
#> LAPACK: /Library/Frameworks/R.framework/Versions/4.2-arm64/Resources/lib/libRlapack.dylib
#>
#> locale:
#> [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
#>
#> attached base packages:
#> [1] stats graphics grDevices utils datasets methods base
#>
#> other attached packages:
#> [1] lubridate_1.9.2 forcats_1.0.0 stringr_1.5.0 dplyr_1.1.0
#> [5] purrr_1.0.1 readr_2.1.4 tidyr_1.3.0 tibble_3.1.8
#> [9] ggplot2_3.4.1 tidyverse_2.0.0 here_1.0.1 arrow_11.0.0.2
#>
#> loaded via a namespace (and not attached):
#> [1] styler_1.9.1 tidyselect_1.2.0 xfun_0.37 colorspace_2.1-0
#> [5] vctrs_0.5.2 generics_0.1.3 htmltools_0.5.4 yaml_2.3.7
#> [9] utf8_1.2.3 rlang_1.0.6 R.oo_1.25.0 pillar_1.8.1
#> [13] glue_1.6.2 withr_2.5.0 R.utils_2.12.2 bit64_4.0.5
#> [17] R.cache_0.16.0 lifecycle_1.0.3 munsell_0.5.0 gtable_0.3.1
#> [21] R.methodsS3_1.8.2 evaluate_0.20 knitr_1.42 tzdb_0.3.0
#> [25] fastmap_1.1.1 parallel_4.2.1 fansi_1.0.4 scales_1.2.1
#> [29] vroom_1.6.1 fs_1.6.1 bit_4.0.5 hms_1.1.2
#> [33] digest_0.6.31 stringi_1.7.12 grid_4.2.1 rprojroot_2.0.3
#> [37] cli_3.6.0 tools_4.2.1 magrittr_2.0.3 crayon_1.5.2
#> [41] pkgconfig_2.0.3 ellipsis_0.3.2 reprex_2.0.2 timechange_0.2.0
#> [45] assertthat_0.2.1 rmarkdown_2.20 rstudioapi_0.14 R6_2.5.1
#> [49] compiler_4.2.1
```
<sup>Created on 2023-03-09 with [reprex v2.0.2](https://reprex.tidyverse.org)</sup>
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org