You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by "thisisnic (via GitHub)" <gi...@apache.org> on 2023/10/03 23:39:21 UTC

[PR] GH-29184: [R] Read CSV with comma as decimal mark [arrow]

thisisnic opened a new pull request, #38002:
URL: https://github.com/apache/arrow/pull/38002

   ### Rationale for this change
   
   Allow customisable decimal points when reading data
   
   ### What changes are included in this PR?
   
   Expose the C++ option in R
   
   ### Are these changes tested?
   
   Aye
   
   ### Are there any user-facing changes?
   
   Indeed


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


Re: [PR] GH-29184: [R] Read CSV with comma as decimal mark [arrow]

Posted by "thisisnic (via GitHub)" <gi...@apache.org>.
thisisnic commented on PR #38002:
URL: https://github.com/apache/arrow/pull/38002#issuecomment-1745891755

   Also closes #38001


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


Re: [PR] GH-29184: [R] Read CSV with comma as decimal mark [arrow]

Posted by "thisisnic (via GitHub)" <gi...@apache.org>.
thisisnic commented on code in PR #38002:
URL: https://github.com/apache/arrow/pull/38002#discussion_r1350094826


##########
r/R/csv.R:
##########
@@ -279,6 +280,31 @@ read_csv_arrow <- function(file,
   eval.parent(mc)
 }
 
+#' @rdname read_delim_arrow
+#' @export
+read_csv2_arrow <- function(file,
+                            quote = '"',
+                            escape_double = TRUE,
+                            escape_backslash = FALSE,
+                            schema = NULL,
+                            col_names = TRUE,
+                            col_types = NULL,
+                            col_select = NULL,
+                            na = c("", "NA"),
+                            quoted_na = TRUE,
+                            skip_empty_rows = TRUE,
+                            skip = 0L,
+                            parse_options = NULL,
+                            convert_options = CsvConvertOptions$create(decimal_point = ","),
+                            read_options = NULL,
+                            as_data_frame = TRUE,
+                            timestamp_parsers = NULL) {

Review Comment:
   Just checked them against `read_csv_arrow()` and `readr::read_csv2()` - they match.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


Re: [PR] GH-29184: [R] Read CSV with comma as decimal mark [arrow]

Posted by "kou (via GitHub)" <gi...@apache.org>.
kou commented on PR #38002:
URL: https://github.com/apache/arrow/pull/38002#issuecomment-1753934789

   @thisisnic @paleolimbot It seems that this introduced a new lint error:
   
   https://github.com/apache/arrow/actions/runs/6459175457/job/17535485765
   
   ```text
   Error: Error: Not lint free
   R\csv.R:307:20: style: [assignment_linter] Use <-, not =, for assignment.
     mc$decimal_point = ","
                      ^
   ```
   
   Could you take a look at this?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


Re: [PR] GH-29184: [R] Read CSV with comma as decimal mark [arrow]

Posted by "thisisnic (via GitHub)" <gi...@apache.org>.
thisisnic commented on code in PR #38002:
URL: https://github.com/apache/arrow/pull/38002#discussion_r1350038776


##########
r/R/csv.R:
##########
@@ -279,6 +280,31 @@ read_csv_arrow <- function(file,
   eval.parent(mc)
 }
 
+#' @rdname read_delim_arrow
+#' @export
+read_csv2_arrow <- function(file,
+                            quote = '"',
+                            escape_double = TRUE,
+                            escape_backslash = FALSE,
+                            schema = NULL,
+                            col_names = TRUE,
+                            col_types = NULL,
+                            col_select = NULL,
+                            na = c("", "NA"),
+                            quoted_na = TRUE,
+                            skip_empty_rows = TRUE,
+                            skip = 0L,
+                            parse_options = NULL,
+                            convert_options = CsvConvertOptions$create(decimal_point = ","),

Review Comment:
   Sorry, still working on this, totally needs updating still



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


Re: [PR] GH-29184: [R] Read CSV with comma as decimal mark [arrow]

Posted by "paleolimbot (via GitHub)" <gi...@apache.org>.
paleolimbot commented on code in PR #38002:
URL: https://github.com/apache/arrow/pull/38002#discussion_r1349137559


##########
r/R/csv.R:
##########
@@ -279,6 +280,31 @@ read_csv_arrow <- function(file,
   eval.parent(mc)
 }
 
+#' @rdname read_delim_arrow
+#' @export
+read_csv2_arrow <- function(file,
+                            quote = '"',
+                            escape_double = TRUE,
+                            escape_backslash = FALSE,
+                            schema = NULL,
+                            col_names = TRUE,
+                            col_types = NULL,
+                            col_select = NULL,
+                            na = c("", "NA"),
+                            quoted_na = TRUE,
+                            skip_empty_rows = TRUE,
+                            skip = 0L,
+                            parse_options = NULL,
+                            convert_options = CsvConvertOptions$create(decimal_point = ","),
+                            read_options = NULL,
+                            as_data_frame = TRUE,
+                            timestamp_parsers = NULL) {

Review Comment:
   I imagine these are freshly copy/pasted, but maybe double-check that the defaults are correct.



##########
r/tests/testthat/test-csv.R:
##########
@@ -733,3 +733,10 @@ test_that("Can read CSV files from a URL", {
   expect_true(tibble::is_tibble(cu))
   expect_identical(dim(cu), c(100L, 13L))
 })
+
+test_that("read_csv2_arrow correctly parses comma decimals", {
+  tf <- tempfile()

Review Comment:
   ```suggestion
     tf <- tempfile()
     on.exit(unlink(tf))
   ```



##########
r/tests/testthat/test-dataset-csv.R:
##########
@@ -637,3 +637,18 @@ test_that("GH-34640 - CSV datasets are read in correctly when both schema and pa
       summarize(mean = mean(integer))
   )
 })
+
+test_that("open_dataset() with `decimal_point` argument", {
+  temp_dir <- make_temp_dir()

Review Comment:
   ```suggestion
     temp_dir <- make_temp_dir()
     on.exit(unlink(temp_dir, recursive = TRUE))
   ```



##########
r/R/csv.R:
##########
@@ -279,6 +280,31 @@ read_csv_arrow <- function(file,
   eval.parent(mc)
 }
 
+#' @rdname read_delim_arrow
+#' @export
+read_csv2_arrow <- function(file,
+                            quote = '"',
+                            escape_double = TRUE,
+                            escape_backslash = FALSE,
+                            schema = NULL,
+                            col_names = TRUE,
+                            col_types = NULL,
+                            col_select = NULL,
+                            na = c("", "NA"),
+                            quoted_na = TRUE,
+                            skip_empty_rows = TRUE,
+                            skip = 0L,
+                            parse_options = NULL,
+                            convert_options = CsvConvertOptions$create(decimal_point = ","),

Review Comment:
   This pattern seems like it might not generalize...if a user was using `read_csv_arrow(..., convert_options = ...)`, they can't simply change that to `read_csv2_arrow(..., convert_options = ...)` and get the behaviour they were expecting. I wonder if adding `decimal_point` as a top-level argument to `read_delim_arrow()` would lead to a more predictable relationship between `read_csv_arrow()` and `read_csv2_arrow()`?



##########
r/tests/testthat/test-dataset-csv.R:
##########
@@ -637,3 +637,18 @@ test_that("GH-34640 - CSV datasets are read in correctly when both schema and pa
       summarize(mean = mean(integer))
   )
 })
+
+test_that("open_dataset() with `decimal_point` argument", {
+  temp_dir <- make_temp_dir()
+  writeLines("x\ty\n1,2\tc", con = file.path(temp_dir, "file1.csv"))
+
+  expect_equal(
+    open_dataset(temp_dir, format = "tsv") %>% collect(),
+    tibble(x = "1,2", y = "c")
+  )
+
+  expect_equal(
+    open_dataset(temp_dir, format = "tsv", decimal_point = ",") %>% collect(),
+    tibble(x = 1.2, y = "c")
+  )

Review Comment:
   What happens if you try to pass a `decimal_point = "more than one character"` or `decimal_point = ""`? (Not sure if it needs a test, but worth checking that it doesn't crash)



##########
r/R/csv.R:
##########
@@ -279,6 +280,31 @@ read_csv_arrow <- function(file,
   eval.parent(mc)
 }
 
+#' @rdname read_delim_arrow
+#' @export
+read_csv2_arrow <- function(file,
+                            quote = '"',
+                            escape_double = TRUE,
+                            escape_backslash = FALSE,
+                            schema = NULL,
+                            col_names = TRUE,
+                            col_types = NULL,
+                            col_select = NULL,
+                            na = c("", "NA"),
+                            quoted_na = TRUE,
+                            skip_empty_rows = TRUE,
+                            skip = 0L,
+                            parse_options = NULL,
+                            convert_options = CsvConvertOptions$create(decimal_point = ","),

Review Comment:
   (Or simply documenting that you can add that to `convert_options` and leave the tidyverse-interface-copying for another time)



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


Re: [PR] GH-29184: [R] Read CSV with comma as decimal mark [arrow]

Posted by "paleolimbot (via GitHub)" <gi...@apache.org>.
paleolimbot commented on PR #38002:
URL: https://github.com/apache/arrow/pull/38002#issuecomment-1751784563

   (Rebasing should fix the CI error...in theory I fixed that yesterday!)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


Re: [PR] GH-29184: [R] Read CSV with comma as decimal mark [arrow]

Posted by "conbench-apache-arrow[bot] (via GitHub)" <gi...@apache.org>.
conbench-apache-arrow[bot] commented on PR #38002:
URL: https://github.com/apache/arrow/pull/38002#issuecomment-1755929197

   After merging your PR, Conbench analyzed the 6 benchmarking runs that have been run so far on merge-commit b3362f2d241248a435ce53ae85bf3cf5c65d1432.
   
   There were no benchmark performance regressions. 🎉
   
   The [full Conbench report](https://github.com/apache/arrow/runs/17574354171) has more details. It also includes information about 2 possible false positives for unstable benchmarks that are known to sometimes produce them.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


Re: [PR] GH-29184: [R] Read CSV with comma as decimal mark [arrow]

Posted by "paleolimbot (via GitHub)" <gi...@apache.org>.
paleolimbot commented on code in PR #38002:
URL: https://github.com/apache/arrow/pull/38002#discussion_r1349814104


##########
r/R/csv.R:
##########
@@ -279,6 +280,31 @@ read_csv_arrow <- function(file,
   eval.parent(mc)
 }
 
+#' @rdname read_delim_arrow
+#' @export
+read_csv2_arrow <- function(file,
+                            quote = '"',
+                            escape_double = TRUE,
+                            escape_backslash = FALSE,
+                            schema = NULL,
+                            col_names = TRUE,
+                            col_types = NULL,
+                            col_select = NULL,
+                            na = c("", "NA"),
+                            quoted_na = TRUE,
+                            skip_empty_rows = TRUE,
+                            skip = 0L,
+                            parse_options = NULL,
+                            convert_options = CsvConvertOptions$create(decimal_point = ","),

Review Comment:
   I see you added the top level argument but I think this line still needs to change to reflect that?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


Re: [PR] GH-29184: [R] Read CSV with comma as decimal mark [arrow]

Posted by "github-actions[bot] (via GitHub)" <gi...@apache.org>.
github-actions[bot] commented on PR #38002:
URL: https://github.com/apache/arrow/pull/38002#issuecomment-1745890443

   :warning: GitHub issue #29184 **has been automatically assigned in GitHub** to PR creator.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


Re: [PR] GH-29184: [R] Read CSV with comma as decimal mark [arrow]

Posted by "thisisnic (via GitHub)" <gi...@apache.org>.
thisisnic merged PR #38002:
URL: https://github.com/apache/arrow/pull/38002


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


Re: [PR] GH-29184: [R] Read CSV with comma as decimal mark [arrow]

Posted by "thisisnic (via GitHub)" <gi...@apache.org>.
thisisnic commented on code in PR #38002:
URL: https://github.com/apache/arrow/pull/38002#discussion_r1349515390


##########
r/tests/testthat/test-dataset-csv.R:
##########
@@ -637,3 +637,18 @@ test_that("GH-34640 - CSV datasets are read in correctly when both schema and pa
       summarize(mean = mean(integer))
   )
 })
+
+test_that("open_dataset() with `decimal_point` argument", {
+  temp_dir <- make_temp_dir()
+  writeLines("x\ty\n1,2\tc", con = file.path(temp_dir, "file1.csv"))
+
+  expect_equal(
+    open_dataset(temp_dir, format = "tsv") %>% collect(),
+    tibble(x = "1,2", y = "c")
+  )
+
+  expect_equal(
+    open_dataset(temp_dir, format = "tsv", decimal_point = ",") %>% collect(),
+    tibble(x = 1.2, y = "c")
+  )

Review Comment:
   Good call!  We get an error `Error: Expected string vector of length 1` (from the C++?), and the values read in as character in those cases.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


Re: [PR] GH-29184: [R] Read CSV with comma as decimal mark [arrow]

Posted by "thisisnic (via GitHub)" <gi...@apache.org>.
thisisnic commented on code in PR #38002:
URL: https://github.com/apache/arrow/pull/38002#discussion_r1350096514


##########
r/R/csv.R:
##########
@@ -279,6 +280,31 @@ read_csv_arrow <- function(file,
   eval.parent(mc)
 }
 
+#' @rdname read_delim_arrow
+#' @export
+read_csv2_arrow <- function(file,
+                            quote = '"',
+                            escape_double = TRUE,
+                            escape_backslash = FALSE,
+                            schema = NULL,
+                            col_names = TRUE,
+                            col_types = NULL,
+                            col_select = NULL,
+                            na = c("", "NA"),
+                            quoted_na = TRUE,
+                            skip_empty_rows = TRUE,
+                            skip = 0L,
+                            parse_options = NULL,
+                            convert_options = CsvConvertOptions$create(decimal_point = ","),

Review Comment:
   Thanks for your various suggestions there @paleolimbot, that's really helpful!  I've added `decimal_point` to `read_csv2_arrow()` now.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org