You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@arrow.apache.org by "angela-li (via GitHub)" <gi...@apache.org> on 2023/09/27 15:17:12 UTC

[GitHub] [arrow] angela-li opened a new issue, #37908: [R] open_dataset() behavior with incorrectly quoted input data

angela-li opened a new issue, #37908:
URL: https://github.com/apache/arrow/issues/37908

   ### Describe the usage question you have. Please include as many useful details as  possible.
   
   
   I've started using {arrow} to read in data in R, and I noticed that it handles messy (aka human-collected!) tabular data slightly worse than {data.table}'s `fread` function. (But I want to work with arrow as vs. data.table as the partitioning element is going to be useful for me in the future!)
   
   One issue I came across was how `open_dataset` handles incorrectly quoted data, or data where the default `quote_char` of `"` is included accidentally in a column.
   
   Here's how the behavior is different between data.table and arrow. (Here's the [test_data.txt file](https://gist.github.com/angela-li/fd7c244299a8015d03cfbadf5820539b) for the below code. It's a .txt file because the original humongous data file is delivered as a .txt.)
   
   ```r
   # Does not work
   test <- open_dataset("test_data.txt",
                format = "text",
                delim = "|")
   
   # Works - generates error about improper quoting, but reads it correctly
   library(data.table)
   test <- fread("test_data.txt")
   ```
   
   The data.table() documentation describes how they handle this data situation reasonably well, [NEWS file here](https://github.com/Rdatatable/data.table/blob/88039186915028ab3c93ccfd8e22c0d1c3534b1a/NEWS.md?plain=1#L1814).
   
   For now, I think I can change `parse_options` in the open_dataset() function to handle this, but it was quite fiddly to do this - hard to track down in the docs how to do this. Changing this option is also not good for the rest of the data, where I _do_ want the `quote_char` to be `"`.
   
   ```r
   # Works, but is frustrating to figure out - and not ideal for the rest of the data
   test <- open_dataset("test_data.txt",
                format = "text",
                parse_options = CsvParseOptions$create(
                                      delimiter = "|",
                                      quoting = TRUE,
                                      quote_char = '', # changing this to blank, instead of '"', solves the problem
                                      double_quote = TRUE,
                                      escaping = FALSE,
                                      escape_char = "\\",
                                      newlines_in_values = FALSE,
                                      ignore_empty_lines = TRUE))
   ```
   
   I don't know if improper quoting happens elsewhere in the data, so ideally there would be some way to detect and fix this type of improper quoting systematically (as versus skipping rows manually, or changing the `quote_char` to blank, which could cause issues for other columns.)
   
   Two qs:
   
   1. (immediate q) Are there more effective usage strategies for handling this type of data?
   2. (longer-term improvement) Would automatically handling this type of messy data be something that the arrow team would consider building into `open_dataset()`?
   
   Thanks for your help!
   
   ### Component(s)
   
   R


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@arrow.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

Re: [I] [R] open_dataset() behavior with incorrectly quoted input data [arrow]

Posted by "amoeba (via GitHub)" <gi...@apache.org>.

amoeba commented on issue #37908:
URL: https://github.com/apache/arrow/issues/37908#issuecomment-1756253850

   Hi @angela-li, when I've run into situations like yours in the past, I've resorted to adding a cleanup step in between the raw data and the less flexible system (in this case, arrow) in order to get the raw data in a form that can be read without issues. I can imagine this might not be practical for your use case.
   
   This comment got me thinking,
   
   > Changing this option is also not good for the rest of the data, where I do want the quote_char to be ".
   
   One other thing you might try that arrow can do right now would be to make use of arrow's UnionDataset functionality. As described above, you essentially need to parse some files with one set of rules and other files with another. `open_dataset` can actually open other Datasets so you could do something like,
   
   ```r
   my_ds <- open_dataset(
     list(
       open_dataset("good_file.txt", type = "text")
       open_dataset("bad_file.txt", type = "text", parse_options = CsvParseOptions$create(...))
     )
   ) # <- this is a UnionDataset
   ```
   
   From here you can work with `my_ds` normally.
   
   This problem also reminds me of lubridate and its `orders` argument in [`lubridate::parse_date_time`](https://lubridate.tidyverse.org/reference/parse_date_time.html). One limitation of the above approach is that it requires you to know which files are problematic and which are not. So an idea would be to create a list of `CsvParseOptions` objects, try opening your files in a `tryCatch` as you try each option. I've included hacky example below.
   
   <details>
   <summary>flexible_open_dataset.R</summary>
   
   ```r
   library(arrow)
   
   # First create a set of CsvParseOptions to try. Order matters.
   default_parse_options <- CsvParseOptions$create(delimiter = "|")
   quirk_parse_options <- CsvParseOptions$create(delimiter = "|", quote_char = '')
   my_parse_options <- c(default_parse_options, quirk_parse_options)
   
   # Then we define two helper functions that attempt to call open_dataset until one succeeds
   flexible_open_dataset_single <- function(file, parse_options) {
     for (parse_option in parse_options) {
       ds <- tryCatch({
         open_dataset(file, format = "text", parse_options = parse_option)
       },
       error = function(e) {
         warning(
           "Failed to parse ", file,
           " with provided ParseOption. Trying any remaining options...")
         NULL
       })
   
       if (!is.null(ds)) {
         break;
       }
     }
   
     ds
   }
   
   flexible_open_dataset <- function(files, parse_options) {
     open_dataset(lapply(files, function(f) { flexible_open_dataset_single(f, parse_options) }))
   }
   
   # Then, finally, we use our new helper and this should print a warning but otherwise work
   my_ds <- flexible_open_dataset(c("test_data.txt", "test_data_good.txt"), my_parse_options)
   ```
   </details>
   
   If we wanted to provide something like this in arrow, one way would be to allow `parse_options` to take multiple values and use a similar mechanism internally to try each.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] thisisnic commented on issue #37908: [R] open_dataset() behavior with incorrectly quoted input data

Posted by "thisisnic (via GitHub)" <gi...@apache.org>.

thisisnic commented on issue #37908:
URL: https://github.com/apache/arrow/issues/37908#issuecomment-1740995059

   Hi @angela-li, thanks for reporting this!  Yeah, that looks pretty gnarly in terms of the paths you had to go down to find that information, we should definitely improve our docs around this.  Would you mind telling me a bit more about where you looked and how you figured it out?
   
   In terms of question 1, I'm not sure there is a better strategy here.
   
   In terms of question 2, it sounds like a really helpful feature request, though it'd be a pretty substantial, and most likely in the C++ code, and I know that a lot of those devs are focussed on other areas of the codebase.  Arrow’s CSV reader is optimized for very fast parsing of valid CSVs (rather than other parsers like readr and data.table that offer more flexible options for handling invalid data, occasionally at the expense of speed), so it might end up being more of a problem that is better solved by multiple libraries.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

Re: [I] [R] open_dataset() behavior with incorrectly quoted input data [arrow]

Posted by "thisisnic (via GitHub)" <gi...@apache.org>.

thisisnic commented on issue #37908:
URL: https://github.com/apache/arrow/issues/37908#issuecomment-1744987621

   Thanks, that's really helpful!  I wondering if a nice solution here might be to create wrapper functions, e.g. `csv_parse_options()` etc, which just call `CSVParseOptions$create()` but are more nicely documented.  I feel like this may only partially solve your problem though - the docs could be simpler to find, but perhaps the error message could be improved too.
   
   The error message:
   ```
   Error in `open_dataset()`:
   ! Invalid: Error creating dataset. Could not read schema from '/home/nic/test_data.txt'. Is this a 'csv' file?: Could not open CSV input source '/home/nic/test_data.txt': Invalid: CSV parse error: Row #2: Expected 5 columns, got 2: 1|"RM-4.15|1958||1
   2||||1
   ```
   
   I think the problem here is that that error could be caused by multiple problems, so it's not definitive.  We technically could search for the quote character in the problematic line and if there is an uneven number, mention that, but I'm a bit unsure if it's a bit of a fringe case.  Perhaps it's not though given someone else had the same issue!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] angela-li commented on issue #37908: [R] open_dataset() behavior with incorrectly quoted input data

Posted by "angela-li (via GitHub)" <gi...@apache.org>.

angela-li commented on issue #37908:
URL: https://github.com/apache/arrow/issues/37908#issuecomment-1742200461

   > Would you mind telling me a bit more about where you looked and how you figured it out?
   
   Sure! Here's what I did to try to debug:
   
   - First I tried to zoom in on the line that was producing the error and subset it with the Linux shell (`head`, `cut`, etc) to see what the smallest subset of the data that I could read in to keep producing the problem. After inspecting it and comparing it to the row above, I started to get a hunch that it was the extra " character.
   - Then, I read that tiny subset of data into `data.table::fread()` to see if it handled it better. It did, but it produced a different error in fread: _"Warning: Found and resolved improper quoting in first 100 rows. If the fields are not quoted (e.g. field separator does not appear within any field), try quote="" to avoid this warning."_
       - I then went to the data.table source code and searched the wording of that error in the Github repo, which led me to the NEWS file linked above.
   - I realized then it had something to do with incorrectly quoted data and looked on StackOverflow, which led me to this [PyArrow documentation of quote_char](https://arrow.apache.org/docs/python/generated/pyarrow.csv.ParseOptions.html#pyarrow.csv.ParseOptions.quote_char) (**probably the most useful in terms of figuring out the argument I'd need to tweak - so use this as a reference for doc improvements!**)
   - From that, I thought that maybe there was an equivalent for pyarrow.csv.ParseOptions[¶](https://arrow.apache.org/docs/python/generated/pyarrow.csv.ParseOptions.html#pyarrow-csv-parseoptions) in the R arrow package, which led me to the [CsvParseOptions page](https://arrow.apache.org/docs/r/reference/CsvReadOptions.html) on the pkgdown site. I didn't know what default `quote_char` took so then I went and looked at the source code for CsvParseOptions, and realized it was a `"`. And then I slowly started to realize how to fix it! 
       - ...And told myself that would be a useful docs PR which is now merged as #37909!!
   - It was difficult to figure out how to use CsvParseOptions$create() because there were no examples, but I just tried things out (with my limited knowledge of S4 methods in R), paralleling the examples in the Python code, until reading my messy data with `open_dataset()` worked!
   - Along the way, I also found [someone else on StackOverflow who ran into the same issue](https://stackoverflow.com/questions/74057299/how-to-read-csv-with-within-quoted-string-with-read-csv-arrow) but did not figure it out (but I felt glad I was not the only one with the problem)
   
   Phew!
   
   > Arrow’s CSV reader is optimized for very fast parsing of valid CSVs (rather than other parsers like readr and data.table that offer more flexible options for handling invalid data, occasionally at the expense of speed), so it might end up being more of a problem that is better solved by multiple libraries.
   
   That makes sense that Arrow is optimized for fast parsing of valid CSVs - that's what I started to suspect after seeing other examples that Arrow is used for (oftentimes machine-output data, not messy human-collected data). I'll think about what to do in this case.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org