You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@arrow.apache.org by "karldw (via GitHub)" <gi...@apache.org> on 2023/05/25 02:08:09 UTC

[GitHub] [arrow] karldw opened a new issue, #35756: [R] open_dataset and open_csv_dataset do not use skip argument

karldw opened a new issue, #35756:
URL: https://github.com/apache/arrow/issues/35756

   ### Describe the bug, including details regarding any error messages, version, and platform.
   
   I noticed that a CSV with header rows can be successfully read with `read_csv_arrow(..., skip=N)`, but not with `open_csv_dataset` unless a schema is provided. An example is below.
   
   I expected `open_csv_dataset` to use the skip argument the same way `read_csv_arrow` does, skipping a fixed number of rows from every CSV, but it seems to not be skipping any. I think this is likely a bug -- maybe in the schema parsing?
   
   
   ```R
   library(arrow)
   
   lines <- c(
     "This line should be skipped",
     "This one too, even though it has a comma",
     "X,Y,Z",
     "1,2,3",
     "4,5,6"
   )
   tmp_dir <- file.path(tempdir(), "arrow_test")
   dir.create(tmp_dir, recursive=TRUE)
   tmp_csv <- file.path(tmp_dir, "test.csv")
   writeLines(lines, tmp_csv)
   
   # This works as expected, skipping the first two lines and reading headers from the third
   read_csv_arrow(tmp_csv, skip=2L)
   
   open_csv_dataset(tmp_dir, skip=2L)
   #> ! Invalid: Error creating dataset. Could not read schema from 
   #> '/tmp/RtmpPRG1yT/arrow_test/test.csv'. Is this a 'csv' file?: 
   #> Could not open CSV input source '/tmp/RtmpPRG1yT/arrow_test/test.csv': 
   #> Invalid: CSV parse error: Row #2: Expected 1 columns, 
   #> got 2: This one too, even though it has a comma
   
   # These generate the same error:
   open_csv_dataset(tmp_dir, skip=3L)
   open_dataset(tmp_dir, format="csv", skip=2L)
   
   
   schem <- schema(
     field(name="X", type=int32()),
     field(name="Y", type=int32()),
     field(name="Z", type=int32())
   )
   
   # Works when schema is supplied (now also need to skip header row)
   open_csv_dataset(tmp_dir, skip=3L, schema=schem) |> dplyr::collect()
   
   ```
   
   
   
   
   
   ### Component(s)
   
   R


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@arrow.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] westonpace commented on issue #35756: [R][C++] open_dataset and open_csv_dataset do not use skip argument

Posted by "westonpace (via GitHub)" <gi...@apache.org>.

westonpace commented on issue #35756:
URL: https://github.com/apache/arrow/issues/35756#issuecomment-1585086512

   Forgot the patch
   [Uploading csv-patch.txt…]()
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] thisisnic commented on issue #35756: [R] open_dataset and open_csv_dataset do not use skip argument

Posted by "thisisnic (via GitHub)" <gi...@apache.org>.

thisisnic commented on issue #35756:
URL: https://github.com/apache/arrow/issues/35756#issuecomment-1572567597

   I've ended up with a slightly different error message:
   
   ```
   Error in `open_dataset()`:
   ! Invalid: Error creating dataset. Could not read schema from '/tmp/RtmpcW85O3/file633ce7dbf51dd/5/file1.csv'. Is this a 'csv' file?: Could not open CSV input source '/tmp/RtmpcW85O3/file633ce7dbf51dd/5/file1.csv': Invalid: CSV file contained multiple columns named 10
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] thisisnic commented on issue #35756: [R] open_dataset and open_csv_dataset do not use skip argument

Posted by "thisisnic (via GitHub)" <gi...@apache.org>.

thisisnic commented on issue #35756:
URL: https://github.com/apache/arrow/issues/35756#issuecomment-1568479940

   Can confirm that this can be reproduced on dev.  My best guess is that this was introduced in the refactor done as part of #33526.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] thisisnic commented on issue #35756: [R] open_dataset and open_csv_dataset do not use skip argument

Posted by "thisisnic (via GitHub)" <gi...@apache.org>.

thisisnic commented on issue #35756:
URL: https://github.com/apache/arrow/issues/35756#issuecomment-1572566022

   I've been looking into this more closely, and the CSVReadOptions object is being created correctly, and now I'm wondering if something else is happening here.
   
   We haven't caught this before, I think, because we don't tend to see `skip` being used on its own without `col_names` or a schema being supplied.
   
   I noticed this comment in the definition of ReadOptions: 
   
   https://github.com/apache/arrow/blob/3299d12efc91220237266bfa6f985f9eb37492f8/cpp/src/arrow/csv/options.h#L158
   
   However, I can't find the code that implements actually falling back on `autogenerate_column_names`.  @westonpace - am I looking in the wrong place, or is that functionality missing?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] westonpace commented on issue #35756: [R][C++] open_dataset and open_csv_dataset do not use skip argument

Posted by "westonpace (via GitHub)" <gi...@apache.org>.

westonpace commented on issue #35756:
URL: https://github.com/apache/arrow/issues/35756#issuecomment-1585085230

I had an R environment setup today and so I debugged this for a bit. @thisisnic , I had, offline, pointed you at https://github.com/apache/arrow/blob/8b2ab4d8200fdd414b85d531f1cef4b58a3ce351/cpp/src/arrow/csv/reader.cc#L602 as the point where we handle autogenerated column names.

However, I had forgotten that we also (rather embarassingly :) have a duplicated copy of this logic in the datasets module here: https://github.com/apache/arrow/blob/8b2ab4d8200fdd414b85d531f1cef4b58a3ce351/cpp/src/arrow/dataset/file_csv.cc#L179

The logic in the dataset module is slightly different than the logic in the reader module. It is (omitting some stuff):

```
int32_t max_num_rows = read_options.skip_rows + 1;
csv::BlockParser parser(pool, parse_options, /*num_cols=*/-1, /*first_row=*/1,
max_num_rows);

RETURN_NOT_OK(parser.Parse(std::string_view{first_block}, &parsed_size));

if (read_options.autogenerate_column_names) {
column_names.reserve(parser.num_cols());
```

So we give the skipped rows to the parser (this is different than the reader.cc logic where we skip the rows outside the parser and then only give the first non-skipped row to the parser).

When we are not auto-generating column names then I think we kind of get away with it because we call `parser.VisitLastRow` which only really depends on the contents of the last row.

On the other hand, if the user is asking to autogenerate the column names we use `parser.num_cols()`. This calculation is based on the first row the parser sees!

I'm attaching a very clumsy sketch of a fix (which I verified works in OPs reprex) that just copy/pastes code from the reader so we can handle skipping in the exact same way. This sketch is also missing unit tests.

I think a good long-term fix would be to eliminate this duplicate path entirely. This could be done by adding an "inspect" method (or PeekMetadata or GetColumnNames or something) to the CSV reader. Then the datasets API could use that instead of re-inventing the wheel.

--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org