You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@arrow.apache.org by "thisisnic (via GitHub)" <gi...@apache.org> on 2023/03/20 10:15:03 UTC

[GitHub] [arrow] thisisnic opened a new issue, #34640: [C++] Can't read in partitioning column in CSV datasets when both partition and schema supplied

thisisnic opened a new issue, #34640:
URL: https://github.com/apache/arrow/issues/34640

   ### Describe the bug, including details regarding any error messages, version, and platform.
   
   Surfaced via R:
   
   ``` r
   library(dplyr)
   library(arrow)
   
   # set up temporary directory
   tf <- tempfile()
   dir.create(tf)
   
   # set up dummy dataset
   df <- tibble::tibble(group = rep(1:2, each = 5), value = 1:10)
   df
   #> # A tibble: 10 × 2
   #>    group value
   #>    <int> <int>
   #>  1     1     1
   #>  2     1     2
   #>  3     1     3
   #>  4     1     4
   #>  5     1     5
   #>  6     2     6
   #>  7     2     7
   #>  8     2     8
   #>  9     2     9
   #> 10     2    10
   
   # write dataset
   write_dataset(df, tf, format = "csv", partitioning = "group", hive_style = FALSE)
   
   list.files(tf, recursive = TRUE)
   #> [1] "1/part-0.csv" "2/part-0.csv"
   
   # with just partitioning, successfully can read back in partitioning variable
   open_dataset(
     tf,
     format = "csv",
     partitioning = "group"
   ) %>% collect()
   #> # A tibble: 10 × 2
   #>    value group
   #>    <int> <int>
   #>  1     1     1
   #>  2     2     1
   #>  3     3     1
   #>  4     4     1
   #>  5     5     1
   #>  6     6     2
   #>  7     7     2
   #>  8     8     2
   #>  9     9     2
   #> 10    10     2
   
   # with partitioning and schema supplied, "group" variable is not included
   open_dataset(
     tf,
     format = "csv",
     schema = schema(value = int32()),
     skip = 1,
     partitioning = schema(group = int32())
   ) %>% collect()
   #> # A tibble: 10 × 1
   #>    value
   #>    <int>
   #>  1     1
   #>  2     2
   #>  3     3
   #>  4     4
   #>  5     5
   #>  6     6
   #>  7     7
   #>  8     8
   #>  9     9
   #> 10    10
   
   # we can't add the partitioning variable to the schema or we get an error
   open_dataset(
     tf,
     format = "csv",
     schema = schema(value = int32(), group = int32()),
     skip = 1,
     partitioning = schema(group = int32())
   ) %>% collect()
   #> Error in `compute.Dataset()` at r/R/dplyr-collect.R:33:2:
   #> ! Invalid: Could not open CSV input source '/tmp/RtmpUTGEHf/file492ad7322363a/1/part-0.csv': Invalid: CSV parse error: Row #2: Expected 2 columns, got 1: 1
   #> /home/nic2/arrow/cpp/src/arrow/csv/parser.cc:477  (ParseLine<SpecializedOptions, false>(values_writer, parsed_writer, data, data_end, is_final, &line_end, bulk_filter))
   #> /home/nic2/arrow/cpp/src/arrow/csv/parser.cc:566  ParseChunk<SpecializedOptions>( &values_writer, &parsed_writer, data, data_end, is_final, rows_in_chunk, &data, &finished_parsing, bulk_filter)
   #> /home/nic2/arrow/cpp/src/arrow/csv/reader.cc:426  parser->ParseFinal(views, &parsed_size)
   
   #> Backtrace:
   #>      ▆
   #>   1. ├─... %>% collect()
   #>   2. ├─dplyr::collect(.)
   #>   3. └─arrow:::collect.Dataset(.)
   #>   4.   ├─arrow:::collect.ArrowTabular(compute.Dataset(x), as_data_frame) at r/R/dplyr-collect.R:33:2
   #>   5.   │ └─base::as.data.frame(x, ...) at r/R/dplyr-collect.R:27:4
   #>   6.   └─arrow:::compute.Dataset(x) at r/R/dplyr-collect.R:33:2
   #>   7.     └─base::tryCatch(...) at r/R/dplyr-collect.R:40:2
   #>   8.       └─base (local) tryCatchList(expr, classes, parentenv, handlers)
   #>   9.         └─base (local) tryCatchOne(expr, names, parentenv, handlers[[1L]])
   #>  10.           └─value[[3L]](cond)
   #>  11.             └─arrow:::augment_io_error_msg(e, call, schema = schema()) at r/R/dplyr-collect.R:49:6
   #>  12.               └─rlang::abort(msg, call = call) at r/R/util.R:251:2
   ```
   
   This was discussed on #32938 but the solution mentioned there works for Parquet files but not CSV.
   
   ### Component(s)
   
   C++


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@arrow.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] thisisnic commented on issue #34640: [C++] Can't read in partitioning column in CSV datasets when both (non-hive) partition and schema supplied

Posted by "thisisnic (via GitHub)" <gi...@apache.org>.

thisisnic commented on issue #34640:
URL: https://github.com/apache/arrow/issues/34640#issuecomment-1475986515

   This is further complicated by differences in behaviour between hive-style and non-hive-style datasets:
   
   ``` r
   library(arrow)
   library(dplyr)
   
   tf <- tempfile()
   dir.create(tf)
   
   arrow::write_dataset(mtcars, tf, partitioning = "am", format = "csv")
   arrow::write_dataset(mtcars, tf, partitioning = "am", format = "parquet")
   
   schema_with_am <- schema(mpg = float64(), cyl = int64(), disp = float64(), hp = int64(), 
       drat = float64(), wt = float64(), qsec = float64(), vs = int64(), 
       am = int64(), gear = int64(), carb = int64())
   
   schema_without_am <- schema(mpg = float64(), cyl = int64(), disp = float64(), hp = int64(), 
       drat = float64(), wt = float64(), qsec = float64(), vs = int64(), gear = int64(), carb = int64())
   
   # parquet is fine with the schema containing the partitioning value
   pq = open_dataset(
     tf,
     format = "parquet",
     factory_options = list(exclude_invalid_files = TRUE),
     schema = schema_with_am,
     partitioning = "am"
   )
   collect(pq)
   #> # A tibble: 32 × 11
   #>      mpg   cyl  disp    hp  drat    wt  qsec    vs    am  gear  carb
   #>    <dbl> <int> <dbl> <int> <dbl> <dbl> <dbl> <int> <int> <int> <int>
   #>  1  21.4     6  258    110  3.08  3.22  19.4     1     0     3     1
   #>  2  18.7     8  360    175  3.15  3.44  17.0     0     0     3     2
   #>  3  18.1     6  225    105  2.76  3.46  20.2     1     0     3     1
   #>  4  14.3     8  360    245  3.21  3.57  15.8     0     0     3     4
   #>  5  24.4     4  147.    62  3.69  3.19  20       1     0     4     2
   #>  6  22.8     4  141.    95  3.92  3.15  22.9     1     0     4     2
   #>  7  19.2     6  168.   123  3.92  3.44  18.3     1     0     4     4
   #>  8  17.8     6  168.   123  3.92  3.44  18.9     1     0     4     4
   #>  9  16.4     8  276.   180  3.07  4.07  17.4     0     0     3     3
   #> 10  17.3     8  276.   180  3.07  3.73  17.6     0     0     3     3
   #> # … with 22 more rows
   
   # CSV shows 0 files with the schema containing the partitioning value
   csv_nope = open_dataset(
     tf,
     format = "csv",
     factory_options = list(exclude_invalid_files = TRUE),
     schema = schema_with_am,
    # partitioning = "am",
     skip = 1
   )
   csv_nope
   #> FileSystemDataset with 0 csv files
   #> mpg: double
   #> cyl: int64
   #> disp: double
   #> hp: int64
   #> drat: double
   #> wt: double
   #> qsec: double
   #> vs: int64
   #> am: int64
   #> gear: int64
   #> carb: int64
   
   # CSV is fine with the schema without the partitioning value
   csv = open_dataset(
     tf,
     format = "csv",
     factory_options = list(exclude_invalid_files = TRUE),
     schema = schema_without_am,
     skip = 1,
     partitioning = "am"
   )
   
   csv
   #> FileSystemDataset with 2 csv files
   #> mpg: double
   #> cyl: int64
   #> disp: double
   #> hp: int64
   #> drat: double
   #> wt: double
   #> qsec: double
   #> vs: int64
   #> gear: int64
   #> carb: int64
   
   ## This is all fine, but when we try it again without hive-style partitioning... ##
   
   tf <- tempfile()
   dir.create(tf)
   
   arrow::write_dataset(mtcars, tf, partitioning = "am", format = "csv", hive_style = FALSE)
   arrow::write_dataset(mtcars, tf, partitioning = "am", format = "parquet", hive_style = FALSE)
   
   schema_with_am <- schema(mpg = float64(), cyl = int64(), disp = float64(), hp = int64(), 
       drat = float64(), wt = float64(), qsec = float64(), vs = int64(), 
       am = int64(), gear = int64(), carb = int64())
   
   schema_without_am <- schema(mpg = float64(), cyl = int64(), disp = float64(), hp = int64(), 
       drat = float64(), wt = float64(), qsec = float64(), vs = int64(), gear = int64(), carb = int64())
   
   # parquet is fine with the schema containing the partitioning value
   pq = open_dataset(
     tf,
     format = "parquet",
     factory_options = list(exclude_invalid_files = TRUE),
     schema = schema_with_am,
     partitioning = "am"
   )
   collect(pq)
   #> # A tibble: 32 × 11
   #>      mpg   cyl  disp    hp  drat    wt  qsec    vs    am  gear  carb
   #>    <dbl> <int> <dbl> <int> <dbl> <dbl> <dbl> <int> <int> <int> <int>
   #>  1  21.4     6  258    110  3.08  3.22  19.4     1     0     3     1
   #>  2  18.7     8  360    175  3.15  3.44  17.0     0     0     3     2
   #>  3  18.1     6  225    105  2.76  3.46  20.2     1     0     3     1
   #>  4  14.3     8  360    245  3.21  3.57  15.8     0     0     3     4
   #>  5  24.4     4  147.    62  3.69  3.19  20       1     0     4     2
   #>  6  22.8     4  141.    95  3.92  3.15  22.9     1     0     4     2
   #>  7  19.2     6  168.   123  3.92  3.44  18.3     1     0     4     4
   #>  8  17.8     6  168.   123  3.92  3.44  18.9     1     0     4     4
   #>  9  16.4     8  276.   180  3.07  4.07  17.4     0     0     3     3
   #> 10  17.3     8  276.   180  3.07  3.73  17.6     0     0     3     3
   #> # … with 22 more rows
   
   # CSV shows 0 files with the schema containing the partitioning value
   csv_nope = open_dataset(
     tf,
     format = "csv",
     factory_options = list(exclude_invalid_files = TRUE),
     schema = schema_with_am,
     partitioning = "am",
     skip = 1
   )
   csv_nope
   #> FileSystemDataset with 0 csv files
   #> mpg: double
   #> cyl: int64
   #> disp: double
   #> hp: int64
   #> drat: double
   #> wt: double
   #> qsec: double
   #> vs: int64
   #> am: int64
   #> gear: int64
   #> carb: int64
   
   # We now get an error when we provide a schema without the partitioning value
   csv = open_dataset(
     tf,
     format = "csv",
     factory_options = list(exclude_invalid_files = TRUE),
     schema = schema_without_am,
     skip = 1,
     partitioning = "am"
   )
   #> Error in `open_dataset()`:
   #> ! Invalid: No match for FieldRef.Name(am) in mpg: double
   #> cyl: int64
   #> disp: double
   #> hp: int64
   #> drat: double
   #> wt: double
   #> qsec: double
   #> vs: int64
   #> gear: int64
   #> carb: int64
   #> /home/nic2/arrow/cpp/src/arrow/type.h:1852  CheckNonEmpty(matches, root)
   #> /home/nic2/arrow/cpp/src/arrow/dataset/partition.cc:625  ref.FindOne(*schema).status()
   #> /home/nic2/arrow/cpp/src/arrow/dataset/discovery.cc:279  factory->Finish(schema)
   
   #> Backtrace:
   #>     ▆
   #>  1. └─arrow::open_dataset(...)
   #>  2.   └─base::tryCatch(...) at r/R/dataset.R:220:2
   #>  3.     └─base (local) tryCatchList(expr, classes, parentenv, handlers)
   #>  4.       └─base (local) tryCatchOne(expr, names, parentenv, handlers[[1L]])
   #>  5.         └─value[[3L]](cond)
   #>  6.           └─arrow:::augment_io_error_msg(e, call, format = format) at r/R/dataset.R:226:6
   #>  7.             └─rlang::abort(msg, call = call) at r/R/util.R:251:2
   
   # No error when the partitioning variable is a schema object, BUT partitioning
   #   column isn't included in the data 
   csv = open_dataset(
     tf,
     format = "csv",
     factory_options = list(exclude_invalid_files = TRUE),
     schema = schema_without_am,
     skip = 1,
     partitioning = schema(am = int32())
   )
   
   csv
   #> FileSystemDataset with 2 csv files
   #> mpg: double
   #> cyl: int64
   #> disp: double
   #> hp: int64
   #> drat: double
   #> wt: double
   #> qsec: double
   #> vs: int64
   #> gear: int64
   #> carb: int64
   
   # If we try the schema with the partitioning variable AND passing in a schema 
   #   in as partitioning variable, we get 0 files
   csv = open_dataset(
     tf,
     format = "csv",
     factory_options = list(exclude_invalid_files = TRUE),
     schema = schema_with_am,
     skip = 1,
     partitioning = schema(am = int32())
   )
   
   csv
   #> FileSystemDataset with 0 csv files
   #> mpg: double
   #> cyl: int64
   #> disp: double
   #> hp: int64
   #> drat: double
   #> wt: double
   #> qsec: double
   #> vs: int64
   #> am: int64
   #> gear: int64
   #> carb: int64
   ```
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] thisisnic commented on issue #34640: [C++] Can't read in partitioning column in CSV datasets when both (non-hive) partition and schema supplied

Posted by "thisisnic (via GitHub)" <gi...@apache.org>.

thisisnic commented on issue #34640:
URL: https://github.com/apache/arrow/issues/34640#issuecomment-1714146681

   This appears to be an R implementation issue rather than a C++ issue, because the following code can be run with PyArrow and gets the expected results:
   
   ```
   import numpy.random
   import pyarrow as pa
   import pyarrow.dataset as ds
   import pyarrow.dataset as ds
   
   # Create and write dataset
   data = pa.table({"day": numpy.random.randint(1, 31, size=100),
                    "month": numpy.random.randint(1, 12, size=100),
                    "year": [2000 + x // 10 for x in range(100)]})
   
   ds.write_dataset(data, "./partitioned", format="parquet",
                    partitioning=ds.partitioning(pa.schema([("year", pa.int16())])))
   
   # Read and print dataset
   dataset = ds.dataset("./partitioned", format="parquet", partitioning = ds.partitioning(pa.schema([("year", pa.int16())])),
                       schema = pa.schema([("day", pa.int16()), ("month", pa.int16()), ("year", pa.int16())])
                       )
   print(dataset.to_table())
   ```
   
   ```
   pyarrow.Table
   day: int16
   month: int16
   year: int16
   ----
   day: [[2,29,16,16,14,21,1,18,22,17],[21,16,22,24,17,8,30,10,4,22],...,[6,1,16,5,26,30,18,11,18,3],[17,15,5,30,23,21,9,11,4,22]]
   month: [[6,2,6,6,3,11,1,11,7,9],[11,8,3,5,6,3,5,7,9,9],...,[8,5,9,4,3,9,9,7,9,10],[1,5,9,1,2,8,2,6,11,4]]
   year: [[2000,2000,2000,2000,2000,2000,2000,2000,2000,2000],[2001,2001,2001,2001,2001,2001,2001,2001,2001,2001],...,[2008,2008,2008,2008,2008,2008,2008,2008,2008,2008],[2009,2009,2009,2009,2009,2009,2009,2009,2009,2009]]
   ```
   
   And the equivalent with CSVs:
   
   ```
   ds.write_dataset(data, "./partitioned_csv", format="csv",
                    partitioning=ds.partitioning(pa.schema([("year", pa.int16())])))
   
   dataset = ds.dataset("./partitioned_csv", format="csv", partitioning = ds.partitioning(pa.schema([("year", pa.int32())])),
                       schema = pa.schema([("day", pa.int16()), ("month", pa.int16()), ("year", pa.int16())])
                       )
   print(dataset.to_table())
   ```
   
   gets
   
   ```
   pyarrow.Table
   day: int16
   month: int16
   year: int16
   ----
   day: [[2,29,16,16,14,21,1,18,22,17],[21,16,22,24,17,8,30,10,4,22],...,[6,1,16,5,26,30,18,11,18,3],[17,15,5,30,23,21,9,11,4,22]]
   month: [[6,2,6,6,3,11,1,11,7,9],[11,8,3,5,6,3,5,7,9,9],...,[8,5,9,4,3,9,9,7,9,10],[1,5,9,1,2,8,2,6,11,4]]
   year: [[2000,2000,2000,2000,2000,2000,2000,2000,2000,2000],[2001,2001,2001,2001,2001,2001,2001,2001,2001,2001],...,[2008,2008,2008,2008,2008,2008,2008,2008,2008,2008],[2009,2009,2009,2009,2009,2009,2009,2009,2009,2009]]
   
   
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow] thisisnic closed issue #34640: [R] Can't read in partitioning column in CSV datasets when both (non-hive) partition and schema supplied

Posted by "thisisnic (via GitHub)" <gi...@apache.org>.

thisisnic closed issue #34640: [R] Can't read in partitioning column in CSV datasets when both (non-hive) partition and schema supplied
URL: https://github.com/apache/arrow/issues/34640


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org