You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@arrow.apache.org by "thisisnic (via GitHub)" <gi...@apache.org> on 2023/03/20 10:15:03 UTC
[GitHub] [arrow] thisisnic opened a new issue, #34640: [C++] Can't read in partitioning column in CSV datasets when both partition and schema supplied
thisisnic opened a new issue, #34640:
URL: https://github.com/apache/arrow/issues/34640
### Describe the bug, including details regarding any error messages, version, and platform.
Surfaced via R:
``` r
library(dplyr)
library(arrow)
# set up temporary directory
tf <- tempfile()
dir.create(tf)
# set up dummy dataset
df <- tibble::tibble(group = rep(1:2, each = 5), value = 1:10)
df
#> # A tibble: 10 × 2
#> group value
#> <int> <int>
#> 1 1 1
#> 2 1 2
#> 3 1 3
#> 4 1 4
#> 5 1 5
#> 6 2 6
#> 7 2 7
#> 8 2 8
#> 9 2 9
#> 10 2 10
# write dataset
write_dataset(df, tf, format = "csv", partitioning = "group", hive_style = FALSE)
list.files(tf, recursive = TRUE)
#> [1] "1/part-0.csv" "2/part-0.csv"
# with just partitioning, successfully can read back in partitioning variable
open_dataset(
tf,
format = "csv",
partitioning = "group"
) %>% collect()
#> # A tibble: 10 × 2
#> value group
#> <int> <int>
#> 1 1 1
#> 2 2 1
#> 3 3 1
#> 4 4 1
#> 5 5 1
#> 6 6 2
#> 7 7 2
#> 8 8 2
#> 9 9 2
#> 10 10 2
# with partitioning and schema supplied, "group" variable is not included
open_dataset(
tf,
format = "csv",
schema = schema(value = int32()),
skip = 1,
partitioning = schema(group = int32())
) %>% collect()
#> # A tibble: 10 × 1
#> value
#> <int>
#> 1 1
#> 2 2
#> 3 3
#> 4 4
#> 5 5
#> 6 6
#> 7 7
#> 8 8
#> 9 9
#> 10 10
# we can't add the partitioning variable to the schema or we get an error
open_dataset(
tf,
format = "csv",
schema = schema(value = int32(), group = int32()),
skip = 1,
partitioning = schema(group = int32())
) %>% collect()
#> Error in `compute.Dataset()` at r/R/dplyr-collect.R:33:2:
#> ! Invalid: Could not open CSV input source '/tmp/RtmpUTGEHf/file492ad7322363a/1/part-0.csv': Invalid: CSV parse error: Row #2: Expected 2 columns, got 1: 1
#> /home/nic2/arrow/cpp/src/arrow/csv/parser.cc:477 (ParseLine<SpecializedOptions, false>(values_writer, parsed_writer, data, data_end, is_final, &line_end, bulk_filter))
#> /home/nic2/arrow/cpp/src/arrow/csv/parser.cc:566 ParseChunk<SpecializedOptions>( &values_writer, &parsed_writer, data, data_end, is_final, rows_in_chunk, &data, &finished_parsing, bulk_filter)
#> /home/nic2/arrow/cpp/src/arrow/csv/reader.cc:426 parser->ParseFinal(views, &parsed_size)
#> Backtrace:
#> ▆
#> 1. ├─... %>% collect()
#> 2. ├─dplyr::collect(.)
#> 3. └─arrow:::collect.Dataset(.)
#> 4. ├─arrow:::collect.ArrowTabular(compute.Dataset(x), as_data_frame) at r/R/dplyr-collect.R:33:2
#> 5. │ └─base::as.data.frame(x, ...) at r/R/dplyr-collect.R:27:4
#> 6. └─arrow:::compute.Dataset(x) at r/R/dplyr-collect.R:33:2
#> 7. └─base::tryCatch(...) at r/R/dplyr-collect.R:40:2
#> 8. └─base (local) tryCatchList(expr, classes, parentenv, handlers)
#> 9. └─base (local) tryCatchOne(expr, names, parentenv, handlers[[1L]])
#> 10. └─value[[3L]](cond)
#> 11. └─arrow:::augment_io_error_msg(e, call, schema = schema()) at r/R/dplyr-collect.R:49:6
#> 12. └─rlang::abort(msg, call = call) at r/R/util.R:251:2
```
This was discussed on #32938 but the solution mentioned there works for Parquet files but not CSV.
### Component(s)
C++
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: issues-unsubscribe@arrow.apache.org.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [arrow] thisisnic commented on issue #34640: [C++] Can't read in partitioning column in CSV datasets when both (non-hive) partition and schema supplied
Posted by "thisisnic (via GitHub)" <gi...@apache.org>.
thisisnic commented on issue #34640:
URL: https://github.com/apache/arrow/issues/34640#issuecomment-1475986515
This is further complicated by differences in behaviour between hive-style and non-hive-style datasets:
``` r
library(arrow)
library(dplyr)
tf <- tempfile()
dir.create(tf)
arrow::write_dataset(mtcars, tf, partitioning = "am", format = "csv")
arrow::write_dataset(mtcars, tf, partitioning = "am", format = "parquet")
schema_with_am <- schema(mpg = float64(), cyl = int64(), disp = float64(), hp = int64(),
drat = float64(), wt = float64(), qsec = float64(), vs = int64(),
am = int64(), gear = int64(), carb = int64())
schema_without_am <- schema(mpg = float64(), cyl = int64(), disp = float64(), hp = int64(),
drat = float64(), wt = float64(), qsec = float64(), vs = int64(), gear = int64(), carb = int64())
# parquet is fine with the schema containing the partitioning value
pq = open_dataset(
tf,
format = "parquet",
factory_options = list(exclude_invalid_files = TRUE),
schema = schema_with_am,
partitioning = "am"
)
collect(pq)
#> # A tibble: 32 × 11
#> mpg cyl disp hp drat wt qsec vs am gear carb
#> <dbl> <int> <dbl> <int> <dbl> <dbl> <dbl> <int> <int> <int> <int>
#> 1 21.4 6 258 110 3.08 3.22 19.4 1 0 3 1
#> 2 18.7 8 360 175 3.15 3.44 17.0 0 0 3 2
#> 3 18.1 6 225 105 2.76 3.46 20.2 1 0 3 1
#> 4 14.3 8 360 245 3.21 3.57 15.8 0 0 3 4
#> 5 24.4 4 147. 62 3.69 3.19 20 1 0 4 2
#> 6 22.8 4 141. 95 3.92 3.15 22.9 1 0 4 2
#> 7 19.2 6 168. 123 3.92 3.44 18.3 1 0 4 4
#> 8 17.8 6 168. 123 3.92 3.44 18.9 1 0 4 4
#> 9 16.4 8 276. 180 3.07 4.07 17.4 0 0 3 3
#> 10 17.3 8 276. 180 3.07 3.73 17.6 0 0 3 3
#> # … with 22 more rows
# CSV shows 0 files with the schema containing the partitioning value
csv_nope = open_dataset(
tf,
format = "csv",
factory_options = list(exclude_invalid_files = TRUE),
schema = schema_with_am,
# partitioning = "am",
skip = 1
)
csv_nope
#> FileSystemDataset with 0 csv files
#> mpg: double
#> cyl: int64
#> disp: double
#> hp: int64
#> drat: double
#> wt: double
#> qsec: double
#> vs: int64
#> am: int64
#> gear: int64
#> carb: int64
# CSV is fine with the schema without the partitioning value
csv = open_dataset(
tf,
format = "csv",
factory_options = list(exclude_invalid_files = TRUE),
schema = schema_without_am,
skip = 1,
partitioning = "am"
)
csv
#> FileSystemDataset with 2 csv files
#> mpg: double
#> cyl: int64
#> disp: double
#> hp: int64
#> drat: double
#> wt: double
#> qsec: double
#> vs: int64
#> gear: int64
#> carb: int64
## This is all fine, but when we try it again without hive-style partitioning... ##
tf <- tempfile()
dir.create(tf)
arrow::write_dataset(mtcars, tf, partitioning = "am", format = "csv", hive_style = FALSE)
arrow::write_dataset(mtcars, tf, partitioning = "am", format = "parquet", hive_style = FALSE)
schema_with_am <- schema(mpg = float64(), cyl = int64(), disp = float64(), hp = int64(),
drat = float64(), wt = float64(), qsec = float64(), vs = int64(),
am = int64(), gear = int64(), carb = int64())
schema_without_am <- schema(mpg = float64(), cyl = int64(), disp = float64(), hp = int64(),
drat = float64(), wt = float64(), qsec = float64(), vs = int64(), gear = int64(), carb = int64())
# parquet is fine with the schema containing the partitioning value
pq = open_dataset(
tf,
format = "parquet",
factory_options = list(exclude_invalid_files = TRUE),
schema = schema_with_am,
partitioning = "am"
)
collect(pq)
#> # A tibble: 32 × 11
#> mpg cyl disp hp drat wt qsec vs am gear carb
#> <dbl> <int> <dbl> <int> <dbl> <dbl> <dbl> <int> <int> <int> <int>
#> 1 21.4 6 258 110 3.08 3.22 19.4 1 0 3 1
#> 2 18.7 8 360 175 3.15 3.44 17.0 0 0 3 2
#> 3 18.1 6 225 105 2.76 3.46 20.2 1 0 3 1
#> 4 14.3 8 360 245 3.21 3.57 15.8 0 0 3 4
#> 5 24.4 4 147. 62 3.69 3.19 20 1 0 4 2
#> 6 22.8 4 141. 95 3.92 3.15 22.9 1 0 4 2
#> 7 19.2 6 168. 123 3.92 3.44 18.3 1 0 4 4
#> 8 17.8 6 168. 123 3.92 3.44 18.9 1 0 4 4
#> 9 16.4 8 276. 180 3.07 4.07 17.4 0 0 3 3
#> 10 17.3 8 276. 180 3.07 3.73 17.6 0 0 3 3
#> # … with 22 more rows
# CSV shows 0 files with the schema containing the partitioning value
csv_nope = open_dataset(
tf,
format = "csv",
factory_options = list(exclude_invalid_files = TRUE),
schema = schema_with_am,
partitioning = "am",
skip = 1
)
csv_nope
#> FileSystemDataset with 0 csv files
#> mpg: double
#> cyl: int64
#> disp: double
#> hp: int64
#> drat: double
#> wt: double
#> qsec: double
#> vs: int64
#> am: int64
#> gear: int64
#> carb: int64
# We now get an error when we provide a schema without the partitioning value
csv = open_dataset(
tf,
format = "csv",
factory_options = list(exclude_invalid_files = TRUE),
schema = schema_without_am,
skip = 1,
partitioning = "am"
)
#> Error in `open_dataset()`:
#> ! Invalid: No match for FieldRef.Name(am) in mpg: double
#> cyl: int64
#> disp: double
#> hp: int64
#> drat: double
#> wt: double
#> qsec: double
#> vs: int64
#> gear: int64
#> carb: int64
#> /home/nic2/arrow/cpp/src/arrow/type.h:1852 CheckNonEmpty(matches, root)
#> /home/nic2/arrow/cpp/src/arrow/dataset/partition.cc:625 ref.FindOne(*schema).status()
#> /home/nic2/arrow/cpp/src/arrow/dataset/discovery.cc:279 factory->Finish(schema)
#> Backtrace:
#> ▆
#> 1. └─arrow::open_dataset(...)
#> 2. └─base::tryCatch(...) at r/R/dataset.R:220:2
#> 3. └─base (local) tryCatchList(expr, classes, parentenv, handlers)
#> 4. └─base (local) tryCatchOne(expr, names, parentenv, handlers[[1L]])
#> 5. └─value[[3L]](cond)
#> 6. └─arrow:::augment_io_error_msg(e, call, format = format) at r/R/dataset.R:226:6
#> 7. └─rlang::abort(msg, call = call) at r/R/util.R:251:2
# No error when the partitioning variable is a schema object, BUT partitioning
# column isn't included in the data
csv = open_dataset(
tf,
format = "csv",
factory_options = list(exclude_invalid_files = TRUE),
schema = schema_without_am,
skip = 1,
partitioning = schema(am = int32())
)
csv
#> FileSystemDataset with 2 csv files
#> mpg: double
#> cyl: int64
#> disp: double
#> hp: int64
#> drat: double
#> wt: double
#> qsec: double
#> vs: int64
#> gear: int64
#> carb: int64
# If we try the schema with the partitioning variable AND passing in a schema
# in as partitioning variable, we get 0 files
csv = open_dataset(
tf,
format = "csv",
factory_options = list(exclude_invalid_files = TRUE),
schema = schema_with_am,
skip = 1,
partitioning = schema(am = int32())
)
csv
#> FileSystemDataset with 0 csv files
#> mpg: double
#> cyl: int64
#> disp: double
#> hp: int64
#> drat: double
#> wt: double
#> qsec: double
#> vs: int64
#> am: int64
#> gear: int64
#> carb: int64
```
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [arrow] thisisnic commented on issue #34640: [C++] Can't read in partitioning column in CSV datasets when both (non-hive) partition and schema supplied
Posted by "thisisnic (via GitHub)" <gi...@apache.org>.
thisisnic commented on issue #34640:
URL: https://github.com/apache/arrow/issues/34640#issuecomment-1714146681
This appears to be an R implementation issue rather than a C++ issue, because the following code can be run with PyArrow and gets the expected results:
```
import numpy.random
import pyarrow as pa
import pyarrow.dataset as ds
import pyarrow.dataset as ds
# Create and write dataset
data = pa.table({"day": numpy.random.randint(1, 31, size=100),
"month": numpy.random.randint(1, 12, size=100),
"year": [2000 + x // 10 for x in range(100)]})
ds.write_dataset(data, "./partitioned", format="parquet",
partitioning=ds.partitioning(pa.schema([("year", pa.int16())])))
# Read and print dataset
dataset = ds.dataset("./partitioned", format="parquet", partitioning = ds.partitioning(pa.schema([("year", pa.int16())])),
schema = pa.schema([("day", pa.int16()), ("month", pa.int16()), ("year", pa.int16())])
)
print(dataset.to_table())
```
```
pyarrow.Table
day: int16
month: int16
year: int16
----
day: [[2,29,16,16,14,21,1,18,22,17],[21,16,22,24,17,8,30,10,4,22],...,[6,1,16,5,26,30,18,11,18,3],[17,15,5,30,23,21,9,11,4,22]]
month: [[6,2,6,6,3,11,1,11,7,9],[11,8,3,5,6,3,5,7,9,9],...,[8,5,9,4,3,9,9,7,9,10],[1,5,9,1,2,8,2,6,11,4]]
year: [[2000,2000,2000,2000,2000,2000,2000,2000,2000,2000],[2001,2001,2001,2001,2001,2001,2001,2001,2001,2001],...,[2008,2008,2008,2008,2008,2008,2008,2008,2008,2008],[2009,2009,2009,2009,2009,2009,2009,2009,2009,2009]]
```
And the equivalent with CSVs:
```
ds.write_dataset(data, "./partitioned_csv", format="csv",
partitioning=ds.partitioning(pa.schema([("year", pa.int16())])))
dataset = ds.dataset("./partitioned_csv", format="csv", partitioning = ds.partitioning(pa.schema([("year", pa.int32())])),
schema = pa.schema([("day", pa.int16()), ("month", pa.int16()), ("year", pa.int16())])
)
print(dataset.to_table())
```
gets
```
pyarrow.Table
day: int16
month: int16
year: int16
----
day: [[2,29,16,16,14,21,1,18,22,17],[21,16,22,24,17,8,30,10,4,22],...,[6,1,16,5,26,30,18,11,18,3],[17,15,5,30,23,21,9,11,4,22]]
month: [[6,2,6,6,3,11,1,11,7,9],[11,8,3,5,6,3,5,7,9,9],...,[8,5,9,4,3,9,9,7,9,10],[1,5,9,1,2,8,2,6,11,4]]
year: [[2000,2000,2000,2000,2000,2000,2000,2000,2000,2000],[2001,2001,2001,2001,2001,2001,2001,2001,2001,2001],...,[2008,2008,2008,2008,2008,2008,2008,2008,2008,2008],[2009,2009,2009,2009,2009,2009,2009,2009,2009,2009]]
```
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [arrow] thisisnic closed issue #34640: [R] Can't read in partitioning column in CSV datasets when both (non-hive) partition and schema supplied
Posted by "thisisnic (via GitHub)" <gi...@apache.org>.
thisisnic closed issue #34640: [R] Can't read in partitioning column in CSV datasets when both (non-hive) partition and schema supplied
URL: https://github.com/apache/arrow/issues/34640
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: issues-unsubscribe@arrow.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org