You are viewing a plain text version of this content. The canonical link for it is here.
Posted to jira@arrow.apache.org by "Weston Pace (Jira)" <ji...@apache.org> on 2021/11/15 14:49:00 UTC
[jira] [Updated] (ARROW-14705) [C++] unify_schemas can't handle int64 + double, affects CSV dataset
[ https://issues.apache.org/jira/browse/ARROW-14705?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Weston Pace updated ARROW-14705:
--------------------------------
Labels: query-engine (was: )
> [C++] unify_schemas can't handle int64 + double, affects CSV dataset
> --------------------------------------------------------------------
>
> Key: ARROW-14705
> URL: https://issues.apache.org/jira/browse/ARROW-14705
> Project: Apache Arrow
> Issue Type: Bug
> Components: C++, R
> Reporter: Neal Richardson
> Priority: Major
> Labels: query-engine
>
> Twitter question of "how can I make arrow's csv reader not make int64 for integers", turns out to be originating from the scenario where some csvs in a directory may have all integer values for a column but there are decimals in others, and you can't use them together in a dataset.
> {code:r}
> library(arrow, warn.conflicts = FALSE)
> library(dplyr, warn.conflicts = FALSE)
> ds_dir <- tempfile()
> dir.create(ds_dir)
> cat("a\n1", file = file.path(ds_dir, "1.csv"))
> cat("a\n1.1", file = file.path(ds_dir, "2.csv"))
> ds <- open_dataset(ds_dir, format = "csv")
> ds
> #> FileSystemDataset with 2 csv files
> #> a: int64
> ## It just picked the schema of the first file
> collect(ds)
> #> Error: Invalid: Could not open CSV input source '/private/var/folders/yv/b6mwztyj0r11r8pnsbmpltx00000gn/T/RtmpzENOMb/filea9c3292e06dd/2.csv': Invalid: In CSV column #0: Row #2: CSV conversion error to int64: invalid value '1.1'
> #> ../src/arrow/csv/converter.cc:492 decoder_.Decode(data, size, quoted, &value)
> #> ../src/arrow/csv/parser.h:123 status
> #> ../src/arrow/csv/converter.cc:496 parser.VisitColumn(col_index, visit)
> #> ../src/arrow/csv/reader.cc:462 internal::UnwrapOrRaise(maybe_decoded_arrays)
> #> ../src/arrow/compute/exec/exec_plan.cc:398 iterator_.Next()
> #> ../src/arrow/record_batch.cc:318 ReadNext(&batch)
> #> ../src/arrow/record_batch.cc:329 ReadAll(&batches)
> ## Let's try again and tell it to unify schemas. Should result in a float64 type
> ds <- open_dataset(ds_dir, format = "csv", unify_schemas = TRUE)
> #> Error: Invalid: Unable to merge: Field a has incompatible types: int64 vs double
> #> ../src/arrow/type.cc:1621 fields_[i]->MergeWith(field)
> #> ../src/arrow/type.cc:1684 AddField(field)
> #> ../src/arrow/type.cc:1755 builder.AddSchema(schema)
> #> ../src/arrow/dataset/discovery.cc:251 Inspect(options.inspect_options)
> {code}
--
This message was sent by Atlassian Jira
(v8.20.1#820001)