You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@arrow.apache.org by "Neal Richardson (Jira)" <ji...@apache.org> on 2021/11/12 22:06:00 UTC

[jira] [Created] (ARROW-14705) [C++] unify_schemas can't handle int64 + double, affects CSV dataset

Neal Richardson created ARROW-14705:
---------------------------------------

             Summary: [C++] unify_schemas can't handle int64 + double, affects CSV dataset
                 Key: ARROW-14705
                 URL: https://issues.apache.org/jira/browse/ARROW-14705
             Project: Apache Arrow
          Issue Type: Bug
          Components: C++, R
            Reporter: Neal Richardson


Twitter question of "how can I make arrow's csv reader not make int64 for integers", turns out to be originating from the scenario where some csvs in a directory may have all integer values for a column but there are decimals in others, and you can't use them together in a dataset.
{code:r}
library(arrow, warn.conflicts = FALSE)
library(dplyr, warn.conflicts = FALSE)

ds_dir <- tempfile()
dir.create(ds_dir)
cat("a\n1", file = file.path(ds_dir, "1.csv"))
cat("a\n1.1", file = file.path(ds_dir, "2.csv"))

ds <- open_dataset(ds_dir, format = "csv")
ds
#> FileSystemDataset with 2 csv files
#> a: int64

## It just picked the schema of the first file
collect(ds)
#> Error: Invalid: Could not open CSV input source '/private/var/folders/yv/b6mwztyj0r11r8pnsbmpltx00000gn/T/RtmpzENOMb/filea9c3292e06dd/2.csv': Invalid: In CSV column #0: Row #2: CSV conversion error to int64: invalid value '1.1'
#> ../src/arrow/csv/converter.cc:492  decoder_.Decode(data, size, quoted, &value)
#> ../src/arrow/csv/parser.h:123  status
#> ../src/arrow/csv/converter.cc:496  parser.VisitColumn(col_index, visit)
#> ../src/arrow/csv/reader.cc:462  internal::UnwrapOrRaise(maybe_decoded_arrays)
#> ../src/arrow/compute/exec/exec_plan.cc:398  iterator_.Next()
#> ../src/arrow/record_batch.cc:318  ReadNext(&batch)
#> ../src/arrow/record_batch.cc:329  ReadAll(&batches)

## Let's try again and tell it to unify schemas. Should result in a float64 type
ds <- open_dataset(ds_dir, format = "csv", unify_schemas = TRUE)
#> Error: Invalid: Unable to merge: Field a has incompatible types: int64 vs double
#> ../src/arrow/type.cc:1621  fields_[i]->MergeWith(field)
#> ../src/arrow/type.cc:1684  AddField(field)
#> ../src/arrow/type.cc:1755  builder.AddSchema(schema)
#> ../src/arrow/dataset/discovery.cc:251  Inspect(options.inspect_options)
{code}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)