You are viewing a plain text version of this content. The canonical link for it is here.
Posted to jira@arrow.apache.org by "Todd Farmer (Jira)" <ji...@apache.org> on 2022/07/12 14:05:03 UTC
[jira] [Assigned] (ARROW-14705) [C++] unify_schemas can't handle int64 + double, affects CSV dataset

     [ https://issues.apache.org/jira/browse/ARROW-14705?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Todd Farmer reassigned ARROW-14705:
-----------------------------------

    Assignee:     (was: David Li)

This issue was last updated over 90 days ago, which may be an indication it is no longer being actively worked. To better reflect the current state, the issue is being unassigned. Please feel free to re-take assignment of the issue if it is being actively worked, or if you plan to start that work soon.

> [C++] unify_schemas can't handle int64 + double, affects CSV dataset
> --------------------------------------------------------------------
>
>                 Key: ARROW-14705
>                 URL: https://issues.apache.org/jira/browse/ARROW-14705
>             Project: Apache Arrow
>          Issue Type: Bug
>          Components: C++, Python, R
>            Reporter: Neal Richardson
>            Priority: Major
>              Labels: pull-request-available, query-engine
>          Time Spent: 1h 20m
>  Remaining Estimate: 0h
>
> Twitter question of "how can I make arrow's csv reader not make int64 for integers", turns out to be originating from the scenario where some csvs in a directory may have all integer values for a column but there are decimals in others, and you can't use them together in a dataset.
> {code:r}
> library(arrow, warn.conflicts = FALSE)
> library(dplyr, warn.conflicts = FALSE)
> ds_dir <- tempfile()
> dir.create(ds_dir)
> cat("a\n1", file = file.path(ds_dir, "1.csv"))
> cat("a\n1.1", file = file.path(ds_dir, "2.csv"))
> ds <- open_dataset(ds_dir, format = "csv")
> ds
> #> FileSystemDataset with 2 csv files
> #> a: int64
> ## It just picked the schema of the first file
> collect(ds)
> #> Error: Invalid: Could not open CSV input source '/private/var/folders/yv/b6mwztyj0r11r8pnsbmpltx00000gn/T/RtmpzENOMb/filea9c3292e06dd/2.csv': Invalid: In CSV column #0: Row #2: CSV conversion error to int64: invalid value '1.1'
> #> ../src/arrow/csv/converter.cc:492  decoder_.Decode(data, size, quoted, &value)
> #> ../src/arrow/csv/parser.h:123  status
> #> ../src/arrow/csv/converter.cc:496  parser.VisitColumn(col_index, visit)
> #> ../src/arrow/csv/reader.cc:462  internal::UnwrapOrRaise(maybe_decoded_arrays)
> #> ../src/arrow/compute/exec/exec_plan.cc:398  iterator_.Next()
> #> ../src/arrow/record_batch.cc:318  ReadNext(&batch)
> #> ../src/arrow/record_batch.cc:329  ReadAll(&batches)
> ## Let's try again and tell it to unify schemas. Should result in a float64 type
> ds <- open_dataset(ds_dir, format = "csv", unify_schemas = TRUE)
> #> Error: Invalid: Unable to merge: Field a has incompatible types: int64 vs double
> #> ../src/arrow/type.cc:1621  fields_[i]->MergeWith(field)
> #> ../src/arrow/type.cc:1684  AddField(field)
> #> ../src/arrow/type.cc:1755  builder.AddSchema(schema)
> #> ../src/arrow/dataset/discovery.cc:251  Inspect(options.inspect_options)
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)