You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@arrow.apache.org by "Nicola Crane (Jira)" <ji...@apache.org> on 2021/11/17 22:01:00 UTC

[jira] [Created] (ARROW-14743) [C++] Error reading in dataset when partitioning variable in schema

Nicola Crane created ARROW-14743:
------------------------------------

             Summary: [C++] Error reading in dataset when partitioning variable in schema
                 Key: ARROW-14743
                 URL: https://issues.apache.org/jira/browse/ARROW-14743
             Project: Apache Arrow
          Issue Type: New Feature
          Components: C++
            Reporter: Nicola Crane


If partitioned data is read back in and a schema is used (containing the partitioning variable), there is an error - see below.  The error occurs whether or not the argument {partitioning} is specified or not.

{code: r}
library(arrow)
library(dplyr)

data(diamonds, package='ggplot2')
write_dataset(diamonds, path='diamonds', format='csv', partitioning='cut')

diamond_schema <- schema(
    carat=float64(),
    cut=string(),
    color=string(),
    clarity=string(),
    depth=float64(),
    table=float64(),
    price=float64(),
    x=float64(),
    y=float64(),
    z=float64(),
)

open_dataset('diamonds', format='csv', schema=diamond_schema, partitioning = "cut") %>%
  collect()

# Error: Invalid: Could not open CSV input source '/home/nic2/arrow/r/diamonds/cut=Fair/part-0.csv': Invalid: CSV parse error: Row #1: Expected 10 columns, got 9: "carat","color","clarity","depth","table","price","x","y","z"

{code}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)