You are viewing a plain text version of this content. The canonical link for it is here.

Posted to jira@arrow.apache.org by "Ben Harkins (Jira)" <ji...@apache.org> on 2022/10/20 16:50:00 UTC

[jira] [Assigned] (ARROW-18106) [C++] JSON reader ignores explicit schema with default unexpected_field_behavior="infer"

     [ https://issues.apache.org/jira/browse/ARROW-18106?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Ben Harkins reassigned ARROW-18106:
-----------------------------------

    Assignee: Ben Harkins

> [C++] JSON reader ignores explicit schema with default unexpected_field_behavior="infer"
> ----------------------------------------------------------------------------------------
>
>                 Key: ARROW-18106
>                 URL: https://issues.apache.org/jira/browse/ARROW-18106
>             Project: Apache Arrow
>          Issue Type: Bug
>          Components: C++
>            Reporter: Joris Van den Bossche
>            Assignee: Ben Harkins
>            Priority: Major
>              Labels: json
>
> Not 100% sure this is a "bug", but at least I find it an unexpected interplay between two options.
> By default, when reading json, we _infer_ the data type of columns, and when specifying an explicit schema, we _also_ by default infer the type of columns that are not specified in the explicit schema. The docs for {{unexpected_field_behavior}}:
> > How JSON fields outside of explicit_schema (if given) are treated
> But it seems that if you specify a schema, and the parsing of one of the columns fails according to that schema, we still fall back to this default of inferring the data type (while I would have expected an error, since we should only infer for columns _not_ in the schema.
> Example code using pyarrow:
> {code:python}
> import io
> import pyarrow as pa
> from pyarrow import json
> s_json = """{"column":"2022-09-05T08:08:46.000"}"""
> opts = json.ParseOptions(explicit_schema=pa.schema([("column", pa.timestamp("s"))]))
> json.read_json(io.BytesIO(s_json.encode()), parse_options=opts)
> {code}
> The parsing fails here because there are milliseconds and the type is "s", but the explicit schema is ignored, and we get a result with a string column as result:
> {code}
> pyarrow.Table
> column: string
> ----
> column: [["2022-09-05T08:08:46.000"]]
> {code}
> But when adding {{unexpected_field_behaviour="ignore"}}, we actually get the expected parse error:
> {code:python}
> opts = json.ParseOptions(explicit_schema=pa.schema([("column", pa.timestamp("s"))]), unexpected_field_behavior="ignore")
> json.read_json(io.BytesIO(s_json.encode()), parse_options=opts)
> {code}
> gives
> {code}
> ArrowInvalid: Failed of conversion of JSON to timestamp[s], couldn't parse:2022-09-05T08:08:46.000
> {code}
> It might be this is specific to timestamps, I don't directly see a similar issue with eg {{"column": "A"}} and setting the schema to "column" being int64.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)