You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@arrow.apache.org by Pradeep Gollakota <pg...@google.com.INVALID> on 2022/11/17 23:57:52 UTC

[DISCUSS] JSON Canonical Extension Type

Hi folks!

I put together this specification for canonicalizing the JSON type in Arrow.

## Introduction
JSON is a widely used text based data interchange format. There are many
use cases where a user has a column whose contents are a JSON encoded
string. BigQuery's [JSON Type][1] and Parquet’s [JSON Logical Type][2] are
two such examples.

The JSON specification is defined in [RFC-8259][3]. However, many of the
most popular parsers support non standard extensions. Examples of non
standard extensions to JSON include comments, unquoted keys, trailing
commas, etc.

## Extension Specification
* The name of the extension is `arrow.json`
* The storage type of the extension is `utf8`
* The extension type has no parameters
* The metadata MUST be either empty or a valid JSON object
    - There is no canonical metadata
    - Implementations MAY include implementation-specific metadata by using
a namespaced key. For example `{"google.bigquery": {"my": "metadata"}}`
* Implementations...
    - MUST produce valid UTF-8 encoded text
    - SHOULD produce valid standard JSON
    - MAY produce valid non-standard JSON
    - MUST support parsing standard JSON
    - MAY support parsing non standard JSON
    - SHOULD pass through contents that they do not understand

## Forward compatibility
In the future we might allow this logical type to annotate a byte storage
type with a different text encoding.  Implementations consuming JSON
logical types should verify this.

    [1]:
https://cloud.google.com/bigquery/docs/reference/standard-sql/data-types#json_type
    [2]:
https://github.com/apache/parquet-format/blob/master/LogicalTypes.md#json
    [3]: https://datatracker.ietf.org/doc/html/rfc8259

Re: [DISCUSS] JSON Canonical Extension Type

Posted by Will Jones <wi...@gmail.com>.
Hello,

Sorry this hasn't gotten much attention recently. I just brought this up at
the Arrow community meeting, as I'd like to revive it.

It looks like there is a draft implementation up already [1].

I'm generally supportive of this, but I have a few questions:

1. Would we be able to make this extension type work on top of any of the
string types, including Utf8, LargeUtf8, and the (under consideration [2])
StringView types?
2. Does this imply a potential canonical extension type for every
text-based data format, such as HOCON, XML, and so on? If we agree JSON is
special, I think it's fine to have its own extension type. On the other
hand, it might be worth considering making a generic extension type for
serialized data, that is parameterized by the media type
("application/json" in this case).  This doesn't preclude the possibility
of building an extension type class / struct within Arrow implementations
that is specific to JSON; I don't think there's any hard rule that there
has to be a 1-1 correspondence between extension types in the format and
the concrete data structures in libraries.

Best,

Will Jones

[1] https://github.com/apache/arrow/pull/13901
[2] https://lists.apache.org/thread/w88tpz76ox8h3rxkjl4so6rg3f1rv7wt


On Thu, Dec 1, 2022 at 12:23 AM Antoine Pitrou <an...@python.org> wrote:

>
> HOCON is a superset of JSON, so I'm not sure making it an extension type
> based it on JSON would be a good idea.
>
>
> Le 01/12/2022 à 06:23, Micah Kornfield a écrit :
> >>
> >> Can a logical extension be based on another logical extension?
> >
> > Potentially but this is mostly an implementation details, each type
> should
> > have their own specification IMO.
> >
> > HOCON support might be nice..
> >
> > I'm not sure if this is common enough to warrant a canonical type within
> > Arrow but you are welcome to propose something if you would like.
> >
> > Cheers,
> > Micah
> >
> > On Mon, Nov 28, 2022 at 11:55 AM Lee, David <David.Lee@blackrock.com
> .invalid>
> > wrote:
> >
> >> Can a logical extension be based on another logical extension?
> >>
> >> HOCON support might be nice..
> >>
> >> -----Original Message-----
> >> From: Micah Kornfield <em...@gmail.com>
> >> Sent: Monday, November 28, 2022 11:50 AM
> >> To: dev@arrow.apache.org
> >> Subject: Re: [DISCUSS] JSON Canonical Extension Type
> >>
> >> External Email: Use caution with links and attachments
> >>
> >>
> >> This seems like a reasonable definition to me.  Since there hasn't been
> >> much feedback, I think maybe following through an implementation + this
> >> description in a PR would be the next steps.  If there isn't further
> >> feedback on this, once the PR is up we can have try to vote (which might
> >> bring up some more feedback, but hopefully wouldn't cause too much
> >> implementation churn).
> >>
> >> Thanks,
> >> Micah
> >>
> >> On Thu, Nov 17, 2022 at 3:58 PM Pradeep Gollakota
> >> <pg...@google.com.invalid> wrote:
> >>
> >>> Hi folks!
> >>>
> >>> I put together this specification for canonicalizing the JSON type in
> >>> Arrow.
> >>>
> >>> ## Introduction
> >>> JSON is a widely used text based data interchange format. There are
> >>> many use cases where a user has a column whose contents are a JSON
> >>> encoded string. BigQuery's [JSON Type][1] and Parquet’s [JSON Logical
> >>> Type][2] are two such examples.
> >>>
> >>> The JSON specification is defined in [RFC-8259][3]. However, many of
> >>> the most popular parsers support non standard extensions. Examples of
> >>> non standard extensions to JSON include comments, unquoted keys,
> >>> trailing commas, etc.
> >>>
> >>> ## Extension Specification
> >>> * The name of the extension is `arrow.json`
> >>> * The storage type of the extension is `utf8`
> >>> * The extension type has no parameters
> >>> * The metadata MUST be either empty or a valid JSON object
> >>>      - There is no canonical metadata
> >>>      - Implementations MAY include implementation-specific metadata by
> >>> using a namespaced key. For example `{"google.bigquery": {"my":
> >>> "metadata"}}`
> >>> * Implementations...
> >>>      - MUST produce valid UTF-8 encoded text
> >>>      - SHOULD produce valid standard JSON
> >>>      - MAY produce valid non-standard JSON
> >>>      - MUST support parsing standard JSON
> >>>      - MAY support parsing non standard JSON
> >>>      - SHOULD pass through contents that they do not understand
> >>>
> >>> ## Forward compatibility
> >>> In the future we might allow this logical type to annotate a byte
> >>> storage type with a different text encoding.  Implementations
> >>> consuming JSON logical types should verify this.
> >>>
> >>>      [1]:
> >>>
> >>>
> >>
> https://urldefense.com/v3/__https://cloud.google.com/bigquery/docs/reference/standard-sql/data-types*json_type__;Iw!!KSjYCgUGsB4!YhB-EpSLu8HTacaUsWvTVqF0kYh81UlVwNFBAc4-f95F7bGtdGuyWN_JObBkRSee-jTU20_MmGe2WUH8UMqTxPY$
> >>>      [2]:
> >>>
> >>
> https://urldefense.com/v3/__https://github.com/apache/parquet-format/blob/master/LogicalTypes.md*json__;Iw!!KSjYCgUGsB4!YhB-EpSLu8HTacaUsWvTVqF0kYh81UlVwNFBAc4-f95F7bGtdGuyWN_JObBkRSee-jTU20_MmGe2WUH8RFfD8NY$
> >>>      [3]:
> >>>
> >>
> https://urldefense.com/v3/__https://datatracker.ietf.org/doc/html/rfc8259__;!!KSjYCgUGsB4!YhB-EpSLu8HTacaUsWvTVqF0kYh81UlVwNFBAc4-f95F7bGtdGuyWN_JObBkRSee-jTU20_MmGe2WUH8MGoes7Q$
> >>>
> >>
> >>
> >> This message may contain information that is confidential or privileged.
> >> If you are not the intended recipient, please advise the sender
> immediately
> >> and delete this message. See
> >> http://www.blackrock.com/corporate/compliance/email-disclaimers for
> >> further information.  Please refer to
> >> http://www.blackrock.com/corporate/compliance/privacy-policy for more
> >> information about BlackRock’s Privacy Policy.
> >>
> >>
> >> For a list of BlackRock's office addresses worldwide, see
> >> http://www.blackrock.com/corporate/about-us/contacts-locations.
> >>
> >> © 2022 BlackRock, Inc. All rights reserved.
> >>
> >
>

Re: [DISCUSS] JSON Canonical Extension Type

Posted by Antoine Pitrou <an...@python.org>.
HOCON is a superset of JSON, so I'm not sure making it an extension type 
based it on JSON would be a good idea.


Le 01/12/2022 à 06:23, Micah Kornfield a écrit :
>>
>> Can a logical extension be based on another logical extension?
> 
> Potentially but this is mostly an implementation details, each type should
> have their own specification IMO.
> 
> HOCON support might be nice..
> 
> I'm not sure if this is common enough to warrant a canonical type within
> Arrow but you are welcome to propose something if you would like.
> 
> Cheers,
> Micah
> 
> On Mon, Nov 28, 2022 at 11:55 AM Lee, David <Da...@blackrock.com.invalid>
> wrote:
> 
>> Can a logical extension be based on another logical extension?
>>
>> HOCON support might be nice..
>>
>> -----Original Message-----
>> From: Micah Kornfield <em...@gmail.com>
>> Sent: Monday, November 28, 2022 11:50 AM
>> To: dev@arrow.apache.org
>> Subject: Re: [DISCUSS] JSON Canonical Extension Type
>>
>> External Email: Use caution with links and attachments
>>
>>
>> This seems like a reasonable definition to me.  Since there hasn't been
>> much feedback, I think maybe following through an implementation + this
>> description in a PR would be the next steps.  If there isn't further
>> feedback on this, once the PR is up we can have try to vote (which might
>> bring up some more feedback, but hopefully wouldn't cause too much
>> implementation churn).
>>
>> Thanks,
>> Micah
>>
>> On Thu, Nov 17, 2022 at 3:58 PM Pradeep Gollakota
>> <pg...@google.com.invalid> wrote:
>>
>>> Hi folks!
>>>
>>> I put together this specification for canonicalizing the JSON type in
>>> Arrow.
>>>
>>> ## Introduction
>>> JSON is a widely used text based data interchange format. There are
>>> many use cases where a user has a column whose contents are a JSON
>>> encoded string. BigQuery's [JSON Type][1] and Parquet’s [JSON Logical
>>> Type][2] are two such examples.
>>>
>>> The JSON specification is defined in [RFC-8259][3]. However, many of
>>> the most popular parsers support non standard extensions. Examples of
>>> non standard extensions to JSON include comments, unquoted keys,
>>> trailing commas, etc.
>>>
>>> ## Extension Specification
>>> * The name of the extension is `arrow.json`
>>> * The storage type of the extension is `utf8`
>>> * The extension type has no parameters
>>> * The metadata MUST be either empty or a valid JSON object
>>>      - There is no canonical metadata
>>>      - Implementations MAY include implementation-specific metadata by
>>> using a namespaced key. For example `{"google.bigquery": {"my":
>>> "metadata"}}`
>>> * Implementations...
>>>      - MUST produce valid UTF-8 encoded text
>>>      - SHOULD produce valid standard JSON
>>>      - MAY produce valid non-standard JSON
>>>      - MUST support parsing standard JSON
>>>      - MAY support parsing non standard JSON
>>>      - SHOULD pass through contents that they do not understand
>>>
>>> ## Forward compatibility
>>> In the future we might allow this logical type to annotate a byte
>>> storage type with a different text encoding.  Implementations
>>> consuming JSON logical types should verify this.
>>>
>>>      [1]:
>>>
>>>
>> https://urldefense.com/v3/__https://cloud.google.com/bigquery/docs/reference/standard-sql/data-types*json_type__;Iw!!KSjYCgUGsB4!YhB-EpSLu8HTacaUsWvTVqF0kYh81UlVwNFBAc4-f95F7bGtdGuyWN_JObBkRSee-jTU20_MmGe2WUH8UMqTxPY$
>>>      [2]:
>>>
>> https://urldefense.com/v3/__https://github.com/apache/parquet-format/blob/master/LogicalTypes.md*json__;Iw!!KSjYCgUGsB4!YhB-EpSLu8HTacaUsWvTVqF0kYh81UlVwNFBAc4-f95F7bGtdGuyWN_JObBkRSee-jTU20_MmGe2WUH8RFfD8NY$
>>>      [3]:
>>>
>> https://urldefense.com/v3/__https://datatracker.ietf.org/doc/html/rfc8259__;!!KSjYCgUGsB4!YhB-EpSLu8HTacaUsWvTVqF0kYh81UlVwNFBAc4-f95F7bGtdGuyWN_JObBkRSee-jTU20_MmGe2WUH8MGoes7Q$
>>>
>>
>>
>> This message may contain information that is confidential or privileged.
>> If you are not the intended recipient, please advise the sender immediately
>> and delete this message. See
>> http://www.blackrock.com/corporate/compliance/email-disclaimers for
>> further information.  Please refer to
>> http://www.blackrock.com/corporate/compliance/privacy-policy for more
>> information about BlackRock’s Privacy Policy.
>>
>>
>> For a list of BlackRock's office addresses worldwide, see
>> http://www.blackrock.com/corporate/about-us/contacts-locations.
>>
>> © 2022 BlackRock, Inc. All rights reserved.
>>
> 

Re: [DISCUSS] JSON Canonical Extension Type

Posted by Micah Kornfield <em...@gmail.com>.
>
> Can a logical extension be based on another logical extension?

Potentially but this is mostly an implementation details, each type should
have their own specification IMO.

HOCON support might be nice..

I'm not sure if this is common enough to warrant a canonical type within
Arrow but you are welcome to propose something if you would like.

Cheers,
Micah

On Mon, Nov 28, 2022 at 11:55 AM Lee, David <Da...@blackrock.com.invalid>
wrote:

> Can a logical extension be based on another logical extension?
>
> HOCON support might be nice..
>
> -----Original Message-----
> From: Micah Kornfield <em...@gmail.com>
> Sent: Monday, November 28, 2022 11:50 AM
> To: dev@arrow.apache.org
> Subject: Re: [DISCUSS] JSON Canonical Extension Type
>
> External Email: Use caution with links and attachments
>
>
> This seems like a reasonable definition to me.  Since there hasn't been
> much feedback, I think maybe following through an implementation + this
> description in a PR would be the next steps.  If there isn't further
> feedback on this, once the PR is up we can have try to vote (which might
> bring up some more feedback, but hopefully wouldn't cause too much
> implementation churn).
>
> Thanks,
> Micah
>
> On Thu, Nov 17, 2022 at 3:58 PM Pradeep Gollakota
> <pg...@google.com.invalid> wrote:
>
> > Hi folks!
> >
> > I put together this specification for canonicalizing the JSON type in
> > Arrow.
> >
> > ## Introduction
> > JSON is a widely used text based data interchange format. There are
> > many use cases where a user has a column whose contents are a JSON
> > encoded string. BigQuery's [JSON Type][1] and Parquet’s [JSON Logical
> > Type][2] are two such examples.
> >
> > The JSON specification is defined in [RFC-8259][3]. However, many of
> > the most popular parsers support non standard extensions. Examples of
> > non standard extensions to JSON include comments, unquoted keys,
> > trailing commas, etc.
> >
> > ## Extension Specification
> > * The name of the extension is `arrow.json`
> > * The storage type of the extension is `utf8`
> > * The extension type has no parameters
> > * The metadata MUST be either empty or a valid JSON object
> >     - There is no canonical metadata
> >     - Implementations MAY include implementation-specific metadata by
> > using a namespaced key. For example `{"google.bigquery": {"my":
> > "metadata"}}`
> > * Implementations...
> >     - MUST produce valid UTF-8 encoded text
> >     - SHOULD produce valid standard JSON
> >     - MAY produce valid non-standard JSON
> >     - MUST support parsing standard JSON
> >     - MAY support parsing non standard JSON
> >     - SHOULD pass through contents that they do not understand
> >
> > ## Forward compatibility
> > In the future we might allow this logical type to annotate a byte
> > storage type with a different text encoding.  Implementations
> > consuming JSON logical types should verify this.
> >
> >     [1]:
> >
> >
> https://urldefense.com/v3/__https://cloud.google.com/bigquery/docs/reference/standard-sql/data-types*json_type__;Iw!!KSjYCgUGsB4!YhB-EpSLu8HTacaUsWvTVqF0kYh81UlVwNFBAc4-f95F7bGtdGuyWN_JObBkRSee-jTU20_MmGe2WUH8UMqTxPY$
> >     [2]:
> >
> https://urldefense.com/v3/__https://github.com/apache/parquet-format/blob/master/LogicalTypes.md*json__;Iw!!KSjYCgUGsB4!YhB-EpSLu8HTacaUsWvTVqF0kYh81UlVwNFBAc4-f95F7bGtdGuyWN_JObBkRSee-jTU20_MmGe2WUH8RFfD8NY$
> >     [3]:
> >
> https://urldefense.com/v3/__https://datatracker.ietf.org/doc/html/rfc8259__;!!KSjYCgUGsB4!YhB-EpSLu8HTacaUsWvTVqF0kYh81UlVwNFBAc4-f95F7bGtdGuyWN_JObBkRSee-jTU20_MmGe2WUH8MGoes7Q$
> >
>
>
> This message may contain information that is confidential or privileged.
> If you are not the intended recipient, please advise the sender immediately
> and delete this message. See
> http://www.blackrock.com/corporate/compliance/email-disclaimers for
> further information.  Please refer to
> http://www.blackrock.com/corporate/compliance/privacy-policy for more
> information about BlackRock’s Privacy Policy.
>
>
> For a list of BlackRock's office addresses worldwide, see
> http://www.blackrock.com/corporate/about-us/contacts-locations.
>
> © 2022 BlackRock, Inc. All rights reserved.
>

RE: [DISCUSS] JSON Canonical Extension Type

Posted by "Lee, David" <Da...@blackrock.com.INVALID>.
Can a logical extension be based on another logical extension?

HOCON support might be nice..

-----Original Message-----
From: Micah Kornfield <em...@gmail.com> 
Sent: Monday, November 28, 2022 11:50 AM
To: dev@arrow.apache.org
Subject: Re: [DISCUSS] JSON Canonical Extension Type

External Email: Use caution with links and attachments


This seems like a reasonable definition to me.  Since there hasn't been much feedback, I think maybe following through an implementation + this description in a PR would be the next steps.  If there isn't further feedback on this, once the PR is up we can have try to vote (which might bring up some more feedback, but hopefully wouldn't cause too much implementation churn).

Thanks,
Micah

On Thu, Nov 17, 2022 at 3:58 PM Pradeep Gollakota <pg...@google.com.invalid> wrote:

> Hi folks!
>
> I put together this specification for canonicalizing the JSON type in 
> Arrow.
>
> ## Introduction
> JSON is a widely used text based data interchange format. There are 
> many use cases where a user has a column whose contents are a JSON 
> encoded string. BigQuery's [JSON Type][1] and Parquet’s [JSON Logical 
> Type][2] are two such examples.
>
> The JSON specification is defined in [RFC-8259][3]. However, many of 
> the most popular parsers support non standard extensions. Examples of 
> non standard extensions to JSON include comments, unquoted keys, 
> trailing commas, etc.
>
> ## Extension Specification
> * The name of the extension is `arrow.json`
> * The storage type of the extension is `utf8`
> * The extension type has no parameters
> * The metadata MUST be either empty or a valid JSON object
>     - There is no canonical metadata
>     - Implementations MAY include implementation-specific metadata by 
> using a namespaced key. For example `{"google.bigquery": {"my": 
> "metadata"}}`
> * Implementations...
>     - MUST produce valid UTF-8 encoded text
>     - SHOULD produce valid standard JSON
>     - MAY produce valid non-standard JSON
>     - MUST support parsing standard JSON
>     - MAY support parsing non standard JSON
>     - SHOULD pass through contents that they do not understand
>
> ## Forward compatibility
> In the future we might allow this logical type to annotate a byte 
> storage type with a different text encoding.  Implementations 
> consuming JSON logical types should verify this.
>
>     [1]:
>
> https://urldefense.com/v3/__https://cloud.google.com/bigquery/docs/reference/standard-sql/data-types*json_type__;Iw!!KSjYCgUGsB4!YhB-EpSLu8HTacaUsWvTVqF0kYh81UlVwNFBAc4-f95F7bGtdGuyWN_JObBkRSee-jTU20_MmGe2WUH8UMqTxPY$
>     [2]:
> https://urldefense.com/v3/__https://github.com/apache/parquet-format/blob/master/LogicalTypes.md*json__;Iw!!KSjYCgUGsB4!YhB-EpSLu8HTacaUsWvTVqF0kYh81UlVwNFBAc4-f95F7bGtdGuyWN_JObBkRSee-jTU20_MmGe2WUH8RFfD8NY$
>     [3]: 
> https://urldefense.com/v3/__https://datatracker.ietf.org/doc/html/rfc8259__;!!KSjYCgUGsB4!YhB-EpSLu8HTacaUsWvTVqF0kYh81UlVwNFBAc4-f95F7bGtdGuyWN_JObBkRSee-jTU20_MmGe2WUH8MGoes7Q$
>


This message may contain information that is confidential or privileged. If you are not the intended recipient, please advise the sender immediately and delete this message. See http://www.blackrock.com/corporate/compliance/email-disclaimers for further information.  Please refer to http://www.blackrock.com/corporate/compliance/privacy-policy for more information about BlackRock’s Privacy Policy.


For a list of BlackRock's office addresses worldwide, see http://www.blackrock.com/corporate/about-us/contacts-locations.

© 2022 BlackRock, Inc. All rights reserved.

Re: [DISCUSS] JSON Canonical Extension Type

Posted by Micah Kornfield <em...@gmail.com>.
This seems like a reasonable definition to me.  Since there hasn't been
much feedback, I think maybe following through an implementation + this
description in a PR would be the next steps.  If there isn't further
feedback on this, once the PR is up we can have try to vote (which might
bring up some more feedback, but hopefully wouldn't cause too much
implementation churn).

Thanks,
Micah

On Thu, Nov 17, 2022 at 3:58 PM Pradeep Gollakota
<pg...@google.com.invalid> wrote:

> Hi folks!
>
> I put together this specification for canonicalizing the JSON type in
> Arrow.
>
> ## Introduction
> JSON is a widely used text based data interchange format. There are many
> use cases where a user has a column whose contents are a JSON encoded
> string. BigQuery's [JSON Type][1] and Parquet’s [JSON Logical Type][2] are
> two such examples.
>
> The JSON specification is defined in [RFC-8259][3]. However, many of the
> most popular parsers support non standard extensions. Examples of non
> standard extensions to JSON include comments, unquoted keys, trailing
> commas, etc.
>
> ## Extension Specification
> * The name of the extension is `arrow.json`
> * The storage type of the extension is `utf8`
> * The extension type has no parameters
> * The metadata MUST be either empty or a valid JSON object
>     - There is no canonical metadata
>     - Implementations MAY include implementation-specific metadata by using
> a namespaced key. For example `{"google.bigquery": {"my": "metadata"}}`
> * Implementations...
>     - MUST produce valid UTF-8 encoded text
>     - SHOULD produce valid standard JSON
>     - MAY produce valid non-standard JSON
>     - MUST support parsing standard JSON
>     - MAY support parsing non standard JSON
>     - SHOULD pass through contents that they do not understand
>
> ## Forward compatibility
> In the future we might allow this logical type to annotate a byte storage
> type with a different text encoding.  Implementations consuming JSON
> logical types should verify this.
>
>     [1]:
>
> https://cloud.google.com/bigquery/docs/reference/standard-sql/data-types#json_type
>     [2]:
> https://github.com/apache/parquet-format/blob/master/LogicalTypes.md#json
>     [3]: https://datatracker.ietf.org/doc/html/rfc8259
>