You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@arrow.apache.org by Micah Kornfield <em...@gmail.com> on 2022/12/01 05:23:47 UTC

Re: [DISCUSS] JSON Canonical Extension Type

>
> Can a logical extension be based on another logical extension?

Potentially but this is mostly an implementation details, each type should
have their own specification IMO.

HOCON support might be nice..

I'm not sure if this is common enough to warrant a canonical type within
Arrow but you are welcome to propose something if you would like.

Cheers,
Micah

On Mon, Nov 28, 2022 at 11:55 AM Lee, David <Da...@blackrock.com.invalid>
wrote:

> Can a logical extension be based on another logical extension?
>
> HOCON support might be nice..
>
> -----Original Message-----
> From: Micah Kornfield <em...@gmail.com>
> Sent: Monday, November 28, 2022 11:50 AM
> To: dev@arrow.apache.org
> Subject: Re: [DISCUSS] JSON Canonical Extension Type
>
> External Email: Use caution with links and attachments
>
>
> This seems like a reasonable definition to me.  Since there hasn't been
> much feedback, I think maybe following through an implementation + this
> description in a PR would be the next steps.  If there isn't further
> feedback on this, once the PR is up we can have try to vote (which might
> bring up some more feedback, but hopefully wouldn't cause too much
> implementation churn).
>
> Thanks,
> Micah
>
> On Thu, Nov 17, 2022 at 3:58 PM Pradeep Gollakota
> <pg...@google.com.invalid> wrote:
>
> > Hi folks!
> >
> > I put together this specification for canonicalizing the JSON type in
> > Arrow.
> >
> > ## Introduction
> > JSON is a widely used text based data interchange format. There are
> > many use cases where a user has a column whose contents are a JSON
> > encoded string. BigQuery's [JSON Type][1] and Parquet’s [JSON Logical
> > Type][2] are two such examples.
> >
> > The JSON specification is defined in [RFC-8259][3]. However, many of
> > the most popular parsers support non standard extensions. Examples of
> > non standard extensions to JSON include comments, unquoted keys,
> > trailing commas, etc.
> >
> > ## Extension Specification
> > * The name of the extension is `arrow.json`
> > * The storage type of the extension is `utf8`
> > * The extension type has no parameters
> > * The metadata MUST be either empty or a valid JSON object
> >     - There is no canonical metadata
> >     - Implementations MAY include implementation-specific metadata by
> > using a namespaced key. For example `{"google.bigquery": {"my":
> > "metadata"}}`
> > * Implementations...
> >     - MUST produce valid UTF-8 encoded text
> >     - SHOULD produce valid standard JSON
> >     - MAY produce valid non-standard JSON
> >     - MUST support parsing standard JSON
> >     - MAY support parsing non standard JSON
> >     - SHOULD pass through contents that they do not understand
> >
> > ## Forward compatibility
> > In the future we might allow this logical type to annotate a byte
> > storage type with a different text encoding.  Implementations
> > consuming JSON logical types should verify this.
> >
> >     [1]:
> >
> >
> https://urldefense.com/v3/__https://cloud.google.com/bigquery/docs/reference/standard-sql/data-types*json_type__;Iw!!KSjYCgUGsB4!YhB-EpSLu8HTacaUsWvTVqF0kYh81UlVwNFBAc4-f95F7bGtdGuyWN_JObBkRSee-jTU20_MmGe2WUH8UMqTxPY$
> >     [2]:
> >
> https://urldefense.com/v3/__https://github.com/apache/parquet-format/blob/master/LogicalTypes.md*json__;Iw!!KSjYCgUGsB4!YhB-EpSLu8HTacaUsWvTVqF0kYh81UlVwNFBAc4-f95F7bGtdGuyWN_JObBkRSee-jTU20_MmGe2WUH8RFfD8NY$
> >     [3]:
> >
> https://urldefense.com/v3/__https://datatracker.ietf.org/doc/html/rfc8259__;!!KSjYCgUGsB4!YhB-EpSLu8HTacaUsWvTVqF0kYh81UlVwNFBAc4-f95F7bGtdGuyWN_JObBkRSee-jTU20_MmGe2WUH8MGoes7Q$
> >
>
>
> This message may contain information that is confidential or privileged.
> If you are not the intended recipient, please advise the sender immediately
> and delete this message. See
> http://www.blackrock.com/corporate/compliance/email-disclaimers for
> further information.  Please refer to
> http://www.blackrock.com/corporate/compliance/privacy-policy for more
> information about BlackRock’s Privacy Policy.
>
>
> For a list of BlackRock's office addresses worldwide, see
> http://www.blackrock.com/corporate/about-us/contacts-locations.
>
> © 2022 BlackRock, Inc. All rights reserved.
>

Re: [DISCUSS] JSON Canonical Extension Type

Posted by Will Jones <wi...@gmail.com>.
Hello,

Sorry this hasn't gotten much attention recently. I just brought this up at
the Arrow community meeting, as I'd like to revive it.

It looks like there is a draft implementation up already [1].

I'm generally supportive of this, but I have a few questions:

1. Would we be able to make this extension type work on top of any of the
string types, including Utf8, LargeUtf8, and the (under consideration [2])
StringView types?
2. Does this imply a potential canonical extension type for every
text-based data format, such as HOCON, XML, and so on? If we agree JSON is
special, I think it's fine to have its own extension type. On the other
hand, it might be worth considering making a generic extension type for
serialized data, that is parameterized by the media type
("application/json" in this case).  This doesn't preclude the possibility
of building an extension type class / struct within Arrow implementations
that is specific to JSON; I don't think there's any hard rule that there
has to be a 1-1 correspondence between extension types in the format and
the concrete data structures in libraries.

Best,

Will Jones

[1] https://github.com/apache/arrow/pull/13901
[2] https://lists.apache.org/thread/w88tpz76ox8h3rxkjl4so6rg3f1rv7wt


On Thu, Dec 1, 2022 at 12:23 AM Antoine Pitrou <an...@python.org> wrote:

>
> HOCON is a superset of JSON, so I'm not sure making it an extension type
> based it on JSON would be a good idea.
>
>
> Le 01/12/2022 à 06:23, Micah Kornfield a écrit :
> >>
> >> Can a logical extension be based on another logical extension?
> >
> > Potentially but this is mostly an implementation details, each type
> should
> > have their own specification IMO.
> >
> > HOCON support might be nice..
> >
> > I'm not sure if this is common enough to warrant a canonical type within
> > Arrow but you are welcome to propose something if you would like.
> >
> > Cheers,
> > Micah
> >
> > On Mon, Nov 28, 2022 at 11:55 AM Lee, David <David.Lee@blackrock.com
> .invalid>
> > wrote:
> >
> >> Can a logical extension be based on another logical extension?
> >>
> >> HOCON support might be nice..
> >>
> >> -----Original Message-----
> >> From: Micah Kornfield <em...@gmail.com>
> >> Sent: Monday, November 28, 2022 11:50 AM
> >> To: dev@arrow.apache.org
> >> Subject: Re: [DISCUSS] JSON Canonical Extension Type
> >>
> >> External Email: Use caution with links and attachments
> >>
> >>
> >> This seems like a reasonable definition to me.  Since there hasn't been
> >> much feedback, I think maybe following through an implementation + this
> >> description in a PR would be the next steps.  If there isn't further
> >> feedback on this, once the PR is up we can have try to vote (which might
> >> bring up some more feedback, but hopefully wouldn't cause too much
> >> implementation churn).
> >>
> >> Thanks,
> >> Micah
> >>
> >> On Thu, Nov 17, 2022 at 3:58 PM Pradeep Gollakota
> >> <pg...@google.com.invalid> wrote:
> >>
> >>> Hi folks!
> >>>
> >>> I put together this specification for canonicalizing the JSON type in
> >>> Arrow.
> >>>
> >>> ## Introduction
> >>> JSON is a widely used text based data interchange format. There are
> >>> many use cases where a user has a column whose contents are a JSON
> >>> encoded string. BigQuery's [JSON Type][1] and Parquet’s [JSON Logical
> >>> Type][2] are two such examples.
> >>>
> >>> The JSON specification is defined in [RFC-8259][3]. However, many of
> >>> the most popular parsers support non standard extensions. Examples of
> >>> non standard extensions to JSON include comments, unquoted keys,
> >>> trailing commas, etc.
> >>>
> >>> ## Extension Specification
> >>> * The name of the extension is `arrow.json`
> >>> * The storage type of the extension is `utf8`
> >>> * The extension type has no parameters
> >>> * The metadata MUST be either empty or a valid JSON object
> >>>      - There is no canonical metadata
> >>>      - Implementations MAY include implementation-specific metadata by
> >>> using a namespaced key. For example `{"google.bigquery": {"my":
> >>> "metadata"}}`
> >>> * Implementations...
> >>>      - MUST produce valid UTF-8 encoded text
> >>>      - SHOULD produce valid standard JSON
> >>>      - MAY produce valid non-standard JSON
> >>>      - MUST support parsing standard JSON
> >>>      - MAY support parsing non standard JSON
> >>>      - SHOULD pass through contents that they do not understand
> >>>
> >>> ## Forward compatibility
> >>> In the future we might allow this logical type to annotate a byte
> >>> storage type with a different text encoding.  Implementations
> >>> consuming JSON logical types should verify this.
> >>>
> >>>      [1]:
> >>>
> >>>
> >>
> https://urldefense.com/v3/__https://cloud.google.com/bigquery/docs/reference/standard-sql/data-types*json_type__;Iw!!KSjYCgUGsB4!YhB-EpSLu8HTacaUsWvTVqF0kYh81UlVwNFBAc4-f95F7bGtdGuyWN_JObBkRSee-jTU20_MmGe2WUH8UMqTxPY$
> >>>      [2]:
> >>>
> >>
> https://urldefense.com/v3/__https://github.com/apache/parquet-format/blob/master/LogicalTypes.md*json__;Iw!!KSjYCgUGsB4!YhB-EpSLu8HTacaUsWvTVqF0kYh81UlVwNFBAc4-f95F7bGtdGuyWN_JObBkRSee-jTU20_MmGe2WUH8RFfD8NY$
> >>>      [3]:
> >>>
> >>
> https://urldefense.com/v3/__https://datatracker.ietf.org/doc/html/rfc8259__;!!KSjYCgUGsB4!YhB-EpSLu8HTacaUsWvTVqF0kYh81UlVwNFBAc4-f95F7bGtdGuyWN_JObBkRSee-jTU20_MmGe2WUH8MGoes7Q$
> >>>
> >>
> >>
> >> This message may contain information that is confidential or privileged.
> >> If you are not the intended recipient, please advise the sender
> immediately
> >> and delete this message. See
> >> http://www.blackrock.com/corporate/compliance/email-disclaimers for
> >> further information.  Please refer to
> >> http://www.blackrock.com/corporate/compliance/privacy-policy for more
> >> information about BlackRock’s Privacy Policy.
> >>
> >>
> >> For a list of BlackRock's office addresses worldwide, see
> >> http://www.blackrock.com/corporate/about-us/contacts-locations.
> >>
> >> © 2022 BlackRock, Inc. All rights reserved.
> >>
> >
>

Re: [DISCUSS] JSON Canonical Extension Type

Posted by Antoine Pitrou <an...@python.org>.
HOCON is a superset of JSON, so I'm not sure making it an extension type 
based it on JSON would be a good idea.


Le 01/12/2022 à 06:23, Micah Kornfield a écrit :
>>
>> Can a logical extension be based on another logical extension?
> 
> Potentially but this is mostly an implementation details, each type should
> have their own specification IMO.
> 
> HOCON support might be nice..
> 
> I'm not sure if this is common enough to warrant a canonical type within
> Arrow but you are welcome to propose something if you would like.
> 
> Cheers,
> Micah
> 
> On Mon, Nov 28, 2022 at 11:55 AM Lee, David <Da...@blackrock.com.invalid>
> wrote:
> 
>> Can a logical extension be based on another logical extension?
>>
>> HOCON support might be nice..
>>
>> -----Original Message-----
>> From: Micah Kornfield <em...@gmail.com>
>> Sent: Monday, November 28, 2022 11:50 AM
>> To: dev@arrow.apache.org
>> Subject: Re: [DISCUSS] JSON Canonical Extension Type
>>
>> External Email: Use caution with links and attachments
>>
>>
>> This seems like a reasonable definition to me.  Since there hasn't been
>> much feedback, I think maybe following through an implementation + this
>> description in a PR would be the next steps.  If there isn't further
>> feedback on this, once the PR is up we can have try to vote (which might
>> bring up some more feedback, but hopefully wouldn't cause too much
>> implementation churn).
>>
>> Thanks,
>> Micah
>>
>> On Thu, Nov 17, 2022 at 3:58 PM Pradeep Gollakota
>> <pg...@google.com.invalid> wrote:
>>
>>> Hi folks!
>>>
>>> I put together this specification for canonicalizing the JSON type in
>>> Arrow.
>>>
>>> ## Introduction
>>> JSON is a widely used text based data interchange format. There are
>>> many use cases where a user has a column whose contents are a JSON
>>> encoded string. BigQuery's [JSON Type][1] and Parquet’s [JSON Logical
>>> Type][2] are two such examples.
>>>
>>> The JSON specification is defined in [RFC-8259][3]. However, many of
>>> the most popular parsers support non standard extensions. Examples of
>>> non standard extensions to JSON include comments, unquoted keys,
>>> trailing commas, etc.
>>>
>>> ## Extension Specification
>>> * The name of the extension is `arrow.json`
>>> * The storage type of the extension is `utf8`
>>> * The extension type has no parameters
>>> * The metadata MUST be either empty or a valid JSON object
>>>      - There is no canonical metadata
>>>      - Implementations MAY include implementation-specific metadata by
>>> using a namespaced key. For example `{"google.bigquery": {"my":
>>> "metadata"}}`
>>> * Implementations...
>>>      - MUST produce valid UTF-8 encoded text
>>>      - SHOULD produce valid standard JSON
>>>      - MAY produce valid non-standard JSON
>>>      - MUST support parsing standard JSON
>>>      - MAY support parsing non standard JSON
>>>      - SHOULD pass through contents that they do not understand
>>>
>>> ## Forward compatibility
>>> In the future we might allow this logical type to annotate a byte
>>> storage type with a different text encoding.  Implementations
>>> consuming JSON logical types should verify this.
>>>
>>>      [1]:
>>>
>>>
>> https://urldefense.com/v3/__https://cloud.google.com/bigquery/docs/reference/standard-sql/data-types*json_type__;Iw!!KSjYCgUGsB4!YhB-EpSLu8HTacaUsWvTVqF0kYh81UlVwNFBAc4-f95F7bGtdGuyWN_JObBkRSee-jTU20_MmGe2WUH8UMqTxPY$
>>>      [2]:
>>>
>> https://urldefense.com/v3/__https://github.com/apache/parquet-format/blob/master/LogicalTypes.md*json__;Iw!!KSjYCgUGsB4!YhB-EpSLu8HTacaUsWvTVqF0kYh81UlVwNFBAc4-f95F7bGtdGuyWN_JObBkRSee-jTU20_MmGe2WUH8RFfD8NY$
>>>      [3]:
>>>
>> https://urldefense.com/v3/__https://datatracker.ietf.org/doc/html/rfc8259__;!!KSjYCgUGsB4!YhB-EpSLu8HTacaUsWvTVqF0kYh81UlVwNFBAc4-f95F7bGtdGuyWN_JObBkRSee-jTU20_MmGe2WUH8MGoes7Q$
>>>
>>
>>
>> This message may contain information that is confidential or privileged.
>> If you are not the intended recipient, please advise the sender immediately
>> and delete this message. See
>> http://www.blackrock.com/corporate/compliance/email-disclaimers for
>> further information.  Please refer to
>> http://www.blackrock.com/corporate/compliance/privacy-policy for more
>> information about BlackRock’s Privacy Policy.
>>
>>
>> For a list of BlackRock's office addresses worldwide, see
>> http://www.blackrock.com/corporate/about-us/contacts-locations.
>>
>> © 2022 BlackRock, Inc. All rights reserved.
>>
>