You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@arrow.apache.org by Micah Kornfield <em...@gmail.com> on 2022/08/01 15:51:14 UTC

Re: [ARROW-17255] Logical JSON type in Arrow

>
> It would be reasonable to restrict JSON to utf8, and tell people they
> need to transcode in the rare cases where some obnoxious software
> outputs utf16-encoded JSON.

+1 I think this aligns with the latest JSON RFC [1] as well.

Sounds good to me too. +1 on the canonical extension type option; maybe it
> should end up as a first-class type, but I'd like to see us try it without
> first and see what that tells us about the path for having an extension
> type get promoted to being a first-class type. This is something that has
> been discussed in principle before, but I don't know we've worked out what
> it would look like in practice.

From prior discussions, we agreed that it made sense to approach JSON as an
extension type [2].  As noted previously on the thread, I don't think this
precludes having API's in C++/Python that make the type look the same as a
natively supported type, but there might be constraints we uncover as we
move forward with implementation.  I don't think we reached an exact
conclusion on canonical extension types but [3] was the last conversation.
I think the main question is if there are maintainers for other languages
that want to add the extension type, I can probably find some time for Java.

[1] https://datatracker.ietf.org/doc/html/rfc8259#section-8.1
[2] https://lists.apache.org/thread/3nls3222ggnxlrp0s46rxrcmgbyhgn8t (sorry
I still need to document the outcome of this discussion).
[3] https://lists.apache.org/thread/bd0ttt725jqn5ylsp8v006rpfymow3mn

On Sat, Jul 30, 2022 at 12:14 PM Antoine Pitrou <an...@python.org> wrote:

>
> Le 30/07/2022 à 01:02, Wes McKinney a écrit :
> > I think either path:
> >
> > * Canonical extension type
> > * First-class type in the Type union in Flatbuffers
> >
> > would be OK. The canonical extension type option is the preferable
> > path here, I think, because it allows Arrow implementations without
> > any special handling for JSON to allow the data to pass through as
> > Binary or String. Implementations like C++ could see the extension
> > type metadata and construct an instance of arrow::Type::JSON /
> > JsonArray, etc., but when it gets serialized back to Parquet or Arrow
> > IPC it looks like binary/string (since JSON can be utf-16/utf-32,
> > right?) with additional field metadata.
>
> It would be reasonable to restrict JSON to utf8, and tell people they
> need to transcode in the rare cases where some obnoxious software
> outputs utf16-encoded JSON.
>
> And I agree a canonical extension type would be massively more useful
> for JSON than for UUID (which basically doesn't make sense: a UUID is an
> opaque binary string for all practical purposes).
>
> Regards
>
> Antoine.
>

Re: [ARROW-17255] Logical JSON type in Arrow

Posted by Weston Pace <we...@gmail.com>.

I think, from a compute perspective, one would just cast before doing
anything.  So you wouldn't need much beyond parse and unparse.  For
example, if you have a JSON document and you want to know the largest
value of $.weather.temperature then you could do...

MAX(STRUCT_FIELD(PARSE_JSON("json_col"), "weather.temperature"))

You could maybe add support for a JSONPath aware parsing mechanism so
then you could do something like...

MAX(PARSE_JSON("json_col", "$.weather.temperature"))

On Wed, Aug 3, 2022 at 4:20 AM Lee, David
<Da...@blackrock.com.invalid> wrote:
>
>
> There are probably two ways to approach this.
>
> Physically store the json as a UTF8 string
>
> Or
>
> Physically store the json as nested lists and structs. This is more complicated and ideally this method would also support including json schemas to help address missing values and round trip conversions. https://json-schema.org/
>
> Sent from my iPad
>
> On Aug 2, 2022, at 11:23 PM, Lee, David <Da...@blackrock.com.invalid> wrote:
>
> External Email: Use caution with links and attachments
>
>
> While I do like having a json type, adding processing functionality especially around compute capabilities might be limiting.
>
> Arrow already supports nested lists and structs which can cover json structures while offering vectorized processing. Json should only be a logical representation of what arrow physically supports today.
>
> A bad example is Snowflake semi structured data support. They have a Java engine for tabular data and a JavaScript engine for json data. The JS engine is a second class citizen that requires a lot of compute to string parse json data before json content can be filtered, sorted, aggregated, etc..
>
> Sent from my iPad
>
> On Aug 2, 2022, at 11:38 AM, Wes McKinney <we...@gmail.com> wrote:
>
> External Email: Use caution with links and attachments
>
>
> I should add that since Parquet has JSON, BSON, and UUID types, that
> while UUID is just a simple fixed sized binary, that having the
> extension types so that the metadata flows through accurately to
> Parquet would be net beneficial:
>
> https://urldefense.com/v3/__https://github.com/apache/parquet-format/blob/master/src/main/thrift/parquet.thrift*L342__;Iw!!KSjYCgUGsB4!aTjWsSjJoE1gN7iM84QJUDoTt3F1A9BBpaLGscg9jYN26Eohr9bN8y0ccxgI8S3zLfGUjXBV2ewE9myKupUW$
>
> Implementing JSON (and BSON and UUID if we want them) as extension
> types and restricting JSON to UTF-8 sounds good to me.
>
> On Tue, Aug 2, 2022 at 12:43 AM Micah Kornfield <em...@gmail.com> wrote:
>
>
> 2. What do we do about different non-utf8 encodings? There does not
> appear
> to be a consensus yet on this point. One option is to only allow utf8
> encoding and force implementers to convert non-utf8 to utf8. Second
> option
> is to allow all encodings and capture the encoding in the metadata (I'm
> leaning towards this option).
>
>
> Allowing non-utf8 encodings adds complexity for everyone. Disallowing
> them only adds complexity for the tiny minority of producers of non-utf8
> JSON.
>
>
> I'd also add that if we only allow extension on utf8 today, it would be a
> forward/backward compatible change to allow parameterizing the extension
> for bytes type by encoding if we wanted to support it in the future.
> Parquet also only supports UTF-8 [1] for its logical JSON type.
>
> [1]
> https://urldefense.com/v3/__https://github.com/apache/parquet-format/blob/master/LogicalTypes.md*json__;Iw!!KSjYCgUGsB4!aTjWsSjJoE1gN7iM84QJUDoTt3F1A9BBpaLGscg9jYN26Eohr9bN8y0ccxgI8S3zLfGUjXBV2ewE9sNlK7dP$
>
> On Mon, Aug 1, 2022 at 11:39 PM Antoine Pitrou <an...@python.org> wrote:
>
>
> Le 01/08/2022 à 22:53, Pradeep Gollakota a écrit :
> Thanks for all the great feedback.
>
> To proceed forward, we seem to need decisions around the following:
>
> 1. Whether to use arrow extensions or first class types. The consensus is
> building towards using arrow extensions.
>
> +1
>
> 2. What do we do about different non-utf8 encodings? There does not
> appear
> to be a consensus yet on this point. One option is to only allow utf8
> encoding and force implementers to convert non-utf8 to utf8. Second
> option
> is to allow all encodings and capture the encoding in the metadata (I'm
> leaning towards this option).
>
> Allowing non-utf8 encodings adds complexity for everyone. Disallowing
> them only adds complexity for the tiny minority of producers of non-utf8
> JSON.
>
> 3. What do we do about the different formats of JSON (string, BSON,
> UBJSON,
> etc.)?
>
> There are no "different formats of JSON". BSON etc. are unrelated formats.
>
> Regards
>
> Antoine.
>
>
>
> This message may contain information that is confidential or privileged. If you are not the intended recipient, please advise the sender immediately and delete this message. See http://www.blackrock.com/corporate/compliance/email-disclaimers for further information.  Please refer to http://www.blackrock.com/corporate/compliance/privacy-policy for more information about BlackRock’s Privacy Policy.
>
>
> For a list of BlackRock's office addresses worldwide, see http://www.blackrock.com/corporate/about-us/contacts-locations.
>
> © 2022 BlackRock, Inc. All rights reserved.

Re: [ARROW-17255] Logical JSON type in Arrow

Posted by Antoine Pitrou <an...@python.org>.

Hi Pradeep,

Thanks for filing this PR!

Before merging this PR, I think we should discuss a bit what a canonical 
extension type is, and how it gets standardized. I'll make a separate 
discussion thread.

Regards

Antoine.


Le 16/08/2022 à 22:40, Pradeep Gollakota a écrit :
> Hi all,
> 
> I've created a pull request introducing a canonical extension type as
> discussed in this thread. https://github.com/apache/arrow/pull/13901
> 
> Thanks for all the input!
> 
> On Wed, Aug 3, 2022 at 10:46 AM Antoine Pitrou <an...@python.org> wrote:
> 
>>
>>
>> Le 03/08/2022 à 16:19, Lee, David a écrit :
>>>
>>> There are probably two ways to approach this.
>>>
>>> Physically store the json as a UTF8 string
>>>
>>> Or
>>>
>>> Physically store the json as nested lists and structs.
>>
>> This works if all JSON values follow a predefined schema, which is not
>> necessarily the case.
>>
>> In any case, this proposal is about the former approach (store the JSON
>> as a UTF8 string). Arrow already supports the latter approach if your
>> use case is amenable to it.
>>
>> Regards
>>
>> Antoine.
>>
>>
>>
>>
>>
>>    This is more complicated and ideally this method would also support
>> including json schemas to help address missing values and round trip
>> conversions. https://json-schema.org/
>>>
>>> Sent from my iPad
>>>
>>> On Aug 2, 2022, at 11:23 PM, Lee, David <Da...@blackrock.com.invalid>
>> wrote:
>>>
>>> External Email: Use caution with links and attachments
>>>
>>>
>>> While I do like having a json type, adding processing functionality
>> especially around compute capabilities might be limiting.
>>>
>>> Arrow already supports nested lists and structs which can cover json
>> structures while offering vectorized processing. Json should only be a
>> logical representation of what arrow physically supports today.
>>>
>>> A bad example is Snowflake semi structured data support. They have a
>> Java engine for tabular data and a JavaScript engine for json data. The JS
>> engine is a second class citizen that requires a lot of compute to string
>> parse json data before json content can be filtered, sorted, aggregated,
>> etc..
>>>
>>> Sent from my iPad
>>>
>>> On Aug 2, 2022, at 11:38 AM, Wes McKinney <we...@gmail.com> wrote:
>>>
>>> External Email: Use caution with links and attachments
>>>
>>>
>>> I should add that since Parquet has JSON, BSON, and UUID types, that
>>> while UUID is just a simple fixed sized binary, that having the
>>> extension types so that the metadata flows through accurately to
>>> Parquet would be net beneficial:
>>>
>>>
>> https://urldefense.com/v3/__https://github.com/apache/parquet-format/blob/master/src/main/thrift/parquet.thrift*L342__;Iw!!KSjYCgUGsB4!aTjWsSjJoE1gN7iM84QJUDoTt3F1A9BBpaLGscg9jYN26Eohr9bN8y0ccxgI8S3zLfGUjXBV2ewE9myKupUW$
>>>
>>> Implementing JSON (and BSON and UUID if we want them) as extension
>>> types and restricting JSON to UTF-8 sounds good to me.
>>>
>>> On Tue, Aug 2, 2022 at 12:43 AM Micah Kornfield <em...@gmail.com>
>> wrote:
>>>
>>>
>>> 2. What do we do about different non-utf8 encodings? There does not
>>> appear
>>> to be a consensus yet on this point. One option is to only allow utf8
>>> encoding and force implementers to convert non-utf8 to utf8. Second
>>> option
>>> is to allow all encodings and capture the encoding in the metadata (I'm
>>> leaning towards this option).
>>>
>>>
>>> Allowing non-utf8 encodings adds complexity for everyone. Disallowing
>>> them only adds complexity for the tiny minority of producers of non-utf8
>>> JSON.
>>>
>>>
>>> I'd also add that if we only allow extension on utf8 today, it would be a
>>> forward/backward compatible change to allow parameterizing the extension
>>> for bytes type by encoding if we wanted to support it in the future.
>>> Parquet also only supports UTF-8 [1] for its logical JSON type.
>>>
>>> [1]
>>>
>> https://urldefense.com/v3/__https://github.com/apache/parquet-format/blob/master/LogicalTypes.md*json__;Iw!!KSjYCgUGsB4!aTjWsSjJoE1gN7iM84QJUDoTt3F1A9BBpaLGscg9jYN26Eohr9bN8y0ccxgI8S3zLfGUjXBV2ewE9sNlK7dP$
>>>
>>> On Mon, Aug 1, 2022 at 11:39 PM Antoine Pitrou <an...@python.org>
>> wrote:
>>>
>>>
>>> Le 01/08/2022 à 22:53, Pradeep Gollakota a écrit :
>>> Thanks for all the great feedback.
>>>
>>> To proceed forward, we seem to need decisions around the following:
>>>
>>> 1. Whether to use arrow extensions or first class types. The consensus is
>>> building towards using arrow extensions.
>>>
>>> +1
>>>
>>> 2. What do we do about different non-utf8 encodings? There does not
>>> appear
>>> to be a consensus yet on this point. One option is to only allow utf8
>>> encoding and force implementers to convert non-utf8 to utf8. Second
>>> option
>>> is to allow all encodings and capture the encoding in the metadata (I'm
>>> leaning towards this option).
>>>
>>> Allowing non-utf8 encodings adds complexity for everyone. Disallowing
>>> them only adds complexity for the tiny minority of producers of non-utf8
>>> JSON.
>>>
>>> 3. What do we do about the different formats of JSON (string, BSON,
>>> UBJSON,
>>> etc.)?
>>>
>>> There are no "different formats of JSON". BSON etc. are unrelated
>> formats.
>>>
>>> Regards
>>>
>>> Antoine.
>>>
>>>
>>>
>>> This message may contain information that is confidential or privileged.
>> If you are not the intended recipient, please advise the sender immediately
>> and delete this message. See
>> http://www.blackrock.com/corporate/compliance/email-disclaimers for
>> further information.  Please refer to
>> http://www.blackrock.com/corporate/compliance/privacy-policy for more
>> information about BlackRock’s Privacy Policy.
>>>
>>>
>>> For a list of BlackRock's office addresses worldwide, see
>> http://www.blackrock.com/corporate/about-us/contacts-locations.
>>>
>>> © 2022 BlackRock, Inc. All rights reserved.
>>
> 
>

Re: [ARROW-17255] Logical JSON type in Arrow

Posted by Pradeep Gollakota <pg...@google.com.INVALID>.

Hi all,

I've created a pull request introducing a canonical extension type as
discussed in this thread. https://github.com/apache/arrow/pull/13901

Thanks for all the input!

On Wed, Aug 3, 2022 at 10:46 AM Antoine Pitrou <an...@python.org> wrote:

>
>
> Le 03/08/2022 à 16:19, Lee, David a écrit :
> >
> > There are probably two ways to approach this.
> >
> > Physically store the json as a UTF8 string
> >
> > Or
> >
> > Physically store the json as nested lists and structs.
>
> This works if all JSON values follow a predefined schema, which is not
> necessarily the case.
>
> In any case, this proposal is about the former approach (store the JSON
> as a UTF8 string). Arrow already supports the latter approach if your
> use case is amenable to it.
>
> Regards
>
> Antoine.
>
>
>
>
>
>   This is more complicated and ideally this method would also support
> including json schemas to help address missing values and round trip
> conversions. https://json-schema.org/
> >
> > Sent from my iPad
> >
> > On Aug 2, 2022, at 11:23 PM, Lee, David <Da...@blackrock.com.invalid>
> wrote:
> >
> > External Email: Use caution with links and attachments
> >
> >
> > While I do like having a json type, adding processing functionality
> especially around compute capabilities might be limiting.
> >
> > Arrow already supports nested lists and structs which can cover json
> structures while offering vectorized processing. Json should only be a
> logical representation of what arrow physically supports today.
> >
> > A bad example is Snowflake semi structured data support. They have a
> Java engine for tabular data and a JavaScript engine for json data. The JS
> engine is a second class citizen that requires a lot of compute to string
> parse json data before json content can be filtered, sorted, aggregated,
> etc..
> >
> > Sent from my iPad
> >
> > On Aug 2, 2022, at 11:38 AM, Wes McKinney <we...@gmail.com> wrote:
> >
> > External Email: Use caution with links and attachments
> >
> >
> > I should add that since Parquet has JSON, BSON, and UUID types, that
> > while UUID is just a simple fixed sized binary, that having the
> > extension types so that the metadata flows through accurately to
> > Parquet would be net beneficial:
> >
> >
> https://urldefense.com/v3/__https://github.com/apache/parquet-format/blob/master/src/main/thrift/parquet.thrift*L342__;Iw!!KSjYCgUGsB4!aTjWsSjJoE1gN7iM84QJUDoTt3F1A9BBpaLGscg9jYN26Eohr9bN8y0ccxgI8S3zLfGUjXBV2ewE9myKupUW$
> >
> > Implementing JSON (and BSON and UUID if we want them) as extension
> > types and restricting JSON to UTF-8 sounds good to me.
> >
> > On Tue, Aug 2, 2022 at 12:43 AM Micah Kornfield <em...@gmail.com>
> wrote:
> >
> >
> > 2. What do we do about different non-utf8 encodings? There does not
> > appear
> > to be a consensus yet on this point. One option is to only allow utf8
> > encoding and force implementers to convert non-utf8 to utf8. Second
> > option
> > is to allow all encodings and capture the encoding in the metadata (I'm
> > leaning towards this option).
> >
> >
> > Allowing non-utf8 encodings adds complexity for everyone. Disallowing
> > them only adds complexity for the tiny minority of producers of non-utf8
> > JSON.
> >
> >
> > I'd also add that if we only allow extension on utf8 today, it would be a
> > forward/backward compatible change to allow parameterizing the extension
> > for bytes type by encoding if we wanted to support it in the future.
> > Parquet also only supports UTF-8 [1] for its logical JSON type.
> >
> > [1]
> >
> https://urldefense.com/v3/__https://github.com/apache/parquet-format/blob/master/LogicalTypes.md*json__;Iw!!KSjYCgUGsB4!aTjWsSjJoE1gN7iM84QJUDoTt3F1A9BBpaLGscg9jYN26Eohr9bN8y0ccxgI8S3zLfGUjXBV2ewE9sNlK7dP$
> >
> > On Mon, Aug 1, 2022 at 11:39 PM Antoine Pitrou <an...@python.org>
> wrote:
> >
> >
> > Le 01/08/2022 à 22:53, Pradeep Gollakota a écrit :
> > Thanks for all the great feedback.
> >
> > To proceed forward, we seem to need decisions around the following:
> >
> > 1. Whether to use arrow extensions or first class types. The consensus is
> > building towards using arrow extensions.
> >
> > +1
> >
> > 2. What do we do about different non-utf8 encodings? There does not
> > appear
> > to be a consensus yet on this point. One option is to only allow utf8
> > encoding and force implementers to convert non-utf8 to utf8. Second
> > option
> > is to allow all encodings and capture the encoding in the metadata (I'm
> > leaning towards this option).
> >
> > Allowing non-utf8 encodings adds complexity for everyone. Disallowing
> > them only adds complexity for the tiny minority of producers of non-utf8
> > JSON.
> >
> > 3. What do we do about the different formats of JSON (string, BSON,
> > UBJSON,
> > etc.)?
> >
> > There are no "different formats of JSON". BSON etc. are unrelated
> formats.
> >
> > Regards
> >
> > Antoine.
> >
> >
> >
> > This message may contain information that is confidential or privileged.
> If you are not the intended recipient, please advise the sender immediately
> and delete this message. See
> http://www.blackrock.com/corporate/compliance/email-disclaimers for
> further information.  Please refer to
> http://www.blackrock.com/corporate/compliance/privacy-policy for more
> information about BlackRock’s Privacy Policy.
> >
> >
> > For a list of BlackRock's office addresses worldwide, see
> http://www.blackrock.com/corporate/about-us/contacts-locations.
> >
> > © 2022 BlackRock, Inc. All rights reserved.
>


-- 
Pradeep

Re: [ARROW-17255] Logical JSON type in Arrow

Posted by Antoine Pitrou <an...@python.org>.


Le 03/08/2022 à 16:19, Lee, David a écrit :
> 
> There are probably two ways to approach this.
> 
> Physically store the json as a UTF8 string
> 
> Or
> 
> Physically store the json as nested lists and structs.

This works if all JSON values follow a predefined schema, which is not 
necessarily the case.

In any case, this proposal is about the former approach (store the JSON 
as a UTF8 string). Arrow already supports the latter approach if your 
use case is amenable to it.

Regards

Antoine.





  This is more complicated and ideally this method would also support 
including json schemas to help address missing values and round trip 
conversions. https://json-schema.org/
> 
> Sent from my iPad
> 
> On Aug 2, 2022, at 11:23 PM, Lee, David <Da...@blackrock.com.invalid> wrote:
> 
> External Email: Use caution with links and attachments
> 
> 
> While I do like having a json type, adding processing functionality especially around compute capabilities might be limiting.
> 
> Arrow already supports nested lists and structs which can cover json structures while offering vectorized processing. Json should only be a logical representation of what arrow physically supports today.
> 
> A bad example is Snowflake semi structured data support. They have a Java engine for tabular data and a JavaScript engine for json data. The JS engine is a second class citizen that requires a lot of compute to string parse json data before json content can be filtered, sorted, aggregated, etc..
> 
> Sent from my iPad
> 
> On Aug 2, 2022, at 11:38 AM, Wes McKinney <we...@gmail.com> wrote:
> 
> External Email: Use caution with links and attachments
> 
> 
> I should add that since Parquet has JSON, BSON, and UUID types, that
> while UUID is just a simple fixed sized binary, that having the
> extension types so that the metadata flows through accurately to
> Parquet would be net beneficial:
> 
> https://urldefense.com/v3/__https://github.com/apache/parquet-format/blob/master/src/main/thrift/parquet.thrift*L342__;Iw!!KSjYCgUGsB4!aTjWsSjJoE1gN7iM84QJUDoTt3F1A9BBpaLGscg9jYN26Eohr9bN8y0ccxgI8S3zLfGUjXBV2ewE9myKupUW$
> 
> Implementing JSON (and BSON and UUID if we want them) as extension
> types and restricting JSON to UTF-8 sounds good to me.
> 
> On Tue, Aug 2, 2022 at 12:43 AM Micah Kornfield <em...@gmail.com> wrote:
> 
> 
> 2. What do we do about different non-utf8 encodings? There does not
> appear
> to be a consensus yet on this point. One option is to only allow utf8
> encoding and force implementers to convert non-utf8 to utf8. Second
> option
> is to allow all encodings and capture the encoding in the metadata (I'm
> leaning towards this option).
> 
> 
> Allowing non-utf8 encodings adds complexity for everyone. Disallowing
> them only adds complexity for the tiny minority of producers of non-utf8
> JSON.
> 
> 
> I'd also add that if we only allow extension on utf8 today, it would be a
> forward/backward compatible change to allow parameterizing the extension
> for bytes type by encoding if we wanted to support it in the future.
> Parquet also only supports UTF-8 [1] for its logical JSON type.
> 
> [1]
> https://urldefense.com/v3/__https://github.com/apache/parquet-format/blob/master/LogicalTypes.md*json__;Iw!!KSjYCgUGsB4!aTjWsSjJoE1gN7iM84QJUDoTt3F1A9BBpaLGscg9jYN26Eohr9bN8y0ccxgI8S3zLfGUjXBV2ewE9sNlK7dP$
> 
> On Mon, Aug 1, 2022 at 11:39 PM Antoine Pitrou <an...@python.org> wrote:
> 
> 
> Le 01/08/2022 à 22:53, Pradeep Gollakota a écrit :
> Thanks for all the great feedback.
> 
> To proceed forward, we seem to need decisions around the following:
> 
> 1. Whether to use arrow extensions or first class types. The consensus is
> building towards using arrow extensions.
> 
> +1
> 
> 2. What do we do about different non-utf8 encodings? There does not
> appear
> to be a consensus yet on this point. One option is to only allow utf8
> encoding and force implementers to convert non-utf8 to utf8. Second
> option
> is to allow all encodings and capture the encoding in the metadata (I'm
> leaning towards this option).
> 
> Allowing non-utf8 encodings adds complexity for everyone. Disallowing
> them only adds complexity for the tiny minority of producers of non-utf8
> JSON.
> 
> 3. What do we do about the different formats of JSON (string, BSON,
> UBJSON,
> etc.)?
> 
> There are no "different formats of JSON". BSON etc. are unrelated formats.
> 
> Regards
> 
> Antoine.
> 
> 
> 
> This message may contain information that is confidential or privileged. If you are not the intended recipient, please advise the sender immediately and delete this message. See http://www.blackrock.com/corporate/compliance/email-disclaimers for further information.  Please refer to http://www.blackrock.com/corporate/compliance/privacy-policy for more information about BlackRock’s Privacy Policy.
> 
> 
> For a list of BlackRock's office addresses worldwide, see http://www.blackrock.com/corporate/about-us/contacts-locations.
> 
> © 2022 BlackRock, Inc. All rights reserved.

Re: [ARROW-17255] Logical JSON type in Arrow

Posted by "Lee, David" <Da...@blackrock.com.INVALID>.

There are probably two ways to approach this.

Physically store the json as a UTF8 string

Physically store the json as nested lists and structs. This is more complicated and ideally this method would also support including json schemas to help address missing values and round trip conversions. https://json-schema.org/

Sent from my iPad

On Aug 2, 2022, at 11:23 PM, Lee, David <Da...@blackrock.com.invalid> wrote:

External Email: Use caution with links and attachments

While I do like having a json type, adding processing functionality especially around compute capabilities might be limiting.

Arrow already supports nested lists and structs which can cover json structures while offering vectorized processing. Json should only be a logical representation of what arrow physically supports today.

A bad example is Snowflake semi structured data support. They have a Java engine for tabular data and a JavaScript engine for json data. The JS engine is a second class citizen that requires a lot of compute to string parse json data before json content can be filtered, sorted, aggregated, etc..

Sent from my iPad

On Aug 2, 2022, at 11:38 AM, Wes McKinney <we...@gmail.com> wrote:

External Email: Use caution with links and attachments

I should add that since Parquet has JSON, BSON, and UUID types, that
while UUID is just a simple fixed sized binary, that having the
extension types so that the metadata flows through accurately to
Parquet would be net beneficial:

https://urldefense.com/v3/__https://github.com/apache/parquet-format/blob/master/src/main/thrift/parquet.thrift*L342__;Iw!!KSjYCgUGsB4!aTjWsSjJoE1gN7iM84QJUDoTt3F1A9BBpaLGscg9jYN26Eohr9bN8y0ccxgI8S3zLfGUjXBV2ewE9myKupUW$

Implementing JSON (and BSON and UUID if we want them) as extension
types and restricting JSON to UTF-8 sounds good to me.

On Tue, Aug 2, 2022 at 12:43 AM Micah Kornfield <em...@gmail.com> wrote:

2. What do we do about different non-utf8 encodings? There does not
appear
to be a consensus yet on this point. One option is to only allow utf8
encoding and force implementers to convert non-utf8 to utf8. Second
option
is to allow all encodings and capture the encoding in the metadata (I'm
leaning towards this option).

Allowing non-utf8 encodings adds complexity for everyone. Disallowing
them only adds complexity for the tiny minority of producers of non-utf8
JSON.

I'd also add that if we only allow extension on utf8 today, it would be a
forward/backward compatible change to allow parameterizing the extension
for bytes type by encoding if we wanted to support it in the future.
Parquet also only supports UTF-8 [1] for its logical JSON type.

[1]
https://urldefense.com/v3/__https://github.com/apache/parquet-format/blob/master/LogicalTypes.md*json__;Iw!!KSjYCgUGsB4!aTjWsSjJoE1gN7iM84QJUDoTt3F1A9BBpaLGscg9jYN26Eohr9bN8y0ccxgI8S3zLfGUjXBV2ewE9sNlK7dP$

On Mon, Aug 1, 2022 at 11:39 PM Antoine Pitrou <an...@python.org> wrote:

Le 01/08/2022 à 22:53, Pradeep Gollakota a écrit :
Thanks for all the great feedback.

To proceed forward, we seem to need decisions around the following:

1. Whether to use arrow extensions or first class types. The consensus is
building towards using arrow extensions.

Allowing non-utf8 encodings adds complexity for everyone. Disallowing
them only adds complexity for the tiny minority of producers of non-utf8
JSON.

3. What do we do about the different formats of JSON (string, BSON,
UBJSON,
etc.)?

There are no "different formats of JSON". BSON etc. are unrelated formats.

Regards

Antoine.

This message may contain information that is confidential or privileged. If you are not the intended recipient, please advise the sender immediately and delete this message. See http://www.blackrock.com/corporate/compliance/email-disclaimers for further information. Please refer to http://www.blackrock.com/corporate/compliance/privacy-policy for more information about BlackRock’s Privacy Policy.

For a list of BlackRock's office addresses worldwide, see http://www.blackrock.com/corporate/about-us/contacts-locations.

Re: [ARROW-17255] Logical JSON type in Arrow

Posted by "Lee, David" <Da...@blackrock.com.INVALID>.

While I do like having a json type, adding processing functionality especially around compute capabilities might be limiting.

Arrow already supports nested lists and structs which can cover json structures while offering vectorized processing. Json should only be a logical representation of what arrow physically supports today.

A bad example is Snowflake semi structured data support. They have a Java engine for tabular data and a JavaScript engine for json data. The JS engine is a second class citizen that requires a lot of compute to string parse json data before json content can be filtered, sorted, aggregated, etc..

Sent from my iPad

> On Aug 2, 2022, at 11:38 AM, Wes McKinney <we...@gmail.com> wrote:
> 
> External Email: Use caution with links and attachments
> 
> 
> I should add that since Parquet has JSON, BSON, and UUID types, that
> while UUID is just a simple fixed sized binary, that having the
> extension types so that the metadata flows through accurately to
> Parquet would be net beneficial:
> 
> https://urldefense.com/v3/__https://github.com/apache/parquet-format/blob/master/src/main/thrift/parquet.thrift*L342__;Iw!!KSjYCgUGsB4!aTjWsSjJoE1gN7iM84QJUDoTt3F1A9BBpaLGscg9jYN26Eohr9bN8y0ccxgI8S3zLfGUjXBV2ewE9myKupUW$
> 
> Implementing JSON (and BSON and UUID if we want them) as extension
> types and restricting JSON to UTF-8 sounds good to me.
> 
>> On Tue, Aug 2, 2022 at 12:43 AM Micah Kornfield <em...@gmail.com> wrote:
>> 
>>> 
>>>> 2. What do we do about different non-utf8 encodings? There does not
>>> appear
>>>> to be a consensus yet on this point. One option is to only allow utf8
>>>> encoding and force implementers to convert non-utf8 to utf8. Second
>>> option
>>>> is to allow all encodings and capture the encoding in the metadata (I'm
>>>> leaning towards this option).
>> 
>> 
>> Allowing non-utf8 encodings adds complexity for everyone. Disallowing
>>> them only adds complexity for the tiny minority of producers of non-utf8
>>> JSON.
>> 
>> 
>> I'd also add that if we only allow extension on utf8 today, it would be a
>> forward/backward compatible change to allow parameterizing the extension
>> for bytes type by encoding if we wanted to support it in the future.
>> Parquet also only supports UTF-8 [1] for its logical JSON type.
>> 
>> [1]
>> https://urldefense.com/v3/__https://github.com/apache/parquet-format/blob/master/LogicalTypes.md*json__;Iw!!KSjYCgUGsB4!aTjWsSjJoE1gN7iM84QJUDoTt3F1A9BBpaLGscg9jYN26Eohr9bN8y0ccxgI8S3zLfGUjXBV2ewE9sNlK7dP$
>> 
>>> On Mon, Aug 1, 2022 at 11:39 PM Antoine Pitrou <an...@python.org> wrote:
>>> 
>>> 
>>> Le 01/08/2022 à 22:53, Pradeep Gollakota a écrit :
>>>> Thanks for all the great feedback.
>>>> 
>>>> To proceed forward, we seem to need decisions around the following:
>>>> 
>>>> 1. Whether to use arrow extensions or first class types. The consensus is
>>>> building towards using arrow extensions.
>>> 
>>> +1
>>> 
>>>> 2. What do we do about different non-utf8 encodings? There does not
>>> appear
>>>> to be a consensus yet on this point. One option is to only allow utf8
>>>> encoding and force implementers to convert non-utf8 to utf8. Second
>>> option
>>>> is to allow all encodings and capture the encoding in the metadata (I'm
>>>> leaning towards this option).
>>> 
>>> Allowing non-utf8 encodings adds complexity for everyone. Disallowing
>>> them only adds complexity for the tiny minority of producers of non-utf8
>>> JSON.
>>> 
>>>> 3. What do we do about the different formats of JSON (string, BSON,
>>> UBJSON,
>>>> etc.)?
>>> 
>>> There are no "different formats of JSON". BSON etc. are unrelated formats.
>>> 
>>> Regards
>>> 
>>> Antoine.
>>> 


This message may contain information that is confidential or privileged. If you are not the intended recipient, please advise the sender immediately and delete this message. See http://www.blackrock.com/corporate/compliance/email-disclaimers for further information.  Please refer to http://www.blackrock.com/corporate/compliance/privacy-policy for more information about BlackRock’s Privacy Policy.


For a list of BlackRock's office addresses worldwide, see http://www.blackrock.com/corporate/about-us/contacts-locations.

© 2022 BlackRock, Inc. All rights reserved.

Re: [ARROW-17255] Logical JSON type in Arrow

Posted by Wes McKinney <we...@gmail.com>.

I should add that since Parquet has JSON, BSON, and UUID types, that
while UUID is just a simple fixed sized binary, that having the
extension types so that the metadata flows through accurately to
Parquet would be net beneficial:

https://github.com/apache/parquet-format/blob/master/src/main/thrift/parquet.thrift#L342

Implementing JSON (and BSON and UUID if we want them) as extension
types and restricting JSON to UTF-8 sounds good to me.

On Tue, Aug 2, 2022 at 12:43 AM Micah Kornfield <em...@gmail.com> wrote:
>
> >
> > > 2. What do we do about different non-utf8 encodings? There does not
> > appear
> > > to be a consensus yet on this point. One option is to only allow utf8
> > > encoding and force implementers to convert non-utf8 to utf8. Second
> > option
> > > is to allow all encodings and capture the encoding in the metadata (I'm
> > > leaning towards this option).
>
>
> Allowing non-utf8 encodings adds complexity for everyone. Disallowing
> > them only adds complexity for the tiny minority of producers of non-utf8
> > JSON.
>
>
> I'd also add that if we only allow extension on utf8 today, it would be a
> forward/backward compatible change to allow parameterizing the extension
> for bytes type by encoding if we wanted to support it in the future.
> Parquet also only supports UTF-8 [1] for its logical JSON type.
>
> [1]
> https://github.com/apache/parquet-format/blob/master/LogicalTypes.md#json
>
> On Mon, Aug 1, 2022 at 11:39 PM Antoine Pitrou <an...@python.org> wrote:
>
> >
> > Le 01/08/2022 à 22:53, Pradeep Gollakota a écrit :
> > > Thanks for all the great feedback.
> > >
> > > To proceed forward, we seem to need decisions around the following:
> > >
> > > 1. Whether to use arrow extensions or first class types. The consensus is
> > > building towards using arrow extensions.
> >
> > +1
> >
> > > 2. What do we do about different non-utf8 encodings? There does not
> > appear
> > > to be a consensus yet on this point. One option is to only allow utf8
> > > encoding and force implementers to convert non-utf8 to utf8. Second
> > option
> > > is to allow all encodings and capture the encoding in the metadata (I'm
> > > leaning towards this option).
> >
> > Allowing non-utf8 encodings adds complexity for everyone. Disallowing
> > them only adds complexity for the tiny minority of producers of non-utf8
> > JSON.
> >
> > > 3. What do we do about the different formats of JSON (string, BSON,
> > UBJSON,
> > > etc.)?
> >
> > There are no "different formats of JSON". BSON etc. are unrelated formats.
> >
> > Regards
> >
> > Antoine.
> >

Re: [ARROW-17255] Logical JSON type in Arrow

Posted by Micah Kornfield <em...@gmail.com>.

>
> > 2. What do we do about different non-utf8 encodings? There does not
> appear
> > to be a consensus yet on this point. One option is to only allow utf8
> > encoding and force implementers to convert non-utf8 to utf8. Second
> option
> > is to allow all encodings and capture the encoding in the metadata (I'm
> > leaning towards this option).


Allowing non-utf8 encodings adds complexity for everyone. Disallowing
> them only adds complexity for the tiny minority of producers of non-utf8
> JSON.


I'd also add that if we only allow extension on utf8 today, it would be a
forward/backward compatible change to allow parameterizing the extension
for bytes type by encoding if we wanted to support it in the future.
Parquet also only supports UTF-8 [1] for its logical JSON type.

[1]
https://github.com/apache/parquet-format/blob/master/LogicalTypes.md#json

On Mon, Aug 1, 2022 at 11:39 PM Antoine Pitrou <an...@python.org> wrote:

>
> Le 01/08/2022 à 22:53, Pradeep Gollakota a écrit :
> > Thanks for all the great feedback.
> >
> > To proceed forward, we seem to need decisions around the following:
> >
> > 1. Whether to use arrow extensions or first class types. The consensus is
> > building towards using arrow extensions.
>
> +1
>
> > 2. What do we do about different non-utf8 encodings? There does not
> appear
> > to be a consensus yet on this point. One option is to only allow utf8
> > encoding and force implementers to convert non-utf8 to utf8. Second
> option
> > is to allow all encodings and capture the encoding in the metadata (I'm
> > leaning towards this option).
>
> Allowing non-utf8 encodings adds complexity for everyone. Disallowing
> them only adds complexity for the tiny minority of producers of non-utf8
> JSON.
>
> > 3. What do we do about the different formats of JSON (string, BSON,
> UBJSON,
> > etc.)?
>
> There are no "different formats of JSON". BSON etc. are unrelated formats.
>
> Regards
>
> Antoine.
>

Re: [ARROW-17255] Logical JSON type in Arrow

Posted by Antoine Pitrou <an...@python.org>.

Le 01/08/2022 à 22:53, Pradeep Gollakota a écrit :
> Thanks for all the great feedback.
> 
> To proceed forward, we seem to need decisions around the following:
> 
> 1. Whether to use arrow extensions or first class types. The consensus is
> building towards using arrow extensions.

+1

> 2. What do we do about different non-utf8 encodings? There does not appear
> to be a consensus yet on this point. One option is to only allow utf8
> encoding and force implementers to convert non-utf8 to utf8. Second option
> is to allow all encodings and capture the encoding in the metadata (I'm
> leaning towards this option).

Allowing non-utf8 encodings adds complexity for everyone. Disallowing 
them only adds complexity for the tiny minority of producers of non-utf8 
JSON.

> 3. What do we do about the different formats of JSON (string, BSON, UBJSON,
> etc.)?

There are no "different formats of JSON". BSON etc. are unrelated formats.

Regards

Antoine.

Re: [ARROW-17255] Logical JSON type in Arrow

Posted by Pradeep Gollakota <pg...@google.com.INVALID>.

Thanks for all the great feedback.

To proceed forward, we seem to need decisions around the following:

1. Whether to use arrow extensions or first class types. The consensus is
building towards using arrow extensions.
2. What do we do about different non-utf8 encodings? There does not appear
to be a consensus yet on this point. One option is to only allow utf8
encoding and force implementers to convert non-utf8 to utf8. Second option
is to allow all encodings and capture the encoding in the metadata (I'm
leaning towards this option).
3. What do we do about the different formats of JSON (string, BSON, UBJSON,
etc.)? We could capture this in the metadata. If implementers don't
understand the format, they could simply treat it as binary data. I'm not
sure about what we can do when we receive something in string format and
need to write to parquet in BSON. Do we re-encode the data?
4. How do we treat this on the C++ (Java, etc) side? No special treatment
vs constructing an instance of arrow::Type::JSON.


On Mon, Aug 1, 2022 at 11:51 AM Micah Kornfield <em...@gmail.com>
wrote:

> >
> > It would be reasonable to restrict JSON to utf8, and tell people they
> > need to transcode in the rare cases where some obnoxious software
> > outputs utf16-encoded JSON.
>
> +1 I think this aligns with the latest JSON RFC [1] as well.
>
> Sounds good to me too. +1 on the canonical extension type option; maybe it
> > should end up as a first-class type, but I'd like to see us try it
> without
> > first and see what that tells us about the path for having an extension
> > type get promoted to being a first-class type. This is something that has
> > been discussed in principle before, but I don't know we've worked out
> what
> > it would look like in practice.
>
> From prior discussions, we agreed that it made sense to approach JSON as an
> extension type [2].  As noted previously on the thread, I don't think this
> precludes having API's in C++/Python that make the type look the same as a
> natively supported type, but there might be constraints we uncover as we
> move forward with implementation.  I don't think we reached an exact
> conclusion on canonical extension types but [3] was the last conversation.
> I think the main question is if there are maintainers for other languages
> that want to add the extension type, I can probably find some time for
> Java.
>
>
> [1] https://datatracker.ietf.org/doc/html/rfc8259#section-8.1
> [2] https://lists.apache.org/thread/3nls3222ggnxlrp0s46rxrcmgbyhgn8t
> (sorry
> I still need to document the outcome of this discussion).
> [3] https://lists.apache.org/thread/bd0ttt725jqn5ylsp8v006rpfymow3mn
>
> On Sat, Jul 30, 2022 at 12:14 PM Antoine Pitrou <an...@python.org>
> wrote:
>
> >
> > Le 30/07/2022 à 01:02, Wes McKinney a écrit :
> > > I think either path:
> > >
> > > * Canonical extension type
> > > * First-class type in the Type union in Flatbuffers
> > >
> > > would be OK. The canonical extension type option is the preferable
> > > path here, I think, because it allows Arrow implementations without
> > > any special handling for JSON to allow the data to pass through as
> > > Binary or String. Implementations like C++ could see the extension
> > > type metadata and construct an instance of arrow::Type::JSON /
> > > JsonArray, etc., but when it gets serialized back to Parquet or Arrow
> > > IPC it looks like binary/string (since JSON can be utf-16/utf-32,
> > > right?) with additional field metadata.
> >
> > It would be reasonable to restrict JSON to utf8, and tell people they
> > need to transcode in the rare cases where some obnoxious software
> > outputs utf16-encoded JSON.
> >
> > And I agree a canonical extension type would be massively more useful
> > for JSON than for UUID (which basically doesn't make sense: a UUID is an
> > opaque binary string for all practical purposes).
> >
> > Regards
> >
> > Antoine.
> >
>


-- 
Pradeep