You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@arrow.apache.org by Micah Kornfield <em...@gmail.com> on 2022/08/01 15:51:14 UTC
Re: [ARROW-17255] Logical JSON type in Arrow
>
> It would be reasonable to restrict JSON to utf8, and tell people they
> need to transcode in the rare cases where some obnoxious software
> outputs utf16-encoded JSON.
+1 I think this aligns with the latest JSON RFC [1] as well.
Sounds good to me too. +1 on the canonical extension type option; maybe it
> should end up as a first-class type, but I'd like to see us try it without
> first and see what that tells us about the path for having an extension
> type get promoted to being a first-class type. This is something that has
> been discussed in principle before, but I don't know we've worked out what
> it would look like in practice.
From prior discussions, we agreed that it made sense to approach JSON as an
extension type [2]. As noted previously on the thread, I don't think this
precludes having API's in C++/Python that make the type look the same as a
natively supported type, but there might be constraints we uncover as we
move forward with implementation. I don't think we reached an exact
conclusion on canonical extension types but [3] was the last conversation.
I think the main question is if there are maintainers for other languages
that want to add the extension type, I can probably find some time for Java.
[1] https://datatracker.ietf.org/doc/html/rfc8259#section-8.1
[2] https://lists.apache.org/thread/3nls3222ggnxlrp0s46rxrcmgbyhgn8t (sorry
I still need to document the outcome of this discussion).
[3] https://lists.apache.org/thread/bd0ttt725jqn5ylsp8v006rpfymow3mn
On Sat, Jul 30, 2022 at 12:14 PM Antoine Pitrou <an...@python.org> wrote:
>
> Le 30/07/2022 à 01:02, Wes McKinney a écrit :
> > I think either path:
> >
> > * Canonical extension type
> > * First-class type in the Type union in Flatbuffers
> >
> > would be OK. The canonical extension type option is the preferable
> > path here, I think, because it allows Arrow implementations without
> > any special handling for JSON to allow the data to pass through as
> > Binary or String. Implementations like C++ could see the extension
> > type metadata and construct an instance of arrow::Type::JSON /
> > JsonArray, etc., but when it gets serialized back to Parquet or Arrow
> > IPC it looks like binary/string (since JSON can be utf-16/utf-32,
> > right?) with additional field metadata.
>
> It would be reasonable to restrict JSON to utf8, and tell people they
> need to transcode in the rare cases where some obnoxious software
> outputs utf16-encoded JSON.
>
> And I agree a canonical extension type would be massively more useful
> for JSON than for UUID (which basically doesn't make sense: a UUID is an
> opaque binary string for all practical purposes).
>
> Regards
>
> Antoine.
>
Re: [ARROW-17255] Logical JSON type in Arrow
Posted by Weston Pace <we...@gmail.com>.
I think, from a compute perspective, one would just cast before doing
anything. So you wouldn't need much beyond parse and unparse. For
example, if you have a JSON document and you want to know the largest
value of $.weather.temperature then you could do...
MAX(STRUCT_FIELD(PARSE_JSON("json_col"), "weather.temperature"))
You could maybe add support for a JSONPath aware parsing mechanism so
then you could do something like...
MAX(PARSE_JSON("json_col", "$.weather.temperature"))
On Wed, Aug 3, 2022 at 4:20 AM Lee, David
<Da...@blackrock.com.invalid> wrote:
>
>
> There are probably two ways to approach this.
>
> Physically store the json as a UTF8 string
>
> Or
>
> Physically store the json as nested lists and structs. This is more complicated and ideally this method would also support including json schemas to help address missing values and round trip conversions. https://json-schema.org/
>
> Sent from my iPad
>
> On Aug 2, 2022, at 11:23 PM, Lee, David <Da...@blackrock.com.invalid> wrote:
>
> External Email: Use caution with links and attachments
>
>
> While I do like having a json type, adding processing functionality especially around compute capabilities might be limiting.
>
> Arrow already supports nested lists and structs which can cover json structures while offering vectorized processing. Json should only be a logical representation of what arrow physically supports today.
>
> A bad example is Snowflake semi structured data support. They have a Java engine for tabular data and a JavaScript engine for json data. The JS engine is a second class citizen that requires a lot of compute to string parse json data before json content can be filtered, sorted, aggregated, etc..
>
> Sent from my iPad
>
> On Aug 2, 2022, at 11:38 AM, Wes McKinney <we...@gmail.com> wrote:
>
> External Email: Use caution with links and attachments
>
>
> I should add that since Parquet has JSON, BSON, and UUID types, that
> while UUID is just a simple fixed sized binary, that having the
> extension types so that the metadata flows through accurately to
> Parquet would be net beneficial:
>
> https://urldefense.com/v3/__https://github.com/apache/parquet-format/blob/master/src/main/thrift/parquet.thrift*L342__;Iw!!KSjYCgUGsB4!aTjWsSjJoE1gN7iM84QJUDoTt3F1A9BBpaLGscg9jYN26Eohr9bN8y0ccxgI8S3zLfGUjXBV2ewE9myKupUW$
>
> Implementing JSON (and BSON and UUID if we want them) as extension
> types and restricting JSON to UTF-8 sounds good to me.
>
> On Tue, Aug 2, 2022 at 12:43 AM Micah Kornfield <em...@gmail.com> wrote:
>
>
> 2. What do we do about different non-utf8 encodings? There does not
> appear
> to be a consensus yet on this point. One option is to only allow utf8
> encoding and force implementers to convert non-utf8 to utf8. Second
> option
> is to allow all encodings and capture the encoding in the metadata (I'm
> leaning towards this option).
>
>
> Allowing non-utf8 encodings adds complexity for everyone. Disallowing
> them only adds complexity for the tiny minority of producers of non-utf8
> JSON.
>
>
> I'd also add that if we only allow extension on utf8 today, it would be a
> forward/backward compatible change to allow parameterizing the extension
> for bytes type by encoding if we wanted to support it in the future.
> Parquet also only supports UTF-8 [1] for its logical JSON type.
>
> [1]
> https://urldefense.com/v3/__https://github.com/apache/parquet-format/blob/master/LogicalTypes.md*json__;Iw!!KSjYCgUGsB4!aTjWsSjJoE1gN7iM84QJUDoTt3F1A9BBpaLGscg9jYN26Eohr9bN8y0ccxgI8S3zLfGUjXBV2ewE9sNlK7dP$
>
> On Mon, Aug 1, 2022 at 11:39 PM Antoine Pitrou <an...@python.org> wrote:
>
>
> Le 01/08/2022 à 22:53, Pradeep Gollakota a écrit :
> Thanks for all the great feedback.
>
> To proceed forward, we seem to need decisions around the following:
>
> 1. Whether to use arrow extensions or first class types. The consensus is
> building towards using arrow extensions.
>
> +1
>
> 2. What do we do about different non-utf8 encodings? There does not
> appear
> to be a consensus yet on this point. One option is to only allow utf8
> encoding and force implementers to convert non-utf8 to utf8. Second
> option
> is to allow all encodings and capture the encoding in the metadata (I'm
> leaning towards this option).
>
> Allowing non-utf8 encodings adds complexity for everyone. Disallowing
> them only adds complexity for the tiny minority of producers of non-utf8
> JSON.
>
> 3. What do we do about the different formats of JSON (string, BSON,
> UBJSON,
> etc.)?
>
> There are no "different formats of JSON". BSON etc. are unrelated formats.
>
> Regards
>
> Antoine.
>
>
>
> This message may contain information that is confidential or privileged. If you are not the intended recipient, please advise the sender immediately and delete this message. See http://www.blackrock.com/corporate/compliance/email-disclaimers for further information. Please refer to http://www.blackrock.com/corporate/compliance/privacy-policy for more information about BlackRock’s Privacy Policy.
>
>
> For a list of BlackRock's office addresses worldwide, see http://www.blackrock.com/corporate/about-us/contacts-locations.
>
> © 2022 BlackRock, Inc. All rights reserved.
Re: [ARROW-17255] Logical JSON type in Arrow
Posted by Antoine Pitrou <an...@python.org>.
Hi Pradeep,
Thanks for filing this PR!
Before merging this PR, I think we should discuss a bit what a canonical
extension type is, and how it gets standardized. I'll make a separate
discussion thread.
Regards
Antoine.
Le 16/08/2022 à 22:40, Pradeep Gollakota a écrit :
> Hi all,
>
> I've created a pull request introducing a canonical extension type as
> discussed in this thread. https://github.com/apache/arrow/pull/13901
>
> Thanks for all the input!
>
> On Wed, Aug 3, 2022 at 10:46 AM Antoine Pitrou <an...@python.org> wrote:
>
>>
>>
>> Le 03/08/2022 à 16:19, Lee, David a écrit :
>>>
>>> There are probably two ways to approach this.
>>>
>>> Physically store the json as a UTF8 string
>>>
>>> Or
>>>
>>> Physically store the json as nested lists and structs.
>>
>> This works if all JSON values follow a predefined schema, which is not
>> necessarily the case.
>>
>> In any case, this proposal is about the former approach (store the JSON
>> as a UTF8 string). Arrow already supports the latter approach if your
>> use case is amenable to it.
>>
>> Regards
>>
>> Antoine.
>>
>>
>>
>>
>>
>> This is more complicated and ideally this method would also support
>> including json schemas to help address missing values and round trip
>> conversions. https://json-schema.org/
>>>
>>> Sent from my iPad
>>>
>>> On Aug 2, 2022, at 11:23 PM, Lee, David <Da...@blackrock.com.invalid>
>> wrote:
>>>
>>> External Email: Use caution with links and attachments
>>>
>>>
>>> While I do like having a json type, adding processing functionality
>> especially around compute capabilities might be limiting.
>>>
>>> Arrow already supports nested lists and structs which can cover json
>> structures while offering vectorized processing. Json should only be a
>> logical representation of what arrow physically supports today.
>>>
>>> A bad example is Snowflake semi structured data support. They have a
>> Java engine for tabular data and a JavaScript engine for json data. The JS
>> engine is a second class citizen that requires a lot of compute to string
>> parse json data before json content can be filtered, sorted, aggregated,
>> etc..
>>>
>>> Sent from my iPad
>>>
>>> On Aug 2, 2022, at 11:38 AM, Wes McKinney <we...@gmail.com> wrote:
>>>
>>> External Email: Use caution with links and attachments
>>>
>>>
>>> I should add that since Parquet has JSON, BSON, and UUID types, that
>>> while UUID is just a simple fixed sized binary, that having the
>>> extension types so that the metadata flows through accurately to
>>> Parquet would be net beneficial:
>>>
>>>
>> https://urldefense.com/v3/__https://github.com/apache/parquet-format/blob/master/src/main/thrift/parquet.thrift*L342__;Iw!!KSjYCgUGsB4!aTjWsSjJoE1gN7iM84QJUDoTt3F1A9BBpaLGscg9jYN26Eohr9bN8y0ccxgI8S3zLfGUjXBV2ewE9myKupUW$
>>>
>>> Implementing JSON (and BSON and UUID if we want them) as extension
>>> types and restricting JSON to UTF-8 sounds good to me.
>>>
>>> On Tue, Aug 2, 2022 at 12:43 AM Micah Kornfield <em...@gmail.com>
>> wrote:
>>>
>>>
>>> 2. What do we do about different non-utf8 encodings? There does not
>>> appear
>>> to be a consensus yet on this point. One option is to only allow utf8
>>> encoding and force implementers to convert non-utf8 to utf8. Second
>>> option
>>> is to allow all encodings and capture the encoding in the metadata (I'm
>>> leaning towards this option).
>>>
>>>
>>> Allowing non-utf8 encodings adds complexity for everyone. Disallowing
>>> them only adds complexity for the tiny minority of producers of non-utf8
>>> JSON.
>>>
>>>
>>> I'd also add that if we only allow extension on utf8 today, it would be a
>>> forward/backward compatible change to allow parameterizing the extension
>>> for bytes type by encoding if we wanted to support it in the future.
>>> Parquet also only supports UTF-8 [1] for its logical JSON type.
>>>
>>> [1]
>>>
>> https://urldefense.com/v3/__https://github.com/apache/parquet-format/blob/master/LogicalTypes.md*json__;Iw!!KSjYCgUGsB4!aTjWsSjJoE1gN7iM84QJUDoTt3F1A9BBpaLGscg9jYN26Eohr9bN8y0ccxgI8S3zLfGUjXBV2ewE9sNlK7dP$
>>>
>>> On Mon, Aug 1, 2022 at 11:39 PM Antoine Pitrou <an...@python.org>
>> wrote:
>>>
>>>
>>> Le 01/08/2022 à 22:53, Pradeep Gollakota a écrit :
>>> Thanks for all the great feedback.
>>>
>>> To proceed forward, we seem to need decisions around the following:
>>>
>>> 1. Whether to use arrow extensions or first class types. The consensus is
>>> building towards using arrow extensions.
>>>
>>> +1
>>>
>>> 2. What do we do about different non-utf8 encodings? There does not
>>> appear
>>> to be a consensus yet on this point. One option is to only allow utf8
>>> encoding and force implementers to convert non-utf8 to utf8. Second
>>> option
>>> is to allow all encodings and capture the encoding in the metadata (I'm
>>> leaning towards this option).
>>>
>>> Allowing non-utf8 encodings adds complexity for everyone. Disallowing
>>> them only adds complexity for the tiny minority of producers of non-utf8
>>> JSON.
>>>
>>> 3. What do we do about the different formats of JSON (string, BSON,
>>> UBJSON,
>>> etc.)?
>>>
>>> There are no "different formats of JSON". BSON etc. are unrelated
>> formats.
>>>
>>> Regards
>>>
>>> Antoine.
>>>
>>>
>>>
>>> This message may contain information that is confidential or privileged.
>> If you are not the intended recipient, please advise the sender immediately
>> and delete this message. See
>> http://www.blackrock.com/corporate/compliance/email-disclaimers for
>> further information. Please refer to
>> http://www.blackrock.com/corporate/compliance/privacy-policy for more
>> information about BlackRock’s Privacy Policy.
>>>
>>>
>>> For a list of BlackRock's office addresses worldwide, see
>> http://www.blackrock.com/corporate/about-us/contacts-locations.
>>>
>>> © 2022 BlackRock, Inc. All rights reserved.
>>
>
>
Re: [ARROW-17255] Logical JSON type in Arrow
Posted by Pradeep Gollakota <pg...@google.com.INVALID>.
Hi all,
I've created a pull request introducing a canonical extension type as
discussed in this thread. https://github.com/apache/arrow/pull/13901
Thanks for all the input!
On Wed, Aug 3, 2022 at 10:46 AM Antoine Pitrou <an...@python.org> wrote:
>
>
> Le 03/08/2022 à 16:19, Lee, David a écrit :
> >
> > There are probably two ways to approach this.
> >
> > Physically store the json as a UTF8 string
> >
> > Or
> >
> > Physically store the json as nested lists and structs.
>
> This works if all JSON values follow a predefined schema, which is not
> necessarily the case.
>
> In any case, this proposal is about the former approach (store the JSON
> as a UTF8 string). Arrow already supports the latter approach if your
> use case is amenable to it.
>
> Regards
>
> Antoine.
>
>
>
>
>
> This is more complicated and ideally this method would also support
> including json schemas to help address missing values and round trip
> conversions. https://json-schema.org/
> >
> > Sent from my iPad
> >
> > On Aug 2, 2022, at 11:23 PM, Lee, David <Da...@blackrock.com.invalid>
> wrote:
> >
> > External Email: Use caution with links and attachments
> >
> >
> > While I do like having a json type, adding processing functionality
> especially around compute capabilities might be limiting.
> >
> > Arrow already supports nested lists and structs which can cover json
> structures while offering vectorized processing. Json should only be a
> logical representation of what arrow physically supports today.
> >
> > A bad example is Snowflake semi structured data support. They have a
> Java engine for tabular data and a JavaScript engine for json data. The JS
> engine is a second class citizen that requires a lot of compute to string
> parse json data before json content can be filtered, sorted, aggregated,
> etc..
> >
> > Sent from my iPad
> >
> > On Aug 2, 2022, at 11:38 AM, Wes McKinney <we...@gmail.com> wrote:
> >
> > External Email: Use caution with links and attachments
> >
> >
> > I should add that since Parquet has JSON, BSON, and UUID types, that
> > while UUID is just a simple fixed sized binary, that having the
> > extension types so that the metadata flows through accurately to
> > Parquet would be net beneficial:
> >
> >
> https://urldefense.com/v3/__https://github.com/apache/parquet-format/blob/master/src/main/thrift/parquet.thrift*L342__;Iw!!KSjYCgUGsB4!aTjWsSjJoE1gN7iM84QJUDoTt3F1A9BBpaLGscg9jYN26Eohr9bN8y0ccxgI8S3zLfGUjXBV2ewE9myKupUW$
> >
> > Implementing JSON (and BSON and UUID if we want them) as extension
> > types and restricting JSON to UTF-8 sounds good to me.
> >
> > On Tue, Aug 2, 2022 at 12:43 AM Micah Kornfield <em...@gmail.com>
> wrote:
> >
> >
> > 2. What do we do about different non-utf8 encodings? There does not
> > appear
> > to be a consensus yet on this point. One option is to only allow utf8
> > encoding and force implementers to convert non-utf8 to utf8. Second
> > option
> > is to allow all encodings and capture the encoding in the metadata (I'm
> > leaning towards this option).
> >
> >
> > Allowing non-utf8 encodings adds complexity for everyone. Disallowing
> > them only adds complexity for the tiny minority of producers of non-utf8
> > JSON.
> >
> >
> > I'd also add that if we only allow extension on utf8 today, it would be a
> > forward/backward compatible change to allow parameterizing the extension
> > for bytes type by encoding if we wanted to support it in the future.
> > Parquet also only supports UTF-8 [1] for its logical JSON type.
> >
> > [1]
> >
> https://urldefense.com/v3/__https://github.com/apache/parquet-format/blob/master/LogicalTypes.md*json__;Iw!!KSjYCgUGsB4!aTjWsSjJoE1gN7iM84QJUDoTt3F1A9BBpaLGscg9jYN26Eohr9bN8y0ccxgI8S3zLfGUjXBV2ewE9sNlK7dP$
> >
> > On Mon, Aug 1, 2022 at 11:39 PM Antoine Pitrou <an...@python.org>
> wrote:
> >
> >
> > Le 01/08/2022 à 22:53, Pradeep Gollakota a écrit :
> > Thanks for all the great feedback.
> >
> > To proceed forward, we seem to need decisions around the following:
> >
> > 1. Whether to use arrow extensions or first class types. The consensus is
> > building towards using arrow extensions.
> >
> > +1
> >
> > 2. What do we do about different non-utf8 encodings? There does not
> > appear
> > to be a consensus yet on this point. One option is to only allow utf8
> > encoding and force implementers to convert non-utf8 to utf8. Second
> > option
> > is to allow all encodings and capture the encoding in the metadata (I'm
> > leaning towards this option).
> >
> > Allowing non-utf8 encodings adds complexity for everyone. Disallowing
> > them only adds complexity for the tiny minority of producers of non-utf8
> > JSON.
> >
> > 3. What do we do about the different formats of JSON (string, BSON,
> > UBJSON,
> > etc.)?
> >
> > There are no "different formats of JSON". BSON etc. are unrelated
> formats.
> >
> > Regards
> >
> > Antoine.
> >
> >
> >
> > This message may contain information that is confidential or privileged.
> If you are not the intended recipient, please advise the sender immediately
> and delete this message. See
> http://www.blackrock.com/corporate/compliance/email-disclaimers for
> further information. Please refer to
> http://www.blackrock.com/corporate/compliance/privacy-policy for more
> information about BlackRock’s Privacy Policy.
> >
> >
> > For a list of BlackRock's office addresses worldwide, see
> http://www.blackrock.com/corporate/about-us/contacts-locations.
> >
> > © 2022 BlackRock, Inc. All rights reserved.
>
--
Pradeep
Re: [ARROW-17255] Logical JSON type in Arrow
Posted by Antoine Pitrou <an...@python.org>.
Le 03/08/2022 à 16:19, Lee, David a écrit :
>
> There are probably two ways to approach this.
>
> Physically store the json as a UTF8 string
>
> Or
>
> Physically store the json as nested lists and structs.
This works if all JSON values follow a predefined schema, which is not
necessarily the case.
In any case, this proposal is about the former approach (store the JSON
as a UTF8 string). Arrow already supports the latter approach if your
use case is amenable to it.
Regards
Antoine.
This is more complicated and ideally this method would also support
including json schemas to help address missing values and round trip
conversions. https://json-schema.org/
>
> Sent from my iPad
>
> On Aug 2, 2022, at 11:23 PM, Lee, David <Da...@blackrock.com.invalid> wrote:
>
> External Email: Use caution with links and attachments
>
>
> While I do like having a json type, adding processing functionality especially around compute capabilities might be limiting.
>
> Arrow already supports nested lists and structs which can cover json structures while offering vectorized processing. Json should only be a logical representation of what arrow physically supports today.
>
> A bad example is Snowflake semi structured data support. They have a Java engine for tabular data and a JavaScript engine for json data. The JS engine is a second class citizen that requires a lot of compute to string parse json data before json content can be filtered, sorted, aggregated, etc..
>
> Sent from my iPad
>
> On Aug 2, 2022, at 11:38 AM, Wes McKinney <we...@gmail.com> wrote:
>
> External Email: Use caution with links and attachments
>
>
> I should add that since Parquet has JSON, BSON, and UUID types, that
> while UUID is just a simple fixed sized binary, that having the
> extension types so that the metadata flows through accurately to
> Parquet would be net beneficial:
>
> https://urldefense.com/v3/__https://github.com/apache/parquet-format/blob/master/src/main/thrift/parquet.thrift*L342__;Iw!!KSjYCgUGsB4!aTjWsSjJoE1gN7iM84QJUDoTt3F1A9BBpaLGscg9jYN26Eohr9bN8y0ccxgI8S3zLfGUjXBV2ewE9myKupUW$
>
> Implementing JSON (and BSON and UUID if we want them) as extension
> types and restricting JSON to UTF-8 sounds good to me.
>
> On Tue, Aug 2, 2022 at 12:43 AM Micah Kornfield <em...@gmail.com> wrote:
>
>
> 2. What do we do about different non-utf8 encodings? There does not
> appear
> to be a consensus yet on this point. One option is to only allow utf8
> encoding and force implementers to convert non-utf8 to utf8. Second
> option
> is to allow all encodings and capture the encoding in the metadata (I'm
> leaning towards this option).
>
>
> Allowing non-utf8 encodings adds complexity for everyone. Disallowing
> them only adds complexity for the tiny minority of producers of non-utf8
> JSON.
>
>
> I'd also add that if we only allow extension on utf8 today, it would be a
> forward/backward compatible change to allow parameterizing the extension
> for bytes type by encoding if we wanted to support it in the future.
> Parquet also only supports UTF-8 [1] for its logical JSON type.
>
> [1]
> https://urldefense.com/v3/__https://github.com/apache/parquet-format/blob/master/LogicalTypes.md*json__;Iw!!KSjYCgUGsB4!aTjWsSjJoE1gN7iM84QJUDoTt3F1A9BBpaLGscg9jYN26Eohr9bN8y0ccxgI8S3zLfGUjXBV2ewE9sNlK7dP$
>
> On Mon, Aug 1, 2022 at 11:39 PM Antoine Pitrou <an...@python.org> wrote:
>
>
> Le 01/08/2022 à 22:53, Pradeep Gollakota a écrit :
> Thanks for all the great feedback.
>
> To proceed forward, we seem to need decisions around the following:
>
> 1. Whether to use arrow extensions or first class types. The consensus is
> building towards using arrow extensions.
>
> +1
>
> 2. What do we do about different non-utf8 encodings? There does not
> appear
> to be a consensus yet on this point. One option is to only allow utf8
> encoding and force implementers to convert non-utf8 to utf8. Second
> option
> is to allow all encodings and capture the encoding in the metadata (I'm
> leaning towards this option).
>
> Allowing non-utf8 encodings adds complexity for everyone. Disallowing
> them only adds complexity for the tiny minority of producers of non-utf8
> JSON.
>
> 3. What do we do about the different formats of JSON (string, BSON,
> UBJSON,
> etc.)?
>
> There are no "different formats of JSON". BSON etc. are unrelated formats.
>
> Regards
>
> Antoine.
>
>
>
> This message may contain information that is confidential or privileged. If you are not the intended recipient, please advise the sender immediately and delete this message. See http://www.blackrock.com/corporate/compliance/email-disclaimers for further information. Please refer to http://www.blackrock.com/corporate/compliance/privacy-policy for more information about BlackRock’s Privacy Policy.
>
>
> For a list of BlackRock's office addresses worldwide, see http://www.blackrock.com/corporate/about-us/contacts-locations.
>
> © 2022 BlackRock, Inc. All rights reserved.
Re: [ARROW-17255] Logical JSON type in Arrow
Posted by "Lee, David" <Da...@blackrock.com.INVALID>.
There are probably two ways to approach this.
Physically store the json as a UTF8 string
Or
Physically store the json as nested lists and structs. This is more complicated and ideally this method would also support including json schemas to help address missing values and round trip conversions. https://json-schema.org/
Sent from my iPad
On Aug 2, 2022, at 11:23 PM, Lee, David <Da...@blackrock.com.invalid> wrote:
External Email: Use caution with links and attachments
While I do like having a json type, adding processing functionality especially around compute capabilities might be limiting.
Arrow already supports nested lists and structs which can cover json structures while offering vectorized processing. Json should only be a logical representation of what arrow physically supports today.
A bad example is Snowflake semi structured data support. They have a Java engine for tabular data and a JavaScript engine for json data. The JS engine is a second class citizen that requires a lot of compute to string parse json data before json content can be filtered, sorted, aggregated, etc..
Sent from my iPad
On Aug 2, 2022, at 11:38 AM, Wes McKinney <we...@gmail.com> wrote:
External Email: Use caution with links and attachments
I should add that since Parquet has JSON, BSON, and UUID types, that
while UUID is just a simple fixed sized binary, that having the
extension types so that the metadata flows through accurately to
Parquet would be net beneficial:
https://urldefense.com/v3/__https://github.com/apache/parquet-format/blob/master/src/main/thrift/parquet.thrift*L342__;Iw!!KSjYCgUGsB4!aTjWsSjJoE1gN7iM84QJUDoTt3F1A9BBpaLGscg9jYN26Eohr9bN8y0ccxgI8S3zLfGUjXBV2ewE9myKupUW$
Implementing JSON (and BSON and UUID if we want them) as extension
types and restricting JSON to UTF-8 sounds good to me.
On Tue, Aug 2, 2022 at 12:43 AM Micah Kornfield <em...@gmail.com> wrote:
2. What do we do about different non-utf8 encodings? There does not
appear
to be a consensus yet on this point. One option is to only allow utf8
encoding and force implementers to convert non-utf8 to utf8. Second
option
is to allow all encodings and capture the encoding in the metadata (I'm
leaning towards this option).
Allowing non-utf8 encodings adds complexity for everyone. Disallowing
them only adds complexity for the tiny minority of producers of non-utf8
JSON.
I'd also add that if we only allow extension on utf8 today, it would be a
forward/backward compatible change to allow parameterizing the extension
for bytes type by encoding if we wanted to support it in the future.
Parquet also only supports UTF-8 [1] for its logical JSON type.
[1]
https://urldefense.com/v3/__https://github.com/apache/parquet-format/blob/master/LogicalTypes.md*json__;Iw!!KSjYCgUGsB4!aTjWsSjJoE1gN7iM84QJUDoTt3F1A9BBpaLGscg9jYN26Eohr9bN8y0ccxgI8S3zLfGUjXBV2ewE9sNlK7dP$
On Mon, Aug 1, 2022 at 11:39 PM Antoine Pitrou <an...@python.org> wrote:
Le 01/08/2022 à 22:53, Pradeep Gollakota a écrit :
Thanks for all the great feedback.
To proceed forward, we seem to need decisions around the following:
1. Whether to use arrow extensions or first class types. The consensus is
building towards using arrow extensions.
+1
2. What do we do about different non-utf8 encodings? There does not
appear
to be a consensus yet on this point. One option is to only allow utf8
encoding and force implementers to convert non-utf8 to utf8. Second
option
is to allow all encodings and capture the encoding in the metadata (I'm
leaning towards this option).
Allowing non-utf8 encodings adds complexity for everyone. Disallowing
them only adds complexity for the tiny minority of producers of non-utf8
JSON.
3. What do we do about the different formats of JSON (string, BSON,
UBJSON,
etc.)?
There are no "different formats of JSON". BSON etc. are unrelated formats.
Regards
Antoine.
This message may contain information that is confidential or privileged. If you are not the intended recipient, please advise the sender immediately and delete this message. See http://www.blackrock.com/corporate/compliance/email-disclaimers for further information. Please refer to http://www.blackrock.com/corporate/compliance/privacy-policy for more information about BlackRock’s Privacy Policy.
For a list of BlackRock's office addresses worldwide, see http://www.blackrock.com/corporate/about-us/contacts-locations.
© 2022 BlackRock, Inc. All rights reserved.
Re: [ARROW-17255] Logical JSON type in Arrow
Posted by "Lee, David" <Da...@blackrock.com.INVALID>.
While I do like having a json type, adding processing functionality especially around compute capabilities might be limiting.
Arrow already supports nested lists and structs which can cover json structures while offering vectorized processing. Json should only be a logical representation of what arrow physically supports today.
A bad example is Snowflake semi structured data support. They have a Java engine for tabular data and a JavaScript engine for json data. The JS engine is a second class citizen that requires a lot of compute to string parse json data before json content can be filtered, sorted, aggregated, etc..
Sent from my iPad
> On Aug 2, 2022, at 11:38 AM, Wes McKinney <we...@gmail.com> wrote:
>
> External Email: Use caution with links and attachments
>
>
> I should add that since Parquet has JSON, BSON, and UUID types, that
> while UUID is just a simple fixed sized binary, that having the
> extension types so that the metadata flows through accurately to
> Parquet would be net beneficial:
>
> https://urldefense.com/v3/__https://github.com/apache/parquet-format/blob/master/src/main/thrift/parquet.thrift*L342__;Iw!!KSjYCgUGsB4!aTjWsSjJoE1gN7iM84QJUDoTt3F1A9BBpaLGscg9jYN26Eohr9bN8y0ccxgI8S3zLfGUjXBV2ewE9myKupUW$
>
> Implementing JSON (and BSON and UUID if we want them) as extension
> types and restricting JSON to UTF-8 sounds good to me.
>
>> On Tue, Aug 2, 2022 at 12:43 AM Micah Kornfield <em...@gmail.com> wrote:
>>
>>>
>>>> 2. What do we do about different non-utf8 encodings? There does not
>>> appear
>>>> to be a consensus yet on this point. One option is to only allow utf8
>>>> encoding and force implementers to convert non-utf8 to utf8. Second
>>> option
>>>> is to allow all encodings and capture the encoding in the metadata (I'm
>>>> leaning towards this option).
>>
>>
>> Allowing non-utf8 encodings adds complexity for everyone. Disallowing
>>> them only adds complexity for the tiny minority of producers of non-utf8
>>> JSON.
>>
>>
>> I'd also add that if we only allow extension on utf8 today, it would be a
>> forward/backward compatible change to allow parameterizing the extension
>> for bytes type by encoding if we wanted to support it in the future.
>> Parquet also only supports UTF-8 [1] for its logical JSON type.
>>
>> [1]
>> https://urldefense.com/v3/__https://github.com/apache/parquet-format/blob/master/LogicalTypes.md*json__;Iw!!KSjYCgUGsB4!aTjWsSjJoE1gN7iM84QJUDoTt3F1A9BBpaLGscg9jYN26Eohr9bN8y0ccxgI8S3zLfGUjXBV2ewE9sNlK7dP$
>>
>>> On Mon, Aug 1, 2022 at 11:39 PM Antoine Pitrou <an...@python.org> wrote:
>>>
>>>
>>> Le 01/08/2022 à 22:53, Pradeep Gollakota a écrit :
>>>> Thanks for all the great feedback.
>>>>
>>>> To proceed forward, we seem to need decisions around the following:
>>>>
>>>> 1. Whether to use arrow extensions or first class types. The consensus is
>>>> building towards using arrow extensions.
>>>
>>> +1
>>>
>>>> 2. What do we do about different non-utf8 encodings? There does not
>>> appear
>>>> to be a consensus yet on this point. One option is to only allow utf8
>>>> encoding and force implementers to convert non-utf8 to utf8. Second
>>> option
>>>> is to allow all encodings and capture the encoding in the metadata (I'm
>>>> leaning towards this option).
>>>
>>> Allowing non-utf8 encodings adds complexity for everyone. Disallowing
>>> them only adds complexity for the tiny minority of producers of non-utf8
>>> JSON.
>>>
>>>> 3. What do we do about the different formats of JSON (string, BSON,
>>> UBJSON,
>>>> etc.)?
>>>
>>> There are no "different formats of JSON". BSON etc. are unrelated formats.
>>>
>>> Regards
>>>
>>> Antoine.
>>>
This message may contain information that is confidential or privileged. If you are not the intended recipient, please advise the sender immediately and delete this message. See http://www.blackrock.com/corporate/compliance/email-disclaimers for further information. Please refer to http://www.blackrock.com/corporate/compliance/privacy-policy for more information about BlackRock’s Privacy Policy.
For a list of BlackRock's office addresses worldwide, see http://www.blackrock.com/corporate/about-us/contacts-locations.
© 2022 BlackRock, Inc. All rights reserved.
Re: [ARROW-17255] Logical JSON type in Arrow
Posted by Wes McKinney <we...@gmail.com>.
I should add that since Parquet has JSON, BSON, and UUID types, that
while UUID is just a simple fixed sized binary, that having the
extension types so that the metadata flows through accurately to
Parquet would be net beneficial:
https://github.com/apache/parquet-format/blob/master/src/main/thrift/parquet.thrift#L342
Implementing JSON (and BSON and UUID if we want them) as extension
types and restricting JSON to UTF-8 sounds good to me.
On Tue, Aug 2, 2022 at 12:43 AM Micah Kornfield <em...@gmail.com> wrote:
>
> >
> > > 2. What do we do about different non-utf8 encodings? There does not
> > appear
> > > to be a consensus yet on this point. One option is to only allow utf8
> > > encoding and force implementers to convert non-utf8 to utf8. Second
> > option
> > > is to allow all encodings and capture the encoding in the metadata (I'm
> > > leaning towards this option).
>
>
> Allowing non-utf8 encodings adds complexity for everyone. Disallowing
> > them only adds complexity for the tiny minority of producers of non-utf8
> > JSON.
>
>
> I'd also add that if we only allow extension on utf8 today, it would be a
> forward/backward compatible change to allow parameterizing the extension
> for bytes type by encoding if we wanted to support it in the future.
> Parquet also only supports UTF-8 [1] for its logical JSON type.
>
> [1]
> https://github.com/apache/parquet-format/blob/master/LogicalTypes.md#json
>
> On Mon, Aug 1, 2022 at 11:39 PM Antoine Pitrou <an...@python.org> wrote:
>
> >
> > Le 01/08/2022 à 22:53, Pradeep Gollakota a écrit :
> > > Thanks for all the great feedback.
> > >
> > > To proceed forward, we seem to need decisions around the following:
> > >
> > > 1. Whether to use arrow extensions or first class types. The consensus is
> > > building towards using arrow extensions.
> >
> > +1
> >
> > > 2. What do we do about different non-utf8 encodings? There does not
> > appear
> > > to be a consensus yet on this point. One option is to only allow utf8
> > > encoding and force implementers to convert non-utf8 to utf8. Second
> > option
> > > is to allow all encodings and capture the encoding in the metadata (I'm
> > > leaning towards this option).
> >
> > Allowing non-utf8 encodings adds complexity for everyone. Disallowing
> > them only adds complexity for the tiny minority of producers of non-utf8
> > JSON.
> >
> > > 3. What do we do about the different formats of JSON (string, BSON,
> > UBJSON,
> > > etc.)?
> >
> > There are no "different formats of JSON". BSON etc. are unrelated formats.
> >
> > Regards
> >
> > Antoine.
> >
Re: [ARROW-17255] Logical JSON type in Arrow
Posted by Micah Kornfield <em...@gmail.com>.
>
> > 2. What do we do about different non-utf8 encodings? There does not
> appear
> > to be a consensus yet on this point. One option is to only allow utf8
> > encoding and force implementers to convert non-utf8 to utf8. Second
> option
> > is to allow all encodings and capture the encoding in the metadata (I'm
> > leaning towards this option).
Allowing non-utf8 encodings adds complexity for everyone. Disallowing
> them only adds complexity for the tiny minority of producers of non-utf8
> JSON.
I'd also add that if we only allow extension on utf8 today, it would be a
forward/backward compatible change to allow parameterizing the extension
for bytes type by encoding if we wanted to support it in the future.
Parquet also only supports UTF-8 [1] for its logical JSON type.
[1]
https://github.com/apache/parquet-format/blob/master/LogicalTypes.md#json
On Mon, Aug 1, 2022 at 11:39 PM Antoine Pitrou <an...@python.org> wrote:
>
> Le 01/08/2022 à 22:53, Pradeep Gollakota a écrit :
> > Thanks for all the great feedback.
> >
> > To proceed forward, we seem to need decisions around the following:
> >
> > 1. Whether to use arrow extensions or first class types. The consensus is
> > building towards using arrow extensions.
>
> +1
>
> > 2. What do we do about different non-utf8 encodings? There does not
> appear
> > to be a consensus yet on this point. One option is to only allow utf8
> > encoding and force implementers to convert non-utf8 to utf8. Second
> option
> > is to allow all encodings and capture the encoding in the metadata (I'm
> > leaning towards this option).
>
> Allowing non-utf8 encodings adds complexity for everyone. Disallowing
> them only adds complexity for the tiny minority of producers of non-utf8
> JSON.
>
> > 3. What do we do about the different formats of JSON (string, BSON,
> UBJSON,
> > etc.)?
>
> There are no "different formats of JSON". BSON etc. are unrelated formats.
>
> Regards
>
> Antoine.
>
Re: [ARROW-17255] Logical JSON type in Arrow
Posted by Antoine Pitrou <an...@python.org>.
Le 01/08/2022 à 22:53, Pradeep Gollakota a écrit :
> Thanks for all the great feedback.
>
> To proceed forward, we seem to need decisions around the following:
>
> 1. Whether to use arrow extensions or first class types. The consensus is
> building towards using arrow extensions.
+1
> 2. What do we do about different non-utf8 encodings? There does not appear
> to be a consensus yet on this point. One option is to only allow utf8
> encoding and force implementers to convert non-utf8 to utf8. Second option
> is to allow all encodings and capture the encoding in the metadata (I'm
> leaning towards this option).
Allowing non-utf8 encodings adds complexity for everyone. Disallowing
them only adds complexity for the tiny minority of producers of non-utf8
JSON.
> 3. What do we do about the different formats of JSON (string, BSON, UBJSON,
> etc.)?
There are no "different formats of JSON". BSON etc. are unrelated formats.
Regards
Antoine.
Re: [ARROW-17255] Logical JSON type in Arrow
Posted by Pradeep Gollakota <pg...@google.com.INVALID>.
Thanks for all the great feedback.
To proceed forward, we seem to need decisions around the following:
1. Whether to use arrow extensions or first class types. The consensus is
building towards using arrow extensions.
2. What do we do about different non-utf8 encodings? There does not appear
to be a consensus yet on this point. One option is to only allow utf8
encoding and force implementers to convert non-utf8 to utf8. Second option
is to allow all encodings and capture the encoding in the metadata (I'm
leaning towards this option).
3. What do we do about the different formats of JSON (string, BSON, UBJSON,
etc.)? We could capture this in the metadata. If implementers don't
understand the format, they could simply treat it as binary data. I'm not
sure about what we can do when we receive something in string format and
need to write to parquet in BSON. Do we re-encode the data?
4. How do we treat this on the C++ (Java, etc) side? No special treatment
vs constructing an instance of arrow::Type::JSON.
On Mon, Aug 1, 2022 at 11:51 AM Micah Kornfield <em...@gmail.com>
wrote:
> >
> > It would be reasonable to restrict JSON to utf8, and tell people they
> > need to transcode in the rare cases where some obnoxious software
> > outputs utf16-encoded JSON.
>
> +1 I think this aligns with the latest JSON RFC [1] as well.
>
> Sounds good to me too. +1 on the canonical extension type option; maybe it
> > should end up as a first-class type, but I'd like to see us try it
> without
> > first and see what that tells us about the path for having an extension
> > type get promoted to being a first-class type. This is something that has
> > been discussed in principle before, but I don't know we've worked out
> what
> > it would look like in practice.
>
> From prior discussions, we agreed that it made sense to approach JSON as an
> extension type [2]. As noted previously on the thread, I don't think this
> precludes having API's in C++/Python that make the type look the same as a
> natively supported type, but there might be constraints we uncover as we
> move forward with implementation. I don't think we reached an exact
> conclusion on canonical extension types but [3] was the last conversation.
> I think the main question is if there are maintainers for other languages
> that want to add the extension type, I can probably find some time for
> Java.
>
>
> [1] https://datatracker.ietf.org/doc/html/rfc8259#section-8.1
> [2] https://lists.apache.org/thread/3nls3222ggnxlrp0s46rxrcmgbyhgn8t
> (sorry
> I still need to document the outcome of this discussion).
> [3] https://lists.apache.org/thread/bd0ttt725jqn5ylsp8v006rpfymow3mn
>
> On Sat, Jul 30, 2022 at 12:14 PM Antoine Pitrou <an...@python.org>
> wrote:
>
> >
> > Le 30/07/2022 à 01:02, Wes McKinney a écrit :
> > > I think either path:
> > >
> > > * Canonical extension type
> > > * First-class type in the Type union in Flatbuffers
> > >
> > > would be OK. The canonical extension type option is the preferable
> > > path here, I think, because it allows Arrow implementations without
> > > any special handling for JSON to allow the data to pass through as
> > > Binary or String. Implementations like C++ could see the extension
> > > type metadata and construct an instance of arrow::Type::JSON /
> > > JsonArray, etc., but when it gets serialized back to Parquet or Arrow
> > > IPC it looks like binary/string (since JSON can be utf-16/utf-32,
> > > right?) with additional field metadata.
> >
> > It would be reasonable to restrict JSON to utf8, and tell people they
> > need to transcode in the rare cases where some obnoxious software
> > outputs utf16-encoded JSON.
> >
> > And I agree a canonical extension type would be massively more useful
> > for JSON than for UUID (which basically doesn't make sense: a UUID is an
> > opaque binary string for all practical purposes).
> >
> > Regards
> >
> > Antoine.
> >
>
--
Pradeep