You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2020/12/19 21:45:24 UTC

[GitHub] [arrow] kflansburg opened a new pull request #8971: ARROW-10987: [Rust] Interpret BinaryArray as JSON

kflansburg opened a new pull request #8971:
URL: https://github.com/apache/arrow/pull/8971


   Create lightweight wrapper, `JSONArray` to interpret `BinaryArray` values as serialized JSON. Leverage recent work for inferring JSON schema to support conversion to `StructArray`.
   
   Example:
   ```rust
   let json_array: JSONArray = binary_array.into();
   let struct_array: StructArray = json_array.try_into().unwrap();
   ```
   
   cc @nevi-me 


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] alamb commented on pull request #8971: ARROW-10987: [Rust] Interpret BinaryArray as JSON

Posted by GitBox <gi...@apache.org>.
alamb commented on pull request #8971:
URL: https://github.com/apache/arrow/pull/8971#issuecomment-778610316


   
   What is the status of this PR?
   
   As part of trying to clean up the backlog of Rust PRs in this repo, I am going  through seemingly stale PRs and pinging the authors to see if there are any plans to continue the work or conversation.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] kflansburg commented on pull request #8971: ARROW-10987: [Rust] Interpret BinaryArray as JSON

Posted by GitBox <gi...@apache.org>.
kflansburg commented on pull request #8971:
URL: https://github.com/apache/arrow/pull/8971#issuecomment-751163846


   @nevi-me, Looking at PyArrow's implementation of extension types, they appear to be creating a new type (`UuidType`) which wraps the underlying array, much like I'm doing here. Based on this, I'm thinking the following would match the spirit of the Extension specification: 
   
   1. Add a default-empty metadata map to `Field`. 
   1. Possibly define `ExtensionType: Array` trait which at the very least outputs the required metadata.
   1. Move `JSONArray` to a new module called `extension` and rename to `JSONType`. 
   1. Ensure that `Fields` for `JSONArrays` have   `ARROW:extension:name=json` and `ARROW:extension:metadata=` (empty). 
   1. Update various Arrow reading code to capture `Field` metadata and produce `JSONType` when appropriate. 
   
   The remaining concern I have here is that there appears to be no other libraries implementing this (JSON), so interoperability seems unlikely. 
   
   Finally, you mention using `StringArray` as the underlying type, however there are a number of use-cases where a `BinaryArray` will be the input, and since `serde_json` supports parsing `&[u8]`, it would be nice to be able to skip the extra utf-8 validation step. Thoughts?
   
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] jorgecarleitao commented on pull request #8971: ARROW-10987: [Rust] Interpret BinaryArray as JSON

Posted by GitBox <gi...@apache.org>.
jorgecarleitao commented on pull request #8971:
URL: https://github.com/apache/arrow/pull/8971#issuecomment-749032177


   Hi @kflansburg .
   
   Every array type in the arrow crate has an associated `DataType` that is part of the arrow specification. In particular, they all implement the trait `Array`, which has a method `data_type`, which allow interoperability with other implementations via the c data interface, flight protocol, parquet format, etc.
   
   I can't find a type "JSON" in [the arrow specification](https://github.com/apache/arrow/blob/master/format/Schema.fbs). Could you clarify how this new struct would fit?


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] jorgecarleitao commented on pull request #8971: ARROW-10987: [Rust] Interpret BinaryArray as JSON

Posted by GitBox <gi...@apache.org>.
jorgecarleitao commented on pull request #8971:
URL: https://github.com/apache/arrow/pull/8971#issuecomment-749233717


   Maybe @nevi-me can provide a better insight here, but IMO the conversion `[Large]BinaryArray` to `StructArray` is a kernel with a signature
   
   ```rust
   to_struct<T>(array: GenericBinaryArray<T>, schema: &Schema, error_on_invalid: bool) -> Result<StructArray>
   ```
   
   or something similar.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] kflansburg commented on pull request #8971: ARROW-10987: [Rust] Interpret BinaryArray as JSON

Posted by GitBox <gi...@apache.org>.
kflansburg commented on pull request #8971:
URL: https://github.com/apache/arrow/pull/8971#issuecomment-749228883


   Hey @jorgecarleitao ,
   
   Thanks for the info. It sounds, then, like this would be better suited as a method on `BinaryArray`. Would you agree?
   
   FWIW, this is a light wrapper around `BinaryArray`, and implements `Deref`. The underlying memory would be `DataType::Binary`. 


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] codecov-io edited a comment on pull request #8971: ARROW-10987: [Rust] Interpret BinaryArray as JSON

Posted by GitBox <gi...@apache.org>.
codecov-io edited a comment on pull request #8971:
URL: https://github.com/apache/arrow/pull/8971#issuecomment-748530908


   # [Codecov](https://codecov.io/gh/apache/arrow/pull/8971?src=pr&el=h1) Report
   > Merging [#8971](https://codecov.io/gh/apache/arrow/pull/8971?src=pr&el=desc) (93aa609) into [master](https://codecov.io/gh/apache/arrow/commit/5eb6ce18e110fe6534ab08f158dc0b8a2744e59e?el=desc) (5eb6ce1) will **increase** coverage by `0.02%`.
   > The diff coverage is `97.43%`.
   
   [![Impacted file tree graph](https://codecov.io/gh/apache/arrow/pull/8971/graphs/tree.svg?width=650&height=150&src=pr&token=LpTCFbqVT1)](https://codecov.io/gh/apache/arrow/pull/8971?src=pr&el=tree)
   
   ```diff
   @@            Coverage Diff             @@
   ##           master    #8971      +/-   ##
   ==========================================
   + Coverage   83.19%   83.21%   +0.02%     
   ==========================================
     Files         199      200       +1     
     Lines       48661    48739      +78     
   ==========================================
   + Hits        40482    40558      +76     
   - Misses       8179     8181       +2     
   ```
   
   
   | [Impacted Files](https://codecov.io/gh/apache/arrow/pull/8971?src=pr&el=tree) | Coverage Δ | |
   |---|---|---|
   | [rust/arrow/src/array/array\_json.rs](https://codecov.io/gh/apache/arrow/pull/8971/diff?src=pr&el=tree#diff-cnVzdC9hcnJvdy9zcmMvYXJyYXkvYXJyYXlfanNvbi5ycw==) | `97.43% <97.43%> (ø)` | |
   
   ------
   
   [Continue to review full report at Codecov](https://codecov.io/gh/apache/arrow/pull/8971?src=pr&el=continue).
   > **Legend** - [Click here to learn more](https://docs.codecov.io/docs/codecov-delta)
   > `Δ = absolute <relative> (impact)`, `ø = not affected`, `? = missing data`
   > Powered by [Codecov](https://codecov.io/gh/apache/arrow/pull/8971?src=pr&el=footer). Last update [5eb6ce1...93aa609](https://codecov.io/gh/apache/arrow/pull/8971?src=pr&el=lastupdated). Read the [comment docs](https://docs.codecov.io/docs/pull-request-comments).
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] codecov-io commented on pull request #8971: ARROW-10987: [Rust] Interpret BinaryArray as JSON

Posted by GitBox <gi...@apache.org>.
codecov-io commented on pull request #8971:
URL: https://github.com/apache/arrow/pull/8971#issuecomment-748530908


   # [Codecov](https://codecov.io/gh/apache/arrow/pull/8971?src=pr&el=h1) Report
   > Merging [#8971](https://codecov.io/gh/apache/arrow/pull/8971?src=pr&el=desc) (b6d4cfb) into [master](https://codecov.io/gh/apache/arrow/commit/091df202ceb586b92882f67577ff720664e63eff?el=desc) (091df20) will **decrease** coverage by `0.02%`.
   > The diff coverage is `97.95%`.
   
   [![Impacted file tree graph](https://codecov.io/gh/apache/arrow/pull/8971/graphs/tree.svg?width=650&height=150&src=pr&token=LpTCFbqVT1)](https://codecov.io/gh/apache/arrow/pull/8971?src=pr&el=tree)
   
   ```diff
   @@            Coverage Diff             @@
   ##           master    #8971      +/-   ##
   ==========================================
   - Coverage   83.22%   83.20%   -0.03%     
   ==========================================
     Files         196      200       +4     
     Lines       48232    48710     +478     
   ==========================================
   + Hits        40142    40530     +388     
   - Misses       8090     8180      +90     
   ```
   
   
   | [Impacted Files](https://codecov.io/gh/apache/arrow/pull/8971?src=pr&el=tree) | Coverage Δ | |
   |---|---|---|
   | [rust/arrow/src/array/array\_json.rs](https://codecov.io/gh/apache/arrow/pull/8971/diff?src=pr&el=tree#diff-cnVzdC9hcnJvdy9zcmMvYXJyYXkvYXJyYXlfanNvbi5ycw==) | `97.95% <97.95%> (ø)` | |
   | [rust/datafusion/src/optimizer/filter\_push\_down.rs](https://codecov.io/gh/apache/arrow/pull/8971/diff?src=pr&el=tree#diff-cnVzdC9kYXRhZnVzaW9uL3NyYy9vcHRpbWl6ZXIvZmlsdGVyX3B1c2hfZG93bi5ycw==) | `97.65% <0.00%> (-1.74%)` | :arrow_down: |
   | [rust/datafusion/src/logical\_plan/builder.rs](https://codecov.io/gh/apache/arrow/pull/8971/diff?src=pr&el=tree#diff-cnVzdC9kYXRhZnVzaW9uL3NyYy9sb2dpY2FsX3BsYW4vYnVpbGRlci5ycw==) | `90.10% <0.00%> (-1.00%)` | :arrow_down: |
   | [rust/datafusion/src/physical\_plan/planner.rs](https://codecov.io/gh/apache/arrow/pull/8971/diff?src=pr&el=tree#diff-cnVzdC9kYXRhZnVzaW9uL3NyYy9waHlzaWNhbF9wbGFuL3BsYW5uZXIucnM=) | `80.45% <0.00%> (-0.10%)` | :arrow_down: |
   | [rust/datafusion/src/datasource/csv.rs](https://codecov.io/gh/apache/arrow/pull/8971/diff?src=pr&el=tree#diff-cnVzdC9kYXRhZnVzaW9uL3NyYy9kYXRhc291cmNlL2Nzdi5ycw==) | `75.00% <0.00%> (ø)` | |
   | [rust/datafusion/src/datasource/empty.rs](https://codecov.io/gh/apache/arrow/pull/8971/diff?src=pr&el=tree#diff-cnVzdC9kYXRhZnVzaW9uL3NyYy9kYXRhc291cmNlL2VtcHR5LnJz) | `52.94% <0.00%> (ø)` | |
   | [rust/datafusion/tests/custom\_sources.rs](https://codecov.io/gh/apache/arrow/pull/8971/diff?src=pr&el=tree#diff-cnVzdC9kYXRhZnVzaW9uL3Rlc3RzL2N1c3RvbV9zb3VyY2VzLnJz) | `75.00% <0.00%> (ø)` | |
   | [rust/datafusion/src/datasource/memory.rs](https://codecov.io/gh/apache/arrow/pull/8971/diff?src=pr&el=tree#diff-cnVzdC9kYXRhZnVzaW9uL3NyYy9kYXRhc291cmNlL21lbW9yeS5ycw==) | `83.59% <0.00%> (ø)` | |
   | [rust/datafusion/src/datasource/parquet.rs](https://codecov.io/gh/apache/arrow/pull/8971/diff?src=pr&el=tree#diff-cnVzdC9kYXRhZnVzaW9uL3NyYy9kYXRhc291cmNlL3BhcnF1ZXQucnM=) | `95.62% <0.00%> (ø)` | |
   | [rust/datafusion/src/datasource/datasource.rs](https://codecov.io/gh/apache/arrow/pull/8971/diff?src=pr&el=tree#diff-cnVzdC9kYXRhZnVzaW9uL3NyYy9kYXRhc291cmNlL2RhdGFzb3VyY2UucnM=) | `100.00% <0.00%> (ø)` | |
   | ... and [11 more](https://codecov.io/gh/apache/arrow/pull/8971/diff?src=pr&el=tree-more) | |
   
   ------
   
   [Continue to review full report at Codecov](https://codecov.io/gh/apache/arrow/pull/8971?src=pr&el=continue).
   > **Legend** - [Click here to learn more](https://docs.codecov.io/docs/codecov-delta)
   > `Δ = absolute <relative> (impact)`, `ø = not affected`, `? = missing data`
   > Powered by [Codecov](https://codecov.io/gh/apache/arrow/pull/8971?src=pr&el=footer). Last update [091df20...b6d4cfb](https://codecov.io/gh/apache/arrow/pull/8971?src=pr&el=lastupdated). Read the [comment docs](https://docs.codecov.io/docs/pull-request-comments).
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] kflansburg commented on pull request #8971: ARROW-10987: [Rust] Interpret BinaryArray as JSON

Posted by GitBox <gi...@apache.org>.
kflansburg commented on pull request #8971:
URL: https://github.com/apache/arrow/pull/8971#issuecomment-749236910


   I can see how this would be a kernel, however it should probably be named more specifically since there will an identical type signature for different formats (JSON, proto, etc.). 


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] kflansburg closed pull request #8971: ARROW-10987: [Rust] Interpret BinaryArray as JSON

Posted by GitBox <gi...@apache.org>.
kflansburg closed pull request #8971:
URL: https://github.com/apache/arrow/pull/8971


   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] github-actions[bot] commented on pull request #8971: ARROW-10987: [Rust] Interpret BinaryArray as JSON

Posted by GitBox <gi...@apache.org>.
github-actions[bot] commented on pull request #8971:
URL: https://github.com/apache/arrow/pull/8971#issuecomment-748530337


   https://issues.apache.org/jira/browse/ARROW-10987


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] kflansburg commented on pull request #8971: ARROW-10987: [Rust] Interpret BinaryArray as JSON

Posted by GitBox <gi...@apache.org>.
kflansburg commented on pull request #8971:
URL: https://github.com/apache/arrow/pull/8971#issuecomment-779414412


   No actionable feedback.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] nevi-me commented on pull request #8971: ARROW-10987: [Rust] Interpret BinaryArray as JSON

Posted by GitBox <gi...@apache.org>.
nevi-me commented on pull request #8971:
URL: https://github.com/apache/arrow/pull/8971#issuecomment-749333871


   Hi @kflansburg @jorgecarleitao , I offered the suggestion, so I should have been more explicit on how we could do this in a spec-compliant fashion. I apologise for that.
   
   The spec doesn't have a `JSONArray`, but it has an `ExtensionArray`, which would be one of the logical types that Arrow supports, but with extra metadata that defines that it's a specific type. JSON, UUID are good examples, where the datatype could be StringArray and FixedSizeBinaryArray respectively.
   
   In order to take advantage of the JSON conversion, we could:
   
   - Implement https://issues.apache.org/jira/browse/ARROW-10258, which is the PR for ExtensionArray (I still intend on getitng to this by this release, but help is welcome).
   - Add a function that converts a `StringArray` to `ExtensionArray` with the metadata indicating that it's a JSON array.
   
   We'd have to look at the other language implementations, and also plan out the API so it can cover its intended use-cases adequately.
   
   I personally see the need to be able to convert a `StringArray` that has JSON data, into a `StructArray`. One concrete example is that Postgres and other databases have a `JSON` type, which is externally represented as a `string with valid JSON`.
   Given that adding this functionality wouldn't introduce any dependencies, I think it can/should live in the `arrow` crate.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] codecov-io edited a comment on pull request #8971: ARROW-10987: [Rust] Interpret BinaryArray as JSON

Posted by GitBox <gi...@apache.org>.
codecov-io edited a comment on pull request #8971:
URL: https://github.com/apache/arrow/pull/8971#issuecomment-748530908


   # [Codecov](https://codecov.io/gh/apache/arrow/pull/8971?src=pr&el=h1) Report
   > Merging [#8971](https://codecov.io/gh/apache/arrow/pull/8971?src=pr&el=desc) (45bc767) into [master](https://codecov.io/gh/apache/arrow/commit/5eb6ce18e110fe6534ab08f158dc0b8a2744e59e?el=desc) (5eb6ce1) will **increase** coverage by `0.01%`.
   > The diff coverage is `96.29%`.
   
   [![Impacted file tree graph](https://codecov.io/gh/apache/arrow/pull/8971/graphs/tree.svg?width=650&height=150&src=pr&token=LpTCFbqVT1)](https://codecov.io/gh/apache/arrow/pull/8971?src=pr&el=tree)
   
   ```diff
   @@            Coverage Diff             @@
   ##           master    #8971      +/-   ##
   ==========================================
   + Coverage   83.19%   83.20%   +0.01%     
   ==========================================
     Files         199      200       +1     
     Lines       48661    48715      +54     
   ==========================================
   + Hits        40482    40534      +52     
   - Misses       8179     8181       +2     
   ```
   
   
   | [Impacted Files](https://codecov.io/gh/apache/arrow/pull/8971?src=pr&el=tree) | Coverage Δ | |
   |---|---|---|
   | [rust/arrow/src/array/array\_json.rs](https://codecov.io/gh/apache/arrow/pull/8971/diff?src=pr&el=tree#diff-cnVzdC9hcnJvdy9zcmMvYXJyYXkvYXJyYXlfanNvbi5ycw==) | `96.29% <96.29%> (ø)` | |
   
   ------
   
   [Continue to review full report at Codecov](https://codecov.io/gh/apache/arrow/pull/8971?src=pr&el=continue).
   > **Legend** - [Click here to learn more](https://docs.codecov.io/docs/codecov-delta)
   > `Δ = absolute <relative> (impact)`, `ø = not affected`, `? = missing data`
   > Powered by [Codecov](https://codecov.io/gh/apache/arrow/pull/8971?src=pr&el=footer). Last update [5eb6ce1...45bc767](https://codecov.io/gh/apache/arrow/pull/8971?src=pr&el=lastupdated). Read the [comment docs](https://docs.codecov.io/docs/pull-request-comments).
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org