You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@arrow.apache.org by "Joris Van den Bossche (JIRA)" <ji...@apache.org> on 2019/06/12 06:28:00 UTC

[jira] [Commented] (ARROW-5568) [Python] Allow parsing more general JSON formats

    [ https://issues.apache.org/jira/browse/ARROW-5568?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16861790#comment-16861790 ] 

Joris Van den Bossche commented on ARROW-5568:
----------------------------------------------

{quote}I have JSON data where the columnar (line-delimited) part is in a `data` subkey:{quote}

Note that the {{data}} subpart is not line delimited, but a comma-delimited JSON array. So that's a first thing that would be good to support.

Some additional resources that might be useful: in pandas there are many formats supported, called "orients", see the overview table at http://pandas.pydata.org/pandas-docs/version/0.24/user_guide/io.html#reading-json (disclaimer: I don't know how common the different formats are, so it doesn't necessarily makes sense to copy them all from pandas).

One of the formats is the JSON Table Schema (https://frictionlessdata.io/specs/table-schema/), which is a json file with a {{'metadata'}} and {{'data'}} top-level keys, where the {{'data'}} then consists of comma-delimited records (so very similar in structure as what [~dhirschfeld] showed above).

> [Python] Allow parsing more general JSON formats
> ------------------------------------------------
>
>                 Key: ARROW-5568
>                 URL: https://issues.apache.org/jira/browse/ARROW-5568
>             Project: Apache Arrow
>          Issue Type: Improvement
>          Components: Python
>            Reporter: Dave Hirschfeld
>            Priority: Minor
>
> I have JSON data where the columnar (line-delimited) part is in a `data` subkey:
> {code:java}
> {
>   "metadata": {"name": "block1"},
>   "data" : [
>     {"a": 1, "b": 2.0, "c": "foo", "d": false},
>     {"a": 4, "b": -5.5, "c": null, "d": true}
>   ]
> }
> {code}
>  
>  
> It would be good if the arrow JSON parser could allow specifying where the columnar data is stored.
> Since the `metadata` is also important to me it would be even better if the rest of the JSON could be returned as a Python dict with the only the specified keys parsed as arrow tables - e.g.
>  
> {code:java}
> >>> block1 = json.read_json(fn, tables=['data'])
> >>> block1['data']
> pyarrow.Table
> a: int64
> b: double
> c: string
> d: bool
> >>> block1['metadata']
> {'name': 'block1'}
> >>> block1
> {
>   "metadata": {"name": "block1"},
>   "data" : pyarrow.Table
> }{code}
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)