You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@arrow.apache.org by Luke <vi...@gmail.com> on 2019/09/24 18:25:59 UTC

arrow encoding of nested dictionary?

This is a simplified example but trying to figure out what gains can be had
using arrow vice straight nested python dictionaries for something like the
following:

{'random string 1': {'field1': {'field11': 'random string 2',
                                'field12': 100},
                     'field2': 200,
                     'field3': [300,
                                400,
                                {'random string 3': 500}]
                    },
 'random string 4': {'field5': {'field51': 600,
                                'field52 ': [700,
                                            800,
                                            {'random string53': 900,
                                             'random string54': 'random
string55'}
                                            ]
                                 }
                     }
}

I didn't see anything that would convert an arbitrary nested dictionary
into some arrow structure -- did I miss something?  If there isn't what are
some suggestions.  I am doing pretty heavy data analysis where I am handed
some nested python dictionaries or nested json that I am loading into a
nested python dictionary.  The memory footprint on these are large and I
have individual json files when loaded by json.load becomes a 5-6 GB python
dictionary (which is a little crazy when the actual json files is like
700MB).

curious,
Luke

Re: arrow encoding of nested dictionary?

Posted by Luke <vi...@gmail.com>.

Thanks Wes for the explanation, I was missing the need for the union.

I was pretty amazed at how much more memory the python nested dict was than
the size of the json file on disk, especially with how verbose json is.

-Luke


On Tue, Sep 24, 2019 at 11:59 PM Wes McKinney <we...@gmail.com> wrote:

> The Arrow version of a nested structure will use significantly less
> memory than the nested-Python-dictionary version.
>
> We don't have a 100% complete converter from JSON-like data to Arrow
> in-memory -- the main thing that's missing is creation of Unions
> automatically. For example, the array
>
> [700, 800, {'random string53': 900, 'random string54': 'random string55'}]
>
> would need to be a union of an integer and a struct.
>
> Assuming you don't have heterogeneous arrays and the type of values
> don't change from record to record, you can simply pass a list of
> records to pyarrow.array
>
> - Wes
>
> On Tue, Sep 24, 2019 at 1:26 PM Luke <vi...@gmail.com> wrote:
> >
> > This is a simplified example but trying to figure out what gains can be
> had using arrow vice straight nested python dictionaries for something like
> the following:
> >
> > {'random string 1': {'field1': {'field11': 'random string 2',
> >                                 'field12': 100},
> >                      'field2': 200,
> >                      'field3': [300,
> >                                 400,
> >                                 {'random string 3': 500}]
> >                     },
> >  'random string 4': {'field5': {'field51': 600,
> >                                 'field52 ': [700,
> >                                             800,
> >                                             {'random string53': 900,
> >                                              'random string54': 'random
> string55'}
> >                                             ]
> >                                  }
> >                      }
> > }
> >
> > I didn't see anything that would convert an arbitrary nested dictionary
> into some arrow structure -- did I miss something?  If there isn't what are
> some suggestions.  I am doing pretty heavy data analysis where I am handed
> some nested python dictionaries or nested json that I am loading into a
> nested python dictionary.  The memory footprint on these are large and I
> have individual json files when loaded by json.load becomes a 5-6 GB python
> dictionary (which is a little crazy when the actual json files is like
> 700MB).
> >
> > curious,
> > Luke
>

Re: arrow encoding of nested dictionary?

Posted by Wes McKinney <we...@gmail.com>.

The Arrow version of a nested structure will use significantly less
memory than the nested-Python-dictionary version.

We don't have a 100% complete converter from JSON-like data to Arrow
in-memory -- the main thing that's missing is creation of Unions
automatically. For example, the array

[700, 800, {'random string53': 900, 'random string54': 'random string55'}]

would need to be a union of an integer and a struct.

Assuming you don't have heterogeneous arrays and the type of values
don't change from record to record, you can simply pass a list of
records to pyarrow.array

- Wes

On Tue, Sep 24, 2019 at 1:26 PM Luke <vi...@gmail.com> wrote:
>
> This is a simplified example but trying to figure out what gains can be had using arrow vice straight nested python dictionaries for something like the following:
>
> {'random string 1': {'field1': {'field11': 'random string 2',
>                                 'field12': 100},
>                      'field2': 200,
>                      'field3': [300,
>                                 400,
>                                 {'random string 3': 500}]
>                     },
>  'random string 4': {'field5': {'field51': 600,
>                                 'field52 ': [700,
>                                             800,
>                                             {'random string53': 900,
>                                              'random string54': 'random string55'}
>                                             ]
>                                  }
>                      }
> }
>
> I didn't see anything that would convert an arbitrary nested dictionary into some arrow structure -- did I miss something?  If there isn't what are some suggestions.  I am doing pretty heavy data analysis where I am handed some nested python dictionaries or nested json that I am loading into a nested python dictionary.  The memory footprint on these are large and I have individual json files when loaded by json.load becomes a 5-6 GB python dictionary (which is a little crazy when the actual json files is like 700MB).
>
> curious,
> Luke