You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@arrow.apache.org by Luke <vi...@gmail.com> on 2019/09/24 18:25:59 UTC
arrow encoding of nested dictionary?
This is a simplified example but trying to figure out what gains can be had
using arrow vice straight nested python dictionaries for something like the
following:
{'random string 1': {'field1': {'field11': 'random string 2',
'field12': 100},
'field2': 200,
'field3': [300,
400,
{'random string 3': 500}]
},
'random string 4': {'field5': {'field51': 600,
'field52 ': [700,
800,
{'random string53': 900,
'random string54': 'random
string55'}
]
}
}
}
I didn't see anything that would convert an arbitrary nested dictionary
into some arrow structure -- did I miss something? If there isn't what are
some suggestions. I am doing pretty heavy data analysis where I am handed
some nested python dictionaries or nested json that I am loading into a
nested python dictionary. The memory footprint on these are large and I
have individual json files when loaded by json.load becomes a 5-6 GB python
dictionary (which is a little crazy when the actual json files is like
700MB).
curious,
Luke
Re: arrow encoding of nested dictionary?
Posted by Luke <vi...@gmail.com>.
Thanks Wes for the explanation, I was missing the need for the union.
I was pretty amazed at how much more memory the python nested dict was than
the size of the json file on disk, especially with how verbose json is.
-Luke
On Tue, Sep 24, 2019 at 11:59 PM Wes McKinney <we...@gmail.com> wrote:
> The Arrow version of a nested structure will use significantly less
> memory than the nested-Python-dictionary version.
>
> We don't have a 100% complete converter from JSON-like data to Arrow
> in-memory -- the main thing that's missing is creation of Unions
> automatically. For example, the array
>
> [700, 800, {'random string53': 900, 'random string54': 'random string55'}]
>
> would need to be a union of an integer and a struct.
>
> Assuming you don't have heterogeneous arrays and the type of values
> don't change from record to record, you can simply pass a list of
> records to pyarrow.array
>
> - Wes
>
> On Tue, Sep 24, 2019 at 1:26 PM Luke <vi...@gmail.com> wrote:
> >
> > This is a simplified example but trying to figure out what gains can be
> had using arrow vice straight nested python dictionaries for something like
> the following:
> >
> > {'random string 1': {'field1': {'field11': 'random string 2',
> > 'field12': 100},
> > 'field2': 200,
> > 'field3': [300,
> > 400,
> > {'random string 3': 500}]
> > },
> > 'random string 4': {'field5': {'field51': 600,
> > 'field52 ': [700,
> > 800,
> > {'random string53': 900,
> > 'random string54': 'random
> string55'}
> > ]
> > }
> > }
> > }
> >
> > I didn't see anything that would convert an arbitrary nested dictionary
> into some arrow structure -- did I miss something? If there isn't what are
> some suggestions. I am doing pretty heavy data analysis where I am handed
> some nested python dictionaries or nested json that I am loading into a
> nested python dictionary. The memory footprint on these are large and I
> have individual json files when loaded by json.load becomes a 5-6 GB python
> dictionary (which is a little crazy when the actual json files is like
> 700MB).
> >
> > curious,
> > Luke
>
Re: arrow encoding of nested dictionary?
Posted by Wes McKinney <we...@gmail.com>.
The Arrow version of a nested structure will use significantly less
memory than the nested-Python-dictionary version.
We don't have a 100% complete converter from JSON-like data to Arrow
in-memory -- the main thing that's missing is creation of Unions
automatically. For example, the array
[700, 800, {'random string53': 900, 'random string54': 'random string55'}]
would need to be a union of an integer and a struct.
Assuming you don't have heterogeneous arrays and the type of values
don't change from record to record, you can simply pass a list of
records to pyarrow.array
- Wes
On Tue, Sep 24, 2019 at 1:26 PM Luke <vi...@gmail.com> wrote:
>
> This is a simplified example but trying to figure out what gains can be had using arrow vice straight nested python dictionaries for something like the following:
>
> {'random string 1': {'field1': {'field11': 'random string 2',
> 'field12': 100},
> 'field2': 200,
> 'field3': [300,
> 400,
> {'random string 3': 500}]
> },
> 'random string 4': {'field5': {'field51': 600,
> 'field52 ': [700,
> 800,
> {'random string53': 900,
> 'random string54': 'random string55'}
> ]
> }
> }
> }
>
> I didn't see anything that would convert an arbitrary nested dictionary into some arrow structure -- did I miss something? If there isn't what are some suggestions. I am doing pretty heavy data analysis where I am handed some nested python dictionaries or nested json that I am loading into a nested python dictionary. The memory footprint on these are large and I have individual json files when loaded by json.load becomes a 5-6 GB python dictionary (which is a little crazy when the actual json files is like 700MB).
>
> curious,
> Luke