You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@arrow.apache.org by Benjamin MacDonald Schmidt <bm...@gmail.com> on 2020/10/07 17:38:27 UTC

Dictionary key access in python/generally

Hello,

Exciting project, thanks for all your work. I gather it's appropriate to
ask a use question here? Assuming so:

I have a web application that serves portions of a dataset I've broken into
a few thousand featherV2 files structured as a quadtree. The structure
makes heavy use of text dictionary types; I'd like to have each dictionary
integer map to the same string across all files so that I can ship the data
for each tile straight to GPU without decoding the text.

If you slice a portion of a pandas categorical array and coerce to an arrow
dictionary, you keep the underlying pandas integer encoding; for example,
the last line here shows a dictionary with four keys even though the table
has just one row.

```
import pandas as pd
import pyarrow as pa
pandas_cat = pd.Series(["A", "B", "C", "B", "F"], dtype = "category")
pa.Array.from_pandas(pandas_cat[2:3])
```

For my purposes, this is good! But of course it's wasteful, too. So I'm
wondering:

1. Whether it's safe to count on the above code continuing to use the
internal pandas keys in the arrow output, or whether at some point it might
redo the pandas encoding in a more efficient way;
2. Whether there's a native pyarrow way to ensure that multiple feather
dictionaries across files use the same integer identifiers for all the keys
that they share.

I can see that the right way here might be to use the IPC streaming format
rather than feather, and send out a single schema for the dataset, with
dictionary batches identifying the keys. But I'm also attaching table
metadata to each feather, which I'd hate to lose.

-- 
Benjamin Schmidt
Director of Digital Humanities and Clinical Associate Professor of History
20 Cooper Square, Room 538
New York University

<http://goog_1230609213>
benschmidt.org

Re: Dictionary key access in python/generally

Posted by Benjamin MacDonald Schmidt <bm...@gmail.com>.

Thank you both. I hadn't read the IPC documentation closely enough to
understand that it supported metadata at the message level. It seems like
the best approach in my case is then probably to flush the dataset to
separate files as a large number of IPC message batches, and send the
schema and the complete version of the dictionary as just one message each.

On Thu, Oct 8, 2020 at 12:28 AM Micah Kornfield <em...@gmail.com>
wrote:

> I can't speak to whether Pandas conversion will ever change.  Some one else
> can potentially chime in I don't recall any JIRAs recently changing this
> type of conversion, however currently for library functionality there
> aren't any hard guarantees for backwards compatibility (generally we try to
> do our best to not break things).
>
> I can see that the right way here might be to use the IPC streaming format
> > rather than feather, and send out a single schema for the dataset, with
> > dictionary batches identifying the keys.
>
>
> Feather V2 should be the same as the Arrow file format which is different
> then the stream format.  There is a direct writer [1] for this as well, so
> if you have the ability to construct your arrow tables directly from the
> same dictionary, this would be the best way of ensuring any changes to the
> Pandas conversion would not impact you.
>
> [1]
>
> https://arrow.apache.org/docs/python/ipc.html#writing-and-reading-random-access-files
>
> On Wed, Oct 7, 2020 at 10:44 AM Jacob Quinn <qu...@gmail.com>
> wrote:
>
> > >
> > > But I'm also attaching table
> > > metadata to each feather, which I'd hate to lose.
> > >
> >
> > Note the arrow format allows attaching custom metadata at the column
> > (field), schema, and message level, so it should be possible to retain
> any
> > metadata this way.
> >
> > -Jacob
> >
> > On Wed, Oct 7, 2020 at 11:38 AM Benjamin MacDonald Schmidt <
> > bmschmidt@gmail.com> wrote:
> >
> > > Hello,
> > >
> > > Exciting project, thanks for all your work. I gather it's appropriate
> to
> > > ask a use question here? Assuming so:
> > >
> > > I have a web application that serves portions of a dataset I've broken
> > into
> > > a few thousand featherV2 files structured as a quadtree. The structure
> > > makes heavy use of text dictionary types; I'd like to have each
> > dictionary
> > > integer map to the same string across all files so that I can ship the
> > data
> > > for each tile straight to GPU without decoding the text.
> > >
> > > If you slice a portion of a pandas categorical array and coerce to an
> > arrow
> > > dictionary, you keep the underlying pandas integer encoding; for
> example,
> > > the last line here shows a dictionary with four keys even though the
> > table
> > > has just one row.
> > >
> > > ```
> > > import pandas as pd
> > > import pyarrow as pa
> > > pandas_cat = pd.Series(["A", "B", "C", "B", "F"], dtype = "category")
> > > pa.Array.from_pandas(pandas_cat[2:3])
> > > ```
> > >
> > > For my purposes, this is good! But of course it's wasteful, too. So I'm
> > > wondering:
> > >
> > > 1. Whether it's safe to count on the above code continuing to use the
> > > internal pandas keys in the arrow output, or whether at some point it
> > might
> > > redo the pandas encoding in a more efficient way;
> > > 2. Whether there's a native pyarrow way to ensure that multiple feather
> > > dictionaries across files use the same integer identifiers for all the
> > keys
> > > that they share.
> > >
> > > I can see that the right way here might be to use the IPC streaming
> > format
> > > rather than feather, and send out a single schema for the dataset, with
> > > dictionary batches identifying the keys. But I'm also attaching table
> > > metadata to each feather, which I'd hate to lose.
> > >
> > > --
> > > Benjamin Schmidt
> > > Director of Digital Humanities and Clinical Associate Professor of
> > History
> > > 20 Cooper Square, Room 538
> > > New York University
> > >
> > > <http://goog_1230609213>
> > > benschmidt.org
> > >
> >
>


-- 
Benjamin Schmidt
Director of Digital Humanities and Clinical Associate Professor of History
20 Cooper Square, Room 538
New York University

<http://goog_1230609213>
benschmidt.org

Re: Dictionary key access in python/generally

Posted by Micah Kornfield <em...@gmail.com>.

I can't speak to whether Pandas conversion will ever change.  Some one else
can potentially chime in I don't recall any JIRAs recently changing this
type of conversion, however currently for library functionality there
aren't any hard guarantees for backwards compatibility (generally we try to
do our best to not break things).

I can see that the right way here might be to use the IPC streaming format
> rather than feather, and send out a single schema for the dataset, with
> dictionary batches identifying the keys.


Feather V2 should be the same as the Arrow file format which is different
then the stream format.  There is a direct writer [1] for this as well, so
if you have the ability to construct your arrow tables directly from the
same dictionary, this would be the best way of ensuring any changes to the
Pandas conversion would not impact you.

[1]
https://arrow.apache.org/docs/python/ipc.html#writing-and-reading-random-access-files

On Wed, Oct 7, 2020 at 10:44 AM Jacob Quinn <qu...@gmail.com> wrote:

> >
> > But I'm also attaching table
> > metadata to each feather, which I'd hate to lose.
> >
>
> Note the arrow format allows attaching custom metadata at the column
> (field), schema, and message level, so it should be possible to retain any
> metadata this way.
>
> -Jacob
>
> On Wed, Oct 7, 2020 at 11:38 AM Benjamin MacDonald Schmidt <
> bmschmidt@gmail.com> wrote:
>
> > Hello,
> >
> > Exciting project, thanks for all your work. I gather it's appropriate to
> > ask a use question here? Assuming so:
> >
> > I have a web application that serves portions of a dataset I've broken
> into
> > a few thousand featherV2 files structured as a quadtree. The structure
> > makes heavy use of text dictionary types; I'd like to have each
> dictionary
> > integer map to the same string across all files so that I can ship the
> data
> > for each tile straight to GPU without decoding the text.
> >
> > If you slice a portion of a pandas categorical array and coerce to an
> arrow
> > dictionary, you keep the underlying pandas integer encoding; for example,
> > the last line here shows a dictionary with four keys even though the
> table
> > has just one row.
> >
> > ```
> > import pandas as pd
> > import pyarrow as pa
> > pandas_cat = pd.Series(["A", "B", "C", "B", "F"], dtype = "category")
> > pa.Array.from_pandas(pandas_cat[2:3])
> > ```
> >
> > For my purposes, this is good! But of course it's wasteful, too. So I'm
> > wondering:
> >
> > 1. Whether it's safe to count on the above code continuing to use the
> > internal pandas keys in the arrow output, or whether at some point it
> might
> > redo the pandas encoding in a more efficient way;
> > 2. Whether there's a native pyarrow way to ensure that multiple feather
> > dictionaries across files use the same integer identifiers for all the
> keys
> > that they share.
> >
> > I can see that the right way here might be to use the IPC streaming
> format
> > rather than feather, and send out a single schema for the dataset, with
> > dictionary batches identifying the keys. But I'm also attaching table
> > metadata to each feather, which I'd hate to lose.
> >
> > --
> > Benjamin Schmidt
> > Director of Digital Humanities and Clinical Associate Professor of
> History
> > 20 Cooper Square, Room 538
> > New York University
> >
> > <http://goog_1230609213>
> > benschmidt.org
> >
>

Re: Dictionary key access in python/generally

Posted by Jacob Quinn <qu...@gmail.com>.

>
> But I'm also attaching table
> metadata to each feather, which I'd hate to lose.
>

Note the arrow format allows attaching custom metadata at the column
(field), schema, and message level, so it should be possible to retain any
metadata this way.

-Jacob

On Wed, Oct 7, 2020 at 11:38 AM Benjamin MacDonald Schmidt <
bmschmidt@gmail.com> wrote:

> Hello,
>
> Exciting project, thanks for all your work. I gather it's appropriate to
> ask a use question here? Assuming so:
>
> I have a web application that serves portions of a dataset I've broken into
> a few thousand featherV2 files structured as a quadtree. The structure
> makes heavy use of text dictionary types; I'd like to have each dictionary
> integer map to the same string across all files so that I can ship the data
> for each tile straight to GPU without decoding the text.
>
> If you slice a portion of a pandas categorical array and coerce to an arrow
> dictionary, you keep the underlying pandas integer encoding; for example,
> the last line here shows a dictionary with four keys even though the table
> has just one row.
>
> ```
> import pandas as pd
> import pyarrow as pa
> pandas_cat = pd.Series(["A", "B", "C", "B", "F"], dtype = "category")
> pa.Array.from_pandas(pandas_cat[2:3])
> ```
>
> For my purposes, this is good! But of course it's wasteful, too. So I'm
> wondering:
>
> 1. Whether it's safe to count on the above code continuing to use the
> internal pandas keys in the arrow output, or whether at some point it might
> redo the pandas encoding in a more efficient way;
> 2. Whether there's a native pyarrow way to ensure that multiple feather
> dictionaries across files use the same integer identifiers for all the keys
> that they share.
>
> I can see that the right way here might be to use the IPC streaming format
> rather than feather, and send out a single schema for the dataset, with
> dictionary batches identifying the keys. But I'm also attaching table
> metadata to each feather, which I'd hate to lose.
>
> --
> Benjamin Schmidt
> Director of Digital Humanities and Clinical Associate Professor of History
> 20 Cooper Square, Room 538
> New York University
>
> <http://goog_1230609213>
> benschmidt.org
>