You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@arrow.apache.org by Joris Peeters <jo...@gmail.com> on 2021/02/25 16:36:26 UTC

pyarrow: write table where columns share the same dictionary

Hello,

I have a pandas DataFrame with many string columns (>30,000), and they
share a low-cardinality set of values (e.g. size 100). I'd like to convert
this to an Arrow table of dictionary encoded columns (let's say int16 for
the index cols), but with just one shared dictionary of strings.
This is to avoid ending up with >30,000 tiny dictionaries on the wire,
which doesn't even load in e.g. Java (due to a stackoverflow error).

Despite my efforts, I haven't really been able to achieve this with the
public API's I could find. Does anyone have an idea? I'm using pyarrow
3.0.0.

For a mickey mouse example, I'm looking at e.g.

df = pd.DataFrame({'a': ['foo', None, 'bar'], 'b': [None, 'quux', 'foo']})

and would like a Table with dictionary-encoded columns a and b, both
nullable, that both refer to the same dictionary with id=0 (or whatever id)
containing ['foo', 'bar', 'quux'].

Thanks,
-Joris.

Re: pyarrow: write table where columns share the same dictionary

Posted by Joris Peeters <jo...@gmail.com>.
Made https://issues.apache.org/jira/browse/ARROW-11838 to track. If someone
adds me as a Contributor (Joris Peeters / jmgpeeters) I'm happy to assign
it to myself.

-J

On Tue, Mar 2, 2021 at 9:34 AM Antoine Pitrou <an...@python.org> wrote:

>
> Hi Joris,
>
> On Mon, 1 Mar 2021 19:04:08 +0000
> Joris Peeters <jo...@gmail.com> wrote:
> >
> > Given the above,
> > - does it sound sensible to contribute only read for now, or should we
> aim
> > wider and do write as well?
>
> It does sound reasonable to contribue only read.
>
> > - should this be a new JIRA or fall under
> > https://issues.apache.org/jira/browse/ARROW-5340 (e.g. as a subtask if
> you
> > use that).
>
> I'd rather have a new JIRA.  ARROW-5340 is really for deduplication on
> the write side.
>
> Regards
>
> Antoine.
>
>
>

Re: pyarrow: write table where columns share the same dictionary

Posted by Antoine Pitrou <an...@python.org>.
Hi Joris,

On Mon, 1 Mar 2021 19:04:08 +0000
Joris Peeters <jo...@gmail.com> wrote:
> 
> Given the above,
> - does it sound sensible to contribute only read for now, or should we aim
> wider and do write as well?

It does sound reasonable to contribue only read.

> - should this be a new JIRA or fall under
> https://issues.apache.org/jira/browse/ARROW-5340 (e.g. as a subtask if you
> use that).

I'd rather have a new JIRA.  ARROW-5340 is really for deduplication on
the write side.

Regards

Antoine.



Fwd: pyarrow: write table where columns share the same dictionary

Posted by Joris Peeters <jo...@gmail.com>.
Hello,

I'd like to try and contribute a fix for being able to *read *(leaving
write for future work, but not too far behind) in C++ (and pyarrow) IPC
streams where multiple columns share the same dictionary. See the below
(originally to user@) for some context. Although the original query talks
only about writing, reading doesn't work either.

I've played around with a local patch that seems adequate - i.e. it can
read IPC streams with shared dicts that were generated in Java, and they
come out as the appropriate categoricals in pandas.

The advantage of supporting only read right now is that it should require
very few changes - and work completely transparently - whereas write is a
bit trickier, the public interfaces currently not being set up for it (I
might be mistaken about this).
For my personal objectives read is also currently sufficient (as I can just
write from Java in production).
The disadvantage is that we'd probably need a arrow/testing/data file for
now to test this, and can't use the roundtrip yet.

Given the above,
- does it sound sensible to contribute only read for now, or should we aim
wider and do write as well?
- should this be a new JIRA or fall under
https://issues.apache.org/jira/browse/ARROW-5340 (e.g. as a subtask if you
use that).

(I expect to find all useful administrative info in
https://github.com/apache/arrow/blob/master/docs/source/developers/contributing.rst
but do let me know if there are other handy resources)

-J

---------- Forwarded message ---------
From: Joris Peeters <jo...@gmail.com>
Date: Fri, Feb 26, 2021 at 10:11 AM
Subject: Re: pyarrow: write table where columns share the same dictionary
To: <us...@arrow.apache.org>


FWIW, in the Java client it's
https://github.com/apache/arrow/blob/apache-arrow-3.0.0/java/vector/src/main/java/org/apache/arrow/vector/ipc/ArrowStreamReader.java#L131
that's causing the aforementioned stackoverflow when reading lots of
dictionaries from a stream.
i.e. the recursive construct

    public boolean loadNextBatch() throws IOException {
    ..
      if (..) return true;
      else {
        ..
        return loadNextBatch();
      }
    }

Not sure if that qualifies as a bug, as I think the depth is typically
multiple thousands, but perhaps of interest.


On Thu, Feb 25, 2021 at 8:11 PM Wes McKinney <we...@gmail.com> wrote:

> I'm not sure if it's possible at the moment, but it SHOULD be made
> possible. See ARROW-5340
>
> On Thu, Feb 25, 2021 at 10:36 AM Joris Peeters
> <jo...@gmail.com> wrote:
> >
> > Hello,
> >
> > I have a pandas DataFrame with many string columns (>30,000), and they
> share a low-cardinality set of values (e.g. size 100). I'd like to convert
> this to an Arrow table of dictionary encoded columns (let's say int16 for
> the index cols), but with just one shared dictionary of strings.
> > This is to avoid ending up with >30,000 tiny dictionaries on the wire,
> which doesn't even load in e.g. Java (due to a stackoverflow error).
> >
> > Despite my efforts, I haven't really been able to achieve this with the
> public API's I could find. Does anyone have an idea? I'm using pyarrow
> 3.0.0.
> >
> > For a mickey mouse example, I'm looking at e.g.
> >
> > df = pd.DataFrame({'a': ['foo', None, 'bar'], 'b': [None, 'quux',
> 'foo']})
> >
> > and would like a Table with dictionary-encoded columns a and b, both
> nullable, that both refer to the same dictionary with id=0 (or whatever id)
> containing ['foo', 'bar', 'quux'].
> >
> > Thanks,
> > -Joris.
> >
> >
> >
> >
> >
> >
> >
>

Re: pyarrow: write table where columns share the same dictionary

Posted by Joris Peeters <jo...@gmail.com>.
FWIW, in the Java client it's
https://github.com/apache/arrow/blob/apache-arrow-3.0.0/java/vector/src/main/java/org/apache/arrow/vector/ipc/ArrowStreamReader.java#L131
that's causing the aforementioned stackoverflow when reading lots of
dictionaries from a stream.
i.e. the recursive construct

    public boolean loadNextBatch() throws IOException {
    ..
      if (..) return true;
      else {
        ..
        return loadNextBatch();
      }
    }

Not sure if that qualifies as a bug, as I think the depth is typically
multiple thousands, but perhaps of interest.


On Thu, Feb 25, 2021 at 8:11 PM Wes McKinney <we...@gmail.com> wrote:

> I'm not sure if it's possible at the moment, but it SHOULD be made
> possible. See ARROW-5340
>
> On Thu, Feb 25, 2021 at 10:36 AM Joris Peeters
> <jo...@gmail.com> wrote:
> >
> > Hello,
> >
> > I have a pandas DataFrame with many string columns (>30,000), and they
> share a low-cardinality set of values (e.g. size 100). I'd like to convert
> this to an Arrow table of dictionary encoded columns (let's say int16 for
> the index cols), but with just one shared dictionary of strings.
> > This is to avoid ending up with >30,000 tiny dictionaries on the wire,
> which doesn't even load in e.g. Java (due to a stackoverflow error).
> >
> > Despite my efforts, I haven't really been able to achieve this with the
> public API's I could find. Does anyone have an idea? I'm using pyarrow
> 3.0.0.
> >
> > For a mickey mouse example, I'm looking at e.g.
> >
> > df = pd.DataFrame({'a': ['foo', None, 'bar'], 'b': [None, 'quux',
> 'foo']})
> >
> > and would like a Table with dictionary-encoded columns a and b, both
> nullable, that both refer to the same dictionary with id=0 (or whatever id)
> containing ['foo', 'bar', 'quux'].
> >
> > Thanks,
> > -Joris.
> >
> >
> >
> >
> >
> >
> >
>

Re: pyarrow: write table where columns share the same dictionary

Posted by Wes McKinney <we...@gmail.com>.
I'm not sure if it's possible at the moment, but it SHOULD be made
possible. See ARROW-5340

On Thu, Feb 25, 2021 at 10:36 AM Joris Peeters
<jo...@gmail.com> wrote:
>
> Hello,
>
> I have a pandas DataFrame with many string columns (>30,000), and they share a low-cardinality set of values (e.g. size 100). I'd like to convert this to an Arrow table of dictionary encoded columns (let's say int16 for the index cols), but with just one shared dictionary of strings.
> This is to avoid ending up with >30,000 tiny dictionaries on the wire, which doesn't even load in e.g. Java (due to a stackoverflow error).
>
> Despite my efforts, I haven't really been able to achieve this with the public API's I could find. Does anyone have an idea? I'm using pyarrow 3.0.0.
>
> For a mickey mouse example, I'm looking at e.g.
>
> df = pd.DataFrame({'a': ['foo', None, 'bar'], 'b': [None, 'quux', 'foo']})
>
> and would like a Table with dictionary-encoded columns a and b, both nullable, that both refer to the same dictionary with id=0 (or whatever id) containing ['foo', 'bar', 'quux'].
>
> Thanks,
> -Joris.
>
>
>
>
>
>
>