You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@arrow.apache.org by Joris Van den Bossche <jo...@gmail.com> on 2021/02/01 08:19:41 UTC

Re: Pandas Block Manager

I am also planning to actively work on this on the pandas side the coming
month.
Having early feedback on this work will be valuable.

Joris

On Thu, 28 Jan 2021 at 18:51, Wes McKinney <we...@gmail.com> wrote:

> My position on this is that we should work with the pandas community
> to work toward elimination of the BlockManager data structure as this
> will solve a multitude of problems and also make things better for
> Arrow. I am not supportive of the IPC format changes in the PR.
>
> On Wed, Jan 27, 2021 at 6:27 PM Nicholas White <n....@gmail.com>
> wrote:
> >
> > Hi all - just pinging this thread given the later discussions on the PR
> > <https://github.com/apache/arrow/pull/8644>. I am proposing a backwards
> > (but not forwards) compatible change to the spec to strike this line out
> When
> > serializing Arrow data for interprocess communication, these alignment
> and
> > padding requirements are enforced and want to gauge the general reaction
> to
> > this. The tradeoff is roughly ecosystem fragmentation vs. improved
> > performance for a single, albeit common, workflow.
> >
> > @Micah the fixed size list would lose metadata like column headers - and
> my
> > motivating example is saving the IPC-format Arrow files to disk. If you
> > have to add your own solution to keep track of the metadata separately to
> > the Arrow file then you don't really get anything from using Arrow and
> > might as well use, e.g., `np.save` on each dtype in the blockmanager and
> > keep the metadata in a JSON file.
> >
> > @Joris agree that a non-consolidating block manager is a good solution,
> and
> > am following your progress on 39146
> > <https://github.com/pandas-dev/pandas/issues/39146> etc! However, I see
> > you've discussed performance a lot on the mailing list (archive
> > <https://mail.python.org/pipermail/pandas-dev/2020-May/001244.html>) -
> if
> > an array manager got you faster data loading but slower analysis it'd be
> a
> > poor trade-off to the block manager's slower data loading and faster
> > analysis. Your notebook
> > <
> https://gist.github.com/jorisvandenbossche/b8ae071ab7823f7547567b1ab9d4c20c
> >doesn't
> > have any group-by-and-aggregate timings - do you have any yet that you
> > think are representative?
> >
> > Thanks -
> >
> > Nick
> >
> >
> > On Fri, 13 Nov 2020 at 10:49, Joris Van den Bossche <
> > jorisvandenbossche@gmail.com> wrote:
> >
> > > As Micah and Wes pointed out on the PR, this alignment/padding are
> > > requirements of the format specification. For reference, see here:
> > >
> > >
> https://arrow.apache.org/docs/format/Columnar.html#buffer-alignment-and-padding
> > > That's also the reason that I said earlier in this thread that such
> > > zero-copy conversion to pandas you want to achieve is "basically
> > > impossible", as far as I know (when not using the split_block option).
> > >
> > > The suggestion of Micah to use a fixed size list column for each set
> > > of columns with the same data type could work to achieve this. But
> > > then of course you need to construct the Arrow table specifically for
> > > this (and you no longer have a normal table with columns), and will
> > > have to write a custom arrow->pandas conversion to get the buffer of
> > > the list column, view it as a 2D numpy array, and put this in a pandas
> > > Block.
> > >
> > > I know it is not a satisfactory solution *right now*, but longer term,
> > > I think the best bet for getting this zero-copy conversion (which I
> > > would also like to see!) is the non-consolidating manager that stores
> > > 1D arrays under the hood in pandas, which I mentioned earlier in the
> > > thread.
> > >
> > > Joris
> > >
> > > On Fri, 13 Nov 2020 at 00:21, Micah Kornfield <em...@gmail.com>
> > > wrote:
> > > >
> > > > Hi Nicholas,
> > > > I don't think allowing for flexibility of non 8 byte aligned types
> is a
> > > > good idea.  The specification explicitly calls out the alignment
> > > > requirements and allowing for writers to output different non-aligned
> > > > values potentially breaks other implementations.
> > > >
> > > > I'm not sure of your exact use-case but another approach to consider
> is
> > > to
> > > > store the values in a single Arrow column as either a list or a fixed
> > > size
> > > > list and look into doing zero copy from that to the corresponding
> pandas
> > > > memory (this is hypothetical, again I don't have enough context on
> > > > pandas/numpy memory layouts).
> > > >
> > > > -Micah
> > > >
> > > > On Thu, Nov 12, 2020 at 3:01 PM Nicholas White <n....@gmail.com>
> > > wrote:
> > > >
> > > > > OK got everything to work,
> https://github.com/apache/arrow/pull/8644
> > > > > (part of ARROW-10573 now) is ready for review. I've updated the
> test
> > > case
> > > > > to show it is possible to zero-copy a pandas DataFrame! The next
> step
> > > is to
> > > > > dig into `arrow_to_pandas.cc` to make it work automagically...
> > > > >
> > > > > On Wed, 11 Nov 2020 at 22:52, Nicholas White <n....@gmail.com>
> > > wrote:
> > > > >
> > > > >> Thanks all, this has been interesting. I've made a patch that
> sort-of
> > > > >> does what I want[1] - I hope the test case is clear! I made the
> batch
> > > > >> writer use the `alignment` field that was already in the
> > > `IpcWriteOptions`
> > > > >> to align the buffers, instead of fixing their alignment at 8.
> Arrow
> > > then
> > > > >> writes out the buffers consecutively, so you can map them as a 2D
> > > memory
> > > > >> array like I wanted. There's one problem though...the test case
> > > thinks the
> > > > >> arrow data is invalid as it can't read the metadata properly
> (error
> > > below).
> > > > >> Do you have any idea why? I think it's because Arrow puts the
> > > metadata at
> > > > >> the end of the file after the now-unaligned buffers yet assumes
> the
> > > > >> metadata is still 8-byte aligned (which it probably no longer is).
> > > > >>
> > > > >> Nick
> > > > >>
> > > > >> ````
> > > > >> _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
> _ _
> > > _ _
> > > > >> _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
> > > > >> pyarrow/ipc.pxi:494: in pyarrow.lib.RecordBatchReader.read_all
> > > > >>     check_status(self.reader.get().ReadAll(&table))
> > > > >> _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
> _ _
> > > _ _
> > > > >> _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
> > > > >>
> > > > >> >   raise ArrowInvalid(message)
> > > > >> E   pyarrow.lib.ArrowInvalid: Expected to read 117703432 metadata
> > > bytes,
> > > > >> but only read 19
> > > > >> ````
> > > > >>
> > > > >> [1] https://github.com/apache/arrow/pull/8644
> > > > >>
> > > > >>
> > >
>