You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@arrow.apache.org by Eric Erhardt <Er...@microsoft.com.INVALID> on 2019/08/12 15:28:18 UTC

RE: [DISCUSS] Developing a "data frame" subproject in the Arrow C++ libraries

Hey Wes,

I just wanted to check-in on this work. Have there been any updates to the Arrow "data frame" project worth sharing?

Thanks,
Eric

-----Original Message-----
From: Wes McKinney <we...@gmail.com> 
Sent: Tuesday, May 21, 2019 8:17 AM
To: dev@arrow.apache.org
Subject: Re: [DISCUSS] Developing a "data frame" subproject in the Arrow C++ libraries

On Tue, May 21, 2019, 8:43 AM Antoine Pitrou <an...@python.org> wrote:

>
> Le 21/05/2019 à 13:42, Wes McKinney a écrit :
> > hi Antoine,
> >
> > On Tue, May 21, 2019 at 5:48 AM Antoine Pitrou <an...@python.org>
> wrote:
> >>
> >>
> >> Hi Wes,
> >>
> >> How does copy-on-write play together with memory-mapped data?  It 
> >> seems that, depending on whether the memory map has several 
> >> concurrent users (a condition which may be timing-dependent), we 
> >> will either persist changes on disk or make them ephemeral in 
> >> memory.  That doesn't sound very user-friendly, IMHO.
> >
> > With memory-mapping, any Buffer is sliced from the parent MemoryMap 
> > [1] so mutating the data on disk using this interface wouldn't be 
> > possible with the way that I've framed it.
>
> Hmm... I always forget that SliceBuffer returns a read-only view.
>

The more important issue is that parent_ is non-null. The idea is that no mutation is allowed if we reason that another Buffer object has access to the address space of interest. I think this style of copy-on-write is a reasonable compromise that prevents most kinds of defensive copying.


> Regards
>
> Antoine.
>

Re: [DISCUSS] Developing a "data frame" subproject in the Arrow C++ libraries

Posted by Wes McKinney <we...@gmail.com>.
hi Eric -- there have not been any patches yet related to it. I'm
currently in the midst of some internal restructuring of the Parquet
C++ library to address long-standing efficiency and memory use issues.
It's my intention to spend time on the data frame project as one of my
next focus areas, likely to be after Labor Day.

- Wes

On Mon, Aug 12, 2019 at 10:28 AM Eric Erhardt
<Er...@microsoft.com.invalid> wrote:
>
> Hey Wes,
>
> I just wanted to check-in on this work. Have there been any updates to the Arrow "data frame" project worth sharing?
>
> Thanks,
> Eric
>
> -----Original Message-----
> From: Wes McKinney <we...@gmail.com>
> Sent: Tuesday, May 21, 2019 8:17 AM
> To: dev@arrow.apache.org
> Subject: Re: [DISCUSS] Developing a "data frame" subproject in the Arrow C++ libraries
>
> On Tue, May 21, 2019, 8:43 AM Antoine Pitrou <an...@python.org> wrote:
>
> >
> > Le 21/05/2019 à 13:42, Wes McKinney a écrit :
> > > hi Antoine,
> > >
> > > On Tue, May 21, 2019 at 5:48 AM Antoine Pitrou <an...@python.org>
> > wrote:
> > >>
> > >>
> > >> Hi Wes,
> > >>
> > >> How does copy-on-write play together with memory-mapped data?  It
> > >> seems that, depending on whether the memory map has several
> > >> concurrent users (a condition which may be timing-dependent), we
> > >> will either persist changes on disk or make them ephemeral in
> > >> memory.  That doesn't sound very user-friendly, IMHO.
> > >
> > > With memory-mapping, any Buffer is sliced from the parent MemoryMap
> > > [1] so mutating the data on disk using this interface wouldn't be
> > > possible with the way that I've framed it.
> >
> > Hmm... I always forget that SliceBuffer returns a read-only view.
> >
>
> The more important issue is that parent_ is non-null. The idea is that no mutation is allowed if we reason that another Buffer object has access to the address space of interest. I think this style of copy-on-write is a reasonable compromise that prevents most kinds of defensive copying.
>
>
> > Regards
> >
> > Antoine.
> >