You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@parquet.apache.org by Wes McKinney <we...@gmail.com> on 2016/12/23 22:17:21 UTC

IO and memory management in parquet-cpp and Arrow (C++)

hi folks,

Spurred by the discussion and bugfix for PARQUET-799, I'd like to do
something about the IO interfaces that we currently have implemented
in parquet-cpp.

For C++ at least, the Parquet project is not an ideal place to be
maintaining cross-platform IO and memory management. There are
portability and concurrent access issues we will eventually need to
deal with to make parquet-cpp work well in diverse production
environments.

In parallel, we've been developing a general, low-overhead IO
subsystem inside Apache Arrow:

https://github.com/apache/arrow/tree/master/cpp/src/arrow/io

Since Arrow is about in-memory columnar data structures and efficient
IO / RPC / IPC, this is a much more appropriate place to maintain such
code (in the absence of a sort of "Apache C++ Commons" library).
There, we currently have more mature implementations of:

- Operating system files (which also work on Windows)
- Memory mapped files
- HDFS (either using libhdfs or libhdfs3 at your choosing)

Additionally, the "Buffer" abstraction (which handles memory lifetime
and provides a general-purpose way to pass around a block of memory
which may or may not be owned by the application) is implemented in
both Parquet [1] and Arrow [2].

Since, fundamentally, parquet-cpp is a library for encoding and
decoding the Parquet file format rather than general purpose IO /
file-like interfaces, I propose that we excise this code from the
library and make Arrow a hard dependency in libparquet. I believe our
respective developer communities would benefit from a hardening of the
IO and memory interfaces that are being developed in Arrow, and it
will lead to better quality software and reduced fragmentation.

I wanted to bring this up as we are on the cusp of making the first
ASF release of parquet-cpp, and while this work might not make the cut
for 0.1, if we agree it's a good idea it would be good to do it sooner
rather than later.

Thanks and happy holidays / best wishes for 2017,
Wes

[1]: https://github.com/apache/parquet-cpp/blob/master/src/parquet/util/buffer.h
[2]: https://github.com/apache/arrow/blob/master/cpp/src/arrow/buffer.h

Re: IO and memory management in parquet-cpp and Arrow (C++)

Posted by Deepak Majeti <ma...@gmail.com>.
Wes,

This makes sense. Thank you.

On Wed, Dec 28, 2016 at 10:19 AM, Wes McKinney <we...@gmail.com> wrote:

> hi Deepak,
>
> On Wed, Dec 28, 2016 at 9:24 AM, Deepak Majeti <ma...@gmail.com>
> wrote:
> > Wes,
> >
> > I too agree that having a single general purpose I/O implementation makes
> > sense moving forward.
> > We can have a separate build option to build only the parquet-cpp library
> > for those who want to encode/decode just the parquet format.
>
> The changes I'm describing are more invasive because it also coupled
> to parquet-cpp's memory management, namely parquet::Buffer
> (https://github.com/apache/parquet-cpp/blob/master/src/
> parquet/util/buffer.h)
> and parquet::MemoryAllocator
> (https://github.com/apache/parquet-cpp/blob/master/src/
> parquet/util/mem-allocator.h).
>
> One of the benefits of consolidating the memory interfaces is that we
> can provide zero-copy data handoffs from the decoders to Arrow
> consumers (or other applications that handle the buffers).
>
> The goal is that we would make it easy for a parquet-cpp user to
> statically link only the libarrow symbols she/he needs for the
> application and not have to ship Arrow shared libraries separately
> necessarily.
>
> Let me know if this makes sense / what concerns you might have.
>
> Thanks
> Wes
>
> >
> >
> > On Wed, Dec 28, 2016 at 9:04 AM, Uwe L. Korn <uw...@xhochy.com> wrote:
> >
> >> Hello Wes,
> >>
> >> As the classes used from Arrow are very low-level and should not
> >> prohibit the integration of parquet-cpp into other systems, I'm fully in
> >> support of making Arrow a hard dependency. The current two
> >> implementations look very similar of the not-shared-but-shall-be-shared
> >> components. Given that I/O implementations in Arrow would also benefit
> >> (from) other file formats such as Feather and the memory management of
> >> Arrow will see great improvements I don't see a need that parquet-cpp
> >> should have its own abstractions.
> >>
> >> Cheers
> >> Uwe
> >>
> >> --
> >>   Uwe L. Korn
> >>   uwelk@xhochy.com
> >>
> >> On Fri, Dec 23, 2016, at 11:17 PM, Wes McKinney wrote:
> >> > hi folks,
> >> >
> >> > Spurred by the discussion and bugfix for PARQUET-799, I'd like to do
> >> > something about the IO interfaces that we currently have implemented
> >> > in parquet-cpp.
> >> >
> >> > For C++ at least, the Parquet project is not an ideal place to be
> >> > maintaining cross-platform IO and memory management. There are
> >> > portability and concurrent access issues we will eventually need to
> >> > deal with to make parquet-cpp work well in diverse production
> >> > environments.
> >> >
> >> > In parallel, we've been developing a general, low-overhead IO
> >> > subsystem inside Apache Arrow:
> >> >
> >> > https://github.com/apache/arrow/tree/master/cpp/src/arrow/io
> >> >
> >> > Since Arrow is about in-memory columnar data structures and efficient
> >> > IO / RPC / IPC, this is a much more appropriate place to maintain such
> >> > code (in the absence of a sort of "Apache C++ Commons" library).
> >> > There, we currently have more mature implementations of:
> >> >
> >> > - Operating system files (which also work on Windows)
> >> > - Memory mapped files
> >> > - HDFS (either using libhdfs or libhdfs3 at your choosing)
> >> >
> >> > Additionally, the "Buffer" abstraction (which handles memory lifetime
> >> > and provides a general-purpose way to pass around a block of memory
> >> > which may or may not be owned by the application) is implemented in
> >> > both Parquet [1] and Arrow [2].
> >> >
> >> > Since, fundamentally, parquet-cpp is a library for encoding and
> >> > decoding the Parquet file format rather than general purpose IO /
> >> > file-like interfaces, I propose that we excise this code from the
> >> > library and make Arrow a hard dependency in libparquet. I believe our
> >> > respective developer communities would benefit from a hardening of the
> >> > IO and memory interfaces that are being developed in Arrow, and it
> >> > will lead to better quality software and reduced fragmentation.
> >> >
> >> > I wanted to bring this up as we are on the cusp of making the first
> >> > ASF release of parquet-cpp, and while this work might not make the cut
> >> > for 0.1, if we agree it's a good idea it would be good to do it sooner
> >> > rather than later.
> >> >
> >> > Thanks and happy holidays / best wishes for 2017,
> >> > Wes
> >> >
> >> > [1]:
> >> > https://github.com/apache/parquet-cpp/blob/master/src/
> >> parquet/util/buffer.h
> >> > [2]: https://github.com/apache/arrow/blob/master/cpp/src/
> arrow/buffer.h
> >>
> >
> >
> >
> > --
> > regards,
> > Deepak Majeti
>



-- 
regards,
Deepak Majeti

Re: IO and memory management in parquet-cpp and Arrow (C++)

Posted by Wes McKinney <we...@gmail.com>.
hi Deepak,

On Wed, Dec 28, 2016 at 9:24 AM, Deepak Majeti <ma...@gmail.com> wrote:
> Wes,
>
> I too agree that having a single general purpose I/O implementation makes
> sense moving forward.
> We can have a separate build option to build only the parquet-cpp library
> for those who want to encode/decode just the parquet format.

The changes I'm describing are more invasive because it also coupled
to parquet-cpp's memory management, namely parquet::Buffer
(https://github.com/apache/parquet-cpp/blob/master/src/parquet/util/buffer.h)
and parquet::MemoryAllocator
(https://github.com/apache/parquet-cpp/blob/master/src/parquet/util/mem-allocator.h).

One of the benefits of consolidating the memory interfaces is that we
can provide zero-copy data handoffs from the decoders to Arrow
consumers (or other applications that handle the buffers).

The goal is that we would make it easy for a parquet-cpp user to
statically link only the libarrow symbols she/he needs for the
application and not have to ship Arrow shared libraries separately
necessarily.

Let me know if this makes sense / what concerns you might have.

Thanks
Wes

>
>
> On Wed, Dec 28, 2016 at 9:04 AM, Uwe L. Korn <uw...@xhochy.com> wrote:
>
>> Hello Wes,
>>
>> As the classes used from Arrow are very low-level and should not
>> prohibit the integration of parquet-cpp into other systems, I'm fully in
>> support of making Arrow a hard dependency. The current two
>> implementations look very similar of the not-shared-but-shall-be-shared
>> components. Given that I/O implementations in Arrow would also benefit
>> (from) other file formats such as Feather and the memory management of
>> Arrow will see great improvements I don't see a need that parquet-cpp
>> should have its own abstractions.
>>
>> Cheers
>> Uwe
>>
>> --
>>   Uwe L. Korn
>>   uwelk@xhochy.com
>>
>> On Fri, Dec 23, 2016, at 11:17 PM, Wes McKinney wrote:
>> > hi folks,
>> >
>> > Spurred by the discussion and bugfix for PARQUET-799, I'd like to do
>> > something about the IO interfaces that we currently have implemented
>> > in parquet-cpp.
>> >
>> > For C++ at least, the Parquet project is not an ideal place to be
>> > maintaining cross-platform IO and memory management. There are
>> > portability and concurrent access issues we will eventually need to
>> > deal with to make parquet-cpp work well in diverse production
>> > environments.
>> >
>> > In parallel, we've been developing a general, low-overhead IO
>> > subsystem inside Apache Arrow:
>> >
>> > https://github.com/apache/arrow/tree/master/cpp/src/arrow/io
>> >
>> > Since Arrow is about in-memory columnar data structures and efficient
>> > IO / RPC / IPC, this is a much more appropriate place to maintain such
>> > code (in the absence of a sort of "Apache C++ Commons" library).
>> > There, we currently have more mature implementations of:
>> >
>> > - Operating system files (which also work on Windows)
>> > - Memory mapped files
>> > - HDFS (either using libhdfs or libhdfs3 at your choosing)
>> >
>> > Additionally, the "Buffer" abstraction (which handles memory lifetime
>> > and provides a general-purpose way to pass around a block of memory
>> > which may or may not be owned by the application) is implemented in
>> > both Parquet [1] and Arrow [2].
>> >
>> > Since, fundamentally, parquet-cpp is a library for encoding and
>> > decoding the Parquet file format rather than general purpose IO /
>> > file-like interfaces, I propose that we excise this code from the
>> > library and make Arrow a hard dependency in libparquet. I believe our
>> > respective developer communities would benefit from a hardening of the
>> > IO and memory interfaces that are being developed in Arrow, and it
>> > will lead to better quality software and reduced fragmentation.
>> >
>> > I wanted to bring this up as we are on the cusp of making the first
>> > ASF release of parquet-cpp, and while this work might not make the cut
>> > for 0.1, if we agree it's a good idea it would be good to do it sooner
>> > rather than later.
>> >
>> > Thanks and happy holidays / best wishes for 2017,
>> > Wes
>> >
>> > [1]:
>> > https://github.com/apache/parquet-cpp/blob/master/src/
>> parquet/util/buffer.h
>> > [2]: https://github.com/apache/arrow/blob/master/cpp/src/arrow/buffer.h
>>
>
>
>
> --
> regards,
> Deepak Majeti

Re: IO and memory management in parquet-cpp and Arrow (C++)

Posted by Deepak Majeti <ma...@gmail.com>.
Wes,

I too agree that having a single general purpose I/O implementation makes
sense moving forward.
We can have a separate build option to build only the parquet-cpp library
for those who want to encode/decode just the parquet format.


On Wed, Dec 28, 2016 at 9:04 AM, Uwe L. Korn <uw...@xhochy.com> wrote:

> Hello Wes,
>
> As the classes used from Arrow are very low-level and should not
> prohibit the integration of parquet-cpp into other systems, I'm fully in
> support of making Arrow a hard dependency. The current two
> implementations look very similar of the not-shared-but-shall-be-shared
> components. Given that I/O implementations in Arrow would also benefit
> (from) other file formats such as Feather and the memory management of
> Arrow will see great improvements I don't see a need that parquet-cpp
> should have its own abstractions.
>
> Cheers
> Uwe
>
> --
>   Uwe L. Korn
>   uwelk@xhochy.com
>
> On Fri, Dec 23, 2016, at 11:17 PM, Wes McKinney wrote:
> > hi folks,
> >
> > Spurred by the discussion and bugfix for PARQUET-799, I'd like to do
> > something about the IO interfaces that we currently have implemented
> > in parquet-cpp.
> >
> > For C++ at least, the Parquet project is not an ideal place to be
> > maintaining cross-platform IO and memory management. There are
> > portability and concurrent access issues we will eventually need to
> > deal with to make parquet-cpp work well in diverse production
> > environments.
> >
> > In parallel, we've been developing a general, low-overhead IO
> > subsystem inside Apache Arrow:
> >
> > https://github.com/apache/arrow/tree/master/cpp/src/arrow/io
> >
> > Since Arrow is about in-memory columnar data structures and efficient
> > IO / RPC / IPC, this is a much more appropriate place to maintain such
> > code (in the absence of a sort of "Apache C++ Commons" library).
> > There, we currently have more mature implementations of:
> >
> > - Operating system files (which also work on Windows)
> > - Memory mapped files
> > - HDFS (either using libhdfs or libhdfs3 at your choosing)
> >
> > Additionally, the "Buffer" abstraction (which handles memory lifetime
> > and provides a general-purpose way to pass around a block of memory
> > which may or may not be owned by the application) is implemented in
> > both Parquet [1] and Arrow [2].
> >
> > Since, fundamentally, parquet-cpp is a library for encoding and
> > decoding the Parquet file format rather than general purpose IO /
> > file-like interfaces, I propose that we excise this code from the
> > library and make Arrow a hard dependency in libparquet. I believe our
> > respective developer communities would benefit from a hardening of the
> > IO and memory interfaces that are being developed in Arrow, and it
> > will lead to better quality software and reduced fragmentation.
> >
> > I wanted to bring this up as we are on the cusp of making the first
> > ASF release of parquet-cpp, and while this work might not make the cut
> > for 0.1, if we agree it's a good idea it would be good to do it sooner
> > rather than later.
> >
> > Thanks and happy holidays / best wishes for 2017,
> > Wes
> >
> > [1]:
> > https://github.com/apache/parquet-cpp/blob/master/src/
> parquet/util/buffer.h
> > [2]: https://github.com/apache/arrow/blob/master/cpp/src/arrow/buffer.h
>



-- 
regards,
Deepak Majeti

Re: IO and memory management in parquet-cpp and Arrow (C++)

Posted by "Uwe L. Korn" <uw...@xhochy.com>.
Hello Wes,

As the classes used from Arrow are very low-level and should not
prohibit the integration of parquet-cpp into other systems, I'm fully in
support of making Arrow a hard dependency. The current two
implementations look very similar of the not-shared-but-shall-be-shared
components. Given that I/O implementations in Arrow would also benefit
(from) other file formats such as Feather and the memory management of
Arrow will see great improvements I don't see a need that parquet-cpp
should have its own abstractions.

Cheers
Uwe

-- 
  Uwe L. Korn
  uwelk@xhochy.com

On Fri, Dec 23, 2016, at 11:17 PM, Wes McKinney wrote:
> hi folks,
> 
> Spurred by the discussion and bugfix for PARQUET-799, I'd like to do
> something about the IO interfaces that we currently have implemented
> in parquet-cpp.
> 
> For C++ at least, the Parquet project is not an ideal place to be
> maintaining cross-platform IO and memory management. There are
> portability and concurrent access issues we will eventually need to
> deal with to make parquet-cpp work well in diverse production
> environments.
> 
> In parallel, we've been developing a general, low-overhead IO
> subsystem inside Apache Arrow:
> 
> https://github.com/apache/arrow/tree/master/cpp/src/arrow/io
> 
> Since Arrow is about in-memory columnar data structures and efficient
> IO / RPC / IPC, this is a much more appropriate place to maintain such
> code (in the absence of a sort of "Apache C++ Commons" library).
> There, we currently have more mature implementations of:
> 
> - Operating system files (which also work on Windows)
> - Memory mapped files
> - HDFS (either using libhdfs or libhdfs3 at your choosing)
> 
> Additionally, the "Buffer" abstraction (which handles memory lifetime
> and provides a general-purpose way to pass around a block of memory
> which may or may not be owned by the application) is implemented in
> both Parquet [1] and Arrow [2].
> 
> Since, fundamentally, parquet-cpp is a library for encoding and
> decoding the Parquet file format rather than general purpose IO /
> file-like interfaces, I propose that we excise this code from the
> library and make Arrow a hard dependency in libparquet. I believe our
> respective developer communities would benefit from a hardening of the
> IO and memory interfaces that are being developed in Arrow, and it
> will lead to better quality software and reduced fragmentation.
> 
> I wanted to bring this up as we are on the cusp of making the first
> ASF release of parquet-cpp, and while this work might not make the cut
> for 0.1, if we agree it's a good idea it would be good to do it sooner
> rather than later.
> 
> Thanks and happy holidays / best wishes for 2017,
> Wes
> 
> [1]:
> https://github.com/apache/parquet-cpp/blob/master/src/parquet/util/buffer.h
> [2]: https://github.com/apache/arrow/blob/master/cpp/src/arrow/buffer.h