You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@arrow.apache.org by Corey J Nolet <cj...@gmail.com> on 2016/03/10 06:25:05 UTC

Understanding "shared" memory implications

If I understand correctly, Arrow is using Netty underneath which is using Sun's Unsafe API in order to allocate direct byte buffers off heap. It is using Netty to communicate between "client" and "server", information about memory addresses for data that is being requested.

I've never attempted to use the Unsafe API to access off heap memory that has been allocated in one JVM from another JVM but I'm assuming this must be the case in order to claim that the memory is being accessed directly without being copied, correct?

The implication here is huge. If the memory is being directly shared across processes by them being allowed to directly reach into the direct byte buffers, that's true shared memory. Otherwise, if there's copies going on, it's less appealing.


Thanks.

Sent from my iPad

RE: Understanding "shared" memory implications

Posted by "Zheng, Kai" <ka...@intel.com>.

Hi Wes,

Thanks for the helpful clarifying and am glad to know you're also well moving on the project. Yes I will look closely to the project, raising my specific comments in the mailing list. 

Regards,
Kai

-----Original Message-----
From: Wes McKinney [mailto:wes@cloudera.com] 
Sent: Friday, March 18, 2016 6:51 AM
To: dev@arrow.apache.org
Subject: Re: Understanding "shared" memory implications

hi Kai,

This sounds like it might merit a separate thread to discuss the growth of Arrow as a modular ecosystem of libraries in different programming languages and related tools (we've discussed shared memory data access and metadata representation, but not questions around ownership and management of shared memory resources, which inevitably will come up). Composability around the shared memory layout is an extremely powerful concept, just as zero-copy compatibility with BLAS / LAPACK / Intel MKL is very important for high performance linear algebra in scientific computing use cases.

(Note that I recently have been working to revive Parquet C++ as a standalone library and have become a committer on that project, but my first commit was only in January. More specific comments there might be valuable on the Parquet mailing list.)

- Wes

On Thu, Mar 17, 2016 at 3:44 PM, Zheng, Kai <ka...@intel.com> wrote:
> Sounds good to have all these compatible, modular goals and changes, for Apache Arrow, in the early stage.
>
> On the other hand, not being afraid of changes keep evolving towards the core goals and the well-defined initiatives, which is also important. Parquet is relevant and also a good example. IMO, it's simply poorly organized. Yes it's all about history, but as a still young project, it looks like to me it just stops evolving. Will Arrow be another Parquet?
>
> Regards,
> Kai
>
> -----Original Message-----
> From: Wes McKinney [mailto:wes@cloudera.com]
> Sent: Thursday, March 17, 2016 6:03 AM
> To: dev@arrow.apache.org
> Subject: Re: Understanding "shared" memory implications
>
> On Wed, Mar 16, 2016 at 2:33 PM, Jacques Nadeau <ja...@apache.org> wrote:
>>
>> For Arrow, let's make sure that we do our best to accomplish both (1) 
>> and (2). They seem like entirely compatible goals.
>>
>>
>
> For my part on the C++ side, I plan to proceed with a hub-and-spoke 
> model. A minimal small core library with "leaf" shared libraries (for
> example: Parquet read/write adapter) that you can opt-in to building.
> This will add some extra linker configuration complexity for downstream users but to the benefit of a less monolithic library stack (which I don't think anyone wants).

Re: Understanding "shared" memory implications

Posted by Wes McKinney <we...@cloudera.com>.

hi Kai,

This sounds like it might merit a separate thread to discuss the
growth of Arrow as a modular ecosystem of libraries in different
programming languages and related tools (we've discussed shared memory
data access and metadata representation, but not questions around
ownership and management of shared memory resources, which inevitably
will come up). Composability around the shared memory layout is an
extremely powerful concept, just as zero-copy compatibility with BLAS
/ LAPACK / Intel MKL is very important for high performance linear
algebra in scientific computing use cases.

(Note that I recently have been working to revive Parquet C++ as a
standalone library and have become a committer on that project, but my
first commit was only in January. More specific comments there might
be valuable on the Parquet mailing list.)

- Wes

On Thu, Mar 17, 2016 at 3:44 PM, Zheng, Kai <ka...@intel.com> wrote:
> Sounds good to have all these compatible, modular goals and changes, for Apache Arrow, in the early stage.
>
> On the other hand, not being afraid of changes keep evolving towards the core goals and the well-defined initiatives, which is also important. Parquet is relevant and also a good example. IMO, it's simply poorly organized. Yes it's all about history, but as a still young project, it looks like to me it just stops evolving. Will Arrow be another Parquet?
>
> Regards,
> Kai
>
> -----Original Message-----
> From: Wes McKinney [mailto:wes@cloudera.com]
> Sent: Thursday, March 17, 2016 6:03 AM
> To: dev@arrow.apache.org
> Subject: Re: Understanding "shared" memory implications
>
> On Wed, Mar 16, 2016 at 2:33 PM, Jacques Nadeau <ja...@apache.org> wrote:
>>
>> For Arrow, let's make sure that we do our best to accomplish both (1)
>> and (2). They seem like entirely compatible goals.
>>
>>
>
> For my part on the C++ side, I plan to proceed with a hub-and-spoke model. A minimal small core library with "leaf" shared libraries (for
> example: Parquet read/write adapter) that you can opt-in to building.
> This will add some extra linker configuration complexity for downstream users but to the benefit of a less monolithic library stack (which I don't think anyone wants).

RE: Understanding "shared" memory implications

Posted by "Zheng, Kai" <ka...@intel.com>.

Sounds good to have all these compatible, modular goals and changes, for Apache Arrow, in the early stage.

On the other hand, not being afraid of changes keep evolving towards the core goals and the well-defined initiatives, which is also important. Parquet is relevant and also a good example. IMO, it's simply poorly organized. Yes it's all about history, but as a still young project, it looks like to me it just stops evolving. Will Arrow be another Parquet?

Regards,
Kai 

-----Original Message-----
From: Wes McKinney [mailto:wes@cloudera.com] 
Sent: Thursday, March 17, 2016 6:03 AM
To: dev@arrow.apache.org
Subject: Re: Understanding "shared" memory implications

On Wed, Mar 16, 2016 at 2:33 PM, Jacques Nadeau <ja...@apache.org> wrote:
>
> For Arrow, let's make sure that we do our best to accomplish both (1) 
> and (2). They seem like entirely compatible goals.
>
>

For my part on the C++ side, I plan to proceed with a hub-and-spoke model. A minimal small core library with "leaf" shared libraries (for
example: Parquet read/write adapter) that you can opt-in to building.
This will add some extra linker configuration complexity for downstream users but to the benefit of a less monolithic library stack (which I don't think anyone wants).

Re: Understanding "shared" memory implications

Posted by Wes McKinney <we...@cloudera.com>.

On Wed, Mar 16, 2016 at 2:33 PM, Jacques Nadeau <ja...@apache.org> wrote:
>
> For Arrow, let's make sure that we do our best to accomplish both (1) and
> (2). They seem like entirely compatible goals.
>
>

For my part on the C++ side, I plan to proceed with a hub-and-spoke
model. A minimal small core library with "leaf" shared libraries (for
example: Parquet read/write adapter) that you can opt-in to building.
This will add some extra linker configuration complexity for
downstream users but to the benefit of a less monolithic library stack
(which I don't think anyone wants).

Re: Understanding "shared" memory implications

Posted by Jacques Nadeau <ja...@apache.org>.

>>You’re hardly the biggest fan of the bundled default execution
implementation. At your bidding, we’ve been trying for almost 2 years to
get that stuff out of core.

Great point. As you stated, I think there are at least two lessons with
Calcite:

1. Make sure to have an easy to use out of the box initial integration so
people can get moving quickly.
2. Make sure to build a modular set of components so that advanced users
can consume only the pieces that they want.

The Calcite community (and Julian) have done a great job of (1) and that
has led to great adoption. (2) has been harder in that example because it
wasn't an initial design goal.

For Arrow, let's make sure that we do our best to accomplish both (1) and
(2). They seem like entirely compatible goals.



On Wed, Mar 16, 2016 at 10:54 AM, Julian Hyde <jh...@apache.org> wrote:

> Calcite is a salutary example if what happens if you *don’t* figure out
> early enough what is core and what is not. You’re hardly the biggest fan of
> the bundled default execution implementation. At your bidding, we’ve been
> trying for almost 2 years to get that stuff out of core.
>
> Arrow is, at its core, a memory format and APIs for creating/consuming
> that format. I think (hope) that the core can make reasonable assumptions
> such as that there are either multiple reader threads or a single writer
> thread. For that I suppose it will need a memory model.
>
> And yes, the core should deliver something that will solve Corey’s use
> case, e.g. being able to pass memory between processes without copying.
>
> But all of the stuff that involves complex moving parts should be kept
> very clearly out of core (in optional components or sample code) so that
> people can bring their own favorite complex moving parts to Arrow.
>
> Julian
>
>
> > On Mar 16, 2016, at 10:34 AM, Jacques Nadeau <ja...@apache.org> wrote:
> >
> > I think it is okay for a project to be different things to different
> > people.
> >
> > I think it is really important as a library that we have enough
> supporting
> > examples that people can get started quickly. In some sense I'm modeling
> > this after what Julian did with Calcite.  For example he provides a
> default
> > execution implementation to get started with but you don't need to use
> it.
> > I think this helps new users get started and have something working
> sooner.
> > It doesn't mean that a particular consumer needs to adopt that
> > implementation. In fact, many don't.
> >
> > So my goal is to provide an example implementation of sharing across IPC
> > and rpc. Once there is something to play with, we can figure out what
> > pieces are 'core' arrow and what pieces are example implementations.
> > I always thought Arrow was just an in-memory format, and it is the
> > responsibility of whoever else that want to use it to carry that
> > responsibilities out, because depending on workloads, different
> frameworks
> > might pick very different applications. Otherwise it seems to be doing
> too
> > much and having too strong of an opinion about data sharing in a format
> > that's primarily about data sharing.
> >
> > On Wed, Mar 16, 2016 at 10:03 AM, Corey Nolet <cj...@gmail.com> wrote:
> >
> >> I've been under the impression that exposing memory to be shared
> directly
> >> and not copied WAS, in fact, the responsibility of Arrow. In fact, I
> read
> >> this in [1] and this is turned me on to Arrow in the first place.
> >>
> >>
> >> [1]
> >>
> >>
> >
> http://www.datanami.com/2016/02/17/arrow-aims-to-defrag-big-in-memory-analytics/
> >>
> >> On Wed, Mar 16, 2016 at 1:00 PM, Julian Hyde <jh...@apache.org> wrote:
> >>
> >>> This is all very interesting stuff, but just so we’re clear: it is not
> >>> Arrow’s responsibility to provide an RPC/IPC/LPC mechanism, nor
> >> facilities
> >>> for resource management. If we DID decide to make this Arrow’s
> >>> responsibility it would overlap with other components which specialize
> > in
> >>> such stuff.
> >>>
> >>>
> >>>
> >>>> On Mar 16, 2016, at 9:49 AM, Jacques Nadeau <ja...@apache.org>
> >> wrote:
> >>>>
> >>>> @Todd: agree entirely on prototyping design. My goal is throw out some
> >>>> ideas and some POC code and then we can explore from there.
> >>>>
> >>>> My main thoughts have initially been around lifecycle management. I've
> >>> done
> >>>> some work previously where a consistently sized shared buffer using
> >> mmap
> >>>> has improved performance. This is more complicated given the
> >> requirements
> >>>> for providing collaborative allocation and cross process reference
> >>> counts.
> >>>>
> >>>> With regards to whether this is more generally applicable: I think it
> >>> could
> >>>> ultimately be more general but I suggest we focus on the particular
> >>>> application of moving long-lived arrow record batches between a
> >> producer
> >>>> and a consumer initially. Constraining the problems seems like we will
> >>> get
> >>>> to something workable sooner. We can abstract to a more general
> >> solution
> >>> as
> >>>> there are other clear requirements.
> >>>>
> >>>> With regards to capnproto, I believe they are simply saying when they
> >>> talk
> >>>> about zero-copy shared memory that the structure supports that (same
> > as
> >>> any
> >>>> memory-layout based design). I don't believe they actually implemented
> >> a
> >>>> protocol and multi-language implementation for zero-copy cross process
> >>>> communication.
> >>>>
> >>>> One other note to make here is that my goal here is not just about
> >>>> performance but also about memory footprint. Being able to have a
> >> shared
> >>>> memory protocol that allows multiple tools to interact with the same
> >> hot
> >>>> dataset.
> >>>>
> >>>> RE: ACL, for the initial focus, I suggest that we consider the two
> >>> sharing
> >>>> processes are "trusted" and expect the initial Arrow API reference
> >>>> implementations to manage memory access.
> >>>>
> >>>> Regarding other questions that Todd threw out:
> >>>>
> >>>> - if you are using an mmapped file in /dev/shm/, how do you make sure
> >> it
> >>>> gets cleaned up if the process crashes?
> >>>>
> >>>>>> Agreed that it needs to get resolve. If I recall, destruction can be
> >>>> applied once associated process are attached to memory and this allows
> >>> the
> >>>> kernel to recover once all attaching processes are destroyed. If this
> >>> isn't
> >>>> enough, then we may very well need a simple  external coordinator.
> >>>>
> >>>> - how do you allocate memory to it? there's nothing ensuring that
> >>> /dev/shm
> >>>> doesn't swap out if you try to put too much in there, and then your
> >>>> in-memory super-fast access will basically collapse under swap
> >> thrashing
> >>>>
> >>>>>> Simplest model initially is probably one where we assume a master
> >> and a
> >>>> slave. (Ideally negotiated on initial connection.) The master is
> >>>> responsible for allocating memory and giving that to the slave. The
> >>> master
> >>>> then is responsible for managing reasonable memory allocation limits
> >> just
> >>>> like any other. Slaves that need to allocated memory must ask the
> >> master
> >>>> (at whatever chunk makes sense) and will get rejected if they are too
> >>>> aggressive. (this probably means that at any point an IPC can fall
> > back
> >>> to
> >>>> RPC??)
> >>>>
> >>>> - how do you do lifecycle management across the two processes? If,
> > say,
> >>>> Kudu wants to pass a block of data to some Python program, how does it
> >>> know
> >>>> when the Python program is done reading it and it should be deleted?
> >> What
> >>>> if the python program crashed in the middle - when can Kudu release
> > it?
> >>>>
> >>>>>> My thinking, as mentioned earlier, is a shared reference count model
> >>> for
> >>>> complex situations. Possibly a "request/response" ownership model for
> >>>> simpler cases.
> >>>>
> >>>> - how do you do security? If both sides of the connection don't trust
> >>> each
> >>>> other, and use length prefixes and offsets, you have to be constantly
> >>>> validating and re-validating everything you read.
> >>>>
> >>>> I'm suggesting that we start with trusting so we don't get too wrapped
> >> up
> >>>> in all the extra complexities of security. My experience with these
> >>> things
> >>>> is that a lot of users will frequently pick performance or footprint
> >> over
> >>>> security for quite some time. For example, if I recall correctly, on
> >> the
> >>>> shared file descriptor model that was initially implemented in the
> > HDFS
> >>>> client, that people used short-circuit reads for years before security
> >>> was
> >>>> correctly implemented. (Am I remembering this right?)
> >>>>
> >>>> Lastly, as I mentioned above, I don't think there should be any
> >>> requirement
> >>>> that Arrow communication be limited to only 'IPC'. As Todd points out,
> >> in
> >>>> many cases unix domain sockets will be just fine.
> >>>>
> >>>> We need to implement both models because we all know that locality
> > will
> >>>> never be guaranteed. The IPC design/implementation needs to be good
> > for
> >>>> anything to make into arrow.
> >>>>
> >>>> thanks
> >>>> Jacques
> >>>>
> >>>>
> >>>>
> >>>> On Wed, Mar 16, 2016 at 8:54 AM, Zhe Zhang <zh...@apache.org> wrote:
> >>>>
> >>>>> I have similar concerns as Todd stated below. With an mmap-based
> >>> approach,
> >>>>> we are treating shared memory objects like files. This brings in all
> >>>>> filesystem related considerations like ACL and lifecycle mgmt.
> >>>>>
> >>>>> Stepping back a little, the shared-memory work isn't really specific
> >> to
> >>>>> Arrow. A few questions related to this:
> >>>>> 1) Has the topic been discussed in the context of protobuf (or other
> >> IPC
> >>>>> protocols) before? Seems Cap'n Proto (https://capnproto.org/) has
> >>>>> zero-copy
> >>>>> shared memory. I haven't read implementation detail though.
> >>>>> 2) If the shared-memory work benefits a wide range of protocols,
> >> should
> >>> it
> >>>>> be a generalized and standalone library?
> >>>>>
> >>>>> Thanks,
> >>>>> Zhe
> >>>>>
> >>>>> On Tue, Mar 15, 2016 at 8:30 PM Todd Lipcon <to...@cloudera.com>
> >> wrote:
> >>>>>
> >>>>>> Having thought about this quite a bit in the past, I think the
> >>> mechanics
> >>>>> of
> >>>>>> how to share memory are by far the easiest part. The much harder
> > part
> >>> is
> >>>>>> the resource management and ownership. Questions like:
> >>>>>>
> >>>>>> - if you are using an mmapped file in /dev/shm/, how do you make
> > sure
> >>> it
> >>>>>> gets cleaned up if the process crashes?
> >>>>>> - how do you allocate memory to it? there's nothing ensuring that
> >>>>> /dev/shm
> >>>>>> doesn't swap out if you try to put too much in there, and then your
> >>>>>> in-memory super-fast access will basically collapse under swap
> >>> thrashing
> >>>>>> - how do you do lifecycle management across the two processes? If,
> >> say,
> >>>>>> Kudu wants to pass a block of data to some Python program, how does
> >> it
> >>>>> know
> >>>>>> when the Python program is done reading it and it should be deleted?
> >>> What
> >>>>>> if the python program crashed in the middle - when can Kudu release
> >> it?
> >>>>>> - how do you do security? If both sides of the connection don't
> > trust
> >>>>> each
> >>>>>> other, and use length prefixes and offsets, you have to be
> > constantly
> >>>>>> validating and re-validating everything you read.
> >>>>>>
> >>>>>> Another big factor is that shared memory is not, in my experience,
> >>>>>> immediately faster than just copying data over a unix domain socket.
> >> In
> >>>>>> particular, the first time you read an mmapped file, you'll end up
> >>> paying
> >>>>>> minor page fault overhead on every page. This can be improved with
> >>>>>> HugePages, but huge page mmaps are not supported yet in current
> > Linux
> >>>>> (work
> >>>>>> going on currently to address this). So you're left with hugetlbfs,
> >>> which
> >>>>>> involves static allocations and much more pain.
> >>>>>>
> >>>>>> All the above is a long way to say: let's make sure we do the write
> >>>>>> prototyping and up-front design before jumping into code.
> >>>>>>
> >>>>>> -Todd
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>> On Tue, Mar 15, 2016 at 5:54 PM, Jacques Nadeau <jacques@apache.org
> >
> >>>>>> wrote:
> >>>>>>
> >>>>>>> @Corey
> >>>>>>> The POC Steven and Wes are working on is based on MappedBuffer but
> >> I'm
> >>>>>>> looking at using netty's fork of tcnative to use shared memory
> >>>>> directly.
> >>>>>>>
> >>>>>>> @Yiannis
> >>>>>>> We need to have both RPC and a shared memory mechanisms (what I'm
> >>>>>> inclined
> >>>>>>> to call IPC but is a specific kind of IPC). The idea is we
> > negotiate
> >>>>> via
> >>>>>>> RPC and then if we determine shared locality, we work over shared
> >>>>> memory
> >>>>>>> (preferably for both data and control). So the system interacting
> >> with
> >>>>>>> HBase in your example would be the one responsible for placing
> >>>>> collocated
> >>>>>>> execution to take advantage of IPC.
> >>>>>>>
> >>>>>>> How do others feel of my redefinition of IPC to mean the same
> > memory
> >>>>>> space
> >>>>>>> communication (either via shared memory or rdma) versus RPC as
> >> socket
> >>>>>> based
> >>>>>>> communication?
> >>>>>>>
> >>>>>>>
> >>>>>>> On Tue, Mar 15, 2016 at 5:38 PM, Corey Nolet <cj...@gmail.com>
> >>>>> wrote:
> >>>>>>>
> >>>>>>>> I was seeing Netty's unsafe classes being used here, not mapped
> >> byte
> >>>>>>>> buffer  not sure if that statement is completely correct but I'll
> >>>>> have
> >>>>>> to
> >>>>>>>> dog through the code again to figure that out.
> >>>>>>>>
> >>>>>>>> The more I was looking at unsafe, it makes sense why that would be
> >>>>>>>> used.apparently it's also supposed to be included on Java 9 as a
> >>>>> first
> >>>>>>>> class API
> >>>>>>>> On Mar 15, 2016 7:03 PM, "Wes McKinney" <we...@cloudera.com> wrote:
> >>>>>>>>
> >>>>>>>>> My understanding is that you can use java.nio.MappedByteBuffer to
> >>>>>> work
> >>>>>>>>> with memory-mapped files as one way to share memory pages between
> >>>>>> Java
> >>>>>>>>> (and non-Java) processes without copying.
> >>>>>>>>>
> >>>>>>>>> I am hoping that we can reach a POC of zero-copy Arrow memory
> >>>>> sharing
> >>>>>>>>> Java-to-Java and Java-to-C++ in the near future. Indeed this will
> >>>>>> have
> >>>>>>>>> huge implications once we get it working end to end (for example,
> >>>>>>>>> receiving memory from a Java process in Python without a heavy
> >>>>> ser-de
> >>>>>>>>> step -- it's what we've always dreamed of) and with the metadata
> >>>>> and
> >>>>>>>>> shared memory control flow standardized.
> >>>>>>>>>
> >>>>>>>>> - Wes
> >>>>>>>>>
> >>>>>>>>> On Wed, Mar 9, 2016 at 9:25 PM, Corey J Nolet <cjnolet@gmail.com
> >
> >>>>>>> wrote:
> >>>>>>>>>> If I understand correctly, Arrow is using Netty underneath which
> >>>>> is
> >>>>>>>>> using Sun's Unsafe API in order to allocate direct byte buffers
> >> off
> >>>>>>> heap.
> >>>>>>>>> It is using Netty to communicate between "client" and "server",
> >>>>>>>> information
> >>>>>>>>> about memory addresses for data that is being requested.
> >>>>>>>>>>
> >>>>>>>>>> I've never attempted to use the Unsafe API to access off heap
> >>>>>> memory
> >>>>>>>>> that has been allocated in one JVM from another JVM but I'm
> >>>>> assuming
> >>>>>>> this
> >>>>>>>>> must be the case in order to claim that the memory is being
> >>>>> accessed
> >>>>>>>>> directly without being copied, correct?
> >>>>>>>>>>
> >>>>>>>>>> The implication here is huge. If the memory is being directly
> >>>>>> shared
> >>>>>>>>> across processes by them being allowed to directly reach into the
> >>>>>>> direct
> >>>>>>>>> byte buffers, that's true shared memory. Otherwise, if there's
> >>>>> copies
> >>>>>>>> going
> >>>>>>>>> on, it's less appealing.
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>> Thanks.
> >>>>>>>>>>
> >>>>>>>>>> Sent from my iPad
> >>>>>>>>>
> >>>>>>>>
> >>>>>>>
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>> --
> >>>>>> Todd Lipcon
> >>>>>> Software Engineer, Cloudera
> >>>>>>
> >>>>>
> >>>
> >>>
> >>
>
>

Re: Understanding "shared" memory implications

Posted by Wes McKinney <we...@cloudera.com>.

It has always been the expectation that no system would be required to
use a particular piece of Arrow software to "use Arrow" (hence the
importance of having a well-defined specification for memory and
metadata). However, we should also not expect all systems to create
their own implementations of everything. So having out-of-the-box
reference implementations of:

- Data structure builders / array containers
- Socket-based / shared memory data movement + metadata read/write
- Basic algorithms and helper data structures (e.g. hash tables)

...will help other projects become a part of the Arrow ecosystem much
more quickly (and they can later choose to specialize and use their
own bespoke implementations of things, too).

Having reference implementations will also facilitate integration
testing to ensure that you are "Arrow-compliant" if you do create a
custom implementation of any particular component.

- Wes

On Wed, Mar 16, 2016 at 10:54 AM, Julian Hyde <jh...@apache.org> wrote:
> Calcite is a salutary example if what happens if you *don’t* figure out early enough what is core and what is not. You’re hardly the biggest fan of the bundled default execution implementation. At your bidding, we’ve been trying for almost 2 years to get that stuff out of core.
>
> Arrow is, at its core, a memory format and APIs for creating/consuming that format. I think (hope) that the core can make reasonable assumptions such as that there are either multiple reader threads or a single writer thread. For that I suppose it will need a memory model.
>
> And yes, the core should deliver something that will solve Corey’s use case, e.g. being able to pass memory between processes without copying.
>
> But all of the stuff that involves complex moving parts should be kept very clearly out of core (in optional components or sample code) so that people can bring their own favorite complex moving parts to Arrow.
>
> Julian
>
>
>> On Mar 16, 2016, at 10:34 AM, Jacques Nadeau <ja...@apache.org> wrote:
>>
>> I think it is okay for a project to be different things to different
>> people.
>>
>> I think it is really important as a library that we have enough supporting
>> examples that people can get started quickly. In some sense I'm modeling
>> this after what Julian did with Calcite.  For example he provides a default
>> execution implementation to get started with but you don't need to use it.
>> I think this helps new users get started and have something working sooner.
>> It doesn't mean that a particular consumer needs to adopt that
>> implementation. In fact, many don't.
>>
>> So my goal is to provide an example implementation of sharing across IPC
>> and rpc. Once there is something to play with, we can figure out what
>> pieces are 'core' arrow and what pieces are example implementations.
>> I always thought Arrow was just an in-memory format, and it is the
>> responsibility of whoever else that want to use it to carry that
>> responsibilities out, because depending on workloads, different frameworks
>> might pick very different applications. Otherwise it seems to be doing too
>> much and having too strong of an opinion about data sharing in a format
>> that's primarily about data sharing.
>>
>> On Wed, Mar 16, 2016 at 10:03 AM, Corey Nolet <cj...@gmail.com> wrote:
>>
>>> I've been under the impression that exposing memory to be shared directly
>>> and not copied WAS, in fact, the responsibility of Arrow. In fact, I read
>>> this in [1] and this is turned me on to Arrow in the first place.
>>>
>>>
>>> [1]
>>>
>>>
>> http://www.datanami.com/2016/02/17/arrow-aims-to-defrag-big-in-memory-analytics/
>>>
>>> On Wed, Mar 16, 2016 at 1:00 PM, Julian Hyde <jh...@apache.org> wrote:
>>>
>>>> This is all very interesting stuff, but just so we’re clear: it is not
>>>> Arrow’s responsibility to provide an RPC/IPC/LPC mechanism, nor
>>> facilities
>>>> for resource management. If we DID decide to make this Arrow’s
>>>> responsibility it would overlap with other components which specialize
>> in
>>>> such stuff.
>>>>
>>>>
>>>>
>>>>> On Mar 16, 2016, at 9:49 AM, Jacques Nadeau <ja...@apache.org>
>>> wrote:
>>>>>
>>>>> @Todd: agree entirely on prototyping design. My goal is throw out some
>>>>> ideas and some POC code and then we can explore from there.
>>>>>
>>>>> My main thoughts have initially been around lifecycle management. I've
>>>> done
>>>>> some work previously where a consistently sized shared buffer using
>>> mmap
>>>>> has improved performance. This is more complicated given the
>>> requirements
>>>>> for providing collaborative allocation and cross process reference
>>>> counts.
>>>>>
>>>>> With regards to whether this is more generally applicable: I think it
>>>> could
>>>>> ultimately be more general but I suggest we focus on the particular
>>>>> application of moving long-lived arrow record batches between a
>>> producer
>>>>> and a consumer initially. Constraining the problems seems like we will
>>>> get
>>>>> to something workable sooner. We can abstract to a more general
>>> solution
>>>> as
>>>>> there are other clear requirements.
>>>>>
>>>>> With regards to capnproto, I believe they are simply saying when they
>>>> talk
>>>>> about zero-copy shared memory that the structure supports that (same
>> as
>>>> any
>>>>> memory-layout based design). I don't believe they actually implemented
>>> a
>>>>> protocol and multi-language implementation for zero-copy cross process
>>>>> communication.
>>>>>
>>>>> One other note to make here is that my goal here is not just about
>>>>> performance but also about memory footprint. Being able to have a
>>> shared
>>>>> memory protocol that allows multiple tools to interact with the same
>>> hot
>>>>> dataset.
>>>>>
>>>>> RE: ACL, for the initial focus, I suggest that we consider the two
>>>> sharing
>>>>> processes are "trusted" and expect the initial Arrow API reference
>>>>> implementations to manage memory access.
>>>>>
>>>>> Regarding other questions that Todd threw out:
>>>>>
>>>>> - if you are using an mmapped file in /dev/shm/, how do you make sure
>>> it
>>>>> gets cleaned up if the process crashes?
>>>>>
>>>>>>> Agreed that it needs to get resolve. If I recall, destruction can be
>>>>> applied once associated process are attached to memory and this allows
>>>> the
>>>>> kernel to recover once all attaching processes are destroyed. If this
>>>> isn't
>>>>> enough, then we may very well need a simple  external coordinator.
>>>>>
>>>>> - how do you allocate memory to it? there's nothing ensuring that
>>>> /dev/shm
>>>>> doesn't swap out if you try to put too much in there, and then your
>>>>> in-memory super-fast access will basically collapse under swap
>>> thrashing
>>>>>
>>>>>>> Simplest model initially is probably one where we assume a master
>>> and a
>>>>> slave. (Ideally negotiated on initial connection.) The master is
>>>>> responsible for allocating memory and giving that to the slave. The
>>>> master
>>>>> then is responsible for managing reasonable memory allocation limits
>>> just
>>>>> like any other. Slaves that need to allocated memory must ask the
>>> master
>>>>> (at whatever chunk makes sense) and will get rejected if they are too
>>>>> aggressive. (this probably means that at any point an IPC can fall
>> back
>>>> to
>>>>> RPC??)
>>>>>
>>>>> - how do you do lifecycle management across the two processes? If,
>> say,
>>>>> Kudu wants to pass a block of data to some Python program, how does it
>>>> know
>>>>> when the Python program is done reading it and it should be deleted?
>>> What
>>>>> if the python program crashed in the middle - when can Kudu release
>> it?
>>>>>
>>>>>>> My thinking, as mentioned earlier, is a shared reference count model
>>>> for
>>>>> complex situations. Possibly a "request/response" ownership model for
>>>>> simpler cases.
>>>>>
>>>>> - how do you do security? If both sides of the connection don't trust
>>>> each
>>>>> other, and use length prefixes and offsets, you have to be constantly
>>>>> validating and re-validating everything you read.
>>>>>
>>>>> I'm suggesting that we start with trusting so we don't get too wrapped
>>> up
>>>>> in all the extra complexities of security. My experience with these
>>>> things
>>>>> is that a lot of users will frequently pick performance or footprint
>>> over
>>>>> security for quite some time. For example, if I recall correctly, on
>>> the
>>>>> shared file descriptor model that was initially implemented in the
>> HDFS
>>>>> client, that people used short-circuit reads for years before security
>>>> was
>>>>> correctly implemented. (Am I remembering this right?)
>>>>>
>>>>> Lastly, as I mentioned above, I don't think there should be any
>>>> requirement
>>>>> that Arrow communication be limited to only 'IPC'. As Todd points out,
>>> in
>>>>> many cases unix domain sockets will be just fine.
>>>>>
>>>>> We need to implement both models because we all know that locality
>> will
>>>>> never be guaranteed. The IPC design/implementation needs to be good
>> for
>>>>> anything to make into arrow.
>>>>>
>>>>> thanks
>>>>> Jacques
>>>>>
>>>>>
>>>>>
>>>>> On Wed, Mar 16, 2016 at 8:54 AM, Zhe Zhang <zh...@apache.org> wrote:
>>>>>
>>>>>> I have similar concerns as Todd stated below. With an mmap-based
>>>> approach,
>>>>>> we are treating shared memory objects like files. This brings in all
>>>>>> filesystem related considerations like ACL and lifecycle mgmt.
>>>>>>
>>>>>> Stepping back a little, the shared-memory work isn't really specific
>>> to
>>>>>> Arrow. A few questions related to this:
>>>>>> 1) Has the topic been discussed in the context of protobuf (or other
>>> IPC
>>>>>> protocols) before? Seems Cap'n Proto (https://capnproto.org/) has
>>>>>> zero-copy
>>>>>> shared memory. I haven't read implementation detail though.
>>>>>> 2) If the shared-memory work benefits a wide range of protocols,
>>> should
>>>> it
>>>>>> be a generalized and standalone library?
>>>>>>
>>>>>> Thanks,
>>>>>> Zhe
>>>>>>
>>>>>> On Tue, Mar 15, 2016 at 8:30 PM Todd Lipcon <to...@cloudera.com>
>>> wrote:
>>>>>>
>>>>>>> Having thought about this quite a bit in the past, I think the
>>>> mechanics
>>>>>> of
>>>>>>> how to share memory are by far the easiest part. The much harder
>> part
>>>> is
>>>>>>> the resource management and ownership. Questions like:
>>>>>>>
>>>>>>> - if you are using an mmapped file in /dev/shm/, how do you make
>> sure
>>>> it
>>>>>>> gets cleaned up if the process crashes?
>>>>>>> - how do you allocate memory to it? there's nothing ensuring that
>>>>>> /dev/shm
>>>>>>> doesn't swap out if you try to put too much in there, and then your
>>>>>>> in-memory super-fast access will basically collapse under swap
>>>> thrashing
>>>>>>> - how do you do lifecycle management across the two processes? If,
>>> say,
>>>>>>> Kudu wants to pass a block of data to some Python program, how does
>>> it
>>>>>> know
>>>>>>> when the Python program is done reading it and it should be deleted?
>>>> What
>>>>>>> if the python program crashed in the middle - when can Kudu release
>>> it?
>>>>>>> - how do you do security? If both sides of the connection don't
>> trust
>>>>>> each
>>>>>>> other, and use length prefixes and offsets, you have to be
>> constantly
>>>>>>> validating and re-validating everything you read.
>>>>>>>
>>>>>>> Another big factor is that shared memory is not, in my experience,
>>>>>>> immediately faster than just copying data over a unix domain socket.
>>> In
>>>>>>> particular, the first time you read an mmapped file, you'll end up
>>>> paying
>>>>>>> minor page fault overhead on every page. This can be improved with
>>>>>>> HugePages, but huge page mmaps are not supported yet in current
>> Linux
>>>>>> (work
>>>>>>> going on currently to address this). So you're left with hugetlbfs,
>>>> which
>>>>>>> involves static allocations and much more pain.
>>>>>>>
>>>>>>> All the above is a long way to say: let's make sure we do the write
>>>>>>> prototyping and up-front design before jumping into code.
>>>>>>>
>>>>>>> -Todd
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On Tue, Mar 15, 2016 at 5:54 PM, Jacques Nadeau <ja...@apache.org>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> @Corey
>>>>>>>> The POC Steven and Wes are working on is based on MappedBuffer but
>>> I'm
>>>>>>>> looking at using netty's fork of tcnative to use shared memory
>>>>>> directly.
>>>>>>>>
>>>>>>>> @Yiannis
>>>>>>>> We need to have both RPC and a shared memory mechanisms (what I'm
>>>>>>> inclined
>>>>>>>> to call IPC but is a specific kind of IPC). The idea is we
>> negotiate
>>>>>> via
>>>>>>>> RPC and then if we determine shared locality, we work over shared
>>>>>> memory
>>>>>>>> (preferably for both data and control). So the system interacting
>>> with
>>>>>>>> HBase in your example would be the one responsible for placing
>>>>>> collocated
>>>>>>>> execution to take advantage of IPC.
>>>>>>>>
>>>>>>>> How do others feel of my redefinition of IPC to mean the same
>> memory
>>>>>>> space
>>>>>>>> communication (either via shared memory or rdma) versus RPC as
>>> socket
>>>>>>> based
>>>>>>>> communication?
>>>>>>>>
>>>>>>>>
>>>>>>>> On Tue, Mar 15, 2016 at 5:38 PM, Corey Nolet <cj...@gmail.com>
>>>>>> wrote:
>>>>>>>>
>>>>>>>>> I was seeing Netty's unsafe classes being used here, not mapped
>>> byte
>>>>>>>>> buffer  not sure if that statement is completely correct but I'll
>>>>>> have
>>>>>>> to
>>>>>>>>> dog through the code again to figure that out.
>>>>>>>>>
>>>>>>>>> The more I was looking at unsafe, it makes sense why that would be
>>>>>>>>> used.apparently it's also supposed to be included on Java 9 as a
>>>>>> first
>>>>>>>>> class API
>>>>>>>>> On Mar 15, 2016 7:03 PM, "Wes McKinney" <we...@cloudera.com> wrote:
>>>>>>>>>
>>>>>>>>>> My understanding is that you can use java.nio.MappedByteBuffer to
>>>>>>> work
>>>>>>>>>> with memory-mapped files as one way to share memory pages between
>>>>>>> Java
>>>>>>>>>> (and non-Java) processes without copying.
>>>>>>>>>>
>>>>>>>>>> I am hoping that we can reach a POC of zero-copy Arrow memory
>>>>>> sharing
>>>>>>>>>> Java-to-Java and Java-to-C++ in the near future. Indeed this will
>>>>>>> have
>>>>>>>>>> huge implications once we get it working end to end (for example,
>>>>>>>>>> receiving memory from a Java process in Python without a heavy
>>>>>> ser-de
>>>>>>>>>> step -- it's what we've always dreamed of) and with the metadata
>>>>>> and
>>>>>>>>>> shared memory control flow standardized.
>>>>>>>>>>
>>>>>>>>>> - Wes
>>>>>>>>>>
>>>>>>>>>> On Wed, Mar 9, 2016 at 9:25 PM, Corey J Nolet <cj...@gmail.com>
>>>>>>>> wrote:
>>>>>>>>>>> If I understand correctly, Arrow is using Netty underneath which
>>>>>> is
>>>>>>>>>> using Sun's Unsafe API in order to allocate direct byte buffers
>>> off
>>>>>>>> heap.
>>>>>>>>>> It is using Netty to communicate between "client" and "server",
>>>>>>>>> information
>>>>>>>>>> about memory addresses for data that is being requested.
>>>>>>>>>>>
>>>>>>>>>>> I've never attempted to use the Unsafe API to access off heap
>>>>>>> memory
>>>>>>>>>> that has been allocated in one JVM from another JVM but I'm
>>>>>> assuming
>>>>>>>> this
>>>>>>>>>> must be the case in order to claim that the memory is being
>>>>>> accessed
>>>>>>>>>> directly without being copied, correct?
>>>>>>>>>>>
>>>>>>>>>>> The implication here is huge. If the memory is being directly
>>>>>>> shared
>>>>>>>>>> across processes by them being allowed to directly reach into the
>>>>>>>> direct
>>>>>>>>>> byte buffers, that's true shared memory. Otherwise, if there's
>>>>>> copies
>>>>>>>>> going
>>>>>>>>>> on, it's less appealing.
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> Thanks.
>>>>>>>>>>>
>>>>>>>>>>> Sent from my iPad
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> --
>>>>>>> Todd Lipcon
>>>>>>> Software Engineer, Cloudera
>>>>>>>
>>>>>>
>>>>
>>>>
>>>
>

Re: Understanding "shared" memory implications

Posted by Julian Hyde <jh...@apache.org>.

Calcite is a salutary example if what happens if you *don’t* figure out early enough what is core and what is not. You’re hardly the biggest fan of the bundled default execution implementation. At your bidding, we’ve been trying for almost 2 years to get that stuff out of core.

Arrow is, at its core, a memory format and APIs for creating/consuming that format. I think (hope) that the core can make reasonable assumptions such as that there are either multiple reader threads or a single writer thread. For that I suppose it will need a memory model.

And yes, the core should deliver something that will solve Corey’s use case, e.g. being able to pass memory between processes without copying.

But all of the stuff that involves complex moving parts should be kept very clearly out of core (in optional components or sample code) so that people can bring their own favorite complex moving parts to Arrow.

Julian
 

> On Mar 16, 2016, at 10:34 AM, Jacques Nadeau <ja...@apache.org> wrote:
> 
> I think it is okay for a project to be different things to different
> people.
> 
> I think it is really important as a library that we have enough supporting
> examples that people can get started quickly. In some sense I'm modeling
> this after what Julian did with Calcite.  For example he provides a default
> execution implementation to get started with but you don't need to use it.
> I think this helps new users get started and have something working sooner.
> It doesn't mean that a particular consumer needs to adopt that
> implementation. In fact, many don't.
> 
> So my goal is to provide an example implementation of sharing across IPC
> and rpc. Once there is something to play with, we can figure out what
> pieces are 'core' arrow and what pieces are example implementations.
> I always thought Arrow was just an in-memory format, and it is the
> responsibility of whoever else that want to use it to carry that
> responsibilities out, because depending on workloads, different frameworks
> might pick very different applications. Otherwise it seems to be doing too
> much and having too strong of an opinion about data sharing in a format
> that's primarily about data sharing.
> 
> On Wed, Mar 16, 2016 at 10:03 AM, Corey Nolet <cj...@gmail.com> wrote:
> 
>> I've been under the impression that exposing memory to be shared directly
>> and not copied WAS, in fact, the responsibility of Arrow. In fact, I read
>> this in [1] and this is turned me on to Arrow in the first place.
>> 
>> 
>> [1]
>> 
>> 
> http://www.datanami.com/2016/02/17/arrow-aims-to-defrag-big-in-memory-analytics/
>> 
>> On Wed, Mar 16, 2016 at 1:00 PM, Julian Hyde <jh...@apache.org> wrote:
>> 
>>> This is all very interesting stuff, but just so we’re clear: it is not
>>> Arrow’s responsibility to provide an RPC/IPC/LPC mechanism, nor
>> facilities
>>> for resource management. If we DID decide to make this Arrow’s
>>> responsibility it would overlap with other components which specialize
> in
>>> such stuff.
>>> 
>>> 
>>> 
>>>> On Mar 16, 2016, at 9:49 AM, Jacques Nadeau <ja...@apache.org>
>> wrote:
>>>> 
>>>> @Todd: agree entirely on prototyping design. My goal is throw out some
>>>> ideas and some POC code and then we can explore from there.
>>>> 
>>>> My main thoughts have initially been around lifecycle management. I've
>>> done
>>>> some work previously where a consistently sized shared buffer using
>> mmap
>>>> has improved performance. This is more complicated given the
>> requirements
>>>> for providing collaborative allocation and cross process reference
>>> counts.
>>>> 
>>>> With regards to whether this is more generally applicable: I think it
>>> could
>>>> ultimately be more general but I suggest we focus on the particular
>>>> application of moving long-lived arrow record batches between a
>> producer
>>>> and a consumer initially. Constraining the problems seems like we will
>>> get
>>>> to something workable sooner. We can abstract to a more general
>> solution
>>> as
>>>> there are other clear requirements.
>>>> 
>>>> With regards to capnproto, I believe they are simply saying when they
>>> talk
>>>> about zero-copy shared memory that the structure supports that (same
> as
>>> any
>>>> memory-layout based design). I don't believe they actually implemented
>> a
>>>> protocol and multi-language implementation for zero-copy cross process
>>>> communication.
>>>> 
>>>> One other note to make here is that my goal here is not just about
>>>> performance but also about memory footprint. Being able to have a
>> shared
>>>> memory protocol that allows multiple tools to interact with the same
>> hot
>>>> dataset.
>>>> 
>>>> RE: ACL, for the initial focus, I suggest that we consider the two
>>> sharing
>>>> processes are "trusted" and expect the initial Arrow API reference
>>>> implementations to manage memory access.
>>>> 
>>>> Regarding other questions that Todd threw out:
>>>> 
>>>> - if you are using an mmapped file in /dev/shm/, how do you make sure
>> it
>>>> gets cleaned up if the process crashes?
>>>> 
>>>>>> Agreed that it needs to get resolve. If I recall, destruction can be
>>>> applied once associated process are attached to memory and this allows
>>> the
>>>> kernel to recover once all attaching processes are destroyed. If this
>>> isn't
>>>> enough, then we may very well need a simple  external coordinator.
>>>> 
>>>> - how do you allocate memory to it? there's nothing ensuring that
>>> /dev/shm
>>>> doesn't swap out if you try to put too much in there, and then your
>>>> in-memory super-fast access will basically collapse under swap
>> thrashing
>>>> 
>>>>>> Simplest model initially is probably one where we assume a master
>> and a
>>>> slave. (Ideally negotiated on initial connection.) The master is
>>>> responsible for allocating memory and giving that to the slave. The
>>> master
>>>> then is responsible for managing reasonable memory allocation limits
>> just
>>>> like any other. Slaves that need to allocated memory must ask the
>> master
>>>> (at whatever chunk makes sense) and will get rejected if they are too
>>>> aggressive. (this probably means that at any point an IPC can fall
> back
>>> to
>>>> RPC??)
>>>> 
>>>> - how do you do lifecycle management across the two processes? If,
> say,
>>>> Kudu wants to pass a block of data to some Python program, how does it
>>> know
>>>> when the Python program is done reading it and it should be deleted?
>> What
>>>> if the python program crashed in the middle - when can Kudu release
> it?
>>>> 
>>>>>> My thinking, as mentioned earlier, is a shared reference count model
>>> for
>>>> complex situations. Possibly a "request/response" ownership model for
>>>> simpler cases.
>>>> 
>>>> - how do you do security? If both sides of the connection don't trust
>>> each
>>>> other, and use length prefixes and offsets, you have to be constantly
>>>> validating and re-validating everything you read.
>>>> 
>>>> I'm suggesting that we start with trusting so we don't get too wrapped
>> up
>>>> in all the extra complexities of security. My experience with these
>>> things
>>>> is that a lot of users will frequently pick performance or footprint
>> over
>>>> security for quite some time. For example, if I recall correctly, on
>> the
>>>> shared file descriptor model that was initially implemented in the
> HDFS
>>>> client, that people used short-circuit reads for years before security
>>> was
>>>> correctly implemented. (Am I remembering this right?)
>>>> 
>>>> Lastly, as I mentioned above, I don't think there should be any
>>> requirement
>>>> that Arrow communication be limited to only 'IPC'. As Todd points out,
>> in
>>>> many cases unix domain sockets will be just fine.
>>>> 
>>>> We need to implement both models because we all know that locality
> will
>>>> never be guaranteed. The IPC design/implementation needs to be good
> for
>>>> anything to make into arrow.
>>>> 
>>>> thanks
>>>> Jacques
>>>> 
>>>> 
>>>> 
>>>> On Wed, Mar 16, 2016 at 8:54 AM, Zhe Zhang <zh...@apache.org> wrote:
>>>> 
>>>>> I have similar concerns as Todd stated below. With an mmap-based
>>> approach,
>>>>> we are treating shared memory objects like files. This brings in all
>>>>> filesystem related considerations like ACL and lifecycle mgmt.
>>>>> 
>>>>> Stepping back a little, the shared-memory work isn't really specific
>> to
>>>>> Arrow. A few questions related to this:
>>>>> 1) Has the topic been discussed in the context of protobuf (or other
>> IPC
>>>>> protocols) before? Seems Cap'n Proto (https://capnproto.org/) has
>>>>> zero-copy
>>>>> shared memory. I haven't read implementation detail though.
>>>>> 2) If the shared-memory work benefits a wide range of protocols,
>> should
>>> it
>>>>> be a generalized and standalone library?
>>>>> 
>>>>> Thanks,
>>>>> Zhe
>>>>> 
>>>>> On Tue, Mar 15, 2016 at 8:30 PM Todd Lipcon <to...@cloudera.com>
>> wrote:
>>>>> 
>>>>>> Having thought about this quite a bit in the past, I think the
>>> mechanics
>>>>> of
>>>>>> how to share memory are by far the easiest part. The much harder
> part
>>> is
>>>>>> the resource management and ownership. Questions like:
>>>>>> 
>>>>>> - if you are using an mmapped file in /dev/shm/, how do you make
> sure
>>> it
>>>>>> gets cleaned up if the process crashes?
>>>>>> - how do you allocate memory to it? there's nothing ensuring that
>>>>> /dev/shm
>>>>>> doesn't swap out if you try to put too much in there, and then your
>>>>>> in-memory super-fast access will basically collapse under swap
>>> thrashing
>>>>>> - how do you do lifecycle management across the two processes? If,
>> say,
>>>>>> Kudu wants to pass a block of data to some Python program, how does
>> it
>>>>> know
>>>>>> when the Python program is done reading it and it should be deleted?
>>> What
>>>>>> if the python program crashed in the middle - when can Kudu release
>> it?
>>>>>> - how do you do security? If both sides of the connection don't
> trust
>>>>> each
>>>>>> other, and use length prefixes and offsets, you have to be
> constantly
>>>>>> validating and re-validating everything you read.
>>>>>> 
>>>>>> Another big factor is that shared memory is not, in my experience,
>>>>>> immediately faster than just copying data over a unix domain socket.
>> In
>>>>>> particular, the first time you read an mmapped file, you'll end up
>>> paying
>>>>>> minor page fault overhead on every page. This can be improved with
>>>>>> HugePages, but huge page mmaps are not supported yet in current
> Linux
>>>>> (work
>>>>>> going on currently to address this). So you're left with hugetlbfs,
>>> which
>>>>>> involves static allocations and much more pain.
>>>>>> 
>>>>>> All the above is a long way to say: let's make sure we do the write
>>>>>> prototyping and up-front design before jumping into code.
>>>>>> 
>>>>>> -Todd
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> On Tue, Mar 15, 2016 at 5:54 PM, Jacques Nadeau <ja...@apache.org>
>>>>>> wrote:
>>>>>> 
>>>>>>> @Corey
>>>>>>> The POC Steven and Wes are working on is based on MappedBuffer but
>> I'm
>>>>>>> looking at using netty's fork of tcnative to use shared memory
>>>>> directly.
>>>>>>> 
>>>>>>> @Yiannis
>>>>>>> We need to have both RPC and a shared memory mechanisms (what I'm
>>>>>> inclined
>>>>>>> to call IPC but is a specific kind of IPC). The idea is we
> negotiate
>>>>> via
>>>>>>> RPC and then if we determine shared locality, we work over shared
>>>>> memory
>>>>>>> (preferably for both data and control). So the system interacting
>> with
>>>>>>> HBase in your example would be the one responsible for placing
>>>>> collocated
>>>>>>> execution to take advantage of IPC.
>>>>>>> 
>>>>>>> How do others feel of my redefinition of IPC to mean the same
> memory
>>>>>> space
>>>>>>> communication (either via shared memory or rdma) versus RPC as
>> socket
>>>>>> based
>>>>>>> communication?
>>>>>>> 
>>>>>>> 
>>>>>>> On Tue, Mar 15, 2016 at 5:38 PM, Corey Nolet <cj...@gmail.com>
>>>>> wrote:
>>>>>>> 
>>>>>>>> I was seeing Netty's unsafe classes being used here, not mapped
>> byte
>>>>>>>> buffer  not sure if that statement is completely correct but I'll
>>>>> have
>>>>>> to
>>>>>>>> dog through the code again to figure that out.
>>>>>>>> 
>>>>>>>> The more I was looking at unsafe, it makes sense why that would be
>>>>>>>> used.apparently it's also supposed to be included on Java 9 as a
>>>>> first
>>>>>>>> class API
>>>>>>>> On Mar 15, 2016 7:03 PM, "Wes McKinney" <we...@cloudera.com> wrote:
>>>>>>>> 
>>>>>>>>> My understanding is that you can use java.nio.MappedByteBuffer to
>>>>>> work
>>>>>>>>> with memory-mapped files as one way to share memory pages between
>>>>>> Java
>>>>>>>>> (and non-Java) processes without copying.
>>>>>>>>> 
>>>>>>>>> I am hoping that we can reach a POC of zero-copy Arrow memory
>>>>> sharing
>>>>>>>>> Java-to-Java and Java-to-C++ in the near future. Indeed this will
>>>>>> have
>>>>>>>>> huge implications once we get it working end to end (for example,
>>>>>>>>> receiving memory from a Java process in Python without a heavy
>>>>> ser-de
>>>>>>>>> step -- it's what we've always dreamed of) and with the metadata
>>>>> and
>>>>>>>>> shared memory control flow standardized.
>>>>>>>>> 
>>>>>>>>> - Wes
>>>>>>>>> 
>>>>>>>>> On Wed, Mar 9, 2016 at 9:25 PM, Corey J Nolet <cj...@gmail.com>
>>>>>>> wrote:
>>>>>>>>>> If I understand correctly, Arrow is using Netty underneath which
>>>>> is
>>>>>>>>> using Sun's Unsafe API in order to allocate direct byte buffers
>> off
>>>>>>> heap.
>>>>>>>>> It is using Netty to communicate between "client" and "server",
>>>>>>>> information
>>>>>>>>> about memory addresses for data that is being requested.
>>>>>>>>>> 
>>>>>>>>>> I've never attempted to use the Unsafe API to access off heap
>>>>>> memory
>>>>>>>>> that has been allocated in one JVM from another JVM but I'm
>>>>> assuming
>>>>>>> this
>>>>>>>>> must be the case in order to claim that the memory is being
>>>>> accessed
>>>>>>>>> directly without being copied, correct?
>>>>>>>>>> 
>>>>>>>>>> The implication here is huge. If the memory is being directly
>>>>>> shared
>>>>>>>>> across processes by them being allowed to directly reach into the
>>>>>>> direct
>>>>>>>>> byte buffers, that's true shared memory. Otherwise, if there's
>>>>> copies
>>>>>>>> going
>>>>>>>>> on, it's less appealing.
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> Thanks.
>>>>>>>>>> 
>>>>>>>>>> Sent from my iPad
>>>>>>>>> 
>>>>>>>> 
>>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> --
>>>>>> Todd Lipcon
>>>>>> Software Engineer, Cloudera
>>>>>> 
>>>>> 
>>> 
>>> 
>>

Re: Understanding "shared" memory implications

Posted by Jacques Nadeau <ja...@apache.org>.

I think it is okay for a project to be different things to different
people.

I think it is really important as a library that we have enough supporting
examples that people can get started quickly. In some sense I'm modeling
this after what Julian did with Calcite.  For example he provides a default
execution implementation to get started with but you don't need to use it.
I think this helps new users get started and have something working sooner.
It doesn't mean that a particular consumer needs to adopt that
implementation. In fact, many don't.

So my goal is to provide an example implementation of sharing across IPC
and rpc. Once there is something to play with, we can figure out what
pieces are 'core' arrow and what pieces are example implementations.
I always thought Arrow was just an in-memory format, and it is the
responsibility of whoever else that want to use it to carry that
responsibilities out, because depending on workloads, different frameworks
might pick very different applications. Otherwise it seems to be doing too
much and having too strong of an opinion about data sharing in a format
that's primarily about data sharing.

On Wed, Mar 16, 2016 at 10:03 AM, Corey Nolet <cj...@gmail.com> wrote:

> I've been under the impression that exposing memory to be shared directly
> and not copied WAS, in fact, the responsibility of Arrow. In fact, I read
> this in [1] and this is turned me on to Arrow in the first place.
>
>
> [1]
>
>
http://www.datanami.com/2016/02/17/arrow-aims-to-defrag-big-in-memory-analytics/
>
> On Wed, Mar 16, 2016 at 1:00 PM, Julian Hyde <jh...@apache.org> wrote:
>
> > This is all very interesting stuff, but just so we’re clear: it is not
> > Arrow’s responsibility to provide an RPC/IPC/LPC mechanism, nor
> facilities
> > for resource management. If we DID decide to make this Arrow’s
> > responsibility it would overlap with other components which specialize
in
> > such stuff.
> >
> >
> >
> > > On Mar 16, 2016, at 9:49 AM, Jacques Nadeau <ja...@apache.org>
> wrote:
> > >
> > > @Todd: agree entirely on prototyping design. My goal is throw out some
> > > ideas and some POC code and then we can explore from there.
> > >
> > > My main thoughts have initially been around lifecycle management. I've
> > done
> > > some work previously where a consistently sized shared buffer using
> mmap
> > > has improved performance. This is more complicated given the
> requirements
> > > for providing collaborative allocation and cross process reference
> > counts.
> > >
> > > With regards to whether this is more generally applicable: I think it
> > could
> > > ultimately be more general but I suggest we focus on the particular
> > > application of moving long-lived arrow record batches between a
> producer
> > > and a consumer initially. Constraining the problems seems like we will
> > get
> > > to something workable sooner. We can abstract to a more general
> solution
> > as
> > > there are other clear requirements.
> > >
> > > With regards to capnproto, I believe they are simply saying when they
> > talk
> > > about zero-copy shared memory that the structure supports that (same
as
> > any
> > > memory-layout based design). I don't believe they actually implemented
> a
> > > protocol and multi-language implementation for zero-copy cross process
> > > communication.
> > >
> > > One other note to make here is that my goal here is not just about
> > > performance but also about memory footprint. Being able to have a
> shared
> > > memory protocol that allows multiple tools to interact with the same
> hot
> > > dataset.
> > >
> > > RE: ACL, for the initial focus, I suggest that we consider the two
> > sharing
> > > processes are "trusted" and expect the initial Arrow API reference
> > > implementations to manage memory access.
> > >
> > > Regarding other questions that Todd threw out:
> > >
> > > - if you are using an mmapped file in /dev/shm/, how do you make sure
> it
> > > gets cleaned up if the process crashes?
> > >
> > >>> Agreed that it needs to get resolve. If I recall, destruction can be
> > > applied once associated process are attached to memory and this allows
> > the
> > > kernel to recover once all attaching processes are destroyed. If this
> > isn't
> > > enough, then we may very well need a simple  external coordinator.
> > >
> > > - how do you allocate memory to it? there's nothing ensuring that
> > /dev/shm
> > > doesn't swap out if you try to put too much in there, and then your
> > > in-memory super-fast access will basically collapse under swap
> thrashing
> > >
> > >>> Simplest model initially is probably one where we assume a master
> and a
> > > slave. (Ideally negotiated on initial connection.) The master is
> > > responsible for allocating memory and giving that to the slave. The
> > master
> > > then is responsible for managing reasonable memory allocation limits
> just
> > > like any other. Slaves that need to allocated memory must ask the
> master
> > > (at whatever chunk makes sense) and will get rejected if they are too
> > > aggressive. (this probably means that at any point an IPC can fall
back
> > to
> > > RPC??)
> > >
> > > - how do you do lifecycle management across the two processes? If,
say,
> > > Kudu wants to pass a block of data to some Python program, how does it
> > know
> > > when the Python program is done reading it and it should be deleted?
> What
> > > if the python program crashed in the middle - when can Kudu release
it?
> > >
> > >>> My thinking, as mentioned earlier, is a shared reference count model
> > for
> > > complex situations. Possibly a "request/response" ownership model for
> > > simpler cases.
> > >
> > > - how do you do security? If both sides of the connection don't trust
> > each
> > > other, and use length prefixes and offsets, you have to be constantly
> > > validating and re-validating everything you read.
> > >
> > > I'm suggesting that we start with trusting so we don't get too wrapped
> up
> > > in all the extra complexities of security. My experience with these
> > things
> > > is that a lot of users will frequently pick performance or footprint
> over
> > > security for quite some time. For example, if I recall correctly, on
> the
> > > shared file descriptor model that was initially implemented in the
HDFS
> > > client, that people used short-circuit reads for years before security
> > was
> > > correctly implemented. (Am I remembering this right?)
> > >
> > > Lastly, as I mentioned above, I don't think there should be any
> > requirement
> > > that Arrow communication be limited to only 'IPC'. As Todd points out,
> in
> > > many cases unix domain sockets will be just fine.
> > >
> > > We need to implement both models because we all know that locality
will
> > > never be guaranteed. The IPC design/implementation needs to be good
for
> > > anything to make into arrow.
> > >
> > > thanks
> > > Jacques
> > >
> > >
> > >
> > > On Wed, Mar 16, 2016 at 8:54 AM, Zhe Zhang <zh...@apache.org> wrote:
> > >
> > >> I have similar concerns as Todd stated below. With an mmap-based
> > approach,
> > >> we are treating shared memory objects like files. This brings in all
> > >> filesystem related considerations like ACL and lifecycle mgmt.
> > >>
> > >> Stepping back a little, the shared-memory work isn't really specific
> to
> > >> Arrow. A few questions related to this:
> > >> 1) Has the topic been discussed in the context of protobuf (or other
> IPC
> > >> protocols) before? Seems Cap'n Proto (https://capnproto.org/) has
> > >> zero-copy
> > >> shared memory. I haven't read implementation detail though.
> > >> 2) If the shared-memory work benefits a wide range of protocols,
> should
> > it
> > >> be a generalized and standalone library?
> > >>
> > >> Thanks,
> > >> Zhe
> > >>
> > >> On Tue, Mar 15, 2016 at 8:30 PM Todd Lipcon <to...@cloudera.com>
> wrote:
> > >>
> > >>> Having thought about this quite a bit in the past, I think the
> > mechanics
> > >> of
> > >>> how to share memory are by far the easiest part. The much harder
part
> > is
> > >>> the resource management and ownership. Questions like:
> > >>>
> > >>> - if you are using an mmapped file in /dev/shm/, how do you make
sure
> > it
> > >>> gets cleaned up if the process crashes?
> > >>> - how do you allocate memory to it? there's nothing ensuring that
> > >> /dev/shm
> > >>> doesn't swap out if you try to put too much in there, and then your
> > >>> in-memory super-fast access will basically collapse under swap
> > thrashing
> > >>> - how do you do lifecycle management across the two processes? If,
> say,
> > >>> Kudu wants to pass a block of data to some Python program, how does
> it
> > >> know
> > >>> when the Python program is done reading it and it should be deleted?
> > What
> > >>> if the python program crashed in the middle - when can Kudu release
> it?
> > >>> - how do you do security? If both sides of the connection don't
trust
> > >> each
> > >>> other, and use length prefixes and offsets, you have to be
constantly
> > >>> validating and re-validating everything you read.
> > >>>
> > >>> Another big factor is that shared memory is not, in my experience,
> > >>> immediately faster than just copying data over a unix domain socket.
> In
> > >>> particular, the first time you read an mmapped file, you'll end up
> > paying
> > >>> minor page fault overhead on every page. This can be improved with
> > >>> HugePages, but huge page mmaps are not supported yet in current
Linux
> > >> (work
> > >>> going on currently to address this). So you're left with hugetlbfs,
> > which
> > >>> involves static allocations and much more pain.
> > >>>
> > >>> All the above is a long way to say: let's make sure we do the write
> > >>> prototyping and up-front design before jumping into code.
> > >>>
> > >>> -Todd
> > >>>
> > >>>
> > >>>
> > >>> On Tue, Mar 15, 2016 at 5:54 PM, Jacques Nadeau <ja...@apache.org>
> > >>> wrote:
> > >>>
> > >>>> @Corey
> > >>>> The POC Steven and Wes are working on is based on MappedBuffer but
> I'm
> > >>>> looking at using netty's fork of tcnative to use shared memory
> > >> directly.
> > >>>>
> > >>>> @Yiannis
> > >>>> We need to have both RPC and a shared memory mechanisms (what I'm
> > >>> inclined
> > >>>> to call IPC but is a specific kind of IPC). The idea is we
negotiate
> > >> via
> > >>>> RPC and then if we determine shared locality, we work over shared
> > >> memory
> > >>>> (preferably for both data and control). So the system interacting
> with
> > >>>> HBase in your example would be the one responsible for placing
> > >> collocated
> > >>>> execution to take advantage of IPC.
> > >>>>
> > >>>> How do others feel of my redefinition of IPC to mean the same
memory
> > >>> space
> > >>>> communication (either via shared memory or rdma) versus RPC as
> socket
> > >>> based
> > >>>> communication?
> > >>>>
> > >>>>
> > >>>> On Tue, Mar 15, 2016 at 5:38 PM, Corey Nolet <cj...@gmail.com>
> > >> wrote:
> > >>>>
> > >>>>> I was seeing Netty's unsafe classes being used here, not mapped
> byte
> > >>>>> buffer  not sure if that statement is completely correct but I'll
> > >> have
> > >>> to
> > >>>>> dog through the code again to figure that out.
> > >>>>>
> > >>>>> The more I was looking at unsafe, it makes sense why that would be
> > >>>>> used.apparently it's also supposed to be included on Java 9 as a
> > >> first
> > >>>>> class API
> > >>>>> On Mar 15, 2016 7:03 PM, "Wes McKinney" <we...@cloudera.com> wrote:
> > >>>>>
> > >>>>>> My understanding is that you can use java.nio.MappedByteBuffer to
> > >>> work
> > >>>>>> with memory-mapped files as one way to share memory pages between
> > >>> Java
> > >>>>>> (and non-Java) processes without copying.
> > >>>>>>
> > >>>>>> I am hoping that we can reach a POC of zero-copy Arrow memory
> > >> sharing
> > >>>>>> Java-to-Java and Java-to-C++ in the near future. Indeed this will
> > >>> have
> > >>>>>> huge implications once we get it working end to end (for example,
> > >>>>>> receiving memory from a Java process in Python without a heavy
> > >> ser-de
> > >>>>>> step -- it's what we've always dreamed of) and with the metadata
> > >> and
> > >>>>>> shared memory control flow standardized.
> > >>>>>>
> > >>>>>> - Wes
> > >>>>>>
> > >>>>>> On Wed, Mar 9, 2016 at 9:25 PM, Corey J Nolet <cj...@gmail.com>
> > >>>> wrote:
> > >>>>>>> If I understand correctly, Arrow is using Netty underneath which
> > >> is
> > >>>>>> using Sun's Unsafe API in order to allocate direct byte buffers
> off
> > >>>> heap.
> > >>>>>> It is using Netty to communicate between "client" and "server",
> > >>>>> information
> > >>>>>> about memory addresses for data that is being requested.
> > >>>>>>>
> > >>>>>>> I've never attempted to use the Unsafe API to access off heap
> > >>> memory
> > >>>>>> that has been allocated in one JVM from another JVM but I'm
> > >> assuming
> > >>>> this
> > >>>>>> must be the case in order to claim that the memory is being
> > >> accessed
> > >>>>>> directly without being copied, correct?
> > >>>>>>>
> > >>>>>>> The implication here is huge. If the memory is being directly
> > >>> shared
> > >>>>>> across processes by them being allowed to directly reach into the
> > >>>> direct
> > >>>>>> byte buffers, that's true shared memory. Otherwise, if there's
> > >> copies
> > >>>>> going
> > >>>>>> on, it's less appealing.
> > >>>>>>>
> > >>>>>>>
> > >>>>>>> Thanks.
> > >>>>>>>
> > >>>>>>> Sent from my iPad
> > >>>>>>
> > >>>>>
> > >>>>
> > >>>
> > >>>
> > >>>
> > >>> --
> > >>> Todd Lipcon
> > >>> Software Engineer, Cloudera
> > >>>
> > >>
> >
> >
>

Re: Understanding "shared" memory implications

Posted by Reynold Xin <rx...@databricks.com>.

I always thought Arrow was just an in-memory format, and it is the
responsibility of whoever else that want to use it to carry that
responsibilities out, because depending on workloads, different frameworks
might pick very different applications. Otherwise it seems to be doing too
much and having too strong of an opinion about data sharing in a format
that's primarily about data sharing.

On Wed, Mar 16, 2016 at 10:03 AM, Corey Nolet <cj...@gmail.com> wrote:

> I've been under the impression that exposing memory to be shared directly
> and not copied WAS, in fact, the responsibility of Arrow. In fact, I read
> this in [1] and this is turned me on to Arrow in the first place.
>
>
> [1]
>
> http://www.datanami.com/2016/02/17/arrow-aims-to-defrag-big-in-memory-analytics/
>
> On Wed, Mar 16, 2016 at 1:00 PM, Julian Hyde <jh...@apache.org> wrote:
>
> > This is all very interesting stuff, but just so we’re clear: it is not
> > Arrow’s responsibility to provide an RPC/IPC/LPC mechanism, nor
> facilities
> > for resource management. If we DID decide to make this Arrow’s
> > responsibility it would overlap with other components which specialize in
> > such stuff.
> >
> >
> >
> > > On Mar 16, 2016, at 9:49 AM, Jacques Nadeau <ja...@apache.org>
> wrote:
> > >
> > > @Todd: agree entirely on prototyping design. My goal is throw out some
> > > ideas and some POC code and then we can explore from there.
> > >
> > > My main thoughts have initially been around lifecycle management. I've
> > done
> > > some work previously where a consistently sized shared buffer using
> mmap
> > > has improved performance. This is more complicated given the
> requirements
> > > for providing collaborative allocation and cross process reference
> > counts.
> > >
> > > With regards to whether this is more generally applicable: I think it
> > could
> > > ultimately be more general but I suggest we focus on the particular
> > > application of moving long-lived arrow record batches between a
> producer
> > > and a consumer initially. Constraining the problems seems like we will
> > get
> > > to something workable sooner. We can abstract to a more general
> solution
> > as
> > > there are other clear requirements.
> > >
> > > With regards to capnproto, I believe they are simply saying when they
> > talk
> > > about zero-copy shared memory that the structure supports that (same as
> > any
> > > memory-layout based design). I don't believe they actually implemented
> a
> > > protocol and multi-language implementation for zero-copy cross process
> > > communication.
> > >
> > > One other note to make here is that my goal here is not just about
> > > performance but also about memory footprint. Being able to have a
> shared
> > > memory protocol that allows multiple tools to interact with the same
> hot
> > > dataset.
> > >
> > > RE: ACL, for the initial focus, I suggest that we consider the two
> > sharing
> > > processes are "trusted" and expect the initial Arrow API reference
> > > implementations to manage memory access.
> > >
> > > Regarding other questions that Todd threw out:
> > >
> > > - if you are using an mmapped file in /dev/shm/, how do you make sure
> it
> > > gets cleaned up if the process crashes?
> > >
> > >>> Agreed that it needs to get resolve. If I recall, destruction can be
> > > applied once associated process are attached to memory and this allows
> > the
> > > kernel to recover once all attaching processes are destroyed. If this
> > isn't
> > > enough, then we may very well need a simple  external coordinator.
> > >
> > > - how do you allocate memory to it? there's nothing ensuring that
> > /dev/shm
> > > doesn't swap out if you try to put too much in there, and then your
> > > in-memory super-fast access will basically collapse under swap
> thrashing
> > >
> > >>> Simplest model initially is probably one where we assume a master
> and a
> > > slave. (Ideally negotiated on initial connection.) The master is
> > > responsible for allocating memory and giving that to the slave. The
> > master
> > > then is responsible for managing reasonable memory allocation limits
> just
> > > like any other. Slaves that need to allocated memory must ask the
> master
> > > (at whatever chunk makes sense) and will get rejected if they are too
> > > aggressive. (this probably means that at any point an IPC can fall back
> > to
> > > RPC??)
> > >
> > > - how do you do lifecycle management across the two processes? If, say,
> > > Kudu wants to pass a block of data to some Python program, how does it
> > know
> > > when the Python program is done reading it and it should be deleted?
> What
> > > if the python program crashed in the middle - when can Kudu release it?
> > >
> > >>> My thinking, as mentioned earlier, is a shared reference count model
> > for
> > > complex situations. Possibly a "request/response" ownership model for
> > > simpler cases.
> > >
> > > - how do you do security? If both sides of the connection don't trust
> > each
> > > other, and use length prefixes and offsets, you have to be constantly
> > > validating and re-validating everything you read.
> > >
> > > I'm suggesting that we start with trusting so we don't get too wrapped
> up
> > > in all the extra complexities of security. My experience with these
> > things
> > > is that a lot of users will frequently pick performance or footprint
> over
> > > security for quite some time. For example, if I recall correctly, on
> the
> > > shared file descriptor model that was initially implemented in the HDFS
> > > client, that people used short-circuit reads for years before security
> > was
> > > correctly implemented. (Am I remembering this right?)
> > >
> > > Lastly, as I mentioned above, I don't think there should be any
> > requirement
> > > that Arrow communication be limited to only 'IPC'. As Todd points out,
> in
> > > many cases unix domain sockets will be just fine.
> > >
> > > We need to implement both models because we all know that locality will
> > > never be guaranteed. The IPC design/implementation needs to be good for
> > > anything to make into arrow.
> > >
> > > thanks
> > > Jacques
> > >
> > >
> > >
> > > On Wed, Mar 16, 2016 at 8:54 AM, Zhe Zhang <zh...@apache.org> wrote:
> > >
> > >> I have similar concerns as Todd stated below. With an mmap-based
> > approach,
> > >> we are treating shared memory objects like files. This brings in all
> > >> filesystem related considerations like ACL and lifecycle mgmt.
> > >>
> > >> Stepping back a little, the shared-memory work isn't really specific
> to
> > >> Arrow. A few questions related to this:
> > >> 1) Has the topic been discussed in the context of protobuf (or other
> IPC
> > >> protocols) before? Seems Cap'n Proto (https://capnproto.org/) has
> > >> zero-copy
> > >> shared memory. I haven't read implementation detail though.
> > >> 2) If the shared-memory work benefits a wide range of protocols,
> should
> > it
> > >> be a generalized and standalone library?
> > >>
> > >> Thanks,
> > >> Zhe
> > >>
> > >> On Tue, Mar 15, 2016 at 8:30 PM Todd Lipcon <to...@cloudera.com>
> wrote:
> > >>
> > >>> Having thought about this quite a bit in the past, I think the
> > mechanics
> > >> of
> > >>> how to share memory are by far the easiest part. The much harder part
> > is
> > >>> the resource management and ownership. Questions like:
> > >>>
> > >>> - if you are using an mmapped file in /dev/shm/, how do you make sure
> > it
> > >>> gets cleaned up if the process crashes?
> > >>> - how do you allocate memory to it? there's nothing ensuring that
> > >> /dev/shm
> > >>> doesn't swap out if you try to put too much in there, and then your
> > >>> in-memory super-fast access will basically collapse under swap
> > thrashing
> > >>> - how do you do lifecycle management across the two processes? If,
> say,
> > >>> Kudu wants to pass a block of data to some Python program, how does
> it
> > >> know
> > >>> when the Python program is done reading it and it should be deleted?
> > What
> > >>> if the python program crashed in the middle - when can Kudu release
> it?
> > >>> - how do you do security? If both sides of the connection don't trust
> > >> each
> > >>> other, and use length prefixes and offsets, you have to be constantly
> > >>> validating and re-validating everything you read.
> > >>>
> > >>> Another big factor is that shared memory is not, in my experience,
> > >>> immediately faster than just copying data over a unix domain socket.
> In
> > >>> particular, the first time you read an mmapped file, you'll end up
> > paying
> > >>> minor page fault overhead on every page. This can be improved with
> > >>> HugePages, but huge page mmaps are not supported yet in current Linux
> > >> (work
> > >>> going on currently to address this). So you're left with hugetlbfs,
> > which
> > >>> involves static allocations and much more pain.
> > >>>
> > >>> All the above is a long way to say: let's make sure we do the write
> > >>> prototyping and up-front design before jumping into code.
> > >>>
> > >>> -Todd
> > >>>
> > >>>
> > >>>
> > >>> On Tue, Mar 15, 2016 at 5:54 PM, Jacques Nadeau <ja...@apache.org>
> > >>> wrote:
> > >>>
> > >>>> @Corey
> > >>>> The POC Steven and Wes are working on is based on MappedBuffer but
> I'm
> > >>>> looking at using netty's fork of tcnative to use shared memory
> > >> directly.
> > >>>>
> > >>>> @Yiannis
> > >>>> We need to have both RPC and a shared memory mechanisms (what I'm
> > >>> inclined
> > >>>> to call IPC but is a specific kind of IPC). The idea is we negotiate
> > >> via
> > >>>> RPC and then if we determine shared locality, we work over shared
> > >> memory
> > >>>> (preferably for both data and control). So the system interacting
> with
> > >>>> HBase in your example would be the one responsible for placing
> > >> collocated
> > >>>> execution to take advantage of IPC.
> > >>>>
> > >>>> How do others feel of my redefinition of IPC to mean the same memory
> > >>> space
> > >>>> communication (either via shared memory or rdma) versus RPC as
> socket
> > >>> based
> > >>>> communication?
> > >>>>
> > >>>>
> > >>>> On Tue, Mar 15, 2016 at 5:38 PM, Corey Nolet <cj...@gmail.com>
> > >> wrote:
> > >>>>
> > >>>>> I was seeing Netty's unsafe classes being used here, not mapped
> byte
> > >>>>> buffer  not sure if that statement is completely correct but I'll
> > >> have
> > >>> to
> > >>>>> dog through the code again to figure that out.
> > >>>>>
> > >>>>> The more I was looking at unsafe, it makes sense why that would be
> > >>>>> used.apparently it's also supposed to be included on Java 9 as a
> > >> first
> > >>>>> class API
> > >>>>> On Mar 15, 2016 7:03 PM, "Wes McKinney" <we...@cloudera.com> wrote:
> > >>>>>
> > >>>>>> My understanding is that you can use java.nio.MappedByteBuffer to
> > >>> work
> > >>>>>> with memory-mapped files as one way to share memory pages between
> > >>> Java
> > >>>>>> (and non-Java) processes without copying.
> > >>>>>>
> > >>>>>> I am hoping that we can reach a POC of zero-copy Arrow memory
> > >> sharing
> > >>>>>> Java-to-Java and Java-to-C++ in the near future. Indeed this will
> > >>> have
> > >>>>>> huge implications once we get it working end to end (for example,
> > >>>>>> receiving memory from a Java process in Python without a heavy
> > >> ser-de
> > >>>>>> step -- it's what we've always dreamed of) and with the metadata
> > >> and
> > >>>>>> shared memory control flow standardized.
> > >>>>>>
> > >>>>>> - Wes
> > >>>>>>
> > >>>>>> On Wed, Mar 9, 2016 at 9:25 PM, Corey J Nolet <cj...@gmail.com>
> > >>>> wrote:
> > >>>>>>> If I understand correctly, Arrow is using Netty underneath which
> > >> is
> > >>>>>> using Sun's Unsafe API in order to allocate direct byte buffers
> off
> > >>>> heap.
> > >>>>>> It is using Netty to communicate between "client" and "server",
> > >>>>> information
> > >>>>>> about memory addresses for data that is being requested.
> > >>>>>>>
> > >>>>>>> I've never attempted to use the Unsafe API to access off heap
> > >>> memory
> > >>>>>> that has been allocated in one JVM from another JVM but I'm
> > >> assuming
> > >>>> this
> > >>>>>> must be the case in order to claim that the memory is being
> > >> accessed
> > >>>>>> directly without being copied, correct?
> > >>>>>>>
> > >>>>>>> The implication here is huge. If the memory is being directly
> > >>> shared
> > >>>>>> across processes by them being allowed to directly reach into the
> > >>>> direct
> > >>>>>> byte buffers, that's true shared memory. Otherwise, if there's
> > >> copies
> > >>>>> going
> > >>>>>> on, it's less appealing.
> > >>>>>>>
> > >>>>>>>
> > >>>>>>> Thanks.
> > >>>>>>>
> > >>>>>>> Sent from my iPad
> > >>>>>>
> > >>>>>
> > >>>>
> > >>>
> > >>>
> > >>>
> > >>> --
> > >>> Todd Lipcon
> > >>> Software Engineer, Cloudera
> > >>>
> > >>
> >
> >
>

Re: Understanding "shared" memory implications

Posted by Corey Nolet <cj...@gmail.com>.

I've been under the impression that exposing memory to be shared directly
and not copied WAS, in fact, the responsibility of Arrow. In fact, I read
this in [1] and this is turned me on to Arrow in the first place.


[1]
http://www.datanami.com/2016/02/17/arrow-aims-to-defrag-big-in-memory-analytics/

On Wed, Mar 16, 2016 at 1:00 PM, Julian Hyde <jh...@apache.org> wrote:

> This is all very interesting stuff, but just so we’re clear: it is not
> Arrow’s responsibility to provide an RPC/IPC/LPC mechanism, nor facilities
> for resource management. If we DID decide to make this Arrow’s
> responsibility it would overlap with other components which specialize in
> such stuff.
>
>
>
> > On Mar 16, 2016, at 9:49 AM, Jacques Nadeau <ja...@apache.org> wrote:
> >
> > @Todd: agree entirely on prototyping design. My goal is throw out some
> > ideas and some POC code and then we can explore from there.
> >
> > My main thoughts have initially been around lifecycle management. I've
> done
> > some work previously where a consistently sized shared buffer using mmap
> > has improved performance. This is more complicated given the requirements
> > for providing collaborative allocation and cross process reference
> counts.
> >
> > With regards to whether this is more generally applicable: I think it
> could
> > ultimately be more general but I suggest we focus on the particular
> > application of moving long-lived arrow record batches between a producer
> > and a consumer initially. Constraining the problems seems like we will
> get
> > to something workable sooner. We can abstract to a more general solution
> as
> > there are other clear requirements.
> >
> > With regards to capnproto, I believe they are simply saying when they
> talk
> > about zero-copy shared memory that the structure supports that (same as
> any
> > memory-layout based design). I don't believe they actually implemented a
> > protocol and multi-language implementation for zero-copy cross process
> > communication.
> >
> > One other note to make here is that my goal here is not just about
> > performance but also about memory footprint. Being able to have a shared
> > memory protocol that allows multiple tools to interact with the same hot
> > dataset.
> >
> > RE: ACL, for the initial focus, I suggest that we consider the two
> sharing
> > processes are "trusted" and expect the initial Arrow API reference
> > implementations to manage memory access.
> >
> > Regarding other questions that Todd threw out:
> >
> > - if you are using an mmapped file in /dev/shm/, how do you make sure it
> > gets cleaned up if the process crashes?
> >
> >>> Agreed that it needs to get resolve. If I recall, destruction can be
> > applied once associated process are attached to memory and this allows
> the
> > kernel to recover once all attaching processes are destroyed. If this
> isn't
> > enough, then we may very well need a simple  external coordinator.
> >
> > - how do you allocate memory to it? there's nothing ensuring that
> /dev/shm
> > doesn't swap out if you try to put too much in there, and then your
> > in-memory super-fast access will basically collapse under swap thrashing
> >
> >>> Simplest model initially is probably one where we assume a master and a
> > slave. (Ideally negotiated on initial connection.) The master is
> > responsible for allocating memory and giving that to the slave. The
> master
> > then is responsible for managing reasonable memory allocation limits just
> > like any other. Slaves that need to allocated memory must ask the master
> > (at whatever chunk makes sense) and will get rejected if they are too
> > aggressive. (this probably means that at any point an IPC can fall back
> to
> > RPC??)
> >
> > - how do you do lifecycle management across the two processes? If, say,
> > Kudu wants to pass a block of data to some Python program, how does it
> know
> > when the Python program is done reading it and it should be deleted? What
> > if the python program crashed in the middle - when can Kudu release it?
> >
> >>> My thinking, as mentioned earlier, is a shared reference count model
> for
> > complex situations. Possibly a "request/response" ownership model for
> > simpler cases.
> >
> > - how do you do security? If both sides of the connection don't trust
> each
> > other, and use length prefixes and offsets, you have to be constantly
> > validating and re-validating everything you read.
> >
> > I'm suggesting that we start with trusting so we don't get too wrapped up
> > in all the extra complexities of security. My experience with these
> things
> > is that a lot of users will frequently pick performance or footprint over
> > security for quite some time. For example, if I recall correctly, on the
> > shared file descriptor model that was initially implemented in the HDFS
> > client, that people used short-circuit reads for years before security
> was
> > correctly implemented. (Am I remembering this right?)
> >
> > Lastly, as I mentioned above, I don't think there should be any
> requirement
> > that Arrow communication be limited to only 'IPC'. As Todd points out, in
> > many cases unix domain sockets will be just fine.
> >
> > We need to implement both models because we all know that locality will
> > never be guaranteed. The IPC design/implementation needs to be good for
> > anything to make into arrow.
> >
> > thanks
> > Jacques
> >
> >
> >
> > On Wed, Mar 16, 2016 at 8:54 AM, Zhe Zhang <zh...@apache.org> wrote:
> >
> >> I have similar concerns as Todd stated below. With an mmap-based
> approach,
> >> we are treating shared memory objects like files. This brings in all
> >> filesystem related considerations like ACL and lifecycle mgmt.
> >>
> >> Stepping back a little, the shared-memory work isn't really specific to
> >> Arrow. A few questions related to this:
> >> 1) Has the topic been discussed in the context of protobuf (or other IPC
> >> protocols) before? Seems Cap'n Proto (https://capnproto.org/) has
> >> zero-copy
> >> shared memory. I haven't read implementation detail though.
> >> 2) If the shared-memory work benefits a wide range of protocols, should
> it
> >> be a generalized and standalone library?
> >>
> >> Thanks,
> >> Zhe
> >>
> >> On Tue, Mar 15, 2016 at 8:30 PM Todd Lipcon <to...@cloudera.com> wrote:
> >>
> >>> Having thought about this quite a bit in the past, I think the
> mechanics
> >> of
> >>> how to share memory are by far the easiest part. The much harder part
> is
> >>> the resource management and ownership. Questions like:
> >>>
> >>> - if you are using an mmapped file in /dev/shm/, how do you make sure
> it
> >>> gets cleaned up if the process crashes?
> >>> - how do you allocate memory to it? there's nothing ensuring that
> >> /dev/shm
> >>> doesn't swap out if you try to put too much in there, and then your
> >>> in-memory super-fast access will basically collapse under swap
> thrashing
> >>> - how do you do lifecycle management across the two processes? If, say,
> >>> Kudu wants to pass a block of data to some Python program, how does it
> >> know
> >>> when the Python program is done reading it and it should be deleted?
> What
> >>> if the python program crashed in the middle - when can Kudu release it?
> >>> - how do you do security? If both sides of the connection don't trust
> >> each
> >>> other, and use length prefixes and offsets, you have to be constantly
> >>> validating and re-validating everything you read.
> >>>
> >>> Another big factor is that shared memory is not, in my experience,
> >>> immediately faster than just copying data over a unix domain socket. In
> >>> particular, the first time you read an mmapped file, you'll end up
> paying
> >>> minor page fault overhead on every page. This can be improved with
> >>> HugePages, but huge page mmaps are not supported yet in current Linux
> >> (work
> >>> going on currently to address this). So you're left with hugetlbfs,
> which
> >>> involves static allocations and much more pain.
> >>>
> >>> All the above is a long way to say: let's make sure we do the write
> >>> prototyping and up-front design before jumping into code.
> >>>
> >>> -Todd
> >>>
> >>>
> >>>
> >>> On Tue, Mar 15, 2016 at 5:54 PM, Jacques Nadeau <ja...@apache.org>
> >>> wrote:
> >>>
> >>>> @Corey
> >>>> The POC Steven and Wes are working on is based on MappedBuffer but I'm
> >>>> looking at using netty's fork of tcnative to use shared memory
> >> directly.
> >>>>
> >>>> @Yiannis
> >>>> We need to have both RPC and a shared memory mechanisms (what I'm
> >>> inclined
> >>>> to call IPC but is a specific kind of IPC). The idea is we negotiate
> >> via
> >>>> RPC and then if we determine shared locality, we work over shared
> >> memory
> >>>> (preferably for both data and control). So the system interacting with
> >>>> HBase in your example would be the one responsible for placing
> >> collocated
> >>>> execution to take advantage of IPC.
> >>>>
> >>>> How do others feel of my redefinition of IPC to mean the same memory
> >>> space
> >>>> communication (either via shared memory or rdma) versus RPC as socket
> >>> based
> >>>> communication?
> >>>>
> >>>>
> >>>> On Tue, Mar 15, 2016 at 5:38 PM, Corey Nolet <cj...@gmail.com>
> >> wrote:
> >>>>
> >>>>> I was seeing Netty's unsafe classes being used here, not mapped byte
> >>>>> buffer  not sure if that statement is completely correct but I'll
> >> have
> >>> to
> >>>>> dog through the code again to figure that out.
> >>>>>
> >>>>> The more I was looking at unsafe, it makes sense why that would be
> >>>>> used.apparently it's also supposed to be included on Java 9 as a
> >> first
> >>>>> class API
> >>>>> On Mar 15, 2016 7:03 PM, "Wes McKinney" <we...@cloudera.com> wrote:
> >>>>>
> >>>>>> My understanding is that you can use java.nio.MappedByteBuffer to
> >>> work
> >>>>>> with memory-mapped files as one way to share memory pages between
> >>> Java
> >>>>>> (and non-Java) processes without copying.
> >>>>>>
> >>>>>> I am hoping that we can reach a POC of zero-copy Arrow memory
> >> sharing
> >>>>>> Java-to-Java and Java-to-C++ in the near future. Indeed this will
> >>> have
> >>>>>> huge implications once we get it working end to end (for example,
> >>>>>> receiving memory from a Java process in Python without a heavy
> >> ser-de
> >>>>>> step -- it's what we've always dreamed of) and with the metadata
> >> and
> >>>>>> shared memory control flow standardized.
> >>>>>>
> >>>>>> - Wes
> >>>>>>
> >>>>>> On Wed, Mar 9, 2016 at 9:25 PM, Corey J Nolet <cj...@gmail.com>
> >>>> wrote:
> >>>>>>> If I understand correctly, Arrow is using Netty underneath which
> >> is
> >>>>>> using Sun's Unsafe API in order to allocate direct byte buffers off
> >>>> heap.
> >>>>>> It is using Netty to communicate between "client" and "server",
> >>>>> information
> >>>>>> about memory addresses for data that is being requested.
> >>>>>>>
> >>>>>>> I've never attempted to use the Unsafe API to access off heap
> >>> memory
> >>>>>> that has been allocated in one JVM from another JVM but I'm
> >> assuming
> >>>> this
> >>>>>> must be the case in order to claim that the memory is being
> >> accessed
> >>>>>> directly without being copied, correct?
> >>>>>>>
> >>>>>>> The implication here is huge. If the memory is being directly
> >>> shared
> >>>>>> across processes by them being allowed to directly reach into the
> >>>> direct
> >>>>>> byte buffers, that's true shared memory. Otherwise, if there's
> >> copies
> >>>>> going
> >>>>>> on, it's less appealing.
> >>>>>>>
> >>>>>>>
> >>>>>>> Thanks.
> >>>>>>>
> >>>>>>> Sent from my iPad
> >>>>>>
> >>>>>
> >>>>
> >>>
> >>>
> >>>
> >>> --
> >>> Todd Lipcon
> >>> Software Engineer, Cloudera
> >>>
> >>
>
>

Re: Understanding "shared" memory implications

Posted by Julian Hyde <jh...@apache.org>.

This is all very interesting stuff, but just so we’re clear: it is not Arrow’s responsibility to provide an RPC/IPC/LPC mechanism, nor facilities for resource management. If we DID decide to make this Arrow’s responsibility it would overlap with other components which specialize in such stuff.



> On Mar 16, 2016, at 9:49 AM, Jacques Nadeau <ja...@apache.org> wrote:
> 
> @Todd: agree entirely on prototyping design. My goal is throw out some
> ideas and some POC code and then we can explore from there.
> 
> My main thoughts have initially been around lifecycle management. I've done
> some work previously where a consistently sized shared buffer using mmap
> has improved performance. This is more complicated given the requirements
> for providing collaborative allocation and cross process reference counts.
> 
> With regards to whether this is more generally applicable: I think it could
> ultimately be more general but I suggest we focus on the particular
> application of moving long-lived arrow record batches between a producer
> and a consumer initially. Constraining the problems seems like we will get
> to something workable sooner. We can abstract to a more general solution as
> there are other clear requirements.
> 
> With regards to capnproto, I believe they are simply saying when they talk
> about zero-copy shared memory that the structure supports that (same as any
> memory-layout based design). I don't believe they actually implemented a
> protocol and multi-language implementation for zero-copy cross process
> communication.
> 
> One other note to make here is that my goal here is not just about
> performance but also about memory footprint. Being able to have a shared
> memory protocol that allows multiple tools to interact with the same hot
> dataset.
> 
> RE: ACL, for the initial focus, I suggest that we consider the two sharing
> processes are "trusted" and expect the initial Arrow API reference
> implementations to manage memory access.
> 
> Regarding other questions that Todd threw out:
> 
> - if you are using an mmapped file in /dev/shm/, how do you make sure it
> gets cleaned up if the process crashes?
> 
>>> Agreed that it needs to get resolve. If I recall, destruction can be
> applied once associated process are attached to memory and this allows the
> kernel to recover once all attaching processes are destroyed. If this isn't
> enough, then we may very well need a simple  external coordinator.
> 
> - how do you allocate memory to it? there's nothing ensuring that /dev/shm
> doesn't swap out if you try to put too much in there, and then your
> in-memory super-fast access will basically collapse under swap thrashing
> 
>>> Simplest model initially is probably one where we assume a master and a
> slave. (Ideally negotiated on initial connection.) The master is
> responsible for allocating memory and giving that to the slave. The master
> then is responsible for managing reasonable memory allocation limits just
> like any other. Slaves that need to allocated memory must ask the master
> (at whatever chunk makes sense) and will get rejected if they are too
> aggressive. (this probably means that at any point an IPC can fall back to
> RPC??)
> 
> - how do you do lifecycle management across the two processes? If, say,
> Kudu wants to pass a block of data to some Python program, how does it know
> when the Python program is done reading it and it should be deleted? What
> if the python program crashed in the middle - when can Kudu release it?
> 
>>> My thinking, as mentioned earlier, is a shared reference count model for
> complex situations. Possibly a "request/response" ownership model for
> simpler cases.
> 
> - how do you do security? If both sides of the connection don't trust each
> other, and use length prefixes and offsets, you have to be constantly
> validating and re-validating everything you read.
> 
> I'm suggesting that we start with trusting so we don't get too wrapped up
> in all the extra complexities of security. My experience with these things
> is that a lot of users will frequently pick performance or footprint over
> security for quite some time. For example, if I recall correctly, on the
> shared file descriptor model that was initially implemented in the HDFS
> client, that people used short-circuit reads for years before security was
> correctly implemented. (Am I remembering this right?)
> 
> Lastly, as I mentioned above, I don't think there should be any requirement
> that Arrow communication be limited to only 'IPC'. As Todd points out, in
> many cases unix domain sockets will be just fine.
> 
> We need to implement both models because we all know that locality will
> never be guaranteed. The IPC design/implementation needs to be good for
> anything to make into arrow.
> 
> thanks
> Jacques
> 
> 
> 
> On Wed, Mar 16, 2016 at 8:54 AM, Zhe Zhang <zh...@apache.org> wrote:
> 
>> I have similar concerns as Todd stated below. With an mmap-based approach,
>> we are treating shared memory objects like files. This brings in all
>> filesystem related considerations like ACL and lifecycle mgmt.
>> 
>> Stepping back a little, the shared-memory work isn't really specific to
>> Arrow. A few questions related to this:
>> 1) Has the topic been discussed in the context of protobuf (or other IPC
>> protocols) before? Seems Cap'n Proto (https://capnproto.org/) has
>> zero-copy
>> shared memory. I haven't read implementation detail though.
>> 2) If the shared-memory work benefits a wide range of protocols, should it
>> be a generalized and standalone library?
>> 
>> Thanks,
>> Zhe
>> 
>> On Tue, Mar 15, 2016 at 8:30 PM Todd Lipcon <to...@cloudera.com> wrote:
>> 
>>> Having thought about this quite a bit in the past, I think the mechanics
>> of
>>> how to share memory are by far the easiest part. The much harder part is
>>> the resource management and ownership. Questions like:
>>> 
>>> - if you are using an mmapped file in /dev/shm/, how do you make sure it
>>> gets cleaned up if the process crashes?
>>> - how do you allocate memory to it? there's nothing ensuring that
>> /dev/shm
>>> doesn't swap out if you try to put too much in there, and then your
>>> in-memory super-fast access will basically collapse under swap thrashing
>>> - how do you do lifecycle management across the two processes? If, say,
>>> Kudu wants to pass a block of data to some Python program, how does it
>> know
>>> when the Python program is done reading it and it should be deleted? What
>>> if the python program crashed in the middle - when can Kudu release it?
>>> - how do you do security? If both sides of the connection don't trust
>> each
>>> other, and use length prefixes and offsets, you have to be constantly
>>> validating and re-validating everything you read.
>>> 
>>> Another big factor is that shared memory is not, in my experience,
>>> immediately faster than just copying data over a unix domain socket. In
>>> particular, the first time you read an mmapped file, you'll end up paying
>>> minor page fault overhead on every page. This can be improved with
>>> HugePages, but huge page mmaps are not supported yet in current Linux
>> (work
>>> going on currently to address this). So you're left with hugetlbfs, which
>>> involves static allocations and much more pain.
>>> 
>>> All the above is a long way to say: let's make sure we do the write
>>> prototyping and up-front design before jumping into code.
>>> 
>>> -Todd
>>> 
>>> 
>>> 
>>> On Tue, Mar 15, 2016 at 5:54 PM, Jacques Nadeau <ja...@apache.org>
>>> wrote:
>>> 
>>>> @Corey
>>>> The POC Steven and Wes are working on is based on MappedBuffer but I'm
>>>> looking at using netty's fork of tcnative to use shared memory
>> directly.
>>>> 
>>>> @Yiannis
>>>> We need to have both RPC and a shared memory mechanisms (what I'm
>>> inclined
>>>> to call IPC but is a specific kind of IPC). The idea is we negotiate
>> via
>>>> RPC and then if we determine shared locality, we work over shared
>> memory
>>>> (preferably for both data and control). So the system interacting with
>>>> HBase in your example would be the one responsible for placing
>> collocated
>>>> execution to take advantage of IPC.
>>>> 
>>>> How do others feel of my redefinition of IPC to mean the same memory
>>> space
>>>> communication (either via shared memory or rdma) versus RPC as socket
>>> based
>>>> communication?
>>>> 
>>>> 
>>>> On Tue, Mar 15, 2016 at 5:38 PM, Corey Nolet <cj...@gmail.com>
>> wrote:
>>>> 
>>>>> I was seeing Netty's unsafe classes being used here, not mapped byte
>>>>> buffer  not sure if that statement is completely correct but I'll
>> have
>>> to
>>>>> dog through the code again to figure that out.
>>>>> 
>>>>> The more I was looking at unsafe, it makes sense why that would be
>>>>> used.apparently it's also supposed to be included on Java 9 as a
>> first
>>>>> class API
>>>>> On Mar 15, 2016 7:03 PM, "Wes McKinney" <we...@cloudera.com> wrote:
>>>>> 
>>>>>> My understanding is that you can use java.nio.MappedByteBuffer to
>>> work
>>>>>> with memory-mapped files as one way to share memory pages between
>>> Java
>>>>>> (and non-Java) processes without copying.
>>>>>> 
>>>>>> I am hoping that we can reach a POC of zero-copy Arrow memory
>> sharing
>>>>>> Java-to-Java and Java-to-C++ in the near future. Indeed this will
>>> have
>>>>>> huge implications once we get it working end to end (for example,
>>>>>> receiving memory from a Java process in Python without a heavy
>> ser-de
>>>>>> step -- it's what we've always dreamed of) and with the metadata
>> and
>>>>>> shared memory control flow standardized.
>>>>>> 
>>>>>> - Wes
>>>>>> 
>>>>>> On Wed, Mar 9, 2016 at 9:25 PM, Corey J Nolet <cj...@gmail.com>
>>>> wrote:
>>>>>>> If I understand correctly, Arrow is using Netty underneath which
>> is
>>>>>> using Sun's Unsafe API in order to allocate direct byte buffers off
>>>> heap.
>>>>>> It is using Netty to communicate between "client" and "server",
>>>>> information
>>>>>> about memory addresses for data that is being requested.
>>>>>>> 
>>>>>>> I've never attempted to use the Unsafe API to access off heap
>>> memory
>>>>>> that has been allocated in one JVM from another JVM but I'm
>> assuming
>>>> this
>>>>>> must be the case in order to claim that the memory is being
>> accessed
>>>>>> directly without being copied, correct?
>>>>>>> 
>>>>>>> The implication here is huge. If the memory is being directly
>>> shared
>>>>>> across processes by them being allowed to directly reach into the
>>>> direct
>>>>>> byte buffers, that's true shared memory. Otherwise, if there's
>> copies
>>>>> going
>>>>>> on, it's less appealing.
>>>>>>> 
>>>>>>> 
>>>>>>> Thanks.
>>>>>>> 
>>>>>>> Sent from my iPad
>>>>>> 
>>>>> 
>>>> 
>>> 
>>> 
>>> 
>>> --
>>> Todd Lipcon
>>> Software Engineer, Cloudera
>>> 
>>

Re: Understanding "shared" memory implications

Posted by Jacques Nadeau <ja...@apache.org>.

@Todd: agree entirely on prototyping design. My goal is throw out some
ideas and some POC code and then we can explore from there.

My main thoughts have initially been around lifecycle management. I've done
some work previously where a consistently sized shared buffer using mmap
has improved performance. This is more complicated given the requirements
for providing collaborative allocation and cross process reference counts.

With regards to whether this is more generally applicable: I think it could
ultimately be more general but I suggest we focus on the particular
application of moving long-lived arrow record batches between a producer
and a consumer initially. Constraining the problems seems like we will get
to something workable sooner. We can abstract to a more general solution as
there are other clear requirements.

With regards to capnproto, I believe they are simply saying when they talk
about zero-copy shared memory that the structure supports that (same as any
memory-layout based design). I don't believe they actually implemented a
protocol and multi-language implementation for zero-copy cross process
communication.

One other note to make here is that my goal here is not just about
performance but also about memory footprint. Being able to have a shared
memory protocol that allows multiple tools to interact with the same hot
dataset.

RE: ACL, for the initial focus, I suggest that we consider the two sharing
processes are "trusted" and expect the initial Arrow API reference
implementations to manage memory access.

Regarding other questions that Todd threw out:

- if you are using an mmapped file in /dev/shm/, how do you make sure it
gets cleaned up if the process crashes?

>> Agreed that it needs to get resolve. If I recall, destruction can be
applied once associated process are attached to memory and this allows the
kernel to recover once all attaching processes are destroyed. If this isn't
enough, then we may very well need a simple  external coordinator.

- how do you allocate memory to it? there's nothing ensuring that /dev/shm
doesn't swap out if you try to put too much in there, and then your
in-memory super-fast access will basically collapse under swap thrashing

>> Simplest model initially is probably one where we assume a master and a
slave. (Ideally negotiated on initial connection.) The master is
responsible for allocating memory and giving that to the slave. The master
then is responsible for managing reasonable memory allocation limits just
like any other. Slaves that need to allocated memory must ask the master
(at whatever chunk makes sense) and will get rejected if they are too
aggressive. (this probably means that at any point an IPC can fall back to
RPC??)

- how do you do lifecycle management across the two processes? If, say,
Kudu wants to pass a block of data to some Python program, how does it know
when the Python program is done reading it and it should be deleted? What
if the python program crashed in the middle - when can Kudu release it?

>> My thinking, as mentioned earlier, is a shared reference count model for
complex situations. Possibly a "request/response" ownership model for
simpler cases.

- how do you do security? If both sides of the connection don't trust each
other, and use length prefixes and offsets, you have to be constantly
validating and re-validating everything you read.

I'm suggesting that we start with trusting so we don't get too wrapped up
in all the extra complexities of security. My experience with these things
is that a lot of users will frequently pick performance or footprint over
security for quite some time. For example, if I recall correctly, on the
shared file descriptor model that was initially implemented in the HDFS
client, that people used short-circuit reads for years before security was
correctly implemented. (Am I remembering this right?)

Lastly, as I mentioned above, I don't think there should be any requirement
that Arrow communication be limited to only 'IPC'. As Todd points out, in
many cases unix domain sockets will be just fine.

We need to implement both models because we all know that locality will
never be guaranteed. The IPC design/implementation needs to be good for
anything to make into arrow.

thanks
Jacques



On Wed, Mar 16, 2016 at 8:54 AM, Zhe Zhang <zh...@apache.org> wrote:

> I have similar concerns as Todd stated below. With an mmap-based approach,
> we are treating shared memory objects like files. This brings in all
> filesystem related considerations like ACL and lifecycle mgmt.
>
> Stepping back a little, the shared-memory work isn't really specific to
> Arrow. A few questions related to this:
> 1) Has the topic been discussed in the context of protobuf (or other IPC
> protocols) before? Seems Cap'n Proto (https://capnproto.org/) has
> zero-copy
> shared memory. I haven't read implementation detail though.
> 2) If the shared-memory work benefits a wide range of protocols, should it
> be a generalized and standalone library?
>
> Thanks,
> Zhe
>
> On Tue, Mar 15, 2016 at 8:30 PM Todd Lipcon <to...@cloudera.com> wrote:
>
> > Having thought about this quite a bit in the past, I think the mechanics
> of
> > how to share memory are by far the easiest part. The much harder part is
> > the resource management and ownership. Questions like:
> >
> > - if you are using an mmapped file in /dev/shm/, how do you make sure it
> > gets cleaned up if the process crashes?
> > - how do you allocate memory to it? there's nothing ensuring that
> /dev/shm
> > doesn't swap out if you try to put too much in there, and then your
> > in-memory super-fast access will basically collapse under swap thrashing
> > - how do you do lifecycle management across the two processes? If, say,
> > Kudu wants to pass a block of data to some Python program, how does it
> know
> > when the Python program is done reading it and it should be deleted? What
> > if the python program crashed in the middle - when can Kudu release it?
> > - how do you do security? If both sides of the connection don't trust
> each
> > other, and use length prefixes and offsets, you have to be constantly
> > validating and re-validating everything you read.
> >
> > Another big factor is that shared memory is not, in my experience,
> > immediately faster than just copying data over a unix domain socket. In
> > particular, the first time you read an mmapped file, you'll end up paying
> > minor page fault overhead on every page. This can be improved with
> > HugePages, but huge page mmaps are not supported yet in current Linux
> (work
> > going on currently to address this). So you're left with hugetlbfs, which
> > involves static allocations and much more pain.
> >
> > All the above is a long way to say: let's make sure we do the write
> > prototyping and up-front design before jumping into code.
> >
> > -Todd
> >
> >
> >
> > On Tue, Mar 15, 2016 at 5:54 PM, Jacques Nadeau <ja...@apache.org>
> > wrote:
> >
> > > @Corey
> > > The POC Steven and Wes are working on is based on MappedBuffer but I'm
> > > looking at using netty's fork of tcnative to use shared memory
> directly.
> > >
> > > @Yiannis
> > > We need to have both RPC and a shared memory mechanisms (what I'm
> > inclined
> > > to call IPC but is a specific kind of IPC). The idea is we negotiate
> via
> > > RPC and then if we determine shared locality, we work over shared
> memory
> > > (preferably for both data and control). So the system interacting with
> > > HBase in your example would be the one responsible for placing
> collocated
> > > execution to take advantage of IPC.
> > >
> > > How do others feel of my redefinition of IPC to mean the same memory
> > space
> > > communication (either via shared memory or rdma) versus RPC as socket
> > based
> > > communication?
> > >
> > >
> > > On Tue, Mar 15, 2016 at 5:38 PM, Corey Nolet <cj...@gmail.com>
> wrote:
> > >
> > > > I was seeing Netty's unsafe classes being used here, not mapped byte
> > > > buffer  not sure if that statement is completely correct but I'll
> have
> > to
> > > > dog through the code again to figure that out.
> > > >
> > > > The more I was looking at unsafe, it makes sense why that would be
> > > > used.apparently it's also supposed to be included on Java 9 as a
> first
> > > > class API
> > > > On Mar 15, 2016 7:03 PM, "Wes McKinney" <we...@cloudera.com> wrote:
> > > >
> > > > > My understanding is that you can use java.nio.MappedByteBuffer to
> > work
> > > > > with memory-mapped files as one way to share memory pages between
> > Java
> > > > > (and non-Java) processes without copying.
> > > > >
> > > > > I am hoping that we can reach a POC of zero-copy Arrow memory
> sharing
> > > > > Java-to-Java and Java-to-C++ in the near future. Indeed this will
> > have
> > > > > huge implications once we get it working end to end (for example,
> > > > > receiving memory from a Java process in Python without a heavy
> ser-de
> > > > > step -- it's what we've always dreamed of) and with the metadata
> and
> > > > > shared memory control flow standardized.
> > > > >
> > > > > - Wes
> > > > >
> > > > > On Wed, Mar 9, 2016 at 9:25 PM, Corey J Nolet <cj...@gmail.com>
> > > wrote:
> > > > > > If I understand correctly, Arrow is using Netty underneath which
> is
> > > > > using Sun's Unsafe API in order to allocate direct byte buffers off
> > > heap.
> > > > > It is using Netty to communicate between "client" and "server",
> > > > information
> > > > > about memory addresses for data that is being requested.
> > > > > >
> > > > > > I've never attempted to use the Unsafe API to access off heap
> > memory
> > > > > that has been allocated in one JVM from another JVM but I'm
> assuming
> > > this
> > > > > must be the case in order to claim that the memory is being
> accessed
> > > > > directly without being copied, correct?
> > > > > >
> > > > > > The implication here is huge. If the memory is being directly
> > shared
> > > > > across processes by them being allowed to directly reach into the
> > > direct
> > > > > byte buffers, that's true shared memory. Otherwise, if there's
> copies
> > > > going
> > > > > on, it's less appealing.
> > > > > >
> > > > > >
> > > > > > Thanks.
> > > > > >
> > > > > > Sent from my iPad
> > > > >
> > > >
> > >
> >
> >
> >
> > --
> > Todd Lipcon
> > Software Engineer, Cloudera
> >
>

Re: Understanding "shared" memory implications

Posted by Zhe Zhang <zh...@apache.org>.

I have similar concerns as Todd stated below. With an mmap-based approach,
we are treating shared memory objects like files. This brings in all
filesystem related considerations like ACL and lifecycle mgmt.

Stepping back a little, the shared-memory work isn't really specific to
Arrow. A few questions related to this:
1) Has the topic been discussed in the context of protobuf (or other IPC
protocols) before? Seems Cap'n Proto (https://capnproto.org/) has zero-copy
shared memory. I haven't read implementation detail though.
2) If the shared-memory work benefits a wide range of protocols, should it
be a generalized and standalone library?

Thanks,
Zhe

On Tue, Mar 15, 2016 at 8:30 PM Todd Lipcon <to...@cloudera.com> wrote:

> Having thought about this quite a bit in the past, I think the mechanics of
> how to share memory are by far the easiest part. The much harder part is
> the resource management and ownership. Questions like:
>
> - if you are using an mmapped file in /dev/shm/, how do you make sure it
> gets cleaned up if the process crashes?
> - how do you allocate memory to it? there's nothing ensuring that /dev/shm
> doesn't swap out if you try to put too much in there, and then your
> in-memory super-fast access will basically collapse under swap thrashing
> - how do you do lifecycle management across the two processes? If, say,
> Kudu wants to pass a block of data to some Python program, how does it know
> when the Python program is done reading it and it should be deleted? What
> if the python program crashed in the middle - when can Kudu release it?
> - how do you do security? If both sides of the connection don't trust each
> other, and use length prefixes and offsets, you have to be constantly
> validating and re-validating everything you read.
>
> Another big factor is that shared memory is not, in my experience,
> immediately faster than just copying data over a unix domain socket. In
> particular, the first time you read an mmapped file, you'll end up paying
> minor page fault overhead on every page. This can be improved with
> HugePages, but huge page mmaps are not supported yet in current Linux (work
> going on currently to address this). So you're left with hugetlbfs, which
> involves static allocations and much more pain.
>
> All the above is a long way to say: let's make sure we do the write
> prototyping and up-front design before jumping into code.
>
> -Todd
>
>
>
> On Tue, Mar 15, 2016 at 5:54 PM, Jacques Nadeau <ja...@apache.org>
> wrote:
>
> > @Corey
> > The POC Steven and Wes are working on is based on MappedBuffer but I'm
> > looking at using netty's fork of tcnative to use shared memory directly.
> >
> > @Yiannis
> > We need to have both RPC and a shared memory mechanisms (what I'm
> inclined
> > to call IPC but is a specific kind of IPC). The idea is we negotiate via
> > RPC and then if we determine shared locality, we work over shared memory
> > (preferably for both data and control). So the system interacting with
> > HBase in your example would be the one responsible for placing collocated
> > execution to take advantage of IPC.
> >
> > How do others feel of my redefinition of IPC to mean the same memory
> space
> > communication (either via shared memory or rdma) versus RPC as socket
> based
> > communication?
> >
> >
> > On Tue, Mar 15, 2016 at 5:38 PM, Corey Nolet <cj...@gmail.com> wrote:
> >
> > > I was seeing Netty's unsafe classes being used here, not mapped byte
> > > buffer  not sure if that statement is completely correct but I'll have
> to
> > > dog through the code again to figure that out.
> > >
> > > The more I was looking at unsafe, it makes sense why that would be
> > > used.apparently it's also supposed to be included on Java 9 as a first
> > > class API
> > > On Mar 15, 2016 7:03 PM, "Wes McKinney" <we...@cloudera.com> wrote:
> > >
> > > > My understanding is that you can use java.nio.MappedByteBuffer to
> work
> > > > with memory-mapped files as one way to share memory pages between
> Java
> > > > (and non-Java) processes without copying.
> > > >
> > > > I am hoping that we can reach a POC of zero-copy Arrow memory sharing
> > > > Java-to-Java and Java-to-C++ in the near future. Indeed this will
> have
> > > > huge implications once we get it working end to end (for example,
> > > > receiving memory from a Java process in Python without a heavy ser-de
> > > > step -- it's what we've always dreamed of) and with the metadata and
> > > > shared memory control flow standardized.
> > > >
> > > > - Wes
> > > >
> > > > On Wed, Mar 9, 2016 at 9:25 PM, Corey J Nolet <cj...@gmail.com>
> > wrote:
> > > > > If I understand correctly, Arrow is using Netty underneath which is
> > > > using Sun's Unsafe API in order to allocate direct byte buffers off
> > heap.
> > > > It is using Netty to communicate between "client" and "server",
> > > information
> > > > about memory addresses for data that is being requested.
> > > > >
> > > > > I've never attempted to use the Unsafe API to access off heap
> memory
> > > > that has been allocated in one JVM from another JVM but I'm assuming
> > this
> > > > must be the case in order to claim that the memory is being accessed
> > > > directly without being copied, correct?
> > > > >
> > > > > The implication here is huge. If the memory is being directly
> shared
> > > > across processes by them being allowed to directly reach into the
> > direct
> > > > byte buffers, that's true shared memory. Otherwise, if there's copies
> > > going
> > > > on, it's less appealing.
> > > > >
> > > > >
> > > > > Thanks.
> > > > >
> > > > > Sent from my iPad
> > > >
> > >
> >
>
>
>
> --
> Todd Lipcon
> Software Engineer, Cloudera
>

Re: Understanding "shared" memory implications

Posted by Todd Lipcon <to...@cloudera.com>.

Having thought about this quite a bit in the past, I think the mechanics of
how to share memory are by far the easiest part. The much harder part is
the resource management and ownership. Questions like:

- if you are using an mmapped file in /dev/shm/, how do you make sure it
gets cleaned up if the process crashes?
- how do you allocate memory to it? there's nothing ensuring that /dev/shm
doesn't swap out if you try to put too much in there, and then your
in-memory super-fast access will basically collapse under swap thrashing
- how do you do lifecycle management across the two processes? If, say,
Kudu wants to pass a block of data to some Python program, how does it know
when the Python program is done reading it and it should be deleted? What
if the python program crashed in the middle - when can Kudu release it?
- how do you do security? If both sides of the connection don't trust each
other, and use length prefixes and offsets, you have to be constantly
validating and re-validating everything you read.

Another big factor is that shared memory is not, in my experience,
immediately faster than just copying data over a unix domain socket. In
particular, the first time you read an mmapped file, you'll end up paying
minor page fault overhead on every page. This can be improved with
HugePages, but huge page mmaps are not supported yet in current Linux (work
going on currently to address this). So you're left with hugetlbfs, which
involves static allocations and much more pain.

All the above is a long way to say: let's make sure we do the write
prototyping and up-front design before jumping into code.

-Todd



On Tue, Mar 15, 2016 at 5:54 PM, Jacques Nadeau <ja...@apache.org> wrote:

> @Corey
> The POC Steven and Wes are working on is based on MappedBuffer but I'm
> looking at using netty's fork of tcnative to use shared memory directly.
>
> @Yiannis
> We need to have both RPC and a shared memory mechanisms (what I'm inclined
> to call IPC but is a specific kind of IPC). The idea is we negotiate via
> RPC and then if we determine shared locality, we work over shared memory
> (preferably for both data and control). So the system interacting with
> HBase in your example would be the one responsible for placing collocated
> execution to take advantage of IPC.
>
> How do others feel of my redefinition of IPC to mean the same memory space
> communication (either via shared memory or rdma) versus RPC as socket based
> communication?
>
>
> On Tue, Mar 15, 2016 at 5:38 PM, Corey Nolet <cj...@gmail.com> wrote:
>
> > I was seeing Netty's unsafe classes being used here, not mapped byte
> > buffer  not sure if that statement is completely correct but I'll have to
> > dog through the code again to figure that out.
> >
> > The more I was looking at unsafe, it makes sense why that would be
> > used.apparently it's also supposed to be included on Java 9 as a first
> > class API
> > On Mar 15, 2016 7:03 PM, "Wes McKinney" <we...@cloudera.com> wrote:
> >
> > > My understanding is that you can use java.nio.MappedByteBuffer to work
> > > with memory-mapped files as one way to share memory pages between Java
> > > (and non-Java) processes without copying.
> > >
> > > I am hoping that we can reach a POC of zero-copy Arrow memory sharing
> > > Java-to-Java and Java-to-C++ in the near future. Indeed this will have
> > > huge implications once we get it working end to end (for example,
> > > receiving memory from a Java process in Python without a heavy ser-de
> > > step -- it's what we've always dreamed of) and with the metadata and
> > > shared memory control flow standardized.
> > >
> > > - Wes
> > >
> > > On Wed, Mar 9, 2016 at 9:25 PM, Corey J Nolet <cj...@gmail.com>
> wrote:
> > > > If I understand correctly, Arrow is using Netty underneath which is
> > > using Sun's Unsafe API in order to allocate direct byte buffers off
> heap.
> > > It is using Netty to communicate between "client" and "server",
> > information
> > > about memory addresses for data that is being requested.
> > > >
> > > > I've never attempted to use the Unsafe API to access off heap memory
> > > that has been allocated in one JVM from another JVM but I'm assuming
> this
> > > must be the case in order to claim that the memory is being accessed
> > > directly without being copied, correct?
> > > >
> > > > The implication here is huge. If the memory is being directly shared
> > > across processes by them being allowed to directly reach into the
> direct
> > > byte buffers, that's true shared memory. Otherwise, if there's copies
> > going
> > > on, it's less appealing.
> > > >
> > > >
> > > > Thanks.
> > > >
> > > > Sent from my iPad
> > >
> >
>



-- 
Todd Lipcon
Software Engineer, Cloudera

Re: Understanding "shared" memory implications

Posted by Raymond Tay <rt...@oa-labs.com>.

@Jacques,

I’m not sure whether past/current literature has a proper term for specific kind of IPC;
It would be very helpful to orientate present and future people on the codebase moving forward
But we don’t have to be pedantic right now … we can always revisit the issue when the right time approaches…

Thnx
Raymond Tay



-----Original Message-----
From: Jacques Nadeau <ja...@apache.org>
Reply-To: "dev@arrow.apache.org" <de...@arrow.apache.org>
Date: Wednesday, 16 March 2016 at 8:54 AM
To: "dev@arrow.apache.org" <de...@arrow.apache.org>
Subject: Re: Understanding "shared" memory implications

>@Corey
>The POC Steven and Wes are working on is based on MappedBuffer but I'm
>looking at using netty's fork of tcnative to use shared memory directly.
>
>@Yiannis
>We need to have both RPC and a shared memory mechanisms (what I'm inclined
>to call IPC but is a specific kind of IPC). The idea is we negotiate via
>RPC and then if we determine shared locality, we work over shared memory
>(preferably for both data and control). So the system interacting with
>HBase in your example would be the one responsible for placing collocated
>execution to take advantage of IPC.
>
>How do others feel of my redefinition of IPC to mean the same memory space
>communication (either via shared memory or rdma) versus RPC as socket based
>communication?
>
>
>On Tue, Mar 15, 2016 at 5:38 PM, Corey Nolet <cj...@gmail.com> wrote:
>
>> I was seeing Netty's unsafe classes being used here, not mapped byte
>> buffer  not sure if that statement is completely correct but I'll have to
>> dog through the code again to figure that out.
>>
>> The more I was looking at unsafe, it makes sense why that would be
>> used.apparently it's also supposed to be included on Java 9 as a first
>> class API
>> On Mar 15, 2016 7:03 PM, "Wes McKinney" <we...@cloudera.com> wrote:
>>
>> > My understanding is that you can use java.nio.MappedByteBuffer to work
>> > with memory-mapped files as one way to share memory pages between Java
>> > (and non-Java) processes without copying.
>> >
>> > I am hoping that we can reach a POC of zero-copy Arrow memory sharing
>> > Java-to-Java and Java-to-C++ in the near future. Indeed this will have
>> > huge implications once we get it working end to end (for example,
>> > receiving memory from a Java process in Python without a heavy ser-de
>> > step -- it's what we've always dreamed of) and with the metadata and
>> > shared memory control flow standardized.
>> >
>> > - Wes
>> >
>> > On Wed, Mar 9, 2016 at 9:25 PM, Corey J Nolet <cj...@gmail.com> wrote:
>> > > If I understand correctly, Arrow is using Netty underneath which is
>> > using Sun's Unsafe API in order to allocate direct byte buffers off heap.
>> > It is using Netty to communicate between "client" and "server",
>> information
>> > about memory addresses for data that is being requested.
>> > >
>> > > I've never attempted to use the Unsafe API to access off heap memory
>> > that has been allocated in one JVM from another JVM but I'm assuming this
>> > must be the case in order to claim that the memory is being accessed
>> > directly without being copied, correct?
>> > >
>> > > The implication here is huge. If the memory is being directly shared
>> > across processes by them being allowed to directly reach into the direct
>> > byte buffers, that's true shared memory. Otherwise, if there's copies
>> going
>> > on, it's less appealing.
>> > >
>> > >
>> > > Thanks.
>> > >
>> > > Sent from my iPad
>> >
>>

Re: Understanding "shared" memory implications

Posted by Leif Walsh <le...@gmail.com>.

Seems to me IPC/LPC/RPC focuses on the wrong distinction. I think the right
one is between async message-passing (over a socket), where the receiver
decides when to handle the message, and synchronous/direct memory
manipulation (shared mmap, rdma), where the "client" manipulates the
"server's" (rather, shared) memory directly. In the former case, the server
has more gatekeeper-like control over scheduling, and in the latter, the
server may need to poll the shared memory segment in order to know a write
has happened.
On Wed, Mar 16, 2016 at 09:47 Ted Dunning <te...@gmail.com> wrote:

> On Tue, Mar 15, 2016 at 5:54 PM, Jacques Nadeau <ja...@apache.org>
> wrote:
>
> > How do others feel of my redefinition of IPC to mean the same memory
> space
> > communication (either via shared memory or rdma) versus RPC as socket
> based
> > communication?
> >
>
>
> IPC already has a strong definition which is close to what you want so it
> isn't so strange.
>
> On the other hand, you could coin something like LPC (local process
> communication) to contrast with RPC (remote process communication).
>
-- 
-- 
Cheers,
Leif

Re: Understanding "shared" memory implications

Posted by Ted Dunning <te...@gmail.com>.

On Tue, Mar 15, 2016 at 5:54 PM, Jacques Nadeau <ja...@apache.org> wrote:

> How do others feel of my redefinition of IPC to mean the same memory space
> communication (either via shared memory or rdma) versus RPC as socket based
> communication?
>

IPC already has a strong definition which is close to what you want so it
isn't so strange.

On the other hand, you could coin something like LPC (local process
communication) to contrast with RPC (remote process communication).

Re: Understanding "shared" memory implications

Posted by Jacques Nadeau <ja...@apache.org>.

@Corey
The POC Steven and Wes are working on is based on MappedBuffer but I'm
looking at using netty's fork of tcnative to use shared memory directly.

@Yiannis
We need to have both RPC and a shared memory mechanisms (what I'm inclined
to call IPC but is a specific kind of IPC). The idea is we negotiate via
RPC and then if we determine shared locality, we work over shared memory
(preferably for both data and control). So the system interacting with
HBase in your example would be the one responsible for placing collocated
execution to take advantage of IPC.

How do others feel of my redefinition of IPC to mean the same memory space
communication (either via shared memory or rdma) versus RPC as socket based
communication?


On Tue, Mar 15, 2016 at 5:38 PM, Corey Nolet <cj...@gmail.com> wrote:

> I was seeing Netty's unsafe classes being used here, not mapped byte
> buffer  not sure if that statement is completely correct but I'll have to
> dog through the code again to figure that out.
>
> The more I was looking at unsafe, it makes sense why that would be
> used.apparently it's also supposed to be included on Java 9 as a first
> class API
> On Mar 15, 2016 7:03 PM, "Wes McKinney" <we...@cloudera.com> wrote:
>
> > My understanding is that you can use java.nio.MappedByteBuffer to work
> > with memory-mapped files as one way to share memory pages between Java
> > (and non-Java) processes without copying.
> >
> > I am hoping that we can reach a POC of zero-copy Arrow memory sharing
> > Java-to-Java and Java-to-C++ in the near future. Indeed this will have
> > huge implications once we get it working end to end (for example,
> > receiving memory from a Java process in Python without a heavy ser-de
> > step -- it's what we've always dreamed of) and with the metadata and
> > shared memory control flow standardized.
> >
> > - Wes
> >
> > On Wed, Mar 9, 2016 at 9:25 PM, Corey J Nolet <cj...@gmail.com> wrote:
> > > If I understand correctly, Arrow is using Netty underneath which is
> > using Sun's Unsafe API in order to allocate direct byte buffers off heap.
> > It is using Netty to communicate between "client" and "server",
> information
> > about memory addresses for data that is being requested.
> > >
> > > I've never attempted to use the Unsafe API to access off heap memory
> > that has been allocated in one JVM from another JVM but I'm assuming this
> > must be the case in order to claim that the memory is being accessed
> > directly without being copied, correct?
> > >
> > > The implication here is huge. If the memory is being directly shared
> > across processes by them being allowed to directly reach into the direct
> > byte buffers, that's true shared memory. Otherwise, if there's copies
> going
> > on, it's less appealing.
> > >
> > >
> > > Thanks.
> > >
> > > Sent from my iPad
> >
>

Re: Understanding "shared" memory implications

Posted by Corey Nolet <cj...@gmail.com>.

I was seeing Netty's unsafe classes being used here, not mapped byte
buffer  not sure if that statement is completely correct but I'll have to
dog through the code again to figure that out.

The more I was looking at unsafe, it makes sense why that would be
used.apparently it's also supposed to be included on Java 9 as a first
class API
On Mar 15, 2016 7:03 PM, "Wes McKinney" <we...@cloudera.com> wrote:

> My understanding is that you can use java.nio.MappedByteBuffer to work
> with memory-mapped files as one way to share memory pages between Java
> (and non-Java) processes without copying.
>
> I am hoping that we can reach a POC of zero-copy Arrow memory sharing
> Java-to-Java and Java-to-C++ in the near future. Indeed this will have
> huge implications once we get it working end to end (for example,
> receiving memory from a Java process in Python without a heavy ser-de
> step -- it's what we've always dreamed of) and with the metadata and
> shared memory control flow standardized.
>
> - Wes
>
> On Wed, Mar 9, 2016 at 9:25 PM, Corey J Nolet <cj...@gmail.com> wrote:
> > If I understand correctly, Arrow is using Netty underneath which is
> using Sun's Unsafe API in order to allocate direct byte buffers off heap.
> It is using Netty to communicate between "client" and "server", information
> about memory addresses for data that is being requested.
> >
> > I've never attempted to use the Unsafe API to access off heap memory
> that has been allocated in one JVM from another JVM but I'm assuming this
> must be the case in order to claim that the memory is being accessed
> directly without being copied, correct?
> >
> > The implication here is huge. If the memory is being directly shared
> across processes by them being allowed to directly reach into the direct
> byte buffers, that's true shared memory. Otherwise, if there's copies going
> on, it's less appealing.
> >
> >
> > Thanks.
> >
> > Sent from my iPad
>

Re: Understanding "shared" memory implications

Posted by Yiannis Gkoufas <jo...@gmail.com>.

Hi Wes,

can you please clarify something I don't understand? The next versions of
arrow will include the shared memory control flow as well?
So then, what is needed for HBase (for instance) to be integrated is the
adapter to the arrow format?
If yes, then who will be responsible for keeping the data locality in the
regionservers? i.e it would make sense to keep in the RAM of the
regionserver data that are "close" to those stored in the data node.
Hope it makes sense.

Regards,
Yiannis



On 15 March 2016 at 23:02, Wes McKinney <we...@cloudera.com> wrote:

> My understanding is that you can use java.nio.MappedByteBuffer to work
> with memory-mapped files as one way to share memory pages between Java
> (and non-Java) processes without copying.
>
> I am hoping that we can reach a POC of zero-copy Arrow memory sharing
> Java-to-Java and Java-to-C++ in the near future. Indeed this will have
> huge implications once we get it working end to end (for example,
> receiving memory from a Java process in Python without a heavy ser-de
> step -- it's what we've always dreamed of) and with the metadata and
> shared memory control flow standardized.
>
> - Wes
>
> On Wed, Mar 9, 2016 at 9:25 PM, Corey J Nolet <cj...@gmail.com> wrote:
> > If I understand correctly, Arrow is using Netty underneath which is
> using Sun's Unsafe API in order to allocate direct byte buffers off heap.
> It is using Netty to communicate between "client" and "server", information
> about memory addresses for data that is being requested.
> >
> > I've never attempted to use the Unsafe API to access off heap memory
> that has been allocated in one JVM from another JVM but I'm assuming this
> must be the case in order to claim that the memory is being accessed
> directly without being copied, correct?
> >
> > The implication here is huge. If the memory is being directly shared
> across processes by them being allowed to directly reach into the direct
> byte buffers, that's true shared memory. Otherwise, if there's copies going
> on, it's less appealing.
> >
> >
> > Thanks.
> >
> > Sent from my iPad
>

Re: Understanding "shared" memory implications

Posted by Wes McKinney <we...@cloudera.com>.

My understanding is that you can use java.nio.MappedByteBuffer to work
with memory-mapped files as one way to share memory pages between Java
(and non-Java) processes without copying.

I am hoping that we can reach a POC of zero-copy Arrow memory sharing
Java-to-Java and Java-to-C++ in the near future. Indeed this will have
huge implications once we get it working end to end (for example,
receiving memory from a Java process in Python without a heavy ser-de
step -- it's what we've always dreamed of) and with the metadata and
shared memory control flow standardized.

- Wes

On Wed, Mar 9, 2016 at 9:25 PM, Corey J Nolet <cj...@gmail.com> wrote:
> If I understand correctly, Arrow is using Netty underneath which is using Sun's Unsafe API in order to allocate direct byte buffers off heap. It is using Netty to communicate between "client" and "server", information about memory addresses for data that is being requested.
>
> I've never attempted to use the Unsafe API to access off heap memory that has been allocated in one JVM from another JVM but I'm assuming this must be the case in order to claim that the memory is being accessed directly without being copied, correct?
>
> The implication here is huge. If the memory is being directly shared across processes by them being allowed to directly reach into the direct byte buffers, that's true shared memory. Otherwise, if there's copies going on, it's less appealing.
>
>
> Thanks.
>
> Sent from my iPad