You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@arrow.apache.org by yunfan <yu...@foxmail.com> on 2020/06/04 02:22:55 UTC

How to understand and use the zero-copy between two processor?

In my understanding, I can write a file with shared-memory.&nbsp; And open this shared-memory file in other processor.&nbsp;
But it can't used in streaming mode. Any way to use the zero-copy between two processor?
I find spark also use pipe to transform arrow bytes between java and python procecssor.

Re: RE: [EXTERNAL] How to understand and use the zero-copy between two processor?

Posted by Wes McKinney <we...@gmail.com>.
We use "zero-copy" to mean "zero serialization". You must somehow
obtain access to a virtual address space that contains the bytes. If
the data is RAM resident in one process and you want to access that
data in another process, you have two options:

* Use shared memory (that's exactly what shared memory is intended for)
* Send the payload from one process to the other. Technically speaking
this is a "copy" but no "serialization" is required on the receiver
side (since Arrow IPC just involves computing pointer offsets into the
binary payload). There's no magic that I'm aware of for portably
accessing RAM-resident bytes (that are not in shared memory) in one
process from another.

On Thu, Jun 4, 2020 at 8:41 AM Daniel Nugent <nu...@gmail.com> wrote:
>
> Sorry, I don’t rightly know what that part means. You can definitely map arrow IPC messages that are on disk in to memory in a zero copy way. It’s just the streaming part that I’m not sure about.
>
> -Dan Nugent
> On Jun 4, 2020, 08:27 -0400, yunfan <yu...@foxmail.com>, wrote:
>
> I just wonder wonder what the "zero-copy" means in arrow document.
> In my understanding,  copy memory is also necessary for arrow streaming messaging.
>
> https://arrow.apache.org/
> "It also provides computational libraries and zero-copy streaming messaging and interprocess communication"
>
>
>
>
> ------------------ Original ------------------
> From: "Nugent, Daniel"<Da...@mlp.com>;
> Date: Thu, Jun 4, 2020 11:53 AM
> To: "user@arrow.apache.org"<us...@arrow.apache.org>;
> Subject: RE: [EXTERNAL] How to understand and use the zero-copy between two processor?
>
> Hi,
>
>
>
> I'm not 100% sure I know exactly what you want to achieve here, unfortunately. If the message buffers are being streamed to a shared memory backed file, then you can't use shared memory to continuously read them because the mmap facility provides fixed size shared memory. You could use an out of band signal to indicate that you need to re-map the stream storage file, I guess, but that's not really a stream. You *could* read from the file, but that's going to necessarily copy from the file handle, same as a pipe. If you want to use the plasma object store, that can simplify the process of moving individual RecordBatches of a Table into shared memory to be used between processes. Unfortunately, the plasma store does have the limitation that it currently cannot "adopt" shared memory in any way, so one initial copy into the store is necessary.
>
>
>
> To go back to the shared memory + OOB communication: That well may be workable. The read cost for the shared memory backed mapped files will be very low, so concatenating the RecordBatches back into a Table repeatedly may not be a serious issue as long as there aren't *too* many RecordBatches to be processed.
>
>
>
> Even given all of that, I don't know that Spark has yet implemented their Dataframes as Arrow array backed objects. There cannot be *true* zero copy until that is the case amongst two systems.
>
>
>
> I hope that helps a little.
>
>
>
> -Dan Nugent
>
>
>
>
>
> From: yunfan <yu...@foxmail.com>
> Sent: Wednesday, June 3, 2020 10:23 PM
> To: user <us...@arrow.apache.org>
> Subject: [EXTERNAL] How to understand and use the zero-copy between two processor?
>
>
>
> In my understanding, I can write a file with shared-memory.  And open this shared-memory file in other processor.
>
> But it can't used in streaming mode. Any way to use the zero-copy between two processor?
>
> I find spark also use pipe to transform arrow bytes between java and python procecssor.
>
>
>
>
>
>
> ######################################################################
>
> The information contained in this communication is confidential and
>
> may contain information that is privileged or exempt from disclosure
>
> under applicable law. If you are not a named addressee, please notify
>
> the sender immediately and delete this email from your system.
>
> If you have received this communication, and are not a named
>
> recipient, you are hereby notified that any dissemination,
>
> distribution or copying of this communication is strictly prohibited.
>
> ######################################################################

Re:RE: [EXTERNAL] How to understand and use the zero-copy between two processor?

Posted by Daniel Nugent <nu...@gmail.com>.
Sorry, I don’t rightly know what that part means. You can definitely map arrow IPC messages that are on disk in to memory in a zero copy way. It’s just the streaming part that I’m not sure about.

-Dan Nugent
On Jun 4, 2020, 08:27 -0400, yunfan <yu...@foxmail.com>, wrote:
> I just wonder wonder what the "zero-copy" means in arrow document.
> In my understanding,  copy memory is also necessary for arrow streaming messaging.
>
> https://arrow.apache.org/
> "It also provides computational libraries and zero-copy streaming messaging and interprocess communication"
>
>
>
>
> ------------------ Original ------------------
> From: "Nugent, Daniel"<Da...@mlp.com>;
> Date: Thu, Jun 4, 2020 11:53 AM
> To: "user@arrow.apache.org"<us...@arrow.apache.org>;
> Subject: RE: [EXTERNAL] How to understand and use the zero-copy between two processor?
>
> Hi,
>
> I'm not 100% sure I know exactly what you want to achieve here, unfortunately. If the message buffers are being streamed to a shared memory backed file, then you can't use shared memory to continuously read them because the mmap facility provides fixed size shared memory. You could use an out of band signal to indicate that you need to re-map the stream storage file, I guess, but that's not really a stream. You *could* read from the file, but that's going to necessarily copy from the file handle, same as a pipe. If you want to use the plasma object store, that can simplify the process of moving individual RecordBatches of a Table into shared memory to be used between processes. Unfortunately, the plasma store does have the limitation that it currently cannot "adopt" shared memory in any way, so one initial copy into the store is necessary.
>
> To go back to the shared memory + OOB communication: That well may be workable. The read cost for the shared memory backed mapped files will be very low, so concatenating the RecordBatches back into a Table repeatedly may not be a serious issue as long as there aren't *too* many RecordBatches to be processed.
>
> Even given all of that, I don't know that Spark has yet implemented their Dataframes as Arrow array backed objects. There cannot be *true* zero copy until that is the case amongst two systems.
>
> I hope that helps a little.
>
> -Dan Nugent
>
>
> From: yunfan <yu...@foxmail.com>
> Sent: Wednesday, June 3, 2020 10:23 PM
> To: user <us...@arrow.apache.org>
> Subject: [EXTERNAL] How to understand and use the zero-copy between two processor?
>
> In my understanding, I can write a file with shared-memory.  And open this shared-memory file in other processor.
> But it can't used in streaming mode. Any way to use the zero-copy between two processor?
> I find spark also use pipe to transform arrow bytes between java and python procecssor.
>
>
>
> ######################################################################
> The information contained in this communication is confidential and
> may contain information that is privileged or exempt from disclosure
> under applicable law. If you are not a named addressee, please notify
> the sender immediately and delete this email from your system.
> If you have received this communication, and are not a named
> recipient, you are hereby notified that any dissemination,
> distribution or copying of this communication is strictly prohibited.
> ######################################################################

Re:RE: [EXTERNAL] How to understand and use the zero-copy between two processor?

Posted by yunfan <yu...@foxmail.com>.
I just wonder wonder what the "zero-copy" means in arrow document.
In my understanding,&nbsp; copy memory is also&nbsp;necessary for arrow streaming messaging.


https://arrow.apache.org/&nbsp;
"It also provides computational libraries and zero-copy streaming messaging and interprocess communication"



&nbsp;




------------------&nbsp;Original&nbsp;------------------
From:&nbsp;"Nugent, Daniel"<Daniel.Nugent@mlp.com&gt;;
Date:&nbsp;Thu, Jun 4, 2020 11:53 AM
To:&nbsp;"user@arrow.apache.org"<user@arrow.apache.org&gt;;

Subject:&nbsp;RE: [EXTERNAL] How to understand and use the zero-copy between two processor?



  
Hi,
 
&nbsp;
 
I'm not 100% sure I know exactly what you want to achieve here, unfortunately. If the message buffers are being streamed to a shared memory backed file, then  you can't use shared memory to continuously read them because the mmap facility provides fixed size shared memory. You could use an out of band signal to indicate that you need to re-map the stream storage file, I guess, but that's not really a stream. You  *could* read from the file, but that's going to necessarily copy from the file handle, same as a pipe. If you want to use the plasma object store, that can simplify the process of moving individual RecordBatches of a Table into shared memory to be used  between processes. Unfortunately, the plasma store does have the limitation that it currently cannot "adopt" shared memory in any way, so one initial copy into the store is necessary.
 
&nbsp;
 
To go back to the shared memory + OOB communication: That well may be workable. The read cost for the shared memory backed mapped files will be very low, so concatenating  the RecordBatches back into a Table repeatedly may not be a serious issue as long as there aren't *too* many RecordBatches to be processed.
 
&nbsp;
 
Even given all of that, I don't know that Spark has yet implemented their Dataframes as Arrow array backed objects. There cannot be *true* zero copy until  that is the case amongst two systems.
 
&nbsp;
 
I hope that helps a little.
 
&nbsp;
 
-Dan Nugent
 
&nbsp;
 
&nbsp;
 
From: yunfan <yunfanfighting@foxmail.com&gt; 
 Sent: Wednesday, June 3, 2020 10:23 PM
 To: user <user@arrow.apache.org&gt;
 Subject: [EXTERNAL] How to understand and use the zero-copy between two processor?
 
&nbsp;
  
In my understanding, I can write a file with shared-memory.&nbsp; And open this shared-memory file in other processor. 
 
  
But it can't used in streaming mode. Any way to use the zero-copy between two processor?
 
  
I find spark also use pipe to transform arrow bytes between java and python procecssor.
 
  
&nbsp;
 
  
&nbsp;
 
 
 
 
######################################################################
 
The information contained in this communication is confidential and
 
may contain information that is privileged or exempt from disclosure
 
under applicable law. If you are not a named addressee, please notify
 
the sender immediately and delete this email from your system.
 
If you have received this communication, and are not a named
 
recipient, you are hereby notified that any dissemination,
 
distribution or copying of this communication is strictly prohibited.
######################################################################

RE: [EXTERNAL] How to understand and use the zero-copy between two processor?

Posted by "Nugent, Daniel" <Da...@mlp.com>.
Hi,

I'm not 100% sure I know exactly what you want to achieve here, unfortunately. If the message buffers are being streamed to a shared memory backed file, then you can't use shared memory to continuously read them because the mmap facility provides fixed size shared memory. You could use an out of band signal to indicate that you need to re-map the stream storage file, I guess, but that's not really a stream. You *could* read from the file, but that's going to necessarily copy from the file handle, same as a pipe. If you want to use the plasma object store, that can simplify the process of moving individual RecordBatches of a Table into shared memory to be used between processes. Unfortunately, the plasma store does have the limitation that it currently cannot "adopt" shared memory in any way, so one initial copy into the store is necessary.

To go back to the shared memory + OOB communication: That well may be workable. The read cost for the shared memory backed mapped files will be very low, so concatenating the RecordBatches back into a Table repeatedly may not be a serious issue as long as there aren't *too* many RecordBatches to be processed.

Even given all of that, I don't know that Spark has yet implemented their Dataframes as Arrow array backed objects. There cannot be *true* zero copy until that is the case amongst two systems.

I hope that helps a little.

-Dan Nugent


From: yunfan <yu...@foxmail.com>
Sent: Wednesday, June 3, 2020 10:23 PM
To: user <us...@arrow.apache.org>
Subject: [EXTERNAL] How to understand and use the zero-copy between two processor?

In my understanding, I can write a file with shared-memory.  And open this shared-memory file in other processor.
But it can't used in streaming mode. Any way to use the zero-copy between two processor?
I find spark also use pipe to transform arrow bytes between java and python procecssor.




######################################################################

The information contained in this communication is confidential and

may contain information that is privileged or exempt from disclosure

under applicable law. If you are not a named addressee, please notify

the sender immediately and delete this email from your system.

If you have received this communication, and are not a named

recipient, you are hereby notified that any dissemination,

distribution or copying of this communication is strictly prohibited.

######################################################################