You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@spark.apache.org by Jia <ja...@gmail.com> on 2015/12/06 21:43:24 UTC

Shared memory between C++ process and Spark

Dears, for one project, I need to implement something so Spark can read data from a C++ process. 
To provide high performance, I really hope to implement this through shared memory between the C++ process and Java JVM process.
It seems it may be possible to use named memory mapped files and JNI to do this, but I wonder whether there is any existing efforts or more efficient approach to do this?
Thank you very much!

Best Regards,
Jia


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@spark.apache.org
For additional commands, e-mail: dev-help@spark.apache.org

Re: Shared memory between C++ process and Spark

Posted by Jia <ja...@gmail.com>.

Hi, Robin, 
Thanks for your reply and thanks for copying my question to user mailing list.
Yes, we have a distributed C++ application, that will store data on each node in the cluster, and we hope to leverage Spark to do more fancy analytics on those data. But we need high performance, that’s why we want shared memory.
Suggestions will be highly appreciated!

Best Regards,
Jia

On Dec 7, 2015, at 10:54 AM, Robin East <ro...@xense.co.uk> wrote:

> -dev, +user (this is not a question about development of Spark itself so you’ll get more answers in the user mailing list)
> 
> First up let me say that I don’t really know how this could be done - I’m sure it would be possible with enough tinkering but it’s not clear what you are trying to achieve. Spark is a distributed processing system, it has multiple JVMs running on different machines that each run a small part of the overall processing. Unless you have some sort of idea to have multiple C++ processes collocated with the distributed JVMs using named memory mapped files doesn’t make architectural sense. 
> -------------------------------------------------------------------------------
> Robin East
> Spark GraphX in Action Michael Malak and Robin East
> Manning Publications Co.
> http://www.manning.com/books/spark-graphx-in-action
> 
> 
> 
> 
> 
>> On 6 Dec 2015, at 20:43, Jia <ja...@gmail.com> wrote:
>> 
>> Dears, for one project, I need to implement something so Spark can read data from a C++ process. 
>> To provide high performance, I really hope to implement this through shared memory between the C++ process and Java JVM process.
>> It seems it may be possible to use named memory mapped files and JNI to do this, but I wonder whether there is any existing efforts or more efficient approach to do this?
>> Thank you very much!
>> 
>> Best Regards,
>> Jia
>> 
>> 
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: dev-unsubscribe@spark.apache.org
>> For additional commands, e-mail: dev-help@spark.apache.org
>> 
>

Re: Shared memory between C++ process and Spark

Posted by Jian Feng <fr...@yahoo.com.INVALID>.

The only way I can think of is through some kind of wrapper. For java/scala, use JNI. For Python, use extensions. There should not be a lot of work if you know these tools. 

      From: Robin East <ro...@xense.co.uk>
 To: Annabel Melongo <me...@yahoo.com> 
Cc: Jia <ja...@gmail.com>; Dewful <de...@gmail.com>; "user @spark" <us...@spark.apache.org>; "dev@spark.apache.org" <de...@spark.apache.org>
 Sent: Monday, December 7, 2015 10:57 AM
 Subject: Re: Shared memory between C++ process and Spark

Annabel
Spark works very well with data stored in HDFS but is certainly not tied to it. Have a look at the wide variety of connectors to things like Cassandra, HBase, etc.
Robin

Sent from my iPhone
On 7 Dec 2015, at 18:50, Annabel Melongo <me...@yahoo.com> wrote:

Jia,
I'm so confused on this. The architecture of Spark is to run on top of HDFS. What you're requesting, reading and writing to a C++ process, is not part of that requirement.

    On Monday, December 7, 2015 1:42 PM, Jia <ja...@gmail.com> wrote:

 Thanks, Annabel, but I may need to clarify that I have no intention to write and run Spark UDF in C++, I'm just wondering whether Spark can read and write data to a C++ process with zero copy.
Best Regards,Jia 

On Dec 7, 2015, at 12:26 PM, Annabel Melongo <me...@yahoo.com> wrote:

My guess is that Jia wants to run C++ on top of Spark. If that's the case, I'm afraid this is not possible. Spark has support for Java, Python, Scala and R.
The best way to achieve this is to run your application in C++ and used the data created by said application to do manipulation within Spark. 

    On Monday, December 7, 2015 1:15 PM, Jia <ja...@gmail.com> wrote:

 Thanks, Dewful!
My impression is that Tachyon is a very nice in-memory file system that can connect to multiple storages.However, because our data is also hold in memory, I suspect that connecting to Spark directly may be more efficient in performance.But definitely I need to look at Tachyon more carefully, in case it has a very efficient C++ binding mechanism.
Best Regards,Jia
On Dec 7, 2015, at 11:46 AM, Dewful <de...@gmail.com> wrote:

Maybe looking into something like Tachyon would help, I see some sample c++ bindings, not sure how much of the current functionality they support...Hi, Robin, Thanks for your reply and thanks for copying my question to user mailing list.Yes, we have a distributed C++ application, that will store data on each node in the cluster, and we hope to leverage Spark to do more fancy analytics on those data. But we need high performance, that’s why we want shared memory.Suggestions will be highly appreciated!
Best Regards,Jia
On Dec 7, 2015, at 10:54 AM, Robin East <ro...@xense.co.uk> wrote:

-dev, +user (this is not a question about development of Spark itself so you’ll get more answers in the user mailing list)
First up let me say that I don’t really know how this could be done - I’m sure it would be possible with enough tinkering but it’s not clear what you are trying to achieve. Spark is a distributed processing system, it has multiple JVMs running on different machines that each run a small part of the overall processing. Unless you have some sort of idea to have multiple C++ processes collocated with the distributed JVMs using named memory mapped files doesn’t make architectural sense. 
-------------------------------------------------------------------------------Robin EastSpark GraphX in Action Michael Malak and Robin EastManning Publications Co.http://www.manning.com/books/spark-graphx-in-action

On 6 Dec 2015, at 20:43, Jia <ja...@gmail.com> wrote:
Dears, for one project, I need to implement something so Spark can read data from a C++ process. 
To provide high performance, I really hope to implement this through shared memory between the C++ process and Java JVM process.
It seems it may be possible to use named memory mapped files and JNI to do this, but I wonder whether there is any existing efforts or more efficient approach to do this?
Thank you very much!

Best Regards,
Jia

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@spark.apache.org
For additional commands, e-mail: dev-help@spark.apache.org

Re: Shared memory between C++ process and Spark

Posted by Jian Feng <fr...@yahoo.com.INVALID>.

The only way I can think of is through some kind of wrapper. For java/scala, use JNI. For Python, use extensions. There should not be a lot of work if you know these tools. 

      From: Robin East <ro...@xense.co.uk>
 To: Annabel Melongo <me...@yahoo.com> 
Cc: Jia <ja...@gmail.com>; Dewful <de...@gmail.com>; "user @spark" <us...@spark.apache.org>; "dev@spark.apache.org" <de...@spark.apache.org>
 Sent: Monday, December 7, 2015 10:57 AM
 Subject: Re: Shared memory between C++ process and Spark

Annabel
Spark works very well with data stored in HDFS but is certainly not tied to it. Have a look at the wide variety of connectors to things like Cassandra, HBase, etc.
Robin

Sent from my iPhone
On 7 Dec 2015, at 18:50, Annabel Melongo <me...@yahoo.com> wrote:

Jia,
I'm so confused on this. The architecture of Spark is to run on top of HDFS. What you're requesting, reading and writing to a C++ process, is not part of that requirement.

    On Monday, December 7, 2015 1:42 PM, Jia <ja...@gmail.com> wrote:

 Thanks, Annabel, but I may need to clarify that I have no intention to write and run Spark UDF in C++, I'm just wondering whether Spark can read and write data to a C++ process with zero copy.
Best Regards,Jia 

On Dec 7, 2015, at 12:26 PM, Annabel Melongo <me...@yahoo.com> wrote:

My guess is that Jia wants to run C++ on top of Spark. If that's the case, I'm afraid this is not possible. Spark has support for Java, Python, Scala and R.
The best way to achieve this is to run your application in C++ and used the data created by said application to do manipulation within Spark. 

    On Monday, December 7, 2015 1:15 PM, Jia <ja...@gmail.com> wrote:

 Thanks, Dewful!
My impression is that Tachyon is a very nice in-memory file system that can connect to multiple storages.However, because our data is also hold in memory, I suspect that connecting to Spark directly may be more efficient in performance.But definitely I need to look at Tachyon more carefully, in case it has a very efficient C++ binding mechanism.
Best Regards,Jia
On Dec 7, 2015, at 11:46 AM, Dewful <de...@gmail.com> wrote:

Maybe looking into something like Tachyon would help, I see some sample c++ bindings, not sure how much of the current functionality they support...Hi, Robin, Thanks for your reply and thanks for copying my question to user mailing list.Yes, we have a distributed C++ application, that will store data on each node in the cluster, and we hope to leverage Spark to do more fancy analytics on those data. But we need high performance, that’s why we want shared memory.Suggestions will be highly appreciated!
Best Regards,Jia
On Dec 7, 2015, at 10:54 AM, Robin East <ro...@xense.co.uk> wrote:

-dev, +user (this is not a question about development of Spark itself so you’ll get more answers in the user mailing list)
First up let me say that I don’t really know how this could be done - I’m sure it would be possible with enough tinkering but it’s not clear what you are trying to achieve. Spark is a distributed processing system, it has multiple JVMs running on different machines that each run a small part of the overall processing. Unless you have some sort of idea to have multiple C++ processes collocated with the distributed JVMs using named memory mapped files doesn’t make architectural sense. 
-------------------------------------------------------------------------------Robin EastSpark GraphX in Action Michael Malak and Robin EastManning Publications Co.http://www.manning.com/books/spark-graphx-in-action

On 6 Dec 2015, at 20:43, Jia <ja...@gmail.com> wrote:
Dears, for one project, I need to implement something so Spark can read data from a C++ process. 
To provide high performance, I really hope to implement this through shared memory between the C++ process and Java JVM process.
It seems it may be possible to use named memory mapped files and JNI to do this, but I wonder whether there is any existing efforts or more efficient approach to do this?
Thank you very much!

Best Regards,
Jia

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@spark.apache.org
For additional commands, e-mail: dev-help@spark.apache.org

Re: Shared memory between C++ process and Spark

Posted by Annabel Melongo <me...@yahoo.com.INVALID>.

Robin,
Maybe you didn't read my post in which I stated that Spark works on top of HDFS. What Jia wants is to have Spark interacts with a C++ process to read and write data.
I've never heard about Jia's use case in Spark. If you know one, please share that with me.
Thanks 


    On Monday, December 7, 2015 1:57 PM, Robin East <ro...@xense.co.uk> wrote:
 

 Annabel
Spark works very well with data stored in HDFS but is certainly not tied to it. Have a look at the wide variety of connectors to things like Cassandra, HBase, etc.
Robin

Sent from my iPhone
On 7 Dec 2015, at 18:50, Annabel Melongo <me...@yahoo.com> wrote:


Jia,
I'm so confused on this. The architecture of Spark is to run on top of HDFS. What you're requesting, reading and writing to a C++ process, is not part of that requirement.

 


    On Monday, December 7, 2015 1:42 PM, Jia <ja...@gmail.com> wrote:
 

 Thanks, Annabel, but I may need to clarify that I have no intention to write and run Spark UDF in C++, I'm just wondering whether Spark can read and write data to a C++ process with zero copy.
Best Regards,Jia 

On Dec 7, 2015, at 12:26 PM, Annabel Melongo <me...@yahoo.com> wrote:

My guess is that Jia wants to run C++ on top of Spark. If that's the case, I'm afraid this is not possible. Spark has support for Java, Python, Scala and R.
The best way to achieve this is to run your application in C++ and used the data created by said application to do manipulation within Spark. 


    On Monday, December 7, 2015 1:15 PM, Jia <ja...@gmail.com> wrote:
 

 Thanks, Dewful!
My impression is that Tachyon is a very nice in-memory file system that can connect to multiple storages.However, because our data is also hold in memory, I suspect that connecting to Spark directly may be more efficient in performance.But definitely I need to look at Tachyon more carefully, in case it has a very efficient C++ binding mechanism.
Best Regards,Jia
On Dec 7, 2015, at 11:46 AM, Dewful <de...@gmail.com> wrote:

Maybe looking into something like Tachyon would help, I see some sample c++ bindings, not sure how much of the current functionality they support...Hi, Robin, Thanks for your reply and thanks for copying my question to user mailing list.Yes, we have a distributed C++ application, that will store data on each node in the cluster, and we hope to leverage Spark to do more fancy analytics on those data. But we need high performance, that’s why we want shared memory.Suggestions will be highly appreciated!
Best Regards,Jia
On Dec 7, 2015, at 10:54 AM, Robin East <ro...@xense.co.uk> wrote:

-dev, +user (this is not a question about development of Spark itself so you’ll get more answers in the user mailing list)
First up let me say that I don’t really know how this could be done - I’m sure it would be possible with enough tinkering but it’s not clear what you are trying to achieve. Spark is a distributed processing system, it has multiple JVMs running on different machines that each run a small part of the overall processing. Unless you have some sort of idea to have multiple C++ processes collocated with the distributed JVMs using named memory mapped files doesn’t make architectural sense. 
-------------------------------------------------------------------------------Robin EastSpark GraphX in Action Michael Malak and Robin EastManning Publications Co.http://www.manning.com/books/spark-graphx-in-action





On 6 Dec 2015, at 20:43, Jia <ja...@gmail.com> wrote:
Dears, for one project, I need to implement something so Spark can read data from a C++ process. 
To provide high performance, I really hope to implement this through shared memory between the C++ process and Java JVM process.
It seems it may be possible to use named memory mapped files and JNI to do this, but I wonder whether there is any existing efforts or more efficient approach to do this?
Thank you very much!

Best Regards,
Jia


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@spark.apache.org
For additional commands, e-mail: dev-help@spark.apache.org

Re: Shared memory between C++ process and Spark

Posted by Robin East <ro...@xense.co.uk>.

Annabel

Spark works very well with data stored in HDFS but is certainly not tied to it. Have a look at the wide variety of connectors to things like Cassandra, HBase, etc.

Robin

Sent from my iPhone

> On 7 Dec 2015, at 18:50, Annabel Melongo <me...@yahoo.com> wrote:
> 
> Jia,
> 
> I'm so confused on this. The architecture of Spark is to run on top of HDFS. What you're requesting, reading and writing to a C++ process, is not part of that requirement.
> 
> 
> 
> 
> 
> On Monday, December 7, 2015 1:42 PM, Jia <ja...@gmail.com> wrote:
> 
> 
> Thanks, Annabel, but I may need to clarify that I have no intention to write and run Spark UDF in C++, I'm just wondering whether Spark can read and write data to a C++ process with zero copy.
> 
> Best Regards,
> Jia
>  
> 
> 
>> On Dec 7, 2015, at 12:26 PM, Annabel Melongo <me...@yahoo.com> wrote:
>> 
>> My guess is that Jia wants to run C++ on top of Spark. If that's the case, I'm afraid this is not possible. Spark has support for Java, Python, Scala and R.
>> 
>> The best way to achieve this is to run your application in C++ and used the data created by said application to do manipulation within Spark.
>> 
>> 
>> 
>> On Monday, December 7, 2015 1:15 PM, Jia <ja...@gmail.com> wrote:
>> 
>> 
>> Thanks, Dewful!
>> 
>> My impression is that Tachyon is a very nice in-memory file system that can connect to multiple storages.
>> However, because our data is also hold in memory, I suspect that connecting to Spark directly may be more efficient in performance.
>> But definitely I need to look at Tachyon more carefully, in case it has a very efficient C++ binding mechanism.
>> 
>> Best Regards,
>> Jia
>> 
>>> On Dec 7, 2015, at 11:46 AM, Dewful <de...@gmail.com> wrote:
>>> 
>>> Maybe looking into something like Tachyon would help, I see some sample c++ bindings, not sure how much of the current functionality they support...
>>> Hi, Robin, 
>>> Thanks for your reply and thanks for copying my question to user mailing list.
>>> Yes, we have a distributed C++ application, that will store data on each node in the cluster, and we hope to leverage Spark to do more fancy analytics on those data. But we need high performance, that’s why we want shared memory.
>>> Suggestions will be highly appreciated!
>>> 
>>> Best Regards,
>>> Jia
>>> 
>>>> On Dec 7, 2015, at 10:54 AM, Robin East <ro...@xense.co.uk> wrote:
>>>> 
>>>> -dev, +user (this is not a question about development of Spark itself so you’ll get more answers in the user mailing list)
>>>> 
>>>> First up let me say that I don’t really know how this could be done - I’m sure it would be possible with enough tinkering but it’s not clear what you are trying to achieve. Spark is a distributed processing system, it has multiple JVMs running on different machines that each run a small part of the overall processing. Unless you have some sort of idea to have multiple C++ processes collocated with the distributed JVMs using named memory mapped files doesn’t make architectural sense. 
>>>> -------------------------------------------------------------------------------
>>>> Robin East
>>>> Spark GraphX in Action Michael Malak and Robin East
>>>> Manning Publications Co.
>>>> http://www.manning.com/books/spark-graphx-in-action
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>>> On 6 Dec 2015, at 20:43, Jia <ja...@gmail.com> wrote:
>>>>> 
>>>>> Dears, for one project, I need to implement something so Spark can read data from a C++ process. 
>>>>> To provide high performance, I really hope to implement this through shared memory between the C++ process and Java JVM process.
>>>>> It seems it may be possible to use named memory mapped files and JNI to do this, but I wonder whether there is any existing efforts or more efficient approach to do this?
>>>>> Thank you very much!
>>>>> 
>>>>> Best Regards,
>>>>> Jia
>>>>> 
>>>>> 
>>>>> ---------------------------------------------------------------------
>>>>> To unsubscribe, e-mail: dev-unsubscribe@spark.apache.org
>>>>> For additional commands, e-mail: dev-help@spark.apache.org
> 
> 
>

Re: Shared memory between C++ process and Spark

Posted by Nick Pentreath <ni...@gmail.com>.

SparkNet may have some interesting ideas - https://github.com/amplab/SparkNet. Haven't had a deep look at it yet but it seems to have some functionality allowing caffe to read data from RDDs, though I'm not certain the memory is shared.



—
Sent from Mailbox

On Mon, Dec 7, 2015 at 9:55 PM, Robin East <ro...@xense.co.uk> wrote:

> Hi Annabel
> I certainly did read your post. My point was that Spark can read from HDFS but is in no way tied to that storage layer . A very interesting use case that sounds very similar to Jia's (as mentioned by another poster) is contained in https://issues.apache.org/jira/browse/SPARK-10399. The comments section provides a specific example of processing very large images using a pre-existing c++ library.
> Robin
> Sent from my iPhone
>> On 7 Dec 2015, at 18:50, Annabel Melongo <me...@yahoo.com.INVALID> wrote:
>> 
>> Jia,
>> 
>> I'm so confused on this. The architecture of Spark is to run on top of HDFS. What you're requesting, reading and writing to a C++ process, is not part of that requirement.
>> 
>> 
>> 
>> 
>> 
>> On Monday, December 7, 2015 1:42 PM, Jia <ja...@gmail.com> wrote:
>> 
>> 
>> Thanks, Annabel, but I may need to clarify that I have no intention to write and run Spark UDF in C++, I'm just wondering whether Spark can read and write data to a C++ process with zero copy.
>> 
>> Best Regards,
>> Jia
>>  
>> 
>> 
>>> On Dec 7, 2015, at 12:26 PM, Annabel Melongo <me...@yahoo.com> wrote:
>>> 
>>> My guess is that Jia wants to run C++ on top of Spark. If that's the case, I'm afraid this is not possible. Spark has support for Java, Python, Scala and R.
>>> 
>>> The best way to achieve this is to run your application in C++ and used the data created by said application to do manipulation within Spark.
>>> 
>>> 
>>> 
>>> On Monday, December 7, 2015 1:15 PM, Jia <ja...@gmail.com> wrote:
>>> 
>>> 
>>> Thanks, Dewful!
>>> 
>>> My impression is that Tachyon is a very nice in-memory file system that can connect to multiple storages.
>>> However, because our data is also hold in memory, I suspect that connecting to Spark directly may be more efficient in performance.
>>> But definitely I need to look at Tachyon more carefully, in case it has a very efficient C++ binding mechanism.
>>> 
>>> Best Regards,
>>> Jia
>>> 
>>>> On Dec 7, 2015, at 11:46 AM, Dewful <de...@gmail.com> wrote:
>>>> 
>>>> Maybe looking into something like Tachyon would help, I see some sample c++ bindings, not sure how much of the current functionality they support...
>>>> Hi, Robin, 
>>>> Thanks for your reply and thanks for copying my question to user mailing list.
>>>> Yes, we have a distributed C++ application, that will store data on each node in the cluster, and we hope to leverage Spark to do more fancy analytics on those data. But we need high performance, that’s why we want shared memory.
>>>> Suggestions will be highly appreciated!
>>>> 
>>>> Best Regards,
>>>> Jia
>>>> 
>>>>> On Dec 7, 2015, at 10:54 AM, Robin East <ro...@xense.co.uk> wrote:
>>>>> 
>>>>> -dev, +user (this is not a question about development of Spark itself so you’ll get more answers in the user mailing list)
>>>>> 
>>>>> First up let me say that I don’t really know how this could be done - I’m sure it would be possible with enough tinkering but it’s not clear what you are trying to achieve. Spark is a distributed processing system, it has multiple JVMs running on different machines that each run a small part of the overall processing. Unless you have some sort of idea to have multiple C++ processes collocated with the distributed JVMs using named memory mapped files doesn’t make architectural sense. 
>>>>> -------------------------------------------------------------------------------
>>>>> Robin East
>>>>> Spark GraphX in Action Michael Malak and Robin East
>>>>> Manning Publications Co.
>>>>> http://www.manning.com/books/spark-graphx-in-action
>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>>>> On 6 Dec 2015, at 20:43, Jia <ja...@gmail.com> wrote:
>>>>>> 
>>>>>> Dears, for one project, I need to implement something so Spark can read data from a C++ process. 
>>>>>> To provide high performance, I really hope to implement this through shared memory between the C++ process and Java JVM process.
>>>>>> It seems it may be possible to use named memory mapped files and JNI to do this, but I wonder whether there is any existing efforts or more efficient approach to do this?
>>>>>> Thank you very much!
>>>>>> 
>>>>>> Best Regards,
>>>>>> Jia
>>>>>> 
>>>>>> 
>>>>>> ---------------------------------------------------------------------
>>>>>> To unsubscribe, e-mail: dev-unsubscribe@spark.apache.org
>>>>>> For additional commands, e-mail: dev-help@spark.apache.org
>> 
>> 
>>

Re: Shared memory between C++ process and Spark

Posted by Robin East <ro...@xense.co.uk>.

I’m not sure what point you’re trying to prove and I’m not particularly interested in getting into a protracted discussion. Here is what you wrote: The architecture of Spark is to run on top of HDFS. I interpreted that as a statement implying that Spark has to run on HDFS which is definitely not the case. If you didn’t mean then we are both in agreement.
-------------------------------------------------------------------------------
Robin East
Spark GraphX in Action Michael Malak and Robin East
Manning Publications Co.
http://www.manning.com/books/spark-graphx-in-action <http://www.manning.com/books/spark-graphx-in-action>





> On 7 Dec 2015, at 19:56, Annabel Melongo <me...@yahoo.com> wrote:
> 
> Robin,
> 
> To prove my point, this is an unresolved issue still in the implementation stage.
> 
> 
> 
> On Monday, December 7, 2015 2:49 PM, Robin East <ro...@xense.co.uk> wrote:
> 
> 
> Hi Annabel
> 
> I certainly did read your post. My point was that Spark can read from HDFS but is in no way tied to that storage layer . A very interesting use case that sounds very similar to Jia's (as mentioned by another poster) is contained in https://issues.apache.org/jira/browse/SPARK-10399 <https://issues.apache.org/jira/browse/SPARK-10399>. The comments section provides a specific example of processing very large images using a pre-existing c++ library.
> 
> Robin
> 
> Sent from my iPhone
> 
> On 7 Dec 2015, at 18:50, Annabel Melongo <melongo_annabel@yahoo.com.INVALID <ma...@yahoo.com.invalid>> wrote:
> 
>> Jia,
>> 
>> I'm so confused on this. The architecture of Spark is to run on top of HDFS. What you're requesting, reading and writing to a C++ process, is not part of that requirement.
>> 
>> 
>> 
>> 
>> 
>> On Monday, December 7, 2015 1:42 PM, Jia <jacquelinezou@gmail.com <ma...@gmail.com>> wrote:
>> 
>> 
>> Thanks, Annabel, but I may need to clarify that I have no intention to write and run Spark UDF in C++, I'm just wondering whether Spark can read and write data to a C++ process with zero copy.
>> 
>> Best Regards,
>> Jia
>>  
>> 
>> 
>> On Dec 7, 2015, at 12:26 PM, Annabel Melongo <melongo_annabel@yahoo.com <ma...@yahoo.com>> wrote:
>> 
>>> My guess is that Jia wants to run C++ on top of Spark. If that's the case, I'm afraid this is not possible. Spark has support for Java, Python, Scala and R.
>>> 
>>> The best way to achieve this is to run your application in C++ and used the data created by said application to do manipulation within Spark.
>>> 
>>> 
>>> 
>>> On Monday, December 7, 2015 1:15 PM, Jia <jacquelinezou@gmail.com <ma...@gmail.com>> wrote:
>>> 
>>> 
>>> Thanks, Dewful!
>>> 
>>> My impression is that Tachyon is a very nice in-memory file system that can connect to multiple storages.
>>> However, because our data is also hold in memory, I suspect that connecting to Spark directly may be more efficient in performance.
>>> But definitely I need to look at Tachyon more carefully, in case it has a very efficient C++ binding mechanism.
>>> 
>>> Best Regards,
>>> Jia
>>> 
>>> On Dec 7, 2015, at 11:46 AM, Dewful <dewful@gmail.com <ma...@gmail.com>> wrote:
>>> 
>>>> Maybe looking into something like Tachyon would help, I see some sample c++ bindings, not sure how much of the current functionality they support...
>>>> Hi, Robin, 
>>>> Thanks for your reply and thanks for copying my question to user mailing list.
>>>> Yes, we have a distributed C++ application, that will store data on each node in the cluster, and we hope to leverage Spark to do more fancy analytics on those data. But we need high performance, that’s why we want shared memory.
>>>> Suggestions will be highly appreciated!
>>>> 
>>>> Best Regards,
>>>> Jia
>>>> 
>>>> On Dec 7, 2015, at 10:54 AM, Robin East <robin.east@xense.co.uk <ma...@xense.co.uk>> wrote:
>>>> 
>>>>> -dev, +user (this is not a question about development of Spark itself so you’ll get more answers in the user mailing list)
>>>>> 
>>>>> First up let me say that I don’t really know how this could be done - I’m sure it would be possible with enough tinkering but it’s not clear what you are trying to achieve. Spark is a distributed processing system, it has multiple JVMs running on different machines that each run a small part of the overall processing. Unless you have some sort of idea to have multiple C++ processes collocated with the distributed JVMs using named memory mapped files doesn’t make architectural sense. 
>>>>> -------------------------------------------------------------------------------
>>>>> Robin East
>>>>> Spark GraphX in Action Michael Malak and Robin East
>>>>> Manning Publications Co.
>>>>> http://www.manning.com/books/spark-graphx-in-action <http://www.manning.com/books/spark-graphx-in-action>
>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>>>> On 6 Dec 2015, at 20:43, Jia <jacquelinezou@gmail.com <ma...@gmail.com>> wrote:
>>>>>> 
>>>>>> Dears, for one project, I need to implement something so Spark can read data from a C++ process. 
>>>>>> To provide high performance, I really hope to implement this through shared memory between the C++ process and Java JVM process.
>>>>>> It seems it may be possible to use named memory mapped files and JNI to do this, but I wonder whether there is any existing efforts or more efficient approach to do this?
>>>>>> Thank you very much!
>>>>>> 
>>>>>> Best Regards,
>>>>>> Jia
>>>>>> 
>>>>>> 
>>>>>> ---------------------------------------------------------------------
>>>>>> To unsubscribe, e-mail: dev-unsubscribe@spark.apache.org <ma...@spark.apache.org>
>>>>>> For additional commands, e-mail: dev-help@spark.apache.org <ma...@spark.apache.org>
>>>>>> 
>>>>> 
>>>> 
>>> 
>>> 
>>> 
>> 
>> 
>> 
> 
>

Re: Shared memory between C++ process and Spark

Posted by Robin East <ro...@xense.co.uk>.

I’m not sure what point you’re trying to prove and I’m not particularly interested in getting into a protracted discussion. Here is what you wrote: The architecture of Spark is to run on top of HDFS. I interpreted that as a statement implying that Spark has to run on HDFS which is definitely not the case. If you didn’t mean then we are both in agreement.
-------------------------------------------------------------------------------
Robin East
Spark GraphX in Action Michael Malak and Robin East
Manning Publications Co.
http://www.manning.com/books/spark-graphx-in-action <http://www.manning.com/books/spark-graphx-in-action>





> On 7 Dec 2015, at 19:56, Annabel Melongo <me...@yahoo.com> wrote:
> 
> Robin,
> 
> To prove my point, this is an unresolved issue still in the implementation stage.
> 
> 
> 
> On Monday, December 7, 2015 2:49 PM, Robin East <ro...@xense.co.uk> wrote:
> 
> 
> Hi Annabel
> 
> I certainly did read your post. My point was that Spark can read from HDFS but is in no way tied to that storage layer . A very interesting use case that sounds very similar to Jia's (as mentioned by another poster) is contained in https://issues.apache.org/jira/browse/SPARK-10399 <https://issues.apache.org/jira/browse/SPARK-10399>. The comments section provides a specific example of processing very large images using a pre-existing c++ library.
> 
> Robin
> 
> Sent from my iPhone
> 
> On 7 Dec 2015, at 18:50, Annabel Melongo <melongo_annabel@yahoo.com.INVALID <ma...@yahoo.com.invalid>> wrote:
> 
>> Jia,
>> 
>> I'm so confused on this. The architecture of Spark is to run on top of HDFS. What you're requesting, reading and writing to a C++ process, is not part of that requirement.
>> 
>> 
>> 
>> 
>> 
>> On Monday, December 7, 2015 1:42 PM, Jia <jacquelinezou@gmail.com <ma...@gmail.com>> wrote:
>> 
>> 
>> Thanks, Annabel, but I may need to clarify that I have no intention to write and run Spark UDF in C++, I'm just wondering whether Spark can read and write data to a C++ process with zero copy.
>> 
>> Best Regards,
>> Jia
>>  
>> 
>> 
>> On Dec 7, 2015, at 12:26 PM, Annabel Melongo <melongo_annabel@yahoo.com <ma...@yahoo.com>> wrote:
>> 
>>> My guess is that Jia wants to run C++ on top of Spark. If that's the case, I'm afraid this is not possible. Spark has support for Java, Python, Scala and R.
>>> 
>>> The best way to achieve this is to run your application in C++ and used the data created by said application to do manipulation within Spark.
>>> 
>>> 
>>> 
>>> On Monday, December 7, 2015 1:15 PM, Jia <jacquelinezou@gmail.com <ma...@gmail.com>> wrote:
>>> 
>>> 
>>> Thanks, Dewful!
>>> 
>>> My impression is that Tachyon is a very nice in-memory file system that can connect to multiple storages.
>>> However, because our data is also hold in memory, I suspect that connecting to Spark directly may be more efficient in performance.
>>> But definitely I need to look at Tachyon more carefully, in case it has a very efficient C++ binding mechanism.
>>> 
>>> Best Regards,
>>> Jia
>>> 
>>> On Dec 7, 2015, at 11:46 AM, Dewful <dewful@gmail.com <ma...@gmail.com>> wrote:
>>> 
>>>> Maybe looking into something like Tachyon would help, I see some sample c++ bindings, not sure how much of the current functionality they support...
>>>> Hi, Robin, 
>>>> Thanks for your reply and thanks for copying my question to user mailing list.
>>>> Yes, we have a distributed C++ application, that will store data on each node in the cluster, and we hope to leverage Spark to do more fancy analytics on those data. But we need high performance, that’s why we want shared memory.
>>>> Suggestions will be highly appreciated!
>>>> 
>>>> Best Regards,
>>>> Jia
>>>> 
>>>> On Dec 7, 2015, at 10:54 AM, Robin East <robin.east@xense.co.uk <ma...@xense.co.uk>> wrote:
>>>> 
>>>>> -dev, +user (this is not a question about development of Spark itself so you’ll get more answers in the user mailing list)
>>>>> 
>>>>> First up let me say that I don’t really know how this could be done - I’m sure it would be possible with enough tinkering but it’s not clear what you are trying to achieve. Spark is a distributed processing system, it has multiple JVMs running on different machines that each run a small part of the overall processing. Unless you have some sort of idea to have multiple C++ processes collocated with the distributed JVMs using named memory mapped files doesn’t make architectural sense. 
>>>>> -------------------------------------------------------------------------------
>>>>> Robin East
>>>>> Spark GraphX in Action Michael Malak and Robin East
>>>>> Manning Publications Co.
>>>>> http://www.manning.com/books/spark-graphx-in-action <http://www.manning.com/books/spark-graphx-in-action>
>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>>>> On 6 Dec 2015, at 20:43, Jia <jacquelinezou@gmail.com <ma...@gmail.com>> wrote:
>>>>>> 
>>>>>> Dears, for one project, I need to implement something so Spark can read data from a C++ process. 
>>>>>> To provide high performance, I really hope to implement this through shared memory between the C++ process and Java JVM process.
>>>>>> It seems it may be possible to use named memory mapped files and JNI to do this, but I wonder whether there is any existing efforts or more efficient approach to do this?
>>>>>> Thank you very much!
>>>>>> 
>>>>>> Best Regards,
>>>>>> Jia
>>>>>> 
>>>>>> 
>>>>>> ---------------------------------------------------------------------
>>>>>> To unsubscribe, e-mail: dev-unsubscribe@spark.apache.org <ma...@spark.apache.org>
>>>>>> For additional commands, e-mail: dev-help@spark.apache.org <ma...@spark.apache.org>
>>>>>> 
>>>>> 
>>>> 
>>> 
>>> 
>>> 
>> 
>> 
>> 
> 
>

Re: Shared memory between C++ process and Spark

Posted by Annabel Melongo <me...@yahoo.com.INVALID>.

Robin,
To prove my point, this is an unresolved issue still in the implementation stage. 


    On Monday, December 7, 2015 2:49 PM, Robin East <ro...@xense.co.uk> wrote:
 

 Hi Annabel
I certainly did read your post. My point was that Spark can read from HDFS but is in no way tied to that storage layer . A very interesting use case that sounds very similar to Jia's (as mentioned by another poster) is contained in https://issues.apache.org/jira/browse/SPARK-10399. The comments section provides a specific example of processing very large images using a pre-existing c++ library.
Robin
Sent from my iPhone
On 7 Dec 2015, at 18:50, Annabel Melongo <me...@yahoo.com.INVALID> wrote:


Jia,
I'm so confused on this. The architecture of Spark is to run on top of HDFS. What you're requesting, reading and writing to a C++ process, is not part of that requirement.

 


    On Monday, December 7, 2015 1:42 PM, Jia <ja...@gmail.com> wrote:
 

 Thanks, Annabel, but I may need to clarify that I have no intention to write and run Spark UDF in C++, I'm just wondering whether Spark can read and write data to a C++ process with zero copy.
Best Regards,Jia 

On Dec 7, 2015, at 12:26 PM, Annabel Melongo <me...@yahoo.com> wrote:

My guess is that Jia wants to run C++ on top of Spark. If that's the case, I'm afraid this is not possible. Spark has support for Java, Python, Scala and R.
The best way to achieve this is to run your application in C++ and used the data created by said application to do manipulation within Spark. 


    On Monday, December 7, 2015 1:15 PM, Jia <ja...@gmail.com> wrote:
 

 Thanks, Dewful!
My impression is that Tachyon is a very nice in-memory file system that can connect to multiple storages.However, because our data is also hold in memory, I suspect that connecting to Spark directly may be more efficient in performance.But definitely I need to look at Tachyon more carefully, in case it has a very efficient C++ binding mechanism.
Best Regards,Jia
On Dec 7, 2015, at 11:46 AM, Dewful <de...@gmail.com> wrote:

Maybe looking into something like Tachyon would help, I see some sample c++ bindings, not sure how much of the current functionality they support...Hi, Robin, Thanks for your reply and thanks for copying my question to user mailing list.Yes, we have a distributed C++ application, that will store data on each node in the cluster, and we hope to leverage Spark to do more fancy analytics on those data. But we need high performance, that’s why we want shared memory.Suggestions will be highly appreciated!
Best Regards,Jia
On Dec 7, 2015, at 10:54 AM, Robin East <ro...@xense.co.uk> wrote:

-dev, +user (this is not a question about development of Spark itself so you’ll get more answers in the user mailing list)
First up let me say that I don’t really know how this could be done - I’m sure it would be possible with enough tinkering but it’s not clear what you are trying to achieve. Spark is a distributed processing system, it has multiple JVMs running on different machines that each run a small part of the overall processing. Unless you have some sort of idea to have multiple C++ processes collocated with the distributed JVMs using named memory mapped files doesn’t make architectural sense. 
-------------------------------------------------------------------------------Robin EastSpark GraphX in Action Michael Malak and Robin EastManning Publications Co.http://www.manning.com/books/spark-graphx-in-action





On 6 Dec 2015, at 20:43, Jia <ja...@gmail.com> wrote:
Dears, for one project, I need to implement something so Spark can read data from a C++ process. 
To provide high performance, I really hope to implement this through shared memory between the C++ process and Java JVM process.
It seems it may be possible to use named memory mapped files and JNI to do this, but I wonder whether there is any existing efforts or more efficient approach to do this?
Thank you very much!

Best Regards,
Jia


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@spark.apache.org
For additional commands, e-mail: dev-help@spark.apache.org

Re: Shared memory between C++ process and Spark

Posted by Robin East <ro...@xense.co.uk>.

Hi Annabel

I certainly did read your post. My point was that Spark can read from HDFS but is in no way tied to that storage layer . A very interesting use case that sounds very similar to Jia's (as mentioned by another poster) is contained in https://issues.apache.org/jira/browse/SPARK-10399. The comments section provides a specific example of processing very large images using a pre-existing c++ library.

Robin

Sent from my iPhone

> On 7 Dec 2015, at 18:50, Annabel Melongo <me...@yahoo.com.INVALID> wrote:
> 
> Jia,
> 
> I'm so confused on this. The architecture of Spark is to run on top of HDFS. What you're requesting, reading and writing to a C++ process, is not part of that requirement.
> 
> 
> 
> 
> 
> On Monday, December 7, 2015 1:42 PM, Jia <ja...@gmail.com> wrote:
> 
> 
> Thanks, Annabel, but I may need to clarify that I have no intention to write and run Spark UDF in C++, I'm just wondering whether Spark can read and write data to a C++ process with zero copy.
> 
> Best Regards,
> Jia
>  
> 
> 
>> On Dec 7, 2015, at 12:26 PM, Annabel Melongo <me...@yahoo.com> wrote:
>> 
>> My guess is that Jia wants to run C++ on top of Spark. If that's the case, I'm afraid this is not possible. Spark has support for Java, Python, Scala and R.
>> 
>> The best way to achieve this is to run your application in C++ and used the data created by said application to do manipulation within Spark.
>> 
>> 
>> 
>> On Monday, December 7, 2015 1:15 PM, Jia <ja...@gmail.com> wrote:
>> 
>> 
>> Thanks, Dewful!
>> 
>> My impression is that Tachyon is a very nice in-memory file system that can connect to multiple storages.
>> However, because our data is also hold in memory, I suspect that connecting to Spark directly may be more efficient in performance.
>> But definitely I need to look at Tachyon more carefully, in case it has a very efficient C++ binding mechanism.
>> 
>> Best Regards,
>> Jia
>> 
>>> On Dec 7, 2015, at 11:46 AM, Dewful <de...@gmail.com> wrote:
>>> 
>>> Maybe looking into something like Tachyon would help, I see some sample c++ bindings, not sure how much of the current functionality they support...
>>> Hi, Robin, 
>>> Thanks for your reply and thanks for copying my question to user mailing list.
>>> Yes, we have a distributed C++ application, that will store data on each node in the cluster, and we hope to leverage Spark to do more fancy analytics on those data. But we need high performance, that’s why we want shared memory.
>>> Suggestions will be highly appreciated!
>>> 
>>> Best Regards,
>>> Jia
>>> 
>>>> On Dec 7, 2015, at 10:54 AM, Robin East <ro...@xense.co.uk> wrote:
>>>> 
>>>> -dev, +user (this is not a question about development of Spark itself so you’ll get more answers in the user mailing list)
>>>> 
>>>> First up let me say that I don’t really know how this could be done - I’m sure it would be possible with enough tinkering but it’s not clear what you are trying to achieve. Spark is a distributed processing system, it has multiple JVMs running on different machines that each run a small part of the overall processing. Unless you have some sort of idea to have multiple C++ processes collocated with the distributed JVMs using named memory mapped files doesn’t make architectural sense. 
>>>> -------------------------------------------------------------------------------
>>>> Robin East
>>>> Spark GraphX in Action Michael Malak and Robin East
>>>> Manning Publications Co.
>>>> http://www.manning.com/books/spark-graphx-in-action
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>>> On 6 Dec 2015, at 20:43, Jia <ja...@gmail.com> wrote:
>>>>> 
>>>>> Dears, for one project, I need to implement something so Spark can read data from a C++ process. 
>>>>> To provide high performance, I really hope to implement this through shared memory between the C++ process and Java JVM process.
>>>>> It seems it may be possible to use named memory mapped files and JNI to do this, but I wonder whether there is any existing efforts or more efficient approach to do this?
>>>>> Thank you very much!
>>>>> 
>>>>> Best Regards,
>>>>> Jia
>>>>> 
>>>>> 
>>>>> ---------------------------------------------------------------------
>>>>> To unsubscribe, e-mail: dev-unsubscribe@spark.apache.org
>>>>> For additional commands, e-mail: dev-help@spark.apache.org
> 
> 
>

Re: Shared memory between C++ process and Spark

Posted by Robin East <ro...@xense.co.uk>.

Hi Annabel

I certainly did read your post. My point was that Spark can read from HDFS but is in no way tied to that storage layer . A very interesting use case that sounds very similar to Jia's (as mentioned by another poster) is contained in https://issues.apache.org/jira/browse/SPARK-10399. The comments section provides a specific example of processing very large images using a pre-existing c++ library.

Robin

Sent from my iPhone

> On 7 Dec 2015, at 18:50, Annabel Melongo <me...@yahoo.com.INVALID> wrote:
> 
> Jia,
> 
> I'm so confused on this. The architecture of Spark is to run on top of HDFS. What you're requesting, reading and writing to a C++ process, is not part of that requirement.
> 
> 
> 
> 
> 
> On Monday, December 7, 2015 1:42 PM, Jia <ja...@gmail.com> wrote:
> 
> 
> Thanks, Annabel, but I may need to clarify that I have no intention to write and run Spark UDF in C++, I'm just wondering whether Spark can read and write data to a C++ process with zero copy.
> 
> Best Regards,
> Jia
>  
> 
> 
>> On Dec 7, 2015, at 12:26 PM, Annabel Melongo <me...@yahoo.com> wrote:
>> 
>> My guess is that Jia wants to run C++ on top of Spark. If that's the case, I'm afraid this is not possible. Spark has support for Java, Python, Scala and R.
>> 
>> The best way to achieve this is to run your application in C++ and used the data created by said application to do manipulation within Spark.
>> 
>> 
>> 
>> On Monday, December 7, 2015 1:15 PM, Jia <ja...@gmail.com> wrote:
>> 
>> 
>> Thanks, Dewful!
>> 
>> My impression is that Tachyon is a very nice in-memory file system that can connect to multiple storages.
>> However, because our data is also hold in memory, I suspect that connecting to Spark directly may be more efficient in performance.
>> But definitely I need to look at Tachyon more carefully, in case it has a very efficient C++ binding mechanism.
>> 
>> Best Regards,
>> Jia
>> 
>>> On Dec 7, 2015, at 11:46 AM, Dewful <de...@gmail.com> wrote:
>>> 
>>> Maybe looking into something like Tachyon would help, I see some sample c++ bindings, not sure how much of the current functionality they support...
>>> Hi, Robin, 
>>> Thanks for your reply and thanks for copying my question to user mailing list.
>>> Yes, we have a distributed C++ application, that will store data on each node in the cluster, and we hope to leverage Spark to do more fancy analytics on those data. But we need high performance, that’s why we want shared memory.
>>> Suggestions will be highly appreciated!
>>> 
>>> Best Regards,
>>> Jia
>>> 
>>>> On Dec 7, 2015, at 10:54 AM, Robin East <ro...@xense.co.uk> wrote:
>>>> 
>>>> -dev, +user (this is not a question about development of Spark itself so you’ll get more answers in the user mailing list)
>>>> 
>>>> First up let me say that I don’t really know how this could be done - I’m sure it would be possible with enough tinkering but it’s not clear what you are trying to achieve. Spark is a distributed processing system, it has multiple JVMs running on different machines that each run a small part of the overall processing. Unless you have some sort of idea to have multiple C++ processes collocated with the distributed JVMs using named memory mapped files doesn’t make architectural sense. 
>>>> -------------------------------------------------------------------------------
>>>> Robin East
>>>> Spark GraphX in Action Michael Malak and Robin East
>>>> Manning Publications Co.
>>>> http://www.manning.com/books/spark-graphx-in-action
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>>> On 6 Dec 2015, at 20:43, Jia <ja...@gmail.com> wrote:
>>>>> 
>>>>> Dears, for one project, I need to implement something so Spark can read data from a C++ process. 
>>>>> To provide high performance, I really hope to implement this through shared memory between the C++ process and Java JVM process.
>>>>> It seems it may be possible to use named memory mapped files and JNI to do this, but I wonder whether there is any existing efforts or more efficient approach to do this?
>>>>> Thank you very much!
>>>>> 
>>>>> Best Regards,
>>>>> Jia
>>>>> 
>>>>> 
>>>>> ---------------------------------------------------------------------
>>>>> To unsubscribe, e-mail: dev-unsubscribe@spark.apache.org
>>>>> For additional commands, e-mail: dev-help@spark.apache.org
> 
> 
>

Re: Shared memory between C++ process and Spark

Posted by Robin East <ro...@xense.co.uk>.

Annabel

Spark works very well with data stored in HDFS but is certainly not tied to it. Have a look at the wide variety of connectors to things like Cassandra, HBase, etc.

Robin

Sent from my iPhone

> On 7 Dec 2015, at 18:50, Annabel Melongo <me...@yahoo.com> wrote:
> 
> Jia,
> 
> I'm so confused on this. The architecture of Spark is to run on top of HDFS. What you're requesting, reading and writing to a C++ process, is not part of that requirement.
> 
> 
> 
> 
> 
> On Monday, December 7, 2015 1:42 PM, Jia <ja...@gmail.com> wrote:
> 
> 
> Thanks, Annabel, but I may need to clarify that I have no intention to write and run Spark UDF in C++, I'm just wondering whether Spark can read and write data to a C++ process with zero copy.
> 
> Best Regards,
> Jia
>  
> 
> 
>> On Dec 7, 2015, at 12:26 PM, Annabel Melongo <me...@yahoo.com> wrote:
>> 
>> My guess is that Jia wants to run C++ on top of Spark. If that's the case, I'm afraid this is not possible. Spark has support for Java, Python, Scala and R.
>> 
>> The best way to achieve this is to run your application in C++ and used the data created by said application to do manipulation within Spark.
>> 
>> 
>> 
>> On Monday, December 7, 2015 1:15 PM, Jia <ja...@gmail.com> wrote:
>> 
>> 
>> Thanks, Dewful!
>> 
>> My impression is that Tachyon is a very nice in-memory file system that can connect to multiple storages.
>> However, because our data is also hold in memory, I suspect that connecting to Spark directly may be more efficient in performance.
>> But definitely I need to look at Tachyon more carefully, in case it has a very efficient C++ binding mechanism.
>> 
>> Best Regards,
>> Jia
>> 
>>> On Dec 7, 2015, at 11:46 AM, Dewful <de...@gmail.com> wrote:
>>> 
>>> Maybe looking into something like Tachyon would help, I see some sample c++ bindings, not sure how much of the current functionality they support...
>>> Hi, Robin, 
>>> Thanks for your reply and thanks for copying my question to user mailing list.
>>> Yes, we have a distributed C++ application, that will store data on each node in the cluster, and we hope to leverage Spark to do more fancy analytics on those data. But we need high performance, that’s why we want shared memory.
>>> Suggestions will be highly appreciated!
>>> 
>>> Best Regards,
>>> Jia
>>> 
>>>> On Dec 7, 2015, at 10:54 AM, Robin East <ro...@xense.co.uk> wrote:
>>>> 
>>>> -dev, +user (this is not a question about development of Spark itself so you’ll get more answers in the user mailing list)
>>>> 
>>>> First up let me say that I don’t really know how this could be done - I’m sure it would be possible with enough tinkering but it’s not clear what you are trying to achieve. Spark is a distributed processing system, it has multiple JVMs running on different machines that each run a small part of the overall processing. Unless you have some sort of idea to have multiple C++ processes collocated with the distributed JVMs using named memory mapped files doesn’t make architectural sense. 
>>>> -------------------------------------------------------------------------------
>>>> Robin East
>>>> Spark GraphX in Action Michael Malak and Robin East
>>>> Manning Publications Co.
>>>> http://www.manning.com/books/spark-graphx-in-action
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>>> On 6 Dec 2015, at 20:43, Jia <ja...@gmail.com> wrote:
>>>>> 
>>>>> Dears, for one project, I need to implement something so Spark can read data from a C++ process. 
>>>>> To provide high performance, I really hope to implement this through shared memory between the C++ process and Java JVM process.
>>>>> It seems it may be possible to use named memory mapped files and JNI to do this, but I wonder whether there is any existing efforts or more efficient approach to do this?
>>>>> Thank you very much!
>>>>> 
>>>>> Best Regards,
>>>>> Jia
>>>>> 
>>>>> 
>>>>> ---------------------------------------------------------------------
>>>>> To unsubscribe, e-mail: dev-unsubscribe@spark.apache.org
>>>>> For additional commands, e-mail: dev-help@spark.apache.org
> 
> 
>

Re: Shared memory between C++ process and Spark

Posted by Annabel Melongo <me...@yahoo.com.INVALID>.

Jia,
I'm so confused on this. The architecture of Spark is to run on top of HDFS. What you're requesting, reading and writing to a C++ process, is not part of that requirement.

 


    On Monday, December 7, 2015 1:42 PM, Jia <ja...@gmail.com> wrote:
 

 Thanks, Annabel, but I may need to clarify that I have no intention to write and run Spark UDF in C++, I'm just wondering whether Spark can read and write data to a C++ process with zero copy.
Best Regards,Jia 

On Dec 7, 2015, at 12:26 PM, Annabel Melongo <me...@yahoo.com> wrote:

My guess is that Jia wants to run C++ on top of Spark. If that's the case, I'm afraid this is not possible. Spark has support for Java, Python, Scala and R.
The best way to achieve this is to run your application in C++ and used the data created by said application to do manipulation within Spark. 


    On Monday, December 7, 2015 1:15 PM, Jia <ja...@gmail.com> wrote:
 

 Thanks, Dewful!
My impression is that Tachyon is a very nice in-memory file system that can connect to multiple storages.However, because our data is also hold in memory, I suspect that connecting to Spark directly may be more efficient in performance.But definitely I need to look at Tachyon more carefully, in case it has a very efficient C++ binding mechanism.
Best Regards,Jia
On Dec 7, 2015, at 11:46 AM, Dewful <de...@gmail.com> wrote:

Maybe looking into something like Tachyon would help, I see some sample c++ bindings, not sure how much of the current functionality they support...Hi, Robin, Thanks for your reply and thanks for copying my question to user mailing list.Yes, we have a distributed C++ application, that will store data on each node in the cluster, and we hope to leverage Spark to do more fancy analytics on those data. But we need high performance, that’s why we want shared memory.Suggestions will be highly appreciated!
Best Regards,Jia
On Dec 7, 2015, at 10:54 AM, Robin East <ro...@xense.co.uk> wrote:

-dev, +user (this is not a question about development of Spark itself so you’ll get more answers in the user mailing list)
First up let me say that I don’t really know how this could be done - I’m sure it would be possible with enough tinkering but it’s not clear what you are trying to achieve. Spark is a distributed processing system, it has multiple JVMs running on different machines that each run a small part of the overall processing. Unless you have some sort of idea to have multiple C++ processes collocated with the distributed JVMs using named memory mapped files doesn’t make architectural sense. 
-------------------------------------------------------------------------------Robin EastSpark GraphX in Action Michael Malak and Robin EastManning Publications Co.http://www.manning.com/books/spark-graphx-in-action





On 6 Dec 2015, at 20:43, Jia <ja...@gmail.com> wrote:
Dears, for one project, I need to implement something so Spark can read data from a C++ process. 
To provide high performance, I really hope to implement this through shared memory between the C++ process and Java JVM process.
It seems it may be possible to use named memory mapped files and JNI to do this, but I wonder whether there is any existing efforts or more efficient approach to do this?
Thank you very much!

Best Regards,
Jia


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@spark.apache.org
For additional commands, e-mail: dev-help@spark.apache.org

Re: Shared memory between C++ process and Spark

Posted by Jia <ja...@gmail.com>.

Thanks, Annabel, but I may need to clarify that I have no intention to write and run Spark UDF in C++, I'm just wondering whether Spark can read and write data to a C++ process with zero copy.

Best Regards,
Jia
 


On Dec 7, 2015, at 12:26 PM, Annabel Melongo <me...@yahoo.com> wrote:

> My guess is that Jia wants to run C++ on top of Spark. If that's the case, I'm afraid this is not possible. Spark has support for Java, Python, Scala and R.
> 
> The best way to achieve this is to run your application in C++ and used the data created by said application to do manipulation within Spark.
> 
> 
> 
> On Monday, December 7, 2015 1:15 PM, Jia <ja...@gmail.com> wrote:
> 
> 
> Thanks, Dewful!
> 
> My impression is that Tachyon is a very nice in-memory file system that can connect to multiple storages.
> However, because our data is also hold in memory, I suspect that connecting to Spark directly may be more efficient in performance.
> But definitely I need to look at Tachyon more carefully, in case it has a very efficient C++ binding mechanism.
> 
> Best Regards,
> Jia
> 
> On Dec 7, 2015, at 11:46 AM, Dewful <de...@gmail.com> wrote:
> 
>> Maybe looking into something like Tachyon would help, I see some sample c++ bindings, not sure how much of the current functionality they support...
>> Hi, Robin, 
>> Thanks for your reply and thanks for copying my question to user mailing list.
>> Yes, we have a distributed C++ application, that will store data on each node in the cluster, and we hope to leverage Spark to do more fancy analytics on those data. But we need high performance, that’s why we want shared memory.
>> Suggestions will be highly appreciated!
>> 
>> Best Regards,
>> Jia
>> 
>> On Dec 7, 2015, at 10:54 AM, Robin East <ro...@xense.co.uk> wrote:
>> 
>>> -dev, +user (this is not a question about development of Spark itself so you’ll get more answers in the user mailing list)
>>> 
>>> First up let me say that I don’t really know how this could be done - I’m sure it would be possible with enough tinkering but it’s not clear what you are trying to achieve. Spark is a distributed processing system, it has multiple JVMs running on different machines that each run a small part of the overall processing. Unless you have some sort of idea to have multiple C++ processes collocated with the distributed JVMs using named memory mapped files doesn’t make architectural sense. 
>>> -------------------------------------------------------------------------------
>>> Robin East
>>> Spark GraphX in Action Michael Malak and Robin East
>>> Manning Publications Co.
>>> http://www.manning.com/books/spark-graphx-in-action
>>> 
>>> 
>>> 
>>> 
>>> 
>>>> On 6 Dec 2015, at 20:43, Jia <ja...@gmail.com> wrote:
>>>> 
>>>> Dears, for one project, I need to implement something so Spark can read data from a C++ process. 
>>>> To provide high performance, I really hope to implement this through shared memory between the C++ process and Java JVM process.
>>>> It seems it may be possible to use named memory mapped files and JNI to do this, but I wonder whether there is any existing efforts or more efficient approach to do this?
>>>> Thank you very much!
>>>> 
>>>> Best Regards,
>>>> Jia
>>>> 
>>>> 
>>>> ---------------------------------------------------------------------
>>>> To unsubscribe, e-mail: dev-unsubscribe@spark.apache.org
>>>> For additional commands, e-mail: dev-help@spark.apache.org
>>>> 
>>> 
>> 
> 
> 
>

Re: Shared memory between C++ process and Spark

Posted by Jia <ja...@gmail.com>.

Thanks, Annabel, but I may need to clarify that I have no intention to write and run Spark UDF in C++, I'm just wondering whether Spark can read and write data to a C++ process with zero copy.

Best Regards,
Jia
 


On Dec 7, 2015, at 12:26 PM, Annabel Melongo <me...@yahoo.com> wrote:

> My guess is that Jia wants to run C++ on top of Spark. If that's the case, I'm afraid this is not possible. Spark has support for Java, Python, Scala and R.
> 
> The best way to achieve this is to run your application in C++ and used the data created by said application to do manipulation within Spark.
> 
> 
> 
> On Monday, December 7, 2015 1:15 PM, Jia <ja...@gmail.com> wrote:
> 
> 
> Thanks, Dewful!
> 
> My impression is that Tachyon is a very nice in-memory file system that can connect to multiple storages.
> However, because our data is also hold in memory, I suspect that connecting to Spark directly may be more efficient in performance.
> But definitely I need to look at Tachyon more carefully, in case it has a very efficient C++ binding mechanism.
> 
> Best Regards,
> Jia
> 
> On Dec 7, 2015, at 11:46 AM, Dewful <de...@gmail.com> wrote:
> 
>> Maybe looking into something like Tachyon would help, I see some sample c++ bindings, not sure how much of the current functionality they support...
>> Hi, Robin, 
>> Thanks for your reply and thanks for copying my question to user mailing list.
>> Yes, we have a distributed C++ application, that will store data on each node in the cluster, and we hope to leverage Spark to do more fancy analytics on those data. But we need high performance, that’s why we want shared memory.
>> Suggestions will be highly appreciated!
>> 
>> Best Regards,
>> Jia
>> 
>> On Dec 7, 2015, at 10:54 AM, Robin East <ro...@xense.co.uk> wrote:
>> 
>>> -dev, +user (this is not a question about development of Spark itself so you’ll get more answers in the user mailing list)
>>> 
>>> First up let me say that I don’t really know how this could be done - I’m sure it would be possible with enough tinkering but it’s not clear what you are trying to achieve. Spark is a distributed processing system, it has multiple JVMs running on different machines that each run a small part of the overall processing. Unless you have some sort of idea to have multiple C++ processes collocated with the distributed JVMs using named memory mapped files doesn’t make architectural sense. 
>>> -------------------------------------------------------------------------------
>>> Robin East
>>> Spark GraphX in Action Michael Malak and Robin East
>>> Manning Publications Co.
>>> http://www.manning.com/books/spark-graphx-in-action
>>> 
>>> 
>>> 
>>> 
>>> 
>>>> On 6 Dec 2015, at 20:43, Jia <ja...@gmail.com> wrote:
>>>> 
>>>> Dears, for one project, I need to implement something so Spark can read data from a C++ process. 
>>>> To provide high performance, I really hope to implement this through shared memory between the C++ process and Java JVM process.
>>>> It seems it may be possible to use named memory mapped files and JNI to do this, but I wonder whether there is any existing efforts or more efficient approach to do this?
>>>> Thank you very much!
>>>> 
>>>> Best Regards,
>>>> Jia
>>>> 
>>>> 
>>>> ---------------------------------------------------------------------
>>>> To unsubscribe, e-mail: dev-unsubscribe@spark.apache.org
>>>> For additional commands, e-mail: dev-help@spark.apache.org
>>>> 
>>> 
>> 
> 
> 
>

Re: Shared memory between C++ process and Spark

Posted by Annabel Melongo <me...@yahoo.com.INVALID>.

My guess is that Jia wants to run C++ on top of Spark. If that's the case, I'm afraid this is not possible. Spark has support for Java, Python, Scala and R.
The best way to achieve this is to run your application in C++ and used the data created by said application to do manipulation within Spark. 


    On Monday, December 7, 2015 1:15 PM, Jia <ja...@gmail.com> wrote:
 

 Thanks, Dewful!
My impression is that Tachyon is a very nice in-memory file system that can connect to multiple storages.However, because our data is also hold in memory, I suspect that connecting to Spark directly may be more efficient in performance.But definitely I need to look at Tachyon more carefully, in case it has a very efficient C++ binding mechanism.
Best Regards,Jia
On Dec 7, 2015, at 11:46 AM, Dewful <de...@gmail.com> wrote:

Maybe looking into something like Tachyon would help, I see some sample c++ bindings, not sure how much of the current functionality they support...Hi, Robin, Thanks for your reply and thanks for copying my question to user mailing list.Yes, we have a distributed C++ application, that will store data on each node in the cluster, and we hope to leverage Spark to do more fancy analytics on those data. But we need high performance, that’s why we want shared memory.Suggestions will be highly appreciated!
Best Regards,Jia
On Dec 7, 2015, at 10:54 AM, Robin East <ro...@xense.co.uk> wrote:

-dev, +user (this is not a question about development of Spark itself so you’ll get more answers in the user mailing list)
First up let me say that I don’t really know how this could be done - I’m sure it would be possible with enough tinkering but it’s not clear what you are trying to achieve. Spark is a distributed processing system, it has multiple JVMs running on different machines that each run a small part of the overall processing. Unless you have some sort of idea to have multiple C++ processes collocated with the distributed JVMs using named memory mapped files doesn’t make architectural sense. 
-------------------------------------------------------------------------------Robin EastSpark GraphX in Action Michael Malak and Robin EastManning Publications Co.http://www.manning.com/books/spark-graphx-in-action





On 6 Dec 2015, at 20:43, Jia <ja...@gmail.com> wrote:
Dears, for one project, I need to implement something so Spark can read data from a C++ process. 
To provide high performance, I really hope to implement this through shared memory between the C++ process and Java JVM process.
It seems it may be possible to use named memory mapped files and JNI to do this, but I wonder whether there is any existing efforts or more efficient approach to do this?
Thank you very much!

Best Regards,
Jia


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@spark.apache.org
For additional commands, e-mail: dev-help@spark.apache.org

Re: Shared memory between C++ process and Spark

Posted by Jia <ja...@gmail.com>.

Hi, Kazuaki,

It’s very similar with my requirement, thanks!
It seems they want to write to a C++ process with zero copy, and I want to do both read/write with zero copy.
Any one knows how to obtain more information like current status of this JIRA entry?

Best Regards,
Jia




On Dec 7, 2015, at 12:26 PM, Kazuaki Ishizaki <IS...@jp.ibm.com> wrote:

> Is this JIRA entry related to what you want?
> https://issues.apache.org/jira/browse/SPARK-10399
> 
> Regards,
> Kazuaki Ishizaki
> 
> 
> 
> From:        Jia <ja...@gmail.com>
> To:        Dewful <de...@gmail.com>
> Cc:        "user @spark" <us...@spark.apache.org>, dev@spark.apache.org, Robin East <ro...@xense.co.uk>
> Date:        2015/12/08 03:17
> Subject:        Re: Shared memory between C++ process and Spark
> 
> 
> 
> Thanks, Dewful!
> 
> My impression is that Tachyon is a very nice in-memory file system that can connect to multiple storages.
> However, because our data is also hold in memory, I suspect that connecting to Spark directly may be more efficient in performance.
> But definitely I need to look at Tachyon more carefully, in case it has a very efficient C++ binding mechanism.
> 
> Best Regards,
> Jia
> 
> On Dec 7, 2015, at 11:46 AM, Dewful <de...@gmail.com> wrote:
> Maybe looking into something like Tachyon would help, I see some sample c++ bindings, not sure how much of the current functionality they support...
> Hi, Robin, 
> Thanks for your reply and thanks for copying my question to user mailing list.
> Yes, we have a distributed C++ application, that will store data on each node in the cluster, and we hope to leverage Spark to do more fancy analytics on those data. But we need high performance, that’s why we want shared memory.
> Suggestions will be highly appreciated!
> 
> Best Regards,
> Jia
> 
> On Dec 7, 2015, at 10:54 AM, Robin East <ro...@xense.co.uk> wrote:
> 
> -dev, +user (this is not a question about development of Spark itself so you’ll get more answers in the user mailing list)
> 
> First up let me say that I don’t really know how this could be done - I’m sure it would be possible with enough tinkering but it’s not clear what you are trying to achieve. Spark is a distributed processing system, it has multiple JVMs running on different machines that each run a small part of the overall processing. Unless you have some sort of idea to have multiple C++ processes collocated with the distributed JVMs using named memory mapped files doesn’t make architectural sense. 
> -------------------------------------------------------------------------------
> Robin East
> Spark GraphX in Action Michael Malak and Robin East
> Manning Publications Co.
> http://www.manning.com/books/spark-graphx-in-action
> 
> 
> 
> 
> 
> On 6 Dec 2015, at 20:43, Jia <ja...@gmail.com> wrote:
> 
> Dears, for one project, I need to implement something so Spark can read data from a C++ process. 
> To provide high performance, I really hope to implement this through shared memory between the C++ process and Java JVM process.
> It seems it may be possible to use named memory mapped files and JNI to do this, but I wonder whether there is any existing efforts or more efficient approach to do this?
> Thank you very much!
> 
> Best Regards,
> Jia
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@spark.apache.org
> For additional commands, e-mail: dev-help@spark.apache.org
> 
> 
> 
> 
> 
>

Re: Shared memory between C++ process and Spark

Posted by Jia <ja...@gmail.com>.

Hi, Kazuaki,

It’s very similar with my requirement, thanks!
It seems they want to write to a C++ process with zero copy, and I want to do both read/write with zero copy.
Any one knows how to obtain more information like current status of this JIRA entry?

Best Regards,
Jia




On Dec 7, 2015, at 12:26 PM, Kazuaki Ishizaki <IS...@jp.ibm.com> wrote:

> Is this JIRA entry related to what you want?
> https://issues.apache.org/jira/browse/SPARK-10399
> 
> Regards,
> Kazuaki Ishizaki
> 
> 
> 
> From:        Jia <ja...@gmail.com>
> To:        Dewful <de...@gmail.com>
> Cc:        "user @spark" <us...@spark.apache.org>, dev@spark.apache.org, Robin East <ro...@xense.co.uk>
> Date:        2015/12/08 03:17
> Subject:        Re: Shared memory between C++ process and Spark
> 
> 
> 
> Thanks, Dewful!
> 
> My impression is that Tachyon is a very nice in-memory file system that can connect to multiple storages.
> However, because our data is also hold in memory, I suspect that connecting to Spark directly may be more efficient in performance.
> But definitely I need to look at Tachyon more carefully, in case it has a very efficient C++ binding mechanism.
> 
> Best Regards,
> Jia
> 
> On Dec 7, 2015, at 11:46 AM, Dewful <de...@gmail.com> wrote:
> Maybe looking into something like Tachyon would help, I see some sample c++ bindings, not sure how much of the current functionality they support...
> Hi, Robin, 
> Thanks for your reply and thanks for copying my question to user mailing list.
> Yes, we have a distributed C++ application, that will store data on each node in the cluster, and we hope to leverage Spark to do more fancy analytics on those data. But we need high performance, that’s why we want shared memory.
> Suggestions will be highly appreciated!
> 
> Best Regards,
> Jia
> 
> On Dec 7, 2015, at 10:54 AM, Robin East <ro...@xense.co.uk> wrote:
> 
> -dev, +user (this is not a question about development of Spark itself so you’ll get more answers in the user mailing list)
> 
> First up let me say that I don’t really know how this could be done - I’m sure it would be possible with enough tinkering but it’s not clear what you are trying to achieve. Spark is a distributed processing system, it has multiple JVMs running on different machines that each run a small part of the overall processing. Unless you have some sort of idea to have multiple C++ processes collocated with the distributed JVMs using named memory mapped files doesn’t make architectural sense. 
> -------------------------------------------------------------------------------
> Robin East
> Spark GraphX in Action Michael Malak and Robin East
> Manning Publications Co.
> http://www.manning.com/books/spark-graphx-in-action
> 
> 
> 
> 
> 
> On 6 Dec 2015, at 20:43, Jia <ja...@gmail.com> wrote:
> 
> Dears, for one project, I need to implement something so Spark can read data from a C++ process. 
> To provide high performance, I really hope to implement this through shared memory between the C++ process and Java JVM process.
> It seems it may be possible to use named memory mapped files and JNI to do this, but I wonder whether there is any existing efforts or more efficient approach to do this?
> Thank you very much!
> 
> Best Regards,
> Jia
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@spark.apache.org
> For additional commands, e-mail: dev-help@spark.apache.org
> 
> 
> 
> 
> 
>

Re: Shared memory between C++ process and Spark

Posted by Kazuaki Ishizaki <IS...@jp.ibm.com>.

Is this JIRA entry related to what you want?
https://issues.apache.org/jira/browse/SPARK-10399

Regards,
Kazuaki Ishizaki

From:   Jia <ja...@gmail.com>
To:     Dewful <de...@gmail.com>
Cc:     "user @spark" <us...@spark.apache.org>, dev@spark.apache.org, Robin 
East <ro...@xense.co.uk>
Date:   2015/12/08 03:17
Subject:        Re: Shared memory between C++ process and Spark

Thanks, Dewful!

My impression is that Tachyon is a very nice in-memory file system that 
can connect to multiple storages.
However, because our data is also hold in memory, I suspect that 
connecting to Spark directly may be more efficient in performance.
But definitely I need to look at Tachyon more carefully, in case it has a 
very efficient C++ binding mechanism.

Best Regards,
Jia

On Dec 7, 2015, at 11:46 AM, Dewful <de...@gmail.com> wrote:

Maybe looking into something like Tachyon would help, I see some sample 
c++ bindings, not sure how much of the current functionality they 
support...
Hi, Robin, 
Thanks for your reply and thanks for copying my question to user mailing 
list.
Yes, we have a distributed C++ application, that will store data on each 
node in the cluster, and we hope to leverage Spark to do more fancy 
analytics on those data. But we need high performance, that’s why we want 
shared memory.
Suggestions will be highly appreciated!

Best Regards,
Jia

On Dec 7, 2015, at 10:54 AM, Robin East <ro...@xense.co.uk> wrote:

-dev, +user (this is not a question about development of Spark itself so 
you’ll get more answers in the user mailing list)

First up let me say that I don’t really know how this could be done - I’
m sure it would be possible with enough tinkering but it’s not clear what 
you are trying to achieve. Spark is a distributed processing system, it 
has multiple JVMs running on different machines that each run a small part 
of the overall processing. Unless you have some sort of idea to have 
multiple C++ processes collocated with the distributed JVMs using named 
memory mapped files doesn’t make architectural sense. 
-------------------------------------------------------------------------------
Robin East
Spark GraphX in Action Michael Malak and Robin East
Manning Publications Co.
http://www.manning.com/books/spark-graphx-in-action

On 6 Dec 2015, at 20:43, Jia <ja...@gmail.com> wrote:

Dears, for one project, I need to implement something so Spark can read 
data from a C++ process. 
To provide high performance, I really hope to implement this through 
shared memory between the C++ process and Java JVM process.
It seems it may be possible to use named memory mapped files and JNI to do 
this, but I wonder whether there is any existing efforts or more efficient 
approach to do this?
Thank you very much!

Best Regards,
Jia

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@spark.apache.org
For additional commands, e-mail: dev-help@spark.apache.org

Re: Shared memory between C++ process and Spark

Posted by Annabel Melongo <me...@yahoo.com.INVALID>.

My guess is that Jia wants to run C++ on top of Spark. If that's the case, I'm afraid this is not possible. Spark has support for Java, Python, Scala and R.
The best way to achieve this is to run your application in C++ and used the data created by said application to do manipulation within Spark. 


    On Monday, December 7, 2015 1:15 PM, Jia <ja...@gmail.com> wrote:
 

 Thanks, Dewful!
My impression is that Tachyon is a very nice in-memory file system that can connect to multiple storages.However, because our data is also hold in memory, I suspect that connecting to Spark directly may be more efficient in performance.But definitely I need to look at Tachyon more carefully, in case it has a very efficient C++ binding mechanism.
Best Regards,Jia
On Dec 7, 2015, at 11:46 AM, Dewful <de...@gmail.com> wrote:

Maybe looking into something like Tachyon would help, I see some sample c++ bindings, not sure how much of the current functionality they support...Hi, Robin, Thanks for your reply and thanks for copying my question to user mailing list.Yes, we have a distributed C++ application, that will store data on each node in the cluster, and we hope to leverage Spark to do more fancy analytics on those data. But we need high performance, that’s why we want shared memory.Suggestions will be highly appreciated!
Best Regards,Jia
On Dec 7, 2015, at 10:54 AM, Robin East <ro...@xense.co.uk> wrote:

-dev, +user (this is not a question about development of Spark itself so you’ll get more answers in the user mailing list)
First up let me say that I don’t really know how this could be done - I’m sure it would be possible with enough tinkering but it’s not clear what you are trying to achieve. Spark is a distributed processing system, it has multiple JVMs running on different machines that each run a small part of the overall processing. Unless you have some sort of idea to have multiple C++ processes collocated with the distributed JVMs using named memory mapped files doesn’t make architectural sense. 
-------------------------------------------------------------------------------Robin EastSpark GraphX in Action Michael Malak and Robin EastManning Publications Co.http://www.manning.com/books/spark-graphx-in-action





On 6 Dec 2015, at 20:43, Jia <ja...@gmail.com> wrote:
Dears, for one project, I need to implement something so Spark can read data from a C++ process. 
To provide high performance, I really hope to implement this through shared memory between the C++ process and Java JVM process.
It seems it may be possible to use named memory mapped files and JNI to do this, but I wonder whether there is any existing efforts or more efficient approach to do this?
Thank you very much!

Best Regards,
Jia


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@spark.apache.org
For additional commands, e-mail: dev-help@spark.apache.org

Re: Shared memory between C++ process and Spark

Posted by Jia <ja...@gmail.com>.

Thanks, Dewful!

My impression is that Tachyon is a very nice in-memory file system that can connect to multiple storages.
However, because our data is also hold in memory, I suspect that connecting to Spark directly may be more efficient in performance.
But definitely I need to look at Tachyon more carefully, in case it has a very efficient C++ binding mechanism.

Best Regards,
Jia

On Dec 7, 2015, at 11:46 AM, Dewful <de...@gmail.com> wrote:

> Maybe looking into something like Tachyon would help, I see some sample c++ bindings, not sure how much of the current functionality they support...
> 
> Hi, Robin, 
> Thanks for your reply and thanks for copying my question to user mailing list.
> Yes, we have a distributed C++ application, that will store data on each node in the cluster, and we hope to leverage Spark to do more fancy analytics on those data. But we need high performance, that’s why we want shared memory.
> Suggestions will be highly appreciated!
> 
> Best Regards,
> Jia
> 
> On Dec 7, 2015, at 10:54 AM, Robin East <ro...@xense.co.uk> wrote:
> 
>> -dev, +user (this is not a question about development of Spark itself so you’ll get more answers in the user mailing list)
>> 
>> First up let me say that I don’t really know how this could be done - I’m sure it would be possible with enough tinkering but it’s not clear what you are trying to achieve. Spark is a distributed processing system, it has multiple JVMs running on different machines that each run a small part of the overall processing. Unless you have some sort of idea to have multiple C++ processes collocated with the distributed JVMs using named memory mapped files doesn’t make architectural sense. 
>> -------------------------------------------------------------------------------
>> Robin East
>> Spark GraphX in Action Michael Malak and Robin East
>> Manning Publications Co.
>> http://www.manning.com/books/spark-graphx-in-action
>> 
>> 
>> 
>> 
>> 
>>> On 6 Dec 2015, at 20:43, Jia <ja...@gmail.com> wrote:
>>> 
>>> Dears, for one project, I need to implement something so Spark can read data from a C++ process. 
>>> To provide high performance, I really hope to implement this through shared memory between the C++ process and Java JVM process.
>>> It seems it may be possible to use named memory mapped files and JNI to do this, but I wonder whether there is any existing efforts or more efficient approach to do this?
>>> Thank you very much!
>>> 
>>> Best Regards,
>>> Jia
>>> 
>>> 
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: dev-unsubscribe@spark.apache.org
>>> For additional commands, e-mail: dev-help@spark.apache.org
>>> 
>> 
>

Re: Shared memory between C++ process and Spark

Posted by Jia <ja...@gmail.com>.

Thanks, Dewful!

My impression is that Tachyon is a very nice in-memory file system that can connect to multiple storages.
However, because our data is also hold in memory, I suspect that connecting to Spark directly may be more efficient in performance.
But definitely I need to look at Tachyon more carefully, in case it has a very efficient C++ binding mechanism.

Best Regards,
Jia

On Dec 7, 2015, at 11:46 AM, Dewful <de...@gmail.com> wrote:

> Maybe looking into something like Tachyon would help, I see some sample c++ bindings, not sure how much of the current functionality they support...
> 
> Hi, Robin, 
> Thanks for your reply and thanks for copying my question to user mailing list.
> Yes, we have a distributed C++ application, that will store data on each node in the cluster, and we hope to leverage Spark to do more fancy analytics on those data. But we need high performance, that’s why we want shared memory.
> Suggestions will be highly appreciated!
> 
> Best Regards,
> Jia
> 
> On Dec 7, 2015, at 10:54 AM, Robin East <ro...@xense.co.uk> wrote:
> 
>> -dev, +user (this is not a question about development of Spark itself so you’ll get more answers in the user mailing list)
>> 
>> First up let me say that I don’t really know how this could be done - I’m sure it would be possible with enough tinkering but it’s not clear what you are trying to achieve. Spark is a distributed processing system, it has multiple JVMs running on different machines that each run a small part of the overall processing. Unless you have some sort of idea to have multiple C++ processes collocated with the distributed JVMs using named memory mapped files doesn’t make architectural sense. 
>> -------------------------------------------------------------------------------
>> Robin East
>> Spark GraphX in Action Michael Malak and Robin East
>> Manning Publications Co.
>> http://www.manning.com/books/spark-graphx-in-action
>> 
>> 
>> 
>> 
>> 
>>> On 6 Dec 2015, at 20:43, Jia <ja...@gmail.com> wrote:
>>> 
>>> Dears, for one project, I need to implement something so Spark can read data from a C++ process. 
>>> To provide high performance, I really hope to implement this through shared memory between the C++ process and Java JVM process.
>>> It seems it may be possible to use named memory mapped files and JNI to do this, but I wonder whether there is any existing efforts or more efficient approach to do this?
>>> Thank you very much!
>>> 
>>> Best Regards,
>>> Jia
>>> 
>>> 
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: dev-unsubscribe@spark.apache.org
>>> For additional commands, e-mail: dev-help@spark.apache.org
>>> 
>> 
>

Re: Shared memory between C++ process and Spark

Posted by Dewful <de...@gmail.com>.

Maybe looking into something like Tachyon would help, I see some sample c++
bindings, not sure how much of the current functionality they support...
Hi, Robin,
Thanks for your reply and thanks for copying my question to user mailing
list.
Yes, we have a distributed C++ application, that will store data on each
node in the cluster, and we hope to leverage Spark to do more fancy
analytics on those data. But we need high performance, that’s why we want
shared memory.
Suggestions will be highly appreciated!

Best Regards,
Jia

On Dec 7, 2015, at 10:54 AM, Robin East <ro...@xense.co.uk> wrote:

-dev, +user (this is not a question about development of Spark itself so
you’ll get more answers in the user mailing list)

First up let me say that I don’t really know how this could be done - I’m
sure it would be possible with enough tinkering but it’s not clear what you
are trying to achieve. Spark is a distributed processing system, it has
multiple JVMs running on different machines that each run a small part of
the overall processing. Unless you have some sort of idea to have multiple
C++ processes collocated with the distributed JVMs using named memory
mapped files doesn’t make architectural sense.
-------------------------------------------------------------------------------
Robin East
*Spark GraphX in Action* Michael Malak and Robin East
Manning Publications Co.
http://www.manning.com/books/spark-graphx-in-action





On 6 Dec 2015, at 20:43, Jia <ja...@gmail.com> wrote:

Dears, for one project, I need to implement something so Spark can read
data from a C++ process.
To provide high performance, I really hope to implement this through shared
memory between the C++ process and Java JVM process.
It seems it may be possible to use named memory mapped files and JNI to do
this, but I wonder whether there is any existing efforts or more efficient
approach to do this?
Thank you very much!

Best Regards,
Jia


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@spark.apache.org
For additional commands, e-mail: dev-help@spark.apache.org

Re: Shared memory between C++ process and Spark

Posted by Jia <ja...@gmail.com>.

Thanks, Robin, you have a very good point!
We feel that the data copy and allocation overhead may become a performance bottleneck, and is evaluating it right now.
We will do the shared memory stuff only if we’re sure about the potential performance gain and sure that there is no existing stuff in Spark community that we can leverage to do this.

Best Regards,
Jia


On Dec 7, 2015, at 11:56 AM, Robin East <ro...@xense.co.uk> wrote:

> I guess you could write a custom RDD that can read data from a memory-mapped file - not really my area of expertise so I’ll leave it to other members of the forum to chip in with comments as to whether that makes sense. 
> 
> But if you want ‘fancy analytics’ then won’t the processing time more than out-weigh the savings from using memory mapped files? Particularly if your analytics involve any kind of aggregation of data across data nodes. Have you looked at a Lambda architecture which could involve Spark but doesn’t necessarily mean you would go to the trouble of implementing a custom memory-mapped file reading feature.
> -------------------------------------------------------------------------------
> Robin East
> Spark GraphX in Action Michael Malak and Robin East
> Manning Publications Co.
> http://www.manning.com/books/spark-graphx-in-action
> 
> 
> 
> 
> 
>> On 7 Dec 2015, at 17:32, Jia <ja...@gmail.com> wrote:
>> 
>> Hi, Robin, 
>> Thanks for your reply and thanks for copying my question to user mailing list.
>> Yes, we have a distributed C++ application, that will store data on each node in the cluster, and we hope to leverage Spark to do more fancy analytics on those data. But we need high performance, that’s why we want shared memory.
>> Suggestions will be highly appreciated!
>> 
>> Best Regards,
>> Jia
>> 
>> On Dec 7, 2015, at 10:54 AM, Robin East <ro...@xense.co.uk> wrote:
>> 
>>> -dev, +user (this is not a question about development of Spark itself so you’ll get more answers in the user mailing list)
>>> 
>>> First up let me say that I don’t really know how this could be done - I’m sure it would be possible with enough tinkering but it’s not clear what you are trying to achieve. Spark is a distributed processing system, it has multiple JVMs running on different machines that each run a small part of the overall processing. Unless you have some sort of idea to have multiple C++ processes collocated with the distributed JVMs using named memory mapped files doesn’t make architectural sense. 
>>> -------------------------------------------------------------------------------
>>> Robin East
>>> Spark GraphX in Action Michael Malak and Robin East
>>> Manning Publications Co.
>>> http://www.manning.com/books/spark-graphx-in-action
>>> 
>>> 
>>> 
>>> 
>>> 
>>>> On 6 Dec 2015, at 20:43, Jia <ja...@gmail.com> wrote:
>>>> 
>>>> Dears, for one project, I need to implement something so Spark can read data from a C++ process. 
>>>> To provide high performance, I really hope to implement this through shared memory between the C++ process and Java JVM process.
>>>> It seems it may be possible to use named memory mapped files and JNI to do this, but I wonder whether there is any existing efforts or more efficient approach to do this?
>>>> Thank you very much!
>>>> 
>>>> Best Regards,
>>>> Jia
>>>> 
>>>> 
>>>> ---------------------------------------------------------------------
>>>> To unsubscribe, e-mail: dev-unsubscribe@spark.apache.org
>>>> For additional commands, e-mail: dev-help@spark.apache.org
>>>> 
>>> 
>> 
>

Re: Shared memory between C++ process and Spark

Posted by Robin East <ro...@xense.co.uk>.

I guess you could write a custom RDD that can read data from a memory-mapped file - not really my area of expertise so I’ll leave it to other members of the forum to chip in with comments as to whether that makes sense. 

But if you want ‘fancy analytics’ then won’t the processing time more than out-weigh the savings from using memory mapped files? Particularly if your analytics involve any kind of aggregation of data across data nodes. Have you looked at a Lambda architecture which could involve Spark but doesn’t necessarily mean you would go to the trouble of implementing a custom memory-mapped file reading feature.
-------------------------------------------------------------------------------
Robin East
Spark GraphX in Action Michael Malak and Robin East
Manning Publications Co.
http://www.manning.com/books/spark-graphx-in-action <http://www.manning.com/books/spark-graphx-in-action>





> On 7 Dec 2015, at 17:32, Jia <ja...@gmail.com> wrote:
> 
> Hi, Robin, 
> Thanks for your reply and thanks for copying my question to user mailing list.
> Yes, we have a distributed C++ application, that will store data on each node in the cluster, and we hope to leverage Spark to do more fancy analytics on those data. But we need high performance, that’s why we want shared memory.
> Suggestions will be highly appreciated!
> 
> Best Regards,
> Jia
> 
> On Dec 7, 2015, at 10:54 AM, Robin East <robin.east@xense.co.uk <ma...@xense.co.uk>> wrote:
> 
>> -dev, +user (this is not a question about development of Spark itself so you’ll get more answers in the user mailing list)
>> 
>> First up let me say that I don’t really know how this could be done - I’m sure it would be possible with enough tinkering but it’s not clear what you are trying to achieve. Spark is a distributed processing system, it has multiple JVMs running on different machines that each run a small part of the overall processing. Unless you have some sort of idea to have multiple C++ processes collocated with the distributed JVMs using named memory mapped files doesn’t make architectural sense. 
>> -------------------------------------------------------------------------------
>> Robin East
>> Spark GraphX in Action Michael Malak and Robin East
>> Manning Publications Co.
>> http://www.manning.com/books/spark-graphx-in-action <http://www.manning.com/books/spark-graphx-in-action>
>> 
>> 
>> 
>> 
>> 
>>> On 6 Dec 2015, at 20:43, Jia <jacquelinezou@gmail.com <ma...@gmail.com>> wrote:
>>> 
>>> Dears, for one project, I need to implement something so Spark can read data from a C++ process. 
>>> To provide high performance, I really hope to implement this through shared memory between the C++ process and Java JVM process.
>>> It seems it may be possible to use named memory mapped files and JNI to do this, but I wonder whether there is any existing efforts or more efficient approach to do this?
>>> Thank you very much!
>>> 
>>> Best Regards,
>>> Jia
>>> 
>>> 
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: dev-unsubscribe@spark.apache.org <ma...@spark.apache.org>
>>> For additional commands, e-mail: dev-help@spark.apache.org <ma...@spark.apache.org>
>>> 
>> 
>

Re: Shared memory between C++ process and Spark

Posted by Jia <ja...@gmail.com>.

Hi, Robin, 
Thanks for your reply and thanks for copying my question to user mailing list.
Yes, we have a distributed C++ application, that will store data on each node in the cluster, and we hope to leverage Spark to do more fancy analytics on those data. But we need high performance, that’s why we want shared memory.
Suggestions will be highly appreciated!

Best Regards,
Jia

On Dec 7, 2015, at 10:54 AM, Robin East <ro...@xense.co.uk> wrote:

> -dev, +user (this is not a question about development of Spark itself so you’ll get more answers in the user mailing list)
> 
> First up let me say that I don’t really know how this could be done - I’m sure it would be possible with enough tinkering but it’s not clear what you are trying to achieve. Spark is a distributed processing system, it has multiple JVMs running on different machines that each run a small part of the overall processing. Unless you have some sort of idea to have multiple C++ processes collocated with the distributed JVMs using named memory mapped files doesn’t make architectural sense. 
> -------------------------------------------------------------------------------
> Robin East
> Spark GraphX in Action Michael Malak and Robin East
> Manning Publications Co.
> http://www.manning.com/books/spark-graphx-in-action
> 
> 
> 
> 
> 
>> On 6 Dec 2015, at 20:43, Jia <ja...@gmail.com> wrote:
>> 
>> Dears, for one project, I need to implement something so Spark can read data from a C++ process. 
>> To provide high performance, I really hope to implement this through shared memory between the C++ process and Java JVM process.
>> It seems it may be possible to use named memory mapped files and JNI to do this, but I wonder whether there is any existing efforts or more efficient approach to do this?
>> Thank you very much!
>> 
>> Best Regards,
>> Jia
>> 
>> 
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: dev-unsubscribe@spark.apache.org
>> For additional commands, e-mail: dev-help@spark.apache.org
>> 
>

Re: Shared memory between C++ process and Spark

Posted by Robin East <ro...@xense.co.uk>.

-dev, +user (this is not a question about development of Spark itself so you’ll get more answers in the user mailing list)

First up let me say that I don’t really know how this could be done - I’m sure it would be possible with enough tinkering but it’s not clear what you are trying to achieve. Spark is a distributed processing system, it has multiple JVMs running on different machines that each run a small part of the overall processing. Unless you have some sort of idea to have multiple C++ processes collocated with the distributed JVMs using named memory mapped files doesn’t make architectural sense. 
-------------------------------------------------------------------------------
Robin East
Spark GraphX in Action Michael Malak and Robin East
Manning Publications Co.
http://www.manning.com/books/spark-graphx-in-action <http://www.manning.com/books/spark-graphx-in-action>





> On 6 Dec 2015, at 20:43, Jia <ja...@gmail.com> wrote:
> 
> Dears, for one project, I need to implement something so Spark can read data from a C++ process. 
> To provide high performance, I really hope to implement this through shared memory between the C++ process and Java JVM process.
> It seems it may be possible to use named memory mapped files and JNI to do this, but I wonder whether there is any existing efforts or more efficient approach to do this?
> Thank you very much!
> 
> Best Regards,
> Jia
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@spark.apache.org
> For additional commands, e-mail: dev-help@spark.apache.org
>

Re: Shared memory between C++ process and Spark

Posted by Robin East <ro...@xense.co.uk>.

-dev, +user (this is not a question about development of Spark itself so you’ll get more answers in the user mailing list)

First up let me say that I don’t really know how this could be done - I’m sure it would be possible with enough tinkering but it’s not clear what you are trying to achieve. Spark is a distributed processing system, it has multiple JVMs running on different machines that each run a small part of the overall processing. Unless you have some sort of idea to have multiple C++ processes collocated with the distributed JVMs using named memory mapped files doesn’t make architectural sense. 
-------------------------------------------------------------------------------
Robin East
Spark GraphX in Action Michael Malak and Robin East
Manning Publications Co.
http://www.manning.com/books/spark-graphx-in-action <http://www.manning.com/books/spark-graphx-in-action>





> On 6 Dec 2015, at 20:43, Jia <ja...@gmail.com> wrote:
> 
> Dears, for one project, I need to implement something so Spark can read data from a C++ process. 
> To provide high performance, I really hope to implement this through shared memory between the C++ process and Java JVM process.
> It seems it may be possible to use named memory mapped files and JNI to do this, but I wonder whether there is any existing efforts or more efficient approach to do this?
> Thank you very much!
> 
> Best Regards,
> Jia
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@spark.apache.org
> For additional commands, e-mail: dev-help@spark.apache.org
>