You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@spark.apache.org by "D.Y Feng" <yy...@gmail.com> on 2014/01/25 03:06:32 UTC

Can I share the RDD between multiprocess

How can I share the RDD between multiprocess?

-- 


DY.Feng(叶毅锋)
yyfeng88625@twitter
Department of Applied Mathematics
Guangzhou University,China
dyfeng@stu.gzhu.edu.cn

Re: Can I share the RDD between multiprocess

Posted by Mayur Rustagi <ma...@gmail.com>.
Will Job server work here?
http://engineering.ooyala.com/blog/open-sourcing-our-spark-job-server

Mayur Rustagi
Ph: +919632149971
h <https://twitter.com/mayur_rustagi>ttp://www.sigmoidanalytics.com
https://twitter.com/mayur_rustagi



On Sat, Jan 25, 2014 at 10:46 PM, Kapil Malik <km...@adobe.com> wrote:

>  Thanks a lot Mark and Christopher for your prompt replies and
> clarification.
>
>
>
> Regards,
>
>
>
> Kapil Malik | kmalik@adobe.com
>
>
>
> *From:* Christopher Nguyen [mailto:ctn@adatao.com]
> *Sent:* 25 January 2014 22:34
> *To:* user@spark.incubator.apache.org
> *Subject:* RE: Can I share the RDD between multiprocess
>
>
>
> Kapil, that's right, your #2 is the pattern I was referring to. Of course
> it could be Tomcat or something even lighter weight as long as you define
> some suitable client/server protocol.
>
> Sent while mobile. Pls excuse typos etc.
>
> On Jan 25, 2014 6:03 AM, "Kapil Malik" <km...@adobe.com> wrote:
>
> Hi Christopher,
>
>
>
> “make a "server" out of that JVM, and serve up (via HTTP/THRIFT, etc.)
> some kind of reference to those RDDs to multiple clients of that server”
>
>
>
> Can you kindly hint at any starting points regarding your suggestion?
>
> In my understanding, SparkContext constructor creates an Akka actor system
> and starts a jetty UI server. So can we somehow use / tweak the same to
> serve to multiple clients? Or can we simply construct a spark context
> inside a Java server (like Tomcat) ?
>
>
>
> Regards,
>
>
>
> Kapil Malik | kmalik@adobe.com | 33430 / 8800836581 <%208800836581>
>
>
>
> *From:* Christopher Nguyen [mailto:ctn@adatao.com]
> *Sent:* 25 January 2014 12:00
> *To:* user@spark.incubator.apache.org
> *Subject:* Re: Can I share the RDD between multiprocess
>
>
>
> D.Y., it depends on what you mean by "multiprocess".
>
>
>
> RDD lifecycles are currently limited to a single SparkContext. So to
> "share" RDDs you need to somehow access the same SparkContext.
>
>
>
> This means one way to share RDDs is to make sure your accessors are in the
> same JVM that started the SparkContext.
>
>
>
> Another is to make a "server" out of that JVM, and serve up (via
> HTTP/THRIFT, etc.) some kind of reference to those RDDs to multiple clients
> of that server, even though there is only one SparkContext (held by the
> server). We have built a server product using this pattern so I know it can
> work well.
>
>
>   --
>
> Christopher T. Nguyen
>
> Co-founder & CEO, Adatao <http://adatao.com>
>
> linkedin.com/in/ctnguyen
>
>
>
>
>
> On Fri, Jan 24, 2014 at 6:06 PM, D.Y Feng <yy...@gmail.com> wrote:
>
> How can I share the RDD between multiprocess?
>
>
> --
>
>
> DY.Feng(叶毅锋)
> yyfeng88625@twitter
> Department of Applied Mathematics
> Guangzhou University,China
> dyfeng@stu.gzhu.edu.cn
>
>
>
>

RE: Can I share the RDD between multiprocess

Posted by Kapil Malik <km...@adobe.com>.
Thanks a lot Mark and Christopher for your prompt replies and clarification.

Regards,

Kapil Malik | kmalik@adobe.com<ma...@adobe.com>

From: Christopher Nguyen [mailto:ctn@adatao.com]
Sent: 25 January 2014 22:34
To: user@spark.incubator.apache.org
Subject: RE: Can I share the RDD between multiprocess


Kapil, that's right, your #2 is the pattern I was referring to. Of course it could be Tomcat or something even lighter weight as long as you define some suitable client/server protocol.

Sent while mobile. Pls excuse typos etc.
On Jan 25, 2014 6:03 AM, "Kapil Malik" <km...@adobe.com>> wrote:
Hi Christopher,

“make a "server" out of that JVM, and serve up (via HTTP/THRIFT, etc.) some kind of reference to those RDDs to multiple clients of that server”

Can you kindly hint at any starting points regarding your suggestion?
In my understanding, SparkContext constructor creates an Akka actor system and starts a jetty UI server. So can we somehow use / tweak the same to serve to multiple clients? Or can we simply construct a spark context inside a Java server (like Tomcat) ?

Regards,

Kapil Malik | kmalik@adobe.com<ma...@adobe.com> | 33430 / 8800836581

From: Christopher Nguyen [mailto:ctn@adatao.com<ma...@adatao.com>]
Sent: 25 January 2014 12:00
To: user@spark.incubator.apache.org<ma...@spark.incubator.apache.org>
Subject: Re: Can I share the RDD between multiprocess

D.Y., it depends on what you mean by "multiprocess".

RDD lifecycles are currently limited to a single SparkContext. So to "share" RDDs you need to somehow access the same SparkContext.

This means one way to share RDDs is to make sure your accessors are in the same JVM that started the SparkContext.

Another is to make a "server" out of that JVM, and serve up (via HTTP/THRIFT, etc.) some kind of reference to those RDDs to multiple clients of that server, even though there is only one SparkContext (held by the server). We have built a server product using this pattern so I know it can work well.

--
Christopher T. Nguyen
Co-founder & CEO, Adatao<http://adatao.com>
linkedin.com/in/ctnguyen<http://linkedin.com/in/ctnguyen>


On Fri, Jan 24, 2014 at 6:06 PM, D.Y Feng <yy...@gmail.com>> wrote:
How can I share the RDD between multiprocess?

--


DY.Feng(叶毅锋)
yyfeng88625@twitter
Department of Applied Mathematics
Guangzhou University,China
dyfeng@stu.gzhu.edu.cn<ma...@stu.gzhu.edu.cn>



RE: Can I share the RDD between multiprocess

Posted by Christopher Nguyen <ct...@adatao.com>.
Kapil, that's right, your #2 is the pattern I was referring to. Of course
it could be Tomcat or something even lighter weight as long as you define
some suitable client/server protocol.

Sent while mobile. Pls excuse typos etc.
On Jan 25, 2014 6:03 AM, "Kapil Malik" <km...@adobe.com> wrote:

>  Hi Christopher,
>
>
>
> “make a "server" out of that JVM, and serve up (via HTTP/THRIFT, etc.)
> some kind of reference to those RDDs to multiple clients of that server”
>
>
>
> Can you kindly hint at any starting points regarding your suggestion?
>
> In my understanding, SparkContext constructor creates an Akka actor system
> and starts a jetty UI server. So can we somehow use / tweak the same to
> serve to multiple clients? Or can we simply construct a spark context
> inside a Java server (like Tomcat) ?
>
>
>
> Regards,
>
>
>
> Kapil Malik | kmalik@adobe.com | 33430 / 8800836581
>
>
>
> *From:* Christopher Nguyen [mailto:ctn@adatao.com]
> *Sent:* 25 January 2014 12:00
> *To:* user@spark.incubator.apache.org
> *Subject:* Re: Can I share the RDD between multiprocess
>
>
>
> D.Y., it depends on what you mean by "multiprocess".
>
>
>
> RDD lifecycles are currently limited to a single SparkContext. So to
> "share" RDDs you need to somehow access the same SparkContext.
>
>
>
> This means one way to share RDDs is to make sure your accessors are in the
> same JVM that started the SparkContext.
>
>
>
> Another is to make a "server" out of that JVM, and serve up (via
> HTTP/THRIFT, etc.) some kind of reference to those RDDs to multiple clients
> of that server, even though there is only one SparkContext (held by the
> server). We have built a server product using this pattern so I know it can
> work well.
>
>
>   --
>
> Christopher T. Nguyen
>
> Co-founder & CEO, Adatao <http://adatao.com>
>
> linkedin.com/in/ctnguyen
>
>
>
>
>
> On Fri, Jan 24, 2014 at 6:06 PM, D.Y Feng <yy...@gmail.com> wrote:
>
> How can I share the RDD between multiprocess?
>
>
> --
>
>
> DY.Feng(叶毅锋)
> yyfeng88625@twitter
> Department of Applied Mathematics
> Guangzhou University,China
> dyfeng@stu.gzhu.edu.cn
>
>
>
>

Re: Can I share the RDD between multiprocess

Posted by Ruchir Jha <ru...@gmail.com>.
Look at: https://github.com/ooyala/spark-jobserver


On Mon, Aug 11, 2014 at 11:48 AM, coolfrood <aa...@quantcast.com> wrote:

> Reviving this discussion again...
>
> I'm interested in using Spark as the engine for a web service.
>
> The SparkContext and its RDDs only exist in the JVM that started it.  While
> RDDs are resilient, this means the context owner isn't resilient, so I may
> be able to serve requests out of a single "service" JVM, but I'll lose all
> my RDDs if the service dies.
>
> It's possible to share RDDs by writing them into Tachyon, but with that
> I'll
> end up having at least 2 copies of the same data in memory; even more if I
> access the data from multiple contexts.
>
> Is there a way around this?
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/Can-I-share-the-RDD-between-multiprocess-tp916p11901.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
> For additional commands, e-mail: user-help@spark.apache.org
>
>

Re: Can I share the RDD between multiprocess

Posted by coolfrood <aa...@quantcast.com>.
Reviving this discussion again...

I'm interested in using Spark as the engine for a web service.

The SparkContext and its RDDs only exist in the JVM that started it.  While
RDDs are resilient, this means the context owner isn't resilient, so I may
be able to serve requests out of a single "service" JVM, but I'll lose all
my RDDs if the service dies.

It's possible to share RDDs by writing them into Tachyon, but with that I'll
end up having at least 2 copies of the same data in memory; even more if I
access the data from multiple contexts.

Is there a way around this?



--
View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Can-I-share-the-RDD-between-multiprocess-tp916p11901.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
For additional commands, e-mail: user-help@spark.apache.org


Re: Can I share the RDD between multiprocess

Posted by Mark Hamstra <ma...@clearstorydata.com>.
It's a basic strategy that several organizations using Spark have followed,
but there isn't yet a canonical implementation or example of such a server
in the Spark source code.  That is likely to change before the 1.0 release,
and the included job server is likely to be based on an updated/expanded
version of an existing pull
request<https://github.com/apache/incubator-spark/pull/222>
.


On Sat, Jan 25, 2014 at 6:02 AM, Kapil Malik <km...@adobe.com> wrote:

>  Hi Christopher,
>
>
>
> “make a "server" out of that JVM, and serve up (via HTTP/THRIFT, etc.)
> some kind of reference to those RDDs to multiple clients of that server”
>
>
>
> Can you kindly hint at any starting points regarding your suggestion?
>
> In my understanding, SparkContext constructor creates an Akka actor system
> and starts a jetty UI server. So can we somehow use / tweak the same to
> serve to multiple clients? Or can we simply construct a spark context
> inside a Java server (like Tomcat) ?
>
>
>
> Regards,
>
>
>
> Kapil Malik | kmalik@adobe.com | 33430 / 8800836581
>
>
>
> *From:* Christopher Nguyen [mailto:ctn@adatao.com]
> *Sent:* 25 January 2014 12:00
> *To:* user@spark.incubator.apache.org
> *Subject:* Re: Can I share the RDD between multiprocess
>
>
>
> D.Y., it depends on what you mean by "multiprocess".
>
>
>
> RDD lifecycles are currently limited to a single SparkContext. So to
> "share" RDDs you need to somehow access the same SparkContext.
>
>
>
> This means one way to share RDDs is to make sure your accessors are in the
> same JVM that started the SparkContext.
>
>
>
> Another is to make a "server" out of that JVM, and serve up (via
> HTTP/THRIFT, etc.) some kind of reference to those RDDs to multiple clients
> of that server, even though there is only one SparkContext (held by the
> server). We have built a server product using this pattern so I know it can
> work well.
>
>
>   --
>
> Christopher T. Nguyen
>
> Co-founder & CEO, Adatao <http://adatao.com>
>
> linkedin.com/in/ctnguyen
>
>
>
>
>
> On Fri, Jan 24, 2014 at 6:06 PM, D.Y Feng <yy...@gmail.com> wrote:
>
> How can I share the RDD between multiprocess?
>
>
> --
>
>
> DY.Feng(叶毅锋)
> yyfeng88625@twitter
> Department of Applied Mathematics
> Guangzhou University,China
> dyfeng@stu.gzhu.edu.cn
>
>
>
>

RE: Can I share the RDD between multiprocess

Posted by Kapil Malik <km...@adobe.com>.
Hi Christopher,

“make a "server" out of that JVM, and serve up (via HTTP/THRIFT, etc.) some kind of reference to those RDDs to multiple clients of that server”

Can you kindly hint at any starting points regarding your suggestion?
In my understanding, SparkContext constructor creates an Akka actor system and starts a jetty UI server. So can we somehow use / tweak the same to serve to multiple clients? Or can we simply construct a spark context inside a Java server (like Tomcat) ?

Regards,

Kapil Malik | kmalik@adobe.com<ma...@adobe.com> | 33430 / 8800836581

From: Christopher Nguyen [mailto:ctn@adatao.com]
Sent: 25 January 2014 12:00
To: user@spark.incubator.apache.org
Subject: Re: Can I share the RDD between multiprocess

D.Y., it depends on what you mean by "multiprocess".

RDD lifecycles are currently limited to a single SparkContext. So to "share" RDDs you need to somehow access the same SparkContext.

This means one way to share RDDs is to make sure your accessors are in the same JVM that started the SparkContext.

Another is to make a "server" out of that JVM, and serve up (via HTTP/THRIFT, etc.) some kind of reference to those RDDs to multiple clients of that server, even though there is only one SparkContext (held by the server). We have built a server product using this pattern so I know it can work well.

--
Christopher T. Nguyen
Co-founder & CEO, Adatao<http://adatao.com>
linkedin.com/in/ctnguyen<http://linkedin.com/in/ctnguyen>


On Fri, Jan 24, 2014 at 6:06 PM, D.Y Feng <yy...@gmail.com>> wrote:
How can I share the RDD between multiprocess?

--


DY.Feng(叶毅锋)
yyfeng88625@twitter
Department of Applied Mathematics
Guangzhou University,China
dyfeng@stu.gzhu.edu.cn<ma...@stu.gzhu.edu.cn>



Re: Can I share the RDD between multiprocess

Posted by Christopher Nguyen <ct...@adatao.com>.
D.Y., it depends on what you mean by "multiprocess".

RDD lifecycles are currently limited to a single SparkContext. So to
"share" RDDs you need to somehow access the same SparkContext.

This means one way to share RDDs is to make sure your accessors are in the
same JVM that started the SparkContext.

Another is to make a "server" out of that JVM, and serve up (via
HTTP/THRIFT, etc.) some kind of reference to those RDDs to multiple clients
of that server, even though there is only one SparkContext (held by the
server). We have built a server product using this pattern so I know it can
work well.

--
Christopher T. Nguyen
Co-founder & CEO, Adatao <http://adatao.com>
linkedin.com/in/ctnguyen



On Fri, Jan 24, 2014 at 6:06 PM, D.Y Feng <yy...@gmail.com> wrote:

> How can I share the RDD between multiprocess?
>
> --
>
>
> DY.Feng(叶毅锋)
> yyfeng88625@twitter
> Department of Applied Mathematics
> Guangzhou University,China
> dyfeng@stu.gzhu.edu.cn
>
>

Re: Can I share the RDD between multiprocess

Posted by Tathagata Das <ta...@gmail.com>.
Using pure spark, you will have to write an RDD to a Hadoop compatible
filesystem (rdd.saveAs****), and read it back from a different process
(sparkContext.****File ... ).

You can also take a look at the tachyon project (
https://github.com/amplab/tachyon/wiki) which makes this super fast by
using an in-memory caching layer (outside spark).

TD

On Fri, Jan 24, 2014 at 6:53 PM, Binh Nguyen <ng...@gmail.com> wrote:

> RDD is immutable so you should be able to.
>
>
> On Fri, Jan 24, 2014 at 6:06 PM, D.Y Feng <yy...@gmail.com> wrote:
>
>> How can I share the RDD between multiprocess?
>>
>> --
>>
>>
>> DY.Feng(叶毅锋)
>> yyfeng88625@twitter
>> Department of Applied Mathematics
>> Guangzhou University,China
>> dyfeng@stu.gzhu.edu.cn
>>
>>
>
>
>
> --
>
> Binh Nguyen
>
>

Re: Can I share the RDD between multiprocess

Posted by Binh Nguyen <ng...@gmail.com>.
RDD is immutable so you should be able to.


On Fri, Jan 24, 2014 at 6:06 PM, D.Y Feng <yy...@gmail.com> wrote:

> How can I share the RDD between multiprocess?
>
> --
>
>
> DY.Feng(叶毅锋)
> yyfeng88625@twitter
> Department of Applied Mathematics
> Guangzhou University,China
> dyfeng@stu.gzhu.edu.cn
>
>



-- 

Binh Nguyen