You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Murat Eken (JIRA)" <ji...@apache.org> on 2015/01/26 12:36:34 UTC
[jira] [Commented] (SPARK-2389) globally shared SparkContext / shared Spark "application"

    [ https://issues.apache.org/jira/browse/SPARK-2389?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14291724#comment-14291724 ] 

Murat Eken commented on SPARK-2389:
-----------------------------------

+1. We're using a Spark cluster as a real-time query engine, and unfortunately we're running into the same issues as Robert mentions. Although Spark provides a plethora of solutions when it comes to making its cluster fault-tolerant and resilient, we need the same resilience for the front layer, from where the Spark cluster is accessed; meaning multiple instances of Spark clients, hence multiple SparkContexts from those clients connecting to the same cluster with the same computing power.

Performance is crucial for us, hence our choice for caching the data in memory and utilizing the full hardware resources in the executors. Alternative solutions, such as using Tachyon for the data, and restarting executors for each query just don't give the same performance. We're looking into using https://github.com/spark-jobserver/spark-jobserver but that's not a proper solution as we still would have the jobserver as a single point of failure in our setup, which is a problem for us.

I'd appreciate it if a Spark developer could give some information about the feasibility of this change request; if this turns out to be difficult or even impossible due to the choices made in the architecture, it would be good to know that so that we can consider our alternatives.

> globally shared SparkContext / shared Spark "application"
> ---------------------------------------------------------
>
>                 Key: SPARK-2389
>                 URL: https://issues.apache.org/jira/browse/SPARK-2389
>             Project: Spark
>          Issue Type: Improvement
>          Components: Spark Core
>            Reporter: Robert Stupp
>
> The documentation (in Cluster Mode Overview) cites:
> bq. Each application gets its own executor processes, which *stay up for the duration of the whole application* and run tasks in multiple threads. This has the benefit of isolating applications from each other, on both the scheduling side (each driver schedules its own tasks) and executor side (tasks from different applications run in different JVMs). However, it also means that *data cannot be shared* across different Spark applications (instances of SparkContext) without writing it to an external storage system.
> IMO this is a limitation that should be lifted to support any number of --driver-- client processes to share executors and to share (persistent / cached) data.
> This is especially useful if you have a bunch of frontend servers (dump web app servers) that want to use Spark as a _big computing machine_. Most important is the fact that Spark is quite good in caching/persisting data in memory / on disk thus removing load from backend data stores.
> Means: it would be really great to let different --driver-- client JVMs operate on the same RDDs and benefit from Spark's caching/persistence.
> It would however introduce some administration mechanisms to
> * start a shared context
> * update the executor configuration (# of worker nodes, # of cpus, etc) on the fly
> * stop a shared context
> Even "conventional" batch MR applications would benefit if ran fequently against the same data set.
> As an implicit requirement, RDD persistence could get a TTL for its materialized state.
> With such a feature the overall performance of today's web applications could then be increased by adding more web app servers, more spark nodes, more nosql nodes etc



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org