You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by Tobias Pfeiffer <tg...@preferred.jp> on 2014/09/04 03:30:30 UTC

Multi-tenancy for Spark (Streaming) Applications

Hi,

I am not sure if "multi-tenancy" is the right word, but I am thinking about
a Spark application where multiple users can, say, log into some web
interface and specify a data processing pipeline with streaming source,
processing steps, and output.

Now as far as I know, there can be only one StreamingContext per JVM and
also I cannot add sources or processing steps once it has been started. Are
there any ideas/suggestinos for how to achieve a dynamic adding and
removing of input sources and processing pipelines? Do I need a separate
'java' process per user?
Also, can I realize such a thing when using YARN for dynamic allocation?

Thanks
Tobias

Re: Multi-tenancy for Spark (Streaming) Applications

Posted by Tobias Pfeiffer <tg...@preferred.jp>.

Hi,

by now I understood maybe a bit better how spark-submit and YARN play
together and how Spark driver and slaves play together on YARN.

Now for my usecase, as described on <
https://spark.apache.org/docs/latest/submitting-applications.html>, I would
probably have a end-user-facing gateway that submits my Spark (Streaming)
application to the YARN cluster in yarn-cluster mode.

I have a couple of questions regarding that setup:
* That gateway does not need to be written in Scala or Java, it actually
has no contact with the Spark libraries; it is just executing a program on
the command line ("./spark-submit ..."), right?
* Since my application is a streaming application, it won't finish by
itself. What is the best way to terminate the application on the cluster
from my gateway program? Can I just send SIGTERM to the spark-submit
program? Is it recommended?
* I guess there are many possibilities to achieve that, but what is a good
way to send commands/instructions to the running Spark application? If I
want to push some commands from the gateway to the Spark driver, I guess I
need to get its IP address - how? If I want the Spark driver to pull its
instructions, what is a good way to do so? Any suggestions?

Thanks,
Tobias

Re: Multi-tenancy for Spark (Streaming) Applications

Posted by Tobias Pfeiffer <tg...@preferred.jp>.

Hi,

On Thu, Sep 4, 2014 at 10:33 AM, Tathagata Das <ta...@gmail.com>
wrote:

> In the current state of Spark Streaming, creating separate Java processes
> each having a streaming context is probably the best approach to
> dynamically adding and removing of input sources. All of these should be
> able to to use a YARN cluster for resource allocation.
>

So, for example, I would write a server application that accepts a command
like "createNewInstance" and then calls spark-submit, pushing my actual
application to the YARN cluster? Or could I use spark-jobserver?

Thanks
Tobias

Re: Multi-tenancy for Spark (Streaming) Applications

Posted by Tathagata Das <ta...@gmail.com>.

In the current state of Spark Streaming, creating separate Java processes
each having a streaming context is probably the best approach to
dynamically adding and removing of input sources. All of these should be
able to to use a YARN cluster for resource allocation.

On Wed, Sep 3, 2014 at 6:30 PM, Tobias Pfeiffer <tg...@preferred.jp> wrote:

> Hi,
>
> I am not sure if "multi-tenancy" is the right word, but I am thinking
> about a Spark application where multiple users can, say, log into some web
> interface and specify a data processing pipeline with streaming source,
> processing steps, and output.
>
> Now as far as I know, there can be only one StreamingContext per JVM and
> also I cannot add sources or processing steps once it has been started. Are
> there any ideas/suggestinos for how to achieve a dynamic adding and
> removing of input sources and processing pipelines? Do I need a separate
> 'java' process per user?
> Also, can I realize such a thing when using YARN for dynamic allocation?
>
> Thanks
> Tobias
>