You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by DrKhu <kh...@gmail.com> on 2014/09/08 17:30:50 UTC

How do you perform blocking IO in apache spark job?

What if, when I traverse RDD, I need to calculate values in dataset by
calling external (blocking) service? How do you think that could be
achieved?

val values: Future[RDD[Double]] = Future sequence tasks

I've tried to create a list of Futures, but as RDD id not Traversable,
Future.sequence is not suitable.

I just wonder, if anyone had such a problem, and how did you solve it? What
I'm trying to achieve is to get a parallelism on a single worker node, so I
can call that external service 3000 times per second.

Probably, there is another solution, more suitable for spark, like having
multiple working nodes on single host.

It's interesting to know, how do you cope with such a challenge? Thanks.



--
View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/How-do-you-perform-blocking-IO-in-apache-spark-job-tp13704.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
For additional commands, e-mail: user-help@spark.apache.org

Re: How do you perform blocking IO in apache spark job?

Posted by Jörn Franke <jo...@gmail.com>.

Hi,

So the external service itself creates threads and blocks until they
finished execution? In this case you should not do threading but include it
via jni directly in spark - it will take care about threading for you.

Vest regards
Hi, Jörn, first of all, thanks for you intent to help.

This one external service is a native component, that is stateless and that
performs the calculation based on the data I provide. The data is in RDD.

That one component I have on each worker node and I would like to get as
much parallelism as possible on a single worker node.
Using scala future I can get it, at least as much threads, as my machine
allows me. But how to do the same on spark? Is there a possibility to cal
that native component on each worker in multiple threads?

Thanks in advance.



--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/How-do-you-perform-blocking-IO-in-apache-spark-job-tp13704p13707.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
For additional commands, e-mail: user-help@spark.apache.org

Re: How do you perform blocking IO in apache spark job?

Posted by DrKhu <kh...@gmail.com>.

Hi, Jörn, first of all, thanks for you intent to help.

This one external service is a native component, that is stateless and that
performs the calculation based on the data I provide. The data is in RDD.

That one component I have on each worker node and I would like to get as
much parallelism as possible on a single worker node.
Using scala future I can get it, at least as much threads, as my machine
allows me. But how to do the same on spark? Is there a possibility to cal
that native component on each worker in multiple threads?

Thanks in advance.



--
View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/How-do-you-perform-blocking-IO-in-apache-spark-job-tp13704p13707.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
For additional commands, e-mail: user-help@spark.apache.org

Re: How do you perform blocking IO in apache spark job?

Posted by Jörn Franke <jo...@gmail.com>.

Hi,

I What does the external service provide? Data? Calculations? Can the
service push data to you via Kafka and Spark streaming ? Can you fetch the
necessary data beforehand from the service? The solution to your question
depends on your answers.

I would not recommend to connect to a blocking service during spark jobs
execution. What do you do if a node crashes? Is order of service calls for
you relevant?

Best regards
Le 8 sept. 2014 17:31, "DrKhu" <kh...@gmail.com> a écrit :

> What if, when I traverse RDD, I need to calculate values in dataset by
> calling external (blocking) service? How do you think that could be
> achieved?
>
> val values: Future[RDD[Double]] = Future sequence tasks
>
> I've tried to create a list of Futures, but as RDD id not Traversable,
> Future.sequence is not suitable.
>
> I just wonder, if anyone had such a problem, and how did you solve it? What
> I'm trying to achieve is to get a parallelism on a single worker node, so I
> can call that external service 3000 times per second.
>
> Probably, there is another solution, more suitable for spark, like having
> multiple working nodes on single host.
>
> It's interesting to know, how do you cope with such a challenge? Thanks.
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/How-do-you-perform-blocking-IO-in-apache-spark-job-tp13704.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
> For additional commands, e-mail: user-help@spark.apache.org
>
>

Re: How do you perform blocking IO in apache spark job?

Posted by DrKhu <kh...@gmail.com>.

 Thanks, Sean, I'll try to explain, what I'm trying to do.

The native component, that I'm talking about is the native code, that I call
using JNI.
I've wrote small test



Here, I traverse through the collection to call the native component N
(1000) times.
Then I have a result 

it means, that I'm able to get 10 req/sec by calling native component.

And I would like to achieve the same result (not less) on a single node
using spark.
Then I've started 1 node cluster and runned next code on it:



Here I've provided partitions = 1000, but the response time was not the
same, but a lot more worse:



Operation filtered.top(10)(Ordering.Double) is blocking, as I understand, at
this time closure inside the map transformation starts to execute, calling
native component is blocking there. If I could make it non-blocking, I would
expect increase in performance.

What do you think?
How would you improve code? Or what spark configurations to look for?
(Sorry, I'm quite new to Spark)



--
View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/How-do-you-perform-blocking-IO-in-apache-spark-job-tp13704p13713.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
For additional commands, e-mail: user-help@spark.apache.org

Re: How do you perform blocking IO in apache spark job?

Posted by Sean Owen <so...@cloudera.com>.

What is the driver-side Future for? Are you trying to make the remote
Spark workers execute more requests to your service concurrently? it's
not clear from your messages whether it's something like a web
service, or just local native code.

So the time spent in your processing -- whatever returns Double -- is
mostly waiting for a blocking service to return?  I assume the
external service is not at capacity yet and can handle more concurrent
requests, or else, there's no point in adding parallelism.

First I'd figure out how many parallel requests the service can handle
before starting to slow down; call it N. It won't help to make more
than N requests in parallel. So first I'd make sure you really are not
yet at that point.

You can make more partitions with repartition(), to have at least N
partitions. Then you want to make sure there are enough executors,
with access to enough cores, to run N tasks concurrently on the
cluster. That should maximize parallelism.

You can indeed write remote functions that parallelize themselves with
Future (not on the driver side) but I think ideally you get the
parallelism from Spark, absent a reason not to.

On Mon, Sep 8, 2014 at 4:30 PM, DrKhu <kh...@gmail.com> wrote:
> What if, when I traverse RDD, I need to calculate values in dataset by
> calling external (blocking) service? How do you think that could be
> achieved?
>
> val values: Future[RDD[Double]] = Future sequence tasks
>
> I've tried to create a list of Futures, but as RDD id not Traversable,
> Future.sequence is not suitable.
>
> I just wonder, if anyone had such a problem, and how did you solve it? What
> I'm trying to achieve is to get a parallelism on a single worker node, so I
> can call that external service 3000 times per second.
>
> Probably, there is another solution, more suitable for spark, like having
> multiple working nodes on single host.
>
> It's interesting to know, how do you cope with such a challenge? Thanks.
>
>
>
> --
> View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/How-do-you-perform-blocking-IO-in-apache-spark-job-tp13704.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
> For additional commands, e-mail: user-help@spark.apache.org
>

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
For additional commands, e-mail: user-help@spark.apache.org