You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@spark.apache.org by Rachana Srivastava <Ra...@markmonitor.com> on 2015/09/12 05:07:28 UTC

Multithreaded vs Spark Executor

Hello all,

We are getting stream of input data from a Kafka queue using Spark Streaming API.  For each data element we want to run parallel threads to process a set of feature lists (nearly 100 feature or more).    Since feature lists creation is independent of each other we would like to execute these feature lists in parallel on the input data that we get from the Kafka queue.

Question is

1. Should we write thread pool and manage these features execution on different threads in parallel.  Only concern is because of data locality we are confined to the node that is assigned to the input data from the Kafka stream we cannot leverage distributed nodes for processing of these features for a single input data.

2.  Or since we are using JavaRDD as a feature list, these feature execution will be managed internally by Spark executors.

Thanks,

Rachana

Re: Multithreaded vs Spark Executor

Posted by Richard Eggert <ri...@gmail.com>.

Parallel processing is what Spark was made for. Let it do its job. Spawning
your own threads independently of what Spark is doing seems like you'd just
be asking for trouble.

I think you can accomplish what you want by taking the cartesian product of
the data element RDD and the feature list RDD and then perform the
computation as a map operation that takes the tuple of the data element and
feature as input.

Rich
On Sep 11, 2015 11:07 PM, "Rachana Srivastava" <
Rachana.Srivastava@markmonitor.com> wrote:

> Hello all,
>
>
>
> We are getting stream of input data from a Kafka queue using Spark
> Streaming API.  For each data element we want to run parallel threads to
> process a set of feature lists (nearly 100 feature or more).    Since
> feature lists creation is independent of each other we would like to
> execute these feature lists in parallel on the input data that we get from
> the Kafka queue.
>
>
>
> *Question is *
>
>
>
> 1. Should we write thread pool and manage these features execution on
> different threads in parallel.  Only concern is because of data locality we
> are confined to the node that is assigned to the input data from the Kafka
> stream we cannot leverage distributed nodes for processing of these
> features for a single input data.
>
>
>
> 2.  Or since we are using JavaRDD as a feature list, these feature
> execution will be managed internally by Spark executors.
>
>
>
> Thanks,
>
>
>
> Rachana
>