You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by Utkarsh Sengar <ut...@gmail.com> on 2015/08/27 22:32:09 UTC

Porting a multit-hreaded compute intensive job to spark

I am working on code which uses executor service to parallelize tasks
(think machine learning computations done over small dataset over and over
again).
My goal is to execute some code as fast as possible, multiple times and
store the result somewhere (total executions will be on the order of 100M
runs atleast).

The logic looks something like this (its a simplified example):

dbconn = new dbconn() //This is reused by all threads
for a in listOfSize1000:
   for b in listofSize10:
      for c in listOfSize2:
         taskcompletionexecutorservice.submit(new runner(a, b, c, dbconn))

At the end, taskcompletionexecutorservice.take() is called and I store the
Result from "Future<Result>" in a db.
But this approach is not really scaling after a point.

So this is what I am doing right now in spark (which is a brutal hack, but
I am looking for suggestions on how to best structure this):

sparkContext.parallelize(listOfSize1000).filter(a -> {
   dbconn = new dbconn() //Cannot init it outsize parallelize since its not
serializable
   for b in listofSize10:
      for c in listOfSize2:
         Result r = new runner(a, b, c. dbconn))
         dbconn.store(r)

    return true //It serves no purpose.
}).count();

This approach looks inefficient to me since its not truly parallelizing on
the smallest unit of job, although this job works alright. Also count is
not really doing anything for for me, i added it to trigger the execution.
It was inspired by computing the pi example here:
http://spark.apache.org/examples.html

So any suggestions of how can I better structure my spark runner so that I
can efficiently use spark executors?

-- 
Thanks,
-Utkarsh