You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@spark.apache.org by Liquan Pei <li...@gmail.com> on 2014/06/21 02:47:58 UTC

Current status of Sparrow

Hi

What is the current status of Sparrow integration with Spark? I would like
to integrate Sparrow with Spark 1.0 on a 100 node cluster. Any suggestions?

Thanks a lot for your help!
Liquan

Re: Current status of Sparrow

Posted by Kay Ousterhout <ke...@eecs.berkeley.edu>.

Hi Liquan,

Sparrow is not currently integrated into the Spark distribution, so if
you'd like to use Spark with Sparrow, you need to use a forked version of
Spark (https://github.com/kayousterhout/spark/tree/sparrow).  This version
of Spark was forked off an older version of Spark so some work will be
involved to bring this up to date with the latest version of Spark; I can
help with this.

Unfortunately there are also a few practical problems with using Sparrow
with Spark that may or may not be compatible with your target workload.
 Sparrow distributes scheduling over many Sparrow schedulers that are each
associated with their own Spark driver (this is where Sparrow's
improvements stem from -- there's no longer a single driver serving as the
bottleneck for your application, but all of the schedulers/drivers share
the same slots for scheduling tasks).  As a result, data stored in Spark's
block manager on one Spark driver (and created as part of a job scheduled
by the associated Sparrow scheduler) cannot be accessed by other Spark
drivers.  If you're storing data in Tachyon or have a workload where
different jobs have disjoint working sets, this won't be an issue.

-Kay

On Fri, Jun 20, 2014 at 5:47 PM, Liquan Pei <li...@gmail.com> wrote:

> Hi
>
> What is the current status of Sparrow integration with Spark? I would like
> to integrate Sparrow with Spark 1.0 on a 100 node cluster. Any suggestions?
>
> Thanks a lot for your help!
> Liquan
>