You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@spark.apache.org by Debasish Das <de...@gmail.com> on 2014/08/20 18:02:55 UTC

Akka usage in Spark

Hi,

There have been some recent changes in the way akka is used in spark and I
feel they are major changes...

Is there a design document / JIRA / experiment on large datasets that
highlight the impact of changes (1.0 vs 1.1) ? Basically it will be great
to understand where akka is used in the code base...

If I don't have to broadcast big variables but use akka's programming model
(use actors directly) on Spark's actorsystem is that allowed ? I understand
that it might look hacky :-)

Thanks.
Deb

Re: Akka usage in Spark

Posted by Mayur Rustagi <ma...@gmail.com>.

The stream receiver seems to leverage actor receivers
     http://spark.apache.org/docs/0.8.1/streaming-custom-receivers.html
But spark system doesnt lend itself to a messaging kind of a structure..
more of a DAG kind
Just curious are you looking for the actor subsystem to act on messages or
just looking to use them as a local/distributed message bus

Mayur Rustagi
Ph: +1 (760) 203 3257
http://www.sigmoidanalytics.com
@mayur_rustagi <https://twitter.com/mayur_rustagi>



On Thu, Aug 21, 2014 at 4:04 AM, Debasish Das <de...@gmail.com>
wrote:

> Yeah that's the one we discussed...sorry I pointed to a different one that
> I was reading...
>
>
> On Wed, Aug 20, 2014 at 3:28 PM, DB Tsai <db...@dbtsai.com> wrote:
>
> > To be specific, I was discussing this PR with Debasish which reduces
> > lots of issues when sending big objects to executors without using
> > broadcast explicitly.
> >
> > Broadcast RDD object once per TaskSet (instead of sending it for every
> > task)
> > https://issues.apache.org/jira/browse/SPARK-2521
> >
> > Sincerely,
> >
> > DB Tsai
> > -------------------------------------------------------
> > My Blog: https://www.dbtsai.com
> > LinkedIn: https://www.linkedin.com/in/dbtsai
> >
> >
> > On Wed, Aug 20, 2014 at 3:19 PM, Debasish Das <de...@gmail.com>
> > wrote:
> > > Hi Patrick,
> > >
> > > Last few days I came across some bugs which got exposed due to ALS runs
> > on
> > > large scale data...although it was not related to the akka changes but
> > > during the debug I found across some akka related changes that might
> have
> > > an impact of overall performance...one example is the following:
> > >
> > > https://github.com/apache/spark/pull/1907
> > >
> > > @dbtsai explained it to me a bit yesterday that in 1.1 RDDs are no
> longer
> > > sent through akka msgs but over http-channels...If there is a document
> > > detailing the architecture that is currently in-place (like how the
> core
> > > changed from 1.0 to 1.1) it will help a lot in debugging the jobs which
> > are
> > > built upon the libraries like mllib and optimize them further for
> > > efficiency...
> > >
> > > For using the Spark actor system directly:
> > >
> > > I spent few weeks December 2013 to make the Scalafish code (
> > > https://github.com/azymnis/scalafish) operational on 10 nodes...It
> uses
> > > scalding for matrix partitioning and actorSystem to coordinate the
> > > updates...It is a cool use of akka but getting an actor system
> > operational
> > > is difficult...
> > >
> > > Since Spark already has tested version of actor system running on both
> > > standalone and yarn modes, I am planning to port scalafish to spark
> using
> > > actor model...That's one of the use-cases I am looking for...
> > >
> > > Another use-case that I am considering is to send msgs directly from
> > kafka
> > > queues to spark actorSystem for processing to get Storm like
> > > latency...basically window sizes of 1-2 ms and no overhead of using an
> > RDD
> > > if possible...
> > >
> > > Thanks.
> > > Deb
> > >
> > >
> > > On Wed, Aug 20, 2014 at 1:42 PM, Patrick Wendell <pw...@gmail.com>
> > wrote:
> > >
> > >> Hey Deb,
> > >>
> > >> Can you be specific what changes you are mentioning? We have not, to
> my
> > >> knowledge, made major architectural changes around akka use.
> > >>
> > >> I think in general we don't want people to be using Spark's actor
> system
> > >> directly - it is an internal communication component in Spark and
> could
> > >> e.g. be re-factored later to not use akka at all. Could you elaborate
> a
> > bit
> > >> more on your use case?
> > >>
> > >> - Patrick
> > >>
> > >>
> > >> On Wed, Aug 20, 2014 at 9:02 AM, Debasish Das <
> debasish.das83@gmail.com
> > >
> > >> wrote:
> > >>
> > >>> Hi,
> > >>>
> > >>> There have been some recent changes in the way akka is used in spark
> > and I
> > >>> feel they are major changes...
> > >>>
> > >>> Is there a design document / JIRA / experiment on large datasets that
> > >>> highlight the impact of changes (1.0 vs 1.1) ? Basically it will be
> > great
> > >>> to understand where akka is used in the code base...
> > >>>
> > >>> If I don't have to broadcast big variables but use akka's programming
> > >>> model
> > >>> (use actors directly) on Spark's actorsystem is that allowed ? I
> > >>> understand
> > >>> that it might look hacky :-)
> > >>>
> > >>> Thanks.
> > >>> Deb
> > >>>
> > >>
> > >>
> >
>

Re: Akka usage in Spark

Posted by Debasish Das <de...@gmail.com>.

Yeah that's the one we discussed...sorry I pointed to a different one that
I was reading...


On Wed, Aug 20, 2014 at 3:28 PM, DB Tsai <db...@dbtsai.com> wrote:

> To be specific, I was discussing this PR with Debasish which reduces
> lots of issues when sending big objects to executors without using
> broadcast explicitly.
>
> Broadcast RDD object once per TaskSet (instead of sending it for every
> task)
> https://issues.apache.org/jira/browse/SPARK-2521
>
> Sincerely,
>
> DB Tsai
> -------------------------------------------------------
> My Blog: https://www.dbtsai.com
> LinkedIn: https://www.linkedin.com/in/dbtsai
>
>
> On Wed, Aug 20, 2014 at 3:19 PM, Debasish Das <de...@gmail.com>
> wrote:
> > Hi Patrick,
> >
> > Last few days I came across some bugs which got exposed due to ALS runs
> on
> > large scale data...although it was not related to the akka changes but
> > during the debug I found across some akka related changes that might have
> > an impact of overall performance...one example is the following:
> >
> > https://github.com/apache/spark/pull/1907
> >
> > @dbtsai explained it to me a bit yesterday that in 1.1 RDDs are no longer
> > sent through akka msgs but over http-channels...If there is a document
> > detailing the architecture that is currently in-place (like how the core
> > changed from 1.0 to 1.1) it will help a lot in debugging the jobs which
> are
> > built upon the libraries like mllib and optimize them further for
> > efficiency...
> >
> > For using the Spark actor system directly:
> >
> > I spent few weeks December 2013 to make the Scalafish code (
> > https://github.com/azymnis/scalafish) operational on 10 nodes...It uses
> > scalding for matrix partitioning and actorSystem to coordinate the
> > updates...It is a cool use of akka but getting an actor system
> operational
> > is difficult...
> >
> > Since Spark already has tested version of actor system running on both
> > standalone and yarn modes, I am planning to port scalafish to spark using
> > actor model...That's one of the use-cases I am looking for...
> >
> > Another use-case that I am considering is to send msgs directly from
> kafka
> > queues to spark actorSystem for processing to get Storm like
> > latency...basically window sizes of 1-2 ms and no overhead of using an
> RDD
> > if possible...
> >
> > Thanks.
> > Deb
> >
> >
> > On Wed, Aug 20, 2014 at 1:42 PM, Patrick Wendell <pw...@gmail.com>
> wrote:
> >
> >> Hey Deb,
> >>
> >> Can you be specific what changes you are mentioning? We have not, to my
> >> knowledge, made major architectural changes around akka use.
> >>
> >> I think in general we don't want people to be using Spark's actor system
> >> directly - it is an internal communication component in Spark and could
> >> e.g. be re-factored later to not use akka at all. Could you elaborate a
> bit
> >> more on your use case?
> >>
> >> - Patrick
> >>
> >>
> >> On Wed, Aug 20, 2014 at 9:02 AM, Debasish Das <debasish.das83@gmail.com
> >
> >> wrote:
> >>
> >>> Hi,
> >>>
> >>> There have been some recent changes in the way akka is used in spark
> and I
> >>> feel they are major changes...
> >>>
> >>> Is there a design document / JIRA / experiment on large datasets that
> >>> highlight the impact of changes (1.0 vs 1.1) ? Basically it will be
> great
> >>> to understand where akka is used in the code base...
> >>>
> >>> If I don't have to broadcast big variables but use akka's programming
> >>> model
> >>> (use actors directly) on Spark's actorsystem is that allowed ? I
> >>> understand
> >>> that it might look hacky :-)
> >>>
> >>> Thanks.
> >>> Deb
> >>>
> >>
> >>
>

Re: Akka usage in Spark

Posted by DB Tsai <db...@dbtsai.com>.

To be specific, I was discussing this PR with Debasish which reduces
lots of issues when sending big objects to executors without using
broadcast explicitly.

Broadcast RDD object once per TaskSet (instead of sending it for every task)
https://issues.apache.org/jira/browse/SPARK-2521

Sincerely,

DB Tsai
-------------------------------------------------------
My Blog: https://www.dbtsai.com
LinkedIn: https://www.linkedin.com/in/dbtsai


On Wed, Aug 20, 2014 at 3:19 PM, Debasish Das <de...@gmail.com> wrote:
> Hi Patrick,
>
> Last few days I came across some bugs which got exposed due to ALS runs on
> large scale data...although it was not related to the akka changes but
> during the debug I found across some akka related changes that might have
> an impact of overall performance...one example is the following:
>
> https://github.com/apache/spark/pull/1907
>
> @dbtsai explained it to me a bit yesterday that in 1.1 RDDs are no longer
> sent through akka msgs but over http-channels...If there is a document
> detailing the architecture that is currently in-place (like how the core
> changed from 1.0 to 1.1) it will help a lot in debugging the jobs which are
> built upon the libraries like mllib and optimize them further for
> efficiency...
>
> For using the Spark actor system directly:
>
> I spent few weeks December 2013 to make the Scalafish code (
> https://github.com/azymnis/scalafish) operational on 10 nodes...It uses
> scalding for matrix partitioning and actorSystem to coordinate the
> updates...It is a cool use of akka but getting an actor system operational
> is difficult...
>
> Since Spark already has tested version of actor system running on both
> standalone and yarn modes, I am planning to port scalafish to spark using
> actor model...That's one of the use-cases I am looking for...
>
> Another use-case that I am considering is to send msgs directly from kafka
> queues to spark actorSystem for processing to get Storm like
> latency...basically window sizes of 1-2 ms and no overhead of using an RDD
> if possible...
>
> Thanks.
> Deb
>
>
> On Wed, Aug 20, 2014 at 1:42 PM, Patrick Wendell <pw...@gmail.com> wrote:
>
>> Hey Deb,
>>
>> Can you be specific what changes you are mentioning? We have not, to my
>> knowledge, made major architectural changes around akka use.
>>
>> I think in general we don't want people to be using Spark's actor system
>> directly - it is an internal communication component in Spark and could
>> e.g. be re-factored later to not use akka at all. Could you elaborate a bit
>> more on your use case?
>>
>> - Patrick
>>
>>
>> On Wed, Aug 20, 2014 at 9:02 AM, Debasish Das <de...@gmail.com>
>> wrote:
>>
>>> Hi,
>>>
>>> There have been some recent changes in the way akka is used in spark and I
>>> feel they are major changes...
>>>
>>> Is there a design document / JIRA / experiment on large datasets that
>>> highlight the impact of changes (1.0 vs 1.1) ? Basically it will be great
>>> to understand where akka is used in the code base...
>>>
>>> If I don't have to broadcast big variables but use akka's programming
>>> model
>>> (use actors directly) on Spark's actorsystem is that allowed ? I
>>> understand
>>> that it might look hacky :-)
>>>
>>> Thanks.
>>> Deb
>>>
>>
>>

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@spark.apache.org
For additional commands, e-mail: dev-help@spark.apache.org

Re: Akka usage in Spark

Posted by Debasish Das <de...@gmail.com>.

Hi Patrick,

Last few days I came across some bugs which got exposed due to ALS runs on
large scale data...although it was not related to the akka changes but
during the debug I found across some akka related changes that might have
an impact of overall performance...one example is the following:

https://github.com/apache/spark/pull/1907

@dbtsai explained it to me a bit yesterday that in 1.1 RDDs are no longer
sent through akka msgs but over http-channels...If there is a document
detailing the architecture that is currently in-place (like how the core
changed from 1.0 to 1.1) it will help a lot in debugging the jobs which are
built upon the libraries like mllib and optimize them further for
efficiency...

For using the Spark actor system directly:

I spent few weeks December 2013 to make the Scalafish code (
https://github.com/azymnis/scalafish) operational on 10 nodes...It uses
scalding for matrix partitioning and actorSystem to coordinate the
updates...It is a cool use of akka but getting an actor system operational
is difficult...

Since Spark already has tested version of actor system running on both
standalone and yarn modes, I am planning to port scalafish to spark using
actor model...That's one of the use-cases I am looking for...

Another use-case that I am considering is to send msgs directly from kafka
queues to spark actorSystem for processing to get Storm like
latency...basically window sizes of 1-2 ms and no overhead of using an RDD
if possible...

Thanks.
Deb

On Wed, Aug 20, 2014 at 1:42 PM, Patrick Wendell <pw...@gmail.com> wrote:

> Hey Deb,
>
> Can you be specific what changes you are mentioning? We have not, to my
> knowledge, made major architectural changes around akka use.
>
> I think in general we don't want people to be using Spark's actor system
> directly - it is an internal communication component in Spark and could
> e.g. be re-factored later to not use akka at all. Could you elaborate a bit
> more on your use case?
>
> - Patrick
>
>
> On Wed, Aug 20, 2014 at 9:02 AM, Debasish Das <de...@gmail.com>
> wrote:
>
>> Hi,
>>
>> There have been some recent changes in the way akka is used in spark and I
>> feel they are major changes...
>>
>> Is there a design document / JIRA / experiment on large datasets that
>> highlight the impact of changes (1.0 vs 1.1) ? Basically it will be great
>> to understand where akka is used in the code base...
>>
>> If I don't have to broadcast big variables but use akka's programming
>> model
>> (use actors directly) on Spark's actorsystem is that allowed ? I
>> understand
>> that it might look hacky :-)
>>
>> Thanks.
>> Deb
>>
>
>

Re: Akka usage in Spark

Posted by Patrick Wendell <pw...@gmail.com>.

Hey Deb,

Can you be specific what changes you are mentioning? We have not, to my
knowledge, made major architectural changes around akka use.

I think in general we don't want people to be using Spark's actor system
directly - it is an internal communication component in Spark and could
e.g. be re-factored later to not use akka at all. Could you elaborate a bit
more on your use case?

- Patrick

On Wed, Aug 20, 2014 at 9:02 AM, Debasish Das <de...@gmail.com>
wrote:

> Hi,
>
> There have been some recent changes in the way akka is used in spark and I
> feel they are major changes...
>
> Is there a design document / JIRA / experiment on large datasets that
> highlight the impact of changes (1.0 vs 1.1) ? Basically it will be great
> to understand where akka is used in the code base...
>
> If I don't have to broadcast big variables but use akka's programming model
> (use actors directly) on Spark's actorsystem is that allowed ? I understand
> that it might look hacky :-)
>
> Thanks.
> Deb
>