You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@spark.apache.org by Jun Feng Liu <li...@cn.ibm.com> on 2014/12/10 09:25:37 UTC

HA support for Spark

Do we have any high availability support in Spark driver level? For 
example, if we want spark drive can move to another node continue 
execution when failure happen. I can see the RDD checkpoint can help to 
serialization the status of RDD. I can image to load the check point from 
another node when error happen, but seems like will lost track all tasks 
status or even executor information that maintain in spark context. I am 
not sure if there is any existing stuff I can leverage to do that. thanks 
for any suggests
 
Best Regards
 
Jun Feng Liu
IBM China Systems & Technology Laboratory in Beijing



Phone: 86-10-82452683 
E-mail: liujunf@cn.ibm.com


BLD 28,ZGC Software Park 
No.8 Rd.Dong Bei Wang West, Dist.Haidian Beijing 100193 
China

Re: HA support for Spark

Posted by Jun Feng Liu <li...@cn.ibm.com>.

Interesting, you saying StreamContext checkpoint can regenerate DAG stuff? 

 
Best Regards
 
Jun Feng Liu
IBM China Systems & Technology Laboratory in Beijing



Phone: 86-10-82452683 
E-mail: liujunf@cn.ibm.com


BLD 28,ZGC Software Park 
No.8 Rd.Dong Bei Wang West, Dist.Haidian Beijing 100193 
China 
 

 



Tathagata Das <ta...@gmail.com> 
2014/12/11 20:20

To
Jun Feng Liu/China/IBM@IBMCN, 
cc
Sandy Ryza <sa...@cloudera.com>, "dev@spark.apache.org" 
<de...@spark.apache.org>, Reynold Xin <rx...@databricks.com>
Subject
Re: HA support for Spark






Spark Streaming essentially does this by saving the DAG of DStreams, which 
can deterministically regenerate the DAG of RDDs upon recovery from 
failure. Along with that the progress information (which batches have 
finished, which batches are queued, etc.) is also saved, so that upon 
recovery the system can restart from where it was before failure. This was 
conceptually easy to do because the RDDs are very deterministically 
generated in every batch. Extending this to a very general Spark program 
with arbitrary RDD computations is definitely conceptually possible but 
not that easy to do.

On Wed, Dec 10, 2014 at 7:34 PM, Jun Feng Liu <li...@cn.ibm.com> wrote:
Right, perhaps also need preserve some DAG information? I am wondering if 
there is any work around this.


Sandy Ryza ---2014-12-11 01:36:35---Sandy Ryza <sa...@cloudera.com>


Sandy Ryza <sa...@cloudera.com>  
2014-12-11 01:34



To

Jun Feng Liu/China/IBM@IBMCN, 

cc

Reynold Xin <rx...@databricks.com>, "dev@spark.apache.org" <
dev@spark.apache.org>

Subject

Re: HA support for Spark





I think that if we were able to maintain the full set of created RDDs as
well as some scheduler and block manager state, it would be enough for 
most
apps to recover.

On Wed, Dec 10, 2014 at 5:30 AM, Jun Feng Liu <li...@cn.ibm.com> wrote:

> Well, it should not be mission impossible thinking there are so many HA
> solution existing today. I would interest to know if there is any 
specific
> difficult.
>
> Best Regards
>
>
> *Jun Feng Liu*
> IBM China Systems & Technology Laboratory in Beijing
>
>   ------------------------------
>  [image: 2D barcode - encoded with contact information] *Phone: 
*86-10-82452683
>
> * E-mail:* *liujunf@cn.ibm.com* <li...@cn.ibm.com>
> [image: IBM]
>
> BLD 28,ZGC Software Park
> No.8 Rd.Dong Bei Wang West, Dist.Haidian Beijing 100193
> China
>
>
>
>
>
>  *Reynold Xin <rxin@databricks.com <rx...@databricks.com>>*
>
> 2014/12/10 16:30
>   To
> Jun Feng Liu/China/IBM@IBMCN,
> cc
> "dev@spark.apache.org" <de...@spark.apache.org>
> Subject
> Re: HA support for Spark
>
>
>
>
> This would be plausible for specific purposes such as Spark streaming or
> Spark SQL, but I don't think it is doable for general Spark driver since 
it
> is just a normal JVM process with arbitrary program state.
>
> On Wed, Dec 10, 2014 at 12:25 AM, Jun Feng Liu <li...@cn.ibm.com> 
wrote:
>
> > Do we have any high availability support in Spark driver level? For
> > example, if we want spark drive can move to another node continue
> execution
> > when failure happen. I can see the RDD checkpoint can help to
> serialization
> > the status of RDD. I can image to load the check point from another 
node
> > when error happen, but seems like will lost track all tasks status or
> even
> > executor information that maintain in spark context. I am not sure if
> there
> > is any existing stuff I can leverage to do that. thanks for any 
suggests
> >
> > Best Regards
> >
> >
> > *Jun Feng Liu*
> > IBM China Systems & Technology Laboratory in Beijing
> >
> >   ------------------------------
> >  [image: 2D barcode - encoded with contact information] *Phone:
> *86-10-82452683
> >
> > * E-mail:* *liujunf@cn.ibm.com* <li...@cn.ibm.com>
> > [image: IBM]
> >
> > BLD 28,ZGC Software Park
> > No.8 Rd.Dong Bei Wang West, Dist.Haidian Beijing 100193
> > China
> >
> >
> >
> >
> >
>
>

Re: HA support for Spark

Posted by Tathagata Das <ta...@gmail.com>.

Spark Streaming essentially does this by saving the DAG of DStreams, which
can deterministically regenerate the DAG of RDDs upon recovery from
failure. Along with that the progress information (which batches have
finished, which batches are queued, etc.) is also saved, so that upon
recovery the system can restart from where it was before failure. This was
conceptually easy to do because the RDDs are very deterministically
generated in every batch. Extending this to a very general Spark program
with arbitrary RDD computations is definitely conceptually possible but not
that easy to do.

On Wed, Dec 10, 2014 at 7:34 PM, Jun Feng Liu <li...@cn.ibm.com> wrote:

> Right, perhaps also need preserve some DAG information? I am wondering if
> there is any work around this.
>
>
> [image: Inactive hide details for Sandy Ryza ---2014-12-11
> 01:36:35---Sandy Ryza <sa...@cloudera.com>]Sandy Ryza ---2014-12-11
> 01:36:35---Sandy Ryza <sa...@cloudera.com>
>
>
>    *Sandy Ryza <sandy.ryza@cloudera.com <sa...@cloudera.com>>*
>
>    2014-12-11 01:34
>
>
> To
>
>
>    Jun Feng Liu/China/IBM@IBMCN,
>
>
> cc
>
>
>    Reynold Xin <rx...@databricks.com>, "dev@spark.apache.org" <
>    dev@spark.apache.org>
>
>
> Subject
>
>
>    Re: HA support for Spark
>
>
> I think that if we were able to maintain the full set of created RDDs as
> well as some scheduler and block manager state, it would be enough for most
> apps to recover.
>
> On Wed, Dec 10, 2014 at 5:30 AM, Jun Feng Liu <li...@cn.ibm.com> wrote:
>
> > Well, it should not be mission impossible thinking there are so many HA
> > solution existing today. I would interest to know if there is any
> specific
> > difficult.
> >
> > Best Regards
> >
> >
> > *Jun Feng Liu*
> > IBM China Systems & Technology Laboratory in Beijing
> >
> >   ------------------------------
> >  [image: 2D barcode - encoded with contact information] *Phone:
> *86-10-82452683
> >
> > * E-mail:* *liujunf@cn.ibm.com* <li...@cn.ibm.com>
> > [image: IBM]
> >
> > BLD 28,ZGC Software Park
> > No.8 Rd.Dong Bei Wang West, Dist.Haidian Beijing 100193
> > China
> >
> >
> >
> >
> >
> >  *Reynold Xin <rxin@databricks.com <rx...@databricks.com>>*
> >
> > 2014/12/10 16:30
> >   To
> > Jun Feng Liu/China/IBM@IBMCN,
> > cc
> > "dev@spark.apache.org" <de...@spark.apache.org>
> > Subject
> > Re: HA support for Spark
> >
> >
> >
> >
> > This would be plausible for specific purposes such as Spark streaming or
> > Spark SQL, but I don't think it is doable for general Spark driver since
> it
> > is just a normal JVM process with arbitrary program state.
> >
> > On Wed, Dec 10, 2014 at 12:25 AM, Jun Feng Liu <li...@cn.ibm.com>
> wrote:
> >
> > > Do we have any high availability support in Spark driver level? For
> > > example, if we want spark drive can move to another node continue
> > execution
> > > when failure happen. I can see the RDD checkpoint can help to
> > serialization
> > > the status of RDD. I can image to load the check point from another
> node
> > > when error happen, but seems like will lost track all tasks status or
> > even
> > > executor information that maintain in spark context. I am not sure if
> > there
> > > is any existing stuff I can leverage to do that. thanks for any
> suggests
> > >
> > > Best Regards
> > >
> > >
> > > *Jun Feng Liu*
> > > IBM China Systems & Technology Laboratory in Beijing
> > >
> > >   ------------------------------
> > >  [image: 2D barcode - encoded with contact information] *Phone:
> > *86-10-82452683
> > >
> > > * E-mail:* *liujunf@cn.ibm.com* <li...@cn.ibm.com>
> > > [image: IBM]
> > >
> > > BLD 28,ZGC Software Park
> > > No.8 Rd.Dong Bei Wang West, Dist.Haidian Beijing 100193
> > > China
> > >
> > >
> > >
> > >
> > >
> >
> >
>
>

Re: HA support for Spark

Posted by Jun Feng Liu <li...@cn.ibm.com>.

Right, perhaps also need preserve some DAG information? I am wondering if
there is any work around this.



                                                                           
             Sandy Ryza                                                    
             <sandy.ryza@cloud                                             
             era.com>                                                   To 
                                       Jun Feng Liu/China/IBM@IBMCN,       
             2014-12-11 01:34                                           cc 
                                       Reynold Xin <rx...@databricks.com>,  
                                       "dev@spark.apache.org"              
                                       <de...@spark.apache.org>              
                                                                   Subject 
                                       Re: HA support for Spark            
                                                                           
                                                                           
                                                                           
                                                                           
                                                                           
                                                                           




I think that if we were able to maintain the full set of created RDDs as
well as some scheduler and block manager state, it would be enough for most
apps to recover.

On Wed, Dec 10, 2014 at 5:30 AM, Jun Feng Liu <li...@cn.ibm.com> wrote:

> Well, it should not be mission impossible thinking there are so many HA
> solution existing today. I would interest to know if there is any
specific
> difficult.
>
> Best Regards
>
>
> *Jun Feng Liu*
> IBM China Systems & Technology Laboratory in Beijing
>
>   ------------------------------
>  [image: 2D barcode - encoded with contact information] *Phone:
*86-10-82452683
>
> * E-mail:* *liujunf@cn.ibm.com* <li...@cn.ibm.com>
> [image: IBM]
>
> BLD 28,ZGC Software Park
> No.8 Rd.Dong Bei Wang West, Dist.Haidian Beijing 100193
> China
>
>
>
>
>
>  *Reynold Xin <rxin@databricks.com <rx...@databricks.com>>*
>
> 2014/12/10 16:30
>   To
> Jun Feng Liu/China/IBM@IBMCN,
> cc
> "dev@spark.apache.org" <de...@spark.apache.org>
> Subject
> Re: HA support for Spark
>
>
>
>
> This would be plausible for specific purposes such as Spark streaming or
> Spark SQL, but I don't think it is doable for general Spark driver since
it
> is just a normal JVM process with arbitrary program state.
>
> On Wed, Dec 10, 2014 at 12:25 AM, Jun Feng Liu <li...@cn.ibm.com>
wrote:
>
> > Do we have any high availability support in Spark driver level? For
> > example, if we want spark drive can move to another node continue
> execution
> > when failure happen. I can see the RDD checkpoint can help to
> serialization
> > the status of RDD. I can image to load the check point from another
node
> > when error happen, but seems like will lost track all tasks status or
> even
> > executor information that maintain in spark context. I am not sure if
> there
> > is any existing stuff I can leverage to do that. thanks for any
suggests
> >
> > Best Regards
> >
> >
> > *Jun Feng Liu*
> > IBM China Systems & Technology Laboratory in Beijing
> >
> >   ------------------------------
> >  [image: 2D barcode - encoded with contact information] *Phone:
> *86-10-82452683
> >
> > * E-mail:* *liujunf@cn.ibm.com* <li...@cn.ibm.com>
> > [image: IBM]
> >
> > BLD 28,ZGC Software Park
> > No.8 Rd.Dong Bei Wang West, Dist.Haidian Beijing 100193
> > China
> >
> >
> >
> >
> >
>
>

Re: HA support for Spark

Posted by Sandy Ryza <sa...@cloudera.com>.

I think that if we were able to maintain the full set of created RDDs as
well as some scheduler and block manager state, it would be enough for most
apps to recover.

On Wed, Dec 10, 2014 at 5:30 AM, Jun Feng Liu <li...@cn.ibm.com> wrote:

> Well, it should not be mission impossible thinking there are so many HA
> solution existing today. I would interest to know if there is any specific
> difficult.
>
> Best Regards
>
>
> *Jun Feng Liu*
> IBM China Systems & Technology Laboratory in Beijing
>
>   ------------------------------
>  [image: 2D barcode - encoded with contact information] *Phone: *86-10-82452683
>
> * E-mail:* *liujunf@cn.ibm.com* <li...@cn.ibm.com>
> [image: IBM]
>
> BLD 28,ZGC Software Park
> No.8 Rd.Dong Bei Wang West, Dist.Haidian Beijing 100193
> China
>
>
>
>
>
>  *Reynold Xin <rxin@databricks.com <rx...@databricks.com>>*
>
> 2014/12/10 16:30
>   To
> Jun Feng Liu/China/IBM@IBMCN,
> cc
> "dev@spark.apache.org" <de...@spark.apache.org>
> Subject
> Re: HA support for Spark
>
>
>
>
> This would be plausible for specific purposes such as Spark streaming or
> Spark SQL, but I don't think it is doable for general Spark driver since it
> is just a normal JVM process with arbitrary program state.
>
> On Wed, Dec 10, 2014 at 12:25 AM, Jun Feng Liu <li...@cn.ibm.com> wrote:
>
> > Do we have any high availability support in Spark driver level? For
> > example, if we want spark drive can move to another node continue
> execution
> > when failure happen. I can see the RDD checkpoint can help to
> serialization
> > the status of RDD. I can image to load the check point from another node
> > when error happen, but seems like will lost track all tasks status or
> even
> > executor information that maintain in spark context. I am not sure if
> there
> > is any existing stuff I can leverage to do that. thanks for any suggests
> >
> > Best Regards
> >
> >
> > *Jun Feng Liu*
> > IBM China Systems & Technology Laboratory in Beijing
> >
> >   ------------------------------
> >  [image: 2D barcode - encoded with contact information] *Phone:
> *86-10-82452683
> >
> > * E-mail:* *liujunf@cn.ibm.com* <li...@cn.ibm.com>
> > [image: IBM]
> >
> > BLD 28,ZGC Software Park
> > No.8 Rd.Dong Bei Wang West, Dist.Haidian Beijing 100193
> > China
> >
> >
> >
> >
> >
>
>

Re: HA support for Spark

Posted by Jun Feng Liu <li...@cn.ibm.com>.

Well, it should not be mission impossible thinking there are so many HA 
solution existing today. I would interest to know if there is any specific 
difficult.
 
Best Regards
 
Jun Feng Liu
IBM China Systems & Technology Laboratory in Beijing



Phone: 86-10-82452683 
E-mail: liujunf@cn.ibm.com


BLD 28,ZGC Software Park 
No.8 Rd.Dong Bei Wang West, Dist.Haidian Beijing 100193 
China 
 

 



Reynold Xin <rx...@databricks.com> 
2014/12/10 16:30

To
Jun Feng Liu/China/IBM@IBMCN, 
cc
"dev@spark.apache.org" <de...@spark.apache.org>
Subject
Re: HA support for Spark






This would be plausible for specific purposes such as Spark streaming or
Spark SQL, but I don't think it is doable for general Spark driver since 
it
is just a normal JVM process with arbitrary program state.

On Wed, Dec 10, 2014 at 12:25 AM, Jun Feng Liu <li...@cn.ibm.com> wrote:

> Do we have any high availability support in Spark driver level? For
> example, if we want spark drive can move to another node continue 
execution
> when failure happen. I can see the RDD checkpoint can help to 
serialization
> the status of RDD. I can image to load the check point from another node
> when error happen, but seems like will lost track all tasks status or 
even
> executor information that maintain in spark context. I am not sure if 
there
> is any existing stuff I can leverage to do that. thanks for any suggests
>
> Best Regards
>
>
> *Jun Feng Liu*
> IBM China Systems & Technology Laboratory in Beijing
>
>   ------------------------------
>  [image: 2D barcode - encoded with contact information] *Phone: 
*86-10-82452683
>
> * E-mail:* *liujunf@cn.ibm.com* <li...@cn.ibm.com>
> [image: IBM]
>
> BLD 28,ZGC Software Park
> No.8 Rd.Dong Bei Wang West, Dist.Haidian Beijing 100193
> China
>
>
>
>
>

Re: HA support for Spark

Posted by Reynold Xin <rx...@databricks.com>.

This would be plausible for specific purposes such as Spark streaming or
Spark SQL, but I don't think it is doable for general Spark driver since it
is just a normal JVM process with arbitrary program state.

On Wed, Dec 10, 2014 at 12:25 AM, Jun Feng Liu <li...@cn.ibm.com> wrote:

> Do we have any high availability support in Spark driver level? For
> example, if we want spark drive can move to another node continue execution
> when failure happen. I can see the RDD checkpoint can help to serialization
> the status of RDD. I can image to load the check point from another node
> when error happen, but seems like will lost track all tasks status or even
> executor information that maintain in spark context. I am not sure if there
> is any existing stuff I can leverage to do that. thanks for any suggests
>
> Best Regards
>
>
> *Jun Feng Liu*
> IBM China Systems & Technology Laboratory in Beijing
>
>   ------------------------------
>  [image: 2D barcode - encoded with contact information] *Phone: *86-10-82452683
>
> * E-mail:* *liujunf@cn.ibm.com* <li...@cn.ibm.com>
> [image: IBM]
>
> BLD 28,ZGC Software Park
> No.8 Rd.Dong Bei Wang West, Dist.Haidian Beijing 100193
> China
>
>
>
>
>