You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@spark.apache.org by Jun Feng Liu <li...@cn.ibm.com> on 2014/12/10 14:47:12 UTC
Tachyon in Spark
Dose Spark today really leverage Tachyon linage to process data? It seems
like the application should call createDependency function in TachyonFS to
create a new linage node. But I did not find any place call that in Spark
code. Did I missed anything?
Best Regards
Jun Feng Liu
IBM China Systems & Technology Laboratory in Beijing
Phone: 86-10-82452683
E-mail: liujunf@cn.ibm.com
BLD 28,ZGC Software Park
No.8 Rd.Dong Bei Wang West, Dist.Haidian Beijing 100193
China
Re: Tachyon in Spark
Posted by Jun Feng Liu <li...@cn.ibm.com>.
Thanks the response. I got the point - sounds like todays Spark linage
dose not push to Tachyon linage. Would be good to see how it works.
Jun Feng Liu.
Haoyuan Li
<haoyuan.li@gmail
.com> To
Jun Feng Liu/China/IBM@IBMCN,
2014-12-13 00:17 cc
Reynold Xin <rx...@databricks.com>,
Andrew Ash <an...@andrewash.com>,
"dev@spark.apache.org"
<de...@spark.apache.org>
Subject
Re: Tachyon in Spark
Junfeng, by off the heap solution, did you mean "rdd.persist(OFF_HEAP)"?
That feature is different from the lineage feature. You can use this
feature (rdd.persist(OFF_HEAP)) now for any Spark version later than 1.0.0
with Tachyon without a problem.
Regarding Reynold's last email, those are good points. Tachyon had provided
this a while ago. We are working on enhancing this feature and the
integration part with Spark.
Thanks,
Haoyuan
On Fri, Dec 12, 2014 at 5:06 AM, Jun Feng Liu <li...@cn.ibm.com> wrote:
>
> I think the linage is the key feature of tachyon to reproduce the RDD
when
> any error happen. Otherwise, there have to be some data replica among
> tachyon nodes to ensure the data redundancy for fault tolerant - I think
> tachyon is avoiding to go to this path. Dose it mean the off-heap
solution
> is not ready yet if tachyon linage dose not work right now?
>
> Best Regards
>
>
> *Jun Feng Liu*
> IBM China Systems & Technology Laboratory in Beijing
>
> ------------------------------
> [image: 2D barcode - encoded with contact information] *Phone:
*86-10-82452683
>
> * E-mail:* *liujunf@cn.ibm.com* <li...@cn.ibm.com>
> [image: IBM]
>
> BLD 28,ZGC Software Park
> No.8 Rd.Dong Bei Wang West, Dist.Haidian Beijing 100193
> China
>
>
>
>
>
> *Reynold Xin <rxin@databricks.com <rx...@databricks.com>>*
>
> 2014/12/12 10:22
> To
> Andrew Ash <an...@andrewash.com>,
> cc
> Jun Feng Liu/China/IBM@IBMCN, "dev@spark.apache.org"
<dev@spark.apache.org
> >
> Subject
> Re: Tachyon in Spark
>
>
>
>
> Actually HY emailed me offline about this and this is supported in the
> latest version of Tachyon. It is a hard problem to push this into
storage;
> need to think about how to handle isolation, resource allocation, etc.
>
>
>
https://github.com/amplab/tachyon/blob/master/core/src/main/java/tachyon/master/Dependency.java
>
> On Thu, Dec 11, 2014 at 3:54 PM, Reynold Xin <rx...@databricks.com> wrote:
>
> > I don't think the lineage thing is even turned on in Tachyon - it was
> > mostly a research prototype, so I don't think it'd make sense for us to
> use
> > that.
> >
> >
> > On Thu, Dec 11, 2014 at 3:51 PM, Andrew Ash <an...@andrewash.com>
> wrote:
> >
> >> I'm interested in understanding this as well. One of the main ways
> >> Tachyon
> >> is supposed to realize performance gains without sacrificing
durability
> is
> >> by storing the lineage of data rather than full copies of it (similar
to
> >> Spark). But if Spark isn't sending lineage information into Tachyon,
> then
> >> I'm not sure how this isn't a durability concern.
> >>
> >> On Wed, Dec 10, 2014 at 5:47 AM, Jun Feng Liu <li...@cn.ibm.com>
> wrote:
> >>
> >> > Dose Spark today really leverage Tachyon linage to process data? It
> >> seems
> >> > like the application should call createDependency function in
> TachyonFS
> >> > to create a new linage node. But I did not find any place call that
in
> >> > Spark code. Did I missed anything?
> >> >
> >> > Best Regards
> >> >
> >> >
> >> > *Jun Feng Liu*
> >> > IBM China Systems & Technology Laboratory in Beijing
> >> >
> >> > ------------------------------
> >> > [image: 2D barcode - encoded with contact information] *Phone:
> >> *86-10-82452683
> >> >
> >> > * E-mail:* *liujunf@cn.ibm.com* <li...@cn.ibm.com>
> >> > [image: IBM]
> >> >
> >> > BLD 28,ZGC Software Park
> >> > No.8 Rd.Dong Bei Wang West, Dist.Haidian Beijing 100193
> >> > China
> >> >
> >> >
> >> >
> >> >
> >> >
> >>
> >
> >
>
>
--
Haoyuan Li
AMPLab, EECS, UC Berkeley
http://www.cs.berkeley.edu/~haoyuan/
Re: Tachyon in Spark
Posted by Haoyuan Li <ha...@gmail.com>.
Junfeng, by off the heap solution, did you mean "rdd.persist(OFF_HEAP)"?
That feature is different from the lineage feature. You can use this
feature (rdd.persist(OFF_HEAP)) now for any Spark version later than 1.0.0
with Tachyon without a problem.
Regarding Reynold's last email, those are good points. Tachyon had provided
this a while ago. We are working on enhancing this feature and the
integration part with Spark.
Thanks,
Haoyuan
On Fri, Dec 12, 2014 at 5:06 AM, Jun Feng Liu <li...@cn.ibm.com> wrote:
>
> I think the linage is the key feature of tachyon to reproduce the RDD when
> any error happen. Otherwise, there have to be some data replica among
> tachyon nodes to ensure the data redundancy for fault tolerant - I think
> tachyon is avoiding to go to this path. Dose it mean the off-heap solution
> is not ready yet if tachyon linage dose not work right now?
>
> Best Regards
>
>
> *Jun Feng Liu*
> IBM China Systems & Technology Laboratory in Beijing
>
> ------------------------------
> [image: 2D barcode - encoded with contact information] *Phone: *86-10-82452683
>
> * E-mail:* *liujunf@cn.ibm.com* <li...@cn.ibm.com>
> [image: IBM]
>
> BLD 28,ZGC Software Park
> No.8 Rd.Dong Bei Wang West, Dist.Haidian Beijing 100193
> China
>
>
>
>
>
> *Reynold Xin <rxin@databricks.com <rx...@databricks.com>>*
>
> 2014/12/12 10:22
> To
> Andrew Ash <an...@andrewash.com>,
> cc
> Jun Feng Liu/China/IBM@IBMCN, "dev@spark.apache.org" <dev@spark.apache.org
> >
> Subject
> Re: Tachyon in Spark
>
>
>
>
> Actually HY emailed me offline about this and this is supported in the
> latest version of Tachyon. It is a hard problem to push this into storage;
> need to think about how to handle isolation, resource allocation, etc.
>
>
> https://github.com/amplab/tachyon/blob/master/core/src/main/java/tachyon/master/Dependency.java
>
> On Thu, Dec 11, 2014 at 3:54 PM, Reynold Xin <rx...@databricks.com> wrote:
>
> > I don't think the lineage thing is even turned on in Tachyon - it was
> > mostly a research prototype, so I don't think it'd make sense for us to
> use
> > that.
> >
> >
> > On Thu, Dec 11, 2014 at 3:51 PM, Andrew Ash <an...@andrewash.com>
> wrote:
> >
> >> I'm interested in understanding this as well. One of the main ways
> >> Tachyon
> >> is supposed to realize performance gains without sacrificing durability
> is
> >> by storing the lineage of data rather than full copies of it (similar to
> >> Spark). But if Spark isn't sending lineage information into Tachyon,
> then
> >> I'm not sure how this isn't a durability concern.
> >>
> >> On Wed, Dec 10, 2014 at 5:47 AM, Jun Feng Liu <li...@cn.ibm.com>
> wrote:
> >>
> >> > Dose Spark today really leverage Tachyon linage to process data? It
> >> seems
> >> > like the application should call createDependency function in
> TachyonFS
> >> > to create a new linage node. But I did not find any place call that in
> >> > Spark code. Did I missed anything?
> >> >
> >> > Best Regards
> >> >
> >> >
> >> > *Jun Feng Liu*
> >> > IBM China Systems & Technology Laboratory in Beijing
> >> >
> >> > ------------------------------
> >> > [image: 2D barcode - encoded with contact information] *Phone:
> >> *86-10-82452683
> >> >
> >> > * E-mail:* *liujunf@cn.ibm.com* <li...@cn.ibm.com>
> >> > [image: IBM]
> >> >
> >> > BLD 28,ZGC Software Park
> >> > No.8 Rd.Dong Bei Wang West, Dist.Haidian Beijing 100193
> >> > China
> >> >
> >> >
> >> >
> >> >
> >> >
> >>
> >
> >
>
>
--
Haoyuan Li
AMPLab, EECS, UC Berkeley
http://www.cs.berkeley.edu/~haoyuan/
Re: Tachyon in Spark
Posted by Jun Feng Liu <li...@cn.ibm.com>.
I think the linage is the key feature of tachyon to reproduce the RDD when
any error happen. Otherwise, there have to be some data replica among
tachyon nodes to ensure the data redundancy for fault tolerant - I think
tachyon is avoiding to go to this path. Dose it mean the off-heap solution
is not ready yet if tachyon linage dose not work right now?
Best Regards
Jun Feng Liu
IBM China Systems & Technology Laboratory in Beijing
Phone: 86-10-82452683
E-mail: liujunf@cn.ibm.com
BLD 28,ZGC Software Park
No.8 Rd.Dong Bei Wang West, Dist.Haidian Beijing 100193
China
Reynold Xin <rx...@databricks.com>
2014/12/12 10:22
To
Andrew Ash <an...@andrewash.com>,
cc
Jun Feng Liu/China/IBM@IBMCN, "dev@spark.apache.org"
<de...@spark.apache.org>
Subject
Re: Tachyon in Spark
Actually HY emailed me offline about this and this is supported in the
latest version of Tachyon. It is a hard problem to push this into storage;
need to think about how to handle isolation, resource allocation, etc.
https://github.com/amplab/tachyon/blob/master/core/src/main/java/tachyon/master/Dependency.java
On Thu, Dec 11, 2014 at 3:54 PM, Reynold Xin <rx...@databricks.com> wrote:
> I don't think the lineage thing is even turned on in Tachyon - it was
> mostly a research prototype, so I don't think it'd make sense for us to
use
> that.
>
>
> On Thu, Dec 11, 2014 at 3:51 PM, Andrew Ash <an...@andrewash.com>
wrote:
>
>> I'm interested in understanding this as well. One of the main ways
>> Tachyon
>> is supposed to realize performance gains without sacrificing durability
is
>> by storing the lineage of data rather than full copies of it (similar
to
>> Spark). But if Spark isn't sending lineage information into Tachyon,
then
>> I'm not sure how this isn't a durability concern.
>>
>> On Wed, Dec 10, 2014 at 5:47 AM, Jun Feng Liu <li...@cn.ibm.com>
wrote:
>>
>> > Dose Spark today really leverage Tachyon linage to process data? It
>> seems
>> > like the application should call createDependency function in
TachyonFS
>> > to create a new linage node. But I did not find any place call that
in
>> > Spark code. Did I missed anything?
>> >
>> > Best Regards
>> >
>> >
>> > *Jun Feng Liu*
>> > IBM China Systems & Technology Laboratory in Beijing
>> >
>> > ------------------------------
>> > [image: 2D barcode - encoded with contact information] *Phone:
>> *86-10-82452683
>> >
>> > * E-mail:* *liujunf@cn.ibm.com* <li...@cn.ibm.com>
>> > [image: IBM]
>> >
>> > BLD 28,ZGC Software Park
>> > No.8 Rd.Dong Bei Wang West, Dist.Haidian Beijing 100193
>> > China
>> >
>> >
>> >
>> >
>> >
>>
>
>
Re: Tachyon in Spark
Posted by Reynold Xin <rx...@databricks.com>.
Actually HY emailed me offline about this and this is supported in the
latest version of Tachyon. It is a hard problem to push this into storage;
need to think about how to handle isolation, resource allocation, etc.
https://github.com/amplab/tachyon/blob/master/core/src/main/java/tachyon/master/Dependency.java
On Thu, Dec 11, 2014 at 3:54 PM, Reynold Xin <rx...@databricks.com> wrote:
> I don't think the lineage thing is even turned on in Tachyon - it was
> mostly a research prototype, so I don't think it'd make sense for us to use
> that.
>
>
> On Thu, Dec 11, 2014 at 3:51 PM, Andrew Ash <an...@andrewash.com> wrote:
>
>> I'm interested in understanding this as well. One of the main ways
>> Tachyon
>> is supposed to realize performance gains without sacrificing durability is
>> by storing the lineage of data rather than full copies of it (similar to
>> Spark). But if Spark isn't sending lineage information into Tachyon, then
>> I'm not sure how this isn't a durability concern.
>>
>> On Wed, Dec 10, 2014 at 5:47 AM, Jun Feng Liu <li...@cn.ibm.com> wrote:
>>
>> > Dose Spark today really leverage Tachyon linage to process data? It
>> seems
>> > like the application should call createDependency function in TachyonFS
>> > to create a new linage node. But I did not find any place call that in
>> > Spark code. Did I missed anything?
>> >
>> > Best Regards
>> >
>> >
>> > *Jun Feng Liu*
>> > IBM China Systems & Technology Laboratory in Beijing
>> >
>> > ------------------------------
>> > [image: 2D barcode - encoded with contact information] *Phone:
>> *86-10-82452683
>> >
>> > * E-mail:* *liujunf@cn.ibm.com* <li...@cn.ibm.com>
>> > [image: IBM]
>> >
>> > BLD 28,ZGC Software Park
>> > No.8 Rd.Dong Bei Wang West, Dist.Haidian Beijing 100193
>> > China
>> >
>> >
>> >
>> >
>> >
>>
>
>
Re: Tachyon in Spark
Posted by Reynold Xin <rx...@databricks.com>.
I don't think the lineage thing is even turned on in Tachyon - it was
mostly a research prototype, so I don't think it'd make sense for us to use
that.
On Thu, Dec 11, 2014 at 3:51 PM, Andrew Ash <an...@andrewash.com> wrote:
> I'm interested in understanding this as well. One of the main ways Tachyon
> is supposed to realize performance gains without sacrificing durability is
> by storing the lineage of data rather than full copies of it (similar to
> Spark). But if Spark isn't sending lineage information into Tachyon, then
> I'm not sure how this isn't a durability concern.
>
> On Wed, Dec 10, 2014 at 5:47 AM, Jun Feng Liu <li...@cn.ibm.com> wrote:
>
> > Dose Spark today really leverage Tachyon linage to process data? It seems
> > like the application should call createDependency function in TachyonFS
> > to create a new linage node. But I did not find any place call that in
> > Spark code. Did I missed anything?
> >
> > Best Regards
> >
> >
> > *Jun Feng Liu*
> > IBM China Systems & Technology Laboratory in Beijing
> >
> > ------------------------------
> > [image: 2D barcode - encoded with contact information] *Phone:
> *86-10-82452683
> >
> > * E-mail:* *liujunf@cn.ibm.com* <li...@cn.ibm.com>
> > [image: IBM]
> >
> > BLD 28,ZGC Software Park
> > No.8 Rd.Dong Bei Wang West, Dist.Haidian Beijing 100193
> > China
> >
> >
> >
> >
> >
>
Re: Tachyon in Spark
Posted by Andrew Ash <an...@andrewash.com>.
I'm interested in understanding this as well. One of the main ways Tachyon
is supposed to realize performance gains without sacrificing durability is
by storing the lineage of data rather than full copies of it (similar to
Spark). But if Spark isn't sending lineage information into Tachyon, then
I'm not sure how this isn't a durability concern.
On Wed, Dec 10, 2014 at 5:47 AM, Jun Feng Liu <li...@cn.ibm.com> wrote:
> Dose Spark today really leverage Tachyon linage to process data? It seems
> like the application should call createDependency function in TachyonFS
> to create a new linage node. But I did not find any place call that in
> Spark code. Did I missed anything?
>
> Best Regards
>
>
> *Jun Feng Liu*
> IBM China Systems & Technology Laboratory in Beijing
>
> ------------------------------
> [image: 2D barcode - encoded with contact information] *Phone: *86-10-82452683
>
> * E-mail:* *liujunf@cn.ibm.com* <li...@cn.ibm.com>
> [image: IBM]
>
> BLD 28,ZGC Software Park
> No.8 Rd.Dong Bei Wang West, Dist.Haidian Beijing 100193
> China
>
>
>
>
>