You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@spark.apache.org by Mridul Muralidharan <mr...@gmail.com> on 2014/04/06 01:30:02 UTC

ephemeral storage level in spark ?

Hi,

  We have a requirement to use a (potential) ephemeral storage, which
is not within the VM, which is strongly tied to a worker node. So
source of truth for a block would still be within spark; but to
actually do computation, we would need to copy data to external device
(where it might lie around for a while : so data locality really
really helps if we can avoid a subsequent copy if it is already
present on computations on same block again).

I was wondering if the recently added storage level for tachyon would
help in this case (note, tachyon wont help; just the storage level
might).
What sort of guarantees does it provide ? How extensible is it ? Or is
it strongly tied to tachyon with only a generic name ?


Thanks,
Mridul

Re: ephemeral storage level in spark ?

Posted by Matei Zaharia <ma...@gmail.com>.

The off-heap storage level is currently tied to Tachyon, but it might support other forms of off-heap storage later. However it’s not really designed to be mixed with the other ones. For this use case you may want to rely on memory locality and have some custom code to push the data to the accelerator. If you can think of a way to extend the storage level concept to handle this that would be general though, do send a proposal.

Matei

On Apr 5, 2014, at 5:14 PM, Mridul Muralidharan <mr...@gmail.com> wrote:

> No, I am thinking along lines of writing to an accelerator card or
> dedicated card with its own memory.
> 
> Regards,
> Mridul
> On Apr 6, 2014 5:19 AM, "Haoyuan Li" <ha...@gmail.com> wrote:
> 
>> Hi Mridul,
>> 
>> Do you mean the scenario that different Spark applications need to read the
>> same raw data, which is stored in a remote cluster or machines. And the
>> goal is to load the remote raw data only once?
>> 
>> Haoyuan
>> 
>> 
>> On Sat, Apr 5, 2014 at 4:30 PM, Mridul Muralidharan <mridul@gmail.com
>>> wrote:
>> 
>>> Hi,
>>> 
>>>  We have a requirement to use a (potential) ephemeral storage, which
>>> is not within the VM, which is strongly tied to a worker node. So
>>> source of truth for a block would still be within spark; but to
>>> actually do computation, we would need to copy data to external device
>>> (where it might lie around for a while : so data locality really
>>> really helps if we can avoid a subsequent copy if it is already
>>> present on computations on same block again).
>>> 
>>> I was wondering if the recently added storage level for tachyon would
>>> help in this case (note, tachyon wont help; just the storage level
>>> might).
>>> What sort of guarantees does it provide ? How extensible is it ? Or is
>>> it strongly tied to tachyon with only a generic name ?
>>> 
>>> 
>>> Thanks,
>>> Mridul
>>> 
>> 
>> 
>> 
>> --
>> Haoyuan Li
>> Algorithms, Machines, People Lab, EECS, UC Berkeley
>> http://www.cs.berkeley.edu/~haoyuan/
>>

Re: ephemeral storage level in spark ?

Posted by Mridul Muralidharan <mr...@gmail.com>.

No, I am thinking along lines of writing to an accelerator card or
dedicated card with its own memory.

Regards,
Mridul
On Apr 6, 2014 5:19 AM, "Haoyuan Li" <ha...@gmail.com> wrote:

> Hi Mridul,
>
> Do you mean the scenario that different Spark applications need to read the
> same raw data, which is stored in a remote cluster or machines. And the
> goal is to load the remote raw data only once?
>
> Haoyuan
>
>
> On Sat, Apr 5, 2014 at 4:30 PM, Mridul Muralidharan <mridul@gmail.com
> >wrote:
>
> > Hi,
> >
> >   We have a requirement to use a (potential) ephemeral storage, which
> > is not within the VM, which is strongly tied to a worker node. So
> > source of truth for a block would still be within spark; but to
> > actually do computation, we would need to copy data to external device
> > (where it might lie around for a while : so data locality really
> > really helps if we can avoid a subsequent copy if it is already
> > present on computations on same block again).
> >
> > I was wondering if the recently added storage level for tachyon would
> > help in this case (note, tachyon wont help; just the storage level
> > might).
> > What sort of guarantees does it provide ? How extensible is it ? Or is
> > it strongly tied to tachyon with only a generic name ?
> >
> >
> > Thanks,
> > Mridul
> >
>
>
>
> --
> Haoyuan Li
> Algorithms, Machines, People Lab, EECS, UC Berkeley
> http://www.cs.berkeley.edu/~haoyuan/
>

Re: ephemeral storage level in spark ?

Posted by Haoyuan Li <ha...@gmail.com>.

Hi Mridul,

Do you mean the scenario that different Spark applications need to read the
same raw data, which is stored in a remote cluster or machines. And the
goal is to load the remote raw data only once?

Haoyuan


On Sat, Apr 5, 2014 at 4:30 PM, Mridul Muralidharan <mr...@gmail.com>wrote:

> Hi,
>
>   We have a requirement to use a (potential) ephemeral storage, which
> is not within the VM, which is strongly tied to a worker node. So
> source of truth for a block would still be within spark; but to
> actually do computation, we would need to copy data to external device
> (where it might lie around for a while : so data locality really
> really helps if we can avoid a subsequent copy if it is already
> present on computations on same block again).
>
> I was wondering if the recently added storage level for tachyon would
> help in this case (note, tachyon wont help; just the storage level
> might).
> What sort of guarantees does it provide ? How extensible is it ? Or is
> it strongly tied to tachyon with only a generic name ?
>
>
> Thanks,
> Mridul
>



-- 
Haoyuan Li
Algorithms, Machines, People Lab, EECS, UC Berkeley
http://www.cs.berkeley.edu/~haoyuan/