You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@spark.apache.org by Kazuaki Ishizaki <IS...@jp.ibm.com> on 2016/12/26 01:12:29 UTC

Sharing data in columnar storage between two applications

Here is an interesting discussion to share data in columnar storage 
between two applications.
https://github.com/apache/spark/pull/15219#issuecomment-265835049

One of the ideas is to prepare interfaces (or trait) only for read or 
write. Each application can implement only one class to want to do (e.g. 
read or write). For example, FiloDB wants to provide a columnar storage 
that can be read from Spark. In that case, it is easy to implement only 
read APIs for Spark. These two classes can be prepared.
However, it may lead to incompatibility in ColumnarBatch. ColumnarBatch 
keeps a set of ColumnVector that can be read or written. The ColumnVector 
class should have read and write APIs. How can we put the new ColumnVector 
with only read APIs?  Here is an example to case incompatibility at 
https://gist.github.com/kiszk/00ab7d0c69f0e598e383cdc8e72bcc4d

Another possible idea is that both applications supports Apache Arrow 
APIs.
Other approaches could be.

What approach would be good for all of applications?

Regards,
Kazuaki Ishizaki

Re: Sharing data in columnar storage between two applications

Posted by Mark Hamstra <ma...@clearstorydata.com>.

Yes, this is part of Matei's current research, for which code is not yet
publicly available at all, much less in a form suitable for production use.

On Mon, Dec 26, 2016 at 2:29 AM, Evan Chan <ve...@gmail.com> wrote:

> Looks pretty interesting, but might take a while honestly.
>
> On Dec 25, 2016, at 5:24 PM, Mark Hamstra <ma...@clearstorydata.com> wrote:
>
> NOt so much about between applications, rather multiple frameworks within
> an application, but still related: https://cs.stanford.
> edu/~matei/papers/2017/cidr_weld.pdf
>
> On Sun, Dec 25, 2016 at 8:12 PM, Kazuaki Ishizaki <IS...@jp.ibm.com>
> wrote:
>
>> Here is an interesting discussion to share data in columnar storage
>> between two applications.
>> https://github.com/apache/spark/pull/15219#issuecomment-265835049
>>
>> One of the ideas is to prepare interfaces (or trait) only for read or
>> write. Each application can implement only one class to want to do (e.g.
>> read or write). For example, FiloDB wants to provide a columnar storage
>> that can be read from Spark. In that case, it is easy to implement only
>> read APIs for Spark. These two classes can be prepared.
>> However, it may lead to incompatibility in ColumnarBatch. ColumnarBatch
>> keeps a set of ColumnVector that can be read or written. The ColumnVector
>> class should have read and write APIs. How can we put the new ColumnVector
>> with only read APIs?  Here is an example to case incompatibility at
>> https://gist.github.com/kiszk/00ab7d0c69f0e598e383cdc8e72bcc4d
>>
>> Another possible idea is that both applications supports Apache Arrow
>> APIs.
>> Other approaches could be.
>>
>> What approach would be good for all of applications?
>>
>> Regards,
>> Kazuaki Ishizaki
>>
>
>
>

Re: Sharing data in columnar storage between two applications

Posted by Mark Hamstra <ma...@clearstorydata.com>.

NOt so much about between applications, rather multiple frameworks within
an application, but still related:
https://cs.stanford.edu/~matei/papers/2017/cidr_weld.pdf

On Sun, Dec 25, 2016 at 8:12 PM, Kazuaki Ishizaki <IS...@jp.ibm.com>
wrote:

> Here is an interesting discussion to share data in columnar storage
> between two applications.
> https://github.com/apache/spark/pull/15219#issuecomment-265835049
>
> One of the ideas is to prepare interfaces (or trait) only for read or
> write. Each application can implement only one class to want to do (e.g.
> read or write). For example, FiloDB wants to provide a columnar storage
> that can be read from Spark. In that case, it is easy to implement only
> read APIs for Spark. These two classes can be prepared.
> However, it may lead to incompatibility in ColumnarBatch. ColumnarBatch
> keeps a set of ColumnVector that can be read or written. The ColumnVector
> class should have read and write APIs. How can we put the new ColumnVector
> with only read APIs?  Here is an example to case incompatibility at
> https://gist.github.com/kiszk/00ab7d0c69f0e598e383cdc8e72bcc4d
>
> Another possible idea is that both applications supports Apache Arrow APIs.
> Other approaches could be.
>
> What approach would be good for all of applications?
>
> Regards,
> Kazuaki Ishizaki
>