You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@spark.apache.org by Evan Chan <ve...@gmail.com> on 2014/08/23 00:40:50 UTC

[Spark SQL] off-heap columnar store

Hey guys,

What is the plan for getting Tachyon/off-heap support for the columnar
compressed store?  It's not in 1.1 is it?

In particular:
 - being able to set TACHYON as the caching mode
 - loading of hot columns or all columns
 - write-through of columnar store data to HDFS or backing store
 - being able to start a context and query directly from Tachyon's
cached columnar data

I think most of this was in Shark 0.9.1.

Also, how likely is the wire format for the columnar compressed data
to change?  That would be a problem for write-through or persistence.

thanks,
Evan

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@spark.apache.org
For additional commands, e-mail: dev-help@spark.apache.org

Re: [Spark SQL] off-heap columnar store

Posted by Henry Saputra <he...@gmail.com>.

Hi Michael,

This is great news.
Any initial proposal or design about the caching to Tachyon that you
can share so far?

I don't think there is a JIRA ticket open to track this feature yet.

- Henry

On Mon, Aug 25, 2014 at 1:13 PM, Michael Armbrust
<mi...@databricks.com> wrote:
>>
>> What is the plan for getting Tachyon/off-heap support for the columnar
>> compressed store?  It's not in 1.1 is it?
>
>
> It is not in 1.1 and there are not concrete plans for adding it at this
> point.  Currently, there is more engineering investment going into caching
> parquet data in Tachyon instead.  This approach is going to have much
> better support for nested data, leverages other work being done on parquet,
> and alleviates your concerns about wire format compatibility.
>
> That said, if someone really wants to try and implement it, I don't think
> it would be very hard.  The primary issue is going to be designing a clean
> interface that is not too tied to this one implementation.
>
>
>> Also, how likely is the wire format for the columnar compressed data
>> to change?  That would be a problem for write-through or persistence.
>>
>
> We aren't making any guarantees at the moment that it won't change.  Its
> currently only intended for temporary caching of data.

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@spark.apache.org
For additional commands, e-mail: dev-help@spark.apache.org

Re: [Spark SQL] off-heap columnar store

Posted by Evan Chan <ve...@gmail.com>.

On Sun, Aug 31, 2014 at 8:27 PM, Ian O'Connell <ia...@ianoconnell.com> wrote:
> I'm not sure what you mean here? Parquet is at its core just a format, you
> could store that data anywhere.
>
> Though it sounds like you saying, correct me if i'm wrong: you basically
> want a columnar abstraction layer where you can provide a different backing
> implementation to keep the columns rather than parquet-mr?
>
> I.e. you want to be able to produce a schema RDD from something like
> vertica, where updates should act as a write through cache back to vertica
> itself?

Something like that.

I'd like,

1)  An API to produce a schema RDD from an RDD of columns, not rows.
  However, an RDD[Column] would not make sense, since it would be
spread out across partitions.  Perhaps what is needed is a
Seq[RDD[ColumnSegment]].    The idea is that each RDD would hold the
segments for one column.  The segments represent a range of rows.
This would then read from something like Vertica or Cassandra.

2)  A variant of 1) where you could read this data from Tachyon.
Tachyon is supposed to support a columnar representation of data, it
did for Shark 0.9.x.

The goal is basically to load columnar data from something like
Cassandra into Tachyon, with the compression ratio of columnar
storage, and the speed of InMemoryColumnarTableScan.   If data is
appended into the Tachyon representation, be able to cache it back.
The write back is not as high a priority though.

A workaround would be to read data from Cassandra/Vertica/etc. and
write back into Parquet, but this would take a long time and incur
huge I/O overhead.

>
> I'm sorry it just sounds like its worth clearly defining what your key
> requirement/goal is.
>
>
> On Thu, Aug 28, 2014 at 11:31 PM, Evan Chan <ve...@gmail.com> wrote:
>>
>> >
>> >> The reason I'm asking about the columnar compressed format is that
>> >> there are some problems for which Parquet is not practical.
>> >
>> >
>> > Can you elaborate?
>>
>> Sure.
>>
>> - Organization or co has no Hadoop, but significant investment in some
>> other NoSQL store.
>> - Need to efficiently add a new column to existing data
>> - Need to mark some existing rows as deleted or replace small bits of
>> existing data
>>
>> For these use cases, it would be much more efficient and practical if
>> we didn't have to take the origin of the data from the datastore,
>> convert it to Parquet first.  Doing so loses significant latency and
>> causes Ops headaches in having to maintain HDFS.     It would be great
>> to be able to load data directly into the columnar format, into the
>> InMemoryColumnarCache.
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: dev-unsubscribe@spark.apache.org
>> For additional commands, e-mail: dev-help@spark.apache.org
>>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@spark.apache.org
For additional commands, e-mail: dev-help@spark.apache.org

Re: [Spark SQL] off-heap columnar store

Posted by Ian O'Connell <ia...@ianoconnell.com>.

I'm not sure what you mean here? Parquet is at its core just a format, you
could store that data anywhere.

Though it sounds like you saying, correct me if i'm wrong: you basically
want a columnar abstraction layer where you can provide a different backing
implementation to keep the columns rather than parquet-mr?

I.e. you want to be able to produce a schema RDD from something like
vertica, where updates should act as a write through cache back to vertica
itself?

I'm sorry it just sounds like its worth clearly defining what your key
requirement/goal is.

On Thu, Aug 28, 2014 at 11:31 PM, Evan Chan <ve...@gmail.com> wrote:

> >
> >> The reason I'm asking about the columnar compressed format is that
> >> there are some problems for which Parquet is not practical.
> >
> >
> > Can you elaborate?
>
> Sure.
>
> - Organization or co has no Hadoop, but significant investment in some
> other NoSQL store.
> - Need to efficiently add a new column to existing data
> - Need to mark some existing rows as deleted or replace small bits of
> existing data
>
> For these use cases, it would be much more efficient and practical if
> we didn't have to take the origin of the data from the datastore,
> convert it to Parquet first.  Doing so loses significant latency and
> causes Ops headaches in having to maintain HDFS.     It would be great
> to be able to load data directly into the columnar format, into the
> InMemoryColumnarCache.
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@spark.apache.org
> For additional commands, e-mail: dev-help@spark.apache.org
>
>

Re: [Spark SQL] off-heap columnar store

Posted by Evan Chan <ve...@gmail.com>.

>
>> The reason I'm asking about the columnar compressed format is that
>> there are some problems for which Parquet is not practical.
>
>
> Can you elaborate?

Sure.

- Organization or co has no Hadoop, but significant investment in some
other NoSQL store.
- Need to efficiently add a new column to existing data
- Need to mark some existing rows as deleted or replace small bits of
existing data

For these use cases, it would be much more efficient and practical if
we didn't have to take the origin of the data from the datastore,
convert it to Parquet first.  Doing so loses significant latency and
causes Ops headaches in having to maintain HDFS.     It would be great
to be able to load data directly into the columnar format, into the
InMemoryColumnarCache.

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@spark.apache.org
For additional commands, e-mail: dev-help@spark.apache.org

Re: [Spark SQL] off-heap columnar store

Posted by Michael Armbrust <mi...@databricks.com>.

>
> Any initial proposal or design about the caching to Tachyon that you
> can share so far?


Caching parquet files in tachyon with saveAsParquetFile and then reading
them with parquetFile should already work. You can use SQL on these tables
by using registerTempTable.

Some of the general parquet work that we have been doing includes: #1935
<https://github.com/apache/spark/pull/1935>, SPARK-2721
<https://issues.apache.org/jira/browse/SPARK-2721>, SPARK-3036
<https://issues.apache.org/jira/browse/SPARK-3036>, SPARK-3037
<https://issues.apache.org/jira/browse/SPARK-3037> and #1819
<https://github.com/apache/spark/pull/1819>

The reason I'm asking about the columnar compressed format is that
> there are some problems for which Parquet is not practical.


Can you elaborate?

Re: [Spark SQL] off-heap columnar store

Posted by Evan Chan <ve...@gmail.com>.

What would be the timeline for the parquet caching work?

The reason I'm asking about the columnar compressed format is that
there are some problems for which Parquet is not practical.

On Mon, Aug 25, 2014 at 1:13 PM, Michael Armbrust
<mi...@databricks.com> wrote:
>> What is the plan for getting Tachyon/off-heap support for the columnar
>> compressed store?  It's not in 1.1 is it?
>
>
> It is not in 1.1 and there are not concrete plans for adding it at this
> point.  Currently, there is more engineering investment going into caching
> parquet data in Tachyon instead.  This approach is going to have much better
> support for nested data, leverages other work being done on parquet, and
> alleviates your concerns about wire format compatibility.
>
> That said, if someone really wants to try and implement it, I don't think it
> would be very hard.  The primary issue is going to be designing a clean
> interface that is not too tied to this one implementation.
>
>>
>> Also, how likely is the wire format for the columnar compressed data
>> to change?  That would be a problem for write-through or persistence.
>
>
> We aren't making any guarantees at the moment that it won't change.  Its
> currently only intended for temporary caching of data.

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@spark.apache.org
For additional commands, e-mail: dev-help@spark.apache.org

Re: [Spark SQL] off-heap columnar store

Posted by Michael Armbrust <mi...@databricks.com>.

>
> What is the plan for getting Tachyon/off-heap support for the columnar
> compressed store?  It's not in 1.1 is it?


It is not in 1.1 and there are not concrete plans for adding it at this
point.  Currently, there is more engineering investment going into caching
parquet data in Tachyon instead.  This approach is going to have much
better support for nested data, leverages other work being done on parquet,
and alleviates your concerns about wire format compatibility.

That said, if someone really wants to try and implement it, I don't think
it would be very hard.  The primary issue is going to be designing a clean
interface that is not too tied to this one implementation.


> Also, how likely is the wire format for the columnar compressed data
> to change?  That would be a problem for write-through or persistence.
>

We aren't making any guarantees at the moment that it won't change.  Its
currently only intended for temporary caching of data.