You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@spark.apache.org by Mick Davies <mi...@gmail.com> on 2015/02/01 12:03:31 UTC

Caching tables at column level

I have been working a lot recently with denormalised tables with lots of
columns, nearly 600. We are using this form to avoid joins. 

I have tried to use cache table with this data, but it proves too expensive
as it seems to try to cache all the data in the table.

For data sets such as the one I am using you find that certain columns will
be hot, referenced frequently in queries, others will be used very
infrequently.

Therefore it would be great if caches could be column based. I realise that
this may not be optimal for all use cases, but I think it could be quite a
common need.  Has something like this been considered?

Thanks Mick



--
View this message in context: http://apache-spark-developers-list.1001551.n3.nabble.com/Caching-tables-at-column-level-tp10377.html
Sent from the Apache Spark Developers List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@spark.apache.org
For additional commands, e-mail: dev-help@spark.apache.org

Re: Caching tables at column level

Posted by Mick Davies <mi...@gmail.com>.

Thanks - we have tried this and it works nicely.



--
View this message in context: http://apache-spark-developers-list.1001551.n3.nabble.com/Caching-tables-at-column-level-tp10377p10618.html
Sent from the Apache Spark Developers List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@spark.apache.org
For additional commands, e-mail: dev-help@spark.apache.org

Re: Caching tables at column level

Posted by Michael Armbrust <mi...@databricks.com>.

Its not completely transparent, but you can do something like the following
today:

CACHE TABLE hotData AS SELECT columns, I, care, about FROM fullTable

On Sun, Feb 1, 2015 at 3:03 AM, Mick Davies <mi...@gmail.com>
wrote:

> I have been working a lot recently with denormalised tables with lots of
> columns, nearly 600. We are using this form to avoid joins.
>
> I have tried to use cache table with this data, but it proves too expensive
> as it seems to try to cache all the data in the table.
>
> For data sets such as the one I am using you find that certain columns will
> be hot, referenced frequently in queries, others will be used very
> infrequently.
>
> Therefore it would be great if caches could be column based. I realise that
> this may not be optimal for all use cases, but I think it could be quite a
> common need.  Has something like this been considered?
>
> Thanks Mick
>
>
>
> --
> View this message in context:
> http://apache-spark-developers-list.1001551.n3.nabble.com/Caching-tables-at-column-level-tp10377.html
> Sent from the Apache Spark Developers List mailing list archive at
> Nabble.com.
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@spark.apache.org
> For additional commands, e-mail: dev-help@spark.apache.org
>
>