You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by vinod kumar <vi...@gmail.com> on 2015/07/10 06:35:02 UTC

Caching in spark

Hi Guys,

Can any one please share me how to use caching feature of spark via spark
sql queries?

-Vinod

Re: Caching in spark

Posted by Akhil Das <ak...@sigmoidanalytics.com>.

There was a discussion happened on that earlier, let me re-post it for you.

For the following code:

     val *df* = sqlContext.parquetFile(path)

*df* remains columnar (actually it just reads from the columnar Parquet
file on disk).

For the following code:

     val *cdf* = df.cache()

*cdf* is also columnar but that's different from Parquet. When a DataFrame
is cached, Spark SQL turns it into a private in-memory columnar format.

Some more details about the in-memory columnar structure: it's columnar,
but much simpler than the one Parquet uses. The columnar byte arrays are
split into batches with a fixed row count (configured by "
spark.sql.inMemoryColumnarStorage.batchSize"). Also, each column is
compressed with a compression scheme chose according to the data type and
statistics information of that column. Supported compression schemes
include RLE, DeltaInt, DeltaLong, BooleanBitSet, and DictionaryEncoding.

You may find the implementation here:
https://github.com/apache/spark/tree/master/sql/core/src/main/scala/org/apache/spark/sql/columnar

This was originally written by Cheng.

Thanks
Best Regards

On Sun, Jul 12, 2015 at 11:37 PM, Ruslan Dautkhanov <da...@gmail.com>
wrote:

> Hi Akhil,
>
> It's interesting if RDDs are stored internally in a columnar format as
> well?
> Or it is only when an RDD is cached in SQL context, it is converted to
> columnar format.
> What about data frames?
>
> Thanks!
>
>
> --
> Ruslan Dautkhanov
>
> On Fri, Jul 10, 2015 at 2:07 AM, Akhil Das <ak...@sigmoidanalytics.com>
> wrote:
>
>>
>> https://spark.apache.org/docs/latest/sql-programming-guide.html#caching-data-in-memory
>>
>> Thanks
>> Best Regards
>>
>> On Fri, Jul 10, 2015 at 10:05 AM, vinod kumar <vi...@gmail.com>
>> wrote:
>>
>>> Hi Guys,
>>>
>>> Can any one please share me how to use caching feature of spark via
>>> spark sql queries?
>>>
>>> -Vinod
>>>
>>
>>
>

Re: Caching in spark

Posted by Ruslan Dautkhanov <da...@gmail.com>.

Hi Akhil,

It's interesting if RDDs are stored internally in a columnar format as well?
Or it is only when an RDD is cached in SQL context, it is converted to
columnar format.
What about data frames?

Thanks!

-- 
Ruslan Dautkhanov

On Fri, Jul 10, 2015 at 2:07 AM, Akhil Das <ak...@sigmoidanalytics.com>
wrote:

>
> https://spark.apache.org/docs/latest/sql-programming-guide.html#caching-data-in-memory
>
> Thanks
> Best Regards
>
> On Fri, Jul 10, 2015 at 10:05 AM, vinod kumar <vi...@gmail.com>
> wrote:
>
>> Hi Guys,
>>
>> Can any one please share me how to use caching feature of spark via spark
>> sql queries?
>>
>> -Vinod
>>
>
>

Re: Caching in spark

Posted by Akhil Das <ak...@sigmoidanalytics.com>.

https://spark.apache.org/docs/latest/sql-programming-guide.html#caching-data-in-memory

Thanks
Best Regards

On Fri, Jul 10, 2015 at 10:05 AM, vinod kumar <vi...@gmail.com>
wrote:

> Hi Guys,
>
> Can any one please share me how to use caching feature of spark via spark
> sql queries?
>
> -Vinod
>