You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@spark.apache.org by vinod kumar <vi...@gmail.com> on 2015/07/10 06:35:02 UTC
Caching in spark
Hi Guys,
Can any one please share me how to use caching feature of spark via spark
sql queries?
-Vinod
Re: Caching in spark
Posted by Akhil Das <ak...@sigmoidanalytics.com>.
There was a discussion happened on that earlier, let me re-post it for you.
For the following code:
val *df* = sqlContext.parquetFile(path)
*df* remains columnar (actually it just reads from the columnar Parquet
file on disk).
For the following code:
val *cdf* = df.cache()
*cdf* is also columnar but that's different from Parquet. When a DataFrame
is cached, Spark SQL turns it into a private in-memory columnar format.
Some more details about the in-memory columnar structure: it's columnar,
but much simpler than the one Parquet uses. The columnar byte arrays are
split into batches with a fixed row count (configured by "
spark.sql.inMemoryColumnarStorage.batchSize"). Also, each column is
compressed with a compression scheme chose according to the data type and
statistics information of that column. Supported compression schemes
include RLE, DeltaInt, DeltaLong, BooleanBitSet, and DictionaryEncoding.
You may find the implementation here:
https://github.com/apache/spark/tree/master/sql/core/src/main/scala/org/apache/spark/sql/columnar
This was originally written by Cheng.
Thanks
Best Regards
On Sun, Jul 12, 2015 at 11:37 PM, Ruslan Dautkhanov <da...@gmail.com>
wrote:
> Hi Akhil,
>
> It's interesting if RDDs are stored internally in a columnar format as
> well?
> Or it is only when an RDD is cached in SQL context, it is converted to
> columnar format.
> What about data frames?
>
> Thanks!
>
>
> --
> Ruslan Dautkhanov
>
> On Fri, Jul 10, 2015 at 2:07 AM, Akhil Das <ak...@sigmoidanalytics.com>
> wrote:
>
>>
>> https://spark.apache.org/docs/latest/sql-programming-guide.html#caching-data-in-memory
>>
>> Thanks
>> Best Regards
>>
>> On Fri, Jul 10, 2015 at 10:05 AM, vinod kumar <vi...@gmail.com>
>> wrote:
>>
>>> Hi Guys,
>>>
>>> Can any one please share me how to use caching feature of spark via
>>> spark sql queries?
>>>
>>> -Vinod
>>>
>>
>>
>
Re: Caching in spark
Posted by Ruslan Dautkhanov <da...@gmail.com>.
Hi Akhil,
It's interesting if RDDs are stored internally in a columnar format as well?
Or it is only when an RDD is cached in SQL context, it is converted to
columnar format.
What about data frames?
Thanks!
--
Ruslan Dautkhanov
On Fri, Jul 10, 2015 at 2:07 AM, Akhil Das <ak...@sigmoidanalytics.com>
wrote:
>
> https://spark.apache.org/docs/latest/sql-programming-guide.html#caching-data-in-memory
>
> Thanks
> Best Regards
>
> On Fri, Jul 10, 2015 at 10:05 AM, vinod kumar <vi...@gmail.com>
> wrote:
>
>> Hi Guys,
>>
>> Can any one please share me how to use caching feature of spark via spark
>> sql queries?
>>
>> -Vinod
>>
>
>
Re: Caching in spark
Posted by Akhil Das <ak...@sigmoidanalytics.com>.
https://spark.apache.org/docs/latest/sql-programming-guide.html#caching-data-in-memory
Thanks
Best Regards
On Fri, Jul 10, 2015 at 10:05 AM, vinod kumar <vi...@gmail.com>
wrote:
> Hi Guys,
>
> Can any one please share me how to use caching feature of spark via spark
> sql queries?
>
> -Vinod
>