You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@spark.apache.org by Sadhan Sood <sa...@gmail.com> on 2014/11/13 00:16:32 UTC

Cache sparkSql data without uncompressing it in memory

We noticed while caching data from our hive tables which contain data in
compressed sequence file format that it gets uncompressed in memory when
getting cached. Is there a way to turn this off and cache the compressed
data as is ?

Re: Cache sparkSql data without uncompressing it in memory

Posted by Cheng Lian <li...@gmail.com>.
Hm… Have you tuned |spark.storage.memoryFraction|? By default, 60% of 
memory is used for caching. You may refer to details from here 
http://spark.apache.org/docs/latest/configuration.html

On 11/15/14 5:43 AM, Sadhan Sood wrote:

> Thanks Cheng, that was helpful. I noticed from UI that only half of 
> the memory per executor was being used for caching, is that true? We 
> have a 2 TB sequence file dataset that we wanted to cache in our 
> cluster with ~ 5TB memory but caching still failed and what looked 
> like from the UI was that it used 2.5 TB of memory and almost wrote 12 
> TB to disk (at which point it was useless) during the mapPartition 
> stage. Also, couldn't run more than 2 executors/box (60g memory/box) 
> or else it died very quickly from lesser memory/executor (not sure 
> why?) although I/O seemed to be going much faster which makes sense 
> because of more parallel reads.
>
> On Thu, Nov 13, 2014 at 10:50 PM, Cheng Lian <lian.cs.zju@gmail.com 
> <ma...@gmail.com>> wrote:
>
>     No, the columnar buffer is built in a small batching manner, the
>     batch size is controlled by the
>     |spark.sql.inMemoryColumnarStorage.batchSize| property. The
>     default value for this in master and branch-1.2 is 10,000 rows per
>     batch.
>
>     On 11/14/14 1:27 AM, Sadhan Sood wrote:
>
>>     Thanks Chneg, Just one more question - does that mean that we
>>     still need enough memory in the cluster to uncompress the data
>>     before it can be compressed again or does that just read the raw
>>     data as is?
>>
>>     On Wed, Nov 12, 2014 at 10:05 PM, Cheng Lian
>>     <lian.cs.zju@gmail.com <ma...@gmail.com>> wrote:
>>
>>         Currently there’s no way to cache the compressed sequence
>>         file directly. Spark SQL uses in-memory columnar format while
>>         caching table rows, so we must read all the raw data and
>>         convert them into columnar format. However, you can enable
>>         in-memory columnar compression by setting
>>         |spark.sql.inMemoryColumnarStorage.compressed| to |true|.
>>         This property is already set to true by default in master
>>         branch and branch-1.2.
>>
>>         On 11/13/14 7:16 AM, Sadhan Sood wrote:
>>
>>>         We noticed while caching data from our hive tables which
>>>         contain data in compressed sequence file format that it gets
>>>         uncompressed in memory when getting cached. Is there a way
>>>         to turn this off and cache the compressed data as is ?
>>         ​
>>
>>
>     ​
>
>
​

Re: Cache sparkSql data without uncompressing it in memory

Posted by Sadhan Sood <sa...@gmail.com>.
Thanks Cheng, that was helpful. I noticed from UI that only half of the
memory per executor was being used for caching, is that true? We have a 2
TB sequence file dataset that we wanted to cache in our cluster with ~ 5TB
memory but caching still failed and what looked like from the UI was that
it used 2.5 TB of memory and almost wrote 12 TB to disk (at which point it
was useless) during the mapPartition stage. Also, couldn't run more than 2
executors/box (60g memory/box) or else it died very quickly from lesser
memory/executor (not sure why?) although I/O seemed to be going much faster
which makes sense because of more parallel reads.

On Thu, Nov 13, 2014 at 10:50 PM, Cheng Lian <li...@gmail.com> wrote:

>  No, the columnar buffer is built in a small batching manner, the batch
> size is controlled by the spark.sql.inMemoryColumnarStorage.batchSize
> property. The default value for this in master and branch-1.2 is 10,000
> rows per batch.
>
> On 11/14/14 1:27 AM, Sadhan Sood wrote:
>
>   Thanks Chneg, Just one more question - does that mean that we still
> need enough memory in the cluster to uncompress the data before it can be
> compressed again or does that just read the raw data as is?
>
> On Wed, Nov 12, 2014 at 10:05 PM, Cheng Lian <li...@gmail.com>
> wrote:
>
>>  Currently there’s no way to cache the compressed sequence file
>> directly. Spark SQL uses in-memory columnar format while caching table
>> rows, so we must read all the raw data and convert them into columnar
>> format. However, you can enable in-memory columnar compression by setting
>> spark.sql.inMemoryColumnarStorage.compressed to true. This property is
>> already set to true by default in master branch and branch-1.2.
>>
>> On 11/13/14 7:16 AM, Sadhan Sood wrote:
>>
>> We noticed while caching data from our hive tables which contain data in
>> compressed sequence file format that it gets uncompressed in memory when
>> getting cached. Is there a way to turn this off and cache the compressed
>> data as is ?
>>
>>  ​
>>
>
>    ​
>

Re: Cache sparkSql data without uncompressing it in memory

Posted by Cheng Lian <li...@gmail.com>.
No, the columnar buffer is built in a small batching manner, the batch 
size is controlled by the |spark.sql.inMemoryColumnarStorage.batchSize| 
property. The default value for this in master and branch-1.2 is 10,000 
rows per batch.

On 11/14/14 1:27 AM, Sadhan Sood wrote:

> Thanks Chneg, Just one more question - does that mean that we still 
> need enough memory in the cluster to uncompress the data before it can 
> be compressed again or does that just read the raw data as is?
>
> On Wed, Nov 12, 2014 at 10:05 PM, Cheng Lian <lian.cs.zju@gmail.com 
> <ma...@gmail.com>> wrote:
>
>     Currently there’s no way to cache the compressed sequence file
>     directly. Spark SQL uses in-memory columnar format while caching
>     table rows, so we must read all the raw data and convert them into
>     columnar format. However, you can enable in-memory columnar
>     compression by setting
>     |spark.sql.inMemoryColumnarStorage.compressed| to |true|. This
>     property is already set to true by default in master branch and
>     branch-1.2.
>
>     On 11/13/14 7:16 AM, Sadhan Sood wrote:
>
>>     We noticed while caching data from our hive tables which contain
>>     data in compressed sequence file format that it gets uncompressed
>>     in memory when getting cached. Is there a way to turn this off
>>     and cache the compressed data as is ?
>     ​
>
>
​

Re: Cache sparkSql data without uncompressing it in memory

Posted by Cheng Lian <li...@gmail.com>.
No, the columnar buffer is built in a small batching manner, the batch 
size is controlled by the |spark.sql.inMemoryColumnarStorage.batchSize| 
property. The default value for this in master and branch-1.2 is 10,000 
rows per batch.

On 11/14/14 1:27 AM, Sadhan Sood wrote:

> Thanks Chneg, Just one more question - does that mean that we still 
> need enough memory in the cluster to uncompress the data before it can 
> be compressed again or does that just read the raw data as is?
>
> On Wed, Nov 12, 2014 at 10:05 PM, Cheng Lian <lian.cs.zju@gmail.com 
> <ma...@gmail.com>> wrote:
>
>     Currently there’s no way to cache the compressed sequence file
>     directly. Spark SQL uses in-memory columnar format while caching
>     table rows, so we must read all the raw data and convert them into
>     columnar format. However, you can enable in-memory columnar
>     compression by setting
>     |spark.sql.inMemoryColumnarStorage.compressed| to |true|. This
>     property is already set to true by default in master branch and
>     branch-1.2.
>
>     On 11/13/14 7:16 AM, Sadhan Sood wrote:
>
>>     We noticed while caching data from our hive tables which contain
>>     data in compressed sequence file format that it gets uncompressed
>>     in memory when getting cached. Is there a way to turn this off
>>     and cache the compressed data as is ?
>     ​
>
>
​

Re: Cache sparkSql data without uncompressing it in memory

Posted by Sadhan Sood <sa...@gmail.com>.
Thanks Chneg, Just one more question - does that mean that we still need
enough memory in the cluster to uncompress the data before it can be
compressed again or does that just read the raw data as is?

On Wed, Nov 12, 2014 at 10:05 PM, Cheng Lian <li...@gmail.com> wrote:

>  Currently there’s no way to cache the compressed sequence file directly.
> Spark SQL uses in-memory columnar format while caching table rows, so we
> must read all the raw data and convert them into columnar format. However,
> you can enable in-memory columnar compression by setting
> spark.sql.inMemoryColumnarStorage.compressed to true. This property is
> already set to true by default in master branch and branch-1.2.
>
> On 11/13/14 7:16 AM, Sadhan Sood wrote:
>
>   We noticed while caching data from our hive tables which contain data
> in compressed sequence file format that it gets uncompressed in memory when
> getting cached. Is there a way to turn this off and cache the compressed
> data as is ?
>
>   ​
>

Re: Cache sparkSql data without uncompressing it in memory

Posted by Sadhan Sood <sa...@gmail.com>.
Thanks Chneg, Just one more question - does that mean that we still need
enough memory in the cluster to uncompress the data before it can be
compressed again or does that just read the raw data as is?

On Wed, Nov 12, 2014 at 10:05 PM, Cheng Lian <li...@gmail.com> wrote:

>  Currently there’s no way to cache the compressed sequence file directly.
> Spark SQL uses in-memory columnar format while caching table rows, so we
> must read all the raw data and convert them into columnar format. However,
> you can enable in-memory columnar compression by setting
> spark.sql.inMemoryColumnarStorage.compressed to true. This property is
> already set to true by default in master branch and branch-1.2.
>
> On 11/13/14 7:16 AM, Sadhan Sood wrote:
>
>   We noticed while caching data from our hive tables which contain data
> in compressed sequence file format that it gets uncompressed in memory when
> getting cached. Is there a way to turn this off and cache the compressed
> data as is ?
>
>   ​
>

Re: Cache sparkSql data without uncompressing it in memory

Posted by Cheng Lian <li...@gmail.com>.
Currently there’s no way to cache the compressed sequence file directly. 
Spark SQL uses in-memory columnar format while caching table rows, so we 
must read all the raw data and convert them into columnar format. 
However, you can enable in-memory columnar compression by setting 
|spark.sql.inMemoryColumnarStorage.compressed| to |true|. This property 
is already set to true by default in master branch and branch-1.2.

On 11/13/14 7:16 AM, Sadhan Sood wrote:

> We noticed while caching data from our hive tables which contain data 
> in compressed sequence file format that it gets uncompressed in memory 
> when getting cached. Is there a way to turn this off and cache the 
> compressed data as is ?

​

Re: Cache sparkSql data without uncompressing it in memory

Posted by Cheng Lian <li...@gmail.com>.
Currently there’s no way to cache the compressed sequence file directly. 
Spark SQL uses in-memory columnar format while caching table rows, so we 
must read all the raw data and convert them into columnar format. 
However, you can enable in-memory columnar compression by setting 
|spark.sql.inMemoryColumnarStorage.compressed| to |true|. This property 
is already set to true by default in master branch and branch-1.2.

On 11/13/14 7:16 AM, Sadhan Sood wrote:

> We noticed while caching data from our hive tables which contain data 
> in compressed sequence file format that it gets uncompressed in memory 
> when getting cached. Is there a way to turn this off and cache the 
> compressed data as is ?

​