You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by Prithish <pr...@gmail.com> on 2016/11/14 09:05:22 UTC

AVRO File size when caching in-memory

Can someone please explain why this happens?

When I read a 600kb AVRO file and cache this in memory (using cacheTable),
it shows up as 11mb (storage tab in Spark UI). I have tried this with
different file sizes, and the size in-memory is always proportionate. I
thought Spark compresses when using cacheTable.

RE: AVRO File size when caching in-memory

Posted by Shreya Agarwal <sh...@microsoft.com>.

Ah, yes. Nested schemas should be avoided if you want the best memory usage.

Sent from my Windows 10 phone

From: Prithish<ma...@gmail.com>
Sent: Wednesday, November 16, 2016 12:48 AM
To: Takeshi Yamamuro<ma...@gmail.com>
Cc: Shreya Agarwal<ma...@microsoft.com>; user@spark.apache.org<ma...@spark.apache.org>
Subject: Re: AVRO File size when caching in-memory

It's something like the schema shown below (with several additional levels/sublevels)

root
 |-- sentAt: long (nullable = true)
 |-- sharing: string (nullable = true)
 |-- receivedAt: long (nullable = true)
 |-- ip: string (nullable = true)
 |-- story: struct (nullable = true)
 |    |-- super: string (nullable = true)
 |    |-- lang: string (nullable = true)
 |    |-- setting: string (nullable = true)
 |    |-- myapp: struct (nullable = true)
 |    |    |-- id: string (nullable = true)
 |    |    |-- ver: string (nullable = true)
 |    |    |-- build: string (nullable = true)
 |    |-- comp: struct (nullable = true)
 |    |    |-- notes: string (nullable = true)
 |    |    |-- source: string (nullable = true)
 |    |    |-- name: string (nullable = true)
 |    |    |-- content: string (nullable = true)
 |    |    |-- sub: string (nullable = true)
 |    |-- loc: struct (nullable = true)
 |    |    |-- city: string (nullable = true)
 |    |    |-- country: string (nullable = true)
 |    |    |-- lat: double (nullable = true)
 |    |    |-- long: double (nullable = true)

On Wed, Nov 16, 2016 at 2:08 PM, Takeshi Yamamuro <li...@gmail.com>> wrote:
Hi,

What's the schema interpreted by spark?
A compression logic of the spark caching depends on column types.

// maropu

On Wed, Nov 16, 2016 at 5:26 PM, Prithish <pr...@gmail.com>> wrote:
Thanks for your response.

I did some more tests and I am seeing that when I have a flatter structure for my AVRO, the cache memory use is close to the CSV. But, when I use few levels of nesting, the cache memory usage blows up. This is really critical for planning the cluster we will be using. To avoid using a larger cluster, looks like, we will have to consider keeping the structure flat as much as possible.

On Wed, Nov 16, 2016 at 1:18 PM, Shreya Agarwal <sh...@microsoft.com>> wrote:
(Adding user@spark back to the discussion)

Well, the CSV vs AVRO might be simpler to explain. CSV has a lot of scope for compression. On the other hand avro and parquet are already compressed and just store extra schema info, afaik. Avro and parquet are both going to make your data smaller, parquet through compressed columnar storage, and avro through its binary data format.

Next, talking about the 62kb becoming 1224kb. I actually do not see such a massive blow up. The avro you shared is 28kb on my system and becomes 53.7kb when cached in memory deserialized and 52.9kb when cached In memory serialized. Exact same numbers with parquet as well. This is expected behavior, if I am not wrong.

In fact, now that I think about it, even larger blow ups might be valid, since your data must have been deserialized from the compressed avro format, making it bigger. The order of magnitude of difference in size would depend on the type of data you have and how well it was compressable.

The purpose of these formats is to store data to persistent storage in a way that's faster to read from, not to reduce cache-memory usage.

Maybe others here have more info to share.

Regards,
Shreya

Sent from my Windows 10 phone

From: Prithish<ma...@gmail.com>
Sent: Tuesday, November 15, 2016 11:04 PM
To: Shreya Agarwal<ma...@microsoft.com>
Subject: Re: AVRO File size when caching in-memory

I did another test and noting my observations here. These were done with the same data in avro and csv formats.

In AVRO, the file size on disk was 62kb and after caching, the in-memory size is 1224kb
In CSV, the file size on disk was 690kb and after caching, the in-memory size is 290kb

I'm guessing that the spark caching is not able to compress when the source is avro. Not sure if this is just my immature conclusion. Waiting to hear your observation.

On Wed, Nov 16, 2016 at 12:14 PM, Prithish <pr...@gmail.com>> wrote:
Thanks for your response.

I have attached the code (that I ran using the Spark-shell) as well as a sample avro file. After you run this code, the data is cached in memory and you can go to the "storage" tab on the Spark-ui (localhost:4040) and see the size it uses. In this example the size is small, but in my actual scenario, the source file size is 30GB and the in-memory size comes to around 800GB. I am trying to understand if this is expected when using avro or not.

On Wed, Nov 16, 2016 at 10:37 AM, Shreya Agarwal <sh...@microsoft.com>> wrote:
I haven't used Avro ever. But if you can send over a quick sample code, I can run and see if I repro it and maybe debug.

From: Prithish [mailto:prithish@gmail.com<ma...@gmail.com>]
Sent: Tuesday, November 15, 2016 8:44 PM
To: J?rn Franke <jo...@gmail.com>>
Cc: User <us...@spark.apache.org>>
Subject: Re: AVRO File size when caching in-memory

Anyone?

On Tue, Nov 15, 2016 at 10:45 AM, Prithish <pr...@gmail.com>> wrote:
I am using 2.0.1 and databricks avro library 3.0.1. I am running this on the latest AWS EMR release.

On Mon, Nov 14, 2016 at 3:06 PM, J?rn Franke <jo...@gmail.com>> wrote:
spark version? Are you using tungsten?

> On 14 Nov 2016, at 10:05, Prithish <pr...@gmail.com>> wrote:
>
> Can someone please explain why this happens?
>
> When I read a 600kb AVRO file and cache this in memory (using cacheTable), it shows up as 11mb (storage tab in Spark UI). I have tried this with different file sizes, and the size in-memory is always proportionate. I thought Spark compresses when using cacheTable.

--
---
Takeshi Yamamuro

Re: AVRO File size when caching in-memory

Posted by Prithish <pr...@gmail.com>.

It's something like the schema shown below (with several additional
levels/sublevels)

root
 |-- sentAt: long (nullable = true)
 |-- sharing: string (nullable = true)
 |-- receivedAt: long (nullable = true)
 |-- ip: string (nullable = true)
 |-- story: struct (nullable = true)
 |    |-- super: string (nullable = true)
 |    |-- lang: string (nullable = true)
 |    |-- setting: string (nullable = true)
 |    |-- myapp: struct (nullable = true)
 |    |    |-- id: string (nullable = true)
 |    |    |-- ver: string (nullable = true)
 |    |    |-- build: string (nullable = true)
 |    |-- comp: struct (nullable = true)
 |    |    |-- notes: string (nullable = true)
 |    |    |-- source: string (nullable = true)
 |    |    |-- name: string (nullable = true)
 |    |    |-- content: string (nullable = true)
 |    |    |-- sub: string (nullable = true)
 |    |-- loc: struct (nullable = true)
 |    |    |-- city: string (nullable = true)
 |    |    |-- country: string (nullable = true)
 |    |    |-- lat: double (nullable = true)
 |    |    |-- long: double (nullable = true)

On Wed, Nov 16, 2016 at 2:08 PM, Takeshi Yamamuro <li...@gmail.com>
wrote:

> Hi,
>
> What's the schema interpreted by spark?
> A compression logic of the spark caching depends on column types.
>
> // maropu
>
>
> On Wed, Nov 16, 2016 at 5:26 PM, Prithish <pr...@gmail.com> wrote:
>
>> Thanks for your response.
>>
>> I did some more tests and I am seeing that when I have a flatter
>> structure for my AVRO, the cache memory use is close to the CSV. But, when
>> I use few levels of nesting, the cache memory usage blows up. This is
>> really critical for planning the cluster we will be using. To avoid using a
>> larger cluster, looks like, we will have to consider keeping the structure
>> flat as much as possible.
>>
>> On Wed, Nov 16, 2016 at 1:18 PM, Shreya Agarwal <sh...@microsoft.com>
>> wrote:
>>
>>> (Adding user@spark back to the discussion)
>>>
>>>
>>>
>>> Well, the CSV vs AVRO might be simpler to explain. CSV has a lot of
>>> scope for compression. On the other hand avro and parquet are already
>>> compressed and just store extra schema info, afaik. Avro and parquet are
>>> both going to make your data smaller, parquet through compressed columnar
>>> storage, and avro through its binary data format.
>>>
>>>
>>>
>>> Next, talking about the 62kb becoming 1224kb. I actually do not see such
>>> a massive blow up. The avro you shared is 28kb on my system and becomes
>>> 53.7kb when cached in memory deserialized and 52.9kb when cached In memory
>>> serialized. Exact same numbers with parquet as well. This is expected
>>> behavior, if I am not wrong.
>>>
>>>
>>>
>>> In fact, now that I think about it, even larger blow ups might be valid,
>>> since your data must have been deserialized from the compressed avro
>>> format, making it bigger. The order of magnitude of difference in size
>>> would depend on the type of data you have and how well it was compressable.
>>>
>>>
>>>
>>> The purpose of these formats is to store data to persistent storage in a
>>> way that's faster to read from, not to reduce cache-memory usage.
>>>
>>>
>>>
>>> Maybe others here have more info to share.
>>>
>>>
>>>
>>> Regards,
>>>
>>> Shreya
>>>
>>>
>>>
>>> Sent from my Windows 10 phone
>>>
>>>
>>>
>>> *From: *Prithish <pr...@gmail.com>
>>> *Sent: *Tuesday, November 15, 2016 11:04 PM
>>> *To: *Shreya Agarwal <sh...@microsoft.com>
>>> *Subject: *Re: AVRO File size when caching in-memory
>>>
>>>
>>> I did another test and noting my observations here. These were done with
>>> the same data in avro and csv formats.
>>>
>>> In AVRO, the file size on disk was 62kb and after caching, the in-memory
>>> size is 1224kb
>>> In CSV, the file size on disk was 690kb and after caching, the in-memory
>>> size is 290kb
>>>
>>> I'm guessing that the spark caching is not able to compress when the
>>> source is avro. Not sure if this is just my immature conclusion. Waiting to
>>> hear your observation.
>>>
>>> On Wed, Nov 16, 2016 at 12:14 PM, Prithish <pr...@gmail.com> wrote:
>>>
>>>> Thanks for your response.
>>>>
>>>> I have attached the code (that I ran using the Spark-shell) as well as
>>>> a sample avro file. After you run this code, the data is cached in memory
>>>> and you can go to the "storage" tab on the Spark-ui (localhost:4040) and
>>>> see the size it uses. In this example the size is small, but in my actual
>>>> scenario, the source file size is 30GB and the in-memory size comes to
>>>> around 800GB. I am trying to understand if this is expected when using avro
>>>> or not.
>>>>
>>>> On Wed, Nov 16, 2016 at 10:37 AM, Shreya Agarwal <
>>>> shreyagr@microsoft.com> wrote:
>>>>
>>>>> I haven’t used Avro ever. But if you can send over a quick sample
>>>>> code, I can run and see if I repro it and maybe debug.
>>>>>
>>>>>
>>>>>
>>>>> *From:* Prithish [mailto:prithish@gmail.com]
>>>>> *Sent:* Tuesday, November 15, 2016 8:44 PM
>>>>> *To:* Jörn Franke <jo...@gmail.com>
>>>>> *Cc:* User <us...@spark.apache.org>
>>>>> *Subject:* Re: AVRO File size when caching in-memory
>>>>>
>>>>>
>>>>>
>>>>> Anyone?
>>>>>
>>>>>
>>>>>
>>>>> On Tue, Nov 15, 2016 at 10:45 AM, Prithish <pr...@gmail.com> wrote:
>>>>>
>>>>> I am using 2.0.1 and databricks avro library 3.0.1. I am running this
>>>>> on the latest AWS EMR release.
>>>>>
>>>>>
>>>>>
>>>>> On Mon, Nov 14, 2016 at 3:06 PM, Jörn Franke <jo...@gmail.com>
>>>>> wrote:
>>>>>
>>>>> spark version? Are you using tungsten?
>>>>>
>>>>>
>>>>> > On 14 Nov 2016, at 10:05, Prithish <pr...@gmail.com> wrote:
>>>>> >
>>>>> > Can someone please explain why this happens?
>>>>> >
>>>>> > When I read a 600kb AVRO file and cache this in memory (using
>>>>> cacheTable), it shows up as 11mb (storage tab in Spark UI). I have tried
>>>>> this with different file sizes, and the size in-memory is always
>>>>> proportionate. I thought Spark compresses when using cacheTable.
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>
>>>>
>>>
>>
>
>
> --
> ---
> Takeshi Yamamuro
>

Re: AVRO File size when caching in-memory

Posted by Takeshi Yamamuro <li...@gmail.com>.

Hi,

What's the schema interpreted by spark?
A compression logic of the spark caching depends on column types.

// maropu


On Wed, Nov 16, 2016 at 5:26 PM, Prithish <pr...@gmail.com> wrote:

> Thanks for your response.
>
> I did some more tests and I am seeing that when I have a flatter structure
> for my AVRO, the cache memory use is close to the CSV. But, when I use few
> levels of nesting, the cache memory usage blows up. This is really critical
> for planning the cluster we will be using. To avoid using a larger cluster,
> looks like, we will have to consider keeping the structure flat as much as
> possible.
>
> On Wed, Nov 16, 2016 at 1:18 PM, Shreya Agarwal <sh...@microsoft.com>
> wrote:
>
>> (Adding user@spark back to the discussion)
>>
>>
>>
>> Well, the CSV vs AVRO might be simpler to explain. CSV has a lot of scope
>> for compression. On the other hand avro and parquet are already compressed
>> and just store extra schema info, afaik. Avro and parquet are both going to
>> make your data smaller, parquet through compressed columnar storage, and
>> avro through its binary data format.
>>
>>
>>
>> Next, talking about the 62kb becoming 1224kb. I actually do not see such
>> a massive blow up. The avro you shared is 28kb on my system and becomes
>> 53.7kb when cached in memory deserialized and 52.9kb when cached In memory
>> serialized. Exact same numbers with parquet as well. This is expected
>> behavior, if I am not wrong.
>>
>>
>>
>> In fact, now that I think about it, even larger blow ups might be valid,
>> since your data must have been deserialized from the compressed avro
>> format, making it bigger. The order of magnitude of difference in size
>> would depend on the type of data you have and how well it was compressable.
>>
>>
>>
>> The purpose of these formats is to store data to persistent storage in a
>> way that's faster to read from, not to reduce cache-memory usage.
>>
>>
>>
>> Maybe others here have more info to share.
>>
>>
>>
>> Regards,
>>
>> Shreya
>>
>>
>>
>> Sent from my Windows 10 phone
>>
>>
>>
>> *From: *Prithish <pr...@gmail.com>
>> *Sent: *Tuesday, November 15, 2016 11:04 PM
>> *To: *Shreya Agarwal <sh...@microsoft.com>
>> *Subject: *Re: AVRO File size when caching in-memory
>>
>>
>> I did another test and noting my observations here. These were done with
>> the same data in avro and csv formats.
>>
>> In AVRO, the file size on disk was 62kb and after caching, the in-memory
>> size is 1224kb
>> In CSV, the file size on disk was 690kb and after caching, the in-memory
>> size is 290kb
>>
>> I'm guessing that the spark caching is not able to compress when the
>> source is avro. Not sure if this is just my immature conclusion. Waiting to
>> hear your observation.
>>
>> On Wed, Nov 16, 2016 at 12:14 PM, Prithish <pr...@gmail.com> wrote:
>>
>>> Thanks for your response.
>>>
>>> I have attached the code (that I ran using the Spark-shell) as well as a
>>> sample avro file. After you run this code, the data is cached in memory and
>>> you can go to the "storage" tab on the Spark-ui (localhost:4040) and see
>>> the size it uses. In this example the size is small, but in my actual
>>> scenario, the source file size is 30GB and the in-memory size comes to
>>> around 800GB. I am trying to understand if this is expected when using avro
>>> or not.
>>>
>>> On Wed, Nov 16, 2016 at 10:37 AM, Shreya Agarwal <shreyagr@microsoft.com
>>> > wrote:
>>>
>>>> I haven’t used Avro ever. But if you can send over a quick sample code,
>>>> I can run and see if I repro it and maybe debug.
>>>>
>>>>
>>>>
>>>> *From:* Prithish [mailto:prithish@gmail.com]
>>>> *Sent:* Tuesday, November 15, 2016 8:44 PM
>>>> *To:* Jörn Franke <jo...@gmail.com>
>>>> *Cc:* User <us...@spark.apache.org>
>>>> *Subject:* Re: AVRO File size when caching in-memory
>>>>
>>>>
>>>>
>>>> Anyone?
>>>>
>>>>
>>>>
>>>> On Tue, Nov 15, 2016 at 10:45 AM, Prithish <pr...@gmail.com> wrote:
>>>>
>>>> I am using 2.0.1 and databricks avro library 3.0.1. I am running this
>>>> on the latest AWS EMR release.
>>>>
>>>>
>>>>
>>>> On Mon, Nov 14, 2016 at 3:06 PM, Jörn Franke <jo...@gmail.com>
>>>> wrote:
>>>>
>>>> spark version? Are you using tungsten?
>>>>
>>>>
>>>> > On 14 Nov 2016, at 10:05, Prithish <pr...@gmail.com> wrote:
>>>> >
>>>> > Can someone please explain why this happens?
>>>> >
>>>> > When I read a 600kb AVRO file and cache this in memory (using
>>>> cacheTable), it shows up as 11mb (storage tab in Spark UI). I have tried
>>>> this with different file sizes, and the size in-memory is always
>>>> proportionate. I thought Spark compresses when using cacheTable.
>>>>
>>>>
>>>>
>>>>
>>>>
>>>
>>>
>>
>


-- 
---
Takeshi Yamamuro

Re: AVRO File size when caching in-memory

Posted by Prithish <pr...@gmail.com>.

Thanks for your response.

I did some more tests and I am seeing that when I have a flatter structure
for my AVRO, the cache memory use is close to the CSV. But, when I use few
levels of nesting, the cache memory usage blows up. This is really critical
for planning the cluster we will be using. To avoid using a larger cluster,
looks like, we will have to consider keeping the structure flat as much as
possible.

On Wed, Nov 16, 2016 at 1:18 PM, Shreya Agarwal <sh...@microsoft.com>
wrote:

> (Adding user@spark back to the discussion)
>
>
>
> Well, the CSV vs AVRO might be simpler to explain. CSV has a lot of scope
> for compression. On the other hand avro and parquet are already compressed
> and just store extra schema info, afaik. Avro and parquet are both going to
> make your data smaller, parquet through compressed columnar storage, and
> avro through its binary data format.
>
>
>
> Next, talking about the 62kb becoming 1224kb. I actually do not see such a
> massive blow up. The avro you shared is 28kb on my system and becomes
> 53.7kb when cached in memory deserialized and 52.9kb when cached In memory
> serialized. Exact same numbers with parquet as well. This is expected
> behavior, if I am not wrong.
>
>
>
> In fact, now that I think about it, even larger blow ups might be valid,
> since your data must have been deserialized from the compressed avro
> format, making it bigger. The order of magnitude of difference in size
> would depend on the type of data you have and how well it was compressable.
>
>
>
> The purpose of these formats is to store data to persistent storage in a
> way that's faster to read from, not to reduce cache-memory usage.
>
>
>
> Maybe others here have more info to share.
>
>
>
> Regards,
>
> Shreya
>
>
>
> Sent from my Windows 10 phone
>
>
>
> *From: *Prithish <pr...@gmail.com>
> *Sent: *Tuesday, November 15, 2016 11:04 PM
> *To: *Shreya Agarwal <sh...@microsoft.com>
> *Subject: *Re: AVRO File size when caching in-memory
>
>
> I did another test and noting my observations here. These were done with
> the same data in avro and csv formats.
>
> In AVRO, the file size on disk was 62kb and after caching, the in-memory
> size is 1224kb
> In CSV, the file size on disk was 690kb and after caching, the in-memory
> size is 290kb
>
> I'm guessing that the spark caching is not able to compress when the
> source is avro. Not sure if this is just my immature conclusion. Waiting to
> hear your observation.
>
> On Wed, Nov 16, 2016 at 12:14 PM, Prithish <pr...@gmail.com> wrote:
>
>> Thanks for your response.
>>
>> I have attached the code (that I ran using the Spark-shell) as well as a
>> sample avro file. After you run this code, the data is cached in memory and
>> you can go to the "storage" tab on the Spark-ui (localhost:4040) and see
>> the size it uses. In this example the size is small, but in my actual
>> scenario, the source file size is 30GB and the in-memory size comes to
>> around 800GB. I am trying to understand if this is expected when using avro
>> or not.
>>
>> On Wed, Nov 16, 2016 at 10:37 AM, Shreya Agarwal <sh...@microsoft.com>
>> wrote:
>>
>>> I haven’t used Avro ever. But if you can send over a quick sample code,
>>> I can run and see if I repro it and maybe debug.
>>>
>>>
>>>
>>> *From:* Prithish [mailto:prithish@gmail.com]
>>> *Sent:* Tuesday, November 15, 2016 8:44 PM
>>> *To:* Jörn Franke <jo...@gmail.com>
>>> *Cc:* User <us...@spark.apache.org>
>>> *Subject:* Re: AVRO File size when caching in-memory
>>>
>>>
>>>
>>> Anyone?
>>>
>>>
>>>
>>> On Tue, Nov 15, 2016 at 10:45 AM, Prithish <pr...@gmail.com> wrote:
>>>
>>> I am using 2.0.1 and databricks avro library 3.0.1. I am running this on
>>> the latest AWS EMR release.
>>>
>>>
>>>
>>> On Mon, Nov 14, 2016 at 3:06 PM, Jörn Franke <jo...@gmail.com>
>>> wrote:
>>>
>>> spark version? Are you using tungsten?
>>>
>>>
>>> > On 14 Nov 2016, at 10:05, Prithish <pr...@gmail.com> wrote:
>>> >
>>> > Can someone please explain why this happens?
>>> >
>>> > When I read a 600kb AVRO file and cache this in memory (using
>>> cacheTable), it shows up as 11mb (storage tab in Spark UI). I have tried
>>> this with different file sizes, and the size in-memory is always
>>> proportionate. I thought Spark compresses when using cacheTable.
>>>
>>>
>>>
>>>
>>>
>>
>>
>

RE: AVRO File size when caching in-memory

Posted by Shreya Agarwal <sh...@microsoft.com>.

(Adding user@spark back to the discussion)

Well, the CSV vs AVRO might be simpler to explain. CSV has a lot of scope for compression. On the other hand avro and parquet are already compressed and just store extra schema info, afaik. Avro and parquet are both going to make your data smaller, parquet through compressed columnar storage, and avro through its binary data format.

Next, talking about the 62kb becoming 1224kb. I actually do not see such a massive blow up. The avro you shared is 28kb on my system and becomes 53.7kb when cached in memory deserialized and 52.9kb when cached In memory serialized. Exact same numbers with parquet as well. This is expected behavior, if I am not wrong.

In fact, now that I think about it, even larger blow ups might be valid, since your data must have been deserialized from the compressed avro format, making it bigger. The order of magnitude of difference in size would depend on the type of data you have and how well it was compressable.

The purpose of these formats is to store data to persistent storage in a way that's faster to read from, not to reduce cache-memory usage.

Maybe others here have more info to share.

Regards,
Shreya

Sent from my Windows 10 phone

From: Prithish<ma...@gmail.com>
Sent: Tuesday, November 15, 2016 11:04 PM
To: Shreya Agarwal<ma...@microsoft.com>
Subject: Re: AVRO File size when caching in-memory

I did another test and noting my observations here. These were done with the same data in avro and csv formats.

In AVRO, the file size on disk was 62kb and after caching, the in-memory size is 1224kb
In CSV, the file size on disk was 690kb and after caching, the in-memory size is 290kb

I'm guessing that the spark caching is not able to compress when the source is avro. Not sure if this is just my immature conclusion. Waiting to hear your observation.

On Wed, Nov 16, 2016 at 12:14 PM, Prithish <pr...@gmail.com>> wrote:
Thanks for your response.

I have attached the code (that I ran using the Spark-shell) as well as a sample avro file. After you run this code, the data is cached in memory and you can go to the "storage" tab on the Spark-ui (localhost:4040) and see the size it uses. In this example the size is small, but in my actual scenario, the source file size is 30GB and the in-memory size comes to around 800GB. I am trying to understand if this is expected when using avro or not.

On Wed, Nov 16, 2016 at 10:37 AM, Shreya Agarwal <sh...@microsoft.com>> wrote:
I haven't used Avro ever. But if you can send over a quick sample code, I can run and see if I repro it and maybe debug.

From: Prithish [mailto:prithish@gmail.com<ma...@gmail.com>]
Sent: Tuesday, November 15, 2016 8:44 PM
To: J?rn Franke <jo...@gmail.com>>
Cc: User <us...@spark.apache.org>>
Subject: Re: AVRO File size when caching in-memory

Anyone?

On Tue, Nov 15, 2016 at 10:45 AM, Prithish <pr...@gmail.com>> wrote:
I am using 2.0.1 and databricks avro library 3.0.1. I am running this on the latest AWS EMR release.

On Mon, Nov 14, 2016 at 3:06 PM, J?rn Franke <jo...@gmail.com>> wrote:
spark version? Are you using tungsten?

> On 14 Nov 2016, at 10:05, Prithish <pr...@gmail.com>> wrote:
>
> Can someone please explain why this happens?
>
> When I read a 600kb AVRO file and cache this in memory (using cacheTable), it shows up as 11mb (storage tab in Spark UI). I have tried this with different file sizes, and the size in-memory is always proportionate. I thought Spark compresses when using cacheTable.

Re: AVRO File size when caching in-memory

Posted by Prithish <pr...@gmail.com>.

Anyone?

On Tue, Nov 15, 2016 at 10:45 AM, Prithish <pr...@gmail.com> wrote:

> I am using 2.0.1 and databricks avro library 3.0.1. I am running this on
> the latest AWS EMR release.
>
> On Mon, Nov 14, 2016 at 3:06 PM, Jörn Franke <jo...@gmail.com> wrote:
>
>> spark version? Are you using tungsten?
>>
>> > On 14 Nov 2016, at 10:05, Prithish <pr...@gmail.com> wrote:
>> >
>> > Can someone please explain why this happens?
>> >
>> > When I read a 600kb AVRO file and cache this in memory (using
>> cacheTable), it shows up as 11mb (storage tab in Spark UI). I have tried
>> this with different file sizes, and the size in-memory is always
>> proportionate. I thought Spark compresses when using cacheTable.
>>
>
>

Re: AVRO File size when caching in-memory

Posted by Prithish <pr...@gmail.com>.

I am using 2.0.1 and databricks avro library 3.0.1. I am running this on
the latest AWS EMR release.

On Mon, Nov 14, 2016 at 3:06 PM, Jörn Franke <jo...@gmail.com> wrote:

> spark version? Are you using tungsten?
>
> > On 14 Nov 2016, at 10:05, Prithish <pr...@gmail.com> wrote:
> >
> > Can someone please explain why this happens?
> >
> > When I read a 600kb AVRO file and cache this in memory (using
> cacheTable), it shows up as 11mb (storage tab in Spark UI). I have tried
> this with different file sizes, and the size in-memory is always
> proportionate. I thought Spark compresses when using cacheTable.
>

Re: AVRO File size when caching in-memory

Posted by Jörn Franke <jo...@gmail.com>.

spark version? Are you using tungsten?

> On 14 Nov 2016, at 10:05, Prithish <pr...@gmail.com> wrote:
> 
> Can someone please explain why this happens?
> 
> When I read a 600kb AVRO file and cache this in memory (using cacheTable), it shows up as 11mb (storage tab in Spark UI). I have tried this with different file sizes, and the size in-memory is always proportionate. I thought Spark compresses when using cacheTable. 

---------------------------------------------------------------------
To unsubscribe e-mail: user-unsubscribe@spark.apache.org