You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by Prithish <pr...@gmail.com> on 2016/10/27 05:19:24 UTC

Question about In-Memory size (cache / cacheTable)

Hello,

I am trying to understand how in-memory size is changing in these
situations. Specifically, why is in-memory size much higher for avro and
parquet? Are there any optimizations necessary to reduce this?

Used cacheTable on each of these:

AVRO File (600kb) - In-memory size was 12mb
Parquet File (600kb) - In-memory size was 12mb
CSV File (3mb, was the same file as above) - In-memory size was 600Kb

Because of this, we need a cluster with a much bigger memory if we were to
cache the avro files.

Thanks for your help.

Prit