You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by Hemant Bhanawat <he...@gmail.com> on 2016/09/22 06:36:50 UTC

Memory usage by Spark jobs

I am working on profiling TPCH queries for Spark 2.0.  I see lot of
temporary object creation (sometimes size as much as the data size) which
is justified for the kind of processing Spark does. But, from production
perspective, is there a guideline on how much memory should be allocated
for processing a specific data size of let's say parquet data? Also, has
someone investigated memory usage for the individual SQL operators like
Filter, group by, order by, Exchange etc.?

Hemant Bhanawat <https://www.linkedin.com/in/hemant-bhanawat-92a3811>
www.snappydata.io

Re: Memory usage by Spark jobs

Posted by Jörn Franke <jo...@gmail.com>.

You should take also into account that spark has different option to represent data in-memory, such as Java serialized objects, Kyro serialized, Tungsten (columnar optionally compressed) etc. the tungsten thing depends heavily on the underlying data and sorting especially if compressed.
Then, you might think also about broadcasted data etc.

As such I am not aware of a specific guide, but there is also no magic behind it. could be a good jira task :) 

> On 22 Sep 2016, at 08:36, Hemant Bhanawat <he...@gmail.com> wrote:
> 
> I am working on profiling TPCH queries for Spark 2.0.  I see lot of temporary object creation (sometimes size as much as the data size) which is justified for the kind of processing Spark does. But, from production perspective, is there a guideline on how much memory should be allocated for processing a specific data size of let's say parquet data? Also, has someone investigated memory usage for the individual SQL operators like Filter, group by, order by, Exchange etc.? 
> 
> Hemant Bhanawat
> www.snappydata.io

Re: Memory usage by Spark jobs

Posted by Jörn Franke <jo...@gmail.com>.

You should take also into account that spark has different option to represent data in-memory, such as Java serialized objects, Kyro serialized, Tungsten (columnar optionally compressed) etc. the tungsten thing depends heavily on the underlying data and sorting especially if compressed.
Then, you might think also about broadcasted data etc.

As such I am not aware of a specific guide, but there is also no magic behind it. could be a good jira task :) 

> On 22 Sep 2016, at 08:36, Hemant Bhanawat <he...@gmail.com> wrote:
> 
> I am working on profiling TPCH queries for Spark 2.0.  I see lot of temporary object creation (sometimes size as much as the data size) which is justified for the kind of processing Spark does. But, from production perspective, is there a guideline on how much memory should be allocated for processing a specific data size of let's say parquet data? Also, has someone investigated memory usage for the individual SQL operators like Filter, group by, order by, Exchange etc.? 
> 
> Hemant Bhanawat
> www.snappydata.io