You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@spark.apache.org by Adrian Bridgett <ad...@opensignal.com> on 2016/09/19 21:05:50 UTC

very high maxresults setting (no collect())

Hi,

We've recently started seeing a huge increase in 
spark.driver.maxResultSize - we are starting to set it at 3GB (and 
increase our driver memory a lot to 12GB or so).  This is on v1.6.1 with 
Mesos scheduler.

All the docs I can see is that this is to do with .collect() being 
called on a large RDD (which isn't the case AFAIK - certainly nothing in 
the code) and it's rather puzzling me as to what's going on.  I thought 
that the number of tasks was coming into it (about 14000 tasks in each 
of about a dozen stages).  Adding a coalesce seemed to help but now we 
are hitting the problem again after a few minor code tweaks.

What else could be contributing to this?   Thoughts I've had:
- number of tasks
- metrics?
- um, a bit stuck!

The code looks like this:
df=....
df.persist()
val rows = df.count()

// actually we loop over this a few times
val output = df. groupBy("id").agg(
           avg($"score").as("avg_score"),
           count($"id").as("rows")
         ).
         select(
           $"id",
           $"avg_score,
           $"rows",
         ).sort($"id")
output.coalesce(1000).write.format("com.databricks.spark.csv").save('/tmp/...')

Cheers for any help/pointers!  There are a couple of memory leak tickets 
fixed in v1.6.2 that may affect the driver so I may try an upgrade (the 
executors are fine).

Adrian

---------------------------------------------------------------------
To unsubscribe e-mail: user-unsubscribe@spark.apache.org


Re: very high maxresults setting (no collect())

Posted by Adrian Bridgett <ad...@opensignal.com>.
Hi Michael,

No spark upgrade, we've been changing some of our data pipelines so the 
data volumes have probably been getting a bit larger.  Just in the last 
few weeks we've seen quite a few jobs needing a larger maxResultSize. 
Some jobs have gone from "fine with 1GB default" to 3GB.   Wondering 
what besides a collect could cause this (as there's certainly not an 
explicit collect()).

Mesos, parquet source data, a broadcast of a small table earlier which 
is joined then just a few aggregations, select, coalesce and spark-csv 
write.  The executors go along nicely (as does the driver) and then we 
start to hit memory pressure on the driver in the output loop and the 
job grinds to a crawl (we eventually have to kill it and restart with 
more memory).

Adrian

---------------------------------------------------------------------
To unsubscribe e-mail: user-unsubscribe@spark.apache.org


Re: very high maxresults setting (no collect())

Posted by Michael Gummelt <mg...@mesosphere.io>.
When you say "started seeing", do you mean after a Spark version upgrade?
After running a new job?

On Mon, Sep 19, 2016 at 2:05 PM, Adrian Bridgett <ad...@opensignal.com>
wrote:

> Hi,
>
> We've recently started seeing a huge increase in
> spark.driver.maxResultSize - we are starting to set it at 3GB (and increase
> our driver memory a lot to 12GB or so).  This is on v1.6.1 with Mesos
> scheduler.
>
> All the docs I can see is that this is to do with .collect() being called
> on a large RDD (which isn't the case AFAIK - certainly nothing in the code)
> and it's rather puzzling me as to what's going on.  I thought that the
> number of tasks was coming into it (about 14000 tasks in each of about a
> dozen stages).  Adding a coalesce seemed to help but now we are hitting the
> problem again after a few minor code tweaks.
>
> What else could be contributing to this?   Thoughts I've had:
> - number of tasks
> - metrics?
> - um, a bit stuck!
>
> The code looks like this:
> df=....
> df.persist()
> val rows = df.count()
>
> // actually we loop over this a few times
> val output = df. groupBy("id").agg(
>           avg($"score").as("avg_score"),
>           count($"id").as("rows")
>         ).
>         select(
>           $"id",
>           $"avg_score,
>           $"rows",
>         ).sort($"id")
> output.coalesce(1000).write.format("com.databricks.spark.csv
> ").save('/tmp/...')
>
> Cheers for any help/pointers!  There are a couple of memory leak tickets
> fixed in v1.6.2 that may affect the driver so I may try an upgrade (the
> executors are fine).
>
> Adrian
>
> ---------------------------------------------------------------------
> To unsubscribe e-mail: user-unsubscribe@spark.apache.org
>
>


-- 
Michael Gummelt
Software Engineer
Mesosphere