You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@spark.apache.org by Adrian Bridgett <ad...@opensignal.com> on 2016/09/19 21:05:50 UTC
very high maxresults setting (no collect())
Hi,
We've recently started seeing a huge increase in
spark.driver.maxResultSize - we are starting to set it at 3GB (and
increase our driver memory a lot to 12GB or so). This is on v1.6.1 with
Mesos scheduler.
All the docs I can see is that this is to do with .collect() being
called on a large RDD (which isn't the case AFAIK - certainly nothing in
the code) and it's rather puzzling me as to what's going on. I thought
that the number of tasks was coming into it (about 14000 tasks in each
of about a dozen stages). Adding a coalesce seemed to help but now we
are hitting the problem again after a few minor code tweaks.
What else could be contributing to this? Thoughts I've had:
- number of tasks
- metrics?
- um, a bit stuck!
The code looks like this:
df=....
df.persist()
val rows = df.count()
// actually we loop over this a few times
val output = df. groupBy("id").agg(
avg($"score").as("avg_score"),
count($"id").as("rows")
).
select(
$"id",
$"avg_score,
$"rows",
).sort($"id")
output.coalesce(1000).write.format("com.databricks.spark.csv").save('/tmp/...')
Cheers for any help/pointers! There are a couple of memory leak tickets
fixed in v1.6.2 that may affect the driver so I may try an upgrade (the
executors are fine).
Adrian
---------------------------------------------------------------------
To unsubscribe e-mail: user-unsubscribe@spark.apache.org
Re: very high maxresults setting (no collect())
Posted by Adrian Bridgett <ad...@opensignal.com>.
Hi Michael,
No spark upgrade, we've been changing some of our data pipelines so the
data volumes have probably been getting a bit larger. Just in the last
few weeks we've seen quite a few jobs needing a larger maxResultSize.
Some jobs have gone from "fine with 1GB default" to 3GB. Wondering
what besides a collect could cause this (as there's certainly not an
explicit collect()).
Mesos, parquet source data, a broadcast of a small table earlier which
is joined then just a few aggregations, select, coalesce and spark-csv
write. The executors go along nicely (as does the driver) and then we
start to hit memory pressure on the driver in the output loop and the
job grinds to a crawl (we eventually have to kill it and restart with
more memory).
Adrian
---------------------------------------------------------------------
To unsubscribe e-mail: user-unsubscribe@spark.apache.org
Re: very high maxresults setting (no collect())
Posted by Michael Gummelt <mg...@mesosphere.io>.
When you say "started seeing", do you mean after a Spark version upgrade?
After running a new job?
On Mon, Sep 19, 2016 at 2:05 PM, Adrian Bridgett <ad...@opensignal.com>
wrote:
> Hi,
>
> We've recently started seeing a huge increase in
> spark.driver.maxResultSize - we are starting to set it at 3GB (and increase
> our driver memory a lot to 12GB or so). This is on v1.6.1 with Mesos
> scheduler.
>
> All the docs I can see is that this is to do with .collect() being called
> on a large RDD (which isn't the case AFAIK - certainly nothing in the code)
> and it's rather puzzling me as to what's going on. I thought that the
> number of tasks was coming into it (about 14000 tasks in each of about a
> dozen stages). Adding a coalesce seemed to help but now we are hitting the
> problem again after a few minor code tweaks.
>
> What else could be contributing to this? Thoughts I've had:
> - number of tasks
> - metrics?
> - um, a bit stuck!
>
> The code looks like this:
> df=....
> df.persist()
> val rows = df.count()
>
> // actually we loop over this a few times
> val output = df. groupBy("id").agg(
> avg($"score").as("avg_score"),
> count($"id").as("rows")
> ).
> select(
> $"id",
> $"avg_score,
> $"rows",
> ).sort($"id")
> output.coalesce(1000).write.format("com.databricks.spark.csv
> ").save('/tmp/...')
>
> Cheers for any help/pointers! There are a couple of memory leak tickets
> fixed in v1.6.2 that may affect the driver so I may try an upgrade (the
> executors are fine).
>
> Adrian
>
> ---------------------------------------------------------------------
> To unsubscribe e-mail: user-unsubscribe@spark.apache.org
>
>
--
Michael Gummelt
Software Engineer
Mesosphere