You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by Denarian Kislata <de...@gmail.com> on 2022/05/05 08:07:03 UTC

Something about Spark which has bothered me for a very long time, which I've never understood

Greetings, and thanks in advance.

For ~8 years i've been a spark user, and i've seen this same problem at
more SaaS startups than I can count, and although it's straightforward to
fix, I've never understood _why_ it happens.

I'm hoping someone can explain the why behind it.

Unfortunately I don't have screenshots handy, but the common problem is
succinct:

Someone created a table stored as a large number of small s3 objects in a
single bucket, partitioned by date. Some job loads this table by date
partition, and inevitably at some point there is a scan/exchange in an
execution plan which has ~1M partitions in the spark UI.

Common issue with a known solution, but the odd thing I always see in this
case, and wonder about, is why do the ganglia metrics look so strange for
that particular part of the job?

To clarify what strange means, I mean (for some cluster of some size) CPU
utilization at about 40% user 2% sys while network is sitting at about 5%
of quoted throughput. It's not a long-fat network, and I usually haven't
seen high retransmission rates or throttling from s3 in the logs.

In these cases, heap is usually ~25% of maximum. It's so strange to see
utilization like this across the board, because some resource must be
saturated, and although I haven't gotten to the point of connecting to an
executor and reading saturation metrics, my intuition is that it should
actually be some pooled resource or semaphore which is exhausted. An
obvious culprit would be something related to s3a, but whenever I peek at
that, nothing nothing really stands out configuration wise. Maybe there's
something I don't understand about the connections and threads
configuration for s3a and I should go read that implementation tomorrow,
but I thought I'd ask here first.

Has anyone else seen this? I'm not interested in _fixing_ it really (the
right fix is to correct the input data...), but I'd love to understand
_why_ it happens.

Thanks!
Den

I'm curious what people think about this.

Re: Something about Spark which has bothered me for a very long time, which I've never understood

Posted by "Lalwani, Jayesh" <jl...@amazon.com.INVALID>.

Have you tried taking several thread dumps across executors to see if the executors are consistently waiting for a resource?

I suspect it’s S3.. S3’s list operation doesn’t scale with the number of keys in a folder. You aren’t being throttled by S3. S3 is just slow when you have lot of small objects. I suspect you’ll just see the threads waiting for S3 to send back a response.
From: Denarian Kislata <de...@gmail.com>
Date: Thursday, May 5, 2022 at 8:03 AM
To: "user@spark.apache.org" <us...@spark.apache.org>
Subject: [EXTERNAL] Something about Spark which has bothered me for a very long time, which I've never understood

CAUTION: This email originated from outside of the organization. Do not click links or open attachments unless you can confirm the sender and know the content is safe.

Greetings, and thanks in advance.

For ~8 years i've been a spark user, and i've seen this same problem at more SaaS startups than I can count, and although it's straightforward to fix, I've never understood _why_ it happens.

I'm hoping someone can explain the why behind it.

Unfortunately I don't have screenshots handy, but the common problem is succinct:

Someone created a table stored as a large number of small s3 objects in a single bucket, partitioned by date. Some job loads this table by date partition, and inevitably at some point there is a scan/exchange in an execution plan which has ~1M partitions in the spark UI.

Common issue with a known solution, but the odd thing I always see in this case, and wonder about, is why do the ganglia metrics look so strange for that particular part of the job?

To clarify what strange means, I mean (for some cluster of some size) CPU utilization at about 40% user 2% sys while network is sitting at about 5% of quoted throughput. It's not a long-fat network, and I usually haven't seen high retransmission rates or throttling from s3 in the logs.

In these cases, heap is usually ~25% of maximum. It's so strange to see utilization like this across the board, because some resource must be saturated, and although I haven't gotten to the point of connecting to an executor and reading saturation metrics, my intuition is that it should actually be some pooled resource or semaphore which is exhausted. An obvious culprit would be something related to s3a, but whenever I peek at that, nothing nothing really stands out configuration wise. Maybe there's something I don't understand about the connections and threads configuration for s3a and I should go read that implementation tomorrow, but I thought I'd ask here first.

Has anyone else seen this? I'm not interested in _fixing_ it really (the right fix is to correct the input data...), but I'd love to understand _why_ it happens.

Thanks!
Den

I'm curious what people think about this.