You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@spark.apache.org by Alex Cruise <al...@cluonflux.com> on 2023/05/12 16:04:19 UTC

planInputPartitions being called twice

(I posted this on Slack originally)

Hey folks, I’m writing a batch connector for an in-house data lake and
doing some performance work now… I’ve noticed my ScanBuilder creates a Scan
exactly once, but its toBatch method is being called three times, returning
the identical object every time, then the batch’s planInputPartitions
method is being called twice, doing a large amount of redundant work. I'm
targeting Spark 3.3.2 currently because EMR doesn't support Spark 3.4.x yet.

This is all a single node, local mode.  planInputPartitions() is itself a
somewhat expensive operation so I’d rather not have it being called twice.

I haven’t implemented SupportsRuntimeFiltering yet, but I’m not confident
it would help with this specific problem.

The javadoc for planInputPartitions says it’ll "be called only once, to
launch one Spark job",
OTOH
https://github.com/vertica/spark-connector/issues/171#issuecomment-1051162865
says it’s normal for it to be called twice

Well, at least it’s called on the same instance both times, so I can just
cache the results I guess… annoying though.

Is there a well-known better way to avoid this inefficiency? Is it a bug?

Thanks!

-0xe1a