You are viewing a plain text version of this content. The canonical link for it is here.

Posted to users@hudi.apache.org by Karl Wridgway <ka...@wridgway.com> on 2020/09/12 04:08:49 UTC

PySpark and Spark query optimisations

Hi team,

I was wondering if it's possible to leverage Spark's built optimisations
for COPY_ON_WRITE tables with PySpark ?

The documentation here : https://hudi.apache.org/docs/querying_data.html

Describes how to do this for Scala/Java :

"If using spark’s built in support, additionally a path filter needs to be
pushed into sparkContext as follows. This method retains Spark built-in
optimizations for reading parquet files like vectorized reading on Hudi
Hive tables.

spark.sparkContext.hadoopConfiguration.setClass("mapreduce.input.pathFilter.class",
classOf[org.apache.hudi.hadoop.HoodieROTablePathFilter],
classOf[org.apache.hadoop.fs.PathFilter]);
"

Regards,
Karl