You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@hudi.apache.org by Karl Wridgway <ka...@wridgway.com> on 2020/09/12 04:08:49 UTC
PySpark and Spark query optimisations
Hi team,
I was wondering if it's possible to leverage Spark's built optimisations
for COPY_ON_WRITE tables with PySpark ?
The documentation here : https://hudi.apache.org/docs/querying_data.html
Describes how to do this for Scala/Java :
"If using spark’s built in support, additionally a path filter needs to be
pushed into sparkContext as follows. This method retains Spark built-in
optimizations for reading parquet files like vectorized reading on Hudi
Hive tables.
spark.sparkContext.hadoopConfiguration.setClass("mapreduce.input.pathFilter.class",
classOf[org.apache.hudi.hadoop.HoodieROTablePathFilter],
classOf[org.apache.hadoop.fs.PathFilter]);
"
Regards,
Karl