You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@spark.apache.org by Haijia Zhou <ha...@yahoo.com.INVALID> on 2021/02/15 04:23:27 UTC

[SPARK-SQL] Does Spark 3.0 support parquet predicate pushdown for array of structs?

Hi, I know Spark 3.0 has added Parquet predicate pushdown for nested structures (SPARK-17636) Does it also support predicate pushdown for an array of structs?  For example, say I have a spark table 'individuals' (in parquet format) with the following schema 
root |-- individual_id: string (nullable = true) |-- devices: array (nullable = true) | |-- element: struct (containsNull = true) | | |-- type: string (nullable = true) | | |-- carrier_name: string (nullable = true) | | |-- model: string (nullable = true) | | |-- vendor: string (nullable = true) | | |-- year_released: integer (nullable = true) | | |-- primary_hardware_type: string (nullable = true) | | |-- browser_name: string (nullable = true) | | |-- browser_version: string (nullable = true) | | |-- manufacturer: string (nullable = true)

I can then use the following code to find the number of individuals who have at least one device that was released after 2010
select count(*) as total_count from individuals  where exists(devices, dev -> dev.year_released > 2010)
The query runs well with spark 3.0 but it had to read all the columns of the nested structure 'devices', as shown below.res14: org.apache.spark.sql.execution.SparkPlan =AdaptiveSparkPlan isFinalPlan=false+- HashAggregate(keys=[], functions=[finalmerge_count(merge count#59L) AS count(1)#55L], output=[total_count#54L]) +- Exchange SinglePartition, true, [id=#35] +- HashAggregate(keys=[], functions=[partial_count(1) AS count#59L], output=[count#59L]) +- Project +- Filter exists(devices#48, lambdafunction((lambda dev#56.year_released > 2018), lambda dev#56, false)) +- FileScan parquet [ids#48] Batched: true, DataFilters: [exists(devices#48, lambdafunction((lambda dev#56.year_released > 2018), lambda dev#56, false))], Format: Parquet, Location: InMemoryFileIndex[s3://..., PartitionFilters: [], PushedFilters: [], ReadSchema: struct<devices:array<struct<id_last_seen:date,type:string,value:string,carrier_name:string,model:string,vendor:string,in...

Any thoughts?
Thanks
Haijia