You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "SAVIO SALVARINO TELES DE OLIVEIRA (Jira)" <ji...@apache.org> on 2022/06/26 13:26:00 UTC
[jira] [Commented] (SPARK-38288) Aggregate push down doesnt work using Spark SQL jdbc datasource with postgresql
[ https://issues.apache.org/jira/browse/SPARK-38288?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17558879#comment-17558879 ]
SAVIO SALVARINO TELES DE OLIVEIRA commented on SPARK-38288:
-----------------------------------------------------------
I have the same problem with Spark 3.2.1.
Code to read orders table from PostgreSQL:
{code:java}
orders = spark. \
read. \
format('jdbc'). \
option('url', 'jdbc:postgresql://...'). \
option('driver', 'org.postgresql.Driver'). \
option('dbtable', 'orders'). \
option('user', '***'). \
option('password', '****'). \
option('pushDownAggregate', 'true'). \
load(){code}
Now, I'm trying to group by two columns (owner_name and client_id):
{code:java}
orders.groupby("owner_name", "client_id").agg(max('order_date').alias("max_order_date")).limit(10).explain('extended'){code}
But the query execution is still using the Relation instead RelationV2:
{code:java}
== Optimized Logical Plan ==
GlobalLimit 10
+- LocalLimit 10
+- Aggregate [owner_name#2717, client_id#2724], [owner_name#2717, client_id#2724, max(order_date#2728) AS max_order_date#2910]
+- Project [owner_name#2717, client_id#2724, order_date#2728]
+- Relation [owner_name#2717,client_id#2724,order_date#2728] JDBCRelation(orders) [numPartitions=1]{code}
> Aggregate push down doesnt work using Spark SQL jdbc datasource with postgresql
> -------------------------------------------------------------------------------
>
> Key: SPARK-38288
> URL: https://issues.apache.org/jira/browse/SPARK-38288
> Project: Spark
> Issue Type: Question
> Components: SQL
> Affects Versions: 3.2.1
> Reporter: Luis Lozano Coira
> Priority: Major
> Labels: DataSource, Spark-SQL
>
> I am establishing a connection with postgresql using the Spark SQL jdbc datasource. I have started the spark shell including the postgres driver and I can connect and execute queries without problems. I am using this statement:
> {code:java}
> val df = spark.read.format("jdbc").option("url", "jdbc:postgresql://host:port/").option("driver", "org.postgresql.Driver").option("dbtable", "test").option("user", "postgres").option("password", "*******").option("pushDownAggregate",true).load()
> {code}
> I am adding the pushDownAggregate option because I would like the aggregations are delegated to the source. But for some reason this is not happening.
> Reviewing this pull request, it seems that this feature should be merged into 3.2. [https://github.com/apache/spark/pull/29695]
> I am making the aggregations considering the mentioned limitations. An example case where I don't see pushdown being done would be this one:
> {code:java}
> df.groupBy("name").max("age").show()
> {code}
> The results of the queryExecution are shown below:
> {code:java}
> scala> df.groupBy("name").max("age").queryExecution.executedPlan
> res19: org.apache.spark.sql.execution.SparkPlan =
> AdaptiveSparkPlan isFinalPlan=false
> +- HashAggregate(keys=[name#274], functions=[max(age#246)], output=[name#274, max(age)#544])
> +- Exchange hashpartitioning(name#274, 200), ENSURE_REQUIREMENTS, [id=#205]
> +- HashAggregate(keys=[name#274], functions=[partial_max(age#246)], output=[name#274, max#548])
> +- Scan JDBCRelation(test) [numPartitions=1] [age#246,name#274] PushedAggregates: [], PushedFilters: [], PushedGroupby: [], ReadSchema: struct<age:int,name:string>
> scala> dfp.groupBy("name").max("age").queryExecution.toString
> res20: String =
> "== Parsed Logical Plan ==
> Aggregate [name#274], [name#274, max(age#246) AS max(age)#581]
> +- Relation [age#246] JDBCRelation(test) [numPartitions=1]
> == Analyzed Logical Plan ==
> name: string, max(age): int
> Aggregate [name#274], [name#274, max(age#246) AS max(age)#581]
> +- Relation [age#24...
> {code}
> What could be the problem? Should pushDownAggregate work in this case?
--
This message was sent by Atlassian Jira
(v8.20.7#820007)
---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org