You are viewing a plain text version of this content. The canonical link for it is here.

Posted to reviews@spark.apache.org by "beliefer (via GitHub)" <gi...@apache.org> on 2023/05/27 13:37:10 UTC

[GitHub] [spark] beliefer opened a new pull request, #41342: [WIP][SPARK-43829][CONNECT] Improve SparkConnectPlanner by reuse Dataset and avoid construct new Dataset

beliefer opened a new pull request, #41342:
URL: https://github.com/apache/spark/pull/41342

   ### What changes were proposed in this pull request?
   Currently, `SparkConnectPlanner.transformRelation` always return the `LogicalPlan` of `Dataset`.
   The `SparkConnectStreamHandler.handlePlan` constructs a new `Dataset` for it.
   Sometimes, `SparkConnectStreamHandler.handlePlan` could reuse the `Dataset` created by `SparkConnectPlanner.transformRelation`.
   
   
   ### Why are the changes needed?
   Improve `SparkConnectPlanner` by reuse `Dataset` and avoid construct new `Dataset`.
   
   
   ### Does this PR introduce _any_ user-facing change?
   'No'.
   Just update inner implementation.
   
   
   ### How was this patch tested?
   Exists test cases.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] beliefer commented on pull request #41342: [SPARK-43829][CONNECT] Improve SparkConnectPlanner by reuse Dataset and avoid construct new Dataset

Posted by "beliefer (via GitHub)" <gi...@apache.org>.

beliefer commented on PR #41342:
URL: https://github.com/apache/spark/pull/41342#issuecomment-1591189110

   > @beliefer thanks for doing this. I will take a look :)...
   
   Could you have time to take a look ?
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] beliefer commented on pull request #41342: [SPARK-43829][CONNECT] Improve SparkConnectPlanner by reuse Dataset and avoid construct new Dataset

Posted by "beliefer (via GitHub)" <gi...@apache.org>.

beliefer commented on PR #41342:
URL: https://github.com/apache/spark/pull/41342#issuecomment-1566560446

   ping @HyukjinKwon @hvanhovell @zhengruifeng cc @amaliujia  @LuciferYang 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] hvanhovell commented on pull request #41342: [SPARK-43829][CONNECT] Improve SparkConnectPlanner by reuse Dataset and avoid construct new Dataset

Posted by "hvanhovell (via GitHub)" <gi...@apache.org>.

hvanhovell commented on PR #41342:
URL: https://github.com/apache/spark/pull/41342#issuecomment-1568041469

   @beliefer thanks for doing this. I will take a look :)...
   
   Architecturally I think we need to move to a situation where we do not create Dataset/Dataframes in the planner at all. It should all be logical plans. The only place where we may need a Dataset/Dataframe is for execution. The reason for this is that I want the two implementations share as much code as possible by making connect the primary API, and have the current API an extension of that.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] zhengruifeng commented on pull request #41342: [SPARK-43829][CONNECT] Improve SparkConnectPlanner by reuse Dataset and avoid construct new Dataset

Posted by "zhengruifeng (via GitHub)" <gi...@apache.org>.

zhengruifeng commented on PR #41342:
URL: https://github.com/apache/spark/pull/41342#issuecomment-1568411839

   We have discussed this topic offline before, if we will add dedicated plan for dataframe ops (e.g. https://github.com/apache/spark/pull/38395) in the future, then we should not make such changes with dataset


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] beliefer commented on pull request #41342: [SPARK-43829][CONNECT] Improve SparkConnectPlanner by reuse Dataset and avoid construct new Dataset

Posted by "beliefer (via GitHub)" <gi...@apache.org>.

beliefer commented on PR #41342:
URL: https://github.com/apache/spark/pull/41342#issuecomment-1599976393

   ping @hvanhovell Could you take a review?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-43829][CONNECT] Improve SparkConnectPlanner by reuse Dataset and avoid construct new Dataset [spark]

Posted by "github-actions[bot] (via GitHub)" <gi...@apache.org>.

github-actions[bot] commented on PR #41342:
URL: https://github.com/apache/spark/pull/41342#issuecomment-1769672939

   We're closing this PR because it hasn't been updated in a while. This isn't a judgement on the merit of the PR in any way. It's just a way of keeping the PR queue manageable.
   If you'd like to revive this PR, please reopen it and ask a committer to remove the Stale tag!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] beliefer commented on pull request #41342: [SPARK-43829][CONNECT] Improve SparkConnectPlanner by reuse Dataset and avoid construct new Dataset

Posted by "beliefer (via GitHub)" <gi...@apache.org>.

beliefer commented on PR #41342:
URL: https://github.com/apache/spark/pull/41342#issuecomment-1568066323

   @hvanhovell Thank you for the first review.
   
   Because Spark SQL doesn't support a convenient way to construct logical plan for each API. We had to create Dataset/Dataframes in the planner.
   
   There are two ways to avoid create duplicate Dataset/Dataframes:
   1. Add many new logical node and sql's dataset API for return logical plan directly. This job affects SQL mode deeply.
   2. We only create Dataset/Dataframes once in planner and use it for execution. This PR follows this way.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] beliefer commented on pull request #41342: [SPARK-43829][CONNECT] Improve SparkConnectPlanner by reuse Dataset and avoid construct new Dataset

Posted by "beliefer (via GitHub)" <gi...@apache.org>.

beliefer commented on PR #41342:
URL: https://github.com/apache/spark/pull/41342#issuecomment-1610557865

   ping @hvanhovell Could you have time to take a look? I have rebased many times.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] HyukjinKwon commented on a diff in pull request #41342: [SPARK-43829][CONNECT] Improve SparkConnectPlanner by reuse Dataset and avoid construct new Dataset

Posted by "HyukjinKwon (via GitHub)" <gi...@apache.org>.

HyukjinKwon commented on code in PR #41342:
URL: https://github.com/apache/spark/pull/41342#discussion_r1250437230


##########
connector/connect/server/src/main/scala/org/apache/spark/sql/connect/planner/SparkConnectPlanner.scala:
##########
@@ -101,6 +101,40 @@ class SparkConnectPlanner(val sessionHolder: SessionHolder) extends Logging {
   }
 
   // The root of the query plan is a relation and we apply the transformations to it.
+  def transformRelationAsDataset(rel: proto.Relation): Dataset[Row] = {

Review Comment:
   This doesn't look making the code a lot better ... it basically saves the call of `Dataset.ofRows` in the individual transformation, which doesn't seem a lot of worthwhile.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] beliefer commented on pull request #41342: [SPARK-43829][CONNECT] Improve SparkConnectPlanner by reuse Dataset and avoid construct new Dataset

Posted by "beliefer (via GitHub)" <gi...@apache.org>.

beliefer commented on PR #41342:
URL: https://github.com/apache/spark/pull/41342#issuecomment-1628913990

   ping @hvanhovell Could you have time to take a look? 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] beliefer commented on pull request #41342: [SPARK-43829][CONNECT] Improve SparkConnectPlanner by reuse Dataset and avoid construct new Dataset

Posted by "beliefer (via GitHub)" <gi...@apache.org>.

beliefer commented on PR #41342:
URL: https://github.com/apache/spark/pull/41342#issuecomment-1569974676

   > We have discussed this topic offline before, if we will add dedicated plan for dataframe ops (e.g. #38395) in the future, then we should not make such changes with dataset
   
   The dedicated plan need more tasks and seems causing code intrusion into Spark.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-43829][CONNECT] Improve SparkConnectPlanner by reuse Dataset and avoid construct new Dataset [spark]

Posted by "github-actions[bot] (via GitHub)" <gi...@apache.org>.

github-actions[bot] closed pull request #41342: [SPARK-43829][CONNECT] Improve SparkConnectPlanner by reuse Dataset and avoid construct new Dataset
URL: https://github.com/apache/spark/pull/41342


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org