You are viewing a plain text version of this content. The canonical link for it is here.

Posted to reviews@spark.apache.org by GitBox <gi...@apache.org> on 2022/02/18 23:48:09 UTC

[GitHub] [spark] HeartSaVioR edited a comment on pull request #35552: [SPARK-38237][SQL][SS] Introduce a new config to require all cluster keys on Aggregate

HeartSaVioR edited a comment on pull request #35552:
URL: https://github.com/apache/spark/pull/35552#issuecomment-1045348369


   Thanks all for the valuable inputs. I really appreciated all the inputs!
   
   First of all, I have to admit I missed the partial aggregation. My bad.
   
   If I understand correctly, we seem to be on the same page that 
   
   1) There are various cases skew happens before aggregation (e.g. join), and ClusteredDistribution wouldn't deal with skew since it is already clustered as it is required.
   2) In above cases, there are some cases Spark cannot deal automatically. (e.g. insufficient stats)
   3) We do partial aggregate hence simply changing the required child distribution to HashClusteredDistribution wouldn't help.
   
   (Please correct me if there is something we don't agree upon.)
   
   I agree this fix doesn't seem to go with right way, but seems like it is still valuable to discuss further for the better fix, regardless of the condition we address it in this PR or defer to other PR in the future.
   
   A.
   (First of all, sorry for the ignorance. I haven't dealt with pure SQL statement, so my question could be very silly.)
   
   I'd like to verify that there are alternatives end users can always leverage them for any cases. I agree DataFrame has no problem on this given existence of `repartition()`. But how about SQL statement? Do we provide the same for SQL statement? Could end users inject the hint anywhere before operator and it will take effect on the exact place?
   
   B.
   3) is valid and a great point I totally missed. But then there would be another question, "why we do partial aggregate even users express the intention they want to do full shuffle because they indicate a skew?" I suspect we have to skip it.
   
   If we go with skipping the partial aggregate altogether when end users give a hint (e.g. config, or better way if exists), would it work and would it be the acceptable solution for us?
   
   C. (out of scope on this PR, but to be future proof)
   Let's expand this to the broader operators leveraging ClusteredDistribution. 3) doesn't even exist for them and we have a problem there as well. What would be the way to fix it if users indicate the issue and want to deal with?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org