You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by GitBox <gi...@apache.org> on 2020/10/06 21:26:20 UTC

[GitHub] [spark] viirya edited a comment on pull request #29950: [SPARK-32945][SQL] Avoid collapsing projects if reaching max allowed common exprs

viirya edited a comment on pull request #29950:
URL: https://github.com/apache/spark/pull/29950#issuecomment-704561415


   > Perhaps the max number of common expressions is not the best metric here?
   > 
   > Lets compare two cases:
   > 
   >     1. On the lower project you have a `JsonToStructs` and on upper Project you get 3 fields from that struct. This would mean 2 redundant computations and the "metric" you are looking at is 3.
   > 
   >     2. On the lower project you have two `JsonToStructs` and on upper Project you get 2 fields from both stucts. This would also mean 2 redundant computations and the "metric" you are looking at is 2.
   > 
   > 
   > Adding more `JsonToStructs` to the lower level would increase the number redundant computations without increasing the max value.
   > So as an alternative I would propose "the number of redundant computations" (sum of values in the `exprMap` minus its size) as a metric to use.
   > 
   > Although I must admit, that in that case we might cache more values for the number of extra computations we save.
   > So both of them have their benefits.
   
   Yes, in the case you add the number of redundant computations each time you add one more `JsonToStructs`. But overall it should not cause noticeable performance issue because you simply three times the running cost of `JsonToStructs` (assume each `JsonToStructs` has 2 redundant computations). In our case we notice performance issue when one `JsonToStructs` is repeated more than 50 times.
   
   The number of redundant computations is misleading. If we have 100 `JsonToStructs` in lower project, each of them has 2 redundant computations in upper project, it doesn't mean we 100 times the running cost of `JsonToStructs`. In other words, it is hard to tell the performance difference between 10 and 20 redundant computations. If the redundant computations come from the same expression, then we have 10 times v.s. 20 times running cost, but if they are from 10 expressions? We might just have 2 to 3 times running cost.
   
   
   
   
   
   
   
   
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org