You are viewing a plain text version of this content. The canonical link for it is here.
Posted to gitbox@hive.apache.org by GitBox <gi...@apache.org> on 2021/02/03 09:52:25 UTC

[GitHub] [hive] rbalamohan opened a new pull request #1940: HIVE-24710: Optimise PTF iteration for count(*) to reduce CPU and IO …

rbalamohan opened a new pull request #1940:
URL: https://github.com/apache/hive/pull/1940


   https://issues.apache.org/jira/browse/HIVE-24710
   
   {noformat}
   select x, y, count(*) over (partition by x order by y range between 86400 PRECEDING and CURRENT ROW) r0 from foo
   {noformat}
   
   When there are duplicates "y",  window frame becomes really large and internal implementation of PTFOperator ends up running for O(n^2) times. E.g in some queries, we had 2.5 M entries in the window and that caused it to run forever in single task.  Along with this, there is high amount of IO due to reading and discarding rows from RowContainers (note that we just need the count and nothing from materizlied row).
   
   1. In such cases, there is no need to iterate over the rowcontainers often (internally it does O(n^2) operations taking forever when window frame is really large). This can be optimised to reduce CPU burn and IO.
   2. BasePartitionEvaluator::calcFunctionValue need not materialize ROW when parameters are empty. This codepath can also be optimised.
   
   ### What changes were proposed in this pull request?
   - For count(*), PR follows a fast path and just takes up the count of PTFPartitionIterator.
   - When parameters are empty/null, it tries to run via optimised iterator which does not materialize anything in ROW. This helps in reducing IO cost. 
   
   ### How was this patch tested?
   small internal cluster


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: gitbox-unsubscribe@hive.apache.org
For additional commands, e-mail: gitbox-help@hive.apache.org


[GitHub] [hive] rbalamohan merged pull request #1940: HIVE-24710: Optimise PTF iteration for count(*) to reduce CPU and IO …

Posted by GitBox <gi...@apache.org>.
rbalamohan merged pull request #1940:
URL: https://github.com/apache/hive/pull/1940


   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: gitbox-unsubscribe@hive.apache.org
For additional commands, e-mail: gitbox-help@hive.apache.org


[GitHub] [hive] rbalamohan commented on pull request #1940: HIVE-24710: Optimise PTF iteration for count(*) to reduce CPU and IO …

Posted by GitBox <gi...@apache.org>.
rbalamohan commented on pull request #1940:
URL: https://github.com/apache/hive/pull/1940#issuecomment-780953443


   Thanks for the review @ashutoshc 


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: gitbox-unsubscribe@hive.apache.org
For additional commands, e-mail: gitbox-help@hive.apache.org


[GitHub] [hive] ashutoshc commented on pull request #1940: HIVE-24710: Optimise PTF iteration for count(*) to reduce CPU and IO …

Posted by GitBox <gi...@apache.org>.
ashutoshc commented on pull request #1940:
URL: https://github.com/apache/hive/pull/1940#issuecomment-780917826


   +1 LGTM


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: gitbox-unsubscribe@hive.apache.org
For additional commands, e-mail: gitbox-help@hive.apache.org