You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@phoenix.apache.org by "Bin Shi (JIRA)" <ji...@apache.org> on 2018/09/07 17:08:00 UTC
[jira] [Commented] (PHOENIX-4594) Perform binary search on guideposts during query compilation

    [ https://issues.apache.org/jira/browse/PHOENIX-4594?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16607374#comment-16607374 ] 

Bin Shi commented on PHOENIX-4594:
----------------------------------

It seems that we have more issues in Phoenix Stats.
 # In BaseResultIterators.getParallelScans(...), we use linear search in guide posts to find where the intersection begins – this is the one described in this Jira and previous comments. The solution "make a pass through all guideposts and put them into a List in which we can perform a binary search" has one downside (besides memory issue), we could do unnecessary work to decode the guide posts which exceed the Scan Range, whereas the original code doesn't have this problem. If this unnecessary work is expensive in some cases, we could decode guide posts in batches and use binary searches in moving window to find where the intersection begins. I won't go that far for the time being, because I haven't seen that it is useful when GUIDE_POST_WIDTH below 100MB.
 # In BaseResultIterators.getParallelScans(...), for all the cases (serial plan or not, point lookup or not, explain plan or a real query executed on the server side, useStatsForParallelization flag turned on or off), we always go through guide posts, collect estimation (on # rows and size) and create scans based on guide posts, but we should deliberately differentiate among these cases – for example, for point lookup, it isn't necessary to collect the estimation based on guide posts because it isn't used in the future, and generating scans on region level is enough; another example, if useStatsForParallelization is turned off, we may collect the estimation based on guide posts, but generating scans on region level is enough. 
 # 
Regarding overall Phoenix Stats design, while I do think Phoenix Stats is useful for Query Complexity Estimation and Query Optimization, according to [Statistics Collection|https://phoenix.apache.org/update_statistics.html] (see “parallelization” section) on Phoenix website, Phoenix Stats is designed for providing a means of gaining intra-region parallelization (thus increase the performance and reduce query latency). *I doubt that Phoenix Stats is a good design for achieving this goal, but I could miss the context which results in the misunderstanding.*
 
In my understanding, the current design works well only when a region server is processing one or few queries and overall load is light - thus increasing the parallelization inside of a query by assigning each chunk of data between guideposts to a thread/handler in a separate scan helps on the performance.{color:#000000} When the overall load on a region server is high due to high resource consumption or multiple queries being processed on the server, increasing the parallelization inside of a query could lead to higher system overhead due to context switching and L1 cache miss when we increase the # of threads/handlers on the region server (it should still bound by the # of CPU cores) or lead to unpredictable latency when threads/handlers are saturated and few pieces of scans wait for free threads/handlers (), and eventfully hurt performance instead of having performance gain.{color}
{color:#000000} {color}
{color:#000000}According to the above, whether or not the current design works well depends on the access pattern of the load and the scenarios, but processing multiple queries and having high or varying load on region servers should be very common.{color}
 # {color:#000000}Because of 3., in Phoenix Stats, we might need to achieve high degree of palatalization in two levels. Client side, where the compilation and query optimization happens, decides parallel scan on region level, and each individual region server decides intra-region level parallelization based on the current load and guide post info.  {color}

> Perform binary search on guideposts during query compilation
> ------------------------------------------------------------
>
>                 Key: PHOENIX-4594
>                 URL: https://issues.apache.org/jira/browse/PHOENIX-4594
>             Project: Phoenix
>          Issue Type: Improvement
>            Reporter: James Taylor
>            Assignee: Abhishek Singh Chouhan
>            Priority: Major
>
> If there are many guideposts, performance will suffer during query compilation because we do a linear search of the guideposts to find the intersection with the scan ranges. Instead, in BaseResultIterators.getParallelScans() we should populate an array of guideposts and perform a binary search. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)