You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@phoenix.apache.org by "Kadir Ozdemir (Jira)" <ji...@apache.org> on 2023/04/14 03:24:00 UTC

[jira] [Resolved] (PHOENIX-6832) Uncovered Global Secondary Indexes

     [ https://issues.apache.org/jira/browse/PHOENIX-6832?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Kadir Ozdemir resolved PHOENIX-6832.
------------------------------------
    Fix Version/s: 5.2.0
       Resolution: Fixed

> Uncovered Global Secondary Indexes
> ----------------------------------
>
>                 Key: PHOENIX-6832
>                 URL: https://issues.apache.org/jira/browse/PHOENIX-6832
>             Project: Phoenix
>          Issue Type: Improvement
>            Reporter: Kadir Ozdemir
>            Assignee: Kadir Ozdemir
>            Priority: Major
>             Fix For: 5.2.0
>
>
> An index can be called an uncovered index if the index cannot serve a query alone. The sole purpose of an uncovered index would be identifying the data table rows to be scanned for the query.  This implies that the DDL for an uncovered index does not have the INCLUDE clause. 
> Then an index is called a covered index if the index can serve a query alone. Please note that a covered index does not mean that it can cover all queries. It just means that it can cover a query. A covered index can still cover some queries even if the index DDL does not have the INCLUDE clause. This is because a given query may reference only PK and/or indexed columns, and thus a covered index without any included columns can serve this query by itself (i.e.,  without joining index rows with data table rows). Another use case for covered indexes without included columns is the count(*) queries. Currently Phoenix uses indexes for count(*) queries by default.
> Since uncovered indexes will be used to identify data table rows affected by a given query and the column values will be picked up from the data table, we can provide a solution that is much simpler than the solution for covered indexes by taking the advantage of the fact that the data table is the source of truth, and an index table is used to only map secondary keys to the primary keys to eliminate full table scans. The correctness of such a solution is ensured if for every data table row, there exists an index row. Then our solution to update the data tables and their indexes in a consistent fashion for global secondary indexes would be a two-phase update approach, where we first insert the index table rows, and only if they are successful, then we update the data table rows. 
> This approach does not require reading the existing data table rows which is currently required for covered indexes. Also, it does not require two-phase commit writes for updating and maintaining global secondary index table rows. Eliminating a data table read operation and an RPC call to update the index row verification status on the corresponding index row would cut down index write latency overhead by at least 50% for global uncovered indexes when compared to global covered indexes. This is because global covered indexes require one data table read and two index write operations for every data table update whereas global uncovered indexes would require only one index write. For batch writes, the expected performance and latency improvement would be much higher than 50% since a batch of random row updates would not anymore require random seeks on the data table for reading existing data table rows.
> PHOENIX-6458, PHOENIX-6501 and PHOENIX-6663 improve the performance and efficiency of joining index rows with their data table rows when a covered index cannot cover a given query. We can further leverage it to support uncovered indexes. 
> The uncovered indexes would be a significant performance improvement for write intensive workloads. Also a common use case where uncovered indexes will be desired is the upsert select use case on the data table, where a subset of rows are updated in a batch. In this use case, the select query performance is greatly improved via a covered index but the upsert part suffers due to the covered index write overhead especially when the selected data table rows are not consecutively stored on disk which is the most common case.
> As mentioned before, the DDL for index creation does not include the INCLUDE clause. We can add the UNCOVERED keyword to indicate the index to be created is an uncovered index, for example, CREATE UNCOVERED INDEX. 
> As in the case of covered indexes, we can do read repair for uncovered indexes too. The difference is that instead of using the verify status for index rows, we would check if the corresponding data table row exists for a given index row. Since we would always retrieve the data table rows to join back with index rows for uncovered indexes, the read repair cost would occur only for deleting invalid index rows. Also, the existing index reverse verification and repair feature supported by IndexTool can be used to do bulk repair operations from time to time.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)