You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@phoenix.apache.org by "Kadir OZDEMIR (Jira)" <ji...@apache.org> on 2019/10/18 02:56:00 UTC

[jira] [Reopened] (PHOENIX-5527) Unverified index rows should not be deleted due to replication lag

     [ https://issues.apache.org/jira/browse/PHOENIX-5527?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Kadir OZDEMIR reopened PHOENIX-5527:
------------------------------------

In the original design for consistent indexes, we do three phase write. In the first phase, we write full index rows with unverified status, then we write data table rows, and finally we overwrite the index row status and set it to unverified. All these writes get the same timestamp so that index and data table entries have the same timestamp for consistency.  This timestamp is the wall clock time of the server at the time data table row is read to prepare index mutations.

Now if an index row is replicated before its data row and is scanned at the destination, this row can be deleted by read repair. The delete timestamp will be the same as the existing row timestamp. Since deletes always trump puts when the timestamps are the same, even if the data row is replicated later, it will not be visible. To reduce the occurrences of this event, we set the delete time to 7 days as a stopgap solution for now. However, the side effect of this would be the increase in the number of unverified rows and unnecessary read repairs.

There is a better solution for this replication lag problem as follows:

 1. Instead of writing full index row in the first phase, write it at the last phase. So, in the first phase, we just write unverified status for the index row. In the last row, we do full row index write at the last phase.

2. The timestamp of the unverified row is the timestamp of the index full row (and also the data table row) minus 1. This will make sure that if the unverified row is deleted by read repair, it will not mask the verified row.

This change does not impact correctness of the design. Now, if the index row is replicated before the data table row and is scanned, it can be deleted safely as this will only delete the unverified status. When the full index row is replicated, it will be visible to scans. 

This also improves overall design in terms of efficiency. In the presence of concurrent writes, we skip the last write phase. These writes leave the index writes in unverified status. Similarly, if the first or second phase write fails, we do not proceed with the third phase. 

Since with this change, we will be writing only the empty column for index tables in these failure cases , the storage usage will be improved as we will write less index data.

The actual fix for the replication lag should be not to replicate index tables index tables in the first place, and to derive them form the data table writes as we do on the local cluster.  When we have the actual fix, we may remove subtraction 1 from unverified row timestamp (although we may also want to keep it as it can protect the index rows against deletions by some crazy race conditions). 

The patch for this attached. I run the tests locally and all passed except one test failure of a newly introduced IT (EmptyColumnIT). The patch is quite small and straightforward. I am hoping to get a +1 quickly from one of you, [~gjacoby], [~vincentpoon],[~abhishek.chouhan], [~larsh].

> Unverified index rows should not be deleted due to replication lag 
> -------------------------------------------------------------------
>
>                 Key: PHOENIX-5527
>                 URL: https://issues.apache.org/jira/browse/PHOENIX-5527
>             Project: Phoenix
>          Issue Type: Improvement
>    Affects Versions: 5.0.0, 4.14.3
>            Reporter: Kadir OZDEMIR
>            Assignee: Kadir OZDEMIR
>            Priority: Major
>             Fix For: 4.15.0, 5.1.0
>
>         Attachments: PHOENIX-5527.master.001.patch
>
>
> The current default delete time for unverified index rows is 10 minutes. If an index table row is replicated before its data table row and the replication row is unverified at the time of replication, it can be deleted when it is scanned on the destination cluster. To prevent these deletes due to replication lag issues, we should increase the default time to 7 days. This value is configurable using the configuration parameter,  phoenix.global.index.row.age.threshold.to.delete.ms.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)