You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@ignite.apache.org by "Aleksey Plekhanov (Jira)" <ji...@apache.org> on 2023/10/19 15:08:00 UTC

[jira] [Updated] (IGNITE-20697) Move physical records from WAL to another storage

     [ https://issues.apache.org/jira/browse/IGNITE-20697?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Aleksey Plekhanov updated IGNITE-20697:
---------------------------------------
    Description: 
Currentrly, physycal records take most of the WAL size. But physical records in WAL files required only for crush recovery and these records are useful only for a short period of time (since last checkpoint). 
Size of physical records during checkpoint is more than size of all modified pages between checkpoints, since we need to store page snapshot record for each modified page and page delta records, if page is modified more than once between checkpoints.
We process WAL file several times in stable workflow (without crashes and rebalances):
 # We write records to WAL files
 # We copy WAL files to archive
 # We compact WAL files (remove phisical records + compress)

So, totally we write all physical records twice and read physical records at least twice.

To reduce disc workload we can move physical records to another storage and don't write them to WAL files. To provide the same crush recovery guarantees we can write modified pages twice during checkpoint. First time to some delta file and second time to the page storage. In this case we can recover any page if we crash during write to page storage from delta file (instead of WAL, as we do now).

This proposal has pros and cons.
Pros:
 - Less size of stored data (we don't store page delta files, only final state of the page)
 - Reduced disc workload (we store additionally write once all modified pages instead of 2 writes and 2 reads of larger amount of data)
 - Potentially reduced latancy (instead of writing physical records synchronously during data modification we write to WAL only logical records and physical pages will be written by checkpointer threads)

Cons:
 - Increased checkpoint duration (we should write doubled amount of data during checkpoint)

Let's try to implement it and benchmark.

  was:
Currentrly, physycal records take most of the WAL size. But physical records in WAL files required only for crush recovery and these records are useful only for a short period of time (since last checkpoint). 
Size of physical records during checkpoint is more than size of all modified pages between checkpoints, since we need to store page snapshot record for each modified page and page delta records, if page is modified more than once between checkpoints.
We process WAL file several times in normal workflow (without crashes):
1) We write records to WAL files
2) We copy WAL files to archive
3) We compact WAL files (remove phisical records + compress)
So, totally we write all physical records twice and read physical records twice. 
To reduce disc workload we can move physical records to another storage and don't write them to WAL files. 
To provide the same crush recovery guarantees we can write modified pages twice during checkpoint. First time to some delta file and second time to the page storage. In this case we can recover any page if we crash during write to page storage from delta file (instead of WAL, as we do now).
This proposal has pros and cons.
Pros:
- Less size of stored data (we don't store page delta files, only final state of the page)
- Reduced disc workload (we store additionally write once all modified pages instead of 2 writes and 2 reads of larger amount of data)
- Potentially reduced latancy (instead of writing physical records synchronously during data modification we write to WAL only logical records and physical pages will be written by checkpointer threads)
Cons:
- Increased checkpoint duration (we should write doubled amount of data during checkpoint)
Let's try it and benchmark.


> Move physical records from WAL to another storage 
> --------------------------------------------------
>
>                 Key: IGNITE-20697
>                 URL: https://issues.apache.org/jira/browse/IGNITE-20697
>             Project: Ignite
>          Issue Type: Improvement
>            Reporter: Aleksey Plekhanov
>            Assignee: Aleksey Plekhanov
>            Priority: Major
>
> Currentrly, physycal records take most of the WAL size. But physical records in WAL files required only for crush recovery and these records are useful only for a short period of time (since last checkpoint). 
> Size of physical records during checkpoint is more than size of all modified pages between checkpoints, since we need to store page snapshot record for each modified page and page delta records, if page is modified more than once between checkpoints.
> We process WAL file several times in stable workflow (without crashes and rebalances):
>  # We write records to WAL files
>  # We copy WAL files to archive
>  # We compact WAL files (remove phisical records + compress)
> So, totally we write all physical records twice and read physical records at least twice.
> To reduce disc workload we can move physical records to another storage and don't write them to WAL files. To provide the same crush recovery guarantees we can write modified pages twice during checkpoint. First time to some delta file and second time to the page storage. In this case we can recover any page if we crash during write to page storage from delta file (instead of WAL, as we do now).
> This proposal has pros and cons.
> Pros:
>  - Less size of stored data (we don't store page delta files, only final state of the page)
>  - Reduced disc workload (we store additionally write once all modified pages instead of 2 writes and 2 reads of larger amount of data)
>  - Potentially reduced latancy (instead of writing physical records synchronously during data modification we write to WAL only logical records and physical pages will be written by checkpointer threads)
> Cons:
>  - Increased checkpoint duration (we should write doubled amount of data during checkpoint)
> Let's try to implement it and benchmark.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)