You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@hbase.apache.org by "Viraj Jasani (Jira)" <ji...@apache.org> on 2021/11/18 18:00:00 UTC

[jira] [Updated] (HBASE-26466) Immutable timeseries usecase - Create new region rather than split existing one

     [ https://issues.apache.org/jira/browse/HBASE-26466?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Viraj Jasani updated HBASE-26466:
---------------------------------
    Description: 
For insertion of immutable data usecase (specifically time-series data), region split mechanism doesn't seem to provide better availability when ingestion rate is very high. When we ingest lot of data, the region split policy tries to split the given hot region based on the size (either size of all stores combined or size of any single store exceeding max file size configured) if we consider default {_}SteppingSplitPolicy{_}. The latest hot regions tend to receive all latest inserts. When the region is split, the first half of the region (say daughterA) stays on the same server whereas the second half (daughterB) region – likely to become another hot region because all new latest updates come to second half region in the sequential write fashion – is moved out to other servers in the cluster. Hence, once new daughter region is created, client traffic will be redirected to another server. Client requests will be piled up when region split is triggered till new daughters come alive and once done, client will have to request meta for updated daughter region and redirect traffic to new server.

If we could have configurable region creation strategy that 1) keeps the split disabled for the given table, and 2) create new region dynamically with lexicographically higher start key on the same server and update it's own region boundary, the client will have to look up meta once and continue ingestion without any degraded SLA caused by region split transitions.

Note: region split might also encounter some complications, requiring the procedure to be rolled back from some step, or continue with internal retries, eventually further delaying the ingestion from clients.

 

There are some complications around updating live region's start and end keys as this key range is immutable. We could brainstorm ideas around making them optionally mutable and any issues around them. For instance, client might continue writing data to the region with updated end key but writes will fail for out of range keys and hence, they will lookup in meta for updated key-space range (new region created with end key: EMPTY_END_ROW) of the table.

  was:
For insertion of immutable data usecase (specifically time-series data), region split mechanism doesn't seem to provide better availability when ingestion rate is very high. When we ingest lot of data, the region split policy tries to split the given hot region based on the size (either size of all stores combined or size of any single store exceeding max file size configured) if we consider default {_}SteppingSplitPolicy{_}. The latest hot regions tend to receive all latest inserts. When the region is split, the first half of the region (say daughterA) stays on the same server whereas the second half (daughterB) region – likely to become another hot region because all new latest updates come to second half region in the sequential write fashion – is moved out to other servers in the cluster. Hence, once new daughter region is created, client traffic will be redirected to another server. Client requests will be piled up when region split is triggered till new daughters come alive and once done, client will have to request meta for updated daughter region and redirect traffic to new server.

If we could have configurable region creation strategy that 1) keeps the split disabled for the given table, and 2) create new region dynamically with lexicographically higher start key on the same server and update it's own region boundary, the client will have to look up meta once and continue ingestion without any degraded SLA caused by region split transitions.

Note: region split might also encounter some complications, requiring the procedure to be rolled back from some step, or continue with internal retries, eventually further delaying the ingestion from clients.

 

There are some complications around updating live region's start and end keys as this key range is immutable. We could brainstorm ideas around making them optionally mutable and any issues around them. For instance, client might continue writing data to the region with updated end key but writes will fail and hence, they will lookup in meta for updated key-space range of the table.


> Immutable timeseries usecase - Create new region rather than split existing one
> -------------------------------------------------------------------------------
>
>                 Key: HBASE-26466
>                 URL: https://issues.apache.org/jira/browse/HBASE-26466
>             Project: HBase
>          Issue Type: Brainstorming
>            Reporter: Viraj Jasani
>            Priority: Major
>
> For insertion of immutable data usecase (specifically time-series data), region split mechanism doesn't seem to provide better availability when ingestion rate is very high. When we ingest lot of data, the region split policy tries to split the given hot region based on the size (either size of all stores combined or size of any single store exceeding max file size configured) if we consider default {_}SteppingSplitPolicy{_}. The latest hot regions tend to receive all latest inserts. When the region is split, the first half of the region (say daughterA) stays on the same server whereas the second half (daughterB) region – likely to become another hot region because all new latest updates come to second half region in the sequential write fashion – is moved out to other servers in the cluster. Hence, once new daughter region is created, client traffic will be redirected to another server. Client requests will be piled up when region split is triggered till new daughters come alive and once done, client will have to request meta for updated daughter region and redirect traffic to new server.
> If we could have configurable region creation strategy that 1) keeps the split disabled for the given table, and 2) create new region dynamically with lexicographically higher start key on the same server and update it's own region boundary, the client will have to look up meta once and continue ingestion without any degraded SLA caused by region split transitions.
> Note: region split might also encounter some complications, requiring the procedure to be rolled back from some step, or continue with internal retries, eventually further delaying the ingestion from clients.
>  
> There are some complications around updating live region's start and end keys as this key range is immutable. We could brainstorm ideas around making them optionally mutable and any issues around them. For instance, client might continue writing data to the region with updated end key but writes will fail for out of range keys and hence, they will lookup in meta for updated key-space range (new region created with end key: EMPTY_END_ROW) of the table.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)