You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@hbase.apache.org by "Bryan Beaudreault (Jira)" <ji...@apache.org> on 2022/04/12 18:12:00 UTC

[jira] [Commented] (HBASE-26522) Improve documentation of hbase 1.x to 2.x potential incompatibilities

    [ https://issues.apache.org/jira/browse/HBASE-26522?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17521283#comment-17521283 ] 

Bryan Beaudreault commented on HBASE-26522:
-------------------------------------------

For the watchers currently working on their hbase2 upgrades, i added another to the list. If you have many regions and use replication, make sure to upgrade to 2.4.10+ otherwise you might get impacted by https://issues.apache.org/jira/browse/HBASE-26590

For us we saw a case where a table has 32k+ regions. Replication uses HTable.batch, which sequentially fetches region locations for all actions in a batch. This would be fine if they were all cached, but you run into issues on an empty cache. Replication uses a 5k batch size by default. If you get unlucky, a batch may have 1 action per region in the worst case, resulting in 5k meta hits for that batch. The default operation timeout for replication is 10s. If you have many regionservers hitting this issue, it's very possible to exceed this timeout just in fetching region locations. In this case, the actions themselves never get executed and instead a DoNotRetryException is thrown. This causes the meta cache to be cleared, which starts the bad pattern over again.

Most of that is how branch-1 works as well, but the caching change prior to HBASE-26590 was enough to make this perform slowly enough to never make progress.

> Improve documentation of hbase 1.x to 2.x potential incompatibilities
> ---------------------------------------------------------------------
>
>                 Key: HBASE-26522
>                 URL: https://issues.apache.org/jira/browse/HBASE-26522
>             Project: HBase
>          Issue Type: Improvement
>            Reporter: Bryan Beaudreault
>            Assignee: Bryan Beaudreault
>            Priority: Minor
>
> We're working on a major upgrade of almost 900 tables across 100 production clusters (and corresponding QA environment clusters). We've upgraded about 25% of our QA environment and run into a series of incompatibilities along the way. Most of them have been easy to get around, but I wanted to create this Jira to collect them so that we can make an update to the docs for future upgraders.
> My plan is to periodically edit this description to add to the list. If anyone else has anything to contribute, feel free to edit as well or add a comment. 
> Incompatibilities to document:
>  -  HBASE-15676 changed the serialized byte string used for the fuzzy mask. FuzzyRowFilters created by older clients will not match any rows in an hbase2 cluster. This was fixed in HBASE-26537 but should be documented in our upgrade guide.
>  - CDH5 try/catches bad HTableDescriptor.getDurability calls and returns USE_DEFAULT. In hbase2, if someone creates a table with a bad durability (i.e. DEFAULT instead of USE_DEFAULT), it results in a failure which causes the CreateTableProcedure to infinitely retries with no backoff. This rapid retry caused a bunch of pain on the cluster that encountered it, backing up datanode's ability to keep up with the millions of calls to create and delete .regioninfo files.
>  - This isn't quite an incompatibility, but HBASE-19389 introduced a concurrency mitigation which may have surprising results coming from older versions. The defaults are pretty conservative – when writing more than 100 columns, no more than 10 concurrent writes or 20 pending writes at once.
>  - Increments sent from branch-1 clients may get erroneously stored with a timestamp of 0 on hbase2+ clusters: HBASE-26713
>  - CheckAndMutate with a "null" compare value used to ignore CompareOp. Fixed in HBASE-26742, checkAndMutate affects may change between versions.
>  - client will not know how to handle dangling rep_barrier rows in meta: HBASE-26797
>  - the default hbase split policy is SteppingSplitPolicy. This is overall a good policy which is more likely to split small tables to ensure they are spread across more servers. If you upgrade, you may notice your tables suddenly getting split more than you're used to. This may be an issue if you use a row key prefix, because hbase isn't aware of your prefix and may mess up your splits. You can get around this by defining a RegionSplitRestriction. See HBASE-25766
>  - Regression in meta requests may impact replication on clusters with many regions. Fixed in 2.4.10+, per HBASE-26590



--
This message was sent by Atlassian Jira
(v8.20.1#820001)