You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@hbase.apache.org by Gavin McDonald <gm...@apache.org> on 2022/04/20 16:23:51 UTC

Re: [DISCUSS] ci-hbase capacity and what we'd like to see tested

Hi Sean, all 

Where are we with this? I havent seen any INFRA Jiras about this so far.

I'm eager to help and get this going

Gav... (ASF Infra)

On 2022/03/15 13:45:25 Sean Busbey wrote:
> It sounds like we have consensus on the general approach.
> 
> I'll file some jiras for the work. I have some time to put towards
> implementing things. If folks want to help move things along faster
> they're welcome to pitch in.
> 
> On Sun, Mar 13, 2022 at 1:35 PM Andrew Purtell <an...@gmail.com> wrote:
> >
> > I like the idea of multi node testing, because the mini cluster does an admirable job but cannot truly emulate a production deploy because of various singletons in our code or that of our dependencies. It would also be pretty nice if k8s was the substrate — and ultimately is used to inject chaos too (via hbase-it) — because it would help us detect if we violate required discipline for that common deployment target, like inappropriate caching of DNS resolutions concurrent with pod cycling, to pick a historical example.
> >
> > Even just periodic execution of ITBLL would be nice.
> >
> > So I guess the next question is what does that require of us, the larger community. Who proposed the work? Who performs it? We should open some JIRAs to kick things off?
> >
> > > On Mar 10, 2022, at 8:32 AM, Sean Busbey <bu...@apache.org> wrote:
> > >
> > > Hi folks!
> > >
> > > Quick background: all of the automated testing for nightly and PR
> > > contributions is now running on a dedicated Jenkins instance (
> > > ci-hbase.apache.org ). We moved our existing 10 dedicated nodes off of
> > > the ci-hadoop controller and thanks to a new anonymous donor we were
> > > able to add an additional 10 nodes.
> > >
> > > The new donor gave enough of a contribution that we can make some
> > > decisions as a community about expanding these resources further.
> > >
> > > The new 10 nodes run 2 executors each (same as our old nodes), have
> > > this shape, and are considered "medium" by the provider we're getting
> > > them from:
> > >
> > > 64GB DDR4 ECC RAM
> > > Intel® Xeon® E-2176G hexa-core processor with Hyper-Threading Coffee Lake.
> > > 2 x 960 GB NVMe SSD Datacenter Edition (RAID 1)
> > >
> > > To give an idea of what the current testing workload of our project
> > > looks like, we can use the built in jenkins utilization tooling for
> > > our general purpose label 'hbase'[0].
> > >
> > > If you look at the last 4 days of utilization[1] we have a couple of
> > > periods with a small backlog of ~2 executors worth of work. The
> > > measurements are very rolled up so it's hard to tell specifics. On the
> > > chart of the last day or so[2] we can see two periods of 1-2 hours
> > > where we have a backlog of 2-4 executors worth of work.
> > >
> > > for comparison, the chart for immediately after we had to burn off ~3
> > > days of backlog because our worker nodes were offline back at the end
> > > of february shows no queue[3].
> > >
> > > I think we could possibly benefit from adding 1-2 additional medium
> > > worker nodes, but the long periods where we have ~half our executors
> > > idle makes me think some refactoring or timing changes would maybe be
> > > a better way to improve our current steady state workload.
> > >
> > > One thing that we currently lack is robust integration testing of a
> > > cluster deployment. At the moment our nightly jobs spin up a test that
> > > makes a single node version of Hadoop and then a single node hbase on
> > > top of it. It then does a trivial functionality test[4].
> > >
> > > The host provider we use for jenkins worker nodes has a large node shaped like:
> > > 160GB RAM
> > > Intel® Xeon® W-2295 18-Core Cascade-Lake W Hyper-Threading
> > > 2 x 960GB NVMe drives as RAID1
> > >
> > > A pretty short path to improvement would be if we got 1 or 2 of these
> > > nodes and moved our integration test to use the minikube project[5] to
> > > run a local kubernetes environment. We could then deploy a small but
> > > multinode Hadoop and HBase cluster and run e.g. ITBLL against it in
> > > addition to whatever checking of cli commands, shell expectations,
> > > etc.
> > >
> > > What do y'all think?
> > >
> > > [0]: https://ci-hbase.apache.org/label/hbase/load-statistics
> > > [1]: https://issues.apache.org/jira/secure/attachment/13040941/ci-hbase-long-graph-20220310.png
> > > [2]: https://issues.apache.org/jira/secure/attachment/13040940/ci-hbase-medium-graph-20220310.png
> > > [3]: https://issues.apache.org/jira/secure/attachment/13040939/ci-hbase-medium-graph-20220223.png
> > > [4]: https://github.com/apache/hbase/blob/master/dev-support/hbase_nightly_pseudo-distributed-test.sh
> > > [5]: https://minikube.sigs.k8s.io/docs/
>