You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@kudu.apache.org by Franco VENTURI <fv...@comcast.net> on 2020/04/01 20:02:25 UTC
Re: Tablet Server with almost 1TB of WALs (and large number of open files)

Adar, Andrew,
first of all thanks for looking into this.
I just checked the Kudu memory usage in that TS, and it is currently at 104GB used out of a total of 128GB (about 81.5% used), so it definitely seems high to me.


As per the management manager dashboard, I wrote a quick script that parses its output, and this is what I see:

- top 10 tablets by RAM anchored:
curl -k -s -S https://localhost:8050/maintenance-manager | ./parse-maintenance-manager | sort -k3 -nr | head -10
FlushMRSOp(ee2abaeb126a44d0b1565c7fdd8e40da) True 10171187 0 0.0
FlushMRSOp(21d9fa855898438ab0845173fafe6b8c) True 3082813 0 0.0
FlushMRSOp(097c728dc2034407aa6f9bf71e80ee97) True 2181038 0 0.0
FlushMRSOp(de5432259dd44e3ba67e9d3f3eea0327) True 2160066 0 0.564864
FlushMRSOp(1e0ea408fc1b4ad5be0ecc9d3d5e2d1d) True 2118123 0 0.0
FlushMRSOp(8a1e4cd273ba4d22937ab25a91668ea2) True 2034237 0 0.0994112
FlushMRSOp(7b49a6c733e94b968aeb228b0cf5c0cc) True 2034237 0 0.0497825
FlushMRSOp(7bac650115f74d66a1c87ee91dc3f751) True 1688207 0 0.0414489
FlushMRSOp(a5f13d9e94aa49e7933d9f84e62ee4e3) True 1656750 0 0.441552
FlushMRSOp(897f9e2db1954523ac7781f9c45645f0) True 1562378 0 0.0845236

- top 10 tablets by log retained:
curl -k -s -S https://localhost:8050/maintenance-manager | ./parse-maintenance-manager | sort -k4 -nr | head -10
FlushDeltaMemStoresOp(b3facf00fcff403293d36c1032811e6e) True 1740 33039035924 1.0
FlushDeltaMemStoresOp(354088f7954047908b4e68e0627836b8) True 763 32416265666 1.0
FlushDeltaMemStoresOp(c5369eb17772432bbe326abf902c8055) True 763 32137092792 1.0
FlushDeltaMemStoresOp(4540cba1b331429a8cbacf08acf2a321) True 763 32137092792 1.0
FlushDeltaMemStoresOp(cb302d0772a64a5795913196bdf43ed3) True 1740 32094143119 1.0
FlushDeltaMemStoresOp(4779c42e5ef6493ba4d51dc44c97f7f7) True 763 31965294100 1.0
FlushDeltaMemStoresOp(90a0465b7b4f4ed7a3c5e43d993cf52e) True 1740 31825707663 1.0
FlushDeltaMemStoresOp(f318cbf0cdcc4e10979fc1366027dde5) True 1740 31664646389 1.0
FlushDeltaMemStoresOp(a4ac51eefc59467abf45938529080c17) True 763 30805652930 1.0
FlushDeltaMemStoresOp(b00cb81794594d9b8366980a24bf79ad) True 763 30676803911 1.0

Again tablet b3facf00fcff403293d36c1032811e6e is on top of the list for log retained with about 30.77GB; at first glance though it doesn't seem to me that any of those tablets with a large amount of WALs is in the AM anchored list.

Finally since we moved into April and many of those tables are range partitioned on a date column either by quarter or by month, I expect that almost all of the inserts and updates have now moved to the empty tablets for the month of April (or Q2), so I am pretty sure that those tablets whose WALs are so large have now stopped growing.
We are in the process of moving one of those large replicas from last month/quarter to a different TS to try to balance things a little; I'll let you know how it goes.

Yesterday we also contacted our sales engineer with the vendor to see if we can have a patch for KUDU-3002.

Franco


> On March 31, 2020 at 6:36 PM Andrew Wong <aw...@cloudera.com> wrote:
> 
>     The maintenance manager dashboard of the tablet server should have some information to determine what's going on. If it shows that there are DMS flush operations that are anchoring a significant chunk of WALs, that would explain where some of the WAL disk usage is going. If the amount of memory used displayed in the "Memory (detail)" page shows that the server is using a significant amount of the memory limit, together, these two symptoms point towards this being KUDU-3002.
> 
>     With or without a patch, it's worth checking whether there's anything that can be done about the memory pressure. If there's tablet skew across the cluster (some servers with many replicas, some with few), consider rebalancing to reduce the load on the overloaded servers. At the very least, looking at the maintenance manager dashboard should give you an idea of what tablet replicas can be moved away from the tablet server (e.g. move the replicas with more WALs retained by DMS flush ops, preferably to a tablet server that has less memory being consumed, per its "Memory (detail)" page).
> 
>     On Tue, Mar 31, 2020 at 2:03 PM Adar Lieber-Dembo < adar@cloudera.com mailto:adar@cloudera.com > wrote:
> 
>         > > Definitely seems like KUDU-3002 could apply here. Andrew, are you
> >         aware of a straightforward way to determine whether a given table (or
> >         cluster) would benefit from the fix for KUDU-3002?
> > 
> >         One possibility is that, if your vendor is in the business of
> >         providing patch releases, they could supply the fix for KUDU-3002 as a
> >         patch for your release of Kudu. Looking at the fix, it's not super
> >         invasive and should be pretty easy to backport into an older Kudu
> >         release. Have you talked to them about that?
> > 
> >         On Tue, Mar 31, 2020 at 5:58 AM Franco VENTURI < fventuri@comcast.net mailto:fventuri@comcast.net > wrote:
> >         >
> >         > Adar, Andrew,
> >         > thanks for your detailed and prompt replies.
> >         >
> >         > "Fortunately" (for your questions) we have another TS whose WALs disk is currently about 80% full (and three more whose WALs disk is above > 50%), and I suspect that it will be the next one we'll have to restart in a few nights.
> >         >
> >         > On this TS the output from 'lsof' this morning shows 175,550 files open, of which 84,876 are Kudu data files and 90,340 are Kudu WALs.
> >         >
> >         > For this server these are the top 10 tablets by WAL sizes (in kB):
> >         >
> >         > du -sk * | sort -nr | head -10
> >         > 31400552 b3facf00fcff403293d36c1032811e6e
> >         > 31204488 354088f7954047908b4e68e0627836b8
> >         > 30584928 90a0465b7b4f4ed7a3c5e43d993cf52e
> >         > 30536168 c5369eb17772432bbe326abf902c8055
> >         > 30535900 4540cba1b331429a8cbacf08acf2a321
> >         > 30503820 cb302d0772a64a5795913196bdf43ed3
> >         > 30428040 f318cbf0cdcc4e10979fc1366027dde5
> >         > 30379552 4779c42e5ef6493ba4d51dc44c97f7f7
> >         > 29671692 a4ac51eefc59467abf45938529080c17
> >         > 29539940 b00cb81794594d9b8366980a24bf79ad
> >         >
> >         > and these are the top 10 tablets by number of WAL segments:
> >         >
> >         > for t in *; do echo "$(ls $t | grep -c '^wal-') $t"; done | sort -nr | head -10
> >         > 3813 b3facf00fcff403293d36c1032811e6e
> >         > 3784 354088f7954047908b4e68e0627836b8
> >         > 3716 90a0465b7b4f4ed7a3c5e43d993cf52e
> >         > 3705 c5369eb17772432bbe326abf902c8055
> >         > 3705 4540cba1b331429a8cbacf08acf2a321
> >         > 3700 cb302d0772a64a5795913196bdf43ed3
> >         > 3698 f318cbf0cdcc4e10979fc1366027dde5
> >         > 3685 4779c42e5ef6493ba4d51dc44c97f7f7
> >         > 3600 a4ac51eefc59467abf45938529080c17
> >         > 3585 b00cb81794594d9b8366980a24bf79ad
> >         >
> >         > as you can see, the largest tablets by WAL size are also the largest ones by number of WAL segments.
> >         >
> >         > Taking a more detailed look at the largest of these tablets (b3facf00fcff403293d36c1032811e6e), these are the TS's that host a replica of that tablet from the output of the command 'kudu table list':
> >         >
> >         > T b3facf00fcff403293d36c1032811e6e
> >         > L e4a4195a39df41f0b04887fdcae399d8 ts07:7050
> >         > V 147fcef6fb49437aa19f7a95fb26c091 ts11:7050
> >         > V 59fe260f21da48059ff5683c364070ce ts31:7050
> >         >
> >         > where ts07 (the leader) is the TS whose WALs disk is about 80% full.
> >         >
> >         > I looked at the 'ops_behind_leader' metric for that tablet on the other two TS's (ts11 and ts31) by querying their metrics, and they are both 0.
> >         >
> >         >
> >         > As for the memory pressure, the leader (ts07) shows the following metrics:
> >         >
> >         > leader_memory_pressure_rejections": 22397
> >         > transaction_memory_pressure_rejections: 0
> >         > follower_memory_pressure_rejections: 0
> >         >
> >         >
> >         > Finally a couple of non-technical comments about KUDU-3002 ( https://issues.apache.org/jira/browse/KUDU-3002):
> >         >
> >         > - I can see it has been fixed in Kudu 1.12.0; however we (as probably most other enterprise customers) depend on a vendor distribution, so it won't really be available to us until the vendor packages it (I think the current version of Kudu in their runtime is 1.11.0, so I guess 1.12.0 could only be a month or two away)
> >         >
> >         > - The other major problem we have is that vendors distributions like the one we are using bundle about a couple of dozens products together, so if we want to upgrade Kudu to the latest available version, we also have to upgrade everything else, like HDFS (major upgrade from 2.6 to 3.x), Kafka (major upgrade), HBase (major upgrade), etc, and in many cases these upgrades also bring significant changes/deprecations in other components, like Parquet, which means we have to change (and in some cases rewrite) our code that uses Parquet or Kafka, since these products are rapidly evolving, and many times in ways that break compatibility with old versions - in other words, it's a big mess.
> >         >
> >         >
> >         > I apologize for the final rant; I understand that it is not your or Kudu's fault, and I don't know if there's an easy solution to this conundrum within the constraints of a vendor supported approach, but for us it makes zero-maintenance cloud solutions attractive, at the cost of sacrificing the flexibility and "customizability" of a in-house solution.
> >         >
> >         > Franco
> >         >
> >         > On March 30, 2020 at 2:22 PM Andrew Wong < awong@cloudera.com mailto:awong@cloudera.com > wrote:
> >         >
> >         > Alternatively, if the servers in question are under constant memory
> >         > pressure and receive a fair number of updates, they may be
> >         > prioritizing flushing of inserted rows at the expense of updates,
> >         > causing the tablets to retain a great number of WAL segments
> >         > (containing older updates) for durability's sake.
> >         >
> >         >
> >         > Just an FYI in case it helps confirm or rule it out, this refers to KUDU-3002, which will be fixed in the upcoming release. Can you determine whether your tablet servers are under memory pressure?
> >         >
> >         > On Mon, Mar 30, 2020 at 11:17 AM Adar Lieber-Dembo < adar@cloudera.com mailto:adar@cloudera.com > wrote:
> >         >
> >         > > - the number of open files in the Kudu process in the tablet servers has increased to now more than 150,000 (as counted using 'lsof'); we raised the limit of maximum number of open files twice already to avoid a crash, but we (and our vendor) are concerned that something might not be right with such a high number of open files.
> >         >
> >         > Using lsof, can you figure out which files are open? WAL segments?
> >         > Data files? Something else? Given the high WAL usage, I'm guessing
> >         > it's the former and these are actually one and the same problem, but
> >         > would be good to confirm nonetheless.
> >         >
> >         > > - in some of the tablet servers the disk space used by the WALs is significantly (and concerningly) higher than in most of the other tablet servers; we use a 1TB SSD drive (about 950GB usable) to store the WALs on each tablet server, and this week was the second time where we saw a tablet server almost fill the whole WAL disk. We had to stop and restart the tablet server, so its tablets would be migrated to different TS's, and we could manually clean up the WALs directory, but this is definitely not something we would like to do in the future. We took a look inside the WAL directory on that TS before wiping it, and we observed that there were a few tablets whose WALs were in excess of 30GB. Another piece of information is that the table that the largest of these tablets belong to, receives about 15M transactions a day, of which about 25% are new inserts and the rest are updates of existing rows.
> >         >
> >         > Sounds like there are at least several tablets with follower replicas
> >         > that have fallen behind their leaders and are trying to catch up. In
> >         > these situations, a leader will preserve as many WAL segments as
> >         > necessary in order to catch up the lagging follower replica, at least
> >         > until some threshold is reached (at which point the master will bring
> >         > a new replica online and the lagging replica will be evicted). These
> >         > calculations are done in terms of the number of WAL segments; in the
> >         > affected tablets, do you recall how many WAL segment files there were
> >         > before you deleted the directories?
> >         >
> >         > Alternatively, if the servers in question are under constant memory
> >         > pressure and receive a fair number of updates, they may be
> >         > prioritizing flushing of inserted rows at the expense of updates,
> >         > causing the tablets to retain a great number of WAL segments
> >         > (containing older updates) for durability's sake. If you recall the
> >         > affected tablet IDs, do your logs indicate the nature of the
> >         > background operations performed for those tablets?
> >         >
> >         > Some of these questions can also be answered via Kudu metrics.There's
> >         > the ops_behind_leader tablet-level metric, which can tell you how far
> >         > behind a replica may be. Unfortunately I can't find a metric for
> >         > average number of WAL segments retained (or a histogram); I thought we
> >         > had that, but maybe not.
> >         >
> >         >
> >         >
> >         > --
> >         > Andrew Wong
> >         >
> >         >
> >         >
> > 
> >     > 
> 
>     --
>     Andrew Wong
>