You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@accumulo.apache.org by Mike Miller <mm...@apache.org> on 2021/01/11 18:58:14 UTC

Re: Continuous tablets unloaded and fails to balance from accumulo master

FYI the Master has a debug log that will print up to 10 tablets that have
outstanding migrations.
https://github.com/apache/accumulo/blob/0a9837f3f8395d89c5cd7bab7805c4aae28919be/server/base/src/main/java/org/apache/accumulo/server/master/balancer/TabletBalancer.java#L172


On Thu, Nov 26, 2020 at 3:30 AM Hart, Andrew <an...@cgi.com> wrote:

> Just for completeness, the solution in the end was to stop then start the
> tservers one at a time until the error cleared.  I never found a way to
> work out which tserver was causing the issue.
>
>
>
> *From:* Hart, Andrew [mailto:and.hart@cgi.com]
> *Sent:* 07 October 2020 13:54
> *To:* user@accumulo.apache.org
> *Subject:* RE: Continuous tablets unloaded and fails to balance from
> accumulo master
>
>
>
> EXTERNAL SENDER: Do not click any links or open any attachments unless
> you trust the sender and know the content is safe.
> EXPÉDITEUR EXTERNE: Ne cliquez sur aucun lien et n’ouvrez aucune pièce
> jointe à moins qu’ils ne proviennent d’un expéditeur fiable, ou que vous
> ayez l'assurance que le contenu provient d'une source sûre.
>
>
>
> Thanks for your suggestions
>
>
>
> Restarting the tserver that had the assigned to dead server tablets, was
> tried but nothing happened to the tablets because they were not part of any
> table and so did not appear to do anything.
>
>
>
> Scanning for missing loc entries – the command you suggested produced no
> output other than a zootraceclient was loaded statement.
>
>
>
> Restarting the master works for 1 balance only and then it returns to 1
> tablets are unloaded.  This is my current workaround for the last few weeks.
>
>
>
> I assume the tables are old and delete since their IDs in the metadata are
> lower than currently created ones and the ID doesn’t appear in tables –l
>
>
>
> I like your GC idea I will look into that.  I may have cloned tables in
> the past to fix some other problem but it is not something I would normally
> do.
>
>
>
> Thanks for again for your ideas.
>
>
>
> *From:* Mike Miller <mm...@apache.org>
> *Sent:* 06 October 2020 19:53
> *To:* user@accumulo.apache.org
> *Subject:* Re: Continuous tablets unloaded and fails to balance from
> accumulo master
>
>
>
> EXTERNAL SENDER: Do not click any links or open any attachments unless
> you trust the sender and know the content is safe.
> EXPÉDITEUR EXTERNE: Ne cliquez sur aucun lien et n’ouvrez aucune pièce
> jointe à moins qu’ils ne proviennent d’un expéditeur fiable, ou que vous
> ayez l'assurance que le contenu provient d'une source sûre.
>
>
>
> It would help if you provided what commands you are running and some of
> the output (if possible) - or at least more detail of what you are seeing.
> It's had to provide specifics, because it's hard to understand how you got
> into this state, what you have done, and what the current state is.
>
> If tablets are assigned to a dead server, but you think that server is ok,
> did you try taking that server down?  Once the server is detected as down,
> that should trigger reassignments - at that point you can restart the
> server.
>
> Scanning the accumulo.metadata table - does every extent have a loc entry?
> Something like:
>
> accumulo shell -u root -p secret -e 'scan -t accumulo.metadata -np -c loc'
> | grep -v loc
>
> Have you tried restarting the master?
>
> If the tables are "old" and deleted - what are you onlining?  Have you
> tried to delete an offline table?
>
> Is you GC running to completion? Do you clone tables?  One issue may be
> that Accumulo gc needs to check that a file is not shared between tables,
> maybe its running into issues completing that check?
>
>
>
> On Tue, Oct 6, 2020 at 12:57 PM Christopher <ct...@apache.org> wrote:
>
> I'm not sure CheckForMetadataProblems can check for all that many
> different types of problems. It is limited.
> If you have tablets still in the metadata table for tables that no longer
> exist, that indicates you probably had some sort of crash and possible
> corruption of your metadata.
> The only option would be to manually delete those entries.
> A command to automatically prune these would probably be dangerous...
> running it when there's a transient ZooKeeper problem, for example, could
> end up deleting all your tables... which would be bad. Although it is
> dangerous, manual surgery on the metadata table to remove these entries, as
> you suggested, is probably the best option.
>
>
>
> On Tue, Oct 6, 2020 at 12:03 PM Hart, Andrew <an...@cgi.com> wrote:
>
> I am still trying to find the one “unloaded tablet” that is preventing the
> cluster balancing, however, there are a lot of unassigned tablets.
>
>
>
> I have been getting rid of them by onlining tables and completing failed
> table deletes but I am still left with many tablets that are unassigned.
> They seem to be mostly from old deleted tables and so I am not sure why
> they are there at all.
>
> The unassigned tablets are shown in accumulo
> org.apache.accumulo.server.util.FindOfflineTablets and in accumulo admin
> checkTablets
>
> And as I said, some are assign to dead server but actually the server
> isn’t dead at all.
>
>
>
> CheckForMetadataProblems reports “All is well”
>
>
>
> I thought that if I could clear up this mess I could then eventually get
> to just one unassigned tablet which would be the “1 tablets are unloaded”
> one.  (I would then clone the table or copy the data out or something)
>
>
>
> So the problem remains.  The cluster doesn’t balance due to migrations.  I
> don’t find a tablet with a future entry and I can’t find it in unassigned
> or offline tablets due to the large number of other (presumably defunct)
> tablets with unassigned problems in tables that no longer exist.
>
>
>
> There are warnings in the documentation about manually editing the
> accumulo metadata table but it seems that the only option is to go in with
> a deletemany on any rows that start with an old deleted table.  There does
> not seem to be an “accumulo admin pruneDefunctTablets –t tid” command! :D
>
>
>
>
>
>
>
> *From:* Mike Miller <mm...@apache.org>
> *Sent:* 06 October 2020 16:27
> *To:* user@accumulo.apache.org
> *Subject:* Re: Continuous tablets unloaded and fails to balance from
> accumulo master
>
>
>
> EXTERNAL SENDER: Do not click any links or open any attachments unless
> you trust the sender and know the content is safe.
> EXPÉDITEUR EXTERNE: Ne cliquez sur aucun lien et n’ouvrez aucune pièce
> jointe à moins qu’ils ne proviennent d’un expéditeur fiable, ou que vous
> ayez l'assurance que le contenu provient d'une source sûre.
>
>
>
> Do you want to merge old tablets that don't exist anymore?  I am not sure
> what you are asking... you might have better luck if you provide some more
> info and ask on Slack: https://accumulo.apache.org/contact-us/#slack
> <https://urldefense.proofpoint.com/v2/url?u=https-3A__accumulo.apache.org_contact-2Dus_-23slack&d=DwMFaQ&c=H50I6Bh8SW87d_bXfZP_8g&r=f1Vi1t2KLSKTuTeSpDUCXg&m=Lgh2fhFz4BGHb5Zc9up-gHPYKgQEyQzp4d5XjC5P35A&s=-e_h4A8fCLAqaw1Etl-J2VMdIHWi-Et0FEJW_DgZTbo&e=>
>
>
>
> On Tue, Oct 6, 2020 at 7:25 AM Hart, Andrew <an...@cgi.com> wrote:
>
> What is the way to remove tablets that still exist in accumulo but do not
> have an online, offline or deleting table?
>
>
>
> Some of these tablets say ASSIGNED TO DEAD SERVER but the tserver they
> refer to is up and working properly.
>
>
>
> *From:* Hart, Andrew <an...@cgi.com>
> *Sent:* 25 September 2020 13:52
> *To:* user@accumulo.apache.org
> *Subject:* RE: Continuous tablets unloaded and fails to balance from
> accumulo master
>
>
>
> EXTERNAL SENDER: Do not click any links or open any attachments unless
> you trust the sender and know the content is safe.
> EXPÉDITEUR EXTERNE: Ne cliquez sur aucun lien et n’ouvrez aucune pièce
> jointe à moins qu’ils ne proviennent d’un expéditeur fiable, ou que vous
> ayez l'assurance que le contenu provient d'une source sûre.
>
>
>
> Thanks for your help.  In looking for this I think I have found that there
> are deleted tables that still have a lot of tablets in the metadata table.
>
> I need to solve that before coming back to find the 1 unloaded tablet.
>
>
>
> Cheers And.
>
>
>
> *From:* Mike Miller <mm...@apache.org>
> *Sent:* 24 September 2020 16:08
> *To:* user@accumulo.apache.org
> *Subject:* Re: Continuous tablets unloaded and fails to balance from
> accumulo master
>
>
>
> EXTERNAL SENDER: Do not click any links or open any attachments unless
> you trust the sender and know the content is safe.
> EXPÉDITEUR EXTERNE: Ne cliquez sur aucun lien et n’ouvrez aucune pièce
> jointe à moins qu’ils ne proviennent d’un expéditeur fiable, ou que vous
> ayez l'assurance que le contenu provient d'une source sûre.
>
>
>
> That might be OK, could just mean it hasn't been assigned yet.  The only
> way I can think of is to populate a list of all tablets from the metadata
> table and find the one without a "loc" column family.
>
>
>
> On Thu, Sep 24, 2020 at 10:55 AM Hart, Andrew <an...@cgi.com> wrote:
>
> No, no future entries in the table.
>
>
>
> *From:* Mike Miller <mm...@apache.org>
> *Sent:* 24 September 2020 15:10
> *To:* user@accumulo.apache.org
> *Subject:* Re: Continuous tablets unloaded and fails to balance from
> accumulo master
>
>
>
> EXTERNAL SENDER: Do not click any links or open any attachments unless
> you trust the sender and know the content is safe.
> EXPÉDITEUR EXTERNE: Ne cliquez sur aucun lien et n’ouvrez aucune pièce
> jointe à moins qu’ils ne proviennent d’un expéditeur fiable, ou que vous
> ayez l'assurance que le contenu provient d'une source sûre.
>
>
>
> You should be able to figure out the unloaded tablet from the
> "accumulo.metadata" table.  The metadata table will list the tablet
> location using the "loc" column family to indicate it has loaded a tablet
> that it was assigned.
>
> For example the tablet "n;9" will have an entry like:
>
> n;9 loc:1000041fbf00006 []    ip-172-31-87-51.ec2.internal:9997
>
>
>
> From my understanding, the unloaded tablet should have a "future" column
> family, meaning it has been assigned a new location but not loaded yet.  If
> the tablet doesn't have a "loc" or "future" column family then that is a
> problem.
>
>
>
> On Thu, Sep 24, 2020 at 6:32 AM Hart, Andrew <an...@cgi.com> wrote:
>
> Hi,
>
>
>
> I am getting “Not balancing due to 1 outstanding migrations” and “[Normal
> tablets]: 1 tablets unloaded”.
>
> This means that the cluster never balances unless I restart the master,
> after which I get a 1 off balance and then it returns to the above messages.
>
>
>
> How do I identify the tablet that is unloaded?  It isn’t in the logs that
> I can see.  Is it possible to tell from the contents of the
> accumulo.metadata table?
>
>
>
> Is there a way to use FindOfflineTablets?
>
>
>
> And.
>
>