You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@accumulo.apache.org by "Hart, Andrew" <an...@cgi.com> on 2020/11/26 08:30:19 UTC

RE: Continuous tablets unloaded and fails to balance from accumulo master

Just for completeness, the solution in the end was to stop then start the tservers one at a time until the error cleared.  I never found a way to work out which tserver was causing the issue.

From: Hart, Andrew [mailto:and.hart@cgi.com]
Sent: 07 October 2020 13:54
To: user@accumulo.apache.org
Subject: RE: Continuous tablets unloaded and fails to balance from accumulo master

EXTERNAL SENDER: Do not click any links or open any attachments unless you trust the sender and know the content is safe.
EXPÉDITEUR EXTERNE: Ne cliquez sur aucun lien et n’ouvrez aucune pièce jointe à moins qu’ils ne proviennent d’un expéditeur fiable, ou que vous ayez l'assurance que le contenu provient d'une source sûre.


Thanks for your suggestions

Restarting the tserver that had the assigned to dead server tablets, was tried but nothing happened to the tablets because they were not part of any table and so did not appear to do anything.

Scanning for missing loc entries – the command you suggested produced no output other than a zootraceclient was loaded statement.

Restarting the master works for 1 balance only and then it returns to 1 tablets are unloaded.  This is my current workaround for the last few weeks.

I assume the tables are old and delete since their IDs in the metadata are lower than currently created ones and the ID doesn’t appear in tables –l

I like your GC idea I will look into that.  I may have cloned tables in the past to fix some other problem but it is not something I would normally do.

Thanks for again for your ideas.

From: Mike Miller <mm...@apache.org>>
Sent: 06 October 2020 19:53
To: user@accumulo.apache.org<ma...@accumulo.apache.org>
Subject: Re: Continuous tablets unloaded and fails to balance from accumulo master

EXTERNAL SENDER: Do not click any links or open any attachments unless you trust the sender and know the content is safe.
EXPÉDITEUR EXTERNE: Ne cliquez sur aucun lien et n’ouvrez aucune pièce jointe à moins qu’ils ne proviennent d’un expéditeur fiable, ou que vous ayez l'assurance que le contenu provient d'une source sûre.


It would help if you provided what commands you are running and some of the output (if possible) - or at least more detail of what you are seeing.  It's had to provide specifics, because it's hard to understand how you got into this state, what you have done, and what the current state is.

If tablets are assigned to a dead server, but you think that server is ok, did you try taking that server down?  Once the server is detected as down, that should trigger reassignments - at that point you can restart the server.

Scanning the accumulo.metadata table - does every extent have a loc entry? Something like:

accumulo shell -u root -p secret -e 'scan -t accumulo.metadata -np -c loc' | grep -v loc

Have you tried restarting the master?

If the tables are "old" and deleted - what are you onlining?  Have you tried to delete an offline table?

Is you GC running to completion? Do you clone tables?  One issue may be that Accumulo gc needs to check that a file is not shared between tables, maybe its running into issues completing that check?

On Tue, Oct 6, 2020 at 12:57 PM Christopher <ct...@apache.org>> wrote:
I'm not sure CheckForMetadataProblems can check for all that many different types of problems. It is limited.
If you have tablets still in the metadata table for tables that no longer exist, that indicates you probably had some sort of crash and possible corruption of your metadata.
The only option would be to manually delete those entries.
A command to automatically prune these would probably be dangerous... running it when there's a transient ZooKeeper problem, for example, could end up deleting all your tables... which would be bad. Although it is dangerous, manual surgery on the metadata table to remove these entries, as you suggested, is probably the best option.

On Tue, Oct 6, 2020 at 12:03 PM Hart, Andrew <an...@cgi.com>> wrote:
I am still trying to find the one “unloaded tablet” that is preventing the cluster balancing, however, there are a lot of unassigned tablets.

I have been getting rid of them by onlining tables and completing failed table deletes but I am still left with many tablets that are unassigned.  They seem to be mostly from old deleted tables and so I am not sure why they are there at all.
The unassigned tablets are shown in accumulo org.apache.accumulo.server.util.FindOfflineTablets and in accumulo admin checkTablets
And as I said, some are assign to dead server but actually the server isn’t dead at all.

CheckForMetadataProblems reports “All is well”

I thought that if I could clear up this mess I could then eventually get to just one unassigned tablet which would be the “1 tablets are unloaded” one.  (I would then clone the table or copy the data out or something)

So the problem remains.  The cluster doesn’t balance due to migrations.  I don’t find a tablet with a future entry and I can’t find it in unassigned or offline tablets due to the large number of other (presumably defunct) tablets with unassigned problems in tables that no longer exist.

There are warnings in the documentation about manually editing the accumulo metadata table but it seems that the only option is to go in with a deletemany on any rows that start with an old deleted table.  There does not seem to be an “accumulo admin pruneDefunctTablets –t tid” command! :D



From: Mike Miller <mm...@apache.org>>
Sent: 06 October 2020 16:27
To: user@accumulo.apache.org<ma...@accumulo.apache.org>
Subject: Re: Continuous tablets unloaded and fails to balance from accumulo master

EXTERNAL SENDER: Do not click any links or open any attachments unless you trust the sender and know the content is safe.
EXPÉDITEUR EXTERNE: Ne cliquez sur aucun lien et n’ouvrez aucune pièce jointe à moins qu’ils ne proviennent d’un expéditeur fiable, ou que vous ayez l'assurance que le contenu provient d'une source sûre.


Do you want to merge old tablets that don't exist anymore?  I am not sure what you are asking... you might have better luck if you provide some more info and ask on Slack: https://accumulo.apache.org/contact-us/#slack<https://urldefense.proofpoint.com/v2/url?u=https-3A__accumulo.apache.org_contact-2Dus_-23slack&d=DwMFaQ&c=H50I6Bh8SW87d_bXfZP_8g&r=f1Vi1t2KLSKTuTeSpDUCXg&m=Lgh2fhFz4BGHb5Zc9up-gHPYKgQEyQzp4d5XjC5P35A&s=-e_h4A8fCLAqaw1Etl-J2VMdIHWi-Et0FEJW_DgZTbo&e=>

On Tue, Oct 6, 2020 at 7:25 AM Hart, Andrew <an...@cgi.com>> wrote:
What is the way to remove tablets that still exist in accumulo but do not have an online, offline or deleting table?

Some of these tablets say ASSIGNED TO DEAD SERVER but the tserver they refer to is up and working properly.

From: Hart, Andrew <an...@cgi.com>>
Sent: 25 September 2020 13:52
To: user@accumulo.apache.org<ma...@accumulo.apache.org>
Subject: RE: Continuous tablets unloaded and fails to balance from accumulo master

EXTERNAL SENDER: Do not click any links or open any attachments unless you trust the sender and know the content is safe.
EXPÉDITEUR EXTERNE: Ne cliquez sur aucun lien et n’ouvrez aucune pièce jointe à moins qu’ils ne proviennent d’un expéditeur fiable, ou que vous ayez l'assurance que le contenu provient d'une source sûre.


Thanks for your help.  In looking for this I think I have found that there are deleted tables that still have a lot of tablets in the metadata table.
I need to solve that before coming back to find the 1 unloaded tablet.

Cheers And.

From: Mike Miller <mm...@apache.org>>
Sent: 24 September 2020 16:08
To: user@accumulo.apache.org<ma...@accumulo.apache.org>
Subject: Re: Continuous tablets unloaded and fails to balance from accumulo master

EXTERNAL SENDER: Do not click any links or open any attachments unless you trust the sender and know the content is safe.
EXPÉDITEUR EXTERNE: Ne cliquez sur aucun lien et n’ouvrez aucune pièce jointe à moins qu’ils ne proviennent d’un expéditeur fiable, ou que vous ayez l'assurance que le contenu provient d'une source sûre.


That might be OK, could just mean it hasn't been assigned yet.  The only way I can think of is to populate a list of all tablets from the metadata table and find the one without a "loc" column family.

On Thu, Sep 24, 2020 at 10:55 AM Hart, Andrew <an...@cgi.com>> wrote:
No, no future entries in the table.

From: Mike Miller <mm...@apache.org>>
Sent: 24 September 2020 15:10
To: user@accumulo.apache.org<ma...@accumulo.apache.org>
Subject: Re: Continuous tablets unloaded and fails to balance from accumulo master

EXTERNAL SENDER: Do not click any links or open any attachments unless you trust the sender and know the content is safe.
EXPÉDITEUR EXTERNE: Ne cliquez sur aucun lien et n’ouvrez aucune pièce jointe à moins qu’ils ne proviennent d’un expéditeur fiable, ou que vous ayez l'assurance que le contenu provient d'une source sûre.


You should be able to figure out the unloaded tablet from the "accumulo.metadata" table.  The metadata table will list the tablet location using the "loc" column family to indicate it has loaded a tablet that it was assigned.
For example the tablet "n;9" will have an entry like:
n;9 loc:1000041fbf00006 []    ip-172-31-87-51.ec2.internal:9997

From my understanding, the unloaded tablet should have a "future" column family, meaning it has been assigned a new location but not loaded yet.  If the tablet doesn't have a "loc" or "future" column family then that is a problem.

On Thu, Sep 24, 2020 at 6:32 AM Hart, Andrew <an...@cgi.com>> wrote:
Hi,

I am getting “Not balancing due to 1 outstanding migrations” and “[Normal tablets]: 1 tablets unloaded”.
This means that the cluster never balances unless I restart the master, after which I get a 1 off balance and then it returns to the above messages.

How do I identify the tablet that is unloaded?  It isn’t in the logs that I can see.  Is it possible to tell from the contents of the accumulo.metadata table?

Is there a way to use FindOfflineTablets?

And.

Re: Continuous tablets unloaded and fails to balance from accumulo master

Posted by Mike Miller <mm...@apache.org>.
FYI the Master has a debug log that will print up to 10 tablets that have
outstanding migrations.
https://github.com/apache/accumulo/blob/0a9837f3f8395d89c5cd7bab7805c4aae28919be/server/base/src/main/java/org/apache/accumulo/server/master/balancer/TabletBalancer.java#L172


On Thu, Nov 26, 2020 at 3:30 AM Hart, Andrew <an...@cgi.com> wrote:

> Just for completeness, the solution in the end was to stop then start the
> tservers one at a time until the error cleared.  I never found a way to
> work out which tserver was causing the issue.
>
>
>
> *From:* Hart, Andrew [mailto:and.hart@cgi.com]
> *Sent:* 07 October 2020 13:54
> *To:* user@accumulo.apache.org
> *Subject:* RE: Continuous tablets unloaded and fails to balance from
> accumulo master
>
>
>
> EXTERNAL SENDER: Do not click any links or open any attachments unless
> you trust the sender and know the content is safe.
> EXPÉDITEUR EXTERNE: Ne cliquez sur aucun lien et n’ouvrez aucune pièce
> jointe à moins qu’ils ne proviennent d’un expéditeur fiable, ou que vous
> ayez l'assurance que le contenu provient d'une source sûre.
>
>
>
> Thanks for your suggestions
>
>
>
> Restarting the tserver that had the assigned to dead server tablets, was
> tried but nothing happened to the tablets because they were not part of any
> table and so did not appear to do anything.
>
>
>
> Scanning for missing loc entries – the command you suggested produced no
> output other than a zootraceclient was loaded statement.
>
>
>
> Restarting the master works for 1 balance only and then it returns to 1
> tablets are unloaded.  This is my current workaround for the last few weeks.
>
>
>
> I assume the tables are old and delete since their IDs in the metadata are
> lower than currently created ones and the ID doesn’t appear in tables –l
>
>
>
> I like your GC idea I will look into that.  I may have cloned tables in
> the past to fix some other problem but it is not something I would normally
> do.
>
>
>
> Thanks for again for your ideas.
>
>
>
> *From:* Mike Miller <mm...@apache.org>
> *Sent:* 06 October 2020 19:53
> *To:* user@accumulo.apache.org
> *Subject:* Re: Continuous tablets unloaded and fails to balance from
> accumulo master
>
>
>
> EXTERNAL SENDER: Do not click any links or open any attachments unless
> you trust the sender and know the content is safe.
> EXPÉDITEUR EXTERNE: Ne cliquez sur aucun lien et n’ouvrez aucune pièce
> jointe à moins qu’ils ne proviennent d’un expéditeur fiable, ou que vous
> ayez l'assurance que le contenu provient d'une source sûre.
>
>
>
> It would help if you provided what commands you are running and some of
> the output (if possible) - or at least more detail of what you are seeing.
> It's had to provide specifics, because it's hard to understand how you got
> into this state, what you have done, and what the current state is.
>
> If tablets are assigned to a dead server, but you think that server is ok,
> did you try taking that server down?  Once the server is detected as down,
> that should trigger reassignments - at that point you can restart the
> server.
>
> Scanning the accumulo.metadata table - does every extent have a loc entry?
> Something like:
>
> accumulo shell -u root -p secret -e 'scan -t accumulo.metadata -np -c loc'
> | grep -v loc
>
> Have you tried restarting the master?
>
> If the tables are "old" and deleted - what are you onlining?  Have you
> tried to delete an offline table?
>
> Is you GC running to completion? Do you clone tables?  One issue may be
> that Accumulo gc needs to check that a file is not shared between tables,
> maybe its running into issues completing that check?
>
>
>
> On Tue, Oct 6, 2020 at 12:57 PM Christopher <ct...@apache.org> wrote:
>
> I'm not sure CheckForMetadataProblems can check for all that many
> different types of problems. It is limited.
> If you have tablets still in the metadata table for tables that no longer
> exist, that indicates you probably had some sort of crash and possible
> corruption of your metadata.
> The only option would be to manually delete those entries.
> A command to automatically prune these would probably be dangerous...
> running it when there's a transient ZooKeeper problem, for example, could
> end up deleting all your tables... which would be bad. Although it is
> dangerous, manual surgery on the metadata table to remove these entries, as
> you suggested, is probably the best option.
>
>
>
> On Tue, Oct 6, 2020 at 12:03 PM Hart, Andrew <an...@cgi.com> wrote:
>
> I am still trying to find the one “unloaded tablet” that is preventing the
> cluster balancing, however, there are a lot of unassigned tablets.
>
>
>
> I have been getting rid of them by onlining tables and completing failed
> table deletes but I am still left with many tablets that are unassigned.
> They seem to be mostly from old deleted tables and so I am not sure why
> they are there at all.
>
> The unassigned tablets are shown in accumulo
> org.apache.accumulo.server.util.FindOfflineTablets and in accumulo admin
> checkTablets
>
> And as I said, some are assign to dead server but actually the server
> isn’t dead at all.
>
>
>
> CheckForMetadataProblems reports “All is well”
>
>
>
> I thought that if I could clear up this mess I could then eventually get
> to just one unassigned tablet which would be the “1 tablets are unloaded”
> one.  (I would then clone the table or copy the data out or something)
>
>
>
> So the problem remains.  The cluster doesn’t balance due to migrations.  I
> don’t find a tablet with a future entry and I can’t find it in unassigned
> or offline tablets due to the large number of other (presumably defunct)
> tablets with unassigned problems in tables that no longer exist.
>
>
>
> There are warnings in the documentation about manually editing the
> accumulo metadata table but it seems that the only option is to go in with
> a deletemany on any rows that start with an old deleted table.  There does
> not seem to be an “accumulo admin pruneDefunctTablets –t tid” command! :D
>
>
>
>
>
>
>
> *From:* Mike Miller <mm...@apache.org>
> *Sent:* 06 October 2020 16:27
> *To:* user@accumulo.apache.org
> *Subject:* Re: Continuous tablets unloaded and fails to balance from
> accumulo master
>
>
>
> EXTERNAL SENDER: Do not click any links or open any attachments unless
> you trust the sender and know the content is safe.
> EXPÉDITEUR EXTERNE: Ne cliquez sur aucun lien et n’ouvrez aucune pièce
> jointe à moins qu’ils ne proviennent d’un expéditeur fiable, ou que vous
> ayez l'assurance que le contenu provient d'une source sûre.
>
>
>
> Do you want to merge old tablets that don't exist anymore?  I am not sure
> what you are asking... you might have better luck if you provide some more
> info and ask on Slack: https://accumulo.apache.org/contact-us/#slack
> <https://urldefense.proofpoint.com/v2/url?u=https-3A__accumulo.apache.org_contact-2Dus_-23slack&d=DwMFaQ&c=H50I6Bh8SW87d_bXfZP_8g&r=f1Vi1t2KLSKTuTeSpDUCXg&m=Lgh2fhFz4BGHb5Zc9up-gHPYKgQEyQzp4d5XjC5P35A&s=-e_h4A8fCLAqaw1Etl-J2VMdIHWi-Et0FEJW_DgZTbo&e=>
>
>
>
> On Tue, Oct 6, 2020 at 7:25 AM Hart, Andrew <an...@cgi.com> wrote:
>
> What is the way to remove tablets that still exist in accumulo but do not
> have an online, offline or deleting table?
>
>
>
> Some of these tablets say ASSIGNED TO DEAD SERVER but the tserver they
> refer to is up and working properly.
>
>
>
> *From:* Hart, Andrew <an...@cgi.com>
> *Sent:* 25 September 2020 13:52
> *To:* user@accumulo.apache.org
> *Subject:* RE: Continuous tablets unloaded and fails to balance from
> accumulo master
>
>
>
> EXTERNAL SENDER: Do not click any links or open any attachments unless
> you trust the sender and know the content is safe.
> EXPÉDITEUR EXTERNE: Ne cliquez sur aucun lien et n’ouvrez aucune pièce
> jointe à moins qu’ils ne proviennent d’un expéditeur fiable, ou que vous
> ayez l'assurance que le contenu provient d'une source sûre.
>
>
>
> Thanks for your help.  In looking for this I think I have found that there
> are deleted tables that still have a lot of tablets in the metadata table.
>
> I need to solve that before coming back to find the 1 unloaded tablet.
>
>
>
> Cheers And.
>
>
>
> *From:* Mike Miller <mm...@apache.org>
> *Sent:* 24 September 2020 16:08
> *To:* user@accumulo.apache.org
> *Subject:* Re: Continuous tablets unloaded and fails to balance from
> accumulo master
>
>
>
> EXTERNAL SENDER: Do not click any links or open any attachments unless
> you trust the sender and know the content is safe.
> EXPÉDITEUR EXTERNE: Ne cliquez sur aucun lien et n’ouvrez aucune pièce
> jointe à moins qu’ils ne proviennent d’un expéditeur fiable, ou que vous
> ayez l'assurance que le contenu provient d'une source sûre.
>
>
>
> That might be OK, could just mean it hasn't been assigned yet.  The only
> way I can think of is to populate a list of all tablets from the metadata
> table and find the one without a "loc" column family.
>
>
>
> On Thu, Sep 24, 2020 at 10:55 AM Hart, Andrew <an...@cgi.com> wrote:
>
> No, no future entries in the table.
>
>
>
> *From:* Mike Miller <mm...@apache.org>
> *Sent:* 24 September 2020 15:10
> *To:* user@accumulo.apache.org
> *Subject:* Re: Continuous tablets unloaded and fails to balance from
> accumulo master
>
>
>
> EXTERNAL SENDER: Do not click any links or open any attachments unless
> you trust the sender and know the content is safe.
> EXPÉDITEUR EXTERNE: Ne cliquez sur aucun lien et n’ouvrez aucune pièce
> jointe à moins qu’ils ne proviennent d’un expéditeur fiable, ou que vous
> ayez l'assurance que le contenu provient d'une source sûre.
>
>
>
> You should be able to figure out the unloaded tablet from the
> "accumulo.metadata" table.  The metadata table will list the tablet
> location using the "loc" column family to indicate it has loaded a tablet
> that it was assigned.
>
> For example the tablet "n;9" will have an entry like:
>
> n;9 loc:1000041fbf00006 []    ip-172-31-87-51.ec2.internal:9997
>
>
>
> From my understanding, the unloaded tablet should have a "future" column
> family, meaning it has been assigned a new location but not loaded yet.  If
> the tablet doesn't have a "loc" or "future" column family then that is a
> problem.
>
>
>
> On Thu, Sep 24, 2020 at 6:32 AM Hart, Andrew <an...@cgi.com> wrote:
>
> Hi,
>
>
>
> I am getting “Not balancing due to 1 outstanding migrations” and “[Normal
> tablets]: 1 tablets unloaded”.
>
> This means that the cluster never balances unless I restart the master,
> after which I get a 1 off balance and then it returns to the above messages.
>
>
>
> How do I identify the tablet that is unloaded?  It isn’t in the logs that
> I can see.  Is it possible to tell from the contents of the
> accumulo.metadata table?
>
>
>
> Is there a way to use FindOfflineTablets?
>
>
>
> And.
>
>