You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@accumulo.apache.org by "Ligade, Shailesh [USA]" <Li...@bah.com> on 2022/04/12 12:35:31 UTC

Accumulo 1.10.0

Hello, Last weekend we ran out of hdfs space 🙁 all volumes were 100% yeah it was crazy. This accumulo has many tables with good data.

Although accumulo was up it had 3 unsassigned tablets

So i added few nodes to hdfs/accumulo, now hdfs capacity is 33% empty. I issued hdfs rebalance command (just in case) So all good. Accumulo unassigned tablets went away but tables are show no assigned tablets on the accumulo monitor.

On the active master i am seeing error

ERROR: Error processing table state for store Normal Tablets
java.langRuntimeexception: org.apache.accumulo.server.master.state.TabletLocationState$BadLocationStateException: found two locations for the same extent xxxxxxxx

Question is i am getting this because balancer is running and once it finished it will recover? What can be done to save this cluster?

Thanks

-S

Re: Accumulo 1.10.0

Posted by "Ligade, Shailesh [USA]" <Li...@bah.com>.

Thanks Ed,

Stooping tserver didn't help weird issue i saw is that both location on the same tserver. So i guess i have to do it hard way...:-(

-S
________________________________
From: dev1 <de...@etcoleman.com>
Sent: Tuesday, April 12, 2022 10:07 AM
To: 'user@accumulo.apache.org' <us...@accumulo.apache.org>
Subject: [External] RE: Accumulo 1.10.0

I would suspect that the metadata table became corrupted when the system went unstable and two tablet servers somehow ended up both thinking that they were responsible for the same extents(s)  This should not be because of the balancer running.

If you scan the accumulo.metadata table using the shell (scan -t accumulo.metadata -c loc) or (scan -t accumulo.metadata -c loc -b [TABLE_ID#]:[EXTENT])

There will be duplicated loc entries.

I am uncertain on the best way to fix this and do not have a place to try things out, but possible actions.

Shutdown / bounce the tservers that have the duplicated assignments – you could start with just one and see what happens. When the tservers go offline – the tablets should be reassigned and maybe only one (re)assignment will occur.

Try bouncing the manager (master)

If those don’t work, then a very aggressive / dangerous / only as a last resort:

Delete the specific loc rows from the metadata table (delete [row_id] loc [value] -t accumulo.metadata)  This will cause a future entry in the zookeeper – to get that to reassign it might be enough to bounce the master, or you may need to shutdown / restart the cluster.

Ed Coleman

From: Ligade, Shailesh [USA] <Li...@bah.com>
Sent: Tuesday, April 12, 2022 8:36 AM
To: user@accumulo.apache.org
Subject: Accumulo 1.10.0

Hello, Last weekend we ran out of hdfs space ?? all volumes were 100% yeah it was crazy. This accumulo has many tables with good data.

Although accumulo was up it had 3 unsassigned tablets

So i added few nodes to hdfs/accumulo, now hdfs capacity is 33% empty. I issued hdfs rebalance command (just in case) So all good. Accumulo unassigned tablets went away but tables are show no assigned tablets on the accumulo monitor.

On the active master i am seeing error

ERROR: Error processing table state for store Normal Tablets

java.langRuntimeexception: org.apache.accumulo.server.master.state.TabletLocationState$BadLocationStateException: found two locations for the same extent xxxxxxxx

Question is i am getting this because balancer is running and once it finished it will recover? What can be done to save this cluster?

Thanks

-S

Re: Accumulo 1.10.0

Posted by "Ligade, Shailesh [USA]" <Li...@bah.com>.

Thanks Ed,

after few rounds of delete duplicate location, the cluster is up and kicking. I didn't have to restart accumulo after deletes.

Thank again

-S
________________________________
From: Ligade, Shailesh [USA] <Li...@bah.com>
Sent: Wednesday, April 13, 2022 9:20 AM
To: 'user@accumulo.apache.org' <us...@accumulo.apache.org>
Subject: Re: Accumulo 1.10.0

Thanks

i noticed that issue with the write permission, anyway i was able to run all delete. In reality i ran
CheckForMetadatProblems to generate script and ran all delete
Shell didn't throw any error
Then i restarted my both masters, but then master came up i am seeing the same duplicate entry error in the master log so either i didn't do something right, may be master didn't get bounce properly or zookeeper or something still has that data..

Any suggestions?
-S
________________________________
From: dev1 <de...@etcoleman.com>
Sent: Wednesday, April 13, 2022 9:14 AM
To: 'user@accumulo.apache.org' <us...@accumulo.apache.org>
Subject: [External] RE: Accumulo 1.10.0


If you still have an issue – check that your user has WRITE permissions on the metadata table (even root needs to be added) – if you grant permission(s) you likely would want to remove what you added once you are done to prevent inadvertently modifying the table in the future if you make a mistake with a command intended for another table. (Besides being a good security practice to operate with the minimum required permissions)



From: Ligade, Shailesh [USA] <Li...@bah.com>
Sent: Wednesday, April 13, 2022 8:41 AM
To: 'user@accumulo.apache.org' <us...@accumulo.apache.org>
Subject: Re: Accumulo 1.10.0



I think i figured out



i have to be on that accumulo.metadata table in order for delete command to work.. -t accumulo.metadata did not work ..not sure why ??



Thanks



0S

________________________________

From: Ligade, Shailesh [USA] <Li...@bah.com>>
Sent: Wednesday, April 13, 2022 7:51 AM
To: 'user@accumulo.apache.org' <us...@accumulo.apache.org>>
Subject: Re: Accumulo 1.10.0



Thanks Ed,



a quick question,



Now that i want to delete those duplicates (there are many of those ??)



the scan output is



a;xxxx:yyyy loc:aaaaa [] tablet1:9997

a;xxxx:yyyy loc:zzzzzz [] tablet2:9997



What is the right delete command, when i issue



delete a; loc aaaaa -t accumulo.metadata



i get help so it doesn't think it is right command



i tried



delete a;xxxx loca aaaaa -t accumulo.metadata  or

delete a;xxxx:yyyy loc aaaaa -t accumulo.metadata



still get the help message..



Thanks in advance,



-S





________________________________

From: dev1 <de...@etcoleman.com>>
Sent: Tuesday, April 12, 2022 10:07 AM
To: 'user@accumulo.apache.org' <us...@accumulo.apache.org>>
Subject: [External] RE: Accumulo 1.10.0



I would suspect that the metadata table became corrupted when the system went unstable and two tablet servers somehow ended up both thinking that they were responsible for the same extents(s)  This should not be because of the balancer running.



If you scan the accumulo.metadata table using the shell (scan -t accumulo.metadata -c loc) or (scan -t accumulo.metadata -c loc -b [TABLE_ID#]:[EXTENT])



There will be duplicated loc entries.



I am uncertain on the best way to fix this and do not have a place to try things out, but possible actions.



Shutdown / bounce the tservers that have the duplicated assignments – you could start with just one and see what happens. When the tservers go offline – the tablets should be reassigned and maybe only one (re)assignment will occur.



Try bouncing the manager (master)



If those don’t work, then a very aggressive / dangerous / only as a last resort:



Delete the specific loc rows from the metadata table (delete [row_id] loc [value] -t accumulo.metadata)  This will cause a future entry in the zookeeper – to get that to reassign it might be enough to bounce the master, or you may need to shutdown / restart the cluster.



Ed Coleman



From: Ligade, Shailesh [USA] <Li...@bah.com>>
Sent: Tuesday, April 12, 2022 8:36 AM
To: user@accumulo.apache.org<ma...@accumulo.apache.org>
Subject: Accumulo 1.10.0



Hello, Last weekend we ran out of hdfs space ?? all volumes were 100% yeah it was crazy. This accumulo has many tables with good data.



Although accumulo was up it had 3 unsassigned tablets



So i added few nodes to hdfs/accumulo, now hdfs capacity is 33% empty. I issued hdfs rebalance command (just in case) So all good. Accumulo unassigned tablets went away but tables are show no assigned tablets on the accumulo monitor.



On the active master i am seeing error



ERROR: Error processing table state for store Normal Tablets

java.langRuntimeexception: org.apache.accumulo.server.master.state.TabletLocationState$BadLocationStateException: found two locations for the same extent xxxxxxxx



Question is i am getting this because balancer is running and once it finished it will recover? What can be done to save this cluster?



Thanks



-S

Re: Accumulo 1.10.0

Posted by "Ligade, Shailesh [USA]" <Li...@bah.com>.

Thanks

i noticed that issue with the write permission, anyway i was able to run all delete. In reality i ran
CheckForMetadatProblems to generate script and ran all delete
Shell didn't throw any error
Then i restarted my both masters, but then master came up i am seeing the same duplicate entry error in the master log so either i didn't do something right, may be master didn't get bounce properly or zookeeper or something still has that data..

Any suggestions?
-S
________________________________
From: dev1 <de...@etcoleman.com>
Sent: Wednesday, April 13, 2022 9:14 AM
To: 'user@accumulo.apache.org' <us...@accumulo.apache.org>
Subject: [External] RE: Accumulo 1.10.0


If you still have an issue – check that your user has WRITE permissions on the metadata table (even root needs to be added) – if you grant permission(s) you likely would want to remove what you added once you are done to prevent inadvertently modifying the table in the future if you make a mistake with a command intended for another table. (Besides being a good security practice to operate with the minimum required permissions)



From: Ligade, Shailesh [USA] <Li...@bah.com>
Sent: Wednesday, April 13, 2022 8:41 AM
To: 'user@accumulo.apache.org' <us...@accumulo.apache.org>
Subject: Re: Accumulo 1.10.0



I think i figured out



i have to be on that accumulo.metadata table in order for delete command to work.. -t accumulo.metadata did not work ..not sure why ??



Thanks



0S

________________________________

From: Ligade, Shailesh [USA] <Li...@bah.com>>
Sent: Wednesday, April 13, 2022 7:51 AM
To: 'user@accumulo.apache.org' <us...@accumulo.apache.org>>
Subject: Re: Accumulo 1.10.0



Thanks Ed,



a quick question,



Now that i want to delete those duplicates (there are many of those ??)



the scan output is



a;xxxx:yyyy loc:aaaaa [] tablet1:9997

a;xxxx:yyyy loc:zzzzzz [] tablet2:9997



What is the right delete command, when i issue



delete a; loc aaaaa -t accumulo.metadata



i get help so it doesn't think it is right command



i tried



delete a;xxxx loca aaaaa -t accumulo.metadata  or

delete a;xxxx:yyyy loc aaaaa -t accumulo.metadata



still get the help message..



Thanks in advance,



-S





________________________________

From: dev1 <de...@etcoleman.com>>
Sent: Tuesday, April 12, 2022 10:07 AM
To: 'user@accumulo.apache.org' <us...@accumulo.apache.org>>
Subject: [External] RE: Accumulo 1.10.0



I would suspect that the metadata table became corrupted when the system went unstable and two tablet servers somehow ended up both thinking that they were responsible for the same extents(s)  This should not be because of the balancer running.



If you scan the accumulo.metadata table using the shell (scan -t accumulo.metadata -c loc) or (scan -t accumulo.metadata -c loc -b [TABLE_ID#]:[EXTENT])



There will be duplicated loc entries.



I am uncertain on the best way to fix this and do not have a place to try things out, but possible actions.



Shutdown / bounce the tservers that have the duplicated assignments – you could start with just one and see what happens. When the tservers go offline – the tablets should be reassigned and maybe only one (re)assignment will occur.



Try bouncing the manager (master)



If those don’t work, then a very aggressive / dangerous / only as a last resort:



Delete the specific loc rows from the metadata table (delete [row_id] loc [value] -t accumulo.metadata)  This will cause a future entry in the zookeeper – to get that to reassign it might be enough to bounce the master, or you may need to shutdown / restart the cluster.



Ed Coleman



From: Ligade, Shailesh [USA] <Li...@bah.com>>
Sent: Tuesday, April 12, 2022 8:36 AM
To: user@accumulo.apache.org<ma...@accumulo.apache.org>
Subject: Accumulo 1.10.0



Hello, Last weekend we ran out of hdfs space ?? all volumes were 100% yeah it was crazy. This accumulo has many tables with good data.



Although accumulo was up it had 3 unsassigned tablets



So i added few nodes to hdfs/accumulo, now hdfs capacity is 33% empty. I issued hdfs rebalance command (just in case) So all good. Accumulo unassigned tablets went away but tables are show no assigned tablets on the accumulo monitor.



On the active master i am seeing error



ERROR: Error processing table state for store Normal Tablets

java.langRuntimeexception: org.apache.accumulo.server.master.state.TabletLocationState$BadLocationStateException: found two locations for the same extent xxxxxxxx



Question is i am getting this because balancer is running and once it finished it will recover? What can be done to save this cluster?



Thanks



-S

RE: Accumulo 1.10.0

Posted by dev1 <de...@etcoleman.com>.

If you still have an issue – check that your user has WRITE permissions on the metadata table (even root needs to be added) – if you grant permission(s) you likely would want to remove what you added once you are done to prevent inadvertently modifying the table in the future if you make a mistake with a command intended for another table. (Besides being a good security practice to operate with the minimum required permissions)

From: Ligade, Shailesh [USA] <Li...@bah.com>
Sent: Wednesday, April 13, 2022 8:41 AM
To: 'user@accumulo.apache.org' <us...@accumulo.apache.org>
Subject: Re: Accumulo 1.10.0

I think i figured out

i have to be on that accumulo.metadata table in order for delete command to work.. -t accumulo.metadata did not work ..not sure why 🙁

Thanks

0S
________________________________
From: Ligade, Shailesh [USA] <Li...@bah.com>>
Sent: Wednesday, April 13, 2022 7:51 AM
To: 'user@accumulo.apache.org' <us...@accumulo.apache.org>>
Subject: Re: Accumulo 1.10.0

Thanks Ed,

a quick question,

Now that i want to delete those duplicates (there are many of those 🙁)

the scan output is

a;xxxx:yyyy loc:aaaaa [] tablet1:9997
a;xxxx:yyyy loc:zzzzzz [] tablet2:9997

What is the right delete command, when i issue

delete a; loc aaaaa -t accumulo.metadata

i get help so it doesn't think it is right command

i tried

delete a;xxxx loca aaaaa -t accumulo.metadata  or
delete a;xxxx:yyyy loc aaaaa -t accumulo.metadata

still get the help message..

Thanks in advance,

-S

________________________________
From: dev1 <de...@etcoleman.com>>
Sent: Tuesday, April 12, 2022 10:07 AM
To: 'user@accumulo.apache.org' <us...@accumulo.apache.org>>
Subject: [External] RE: Accumulo 1.10.0

I would suspect that the metadata table became corrupted when the system went unstable and two tablet servers somehow ended up both thinking that they were responsible for the same extents(s)  This should not be because of the balancer running.

If you scan the accumulo.metadata table using the shell (scan -t accumulo.metadata -c loc) or (scan -t accumulo.metadata -c loc -b [TABLE_ID#]:[EXTENT])

There will be duplicated loc entries.

I am uncertain on the best way to fix this and do not have a place to try things out, but possible actions.

Shutdown / bounce the tservers that have the duplicated assignments – you could start with just one and see what happens. When the tservers go offline – the tablets should be reassigned and maybe only one (re)assignment will occur.

Try bouncing the manager (master)

If those don’t work, then a very aggressive / dangerous / only as a last resort:

Delete the specific loc rows from the metadata table (delete [row_id] loc [value] -t accumulo.metadata)  This will cause a future entry in the zookeeper – to get that to reassign it might be enough to bounce the master, or you may need to shutdown / restart the cluster.

Ed Coleman

From: Ligade, Shailesh [USA] <Li...@bah.com>>
Sent: Tuesday, April 12, 2022 8:36 AM
To: user@accumulo.apache.org<ma...@accumulo.apache.org>
Subject: Accumulo 1.10.0

Hello, Last weekend we ran out of hdfs space 🙁 all volumes were 100% yeah it was crazy. This accumulo has many tables with good data.

Although accumulo was up it had 3 unsassigned tablets

So i added few nodes to hdfs/accumulo, now hdfs capacity is 33% empty. I issued hdfs rebalance command (just in case) So all good. Accumulo unassigned tablets went away but tables are show no assigned tablets on the accumulo monitor.

On the active master i am seeing error

ERROR: Error processing table state for store Normal Tablets

java.langRuntimeexception: org.apache.accumulo.server.master.state.TabletLocationState$BadLocationStateException: found two locations for the same extent xxxxxxxx

Question is i am getting this because balancer is running and once it finished it will recover? What can be done to save this cluster?

Thanks

-S

Re: Accumulo 1.10.0

Posted by "Ligade, Shailesh [USA]" <Li...@bah.com>.

I think i figured out

i have to be on that accumulo.metadata table in order for delete command to work.. -t accumulo.metadata did not work ..not sure why 🙁

Thanks

0S
________________________________
From: Ligade, Shailesh [USA] <Li...@bah.com>
Sent: Wednesday, April 13, 2022 7:51 AM
To: 'user@accumulo.apache.org' <us...@accumulo.apache.org>
Subject: Re: Accumulo 1.10.0

Thanks Ed,

a quick question,

Now that i want to delete those duplicates (there are many of those 🙁)

the scan output is

a;xxxx:yyyy loc:aaaaa [] tablet1:9997
a;xxxx:yyyy loc:zzzzzz [] tablet2:9997

What is the right delete command, when i issue

delete a; loc aaaaa -t accumulo.metadata

i get help so it doesn't think it is right command

i tried

delete a;xxxx loca aaaaa -t accumulo.metadata  or
delete a;xxxx:yyyy loc aaaaa -t accumulo.metadata

still get the help message..

Thanks in advance,

-S


________________________________
From: dev1 <de...@etcoleman.com>
Sent: Tuesday, April 12, 2022 10:07 AM
To: 'user@accumulo.apache.org' <us...@accumulo.apache.org>
Subject: [External] RE: Accumulo 1.10.0


I would suspect that the metadata table became corrupted when the system went unstable and two tablet servers somehow ended up both thinking that they were responsible for the same extents(s)  This should not be because of the balancer running.



If you scan the accumulo.metadata table using the shell (scan -t accumulo.metadata -c loc) or (scan -t accumulo.metadata -c loc -b [TABLE_ID#]:[EXTENT])



There will be duplicated loc entries.



I am uncertain on the best way to fix this and do not have a place to try things out, but possible actions.



Shutdown / bounce the tservers that have the duplicated assignments – you could start with just one and see what happens. When the tservers go offline – the tablets should be reassigned and maybe only one (re)assignment will occur.



Try bouncing the manager (master)



If those don’t work, then a very aggressive / dangerous / only as a last resort:



Delete the specific loc rows from the metadata table (delete [row_id] loc [value] -t accumulo.metadata)  This will cause a future entry in the zookeeper – to get that to reassign it might be enough to bounce the master, or you may need to shutdown / restart the cluster.



Ed Coleman



From: Ligade, Shailesh [USA] <Li...@bah.com>
Sent: Tuesday, April 12, 2022 8:36 AM
To: user@accumulo.apache.org
Subject: Accumulo 1.10.0



Hello, Last weekend we ran out of hdfs space 🙁 all volumes were 100% yeah it was crazy. This accumulo has many tables with good data.



Although accumulo was up it had 3 unsassigned tablets



So i added few nodes to hdfs/accumulo, now hdfs capacity is 33% empty. I issued hdfs rebalance command (just in case) So all good. Accumulo unassigned tablets went away but tables are show no assigned tablets on the accumulo monitor.



On the active master i am seeing error



ERROR: Error processing table state for store Normal Tablets

java.langRuntimeexception: org.apache.accumulo.server.master.state.TabletLocationState$BadLocationStateException: found two locations for the same extent xxxxxxxx



Question is i am getting this because balancer is running and once it finished it will recover? What can be done to save this cluster?



Thanks



-S

Re: Accumulo 1.10.0

Posted by "Ligade, Shailesh [USA]" <Li...@bah.com>.

Thanks Ed,

a quick question,

Now that i want to delete those duplicates (there are many of those 🙁)

the scan output is

a;xxxx:yyyy loc:aaaaa [] tablet1:9997
a;xxxx:yyyy loc:zzzzzz [] tablet2:9997

What is the right delete command, when i issue

delete a; loc aaaaa -t accumulo.metadata

i get help so it doesn't think it is right command

i tried

delete a;xxxx loca aaaaa -t accumulo.metadata  or
delete a;xxxx:yyyy loc aaaaa -t accumulo.metadata

still get the help message..

Thanks in advance,

-S


________________________________
From: dev1 <de...@etcoleman.com>
Sent: Tuesday, April 12, 2022 10:07 AM
To: 'user@accumulo.apache.org' <us...@accumulo.apache.org>
Subject: [External] RE: Accumulo 1.10.0


I would suspect that the metadata table became corrupted when the system went unstable and two tablet servers somehow ended up both thinking that they were responsible for the same extents(s)  This should not be because of the balancer running.



If you scan the accumulo.metadata table using the shell (scan -t accumulo.metadata -c loc) or (scan -t accumulo.metadata -c loc -b [TABLE_ID#]:[EXTENT])



There will be duplicated loc entries.



I am uncertain on the best way to fix this and do not have a place to try things out, but possible actions.



Shutdown / bounce the tservers that have the duplicated assignments – you could start with just one and see what happens. When the tservers go offline – the tablets should be reassigned and maybe only one (re)assignment will occur.



Try bouncing the manager (master)



If those don’t work, then a very aggressive / dangerous / only as a last resort:



Delete the specific loc rows from the metadata table (delete [row_id] loc [value] -t accumulo.metadata)  This will cause a future entry in the zookeeper – to get that to reassign it might be enough to bounce the master, or you may need to shutdown / restart the cluster.



Ed Coleman



From: Ligade, Shailesh [USA] <Li...@bah.com>
Sent: Tuesday, April 12, 2022 8:36 AM
To: user@accumulo.apache.org
Subject: Accumulo 1.10.0



Hello, Last weekend we ran out of hdfs space 🙁 all volumes were 100% yeah it was crazy. This accumulo has many tables with good data.



Although accumulo was up it had 3 unsassigned tablets



So i added few nodes to hdfs/accumulo, now hdfs capacity is 33% empty. I issued hdfs rebalance command (just in case) So all good. Accumulo unassigned tablets went away but tables are show no assigned tablets on the accumulo monitor.



On the active master i am seeing error



ERROR: Error processing table state for store Normal Tablets

java.langRuntimeexception: org.apache.accumulo.server.master.state.TabletLocationState$BadLocationStateException: found two locations for the same extent xxxxxxxx



Question is i am getting this because balancer is running and once it finished it will recover? What can be done to save this cluster?



Thanks



-S

RE: Accumulo 1.10.0

Posted by dev1 <de...@etcoleman.com>.

I would suspect that the metadata table became corrupted when the system went unstable and two tablet servers somehow ended up both thinking that they were responsible for the same extents(s)  This should not be because of the balancer running.

If you scan the accumulo.metadata table using the shell (scan -t accumulo.metadata -c loc) or (scan -t accumulo.metadata -c loc -b [TABLE_ID#]:[EXTENT])

There will be duplicated loc entries.

I am uncertain on the best way to fix this and do not have a place to try things out, but possible actions.

Shutdown / bounce the tservers that have the duplicated assignments – you could start with just one and see what happens. When the tservers go offline – the tablets should be reassigned and maybe only one (re)assignment will occur.

Try bouncing the manager (master)

If those don’t work, then a very aggressive / dangerous / only as a last resort:

Delete the specific loc rows from the metadata table (delete [row_id] loc [value] -t accumulo.metadata)  This will cause a future entry in the zookeeper – to get that to reassign it might be enough to bounce the master, or you may need to shutdown / restart the cluster.

Ed Coleman

From: Ligade, Shailesh [USA] <Li...@bah.com>
Sent: Tuesday, April 12, 2022 8:36 AM
To: user@accumulo.apache.org
Subject: Accumulo 1.10.0

Hello, Last weekend we ran out of hdfs space 🙁 all volumes were 100% yeah it was crazy. This accumulo has many tables with good data.

Although accumulo was up it had 3 unsassigned tablets

So i added few nodes to hdfs/accumulo, now hdfs capacity is 33% empty. I issued hdfs rebalance command (just in case) So all good. Accumulo unassigned tablets went away but tables are show no assigned tablets on the accumulo monitor.

On the active master i am seeing error

ERROR: Error processing table state for store Normal Tablets
java.langRuntimeexception: org.apache.accumulo.server.master.state.TabletLocationState$BadLocationStateException: found two locations for the same extent xxxxxxxx

Question is i am getting this because balancer is running and once it finished it will recover? What can be done to save this cluster?

Thanks

-S