You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@accumulo.apache.org by "Ligade, Shailesh [USA]" <Li...@bah.com> on 2022/02/09 16:49:09 UTC

accumulo 1.10.0 masters won't start

Hello,

My both masters are stuck error on zookeeper:

IOException: Packet len 2791093 is out of range!
KeeperErrorCode = ConnectionLoss for /accumulo/<instance_id>/fate


if use zkCli to see what is under fate, i get

IOException Packet len 2791161 is out of range
Unable to read additional data from server sessionid xxxx, likely server has closed socket

hdfs fsck is all good

How can I clear this fate?

master process is up and I can get into accumulo shell, but there are no fate (fate print returns empty)

Any idea how to bring the master up?

Thanks

S

Re: accumulo 1.10.0 masters won't start

Posted by dev1 <de...@etcoleman.com>.

Tservers will hold locks.
________________________________
From: Ligade, Shailesh [USA] <Li...@bah.com>
Sent: Wednesday, February 16, 2022 9:10 AM
To: 'user@accumulo.apache.org' <us...@accumulo.apache.org>
Subject: Re: accumulo 1.10.0 masters won't start

Well now master doesn't come up throwing all sorts of zookeeper errros, only chnages i did was jute.maxbuffer set to max of 0x600000 (in both zookeeper java.env and accumulo-site) and instance.zookeeper.timeout set 90s

Even if both masters are down, i still see table_locks under znode, si that normal?

Appreciate your help

-S
________________________________
From: Ligade, Shailesh [USA] <Li...@bah.com>
Sent: Wednesday, February 16, 2022 8:05 AM
To: 'user@accumulo.apache.org' <us...@accumulo.apache.org>
Subject: Re: accumulo 1.10.0 masters won't start

Thanks Ed,

We have two clusters, one is source other is peer. I am testing this large table on peer first (eventually i need to do the same on source cluster). i am not stopping ingest on source cluster so replication will continue the peer table, however while i am doing this not much ingest is happening.

I tried the range compaction along with range merge however, merge takes forever (even over single range.. i didn't try many just first few) and before it finishes i get zookeeper error and both master crash. After I bump that jute setting (on both java.env on zookeeper and accumulo-env on accumlo) to get it back. So left merges alone and just trying 72k compaction, since compactions are not backing up, i am doing minimal sleep after every 100 compact commands. But sometimes during compactions, i do get zookeeper errors and master crash.

I do get your idea or create new table with less splits that way the new table will be compacted. However, for that i will need to stop ingest on primary, and then setup replication on the new cluster again..i was avoiding that. but i guess that may be my only option.

A last question, if I let this table continue to grow tablets, what is the worst case scenerio? How it may affect system performance?

Thanks

-S
________________________________
From: dev1 <de...@etcoleman.com>
Sent: Tuesday, February 15, 2022 3:11 PM
To: 'user@accumulo.apache.org' <us...@accumulo.apache.org>
Subject: [External] RE: accumulo 1.10.0 masters won't start

Can you compact the table?  How aggressive do you want to get? I do not understand why you are getting the ZooKeeper errors – is it related to the number of tablets, or is it something else?  (an iterator that was attached with a very large set of arguments, a very large list or some sort of binary data – say a bloom filter)

If it were me – you need to balance your goals and requirements that might dictate a less aggressive approach.  At this point I’m assuming that getting things back on line without data loss is the top priority. And if I was sure that it was not related to something that I attached to the table)

If I have room and can compact the table(s). It could also depend on how long a compaction would take and if I could wait.  It is generally preferable to work on files that have any deleted data removed and can reduce the total number of files when files from minor compactions and bulk ingest files are combined into a single file for that tablet)

Stop ingest.

Flush the source table – allow any compactions to settle. (Optional if compacting, but should be a quick command to execute)

(Optional – compact the original.)

Clone the source table.

Compact the clone so that the clone does not share any files with the source

Optionally – use the exportable command to generate a list of files from the clone – you may not need it, but could be handy

Take the clone offline.

Move the files under /accumulo/tables/[clone-id]/dirs to one or more staging directories (in hdfs) – the export list could help.

Delete the clone table – (I believe that the delete will not check for the existence of files if it is offline.) If not, then it would be necessary to use an empty rfile as a placeholder.

Create a new table and set splits – this could be your desired number – or use just enough splits that each tserver has 1 tablet.

Set the default table split size to some multiple of the desired final size – this limits splitting during the imports. Not critical, but may be faster.

Take the new table offline and then back online – this will immediately migrate the splits – or you could just wait for the migration to finish.

Bulk import the files from the staging area(s) – likely in batches.  You will likely have ~72K files – so maybe ~7,000 files / batch?

Once all files have been imported set the split threshold to desired size.

Check that permissions, users, iterators, table config parameters are present on the new table and match the source.

Rename the source table to old_xxx or whatever

Rename the new table to the source table, verify things are okay and delete the original.

If you don’t have the space, you could skip operating on the clone, but then you can’t fall back to the original if things go wrong.

Another way would be to use the importable, but you need to make sure that it doesn’t just recreate the original splits, otherwise you end up with the same 72K files.

Ed Coleman

From: Ligade, Shailesh [USA] <Li...@bah.com>
Sent: Tuesday, February 15, 2022 1:11 PM
To: user@accumulo.apache.org
Subject: Re: accumulo 1.10.0 masters won't start

Well,

I am trying to merge a large table (8T with 72k tablets, with default tablet size 1G)

Since I keep getting those zookeeper errors realted to size, i keep on bumping the jute.maxbuffer adn now it is all the way to 8m

But still i can't merge even for small subset (-b and -e) Now the error is Xid out of order, Gox Xid xx with err -101 expected Xid yy for a packet with details:

after this master crashes

Any suggestion how to go about and how to merge this large table?

Thanks

-S

________________________________

From: Ligade, Shailesh [USA] <Li...@bah.com>>
Sent: Thursday, February 10, 2022 9:20 AM
To: user@accumulo.apache.org<ma...@accumulo.apache.org> <us...@accumulo.apache.org>>
Subject: Re: accumulo 1.10.0 masters won't start

Thanks Ed,

That saved the day. The confusing part setting up that property is documentation if it needs hex or bytes etc. Even the example they provided here

https://solr.apache.org/guide/7_4/setting-up-an-external-zookeeper-ensemble.html#increasing-the-file-size-limit<https://urldefense.com/v3/__https://solr.apache.org/guide/7_4/setting-up-an-external-zookeeper-ensemble.html*increasing-the-file-size-limit__;Iw!!May37g!dgoe7b7JQfco1hl_OGnhmp98GibG_1En56_s_KMHeTxlLgRECJIoqNtDeascfM1hqA$>

Setting Up an External ZooKeeper Ensemble | Apache Solr Reference Guide 7.4 - The Apache Software Foundation<https://urldefense.com/v3/__https://solr.apache.org/guide/7_4/setting-up-an-external-zookeeper-ensemble.html*increasing-the-file-size-limit__;Iw!!May37g!dgoe7b7JQfco1hl_OGnhmp98GibG_1En56_s_KMHeTxlLgRECJIoqNtDeascfM1hqA$>

The solution to this problem is to set up an external ZooKeeper ensemble, which is a number of servers running ZooKeeper that communicate with each other to coordinate the activities of the cluster.

solr.apache.org



states they are setting the value to 2mb but the value really looks like 200k (with 5 0)

------------------------------------

Add the following line to increase the file size limit to 2MB:

SOLR_OPTS="$SOLR_OPTS -Djute.maxbuffer=0x200000"

-------------------------------------

Anyways, a master is up and running for an hour now..so just trying to understand what was changed and revert it after it stabilize.

Thanks a bunch.

-S

________________________________

From: dev1 <de...@etcoleman.com>>
Sent: Wednesday, February 9, 2022 5:54 PM
To: user@accumulo.apache.org<ma...@accumulo.apache.org> <us...@accumulo.apache.org>>
Subject: [External] Re: accumulo 1.10.0 masters won't start

You might want to set the accumulo (zookeeper client) side - by setting ACCUMULO_JAVA_OPTS that is processed in accumulo-env.sh (or just edit that file?)

Looking at the Zookeeper documentation it describes what looks like you are seeing:

When jute.maxbuffer in the client side is less than the server side, the client wants to read the data exceeds jute.maxbuffer in the client side, the client side will get java.io.IOException: Unreasonable length or Packet len is out of range!

Also, a search showed jira tickets that had a server side limit of 4MB, but client limits of 1MB - you may want to see if 4194304 (or larger) as a value works,

________________________________

From: dev1 <de...@etcoleman.com>>
Sent: Wednesday, February 9, 2022 5:25 PM
To: user@accumulo.apache.org<ma...@accumulo.apache.org> <us...@accumulo.apache.org>>
Subject: Re: accumulo 1.10.0 masters won't start

jute.maxbuffer is a ZooKeeper property - it needs to be set on the zookeeper configuration.  If this is still correct, then it looks like there are a few options https://solr.apache.org/guide/7_4/setting-up-an-external-zookeeper-ensemble.html#increasing-the-file-size-limit<https://urldefense.com/v3/__https:/solr.apache.org/guide/7_4/setting-up-an-external-zookeeper-ensemble.html*increasing-the-file-size-limit__;Iw!!May37g!ewlkGRNFLrKEpeF1Lz8vRt_oBtpgi8hVvvnCrp1Dq4_8Xprb4tEHWiHVFW7EVUT3vA$>

But maybe the ZooKeeper documentation for your version can provide additional guidance?

Setting Up an External ZooKeeper Ensemble | Apache Solr Reference Guide 7.4 - The Apache Software Foundation<https://urldefense.com/v3/__https:/solr.apache.org/guide/7_4/setting-up-an-external-zookeeper-ensemble.html*increasing-the-file-size-limit__;Iw!!May37g!ewlkGRNFLrKEpeF1Lz8vRt_oBtpgi8hVvvnCrp1Dq4_8Xprb4tEHWiHVFW7EVUT3vA$>

The solution to this problem is to set up an external ZooKeeper ensemble, which is a number of servers running ZooKeeper that communicate with each other to coordinate the activities of the cluster.

solr.apache.org

________________________________

From: Shailesh Ligade <SL...@FBI.GOV>>
Sent: Wednesday, February 9, 2022 5:02 PM
To: user@accumulo.apache.org<ma...@accumulo.apache.org> <us...@accumulo.apache.org>>
Subject: RE: accumulo 1.10.0 masters won't start

Thanks

Even if I set jute.maxbuffer on zookeeper in conf/java.env file to

-Djute.maxbuffer=300000

I see in accumulo master log as

INFO: jute.maxbuffer value is 1048575 Bytes    not sure where to set that on accumulo side.

I set instance.zookeeper.timeout value to 90s in accumulo-site.xml

But still get those zookeeper KeeperErrorCode errors

-S

From: dev1 <de...@etcoleman.com>>
Sent: Wednesday, February 9, 2022 4:27 PM
To: user@accumulo.apache.org<ma...@accumulo.apache.org>
Subject: [EXTERNAL EMAIL] - Re: accumulo 1.10.0 masters won't start

I would not recommend setting the goal state directly unlit there are no other alternatives.

It is hard to recommend what to do, because it is unclear what put you into the current situation and what action / impact you might have had trying to fix things -

why did the goal state become unset in the first place?

what did you stuff into the fates that increased the need for larger jute buffers?

It could be that the number of tables and servers pushed you over the limit - or it could be something else.

What I would do.

Shutdown accumulo and make sure all services / tservers are stopped.

Shutdown any other services that might be using ZooKeeper.

Shutdown ZooKeeper.

Set the larger jute.buffer and increase the timeout values across the board and in any dependent services.

Start hdfs - if you needed to shut it down.

Start just zookeeper - and use zkCli.sh to examine the state of things.  If that looks okay.

Start just the master - how far does it come up?  It will not be able to load the root / metadata tables, but it may give some indication of state,

I'd then cycle between stopping the master, trying to clean-up things using zkCli.sh using any guidance with errors the master is generating. If that looks promising, then:

With the master stopped - start the tservers and check a few logs if there are exceptions determine if they are they something that is pointing to an issue - or just something that is transient and handled.

Once the tservers are up and looking okay - start the master.

One of the things to grab as soon as you can get the shell to run - get a listing of the tables and the ids.  If the worst happens, you can use that to map the existing data into a "new" instance. Hopefully it will not come to that and you will not need it - but if you don't have it and you need it, well... The table names and id are all in ZooKeeper.

Ed Coleman

________________________________

From: Shailesh Ligade <SL...@FBI.GOV>>
Sent: Wednesday, February 9, 2022 3:47 PM
To: user@accumulo.apache.org<ma...@accumulo.apache.org> <us...@accumulo.apache.org>>
Subject: RE: accumulo 1.10.0 masters won't start

Thanks I can try that,

At this point, my goal is to get accumulo up. I was just wondering if I can set different goal like SAFE_MODE will it come up by ignoring fate and other issues? If that comes up, can I switch back to NORMAL, will that work? I understand there may be some data loss..

-S

From: dev1 <de...@etcoleman.com>>
Sent: Wednesday, February 9, 2022 3:36 PM
To: user@accumulo.apache.org<ma...@accumulo.apache.org>
Subject: [EXTERNAL EMAIL] - Re: accumulo 1.10.0 masters won't start

For values in zoo.cfg see: https://zookeeper.apache.org/doc/r3.5.9/zookeeperAdmin.html#sc_advancedConfiguration<https://urldefense.com/v3/__https:/usg02.safelinks.protection.office365.us/?url=https*3A*2F*2Fzookeeper.apache.org*2Fdoc*2Fr3.5.9*2FzookeeperAdmin.html*23sc_advancedConfiguration&data=04*7C01*7CSLIGADE*40FBI.GOV*7Cb7b8be92faf64fbc95ff08d9ec13044d*7C022914a9b95f4b7bbace551ce1a04071*7C0*7C0*7C637800390698440068*7CUnknown*7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0*3D*7C3000&sdata=QaGFX7kcHJeiIN73G5bfDDEQNgxN0F7QdyJ9fO3SJzA*3D&reserved=0__;JSUlJSUlJSUlJSUlJSUlJSUlJSU!!May37g!ewlkGRNFLrKEpeF1Lz8vRt_oBtpgi8hVvvnCrp1Dq4_8Xprb4tEHWiHVFW6UcuwVgA$>

maxSessionTimeout

In the accumulo config  - #instance.zookeepers.timeout=30s

The zookeeper setting controls the max time that the ZK servers will grant - the accumulo setting is how much time accumulo will ask for.

ZooKeeper: Because Coordinating Distributed Systems is a Zoo<https://urldefense.com/v3/__https:/usg02.safelinks.protection.office365.us/?url=https*3A*2F*2Fzookeeper.apache.org*2Fdoc*2Fr3.5.9*2FzookeeperAdmin.html*23sc_advancedConfiguration&data=04*7C01*7CSLIGADE*40FBI.GOV*7Cb7b8be92faf64fbc95ff08d9ec13044d*7C022914a9b95f4b7bbace551ce1a04071*7C0*7C0*7C637800390698440068*7CUnknown*7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0*3D*7C3000&sdata=QaGFX7kcHJeiIN73G5bfDDEQNgxN0F7QdyJ9fO3SJzA*3D&reserved=0__;JSUlJSUlJSUlJSUlJSUlJSUlJSU!!May37g!ewlkGRNFLrKEpeF1Lz8vRt_oBtpgi8hVvvnCrp1Dq4_8Xprb4tEHWiHVFW6UcuwVgA$>

Trace Mask Bit Values ; 0b0000000000 : Unused, reserved for future use. 0b0000000010 : Logs client requests, excluding ping requests. 0b0000000100 : Unused, reserved ...

zookeeper.apache.org

________________________________

From: Ligade, Shailesh [USA] <Li...@bah.com>>
Sent: Wednesday, February 9, 2022 3:03 PM
To: user@accumulo.apache.org<ma...@accumulo.apache.org> <us...@accumulo.apache.org>>
Subject: Re: accumulo 1.10.0 masters won't start

thanks for response,

no i have not update any timeout

is that going in zoo.cfg? I can see there is min/maxSessionTimeout 2/20, is that what are you refering to?

-S

________________________________

From: dev1 <de...@etcoleman.com>>
Sent: Wednesday, February 9, 2022 2:51 PM
To: user@accumulo.apache.org<ma...@accumulo.apache.org> <us...@accumulo.apache.org>>
Subject: [External] Re: accumulo 1.10.0 masters won't start

Have you tried to increase the zoo session timeout value? I think it's zookeeper.session.timeout.ms

________________________________

From: Ligade, Shailesh [USA] <Li...@bah.com>>
Sent: Wednesday, February 9, 2022 2:47 PM
To: user@accumulo.apache.org<ma...@accumulo.apache.org> <us...@accumulo.apache.org>>
Subject: Re: accumulo 1.10.0 masters won't start

Thanks

That fixed goal sate issue but now still getting

Errors with zookeeper

e.g.

KeeperErrorCode = ConnectionLoss for

/accumulo/<instane-id>/config/tserver.hold.time.max

/accumulo/<instane-id>/tables

/accumulo/<instane-id>/tables/1/name

/accumulo/<instane-id>/fate

/accumulo/<instane-id>/masters/goal_state

So it is all over …some I see good values in zookeeper…so not sure..  🙁

-S

________________________________

From: dev1 <de...@etcoleman.com>>
Sent: Wednesday, February 9, 2022 2:22 PM
To: user@accumulo.apache.org<ma...@accumulo.apache.org> <us...@accumulo.apache.org>>
Subject: [External] Re: accumulo 1.10.0 masters won't start

The is a utility - SetGoalState that can be run from the command line

accumulo SetGoalState NORMAL

(or SAFE_MODE, CLEAN_STOP)

It sets a value in ZK at /accumulo/instance-id/managers/goal_state

Ed Coleman

________________________________

From: Ligade, Shailesh [USA] <Li...@bah.com>>
Sent: Wednesday, February 9, 2022 1:54 PM
To: user@accumulo.apache.org<ma...@accumulo.apache.org> <us...@accumulo.apache.org>>
Subject: Re: accumulo 1.10.0 masters won't start

Well,

i just went ahead and deleted fate in zookeeper and restarted the master..it was doing better but then i am getting different error

ERROR: Problem getting real goal state from zookeeper: java.lang.IllegalArgumentException: No enum constant org.apache.accumulo.core.master.thrift.MasterGoalState

I hope i didn't delete goal_state accidently ...;-( currently ls on goal_state is [], is there a way to add some value there?

-S

________________________________

From: dev1 <de...@etcoleman.com>>
Sent: Wednesday, February 9, 2022 1:32 PM
To: user@accumulo.apache.org<ma...@accumulo.apache.org> <us...@accumulo.apache.org>>
Subject: [External] Re: accumulo 1.10.0 masters won't start

Did you try setting the increased size in the zkCli.sh command (or wherever it gets it environment from?)

The ZK docs indicate that it needs to be set to the same size on all servers and clients.

You should be able to use zkCli.sh to at least see what's going on - if that does not work, then it seems unlikely that the master would either.

Can you:

  *   list the nodes under /accumulo/[instance id]/fate?
  *   use the stat command on each of the nodes - the size is one of the fields.
  *   list nodes under any of the /accumulo/[instance_id/fate/tx-#####
  *   there should be a node named debug - doing a get on that should show the op name.

Ed Coleman

________________________________

From: Ligade, Shailesh [USA] <Li...@bah.com>>
Sent: Wednesday, February 9, 2022 12:54 PM
To: user@accumulo.apache.org<ma...@accumulo.apache.org> <us...@accumulo.apache.org>>
Subject: Re: accumulo 1.10.0 masters won't start

Thanks

I added

-Djute.maxbuffer=30000000

In conf/java.env and restart all zookeepers but still getting the same error.. documentation is kind of fuzzy on setting this property as it states in hex (default 0xffff) so not 100% sure if 30000000 is ok, but atleast I could see zookeeper was up

-S

________________________________

From: dev1 <de...@etcoleman.com>>
Sent: Wednesday, February 9, 2022 12:26 PM
To: user@accumulo.apache.org<ma...@accumulo.apache.org> <us...@accumulo.apache.org>>
Subject: [External] Re: accumulo 1.10.0 masters won't start

Does the monitor or any of the logs show errors that relate to exceeding the ZooKeeper jute buffer size?

Is so, have you tried increasing the ZooKeeper jute.maxbuffer limit(https://zookeeper.apache.org/doc/r3.5.9/zookeeperAdmin.html#Unsafe+Options<https://urldefense.com/v3/__https:/usg02.safelinks.protection.office365.us/?url=https*3A*2F*2Furldefense.com*2Fv3*2F__https*3A*2F*2Fzookeeper.apache.org*2Fdoc*2Fr3.5.9*2FzookeeperAdmin.html*Unsafe*Options__*3BIys!!May37g!dTGCMHPLPDBXwSqtLa5cIPHiTIQF7IjLCVyvGxfi1sgPbrsOI8RCEsuZ9u-jJtayEg*24&data=04*7C01*7CSLIGADE*40FBI.GOV*7Cb7b8be92faf64fbc95ff08d9ec13044d*7C022914a9b95f4b7bbace551ce1a04071*7C0*7C0*7C637800390698440068*7CUnknown*7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0*3D*7C3000&sdata=meZvmEpBktGc95qzM46QmtNWp5NJ8noozSTv896k7qw*3D&reserved=0__;JSUlJSUlJSUlJSUqKiUlJSUlJSUlJSUlJSUlJQ!!May37g!ewlkGRNFLrKEpeF1Lz8vRt_oBtpgi8hVvvnCrp1Dq4_8Xprb4tEHWiHVFW58KT9_bg$>)?

Ed Coleman

________________________________

From: Ligade, Shailesh [USA] <Li...@bah.com>>
Sent: Wednesday, February 9, 2022 11:49 AM
To: user@accumulo.apache.org<ma...@accumulo.apache.org> <us...@accumulo.apache.org>>
Subject: accumulo 1.10.0 masters won't start

Hello,

My both masters are stuck error on zookeeper:

IOException: Packet len 2791093 is out of range!

KeeperErrorCode = ConnectionLoss for /accumulo/<instance_id>/fate

if use zkCli to see what is under fate, i get

IOException Packet len 2791161 is out of range

Unable to read additional data from server sessionid xxxx, likely server has closed socket

hdfs fsck is all good

How can I clear this fate?

master process is up and I can get into accumulo shell, but there are no fate (fate print returns empty)

Any idea how to bring the master up?

Thanks

S

Re: accumulo 1.10.0 masters won't start

Posted by "Ligade, Shailesh [USA]" <Li...@bah.com>.

Well now master doesn't come up throwing all sorts of zookeeper errros, only chnages i did was jute.maxbuffer set to max of 0x600000 (in both zookeeper java.env and accumulo-site) and instance.zookeeper.timeout set 90s

Even if both masters are down, i still see table_locks under znode, si that normal?

Appreciate your help

-S
________________________________
From: Ligade, Shailesh [USA] <Li...@bah.com>
Sent: Wednesday, February 16, 2022 8:05 AM
To: 'user@accumulo.apache.org' <us...@accumulo.apache.org>
Subject: Re: accumulo 1.10.0 masters won't start

Thanks Ed,

We have two clusters, one is source other is peer. I am testing this large table on peer first (eventually i need to do the same on source cluster). i am not stopping ingest on source cluster so replication will continue the peer table, however while i am doing this not much ingest is happening.


I tried the range compaction along with range merge however, merge takes forever (even over single range.. i didn't try many just first few) and before it finishes i get zookeeper error and both master crash. After I bump that jute setting (on both java.env on zookeeper and accumulo-env on accumlo) to get it back. So left merges alone and just trying 72k compaction, since compactions are not backing up, i am doing minimal sleep after every 100 compact commands. But sometimes during compactions, i do get zookeeper errors and master crash.

I do get your idea or create new table with less splits that way the new table will be compacted. However, for that i will need to stop ingest on primary, and then setup replication on the new cluster again..i was avoiding that. but i guess that may be my only option.

A last question, if I let this table continue to grow tablets, what is the worst case scenerio? How it may affect system performance?

Thanks

-S
________________________________
From: dev1 <de...@etcoleman.com>
Sent: Tuesday, February 15, 2022 3:11 PM
To: 'user@accumulo.apache.org' <us...@accumulo.apache.org>
Subject: [External] RE: accumulo 1.10.0 masters won't start


Can you compact the table?  How aggressive do you want to get? I do not understand why you are getting the ZooKeeper errors – is it related to the number of tablets, or is it something else?  (an iterator that was attached with a very large set of arguments, a very large list or some sort of binary data – say a bloom filter)



If it were me – you need to balance your goals and requirements that might dictate a less aggressive approach.  At this point I’m assuming that getting things back on line without data loss is the top priority. And if I was sure that it was not related to something that I attached to the table)



If I have room and can compact the table(s). It could also depend on how long a compaction would take and if I could wait.  It is generally preferable to work on files that have any deleted data removed and can reduce the total number of files when files from minor compactions and bulk ingest files are combined into a single file for that tablet)



Stop ingest.

Flush the source table – allow any compactions to settle. (Optional if compacting, but should be a quick command to execute)

(Optional – compact the original.)

Clone the source table.

Compact the clone so that the clone does not share any files with the source

Optionally – use the exportable command to generate a list of files from the clone – you may not need it, but could be handy

Take the clone offline.

Move the files under /accumulo/tables/[clone-id]/dirs to one or more staging directories (in hdfs) – the export list could help.

Delete the clone table – (I believe that the delete will not check for the existence of files if it is offline.) If not, then it would be necessary to use an empty rfile as a placeholder.

Create a new table and set splits – this could be your desired number – or use just enough splits that each tserver has 1 tablet.

Set the default table split size to some multiple of the desired final size – this limits splitting during the imports. Not critical, but may be faster.

Take the new table offline and then back online – this will immediately migrate the splits – or you could just wait for the migration to finish.

Bulk import the files from the staging area(s) – likely in batches.  You will likely have ~72K files – so maybe ~7,000 files / batch?

Once all files have been imported set the split threshold to desired size.

Check that permissions, users, iterators, table config parameters are present on the new table and match the source.

Rename the source table to old_xxx or whatever

Rename the new table to the source table, verify things are okay and delete the original.



If you don’t have the space, you could skip operating on the clone, but then you can’t fall back to the original if things go wrong.



Another way would be to use the importable, but you need to make sure that it doesn’t just recreate the original splits, otherwise you end up with the same 72K files.



Ed Coleman





From: Ligade, Shailesh [USA] <Li...@bah.com>
Sent: Tuesday, February 15, 2022 1:11 PM
To: user@accumulo.apache.org
Subject: Re: accumulo 1.10.0 masters won't start



Well,



I am trying to merge a large table (8T with 72k tablets, with default tablet size 1G)



Since I keep getting those zookeeper errors realted to size, i keep on bumping the jute.maxbuffer adn now it is all the way to 8m



But still i can't merge even for small subset (-b and -e) Now the error is Xid out of order, Gox Xid xx with err -101 expected Xid yy for a packet with details:



after this master crashes



Any suggestion how to go about and how to merge this large table?



Thanks



-S

________________________________

From: Ligade, Shailesh [USA] <Li...@bah.com>>
Sent: Thursday, February 10, 2022 9:20 AM
To: user@accumulo.apache.org<ma...@accumulo.apache.org> <us...@accumulo.apache.org>>
Subject: Re: accumulo 1.10.0 masters won't start



Thanks Ed,



That saved the day. The confusing part setting up that property is documentation if it needs hex or bytes etc. Even the example they provided here

https://solr.apache.org/guide/7_4/setting-up-an-external-zookeeper-ensemble.html#increasing-the-file-size-limit<https://urldefense.com/v3/__https://solr.apache.org/guide/7_4/setting-up-an-external-zookeeper-ensemble.html*increasing-the-file-size-limit__;Iw!!May37g!dgoe7b7JQfco1hl_OGnhmp98GibG_1En56_s_KMHeTxlLgRECJIoqNtDeascfM1hqA$>

Setting Up an External ZooKeeper Ensemble | Apache Solr Reference Guide 7.4 - The Apache Software Foundation<https://urldefense.com/v3/__https://solr.apache.org/guide/7_4/setting-up-an-external-zookeeper-ensemble.html*increasing-the-file-size-limit__;Iw!!May37g!dgoe7b7JQfco1hl_OGnhmp98GibG_1En56_s_KMHeTxlLgRECJIoqNtDeascfM1hqA$>

The solution to this problem is to set up an external ZooKeeper ensemble, which is a number of servers running ZooKeeper that communicate with each other to coordinate the activities of the cluster.

solr.apache.org



states they are setting the value to 2mb but the value really looks like 200k (with 5 0)



------------------------------------

Add the following line to increase the file size limit to 2MB:

SOLR_OPTS="$SOLR_OPTS -Djute.maxbuffer=0x200000"

-------------------------------------



Anyways, a master is up and running for an hour now..so just trying to understand what was changed and revert it after it stabilize.



Thanks a bunch.



-S



________________________________

From: dev1 <de...@etcoleman.com>>
Sent: Wednesday, February 9, 2022 5:54 PM
To: user@accumulo.apache.org<ma...@accumulo.apache.org> <us...@accumulo.apache.org>>
Subject: [External] Re: accumulo 1.10.0 masters won't start



You might want to set the accumulo (zookeeper client) side - by setting ACCUMULO_JAVA_OPTS that is processed in accumulo-env.sh (or just edit that file?)



Looking at the Zookeeper documentation it describes what looks like you are seeing:



When jute.maxbuffer in the client side is less than the server side, the client wants to read the data exceeds jute.maxbuffer in the client side, the client side will get java.io.IOException: Unreasonable length or Packet len is out of range!



Also, a search showed jira tickets that had a server side limit of 4MB, but client limits of 1MB - you may want to see if 4194304 (or larger) as a value works,



________________________________

From: dev1 <de...@etcoleman.com>>
Sent: Wednesday, February 9, 2022 5:25 PM
To: user@accumulo.apache.org<ma...@accumulo.apache.org> <us...@accumulo.apache.org>>
Subject: Re: accumulo 1.10.0 masters won't start



jute.maxbuffer is a ZooKeeper property - it needs to be set on the zookeeper configuration.  If this is still correct, then it looks like there are a few options https://solr.apache.org/guide/7_4/setting-up-an-external-zookeeper-ensemble.html#increasing-the-file-size-limit<https://urldefense.com/v3/__https:/solr.apache.org/guide/7_4/setting-up-an-external-zookeeper-ensemble.html*increasing-the-file-size-limit__;Iw!!May37g!ewlkGRNFLrKEpeF1Lz8vRt_oBtpgi8hVvvnCrp1Dq4_8Xprb4tEHWiHVFW7EVUT3vA$>



But maybe the ZooKeeper documentation for your version can provide additional guidance?

Setting Up an External ZooKeeper Ensemble | Apache Solr Reference Guide 7.4 - The Apache Software Foundation<https://urldefense.com/v3/__https:/solr.apache.org/guide/7_4/setting-up-an-external-zookeeper-ensemble.html*increasing-the-file-size-limit__;Iw!!May37g!ewlkGRNFLrKEpeF1Lz8vRt_oBtpgi8hVvvnCrp1Dq4_8Xprb4tEHWiHVFW7EVUT3vA$>

The solution to this problem is to set up an external ZooKeeper ensemble, which is a number of servers running ZooKeeper that communicate with each other to coordinate the activities of the cluster.

solr.apache.org





________________________________

From: Shailesh Ligade <SL...@FBI.GOV>>
Sent: Wednesday, February 9, 2022 5:02 PM
To: user@accumulo.apache.org<ma...@accumulo.apache.org> <us...@accumulo.apache.org>>
Subject: RE: accumulo 1.10.0 masters won't start



Thanks



Even if I set jute.maxbuffer on zookeeper in conf/java.env file to



-Djute.maxbuffer=300000



I see in accumulo master log as



INFO: jute.maxbuffer value is 1048575 Bytes    not sure where to set that on accumulo side.



I set instance.zookeeper.timeout value to 90s in accumulo-site.xml



But still get those zookeeper KeeperErrorCode errors



-S



From: dev1 <de...@etcoleman.com>>
Sent: Wednesday, February 9, 2022 4:27 PM
To: user@accumulo.apache.org<ma...@accumulo.apache.org>
Subject: [EXTERNAL EMAIL] - Re: accumulo 1.10.0 masters won't start



I would not recommend setting the goal state directly unlit there are no other alternatives.



It is hard to recommend what to do, because it is unclear what put you into the current situation and what action / impact you might have had trying to fix things -



why did the goal state become unset in the first place?

what did you stuff into the fates that increased the need for larger jute buffers?



It could be that the number of tables and servers pushed you over the limit - or it could be something else.



What I would do.



Shutdown accumulo and make sure all services / tservers are stopped.

Shutdown any other services that might be using ZooKeeper.

Shutdown ZooKeeper.



Set the larger jute.buffer and increase the timeout values across the board and in any dependent services.



Start hdfs - if you needed to shut it down.

Start just zookeeper - and use zkCli.sh to examine the state of things.  If that looks okay.

Start just the master - how far does it come up?  It will not be able to load the root / metadata tables, but it may give some indication of state,



I'd then cycle between stopping the master, trying to clean-up things using zkCli.sh using any guidance with errors the master is generating. If that looks promising, then:



With the master stopped - start the tservers and check a few logs if there are exceptions determine if they are they something that is pointing to an issue - or just something that is transient and handled.



Once the tservers are up and looking okay - start the master.



One of the things to grab as soon as you can get the shell to run - get a listing of the tables and the ids.  If the worst happens, you can use that to map the existing data into a "new" instance. Hopefully it will not come to that and you will not need it - but if you don't have it and you need it, well... The table names and id are all in ZooKeeper.



Ed Coleman



________________________________

From: Shailesh Ligade <SL...@FBI.GOV>>
Sent: Wednesday, February 9, 2022 3:47 PM
To: user@accumulo.apache.org<ma...@accumulo.apache.org> <us...@accumulo.apache.org>>
Subject: RE: accumulo 1.10.0 masters won't start



Thanks I can try that,



At this point, my goal is to get accumulo up. I was just wondering if I can set different goal like SAFE_MODE will it come up by ignoring fate and other issues? If that comes up, can I switch back to NORMAL, will that work? I understand there may be some data loss..



-S



From: dev1 <de...@etcoleman.com>>
Sent: Wednesday, February 9, 2022 3:36 PM
To: user@accumulo.apache.org<ma...@accumulo.apache.org>
Subject: [EXTERNAL EMAIL] - Re: accumulo 1.10.0 masters won't start



For values in zoo.cfg see: https://zookeeper.apache.org/doc/r3.5.9/zookeeperAdmin.html#sc_advancedConfiguration<https://urldefense.com/v3/__https:/usg02.safelinks.protection.office365.us/?url=https*3A*2F*2Fzookeeper.apache.org*2Fdoc*2Fr3.5.9*2FzookeeperAdmin.html*23sc_advancedConfiguration&data=04*7C01*7CSLIGADE*40FBI.GOV*7Cb7b8be92faf64fbc95ff08d9ec13044d*7C022914a9b95f4b7bbace551ce1a04071*7C0*7C0*7C637800390698440068*7CUnknown*7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0*3D*7C3000&sdata=QaGFX7kcHJeiIN73G5bfDDEQNgxN0F7QdyJ9fO3SJzA*3D&reserved=0__;JSUlJSUlJSUlJSUlJSUlJSUlJSU!!May37g!ewlkGRNFLrKEpeF1Lz8vRt_oBtpgi8hVvvnCrp1Dq4_8Xprb4tEHWiHVFW6UcuwVgA$>



maxSessionTimeout



In the accumulo config  - #instance.zookeepers.timeout=30s



The zookeeper setting controls the max time that the ZK servers will grant - the accumulo setting is how much time accumulo will ask for.



ZooKeeper: Because Coordinating Distributed Systems is a Zoo<https://urldefense.com/v3/__https:/usg02.safelinks.protection.office365.us/?url=https*3A*2F*2Fzookeeper.apache.org*2Fdoc*2Fr3.5.9*2FzookeeperAdmin.html*23sc_advancedConfiguration&data=04*7C01*7CSLIGADE*40FBI.GOV*7Cb7b8be92faf64fbc95ff08d9ec13044d*7C022914a9b95f4b7bbace551ce1a04071*7C0*7C0*7C637800390698440068*7CUnknown*7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0*3D*7C3000&sdata=QaGFX7kcHJeiIN73G5bfDDEQNgxN0F7QdyJ9fO3SJzA*3D&reserved=0__;JSUlJSUlJSUlJSUlJSUlJSUlJSU!!May37g!ewlkGRNFLrKEpeF1Lz8vRt_oBtpgi8hVvvnCrp1Dq4_8Xprb4tEHWiHVFW6UcuwVgA$>

Trace Mask Bit Values ; 0b0000000000 : Unused, reserved for future use. 0b0000000010 : Logs client requests, excluding ping requests. 0b0000000100 : Unused, reserved ...

zookeeper.apache.org





________________________________

From: Ligade, Shailesh [USA] <Li...@bah.com>>
Sent: Wednesday, February 9, 2022 3:03 PM
To: user@accumulo.apache.org<ma...@accumulo.apache.org> <us...@accumulo.apache.org>>
Subject: Re: accumulo 1.10.0 masters won't start



thanks for response,



no i have not update any timeout



is that going in zoo.cfg? I can see there is min/maxSessionTimeout 2/20, is that what are you refering to?



-S

________________________________

From: dev1 <de...@etcoleman.com>>
Sent: Wednesday, February 9, 2022 2:51 PM
To: user@accumulo.apache.org<ma...@accumulo.apache.org> <us...@accumulo.apache.org>>
Subject: [External] Re: accumulo 1.10.0 masters won't start



Have you tried to increase the zoo session timeout value? I think it's zookeeper.session.timeout.ms



________________________________

From: Ligade, Shailesh [USA] <Li...@bah.com>>
Sent: Wednesday, February 9, 2022 2:47 PM
To: user@accumulo.apache.org<ma...@accumulo.apache.org> <us...@accumulo.apache.org>>
Subject: Re: accumulo 1.10.0 masters won't start



Thanks



That fixed goal sate issue but now still getting



Errors with zookeeper

e.g.



KeeperErrorCode = ConnectionLoss for



/accumulo/<instane-id>/config/tserver.hold.time.max

/accumulo/<instane-id>/tables

/accumulo/<instane-id>/tables/1/name

/accumulo/<instane-id>/fate

/accumulo/<instane-id>/masters/goal_state



So it is all over …some I see good values in zookeeper…so not sure..  🙁



-S



________________________________

From: dev1 <de...@etcoleman.com>>
Sent: Wednesday, February 9, 2022 2:22 PM
To: user@accumulo.apache.org<ma...@accumulo.apache.org> <us...@accumulo.apache.org>>
Subject: [External] Re: accumulo 1.10.0 masters won't start



The is a utility - SetGoalState that can be run from the command line



accumulo SetGoalState NORMAL



(or SAFE_MODE, CLEAN_STOP)



It sets a value in ZK at /accumulo/instance-id/managers/goal_state



Ed Coleman



________________________________

From: Ligade, Shailesh [USA] <Li...@bah.com>>
Sent: Wednesday, February 9, 2022 1:54 PM
To: user@accumulo.apache.org<ma...@accumulo.apache.org> <us...@accumulo.apache.org>>
Subject: Re: accumulo 1.10.0 masters won't start



Well,



i just went ahead and deleted fate in zookeeper and restarted the master..it was doing better but then i am getting different error



ERROR: Problem getting real goal state from zookeeper: java.lang.IllegalArgumentException: No enum constant org.apache.accumulo.core.master.thrift.MasterGoalState



I hope i didn't delete goal_state accidently ...;-( currently ls on goal_state is [], is there a way to add some value there?



-S

________________________________

From: dev1 <de...@etcoleman.com>>
Sent: Wednesday, February 9, 2022 1:32 PM
To: user@accumulo.apache.org<ma...@accumulo.apache.org> <us...@accumulo.apache.org>>
Subject: [External] Re: accumulo 1.10.0 masters won't start



Did you try setting the increased size in the zkCli.sh command (or wherever it gets it environment from?)



The ZK docs indicate that it needs to be set to the same size on all servers and clients.



You should be able to use zkCli.sh to at least see what's going on - if that does not work, then it seems unlikely that the master would either.



Can you:

  *   list the nodes under /accumulo/[instance id]/fate?
  *   use the stat command on each of the nodes - the size is one of the fields.
  *   list nodes under any of the /accumulo/[instance_id/fate/tx-#####
  *   there should be a node named debug - doing a get on that should show the op name.



Ed Coleman

________________________________

From: Ligade, Shailesh [USA] <Li...@bah.com>>
Sent: Wednesday, February 9, 2022 12:54 PM
To: user@accumulo.apache.org<ma...@accumulo.apache.org> <us...@accumulo.apache.org>>
Subject: Re: accumulo 1.10.0 masters won't start



Thanks



I added



-Djute.maxbuffer=30000000



In conf/java.env and restart all zookeepers but still getting the same error.. documentation is kind of fuzzy on setting this property as it states in hex (default 0xffff) so not 100% sure if 30000000 is ok, but atleast I could see zookeeper was up



-S



________________________________

From: dev1 <de...@etcoleman.com>>
Sent: Wednesday, February 9, 2022 12:26 PM
To: user@accumulo.apache.org<ma...@accumulo.apache.org> <us...@accumulo.apache.org>>
Subject: [External] Re: accumulo 1.10.0 masters won't start



Does the monitor or any of the logs show errors that relate to exceeding the ZooKeeper jute buffer size?



Is so, have you tried increasing the ZooKeeper jute.maxbuffer limit(https://zookeeper.apache.org/doc/r3.5.9/zookeeperAdmin.html#Unsafe+Options<https://urldefense.com/v3/__https:/usg02.safelinks.protection.office365.us/?url=https*3A*2F*2Furldefense.com*2Fv3*2F__https*3A*2F*2Fzookeeper.apache.org*2Fdoc*2Fr3.5.9*2FzookeeperAdmin.html*Unsafe*Options__*3BIys!!May37g!dTGCMHPLPDBXwSqtLa5cIPHiTIQF7IjLCVyvGxfi1sgPbrsOI8RCEsuZ9u-jJtayEg*24&data=04*7C01*7CSLIGADE*40FBI.GOV*7Cb7b8be92faf64fbc95ff08d9ec13044d*7C022914a9b95f4b7bbace551ce1a04071*7C0*7C0*7C637800390698440068*7CUnknown*7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0*3D*7C3000&sdata=meZvmEpBktGc95qzM46QmtNWp5NJ8noozSTv896k7qw*3D&reserved=0__;JSUlJSUlJSUlJSUqKiUlJSUlJSUlJSUlJSUlJQ!!May37g!ewlkGRNFLrKEpeF1Lz8vRt_oBtpgi8hVvvnCrp1Dq4_8Xprb4tEHWiHVFW58KT9_bg$>)?



Ed Coleman





________________________________

From: Ligade, Shailesh [USA] <Li...@bah.com>>
Sent: Wednesday, February 9, 2022 11:49 AM
To: user@accumulo.apache.org<ma...@accumulo.apache.org> <us...@accumulo.apache.org>>
Subject: accumulo 1.10.0 masters won't start



Hello,



My both masters are stuck error on zookeeper:



IOException: Packet len 2791093 is out of range!

KeeperErrorCode = ConnectionLoss for /accumulo/<instance_id>/fate





if use zkCli to see what is under fate, i get



IOException Packet len 2791161 is out of range

Unable to read additional data from server sessionid xxxx, likely server has closed socket



hdfs fsck is all good



How can I clear this fate?



master process is up and I can get into accumulo shell, but there are no fate (fate print returns empty)



Any idea how to bring the master up?



Thanks



S

RE: accumulo 1.10.0 masters won't start

Posted by dev1 <de...@etcoleman.com>.

You could move. That should just be a metadata op.

From: Ligade, Shailesh [USA] <Li...@bah.com>
Sent: Wednesday, February 16, 2022 11:50 AM
To: 'user@accumulo.apache.org' <us...@accumulo.apache.org>
Subject: Re: accumulo 1.10.0 masters won't start

Thanks Ed,

Uhmm, copying all that 7T data from unfalttend hdfs to falttened one will take som etime..i guess it makes sense to just copy/flatten 100 rf files, import, and keep repeating it will work without filling single datanode... I wish there is accumulo command that will take the structure as is..

-S
________________________________
From: dev1 <de...@etcoleman.com>>
Sent: Wednesday, February 16, 2022 10:49 AM
To: 'user@accumulo.apache.org' <us...@accumulo.apache.org>>
Subject: [External] RE: accumulo 1.10.0 masters won't start

I would use importdirectory [src_dir] [fail_dir] -t new_table false

I would move the files from under accumulo (either shutdown, or at least have the table offline) into hdfs directories for each batch. (10K or so) batch1, batch2,… I think importdirectory expects a flat directory of just files.

Then I can import one batch, check for errors, repeat.

The table you create – the splits will be whatever you set.  Again, maybe 1 split for each tserver (that’s about optimum for the batch import)  Set the split size higher.  The import command will then place the commands according to the splits on the new table – so in you case, its likey multiple files from you current split will be assigned to 1 tserver – effectively being a merge.

My approach to these things is to create scripts based on the info that I have, but have them so that I run them so I can see if things are progressing and make adjustments if not.  I use individual scripts so that I have positive control.  Pipe, grep, awk to build the commands, review the files as a sanity check and then run them.

Ed Coleman

From: Ligade, Shailesh [USA] <Li...@bah.com>>
Sent: Wednesday, February 16, 2022 9:37 AM
To: 'user@accumulo.apache.org' <us...@accumulo.apache.org>>
Subject: Re: accumulo 1.10.0 masters won't start

Thanks

Since hdfs fsck is fine on /accumulo, I can backup my tables to some location within hdfs (not under accumulo) and reinitialize accumulo.

then I can recreate my tables/users on new instances.

What will be command to import/load existing hdfs data into this newly created table? importtable command create new table as well, so  i guess i need to test it somewhere.

Also while loading old data intothe  new table, what can I do get rid of these splits/tablets?

I think this will be faster approach for me to recover..

Thank you so much

-S

________________________________

From: dev1 <de...@etcoleman.com>>
Sent: Wednesday, February 16, 2022 9:29 AM
To: 'user@accumulo.apache.org' <us...@accumulo.apache.org>>
Subject: [External] Re: accumulo 1.10.0 masters won't start

> What happens if I let the #tablets grown?

It sounds like you might be in the worst case now?  There is over head for each tablet - at what point the master / other things fall over is not something I've tried to find out. Even scanning the metadata table and gc process are doing a lot of work to track process that many files / tablets and it likely unnecessary.

What is the command / arguments that you are using for compactions?  The comment minimal sleep after 100 compaction commands is confusing to me.

Can you buffer the replication?

You might be able to:

 - create a new table.

 - point the replication to write to the new table.

 - ingest data from the old into the new.

You should look towards picking a split threshold so that you have 1 or maybe a few tablets per tserver (with some reasonable split size.)  Split sizes of 3G or 5G are not uncommon - and larger is reasonable.

Ed Coleman

________________________________

From: Ligade, Shailesh [USA] <Li...@bah.com>>
Sent: Wednesday, February 16, 2022 8:05 AM
To: 'user@accumulo.apache.org' <us...@accumulo.apache.org>>
Subject: Re: accumulo 1.10.0 masters won't start

Thanks Ed,

We have two clusters, one is source other is peer. I am testing this large table on peer first (eventually i need to do the same on source cluster). i am not stopping ingest on source cluster so replication will continue the peer table, however while i am doing this not much ingest is happening.

I tried the range compaction along with range merge however, merge takes forever (even over single range.. i didn't try many just first few) and before it finishes i get zookeeper error and both master crash. After I bump that jute setting (on both java.env on zookeeper and accumulo-env on accumlo) to get it back. So left merges alone and just trying 72k compaction, since compactions are not backing up, i am doing minimal sleep after every 100 compact commands. But sometimes during compactions, i do get zookeeper errors and master crash.

I do get your idea or create new table with less splits that way the new table will be compacted. However, for that i will need to stop ingest on primary, and then setup replication on the new cluster again..i was avoiding that. but i guess that may be my only option.

A last question, if I let this table continue to grow tablets, what is the worst case scenerio? How it may affect system performance?

Thanks

-S

________________________________

From: dev1 <de...@etcoleman.com>>
Sent: Tuesday, February 15, 2022 3:11 PM
To: 'user@accumulo.apache.org' <us...@accumulo.apache.org>>
Subject: [External] RE: accumulo 1.10.0 masters won't start

Can you compact the table?  How aggressive do you want to get? I do not understand why you are getting the ZooKeeper errors – is it related to the number of tablets, or is it something else?  (an iterator that was attached with a very large set of arguments, a very large list or some sort of binary data – say a bloom filter)

If it were me – you need to balance your goals and requirements that might dictate a less aggressive approach.  At this point I’m assuming that getting things back on line without data loss is the top priority. And if I was sure that it was not related to something that I attached to the table)

If I have room and can compact the table(s). It could also depend on how long a compaction would take and if I could wait.  It is generally preferable to work on files that have any deleted data removed and can reduce the total number of files when files from minor compactions and bulk ingest files are combined into a single file for that tablet)

Stop ingest.

Flush the source table – allow any compactions to settle. (Optional if compacting, but should be a quick command to execute)

(Optional – compact the original.)

Clone the source table.

Compact the clone so that the clone does not share any files with the source

Optionally – use the exportable command to generate a list of files from the clone – you may not need it, but could be handy

Take the clone offline.

Move the files under /accumulo/tables/[clone-id]/dirs to one or more staging directories (in hdfs) – the export list could help.

Delete the clone table – (I believe that the delete will not check for the existence of files if it is offline.) If not, then it would be necessary to use an empty rfile as a placeholder.

Create a new table and set splits – this could be your desired number – or use just enough splits that each tserver has 1 tablet.

Set the default table split size to some multiple of the desired final size – this limits splitting during the imports. Not critical, but may be faster.

Take the new table offline and then back online – this will immediately migrate the splits – or you could just wait for the migration to finish.

Bulk import the files from the staging area(s) – likely in batches.  You will likely have ~72K files – so maybe ~7,000 files / batch?

Once all files have been imported set the split threshold to desired size.

Check that permissions, users, iterators, table config parameters are present on the new table and match the source.

Rename the source table to old_xxx or whatever

Rename the new table to the source table, verify things are okay and delete the original.

If you don’t have the space, you could skip operating on the clone, but then you can’t fall back to the original if things go wrong.

Another way would be to use the importable, but you need to make sure that it doesn’t just recreate the original splits, otherwise you end up with the same 72K files.

Ed Coleman

From: Ligade, Shailesh [USA] <Li...@bah.com>>
Sent: Tuesday, February 15, 2022 1:11 PM
To: user@accumulo.apache.org<ma...@accumulo.apache.org>
Subject: Re: accumulo 1.10.0 masters won't start

Well,

I am trying to merge a large table (8T with 72k tablets, with default tablet size 1G)

Since I keep getting those zookeeper errors realted to size, i keep on bumping the jute.maxbuffer adn now it is all the way to 8m

But still i can't merge even for small subset (-b and -e) Now the error is Xid out of order, Gox Xid xx with err -101 expected Xid yy for a packet with details:

after this master crashes

Any suggestion how to go about and how to merge this large table?

Thanks

-S

________________________________

From: Ligade, Shailesh [USA] <Li...@bah.com>>
Sent: Thursday, February 10, 2022 9:20 AM
To: user@accumulo.apache.org<ma...@accumulo.apache.org> <us...@accumulo.apache.org>>
Subject: Re: accumulo 1.10.0 masters won't start

Thanks Ed,

That saved the day. The confusing part setting up that property is documentation if it needs hex or bytes etc. Even the example they provided here

https://solr.apache.org/guide/7_4/setting-up-an-external-zookeeper-ensemble.html#increasing-the-file-size-limit<https://urldefense.com/v3/__https:/solr.apache.org/guide/7_4/setting-up-an-external-zookeeper-ensemble.html*increasing-the-file-size-limit__;Iw!!May37g!dgoe7b7JQfco1hl_OGnhmp98GibG_1En56_s_KMHeTxlLgRECJIoqNtDeascfM1hqA$>

Setting Up an External ZooKeeper Ensemble | Apache Solr Reference Guide 7.4 - The Apache Software Foundation<https://urldefense.com/v3/__https:/solr.apache.org/guide/7_4/setting-up-an-external-zookeeper-ensemble.html*increasing-the-file-size-limit__;Iw!!May37g!dgoe7b7JQfco1hl_OGnhmp98GibG_1En56_s_KMHeTxlLgRECJIoqNtDeascfM1hqA$>

The solution to this problem is to set up an external ZooKeeper ensemble, which is a number of servers running ZooKeeper that communicate with each other to coordinate the activities of the cluster.

solr.apache.org



states they are setting the value to 2mb but the value really looks like 200k (with 5 0)

------------------------------------

Add the following line to increase the file size limit to 2MB:

SOLR_OPTS="$SOLR_OPTS -Djute.maxbuffer=0x200000"

-------------------------------------

Anyways, a master is up and running for an hour now..so just trying to understand what was changed and revert it after it stabilize.

Thanks a bunch.

-S

________________________________

From: dev1 <de...@etcoleman.com>>
Sent: Wednesday, February 9, 2022 5:54 PM
To: user@accumulo.apache.org<ma...@accumulo.apache.org> <us...@accumulo.apache.org>>
Subject: [External] Re: accumulo 1.10.0 masters won't start

You might want to set the accumulo (zookeeper client) side - by setting ACCUMULO_JAVA_OPTS that is processed in accumulo-env.sh (or just edit that file?)

Looking at the Zookeeper documentation it describes what looks like you are seeing:

When jute.maxbuffer in the client side is less than the server side, the client wants to read the data exceeds jute.maxbuffer in the client side, the client side will get java.io.IOException: Unreasonable length or Packet len is out of range!

Also, a search showed jira tickets that had a server side limit of 4MB, but client limits of 1MB - you may want to see if 4194304 (or larger) as a value works,

________________________________

From: dev1 <de...@etcoleman.com>>
Sent: Wednesday, February 9, 2022 5:25 PM
To: user@accumulo.apache.org<ma...@accumulo.apache.org> <us...@accumulo.apache.org>>
Subject: Re: accumulo 1.10.0 masters won't start

jute.maxbuffer is a ZooKeeper property - it needs to be set on the zookeeper configuration.  If this is still correct, then it looks like there are a few options https://solr.apache.org/guide/7_4/setting-up-an-external-zookeeper-ensemble.html#increasing-the-file-size-limit<https://urldefense.com/v3/__https:/solr.apache.org/guide/7_4/setting-up-an-external-zookeeper-ensemble.html*increasing-the-file-size-limit__;Iw!!May37g!ewlkGRNFLrKEpeF1Lz8vRt_oBtpgi8hVvvnCrp1Dq4_8Xprb4tEHWiHVFW7EVUT3vA$>

But maybe the ZooKeeper documentation for your version can provide additional guidance?

Setting Up an External ZooKeeper Ensemble | Apache Solr Reference Guide 7.4 - The Apache Software Foundation<https://urldefense.com/v3/__https:/solr.apache.org/guide/7_4/setting-up-an-external-zookeeper-ensemble.html*increasing-the-file-size-limit__;Iw!!May37g!ewlkGRNFLrKEpeF1Lz8vRt_oBtpgi8hVvvnCrp1Dq4_8Xprb4tEHWiHVFW7EVUT3vA$>

The solution to this problem is to set up an external ZooKeeper ensemble, which is a number of servers running ZooKeeper that communicate with each other to coordinate the activities of the cluster.

solr.apache.org

________________________________

From: Shailesh Ligade <SL...@FBI.GOV>>
Sent: Wednesday, February 9, 2022 5:02 PM
To: user@accumulo.apache.org<ma...@accumulo.apache.org> <us...@accumulo.apache.org>>
Subject: RE: accumulo 1.10.0 masters won't start

Thanks

Even if I set jute.maxbuffer on zookeeper in conf/java.env file to

-Djute.maxbuffer=300000

I see in accumulo master log as

INFO: jute.maxbuffer value is 1048575 Bytes    not sure where to set that on accumulo side.

I set instance.zookeeper.timeout value to 90s in accumulo-site.xml

But still get those zookeeper KeeperErrorCode errors

-S

From: dev1 <de...@etcoleman.com>>
Sent: Wednesday, February 9, 2022 4:27 PM
To: user@accumulo.apache.org<ma...@accumulo.apache.org>
Subject: [EXTERNAL EMAIL] - Re: accumulo 1.10.0 masters won't start

I would not recommend setting the goal state directly unlit there are no other alternatives.

It is hard to recommend what to do, because it is unclear what put you into the current situation and what action / impact you might have had trying to fix things -

why did the goal state become unset in the first place?

what did you stuff into the fates that increased the need for larger jute buffers?

It could be that the number of tables and servers pushed you over the limit - or it could be something else.

What I would do.

Shutdown accumulo and make sure all services / tservers are stopped.

Shutdown any other services that might be using ZooKeeper.

Shutdown ZooKeeper.

Set the larger jute.buffer and increase the timeout values across the board and in any dependent services.

Start hdfs - if you needed to shut it down.

Start just zookeeper - and use zkCli.sh to examine the state of things.  If that looks okay.

Start just the master - how far does it come up?  It will not be able to load the root / metadata tables, but it may give some indication of state,

I'd then cycle between stopping the master, trying to clean-up things using zkCli.sh using any guidance with errors the master is generating. If that looks promising, then:

With the master stopped - start the tservers and check a few logs if there are exceptions determine if they are they something that is pointing to an issue - or just something that is transient and handled.

Once the tservers are up and looking okay - start the master.

One of the things to grab as soon as you can get the shell to run - get a listing of the tables and the ids.  If the worst happens, you can use that to map the existing data into a "new" instance. Hopefully it will not come to that and you will not need it - but if you don't have it and you need it, well... The table names and id are all in ZooKeeper.

Ed Coleman

________________________________

From: Shailesh Ligade <SL...@FBI.GOV>>
Sent: Wednesday, February 9, 2022 3:47 PM
To: user@accumulo.apache.org<ma...@accumulo.apache.org> <us...@accumulo.apache.org>>
Subject: RE: accumulo 1.10.0 masters won't start

Thanks I can try that,

At this point, my goal is to get accumulo up. I was just wondering if I can set different goal like SAFE_MODE will it come up by ignoring fate and other issues? If that comes up, can I switch back to NORMAL, will that work? I understand there may be some data loss..

-S

From: dev1 <de...@etcoleman.com>>
Sent: Wednesday, February 9, 2022 3:36 PM
To: user@accumulo.apache.org<ma...@accumulo.apache.org>
Subject: [EXTERNAL EMAIL] - Re: accumulo 1.10.0 masters won't start

For values in zoo.cfg see: https://zookeeper.apache.org/doc/r3.5.9/zookeeperAdmin.html#sc_advancedConfiguration<https://urldefense.com/v3/__https:/usg02.safelinks.protection.office365.us/?url=https*3A*2F*2Fzookeeper.apache.org*2Fdoc*2Fr3.5.9*2FzookeeperAdmin.html*23sc_advancedConfiguration&data=04*7C01*7CSLIGADE*40FBI.GOV*7Cb7b8be92faf64fbc95ff08d9ec13044d*7C022914a9b95f4b7bbace551ce1a04071*7C0*7C0*7C637800390698440068*7CUnknown*7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0*3D*7C3000&sdata=QaGFX7kcHJeiIN73G5bfDDEQNgxN0F7QdyJ9fO3SJzA*3D&reserved=0__;JSUlJSUlJSUlJSUlJSUlJSUlJSU!!May37g!ewlkGRNFLrKEpeF1Lz8vRt_oBtpgi8hVvvnCrp1Dq4_8Xprb4tEHWiHVFW6UcuwVgA$>

maxSessionTimeout

In the accumulo config  - #instance.zookeepers.timeout=30s

The zookeeper setting controls the max time that the ZK servers will grant - the accumulo setting is how much time accumulo will ask for.

ZooKeeper: Because Coordinating Distributed Systems is a Zoo<https://urldefense.com/v3/__https:/usg02.safelinks.protection.office365.us/?url=https*3A*2F*2Fzookeeper.apache.org*2Fdoc*2Fr3.5.9*2FzookeeperAdmin.html*23sc_advancedConfiguration&data=04*7C01*7CSLIGADE*40FBI.GOV*7Cb7b8be92faf64fbc95ff08d9ec13044d*7C022914a9b95f4b7bbace551ce1a04071*7C0*7C0*7C637800390698440068*7CUnknown*7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0*3D*7C3000&sdata=QaGFX7kcHJeiIN73G5bfDDEQNgxN0F7QdyJ9fO3SJzA*3D&reserved=0__;JSUlJSUlJSUlJSUlJSUlJSUlJSU!!May37g!ewlkGRNFLrKEpeF1Lz8vRt_oBtpgi8hVvvnCrp1Dq4_8Xprb4tEHWiHVFW6UcuwVgA$>

Trace Mask Bit Values ; 0b0000000000 : Unused, reserved for future use. 0b0000000010 : Logs client requests, excluding ping requests. 0b0000000100 : Unused, reserved ...

zookeeper.apache.org

________________________________

From: Ligade, Shailesh [USA] <Li...@bah.com>>
Sent: Wednesday, February 9, 2022 3:03 PM
To: user@accumulo.apache.org<ma...@accumulo.apache.org> <us...@accumulo.apache.org>>
Subject: Re: accumulo 1.10.0 masters won't start

thanks for response,

no i have not update any timeout

is that going in zoo.cfg? I can see there is min/maxSessionTimeout 2/20, is that what are you refering to?

-S

________________________________

From: dev1 <de...@etcoleman.com>>
Sent: Wednesday, February 9, 2022 2:51 PM
To: user@accumulo.apache.org<ma...@accumulo.apache.org> <us...@accumulo.apache.org>>
Subject: [External] Re: accumulo 1.10.0 masters won't start

Have you tried to increase the zoo session timeout value? I think it's zookeeper.session.timeout.ms

________________________________

From: Ligade, Shailesh [USA] <Li...@bah.com>>
Sent: Wednesday, February 9, 2022 2:47 PM
To: user@accumulo.apache.org<ma...@accumulo.apache.org> <us...@accumulo.apache.org>>
Subject: Re: accumulo 1.10.0 masters won't start

Thanks

That fixed goal sate issue but now still getting

Errors with zookeeper

e.g.

KeeperErrorCode = ConnectionLoss for

/accumulo/<instane-id>/config/tserver.hold.time.max

/accumulo/<instane-id>/tables

/accumulo/<instane-id>/tables/1/name

/accumulo/<instane-id>/fate

/accumulo/<instane-id>/masters/goal_state

So it is all over …some I see good values in zookeeper…so not sure..  🙁

-S

________________________________

From: dev1 <de...@etcoleman.com>>
Sent: Wednesday, February 9, 2022 2:22 PM
To: user@accumulo.apache.org<ma...@accumulo.apache.org> <us...@accumulo.apache.org>>
Subject: [External] Re: accumulo 1.10.0 masters won't start

The is a utility - SetGoalState that can be run from the command line

accumulo SetGoalState NORMAL

(or SAFE_MODE, CLEAN_STOP)

It sets a value in ZK at /accumulo/instance-id/managers/goal_state

Ed Coleman

________________________________

From: Ligade, Shailesh [USA] <Li...@bah.com>>
Sent: Wednesday, February 9, 2022 1:54 PM
To: user@accumulo.apache.org<ma...@accumulo.apache.org> <us...@accumulo.apache.org>>
Subject: Re: accumulo 1.10.0 masters won't start

Well,

i just went ahead and deleted fate in zookeeper and restarted the master..it was doing better but then i am getting different error

ERROR: Problem getting real goal state from zookeeper: java.lang.IllegalArgumentException: No enum constant org.apache.accumulo.core.master.thrift.MasterGoalState

I hope i didn't delete goal_state accidently ...;-( currently ls on goal_state is [], is there a way to add some value there?

-S

________________________________

From: dev1 <de...@etcoleman.com>>
Sent: Wednesday, February 9, 2022 1:32 PM
To: user@accumulo.apache.org<ma...@accumulo.apache.org> <us...@accumulo.apache.org>>
Subject: [External] Re: accumulo 1.10.0 masters won't start

Did you try setting the increased size in the zkCli.sh command (or wherever it gets it environment from?)

The ZK docs indicate that it needs to be set to the same size on all servers and clients.

You should be able to use zkCli.sh to at least see what's going on - if that does not work, then it seems unlikely that the master would either.

Can you:

  *   list the nodes under /accumulo/[instance id]/fate?
  *   use the stat command on each of the nodes - the size is one of the fields.
  *   list nodes under any of the /accumulo/[instance_id/fate/tx-#####
  *   there should be a node named debug - doing a get on that should show the op name.

Ed Coleman

________________________________

From: Ligade, Shailesh [USA] <Li...@bah.com>>
Sent: Wednesday, February 9, 2022 12:54 PM
To: user@accumulo.apache.org<ma...@accumulo.apache.org> <us...@accumulo.apache.org>>
Subject: Re: accumulo 1.10.0 masters won't start

Thanks

I added

-Djute.maxbuffer=30000000

In conf/java.env and restart all zookeepers but still getting the same error.. documentation is kind of fuzzy on setting this property as it states in hex (default 0xffff) so not 100% sure if 30000000 is ok, but atleast I could see zookeeper was up

-S

________________________________

From: dev1 <de...@etcoleman.com>>
Sent: Wednesday, February 9, 2022 12:26 PM
To: user@accumulo.apache.org<ma...@accumulo.apache.org> <us...@accumulo.apache.org>>
Subject: [External] Re: accumulo 1.10.0 masters won't start

Does the monitor or any of the logs show errors that relate to exceeding the ZooKeeper jute buffer size?

Is so, have you tried increasing the ZooKeeper jute.maxbuffer limit(https://zookeeper.apache.org/doc/r3.5.9/zookeeperAdmin.html#Unsafe+Options<https://urldefense.com/v3/__https:/usg02.safelinks.protection.office365.us/?url=https*3A*2F*2Furldefense.com*2Fv3*2F__https*3A*2F*2Fzookeeper.apache.org*2Fdoc*2Fr3.5.9*2FzookeeperAdmin.html*Unsafe*Options__*3BIys!!May37g!dTGCMHPLPDBXwSqtLa5cIPHiTIQF7IjLCVyvGxfi1sgPbrsOI8RCEsuZ9u-jJtayEg*24&data=04*7C01*7CSLIGADE*40FBI.GOV*7Cb7b8be92faf64fbc95ff08d9ec13044d*7C022914a9b95f4b7bbace551ce1a04071*7C0*7C0*7C637800390698440068*7CUnknown*7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0*3D*7C3000&sdata=meZvmEpBktGc95qzM46QmtNWp5NJ8noozSTv896k7qw*3D&reserved=0__;JSUlJSUlJSUlJSUqKiUlJSUlJSUlJSUlJSUlJQ!!May37g!ewlkGRNFLrKEpeF1Lz8vRt_oBtpgi8hVvvnCrp1Dq4_8Xprb4tEHWiHVFW58KT9_bg$>)?

Ed Coleman

________________________________

From: Ligade, Shailesh [USA] <Li...@bah.com>>
Sent: Wednesday, February 9, 2022 11:49 AM
To: user@accumulo.apache.org<ma...@accumulo.apache.org> <us...@accumulo.apache.org>>
Subject: accumulo 1.10.0 masters won't start

Hello,

My both masters are stuck error on zookeeper:

IOException: Packet len 2791093 is out of range!

KeeperErrorCode = ConnectionLoss for /accumulo/<instance_id>/fate

if use zkCli to see what is under fate, i get

IOException Packet len 2791161 is out of range

Unable to read additional data from server sessionid xxxx, likely server has closed socket

hdfs fsck is all good

How can I clear this fate?

master process is up and I can get into accumulo shell, but there are no fate (fate print returns empty)

Any idea how to bring the master up?

Thanks

S

Re: accumulo 1.10.0 masters won't start

Posted by "Ligade, Shailesh [USA]" <Li...@bah.com>.

Thanks Ed,

Uhmm, copying all that 7T data from unfalttend hdfs to falttened one will take som etime..i guess it makes sense to just copy/flatten 100 rf files, import, and keep repeating it will work without filling single datanode... I wish there is accumulo command that will take the structure as is..

-S
________________________________
From: dev1 <de...@etcoleman.com>
Sent: Wednesday, February 16, 2022 10:49 AM
To: 'user@accumulo.apache.org' <us...@accumulo.apache.org>
Subject: [External] RE: accumulo 1.10.0 masters won't start

I would use importdirectory [src_dir] [fail_dir] -t new_table false

I would move the files from under accumulo (either shutdown, or at least have the table offline) into hdfs directories for each batch. (10K or so) batch1, batch2,… I think importdirectory expects a flat directory of just files.

Then I can import one batch, check for errors, repeat.

The table you create – the splits will be whatever you set.  Again, maybe 1 split for each tserver (that’s about optimum for the batch import)  Set the split size higher.  The import command will then place the commands according to the splits on the new table – so in you case, its likey multiple files from you current split will be assigned to 1 tserver – effectively being a merge.

My approach to these things is to create scripts based on the info that I have, but have them so that I run them so I can see if things are progressing and make adjustments if not.  I use individual scripts so that I have positive control.  Pipe, grep, awk to build the commands, review the files as a sanity check and then run them.

Ed Coleman

From: Ligade, Shailesh [USA] <Li...@bah.com>
Sent: Wednesday, February 16, 2022 9:37 AM
To: 'user@accumulo.apache.org' <us...@accumulo.apache.org>
Subject: Re: accumulo 1.10.0 masters won't start

Thanks

Since hdfs fsck is fine on /accumulo, I can backup my tables to some location within hdfs (not under accumulo) and reinitialize accumulo.

then I can recreate my tables/users on new instances.

What will be command to import/load existing hdfs data into this newly created table? importtable command create new table as well, so  i guess i need to test it somewhere.

Also while loading old data intothe  new table, what can I do get rid of these splits/tablets?

I think this will be faster approach for me to recover..

Thank you so much

-S

________________________________

From: dev1 <de...@etcoleman.com>>
Sent: Wednesday, February 16, 2022 9:29 AM
To: 'user@accumulo.apache.org' <us...@accumulo.apache.org>>
Subject: [External] Re: accumulo 1.10.0 masters won't start

> What happens if I let the #tablets grown?

It sounds like you might be in the worst case now?  There is over head for each tablet - at what point the master / other things fall over is not something I've tried to find out. Even scanning the metadata table and gc process are doing a lot of work to track process that many files / tablets and it likely unnecessary.

What is the command / arguments that you are using for compactions?  The comment minimal sleep after 100 compaction commands is confusing to me.

Can you buffer the replication?

You might be able to:

 - create a new table.

 - point the replication to write to the new table.

 - ingest data from the old into the new.

You should look towards picking a split threshold so that you have 1 or maybe a few tablets per tserver (with some reasonable split size.)  Split sizes of 3G or 5G are not uncommon - and larger is reasonable.

Ed Coleman

________________________________

From: Ligade, Shailesh [USA] <Li...@bah.com>>
Sent: Wednesday, February 16, 2022 8:05 AM
To: 'user@accumulo.apache.org' <us...@accumulo.apache.org>>
Subject: Re: accumulo 1.10.0 masters won't start

Thanks Ed,

We have two clusters, one is source other is peer. I am testing this large table on peer first (eventually i need to do the same on source cluster). i am not stopping ingest on source cluster so replication will continue the peer table, however while i am doing this not much ingest is happening.

I tried the range compaction along with range merge however, merge takes forever (even over single range.. i didn't try many just first few) and before it finishes i get zookeeper error and both master crash. After I bump that jute setting (on both java.env on zookeeper and accumulo-env on accumlo) to get it back. So left merges alone and just trying 72k compaction, since compactions are not backing up, i am doing minimal sleep after every 100 compact commands. But sometimes during compactions, i do get zookeeper errors and master crash.

I do get your idea or create new table with less splits that way the new table will be compacted. However, for that i will need to stop ingest on primary, and then setup replication on the new cluster again..i was avoiding that. but i guess that may be my only option.

A last question, if I let this table continue to grow tablets, what is the worst case scenerio? How it may affect system performance?

Thanks

-S

________________________________

From: dev1 <de...@etcoleman.com>>
Sent: Tuesday, February 15, 2022 3:11 PM
To: 'user@accumulo.apache.org' <us...@accumulo.apache.org>>
Subject: [External] RE: accumulo 1.10.0 masters won't start

Can you compact the table?  How aggressive do you want to get? I do not understand why you are getting the ZooKeeper errors – is it related to the number of tablets, or is it something else?  (an iterator that was attached with a very large set of arguments, a very large list or some sort of binary data – say a bloom filter)

If it were me – you need to balance your goals and requirements that might dictate a less aggressive approach.  At this point I’m assuming that getting things back on line without data loss is the top priority. And if I was sure that it was not related to something that I attached to the table)

If I have room and can compact the table(s). It could also depend on how long a compaction would take and if I could wait.  It is generally preferable to work on files that have any deleted data removed and can reduce the total number of files when files from minor compactions and bulk ingest files are combined into a single file for that tablet)

Stop ingest.

Flush the source table – allow any compactions to settle. (Optional if compacting, but should be a quick command to execute)

(Optional – compact the original.)

Clone the source table.

Compact the clone so that the clone does not share any files with the source

Optionally – use the exportable command to generate a list of files from the clone – you may not need it, but could be handy

Take the clone offline.

Move the files under /accumulo/tables/[clone-id]/dirs to one or more staging directories (in hdfs) – the export list could help.

Delete the clone table – (I believe that the delete will not check for the existence of files if it is offline.) If not, then it would be necessary to use an empty rfile as a placeholder.

Create a new table and set splits – this could be your desired number – or use just enough splits that each tserver has 1 tablet.

Set the default table split size to some multiple of the desired final size – this limits splitting during the imports. Not critical, but may be faster.

Take the new table offline and then back online – this will immediately migrate the splits – or you could just wait for the migration to finish.

Bulk import the files from the staging area(s) – likely in batches.  You will likely have ~72K files – so maybe ~7,000 files / batch?

Once all files have been imported set the split threshold to desired size.

Check that permissions, users, iterators, table config parameters are present on the new table and match the source.

Rename the source table to old_xxx or whatever

Rename the new table to the source table, verify things are okay and delete the original.

If you don’t have the space, you could skip operating on the clone, but then you can’t fall back to the original if things go wrong.

Another way would be to use the importable, but you need to make sure that it doesn’t just recreate the original splits, otherwise you end up with the same 72K files.

Ed Coleman

From: Ligade, Shailesh [USA] <Li...@bah.com>>
Sent: Tuesday, February 15, 2022 1:11 PM
To: user@accumulo.apache.org<ma...@accumulo.apache.org>
Subject: Re: accumulo 1.10.0 masters won't start

Well,

I am trying to merge a large table (8T with 72k tablets, with default tablet size 1G)

Since I keep getting those zookeeper errors realted to size, i keep on bumping the jute.maxbuffer adn now it is all the way to 8m

But still i can't merge even for small subset (-b and -e) Now the error is Xid out of order, Gox Xid xx with err -101 expected Xid yy for a packet with details:

after this master crashes

Any suggestion how to go about and how to merge this large table?

Thanks

-S

________________________________

From: Ligade, Shailesh [USA] <Li...@bah.com>>
Sent: Thursday, February 10, 2022 9:20 AM
To: user@accumulo.apache.org<ma...@accumulo.apache.org> <us...@accumulo.apache.org>>
Subject: Re: accumulo 1.10.0 masters won't start

Thanks Ed,

That saved the day. The confusing part setting up that property is documentation if it needs hex or bytes etc. Even the example they provided here

https://solr.apache.org/guide/7_4/setting-up-an-external-zookeeper-ensemble.html#increasing-the-file-size-limit<https://urldefense.com/v3/__https:/solr.apache.org/guide/7_4/setting-up-an-external-zookeeper-ensemble.html*increasing-the-file-size-limit__;Iw!!May37g!dgoe7b7JQfco1hl_OGnhmp98GibG_1En56_s_KMHeTxlLgRECJIoqNtDeascfM1hqA$>

Setting Up an External ZooKeeper Ensemble | Apache Solr Reference Guide 7.4 - The Apache Software Foundation<https://urldefense.com/v3/__https:/solr.apache.org/guide/7_4/setting-up-an-external-zookeeper-ensemble.html*increasing-the-file-size-limit__;Iw!!May37g!dgoe7b7JQfco1hl_OGnhmp98GibG_1En56_s_KMHeTxlLgRECJIoqNtDeascfM1hqA$>

The solution to this problem is to set up an external ZooKeeper ensemble, which is a number of servers running ZooKeeper that communicate with each other to coordinate the activities of the cluster.

solr.apache.org



states they are setting the value to 2mb but the value really looks like 200k (with 5 0)

------------------------------------

Add the following line to increase the file size limit to 2MB:

SOLR_OPTS="$SOLR_OPTS -Djute.maxbuffer=0x200000"

-------------------------------------

Anyways, a master is up and running for an hour now..so just trying to understand what was changed and revert it after it stabilize.

Thanks a bunch.

-S

________________________________

From: dev1 <de...@etcoleman.com>>
Sent: Wednesday, February 9, 2022 5:54 PM
To: user@accumulo.apache.org<ma...@accumulo.apache.org> <us...@accumulo.apache.org>>
Subject: [External] Re: accumulo 1.10.0 masters won't start

You might want to set the accumulo (zookeeper client) side - by setting ACCUMULO_JAVA_OPTS that is processed in accumulo-env.sh (or just edit that file?)

Looking at the Zookeeper documentation it describes what looks like you are seeing:

When jute.maxbuffer in the client side is less than the server side, the client wants to read the data exceeds jute.maxbuffer in the client side, the client side will get java.io.IOException: Unreasonable length or Packet len is out of range!

Also, a search showed jira tickets that had a server side limit of 4MB, but client limits of 1MB - you may want to see if 4194304 (or larger) as a value works,

________________________________

From: dev1 <de...@etcoleman.com>>
Sent: Wednesday, February 9, 2022 5:25 PM
To: user@accumulo.apache.org<ma...@accumulo.apache.org> <us...@accumulo.apache.org>>
Subject: Re: accumulo 1.10.0 masters won't start

jute.maxbuffer is a ZooKeeper property - it needs to be set on the zookeeper configuration.  If this is still correct, then it looks like there are a few options https://solr.apache.org/guide/7_4/setting-up-an-external-zookeeper-ensemble.html#increasing-the-file-size-limit<https://urldefense.com/v3/__https:/solr.apache.org/guide/7_4/setting-up-an-external-zookeeper-ensemble.html*increasing-the-file-size-limit__;Iw!!May37g!ewlkGRNFLrKEpeF1Lz8vRt_oBtpgi8hVvvnCrp1Dq4_8Xprb4tEHWiHVFW7EVUT3vA$>

But maybe the ZooKeeper documentation for your version can provide additional guidance?

Setting Up an External ZooKeeper Ensemble | Apache Solr Reference Guide 7.4 - The Apache Software Foundation<https://urldefense.com/v3/__https:/solr.apache.org/guide/7_4/setting-up-an-external-zookeeper-ensemble.html*increasing-the-file-size-limit__;Iw!!May37g!ewlkGRNFLrKEpeF1Lz8vRt_oBtpgi8hVvvnCrp1Dq4_8Xprb4tEHWiHVFW7EVUT3vA$>

The solution to this problem is to set up an external ZooKeeper ensemble, which is a number of servers running ZooKeeper that communicate with each other to coordinate the activities of the cluster.

solr.apache.org

________________________________

From: Shailesh Ligade <SL...@FBI.GOV>>
Sent: Wednesday, February 9, 2022 5:02 PM
To: user@accumulo.apache.org<ma...@accumulo.apache.org> <us...@accumulo.apache.org>>
Subject: RE: accumulo 1.10.0 masters won't start

Thanks

Even if I set jute.maxbuffer on zookeeper in conf/java.env file to

-Djute.maxbuffer=300000

I see in accumulo master log as

INFO: jute.maxbuffer value is 1048575 Bytes    not sure where to set that on accumulo side.

I set instance.zookeeper.timeout value to 90s in accumulo-site.xml

But still get those zookeeper KeeperErrorCode errors

-S

From: dev1 <de...@etcoleman.com>>
Sent: Wednesday, February 9, 2022 4:27 PM
To: user@accumulo.apache.org<ma...@accumulo.apache.org>
Subject: [EXTERNAL EMAIL] - Re: accumulo 1.10.0 masters won't start

I would not recommend setting the goal state directly unlit there are no other alternatives.

It is hard to recommend what to do, because it is unclear what put you into the current situation and what action / impact you might have had trying to fix things -

why did the goal state become unset in the first place?

what did you stuff into the fates that increased the need for larger jute buffers?

It could be that the number of tables and servers pushed you over the limit - or it could be something else.

What I would do.

Shutdown accumulo and make sure all services / tservers are stopped.

Shutdown any other services that might be using ZooKeeper.

Shutdown ZooKeeper.

Set the larger jute.buffer and increase the timeout values across the board and in any dependent services.

Start hdfs - if you needed to shut it down.

Start just zookeeper - and use zkCli.sh to examine the state of things.  If that looks okay.

Start just the master - how far does it come up?  It will not be able to load the root / metadata tables, but it may give some indication of state,

I'd then cycle between stopping the master, trying to clean-up things using zkCli.sh using any guidance with errors the master is generating. If that looks promising, then:

With the master stopped - start the tservers and check a few logs if there are exceptions determine if they are they something that is pointing to an issue - or just something that is transient and handled.

Once the tservers are up and looking okay - start the master.

One of the things to grab as soon as you can get the shell to run - get a listing of the tables and the ids.  If the worst happens, you can use that to map the existing data into a "new" instance. Hopefully it will not come to that and you will not need it - but if you don't have it and you need it, well... The table names and id are all in ZooKeeper.

Ed Coleman

________________________________

From: Shailesh Ligade <SL...@FBI.GOV>>
Sent: Wednesday, February 9, 2022 3:47 PM
To: user@accumulo.apache.org<ma...@accumulo.apache.org> <us...@accumulo.apache.org>>
Subject: RE: accumulo 1.10.0 masters won't start

Thanks I can try that,

At this point, my goal is to get accumulo up. I was just wondering if I can set different goal like SAFE_MODE will it come up by ignoring fate and other issues? If that comes up, can I switch back to NORMAL, will that work? I understand there may be some data loss..

-S

From: dev1 <de...@etcoleman.com>>
Sent: Wednesday, February 9, 2022 3:36 PM
To: user@accumulo.apache.org<ma...@accumulo.apache.org>
Subject: [EXTERNAL EMAIL] - Re: accumulo 1.10.0 masters won't start

For values in zoo.cfg see: https://zookeeper.apache.org/doc/r3.5.9/zookeeperAdmin.html#sc_advancedConfiguration<https://urldefense.com/v3/__https:/usg02.safelinks.protection.office365.us/?url=https*3A*2F*2Fzookeeper.apache.org*2Fdoc*2Fr3.5.9*2FzookeeperAdmin.html*23sc_advancedConfiguration&data=04*7C01*7CSLIGADE*40FBI.GOV*7Cb7b8be92faf64fbc95ff08d9ec13044d*7C022914a9b95f4b7bbace551ce1a04071*7C0*7C0*7C637800390698440068*7CUnknown*7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0*3D*7C3000&sdata=QaGFX7kcHJeiIN73G5bfDDEQNgxN0F7QdyJ9fO3SJzA*3D&reserved=0__;JSUlJSUlJSUlJSUlJSUlJSUlJSU!!May37g!ewlkGRNFLrKEpeF1Lz8vRt_oBtpgi8hVvvnCrp1Dq4_8Xprb4tEHWiHVFW6UcuwVgA$>

maxSessionTimeout

In the accumulo config  - #instance.zookeepers.timeout=30s

The zookeeper setting controls the max time that the ZK servers will grant - the accumulo setting is how much time accumulo will ask for.

ZooKeeper: Because Coordinating Distributed Systems is a Zoo<https://urldefense.com/v3/__https:/usg02.safelinks.protection.office365.us/?url=https*3A*2F*2Fzookeeper.apache.org*2Fdoc*2Fr3.5.9*2FzookeeperAdmin.html*23sc_advancedConfiguration&data=04*7C01*7CSLIGADE*40FBI.GOV*7Cb7b8be92faf64fbc95ff08d9ec13044d*7C022914a9b95f4b7bbace551ce1a04071*7C0*7C0*7C637800390698440068*7CUnknown*7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0*3D*7C3000&sdata=QaGFX7kcHJeiIN73G5bfDDEQNgxN0F7QdyJ9fO3SJzA*3D&reserved=0__;JSUlJSUlJSUlJSUlJSUlJSUlJSU!!May37g!ewlkGRNFLrKEpeF1Lz8vRt_oBtpgi8hVvvnCrp1Dq4_8Xprb4tEHWiHVFW6UcuwVgA$>

Trace Mask Bit Values ; 0b0000000000 : Unused, reserved for future use. 0b0000000010 : Logs client requests, excluding ping requests. 0b0000000100 : Unused, reserved ...

zookeeper.apache.org

________________________________

From: Ligade, Shailesh [USA] <Li...@bah.com>>
Sent: Wednesday, February 9, 2022 3:03 PM
To: user@accumulo.apache.org<ma...@accumulo.apache.org> <us...@accumulo.apache.org>>
Subject: Re: accumulo 1.10.0 masters won't start

thanks for response,

no i have not update any timeout

is that going in zoo.cfg? I can see there is min/maxSessionTimeout 2/20, is that what are you refering to?

-S

________________________________

From: dev1 <de...@etcoleman.com>>
Sent: Wednesday, February 9, 2022 2:51 PM
To: user@accumulo.apache.org<ma...@accumulo.apache.org> <us...@accumulo.apache.org>>
Subject: [External] Re: accumulo 1.10.0 masters won't start

Have you tried to increase the zoo session timeout value? I think it's zookeeper.session.timeout.ms

________________________________

From: Ligade, Shailesh [USA] <Li...@bah.com>>
Sent: Wednesday, February 9, 2022 2:47 PM
To: user@accumulo.apache.org<ma...@accumulo.apache.org> <us...@accumulo.apache.org>>
Subject: Re: accumulo 1.10.0 masters won't start

Thanks

That fixed goal sate issue but now still getting

Errors with zookeeper

e.g.

KeeperErrorCode = ConnectionLoss for

/accumulo/<instane-id>/config/tserver.hold.time.max

/accumulo/<instane-id>/tables

/accumulo/<instane-id>/tables/1/name

/accumulo/<instane-id>/fate

/accumulo/<instane-id>/masters/goal_state

So it is all over …some I see good values in zookeeper…so not sure..  🙁

-S

________________________________

From: dev1 <de...@etcoleman.com>>
Sent: Wednesday, February 9, 2022 2:22 PM
To: user@accumulo.apache.org<ma...@accumulo.apache.org> <us...@accumulo.apache.org>>
Subject: [External] Re: accumulo 1.10.0 masters won't start

The is a utility - SetGoalState that can be run from the command line

accumulo SetGoalState NORMAL

(or SAFE_MODE, CLEAN_STOP)

It sets a value in ZK at /accumulo/instance-id/managers/goal_state

Ed Coleman

________________________________

From: Ligade, Shailesh [USA] <Li...@bah.com>>
Sent: Wednesday, February 9, 2022 1:54 PM
To: user@accumulo.apache.org<ma...@accumulo.apache.org> <us...@accumulo.apache.org>>
Subject: Re: accumulo 1.10.0 masters won't start

Well,

i just went ahead and deleted fate in zookeeper and restarted the master..it was doing better but then i am getting different error

ERROR: Problem getting real goal state from zookeeper: java.lang.IllegalArgumentException: No enum constant org.apache.accumulo.core.master.thrift.MasterGoalState

I hope i didn't delete goal_state accidently ...;-( currently ls on goal_state is [], is there a way to add some value there?

-S

________________________________

From: dev1 <de...@etcoleman.com>>
Sent: Wednesday, February 9, 2022 1:32 PM
To: user@accumulo.apache.org<ma...@accumulo.apache.org> <us...@accumulo.apache.org>>
Subject: [External] Re: accumulo 1.10.0 masters won't start

Did you try setting the increased size in the zkCli.sh command (or wherever it gets it environment from?)

The ZK docs indicate that it needs to be set to the same size on all servers and clients.

You should be able to use zkCli.sh to at least see what's going on - if that does not work, then it seems unlikely that the master would either.

Can you:

  *   list the nodes under /accumulo/[instance id]/fate?
  *   use the stat command on each of the nodes - the size is one of the fields.
  *   list nodes under any of the /accumulo/[instance_id/fate/tx-#####
  *   there should be a node named debug - doing a get on that should show the op name.

Ed Coleman

________________________________

From: Ligade, Shailesh [USA] <Li...@bah.com>>
Sent: Wednesday, February 9, 2022 12:54 PM
To: user@accumulo.apache.org<ma...@accumulo.apache.org> <us...@accumulo.apache.org>>
Subject: Re: accumulo 1.10.0 masters won't start

Thanks

I added

-Djute.maxbuffer=30000000

In conf/java.env and restart all zookeepers but still getting the same error.. documentation is kind of fuzzy on setting this property as it states in hex (default 0xffff) so not 100% sure if 30000000 is ok, but atleast I could see zookeeper was up

-S

________________________________

From: dev1 <de...@etcoleman.com>>
Sent: Wednesday, February 9, 2022 12:26 PM
To: user@accumulo.apache.org<ma...@accumulo.apache.org> <us...@accumulo.apache.org>>
Subject: [External] Re: accumulo 1.10.0 masters won't start

Does the monitor or any of the logs show errors that relate to exceeding the ZooKeeper jute buffer size?

Is so, have you tried increasing the ZooKeeper jute.maxbuffer limit(https://zookeeper.apache.org/doc/r3.5.9/zookeeperAdmin.html#Unsafe+Options<https://urldefense.com/v3/__https:/usg02.safelinks.protection.office365.us/?url=https*3A*2F*2Furldefense.com*2Fv3*2F__https*3A*2F*2Fzookeeper.apache.org*2Fdoc*2Fr3.5.9*2FzookeeperAdmin.html*Unsafe*Options__*3BIys!!May37g!dTGCMHPLPDBXwSqtLa5cIPHiTIQF7IjLCVyvGxfi1sgPbrsOI8RCEsuZ9u-jJtayEg*24&data=04*7C01*7CSLIGADE*40FBI.GOV*7Cb7b8be92faf64fbc95ff08d9ec13044d*7C022914a9b95f4b7bbace551ce1a04071*7C0*7C0*7C637800390698440068*7CUnknown*7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0*3D*7C3000&sdata=meZvmEpBktGc95qzM46QmtNWp5NJ8noozSTv896k7qw*3D&reserved=0__;JSUlJSUlJSUlJSUqKiUlJSUlJSUlJSUlJSUlJQ!!May37g!ewlkGRNFLrKEpeF1Lz8vRt_oBtpgi8hVvvnCrp1Dq4_8Xprb4tEHWiHVFW58KT9_bg$>)?

Ed Coleman

________________________________

From: Ligade, Shailesh [USA] <Li...@bah.com>>
Sent: Wednesday, February 9, 2022 11:49 AM
To: user@accumulo.apache.org<ma...@accumulo.apache.org> <us...@accumulo.apache.org>>
Subject: accumulo 1.10.0 masters won't start

Hello,

My both masters are stuck error on zookeeper:

IOException: Packet len 2791093 is out of range!

KeeperErrorCode = ConnectionLoss for /accumulo/<instance_id>/fate

if use zkCli to see what is under fate, i get

IOException Packet len 2791161 is out of range

Unable to read additional data from server sessionid xxxx, likely server has closed socket

hdfs fsck is all good

How can I clear this fate?

master process is up and I can get into accumulo shell, but there are no fate (fate print returns empty)

Any idea how to bring the master up?

Thanks

S

RE: accumulo 1.10.0 masters won't start

Posted by dev1 <de...@etcoleman.com>.

I would use importdirectory [src_dir] [fail_dir] -t new_table false

I would move the files from under accumulo (either shutdown, or at least have the table offline) into hdfs directories for each batch. (10K or so) batch1, batch2,… I think importdirectory expects a flat directory of just files.
Then I can import one batch, check for errors, repeat.

The table you create – the splits will be whatever you set.  Again, maybe 1 split for each tserver (that’s about optimum for the batch import)  Set the split size higher.  The import command will then place the commands according to the splits on the new table – so in you case, its likey multiple files from you current split will be assigned to 1 tserver – effectively being a merge.

My approach to these things is to create scripts based on the info that I have, but have them so that I run them so I can see if things are progressing and make adjustments if not.  I use individual scripts so that I have positive control.  Pipe, grep, awk to build the commands, review the files as a sanity check and then run them.

Ed Coleman

From: Ligade, Shailesh [USA] <Li...@bah.com>
Sent: Wednesday, February 16, 2022 9:37 AM
To: 'user@accumulo.apache.org' <us...@accumulo.apache.org>
Subject: Re: accumulo 1.10.0 masters won't start

Thanks

Since hdfs fsck is fine on /accumulo, I can backup my tables to some location within hdfs (not under accumulo) and reinitialize accumulo.
then I can recreate my tables/users on new instances.
What will be command to import/load existing hdfs data into this newly created table? importtable command create new table as well, so  i guess i need to test it somewhere.
Also while loading old data intothe  new table, what can I do get rid of these splits/tablets?

I think this will be faster approach for me to recover..

Thank you so much

-S
________________________________
From: dev1 <de...@etcoleman.com>>
Sent: Wednesday, February 16, 2022 9:29 AM
To: 'user@accumulo.apache.org' <us...@accumulo.apache.org>>
Subject: [External] Re: accumulo 1.10.0 masters won't start

> What happens if I let the #tablets grown?

It sounds like you might be in the worst case now?  There is over head for each tablet - at what point the master / other things fall over is not something I've tried to find out. Even scanning the metadata table and gc process are doing a lot of work to track process that many files / tablets and it likely unnecessary.

What is the command / arguments that you are using for compactions?  The comment minimal sleep after 100 compaction commands is confusing to me.

Can you buffer the replication?

You might be able to:
 - create a new table.
 - point the replication to write to the new table.
 - ingest data from the old into the new.

You should look towards picking a split threshold so that you have 1 or maybe a few tablets per tserver (with some reasonable split size.)  Split sizes of 3G or 5G are not uncommon - and larger is reasonable.

Ed Coleman
________________________________
From: Ligade, Shailesh [USA] <Li...@bah.com>>
Sent: Wednesday, February 16, 2022 8:05 AM
To: 'user@accumulo.apache.org' <us...@accumulo.apache.org>>
Subject: Re: accumulo 1.10.0 masters won't start

Thanks Ed,

We have two clusters, one is source other is peer. I am testing this large table on peer first (eventually i need to do the same on source cluster). i am not stopping ingest on source cluster so replication will continue the peer table, however while i am doing this not much ingest is happening.


I tried the range compaction along with range merge however, merge takes forever (even over single range.. i didn't try many just first few) and before it finishes i get zookeeper error and both master crash. After I bump that jute setting (on both java.env on zookeeper and accumulo-env on accumlo) to get it back. So left merges alone and just trying 72k compaction, since compactions are not backing up, i am doing minimal sleep after every 100 compact commands. But sometimes during compactions, i do get zookeeper errors and master crash.

I do get your idea or create new table with less splits that way the new table will be compacted. However, for that i will need to stop ingest on primary, and then setup replication on the new cluster again..i was avoiding that. but i guess that may be my only option.

A last question, if I let this table continue to grow tablets, what is the worst case scenerio? How it may affect system performance?

Thanks

-S
________________________________
From: dev1 <de...@etcoleman.com>>
Sent: Tuesday, February 15, 2022 3:11 PM
To: 'user@accumulo.apache.org' <us...@accumulo.apache.org>>
Subject: [External] RE: accumulo 1.10.0 masters won't start


Can you compact the table?  How aggressive do you want to get? I do not understand why you are getting the ZooKeeper errors – is it related to the number of tablets, or is it something else?  (an iterator that was attached with a very large set of arguments, a very large list or some sort of binary data – say a bloom filter)



If it were me – you need to balance your goals and requirements that might dictate a less aggressive approach.  At this point I’m assuming that getting things back on line without data loss is the top priority. And if I was sure that it was not related to something that I attached to the table)



If I have room and can compact the table(s). It could also depend on how long a compaction would take and if I could wait.  It is generally preferable to work on files that have any deleted data removed and can reduce the total number of files when files from minor compactions and bulk ingest files are combined into a single file for that tablet)



Stop ingest.

Flush the source table – allow any compactions to settle. (Optional if compacting, but should be a quick command to execute)

(Optional – compact the original.)

Clone the source table.

Compact the clone so that the clone does not share any files with the source

Optionally – use the exportable command to generate a list of files from the clone – you may not need it, but could be handy

Take the clone offline.

Move the files under /accumulo/tables/[clone-id]/dirs to one or more staging directories (in hdfs) – the export list could help.

Delete the clone table – (I believe that the delete will not check for the existence of files if it is offline.) If not, then it would be necessary to use an empty rfile as a placeholder.

Create a new table and set splits – this could be your desired number – or use just enough splits that each tserver has 1 tablet.

Set the default table split size to some multiple of the desired final size – this limits splitting during the imports. Not critical, but may be faster.

Take the new table offline and then back online – this will immediately migrate the splits – or you could just wait for the migration to finish.

Bulk import the files from the staging area(s) – likely in batches.  You will likely have ~72K files – so maybe ~7,000 files / batch?

Once all files have been imported set the split threshold to desired size.

Check that permissions, users, iterators, table config parameters are present on the new table and match the source.

Rename the source table to old_xxx or whatever

Rename the new table to the source table, verify things are okay and delete the original.



If you don’t have the space, you could skip operating on the clone, but then you can’t fall back to the original if things go wrong.



Another way would be to use the importable, but you need to make sure that it doesn’t just recreate the original splits, otherwise you end up with the same 72K files.



Ed Coleman





From: Ligade, Shailesh [USA] <Li...@bah.com>>
Sent: Tuesday, February 15, 2022 1:11 PM
To: user@accumulo.apache.org<ma...@accumulo.apache.org>
Subject: Re: accumulo 1.10.0 masters won't start



Well,



I am trying to merge a large table (8T with 72k tablets, with default tablet size 1G)



Since I keep getting those zookeeper errors realted to size, i keep on bumping the jute.maxbuffer adn now it is all the way to 8m



But still i can't merge even for small subset (-b and -e) Now the error is Xid out of order, Gox Xid xx with err -101 expected Xid yy for a packet with details:



after this master crashes



Any suggestion how to go about and how to merge this large table?



Thanks



-S

________________________________

From: Ligade, Shailesh [USA] <Li...@bah.com>>
Sent: Thursday, February 10, 2022 9:20 AM
To: user@accumulo.apache.org<ma...@accumulo.apache.org> <us...@accumulo.apache.org>>
Subject: Re: accumulo 1.10.0 masters won't start



Thanks Ed,



That saved the day. The confusing part setting up that property is documentation if it needs hex or bytes etc. Even the example they provided here

https://solr.apache.org/guide/7_4/setting-up-an-external-zookeeper-ensemble.html#increasing-the-file-size-limit<https://urldefense.com/v3/__https:/solr.apache.org/guide/7_4/setting-up-an-external-zookeeper-ensemble.html*increasing-the-file-size-limit__;Iw!!May37g!dgoe7b7JQfco1hl_OGnhmp98GibG_1En56_s_KMHeTxlLgRECJIoqNtDeascfM1hqA$>

Setting Up an External ZooKeeper Ensemble | Apache Solr Reference Guide 7.4 - The Apache Software Foundation<https://urldefense.com/v3/__https:/solr.apache.org/guide/7_4/setting-up-an-external-zookeeper-ensemble.html*increasing-the-file-size-limit__;Iw!!May37g!dgoe7b7JQfco1hl_OGnhmp98GibG_1En56_s_KMHeTxlLgRECJIoqNtDeascfM1hqA$>

The solution to this problem is to set up an external ZooKeeper ensemble, which is a number of servers running ZooKeeper that communicate with each other to coordinate the activities of the cluster.

solr.apache.org



states they are setting the value to 2mb but the value really looks like 200k (with 5 0)



------------------------------------

Add the following line to increase the file size limit to 2MB:

SOLR_OPTS="$SOLR_OPTS -Djute.maxbuffer=0x200000"

-------------------------------------



Anyways, a master is up and running for an hour now..so just trying to understand what was changed and revert it after it stabilize.



Thanks a bunch.



-S



________________________________

From: dev1 <de...@etcoleman.com>>
Sent: Wednesday, February 9, 2022 5:54 PM
To: user@accumulo.apache.org<ma...@accumulo.apache.org> <us...@accumulo.apache.org>>
Subject: [External] Re: accumulo 1.10.0 masters won't start



You might want to set the accumulo (zookeeper client) side - by setting ACCUMULO_JAVA_OPTS that is processed in accumulo-env.sh (or just edit that file?)



Looking at the Zookeeper documentation it describes what looks like you are seeing:



When jute.maxbuffer in the client side is less than the server side, the client wants to read the data exceeds jute.maxbuffer in the client side, the client side will get java.io.IOException: Unreasonable length or Packet len is out of range!



Also, a search showed jira tickets that had a server side limit of 4MB, but client limits of 1MB - you may want to see if 4194304 (or larger) as a value works,



________________________________

From: dev1 <de...@etcoleman.com>>
Sent: Wednesday, February 9, 2022 5:25 PM
To: user@accumulo.apache.org<ma...@accumulo.apache.org> <us...@accumulo.apache.org>>
Subject: Re: accumulo 1.10.0 masters won't start



jute.maxbuffer is a ZooKeeper property - it needs to be set on the zookeeper configuration.  If this is still correct, then it looks like there are a few options https://solr.apache.org/guide/7_4/setting-up-an-external-zookeeper-ensemble.html#increasing-the-file-size-limit<https://urldefense.com/v3/__https:/solr.apache.org/guide/7_4/setting-up-an-external-zookeeper-ensemble.html*increasing-the-file-size-limit__;Iw!!May37g!ewlkGRNFLrKEpeF1Lz8vRt_oBtpgi8hVvvnCrp1Dq4_8Xprb4tEHWiHVFW7EVUT3vA$>



But maybe the ZooKeeper documentation for your version can provide additional guidance?

Setting Up an External ZooKeeper Ensemble | Apache Solr Reference Guide 7.4 - The Apache Software Foundation<https://urldefense.com/v3/__https:/solr.apache.org/guide/7_4/setting-up-an-external-zookeeper-ensemble.html*increasing-the-file-size-limit__;Iw!!May37g!ewlkGRNFLrKEpeF1Lz8vRt_oBtpgi8hVvvnCrp1Dq4_8Xprb4tEHWiHVFW7EVUT3vA$>

The solution to this problem is to set up an external ZooKeeper ensemble, which is a number of servers running ZooKeeper that communicate with each other to coordinate the activities of the cluster.

solr.apache.org





________________________________

From: Shailesh Ligade <SL...@FBI.GOV>>
Sent: Wednesday, February 9, 2022 5:02 PM
To: user@accumulo.apache.org<ma...@accumulo.apache.org> <us...@accumulo.apache.org>>
Subject: RE: accumulo 1.10.0 masters won't start



Thanks



Even if I set jute.maxbuffer on zookeeper in conf/java.env file to



-Djute.maxbuffer=300000



I see in accumulo master log as



INFO: jute.maxbuffer value is 1048575 Bytes    not sure where to set that on accumulo side.



I set instance.zookeeper.timeout value to 90s in accumulo-site.xml



But still get those zookeeper KeeperErrorCode errors



-S



From: dev1 <de...@etcoleman.com>>
Sent: Wednesday, February 9, 2022 4:27 PM
To: user@accumulo.apache.org<ma...@accumulo.apache.org>
Subject: [EXTERNAL EMAIL] - Re: accumulo 1.10.0 masters won't start



I would not recommend setting the goal state directly unlit there are no other alternatives.



It is hard to recommend what to do, because it is unclear what put you into the current situation and what action / impact you might have had trying to fix things -



why did the goal state become unset in the first place?

what did you stuff into the fates that increased the need for larger jute buffers?



It could be that the number of tables and servers pushed you over the limit - or it could be something else.



What I would do.



Shutdown accumulo and make sure all services / tservers are stopped.

Shutdown any other services that might be using ZooKeeper.

Shutdown ZooKeeper.



Set the larger jute.buffer and increase the timeout values across the board and in any dependent services.



Start hdfs - if you needed to shut it down.

Start just zookeeper - and use zkCli.sh to examine the state of things.  If that looks okay.

Start just the master - how far does it come up?  It will not be able to load the root / metadata tables, but it may give some indication of state,



I'd then cycle between stopping the master, trying to clean-up things using zkCli.sh using any guidance with errors the master is generating. If that looks promising, then:



With the master stopped - start the tservers and check a few logs if there are exceptions determine if they are they something that is pointing to an issue - or just something that is transient and handled.



Once the tservers are up and looking okay - start the master.



One of the things to grab as soon as you can get the shell to run - get a listing of the tables and the ids.  If the worst happens, you can use that to map the existing data into a "new" instance. Hopefully it will not come to that and you will not need it - but if you don't have it and you need it, well... The table names and id are all in ZooKeeper.



Ed Coleman



________________________________

From: Shailesh Ligade <SL...@FBI.GOV>>
Sent: Wednesday, February 9, 2022 3:47 PM
To: user@accumulo.apache.org<ma...@accumulo.apache.org> <us...@accumulo.apache.org>>
Subject: RE: accumulo 1.10.0 masters won't start



Thanks I can try that,



At this point, my goal is to get accumulo up. I was just wondering if I can set different goal like SAFE_MODE will it come up by ignoring fate and other issues? If that comes up, can I switch back to NORMAL, will that work? I understand there may be some data loss..



-S



From: dev1 <de...@etcoleman.com>>
Sent: Wednesday, February 9, 2022 3:36 PM
To: user@accumulo.apache.org<ma...@accumulo.apache.org>
Subject: [EXTERNAL EMAIL] - Re: accumulo 1.10.0 masters won't start



For values in zoo.cfg see: https://zookeeper.apache.org/doc/r3.5.9/zookeeperAdmin.html#sc_advancedConfiguration<https://urldefense.com/v3/__https:/usg02.safelinks.protection.office365.us/?url=https*3A*2F*2Fzookeeper.apache.org*2Fdoc*2Fr3.5.9*2FzookeeperAdmin.html*23sc_advancedConfiguration&data=04*7C01*7CSLIGADE*40FBI.GOV*7Cb7b8be92faf64fbc95ff08d9ec13044d*7C022914a9b95f4b7bbace551ce1a04071*7C0*7C0*7C637800390698440068*7CUnknown*7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0*3D*7C3000&sdata=QaGFX7kcHJeiIN73G5bfDDEQNgxN0F7QdyJ9fO3SJzA*3D&reserved=0__;JSUlJSUlJSUlJSUlJSUlJSUlJSU!!May37g!ewlkGRNFLrKEpeF1Lz8vRt_oBtpgi8hVvvnCrp1Dq4_8Xprb4tEHWiHVFW6UcuwVgA$>



maxSessionTimeout



In the accumulo config  - #instance.zookeepers.timeout=30s



The zookeeper setting controls the max time that the ZK servers will grant - the accumulo setting is how much time accumulo will ask for.



ZooKeeper: Because Coordinating Distributed Systems is a Zoo<https://urldefense.com/v3/__https:/usg02.safelinks.protection.office365.us/?url=https*3A*2F*2Fzookeeper.apache.org*2Fdoc*2Fr3.5.9*2FzookeeperAdmin.html*23sc_advancedConfiguration&data=04*7C01*7CSLIGADE*40FBI.GOV*7Cb7b8be92faf64fbc95ff08d9ec13044d*7C022914a9b95f4b7bbace551ce1a04071*7C0*7C0*7C637800390698440068*7CUnknown*7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0*3D*7C3000&sdata=QaGFX7kcHJeiIN73G5bfDDEQNgxN0F7QdyJ9fO3SJzA*3D&reserved=0__;JSUlJSUlJSUlJSUlJSUlJSUlJSU!!May37g!ewlkGRNFLrKEpeF1Lz8vRt_oBtpgi8hVvvnCrp1Dq4_8Xprb4tEHWiHVFW6UcuwVgA$>

Trace Mask Bit Values ; 0b0000000000 : Unused, reserved for future use. 0b0000000010 : Logs client requests, excluding ping requests. 0b0000000100 : Unused, reserved ...

zookeeper.apache.org





________________________________

From: Ligade, Shailesh [USA] <Li...@bah.com>>
Sent: Wednesday, February 9, 2022 3:03 PM
To: user@accumulo.apache.org<ma...@accumulo.apache.org> <us...@accumulo.apache.org>>
Subject: Re: accumulo 1.10.0 masters won't start



thanks for response,



no i have not update any timeout



is that going in zoo.cfg? I can see there is min/maxSessionTimeout 2/20, is that what are you refering to?



-S

________________________________

From: dev1 <de...@etcoleman.com>>
Sent: Wednesday, February 9, 2022 2:51 PM
To: user@accumulo.apache.org<ma...@accumulo.apache.org> <us...@accumulo.apache.org>>
Subject: [External] Re: accumulo 1.10.0 masters won't start



Have you tried to increase the zoo session timeout value? I think it's zookeeper.session.timeout.ms



________________________________

From: Ligade, Shailesh [USA] <Li...@bah.com>>
Sent: Wednesday, February 9, 2022 2:47 PM
To: user@accumulo.apache.org<ma...@accumulo.apache.org> <us...@accumulo.apache.org>>
Subject: Re: accumulo 1.10.0 masters won't start



Thanks



That fixed goal sate issue but now still getting



Errors with zookeeper

e.g.



KeeperErrorCode = ConnectionLoss for



/accumulo/<instane-id>/config/tserver.hold.time.max

/accumulo/<instane-id>/tables

/accumulo/<instane-id>/tables/1/name

/accumulo/<instane-id>/fate

/accumulo/<instane-id>/masters/goal_state



So it is all over …some I see good values in zookeeper…so not sure..  🙁



-S



________________________________

From: dev1 <de...@etcoleman.com>>
Sent: Wednesday, February 9, 2022 2:22 PM
To: user@accumulo.apache.org<ma...@accumulo.apache.org> <us...@accumulo.apache.org>>
Subject: [External] Re: accumulo 1.10.0 masters won't start



The is a utility - SetGoalState that can be run from the command line



accumulo SetGoalState NORMAL



(or SAFE_MODE, CLEAN_STOP)



It sets a value in ZK at /accumulo/instance-id/managers/goal_state



Ed Coleman



________________________________

From: Ligade, Shailesh [USA] <Li...@bah.com>>
Sent: Wednesday, February 9, 2022 1:54 PM
To: user@accumulo.apache.org<ma...@accumulo.apache.org> <us...@accumulo.apache.org>>
Subject: Re: accumulo 1.10.0 masters won't start



Well,



i just went ahead and deleted fate in zookeeper and restarted the master..it was doing better but then i am getting different error



ERROR: Problem getting real goal state from zookeeper: java.lang.IllegalArgumentException: No enum constant org.apache.accumulo.core.master.thrift.MasterGoalState



I hope i didn't delete goal_state accidently ...;-( currently ls on goal_state is [], is there a way to add some value there?



-S

________________________________

From: dev1 <de...@etcoleman.com>>
Sent: Wednesday, February 9, 2022 1:32 PM
To: user@accumulo.apache.org<ma...@accumulo.apache.org> <us...@accumulo.apache.org>>
Subject: [External] Re: accumulo 1.10.0 masters won't start



Did you try setting the increased size in the zkCli.sh command (or wherever it gets it environment from?)



The ZK docs indicate that it needs to be set to the same size on all servers and clients.



You should be able to use zkCli.sh to at least see what's going on - if that does not work, then it seems unlikely that the master would either.



Can you:

  *   list the nodes under /accumulo/[instance id]/fate?
  *   use the stat command on each of the nodes - the size is one of the fields.
  *   list nodes under any of the /accumulo/[instance_id/fate/tx-#####
  *   there should be a node named debug - doing a get on that should show the op name.



Ed Coleman

________________________________

From: Ligade, Shailesh [USA] <Li...@bah.com>>
Sent: Wednesday, February 9, 2022 12:54 PM
To: user@accumulo.apache.org<ma...@accumulo.apache.org> <us...@accumulo.apache.org>>
Subject: Re: accumulo 1.10.0 masters won't start



Thanks



I added



-Djute.maxbuffer=30000000



In conf/java.env and restart all zookeepers but still getting the same error.. documentation is kind of fuzzy on setting this property as it states in hex (default 0xffff) so not 100% sure if 30000000 is ok, but atleast I could see zookeeper was up



-S



________________________________

From: dev1 <de...@etcoleman.com>>
Sent: Wednesday, February 9, 2022 12:26 PM
To: user@accumulo.apache.org<ma...@accumulo.apache.org> <us...@accumulo.apache.org>>
Subject: [External] Re: accumulo 1.10.0 masters won't start



Does the monitor or any of the logs show errors that relate to exceeding the ZooKeeper jute buffer size?



Is so, have you tried increasing the ZooKeeper jute.maxbuffer limit(https://zookeeper.apache.org/doc/r3.5.9/zookeeperAdmin.html#Unsafe+Options<https://urldefense.com/v3/__https:/usg02.safelinks.protection.office365.us/?url=https*3A*2F*2Furldefense.com*2Fv3*2F__https*3A*2F*2Fzookeeper.apache.org*2Fdoc*2Fr3.5.9*2FzookeeperAdmin.html*Unsafe*Options__*3BIys!!May37g!dTGCMHPLPDBXwSqtLa5cIPHiTIQF7IjLCVyvGxfi1sgPbrsOI8RCEsuZ9u-jJtayEg*24&data=04*7C01*7CSLIGADE*40FBI.GOV*7Cb7b8be92faf64fbc95ff08d9ec13044d*7C022914a9b95f4b7bbace551ce1a04071*7C0*7C0*7C637800390698440068*7CUnknown*7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0*3D*7C3000&sdata=meZvmEpBktGc95qzM46QmtNWp5NJ8noozSTv896k7qw*3D&reserved=0__;JSUlJSUlJSUlJSUqKiUlJSUlJSUlJSUlJSUlJQ!!May37g!ewlkGRNFLrKEpeF1Lz8vRt_oBtpgi8hVvvnCrp1Dq4_8Xprb4tEHWiHVFW58KT9_bg$>)?



Ed Coleman





________________________________

From: Ligade, Shailesh [USA] <Li...@bah.com>>
Sent: Wednesday, February 9, 2022 11:49 AM
To: user@accumulo.apache.org<ma...@accumulo.apache.org> <us...@accumulo.apache.org>>
Subject: accumulo 1.10.0 masters won't start



Hello,



My both masters are stuck error on zookeeper:



IOException: Packet len 2791093 is out of range!

KeeperErrorCode = ConnectionLoss for /accumulo/<instance_id>/fate





if use zkCli to see what is under fate, i get



IOException Packet len 2791161 is out of range

Unable to read additional data from server sessionid xxxx, likely server has closed socket



hdfs fsck is all good



How can I clear this fate?



master process is up and I can get into accumulo shell, but there are no fate (fate print returns empty)



Any idea how to bring the master up?



Thanks



S

Re: accumulo 1.10.0 masters won't start

Posted by "Ligade, Shailesh [USA]" <Li...@bah.com>.

Thanks

Since hdfs fsck is fine on /accumulo, I can backup my tables to some location within hdfs (not under accumulo) and reinitialize accumulo.
then I can recreate my tables/users on new instances.
What will be command to import/load existing hdfs data into this newly created table? importtable command create new table as well, so  i guess i need to test it somewhere.
Also while loading old data intothe  new table, what can I do get rid of these splits/tablets?

I think this will be faster approach for me to recover..

Thank you so much

-S
________________________________
From: dev1 <de...@etcoleman.com>
Sent: Wednesday, February 16, 2022 9:29 AM
To: 'user@accumulo.apache.org' <us...@accumulo.apache.org>
Subject: [External] Re: accumulo 1.10.0 masters won't start

> What happens if I let the #tablets grown?

It sounds like you might be in the worst case now?  There is over head for each tablet - at what point the master / other things fall over is not something I've tried to find out. Even scanning the metadata table and gc process are doing a lot of work to track process that many files / tablets and it likely unnecessary.

What is the command / arguments that you are using for compactions?  The comment minimal sleep after 100 compaction commands is confusing to me.

Can you buffer the replication?

You might be able to:
 - create a new table.
 - point the replication to write to the new table.
 - ingest data from the old into the new.

You should look towards picking a split threshold so that you have 1 or maybe a few tablets per tserver (with some reasonable split size.)  Split sizes of 3G or 5G are not uncommon - and larger is reasonable.

Ed Coleman
________________________________
From: Ligade, Shailesh [USA] <Li...@bah.com>
Sent: Wednesday, February 16, 2022 8:05 AM
To: 'user@accumulo.apache.org' <us...@accumulo.apache.org>
Subject: Re: accumulo 1.10.0 masters won't start

Thanks Ed,

We have two clusters, one is source other is peer. I am testing this large table on peer first (eventually i need to do the same on source cluster). i am not stopping ingest on source cluster so replication will continue the peer table, however while i am doing this not much ingest is happening.


I tried the range compaction along with range merge however, merge takes forever (even over single range.. i didn't try many just first few) and before it finishes i get zookeeper error and both master crash. After I bump that jute setting (on both java.env on zookeeper and accumulo-env on accumlo) to get it back. So left merges alone and just trying 72k compaction, since compactions are not backing up, i am doing minimal sleep after every 100 compact commands. But sometimes during compactions, i do get zookeeper errors and master crash.

I do get your idea or create new table with less splits that way the new table will be compacted. However, for that i will need to stop ingest on primary, and then setup replication on the new cluster again..i was avoiding that. but i guess that may be my only option.

A last question, if I let this table continue to grow tablets, what is the worst case scenerio? How it may affect system performance?

Thanks

-S
________________________________
From: dev1 <de...@etcoleman.com>
Sent: Tuesday, February 15, 2022 3:11 PM
To: 'user@accumulo.apache.org' <us...@accumulo.apache.org>
Subject: [External] RE: accumulo 1.10.0 masters won't start


Can you compact the table?  How aggressive do you want to get? I do not understand why you are getting the ZooKeeper errors – is it related to the number of tablets, or is it something else?  (an iterator that was attached with a very large set of arguments, a very large list or some sort of binary data – say a bloom filter)



If it were me – you need to balance your goals and requirements that might dictate a less aggressive approach.  At this point I’m assuming that getting things back on line without data loss is the top priority. And if I was sure that it was not related to something that I attached to the table)



If I have room and can compact the table(s). It could also depend on how long a compaction would take and if I could wait.  It is generally preferable to work on files that have any deleted data removed and can reduce the total number of files when files from minor compactions and bulk ingest files are combined into a single file for that tablet)



Stop ingest.

Flush the source table – allow any compactions to settle. (Optional if compacting, but should be a quick command to execute)

(Optional – compact the original.)

Clone the source table.

Compact the clone so that the clone does not share any files with the source

Optionally – use the exportable command to generate a list of files from the clone – you may not need it, but could be handy

Take the clone offline.

Move the files under /accumulo/tables/[clone-id]/dirs to one or more staging directories (in hdfs) – the export list could help.

Delete the clone table – (I believe that the delete will not check for the existence of files if it is offline.) If not, then it would be necessary to use an empty rfile as a placeholder.

Create a new table and set splits – this could be your desired number – or use just enough splits that each tserver has 1 tablet.

Set the default table split size to some multiple of the desired final size – this limits splitting during the imports. Not critical, but may be faster.

Take the new table offline and then back online – this will immediately migrate the splits – or you could just wait for the migration to finish.

Bulk import the files from the staging area(s) – likely in batches.  You will likely have ~72K files – so maybe ~7,000 files / batch?

Once all files have been imported set the split threshold to desired size.

Check that permissions, users, iterators, table config parameters are present on the new table and match the source.

Rename the source table to old_xxx or whatever

Rename the new table to the source table, verify things are okay and delete the original.



If you don’t have the space, you could skip operating on the clone, but then you can’t fall back to the original if things go wrong.



Another way would be to use the importable, but you need to make sure that it doesn’t just recreate the original splits, otherwise you end up with the same 72K files.



Ed Coleman





From: Ligade, Shailesh [USA] <Li...@bah.com>
Sent: Tuesday, February 15, 2022 1:11 PM
To: user@accumulo.apache.org
Subject: Re: accumulo 1.10.0 masters won't start



Well,



I am trying to merge a large table (8T with 72k tablets, with default tablet size 1G)



Since I keep getting those zookeeper errors realted to size, i keep on bumping the jute.maxbuffer adn now it is all the way to 8m



But still i can't merge even for small subset (-b and -e) Now the error is Xid out of order, Gox Xid xx with err -101 expected Xid yy for a packet with details:



after this master crashes



Any suggestion how to go about and how to merge this large table?



Thanks



-S

________________________________

From: Ligade, Shailesh [USA] <Li...@bah.com>>
Sent: Thursday, February 10, 2022 9:20 AM
To: user@accumulo.apache.org<ma...@accumulo.apache.org> <us...@accumulo.apache.org>>
Subject: Re: accumulo 1.10.0 masters won't start



Thanks Ed,



That saved the day. The confusing part setting up that property is documentation if it needs hex or bytes etc. Even the example they provided here

https://solr.apache.org/guide/7_4/setting-up-an-external-zookeeper-ensemble.html#increasing-the-file-size-limit<https://urldefense.com/v3/__https://solr.apache.org/guide/7_4/setting-up-an-external-zookeeper-ensemble.html*increasing-the-file-size-limit__;Iw!!May37g!dgoe7b7JQfco1hl_OGnhmp98GibG_1En56_s_KMHeTxlLgRECJIoqNtDeascfM1hqA$>

Setting Up an External ZooKeeper Ensemble | Apache Solr Reference Guide 7.4 - The Apache Software Foundation<https://urldefense.com/v3/__https://solr.apache.org/guide/7_4/setting-up-an-external-zookeeper-ensemble.html*increasing-the-file-size-limit__;Iw!!May37g!dgoe7b7JQfco1hl_OGnhmp98GibG_1En56_s_KMHeTxlLgRECJIoqNtDeascfM1hqA$>

The solution to this problem is to set up an external ZooKeeper ensemble, which is a number of servers running ZooKeeper that communicate with each other to coordinate the activities of the cluster.

solr.apache.org



states they are setting the value to 2mb but the value really looks like 200k (with 5 0)



------------------------------------

Add the following line to increase the file size limit to 2MB:

SOLR_OPTS="$SOLR_OPTS -Djute.maxbuffer=0x200000"

-------------------------------------



Anyways, a master is up and running for an hour now..so just trying to understand what was changed and revert it after it stabilize.



Thanks a bunch.



-S



________________________________

From: dev1 <de...@etcoleman.com>>
Sent: Wednesday, February 9, 2022 5:54 PM
To: user@accumulo.apache.org<ma...@accumulo.apache.org> <us...@accumulo.apache.org>>
Subject: [External] Re: accumulo 1.10.0 masters won't start



You might want to set the accumulo (zookeeper client) side - by setting ACCUMULO_JAVA_OPTS that is processed in accumulo-env.sh (or just edit that file?)



Looking at the Zookeeper documentation it describes what looks like you are seeing:



When jute.maxbuffer in the client side is less than the server side, the client wants to read the data exceeds jute.maxbuffer in the client side, the client side will get java.io.IOException: Unreasonable length or Packet len is out of range!



Also, a search showed jira tickets that had a server side limit of 4MB, but client limits of 1MB - you may want to see if 4194304 (or larger) as a value works,



________________________________

From: dev1 <de...@etcoleman.com>>
Sent: Wednesday, February 9, 2022 5:25 PM
To: user@accumulo.apache.org<ma...@accumulo.apache.org> <us...@accumulo.apache.org>>
Subject: Re: accumulo 1.10.0 masters won't start



jute.maxbuffer is a ZooKeeper property - it needs to be set on the zookeeper configuration.  If this is still correct, then it looks like there are a few options https://solr.apache.org/guide/7_4/setting-up-an-external-zookeeper-ensemble.html#increasing-the-file-size-limit<https://urldefense.com/v3/__https:/solr.apache.org/guide/7_4/setting-up-an-external-zookeeper-ensemble.html*increasing-the-file-size-limit__;Iw!!May37g!ewlkGRNFLrKEpeF1Lz8vRt_oBtpgi8hVvvnCrp1Dq4_8Xprb4tEHWiHVFW7EVUT3vA$>



But maybe the ZooKeeper documentation for your version can provide additional guidance?

Setting Up an External ZooKeeper Ensemble | Apache Solr Reference Guide 7.4 - The Apache Software Foundation<https://urldefense.com/v3/__https:/solr.apache.org/guide/7_4/setting-up-an-external-zookeeper-ensemble.html*increasing-the-file-size-limit__;Iw!!May37g!ewlkGRNFLrKEpeF1Lz8vRt_oBtpgi8hVvvnCrp1Dq4_8Xprb4tEHWiHVFW7EVUT3vA$>

The solution to this problem is to set up an external ZooKeeper ensemble, which is a number of servers running ZooKeeper that communicate with each other to coordinate the activities of the cluster.

solr.apache.org





________________________________

From: Shailesh Ligade <SL...@FBI.GOV>>
Sent: Wednesday, February 9, 2022 5:02 PM
To: user@accumulo.apache.org<ma...@accumulo.apache.org> <us...@accumulo.apache.org>>
Subject: RE: accumulo 1.10.0 masters won't start



Thanks



Even if I set jute.maxbuffer on zookeeper in conf/java.env file to



-Djute.maxbuffer=300000



I see in accumulo master log as



INFO: jute.maxbuffer value is 1048575 Bytes    not sure where to set that on accumulo side.



I set instance.zookeeper.timeout value to 90s in accumulo-site.xml



But still get those zookeeper KeeperErrorCode errors



-S



From: dev1 <de...@etcoleman.com>>
Sent: Wednesday, February 9, 2022 4:27 PM
To: user@accumulo.apache.org<ma...@accumulo.apache.org>
Subject: [EXTERNAL EMAIL] - Re: accumulo 1.10.0 masters won't start



I would not recommend setting the goal state directly unlit there are no other alternatives.



It is hard to recommend what to do, because it is unclear what put you into the current situation and what action / impact you might have had trying to fix things -



why did the goal state become unset in the first place?

what did you stuff into the fates that increased the need for larger jute buffers?



It could be that the number of tables and servers pushed you over the limit - or it could be something else.



What I would do.



Shutdown accumulo and make sure all services / tservers are stopped.

Shutdown any other services that might be using ZooKeeper.

Shutdown ZooKeeper.



Set the larger jute.buffer and increase the timeout values across the board and in any dependent services.



Start hdfs - if you needed to shut it down.

Start just zookeeper - and use zkCli.sh to examine the state of things.  If that looks okay.

Start just the master - how far does it come up?  It will not be able to load the root / metadata tables, but it may give some indication of state,



I'd then cycle between stopping the master, trying to clean-up things using zkCli.sh using any guidance with errors the master is generating. If that looks promising, then:



With the master stopped - start the tservers and check a few logs if there are exceptions determine if they are they something that is pointing to an issue - or just something that is transient and handled.



Once the tservers are up and looking okay - start the master.



One of the things to grab as soon as you can get the shell to run - get a listing of the tables and the ids.  If the worst happens, you can use that to map the existing data into a "new" instance. Hopefully it will not come to that and you will not need it - but if you don't have it and you need it, well... The table names and id are all in ZooKeeper.



Ed Coleman



________________________________

From: Shailesh Ligade <SL...@FBI.GOV>>
Sent: Wednesday, February 9, 2022 3:47 PM
To: user@accumulo.apache.org<ma...@accumulo.apache.org> <us...@accumulo.apache.org>>
Subject: RE: accumulo 1.10.0 masters won't start



Thanks I can try that,



At this point, my goal is to get accumulo up. I was just wondering if I can set different goal like SAFE_MODE will it come up by ignoring fate and other issues? If that comes up, can I switch back to NORMAL, will that work? I understand there may be some data loss..



-S



From: dev1 <de...@etcoleman.com>>
Sent: Wednesday, February 9, 2022 3:36 PM
To: user@accumulo.apache.org<ma...@accumulo.apache.org>
Subject: [EXTERNAL EMAIL] - Re: accumulo 1.10.0 masters won't start



For values in zoo.cfg see: https://zookeeper.apache.org/doc/r3.5.9/zookeeperAdmin.html#sc_advancedConfiguration<https://urldefense.com/v3/__https:/usg02.safelinks.protection.office365.us/?url=https*3A*2F*2Fzookeeper.apache.org*2Fdoc*2Fr3.5.9*2FzookeeperAdmin.html*23sc_advancedConfiguration&data=04*7C01*7CSLIGADE*40FBI.GOV*7Cb7b8be92faf64fbc95ff08d9ec13044d*7C022914a9b95f4b7bbace551ce1a04071*7C0*7C0*7C637800390698440068*7CUnknown*7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0*3D*7C3000&sdata=QaGFX7kcHJeiIN73G5bfDDEQNgxN0F7QdyJ9fO3SJzA*3D&reserved=0__;JSUlJSUlJSUlJSUlJSUlJSUlJSU!!May37g!ewlkGRNFLrKEpeF1Lz8vRt_oBtpgi8hVvvnCrp1Dq4_8Xprb4tEHWiHVFW6UcuwVgA$>



maxSessionTimeout



In the accumulo config  - #instance.zookeepers.timeout=30s



The zookeeper setting controls the max time that the ZK servers will grant - the accumulo setting is how much time accumulo will ask for.



ZooKeeper: Because Coordinating Distributed Systems is a Zoo<https://urldefense.com/v3/__https:/usg02.safelinks.protection.office365.us/?url=https*3A*2F*2Fzookeeper.apache.org*2Fdoc*2Fr3.5.9*2FzookeeperAdmin.html*23sc_advancedConfiguration&data=04*7C01*7CSLIGADE*40FBI.GOV*7Cb7b8be92faf64fbc95ff08d9ec13044d*7C022914a9b95f4b7bbace551ce1a04071*7C0*7C0*7C637800390698440068*7CUnknown*7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0*3D*7C3000&sdata=QaGFX7kcHJeiIN73G5bfDDEQNgxN0F7QdyJ9fO3SJzA*3D&reserved=0__;JSUlJSUlJSUlJSUlJSUlJSUlJSU!!May37g!ewlkGRNFLrKEpeF1Lz8vRt_oBtpgi8hVvvnCrp1Dq4_8Xprb4tEHWiHVFW6UcuwVgA$>

Trace Mask Bit Values ; 0b0000000000 : Unused, reserved for future use. 0b0000000010 : Logs client requests, excluding ping requests. 0b0000000100 : Unused, reserved ...

zookeeper.apache.org





________________________________

From: Ligade, Shailesh [USA] <Li...@bah.com>>
Sent: Wednesday, February 9, 2022 3:03 PM
To: user@accumulo.apache.org<ma...@accumulo.apache.org> <us...@accumulo.apache.org>>
Subject: Re: accumulo 1.10.0 masters won't start



thanks for response,



no i have not update any timeout



is that going in zoo.cfg? I can see there is min/maxSessionTimeout 2/20, is that what are you refering to?



-S

________________________________

From: dev1 <de...@etcoleman.com>>
Sent: Wednesday, February 9, 2022 2:51 PM
To: user@accumulo.apache.org<ma...@accumulo.apache.org> <us...@accumulo.apache.org>>
Subject: [External] Re: accumulo 1.10.0 masters won't start



Have you tried to increase the zoo session timeout value? I think it's zookeeper.session.timeout.ms



________________________________

From: Ligade, Shailesh [USA] <Li...@bah.com>>
Sent: Wednesday, February 9, 2022 2:47 PM
To: user@accumulo.apache.org<ma...@accumulo.apache.org> <us...@accumulo.apache.org>>
Subject: Re: accumulo 1.10.0 masters won't start



Thanks



That fixed goal sate issue but now still getting



Errors with zookeeper

e.g.



KeeperErrorCode = ConnectionLoss for



/accumulo/<instane-id>/config/tserver.hold.time.max

/accumulo/<instane-id>/tables

/accumulo/<instane-id>/tables/1/name

/accumulo/<instane-id>/fate

/accumulo/<instane-id>/masters/goal_state



So it is all over …some I see good values in zookeeper…so not sure..  🙁



-S



________________________________

From: dev1 <de...@etcoleman.com>>
Sent: Wednesday, February 9, 2022 2:22 PM
To: user@accumulo.apache.org<ma...@accumulo.apache.org> <us...@accumulo.apache.org>>
Subject: [External] Re: accumulo 1.10.0 masters won't start



The is a utility - SetGoalState that can be run from the command line



accumulo SetGoalState NORMAL



(or SAFE_MODE, CLEAN_STOP)



It sets a value in ZK at /accumulo/instance-id/managers/goal_state



Ed Coleman



________________________________

From: Ligade, Shailesh [USA] <Li...@bah.com>>
Sent: Wednesday, February 9, 2022 1:54 PM
To: user@accumulo.apache.org<ma...@accumulo.apache.org> <us...@accumulo.apache.org>>
Subject: Re: accumulo 1.10.0 masters won't start



Well,



i just went ahead and deleted fate in zookeeper and restarted the master..it was doing better but then i am getting different error



ERROR: Problem getting real goal state from zookeeper: java.lang.IllegalArgumentException: No enum constant org.apache.accumulo.core.master.thrift.MasterGoalState



I hope i didn't delete goal_state accidently ...;-( currently ls on goal_state is [], is there a way to add some value there?



-S

________________________________

From: dev1 <de...@etcoleman.com>>
Sent: Wednesday, February 9, 2022 1:32 PM
To: user@accumulo.apache.org<ma...@accumulo.apache.org> <us...@accumulo.apache.org>>
Subject: [External] Re: accumulo 1.10.0 masters won't start



Did you try setting the increased size in the zkCli.sh command (or wherever it gets it environment from?)



The ZK docs indicate that it needs to be set to the same size on all servers and clients.



You should be able to use zkCli.sh to at least see what's going on - if that does not work, then it seems unlikely that the master would either.



Can you:

  *   list the nodes under /accumulo/[instance id]/fate?
  *   use the stat command on each of the nodes - the size is one of the fields.
  *   list nodes under any of the /accumulo/[instance_id/fate/tx-#####
  *   there should be a node named debug - doing a get on that should show the op name.



Ed Coleman

________________________________

From: Ligade, Shailesh [USA] <Li...@bah.com>>
Sent: Wednesday, February 9, 2022 12:54 PM
To: user@accumulo.apache.org<ma...@accumulo.apache.org> <us...@accumulo.apache.org>>
Subject: Re: accumulo 1.10.0 masters won't start



Thanks



I added



-Djute.maxbuffer=30000000



In conf/java.env and restart all zookeepers but still getting the same error.. documentation is kind of fuzzy on setting this property as it states in hex (default 0xffff) so not 100% sure if 30000000 is ok, but atleast I could see zookeeper was up



-S



________________________________

From: dev1 <de...@etcoleman.com>>
Sent: Wednesday, February 9, 2022 12:26 PM
To: user@accumulo.apache.org<ma...@accumulo.apache.org> <us...@accumulo.apache.org>>
Subject: [External] Re: accumulo 1.10.0 masters won't start



Does the monitor or any of the logs show errors that relate to exceeding the ZooKeeper jute buffer size?



Is so, have you tried increasing the ZooKeeper jute.maxbuffer limit(https://zookeeper.apache.org/doc/r3.5.9/zookeeperAdmin.html#Unsafe+Options<https://urldefense.com/v3/__https:/usg02.safelinks.protection.office365.us/?url=https*3A*2F*2Furldefense.com*2Fv3*2F__https*3A*2F*2Fzookeeper.apache.org*2Fdoc*2Fr3.5.9*2FzookeeperAdmin.html*Unsafe*Options__*3BIys!!May37g!dTGCMHPLPDBXwSqtLa5cIPHiTIQF7IjLCVyvGxfi1sgPbrsOI8RCEsuZ9u-jJtayEg*24&data=04*7C01*7CSLIGADE*40FBI.GOV*7Cb7b8be92faf64fbc95ff08d9ec13044d*7C022914a9b95f4b7bbace551ce1a04071*7C0*7C0*7C637800390698440068*7CUnknown*7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0*3D*7C3000&sdata=meZvmEpBktGc95qzM46QmtNWp5NJ8noozSTv896k7qw*3D&reserved=0__;JSUlJSUlJSUlJSUqKiUlJSUlJSUlJSUlJSUlJQ!!May37g!ewlkGRNFLrKEpeF1Lz8vRt_oBtpgi8hVvvnCrp1Dq4_8Xprb4tEHWiHVFW58KT9_bg$>)?



Ed Coleman





________________________________

From: Ligade, Shailesh [USA] <Li...@bah.com>>
Sent: Wednesday, February 9, 2022 11:49 AM
To: user@accumulo.apache.org<ma...@accumulo.apache.org> <us...@accumulo.apache.org>>
Subject: accumulo 1.10.0 masters won't start



Hello,



My both masters are stuck error on zookeeper:



IOException: Packet len 2791093 is out of range!

KeeperErrorCode = ConnectionLoss for /accumulo/<instance_id>/fate





if use zkCli to see what is under fate, i get



IOException Packet len 2791161 is out of range

Unable to read additional data from server sessionid xxxx, likely server has closed socket



hdfs fsck is all good



How can I clear this fate?



master process is up and I can get into accumulo shell, but there are no fate (fate print returns empty)



Any idea how to bring the master up?



Thanks



S

Re: accumulo 1.10.0 masters won't start

Posted by dev1 <de...@etcoleman.com>.

> What happens if I let the #tablets grown?

It sounds like you might be in the worst case now?  There is over head for each tablet - at what point the master / other things fall over is not something I've tried to find out. Even scanning the metadata table and gc process are doing a lot of work to track process that many files / tablets and it likely unnecessary.

What is the command / arguments that you are using for compactions?  The comment minimal sleep after 100 compaction commands is confusing to me.

Can you buffer the replication?

You might be able to:
 - create a new table.
 - point the replication to write to the new table.
 - ingest data from the old into the new.

You should look towards picking a split threshold so that you have 1 or maybe a few tablets per tserver (with some reasonable split size.)  Split sizes of 3G or 5G are not uncommon - and larger is reasonable.

Ed Coleman
________________________________
From: Ligade, Shailesh [USA] <Li...@bah.com>
Sent: Wednesday, February 16, 2022 8:05 AM
To: 'user@accumulo.apache.org' <us...@accumulo.apache.org>
Subject: Re: accumulo 1.10.0 masters won't start

Thanks Ed,

We have two clusters, one is source other is peer. I am testing this large table on peer first (eventually i need to do the same on source cluster). i am not stopping ingest on source cluster so replication will continue the peer table, however while i am doing this not much ingest is happening.


I tried the range compaction along with range merge however, merge takes forever (even over single range.. i didn't try many just first few) and before it finishes i get zookeeper error and both master crash. After I bump that jute setting (on both java.env on zookeeper and accumulo-env on accumlo) to get it back. So left merges alone and just trying 72k compaction, since compactions are not backing up, i am doing minimal sleep after every 100 compact commands. But sometimes during compactions, i do get zookeeper errors and master crash.

I do get your idea or create new table with less splits that way the new table will be compacted. However, for that i will need to stop ingest on primary, and then setup replication on the new cluster again..i was avoiding that. but i guess that may be my only option.

A last question, if I let this table continue to grow tablets, what is the worst case scenerio? How it may affect system performance?

Thanks

-S
________________________________
From: dev1 <de...@etcoleman.com>
Sent: Tuesday, February 15, 2022 3:11 PM
To: 'user@accumulo.apache.org' <us...@accumulo.apache.org>
Subject: [External] RE: accumulo 1.10.0 masters won't start


Can you compact the table?  How aggressive do you want to get? I do not understand why you are getting the ZooKeeper errors – is it related to the number of tablets, or is it something else?  (an iterator that was attached with a very large set of arguments, a very large list or some sort of binary data – say a bloom filter)



If it were me – you need to balance your goals and requirements that might dictate a less aggressive approach.  At this point I’m assuming that getting things back on line without data loss is the top priority. And if I was sure that it was not related to something that I attached to the table)



If I have room and can compact the table(s). It could also depend on how long a compaction would take and if I could wait.  It is generally preferable to work on files that have any deleted data removed and can reduce the total number of files when files from minor compactions and bulk ingest files are combined into a single file for that tablet)



Stop ingest.

Flush the source table – allow any compactions to settle. (Optional if compacting, but should be a quick command to execute)

(Optional – compact the original.)

Clone the source table.

Compact the clone so that the clone does not share any files with the source

Optionally – use the exportable command to generate a list of files from the clone – you may not need it, but could be handy

Take the clone offline.

Move the files under /accumulo/tables/[clone-id]/dirs to one or more staging directories (in hdfs) – the export list could help.

Delete the clone table – (I believe that the delete will not check for the existence of files if it is offline.) If not, then it would be necessary to use an empty rfile as a placeholder.

Create a new table and set splits – this could be your desired number – or use just enough splits that each tserver has 1 tablet.

Set the default table split size to some multiple of the desired final size – this limits splitting during the imports. Not critical, but may be faster.

Take the new table offline and then back online – this will immediately migrate the splits – or you could just wait for the migration to finish.

Bulk import the files from the staging area(s) – likely in batches.  You will likely have ~72K files – so maybe ~7,000 files / batch?

Once all files have been imported set the split threshold to desired size.

Check that permissions, users, iterators, table config parameters are present on the new table and match the source.

Rename the source table to old_xxx or whatever

Rename the new table to the source table, verify things are okay and delete the original.



If you don’t have the space, you could skip operating on the clone, but then you can’t fall back to the original if things go wrong.



Another way would be to use the importable, but you need to make sure that it doesn’t just recreate the original splits, otherwise you end up with the same 72K files.



Ed Coleman





From: Ligade, Shailesh [USA] <Li...@bah.com>
Sent: Tuesday, February 15, 2022 1:11 PM
To: user@accumulo.apache.org
Subject: Re: accumulo 1.10.0 masters won't start



Well,



I am trying to merge a large table (8T with 72k tablets, with default tablet size 1G)



Since I keep getting those zookeeper errors realted to size, i keep on bumping the jute.maxbuffer adn now it is all the way to 8m



But still i can't merge even for small subset (-b and -e) Now the error is Xid out of order, Gox Xid xx with err -101 expected Xid yy for a packet with details:



after this master crashes



Any suggestion how to go about and how to merge this large table?



Thanks



-S

________________________________

From: Ligade, Shailesh [USA] <Li...@bah.com>>
Sent: Thursday, February 10, 2022 9:20 AM
To: user@accumulo.apache.org<ma...@accumulo.apache.org> <us...@accumulo.apache.org>>
Subject: Re: accumulo 1.10.0 masters won't start



Thanks Ed,



That saved the day. The confusing part setting up that property is documentation if it needs hex or bytes etc. Even the example they provided here

https://solr.apache.org/guide/7_4/setting-up-an-external-zookeeper-ensemble.html#increasing-the-file-size-limit<https://urldefense.com/v3/__https://solr.apache.org/guide/7_4/setting-up-an-external-zookeeper-ensemble.html*increasing-the-file-size-limit__;Iw!!May37g!dgoe7b7JQfco1hl_OGnhmp98GibG_1En56_s_KMHeTxlLgRECJIoqNtDeascfM1hqA$>

Setting Up an External ZooKeeper Ensemble | Apache Solr Reference Guide 7.4 - The Apache Software Foundation<https://urldefense.com/v3/__https://solr.apache.org/guide/7_4/setting-up-an-external-zookeeper-ensemble.html*increasing-the-file-size-limit__;Iw!!May37g!dgoe7b7JQfco1hl_OGnhmp98GibG_1En56_s_KMHeTxlLgRECJIoqNtDeascfM1hqA$>

The solution to this problem is to set up an external ZooKeeper ensemble, which is a number of servers running ZooKeeper that communicate with each other to coordinate the activities of the cluster.

solr.apache.org



states they are setting the value to 2mb but the value really looks like 200k (with 5 0)



------------------------------------

Add the following line to increase the file size limit to 2MB:

SOLR_OPTS="$SOLR_OPTS -Djute.maxbuffer=0x200000"

-------------------------------------



Anyways, a master is up and running for an hour now..so just trying to understand what was changed and revert it after it stabilize.



Thanks a bunch.



-S



________________________________

From: dev1 <de...@etcoleman.com>>
Sent: Wednesday, February 9, 2022 5:54 PM
To: user@accumulo.apache.org<ma...@accumulo.apache.org> <us...@accumulo.apache.org>>
Subject: [External] Re: accumulo 1.10.0 masters won't start



You might want to set the accumulo (zookeeper client) side - by setting ACCUMULO_JAVA_OPTS that is processed in accumulo-env.sh (or just edit that file?)



Looking at the Zookeeper documentation it describes what looks like you are seeing:



When jute.maxbuffer in the client side is less than the server side, the client wants to read the data exceeds jute.maxbuffer in the client side, the client side will get java.io.IOException: Unreasonable length or Packet len is out of range!



Also, a search showed jira tickets that had a server side limit of 4MB, but client limits of 1MB - you may want to see if 4194304 (or larger) as a value works,



________________________________

From: dev1 <de...@etcoleman.com>>
Sent: Wednesday, February 9, 2022 5:25 PM
To: user@accumulo.apache.org<ma...@accumulo.apache.org> <us...@accumulo.apache.org>>
Subject: Re: accumulo 1.10.0 masters won't start



jute.maxbuffer is a ZooKeeper property - it needs to be set on the zookeeper configuration.  If this is still correct, then it looks like there are a few options https://solr.apache.org/guide/7_4/setting-up-an-external-zookeeper-ensemble.html#increasing-the-file-size-limit<https://urldefense.com/v3/__https:/solr.apache.org/guide/7_4/setting-up-an-external-zookeeper-ensemble.html*increasing-the-file-size-limit__;Iw!!May37g!ewlkGRNFLrKEpeF1Lz8vRt_oBtpgi8hVvvnCrp1Dq4_8Xprb4tEHWiHVFW7EVUT3vA$>



But maybe the ZooKeeper documentation for your version can provide additional guidance?

Setting Up an External ZooKeeper Ensemble | Apache Solr Reference Guide 7.4 - The Apache Software Foundation<https://urldefense.com/v3/__https:/solr.apache.org/guide/7_4/setting-up-an-external-zookeeper-ensemble.html*increasing-the-file-size-limit__;Iw!!May37g!ewlkGRNFLrKEpeF1Lz8vRt_oBtpgi8hVvvnCrp1Dq4_8Xprb4tEHWiHVFW7EVUT3vA$>

The solution to this problem is to set up an external ZooKeeper ensemble, which is a number of servers running ZooKeeper that communicate with each other to coordinate the activities of the cluster.

solr.apache.org





________________________________

From: Shailesh Ligade <SL...@FBI.GOV>>
Sent: Wednesday, February 9, 2022 5:02 PM
To: user@accumulo.apache.org<ma...@accumulo.apache.org> <us...@accumulo.apache.org>>
Subject: RE: accumulo 1.10.0 masters won't start



Thanks



Even if I set jute.maxbuffer on zookeeper in conf/java.env file to



-Djute.maxbuffer=300000



I see in accumulo master log as



INFO: jute.maxbuffer value is 1048575 Bytes    not sure where to set that on accumulo side.



I set instance.zookeeper.timeout value to 90s in accumulo-site.xml



But still get those zookeeper KeeperErrorCode errors



-S



From: dev1 <de...@etcoleman.com>>
Sent: Wednesday, February 9, 2022 4:27 PM
To: user@accumulo.apache.org<ma...@accumulo.apache.org>
Subject: [EXTERNAL EMAIL] - Re: accumulo 1.10.0 masters won't start



I would not recommend setting the goal state directly unlit there are no other alternatives.



It is hard to recommend what to do, because it is unclear what put you into the current situation and what action / impact you might have had trying to fix things -



why did the goal state become unset in the first place?

what did you stuff into the fates that increased the need for larger jute buffers?



It could be that the number of tables and servers pushed you over the limit - or it could be something else.



What I would do.



Shutdown accumulo and make sure all services / tservers are stopped.

Shutdown any other services that might be using ZooKeeper.

Shutdown ZooKeeper.



Set the larger jute.buffer and increase the timeout values across the board and in any dependent services.



Start hdfs - if you needed to shut it down.

Start just zookeeper - and use zkCli.sh to examine the state of things.  If that looks okay.

Start just the master - how far does it come up?  It will not be able to load the root / metadata tables, but it may give some indication of state,



I'd then cycle between stopping the master, trying to clean-up things using zkCli.sh using any guidance with errors the master is generating. If that looks promising, then:



With the master stopped - start the tservers and check a few logs if there are exceptions determine if they are they something that is pointing to an issue - or just something that is transient and handled.



Once the tservers are up and looking okay - start the master.



One of the things to grab as soon as you can get the shell to run - get a listing of the tables and the ids.  If the worst happens, you can use that to map the existing data into a "new" instance. Hopefully it will not come to that and you will not need it - but if you don't have it and you need it, well... The table names and id are all in ZooKeeper.



Ed Coleman



________________________________

From: Shailesh Ligade <SL...@FBI.GOV>>
Sent: Wednesday, February 9, 2022 3:47 PM
To: user@accumulo.apache.org<ma...@accumulo.apache.org> <us...@accumulo.apache.org>>
Subject: RE: accumulo 1.10.0 masters won't start



Thanks I can try that,



At this point, my goal is to get accumulo up. I was just wondering if I can set different goal like SAFE_MODE will it come up by ignoring fate and other issues? If that comes up, can I switch back to NORMAL, will that work? I understand there may be some data loss..



-S



From: dev1 <de...@etcoleman.com>>
Sent: Wednesday, February 9, 2022 3:36 PM
To: user@accumulo.apache.org<ma...@accumulo.apache.org>
Subject: [EXTERNAL EMAIL] - Re: accumulo 1.10.0 masters won't start



For values in zoo.cfg see: https://zookeeper.apache.org/doc/r3.5.9/zookeeperAdmin.html#sc_advancedConfiguration<https://urldefense.com/v3/__https:/usg02.safelinks.protection.office365.us/?url=https*3A*2F*2Fzookeeper.apache.org*2Fdoc*2Fr3.5.9*2FzookeeperAdmin.html*23sc_advancedConfiguration&data=04*7C01*7CSLIGADE*40FBI.GOV*7Cb7b8be92faf64fbc95ff08d9ec13044d*7C022914a9b95f4b7bbace551ce1a04071*7C0*7C0*7C637800390698440068*7CUnknown*7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0*3D*7C3000&sdata=QaGFX7kcHJeiIN73G5bfDDEQNgxN0F7QdyJ9fO3SJzA*3D&reserved=0__;JSUlJSUlJSUlJSUlJSUlJSUlJSU!!May37g!ewlkGRNFLrKEpeF1Lz8vRt_oBtpgi8hVvvnCrp1Dq4_8Xprb4tEHWiHVFW6UcuwVgA$>



maxSessionTimeout



In the accumulo config  - #instance.zookeepers.timeout=30s



The zookeeper setting controls the max time that the ZK servers will grant - the accumulo setting is how much time accumulo will ask for.



ZooKeeper: Because Coordinating Distributed Systems is a Zoo<https://urldefense.com/v3/__https:/usg02.safelinks.protection.office365.us/?url=https*3A*2F*2Fzookeeper.apache.org*2Fdoc*2Fr3.5.9*2FzookeeperAdmin.html*23sc_advancedConfiguration&data=04*7C01*7CSLIGADE*40FBI.GOV*7Cb7b8be92faf64fbc95ff08d9ec13044d*7C022914a9b95f4b7bbace551ce1a04071*7C0*7C0*7C637800390698440068*7CUnknown*7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0*3D*7C3000&sdata=QaGFX7kcHJeiIN73G5bfDDEQNgxN0F7QdyJ9fO3SJzA*3D&reserved=0__;JSUlJSUlJSUlJSUlJSUlJSUlJSU!!May37g!ewlkGRNFLrKEpeF1Lz8vRt_oBtpgi8hVvvnCrp1Dq4_8Xprb4tEHWiHVFW6UcuwVgA$>

Trace Mask Bit Values ; 0b0000000000 : Unused, reserved for future use. 0b0000000010 : Logs client requests, excluding ping requests. 0b0000000100 : Unused, reserved ...

zookeeper.apache.org





________________________________

From: Ligade, Shailesh [USA] <Li...@bah.com>>
Sent: Wednesday, February 9, 2022 3:03 PM
To: user@accumulo.apache.org<ma...@accumulo.apache.org> <us...@accumulo.apache.org>>
Subject: Re: accumulo 1.10.0 masters won't start



thanks for response,



no i have not update any timeout



is that going in zoo.cfg? I can see there is min/maxSessionTimeout 2/20, is that what are you refering to?



-S

________________________________

From: dev1 <de...@etcoleman.com>>
Sent: Wednesday, February 9, 2022 2:51 PM
To: user@accumulo.apache.org<ma...@accumulo.apache.org> <us...@accumulo.apache.org>>
Subject: [External] Re: accumulo 1.10.0 masters won't start



Have you tried to increase the zoo session timeout value? I think it's zookeeper.session.timeout.ms



________________________________

From: Ligade, Shailesh [USA] <Li...@bah.com>>
Sent: Wednesday, February 9, 2022 2:47 PM
To: user@accumulo.apache.org<ma...@accumulo.apache.org> <us...@accumulo.apache.org>>
Subject: Re: accumulo 1.10.0 masters won't start



Thanks



That fixed goal sate issue but now still getting



Errors with zookeeper

e.g.



KeeperErrorCode = ConnectionLoss for



/accumulo/<instane-id>/config/tserver.hold.time.max

/accumulo/<instane-id>/tables

/accumulo/<instane-id>/tables/1/name

/accumulo/<instane-id>/fate

/accumulo/<instane-id>/masters/goal_state



So it is all over …some I see good values in zookeeper…so not sure..  🙁



-S



________________________________

From: dev1 <de...@etcoleman.com>>
Sent: Wednesday, February 9, 2022 2:22 PM
To: user@accumulo.apache.org<ma...@accumulo.apache.org> <us...@accumulo.apache.org>>
Subject: [External] Re: accumulo 1.10.0 masters won't start



The is a utility - SetGoalState that can be run from the command line



accumulo SetGoalState NORMAL



(or SAFE_MODE, CLEAN_STOP)



It sets a value in ZK at /accumulo/instance-id/managers/goal_state



Ed Coleman



________________________________

From: Ligade, Shailesh [USA] <Li...@bah.com>>
Sent: Wednesday, February 9, 2022 1:54 PM
To: user@accumulo.apache.org<ma...@accumulo.apache.org> <us...@accumulo.apache.org>>
Subject: Re: accumulo 1.10.0 masters won't start



Well,



i just went ahead and deleted fate in zookeeper and restarted the master..it was doing better but then i am getting different error



ERROR: Problem getting real goal state from zookeeper: java.lang.IllegalArgumentException: No enum constant org.apache.accumulo.core.master.thrift.MasterGoalState



I hope i didn't delete goal_state accidently ...;-( currently ls on goal_state is [], is there a way to add some value there?



-S

________________________________

From: dev1 <de...@etcoleman.com>>
Sent: Wednesday, February 9, 2022 1:32 PM
To: user@accumulo.apache.org<ma...@accumulo.apache.org> <us...@accumulo.apache.org>>
Subject: [External] Re: accumulo 1.10.0 masters won't start



Did you try setting the increased size in the zkCli.sh command (or wherever it gets it environment from?)



The ZK docs indicate that it needs to be set to the same size on all servers and clients.



You should be able to use zkCli.sh to at least see what's going on - if that does not work, then it seems unlikely that the master would either.



Can you:

  *   list the nodes under /accumulo/[instance id]/fate?
  *   use the stat command on each of the nodes - the size is one of the fields.
  *   list nodes under any of the /accumulo/[instance_id/fate/tx-#####
  *   there should be a node named debug - doing a get on that should show the op name.



Ed Coleman

________________________________

From: Ligade, Shailesh [USA] <Li...@bah.com>>
Sent: Wednesday, February 9, 2022 12:54 PM
To: user@accumulo.apache.org<ma...@accumulo.apache.org> <us...@accumulo.apache.org>>
Subject: Re: accumulo 1.10.0 masters won't start



Thanks



I added



-Djute.maxbuffer=30000000



In conf/java.env and restart all zookeepers but still getting the same error.. documentation is kind of fuzzy on setting this property as it states in hex (default 0xffff) so not 100% sure if 30000000 is ok, but atleast I could see zookeeper was up



-S



________________________________

From: dev1 <de...@etcoleman.com>>
Sent: Wednesday, February 9, 2022 12:26 PM
To: user@accumulo.apache.org<ma...@accumulo.apache.org> <us...@accumulo.apache.org>>
Subject: [External] Re: accumulo 1.10.0 masters won't start



Does the monitor or any of the logs show errors that relate to exceeding the ZooKeeper jute buffer size?



Is so, have you tried increasing the ZooKeeper jute.maxbuffer limit(https://zookeeper.apache.org/doc/r3.5.9/zookeeperAdmin.html#Unsafe+Options<https://urldefense.com/v3/__https:/usg02.safelinks.protection.office365.us/?url=https*3A*2F*2Furldefense.com*2Fv3*2F__https*3A*2F*2Fzookeeper.apache.org*2Fdoc*2Fr3.5.9*2FzookeeperAdmin.html*Unsafe*Options__*3BIys!!May37g!dTGCMHPLPDBXwSqtLa5cIPHiTIQF7IjLCVyvGxfi1sgPbrsOI8RCEsuZ9u-jJtayEg*24&data=04*7C01*7CSLIGADE*40FBI.GOV*7Cb7b8be92faf64fbc95ff08d9ec13044d*7C022914a9b95f4b7bbace551ce1a04071*7C0*7C0*7C637800390698440068*7CUnknown*7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0*3D*7C3000&sdata=meZvmEpBktGc95qzM46QmtNWp5NJ8noozSTv896k7qw*3D&reserved=0__;JSUlJSUlJSUlJSUqKiUlJSUlJSUlJSUlJSUlJQ!!May37g!ewlkGRNFLrKEpeF1Lz8vRt_oBtpgi8hVvvnCrp1Dq4_8Xprb4tEHWiHVFW58KT9_bg$>)?



Ed Coleman





________________________________

From: Ligade, Shailesh [USA] <Li...@bah.com>>
Sent: Wednesday, February 9, 2022 11:49 AM
To: user@accumulo.apache.org<ma...@accumulo.apache.org> <us...@accumulo.apache.org>>
Subject: accumulo 1.10.0 masters won't start



Hello,



My both masters are stuck error on zookeeper:



IOException: Packet len 2791093 is out of range!

KeeperErrorCode = ConnectionLoss for /accumulo/<instance_id>/fate





if use zkCli to see what is under fate, i get



IOException Packet len 2791161 is out of range

Unable to read additional data from server sessionid xxxx, likely server has closed socket



hdfs fsck is all good



How can I clear this fate?



master process is up and I can get into accumulo shell, but there are no fate (fate print returns empty)



Any idea how to bring the master up?



Thanks



S

Re: accumulo 1.10.0 masters won't start

Posted by "Ligade, Shailesh [USA]" <Li...@bah.com>.

Thanks Ed,

We have two clusters, one is source other is peer. I am testing this large table on peer first (eventually i need to do the same on source cluster). i am not stopping ingest on source cluster so replication will continue the peer table, however while i am doing this not much ingest is happening.


I tried the range compaction along with range merge however, merge takes forever (even over single range.. i didn't try many just first few) and before it finishes i get zookeeper error and both master crash. After I bump that jute setting (on both java.env on zookeeper and accumulo-env on accumlo) to get it back. So left merges alone and just trying 72k compaction, since compactions are not backing up, i am doing minimal sleep after every 100 compact commands. But sometimes during compactions, i do get zookeeper errors and master crash.

I do get your idea or create new table with less splits that way the new table will be compacted. However, for that i will need to stop ingest on primary, and then setup replication on the new cluster again..i was avoiding that. but i guess that may be my only option.

A last question, if I let this table continue to grow tablets, what is the worst case scenerio? How it may affect system performance?

Thanks

-S
________________________________
From: dev1 <de...@etcoleman.com>
Sent: Tuesday, February 15, 2022 3:11 PM
To: 'user@accumulo.apache.org' <us...@accumulo.apache.org>
Subject: [External] RE: accumulo 1.10.0 masters won't start


Can you compact the table?  How aggressive do you want to get? I do not understand why you are getting the ZooKeeper errors – is it related to the number of tablets, or is it something else?  (an iterator that was attached with a very large set of arguments, a very large list or some sort of binary data – say a bloom filter)



If it were me – you need to balance your goals and requirements that might dictate a less aggressive approach.  At this point I’m assuming that getting things back on line without data loss is the top priority. And if I was sure that it was not related to something that I attached to the table)



If I have room and can compact the table(s). It could also depend on how long a compaction would take and if I could wait.  It is generally preferable to work on files that have any deleted data removed and can reduce the total number of files when files from minor compactions and bulk ingest files are combined into a single file for that tablet)



Stop ingest.

Flush the source table – allow any compactions to settle. (Optional if compacting, but should be a quick command to execute)

(Optional – compact the original.)

Clone the source table.

Compact the clone so that the clone does not share any files with the source

Optionally – use the exportable command to generate a list of files from the clone – you may not need it, but could be handy

Take the clone offline.

Move the files under /accumulo/tables/[clone-id]/dirs to one or more staging directories (in hdfs) – the export list could help.

Delete the clone table – (I believe that the delete will not check for the existence of files if it is offline.) If not, then it would be necessary to use an empty rfile as a placeholder.

Create a new table and set splits – this could be your desired number – or use just enough splits that each tserver has 1 tablet.

Set the default table split size to some multiple of the desired final size – this limits splitting during the imports. Not critical, but may be faster.

Take the new table offline and then back online – this will immediately migrate the splits – or you could just wait for the migration to finish.

Bulk import the files from the staging area(s) – likely in batches.  You will likely have ~72K files – so maybe ~7,000 files / batch?

Once all files have been imported set the split threshold to desired size.

Check that permissions, users, iterators, table config parameters are present on the new table and match the source.

Rename the source table to old_xxx or whatever

Rename the new table to the source table, verify things are okay and delete the original.



If you don’t have the space, you could skip operating on the clone, but then you can’t fall back to the original if things go wrong.



Another way would be to use the importable, but you need to make sure that it doesn’t just recreate the original splits, otherwise you end up with the same 72K files.



Ed Coleman





From: Ligade, Shailesh [USA] <Li...@bah.com>
Sent: Tuesday, February 15, 2022 1:11 PM
To: user@accumulo.apache.org
Subject: Re: accumulo 1.10.0 masters won't start



Well,



I am trying to merge a large table (8T with 72k tablets, with default tablet size 1G)



Since I keep getting those zookeeper errors realted to size, i keep on bumping the jute.maxbuffer adn now it is all the way to 8m



But still i can't merge even for small subset (-b and -e) Now the error is Xid out of order, Gox Xid xx with err -101 expected Xid yy for a packet with details:



after this master crashes



Any suggestion how to go about and how to merge this large table?



Thanks



-S

________________________________

From: Ligade, Shailesh [USA] <Li...@bah.com>>
Sent: Thursday, February 10, 2022 9:20 AM
To: user@accumulo.apache.org<ma...@accumulo.apache.org> <us...@accumulo.apache.org>>
Subject: Re: accumulo 1.10.0 masters won't start



Thanks Ed,



That saved the day. The confusing part setting up that property is documentation if it needs hex or bytes etc. Even the example they provided here

https://solr.apache.org/guide/7_4/setting-up-an-external-zookeeper-ensemble.html#increasing-the-file-size-limit<https://urldefense.com/v3/__https://solr.apache.org/guide/7_4/setting-up-an-external-zookeeper-ensemble.html*increasing-the-file-size-limit__;Iw!!May37g!dgoe7b7JQfco1hl_OGnhmp98GibG_1En56_s_KMHeTxlLgRECJIoqNtDeascfM1hqA$>

Setting Up an External ZooKeeper Ensemble | Apache Solr Reference Guide 7.4 - The Apache Software Foundation<https://urldefense.com/v3/__https://solr.apache.org/guide/7_4/setting-up-an-external-zookeeper-ensemble.html*increasing-the-file-size-limit__;Iw!!May37g!dgoe7b7JQfco1hl_OGnhmp98GibG_1En56_s_KMHeTxlLgRECJIoqNtDeascfM1hqA$>

The solution to this problem is to set up an external ZooKeeper ensemble, which is a number of servers running ZooKeeper that communicate with each other to coordinate the activities of the cluster.

solr.apache.org



states they are setting the value to 2mb but the value really looks like 200k (with 5 0)



------------------------------------

Add the following line to increase the file size limit to 2MB:

SOLR_OPTS="$SOLR_OPTS -Djute.maxbuffer=0x200000"

-------------------------------------



Anyways, a master is up and running for an hour now..so just trying to understand what was changed and revert it after it stabilize.



Thanks a bunch.



-S



________________________________

From: dev1 <de...@etcoleman.com>>
Sent: Wednesday, February 9, 2022 5:54 PM
To: user@accumulo.apache.org<ma...@accumulo.apache.org> <us...@accumulo.apache.org>>
Subject: [External] Re: accumulo 1.10.0 masters won't start



You might want to set the accumulo (zookeeper client) side - by setting ACCUMULO_JAVA_OPTS that is processed in accumulo-env.sh (or just edit that file?)



Looking at the Zookeeper documentation it describes what looks like you are seeing:



When jute.maxbuffer in the client side is less than the server side, the client wants to read the data exceeds jute.maxbuffer in the client side, the client side will get java.io.IOException: Unreasonable length or Packet len is out of range!



Also, a search showed jira tickets that had a server side limit of 4MB, but client limits of 1MB - you may want to see if 4194304 (or larger) as a value works,



________________________________

From: dev1 <de...@etcoleman.com>>
Sent: Wednesday, February 9, 2022 5:25 PM
To: user@accumulo.apache.org<ma...@accumulo.apache.org> <us...@accumulo.apache.org>>
Subject: Re: accumulo 1.10.0 masters won't start



jute.maxbuffer is a ZooKeeper property - it needs to be set on the zookeeper configuration.  If this is still correct, then it looks like there are a few options https://solr.apache.org/guide/7_4/setting-up-an-external-zookeeper-ensemble.html#increasing-the-file-size-limit<https://urldefense.com/v3/__https:/solr.apache.org/guide/7_4/setting-up-an-external-zookeeper-ensemble.html*increasing-the-file-size-limit__;Iw!!May37g!ewlkGRNFLrKEpeF1Lz8vRt_oBtpgi8hVvvnCrp1Dq4_8Xprb4tEHWiHVFW7EVUT3vA$>



But maybe the ZooKeeper documentation for your version can provide additional guidance?

Setting Up an External ZooKeeper Ensemble | Apache Solr Reference Guide 7.4 - The Apache Software Foundation<https://urldefense.com/v3/__https:/solr.apache.org/guide/7_4/setting-up-an-external-zookeeper-ensemble.html*increasing-the-file-size-limit__;Iw!!May37g!ewlkGRNFLrKEpeF1Lz8vRt_oBtpgi8hVvvnCrp1Dq4_8Xprb4tEHWiHVFW7EVUT3vA$>

The solution to this problem is to set up an external ZooKeeper ensemble, which is a number of servers running ZooKeeper that communicate with each other to coordinate the activities of the cluster.

solr.apache.org





________________________________

From: Shailesh Ligade <SL...@FBI.GOV>>
Sent: Wednesday, February 9, 2022 5:02 PM
To: user@accumulo.apache.org<ma...@accumulo.apache.org> <us...@accumulo.apache.org>>
Subject: RE: accumulo 1.10.0 masters won't start



Thanks



Even if I set jute.maxbuffer on zookeeper in conf/java.env file to



-Djute.maxbuffer=300000



I see in accumulo master log as



INFO: jute.maxbuffer value is 1048575 Bytes    not sure where to set that on accumulo side.



I set instance.zookeeper.timeout value to 90s in accumulo-site.xml



But still get those zookeeper KeeperErrorCode errors



-S



From: dev1 <de...@etcoleman.com>>
Sent: Wednesday, February 9, 2022 4:27 PM
To: user@accumulo.apache.org<ma...@accumulo.apache.org>
Subject: [EXTERNAL EMAIL] - Re: accumulo 1.10.0 masters won't start



I would not recommend setting the goal state directly unlit there are no other alternatives.



It is hard to recommend what to do, because it is unclear what put you into the current situation and what action / impact you might have had trying to fix things -



why did the goal state become unset in the first place?

what did you stuff into the fates that increased the need for larger jute buffers?



It could be that the number of tables and servers pushed you over the limit - or it could be something else.



What I would do.



Shutdown accumulo and make sure all services / tservers are stopped.

Shutdown any other services that might be using ZooKeeper.

Shutdown ZooKeeper.



Set the larger jute.buffer and increase the timeout values across the board and in any dependent services.



Start hdfs - if you needed to shut it down.

Start just zookeeper - and use zkCli.sh to examine the state of things.  If that looks okay.

Start just the master - how far does it come up?  It will not be able to load the root / metadata tables, but it may give some indication of state,



I'd then cycle between stopping the master, trying to clean-up things using zkCli.sh using any guidance with errors the master is generating. If that looks promising, then:



With the master stopped - start the tservers and check a few logs if there are exceptions determine if they are they something that is pointing to an issue - or just something that is transient and handled.



Once the tservers are up and looking okay - start the master.



One of the things to grab as soon as you can get the shell to run - get a listing of the tables and the ids.  If the worst happens, you can use that to map the existing data into a "new" instance. Hopefully it will not come to that and you will not need it - but if you don't have it and you need it, well... The table names and id are all in ZooKeeper.



Ed Coleman



________________________________

From: Shailesh Ligade <SL...@FBI.GOV>>
Sent: Wednesday, February 9, 2022 3:47 PM
To: user@accumulo.apache.org<ma...@accumulo.apache.org> <us...@accumulo.apache.org>>
Subject: RE: accumulo 1.10.0 masters won't start



Thanks I can try that,



At this point, my goal is to get accumulo up. I was just wondering if I can set different goal like SAFE_MODE will it come up by ignoring fate and other issues? If that comes up, can I switch back to NORMAL, will that work? I understand there may be some data loss..



-S



From: dev1 <de...@etcoleman.com>>
Sent: Wednesday, February 9, 2022 3:36 PM
To: user@accumulo.apache.org<ma...@accumulo.apache.org>
Subject: [EXTERNAL EMAIL] - Re: accumulo 1.10.0 masters won't start



For values in zoo.cfg see: https://zookeeper.apache.org/doc/r3.5.9/zookeeperAdmin.html#sc_advancedConfiguration<https://urldefense.com/v3/__https:/usg02.safelinks.protection.office365.us/?url=https*3A*2F*2Fzookeeper.apache.org*2Fdoc*2Fr3.5.9*2FzookeeperAdmin.html*23sc_advancedConfiguration&data=04*7C01*7CSLIGADE*40FBI.GOV*7Cb7b8be92faf64fbc95ff08d9ec13044d*7C022914a9b95f4b7bbace551ce1a04071*7C0*7C0*7C637800390698440068*7CUnknown*7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0*3D*7C3000&sdata=QaGFX7kcHJeiIN73G5bfDDEQNgxN0F7QdyJ9fO3SJzA*3D&reserved=0__;JSUlJSUlJSUlJSUlJSUlJSUlJSU!!May37g!ewlkGRNFLrKEpeF1Lz8vRt_oBtpgi8hVvvnCrp1Dq4_8Xprb4tEHWiHVFW6UcuwVgA$>



maxSessionTimeout



In the accumulo config  - #instance.zookeepers.timeout=30s



The zookeeper setting controls the max time that the ZK servers will grant - the accumulo setting is how much time accumulo will ask for.



ZooKeeper: Because Coordinating Distributed Systems is a Zoo<https://urldefense.com/v3/__https:/usg02.safelinks.protection.office365.us/?url=https*3A*2F*2Fzookeeper.apache.org*2Fdoc*2Fr3.5.9*2FzookeeperAdmin.html*23sc_advancedConfiguration&data=04*7C01*7CSLIGADE*40FBI.GOV*7Cb7b8be92faf64fbc95ff08d9ec13044d*7C022914a9b95f4b7bbace551ce1a04071*7C0*7C0*7C637800390698440068*7CUnknown*7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0*3D*7C3000&sdata=QaGFX7kcHJeiIN73G5bfDDEQNgxN0F7QdyJ9fO3SJzA*3D&reserved=0__;JSUlJSUlJSUlJSUlJSUlJSUlJSU!!May37g!ewlkGRNFLrKEpeF1Lz8vRt_oBtpgi8hVvvnCrp1Dq4_8Xprb4tEHWiHVFW6UcuwVgA$>

Trace Mask Bit Values ; 0b0000000000 : Unused, reserved for future use. 0b0000000010 : Logs client requests, excluding ping requests. 0b0000000100 : Unused, reserved ...

zookeeper.apache.org





________________________________

From: Ligade, Shailesh [USA] <Li...@bah.com>>
Sent: Wednesday, February 9, 2022 3:03 PM
To: user@accumulo.apache.org<ma...@accumulo.apache.org> <us...@accumulo.apache.org>>
Subject: Re: accumulo 1.10.0 masters won't start



thanks for response,



no i have not update any timeout



is that going in zoo.cfg? I can see there is min/maxSessionTimeout 2/20, is that what are you refering to?



-S

________________________________

From: dev1 <de...@etcoleman.com>>
Sent: Wednesday, February 9, 2022 2:51 PM
To: user@accumulo.apache.org<ma...@accumulo.apache.org> <us...@accumulo.apache.org>>
Subject: [External] Re: accumulo 1.10.0 masters won't start



Have you tried to increase the zoo session timeout value? I think it's zookeeper.session.timeout.ms



________________________________

From: Ligade, Shailesh [USA] <Li...@bah.com>>
Sent: Wednesday, February 9, 2022 2:47 PM
To: user@accumulo.apache.org<ma...@accumulo.apache.org> <us...@accumulo.apache.org>>
Subject: Re: accumulo 1.10.0 masters won't start



Thanks



That fixed goal sate issue but now still getting



Errors with zookeeper

e.g.



KeeperErrorCode = ConnectionLoss for



/accumulo/<instane-id>/config/tserver.hold.time.max

/accumulo/<instane-id>/tables

/accumulo/<instane-id>/tables/1/name

/accumulo/<instane-id>/fate

/accumulo/<instane-id>/masters/goal_state



So it is all over …some I see good values in zookeeper…so not sure..  🙁



-S



________________________________

From: dev1 <de...@etcoleman.com>>
Sent: Wednesday, February 9, 2022 2:22 PM
To: user@accumulo.apache.org<ma...@accumulo.apache.org> <us...@accumulo.apache.org>>
Subject: [External] Re: accumulo 1.10.0 masters won't start



The is a utility - SetGoalState that can be run from the command line



accumulo SetGoalState NORMAL



(or SAFE_MODE, CLEAN_STOP)



It sets a value in ZK at /accumulo/instance-id/managers/goal_state



Ed Coleman



________________________________

From: Ligade, Shailesh [USA] <Li...@bah.com>>
Sent: Wednesday, February 9, 2022 1:54 PM
To: user@accumulo.apache.org<ma...@accumulo.apache.org> <us...@accumulo.apache.org>>
Subject: Re: accumulo 1.10.0 masters won't start



Well,



i just went ahead and deleted fate in zookeeper and restarted the master..it was doing better but then i am getting different error



ERROR: Problem getting real goal state from zookeeper: java.lang.IllegalArgumentException: No enum constant org.apache.accumulo.core.master.thrift.MasterGoalState



I hope i didn't delete goal_state accidently ...;-( currently ls on goal_state is [], is there a way to add some value there?



-S

________________________________

From: dev1 <de...@etcoleman.com>>
Sent: Wednesday, February 9, 2022 1:32 PM
To: user@accumulo.apache.org<ma...@accumulo.apache.org> <us...@accumulo.apache.org>>
Subject: [External] Re: accumulo 1.10.0 masters won't start



Did you try setting the increased size in the zkCli.sh command (or wherever it gets it environment from?)



The ZK docs indicate that it needs to be set to the same size on all servers and clients.



You should be able to use zkCli.sh to at least see what's going on - if that does not work, then it seems unlikely that the master would either.



Can you:

  *   list the nodes under /accumulo/[instance id]/fate?
  *   use the stat command on each of the nodes - the size is one of the fields.
  *   list nodes under any of the /accumulo/[instance_id/fate/tx-#####
  *   there should be a node named debug - doing a get on that should show the op name.



Ed Coleman

________________________________

From: Ligade, Shailesh [USA] <Li...@bah.com>>
Sent: Wednesday, February 9, 2022 12:54 PM
To: user@accumulo.apache.org<ma...@accumulo.apache.org> <us...@accumulo.apache.org>>
Subject: Re: accumulo 1.10.0 masters won't start



Thanks



I added



-Djute.maxbuffer=30000000



In conf/java.env and restart all zookeepers but still getting the same error.. documentation is kind of fuzzy on setting this property as it states in hex (default 0xffff) so not 100% sure if 30000000 is ok, but atleast I could see zookeeper was up



-S



________________________________

From: dev1 <de...@etcoleman.com>>
Sent: Wednesday, February 9, 2022 12:26 PM
To: user@accumulo.apache.org<ma...@accumulo.apache.org> <us...@accumulo.apache.org>>
Subject: [External] Re: accumulo 1.10.0 masters won't start



Does the monitor or any of the logs show errors that relate to exceeding the ZooKeeper jute buffer size?



Is so, have you tried increasing the ZooKeeper jute.maxbuffer limit(https://zookeeper.apache.org/doc/r3.5.9/zookeeperAdmin.html#Unsafe+Options<https://urldefense.com/v3/__https:/usg02.safelinks.protection.office365.us/?url=https*3A*2F*2Furldefense.com*2Fv3*2F__https*3A*2F*2Fzookeeper.apache.org*2Fdoc*2Fr3.5.9*2FzookeeperAdmin.html*Unsafe*Options__*3BIys!!May37g!dTGCMHPLPDBXwSqtLa5cIPHiTIQF7IjLCVyvGxfi1sgPbrsOI8RCEsuZ9u-jJtayEg*24&data=04*7C01*7CSLIGADE*40FBI.GOV*7Cb7b8be92faf64fbc95ff08d9ec13044d*7C022914a9b95f4b7bbace551ce1a04071*7C0*7C0*7C637800390698440068*7CUnknown*7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0*3D*7C3000&sdata=meZvmEpBktGc95qzM46QmtNWp5NJ8noozSTv896k7qw*3D&reserved=0__;JSUlJSUlJSUlJSUqKiUlJSUlJSUlJSUlJSUlJQ!!May37g!ewlkGRNFLrKEpeF1Lz8vRt_oBtpgi8hVvvnCrp1Dq4_8Xprb4tEHWiHVFW58KT9_bg$>)?



Ed Coleman





________________________________

From: Ligade, Shailesh [USA] <Li...@bah.com>>
Sent: Wednesday, February 9, 2022 11:49 AM
To: user@accumulo.apache.org<ma...@accumulo.apache.org> <us...@accumulo.apache.org>>
Subject: accumulo 1.10.0 masters won't start



Hello,



My both masters are stuck error on zookeeper:



IOException: Packet len 2791093 is out of range!

KeeperErrorCode = ConnectionLoss for /accumulo/<instance_id>/fate





if use zkCli to see what is under fate, i get



IOException Packet len 2791161 is out of range

Unable to read additional data from server sessionid xxxx, likely server has closed socket



hdfs fsck is all good



How can I clear this fate?



master process is up and I can get into accumulo shell, but there are no fate (fate print returns empty)



Any idea how to bring the master up?



Thanks



S

RE: accumulo 1.10.0 masters won't start

Posted by dev1 <de...@etcoleman.com>.

Can you compact the table?  How aggressive do you want to get? I do not understand why you are getting the ZooKeeper errors – is it related to the number of tablets, or is it something else?  (an iterator that was attached with a very large set of arguments, a very large list or some sort of binary data – say a bloom filter)

If it were me – you need to balance your goals and requirements that might dictate a less aggressive approach.  At this point I’m assuming that getting things back on line without data loss is the top priority. And if I was sure that it was not related to something that I attached to the table)

If I have room and can compact the table(s). It could also depend on how long a compaction would take and if I could wait.  It is generally preferable to work on files that have any deleted data removed and can reduce the total number of files when files from minor compactions and bulk ingest files are combined into a single file for that tablet)

Stop ingest.
Flush the source table – allow any compactions to settle. (Optional if compacting, but should be a quick command to execute)
(Optional – compact the original.)
Clone the source table.
Compact the clone so that the clone does not share any files with the source
Optionally – use the exportable command to generate a list of files from the clone – you may not need it, but could be handy
Take the clone offline.
Move the files under /accumulo/tables/[clone-id]/dirs to one or more staging directories (in hdfs) – the export list could help.
Delete the clone table – (I believe that the delete will not check for the existence of files if it is offline.) If not, then it would be necessary to use an empty rfile as a placeholder.
Create a new table and set splits – this could be your desired number – or use just enough splits that each tserver has 1 tablet.
Set the default table split size to some multiple of the desired final size – this limits splitting during the imports. Not critical, but may be faster.
Take the new table offline and then back online – this will immediately migrate the splits – or you could just wait for the migration to finish.
Bulk import the files from the staging area(s) – likely in batches.  You will likely have ~72K files – so maybe ~7,000 files / batch?
Once all files have been imported set the split threshold to desired size.
Check that permissions, users, iterators, table config parameters are present on the new table and match the source.
Rename the source table to old_xxx or whatever
Rename the new table to the source table, verify things are okay and delete the original.

If you don’t have the space, you could skip operating on the clone, but then you can’t fall back to the original if things go wrong.

Another way would be to use the importable, but you need to make sure that it doesn’t just recreate the original splits, otherwise you end up with the same 72K files.

Ed Coleman


From: Ligade, Shailesh [USA] <Li...@bah.com>
Sent: Tuesday, February 15, 2022 1:11 PM
To: user@accumulo.apache.org
Subject: Re: accumulo 1.10.0 masters won't start

Well,

I am trying to merge a large table (8T with 72k tablets, with default tablet size 1G)

Since I keep getting those zookeeper errors realted to size, i keep on bumping the jute.maxbuffer adn now it is all the way to 8m

But still i can't merge even for small subset (-b and -e) Now the error is Xid out of order, Gox Xid xx with err -101 expected Xid yy for a packet with details:

after this master crashes

Any suggestion how to go about and how to merge this large table?

Thanks

-S
________________________________
From: Ligade, Shailesh [USA] <Li...@bah.com>>
Sent: Thursday, February 10, 2022 9:20 AM
To: user@accumulo.apache.org<ma...@accumulo.apache.org> <us...@accumulo.apache.org>>
Subject: Re: accumulo 1.10.0 masters won't start

Thanks Ed,

That saved the day. The confusing part setting up that property is documentation if it needs hex or bytes etc. Even the example they provided here
https://solr.apache.org/guide/7_4/setting-up-an-external-zookeeper-ensemble.html#increasing-the-file-size-limit
Setting Up an External ZooKeeper Ensemble | Apache Solr Reference Guide 7.4 - The Apache Software Foundation<https://solr.apache.org/guide/7_4/setting-up-an-external-zookeeper-ensemble.html#increasing-the-file-size-limit>
The solution to this problem is to set up an external ZooKeeper ensemble, which is a number of servers running ZooKeeper that communicate with each other to coordinate the activities of the cluster.
solr.apache.org

states they are setting the value to 2mb but the value really looks like 200k (with 5 0)

------------------------------------

Add the following line to increase the file size limit to 2MB:

SOLR_OPTS="$SOLR_OPTS -Djute.maxbuffer=0x200000"
-------------------------------------

Anyways, a master is up and running for an hour now..so just trying to understand what was changed and revert it after it stabilize.

Thanks a bunch.

-S

________________________________
From: dev1 <de...@etcoleman.com>>
Sent: Wednesday, February 9, 2022 5:54 PM
To: user@accumulo.apache.org<ma...@accumulo.apache.org> <us...@accumulo.apache.org>>
Subject: [External] Re: accumulo 1.10.0 masters won't start

You might want to set the accumulo (zookeeper client) side - by setting ACCUMULO_JAVA_OPTS that is processed in accumulo-env.sh (or just edit that file?)

Looking at the Zookeeper documentation it describes what looks like you are seeing:

When jute.maxbuffer in the client side is less than the server side, the client wants to read the data exceeds jute.maxbuffer in the client side, the client side will get java.io.IOException: Unreasonable length or Packet len is out of range!

Also, a search showed jira tickets that had a server side limit of 4MB, but client limits of 1MB - you may want to see if 4194304 (or larger) as a value works,


________________________________
From: dev1 <de...@etcoleman.com>>
Sent: Wednesday, February 9, 2022 5:25 PM
To: user@accumulo.apache.org<ma...@accumulo.apache.org> <us...@accumulo.apache.org>>
Subject: Re: accumulo 1.10.0 masters won't start

jute.maxbuffer is a ZooKeeper property - it needs to be set on the zookeeper configuration.  If this is still correct, then it looks like there are a few options https://solr.apache.org/guide/7_4/setting-up-an-external-zookeeper-ensemble.html#increasing-the-file-size-limit<https://urldefense.com/v3/__https:/solr.apache.org/guide/7_4/setting-up-an-external-zookeeper-ensemble.html*increasing-the-file-size-limit__;Iw!!May37g!ewlkGRNFLrKEpeF1Lz8vRt_oBtpgi8hVvvnCrp1Dq4_8Xprb4tEHWiHVFW7EVUT3vA$>

But maybe the ZooKeeper documentation for your version can provide additional guidance?
Setting Up an External ZooKeeper Ensemble | Apache Solr Reference Guide 7.4 - The Apache Software Foundation<https://urldefense.com/v3/__https:/solr.apache.org/guide/7_4/setting-up-an-external-zookeeper-ensemble.html*increasing-the-file-size-limit__;Iw!!May37g!ewlkGRNFLrKEpeF1Lz8vRt_oBtpgi8hVvvnCrp1Dq4_8Xprb4tEHWiHVFW7EVUT3vA$>
The solution to this problem is to set up an external ZooKeeper ensemble, which is a number of servers running ZooKeeper that communicate with each other to coordinate the activities of the cluster.
solr.apache.org


________________________________
From: Shailesh Ligade <SL...@FBI.GOV>>
Sent: Wednesday, February 9, 2022 5:02 PM
To: user@accumulo.apache.org<ma...@accumulo.apache.org> <us...@accumulo.apache.org>>
Subject: RE: accumulo 1.10.0 masters won't start


Thanks



Even if I set jute.maxbuffer on zookeeper in conf/java.env file to



-Djute.maxbuffer=300000



I see in accumulo master log as



INFO: jute.maxbuffer value is 1048575 Bytes    not sure where to set that on accumulo side.



I set instance.zookeeper.timeout value to 90s in accumulo-site.xml



But still get those zookeeper KeeperErrorCode errors



-S



From: dev1 <de...@etcoleman.com>>
Sent: Wednesday, February 9, 2022 4:27 PM
To: user@accumulo.apache.org<ma...@accumulo.apache.org>
Subject: [EXTERNAL EMAIL] - Re: accumulo 1.10.0 masters won't start



I would not recommend setting the goal state directly unlit there are no other alternatives.



It is hard to recommend what to do, because it is unclear what put you into the current situation and what action / impact you might have had trying to fix things -



why did the goal state become unset in the first place?

what did you stuff into the fates that increased the need for larger jute buffers?



It could be that the number of tables and servers pushed you over the limit - or it could be something else.



What I would do.



Shutdown accumulo and make sure all services / tservers are stopped.

Shutdown any other services that might be using ZooKeeper.

Shutdown ZooKeeper.



Set the larger jute.buffer and increase the timeout values across the board and in any dependent services.



Start hdfs - if you needed to shut it down.

Start just zookeeper - and use zkCli.sh to examine the state of things.  If that looks okay.

Start just the master - how far does it come up?  It will not be able to load the root / metadata tables, but it may give some indication of state,



I'd then cycle between stopping the master, trying to clean-up things using zkCli.sh using any guidance with errors the master is generating. If that looks promising, then:



With the master stopped - start the tservers and check a few logs if there are exceptions determine if they are they something that is pointing to an issue - or just something that is transient and handled.



Once the tservers are up and looking okay - start the master.



One of the things to grab as soon as you can get the shell to run - get a listing of the tables and the ids.  If the worst happens, you can use that to map the existing data into a "new" instance. Hopefully it will not come to that and you will not need it - but if you don't have it and you need it, well... The table names and id are all in ZooKeeper.



Ed Coleman



________________________________

From: Shailesh Ligade <SL...@FBI.GOV>>
Sent: Wednesday, February 9, 2022 3:47 PM
To: user@accumulo.apache.org<ma...@accumulo.apache.org> <us...@accumulo.apache.org>>
Subject: RE: accumulo 1.10.0 masters won't start



Thanks I can try that,



At this point, my goal is to get accumulo up. I was just wondering if I can set different goal like SAFE_MODE will it come up by ignoring fate and other issues? If that comes up, can I switch back to NORMAL, will that work? I understand there may be some data loss..



-S



From: dev1 <de...@etcoleman.com>>
Sent: Wednesday, February 9, 2022 3:36 PM
To: user@accumulo.apache.org<ma...@accumulo.apache.org>
Subject: [EXTERNAL EMAIL] - Re: accumulo 1.10.0 masters won't start



For values in zoo.cfg see: https://zookeeper.apache.org/doc/r3.5.9/zookeeperAdmin.html#sc_advancedConfiguration<https://urldefense.com/v3/__https:/usg02.safelinks.protection.office365.us/?url=https*3A*2F*2Fzookeeper.apache.org*2Fdoc*2Fr3.5.9*2FzookeeperAdmin.html*23sc_advancedConfiguration&data=04*7C01*7CSLIGADE*40FBI.GOV*7Cb7b8be92faf64fbc95ff08d9ec13044d*7C022914a9b95f4b7bbace551ce1a04071*7C0*7C0*7C637800390698440068*7CUnknown*7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0*3D*7C3000&sdata=QaGFX7kcHJeiIN73G5bfDDEQNgxN0F7QdyJ9fO3SJzA*3D&reserved=0__;JSUlJSUlJSUlJSUlJSUlJSUlJSU!!May37g!ewlkGRNFLrKEpeF1Lz8vRt_oBtpgi8hVvvnCrp1Dq4_8Xprb4tEHWiHVFW6UcuwVgA$>



maxSessionTimeout



In the accumulo config  - #instance.zookeepers.timeout=30s



The zookeeper setting controls the max time that the ZK servers will grant - the accumulo setting is how much time accumulo will ask for.



ZooKeeper: Because Coordinating Distributed Systems is a Zoo<https://urldefense.com/v3/__https:/usg02.safelinks.protection.office365.us/?url=https*3A*2F*2Fzookeeper.apache.org*2Fdoc*2Fr3.5.9*2FzookeeperAdmin.html*23sc_advancedConfiguration&data=04*7C01*7CSLIGADE*40FBI.GOV*7Cb7b8be92faf64fbc95ff08d9ec13044d*7C022914a9b95f4b7bbace551ce1a04071*7C0*7C0*7C637800390698440068*7CUnknown*7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0*3D*7C3000&sdata=QaGFX7kcHJeiIN73G5bfDDEQNgxN0F7QdyJ9fO3SJzA*3D&reserved=0__;JSUlJSUlJSUlJSUlJSUlJSUlJSU!!May37g!ewlkGRNFLrKEpeF1Lz8vRt_oBtpgi8hVvvnCrp1Dq4_8Xprb4tEHWiHVFW6UcuwVgA$>

Trace Mask Bit Values ; 0b0000000000 : Unused, reserved for future use. 0b0000000010 : Logs client requests, excluding ping requests. 0b0000000100 : Unused, reserved ...

zookeeper.apache.org





________________________________

From: Ligade, Shailesh [USA] <Li...@bah.com>>
Sent: Wednesday, February 9, 2022 3:03 PM
To: user@accumulo.apache.org<ma...@accumulo.apache.org> <us...@accumulo.apache.org>>
Subject: Re: accumulo 1.10.0 masters won't start



thanks for response,



no i have not update any timeout



is that going in zoo.cfg? I can see there is min/maxSessionTimeout 2/20, is that what are you refering to?



-S

________________________________

From: dev1 <de...@etcoleman.com>>
Sent: Wednesday, February 9, 2022 2:51 PM
To: user@accumulo.apache.org<ma...@accumulo.apache.org> <us...@accumulo.apache.org>>
Subject: [External] Re: accumulo 1.10.0 masters won't start



Have you tried to increase the zoo session timeout value? I think it's zookeeper.session.timeout.ms



________________________________

From: Ligade, Shailesh [USA] <Li...@bah.com>>
Sent: Wednesday, February 9, 2022 2:47 PM
To: user@accumulo.apache.org<ma...@accumulo.apache.org> <us...@accumulo.apache.org>>
Subject: Re: accumulo 1.10.0 masters won't start



Thanks



That fixed goal sate issue but now still getting



Errors with zookeeper

e.g.



KeeperErrorCode = ConnectionLoss for



/accumulo/<instane-id>/config/tserver.hold.time.max

/accumulo/<instane-id>/tables

/accumulo/<instane-id>/tables/1/name

/accumulo/<instane-id>/fate

/accumulo/<instane-id>/masters/goal_state



So it is all over …some I see good values in zookeeper…so not sure..  🙁



-S



________________________________

From: dev1 <de...@etcoleman.com>>
Sent: Wednesday, February 9, 2022 2:22 PM
To: user@accumulo.apache.org<ma...@accumulo.apache.org> <us...@accumulo.apache.org>>
Subject: [External] Re: accumulo 1.10.0 masters won't start



The is a utility - SetGoalState that can be run from the command line



accumulo SetGoalState NORMAL



(or SAFE_MODE, CLEAN_STOP)



It sets a value in ZK at /accumulo/instance-id/managers/goal_state



Ed Coleman



________________________________

From: Ligade, Shailesh [USA] <Li...@bah.com>>
Sent: Wednesday, February 9, 2022 1:54 PM
To: user@accumulo.apache.org<ma...@accumulo.apache.org> <us...@accumulo.apache.org>>
Subject: Re: accumulo 1.10.0 masters won't start



Well,



i just went ahead and deleted fate in zookeeper and restarted the master..it was doing better but then i am getting different error



ERROR: Problem getting real goal state from zookeeper: java.lang.IllegalArgumentException: No enum constant org.apache.accumulo.core.master.thrift.MasterGoalState



I hope i didn't delete goal_state accidently ...;-( currently ls on goal_state is [], is there a way to add some value there?



-S

________________________________

From: dev1 <de...@etcoleman.com>>
Sent: Wednesday, February 9, 2022 1:32 PM
To: user@accumulo.apache.org<ma...@accumulo.apache.org> <us...@accumulo.apache.org>>
Subject: [External] Re: accumulo 1.10.0 masters won't start



Did you try setting the increased size in the zkCli.sh command (or wherever it gets it environment from?)



The ZK docs indicate that it needs to be set to the same size on all servers and clients.



You should be able to use zkCli.sh to at least see what's going on - if that does not work, then it seems unlikely that the master would either.



Can you:

  *   list the nodes under /accumulo/[instance id]/fate?
  *   use the stat command on each of the nodes - the size is one of the fields.
  *   list nodes under any of the /accumulo/[instance_id/fate/tx-#####
  *   there should be a node named debug - doing a get on that should show the op name.



Ed Coleman

________________________________

From: Ligade, Shailesh [USA] <Li...@bah.com>>
Sent: Wednesday, February 9, 2022 12:54 PM
To: user@accumulo.apache.org<ma...@accumulo.apache.org> <us...@accumulo.apache.org>>
Subject: Re: accumulo 1.10.0 masters won't start



Thanks



I added



-Djute.maxbuffer=30000000



In conf/java.env and restart all zookeepers but still getting the same error.. documentation is kind of fuzzy on setting this property as it states in hex (default 0xffff) so not 100% sure if 30000000 is ok, but atleast I could see zookeeper was up



-S



________________________________

From: dev1 <de...@etcoleman.com>>
Sent: Wednesday, February 9, 2022 12:26 PM
To: user@accumulo.apache.org<ma...@accumulo.apache.org> <us...@accumulo.apache.org>>
Subject: [External] Re: accumulo 1.10.0 masters won't start



Does the monitor or any of the logs show errors that relate to exceeding the ZooKeeper jute buffer size?



Is so, have you tried increasing the ZooKeeper jute.maxbuffer limit(https://zookeeper.apache.org/doc/r3.5.9/zookeeperAdmin.html#Unsafe+Options<https://urldefense.com/v3/__https:/usg02.safelinks.protection.office365.us/?url=https*3A*2F*2Furldefense.com*2Fv3*2F__https*3A*2F*2Fzookeeper.apache.org*2Fdoc*2Fr3.5.9*2FzookeeperAdmin.html*Unsafe*Options__*3BIys!!May37g!dTGCMHPLPDBXwSqtLa5cIPHiTIQF7IjLCVyvGxfi1sgPbrsOI8RCEsuZ9u-jJtayEg*24&data=04*7C01*7CSLIGADE*40FBI.GOV*7Cb7b8be92faf64fbc95ff08d9ec13044d*7C022914a9b95f4b7bbace551ce1a04071*7C0*7C0*7C637800390698440068*7CUnknown*7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0*3D*7C3000&sdata=meZvmEpBktGc95qzM46QmtNWp5NJ8noozSTv896k7qw*3D&reserved=0__;JSUlJSUlJSUlJSUqKiUlJSUlJSUlJSUlJSUlJQ!!May37g!ewlkGRNFLrKEpeF1Lz8vRt_oBtpgi8hVvvnCrp1Dq4_8Xprb4tEHWiHVFW58KT9_bg$>)?



Ed Coleman





________________________________

From: Ligade, Shailesh [USA] <Li...@bah.com>>
Sent: Wednesday, February 9, 2022 11:49 AM
To: user@accumulo.apache.org<ma...@accumulo.apache.org> <us...@accumulo.apache.org>>
Subject: accumulo 1.10.0 masters won't start



Hello,



My both masters are stuck error on zookeeper:



IOException: Packet len 2791093 is out of range!

KeeperErrorCode = ConnectionLoss for /accumulo/<instance_id>/fate





if use zkCli to see what is under fate, i get



IOException Packet len 2791161 is out of range

Unable to read additional data from server sessionid xxxx, likely server has closed socket



hdfs fsck is all good



How can I clear this fate?



master process is up and I can get into accumulo shell, but there are no fate (fate print returns empty)



Any idea how to bring the master up?



Thanks



S

Re: accumulo 1.10.0 masters won't start

Posted by "Ligade, Shailesh [USA]" <Li...@bah.com>.

Well,

I am trying to merge a large table (8T with 72k tablets, with default tablet size 1G)

Since I keep getting those zookeeper errors realted to size, i keep on bumping the jute.maxbuffer adn now it is all the way to 8m

But still i can't merge even for small subset (-b and -e) Now the error is Xid out of order, Gox Xid xx with err -101 expected Xid yy for a packet with details:

after this master crashes

Any suggestion how to go about and how to merge this large table?

Thanks

-S
________________________________
From: Ligade, Shailesh [USA] <Li...@bah.com>
Sent: Thursday, February 10, 2022 9:20 AM
To: user@accumulo.apache.org <us...@accumulo.apache.org>
Subject: Re: accumulo 1.10.0 masters won't start

Thanks Ed,

That saved the day. The confusing part setting up that property is documentation if it needs hex or bytes etc. Even the example they provided here

https://solr.apache.org/guide/7_4/setting-up-an-external-zookeeper-ensemble.html#increasing-the-file-size-limit
Setting Up an External ZooKeeper Ensemble | Apache Solr Reference Guide 7.4 - The Apache Software Foundation<https://solr.apache.org/guide/7_4/setting-up-an-external-zookeeper-ensemble.html#increasing-the-file-size-limit>
The solution to this problem is to set up an external ZooKeeper ensemble, which is a number of servers running ZooKeeper that communicate with each other to coordinate the activities of the cluster.
solr.apache.org

states they are setting the value to 2mb but the value really looks like 200k (with 5 0)

------------------------------------

Add the following line to increase the file size limit to 2MB:

SOLR_OPTS="$SOLR_OPTS -Djute.maxbuffer=0x200000"

-------------------------------------

Anyways, a master is up and running for an hour now..so just trying to understand what was changed and revert it after it stabilize.

Thanks a bunch.

-S

________________________________
From: dev1 <de...@etcoleman.com>
Sent: Wednesday, February 9, 2022 5:54 PM
To: user@accumulo.apache.org <us...@accumulo.apache.org>
Subject: [External] Re: accumulo 1.10.0 masters won't start

You might want to set the accumulo (zookeeper client) side - by setting ACCUMULO_JAVA_OPTS that is processed in accumulo-env.sh (or just edit that file?)

Looking at the Zookeeper documentation it describes what looks like you are seeing:

When jute.maxbuffer in the client side is less than the server side, the client wants to read the data exceeds jute.maxbuffer in the client side, the client side will get java.io.IOException: Unreasonable length or Packet len is out of range!

Also, a search showed jira tickets that had a server side limit of 4MB, but client limits of 1MB - you may want to see if 4194304 (or larger) as a value works,

________________________________
From: dev1 <de...@etcoleman.com>
Sent: Wednesday, February 9, 2022 5:25 PM
To: user@accumulo.apache.org <us...@accumulo.apache.org>
Subject: Re: accumulo 1.10.0 masters won't start

jute.maxbuffer is a ZooKeeper property - it needs to be set on the zookeeper configuration.  If this is still correct, then it looks like there are a few options https://solr.apache.org/guide/7_4/setting-up-an-external-zookeeper-ensemble.html#increasing-the-file-size-limit<https://urldefense.com/v3/__https://solr.apache.org/guide/7_4/setting-up-an-external-zookeeper-ensemble.html*increasing-the-file-size-limit__;Iw!!May37g!ewlkGRNFLrKEpeF1Lz8vRt_oBtpgi8hVvvnCrp1Dq4_8Xprb4tEHWiHVFW7EVUT3vA$>

But maybe the ZooKeeper documentation for your version can provide additional guidance?
Setting Up an External ZooKeeper Ensemble | Apache Solr Reference Guide 7.4 - The Apache Software Foundation<https://urldefense.com/v3/__https://solr.apache.org/guide/7_4/setting-up-an-external-zookeeper-ensemble.html*increasing-the-file-size-limit__;Iw!!May37g!ewlkGRNFLrKEpeF1Lz8vRt_oBtpgi8hVvvnCrp1Dq4_8Xprb4tEHWiHVFW7EVUT3vA$>
The solution to this problem is to set up an external ZooKeeper ensemble, which is a number of servers running ZooKeeper that communicate with each other to coordinate the activities of the cluster.
solr.apache.org


________________________________
From: Shailesh Ligade <SL...@FBI.GOV>
Sent: Wednesday, February 9, 2022 5:02 PM
To: user@accumulo.apache.org <us...@accumulo.apache.org>
Subject: RE: accumulo 1.10.0 masters won't start


Thanks



Even if I set jute.maxbuffer on zookeeper in conf/java.env file to



-Djute.maxbuffer=300000



I see in accumulo master log as



INFO: jute.maxbuffer value is 1048575 Bytes    not sure where to set that on accumulo side.



I set instance.zookeeper.timeout value to 90s in accumulo-site.xml



But still get those zookeeper KeeperErrorCode errors



-S



From: dev1 <de...@etcoleman.com>
Sent: Wednesday, February 9, 2022 4:27 PM
To: user@accumulo.apache.org
Subject: [EXTERNAL EMAIL] - Re: accumulo 1.10.0 masters won't start



I would not recommend setting the goal state directly unlit there are no other alternatives.



It is hard to recommend what to do, because it is unclear what put you into the current situation and what action / impact you might have had trying to fix things -



why did the goal state become unset in the first place?

what did you stuff into the fates that increased the need for larger jute buffers?



It could be that the number of tables and servers pushed you over the limit - or it could be something else.



What I would do.



Shutdown accumulo and make sure all services / tservers are stopped.

Shutdown any other services that might be using ZooKeeper.

Shutdown ZooKeeper.



Set the larger jute.buffer and increase the timeout values across the board and in any dependent services.



Start hdfs - if you needed to shut it down.

Start just zookeeper - and use zkCli.sh to examine the state of things.  If that looks okay.

Start just the master - how far does it come up?  It will not be able to load the root / metadata tables, but it may give some indication of state,



I'd then cycle between stopping the master, trying to clean-up things using zkCli.sh using any guidance with errors the master is generating. If that looks promising, then:



With the master stopped - start the tservers and check a few logs if there are exceptions determine if they are they something that is pointing to an issue - or just something that is transient and handled.



Once the tservers are up and looking okay - start the master.



One of the things to grab as soon as you can get the shell to run - get a listing of the tables and the ids.  If the worst happens, you can use that to map the existing data into a "new" instance. Hopefully it will not come to that and you will not need it - but if you don't have it and you need it, well... The table names and id are all in ZooKeeper.



Ed Coleman



________________________________

From: Shailesh Ligade <SL...@FBI.GOV>>
Sent: Wednesday, February 9, 2022 3:47 PM
To: user@accumulo.apache.org<ma...@accumulo.apache.org> <us...@accumulo.apache.org>>
Subject: RE: accumulo 1.10.0 masters won't start



Thanks I can try that,



At this point, my goal is to get accumulo up. I was just wondering if I can set different goal like SAFE_MODE will it come up by ignoring fate and other issues? If that comes up, can I switch back to NORMAL, will that work? I understand there may be some data loss..



-S



From: dev1 <de...@etcoleman.com>>
Sent: Wednesday, February 9, 2022 3:36 PM
To: user@accumulo.apache.org<ma...@accumulo.apache.org>
Subject: [EXTERNAL EMAIL] - Re: accumulo 1.10.0 masters won't start



For values in zoo.cfg see: https://zookeeper.apache.org/doc/r3.5.9/zookeeperAdmin.html#sc_advancedConfiguration<https://urldefense.com/v3/__https://usg02.safelinks.protection.office365.us/?url=https*3A*2F*2Fzookeeper.apache.org*2Fdoc*2Fr3.5.9*2FzookeeperAdmin.html*23sc_advancedConfiguration&data=04*7C01*7CSLIGADE*40FBI.GOV*7Cb7b8be92faf64fbc95ff08d9ec13044d*7C022914a9b95f4b7bbace551ce1a04071*7C0*7C0*7C637800390698440068*7CUnknown*7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0*3D*7C3000&sdata=QaGFX7kcHJeiIN73G5bfDDEQNgxN0F7QdyJ9fO3SJzA*3D&reserved=0__;JSUlJSUlJSUlJSUlJSUlJSUlJSU!!May37g!ewlkGRNFLrKEpeF1Lz8vRt_oBtpgi8hVvvnCrp1Dq4_8Xprb4tEHWiHVFW6UcuwVgA$>



maxSessionTimeout



In the accumulo config  - #instance.zookeepers.timeout=30s



The zookeeper setting controls the max time that the ZK servers will grant - the accumulo setting is how much time accumulo will ask for.



ZooKeeper: Because Coordinating Distributed Systems is a Zoo<https://urldefense.com/v3/__https://usg02.safelinks.protection.office365.us/?url=https*3A*2F*2Fzookeeper.apache.org*2Fdoc*2Fr3.5.9*2FzookeeperAdmin.html*23sc_advancedConfiguration&data=04*7C01*7CSLIGADE*40FBI.GOV*7Cb7b8be92faf64fbc95ff08d9ec13044d*7C022914a9b95f4b7bbace551ce1a04071*7C0*7C0*7C637800390698440068*7CUnknown*7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0*3D*7C3000&sdata=QaGFX7kcHJeiIN73G5bfDDEQNgxN0F7QdyJ9fO3SJzA*3D&reserved=0__;JSUlJSUlJSUlJSUlJSUlJSUlJSU!!May37g!ewlkGRNFLrKEpeF1Lz8vRt_oBtpgi8hVvvnCrp1Dq4_8Xprb4tEHWiHVFW6UcuwVgA$>

Trace Mask Bit Values ; 0b0000000000 : Unused, reserved for future use. 0b0000000010 : Logs client requests, excluding ping requests. 0b0000000100 : Unused, reserved ...

zookeeper.apache.org





________________________________

From: Ligade, Shailesh [USA] <Li...@bah.com>>
Sent: Wednesday, February 9, 2022 3:03 PM
To: user@accumulo.apache.org<ma...@accumulo.apache.org> <us...@accumulo.apache.org>>
Subject: Re: accumulo 1.10.0 masters won't start



thanks for response,



no i have not update any timeout



is that going in zoo.cfg? I can see there is min/maxSessionTimeout 2/20, is that what are you refering to?



-S

________________________________

From: dev1 <de...@etcoleman.com>>
Sent: Wednesday, February 9, 2022 2:51 PM
To: user@accumulo.apache.org<ma...@accumulo.apache.org> <us...@accumulo.apache.org>>
Subject: [External] Re: accumulo 1.10.0 masters won't start



Have you tried to increase the zoo session timeout value? I think it's zookeeper.session.timeout.ms



________________________________

From: Ligade, Shailesh [USA] <Li...@bah.com>>
Sent: Wednesday, February 9, 2022 2:47 PM
To: user@accumulo.apache.org<ma...@accumulo.apache.org> <us...@accumulo.apache.org>>
Subject: Re: accumulo 1.10.0 masters won't start



Thanks



That fixed goal sate issue but now still getting



Errors with zookeeper

e.g.



KeeperErrorCode = ConnectionLoss for



/accumulo/<instane-id>/config/tserver.hold.time.max

/accumulo/<instane-id>/tables

/accumulo/<instane-id>/tables/1/name

/accumulo/<instane-id>/fate

/accumulo/<instane-id>/masters/goal_state



So it is all over …some I see good values in zookeeper…so not sure..  🙁



-S



________________________________

From: dev1 <de...@etcoleman.com>>
Sent: Wednesday, February 9, 2022 2:22 PM
To: user@accumulo.apache.org<ma...@accumulo.apache.org> <us...@accumulo.apache.org>>
Subject: [External] Re: accumulo 1.10.0 masters won't start



The is a utility - SetGoalState that can be run from the command line



accumulo SetGoalState NORMAL



(or SAFE_MODE, CLEAN_STOP)



It sets a value in ZK at /accumulo/instance-id/managers/goal_state



Ed Coleman



________________________________

From: Ligade, Shailesh [USA] <Li...@bah.com>>
Sent: Wednesday, February 9, 2022 1:54 PM
To: user@accumulo.apache.org<ma...@accumulo.apache.org> <us...@accumulo.apache.org>>
Subject: Re: accumulo 1.10.0 masters won't start



Well,



i just went ahead and deleted fate in zookeeper and restarted the master..it was doing better but then i am getting different error



ERROR: Problem getting real goal state from zookeeper: java.lang.IllegalArgumentException: No enum constant org.apache.accumulo.core.master.thrift.MasterGoalState



I hope i didn't delete goal_state accidently ...;-( currently ls on goal_state is [], is there a way to add some value there?



-S

________________________________

From: dev1 <de...@etcoleman.com>>
Sent: Wednesday, February 9, 2022 1:32 PM
To: user@accumulo.apache.org<ma...@accumulo.apache.org> <us...@accumulo.apache.org>>
Subject: [External] Re: accumulo 1.10.0 masters won't start



Did you try setting the increased size in the zkCli.sh command (or wherever it gets it environment from?)



The ZK docs indicate that it needs to be set to the same size on all servers and clients.



You should be able to use zkCli.sh to at least see what's going on - if that does not work, then it seems unlikely that the master would either.



Can you:

  *   list the nodes under /accumulo/[instance id]/fate?
  *   use the stat command on each of the nodes - the size is one of the fields.
  *   list nodes under any of the /accumulo/[instance_id/fate/tx-#####
  *   there should be a node named debug - doing a get on that should show the op name.



Ed Coleman

________________________________

From: Ligade, Shailesh [USA] <Li...@bah.com>>
Sent: Wednesday, February 9, 2022 12:54 PM
To: user@accumulo.apache.org<ma...@accumulo.apache.org> <us...@accumulo.apache.org>>
Subject: Re: accumulo 1.10.0 masters won't start



Thanks



I added



-Djute.maxbuffer=30000000



In conf/java.env and restart all zookeepers but still getting the same error.. documentation is kind of fuzzy on setting this property as it states in hex (default 0xffff) so not 100% sure if 30000000 is ok, but atleast I could see zookeeper was up



-S



________________________________

From: dev1 <de...@etcoleman.com>>
Sent: Wednesday, February 9, 2022 12:26 PM
To: user@accumulo.apache.org<ma...@accumulo.apache.org> <us...@accumulo.apache.org>>
Subject: [External] Re: accumulo 1.10.0 masters won't start



Does the monitor or any of the logs show errors that relate to exceeding the ZooKeeper jute buffer size?



Is so, have you tried increasing the ZooKeeper jute.maxbuffer limit(https://zookeeper.apache.org/doc/r3.5.9/zookeeperAdmin.html#Unsafe+Options<https://urldefense.com/v3/__https://usg02.safelinks.protection.office365.us/?url=https*3A*2F*2Furldefense.com*2Fv3*2F__https*3A*2F*2Fzookeeper.apache.org*2Fdoc*2Fr3.5.9*2FzookeeperAdmin.html*Unsafe*Options__*3BIys!!May37g!dTGCMHPLPDBXwSqtLa5cIPHiTIQF7IjLCVyvGxfi1sgPbrsOI8RCEsuZ9u-jJtayEg*24&data=04*7C01*7CSLIGADE*40FBI.GOV*7Cb7b8be92faf64fbc95ff08d9ec13044d*7C022914a9b95f4b7bbace551ce1a04071*7C0*7C0*7C637800390698440068*7CUnknown*7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0*3D*7C3000&sdata=meZvmEpBktGc95qzM46QmtNWp5NJ8noozSTv896k7qw*3D&reserved=0__;JSUlJSUlJSUlJSUqKiUlJSUlJSUlJSUlJSUlJQ!!May37g!ewlkGRNFLrKEpeF1Lz8vRt_oBtpgi8hVvvnCrp1Dq4_8Xprb4tEHWiHVFW58KT9_bg$>)?



Ed Coleman





________________________________

From: Ligade, Shailesh [USA] <Li...@bah.com>>
Sent: Wednesday, February 9, 2022 11:49 AM
To: user@accumulo.apache.org<ma...@accumulo.apache.org> <us...@accumulo.apache.org>>
Subject: accumulo 1.10.0 masters won't start



Hello,



My both masters are stuck error on zookeeper:



IOException: Packet len 2791093 is out of range!

KeeperErrorCode = ConnectionLoss for /accumulo/<instance_id>/fate





if use zkCli to see what is under fate, i get



IOException Packet len 2791161 is out of range

Unable to read additional data from server sessionid xxxx, likely server has closed socket



hdfs fsck is all good



How can I clear this fate?



master process is up and I can get into accumulo shell, but there are no fate (fate print returns empty)



Any idea how to bring the master up?



Thanks



S

Re: accumulo 1.10.0 masters won't start

Posted by "Ligade, Shailesh [USA]" <Li...@bah.com>.

Thanks Ed,

That saved the day. The confusing part setting up that property is documentation if it needs hex or bytes etc. Even the example they provided here

https://solr.apache.org/guide/7_4/setting-up-an-external-zookeeper-ensemble.html#increasing-the-file-size-limit
Setting Up an External ZooKeeper Ensemble | Apache Solr Reference Guide 7.4 - The Apache Software Foundation<https://solr.apache.org/guide/7_4/setting-up-an-external-zookeeper-ensemble.html#increasing-the-file-size-limit>
The solution to this problem is to set up an external ZooKeeper ensemble, which is a number of servers running ZooKeeper that communicate with each other to coordinate the activities of the cluster.
solr.apache.org

states they are setting the value to 2mb but the value really looks like 200k (with 5 0)

------------------------------------

Add the following line to increase the file size limit to 2MB:

SOLR_OPTS="$SOLR_OPTS -Djute.maxbuffer=0x200000"

-------------------------------------

Anyways, a master is up and running for an hour now..so just trying to understand what was changed and revert it after it stabilize.

Thanks a bunch.

-S

________________________________
From: dev1 <de...@etcoleman.com>
Sent: Wednesday, February 9, 2022 5:54 PM
To: user@accumulo.apache.org <us...@accumulo.apache.org>
Subject: [External] Re: accumulo 1.10.0 masters won't start

You might want to set the accumulo (zookeeper client) side - by setting ACCUMULO_JAVA_OPTS that is processed in accumulo-env.sh (or just edit that file?)

Looking at the Zookeeper documentation it describes what looks like you are seeing:

When jute.maxbuffer in the client side is less than the server side, the client wants to read the data exceeds jute.maxbuffer in the client side, the client side will get java.io.IOException: Unreasonable length or Packet len is out of range!

Also, a search showed jira tickets that had a server side limit of 4MB, but client limits of 1MB - you may want to see if 4194304 (or larger) as a value works,

________________________________
From: dev1 <de...@etcoleman.com>
Sent: Wednesday, February 9, 2022 5:25 PM
To: user@accumulo.apache.org <us...@accumulo.apache.org>
Subject: Re: accumulo 1.10.0 masters won't start

jute.maxbuffer is a ZooKeeper property - it needs to be set on the zookeeper configuration.  If this is still correct, then it looks like there are a few options https://solr.apache.org/guide/7_4/setting-up-an-external-zookeeper-ensemble.html#increasing-the-file-size-limit<https://urldefense.com/v3/__https://solr.apache.org/guide/7_4/setting-up-an-external-zookeeper-ensemble.html*increasing-the-file-size-limit__;Iw!!May37g!ewlkGRNFLrKEpeF1Lz8vRt_oBtpgi8hVvvnCrp1Dq4_8Xprb4tEHWiHVFW7EVUT3vA$>

But maybe the ZooKeeper documentation for your version can provide additional guidance?
Setting Up an External ZooKeeper Ensemble | Apache Solr Reference Guide 7.4 - The Apache Software Foundation<https://urldefense.com/v3/__https://solr.apache.org/guide/7_4/setting-up-an-external-zookeeper-ensemble.html*increasing-the-file-size-limit__;Iw!!May37g!ewlkGRNFLrKEpeF1Lz8vRt_oBtpgi8hVvvnCrp1Dq4_8Xprb4tEHWiHVFW7EVUT3vA$>
The solution to this problem is to set up an external ZooKeeper ensemble, which is a number of servers running ZooKeeper that communicate with each other to coordinate the activities of the cluster.
solr.apache.org


________________________________
From: Shailesh Ligade <SL...@FBI.GOV>
Sent: Wednesday, February 9, 2022 5:02 PM
To: user@accumulo.apache.org <us...@accumulo.apache.org>
Subject: RE: accumulo 1.10.0 masters won't start


Thanks



Even if I set jute.maxbuffer on zookeeper in conf/java.env file to



-Djute.maxbuffer=300000



I see in accumulo master log as



INFO: jute.maxbuffer value is 1048575 Bytes    not sure where to set that on accumulo side.



I set instance.zookeeper.timeout value to 90s in accumulo-site.xml



But still get those zookeeper KeeperErrorCode errors



-S



From: dev1 <de...@etcoleman.com>
Sent: Wednesday, February 9, 2022 4:27 PM
To: user@accumulo.apache.org
Subject: [EXTERNAL EMAIL] - Re: accumulo 1.10.0 masters won't start



I would not recommend setting the goal state directly unlit there are no other alternatives.



It is hard to recommend what to do, because it is unclear what put you into the current situation and what action / impact you might have had trying to fix things -



why did the goal state become unset in the first place?

what did you stuff into the fates that increased the need for larger jute buffers?



It could be that the number of tables and servers pushed you over the limit - or it could be something else.



What I would do.



Shutdown accumulo and make sure all services / tservers are stopped.

Shutdown any other services that might be using ZooKeeper.

Shutdown ZooKeeper.



Set the larger jute.buffer and increase the timeout values across the board and in any dependent services.



Start hdfs - if you needed to shut it down.

Start just zookeeper - and use zkCli.sh to examine the state of things.  If that looks okay.

Start just the master - how far does it come up?  It will not be able to load the root / metadata tables, but it may give some indication of state,



I'd then cycle between stopping the master, trying to clean-up things using zkCli.sh using any guidance with errors the master is generating. If that looks promising, then:



With the master stopped - start the tservers and check a few logs if there are exceptions determine if they are they something that is pointing to an issue - or just something that is transient and handled.



Once the tservers are up and looking okay - start the master.



One of the things to grab as soon as you can get the shell to run - get a listing of the tables and the ids.  If the worst happens, you can use that to map the existing data into a "new" instance. Hopefully it will not come to that and you will not need it - but if you don't have it and you need it, well... The table names and id are all in ZooKeeper.



Ed Coleman



________________________________

From: Shailesh Ligade <SL...@FBI.GOV>>
Sent: Wednesday, February 9, 2022 3:47 PM
To: user@accumulo.apache.org<ma...@accumulo.apache.org> <us...@accumulo.apache.org>>
Subject: RE: accumulo 1.10.0 masters won't start



Thanks I can try that,



At this point, my goal is to get accumulo up. I was just wondering if I can set different goal like SAFE_MODE will it come up by ignoring fate and other issues? If that comes up, can I switch back to NORMAL, will that work? I understand there may be some data loss..



-S



From: dev1 <de...@etcoleman.com>>
Sent: Wednesday, February 9, 2022 3:36 PM
To: user@accumulo.apache.org<ma...@accumulo.apache.org>
Subject: [EXTERNAL EMAIL] - Re: accumulo 1.10.0 masters won't start



For values in zoo.cfg see: https://zookeeper.apache.org/doc/r3.5.9/zookeeperAdmin.html#sc_advancedConfiguration<https://urldefense.com/v3/__https://usg02.safelinks.protection.office365.us/?url=https*3A*2F*2Fzookeeper.apache.org*2Fdoc*2Fr3.5.9*2FzookeeperAdmin.html*23sc_advancedConfiguration&data=04*7C01*7CSLIGADE*40FBI.GOV*7Cb7b8be92faf64fbc95ff08d9ec13044d*7C022914a9b95f4b7bbace551ce1a04071*7C0*7C0*7C637800390698440068*7CUnknown*7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0*3D*7C3000&sdata=QaGFX7kcHJeiIN73G5bfDDEQNgxN0F7QdyJ9fO3SJzA*3D&reserved=0__;JSUlJSUlJSUlJSUlJSUlJSUlJSU!!May37g!ewlkGRNFLrKEpeF1Lz8vRt_oBtpgi8hVvvnCrp1Dq4_8Xprb4tEHWiHVFW6UcuwVgA$>



maxSessionTimeout



In the accumulo config  - #instance.zookeepers.timeout=30s



The zookeeper setting controls the max time that the ZK servers will grant - the accumulo setting is how much time accumulo will ask for.



ZooKeeper: Because Coordinating Distributed Systems is a Zoo<https://urldefense.com/v3/__https://usg02.safelinks.protection.office365.us/?url=https*3A*2F*2Fzookeeper.apache.org*2Fdoc*2Fr3.5.9*2FzookeeperAdmin.html*23sc_advancedConfiguration&data=04*7C01*7CSLIGADE*40FBI.GOV*7Cb7b8be92faf64fbc95ff08d9ec13044d*7C022914a9b95f4b7bbace551ce1a04071*7C0*7C0*7C637800390698440068*7CUnknown*7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0*3D*7C3000&sdata=QaGFX7kcHJeiIN73G5bfDDEQNgxN0F7QdyJ9fO3SJzA*3D&reserved=0__;JSUlJSUlJSUlJSUlJSUlJSUlJSU!!May37g!ewlkGRNFLrKEpeF1Lz8vRt_oBtpgi8hVvvnCrp1Dq4_8Xprb4tEHWiHVFW6UcuwVgA$>

Trace Mask Bit Values ; 0b0000000000 : Unused, reserved for future use. 0b0000000010 : Logs client requests, excluding ping requests. 0b0000000100 : Unused, reserved ...

zookeeper.apache.org





________________________________

From: Ligade, Shailesh [USA] <Li...@bah.com>>
Sent: Wednesday, February 9, 2022 3:03 PM
To: user@accumulo.apache.org<ma...@accumulo.apache.org> <us...@accumulo.apache.org>>
Subject: Re: accumulo 1.10.0 masters won't start



thanks for response,



no i have not update any timeout



is that going in zoo.cfg? I can see there is min/maxSessionTimeout 2/20, is that what are you refering to?



-S

________________________________

From: dev1 <de...@etcoleman.com>>
Sent: Wednesday, February 9, 2022 2:51 PM
To: user@accumulo.apache.org<ma...@accumulo.apache.org> <us...@accumulo.apache.org>>
Subject: [External] Re: accumulo 1.10.0 masters won't start



Have you tried to increase the zoo session timeout value? I think it's zookeeper.session.timeout.ms



________________________________

From: Ligade, Shailesh [USA] <Li...@bah.com>>
Sent: Wednesday, February 9, 2022 2:47 PM
To: user@accumulo.apache.org<ma...@accumulo.apache.org> <us...@accumulo.apache.org>>
Subject: Re: accumulo 1.10.0 masters won't start



Thanks



That fixed goal sate issue but now still getting



Errors with zookeeper

e.g.



KeeperErrorCode = ConnectionLoss for



/accumulo/<instane-id>/config/tserver.hold.time.max

/accumulo/<instane-id>/tables

/accumulo/<instane-id>/tables/1/name

/accumulo/<instane-id>/fate

/accumulo/<instane-id>/masters/goal_state



So it is all over …some I see good values in zookeeper…so not sure..  🙁



-S



________________________________

From: dev1 <de...@etcoleman.com>>
Sent: Wednesday, February 9, 2022 2:22 PM
To: user@accumulo.apache.org<ma...@accumulo.apache.org> <us...@accumulo.apache.org>>
Subject: [External] Re: accumulo 1.10.0 masters won't start



The is a utility - SetGoalState that can be run from the command line



accumulo SetGoalState NORMAL



(or SAFE_MODE, CLEAN_STOP)



It sets a value in ZK at /accumulo/instance-id/managers/goal_state



Ed Coleman



________________________________

From: Ligade, Shailesh [USA] <Li...@bah.com>>
Sent: Wednesday, February 9, 2022 1:54 PM
To: user@accumulo.apache.org<ma...@accumulo.apache.org> <us...@accumulo.apache.org>>
Subject: Re: accumulo 1.10.0 masters won't start



Well,



i just went ahead and deleted fate in zookeeper and restarted the master..it was doing better but then i am getting different error



ERROR: Problem getting real goal state from zookeeper: java.lang.IllegalArgumentException: No enum constant org.apache.accumulo.core.master.thrift.MasterGoalState



I hope i didn't delete goal_state accidently ...;-( currently ls on goal_state is [], is there a way to add some value there?



-S

________________________________

From: dev1 <de...@etcoleman.com>>
Sent: Wednesday, February 9, 2022 1:32 PM
To: user@accumulo.apache.org<ma...@accumulo.apache.org> <us...@accumulo.apache.org>>
Subject: [External] Re: accumulo 1.10.0 masters won't start



Did you try setting the increased size in the zkCli.sh command (or wherever it gets it environment from?)



The ZK docs indicate that it needs to be set to the same size on all servers and clients.



You should be able to use zkCli.sh to at least see what's going on - if that does not work, then it seems unlikely that the master would either.



Can you:

  *   list the nodes under /accumulo/[instance id]/fate?
  *   use the stat command on each of the nodes - the size is one of the fields.
  *   list nodes under any of the /accumulo/[instance_id/fate/tx-#####
  *   there should be a node named debug - doing a get on that should show the op name.



Ed Coleman

________________________________

From: Ligade, Shailesh [USA] <Li...@bah.com>>
Sent: Wednesday, February 9, 2022 12:54 PM
To: user@accumulo.apache.org<ma...@accumulo.apache.org> <us...@accumulo.apache.org>>
Subject: Re: accumulo 1.10.0 masters won't start



Thanks



I added



-Djute.maxbuffer=30000000



In conf/java.env and restart all zookeepers but still getting the same error.. documentation is kind of fuzzy on setting this property as it states in hex (default 0xffff) so not 100% sure if 30000000 is ok, but atleast I could see zookeeper was up



-S



________________________________

From: dev1 <de...@etcoleman.com>>
Sent: Wednesday, February 9, 2022 12:26 PM
To: user@accumulo.apache.org<ma...@accumulo.apache.org> <us...@accumulo.apache.org>>
Subject: [External] Re: accumulo 1.10.0 masters won't start



Does the monitor or any of the logs show errors that relate to exceeding the ZooKeeper jute buffer size?



Is so, have you tried increasing the ZooKeeper jute.maxbuffer limit(https://zookeeper.apache.org/doc/r3.5.9/zookeeperAdmin.html#Unsafe+Options<https://urldefense.com/v3/__https://usg02.safelinks.protection.office365.us/?url=https*3A*2F*2Furldefense.com*2Fv3*2F__https*3A*2F*2Fzookeeper.apache.org*2Fdoc*2Fr3.5.9*2FzookeeperAdmin.html*Unsafe*Options__*3BIys!!May37g!dTGCMHPLPDBXwSqtLa5cIPHiTIQF7IjLCVyvGxfi1sgPbrsOI8RCEsuZ9u-jJtayEg*24&data=04*7C01*7CSLIGADE*40FBI.GOV*7Cb7b8be92faf64fbc95ff08d9ec13044d*7C022914a9b95f4b7bbace551ce1a04071*7C0*7C0*7C637800390698440068*7CUnknown*7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0*3D*7C3000&sdata=meZvmEpBktGc95qzM46QmtNWp5NJ8noozSTv896k7qw*3D&reserved=0__;JSUlJSUlJSUlJSUqKiUlJSUlJSUlJSUlJSUlJQ!!May37g!ewlkGRNFLrKEpeF1Lz8vRt_oBtpgi8hVvvnCrp1Dq4_8Xprb4tEHWiHVFW58KT9_bg$>)?



Ed Coleman





________________________________

From: Ligade, Shailesh [USA] <Li...@bah.com>>
Sent: Wednesday, February 9, 2022 11:49 AM
To: user@accumulo.apache.org<ma...@accumulo.apache.org> <us...@accumulo.apache.org>>
Subject: accumulo 1.10.0 masters won't start



Hello,



My both masters are stuck error on zookeeper:



IOException: Packet len 2791093 is out of range!

KeeperErrorCode = ConnectionLoss for /accumulo/<instance_id>/fate





if use zkCli to see what is under fate, i get



IOException Packet len 2791161 is out of range

Unable to read additional data from server sessionid xxxx, likely server has closed socket



hdfs fsck is all good



How can I clear this fate?



master process is up and I can get into accumulo shell, but there are no fate (fate print returns empty)



Any idea how to bring the master up?



Thanks



S

Re: accumulo 1.10.0 masters won't start

Posted by dev1 <de...@etcoleman.com>.

You might want to set the accumulo (zookeeper client) side - by setting ACCUMULO_JAVA_OPTS that is processed in accumulo-env.sh (or just edit that file?)

Looking at the Zookeeper documentation it describes what looks like you are seeing:

When jute.maxbuffer in the client side is less than the server side, the client wants to read the data exceeds jute.maxbuffer in the client side, the client side will get java.io.IOException: Unreasonable length or Packet len is out of range!

Also, a search showed jira tickets that had a server side limit of 4MB, but client limits of 1MB - you may want to see if 4194304 (or larger) as a value works,

________________________________
From: dev1 <de...@etcoleman.com>
Sent: Wednesday, February 9, 2022 5:25 PM
To: user@accumulo.apache.org <us...@accumulo.apache.org>
Subject: Re: accumulo 1.10.0 masters won't start

jute.maxbuffer is a ZooKeeper property - it needs to be set on the zookeeper configuration.  If this is still correct, then it looks like there are a few options https://solr.apache.org/guide/7_4/setting-up-an-external-zookeeper-ensemble.html#increasing-the-file-size-limit

But maybe the ZooKeeper documentation for your version can provide additional guidance?
Setting Up an External ZooKeeper Ensemble | Apache Solr Reference Guide 7.4 - The Apache Software Foundation<https://solr.apache.org/guide/7_4/setting-up-an-external-zookeeper-ensemble.html#increasing-the-file-size-limit>
The solution to this problem is to set up an external ZooKeeper ensemble, which is a number of servers running ZooKeeper that communicate with each other to coordinate the activities of the cluster.
solr.apache.org


________________________________
From: Shailesh Ligade <SL...@FBI.GOV>
Sent: Wednesday, February 9, 2022 5:02 PM
To: user@accumulo.apache.org <us...@accumulo.apache.org>
Subject: RE: accumulo 1.10.0 masters won't start


Thanks



Even if I set jute.maxbuffer on zookeeper in conf/java.env file to



-Djute.maxbuffer=300000



I see in accumulo master log as



INFO: jute.maxbuffer value is 1048575 Bytes    not sure where to set that on accumulo side.



I set instance.zookeeper.timeout value to 90s in accumulo-site.xml



But still get those zookeeper KeeperErrorCode errors



-S



From: dev1 <de...@etcoleman.com>
Sent: Wednesday, February 9, 2022 4:27 PM
To: user@accumulo.apache.org
Subject: [EXTERNAL EMAIL] - Re: accumulo 1.10.0 masters won't start



I would not recommend setting the goal state directly unlit there are no other alternatives.



It is hard to recommend what to do, because it is unclear what put you into the current situation and what action / impact you might have had trying to fix things -



why did the goal state become unset in the first place?

what did you stuff into the fates that increased the need for larger jute buffers?



It could be that the number of tables and servers pushed you over the limit - or it could be something else.



What I would do.



Shutdown accumulo and make sure all services / tservers are stopped.

Shutdown any other services that might be using ZooKeeper.

Shutdown ZooKeeper.



Set the larger jute.buffer and increase the timeout values across the board and in any dependent services.



Start hdfs - if you needed to shut it down.

Start just zookeeper - and use zkCli.sh to examine the state of things.  If that looks okay.

Start just the master - how far does it come up?  It will not be able to load the root / metadata tables, but it may give some indication of state,



I'd then cycle between stopping the master, trying to clean-up things using zkCli.sh using any guidance with errors the master is generating. If that looks promising, then:



With the master stopped - start the tservers and check a few logs if there are exceptions determine if they are they something that is pointing to an issue - or just something that is transient and handled.



Once the tservers are up and looking okay - start the master.



One of the things to grab as soon as you can get the shell to run - get a listing of the tables and the ids.  If the worst happens, you can use that to map the existing data into a "new" instance. Hopefully it will not come to that and you will not need it - but if you don't have it and you need it, well... The table names and id are all in ZooKeeper.



Ed Coleman



________________________________

From: Shailesh Ligade <SL...@FBI.GOV>>
Sent: Wednesday, February 9, 2022 3:47 PM
To: user@accumulo.apache.org<ma...@accumulo.apache.org> <us...@accumulo.apache.org>>
Subject: RE: accumulo 1.10.0 masters won't start



Thanks I can try that,



At this point, my goal is to get accumulo up. I was just wondering if I can set different goal like SAFE_MODE will it come up by ignoring fate and other issues? If that comes up, can I switch back to NORMAL, will that work? I understand there may be some data loss..



-S



From: dev1 <de...@etcoleman.com>>
Sent: Wednesday, February 9, 2022 3:36 PM
To: user@accumulo.apache.org<ma...@accumulo.apache.org>
Subject: [EXTERNAL EMAIL] - Re: accumulo 1.10.0 masters won't start



For values in zoo.cfg see: https://zookeeper.apache.org/doc/r3.5.9/zookeeperAdmin.html#sc_advancedConfiguration<https://usg02.safelinks.protection.office365.us/?url=https%3A%2F%2Fzookeeper.apache.org%2Fdoc%2Fr3.5.9%2FzookeeperAdmin.html%23sc_advancedConfiguration&data=04%7C01%7CSLIGADE%40FBI.GOV%7Cb7b8be92faf64fbc95ff08d9ec13044d%7C022914a9b95f4b7bbace551ce1a04071%7C0%7C0%7C637800390698440068%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&sdata=QaGFX7kcHJeiIN73G5bfDDEQNgxN0F7QdyJ9fO3SJzA%3D&reserved=0>



maxSessionTimeout



In the accumulo config  - #instance.zookeepers.timeout=30s



The zookeeper setting controls the max time that the ZK servers will grant - the accumulo setting is how much time accumulo will ask for.



ZooKeeper: Because Coordinating Distributed Systems is a Zoo<https://usg02.safelinks.protection.office365.us/?url=https%3A%2F%2Fzookeeper.apache.org%2Fdoc%2Fr3.5.9%2FzookeeperAdmin.html%23sc_advancedConfiguration&data=04%7C01%7CSLIGADE%40FBI.GOV%7Cb7b8be92faf64fbc95ff08d9ec13044d%7C022914a9b95f4b7bbace551ce1a04071%7C0%7C0%7C637800390698440068%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&sdata=QaGFX7kcHJeiIN73G5bfDDEQNgxN0F7QdyJ9fO3SJzA%3D&reserved=0>

Trace Mask Bit Values ; 0b0000000000 : Unused, reserved for future use. 0b0000000010 : Logs client requests, excluding ping requests. 0b0000000100 : Unused, reserved ...

zookeeper.apache.org





________________________________

From: Ligade, Shailesh [USA] <Li...@bah.com>>
Sent: Wednesday, February 9, 2022 3:03 PM
To: user@accumulo.apache.org<ma...@accumulo.apache.org> <us...@accumulo.apache.org>>
Subject: Re: accumulo 1.10.0 masters won't start



thanks for response,



no i have not update any timeout



is that going in zoo.cfg? I can see there is min/maxSessionTimeout 2/20, is that what are you refering to?



-S

________________________________

From: dev1 <de...@etcoleman.com>>
Sent: Wednesday, February 9, 2022 2:51 PM
To: user@accumulo.apache.org<ma...@accumulo.apache.org> <us...@accumulo.apache.org>>
Subject: [External] Re: accumulo 1.10.0 masters won't start



Have you tried to increase the zoo session timeout value? I think it's zookeeper.session.timeout.ms



________________________________

From: Ligade, Shailesh [USA] <Li...@bah.com>>
Sent: Wednesday, February 9, 2022 2:47 PM
To: user@accumulo.apache.org<ma...@accumulo.apache.org> <us...@accumulo.apache.org>>
Subject: Re: accumulo 1.10.0 masters won't start



Thanks



That fixed goal sate issue but now still getting



Errors with zookeeper

e.g.



KeeperErrorCode = ConnectionLoss for



/accumulo/<instane-id>/config/tserver.hold.time.max

/accumulo/<instane-id>/tables

/accumulo/<instane-id>/tables/1/name

/accumulo/<instane-id>/fate

/accumulo/<instane-id>/masters/goal_state



So it is all over …some I see good values in zookeeper…so not sure..  🙁



-S



________________________________

From: dev1 <de...@etcoleman.com>>
Sent: Wednesday, February 9, 2022 2:22 PM
To: user@accumulo.apache.org<ma...@accumulo.apache.org> <us...@accumulo.apache.org>>
Subject: [External] Re: accumulo 1.10.0 masters won't start



The is a utility - SetGoalState that can be run from the command line



accumulo SetGoalState NORMAL



(or SAFE_MODE, CLEAN_STOP)



It sets a value in ZK at /accumulo/instance-id/managers/goal_state



Ed Coleman



________________________________

From: Ligade, Shailesh [USA] <Li...@bah.com>>
Sent: Wednesday, February 9, 2022 1:54 PM
To: user@accumulo.apache.org<ma...@accumulo.apache.org> <us...@accumulo.apache.org>>
Subject: Re: accumulo 1.10.0 masters won't start



Well,



i just went ahead and deleted fate in zookeeper and restarted the master..it was doing better but then i am getting different error



ERROR: Problem getting real goal state from zookeeper: java.lang.IllegalArgumentException: No enum constant org.apache.accumulo.core.master.thrift.MasterGoalState



I hope i didn't delete goal_state accidently ...;-( currently ls on goal_state is [], is there a way to add some value there?



-S

________________________________

From: dev1 <de...@etcoleman.com>>
Sent: Wednesday, February 9, 2022 1:32 PM
To: user@accumulo.apache.org<ma...@accumulo.apache.org> <us...@accumulo.apache.org>>
Subject: [External] Re: accumulo 1.10.0 masters won't start



Did you try setting the increased size in the zkCli.sh command (or wherever it gets it environment from?)



The ZK docs indicate that it needs to be set to the same size on all servers and clients.



You should be able to use zkCli.sh to at least see what's going on - if that does not work, then it seems unlikely that the master would either.



Can you:

  *   list the nodes under /accumulo/[instance id]/fate?
  *   use the stat command on each of the nodes - the size is one of the fields.
  *   list nodes under any of the /accumulo/[instance_id/fate/tx-#####
  *   there should be a node named debug - doing a get on that should show the op name.



Ed Coleman

________________________________

From: Ligade, Shailesh [USA] <Li...@bah.com>>
Sent: Wednesday, February 9, 2022 12:54 PM
To: user@accumulo.apache.org<ma...@accumulo.apache.org> <us...@accumulo.apache.org>>
Subject: Re: accumulo 1.10.0 masters won't start



Thanks



I added



-Djute.maxbuffer=30000000



In conf/java.env and restart all zookeepers but still getting the same error.. documentation is kind of fuzzy on setting this property as it states in hex (default 0xffff) so not 100% sure if 30000000 is ok, but atleast I could see zookeeper was up



-S



________________________________

From: dev1 <de...@etcoleman.com>>
Sent: Wednesday, February 9, 2022 12:26 PM
To: user@accumulo.apache.org<ma...@accumulo.apache.org> <us...@accumulo.apache.org>>
Subject: [External] Re: accumulo 1.10.0 masters won't start



Does the monitor or any of the logs show errors that relate to exceeding the ZooKeeper jute buffer size?



Is so, have you tried increasing the ZooKeeper jute.maxbuffer limit(https://zookeeper.apache.org/doc/r3.5.9/zookeeperAdmin.html#Unsafe+Options<https://usg02.safelinks.protection.office365.us/?url=https%3A%2F%2Furldefense.com%2Fv3%2F__https%3A%2F%2Fzookeeper.apache.org%2Fdoc%2Fr3.5.9%2FzookeeperAdmin.html*Unsafe*Options__%3BIys!!May37g!dTGCMHPLPDBXwSqtLa5cIPHiTIQF7IjLCVyvGxfi1sgPbrsOI8RCEsuZ9u-jJtayEg%24&data=04%7C01%7CSLIGADE%40FBI.GOV%7Cb7b8be92faf64fbc95ff08d9ec13044d%7C022914a9b95f4b7bbace551ce1a04071%7C0%7C0%7C637800390698440068%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&sdata=meZvmEpBktGc95qzM46QmtNWp5NJ8noozSTv896k7qw%3D&reserved=0>)?



Ed Coleman





________________________________

From: Ligade, Shailesh [USA] <Li...@bah.com>>
Sent: Wednesday, February 9, 2022 11:49 AM
To: user@accumulo.apache.org<ma...@accumulo.apache.org> <us...@accumulo.apache.org>>
Subject: accumulo 1.10.0 masters won't start



Hello,



My both masters are stuck error on zookeeper:



IOException: Packet len 2791093 is out of range!

KeeperErrorCode = ConnectionLoss for /accumulo/<instance_id>/fate





if use zkCli to see what is under fate, i get



IOException Packet len 2791161 is out of range

Unable to read additional data from server sessionid xxxx, likely server has closed socket



hdfs fsck is all good



How can I clear this fate?



master process is up and I can get into accumulo shell, but there are no fate (fate print returns empty)



Any idea how to bring the master up?



Thanks



S

Re: accumulo 1.10.0 masters won't start

Posted by dev1 <de...@etcoleman.com>.

jute.maxbuffer is a ZooKeeper property - it needs to be set on the zookeeper configuration.  If this is still correct, then it looks like there are a few options https://solr.apache.org/guide/7_4/setting-up-an-external-zookeeper-ensemble.html#increasing-the-file-size-limit

But maybe the ZooKeeper documentation for your version can provide additional guidance?
Setting Up an External ZooKeeper Ensemble | Apache Solr Reference Guide 7.4 - The Apache Software Foundation<https://solr.apache.org/guide/7_4/setting-up-an-external-zookeeper-ensemble.html#increasing-the-file-size-limit>
The solution to this problem is to set up an external ZooKeeper ensemble, which is a number of servers running ZooKeeper that communicate with each other to coordinate the activities of the cluster.
solr.apache.org


________________________________
From: Shailesh Ligade <SL...@FBI.GOV>
Sent: Wednesday, February 9, 2022 5:02 PM
To: user@accumulo.apache.org <us...@accumulo.apache.org>
Subject: RE: accumulo 1.10.0 masters won't start


Thanks



Even if I set jute.maxbuffer on zookeeper in conf/java.env file to



-Djute.maxbuffer=300000



I see in accumulo master log as



INFO: jute.maxbuffer value is 1048575 Bytes    not sure where to set that on accumulo side.



I set instance.zookeeper.timeout value to 90s in accumulo-site.xml



But still get those zookeeper KeeperErrorCode errors



-S



From: dev1 <de...@etcoleman.com>
Sent: Wednesday, February 9, 2022 4:27 PM
To: user@accumulo.apache.org
Subject: [EXTERNAL EMAIL] - Re: accumulo 1.10.0 masters won't start



I would not recommend setting the goal state directly unlit there are no other alternatives.



It is hard to recommend what to do, because it is unclear what put you into the current situation and what action / impact you might have had trying to fix things -



why did the goal state become unset in the first place?

what did you stuff into the fates that increased the need for larger jute buffers?



It could be that the number of tables and servers pushed you over the limit - or it could be something else.



What I would do.



Shutdown accumulo and make sure all services / tservers are stopped.

Shutdown any other services that might be using ZooKeeper.

Shutdown ZooKeeper.



Set the larger jute.buffer and increase the timeout values across the board and in any dependent services.



Start hdfs - if you needed to shut it down.

Start just zookeeper - and use zkCli.sh to examine the state of things.  If that looks okay.

Start just the master - how far does it come up?  It will not be able to load the root / metadata tables, but it may give some indication of state,



I'd then cycle between stopping the master, trying to clean-up things using zkCli.sh using any guidance with errors the master is generating. If that looks promising, then:



With the master stopped - start the tservers and check a few logs if there are exceptions determine if they are they something that is pointing to an issue - or just something that is transient and handled.



Once the tservers are up and looking okay - start the master.



One of the things to grab as soon as you can get the shell to run - get a listing of the tables and the ids.  If the worst happens, you can use that to map the existing data into a "new" instance. Hopefully it will not come to that and you will not need it - but if you don't have it and you need it, well... The table names and id are all in ZooKeeper.



Ed Coleman



________________________________

From: Shailesh Ligade <SL...@FBI.GOV>>
Sent: Wednesday, February 9, 2022 3:47 PM
To: user@accumulo.apache.org<ma...@accumulo.apache.org> <us...@accumulo.apache.org>>
Subject: RE: accumulo 1.10.0 masters won't start



Thanks I can try that,



At this point, my goal is to get accumulo up. I was just wondering if I can set different goal like SAFE_MODE will it come up by ignoring fate and other issues? If that comes up, can I switch back to NORMAL, will that work? I understand there may be some data loss..



-S



From: dev1 <de...@etcoleman.com>>
Sent: Wednesday, February 9, 2022 3:36 PM
To: user@accumulo.apache.org<ma...@accumulo.apache.org>
Subject: [EXTERNAL EMAIL] - Re: accumulo 1.10.0 masters won't start



For values in zoo.cfg see: https://zookeeper.apache.org/doc/r3.5.9/zookeeperAdmin.html#sc_advancedConfiguration<https://usg02.safelinks.protection.office365.us/?url=https%3A%2F%2Fzookeeper.apache.org%2Fdoc%2Fr3.5.9%2FzookeeperAdmin.html%23sc_advancedConfiguration&data=04%7C01%7CSLIGADE%40FBI.GOV%7Cb7b8be92faf64fbc95ff08d9ec13044d%7C022914a9b95f4b7bbace551ce1a04071%7C0%7C0%7C637800390698440068%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&sdata=QaGFX7kcHJeiIN73G5bfDDEQNgxN0F7QdyJ9fO3SJzA%3D&reserved=0>



maxSessionTimeout



In the accumulo config  - #instance.zookeepers.timeout=30s



The zookeeper setting controls the max time that the ZK servers will grant - the accumulo setting is how much time accumulo will ask for.



ZooKeeper: Because Coordinating Distributed Systems is a Zoo<https://usg02.safelinks.protection.office365.us/?url=https%3A%2F%2Fzookeeper.apache.org%2Fdoc%2Fr3.5.9%2FzookeeperAdmin.html%23sc_advancedConfiguration&data=04%7C01%7CSLIGADE%40FBI.GOV%7Cb7b8be92faf64fbc95ff08d9ec13044d%7C022914a9b95f4b7bbace551ce1a04071%7C0%7C0%7C637800390698440068%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&sdata=QaGFX7kcHJeiIN73G5bfDDEQNgxN0F7QdyJ9fO3SJzA%3D&reserved=0>

Trace Mask Bit Values ; 0b0000000000 : Unused, reserved for future use. 0b0000000010 : Logs client requests, excluding ping requests. 0b0000000100 : Unused, reserved ...

zookeeper.apache.org





________________________________

From: Ligade, Shailesh [USA] <Li...@bah.com>>
Sent: Wednesday, February 9, 2022 3:03 PM
To: user@accumulo.apache.org<ma...@accumulo.apache.org> <us...@accumulo.apache.org>>
Subject: Re: accumulo 1.10.0 masters won't start



thanks for response,



no i have not update any timeout



is that going in zoo.cfg? I can see there is min/maxSessionTimeout 2/20, is that what are you refering to?



-S

________________________________

From: dev1 <de...@etcoleman.com>>
Sent: Wednesday, February 9, 2022 2:51 PM
To: user@accumulo.apache.org<ma...@accumulo.apache.org> <us...@accumulo.apache.org>>
Subject: [External] Re: accumulo 1.10.0 masters won't start



Have you tried to increase the zoo session timeout value? I think it's zookeeper.session.timeout.ms



________________________________

From: Ligade, Shailesh [USA] <Li...@bah.com>>
Sent: Wednesday, February 9, 2022 2:47 PM
To: user@accumulo.apache.org<ma...@accumulo.apache.org> <us...@accumulo.apache.org>>
Subject: Re: accumulo 1.10.0 masters won't start



Thanks



That fixed goal sate issue but now still getting



Errors with zookeeper

e.g.



KeeperErrorCode = ConnectionLoss for



/accumulo/<instane-id>/config/tserver.hold.time.max

/accumulo/<instane-id>/tables

/accumulo/<instane-id>/tables/1/name

/accumulo/<instane-id>/fate

/accumulo/<instane-id>/masters/goal_state



So it is all over …some I see good values in zookeeper…so not sure..  🙁



-S



________________________________

From: dev1 <de...@etcoleman.com>>
Sent: Wednesday, February 9, 2022 2:22 PM
To: user@accumulo.apache.org<ma...@accumulo.apache.org> <us...@accumulo.apache.org>>
Subject: [External] Re: accumulo 1.10.0 masters won't start



The is a utility - SetGoalState that can be run from the command line



accumulo SetGoalState NORMAL



(or SAFE_MODE, CLEAN_STOP)



It sets a value in ZK at /accumulo/instance-id/managers/goal_state



Ed Coleman



________________________________

From: Ligade, Shailesh [USA] <Li...@bah.com>>
Sent: Wednesday, February 9, 2022 1:54 PM
To: user@accumulo.apache.org<ma...@accumulo.apache.org> <us...@accumulo.apache.org>>
Subject: Re: accumulo 1.10.0 masters won't start



Well,



i just went ahead and deleted fate in zookeeper and restarted the master..it was doing better but then i am getting different error



ERROR: Problem getting real goal state from zookeeper: java.lang.IllegalArgumentException: No enum constant org.apache.accumulo.core.master.thrift.MasterGoalState



I hope i didn't delete goal_state accidently ...;-( currently ls on goal_state is [], is there a way to add some value there?



-S

________________________________

From: dev1 <de...@etcoleman.com>>
Sent: Wednesday, February 9, 2022 1:32 PM
To: user@accumulo.apache.org<ma...@accumulo.apache.org> <us...@accumulo.apache.org>>
Subject: [External] Re: accumulo 1.10.0 masters won't start



Did you try setting the increased size in the zkCli.sh command (or wherever it gets it environment from?)



The ZK docs indicate that it needs to be set to the same size on all servers and clients.



You should be able to use zkCli.sh to at least see what's going on - if that does not work, then it seems unlikely that the master would either.



Can you:

  *   list the nodes under /accumulo/[instance id]/fate?
  *   use the stat command on each of the nodes - the size is one of the fields.
  *   list nodes under any of the /accumulo/[instance_id/fate/tx-#####
  *   there should be a node named debug - doing a get on that should show the op name.



Ed Coleman

________________________________

From: Ligade, Shailesh [USA] <Li...@bah.com>>
Sent: Wednesday, February 9, 2022 12:54 PM
To: user@accumulo.apache.org<ma...@accumulo.apache.org> <us...@accumulo.apache.org>>
Subject: Re: accumulo 1.10.0 masters won't start



Thanks



I added



-Djute.maxbuffer=30000000



In conf/java.env and restart all zookeepers but still getting the same error.. documentation is kind of fuzzy on setting this property as it states in hex (default 0xffff) so not 100% sure if 30000000 is ok, but atleast I could see zookeeper was up



-S



________________________________

From: dev1 <de...@etcoleman.com>>
Sent: Wednesday, February 9, 2022 12:26 PM
To: user@accumulo.apache.org<ma...@accumulo.apache.org> <us...@accumulo.apache.org>>
Subject: [External] Re: accumulo 1.10.0 masters won't start



Does the monitor or any of the logs show errors that relate to exceeding the ZooKeeper jute buffer size?



Is so, have you tried increasing the ZooKeeper jute.maxbuffer limit(https://zookeeper.apache.org/doc/r3.5.9/zookeeperAdmin.html#Unsafe+Options<https://usg02.safelinks.protection.office365.us/?url=https%3A%2F%2Furldefense.com%2Fv3%2F__https%3A%2F%2Fzookeeper.apache.org%2Fdoc%2Fr3.5.9%2FzookeeperAdmin.html*Unsafe*Options__%3BIys!!May37g!dTGCMHPLPDBXwSqtLa5cIPHiTIQF7IjLCVyvGxfi1sgPbrsOI8RCEsuZ9u-jJtayEg%24&data=04%7C01%7CSLIGADE%40FBI.GOV%7Cb7b8be92faf64fbc95ff08d9ec13044d%7C022914a9b95f4b7bbace551ce1a04071%7C0%7C0%7C637800390698440068%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&sdata=meZvmEpBktGc95qzM46QmtNWp5NJ8noozSTv896k7qw%3D&reserved=0>)?



Ed Coleman





________________________________

From: Ligade, Shailesh [USA] <Li...@bah.com>>
Sent: Wednesday, February 9, 2022 11:49 AM
To: user@accumulo.apache.org<ma...@accumulo.apache.org> <us...@accumulo.apache.org>>
Subject: accumulo 1.10.0 masters won't start



Hello,



My both masters are stuck error on zookeeper:



IOException: Packet len 2791093 is out of range!

KeeperErrorCode = ConnectionLoss for /accumulo/<instance_id>/fate





if use zkCli to see what is under fate, i get



IOException Packet len 2791161 is out of range

Unable to read additional data from server sessionid xxxx, likely server has closed socket



hdfs fsck is all good



How can I clear this fate?



master process is up and I can get into accumulo shell, but there are no fate (fate print returns empty)



Any idea how to bring the master up?



Thanks



S

RE: accumulo 1.10.0 masters won't start

Posted by Shailesh Ligade <SL...@FBI.GOV>.

Thanks

Even if I set jute.maxbuffer on zookeeper in conf/java.env file to

-Djute.maxbuffer=300000

I see in accumulo master log as

INFO: jute.maxbuffer value is 1048575 Bytes    not sure where to set that on accumulo side.

I set instance.zookeeper.timeout value to 90s in accumulo-site.xml

But still get those zookeeper KeeperErrorCode errors

-S

From: dev1 <de...@etcoleman.com>
Sent: Wednesday, February 9, 2022 4:27 PM
To: user@accumulo.apache.org
Subject: [EXTERNAL EMAIL] - Re: accumulo 1.10.0 masters won't start

I would not recommend setting the goal state directly unlit there are no other alternatives.

It is hard to recommend what to do, because it is unclear what put you into the current situation and what action / impact you might have had trying to fix things -

why did the goal state become unset in the first place?
what did you stuff into the fates that increased the need for larger jute buffers?

It could be that the number of tables and servers pushed you over the limit - or it could be something else.

What I would do.

Shutdown accumulo and make sure all services / tservers are stopped.
Shutdown any other services that might be using ZooKeeper.
Shutdown ZooKeeper.

Set the larger jute.buffer and increase the timeout values across the board and in any dependent services.

Start hdfs - if you needed to shut it down.
Start just zookeeper - and use zkCli.sh to examine the state of things.  If that looks okay.
Start just the master - how far does it come up?  It will not be able to load the root / metadata tables, but it may give some indication of state,

I'd then cycle between stopping the master, trying to clean-up things using zkCli.sh using any guidance with errors the master is generating. If that looks promising, then:

With the master stopped - start the tservers and check a few logs if there are exceptions determine if they are they something that is pointing to an issue - or just something that is transient and handled.

Once the tservers are up and looking okay - start the master.

One of the things to grab as soon as you can get the shell to run - get a listing of the tables and the ids.  If the worst happens, you can use that to map the existing data into a "new" instance. Hopefully it will not come to that and you will not need it - but if you don't have it and you need it, well... The table names and id are all in ZooKeeper.

Ed Coleman

________________________________
From: Shailesh Ligade <SL...@FBI.GOV>>
Sent: Wednesday, February 9, 2022 3:47 PM
To: user@accumulo.apache.org<ma...@accumulo.apache.org> <us...@accumulo.apache.org>>
Subject: RE: accumulo 1.10.0 masters won't start


Thanks I can try that,



At this point, my goal is to get accumulo up. I was just wondering if I can set different goal like SAFE_MODE will it come up by ignoring fate and other issues? If that comes up, can I switch back to NORMAL, will that work? I understand there may be some data loss..



-S



From: dev1 <de...@etcoleman.com>>
Sent: Wednesday, February 9, 2022 3:36 PM
To: user@accumulo.apache.org<ma...@accumulo.apache.org>
Subject: [EXTERNAL EMAIL] - Re: accumulo 1.10.0 masters won't start



For values in zoo.cfg see: https://zookeeper.apache.org/doc/r3.5.9/zookeeperAdmin.html#sc_advancedConfiguration<https://usg02.safelinks.protection.office365.us/?url=https%3A%2F%2Fzookeeper.apache.org%2Fdoc%2Fr3.5.9%2FzookeeperAdmin.html%23sc_advancedConfiguration&data=04%7C01%7CSLIGADE%40FBI.GOV%7Cb7b8be92faf64fbc95ff08d9ec13044d%7C022914a9b95f4b7bbace551ce1a04071%7C0%7C0%7C637800390698440068%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&sdata=QaGFX7kcHJeiIN73G5bfDDEQNgxN0F7QdyJ9fO3SJzA%3D&reserved=0>



maxSessionTimeout



In the accumulo config  - #instance.zookeepers.timeout=30s



The zookeeper setting controls the max time that the ZK servers will grant - the accumulo setting is how much time accumulo will ask for.



ZooKeeper: Because Coordinating Distributed Systems is a Zoo<https://usg02.safelinks.protection.office365.us/?url=https%3A%2F%2Fzookeeper.apache.org%2Fdoc%2Fr3.5.9%2FzookeeperAdmin.html%23sc_advancedConfiguration&data=04%7C01%7CSLIGADE%40FBI.GOV%7Cb7b8be92faf64fbc95ff08d9ec13044d%7C022914a9b95f4b7bbace551ce1a04071%7C0%7C0%7C637800390698440068%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&sdata=QaGFX7kcHJeiIN73G5bfDDEQNgxN0F7QdyJ9fO3SJzA%3D&reserved=0>

Trace Mask Bit Values ; 0b0000000000 : Unused, reserved for future use. 0b0000000010 : Logs client requests, excluding ping requests. 0b0000000100 : Unused, reserved ...

zookeeper.apache.org





________________________________

From: Ligade, Shailesh [USA] <Li...@bah.com>>
Sent: Wednesday, February 9, 2022 3:03 PM
To: user@accumulo.apache.org<ma...@accumulo.apache.org> <us...@accumulo.apache.org>>
Subject: Re: accumulo 1.10.0 masters won't start



thanks for response,



no i have not update any timeout



is that going in zoo.cfg? I can see there is min/maxSessionTimeout 2/20, is that what are you refering to?



-S

________________________________

From: dev1 <de...@etcoleman.com>>
Sent: Wednesday, February 9, 2022 2:51 PM
To: user@accumulo.apache.org<ma...@accumulo.apache.org> <us...@accumulo.apache.org>>
Subject: [External] Re: accumulo 1.10.0 masters won't start



Have you tried to increase the zoo session timeout value? I think it's zookeeper.session.timeout.ms



________________________________

From: Ligade, Shailesh [USA] <Li...@bah.com>>
Sent: Wednesday, February 9, 2022 2:47 PM
To: user@accumulo.apache.org<ma...@accumulo.apache.org> <us...@accumulo.apache.org>>
Subject: Re: accumulo 1.10.0 masters won't start



Thanks



That fixed goal sate issue but now still getting



Errors with zookeeper

e.g.



KeeperErrorCode = ConnectionLoss for



/accumulo/<instane-id>/config/tserver.hold.time.max

/accumulo/<instane-id>/tables

/accumulo/<instane-id>/tables/1/name

/accumulo/<instane-id>/fate

/accumulo/<instane-id>/masters/goal_state



So it is all over …some I see good values in zookeeper…so not sure..  🙁



-S



________________________________

From: dev1 <de...@etcoleman.com>>
Sent: Wednesday, February 9, 2022 2:22 PM
To: user@accumulo.apache.org<ma...@accumulo.apache.org> <us...@accumulo.apache.org>>
Subject: [External] Re: accumulo 1.10.0 masters won't start



The is a utility - SetGoalState that can be run from the command line



accumulo SetGoalState NORMAL



(or SAFE_MODE, CLEAN_STOP)



It sets a value in ZK at /accumulo/instance-id/managers/goal_state



Ed Coleman



________________________________

From: Ligade, Shailesh [USA] <Li...@bah.com>>
Sent: Wednesday, February 9, 2022 1:54 PM
To: user@accumulo.apache.org<ma...@accumulo.apache.org> <us...@accumulo.apache.org>>
Subject: Re: accumulo 1.10.0 masters won't start



Well,



i just went ahead and deleted fate in zookeeper and restarted the master..it was doing better but then i am getting different error



ERROR: Problem getting real goal state from zookeeper: java.lang.IllegalArgumentException: No enum constant org.apache.accumulo.core.master.thrift.MasterGoalState



I hope i didn't delete goal_state accidently ...;-( currently ls on goal_state is [], is there a way to add some value there?



-S

________________________________

From: dev1 <de...@etcoleman.com>>
Sent: Wednesday, February 9, 2022 1:32 PM
To: user@accumulo.apache.org<ma...@accumulo.apache.org> <us...@accumulo.apache.org>>
Subject: [External] Re: accumulo 1.10.0 masters won't start



Did you try setting the increased size in the zkCli.sh command (or wherever it gets it environment from?)



The ZK docs indicate that it needs to be set to the same size on all servers and clients.



You should be able to use zkCli.sh to at least see what's going on - if that does not work, then it seems unlikely that the master would either.



Can you:

  *   list the nodes under /accumulo/[instance id]/fate?
  *   use the stat command on each of the nodes - the size is one of the fields.
  *   list nodes under any of the /accumulo/[instance_id/fate/tx-#####
  *   there should be a node named debug - doing a get on that should show the op name.



Ed Coleman

________________________________

From: Ligade, Shailesh [USA] <Li...@bah.com>>
Sent: Wednesday, February 9, 2022 12:54 PM
To: user@accumulo.apache.org<ma...@accumulo.apache.org> <us...@accumulo.apache.org>>
Subject: Re: accumulo 1.10.0 masters won't start



Thanks



I added



-Djute.maxbuffer=30000000



In conf/java.env and restart all zookeepers but still getting the same error.. documentation is kind of fuzzy on setting this property as it states in hex (default 0xffff) so not 100% sure if 30000000 is ok, but atleast I could see zookeeper was up



-S



________________________________

From: dev1 <de...@etcoleman.com>>
Sent: Wednesday, February 9, 2022 12:26 PM
To: user@accumulo.apache.org<ma...@accumulo.apache.org> <us...@accumulo.apache.org>>
Subject: [External] Re: accumulo 1.10.0 masters won't start



Does the monitor or any of the logs show errors that relate to exceeding the ZooKeeper jute buffer size?



Is so, have you tried increasing the ZooKeeper jute.maxbuffer limit(https://zookeeper.apache.org/doc/r3.5.9/zookeeperAdmin.html#Unsafe+Options<https://usg02.safelinks.protection.office365.us/?url=https%3A%2F%2Furldefense.com%2Fv3%2F__https%3A%2F%2Fzookeeper.apache.org%2Fdoc%2Fr3.5.9%2FzookeeperAdmin.html*Unsafe*Options__%3BIys!!May37g!dTGCMHPLPDBXwSqtLa5cIPHiTIQF7IjLCVyvGxfi1sgPbrsOI8RCEsuZ9u-jJtayEg%24&data=04%7C01%7CSLIGADE%40FBI.GOV%7Cb7b8be92faf64fbc95ff08d9ec13044d%7C022914a9b95f4b7bbace551ce1a04071%7C0%7C0%7C637800390698440068%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&sdata=meZvmEpBktGc95qzM46QmtNWp5NJ8noozSTv896k7qw%3D&reserved=0>)?



Ed Coleman





________________________________

From: Ligade, Shailesh [USA] <Li...@bah.com>>
Sent: Wednesday, February 9, 2022 11:49 AM
To: user@accumulo.apache.org<ma...@accumulo.apache.org> <us...@accumulo.apache.org>>
Subject: accumulo 1.10.0 masters won't start



Hello,



My both masters are stuck error on zookeeper:



IOException: Packet len 2791093 is out of range!

KeeperErrorCode = ConnectionLoss for /accumulo/<instance_id>/fate





if use zkCli to see what is under fate, i get



IOException Packet len 2791161 is out of range

Unable to read additional data from server sessionid xxxx, likely server has closed socket



hdfs fsck is all good



How can I clear this fate?



master process is up and I can get into accumulo shell, but there are no fate (fate print returns empty)



Any idea how to bring the master up?



Thanks



S

Re: accumulo 1.10.0 masters won't start

Posted by dev1 <de...@etcoleman.com>.

I would not recommend setting the goal state directly unlit there are no other alternatives.

It is hard to recommend what to do, because it is unclear what put you into the current situation and what action / impact you might have had trying to fix things -

why did the goal state become unset in the first place?
what did you stuff into the fates that increased the need for larger jute buffers?

It could be that the number of tables and servers pushed you over the limit - or it could be something else.

What I would do.

Shutdown accumulo and make sure all services / tservers are stopped.
Shutdown any other services that might be using ZooKeeper.
Shutdown ZooKeeper.

Set the larger jute.buffer and increase the timeout values across the board and in any dependent services.

Start hdfs - if you needed to shut it down.
Start just zookeeper - and use zkCli.sh to examine the state of things.  If that looks okay.
Start just the master - how far does it come up?  It will not be able to load the root / metadata tables, but it may give some indication of state,

I'd then cycle between stopping the master, trying to clean-up things using zkCli.sh using any guidance with errors the master is generating. If that looks promising, then:

With the master stopped - start the tservers and check a few logs if there are exceptions determine if they are they something that is pointing to an issue - or just something that is transient and handled.

Once the tservers are up and looking okay - start the master.

One of the things to grab as soon as you can get the shell to run - get a listing of the tables and the ids.  If the worst happens, you can use that to map the existing data into a "new" instance. Hopefully it will not come to that and you will not need it - but if you don't have it and you need it, well... The table names and id are all in ZooKeeper.

Ed Coleman

________________________________
From: Shailesh Ligade <SL...@FBI.GOV>
Sent: Wednesday, February 9, 2022 3:47 PM
To: user@accumulo.apache.org <us...@accumulo.apache.org>
Subject: RE: accumulo 1.10.0 masters won't start


Thanks I can try that,



At this point, my goal is to get accumulo up. I was just wondering if I can set different goal like SAFE_MODE will it come up by ignoring fate and other issues? If that comes up, can I switch back to NORMAL, will that work? I understand there may be some data loss..



-S



From: dev1 <de...@etcoleman.com>
Sent: Wednesday, February 9, 2022 3:36 PM
To: user@accumulo.apache.org
Subject: [EXTERNAL EMAIL] - Re: accumulo 1.10.0 masters won't start



For values in zoo.cfg see: https://zookeeper.apache.org/doc/r3.5.9/zookeeperAdmin.html#sc_advancedConfiguration<https://usg02.safelinks.protection.office365.us/?url=https%3A%2F%2Fzookeeper.apache.org%2Fdoc%2Fr3.5.9%2FzookeeperAdmin.html%23sc_advancedConfiguration&data=04%7C01%7CSLIGADE%40FBI.GOV%7Cfe142efcb7964926cec008d9ec0bea5a%7C022914a9b95f4b7bbace551ce1a04071%7C0%7C0%7C637800359519124891%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&sdata=5QxxwiXmGjYdbt0j3citMG1poPqHrek0qqqRWIM7vfw%3D&reserved=0>



maxSessionTimeout



In the accumulo config  - #instance.zookeepers.timeout=30s



The zookeeper setting controls the max time that the ZK servers will grant - the accumulo setting is how much time accumulo will ask for.



ZooKeeper: Because Coordinating Distributed Systems is a Zoo<https://usg02.safelinks.protection.office365.us/?url=https%3A%2F%2Fzookeeper.apache.org%2Fdoc%2Fr3.5.9%2FzookeeperAdmin.html%23sc_advancedConfiguration&data=04%7C01%7CSLIGADE%40FBI.GOV%7Cfe142efcb7964926cec008d9ec0bea5a%7C022914a9b95f4b7bbace551ce1a04071%7C0%7C0%7C637800359519124891%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&sdata=5QxxwiXmGjYdbt0j3citMG1poPqHrek0qqqRWIM7vfw%3D&reserved=0>

Trace Mask Bit Values ; 0b0000000000 : Unused, reserved for future use. 0b0000000010 : Logs client requests, excluding ping requests. 0b0000000100 : Unused, reserved ...

zookeeper.apache.org





________________________________

From: Ligade, Shailesh [USA] <Li...@bah.com>>
Sent: Wednesday, February 9, 2022 3:03 PM
To: user@accumulo.apache.org<ma...@accumulo.apache.org> <us...@accumulo.apache.org>>
Subject: Re: accumulo 1.10.0 masters won't start



thanks for response,



no i have not update any timeout



is that going in zoo.cfg? I can see there is min/maxSessionTimeout 2/20, is that what are you refering to?



-S

________________________________

From: dev1 <de...@etcoleman.com>>
Sent: Wednesday, February 9, 2022 2:51 PM
To: user@accumulo.apache.org<ma...@accumulo.apache.org> <us...@accumulo.apache.org>>
Subject: [External] Re: accumulo 1.10.0 masters won't start



Have you tried to increase the zoo session timeout value? I think it's zookeeper.session.timeout.ms



________________________________

From: Ligade, Shailesh [USA] <Li...@bah.com>>
Sent: Wednesday, February 9, 2022 2:47 PM
To: user@accumulo.apache.org<ma...@accumulo.apache.org> <us...@accumulo.apache.org>>
Subject: Re: accumulo 1.10.0 masters won't start



Thanks



That fixed goal sate issue but now still getting



Errors with zookeeper

e.g.



KeeperErrorCode = ConnectionLoss for



/accumulo/<instane-id>/config/tserver.hold.time.max

/accumulo/<instane-id>/tables

/accumulo/<instane-id>/tables/1/name

/accumulo/<instane-id>/fate

/accumulo/<instane-id>/masters/goal_state



So it is all over …some I see good values in zookeeper…so not sure..  🙁



-S



________________________________

From: dev1 <de...@etcoleman.com>>
Sent: Wednesday, February 9, 2022 2:22 PM
To: user@accumulo.apache.org<ma...@accumulo.apache.org> <us...@accumulo.apache.org>>
Subject: [External] Re: accumulo 1.10.0 masters won't start



The is a utility - SetGoalState that can be run from the command line



accumulo SetGoalState NORMAL



(or SAFE_MODE, CLEAN_STOP)



It sets a value in ZK at /accumulo/instance-id/managers/goal_state



Ed Coleman



________________________________

From: Ligade, Shailesh [USA] <Li...@bah.com>>
Sent: Wednesday, February 9, 2022 1:54 PM
To: user@accumulo.apache.org<ma...@accumulo.apache.org> <us...@accumulo.apache.org>>
Subject: Re: accumulo 1.10.0 masters won't start



Well,



i just went ahead and deleted fate in zookeeper and restarted the master..it was doing better but then i am getting different error



ERROR: Problem getting real goal state from zookeeper: java.lang.IllegalArgumentException: No enum constant org.apache.accumulo.core.master.thrift.MasterGoalState



I hope i didn't delete goal_state accidently ...;-( currently ls on goal_state is [], is there a way to add some value there?



-S

________________________________

From: dev1 <de...@etcoleman.com>>
Sent: Wednesday, February 9, 2022 1:32 PM
To: user@accumulo.apache.org<ma...@accumulo.apache.org> <us...@accumulo.apache.org>>
Subject: [External] Re: accumulo 1.10.0 masters won't start



Did you try setting the increased size in the zkCli.sh command (or wherever it gets it environment from?)



The ZK docs indicate that it needs to be set to the same size on all servers and clients.



You should be able to use zkCli.sh to at least see what's going on - if that does not work, then it seems unlikely that the master would either.



Can you:

  *   list the nodes under /accumulo/[instance id]/fate?
  *   use the stat command on each of the nodes - the size is one of the fields.
  *   list nodes under any of the /accumulo/[instance_id/fate/tx-#####
  *   there should be a node named debug - doing a get on that should show the op name.



Ed Coleman

________________________________

From: Ligade, Shailesh [USA] <Li...@bah.com>>
Sent: Wednesday, February 9, 2022 12:54 PM
To: user@accumulo.apache.org<ma...@accumulo.apache.org> <us...@accumulo.apache.org>>
Subject: Re: accumulo 1.10.0 masters won't start



Thanks



I added



-Djute.maxbuffer=30000000



In conf/java.env and restart all zookeepers but still getting the same error.. documentation is kind of fuzzy on setting this property as it states in hex (default 0xffff) so not 100% sure if 30000000 is ok, but atleast I could see zookeeper was up



-S



________________________________

From: dev1 <de...@etcoleman.com>>
Sent: Wednesday, February 9, 2022 12:26 PM
To: user@accumulo.apache.org<ma...@accumulo.apache.org> <us...@accumulo.apache.org>>
Subject: [External] Re: accumulo 1.10.0 masters won't start



Does the monitor or any of the logs show errors that relate to exceeding the ZooKeeper jute buffer size?



Is so, have you tried increasing the ZooKeeper jute.maxbuffer limit(https://zookeeper.apache.org/doc/r3.5.9/zookeeperAdmin.html#Unsafe+Options<https://usg02.safelinks.protection.office365.us/?url=https%3A%2F%2Furldefense.com%2Fv3%2F__https%3A%2F%2Fzookeeper.apache.org%2Fdoc%2Fr3.5.9%2FzookeeperAdmin.html*Unsafe*Options__%3BIys!!May37g!dTGCMHPLPDBXwSqtLa5cIPHiTIQF7IjLCVyvGxfi1sgPbrsOI8RCEsuZ9u-jJtayEg%24&data=04%7C01%7CSLIGADE%40FBI.GOV%7Cfe142efcb7964926cec008d9ec0bea5a%7C022914a9b95f4b7bbace551ce1a04071%7C0%7C0%7C637800359519124891%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&sdata=i6EQVl1gj5FRZzfj%2Fv9kiF5IEK86yGQOYGdAIDcoTyQ%3D&reserved=0>)?



Ed Coleman





________________________________

From: Ligade, Shailesh [USA] <Li...@bah.com>>
Sent: Wednesday, February 9, 2022 11:49 AM
To: user@accumulo.apache.org<ma...@accumulo.apache.org> <us...@accumulo.apache.org>>
Subject: accumulo 1.10.0 masters won't start



Hello,



My both masters are stuck error on zookeeper:



IOException: Packet len 2791093 is out of range!

KeeperErrorCode = ConnectionLoss for /accumulo/<instance_id>/fate





if use zkCli to see what is under fate, i get



IOException Packet len 2791161 is out of range

Unable to read additional data from server sessionid xxxx, likely server has closed socket



hdfs fsck is all good



How can I clear this fate?



master process is up and I can get into accumulo shell, but there are no fate (fate print returns empty)



Any idea how to bring the master up?



Thanks



S

RE: accumulo 1.10.0 masters won't start

Posted by Shailesh Ligade <SL...@FBI.GOV>.

Thanks I can try that,

At this point, my goal is to get accumulo up. I was just wondering if I can set different goal like SAFE_MODE will it come up by ignoring fate and other issues? If that comes up, can I switch back to NORMAL, will that work? I understand there may be some data loss..

-S

From: dev1 <de...@etcoleman.com>
Sent: Wednesday, February 9, 2022 3:36 PM
To: user@accumulo.apache.org
Subject: [EXTERNAL EMAIL] - Re: accumulo 1.10.0 masters won't start

For values in zoo.cfg see: https://zookeeper.apache.org/doc/r3.5.9/zookeeperAdmin.html#sc_advancedConfiguration<https://usg02.safelinks.protection.office365.us/?url=https%3A%2F%2Fzookeeper.apache.org%2Fdoc%2Fr3.5.9%2FzookeeperAdmin.html%23sc_advancedConfiguration&data=04%7C01%7CSLIGADE%40FBI.GOV%7Cfe142efcb7964926cec008d9ec0bea5a%7C022914a9b95f4b7bbace551ce1a04071%7C0%7C0%7C637800359519124891%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&sdata=5QxxwiXmGjYdbt0j3citMG1poPqHrek0qqqRWIM7vfw%3D&reserved=0>

maxSessionTimeout

In the accumulo config  - #instance.zookeepers.timeout=30s

The zookeeper setting controls the max time that the ZK servers will grant - the accumulo setting is how much time accumulo will ask for.

ZooKeeper: Because Coordinating Distributed Systems is a Zoo<https://usg02.safelinks.protection.office365.us/?url=https%3A%2F%2Fzookeeper.apache.org%2Fdoc%2Fr3.5.9%2FzookeeperAdmin.html%23sc_advancedConfiguration&data=04%7C01%7CSLIGADE%40FBI.GOV%7Cfe142efcb7964926cec008d9ec0bea5a%7C022914a9b95f4b7bbace551ce1a04071%7C0%7C0%7C637800359519124891%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&sdata=5QxxwiXmGjYdbt0j3citMG1poPqHrek0qqqRWIM7vfw%3D&reserved=0>
Trace Mask Bit Values ; 0b0000000000 : Unused, reserved for future use. 0b0000000010 : Logs client requests, excluding ping requests. 0b0000000100 : Unused, reserved ...
zookeeper.apache.org

________________________________
From: Ligade, Shailesh [USA] <Li...@bah.com>>
Sent: Wednesday, February 9, 2022 3:03 PM
To: user@accumulo.apache.org<ma...@accumulo.apache.org> <us...@accumulo.apache.org>>
Subject: Re: accumulo 1.10.0 masters won't start

thanks for response,

no i have not update any timeout

is that going in zoo.cfg? I can see there is min/maxSessionTimeout 2/20, is that what are you refering to?

-S
________________________________
From: dev1 <de...@etcoleman.com>>
Sent: Wednesday, February 9, 2022 2:51 PM
To: user@accumulo.apache.org<ma...@accumulo.apache.org> <us...@accumulo.apache.org>>
Subject: [External] Re: accumulo 1.10.0 masters won't start

Have you tried to increase the zoo session timeout value? I think it's zookeeper.session.timeout.ms

________________________________
From: Ligade, Shailesh [USA] <Li...@bah.com>>
Sent: Wednesday, February 9, 2022 2:47 PM
To: user@accumulo.apache.org<ma...@accumulo.apache.org> <us...@accumulo.apache.org>>
Subject: Re: accumulo 1.10.0 masters won't start

Thanks

That fixed goal sate issue but now still getting

Errors with zookeeper

e.g.

KeeperErrorCode = ConnectionLoss for

/accumulo/<instane-id>/config/tserver.hold.time.max

/accumulo/<instane-id>/tables

/accumulo/<instane-id>/tables/1/name

/accumulo/<instane-id>/fate

/accumulo/<instane-id>/masters/goal_state

So it is all over …some I see good values in zookeeper…so not sure..  🙁

-S

________________________________
From: dev1 <de...@etcoleman.com>>
Sent: Wednesday, February 9, 2022 2:22 PM
To: user@accumulo.apache.org<ma...@accumulo.apache.org> <us...@accumulo.apache.org>>
Subject: [External] Re: accumulo 1.10.0 masters won't start

The is a utility - SetGoalState that can be run from the command line

accumulo SetGoalState NORMAL

(or SAFE_MODE, CLEAN_STOP)

It sets a value in ZK at /accumulo/instance-id/managers/goal_state

Ed Coleman

________________________________
From: Ligade, Shailesh [USA] <Li...@bah.com>>
Sent: Wednesday, February 9, 2022 1:54 PM
To: user@accumulo.apache.org<ma...@accumulo.apache.org> <us...@accumulo.apache.org>>
Subject: Re: accumulo 1.10.0 masters won't start

Well,

i just went ahead and deleted fate in zookeeper and restarted the master..it was doing better but then i am getting different error

ERROR: Problem getting real goal state from zookeeper: java.lang.IllegalArgumentException: No enum constant org.apache.accumulo.core.master.thrift.MasterGoalState

I hope i didn't delete goal_state accidently ...;-( currently ls on goal_state is [], is there a way to add some value there?

-S
________________________________
From: dev1 <de...@etcoleman.com>>
Sent: Wednesday, February 9, 2022 1:32 PM
To: user@accumulo.apache.org<ma...@accumulo.apache.org> <us...@accumulo.apache.org>>
Subject: [External] Re: accumulo 1.10.0 masters won't start

Did you try setting the increased size in the zkCli.sh command (or wherever it gets it environment from?)

The ZK docs indicate that it needs to be set to the same size on all servers and clients.

You should be able to use zkCli.sh to at least see what's going on - if that does not work, then it seems unlikely that the master would either.

Can you:

  *   list the nodes under /accumulo/[instance id]/fate?
  *   use the stat command on each of the nodes - the size is one of the fields.
  *   list nodes under any of the /accumulo/[instance_id/fate/tx-#####
  *   there should be a node named debug - doing a get on that should show the op name.

Ed Coleman
________________________________
From: Ligade, Shailesh [USA] <Li...@bah.com>>
Sent: Wednesday, February 9, 2022 12:54 PM
To: user@accumulo.apache.org<ma...@accumulo.apache.org> <us...@accumulo.apache.org>>
Subject: Re: accumulo 1.10.0 masters won't start

Thanks

I added

-Djute.maxbuffer=30000000

In conf/java.env and restart all zookeepers but still getting the same error.. documentation is kind of fuzzy on setting this property as it states in hex (default 0xffff) so not 100% sure if 30000000 is ok, but atleast I could see zookeeper was up

-S

________________________________
From: dev1 <de...@etcoleman.com>>
Sent: Wednesday, February 9, 2022 12:26 PM
To: user@accumulo.apache.org<ma...@accumulo.apache.org> <us...@accumulo.apache.org>>
Subject: [External] Re: accumulo 1.10.0 masters won't start

Does the monitor or any of the logs show errors that relate to exceeding the ZooKeeper jute buffer size?

Is so, have you tried increasing the ZooKeeper jute.maxbuffer limit(https://zookeeper.apache.org/doc/r3.5.9/zookeeperAdmin.html#Unsafe+Options<https://usg02.safelinks.protection.office365.us/?url=https%3A%2F%2Furldefense.com%2Fv3%2F__https%3A%2F%2Fzookeeper.apache.org%2Fdoc%2Fr3.5.9%2FzookeeperAdmin.html*Unsafe*Options__%3BIys!!May37g!dTGCMHPLPDBXwSqtLa5cIPHiTIQF7IjLCVyvGxfi1sgPbrsOI8RCEsuZ9u-jJtayEg%24&data=04%7C01%7CSLIGADE%40FBI.GOV%7Cfe142efcb7964926cec008d9ec0bea5a%7C022914a9b95f4b7bbace551ce1a04071%7C0%7C0%7C637800359519124891%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&sdata=i6EQVl1gj5FRZzfj%2Fv9kiF5IEK86yGQOYGdAIDcoTyQ%3D&reserved=0>)?

Ed Coleman

________________________________
From: Ligade, Shailesh [USA] <Li...@bah.com>>
Sent: Wednesday, February 9, 2022 11:49 AM
To: user@accumulo.apache.org<ma...@accumulo.apache.org> <us...@accumulo.apache.org>>
Subject: accumulo 1.10.0 masters won't start

Hello,

My both masters are stuck error on zookeeper:

IOException: Packet len 2791093 is out of range!
KeeperErrorCode = ConnectionLoss for /accumulo/<instance_id>/fate

if use zkCli to see what is under fate, i get

IOException Packet len 2791161 is out of range
Unable to read additional data from server sessionid xxxx, likely server has closed socket

hdfs fsck is all good

How can I clear this fate?

master process is up and I can get into accumulo shell, but there are no fate (fate print returns empty)

Any idea how to bring the master up?

Thanks

S

Re: accumulo 1.10.0 masters won't start

Posted by dev1 <de...@etcoleman.com>.

For values in zoo.cfg see: https://zookeeper.apache.org/doc/r3.5.9/zookeeperAdmin.html#sc_advancedConfiguration

maxSessionTimeout

In the accumulo config  - #instance.zookeepers.timeout=30s

The zookeeper setting controls the max time that the ZK servers will grant - the accumulo setting is how much time accumulo will ask for.

ZooKeeper: Because Coordinating Distributed Systems is a Zoo<https://zookeeper.apache.org/doc/r3.5.9/zookeeperAdmin.html#sc_advancedConfiguration>
Trace Mask Bit Values ; 0b0000000000 : Unused, reserved for future use. 0b0000000010 : Logs client requests, excluding ping requests. 0b0000000100 : Unused, reserved ...
zookeeper.apache.org


________________________________
From: Ligade, Shailesh [USA] <Li...@bah.com>
Sent: Wednesday, February 9, 2022 3:03 PM
To: user@accumulo.apache.org <us...@accumulo.apache.org>
Subject: Re: accumulo 1.10.0 masters won't start

thanks for response,

no i have not update any timeout

is that going in zoo.cfg? I can see there is min/maxSessionTimeout 2/20, is that what are you refering to?

-S
________________________________
From: dev1 <de...@etcoleman.com>
Sent: Wednesday, February 9, 2022 2:51 PM
To: user@accumulo.apache.org <us...@accumulo.apache.org>
Subject: [External] Re: accumulo 1.10.0 masters won't start

Have you tried to increase the zoo session timeout value? I think it's zookeeper.session.timeout.ms

________________________________
From: Ligade, Shailesh [USA] <Li...@bah.com>
Sent: Wednesday, February 9, 2022 2:47 PM
To: user@accumulo.apache.org <us...@accumulo.apache.org>
Subject: Re: accumulo 1.10.0 masters won't start


Thanks



That fixed goal sate issue but now still getting



Errors with zookeeper

e.g.



KeeperErrorCode = ConnectionLoss for


/accumulo/<instane-id>/config/tserver.hold.time.max

/accumulo/<instane-id>/tables

/accumulo/<instane-id>/tables/1/name

/accumulo/<instane-id>/fate

/accumulo/<instane-id>/masters/goal_state



So it is all over …some I see good values in zookeeper…so not sure..  🙁



-S

________________________________
From: dev1 <de...@etcoleman.com>
Sent: Wednesday, February 9, 2022 2:22 PM
To: user@accumulo.apache.org <us...@accumulo.apache.org>
Subject: [External] Re: accumulo 1.10.0 masters won't start

The is a utility - SetGoalState that can be run from the command line

accumulo SetGoalState NORMAL

(or SAFE_MODE, CLEAN_STOP)

It sets a value in ZK at /accumulo/instance-id/managers/goal_state

Ed Coleman

________________________________
From: Ligade, Shailesh [USA] <Li...@bah.com>
Sent: Wednesday, February 9, 2022 1:54 PM
To: user@accumulo.apache.org <us...@accumulo.apache.org>
Subject: Re: accumulo 1.10.0 masters won't start

Well,

i just went ahead and deleted fate in zookeeper and restarted the master..it was doing better but then i am getting different error

ERROR: Problem getting real goal state from zookeeper: java.lang.IllegalArgumentException: No enum constant org.apache.accumulo.core.master.thrift.MasterGoalState

I hope i didn't delete goal_state accidently ...;-( currently ls on goal_state is [], is there a way to add some value there?

-S
________________________________
From: dev1 <de...@etcoleman.com>
Sent: Wednesday, February 9, 2022 1:32 PM
To: user@accumulo.apache.org <us...@accumulo.apache.org>
Subject: [External] Re: accumulo 1.10.0 masters won't start

Did you try setting the increased size in the zkCli.sh command (or wherever it gets it environment from?)

The ZK docs indicate that it needs to be set to the same size on all servers and clients.

You should be able to use zkCli.sh to at least see what's going on - if that does not work, then it seems unlikely that the master would either.

Can you:

  *   list the nodes under /accumulo/[instance id]/fate?
  *   use the stat command on each of the nodes - the size is one of the fields.
  *   list nodes under any of the /accumulo/[instance_id/fate/tx-#####
  *   there should be a node named debug - doing a get on that should show the op name.

Ed Coleman
________________________________
From: Ligade, Shailesh [USA] <Li...@bah.com>
Sent: Wednesday, February 9, 2022 12:54 PM
To: user@accumulo.apache.org <us...@accumulo.apache.org>
Subject: Re: accumulo 1.10.0 masters won't start


Thanks



I added



-Djute.maxbuffer=30000000



In conf/java.env and restart all zookeepers but still getting the same error.. documentation is kind of fuzzy on setting this property as it states in hex (default 0xffff) so not 100% sure if 30000000 is ok, but atleast I could see zookeeper was up



-S

________________________________
From: dev1 <de...@etcoleman.com>
Sent: Wednesday, February 9, 2022 12:26 PM
To: user@accumulo.apache.org <us...@accumulo.apache.org>
Subject: [External] Re: accumulo 1.10.0 masters won't start

Does the monitor or any of the logs show errors that relate to exceeding the ZooKeeper jute buffer size?

Is so, have you tried increasing the ZooKeeper jute.maxbuffer limit(https://zookeeper.apache.org/doc/r3.5.9/zookeeperAdmin.html#Unsafe+Options<https://urldefense.com/v3/__https://zookeeper.apache.org/doc/r3.5.9/zookeeperAdmin.html*Unsafe*Options__;Iys!!May37g!dTGCMHPLPDBXwSqtLa5cIPHiTIQF7IjLCVyvGxfi1sgPbrsOI8RCEsuZ9u-jJtayEg$>)?

Ed Coleman


________________________________
From: Ligade, Shailesh [USA] <Li...@bah.com>
Sent: Wednesday, February 9, 2022 11:49 AM
To: user@accumulo.apache.org <us...@accumulo.apache.org>
Subject: accumulo 1.10.0 masters won't start

Hello,

My both masters are stuck error on zookeeper:

IOException: Packet len 2791093 is out of range!
KeeperErrorCode = ConnectionLoss for /accumulo/<instance_id>/fate


if use zkCli to see what is under fate, i get

IOException Packet len 2791161 is out of range
Unable to read additional data from server sessionid xxxx, likely server has closed socket

hdfs fsck is all good

How can I clear this fate?

master process is up and I can get into accumulo shell, but there are no fate (fate print returns empty)

Any idea how to bring the master up?

Thanks

S

Re: accumulo 1.10.0 masters won't start

Posted by "Ligade, Shailesh [USA]" <Li...@bah.com>.

thanks for response,

no i have not update any timeout

is that going in zoo.cfg? I can see there is min/maxSessionTimeout 2/20, is that what are you refering to?

-S
________________________________
From: dev1 <de...@etcoleman.com>
Sent: Wednesday, February 9, 2022 2:51 PM
To: user@accumulo.apache.org <us...@accumulo.apache.org>
Subject: [External] Re: accumulo 1.10.0 masters won't start

Have you tried to increase the zoo session timeout value? I think it's zookeeper.session.timeout.ms

________________________________
From: Ligade, Shailesh [USA] <Li...@bah.com>
Sent: Wednesday, February 9, 2022 2:47 PM
To: user@accumulo.apache.org <us...@accumulo.apache.org>
Subject: Re: accumulo 1.10.0 masters won't start


Thanks



That fixed goal sate issue but now still getting



Errors with zookeeper

e.g.



KeeperErrorCode = ConnectionLoss for


/accumulo/<instane-id>/config/tserver.hold.time.max

/accumulo/<instane-id>/tables

/accumulo/<instane-id>/tables/1/name

/accumulo/<instane-id>/fate

/accumulo/<instane-id>/masters/goal_state



So it is all over …some I see good values in zookeeper…so not sure..  🙁



-S

________________________________
From: dev1 <de...@etcoleman.com>
Sent: Wednesday, February 9, 2022 2:22 PM
To: user@accumulo.apache.org <us...@accumulo.apache.org>
Subject: [External] Re: accumulo 1.10.0 masters won't start

The is a utility - SetGoalState that can be run from the command line

accumulo SetGoalState NORMAL

(or SAFE_MODE, CLEAN_STOP)

It sets a value in ZK at /accumulo/instance-id/managers/goal_state

Ed Coleman

________________________________
From: Ligade, Shailesh [USA] <Li...@bah.com>
Sent: Wednesday, February 9, 2022 1:54 PM
To: user@accumulo.apache.org <us...@accumulo.apache.org>
Subject: Re: accumulo 1.10.0 masters won't start

Well,

i just went ahead and deleted fate in zookeeper and restarted the master..it was doing better but then i am getting different error

ERROR: Problem getting real goal state from zookeeper: java.lang.IllegalArgumentException: No enum constant org.apache.accumulo.core.master.thrift.MasterGoalState

I hope i didn't delete goal_state accidently ...;-( currently ls on goal_state is [], is there a way to add some value there?

-S
________________________________
From: dev1 <de...@etcoleman.com>
Sent: Wednesday, February 9, 2022 1:32 PM
To: user@accumulo.apache.org <us...@accumulo.apache.org>
Subject: [External] Re: accumulo 1.10.0 masters won't start

Did you try setting the increased size in the zkCli.sh command (or wherever it gets it environment from?)

The ZK docs indicate that it needs to be set to the same size on all servers and clients.

You should be able to use zkCli.sh to at least see what's going on - if that does not work, then it seems unlikely that the master would either.

Can you:

  *   list the nodes under /accumulo/[instance id]/fate?
  *   use the stat command on each of the nodes - the size is one of the fields.
  *   list nodes under any of the /accumulo/[instance_id/fate/tx-#####
  *   there should be a node named debug - doing a get on that should show the op name.

Ed Coleman
________________________________
From: Ligade, Shailesh [USA] <Li...@bah.com>
Sent: Wednesday, February 9, 2022 12:54 PM
To: user@accumulo.apache.org <us...@accumulo.apache.org>
Subject: Re: accumulo 1.10.0 masters won't start


Thanks



I added



-Djute.maxbuffer=30000000



In conf/java.env and restart all zookeepers but still getting the same error.. documentation is kind of fuzzy on setting this property as it states in hex (default 0xffff) so not 100% sure if 30000000 is ok, but atleast I could see zookeeper was up



-S

________________________________
From: dev1 <de...@etcoleman.com>
Sent: Wednesday, February 9, 2022 12:26 PM
To: user@accumulo.apache.org <us...@accumulo.apache.org>
Subject: [External] Re: accumulo 1.10.0 masters won't start

Does the monitor or any of the logs show errors that relate to exceeding the ZooKeeper jute buffer size?

Is so, have you tried increasing the ZooKeeper jute.maxbuffer limit(https://zookeeper.apache.org/doc/r3.5.9/zookeeperAdmin.html#Unsafe+Options<https://urldefense.com/v3/__https://zookeeper.apache.org/doc/r3.5.9/zookeeperAdmin.html*Unsafe*Options__;Iys!!May37g!dTGCMHPLPDBXwSqtLa5cIPHiTIQF7IjLCVyvGxfi1sgPbrsOI8RCEsuZ9u-jJtayEg$>)?

Ed Coleman


________________________________
From: Ligade, Shailesh [USA] <Li...@bah.com>
Sent: Wednesday, February 9, 2022 11:49 AM
To: user@accumulo.apache.org <us...@accumulo.apache.org>
Subject: accumulo 1.10.0 masters won't start

Hello,

My both masters are stuck error on zookeeper:

IOException: Packet len 2791093 is out of range!
KeeperErrorCode = ConnectionLoss for /accumulo/<instance_id>/fate


if use zkCli to see what is under fate, i get

IOException Packet len 2791161 is out of range
Unable to read additional data from server sessionid xxxx, likely server has closed socket

hdfs fsck is all good

How can I clear this fate?

master process is up and I can get into accumulo shell, but there are no fate (fate print returns empty)

Any idea how to bring the master up?

Thanks

S

Re: accumulo 1.10.0 masters won't start

Posted by dev1 <de...@etcoleman.com>.

Have you tried to increase the zoo session timeout value? I think it's zookeeper.session.timeout.ms

________________________________
From: Ligade, Shailesh [USA] <Li...@bah.com>
Sent: Wednesday, February 9, 2022 2:47 PM
To: user@accumulo.apache.org <us...@accumulo.apache.org>
Subject: Re: accumulo 1.10.0 masters won't start

Thanks

That fixed goal sate issue but now still getting

Errors with zookeeper

e.g.

KeeperErrorCode = ConnectionLoss for

/accumulo/<instane-id>/config/tserver.hold.time.max

/accumulo/<instane-id>/tables

/accumulo/<instane-id>/tables/1/name

/accumulo/<instane-id>/fate

/accumulo/<instane-id>/masters/goal_state

So it is all over …some I see good values in zookeeper…so not sure..  🙁

-S

________________________________
From: dev1 <de...@etcoleman.com>
Sent: Wednesday, February 9, 2022 2:22 PM
To: user@accumulo.apache.org <us...@accumulo.apache.org>
Subject: [External] Re: accumulo 1.10.0 masters won't start

The is a utility - SetGoalState that can be run from the command line

accumulo SetGoalState NORMAL

(or SAFE_MODE, CLEAN_STOP)

It sets a value in ZK at /accumulo/instance-id/managers/goal_state

Ed Coleman

________________________________
From: Ligade, Shailesh [USA] <Li...@bah.com>
Sent: Wednesday, February 9, 2022 1:54 PM
To: user@accumulo.apache.org <us...@accumulo.apache.org>
Subject: Re: accumulo 1.10.0 masters won't start

Well,

i just went ahead and deleted fate in zookeeper and restarted the master..it was doing better but then i am getting different error

ERROR: Problem getting real goal state from zookeeper: java.lang.IllegalArgumentException: No enum constant org.apache.accumulo.core.master.thrift.MasterGoalState

I hope i didn't delete goal_state accidently ...;-( currently ls on goal_state is [], is there a way to add some value there?

-S
________________________________
From: dev1 <de...@etcoleman.com>
Sent: Wednesday, February 9, 2022 1:32 PM
To: user@accumulo.apache.org <us...@accumulo.apache.org>
Subject: [External] Re: accumulo 1.10.0 masters won't start

Did you try setting the increased size in the zkCli.sh command (or wherever it gets it environment from?)

The ZK docs indicate that it needs to be set to the same size on all servers and clients.

You should be able to use zkCli.sh to at least see what's going on - if that does not work, then it seems unlikely that the master would either.

Can you:

  *   list the nodes under /accumulo/[instance id]/fate?
  *   use the stat command on each of the nodes - the size is one of the fields.
  *   list nodes under any of the /accumulo/[instance_id/fate/tx-#####
  *   there should be a node named debug - doing a get on that should show the op name.

Ed Coleman
________________________________
From: Ligade, Shailesh [USA] <Li...@bah.com>
Sent: Wednesday, February 9, 2022 12:54 PM
To: user@accumulo.apache.org <us...@accumulo.apache.org>
Subject: Re: accumulo 1.10.0 masters won't start

Thanks

I added

-Djute.maxbuffer=30000000

In conf/java.env and restart all zookeepers but still getting the same error.. documentation is kind of fuzzy on setting this property as it states in hex (default 0xffff) so not 100% sure if 30000000 is ok, but atleast I could see zookeeper was up

-S

________________________________
From: dev1 <de...@etcoleman.com>
Sent: Wednesday, February 9, 2022 12:26 PM
To: user@accumulo.apache.org <us...@accumulo.apache.org>
Subject: [External] Re: accumulo 1.10.0 masters won't start

Does the monitor or any of the logs show errors that relate to exceeding the ZooKeeper jute buffer size?

Is so, have you tried increasing the ZooKeeper jute.maxbuffer limit(https://zookeeper.apache.org/doc/r3.5.9/zookeeperAdmin.html#Unsafe+Options<https://urldefense.com/v3/__https://zookeeper.apache.org/doc/r3.5.9/zookeeperAdmin.html*Unsafe*Options__;Iys!!May37g!dTGCMHPLPDBXwSqtLa5cIPHiTIQF7IjLCVyvGxfi1sgPbrsOI8RCEsuZ9u-jJtayEg$>)?

Ed Coleman

________________________________
From: Ligade, Shailesh [USA] <Li...@bah.com>
Sent: Wednesday, February 9, 2022 11:49 AM
To: user@accumulo.apache.org <us...@accumulo.apache.org>
Subject: accumulo 1.10.0 masters won't start

Hello,

My both masters are stuck error on zookeeper:

IOException: Packet len 2791093 is out of range!
KeeperErrorCode = ConnectionLoss for /accumulo/<instance_id>/fate

if use zkCli to see what is under fate, i get

IOException Packet len 2791161 is out of range
Unable to read additional data from server sessionid xxxx, likely server has closed socket

hdfs fsck is all good

How can I clear this fate?

master process is up and I can get into accumulo shell, but there are no fate (fate print returns empty)

Any idea how to bring the master up?

Thanks

S

Re: accumulo 1.10.0 masters won't start

Posted by "Ligade, Shailesh [USA]" <Li...@bah.com>.

Thanks

That fixed goal sate issue but now still getting

Errors with zookeeper
e.g.

KeeperErrorCode = ConnectionLoss for

/accumulo/<instane-id>/config/tserver.hold.time.max
/accumulo/<instane-id>/tables
/accumulo/<instane-id>/tables/1/name
/accumulo/<instane-id>/fate
/accumulo/<instane-id>/masters/goal_state

So it is all over …some I see good values in zookeeper…so not sure..  🙁

-S

________________________________
From: dev1 <de...@etcoleman.com>
Sent: Wednesday, February 9, 2022 2:22 PM
To: user@accumulo.apache.org <us...@accumulo.apache.org>
Subject: [External] Re: accumulo 1.10.0 masters won't start

The is a utility - SetGoalState that can be run from the command line

accumulo SetGoalState NORMAL

(or SAFE_MODE, CLEAN_STOP)

It sets a value in ZK at /accumulo/instance-id/managers/goal_state

Ed Coleman

________________________________
From: Ligade, Shailesh [USA] <Li...@bah.com>
Sent: Wednesday, February 9, 2022 1:54 PM
To: user@accumulo.apache.org <us...@accumulo.apache.org>
Subject: Re: accumulo 1.10.0 masters won't start

Well,

i just went ahead and deleted fate in zookeeper and restarted the master..it was doing better but then i am getting different error

ERROR: Problem getting real goal state from zookeeper: java.lang.IllegalArgumentException: No enum constant org.apache.accumulo.core.master.thrift.MasterGoalState

I hope i didn't delete goal_state accidently ...;-( currently ls on goal_state is [], is there a way to add some value there?

-S
________________________________
From: dev1 <de...@etcoleman.com>
Sent: Wednesday, February 9, 2022 1:32 PM
To: user@accumulo.apache.org <us...@accumulo.apache.org>
Subject: [External] Re: accumulo 1.10.0 masters won't start

Did you try setting the increased size in the zkCli.sh command (or wherever it gets it environment from?)

The ZK docs indicate that it needs to be set to the same size on all servers and clients.

You should be able to use zkCli.sh to at least see what's going on - if that does not work, then it seems unlikely that the master would either.

Can you:

  *   list the nodes under /accumulo/[instance id]/fate?
  *   use the stat command on each of the nodes - the size is one of the fields.
  *   list nodes under any of the /accumulo/[instance_id/fate/tx-#####
  *   there should be a node named debug - doing a get on that should show the op name.

Ed Coleman
________________________________
From: Ligade, Shailesh [USA] <Li...@bah.com>
Sent: Wednesday, February 9, 2022 12:54 PM
To: user@accumulo.apache.org <us...@accumulo.apache.org>
Subject: Re: accumulo 1.10.0 masters won't start


Thanks



I added



-Djute.maxbuffer=30000000



In conf/java.env and restart all zookeepers but still getting the same error.. documentation is kind of fuzzy on setting this property as it states in hex (default 0xffff) so not 100% sure if 30000000 is ok, but atleast I could see zookeeper was up



-S

________________________________
From: dev1 <de...@etcoleman.com>
Sent: Wednesday, February 9, 2022 12:26 PM
To: user@accumulo.apache.org <us...@accumulo.apache.org>
Subject: [External] Re: accumulo 1.10.0 masters won't start

Does the monitor or any of the logs show errors that relate to exceeding the ZooKeeper jute buffer size?

Is so, have you tried increasing the ZooKeeper jute.maxbuffer limit(https://zookeeper.apache.org/doc/r3.5.9/zookeeperAdmin.html#Unsafe+Options<https://urldefense.com/v3/__https://zookeeper.apache.org/doc/r3.5.9/zookeeperAdmin.html*Unsafe*Options__;Iys!!May37g!dTGCMHPLPDBXwSqtLa5cIPHiTIQF7IjLCVyvGxfi1sgPbrsOI8RCEsuZ9u-jJtayEg$>)?

Ed Coleman


________________________________
From: Ligade, Shailesh [USA] <Li...@bah.com>
Sent: Wednesday, February 9, 2022 11:49 AM
To: user@accumulo.apache.org <us...@accumulo.apache.org>
Subject: accumulo 1.10.0 masters won't start

Hello,

My both masters are stuck error on zookeeper:

IOException: Packet len 2791093 is out of range!
KeeperErrorCode = ConnectionLoss for /accumulo/<instance_id>/fate


if use zkCli to see what is under fate, i get

IOException Packet len 2791161 is out of range
Unable to read additional data from server sessionid xxxx, likely server has closed socket

hdfs fsck is all good

How can I clear this fate?

master process is up and I can get into accumulo shell, but there are no fate (fate print returns empty)

Any idea how to bring the master up?

Thanks

S

Re: accumulo 1.10.0 masters won't start

Posted by dev1 <de...@etcoleman.com>.

The is a utility - SetGoalState that can be run from the command line

accumulo SetGoalState NORMAL

(or SAFE_MODE, CLEAN_STOP)

It sets a value in ZK at /accumulo/instance-id/managers/goal_state

Ed Coleman

________________________________
From: Ligade, Shailesh [USA] <Li...@bah.com>
Sent: Wednesday, February 9, 2022 1:54 PM
To: user@accumulo.apache.org <us...@accumulo.apache.org>
Subject: Re: accumulo 1.10.0 masters won't start

Well,

i just went ahead and deleted fate in zookeeper and restarted the master..it was doing better but then i am getting different error

ERROR: Problem getting real goal state from zookeeper: java.lang.IllegalArgumentException: No enum constant org.apache.accumulo.core.master.thrift.MasterGoalState

I hope i didn't delete goal_state accidently ...;-( currently ls on goal_state is [], is there a way to add some value there?

-S
________________________________
From: dev1 <de...@etcoleman.com>
Sent: Wednesday, February 9, 2022 1:32 PM
To: user@accumulo.apache.org <us...@accumulo.apache.org>
Subject: [External] Re: accumulo 1.10.0 masters won't start

Did you try setting the increased size in the zkCli.sh command (or wherever it gets it environment from?)

The ZK docs indicate that it needs to be set to the same size on all servers and clients.

You should be able to use zkCli.sh to at least see what's going on - if that does not work, then it seems unlikely that the master would either.

Can you:

  *   list the nodes under /accumulo/[instance id]/fate?
  *   use the stat command on each of the nodes - the size is one of the fields.
  *   list nodes under any of the /accumulo/[instance_id/fate/tx-#####
  *   there should be a node named debug - doing a get on that should show the op name.

Ed Coleman
________________________________
From: Ligade, Shailesh [USA] <Li...@bah.com>
Sent: Wednesday, February 9, 2022 12:54 PM
To: user@accumulo.apache.org <us...@accumulo.apache.org>
Subject: Re: accumulo 1.10.0 masters won't start


Thanks



I added



-Djute.maxbuffer=30000000



In conf/java.env and restart all zookeepers but still getting the same error.. documentation is kind of fuzzy on setting this property as it states in hex (default 0xffff) so not 100% sure if 30000000 is ok, but atleast I could see zookeeper was up



-S

________________________________
From: dev1 <de...@etcoleman.com>
Sent: Wednesday, February 9, 2022 12:26 PM
To: user@accumulo.apache.org <us...@accumulo.apache.org>
Subject: [External] Re: accumulo 1.10.0 masters won't start

Does the monitor or any of the logs show errors that relate to exceeding the ZooKeeper jute buffer size?

Is so, have you tried increasing the ZooKeeper jute.maxbuffer limit(https://zookeeper.apache.org/doc/r3.5.9/zookeeperAdmin.html#Unsafe+Options<https://urldefense.com/v3/__https://zookeeper.apache.org/doc/r3.5.9/zookeeperAdmin.html*Unsafe*Options__;Iys!!May37g!dTGCMHPLPDBXwSqtLa5cIPHiTIQF7IjLCVyvGxfi1sgPbrsOI8RCEsuZ9u-jJtayEg$>)?

Ed Coleman


________________________________
From: Ligade, Shailesh [USA] <Li...@bah.com>
Sent: Wednesday, February 9, 2022 11:49 AM
To: user@accumulo.apache.org <us...@accumulo.apache.org>
Subject: accumulo 1.10.0 masters won't start

Hello,

My both masters are stuck error on zookeeper:

IOException: Packet len 2791093 is out of range!
KeeperErrorCode = ConnectionLoss for /accumulo/<instance_id>/fate


if use zkCli to see what is under fate, i get

IOException Packet len 2791161 is out of range
Unable to read additional data from server sessionid xxxx, likely server has closed socket

hdfs fsck is all good

How can I clear this fate?

master process is up and I can get into accumulo shell, but there are no fate (fate print returns empty)

Any idea how to bring the master up?

Thanks

S

Re: accumulo 1.10.0 masters won't start

Posted by "Ligade, Shailesh [USA]" <Li...@bah.com>.

Well,

i just went ahead and deleted fate in zookeeper and restarted the master..it was doing better but then i am getting different error

ERROR: Problem getting real goal state from zookeeper: java.lang.IllegalArgumentException: No enum constant org.apache.accumulo.core.master.thrift.MasterGoalState

I hope i didn't delete goal_state accidently ...;-( currently ls on goal_state is [], is there a way to add some value there?

-S
________________________________
From: dev1 <de...@etcoleman.com>
Sent: Wednesday, February 9, 2022 1:32 PM
To: user@accumulo.apache.org <us...@accumulo.apache.org>
Subject: [External] Re: accumulo 1.10.0 masters won't start

Did you try setting the increased size in the zkCli.sh command (or wherever it gets it environment from?)

The ZK docs indicate that it needs to be set to the same size on all servers and clients.

You should be able to use zkCli.sh to at least see what's going on - if that does not work, then it seems unlikely that the master would either.

Can you:

  *   list the nodes under /accumulo/[instance id]/fate?
  *   use the stat command on each of the nodes - the size is one of the fields.
  *   list nodes under any of the /accumulo/[instance_id/fate/tx-#####
  *   there should be a node named debug - doing a get on that should show the op name.

Ed Coleman
________________________________
From: Ligade, Shailesh [USA] <Li...@bah.com>
Sent: Wednesday, February 9, 2022 12:54 PM
To: user@accumulo.apache.org <us...@accumulo.apache.org>
Subject: Re: accumulo 1.10.0 masters won't start


Thanks



I added



-Djute.maxbuffer=30000000



In conf/java.env and restart all zookeepers but still getting the same error.. documentation is kind of fuzzy on setting this property as it states in hex (default 0xffff) so not 100% sure if 30000000 is ok, but atleast I could see zookeeper was up



-S

________________________________
From: dev1 <de...@etcoleman.com>
Sent: Wednesday, February 9, 2022 12:26 PM
To: user@accumulo.apache.org <us...@accumulo.apache.org>
Subject: [External] Re: accumulo 1.10.0 masters won't start

Does the monitor or any of the logs show errors that relate to exceeding the ZooKeeper jute buffer size?

Is so, have you tried increasing the ZooKeeper jute.maxbuffer limit(https://zookeeper.apache.org/doc/r3.5.9/zookeeperAdmin.html#Unsafe+Options<https://urldefense.com/v3/__https://zookeeper.apache.org/doc/r3.5.9/zookeeperAdmin.html*Unsafe*Options__;Iys!!May37g!dTGCMHPLPDBXwSqtLa5cIPHiTIQF7IjLCVyvGxfi1sgPbrsOI8RCEsuZ9u-jJtayEg$>)?

Ed Coleman


________________________________
From: Ligade, Shailesh [USA] <Li...@bah.com>
Sent: Wednesday, February 9, 2022 11:49 AM
To: user@accumulo.apache.org <us...@accumulo.apache.org>
Subject: accumulo 1.10.0 masters won't start

Hello,

My both masters are stuck error on zookeeper:

IOException: Packet len 2791093 is out of range!
KeeperErrorCode = ConnectionLoss for /accumulo/<instance_id>/fate


if use zkCli to see what is under fate, i get

IOException Packet len 2791161 is out of range
Unable to read additional data from server sessionid xxxx, likely server has closed socket

hdfs fsck is all good

How can I clear this fate?

master process is up and I can get into accumulo shell, but there are no fate (fate print returns empty)

Any idea how to bring the master up?

Thanks

S

Re: accumulo 1.10.0 masters won't start

Posted by "Ligade, Shailesh [USA]" <Li...@bah.com>.

Uhmm i guess I can't even list anything under fate without that error

Yes, i updated java.env on all zookeeeper

Can I just delete fate folder and recreate and see if master comes up?

-S
________________________________
From: dev1 <de...@etcoleman.com>
Sent: Wednesday, February 9, 2022 1:32 PM
To: user@accumulo.apache.org <us...@accumulo.apache.org>
Subject: [External] Re: accumulo 1.10.0 masters won't start

Did you try setting the increased size in the zkCli.sh command (or wherever it gets it environment from?)

The ZK docs indicate that it needs to be set to the same size on all servers and clients.

You should be able to use zkCli.sh to at least see what's going on - if that does not work, then it seems unlikely that the master would either.

Can you:

  *   list the nodes under /accumulo/[instance id]/fate?
  *   use the stat command on each of the nodes - the size is one of the fields.
  *   list nodes under any of the /accumulo/[instance_id/fate/tx-#####
  *   there should be a node named debug - doing a get on that should show the op name.

Ed Coleman
________________________________
From: Ligade, Shailesh [USA] <Li...@bah.com>
Sent: Wednesday, February 9, 2022 12:54 PM
To: user@accumulo.apache.org <us...@accumulo.apache.org>
Subject: Re: accumulo 1.10.0 masters won't start


Thanks



I added



-Djute.maxbuffer=30000000



In conf/java.env and restart all zookeepers but still getting the same error.. documentation is kind of fuzzy on setting this property as it states in hex (default 0xffff) so not 100% sure if 30000000 is ok, but atleast I could see zookeeper was up



-S

________________________________
From: dev1 <de...@etcoleman.com>
Sent: Wednesday, February 9, 2022 12:26 PM
To: user@accumulo.apache.org <us...@accumulo.apache.org>
Subject: [External] Re: accumulo 1.10.0 masters won't start

Does the monitor or any of the logs show errors that relate to exceeding the ZooKeeper jute buffer size?

Is so, have you tried increasing the ZooKeeper jute.maxbuffer limit(https://zookeeper.apache.org/doc/r3.5.9/zookeeperAdmin.html#Unsafe+Options<https://urldefense.com/v3/__https://zookeeper.apache.org/doc/r3.5.9/zookeeperAdmin.html*Unsafe*Options__;Iys!!May37g!dTGCMHPLPDBXwSqtLa5cIPHiTIQF7IjLCVyvGxfi1sgPbrsOI8RCEsuZ9u-jJtayEg$>)?

Ed Coleman


________________________________
From: Ligade, Shailesh [USA] <Li...@bah.com>
Sent: Wednesday, February 9, 2022 11:49 AM
To: user@accumulo.apache.org <us...@accumulo.apache.org>
Subject: accumulo 1.10.0 masters won't start

Hello,

My both masters are stuck error on zookeeper:

IOException: Packet len 2791093 is out of range!
KeeperErrorCode = ConnectionLoss for /accumulo/<instance_id>/fate


if use zkCli to see what is under fate, i get

IOException Packet len 2791161 is out of range
Unable to read additional data from server sessionid xxxx, likely server has closed socket

hdfs fsck is all good

How can I clear this fate?

master process is up and I can get into accumulo shell, but there are no fate (fate print returns empty)

Any idea how to bring the master up?

Thanks

S

Re: accumulo 1.10.0 masters won't start

Posted by dev1 <de...@etcoleman.com>.

Did you try setting the increased size in the zkCli.sh command (or wherever it gets it environment from?)

The ZK docs indicate that it needs to be set to the same size on all servers and clients.

You should be able to use zkCli.sh to at least see what's going on - if that does not work, then it seems unlikely that the master would either.

Can you:

  *   list the nodes under /accumulo/[instance id]/fate?
  *   use the stat command on each of the nodes - the size is one of the fields.
  *   list nodes under any of the /accumulo/[instance_id/fate/tx-#####
  *   there should be a node named debug - doing a get on that should show the op name.

Ed Coleman
________________________________
From: Ligade, Shailesh [USA] <Li...@bah.com>
Sent: Wednesday, February 9, 2022 12:54 PM
To: user@accumulo.apache.org <us...@accumulo.apache.org>
Subject: Re: accumulo 1.10.0 masters won't start


Thanks



I added



-Djute.maxbuffer=30000000



In conf/java.env and restart all zookeepers but still getting the same error.. documentation is kind of fuzzy on setting this property as it states in hex (default 0xffff) so not 100% sure if 30000000 is ok, but atleast I could see zookeeper was up



-S

________________________________
From: dev1 <de...@etcoleman.com>
Sent: Wednesday, February 9, 2022 12:26 PM
To: user@accumulo.apache.org <us...@accumulo.apache.org>
Subject: [External] Re: accumulo 1.10.0 masters won't start

Does the monitor or any of the logs show errors that relate to exceeding the ZooKeeper jute buffer size?

Is so, have you tried increasing the ZooKeeper jute.maxbuffer limit(https://zookeeper.apache.org/doc/r3.5.9/zookeeperAdmin.html#Unsafe+Options<https://urldefense.com/v3/__https://zookeeper.apache.org/doc/r3.5.9/zookeeperAdmin.html*Unsafe*Options__;Iys!!May37g!dTGCMHPLPDBXwSqtLa5cIPHiTIQF7IjLCVyvGxfi1sgPbrsOI8RCEsuZ9u-jJtayEg$>)?

Ed Coleman


________________________________
From: Ligade, Shailesh [USA] <Li...@bah.com>
Sent: Wednesday, February 9, 2022 11:49 AM
To: user@accumulo.apache.org <us...@accumulo.apache.org>
Subject: accumulo 1.10.0 masters won't start

Hello,

My both masters are stuck error on zookeeper:

IOException: Packet len 2791093 is out of range!
KeeperErrorCode = ConnectionLoss for /accumulo/<instance_id>/fate


if use zkCli to see what is under fate, i get

IOException Packet len 2791161 is out of range
Unable to read additional data from server sessionid xxxx, likely server has closed socket

hdfs fsck is all good

How can I clear this fate?

master process is up and I can get into accumulo shell, but there are no fate (fate print returns empty)

Any idea how to bring the master up?

Thanks

S

Re: accumulo 1.10.0 masters won't start

Posted by "Ligade, Shailesh [USA]" <Li...@bah.com>.

Thanks

I added

-Djute.maxbuffer=30000000

In conf/java.env and restart all zookeepers but still getting the same error.. documentation is kind of fuzzy on setting this property as it states in hex (default 0xffff) so not 100% sure if 30000000 is ok, but atleast I could see zookeeper was up

-S

________________________________
From: dev1 <de...@etcoleman.com>
Sent: Wednesday, February 9, 2022 12:26 PM
To: user@accumulo.apache.org <us...@accumulo.apache.org>
Subject: [External] Re: accumulo 1.10.0 masters won't start

Does the monitor or any of the logs show errors that relate to exceeding the ZooKeeper jute buffer size?

Is so, have you tried increasing the ZooKeeper jute.maxbuffer limit(https://zookeeper.apache.org/doc/r3.5.9/zookeeperAdmin.html#Unsafe+Options<https://urldefense.com/v3/__https://zookeeper.apache.org/doc/r3.5.9/zookeeperAdmin.html*Unsafe*Options__;Iys!!May37g!dTGCMHPLPDBXwSqtLa5cIPHiTIQF7IjLCVyvGxfi1sgPbrsOI8RCEsuZ9u-jJtayEg$>)?

Ed Coleman


________________________________
From: Ligade, Shailesh [USA] <Li...@bah.com>
Sent: Wednesday, February 9, 2022 11:49 AM
To: user@accumulo.apache.org <us...@accumulo.apache.org>
Subject: accumulo 1.10.0 masters won't start

Hello,

My both masters are stuck error on zookeeper:

IOException: Packet len 2791093 is out of range!
KeeperErrorCode = ConnectionLoss for /accumulo/<instance_id>/fate


if use zkCli to see what is under fate, i get

IOException Packet len 2791161 is out of range
Unable to read additional data from server sessionid xxxx, likely server has closed socket

hdfs fsck is all good

How can I clear this fate?

master process is up and I can get into accumulo shell, but there are no fate (fate print returns empty)

Any idea how to bring the master up?

Thanks

S

Re: accumulo 1.10.0 masters won't start

Posted by dev1 <de...@etcoleman.com>.

Does the monitor or any of the logs show errors that relate to exceeding the ZooKeeper jute buffer size?

Is so, have you tried increasing the ZooKeeper jute.maxbuffer limit(https://zookeeper.apache.org/doc/r3.5.9/zookeeperAdmin.html#Unsafe+Options)?

Ed Coleman


________________________________
From: Ligade, Shailesh [USA] <Li...@bah.com>
Sent: Wednesday, February 9, 2022 11:49 AM
To: user@accumulo.apache.org <us...@accumulo.apache.org>
Subject: accumulo 1.10.0 masters won't start

Hello,

My both masters are stuck error on zookeeper:

IOException: Packet len 2791093 is out of range!
KeeperErrorCode = ConnectionLoss for /accumulo/<instance_id>/fate


if use zkCli to see what is under fate, i get

IOException Packet len 2791161 is out of range
Unable to read additional data from server sessionid xxxx, likely server has closed socket

hdfs fsck is all good

How can I clear this fate?

master process is up and I can get into accumulo shell, but there are no fate (fate print returns empty)

Any idea how to bring the master up?

Thanks

S