You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@accumulo.apache.org by "Wonders, Jonathan (Serco NA)" <Jo...@serco-na.com> on 2022/04/22 19:30:44 UTC

Tablet Server Session Id Out of Range

Serco Business

Greetings,

The team I work with is encountering an issue when starting an Accumulo 1.7.x cluster and when running troubleshooting commands such as bin/accumulo admin checkTablets. The primary symptom is a NumberFormatException thrown within ZookeeperLockChecker that occurs when parsing the tablet server session id (Long.parseLong) for an input string "ff804d767efe0004" (which is out of range when interpreting as a positive signed long).

From what I can gather, our zookeeper cluster has been running for such a long time that the epoch component of the session id has grown to the point where interpreting the session id as a signed long would be a negative value. Within the ZooKeeper code, the session id is treated as an unsigned long (e.g., Long.toHexString) which leads me to think that the Accumulo code is not parsing the value correctly. This discrepancy is present in all versions since the introduction of the ZookeeperLockChecker class.

There does not appear to be an easy way to work around this problem. Currently, our best idea of how to recover the data from this cluster is to set up a separate zookeeper cluster, migrate the data we have in zookeeper to the new cluster, and then swap over configuration to point to the new zookeeper cluster. I would appreciate any ideas or suggestions from the community.

Thanks,
Jonathan





RE: Tablet Server Session Id Out of Range

Posted by dl...@comcast.net.
Please see: https://lists.apache.org/thread/p7mwtkfpbyb551pw5k7yg61jopf50m8s

 

From: Josef Roehrl - PHEMI <jr...@phemi.com> 
Sent: Monday, June 20, 2022 7:02 PM
To: user@accumulo.apache.org
Subject: Re: Tablet Server Session Id Out of Range

 

Hi Jonathan,

 

We too have exactly this issue as of a couple of days ago.

 

This is in 1.7.2, zookeeper 3.4.5 and from a cluster that has been shut down for a long time.

 

Accumulo writes the root_tablet/lastlocation node with the name of the tserver, concatenated with the ephemeral owner form zookeeper. For us, this hex value is now a negative 64-bit long which parseLong throws an exception on. Note that parseUnsignedLong would have work. Also, note that, at least, in Accumulo 2.0, the code to do the same was changed to parse strings, not longs, avoiding the whole issue.

 

Is there not some way/hack to reset the session id to something reasonable somewhere in zookeeper?

 

Regards,

 

On Fri, Apr 22, 2022 at 12:30 PM Wonders, Jonathan (Serco NA) <Jonathan.Wonders@serco-na.com <ma...@serco-na.com> > wrote:

Serco Business

 

Greetings,

 

The team I work with is encountering an issue when starting an Accumulo 1.7.x cluster and when running troubleshooting commands such as bin/accumulo admin checkTablets. The primary symptom is a NumberFormatException thrown within ZookeeperLockChecker that occurs when parsing the tablet server session id (Long.parseLong) for an input string “ff804d767efe0004” (which is out of range when interpreting as a positive signed long).

 

From what I can gather, our zookeeper cluster has been running for such a long time that the epoch component of the session id has grown to the point where interpreting the session id as a signed long would be a negative value. Within the ZooKeeper code, the session id is treated as an unsigned long (e.g., Long.toHexString) which leads me to think that the Accumulo code is not parsing the value correctly. This discrepancy is present in all versions since the introduction of the ZookeeperLockChecker class.

 

There does not appear to be an easy way to work around this problem. Currently, our best idea of how to recover the data from this cluster is to set up a separate zookeeper cluster, migrate the data we have in zookeeper to the new cluster, and then swap over configuration to point to the new zookeeper cluster. I would appreciate any ideas or suggestions from the community.

 

Thanks,

Jonathan

 

 

 

 




 

-- 

Josef Roehrl

Professional Services
Solutions Architect
I recognize that my working hours may not be the same as yours.
Please feel free to respond only during your working hours.
Has it really been 45+ years programming?
So little time. So much to learn and do.

PHEMI Systems

777 Hornby Street, Suite 600 
Vancouver, BC
V6Z 1S4
604-336-1119

Website <http://www.phemi.com/>  Twitter <https://twitter.com/PHEMISystems>  Linkedin <http://www.linkedin.com/company/3561810?trk=tyah&amp;trkInfo=tarId%3A1403279580554%2Ctas%3Aphemi%20hea%2Cidx%3A1-1-1>  


Re: Tablet Server Session Id Out of Range

Posted by Josef Roehrl - PHEMI <jr...@phemi.com>.
Hi Jonathan,

We too have exactly this issue as of a couple of days ago.

This is in 1.7.2, zookeeper 3.4.5 and from a cluster that has been shut
down for a long time.

Accumulo writes the root_tablet/lastlocation node with the name of the
tserver, concatenated with the ephemeral owner form zookeeper. For us, this
hex value is now a negative 64-bit long which parseLong throws an exception
on. Note that parseUnsignedLong would have work. Also, note that, at least,
in Accumulo 2.0, the code to do the same was changed to parse strings, not
longs, avoiding the whole issue.

Is there not some way/hack to reset the session id to something reasonable
somewhere in zookeeper?

Regards,

On Fri, Apr 22, 2022 at 12:30 PM Wonders, Jonathan (Serco NA) <
Jonathan.Wonders@serco-na.com> wrote:

> Serco Business
>
> Greetings,
>
>
>
> The team I work with is encountering an issue when starting an Accumulo
> 1.7.x cluster and when running troubleshooting commands such as
> bin/accumulo admin checkTablets. The primary symptom is a
> NumberFormatException thrown within ZookeeperLockChecker that occurs when
> parsing the tablet server session id (Long.parseLong) for an input string
> “ff804d767efe0004” (which is out of range when interpreting as a positive
> signed long).
>
>
>
> From what I can gather, our zookeeper cluster has been running for such a
> long time that the epoch component of the session id has grown to the point
> where interpreting the session id as a signed long would be a negative
> value. Within the ZooKeeper code, the session id is treated as an unsigned
> long (e.g., Long.toHexString) which leads me to think that the Accumulo
> code is not parsing the value correctly. This discrepancy is present in all
> versions since the introduction of the ZookeeperLockChecker class.
>
>
>
> There does not appear to be an easy way to work around this problem.
> Currently, our best idea of how to recover the data from this cluster is to
> set up a separate zookeeper cluster, migrate the data we have in zookeeper
> to the new cluster, and then swap over configuration to point to the new
> zookeeper cluster. I would appreciate any ideas or suggestions from the
> community.
>
>
>
> Thanks,
>
> Jonathan
>
>
>
>
>
>
>
>
>


-- 

Josef Roehrl

Professional Services
Solutions Architect



*I recognize that my working hours may not be the same as yours.Please feel
free to respond only during your working hours.Has it really been 45+ years
programming?So little time. So much to learn and do.*
*PHEMI Systems*

777 Hornby Street, Suite 600
Vancouver, BC
V6Z 1S4
604-336-1119
Website <http://www.phemi.com/> Twitter <https://twitter.com/PHEMISystems>
Linkedin
<http://www.linkedin.com/company/3561810?trk=tyah&amp;trkInfo=tarId%3A1403279580554%2Ctas%3Aphemi%20hea%2Cidx%3A1-1-1>

RE: [EXTERNAL] Re: Tablet Server Session Id Out of Range

Posted by "Wonders, Jonathan (Serco NA)" <Jo...@serco-na.com>.
Serco Business

Completely understand that this system is running EOL versions and that this is not a good practice. I won't attempt to rationalize it. I do appreciate the help understanding this issue despite it being related to EOL versions and hope it will be a driving force to upgrade in a timely manner going forward.

From: Christopher <ct...@apache.org>
Sent: Saturday, April 23, 2022 10:32 AM
To: accumulo-user <us...@accumulo.apache.org>
Subject: Re: [EXTERNAL] Re: Tablet Server Session Id Out of Range

Upgrading ZooKeeper will be necessary to work through the issue, but it may not be sufficient. At the very least, you should upgrade to the latest version of ZooKeeper 3.4, which I believe is 3.4.14. Updating bugfix/patch releases should be part of routine maintenance. Upgrading to the next minor or major version may take more planning, but it should be noted that both Accumulo 1.7 and ZooKeeper 3.4 are EOL by their respective upstream communities. There are a great many bugs that have already been fixed since those versions, that you may encounter if you don't have a maintenance plan in place that includes upgrading to supported versions.

On Fri, Apr 22, 2022 at 4:37 PM Wonders, Jonathan (Serco NA) <Jo...@serco-na.com>> wrote:

Serco Business

Mike,

Thanks for the suggestion. We are running Zookeeper 3.4.5 (probably unsurprisingly).

That does sound like the issue we are seeing. I might have misinterpreted what I read about zookeeper because I was under the impression that the session ids were a function of the number of leader election cycles as opposed to absolute time.

Is upgrading zookeeper essentially the only way to work through this issue?

Thanks,
Jonathan

From: Michael Wall <mj...@apache.org>>
Sent: Friday, April 22, 2022 4:19 PM
To: Accumulo User List <us...@accumulo.apache.org>>
Subject: [EXTERNAL] Re: Tablet Server Session Id Out of Range

Attention: This email originated from outside of the organization. Do not click links or open attachments unless you recognize the sender and know the content is safe.
Hi Johnathan

What version of zookeeper are you running?  Sounds like you may be hitting https://issues.apache.org/jira/plugins/servlet/mobile#issue/ZOOKEEPER-1622<https://secure-web.cisco.com/1eDbteF3sXIcPFBHbJqVXn7LDYQZz5bnqoB8MMFFQTj93sIwt4VT3sdrlzWMzTQt0T8yLjDahlHV_yWVvG2LIbS8O5uaDdm7hNFzYnd0SbuyaEPKAxq5UGlGrsxN7tKapAe3-L8D-3j9zRrTpPNB7dj3hu2fb6jqEqaZqM--1H4P-8O1NT1Xw-G6ZnQjx764YPQK_FXXuyZSg-2Qp5EjuDMS_sXQVheD8fWezk2u4xhHgppN6ScC3nZGZzFbqg1vyyC-fH0PdciwOP1XAWLqhIDbG8QLsNtDJsEyaYLSC4RhqbcF8-Nrg77o12mnTtCU9/https%3A%2F%2Fissues.apache.org%2Fjira%2Fplugins%2Fservlet%2Fmobile%23issue%2FZOOKEEPER-1622>.  If so, try shutting down Accumulo, then zookeeper.  Then upgrade zookeeper and restart.

Mike


On Fri, Apr 22, 2022, 15:30 Wonders, Jonathan (Serco NA) <Jo...@serco-na.com>> wrote:

Serco Business

Greetings,

The team I work with is encountering an issue when starting an Accumulo 1.7.x cluster and when running troubleshooting commands such as bin/accumulo admin checkTablets. The primary symptom is a NumberFormatException thrown within ZookeeperLockChecker that occurs when parsing the tablet server session id (Long.parseLong) for an input string "ff804d767efe0004" (which is out of range when interpreting as a positive signed long).

From what I can gather, our zookeeper cluster has been running for such a long time that the epoch component of the session id has grown to the point where interpreting the session id as a signed long would be a negative value. Within the ZooKeeper code, the session id is treated as an unsigned long (e.g., Long.toHexString) which leads me to think that the Accumulo code is not parsing the value correctly. This discrepancy is present in all versions since the introduction of the ZookeeperLockChecker class.

There does not appear to be an easy way to work around this problem. Currently, our best idea of how to recover the data from this cluster is to set up a separate zookeeper cluster, migrate the data we have in zookeeper to the new cluster, and then swap over configuration to point to the new zookeeper cluster. I would appreciate any ideas or suggestions from the community.

Thanks,
Jonathan





Re: [EXTERNAL] Re: Tablet Server Session Id Out of Range

Posted by Christopher <ct...@apache.org>.
Upgrading ZooKeeper will be necessary to work through the issue, but it may
not be sufficient. At the very least, you should upgrade to the latest
version of ZooKeeper 3.4, which I believe is 3.4.14. Updating bugfix/patch
releases should be part of routine maintenance. Upgrading to the next minor
or major version may take more planning, but it should be noted that both
Accumulo 1.7 and ZooKeeper 3.4 are EOL by their respective upstream
communities. There are a great many bugs that have already been fixed since
those versions, that you may encounter if you don't have a maintenance plan
in place that includes upgrading to supported versions.


On Fri, Apr 22, 2022 at 4:37 PM Wonders, Jonathan (Serco NA) <
Jonathan.Wonders@serco-na.com> wrote:

> Serco Business
>
> Mike,
>
>
>
> Thanks for the suggestion. We are running Zookeeper 3.4.5 (probably
> unsurprisingly).
>
>
>
> That does sound like the issue we are seeing. I might have misinterpreted
> what I read about zookeeper because I was under the impression that the
> session ids were a function of the number of leader election cycles as
> opposed to absolute time.
>
>
>
> Is upgrading zookeeper essentially the only way to work through this issue?
>
>
>
> Thanks,
>
> Jonathan
>
>
>
> *From:* Michael Wall <mj...@apache.org>
> *Sent:* Friday, April 22, 2022 4:19 PM
> *To:* Accumulo User List <us...@accumulo.apache.org>
> *Subject:* [EXTERNAL] Re: Tablet Server Session Id Out of Range
>
>
>
> Attention: This email originated from outside of the organization. Do not
> click links or open attachments unless you recognize the sender and know
> the content is safe.
>
> Hi Johnathan
>
>
>
> What version of zookeeper are you running?  Sounds like you may be hitting
> https://issues.apache.org/jira/plugins/servlet/mobile#issue/ZOOKEEPER-1622
> <https://secure-web.cisco.com/1eDbteF3sXIcPFBHbJqVXn7LDYQZz5bnqoB8MMFFQTj93sIwt4VT3sdrlzWMzTQt0T8yLjDahlHV_yWVvG2LIbS8O5uaDdm7hNFzYnd0SbuyaEPKAxq5UGlGrsxN7tKapAe3-L8D-3j9zRrTpPNB7dj3hu2fb6jqEqaZqM--1H4P-8O1NT1Xw-G6ZnQjx764YPQK_FXXuyZSg-2Qp5EjuDMS_sXQVheD8fWezk2u4xhHgppN6ScC3nZGZzFbqg1vyyC-fH0PdciwOP1XAWLqhIDbG8QLsNtDJsEyaYLSC4RhqbcF8-Nrg77o12mnTtCU9/https%3A%2F%2Fissues.apache.org%2Fjira%2Fplugins%2Fservlet%2Fmobile%23issue%2FZOOKEEPER-1622>.
> If so, try shutting down Accumulo, then zookeeper.  Then upgrade zookeeper
> and restart.
>
>
>
> Mike
>
>
>
>
>
> On Fri, Apr 22, 2022, 15:30 Wonders, Jonathan (Serco NA) <
> Jonathan.Wonders@serco-na.com> wrote:
>
> Serco Business
>
>
>
> Greetings,
>
>
>
> The team I work with is encountering an issue when starting an Accumulo
> 1.7.x cluster and when running troubleshooting commands such as
> bin/accumulo admin checkTablets. The primary symptom is a
> NumberFormatException thrown within ZookeeperLockChecker that occurs when
> parsing the tablet server session id (Long.parseLong) for an input string
> “ff804d767efe0004” (which is out of range when interpreting as a positive
> signed long).
>
>
>
> From what I can gather, our zookeeper cluster has been running for such a
> long time that the epoch component of the session id has grown to the point
> where interpreting the session id as a signed long would be a negative
> value. Within the ZooKeeper code, the session id is treated as an unsigned
> long (e.g., Long.toHexString) which leads me to think that the Accumulo
> code is not parsing the value correctly. This discrepancy is present in all
> versions since the introduction of the ZookeeperLockChecker class.
>
>
>
> There does not appear to be an easy way to work around this problem.
> Currently, our best idea of how to recover the data from this cluster is to
> set up a separate zookeeper cluster, migrate the data we have in zookeeper
> to the new cluster, and then swap over configuration to point to the new
> zookeeper cluster. I would appreciate any ideas or suggestions from the
> community.
>
>
>
> Thanks,
>
> Jonathan
>
>
>
>
>
>
>
>
>
>

RE: [EXTERNAL] Re: Tablet Server Session Id Out of Range

Posted by "Wonders, Jonathan (Serco NA)" <Jo...@serco-na.com>.
Serco Business

Mike,

Thanks for the suggestion. We are running Zookeeper 3.4.5 (probably unsurprisingly).

That does sound like the issue we are seeing. I might have misinterpreted what I read about zookeeper because I was under the impression that the session ids were a function of the number of leader election cycles as opposed to absolute time.

Is upgrading zookeeper essentially the only way to work through this issue?

Thanks,
Jonathan

From: Michael Wall <mj...@apache.org>
Sent: Friday, April 22, 2022 4:19 PM
To: Accumulo User List <us...@accumulo.apache.org>
Subject: [EXTERNAL] Re: Tablet Server Session Id Out of Range

Attention: This email originated from outside of the organization. Do not click links or open attachments unless you recognize the sender and know the content is safe.
Hi Johnathan

What version of zookeeper are you running?  Sounds like you may be hitting https://issues.apache.org/jira/plugins/servlet/mobile#issue/ZOOKEEPER-1622<https://secure-web.cisco.com/1eDbteF3sXIcPFBHbJqVXn7LDYQZz5bnqoB8MMFFQTj93sIwt4VT3sdrlzWMzTQt0T8yLjDahlHV_yWVvG2LIbS8O5uaDdm7hNFzYnd0SbuyaEPKAxq5UGlGrsxN7tKapAe3-L8D-3j9zRrTpPNB7dj3hu2fb6jqEqaZqM--1H4P-8O1NT1Xw-G6ZnQjx764YPQK_FXXuyZSg-2Qp5EjuDMS_sXQVheD8fWezk2u4xhHgppN6ScC3nZGZzFbqg1vyyC-fH0PdciwOP1XAWLqhIDbG8QLsNtDJsEyaYLSC4RhqbcF8-Nrg77o12mnTtCU9/https%3A%2F%2Fissues.apache.org%2Fjira%2Fplugins%2Fservlet%2Fmobile%23issue%2FZOOKEEPER-1622>.  If so, try shutting down Accumulo, then zookeeper.  Then upgrade zookeeper and restart.

Mike


On Fri, Apr 22, 2022, 15:30 Wonders, Jonathan (Serco NA) <Jo...@serco-na.com>> wrote:

Serco Business

Greetings,

The team I work with is encountering an issue when starting an Accumulo 1.7.x cluster and when running troubleshooting commands such as bin/accumulo admin checkTablets. The primary symptom is a NumberFormatException thrown within ZookeeperLockChecker that occurs when parsing the tablet server session id (Long.parseLong) for an input string "ff804d767efe0004" (which is out of range when interpreting as a positive signed long).

From what I can gather, our zookeeper cluster has been running for such a long time that the epoch component of the session id has grown to the point where interpreting the session id as a signed long would be a negative value. Within the ZooKeeper code, the session id is treated as an unsigned long (e.g., Long.toHexString) which leads me to think that the Accumulo code is not parsing the value correctly. This discrepancy is present in all versions since the introduction of the ZookeeperLockChecker class.

There does not appear to be an easy way to work around this problem. Currently, our best idea of how to recover the data from this cluster is to set up a separate zookeeper cluster, migrate the data we have in zookeeper to the new cluster, and then swap over configuration to point to the new zookeeper cluster. I would appreciate any ideas or suggestions from the community.

Thanks,
Jonathan





Re: Tablet Server Session Id Out of Range

Posted by Michael Wall <mj...@apache.org>.
Hi Johnathan

What version of zookeeper are you running?  Sounds like you may be hitting
https://issues.apache.org/jira/plugins/servlet/mobile#issue/ZOOKEEPER-1622.
If so, try shutting down Accumulo, then zookeeper.  Then upgrade zookeeper
and restart.

Mike



On Fri, Apr 22, 2022, 15:30 Wonders, Jonathan (Serco NA) <
Jonathan.Wonders@serco-na.com> wrote:

> Serco Business
>
> Greetings,
>
>
>
> The team I work with is encountering an issue when starting an Accumulo
> 1.7.x cluster and when running troubleshooting commands such as
> bin/accumulo admin checkTablets. The primary symptom is a
> NumberFormatException thrown within ZookeeperLockChecker that occurs when
> parsing the tablet server session id (Long.parseLong) for an input string
> “ff804d767efe0004” (which is out of range when interpreting as a positive
> signed long).
>
>
>
> From what I can gather, our zookeeper cluster has been running for such a
> long time that the epoch component of the session id has grown to the point
> where interpreting the session id as a signed long would be a negative
> value. Within the ZooKeeper code, the session id is treated as an unsigned
> long (e.g., Long.toHexString) which leads me to think that the Accumulo
> code is not parsing the value correctly. This discrepancy is present in all
> versions since the introduction of the ZookeeperLockChecker class.
>
>
>
> There does not appear to be an easy way to work around this problem.
> Currently, our best idea of how to recover the data from this cluster is to
> set up a separate zookeeper cluster, migrate the data we have in zookeeper
> to the new cluster, and then swap over configuration to point to the new
> zookeeper cluster. I would appreciate any ideas or suggestions from the
> community.
>
>
>
> Thanks,
>
> Jonathan
>
>
>
>
>
>
>
>
>