You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@accumulo.apache.org by "Dickson, Matt MR" <ma...@defence.gov.au> on 2017/02/20 21:52:13 UTC

accumulo.root invalid table reference [SEC=UNOFFICIAL]

UNOFFICIAL

I have a situation where all tablet servers are progressively being declared dead.  From the logs the tservers report errors like:

2017-02-....  DEBUG: Scan failed thrift error org.apache.thrift.trasport.TTransportException null
(!0;1vm\\125.323.233.23::2016103<,server.com.org:9997,2342423df12341d)

1vm was a table id that was deleted several months ago so it appears there is some invalid reference somewhere.

Scanning the metadata table "scan -b 1vm" returns no rows returned for 1vm.

A scan of the accumulo.root table returns approximately 15 rows that start with;

!0:1vm;<ip addr>::2016103 blah

How are the root table entries used and would it be safe to remove these entries since they reference a deleted table?

Thanks in advance,
Matt


Re: accumulo.root invalid table reference [SEC=UNOFFICIAL]

Posted by Josh Elser <jo...@gmail.com>.
+1 to that. Great suggestion, Mike, and great find, Matt!

I think this would be a great thing to capture in the Accumulo User 
Manual if you're interested..

http://accumulo.apache.org/1.8/accumulo_user_manual.html#_troubleshooting

Michael Wall wrote:
> Hi Matt,
>
> Glad you got the metadata table to come up.  So some more questions for you.
>
> How many nodes do you have?
> How many tservers?
> How many tablets are hosted per tserver across all tables?
>
> If you deleted a table, those entries in the metadata table should be
> gone.  Are you still seeing stuff from the deleted table in the metadata
> table?  If all metadata entries are in one tablet, then there are no
> splits for the metadata table and running merge will not help.  After we
> see the answers to the questions above, I will try to recommend
> something else.
>
> Mike
>
> On Tue, Feb 21, 2017 at 6:22 PM Dickson, Matt MR
> <matt.dickson@defence.gov.au <ma...@defence.gov.au>> wrote:
>
>     __
>
>     *UNOFFICIAL*
>
>     Firstly, thankyou for your advice its been very helpful.
>     Increasing the tablet server memory has allowed the metadata table
>     to come online.  From using the rfile-info and looking at the splits
>     for the metadata table it appears that all the metadata table
>     entries are in one tablet.  All tablet servers then query the one
>     node hosting that tablet.
>     I suspect the cause of this was a poorly designed table that at one
>     point the Accumulo gui reported 1.02T tablets for.  We've
>     subsequently deleted that table but it might be that there were so
>     many entries in the metadata table that all splits on it were due to
>     this massive table that had the table id 1vm.
>     To rectify this, is it safe to run a merge on the metadata table to
>     force it to redistribute?
>
>     ------------------------------------------------------------------------
>     *From:* Michael Wall [mailto:mjwall@gmail.com
>     <ma...@gmail.com>]
>     *Sent:* Wednesday, 22 February 2017 02:44
>
>     *To:* user@accumulo.apache.org <ma...@accumulo.apache.org>
>     *Subject:* Re: accumulo.root invalid table reference [SEC=UNOFFICIAL]
>     Matt,
>
>     If I am reading this correctly, you have a tablet that is being
>     loading onto a tserver.  That tserver dies, so the tablet is then
>     assigned to another tablet.  While the tablet is being loading, that
>     tserver dies and so on.  Is that correct?
>
>     Can you identify the tablet that is bouncing around?  If so, try
>     using rfile-info -d to inspect the rfiles associated with that
>     tablet.  Also look at the rfiles that compose that tablet to see if
>     anything sticks out.
>
>     Any logs that would help explain why the tablet server is dying?
>     Can you increase the memory of the tserver?
>
>     Mike
>
>     On Tue, Feb 21, 2017 at 10:35 AM Josh Elser <josh.elser@gmail.com
>     <ma...@gmail.com>> wrote:
>
>         ... [zookeeper.ZooCache] WARN: Saw (possibly) transient exception
>         communicating with ZooKeeper, will retry
>         SessionExpiredException: KeeperErrorCode = Session expired for
>         /accumulo/4234234234234234/namespaces/+accumulo/conf/table.scan.max.memory
>
>         There can be a number of causes for this, but here are the most
>         likely ones.
>
>         * JVM gc pauses
>         * ZooKeeper max client connections
>         * Operating System/Hardware-level pauses
>
>         The former should be noticeable by the Accumulo log. There is a
>         daemon
>         running which watches for pauses that happen and then reports
>         them. If
>         this is happening, you might have to give the process some more Java
>         heap, tweak your CMS/G1 parameters, etc.
>
>         For maxClientConnections, see
>         https://community.hortonworks.com/articles/51191/understanding-apache-zookeeper-connection-rate-lim.html
>
>         For the latter, swappiness is the most likely candidate
>         (assuming this
>         is hopping across different physical nodes), as are "transparent
>         huge
>         pages". If it is limited to a single host, things like bad NICs,
>         hard
>         drives, and other hardware issues might be a source of slowness.
>
>         On Mon, Feb 20, 2017 at 10:18 PM, Dickson, Matt MR
>         <matt.dickson@defence.gov.au
>         <ma...@defence.gov.au>> wrote:
>          > UNOFFICIAL
>          >
>          > It looks like an issue with one of the metadata table
>         tablets. On startup
>          > the server that hosts a particular metadata tablet gets
>         scanned by all other
>          > tablet servers in the cluster.  This then crashes that tablet
>         server with an
>          > error in the tserver log;
>          >
>          > ... [zookeeper.ZooCache] WARN: Saw (possibly) transient exception
>          > communicating with ZooKeeper, will retry
>          > SessionExpiredException: KeeperErrorCode = Session expired for
>          >
>         /accumulo/4234234234234234/namespaces/+accumulo/conf/table.scan.max.memory
>          >
>          > That metadata table tablet is then transferred to another
>         host which then
>          > fails also, and so on.
>          >
>          > While the server is hosting this metadata tablet, we see the
>         following log
>          > statement from all tserver.logs in the cluster:
>          >
>          > .... [impl.ThriftScanner] DEBUG: Scan failed, thrift error
>          > org.apache.thrift.transport.TTransportException  null
>          > (!0;1vm\\;125.323.233.23::2016103<,server.com.org:9997
>         <http://server.com.org:9997>,2342423df12341d)
>          > Hope that helps complete the picture.
>          >
>          >
>          > ________________________________
>          > From: Christopher [mailto:ctubbsii@apache.org
>         <ma...@apache.org>]
>          > Sent: Tuesday, 21 February 2017 13:17
>          >
>          > To: user@accumulo.apache.org <ma...@accumulo.apache.org>
>          > Subject: Re: accumulo.root invalid table reference
>         [SEC=UNOFFICIAL]
>          >
>          > Removing them is probably a bad idea. The root table entries
>         correspond to
>          > split points in the metadata table. There is no need for the
>         tables which
>          > existed when the metadata table split to still exist for this
>         to continue to
>          > act as a valid split point.
>          >
>          > Would need to see the exception stack trace, or at least an
>         error message,
>          > to troubleshoot the shell scanning error you saw.
>          >
>          >
>          > On Mon, Feb 20, 2017, 20:00 Dickson, Matt MR
>         <matt.dickson@defence.gov.au <ma...@defence.gov.au>>
>          > wrote:
>          >>
>          >> UNOFFICIAL
>          >>
>          >> In case it is ok to remove these from the root table, how
>         can I scan the
>          >> root table for rows with a rowid starting with !0;1vm?
>          >>
>          >> Running "scan -b !0;1vm" throws an exception and exits the
>         shell.
>          >>
>          >>
>          >> -----Original Message-----
>          >> From: Dickson, Matt MR [mailto:matt.dickson@defence.gov.au
>         <ma...@defence.gov.au>]
>          >> Sent: Tuesday, 21 February 2017 09:30
>          >> To: 'user@accumulo.apache.org <ma...@accumulo.apache.org>'
>          >> Subject: RE: accumulo.root invalid table reference
>         [SEC=UNOFFICIAL]
>          >>
>          >> UNOFFICIAL
>          >>
>          >>
>          >> Does that mean I should have entries for 1vm in the metadata
>         table
>          >> corresponding to the root table?
>          >>
>          >> We are running 1.6.5
>          >>
>          >>
>          >> -----Original Message-----
>          >> From: Josh Elser [mailto:josh.elser@gmail.com
>         <ma...@gmail.com>]
>          >> Sent: Tuesday, 21 February 2017 09:22
>          >> To: user@accumulo.apache.org <ma...@accumulo.apache.org>
>          >> Subject: Re: accumulo.root invalid table reference
>         [SEC=UNOFFICIAL]
>          >>
>          >> The root table should only reference the tablets in the
>         metadata table.
>          >> It's a hierarchy: like metadata is for the user tables, root
>         is for the
>          >> metadata table.
>          >>
>          >> What version are ya running, Matt?
>          >>
>          >> Dickson, Matt MR wrote:
>          >> > *UNOFFICIAL*
>          >> >
>          >> > I have a situation where all tablet servers are
>         progressively being
>          >> > declared dead. From the logs the tservers report errors like:
>          >> > 2017-02-.... DEBUG: Scan failed thrift error
>          >> > org.apache.thrift.trasport.TTransportException null
>          >> > (!0;1vm\\125.323.233.23::2016103<,server.com.org:9997
>         <http://server.com.org:9997>,2342423df12341d)
>          >> > 1vm was a table id that was deleted several months ago so
>         it appears
>          >> > there is some invalid reference somewhere.
>          >> > Scanning the metadata table "scan -b 1vm" returns no rows
>         returned for
>          >> > 1vm.
>          >> > A scan of the accumulo.root table returns approximately 15
>         rows that
>          >> > start with; !0:1vm;<i/p addr>/::2016103 /blah/ // How are
>         the root
>          >> > table entries used and would it be safe to remove these
>         entries since
>          >> > they reference a deleted table?
>          >> > Thanks in advance,
>          >> > Matt
>          >> > //
>          >
>          > --
>          > Christopher
>

Re: accumulo.root invalid table reference [SEC=UNOFFICIAL]

Posted by Michael Wall <mj...@gmail.com>.
Hi Matt,

Glad you got the metadata table to come up.  So some more questions for you.

How many nodes do you have?
How many tservers?
How many tablets are hosted per tserver across all tables?

If you deleted a table, those entries in the metadata table should be
gone.  Are you still seeing stuff from the deleted table in the metadata
table?  If all metadata entries are in one tablet, then there are no splits
for the metadata table and running merge will not help.  After we see the
answers to the questions above, I will try to recommend something else.

Mike

On Tue, Feb 21, 2017 at 6:22 PM Dickson, Matt MR <
matt.dickson@defence.gov.au> wrote:

> *UNOFFICIAL*
> Firstly, thankyou for your advice its been very helpful.
>
> Increasing the tablet server memory has allowed the metadata table to come
> online.  From using the rfile-info and looking at the splits for the
> metadata table it appears that all the metadata table entries are in one
> tablet.  All tablet servers then query the one node hosting that tablet.
>
> I suspect the cause of this was a poorly designed table that at one point
> the Accumulo gui reported 1.02T tablets for.  We've subsequently deleted
> that table but it might be that there were so many entries in the metadata
> table that all splits on it were due to this massive table that had the
> table id 1vm.
>
> To rectify this, is it safe to run a merge on the metadata table to force
> it to redistribute?
>
> ------------------------------
> *From:* Michael Wall [mailto:mjwall@gmail.com]
> *Sent:* Wednesday, 22 February 2017 02:44
>
> *To:* user@accumulo.apache.org
> *Subject:* Re: accumulo.root invalid table reference [SEC=UNOFFICIAL]
> Matt,
>
> If I am reading this correctly, you have a tablet that is being loading
> onto a tserver.  That tserver dies, so the tablet is then assigned to
> another tablet.  While the tablet is being loading, that tserver dies and
> so on.  Is that correct?
>
> Can you identify the tablet that is bouncing around?  If so, try using
> rfile-info -d to inspect the rfiles associated with that tablet.  Also look
> at the rfiles that compose that tablet to see if anything sticks out.
>
> Any logs that would help explain why the tablet server is dying?  Can you
> increase the memory of the tserver?
>
> Mike
>
> On Tue, Feb 21, 2017 at 10:35 AM Josh Elser <jo...@gmail.com> wrote:
>
> ... [zookeeper.ZooCache] WARN: Saw (possibly) transient exception
> communicating with ZooKeeper, will retry
> SessionExpiredException: KeeperErrorCode = Session expired for
> /accumulo/4234234234234234/namespaces/+accumulo/conf/table.scan.max.memory
>
> There can be a number of causes for this, but here are the most likely
> ones.
>
> * JVM gc pauses
> * ZooKeeper max client connections
> * Operating System/Hardware-level pauses
>
> The former should be noticeable by the Accumulo log. There is a daemon
> running which watches for pauses that happen and then reports them. If
> this is happening, you might have to give the process some more Java
> heap, tweak your CMS/G1 parameters, etc.
>
> For maxClientConnections, see
>
> https://community.hortonworks.com/articles/51191/understanding-apache-zookeeper-connection-rate-lim.html
>
> For the latter, swappiness is the most likely candidate (assuming this
> is hopping across different physical nodes), as are "transparent huge
> pages". If it is limited to a single host, things like bad NICs, hard
> drives, and other hardware issues might be a source of slowness.
>
> On Mon, Feb 20, 2017 at 10:18 PM, Dickson, Matt MR
> <ma...@defence.gov.au> wrote:
> > UNOFFICIAL
> >
> > It looks like an issue with one of the metadata table tablets. On startup
> > the server that hosts a particular metadata tablet gets scanned by all
> other
> > tablet servers in the cluster.  This then crashes that tablet server
> with an
> > error in the tserver log;
> >
> > ... [zookeeper.ZooCache] WARN: Saw (possibly) transient exception
> > communicating with ZooKeeper, will retry
> > SessionExpiredException: KeeperErrorCode = Session expired for
> >
> /accumulo/4234234234234234/namespaces/+accumulo/conf/table.scan.max.memory
> >
> > That metadata table tablet is then transferred to another host which then
> > fails also, and so on.
> >
> > While the server is hosting this metadata tablet, we see the following
> log
> > statement from all tserver.logs in the cluster:
> >
> > .... [impl.ThriftScanner] DEBUG: Scan failed, thrift error
> > org.apache.thrift.transport.TTransportException  null
> > (!0;1vm\\;125.323.233.23::2016103<,server.com.org:9997,2342423df12341d)
> > Hope that helps complete the picture.
> >
> >
> > ________________________________
> > From: Christopher [mailto:ctubbsii@apache.org]
> > Sent: Tuesday, 21 February 2017 13:17
> >
> > To: user@accumulo.apache.org
> > Subject: Re: accumulo.root invalid table reference [SEC=UNOFFICIAL]
> >
> > Removing them is probably a bad idea. The root table entries correspond
> to
> > split points in the metadata table. There is no need for the tables which
> > existed when the metadata table split to still exist for this to
> continue to
> > act as a valid split point.
> >
> > Would need to see the exception stack trace, or at least an error
> message,
> > to troubleshoot the shell scanning error you saw.
> >
> >
> > On Mon, Feb 20, 2017, 20:00 Dickson, Matt MR <
> matt.dickson@defence.gov.au>
> > wrote:
> >>
> >> UNOFFICIAL
> >>
> >> In case it is ok to remove these from the root table, how can I scan the
> >> root table for rows with a rowid starting with !0;1vm?
> >>
> >> Running "scan -b !0;1vm" throws an exception and exits the shell.
> >>
> >>
> >> -----Original Message-----
> >> From: Dickson, Matt MR [mailto:matt.dickson@defence.gov.au]
> >> Sent: Tuesday, 21 February 2017 09:30
> >> To: 'user@accumulo.apache.org'
> >> Subject: RE: accumulo.root invalid table reference [SEC=UNOFFICIAL]
> >>
> >> UNOFFICIAL
> >>
> >>
> >> Does that mean I should have entries for 1vm in the metadata table
> >> corresponding to the root table?
> >>
> >> We are running 1.6.5
> >>
> >>
> >> -----Original Message-----
> >> From: Josh Elser [mailto:josh.elser@gmail.com]
> >> Sent: Tuesday, 21 February 2017 09:22
> >> To: user@accumulo.apache.org
> >> Subject: Re: accumulo.root invalid table reference [SEC=UNOFFICIAL]
> >>
> >> The root table should only reference the tablets in the metadata table.
> >> It's a hierarchy: like metadata is for the user tables, root is for the
> >> metadata table.
> >>
> >> What version are ya running, Matt?
> >>
> >> Dickson, Matt MR wrote:
> >> > *UNOFFICIAL*
> >> >
> >> > I have a situation where all tablet servers are progressively being
> >> > declared dead. From the logs the tservers report errors like:
> >> > 2017-02-.... DEBUG: Scan failed thrift error
> >> > org.apache.thrift.trasport.TTransportException null
> >> > (!0;1vm\\125.323.233.23::2016103<,server.com.org:9997
> ,2342423df12341d)
> >> > 1vm was a table id that was deleted several months ago so it appears
> >> > there is some invalid reference somewhere.
> >> > Scanning the metadata table "scan -b 1vm" returns no rows returned for
> >> > 1vm.
> >> > A scan of the accumulo.root table returns approximately 15 rows that
> >> > start with; !0:1vm;<i/p addr>/::2016103 /blah/ // How are the root
> >> > table entries used and would it be safe to remove these entries since
> >> > they reference a deleted table?
> >> > Thanks in advance,
> >> > Matt
> >> > //
> >
> > --
> > Christopher
>
>

Re: accumulo.root invalid table reference [SEC=UNOFFICIAL]

Posted by Michael Wall <mj...@gmail.com>.
Matt,

This is going to be a long one, sorry.  I will attempt to replicate your
issue and show you how I accomplished what I think you are trying to do.
I'll be using a single jar mini accumulo cluster I created at
https://github.com/mjwall/standalone-mac/tree/1.6.6.  Note it is 1.6.6

Built and ran with

mvn clean package && java -jar
target/standalone-1.6.6-mac-shaded-0.0.1-SNAPSHOT.jar

once it starts up, here is what you get

Starting a Mini Accumulo Cluster:
InstanceName:       smac
Root user password: secret
Temp dir is:
 /var/folders/cd/l8dpphgn3j1gfpr2gs6yb9vjjpd1pt/T/1487858319075-0
Zookeeper is:       localhost:2181
Monitor:            http://localhost:56202
Starting a shell

Shell - Apache Accumulo Interactive Shell
-
- version: 1.6.6
- instance name: smac
- instance id: 2a19ecc3-dd7f-4a8e-9505-5577f9eff2c7
-
- type 'help' for a list of available commands
-
root@smac>

The monitor url is shown above, so I hit that to look around

It shows 2 metadata tables at this points, which I can confirm with the
following scan

root@smac blerp> scan -t accumulo.root -c ~tab
!0;~ ~tab:~pr []    \x00
!0< ~tab:~pr []    \x01~

Now create a couple of tables with some splits

root@smac> createtable blah
root@smac blah> addsplits 1 2 3 4 5 6 7 8 9 0 a b c d e f g h i j k l m n o
p q r s t u v w x y z
root@smac blah> createtable blerp
root@smac blerp> addsplits 1 2 3 4 5 6 7 8 9 0 a b c d e f g h i j k l m n
o p q r s t u v w x y z

Just for reference, here are the tables

root@smac blerp> tables -l
accumulo.metadata    =>        !0
accumulo.root        =>        +r
blah                 =>         1
blerp                =>         2

So let's create some additional metadata tablets before we delete any
tables tables

addsplits -t accumulo.metadata 1;1 1;3 1;5 1;7 1;9 1;a 1;c 1;e 1;g 1;i 1;k
1;m 1;o 1;q 1;s 1;u 1;w 1;y
addsplits -t accumulo.metadata 2;1 2;3 2;5 2;7 2;9 2;a 2;c 2;e 2;g 2;i 2;k
2;m 2;o 2;q 2;s 2;u 2;w 2;y

So there are now 37 metadata tablets in the monitor.  Scanning
accumulo.root shows that

root@smac blerg> scan -t accumulo.root -c ~tab
!0;1;1 ~tab:~pr []    \x00
!0;1;3 ~tab:~pr []    \x011;1
!0;1;5 ~tab:~pr []    \x011;3
!0;1;7 ~tab:~pr []    \x011;5
!0;1;9 ~tab:~pr []    \x011;7
!0;1;a ~tab:~pr []    \x011;9
!0;1;c ~tab:~pr []    \x011;a
!0;1;e ~tab:~pr []    \x011;c
!0;1;g ~tab:~pr []    \x011;e
!0;1;i ~tab:~pr []    \x011;g
!0;1;k ~tab:~pr []    \x011;i
!0;1;m ~tab:~pr []    \x011;k
!0;1;o ~tab:~pr []    \x011;m
!0;1;q ~tab:~pr []    \x011;o
!0;1;s ~tab:~pr []    \x011;q
!0;1;u ~tab:~pr []    \x011;s
!0;1;w ~tab:~pr []    \x011;u
!0;1;y ~tab:~pr []    \x011;w
!0;2;1 ~tab:~pr []    \x011;y
!0;2;3 ~tab:~pr []    \x012;1
!0;2;5 ~tab:~pr []    \x012;3
!0;2;7 ~tab:~pr []    \x012;5
!0;2;9 ~tab:~pr []    \x012;7
!0;2;a ~tab:~pr []    \x012;9
!0;2;c ~tab:~pr []    \x012;a
!0;2;e ~tab:~pr []    \x012;c
!0;2;g ~tab:~pr []    \x012;e
!0;2;i ~tab:~pr []    \x012;g
!0;2;k ~tab:~pr []    \x012;i
!0;2;m ~tab:~pr []    \x012;k
!0;2;o ~tab:~pr []    \x012;m
!0;2;q ~tab:~pr []    \x012;o
!0;2;s ~tab:~pr []    \x012;q
!0;2;u ~tab:~pr []    \x012;s
!0;2;w ~tab:~pr []    \x012;u
!0;2;y ~tab:~pr []    \x012;w
!0;~ ~tab:~pr []    \x012;y
!0< ~tab:~pr []    \x01~

There are associated metadata entries as well

root@smac blerg> scan -t accumulo.metadata -b 1; -e 2; -c ~tab
1;0 ~tab:~pr []    \x00
1;1 ~tab:~pr []    \x010
1;2 ~tab:~pr []    \x011
1;3 ~tab:~pr []    \x012
1;4 ~tab:~pr []    \x013
1;5 ~tab:~pr []    \x014
1;6 ~tab:~pr []    \x015
1;7 ~tab:~pr []    \x016
1;8 ~tab:~pr []    \x017
1;9 ~tab:~pr []    \x018
1;a ~tab:~pr []    \x019
1;b ~tab:~pr []    \x01a
1;c ~tab:~pr []    \x01b
1;d ~tab:~pr []    \x01c
1;e ~tab:~pr []    \x01d
1;f ~tab:~pr []    \x01e
1;g ~tab:~pr []    \x01f
1;h ~tab:~pr []    \x01g
1;i ~tab:~pr []    \x01h
1;j ~tab:~pr []    \x01i
1;k ~tab:~pr []    \x01j
1;l ~tab:~pr []    \x01k
1;m ~tab:~pr []    \x01l
1;n ~tab:~pr []    \x01m
1;o ~tab:~pr []    \x01n
1;p ~tab:~pr []    \x01o
1;q ~tab:~pr []    \x01p
1;r ~tab:~pr []    \x01q
1;s ~tab:~pr []    \x01r
1;t ~tab:~pr []    \x01s
1;u ~tab:~pr []    \x01t
1;v ~tab:~pr []    \x01u
1;w ~tab:~pr []    \x01v
1;x ~tab:~pr []    \x01w
1;y ~tab:~pr []    \x01x
1;z ~tab:~pr []    \x01y
1< ~tab:~pr []    \x01z

Let's delete the 2 tables
root@smac blerg> deletetable blerg
deletetable { blerg } (yes|no)? yes
Table: [blerg] has been deleted.
root@smac> deletetable blah
deletetable { blah } (yes|no)? yes
Table: [blah] has been deleted.

The metadata table is clean

root@smac> scan -t accumulo.metadata -b 1; -e 2; -c ~tab
root@smac> scan -t accumulo.metadata -b 2;  -c ~tab

The root table now has empty tablets

root@smac> scan -t accumulo.root -c ~tab
!0;1;1 ~tab:~pr []    \x00
!0;1;3 ~tab:~pr []    \x011;1
!0;1;5 ~tab:~pr []    \x011;3
!0;1;7 ~tab:~pr []    \x011;5
!0;1;9 ~tab:~pr []    \x011;7
!0;1;a ~tab:~pr []    \x011;9
!0;1;c ~tab:~pr []    \x011;a
!0;1;e ~tab:~pr []    \x011;c
!0;1;g ~tab:~pr []    \x011;e
!0;1;i ~tab:~pr []    \x011;g
!0;1;k ~tab:~pr []    \x011;i
!0;1;m ~tab:~pr []    \x011;k
!0;1;o ~tab:~pr []    \x011;m
!0;1;q ~tab:~pr []    \x011;o
!0;1;s ~tab:~pr []    \x011;q
!0;1;u ~tab:~pr []    \x011;s
!0;1;w ~tab:~pr []    \x011;u
!0;1;y ~tab:~pr []    \x011;w
!0;2;1 ~tab:~pr []    \x011;y
!0;2;3 ~tab:~pr []    \x012;1
!0;2;5 ~tab:~pr []    \x012;3
!0;2;7 ~tab:~pr []    \x012;5
!0;2;9 ~tab:~pr []    \x012;7
!0;2;a ~tab:~pr []    \x012;9
!0;2;c ~tab:~pr []    \x012;a
!0;2;e ~tab:~pr []    \x012;c
!0;2;g ~tab:~pr []    \x012;e
!0;2;i ~tab:~pr []    \x012;g
!0;2;k ~tab:~pr []    \x012;i
!0;2;m ~tab:~pr []    \x012;k
!0;2;o ~tab:~pr []    \x012;m
!0;2;q ~tab:~pr []    \x012;o
!0;2;s ~tab:~pr []    \x012;q
!0;2;u ~tab:~pr []    \x012;s
!0;2;w ~tab:~pr []    \x012;u
!0;2;y ~tab:~pr []    \x012;w
!0;~ ~tab:~pr []    \x012;y
!0< ~tab:~pr []    \x01~

I believe this to be the situation you are in.  Is that correct?

So let's merge away the splits for table 1 and 2 into the last split for
table 2, 2;y.

root@smac> merge -?
2017-02-23 09:44:21,860 [shell.Shell.audit] INFO : root@smac> merge -?
usage: merge [-] [-?] [-b <begin-row>] [-e <end-row>] [-f] [-s <arg>] [-t
<table>] [-v]
description: merges tablets in a table
  -,--all                        allow an entire table to be merged into
one tablet without prompting the user for confirmation
  -?,--help                      display this help
  -b,--begin-row <begin-row>     begin row (exclusive)
  -e,--end-row <end-row>         end row (inclusive)
  -f,--force                     merge small tablets to large tablets, even
if it goes over the given size
  -s,--size <arg>                merge tablets to the given size over the
entire table
  -t,--table <table>             table to be merged
  -v,--verbose                   verbose output during merge
root@smac> merge -t accumulo.metadata -b 1;0 -e 2;y -v

There are now 3 metadata tablets in the monitor and with this scan

root@smac> scan -t accumulo.root -c ~tab
2017-02-23 09:45:48,988 [shell.Shell.audit] INFO : root@smac> scan -t
accumulo.root -c ~tab
!0;2;y ~tab:~pr []    \x00
!0;~ ~tab:~pr []    \x012;y
!0< ~tab:~pr []    \x01~

Can you provide more details on what is different from this walkthrough for
you?




On Wed, Feb 22, 2017 at 9:18 PM Dickson, Matt MR <
matt.dickson@defence.gov.au> wrote:

> *UNOFFICIAL*
> We are on 1.6.5, could it be that the merge is not available in this
> version.
>
>
> ------------------------------
> *From:* Christopher [mailto:ctubbsii@apache.org]
> *Sent:* Thursday, 23 February 2017 12:46
>
> *To:* user@accumulo.apache.org
> *Subject:* Re: accumulo.root invalid table reference [SEC=UNOFFICIAL]
> On Wed, Feb 22, 2017 at 8:18 PM Dickson, Matt MR <
> matt.dickson@defence.gov.au> wrote:
>
> UNOFFICIAL
>
> I ran the compaction with no luck.
>
> I've had a close look at the split points on the metadata table and
> confirmed that due to the initial large table we now have 90% of the
> metadata for existing tables hosted on one tablet which creates a hotspot.
> I've now manually added better split points to the metadata table that has
> created tablets with only 4-5M entries rather than 12M+.
>
> The split points I created isolate the metadata for large tables to
> separate tablets but ideally I'd like to split these further which raises 3
> questions.
>
> 1. If I have table 1xo, is there a smart way to determine the mid point of
> the data in the metadata table eg 1xo;xxxx to allow me to create a split
> based on that?
>
> 2. I tried to merge tablets on the metadata table where the size was
> smaller than 1M but was met with a warning stating merge on the metadata
> table was not allowed. Due to the deletion of the large table we have
> several tablets with zero entries and they will never be populate.
>
>
> Hmm. That seems to ring a bell. It was a goal of moving the root tablet
> into its own table, that users would be able to merge the metadata table.
> However, we may still have an unnecessary constraint on that in the
> interface, which is no longer needed. If merging on the metadata table
> doesn't work, please file a JIRA at
> https://issues.apache.org/browse/ACCUMULO with any error messages you
> saw, so we can track it as a bug.
>
>
> 3. How Accumulo should deal with the deletion of a massive table? Should
> the metadata table redistribute the tablets to avoid hotspotting on a
> single tserver which appears to be whats happening?
>
> Thanks for all the help so far.
>
> -----Original Message-----
> From: Josh Elser [mailto:josh.elser@gmail.com]
> Sent: Thursday, 23 February 2017 10:00
> To: user@accumulo.apache.org
> Subject: Re: accumulo.root invalid table reference [SEC=UNOFFICIAL]
>
> There's likely a delete "tombstone" in another file referenced by that
> tablet which is masking those entries. If you compact the tablet, you
> should see them all disappear.
>
> Yes, you should be able to split/merge the metatdata table just like any
> other table. Beware, the implications of this are system wide instead of
> localized to a single user table :)
>
> Dickson, Matt MR wrote:
> > *UNOFFICIAL*
> >
> > When I inspect the rfiles associated with the metadata table using the
> > rfile-info there are a lot of entries for the old deleted table, 1vm.
> > Querying the metadata table returns nothing for the deleted table.
> > When a table is deleted should the rfiles have any records referencing
> > the old table?
> > Also, am I able to manually create new split point on the metadata
> > table to force it to break up the large tablet?
> > ----------------------------------------------------------------------
> > --
> > *From:* Christopher [mailto:ctubbsii@apache.org]
> > *Sent:* Wednesday, 22 February 2017 15:46
> > *To:* user@accumulo.apache.org
> > *Subject:* Re: accumulo.root invalid table reference [SEC=UNOFFICIAL]
> >
> > It should be safe to merge on the metadata table. That was one of the
> > goals of moving the root tablet into its own table. I'm pretty sure we
> > have a build test to ensure it works.
> >
> > On Tue, Feb 21, 2017, 18:22 Dickson, Matt MR
> > <matt.dickson@defence.gov.au <ma...@defence.gov.au>>
> wrote:
> >
> >     __
> >
> >     *UNOFFICIAL*
> >
> >     Firstly, thankyou for your advice its been very helpful.
> >     Increasing the tablet server memory has allowed the metadata table
> >     to come online. From using the rfile-info and looking at the splits
> >     for the metadata table it appears that all the metadata table
> >     entries are in one tablet. All tablet servers then query the one
> >     node hosting that tablet.
> >     I suspect the cause of this was a poorly designed table that at one
> >     point the Accumulo gui reported 1.02T tablets for. We've
> >     subsequently deleted that table but it might be that there were so
> >     many entries in the metadata table that all splits on it were due to
> >     this massive table that had the table id 1vm.
> >     To rectify this, is it safe to run a merge on the metadata table to
> >     force it to redistribute?
> >
> >
>  ------------------------------------------------------------------------
> >     *From:* Michael Wall [mailto:mjwall@gmail.com
> >     <ma...@gmail.com>]
> >     *Sent:* Wednesday, 22 February 2017 02:44
> >
> >     *To:* user@accumulo.apache.org <ma...@accumulo.apache.org>
> >     *Subject:* Re: accumulo.root invalid table reference [SEC=UNOFFICIAL]
> >     Matt,
> >
> >     If I am reading this correctly, you have a tablet that is being
> >     loading onto a tserver. That tserver dies, so the tablet is then
> >     assigned to another tablet. While the tablet is being loading, that
> >     tserver dies and so on. Is that correct?
> >
> >     Can you identify the tablet that is bouncing around? If so, try
> >     using rfile-info -d to inspect the rfiles associated with that
> >     tablet. Also look at the rfiles that compose that tablet to see if
> >     anything sticks out.
> >
> >     Any logs that would help explain why the tablet server is dying? Can
> >     you increase the memory of the tserver?
> >
> >     Mike
> >
> >     On Tue, Feb 21, 2017 at 10:35 AM Josh Elser <josh.elser@gmail.com
> >     <ma...@gmail.com>> wrote:
> >
> >         ... [zookeeper.ZooCache] WARN: Saw (possibly) transient exception
> >         communicating with ZooKeeper, will retry
> >         SessionExpiredException: KeeperErrorCode = Session expired for
> >
> > /accumulo/4234234234234234/namespaces/+accumulo/conf/table.scan.max.me
> > mory
> >
> >         There can be a number of causes for this, but here are the most
> >         likely ones.
> >
> >         * JVM gc pauses
> >         * ZooKeeper max client connections
> >         * Operating System/Hardware-level pauses
> >
> >         The former should be noticeable by the Accumulo log. There is a
> >         daemon
> >         running which watches for pauses that happen and then reports
> >         them. If
> >         this is happening, you might have to give the process some more
> Java
> >         heap, tweak your CMS/G1 parameters, etc.
> >
> >         For maxClientConnections, see
> >
> > https://community.hortonworks.com/articles/51191/understanding-apache-
> > zookeeper-connection-rate-lim.html
> >
> >         For the latter, swappiness is the most likely candidate
> >         (assuming this
> >         is hopping across different physical nodes), as are "transparent
> >         huge
> >         pages". If it is limited to a single host, things like bad NICs,
> >         hard
> >         drives, and other hardware issues might be a source of slowness.
> >
> >         On Mon, Feb 20, 2017 at 10:18 PM, Dickson, Matt MR
> >         <matt.dickson@defence.gov.au
> >         <ma...@defence.gov.au>> wrote:
> >          > UNOFFICIAL
> >          >
> >          > It looks like an issue with one of the metadata table
> >         tablets. On startup
> >          > the server that hosts a particular metadata tablet gets
> >         scanned by all other
> >          > tablet servers in the cluster. This then crashes that tablet
> >         server with an
> >          > error in the tserver log;
> >          >
> >          > ... [zookeeper.ZooCache] WARN: Saw (possibly) transient
> exception
> >          > communicating with ZooKeeper, will retry
> >          > SessionExpiredException: KeeperErrorCode = Session expired for
> >          >
> >
>  /accumulo/4234234234234234/namespaces/+accumulo/conf/table.scan.max.memory
> >          >
> >          > That metadata table tablet is then transferred to another
> >         host which then
> >          > fails also, and so on.
> >          >
> >          > While the server is hosting this metadata tablet, we see the
> >         following log
> >          > statement from all tserver.logs in the cluster:
> >          >
> >          > .... [impl.ThriftScanner] DEBUG: Scan failed, thrift error
> >          > org.apache.thrift.transport.TTransportException null
> >          > (!0;1vm\\;125.323.233.23::2016103<,server.com.org:9997
> >         <http://server.com.org:9997>,2342423df12341d)
> >          > Hope that helps complete the picture.
> >          >
> >          >
> >          > ________________________________
> >          > From: Christopher [mailto:ctubbsii@apache.org
> >         <ma...@apache.org>]
> >          > Sent: Tuesday, 21 February 2017 13:17
> >          >
> >          > To: user@accumulo.apache.org <mailto:user@accumulo.apache.org
> >
> >          > Subject: Re: accumulo.root invalid table reference
> >         [SEC=UNOFFICIAL]
> >          >
> >          > Removing them is probably a bad idea. The root table entries
> >         correspond to
> >          > split points in the metadata table. There is no need for the
> >         tables which
> >          > existed when the metadata table split to still exist for this
> >         to continue to
> >          > act as a valid split point.
> >          >
> >          > Would need to see the exception stack trace, or at least an
> >         error message,
> >          > to troubleshoot the shell scanning error you saw.
> >          >
> >          >
> >          > On Mon, Feb 20, 2017, 20:00 Dickson, Matt MR
> >         <matt.dickson@defence.gov.au <mailto:matt.dickson@defence.gov.au
> >>
> >          > wrote:
> >          >>
> >          >> UNOFFICIAL
> >          >>
> >          >> In case it is ok to remove these from the root table, how
> >         can I scan the
> >          >> root table for rows with a rowid starting with !0;1vm?
> >          >>
> >          >> Running "scan -b !0;1vm" throws an exception and exits the
> >         shell.
> >          >>
> >          >>
> >          >> -----Original Message-----
> >          >> From: Dickson, Matt MR [mailto:matt.dickson@defence.gov.au
> >         <ma...@defence.gov.au>]
> >          >> Sent: Tuesday, 21 February 2017 09:30
> >          >> To: 'user@accumulo.apache.org <mailto:
> user@accumulo.apache.org>'
> >          >> Subject: RE: accumulo.root invalid table reference
> >         [SEC=UNOFFICIAL]
> >          >>
> >          >> UNOFFICIAL
> >          >>
> >          >>
> >          >> Does that mean I should have entries for 1vm in the metadata
> >         table
> >          >> corresponding to the root table?
> >          >>
> >          >> We are running 1.6.5
> >          >>
> >          >>
> >          >> -----Original Message-----
> >          >> From: Josh Elser [mailto:josh.elser@gmail.com
> >         <ma...@gmail.com>]
> >          >> Sent: Tuesday, 21 February 2017 09:22
> >          >> To: user@accumulo.apache.org <mailto:
> user@accumulo.apache.org>
> >          >> Subject: Re: accumulo.root invalid table reference
> >         [SEC=UNOFFICIAL]
> >          >>
> >          >> The root table should only reference the tablets in the
> >         metadata table.
> >          >> It's a hierarchy: like metadata is for the user tables, root
> >         is for the
> >          >> metadata table.
> >          >>
> >          >> What version are ya running, Matt?
> >          >>
> >          >> Dickson, Matt MR wrote:
> >          >> > *UNOFFICIAL*
> >          >> >
> >          >> > I have a situation where all tablet servers are
> >         progressively being
> >          >> > declared dead. From the logs the tservers report errors
> like:
> >          >> > 2017-02-.... DEBUG: Scan failed thrift error
> >          >> > org.apache.thrift.trasport.TTransportException null
> >          >> > (!0;1vm\\125.323.233.23::2016103<,server.com.org:9997
> >         <http://server.com.org:9997>,2342423df12341d)
> >          >> > 1vm was a table id that was deleted several months ago so
> >         it appears
> >          >> > there is some invalid reference somewhere.
> >          >> > Scanning the metadata table "scan -b 1vm" returns no rows
> >         returned for
> >          >> > 1vm.
> >          >> > A scan of the accumulo.root table returns approximately 15
> >         rows that
> >          >> > start with; !0:1vm;<i/p addr>/::2016103 /blah/ // How are
> >         the root
> >          >> > table entries used and would it be safe to remove these
> >         entries since
> >          >> > they reference a deleted table?
> >          >> > Thanks in advance,
> >          >> > Matt
> >          >> > //
> >          >
> >          > --
> >          > Christopher
> >
> > --
> > Christopher
>
> --
> Christopher
>

RE: accumulo.root invalid table reference [SEC=UNOFFICIAL]

Posted by Ed Coleman <de...@etcoleman.com>.
Mainly for the record (and apparently will not help with 1.6.5)

 

Christopher wrote: "Hmm. That seems to ring a bell. It was a goal of moving
the root tablet into its own table, that users would be able to merge the
metadata table. ."

 

I was able to merge a metadata tablet using Accumulo 1.7.2

 

I was doing some development on a small stand-alone instance with hdfs,
Accumulo 1.7.2 all running using local host - but everything was a "real"
instance, in case it matters. The metadata tablet had split a number of
times with the test data I inserted, and as I was cleaning things up for
another round of testing I decided to try and merge the metadata back into a
single tablet.

 

Running merge worked and I ended up with a single metadata tablet again -
not an exhaustive test by any means, but does indicate that the option is
available. As my tests expand to a multiple node cluster, I'll try to
remember to try this again.

 

 

From: Dickson, Matt MR [mailto:matt.dickson@defence.gov.au] 
Sent: Wednesday, February 22, 2017 9:18 PM
To: 'user@accumulo.apache.org' <us...@accumulo.apache.org>
Subject: RE: accumulo.root invalid table reference [SEC=UNOFFICIAL]

 

UNOFFICIAL

We are on 1.6.5, could it be that the merge is not available in this
version.

 

 

  _____  

From: Christopher [mailto:ctubbsii@apache.org] 
Sent: Thursday, 23 February 2017 12:46
To: user@accumulo.apache.org <ma...@accumulo.apache.org> 
Subject: Re: accumulo.root invalid table reference [SEC=UNOFFICIAL]

On Wed, Feb 22, 2017 at 8:18 PM Dickson, Matt MR
<matt.dickson@defence.gov.au <ma...@defence.gov.au> > wrote:

UNOFFICIAL

I ran the compaction with no luck.

I've had a close look at the split points on the metadata table and
confirmed that due to the initial large table we now have 90% of the
metadata for existing tables hosted on one tablet which creates a hotspot.
I've now manually added better split points to the metadata table that has
created tablets with only 4-5M entries rather than 12M+.

The split points I created isolate the metadata for large tables to separate
tablets but ideally I'd like to split these further which raises 3
questions.

1. If I have table 1xo, is there a smart way to determine the mid point of
the data in the metadata table eg 1xo;xxxx to allow me to create a split
based on that?

2. I tried to merge tablets on the metadata table where the size was smaller
than 1M but was met with a warning stating merge on the metadata table was
not allowed. Due to the deletion of the large table we have several tablets
with zero entries and they will never be populate.

 

Hmm. That seems to ring a bell. It was a goal of moving the root tablet into
its own table, that users would be able to merge the metadata table.
However, we may still have an unnecessary constraint on that in the
interface, which is no longer needed. If merging on the metadata table
doesn't work, please file a JIRA at
https://issues.apache.org/browse/ACCUMULO with any error messages you saw,
so we can track it as a bug.

 

3. How Accumulo should deal with the deletion of a massive table? Should the
metadata table redistribute the tablets to avoid hotspotting on a single
tserver which appears to be whats happening?

Thanks for all the help so far.

-----Original Message-----
From: Josh Elser [mailto:josh.elser@gmail.com <ma...@gmail.com>
]
Sent: Thursday, 23 February 2017 10:00
To: user@accumulo.apache.org <ma...@accumulo.apache.org> 
Subject: Re: accumulo.root invalid table reference [SEC=UNOFFICIAL]

There's likely a delete "tombstone" in another file referenced by that
tablet which is masking those entries. If you compact the tablet, you should
see them all disappear.

Yes, you should be able to split/merge the metatdata table just like any
other table. Beware, the implications of this are system wide instead of
localized to a single user table :)

Dickson, Matt MR wrote:
> *UNOFFICIAL*
>
> When I inspect the rfiles associated with the metadata table using the
> rfile-info there are a lot of entries for the old deleted table, 1vm.
> Querying the metadata table returns nothing for the deleted table.
> When a table is deleted should the rfiles have any records referencing
> the old table?
> Also, am I able to manually create new split point on the metadata
> table to force it to break up the large tablet?
> ----------------------------------------------------------------------
> --
> *From:* Christopher [mailto:ctubbsii@apache.org
<ma...@apache.org> ]
> *Sent:* Wednesday, 22 February 2017 15:46
> *To:* user@accumulo.apache.org <ma...@accumulo.apache.org> 
> *Subject:* Re: accumulo.root invalid table reference [SEC=UNOFFICIAL]
>
> It should be safe to merge on the metadata table. That was one of the
> goals of moving the root tablet into its own table. I'm pretty sure we
> have a build test to ensure it works.
>
> On Tue, Feb 21, 2017, 18:22 Dickson, Matt MR
> <matt.dickson@defence.gov.au <ma...@defence.gov.au>
<mailto:matt.dickson@defence.gov.au <ma...@defence.gov.au> >>
wrote:
>
>     __
>
>     *UNOFFICIAL*
>
>     Firstly, thankyou for your advice its been very helpful.
>     Increasing the tablet server memory has allowed the metadata table
>     to come online. From using the rfile-info and looking at the splits
>     for the metadata table it appears that all the metadata table
>     entries are in one tablet. All tablet servers then query the one
>     node hosting that tablet.
>     I suspect the cause of this was a poorly designed table that at one
>     point the Accumulo gui reported 1.02T tablets for. We've
>     subsequently deleted that table but it might be that there were so
>     many entries in the metadata table that all splits on it were due to
>     this massive table that had the table id 1vm.
>     To rectify this, is it safe to run a merge on the metadata table to
>     force it to redistribute?
>
>
------------------------------------------------------------------------
>     *From:* Michael Wall [mailto:mjwall@gmail.com
<ma...@gmail.com> 
>     <mailto:mjwall@gmail.com <ma...@gmail.com> >]
>     *Sent:* Wednesday, 22 February 2017 02:44
>
>     *To:* user@accumulo.apache.org <ma...@accumulo.apache.org>
<mailto:user@accumulo.apache.org <ma...@accumulo.apache.org> >
>     *Subject:* Re: accumulo.root invalid table reference [SEC=UNOFFICIAL]
>     Matt,
>
>     If I am reading this correctly, you have a tablet that is being
>     loading onto a tserver. That tserver dies, so the tablet is then
>     assigned to another tablet. While the tablet is being loading, that
>     tserver dies and so on. Is that correct?
>
>     Can you identify the tablet that is bouncing around? If so, try
>     using rfile-info -d to inspect the rfiles associated with that
>     tablet. Also look at the rfiles that compose that tablet to see if
>     anything sticks out.
>
>     Any logs that would help explain why the tablet server is dying? Can
>     you increase the memory of the tserver?
>
>     Mike
>
>     On Tue, Feb 21, 2017 at 10:35 AM Josh Elser <josh.elser@gmail.com
<ma...@gmail.com> 
>     <mailto:josh.elser@gmail.com <ma...@gmail.com> >> wrote:
>
>         ... [zookeeper.ZooCache] WARN: Saw (possibly) transient exception
>         communicating with ZooKeeper, will retry
>         SessionExpiredException: KeeperErrorCode = Session expired for
>
> /accumulo/4234234234234234/namespaces/+accumulo/conf/table.scan.max.me
<http://table.scan.max.me> 
> mory
>
>         There can be a number of causes for this, but here are the most
>         likely ones.
>
>         * JVM gc pauses
>         * ZooKeeper max client connections
>         * Operating System/Hardware-level pauses
>
>         The former should be noticeable by the Accumulo log. There is a
>         daemon
>         running which watches for pauses that happen and then reports
>         them. If
>         this is happening, you might have to give the process some more
Java
>         heap, tweak your CMS/G1 parameters, etc.
>
>         For maxClientConnections, see
>
> https://community.hortonworks.com/articles/51191/understanding-apache-
> zookeeper-connection-rate-lim.html
>
>         For the latter, swappiness is the most likely candidate
>         (assuming this
>         is hopping across different physical nodes), as are "transparent
>         huge
>         pages". If it is limited to a single host, things like bad NICs,
>         hard
>         drives, and other hardware issues might be a source of slowness.
>
>         On Mon, Feb 20, 2017 at 10:18 PM, Dickson, Matt MR
>         <matt.dickson@defence.gov.au <ma...@defence.gov.au> 
>         <mailto:matt.dickson@defence.gov.au
<ma...@defence.gov.au> >> wrote:
>          > UNOFFICIAL
>          >
>          > It looks like an issue with one of the metadata table
>         tablets. On startup
>          > the server that hosts a particular metadata tablet gets
>         scanned by all other
>          > tablet servers in the cluster. This then crashes that tablet
>         server with an
>          > error in the tserver log;
>          >
>          > ... [zookeeper.ZooCache] WARN: Saw (possibly) transient
exception
>          > communicating with ZooKeeper, will retry
>          > SessionExpiredException: KeeperErrorCode = Session expired for
>          >
>
/accumulo/4234234234234234/namespaces/+accumulo/conf/table.scan.max.memory
>          >
>          > That metadata table tablet is then transferred to another
>         host which then
>          > fails also, and so on.
>          >
>          > While the server is hosting this metadata tablet, we see the
>         following log
>          > statement from all tserver.logs in the cluster:
>          >
>          > .... [impl.ThriftScanner] DEBUG: Scan failed, thrift error
>          > org.apache.thrift.transport.TTransportException null
>          > (!0;1vm\\;125.323.233.23::2016103<,server.com.org:9997
<http://server.com.org:9997> 
>         <http://server.com.org:9997>,2342423df12341d)
>          > Hope that helps complete the picture.
>          >
>          >
>          > ________________________________
>          > From: Christopher [mailto:ctubbsii@apache.org
<ma...@apache.org> 
>         <mailto:ctubbsii@apache.org <ma...@apache.org> >]
>          > Sent: Tuesday, 21 February 2017 13:17
>          >
>          > To: user@accumulo.apache.org <ma...@accumulo.apache.org>
<mailto:user@accumulo.apache.org <ma...@accumulo.apache.org> >
>          > Subject: Re: accumulo.root invalid table reference
>         [SEC=UNOFFICIAL]
>          >
>          > Removing them is probably a bad idea. The root table entries
>         correspond to
>          > split points in the metadata table. There is no need for the
>         tables which
>          > existed when the metadata table split to still exist for this
>         to continue to
>          > act as a valid split point.
>          >
>          > Would need to see the exception stack trace, or at least an
>         error message,
>          > to troubleshoot the shell scanning error you saw.
>          >
>          >
>          > On Mon, Feb 20, 2017, 20:00 Dickson, Matt MR
>         <matt.dickson@defence.gov.au <ma...@defence.gov.au>
<mailto:matt.dickson@defence.gov.au <ma...@defence.gov.au> >>
>          > wrote:
>          >>
>          >> UNOFFICIAL
>          >>
>          >> In case it is ok to remove these from the root table, how
>         can I scan the
>          >> root table for rows with a rowid starting with !0;1vm?
>          >>
>          >> Running "scan -b !0;1vm" throws an exception and exits the
>         shell.
>          >>
>          >>
>          >> -----Original Message-----
>          >> From: Dickson, Matt MR [mailto:matt.dickson@defence.gov.au
<ma...@defence.gov.au> 
>         <mailto:matt.dickson@defence.gov.au
<ma...@defence.gov.au> >]
>          >> Sent: Tuesday, 21 February 2017 09:30
>          >> To: 'user@accumulo.apache.org
<ma...@accumulo.apache.org>  <mailto:user@accumulo.apache.org
<ma...@accumulo.apache.org> >'
>          >> Subject: RE: accumulo.root invalid table reference
>         [SEC=UNOFFICIAL]
>          >>
>          >> UNOFFICIAL
>          >>
>          >>
>          >> Does that mean I should have entries for 1vm in the metadata
>         table
>          >> corresponding to the root table?
>          >>
>          >> We are running 1.6.5
>          >>
>          >>
>          >> -----Original Message-----
>          >> From: Josh Elser [mailto:josh.elser@gmail.com
<ma...@gmail.com> 
>         <mailto:josh.elser@gmail.com <ma...@gmail.com> >]
>          >> Sent: Tuesday, 21 February 2017 09:22
>          >> To: user@accumulo.apache.org <ma...@accumulo.apache.org>
<mailto:user@accumulo.apache.org <ma...@accumulo.apache.org> >
>          >> Subject: Re: accumulo.root invalid table reference
>         [SEC=UNOFFICIAL]
>          >>
>          >> The root table should only reference the tablets in the
>         metadata table.
>          >> It's a hierarchy: like metadata is for the user tables, root
>         is for the
>          >> metadata table.
>          >>
>          >> What version are ya running, Matt?
>          >>
>          >> Dickson, Matt MR wrote:
>          >> > *UNOFFICIAL*
>          >> >
>          >> > I have a situation where all tablet servers are
>         progressively being
>          >> > declared dead. From the logs the tservers report errors
like:
>          >> > 2017-02-.... DEBUG: Scan failed thrift error
>          >> > org.apache.thrift.trasport.TTransportException null
>          >> > (!0;1vm\\125.323.233.23::2016103<,server.com.org:9997
<http://server.com.org:9997> 
>         <http://server.com.org:9997>,2342423df12341d)
>          >> > 1vm was a table id that was deleted several months ago so
>         it appears
>          >> > there is some invalid reference somewhere.
>          >> > Scanning the metadata table "scan -b 1vm" returns no rows
>         returned for
>          >> > 1vm.
>          >> > A scan of the accumulo.root table returns approximately 15
>         rows that
>          >> > start with; !0:1vm;<i/p addr>/::2016103 /blah/ // How are
>         the root
>          >> > table entries used and would it be safe to remove these
>         entries since
>          >> > they reference a deleted table?
>          >> > Thanks in advance,
>          >> > Matt
>          >> > //
>          >
>          > --
>          > Christopher
>
> --
> Christopher

-- 

Christopher


RE: accumulo.root invalid table reference [SEC=UNOFFICIAL]

Posted by "Dickson, Matt MR" <ma...@defence.gov.au>.
UNOFFICIAL

We are on 1.6.5, could it be that the merge is not available in this version.


________________________________
From: Christopher [mailto:ctubbsii@apache.org]
Sent: Thursday, 23 February 2017 12:46
To: user@accumulo.apache.org
Subject: Re: accumulo.root invalid table reference [SEC=UNOFFICIAL]

On Wed, Feb 22, 2017 at 8:18 PM Dickson, Matt MR <ma...@defence.gov.au>> wrote:
UNOFFICIAL

I ran the compaction with no luck.

I've had a close look at the split points on the metadata table and confirmed that due to the initial large table we now have 90% of the metadata for existing tables hosted on one tablet which creates a hotspot.  I've now manually added better split points to the metadata table that has created tablets with only 4-5M entries rather than 12M+.

The split points I created isolate the metadata for large tables to separate tablets but ideally I'd like to split these further which raises 3 questions.

1. If I have table 1xo, is there a smart way to determine the mid point of the data in the metadata table eg 1xo;xxxx to allow me to create a split based on that?

2. I tried to merge tablets on the metadata table where the size was smaller than 1M but was met with a warning stating merge on the metadata table was not allowed. Due to the deletion of the large table we have several tablets with zero entries and they will never be populate.


Hmm. That seems to ring a bell. It was a goal of moving the root tablet into its own table, that users would be able to merge the metadata table. However, we may still have an unnecessary constraint on that in the interface, which is no longer needed. If merging on the metadata table doesn't work, please file a JIRA at https://issues.apache.org/browse/ACCUMULO with any error messages you saw, so we can track it as a bug.

3. How Accumulo should deal with the deletion of a massive table? Should the metadata table redistribute the tablets to avoid hotspotting on a single tserver which appears to be whats happening?

Thanks for all the help so far.

-----Original Message-----
From: Josh Elser [mailto:josh.elser@gmail.com<ma...@gmail.com>]
Sent: Thursday, 23 February 2017 10:00
To: user@accumulo.apache.org<ma...@accumulo.apache.org>
Subject: Re: accumulo.root invalid table reference [SEC=UNOFFICIAL]

There's likely a delete "tombstone" in another file referenced by that tablet which is masking those entries. If you compact the tablet, you should see them all disappear.

Yes, you should be able to split/merge the metatdata table just like any other table. Beware, the implications of this are system wide instead of localized to a single user table :)

Dickson, Matt MR wrote:
> *UNOFFICIAL*
>
> When I inspect the rfiles associated with the metadata table using the
> rfile-info there are a lot of entries for the old deleted table, 1vm.
> Querying the metadata table returns nothing for the deleted table.
> When a table is deleted should the rfiles have any records referencing
> the old table?
> Also, am I able to manually create new split point on the metadata
> table to force it to break up the large tablet?
> ----------------------------------------------------------------------
> --
> *From:* Christopher [mailto:ctubbsii@apache.org<ma...@apache.org>]
> *Sent:* Wednesday, 22 February 2017 15:46
> *To:* user@accumulo.apache.org<ma...@accumulo.apache.org>
> *Subject:* Re: accumulo.root invalid table reference [SEC=UNOFFICIAL]
>
> It should be safe to merge on the metadata table. That was one of the
> goals of moving the root tablet into its own table. I'm pretty sure we
> have a build test to ensure it works.
>
> On Tue, Feb 21, 2017, 18:22 Dickson, Matt MR
> <ma...@defence.gov.au> <ma...@defence.gov.au>>> wrote:
>
>     __
>
>     *UNOFFICIAL*
>
>     Firstly, thankyou for your advice its been very helpful.
>     Increasing the tablet server memory has allowed the metadata table
>     to come online. From using the rfile-info and looking at the splits
>     for the metadata table it appears that all the metadata table
>     entries are in one tablet. All tablet servers then query the one
>     node hosting that tablet.
>     I suspect the cause of this was a poorly designed table that at one
>     point the Accumulo gui reported 1.02T tablets for. We've
>     subsequently deleted that table but it might be that there were so
>     many entries in the metadata table that all splits on it were due to
>     this massive table that had the table id 1vm.
>     To rectify this, is it safe to run a merge on the metadata table to
>     force it to redistribute?
>
>     ------------------------------------------------------------------------
>     *From:* Michael Wall [mailto:mjwall@gmail.com<ma...@gmail.com>
>     <ma...@gmail.com>>]
>     *Sent:* Wednesday, 22 February 2017 02:44
>
>     *To:* user@accumulo.apache.org<ma...@accumulo.apache.org> <ma...@accumulo.apache.org>>
>     *Subject:* Re: accumulo.root invalid table reference [SEC=UNOFFICIAL]
>     Matt,
>
>     If I am reading this correctly, you have a tablet that is being
>     loading onto a tserver. That tserver dies, so the tablet is then
>     assigned to another tablet. While the tablet is being loading, that
>     tserver dies and so on. Is that correct?
>
>     Can you identify the tablet that is bouncing around? If so, try
>     using rfile-info -d to inspect the rfiles associated with that
>     tablet. Also look at the rfiles that compose that tablet to see if
>     anything sticks out.
>
>     Any logs that would help explain why the tablet server is dying? Can
>     you increase the memory of the tserver?
>
>     Mike
>
>     On Tue, Feb 21, 2017 at 10:35 AM Josh Elser <jo...@gmail.com>
>     <ma...@gmail.com>>> wrote:
>
>         ... [zookeeper.ZooCache] WARN: Saw (possibly) transient exception
>         communicating with ZooKeeper, will retry
>         SessionExpiredException: KeeperErrorCode = Session expired for
>
> /accumulo/4234234234234234/namespaces/+accumulo/conf/table.scan.max.me<http://table.scan.max.me>
> mory
>
>         There can be a number of causes for this, but here are the most
>         likely ones.
>
>         * JVM gc pauses
>         * ZooKeeper max client connections
>         * Operating System/Hardware-level pauses
>
>         The former should be noticeable by the Accumulo log. There is a
>         daemon
>         running which watches for pauses that happen and then reports
>         them. If
>         this is happening, you might have to give the process some more Java
>         heap, tweak your CMS/G1 parameters, etc.
>
>         For maxClientConnections, see
>
> https://community.hortonworks.com/articles/51191/understanding-apache-
> zookeeper-connection-rate-lim.html
>
>         For the latter, swappiness is the most likely candidate
>         (assuming this
>         is hopping across different physical nodes), as are "transparent
>         huge
>         pages". If it is limited to a single host, things like bad NICs,
>         hard
>         drives, and other hardware issues might be a source of slowness.
>
>         On Mon, Feb 20, 2017 at 10:18 PM, Dickson, Matt MR
>         <ma...@defence.gov.au>
>         <ma...@defence.gov.au>>> wrote:
>          > UNOFFICIAL
>          >
>          > It looks like an issue with one of the metadata table
>         tablets. On startup
>          > the server that hosts a particular metadata tablet gets
>         scanned by all other
>          > tablet servers in the cluster. This then crashes that tablet
>         server with an
>          > error in the tserver log;
>          >
>          > ... [zookeeper.ZooCache] WARN: Saw (possibly) transient exception
>          > communicating with ZooKeeper, will retry
>          > SessionExpiredException: KeeperErrorCode = Session expired for
>          >
>         /accumulo/4234234234234234/namespaces/+accumulo/conf/table.scan.max.memory
>          >
>          > That metadata table tablet is then transferred to another
>         host which then
>          > fails also, and so on.
>          >
>          > While the server is hosting this metadata tablet, we see the
>         following log
>          > statement from all tserver.logs in the cluster:
>          >
>          > .... [impl.ThriftScanner] DEBUG: Scan failed, thrift error
>          > org.apache.thrift.transport.TTransportException null
>          > (!0;1vm\\;125.323.233.23::2016103<,server.com.org:9997<http://server.com.org:9997>
>         <http://server.com.org:9997>,2342423df12341d)
>          > Hope that helps complete the picture.
>          >
>          >
>          > ________________________________
>          > From: Christopher [mailto:ctubbsii@apache.org<ma...@apache.org>
>         <ma...@apache.org>>]
>          > Sent: Tuesday, 21 February 2017 13:17
>          >
>          > To: user@accumulo.apache.org<ma...@accumulo.apache.org> <ma...@accumulo.apache.org>>
>          > Subject: Re: accumulo.root invalid table reference
>         [SEC=UNOFFICIAL]
>          >
>          > Removing them is probably a bad idea. The root table entries
>         correspond to
>          > split points in the metadata table. There is no need for the
>         tables which
>          > existed when the metadata table split to still exist for this
>         to continue to
>          > act as a valid split point.
>          >
>          > Would need to see the exception stack trace, or at least an
>         error message,
>          > to troubleshoot the shell scanning error you saw.
>          >
>          >
>          > On Mon, Feb 20, 2017, 20:00 Dickson, Matt MR
>         <ma...@defence.gov.au> <ma...@defence.gov.au>>>
>          > wrote:
>          >>
>          >> UNOFFICIAL
>          >>
>          >> In case it is ok to remove these from the root table, how
>         can I scan the
>          >> root table for rows with a rowid starting with !0;1vm?
>          >>
>          >> Running "scan -b !0;1vm" throws an exception and exits the
>         shell.
>          >>
>          >>
>          >> -----Original Message-----
>          >> From: Dickson, Matt MR [mailto:matt.dickson@defence.gov.au<ma...@defence.gov.au>
>         <ma...@defence.gov.au>>]
>          >> Sent: Tuesday, 21 February 2017 09:30
>          >> To: 'user@accumulo.apache.org<ma...@accumulo.apache.org> <ma...@accumulo.apache.org>>'
>          >> Subject: RE: accumulo.root invalid table reference
>         [SEC=UNOFFICIAL]
>          >>
>          >> UNOFFICIAL
>          >>
>          >>
>          >> Does that mean I should have entries for 1vm in the metadata
>         table
>          >> corresponding to the root table?
>          >>
>          >> We are running 1.6.5
>          >>
>          >>
>          >> -----Original Message-----
>          >> From: Josh Elser [mailto:josh.elser@gmail.com<ma...@gmail.com>
>         <ma...@gmail.com>>]
>          >> Sent: Tuesday, 21 February 2017 09:22
>          >> To: user@accumulo.apache.org<ma...@accumulo.apache.org> <ma...@accumulo.apache.org>>
>          >> Subject: Re: accumulo.root invalid table reference
>         [SEC=UNOFFICIAL]
>          >>
>          >> The root table should only reference the tablets in the
>         metadata table.
>          >> It's a hierarchy: like metadata is for the user tables, root
>         is for the
>          >> metadata table.
>          >>
>          >> What version are ya running, Matt?
>          >>
>          >> Dickson, Matt MR wrote:
>          >> > *UNOFFICIAL*
>          >> >
>          >> > I have a situation where all tablet servers are
>         progressively being
>          >> > declared dead. From the logs the tservers report errors like:
>          >> > 2017-02-.... DEBUG: Scan failed thrift error
>          >> > org.apache.thrift.trasport.TTransportException null
>          >> > (!0;1vm\\125.323.233.23::2016103<,server.com.org:9997<http://server.com.org:9997>
>         <http://server.com.org:9997>,2342423df12341d)
>          >> > 1vm was a table id that was deleted several months ago so
>         it appears
>          >> > there is some invalid reference somewhere.
>          >> > Scanning the metadata table "scan -b 1vm" returns no rows
>         returned for
>          >> > 1vm.
>          >> > A scan of the accumulo.root table returns approximately 15
>         rows that
>          >> > start with; !0:1vm;<i/p addr>/::2016103 /blah/ // How are
>         the root
>          >> > table entries used and would it be safe to remove these
>         entries since
>          >> > they reference a deleted table?
>          >> > Thanks in advance,
>          >> > Matt
>          >> > //
>          >
>          > --
>          > Christopher
>
> --
> Christopher
--
Christopher

Re: accumulo.root invalid table reference [SEC=UNOFFICIAL]

Posted by Christopher <ct...@apache.org>.
On Wed, Feb 22, 2017 at 8:18 PM Dickson, Matt MR <
matt.dickson@defence.gov.au> wrote:

> UNOFFICIAL
>
> I ran the compaction with no luck.
>
> I've had a close look at the split points on the metadata table and
> confirmed that due to the initial large table we now have 90% of the
> metadata for existing tables hosted on one tablet which creates a hotspot.
> I've now manually added better split points to the metadata table that has
> created tablets with only 4-5M entries rather than 12M+.
>
> The split points I created isolate the metadata for large tables to
> separate tablets but ideally I'd like to split these further which raises 3
> questions.
>
> 1. If I have table 1xo, is there a smart way to determine the mid point of
> the data in the metadata table eg 1xo;xxxx to allow me to create a split
> based on that?
>
> 2. I tried to merge tablets on the metadata table where the size was
> smaller than 1M but was met with a warning stating merge on the metadata
> table was not allowed. Due to the deletion of the large table we have
> several tablets with zero entries and they will never be populate.
>
>
Hmm. That seems to ring a bell. It was a goal of moving the root tablet
into its own table, that users would be able to merge the metadata table.
However, we may still have an unnecessary constraint on that in the
interface, which is no longer needed. If merging on the metadata table
doesn't work, please file a JIRA at
https://issues.apache.org/browse/ACCUMULO with any error messages you saw,
so we can track it as a bug.


> 3. How Accumulo should deal with the deletion of a massive table? Should
> the metadata table redistribute the tablets to avoid hotspotting on a
> single tserver which appears to be whats happening?
>
> Thanks for all the help so far.
>
> -----Original Message-----
> From: Josh Elser [mailto:josh.elser@gmail.com]
> Sent: Thursday, 23 February 2017 10:00
> To: user@accumulo.apache.org
> Subject: Re: accumulo.root invalid table reference [SEC=UNOFFICIAL]
>
> There's likely a delete "tombstone" in another file referenced by that
> tablet which is masking those entries. If you compact the tablet, you
> should see them all disappear.
>
> Yes, you should be able to split/merge the metatdata table just like any
> other table. Beware, the implications of this are system wide instead of
> localized to a single user table :)
>
> Dickson, Matt MR wrote:
> > *UNOFFICIAL*
> >
> > When I inspect the rfiles associated with the metadata table using the
> > rfile-info there are a lot of entries for the old deleted table, 1vm.
> > Querying the metadata table returns nothing for the deleted table.
> > When a table is deleted should the rfiles have any records referencing
> > the old table?
> > Also, am I able to manually create new split point on the metadata
> > table to force it to break up the large tablet?
> > ----------------------------------------------------------------------
> > --
> > *From:* Christopher [mailto:ctubbsii@apache.org]
> > *Sent:* Wednesday, 22 February 2017 15:46
> > *To:* user@accumulo.apache.org
> > *Subject:* Re: accumulo.root invalid table reference [SEC=UNOFFICIAL]
> >
> > It should be safe to merge on the metadata table. That was one of the
> > goals of moving the root tablet into its own table. I'm pretty sure we
> > have a build test to ensure it works.
> >
> > On Tue, Feb 21, 2017, 18:22 Dickson, Matt MR
> > <matt.dickson@defence.gov.au <ma...@defence.gov.au>>
> wrote:
> >
> >     __
> >
> >     *UNOFFICIAL*
> >
> >     Firstly, thankyou for your advice its been very helpful.
> >     Increasing the tablet server memory has allowed the metadata table
> >     to come online. From using the rfile-info and looking at the splits
> >     for the metadata table it appears that all the metadata table
> >     entries are in one tablet. All tablet servers then query the one
> >     node hosting that tablet.
> >     I suspect the cause of this was a poorly designed table that at one
> >     point the Accumulo gui reported 1.02T tablets for. We've
> >     subsequently deleted that table but it might be that there were so
> >     many entries in the metadata table that all splits on it were due to
> >     this massive table that had the table id 1vm.
> >     To rectify this, is it safe to run a merge on the metadata table to
> >     force it to redistribute?
> >
> >
>  ------------------------------------------------------------------------
> >     *From:* Michael Wall [mailto:mjwall@gmail.com
> >     <ma...@gmail.com>]
> >     *Sent:* Wednesday, 22 February 2017 02:44
> >
> >     *To:* user@accumulo.apache.org <ma...@accumulo.apache.org>
> >     *Subject:* Re: accumulo.root invalid table reference [SEC=UNOFFICIAL]
> >     Matt,
> >
> >     If I am reading this correctly, you have a tablet that is being
> >     loading onto a tserver. That tserver dies, so the tablet is then
> >     assigned to another tablet. While the tablet is being loading, that
> >     tserver dies and so on. Is that correct?
> >
> >     Can you identify the tablet that is bouncing around? If so, try
> >     using rfile-info -d to inspect the rfiles associated with that
> >     tablet. Also look at the rfiles that compose that tablet to see if
> >     anything sticks out.
> >
> >     Any logs that would help explain why the tablet server is dying? Can
> >     you increase the memory of the tserver?
> >
> >     Mike
> >
> >     On Tue, Feb 21, 2017 at 10:35 AM Josh Elser <josh.elser@gmail.com
> >     <ma...@gmail.com>> wrote:
> >
> >         ... [zookeeper.ZooCache] WARN: Saw (possibly) transient exception
> >         communicating with ZooKeeper, will retry
> >         SessionExpiredException: KeeperErrorCode = Session expired for
> >
> > /accumulo/4234234234234234/namespaces/+accumulo/conf/table.scan.max.me
> > mory
> >
> >         There can be a number of causes for this, but here are the most
> >         likely ones.
> >
> >         * JVM gc pauses
> >         * ZooKeeper max client connections
> >         * Operating System/Hardware-level pauses
> >
> >         The former should be noticeable by the Accumulo log. There is a
> >         daemon
> >         running which watches for pauses that happen and then reports
> >         them. If
> >         this is happening, you might have to give the process some more
> Java
> >         heap, tweak your CMS/G1 parameters, etc.
> >
> >         For maxClientConnections, see
> >
> > https://community.hortonworks.com/articles/51191/understanding-apache-
> > zookeeper-connection-rate-lim.html
> >
> >         For the latter, swappiness is the most likely candidate
> >         (assuming this
> >         is hopping across different physical nodes), as are "transparent
> >         huge
> >         pages". If it is limited to a single host, things like bad NICs,
> >         hard
> >         drives, and other hardware issues might be a source of slowness.
> >
> >         On Mon, Feb 20, 2017 at 10:18 PM, Dickson, Matt MR
> >         <matt.dickson@defence.gov.au
> >         <ma...@defence.gov.au>> wrote:
> >          > UNOFFICIAL
> >          >
> >          > It looks like an issue with one of the metadata table
> >         tablets. On startup
> >          > the server that hosts a particular metadata tablet gets
> >         scanned by all other
> >          > tablet servers in the cluster. This then crashes that tablet
> >         server with an
> >          > error in the tserver log;
> >          >
> >          > ... [zookeeper.ZooCache] WARN: Saw (possibly) transient
> exception
> >          > communicating with ZooKeeper, will retry
> >          > SessionExpiredException: KeeperErrorCode = Session expired for
> >          >
> >
>  /accumulo/4234234234234234/namespaces/+accumulo/conf/table.scan.max.memory
> >          >
> >          > That metadata table tablet is then transferred to another
> >         host which then
> >          > fails also, and so on.
> >          >
> >          > While the server is hosting this metadata tablet, we see the
> >         following log
> >          > statement from all tserver.logs in the cluster:
> >          >
> >          > .... [impl.ThriftScanner] DEBUG: Scan failed, thrift error
> >          > org.apache.thrift.transport.TTransportException null
> >          > (!0;1vm\\;125.323.233.23::2016103<,server.com.org:9997
> >         <http://server.com.org:9997>,2342423df12341d)
> >          > Hope that helps complete the picture.
> >          >
> >          >
> >          > ________________________________
> >          > From: Christopher [mailto:ctubbsii@apache.org
> >         <ma...@apache.org>]
> >          > Sent: Tuesday, 21 February 2017 13:17
> >          >
> >          > To: user@accumulo.apache.org <mailto:user@accumulo.apache.org
> >
> >          > Subject: Re: accumulo.root invalid table reference
> >         [SEC=UNOFFICIAL]
> >          >
> >          > Removing them is probably a bad idea. The root table entries
> >         correspond to
> >          > split points in the metadata table. There is no need for the
> >         tables which
> >          > existed when the metadata table split to still exist for this
> >         to continue to
> >          > act as a valid split point.
> >          >
> >          > Would need to see the exception stack trace, or at least an
> >         error message,
> >          > to troubleshoot the shell scanning error you saw.
> >          >
> >          >
> >          > On Mon, Feb 20, 2017, 20:00 Dickson, Matt MR
> >         <matt.dickson@defence.gov.au <mailto:matt.dickson@defence.gov.au
> >>
> >          > wrote:
> >          >>
> >          >> UNOFFICIAL
> >          >>
> >          >> In case it is ok to remove these from the root table, how
> >         can I scan the
> >          >> root table for rows with a rowid starting with !0;1vm?
> >          >>
> >          >> Running "scan -b !0;1vm" throws an exception and exits the
> >         shell.
> >          >>
> >          >>
> >          >> -----Original Message-----
> >          >> From: Dickson, Matt MR [mailto:matt.dickson@defence.gov.au
> >         <ma...@defence.gov.au>]
> >          >> Sent: Tuesday, 21 February 2017 09:30
> >          >> To: 'user@accumulo.apache.org <mailto:
> user@accumulo.apache.org>'
> >          >> Subject: RE: accumulo.root invalid table reference
> >         [SEC=UNOFFICIAL]
> >          >>
> >          >> UNOFFICIAL
> >          >>
> >          >>
> >          >> Does that mean I should have entries for 1vm in the metadata
> >         table
> >          >> corresponding to the root table?
> >          >>
> >          >> We are running 1.6.5
> >          >>
> >          >>
> >          >> -----Original Message-----
> >          >> From: Josh Elser [mailto:josh.elser@gmail.com
> >         <ma...@gmail.com>]
> >          >> Sent: Tuesday, 21 February 2017 09:22
> >          >> To: user@accumulo.apache.org <mailto:
> user@accumulo.apache.org>
> >          >> Subject: Re: accumulo.root invalid table reference
> >         [SEC=UNOFFICIAL]
> >          >>
> >          >> The root table should only reference the tablets in the
> >         metadata table.
> >          >> It's a hierarchy: like metadata is for the user tables, root
> >         is for the
> >          >> metadata table.
> >          >>
> >          >> What version are ya running, Matt?
> >          >>
> >          >> Dickson, Matt MR wrote:
> >          >> > *UNOFFICIAL*
> >          >> >
> >          >> > I have a situation where all tablet servers are
> >         progressively being
> >          >> > declared dead. From the logs the tservers report errors
> like:
> >          >> > 2017-02-.... DEBUG: Scan failed thrift error
> >          >> > org.apache.thrift.trasport.TTransportException null
> >          >> > (!0;1vm\\125.323.233.23::2016103<,server.com.org:9997
> >         <http://server.com.org:9997>,2342423df12341d)
> >          >> > 1vm was a table id that was deleted several months ago so
> >         it appears
> >          >> > there is some invalid reference somewhere.
> >          >> > Scanning the metadata table "scan -b 1vm" returns no rows
> >         returned for
> >          >> > 1vm.
> >          >> > A scan of the accumulo.root table returns approximately 15
> >         rows that
> >          >> > start with; !0:1vm;<i/p addr>/::2016103 /blah/ // How are
> >         the root
> >          >> > table entries used and would it be safe to remove these
> >         entries since
> >          >> > they reference a deleted table?
> >          >> > Thanks in advance,
> >          >> > Matt
> >          >> > //
> >          >
> >          > --
> >          > Christopher
> >
> > --
> > Christopher
>
-- 
Christopher

RE: accumulo.root invalid table reference [SEC=UNOFFICIAL]

Posted by "Dickson, Matt MR" <ma...@defence.gov.au>.
UNOFFICIAL

I ran the compaction with no luck.  

I've had a close look at the split points on the metadata table and confirmed that due to the initial large table we now have 90% of the metadata for existing tables hosted on one tablet which creates a hotspot.  I've now manually added better split points to the metadata table that has created tablets with only 4-5M entries rather than 12M+.  

The split points I created isolate the metadata for large tables to separate tablets but ideally I'd like to split these further which raises 3 questions.

1. If I have table 1xo, is there a smart way to determine the mid point of the data in the metadata table eg 1xo;xxxx to allow me to create a split based on that?

2. I tried to merge tablets on the metadata table where the size was smaller than 1M but was met with a warning stating merge on the metadata table was not allowed. Due to the deletion of the large table we have several tablets with zero entries and they will never be populate.

3. How Accumulo should deal with the deletion of a massive table? Should the metadata table redistribute the tablets to avoid hotspotting on a single tserver which appears to be whats happening?

Thanks for all the help so far.

-----Original Message-----
From: Josh Elser [mailto:josh.elser@gmail.com]
Sent: Thursday, 23 February 2017 10:00
To: user@accumulo.apache.org
Subject: Re: accumulo.root invalid table reference [SEC=UNOFFICIAL]

There's likely a delete "tombstone" in another file referenced by that tablet which is masking those entries. If you compact the tablet, you should see them all disappear.

Yes, you should be able to split/merge the metatdata table just like any other table. Beware, the implications of this are system wide instead of localized to a single user table :)

Dickson, Matt MR wrote:
> *UNOFFICIAL*
>
> When I inspect the rfiles associated with the metadata table using the 
> rfile-info there are a lot of entries for the old deleted table, 1vm.
> Querying the metadata table returns nothing for the deleted table.
> When a table is deleted should the rfiles have any records referencing 
> the old table?
> Also, am I able to manually create new split point on the metadata 
> table to force it to break up the large tablet?
> ----------------------------------------------------------------------
> --
> *From:* Christopher [mailto:ctubbsii@apache.org]
> *Sent:* Wednesday, 22 February 2017 15:46
> *To:* user@accumulo.apache.org
> *Subject:* Re: accumulo.root invalid table reference [SEC=UNOFFICIAL]
>
> It should be safe to merge on the metadata table. That was one of the 
> goals of moving the root tablet into its own table. I'm pretty sure we 
> have a build test to ensure it works.
>
> On Tue, Feb 21, 2017, 18:22 Dickson, Matt MR 
> <matt.dickson@defence.gov.au <ma...@defence.gov.au>> wrote:
>
>     __
>
>     *UNOFFICIAL*
>
>     Firstly, thankyou for your advice its been very helpful.
>     Increasing the tablet server memory has allowed the metadata table
>     to come online. From using the rfile-info and looking at the splits
>     for the metadata table it appears that all the metadata table
>     entries are in one tablet. All tablet servers then query the one
>     node hosting that tablet.
>     I suspect the cause of this was a poorly designed table that at one
>     point the Accumulo gui reported 1.02T tablets for. We've
>     subsequently deleted that table but it might be that there were so
>     many entries in the metadata table that all splits on it were due to
>     this massive table that had the table id 1vm.
>     To rectify this, is it safe to run a merge on the metadata table to
>     force it to redistribute?
>
>     ------------------------------------------------------------------------
>     *From:* Michael Wall [mailto:mjwall@gmail.com
>     <ma...@gmail.com>]
>     *Sent:* Wednesday, 22 February 2017 02:44
>
>     *To:* user@accumulo.apache.org <ma...@accumulo.apache.org>
>     *Subject:* Re: accumulo.root invalid table reference [SEC=UNOFFICIAL]
>     Matt,
>
>     If I am reading this correctly, you have a tablet that is being
>     loading onto a tserver. That tserver dies, so the tablet is then
>     assigned to another tablet. While the tablet is being loading, that
>     tserver dies and so on. Is that correct?
>
>     Can you identify the tablet that is bouncing around? If so, try
>     using rfile-info -d to inspect the rfiles associated with that
>     tablet. Also look at the rfiles that compose that tablet to see if
>     anything sticks out.
>
>     Any logs that would help explain why the tablet server is dying? Can
>     you increase the memory of the tserver?
>
>     Mike
>
>     On Tue, Feb 21, 2017 at 10:35 AM Josh Elser <josh.elser@gmail.com
>     <ma...@gmail.com>> wrote:
>
>         ... [zookeeper.ZooCache] WARN: Saw (possibly) transient exception
>         communicating with ZooKeeper, will retry
>         SessionExpiredException: KeeperErrorCode = Session expired for
>         
> /accumulo/4234234234234234/namespaces/+accumulo/conf/table.scan.max.me
> mory
>
>         There can be a number of causes for this, but here are the most
>         likely ones.
>
>         * JVM gc pauses
>         * ZooKeeper max client connections
>         * Operating System/Hardware-level pauses
>
>         The former should be noticeable by the Accumulo log. There is a
>         daemon
>         running which watches for pauses that happen and then reports
>         them. If
>         this is happening, you might have to give the process some more Java
>         heap, tweak your CMS/G1 parameters, etc.
>
>         For maxClientConnections, see
>         
> https://community.hortonworks.com/articles/51191/understanding-apache-
> zookeeper-connection-rate-lim.html
>
>         For the latter, swappiness is the most likely candidate
>         (assuming this
>         is hopping across different physical nodes), as are "transparent
>         huge
>         pages". If it is limited to a single host, things like bad NICs,
>         hard
>         drives, and other hardware issues might be a source of slowness.
>
>         On Mon, Feb 20, 2017 at 10:18 PM, Dickson, Matt MR
>         <matt.dickson@defence.gov.au
>         <ma...@defence.gov.au>> wrote:
>          > UNOFFICIAL
>          >
>          > It looks like an issue with one of the metadata table
>         tablets. On startup
>          > the server that hosts a particular metadata tablet gets
>         scanned by all other
>          > tablet servers in the cluster. This then crashes that tablet
>         server with an
>          > error in the tserver log;
>          >
>          > ... [zookeeper.ZooCache] WARN: Saw (possibly) transient exception
>          > communicating with ZooKeeper, will retry
>          > SessionExpiredException: KeeperErrorCode = Session expired for
>          >
>         /accumulo/4234234234234234/namespaces/+accumulo/conf/table.scan.max.memory
>          >
>          > That metadata table tablet is then transferred to another
>         host which then
>          > fails also, and so on.
>          >
>          > While the server is hosting this metadata tablet, we see the
>         following log
>          > statement from all tserver.logs in the cluster:
>          >
>          > .... [impl.ThriftScanner] DEBUG: Scan failed, thrift error
>          > org.apache.thrift.transport.TTransportException null
>          > (!0;1vm\\;125.323.233.23::2016103<,server.com.org:9997
>         <http://server.com.org:9997>,2342423df12341d)
>          > Hope that helps complete the picture.
>          >
>          >
>          > ________________________________
>          > From: Christopher [mailto:ctubbsii@apache.org
>         <ma...@apache.org>]
>          > Sent: Tuesday, 21 February 2017 13:17
>          >
>          > To: user@accumulo.apache.org <ma...@accumulo.apache.org>
>          > Subject: Re: accumulo.root invalid table reference
>         [SEC=UNOFFICIAL]
>          >
>          > Removing them is probably a bad idea. The root table entries
>         correspond to
>          > split points in the metadata table. There is no need for the
>         tables which
>          > existed when the metadata table split to still exist for this
>         to continue to
>          > act as a valid split point.
>          >
>          > Would need to see the exception stack trace, or at least an
>         error message,
>          > to troubleshoot the shell scanning error you saw.
>          >
>          >
>          > On Mon, Feb 20, 2017, 20:00 Dickson, Matt MR
>         <matt.dickson@defence.gov.au <ma...@defence.gov.au>>
>          > wrote:
>          >>
>          >> UNOFFICIAL
>          >>
>          >> In case it is ok to remove these from the root table, how
>         can I scan the
>          >> root table for rows with a rowid starting with !0;1vm?
>          >>
>          >> Running "scan -b !0;1vm" throws an exception and exits the
>         shell.
>          >>
>          >>
>          >> -----Original Message-----
>          >> From: Dickson, Matt MR [mailto:matt.dickson@defence.gov.au
>         <ma...@defence.gov.au>]
>          >> Sent: Tuesday, 21 February 2017 09:30
>          >> To: 'user@accumulo.apache.org <ma...@accumulo.apache.org>'
>          >> Subject: RE: accumulo.root invalid table reference
>         [SEC=UNOFFICIAL]
>          >>
>          >> UNOFFICIAL
>          >>
>          >>
>          >> Does that mean I should have entries for 1vm in the metadata
>         table
>          >> corresponding to the root table?
>          >>
>          >> We are running 1.6.5
>          >>
>          >>
>          >> -----Original Message-----
>          >> From: Josh Elser [mailto:josh.elser@gmail.com
>         <ma...@gmail.com>]
>          >> Sent: Tuesday, 21 February 2017 09:22
>          >> To: user@accumulo.apache.org <ma...@accumulo.apache.org>
>          >> Subject: Re: accumulo.root invalid table reference
>         [SEC=UNOFFICIAL]
>          >>
>          >> The root table should only reference the tablets in the
>         metadata table.
>          >> It's a hierarchy: like metadata is for the user tables, root
>         is for the
>          >> metadata table.
>          >>
>          >> What version are ya running, Matt?
>          >>
>          >> Dickson, Matt MR wrote:
>          >> > *UNOFFICIAL*
>          >> >
>          >> > I have a situation where all tablet servers are
>         progressively being
>          >> > declared dead. From the logs the tservers report errors like:
>          >> > 2017-02-.... DEBUG: Scan failed thrift error
>          >> > org.apache.thrift.trasport.TTransportException null
>          >> > (!0;1vm\\125.323.233.23::2016103<,server.com.org:9997
>         <http://server.com.org:9997>,2342423df12341d)
>          >> > 1vm was a table id that was deleted several months ago so
>         it appears
>          >> > there is some invalid reference somewhere.
>          >> > Scanning the metadata table "scan -b 1vm" returns no rows
>         returned for
>          >> > 1vm.
>          >> > A scan of the accumulo.root table returns approximately 15
>         rows that
>          >> > start with; !0:1vm;<i/p addr>/::2016103 /blah/ // How are
>         the root
>          >> > table entries used and would it be safe to remove these
>         entries since
>          >> > they reference a deleted table?
>          >> > Thanks in advance,
>          >> > Matt
>          >> > //
>          >
>          > --
>          > Christopher
>
> --
> Christopher

Re: accumulo.root invalid table reference [SEC=UNOFFICIAL]

Posted by Josh Elser <jo...@gmail.com>.
There's likely a delete "tombstone" in another file referenced by that 
tablet which is masking those entries. If you compact the tablet, you 
should see them all disappear.

Yes, you should be able to split/merge the metatdata table just like any 
other table. Beware, the implications of this are system wide instead of 
localized to a single user table :)

Dickson, Matt MR wrote:
> *UNOFFICIAL*
>
> When I inspect the rfiles associated with the metadata table using the
> rfile-info there are a lot of entries for the old deleted table, 1vm.
> Querying the metadata table returns nothing for the deleted table.
> When a table is deleted should the rfiles have any records referencing
> the old table?
> Also, am I able to manually create new split point on the metadata table
> to force it to break up the large tablet?
> ------------------------------------------------------------------------
> *From:* Christopher [mailto:ctubbsii@apache.org]
> *Sent:* Wednesday, 22 February 2017 15:46
> *To:* user@accumulo.apache.org
> *Subject:* Re: accumulo.root invalid table reference [SEC=UNOFFICIAL]
>
> It should be safe to merge on the metadata table. That was one of the
> goals of moving the root tablet into its own table. I'm pretty sure we
> have a build test to ensure it works.
>
> On Tue, Feb 21, 2017, 18:22 Dickson, Matt MR
> <matt.dickson@defence.gov.au <ma...@defence.gov.au>> wrote:
>
>     __
>
>     *UNOFFICIAL*
>
>     Firstly, thankyou for your advice its been very helpful.
>     Increasing the tablet server memory has allowed the metadata table
>     to come online. From using the rfile-info and looking at the splits
>     for the metadata table it appears that all the metadata table
>     entries are in one tablet. All tablet servers then query the one
>     node hosting that tablet.
>     I suspect the cause of this was a poorly designed table that at one
>     point the Accumulo gui reported 1.02T tablets for. We've
>     subsequently deleted that table but it might be that there were so
>     many entries in the metadata table that all splits on it were due to
>     this massive table that had the table id 1vm.
>     To rectify this, is it safe to run a merge on the metadata table to
>     force it to redistribute?
>
>     ------------------------------------------------------------------------
>     *From:* Michael Wall [mailto:mjwall@gmail.com
>     <ma...@gmail.com>]
>     *Sent:* Wednesday, 22 February 2017 02:44
>
>     *To:* user@accumulo.apache.org <ma...@accumulo.apache.org>
>     *Subject:* Re: accumulo.root invalid table reference [SEC=UNOFFICIAL]
>     Matt,
>
>     If I am reading this correctly, you have a tablet that is being
>     loading onto a tserver. That tserver dies, so the tablet is then
>     assigned to another tablet. While the tablet is being loading, that
>     tserver dies and so on. Is that correct?
>
>     Can you identify the tablet that is bouncing around? If so, try
>     using rfile-info -d to inspect the rfiles associated with that
>     tablet. Also look at the rfiles that compose that tablet to see if
>     anything sticks out.
>
>     Any logs that would help explain why the tablet server is dying? Can
>     you increase the memory of the tserver?
>
>     Mike
>
>     On Tue, Feb 21, 2017 at 10:35 AM Josh Elser <josh.elser@gmail.com
>     <ma...@gmail.com>> wrote:
>
>         ... [zookeeper.ZooCache] WARN: Saw (possibly) transient exception
>         communicating with ZooKeeper, will retry
>         SessionExpiredException: KeeperErrorCode = Session expired for
>         /accumulo/4234234234234234/namespaces/+accumulo/conf/table.scan.max.memory
>
>         There can be a number of causes for this, but here are the most
>         likely ones.
>
>         * JVM gc pauses
>         * ZooKeeper max client connections
>         * Operating System/Hardware-level pauses
>
>         The former should be noticeable by the Accumulo log. There is a
>         daemon
>         running which watches for pauses that happen and then reports
>         them. If
>         this is happening, you might have to give the process some more Java
>         heap, tweak your CMS/G1 parameters, etc.
>
>         For maxClientConnections, see
>         https://community.hortonworks.com/articles/51191/understanding-apache-zookeeper-connection-rate-lim.html
>
>         For the latter, swappiness is the most likely candidate
>         (assuming this
>         is hopping across different physical nodes), as are "transparent
>         huge
>         pages". If it is limited to a single host, things like bad NICs,
>         hard
>         drives, and other hardware issues might be a source of slowness.
>
>         On Mon, Feb 20, 2017 at 10:18 PM, Dickson, Matt MR
>         <matt.dickson@defence.gov.au
>         <ma...@defence.gov.au>> wrote:
>          > UNOFFICIAL
>          >
>          > It looks like an issue with one of the metadata table
>         tablets. On startup
>          > the server that hosts a particular metadata tablet gets
>         scanned by all other
>          > tablet servers in the cluster. This then crashes that tablet
>         server with an
>          > error in the tserver log;
>          >
>          > ... [zookeeper.ZooCache] WARN: Saw (possibly) transient exception
>          > communicating with ZooKeeper, will retry
>          > SessionExpiredException: KeeperErrorCode = Session expired for
>          >
>         /accumulo/4234234234234234/namespaces/+accumulo/conf/table.scan.max.memory
>          >
>          > That metadata table tablet is then transferred to another
>         host which then
>          > fails also, and so on.
>          >
>          > While the server is hosting this metadata tablet, we see the
>         following log
>          > statement from all tserver.logs in the cluster:
>          >
>          > .... [impl.ThriftScanner] DEBUG: Scan failed, thrift error
>          > org.apache.thrift.transport.TTransportException null
>          > (!0;1vm\\;125.323.233.23::2016103<,server.com.org:9997
>         <http://server.com.org:9997>,2342423df12341d)
>          > Hope that helps complete the picture.
>          >
>          >
>          > ________________________________
>          > From: Christopher [mailto:ctubbsii@apache.org
>         <ma...@apache.org>]
>          > Sent: Tuesday, 21 February 2017 13:17
>          >
>          > To: user@accumulo.apache.org <ma...@accumulo.apache.org>
>          > Subject: Re: accumulo.root invalid table reference
>         [SEC=UNOFFICIAL]
>          >
>          > Removing them is probably a bad idea. The root table entries
>         correspond to
>          > split points in the metadata table. There is no need for the
>         tables which
>          > existed when the metadata table split to still exist for this
>         to continue to
>          > act as a valid split point.
>          >
>          > Would need to see the exception stack trace, or at least an
>         error message,
>          > to troubleshoot the shell scanning error you saw.
>          >
>          >
>          > On Mon, Feb 20, 2017, 20:00 Dickson, Matt MR
>         <matt.dickson@defence.gov.au <ma...@defence.gov.au>>
>          > wrote:
>          >>
>          >> UNOFFICIAL
>          >>
>          >> In case it is ok to remove these from the root table, how
>         can I scan the
>          >> root table for rows with a rowid starting with !0;1vm?
>          >>
>          >> Running "scan -b !0;1vm" throws an exception and exits the
>         shell.
>          >>
>          >>
>          >> -----Original Message-----
>          >> From: Dickson, Matt MR [mailto:matt.dickson@defence.gov.au
>         <ma...@defence.gov.au>]
>          >> Sent: Tuesday, 21 February 2017 09:30
>          >> To: 'user@accumulo.apache.org <ma...@accumulo.apache.org>'
>          >> Subject: RE: accumulo.root invalid table reference
>         [SEC=UNOFFICIAL]
>          >>
>          >> UNOFFICIAL
>          >>
>          >>
>          >> Does that mean I should have entries for 1vm in the metadata
>         table
>          >> corresponding to the root table?
>          >>
>          >> We are running 1.6.5
>          >>
>          >>
>          >> -----Original Message-----
>          >> From: Josh Elser [mailto:josh.elser@gmail.com
>         <ma...@gmail.com>]
>          >> Sent: Tuesday, 21 February 2017 09:22
>          >> To: user@accumulo.apache.org <ma...@accumulo.apache.org>
>          >> Subject: Re: accumulo.root invalid table reference
>         [SEC=UNOFFICIAL]
>          >>
>          >> The root table should only reference the tablets in the
>         metadata table.
>          >> It's a hierarchy: like metadata is for the user tables, root
>         is for the
>          >> metadata table.
>          >>
>          >> What version are ya running, Matt?
>          >>
>          >> Dickson, Matt MR wrote:
>          >> > *UNOFFICIAL*
>          >> >
>          >> > I have a situation where all tablet servers are
>         progressively being
>          >> > declared dead. From the logs the tservers report errors like:
>          >> > 2017-02-.... DEBUG: Scan failed thrift error
>          >> > org.apache.thrift.trasport.TTransportException null
>          >> > (!0;1vm\\125.323.233.23::2016103<,server.com.org:9997
>         <http://server.com.org:9997>,2342423df12341d)
>          >> > 1vm was a table id that was deleted several months ago so
>         it appears
>          >> > there is some invalid reference somewhere.
>          >> > Scanning the metadata table "scan -b 1vm" returns no rows
>         returned for
>          >> > 1vm.
>          >> > A scan of the accumulo.root table returns approximately 15
>         rows that
>          >> > start with; !0:1vm;<i/p addr>/::2016103 /blah/ // How are
>         the root
>          >> > table entries used and would it be safe to remove these
>         entries since
>          >> > they reference a deleted table?
>          >> > Thanks in advance,
>          >> > Matt
>          >> > //
>          >
>          > --
>          > Christopher
>
> --
> Christopher

RE: accumulo.root invalid table reference [SEC=UNOFFICIAL]

Posted by "Dickson, Matt MR" <ma...@defence.gov.au>.
UNOFFICIAL

When I inspect the rfiles associated with the metadata table using the rfile-info there are a lot of entries for the old deleted table, 1vm. Querying the metadata table returns nothing for the deleted table.

When a table is deleted should the rfiles have any records referencing the old table?

Also, am I able to manually create new split point on the metadata table to force it to break up the large tablet?

________________________________
From: Christopher [mailto:ctubbsii@apache.org]
Sent: Wednesday, 22 February 2017 15:46
To: user@accumulo.apache.org
Subject: Re: accumulo.root invalid table reference [SEC=UNOFFICIAL]

It should be safe to merge on the metadata table. That was one of the goals of moving the root tablet into its own table. I'm pretty sure we have a build test to ensure it works.

On Tue, Feb 21, 2017, 18:22 Dickson, Matt MR <ma...@defence.gov.au>> wrote:

UNOFFICIAL

Firstly, thankyou for your advice its been very helpful.

Increasing the tablet server memory has allowed the metadata table to come online.  From using the rfile-info and looking at the splits for the metadata table it appears that all the metadata table entries are in one tablet.  All tablet servers then query the one node hosting that tablet.

I suspect the cause of this was a poorly designed table that at one point the Accumulo gui reported 1.02T tablets for.  We've subsequently deleted that table but it might be that there were so many entries in the metadata table that all splits on it were due to this massive table that had the table id 1vm.

To rectify this, is it safe to run a merge on the metadata table to force it to redistribute?

________________________________
From: Michael Wall [mailto:mjwall@gmail.com<ma...@gmail.com>]
Sent: Wednesday, 22 February 2017 02:44

To: user@accumulo.apache.org<ma...@accumulo.apache.org>
Subject: Re: accumulo.root invalid table reference [SEC=UNOFFICIAL]
Matt,

If I am reading this correctly, you have a tablet that is being loading onto a tserver.  That tserver dies, so the tablet is then assigned to another tablet.  While the tablet is being loading, that tserver dies and so on.  Is that correct?

Can you identify the tablet that is bouncing around?  If so, try using rfile-info -d to inspect the rfiles associated with that tablet.  Also look at the rfiles that compose that tablet to see if anything sticks out.

Any logs that would help explain why the tablet server is dying?  Can you increase the memory of the tserver?

Mike

On Tue, Feb 21, 2017 at 10:35 AM Josh Elser <jo...@gmail.com>> wrote:
... [zookeeper.ZooCache] WARN: Saw (possibly) transient exception
communicating with ZooKeeper, will retry
SessionExpiredException: KeeperErrorCode = Session expired for
/accumulo/4234234234234234/namespaces/+accumulo/conf/table.scan.max.memory

There can be a number of causes for this, but here are the most likely ones.

* JVM gc pauses
* ZooKeeper max client connections
* Operating System/Hardware-level pauses

The former should be noticeable by the Accumulo log. There is a daemon
running which watches for pauses that happen and then reports them. If
this is happening, you might have to give the process some more Java
heap, tweak your CMS/G1 parameters, etc.

For maxClientConnections, see
https://community.hortonworks.com/articles/51191/understanding-apache-zookeeper-connection-rate-lim.html

For the latter, swappiness is the most likely candidate (assuming this
is hopping across different physical nodes), as are "transparent huge
pages". If it is limited to a single host, things like bad NICs, hard
drives, and other hardware issues might be a source of slowness.

On Mon, Feb 20, 2017 at 10:18 PM, Dickson, Matt MR
<ma...@defence.gov.au>> wrote:
> UNOFFICIAL
>
> It looks like an issue with one of the metadata table tablets. On startup
> the server that hosts a particular metadata tablet gets scanned by all other
> tablet servers in the cluster.  This then crashes that tablet server with an
> error in the tserver log;
>
> ... [zookeeper.ZooCache] WARN: Saw (possibly) transient exception
> communicating with ZooKeeper, will retry
> SessionExpiredException: KeeperErrorCode = Session expired for
> /accumulo/4234234234234234/namespaces/+accumulo/conf/table.scan.max.memory
>
> That metadata table tablet is then transferred to another host which then
> fails also, and so on.
>
> While the server is hosting this metadata tablet, we see the following log
> statement from all tserver.logs in the cluster:
>
> .... [impl.ThriftScanner] DEBUG: Scan failed, thrift error
> org.apache.thrift.transport.TTransportException  null
> (!0;1vm\\;125.323.233.23::2016103<,server.com.org:9997<http://server.com.org:9997>,2342423df12341d)
> Hope that helps complete the picture.
>
>
> ________________________________
> From: Christopher [mailto:ctubbsii@apache.org<ma...@apache.org>]
> Sent: Tuesday, 21 February 2017 13:17
>
> To: user@accumulo.apache.org<ma...@accumulo.apache.org>
> Subject: Re: accumulo.root invalid table reference [SEC=UNOFFICIAL]
>
> Removing them is probably a bad idea. The root table entries correspond to
> split points in the metadata table. There is no need for the tables which
> existed when the metadata table split to still exist for this to continue to
> act as a valid split point.
>
> Would need to see the exception stack trace, or at least an error message,
> to troubleshoot the shell scanning error you saw.
>
>
> On Mon, Feb 20, 2017, 20:00 Dickson, Matt MR <ma...@defence.gov.au>>
> wrote:
>>
>> UNOFFICIAL
>>
>> In case it is ok to remove these from the root table, how can I scan the
>> root table for rows with a rowid starting with !0;1vm?
>>
>> Running "scan -b !0;1vm" throws an exception and exits the shell.
>>
>>
>> -----Original Message-----
>> From: Dickson, Matt MR [mailto:matt.dickson@defence.gov.au<ma...@defence.gov.au>]
>> Sent: Tuesday, 21 February 2017 09:30
>> To: 'user@accumulo.apache.org<ma...@accumulo.apache.org>'
>> Subject: RE: accumulo.root invalid table reference [SEC=UNOFFICIAL]
>>
>> UNOFFICIAL
>>
>>
>> Does that mean I should have entries for 1vm in the metadata table
>> corresponding to the root table?
>>
>> We are running 1.6.5
>>
>>
>> -----Original Message-----
>> From: Josh Elser [mailto:josh.elser@gmail.com<ma...@gmail.com>]
>> Sent: Tuesday, 21 February 2017 09:22
>> To: user@accumulo.apache.org<ma...@accumulo.apache.org>
>> Subject: Re: accumulo.root invalid table reference [SEC=UNOFFICIAL]
>>
>> The root table should only reference the tablets in the metadata table.
>> It's a hierarchy: like metadata is for the user tables, root is for the
>> metadata table.
>>
>> What version are ya running, Matt?
>>
>> Dickson, Matt MR wrote:
>> > *UNOFFICIAL*
>> >
>> > I have a situation where all tablet servers are progressively being
>> > declared dead. From the logs the tservers report errors like:
>> > 2017-02-.... DEBUG: Scan failed thrift error
>> > org.apache.thrift.trasport.TTransportException null
>> > (!0;1vm\\125.323.233.23::2016103<,server.com.org:9997<http://server.com.org:9997>,2342423df12341d)
>> > 1vm was a table id that was deleted several months ago so it appears
>> > there is some invalid reference somewhere.
>> > Scanning the metadata table "scan -b 1vm" returns no rows returned for
>> > 1vm.
>> > A scan of the accumulo.root table returns approximately 15 rows that
>> > start with; !0:1vm;<i/p addr>/::2016103 /blah/ // How are the root
>> > table entries used and would it be safe to remove these entries since
>> > they reference a deleted table?
>> > Thanks in advance,
>> > Matt
>> > //
>
> --
> Christopher
--
Christopher

Re: accumulo.root invalid table reference [SEC=UNOFFICIAL]

Posted by Christopher <ct...@apache.org>.
It should be safe to merge on the metadata table. That was one of the goals
of moving the root tablet into its own table. I'm pretty sure we have a
build test to ensure it works.

On Tue, Feb 21, 2017, 18:22 Dickson, Matt MR <ma...@defence.gov.au>
wrote:

> *UNOFFICIAL*
> Firstly, thankyou for your advice its been very helpful.
>
> Increasing the tablet server memory has allowed the metadata table to come
> online.  From using the rfile-info and looking at the splits for the
> metadata table it appears that all the metadata table entries are in one
> tablet.  All tablet servers then query the one node hosting that tablet.
>
> I suspect the cause of this was a poorly designed table that at one point
> the Accumulo gui reported 1.02T tablets for.  We've subsequently deleted
> that table but it might be that there were so many entries in the metadata
> table that all splits on it were due to this massive table that had the
> table id 1vm.
>
> To rectify this, is it safe to run a merge on the metadata table to force
> it to redistribute?
>
> ------------------------------
> *From:* Michael Wall [mailto:mjwall@gmail.com]
> *Sent:* Wednesday, 22 February 2017 02:44
>
> *To:* user@accumulo.apache.org
> *Subject:* Re: accumulo.root invalid table reference [SEC=UNOFFICIAL]
> Matt,
>
> If I am reading this correctly, you have a tablet that is being loading
> onto a tserver.  That tserver dies, so the tablet is then assigned to
> another tablet.  While the tablet is being loading, that tserver dies and
> so on.  Is that correct?
>
> Can you identify the tablet that is bouncing around?  If so, try using
> rfile-info -d to inspect the rfiles associated with that tablet.  Also look
> at the rfiles that compose that tablet to see if anything sticks out.
>
> Any logs that would help explain why the tablet server is dying?  Can you
> increase the memory of the tserver?
>
> Mike
>
> On Tue, Feb 21, 2017 at 10:35 AM Josh Elser <jo...@gmail.com> wrote:
>
> ... [zookeeper.ZooCache] WARN: Saw (possibly) transient exception
> communicating with ZooKeeper, will retry
> SessionExpiredException: KeeperErrorCode = Session expired for
> /accumulo/4234234234234234/namespaces/+accumulo/conf/table.scan.max.memory
>
> There can be a number of causes for this, but here are the most likely
> ones.
>
> * JVM gc pauses
> * ZooKeeper max client connections
> * Operating System/Hardware-level pauses
>
> The former should be noticeable by the Accumulo log. There is a daemon
> running which watches for pauses that happen and then reports them. If
> this is happening, you might have to give the process some more Java
> heap, tweak your CMS/G1 parameters, etc.
>
> For maxClientConnections, see
>
> https://community.hortonworks.com/articles/51191/understanding-apache-zookeeper-connection-rate-lim.html
>
> For the latter, swappiness is the most likely candidate (assuming this
> is hopping across different physical nodes), as are "transparent huge
> pages". If it is limited to a single host, things like bad NICs, hard
> drives, and other hardware issues might be a source of slowness.
>
> On Mon, Feb 20, 2017 at 10:18 PM, Dickson, Matt MR
> <ma...@defence.gov.au> wrote:
> > UNOFFICIAL
> >
> > It looks like an issue with one of the metadata table tablets. On startup
> > the server that hosts a particular metadata tablet gets scanned by all
> other
> > tablet servers in the cluster.  This then crashes that tablet server
> with an
> > error in the tserver log;
> >
> > ... [zookeeper.ZooCache] WARN: Saw (possibly) transient exception
> > communicating with ZooKeeper, will retry
> > SessionExpiredException: KeeperErrorCode = Session expired for
> >
> /accumulo/4234234234234234/namespaces/+accumulo/conf/table.scan.max.memory
> >
> > That metadata table tablet is then transferred to another host which then
> > fails also, and so on.
> >
> > While the server is hosting this metadata tablet, we see the following
> log
> > statement from all tserver.logs in the cluster:
> >
> > .... [impl.ThriftScanner] DEBUG: Scan failed, thrift error
> > org.apache.thrift.transport.TTransportException  null
> > (!0;1vm\\;125.323.233.23::2016103<,server.com.org:9997,2342423df12341d)
> > Hope that helps complete the picture.
> >
> >
> > ________________________________
> > From: Christopher [mailto:ctubbsii@apache.org]
> > Sent: Tuesday, 21 February 2017 13:17
> >
> > To: user@accumulo.apache.org
> > Subject: Re: accumulo.root invalid table reference [SEC=UNOFFICIAL]
> >
> > Removing them is probably a bad idea. The root table entries correspond
> to
> > split points in the metadata table. There is no need for the tables which
> > existed when the metadata table split to still exist for this to
> continue to
> > act as a valid split point.
> >
> > Would need to see the exception stack trace, or at least an error
> message,
> > to troubleshoot the shell scanning error you saw.
> >
> >
> > On Mon, Feb 20, 2017, 20:00 Dickson, Matt MR <
> matt.dickson@defence.gov.au>
> > wrote:
> >>
> >> UNOFFICIAL
> >>
> >> In case it is ok to remove these from the root table, how can I scan the
> >> root table for rows with a rowid starting with !0;1vm?
> >>
> >> Running "scan -b !0;1vm" throws an exception and exits the shell.
> >>
> >>
> >> -----Original Message-----
> >> From: Dickson, Matt MR [mailto:matt.dickson@defence.gov.au]
> >> Sent: Tuesday, 21 February 2017 09:30
> >> To: 'user@accumulo.apache.org'
> >> Subject: RE: accumulo.root invalid table reference [SEC=UNOFFICIAL]
> >>
> >> UNOFFICIAL
> >>
> >>
> >> Does that mean I should have entries for 1vm in the metadata table
> >> corresponding to the root table?
> >>
> >> We are running 1.6.5
> >>
> >>
> >> -----Original Message-----
> >> From: Josh Elser [mailto:josh.elser@gmail.com]
> >> Sent: Tuesday, 21 February 2017 09:22
> >> To: user@accumulo.apache.org
> >> Subject: Re: accumulo.root invalid table reference [SEC=UNOFFICIAL]
> >>
> >> The root table should only reference the tablets in the metadata table.
> >> It's a hierarchy: like metadata is for the user tables, root is for the
> >> metadata table.
> >>
> >> What version are ya running, Matt?
> >>
> >> Dickson, Matt MR wrote:
> >> > *UNOFFICIAL*
> >> >
> >> > I have a situation where all tablet servers are progressively being
> >> > declared dead. From the logs the tservers report errors like:
> >> > 2017-02-.... DEBUG: Scan failed thrift error
> >> > org.apache.thrift.trasport.TTransportException null
> >> > (!0;1vm\\125.323.233.23::2016103<,server.com.org:9997
> ,2342423df12341d)
> >> > 1vm was a table id that was deleted several months ago so it appears
> >> > there is some invalid reference somewhere.
> >> > Scanning the metadata table "scan -b 1vm" returns no rows returned for
> >> > 1vm.
> >> > A scan of the accumulo.root table returns approximately 15 rows that
> >> > start with; !0:1vm;<i/p addr>/::2016103 /blah/ // How are the root
> >> > table entries used and would it be safe to remove these entries since
> >> > they reference a deleted table?
> >> > Thanks in advance,
> >> > Matt
> >> > //
> >
> > --
> > Christopher
>
> --
Christopher

RE: accumulo.root invalid table reference [SEC=UNOFFICIAL]

Posted by "Dickson, Matt MR" <ma...@defence.gov.au>.
UNOFFICIAL

Firstly, thankyou for your advice its been very helpful.

Increasing the tablet server memory has allowed the metadata table to come online.  From using the rfile-info and looking at the splits for the metadata table it appears that all the metadata table entries are in one tablet.  All tablet servers then query the one node hosting that tablet.

I suspect the cause of this was a poorly designed table that at one point the Accumulo gui reported 1.02T tablets for.  We've subsequently deleted that table but it might be that there were so many entries in the metadata table that all splits on it were due to this massive table that had the table id 1vm.

To rectify this, is it safe to run a merge on the metadata table to force it to redistribute?

________________________________
From: Michael Wall [mailto:mjwall@gmail.com]
Sent: Wednesday, 22 February 2017 02:44
To: user@accumulo.apache.org
Subject: Re: accumulo.root invalid table reference [SEC=UNOFFICIAL]

Matt,

If I am reading this correctly, you have a tablet that is being loading onto a tserver.  That tserver dies, so the tablet is then assigned to another tablet.  While the tablet is being loading, that tserver dies and so on.  Is that correct?

Can you identify the tablet that is bouncing around?  If so, try using rfile-info -d to inspect the rfiles associated with that tablet.  Also look at the rfiles that compose that tablet to see if anything sticks out.

Any logs that would help explain why the tablet server is dying?  Can you increase the memory of the tserver?

Mike

On Tue, Feb 21, 2017 at 10:35 AM Josh Elser <jo...@gmail.com>> wrote:
... [zookeeper.ZooCache] WARN: Saw (possibly) transient exception
communicating with ZooKeeper, will retry
SessionExpiredException: KeeperErrorCode = Session expired for
/accumulo/4234234234234234/namespaces/+accumulo/conf/table.scan.max.memory

There can be a number of causes for this, but here are the most likely ones.

* JVM gc pauses
* ZooKeeper max client connections
* Operating System/Hardware-level pauses

The former should be noticeable by the Accumulo log. There is a daemon
running which watches for pauses that happen and then reports them. If
this is happening, you might have to give the process some more Java
heap, tweak your CMS/G1 parameters, etc.

For maxClientConnections, see
https://community.hortonworks.com/articles/51191/understanding-apache-zookeeper-connection-rate-lim.html

For the latter, swappiness is the most likely candidate (assuming this
is hopping across different physical nodes), as are "transparent huge
pages". If it is limited to a single host, things like bad NICs, hard
drives, and other hardware issues might be a source of slowness.

On Mon, Feb 20, 2017 at 10:18 PM, Dickson, Matt MR
<ma...@defence.gov.au>> wrote:
> UNOFFICIAL
>
> It looks like an issue with one of the metadata table tablets. On startup
> the server that hosts a particular metadata tablet gets scanned by all other
> tablet servers in the cluster.  This then crashes that tablet server with an
> error in the tserver log;
>
> ... [zookeeper.ZooCache] WARN: Saw (possibly) transient exception
> communicating with ZooKeeper, will retry
> SessionExpiredException: KeeperErrorCode = Session expired for
> /accumulo/4234234234234234/namespaces/+accumulo/conf/table.scan.max.memory
>
> That metadata table tablet is then transferred to another host which then
> fails also, and so on.
>
> While the server is hosting this metadata tablet, we see the following log
> statement from all tserver.logs in the cluster:
>
> .... [impl.ThriftScanner] DEBUG: Scan failed, thrift error
> org.apache.thrift.transport.TTransportException  null
> (!0;1vm\\;125.323.233.23::2016103<,server.com.org:9997<http://server.com.org:9997>,2342423df12341d)
> Hope that helps complete the picture.
>
>
> ________________________________
> From: Christopher [mailto:ctubbsii@apache.org<ma...@apache.org>]
> Sent: Tuesday, 21 February 2017 13:17
>
> To: user@accumulo.apache.org<ma...@accumulo.apache.org>
> Subject: Re: accumulo.root invalid table reference [SEC=UNOFFICIAL]
>
> Removing them is probably a bad idea. The root table entries correspond to
> split points in the metadata table. There is no need for the tables which
> existed when the metadata table split to still exist for this to continue to
> act as a valid split point.
>
> Would need to see the exception stack trace, or at least an error message,
> to troubleshoot the shell scanning error you saw.
>
>
> On Mon, Feb 20, 2017, 20:00 Dickson, Matt MR <ma...@defence.gov.au>>
> wrote:
>>
>> UNOFFICIAL
>>
>> In case it is ok to remove these from the root table, how can I scan the
>> root table for rows with a rowid starting with !0;1vm?
>>
>> Running "scan -b !0;1vm" throws an exception and exits the shell.
>>
>>
>> -----Original Message-----
>> From: Dickson, Matt MR [mailto:matt.dickson@defence.gov.au<ma...@defence.gov.au>]
>> Sent: Tuesday, 21 February 2017 09:30
>> To: 'user@accumulo.apache.org<ma...@accumulo.apache.org>'
>> Subject: RE: accumulo.root invalid table reference [SEC=UNOFFICIAL]
>>
>> UNOFFICIAL
>>
>>
>> Does that mean I should have entries for 1vm in the metadata table
>> corresponding to the root table?
>>
>> We are running 1.6.5
>>
>>
>> -----Original Message-----
>> From: Josh Elser [mailto:josh.elser@gmail.com<ma...@gmail.com>]
>> Sent: Tuesday, 21 February 2017 09:22
>> To: user@accumulo.apache.org<ma...@accumulo.apache.org>
>> Subject: Re: accumulo.root invalid table reference [SEC=UNOFFICIAL]
>>
>> The root table should only reference the tablets in the metadata table.
>> It's a hierarchy: like metadata is for the user tables, root is for the
>> metadata table.
>>
>> What version are ya running, Matt?
>>
>> Dickson, Matt MR wrote:
>> > *UNOFFICIAL*
>> >
>> > I have a situation where all tablet servers are progressively being
>> > declared dead. From the logs the tservers report errors like:
>> > 2017-02-.... DEBUG: Scan failed thrift error
>> > org.apache.thrift.trasport.TTransportException null
>> > (!0;1vm\\125.323.233.23::2016103<,server.com.org:9997<http://server.com.org:9997>,2342423df12341d)
>> > 1vm was a table id that was deleted several months ago so it appears
>> > there is some invalid reference somewhere.
>> > Scanning the metadata table "scan -b 1vm" returns no rows returned for
>> > 1vm.
>> > A scan of the accumulo.root table returns approximately 15 rows that
>> > start with; !0:1vm;<i/p addr>/::2016103 /blah/ // How are the root
>> > table entries used and would it be safe to remove these entries since
>> > they reference a deleted table?
>> > Thanks in advance,
>> > Matt
>> > //
>
> --
> Christopher

Re: accumulo.root invalid table reference [SEC=UNOFFICIAL]

Posted by Michael Wall <mj...@gmail.com>.
Matt,

If I am reading this correctly, you have a tablet that is being loading
onto a tserver.  That tserver dies, so the tablet is then assigned to
another tablet.  While the tablet is being loading, that tserver dies and
so on.  Is that correct?

Can you identify the tablet that is bouncing around?  If so, try using
rfile-info -d to inspect the rfiles associated with that tablet.  Also look
at the rfiles that compose that tablet to see if anything sticks out.

Any logs that would help explain why the tablet server is dying?  Can you
increase the memory of the tserver?

Mike

On Tue, Feb 21, 2017 at 10:35 AM Josh Elser <jo...@gmail.com> wrote:

> ... [zookeeper.ZooCache] WARN: Saw (possibly) transient exception
> communicating with ZooKeeper, will retry
> SessionExpiredException: KeeperErrorCode = Session expired for
> /accumulo/4234234234234234/namespaces/+accumulo/conf/table.scan.max.memory
>
> There can be a number of causes for this, but here are the most likely
> ones.
>
> * JVM gc pauses
> * ZooKeeper max client connections
> * Operating System/Hardware-level pauses
>
> The former should be noticeable by the Accumulo log. There is a daemon
> running which watches for pauses that happen and then reports them. If
> this is happening, you might have to give the process some more Java
> heap, tweak your CMS/G1 parameters, etc.
>
> For maxClientConnections, see
>
> https://community.hortonworks.com/articles/51191/understanding-apache-zookeeper-connection-rate-lim.html
>
> For the latter, swappiness is the most likely candidate (assuming this
> is hopping across different physical nodes), as are "transparent huge
> pages". If it is limited to a single host, things like bad NICs, hard
> drives, and other hardware issues might be a source of slowness.
>
> On Mon, Feb 20, 2017 at 10:18 PM, Dickson, Matt MR
> <ma...@defence.gov.au> wrote:
> > UNOFFICIAL
> >
> > It looks like an issue with one of the metadata table tablets. On startup
> > the server that hosts a particular metadata tablet gets scanned by all
> other
> > tablet servers in the cluster.  This then crashes that tablet server
> with an
> > error in the tserver log;
> >
> > ... [zookeeper.ZooCache] WARN: Saw (possibly) transient exception
> > communicating with ZooKeeper, will retry
> > SessionExpiredException: KeeperErrorCode = Session expired for
> >
> /accumulo/4234234234234234/namespaces/+accumulo/conf/table.scan.max.memory
> >
> > That metadata table tablet is then transferred to another host which then
> > fails also, and so on.
> >
> > While the server is hosting this metadata tablet, we see the following
> log
> > statement from all tserver.logs in the cluster:
> >
> > .... [impl.ThriftScanner] DEBUG: Scan failed, thrift error
> > org.apache.thrift.transport.TTransportException  null
> > (!0;1vm\\;125.323.233.23::2016103<,server.com.org:9997,2342423df12341d)
> > Hope that helps complete the picture.
> >
> >
> > ________________________________
> > From: Christopher [mailto:ctubbsii@apache.org]
> > Sent: Tuesday, 21 February 2017 13:17
> >
> > To: user@accumulo.apache.org
> > Subject: Re: accumulo.root invalid table reference [SEC=UNOFFICIAL]
> >
> > Removing them is probably a bad idea. The root table entries correspond
> to
> > split points in the metadata table. There is no need for the tables which
> > existed when the metadata table split to still exist for this to
> continue to
> > act as a valid split point.
> >
> > Would need to see the exception stack trace, or at least an error
> message,
> > to troubleshoot the shell scanning error you saw.
> >
> >
> > On Mon, Feb 20, 2017, 20:00 Dickson, Matt MR <
> matt.dickson@defence.gov.au>
> > wrote:
> >>
> >> UNOFFICIAL
> >>
> >> In case it is ok to remove these from the root table, how can I scan the
> >> root table for rows with a rowid starting with !0;1vm?
> >>
> >> Running "scan -b !0;1vm" throws an exception and exits the shell.
> >>
> >>
> >> -----Original Message-----
> >> From: Dickson, Matt MR [mailto:matt.dickson@defence.gov.au]
> >> Sent: Tuesday, 21 February 2017 09:30
> >> To: 'user@accumulo.apache.org'
> >> Subject: RE: accumulo.root invalid table reference [SEC=UNOFFICIAL]
> >>
> >> UNOFFICIAL
> >>
> >>
> >> Does that mean I should have entries for 1vm in the metadata table
> >> corresponding to the root table?
> >>
> >> We are running 1.6.5
> >>
> >>
> >> -----Original Message-----
> >> From: Josh Elser [mailto:josh.elser@gmail.com]
> >> Sent: Tuesday, 21 February 2017 09:22
> >> To: user@accumulo.apache.org
> >> Subject: Re: accumulo.root invalid table reference [SEC=UNOFFICIAL]
> >>
> >> The root table should only reference the tablets in the metadata table.
> >> It's a hierarchy: like metadata is for the user tables, root is for the
> >> metadata table.
> >>
> >> What version are ya running, Matt?
> >>
> >> Dickson, Matt MR wrote:
> >> > *UNOFFICIAL*
> >> >
> >> > I have a situation where all tablet servers are progressively being
> >> > declared dead. From the logs the tservers report errors like:
> >> > 2017-02-.... DEBUG: Scan failed thrift error
> >> > org.apache.thrift.trasport.TTransportException null
> >> > (!0;1vm\\125.323.233.23::2016103<,server.com.org:9997
> ,2342423df12341d)
> >> > 1vm was a table id that was deleted several months ago so it appears
> >> > there is some invalid reference somewhere.
> >> > Scanning the metadata table "scan -b 1vm" returns no rows returned for
> >> > 1vm.
> >> > A scan of the accumulo.root table returns approximately 15 rows that
> >> > start with; !0:1vm;<i/p addr>/::2016103 /blah/ // How are the root
> >> > table entries used and would it be safe to remove these entries since
> >> > they reference a deleted table?
> >> > Thanks in advance,
> >> > Matt
> >> > //
> >
> > --
> > Christopher
>

Re: accumulo.root invalid table reference [SEC=UNOFFICIAL]

Posted by Josh Elser <jo...@gmail.com>.
... [zookeeper.ZooCache] WARN: Saw (possibly) transient exception
communicating with ZooKeeper, will retry
SessionExpiredException: KeeperErrorCode = Session expired for
/accumulo/4234234234234234/namespaces/+accumulo/conf/table.scan.max.memory

There can be a number of causes for this, but here are the most likely ones.

* JVM gc pauses
* ZooKeeper max client connections
* Operating System/Hardware-level pauses

The former should be noticeable by the Accumulo log. There is a daemon
running which watches for pauses that happen and then reports them. If
this is happening, you might have to give the process some more Java
heap, tweak your CMS/G1 parameters, etc.

For maxClientConnections, see
https://community.hortonworks.com/articles/51191/understanding-apache-zookeeper-connection-rate-lim.html

For the latter, swappiness is the most likely candidate (assuming this
is hopping across different physical nodes), as are "transparent huge
pages". If it is limited to a single host, things like bad NICs, hard
drives, and other hardware issues might be a source of slowness.

On Mon, Feb 20, 2017 at 10:18 PM, Dickson, Matt MR
<ma...@defence.gov.au> wrote:
> UNOFFICIAL
>
> It looks like an issue with one of the metadata table tablets. On startup
> the server that hosts a particular metadata tablet gets scanned by all other
> tablet servers in the cluster.  This then crashes that tablet server with an
> error in the tserver log;
>
> ... [zookeeper.ZooCache] WARN: Saw (possibly) transient exception
> communicating with ZooKeeper, will retry
> SessionExpiredException: KeeperErrorCode = Session expired for
> /accumulo/4234234234234234/namespaces/+accumulo/conf/table.scan.max.memory
>
> That metadata table tablet is then transferred to another host which then
> fails also, and so on.
>
> While the server is hosting this metadata tablet, we see the following log
> statement from all tserver.logs in the cluster:
>
> .... [impl.ThriftScanner] DEBUG: Scan failed, thrift error
> org.apache.thrift.transport.TTransportException  null
> (!0;1vm\\;125.323.233.23::2016103<,server.com.org:9997,2342423df12341d)
> Hope that helps complete the picture.
>
>
> ________________________________
> From: Christopher [mailto:ctubbsii@apache.org]
> Sent: Tuesday, 21 February 2017 13:17
>
> To: user@accumulo.apache.org
> Subject: Re: accumulo.root invalid table reference [SEC=UNOFFICIAL]
>
> Removing them is probably a bad idea. The root table entries correspond to
> split points in the metadata table. There is no need for the tables which
> existed when the metadata table split to still exist for this to continue to
> act as a valid split point.
>
> Would need to see the exception stack trace, or at least an error message,
> to troubleshoot the shell scanning error you saw.
>
>
> On Mon, Feb 20, 2017, 20:00 Dickson, Matt MR <ma...@defence.gov.au>
> wrote:
>>
>> UNOFFICIAL
>>
>> In case it is ok to remove these from the root table, how can I scan the
>> root table for rows with a rowid starting with !0;1vm?
>>
>> Running "scan -b !0;1vm" throws an exception and exits the shell.
>>
>>
>> -----Original Message-----
>> From: Dickson, Matt MR [mailto:matt.dickson@defence.gov.au]
>> Sent: Tuesday, 21 February 2017 09:30
>> To: 'user@accumulo.apache.org'
>> Subject: RE: accumulo.root invalid table reference [SEC=UNOFFICIAL]
>>
>> UNOFFICIAL
>>
>>
>> Does that mean I should have entries for 1vm in the metadata table
>> corresponding to the root table?
>>
>> We are running 1.6.5
>>
>>
>> -----Original Message-----
>> From: Josh Elser [mailto:josh.elser@gmail.com]
>> Sent: Tuesday, 21 February 2017 09:22
>> To: user@accumulo.apache.org
>> Subject: Re: accumulo.root invalid table reference [SEC=UNOFFICIAL]
>>
>> The root table should only reference the tablets in the metadata table.
>> It's a hierarchy: like metadata is for the user tables, root is for the
>> metadata table.
>>
>> What version are ya running, Matt?
>>
>> Dickson, Matt MR wrote:
>> > *UNOFFICIAL*
>> >
>> > I have a situation where all tablet servers are progressively being
>> > declared dead. From the logs the tservers report errors like:
>> > 2017-02-.... DEBUG: Scan failed thrift error
>> > org.apache.thrift.trasport.TTransportException null
>> > (!0;1vm\\125.323.233.23::2016103<,server.com.org:9997,2342423df12341d)
>> > 1vm was a table id that was deleted several months ago so it appears
>> > there is some invalid reference somewhere.
>> > Scanning the metadata table "scan -b 1vm" returns no rows returned for
>> > 1vm.
>> > A scan of the accumulo.root table returns approximately 15 rows that
>> > start with; !0:1vm;<i/p addr>/::2016103 /blah/ // How are the root
>> > table entries used and would it be safe to remove these entries since
>> > they reference a deleted table?
>> > Thanks in advance,
>> > Matt
>> > //
>
> --
> Christopher

RE: accumulo.root invalid table reference [SEC=UNOFFICIAL]

Posted by "Dickson, Matt MR" <ma...@defence.gov.au>.
UNOFFICIAL

It looks like an issue with one of the metadata table tablets. On startup the server that hosts a particular metadata tablet gets scanned by all other tablet servers in the cluster.  This then crashes that tablet server with an error in the tserver log;

... [zookeeper.ZooCache] WARN: Saw (possibly) transient exception communicating with ZooKeeper, will retry
SessionExpiredException: KeeperErrorCode = Session expired for /accumulo/4234234234234234/namespaces/+accumulo/conf/table.scan.max.memory

That metadata table tablet is then transferred to another host which then fails also, and so on.

While the server is hosting this metadata tablet, we see the following log statement from all tserver.logs in the cluster:

.... [impl.ThriftScanner] DEBUG: Scan failed, thrift error org.apache.thrift.transport.TTransportException  null (!0;1vm\\;125.323.233.23::2016103<,server.com.org:9997<http://server.com.org:9997/>,2342423df12341d)
Hope that helps complete the picture.


________________________________
From: Christopher [mailto:ctubbsii@apache.org]
Sent: Tuesday, 21 February 2017 13:17
To: user@accumulo.apache.org
Subject: Re: accumulo.root invalid table reference [SEC=UNOFFICIAL]


Removing them is probably a bad idea. The root table entries correspond to split points in the metadata table. There is no need for the tables which existed when the metadata table split to still exist for this to continue to act as a valid split point.

Would need to see the exception stack trace, or at least an error message, to troubleshoot the shell scanning error you saw.

On Mon, Feb 20, 2017, 20:00 Dickson, Matt MR <ma...@defence.gov.au>> wrote:
UNOFFICIAL

In case it is ok to remove these from the root table, how can I scan the root table for rows with a rowid starting with !0;1vm?

Running "scan -b !0;1vm" throws an exception and exits the shell.


-----Original Message-----
From: Dickson, Matt MR [mailto:matt.dickson@defence.gov.au<ma...@defence.gov.au>]
Sent: Tuesday, 21 February 2017 09:30
To: 'user@accumulo.apache.org<ma...@accumulo.apache.org>'
Subject: RE: accumulo.root invalid table reference [SEC=UNOFFICIAL]

UNOFFICIAL


Does that mean I should have entries for 1vm in the metadata table corresponding to the root table?

We are running 1.6.5


-----Original Message-----
From: Josh Elser [mailto:josh.elser@gmail.com<ma...@gmail.com>]
Sent: Tuesday, 21 February 2017 09:22
To: user@accumulo.apache.org<ma...@accumulo.apache.org>
Subject: Re: accumulo.root invalid table reference [SEC=UNOFFICIAL]

The root table should only reference the tablets in the metadata table.
It's a hierarchy: like metadata is for the user tables, root is for the metadata table.

What version are ya running, Matt?

Dickson, Matt MR wrote:
> *UNOFFICIAL*
>
> I have a situation where all tablet servers are progressively being
> declared dead. From the logs the tservers report errors like:
> 2017-02-.... DEBUG: Scan failed thrift error
> org.apache.thrift.trasport.TTransportException null
> (!0;1vm\\125.323.233.23::2016103<,server.com.org:9997<http://server.com.org:9997>,2342423df12341d)
> 1vm was a table id that was deleted several months ago so it appears
> there is some invalid reference somewhere.
> Scanning the metadata table "scan -b 1vm" returns no rows returned for 1vm.
> A scan of the accumulo.root table returns approximately 15 rows that
> start with; !0:1vm;<i/p addr>/::2016103 /blah/ // How are the root
> table entries used and would it be safe to remove these entries since
> they reference a deleted table?
> Thanks in advance,
> Matt
> //
--
Christopher

Re: accumulo.root invalid table reference [SEC=UNOFFICIAL]

Posted by Christopher <ct...@apache.org>.
Removing them is probably a bad idea. The root table entries correspond to
split points in the metadata table. There is no need for the tables which
existed when the metadata table split to still exist for this to continue
to act as a valid split point.

Would need to see the exception stack trace, or at least an error message,
to troubleshoot the shell scanning error you saw.

On Mon, Feb 20, 2017, 20:00 Dickson, Matt MR <ma...@defence.gov.au>
wrote:

> UNOFFICIAL
>
> In case it is ok to remove these from the root table, how can I scan the
> root table for rows with a rowid starting with !0;1vm?
>
> Running "scan -b !0;1vm" throws an exception and exits the shell.
>
>
> -----Original Message-----
> From: Dickson, Matt MR [mailto:matt.dickson@defence.gov.au]
> Sent: Tuesday, 21 February 2017 09:30
> To: 'user@accumulo.apache.org'
> Subject: RE: accumulo.root invalid table reference [SEC=UNOFFICIAL]
>
> UNOFFICIAL
>
>
> Does that mean I should have entries for 1vm in the metadata table
> corresponding to the root table?
>
> We are running 1.6.5
>
>
> -----Original Message-----
> From: Josh Elser [mailto:josh.elser@gmail.com]
> Sent: Tuesday, 21 February 2017 09:22
> To: user@accumulo.apache.org
> Subject: Re: accumulo.root invalid table reference [SEC=UNOFFICIAL]
>
> The root table should only reference the tablets in the metadata table.
> It's a hierarchy: like metadata is for the user tables, root is for the
> metadata table.
>
> What version are ya running, Matt?
>
> Dickson, Matt MR wrote:
> > *UNOFFICIAL*
> >
> > I have a situation where all tablet servers are progressively being
> > declared dead. From the logs the tservers report errors like:
> > 2017-02-.... DEBUG: Scan failed thrift error
> > org.apache.thrift.trasport.TTransportException null
> > (!0;1vm\\125.323.233.23::2016103<,server.com.org:9997,2342423df12341d)
> > 1vm was a table id that was deleted several months ago so it appears
> > there is some invalid reference somewhere.
> > Scanning the metadata table "scan -b 1vm" returns no rows returned for
> 1vm.
> > A scan of the accumulo.root table returns approximately 15 rows that
> > start with; !0:1vm;<i/p addr>/::2016103 /blah/ // How are the root
> > table entries used and would it be safe to remove these entries since
> > they reference a deleted table?
> > Thanks in advance,
> > Matt
> > //
>
-- 
Christopher

RE: accumulo.root invalid table reference [SEC=UNOFFICIAL]

Posted by "Dickson, Matt MR" <ma...@defence.gov.au>.
UNOFFICIAL

In case it is ok to remove these from the root table, how can I scan the root table for rows with a rowid starting with !0;1vm?  

Running "scan -b !0;1vm" throws an exception and exits the shell. 


-----Original Message-----
From: Dickson, Matt MR [mailto:matt.dickson@defence.gov.au]
Sent: Tuesday, 21 February 2017 09:30
To: 'user@accumulo.apache.org'
Subject: RE: accumulo.root invalid table reference [SEC=UNOFFICIAL]

UNOFFICIAL


Does that mean I should have entries for 1vm in the metadata table corresponding to the root table?

We are running 1.6.5
 

-----Original Message-----
From: Josh Elser [mailto:josh.elser@gmail.com]
Sent: Tuesday, 21 February 2017 09:22
To: user@accumulo.apache.org
Subject: Re: accumulo.root invalid table reference [SEC=UNOFFICIAL]

The root table should only reference the tablets in the metadata table. 
It's a hierarchy: like metadata is for the user tables, root is for the metadata table.

What version are ya running, Matt?

Dickson, Matt MR wrote:
> *UNOFFICIAL*
>
> I have a situation where all tablet servers are progressively being 
> declared dead. From the logs the tservers report errors like:
> 2017-02-.... DEBUG: Scan failed thrift error 
> org.apache.thrift.trasport.TTransportException null
> (!0;1vm\\125.323.233.23::2016103<,server.com.org:9997,2342423df12341d)
> 1vm was a table id that was deleted several months ago so it appears 
> there is some invalid reference somewhere.
> Scanning the metadata table "scan -b 1vm" returns no rows returned for 1vm.
> A scan of the accumulo.root table returns approximately 15 rows that 
> start with; !0:1vm;<i/p addr>/::2016103 /blah/ // How are the root 
> table entries used and would it be safe to remove these entries since 
> they reference a deleted table?
> Thanks in advance,
> Matt
> //

RE: accumulo.root invalid table reference [SEC=UNOFFICIAL]

Posted by "Dickson, Matt MR" <ma...@defence.gov.au>.
UNOFFICIAL


Does that mean I should have entries for 1vm in the metadata table corresponding to the root table?

We are running 1.6.5
 

-----Original Message-----
From: Josh Elser [mailto:josh.elser@gmail.com]
Sent: Tuesday, 21 February 2017 09:22
To: user@accumulo.apache.org
Subject: Re: accumulo.root invalid table reference [SEC=UNOFFICIAL]

The root table should only reference the tablets in the metadata table. 
It's a hierarchy: like metadata is for the user tables, root is for the metadata table.

What version are ya running, Matt?

Dickson, Matt MR wrote:
> *UNOFFICIAL*
>
> I have a situation where all tablet servers are progressively being 
> declared dead. From the logs the tservers report errors like:
> 2017-02-.... DEBUG: Scan failed thrift error 
> org.apache.thrift.trasport.TTransportException null
> (!0;1vm\\125.323.233.23::2016103<,server.com.org:9997,2342423df12341d)
> 1vm was a table id that was deleted several months ago so it appears 
> there is some invalid reference somewhere.
> Scanning the metadata table "scan -b 1vm" returns no rows returned for 1vm.
> A scan of the accumulo.root table returns approximately 15 rows that 
> start with; !0:1vm;<i/p addr>/::2016103 /blah/ // How are the root 
> table entries used and would it be safe to remove these entries since 
> they reference a deleted table?
> Thanks in advance,
> Matt
> //

Re: accumulo.root invalid table reference [SEC=UNOFFICIAL]

Posted by Josh Elser <jo...@gmail.com>.
The root table should only reference the tablets in the metadata table. 
It's a hierarchy: like metadata is for the user tables, root is for the 
metadata table.

What version are ya running, Matt?

Dickson, Matt MR wrote:
> *UNOFFICIAL*
>
> I have a situation where all tablet servers are progressively being
> declared dead. From the logs the tservers report errors like:
> 2017-02-.... DEBUG: Scan failed thrift error
> org.apache.thrift.trasport.TTransportException null
> (!0;1vm\\125.323.233.23::2016103<,server.com.org:9997,2342423df12341d)
> 1vm was a table id that was deleted several months ago so it appears
> there is some invalid reference somewhere.
> Scanning the metadata table "scan -b 1vm" returns no rows returned for 1vm.
> A scan of the accumulo.root table returns approximately 15 rows that
> start with;
> !0:1vm;<i/p addr>/::2016103 /blah/
> //
> How are the root table entries used and would it be safe to remove these
> entries since they reference a deleted table?
> Thanks in advance,
> Matt
> //