You are viewing a plain text version of this content. The canonical link for it is here.

Posted to notifications@accumulo.apache.org by GitBox <gi...@apache.org> on 2021/05/06 18:06:51 UTC

[GitHub] [accumulo] Viv1986 opened a new issue #2085: Unassigned tables on every restart accumulo 1.10.1

Viv1986 opened a new issue #2085:
URL: https://github.com/apache/accumulo/issues/2085


   HI all,
   
   have accumulo 1.10.1 and last month every restart broke +r table which is root and after restore, it become whole cluster unassigned tables and need to wait near hour while all tables will be assigned, 
   there is 1 master and 2 table servers, can anybody help me with that? 
   
   021/05/06 17:20:50,343 | tserver:xxxxxx140.internal.xxxx.net | 1 | WARN | exception trying to assign tablet +r<< hdfs://xxxxxxxx120.xxxxxxxxxxxx-dev.local:8020/apps/accumulo/data/tables/+r/root_tablet 	java.lang.RuntimeException: Error recovering tablet +r<< from log files 		at org.apache.accumulo.tserver.tablet.Tablet.<init>(Tablet.java:499) 		at org.apache.accumulo.tserver.TabletServer$AssignmentHandler.run(TabletServer.java:2413) 		at org.apache.accumulo.tserver.TabletServer$ThriftClientHandler$3.run(TabletServer.java:1775) 	Caused by: java.io.IOException: Unable to find recovery files for extent +r<< logEntry: +r<< hdfs://xxxxxxxx120.xxxxxxxx-dev.local:8020/apps/accumulo/data/wal/xxxxxxxxxx.xxxxxxxxx-dev.local+9997/e31761f2-e600-49f9-9a5c-8972aa37005b 		at org.apache.accumulo.tserver.TabletServer.recover(TabletServer.java:3311) 		at org.apache.accumulo.tserver.tablet.Tablet.<init>(Tablet.java:437) 		... 2 more
   -- | -- | -- | -- | --
   2021/05/06 17:20:50,345 | tserver:xxxxxxxx140.internal.cloudapp.net | 1 | WARN | Error recovering tablet +r<< from log files
   2021/05/06 17:20:50,349 | tserver:xxxxxxxxx140.internal.cloudapp.net | 1 | WARN | failed to open tablet +r<< reporting failure to master
   2021/05/06 17:20:50,351 | tserver:xxxxxxxxx140.internal.cloudapp.net | 1 | WARN | rescheduling tablet load in 1.00 seconds
   2021/05/06 17:20:50,383 | master:xxxxxxxxx120.internal.cloudapp.net | 1 | ERROR | xxxxxxxxx140.geospatialstrg-dev.local:9997 reports assignment failed for tablet +r<<
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [accumulo] milleruntime commented on issue #2085: Unassigned tables on every restart accumulo 1.10.1

Posted by GitBox <gi...@apache.org>.

milleruntime commented on issue #2085:
URL: https://github.com/apache/accumulo/issues/2085#issuecomment-847047942


   There are a number of factors that could affect Tablet migration (tablet size, number of tablets, number of tservers, recovery, configuration). Are you using the default configured Load Balancer (see `table.balancer` property) DefaultLoadBalancer? If so, it will attempt to keep an even number of tablets across all tservers. When a tablet gets migrated from one tserver to another, it will get unloaded and then loaded on the new tserver. So if the tablet has a Write Ahead Log (WAL) associated with it, it will do recovery on the WAL. If you are having issues with WALs (based on the previous comments) then this would hold up loading of the tablets with the problematic WALs.
   
   If you have resolved the WAL problems, then you might want to look into using a different load balancer. See this blog post about the GroupBalancer: https://accumulo.apache.org/blog/2015/03/20/balancing-groups-of-tablets.html


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [accumulo] Viv1986 commented on issue #2085: Unassigned tables on every restart accumulo 1.10.1

Posted by GitBox <gi...@apache.org>.

Viv1986 commented on issue #2085:
URL: https://github.com/apache/accumulo/issues/2085#issuecomment-839942726


   @EdColeman tried, but all time 
   
   Hi. This is the qmail-send program at apache.org.
   I'm afraid I wasn't able to deliver your message to the following addresses.
   This is a permanent error; I've given up. Sorry it didn't work out.
   
   <us...@accumulo.apache.org>:
   Must be sent from an @apache.org address or a subscriber address or an address in LDAP.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [accumulo] ctubbsii commented on issue #2085: Unassigned tables on every restart accumulo 1.10.1

Posted by GitBox <gi...@apache.org>.

ctubbsii commented on issue #2085:
URL: https://github.com/apache/accumulo/issues/2085#issuecomment-833762468


   Based on the error message, it looks like you're missing a recovery file in HDFS for the root tablet. I would first check for HDFS being corrupt. Did you see this after a failure? Have you finished recovering HDFS from that failure before starting Accumulo? Were any files unrecoverable in HDFS? Did all nodes come back online for HDFS?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [accumulo] EdColeman closed issue #2085: Unassigned tables on every restart accumulo 1.10.1

Posted by GitBox <gi...@apache.org>.

EdColeman closed issue #2085:
URL: https://github.com/apache/accumulo/issues/2085


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [accumulo] Viv1986 commented on issue #2085: Unassigned tables on every restart accumulo 1.10.1

Posted by GitBox <gi...@apache.org>.

Viv1986 commented on issue #2085:
URL: https://github.com/apache/accumulo/issues/2085#issuecomment-838854440


   and found that thouse wal files accumulo deleted after tables reassign, but storage have some old files which stores for months and even year


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [accumulo] Viv1986 edited a comment on issue #2085: Unassigned tables on every restart accumulo 1.10.1

Posted by GitBox <gi...@apache.org>.

Viv1986 edited a comment on issue #2085:
URL: https://github.com/apache/accumulo/issues/2085#issuecomment-834273003


   @ctubbsii didn't found any errors in hdfs, everything recover ok, problem that I can't reboot cluster or table, it's all time become unavailable while all tables witll be reassgined
   hdfs fsck /
   Status: HEALTHY
    Number of data-nodes:  2
    Number of racks:               1
    Total dirs:                    8560
    Total symlinks:                0
   
   Replicated Blocks:
    Total size:    136475049809 B (Total open files size: 372 B)
    Total files:   8701 (Files currently being written: 8)
    Total blocks (validated):      9240 (avg. block size 14770027 B) (Total open file blocks (not validated): 6)
    Minimally replicated blocks:   9240 (100.0 %)
    Over-replicated blocks:        0 (0.0 %)
    Under-replicated blocks:       9240 (100.0 %)
    Mis-replicated blocks:         0 (0.0 %)
    Default replication factor:    3
    Average block replication:     2.0
    Missing blocks:                0
    Corrupt blocks:                0
    Missing replicas:              9746 (34.52845 %)
   
   Erasure Coded Block Groups:
    Total size:    0 B
    Total files:   0
    Total block groups (validated):        0
    Minimally erasure-coded block groups:  0
    Over-erasure-coded block groups:       0
    Under-erasure-coded block groups:      0
    Unsatisfactory placement block groups: 0
    Average block group size:      0.0
    Missing block groups:          0
    Corrupt block groups:          0
    Missing internal blocks:       0
   FSCK ended at Fri May 07 11:16:08 UTC 2021 in 590 milliseconds
   
   if I reboot only one table I get
   Some tablets were unloaded in an unsafe manner. Write-ahead logs are being recovered.
   and procedure of recovering same root table begins, after tables reassign


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [accumulo] EdColeman commented on issue #2085: Unassigned tables on every restart accumulo 1.10.1

Posted by GitBox <gi...@apache.org>.

EdColeman commented on issue #2085:
URL: https://github.com/apache/accumulo/issues/2085#issuecomment-839910056


   Please move this to discussion to the [user mailing list](https://accumulo.apache.org/contact-us/) .  On the list, please describe what steps you are taking,  and then behavior of the system - other details like # of tservers / tablets per tserver, type of ingest (bulk or batch) would be really helpful.  That will provide you with a wider audience for those that are not seeing this discussion.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [accumulo] EdColeman commented on issue #2085: Unassigned tables on every restart accumulo 1.10.1

Posted by GitBox <gi...@apache.org>.

EdColeman commented on issue #2085:
URL: https://github.com/apache/accumulo/issues/2085#issuecomment-839946104


   Did you subscribe first? 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [accumulo] EdColeman commented on issue #2085: Unassigned tables on every restart accumulo 1.10.1

Posted by GitBox <gi...@apache.org>.

EdColeman commented on issue #2085:
URL: https://github.com/apache/accumulo/issues/2085#issuecomment-834580352


   Do you have a lot of data in the write-ahead logs?  Is your ingest streaming or do you do mostly bulk ingest?
   
   If you do not have a lot of data in the WAL, and IF you can determine that the tables rfiles are intact and accessible, you could try a few things - but these will likely lead to some data loss - especially for anything in the WALS.
   
   Does the directory from your error message exist, does it have any files in it?
   
   `hdfs://xxxxxxxx120.xxxxxxxx-dev.local:8020/apps/accumulo/data/wal/xxxxxxxxxx.xxxxxxxxx-dev.local+9997/e31761f2-e600-49f9-9a5c-8972aa37005b`
   
   I am assuming that because this is the root table, you cannot scan anything with the accumulo shell to examine the metadata table.
   
   There some information in the [user manual](https://accumulo.apache.org/1.10/accumulo_user_manual.html#_troubleshooting) on what needs to be done if the root (or metadata) table(s) have a reference to a corrupt wal - basically, shut down accumulo, move the tables rfiles, reinitialize accumlo and the bulk import the files that you moved. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [accumulo] Viv1986 commented on issue #2085: Unassigned tables on every restart accumulo 1.10.1

Posted by GitBox <gi...@apache.org>.

Viv1986 commented on issue #2085:
URL: https://github.com/apache/accumulo/issues/2085#issuecomment-839883746


   @EdColeman it's reassign in any case, if I start first master, the last, or even if I not restart master.
   Tables run compaction time per time, not manually, but some time ago I added disks and started major compaction and it run ok without any notice.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [accumulo] EdColeman commented on issue #2085: Unassigned tables on every restart accumulo 1.10.1

Posted by GitBox <gi...@apache.org>.

EdColeman commented on issue #2085:
URL: https://github.com/apache/accumulo/issues/2085#issuecomment-847047456


   Please ask this on the user mailing list.  You will be more likely to get a response if you include additional information.  There are a lot of factors that trigger reassignment and without any information on why the reassignment is occurring and what the system it is very difficult to just guess.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [accumulo] Viv1986 edited a comment on issue #2085: Unassigned tables on every restart accumulo 1.10.1

Posted by GitBox <gi...@apache.org>.

Viv1986 edited a comment on issue #2085:
URL: https://github.com/apache/accumulo/issues/2085#issuecomment-834273003


   @ctubbsii didn't found any errors in hdfs, everything recover ok, problem that I can't reboot cluster or table, it's all time become unavailable while all tables witll be reassgined
   hdfs fsck /
   Status: HEALTHY
    Number of data-nodes:  2
    Number of racks:               1
    Total dirs:                    8560
    Total symlinks:                0
   
   Replicated Blocks:
    Total size:    136475049809 B (Total open files size: 372 B)
    Total files:   8701 (Files currently being written: 8)
    Total blocks (validated):      9240 (avg. block size 14770027 B) (Total open file blocks (not validated): 6)
    Minimally replicated blocks:   9240 (100.0 %)
    Over-replicated blocks:        0 (0.0 %)
    Under-replicated blocks:       9240 (100.0 %)
    Mis-replicated blocks:         0 (0.0 %)
    Default replication factor:    3
    Average block replication:     2.0
    Missing blocks:                0
    Corrupt blocks:                0
    Missing replicas:              9746 (34.52845 %)
   
   Erasure Coded Block Groups:
    Total size:    0 B
    Total files:   0
    Total block groups (validated):        0
    Minimally erasure-coded block groups:  0
    Over-erasure-coded block groups:       0
    Under-erasure-coded block groups:      0
    Unsatisfactory placement block groups: 0
    Average block group size:      0.0
    Missing block groups:          0
    Corrupt block groups:          0
    Missing internal blocks:       0
   FSCK ended at Fri May 07 11:16:08 UTC 2021 in 590 milliseconds


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [accumulo] EdColeman commented on issue #2085: Unassigned tables on every restart accumulo 1.10.1

Posted by GitBox <gi...@apache.org>.

EdColeman commented on issue #2085:
URL: https://github.com/apache/accumulo/issues/2085#issuecomment-840154332


   Closing this - the discussion has moved to the mailing list and there is not enough actionable information in the issue to take any action.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [accumulo] EdColeman commented on issue #2085: Unassigned tables on every restart accumulo 1.10.1

Posted by GitBox <gi...@apache.org>.

EdColeman commented on issue #2085:
URL: https://github.com/apache/accumulo/issues/2085#issuecomment-839753763


   When you do a restart, do you start the tservers and THEN the master(s) - if you start the master first, it will see assign everything to the first tserver(s) that it sees, resulting in an unbalanced cluster that takes a while for everything to be processed and then rebalanced because only one (or a few) tservers are processing everything.  There is a property to have the master wait for a required number of tservers that was  added in 1.10.  It is best if the tservers are all up and then the master starts - taht way all tservers are available for assignments.
   
   Other factors could be if the tables are performing batch writes (vs bulk import) - that will generate WAL files that need to be processed during recovery.
   Do you every compact the tables - that will reduce the number of files as well as process any info in the WAL so that there is less work that needs to be performed during recovery.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [accumulo] ctubbsii commented on issue #2085: Unassigned tables on every restart accumulo 1.10.1

Posted by GitBox <gi...@apache.org>.

ctubbsii commented on issue #2085:
URL: https://github.com/apache/accumulo/issues/2085#issuecomment-834673129


   Before you do any recovery, I would examine the logs for any references to the missing file, to see if you can determine what might have happened to it. If HDFS is healthy and has no corrupt or missing data, I'm very curious how the file could have gone missing.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [accumulo] ctubbsii commented on issue #2085: Unassigned tables on every restart accumulo 1.10.1

Posted by GitBox <gi...@apache.org>.

ctubbsii commented on issue #2085:
URL: https://github.com/apache/accumulo/issues/2085#issuecomment-838926397


   Having a WAL file that isn't empty is normal for a running system, so I wouldn't be concerned about that. As for the orphaned files that have been around for months, that can happen if there's a failure after the accumulo-gc removes a deleted file entry from the metadata table, but before it actually deletes it in HDFS. This is a "fail safe" order of operations, but can result in orphaned files. It doesn't tell you much other than the fact that there was a failure in the past, though. At this point, I'm not sure if I have any other suggestions. You could try having a discussion on the [user mailing list](https://accumulo.apache.org/contact-us/) to get ideas from the wider Accumulo community than just those developers following the issue tracker.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [accumulo] Viv1986 commented on issue #2085: Unassigned tables on every restart accumulo 1.10.1

Posted by GitBox <gi...@apache.org>.

Viv1986 commented on issue #2085:
URL: https://github.com/apache/accumulo/issues/2085#issuecomment-839628189


   ok my last tests show me that all clusters have that problem, and it's not related to error which described in my first message, I use ambari and it's restart tables and master with kill -9, and seems it's ok with reassign tables with such restart.
   I have another cluster with same config, but with heapsize = 1.5G, and 40.75B records, and reassign there take near 15 minutes, and same cluster with 40G java heapsize and 52.71B records, reassign will take near 4 hours, what related to table reassign speed? 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [accumulo] Viv1986 commented on issue #2085: Unassigned tables on every restart accumulo 1.10.1

Posted by GitBox <gi...@apache.org>.

Viv1986 commented on issue #2085:
URL: https://github.com/apache/accumulo/issues/2085#issuecomment-846920178


   can anybody tell me what related to speed of table reassign? cuz in same configuration I get ~14 tables per minute on current, and ~1500 per minute on another


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [accumulo] Viv1986 commented on issue #2085: Unassigned tables on every restart accumulo 1.10.1

Posted by GitBox <gi...@apache.org>.

Viv1986 commented on issue #2085:
URL: https://github.com/apache/accumulo/issues/2085#issuecomment-834273003


   @ctubbsii didn't found any errors in hdfs, everything recover ok, problem that I can't reboot cluster or table, it's all time become unavailable while all tables witll be reassgined


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [accumulo] Viv1986 commented on issue #2085: Unassigned tables on every restart accumulo 1.10.1

Posted by GitBox <gi...@apache.org>.

Viv1986 commented on issue #2085:
URL: https://github.com/apache/accumulo/issues/2085#issuecomment-838804779


   @ctubbsii I examine all logs and found nothing, also by hdfs dfs -ls I able to find that wal file and it have some size, so, any other suggestions except reinitialization? 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org