You are viewing a plain text version of this content. The canonical link for it is here.
Posted to notifications@asterixdb.apache.org by "Murtadha Hubail (JIRA)" <ji...@apache.org> on 2017/11/03 16:57:00 UTC

[jira] [Closed] (ASTERIXDB-2145) Recovery process fails on 100 datasets

     [ https://issues.apache.org/jira/browse/ASTERIXDB-2145?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Murtadha Hubail closed ASTERIXDB-2145.
--------------------------------------
    Resolution: Duplicate

> Recovery process fails on 100 datasets
> --------------------------------------
>
>                 Key: ASTERIXDB-2145
>                 URL: https://issues.apache.org/jira/browse/ASTERIXDB-2145
>             Project: Apache AsterixDB
>          Issue Type: Bug
>            Reporter: Taewoo Kim
>            Assignee: Ian Maxon
>            Priority: Major
>
> On the Cloudberry DB, currently, there are 112 datasets on a dataverse. When restarting that instance, the NC showed the following error and stopped. 
> java.lang.IllegalStateException: Failed to redo
> at org.apache.asterix.app.nc.RecoveryManager.redo(RecoveryManager.java:712)
> at org.apache.asterix.app.nc.RecoveryManager.startRecoveryRedoPhase(RecoveryManager.java:378)
> at org.apache.asterix.app.nc.RecoveryManager.replayPartitionsLogs(RecoveryManager.java:187)
> at org.apache.asterix.app.nc.RecoveryManager.startLocalRecovery(RecoveryManager.java:179)
> at org.apache.asterix.app.nc.task.LocalRecoveryTask.perform(LocalRecoveryTask.java:43)
> at org.apache.asterix.app.replication.message.StartupTaskResponseMessage.handle(StartupTaskResponseMessage.java:56)
> at org.apache.asterix.messaging.NCMessageBroker.receivedMessage(NCMessageBroker.java:92)
> at org.apache.hyracks.control.nc.work.ApplicationMessageWork.run(ApplicationMessageWork.java:51)
> at org.apache.hyracks.control.common.work.WorkQueue$WorkerThread.run(WorkQueue.java:127)
> Caused by: org.apache.hyracks.api.exceptions.HyracksDataException:
> Cannot allocate dataset 191 memory since memory budget would be
> exceeded.
> at org.apache.asterix.common.context.DatasetLifecycleManager.allocateMemory(DatasetLifecycleManager.java:568)
> at org.apache.hyracks.storage.common.buffercache.ResourceHeapBufferAllocator.reserveAllocation(ResourceHeapBufferAllocator.java:53)
> at org.apache.hyracks.storage.am.lsm.common.impls.VirtualBufferCache.open(VirtualBufferCache.java:307)
> at org.apache.hyracks.storage.am.lsm.common.impls.MultitenantVirtualBufferCache.open(MultitenantVirtualBufferCache.java:119)
> at org.apache.hyracks.storage.am.lsm.btree.impls.LSMBTree.allocateMemoryComponent(LSMBTree.java:611)
> at org.apache.hyracks.storage.am.lsm.common.impls.AbstractLSMIndex.allocateMemoryComponents(AbstractLSMIndex.java:389)
> at org.apache.hyracks.storage.am.lsm.common.impls.LSMHarness.modify(LSMHarness.java:421)
> at org.apache.hyracks.storage.am.lsm.common.impls.LSMHarness.forceModify(LSMHarness.java:368)
> at org.apache.hyracks.storage.am.lsm.common.impls.LSMTreeIndexAccessor.forceUpsert(LSMTreeIndexAccessor.java:181)
> at org.apache.asterix.app.nc.RecoveryManager.redo(RecoveryManager.java:707)
> ... 8 more
> So, I increased the storage.memorycomponent.globalbudget parameter from 3GB to 5GB. Still, the NC showed the following error and the recovery process could not finish. 
> ... similar log records ...
> Oct 25, 2017 9:33:44 AM org.apache.asterix.transaction.management.resource.PersistentLocalResourceRepository loadDataverse
> INFO: Loading dataverse:berry
> Oct 25, 2017 9:33:44 AM org.apache.asterix.transaction.management.resource.PersistentLocalResourceRepository loadIndex
> INFO: Loading index:meta_idx_meta
> Oct 25, 2017 9:33:44 AM org.apache.asterix.transaction.management.resource.PersistentLocalResourceRepository loadIndex
> INFO: Resource loaded 161:storage/partition_1/berry/meta_idx_meta
> Oct 25, 2017 9:34:09 AM org.apache.hyracks.util.ExitUtil$ExitThread run
> INFO: JVM exiting with status 2; bye!
> So, I checked the parameter information page and found that the default parameter for storage.memorycomponent.numpages is 1/16 of the global component budget. Therefore, I decreased this parameter to increase the number of datasets in memory. And the instance was finally able to start. So, it seems that the recovery process tries to load and keep all datasets into memory and this needs to be checked.  



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)