You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@geode.apache.org by Kapil Goyal <go...@vmware.com> on 2016/10/04 08:16:29 UTC

OplogCompactor race issue?

Hi All,

We have been testing a single cache node with a lot of data recently and frequently run into this error:

[info 2016/09/29 06:16:06.823 UTC <OplogCompactor nsxDiskStore for oplog oplog#6> tid=0x19] OplogCompactor for nsxDiskStore compaction oplog id(s): oplog#6
[info 2016/09/29 06:16:08.232 UTC <OplogCompactor nsxDiskStore for oplog oplog#6> tid=0x19] compaction did 6,310 creates and updates in 1,408 ms
[info 2016/09/29 06:16:08.248 UTC <Oplog Delete Task> tid=0x19] Deleted oplog#6 crf for disk store nsxDiskStore.
[info 2016/09/29 06:16:08.256 UTC <Oplog Delete Task> tid=0x19] Deleted oplog#6 krf for disk store nsxDiskStore.
[info 2016/09/29 06:16:08.256 UTC <Oplog Delete Task> tid=0x19] Deleted oplog#6 drf for disk store nsxDiskStore.
[info 2016/09/29 06:17:03.887 UTC <Event Processor for GatewaySender_AsyncEventQueue_txLogEventQueue> tid=0x19] Created oplog#8 drf for disk store nsxDiskStore.
[info 2016/09/29 06:17:03.911 UTC <Event Processor for GatewaySender_AsyncEventQueue_txLogEventQueue> tid=0x19] Created oplog#8 crf for disk store nsxDiskStore.
[info 2016/09/29 06:17:04.031 UTC <Idle OplogCompactor> tid=0x19] Created oplog#7 krf for disk store nsxDiskStore.
[info 2016/09/29 06:17:04.314 UTC <OplogCompactor nsxDiskStore for oplog oplog#7> tid=0x19] OplogCompactor for nsxDiskStore compaction oplog id(s): oplog#7
[error 2016/09/29 06:17:16.075 UTC <OplogCompactor nsxDiskStore for oplog oplog#7> tid=0x19] A DiskAccessException has occurred while writing to the disk for disk store nsxDiskStore. The cache will be closed. ?com.gemstone.gemfire.cache.DiskAccessException: For DiskStore: nsxDiskStore: Failed writing key to "/common/nsxapi/data/self/BACKUPnsxDiskStore_7", caused by java.io.IOException: Stream Closed ?at com.gemstone.gemfire.internal.cache.Oplog.flushAll(Oplog.java:5235)

From the logs it appears there may be a race between threads "Idle OplogCompactor" and "OplogCompactor nsxDiskStore for oplog oplog#7". I see that both are doing operations related to oplog#7. The former logs creation of a KRF file, while the latter is trying to access either the DRF or the CRF file. Now, is it possible that "Idle OplogCompactor" closed the DRF/CRF files for oplog#7 as part of creating the KRF for the same? This is what GemFire docs say about it:

"After the oplog is closed, GemFire also attempts to create a krf file, which contains the key names as well as the offset for the value within the crf file."

Based on the above, it's possible that oplog#7 was already closed and its KRF was already created, when the compactor tried to access the files.

Have any of you run into this error before? Any suggestions?

Thanks
Kapil

Re: OplogCompactor race issue?

Posted by Kapil Goyal <go...@vmware.com>.
Thanks Darrel.

I will try out your suggestion and get back with results.

Regards
Kapil

From: Darrel Schneider <ds...@pivotal.io>>
Reply-To: "user@geode.incubator.apache.org<ma...@geode.incubator.apache.org>" <us...@geode.incubator.apache.org>>
Date: Thursday, October 6, 2016 at 4:12 PM
To: "user@geode.incubator.apache.org<ma...@geode.incubator.apache.org>" <us...@geode.incubator.apache.org>>
Subject: Re: OplogCompactor race issue?

the ticket is GEODE-1969

On Thu, Oct 6, 2016 at 4:01 PM, Darrel Schneider <ds...@pivotal.io>> wrote:
The call stack shows that Oplog.flush is calling "force." It only does this when the system property "gemfire.syncWrites" is set to "true." I think that is why you found this problem. I think all/most of our testing runs with this sys prop at the default of "false."

Currently the code in Oplog.flush that calls "force" does it outside of the sync on "lock."
If you move the "force" code up into this sync block it should fix your problem.
For example old code:
      synchronized (this.lock/* olf */) {
        if (olf.RAFClosed) {
          return;
        }
        ByteBuffer bb = olf.writeBuf;
        if (bb != null && bb.position() != 0) {
          bb.flip();
          int flushed = 0;
          do {
            flushed += olf.channel.write(bb);
          } while (bb.hasRemaining());
          // update bytesFlushed after entire writeBuffer is flushed to fix bug
          // 41201
          olf.bytesFlushed += flushed;
          bb.clear();
        }
      }
      if (doSync) {
        if (SYNC_WRITES) {
          // Synch Meta Data as well as content
          olf.channel.force(true);
        }
      }
new code with fix:
      synchronized (this.lock/* olf */) {
        if (olf.RAFClosed) {
          return;
        }
        ByteBuffer bb = olf.writeBuf;
        if (bb != null && bb.position() != 0) {
          bb.flip();
          int flushed = 0;
          do {
            flushed += olf.channel.write(bb);
          } while (bb.hasRemaining());
          // update bytesFlushed after entire writeBuffer is flushed to fix bug
          // 41201
          olf.bytesFlushed += flushed;
          bb.clear();
        }
        if (doSync) {
          if (SYNC_WRITES) {
            // Synch Meta Data as well as content
            olf.channel.force(true);
          }
        }
      }

I will file a ticket on this. Let me know if this fix works for you. Also be aware that setting "gemfire.syncWrites" to "true" can really slow down the rate at which you can write to disk.


On Thu, Oct 6, 2016 at 2:47 PM, Kapil Goyal <go...@vmware.com>> wrote:
Hi Dan,

We are doing put operations within transactions.

Here is the stack trace:

[warn 2016/09/29 06:17:16.297 UTC <Event Processor for GatewaySender_AsyncEventQueue_txLogEventQueue> tid=0x19] com.gemstone.gemfire.internal.cache.wan.GatewaySenderEventCallbackDispatcher@55bded67: Exception during processing batch 448
com.gemstone.gemfire.internal.cache.wan.GatewaySenderException: com.gemstone.gemfire.internal.cache.wan.GatewaySenderEventCallbackDispatcher@55bded67: Exception during processing batch 448, caused by com.gemstone.gemfire.cache.CacheClosedException: For DiskStore: nsxDiskStore: Failed writing key to "/common/nsxapi/data/self/BACKUPnsxDiskStore_7", caused by com.gemstone.gemfire.cache.DiskAccessException: For DiskStore: nsxDiskStore: Failed writing key to "/common/nsxapi/data/self/BACKUPnsxDiskStore_7", caused by java.io.IOException: Stream Closed
        at com.gemstone.gemfire.internal.cache.wan.GatewaySenderEventCallbackDispatcher.dispatchBatch(GatewaySenderEventCallbackDispatcher.java:173)
        at com.gemstone.gemfire.internal.cache.wan.GatewaySenderEventCallbackDispatcher.dispatchBatch(GatewaySenderEventCallbackDispatcher.java:83)
        at com.gemstone.gemfire.internal.cache.wan.AbstractGatewaySenderEventProcessor.processQueue(AbstractGatewaySenderEventProcessor.java:579)
        at com.gemstone.gemfire.internal.cache.wan.serial.SerialGatewaySenderEventProcessor.run(SerialGatewaySenderEventProcessor.java:219)
Caused by: com.gemstone.gemfire.cache.CacheClosedException: For DiskStore: nsxDiskStore: Failed writing key to "/common/nsxapi/data/self/BACKUPnsxDiskStore_7", caused by com.gemstone.gemfire.cache.DiskAccessException: For DiskStore: nsxDiskStore: Failed writing key to "/common/nsxapi/data/self/BACKUPnsxDiskStore_7", caused by java.io.IOException: Stream Closed
        at com.gemstone.gemfire.internal.cache.GemFireCacheImpl$Stopper.generateCancelledException(GemFireCacheImpl.java:1299)
        at com.gemstone.gemfire.CancelCriterion.checkCancelInProgress(CancelCriterion.java:82)
        at com.gemstone.gemfire.internal.cache.TXManagerImpl.checkClosed(TXManagerImpl.java:606)
        at com.gemstone.gemfire.internal.cache.TXManagerImpl.begin(TXManagerImpl.java:279)
        at com.vmware.nsx.management.container.dao.gemfire.GemFireTxLogDao.processTxLog(GemFireTxLogDao.java:119)
        at com.vmware.nsx.management.container.dao.gemfire.TxLogAsyncEventListener.processEvents(TxLogAsyncEventListener.java:93)
        at com.gemstone.gemfire.internal.cache.wan.GatewaySenderEventCallbackDispatcher.dispatchBatch(GatewaySenderEventCallbackDispatcher.java:164)
        ... 3 more
Caused by: com.gemstone.gemfire.cache.DiskAccessException: For DiskStore: nsxDiskStore: Failed writing key to "/common/nsxapi/data/self/BACKUPnsxDiskStore_7", caused by java.io.IOException: Stream Closed
        at com.gemstone.gemfire.internal.cache.Oplog.flushAll(Oplog.java:5235)
        at com.gemstone.gemfire.internal.cache.Oplog.flushAll(Oplog.java:5223)
        at com.gemstone.gemfire.internal.cache.Oplog.flushAll(Oplog.java:5215)
        at com.gemstone.gemfire.internal.cache.Oplog.basicClose(Oplog.java:3312)
        at com.gemstone.gemfire.internal.cache.Oplog.destroy(Oplog.java:3375)
        at com.gemstone.gemfire.internal.cache.Oplog.handleEmptyAndOldest(Oplog.java:5877)
        at com.gemstone.gemfire.internal.cache.Oplog.handleNoLiveValues(Oplog.java:5924)
        at com.gemstone.gemfire.internal.cache.Oplog.basicModify(Oplog.java:4751)
        at com.gemstone.gemfire.internal.cache.Oplog.copyForwardModifyForCompact(Oplog.java:4594)
        at com.gemstone.gemfire.internal.cache.Oplog.compact(Oplog.java:6087)
        at com.gemstone.gemfire.internal.cache.DiskStoreImpl$OplogCompactor.compact(DiskStoreImpl.java:2970)
        at com.gemstone.gemfire.internal.cache.DiskStoreImpl$OplogCompactor.run(DiskStoreImpl.java:3034)
        at com.gemstone.gemfire.internal.cache.DiskStoreImpl$6.run(DiskStoreImpl.java:4701)
        at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
        at java.util.concurrent.FutureTask.run(FutureTask.java:266)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
        at java.lang.Thread.run(Thread.java:745)
Caused by: java.io.IOException: Stream Closed
        at java.io.RandomAccessFile.getFilePointer(Native Method)
        at com.gemstone.gemfire.internal.cache.persistence.UninterruptibleRandomAccessFile.getFilePointer(UninterruptibleRandomAccessFile.java:59)
        at com.gemstone.gemfire.internal.cache.persistence.UninterruptibleRandomAccessFile$UninterruptibleFileChannelImpl.doUninterruptibly(UninterruptibleRandomAccessFile.java:102)
        at com.gemstone.gemfire.internal.cache.persistence.UninterruptibleRandomAccessFile$UninterruptibleFileChannelImpl.force(UninterruptibleRandomAccessFile.java:226)
        at com.gemstone.gemfire.internal.cache.Oplog.flush(Oplog.java:5202)
        at com.gemstone.gemfire.internal.cache.Oplog.flushAll(Oplog.java:5232)
        ... 17 more

Thanks
Kapil

-----Original Message-----
From: Dan Smith [mailto:dsmith@pivotal.io<ma...@pivotal.io>]
Sent: Tuesday, October 04, 2016 10:08 AM
To: user@geode.incubator.apache.org<ma...@geode.incubator.apache.org>
Subject: Re: OplogCompactor race issue?

Hi Kapil,

I haven't seen this particular issue before. Do you have any stack traces of the exception from the OplogCompactor? What sort of operations are you doing? puts?

-Dan

On Tue, Oct 4, 2016 at 1:16 AM, Kapil Goyal <go...@vmware.com>> wrote:
> Hi All,
>
> We have been testing a single cache node with a lot of data recently
> and frequently run into this error:
>
> [info 2016/09/29 06:16:06.823 UTC <OplogCompactor nsxDiskStore for
> oplog oplog#6> tid=0x19] OplogCompactor for nsxDiskStore compaction oplog id(s):
> oplog#6
> [info 2016/09/29 06:16:08.232 UTC <OplogCompactor nsxDiskStore for
> oplog oplog#6> tid=0x19] compaction did 6,310 creates and updates in
> 1,408 ms [info 2016/09/29 06:16:08.248 UTC <Oplog Delete Task>
> tid=0x19] Deleted
> oplog#6 crf for disk store nsxDiskStore.
> [info 2016/09/29 06:16:08.256 UTC <Oplog Delete Task> tid=0x19]
> Deleted
> oplog#6 krf for disk store nsxDiskStore.
> [info 2016/09/29 06:16:08.256 UTC <Oplog Delete Task> tid=0x19]
> Deleted
> oplog#6 drf for disk store nsxDiskStore.
> [info 2016/09/29 06:17:03.887 UTC <Event Processor for
> GatewaySender_AsyncEventQueue_txLogEventQueue> tid=0x19] Created
> oplog#8 drf for disk store nsxDiskStore.
> [info 2016/09/29 06:17:03.911 UTC <Event Processor for
> GatewaySender_AsyncEventQueue_txLogEventQueue> tid=0x19] Created
> oplog#8 crf for disk store nsxDiskStore.
> [info 2016/09/29 06:17:04.031 UTC <Idle OplogCompactor> tid=0x19]
> Created
> oplog#7 krf for disk store nsxDiskStore.
> [info 2016/09/29 06:17:04.314 UTC <OplogCompactor nsxDiskStore for
> oplog oplog#7> tid=0x19] OplogCompactor for nsxDiskStore compaction oplog id(s):
> oplog#7
> [error 2016/09/29 06:17:16.075 UTC <OplogCompactor nsxDiskStore for
> oplog oplog#7> tid=0x19] A DiskAccessException has occurred while
> writing to the disk for disk store nsxDiskStore. The cache will be closed.
> ?com.gemstone.gemfire.cache.DiskAccessException: For DiskStore:
> nsxDiskStore: Failed writing key to
> "/common/nsxapi/data/self/BACKUPnsxDiskStore_7", caused by
> java.io.IOException: Stream Closed ?at
> com.gemstone.gemfire.internal.cache.Oplog.flushAll(Oplog.java:5235)
>
> From the logs it appears there may be a race between threads "Idle
> OplogCompactor" and "OplogCompactor nsxDiskStore for oplog oplog#7". I
> see that both are doing operations related to oplog#7. The former logs
> creation of a KRF file, while the latter is trying to access either
> the DRF or the CRF file. Now, is it possible that "Idle
> OplogCompactor" closed the DRF/CRF files for oplog#7 as part of
> creating the KRF for the same? This is what GemFire docs say about it:
>
> "After the oplog is closed, GemFire also attempts to create a krf
> file, which contains the key names as well as the offset for the value
> within the crf file."
>
> Based on the above, it's possible that oplog#7 was already closed and
> its KRF was already created, when the compactor tried to access the files.
>
> Have any of you run into this error before? Any suggestions?
>
> Thanks
> Kapil



Re: OplogCompactor race issue?

Posted by Darrel Schneider <ds...@pivotal.io>.
the ticket is GEODE-1969

On Thu, Oct 6, 2016 at 4:01 PM, Darrel Schneider <ds...@pivotal.io>
wrote:

> The call stack shows that Oplog.flush is calling "force." It only does
> this when the system property "gemfire.syncWrites" is set to "true." I
> think that is why you found this problem. I think all/most of our testing
> runs with this sys prop at the default of "false."
>
> Currently the code in Oplog.flush that calls "force" does it outside of
> the sync on "lock."
> If you move the "force" code up into this sync block it should fix your
> problem.
> For example old code:
>       synchronized (this.lock/* olf */) {
>         if (olf.RAFClosed) {
>           return;
>         }
>         ByteBuffer bb = olf.writeBuf;
>         if (bb != null && bb.position() != 0) {
>           bb.flip();
>           int flushed = 0;
>           do {
>             flushed += olf.channel.write(bb);
>           } while (bb.hasRemaining());
>           // update bytesFlushed after entire writeBuffer is flushed to
> fix bug
>           // 41201
>           olf.bytesFlushed += flushed;
>           bb.clear();
>         }
>       }
>       if (doSync) {
>         if (SYNC_WRITES) {
>           // Synch Meta Data as well as content
>           olf.channel.force(true);
>         }
>       }
> new code with fix:
>       synchronized (this.lock/* olf */) {
>         if (olf.RAFClosed) {
>           return;
>         }
>         ByteBuffer bb = olf.writeBuf;
>         if (bb != null && bb.position() != 0) {
>           bb.flip();
>           int flushed = 0;
>           do {
>             flushed += olf.channel.write(bb);
>           } while (bb.hasRemaining());
>           // update bytesFlushed after entire writeBuffer is flushed to
> fix bug
>           // 41201
>           olf.bytesFlushed += flushed;
>           bb.clear();
>         }
>  *       if (doSync) {*
> *          if (SYNC_WRITES) {*
> *            // Synch Meta Data as well as content*
> *            olf.channel.force(true);*
> *          }*
> *        }*
> *      }*
>
> I will file a ticket on this. Let me know if this fix works for you. Also
> be aware that setting "gemfire.syncWrites" to "true" can really slow down
> the rate at which you can write to disk.
>
>
> On Thu, Oct 6, 2016 at 2:47 PM, Kapil Goyal <go...@vmware.com> wrote:
>
>> Hi Dan,
>>
>> We are doing put operations within transactions.
>>
>> Here is the stack trace:
>>
>> [warn 2016/09/29 06:17:16.297 UTC <Event Processor for
>> GatewaySender_AsyncEventQueue_txLogEventQueue> tid=0x19]
>> com.gemstone.gemfire.internal.cache.wan.GatewaySenderEventCa
>> llbackDispatcher@55bded67: Exception during processing batch 448
>> com.gemstone.gemfire.internal.cache.wan.GatewaySenderException:
>> com.gemstone.gemfire.internal.cache.wan.GatewaySenderEventCa
>> llbackDispatcher@55bded67: Exception during processing batch 448, caused
>> by com.gemstone.gemfire.cache.CacheClosedException: For DiskStore:
>> nsxDiskStore: Failed writing key to "/common/nsxapi/data/self/BACKUPnsxDiskStore_7",
>> caused by com.gemstone.gemfire.cache.DiskAccessException: For DiskStore:
>> nsxDiskStore: Failed writing key to "/common/nsxapi/data/self/BACKUPnsxDiskStore_7",
>> caused by java.io.IOException: Stream Closed
>>         at com.gemstone.gemfire.internal.cache.wan.GatewaySenderEventCa
>> llbackDispatcher.dispatchBatch(GatewaySenderEve
>> ntCallbackDispatcher.java:173)
>>         at com.gemstone.gemfire.internal.cache.wan.GatewaySenderEventCa
>> llbackDispatcher.dispatchBatch(GatewaySenderEve
>> ntCallbackDispatcher.java:83)
>>         at com.gemstone.gemfire.internal.cache.wan.AbstractGatewaySende
>> rEventProcessor.processQueue(AbstractGatewaySenderEventProce
>> ssor.java:579)
>>         at com.gemstone.gemfire.internal.cache.wan.serial.SerialGateway
>> SenderEventProcessor.run(SerialGatewaySenderEventProcessor.java:219)
>> Caused by: com.gemstone.gemfire.cache.CacheClosedException: For
>> DiskStore: nsxDiskStore: Failed writing key to
>> "/common/nsxapi/data/self/BACKUPnsxDiskStore_7", caused by
>> com.gemstone.gemfire.cache.DiskAccessException: For DiskStore:
>> nsxDiskStore: Failed writing key to "/common/nsxapi/data/self/BACKUPnsxDiskStore_7",
>> caused by java.io.IOException: Stream Closed
>>         at com.gemstone.gemfire.internal.cache.GemFireCacheImpl$Stopper
>> .generateCancelledException(GemFireCacheImpl.java:1299)
>>         at com.gemstone.gemfire.CancelCriterion.checkCancelInProgress(C
>> ancelCriterion.java:82)
>>         at com.gemstone.gemfire.internal.cache.TXManagerImpl.checkClose
>> d(TXManagerImpl.java:606)
>>         at com.gemstone.gemfire.internal.cache.TXManagerImpl.begin(TXMa
>> nagerImpl.java:279)
>>         at com.vmware.nsx.management.container.dao.gemfire.GemFireTxLog
>> Dao.processTxLog(GemFireTxLogDao.java:119)
>>         at com.vmware.nsx.management.container.dao.gemfire.TxLogAsyncEv
>> entListener.processEvents(TxLogAsyncEventListener.java:93)
>>         at com.gemstone.gemfire.internal.cache.wan.GatewaySenderEventCa
>> llbackDispatcher.dispatchBatch(GatewaySenderEve
>> ntCallbackDispatcher.java:164)
>>         ... 3 more
>> Caused by: com.gemstone.gemfire.cache.DiskAccessException: For
>> DiskStore: nsxDiskStore: Failed writing key to
>> "/common/nsxapi/data/self/BACKUPnsxDiskStore_7", caused by
>> java.io.IOException: Stream Closed
>>         at com.gemstone.gemfire.internal.cache.Oplog.flushAll(Oplog.jav
>> a:5235)
>>         at com.gemstone.gemfire.internal.cache.Oplog.flushAll(Oplog.jav
>> a:5223)
>>         at com.gemstone.gemfire.internal.cache.Oplog.flushAll(Oplog.jav
>> a:5215)
>>         at com.gemstone.gemfire.internal.cache.Oplog.basicClose(Oplog.j
>> ava:3312)
>>         at com.gemstone.gemfire.internal.cache.Oplog.destroy(Oplog.java
>> :3375)
>>         at com.gemstone.gemfire.internal.cache.Oplog.handleEmptyAndOlde
>> st(Oplog.java:5877)
>>         at com.gemstone.gemfire.internal.cache.Oplog.handleNoLiveValues
>> (Oplog.java:5924)
>>         at com.gemstone.gemfire.internal.cache.Oplog.basicModify(Oplog.
>> java:4751)
>>         at com.gemstone.gemfire.internal.cache.Oplog.copyForwardModifyF
>> orCompact(Oplog.java:4594)
>>         at com.gemstone.gemfire.internal.cache.Oplog.compact(Oplog.java
>> :6087)
>>         at com.gemstone.gemfire.internal.cache.DiskStoreImpl$OplogCompa
>> ctor.compact(DiskStoreImpl.java:2970)
>>         at com.gemstone.gemfire.internal.cache.DiskStoreImpl$OplogCompa
>> ctor.run(DiskStoreImpl.java:3034)
>>         at com.gemstone.gemfire.internal.cache.DiskStoreImpl$6.run(Disk
>> StoreImpl.java:4701)
>>         at java.util.concurrent.Executors$RunnableAdapter.call(
>> Executors.java:511)
>>         at java.util.concurrent.FutureTask.run(FutureTask.java:266)
>>         at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPool
>> Executor.java:1142)
>>         at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoo
>> lExecutor.java:617)
>>         at java.lang.Thread.run(Thread.java:745)
>> Caused by: java.io.IOException: Stream Closed
>>         at java.io.RandomAccessFile.getFilePointer(Native Method)
>>         at com.gemstone.gemfire.internal.cache.persistence.Uninterrupti
>> bleRandomAccessFile.getFilePointer(Uninterruptible
>> RandomAccessFile.java:59)
>>         at com.gemstone.gemfire.internal.cache.persistence.Uninterrupti
>> bleRandomAccessFile$UninterruptibleFileChannelImpl.
>> doUninterruptibly(UninterruptibleRandomAccessFile.java:102)
>>         at com.gemstone.gemfire.internal.cache.persistence.Uninterrupti
>> bleRandomAccessFile$UninterruptibleFileChannelImpl.force(Uni
>> nterruptibleRandomAccessFile.java:226)
>>         at com.gemstone.gemfire.internal.cache.Oplog.flush(Oplog.java:5
>> 202)
>>         at com.gemstone.gemfire.internal.cache.Oplog.flushAll(Oplog.jav
>> a:5232)
>>         ... 17 more
>>
>> Thanks
>> Kapil
>>
>> -----Original Message-----
>> From: Dan Smith [mailto:dsmith@pivotal.io]
>> Sent: Tuesday, October 04, 2016 10:08 AM
>> To: user@geode.incubator.apache.org
>> Subject: Re: OplogCompactor race issue?
>>
>> Hi Kapil,
>>
>> I haven't seen this particular issue before. Do you have any stack traces
>> of the exception from the OplogCompactor? What sort of operations are you
>> doing? puts?
>>
>> -Dan
>>
>> On Tue, Oct 4, 2016 at 1:16 AM, Kapil Goyal <go...@vmware.com> wrote:
>> > Hi All,
>> >
>> > We have been testing a single cache node with a lot of data recently
>> > and frequently run into this error:
>> >
>> > [info 2016/09/29 06:16:06.823 UTC <OplogCompactor nsxDiskStore for
>> > oplog oplog#6> tid=0x19] OplogCompactor for nsxDiskStore compaction
>> oplog id(s):
>> > oplog#6
>> > [info 2016/09/29 06:16:08.232 UTC <OplogCompactor nsxDiskStore for
>> > oplog oplog#6> tid=0x19] compaction did 6,310 creates and updates in
>> > 1,408 ms [info 2016/09/29 06:16:08.248 UTC <Oplog Delete Task>
>> > tid=0x19] Deleted
>> > oplog#6 crf for disk store nsxDiskStore.
>> > [info 2016/09/29 06:16:08.256 UTC <Oplog Delete Task> tid=0x19]
>> > Deleted
>> > oplog#6 krf for disk store nsxDiskStore.
>> > [info 2016/09/29 06:16:08.256 UTC <Oplog Delete Task> tid=0x19]
>> > Deleted
>> > oplog#6 drf for disk store nsxDiskStore.
>> > [info 2016/09/29 06:17:03.887 UTC <Event Processor for
>> > GatewaySender_AsyncEventQueue_txLogEventQueue> tid=0x19] Created
>> > oplog#8 drf for disk store nsxDiskStore.
>> > [info 2016/09/29 06:17:03.911 UTC <Event Processor for
>> > GatewaySender_AsyncEventQueue_txLogEventQueue> tid=0x19] Created
>> > oplog#8 crf for disk store nsxDiskStore.
>> > [info 2016/09/29 06:17:04.031 UTC <Idle OplogCompactor> tid=0x19]
>> > Created
>> > oplog#7 krf for disk store nsxDiskStore.
>> > [info 2016/09/29 06:17:04.314 UTC <OplogCompactor nsxDiskStore for
>> > oplog oplog#7> tid=0x19] OplogCompactor for nsxDiskStore compaction
>> oplog id(s):
>> > oplog#7
>> > [error 2016/09/29 06:17:16.075 UTC <OplogCompactor nsxDiskStore for
>> > oplog oplog#7> tid=0x19] A DiskAccessException has occurred while
>> > writing to the disk for disk store nsxDiskStore. The cache will be
>> closed.
>> > ?com.gemstone.gemfire.cache.DiskAccessException: For DiskStore:
>> > nsxDiskStore: Failed writing key to
>> > "/common/nsxapi/data/self/BACKUPnsxDiskStore_7", caused by
>> > java.io.IOException: Stream Closed ?at
>> > com.gemstone.gemfire.internal.cache.Oplog.flushAll(Oplog.java:5235)
>> >
>> > From the logs it appears there may be a race between threads "Idle
>> > OplogCompactor" and "OplogCompactor nsxDiskStore for oplog oplog#7". I
>> > see that both are doing operations related to oplog#7. The former logs
>> > creation of a KRF file, while the latter is trying to access either
>> > the DRF or the CRF file. Now, is it possible that "Idle
>> > OplogCompactor" closed the DRF/CRF files for oplog#7 as part of
>> > creating the KRF for the same? This is what GemFire docs say about it:
>> >
>> > "After the oplog is closed, GemFire also attempts to create a krf
>> > file, which contains the key names as well as the offset for the value
>> > within the crf file."
>> >
>> > Based on the above, it's possible that oplog#7 was already closed and
>> > its KRF was already created, when the compactor tried to access the
>> files.
>> >
>> > Have any of you run into this error before? Any suggestions?
>> >
>> > Thanks
>> > Kapil
>>
>
>

Re: OplogCompactor race issue?

Posted by Darrel Schneider <ds...@pivotal.io>.
The call stack shows that Oplog.flush is calling "force." It only does this
when the system property "gemfire.syncWrites" is set to "true." I think
that is why you found this problem. I think all/most of our testing runs
with this sys prop at the default of "false."

Currently the code in Oplog.flush that calls "force" does it outside of the
sync on "lock."
If you move the "force" code up into this sync block it should fix your
problem.
For example old code:
      synchronized (this.lock/* olf */) {
        if (olf.RAFClosed) {
          return;
        }
        ByteBuffer bb = olf.writeBuf;
        if (bb != null && bb.position() != 0) {
          bb.flip();
          int flushed = 0;
          do {
            flushed += olf.channel.write(bb);
          } while (bb.hasRemaining());
          // update bytesFlushed after entire writeBuffer is flushed to fix
bug
          // 41201
          olf.bytesFlushed += flushed;
          bb.clear();
        }
      }
      if (doSync) {
        if (SYNC_WRITES) {
          // Synch Meta Data as well as content
          olf.channel.force(true);
        }
      }
new code with fix:
      synchronized (this.lock/* olf */) {
        if (olf.RAFClosed) {
          return;
        }
        ByteBuffer bb = olf.writeBuf;
        if (bb != null && bb.position() != 0) {
          bb.flip();
          int flushed = 0;
          do {
            flushed += olf.channel.write(bb);
          } while (bb.hasRemaining());
          // update bytesFlushed after entire writeBuffer is flushed to fix
bug
          // 41201
          olf.bytesFlushed += flushed;
          bb.clear();
        }
 *       if (doSync) {*
*          if (SYNC_WRITES) {*
*            // Synch Meta Data as well as content*
*            olf.channel.force(true);*
*          }*
*        }*
*      }*

I will file a ticket on this. Let me know if this fix works for you. Also
be aware that setting "gemfire.syncWrites" to "true" can really slow down
the rate at which you can write to disk.


On Thu, Oct 6, 2016 at 2:47 PM, Kapil Goyal <go...@vmware.com> wrote:

> Hi Dan,
>
> We are doing put operations within transactions.
>
> Here is the stack trace:
>
> [warn 2016/09/29 06:17:16.297 UTC <Event Processor for
> GatewaySender_AsyncEventQueue_txLogEventQueue> tid=0x19]
> com.gemstone.gemfire.internal.cache.wan.GatewaySenderEventCallbackDisp
> atcher@55bded67: Exception during processing batch 448
> com.gemstone.gemfire.internal.cache.wan.GatewaySenderException:
> com.gemstone.gemfire.internal.cache.wan.GatewaySenderEventCallbackDisp
> atcher@55bded67: Exception during processing batch 448, caused by
> com.gemstone.gemfire.cache.CacheClosedException: For DiskStore:
> nsxDiskStore: Failed writing key to "/common/nsxapi/data/self/BACKUPnsxDiskStore_7",
> caused by com.gemstone.gemfire.cache.DiskAccessException: For DiskStore:
> nsxDiskStore: Failed writing key to "/common/nsxapi/data/self/BACKUPnsxDiskStore_7",
> caused by java.io.IOException: Stream Closed
>         at com.gemstone.gemfire.internal.cache.wan.
> GatewaySenderEventCallbackDispatcher.dispatchBatch(
> GatewaySenderEventCallbackDispatcher.java:173)
>         at com.gemstone.gemfire.internal.cache.wan.
> GatewaySenderEventCallbackDispatcher.dispatchBatch(
> GatewaySenderEventCallbackDispatcher.java:83)
>         at com.gemstone.gemfire.internal.cache.wan.
> AbstractGatewaySenderEventProcessor.processQueue(
> AbstractGatewaySenderEventProcessor.java:579)
>         at com.gemstone.gemfire.internal.cache.wan.serial.
> SerialGatewaySenderEventProcessor.run(SerialGatewaySenderEventProces
> sor.java:219)
> Caused by: com.gemstone.gemfire.cache.CacheClosedException: For
> DiskStore: nsxDiskStore: Failed writing key to "/common/nsxapi/data/self/BACKUPnsxDiskStore_7",
> caused by com.gemstone.gemfire.cache.DiskAccessException: For DiskStore:
> nsxDiskStore: Failed writing key to "/common/nsxapi/data/self/BACKUPnsxDiskStore_7",
> caused by java.io.IOException: Stream Closed
>         at com.gemstone.gemfire.internal.cache.GemFireCacheImpl$Stopper.
> generateCancelledException(GemFireCacheImpl.java:1299)
>         at com.gemstone.gemfire.CancelCriterion.checkCancelInProgress(
> CancelCriterion.java:82)
>         at com.gemstone.gemfire.internal.cache.TXManagerImpl.
> checkClosed(TXManagerImpl.java:606)
>         at com.gemstone.gemfire.internal.cache.TXManagerImpl.begin(
> TXManagerImpl.java:279)
>         at com.vmware.nsx.management.container.dao.gemfire.
> GemFireTxLogDao.processTxLog(GemFireTxLogDao.java:119)
>         at com.vmware.nsx.management.container.dao.gemfire.
> TxLogAsyncEventListener.processEvents(TxLogAsyncEventListener.java:93)
>         at com.gemstone.gemfire.internal.cache.wan.
> GatewaySenderEventCallbackDispatcher.dispatchBatch(
> GatewaySenderEventCallbackDispatcher.java:164)
>         ... 3 more
> Caused by: com.gemstone.gemfire.cache.DiskAccessException: For DiskStore:
> nsxDiskStore: Failed writing key to "/common/nsxapi/data/self/BACKUPnsxDiskStore_7",
> caused by java.io.IOException: Stream Closed
>         at com.gemstone.gemfire.internal.cache.Oplog.flushAll(Oplog.
> java:5235)
>         at com.gemstone.gemfire.internal.cache.Oplog.flushAll(Oplog.
> java:5223)
>         at com.gemstone.gemfire.internal.cache.Oplog.flushAll(Oplog.
> java:5215)
>         at com.gemstone.gemfire.internal.cache.Oplog.basicClose(Oplog.
> java:3312)
>         at com.gemstone.gemfire.internal.cache.Oplog.destroy(Oplog.
> java:3375)
>         at com.gemstone.gemfire.internal.cache.Oplog.
> handleEmptyAndOldest(Oplog.java:5877)
>         at com.gemstone.gemfire.internal.cache.Oplog.
> handleNoLiveValues(Oplog.java:5924)
>         at com.gemstone.gemfire.internal.cache.Oplog.basicModify(Oplog.
> java:4751)
>         at com.gemstone.gemfire.internal.cache.Oplog.
> copyForwardModifyForCompact(Oplog.java:4594)
>         at com.gemstone.gemfire.internal.cache.Oplog.compact(Oplog.
> java:6087)
>         at com.gemstone.gemfire.internal.cache.DiskStoreImpl$
> OplogCompactor.compact(DiskStoreImpl.java:2970)
>         at com.gemstone.gemfire.internal.cache.DiskStoreImpl$
> OplogCompactor.run(DiskStoreImpl.java:3034)
>         at com.gemstone.gemfire.internal.cache.DiskStoreImpl$6.run(
> DiskStoreImpl.java:4701)
>         at java.util.concurrent.Executors$RunnableAdapter.
> call(Executors.java:511)
>         at java.util.concurrent.FutureTask.run(FutureTask.java:266)
>         at java.util.concurrent.ThreadPoolExecutor.runWorker(
> ThreadPoolExecutor.java:1142)
>         at java.util.concurrent.ThreadPoolExecutor$Worker.run(
> ThreadPoolExecutor.java:617)
>         at java.lang.Thread.run(Thread.java:745)
> Caused by: java.io.IOException: Stream Closed
>         at java.io.RandomAccessFile.getFilePointer(Native Method)
>         at com.gemstone.gemfire.internal.cache.persistence.
> UninterruptibleRandomAccessFile.getFilePointer(
> UninterruptibleRandomAccessFile.java:59)
>         at com.gemstone.gemfire.internal.cache.persistence.
> UninterruptibleRandomAccessFile$UninterruptibleFileChannelImpl
> .doUninterruptibly(UninterruptibleRandomAccessFile.java:102)
>         at com.gemstone.gemfire.internal.cache.persistence.
> UninterruptibleRandomAccessFile$UninterruptibleFileChannelImpl.force(
> UninterruptibleRandomAccessFile.java:226)
>         at com.gemstone.gemfire.internal.cache.Oplog.flush(Oplog.java:
> 5202)
>         at com.gemstone.gemfire.internal.cache.Oplog.flushAll(Oplog.
> java:5232)
>         ... 17 more
>
> Thanks
> Kapil
>
> -----Original Message-----
> From: Dan Smith [mailto:dsmith@pivotal.io]
> Sent: Tuesday, October 04, 2016 10:08 AM
> To: user@geode.incubator.apache.org
> Subject: Re: OplogCompactor race issue?
>
> Hi Kapil,
>
> I haven't seen this particular issue before. Do you have any stack traces
> of the exception from the OplogCompactor? What sort of operations are you
> doing? puts?
>
> -Dan
>
> On Tue, Oct 4, 2016 at 1:16 AM, Kapil Goyal <go...@vmware.com> wrote:
> > Hi All,
> >
> > We have been testing a single cache node with a lot of data recently
> > and frequently run into this error:
> >
> > [info 2016/09/29 06:16:06.823 UTC <OplogCompactor nsxDiskStore for
> > oplog oplog#6> tid=0x19] OplogCompactor for nsxDiskStore compaction
> oplog id(s):
> > oplog#6
> > [info 2016/09/29 06:16:08.232 UTC <OplogCompactor nsxDiskStore for
> > oplog oplog#6> tid=0x19] compaction did 6,310 creates and updates in
> > 1,408 ms [info 2016/09/29 06:16:08.248 UTC <Oplog Delete Task>
> > tid=0x19] Deleted
> > oplog#6 crf for disk store nsxDiskStore.
> > [info 2016/09/29 06:16:08.256 UTC <Oplog Delete Task> tid=0x19]
> > Deleted
> > oplog#6 krf for disk store nsxDiskStore.
> > [info 2016/09/29 06:16:08.256 UTC <Oplog Delete Task> tid=0x19]
> > Deleted
> > oplog#6 drf for disk store nsxDiskStore.
> > [info 2016/09/29 06:17:03.887 UTC <Event Processor for
> > GatewaySender_AsyncEventQueue_txLogEventQueue> tid=0x19] Created
> > oplog#8 drf for disk store nsxDiskStore.
> > [info 2016/09/29 06:17:03.911 UTC <Event Processor for
> > GatewaySender_AsyncEventQueue_txLogEventQueue> tid=0x19] Created
> > oplog#8 crf for disk store nsxDiskStore.
> > [info 2016/09/29 06:17:04.031 UTC <Idle OplogCompactor> tid=0x19]
> > Created
> > oplog#7 krf for disk store nsxDiskStore.
> > [info 2016/09/29 06:17:04.314 UTC <OplogCompactor nsxDiskStore for
> > oplog oplog#7> tid=0x19] OplogCompactor for nsxDiskStore compaction
> oplog id(s):
> > oplog#7
> > [error 2016/09/29 06:17:16.075 UTC <OplogCompactor nsxDiskStore for
> > oplog oplog#7> tid=0x19] A DiskAccessException has occurred while
> > writing to the disk for disk store nsxDiskStore. The cache will be
> closed.
> > ?com.gemstone.gemfire.cache.DiskAccessException: For DiskStore:
> > nsxDiskStore: Failed writing key to
> > "/common/nsxapi/data/self/BACKUPnsxDiskStore_7", caused by
> > java.io.IOException: Stream Closed ?at
> > com.gemstone.gemfire.internal.cache.Oplog.flushAll(Oplog.java:5235)
> >
> > From the logs it appears there may be a race between threads "Idle
> > OplogCompactor" and "OplogCompactor nsxDiskStore for oplog oplog#7". I
> > see that both are doing operations related to oplog#7. The former logs
> > creation of a KRF file, while the latter is trying to access either
> > the DRF or the CRF file. Now, is it possible that "Idle
> > OplogCompactor" closed the DRF/CRF files for oplog#7 as part of
> > creating the KRF for the same? This is what GemFire docs say about it:
> >
> > "After the oplog is closed, GemFire also attempts to create a krf
> > file, which contains the key names as well as the offset for the value
> > within the crf file."
> >
> > Based on the above, it's possible that oplog#7 was already closed and
> > its KRF was already created, when the compactor tried to access the
> files.
> >
> > Have any of you run into this error before? Any suggestions?
> >
> > Thanks
> > Kapil
>

RE: OplogCompactor race issue?

Posted by Kapil Goyal <go...@vmware.com>.
Hi Dan,

We are doing put operations within transactions.

Here is the stack trace:

[warn 2016/09/29 06:17:16.297 UTC <Event Processor for GatewaySender_AsyncEventQueue_txLogEventQueue> tid=0x19] com.gemstone.gemfire.internal.cache.wan.GatewaySenderEventCallbackDispatcher@55bded67: Exception during processing batch 448
com.gemstone.gemfire.internal.cache.wan.GatewaySenderException: com.gemstone.gemfire.internal.cache.wan.GatewaySenderEventCallbackDispatcher@55bded67: Exception during processing batch 448, caused by com.gemstone.gemfire.cache.CacheClosedException: For DiskStore: nsxDiskStore: Failed writing key to "/common/nsxapi/data/self/BACKUPnsxDiskStore_7", caused by com.gemstone.gemfire.cache.DiskAccessException: For DiskStore: nsxDiskStore: Failed writing key to "/common/nsxapi/data/self/BACKUPnsxDiskStore_7", caused by java.io.IOException: Stream Closed
        at com.gemstone.gemfire.internal.cache.wan.GatewaySenderEventCallbackDispatcher.dispatchBatch(GatewaySenderEventCallbackDispatcher.java:173)
        at com.gemstone.gemfire.internal.cache.wan.GatewaySenderEventCallbackDispatcher.dispatchBatch(GatewaySenderEventCallbackDispatcher.java:83)
        at com.gemstone.gemfire.internal.cache.wan.AbstractGatewaySenderEventProcessor.processQueue(AbstractGatewaySenderEventProcessor.java:579)
        at com.gemstone.gemfire.internal.cache.wan.serial.SerialGatewaySenderEventProcessor.run(SerialGatewaySenderEventProcessor.java:219)
Caused by: com.gemstone.gemfire.cache.CacheClosedException: For DiskStore: nsxDiskStore: Failed writing key to "/common/nsxapi/data/self/BACKUPnsxDiskStore_7", caused by com.gemstone.gemfire.cache.DiskAccessException: For DiskStore: nsxDiskStore: Failed writing key to "/common/nsxapi/data/self/BACKUPnsxDiskStore_7", caused by java.io.IOException: Stream Closed
        at com.gemstone.gemfire.internal.cache.GemFireCacheImpl$Stopper.generateCancelledException(GemFireCacheImpl.java:1299)
        at com.gemstone.gemfire.CancelCriterion.checkCancelInProgress(CancelCriterion.java:82)
        at com.gemstone.gemfire.internal.cache.TXManagerImpl.checkClosed(TXManagerImpl.java:606)
        at com.gemstone.gemfire.internal.cache.TXManagerImpl.begin(TXManagerImpl.java:279)
        at com.vmware.nsx.management.container.dao.gemfire.GemFireTxLogDao.processTxLog(GemFireTxLogDao.java:119)
        at com.vmware.nsx.management.container.dao.gemfire.TxLogAsyncEventListener.processEvents(TxLogAsyncEventListener.java:93)
        at com.gemstone.gemfire.internal.cache.wan.GatewaySenderEventCallbackDispatcher.dispatchBatch(GatewaySenderEventCallbackDispatcher.java:164)
        ... 3 more
Caused by: com.gemstone.gemfire.cache.DiskAccessException: For DiskStore: nsxDiskStore: Failed writing key to "/common/nsxapi/data/self/BACKUPnsxDiskStore_7", caused by java.io.IOException: Stream Closed
        at com.gemstone.gemfire.internal.cache.Oplog.flushAll(Oplog.java:5235)
        at com.gemstone.gemfire.internal.cache.Oplog.flushAll(Oplog.java:5223)
        at com.gemstone.gemfire.internal.cache.Oplog.flushAll(Oplog.java:5215)
        at com.gemstone.gemfire.internal.cache.Oplog.basicClose(Oplog.java:3312)
        at com.gemstone.gemfire.internal.cache.Oplog.destroy(Oplog.java:3375)
        at com.gemstone.gemfire.internal.cache.Oplog.handleEmptyAndOldest(Oplog.java:5877)
        at com.gemstone.gemfire.internal.cache.Oplog.handleNoLiveValues(Oplog.java:5924)
        at com.gemstone.gemfire.internal.cache.Oplog.basicModify(Oplog.java:4751)
        at com.gemstone.gemfire.internal.cache.Oplog.copyForwardModifyForCompact(Oplog.java:4594)
        at com.gemstone.gemfire.internal.cache.Oplog.compact(Oplog.java:6087)
        at com.gemstone.gemfire.internal.cache.DiskStoreImpl$OplogCompactor.compact(DiskStoreImpl.java:2970)
        at com.gemstone.gemfire.internal.cache.DiskStoreImpl$OplogCompactor.run(DiskStoreImpl.java:3034)
        at com.gemstone.gemfire.internal.cache.DiskStoreImpl$6.run(DiskStoreImpl.java:4701)
        at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
        at java.util.concurrent.FutureTask.run(FutureTask.java:266)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
        at java.lang.Thread.run(Thread.java:745)
Caused by: java.io.IOException: Stream Closed
        at java.io.RandomAccessFile.getFilePointer(Native Method)
        at com.gemstone.gemfire.internal.cache.persistence.UninterruptibleRandomAccessFile.getFilePointer(UninterruptibleRandomAccessFile.java:59)
        at com.gemstone.gemfire.internal.cache.persistence.UninterruptibleRandomAccessFile$UninterruptibleFileChannelImpl.doUninterruptibly(UninterruptibleRandomAccessFile.java:102)
        at com.gemstone.gemfire.internal.cache.persistence.UninterruptibleRandomAccessFile$UninterruptibleFileChannelImpl.force(UninterruptibleRandomAccessFile.java:226)
        at com.gemstone.gemfire.internal.cache.Oplog.flush(Oplog.java:5202)
        at com.gemstone.gemfire.internal.cache.Oplog.flushAll(Oplog.java:5232)
        ... 17 more

Thanks
Kapil

-----Original Message-----
From: Dan Smith [mailto:dsmith@pivotal.io] 
Sent: Tuesday, October 04, 2016 10:08 AM
To: user@geode.incubator.apache.org
Subject: Re: OplogCompactor race issue?

Hi Kapil,

I haven't seen this particular issue before. Do you have any stack traces of the exception from the OplogCompactor? What sort of operations are you doing? puts?

-Dan

On Tue, Oct 4, 2016 at 1:16 AM, Kapil Goyal <go...@vmware.com> wrote:
> Hi All,
>
> We have been testing a single cache node with a lot of data recently 
> and frequently run into this error:
>
> [info 2016/09/29 06:16:06.823 UTC <OplogCompactor nsxDiskStore for 
> oplog oplog#6> tid=0x19] OplogCompactor for nsxDiskStore compaction oplog id(s):
> oplog#6
> [info 2016/09/29 06:16:08.232 UTC <OplogCompactor nsxDiskStore for 
> oplog oplog#6> tid=0x19] compaction did 6,310 creates and updates in 
> 1,408 ms [info 2016/09/29 06:16:08.248 UTC <Oplog Delete Task> 
> tid=0x19] Deleted
> oplog#6 crf for disk store nsxDiskStore.
> [info 2016/09/29 06:16:08.256 UTC <Oplog Delete Task> tid=0x19] 
> Deleted
> oplog#6 krf for disk store nsxDiskStore.
> [info 2016/09/29 06:16:08.256 UTC <Oplog Delete Task> tid=0x19] 
> Deleted
> oplog#6 drf for disk store nsxDiskStore.
> [info 2016/09/29 06:17:03.887 UTC <Event Processor for 
> GatewaySender_AsyncEventQueue_txLogEventQueue> tid=0x19] Created 
> oplog#8 drf for disk store nsxDiskStore.
> [info 2016/09/29 06:17:03.911 UTC <Event Processor for 
> GatewaySender_AsyncEventQueue_txLogEventQueue> tid=0x19] Created 
> oplog#8 crf for disk store nsxDiskStore.
> [info 2016/09/29 06:17:04.031 UTC <Idle OplogCompactor> tid=0x19] 
> Created
> oplog#7 krf for disk store nsxDiskStore.
> [info 2016/09/29 06:17:04.314 UTC <OplogCompactor nsxDiskStore for 
> oplog oplog#7> tid=0x19] OplogCompactor for nsxDiskStore compaction oplog id(s):
> oplog#7
> [error 2016/09/29 06:17:16.075 UTC <OplogCompactor nsxDiskStore for 
> oplog oplog#7> tid=0x19] A DiskAccessException has occurred while 
> writing to the disk for disk store nsxDiskStore. The cache will be closed.
> ?com.gemstone.gemfire.cache.DiskAccessException: For DiskStore:
> nsxDiskStore: Failed writing key to
> "/common/nsxapi/data/self/BACKUPnsxDiskStore_7", caused by
> java.io.IOException: Stream Closed ?at
> com.gemstone.gemfire.internal.cache.Oplog.flushAll(Oplog.java:5235)
>
> From the logs it appears there may be a race between threads "Idle 
> OplogCompactor" and "OplogCompactor nsxDiskStore for oplog oplog#7". I 
> see that both are doing operations related to oplog#7. The former logs 
> creation of a KRF file, while the latter is trying to access either 
> the DRF or the CRF file. Now, is it possible that "Idle 
> OplogCompactor" closed the DRF/CRF files for oplog#7 as part of 
> creating the KRF for the same? This is what GemFire docs say about it:
>
> "After the oplog is closed, GemFire also attempts to create a krf 
> file, which contains the key names as well as the offset for the value 
> within the crf file."
>
> Based on the above, it's possible that oplog#7 was already closed and 
> its KRF was already created, when the compactor tried to access the files.
>
> Have any of you run into this error before? Any suggestions?
>
> Thanks
> Kapil

Re: OplogCompactor race issue?

Posted by Dan Smith <ds...@pivotal.io>.
Hi Kapil,

I haven't seen this particular issue before. Do you have any stack
traces of the exception from the OplogCompactor? What sort of
operations are you doing? puts?

-Dan

On Tue, Oct 4, 2016 at 1:16 AM, Kapil Goyal <go...@vmware.com> wrote:
> Hi All,
>
> We have been testing a single cache node with a lot of data recently and
> frequently run into this error:
>
> [info 2016/09/29 06:16:06.823 UTC <OplogCompactor nsxDiskStore for oplog
> oplog#6> tid=0x19] OplogCompactor for nsxDiskStore compaction oplog id(s):
> oplog#6
> [info 2016/09/29 06:16:08.232 UTC <OplogCompactor nsxDiskStore for oplog
> oplog#6> tid=0x19] compaction did 6,310 creates and updates in 1,408 ms
> [info 2016/09/29 06:16:08.248 UTC <Oplog Delete Task> tid=0x19] Deleted
> oplog#6 crf for disk store nsxDiskStore.
> [info 2016/09/29 06:16:08.256 UTC <Oplog Delete Task> tid=0x19] Deleted
> oplog#6 krf for disk store nsxDiskStore.
> [info 2016/09/29 06:16:08.256 UTC <Oplog Delete Task> tid=0x19] Deleted
> oplog#6 drf for disk store nsxDiskStore.
> [info 2016/09/29 06:17:03.887 UTC <Event Processor for
> GatewaySender_AsyncEventQueue_txLogEventQueue> tid=0x19] Created oplog#8 drf
> for disk store nsxDiskStore.
> [info 2016/09/29 06:17:03.911 UTC <Event Processor for
> GatewaySender_AsyncEventQueue_txLogEventQueue> tid=0x19] Created oplog#8 crf
> for disk store nsxDiskStore.
> [info 2016/09/29 06:17:04.031 UTC <Idle OplogCompactor> tid=0x19] Created
> oplog#7 krf for disk store nsxDiskStore.
> [info 2016/09/29 06:17:04.314 UTC <OplogCompactor nsxDiskStore for oplog
> oplog#7> tid=0x19] OplogCompactor for nsxDiskStore compaction oplog id(s):
> oplog#7
> [error 2016/09/29 06:17:16.075 UTC <OplogCompactor nsxDiskStore for oplog
> oplog#7> tid=0x19] A DiskAccessException has occurred while writing to the
> disk for disk store nsxDiskStore. The cache will be closed.
> ?com.gemstone.gemfire.cache.DiskAccessException: For DiskStore:
> nsxDiskStore: Failed writing key to
> "/common/nsxapi/data/self/BACKUPnsxDiskStore_7", caused by
> java.io.IOException: Stream Closed ?at
> com.gemstone.gemfire.internal.cache.Oplog.flushAll(Oplog.java:5235)
>
> From the logs it appears there may be a race between threads "Idle
> OplogCompactor" and "OplogCompactor nsxDiskStore for oplog oplog#7". I see
> that both are doing operations related to oplog#7. The former logs creation
> of a KRF file, while the latter is trying to access either the DRF or the
> CRF file. Now, is it possible that "Idle OplogCompactor" closed the DRF/CRF
> files for oplog#7 as part of creating the KRF for the same? This is what
> GemFire docs say about it:
>
> "After the oplog is closed, GemFire also attempts to create a krf file,
> which contains the key names as well as the offset for the value within the
> crf file."
>
> Based on the above, it's possible that oplog#7 was already closed and its
> KRF was already created, when the compactor tried to access the files.
>
> Have any of you run into this error before? Any suggestions?
>
> Thanks
> Kapil