You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@hbase.apache.org by "Kannan Muthukkaruppan (JIRA)" <ji...@apache.org> on 2011/01/26 09:46:57 UTC

[jira] Created: (HBASE-3481) max seq id in flushed file can be larger than its correct value causing data loss during recovery

max seq id in flushed file can be larger than its correct value causing data loss during recovery
-------------------------------------------------------------------------------------------------

                 Key: HBASE-3481
                 URL: https://issues.apache.org/jira/browse/HBASE-3481
             Project: HBase
          Issue Type: Bug
            Reporter: Kannan Muthukkaruppan
            Priority: Critical


[While doing some cluster kill tests, I noticed some missing data after log recovery. Upon investigating further, and pretty printing contents of HFiles and recovered logs, this is my analysis of the situation/bug. Please confirm the theory and pitch in with suggestions.]

When memstores are flushed, the max sequence id recorded in  the HFile should be the max sequence id of all KVs in the memstore. However, we seem to be simply obtain the current sequence id from the HRegion, and stamp the HFile's MAX_SEQ_ID with it.

>From HRegion.java:
{code}
    sequenceId = (wal == null)? myseqid: wal.startCacheFlush();
{code}

where, startCacheFlush() is:

{code}
public long startCacheFlush() {
    this.cacheFlushLock.lock();
    return obtainSeqNum();
 }
{code}

where, obtainSeqNum() is simply: 

{code}
    public long startCacheFlush() {
    this.cacheFlushLock.lock();
    return obtainSeqNum();
  }
{code}

So let's say a memstore contains edits with sequence number 1..10.

Meanwhile, say more Puts come along, and are going through this flow (in pseudo-code)

{code}
   1. HLog.append();
       1.1  obtainSeqNum()
       1.2 writeToWAL()

   2 updateMemStore()
{code}

So it is possible that the sequence number has already been incremented to say 15 (if there are 5 more outstanding puts)... but if their writeToWAL() is still in progress. In this case, none of these edits (11..15) would have been written to memstore yet.

At this point if a cache flush of the memstore happens, then we'll record its MAX_SEQ_ID as 16 instead of 10 (because that's what obtainSeqNum() would return as the next sequence number to use, right?).

Assume that the edits 11..15 eventually complete. And so HLogs do contain the data for edits 11..15.

Now, at this point if the region server were to crash, and we run log recovery, the splits all go through correctly, and a correct recovered.edits file is generated with the edits 11..15. 

Next, when the region is opened, the HRegion notes that one of the store file says MAX_SEQ_ID is 16. So, when it replays the recovered.edits file, it  skips replaying edits 11..15. Or in other words, data loss.

----





-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Re: [jira] Commented: (HBASE-3481) max seq id in flushed file can be larger than its correct value causing data loss during recovery

Posted by Ryan Rawson <ry...@gmail.com>.

In HRegion.internalFlushCache we have this logic:

    this.updatesLock.writeLock().lock();
    final long currentMemStoreSize = this.memstoreSize.get();
    List<StoreFlusher> storeFlushers = new
ArrayList<StoreFlusher>(stores.size());
    try {
      sequenceId = (wal == null)? myseqid: wal.startCacheFlush();
      completeSequenceId = this.getCompleteCacheFlushSequenceId(sequenceId);

      for (Store s : stores.values()) {
        storeFlushers.add(s.getStoreFlusher(completeSequenceId));
      }

      // prepare flush (take a snapshot)
      for (StoreFlusher flusher : storeFlushers) {
        flusher.prepare();
      }
    } finally {
      this.updatesLock.writeLock().unlock();
    }

We take a write lock, no more puts/deletes/whatever can be done to this hregion.

we then grab a seqid (wal.startCacheFlush).  We now snapshot everything.

we then release the update lock and mutations can happen to the region again.

The flush sequence id should lie exactly between the snapshot and the memstore.

Given this code, I'm not sure how to explain what you are seeing...
But this logic seems spot on and correct.

On Wed, Jan 26, 2011 at 1:14 AM, Kannan Muthukkaruppan (JIRA)
<ji...@apache.org> wrote:
>
>    [ https://issues.apache.org/jira/browse/HBASE-3481?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12986911#action_12986911 ]
>
> Kannan Muthukkaruppan commented on HBASE-3481:
> ----------------------------------------------
>
> Is the last seq id readily available from the memstore KVs or is already stashed away somewhere? I agree that that would be the cleanest/best fix.
>
> (Happy to accept  a patch if you want to post one up. Else, I'll look further on this  tomorrow morning).
>
>> max seq id in flushed file can be larger than its correct value causing data loss during recovery
>> -------------------------------------------------------------------------------------------------
>>
>>                 Key: HBASE-3481
>>                 URL: https://issues.apache.org/jira/browse/HBASE-3481
>>             Project: HBase
>>          Issue Type: Bug
>>            Reporter: Kannan Muthukkaruppan
>>            Priority: Critical
>>
>> [While doing some cluster kill tests, I noticed some missing data after log recovery. Upon investigating further, and pretty printing contents of HFiles and recovered logs, this is my analysis of the situation/bug. Please confirm the theory and pitch in with suggestions.]
>> When memstores are flushed, the max sequence id recorded in  the HFile should be the max sequence id of all KVs in the memstore. However, we seem to simply obtain the current sequence id from the HRegion, and stamp the HFile's MAX_SEQ_ID with it.
>> From HRegion.java:
>> {code}
>>     sequenceId = (wal == null)? myseqid: wal.startCacheFlush();
>> {code}
>> where, startCacheFlush() is:
>> {code}
>> public long startCacheFlush() {
>>     this.cacheFlushLock.lock();
>>     return obtainSeqNum();
>>  }
>> {code}
>> where, obtainSeqNum() is simply:
>> {code}
>> private long obtainSeqNum() {
>>     return this.logSeqNum.incrementAndGet();
>>   }
>> {code}
>> So let's say a memstore contains edits with sequence number 1..10.
>> Meanwhile, say more Puts come along, and are going through this flow (in pseudo-code)
>> {code}
>> 1. HLog.append();
>>        1.1  obtainSeqNum()
>>        1.2 writeToWAL()
>> 2 updateMemStore()
>> {code}
>> So it is possible that the sequence number has already been incremented to say 15 if there are 5 more outstanding puts. Say the writeToWAL() is still in progress for these puts. In this case, none of these edits (11..15) would have been written to memstore yet.
>> At this point if a cache flush of the memstore happens, then we'll record its MAX_SEQ_ID as 16 in the store file instead of 10 (because that's what obtainSeqNum() would return as the next sequence number to use, right?).
>> Assume that the edits 11..15 eventually complete. And so HLogs do contain the data for edits 11..15.
>> Now, at this point if the region server were to crash, and we run log recovery, the splits all go through correctly, and a correct recovered.edits file is generated with the edits 11..15.
>> Next, when the region is opened, the HRegion notes that one of the store file says MAX_SEQ_ID is 16. So, when it replays the recovered.edits file, it  skips replaying edits 11..15. Or in other words, data loss.
>> ----
>
> --
> This message is automatically generated by JIRA.
> -
> You can reply to this email to add a comment to the issue online.
>
>

[jira] Commented: (HBASE-3481) max seq id in flushed file can be larger than its correct value causing data loss during recovery

Posted by "Todd Lipcon (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HBASE-3481?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12987688#action_12987688 ] 

Todd Lipcon commented on HBASE-3481:
------------------------------------

Any way we can build a test for this? (even if it doesn't fail reliably every time, would be good to catch bugs like this in the future)

> max seq id in flushed file can be larger than its correct value causing data loss during recovery
> -------------------------------------------------------------------------------------------------
>
>                 Key: HBASE-3481
>                 URL: https://issues.apache.org/jira/browse/HBASE-3481
>             Project: HBase
>          Issue Type: Bug
>            Reporter: Kannan Muthukkaruppan
>            Assignee: ryan rawson
>            Priority: Blocker
>             Fix For: 0.90.1, 0.92.0
>
>         Attachments: HBASE-3481.txt
>
>
> [While doing some cluster kill tests, I noticed some missing data after log recovery. Upon investigating further, and pretty printing contents of HFiles and recovered logs, this is my analysis of the situation/bug. Please confirm the theory and pitch in with suggestions.]
> When memstores are flushed, the max sequence id recorded in  the HFile should be the max sequence id of all KVs in the memstore. However, we seem to simply obtain the current sequence id from the HRegion, and stamp the HFile's MAX_SEQ_ID with it.
> From HRegion.java:
> {code}
>     sequenceId = (wal == null)? myseqid: wal.startCacheFlush();
> {code}
> where, startCacheFlush() is:
> {code}
> public long startCacheFlush() {
>     this.cacheFlushLock.lock();
>     return obtainSeqNum();
>  }
> {code}
> where, obtainSeqNum() is simply: 
> {code}
> private long obtainSeqNum() {
>     return this.logSeqNum.incrementAndGet();
>   }
> {code}
> So let's say a memstore contains edits with sequence number 1..10.
> Meanwhile, say more Puts come along, and are going through this flow (in pseudo-code)
> {code}
> 1. HLog.append();
>        1.1  obtainSeqNum()
>        1.2 writeToWAL()
> 2 updateMemStore()
> {code}
> So it is possible that the sequence number has already been incremented to say 15 if there are 5 more outstanding puts. Say the writeToWAL() is still in progress for these puts. In this case, none of these edits (11..15) would have been written to memstore yet.
> At this point if a cache flush of the memstore happens, then we'll record its MAX_SEQ_ID as 16 in the store file instead of 10 (because that's what obtainSeqNum() would return as the next sequence number to use, right?).
> Assume that the edits 11..15 eventually complete. And so HLogs do contain the data for edits 11..15.
> Now, at this point if the region server were to crash, and we run log recovery, the splits all go through correctly, and a correct recovered.edits file is generated with the edits 11..15. 
> Next, when the region is opened, the HRegion notes that one of the store file says MAX_SEQ_ID is 16. So, when it replays the recovered.edits file, it  skips replaying edits 11..15. Or in other words, data loss.
> ----

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (HBASE-3481) max seq id in flushed file can be larger than its correct value causing data loss during recovery

Posted by "stack (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HBASE-3481?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

stack updated HBASE-3481:
-------------------------

    Assignee: ryan rawson
      Status: Patch Available  (was: Open)

Marking patch available and assigning to Ryan.  Want to commit RR?

> max seq id in flushed file can be larger than its correct value causing data loss during recovery
> -------------------------------------------------------------------------------------------------
>
>                 Key: HBASE-3481
>                 URL: https://issues.apache.org/jira/browse/HBASE-3481
>             Project: HBase
>          Issue Type: Bug
>            Reporter: Kannan Muthukkaruppan
>            Assignee: ryan rawson
>            Priority: Blocker
>             Fix For: 0.90.1
>
>         Attachments: HBASE-3481.txt
>
>
> [While doing some cluster kill tests, I noticed some missing data after log recovery. Upon investigating further, and pretty printing contents of HFiles and recovered logs, this is my analysis of the situation/bug. Please confirm the theory and pitch in with suggestions.]
> When memstores are flushed, the max sequence id recorded in  the HFile should be the max sequence id of all KVs in the memstore. However, we seem to simply obtain the current sequence id from the HRegion, and stamp the HFile's MAX_SEQ_ID with it.
> From HRegion.java:
> {code}
>     sequenceId = (wal == null)? myseqid: wal.startCacheFlush();
> {code}
> where, startCacheFlush() is:
> {code}
> public long startCacheFlush() {
>     this.cacheFlushLock.lock();
>     return obtainSeqNum();
>  }
> {code}
> where, obtainSeqNum() is simply: 
> {code}
> private long obtainSeqNum() {
>     return this.logSeqNum.incrementAndGet();
>   }
> {code}
> So let's say a memstore contains edits with sequence number 1..10.
> Meanwhile, say more Puts come along, and are going through this flow (in pseudo-code)
> {code}
> 1. HLog.append();
>        1.1  obtainSeqNum()
>        1.2 writeToWAL()
> 2 updateMemStore()
> {code}
> So it is possible that the sequence number has already been incremented to say 15 if there are 5 more outstanding puts. Say the writeToWAL() is still in progress for these puts. In this case, none of these edits (11..15) would have been written to memstore yet.
> At this point if a cache flush of the memstore happens, then we'll record its MAX_SEQ_ID as 16 in the store file instead of 10 (because that's what obtainSeqNum() would return as the next sequence number to use, right?).
> Assume that the edits 11..15 eventually complete. And so HLogs do contain the data for edits 11..15.
> Now, at this point if the region server were to crash, and we run log recovery, the splits all go through correctly, and a correct recovered.edits file is generated with the edits 11..15. 
> Next, when the region is opened, the HRegion notes that one of the store file says MAX_SEQ_ID is 16. So, when it replays the recovered.edits file, it  skips replaying edits 11..15. Or in other words, data loss.
> ----

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HBASE-3481) max seq id in flushed file can be larger than its correct value causing data loss during recovery

Posted by "ryan rawson (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HBASE-3481?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12986921#action_12986921 ] 

ryan rawson commented on HBASE-3481:
------------------------------------

In HRegion.internalFlushCache we have this logic:

    this.updatesLock.writeLock().lock();
    final long currentMemStoreSize = this.memstoreSize.get();
    List<StoreFlusher> storeFlushers = new
ArrayList<StoreFlusher>(stores.size());
    try {
      sequenceId = (wal == null)? myseqid: wal.startCacheFlush();
      completeSequenceId = this.getCompleteCacheFlushSequenceId(sequenceId);

      for (Store s : stores.values()) {
        storeFlushers.add(s.getStoreFlusher(completeSequenceId));
      }

      // prepare flush (take a snapshot)
      for (StoreFlusher flusher : storeFlushers) {
        flusher.prepare();
      }
    } finally {
      this.updatesLock.writeLock().unlock();
    }

We take a write lock, no more puts/deletes/whatever can be done to this hregion.

we then grab a seqid (wal.startCacheFlush).  We now snapshot everything.

we then release the update lock and mutations can happen to the region again.

The flush sequence id should lie exactly between the snapshot and the memstore.

Given this code, I'm not sure how to explain what you are seeing...
But this logic seems spot on and correct.

On Wed, Jan 26, 2011 at 1:14 AM, Kannan Muthukkaruppan (JIRA)



> max seq id in flushed file can be larger than its correct value causing data loss during recovery
> -------------------------------------------------------------------------------------------------
>
>                 Key: HBASE-3481
>                 URL: https://issues.apache.org/jira/browse/HBASE-3481
>             Project: HBase
>          Issue Type: Bug
>            Reporter: Kannan Muthukkaruppan
>            Priority: Critical
>
> [While doing some cluster kill tests, I noticed some missing data after log recovery. Upon investigating further, and pretty printing contents of HFiles and recovered logs, this is my analysis of the situation/bug. Please confirm the theory and pitch in with suggestions.]
> When memstores are flushed, the max sequence id recorded in  the HFile should be the max sequence id of all KVs in the memstore. However, we seem to simply obtain the current sequence id from the HRegion, and stamp the HFile's MAX_SEQ_ID with it.
> From HRegion.java:
> {code}
>     sequenceId = (wal == null)? myseqid: wal.startCacheFlush();
> {code}
> where, startCacheFlush() is:
> {code}
> public long startCacheFlush() {
>     this.cacheFlushLock.lock();
>     return obtainSeqNum();
>  }
> {code}
> where, obtainSeqNum() is simply: 
> {code}
> private long obtainSeqNum() {
>     return this.logSeqNum.incrementAndGet();
>   }
> {code}
> So let's say a memstore contains edits with sequence number 1..10.
> Meanwhile, say more Puts come along, and are going through this flow (in pseudo-code)
> {code}
> 1. HLog.append();
>        1.1  obtainSeqNum()
>        1.2 writeToWAL()
> 2 updateMemStore()
> {code}
> So it is possible that the sequence number has already been incremented to say 15 if there are 5 more outstanding puts. Say the writeToWAL() is still in progress for these puts. In this case, none of these edits (11..15) would have been written to memstore yet.
> At this point if a cache flush of the memstore happens, then we'll record its MAX_SEQ_ID as 16 in the store file instead of 10 (because that's what obtainSeqNum() would return as the next sequence number to use, right?).
> Assume that the edits 11..15 eventually complete. And so HLogs do contain the data for edits 11..15.
> Now, at this point if the region server were to crash, and we run log recovery, the splits all go through correctly, and a correct recovered.edits file is generated with the edits 11..15. 
> Next, when the region is opened, the HRegion notes that one of the store file says MAX_SEQ_ID is 16. So, when it replays the recovered.edits file, it  skips replaying edits 11..15. Or in other words, data loss.
> ----

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HBASE-3481) max seq id in flushed file can be larger than its correct value causing data loss during recovery

Posted by "Kannan Muthukkaruppan (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HBASE-3481?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12987209#action_12987209 ] 

Kannan Muthukkaruppan commented on HBASE-3481:
----------------------------------------------

Since single Put on client gets promoted to multiPut() , this bug affects both regular puts and increment column value.


> max seq id in flushed file can be larger than its correct value causing data loss during recovery
> -------------------------------------------------------------------------------------------------
>
>                 Key: HBASE-3481
>                 URL: https://issues.apache.org/jira/browse/HBASE-3481
>             Project: HBase
>          Issue Type: Bug
>            Reporter: Kannan Muthukkaruppan
>            Priority: Critical
>
> [While doing some cluster kill tests, I noticed some missing data after log recovery. Upon investigating further, and pretty printing contents of HFiles and recovered logs, this is my analysis of the situation/bug. Please confirm the theory and pitch in with suggestions.]
> When memstores are flushed, the max sequence id recorded in  the HFile should be the max sequence id of all KVs in the memstore. However, we seem to simply obtain the current sequence id from the HRegion, and stamp the HFile's MAX_SEQ_ID with it.
> From HRegion.java:
> {code}
>     sequenceId = (wal == null)? myseqid: wal.startCacheFlush();
> {code}
> where, startCacheFlush() is:
> {code}
> public long startCacheFlush() {
>     this.cacheFlushLock.lock();
>     return obtainSeqNum();
>  }
> {code}
> where, obtainSeqNum() is simply: 
> {code}
> private long obtainSeqNum() {
>     return this.logSeqNum.incrementAndGet();
>   }
> {code}
> So let's say a memstore contains edits with sequence number 1..10.
> Meanwhile, say more Puts come along, and are going through this flow (in pseudo-code)
> {code}
> 1. HLog.append();
>        1.1  obtainSeqNum()
>        1.2 writeToWAL()
> 2 updateMemStore()
> {code}
> So it is possible that the sequence number has already been incremented to say 15 if there are 5 more outstanding puts. Say the writeToWAL() is still in progress for these puts. In this case, none of these edits (11..15) would have been written to memstore yet.
> At this point if a cache flush of the memstore happens, then we'll record its MAX_SEQ_ID as 16 in the store file instead of 10 (because that's what obtainSeqNum() would return as the next sequence number to use, right?).
> Assume that the edits 11..15 eventually complete. And so HLogs do contain the data for edits 11..15.
> Now, at this point if the region server were to crash, and we run log recovery, the splits all go through correctly, and a correct recovered.edits file is generated with the edits 11..15. 
> Next, when the region is opened, the HRegion notes that one of the store file says MAX_SEQ_ID is 16. So, when it replays the recovered.edits file, it  skips replaying edits 11..15. Or in other words, data loss.
> ----

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HBASE-3481) max seq id in flushed file can be larger than its correct value causing data loss during recovery

Posted by "Todd Lipcon (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HBASE-3481?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12987965#action_12987965 ] 

Todd Lipcon commented on HBASE-3481:
------------------------------------

BTW, think I ran into another ramification of this bug. I had a region split,
but even after the parent was closed, it remained in the {{lastSeqWritten}}
map in HLog. So, my log shows:

2011-01-27 11:34:21,784 INFO org.apache.hadoop.hbase.regionserver.wal.HLog: Too many hlogs: logs=46, maxlogs=32; forcing flush of 5 regions(s): 1064fef659ccf634bfb6dbd24a3d3e32, 2e84570f830560a49276e236ebfb53ed, 61b89db95b24fbc6cabf6661cfbc9534, 6f219526af6c8a9e180324bd5d03a08b, d7e912ab428
2011-01-27 11:34:21,784 WARN org.apache.hadoop.hbase.regionserver.LogRoller: Failed to schedule flush of d7e912ab428b25dbb388edda934591a0r=null, requester=null
2011-01-27 11:34:45,904 INFO org.apache.hadoop.hbase.regionserver.wal.HLog: Too many hlogs: logs=47, maxlogs=32; forcing flush of 1 regions(s): d7e912ab428b25dbb388edda934591a0
2011-01-27 11:34:45,934 WARN org.apache.hadoop.hbase.regionserver.LogRoller: Failed to schedule flush of d7e912ab428b25dbb388edda934591a0r=null, requester=null
...
and I ended up with hundreds of HLogs. This was with a mixed put/ICV workload

> max seq id in flushed file can be larger than its correct value causing data loss during recovery
> -------------------------------------------------------------------------------------------------
>
>                 Key: HBASE-3481
>                 URL: https://issues.apache.org/jira/browse/HBASE-3481
>             Project: HBase
>          Issue Type: Bug
>            Reporter: Kannan Muthukkaruppan
>            Assignee: ryan rawson
>            Priority: Blocker
>             Fix For: 0.90.1, 0.92.0
>
>         Attachments: HBASE-3481.txt
>
>
> [While doing some cluster kill tests, I noticed some missing data after log recovery. Upon investigating further, and pretty printing contents of HFiles and recovered logs, this is my analysis of the situation/bug. Please confirm the theory and pitch in with suggestions.]
> When memstores are flushed, the max sequence id recorded in  the HFile should be the max sequence id of all KVs in the memstore. However, we seem to simply obtain the current sequence id from the HRegion, and stamp the HFile's MAX_SEQ_ID with it.
> From HRegion.java:
> {code}
>     sequenceId = (wal == null)? myseqid: wal.startCacheFlush();
> {code}
> where, startCacheFlush() is:
> {code}
> public long startCacheFlush() {
>     this.cacheFlushLock.lock();
>     return obtainSeqNum();
>  }
> {code}
> where, obtainSeqNum() is simply: 
> {code}
> private long obtainSeqNum() {
>     return this.logSeqNum.incrementAndGet();
>   }
> {code}
> So let's say a memstore contains edits with sequence number 1..10.
> Meanwhile, say more Puts come along, and are going through this flow (in pseudo-code)
> {code}
> 1. HLog.append();
>        1.1  obtainSeqNum()
>        1.2 writeToWAL()
> 2 updateMemStore()
> {code}
> So it is possible that the sequence number has already been incremented to say 15 if there are 5 more outstanding puts. Say the writeToWAL() is still in progress for these puts. In this case, none of these edits (11..15) would have been written to memstore yet.
> At this point if a cache flush of the memstore happens, then we'll record its MAX_SEQ_ID as 16 in the store file instead of 10 (because that's what obtainSeqNum() would return as the next sequence number to use, right?).
> Assume that the edits 11..15 eventually complete. And so HLogs do contain the data for edits 11..15.
> Now, at this point if the region server were to crash, and we run log recovery, the splits all go through correctly, and a correct recovered.edits file is generated with the edits 11..15. 
> Next, when the region is opened, the HRegion notes that one of the store file says MAX_SEQ_ID is 16. So, when it replays the recovered.edits file, it  skips replaying edits 11..15. Or in other words, data loss.
> ----

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HBASE-3481) max seq id in flushed file can be larger than its correct value causing data loss during recovery

Posted by "Kannan Muthukkaruppan (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HBASE-3481?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12987198#action_12987198 ] 

Kannan Muthukkaruppan commented on HBASE-3481:
----------------------------------------------

Ryan wrote: <<< We take a write lock, no more puts/deletes/whatever can be done to this hregion>>>.

Interestingly, I don't see incrementColumnValue() taking the region's updatesLock, which seems like a problem too (but not the one I am running into -- since in my test I didn't use incrementColumnValue()).

We can file the incrementColumnValue() issue as a separate JIRA (if I am not missing something obvious there).

regards,
Kannan 

> max seq id in flushed file can be larger than its correct value causing data loss during recovery
> -------------------------------------------------------------------------------------------------
>
>                 Key: HBASE-3481
>                 URL: https://issues.apache.org/jira/browse/HBASE-3481
>             Project: HBase
>          Issue Type: Bug
>            Reporter: Kannan Muthukkaruppan
>            Priority: Critical
>
> [While doing some cluster kill tests, I noticed some missing data after log recovery. Upon investigating further, and pretty printing contents of HFiles and recovered logs, this is my analysis of the situation/bug. Please confirm the theory and pitch in with suggestions.]
> When memstores are flushed, the max sequence id recorded in  the HFile should be the max sequence id of all KVs in the memstore. However, we seem to simply obtain the current sequence id from the HRegion, and stamp the HFile's MAX_SEQ_ID with it.
> From HRegion.java:
> {code}
>     sequenceId = (wal == null)? myseqid: wal.startCacheFlush();
> {code}
> where, startCacheFlush() is:
> {code}
> public long startCacheFlush() {
>     this.cacheFlushLock.lock();
>     return obtainSeqNum();
>  }
> {code}
> where, obtainSeqNum() is simply: 
> {code}
> private long obtainSeqNum() {
>     return this.logSeqNum.incrementAndGet();
>   }
> {code}
> So let's say a memstore contains edits with sequence number 1..10.
> Meanwhile, say more Puts come along, and are going through this flow (in pseudo-code)
> {code}
> 1. HLog.append();
>        1.1  obtainSeqNum()
>        1.2 writeToWAL()
> 2 updateMemStore()
> {code}
> So it is possible that the sequence number has already been incremented to say 15 if there are 5 more outstanding puts. Say the writeToWAL() is still in progress for these puts. In this case, none of these edits (11..15) would have been written to memstore yet.
> At this point if a cache flush of the memstore happens, then we'll record its MAX_SEQ_ID as 16 in the store file instead of 10 (because that's what obtainSeqNum() would return as the next sequence number to use, right?).
> Assume that the edits 11..15 eventually complete. And so HLogs do contain the data for edits 11..15.
> Now, at this point if the region server were to crash, and we run log recovery, the splits all go through correctly, and a correct recovered.edits file is generated with the edits 11..15. 
> Next, when the region is opened, the HRegion notes that one of the store file says MAX_SEQ_ID is 16. So, when it replays the recovered.edits file, it  skips replaying edits 11..15. Or in other words, data loss.
> ----

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HBASE-3481) max seq id in flushed file can be larger than its correct value causing data loss during recovery

Posted by "Kannan Muthukkaruppan (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HBASE-3481?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12987208#action_12987208 ] 

Kannan Muthukkaruppan commented on HBASE-3481:
----------------------------------------------

mutliPut() code path also doesn't seem to take updatesLock. I believe this is the bug I am hitting...


> max seq id in flushed file can be larger than its correct value causing data loss during recovery
> -------------------------------------------------------------------------------------------------
>
>                 Key: HBASE-3481
>                 URL: https://issues.apache.org/jira/browse/HBASE-3481
>             Project: HBase
>          Issue Type: Bug
>            Reporter: Kannan Muthukkaruppan
>            Priority: Critical
>
> [While doing some cluster kill tests, I noticed some missing data after log recovery. Upon investigating further, and pretty printing contents of HFiles and recovered logs, this is my analysis of the situation/bug. Please confirm the theory and pitch in with suggestions.]
> When memstores are flushed, the max sequence id recorded in  the HFile should be the max sequence id of all KVs in the memstore. However, we seem to simply obtain the current sequence id from the HRegion, and stamp the HFile's MAX_SEQ_ID with it.
> From HRegion.java:
> {code}
>     sequenceId = (wal == null)? myseqid: wal.startCacheFlush();
> {code}
> where, startCacheFlush() is:
> {code}
> public long startCacheFlush() {
>     this.cacheFlushLock.lock();
>     return obtainSeqNum();
>  }
> {code}
> where, obtainSeqNum() is simply: 
> {code}
> private long obtainSeqNum() {
>     return this.logSeqNum.incrementAndGet();
>   }
> {code}
> So let's say a memstore contains edits with sequence number 1..10.
> Meanwhile, say more Puts come along, and are going through this flow (in pseudo-code)
> {code}
> 1. HLog.append();
>        1.1  obtainSeqNum()
>        1.2 writeToWAL()
> 2 updateMemStore()
> {code}
> So it is possible that the sequence number has already been incremented to say 15 if there are 5 more outstanding puts. Say the writeToWAL() is still in progress for these puts. In this case, none of these edits (11..15) would have been written to memstore yet.
> At this point if a cache flush of the memstore happens, then we'll record its MAX_SEQ_ID as 16 in the store file instead of 10 (because that's what obtainSeqNum() would return as the next sequence number to use, right?).
> Assume that the edits 11..15 eventually complete. And so HLogs do contain the data for edits 11..15.
> Now, at this point if the region server were to crash, and we run log recovery, the splits all go through correctly, and a correct recovered.edits file is generated with the edits 11..15. 
> Next, when the region is opened, the HRegion notes that one of the store file says MAX_SEQ_ID is 16. So, when it replays the recovered.edits file, it  skips replaying edits 11..15. Or in other words, data loss.
> ----

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (HBASE-3481) max seq id in flushed file can be larger than its correct value causing data loss during recovery

Posted by "ryan rawson (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HBASE-3481?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

ryan rawson updated HBASE-3481:
-------------------------------

       Resolution: Fixed
    Fix Version/s: 0.92.0
           Status: Resolved  (was: Patch Available)

> max seq id in flushed file can be larger than its correct value causing data loss during recovery
> -------------------------------------------------------------------------------------------------
>
>                 Key: HBASE-3481
>                 URL: https://issues.apache.org/jira/browse/HBASE-3481
>             Project: HBase
>          Issue Type: Bug
>            Reporter: Kannan Muthukkaruppan
>            Assignee: ryan rawson
>            Priority: Blocker
>             Fix For: 0.90.1, 0.92.0
>
>         Attachments: HBASE-3481.txt
>
>
> [While doing some cluster kill tests, I noticed some missing data after log recovery. Upon investigating further, and pretty printing contents of HFiles and recovered logs, this is my analysis of the situation/bug. Please confirm the theory and pitch in with suggestions.]
> When memstores are flushed, the max sequence id recorded in  the HFile should be the max sequence id of all KVs in the memstore. However, we seem to simply obtain the current sequence id from the HRegion, and stamp the HFile's MAX_SEQ_ID with it.
> From HRegion.java:
> {code}
>     sequenceId = (wal == null)? myseqid: wal.startCacheFlush();
> {code}
> where, startCacheFlush() is:
> {code}
> public long startCacheFlush() {
>     this.cacheFlushLock.lock();
>     return obtainSeqNum();
>  }
> {code}
> where, obtainSeqNum() is simply: 
> {code}
> private long obtainSeqNum() {
>     return this.logSeqNum.incrementAndGet();
>   }
> {code}
> So let's say a memstore contains edits with sequence number 1..10.
> Meanwhile, say more Puts come along, and are going through this flow (in pseudo-code)
> {code}
> 1. HLog.append();
>        1.1  obtainSeqNum()
>        1.2 writeToWAL()
> 2 updateMemStore()
> {code}
> So it is possible that the sequence number has already been incremented to say 15 if there are 5 more outstanding puts. Say the writeToWAL() is still in progress for these puts. In this case, none of these edits (11..15) would have been written to memstore yet.
> At this point if a cache flush of the memstore happens, then we'll record its MAX_SEQ_ID as 16 in the store file instead of 10 (because that's what obtainSeqNum() would return as the next sequence number to use, right?).
> Assume that the edits 11..15 eventually complete. And so HLogs do contain the data for edits 11..15.
> Now, at this point if the region server were to crash, and we run log recovery, the splits all go through correctly, and a correct recovered.edits file is generated with the edits 11..15. 
> Next, when the region is opened, the HRegion notes that one of the store file says MAX_SEQ_ID is 16. So, when it replays the recovered.edits file, it  skips replaying edits 11..15. Or in other words, data loss.
> ----

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (HBASE-3481) max seq id in flushed file can be larger than its correct value causing data loss during recovery

Posted by "Kannan Muthukkaruppan (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HBASE-3481?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Kannan Muthukkaruppan updated HBASE-3481:
-----------------------------------------

    Description: 
[While doing some cluster kill tests, I noticed some missing data after log recovery. Upon investigating further, and pretty printing contents of HFiles and recovered logs, this is my analysis of the situation/bug. Please confirm the theory and pitch in with suggestions.]

When memstores are flushed, the max sequence id recorded in  the HFile should be the max sequence id of all KVs in the memstore. However, we seem to simply obtain the current sequence id from the HRegion, and stamp the HFile's MAX_SEQ_ID with it.

>From HRegion.java:
{code}
    sequenceId = (wal == null)? myseqid: wal.startCacheFlush();
{code}

where, startCacheFlush() is:

{code}
public long startCacheFlush() {
    this.cacheFlushLock.lock();
    return obtainSeqNum();
 }
{code}

where, obtainSeqNum() is simply: 

{code}
private long obtainSeqNum() {
    return this.logSeqNum.incrementAndGet();
  }
{code}

So let's say a memstore contains edits with sequence number 1..10.

Meanwhile, say more Puts come along, and are going through this flow (in pseudo-code)

{code}
1. HLog.append();
       1.1  obtainSeqNum()
       1.2 writeToWAL()

2 updateMemStore()
{code}

So it is possible that the sequence number has already been incremented to say 15 if there are 5 more outstanding puts. Say the writeToWAL() is still in progress for these puts. In this case, none of these edits (11..15) would have been written to memstore yet.

At this point if a cache flush of the memstore happens, then we'll record its MAX_SEQ_ID as 16 in the store file instead of 10 (because that's what obtainSeqNum() would return as the next sequence number to use, right?).

Assume that the edits 11..15 eventually complete. And so HLogs do contain the data for edits 11..15.

Now, at this point if the region server were to crash, and we run log recovery, the splits all go through correctly, and a correct recovered.edits file is generated with the edits 11..15. 

Next, when the region is opened, the HRegion notes that one of the store file says MAX_SEQ_ID is 16. So, when it replays the recovered.edits file, it  skips replaying edits 11..15. Or in other words, data loss.

----





  was:
[While doing some cluster kill tests, I noticed some missing data after log recovery. Upon investigating further, and pretty printing contents of HFiles and recovered logs, this is my analysis of the situation/bug. Please confirm the theory and pitch in with suggestions.]

When memstores are flushed, the max sequence id recorded in  the HFile should be the max sequence id of all KVs in the memstore. However, we seem to simply obtain the current sequence id from the HRegion, and stamp the HFile's MAX_SEQ_ID with it.

>From HRegion.java:
{code}
    sequenceId = (wal == null)? myseqid: wal.startCacheFlush();
{code}

where, startCacheFlush() is:

{code}
public long startCacheFlush() {
    this.cacheFlushLock.lock();
    return obtainSeqNum();
 }
{code}

where, obtainSeqNum() is simply: 

{code}
private long obtainSeqNum() {
    return this.logSeqNum.incrementAndGet();
  }
{code}

So let's say a memstore contains edits with sequence number 1..10.

Meanwhile, say more Puts come along, and are going through this flow (in pseudo-code)

{code}
   1. HLog.append();
       1.1  obtainSeqNum()
       1.2 writeToWAL()

   2 updateMemStore()
{code}

So it is possible that the sequence number has already been incremented to say 15 (if there are 5 more outstanding puts)... but if their writeToWAL() is still in progress. In this case, none of these edits (11..15) would have been written to memstore yet.

At this point if a cache flush of the memstore happens, then we'll record its MAX_SEQ_ID as 16 instead of 10 (because that's what obtainSeqNum() would return as the next sequence number to use, right?).

Assume that the edits 11..15 eventually complete. And so HLogs do contain the data for edits 11..15.

Now, at this point if the region server were to crash, and we run log recovery, the splits all go through correctly, and a correct recovered.edits file is generated with the edits 11..15. 

Next, when the region is opened, the HRegion notes that one of the store file says MAX_SEQ_ID is 16. So, when it replays the recovered.edits file, it  skips replaying edits 11..15. Or in other words, data loss.

----






> max seq id in flushed file can be larger than its correct value causing data loss during recovery
> -------------------------------------------------------------------------------------------------
>
>                 Key: HBASE-3481
>                 URL: https://issues.apache.org/jira/browse/HBASE-3481
>             Project: HBase
>          Issue Type: Bug
>            Reporter: Kannan Muthukkaruppan
>            Priority: Critical
>
> [While doing some cluster kill tests, I noticed some missing data after log recovery. Upon investigating further, and pretty printing contents of HFiles and recovered logs, this is my analysis of the situation/bug. Please confirm the theory and pitch in with suggestions.]
> When memstores are flushed, the max sequence id recorded in  the HFile should be the max sequence id of all KVs in the memstore. However, we seem to simply obtain the current sequence id from the HRegion, and stamp the HFile's MAX_SEQ_ID with it.
> From HRegion.java:
> {code}
>     sequenceId = (wal == null)? myseqid: wal.startCacheFlush();
> {code}
> where, startCacheFlush() is:
> {code}
> public long startCacheFlush() {
>     this.cacheFlushLock.lock();
>     return obtainSeqNum();
>  }
> {code}
> where, obtainSeqNum() is simply: 
> {code}
> private long obtainSeqNum() {
>     return this.logSeqNum.incrementAndGet();
>   }
> {code}
> So let's say a memstore contains edits with sequence number 1..10.
> Meanwhile, say more Puts come along, and are going through this flow (in pseudo-code)
> {code}
> 1. HLog.append();
>        1.1  obtainSeqNum()
>        1.2 writeToWAL()
> 2 updateMemStore()
> {code}
> So it is possible that the sequence number has already been incremented to say 15 if there are 5 more outstanding puts. Say the writeToWAL() is still in progress for these puts. In this case, none of these edits (11..15) would have been written to memstore yet.
> At this point if a cache flush of the memstore happens, then we'll record its MAX_SEQ_ID as 16 in the store file instead of 10 (because that's what obtainSeqNum() would return as the next sequence number to use, right?).
> Assume that the edits 11..15 eventually complete. And so HLogs do contain the data for edits 11..15.
> Now, at this point if the region server were to crash, and we run log recovery, the splits all go through correctly, and a correct recovered.edits file is generated with the edits 11..15. 
> Next, when the region is opened, the HRegion notes that one of the store file says MAX_SEQ_ID is 16. So, when it replays the recovered.edits file, it  skips replaying edits 11..15. Or in other words, data loss.
> ----

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (HBASE-3481) max seq id in flushed file can be larger than its correct value causing data loss during recovery

Posted by "Kannan Muthukkaruppan (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HBASE-3481?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Kannan Muthukkaruppan updated HBASE-3481:
-----------------------------------------

    Description: 
[While doing some cluster kill tests, I noticed some missing data after log recovery. Upon investigating further, and pretty printing contents of HFiles and recovered logs, this is my analysis of the situation/bug. Please confirm the theory and pitch in with suggestions.]

When memstores are flushed, the max sequence id recorded in  the HFile should be the max sequence id of all KVs in the memstore. However, we seem to simply obtain the current sequence id from the HRegion, and stamp the HFile's MAX_SEQ_ID with it.

>From HRegion.java:
{code}
    sequenceId = (wal == null)? myseqid: wal.startCacheFlush();
{code}

where, startCacheFlush() is:

{code}
public long startCacheFlush() {
    this.cacheFlushLock.lock();
    return obtainSeqNum();
 }
{code}

where, obtainSeqNum() is simply: 

{code}
private long obtainSeqNum() {
    return this.logSeqNum.incrementAndGet();
  }
{code}

So let's say a memstore contains edits with sequence number 1..10.

Meanwhile, say more Puts come along, and are going through this flow (in pseudo-code)

{code}
   1. HLog.append();
       1.1  obtainSeqNum()
       1.2 writeToWAL()

   2 updateMemStore()
{code}

So it is possible that the sequence number has already been incremented to say 15 (if there are 5 more outstanding puts)... but if their writeToWAL() is still in progress. In this case, none of these edits (11..15) would have been written to memstore yet.

At this point if a cache flush of the memstore happens, then we'll record its MAX_SEQ_ID as 16 instead of 10 (because that's what obtainSeqNum() would return as the next sequence number to use, right?).

Assume that the edits 11..15 eventually complete. And so HLogs do contain the data for edits 11..15.

Now, at this point if the region server were to crash, and we run log recovery, the splits all go through correctly, and a correct recovered.edits file is generated with the edits 11..15. 

Next, when the region is opened, the HRegion notes that one of the store file says MAX_SEQ_ID is 16. So, when it replays the recovered.edits file, it  skips replaying edits 11..15. Or in other words, data loss.

----





  was:
[While doing some cluster kill tests, I noticed some missing data after log recovery. Upon investigating further, and pretty printing contents of HFiles and recovered logs, this is my analysis of the situation/bug. Please confirm the theory and pitch in with suggestions.]

When memstores are flushed, the max sequence id recorded in  the HFile should be the max sequence id of all KVs in the memstore. However, we seem to be simply obtain the current sequence id from the HRegion, and stamp the HFile's MAX_SEQ_ID with it.

>From HRegion.java:
{code}
    sequenceId = (wal == null)? myseqid: wal.startCacheFlush();
{code}

where, startCacheFlush() is:

{code}
public long startCacheFlush() {
    this.cacheFlushLock.lock();
    return obtainSeqNum();
 }
{code}

where, obtainSeqNum() is simply: 

{code}
    public long startCacheFlush() {
    this.cacheFlushLock.lock();
    return obtainSeqNum();
  }
{code}

So let's say a memstore contains edits with sequence number 1..10.

Meanwhile, say more Puts come along, and are going through this flow (in pseudo-code)

{code}
   1. HLog.append();
       1.1  obtainSeqNum()
       1.2 writeToWAL()

   2 updateMemStore()
{code}

So it is possible that the sequence number has already been incremented to say 15 (if there are 5 more outstanding puts)... but if their writeToWAL() is still in progress. In this case, none of these edits (11..15) would have been written to memstore yet.

At this point if a cache flush of the memstore happens, then we'll record its MAX_SEQ_ID as 16 instead of 10 (because that's what obtainSeqNum() would return as the next sequence number to use, right?).

Assume that the edits 11..15 eventually complete. And so HLogs do contain the data for edits 11..15.

Now, at this point if the region server were to crash, and we run log recovery, the splits all go through correctly, and a correct recovered.edits file is generated with the edits 11..15. 

Next, when the region is opened, the HRegion notes that one of the store file says MAX_SEQ_ID is 16. So, when it replays the recovered.edits file, it  skips replaying edits 11..15. Or in other words, data loss.

----






> max seq id in flushed file can be larger than its correct value causing data loss during recovery
> -------------------------------------------------------------------------------------------------
>
>                 Key: HBASE-3481
>                 URL: https://issues.apache.org/jira/browse/HBASE-3481
>             Project: HBase
>          Issue Type: Bug
>            Reporter: Kannan Muthukkaruppan
>            Priority: Critical
>
> [While doing some cluster kill tests, I noticed some missing data after log recovery. Upon investigating further, and pretty printing contents of HFiles and recovered logs, this is my analysis of the situation/bug. Please confirm the theory and pitch in with suggestions.]
> When memstores are flushed, the max sequence id recorded in  the HFile should be the max sequence id of all KVs in the memstore. However, we seem to simply obtain the current sequence id from the HRegion, and stamp the HFile's MAX_SEQ_ID with it.
> From HRegion.java:
> {code}
>     sequenceId = (wal == null)? myseqid: wal.startCacheFlush();
> {code}
> where, startCacheFlush() is:
> {code}
> public long startCacheFlush() {
>     this.cacheFlushLock.lock();
>     return obtainSeqNum();
>  }
> {code}
> where, obtainSeqNum() is simply: 
> {code}
> private long obtainSeqNum() {
>     return this.logSeqNum.incrementAndGet();
>   }
> {code}
> So let's say a memstore contains edits with sequence number 1..10.
> Meanwhile, say more Puts come along, and are going through this flow (in pseudo-code)
> {code}
>    1. HLog.append();
>        1.1  obtainSeqNum()
>        1.2 writeToWAL()
>    2 updateMemStore()
> {code}
> So it is possible that the sequence number has already been incremented to say 15 (if there are 5 more outstanding puts)... but if their writeToWAL() is still in progress. In this case, none of these edits (11..15) would have been written to memstore yet.
> At this point if a cache flush of the memstore happens, then we'll record its MAX_SEQ_ID as 16 instead of 10 (because that's what obtainSeqNum() would return as the next sequence number to use, right?).
> Assume that the edits 11..15 eventually complete. And so HLogs do contain the data for edits 11..15.
> Now, at this point if the region server were to crash, and we run log recovery, the splits all go through correctly, and a correct recovered.edits file is generated with the edits 11..15. 
> Next, when the region is opened, the HRegion notes that one of the store file says MAX_SEQ_ID is 16. So, when it replays the recovered.edits file, it  skips replaying edits 11..15. Or in other words, data loss.
> ----

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HBASE-3481) max seq id in flushed file can be larger than its correct value causing data loss during recovery

Posted by "ryan rawson (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HBASE-3481?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12986907#action_12986907 ] 

ryan rawson commented on HBASE-3481:
------------------------------------

if we avoid the skip behaviour, which isnt optimization, we will
introduce duplicate KVs into the HFiles, which for people who are
depending on the version count to keep a reasonable, important, number
of versions will be trouble for them.

why not just nab the largest seqid during flush and provide it at the
end?  The footer is written out at close() time, and we have time
between the last KV being appened and close() being called to add a
correct SEQ_ID.


> max seq id in flushed file can be larger than its correct value causing data loss during recovery
> -------------------------------------------------------------------------------------------------
>
>                 Key: HBASE-3481
>                 URL: https://issues.apache.org/jira/browse/HBASE-3481
>             Project: HBase
>          Issue Type: Bug
>            Reporter: Kannan Muthukkaruppan
>            Priority: Critical
>
> [While doing some cluster kill tests, I noticed some missing data after log recovery. Upon investigating further, and pretty printing contents of HFiles and recovered logs, this is my analysis of the situation/bug. Please confirm the theory and pitch in with suggestions.]
> When memstores are flushed, the max sequence id recorded in  the HFile should be the max sequence id of all KVs in the memstore. However, we seem to simply obtain the current sequence id from the HRegion, and stamp the HFile's MAX_SEQ_ID with it.
> From HRegion.java:
> {code}
>     sequenceId = (wal == null)? myseqid: wal.startCacheFlush();
> {code}
> where, startCacheFlush() is:
> {code}
> public long startCacheFlush() {
>     this.cacheFlushLock.lock();
>     return obtainSeqNum();
>  }
> {code}
> where, obtainSeqNum() is simply: 
> {code}
> private long obtainSeqNum() {
>     return this.logSeqNum.incrementAndGet();
>   }
> {code}
> So let's say a memstore contains edits with sequence number 1..10.
> Meanwhile, say more Puts come along, and are going through this flow (in pseudo-code)
> {code}
> 1. HLog.append();
>        1.1  obtainSeqNum()
>        1.2 writeToWAL()
> 2 updateMemStore()
> {code}
> So it is possible that the sequence number has already been incremented to say 15 if there are 5 more outstanding puts. Say the writeToWAL() is still in progress for these puts. In this case, none of these edits (11..15) would have been written to memstore yet.
> At this point if a cache flush of the memstore happens, then we'll record its MAX_SEQ_ID as 16 in the store file instead of 10 (because that's what obtainSeqNum() would return as the next sequence number to use, right?).
> Assume that the edits 11..15 eventually complete. And so HLogs do contain the data for edits 11..15.
> Now, at this point if the region server were to crash, and we run log recovery, the splits all go through correctly, and a correct recovered.edits file is generated with the edits 11..15. 
> Next, when the region is opened, the HRegion notes that one of the store file says MAX_SEQ_ID is 16. So, when it replays the recovered.edits file, it  skips replaying edits 11..15. Or in other words, data loss.
> ----

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HBASE-3481) max seq id in flushed file can be larger than its correct value causing data loss during recovery

Posted by "ryan rawson (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HBASE-3481?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12987768#action_12987768 ] 

ryan rawson commented on HBASE-3481:
------------------------------------

that is a good point I opened HBASE-3486 to do that. We already have
durability unit tests, they should be fixed to fail w/o this patch.


> max seq id in flushed file can be larger than its correct value causing data loss during recovery
> -------------------------------------------------------------------------------------------------
>
>                 Key: HBASE-3481
>                 URL: https://issues.apache.org/jira/browse/HBASE-3481
>             Project: HBase
>          Issue Type: Bug
>            Reporter: Kannan Muthukkaruppan
>            Assignee: ryan rawson
>            Priority: Blocker
>             Fix For: 0.90.1, 0.92.0
>
>         Attachments: HBASE-3481.txt
>
>
> [While doing some cluster kill tests, I noticed some missing data after log recovery. Upon investigating further, and pretty printing contents of HFiles and recovered logs, this is my analysis of the situation/bug. Please confirm the theory and pitch in with suggestions.]
> When memstores are flushed, the max sequence id recorded in  the HFile should be the max sequence id of all KVs in the memstore. However, we seem to simply obtain the current sequence id from the HRegion, and stamp the HFile's MAX_SEQ_ID with it.
> From HRegion.java:
> {code}
>     sequenceId = (wal == null)? myseqid: wal.startCacheFlush();
> {code}
> where, startCacheFlush() is:
> {code}
> public long startCacheFlush() {
>     this.cacheFlushLock.lock();
>     return obtainSeqNum();
>  }
> {code}
> where, obtainSeqNum() is simply: 
> {code}
> private long obtainSeqNum() {
>     return this.logSeqNum.incrementAndGet();
>   }
> {code}
> So let's say a memstore contains edits with sequence number 1..10.
> Meanwhile, say more Puts come along, and are going through this flow (in pseudo-code)
> {code}
> 1. HLog.append();
>        1.1  obtainSeqNum()
>        1.2 writeToWAL()
> 2 updateMemStore()
> {code}
> So it is possible that the sequence number has already been incremented to say 15 if there are 5 more outstanding puts. Say the writeToWAL() is still in progress for these puts. In this case, none of these edits (11..15) would have been written to memstore yet.
> At this point if a cache flush of the memstore happens, then we'll record its MAX_SEQ_ID as 16 in the store file instead of 10 (because that's what obtainSeqNum() would return as the next sequence number to use, right?).
> Assume that the edits 11..15 eventually complete. And so HLogs do contain the data for edits 11..15.
> Now, at this point if the region server were to crash, and we run log recovery, the splits all go through correctly, and a correct recovered.edits file is generated with the edits 11..15. 
> Next, when the region is opened, the HRegion notes that one of the store file says MAX_SEQ_ID is 16. So, when it replays the recovered.edits file, it  skips replaying edits 11..15. Or in other words, data loss.
> ----

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HBASE-3481) max seq id in flushed file can be larger than its correct value causing data loss during recovery

Posted by "ryan rawson (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HBASE-3481?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12986899#action_12986899 ] 

ryan rawson commented on HBASE-3481:
------------------------------------

Your analysis sounds correct.  The correct thing to do here is to provide the MAX_SEQ_ID from the memstore KVs, not from the "current" HLog seqid.

good find!

> max seq id in flushed file can be larger than its correct value causing data loss during recovery
> -------------------------------------------------------------------------------------------------
>
>                 Key: HBASE-3481
>                 URL: https://issues.apache.org/jira/browse/HBASE-3481
>             Project: HBase
>          Issue Type: Bug
>            Reporter: Kannan Muthukkaruppan
>            Priority: Critical
>
> [While doing some cluster kill tests, I noticed some missing data after log recovery. Upon investigating further, and pretty printing contents of HFiles and recovered logs, this is my analysis of the situation/bug. Please confirm the theory and pitch in with suggestions.]
> When memstores are flushed, the max sequence id recorded in  the HFile should be the max sequence id of all KVs in the memstore. However, we seem to be simply obtain the current sequence id from the HRegion, and stamp the HFile's MAX_SEQ_ID with it.
> From HRegion.java:
> {code}
>     sequenceId = (wal == null)? myseqid: wal.startCacheFlush();
> {code}
> where, startCacheFlush() is:
> {code}
> public long startCacheFlush() {
>     this.cacheFlushLock.lock();
>     return obtainSeqNum();
>  }
> {code}
> where, obtainSeqNum() is simply: 
> {code}
>     public long startCacheFlush() {
>     this.cacheFlushLock.lock();
>     return obtainSeqNum();
>   }
> {code}
> So let's say a memstore contains edits with sequence number 1..10.
> Meanwhile, say more Puts come along, and are going through this flow (in pseudo-code)
> {code}
>    1. HLog.append();
>        1.1  obtainSeqNum()
>        1.2 writeToWAL()
>    2 updateMemStore()
> {code}
> So it is possible that the sequence number has already been incremented to say 15 (if there are 5 more outstanding puts)... but if their writeToWAL() is still in progress. In this case, none of these edits (11..15) would have been written to memstore yet.
> At this point if a cache flush of the memstore happens, then we'll record its MAX_SEQ_ID as 16 instead of 10 (because that's what obtainSeqNum() would return as the next sequence number to use, right?).
> Assume that the edits 11..15 eventually complete. And so HLogs do contain the data for edits 11..15.
> Now, at this point if the region server were to crash, and we run log recovery, the splits all go through correctly, and a correct recovered.edits file is generated with the edits 11..15. 
> Next, when the region is opened, the HRegion notes that one of the store file says MAX_SEQ_ID is 16. So, when it replays the recovered.edits file, it  skips replaying edits 11..15. Or in other words, data loss.
> ----

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HBASE-3481) max seq id in flushed file can be larger than its correct value causing data loss during recovery

Posted by "Kannan Muthukkaruppan (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HBASE-3481?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12987027#action_12987027 ] 

Kannan Muthukkaruppan commented on HBASE-3481:
----------------------------------------------

Ryan: Your observation seems quite correct! So the outstanding puts case I was worried about shouldn't happen because of the updatesLock.

So the mystery thickens -- I'll need to dig further as to why these edits got skipped during replay. The edits in question were definitely present in the recovered.edits file and had older sequence number than the HFile's MAX_SEQ_ID.

> max seq id in flushed file can be larger than its correct value causing data loss during recovery
> -------------------------------------------------------------------------------------------------
>
>                 Key: HBASE-3481
>                 URL: https://issues.apache.org/jira/browse/HBASE-3481
>             Project: HBase
>          Issue Type: Bug
>            Reporter: Kannan Muthukkaruppan
>            Priority: Critical
>
> [While doing some cluster kill tests, I noticed some missing data after log recovery. Upon investigating further, and pretty printing contents of HFiles and recovered logs, this is my analysis of the situation/bug. Please confirm the theory and pitch in with suggestions.]
> When memstores are flushed, the max sequence id recorded in  the HFile should be the max sequence id of all KVs in the memstore. However, we seem to simply obtain the current sequence id from the HRegion, and stamp the HFile's MAX_SEQ_ID with it.
> From HRegion.java:
> {code}
>     sequenceId = (wal == null)? myseqid: wal.startCacheFlush();
> {code}
> where, startCacheFlush() is:
> {code}
> public long startCacheFlush() {
>     this.cacheFlushLock.lock();
>     return obtainSeqNum();
>  }
> {code}
> where, obtainSeqNum() is simply: 
> {code}
> private long obtainSeqNum() {
>     return this.logSeqNum.incrementAndGet();
>   }
> {code}
> So let's say a memstore contains edits with sequence number 1..10.
> Meanwhile, say more Puts come along, and are going through this flow (in pseudo-code)
> {code}
> 1. HLog.append();
>        1.1  obtainSeqNum()
>        1.2 writeToWAL()
> 2 updateMemStore()
> {code}
> So it is possible that the sequence number has already been incremented to say 15 if there are 5 more outstanding puts. Say the writeToWAL() is still in progress for these puts. In this case, none of these edits (11..15) would have been written to memstore yet.
> At this point if a cache flush of the memstore happens, then we'll record its MAX_SEQ_ID as 16 in the store file instead of 10 (because that's what obtainSeqNum() would return as the next sequence number to use, right?).
> Assume that the edits 11..15 eventually complete. And so HLogs do contain the data for edits 11..15.
> Now, at this point if the region server were to crash, and we run log recovery, the splits all go through correctly, and a correct recovered.edits file is generated with the edits 11..15. 
> Next, when the region is opened, the HRegion notes that one of the store file says MAX_SEQ_ID is 16. So, when it replays the recovered.edits file, it  skips replaying edits 11..15. Or in other words, data loss.
> ----

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HBASE-3481) max seq id in flushed file can be larger than its correct value causing data loss during recovery

Posted by "Kannan Muthukkaruppan (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HBASE-3481?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12987083#action_12987083 ] 

Kannan Muthukkaruppan commented on HBASE-3481:
----------------------------------------------

Dug some more... the bad MAX_SEQ_ID that the HFile got stamped with corresponds to an edit that looks special.

{code}
<METAROW/METAFAMILY:/1296023308559/Put/vlen=17; >
{code}

Another interesting thing is that in the  .logs, this seq id  appears out of order...

{code}
...
#1081, pos=6403894 kannan1/kannan1,,1296023211936.dfe45b39b1d8c360b8e7f554b576ab17./71170147=[#edits: 18 = <0992bdb67af8d2b2172618496243433c:287686/actions:0\/1296023308555/Put/vlen=597; 0992bdb67af8d2b2172618496243433c:287686/actions:1/1296023308555/Put/vlen=416; 0992bdb67af8d2b2172618496243433c:287686/actions:2\/1296023308555/Put/vlen=436; 0992bdb67af8d2b2172618496243433c:287686/actions:3/1296023308555/Put/vlen=329; 0992bdb67af8d2b2172618496243433c:287686/actions:4\/1296023308555/Put/vlen=256; 0992bdb67af8d2b2172618496243433c:287686/actions:5/1296023308555/Put/vlen=792; 0992bdb67af8d2b2172618496243433c:287686/actions:6\/1296023308555/Put/vlen=231; 0992bdb67af8d2b2172618496243433c:287686/actions:7/1296023308555/Put/vlen=790; 0992bdb67af8d2b2172618496243433c:287686/actions:8\/1296023308555/Put/vlen=800; 0992bdb67af8d2b2172618496243433c:287686/actions:9/1296023308555/Put/vlen=664; 0992bdb67af8d2b2172618496243433c:287686/actions:10\/1296023308555/Put/vlen=618; 0992bdb67af8d2b2172618496243433c:287686/actions:11/1296023308555/Put/vlen=533; 0992bdb67af8d2b2172618496243433c:287686/actions:12\/1296023308555/Put/vlen=636; 0992bdb67af8d2b2172618496243433c:287686/actions:13/1296023308555/Put/vlen=554; 0992bdb67af8d2b2172618496243433c:287686/actions:14\/1296023308555/Put/vlen=353; 0992bdb67af8d2b2172618496243433c:287686/actions:15/1296023308555/Put/vlen=307; 0992bdb67af8d2b2172618496243433c:287686/actions:16\/1296023308555/Put/vlen=410; 0992bdb67af8d2b2172618496243433c:287686/actions:17/1296023308555/Put/vlen=579; >]

#1082, pos=6404082 kannan1/kannan1,bbbbbbbb,1296023211937.0a3f2362b0264b0192c311f10c006d84./71169066{=[#edits: 1 = <METAROW/METAFAMILY:/1296023308559/Put/vlen=17; >]

#1083, pos=6406845 kannan1/kannan1,99999999,1296023211936.8ff83822490b1f8175dc16aaf4c170a6./71170148=[#edits: 4 = <a2adc743be4a350b7009af5143acfbf3:287685/actions:0/1296023308555/Put/vlen=552; a2adc743be4a35\
0b7009af5143acfbf3:287685/actions:1/1296023308555/Put/vlen=715; a2adc743be4a350b7009af5143acfbf3:287685/actions:2/1296023308555/Put/vlen=806; a2adc743be4a350b7009af5143acfbf3:287685/actions:3/1296023308555/P\
ut/vlen=296; >]
{code}


The flushed file's MAX_SEQ_ID is recorded as: 71169066 (corresponding to entry #1082 above). However that edit is not a regular put  key, but this special row <METAROW/METAFAMILY:/1296023308559/Put/vlen=17; >]

Also, notice out the sequence ids for edits surrounding it(entries #1081 & #1083) are both higher (namely, 71170147 & 71170148).

Looks like this sequence id is being obtained outside of the updatesLock. But haven't confirmed that in the code.


> max seq id in flushed file can be larger than its correct value causing data loss during recovery
> -------------------------------------------------------------------------------------------------
>
>                 Key: HBASE-3481
>                 URL: https://issues.apache.org/jira/browse/HBASE-3481
>             Project: HBase
>          Issue Type: Bug
>            Reporter: Kannan Muthukkaruppan
>            Priority: Critical
>
> [While doing some cluster kill tests, I noticed some missing data after log recovery. Upon investigating further, and pretty printing contents of HFiles and recovered logs, this is my analysis of the situation/bug. Please confirm the theory and pitch in with suggestions.]
> When memstores are flushed, the max sequence id recorded in  the HFile should be the max sequence id of all KVs in the memstore. However, we seem to simply obtain the current sequence id from the HRegion, and stamp the HFile's MAX_SEQ_ID with it.
> From HRegion.java:
> {code}
>     sequenceId = (wal == null)? myseqid: wal.startCacheFlush();
> {code}
> where, startCacheFlush() is:
> {code}
> public long startCacheFlush() {
>     this.cacheFlushLock.lock();
>     return obtainSeqNum();
>  }
> {code}
> where, obtainSeqNum() is simply: 
> {code}
> private long obtainSeqNum() {
>     return this.logSeqNum.incrementAndGet();
>   }
> {code}
> So let's say a memstore contains edits with sequence number 1..10.
> Meanwhile, say more Puts come along, and are going through this flow (in pseudo-code)
> {code}
> 1. HLog.append();
>        1.1  obtainSeqNum()
>        1.2 writeToWAL()
> 2 updateMemStore()
> {code}
> So it is possible that the sequence number has already been incremented to say 15 if there are 5 more outstanding puts. Say the writeToWAL() is still in progress for these puts. In this case, none of these edits (11..15) would have been written to memstore yet.
> At this point if a cache flush of the memstore happens, then we'll record its MAX_SEQ_ID as 16 in the store file instead of 10 (because that's what obtainSeqNum() would return as the next sequence number to use, right?).
> Assume that the edits 11..15 eventually complete. And so HLogs do contain the data for edits 11..15.
> Now, at this point if the region server were to crash, and we run log recovery, the splits all go through correctly, and a correct recovered.edits file is generated with the edits 11..15. 
> Next, when the region is opened, the HRegion notes that one of the store file says MAX_SEQ_ID is 16. So, when it replays the recovered.edits file, it  skips replaying edits 11..15. Or in other words, data loss.
> ----

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HBASE-3481) max seq id in flushed file can be larger than its correct value causing data loss during recovery

Posted by "ryan rawson (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HBASE-3481?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12987449#action_12987449 ] 

ryan rawson commented on HBASE-3481:
------------------------------------

ill commit in am

> max seq id in flushed file can be larger than its correct value causing data loss during recovery
> -------------------------------------------------------------------------------------------------
>
>                 Key: HBASE-3481
>                 URL: https://issues.apache.org/jira/browse/HBASE-3481
>             Project: HBase
>          Issue Type: Bug
>            Reporter: Kannan Muthukkaruppan
>            Assignee: ryan rawson
>            Priority: Blocker
>             Fix For: 0.90.1
>
>         Attachments: HBASE-3481.txt
>
>
> [While doing some cluster kill tests, I noticed some missing data after log recovery. Upon investigating further, and pretty printing contents of HFiles and recovered logs, this is my analysis of the situation/bug. Please confirm the theory and pitch in with suggestions.]
> When memstores are flushed, the max sequence id recorded in  the HFile should be the max sequence id of all KVs in the memstore. However, we seem to simply obtain the current sequence id from the HRegion, and stamp the HFile's MAX_SEQ_ID with it.
> From HRegion.java:
> {code}
>     sequenceId = (wal == null)? myseqid: wal.startCacheFlush();
> {code}
> where, startCacheFlush() is:
> {code}
> public long startCacheFlush() {
>     this.cacheFlushLock.lock();
>     return obtainSeqNum();
>  }
> {code}
> where, obtainSeqNum() is simply: 
> {code}
> private long obtainSeqNum() {
>     return this.logSeqNum.incrementAndGet();
>   }
> {code}
> So let's say a memstore contains edits with sequence number 1..10.
> Meanwhile, say more Puts come along, and are going through this flow (in pseudo-code)
> {code}
> 1. HLog.append();
>        1.1  obtainSeqNum()
>        1.2 writeToWAL()
> 2 updateMemStore()
> {code}
> So it is possible that the sequence number has already been incremented to say 15 if there are 5 more outstanding puts. Say the writeToWAL() is still in progress for these puts. In this case, none of these edits (11..15) would have been written to memstore yet.
> At this point if a cache flush of the memstore happens, then we'll record its MAX_SEQ_ID as 16 in the store file instead of 10 (because that's what obtainSeqNum() would return as the next sequence number to use, right?).
> Assume that the edits 11..15 eventually complete. And so HLogs do contain the data for edits 11..15.
> Now, at this point if the region server were to crash, and we run log recovery, the splits all go through correctly, and a correct recovered.edits file is generated with the edits 11..15. 
> Next, when the region is opened, the HRegion notes that one of the store file says MAX_SEQ_ID is 16. So, when it replays the recovered.edits file, it  skips replaying edits 11..15. Or in other words, data loss.
> ----

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HBASE-3481) max seq id in flushed file can be larger than its correct value causing data loss during recovery

Posted by "Kannan Muthukkaruppan (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HBASE-3481?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12986911#action_12986911 ] 

Kannan Muthukkaruppan commented on HBASE-3481:
----------------------------------------------

Is the last seq id readily available from the memstore KVs or is already stashed away somewhere? I agree that that would be the cleanest/best fix.

(Happy to accept  a patch if you want to post one up. Else, I'll look further on this  tomorrow morning).

> max seq id in flushed file can be larger than its correct value causing data loss during recovery
> -------------------------------------------------------------------------------------------------
>
>                 Key: HBASE-3481
>                 URL: https://issues.apache.org/jira/browse/HBASE-3481
>             Project: HBase
>          Issue Type: Bug
>            Reporter: Kannan Muthukkaruppan
>            Priority: Critical
>
> [While doing some cluster kill tests, I noticed some missing data after log recovery. Upon investigating further, and pretty printing contents of HFiles and recovered logs, this is my analysis of the situation/bug. Please confirm the theory and pitch in with suggestions.]
> When memstores are flushed, the max sequence id recorded in  the HFile should be the max sequence id of all KVs in the memstore. However, we seem to simply obtain the current sequence id from the HRegion, and stamp the HFile's MAX_SEQ_ID with it.
> From HRegion.java:
> {code}
>     sequenceId = (wal == null)? myseqid: wal.startCacheFlush();
> {code}
> where, startCacheFlush() is:
> {code}
> public long startCacheFlush() {
>     this.cacheFlushLock.lock();
>     return obtainSeqNum();
>  }
> {code}
> where, obtainSeqNum() is simply: 
> {code}
> private long obtainSeqNum() {
>     return this.logSeqNum.incrementAndGet();
>   }
> {code}
> So let's say a memstore contains edits with sequence number 1..10.
> Meanwhile, say more Puts come along, and are going through this flow (in pseudo-code)
> {code}
> 1. HLog.append();
>        1.1  obtainSeqNum()
>        1.2 writeToWAL()
> 2 updateMemStore()
> {code}
> So it is possible that the sequence number has already been incremented to say 15 if there are 5 more outstanding puts. Say the writeToWAL() is still in progress for these puts. In this case, none of these edits (11..15) would have been written to memstore yet.
> At this point if a cache flush of the memstore happens, then we'll record its MAX_SEQ_ID as 16 in the store file instead of 10 (because that's what obtainSeqNum() would return as the next sequence number to use, right?).
> Assume that the edits 11..15 eventually complete. And so HLogs do contain the data for edits 11..15.
> Now, at this point if the region server were to crash, and we run log recovery, the splits all go through correctly, and a correct recovered.edits file is generated with the edits 11..15. 
> Next, when the region is opened, the HRegion notes that one of the store file says MAX_SEQ_ID is 16. So, when it replays the recovered.edits file, it  skips replaying edits 11..15. Or in other words, data loss.
> ----

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HBASE-3481) max seq id in flushed file can be larger than its correct value causing data loss during recovery

Posted by "Kannan Muthukkaruppan (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HBASE-3481?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12986900#action_12986900 ] 

Kannan Muthukkaruppan commented on HBASE-3481:
----------------------------------------------

Maybe the quickest fix is to avoid the "skip" optimization during replaying of recovered.edits.

I think this should restore correctness. 

And with regards to HLog reclamation (i.e. an HLog should only be reclaimed if it contains no data for an active memstore), I don't think it relies on this MAX_SEQ_ID inside store files-- but rather on separate mechanism of what the min edit contained in each memstore is. So, probably that case is ok.

> max seq id in flushed file can be larger than its correct value causing data loss during recovery
> -------------------------------------------------------------------------------------------------
>
>                 Key: HBASE-3481
>                 URL: https://issues.apache.org/jira/browse/HBASE-3481
>             Project: HBase
>          Issue Type: Bug
>            Reporter: Kannan Muthukkaruppan
>            Priority: Critical
>
> [While doing some cluster kill tests, I noticed some missing data after log recovery. Upon investigating further, and pretty printing contents of HFiles and recovered logs, this is my analysis of the situation/bug. Please confirm the theory and pitch in with suggestions.]
> When memstores are flushed, the max sequence id recorded in  the HFile should be the max sequence id of all KVs in the memstore. However, we seem to be simply obtain the current sequence id from the HRegion, and stamp the HFile's MAX_SEQ_ID with it.
> From HRegion.java:
> {code}
>     sequenceId = (wal == null)? myseqid: wal.startCacheFlush();
> {code}
> where, startCacheFlush() is:
> {code}
> public long startCacheFlush() {
>     this.cacheFlushLock.lock();
>     return obtainSeqNum();
>  }
> {code}
> where, obtainSeqNum() is simply: 
> {code}
>     public long startCacheFlush() {
>     this.cacheFlushLock.lock();
>     return obtainSeqNum();
>   }
> {code}
> So let's say a memstore contains edits with sequence number 1..10.
> Meanwhile, say more Puts come along, and are going through this flow (in pseudo-code)
> {code}
>    1. HLog.append();
>        1.1  obtainSeqNum()
>        1.2 writeToWAL()
>    2 updateMemStore()
> {code}
> So it is possible that the sequence number has already been incremented to say 15 (if there are 5 more outstanding puts)... but if their writeToWAL() is still in progress. In this case, none of these edits (11..15) would have been written to memstore yet.
> At this point if a cache flush of the memstore happens, then we'll record its MAX_SEQ_ID as 16 instead of 10 (because that's what obtainSeqNum() would return as the next sequence number to use, right?).
> Assume that the edits 11..15 eventually complete. And so HLogs do contain the data for edits 11..15.
> Now, at this point if the region server were to crash, and we run log recovery, the splits all go through correctly, and a correct recovered.edits file is generated with the edits 11..15. 
> Next, when the region is opened, the HRegion notes that one of the store file says MAX_SEQ_ID is 16. So, when it replays the recovered.edits file, it  skips replaying edits 11..15. Or in other words, data loss.
> ----

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HBASE-3481) max seq id in flushed file can be larger than its correct value causing data loss during recovery

Posted by "Kannan Muthukkaruppan (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HBASE-3481?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12987335#action_12987335 ] 

Kannan Muthukkaruppan commented on HBASE-3481:
----------------------------------------------

Unit tests ran well (I ran against 0.89) & so did the kill testing. I think we are good wrt to this patch.

> max seq id in flushed file can be larger than its correct value causing data loss during recovery
> -------------------------------------------------------------------------------------------------
>
>                 Key: HBASE-3481
>                 URL: https://issues.apache.org/jira/browse/HBASE-3481
>             Project: HBase
>          Issue Type: Bug
>            Reporter: Kannan Muthukkaruppan
>            Priority: Blocker
>             Fix For: 0.90.1
>
>         Attachments: HBASE-3481.txt
>
>
> [While doing some cluster kill tests, I noticed some missing data after log recovery. Upon investigating further, and pretty printing contents of HFiles and recovered logs, this is my analysis of the situation/bug. Please confirm the theory and pitch in with suggestions.]
> When memstores are flushed, the max sequence id recorded in  the HFile should be the max sequence id of all KVs in the memstore. However, we seem to simply obtain the current sequence id from the HRegion, and stamp the HFile's MAX_SEQ_ID with it.
> From HRegion.java:
> {code}
>     sequenceId = (wal == null)? myseqid: wal.startCacheFlush();
> {code}
> where, startCacheFlush() is:
> {code}
> public long startCacheFlush() {
>     this.cacheFlushLock.lock();
>     return obtainSeqNum();
>  }
> {code}
> where, obtainSeqNum() is simply: 
> {code}
> private long obtainSeqNum() {
>     return this.logSeqNum.incrementAndGet();
>   }
> {code}
> So let's say a memstore contains edits with sequence number 1..10.
> Meanwhile, say more Puts come along, and are going through this flow (in pseudo-code)
> {code}
> 1. HLog.append();
>        1.1  obtainSeqNum()
>        1.2 writeToWAL()
> 2 updateMemStore()
> {code}
> So it is possible that the sequence number has already been incremented to say 15 if there are 5 more outstanding puts. Say the writeToWAL() is still in progress for these puts. In this case, none of these edits (11..15) would have been written to memstore yet.
> At this point if a cache flush of the memstore happens, then we'll record its MAX_SEQ_ID as 16 in the store file instead of 10 (because that's what obtainSeqNum() would return as the next sequence number to use, right?).
> Assume that the edits 11..15 eventually complete. And so HLogs do contain the data for edits 11..15.
> Now, at this point if the region server were to crash, and we run log recovery, the splits all go through correctly, and a correct recovered.edits file is generated with the edits 11..15. 
> Next, when the region is opened, the HRegion notes that one of the store file says MAX_SEQ_ID is 16. So, when it replays the recovered.edits file, it  skips replaying edits 11..15. Or in other words, data loss.
> ----

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HBASE-3481) max seq id in flushed file can be larger than its correct value causing data loss during recovery

Posted by "Kannan Muthukkaruppan (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HBASE-3481?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12987241#action_12987241 ] 

Kannan Muthukkaruppan commented on HBASE-3481:
----------------------------------------------

You beat me to the patch Ryan :) Thanks. Patch looks good. I imported patch and am running unit tests and will also run some kill tests shortly.

> max seq id in flushed file can be larger than its correct value causing data loss during recovery
> -------------------------------------------------------------------------------------------------
>
>                 Key: HBASE-3481
>                 URL: https://issues.apache.org/jira/browse/HBASE-3481
>             Project: HBase
>          Issue Type: Bug
>            Reporter: Kannan Muthukkaruppan
>            Priority: Critical
>         Attachments: HBASE-3481.txt
>
>
> [While doing some cluster kill tests, I noticed some missing data after log recovery. Upon investigating further, and pretty printing contents of HFiles and recovered logs, this is my analysis of the situation/bug. Please confirm the theory and pitch in with suggestions.]
> When memstores are flushed, the max sequence id recorded in  the HFile should be the max sequence id of all KVs in the memstore. However, we seem to simply obtain the current sequence id from the HRegion, and stamp the HFile's MAX_SEQ_ID with it.
> From HRegion.java:
> {code}
>     sequenceId = (wal == null)? myseqid: wal.startCacheFlush();
> {code}
> where, startCacheFlush() is:
> {code}
> public long startCacheFlush() {
>     this.cacheFlushLock.lock();
>     return obtainSeqNum();
>  }
> {code}
> where, obtainSeqNum() is simply: 
> {code}
> private long obtainSeqNum() {
>     return this.logSeqNum.incrementAndGet();
>   }
> {code}
> So let's say a memstore contains edits with sequence number 1..10.
> Meanwhile, say more Puts come along, and are going through this flow (in pseudo-code)
> {code}
> 1. HLog.append();
>        1.1  obtainSeqNum()
>        1.2 writeToWAL()
> 2 updateMemStore()
> {code}
> So it is possible that the sequence number has already been incremented to say 15 if there are 5 more outstanding puts. Say the writeToWAL() is still in progress for these puts. In this case, none of these edits (11..15) would have been written to memstore yet.
> At this point if a cache flush of the memstore happens, then we'll record its MAX_SEQ_ID as 16 in the store file instead of 10 (because that's what obtainSeqNum() would return as the next sequence number to use, right?).
> Assume that the edits 11..15 eventually complete. And so HLogs do contain the data for edits 11..15.
> Now, at this point if the region server were to crash, and we run log recovery, the splits all go through correctly, and a correct recovered.edits file is generated with the edits 11..15. 
> Next, when the region is opened, the HRegion notes that one of the store file says MAX_SEQ_ID is 16. So, when it replays the recovered.edits file, it  skips replaying edits 11..15. Or in other words, data loss.
> ----

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (HBASE-3481) max seq id in flushed file can be larger than its correct value causing data loss during recovery

Posted by "ryan rawson (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HBASE-3481?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

ryan rawson updated HBASE-3481:
-------------------------------

    Attachment: HBASE-3481.txt

here is a patch adding in those locks

> max seq id in flushed file can be larger than its correct value causing data loss during recovery
> -------------------------------------------------------------------------------------------------
>
>                 Key: HBASE-3481
>                 URL: https://issues.apache.org/jira/browse/HBASE-3481
>             Project: HBase
>          Issue Type: Bug
>            Reporter: Kannan Muthukkaruppan
>            Priority: Critical
>         Attachments: HBASE-3481.txt
>
>
> [While doing some cluster kill tests, I noticed some missing data after log recovery. Upon investigating further, and pretty printing contents of HFiles and recovered logs, this is my analysis of the situation/bug. Please confirm the theory and pitch in with suggestions.]
> When memstores are flushed, the max sequence id recorded in  the HFile should be the max sequence id of all KVs in the memstore. However, we seem to simply obtain the current sequence id from the HRegion, and stamp the HFile's MAX_SEQ_ID with it.
> From HRegion.java:
> {code}
>     sequenceId = (wal == null)? myseqid: wal.startCacheFlush();
> {code}
> where, startCacheFlush() is:
> {code}
> public long startCacheFlush() {
>     this.cacheFlushLock.lock();
>     return obtainSeqNum();
>  }
> {code}
> where, obtainSeqNum() is simply: 
> {code}
> private long obtainSeqNum() {
>     return this.logSeqNum.incrementAndGet();
>   }
> {code}
> So let's say a memstore contains edits with sequence number 1..10.
> Meanwhile, say more Puts come along, and are going through this flow (in pseudo-code)
> {code}
> 1. HLog.append();
>        1.1  obtainSeqNum()
>        1.2 writeToWAL()
> 2 updateMemStore()
> {code}
> So it is possible that the sequence number has already been incremented to say 15 if there are 5 more outstanding puts. Say the writeToWAL() is still in progress for these puts. In this case, none of these edits (11..15) would have been written to memstore yet.
> At this point if a cache flush of the memstore happens, then we'll record its MAX_SEQ_ID as 16 in the store file instead of 10 (because that's what obtainSeqNum() would return as the next sequence number to use, right?).
> Assume that the edits 11..15 eventually complete. And so HLogs do contain the data for edits 11..15.
> Now, at this point if the region server were to crash, and we run log recovery, the splits all go through correctly, and a correct recovered.edits file is generated with the edits 11..15. 
> Next, when the region is opened, the HRegion notes that one of the store file says MAX_SEQ_ID is 16. So, when it replays the recovered.edits file, it  skips replaying edits 11..15. Or in other words, data loss.
> ----

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (HBASE-3481) max seq id in flushed file can be larger than its correct value causing data loss during recovery

Posted by "stack (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HBASE-3481?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

stack updated HBASE-3481:
-------------------------

         Priority: Blocker  (was: Critical)
    Fix Version/s: 0.90.1

Marking a blocker because data loss (and since it seems like Kannan and Ryan figured it -- smile).  Bringing into 0.90.1.

> max seq id in flushed file can be larger than its correct value causing data loss during recovery
> -------------------------------------------------------------------------------------------------
>
>                 Key: HBASE-3481
>                 URL: https://issues.apache.org/jira/browse/HBASE-3481
>             Project: HBase
>          Issue Type: Bug
>            Reporter: Kannan Muthukkaruppan
>            Priority: Blocker
>             Fix For: 0.90.1
>
>         Attachments: HBASE-3481.txt
>
>
> [While doing some cluster kill tests, I noticed some missing data after log recovery. Upon investigating further, and pretty printing contents of HFiles and recovered logs, this is my analysis of the situation/bug. Please confirm the theory and pitch in with suggestions.]
> When memstores are flushed, the max sequence id recorded in  the HFile should be the max sequence id of all KVs in the memstore. However, we seem to simply obtain the current sequence id from the HRegion, and stamp the HFile's MAX_SEQ_ID with it.
> From HRegion.java:
> {code}
>     sequenceId = (wal == null)? myseqid: wal.startCacheFlush();
> {code}
> where, startCacheFlush() is:
> {code}
> public long startCacheFlush() {
>     this.cacheFlushLock.lock();
>     return obtainSeqNum();
>  }
> {code}
> where, obtainSeqNum() is simply: 
> {code}
> private long obtainSeqNum() {
>     return this.logSeqNum.incrementAndGet();
>   }
> {code}
> So let's say a memstore contains edits with sequence number 1..10.
> Meanwhile, say more Puts come along, and are going through this flow (in pseudo-code)
> {code}
> 1. HLog.append();
>        1.1  obtainSeqNum()
>        1.2 writeToWAL()
> 2 updateMemStore()
> {code}
> So it is possible that the sequence number has already been incremented to say 15 if there are 5 more outstanding puts. Say the writeToWAL() is still in progress for these puts. In this case, none of these edits (11..15) would have been written to memstore yet.
> At this point if a cache flush of the memstore happens, then we'll record its MAX_SEQ_ID as 16 in the store file instead of 10 (because that's what obtainSeqNum() would return as the next sequence number to use, right?).
> Assume that the edits 11..15 eventually complete. And so HLogs do contain the data for edits 11..15.
> Now, at this point if the region server were to crash, and we run log recovery, the splits all go through correctly, and a correct recovered.edits file is generated with the edits 11..15. 
> Next, when the region is opened, the HRegion notes that one of the store file says MAX_SEQ_ID is 16. So, when it replays the recovered.edits file, it  skips replaying edits 11..15. Or in other words, data loss.
> ----

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.