You are viewing a plain text version of this content. The canonical link for it is here.

Posted to reviews@kudu.apache.org by "Todd Lipcon (Code Review)" <ge...@cloudera.org> on 2017/06/08 03:10:16 UTC

[kudu-CR] log block manager: fix corruption after re-opening compacted metadata

Hello Adar Dembo,

I'd like you to do a code review.  Please visit

    http://gerrit.cloudera.org:8080/7113

to review the following change.

Change subject: log_block_manager: fix corruption after re-opening compacted metadata
......................................................................

log_block_manager: fix corruption after re-opening compacted metadata

This fixes an issue discovered on a cluster due to the following
sequence of events:

- a block manager compacts a metadata file while starting up
- when it reopens the metadata file after replacing it with the
  compacted one, it gets a file_cache hit. Thus, the WritablePBContainer
  continues to write to the _deleted_ file instead of the compacted one.
  Metadata entries at this point are lost (which could cause block loss
  in the case of lost CREATE records, or dangling blocks in the case of
  lost DELETEs)
- if the server continues to run for a while, the FD will be evicted
  from the cache and eventually re-opened. At that point, a further
  DELETE record could end up writing to an offset past the end of the
  file, since the write offset was incremented by the "lost" records
  above.
- on the next restart, the metadata file would have a "gap" of zero
  bytes, which would surface as a checksum failure and failure to start
  up.

The fix is relatively simple: when we replace the metadata file we need
to evict the cache entry so that when we "reopen", it actually starts
appending to the _new_ file and not the old deleted one.

The bulk of the changes here are to tests:
- the stress test now enforces a minimum number of live blocks before it
  starts deleting them. It also more aggressively compacts, and has a
  smaller cache. With these changes, I was sometimes able to reproduce
  the issue.
- A more targeted test issues a canned sequence of block creations and
  deletions that can reliably reproduce the above issue.

Change-Id: I14b2c64685e24d27591258911db4aeb9e8020a4d
---
M src/kudu/fs/block_manager-stress-test.cc
M src/kudu/fs/log_block_manager-test.cc
M src/kudu/fs/log_block_manager.cc
M src/kudu/util/file_cache-test.cc
M src/kudu/util/file_cache.cc
M src/kudu/util/file_cache.h
6 files changed, 150 insertions(+), 59 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/13/7113/1
-- 
To view, visit http://gerrit.cloudera.org:8080/7113
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: newchange
Gerrit-Change-Id: I14b2c64685e24d27591258911db4aeb9e8020a4d
Gerrit-PatchSet: 1
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Todd Lipcon <to...@apache.org>
Gerrit-Reviewer: Adar Dembo <ad...@cloudera.com>

[kudu-CR] log block manager: fix corruption after re-opening compacted metadata

Posted by "Todd Lipcon (Code Review)" <ge...@cloudera.org>.

Hello Kudu Jenkins,

I'd like you to reexamine a change.  Please visit

    http://gerrit.cloudera.org:8080/7113

to look at the new patch set (#2).

Change subject: log_block_manager: fix corruption after re-opening compacted metadata
......................................................................

log_block_manager: fix corruption after re-opening compacted metadata

This fixes an issue discovered on a cluster due to the following
sequence of events:

- a block manager compacts a metadata file while starting up
- when it reopens the metadata file after replacing it with the
  compacted one, it gets a file_cache hit. Thus, the WritablePBContainer
  continues to write to the _deleted_ file instead of the compacted one.
  Metadata entries at this point are lost (which could cause block loss
  in the case of lost CREATE records, or dangling blocks in the case of
  lost DELETEs)
- if the server continues to run for a while, the FD will be evicted
  from the cache and eventually re-opened. At that point, a further
  DELETE record could end up writing to an offset past the end of the
  file, since the write offset was incremented by the "lost" records
  above.
- on the next restart, the metadata file would have a "gap" of zero
  bytes, which would surface as a checksum failure and failure to start
  up.

The fix is relatively simple: when we replace the metadata file we need
to invalidate and evict the cache entry so that when we "reopen", it
actually starts appending to the _new_ file and not the old deleted one.

The bulk of the changes here are to tests:
- the stress test now enforces a minimum number of live blocks before it
  starts deleting them. It also more aggressively compacts, and has a
  smaller cache. With these changes, I was sometimes able to reproduce
  the issue.
- A more targeted test issues a canned sequence of block creations and
  deletions that can reliably reproduce the above issue.

Change-Id: I14b2c64685e24d27591258911db4aeb9e8020a4d
---
M src/kudu/fs/block_manager-stress-test.cc
M src/kudu/fs/log_block_manager-test.cc
M src/kudu/fs/log_block_manager.cc
M src/kudu/util/file_cache-test.cc
M src/kudu/util/file_cache.cc
M src/kudu/util/file_cache.h
6 files changed, 191 insertions(+), 65 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/13/7113/2
-- 
To view, visit http://gerrit.cloudera.org:8080/7113
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I14b2c64685e24d27591258911db4aeb9e8020a4d
Gerrit-PatchSet: 2
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Todd Lipcon <to...@apache.org>
Gerrit-Reviewer: Adar Dembo <ad...@cloudera.com>
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Tidy Bot
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>

[kudu-CR] log block manager: fix corruption after re-opening compacted metadata

Posted by "Todd Lipcon (Code Review)" <ge...@cloudera.org>.

Todd Lipcon has posted comments on this change.

Change subject: log_block_manager: fix corruption after re-opening compacted metadata
......................................................................


Patch Set 3:

(1 comment)

http://gerrit.cloudera.org:8080/#/c/7113/3/src/kudu/util/file_cache.cc
File src/kudu/util/file_cache.cc:

Line 525: void FileCache<FileType>::Invalidate(const string& file_name) {
> What about a concurrent DeleteFile() call? Should DeleteFile() also return 
Plausibly? But really I think all of this should be "YMMV, don't do stuff concurrently to a file". All of our use cases for this are externally synchronized already (i.e our one use case is done at startup).

How do you feel about just making an admonishment in the function doc that it is not safe to use concurrently with other operations on the same path, and the "invalidation" thing is just a safeguard so that if someone tries to do it they'll get an error? I could even change it to be a CHECK failure to use an invalidated fd if that feels better?

Basically I don't want to grow an extra concurrency feature (and associated tests, docs, etc) that we don't actually need, especially considering this is also the long pole in a release right now.


-- 
To view, visit http://gerrit.cloudera.org:8080/7113
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I14b2c64685e24d27591258911db4aeb9e8020a4d
Gerrit-PatchSet: 3
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Todd Lipcon <to...@apache.org>
Gerrit-Reviewer: Adar Dembo <ad...@cloudera.com>
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Tidy Bot
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>
Gerrit-HasComments: Yes

[kudu-CR] log block manager: fix corruption after re-opening compacted metadata

Posted by "Adar Dembo (Code Review)" <ge...@cloudera.org>.

Adar Dembo has posted comments on this change.

Change subject: log_block_manager: fix corruption after re-opening compacted metadata
......................................................................

Patch Set 1:

(4 comments)

http://gerrit.cloudera.org:8080/#/c/7113/1/src/kudu/fs/log_block_manager-test.cc
File src/kudu/fs/log_block_manager-test.cc:

Line 1208: TEST_F(LogBlockManagerTest, TestDeleteFromContainerAfterMetadataCompaction) {
> warning: redundant 'test_info_' declaration [readability-redundant-declarat
Maybe add the word "Fuzz" in there, since it's sort of a fuzz test (i.e. the random deletions are meant to generate a different set of cache hits/misses every run).

http://gerrit.cloudera.org:8080/#/c/7113/1/src/kudu/fs/log_block_manager.cc
File src/kudu/fs/log_block_manager.cc:

Line 2330: file_cache_.Evict(metadata_file_name);
Curious as to whether you thought about implementing the rename through the file cache? For this single-threaded non-concurrent case an explicit Evict() is fine, but if done through the cache it seems like it might be easier to support a concurrent use case (i.e. for realtime metadata compaction).

http://gerrit.cloudera.org:8080/#/c/7113/1/src/kudu/util/file_cache-test.cc
File src/kudu/util/file_cache-test.cc:

Line 227: // Create a test file, then delete it. It will be deleted immediately.
Comment doesn't really make sense; we're renaming the file, not deleting it.

http://gerrit.cloudera.org:8080/#/c/7113/1/src/kudu/util/file_cache.h
File src/kudu/util/file_cache.h:

PS1, Line 147: It's recommended that the above pattern
: // only be used in cases when it can be guaranteed that no other objects
: // retain an open descriptor to the same path.
Can we enforce that there are no open readers of this file at the time of eviction? Would the LBM be okay with such semantics?

--
To view, visit http://gerrit.cloudera.org:8080/7113
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I14b2c64685e24d27591258911db4aeb9e8020a4d
Gerrit-PatchSet: 1
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Todd Lipcon <to...@apache.org>
Gerrit-Reviewer: Adar Dembo <ad...@cloudera.com>
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Tidy Bot
Gerrit-HasComments: Yes

[kudu-CR] log block manager: fix corruption after re-opening compacted metadata

Posted by "Todd Lipcon (Code Review)" <ge...@cloudera.org>.

Hello Kudu Jenkins,

I'd like you to reexamine a change.  Please visit

    http://gerrit.cloudera.org:8080/7113

to look at the new patch set (#3).

Change subject: log_block_manager: fix corruption after re-opening compacted metadata
......................................................................

log_block_manager: fix corruption after re-opening compacted metadata

This fixes an issue discovered on a cluster due to the following
sequence of events:

- a block manager compacts a metadata file while starting up
- when it reopens the metadata file after replacing it with the
  compacted one, it gets a file_cache hit. Thus, the WritablePBContainer
  continues to write to the _deleted_ file instead of the compacted one.
  Metadata entries at this point are lost (which could cause block loss
  in the case of lost CREATE records, or dangling blocks in the case of
  lost DELETEs)
- if the server continues to run for a while, the FD will be evicted
  from the cache and eventually re-opened. At that point, a further
  DELETE record could end up writing to an offset past the end of the
  file, since the write offset was incremented by the "lost" records
  above.
- on the next restart, the metadata file would have a "gap" of zero
  bytes, which would surface as a checksum failure and failure to start
  up.

The fix is relatively simple: when we replace the metadata file we need
to invalidate and evict the cache entry so that when we "reopen", it
actually starts appending to the _new_ file and not the old deleted one.

The bulk of the changes here are to tests:
- the stress test now enforces a minimum number of live blocks before it
  starts deleting them. It also more aggressively compacts, and has a
  smaller cache. With these changes, I was sometimes able to reproduce
  the issue.
- A more targeted test issues a canned sequence of block creations and
  deletions that can reliably reproduce the above issue.

Change-Id: I14b2c64685e24d27591258911db4aeb9e8020a4d
---
M src/kudu/fs/block_manager-stress-test.cc
M src/kudu/fs/log_block_manager-test.cc
M src/kudu/fs/log_block_manager.cc
M src/kudu/util/file_cache-test.cc
M src/kudu/util/file_cache.cc
M src/kudu/util/file_cache.h
6 files changed, 194 insertions(+), 65 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/13/7113/3
-- 
To view, visit http://gerrit.cloudera.org:8080/7113
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I14b2c64685e24d27591258911db4aeb9e8020a4d
Gerrit-PatchSet: 3
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Todd Lipcon <to...@apache.org>
Gerrit-Reviewer: Adar Dembo <ad...@cloudera.com>
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Tidy Bot
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>

[kudu-CR] log block manager: fix corruption after re-opening compacted metadata

Posted by "Adar Dembo (Code Review)" <ge...@cloudera.org>.

Adar Dembo has posted comments on this change.

Change subject: log_block_manager: fix corruption after re-opening compacted metadata
......................................................................

Patch Set 3:

(7 comments)

http://gerrit.cloudera.org:8080/#/c/7113/3//COMMIT_MSG
Commit Message:

Line 32: The bulk of the changes here are to tests:
As this is a severe bug, I'd like to better understand why our existing black box testing didn't catch it:

block_manager-stress-test: only restarted once, so if we compacted a metadata file during that restart and corrupted it in the subsequent test run, we never actually reloaded it from disk (via a second restart) to notice. Is that correct?

file_cache-stress-test: only tests FileCache methods, so it doesn't do any file renaming.

Any cluster itest: AFAICT, the test must satisfy all of the following:
- Create blocks (i.e. flush)
- Delete blocks (i.e. compaction or tablet deletion)
- Restart at least twice, once to compact metadata, and a second time to try and read the compacted metadata.
- There needs to be some additional block creation/deletion between the two restarts to corrupt the new metadata file.

Do none of our cluster itests satisfy this criteria?

http://gerrit.cloudera.org:8080/#/c/7113/3/src/kudu/fs/log_block_manager-test.cc
File src/kudu/fs/log_block_manager-test.cc:

PS3, Line 1242: Since we have more metadata files than
: // we have file_cache capacity, this will also generate a mix of cache hits,
: // misses, and re-insertions.
Without the fix, how often does this test trigger the bug?

http://gerrit.cloudera.org:8080/#/c/7113/1/src/kudu/fs/log_block_manager.cc
File src/kudu/fs/log_block_manager.cc:

Line 2330: file_cache_.Invalidate(metadata_file_name);
> I did think about it, but seemed like it would be a bit more complicated to
Rambling thoughts about cache-driven rename below:

The problem is that after Env->RenameFile(), the user has no idea when file operations on an existing descriptor will switch over from the old on-disk file to the new one (it depends on how much time elapses and how much contention there is on the cache).

To which you would say "right, the contract is that you must call Invalidate() after renaming the file". And I would agree, but wouldn't it be better if you were _forced_ to do that by virtue of having to call something like FileCache::RenameFile() instead? Then there'd be no opportunity to shoot yourself in the foot.

All that said, I gave some thought to how to safely handle renames within the file cache, and it is complicated at best. I think you'd have to extend the LRU cache entry key so that you could cache both the pre-rename and post-rename fds at the same time. However, you also have to defer the on-disk rename to when the pre-rename descriptor is closed; if you renamed immediately, cache contention would eventually evict the old descriptor's fd such that the old file is lost. And if you defer the on-disk rename, you're basically destroying any semblance of durability in the "create temp file, write to it, sync it, rename it over real file, sync directory" sequence. You'd have to incorporate the "sync directory" part into the file cache too, which means you lose opportunities to coalesce them (or grow even more complex).

So I'm coming around to agreeing with you that the "right way" is just too complex, especially given the narrow use case.

http://gerrit.cloudera.org:8080/#/c/7113/3/src/kudu/util/file_cache-test.cc
File src/kudu/util/file_cache-test.cc:

Line 242: // fd.
Nit: odd wrapping?

Line 245: ASSERT_EQ(size, kData1.size());
Nit: expected should come first (as per L259).

http://gerrit.cloudera.org:8080/#/c/7113/3/src/kudu/util/file_cache.cc
File src/kudu/util/file_cache.cc:

Line 143: void MarkInvalidated() {
Nit: add a comment explaining what it means to be invalidated.

Line 525: void FileCache<FileType>::Invalidate(const string& file_name) {
There are three interesting things going on here and I want to make sure I understand their impact:
- The retrieved descriptor is marked as invalidated. This means that anybody holding a ref to the descriptor will get IllegalState on any subsequent file operation.
- The descriptor is also erased from the map. If it was a weak ref only this makes sense (may as well purge it now and not in the next lookup). And if it was strong...we need to ensure that future calls to OpenExistingFile() create a new (uninvalidated) descriptor. This second part ought to be mentioned in the "If there is any descriptor..." comment.
- Lastly, we erase any cached entry for this filename thereby closing the underlying fd for this file. AFAICT this is so that if OpenExistingFile() is called with the same filename, there's no chance that it reuses the previous fd (and thus reads from the other UNIX file). If this is true, the comment before Erase() seems wrong: Erase() doesn't deal with descriptors but with the underlying fd (or RWFile/RandomAccessFile, or FileType, or whatever you want to call it).

If this is correct, shouldn't Erase() be called with the spinlock held? Otherwise an OpenExistingFile() called by another thread on the same file could slip in and reuse the bad fd, no? On second thought, maybe you wanted Erase() outside the lock because it can lead to an fd close (and thus an fsync). If so, how can we address this race?

If the answer is "you shouldn't call Invalidate() on a file that you're actively working on", perhaps we should provide a stronger contract, such as enforcing that there's no descriptor associated with the file (before Erase() and possibly after too, to catch any abusers).

--
To view, visit http://gerrit.cloudera.org:8080/7113
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I14b2c64685e24d27591258911db4aeb9e8020a4d
Gerrit-PatchSet: 3
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Todd Lipcon <to...@apache.org>
Gerrit-Reviewer: Adar Dembo <ad...@cloudera.com>
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Tidy Bot
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>
Gerrit-HasComments: Yes

[kudu-CR] log block manager: fix corruption after re-opening compacted metadata

Posted by "Todd Lipcon (Code Review)" <ge...@cloudera.org>.

Todd Lipcon has submitted this change and it was merged.

Change subject: log_block_manager: fix corruption after re-opening compacted metadata
......................................................................


log_block_manager: fix corruption after re-opening compacted metadata

This fixes an issue discovered on a cluster due to the following
sequence of events:

- a block manager compacts a metadata file while starting up
- when it reopens the metadata file after replacing it with the
  compacted one, it gets a file_cache hit. Thus, the WritablePBContainer
  continues to write to the _deleted_ file instead of the compacted one.
  Metadata entries at this point are lost (which could cause block loss
  in the case of lost CREATE records, or dangling blocks in the case of
  lost DELETEs)
- if the server continues to run for a while, the FD will be evicted
  from the cache and eventually re-opened. At that point, a further
  DELETE record could end up writing to an offset past the end of the
  file, since the write offset was incremented by the "lost" records
  above.
- on the next restart, the metadata file would have a "gap" of zero
  bytes, which would surface as a checksum failure and failure to start
  up.

The fix is relatively simple: when we replace the metadata file we need
to invalidate and evict the cache entry so that when we "reopen", it
actually starts appending to the _new_ file and not the old deleted one.

The bulk of the changes here are to tests:
- the stress test now enforces a minimum number of live blocks before it
  starts deleting them. It also more aggressively compacts, and has a
  smaller cache. With these changes, I was sometimes able to reproduce
  the issue.
- A more targeted test issues a canned sequence of block creations and
  deletions that can reliably reproduce the above issue.

Change-Id: I14b2c64685e24d27591258911db4aeb9e8020a4d
Reviewed-on: http://gerrit.cloudera.org:8080/7113
Reviewed-by: Adar Dembo <ad...@cloudera.com>
Tested-by: Kudu Jenkins
---
M src/kudu/fs/block_manager-stress-test.cc
M src/kudu/fs/log_block_manager-test.cc
M src/kudu/fs/log_block_manager.cc
M src/kudu/util/file_cache-test.cc
M src/kudu/util/file_cache.cc
M src/kudu/util/file_cache.h
6 files changed, 218 insertions(+), 66 deletions(-)

Approvals:
  Adar Dembo: Looks good to me, approved
  Kudu Jenkins: Verified



-- 
To view, visit http://gerrit.cloudera.org:8080/7113
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: merged
Gerrit-Change-Id: I14b2c64685e24d27591258911db4aeb9e8020a4d
Gerrit-PatchSet: 7
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Todd Lipcon <to...@apache.org>
Gerrit-Reviewer: Adar Dembo <ad...@cloudera.com>
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Tidy Bot
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>

[kudu-CR] log block manager: fix corruption after re-opening compacted metadata

Posted by "Todd Lipcon (Code Review)" <ge...@cloudera.org>.

Todd Lipcon has posted comments on this change.

Change subject: log_block_manager: fix corruption after re-opening compacted metadata
......................................................................


Patch Set 5:

(5 comments)

http://gerrit.cloudera.org:8080/#/c/7113/5/src/kudu/util/file_cache.cc
File src/kudu/util/file_cache.cc:

Line 141:     while (true){
> Nit: while (true) {
Done


Line 173:   std::atomic<uint8_t> flags_ {0};
> This change was motivated by the new calls to invalidated() in  ReopenFileI
right, TSAN build failed in the previous precommit


Line 317:     CHECK(!base_.invalidated());
> Should be before the lookup, no?
I think the actual order in which the things happen, it's better to do the check _after_ we grabbed the fd. otherwise we could see "not invalidated", then grab the FD and get the FD that has been re-added for the new file. seems I messed this up below though.


Line 420: 
> Nit: make empty line use consistent with the other ReopenFileIfNecessary(),
Done


Line 508: Status FileCache<FileType>::DeleteFile(const string& file_name) {
> To ward against concurrent Invalidate() and DeleteFile(), should we CHECK(!
added the CHECK in FindDescriptorUnlocked which is called below


-- 
To view, visit http://gerrit.cloudera.org:8080/7113
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I14b2c64685e24d27591258911db4aeb9e8020a4d
Gerrit-PatchSet: 5
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Todd Lipcon <to...@apache.org>
Gerrit-Reviewer: Adar Dembo <ad...@cloudera.com>
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Tidy Bot
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>
Gerrit-HasComments: Yes

[kudu-CR] log block manager: fix corruption after re-opening compacted metadata

Posted by "Adar Dembo (Code Review)" <ge...@cloudera.org>.

Adar Dembo has posted comments on this change.

Change subject: log_block_manager: fix corruption after re-opening compacted metadata
......................................................................


Patch Set 3:

(1 comment)

http://gerrit.cloudera.org:8080/#/c/7113/3/src/kudu/util/file_cache.cc
File src/kudu/util/file_cache.cc:

Line 525: void FileCache<FileType>::Invalidate(const string& file_name) {
> Yes, your understanding is correct.
What about a concurrent DeleteFile() call? Should DeleteFile() also return IllegalState if there's an invalidated descriptor for the file in question? Would that be sufficient?

BTW, you can take a look at file_cache-stress-test and see whether there's an interesting way to add some Invalidate() calls there.


-- 
To view, visit http://gerrit.cloudera.org:8080/7113
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I14b2c64685e24d27591258911db4aeb9e8020a4d
Gerrit-PatchSet: 3
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Todd Lipcon <to...@apache.org>
Gerrit-Reviewer: Adar Dembo <ad...@cloudera.com>
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Tidy Bot
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>
Gerrit-HasComments: Yes

[kudu-CR] log block manager: fix corruption after re-opening compacted metadata

Posted by "Todd Lipcon (Code Review)" <ge...@cloudera.org>.

Hello Kudu Jenkins,

I'd like you to reexamine a change.  Please visit

    http://gerrit.cloudera.org:8080/7113

to look at the new patch set (#5).

Change subject: log_block_manager: fix corruption after re-opening compacted metadata
......................................................................

log_block_manager: fix corruption after re-opening compacted metadata

This fixes an issue discovered on a cluster due to the following
sequence of events:

- a block manager compacts a metadata file while starting up
- when it reopens the metadata file after replacing it with the
  compacted one, it gets a file_cache hit. Thus, the WritablePBContainer
  continues to write to the _deleted_ file instead of the compacted one.
  Metadata entries at this point are lost (which could cause block loss
  in the case of lost CREATE records, or dangling blocks in the case of
  lost DELETEs)
- if the server continues to run for a while, the FD will be evicted
  from the cache and eventually re-opened. At that point, a further
  DELETE record could end up writing to an offset past the end of the
  file, since the write offset was incremented by the "lost" records
  above.
- on the next restart, the metadata file would have a "gap" of zero
  bytes, which would surface as a checksum failure and failure to start
  up.

The fix is relatively simple: when we replace the metadata file we need
to invalidate and evict the cache entry so that when we "reopen", it
actually starts appending to the _new_ file and not the old deleted one.

The bulk of the changes here are to tests:
- the stress test now enforces a minimum number of live blocks before it
  starts deleting them. It also more aggressively compacts, and has a
  smaller cache. With these changes, I was sometimes able to reproduce
  the issue.
- A more targeted test issues a canned sequence of block creations and
  deletions that can reliably reproduce the above issue.

Change-Id: I14b2c64685e24d27591258911db4aeb9e8020a4d
---
M src/kudu/fs/block_manager-stress-test.cc
M src/kudu/fs/log_block_manager-test.cc
M src/kudu/fs/log_block_manager.cc
M src/kudu/util/file_cache-test.cc
M src/kudu/util/file_cache.cc
M src/kudu/util/file_cache.h
6 files changed, 218 insertions(+), 66 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/13/7113/5
-- 
To view, visit http://gerrit.cloudera.org:8080/7113
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I14b2c64685e24d27591258911db4aeb9e8020a4d
Gerrit-PatchSet: 5
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Todd Lipcon <to...@apache.org>
Gerrit-Reviewer: Adar Dembo <ad...@cloudera.com>
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Tidy Bot
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>

[kudu-CR] log block manager: fix corruption after re-opening compacted metadata

Posted by "Adar Dembo (Code Review)" <ge...@cloudera.org>.

Adar Dembo has posted comments on this change.

Change subject: log_block_manager: fix corruption after re-opening compacted metadata
......................................................................


Patch Set 3:

(1 comment)

http://gerrit.cloudera.org:8080/#/c/7113/3/src/kudu/util/file_cache.cc
File src/kudu/util/file_cache.cc:

Line 525: void FileCache<FileType>::Invalidate(const string& file_name) {
> Plausibly? But really I think all of this should be "YMMV, don't do stuff c
Agreed that safe and concurrent handling isn't necessary, so doc changes + CHECK failures sound totally reasonable to me; that's what I was getting at when I suggested providing a "stronger contract".


-- 
To view, visit http://gerrit.cloudera.org:8080/7113
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I14b2c64685e24d27591258911db4aeb9e8020a4d
Gerrit-PatchSet: 3
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Todd Lipcon <to...@apache.org>
Gerrit-Reviewer: Adar Dembo <ad...@cloudera.com>
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Tidy Bot
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>
Gerrit-HasComments: Yes

[kudu-CR] log block manager: fix corruption after re-opening compacted metadata

Posted by "Todd Lipcon (Code Review)" <ge...@cloudera.org>.

Todd Lipcon has posted comments on this change.

Change subject: log_block_manager: fix corruption after re-opening compacted metadata
......................................................................


Patch Set 3:

(6 comments)

http://gerrit.cloudera.org:8080/#/c/7113/3//COMMIT_MSG
Commit Message:

Line 32: The bulk of the changes here are to tests:
> As this is a severe bug, I'd like to better understand why our existing bla
block_manager-stress-test missed it for a couple reasons from what I can tell:
- it only restarted once, as you mentioned
- it set the fd cache to 512, which was enough to cause some churn in the FBM, but not much in the LBM. For this bug you trigger, you need "just the right amount" of churn -- it has to be big enough that you will reuse a cached FD when re-appending to the metadata after restart, but small enough that you will then evict and re-open the same FD again before the next restart
- block deletion is faster than block creation, so the deleter threads were "keeping up with" the writer threads, and on restart we always had only a very small number of live containers. Most were 100% dead and therefore rarely got compacted, but rather just removed.

Even with the changes I made, it seemed to reproduce reliably with some runtimes (eg 4 seconds) but not with slightly longer or shorter. It was hard to get it to reproduce reliably, so I went and added the more specific test.

As for other tests (non-specific to block manager), the typical amounts of data are small enough that the FD cache doesn't really churn. Given you need the specific interleaving described above, we didn't have any likelihood of hitting it in our other various tests which do multiple restarts.

As for tests outside of the automated ctests, we did actually hit this on our automated long-lived YCSB runs that we run internally at Cloudera. We didn't hit it sooner due to some test infra subtleties where it wasn't properly tracking master for the last week or two. When it rains, it pours, though, and it seems that at least 3 of our test clusters have now hit it, so we do at least have some ability to feel confident in our tests.


http://gerrit.cloudera.org:8080/#/c/7113/3/src/kudu/fs/log_block_manager-test.cc
File src/kudu/fs/log_block_manager-test.cc:

PS3, Line 1242: Since we have more metadata files than
              :   // we have file_cache capacity, this will also generate a mix of cache hits,
              :   // misses, and re-insertions.
> Without the fix, how often does this test trigger the bug?
pretty much every time (at least it triggered it every time I ran it)


http://gerrit.cloudera.org:8080/#/c/7113/3/src/kudu/util/file_cache-test.cc
File src/kudu/util/file_cache-test.cc:

Line 242:   // fd.
> Nit: odd wrapping?
Done


Line 245:   ASSERT_EQ(size, kData1.size());
> Nit: expected should come first (as per L259).
Done


http://gerrit.cloudera.org:8080/#/c/7113/3/src/kudu/util/file_cache.cc
File src/kudu/util/file_cache.cc:

Line 143:   void MarkInvalidated() {
> Nit: add a comment explaining what it means to be invalidated.
Done


Line 525: void FileCache<FileType>::Invalidate(const string& file_name) {
> There are three interesting things going on here and I want to make sure I 
Yes, your understanding is correct.

I think to handle the race we can do the following order:

1. under the lock: mark any existing descriptor as invalidated, or insert an invalidated descriptor if there is nothing in the map
2. erase the entry from the cache
3. under the lock again: erase the descriptor from the map

Thinking about a concurrent OpenExistingFile() call at each of the points:
before 1: continues to open the old file (i.e serialized before Invalidate())
before 2: fetches the invalidated descriptor and gets an Invalidated on first access (i.e serialized before Invalidate())
before 3: continues to get the invalidated descriptor (same as above)
after 3: creates a new Descriptor, and gets a cache miss, thus opening the _new_ path (serialized after Invalidate())

For concurrent access to _existing_ descriptors:
before 1: nothing notable (serialized before)
any time after 1: gets an error because it's invalidated

If we expected this to be used concurrently I'd write up a stress test for this but the expected use case is that we would externally synchronize this, so don't think it's worth the effort.


-- 
To view, visit http://gerrit.cloudera.org:8080/7113
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I14b2c64685e24d27591258911db4aeb9e8020a4d
Gerrit-PatchSet: 3
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Todd Lipcon <to...@apache.org>
Gerrit-Reviewer: Adar Dembo <ad...@cloudera.com>
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Tidy Bot
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>
Gerrit-HasComments: Yes

[kudu-CR] log block manager: fix corruption after re-opening compacted metadata

Posted by "Todd Lipcon (Code Review)" <ge...@cloudera.org>.

Hello Kudu Jenkins,

I'd like you to reexamine a change.  Please visit

    http://gerrit.cloudera.org:8080/7113

to look at the new patch set (#4).

Change subject: log_block_manager: fix corruption after re-opening compacted metadata
......................................................................

log_block_manager: fix corruption after re-opening compacted metadata

This fixes an issue discovered on a cluster due to the following
sequence of events:

- a block manager compacts a metadata file while starting up
- when it reopens the metadata file after replacing it with the
  compacted one, it gets a file_cache hit. Thus, the WritablePBContainer
  continues to write to the _deleted_ file instead of the compacted one.
  Metadata entries at this point are lost (which could cause block loss
  in the case of lost CREATE records, or dangling blocks in the case of
  lost DELETEs)
- if the server continues to run for a while, the FD will be evicted
  from the cache and eventually re-opened. At that point, a further
  DELETE record could end up writing to an offset past the end of the
  file, since the write offset was incremented by the "lost" records
  above.
- on the next restart, the metadata file would have a "gap" of zero
  bytes, which would surface as a checksum failure and failure to start
  up.

The fix is relatively simple: when we replace the metadata file we need
to invalidate and evict the cache entry so that when we "reopen", it
actually starts appending to the _new_ file and not the old deleted one.

The bulk of the changes here are to tests:
- the stress test now enforces a minimum number of live blocks before it
  starts deleting them. It also more aggressively compacts, and has a
  smaller cache. With these changes, I was sometimes able to reproduce
  the issue.
- A more targeted test issues a canned sequence of block creations and
  deletions that can reliably reproduce the above issue.

Change-Id: I14b2c64685e24d27591258911db4aeb9e8020a4d
---
M src/kudu/fs/block_manager-stress-test.cc
M src/kudu/fs/log_block_manager-test.cc
M src/kudu/fs/log_block_manager.cc
M src/kudu/util/file_cache-test.cc
M src/kudu/util/file_cache.cc
M src/kudu/util/file_cache.h
6 files changed, 215 insertions(+), 65 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/13/7113/4
-- 
To view, visit http://gerrit.cloudera.org:8080/7113
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I14b2c64685e24d27591258911db4aeb9e8020a4d
Gerrit-PatchSet: 4
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Todd Lipcon <to...@apache.org>
Gerrit-Reviewer: Adar Dembo <ad...@cloudera.com>
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Tidy Bot
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>

[kudu-CR] log block manager: fix corruption after re-opening compacted metadata

Posted by "Adar Dembo (Code Review)" <ge...@cloudera.org>.

Adar Dembo has posted comments on this change.

Change subject: log_block_manager: fix corruption after re-opening compacted metadata
......................................................................


Patch Set 6: Code-Review+2

-- 
To view, visit http://gerrit.cloudera.org:8080/7113
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I14b2c64685e24d27591258911db4aeb9e8020a4d
Gerrit-PatchSet: 6
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Todd Lipcon <to...@apache.org>
Gerrit-Reviewer: Adar Dembo <ad...@cloudera.com>
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Tidy Bot
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>
Gerrit-HasComments: No

[kudu-CR] log block manager: fix corruption after re-opening compacted metadata

Posted by "Adar Dembo (Code Review)" <ge...@cloudera.org>.

Adar Dembo has posted comments on this change.

Change subject: log_block_manager: fix corruption after re-opening compacted metadata
......................................................................


Patch Set 5:

(5 comments)

http://gerrit.cloudera.org:8080/#/c/7113/5/src/kudu/util/file_cache.cc
File src/kudu/util/file_cache.cc:

Line 141:     while (true){
Nit: while (true) {

In MarkInvalidated() too.


Line 173:   std::atomic<uint8_t> flags_ {0};
This change was motivated by the new calls to invalidated() in  ReopenFileIfNecessary(), right? Previously we could guarantee that the cache spinlock was always held during calls to deleted(), but that's not the case anymore.


Line 317:     CHECK(!base_.invalidated());
Should be before the lookup, no?


Line 420: 
Nit: make empty line use consistent with the other ReopenFileIfNecessary(), since the functions are almost the same otherwise.


Line 508: Status FileCache<FileType>::DeleteFile(const string& file_name) {
To ward against concurrent Invalidate() and DeleteFile(), should we CHECK(!invalidated()) here too?

It's not an issue in OpenExistingFile() because we'll notice the use of an invalidated descriptor during a subsequent file operation, but here there's no such thing.


-- 
To view, visit http://gerrit.cloudera.org:8080/7113
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I14b2c64685e24d27591258911db4aeb9e8020a4d
Gerrit-PatchSet: 5
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Todd Lipcon <to...@apache.org>
Gerrit-Reviewer: Adar Dembo <ad...@cloudera.com>
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Tidy Bot
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>
Gerrit-HasComments: Yes

[kudu-CR] log block manager: fix corruption after re-opening compacted metadata

Posted by "Todd Lipcon (Code Review)" <ge...@cloudera.org>.

Hello Kudu Jenkins,

I'd like you to reexamine a change.  Please visit

    http://gerrit.cloudera.org:8080/7113

to look at the new patch set (#6).

Change subject: log_block_manager: fix corruption after re-opening compacted metadata
......................................................................

log_block_manager: fix corruption after re-opening compacted metadata

This fixes an issue discovered on a cluster due to the following
sequence of events:

- a block manager compacts a metadata file while starting up
- when it reopens the metadata file after replacing it with the
  compacted one, it gets a file_cache hit. Thus, the WritablePBContainer
  continues to write to the _deleted_ file instead of the compacted one.
  Metadata entries at this point are lost (which could cause block loss
  in the case of lost CREATE records, or dangling blocks in the case of
  lost DELETEs)
- if the server continues to run for a while, the FD will be evicted
  from the cache and eventually re-opened. At that point, a further
  DELETE record could end up writing to an offset past the end of the
  file, since the write offset was incremented by the "lost" records
  above.
- on the next restart, the metadata file would have a "gap" of zero
  bytes, which would surface as a checksum failure and failure to start
  up.

The fix is relatively simple: when we replace the metadata file we need
to invalidate and evict the cache entry so that when we "reopen", it
actually starts appending to the _new_ file and not the old deleted one.

The bulk of the changes here are to tests:
- the stress test now enforces a minimum number of live blocks before it
  starts deleting them. It also more aggressively compacts, and has a
  smaller cache. With these changes, I was sometimes able to reproduce
  the issue.
- A more targeted test issues a canned sequence of block creations and
  deletions that can reliably reproduce the above issue.

Change-Id: I14b2c64685e24d27591258911db4aeb9e8020a4d
---
M src/kudu/fs/block_manager-stress-test.cc
M src/kudu/fs/log_block_manager-test.cc
M src/kudu/fs/log_block_manager.cc
M src/kudu/util/file_cache-test.cc
M src/kudu/util/file_cache.cc
M src/kudu/util/file_cache.h
6 files changed, 218 insertions(+), 66 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/13/7113/6
-- 
To view, visit http://gerrit.cloudera.org:8080/7113
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I14b2c64685e24d27591258911db4aeb9e8020a4d
Gerrit-PatchSet: 6
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Todd Lipcon <to...@apache.org>
Gerrit-Reviewer: Adar Dembo <ad...@cloudera.com>
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Tidy Bot
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>

[kudu-CR] log block manager: fix corruption after re-opening compacted metadata

Posted by "Todd Lipcon (Code Review)" <ge...@cloudera.org>.

Todd Lipcon has posted comments on this change.

Change subject: log_block_manager: fix corruption after re-opening compacted metadata
......................................................................

Patch Set 1:

(4 comments)

http://gerrit.cloudera.org:8080/#/c/7113/1/src/kudu/fs/log_block_manager-test.cc
File src/kudu/fs/log_block_manager-test.cc:

Line 1208: TEST_F(LogBlockManagerTest, TestDeleteFromContainerAfterMetadataCompaction) {
> Maybe add the word "Fuzz" in there, since it's sort of a fuzz test (i.e. th
hrm, I don't think of this really as a fuzz test because it's not generating some complicated sequence or bit pattern. You're right it's not 100% deterministic, but neither are most of our tests and we don't label them all fuzz.

http://gerrit.cloudera.org:8080/#/c/7113/1/src/kudu/fs/log_block_manager.cc
File src/kudu/fs/log_block_manager.cc:

Line 2330: file_cache_.Evict(metadata_file_name);
> Curious as to whether you thought about implementing the rename through the
I did think about it, but seemed like it would be a bit more complicated to support error handling cases, etc.

I changed it around a bit to be "safer", though. Now, the method is Invalidate(), and any existing open descriptors are marked such that if you try to continue using them post-invalidation you get an error rather than transparently re-opening the new file.

http://gerrit.cloudera.org:8080/#/c/7113/1/src/kudu/util/file_cache-test.cc
File src/kudu/util/file_cache-test.cc:

Line 227: // Create a test file, then delete it. It will be deleted immediately.
> Comment doesn't really make sense; we're renaming the file, not deleting it
Copy pasta. done.

http://gerrit.cloudera.org:8080/#/c/7113/1/src/kudu/util/file_cache.h
File src/kudu/util/file_cache.h:

PS1, Line 147: It's recommended that the above pattern
: // only be used in cases when it can be guaranteed that no other objects
: // retain an open descriptor to the same path.
> Can we enforce that there are no open readers of this file at the time of e
I changed it around to mark the existing descriptor as 'invalidated' and then ensure that any attempt to use an invalidated descriptor gets an illegal-state error.

--
To view, visit http://gerrit.cloudera.org:8080/7113
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings