You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@kudu.apache.org by "Adar Dembo (Code Review)" <ge...@cloudera.org> on 2017/05/09 06:55:29 UTC

[kudu-CR] KUDU-1549: compact LBM container metadata files at startup

Hello David Ribeiro Alves, Todd Lipcon,

I'd like you to do a code review.  Please visit

    http://gerrit.cloudera.org:8080/6826

to review the following change.

Change subject: KUDU-1549: compact LBM container metadata files at startup
......................................................................

KUDU-1549: compact LBM container metadata files at startup

Here's another patch to speed up LBM startup by reducing the amount of
metadata processed from disk. The idea is to find metadata files with lots
of "dead" blocks (i.e. CREATE + DELETE pairs) and to "compact" them by
rewriting the files without the dead blocks' records. The set of containers
to compact is determined by the ratio of live blocks to total blocks, and
there's a new gflag to configure that.

To make this work, I had to adjust the accounting in BlockCreated() yet
again. The new approach preserves next_block_offset_, but uses
fs_aligned_length() directly for byte-related stats. Without this change,
the aligned container stats (total_bytes and live_bytes_aligned) are
completely out of whack in a container whose metadata was compacted.

Like the dead container deletion patch, this one is quick and dirty in that
the new logic is unsuitable for real-time compaction. Also like that patch,
compaction in real-time would require non-trivial synchronization changes.

Testing is done in several places:
- BlockManagerStressTest: a "real" workload that'll compact some containers
  with additional inconsistencies injected by the LBMCorruptor.
- BlockManagerTest.TestMetadataOkayDespiteFailedWrites: filesystem errors
  are injected into metadata compaction.
- LogBlockManagerTest.TestCompactFullContainerMetadataAtStartup: explicit
  test for metadata compaction (and for the stat accounting fix).

Change-Id: I981f7d9e7eb96fb40cef30ad96c5960b72d07756
---
M src/kudu/fs/block_manager-stress-test.cc
M src/kudu/fs/block_manager-test.cc
M src/kudu/fs/log_block_manager-test.cc
M src/kudu/fs/log_block_manager.cc
M src/kudu/fs/log_block_manager.h
M src/kudu/util/env_posix.cc
M src/kudu/util/status.h
7 files changed, 282 insertions(+), 35 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/26/6826/1
-- 
To view, visit http://gerrit.cloudera.org:8080/6826
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: newchange
Gerrit-Change-Id: I981f7d9e7eb96fb40cef30ad96c5960b72d07756
Gerrit-PatchSet: 1
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Adar Dembo <ad...@cloudera.com>
Gerrit-Reviewer: David Ribeiro Alves <da...@gmail.com>
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>

[kudu-CR] KUDU-1549: compact LBM container metadata files at startup

Posted by "Adar Dembo (Code Review)" <ge...@cloudera.org>.
Adar Dembo has posted comments on this change.

Change subject: KUDU-1549: compact LBM container metadata files at startup
......................................................................


Patch Set 2:

(2 comments)

http://gerrit.cloudera.org:8080/#/c/6826/2/src/kudu/fs/log_block_manager.cc
File src/kudu/fs/log_block_manager.cc:

PS2, Line 87: DEFINE_double(log_container_live_metadata_before_compact_ratio, 0.10,
            :               "Desired ratio of live block metadata in log containers. If a "
            :               "container's live to total block ratio dips below this value, "
            :               "the container's metadata file will be compacted at startup.");
> hmm, I guess i was thinking experimental because who knows if a ratio is ev
I can't say I have a strong opinion on the tagging given how often I hem and haw about which tags to use, so I don't mind making it experimental if you think that's better.

I'll make the change to 0.5. Note that it's not a strict linear improvement due to the fixed cost associated with replacing a metadata file (e.g. an fsync() on the new file). That cost becomes higher and higher the fewer and fewer live blocks are in the file.

Offhand I don't know whether metadata processing is O(size) or O(record count). On spinning disks I would expect the cost of IO to outweigh the cost of deserialization and processing. That would mean O(size) is more significant than O(record count). Do you have any suggestions for how to figure this out?

If it helps, after reading through the protobuf encoding guide (https://developers.google.com/protocol-buffers/docs/encoding), I think CREATE records are 17-44 bytes and DELETE records are 13-22 bytes. The variety in size is a function of the timestamp (both), and the offset/length (CREATE).


http://gerrit.cloudera.org:8080/#/c/6826/2/src/kudu/fs/log_block_manager.h
File src/kudu/fs/log_block_manager.h:

PS2, Line 295:   // 3. Containers in 'low_live_block_containers' will have their metadata
             :   //    files compacted.
             :  
> hmm, trying to follow this suggestion. So, you're saying that since we alre
Correct: the LBM maintains a global map of LogBlocks, each of which is a live block. Apart from the timestamps, each LogBlock has enough information to be converted back into a CREATE BlockRecordPB. If I were to implement metadata file compaction in real time, I'd leverage this rather than rereading the records from disk.

Yes, the contents of vector<BlockRecordPB> is just live blocks, not all blocks.


-- 
To view, visit http://gerrit.cloudera.org:8080/6826
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I981f7d9e7eb96fb40cef30ad96c5960b72d07756
Gerrit-PatchSet: 2
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Adar Dembo <ad...@cloudera.com>
Gerrit-Reviewer: Adar Dembo <ad...@cloudera.com>
Gerrit-Reviewer: David Ribeiro Alves <da...@gmail.com>
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Tidy Bot
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>
Gerrit-HasComments: Yes

[kudu-CR] KUDU-1549: compact LBM container metadata files at startup

Posted by "Adar Dembo (Code Review)" <ge...@cloudera.org>.
Adar Dembo has submitted this change and it was merged.

Change subject: KUDU-1549: compact LBM container metadata files at startup
......................................................................


KUDU-1549: compact LBM container metadata files at startup

Here's another patch to speed up LBM startup by reducing the amount of
metadata processed from disk. The idea is to find metadata files with lots
of "dead" blocks (i.e. CREATE + DELETE pairs) and to "compact" them by
rewriting the files without the dead blocks' records. The set of containers
to compact is determined by the ratio of live blocks to total blocks, and
there's a new gflag to configure that.

To make this work, I had to adjust the accounting in BlockCreated() yet
again. The new approach preserves next_block_offset_, but uses
fs_aligned_length() directly for byte-related stats. Without this change,
the aligned container stats (total_bytes and live_bytes_aligned) are
completely out of whack in a container whose metadata was compacted.

Like the dead container deletion patch, this one is quick and dirty in that
the new logic is unsuitable for real-time compaction. Also like that patch,
compaction in real-time would require non-trivial synchronization changes.

Testing is done in several places:
- BlockManagerStressTest: a "real" workload that'll compact some containers
  with additional inconsistencies injected by the LBMCorruptor.
- BlockManagerTest.TestMetadataOkayDespiteFailedWrites: filesystem errors
  are injected into metadata compaction.
- LogBlockManagerTest.TestCompactFullContainerMetadataAtStartup: explicit
  test for metadata compaction (and for the stat accounting fix).

Change-Id: I981f7d9e7eb96fb40cef30ad96c5960b72d07756
Reviewed-on: http://gerrit.cloudera.org:8080/6826
Tested-by: Kudu Jenkins
Reviewed-by: Todd Lipcon <to...@apache.org>
---
M src/kudu/fs/block_manager-stress-test.cc
M src/kudu/fs/block_manager-test.cc
M src/kudu/fs/fs_report.cc
M src/kudu/fs/fs_report.h
M src/kudu/fs/log_block_manager-test.cc
M src/kudu/fs/log_block_manager.cc
M src/kudu/fs/log_block_manager.h
M src/kudu/util/env_posix.cc
8 files changed, 331 insertions(+), 44 deletions(-)

Approvals:
  Todd Lipcon: Looks good to me, approved
  Kudu Jenkins: Verified



-- 
To view, visit http://gerrit.cloudera.org:8080/6826
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: merged
Gerrit-Change-Id: I981f7d9e7eb96fb40cef30ad96c5960b72d07756
Gerrit-PatchSet: 6
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Adar Dembo <ad...@cloudera.com>
Gerrit-Reviewer: Adar Dembo <ad...@cloudera.com>
Gerrit-Reviewer: David Ribeiro Alves <da...@gmail.com>
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Tidy Bot
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>

[kudu-CR] KUDU-1549: compact LBM container metadata files at startup

Posted by "Todd Lipcon (Code Review)" <ge...@cloudera.org>.
Todd Lipcon has posted comments on this change.

Change subject: KUDU-1549: compact LBM container metadata files at startup
......................................................................


Patch Set 5: Code-Review+2

(2 comments)

http://gerrit.cloudera.org:8080/#/c/6826/2/src/kudu/fs/fs_report.h
File src/kudu/fs/fs_report.h:

Line 127:     // Note: the BlockRecordPB is passed by pointer so that it can be swapped
> I'm pining away for protobuf move constructors.
yea, we should add -DPROTO_EXPERIMENTAL_ENABLE_MOVE to our protobuf build now that we are on protobuf 3.3. I'll do that in a separate patch while I'm thinking of it.


http://gerrit.cloudera.org:8080/#/c/6826/2/src/kudu/fs/log_block_manager.h
File src/kudu/fs/log_block_manager.h:

PS2, Line 295:   //    files compacted.
             :   //
             :  
> Correct: the LBM maintains a global map of LogBlocks, each of which is a li
ok, not too worried then, will cross fingers and we can address if it becomes problematic for some reason


-- 
To view, visit http://gerrit.cloudera.org:8080/6826
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I981f7d9e7eb96fb40cef30ad96c5960b72d07756
Gerrit-PatchSet: 5
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Adar Dembo <ad...@cloudera.com>
Gerrit-Reviewer: Adar Dembo <ad...@cloudera.com>
Gerrit-Reviewer: David Ribeiro Alves <da...@gmail.com>
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Tidy Bot
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>
Gerrit-HasComments: Yes

[kudu-CR] KUDU-1549: compact LBM container metadata files at startup

Posted by "Adar Dembo (Code Review)" <ge...@cloudera.org>.
Hello Kudu Jenkins,

I'd like you to reexamine a change.  Please visit

    http://gerrit.cloudera.org:8080/6826

to look at the new patch set (#2).

Change subject: KUDU-1549: compact LBM container metadata files at startup
......................................................................

KUDU-1549: compact LBM container metadata files at startup

Here's another patch to speed up LBM startup by reducing the amount of
metadata processed from disk. The idea is to find metadata files with lots
of "dead" blocks (i.e. CREATE + DELETE pairs) and to "compact" them by
rewriting the files without the dead blocks' records. The set of containers
to compact is determined by the ratio of live blocks to total blocks, and
there's a new gflag to configure that.

To make this work, I had to adjust the accounting in BlockCreated() yet
again. The new approach preserves next_block_offset_, but uses
fs_aligned_length() directly for byte-related stats. Without this change,
the aligned container stats (total_bytes and live_bytes_aligned) are
completely out of whack in a container whose metadata was compacted.

Like the dead container deletion patch, this one is quick and dirty in that
the new logic is unsuitable for real-time compaction. Also like that patch,
compaction in real-time would require non-trivial synchronization changes.

Testing is done in several places:
- BlockManagerStressTest: a "real" workload that'll compact some containers
  with additional inconsistencies injected by the LBMCorruptor.
- BlockManagerTest.TestMetadataOkayDespiteFailedWrites: filesystem errors
  are injected into metadata compaction.
- LogBlockManagerTest.TestCompactFullContainerMetadataAtStartup: explicit
  test for metadata compaction (and for the stat accounting fix).

Change-Id: I981f7d9e7eb96fb40cef30ad96c5960b72d07756
---
M src/kudu/fs/block_manager-stress-test.cc
M src/kudu/fs/block_manager-test.cc
M src/kudu/fs/fs_report.cc
M src/kudu/fs/fs_report.h
M src/kudu/fs/log_block_manager-test.cc
M src/kudu/fs/log_block_manager.cc
M src/kudu/fs/log_block_manager.h
M src/kudu/util/env_posix.cc
M src/kudu/util/status.h
9 files changed, 302 insertions(+), 54 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/26/6826/2
-- 
To view, visit http://gerrit.cloudera.org:8080/6826
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I981f7d9e7eb96fb40cef30ad96c5960b72d07756
Gerrit-PatchSet: 2
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Adar Dembo <ad...@cloudera.com>
Gerrit-Reviewer: David Ribeiro Alves <da...@gmail.com>
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Tidy Bot
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>

[kudu-CR] KUDU-1549: compact LBM container metadata files at startup

Posted by "Adar Dembo (Code Review)" <ge...@cloudera.org>.
Adar Dembo has posted comments on this change.

Change subject: KUDU-1549: compact LBM container metadata files at startup
......................................................................


Patch Set 2:

(1 comment)

http://gerrit.cloudera.org:8080/#/c/6826/2/src/kudu/fs/log_block_manager.cc
File src/kudu/fs/log_block_manager.cc:

PS2, Line 87: DEFINE_double(log_container_live_metadata_before_compact_ratio, 0.10,
            :               "Desired ratio of live block metadata in log containers. If a "
            :               "container's live to total block ratio dips below this value, "
            :               "the container's metadata file will be compacted at startup.");
> looking at a data/ dir on a cluster that has been around for quite some tim
Thanks for the research.

If we're indeed as seek-dominated as you observed, then the only major improvements would come by reducing the number of files we read at startup. Here are some more thoughts, incorporating some of the ideas you suggested in our discussion earlier today:
1. Raise the container max data file size. Won't help on older versions of el6 with ext4, but will help everywhere else. It makes sense for the max data file size to be a function of the disk size anyway. And it's a pretty cheap way to extract more scalability.
2. Reuse container data file holes, explicitly to avoid creating so many containers. As a side effect, metadata file compaction now becomes more important (and costly).
3. Eschew one metadata file per data file altogether and maintain just one metadata file. Deleting "dead" containers would no longer be an improvement for metadata startup cost. Metadata compaction would be a lot more expensive. Block records themselves would be larger, because each record now needs to point to a particular data file, though this can be mitigated in various ways. A variant of this would be to do away with the 1-1 relationship between metadata and data files and make it more like m-n.

Interesting observation re: defragmentation. I think that makes a compelling case for preallocating the metadata files. That'd reduce seeks at startup time, though it'd also reduce the efficacy of metadata compaction. No need to be as aggressive as for the data files (where we default to preallocating 32 MB); maybe 1 MB would be enough. With a 4096 block size on ext4 and el6, we use a 1353 block limit, which works out to a little over ~100K as an upper bound for a fully dead container. So maybe preallocating 128K would be sufficient. Though it's unclear exactly when we'd truncate away the preallocated excess; the only time that makes perfect sense is when the container is dead, at which point we'll remove it anyway. Maybe something heuristic-based using  the timestamp of the last record in the container?

I think there's consensus that this patch is useful in its current form, so let's continue this discussion in a different venue so that I might merge this. I'm also happy to file JIRAs (or add to KUDU-1549) for whichever of the above you think sound most promising.


-- 
To view, visit http://gerrit.cloudera.org:8080/6826
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I981f7d9e7eb96fb40cef30ad96c5960b72d07756
Gerrit-PatchSet: 2
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Adar Dembo <ad...@cloudera.com>
Gerrit-Reviewer: Adar Dembo <ad...@cloudera.com>
Gerrit-Reviewer: David Ribeiro Alves <da...@gmail.com>
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Tidy Bot
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>
Gerrit-HasComments: Yes

[kudu-CR] KUDU-1549: compact LBM container metadata files at startup

Posted by "Adar Dembo (Code Review)" <ge...@cloudera.org>.
Hello Kudu Jenkins,

I'd like you to reexamine a change.  Please visit

    http://gerrit.cloudera.org:8080/6826

to look at the new patch set (#4).

Change subject: KUDU-1549: compact LBM container metadata files at startup
......................................................................

KUDU-1549: compact LBM container metadata files at startup

Here's another patch to speed up LBM startup by reducing the amount of
metadata processed from disk. The idea is to find metadata files with lots
of "dead" blocks (i.e. CREATE + DELETE pairs) and to "compact" them by
rewriting the files without the dead blocks' records. The set of containers
to compact is determined by the ratio of live blocks to total blocks, and
there's a new gflag to configure that.

To make this work, I had to adjust the accounting in BlockCreated() yet
again. The new approach preserves next_block_offset_, but uses
fs_aligned_length() directly for byte-related stats. Without this change,
the aligned container stats (total_bytes and live_bytes_aligned) are
completely out of whack in a container whose metadata was compacted.

Like the dead container deletion patch, this one is quick and dirty in that
the new logic is unsuitable for real-time compaction. Also like that patch,
compaction in real-time would require non-trivial synchronization changes.

Testing is done in several places:
- BlockManagerStressTest: a "real" workload that'll compact some containers
  with additional inconsistencies injected by the LBMCorruptor.
- BlockManagerTest.TestMetadataOkayDespiteFailedWrites: filesystem errors
  are injected into metadata compaction.
- LogBlockManagerTest.TestCompactFullContainerMetadataAtStartup: explicit
  test for metadata compaction (and for the stat accounting fix).

Change-Id: I981f7d9e7eb96fb40cef30ad96c5960b72d07756
---
M src/kudu/fs/block_manager-stress-test.cc
M src/kudu/fs/block_manager-test.cc
M src/kudu/fs/fs_report.cc
M src/kudu/fs/fs_report.h
M src/kudu/fs/log_block_manager-test.cc
M src/kudu/fs/log_block_manager.cc
M src/kudu/fs/log_block_manager.h
M src/kudu/util/env_posix.cc
8 files changed, 331 insertions(+), 44 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/26/6826/4
-- 
To view, visit http://gerrit.cloudera.org:8080/6826
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I981f7d9e7eb96fb40cef30ad96c5960b72d07756
Gerrit-PatchSet: 4
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Adar Dembo <ad...@cloudera.com>
Gerrit-Reviewer: Adar Dembo <ad...@cloudera.com>
Gerrit-Reviewer: David Ribeiro Alves <da...@gmail.com>
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Tidy Bot
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>

[kudu-CR] KUDU-1549: compact LBM container metadata files at startup

Posted by "Todd Lipcon (Code Review)" <ge...@cloudera.org>.
Todd Lipcon has posted comments on this change.

Change subject: KUDU-1549: compact LBM container metadata files at startup
......................................................................


Patch Set 2:

(1 comment)

http://gerrit.cloudera.org:8080/#/c/6826/2/src/kudu/fs/log_block_manager.cc
File src/kudu/fs/log_block_manager.cc:

PS2, Line 87: DEFINE_double(log_container_live_metadata_before_compact_ratio, 0.10,
            :               "Desired ratio of live block metadata in log containers. If a "
            :               "container's live to total block ratio dips below this value, "
            :               "the container's metadata file will be compacted at startup.");
> Thanks for the research.
Yea, I think it would be good to capture the ideas above in a new JIRA. Definitely don't want to derail this work.


-- 
To view, visit http://gerrit.cloudera.org:8080/6826
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I981f7d9e7eb96fb40cef30ad96c5960b72d07756
Gerrit-PatchSet: 2
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Adar Dembo <ad...@cloudera.com>
Gerrit-Reviewer: Adar Dembo <ad...@cloudera.com>
Gerrit-Reviewer: David Ribeiro Alves <da...@gmail.com>
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Tidy Bot
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>
Gerrit-HasComments: Yes

[kudu-CR] KUDU-1549: compact LBM container metadata files at startup

Posted by "Adar Dembo (Code Review)" <ge...@cloudera.org>.
Adar Dembo has posted comments on this change.

Change subject: KUDU-1549: compact LBM container metadata files at startup
......................................................................


Patch Set 2:

(2 comments)

http://gerrit.cloudera.org:8080/#/c/6826/2//COMMIT_MSG
Commit Message:

Line 11: of "dead" blocks (i.e. CREATE + DELETE pairs) and to "compact" them by
> "garbage collect" here would be more appropriate relative to our other uses
We discussed this on Slack and agreed that since neither "garbage collection" nor "compaction" were 100% precise, we should just keep it as-is.


PS2, Line 22: Like the dead container deletion patch, this one is quick and dirty in that
            : the new logic is unsuitable for real-time compaction. Also like that patch,
            : compaction in real-time would require non-trivial synchronization changes.
> how much space/optimization are we leaving on the table if we don't restart
The only impact is to startup times: the more dead LBM container metadata we can remove, the faster the LBM can start. If the server is never restarted, we'll definitely have more LBM container metadata on disk, but it isn't used until startup, at which point the compaction will kick in and, for qualifying containers, remove 90% of the records, which will speed up subsequent startup.


-- 
To view, visit http://gerrit.cloudera.org:8080/6826
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I981f7d9e7eb96fb40cef30ad96c5960b72d07756
Gerrit-PatchSet: 2
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Adar Dembo <ad...@cloudera.com>
Gerrit-Reviewer: Adar Dembo <ad...@cloudera.com>
Gerrit-Reviewer: David Ribeiro Alves <da...@gmail.com>
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Tidy Bot
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>
Gerrit-HasComments: Yes

[kudu-CR] KUDU-1549: compact LBM container metadata files at startup

Posted by "Todd Lipcon (Code Review)" <ge...@cloudera.org>.
Todd Lipcon has posted comments on this change.

Change subject: KUDU-1549: compact LBM container metadata files at startup
......................................................................


Patch Set 2:

(2 comments)

http://gerrit.cloudera.org:8080/#/c/6826/2/src/kudu/fs/log_block_manager.cc
File src/kudu/fs/log_block_manager.cc:

PS2, Line 87: DEFINE_double(log_container_live_metadata_before_compact_ratio, 0.10,
            :               "Desired ratio of live block metadata in log containers. If a "
            :               "container's live to total block ratio dips below this value, "
            :               "the container's metadata file will be compacted at startup.");
> I'll tag it advanced. I don't think experimental is appropriate given that 
hmm, I guess i was thinking experimental because who knows if a ratio is even the right metric? It's pretty internal-facing, so seems like the kind of thing we may never want to allow a user to tweak? (is it reasonable to have something be experimental forever?)

As for a better value, I was thinking something like 0.5 -- it's worth compacting if we think we can reduce startup time by at least 50%, perhaps?

Do we know how to map a metadata file size/block-count to a time, approximately? ie is the timing O(size) or O(record count)?


http://gerrit.cloudera.org:8080/#/c/6826/2/src/kudu/fs/log_block_manager.h
File src/kudu/fs/log_block_manager.h:

PS2, Line 295:   // 3. Containers in 'low_live_block_containers' will have their metadata
             :   //    files compacted.
             :  
> I went back and forth on this. I do like the idea of not making any on-disk
hmm, trying to follow this suggestion. So, you're saying that since we already have the list of live blocks for each container, we could just dump that out rather than rereading the original block *records*?

Also maybe I misread the original code. Is the vector<BlockRecordPB> here just the remaining _live_ blocks? If so, maybe it's not so bad, since it's bounded at 2x the normal "steady state" usage of the LBM, and we're always assuming that is a small fraction of overall TS heap. In the first read through I was thinking this woudl be _all_ records, not just te live ones.


-- 
To view, visit http://gerrit.cloudera.org:8080/6826
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I981f7d9e7eb96fb40cef30ad96c5960b72d07756
Gerrit-PatchSet: 2
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Adar Dembo <ad...@cloudera.com>
Gerrit-Reviewer: Adar Dembo <ad...@cloudera.com>
Gerrit-Reviewer: David Ribeiro Alves <da...@gmail.com>
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Tidy Bot
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>
Gerrit-HasComments: Yes

[kudu-CR] KUDU-1549: compact LBM container metadata files at startup

Posted by "Todd Lipcon (Code Review)" <ge...@cloudera.org>.
Todd Lipcon has posted comments on this change.

Change subject: KUDU-1549: compact LBM container metadata files at startup
......................................................................


Patch Set 2:

(9 comments)

http://gerrit.cloudera.org:8080/#/c/6826/2/src/kudu/fs/fs_report.h
File src/kudu/fs/fs_report.h:

Line 127:     Entry(std::string c, BlockRecordPB* r);
this is a little odd -- if it takes a pointer I expect it to store the pointer, but the member is still a value. Add a comment? This is just a hack because protobuf doesn't support move constructor?


http://gerrit.cloudera.org:8080/#/c/6826/2/src/kudu/fs/log_block_manager.cc
File src/kudu/fs/log_block_manager.cc:

PS2, Line 87: DEFINE_double(log_container_live_metadata_before_compact_ratio, 0.10,
            :               "Desired ratio of live block metadata in log containers. If a "
            :               "container's live to total block ratio dips below this value, "
            :               "the container's metadata file will be compacted at startup.");
should probably be tagged advanced or even experimental?

0.1 seems like a pretty low value. Isn't the cost relatively low of writing out a new file, considering they are small?


Line 417:       BlockRecordPB* record,
comment on the pointer semantics. it's not clear what's being "saved" here


PS2, Line 1964: offsets also match
how could the offsets of two live blocks match?


PS2, Line 1969: .swap(records);
= std::move(records)? or use .emplace?


Line 2258:   for (const auto& e : low_live_block_containers) {
can you extract the body of this if to a new function, perhaps? or too many local variable references?


Line 2293:       continue;
if this were extracted to a new function, this logic would be a bit easier to follow (RETURN_NOT_OK is usable etc)


http://gerrit.cloudera.org:8080/#/c/6826/2/src/kudu/fs/log_block_manager.h
File src/kudu/fs/log_block_manager.h:

PS2, Line 295:   // 3. Containers in 'low_live_block_containers' will have their metadata
             :   //    files compacted.
             :  
In a typical startup doesn't this mean a fairly large amount of memory usage, vs doing it incrementally after loading each container? I see the appeal of doing all mutative work in one pass at the end, but maybe it's not worth it here, considering these aren't optional "repairs" of unexpected state, but rather just a normal thing we expect would happen on basically every startup?


http://gerrit.cloudera.org:8080/#/c/6826/2/src/kudu/util/status.h
File src/kudu/util/status.h:

Line 60: #define KUDU_WARN_NOT_OK_AND_CONTINUE(to_call, warning_prefix) {       \
The naming here is a little confusing, considering 'continue' might be interpreted as 'warn and keep going' rather than 'warn and use the continue keyword to go to the top of the loop'. maybe CONTINUE_LOOP to clarify? or perhaps extracting a functio where this is used could make it so we can use the _AND_RETURN variant instead


-- 
To view, visit http://gerrit.cloudera.org:8080/6826
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I981f7d9e7eb96fb40cef30ad96c5960b72d07756
Gerrit-PatchSet: 2
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Adar Dembo <ad...@cloudera.com>
Gerrit-Reviewer: Adar Dembo <ad...@cloudera.com>
Gerrit-Reviewer: David Ribeiro Alves <da...@gmail.com>
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Tidy Bot
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>
Gerrit-HasComments: Yes

[kudu-CR] KUDU-1549: compact LBM container metadata files at startup

Posted by "David Ribeiro Alves (Code Review)" <ge...@cloudera.org>.
David Ribeiro Alves has posted comments on this change.

Change subject: KUDU-1549: compact LBM container metadata files at startup
......................................................................


Patch Set 2:

(2 comments)

just took a cursory first look, trying to understand the intent. will come back for a deeper look

http://gerrit.cloudera.org:8080/#/c/6826/2//COMMIT_MSG
Commit Message:

Line 11: of "dead" blocks (i.e. CREATE + DELETE pairs) and to "compact" them by
"garbage collect" here would be more appropriate relative to our other uses of both concepts elsewhere


PS2, Line 22: Like the dead container deletion patch, this one is quick and dirty in that
            : the new logic is unsuitable for real-time compaction. Also like that patch,
            : compaction in real-time would require non-trivial synchronization changes.
how much space/optimization are we leaving on the table if we don't restart? i.e. what's the impact of doing not doing this?


-- 
To view, visit http://gerrit.cloudera.org:8080/6826
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I981f7d9e7eb96fb40cef30ad96c5960b72d07756
Gerrit-PatchSet: 2
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Adar Dembo <ad...@cloudera.com>
Gerrit-Reviewer: David Ribeiro Alves <da...@gmail.com>
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Tidy Bot
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>
Gerrit-HasComments: Yes

[kudu-CR] KUDU-1549: compact LBM container metadata files at startup

Posted by "Adar Dembo (Code Review)" <ge...@cloudera.org>.
Adar Dembo has posted comments on this change.

Change subject: KUDU-1549: compact LBM container metadata files at startup
......................................................................


Patch Set 2:

(9 comments)

http://gerrit.cloudera.org:8080/#/c/6826/2/src/kudu/fs/fs_report.h
File src/kudu/fs/fs_report.h:

Line 127:     Entry(std::string c, BlockRecordPB* r);
> this is a little odd -- if it takes a pointer I expect it to store the poin
I'm pining away for protobuf move constructors.

(will add comment).


http://gerrit.cloudera.org:8080/#/c/6826/2/src/kudu/fs/log_block_manager.cc
File src/kudu/fs/log_block_manager.cc:

PS2, Line 87: DEFINE_double(log_container_live_metadata_before_compact_ratio, 0.10,
            :               "Desired ratio of live block metadata in log containers. If a "
            :               "container's live to total block ratio dips below this value, "
            :               "the container's metadata file will be compacted at startup.");
> should probably be tagged advanced or even experimental?
I'll tag it advanced. I don't think experimental is appropriate given that modifying it doesn't cause new code paths to execute; it just adjusts how often they run.

As for the default value...I don't have a strong opinion, but I'll note that the higher it is, the more metadata write amplification we suffer. That's why I was conservative to start. Did you have a particular value in mind?


Line 417:       BlockRecordPB* record,
> comment on the pointer semantics. it's not clear what's being "saved" here
Done


PS2, Line 1964: offsets also match
> how could the offsets of two live blocks match?
It's contrived, but the two blocks could be of zero length. I'll add a comment.


PS2, Line 1969: .swap(records);
> = std::move(records)? or use .emplace?
Done


Line 2258:   for (const auto& e : low_live_block_containers) {
> can you extract the body of this if to a new function, perhaps? or too many
Done


Line 2293:       continue;
> if this were extracted to a new function, this logic would be a bit easier 
Agreed, done.


http://gerrit.cloudera.org:8080/#/c/6826/2/src/kudu/fs/log_block_manager.h
File src/kudu/fs/log_block_manager.h:

PS2, Line 295:   // 3. Containers in 'low_live_block_containers' will have their metadata
             :   //    files compacted.
             :  
> In a typical startup doesn't this mean a fairly large amount of memory usag
I went back and forth on this. I do like the idea of not making any on-disk changes until we've established that there are no fatal errors, and that's why I ended up on this side of the fence.

That said, here's an alternative: if you'll allow me to grow LogBlock by 8 bytes, I can add the block creation timestamp to it, and then I can use the LogBlocks that we've already allocated in the compaction. There's another advantage to this approach: it opens the door to doing compaction at real time without the onerous burden of rereading records from disk.

Why didn't I do this already? I couldn't bring myself to grow our memory footprint by potentially dozens of MB without also implementing real time metadata compaction.


http://gerrit.cloudera.org:8080/#/c/6826/2/src/kudu/util/status.h
File src/kudu/util/status.h:

Line 60: #define KUDU_WARN_NOT_OK_AND_CONTINUE(to_call, warning_prefix) {       \
> The naming here is a little confusing, considering 'continue' might be inte
I extracted a function like you suggested, so this has been removed.


-- 
To view, visit http://gerrit.cloudera.org:8080/6826
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I981f7d9e7eb96fb40cef30ad96c5960b72d07756
Gerrit-PatchSet: 2
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Adar Dembo <ad...@cloudera.com>
Gerrit-Reviewer: Adar Dembo <ad...@cloudera.com>
Gerrit-Reviewer: David Ribeiro Alves <da...@gmail.com>
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Tidy Bot
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>
Gerrit-HasComments: Yes

[kudu-CR] KUDU-1549: compact LBM container metadata files at startup

Posted by "Todd Lipcon (Code Review)" <ge...@cloudera.org>.
Todd Lipcon has posted comments on this change.

Change subject: KUDU-1549: compact LBM container metadata files at startup
......................................................................


Patch Set 2:

(1 comment)

http://gerrit.cloudera.org:8080/#/c/6826/2/src/kudu/fs/log_block_manager.cc
File src/kudu/fs/log_block_manager.cc:

PS2, Line 87: DEFINE_double(log_container_live_metadata_before_compact_ratio, 0.10,
            :               "Desired ratio of live block metadata in log containers. If a "
            :               "container's live to total block ratio dips below this value, "
            :               "the container's metadata file will be compacted at startup.");
> I can't say I have a strong opinion on the tagging given how often I hem an
looking at a data/ dir on a cluster that has been around for quite some time, most of the metadata files seem to be around 400KB. Assuming 100MB/sec sequential throughput and 10ms seek, it definitely seems like the startup time would be seek-dominated (10 or 20ms seek depending whether various internal metadata pages are hot in cache, plus only 4ms of sequential read time). 

Unfortunatley, if that's true, this might mean that this optimization won't yield much of an improvement in the case that the metadata files are cold after a full machine reboot. However, in a "warm reboot" where the metadata files are in cache, or in the case when the metadata is on SSD, it should be a good CPU savings.

I did a quick test on the above node (2000 metadata files) by calling drop_caches=3 and then timing 'cat *.metadata > /dev/null'. It took 2m14s. Interestingly, though, 'head --bytes=4 *.metadata > /dev/null' took only 54s. Looking at 'filefrag *.metadata' it seems most of the metadata files have ~15 extents, so each file is probably doing way more than the expected number of seeks.

If the above fragmentation is normal (seems quite plausible given the files are written slowly over time, old ones are appended to, etc) then it seems like rewriting them should have a nice perf boost for future startups just by virtue of defragmenting the underlying FS (regardless of any actual compaction). Going as far as to use FIEMAP is probably overkill, but being aggressive to compact probably isn't crazy.

(Sorry for the long comment... curious your thoughts)


-- 
To view, visit http://gerrit.cloudera.org:8080/6826
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I981f7d9e7eb96fb40cef30ad96c5960b72d07756
Gerrit-PatchSet: 2
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Adar Dembo <ad...@cloudera.com>
Gerrit-Reviewer: Adar Dembo <ad...@cloudera.com>
Gerrit-Reviewer: David Ribeiro Alves <da...@gmail.com>
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Tidy Bot
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>
Gerrit-HasComments: Yes