You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@kudu.apache.org by to...@apache.org on 2017/05/19 00:32:48 UTC
[3/3] kudu git commit: log: reduce segment size from 64MB to 8MB

log: reduce segment size from 64MB to 8MB

Currently, we always retain a minimum of one segment of the WAL, even if
a tablet is cold and has not performed any writes in a long time. If a
server has hundreds or thousands of tablets, keeping a 64MB segment for
each tablet adds up to a lot of "wasted" disk space, especially if the
WAL is configured to an expensive disk such as an SSD.

In addition to wasting space, the current code always re-reads all live
WAL segments at startup. Solving this has been an open problem for quite
some time, but there are various subtleties (described in KUDU-38). So,
as a band-aid, reducing the size of the segment will linearly reduce the
amount of work during bootstrap of "cold" tablets.

In summary, the expectation is that, in a "cold" server, this will
reduce WAL disk space usage and startup time by approximately a factor
of 8.

I verified this by making these same changes in the configuration of a server
with ~6000 cold tablets. The disk usage of the WAL disk went from 381GB to
48GB, with a similar factor reduction in startup time.

Because the segment size is reduced by a factor of 8, the patch also
increases the max retention segment count by the same factor (from 10 to
80). This has one risk, in that we currently keep an open file
descriptor for every retained segment. However, in typical workloads,
the number of tablets with a high rate of active writes is not in the
hundreds or thousands, and thus the total number of file descriptors is
still likely to be manageable. Nevertheless, this patch adds a TODO that
we should consider a FileCache for these descriptors if we start to see
the usage be problematic.

So, if reducing to 8MB is good, why not go further, like 4MB or 1MB? The
assumption is that 8MB is still large enough to get good sequential IO
throughput on both read and write, and large enough to limit the FD
usage as described above. If we add an FD cache at some point, we could
consider reducing it further, particularly if running on SSDs where the
sequential throughput is less affected by size.

Although it's possible that a user had configured max_segments_to_retain
based on their own disk space requirements, the flag is marked
experimental, so I don't think we have to worry about compatibility in
that respect. We should consider changing the units here to be MB rather
than segment-count, so that the value is more robust.

Change-Id: Iadcda6b085e69cae5a15d54bb4c945d7605d5f98
Reviewed-on: http://gerrit.cloudera.org:8080/6911
Tested-by: Kudu Jenkins
Reviewed-by: Mike Percy <mp...@apache.org>


Project: http://git-wip-us.apache.org/repos/asf/kudu/repo
Commit: http://git-wip-us.apache.org/repos/asf/kudu/commit/65c3aec9
Tree: http://git-wip-us.apache.org/repos/asf/kudu/tree/65c3aec9
Diff: http://git-wip-us.apache.org/repos/asf/kudu/diff/65c3aec9

Branch: refs/heads/master
Commit: 65c3aec9aeb7cb684cac98add6f48b737ae0de85
Parents: 1aa0ebb
Author: Todd Lipcon <to...@cloudera.com>
Authored: Wed May 17 12:23:01 2017 -0700
Committer: Todd Lipcon <to...@apache.org>
Committed: Fri May 19 00:32:28 2017 +0000

----------------------------------------------------------------------
 src/kudu/consensus/consensus_queue.cc           |  1 -
 src/kudu/consensus/log.cc                       |  4 +-
 src/kudu/consensus/log_util.cc                  |  4 +-
 .../integration-tests/raft_consensus-itest.cc   | 43 ++++++++++++++++++++
 4 files changed, 48 insertions(+), 4 deletions(-)
----------------------------------------------------------------------


http://git-wip-us.apache.org/repos/asf/kudu/blob/65c3aec9/src/kudu/consensus/consensus_queue.cc
----------------------------------------------------------------------
diff --git a/src/kudu/consensus/consensus_queue.cc b/src/kudu/consensus/consensus_queue.cc
index e752eb2..8fa6bef 100644
--- a/src/kudu/consensus/consensus_queue.cc
+++ b/src/kudu/consensus/consensus_queue.cc
@@ -445,7 +445,6 @@ Status PeerMessageQueue::RequestForPeer(const string& uuid,
       request->mutable_ops()->AddAllocated(msg->get());
     }
     msg_refs->swap(messages);
-    DCHECK_LE(request->ByteSize(), FLAGS_consensus_max_batch_size_bytes);
   }
 
   DCHECK(preceding_id.IsInitialized());

http://git-wip-us.apache.org/repos/asf/kudu/blob/65c3aec9/src/kudu/consensus/log.cc
----------------------------------------------------------------------
diff --git a/src/kudu/consensus/log.cc b/src/kudu/consensus/log.cc
index 23f80e2..f6dd641 100644
--- a/src/kudu/consensus/log.cc
+++ b/src/kudu/consensus/log.cc
@@ -63,7 +63,7 @@ DEFINE_int32(log_min_segments_to_retain, 1,
 TAG_FLAG(log_min_segments_to_retain, runtime);
 TAG_FLAG(log_min_segments_to_retain, advanced);
 
-DEFINE_int32(log_max_segments_to_retain, 10,
+DEFINE_int32(log_max_segments_to_retain, 80,
              "The maximum number of past log segments to keep at all times for "
              "the purposes of catching up other peers.");
 TAG_FLAG(log_max_segments_to_retain, runtime);
@@ -1051,6 +1051,8 @@ Status Log::SwitchToAllocatedSegment() {
   }
 
   // Open the segment we just created in readable form and add it to the reader.
+  // TODO(todd): consider using a global FileCache here? With short log segments and
+  // lots of tablets, this file descriptor usage may add up.
   unique_ptr<RandomAccessFile> readable_file;
 
   RandomAccessFileOptions opts;

http://git-wip-us.apache.org/repos/asf/kudu/blob/65c3aec9/src/kudu/consensus/log_util.cc
----------------------------------------------------------------------
diff --git a/src/kudu/consensus/log_util.cc b/src/kudu/consensus/log_util.cc
index 8ef177c..bf20545 100644
--- a/src/kudu/consensus/log_util.cc
+++ b/src/kudu/consensus/log_util.cc
@@ -45,8 +45,8 @@
 #include "kudu/util/pb_util.h"
 #include "kudu/util/scoped_cleanup.h"
 
-DEFINE_int32(log_segment_size_mb, 64,
-             "The default segment size for log roll-overs, in MB");
+DEFINE_int32(log_segment_size_mb, 8,
+             "The default size for log segments, in MB");
 TAG_FLAG(log_segment_size_mb, advanced);
 
 DEFINE_bool(log_force_fsync_all, false,

http://git-wip-us.apache.org/repos/asf/kudu/blob/65c3aec9/src/kudu/integration-tests/raft_consensus-itest.cc
----------------------------------------------------------------------
diff --git a/src/kudu/integration-tests/raft_consensus-itest.cc b/src/kudu/integration-tests/raft_consensus-itest.cc
index d461451..ae3ca79 100644
--- a/src/kudu/integration-tests/raft_consensus-itest.cc
+++ b/src/kudu/integration-tests/raft_consensus-itest.cc
@@ -2508,6 +2508,49 @@ TEST_F(RaftConsensusITest, TestSlowLeader) {
   SleepFor(MonoDelta::FromSeconds(60));
 }
 
+// Test write batches just below the maximum limit.
+TEST_F(RaftConsensusITest, TestLargeBatches) {
+  vector<string> ts_flags = {
+    // We write 128KB cells in this test, so bump the limit, and disable compression.
+    "--max_cell_size_bytes=1000000",
+    "--log_segment_size_mb=1",
+    "--log_compression_codec=none",
+    "--log_min_segments_to_retain=100", // disable GC of logs.
+  };
+  BuildAndStart(ts_flags);
+
+  const int64_t kBatchSize = 40; // Write 40 * 128kb = 5MB per batch.
+  const int64_t kNumBatchesToWrite = 100;
+  TestWorkload workload(cluster_.get());
+  workload.set_table_name(kTableId);
+  workload.set_payload_bytes(128 * 1024); // Write ops of size 128KB.
+  workload.set_write_batch_size(kBatchSize);
+  workload.set_num_write_threads(1);
+  workload.Setup();
+  workload.Start();
+  LOG(INFO) << "Waiting until we've written enough data...";
+  while (workload.rows_inserted() < kBatchSize * kNumBatchesToWrite) {
+    SleepFor(MonoDelta::FromMilliseconds(100));
+  }
+  workload.StopAndJoin();
+
+  // Verify replication.
+  ClusterVerifier v(cluster_.get());
+  NO_FATALS(v.CheckCluster());
+  NO_FATALS(v.CheckRowCount(workload.table_name(),
+                            ClusterVerifier::EXACTLY,
+                            workload.rows_inserted()));
+
+  int num_wals = inspect_->CountFilesInWALDirForTS(0, tablet_id_, "wal-*");
+  int num_batches = workload.rows_inserted() / kBatchSize;
+  // The number of WALs should be similar to 'num_batches'. We can't make
+  // an exact assertion because async preallocation may take a small amount
+  // of time, in which case it's possible to put more than one batch in a
+  // single WAL.
+  ASSERT_GE(num_wals, num_batches / 2);
+  ASSERT_LE(num_wals, num_batches + 2);
+}
+
 void RaftConsensusITest::InsertPayloadIgnoreErrors(int start_row, int num_rows, int payload_size) {
   shared_ptr<KuduTable> table;
   CHECK_OK(client_->OpenTable(kTableId, &table));