You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@kudu.apache.org by mp...@apache.org on 2018/09/26 17:56:37 UTC
[2/4] kudu-site git commit: Publish commit(s) from site source repo: 83530755d Blogpost describing index skip scan optimization.

http://git-wip-us.apache.org/repos/asf/kudu-site/blob/12782cec/blog/page/4/index.html
----------------------------------------------------------------------
diff --git a/blog/page/4/index.html b/blog/page/4/index.html
index 21724ba..3f6dd71 100644
--- a/blog/page/4/index.html
+++ b/blog/page/4/index.html
@@ -117,6 +117,27 @@
 <!-- Articles -->
 <article>
   <header>
+    <h1 class="entry-title"><a href="/2016/10/20/weekly-update.html">Apache Kudu Weekly Update October 20th, 2016</a></h1>
+    <p class="meta">Posted 20 Oct 2016 by Todd Lipcon</p>
+  </header>
+  <div class="entry-content">
+    
+    <p>Welcome to the twenty-second edition of the Kudu Weekly Update. This weekly blog post
+covers ongoing development and news in the Apache Kudu project.</p>
+
+
+    
+  </div>
+  <div class="read-full">
+    <a class="btn btn-info" href="/2016/10/20/weekly-update.html">Read full post...</a>
+  </div>
+</article>
+
+
+
+<!-- Articles -->
+<article>
+  <header>
     <h1 class="entry-title"><a href="/2016/10/11/weekly-update.html">Apache Kudu Weekly Update October 11th, 2016</a></h1>
     <p class="meta">Posted 11 Oct 2016 by Todd Lipcon</p>
   </header>
@@ -209,320 +230,6 @@ scan path to speed up queries.</p>
 
 
 
-<!-- Articles -->
-<article>
-  <header>
-    <h1 class="entry-title"><a href="/2016/08/31/intro-flume-kudu-sink.html">An Introduction to the Flume Kudu Sink</a></h1>
-    <p class="meta">Posted 31 Aug 2016 by Ara Abrahamian</p>
-  </header>
-  <div class="entry-content">
-    
-    <p>This post discusses the Kudu Flume Sink. First, I’ll give some background on why we considered
-using Kudu, what Flume does for us, and how Flume fits with Kudu in our project.</p>
-
-<h2 id="why-kudu">Why Kudu</h2>
-
-<p>Traditionally in the Hadoop ecosystem we’ve dealt with various <em>batch processing</em> technologies such
-as MapReduce and the many libraries and tools built on top of it in various languages (Apache Pig,
-Apache Hive, Apache Oozie and many others). The main problem with this approach is that it needs to
-process the whole data set in batches, again and again, as soon as new data gets added. Things get
-really complicated when a few such tasks need to get chained together, or when the same data set
-needs to be processed in various ways by different jobs, while all compete for the shared cluster
-resources.</p>
-
-<p>The opposite of this approach is <em>stream processing</em>: process the data as soon as it arrives, not
-in batches. Streaming systems such as Spark Streaming, Storm, Kafka Streams, and many others make
-this possible. But writing streaming services is not trivial. The streaming systems are becoming
-more and more capable and support more complex constructs, but they are not yet easy to use. All
-queries and processes need to be carefully planned and implemented.</p>
-
-<p>To summarize, <em>batch processing</em> is:</p>
-
-<ul>
-  <li>file-based</li>
-  <li>a paradigm that processes large chunks of data as a group</li>
-  <li>high latency and high throughput, both for ingest and query</li>
-  <li>typically easy to program, but hard to orchestrate</li>
-  <li>well suited for writing ad-hoc queries, although they are typically high latency</li>
-</ul>
-
-<p>While <em>stream processing</em> is:</p>
-
-<ul>
-  <li>a totally different paradigm, which involves single events and time windows instead of large groups of events</li>
-  <li>still file-based and not a long-term database</li>
-  <li>not batch-oriented, but incremental</li>
-  <li>ultra-fast ingest and ultra-fast query (query results basically pre-calculated)</li>
-  <li>not so easy to program, relatively easy to orchestrate</li>
-  <li>impossible to write ad-hoc queries</li>
-</ul>
-
-<p>And a Kudu-based <em>near real-time</em> approach is:</p>
-
-<ul>
-  <li>flexible and expressive, thanks to SQL support via Apache Impala (incubating)</li>
-  <li>a table-oriented, mutable data store that feels like a traditional relational database</li>
-  <li>very easy to program, you can even pretend it’s good old MySQL</li>
-  <li>low-latency and relatively high throughput, both for ingest and query</li>
-</ul>
-
-<p>At Argyle Data, we’re dealing with complex fraud detection scenarios. We need to ingest massive
-amounts of data, run machine learning algorithms and generate reports. When we created our current
-architecture two years ago we decided to opt for a database as the backbone of our system. That
-database is Apache Accumulo. It’s a key-value based database which runs on top of Hadoop HDFS,
-quite similar to HBase but with some important improvements such as cell level security and ease
-of deployment and management. To enable querying of this data for quite complex reporting and
-analytics, we used Presto, a distributed query engine with a pluggable architecture open-sourced
-by Facebook. We wrote a connector for it to let it run queries against the Accumulo database. This
-architecture has served us well, but there were a few problems:</p>
-
-<ul>
-  <li>we need to ingest even more massive volumes of data in real-time</li>
-  <li>we need to perform complex machine-learning calculations on even larger data-sets</li>
-  <li>we need to support ad-hoc queries, plus long-term data warehouse functionality</li>
-</ul>
-
-<p>So, we’ve started gradually moving the core machine-learning pipeline to a streaming based
-solution. This way we can ingest and process larger data-sets faster in the real-time. But then how
-would we take care of ad-hoc queries and long-term persistence? This is where Kudu comes in. While
-the machine learning pipeline ingests and processes real-time data, we store a copy of the same
-ingested data in Kudu for long-term access and ad-hoc queries. Kudu is our <em>data warehouse</em>. By
-using Kudu and Impala, we can retire our in-house Presto connector and rely on Impala’s
-super-fast query engine.</p>
-
-<p>But how would we make sure data is reliably ingested into the streaming pipeline <em>and</em> the
-Kudu-based data warehouse? This is where Apache Flume comes in.</p>
-
-<h2 id="why-flume">Why Flume</h2>
-
-<p>According to their <a href="http://flume.apache.org/">website</a> “Flume is a distributed, reliable, and
-available service for efficiently collecting, aggregating, and moving large amounts of log data.
-It has a simple and flexible architecture based on streaming data flows. It is robust and fault
-tolerant with tunable reliability mechanisms and many failover and recovery mechanisms.” As you
-can see, nowhere is Hadoop mentioned but Flume is typically used for ingesting data to Hadoop
-clusters.</p>
-
-<p><img src="https://blogs.apache.org/flume/mediaresource/ab0d50f6-a960-42cc-971e-3da38ba3adad" alt="png" /></p>
-
-<p>Flume has an extensible architecture. An instance of Flume, called an <em>agent</em>, can have multiple
-<em>channels</em>, with each having multiple <em>sources</em> and <em>sinks</em> of various types. Sources queue data
-in channels, which in turn write out data to sinks. Such <em>pipelines</em> can be chained together to
-create even more complex ones. There may be more than one agent and agents can be configured to
-support failover and recovery.</p>
-
-<p>Flume comes with a bunch of built-in types of channels, sources and sinks. Memory channel is the
-default (an in-memory queue with no persistence to disk), but other options such as Kafka- and
-File-based channels are also provided. As for the sources, Avro, JMS, Thrift, spooling directory
-source are some of the built-in ones. Flume also ships with many sinks, including sinks for writing
-data to HDFS, HBase, Hive, Kafka, as well as to other Flume agents.</p>
-
-<p>In the rest of this post I’ll go over the Kudu Flume sink and show you how to configure Flume to
-write ingested data to a Kudu table. The sink has been part of the Kudu distribution since the 0.8
-release and the source code can be found <a href="https://github.com/apache/kudu/tree/master/java/kudu-flume-sink">here</a>.</p>
-
-<h2 id="configuring-the-kudu-flume-sink">Configuring the Kudu Flume Sink</h2>
-
-<p>Here is a sample flume configuration file:</p>
-
-<pre><code>agent1.sources  = source1
-agent1.channels = channel1
-agent1.sinks = sink1
-
-agent1.sources.source1.type = exec
-agent1.sources.source1.command = /usr/bin/vmstat 1
-agent1.sources.source1.channels = channel1
-
-agent1.channels.channel1.type = memory
-agent1.channels.channel1.capacity = 10000
-agent1.channels.channel1.transactionCapacity = 1000
-
-agent1.sinks.sink1.type = org.apache.flume.sink.kudu.KuduSink
-agent1.sinks.sink1.masterAddresses = localhost
-agent1.sinks.sink1.tableName = stats
-agent1.sinks.sink1.channel = channel1
-agent1.sinks.sink1.batchSize = 50
-agent1.sinks.sink1.producer = org.apache.kudu.flume.sink.SimpleKuduEventProducer
-</code></pre>
-
-<p>We define a source called <code>source1</code> which simply executes a <code>vmstat</code> command to continuously generate
-virtual memory statistics for the machine and queue events into an in-memory <code>channel1</code> channel,
-which in turn is used for writing these events to a Kudu table called <code>stats</code>. We are using
-<code>org.apache.kudu.flume.sink.SimpleKuduEventProducer</code> as the producer. <code>SimpleKuduEventProducer</code> is
-the built-in and default producer, but it’s implemented as a showcase for how to write Flume
-events into Kudu tables. For any serious functionality we’d have to write a custom producer. We
-need to make this producer and the <code>KuduSink</code> class available to Flume. We can do that by simply
-copying the <code>kudu-flume-sink-&lt;VERSION&gt;.jar</code> jar file from the Kudu distribution to the
-<code>$FLUME_HOME/plugins.d/kudu-sink/lib</code> directory in the Flume installation. The jar file contains
-<code>KuduSink</code> and all of its dependencies (including Kudu java client classes).</p>
-
-<p>At a minimum, the Kudu Flume Sink needs to know where the Kudu masters are
-(<code>agent1.sinks.sink1.masterAddresses = localhost</code>) and which Kudu table should be used for writing
-Flume events to (<code>agent1.sinks.sink1.tableName = stats</code>). The Kudu Flume Sink doesn’t create this
-table, it has to be created before the Kudu Flume Sink is started.</p>
-
-<p>You may also notice the <code>batchSize</code> parameter. Batch size is used for batching up to that many
-Flume events and flushing the entire batch in one shot. Tuning batchSize properly can have a huge
-impact on ingest performance of the Kudu cluster.</p>
-
-<p>Here is a complete list of KuduSink parameters:</p>
-
-<table>
-  <thead>
-    <tr>
-      <th>Parameter Name</th>
-      <th>Default</th>
-      <th>Description</th>
-    </tr>
-  </thead>
-  <tbody>
-    <tr>
-      <td>masterAddresses</td>
-      <td>N/A</td>
-      <td>Comma-separated list of “host:port” pairs of the masters (port optional)</td>
-    </tr>
-    <tr>
-      <td>tableName</td>
-      <td>N/A</td>
-      <td>The name of the table in Kudu to write to</td>
-    </tr>
-    <tr>
-      <td>producer</td>
-      <td>org.apache.kudu.flume.sink.SimpleKuduEventProducer</td>
-      <td>The fully qualified class name of the Kudu event producer the sink should use</td>
-    </tr>
-    <tr>
-      <td>batchSize</td>
-      <td>100</td>
-      <td>Maximum number of events the sink should take from the channel per transaction, if available</td>
-    </tr>
-    <tr>
-      <td>timeoutMillis</td>
-      <td>30000</td>
-      <td>Timeout period for Kudu operations, in milliseconds</td>
-    </tr>
-    <tr>
-      <td>ignoreDuplicateRows</td>
-      <td>true</td>
-      <td>Whether to ignore errors indicating that we attempted to insert duplicate rows into Kudu</td>
-    </tr>
-  </tbody>
-</table>
-
-<p>Let’s take a look at the source code for the built-in producer class:</p>
-
-<pre><code class="language-java">public class SimpleKuduEventProducer implements KuduEventProducer {
-  private byte[] payload;
-  private KuduTable table;
-  private String payloadColumn;
-
-  public SimpleKuduEventProducer(){
-  }
-
-  @Override
-  public void configure(Context context) {
-    payloadColumn = context.getString("payloadColumn","payload");
-  }
-
-  @Override
-  public void configure(ComponentConfiguration conf) {
-  }
-
-  @Override
-  public void initialize(Event event, KuduTable table) {
-    this.payload = event.getBody();
-    this.table = table;
-  }
-
-  @Override
-  public List&lt;Operation&gt; getOperations() throws FlumeException {
-    try {
-      Insert insert = table.newInsert();
-      PartialRow row = insert.getRow();
-      row.addBinary(payloadColumn, payload);
-
-      return Collections.singletonList((Operation) insert);
-    } catch (Exception e){
-      throw new FlumeException("Failed to create Kudu Insert object!", e);
-    }
-  }
-
-  @Override
-  public void close() {
-  }
-}
-</code></pre>
-
-<p><code>SimpleKuduEventProducer</code> implements the <code>org.apache.kudu.flume.sink.KuduEventProducer</code> interface,
-which itself looks like this:</p>
-
-<pre><code class="language-java">public interface KuduEventProducer extends Configurable, ConfigurableComponent {
-  /**
-   * Initialize the event producer.
-   * @param event to be written to Kudu
-   * @param table the KuduTable object used for creating Kudu Operation objects
-   */
-  void initialize(Event event, KuduTable table);
-
-  /**
-   * Get the operations that should be written out to Kudu as a result of this
-   * event. This list is written to Kudu using the Kudu client API.
-   * @return List of {@link org.kududb.client.Operation} which
-   * are written as such to Kudu
-   */
-  List&lt;Operation&gt; getOperations();
-
-  /*
-   * Clean up any state. This will be called when the sink is being stopped.
-   */
-  void close();
-}
-</code></pre>
-
-<p><code>public void configure(Context context)</code> is called when an instance of our producer is instantiated
-by the KuduSink. SimpleKuduEventProducer’s implementation looks for a producer parameter named
-<code>payloadColumn</code> and uses its value (“payload” if not overridden in Flume configuration file) as the
-column which will hold the value of the Flume event payload. If you recall from above, we had
-configured the KuduSink to listen for events generated from the <code>vmstat</code> command. Each output row
-from that command will be stored as a new row containing a <code>payload</code> column in the <code>stats</code> table.
-<code>SimpleKuduEventProducer</code> does not have any configuration parameters, but if it had any we would
-define them by prefixing it with <code>producer.</code> (<code>agent1.sinks.sink1.producer.parameter1</code> for
-example).</p>
-
-<p>The main producer logic resides in the <code>public List&lt;Operation&gt; getOperations()</code> method. In
-SimpleKuduEventProducer’s implementation we simply insert the binary body of the Flume event into
-the Kudu table. Here we call Kudu’s <code>newInsert()</code> to initiate an insert, but could have used
-<code>Upsert</code> if updating an existing row was also an option, in fact there’s another producer
-implementation available for doing just that: <code>SimpleKeyedKuduEventProducer</code>. Most probably you
-will need to write your own custom producer in the real world, but you can base your implementation
-on the built-in ones.</p>
-
-<p>In the future, we plan to add more flexible event producer implementations so that creation of a
-custom event producer is not required to write data to Kudu. See
-<a href="https://gerrit.cloudera.org/#/c/4034/">here</a> for a work-in-progress generic event producer for
-Avro-encoded Events.</p>
-
-<h2 id="conclusion">Conclusion</h2>
-
-<p>Kudu is a scalable data store which lets us ingest insane amounts of data per second. Apache Flume
-helps us aggregate data from various sources, and the Kudu Flume Sink lets us easily store
-the aggregated Flume events into Kudu. Together they enable us to create a data warehouse out of
-disparate sources.</p>
-
-<p><em>Ara Abrahamian is a software engineer at Argyle Data building fraud detection systems using
-sophisticated machine learning methods. Ara is the original author of the Flume Kudu Sink that
-is included in the Kudu distribution. You can follow him on Twitter at
-<a href="https://twitter.com/ara_e">@ara_e</a>.</em></p>
-
-
-    
-  </div>
-  <div class="read-full">
-    <a class="btn btn-info" href="/2016/08/31/intro-flume-kudu-sink.html">Read full post...</a>
-  </div>
-</article>
-
-
-
 <!-- Pagination links -->
 
 <nav>
@@ -543,6 +250,8 @@ is included in the Kudu distribution. You can follow him on Twitter at
     <h3>Recent posts</h3>
     <ul>
     
+      <li> <a href="/2018/09/26/index-skip-scan-optimization-in-kudu.html">Index Skip Scan Optimization in Kudu</a> </li>
+    
       <li> <a href="/2018/09/11/simplified-pipelines-with-kudu.html">Simplified Data Pipelines with Kudu</a> </li>
     
       <li> <a href="/2018/08/06/getting-started-with-kudu-an-oreilly-title.html">Getting Started with Kudu - an O'Reilly Title</a> </li>
@@ -571,8 +280,6 @@ is included in the Kudu distribution. You can follow him on Twitter at
     
       <li> <a href="/2016/11/01/weekly-update.html">Apache Kudu Weekly Update November 1st, 2016</a> </li>
     
-      <li> <a href="/2016/10/20/weekly-update.html">Apache Kudu Weekly Update October 20th, 2016</a> </li>
-    
     </ul>
   </div>
 </div>

http://git-wip-us.apache.org/repos/asf/kudu-site/blob/12782cec/blog/page/5/index.html
----------------------------------------------------------------------
diff --git a/blog/page/5/index.html b/blog/page/5/index.html
index 1e4c02d..eba0ce9 100644
--- a/blog/page/5/index.html
+++ b/blog/page/5/index.html
@@ -117,6 +117,320 @@
 <!-- Articles -->
 <article>
   <header>
+    <h1 class="entry-title"><a href="/2016/08/31/intro-flume-kudu-sink.html">An Introduction to the Flume Kudu Sink</a></h1>
+    <p class="meta">Posted 31 Aug 2016 by Ara Abrahamian</p>
+  </header>
+  <div class="entry-content">
+    
+    <p>This post discusses the Kudu Flume Sink. First, I’ll give some background on why we considered
+using Kudu, what Flume does for us, and how Flume fits with Kudu in our project.</p>
+
+<h2 id="why-kudu">Why Kudu</h2>
+
+<p>Traditionally in the Hadoop ecosystem we’ve dealt with various <em>batch processing</em> technologies such
+as MapReduce and the many libraries and tools built on top of it in various languages (Apache Pig,
+Apache Hive, Apache Oozie and many others). The main problem with this approach is that it needs to
+process the whole data set in batches, again and again, as soon as new data gets added. Things get
+really complicated when a few such tasks need to get chained together, or when the same data set
+needs to be processed in various ways by different jobs, while all compete for the shared cluster
+resources.</p>
+
+<p>The opposite of this approach is <em>stream processing</em>: process the data as soon as it arrives, not
+in batches. Streaming systems such as Spark Streaming, Storm, Kafka Streams, and many others make
+this possible. But writing streaming services is not trivial. The streaming systems are becoming
+more and more capable and support more complex constructs, but they are not yet easy to use. All
+queries and processes need to be carefully planned and implemented.</p>
+
+<p>To summarize, <em>batch processing</em> is:</p>
+
+<ul>
+  <li>file-based</li>
+  <li>a paradigm that processes large chunks of data as a group</li>
+  <li>high latency and high throughput, both for ingest and query</li>
+  <li>typically easy to program, but hard to orchestrate</li>
+  <li>well suited for writing ad-hoc queries, although they are typically high latency</li>
+</ul>
+
+<p>While <em>stream processing</em> is:</p>
+
+<ul>
+  <li>a totally different paradigm, which involves single events and time windows instead of large groups of events</li>
+  <li>still file-based and not a long-term database</li>
+  <li>not batch-oriented, but incremental</li>
+  <li>ultra-fast ingest and ultra-fast query (query results basically pre-calculated)</li>
+  <li>not so easy to program, relatively easy to orchestrate</li>
+  <li>impossible to write ad-hoc queries</li>
+</ul>
+
+<p>And a Kudu-based <em>near real-time</em> approach is:</p>
+
+<ul>
+  <li>flexible and expressive, thanks to SQL support via Apache Impala (incubating)</li>
+  <li>a table-oriented, mutable data store that feels like a traditional relational database</li>
+  <li>very easy to program, you can even pretend it’s good old MySQL</li>
+  <li>low-latency and relatively high throughput, both for ingest and query</li>
+</ul>
+
+<p>At Argyle Data, we’re dealing with complex fraud detection scenarios. We need to ingest massive
+amounts of data, run machine learning algorithms and generate reports. When we created our current
+architecture two years ago we decided to opt for a database as the backbone of our system. That
+database is Apache Accumulo. It’s a key-value based database which runs on top of Hadoop HDFS,
+quite similar to HBase but with some important improvements such as cell level security and ease
+of deployment and management. To enable querying of this data for quite complex reporting and
+analytics, we used Presto, a distributed query engine with a pluggable architecture open-sourced
+by Facebook. We wrote a connector for it to let it run queries against the Accumulo database. This
+architecture has served us well, but there were a few problems:</p>
+
+<ul>
+  <li>we need to ingest even more massive volumes of data in real-time</li>
+  <li>we need to perform complex machine-learning calculations on even larger data-sets</li>
+  <li>we need to support ad-hoc queries, plus long-term data warehouse functionality</li>
+</ul>
+
+<p>So, we’ve started gradually moving the core machine-learning pipeline to a streaming based
+solution. This way we can ingest and process larger data-sets faster in the real-time. But then how
+would we take care of ad-hoc queries and long-term persistence? This is where Kudu comes in. While
+the machine learning pipeline ingests and processes real-time data, we store a copy of the same
+ingested data in Kudu for long-term access and ad-hoc queries. Kudu is our <em>data warehouse</em>. By
+using Kudu and Impala, we can retire our in-house Presto connector and rely on Impala’s
+super-fast query engine.</p>
+
+<p>But how would we make sure data is reliably ingested into the streaming pipeline <em>and</em> the
+Kudu-based data warehouse? This is where Apache Flume comes in.</p>
+
+<h2 id="why-flume">Why Flume</h2>
+
+<p>According to their <a href="http://flume.apache.org/">website</a> “Flume is a distributed, reliable, and
+available service for efficiently collecting, aggregating, and moving large amounts of log data.
+It has a simple and flexible architecture based on streaming data flows. It is robust and fault
+tolerant with tunable reliability mechanisms and many failover and recovery mechanisms.” As you
+can see, nowhere is Hadoop mentioned but Flume is typically used for ingesting data to Hadoop
+clusters.</p>
+
+<p><img src="https://blogs.apache.org/flume/mediaresource/ab0d50f6-a960-42cc-971e-3da38ba3adad" alt="png" /></p>
+
+<p>Flume has an extensible architecture. An instance of Flume, called an <em>agent</em>, can have multiple
+<em>channels</em>, with each having multiple <em>sources</em> and <em>sinks</em> of various types. Sources queue data
+in channels, which in turn write out data to sinks. Such <em>pipelines</em> can be chained together to
+create even more complex ones. There may be more than one agent and agents can be configured to
+support failover and recovery.</p>
+
+<p>Flume comes with a bunch of built-in types of channels, sources and sinks. Memory channel is the
+default (an in-memory queue with no persistence to disk), but other options such as Kafka- and
+File-based channels are also provided. As for the sources, Avro, JMS, Thrift, spooling directory
+source are some of the built-in ones. Flume also ships with many sinks, including sinks for writing
+data to HDFS, HBase, Hive, Kafka, as well as to other Flume agents.</p>
+
+<p>In the rest of this post I’ll go over the Kudu Flume sink and show you how to configure Flume to
+write ingested data to a Kudu table. The sink has been part of the Kudu distribution since the 0.8
+release and the source code can be found <a href="https://github.com/apache/kudu/tree/master/java/kudu-flume-sink">here</a>.</p>
+
+<h2 id="configuring-the-kudu-flume-sink">Configuring the Kudu Flume Sink</h2>
+
+<p>Here is a sample flume configuration file:</p>
+
+<div class="highlighter-rouge">agent1.sources  = source1
+agent1.channels = channel1
+agent1.sinks = sink1
+
+agent1.sources.source1.type = exec
+agent1.sources.source1.command = /usr/bin/vmstat 1
+agent1.sources.source1.channels = channel1
+
+agent1.channels.channel1.type = memory
+agent1.channels.channel1.capacity = 10000
+agent1.channels.channel1.transactionCapacity = 1000
+
+agent1.sinks.sink1.type = org.apache.flume.sink.kudu.KuduSink
+agent1.sinks.sink1.masterAddresses = localhost
+agent1.sinks.sink1.tableName = stats
+agent1.sinks.sink1.channel = channel1
+agent1.sinks.sink1.batchSize = 50
+agent1.sinks.sink1.producer = org.apache.kudu.flume.sink.SimpleKuduEventProducer
+</div>
+
+<p>We define a source called <code class="highlighter-rouge">source1</code> which simply executes a <code class="highlighter-rouge">vmstat</code> command to continuously generate
+virtual memory statistics for the machine and queue events into an in-memory <code class="highlighter-rouge">channel1</code> channel,
+which in turn is used for writing these events to a Kudu table called <code class="highlighter-rouge">stats</code>. We are using
+<code class="highlighter-rouge">org.apache.kudu.flume.sink.SimpleKuduEventProducer</code> as the producer. <code class="highlighter-rouge">SimpleKuduEventProducer</code> is
+the built-in and default producer, but it’s implemented as a showcase for how to write Flume
+events into Kudu tables. For any serious functionality we’d have to write a custom producer. We
+need to make this producer and the <code class="highlighter-rouge">KuduSink</code> class available to Flume. We can do that by simply
+copying the <code class="highlighter-rouge">kudu-flume-sink-&lt;VERSION&gt;.jar</code> jar file from the Kudu distribution to the
+<code class="highlighter-rouge">$FLUME_HOME/plugins.d/kudu-sink/lib</code> directory in the Flume installation. The jar file contains
+<code class="highlighter-rouge">KuduSink</code> and all of its dependencies (including Kudu java client classes).</p>
+
+<p>At a minimum, the Kudu Flume Sink needs to know where the Kudu masters are
+(<code class="highlighter-rouge">agent1.sinks.sink1.masterAddresses = localhost</code>) and which Kudu table should be used for writing
+Flume events to (<code class="highlighter-rouge">agent1.sinks.sink1.tableName = stats</code>). The Kudu Flume Sink doesn’t create this
+table, it has to be created before the Kudu Flume Sink is started.</p>
+
+<p>You may also notice the <code class="highlighter-rouge">batchSize</code> parameter. Batch size is used for batching up to that many
+Flume events and flushing the entire batch in one shot. Tuning batchSize properly can have a huge
+impact on ingest performance of the Kudu cluster.</p>
+
+<p>Here is a complete list of KuduSink parameters:</p>
+
+<table>
+  <thead>
+    <tr>
+      <th>Parameter Name</th>
+      <th>Default</th>
+      <th>Description</th>
+    </tr>
+  </thead>
+  <tbody>
+    <tr>
+      <td>masterAddresses</td>
+      <td>N/A</td>
+      <td>Comma-separated list of “host:port” pairs of the masters (port optional)</td>
+    </tr>
+    <tr>
+      <td>tableName</td>
+      <td>N/A</td>
+      <td>The name of the table in Kudu to write to</td>
+    </tr>
+    <tr>
+      <td>producer</td>
+      <td>org.apache.kudu.flume.sink.SimpleKuduEventProducer</td>
+      <td>The fully qualified class name of the Kudu event producer the sink should use</td>
+    </tr>
+    <tr>
+      <td>batchSize</td>
+      <td>100</td>
+      <td>Maximum number of events the sink should take from the channel per transaction, if available</td>
+    </tr>
+    <tr>
+      <td>timeoutMillis</td>
+      <td>30000</td>
+      <td>Timeout period for Kudu operations, in milliseconds</td>
+    </tr>
+    <tr>
+      <td>ignoreDuplicateRows</td>
+      <td>true</td>
+      <td>Whether to ignore errors indicating that we attempted to insert duplicate rows into Kudu</td>
+    </tr>
+  </tbody>
+</table>
+
+<p>Let’s take a look at the source code for the built-in producer class:</p>
+
+<div class="highlighter-rouge"><span class="kd">public</span> <span class="kd">class</span> <span class="nc">SimpleKuduEventProducer</span> <span class="kd">implements</span> <span class="n">KuduEventProducer</span> <span class="o">{</span>
+  <span class="kd">private</span> <span class="kt">byte</span><span class="o">[]</span> <span class="n">payload</span><span class="o">;</span>
+  <span class="kd">private</span> <span class="n">KuduTable</span> <span class="n">table</span><span class="o">;</span>
+  <span class="kd">private</span> <span class="n">String</span> <span class="n">payloadColumn</span><span class="o">;</span>
+
+  <span class="kd">public</span> <span class="nf">SimpleKuduEventProducer</span><span class="o">(){</span>
+  <span class="o">}</span>
+
+  <span class="nd">@Override</span>
+  <span class="kd">public</span> <span class="kt">void</span> <span class="nf">configure</span><span class="o">(</span><span class="n">Context</span> <span class="n">context</span><span class="o">)</span> <span class="o">{</span>
+    <span class="n">payloadColumn</span> <span class="o">=</span> <span class="n">context</span><span class="o">.</span><span class="na">getString</span><span class="o">(</span><span class="s">"payloadColumn"</span><span class="o">,</span><span class="s">"payload"</span><span class="o">);</span>
+  <span class="o">}</span>
+
+  <span class="nd">@Override</span>
+  <span class="kd">public</span> <span class="kt">void</span> <span class="nf">configure</span><span class="o">(</span><span class="n">ComponentConfiguration</span> <span class="n">conf</span><span class="o">)</span> <span class="o">{</span>
+  <span class="o">}</span>
+
+  <span class="nd">@Override</span>
+  <span class="kd">public</span> <span class="kt">void</span> <span class="nf">initialize</span><span class="o">(</span><span class="n">Event</span> <span class="n">event</span><span class="o">,</span> <span class="n">KuduTable</span> <span class="n">table</span><span class="o">)</span> <span class="o">{</span>
+    <span class="k">this</span><span class="o">.</span><span class="na">payload</span> <span class="o">=</span> <span class="n">event</span><span class="o">.</span><span class="na">getBody</span><span class="o">();</span>
+    <span class="k">this</span><span class="o">.</span><span class="na">table</span> <span class="o">=</span> <span class="n">table</span><span class="o">;</span>
+  <span class="o">}</span>
+
+  <span class="nd">@Override</span>
+  <span class="kd">public</span> <span class="n">List</span><span class="o">&lt;</span><span class="n">Operation</span><span class="o">&gt;</span> <span class="nf">getOperations</span><span class="o">()</span> <span class="kd">throws</span> <span class="n">FlumeException</span> <span class="o">{</span>
+    <span class="k">try</span> <span class="o">{</span>
+      <span class="n">Insert</span> <span class="n">insert</span> <span class="o">=</span> <span class="n">table</span><span class="o">.</span><span class="na">newInsert</span><span class="o">();</span>
+      <span class="n">PartialRow</span> <span class="n">row</span> <span class="o">=</span> <span class="n">insert</span><span class="o">.</span><span class="na">getRow</span><span class="o">();</span>
+      <span class="n">row</span><span class="o">.</span><span class="na">addBinary</span><span class="o">(</span><span class="n">payloadColumn</span><span class="o">,</span> <span class="n">payload</span><span class="o">);</span>
+
+      <span class="k">return</span> <span class="n">Collections</span><span class="o">.</span><span class="na">singletonList</span><span class="o">((</span><span class="n">Operation</span><span class="o">)</span> <span class="n">insert</span><span class="o">);</span>
+    <span class="o">}</span> <span class="k">catch</span> <span class="o">(</span><span class="n">Exception</span> <span class="n">e</span><span class="o">){</span>
+      <span class="k">throw</span> <span class="k">new</span> <span class="nf">FlumeException</span><span class="o">(</span><span class="s">"Failed to create Kudu Insert object!"</span><span class="o">,</span> <span class="n">e</span><span class="o">);</span>
+    <span class="o">}</span>
+  <span class="o">}</span>
+
+  <span class="nd">@Override</span>
+  <span class="kd">public</span> <span class="kt">void</span> <span class="nf">close</span><span class="o">()</span> <span class="o">{</span>
+  <span class="o">}</span>
+<span class="o">}</span>
+</div>
+
+<p><code class="highlighter-rouge">SimpleKuduEventProducer</code> implements the <code class="highlighter-rouge">org.apache.kudu.flume.sink.KuduEventProducer</code> interface,
+which itself looks like this:</p>
+
+<div class="highlighter-rouge"><span class="kd">public</span> <span class="kd">interface</span> <span class="nc">KuduEventProducer</span> <span class="kd">extends</span> <span class="n">Configurable</span><span class="o">,</span> <span class="n">ConfigurableComponent</span> <span class="o">{</span>
+  <span class="cm">/**
+   * Initialize the event producer.
+   * @param event to be written to Kudu
+   * @param table the KuduTable object used for creating Kudu Operation objects
+   */</span>
+  <span class="kt">void</span> <span class="nf">initialize</span><span class="o">(</span><span class="n">Event</span> <span class="n">event</span><span class="o">,</span> <span class="n">KuduTable</span> <span class="n">table</span><span class="o">);</span>
+
+  <span class="cm">/**
+   * Get the operations that should be written out to Kudu as a result of this
+   * event. This list is written to Kudu using the Kudu client API.
+   * @return List of {@link org.kududb.client.Operation} which
+   * are written as such to Kudu
+   */</span>
+  <span class="n">List</span><span class="o">&lt;</span><span class="n">Operation</span><span class="o">&gt;</span> <span class="nf">getOperations</span><span class="o">();</span>
+
+  <span class="cm">/*
+   * Clean up any state. This will be called when the sink is being stopped.
+   */</span>
+  <span class="kt">void</span> <span class="nf">close</span><span class="o">();</span>
+<span class="o">}</span>
+</div>
+
+<p><code class="highlighter-rouge">public void configure(Context context)</code> is called when an instance of our producer is instantiated
+by the KuduSink. SimpleKuduEventProducer’s implementation looks for a producer parameter named
+<code class="highlighter-rouge">payloadColumn</code> and uses its value (“payload” if not overridden in Flume configuration file) as the
+column which will hold the value of the Flume event payload. If you recall from above, we had
+configured the KuduSink to listen for events generated from the <code class="highlighter-rouge">vmstat</code> command. Each output row
+from that command will be stored as a new row containing a <code class="highlighter-rouge">payload</code> column in the <code class="highlighter-rouge">stats</code> table.
+<code class="highlighter-rouge">SimpleKuduEventProducer</code> does not have any configuration parameters, but if it had any we would
+define them by prefixing it with <code class="highlighter-rouge">producer.</code> (<code class="highlighter-rouge">agent1.sinks.sink1.producer.parameter1</code> for
+example).</p>
+
+<p>The main producer logic resides in the <code class="highlighter-rouge">public List&lt;Operation&gt; getOperations()</code> method. In
+SimpleKuduEventProducer’s implementation we simply insert the binary body of the Flume event into
+the Kudu table. Here we call Kudu’s <code class="highlighter-rouge">newInsert()</code> to initiate an insert, but could have used
+<code class="highlighter-rouge">Upsert</code> if updating an existing row was also an option, in fact there’s another producer
+implementation available for doing just that: <code class="highlighter-rouge">SimpleKeyedKuduEventProducer</code>. Most probably you
+will need to write your own custom producer in the real world, but you can base your implementation
+on the built-in ones.</p>
+
+<p>In the future, we plan to add more flexible event producer implementations so that creation of a
+custom event producer is not required to write data to Kudu. See
+<a href="https://gerrit.cloudera.org/#/c/4034/">here</a> for a work-in-progress generic event producer for
+Avro-encoded Events.</p>
+
+<h2 id="conclusion">Conclusion</h2>
+
+<p>Kudu is a scalable data store which lets us ingest insane amounts of data per second. Apache Flume
+helps us aggregate data from various sources, and the Kudu Flume Sink lets us easily store
+the aggregated Flume events into Kudu. Together they enable us to create a data warehouse out of
+disparate sources.</p>
+
+<p><em>Ara Abrahamian is a software engineer at Argyle Data building fraud detection systems using
+sophisticated machine learning methods. Ara is the original author of the Flume Kudu Sink that
+is included in the Kudu distribution. You can follow him on Twitter at
+<a href="https://twitter.com/ara_e">@ara_e</a>.</em></p>
+
+
+    
+  </div>
+  <div class="read-full">
+    <a class="btn btn-info" href="/2016/08/31/intro-flume-kudu-sink.html">Read full post...</a>
+  </div>
+</article>
+
+
+
+<!-- Articles -->
+<article>
+  <header>
     <h1 class="entry-title"><a href="/2016/08/23/new-range-partitioning-features.html">New Range Partitioning Features in Kudu 0.10</a></h1>
     <p class="meta">Posted 23 Aug 2016 by Dan Burkert</p>
   </header>
@@ -201,27 +515,6 @@ covers ongoing development and news in the Apache Kudu project.</p>
 
 
 
-<!-- Articles -->
-<article>
-  <header>
-    <h1 class="entry-title"><a href="/2016/07/26/weekly-update.html">Apache Kudu Weekly Update July 26, 2016</a></h1>
-    <p class="meta">Posted 26 Jul 2016 by Jean-Daniel Cryans</p>
-  </header>
-  <div class="entry-content">
-    
-    <p>Welcome to the eighteenth edition of the Kudu Weekly Update. This weekly blog post
-covers ongoing development and news in the Apache Kudu project.</p>
-
-
-    
-  </div>
-  <div class="read-full">
-    <a class="btn btn-info" href="/2016/07/26/weekly-update.html">Read full post...</a>
-  </div>
-</article>
-
-
-
 <!-- Pagination links -->
 
 <nav>
@@ -242,6 +535,8 @@ covers ongoing development and news in the Apache Kudu project.</p>
     <h3>Recent posts</h3>
     <ul>
     
+      <li> <a href="/2018/09/26/index-skip-scan-optimization-in-kudu.html">Index Skip Scan Optimization in Kudu</a> </li>
+    
       <li> <a href="/2018/09/11/simplified-pipelines-with-kudu.html">Simplified Data Pipelines with Kudu</a> </li>
     
       <li> <a href="/2018/08/06/getting-started-with-kudu-an-oreilly-title.html">Getting Started with Kudu - an O'Reilly Title</a> </li>
@@ -270,8 +565,6 @@ covers ongoing development and news in the Apache Kudu project.</p>
     
       <li> <a href="/2016/11/01/weekly-update.html">Apache Kudu Weekly Update November 1st, 2016</a> </li>
     
-      <li> <a href="/2016/10/20/weekly-update.html">Apache Kudu Weekly Update October 20th, 2016</a> </li>
-    
     </ul>
   </div>
 </div>

http://git-wip-us.apache.org/repos/asf/kudu-site/blob/12782cec/blog/page/6/index.html
----------------------------------------------------------------------
diff --git a/blog/page/6/index.html b/blog/page/6/index.html
index 5801003..b2b5e52 100644
--- a/blog/page/6/index.html
+++ b/blog/page/6/index.html
@@ -117,6 +117,27 @@
 <!-- Articles -->
 <article>
   <header>
+    <h1 class="entry-title"><a href="/2016/07/26/weekly-update.html">Apache Kudu Weekly Update July 26, 2016</a></h1>
+    <p class="meta">Posted 26 Jul 2016 by Jean-Daniel Cryans</p>
+  </header>
+  <div class="entry-content">
+    
+    <p>Welcome to the eighteenth edition of the Kudu Weekly Update. This weekly blog post
+covers ongoing development and news in the Apache Kudu project.</p>
+
+
+    
+  </div>
+  <div class="read-full">
+    <a class="btn btn-info" href="/2016/07/26/weekly-update.html">Read full post...</a>
+  </div>
+</article>
+
+
+
+<!-- Articles -->
+<article>
+  <header>
     <h1 class="entry-title"><a href="/2016/07/25/asf-graduation.html">The Apache Software Foundation Announces Apache&reg; Kudu&trade; as a Top-Level Project</a></h1>
     <p class="meta">Posted 25 Jul 2016 by Jean-Daniel Cryans</p>
   </header>
@@ -209,27 +230,6 @@ of 0.9.0 are encouraged to update to the new version at their earliest convenien
 
 
 
-<!-- Articles -->
-<article>
-  <header>
-    <h1 class="entry-title"><a href="/2016/06/27/weekly-update.html">Apache Kudu (incubating) Weekly Update June 27, 2016</a></h1>
-    <p class="meta">Posted 27 Jun 2016 by Todd Lipcon</p>
-  </header>
-  <div class="entry-content">
-    
-    <p>Welcome to the fifteenth edition of the Kudu Weekly Update. This weekly blog post
-covers ongoing development and news in the Apache Kudu (incubating) project.</p>
-
-
-    
-  </div>
-  <div class="read-full">
-    <a class="btn btn-info" href="/2016/06/27/weekly-update.html">Read full post...</a>
-  </div>
-</article>
-
-
-
 <!-- Pagination links -->
 
 <nav>
@@ -250,6 +250,8 @@ covers ongoing development and news in the Apache Kudu (incubating) project.</p>
     <h3>Recent posts</h3>
     <ul>
     
+      <li> <a href="/2018/09/26/index-skip-scan-optimization-in-kudu.html">Index Skip Scan Optimization in Kudu</a> </li>
+    
       <li> <a href="/2018/09/11/simplified-pipelines-with-kudu.html">Simplified Data Pipelines with Kudu</a> </li>
     
       <li> <a href="/2018/08/06/getting-started-with-kudu-an-oreilly-title.html">Getting Started with Kudu - an O'Reilly Title</a> </li>
@@ -278,8 +280,6 @@ covers ongoing development and news in the Apache Kudu (incubating) project.</p>
     
       <li> <a href="/2016/11/01/weekly-update.html">Apache Kudu Weekly Update November 1st, 2016</a> </li>
     
-      <li> <a href="/2016/10/20/weekly-update.html">Apache Kudu Weekly Update October 20th, 2016</a> </li>
-    
     </ul>
   </div>
 </div>

http://git-wip-us.apache.org/repos/asf/kudu-site/blob/12782cec/blog/page/7/index.html
----------------------------------------------------------------------
diff --git a/blog/page/7/index.html b/blog/page/7/index.html
index d0dfe49..0692c27 100644
--- a/blog/page/7/index.html
+++ b/blog/page/7/index.html
@@ -117,6 +117,27 @@
 <!-- Articles -->
 <article>
   <header>
+    <h1 class="entry-title"><a href="/2016/06/27/weekly-update.html">Apache Kudu (incubating) Weekly Update June 27, 2016</a></h1>
+    <p class="meta">Posted 27 Jun 2016 by Todd Lipcon</p>
+  </header>
+  <div class="entry-content">
+    
+    <p>Welcome to the fifteenth edition of the Kudu Weekly Update. This weekly blog post
+covers ongoing development and news in the Apache Kudu (incubating) project.</p>
+
+
+    
+  </div>
+  <div class="read-full">
+    <a class="btn btn-info" href="/2016/06/27/weekly-update.html">Read full post...</a>
+  </div>
+</article>
+
+
+
+<!-- Articles -->
+<article>
+  <header>
     <h1 class="entry-title"><a href="/2016/06/24/multi-master-1-0-0.html">Master fault tolerance in Kudu 1.0</a></h1>
     <p class="meta">Posted 24 Jun 2016 by Adar Dembo</p>
   </header>
@@ -202,37 +223,6 @@ covers ongoing development and news in the Apache Kudu (incubating) project.</p>
 
 
 
-<!-- Articles -->
-<article>
-  <header>
-    <h1 class="entry-title"><a href="/2016/06/10/apache-kudu-0-9-0-released.html">Apache Kudu (incubating) 0.9.0 released</a></h1>
-    <p class="meta">Posted 10 Jun 2016 by Jean-Daniel Cryans</p>
-  </header>
-  <div class="entry-content">
-    
-    <p>The Apache Kudu (incubating) team is happy to announce the release of Kudu
-0.9.0!</p>
-
-<p>This latest version adds basic UPSERT functionality and an improved Apache Spark Data Source
-that doesn’t rely on the MapReduce I/O formats. It also improves Tablet Server
-restart time as well as write performance under high load. Finally, Kudu now enforces
-the specification of a partitioning scheme for new tables.</p>
-
-<ul>
-  <li>Read the detailed <a href="http://kudu.apache.org/releases/0.9.0/docs/release_notes.html">Kudu 0.9.0 release notes</a></li>
-  <li>Download the <a href="http://kudu.apache.org/releases/0.9.0/">Kudu 0.9.0 source release</a></li>
-</ul>
-
-
-    
-  </div>
-  <div class="read-full">
-    <a class="btn btn-info" href="/2016/06/10/apache-kudu-0-9-0-released.html">Read full post...</a>
-  </div>
-</article>
-
-
-
 <!-- Pagination links -->
 
 <nav>
@@ -253,6 +243,8 @@ the specification of a partitioning scheme for new tables.</p>
     <h3>Recent posts</h3>
     <ul>
     
+      <li> <a href="/2018/09/26/index-skip-scan-optimization-in-kudu.html">Index Skip Scan Optimization in Kudu</a> </li>
+    
       <li> <a href="/2018/09/11/simplified-pipelines-with-kudu.html">Simplified Data Pipelines with Kudu</a> </li>
     
       <li> <a href="/2018/08/06/getting-started-with-kudu-an-oreilly-title.html">Getting Started with Kudu - an O'Reilly Title</a> </li>
@@ -281,8 +273,6 @@ the specification of a partitioning scheme for new tables.</p>
     
       <li> <a href="/2016/11/01/weekly-update.html">Apache Kudu Weekly Update November 1st, 2016</a> </li>
     
-      <li> <a href="/2016/10/20/weekly-update.html">Apache Kudu Weekly Update October 20th, 2016</a> </li>
-    
     </ul>
   </div>
 </div>

http://git-wip-us.apache.org/repos/asf/kudu-site/blob/12782cec/blog/page/8/index.html
----------------------------------------------------------------------
diff --git a/blog/page/8/index.html b/blog/page/8/index.html
index ce0a7e1..aa53f05 100644
--- a/blog/page/8/index.html
+++ b/blog/page/8/index.html
@@ -117,6 +117,37 @@
 <!-- Articles -->
 <article>
   <header>
+    <h1 class="entry-title"><a href="/2016/06/10/apache-kudu-0-9-0-released.html">Apache Kudu (incubating) 0.9.0 released</a></h1>
+    <p class="meta">Posted 10 Jun 2016 by Jean-Daniel Cryans</p>
+  </header>
+  <div class="entry-content">
+    
+    <p>The Apache Kudu (incubating) team is happy to announce the release of Kudu
+0.9.0!</p>
+
+<p>This latest version adds basic UPSERT functionality and an improved Apache Spark Data Source
+that doesn’t rely on the MapReduce I/O formats. It also improves Tablet Server
+restart time as well as write performance under high load. Finally, Kudu now enforces
+the specification of a partitioning scheme for new tables.</p>
+
+<ul>
+  <li>Read the detailed <a href="http://kudu.apache.org/releases/0.9.0/docs/release_notes.html">Kudu 0.9.0 release notes</a></li>
+  <li>Download the <a href="http://kudu.apache.org/releases/0.9.0/">Kudu 0.9.0 source release</a></li>
+</ul>
+
+
+    
+  </div>
+  <div class="read-full">
+    <a class="btn btn-info" href="/2016/06/10/apache-kudu-0-9-0-released.html">Read full post...</a>
+  </div>
+</article>
+
+
+
+<!-- Articles -->
+<article>
+  <header>
     <h1 class="entry-title"><a href="/2016/06/06/weekly-update.html">Apache Kudu (incubating) Weekly Update June 6, 2016</a></h1>
     <p class="meta">Posted 06 Jun 2016 by Jean-Daniel Cryans</p>
   </header>
@@ -200,27 +231,6 @@ covers ongoing development and news in the Apache Kudu (incubating) project.</p>
 
 
 
-<!-- Articles -->
-<article>
-  <header>
-    <h1 class="entry-title"><a href="/2016/05/16/weekly-update.html">Apache Kudu (incubating) Weekly Update May 16, 2016</a></h1>
-    <p class="meta">Posted 16 May 2016 by Todd Lipcon</p>
-  </header>
-  <div class="entry-content">
-    
-    <p>Welcome to the ninth edition of the Kudu Weekly Update. This weekly blog post
-covers ongoing development and news in the Apache Kudu (incubating) project.</p>
-
-
-    
-  </div>
-  <div class="read-full">
-    <a class="btn btn-info" href="/2016/05/16/weekly-update.html">Read full post...</a>
-  </div>
-</article>
-
-
-
 <!-- Pagination links -->
 
 <nav>
@@ -241,6 +251,8 @@ covers ongoing development and news in the Apache Kudu (incubating) project.</p>
     <h3>Recent posts</h3>
     <ul>
     
+      <li> <a href="/2018/09/26/index-skip-scan-optimization-in-kudu.html">Index Skip Scan Optimization in Kudu</a> </li>
+    
       <li> <a href="/2018/09/11/simplified-pipelines-with-kudu.html">Simplified Data Pipelines with Kudu</a> </li>
     
       <li> <a href="/2018/08/06/getting-started-with-kudu-an-oreilly-title.html">Getting Started with Kudu - an O'Reilly Title</a> </li>
@@ -269,8 +281,6 @@ covers ongoing development and news in the Apache Kudu (incubating) project.</p>
     
       <li> <a href="/2016/11/01/weekly-update.html">Apache Kudu Weekly Update November 1st, 2016</a> </li>
     
-      <li> <a href="/2016/10/20/weekly-update.html">Apache Kudu Weekly Update October 20th, 2016</a> </li>
-    
     </ul>
   </div>
 </div>

http://git-wip-us.apache.org/repos/asf/kudu-site/blob/12782cec/blog/page/9/index.html
----------------------------------------------------------------------
diff --git a/blog/page/9/index.html b/blog/page/9/index.html
index ce14d37..85c617d 100644
--- a/blog/page/9/index.html
+++ b/blog/page/9/index.html
@@ -117,6 +117,27 @@
 <!-- Articles -->
 <article>
   <header>
+    <h1 class="entry-title"><a href="/2016/05/16/weekly-update.html">Apache Kudu (incubating) Weekly Update May 16, 2016</a></h1>
+    <p class="meta">Posted 16 May 2016 by Todd Lipcon</p>
+  </header>
+  <div class="entry-content">
+    
+    <p>Welcome to the ninth edition of the Kudu Weekly Update. This weekly blog post
+covers ongoing development and news in the Apache Kudu (incubating) project.</p>
+
+
+    
+  </div>
+  <div class="read-full">
+    <a class="btn btn-info" href="/2016/05/16/weekly-update.html">Read full post...</a>
+  </div>
+</article>
+
+
+
+<!-- Articles -->
+<article>
+  <header>
     <h1 class="entry-title"><a href="/2016/05/09/weekly-update.html">Apache Kudu (incubating) Weekly Update May 9, 2016</a></h1>
     <p class="meta">Posted 09 May 2016 by Jean-Daniel Cryans</p>
   </header>
@@ -197,29 +218,6 @@ covers ongoing development and news in the Apache Kudu (incubating) project.</p>
 
 
 
-<!-- Articles -->
-<article>
-  <header>
-    <h1 class="entry-title"><a href="/2016/04/19/kudu-0-8-0-predicate-improvements.html">Predicate Improvements in Kudu 0.8</a></h1>
-    <p class="meta">Posted 19 Apr 2016 by Dan Burkert</p>
-  </header>
-  <div class="entry-content">
-    
-    <p>The recently released Kudu version 0.8 ships with a host of new improvements to
-scan predicates. Performance and usability have been improved, especially for
-tables taking advantage of <a href="http://kudu.apache.org/docs/schema_design.html#data-distribution">advanced partitioning
-options</a>.</p>
-
-
-    
-  </div>
-  <div class="read-full">
-    <a class="btn btn-info" href="/2016/04/19/kudu-0-8-0-predicate-improvements.html">Read full post...</a>
-  </div>
-</article>
-
-
-
 <!-- Pagination links -->
 
 <nav>
@@ -240,6 +238,8 @@ options</a>.</p>
     <h3>Recent posts</h3>
     <ul>
     
+      <li> <a href="/2018/09/26/index-skip-scan-optimization-in-kudu.html">Index Skip Scan Optimization in Kudu</a> </li>
+    
       <li> <a href="/2018/09/11/simplified-pipelines-with-kudu.html">Simplified Data Pipelines with Kudu</a> </li>
     
       <li> <a href="/2018/08/06/getting-started-with-kudu-an-oreilly-title.html">Getting Started with Kudu - an O'Reilly Title</a> </li>
@@ -268,8 +268,6 @@ options</a>.</p>
     
       <li> <a href="/2016/11/01/weekly-update.html">Apache Kudu Weekly Update November 1st, 2016</a> </li>
     
-      <li> <a href="/2016/10/20/weekly-update.html">Apache Kudu Weekly Update October 20th, 2016</a> </li>
-    
     </ul>
   </div>
 </div>

http://git-wip-us.apache.org/repos/asf/kudu-site/blob/12782cec/faq.html
----------------------------------------------------------------------
diff --git a/faq.html b/faq.html
index 5cd722f..65885fa 100644
--- a/faq.html
+++ b/faq.html
@@ -345,8 +345,8 @@ enforcing “external consistency” in two different ways: one that optimizes f
 requires the user to perform additional work and another that requires no additional
 work but can result in some additional latency.</li>
   <li>Scans have “Read Committed” consistency by default. If the user requires strict-serializable
-scans it can choose the <code>READ_AT_SNAPSHOT</code> mode and, optionally, provide a timestamp. The default
-option is non-blocking but the <code>READ_AT_SNAPSHOT</code> option may block when reading from non-leader
+scans it can choose the <code class="highlighter-rouge">READ_AT_SNAPSHOT</code> mode and, optionally, provide a timestamp. The default
+option is non-blocking but the <code class="highlighter-rouge">READ_AT_SNAPSHOT</code> option may block when reading from non-leader
 replicas.</li>
 </ul>
 
@@ -369,7 +369,7 @@ further information and caveats.</p>
 
 <p>Kudu provides direct access via Java and C++ APIs. An experimental Python API is
 also available and is expected to be fully supported in the future. The easiest
-way to load data into Kudu is to use a <code>CREATE TABLE ... AS SELECT * FROM ...</code>
+way to load data into Kudu is to use a <code class="highlighter-rouge">CREATE TABLE ... AS SELECT * FROM ...</code>
 statement in Impala. Although Kudu has not been extensively tested to work with
 ingest tools such as Flume, Sqoop, or Kafka, several of these have been
 experimentally tested. Explicit support for these ingest tools is expected with
@@ -378,7 +378,7 @@ Kudu’s first generally available release.</p>
 <h4 id="whats-the-most-efficient-way-to-bulk-load-data-into-kudu">What’s the most efficient way to bulk load data into Kudu?</h4>
 
 <p>The easiest way to load data into Kudu is if the data is already managed by Impala.
-In this case, a simple <code>INSERT INTO TABLE some_kudu_table SELECT * FROM some_csv_table</code>
+In this case, a simple <code class="highlighter-rouge">INSERT INTO TABLE some_kudu_table SELECT * FROM some_csv_table</code>
 does the trick.</p>
 
 <p>You can also use Kudu’s MapReduce OutputFormat to load data from HDFS, HBase, or
@@ -530,8 +530,8 @@ features.</p>
 Impala can help if you have it available. You can use it to copy your data into
 Parquet format using a statement like:</p>
 
-<pre><code>INSERT INTO TABLE some_parquet_table SELECT * FROM kudu_table
-</code></pre>
+<div class="highlighter-rouge">INSERT INTO TABLE some_parquet_table SELECT * FROM kudu_table
+</div>
 
 <p>then use <a href="http://hadoop.apache.org/docs/r1.2.1/distcp2.html">distcp</a>
 to copy the Parquet data to another cluster.</p>

http://git-wip-us.apache.org/repos/asf/kudu-site/blob/12782cec/feed.xml
----------------------------------------------------------------------
diff --git a/feed.xml b/feed.xml
index 218dfab..49afcef 100644
--- a/feed.xml
+++ b/feed.xml
@@ -1,4 +1,107 @@
-<?xml version="1.0" encoding="utf-8"?><feed xmlns="http://www.w3.org/2005/Atom"><generator uri="http://jekyllrb.com" version="2.5.3">Jekyll</generator><link href="/feed.xml" rel="self" type="application/atom+xml" /><link href="/" rel="alternate" type="text/html" /><updated>2018-09-11T17:54:59+02:00</updated><id>/</id><entry><title>Simplified Data Pipelines with Kudu</title><link href="/2018/09/11/simplified-pipelines-with-kudu.html" rel="alternate" type="text/html" title="Simplified Data Pipelines with Kudu" /><published>2018-09-11T00:00:00+02:00</published><updated>2018-09-11T00:00:00+02:00</updated><id>/2018/09/11/simplified-pipelines-with-kudu</id><content type="html" xml:base="/2018/09/11/simplified-pipelines-with-kudu.html">&lt;p&gt;I’ve been working with Hadoop now for over seven years and fortunately, or unfortunately, have run
+<?xml version="1.0" encoding="utf-8"?><feed xmlns="http://www.w3.org/2005/Atom"><generator uri="http://jekyllrb.com" version="2.5.3">Jekyll</generator><link href="/feed.xml" rel="self" type="application/atom+xml" /><link href="/" rel="alternate" type="text/html" /><updated>2018-09-26T10:55:43-07:00</updated><id>/</id><entry><title>Index Skip Scan Optimization in Kudu</title><link href="/2018/09/26/index-skip-scan-optimization-in-kudu.html" rel="alternate" type="text/html" title="Index Skip Scan Optimization in Kudu" /><published>2018-09-26T00:00:00-07:00</published><updated>2018-09-26T00:00:00-07:00</updated><id>/2018/09/26/index-skip-scan-optimization-in-kudu</id><content type="html" xml:base="/2018/09/26/index-skip-scan-optimization-in-kudu.html">&lt;p&gt;This summer I got the opportunity to intern with the Apache Kudu team at Cloudera.
+My project was to optimize the Kudu scan path by implementing a technique called
+index skip scan (a.k.a. scan-to-seek, see section 4.1 in [1]). I wanted to share
+my experience and the progress we’ve made so far on the approach.&lt;/p&gt;
+
+&lt;!--more--&gt;
+
+&lt;p&gt;Let’s begin with discussing the current query flow in Kudu.
+Consider the following table:&lt;/p&gt;
+
+&lt;div class=&quot;highlight&quot;&gt;&lt;pre&gt;&lt;code class=&quot;language-sql&quot; data-lang=&quot;sql&quot;&gt;&lt;span class=&quot;k&quot;&gt;CREATE&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;TABLE&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;metrics&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;
+    &lt;span class=&quot;k&quot;&gt;host&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;STRING&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;
+    &lt;span class=&quot;n&quot;&gt;tstamp&lt;/span&gt; &lt;span class=&quot;nb&quot;&gt;INT&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;
+    &lt;span class=&quot;n&quot;&gt;clusterid&lt;/span&gt; &lt;span class=&quot;nb&quot;&gt;INT&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;
+    &lt;span class=&quot;k&quot;&gt;role&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;STRING&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;
+    &lt;span class=&quot;k&quot;&gt;PRIMARY&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;KEY&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;k&quot;&gt;host&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;tstamp&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;clusterid&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
+&lt;span class=&quot;p&quot;&gt;);&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;
+
+&lt;p&gt;&lt;img src=&quot;/img/index-skip-scan/example-table.png&quot; alt=&quot;png&quot; class=&quot;img-responsive&quot; /&gt;
+&lt;em&gt;Sample rows of table &lt;code class=&quot;highlighter-rouge&quot;&gt;metrics&lt;/code&gt; (sorted by key columns).&lt;/em&gt;&lt;/p&gt;
+
+&lt;p&gt;In this case, by default, Kudu internally builds a primary key index (implemented as a
+&lt;a href=&quot;https://en.wikipedia.org/wiki/B-tree&quot;&gt;B-tree&lt;/a&gt;) for the table &lt;code class=&quot;highlighter-rouge&quot;&gt;metrics&lt;/code&gt;.
+As shown in the table above, the index data is sorted by the composite of all key columns.
+When the user query contains the first key column (&lt;code class=&quot;highlighter-rouge&quot;&gt;host&lt;/code&gt;), Kudu uses the index (as the index data is
+primarily sorted on the first key column).&lt;/p&gt;
+
+&lt;p&gt;Now, what if the user query does not contain the first key column and instead only contains the &lt;code class=&quot;highlighter-rouge&quot;&gt;tstamp&lt;/code&gt; column?
+In the above case, the &lt;code class=&quot;highlighter-rouge&quot;&gt;tstamp&lt;/code&gt; column values are sorted with respect to &lt;code class=&quot;highlighter-rouge&quot;&gt;host&lt;/code&gt;,
+but are not globally sorted, and as such, it’s non-trivial to use the index to filter rows.
+Instead, a full tablet scan is done by default. Other databases may optimize such scans by building secondary indexes
+(though it might be redundant to build one on one of the primary keys). However, this isn’t an option for Kudu,
+given its lack of secondary index support.&lt;/p&gt;
+
+&lt;p&gt;The question is, can Kudu do better than a full tablet scan here?&lt;/p&gt;
+
+&lt;p&gt;The answer is yes! Let’s observe the column preceding the &lt;code class=&quot;highlighter-rouge&quot;&gt;tstamp&lt;/code&gt; column. We will refer to it as the
+“prefix column” and its specific value as the “prefix key”. In this example, &lt;code class=&quot;highlighter-rouge&quot;&gt;host&lt;/code&gt; is the prefix column.
+Note that the prefix keys are sorted in the index and that all rows of a given prefix key are also sorted by the
+remaining key columns. Therefore, we can use the index to skip to the rows that have distinct prefix keys,
+and also satisfy the predicate on the &lt;code class=&quot;highlighter-rouge&quot;&gt;tstamp&lt;/code&gt; column.
+For example, consider the query:&lt;/p&gt;
+
+&lt;div class=&quot;highlight&quot;&gt;&lt;pre&gt;&lt;code class=&quot;language-sql&quot; data-lang=&quot;sql&quot;&gt;&lt;span class=&quot;k&quot;&gt;SELECT&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;clusterid&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;FROM&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;metrics&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;WHERE&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;tstamp&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;100&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;
+
+&lt;p&gt;&lt;img src=&quot;/img/index-skip-scan/skip-scan-example-table.png&quot; alt=&quot;png&quot; class=&quot;img-responsive&quot; /&gt;
+&lt;em&gt;Skip scan flow illustration. The rows in green are scanned and the rest are skipped.&lt;/em&gt;&lt;/p&gt;
+
+&lt;p&gt;The tablet server can use the index to &lt;strong&gt;skip&lt;/strong&gt; to the first row with a distinct prefix key (&lt;code class=&quot;highlighter-rouge&quot;&gt;host = helium&lt;/code&gt;) that
+matches the predicate (&lt;code class=&quot;highlighter-rouge&quot;&gt;tstamp = 100&lt;/code&gt;) and then &lt;strong&gt;scan&lt;/strong&gt; through the rows until the predicate no longer matches. At that
+point we would know that no more rows with &lt;code class=&quot;highlighter-rouge&quot;&gt;host = helium&lt;/code&gt; will satisfy the predicate, and we can skip to the next
+prefix key. This holds true for all distinct keys of &lt;code class=&quot;highlighter-rouge&quot;&gt;host&lt;/code&gt;. Hence, this method is popularly known as
+&lt;strong&gt;skip scan optimization&lt;/strong&gt;[2, 3].&lt;/p&gt;
+
+&lt;h1 id=&quot;performance&quot;&gt;Performance&lt;/h1&gt;
+
+&lt;p&gt;This optimization can speed up queries significantly, depending on the cardinality (number of distinct values) of the
+prefix column. The lower the prefix column cardinality, the better the skip scan performance. In fact, when the
+prefix column cardinality is high, skip scan is not a viable approach. The performance graph (obtained using the example
+schema and query pattern mentioned earlier) is shown below.&lt;/p&gt;
+
+&lt;p&gt;Based on our experiments, on up to 10 million rows per tablet (as shown below), we found that the skip scan performance
+begins to get worse with respect to the full tablet scan performance when the prefix column cardinality
+exceeds sqrt(number_of_rows_in_tablet).
+Therefore, in order to use skip scan performance benefits when possible and maintain a consistent performance in cases
+of large prefix column cardinality, we have tentatively chosen to dynamically disable skip scan when the number of skips for
+distinct prefix keys exceeds sqrt(number_of_rows_in_tablet).
+It will be an interesting project to further explore sophisticated heuristics to decide when
+to dynamically disable skip scan.&lt;/p&gt;
+
+&lt;p&gt;&lt;img src=&quot;/img/index-skip-scan/skip-scan-performance-graph.png&quot; alt=&quot;png&quot; class=&quot;img-responsive&quot; /&gt;&lt;/p&gt;
+
+&lt;h1 id=&quot;conclusion&quot;&gt;Conclusion&lt;/h1&gt;
+
+&lt;p&gt;Skip scan optimization in Kudu can lead to huge performance benefits that scale with the size of
+data in Kudu tablets. This is a work-in-progress &lt;a href=&quot;https://gerrit.cloudera.org/#/c/10983/&quot;&gt;patch&lt;/a&gt;.
+The implementation in the patch works only for equality predicates on the non-first primary key
+columns. An important point to note is that although, in the above specific example, the number of prefix
+columns is one (&lt;code class=&quot;highlighter-rouge&quot;&gt;host&lt;/code&gt;), this approach is generalized to work with any number of prefix columns.&lt;/p&gt;
+
+&lt;p&gt;This work also lays the groundwork to leverage the skip scan approach and optimize query processing time in the
+following use cases:&lt;/p&gt;
+
+&lt;ul&gt;
+  &lt;li&gt;Range predicates&lt;/li&gt;
+  &lt;li&gt;In-list predicates&lt;/li&gt;
+&lt;/ul&gt;
+
+&lt;p&gt;This was my first time working on an open source project. I thoroughly enjoyed working on this challenging problem,
+right from understanding the scan path in Kudu to working on a full-fledged implementation of
+the skip scan optimization. I am very grateful to the Kudu team for guiding and supporting me throughout the
+internship period.&lt;/p&gt;
+
+&lt;h1 id=&quot;references&quot;&gt;References&lt;/h1&gt;
+
+&lt;p&gt;&lt;a href=&quot;https://storage.googleapis.com/pub-tools-public-publication-data/pdf/42851.pdf&quot;&gt;[1]&lt;/a&gt;: Gupta, Ashish, et al. “Mesa:
+Geo-replicated, near real-time, scalable data warehousing.” Proceedings of the VLDB Endowment 7.12 (2014): 1259-1270.&lt;/p&gt;
+
+&lt;p&gt;&lt;a href=&quot;https://oracle-base.com/articles/9i/index-skip-scanning/&quot;&gt;[2]&lt;/a&gt;: Index Skip Scanning - Oracle Database&lt;/p&gt;
+
+&lt;p&gt;&lt;a href=&quot;https://www.sqlite.org/optoverview.html#skipscan&quot;&gt;[3]&lt;/a&gt;: Skip Scan - SQLite&lt;/p&gt;</content><author><name>Anupama Gupta</name></author><summary>This summer I got the opportunity to intern with the Apache Kudu team at Cloudera.
+My project was to optimize the Kudu scan path by implementing a technique called
+index skip scan (a.k.a. scan-to-seek, see section 4.1 in [1]). I wanted to share
+my experience and the progress we’ve made so far on the approach.</summary></entry><entry><title>Simplified Data Pipelines with Kudu</title><link href="/2018/09/11/simplified-pipelines-with-kudu.html" rel="alternate" type="text/html" title="Simplified Data Pipelines with Kudu" /><published>2018-09-11T00:00:00-07:00</published><updated>2018-09-11T00:00:00-07:00</updated><id>/2018/09/11/simplified-pipelines-with-kudu</id><content type="html" xml:base="/2018/09/11/simplified-pipelines-with-kudu.html">&lt;p&gt;I’ve been working with Hadoop now for over seven years and fortunately, or unfortunately, have run
 across a lot of structured data use cases.  What we, at &lt;a href=&quot;https://phdata.io/&quot;&gt;phData&lt;/a&gt;, have found is
 that end users are typically comfortable with tabular data and prefer to access their data in a
 structured manner using tables.
@@ -38,7 +141,7 @@ and users to focus on solving business problems, rather than being bothered by t
 the backend.&lt;/p&gt;</content><author><name>Mac Noland</name></author><summary>I’ve been working with Hadoop now for over seven years and fortunately, or unfortunately, have run
 across a lot of structured data use cases.  What we, at phData, have found is
 that end users are typically comfortable with tabular data and prefer to access their data in a
-structured manner using tables.</summary></entry><entry><title>Getting Started with Kudu - an O’Reilly Title</title><link href="/2018/08/06/getting-started-with-kudu-an-oreilly-title.html" rel="alternate" type="text/html" title="Getting Started with Kudu - an O&#39;Reilly Title" /><published>2018-08-06T00:00:00+02:00</published><updated>2018-08-06T00:00:00+02:00</updated><id>/2018/08/06/getting-started-with-kudu-an-oreilly-title</id><content type="html" xml:base="/2018/08/06/getting-started-with-kudu-an-oreilly-title.html">&lt;p&gt;The following article by Brock Noland was reposted from the
+structured manner using tables.</summary></entry><entry><title>Getting Started with Kudu - an O’Reilly Title</title><link href="/2018/08/06/getting-started-with-kudu-an-oreilly-title.html" rel="alternate" type="text/html" title="Getting Started with Kudu - an O&#39;Reilly Title" /><published>2018-08-06T00:00:00-07:00</published><updated>2018-08-06T00:00:00-07:00</updated><id>/2018/08/06/getting-started-with-kudu-an-oreilly-title</id><content type="html" xml:base="/2018/08/06/getting-started-with-kudu-an-oreilly-title.html">&lt;p&gt;The following article by Brock Noland was reposted from the
 &lt;a href=&quot;https://www.phdata.io/getting-started-with-kudu/&quot;&gt;phData&lt;/a&gt;
 blog with their permission.&lt;/p&gt;
 
@@ -52,9 +155,9 @@ challenge at that time.
 In that context, on October 11th 2012 Todd Lipcon perform Apache Kudu’s initial
 commit. The commit message was:&lt;/p&gt;
 
-&lt;pre&gt;&lt;code&gt;Code for writing cfiles seems to basically work
+&lt;div class=&quot;highlighter-rouge&quot;&gt;Code for writing cfiles seems to basically work
 Need to write code for reading cfiles, still
-&lt;/code&gt;&lt;/pre&gt;
+&lt;/div&gt;
 
 &lt;p&gt;And Kudu development was off and running. Around this same time Todd, on his
 internal Wiki page, started listing out the papers he was reading to develop
@@ -90,7 +193,7 @@ of Kudu. Specifically you will learn:&lt;/p&gt;
 
 &lt;p&gt;Looking forward, I am excited to see Kudu gain additional features and adoption
 and eventually the second revision of this title. In the meantime, if you have
-feedback or questions, please reach out on the &lt;code&gt;#getting-started-kudu&lt;/code&gt; channel of
+feedback or questions, please reach out on the &lt;code class=&quot;highlighter-rouge&quot;&gt;#getting-started-kudu&lt;/code&gt; channel of
 the &lt;a href=&quot;https://getkudu-slack.herokuapp.com/&quot;&gt;Kudu Slack&lt;/a&gt; or if you prefer non-real-time
 communication, please use the user@ mailing list!&lt;/p&gt;</content><author><name>Brock Noland</name></author><summary>The following article by Brock Noland was reposted from the
 phData
@@ -101,7 +204,7 @@ Hadoop platform was hard. Organizations required strong Software Engineering
 capabilities to successfully implement complex Lambda architectures or even
 simply implement continuous ingest. Updating or deleting data, were simply a
 nightmare. General Data Protection Regulation (GDPR) would have been an extreme
-challenge at that time.</summary></entry><entry><title>Instrumentation in Apache Kudu</title><link href="/2018/07/10/instrumentation-in-kudu.html" rel="alternate" type="text/html" title="Instrumentation in Apache Kudu" /><published>2018-07-10T00:00:00+02:00</published><updated>2018-07-10T00:00:00+02:00</updated><id>/2018/07/10/instrumentation-in-kudu</id><content type="html" xml:base="/2018/07/10/instrumentation-in-kudu.html">&lt;p&gt;Last week, the &lt;a href=&quot;http://opentracing.io/&quot;&gt;OpenTracing&lt;/a&gt; community invited me to
+challenge at that time.</summary></entry><entry><title>Instrumentation in Apache Kudu</title><link href="/2018/07/10/instrumentation-in-kudu.html" rel="alternate" type="text/html" title="Instrumentation in Apache Kudu" /><published>2018-07-10T00:00:00-07:00</published><updated>2018-07-10T00:00:00-07:00</updated><id>/2018/07/10/instrumentation-in-kudu</id><content type="html" xml:base="/2018/07/10/instrumentation-in-kudu.html">&lt;p&gt;Last week, the &lt;a href=&quot;http://opentracing.io/&quot;&gt;OpenTracing&lt;/a&gt; community invited me to
 their monthly Google Hangout meetup to give an informal talk on tracing and
 instrumentation in Apache Kudu.&lt;/p&gt;
 
@@ -136,7 +239,7 @@ While Kudu doesn’t currently support distributed tracing using OpenTracing,
 it does have quite a lot of other types of instrumentation, metrics, and
 diagnostics logging. The OpenTracing team was interested to hear about some of
 the approaches that Kudu has used, and so I gave a brief introduction to topics
-including:</summary></entry><entry><title>Apache Kudu 1.7.0 released</title><link href="/2018/03/23/apache-kudu-1-7-0-released.html" rel="alternate" type="text/html" title="Apache Kudu 1.7.0 released" /><published>2018-03-23T00:00:00+01:00</published><updated>2018-03-23T00:00:00+01:00</updated><id>/2018/03/23/apache-kudu-1-7-0-released</id><content type="html" xml:base="/2018/03/23/apache-kudu-1-7-0-released.html">&lt;p&gt;The Apache Kudu team is happy to announce the release of Kudu 1.7.0!&lt;/p&gt;
+including:</summary></entry><entry><title>Apache Kudu 1.7.0 released</title><link href="/2018/03/23/apache-kudu-1-7-0-released.html" rel="alternate" type="text/html" title="Apache Kudu 1.7.0 released" /><published>2018-03-23T00:00:00-07:00</published><updated>2018-03-23T00:00:00-07:00</updated><id>/2018/03/23/apache-kudu-1-7-0-released</id><content type="html" xml:base="/2018/03/23/apache-kudu-1-7-0-released.html">&lt;p&gt;The Apache Kudu team is happy to announce the release of Kudu 1.7.0!&lt;/p&gt;
 
 &lt;p&gt;Apache Kudu 1.7.0 is a minor release that offers new features, performance
 optimizations, incremental improvements, and bug fixes.&lt;/p&gt;
@@ -207,7 +310,7 @@ Maven repository and are
 Apache Kudu 1.7.0 is a minor release that offers new features, performance
 optimizations, incremental improvements, and bug fixes.
 
-Release highlights:</summary></entry><entry><title>Apache Kudu 1.6.0 released</title><link href="/2017/12/08/apache-kudu-1-6-0-released.html" rel="alternate" type="text/html" title="Apache Kudu 1.6.0 released" /><published>2017-12-08T00:00:00+01:00</published><updated>2017-12-08T00:00:00+01:00</updated><id>/2017/12/08/apache-kudu-1-6-0-released</id><content type="html" xml:base="/2017/12/08/apache-kudu-1-6-0-released.html">&lt;p&gt;The Apache Kudu team is happy to announce the release of Kudu 1.6.0!&lt;/p&gt;
+Release highlights:</summary></entry><entry><title>Apache Kudu 1.6.0 released</title><link href="/2017/12/08/apache-kudu-1-6-0-released.html" rel="alternate" type="text/html" title="Apache Kudu 1.6.0 released" /><published>2017-12-08T00:00:00-08:00</published><updated>2017-12-08T00:00:00-08:00</updated><id>/2017/12/08/apache-kudu-1-6-0-released</id><content type="html" xml:base="/2017/12/08/apache-kudu-1-6-0-released.html">&lt;p&gt;The Apache Kudu team is happy to announce the release of Kudu 1.6.0!&lt;/p&gt;
 
 &lt;p&gt;Apache Kudu 1.6.0 is a minor release that offers new features, performance
 optimizations, incremental improvements, and bug fixes.&lt;/p&gt;
@@ -266,7 +369,7 @@ Maven repository and are
 Apache Kudu 1.6.0 is a minor release that offers new features, performance
 optimizations, incremental improvements, and bug fixes.
 
-Release highlights:</summary></entry><entry><title>Slides: A brave new world in mutable big data: Relational storage</title><link href="/2017/10/23/nosql-kudu-spanner-slides.html" rel="alternate" type="text/html" title="Slides: A brave new world in mutable big data: Relational storage" /><published>2017-10-23T00:00:00+02:00</published><updated>2017-10-23T00:00:00+02:00</updated><id>/2017/10/23/nosql-kudu-spanner-slides</id><content type="html" xml:base="/2017/10/23/nosql-kudu-spanner-slides.html">&lt;p&gt;Since the Apache Kudu project made its debut in 2015, there have been
+Release highlights:</summary></entry><entry><title>Slides: A brave new world in mutable big data: Relational storage</title><link href="/2017/10/23/nosql-kudu-spanner-slides.html" rel="alternate" type="text/html" title="Slides: A brave new world in mutable big data: Relational storage" /><published>2017-10-23T00:00:00-07:00</published><updated>2017-10-23T00:00:00-07:00</updated><id>/2017/10/23/nosql-kudu-spanner-slides</id><content type="html" xml:base="/2017/10/23/nosql-kudu-spanner-slides.html">&lt;p&gt;Since the Apache Kudu project made its debut in 2015, there have been
 a few common questions that kept coming up at every presentation:&lt;/p&gt;
 
 &lt;ul&gt;
@@ -326,7 +429,7 @@ a few common questions that kept coming up at every presentation:
 
   Is Kudu an open source version of Google’s Spanner system?
   Is Kudu NoSQL or SQL?
-  Why does Kudu have a relational data model? Isn’t SQL dead?</summary></entry><entry><title>Consistency in Apache Kudu, Part 1</title><link href="/2017/09/18/kudu-consistency-pt1.html" rel="alternate" type="text/html" title="Consistency in Apache Kudu, Part 1" /><published>2017-09-18T00:00:00+02:00</published><updated>2017-09-18T00:00:00+02:00</updated><id>/2017/09/18/kudu-consistency-pt1</id><content type="html" xml:base="/2017/09/18/kudu-consistency-pt1.html">&lt;p&gt;In this series of short blog posts we will introduce Kudu’s consistency model,
+  Why does Kudu have a relational data model? Isn’t SQL dead?</summary></entry><entry><title>Consistency in Apache Kudu, Part 1</title><link href="/2017/09/18/kudu-consistency-pt1.html" rel="alternate" type="text/html" title="Consistency in Apache Kudu, Part 1" /><published>2017-09-18T00:00:00-07:00</published><updated>2017-09-18T00:00:00-07:00</updated><id>/2017/09/18/kudu-consistency-pt1</id><content type="html" xml:base="/2017/09/18/kudu-consistency-pt1.html">&lt;p&gt;In this series of short blog posts we will introduce Kudu’s consistency model,
 its design and ultimate goals, current features, and next steps.
 On the way, we’ll shed some light on the more relevant components and how they
 fit together.&lt;/p&gt;
@@ -445,29 +548,29 @@ have increasing timestamps, depending on the user’s choices.&lt;/p&gt;
 &lt;p&gt;Row mutations performed by a single client &lt;em&gt;instance&lt;/em&gt; are guaranteed to have increasing timestamps
 thus reflecting their potential causal relationship. This property is always enforced. However
 there are two major &lt;em&gt;“knobs”&lt;/em&gt; that are available to the user to make performance trade-offs, the
-&lt;code&gt;Read&lt;/code&gt; mode, and the &lt;code&gt;External Consistency&lt;/code&gt; mode (see &lt;a href=&quot;https://kudu.apache.org/docs/transaction_semantics.html&quot;&gt;here&lt;/a&gt;
+&lt;code class=&quot;highlighter-rouge&quot;&gt;Read&lt;/code&gt; mode, and the &lt;code class=&quot;highlighter-rouge&quot;&gt;External Consistency&lt;/code&gt; mode (see &lt;a href=&quot;https://kudu.apache.org/docs/transaction_semantics.html&quot;&gt;here&lt;/a&gt;
 for more information on how to use the relevant APIs).&lt;/p&gt;
 
-&lt;p&gt;The first and most important knob, the &lt;code&gt;Read&lt;/code&gt; mode, pertains to what is the guaranteed recency of
+&lt;p&gt;The first and most important knob, the &lt;code class=&quot;highlighter-rouge&quot;&gt;Read&lt;/code&gt; mode, pertains to what is the guaranteed recency of
 data resulting from scans. Since Kudu uses replication for availability and fault-tolerance, there
 are always multiple replicas of any data item.
 Not all replicas must be up-to-date so if the user cares about recency, e.g. if the user requires
 that any data read includes all previously written data &lt;em&gt;from a single client instance&lt;/em&gt; then it must
-choose the &lt;code&gt;READ_AT_SNAPSHOT&lt;/code&gt; read mode. With this mode enabled the client is guaranteed to observe
+choose the &lt;code class=&quot;highlighter-rouge&quot;&gt;READ_AT_SNAPSHOT&lt;/code&gt; read mode. With this mode enabled the client is guaranteed to observe
  &lt;strong&gt;“READ YOUR OWN WRITES”&lt;/strong&gt; semantics, i.e. scans from a client will always include all previous mutations
 performed by that client. Note that this property is local to a single client instance, not a global
 property.&lt;/p&gt;
 
-&lt;p&gt;The second “knob”, the &lt;code&gt;External Consistency&lt;/code&gt; mode, defines the semantics of how reads and writes
-are performed across multiple client instances. By default, &lt;code&gt;External Consistency&lt;/code&gt; is set to
- &lt;code&gt;CLIENT_PROPAGATED&lt;/code&gt;, meaning it’s up to the user to coordinate a set of &lt;em&gt;timestamp tokens&lt;/em&gt; with clients (even
+&lt;p&gt;The second “knob”, the &lt;code class=&quot;highlighter-rouge&quot;&gt;External Consistency&lt;/code&gt; mode, defines the semantics of how reads and writes
+are performed across multiple client instances. By default, &lt;code class=&quot;highlighter-rouge&quot;&gt;External Consistency&lt;/code&gt; is set to
+ &lt;code class=&quot;highlighter-rouge&quot;&gt;CLIENT_PROPAGATED&lt;/code&gt;, meaning it’s up to the user to coordinate a set of &lt;em&gt;timestamp tokens&lt;/em&gt; with clients (even
 across different machines) if they are performing writes/reads that are somehow causally linked.
 If done correctly this enables &lt;strong&gt;STRICT SERIALIZABILITY&lt;/strong&gt;[5], i.e. &lt;strong&gt;LINEARIZABILITY&lt;/strong&gt;[6] and
 &lt;strong&gt;SERIALIZABILITY&lt;/strong&gt;[7] at the same time, at the cost of having the user coordinate the timestamp
 tokens across clients (a survey of the meaning of these, and other definitions can be found
 &lt;a href=&quot;http://www.ics.forth.gr/tech-reports/2013/2013.TR439_Survey_on_Consistency_Conditions.pdf&quot;&gt;here&lt;/a&gt;).
-The alternative setting for &lt;code&gt;External Consistency&lt;/code&gt; is to have it set to
-&lt;code&gt;COMMIT_WAIT&lt;/code&gt; (experimental), which guarantees the same properties through a different means, by
+The alternative setting for &lt;code class=&quot;highlighter-rouge&quot;&gt;External Consistency&lt;/code&gt; is to have it set to
+&lt;code class=&quot;highlighter-rouge&quot;&gt;COMMIT_WAIT&lt;/code&gt; (experimental), which guarantees the same properties through a different means, by
 implementing Google Spanner’s &lt;em&gt;TrueTime&lt;/em&gt;. This comes at the cost of higher latency (depending on how
 tightly synchronized the system clocks of the various tablet servers are), but doesn’t require users
 to propagate timestamps programmatically.&lt;/p&gt;
@@ -505,7 +608,7 @@ On the way, we’ll shed some light on the more relevant components and how they
 fit together.
 
 In Part 1 of the series (this one), we’ll cover motivation and design trade-offs, the end goals and
-the current status.</summary></entry><entry><title>Apache Kudu 1.5.0 released</title><link href="/2017/09/08/apache-kudu-1-5-0-released.html" rel="alternate" type="text/html" title="Apache Kudu 1.5.0 released" /><published>2017-09-08T00:00:00+02:00</published><updated>2017-09-08T00:00:00+02:00</updated><id>/2017/09/08/apache-kudu-1-5-0-released</id><content type="html" xml:base="/2017/09/08/apache-kudu-1-5-0-released.html">&lt;p&gt;The Apache Kudu team is happy to announce the release of Kudu 1.5.0!&lt;/p&gt;
+the current status.</summary></entry><entry><title>Apache Kudu 1.5.0 released</title><link href="/2017/09/08/apache-kudu-1-5-0-released.html" rel="alternate" type="text/html" title="Apache Kudu 1.5.0 released" /><published>2017-09-08T00:00:00-07:00</published><updated>2017-09-08T00:00:00-07:00</updated><id>/2017/09/08/apache-kudu-1-5-0-released</id><content type="html" xml:base="/2017/09/08/apache-kudu-1-5-0-released.html">&lt;p&gt;The Apache Kudu team is happy to announce the release of Kudu 1.5.0!&lt;/p&gt;
 
 &lt;p&gt;Apache Kudu 1.5.0 is a minor release which offers several new features,
 improvements, optimizations, and bug fixes.&lt;/p&gt;
@@ -523,9 +626,9 @@ scenarios&lt;/li&gt;
 additional reductions planned for the future&lt;/li&gt;
   &lt;li&gt;a new configuration dashboard on the web UI which provides a high-level
 summary of important configuration values&lt;/li&gt;
-  &lt;li&gt;a new &lt;code&gt;kudu tablet move&lt;/code&gt; command which moves a tablet replica from one tablet
+  &lt;li&gt;a new &lt;code class=&quot;highlighter-rouge&quot;&gt;kudu tablet move&lt;/code&gt; command which moves a tablet replica from one tablet
 server to another&lt;/li&gt;
-  &lt;li&gt;a new &lt;code&gt;kudu local_replica data_size&lt;/code&gt; command which summarizes the space usage
+  &lt;li&gt;a new &lt;code class=&quot;highlighter-rouge&quot;&gt;kudu local_replica data_size&lt;/code&gt; command which summarizes the space usage
 of a local tablet&lt;/li&gt;
   &lt;li&gt;all on-disk data is now checksummed by default, which provides error detection
 for improved confidence when running Kudu on unreliable hardware&lt;/li&gt;
@@ -546,7 +649,7 @@ repository.&lt;/li&gt;
 Apache Kudu 1.5.0 is a minor release which offers several new features,
 improvements, optimizations, and bug fixes.
 
-Highlights include:</summary></entry><entry><title>Apache Kudu 1.4.0 released</title><link href="/2017/06/13/apache-kudu-1-4-0-released.html" rel="alternate" type="text/html" title="Apache Kudu 1.4.0 released" /><published>2017-06-13T00:00:00+02:00</published><updated>2017-06-13T00:00:00+02:00</updated><id>/2017/06/13/apache-kudu-1-4-0-released</id><content type="html" xml:base="/2017/06/13/apache-kudu-1-4-0-released.html">&lt;p&gt;The Apache Kudu team is happy to announce the release of Kudu 1.4.0!&lt;/p&gt;
+Highlights include:</summary></entry><entry><title>Apache Kudu 1.4.0 released</title><link href="/2017/06/13/apache-kudu-1-4-0-released.html" rel="alternate" type="text/html" title="Apache Kudu 1.4.0 released" /><published>2017-06-13T00:00:00-07:00</published><updated>2017-06-13T00:00:00-07:00</updated><id>/2017/06/13/apache-kudu-1-4-0-released</id><content type="html" xml:base="/2017/06/13/apache-kudu-1-4-0-released.html">&lt;p&gt;The Apache Kudu team is happy to announce the release of Kudu 1.4.0!&lt;/p&gt;
 
 &lt;p&gt;Apache Kudu 1.4.0 is a minor release which offers several new features,
 improvements, optimizations, and bug fixes.&lt;/p&gt;
@@ -560,7 +663,7 @@ improvements, optimizations, and bug fixes.&lt;/p&gt;
   &lt;li&gt;a new C++ client API to efficiently map primary keys to their associated partitions
 and hosts&lt;/li&gt;
   &lt;li&gt;support for long-running fault-tolerant scans in the Java client&lt;/li&gt;
-  &lt;li&gt;a new &lt;code&gt;kudu fs check&lt;/code&gt; command which can perform offline consistency checks
+  &lt;li&gt;a new &lt;code class=&quot;highlighter-rouge&quot;&gt;kudu fs check&lt;/code&gt; command which can perform offline consistency checks
 and repairs on the local on-disk storage of a Tablet Server or Master.&lt;/li&gt;
   &lt;li&gt;many optimizations to reduce disk space usage, improve write throughput,
 and improve throughput of background maintenance operations.&lt;/li&gt;
@@ -581,31 +684,4 @@ repository.&lt;/li&gt;
 Apache Kudu 1.4.0 is a minor release which offers several new features,
 improvements, optimizations, and bug fixes.
 
-Highlights include:</summary></entry><entry><title>Apache Kudu 1.3.1 released</title><link href="/2017/04/19/apache-kudu-1-3-1-released.html" rel="alternate" type="text/html" title="Apache Kudu 1.3.1 released" /><published>2017-04-19T00:00:00+02:00</published><updated>2017-04-19T00:00:00+02:00</updated><id>/2017/04/19/apache-kudu-1-3-1-released</id><content type="html" xml:base="/2017/04/19/apache-kudu-1-3-1-released.html">&lt;p&gt;The Apache Kudu team is happy to announce the release of Kudu 1.3.1!&lt;/p&gt;
-
-&lt;p&gt;Apache Kudu 1.3.1 is a bug fix release which fixes critical issues discovered
-in Apache Kudu 1.3.0. In particular, this fixes a bug in which data could be
-incorrectly deleted after certain sequences of node failures. Several other
-bugs are also fixed. See the release notes for details.&lt;/p&gt;
-
-&lt;p&gt;Users of Kudu 1.3.0 are encouraged to upgrade to 1.3.1 immediately.&lt;/p&gt;
-
-&lt;ul&gt;
-  &lt;li&gt;Download the &lt;a href=&quot;/releases/1.3.1/&quot;&gt;Kudu 1.3.1 source release&lt;/a&gt;&lt;/li&gt;
-  &lt;li&gt;Convenience binary artifacts for the Java client and various Java
-integrations (eg Spark, Flume) are also now available via the ASF Maven
-repository.&lt;/li&gt;
-&lt;/ul&gt;</content><author><name>Todd Lipcon</name></author><summary>The Apache Kudu team is happy to announce the release of Kudu 1.3.1!
-
-Apache Kudu 1.3.1 is a bug fix release which fixes critical issues discovered
-in Apache Kudu 1.3.0. In particular, this fixes a bug in which data could be
-incorrectly deleted after certain sequences of node failures. Several other
-bugs are also fixed. See the release notes for details.
-
-Users of Kudu 1.3.0 are encouraged to upgrade to 1.3.1 immediately.
-
-
-  Download the Kudu 1.3.1 source release
-  Convenience binary artifacts for the Java client and various Java
-integrations (eg Spark, Flume) are also now available via the ASF Maven
-repository.</summary></entry></feed>
+Highlights include:</summary></entry></feed>

http://git-wip-us.apache.org/repos/asf/kudu-site/blob/12782cec/img/index-skip-scan/example-table.png
----------------------------------------------------------------------
diff --git a/img/index-skip-scan/example-table.png b/img/index-skip-scan/example-table.png
new file mode 100644
index 0000000..585ae4d
Binary files /dev/null and b/img/index-skip-scan/example-table.png differ