You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@kudu.apache.org by mp...@apache.org on 2018/09/26 17:51:44 UTC

kudu git commit: Blogpost describing index skip scan optimization.

Repository: kudu
Updated Branches:
  refs/heads/gh-pages 9f7058cc7 -> 83530755d


Blogpost describing index skip scan optimization.

Link to the version with images:
https://github.com/AnupamaGupta01/kudu/blob/blogpost-2/_posts/2018-09-25-index-skip-scan-optimization-in-kudu.md

Change-Id: I2250652dcba3d1b0a06f1ffb7f23c11bf533d35e
Reviewed-on: http://gerrit.cloudera.org:8080/11263
Reviewed-by: Mike Percy <mp...@apache.org>
Tested-by: Mike Percy <mp...@apache.org>


Project: http://git-wip-us.apache.org/repos/asf/kudu/repo
Commit: http://git-wip-us.apache.org/repos/asf/kudu/commit/83530755
Tree: http://git-wip-us.apache.org/repos/asf/kudu/tree/83530755
Diff: http://git-wip-us.apache.org/repos/asf/kudu/diff/83530755

Branch: refs/heads/gh-pages
Commit: 83530755dbcddd44debe9b3ade286f58b96d5c82
Parents: 9f7058c
Author: anupama <an...@cloudera.com>
Authored: Fri Aug 17 14:53:47 2018 -0700
Committer: Mike Percy <mp...@apache.org>
Committed: Wed Sep 26 17:51:16 2018 +0000

----------------------------------------------------------------------
 ...9-26-index-skip-scan-optimization-in-kudu.md | 114 +++++++++++++++++++
 img/index-skip-scan/example-table.png           | Bin 0 -> 73625 bytes
 img/index-skip-scan/skip-scan-example-table.png | Bin 0 -> 48512 bytes
 .../skip-scan-performance-graph.png             | Bin 0 -> 59142 bytes
 4 files changed, 114 insertions(+)
----------------------------------------------------------------------


http://git-wip-us.apache.org/repos/asf/kudu/blob/83530755/_posts/2018-09-26-index-skip-scan-optimization-in-kudu.md
----------------------------------------------------------------------
diff --git a/_posts/2018-09-26-index-skip-scan-optimization-in-kudu.md b/_posts/2018-09-26-index-skip-scan-optimization-in-kudu.md
new file mode 100644
index 0000000..ba91926
--- /dev/null
+++ b/_posts/2018-09-26-index-skip-scan-optimization-in-kudu.md
@@ -0,0 +1,114 @@
+---
+layout: post
+title: "Index Skip Scan Optimization in Kudu"
+author: Anupama Gupta
+---
+
+This summer I got the opportunity to intern with the Apache Kudu team at Cloudera.
+My project was to optimize the Kudu scan path by implementing a technique called
+index skip scan (a.k.a. scan-to-seek, see section 4.1 in [1]). I wanted to share
+my experience and the progress we've made so far on the approach.
+
+<!--more-->
+
+Let's begin with discussing the current query flow in Kudu.
+Consider the following table:
+
+{% highlight SQL %}
+CREATE TABLE metrics (
+    host STRING,
+    tstamp INT,
+    clusterid INT,
+    role STRING,
+    PRIMARY KEY (host, tstamp, clusterid)
+);
+{% endhighlight %}
+
+![png]({{ site.github.url }}/img/index-skip-scan/example-table.png){: .img-responsive}
+*Sample rows of table `metrics` (sorted by key columns).*
+
+
+In this case, by default, Kudu internally builds a primary key index (implemented as a
+[B-tree](https://en.wikipedia.org/wiki/B-tree)) for the table `metrics`.
+As shown in the table above, the index data is sorted by the composite of all key columns.
+When the user query contains the first key column (`host`), Kudu uses the index (as the index data is
+primarily sorted on the first key column).
+
+Now, what if the user query does not contain the first key column and instead only contains the `tstamp` column?
+In the above case, the `tstamp` column values are sorted with respect to `host`,
+but are not globally sorted, and as such, it's non-trivial to use the index to filter rows.
+Instead, a full tablet scan is done by default. Other databases may optimize such scans by building secondary indexes
+(though it might be redundant to build one on one of the primary keys). However, this isn't an option for Kudu,
+given its lack of secondary index support.
+
+The question is, can Kudu do better than a full tablet scan here?
+
+The answer is yes! Let's observe the column preceding the `tstamp` column. We will refer to it as the
+"prefix column" and its specific value as the "prefix key". In this example, `host` is the prefix column.
+Note that the prefix keys are sorted in the index and that all rows of a given prefix key are also sorted by the
+remaining key columns. Therefore, we can use the index to skip to the rows that have distinct prefix keys,
+and also satisfy the predicate on the `tstamp` column.
+For example, consider the query:
+{% highlight SQL %}
+SELECT clusterid FROM metrics WHERE tstamp = 100;
+{% endhighlight %}
+
+![png]({{ site.github.url }}/img/index-skip-scan/skip-scan-example-table.png){: .img-responsive}
+*Skip scan flow illustration. The rows in green are scanned and the rest are skipped.*
+
+The tablet server can use the index to **skip** to the first row with a distinct prefix key (`host = helium`) that
+matches the predicate (`tstamp = 100`) and then **scan** through the rows until the predicate no longer matches. At that
+point we would know that no more rows with `host = helium` will satisfy the predicate, and we can skip to the next
+prefix key. This holds true for all distinct keys of `host`. Hence, this method is popularly known as
+**skip scan optimization**[2, 3].
+
+Performance
+==========
+
+This optimization can speed up queries significantly, depending on the cardinality (number of distinct values) of the
+prefix column. The lower the prefix column cardinality, the better the skip scan performance. In fact, when the
+prefix column cardinality is high, skip scan is not a viable approach. The performance graph (obtained using the example
+schema and query pattern mentioned earlier) is shown below.
+
+Based on our experiments, on up to 10 million rows per tablet (as shown below), we found that the skip scan performance
+begins to get worse with respect to the full tablet scan performance when the prefix column cardinality
+exceeds sqrt(number_of_rows_in_tablet).
+Therefore, in order to use skip scan performance benefits when possible and maintain a consistent performance in cases
+of large prefix column cardinality, we have tentatively chosen to dynamically disable skip scan when the number of skips for
+distinct prefix keys exceeds sqrt(number_of_rows_in_tablet).
+It will be an interesting project to further explore sophisticated heuristics to decide when
+to dynamically disable skip scan.
+
+![png]({{ site.github.url }}/img/index-skip-scan/skip-scan-performance-graph.png){: .img-responsive}
+
+Conclusion
+==========
+
+Skip scan optimization in Kudu can lead to huge performance benefits that scale with the size of
+data in Kudu tablets. This is a work-in-progress [patch](https://gerrit.cloudera.org/#/c/10983/).
+The implementation in the patch works only for equality predicates on the non-first primary key
+columns. An important point to note is that although, in the above specific example, the number of prefix
+columns is one (`host`), this approach is generalized to work with any number of prefix columns.
+
+This work also lays the groundwork to leverage the skip scan approach and optimize query processing time in the
+following use cases:
+
+- Range predicates
+- In-list predicates
+
+This was my first time working on an open source project. I thoroughly enjoyed working on this challenging problem,
+right from understanding the scan path in Kudu to working on a full-fledged implementation of
+the skip scan optimization. I am very grateful to the Kudu team for guiding and supporting me throughout the
+internship period.
+
+References
+==========
+
+[[1]](https://storage.googleapis.com/pub-tools-public-publication-data/pdf/42851.pdf): Gupta, Ashish, et al. "Mesa:
+Geo-replicated, near real-time, scalable data warehousing." Proceedings of the VLDB Endowment 7.12 (2014): 1259-1270.
+
+[[2]](https://oracle-base.com/articles/9i/index-skip-scanning/): Index Skip Scanning - Oracle Database
+
+[[3]](https://www.sqlite.org/optoverview.html#skipscan): Skip Scan - SQLite
+
+

http://git-wip-us.apache.org/repos/asf/kudu/blob/83530755/img/index-skip-scan/example-table.png
----------------------------------------------------------------------
diff --git a/img/index-skip-scan/example-table.png b/img/index-skip-scan/example-table.png
new file mode 100644
index 0000000..585ae4d
Binary files /dev/null and b/img/index-skip-scan/example-table.png differ

http://git-wip-us.apache.org/repos/asf/kudu/blob/83530755/img/index-skip-scan/skip-scan-example-table.png
----------------------------------------------------------------------
diff --git a/img/index-skip-scan/skip-scan-example-table.png b/img/index-skip-scan/skip-scan-example-table.png
new file mode 100644
index 0000000..3b965ac
Binary files /dev/null and b/img/index-skip-scan/skip-scan-example-table.png differ

http://git-wip-us.apache.org/repos/asf/kudu/blob/83530755/img/index-skip-scan/skip-scan-performance-graph.png
----------------------------------------------------------------------
diff --git a/img/index-skip-scan/skip-scan-performance-graph.png b/img/index-skip-scan/skip-scan-performance-graph.png
new file mode 100644
index 0000000..5004b65
Binary files /dev/null and b/img/index-skip-scan/skip-scan-performance-graph.png differ