You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@kudu.apache.org by ab...@apache.org on 2018/09/11 15:54:25 UTC

kudu git commit: [blog] Data Pipelines Simplified with Kudu

Repository: kudu
Updated Branches:
  refs/heads/gh-pages 1488e1788 -> 9f7058cc7


[blog] Data Pipelines Simplified with Kudu

Change-Id: I222d2462da86c3aad3fa9afd71f686faaa9aa025
Reviewed-on: http://gerrit.cloudera.org:8080/11417
Reviewed-by: Attila Bukor <ab...@apache.org>
Tested-by: Attila Bukor <ab...@apache.org>


Project: http://git-wip-us.apache.org/repos/asf/kudu/repo
Commit: http://git-wip-us.apache.org/repos/asf/kudu/commit/9f7058cc
Tree: http://git-wip-us.apache.org/repos/asf/kudu/tree/9f7058cc
Diff: http://git-wip-us.apache.org/repos/asf/kudu/diff/9f7058cc

Branch: refs/heads/gh-pages
Commit: 9f7058cc77b2c710589194ed53e2e7119cbc9cc9
Parents: 1488e17
Author: Jordan Birdsell <jt...@apache.org>
Authored: Mon Sep 10 21:27:34 2018 -0400
Committer: Attila Bukor <ab...@apache.org>
Committed: Tue Sep 11 15:44:35 2018 +0000

----------------------------------------------------------------------
 ...2018-09-11-simplified-pipelines-with-kudu.md | 44 ++++++++++++++++++++
 1 file changed, 44 insertions(+)
----------------------------------------------------------------------


http://git-wip-us.apache.org/repos/asf/kudu/blob/9f7058cc/_posts/2018-09-11-simplified-pipelines-with-kudu.md
----------------------------------------------------------------------
diff --git a/_posts/2018-09-11-simplified-pipelines-with-kudu.md b/_posts/2018-09-11-simplified-pipelines-with-kudu.md
new file mode 100644
index 0000000..c1e2685
--- /dev/null
+++ b/_posts/2018-09-11-simplified-pipelines-with-kudu.md
@@ -0,0 +1,44 @@
+---
+layout: post
+title: Simplified Data Pipelines with Kudu
+author: Mac Noland
+---
+
+I’ve been working with Hadoop now for over seven years and fortunately, or unfortunately, have run
+across a lot of structured data use cases.  What we, at [phData](https://phdata.io/), have found is
+that end users are typically comfortable with tabular data and prefer to access their data in a
+structured manner using tables.
+<!--more-->
+
+When working on new structured data projects, the first question we always get from non-Hadoop
+followers is, _“how do I update or delete a record?”_  The second question we get is, _“when adding
+records, why don’t they show up in Impala right away?”_  For those of us who have worked with HDFS
+and Impala on HDFS for years, these are simple questions to answer, but hard ones to explain.
+
+The pre-Kudu years were filled with 100’s (or 1000’s) of self-join views (or materialization jobs)
+and compaction jobs, along with scheduled jobs to refresh Impala cache periodically so new records
+show up.  And while doable, for 10,000’s of tables, this basically became a distraction from solving
+real business problems.
+
+With the introduction of Kudu, mixing record level updates, deletes, and inserts, while supporting
+large scans, are now something we can sustainably manage at scale.  HBase is very good at record
+level updates, deletes and inserts, but doesn’t scale well for analytic use cases that often do full
+table scans. Moreover, for streaming use cases, changes are available in near real-time.  End users,
+accustomed to having to _”wait”_ for their data, can now consume the data as it arrives in their
+table.
+
+A common data ingest pattern where Kudu becomes necessary is change data capture (CDC).  That is,
+capturing the inserts, updates, hard deletes, and streaming them into Kudu where they can be applied
+immediately.  Pre-Kudu this pipeline was very tedious to implement.  Now with tools like
+[StreamSets](https://streamsets.com/), you can get up and running in a few hours.
+
+A second common workflow is near real-time analytics.  We’ve streamed data off mining trucks,
+oil wells, manufacturing lines, and needed to make that data available to end users immediately.  No
+longer do we need to batch up writes, flush to HDFS and then refresh cache in Impala.  As mentioned
+before, with Kudu, the data are available as soon as it lands.  This has been a significant 
+enhancement for end users, who previously had to _”wait”_ for data.
+
+In summary, Kudu has made a tremendous impact in removing the operational distractions of merging in
+changes, and refreshing the cache of downstream consumers.  This now allows data engineers
+and users to focus on solving business problems, rather than being bothered by the tediousness of
+the backend.