You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@beam.apache.org by me...@apache.org on 2018/05/25 08:17:59 UTC

[beam-site] branch mergebot updated (1e2ed9a -> 778f349)

This is an automated email from the ASF dual-hosted git repository.

mergebot-role pushed a change to branch mergebot
in repository https://gitbox.apache.org/repos/asf/beam-site.git.


    from 1e2ed9a  This closes #447
     add 419cff6  Prepare repository for deployment.
     new 091a415  [BEAM-4361] Document usage of HBase TableSnapshotInputFormat
     new 778f349  This closes #445

The 2 revisions listed above as "new" are entirely new to this
repository and will be described in separate emails.  The revisions
listed as "add" were already present in the repository and have only
been added to this reference.


Summary of changes:
 content/contribute/become-a-committer/index.html |  2 +-
 src/documentation/io/built-in-hadoop.md          | 67 ++++++++++++++++++++++++
 2 files changed, 68 insertions(+), 1 deletion(-)

-- 
To stop receiving notification emails like this one, please contact
mergebot-role@apache.org.

[beam-site] 02/02: This closes #445

Posted by me...@apache.org.
This is an automated email from the ASF dual-hosted git repository.

mergebot-role pushed a commit to branch mergebot
in repository https://gitbox.apache.org/repos/asf/beam-site.git

commit 778f3494107e157e8ec1a4d431e0dbb049d1533c
Merge: 419cff6 091a415
Author: Mergebot <me...@apache.org>
AuthorDate: Fri May 25 01:17:29 2018 -0700

    This closes #445

 src/documentation/io/built-in-hadoop.md | 67 +++++++++++++++++++++++++++++++++
 1 file changed, 67 insertions(+)

-- 
To stop receiving notification emails like this one, please contact
mergebot-role@apache.org.

[beam-site] 01/02: [BEAM-4361] Document usage of HBase TableSnapshotInputFormat

Posted by me...@apache.org.
This is an automated email from the ASF dual-hosted git repository.

mergebot-role pushed a commit to branch mergebot
in repository https://gitbox.apache.org/repos/asf/beam-site.git

commit 091a41569bca643db12ed0b6fd202d1dda6eff2f
Author: timrobertson100 <ti...@gmail.com>
AuthorDate: Tue May 22 16:33:54 2018 +0200

    [BEAM-4361] Document usage of HBase TableSnapshotInputFormat
---
 src/documentation/io/built-in-hadoop.md | 67 +++++++++++++++++++++++++++++++++
 1 file changed, 67 insertions(+)

diff --git a/src/documentation/io/built-in-hadoop.md b/src/documentation/io/built-in-hadoop.md
index 82fc47f..bcfa267 100644
--- a/src/documentation/io/built-in-hadoop.md
+++ b/src/documentation/io/built-in-hadoop.md
@@ -269,4 +269,71 @@ PCollection<Text, DynamoDBItemWritable> dynamoDBData =
 
 ```py
   # The Beam SDK for Python does not support Hadoop InputFormat IO.
+```
+
+### Apache HBase - TableSnapshotInputFormat
+
+To read data from an HBase table snapshot, use `org.apache.hadoop.hbase.mapreduce.TableSnapshotInputFormat`.
+Reading from a table snapshot bypasses the HBase region servers, instead reading HBase data files directly from the filesystem.
+This is useful for cases such as reading historical data or offloading of work from the HBase cluster. 
+There are scenarios when this may prove faster than accessing content through the region servers using the `HBaseIO`.
+
+A table snapshot can be taken using the HBase shell or programmatically:
+```java
+try (
+    Connection connection = ConnectionFactory.createConnection(hbaseConf);
+    Admin admin = connection.getAdmin()
+  ) {
+  admin.snapshot(
+    "my_snaphshot",
+    TableName.valueOf("my_table"),
+    HBaseProtos.SnapshotDescription.Type.FLUSH);
+}  
+```
+
+```py
+  # The Beam SDK for Python does not support Hadoop InputFormat IO.
+```
+
+A `TableSnapshotInputFormat` is configured as follows:
+
+```java
+// Construct a typical HBase scan
+Scan scan = new Scan();
+scan.setCaching(1000);
+scan.setBatch(1000);
+scan.addColumn(Bytes.toBytes("CF"), Bytes.toBytes("col_1"));
+scan.addColumn(Bytes.toBytes("CF"), Bytes.toBytes("col_2"));
+
+Configuration hbaseConf = HBaseConfiguration.create();
+hbaseConf.set(HConstants.ZOOKEEPER_QUORUM, "zk1:2181");
+hbaseConf.set("hbase.rootdir", "/hbase");
+hbaseConf.setClass(
+    "mapreduce.job.inputformat.class", TableSnapshotInputFormat.class, InputFormat.class);
+hbaseConf.setClass("key.class", ImmutableBytesWritable.class, Writable.class);
+hbaseConf.setClass("value.class", Result.class, Writable.class);
+ClientProtos.Scan proto = ProtobufUtil.toScan(scan);
+hbaseConf.set(TableInputFormat.SCAN, Base64.encodeBytes(proto.toByteArray()));
+
+// Make use of existing utility methods
+Job job = Job.getInstance(hbaseConf); // creates internal clone of hbaseConf
+TableSnapshotInputFormat.setInput(job, "my_snapshot", new Path("/tmp/snapshot_restore"));
+hbaseConf = job.getConfiguration(); // extract the modified clone
+```
+
+```py
+  # The Beam SDK for Python does not support Hadoop InputFormat IO.
+```
+
+Call Read transform as follows:
+
+```java
+PCollection<ImmutableBytesWritable, Result> hbaseSnapshotData =
+  p.apply("read",
+  HadoopInputFormatIO.<ImmutableBytesWritable, Result>read()
+  .withConfiguration(hbaseConf);
+```
+
+```py
+  # The Beam SDK for Python does not support Hadoop InputFormat IO.
 ```
\ No newline at end of file

-- 
To stop receiving notification emails like this one, please contact
mergebot-role@apache.org.