You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@accumulo.apache.org by el...@apache.org on 2013/11/13 05:25:05 UTC

[2/8] git commit: ACCUMULO-1783 More documentation to reading data. Other layout and css fixes.

ACCUMULO-1783 More documentation to reading data. Other layout and css fixes.


Project: http://git-wip-us.apache.org/repos/asf/accumulo-pig/repo
Commit: http://git-wip-us.apache.org/repos/asf/accumulo-pig/commit/04de5a46
Tree: http://git-wip-us.apache.org/repos/asf/accumulo-pig/tree/04de5a46
Diff: http://git-wip-us.apache.org/repos/asf/accumulo-pig/diff/04de5a46

Branch: refs/heads/ACCUMULO-1783
Commit: 04de5a466d472510de41fc30d6e92f5ee7223d42
Parents: 5c6c207
Author: Josh Elser <el...@apache.org>
Authored: Tue Nov 12 16:25:25 2013 -0800
Committer: Josh Elser <el...@apache.org>
Committed: Tue Nov 12 16:25:25 2013 -0800

----------------------------------------------------------------------
 site/_layouts/default.html           |   2 +-
 site/_layouts/docs.html              |   9 ++
 site/css/base.css                    |   1 +
 site/docs/index.md                   |  12 +++
 site/docs/introduction.md            | 100 +++++++++-------------
 site/docs/map-storage.md             | 138 ++++++++++++++++++++++++++++++
 site/images/accumulo-data-model.png  | Bin 0 -> 15838 bytes
 site/images/accumulo-data-model.tiff | Bin 0 -> 19782 bytes
 8 files changed, 203 insertions(+), 59 deletions(-)
----------------------------------------------------------------------


http://git-wip-us.apache.org/repos/asf/accumulo-pig/blob/04de5a46/site/_layouts/default.html
----------------------------------------------------------------------
diff --git a/site/_layouts/default.html b/site/_layouts/default.html
index 0acdd7f..40824ce 100644
--- a/site/_layouts/default.html
+++ b/site/_layouts/default.html
@@ -24,7 +24,7 @@
     <body>
         <div id="content">
             <div id="header">
-                <span class="h1">Accumulo storage with Pig</span>
+                <span class="h1">{% if page.title %} {{ page.title }} {% else %} Accumulo storage with Pig {% endif %}</span>
                 <span style="float:right;"><img src="/images/pig.gif" height="70px" alt="pig" /><img src="/images/accumulo.png" height="70px" alt="accumulo" style="padding-left:15px"/></span>
             </div>
             {{ content }}

http://git-wip-us.apache.org/repos/asf/accumulo-pig/blob/04de5a46/site/_layouts/docs.html
----------------------------------------------------------------------
diff --git a/site/_layouts/docs.html b/site/_layouts/docs.html
new file mode 100644
index 0000000..e68f2bc
--- /dev/null
+++ b/site/_layouts/docs.html
@@ -0,0 +1,9 @@
+---
+layout: default
+---
+
+<div>
+{{ content }}
+</div>
+
+<p><a href="{{site.url}}">Home</a></p>

http://git-wip-us.apache.org/repos/asf/accumulo-pig/blob/04de5a46/site/css/base.css
----------------------------------------------------------------------
diff --git a/site/css/base.css b/site/css/base.css
index 4f6890f..5746a27 100644
--- a/site/css/base.css
+++ b/site/css/base.css
@@ -45,6 +45,7 @@ pre.code {
     overflow: auto;
     padding-left: 20px;
     padding-top: 15px;
+    padding-bottom: 15px;
     background-color: #909090;
     
     border: 1px solid black;

http://git-wip-us.apache.org/repos/asf/accumulo-pig/blob/04de5a46/site/docs/index.md
----------------------------------------------------------------------
diff --git a/site/docs/index.md b/site/docs/index.md
new file mode 100644
index 0000000..4ced400
--- /dev/null
+++ b/site/docs/index.md
@@ -0,0 +1,12 @@
+---
+layout: docs
+title: Accumulo storage with Pig
+permalink: /docs/
+---
+<div>
+{% for page in site.pages %}
+    {% if page.url != '/docs/' and page.url != '/index.html' %}
+        <p><a href="{{site.url}}{{page.url}}">{{page.title}}</a></p>
+    {% endif %}
+{% endfor %}
+</div>

http://git-wip-us.apache.org/repos/asf/accumulo-pig/blob/04de5a46/site/docs/introduction.md
----------------------------------------------------------------------
diff --git a/site/docs/introduction.md b/site/docs/introduction.md
index a46a880..19b2cb6 100644
--- a/site/docs/introduction.md
+++ b/site/docs/introduction.md
@@ -1,71 +1,55 @@
 ---
-layout: default
-title: Accumulo Storage with Pig
+layout: docs
+title: Introduction
 permalink: /docs/introduction/
 ---
-[Apache Accumulo](http://accumulo.apache.org) 
+## Pig 
 
-<pre class="code">
-<span class="comment">-- Read a reduced set of our flight data</span>
-<span class="variable">flight_data</span> = <span class="keyword">LOAD</span> <span class="constants">'accumulo://flights?instance=accumulo&amp;user=pig&amp;password=password&amp;zookeepers=localhost&amp;fetch_columns=destination,departure_time,scheduled_departure_time,flight_number,taxi_in,taxi_out,origin'</span>
-<span class="keyword">USING</span> org.apache.accumulo.pig.AccumuloStorage() <span class="keyword">AS</span> (rowkey:<span class="type">chararray</span>, data:<span class="type">map[]</span>);
+One of the big reasons that [Apache Pig](http://pig.apache.org) exists is to provide a much lower-cost entry point to
+running MapReduce. Writing a MapReduce job typically ends up being expression in hundreds of lines of code to solve
+problems that are often categorized as [embarrassingly parallel](http://en.wikipedia.org/wiki/Embarrassingly_parallel).
+As such, these problems are typically easy to think about conceptually and can be used as a point of introspection to a
+data set or combination of data sets. Thus, it doesn't make sense to write large amounts of Java code for what may be
+repeated one-off questions.
 
-<span class="comment">-- Also read airport information</span>
-<span class="variable">airports</span> = <span class="keyword">LOAD</span> <span class="constants">'accumulo://airports?instance=accumulo&amp;user=pig&amp;password=password&amp;zookeepers=localhost'</span> <span class="keyword">USING</span>
-org.apache.accumulo.pig.AccumuloStorage() <span class="keyword">AS</span> (rowkey:<span class="type">chararray</span>, data:<span class="type">map[]</span>);
+## Accumulo
 
-<span class="comment">-- Permute the map</span>
-<span class="variable">flight_data</span> = <span class="keyword">FOREACH</span> <span class="variable">flight_data</span> <span class="keyword">GENERATE</span> rowkey, data#<span class="constants">'origin'</span> <span class="keyword">AS</span> origin, data#<span class="constants">'destination'</span> <span class="keyword">AS</span> destination, data#<span class="constants">'departure_time'</span> <span class="keyword">AS</span> departure_time,
-data#<span class="constants">'scheduled_departure_time'</span> <span class="keyword">AS</span> scheduled_departure_time, data#<span class="constants">'flight_number'</span> <span class="keyword">AS</span> flight_number, data#<span class="constants">'taxi_in'</span> <span class="keyword">AS</span> taxi_in, data#<span class="constants">'taxi_out'</span> <span class="keyword">AS</span> taxi_out;
+Accumulo is one of many storage solutions in the Apache Hadoop ecosystem, so it makes sense that Pig should also be able
+to read/write data from/to Accumulo. Very similar to [Apache HBase](http://hbase.apache.org), Accumulo is a Key-Value datastore,
+where a Key is made up of [multiple parts](http://accumulo.apache.org/1.5/accumulo_user_manual.html#_data_model). 
 
-<span class="comment">-- Permute the map</span>
-<span class="variable">airports</span> = <span class="keyword">FOREACH</span> <span class="variable">airports</span> <span class="keyword">GENERATE</span> data#<span class="constants">'name'</span> <span class="keyword">AS</span> name, data#<span class="constants">'state'</span> <span class="keyword">AS</span> state, data#<span class="constants">'code'</span> <span class="keyword">AS</span> code, data#<span class="constants">'country'</span> <span class="keyword">AS</span> country, data#<span class="constants">'city'</span> <span class="keyword">AS</span> city;
+![Data Model](/images/accumulo-data-model.png)
 
-<span class="comment">-- Add airport information about the origin of the flight</span>
-<span class="variable">flights_with_origin</span> = <span class="keyword">JOIN</span> <span class="variable">flight_data</span> <span class="keyword">BY</span> origin, <span class="variable">airports</span> <span class="keyword">BY</span> code;
+This data model lends itself well to working with columnar data, handling changing, sparse column definitions on the fly. As
+new columns are created, Accumulo can process these automatically without any user intevention. Many columns can exist
+across many rows, and rows can contain different sets of columns. This can be thought of as storing a Map of data in
+each row.
 
-<span class="comment">-- Store this information back into Accumulo in a new table</span>
-<span class="keyword">STORE</span> <span class="variable">flights_with_origin</span> <span class="keyword">INTO</span> <span class="constants">'accumulo://flights_with_airports?instance=accumulo1.4&amp;user=root&amp;password=secret&amp;zookeepers=localhost'</span> \
-<span class="keyword">USING</span> org.apache.accumulo.pig.AccumuloStorage(<span class="constants">'origin,destination,departure_time,scheduled_departure_time,flight_number,taxi_in,taxi_out,name,state,code,country,city'</span>);
-</pre>
+## File Storage
 
-<p> Vestibulum vulputate nisi non imperdiet elementum. Pellentesque at
-consequat nisi. Fusce ut luctus justo. Aenean tincidunt ut risus
-condimentum convallis. Praesent eget tristique risus. Cras pellentesque sed
-libero ac elementum. Quisque tempus commodo neque, laoreet accumsan lectus
-sollicitudin eget. In convallis neque nisi, a iaculis neque interdum ac.
-Suspendisse in ante lacinia dolor faucibus auctor.
-</p>
+Accumulo provides many other desirable features when it comes to data management over "hand rolled" solutions using
+flat-files in HDFS.
 
-<p>Nulla fringilla quis turpis a gravida. Quisque tellus arcu, sagittis et sapien
-ut, imperdiet scelerisque est. Duis sapien mi, elementum vitae sem quis, varius
-tincidunt tortor. In commodo semper magna. Donec ultrices nunc est, nec
-volutpat leo porta scelerisque. Praesent tellus leo, scelerisque eget tortor
-eget, posuere sodales nulla. Mauris imperdiet magna eget tristique consequat.
-Nullam adipiscing at arcu in vestibulum. Donec consectetur justo sed odio
-vehicula, vel lobortis libero vehicula. Fusce rutrum justo lorem, sed bibendum
-ipsum ultrices eget. Praesent lobortis justo quis sem adipiscing rutrum ac eget
-nisi. Pellentesque et justo in leo rutrum rhoncus a ut neque. Fusce faucibus,
-orci nec venenatis dapibus, est leo ornare eros, ac adipiscing erat felis sit
-amet tellus. Nulla vehicula ipsum sit amet accumsan tempor.
-</p>
+### Sorted
 
-<p>Nulla ac est tincidunt, lacinia quam nec, mollis ante. Nulla ut tincidunt
-massa, vel laoreet elit. Aliquam erat volutpat. Mauris varius dolor in eros
-blandit adipiscing. Nam ultrices tellus quam, eu porta quam varius ac.
-Phasellus in massa fringilla, mattis nisi vel, condimentum diam. Cras porttitor
-eget arcu vel tempor.
-</p>
+All data stored in Accumulo is sorted lexicographically. New data which is written to Accumulo is also written in sorted
+order, while reads against Accumulo performed a merged-read against these sorted streams of data to provide a globally
+sorted view over the table. This feature opens the door to many algorithms that can run efficiently over sorted data
+sets.
 
-<p>Ut id vestibulum lorem. Fusce vitae metus sed magna tincidunt vestibulum. Fusce
-in eros ac nulla vestibulum venenatis vitae vitae nisi. Donec elementum neque
-ac viverra cursus. Morbi tincidunt venenatis tellus, id facilisis nibh viverra
-eget. Aenean pellentesque gravida orci, sed elementum nisl vulputate at.
-Suspendisse ut orci vitae tortor viverra egestas id scelerisque ante. Praesent
-vel tempor justo, id tempor lacus. Proin convallis vehicula mauris. Suspendisse
-tincidunt et libero vitae condimentum. Nam arcu urna, sollicitudin nec diam
-congue, ultricies hendrerit mi. Vivamus viverra elit in libero rutrum commodo.
-Ut eget varius arcu, ac venenatis tellus. Quisque rutrum blandit velit in
-sollicitudin. Maecenas nibh purus, consectetur at elementum at, dictum et
-dolor. 
-</p>
+### Tablets
+
+Accumulo organizes data in tables. Each table is made up of multiple tablets. Each tablet contains at minimum one row
+and can be composed of many files in HDFS. In practice, most tablets in Accumulo will contain many rows. As new data is
+inserted into a table, Accumulo will manage how this data is written to HDFS. As new data is ingested and written to a
+tablet, that tablet will eventually split from one tablet into many. As these splits occurs, Accumulo manages each of
+these files in HDFS for you transparantly alleviating the necessity to implement data retention and organizational logic
+in the application.
+
+## Indexing
+
+In addition to the trivial "map in row" layout, more advanced table schemas exist such as inverted indexes,
+document-partitioned indexes, and edge lists to name a few. All of these can be expressed using the same "5 tuple" Key
+data model that Accumulo provides.
+
+Next, how to [use Pig to manipulate "map in row" datasets in Accumulo](/docs/map-storage).

http://git-wip-us.apache.org/repos/asf/accumulo-pig/blob/04de5a46/site/docs/map-storage.md
----------------------------------------------------------------------
diff --git a/site/docs/map-storage.md b/site/docs/map-storage.md
new file mode 100644
index 0000000..b5a16e7
--- /dev/null
+++ b/site/docs/map-storage.md
@@ -0,0 +1,138 @@
+---
+layout: docs
+title: Using Map Storage
+permalink: /docs/map-storage/
+---
+## Connection
+
+The _AccumuloStorage_ class provides the ability to read data from an Accumulo table. The string argument to the
+[LOAD](http://pig.apache.org/docs/r0.12.0/basic.html#load) command is a URI which contains Accumulo connection
+information and some query options. The URI scheme is always "accumulo" and the path is the Accumulo table name to read
+from. The query string is used to provide the previously mentioned connection and query options. These options are the
+same regardless of whether _AccumuloStorage_ is being used for reading or writing.
+
+* `instance` - The Accumulo instance name
+* `user` - The Accumulo user
+* `password` - The password for the Accumulo user
+* `zookeepers` -  A comma separated list of ZooKeeper hosts for the Accumulo instance
+
+## Reading
+
+Some basic Accumulo read parameters are exposed for use. All of the following are optional.
+
+* `fetch_columns` - A comma separated list of optionally colon-separated elements mapping to column family and qualifier
+pairs, e.g. `foo:bar,column1,column5`. **Default: All columns**.
+* `begin` - The row to begin scanning from. **Default: beginning of the table (null)**.
+* `end` - The row to stop scanning at. **Default: end of the table (null)**.
+* `auths` - A comma separated list of Authorizations to use for the provided users. **Default: all authorizations the
+user has**.
+
+_AccumuloStorage_ will return you data in the following schema. 
+
+<pre class="code">
+(rowkey:<span class="type">chararray</span>, data:<span class="type">map[]</span>)
+</pre>
+
+Each key in the map is a column (family and qualifier) within the provided rowkey and the values are the Accumulo values for the given
+rowkey+column. By default, the map key will have a colon separator between the column family and column qualifier. A
+boolean argument can be provided to the _AccumuloStorage_ constructor. If this boolean is true, the map key will only be
+composed of the column qualifier and each map will be the collection of each column family within the row. For example:
+
+<table border="1">
+    <tr><th>Row</th><th>ColumnFamily</th><th>ColumnQualifier</th><th>Value</th></tr>
+    <tr><td>1</td><td>measurements</td><td>height</td><td>72inches</td></tr>
+    <tr><td>1</td><td>measurements</td><td>weight</td><td>180lbs</td></tr>
+    <tr><td>1</td><td>location</td><td>city</td><td>San Francisco</td></tr>
+    <tr><td>1</td><td>location</td><td>state</td><td>California</td></tr>
+</table>
+
+By default will generate a tuple of the following:
+
+<pre class="code">
+("1", {"measurements:height"#"72inches", "measurements:weight"#"180lbs", "location:city"#"San Francisco", "location:state"#"California"})
+</pre>
+
+If the previously mentioned boolean argument is provided as true, the following will be generated instead:
+
+<pre class="code">
+("1", {"measurements:height"#"72inches", "measurements:weight"#"180lbs"}, {"location:city"#"San Francisco", "location:state"#"California"})
+</pre>
+
+## Writing
+
+Some basic Accumulo write parameters are exposed for use. Like read operations, all of the following are optional.
+
+* `write_buffer_size` - The size, in bytes, to buffer Mutations before sending to an Accumulo server. **Default:
+10,000,000 (10MB)**.
+* `write_threads` - The number of threads to use when sending Mutations to Accumulo servers. **Default: 10**.
+* `write_latency_ms` - The number of milliseconds to wait before forcibly flushing Mutations to Accumulo. **Default:
+10,000 (10 seconds)**.
+
+### Data as map
+
+### Data as fields 
+
+<pre class="code">
+<span class="comment">-- Read a reduced set of our flight data</span>
+<span class="variable">flight_data</span> = <span class="keyword">LOAD</span> <span class="constants">'accumulo://flights?instance=accumulo&amp;user=pig&amp;password=password&amp;zookeepers=localhost&amp;fetch_columns=destination,departure_time,scheduled_departure_time,flight_number,taxi_in,taxi_out,origin'</span>
+<span class="keyword">USING</span> org.apache.accumulo.pig.AccumuloStorage() <span class="keyword">AS</span> (rowkey:<span class="type">chararray</span>, data:<span class="type">map[]</span>);
+
+<span class="comment">-- Also read airport information</span>
+<span class="variable">airports</span> = <span class="keyword">LOAD</span> <span class="constants">'accumulo://airports?instance=accumulo&amp;user=pig&amp;password=password&amp;zookeepers=localhost'</span> <span class="keyword">USING</span>
+org.apache.accumulo.pig.AccumuloStorage() <span class="keyword">AS</span> (rowkey:<span class="type">chararray</span>, data:<span class="type">map[]</span>);
+
+<span class="comment">-- Permute the map</span>
+<span class="variable">flight_data</span> = <span class="keyword">FOREACH</span> <span class="variable">flight_data</span> <span class="keyword">GENERATE</span> rowkey, data#<span class="constants">'origin'</span> <span class="keyword">AS</span> origin, data#<span class="constants">'destination'</span> <span class="keyword">AS</span> destination, data#<span class="constants">'departure_time'</span> <span class="keyword">AS</span> departure_time,
+data#<span class="constants">'scheduled_departure_time'</span> <span class="keyword">AS</span> scheduled_departure_time, data#<span class="constants">'flight_number'</span> <span class="keyword">AS</span> flight_number, data#<span class="constants">'taxi_in'</span> <span class="keyword">AS</span> taxi_in, data#<span class="constants">'taxi_out'</span> <span class="keyword">AS</span> taxi_out;
+
+<span class="comment">-- Permute the map</span>
+<span class="variable">airports</span> = <span class="keyword">FOREACH</span> <span class="variable">airports</span> <span class="keyword">GENERATE</span> data#<span class="constants">'name'</span> <span class="keyword">AS</span> name, data#<span class="constants">'state'</span> <span class="keyword">AS</span> state, data#<span class="constants">'code'</span> <span class="keyword">AS</span> code, data#<span class="constants">'country'</span> <span class="keyword">AS</span> country, data#<span class="constants">'city'</span> <span class="keyword">AS</span> city;
+
+<span class="comment">-- Add airport information about the origin of the flight</span>
+<span class="variable">flights_with_origin</span> = <span class="keyword">JOIN</span> <span class="variable">flight_data</span> <span class="keyword">BY</span> origin, <span class="variable">airports</span> <span class="keyword">BY</span> code;
+
+<span class="comment">-- Store this information back into Accumulo in a new table</span>
+<span class="keyword">STORE</span> <span class="variable">flights_with_origin</span> <span class="keyword">INTO</span> <span class="constants">'accumulo://flights_with_airports?instance=accumulo1.4&amp;user=root&amp;password=secret&amp;zookeepers=localhost'</span> \
+<span class="keyword">USING</span> org.apache.accumulo.pig.AccumuloStorage(<span class="constants">'origin,destination,departure_time,scheduled_departure_time,flight_number,taxi_in,taxi_out,name,state,code,country,city'</span>);
+</pre>
+
+<p> Vestibulum vulputate nisi non imperdiet elementum. Pellentesque at
+consequat nisi. Fusce ut luctus justo. Aenean tincidunt ut risus
+condimentum convallis. Praesent eget tristique risus. Cras pellentesque sed
+libero ac elementum. Quisque tempus commodo neque, laoreet accumsan lectus
+sollicitudin eget. In convallis neque nisi, a iaculis neque interdum ac.
+Suspendisse in ante lacinia dolor faucibus auctor.
+</p>
+
+<p>Nulla fringilla quis turpis a gravida. Quisque tellus arcu, sagittis et sapien
+ut, imperdiet scelerisque est. Duis sapien mi, elementum vitae sem quis, varius
+tincidunt tortor. In commodo semper magna. Donec ultrices nunc est, nec
+volutpat leo porta scelerisque. Praesent tellus leo, scelerisque eget tortor
+eget, posuere sodales nulla. Mauris imperdiet magna eget tristique consequat.
+Nullam adipiscing at arcu in vestibulum. Donec consectetur justo sed odio
+vehicula, vel lobortis libero vehicula. Fusce rutrum justo lorem, sed bibendum
+ipsum ultrices eget. Praesent lobortis justo quis sem adipiscing rutrum ac eget
+nisi. Pellentesque et justo in leo rutrum rhoncus a ut neque. Fusce faucibus,
+orci nec venenatis dapibus, est leo ornare eros, ac adipiscing erat felis sit
+amet tellus. Nulla vehicula ipsum sit amet accumsan tempor.
+</p>
+
+<p>Nulla ac est tincidunt, lacinia quam nec, mollis ante. Nulla ut tincidunt
+massa, vel laoreet elit. Aliquam erat volutpat. Mauris varius dolor in eros
+blandit adipiscing. Nam ultrices tellus quam, eu porta quam varius ac.
+Phasellus in massa fringilla, mattis nisi vel, condimentum diam. Cras porttitor
+eget arcu vel tempor.
+</p>
+
+<p>Ut id vestibulum lorem. Fusce vitae metus sed magna tincidunt vestibulum. Fusce
+in eros ac nulla vestibulum venenatis vitae vitae nisi. Donec elementum neque
+ac viverra cursus. Morbi tincidunt venenatis tellus, id facilisis nibh viverra
+eget. Aenean pellentesque gravida orci, sed elementum nisl vulputate at.
+Suspendisse ut orci vitae tortor viverra egestas id scelerisque ante. Praesent
+vel tempor justo, id tempor lacus. Proin convallis vehicula mauris. Suspendisse
+tincidunt et libero vitae condimentum. Nam arcu urna, sollicitudin nec diam
+congue, ultricies hendrerit mi. Vivamus viverra elit in libero rutrum commodo.
+Ut eget varius arcu, ac venenatis tellus. Quisque rutrum blandit velit in
+sollicitudin. Maecenas nibh purus, consectetur at elementum at, dictum et
+dolor. 
+</p>

http://git-wip-us.apache.org/repos/asf/accumulo-pig/blob/04de5a46/site/images/accumulo-data-model.png
----------------------------------------------------------------------
diff --git a/site/images/accumulo-data-model.png b/site/images/accumulo-data-model.png
new file mode 100644
index 0000000..04ee2a9
Binary files /dev/null and b/site/images/accumulo-data-model.png differ

http://git-wip-us.apache.org/repos/asf/accumulo-pig/blob/04de5a46/site/images/accumulo-data-model.tiff
----------------------------------------------------------------------
diff --git a/site/images/accumulo-data-model.tiff b/site/images/accumulo-data-model.tiff
new file mode 100644
index 0000000..597c488
Binary files /dev/null and b/site/images/accumulo-data-model.tiff differ