You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@jena.apache.org by bu...@apache.org on 2014/11/26 11:05:14 UTC

svn commit: r930555 - in /websites/staging/jena/trunk/content: ./ documentation/hadoop/index.html

Author: buildbot
Date: Wed Nov 26 10:05:14 2014
New Revision: 930555

Log:
Staging update by buildbot for jena

Modified:
    websites/staging/jena/trunk/content/   (props changed)
    websites/staging/jena/trunk/content/documentation/hadoop/index.html

Propchange: websites/staging/jena/trunk/content/
------------------------------------------------------------------------------
--- cms:source-revision (original)
+++ cms:source-revision Wed Nov 26 10:05:14 2014
@@ -1 +1 @@
-1641785
+1641787

Modified: websites/staging/jena/trunk/content/documentation/hadoop/index.html
==============================================================================
--- websites/staging/jena/trunk/content/documentation/hadoop/index.html (original)
+++ websites/staging/jena/trunk/content/documentation/hadoop/index.html Wed Nov 26 10:05:14 2014
@@ -166,7 +166,7 @@ underlying plumbing.</p>
 <li><a href="demo.html">RDF Stats Demo</a></li>
 </ul>
 </li>
-<li><a href="artifacts.html">Maven Artifacts for Jena JDBC</a></li>
+<li><a href="artifacts.html">Maven Artifacts</a></li>
 </ul>
 <h2 id="overview">Overview</h2>
 <p>RDF Tools for Apache Hadoop is published as a set of Maven module via its <a href="artifacts.html">maven artifacts</a>.  The source for these libraries
@@ -201,10 +201,12 @@ on what you are trying to do.  Typically
 </pre></div>
 
 
-<p>Our libraries depend on the relevant Hadoop libraries but since these libraries are provided by the cluster those dependencies are marked as <code>provided</code> and thus are not transitive.  This means that you will typically also need to add the following additional dependencies:</p>
+<p>Our libraries depend on the relevant Hadoop libraries but since these libraries are typically provided by the Hadoop cluster those dependencies are marked as <code>provided</code> and thus are not transitive.  This means that you will typically also need to add the following additional dependencies:</p>
 <div class="codehilite"><pre><span class="c">&lt;!-- Hadoop Dependencies --&gt;</span>
-<span class="c">&lt;!-- Note these will be provided on the Hadoop cluster hence the provided </span>
-<span class="c">        scope --&gt;</span>
+<span class="c">&lt;!-- </span>
+<span class="c">    Note these will be provided on the Hadoop cluster hence the provided </span>
+<span class="c">    scope </span>
+<span class="c">--&gt;</span>
 <span class="nt">&lt;dependency&gt;</span>
   <span class="nt">&lt;groupId&gt;</span>org.apache.hadoop<span class="nt">&lt;/groupId&gt;</span>
   <span class="nt">&lt;artifactId&gt;</span>hadoop-common<span class="nt">&lt;/artifactId&gt;</span>
@@ -240,9 +242,7 @@ then outputs each node with an initial c
 <span class="o">/**</span>
  <span class="o">*</span> <span class="n">A</span> <span class="n">mapper</span> <span class="k">for</span> <span class="n">counting</span> <span class="n">node</span> <span class="n">usages</span> <span class="n">within</span> <span class="n">triples</span> <span class="n">designed</span> <span class="n">primarily</span> <span class="k">for</span> <span class="n">use</span>
  <span class="o">*</span> <span class="n">in</span> <span class="n">conjunction</span> <span class="n">with</span> <span class="p">{@</span><span class="n">link</span> <span class="n">NodeCountReducer</span><span class="p">}</span>
- <span class="o">*</span> 
- <span class="o">*</span> 
- <span class="o">*</span> 
+ <span class="o">*</span>
  <span class="o">*</span> <span class="p">@</span><span class="n">param</span> <span class="o">&lt;</span><span class="n">TKey</span><span class="o">&gt;</span> <span class="n">Key</span> <span class="n">type</span>
  <span class="o">*/</span>
 <span class="n">public</span> <span class="n">class</span> <span class="n">TripleNodeCountMapper</span><span class="o">&lt;</span><span class="n">TKey</span><span class="o">&gt;</span> <span class="n">extends</span> <span class="n">AbstractNodeTupleNodeCountMapper</span><span class="o">&lt;</span><span class="n">TKey</span><span class="p">,</span> <span class="n">Triple</span><span class="p">,</span> <span class="n">TripleWritable</span><span class="o">&gt;</span> <span class="p">{</span>
@@ -291,60 +291,62 @@ then outputs each node with an initial c
 <p>Finally we then need to define an actual Hadoop job we can submit to run this.  Here we take advantage of the <a href="io.html">IO</a> library to provide
 us with support for our desired RDF input format:</p>
 <div class="codehilite"><pre><span class="n">package</span> <span class="n">org</span><span class="p">.</span><span class="n">apache</span><span class="p">.</span><span class="n">jena</span><span class="p">.</span><span class="n">hadoop</span><span class="p">.</span><span class="n">rdf</span><span class="p">.</span><span class="n">stats</span><span class="p">;</span>
-</pre></div>
 
+<span class="n">import</span> <span class="n">org</span><span class="p">.</span><span class="n">apache</span><span class="p">.</span><span class="n">hadoop</span><span class="p">.</span><span class="n">conf</span><span class="p">.</span><span class="n">Configuration</span><span class="p">;</span>
+<span class="n">import</span> <span class="n">org</span><span class="p">.</span><span class="n">apache</span><span class="p">.</span><span class="n">hadoop</span><span class="p">.</span><span class="n">fs</span><span class="p">.</span><span class="n">Path</span><span class="p">;</span>
+<span class="n">import</span> <span class="n">org</span><span class="p">.</span><span class="n">apache</span><span class="p">.</span><span class="n">hadoop</span><span class="p">.</span><span class="n">io</span><span class="p">.</span><span class="n">LongWritable</span><span class="p">;</span>
+<span class="n">import</span> <span class="n">org</span><span class="p">.</span><span class="n">apache</span><span class="p">.</span><span class="n">hadoop</span><span class="p">.</span><span class="n">mapreduce</span><span class="p">.</span><span class="n">Job</span><span class="p">;</span>
+<span class="n">import</span> <span class="n">org</span><span class="p">.</span><span class="n">apache</span><span class="p">.</span><span class="n">hadoop</span><span class="p">.</span><span class="n">mapreduce</span><span class="p">.</span><span class="n">lib</span><span class="p">.</span><span class="n">input</span><span class="p">.</span><span class="n">FileInputFormat</span><span class="p">;</span>
+<span class="n">import</span> <span class="n">org</span><span class="p">.</span><span class="n">apache</span><span class="p">.</span><span class="n">hadoop</span><span class="p">.</span><span class="n">mapreduce</span><span class="p">.</span><span class="n">lib</span><span class="p">.</span><span class="n">output</span><span class="p">.</span><span class="n">FileOutputFormat</span><span class="p">;</span>
+<span class="n">import</span> <span class="n">org</span><span class="p">.</span><span class="n">apache</span><span class="p">.</span><span class="n">jena</span><span class="p">.</span><span class="n">hadoop</span><span class="p">.</span><span class="n">rdf</span><span class="p">.</span><span class="n">io</span><span class="p">.</span><span class="n">input</span><span class="p">.</span><span class="n">TriplesInputFormat</span><span class="p">;</span>
+<span class="n">import</span> <span class="n">org</span><span class="p">.</span><span class="n">apache</span><span class="p">.</span><span class="n">jena</span><span class="p">.</span><span class="n">hadoop</span><span class="p">.</span><span class="n">rdf</span><span class="p">.</span><span class="n">io</span><span class="p">.</span><span class="n">output</span><span class="p">.</span><span class="n">ntriples</span><span class="p">.</span><span class="n">NTriplesNodeOutputFormat</span><span class="p">;</span>
+<span class="n">import</span> <span class="n">org</span><span class="p">.</span><span class="n">apache</span><span class="p">.</span><span class="n">jena</span><span class="p">.</span><span class="n">hadoop</span><span class="p">.</span><span class="n">rdf</span><span class="p">.</span><span class="n">mapreduce</span><span class="p">.</span><span class="n">count</span><span class="p">.</span><span class="n">NodeCountReducer</span><span class="p">;</span>
+<span class="n">import</span> <span class="n">org</span><span class="p">.</span><span class="n">apache</span><span class="p">.</span><span class="n">jena</span><span class="p">.</span><span class="n">hadoop</span><span class="p">.</span><span class="n">rdf</span><span class="p">.</span><span class="n">mapreduce</span><span class="p">.</span><span class="n">count</span><span class="p">.</span><span class="n">TripleNodeCountMapper</span><span class="p">;</span>
+<span class="n">import</span> <span class="n">org</span><span class="p">.</span><span class="n">apache</span><span class="p">.</span><span class="n">jena</span><span class="p">.</span><span class="n">hadoop</span><span class="p">.</span><span class="n">rdf</span><span class="p">.</span><span class="n">types</span><span class="p">.</span><span class="n">NodeWritable</span><span class="p">;</span>
+
+<span class="n">public</span> <span class="n">class</span> <span class="n">RdfMapReduceExample</span> <span class="p">{</span>
 
-<p>import org.apache.hadoop.conf.Configuration;
-import org.apache.hadoop.fs.Path;
-import org.apache.hadoop.io.LongWritable;
-import org.apache.hadoop.mapreduce.Job;
-import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
-import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
-import org.apache.jena.hadoop.rdf.io.input.TriplesInputFormat;
-import org.apache.jena.hadoop.rdf.io.output.ntriples.NTriplesNodeOutputFormat;
-import org.apache.jena.hadoop.rdf.mapreduce.count.NodeCountReducer;
-import org.apache.jena.hadoop.rdf.mapreduce.count.TripleNodeCountMapper;
-import org.apache.jena.hadoop.rdf.types.NodeWritable;</p>
-<p>public class RdfMapReduceExample {</p>
-<div class="codehilite"><pre><span class="n">public</span> <span class="n">static</span> <span class="n">void</span> <span class="n">main</span><span class="p">(</span><span class="n">String</span><span class="p">[]</span> <span class="n">args</span><span class="p">)</span> <span class="p">{</span>
-    <span class="k">try</span> <span class="p">{</span>
-        <span class="o">//</span> <span class="n">Get</span> <span class="n">Hadoop</span> <span class="n">configuration</span>
-        <span class="n">Configuration</span> <span class="n">config</span> <span class="p">=</span> <span class="n">new</span> <span class="n">Configuration</span><span class="p">(</span><span class="n">true</span><span class="p">);</span>
-
-        <span class="o">//</span> <span class="n">Create</span> <span class="n">job</span>
-        <span class="n">Job</span> <span class="n">job</span> <span class="p">=</span> <span class="n">Job</span><span class="p">.</span><span class="n">getInstance</span><span class="p">(</span><span class="n">config</span><span class="p">);</span>
-        <span class="n">job</span><span class="p">.</span><span class="n">setJarByClass</span><span class="p">(</span><span class="n">RdfMapReduceExample</span><span class="p">.</span><span class="n">class</span><span class="p">);</span>
-        <span class="n">job</span><span class="p">.</span><span class="n">setJobName</span><span class="p">(</span>&quot;<span class="n">RDF</span> <span class="n">Triples</span> <span class="n">Node</span> <span class="n">Usage</span> <span class="n">Count</span>&quot;<span class="p">);</span>
-
-        <span class="o">//</span> <span class="n">Map</span><span class="o">/</span><span class="n">Reduce</span> <span class="n">classes</span>
-        <span class="n">job</span><span class="p">.</span><span class="n">setMapperClass</span><span class="p">(</span><span class="n">TripleNodeCountMapper</span><span class="p">.</span><span class="n">class</span><span class="p">);</span>
-        <span class="n">job</span><span class="p">.</span><span class="n">setMapOutputKeyClass</span><span class="p">(</span><span class="n">NodeWritable</span><span class="p">.</span><span class="n">class</span><span class="p">);</span>
-        <span class="n">job</span><span class="p">.</span><span class="n">setMapOutputValueClass</span><span class="p">(</span><span class="n">LongWritable</span><span class="p">.</span><span class="n">class</span><span class="p">);</span>
-        <span class="n">job</span><span class="p">.</span><span class="n">setReducerClass</span><span class="p">(</span><span class="n">NodeCountReducer</span><span class="p">.</span><span class="n">class</span><span class="p">);</span>
-
-        <span class="o">//</span> <span class="n">Input</span> <span class="n">and</span> <span class="n">Output</span>
-        <span class="n">job</span><span class="p">.</span><span class="n">setInputFormatClass</span><span class="p">(</span><span class="n">TriplesInputFormat</span><span class="p">.</span><span class="n">class</span><span class="p">);</span>
-        <span class="n">job</span><span class="p">.</span><span class="n">setOutputFormatClass</span><span class="p">(</span><span class="n">NTriplesNodeOutputFormat</span><span class="p">.</span><span class="n">class</span><span class="p">);</span>
-        <span class="n">FileInputFormat</span><span class="p">.</span><span class="n">setInputPaths</span><span class="p">(</span><span class="n">job</span><span class="p">,</span> <span class="n">new</span> <span class="n">Path</span><span class="p">(</span>&quot;<span class="o">/</span><span class="n">example</span><span class="o">/</span><span class="n">input</span><span class="o">/</span>&quot;<span class="p">));</span>
-        <span class="n">FileOutputFormat</span><span class="p">.</span><span class="n">setOutputPath</span><span class="p">(</span><span class="n">job</span><span class="p">,</span> <span class="n">new</span> <span class="n">Path</span><span class="p">(</span>&quot;<span class="o">/</span><span class="n">example</span><span class="o">/</span><span class="n">output</span><span class="o">/</span>&quot;<span class="p">));</span>
-
-        <span class="o">//</span> <span class="n">Launch</span> <span class="n">the</span> <span class="n">job</span> <span class="n">and</span> <span class="n">await</span> <span class="n">completion</span>
-        <span class="n">job</span><span class="p">.</span><span class="n">submit</span><span class="p">();</span>
-        <span class="k">if</span> <span class="p">(</span><span class="n">job</span><span class="p">.</span><span class="n">monitorAndPrintJob</span><span class="p">())</span> <span class="p">{</span>
-            <span class="o">//</span> <span class="n">OK</span>
-            <span class="n">System</span><span class="p">.</span><span class="n">out</span><span class="p">.</span><span class="n">println</span><span class="p">(</span>&quot;<span class="n">Completed</span>&quot;<span class="p">);</span>
-        <span class="p">}</span> <span class="k">else</span> <span class="p">{</span>
-            <span class="o">//</span> <span class="n">Failed</span>
-            <span class="n">System</span><span class="p">.</span><span class="n">err</span><span class="p">.</span><span class="n">println</span><span class="p">(</span>&quot;<span class="n">Failed</span>&quot;<span class="p">);</span>
+    <span class="n">public</span> <span class="n">static</span> <span class="n">void</span> <span class="n">main</span><span class="p">(</span><span class="n">String</span><span class="p">[]</span> <span class="n">args</span><span class="p">)</span> <span class="p">{</span>
+        <span class="k">try</span> <span class="p">{</span>
+            <span class="o">//</span> <span class="n">Get</span> <span class="n">Hadoop</span> <span class="n">configuration</span>
+            <span class="n">Configuration</span> <span class="n">config</span> <span class="p">=</span> <span class="n">new</span> <span class="n">Configuration</span><span class="p">(</span><span class="n">true</span><span class="p">);</span>
+
+            <span class="o">//</span> <span class="n">Create</span> <span class="n">job</span>
+            <span class="n">Job</span> <span class="n">job</span> <span class="p">=</span> <span class="n">Job</span><span class="p">.</span><span class="n">getInstance</span><span class="p">(</span><span class="n">config</span><span class="p">);</span>
+            <span class="n">job</span><span class="p">.</span><span class="n">setJarByClass</span><span class="p">(</span><span class="n">RdfMapReduceExample</span><span class="p">.</span><span class="n">class</span><span class="p">);</span>
+            <span class="n">job</span><span class="p">.</span><span class="n">setJobName</span><span class="p">(</span>&quot;<span class="n">RDF</span> <span class="n">Triples</span> <span class="n">Node</span> <span class="n">Usage</span> <span class="n">Count</span>&quot;<span class="p">);</span>
+
+            <span class="o">//</span> <span class="n">Map</span><span class="o">/</span><span class="n">Reduce</span> <span class="n">classes</span>
+            <span class="n">job</span><span class="p">.</span><span class="n">setMapperClass</span><span class="p">(</span><span class="n">TripleNodeCountMapper</span><span class="p">.</span><span class="n">class</span><span class="p">);</span>
+            <span class="n">job</span><span class="p">.</span><span class="n">setMapOutputKeyClass</span><span class="p">(</span><span class="n">NodeWritable</span><span class="p">.</span><span class="n">class</span><span class="p">);</span>
+            <span class="n">job</span><span class="p">.</span><span class="n">setMapOutputValueClass</span><span class="p">(</span><span class="n">LongWritable</span><span class="p">.</span><span class="n">class</span><span class="p">);</span>
+            <span class="n">job</span><span class="p">.</span><span class="n">setReducerClass</span><span class="p">(</span><span class="n">NodeCountReducer</span><span class="p">.</span><span class="n">class</span><span class="p">);</span>
+
+            <span class="o">//</span> <span class="n">Input</span> <span class="n">and</span> <span class="n">Output</span>
+            <span class="n">job</span><span class="p">.</span><span class="n">setInputFormatClass</span><span class="p">(</span><span class="n">TriplesInputFormat</span><span class="p">.</span><span class="n">class</span><span class="p">);</span>
+            <span class="n">job</span><span class="p">.</span><span class="n">setOutputFormatClass</span><span class="p">(</span><span class="n">NTriplesNodeOutputFormat</span><span class="p">.</span><span class="n">class</span><span class="p">);</span>
+            <span class="n">FileInputFormat</span><span class="p">.</span><span class="n">setInputPaths</span><span class="p">(</span><span class="n">job</span><span class="p">,</span> <span class="n">new</span> <span class="n">Path</span><span class="p">(</span>&quot;<span class="o">/</span><span class="n">example</span><span class="o">/</span><span class="n">input</span><span class="o">/</span>&quot;<span class="p">));</span>
+            <span class="n">FileOutputFormat</span><span class="p">.</span><span class="n">setOutputPath</span><span class="p">(</span><span class="n">job</span><span class="p">,</span> <span class="n">new</span> <span class="n">Path</span><span class="p">(</span>&quot;<span class="o">/</span><span class="n">example</span><span class="o">/</span><span class="n">output</span><span class="o">/</span>&quot;<span class="p">));</span>
+
+            <span class="o">//</span> <span class="n">Launch</span> <span class="n">the</span> <span class="n">job</span> <span class="n">and</span> <span class="n">await</span> <span class="n">completion</span>
+            <span class="n">job</span><span class="p">.</span><span class="n">submit</span><span class="p">();</span>
+            <span class="k">if</span> <span class="p">(</span><span class="n">job</span><span class="p">.</span><span class="n">monitorAndPrintJob</span><span class="p">())</span> <span class="p">{</span>
+                <span class="o">//</span> <span class="n">OK</span>
+                <span class="n">System</span><span class="p">.</span><span class="n">out</span><span class="p">.</span><span class="n">println</span><span class="p">(</span>&quot;<span class="n">Completed</span>&quot;<span class="p">);</span>
+            <span class="p">}</span> <span class="k">else</span> <span class="p">{</span>
+                <span class="o">//</span> <span class="n">Failed</span>
+                <span class="n">System</span><span class="p">.</span><span class="n">err</span><span class="p">.</span><span class="n">println</span><span class="p">(</span>&quot;<span class="n">Failed</span>&quot;<span class="p">);</span>
+            <span class="p">}</span>
+        <span class="p">}</span> <span class="k">catch</span> <span class="p">(</span><span class="n">Throwable</span> <span class="n">e</span><span class="p">)</span> <span class="p">{</span>
+            <span class="n">e</span><span class="p">.</span><span class="n">printStackTrace</span><span class="p">();</span>
         <span class="p">}</span>
-    <span class="p">}</span> <span class="k">catch</span> <span class="p">(</span><span class="n">Throwable</span> <span class="n">e</span><span class="p">)</span> <span class="p">{</span>
-        <span class="n">e</span><span class="p">.</span><span class="n">printStackTrace</span><span class="p">();</span>
     <span class="p">}</span>
 <span class="p">}</span>
 </pre></div>
 
 
-<p>}</p>
+<p>So this really is no different from configuring any other Hadoop job, we simply have to point to the relevant input and output formats and provide our mapper and reducer.  Note that here we use the <code>TriplesInputFormat</code> which can handle RDF in any Jena supported format, if you know your RDF is in a specific format it is usually more efficient to use a more specific input format.  Please see the <a href="io.html">IO</a> page for more detail on the available input formats and the differences between them.</p>
+<p>We recommend that you next take a look at our <a href="demo.html">RDF Stats Demo</a> which shows how to do some more complex computations by chaining multiple jobs together.</p>
 <h2 id="apis">APIs</h2>
 <p>There are three main libraries each with their own API:</p>
 <ul>