You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@jena.apache.org by bu...@apache.org on 2015/01/05 14:06:29 UTC

svn commit: r935123 - in /websites/staging/jena/trunk/content: ./ documentation/hadoop/artifacts.html documentation/hadoop/common.html documentation/hadoop/index.html documentation/hadoop/io.html

Author: buildbot
Date: Mon Jan  5 13:06:28 2015
New Revision: 935123

Log:
Staging update by buildbot for jena

Modified:
    websites/staging/jena/trunk/content/   (props changed)
    websites/staging/jena/trunk/content/documentation/hadoop/artifacts.html
    websites/staging/jena/trunk/content/documentation/hadoop/common.html
    websites/staging/jena/trunk/content/documentation/hadoop/index.html
    websites/staging/jena/trunk/content/documentation/hadoop/io.html

Propchange: websites/staging/jena/trunk/content/
------------------------------------------------------------------------------
--- cms:source-revision (original)
+++ cms:source-revision Mon Jan  5 13:06:28 2015
@@ -1 +1 @@
-1649075
+1649520

Modified: websites/staging/jena/trunk/content/documentation/hadoop/artifacts.html
==============================================================================
--- websites/staging/jena/trunk/content/documentation/hadoop/artifacts.html (original)
+++ websites/staging/jena/trunk/content/documentation/hadoop/artifacts.html Mon Jan  5 13:06:28 2015
@@ -19,7 +19,7 @@
     limitations under the License.
 -->
 
-  <title>Apache Jena - Maven Artifacts for Jena RDF Tools for Apache Hadoop</title>
+  <title>Apache Jena - Maven Artifacts for Apache Jena Elephas</title>
   <meta http-equiv="Content-Type" content="text/html; charset=UTF-8">
   <meta name="viewport" content="width=device-width, initial-scale=1.0">
 
@@ -141,8 +141,8 @@
 	<div class="row">
 	<div class="col-md-12">
 	<div id="breadcrumbs"></div>
-	<h1 class="title">Maven Artifacts for Jena RDF Tools for Apache Hadoop</h1>
-  <p>The Jena RDF Tools for Hadoop libraries are a collection of maven artifacts which can be used individually
+	<h1 class="title">Maven Artifacts for Apache Jena Elephas</h1>
+  <p>The Apache Jena Elephas libraries for Apache Hadoop are a collection of maven artifacts which can be used individually
 or together as desired.  These are available from the same locations as any other Jena
 artifact, see <a href="/download/maven.html">Using Jena with Maven</a> for more information.</p>
 <h1 id="hadoop-dependencies">Hadoop Dependencies</h1>
@@ -169,11 +169,11 @@ declare these basic dependencies as <cod
 
 <h1 id="jena-rdf-tools-for-apache-hadoop-artifacts">Jena RDF Tools for Apache Hadoop Artifacts</h1>
 <h2 id="common-api">Common API</h2>
-<p>The <code>jena-hadoop-rdf-common</code> artifact provides common classes for enabling RDF on Hadoop.  This is mainly
+<p>The <code>jena-elephas-common</code> artifact provides common classes for enabling RDF on Hadoop.  This is mainly
 composed of relevant <code>Writable</code> implementations for the various supported RDF primitives.</p>
 <div class="codehilite"><pre><span class="nt">&lt;dependency&gt;</span>
   <span class="nt">&lt;groupId&gt;</span>org.apache.jena<span class="nt">&lt;/groupId&gt;</span>
-  <span class="nt">&lt;artifactId&gt;</span>jena-hadoop-rdf-common<span class="nt">&lt;/artifactId&gt;</span>
+  <span class="nt">&lt;artifactId&gt;</span>jena-elephas-common<span class="nt">&lt;/artifactId&gt;</span>
   <span class="nt">&lt;version&gt;</span>x.y.z<span class="nt">&lt;/version&gt;</span>
 <span class="nt">&lt;/dependency&gt;</span>
 </pre></div>
@@ -183,7 +183,7 @@ composed of relevant <code>Writable</cod
 <p>The <a href="io.html">IO API</a> artifact provides support for reading and writing RDF in Hadoop:</p>
 <div class="codehilite"><pre><span class="nt">&lt;dependency&gt;</span>
   <span class="nt">&lt;groupId&gt;</span>org.apache.jena<span class="nt">&lt;/groupId&gt;</span>
-  <span class="nt">&lt;artifactId&gt;</span>jena-hadoop-rdf-io<span class="nt">&lt;/artifactId&gt;</span>
+  <span class="nt">&lt;artifactId&gt;</span>jena-elephas-io<span class="nt">&lt;/artifactId&gt;</span>
   <span class="nt">&lt;version&gt;</span>x.y.z<span class="nt">&lt;/version&gt;</span>
 <span class="nt">&lt;/dependency&gt;</span>
 </pre></div>
@@ -194,7 +194,7 @@ composed of relevant <code>Writable</cod
 to help you get started writing Map/Reduce jobs over RDF data quicker:</p>
 <div class="codehilite"><pre><span class="nt">&lt;dependency&gt;</span>
   <span class="nt">&lt;groupId&gt;</span>org.apache.jena<span class="nt">&lt;/groupId&gt;</span>
-  <span class="nt">&lt;artifactId&gt;</span>jena-hadoop-rdf-mapreduce<span class="nt">&lt;/artifactId&gt;</span>
+  <span class="nt">&lt;artifactId&gt;</span>jena-elephas-mapreduce<span class="nt">&lt;/artifactId&gt;</span>
   <span class="nt">&lt;version&gt;</span>x.y.z<span class="nt">&lt;/version&gt;</span>
 <span class="nt">&lt;/dependency&gt;</span>
 </pre></div>
@@ -205,7 +205,7 @@ to help you get started writing Map/Redu
 own RDF data:</p>
 <div class="codehilite"><pre><span class="nt">&lt;dependency&gt;</span>
   <span class="nt">&lt;groupId&gt;</span>org.apache.jena<span class="nt">&lt;/groupId&gt;</span>
-  <span class="nt">&lt;artifactId&gt;</span>jena-hadoop-rdf-stats<span class="nt">&lt;/artifactId&gt;</span>
+  <span class="nt">&lt;artifactId&gt;</span>jena-elephas-stats<span class="nt">&lt;/artifactId&gt;</span>
   <span class="nt">&lt;version&gt;</span>x.y.z<span class="nt">&lt;/version&gt;</span>
   <span class="nt">&lt;classifier&gt;</span>hadoop-job<span class="nt">&lt;/classifier&gt;</span>
 <span class="nt">&lt;/dependency&gt;</span>

Modified: websites/staging/jena/trunk/content/documentation/hadoop/common.html
==============================================================================
--- websites/staging/jena/trunk/content/documentation/hadoop/common.html (original)
+++ websites/staging/jena/trunk/content/documentation/hadoop/common.html Mon Jan  5 13:06:28 2015
@@ -19,7 +19,7 @@
     limitations under the License.
 -->
 
-  <title>Apache Jena - RDF Tools for Apache Hadoop - Common API</title>
+  <title>Apache Jena - Apache Jena Elephas - Common API</title>
   <meta http-equiv="Content-Type" content="text/html; charset=UTF-8">
   <meta name="viewport" content="width=device-width, initial-scale=1.0">
 
@@ -141,8 +141,8 @@
 	<div class="row">
 	<div class="col-md-12">
 	<div id="breadcrumbs"></div>
-	<h1 class="title">RDF Tools for Apache Hadoop - Common API</h1>
-  <p>The Common API provides the basic data model for representing RDF data within Hadoop applications.  This primarily takes the form of <code>Writable</code> implementations and the necessary machinery to efficiently serialise and deserialise these.</p>
+	<h1 class="title">Apache Jena Elephas - Common API</h1>
+  <p>The Common API provides the basic data model for representing RDF data within Apache Hadoop applications.  This primarily takes the form of <code>Writable</code> implementations and the necessary machinery to efficiently serialise and deserialise these.</p>
 <p>Currently we represent the three main RDF primitives - Nodes, Triples and Quads - though in future a wider range of primitives may be supported if we receive contributions to implement them.</p>
 <h1 id="rdf-primitives">RDF Primitives</h1>
 <h2 id="nodes">Nodes</h2>

Modified: websites/staging/jena/trunk/content/documentation/hadoop/index.html
==============================================================================
--- websites/staging/jena/trunk/content/documentation/hadoop/index.html (original)
+++ websites/staging/jena/trunk/content/documentation/hadoop/index.html Mon Jan  5 13:06:28 2015
@@ -19,7 +19,7 @@
     limitations under the License.
 -->
 
-  <title>Apache Jena - RDF Tools for Apache Hadoop</title>
+  <title>Apache Jena - Apache Jena Elephas</title>
   <meta http-equiv="Content-Type" content="text/html; charset=UTF-8">
   <meta name="viewport" content="width=device-width, initial-scale=1.0">
 
@@ -141,9 +141,8 @@
 	<div class="row">
 	<div class="col-md-12">
 	<div id="breadcrumbs"></div>
-	<h1 class="title">RDF Tools for Apache Hadoop</h1>
-  <p>RDF Tools for Apache Hadoop is a set of libraries which provide various basic building blocks which enable
-you to start writing Hadoop based applications which work with RDF data.</p>
+	<h1 class="title">Apache Jena Elephas</h1>
+  <p>Apache Jena Elephas is a set of libraries which provide various basic building blocks which enable you to start writing Apache Hadoop based applications which work with RDF data.</p>
 <p>Historically there has been no serious support for RDF within the Hadoop ecosystem and what support has existed has
 often been limited and task specific.  These libraries aim to be as generic as possible and provide the necessary
 infrastructure that enables developers to create their application specific logic without worrying about the
@@ -168,7 +167,7 @@ underlying plumbing.</p>
 <li><a href="artifacts.html">Maven Artifacts</a></li>
 </ul>
 <h2 id="overview">Overview</h2>
-<p>RDF Tools for Apache Hadoop is published as a set of Maven module via its <a href="artifacts.html">maven artifacts</a>.  The source for these libraries
+<p>Apache Jena Elephas is published as a set of Maven module via its <a href="artifacts.html">maven artifacts</a>.  The source for these libraries
 may be <a href="/download/index.cgi">downloaded</a> as part of the source distribution.  These modules are built against the Hadoop 2.x. APIs and no
 backwards compatibility for 1.x is provided.</p>
 <p>The core aim of these libraries it to provide the basic building blocks that allow users to start writing Hadoop applications that
@@ -189,12 +188,12 @@ a number of basic statistics over arbitr
 on what you are trying to do.  Typically you will likely need at least the IO library and possibly the Map/Reduce library:</p>
 <div class="codehilite"><pre><span class="nt">&lt;dependency&gt;</span>
   <span class="nt">&lt;groupId&gt;</span>org.apache.jena<span class="nt">&lt;/groupId&gt;</span>
-  <span class="nt">&lt;artifactId&gt;</span>jena-hadoop-rdf-io<span class="nt">&lt;/artifactId&gt;</span>
+  <span class="nt">&lt;artifactId&gt;</span>jena-elephas-io<span class="nt">&lt;/artifactId&gt;</span>
   <span class="nt">&lt;version&gt;</span>x.y.z<span class="nt">&lt;/version&gt;</span>
 <span class="nt">&lt;/dependency&gt;</span>
 <span class="nt">&lt;dependency&gt;</span>
   <span class="nt">&lt;groupId&gt;</span>org.apache.jena<span class="nt">&lt;/groupId&gt;</span>
-  <span class="nt">&lt;artifactId&gt;</span>jena-hadoop-rdf-mapreduce<span class="nt">&lt;/artifactId&gt;</span>
+  <span class="nt">&lt;artifactId&gt;</span>jena-elephas-mapreduce<span class="nt">&lt;/artifactId&gt;</span>
   <span class="nt">&lt;version&gt;</span>x.y.z<span class="nt">&lt;/version&gt;</span>
 <span class="nt">&lt;/dependency&gt;</span>
 </pre></div>

Modified: websites/staging/jena/trunk/content/documentation/hadoop/io.html
==============================================================================
--- websites/staging/jena/trunk/content/documentation/hadoop/io.html (original)
+++ websites/staging/jena/trunk/content/documentation/hadoop/io.html Mon Jan  5 13:06:28 2015
@@ -19,7 +19,7 @@
     limitations under the License.
 -->
 
-  <title>Apache Jena - RDF Tools for Apache Hadoop - IO API</title>
+  <title>Apache Jena - Apache Jena Elephas - IO API</title>
   <meta http-equiv="Content-Type" content="text/html; charset=UTF-8">
   <meta name="viewport" content="width=device-width, initial-scale=1.0">
 
@@ -141,8 +141,8 @@
 	<div class="row">
 	<div class="col-md-12">
 	<div id="breadcrumbs"></div>
-	<h1 class="title">RDF Tools for Apache Hadoop - IO API</h1>
-  <p>The IO API provides support for reading and writing RDF within Hadoop applications.  This is done by providing <code>InputFormat</code> and <code>OutputFormat</code> implementations that cover all the RDF serialisations that Jena supports.</p>
+	<h1 class="title">Apache Jena Elephas - IO API</h1>
+  <p>The IO API provides support for reading and writing RDF within Apache Hadoop applications.  This is done by providing <code>InputFormat</code> and <code>OutputFormat</code> implementations that cover all the RDF serialisations that Jena supports.</p>
 <div class="toc">
 <ul>
 <li><a href="#background-on-hadoop-io">Background on Hadoop IO</a><ul>
@@ -202,8 +202,9 @@
 </ol>
 <p>There is then also the question of whether a serialisation encodes triples, quads or can encode both.  Where a serialisation encodes both we provide two variants of it so you can choose whether you want to process it as triples/quads.</p>
 <h3 id="blank-nodes-in-input">Blank Nodes in Input</h3>
-<p>Note that readers familiar with RDF may be wondering how we cope with blank nodes when splitting input and that is an important concern.</p>
-<p>Essentially Jena contains functionality that allows it to predictably generate identifiers from the original identifier present in the file e.g. <code>_:blank</code>.  This means that wherever <code>_:blank</code> appears  in the original file we are guaranteed to assign it the same internal identifier.  Note that this functionality uses a seed value to ensure that blank nodes coming from different input files are not assigned the same identifier.  When used with Hadoop this seed is chosen based on a combination of the Job ID and the input file path.  This means that the same file processed by different jobs will produce different blank node identifiers each time.</p>
+<p>Note that readers familiar with RDF may be wondering how we cope with blank nodes when splitting input and this is an important issue to address.</p>
+<p>Essentially Jena contains functionality that allows it to predictably generate identifiers from the original identifier present in the file e.g. <code>_:blank</code>.  This means that wherever <code>_:blank</code> appears  in the original file we are guaranteed to assign it the same internal identifier.  Note that this functionality uses a seed value to ensure that blank nodes coming from different input files are not assigned the same identifier.</p>
+<p>When used with Hadoop this seed is chosen based on a combination of the Job ID and the input file path.  This means that the same file processed by different jobs will produce different blank node identifiers each time.  However within a job every read of the file will predictably generate blank node identifiers so splitting does not prevent correct blank node identification.</p>
 <p>Additionally the binary serialisation we use for our RDF primitives (described on the <a href="common.html">Common API</a>) page guarantees that internal identifiers are preserved as-is when communicating values across the cluster.</p>
 <h3 id="mixed-inputs">Mixed Inputs</h3>
 <p>In many cases your input data may be in a variety of different RDF formats in which case we have you covered.  The <code>TriplesInputFormat</code>, <code>QuadsInputFormat</code> and <code>TriplesOrQuadsInputFormat</code> can handle mixture of triples/quads/both triples &amp; quads as desired.  Note that in the case of <code>TriplesOrQuadsInputFormat</code> any triples are up-cast into quads in the default graph.</p>
@@ -222,7 +223,7 @@
 <p>If you are concerned about potential data corruption as a result of this then you should make sure to always choose a whole file output format but be aware that this can exhaust memory if your output is large.</p>
 <h4 id="blank-node-divergence-in-multi-stage-pipelines">Blank Node Divergence in multi-stage pipelines</h4>
 <p>The other thing to consider with regards to blank nodes in output is that Hadoop will by default create multiple output files (one for each reducer) so even if consistent and valid blank nodes are output they may be spread over multiple files.</p>
-<p>In multi-stage pipelines you will need to manually concatenate these files back together (assuming they are in a format that allows this e.g. NTriples) as otherwise when you pass them as input to the next job the blank node identifiers will diverge from each other.  <a href="https://issues.apache.org/jira/browse/JENA-820">JENA-820</a> discusses this problem and introduces a special configuration setting that can be used to resolve this.  Note that even with this setting enabled some formats are not capable of respecting it, see the later section on <a href="#job-configuration-options">Job Configuration Options</a> for more details.</p>
+<p>In multi-stage pipelines you may need to manually concatenate these files back together (assuming they are in a format that allows this e.g. NTriples) as otherwise when you pass them as input to the next job the blank node identifiers will diverge from each other.  <a href="https://issues.apache.org/jira/browse/JENA-820">JENA-820</a> discusses this problem and introduces a special configuration setting that can be used to resolve this.  Note that even with this setting enabled some formats are not capable of respecting it, see the later section on <a href="#job-configuration-options">Job Configuration Options</a> for more details.</p>
 <p>An alternative workaround is to always use RDF Thrift as the intermediate output format since it preserves blank node identifiers precisely as they are seen.  This also has the advantage that RDF Thrift is extremely fast to read and write which can speed up multi-stage pipelines considerably.</p>
 <h3 id="node-output-format">Node Output Format</h3>
 <p>We also include a special <code>NTriplesNodeOutputFormat</code> which is capable of outputting pairs composed of a <code>NodeWritable</code> key and any value type.  Think of this as being similar to the standard Hadoop <code>TextOutputFormat</code> except it understands how to format nodes as valid NTriples serialisation.  This format is useful when performing simple statistical analysis such as node usage counts or other calculations over nodes.</p>
@@ -332,14 +333,15 @@
 
 
 <h4 id="global-blank-node-identity">Global Blank Node Identity</h4>
-<p>The default behaviour of these libraries is to allocate file scoped blank node identifiers in such a way that the same syntactic identifier read from the same file (even if by different nodes/processes) is allocated the same blank node ID.  However the same syntactic identifier in different files should result in different blank nodes.  However as discussed earlier in the case of multi-stage jobs the intermediate outputs may be split over several files which can cause the blank node identifiers to diverge from each other when they are read back in.</p>
-<p>For multi-stage jobs this is often (but not always) incorrect and undesirable behaviour in which case you can set the <code>rdf.io.input.bnodes.global-identity</code> property to true:</p>
+<p>The default behaviour of this library is to allocate file scoped blank node identifiers in such a way that the same syntactic identifier read from the same file is allocated the same blank node ID even across input splits within a job.  Conversely the same syntactic identifier in different input files will result in different blank nodes within a job.</p>
+<p>However as discussed earlier in the case of multi-stage jobs the intermediate outputs may be split over several files which can cause the blank node identifiers to diverge from each other when they are read back in by subsequent jobs.  For multi-stage jobs this is often (but not always) incorrect and undesirable behaviour in which case you will need to set the <code>rdf.io.input.bnodes.global-identity</code> property to true for the subsequent jobs:</p>
 <div class="codehilite"><pre><span class="n">job</span><span class="p">.</span><span class="n">getConfiguration</span><span class="p">.</span><span class="n">setBoolean</span><span class="p">(</span><span class="n">RdfIOConstants</span><span class="p">.</span><span class="n">GLOBAL_BNODE_IDENTITY</span><span class="p">,</span> <span class="n">true</span><span class="p">);</span>
 </pre></div>
 
 
-<p>Note however that not all formats are capable of honouring this option, notably RDF/XML and JSON-LD.</p>
-<p>As noted earlier an alternative workaround is to use RDF Thrift as the intermediate format since it guarantees to preserve blank node identifiers precisely.</p>
+<p><strong>Important</strong> - This should only be set for the later jobs in a multi-stage pipeline and should rarely (if ever) be set for single jobs or the first job of a pipeline.</p>
+<p>Even with this setting enabled not all formats are capable of honouring this option, RDF/XML and JSON-LD will ignore this option and should be avoided as intermediate output formats.</p>
+<p>As noted earlier an alternative workaround to enabling this setting is to instead use RDF Thrift as the intermediate output format since it guarantees to preserve blank node identifiers as-is on both reads and writes.</p>
 <h4 id="output-batch-size">Output Batch Size</h4>
 <p>The batch size for batched output formats can be controlled by setting the <code>rdf.io.output.batch-size</code> property as desired.  The default value for this if not explicitly configured is 10,000:</p>
 <div class="codehilite"><pre><span class="n">job</span><span class="p">.</span><span class="n">getConfiguration</span><span class="p">.</span><span class="n">setInt</span><span class="p">(</span><span class="n">RdfIOConstants</span><span class="p">.</span><span class="n">OUTPUT_BATCH_SIZE</span><span class="p">,</span> 25000<span class="p">);</span>