You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@jena.apache.org by bu...@apache.org on 2014/12/01 15:51:35 UTC
svn commit: r931174 - in /websites/staging/jena/trunk/content: ./ documentation/hadoop/io.html

Author: buildbot
Date: Mon Dec  1 14:51:35 2014
New Revision: 931174

Log:
Staging update by buildbot for jena

Modified:
    websites/staging/jena/trunk/content/   (props changed)
    websites/staging/jena/trunk/content/documentation/hadoop/io.html

Propchange: websites/staging/jena/trunk/content/
------------------------------------------------------------------------------
--- cms:source-revision (original)
+++ cms:source-revision Mon Dec  1 14:51:35 2014
@@ -1 +1 @@
-1642168
+1642695

Modified: websites/staging/jena/trunk/content/documentation/hadoop/io.html
==============================================================================
--- websites/staging/jena/trunk/content/documentation/hadoop/io.html (original)
+++ websites/staging/jena/trunk/content/documentation/hadoop/io.html Mon Dec  1 14:51:35 2014
@@ -157,7 +157,10 @@
 </ul>
 </li>
 <li><a href="#output">Output</a><ul>
-<li><a href="#blank-nodes-in-output">Blank Nodes in Output</a></li>
+<li><a href="#blank-nodes-in-output">Blank Nodes in Output</a><ul>
+<li><a href="#blank-node-divergence-in-multi-stage-pipelines">Blank Node Divergence in multi-stage pipelines</a></li>
+</ul>
+</li>
 <li><a href="#node-output-format">Node Output Format</a></li>
 </ul>
 </li>
@@ -166,16 +169,20 @@
 <li><a href="#output_1">Output</a></li>
 </ul>
 </li>
-<li><a href="#configuration-options">Configuration Options</a><ul>
+<li><a href="#job-setup">Job Setup</a><ul>
+<li><a href="#job-configuration-options">Job Configuration Options</a><ul>
 <li><a href="#input-lines-per-batch">Input Lines per Batch</a></li>
 <li><a href="#max-line-length">Max Line Length</a></li>
 <li><a href="#ignoring-bad-tuples">Ignoring Bad Tuples</a></li>
+<li><a href="#global-blank-node-identity">Global Blank Node Identity</a></li>
 <li><a href="#output-batch-size">Output Batch Size</a></li>
 </ul>
 </li>
 </ul>
 </li>
 </ul>
+</li>
+</ul>
 </div>
 <h1 id="background-on-hadoop-io">Background on Hadoop IO</h1>
 <p>If you are already familiar with the Hadoop IO paradigm then please skip this section, if not please read as otherwise some of the later information will not make much sense.</p>
@@ -184,8 +191,9 @@
 <p>In some cases there are file formats that may be processed in multiple ways i.e. you can <em>split</em> them into pieces or you can process them as a whole.  Which approach you wish to use will depend on whether you have a single file to process or many files to process.  In the case of many files processing files as a whole may provide better overall throughput than processing them as chunks.  However your mileage may vary especially if your input data has many files of uneven size.</p>
 <h2 id="compressed-io">Compressed IO</h2>
 <p>Hadoop natively provides support for compressed input and output providing your Hadoop cluster is appropriately configured.  The advantage of compressing the input/output data is that it means there is less IO workload on the cluster however this comes with the disadvantage that most compression formats block Hadoop's ability to <em>split</em> up the input.</p>
+<p>Hadoop generally handles compression automatically and all our input and output formats are capable of handling compressed input and output as necessary.</p>
 <h1 id="rdf-io-in-hadoop">RDF IO in Hadoop</h1>
-<p>There are a wide range of RDF serialisations supported by ARQ, please see the <a href="../io/">RDF IO</a> for an overview of the formats that Jena supports.</p>
+<p>There are a wide range of RDF serialisations supported by ARQ, please see the <a href="../io/">RDF IO</a> for an overview of the formats that Jena supports.  In this section we go into a lot more depth of how exactly we support RDF IO in Hadoop.</p>
 <h2 id="input">Input</h2>
 <p>One of the difficulties posed when wrapping these for Hadoop IO is that the formats have very different properties in terms of our ability to <em>split</em> them into distinct chunks for Hadoop to process.  So we categorise the possible ways to process RDF inputs as follows:</p>
 <ol>
@@ -213,6 +221,10 @@
 <p>As with input blank nodes provide a complicating factor in producing RDF output.  For whole file output formats this is not an issue but it does need to be considered for line and batch based formats.</p>
 <p>However what we have found in practise is that the Jena writers will predictably map internal identifiers to the blank node identifiers in the output serialisations.  What this means is that even when processing output in batches we've found that using the line/batch based formats correctly preserve blank node identity.</p>
 <p>If you are concerned about potential data corruption as a result of this then you should make sure to always choose a whole file output format but be aware that this can exhaust memory if your output is large.</p>
+<h4 id="blank-node-divergence-in-multi-stage-pipelines">Blank Node Divergence in multi-stage pipelines</h4>
+<p>The other thing to consider with regards to blank nodes in output is that Hadoop will by default create multiple output files (one for each reducer) so even if consistent and valid blank nodes are output they may be spread over multiple files.</p>
+<p>In multi-stage pipelines you will need to manually concatenate these files back together (assuming they are in a format that allows this e.g. NTriples) as otherwise when you pass them as input to the next job the blank node identifiers will diverge from each other.  <a href="https://issues.apache.org/jira/browse/JENA-820">JENA-820</a> discusses this problem and introduces a special configuration setting that can be used to resolve this.  Note that even with this setting enabled some formats are not capable of respecting it, see the later section on <a href="#job-configuration-options">Job Configuration Options</a> for more details.</p>
+<p>An alternative workaround is to always use RDF Thrift as the intermediate output format since it preserves blank node identifiers precisely as they are seen.  This also has the advantage that RDF Thrift is extremely fast to read and write which can speed up multi-stage pipelines considerably.</p>
 <h3 id="node-output-format">Node Output Format</h3>
 <p>We also include a special <code>NTriplesNodeOutputFormat</code> which is capable of outputting pairs composed of a <code>NodeWritable</code> key and any value type.  Think of this as being similar to the standard Hadoop <code>TextOutputFormat</code> except it understands how to format nodes as valid NTriples serialisation.  This format is useful when performing simple statistical analysis such as node usage counts or other calculations over nodes.</p>
 <p>In the case where the value of the key value pair is also a RDF primitive proper NTriples formatting is also applied to each of the nodes in the value</p>
@@ -275,9 +287,28 @@
   <tr><td>RDF Thrift</td><td>Yes</td><td>No</td><td>No</td></tr>
 </table>
 
-<h2 id="configuration-options">Configuration Options</h2>
+<h2 id="job-setup">Job Setup</h2>
+<p>To use RDF as an input and/or output format you will need to configure your Job appropriately, this requires setting the input/output format and setting the data paths:</p>
+<div class="codehilite"><pre><span class="c1">// Create a job using default configuration</span>
+<span class="n">Job</span> <span class="n">job</span> <span class="o">=</span> <span class="n">Job</span><span class="p">.</span><span class="n">createInstance</span><span class="p">(</span><span class="k">new</span> <span class="n">Configuration</span><span class="p">(</span><span class="n">true</span><span class="p">));</span>
+
+<span class="c1">// Use Turtle as the input format</span>
+<span class="n">job</span><span class="p">.</span><span class="n">setInputFormatClass</span><span class="p">(</span><span class="n">TurtleInputFormat</span><span class="p">.</span><span class="k">class</span><span class="p">);</span>
+<span class="n">FileInputFormat</span><span class="p">.</span><span class="n">setInputPath</span><span class="p">(</span><span class="n">job</span><span class="p">,</span> <span class="s">&quot;/users/example/input&quot;</span><span class="p">);</span>
+
+<span class="c1">// Use NTriples as the output format</span>
+<span class="n">job</span><span class="p">.</span><span class="n">setOutputFormatClass</span><span class="p">(</span><span class="n">NTriplesOutputFormat</span><span class="p">.</span><span class="k">class</span><span class="p">);</span>
+<span class="n">FileOutputFormat</span><span class="p">.</span><span class="n">setOutputPath</span><span class="p">(</span><span class="n">job</span><span class="p">,</span> <span class="s">&quot;/users/example/output&quot;</span><span class="p">);</span>
+
+<span class="c1">// Other job configuration...</span>
+</pre></div>
+
+
+<p>This example takes in input in Turtle format from the directory <code>/users/example/input</code> and outputs the end results in NTriples in the directory <code>/users/example/output</code>.</p>
+<p>Take a look at the <a href="../javadoc/hadoop/io/">Javadocs</a> to find the actual available input and output format implementations.</p>
+<h3 id="job-configuration-options">Job Configuration Options</h3>
 <p>There are a several useful configuration options that can be used to tweak the behaviour of the RDF IO functionality if desired.</p>
-<h3 id="input-lines-per-batch">Input Lines per Batch</h3>
+<h4 id="input-lines-per-batch">Input Lines per Batch</h4>
 <p>Since our line based input formats use the standard Hadoop <code>NLineInputFormat</code> to decide how to split up inputs we support the standard <code>mapreduce.input.lineinputformat.linespermap</code> configuration setting for changing the number of lines processed per map.</p>
 <p>You can set this directly in your configuration:</p>
 <div class="codehilite"><pre><span class="n">job</span><span class="p">.</span><span class="n">getConfiguration</span><span class="p">().</span><span class="n">setInt</span><span class="p">(</span><span class="n">NLineInputFormat</span><span class="p">.</span><span class="n">LINES_PER_MAP</span><span class="p">,</span> 100<span class="p">);</span>
@@ -289,19 +320,28 @@
 </pre></div>
 
 
-<h3 id="max-line-length">Max Line Length</h3>
+<h4 id="max-line-length">Max Line Length</h4>
 <p>When using line based inputs it may be desirable to ignore lines that exceed a certain length (for example if you are not interested in really long literals).  Again we use the standard Hadoop configuration setting <code>mapreduce.input.linerecordreader.line.maxlength</code> to control this behaviour:</p>
 <div class="codehilite"><pre><span class="n">job</span><span class="p">.</span><span class="n">getConfiguration</span><span class="p">().</span><span class="n">setInt</span><span class="p">(</span><span class="n">HadoopIOConstants</span><span class="p">.</span><span class="n">MAX_LINE_LENGTH</span><span class="p">,</span> 8192<span class="p">);</span>
 </pre></div>
 
 
-<h3 id="ignoring-bad-tuples">Ignoring Bad Tuples</h3>
+<h4 id="ignoring-bad-tuples">Ignoring Bad Tuples</h4>
 <p>In many cases you may have data that you know contains invalid tuples, in such cases it can be useful to just ignore the bad tuples and continue.  By default we enable this behaviour and will skip over bad tuples though they will be logged as an error.  If you want you can disable this behaviour by setting the <code>rdf.io.input.ignore-bad-tuples</code> configuration setting:</p>
 <div class="codehilite"><pre><span class="n">job</span><span class="p">.</span><span class="n">getConfiguration</span><span class="p">().</span><span class="n">setBoolean</span><span class="p">(</span><span class="n">RdfIOConstants</span><span class="p">.</span><span class="n">INPUT_IGNORE_BAD_TUPLES</span><span class="p">,</span> <span class="n">false</span><span class="p">);</span>
 </pre></div>
 
 
-<h3 id="output-batch-size">Output Batch Size</h3>
+<h4 id="global-blank-node-identity">Global Blank Node Identity</h4>
+<p>The default behaviour of these libraries is to allocate file scoped blank node identifiers in such a way that the same syntactic identifier read from the same file (even if by different nodes/processes) is allocated the same blank node ID.  However the same syntactic identifier in different files should result in different blank nodes.  However as discussed earlier in the case of multi-stage jobs the intermediate outputs may be split over several files which can cause the blank node identifiers to diverge from each other when they are read back in.</p>
+<p>For multi-stage jobs this is often (but not always) incorrect and undesirable behaviour in which case you can set the <code>rdf.io.input.bnodes.global-identity</code> property to true:</p>
+<div class="codehilite"><pre><span class="n">job</span><span class="p">.</span><span class="n">getConfiguration</span><span class="p">.</span><span class="n">setBoolean</span><span class="p">(</span><span class="n">RdfIOConstants</span><span class="p">.</span><span class="n">GLOBAL_BNODE_IDENTITY</span><span class="p">,</span> <span class="n">true</span><span class="p">);</span>
+</pre></div>
+
+
+<p>Note however that not all formats are capable of honouring this option, notably RDF/XML and JSON-LD.</p>
+<p>As noted earlier an alternative workaround is to use RDF Thrift as the intermediate format since it guarantees to preserve blank node identifiers precisely.</p>
+<h4 id="output-batch-size">Output Batch Size</h4>
 <p>The batch size for batched output formats can be controlled by setting the <code>rdf.io.output.batch-size</code> property as desired.  The default value for this if not explicitly configured is 10,000:</p>
 <div class="codehilite"><pre><span class="n">job</span><span class="p">.</span><span class="n">getConfiguration</span><span class="p">.</span><span class="n">setInt</span><span class="p">(</span><span class="n">RdfIOConstants</span><span class="p">.</span><span class="n">OUTPUT_BATCH_SIZE</span><span class="p">,</span> 25000<span class="p">);</span>
 </pre></div>