You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@jena.apache.org by bu...@apache.org on 2014/11/27 15:58:33 UTC
svn commit: r930777 - in /websites/staging/jena/trunk/content: ./ documentation/hadoop/io.html

Author: buildbot
Date: Thu Nov 27 14:58:33 2014
New Revision: 930777

Log:
Staging update by buildbot for jena

Modified:
    websites/staging/jena/trunk/content/   (props changed)
    websites/staging/jena/trunk/content/documentation/hadoop/io.html

Propchange: websites/staging/jena/trunk/content/
------------------------------------------------------------------------------
--- cms:source-revision (original)
+++ cms:source-revision Thu Nov 27 14:58:33 2014
@@ -1 +1 @@
-1642124
+1642168

Modified: websites/staging/jena/trunk/content/documentation/hadoop/io.html
==============================================================================
--- websites/staging/jena/trunk/content/documentation/hadoop/io.html (original)
+++ websites/staging/jena/trunk/content/documentation/hadoop/io.html Thu Nov 27 14:58:33 2014
@@ -158,6 +158,7 @@
 </li>
 <li><a href="#output">Output</a><ul>
 <li><a href="#blank-nodes-in-output">Blank Nodes in Output</a></li>
+<li><a href="#node-output-format">Node Output Format</a></li>
 </ul>
 </li>
 <li><a href="#rdf-serialisation-support">RDF Serialisation Support</a><ul>
@@ -165,6 +166,13 @@
 <li><a href="#output_1">Output</a></li>
 </ul>
 </li>
+<li><a href="#configuration-options">Configuration Options</a><ul>
+<li><a href="#input-lines-per-batch">Input Lines per Batch</a></li>
+<li><a href="#max-line-length">Max Line Length</a></li>
+<li><a href="#ignoring-bad-tuples">Ignoring Bad Tuples</a></li>
+<li><a href="#output-batch-size">Output Batch Size</a></li>
+</ul>
+</li>
 </ul>
 </li>
 </ul>
@@ -191,7 +199,8 @@
 <p>Essentially Jena contains functionality that allows it to predictably generate identifiers from the original identifier present in the file e.g. <code>_:blank</code>.  This means that wherever <code>_:blank</code> appears  in the original file we are guaranteed to assign it the same internal identifier.  Note that this functionality uses a seed value to ensure that blank nodes coming from different input files are not assigned the same identifier.  When used with Hadoop this seed is chosen based on a combination of the Job ID and the input file path.  This means that the same file processed by different jobs will produce different blank node identifiers each time.</p>
 <p>Additionally the binary serialisation we use for our RDF primitives (described on the <a href="common.html">Common API</a>) page guarantees that internal identifiers are preserved as-is when communicating values across the cluster.</p>
 <h3 id="mixed-inputs">Mixed Inputs</h3>
-<p>In many cases your input data may be in a variety of different</p>
+<p>In many cases your input data may be in a variety of different RDF formats in which case we have you covered.  The <code>TriplesInputFormat</code>, <code>QuadsInputFormat</code> and <code>TriplesOrQuadsInputFormat</code> can handle mixture of triples/quads/both triples &amp; quads as desired.  Note that in the case of <code>TriplesOrQuadsInputFormat</code> any triples are up-cast into quads in the default graph.</p>
+<p>With mixed inputs the specific input format to use for each is determined based on the file extensions of each input file, unrecognised extensions will result in an <code>IOException</code>.  Compression is handled automatically you simply need to name your files appropriately to indicate the type of compression used e.g. <code>example.ttl.gz</code> would be treated as GZipped Turtle, if you've used a decent compression tool it should have done this for you.  The downside of mixed inputs is that it decides quite late what the input format is which means that it always processes inputs as whole files because it doesn't decide on the format until after it has been asked to split the inputs.</p>
 <h2 id="output">Output</h2>
 <p>As with input we also need to be careful about how we output RDF data.  Similar to input some serialisations can be output in a streaming fashion while other serialisations require us to store up all the data and then write it out in one go at the end.  We use the same categorisations for output though the meanings are slightly different:</p>
 <ol>
@@ -204,9 +213,12 @@
 <p>As with input blank nodes provide a complicating factor in producing RDF output.  For whole file output formats this is not an issue but it does need to be considered for line and batch based formats.</p>
 <p>However what we have found in practise is that the Jena writers will predictably map internal identifiers to the blank node identifiers in the output serialisations.  What this means is that even when processing output in batches we've found that using the line/batch based formats correctly preserve blank node identity.</p>
 <p>If you are concerned about potential data corruption as a result of this then you should make sure to always choose a whole file output format but be aware that this can exhaust memory if your output is large.</p>
+<h3 id="node-output-format">Node Output Format</h3>
+<p>We also include a special <code>NTriplesNodeOutputFormat</code> which is capable of outputting pairs composed of a <code>NodeWritable</code> key and any value type.  Think of this as being similar to the standard Hadoop <code>TextOutputFormat</code> except it understands how to format nodes as valid NTriples serialisation.  This format is useful when performing simple statistical analysis such as node usage counts or other calculations over nodes.</p>
+<p>In the case where the value of the key value pair is also a RDF primitive proper NTriples formatting is also applied to each of the nodes in the value</p>
 <h2 id="rdf-serialisation-support">RDF Serialisation Support</h2>
 <h3 id="input_1">Input</h3>
-<p>The following table categorised how each supported RDF serialisation is processed for input.  Note that in some cases we offer multiple ways to process a serialisation.</p>
+<p>The following table categorises how each supported RDF serialisation is processed for input.  Note that in some cases we offer multiple ways to process a serialisation.</p>
 <table>
   <tr>
     <th>RDF Serialisation</th>
@@ -235,6 +247,64 @@
 </table>
 
 <h3 id="output_1">Output</h3>
+<p>The following table categorises how each supported RDF serialisation can be processed for output.  As with input some serialisations may be processed in multiple ways.</p>
+<table>
+  <tr>
+    <th>RDF Serialisation</th>
+    <th>Line Based</th>
+    <th>Batch Based</th>
+    <th>Whole File</th>
+  </tr>
+  <tr>
+    <th colspan="4">Triple Formats</th>
+  </tr>
+  <tr><td>NTriples</td><td>Yes</td><td>No</td><td>No</td></tr>
+  <tr><td>Turtle</td><td>Yes</td><td>Yes</td><td>No</td></tr>
+  <tr><td>RDF/XML</td><td>No</td><td>No</td><td>Yes</td></tr>
+  <tr><td>RDF/JSON</td><td>No</td><td>No</td><td>Yes</td></tr>
+  <tr>
+    <th colspan="4">Quad Formats</th>
+  </tr>
+  <tr><td>NQuads</td><td>Yes</td><td>No</td><td>No</td></tr>
+  <tr><td>TriG</td><td>Yes</td><td>Yes</td><td>No</td></tr>
+  <tr><td>TriX</td><td>Yes</td><td>No</td><td>No</td></tr>
+  <tr>
+    <th colspan="4">Triple/Quad Formats</th>
+  </tr>
+  <tr><td>JSON-LD</td><td>No</td><td>No</td><td>Yes</td></tr>
+  <tr><td>RDF Thrift</td><td>Yes</td><td>No</td><td>No</td></tr>
+</table>
+
+<h2 id="configuration-options">Configuration Options</h2>
+<p>There are a several useful configuration options that can be used to tweak the behaviour of the RDF IO functionality if desired.</p>
+<h3 id="input-lines-per-batch">Input Lines per Batch</h3>
+<p>Since our line based input formats use the standard Hadoop <code>NLineInputFormat</code> to decide how to split up inputs we support the standard <code>mapreduce.input.lineinputformat.linespermap</code> configuration setting for changing the number of lines processed per map.</p>
+<p>You can set this directly in your configuration:</p>
+<div class="codehilite"><pre><span class="n">job</span><span class="p">.</span><span class="n">getConfiguration</span><span class="p">().</span><span class="n">setInt</span><span class="p">(</span><span class="n">NLineInputFormat</span><span class="p">.</span><span class="n">LINES_PER_MAP</span><span class="p">,</span> 100<span class="p">);</span>
+</pre></div>
+
+
+<p>Or you can use the convenience method of <code>NLineInputFormat</code> like so:</p>
+<div class="codehilite"><pre><span class="n">NLineInputFormat</span><span class="p">.</span><span class="n">setNumLinesPerMap</span><span class="p">(</span><span class="n">job</span><span class="p">,</span> 100<span class="p">);</span>
+</pre></div>
+
+
+<h3 id="max-line-length">Max Line Length</h3>
+<p>When using line based inputs it may be desirable to ignore lines that exceed a certain length (for example if you are not interested in really long literals).  Again we use the standard Hadoop configuration setting <code>mapreduce.input.linerecordreader.line.maxlength</code> to control this behaviour:</p>
+<div class="codehilite"><pre><span class="n">job</span><span class="p">.</span><span class="n">getConfiguration</span><span class="p">().</span><span class="n">setInt</span><span class="p">(</span><span class="n">HadoopIOConstants</span><span class="p">.</span><span class="n">MAX_LINE_LENGTH</span><span class="p">,</span> 8192<span class="p">);</span>
+</pre></div>
+
+
+<h3 id="ignoring-bad-tuples">Ignoring Bad Tuples</h3>
+<p>In many cases you may have data that you know contains invalid tuples, in such cases it can be useful to just ignore the bad tuples and continue.  By default we enable this behaviour and will skip over bad tuples though they will be logged as an error.  If you want you can disable this behaviour by setting the <code>rdf.io.input.ignore-bad-tuples</code> configuration setting:</p>
+<div class="codehilite"><pre><span class="n">job</span><span class="p">.</span><span class="n">getConfiguration</span><span class="p">().</span><span class="n">setBoolean</span><span class="p">(</span><span class="n">RdfIOConstants</span><span class="p">.</span><span class="n">INPUT_IGNORE_BAD_TUPLES</span><span class="p">,</span> <span class="n">false</span><span class="p">);</span>
+</pre></div>
+
+
+<h3 id="output-batch-size">Output Batch Size</h3>
+<p>The batch size for batched output formats can be controlled by setting the <code>rdf.io.output.batch-size</code> property as desired.  The default value for this if not explicitly configured is 10,000:</p>
+<div class="codehilite"><pre><span class="n">job</span><span class="p">.</span><span class="n">getConfiguration</span><span class="p">.</span><span class="n">setInt</span><span class="p">(</span><span class="n">RdfIOConstants</span><span class="p">.</span><span class="n">OUTPUT_BATCH_SIZE</span><span class="p">,</span> 25000<span class="p">);</span>
+</pre></div>
   </div>
 </div>