You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@jena.apache.org by rv...@apache.org on 2014/11/26 18:05:19 UTC

svn commit: r1641857 - /jena/site/trunk/content/documentation/hadoop/common.mdtext

Author: rvesse
Date: Wed Nov 26 17:05:19 2014
New Revision: 1641857

URL: http://svn.apache.org/r1641857
Log:
Start adding Common API page for RDF Tools for Hadoop docs

Added:
    jena/site/trunk/content/documentation/hadoop/common.mdtext

Added: jena/site/trunk/content/documentation/hadoop/common.mdtext
URL: http://svn.apache.org/viewvc/jena/site/trunk/content/documentation/hadoop/common.mdtext?rev=1641857&view=auto
==============================================================================
--- jena/site/trunk/content/documentation/hadoop/common.mdtext (added)
+++ jena/site/trunk/content/documentation/hadoop/common.mdtext Wed Nov 26 17:05:19 2014
@@ -0,0 +1,37 @@
+Title: RDF Tools for Apache Hadoop - Common API
+
+The Common API provides the basic data model for representing RDF data within Hadoop applications.  This primarily takes the form of `Writable` implementations and the necessary machinery to efficiently serialise and deserialise these.
+
+Currently we represent the three main RDF primitives - Nodes, Triples and Quads - though in future a wider range of primitives may be supported if we receive contributions to implement them.
+
+# RDF Primitives
+
+## Nodes
+
+The `Writable` type for nodes is predictably enough called `NodeWritable` and it implements the `WritableComparable` interface which means it can be used as both a key and/or value in Map/Reduce.  In standard Hadoop style a `get()` method returns the actual value as a Jena `Node` instance while a corresponding `set()` method allows the value to be set.
+
+Note that nodes are lazily converted to and from the underlying binary representation so there is minimal overhead if you create a `NodeWritable` instance that does not actually ever get read/written.
+
+`NodeWritable` supports and automatically registers itself for Hadoop's [`WritableComparator`](https://hadoop.apache.org/docs/stable/api/org/apache/hadoop/io/WritableComparator.html) mechanism which allows it to provide high efficiency binary comparisons on nodes which helps reduce phases run faster by avoiding unnecessary deserialisation into POJOs.
+
+However the downside of this is that the sort order for nodes may not be as natural as the sort order using POJOs or when sorting with SPARQL.  Ultimately this is a performance trade off and in our experiments the benefits far outweigh the lack of a more natural sort order.
+
+## Triples
+
+Again the `Writable` type for nodes is simply called `TripleWritable` and it also implements the `WritableComparable` interface meaning it may be used as both a key and/or value.  Again the standard Hadoop conventions of a `get()` and `set()` method to get/set the value as a Jena `Triple` are followed.
+
+Like the other primitives it is lazily converted to and from the underlying binary representations and it also supports & registers itself for Hadoop's `WritableComparator` mechanism.
+
+## Quads
+
+Finally the `Writable` type for quads is again simply called `QuadWritable` and it implements the `WritableComparable` interface making it usable as both a key and/or value.  As per the other primitives standard Hadoop conventions of a `get()` and `set()` method are provided to get/set the value as a Jena `Quad`.
+
+Like the other primitives it is lazily converted to and from the underlying binary representations and it also supports & registers itself for Hadoop's `WritableComparator` mechanism.
+
+## Arbitrary sized tuples
+
+In some cases you may have data that is RDF like but not itself RDF or that is a mix of triples and quads in which case you may wish to use the `NodeTupleWritable`.  This is used to represent an arbitrarily sized tuple consisting of zero or more `Node` instances, there is no restriction on the number of nodes per tuple and no requirement that tuple data be uniform.
+
+Like the other primitives it implements `WritableComparable` so can be used as a key and/or value.  However this primitive does not support binary comparisons meaning it may not perform as well as using the other primitives.
+
+In this case the `get()` and `set()` methods get/set a `Tuple<Node>` instance which is a convenience container class provided by ARQ.  Currently the implementation does not support lazy conversion so the full `Tuple<Node>` is reconstructed as soon as an `NodeTupleWritable` instance is deserialised.
\ No newline at end of file