You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@jena.apache.org by an...@apache.org on 2021/09/24 21:43:01 UTC

[jena-site] 02/02: Doc for RDF Binary with Protobuf

This is an automated email from the ASF dual-hosted git repository.

andy pushed a commit to branch protobuf
in repository https://gitbox.apache.org/repos/asf/jena-site.git

commit e6614b52907ac4901090eeecbdca8f856b94250d
Author: Andy Seaborne <an...@apache.org>
AuthorDate: Fri Sep 24 22:41:54 2021 +0100

    Doc for RDF Binary with Protobuf
---
 source/documentation/io/__index.md      |  28 +++---
 source/documentation/io/rdf-binary.md   | 161 ++++++++++++++++++++++++++++++--
 source/documentation/io/rdf-input.md    |  18 ++--
 source/documentation/io/rdf-output.md   |  27 +++---
 source/documentation/io/streaming-io.md |   5 +-
 5 files changed, 195 insertions(+), 44 deletions(-)

diff --git a/source/documentation/io/__index.md b/source/documentation/io/__index.md
index 2f34b40..59a76da 100644
--- a/source/documentation/io/__index.md
+++ b/source/documentation/io/__index.md
@@ -34,7 +34,7 @@ See "[Reading JSON-LD 1.1](json-ld-11.html)" for additional setup and use for
 reading JSON-LD 1.1. JSON-LD 1.0 is the current default in Jena.
 
 RDF Binary is a binary encoding of RDF (graphs and datasets) that can be useful
-for fast parsing.  See [RDF Binary using Apache Thrift](rdf-binary.html).
+for fast parsing.  See [RDF Binary](rdf-binary.html).
 
 ## Command line tools
 
@@ -49,18 +49,20 @@ These can be called directly as Java programs:
 The file extensions understood are:
 
 | &nbsp;Extension&nbsp; |&nbsp; Language&nbsp; |
-|-----------|------------|
-| `.ttl`    | Turtle     |
-| `.nt`     | N-Triples  |
-| `.nq`     | N-Quads    |
-| `.trig`   | TriG       |
-| `.rdf`    | RDF/XML    |
-| `.owl`    | RDF/XML    |
-| `.jsonld` | JSON-LD    |
-| `.trdf`   | RDF Thrift |
-| `.rt`     | RDF Thrift |
-| `.rj`     | RDF/JSON   |
-| `.trix`   | TriX       |
+|-----------|--------------|
+| `.ttl`    | Turtle       |
+| `.nt`     | N-Triples    |
+| `.nq`     | N-Quads      |
+| `.trig`   | TriG         |
+| `.rdf`    | RDF/XML      |
+| `.owl`    | RDF/XML      |
+| `.jsonld` | JSON-LD      |
+| `.trdf`   | RDF Thrift   |
+| `.rt`     | RDF Thrift   |
+| `.rpb     | RDF Protobuf |
+| `.pbrdf`  | RDF Protobuf |
+| `.rj`     | RDF/JSON     |
+| `.trix`   | TriX         |
 
 `.n3` is supported but only as a synonym for Turtle.
 
diff --git a/source/documentation/io/rdf-binary.md b/source/documentation/io/rdf-binary.md
index 8f36982..7b1fa89 100644
--- a/source/documentation/io/rdf-binary.md
+++ b/source/documentation/io/rdf-binary.md
@@ -3,7 +3,9 @@ title: RDF Binary using Apache Thrift
 ---
 
 "RDF Binary" is a efficient format for RDF and RDF-related data using
-[Apache Thrift](https://thrift.apache.org/) as the binary encoding.
+[Apache Thrift](https://thrift.apache.org/) 
+or  [Google Protocol Buffers](https://developers.google.com/protocol-buffers)
+as the binary data encoding.
 
 The W3C standard RDF syntaxes are text or XML based.  These incur costs in
 parsing; the most human-readable formats also incur high costs to write, and
@@ -16,14 +18,14 @@ terms, then builds data formats for RDF graphs, RDF datasets, and for
 SPARQL result sets.  This gives a basis for high-performance linked data
 systems.
 
-[Apache Thrift](https://thrift.apache.org/) provides an efficient, 
-wide-used binary encoding layer with a large number of language bindings.
+[Thrift](https://thrift.apache.org/) and
+[Protobuf](https://developers.google.com/protocol-buffers) provides efficient,
+widely-used, binary encoding layers each with a large number of language
+bindings.
 
 For more details of [RDF Thrift](http://afs.github.io/rdf-thrift).
 
-This pages gives the details of RDF Binary encoding in [Apache Thrift](http://thrift.apache.org/).
-
-## Thrift encoding of RDF Terms {#encoding-terms}
+## Thrift encoding of RDF Terms {#encoding-terms-thrift}
 
 RDF Thrift uses the Thrift compact protocol.
 
@@ -84,7 +86,7 @@ Source: [BinaryRDF.thrift](https://github.com/apache/jena/blob/main/jena-arq/Gra
     12: RDF_Decimal     valDecimal
     }
 
-### Thrift encoding of Triples, Quads and rows. {#encoding-tuples}
+### Thrift encoding of Triples, Quads and rows. {#encoding-thrift-tuples}
 
     struct RDF_Triple {
     1: required RDF_Term S
@@ -104,7 +106,7 @@ Source: [BinaryRDF.thrift](https://github.com/apache/jena/blob/main/jena-arq/Gra
     2: required string uri ;
     }
 
-### Thrift encoding of RDF Graphs and RDF Datasets {#encoding-graphs-datasets}
+### Thrift encoding of RDF Graphs and RDF Datasets {#encoding-thrift-graphs-datasets}
 
     union RDF_StreamRow {
     1: RDF_PrefixDecl   prefixDecl
@@ -116,7 +118,7 @@ RDF Graphs are encoded as a stream of `RDF_Triple` and `RDF_PrefixDecl`.
 
 RDF Datasets are encoded as a stream of `RDF_Triple`, `RDF-Quad` and `RDF_PrefixDecl`.
 
-### Thrift encoding of SPARQL Result Sets {#encoding-result-sets}
+### Thrift encoding of SPARQL Result Sets {#encoding-thrift-result-sets}
 
 A SPARQL Result Set is encoded as a list of variables (the header), then
 a stream of rows (the results).
@@ -128,3 +130,144 @@ a stream of rows (the results).
     struct RDF_DataTuple {
     1: list<RDF_Term> row
     }
+
+## Protobuf encoding of RDF Terms {#encoding-terms-protobuf}
+
+The Protobuf schema is simialr.
+
+Source:
+[binary-rdf.proto](https://github.com/apache/jena/blob/main/jena-arq/Grammar/RDF-Protobuf/binary-rdf.proto)
+
+Streaming isused to allow for abitrary size graphs. Therefore the steram items
+(`RDF_StreamRow` below) are written with an initial length (`writeDelimitedTo`
+in the Java API).
+
+See
+[Protobuf Techniques Streaming](https://developers.google.com/protocol-buffers/docs/techniques#streaming).
+
+```
+syntax = "proto3";
+
+option java_package         = "org.apache.jena.riot.protobuf.wire" ;
+
+// Prefer one file with static inner classes.
+option java_outer_classname = "PB_RDF" ;
+// Optimize for speed (default)
+option optimize_for = SPEED ;
+
+//option java_multiple_files = true;
+// ==== RDF Term Definitions 
+
+message RDF_IRI {
+  string iri = 1 ;
+} 
+ 
+// A prefix name (abbrev for an IRI)
+message RDF_PrefixName {
+  string prefix = 1 ;
+  string localName = 2 ;
+} 
+
+message RDF_BNode {
+  string label = 1 ;
+  // 2 * fixed64
+} 
+
+// Common abbreviations for datatypes and other URIs?
+// union with additional values. 
+
+message RDF_Literal {
+  string lex = 1 ;
+  oneof literalKind {
+    bool simple = 9 ;
+    string langtag = 2 ;
+    string datatype = 3 ;
+    RDF_PrefixName dtPrefix = 4 ;
+  }
+}
+
+message RDF_Decimal {
+  sint64  value = 1 ;
+  sint32  scale = 2 ;
+}
+
+message RDF_Var {
+  string name = 1 ;
+}
+
+message RDF_ANY { }
+
+message RDF_UNDEF { }
+
+message RDF_REPEAT { }
+
+message RDF_Term {
+  oneof term {
+    RDF_IRI        iri        = 1 ;
+    RDF_BNode      bnode      = 2 ;
+    RDF_Literal    literal    = 3 ;
+    RDF_PrefixName prefixName = 4 ;
+    RDF_Var        variable   = 5 ;
+    RDF_Triple     tripleTerm = 6 ;
+    RDF_ANY        any        = 7 ;
+    RDF_UNDEF      undefined  = 8 ;
+    RDF_REPEAT     repeat     = 9 ;
+    
+    // Value forms of literals.
+    sint64         valInteger = 20 ;
+    double         valDouble  = 21 ;
+    RDF_Decimal    valDecimal = 22 ;
+  }
+}
+
+// === StreamRDF items 
+
+message RDF_Triple {
+  RDF_Term S = 1 ;
+  RDF_Term P = 2 ;
+  RDF_Term O = 3 ;
+}
+
+message RDF_Quad {
+  RDF_Term S = 1 ;
+  RDF_Term P = 2 ;
+  RDF_Term O = 3 ;
+  RDF_Term G = 4 ;
+}
+
+// Prefix declaration
+message RDF_PrefixDecl {
+  string prefix = 1;
+  string uri    = 2 ;
+}
+
+// StreamRDF
+message RDF_StreamRow {
+  oneof row {
+    RDF_PrefixDecl   prefixDecl  = 1 ;
+    RDF_Triple       triple      = 2 ;
+    RDF_Quad         quad        = 3 ;
+    RDF_IRI          base        = 4 ;
+  }
+}
+
+message RDF_Stream {
+  repeated RDF_StreamRow row = 1 ;
+}
+
+// ==== SPARQL Result Sets
+
+message RDF_VarTuple {
+  repeated RDF_Var vars = 1 ;
+}
+
+message RDF_DataTuple {
+  repeated RDF_Term row = 1 ;
+}
+
+// ==== RDF Graph
+
+message RDF_Graph {
+  repeated RDF_Triple triple = 1 ;
+}
+```
diff --git a/source/documentation/io/rdf-input.md b/source/documentation/io/rdf-input.md
index feaf8d7..46ee2df 100644
--- a/source/documentation/io/rdf-input.md
+++ b/source/documentation/io/rdf-input.md
@@ -67,18 +67,18 @@ as:
 
 The following is a suggested Apache httpd .htaccess file:
 
-    AddType  text/turtle             .ttl
-    AddType  application/rdf+xml     .rdf
-    AddType  application/n-triples   .nt
+    AddType  text/turtle               .ttl
+    AddType  application/rdf+xml       .rdf
+    AddType  application/n-triples     .nt
 
-    AddType  application/ld+json     .jsonld
-    AddType  application/owl+xml     .owl
+    AddType  application/ld+json       .jsonld
 
-    AddType  text/trig               .trig
-    AddType  application/n-quads     .nq
+    AddType  text/trig                 .trig
+    AddType  application/n-quads       .nq
 
-    AddType  application/trix+xml    .trix
-    AddType  application/rdf+thrift  .trdf
+    AddType  application/trix+xml      .trix
+    AddType  application/rdf+thrift    .rt
+    AddType  application/rdf+protobuf  .rpb
 
 ### Example 1 : Using the RDFDataMgr {#using-rdfdatamgr}
 
diff --git a/source/documentation/io/rdf-output.md b/source/documentation/io/rdf-output.md
index 4f078cf..fe75c46 100644
--- a/source/documentation/io/rdf-output.md
+++ b/source/documentation/io/rdf-output.md
@@ -17,7 +17,7 @@ See [Reading RDF](rdf-input.html) for details of the RIOT Reader system.
   - [Turtle and Trig format options](#opt-turtle-trig)
   - [N-Triples and N-Quads](#n-triples-and-n-quads)
   - [JSON-LD](#json-ld)
-  - [RDF Binary](#rdf-thrift)
+  - [RDF Binary](#rdf-binary)
   - [RDF/XML](#rdfxml)
 - [Examples](#examples)
 - [Notes](#notes)
@@ -110,9 +110,10 @@ an `RDFFormat` internally.  The normal writers are:
 | RDFXML            | RDF/XML, pretty printed |
 | RDFJSON           |                         |
 | TRIX              |                         |
-| RDFTHRFT          | RDF Thrift              |
+| RDFTHRFT          | RDF Binary Thrift       |
+| RDFPROTO          | RDF Binary Protobuf     |
 
-Pretty printed RDF/XML is also known as RDF/XML-ABBREV
+Pretty printed RDF/XML is also known as RDF/XML-ABBREV.
 
 ### Pretty Printed Languages
 
@@ -369,21 +370,25 @@ cases.
 What can be done, and how it can be, is explained in the 
 [sample code](https://github.com/apache/jena/tree/main/jena-arq/src-examples/arq/examples/riot/Ex_WriteJsonLD.java).
 
-### RDF Binary {#rdf-thrift}
+### RDF Binary {#rdf-binary}
 
 [This is a binary encoding](rdf-binary.html) using 
-[Apache Thrift](https://thrift.apache.org/) for RDF Graphs
+[Apache Thrift](https://thrift.apache.org/) or 
+[Google Protocol Buffers](https://developers.google.com/protocol-buffers)
+for RDF Graphs
 and RDF Datasets, as well as SPARQL Result Sets, and it provides faster parsing
 compared to the text-based standardised syntax such as N-triples, Turtle or RDF/XML.
 
-| RDFFormat        |
-|------------------|
-| RDFTHRIFT        |
-| RDFTHRIFT_VALUES |
+| RDFFormat         |
+|-------------------|
+| RDF_THRIFT        |
+| RDF_THRIFT_VALUES |
+| RDF_PROTO         |
+| RDF_PROTO_VALUES  |
 
-`RDFTHRIFT_VALUES` is a variant where numeric values are written as values,
+`RDF_THRIFT_VALUES` and `RDF_PROTO_VALUES` are variants where numeric values are written as values,
 not as lexical format and datatype.  See the 
-[description of RDF Thrift](http://afs.github.io/rdf-thrift)
+[description of RDF Binary](https://rdf-binary.html).
 for discussion.
 
 ### RDF/XML {#rdfxml}
diff --git a/source/documentation/io/streaming-io.md b/source/documentation/io/streaming-io.md
index dc866d8..73842a3 100644
--- a/source/documentation/io/streaming-io.md
+++ b/source/documentation/io/streaming-io.md
@@ -7,8 +7,8 @@ fashion. Streaming can be used for manipulating RDF at scale.  Jena
 provides high performance readers and writers for all standard RDF formats,
 and it can be extended with custom formats.
 
-The [RDF Binary using Apache Thrift](rdf-binary.html) provides the highest
-input parsing performance.  N-Triples/N-Quads provide the highest
+The [RDF Binary](rdf-binary.html) provides the highest
+input parsing performance. N-Triples/N-Quads provide the highest
 input parsing performance using W3C Standards.
 
 Files ending in `.gz` are assumed to be gzip-compressed. Input and output
@@ -105,3 +105,4 @@ N-Triples and N-Quads are always written as a stream.
 | `RDFFormat.NQUADS_ASCII`   |                  |
 | `RDFFormat.TRIX`           | `Lang.TRIX`      |
 | `RDFFormat.RDF_THRIFT`     | `Lang.RDFTHRIFT` |
+| `RDFFormat.RDF_PROTO`      | `Lang.RDFPROTO`  |