You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@jena.apache.org by "Andy Seaborne (JIRA)" <ji...@apache.org> on 2014/03/14 14:57:43 UTC

[jira] [Comment Edited] (JENA-625) Data Tables for SPARQL

    [ https://issues.apache.org/jira/browse/JENA-625?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13935035#comment-13935035 ] 

Andy Seaborne edited comment on JENA-625 at 3/14/14 1:57 PM:
-------------------------------------------------------------

h1. Data Tables for SPARQL

This project is about getting CSVs into a form that is amenable to SPARQL processing, and doing so in a way that is not specific to CSV files.  The project includes getting the right architecture in place for regular table shaped data.  The core abstraction is the PropertyTable.

A PropertyTable is collection of data that is sufficiently regular in shape it can be treated as a table.  That means each subject has a value for each one of the set of properties.  Irregularity in terms of missing values needs to be handeled but not multiple values for the same property.

With special storage, a PropertyTable 

* is more compact and more amenable to custom storage (e.g. a JSON document store)
* can have custom indexes on specific columns
* can guarantee access orders 

Providing these features is out of scope of the project but the architecture of the work must be mindful of these possibilities.

This project will involve basic mapping of CSV to RDF using a fixed algorithm, including interpreting data as numbers or strings.  The project is not attempting a fully configurable, template or rules based translation of CSV to RDF.  The W3C CSV-WG will working on a standard version of that and will not deliver a sufficient stable spec in the timeframe of the project.

(Background: see R2RML Direct mapping http://www.w3.org/TR/rdb-direct-mapping/)

h2. Example

Suppose we have a CSV file:
{noformat}
Town,Population
Southton,123000
Northville,654000
{noformat}

which has one header row, two data rows

As RDF this might be viewable as:

{noformat}
@prefix : <http://example/table> .
@prefix csv: <http://w3c/future-csv-vocab/> .

[ csv:row 1 ; :Town "Southton"   ; :Population 123000 ] .
[ csv:row 2 ; :Town "Northville" ; :Population 654000 ] .
{noformat}

or without the bnode abbreviation:
{noformat}
@prefix : <http://example/table> .
@prefix csv: <http://w3c/future-csv-vocab/> .

_:b0  csv:row 1 ;
      :Town "Southton" ;
      :Population 123000 .

_:b1  csv:row 2 ;
      :Town "Northville" ;
      :Population 654000 .
{noformat}


Each row is modelling one "entity" (here, a population observation). There is a subject (a blank node) and one predicate-value for each cell of the row.  Row numbers are added because it can be important. 

Now the CSV file is viewed as an graph - normal, unmodified SPARQL can be used.  Multiple CSVs files can be multiple graphs in one dataset to give query across different data sources.

{noformat}
# Towns over 500,000 people.
SELECT ?townName ?pop {
{ GRAPH <http://example/population> {
    ?x :Town ?townName ;
       :Popuation ?pop .
    FILTER(?pop > 500000)
  }
} 
{noformat}

Like database views, this is the abstraction the application sees; it may be stored internally in a different way.

(Example - out-of-scope - If the property table is in a external store (a KeyValue store or Lucene or a JSON-ish document store (e.g.), then this BGP processing will be much faster).

h2. Work Items

Notes:

* not all work items are the same length.
* work item do not have to done in this order

h3. Phase 1 : Architecture and System

h4. Parse CSV

Code to take a CSV file, and emits 

This needs to be 
* a robust process with good error messages.
* stream based

Jena already has a CSV parser (CSVInputIterator) specifically for SPARQL Results in CSV but the [Apache Commons CSV|http://commons.apache.org/csv/‎] parser is more flexible.  However, it is a pull-parser.  

[This CSVParser|https://github.com/afs/AFS-Dev/blob/master/src/main/java/lib/CSVParser.java] is a push-parser, taking a stream destination for output (this would change from {{Sink<>}} to {{StreamRDF}}).  This can be incorporating into the project, with improvements, if it is the right processing style.

Lots of tests.

h4. Design {{PropertyTable}}

A table of RDF terms (in Jena, RDF term is called a {{Node}}).

* {{PropertyTable}} interface (read-only)
    Get by subject
    Get by property-value
    Based on 
    (it's a document database!)
* {{PropertyTableBuilder}} interface
* {{PropertyTableImpl}} using 

h4. CSV to PropertyTable.

Using the CSV processor that generates tuples of RDF terms, add code to take a tabular CSV file, and create a {{PropertyTable}} using {{PropertyTableBuilder}}

To support testing, there should be RDF tuples to {{PropertyTable}} path as well.
See [RDF Tuple I/O|https://svn.apache.org/repos/asf/jena/Experimental/rdfpatch/src/main/java/org/apache/jena/riot/tio/].
  
h4. {{GraphPropertyTable}}

Implement the {{Graph}} interface (readonly) over a {{PropertyTable}}.
This is subclass from {{GraphBase}} and implements {{find()}}.
{{find()}} needs to choose the access route based on the find arguments.

This will offer the {{PropertyTable}} interface, or an appropriate subset or variation, so that such a graph can be treated in a more table-like fashion.

h4. Wire up

At this point, it should be all work - just create a {{GraphPropertyTable}} from a CSV file, and pass to SPARQL in the normal way.  Better ways to access the property table come later.

h4. Documentation

h4. Announce

h3. Phase 2 : Additional Features

h4. RIOT reader for CSV files.

Add {{.csv}} to RIOT so that {{model.read}} will work.  Note that there is an impedance mismatch here because for RDF data, the interface is "add triple" so the CSv reader will need to be aware of whether the destination is a {{GraphPropertyTable}} or a general {{Graph}} in which case the RDF triples are created for each row and inserted.

h4. CSV->RDF tool.

Using the parser framework developed in , direct CSv to formatted RDF syntax, with no intermediary graph or property table.  Create a command line tool that runs this so we have scalable CSV -> RDF for the direct mapping style used.

Schema driven and customizable conversion is out of scope for the project.

h4. Documentation

h4. OpExecutor to work with OpGraph/OpBGP.

While access from SPARQL via {{Graph.find}} will work, it's not ideal. This work item involves processing a SPARQL basic graph pattern.  See {{OpExecutor.execute(OpBGP, ...)}} for when the target for the query is a {{GraphPropertyTable}}.  It will get a whole row, or rows, of table data and match the pattern with bindings.  

There are several important cases architecturally.



was (Author: andy.seaborne):
h1. Data Tables for SPARQL

This project is about getting CSVs into a form that is amenable to SPARQL processing, and doing so in a way that is not specific to CSV files.  The project includes getting the right architecture in place for regular table shaped data.  The core abstraction is the PropertyTable.

A PropertyTable is collection of data that is sufficiently regular in shape it can be treated as a table.  That means each subject has a value for each one of the set of properties.  Irregularity in terms of missing values needs to be handeled but not multiple values for the same property.

With special storage, a PropertyTable 

* is more compact and more amenable to custom storage (e.g. a JSON document store)
* can have custom indexes on specific columns
* can guarantee access orders 

Providing these features is out of scope of the project but the architecture of the work must be mindful of these possibilities.

This project will involve basic mapping of CSV to RDF using a fixed algorithm, including interpreting data as numbers or strings.  The project is not attempting a fully configurable, template or rules based translation of CSV to RDF.  The W3C CSV-WG will working on a standard version of that and will not deliver a sufficient stable spec in the timeframe of the project.

(Background: see R2RML Direct mapping http://www.w3.org/TR/rdb-direct-mapping/)

h2. Example

Suppose we have a CSV file:
{noformat}
Town,Population
Southton,123000
Northville,654000
{noformat}

which has one header row, two data rows

As RDF this might be viewable as:

{noformat}
@prefix : <http://example/table> .
@prefix csv: <http://w3c/future-csv-vocab/> .

[ csv:row 1 ; :Town "Southton"   ; :Population 123000 ] .
[ csv:row 2 ; :Town "Northville" ; :Population 654000 ] .
{noformat}

or without the bnode abbreviation:
{noformat}
@prefix : <http://example/table> .
@prefix csv: <http://w3c/future-csv-vocab/> .

_:b0  csv:row 1 ;
      :Town "Southton" ;
      :Population 123000 .

_:b1  csv:row 2 ;
      :Town "Northville" ;
      :Population 654000 .
{noformat}


Each row is modelling one "entity" (here, a population observation). There is a subject (a blank node) and one predicate-value for each cell of the row.  Row numbers are added because it can be important. 

Now the CSV file is viewed as an graph - normal, unmodified SPARQL can be used.  Multiple CSVs files can be multiple graphs in one dataset to give query across different data sources.

{noformat}
# Towns over 500,000 people.
SELECT ?townName ?pop {
{ GRAPH <http://example/population> {
    ?x :Town ?townName ;
       :Popuation ?pop .
    FILTER(?pop > 500000)
  }
} 
{noformat}

Like database views, this is the abstraction the application sees; it may be stored internally in a different way.

(Example - out-of-scope - If the property table is in a external store (a KeyValue store or Lucene or a JSON-ish document store (e.g.), then this BGP processing will be much faster).

h2. Work Items

Notes:

* not all work items are the same length.
* work item do not have to done in this order

h3. Phase 1 : Architecture and System

h4. Parse CSV

Code to take a CSV file, and emits 

This needs to be 
* a robust process with good error messages.
* stream based

Jena already has a CSV parser (CSVInputIterator) specifically for SPARQL Results in CSV but the [Apache Commons CSV|http://commons.apache.org/csv/‎] parser is more flexible.  However, it is a pull-parser.  

[This CSVParser|https://github.com/afs/AFS-Dev/blob/master/src/main/java/lib/CSVParser.java] is a push-parser, taking a stream destination for output (this would change from {{Sink<>}} to {{StreamRDF}}).  This can be incorporating into the project, with improvements, if it is the right processing style.

Lots of tests.

h4. Design {{PropertyTable}}

A table of RDF terms (in Jena, RDF term is called a {{Node}}).

* {{PropertyTable}} interface (read-only)
    Get by subject
    Get by property-value
    Based on 
    (it's a document database!)
* {{PropertyTableBuilder}} interface
* {{PropertyTableImpl}} using 

h4. CSV to PropertyTable.

Using the CSV processor that generates tuples of RDF terms, add code to take a tabular CSV file, and create a {{PropertyTable}} using {{PropertyTableBuilder}}

To support testing, there should be RDF tuples to {{PropertyTable}} path as well.
See [RDF Tuple I/O|https://svn.apache.org/repos/asf/jena/Experimental/rdfpatch/src/main/java/org/apache/jena/riot/tio/].
  
h4. {{GraphPropertyTable}}

Implement the {{Graph}} interface (readonly) over a {{PropertyTable}}.
This is subclass from {{GraphBase}} and implements {{find()}}.
{{find()}} needs to choose the access route based on the find arguments.

This will offer the {{PropertyTable}} interface, or an appropriate subset or variation, so that such a graph can be treated in a more table-like fashion.

h4. Wire up

At this point, it should be all work - just create a {{GraphPropertyTable}} from a CSV file, and pass to SPARQL in the normal way.  Better ways to access the property table come later.

h4. Documentation

h4. Announce

h3. Phase 2 : Additional Features

h4. RIOT reader for CSV files.

Add {{.csv}} to RIOT so that {{model.read}} will work.  Note that there is an impedance mismatch here because for RDF data, the interface is "add triple" so the CSv reader will need to be aware of whether the destination is a {{GraphPropertyTable}} or a general {{Graph}} in which case the RDF triples are created for each row and inserted.

h4. CSV->RDF tool.

Using the parser framework developed in , direct CSv to formatted RDF syntax, with no intermediary graph or property table.  Create a command line tool that runs this so we have scalable CSV -> RDF for the direct mapping style used.

Schema driven and customizable conversion is out of scope for the project.

h4. Documentation

h4. OpExecutor to work with OpGraph/OpBGP.

While access from SPARQL via {{Graph.find}} will work, it's not ideal. This work item involves processing a SPARQL basic graph pattern (See {{OpExecutor.execute(OpBGP, ...) }}) when the target for the query is a {{GraphPropertyTable}}.  It will get a whole row, or rows, of table data and match the pattern with bindings.  

There are several important cases architecturally.


> Data Tables for SPARQL
> ----------------------
>
>                 Key: JENA-625
>                 URL: https://issues.apache.org/jira/browse/JENA-625
>             Project: Apache Jena
>          Issue Type: Improvement
>            Reporter: Andy Seaborne
>              Labels: gsoc, gsoc2014, java, linked_data, mentor, rdf, sparql
>
> Temporary tables are used for keeping intermediate results available for reuse with the same query or update, or use by a subsequent query.
> This project will provide temporary tables for one or both of these use cases:
> # implicit use of temporary tables for precomputed parts of basic graph patterns 
> # explicit use of named temporary tables (e.g. "FROM TABLE ...")
> This project requires problem definition, solution design as well as implementation.  Both use cases require modification of the SPARQL query engine.



--
This message was sent by Atlassian JIRA
(v6.2#6252)