You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@jena.apache.org by Ying Jiang <jp...@gmail.com> on 2014/03/03 04:12:07 UTC

Re: [GSoC 2014] Data Tables for SPARQL

Hi Andy,

Thanks for your suggestions! I'm more interested in JENA-625 (Data
Tables for SPARQL). I've seen your new comments in JIRA and studied
the source code of Tarql. I'd like to paste your comments here with my
questions below to clarify the details of this project:

1. CSV to RDF terms (tuples of RDF Terms is already supported
internally in Jena)
 - Questions:
1.1 Tarql uses the first row of CSV as variable names. Should we use
the same idea?
1.2 As to "internal support of tuples of RDF terms in Jena", do you
mean com.hp.hpl.jena.sparql.algebra.table.TableData? Tarql uses
TableData to accommodate RDF term bindings from CSV.

2. Storage of the table (in-memory is enough, with reading from a file).
 - Questions:
2.1 What's the life cycle of the in-memory table? Should we discard
the table after the query execution, or keep it in-memory for later
reuse with the same query or update, or use by a subsequent query?
When will the table be discarded?

3. Modify the SPARQL grammar to support FROM TABLE and TABLE (for
inclusion inside a larger query, c.f. SPARQL VALUES clause).
 - Questions:
3.1 What're the differences between FROM TABLE and TABLE?
3.2 Tarql programmatically modify the query (parsed from standard
SPARQLParser11) with CSV tabsle data without touching the orginal
SPARQL grammar parsing module. Should we adopt a different approach of
modifying the parsing grammar of .jj files and just ask javacc to
generate the new parsing code?

4. Modify execution to include tables.
Questions: No questions for this now.

Best regards,
Ying Jiang

On Thu, Feb 27, 2014 at 10:49 PM, Andy Seaborne <an...@apache.org> wrote:
> On 26/02/14 15:14, Ying Jiang wrote:
>>
>> Hi,
>>
>> With the great guidance from the mentors, especially Andy, I had a
>> good time in GSoC 2013 working on jena-spatial [1]. I'm very grateful.
>> Really learnt a lot from that project.
>>
>> This year, I find the issue of "Extend CONSTRUCT to build quads" [1]
>> very interesting. I've used javacc before. I can understand the ARQ
>> module of parsing SPARQL strings. With a label of "gsoc2014", is it a
>> suitable project for Jena in GSoC 2014? Any more details about the
>> project? Thanks!
>>
>> Best regards,
>> Ying Jiang
>>
>> [1] http://jena.apache.org/documentation/query/spatial-query.html
>> [2] https://issues.apache.org/jira/browse/JENA-491
>>
>
> Hi there,
>
> Given your level of skill and expertise, this project is possibly a bit
> small for you.  It's not the same scale as jena-spatial. It's probably more
> suited to an undergraduate or someone looking to learn about working inside
> a moderately large existing codebase. You have a lot more software
> engineering experience.
>
> Can I interest you in one of:
>
> * JENA-625 especially the part about CSV ingestion.  There is now a W3C
> working group looking at tabular data on the web so we know this is
> interesting to the user community.
>
> * JENA-647, (only just added) which is server side query templates for
> creating data views.
>
> In conjunction with someone (else) doing JENA-632 (custom JSON from SPARQL
> query), we would have a data delivery platform for creating domain specific
> data delivery for webapp.
>
> (this was provided in the proprietary Talis platform as "SPARQL Stored
> Procedures" but that no longer exists.  No need to exactly follow that but
> it was a popular feature so it is useful).
>
> * JENA-624 which is about a new memory-based storage layer.  As a project,
> its nearer in scale to jena-spatial.  This is less about RDF and linked data
> and more about systems programming.
>
>         Andy
>

Re: [GSoC 2014] Data Tables for SPARQL

Posted by Andy Seaborne <an...@apache.org>.

On 21/03/14 04:08, Ying Jiang wrote:
> Hi Andy,
>
> It's OK. Here's the copy of my proposal in the attachment.
>
> Cheers,
> Ying Jiang


No attachments on the mailing list :-(

(A bit of a nuisance but it does discourage mailing large files via the 
archive)

	Andy

Re: [GSoC 2014] Data Tables for SPARQL

Posted by Ying Jiang <jp...@gmail.com>.

Hi Andy,

It's OK. Here's the copy of my proposal in the attachment.

Cheers,
Ying Jiang

On Wed, Mar 19, 2014 at 10:13 PM, Andy Seaborne <an...@apache.org> wrote:
> On 19/03/14 04:22, Ying Jiang wrote:
>>
>> Dear Andy,
>>
>> I've submitted a proposal [1] to GSoC, according to our previous
>> discussions. Please let me know if anything can be improved.
>> Thanks a lot!
>
>
> Looks fine - and congratulations on the lecturer position.
>
>         Andy
>
>
>>
>> Cheers,
>> Ying Jiang
>>
>> [1]
>> http://www.google-melange.com/gsoc/proposal/review/student/google/gsoc2014/jpz6311whu/5632763709358080
>
>
> These URLs are restricted until projects are accepted.  I can read it as:
>
> http://www.google-melange.com/gsoc/proposal/review/org/google/gsoc2014/jpz6311whu/5632763709358080
>
> not sure if that's organisation specific though.
>
> When projects are accepted, the proposal becomes public.
>
> Ying - in Apache, we do everything in public where possible.  Would you mind
> emailing dev@ with a copy?  (Remove anything you don't want on an archived
> list)
>
>
>>
>> On Mon, Mar 17, 2014 at 10:17 PM, Andy Seaborne <an...@apache.org> wrote:
>>>
>>> On 16/03/14 04:31, Ying Jiang wrote:
>>>>
>>>>
>>>> Dear Andy,
>>>>
>>>> I greatly appreciate your detailed explanations. I've studied all the
>>>> examples and the links you mentioned. I'll try to summarise here with
>>>> further questions below:
>>>>
>>>> 1. We have 2 possible ways for the project: "variables-as-columns" and
>>>> "property tables". I can understand both the ideas, thanks to your
>>>> instructions. The former one has its issues you pointed out, and the
>>>> latter one seems to make more sense for the users. Do you mean we
>>>> should discard the former one and focus on the latter in this project?
>>>
>>>
>>>
>>> Yes - "predicates-for-columns" = "property tables"
>>>
>>>  From that, you can recover "variables-as-columns" by query pattern. The
>>> reverse is messy at best. Either very unnatural variable names to stop
>>> clashes or beign careful about scoping (and that will confuse people).
>>>
>>>
>>>> 2. We can have some lessons learned from SQL-to-RDF work. But CSV
>>>> (even regular-shaped CSV) is different from database in some ways,
>>>> which requires us to dig in deeper on the details. Some questions
>>>> like:
>>>
>>>
>>>
>>> The W3C "CSV on the Web Working Group" [1] is working on a standard
>>> mechanism for converting CSV to other forms, RDF included.  The details
>>> of
>>> that mechanism aren't clear yet and won't be in time for the project -
>>> it's
>>> an area that (my current belief) will chop and change a fair bit in
>>> getting
>>> to a final specification.
>>>
>>>
>>> The area of CSV-RDF is bigger than a GSoC project anyway and fairly open
>>> ended given all the sorts of the things people do with CSV files (e.g.
>>> encoding author lists in fields).
>>>
>>> But there is a simpler case - one need is a "direct mapping" whereby a
>>> CSV file with no additional metadata is mapped to RDF.  I think we can
>>> focus
>>> on a design for this in the project.
>>>
>>> The translation is fixed : blank node for each row (addresses the primary
>>> key issue - and alternative below), the base URL of the CSV file is used
>>> to
>>> generate the predicate names.
>>>
>>> Then, the project gets all the machinery working - otherwise the output
>>> will
>>> CSV to RDF without the Jena architectural chnages to support it in the
>>> long
>>> term.
>>>
>>> [1] https://www.w3.org/2013/csvw/wiki/Main_Page
>>>
>>>
>>>> 2.1 How to determine the data type of the column? All the values in
>>>> CSV are firstly parsed as Strings line by line. Suppose the parser
>>>> found a number string of "123000.0", how can we know whether it's an
>>>> integer, a float/double or even just a string in RDF?
>>>
>>>
>>>
>>> Initially, they can be strings.
>>>
>>> Later, and maybe an option the user can turn on, then a dynamic choice
>>> which
>>> is a posh way of saying attempt to parse it as an integer and if it
>>> passes,
>>> it's an integer.  Spreadsheets do this guessing.
>>>
>>> "Duck datatyping" - if it looks like an integer (decimal, double, date)
>>> it
>>> is an integer (decimal, double, date).
>>>
>>> Actually, this is then the same as tokenizing and there is code to reuse
>>> to
>>> do that.
>>>
>>>
>>>> 2.2 How to deal with the namespaces? RDF requires that the subjects
>>>> and the predicates are URIs. We need to pass in the namespaces (or
>>>> just the default namespaces) to make URIs by combining the namespaces
>>>> with the values in CSV. Things may get more complicated if different
>>>> columns are to be bound with different namespaces.
>>>
>>>
>>>
>>> Subject a can be blank nodes which is useful because each row is then a
>>> new
>>> blank node.
>>>
>>> One row written in RDF might be:
>>>
>>>
>>> [ csv:row 1 ; :Town "Southton" ; :Population 123000 ] .
>>>
>>> or
>>>
>>>
>>> _:b0  csv:row 1 ;
>>>        :Town "Southton" ;
>>>        :Population 123000 .
>>>
>>> It's the same RDF triples (3 of them).
>>>
>>> For predicates, suppose the URL of the CSV file is <FILE> then the
>>> columns
>>> can be  <FILE#Town> and <FILE#Population>.
>>>
>>> Rules or SPARQL Update can be used to turn that into a better data model
>>> if
>>> the users wants to write that code.
>>>
>>>
>>>> 2.3 The hp 2006 report [1] says "Jena supports three kinds of property
>>>> tables as well as a triple store". The "town" example you provided
>>>> conforms to the "single-valued" property table. Shall we consider the
>>>> others (e.g. the "multi-valued" one and the "triple store" one) in
>>>> this project? Does Jena in the latest release still support these
>>>> property tables? If so, where're the related source codes?
>>>
>>>
>>>
>>> Single-valued.
>>>
>>> In the CSV-WG it looks like duplicate column names are not going to be
>>> supported (at best, the parser has to make then unique by adding "1", "2"
>>> etc).
>>>
>>> Despite what the report says, the code didn't make it into the public
>>> Jena
>>> codebase.  (And we have removed the old RDB subsystem it refers to.)
>>>
>>>
>>>> 2.4 There's no "primary key" definition in CSV. All the RDF are not
>>>> OWL in fact. How do we know the column in CSV is uniquely defining? It
>>>> seems CSV lacks of some kind of "metadata" of the columns and the
>>>> values. If we have such metadata, how to pass in the namespace of  the
>>>> IRI template of http://data/town/{Town} (something related to the
>>>> question 2.2)?
>>>
>>>
>>>
>>> It's not necessary to have a defined primary row - that is generated
>>> subject
>>> URI.  It might be nice if available but that's metadata.
>>>
>>> So one of:
>>> 1/ The triples for each row have a blank node for subject
>>> 2/ The triples for row N have a URI which is <FILE#_N>.
>>>
>>> In both cases, the subject node is generated automatically.
>>>
>>>
>>>> 3. For the "property tables" way, it seems that all we need to do is
>>>> to resolve the problems in 2., and to code "GraphCSV" accordingly. I
>>>> can make the GraphCSV class by implementing the Graph interface. In
>>>> this way, for Jena ARP, a CSV table is actually a Graph, without any
>>>> differences from other types of Graphs. It looks like that there's no
>>>> need to introduce TABLE and FROM TABLE clauses in the SPARQL language
>>>> grammar. We can just use the existing GRAPH, FROM and FROM NAMED
>>>> clauses for the CSV "property tables", can't we?
>>>
>>>
>>>
>>> s/ARP/ARQ/ -- ARP is the RDF/XML parser; ARQ is the query engine :-)
>>>
>>> Yes - correct.
>>>
>>> In the later stages of the project, there is an item to make OpExecutor
>>> (which is the class that actually drives the SPARQL execution) do better
>>> for
>>> GraphCSV than just treating it as a Graph by accessing the PropertyTable
>>> behind it.
>>>
>>> The big gain for PropertyTables is the space saving they enable as well
>>> as
>>> the possibility of making them persistent in a special storage system
>>> (not
>>> in this project but the design should not make that too hard at some
>>> later
>>> time).
>>>
>>>          Andy
>>>
>>>
>>>>
>>>> Best regards,
>>>> Ying Jiang
>>>>
>>>> [1] http://www.hpl.hp.com/techreports/2006/HPL-2006-140.pdf
>>>>
>>>>
>>>>
>>>> On Mon, Mar 10, 2014 at 10:50 PM, Andy Seaborne <an...@apache.org> wrote:
>>>>>
>>>>>
>>>>> Hi Ying,
>>>>>
>>>>> Good questions.  I'll try to give a response to the specific points
>>>>> you've
>>>>> brought up but also there is a different I want to put forward for
>>>>> discussion.
>>>>>
>>>>> I'll write up a first draft of a project plan then we can see if the
>>>>> size
>>>>> and scope is realistic.
>>>>>
>>>>> You asked about whether variables are column names.  That is how TARQL
>>>>> and
>>>>> SPARQL VALUES works but I've realised there is a different approach and
>>>>> it's
>>>>> one that will give a better system.  It is to translate the CSV to RDF,
>>>>> and
>>>>> this may be materialized or dynamically mapped. If "materialized" it's
>>>>> likely to be a lot bigger; as "property tables" or somethign inspired
>>>>> by
>>>>> that idea, it'll be more compact.
>>>>>
>>>>> There are some issues with variables-as-columns include:
>>>>>
>>>>> 1/ Fixed variable names don't combine with other part of a query
>>>>> pattern
>>>>> very well.
>>>>>
>>>>> If there is common use of the same name it a join - that's what a
>>>>> natural
>>>>> join in SQL is.  If there are two tables, then ?a is overloaded.  If
>>>>> column
>>>>> names are used to derive a variable name, we may not want to equate
>>>>> them
>>>>> in
>>>>> the query because column names in different CSV files weren't designed
>>>>> with
>>>>> that in mind.
>>>>>
>>>>> 2/ You can't describe (in RDF) the data very easily - e.g. annotate
>>>>> that
>>>>> a
>>>>> column is of years.
>>>>>
>>>>> 3/  It needs the language to change (i.e. TABLE to access it)
>>>>>
>>>>> In TARQL, which is focusing on a controlled transform from CSV to RDF,
>>>>> it
>>>>> works out quite nicely - variables go into the CONSTRUCT template. It
>>>>> produces RDF.
>>>>>
>>>>> Property tables are a style of approach where the CSV data is accessed
>>>>> as
>>>>> RDF.
>>>>>
>>>>> The data table columns be predicate URIs.  The data table itself is an
>>>>> RDF
>>>>> graph of regular structure.  It can be accessed with normal
>>>>> (unmodified)
>>>>> SPARQL syntax. It would be better if the storage and execution of that
>>>>> part
>>>>> of the SPARQL query were adapted to such regular data.  Something for
>>>>> after
>>>>> getting an initial cut down.
>>>>>
>>>>> Suppose we have a CSV file:
>>>>> -------------------
>>>>> Town,Population
>>>>> Southton,123000
>>>>> Northville,654000
>>>>> -------------------
>>>>>
>>>>> One header row, two data rows.
>>>>>
>>>>> Aside: this is regular-shaped CSV (and some CSV files are definitely
>>>>> not
>>>>> regular at all!). There is the current editors working draft from the
>>>>> CSV
>>>>> on
>>>>> the Web Working Group (not yet published, likely to change, only part
>>>>> of
>>>>> the
>>>>> picture, etc etc)
>>>>>
>>>>> http://w3c.github.io/csvw/syntax/
>>>>>
>>>>> which is defining a more regular data out of CSV.  This is the target
>>>>> for
>>>>> the CSV work: table shaped CSV; not arbitrary, irregularly shaped CSV.
>>>>>
>>>>> There is no way the working group will have standardised any CSV to RDF
>>>>> mapping in the lifetime of the GSoC project but the WG charter says it
>>>>> must
>>>>> be covered.  So the mapping below is made up and ahead of where the
>>>>> working
>>>>> group is currently but a standardized, "direct mapping" (no metadata,
>>>>> no
>>>>> templates) style is going to happen.  The mapping details may change
>>>>> but
>>>>> the
>>>>> general approach is clear.
>>>>>
>>>>> As RDF this might be
>>>>>
>>>>> -------------
>>>>> @prefix : <http://example/table> .
>>>>> @prefix csv: <http://w3c/future-csv-vocab/> .
>>>>>
>>>>> [ csv:row 1 ; :Town "Southton" ; :Population 123000 ] .
>>>>> [ csv:row 2 ; :Town "Northville" ; :Population 654000 ] .
>>>>> -------------
>>>>>
>>>>> or without the bnode abbreviation:
>>>>>
>>>>> -------------
>>>>> @prefix : <http://example/table> .
>>>>> @prefix csv: <http://w3c/future-csv-vocab/> .
>>>>>
>>>>> _:b0  csv:row 1 ;
>>>>>         :Town "Southton" ;
>>>>>         :Population 123000 .
>>>>>
>>>>> _:b1  csv:row 2 ;
>>>>>         :Town "Northville" ;
>>>>>         :Population 654000 .
>>>>> -------------
>>>>>
>>>>>
>>>>> Each row is modelling one "entity" (here, a population observation).
>>>>> There
>>>>> is a subject (a blank node) and one predicate-value for each cell of
>>>>> the
>>>>> row.  Row numbers are added because it can be important.
>>>>>
>>>>> Background:
>>>>>
>>>>> A related idea for property has come up before
>>>>>
>>>>>     http://www.hpl.hp.com/techreports/2006/HPL-2006-140.html
>>>>>
>>>>> That paper should only be taken as giving a flavour. The motivation was
>>>>> different, more about making RDF look like regular database especially
>>>>> when
>>>>> the data is regular.  At the workshop last week, I talk to Orri Erling
>>>>> (OpenLink/Virtuoso) and apparently, maybe by parallel evolution,
>>>>> Virtuoso
>>>>> does something similar.
>>>>>
>>>>>
>>>>> Aside:
>>>>> There is a whole design space (outside this project) for translating
>>>>> CSV
>>>>> to
>>>>> RDF.
>>>>>
>>>>> Just if anyone is interested: see the related SQL-to-RDF work:
>>>>>
>>>>> http://www.w3.org/TR/r2rml/
>>>>> http://www.w3.org/TR/rdb-direct-mapping/
>>>>>
>>>>> If the metadata said that one of the columns was uniquely defining (a
>>>>> primary key in SQL terms, or inverse functional property in OWL-terms),
>>>>> we
>>>>> wouldn't need blank nodes at all - we could use a URI template, for if
>>>>> town
>>>>> names were unique (they are not!) a IRI template of
>>>>> http://data/town/{Town}
>>>>> would give:
>>>>>
>>>>> -------------
>>>>> @prefix : <http://example/table> .
>>>>> @prefix csv: <http://w3c/future-csv-vocab/> .
>>>>>
>>>>> <http://data/town/Southton>
>>>>>         csv:row 1 ;
>>>>>         rdfs:label "Southton" ;
>>>>>         :Population 123000 .
>>>>>
>>>>> <http://data/town/Northville>
>>>>>         csv:row 2 ;
>>>>>         rdfs:label "Northville" ;
>>>>>         :Population 654000 .
>>>>> -------------
>>>>>
>>>>> Doing this transformation in rules is one route.  JENA-650 connection?
>>>>> </aside>
>>>>>
>>>>> In SPARQL:
>>>>>
>>>>> Now the CSV file is viewed as an graph - normal, unmodified SPARQL can
>>>>> be
>>>>> used.  Multiple CSVs files can be multiple graphs in one dataset to
>>>>> give
>>>>> query across different data sources.
>>>>>
>>>>> # Towns over 500,000 people.
>>>>> SELECT ?townName ?pop {
>>>>> { GRAPH <http://example/population> {
>>>>>       ?x :Town ?townName ;
>>>>>          :Popuation ?pop .
>>>>>       FILTER(?pop > 500000)
>>>>>     }
>>>>> }
>>>>>
>>>>>
>>>>> A few comments inline - the bulk of this message is above.
>>>>>
>>>>> I hope this makes some sense.  Having spent time with people who really
>>>>> do
>>>>> work with CSVs files last week around the linked geospatial workshop ,
>>>>> the
>>>>> user needs and requirements are much clearer.
>>>>>
>>>>>           Andy
>>>>>
>>>>> PS I was on a panel that included mentioning the work you did last
>>>>> year.
>>>>> It
>>>>> went well.
>>>>>
>>>>> On 07/03/14 12:10, Ying Jiang wrote:
>>>>> ...
>>>>>
>>>>>>>> 2. Storage of the table (in-memory is enough, with reading from a
>>>>>>>> file).
>>>>>>>>      - Questions:
>>>>>>>> 2.1 What's the life cycle of the in-memory table? Should we discard
>>>>>>>> the table after the query execution, or keep it in-memory for later
>>>>>>>> reuse with the same query or update, or use by a subsequent query?
>>>>>>>> When will the table be discarded?
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> That'll need refining but a way to read and reuse.  There needs to be
>>>>>>> away
>>>>>>> for the app to pass in tables (a Map<Sting, ???> and a tool
>>>>>>> forerading
>>>>>>> CSVs
>>>>>>> to get the ???) because ...
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> When will the tables be passed in? TARQL loads the CSVs when parsing
>>>>>> the SPARQL query string. Shall we load the tables and create the Map
>>>>>> before querying and cache them for resue? This could be similar to
>>>>>> querying a Dataset, and the simplest way goes something like:
>>>>>>
>>>>>> DataTableMap<String, DataTable> dtm =
>>>>>> DataTableSetFactory.createDataTableMap(); // The keys of dts are the
>>>>>> URI of the DataTables loaded.
>>>>>> dtm.addDataTable( "<ex:table_1>", "file:table_1.csv", true); // The
>>>>>> table data are loaded when added into the map.
>>>>>> dtm.addDataTable( "<ex:table_2>", "file:table_2.csv", false); // Or
>>>>>> the table data are *lazy* loaded during querying later on, i.e. not
>>>>>> loaded now.
>>>>>> Query query = QueryFactory.create(queryString) ; // New .jj will be
>>>>>> created for parsing TABLE and FROM TABLE clauses. However the
>>>>>> QueryFactory interface remains the same as before.
>>>>>> QueryExecution qExec = QueryExecutionFactory.create(query, model,
>>>>>> dtm) ; // New create method for QueryExecutionFactory to accomendate
>>>>>> dtm
>>>>>> ... //dtm can be reused later on for other QueryExecutions, or be
>>>>>> discarded when the app ends.
>>>>>>
>>>>>> Is the above what you mean? Any comments?
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> Yes, using TABLE.
>>>>>
>>>>> With property tables it can be done as
>>>>>
>>>>> // Default graph of the dataset
>>>>>
>>>>> Model csv1 =
>>>>>     ModelFactory.createModelForGraph(new GraphCSV("data1.csv")) ;
>>>>> QueryExecution qExec = QueryExecutionFactory.create(query, csv1) ;
>>>>>
>>>>> or for multiple CSV files and/or other RDF data:
>>>>>
>>>>> Model csv1 =
>>>>>     ModelFactory.createModelForGraph(new GraphCSV("data1.csv")) ;
>>>>> Model csv2 =
>>>>>     ModelFactory.createModelForGraph(new GraphCSV("data1.csv")) ;
>>>>>
>>>>> Dataset dataset = ... ;
>>>>> dataset.addNamedModel("http://example/population", csv1) ;
>>>>> dataset.addNamedModel("http://example/table2", csv2) ;
>>>>>
>>>>> ... normal SPARQL execution ...
>>>>>
>>>>>
>>>>>>>
>>>>>>>
>>>>>>>> 3. Modify the SPARQL grammar to support FROM TABLE and TABLE (for
>>>>>>>> inclusion inside a larger query, c.f. SPARQL VALUES clause).
>>>>>>>>      - Questions:
>>>>>>>> 3.1 What're the differences between FROM TABLE and TABLE?
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> FROM TABLE would be one way to get tables into the query as would
>>>>>>> passing
>>>>>>> it
>>>>>>> in in the query context.
>>>>>>>
>>>>>>> Queries can't be assumed to
>>>>>>>
>>>>>>> TABLE in a query is accessing the table, using it to get the
>>>>>>>
>>>>>>> TARQL, and I've only read the documentation, is a query over a single
>>>>>>> CSV
>>>>>>> file.  This project should be about multiple CSVs and combining with
>>>>>>> other
>>>>>>> RDF data.
>>>>>>>
>>>>>>> A quick sketch and the syntax is not checked as sensible:
>>>>>>>
>>>>>>> SELECT ... {
>>>>>>>      # Fixed column names
>>>>>>>      TABLE <uri> {
>>>>>>>         BIND (URI(CONCAT('http://example.com/ns#', ?b)) AS ?uri)
>>>>>>>         BIND (STRLANG(?a, 'en') AS ?with_language_tag)
>>>>>>>         FILTER (?v > 57)
>>>>>>>      }
>>>>>>> }
>>>>>>>
>>>>>>> More ambitious to have column naming and FILTERs:
>>>>>>>
>>>>>>> SELECT ...
>>>>>>> WHERE {
>>>>>>>
>>>>>>>       TABLE <uri> { "col1" AS ?myVar1 ,
>>>>>>>                     "col10" AS ?V ,
>>>>>>>                     "col5" AS ?appName
>>>>>>>                     FILTER(?V > 57) }
>>>>>>> }
>>>>>>>
>>>>>>> creates a set of bindings based on access description.
>>>>>>>
>>>>>>
>>>>>> Are the <uri> after TABLE the key of the Map<Sting, ???>? If so, I now
>>>>>> understand the TABLE clauses from the examples. However, still not
>>>>>> sure about FROM TABLE. Could you please show me some query string
>>>>>> examples containing the FROM TABLE clauses?
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> FROM TABLE would set the map entry.  c.f. FROM NAMED
>>>>>
>>>>> In this case the name of the table (graph) is the location it comes
>>>>> from
>>>>> -
>>>>> it's not a general choice of name.  A common issue for FROM NAMED, not
>>>>> specific to CSV processing.
>>>>>
>>>
>

Re: [GSoC 2014] Data Tables for SPARQL

Posted by Andy Seaborne <an...@apache.org>.

On 19/03/14 04:22, Ying Jiang wrote:
> Dear Andy,
>
> I've submitted a proposal [1] to GSoC, according to our previous
> discussions. Please let me know if anything can be improved.
> Thanks a lot!

Looks fine - and congratulations on the lecturer position.

	Andy

>
> Cheers,
> Ying Jiang
>
> [1] http://www.google-melange.com/gsoc/proposal/review/student/google/gsoc2014/jpz6311whu/5632763709358080

These URLs are restricted until projects are accepted.  I can read it as:

http://www.google-melange.com/gsoc/proposal/review/org/google/gsoc2014/jpz6311whu/5632763709358080

not sure if that's organisation specific though.

When projects are accepted, the proposal becomes public.

Ying - in Apache, we do everything in public where possible.  Would you 
mind emailing dev@ with a copy?  (Remove anything you don't want on an 
archived list)

>
> On Mon, Mar 17, 2014 at 10:17 PM, Andy Seaborne <an...@apache.org> wrote:
>> On 16/03/14 04:31, Ying Jiang wrote:
>>>
>>> Dear Andy,
>>>
>>> I greatly appreciate your detailed explanations. I've studied all the
>>> examples and the links you mentioned. I'll try to summarise here with
>>> further questions below:
>>>
>>> 1. We have 2 possible ways for the project: "variables-as-columns" and
>>> "property tables". I can understand both the ideas, thanks to your
>>> instructions. The former one has its issues you pointed out, and the
>>> latter one seems to make more sense for the users. Do you mean we
>>> should discard the former one and focus on the latter in this project?
>>
>>
>> Yes - "predicates-for-columns" = "property tables"
>>
>>  From that, you can recover "variables-as-columns" by query pattern. The
>> reverse is messy at best. Either very unnatural variable names to stop
>> clashes or beign careful about scoping (and that will confuse people).
>>
>>
>>> 2. We can have some lessons learned from SQL-to-RDF work. But CSV
>>> (even regular-shaped CSV) is different from database in some ways,
>>> which requires us to dig in deeper on the details. Some questions
>>> like:
>>
>>
>> The W3C "CSV on the Web Working Group" [1] is working on a standard
>> mechanism for converting CSV to other forms, RDF included.  The details of
>> that mechanism aren't clear yet and won't be in time for the project - it's
>> an area that (my current belief) will chop and change a fair bit in getting
>> to a final specification.
>>
>>
>> The area of CSV-RDF is bigger than a GSoC project anyway and fairly open
>> ended given all the sorts of the things people do with CSV files (e.g.
>> encoding author lists in fields).
>>
>> But there is a simpler case - one need is a "direct mapping" whereby a
>> CSV file with no additional metadata is mapped to RDF.  I think we can focus
>> on a design for this in the project.
>>
>> The translation is fixed : blank node for each row (addresses the primary
>> key issue - and alternative below), the base URL of the CSV file is used to
>> generate the predicate names.
>>
>> Then, the project gets all the machinery working - otherwise the output will
>> CSV to RDF without the Jena architectural chnages to support it in the long
>> term.
>>
>> [1] https://www.w3.org/2013/csvw/wiki/Main_Page
>>
>>
>>> 2.1 How to determine the data type of the column? All the values in
>>> CSV are firstly parsed as Strings line by line. Suppose the parser
>>> found a number string of "123000.0", how can we know whether it's an
>>> integer, a float/double or even just a string in RDF?
>>
>>
>> Initially, they can be strings.
>>
>> Later, and maybe an option the user can turn on, then a dynamic choice which
>> is a posh way of saying attempt to parse it as an integer and if it passes,
>> it's an integer.  Spreadsheets do this guessing.
>>
>> "Duck datatyping" - if it looks like an integer (decimal, double, date) it
>> is an integer (decimal, double, date).
>>
>> Actually, this is then the same as tokenizing and there is code to reuse to
>> do that.
>>
>>
>>> 2.2 How to deal with the namespaces? RDF requires that the subjects
>>> and the predicates are URIs. We need to pass in the namespaces (or
>>> just the default namespaces) to make URIs by combining the namespaces
>>> with the values in CSV. Things may get more complicated if different
>>> columns are to be bound with different namespaces.
>>
>>
>> Subject a can be blank nodes which is useful because each row is then a new
>> blank node.
>>
>> One row written in RDF might be:
>>
>>
>> [ csv:row 1 ; :Town "Southton" ; :Population 123000 ] .
>>
>> or
>>
>>
>> _:b0  csv:row 1 ;
>>        :Town "Southton" ;
>>        :Population 123000 .
>>
>> It's the same RDF triples (3 of them).
>>
>> For predicates, suppose the URL of the CSV file is <FILE> then the columns
>> can be  <FILE#Town> and <FILE#Population>.
>>
>> Rules or SPARQL Update can be used to turn that into a better data model if
>> the users wants to write that code.
>>
>>
>>> 2.3 The hp 2006 report [1] says "Jena supports three kinds of property
>>> tables as well as a triple store". The "town" example you provided
>>> conforms to the "single-valued" property table. Shall we consider the
>>> others (e.g. the "multi-valued" one and the "triple store" one) in
>>> this project? Does Jena in the latest release still support these
>>> property tables? If so, where're the related source codes?
>>
>>
>> Single-valued.
>>
>> In the CSV-WG it looks like duplicate column names are not going to be
>> supported (at best, the parser has to make then unique by adding "1", "2"
>> etc).
>>
>> Despite what the report says, the code didn't make it into the public Jena
>> codebase.  (And we have removed the old RDB subsystem it refers to.)
>>
>>
>>> 2.4 There's no "primary key" definition in CSV. All the RDF are not
>>> OWL in fact. How do we know the column in CSV is uniquely defining? It
>>> seems CSV lacks of some kind of "metadata" of the columns and the
>>> values. If we have such metadata, how to pass in the namespace of  the
>>> IRI template of http://data/town/{Town} (something related to the
>>> question 2.2)?
>>
>>
>> It's not necessary to have a defined primary row - that is generated subject
>> URI.  It might be nice if available but that's metadata.
>>
>> So one of:
>> 1/ The triples for each row have a blank node for subject
>> 2/ The triples for row N have a URI which is <FILE#_N>.
>>
>> In both cases, the subject node is generated automatically.
>>
>>
>>> 3. For the "property tables" way, it seems that all we need to do is
>>> to resolve the problems in 2., and to code "GraphCSV" accordingly. I
>>> can make the GraphCSV class by implementing the Graph interface. In
>>> this way, for Jena ARP, a CSV table is actually a Graph, without any
>>> differences from other types of Graphs. It looks like that there's no
>>> need to introduce TABLE and FROM TABLE clauses in the SPARQL language
>>> grammar. We can just use the existing GRAPH, FROM and FROM NAMED
>>> clauses for the CSV "property tables", can't we?
>>
>>
>> s/ARP/ARQ/ -- ARP is the RDF/XML parser; ARQ is the query engine :-)
>>
>> Yes - correct.
>>
>> In the later stages of the project, there is an item to make OpExecutor
>> (which is the class that actually drives the SPARQL execution) do better for
>> GraphCSV than just treating it as a Graph by accessing the PropertyTable
>> behind it.
>>
>> The big gain for PropertyTables is the space saving they enable as well as
>> the possibility of making them persistent in a special storage system (not
>> in this project but the design should not make that too hard at some later
>> time).
>>
>>          Andy
>>
>>
>>>
>>> Best regards,
>>> Ying Jiang
>>>
>>> [1] http://www.hpl.hp.com/techreports/2006/HPL-2006-140.pdf
>>>
>>>
>>>
>>> On Mon, Mar 10, 2014 at 10:50 PM, Andy Seaborne <an...@apache.org> wrote:
>>>>
>>>> Hi Ying,
>>>>
>>>> Good questions.  I'll try to give a response to the specific points
>>>> you've
>>>> brought up but also there is a different I want to put forward for
>>>> discussion.
>>>>
>>>> I'll write up a first draft of a project plan then we can see if the size
>>>> and scope is realistic.
>>>>
>>>> You asked about whether variables are column names.  That is how TARQL
>>>> and
>>>> SPARQL VALUES works but I've realised there is a different approach and
>>>> it's
>>>> one that will give a better system.  It is to translate the CSV to RDF,
>>>> and
>>>> this may be materialized or dynamically mapped. If "materialized" it's
>>>> likely to be a lot bigger; as "property tables" or somethign inspired by
>>>> that idea, it'll be more compact.
>>>>
>>>> There are some issues with variables-as-columns include:
>>>>
>>>> 1/ Fixed variable names don't combine with other part of a query pattern
>>>> very well.
>>>>
>>>> If there is common use of the same name it a join - that's what a natural
>>>> join in SQL is.  If there are two tables, then ?a is overloaded.  If
>>>> column
>>>> names are used to derive a variable name, we may not want to equate them
>>>> in
>>>> the query because column names in different CSV files weren't designed
>>>> with
>>>> that in mind.
>>>>
>>>> 2/ You can't describe (in RDF) the data very easily - e.g. annotate that
>>>> a
>>>> column is of years.
>>>>
>>>> 3/  It needs the language to change (i.e. TABLE to access it)
>>>>
>>>> In TARQL, which is focusing on a controlled transform from CSV to RDF, it
>>>> works out quite nicely - variables go into the CONSTRUCT template. It
>>>> produces RDF.
>>>>
>>>> Property tables are a style of approach where the CSV data is accessed as
>>>> RDF.
>>>>
>>>> The data table columns be predicate URIs.  The data table itself is an
>>>> RDF
>>>> graph of regular structure.  It can be accessed with normal (unmodified)
>>>> SPARQL syntax. It would be better if the storage and execution of that
>>>> part
>>>> of the SPARQL query were adapted to such regular data.  Something for
>>>> after
>>>> getting an initial cut down.
>>>>
>>>> Suppose we have a CSV file:
>>>> -------------------
>>>> Town,Population
>>>> Southton,123000
>>>> Northville,654000
>>>> -------------------
>>>>
>>>> One header row, two data rows.
>>>>
>>>> Aside: this is regular-shaped CSV (and some CSV files are definitely not
>>>> regular at all!). There is the current editors working draft from the CSV
>>>> on
>>>> the Web Working Group (not yet published, likely to change, only part of
>>>> the
>>>> picture, etc etc)
>>>>
>>>> http://w3c.github.io/csvw/syntax/
>>>>
>>>> which is defining a more regular data out of CSV.  This is the target for
>>>> the CSV work: table shaped CSV; not arbitrary, irregularly shaped CSV.
>>>>
>>>> There is no way the working group will have standardised any CSV to RDF
>>>> mapping in the lifetime of the GSoC project but the WG charter says it
>>>> must
>>>> be covered.  So the mapping below is made up and ahead of where the
>>>> working
>>>> group is currently but a standardized, "direct mapping" (no metadata, no
>>>> templates) style is going to happen.  The mapping details may change but
>>>> the
>>>> general approach is clear.
>>>>
>>>> As RDF this might be
>>>>
>>>> -------------
>>>> @prefix : <http://example/table> .
>>>> @prefix csv: <http://w3c/future-csv-vocab/> .
>>>>
>>>> [ csv:row 1 ; :Town "Southton" ; :Population 123000 ] .
>>>> [ csv:row 2 ; :Town "Northville" ; :Population 654000 ] .
>>>> -------------
>>>>
>>>> or without the bnode abbreviation:
>>>>
>>>> -------------
>>>> @prefix : <http://example/table> .
>>>> @prefix csv: <http://w3c/future-csv-vocab/> .
>>>>
>>>> _:b0  csv:row 1 ;
>>>>         :Town "Southton" ;
>>>>         :Population 123000 .
>>>>
>>>> _:b1  csv:row 2 ;
>>>>         :Town "Northville" ;
>>>>         :Population 654000 .
>>>> -------------
>>>>
>>>>
>>>> Each row is modelling one "entity" (here, a population observation).
>>>> There
>>>> is a subject (a blank node) and one predicate-value for each cell of the
>>>> row.  Row numbers are added because it can be important.
>>>>
>>>> Background:
>>>>
>>>> A related idea for property has come up before
>>>>
>>>>     http://www.hpl.hp.com/techreports/2006/HPL-2006-140.html
>>>>
>>>> That paper should only be taken as giving a flavour. The motivation was
>>>> different, more about making RDF look like regular database especially
>>>> when
>>>> the data is regular.  At the workshop last week, I talk to Orri Erling
>>>> (OpenLink/Virtuoso) and apparently, maybe by parallel evolution, Virtuoso
>>>> does something similar.
>>>>
>>>>
>>>> Aside:
>>>> There is a whole design space (outside this project) for translating CSV
>>>> to
>>>> RDF.
>>>>
>>>> Just if anyone is interested: see the related SQL-to-RDF work:
>>>>
>>>> http://www.w3.org/TR/r2rml/
>>>> http://www.w3.org/TR/rdb-direct-mapping/
>>>>
>>>> If the metadata said that one of the columns was uniquely defining (a
>>>> primary key in SQL terms, or inverse functional property in OWL-terms),
>>>> we
>>>> wouldn't need blank nodes at all - we could use a URI template, for if
>>>> town
>>>> names were unique (they are not!) a IRI template of
>>>> http://data/town/{Town}
>>>> would give:
>>>>
>>>> -------------
>>>> @prefix : <http://example/table> .
>>>> @prefix csv: <http://w3c/future-csv-vocab/> .
>>>>
>>>> <http://data/town/Southton>
>>>>         csv:row 1 ;
>>>>         rdfs:label "Southton" ;
>>>>         :Population 123000 .
>>>>
>>>> <http://data/town/Northville>
>>>>         csv:row 2 ;
>>>>         rdfs:label "Northville" ;
>>>>         :Population 654000 .
>>>> -------------
>>>>
>>>> Doing this transformation in rules is one route.  JENA-650 connection?
>>>> </aside>
>>>>
>>>> In SPARQL:
>>>>
>>>> Now the CSV file is viewed as an graph - normal, unmodified SPARQL can be
>>>> used.  Multiple CSVs files can be multiple graphs in one dataset to give
>>>> query across different data sources.
>>>>
>>>> # Towns over 500,000 people.
>>>> SELECT ?townName ?pop {
>>>> { GRAPH <http://example/population> {
>>>>       ?x :Town ?townName ;
>>>>          :Popuation ?pop .
>>>>       FILTER(?pop > 500000)
>>>>     }
>>>> }
>>>>
>>>>
>>>> A few comments inline - the bulk of this message is above.
>>>>
>>>> I hope this makes some sense.  Having spent time with people who really
>>>> do
>>>> work with CSVs files last week around the linked geospatial workshop ,
>>>> the
>>>> user needs and requirements are much clearer.
>>>>
>>>>           Andy
>>>>
>>>> PS I was on a panel that included mentioning the work you did last year.
>>>> It
>>>> went well.
>>>>
>>>> On 07/03/14 12:10, Ying Jiang wrote:
>>>> ...
>>>>
>>>>>>> 2. Storage of the table (in-memory is enough, with reading from a
>>>>>>> file).
>>>>>>>      - Questions:
>>>>>>> 2.1 What's the life cycle of the in-memory table? Should we discard
>>>>>>> the table after the query execution, or keep it in-memory for later
>>>>>>> reuse with the same query or update, or use by a subsequent query?
>>>>>>> When will the table be discarded?
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> That'll need refining but a way to read and reuse.  There needs to be
>>>>>> away
>>>>>> for the app to pass in tables (a Map<Sting, ???> and a tool forerading
>>>>>> CSVs
>>>>>> to get the ???) because ...
>>>>>
>>>>>
>>>>>
>>>>> When will the tables be passed in? TARQL loads the CSVs when parsing
>>>>> the SPARQL query string. Shall we load the tables and create the Map
>>>>> before querying and cache them for resue? This could be similar to
>>>>> querying a Dataset, and the simplest way goes something like:
>>>>>
>>>>> DataTableMap<String, DataTable> dtm =
>>>>> DataTableSetFactory.createDataTableMap(); // The keys of dts are the
>>>>> URI of the DataTables loaded.
>>>>> dtm.addDataTable( "<ex:table_1>", "file:table_1.csv", true); // The
>>>>> table data are loaded when added into the map.
>>>>> dtm.addDataTable( "<ex:table_2>", "file:table_2.csv", false); // Or
>>>>> the table data are *lazy* loaded during querying later on, i.e. not
>>>>> loaded now.
>>>>> Query query = QueryFactory.create(queryString) ; // New .jj will be
>>>>> created for parsing TABLE and FROM TABLE clauses. However the
>>>>> QueryFactory interface remains the same as before.
>>>>> QueryExecution qExec = QueryExecutionFactory.create(query, model,
>>>>> dtm) ; // New create method for QueryExecutionFactory to accomendate
>>>>> dtm
>>>>> ... //dtm can be reused later on for other QueryExecutions, or be
>>>>> discarded when the app ends.
>>>>>
>>>>> Is the above what you mean? Any comments?
>>>>
>>>>
>>>>
>>>> Yes, using TABLE.
>>>>
>>>> With property tables it can be done as
>>>>
>>>> // Default graph of the dataset
>>>>
>>>> Model csv1 =
>>>>     ModelFactory.createModelForGraph(new GraphCSV("data1.csv")) ;
>>>> QueryExecution qExec = QueryExecutionFactory.create(query, csv1) ;
>>>>
>>>> or for multiple CSV files and/or other RDF data:
>>>>
>>>> Model csv1 =
>>>>     ModelFactory.createModelForGraph(new GraphCSV("data1.csv")) ;
>>>> Model csv2 =
>>>>     ModelFactory.createModelForGraph(new GraphCSV("data1.csv")) ;
>>>>
>>>> Dataset dataset = ... ;
>>>> dataset.addNamedModel("http://example/population", csv1) ;
>>>> dataset.addNamedModel("http://example/table2", csv2) ;
>>>>
>>>> ... normal SPARQL execution ...
>>>>
>>>>
>>>>>>
>>>>>>
>>>>>>> 3. Modify the SPARQL grammar to support FROM TABLE and TABLE (for
>>>>>>> inclusion inside a larger query, c.f. SPARQL VALUES clause).
>>>>>>>      - Questions:
>>>>>>> 3.1 What're the differences between FROM TABLE and TABLE?
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> FROM TABLE would be one way to get tables into the query as would
>>>>>> passing
>>>>>> it
>>>>>> in in the query context.
>>>>>>
>>>>>> Queries can't be assumed to
>>>>>>
>>>>>> TABLE in a query is accessing the table, using it to get the
>>>>>>
>>>>>> TARQL, and I've only read the documentation, is a query over a single
>>>>>> CSV
>>>>>> file.  This project should be about multiple CSVs and combining with
>>>>>> other
>>>>>> RDF data.
>>>>>>
>>>>>> A quick sketch and the syntax is not checked as sensible:
>>>>>>
>>>>>> SELECT ... {
>>>>>>      # Fixed column names
>>>>>>      TABLE <uri> {
>>>>>>         BIND (URI(CONCAT('http://example.com/ns#', ?b)) AS ?uri)
>>>>>>         BIND (STRLANG(?a, 'en') AS ?with_language_tag)
>>>>>>         FILTER (?v > 57)
>>>>>>      }
>>>>>> }
>>>>>>
>>>>>> More ambitious to have column naming and FILTERs:
>>>>>>
>>>>>> SELECT ...
>>>>>> WHERE {
>>>>>>
>>>>>>       TABLE <uri> { "col1" AS ?myVar1 ,
>>>>>>                     "col10" AS ?V ,
>>>>>>                     "col5" AS ?appName
>>>>>>                     FILTER(?V > 57) }
>>>>>> }
>>>>>>
>>>>>> creates a set of bindings based on access description.
>>>>>>
>>>>>
>>>>> Are the <uri> after TABLE the key of the Map<Sting, ???>? If so, I now
>>>>> understand the TABLE clauses from the examples. However, still not
>>>>> sure about FROM TABLE. Could you please show me some query string
>>>>> examples containing the FROM TABLE clauses?
>>>>
>>>>
>>>>
>>>> FROM TABLE would set the map entry.  c.f. FROM NAMED
>>>>
>>>> In this case the name of the table (graph) is the location it comes from
>>>> -
>>>> it's not a general choice of name.  A common issue for FROM NAMED, not
>>>> specific to CSV processing.
>>>>
>>

Re: [GSoC 2014] Data Tables for SPARQL

Posted by Ying Jiang <jp...@gmail.com>.

Dear Andy,

I've submitted a proposal [1] to GSoC, according to our previous
discussions. Please let me know if anything can be improved.
Thanks a lot!

Cheers,
Ying Jiang

[1] http://www.google-melange.com/gsoc/proposal/review/student/google/gsoc2014/jpz6311whu/5632763709358080

On Mon, Mar 17, 2014 at 10:17 PM, Andy Seaborne <an...@apache.org> wrote:
> On 16/03/14 04:31, Ying Jiang wrote:
>>
>> Dear Andy,
>>
>> I greatly appreciate your detailed explanations. I've studied all the
>> examples and the links you mentioned. I'll try to summarise here with
>> further questions below:
>>
>> 1. We have 2 possible ways for the project: "variables-as-columns" and
>> "property tables". I can understand both the ideas, thanks to your
>> instructions. The former one has its issues you pointed out, and the
>> latter one seems to make more sense for the users. Do you mean we
>> should discard the former one and focus on the latter in this project?
>
>
> Yes - "predicates-for-columns" = "property tables"
>
> From that, you can recover "variables-as-columns" by query pattern. The
> reverse is messy at best. Either very unnatural variable names to stop
> clashes or beign careful about scoping (and that will confuse people).
>
>
>> 2. We can have some lessons learned from SQL-to-RDF work. But CSV
>> (even regular-shaped CSV) is different from database in some ways,
>> which requires us to dig in deeper on the details. Some questions
>> like:
>
>
> The W3C "CSV on the Web Working Group" [1] is working on a standard
> mechanism for converting CSV to other forms, RDF included.  The details of
> that mechanism aren't clear yet and won't be in time for the project - it's
> an area that (my current belief) will chop and change a fair bit in getting
> to a final specification.
>
>
> The area of CSV-RDF is bigger than a GSoC project anyway and fairly open
> ended given all the sorts of the things people do with CSV files (e.g.
> encoding author lists in fields).
>
> But there is a simpler case - one need is a "direct mapping" whereby a
> CSV file with no additional metadata is mapped to RDF.  I think we can focus
> on a design for this in the project.
>
> The translation is fixed : blank node for each row (addresses the primary
> key issue - and alternative below), the base URL of the CSV file is used to
> generate the predicate names.
>
> Then, the project gets all the machinery working - otherwise the output will
> CSV to RDF without the Jena architectural chnages to support it in the long
> term.
>
> [1] https://www.w3.org/2013/csvw/wiki/Main_Page
>
>
>> 2.1 How to determine the data type of the column? All the values in
>> CSV are firstly parsed as Strings line by line. Suppose the parser
>> found a number string of "123000.0", how can we know whether it's an
>> integer, a float/double or even just a string in RDF?
>
>
> Initially, they can be strings.
>
> Later, and maybe an option the user can turn on, then a dynamic choice which
> is a posh way of saying attempt to parse it as an integer and if it passes,
> it's an integer.  Spreadsheets do this guessing.
>
> "Duck datatyping" - if it looks like an integer (decimal, double, date) it
> is an integer (decimal, double, date).
>
> Actually, this is then the same as tokenizing and there is code to reuse to
> do that.
>
>
>> 2.2 How to deal with the namespaces? RDF requires that the subjects
>> and the predicates are URIs. We need to pass in the namespaces (or
>> just the default namespaces) to make URIs by combining the namespaces
>> with the values in CSV. Things may get more complicated if different
>> columns are to be bound with different namespaces.
>
>
> Subject a can be blank nodes which is useful because each row is then a new
> blank node.
>
> One row written in RDF might be:
>
>
> [ csv:row 1 ; :Town "Southton" ; :Population 123000 ] .
>
> or
>
>
> _:b0  csv:row 1 ;
>       :Town "Southton" ;
>       :Population 123000 .
>
> It's the same RDF triples (3 of them).
>
> For predicates, suppose the URL of the CSV file is <FILE> then the columns
> can be  <FILE#Town> and <FILE#Population>.
>
> Rules or SPARQL Update can be used to turn that into a better data model if
> the users wants to write that code.
>
>
>> 2.3 The hp 2006 report [1] says "Jena supports three kinds of property
>> tables as well as a triple store". The "town" example you provided
>> conforms to the "single-valued" property table. Shall we consider the
>> others (e.g. the "multi-valued" one and the "triple store" one) in
>> this project? Does Jena in the latest release still support these
>> property tables? If so, where're the related source codes?
>
>
> Single-valued.
>
> In the CSV-WG it looks like duplicate column names are not going to be
> supported (at best, the parser has to make then unique by adding "1", "2"
> etc).
>
> Despite what the report says, the code didn't make it into the public Jena
> codebase.  (And we have removed the old RDB subsystem it refers to.)
>
>
>> 2.4 There's no "primary key" definition in CSV. All the RDF are not
>> OWL in fact. How do we know the column in CSV is uniquely defining? It
>> seems CSV lacks of some kind of "metadata" of the columns and the
>> values. If we have such metadata, how to pass in the namespace of  the
>> IRI template of http://data/town/{Town} (something related to the
>> question 2.2)?
>
>
> It's not necessary to have a defined primary row - that is generated subject
> URI.  It might be nice if available but that's metadata.
>
> So one of:
> 1/ The triples for each row have a blank node for subject
> 2/ The triples for row N have a URI which is <FILE#_N>.
>
> In both cases, the subject node is generated automatically.
>
>
>> 3. For the "property tables" way, it seems that all we need to do is
>> to resolve the problems in 2., and to code "GraphCSV" accordingly. I
>> can make the GraphCSV class by implementing the Graph interface. In
>> this way, for Jena ARP, a CSV table is actually a Graph, without any
>> differences from other types of Graphs. It looks like that there's no
>> need to introduce TABLE and FROM TABLE clauses in the SPARQL language
>> grammar. We can just use the existing GRAPH, FROM and FROM NAMED
>> clauses for the CSV "property tables", can't we?
>
>
> s/ARP/ARQ/ -- ARP is the RDF/XML parser; ARQ is the query engine :-)
>
> Yes - correct.
>
> In the later stages of the project, there is an item to make OpExecutor
> (which is the class that actually drives the SPARQL execution) do better for
> GraphCSV than just treating it as a Graph by accessing the PropertyTable
> behind it.
>
> The big gain for PropertyTables is the space saving they enable as well as
> the possibility of making them persistent in a special storage system (not
> in this project but the design should not make that too hard at some later
> time).
>
>         Andy
>
>
>>
>> Best regards,
>> Ying Jiang
>>
>> [1] http://www.hpl.hp.com/techreports/2006/HPL-2006-140.pdf
>>
>>
>>
>> On Mon, Mar 10, 2014 at 10:50 PM, Andy Seaborne <an...@apache.org> wrote:
>>>
>>> Hi Ying,
>>>
>>> Good questions.  I'll try to give a response to the specific points
>>> you've
>>> brought up but also there is a different I want to put forward for
>>> discussion.
>>>
>>> I'll write up a first draft of a project plan then we can see if the size
>>> and scope is realistic.
>>>
>>> You asked about whether variables are column names.  That is how TARQL
>>> and
>>> SPARQL VALUES works but I've realised there is a different approach and
>>> it's
>>> one that will give a better system.  It is to translate the CSV to RDF,
>>> and
>>> this may be materialized or dynamically mapped. If "materialized" it's
>>> likely to be a lot bigger; as "property tables" or somethign inspired by
>>> that idea, it'll be more compact.
>>>
>>> There are some issues with variables-as-columns include:
>>>
>>> 1/ Fixed variable names don't combine with other part of a query pattern
>>> very well.
>>>
>>> If there is common use of the same name it a join - that's what a natural
>>> join in SQL is.  If there are two tables, then ?a is overloaded.  If
>>> column
>>> names are used to derive a variable name, we may not want to equate them
>>> in
>>> the query because column names in different CSV files weren't designed
>>> with
>>> that in mind.
>>>
>>> 2/ You can't describe (in RDF) the data very easily - e.g. annotate that
>>> a
>>> column is of years.
>>>
>>> 3/  It needs the language to change (i.e. TABLE to access it)
>>>
>>> In TARQL, which is focusing on a controlled transform from CSV to RDF, it
>>> works out quite nicely - variables go into the CONSTRUCT template. It
>>> produces RDF.
>>>
>>> Property tables are a style of approach where the CSV data is accessed as
>>> RDF.
>>>
>>> The data table columns be predicate URIs.  The data table itself is an
>>> RDF
>>> graph of regular structure.  It can be accessed with normal (unmodified)
>>> SPARQL syntax. It would be better if the storage and execution of that
>>> part
>>> of the SPARQL query were adapted to such regular data.  Something for
>>> after
>>> getting an initial cut down.
>>>
>>> Suppose we have a CSV file:
>>> -------------------
>>> Town,Population
>>> Southton,123000
>>> Northville,654000
>>> -------------------
>>>
>>> One header row, two data rows.
>>>
>>> Aside: this is regular-shaped CSV (and some CSV files are definitely not
>>> regular at all!). There is the current editors working draft from the CSV
>>> on
>>> the Web Working Group (not yet published, likely to change, only part of
>>> the
>>> picture, etc etc)
>>>
>>> http://w3c.github.io/csvw/syntax/
>>>
>>> which is defining a more regular data out of CSV.  This is the target for
>>> the CSV work: table shaped CSV; not arbitrary, irregularly shaped CSV.
>>>
>>> There is no way the working group will have standardised any CSV to RDF
>>> mapping in the lifetime of the GSoC project but the WG charter says it
>>> must
>>> be covered.  So the mapping below is made up and ahead of where the
>>> working
>>> group is currently but a standardized, "direct mapping" (no metadata, no
>>> templates) style is going to happen.  The mapping details may change but
>>> the
>>> general approach is clear.
>>>
>>> As RDF this might be
>>>
>>> -------------
>>> @prefix : <http://example/table> .
>>> @prefix csv: <http://w3c/future-csv-vocab/> .
>>>
>>> [ csv:row 1 ; :Town "Southton" ; :Population 123000 ] .
>>> [ csv:row 2 ; :Town "Northville" ; :Population 654000 ] .
>>> -------------
>>>
>>> or without the bnode abbreviation:
>>>
>>> -------------
>>> @prefix : <http://example/table> .
>>> @prefix csv: <http://w3c/future-csv-vocab/> .
>>>
>>> _:b0  csv:row 1 ;
>>>        :Town "Southton" ;
>>>        :Population 123000 .
>>>
>>> _:b1  csv:row 2 ;
>>>        :Town "Northville" ;
>>>        :Population 654000 .
>>> -------------
>>>
>>>
>>> Each row is modelling one "entity" (here, a population observation).
>>> There
>>> is a subject (a blank node) and one predicate-value for each cell of the
>>> row.  Row numbers are added because it can be important.
>>>
>>> Background:
>>>
>>> A related idea for property has come up before
>>>
>>>    http://www.hpl.hp.com/techreports/2006/HPL-2006-140.html
>>>
>>> That paper should only be taken as giving a flavour. The motivation was
>>> different, more about making RDF look like regular database especially
>>> when
>>> the data is regular.  At the workshop last week, I talk to Orri Erling
>>> (OpenLink/Virtuoso) and apparently, maybe by parallel evolution, Virtuoso
>>> does something similar.
>>>
>>>
>>> Aside:
>>> There is a whole design space (outside this project) for translating CSV
>>> to
>>> RDF.
>>>
>>> Just if anyone is interested: see the related SQL-to-RDF work:
>>>
>>> http://www.w3.org/TR/r2rml/
>>> http://www.w3.org/TR/rdb-direct-mapping/
>>>
>>> If the metadata said that one of the columns was uniquely defining (a
>>> primary key in SQL terms, or inverse functional property in OWL-terms),
>>> we
>>> wouldn't need blank nodes at all - we could use a URI template, for if
>>> town
>>> names were unique (they are not!) a IRI template of
>>> http://data/town/{Town}
>>> would give:
>>>
>>> -------------
>>> @prefix : <http://example/table> .
>>> @prefix csv: <http://w3c/future-csv-vocab/> .
>>>
>>> <http://data/town/Southton>
>>>        csv:row 1 ;
>>>        rdfs:label "Southton" ;
>>>        :Population 123000 .
>>>
>>> <http://data/town/Northville>
>>>        csv:row 2 ;
>>>        rdfs:label "Northville" ;
>>>        :Population 654000 .
>>> -------------
>>>
>>> Doing this transformation in rules is one route.  JENA-650 connection?
>>> </aside>
>>>
>>> In SPARQL:
>>>
>>> Now the CSV file is viewed as an graph - normal, unmodified SPARQL can be
>>> used.  Multiple CSVs files can be multiple graphs in one dataset to give
>>> query across different data sources.
>>>
>>> # Towns over 500,000 people.
>>> SELECT ?townName ?pop {
>>> { GRAPH <http://example/population> {
>>>      ?x :Town ?townName ;
>>>         :Popuation ?pop .
>>>      FILTER(?pop > 500000)
>>>    }
>>> }
>>>
>>>
>>> A few comments inline - the bulk of this message is above.
>>>
>>> I hope this makes some sense.  Having spent time with people who really
>>> do
>>> work with CSVs files last week around the linked geospatial workshop ,
>>> the
>>> user needs and requirements are much clearer.
>>>
>>>          Andy
>>>
>>> PS I was on a panel that included mentioning the work you did last year.
>>> It
>>> went well.
>>>
>>> On 07/03/14 12:10, Ying Jiang wrote:
>>> ...
>>>
>>>>>> 2. Storage of the table (in-memory is enough, with reading from a
>>>>>> file).
>>>>>>     - Questions:
>>>>>> 2.1 What's the life cycle of the in-memory table? Should we discard
>>>>>> the table after the query execution, or keep it in-memory for later
>>>>>> reuse with the same query or update, or use by a subsequent query?
>>>>>> When will the table be discarded?
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> That'll need refining but a way to read and reuse.  There needs to be
>>>>> away
>>>>> for the app to pass in tables (a Map<Sting, ???> and a tool forerading
>>>>> CSVs
>>>>> to get the ???) because ...
>>>>
>>>>
>>>>
>>>> When will the tables be passed in? TARQL loads the CSVs when parsing
>>>> the SPARQL query string. Shall we load the tables and create the Map
>>>> before querying and cache them for resue? This could be similar to
>>>> querying a Dataset, and the simplest way goes something like:
>>>>
>>>> DataTableMap<String, DataTable> dtm =
>>>> DataTableSetFactory.createDataTableMap(); // The keys of dts are the
>>>> URI of the DataTables loaded.
>>>> dtm.addDataTable( "<ex:table_1>", "file:table_1.csv", true); // The
>>>> table data are loaded when added into the map.
>>>> dtm.addDataTable( "<ex:table_2>", "file:table_2.csv", false); // Or
>>>> the table data are *lazy* loaded during querying later on, i.e. not
>>>> loaded now.
>>>> Query query = QueryFactory.create(queryString) ; // New .jj will be
>>>> created for parsing TABLE and FROM TABLE clauses. However the
>>>> QueryFactory interface remains the same as before.
>>>> QueryExecution qExec = QueryExecutionFactory.create(query, model,
>>>> dtm) ; // New create method for QueryExecutionFactory to accomendate
>>>> dtm
>>>> ... //dtm can be reused later on for other QueryExecutions, or be
>>>> discarded when the app ends.
>>>>
>>>> Is the above what you mean? Any comments?
>>>
>>>
>>>
>>> Yes, using TABLE.
>>>
>>> With property tables it can be done as
>>>
>>> // Default graph of the dataset
>>>
>>> Model csv1 =
>>>    ModelFactory.createModelForGraph(new GraphCSV("data1.csv")) ;
>>> QueryExecution qExec = QueryExecutionFactory.create(query, csv1) ;
>>>
>>> or for multiple CSV files and/or other RDF data:
>>>
>>> Model csv1 =
>>>    ModelFactory.createModelForGraph(new GraphCSV("data1.csv")) ;
>>> Model csv2 =
>>>    ModelFactory.createModelForGraph(new GraphCSV("data1.csv")) ;
>>>
>>> Dataset dataset = ... ;
>>> dataset.addNamedModel("http://example/population", csv1) ;
>>> dataset.addNamedModel("http://example/table2", csv2) ;
>>>
>>> ... normal SPARQL execution ...
>>>
>>>
>>>>>
>>>>>
>>>>>> 3. Modify the SPARQL grammar to support FROM TABLE and TABLE (for
>>>>>> inclusion inside a larger query, c.f. SPARQL VALUES clause).
>>>>>>     - Questions:
>>>>>> 3.1 What're the differences between FROM TABLE and TABLE?
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> FROM TABLE would be one way to get tables into the query as would
>>>>> passing
>>>>> it
>>>>> in in the query context.
>>>>>
>>>>> Queries can't be assumed to
>>>>>
>>>>> TABLE in a query is accessing the table, using it to get the
>>>>>
>>>>> TARQL, and I've only read the documentation, is a query over a single
>>>>> CSV
>>>>> file.  This project should be about multiple CSVs and combining with
>>>>> other
>>>>> RDF data.
>>>>>
>>>>> A quick sketch and the syntax is not checked as sensible:
>>>>>
>>>>> SELECT ... {
>>>>>     # Fixed column names
>>>>>     TABLE <uri> {
>>>>>        BIND (URI(CONCAT('http://example.com/ns#', ?b)) AS ?uri)
>>>>>        BIND (STRLANG(?a, 'en') AS ?with_language_tag)
>>>>>        FILTER (?v > 57)
>>>>>     }
>>>>> }
>>>>>
>>>>> More ambitious to have column naming and FILTERs:
>>>>>
>>>>> SELECT ...
>>>>> WHERE {
>>>>>
>>>>>      TABLE <uri> { "col1" AS ?myVar1 ,
>>>>>                    "col10" AS ?V ,
>>>>>                    "col5" AS ?appName
>>>>>                    FILTER(?V > 57) }
>>>>> }
>>>>>
>>>>> creates a set of bindings based on access description.
>>>>>
>>>>
>>>> Are the <uri> after TABLE the key of the Map<Sting, ???>? If so, I now
>>>> understand the TABLE clauses from the examples. However, still not
>>>> sure about FROM TABLE. Could you please show me some query string
>>>> examples containing the FROM TABLE clauses?
>>>
>>>
>>>
>>> FROM TABLE would set the map entry.  c.f. FROM NAMED
>>>
>>> In this case the name of the table (graph) is the location it comes from
>>> -
>>> it's not a general choice of name.  A common issue for FROM NAMED, not
>>> specific to CSV processing.
>>>
>

Re: [GSoC 2014] Data Tables for SPARQL

Posted by Andy Seaborne <an...@apache.org>.

On 16/03/14 04:31, Ying Jiang wrote:
> Dear Andy,
>
> I greatly appreciate your detailed explanations. I've studied all the
> examples and the links you mentioned. I'll try to summarise here with
> further questions below:
>
> 1. We have 2 possible ways for the project: "variables-as-columns" and
> "property tables". I can understand both the ideas, thanks to your
> instructions. The former one has its issues you pointed out, and the
> latter one seems to make more sense for the users. Do you mean we
> should discard the former one and focus on the latter in this project?

Yes - "predicates-for-columns" = "property tables"

 From that, you can recover "variables-as-columns" by query pattern. 
The reverse is messy at best. Either very unnatural variable names to 
stop clashes or beign careful about scoping (and that will confuse people).

> 2. We can have some lessons learned from SQL-to-RDF work. But CSV
> (even regular-shaped CSV) is different from database in some ways,
> which requires us to dig in deeper on the details. Some questions
> like:

The W3C "CSV on the Web Working Group" [1] is working on a standard 
mechanism for converting CSV to other forms, RDF included.  The details 
of that mechanism aren't clear yet and won't be in time for the project 
- it's an area that (my current belief) will chop and change a fair bit 
in getting to a final specification.


The area of CSV-RDF is bigger than a GSoC project anyway and fairly open 
ended given all the sorts of the things people do with CSV files (e.g. 
encoding author lists in fields).

But there is a simpler case - one need is a "direct mapping" whereby a
CSV file with no additional metadata is mapped to RDF.  I think we can 
focus on a design for this in the project.

The translation is fixed : blank node for each row (addresses the 
primary key issue - and alternative below), the base URL of the CSV file 
is used to generate the predicate names.

Then, the project gets all the machinery working - otherwise the output 
will CSV to RDF without the Jena architectural chnages to support it in 
the long term.

[1] https://www.w3.org/2013/csvw/wiki/Main_Page

> 2.1 How to determine the data type of the column? All the values in
> CSV are firstly parsed as Strings line by line. Suppose the parser
> found a number string of "123000.0", how can we know whether it's an
> integer, a float/double or even just a string in RDF?

Initially, they can be strings.

Later, and maybe an option the user can turn on, then a dynamic choice 
which is a posh way of saying attempt to parse it as an integer and if 
it passes, it's an integer.  Spreadsheets do this guessing.

"Duck datatyping" - if it looks like an integer (decimal, double, date) 
it is an integer (decimal, double, date).

Actually, this is then the same as tokenizing and there is code to reuse 
to do that.

> 2.2 How to deal with the namespaces? RDF requires that the subjects
> and the predicates are URIs. We need to pass in the namespaces (or
> just the default namespaces) to make URIs by combining the namespaces
> with the values in CSV. Things may get more complicated if different
> columns are to be bound with different namespaces.

Subject a can be blank nodes which is useful because each row is then a 
new blank node.

One row written in RDF might be:

[ csv:row 1 ; :Town "Southton" ; :Population 123000 ] .

or

_:b0  csv:row 1 ;
       :Town "Southton" ;
       :Population 123000 .

It's the same RDF triples (3 of them).

For predicates, suppose the URL of the CSV file is <FILE> then the 
columns can be  <FILE#Town> and <FILE#Population>.

Rules or SPARQL Update can be used to turn that into a better data model 
if the users wants to write that code.

> 2.3 The hp 2006 report [1] says "Jena supports three kinds of property
> tables as well as a triple store". The "town" example you provided
> conforms to the "single-valued" property table. Shall we consider the
> others (e.g. the "multi-valued" one and the "triple store" one) in
> this project? Does Jena in the latest release still support these
> property tables? If so, where're the related source codes?

Single-valued.

In the CSV-WG it looks like duplicate column names are not going to be 
supported (at best, the parser has to make then unique by adding "1", 
"2" etc).

Despite what the report says, the code didn't make it into the public 
Jena codebase.  (And we have removed the old RDB subsystem it refers to.)

> 2.4 There's no "primary key" definition in CSV. All the RDF are not
> OWL in fact. How do we know the column in CSV is uniquely defining? It
> seems CSV lacks of some kind of "metadata" of the columns and the
> values. If we have such metadata, how to pass in the namespace of  the
> IRI template of http://data/town/{Town} (something related to the
> question 2.2)?

It's not necessary to have a defined primary row - that is generated 
subject URI.  It might be nice if available but that's metadata.

So one of:
1/ The triples for each row have a blank node for subject
2/ The triples for row N have a URI which is <FILE#_N>.

In both cases, the subject node is generated automatically.

> 3. For the "property tables" way, it seems that all we need to do is
> to resolve the problems in 2., and to code "GraphCSV" accordingly. I
> can make the GraphCSV class by implementing the Graph interface. In
> this way, for Jena ARP, a CSV table is actually a Graph, without any
> differences from other types of Graphs. It looks like that there's no
> need to introduce TABLE and FROM TABLE clauses in the SPARQL language
> grammar. We can just use the existing GRAPH, FROM and FROM NAMED
> clauses for the CSV "property tables", can't we?

s/ARP/ARQ/ -- ARP is the RDF/XML parser; ARQ is the query engine :-)

Yes - correct.

In the later stages of the project, there is an item to make OpExecutor 
(which is the class that actually drives the SPARQL execution) do better 
for GraphCSV than just treating it as a Graph by accessing the 
PropertyTable behind it.

The big gain for PropertyTables is the space saving they enable as well 
as the possibility of making them persistent in a special storage system 
(not in this project but the design should not make that too hard at 
some later time).

	Andy

>
> Best regards,
> Ying Jiang
>
> [1] http://www.hpl.hp.com/techreports/2006/HPL-2006-140.pdf
>
>
>
> On Mon, Mar 10, 2014 at 10:50 PM, Andy Seaborne <an...@apache.org> wrote:
>> Hi Ying,
>>
>> Good questions.  I'll try to give a response to the specific points you've
>> brought up but also there is a different I want to put forward for
>> discussion.
>>
>> I'll write up a first draft of a project plan then we can see if the size
>> and scope is realistic.
>>
>> You asked about whether variables are column names.  That is how TARQL and
>> SPARQL VALUES works but I've realised there is a different approach and it's
>> one that will give a better system.  It is to translate the CSV to RDF, and
>> this may be materialized or dynamically mapped. If "materialized" it's
>> likely to be a lot bigger; as "property tables" or somethign inspired by
>> that idea, it'll be more compact.
>>
>> There are some issues with variables-as-columns include:
>>
>> 1/ Fixed variable names don't combine with other part of a query pattern
>> very well.
>>
>> If there is common use of the same name it a join - that's what a natural
>> join in SQL is.  If there are two tables, then ?a is overloaded.  If column
>> names are used to derive a variable name, we may not want to equate them in
>> the query because column names in different CSV files weren't designed with
>> that in mind.
>>
>> 2/ You can't describe (in RDF) the data very easily - e.g. annotate that a
>> column is of years.
>>
>> 3/  It needs the language to change (i.e. TABLE to access it)
>>
>> In TARQL, which is focusing on a controlled transform from CSV to RDF, it
>> works out quite nicely - variables go into the CONSTRUCT template. It
>> produces RDF.
>>
>> Property tables are a style of approach where the CSV data is accessed as
>> RDF.
>>
>> The data table columns be predicate URIs.  The data table itself is an RDF
>> graph of regular structure.  It can be accessed with normal (unmodified)
>> SPARQL syntax. It would be better if the storage and execution of that part
>> of the SPARQL query were adapted to such regular data.  Something for after
>> getting an initial cut down.
>>
>> Suppose we have a CSV file:
>> -------------------
>> Town,Population
>> Southton,123000
>> Northville,654000
>> -------------------
>>
>> One header row, two data rows.
>>
>> Aside: this is regular-shaped CSV (and some CSV files are definitely not
>> regular at all!). There is the current editors working draft from the CSV on
>> the Web Working Group (not yet published, likely to change, only part of the
>> picture, etc etc)
>>
>> http://w3c.github.io/csvw/syntax/
>>
>> which is defining a more regular data out of CSV.  This is the target for
>> the CSV work: table shaped CSV; not arbitrary, irregularly shaped CSV.
>>
>> There is no way the working group will have standardised any CSV to RDF
>> mapping in the lifetime of the GSoC project but the WG charter says it must
>> be covered.  So the mapping below is made up and ahead of where the working
>> group is currently but a standardized, "direct mapping" (no metadata, no
>> templates) style is going to happen.  The mapping details may change but the
>> general approach is clear.
>>
>> As RDF this might be
>>
>> -------------
>> @prefix : <http://example/table> .
>> @prefix csv: <http://w3c/future-csv-vocab/> .
>>
>> [ csv:row 1 ; :Town "Southton" ; :Population 123000 ] .
>> [ csv:row 2 ; :Town "Northville" ; :Population 654000 ] .
>> -------------
>>
>> or without the bnode abbreviation:
>>
>> -------------
>> @prefix : <http://example/table> .
>> @prefix csv: <http://w3c/future-csv-vocab/> .
>>
>> _:b0  csv:row 1 ;
>>        :Town "Southton" ;
>>        :Population 123000 .
>>
>> _:b1  csv:row 2 ;
>>        :Town "Northville" ;
>>        :Population 654000 .
>> -------------
>>
>>
>> Each row is modelling one "entity" (here, a population observation). There
>> is a subject (a blank node) and one predicate-value for each cell of the
>> row.  Row numbers are added because it can be important.
>>
>> Background:
>>
>> A related idea for property has come up before
>>
>>    http://www.hpl.hp.com/techreports/2006/HPL-2006-140.html
>>
>> That paper should only be taken as giving a flavour. The motivation was
>> different, more about making RDF look like regular database especially when
>> the data is regular.  At the workshop last week, I talk to Orri Erling
>> (OpenLink/Virtuoso) and apparently, maybe by parallel evolution, Virtuoso
>> does something similar.
>>
>>
>> Aside:
>> There is a whole design space (outside this project) for translating CSV to
>> RDF.
>>
>> Just if anyone is interested: see the related SQL-to-RDF work:
>>
>> http://www.w3.org/TR/r2rml/
>> http://www.w3.org/TR/rdb-direct-mapping/
>>
>> If the metadata said that one of the columns was uniquely defining (a
>> primary key in SQL terms, or inverse functional property in OWL-terms), we
>> wouldn't need blank nodes at all - we could use a URI template, for if town
>> names were unique (they are not!) a IRI template of http://data/town/{Town}
>> would give:
>>
>> -------------
>> @prefix : <http://example/table> .
>> @prefix csv: <http://w3c/future-csv-vocab/> .
>>
>> <http://data/town/Southton>
>>        csv:row 1 ;
>>        rdfs:label "Southton" ;
>>        :Population 123000 .
>>
>> <http://data/town/Northville>
>>        csv:row 2 ;
>>        rdfs:label "Northville" ;
>>        :Population 654000 .
>> -------------
>>
>> Doing this transformation in rules is one route.  JENA-650 connection?
>> </aside>
>>
>> In SPARQL:
>>
>> Now the CSV file is viewed as an graph - normal, unmodified SPARQL can be
>> used.  Multiple CSVs files can be multiple graphs in one dataset to give
>> query across different data sources.
>>
>> # Towns over 500,000 people.
>> SELECT ?townName ?pop {
>> { GRAPH <http://example/population> {
>>      ?x :Town ?townName ;
>>         :Popuation ?pop .
>>      FILTER(?pop > 500000)
>>    }
>> }
>>
>>
>> A few comments inline - the bulk of this message is above.
>>
>> I hope this makes some sense.  Having spent time with people who really do
>> work with CSVs files last week around the linked geospatial workshop , the
>> user needs and requirements are much clearer.
>>
>>          Andy
>>
>> PS I was on a panel that included mentioning the work you did last year.  It
>> went well.
>>
>> On 07/03/14 12:10, Ying Jiang wrote:
>> ...
>>
>>>>> 2. Storage of the table (in-memory is enough, with reading from a file).
>>>>>     - Questions:
>>>>> 2.1 What's the life cycle of the in-memory table? Should we discard
>>>>> the table after the query execution, or keep it in-memory for later
>>>>> reuse with the same query or update, or use by a subsequent query?
>>>>> When will the table be discarded?
>>>>
>>>>
>>>>
>>>> That'll need refining but a way to read and reuse.  There needs to be
>>>> away
>>>> for the app to pass in tables (a Map<Sting, ???> and a tool forerading
>>>> CSVs
>>>> to get the ???) because ...
>>>
>>>
>>> When will the tables be passed in? TARQL loads the CSVs when parsing
>>> the SPARQL query string. Shall we load the tables and create the Map
>>> before querying and cache them for resue? This could be similar to
>>> querying a Dataset, and the simplest way goes something like:
>>>
>>> DataTableMap<String, DataTable> dtm =
>>> DataTableSetFactory.createDataTableMap(); // The keys of dts are the
>>> URI of the DataTables loaded.
>>> dtm.addDataTable( "<ex:table_1>", "file:table_1.csv", true); // The
>>> table data are loaded when added into the map.
>>> dtm.addDataTable( "<ex:table_2>", "file:table_2.csv", false); // Or
>>> the table data are *lazy* loaded during querying later on, i.e. not
>>> loaded now.
>>> Query query = QueryFactory.create(queryString) ; // New .jj will be
>>> created for parsing TABLE and FROM TABLE clauses. However the
>>> QueryFactory interface remains the same as before.
>>> QueryExecution qExec = QueryExecutionFactory.create(query, model,
>>> dtm) ; // New create method for QueryExecutionFactory to accomendate
>>> dtm
>>> ... //dtm can be reused later on for other QueryExecutions, or be
>>> discarded when the app ends.
>>>
>>> Is the above what you mean? Any comments?
>>
>>
>> Yes, using TABLE.
>>
>> With property tables it can be done as
>>
>> // Default graph of the dataset
>>
>> Model csv1 =
>>    ModelFactory.createModelForGraph(new GraphCSV("data1.csv")) ;
>> QueryExecution qExec = QueryExecutionFactory.create(query, csv1) ;
>>
>> or for multiple CSV files and/or other RDF data:
>>
>> Model csv1 =
>>    ModelFactory.createModelForGraph(new GraphCSV("data1.csv")) ;
>> Model csv2 =
>>    ModelFactory.createModelForGraph(new GraphCSV("data1.csv")) ;
>>
>> Dataset dataset = ... ;
>> dataset.addNamedModel("http://example/population", csv1) ;
>> dataset.addNamedModel("http://example/table2", csv2) ;
>>
>> ... normal SPARQL execution ...
>>
>>
>>>>
>>>>
>>>>> 3. Modify the SPARQL grammar to support FROM TABLE and TABLE (for
>>>>> inclusion inside a larger query, c.f. SPARQL VALUES clause).
>>>>>     - Questions:
>>>>> 3.1 What're the differences between FROM TABLE and TABLE?
>>>>
>>>>
>>>>
>>>> FROM TABLE would be one way to get tables into the query as would passing
>>>> it
>>>> in in the query context.
>>>>
>>>> Queries can't be assumed to
>>>>
>>>> TABLE in a query is accessing the table, using it to get the
>>>>
>>>> TARQL, and I've only read the documentation, is a query over a single CSV
>>>> file.  This project should be about multiple CSVs and combining with
>>>> other
>>>> RDF data.
>>>>
>>>> A quick sketch and the syntax is not checked as sensible:
>>>>
>>>> SELECT ... {
>>>>     # Fixed column names
>>>>     TABLE <uri> {
>>>>        BIND (URI(CONCAT('http://example.com/ns#', ?b)) AS ?uri)
>>>>        BIND (STRLANG(?a, 'en') AS ?with_language_tag)
>>>>        FILTER (?v > 57)
>>>>     }
>>>> }
>>>>
>>>> More ambitious to have column naming and FILTERs:
>>>>
>>>> SELECT ...
>>>> WHERE {
>>>>
>>>>      TABLE <uri> { "col1" AS ?myVar1 ,
>>>>                    "col10" AS ?V ,
>>>>                    "col5" AS ?appName
>>>>                    FILTER(?V > 57) }
>>>> }
>>>>
>>>> creates a set of bindings based on access description.
>>>>
>>>
>>> Are the <uri> after TABLE the key of the Map<Sting, ???>? If so, I now
>>> understand the TABLE clauses from the examples. However, still not
>>> sure about FROM TABLE. Could you please show me some query string
>>> examples containing the FROM TABLE clauses?
>>
>>
>> FROM TABLE would set the map entry.  c.f. FROM NAMED
>>
>> In this case the name of the table (graph) is the location it comes from -
>> it's not a general choice of name.  A common issue for FROM NAMED, not
>> specific to CSV processing.
>>

Re: [GSoC 2014] Data Tables for SPARQL

Posted by Ying Jiang <jp...@gmail.com>.

Dear Andy,

I greatly appreciate your detailed explanations. I've studied all the
examples and the links you mentioned. I'll try to summarise here with
further questions below:

1. We have 2 possible ways for the project: "variables-as-columns" and
"property tables". I can understand both the ideas, thanks to your
instructions. The former one has its issues you pointed out, and the
latter one seems to make more sense for the users. Do you mean we
should discard the former one and focus on the latter in this project?

2. We can have some lessons learned from SQL-to-RDF work. But CSV
(even regular-shaped CSV) is different from database in some ways,
which requires us to dig in deeper on the details. Some questions
like:
2.1 How to determine the data type of the column? All the values in
CSV are firstly parsed as Strings line by line. Suppose the parser
found a number string of "123000.0", how can we know whether it's an
integer, a float/double or even just a string in RDF?
2.2 How to deal with the namespaces? RDF requires that the subjects
and the predicates are URIs. We need to pass in the namespaces (or
just the default namespaces) to make URIs by combining the namespaces
with the values in CSV. Things may get more complicated if different
columns are to be bound with different namespaces.
2.3 The hp 2006 report [1] says "Jena supports three kinds of property
tables as well as a triple store". The "town" example you provided
conforms to the "single-valued" property table. Shall we consider the
others (e.g. the "multi-valued" one and the "triple store" one) in
this project? Does Jena in the latest release still support these
property tables? If so, where're the related source codes?
2.4 There's no "primary key" definition in CSV. All the RDF are not
OWL in fact. How do we know the column in CSV is uniquely defining? It
seems CSV lacks of some kind of "metadata" of the columns and the
values. If we have such metadata, how to pass in the namespace of  the
IRI template of http://data/town/{Town} (something related to the
question 2.2)?

3. For the "property tables" way, it seems that all we need to do is
to resolve the problems in 2., and to code "GraphCSV" accordingly. I
can make the GraphCSV class by implementing the Graph interface. In
this way, for Jena ARP, a CSV table is actually a Graph, without any
differences from other types of Graphs. It looks like that there's no
need to introduce TABLE and FROM TABLE clauses in the SPARQL language
grammar. We can just use the existing GRAPH, FROM and FROM NAMED
clauses for the CSV "property tables", can't we?

Best regards,
Ying Jiang

[1] http://www.hpl.hp.com/techreports/2006/HPL-2006-140.pdf



On Mon, Mar 10, 2014 at 10:50 PM, Andy Seaborne <an...@apache.org> wrote:
> Hi Ying,
>
> Good questions.  I'll try to give a response to the specific points you've
> brought up but also there is a different I want to put forward for
> discussion.
>
> I'll write up a first draft of a project plan then we can see if the size
> and scope is realistic.
>
> You asked about whether variables are column names.  That is how TARQL and
> SPARQL VALUES works but I've realised there is a different approach and it's
> one that will give a better system.  It is to translate the CSV to RDF, and
> this may be materialized or dynamically mapped. If "materialized" it's
> likely to be a lot bigger; as "property tables" or somethign inspired by
> that idea, it'll be more compact.
>
> There are some issues with variables-as-columns include:
>
> 1/ Fixed variable names don't combine with other part of a query pattern
> very well.
>
> If there is common use of the same name it a join - that's what a natural
> join in SQL is.  If there are two tables, then ?a is overloaded.  If column
> names are used to derive a variable name, we may not want to equate them in
> the query because column names in different CSV files weren't designed with
> that in mind.
>
> 2/ You can't describe (in RDF) the data very easily - e.g. annotate that a
> column is of years.
>
> 3/  It needs the language to change (i.e. TABLE to access it)
>
> In TARQL, which is focusing on a controlled transform from CSV to RDF, it
> works out quite nicely - variables go into the CONSTRUCT template. It
> produces RDF.
>
> Property tables are a style of approach where the CSV data is accessed as
> RDF.
>
> The data table columns be predicate URIs.  The data table itself is an RDF
> graph of regular structure.  It can be accessed with normal (unmodified)
> SPARQL syntax. It would be better if the storage and execution of that part
> of the SPARQL query were adapted to such regular data.  Something for after
> getting an initial cut down.
>
> Suppose we have a CSV file:
> -------------------
> Town,Population
> Southton,123000
> Northville,654000
> -------------------
>
> One header row, two data rows.
>
> Aside: this is regular-shaped CSV (and some CSV files are definitely not
> regular at all!). There is the current editors working draft from the CSV on
> the Web Working Group (not yet published, likely to change, only part of the
> picture, etc etc)
>
> http://w3c.github.io/csvw/syntax/
>
> which is defining a more regular data out of CSV.  This is the target for
> the CSV work: table shaped CSV; not arbitrary, irregularly shaped CSV.
>
> There is no way the working group will have standardised any CSV to RDF
> mapping in the lifetime of the GSoC project but the WG charter says it must
> be covered.  So the mapping below is made up and ahead of where the working
> group is currently but a standardized, "direct mapping" (no metadata, no
> templates) style is going to happen.  The mapping details may change but the
> general approach is clear.
>
> As RDF this might be
>
> -------------
> @prefix : <http://example/table> .
> @prefix csv: <http://w3c/future-csv-vocab/> .
>
> [ csv:row 1 ; :Town "Southton" ; :Population 123000 ] .
> [ csv:row 2 ; :Town "Northville" ; :Population 654000 ] .
> -------------
>
> or without the bnode abbreviation:
>
> -------------
> @prefix : <http://example/table> .
> @prefix csv: <http://w3c/future-csv-vocab/> .
>
> _:b0  csv:row 1 ;
>       :Town "Southton" ;
>       :Population 123000 .
>
> _:b1  csv:row 2 ;
>       :Town "Northville" ;
>       :Population 654000 .
> -------------
>
>
> Each row is modelling one "entity" (here, a population observation). There
> is a subject (a blank node) and one predicate-value for each cell of the
> row.  Row numbers are added because it can be important.
>
> Background:
>
> A related idea for property has come up before
>
>   http://www.hpl.hp.com/techreports/2006/HPL-2006-140.html
>
> That paper should only be taken as giving a flavour. The motivation was
> different, more about making RDF look like regular database especially when
> the data is regular.  At the workshop last week, I talk to Orri Erling
> (OpenLink/Virtuoso) and apparently, maybe by parallel evolution, Virtuoso
> does something similar.
>
>
> Aside:
> There is a whole design space (outside this project) for translating CSV to
> RDF.
>
> Just if anyone is interested: see the related SQL-to-RDF work:
>
> http://www.w3.org/TR/r2rml/
> http://www.w3.org/TR/rdb-direct-mapping/
>
> If the metadata said that one of the columns was uniquely defining (a
> primary key in SQL terms, or inverse functional property in OWL-terms), we
> wouldn't need blank nodes at all - we could use a URI template, for if town
> names were unique (they are not!) a IRI template of http://data/town/{Town}
> would give:
>
> -------------
> @prefix : <http://example/table> .
> @prefix csv: <http://w3c/future-csv-vocab/> .
>
> <http://data/town/Southton>
>       csv:row 1 ;
>       rdfs:label "Southton" ;
>       :Population 123000 .
>
> <http://data/town/Northville>
>       csv:row 2 ;
>       rdfs:label "Northville" ;
>       :Population 654000 .
> -------------
>
> Doing this transformation in rules is one route.  JENA-650 connection?
> </aside>
>
> In SPARQL:
>
> Now the CSV file is viewed as an graph - normal, unmodified SPARQL can be
> used.  Multiple CSVs files can be multiple graphs in one dataset to give
> query across different data sources.
>
> # Towns over 500,000 people.
> SELECT ?townName ?pop {
> { GRAPH <http://example/population> {
>     ?x :Town ?townName ;
>        :Popuation ?pop .
>     FILTER(?pop > 500000)
>   }
> }
>
>
> A few comments inline - the bulk of this message is above.
>
> I hope this makes some sense.  Having spent time with people who really do
> work with CSVs files last week around the linked geospatial workshop , the
> user needs and requirements are much clearer.
>
>         Andy
>
> PS I was on a panel that included mentioning the work you did last year.  It
> went well.
>
> On 07/03/14 12:10, Ying Jiang wrote:
> ...
>
>>>> 2. Storage of the table (in-memory is enough, with reading from a file).
>>>>    - Questions:
>>>> 2.1 What's the life cycle of the in-memory table? Should we discard
>>>> the table after the query execution, or keep it in-memory for later
>>>> reuse with the same query or update, or use by a subsequent query?
>>>> When will the table be discarded?
>>>
>>>
>>>
>>> That'll need refining but a way to read and reuse.  There needs to be
>>> away
>>> for the app to pass in tables (a Map<Sting, ???> and a tool forerading
>>> CSVs
>>> to get the ???) because ...
>>
>>
>> When will the tables be passed in? TARQL loads the CSVs when parsing
>> the SPARQL query string. Shall we load the tables and create the Map
>> before querying and cache them for resue? This could be similar to
>> querying a Dataset, and the simplest way goes something like:
>>
>> DataTableMap<String, DataTable> dtm =
>> DataTableSetFactory.createDataTableMap(); // The keys of dts are the
>> URI of the DataTables loaded.
>> dtm.addDataTable( "<ex:table_1>", "file:table_1.csv", true); // The
>> table data are loaded when added into the map.
>> dtm.addDataTable( "<ex:table_2>", "file:table_2.csv", false); // Or
>> the table data are *lazy* loaded during querying later on, i.e. not
>> loaded now.
>> Query query = QueryFactory.create(queryString) ; // New .jj will be
>> created for parsing TABLE and FROM TABLE clauses. However the
>> QueryFactory interface remains the same as before.
>> QueryExecution qExec = QueryExecutionFactory.create(query, model,
>> dtm) ; // New create method for QueryExecutionFactory to accomendate
>> dtm
>> ... //dtm can be reused later on for other QueryExecutions, or be
>> discarded when the app ends.
>>
>> Is the above what you mean? Any comments?
>
>
> Yes, using TABLE.
>
> With property tables it can be done as
>
> // Default graph of the dataset
>
> Model csv1 =
>   ModelFactory.createModelForGraph(new GraphCSV("data1.csv")) ;
> QueryExecution qExec = QueryExecutionFactory.create(query, csv1) ;
>
> or for multiple CSV files and/or other RDF data:
>
> Model csv1 =
>   ModelFactory.createModelForGraph(new GraphCSV("data1.csv")) ;
> Model csv2 =
>   ModelFactory.createModelForGraph(new GraphCSV("data1.csv")) ;
>
> Dataset dataset = ... ;
> dataset.addNamedModel("http://example/population", csv1) ;
> dataset.addNamedModel("http://example/table2", csv2) ;
>
> ... normal SPARQL execution ...
>
>
>>>
>>>
>>>> 3. Modify the SPARQL grammar to support FROM TABLE and TABLE (for
>>>> inclusion inside a larger query, c.f. SPARQL VALUES clause).
>>>>    - Questions:
>>>> 3.1 What're the differences between FROM TABLE and TABLE?
>>>
>>>
>>>
>>> FROM TABLE would be one way to get tables into the query as would passing
>>> it
>>> in in the query context.
>>>
>>> Queries can't be assumed to
>>>
>>> TABLE in a query is accessing the table, using it to get the
>>>
>>> TARQL, and I've only read the documentation, is a query over a single CSV
>>> file.  This project should be about multiple CSVs and combining with
>>> other
>>> RDF data.
>>>
>>> A quick sketch and the syntax is not checked as sensible:
>>>
>>> SELECT ... {
>>>    # Fixed column names
>>>    TABLE <uri> {
>>>       BIND (URI(CONCAT('http://example.com/ns#', ?b)) AS ?uri)
>>>       BIND (STRLANG(?a, 'en') AS ?with_language_tag)
>>>       FILTER (?v > 57)
>>>    }
>>> }
>>>
>>> More ambitious to have column naming and FILTERs:
>>>
>>> SELECT ...
>>> WHERE {
>>>
>>>     TABLE <uri> { "col1" AS ?myVar1 ,
>>>                   "col10" AS ?V ,
>>>                   "col5" AS ?appName
>>>                   FILTER(?V > 57) }
>>> }
>>>
>>> creates a set of bindings based on access description.
>>>
>>
>> Are the <uri> after TABLE the key of the Map<Sting, ???>? If so, I now
>> understand the TABLE clauses from the examples. However, still not
>> sure about FROM TABLE. Could you please show me some query string
>> examples containing the FROM TABLE clauses?
>
>
> FROM TABLE would set the map entry.  c.f. FROM NAMED
>
> In this case the name of the table (graph) is the location it comes from -
> it's not a general choice of name.  A common issue for FROM NAMED, not
> specific to CSV processing.
>

Re: [GSoC 2014] Data Tables for SPARQL

Posted by Andy Seaborne <an...@apache.org>.

Hi Ying,

Good questions.  I'll try to give a response to the specific points 
you've brought up but also there is a different I want to put forward 
for discussion.

I'll write up a first draft of a project plan then we can see if the 
size and scope is realistic.

You asked about whether variables are column names.  That is how TARQL 
and SPARQL VALUES works but I've realised there is a different approach 
and it's one that will give a better system.  It is to translate the CSV 
to RDF, and this may be materialized or dynamically mapped. If 
"materialized" it's likely to be a lot bigger; as "property tables" or 
somethign inspired by that idea, it'll be more compact.

There are some issues with variables-as-columns include:

1/ Fixed variable names don't combine with other part of a query pattern 
very well.

If there is common use of the same name it a join - that's what a 
natural join in SQL is.  If there are two tables, then ?a is overloaded. 
  If column names are used to derive a variable name, we may not want to 
equate them in the query because column names in different CSV files 
weren't designed with that in mind.

2/ You can't describe (in RDF) the data very easily - e.g. annotate that 
a column is of years.

3/  It needs the language to change (i.e. TABLE to access it)

In TARQL, which is focusing on a controlled transform from CSV to RDF, 
it works out quite nicely - variables go into the CONSTRUCT template. 
It produces RDF.

Property tables are a style of approach where the CSV data is accessed 
as RDF.

The data table columns be predicate URIs.  The data table itself is an 
RDF graph of regular structure.  It can be accessed with normal 
(unmodified) SPARQL syntax. It would be better if the storage and 
execution of that part of the SPARQL query were adapted to such regular 
data.  Something for after getting an initial cut down.

Suppose we have a CSV file:
-------------------
Town,Population
Southton,123000
Northville,654000
-------------------

One header row, two data rows.

Aside: this is regular-shaped CSV (and some CSV files are definitely not 
regular at all!). There is the current editors working draft from the 
CSV on the Web Working Group (not yet published, likely to change, only 
part of the picture, etc etc)

http://w3c.github.io/csvw/syntax/

which is defining a more regular data out of CSV.  This is the target 
for the CSV work: table shaped CSV; not arbitrary, irregularly shaped CSV.

There is no way the working group will have standardised any CSV to RDF 
mapping in the lifetime of the GSoC project but the WG charter says it 
must be covered.  So the mapping below is made up and ahead of where the 
working group is currently but a standardized, "direct mapping" (no 
metadata, no templates) style is going to happen.  The mapping details 
may change but the general approach is clear.

As RDF this might be

-------------
@prefix : <http://example/table> .
@prefix csv: <http://w3c/future-csv-vocab/> .

[ csv:row 1 ; :Town "Southton" ; :Population 123000 ] .
[ csv:row 2 ; :Town "Northville" ; :Population 654000 ] .
-------------

or without the bnode abbreviation:

-------------
@prefix : <http://example/table> .
@prefix csv: <http://w3c/future-csv-vocab/> .

_:b0  csv:row 1 ;
       :Town "Southton" ;
       :Population 123000 .

_:b1  csv:row 2 ;
       :Town "Northville" ;
       :Population 654000 .
-------------

Each row is modelling one "entity" (here, a population observation). 
There is a subject (a blank node) and one predicate-value for each cell 
of the row.  Row numbers are added because it can be important.

Background:

A related idea for property has come up before

   http://www.hpl.hp.com/techreports/2006/HPL-2006-140.html

That paper should only be taken as giving a flavour. The motivation was 
different, more about making RDF look like regular database especially 
when the data is regular.  At the workshop last week, I talk to Orri 
Erling (OpenLink/Virtuoso) and apparently, maybe by parallel evolution, 
Virtuoso does something similar.

Aside:
There is a whole design space (outside this project) for translating CSV 
to RDF.

Just if anyone is interested: see the related SQL-to-RDF work:

http://www.w3.org/TR/r2rml/
http://www.w3.org/TR/rdb-direct-mapping/

If the metadata said that one of the columns was uniquely defining (a 
primary key in SQL terms, or inverse functional property in OWL-terms), 
we wouldn't need blank nodes at all - we could use a URI template, for 
if town names were unique (they are not!) a IRI template of 
http://data/town/{Town} would give:

-------------
@prefix : <http://example/table> .
@prefix csv: <http://w3c/future-csv-vocab/> .

<http://data/town/Southton>
       csv:row 1 ;
       rdfs:label "Southton" ;
       :Population 123000 .

<http://data/town/Northville>
       csv:row 2 ;
       rdfs:label "Northville" ;
       :Population 654000 .
-------------

Doing this transformation in rules is one route.  JENA-650 connection?
</aside>

In SPARQL:

Now the CSV file is viewed as an graph - normal, unmodified SPARQL can 
be used.  Multiple CSVs files can be multiple graphs in one dataset to 
give query across different data sources.

# Towns over 500,000 people.
SELECT ?townName ?pop {
{ GRAPH <http://example/population> {
     ?x :Town ?townName ;
        :Popuation ?pop .
     FILTER(?pop > 500000)
   }
}

A few comments inline - the bulk of this message is above.

I hope this makes some sense.  Having spent time with people who really 
do work with CSVs files last week around the linked geospatial workshop 
, the user needs and requirements are much clearer.

	Andy

PS I was on a panel that included mentioning the work you did last year. 
  It went well.

On 07/03/14 12:10, Ying Jiang wrote:
...
>>> 2. Storage of the table (in-memory is enough, with reading from a file).
>>>    - Questions:
>>> 2.1 What's the life cycle of the in-memory table? Should we discard
>>> the table after the query execution, or keep it in-memory for later
>>> reuse with the same query or update, or use by a subsequent query?
>>> When will the table be discarded?
>>
>>
>> That'll need refining but a way to read and reuse.  There needs to be away
>> for the app to pass in tables (a Map<Sting, ???> and a tool forerading CSVs
>> to get the ???) because ...
>
> When will the tables be passed in? TARQL loads the CSVs when parsing
> the SPARQL query string. Shall we load the tables and create the Map
> before querying and cache them for resue? This could be similar to
> querying a Dataset, and the simplest way goes something like:
>
> DataTableMap<String, DataTable> dtm =
> DataTableSetFactory.createDataTableMap(); // The keys of dts are the
> URI of the DataTables loaded.
> dtm.addDataTable( "<ex:table_1>", "file:table_1.csv", true); // The
> table data are loaded when added into the map.
> dtm.addDataTable( "<ex:table_2>", "file:table_2.csv", false); // Or
> the table data are *lazy* loaded during querying later on, i.e. not
> loaded now.
> Query query = QueryFactory.create(queryString) ; // New .jj will be
> created for parsing TABLE and FROM TABLE clauses. However the
> QueryFactory interface remains the same as before.
> QueryExecution qExec = QueryExecutionFactory.create(query, model,
> dtm) ; // New create method for QueryExecutionFactory to accomendate
> dtm
> ... //dtm can be reused later on for other QueryExecutions, or be
> discarded when the app ends.
>
> Is the above what you mean? Any comments?

Yes, using TABLE.

With property tables it can be done as

// Default graph of the dataset

Model csv1 =
   ModelFactory.createModelForGraph(new GraphCSV("data1.csv")) ;
QueryExecution qExec = QueryExecutionFactory.create(query, csv1) ;

or for multiple CSV files and/or other RDF data:

Model csv1 =
   ModelFactory.createModelForGraph(new GraphCSV("data1.csv")) ;
Model csv2 =
   ModelFactory.createModelForGraph(new GraphCSV("data1.csv")) ;

Dataset dataset = ... ;
dataset.addNamedModel("http://example/population", csv1) ;
dataset.addNamedModel("http://example/table2", csv2) ;

... normal SPARQL execution ...

>>
>>
>>> 3. Modify the SPARQL grammar to support FROM TABLE and TABLE (for
>>> inclusion inside a larger query, c.f. SPARQL VALUES clause).
>>>    - Questions:
>>> 3.1 What're the differences between FROM TABLE and TABLE?
>>
>>
>> FROM TABLE would be one way to get tables into the query as would passing it
>> in in the query context.
>>
>> Queries can't be assumed to
>>
>> TABLE in a query is accessing the table, using it to get the
>>
>> TARQL, and I've only read the documentation, is a query over a single CSV
>> file.  This project should be about multiple CSVs and combining with other
>> RDF data.
>>
>> A quick sketch and the syntax is not checked as sensible:
>>
>> SELECT ... {
>>    # Fixed column names
>>    TABLE <uri> {
>>       BIND (URI(CONCAT('http://example.com/ns#', ?b)) AS ?uri)
>>       BIND (STRLANG(?a, 'en') AS ?with_language_tag)
>>       FILTER (?v > 57)
>>    }
>> }
>>
>> More ambitious to have column naming and FILTERs:
>>
>> SELECT ...
>> WHERE {
>>
>>     TABLE <uri> { "col1" AS ?myVar1 ,
>>                   "col10" AS ?V ,
>>                   "col5" AS ?appName
>>                   FILTER(?V > 57) }
>> }
>>
>> creates a set of bindings based on access description.
>>
>
> Are the <uri> after TABLE the key of the Map<Sting, ???>? If so, I now
> understand the TABLE clauses from the examples. However, still not
> sure about FROM TABLE. Could you please show me some query string
> examples containing the FROM TABLE clauses?

FROM TABLE would set the map entry.  c.f. FROM NAMED

In this case the name of the table (graph) is the location it comes from 
- it's not a general choice of name.  A common issue for FROM NAMED, not 
specific to CSV processing.

Re: [GSoC 2014] Data Tables for SPARQL

Posted by Ying Jiang <jp...@gmail.com>.

Hi Andy,

Thanks for your explanations! Please check my further questions below:

On Tue, Mar 4, 2014 at 6:26 AM, Andy Seaborne <an...@apache.org> wrote:
> On 03/03/14 03:12, Ying Jiang wrote:
>>
>> Hi Andy,
>
>
> Hi Ying,
>
>
>>
>> Thanks for your suggestions! I'm more interested in JENA-625 (Data
>> Tables for SPARQL). I've seen your new comments in JIRA and studied
>> the source code of Tarql. I'd like to paste your comments here with my
>> questions below to clarify the details of this project:
>>
>> 1. CSV to RDF terms (tuples of RDF Terms is already supported
>> internally in Jena)
>>   - Questions:
>> 1.1 Tarql uses the first row of CSV as variable names. Should we use
>> the same idea?
>
>
> Seems like good start although care is needed because the column can be
> anything and SPARQL variables are restricted.
>
> If there is no header row, and we can require that app should say so by some
> mechanism, or if the app wants different names, then a way to provide that,
> falling back to something predicable if dull: ?_col1, ?_col2, ...
>
> See below - there's no need to have fixed variable names.
>
>
>> 1.2 As to "internal support of tuples of RDF terms in Jena", do you
>> mean com.hp.hpl.jena.sparql.algebra.table.TableData? Tarql uses
>> TableData to accommodate RDF term bindings from CSV.
>
>
> That and there is also some RDF tuples code to read/write a textual form as
> well:
>
> https://svn.apache.org/repos/asf/jena/Experimental/rdfpatch/src/main/java/org/apache/jena/riot/tio/
>
> (there are other versions of this code around - this is the ready to use
> form)
>
>
>> 2. Storage of the table (in-memory is enough, with reading from a file).
>>   - Questions:
>> 2.1 What's the life cycle of the in-memory table? Should we discard
>> the table after the query execution, or keep it in-memory for later
>> reuse with the same query or update, or use by a subsequent query?
>> When will the table be discarded?
>
>
> That'll need refining but a way to read and reuse.  There needs to be away
> for the app to pass in tables (a Map<Sting, ???> and a tool forerading CSVs
> to get the ???) because ...

When will the tables be passed in? TARQL loads the CSVs when parsing
the SPARQL query string. Shall we load the tables and create the Map
before querying and cache them for resue? This could be similar to
querying a Dataset, and the simplest way goes something like:

DataTableMap<String, DataTable> dtm =
DataTableSetFactory.createDataTableMap(); // The keys of dts are the
URI of the DataTables loaded.
dtm.addDataTable( "<ex:table_1>", "file:table_1.csv", true); // The
table data are loaded when added into the map.
dtm.addDataTable( "<ex:table_2>", "file:table_2.csv", false); // Or
the table data are *lazy* loaded during querying later on, i.e. not
loaded now.
Query query = QueryFactory.create(queryString) ; // New .jj will be
created for parsing TABLE and FROM TABLE clauses. However the
QueryFactory interface remains the same as before.
QueryExecution qExec = QueryExecutionFactory.create(query, model,
dtm) ; // New create method for QueryExecutionFactory to accomendate
dtm
... //dtm can be reused later on for other QueryExecutions, or be
discarded when the app ends.

Is the above what you mean? Any comments?
>
>
>> 3. Modify the SPARQL grammar to support FROM TABLE and TABLE (for
>> inclusion inside a larger query, c.f. SPARQL VALUES clause).
>>   - Questions:
>> 3.1 What're the differences between FROM TABLE and TABLE?
>
>
> FROM TABLE would be one way to get tables into the query as would passing it
> in in the query context.
>
> Queries can't be assumed to
>
> TABLE in a query is accessing the table, using it to get the
>
> TARQL, and I've only read the documentation, is a query over a single CSV
> file.  This project should be about multiple CSVs and combining with other
> RDF data.
>
> A quick sketch and the syntax is not checked as sensible:
>
> SELECT ... {
>   # Fixed column names
>   TABLE <uri> {
>      BIND (URI(CONCAT('http://example.com/ns#', ?b)) AS ?uri)
>      BIND (STRLANG(?a, 'en') AS ?with_language_tag)
>      FILTER (?v > 57)
>   }
> }
>
> More ambitious to have column naming and FILTERs:
>
> SELECT ...
> WHERE {
>
>    TABLE <uri> { "col1" AS ?myVar1 ,
>                  "col10" AS ?V ,
>                  "col5" AS ?appName
>                  FILTER(?V > 57) }
> }
>
> creates a set of bindings based on access description.
>

Are the <uri> after TABLE the key of the Map<Sting, ???>? If so, I now
understand the TABLE clauses from the examples. However, still not
sure about FROM TABLE. Could you please show me some query string
examples containing the FROM TABLE clauses?

>
>
>> 3.2 Tarql programmatically modify the query (parsed from standard
>> SPARQLParser11) with CSV tabsle data without touching the orginal
>> SPARQL grammar parsing module. Should we adopt a different approach of
>> modifying the parsing grammar of .jj files and just ask javacc to
>> generate the new parsing code?
>
>
> I think the latter if possible.
>
> This, like all projects, will need to move to a detailed design but I don't
> hink it puts the project as a whole at risk.  The basis TARQL idea would be
> a great addition
>
>         Andy
>
>
>>
>> 4. Modify execution to include tables.
>> Questions: No questions for this now.
>>
>> Best regards,
>> Ying Jiang
>>
>> On Thu, Feb 27, 2014 at 10:49 PM, Andy Seaborne <an...@apache.org> wrote:
>>>
>>> On 26/02/14 15:14, Ying Jiang wrote:
>>>>
>>>>
>>>> Hi,
>>>>
>>>> With the great guidance from the mentors, especially Andy, I had a
>>>> good time in GSoC 2013 working on jena-spatial [1]. I'm very grateful.
>>>> Really learnt a lot from that project.
>>>>
>>>> This year, I find the issue of "Extend CONSTRUCT to build quads" [1]
>>>> very interesting. I've used javacc before. I can understand the ARQ
>>>> module of parsing SPARQL strings. With a label of "gsoc2014", is it a
>>>> suitable project for Jena in GSoC 2014? Any more details about the
>>>> project? Thanks!
>>>>
>>>> Best regards,
>>>> Ying Jiang
>>>>
>>>> [1] http://jena.apache.org/documentation/query/spatial-query.html
>>>> [2] https://issues.apache.org/jira/browse/JENA-491
>>>>
>>>
>>> Hi there,
>>>
>>> Given your level of skill and expertise, this project is possibly a bit
>>> small for you.  It's not the same scale as jena-spatial. It's probably
>>> more
>>> suited to an undergraduate or someone looking to learn about working
>>> inside
>>> a moderately large existing codebase. You have a lot more software
>>> engineering experience.
>>>
>>> Can I interest you in one of:
>>>
>>> * JENA-625 especially the part about CSV ingestion.  There is now a W3C
>>> working group looking at tabular data on the web so we know this is
>>> interesting to the user community.
>>>
>>> * JENA-647, (only just added) which is server side query templates for
>>> creating data views.
>>>
>>> In conjunction with someone (else) doing JENA-632 (custom JSON from
>>> SPARQL
>>> query), we would have a data delivery platform for creating domain
>>> specific
>>> data delivery for webapp.
>>>
>>> (this was provided in the proprietary Talis platform as "SPARQL Stored
>>> Procedures" but that no longer exists.  No need to exactly follow that
>>> but
>>> it was a popular feature so it is useful).
>>>
>>> * JENA-624 which is about a new memory-based storage layer.  As a
>>> project,
>>> its nearer in scale to jena-spatial.  This is less about RDF and linked
>>> data
>>> and more about systems programming.
>>>
>>>          Andy
>>>
>

Re: [GSoC 2014] Data Tables for SPARQL

Posted by Ying Jiang <jp...@gmail.com>.

Hi Andy,

For all the possibilities, are all of them supposed to be able to map
to a single interface, e.g. DataTable? It seems that, either a CSV
table, or the query result of a previous query, or the RDF
tuple/result syntax, can be transform into a DataTable. If so, we can
just code towards the interface of DataTable, and make some suitable
transformers for the different possibilities. Can we?

Any other data structures to be considered in this project? Or just tables?

Best regards,
Ying Jiang


On Tue, Mar 4, 2014 at 7:19 PM, Andy Seaborne <an...@apache.org> wrote:
> One extra observation:
>
> If the structure is
>
> + CSV -> RDF data table
> + RDF Data table and query execution
>
> and then execute the query with that data table (as well as everything else)
> then the RDF data table might come from CSV but it might come from other
> sources.
>
> Possibilities include:
> * A previous query - record the results of query and reuse in later queries
> (this is essentially a cache and also a way to avoid writing the same
> pattern over and over again).
>
> * A file in RDF tuple syntax or RDF result syntax (mainly for testing!)
>
> * a regular process that runs and pre-calculates certain important patterns
>
> This project does not have to cover all those possibilities - it should get
> the architecture right so that can all happen.
>
> Storing an RDF data table persistently (other than a text format), reusing
> TDB machinery would be nice but it is a very different part of the codebase
> to work with, so I'm suggesting the project doesn't try to include that this
> time.
>
>         Andy
>
>
>
>
> On 03/03/14 22:26, Andy Seaborne wrote:
>>
>> On 03/03/14 03:12, Ying Jiang wrote:
>>>
>>> Hi Andy,
>>
>>
>> Hi Ying,
>>
>>>
>>> Thanks for your suggestions! I'm more interested in JENA-625 (Data
>>> Tables for SPARQL). I've seen your new comments in JIRA and studied
>>> the source code of Tarql. I'd like to paste your comments here with my
>>> questions below to clarify the details of this project:
>>>
>>> 1. CSV to RDF terms (tuples of RDF Terms is already supported
>>> internally in Jena)
>>>   - Questions:
>>> 1.1 Tarql uses the first row of CSV as variable names. Should we use
>>> the same idea?
>>
>>
>> Seems like good start although care is needed because the column can be
>> anything and SPARQL variables are restricted.
>>
>> If there is no header row, and we can require that app should say so by
>> some mechanism, or if the app wants different names, then a way to
>> provide that, falling back to something predicable if dull: ?_col1,
>> ?_col2, ...
>>
>> See below - there's no need to have fixed variable names.
>>
>>> 1.2 As to "internal support of tuples of RDF terms in Jena", do you
>>> mean com.hp.hpl.jena.sparql.algebra.table.TableData? Tarql uses
>>> TableData to accommodate RDF term bindings from CSV.
>>
>>
>> That and there is also some RDF tuples code to read/write a textual form
>> as well:
>>
>>
>> https://svn.apache.org/repos/asf/jena/Experimental/rdfpatch/src/main/java/org/apache/jena/riot/tio/
>>
>>
>> (there are other versions of this code around - this is the ready to use
>> form)
>>
>>> 2. Storage of the table (in-memory is enough, with reading from a file).
>>>   - Questions:
>>> 2.1 What's the life cycle of the in-memory table? Should we discard
>>> the table after the query execution, or keep it in-memory for later
>>> reuse with the same query or update, or use by a subsequent query?
>>> When will the table be discarded?
>>
>>
>> That'll need refining but a way to read and reuse.  There needs to be
>> away for the app to pass in tables (a Map<Sting, ???> and a tool
>> forerading CSVs to get the ???) because ...
>>
>>> 3. Modify the SPARQL grammar to support FROM TABLE and TABLE (for
>>> inclusion inside a larger query, c.f. SPARQL VALUES clause).
>>>   - Questions:
>>> 3.1 What're the differences between FROM TABLE and TABLE?
>>
>>
>> FROM TABLE would be one way to get tables into the query as would
>> passing it in in the query context.
>>
>> Queries can't be assumed to
>>
>> TABLE in a query is accessing the table, using it to get the
>>
>> TARQL, and I've only read the documentation, is a query over a single
>> CSV file.  This project should be about multiple CSVs and combining with
>> other RDF data.
>>
>> A quick sketch and the syntax is not checked as sensible:
>>
>> SELECT ... {
>>    # Fixed column names
>>    TABLE <uri> {
>>       BIND (URI(CONCAT('http://example.com/ns#', ?b)) AS ?uri)
>>       BIND (STRLANG(?a, 'en') AS ?with_language_tag)
>>       FILTER (?v > 57)
>>    }
>> }
>>
>> More ambitious to have column naming and FILTERs:
>>
>> SELECT ...
>> WHERE {
>>
>>     TABLE <uri> { "col1" AS ?myVar1 ,
>>                   "col10" AS ?V ,
>>                   "col5" AS ?appName
>>                   FILTER(?V > 57) }
>> }
>>
>> creates a set of bindings based on access description.
>>
>>
>>> 3.2 Tarql programmatically modify the query (parsed from standard
>>> SPARQLParser11) with CSV tabsle data without touching the orginal
>>> SPARQL grammar parsing module. Should we adopt a different approach of
>>> modifying the parsing grammar of .jj files and just ask javacc to
>>> generate the new parsing code?
>>
>>
>> I think the latter if possible.
>>
>> This, like all projects, will need to move to a detailed design but I
>> don't hink it puts the project as a whole at risk.  The basis TARQL idea
>> would be a great addition
>>
>>      Andy
>>
>>>
>>> 4. Modify execution to include tables.
>>> Questions: No questions for this now.
>>>
>>> Best regards,
>>> Ying Jiang
>>>
>>> On Thu, Feb 27, 2014 at 10:49 PM, Andy Seaborne <an...@apache.org> wrote:
>>>>
>>>> On 26/02/14 15:14, Ying Jiang wrote:
>>>>>
>>>>>
>>>>> Hi,
>>>>>
>>>>> With the great guidance from the mentors, especially Andy, I had a
>>>>> good time in GSoC 2013 working on jena-spatial [1]. I'm very grateful.
>>>>> Really learnt a lot from that project.
>>>>>
>>>>> This year, I find the issue of "Extend CONSTRUCT to build quads" [1]
>>>>> very interesting. I've used javacc before. I can understand the ARQ
>>>>> module of parsing SPARQL strings. With a label of "gsoc2014", is it a
>>>>> suitable project for Jena in GSoC 2014? Any more details about the
>>>>> project? Thanks!
>>>>>
>>>>> Best regards,
>>>>> Ying Jiang
>>>>>
>>>>> [1] http://jena.apache.org/documentation/query/spatial-query.html
>>>>> [2] https://issues.apache.org/jira/browse/JENA-491
>>>>>
>>>>
>>>> Hi there,
>>>>
>>>> Given your level of skill and expertise, this project is possibly a bit
>>>> small for you.  It's not the same scale as jena-spatial. It's
>>>> probably more
>>>> suited to an undergraduate or someone looking to learn about working
>>>> inside
>>>> a moderately large existing codebase. You have a lot more software
>>>> engineering experience.
>>>>
>>>> Can I interest you in one of:
>>>>
>>>> * JENA-625 especially the part about CSV ingestion.  There is now a W3C
>>>> working group looking at tabular data on the web so we know this is
>>>> interesting to the user community.
>>>>
>>>> * JENA-647, (only just added) which is server side query templates for
>>>> creating data views.
>>>>
>>>> In conjunction with someone (else) doing JENA-632 (custom JSON from
>>>> SPARQL
>>>> query), we would have a data delivery platform for creating domain
>>>> specific
>>>> data delivery for webapp.
>>>>
>>>> (this was provided in the proprietary Talis platform as "SPARQL Stored
>>>> Procedures" but that no longer exists.  No need to exactly follow
>>>> that but
>>>> it was a popular feature so it is useful).
>>>>
>>>> * JENA-624 which is about a new memory-based storage layer.  As a
>>>> project,
>>>> its nearer in scale to jena-spatial.  This is less about RDF and
>>>> linked data
>>>> and more about systems programming.
>>>>
>>>>          Andy
>>>>
>>
>

Re: [GSoC 2014] Data Tables for SPARQL

Posted by Andy Seaborne <an...@apache.org>.

One extra observation:

If the structure is

+ CSV -> RDF data table
+ RDF Data table and query execution

and then execute the query with that data table (as well as everything 
else) then the RDF data table might come from CSV but it might come from 
other sources.

Possibilities include:
* A previous query - record the results of query and reuse in later 
queries (this is essentially a cache and also a way to avoid writing the 
same pattern over and over again).

* A file in RDF tuple syntax or RDF result syntax (mainly for testing!)

* a regular process that runs and pre-calculates certain important patterns

This project does not have to cover all those possibilities - it should 
get the architecture right so that can all happen.

Storing an RDF data table persistently (other than a text format), 
reusing TDB machinery would be nice but it is a very different part of 
the codebase to work with, so I'm suggesting the project doesn't try to 
include that this time.

	Andy



On 03/03/14 22:26, Andy Seaborne wrote:
> On 03/03/14 03:12, Ying Jiang wrote:
>> Hi Andy,
>
> Hi Ying,
>
>>
>> Thanks for your suggestions! I'm more interested in JENA-625 (Data
>> Tables for SPARQL). I've seen your new comments in JIRA and studied
>> the source code of Tarql. I'd like to paste your comments here with my
>> questions below to clarify the details of this project:
>>
>> 1. CSV to RDF terms (tuples of RDF Terms is already supported
>> internally in Jena)
>>   - Questions:
>> 1.1 Tarql uses the first row of CSV as variable names. Should we use
>> the same idea?
>
> Seems like good start although care is needed because the column can be
> anything and SPARQL variables are restricted.
>
> If there is no header row, and we can require that app should say so by
> some mechanism, or if the app wants different names, then a way to
> provide that, falling back to something predicable if dull: ?_col1,
> ?_col2, ...
>
> See below - there's no need to have fixed variable names.
>
>> 1.2 As to "internal support of tuples of RDF terms in Jena", do you
>> mean com.hp.hpl.jena.sparql.algebra.table.TableData? Tarql uses
>> TableData to accommodate RDF term bindings from CSV.
>
> That and there is also some RDF tuples code to read/write a textual form
> as well:
>
> https://svn.apache.org/repos/asf/jena/Experimental/rdfpatch/src/main/java/org/apache/jena/riot/tio/
>
>
> (there are other versions of this code around - this is the ready to use
> form)
>
>> 2. Storage of the table (in-memory is enough, with reading from a file).
>>   - Questions:
>> 2.1 What's the life cycle of the in-memory table? Should we discard
>> the table after the query execution, or keep it in-memory for later
>> reuse with the same query or update, or use by a subsequent query?
>> When will the table be discarded?
>
> That'll need refining but a way to read and reuse.  There needs to be
> away for the app to pass in tables (a Map<Sting, ???> and a tool
> forerading CSVs to get the ???) because ...
>
>> 3. Modify the SPARQL grammar to support FROM TABLE and TABLE (for
>> inclusion inside a larger query, c.f. SPARQL VALUES clause).
>>   - Questions:
>> 3.1 What're the differences between FROM TABLE and TABLE?
>
> FROM TABLE would be one way to get tables into the query as would
> passing it in in the query context.
>
> Queries can't be assumed to
>
> TABLE in a query is accessing the table, using it to get the
>
> TARQL, and I've only read the documentation, is a query over a single
> CSV file.  This project should be about multiple CSVs and combining with
> other RDF data.
>
> A quick sketch and the syntax is not checked as sensible:
>
> SELECT ... {
>    # Fixed column names
>    TABLE <uri> {
>       BIND (URI(CONCAT('http://example.com/ns#', ?b)) AS ?uri)
>       BIND (STRLANG(?a, 'en') AS ?with_language_tag)
>       FILTER (?v > 57)
>    }
> }
>
> More ambitious to have column naming and FILTERs:
>
> SELECT ...
> WHERE {
>
>     TABLE <uri> { "col1" AS ?myVar1 ,
>                   "col10" AS ?V ,
>                   "col5" AS ?appName
>                   FILTER(?V > 57) }
> }
>
> creates a set of bindings based on access description.
>
>
>> 3.2 Tarql programmatically modify the query (parsed from standard
>> SPARQLParser11) with CSV tabsle data without touching the orginal
>> SPARQL grammar parsing module. Should we adopt a different approach of
>> modifying the parsing grammar of .jj files and just ask javacc to
>> generate the new parsing code?
>
> I think the latter if possible.
>
> This, like all projects, will need to move to a detailed design but I
> don't hink it puts the project as a whole at risk.  The basis TARQL idea
> would be a great addition
>
>      Andy
>
>>
>> 4. Modify execution to include tables.
>> Questions: No questions for this now.
>>
>> Best regards,
>> Ying Jiang
>>
>> On Thu, Feb 27, 2014 at 10:49 PM, Andy Seaborne <an...@apache.org> wrote:
>>> On 26/02/14 15:14, Ying Jiang wrote:
>>>>
>>>> Hi,
>>>>
>>>> With the great guidance from the mentors, especially Andy, I had a
>>>> good time in GSoC 2013 working on jena-spatial [1]. I'm very grateful.
>>>> Really learnt a lot from that project.
>>>>
>>>> This year, I find the issue of "Extend CONSTRUCT to build quads" [1]
>>>> very interesting. I've used javacc before. I can understand the ARQ
>>>> module of parsing SPARQL strings. With a label of "gsoc2014", is it a
>>>> suitable project for Jena in GSoC 2014? Any more details about the
>>>> project? Thanks!
>>>>
>>>> Best regards,
>>>> Ying Jiang
>>>>
>>>> [1] http://jena.apache.org/documentation/query/spatial-query.html
>>>> [2] https://issues.apache.org/jira/browse/JENA-491
>>>>
>>>
>>> Hi there,
>>>
>>> Given your level of skill and expertise, this project is possibly a bit
>>> small for you.  It's not the same scale as jena-spatial. It's
>>> probably more
>>> suited to an undergraduate or someone looking to learn about working
>>> inside
>>> a moderately large existing codebase. You have a lot more software
>>> engineering experience.
>>>
>>> Can I interest you in one of:
>>>
>>> * JENA-625 especially the part about CSV ingestion.  There is now a W3C
>>> working group looking at tabular data on the web so we know this is
>>> interesting to the user community.
>>>
>>> * JENA-647, (only just added) which is server side query templates for
>>> creating data views.
>>>
>>> In conjunction with someone (else) doing JENA-632 (custom JSON from
>>> SPARQL
>>> query), we would have a data delivery platform for creating domain
>>> specific
>>> data delivery for webapp.
>>>
>>> (this was provided in the proprietary Talis platform as "SPARQL Stored
>>> Procedures" but that no longer exists.  No need to exactly follow
>>> that but
>>> it was a popular feature so it is useful).
>>>
>>> * JENA-624 which is about a new memory-based storage layer.  As a
>>> project,
>>> its nearer in scale to jena-spatial.  This is less about RDF and
>>> linked data
>>> and more about systems programming.
>>>
>>>          Andy
>>>
>

Re: [GSoC 2014] Data Tables for SPARQL

Posted by Andy Seaborne <an...@apache.org>.

On 03/03/14 03:12, Ying Jiang wrote:
> Hi Andy,

Hi Ying,

>
> Thanks for your suggestions! I'm more interested in JENA-625 (Data
> Tables for SPARQL). I've seen your new comments in JIRA and studied
> the source code of Tarql. I'd like to paste your comments here with my
> questions below to clarify the details of this project:
>
> 1. CSV to RDF terms (tuples of RDF Terms is already supported
> internally in Jena)
>   - Questions:
> 1.1 Tarql uses the first row of CSV as variable names. Should we use
> the same idea?

Seems like good start although care is needed because the column can be 
anything and SPARQL variables are restricted.

If there is no header row, and we can require that app should say so by 
some mechanism, or if the app wants different names, then a way to 
provide that, falling back to something predicable if dull: ?_col1, 
?_col2, ...

See below - there's no need to have fixed variable names.

> 1.2 As to "internal support of tuples of RDF terms in Jena", do you
> mean com.hp.hpl.jena.sparql.algebra.table.TableData? Tarql uses
> TableData to accommodate RDF term bindings from CSV.

That and there is also some RDF tuples code to read/write a textual form 
as well:

https://svn.apache.org/repos/asf/jena/Experimental/rdfpatch/src/main/java/org/apache/jena/riot/tio/

(there are other versions of this code around - this is the ready to use 
form)

> 2. Storage of the table (in-memory is enough, with reading from a file).
>   - Questions:
> 2.1 What's the life cycle of the in-memory table? Should we discard
> the table after the query execution, or keep it in-memory for later
> reuse with the same query or update, or use by a subsequent query?
> When will the table be discarded?

That'll need refining but a way to read and reuse.  There needs to be 
away for the app to pass in tables (a Map<Sting, ???> and a tool 
forerading CSVs to get the ???) because ...

> 3. Modify the SPARQL grammar to support FROM TABLE and TABLE (for
> inclusion inside a larger query, c.f. SPARQL VALUES clause).
>   - Questions:
> 3.1 What're the differences between FROM TABLE and TABLE?

FROM TABLE would be one way to get tables into the query as would 
passing it in in the query context.

Queries can't be assumed to

TABLE in a query is accessing the table, using it to get the

TARQL, and I've only read the documentation, is a query over a single 
CSV file.  This project should be about multiple CSVs and combining with 
other RDF data.

A quick sketch and the syntax is not checked as sensible:

SELECT ... {
   # Fixed column names
   TABLE <uri> {
      BIND (URI(CONCAT('http://example.com/ns#', ?b)) AS ?uri)
      BIND (STRLANG(?a, 'en') AS ?with_language_tag)
      FILTER (?v > 57)
   }
}

More ambitious to have column naming and FILTERs:

SELECT ...
WHERE {

    TABLE <uri> { "col1" AS ?myVar1 ,
                  "col10" AS ?V ,
                  "col5" AS ?appName
                  FILTER(?V > 57) }
}

creates a set of bindings based on access description.


> 3.2 Tarql programmatically modify the query (parsed from standard
> SPARQLParser11) with CSV tabsle data without touching the orginal
> SPARQL grammar parsing module. Should we adopt a different approach of
> modifying the parsing grammar of .jj files and just ask javacc to
> generate the new parsing code?

I think the latter if possible.

This, like all projects, will need to move to a detailed design but I 
don't hink it puts the project as a whole at risk.  The basis TARQL idea 
would be a great addition

	Andy

>
> 4. Modify execution to include tables.
> Questions: No questions for this now.
>
> Best regards,
> Ying Jiang
>
> On Thu, Feb 27, 2014 at 10:49 PM, Andy Seaborne <an...@apache.org> wrote:
>> On 26/02/14 15:14, Ying Jiang wrote:
>>>
>>> Hi,
>>>
>>> With the great guidance from the mentors, especially Andy, I had a
>>> good time in GSoC 2013 working on jena-spatial [1]. I'm very grateful.
>>> Really learnt a lot from that project.
>>>
>>> This year, I find the issue of "Extend CONSTRUCT to build quads" [1]
>>> very interesting. I've used javacc before. I can understand the ARQ
>>> module of parsing SPARQL strings. With a label of "gsoc2014", is it a
>>> suitable project for Jena in GSoC 2014? Any more details about the
>>> project? Thanks!
>>>
>>> Best regards,
>>> Ying Jiang
>>>
>>> [1] http://jena.apache.org/documentation/query/spatial-query.html
>>> [2] https://issues.apache.org/jira/browse/JENA-491
>>>
>>
>> Hi there,
>>
>> Given your level of skill and expertise, this project is possibly a bit
>> small for you.  It's not the same scale as jena-spatial. It's probably more
>> suited to an undergraduate or someone looking to learn about working inside
>> a moderately large existing codebase. You have a lot more software
>> engineering experience.
>>
>> Can I interest you in one of:
>>
>> * JENA-625 especially the part about CSV ingestion.  There is now a W3C
>> working group looking at tabular data on the web so we know this is
>> interesting to the user community.
>>
>> * JENA-647, (only just added) which is server side query templates for
>> creating data views.
>>
>> In conjunction with someone (else) doing JENA-632 (custom JSON from SPARQL
>> query), we would have a data delivery platform for creating domain specific
>> data delivery for webapp.
>>
>> (this was provided in the proprietary Talis platform as "SPARQL Stored
>> Procedures" but that no longer exists.  No need to exactly follow that but
>> it was a popular feature so it is useful).
>>
>> * JENA-624 which is about a new memory-based storage layer.  As a project,
>> its nearer in scale to jena-spatial.  This is less about RDF and linked data
>> and more about systems programming.
>>
>>          Andy
>>