You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@jena.apache.org by Stephen Allen <sa...@apache.org> on 2012/09/22 16:41:05 UTC

SPARQL Query Parsing Questions

I am working on JENA-330 (converting the Update parser to streaming)
and I had a couple of questions:

1) What version of cpp do you use to generate arq.jj and sparql_11.jj?
 My version inserts a bunch of extra newline characters.   cpp (GCC)
3.4.4 (cygming special, gdc 0.12, using dmd 0.125)

2) How important is the TripleCollector "mark" functionality?  It
appears to be in use in the Collection and PropertyList parsing stages
to ensure that statements are added to the QuadAcc in the same order
that they appear in the query.  However, RDF is unordered, so it
doesn't seem strictly necessary.  In a streaming situation, its
presence complicates things.  Can I simply eliminate this
functionality?  Or is it important for some reason I can't see?

Thanks!

-Stephen

Re: SPARQL Query Parsing Questions

Posted by Andy Seaborne <an...@apache.org>.

On 22/09/12 15:41, Stephen Allen wrote:
> I am working on JENA-330 (converting the Update parser to streaming)
> and I had a couple of questions:
>
> 1) What version of cpp do you use to generate arq.jj and sparql_11.jj?
>   My version inserts a bunch of extra newline characters.   cpp (GCC)
> 3.4.4 (cygming special, gdc 0.12, using dmd 0.125)

cpp (Ubuntu/Linaro 4.6.3-1ubuntu5) 4.6.3

but I used to use cpp under cygwin.

The cygwin output might need feeding through dos2unix.

> 2) How important is the TripleCollector "mark" functionality?  It
> appears to be in use in the Collection and PropertyList parsing stages
> to ensure that statements are added to the QuadAcc in the same order
> that they appear in the query.  However, RDF is unordered, so it
> doesn't seem strictly necessary.  In a streaming situation, its
> presence complicates things.  Can I simply eliminate this
> functionality?  Or is it important for some reason I can't see?

The mark is for RDF lists and nested structures.

  :s :p [ :q :r ] .
==>
  :s :p _:b0 .
   _:b0 :q :r

   :s :p (1 2)
==>
  :s :p _:b0 .
  _:b0 rdf:first 1 .
  _:b0 rdf:rest _:b1 .
  _:b1 rdf:first 2 .
  _:b1 rdf:rest rdf:nil

It keeps the triples generated in the order in the AST they are 
encountered.  A list element refers to the next element so you can't 
generate it's rdf:rest until you know what to refer to.  To keep the 
rdf:first and rdf:rest together (for appearances sake, such as printing 
the query or update).

It's probable not necessary to do it with a mark.  It might be possible 
to do as a sliding window of two elements; I have done this on an 
experimental datastructure project so we to operate in the forward 
direction.  Working forwards is a tail recursion and can be loopified. 
Working on the way back out isn't streaming (it needs stack depth).

It gets messy with nested structures:

   :s :p (1 ("a" "b") 2 )

keeping the rdf:firsts in order of 1, "a", "b", 2 is nice albeit not 
necessary.

One approach is that it's streaming except for compound structures.  You 
have to ask how you get a compound structure in the first place.

I think the important cases are

:s :p (
   "item 1"
   "item 2"
) .

:s :p [
   :q 1 ;
   :q 2 ;
] .

where it's easy to generate a huge item worth streaming.  If these can 
be handled, but the more complex ones don't stream, it's still a big win 
IMO.

	Andy

>
> Thanks!
>
> -Stephen
>