You are viewing a plain text version of this content. The canonical link for it is here.

Posted to users@jena.apache.org by Élie Roux <el...@telecom-bretagne.eu> on 2019/02/01 12:52:39 UTC

bottom-up semantics

Dear Jean users,

In short, I'm wondering if there could be an option somewhere for a
top-down SPARQL evaluation mechanism.

Long version: the dataset I'm dealing with contains data in the following form:

ex:Loc1 a :Location ;
        :locatedInWork ex:Work1 ;
        :startPage 123 ;
        :endPage   234 ;
        :startVolume 1 .

ex:Loc2 a :Location ;
        :locatedInWork ex:Work1 ;
        :startPage 234 ;
        :endPage   345 ;
        :startVolume 1 ;
        :endVolume   2 .

where the absence of :endVolume denotes that the endVolume is equal to
the startVolume. This might not be kosher in terms of semantics but
that's the dataset I'm dealing with.

Now, I want to select all the locations in volume 2 (including those
starting before volume 2 and ending after volume 2), the most natural
for me is to write something like:

  ?loc :locatedInWork ex:Work1 ;
       :startVolume ?startvol .
  OPTIONAL { ?loc  :endVolume   ?endvol . }
  FILTER ((BOUND(?endvol) && ?startvol <= 2 && ?endvol >= 2) ||
(!BOUND(?endvol) && ?startvol = 2))

which works fine, but is slow to the extreme (about 8s) due to the
very large amount of triples with the :endVolume property. Now, I
understand the slow performance is sort of expected due what's
referred to as the bottom-up semantics of SPARQL. My understanding is
that the first thing that will get evaluated will be ?loc :endVolume
?endvol which will return a huge amount of results.

Here are a few questions:

- Is my analysis correct?

- In your experience of writing queries, how often do you rely on the
bottom-up semantics? (my experience is never)

- The bottom-up semantics are very counter-intuititve to me, what do
you think is the reason it got into the SPARQL specs?

- I suppose digging into the Jena code to optimize this kind of
requests in Jena must be very deep dive, am I right?

- Is there any plan or dedicated resources to optimize this kind of requests?

- What would be the complexity of writing an alternate query
evaluation mechanism using top-down semantics?

- Would having an option to evaluate a sparql query using top-down
semantics make sense? (we can have discussions of where the option
would be handled, but I think it's helpful for me to get a general
answer)

- Blazegraph advertises that they are first evaluating if the results
of a query would be the same when using a top-down and bottom-up
semantics, and if they are the same they automatically switch to the
top-down semantics, how much time do you estimate one would have to
dive into the Jena code to propose a pull request for that?

Best,
-- 
Elie

Re: bottom-up semantics

Posted by Martynas Jusevičius <ma...@atomgraph.com>.

I'd suggest to start by checking the algebra of your query: http://sparql.org

On Fri, Feb 1, 2019 at 1:53 PM Élie Roux <el...@telecom-bretagne.eu> wrote:
>
> Dear Jean users,
>
> In short, I'm wondering if there could be an option somewhere for a
> top-down SPARQL evaluation mechanism.
>
> Long version: the dataset I'm dealing with contains data in the following form:
>
> ex:Loc1 a :Location ;
>         :locatedInWork ex:Work1 ;
>         :startPage 123 ;
>         :endPage   234 ;
>         :startVolume 1 .
>
> ex:Loc2 a :Location ;
>         :locatedInWork ex:Work1 ;
>         :startPage 234 ;
>         :endPage   345 ;
>         :startVolume 1 ;
>         :endVolume   2 .
>
> where the absence of :endVolume denotes that the endVolume is equal to
> the startVolume. This might not be kosher in terms of semantics but
> that's the dataset I'm dealing with.
>
> Now, I want to select all the locations in volume 2 (including those
> starting before volume 2 and ending after volume 2), the most natural
> for me is to write something like:
>
>   ?loc :locatedInWork ex:Work1 ;
>        :startVolume ?startvol .
>   OPTIONAL { ?loc  :endVolume   ?endvol . }
>   FILTER ((BOUND(?endvol) && ?startvol <= 2 && ?endvol >= 2) ||
> (!BOUND(?endvol) && ?startvol = 2))
>
> which works fine, but is slow to the extreme (about 8s) due to the
> very large amount of triples with the :endVolume property. Now, I
> understand the slow performance is sort of expected due what's
> referred to as the bottom-up semantics of SPARQL. My understanding is
> that the first thing that will get evaluated will be ?loc :endVolume
> ?endvol which will return a huge amount of results.
>
> Here are a few questions:
>
> - Is my analysis correct?
>
> - In your experience of writing queries, how often do you rely on the
> bottom-up semantics? (my experience is never)
>
> - The bottom-up semantics are very counter-intuititve to me, what do
> you think is the reason it got into the SPARQL specs?
>
> - I suppose digging into the Jena code to optimize this kind of
> requests in Jena must be very deep dive, am I right?
>
> - Is there any plan or dedicated resources to optimize this kind of requests?
>
> - What would be the complexity of writing an alternate query
> evaluation mechanism using top-down semantics?
>
> - Would having an option to evaluate a sparql query using top-down
> semantics make sense? (we can have discussions of where the option
> would be handled, but I think it's helpful for me to get a general
> answer)
>
> - Blazegraph advertises that they are first evaluating if the results
> of a query would be the same when using a top-down and bottom-up
> semantics, and if they are the same they automatically switch to the
> top-down semantics, how much time do you estimate one would have to
> dive into the Jena code to propose a pull request for that?
>
> Best,
> --
> Elie

Re: bottom-up semantics

Posted by David Jordan <jd...@gmail.com>.

I have an unusual request for this group, but I am trying to remember the
name of a particular application development platform that I believe was
based on the semantic web. I believe this is a commercial product, books
were published on it, but I simply cannot remember the name. They presented
at several of the semantic web conferences that I attended.

I live in a large community that has a very large clubhouse with lots of
activities scheduled. There are about 2000 people that participate in
multiple groups (clubs). The groups need to schedule room facilities, etc.
In a sense they need a social networking site that deals with people,
events, facilities, etc. There was this commercial application that
provided such capabilities, it may have just been a development platform,
and I believe it was based on RDF/OWL.

I have tried Google, but I have not found it yet. I am sure someone on here
is familiar with it. If I hear the name, I'll recognize it. Apologies that
this is not specific to Jena, which I have used and liked. We just don't
have the bandwidth to develop the needed software from scratch or I would
develop it myself with Jena or some other similar tool.

On Fri, Feb 1, 2019 at 7:53 AM Élie Roux <el...@telecom-bretagne.eu>
wrote:

> Dear Jean users,
>
> In short, I'm wondering if there could be an option somewhere for a
> top-down SPARQL evaluation mechanism.
>
> Long version: the dataset I'm dealing with contains data in the following
> form:
>
> ex:Loc1 a :Location ;
>         :locatedInWork ex:Work1 ;
>         :startPage 123 ;
>         :endPage   234 ;
>         :startVolume 1 .
>
> ex:Loc2 a :Location ;
>         :locatedInWork ex:Work1 ;
>         :startPage 234 ;
>         :endPage   345 ;
>         :startVolume 1 ;
>         :endVolume   2 .
>
> where the absence of :endVolume denotes that the endVolume is equal to
> the startVolume. This might not be kosher in terms of semantics but
> that's the dataset I'm dealing with.
>
> Now, I want to select all the locations in volume 2 (including those
> starting before volume 2 and ending after volume 2), the most natural
> for me is to write something like:
>
>   ?loc :locatedInWork ex:Work1 ;
>        :startVolume ?startvol .
>   OPTIONAL { ?loc  :endVolume   ?endvol . }
>   FILTER ((BOUND(?endvol) && ?startvol <= 2 && ?endvol >= 2) ||
> (!BOUND(?endvol) && ?startvol = 2))
>
> which works fine, but is slow to the extreme (about 8s) due to the
> very large amount of triples with the :endVolume property. Now, I
> understand the slow performance is sort of expected due what's
> referred to as the bottom-up semantics of SPARQL. My understanding is
> that the first thing that will get evaluated will be ?loc :endVolume
> ?endvol which will return a huge amount of results.
>
> Here are a few questions:
>
> - Is my analysis correct?
>
> - In your experience of writing queries, how often do you rely on the
> bottom-up semantics? (my experience is never)
>
> - The bottom-up semantics are very counter-intuititve to me, what do
> you think is the reason it got into the SPARQL specs?
>
> - I suppose digging into the Jena code to optimize this kind of
> requests in Jena must be very deep dive, am I right?
>
> - Is there any plan or dedicated resources to optimize this kind of
> requests?
>
> - What would be the complexity of writing an alternate query
> evaluation mechanism using top-down semantics?
>
> - Would having an option to evaluate a sparql query using top-down
> semantics make sense? (we can have discussions of where the option
> would be handled, but I think it's helpful for me to get a general
> answer)
>
> - Blazegraph advertises that they are first evaluating if the results
> of a query would be the same when using a top-down and bottom-up
> semantics, and if they are the same they automatically switch to the
> top-down semantics, how much time do you estimate one would have to
> dive into the Jena code to propose a pull request for that?
>
> Best,
> --
> Elie
>

Re: bottom-up semantics

Posted by "Lorenz B." <bu...@informatik.uni-leipzig.de>.

>   ?loc :workLocationVolume ?bvol .?loc :locatedInWork ex:Work1 ;
>        :startVolume ?startvol .
>   FILTER ((?bvol = ?volnum && NOT EXISTS {?loc :workLocationEndVolume
> ?evol}) || (?bvol <= ?volnum  && EXISTS {?loc :workLocationEndVolume
> ?evol FILTER (?evol <= ?volnum)}))
>
> In terms of logic is should be equivalent to the previous query,
> should there be a performance difference? My experiments show that
> this version is quite consistently twice as fast as the OPTIONAL
> version.
>
At least, this query avoids the left-join, so yes, there's a good chance
being executed faster.

Re: bottom-up semantics

Posted by Élie Roux <el...@telecom-bretagne.eu>.

Hello,

Thanks for your answer

>         (conditional
>           (bgp
>             (triple ?loc :locatedInWork ex:Work1)
>             (triple ?loc :startVolume ?startvol)
>           )
>           (bgp (triple ?loc :endVolume ?endvol)))))))

Am I right in understanding that in that case ?loc is bound in the
second part (the part with :endVolume)? If so then the slow
performance must come from somewhere else, I'll investigate further.

I found another way of writing the query which is much more complex
but a little bit more satisfying in the sense that it makes the
binding of ?loc very clear:

  ?loc :workLocationVolume ?bvol .?loc :locatedInWork ex:Work1 ;
       :startVolume ?startvol .
  FILTER ((?bvol = ?volnum && NOT EXISTS {?loc :workLocationEndVolume
?evol}) || (?bvol <= ?volnum  && EXISTS {?loc :workLocationEndVolume
?evol FILTER (?evol <= ?volnum)}))

In terms of logic is should be equivalent to the previous query,
should there be a performance difference? My experiments show that
this version is quite consistently twice as fast as the OPTIONAL
version.

Best,
-- 
Elie

Re: bottom-up semantics

Posted by "Lorenz B." <bu...@informatik.uni-leipzig.de>.

Hello

the query algebra has the following structure


|(project (?book ?title)||
||      (filter (|| (&& (&& (bound ?endvol) (<= ?startvol 2)) (>=
?endvol 2)) (&& (! (bound ?endvol)) (= ?startvol 2)))||
||        (leftjoin||
||          (bgp||
||            (triple ?loc :locatedInWork ex:Work1)||
||            (triple ?loc :startVolume ?startvol)||
||          )||
||          (bgp (triple ?loc :endVolume ?endvol)))))))|

(optimized)

|(project (?book ?title)||
||      (filter (|| (&& (&& (bound ?endvol) (<= ?startvol 2)) (>=
?endvol 2)) (&& (! (bound ?endvol)) (= ?startvol 2)))||
||        (conditional||
||          (bgp||
||            (triple ?loc :locatedInWork ex:Work1)||
||            (triple ?loc :startVolume ?startvol)||
||          )||
||          (bgp (triple ?loc :endVolume ?endvol)))))))||
|

You can see, an OPTIONAL is basically a left outer join.

If you're using TDB some statistics on the data could be taken into
account by an optimizer. You can check this by followoing the steps here [1]


[1] https://jena.apache.org/documentation/tdb/optimizer.html



> Dear Jean users,
>
> In short, I'm wondering if there could be an option somewhere for a
> top-down SPARQL evaluation mechanism.
>
> Long version: the dataset I'm dealing with contains data in the following form:
>
> ex:Loc1 a :Location ;
>         :locatedInWork ex:Work1 ;
>         :startPage 123 ;
>         :endPage   234 ;
>         :startVolume 1 .
>
> ex:Loc2 a :Location ;
>         :locatedInWork ex:Work1 ;
>         :startPage 234 ;
>         :endPage   345 ;
>         :startVolume 1 ;
>         :endVolume   2 .
>
> where the absence of :endVolume denotes that the endVolume is equal to
> the startVolume. This might not be kosher in terms of semantics but
> that's the dataset I'm dealing with.
>
> Now, I want to select all the locations in volume 2 (including those
> starting before volume 2 and ending after volume 2), the most natural
> for me is to write something like:
>
>   ?loc :locatedInWork ex:Work1 ;
>        :startVolume ?startvol .
>   OPTIONAL { ?loc  :endVolume   ?endvol . }
>   FILTER ((BOUND(?endvol) && ?startvol <= 2 && ?endvol >= 2) ||
> (!BOUND(?endvol) && ?startvol = 2))
>
> which works fine, but is slow to the extreme (about 8s) due to the
> very large amount of triples with the :endVolume property. Now, I
> understand the slow performance is sort of expected due what's
> referred to as the bottom-up semantics of SPARQL. My understanding is
> that the first thing that will get evaluated will be ?loc :endVolume
> ?endvol which will return a huge amount of results.
>
> Here are a few questions:
>
> - Is my analysis correct?
>
> - In your experience of writing queries, how often do you rely on the
> bottom-up semantics? (my experience is never)
>
> - The bottom-up semantics are very counter-intuititve to me, what do
> you think is the reason it got into the SPARQL specs?
>
> - I suppose digging into the Jena code to optimize this kind of
> requests in Jena must be very deep dive, am I right?
>
> - Is there any plan or dedicated resources to optimize this kind of requests?
>
> - What would be the complexity of writing an alternate query
> evaluation mechanism using top-down semantics?
>
> - Would having an option to evaluate a sparql query using top-down
> semantics make sense? (we can have discussions of where the option
> would be handled, but I think it's helpful for me to get a general
> answer)
>
> - Blazegraph advertises that they are first evaluating if the results
> of a query would be the same when using a top-down and bottom-up
> semantics, and if they are the same they automatically switch to the
> top-down semantics, how much time do you estimate one would have to
> dive into the Jena code to propose a pull request for that?
>
> Best,

-- 
Lorenz Bühmann
AKSW group, University of Leipzig
Group: http://aksw.org - semantic web research center