You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@jena.apache.org by Élie Roux <el...@telecom-bretagne.eu> on 2019/02/01 12:52:39 UTC
bottom-up semantics
Dear Jean users,
In short, I'm wondering if there could be an option somewhere for a
top-down SPARQL evaluation mechanism.
Long version: the dataset I'm dealing with contains data in the following form:
ex:Loc1 a :Location ;
:locatedInWork ex:Work1 ;
:startPage 123 ;
:endPage 234 ;
:startVolume 1 .
ex:Loc2 a :Location ;
:locatedInWork ex:Work1 ;
:startPage 234 ;
:endPage 345 ;
:startVolume 1 ;
:endVolume 2 .
where the absence of :endVolume denotes that the endVolume is equal to
the startVolume. This might not be kosher in terms of semantics but
that's the dataset I'm dealing with.
Now, I want to select all the locations in volume 2 (including those
starting before volume 2 and ending after volume 2), the most natural
for me is to write something like:
?loc :locatedInWork ex:Work1 ;
:startVolume ?startvol .
OPTIONAL { ?loc :endVolume ?endvol . }
FILTER ((BOUND(?endvol) && ?startvol <= 2 && ?endvol >= 2) ||
(!BOUND(?endvol) && ?startvol = 2))
which works fine, but is slow to the extreme (about 8s) due to the
very large amount of triples with the :endVolume property. Now, I
understand the slow performance is sort of expected due what's
referred to as the bottom-up semantics of SPARQL. My understanding is
that the first thing that will get evaluated will be ?loc :endVolume
?endvol which will return a huge amount of results.
Here are a few questions:
- Is my analysis correct?
- In your experience of writing queries, how often do you rely on the
bottom-up semantics? (my experience is never)
- The bottom-up semantics are very counter-intuititve to me, what do
you think is the reason it got into the SPARQL specs?
- I suppose digging into the Jena code to optimize this kind of
requests in Jena must be very deep dive, am I right?
- Is there any plan or dedicated resources to optimize this kind of requests?
- What would be the complexity of writing an alternate query
evaluation mechanism using top-down semantics?
- Would having an option to evaluate a sparql query using top-down
semantics make sense? (we can have discussions of where the option
would be handled, but I think it's helpful for me to get a general
answer)
- Blazegraph advertises that they are first evaluating if the results
of a query would be the same when using a top-down and bottom-up
semantics, and if they are the same they automatically switch to the
top-down semantics, how much time do you estimate one would have to
dive into the Jena code to propose a pull request for that?
Best,
--
Elie
Re: bottom-up semantics
Posted by Martynas Jusevičius <ma...@atomgraph.com>.
I'd suggest to start by checking the algebra of your query: http://sparql.org
On Fri, Feb 1, 2019 at 1:53 PM Élie Roux <el...@telecom-bretagne.eu> wrote:
>
> Dear Jean users,
>
> In short, I'm wondering if there could be an option somewhere for a
> top-down SPARQL evaluation mechanism.
>
> Long version: the dataset I'm dealing with contains data in the following form:
>
> ex:Loc1 a :Location ;
> :locatedInWork ex:Work1 ;
> :startPage 123 ;
> :endPage 234 ;
> :startVolume 1 .
>
> ex:Loc2 a :Location ;
> :locatedInWork ex:Work1 ;
> :startPage 234 ;
> :endPage 345 ;
> :startVolume 1 ;
> :endVolume 2 .
>
> where the absence of :endVolume denotes that the endVolume is equal to
> the startVolume. This might not be kosher in terms of semantics but
> that's the dataset I'm dealing with.
>
> Now, I want to select all the locations in volume 2 (including those
> starting before volume 2 and ending after volume 2), the most natural
> for me is to write something like:
>
> ?loc :locatedInWork ex:Work1 ;
> :startVolume ?startvol .
> OPTIONAL { ?loc :endVolume ?endvol . }
> FILTER ((BOUND(?endvol) && ?startvol <= 2 && ?endvol >= 2) ||
> (!BOUND(?endvol) && ?startvol = 2))
>
> which works fine, but is slow to the extreme (about 8s) due to the
> very large amount of triples with the :endVolume property. Now, I
> understand the slow performance is sort of expected due what's
> referred to as the bottom-up semantics of SPARQL. My understanding is
> that the first thing that will get evaluated will be ?loc :endVolume
> ?endvol which will return a huge amount of results.
>
> Here are a few questions:
>
> - Is my analysis correct?
>
> - In your experience of writing queries, how often do you rely on the
> bottom-up semantics? (my experience is never)
>
> - The bottom-up semantics are very counter-intuititve to me, what do
> you think is the reason it got into the SPARQL specs?
>
> - I suppose digging into the Jena code to optimize this kind of
> requests in Jena must be very deep dive, am I right?
>
> - Is there any plan or dedicated resources to optimize this kind of requests?
>
> - What would be the complexity of writing an alternate query
> evaluation mechanism using top-down semantics?
>
> - Would having an option to evaluate a sparql query using top-down
> semantics make sense? (we can have discussions of where the option
> would be handled, but I think it's helpful for me to get a general
> answer)
>
> - Blazegraph advertises that they are first evaluating if the results
> of a query would be the same when using a top-down and bottom-up
> semantics, and if they are the same they automatically switch to the
> top-down semantics, how much time do you estimate one would have to
> dive into the Jena code to propose a pull request for that?
>
> Best,
> --
> Elie
Re: bottom-up semantics
Posted by David Jordan <jd...@gmail.com>.
I have an unusual request for this group, but I am trying to remember the
name of a particular application development platform that I believe was
based on the semantic web. I believe this is a commercial product, books
were published on it, but I simply cannot remember the name. They presented
at several of the semantic web conferences that I attended.
I live in a large community that has a very large clubhouse with lots of
activities scheduled. There are about 2000 people that participate in
multiple groups (clubs). The groups need to schedule room facilities, etc.
In a sense they need a social networking site that deals with people,
events, facilities, etc. There was this commercial application that
provided such capabilities, it may have just been a development platform,
and I believe it was based on RDF/OWL.
I have tried Google, but I have not found it yet. I am sure someone on here
is familiar with it. If I hear the name, I'll recognize it. Apologies that
this is not specific to Jena, which I have used and liked. We just don't
have the bandwidth to develop the needed software from scratch or I would
develop it myself with Jena or some other similar tool.
On Fri, Feb 1, 2019 at 7:53 AM Élie Roux <el...@telecom-bretagne.eu>
wrote:
> Dear Jean users,
>
> In short, I'm wondering if there could be an option somewhere for a
> top-down SPARQL evaluation mechanism.
>
> Long version: the dataset I'm dealing with contains data in the following
> form:
>
> ex:Loc1 a :Location ;
> :locatedInWork ex:Work1 ;
> :startPage 123 ;
> :endPage 234 ;
> :startVolume 1 .
>
> ex:Loc2 a :Location ;
> :locatedInWork ex:Work1 ;
> :startPage 234 ;
> :endPage 345 ;
> :startVolume 1 ;
> :endVolume 2 .
>
> where the absence of :endVolume denotes that the endVolume is equal to
> the startVolume. This might not be kosher in terms of semantics but
> that's the dataset I'm dealing with.
>
> Now, I want to select all the locations in volume 2 (including those
> starting before volume 2 and ending after volume 2), the most natural
> for me is to write something like:
>
> ?loc :locatedInWork ex:Work1 ;
> :startVolume ?startvol .
> OPTIONAL { ?loc :endVolume ?endvol . }
> FILTER ((BOUND(?endvol) && ?startvol <= 2 && ?endvol >= 2) ||
> (!BOUND(?endvol) && ?startvol = 2))
>
> which works fine, but is slow to the extreme (about 8s) due to the
> very large amount of triples with the :endVolume property. Now, I
> understand the slow performance is sort of expected due what's
> referred to as the bottom-up semantics of SPARQL. My understanding is
> that the first thing that will get evaluated will be ?loc :endVolume
> ?endvol which will return a huge amount of results.
>
> Here are a few questions:
>
> - Is my analysis correct?
>
> - In your experience of writing queries, how often do you rely on the
> bottom-up semantics? (my experience is never)
>
> - The bottom-up semantics are very counter-intuititve to me, what do
> you think is the reason it got into the SPARQL specs?
>
> - I suppose digging into the Jena code to optimize this kind of
> requests in Jena must be very deep dive, am I right?
>
> - Is there any plan or dedicated resources to optimize this kind of
> requests?
>
> - What would be the complexity of writing an alternate query
> evaluation mechanism using top-down semantics?
>
> - Would having an option to evaluate a sparql query using top-down
> semantics make sense? (we can have discussions of where the option
> would be handled, but I think it's helpful for me to get a general
> answer)
>
> - Blazegraph advertises that they are first evaluating if the results
> of a query would be the same when using a top-down and bottom-up
> semantics, and if they are the same they automatically switch to the
> top-down semantics, how much time do you estimate one would have to
> dive into the Jena code to propose a pull request for that?
>
> Best,
> --
> Elie
>
Re: bottom-up semantics
Posted by "Lorenz B." <bu...@informatik.uni-leipzig.de>.
> ?loc :workLocationVolume ?bvol .?loc :locatedInWork ex:Work1 ;
> :startVolume ?startvol .
> FILTER ((?bvol = ?volnum && NOT EXISTS {?loc :workLocationEndVolume
> ?evol}) || (?bvol <= ?volnum && EXISTS {?loc :workLocationEndVolume
> ?evol FILTER (?evol <= ?volnum)}))
>
> In terms of logic is should be equivalent to the previous query,
> should there be a performance difference? My experiments show that
> this version is quite consistently twice as fast as the OPTIONAL
> version.
>
At least, this query avoids the left-join, so yes, there's a good chance
being executed faster.
Re: bottom-up semantics
Posted by Élie Roux <el...@telecom-bretagne.eu>.
Hello,
Thanks for your answer
> (conditional
> (bgp
> (triple ?loc :locatedInWork ex:Work1)
> (triple ?loc :startVolume ?startvol)
> )
> (bgp (triple ?loc :endVolume ?endvol)))))))
Am I right in understanding that in that case ?loc is bound in the
second part (the part with :endVolume)? If so then the slow
performance must come from somewhere else, I'll investigate further.
I found another way of writing the query which is much more complex
but a little bit more satisfying in the sense that it makes the
binding of ?loc very clear:
?loc :workLocationVolume ?bvol .?loc :locatedInWork ex:Work1 ;
:startVolume ?startvol .
FILTER ((?bvol = ?volnum && NOT EXISTS {?loc :workLocationEndVolume
?evol}) || (?bvol <= ?volnum && EXISTS {?loc :workLocationEndVolume
?evol FILTER (?evol <= ?volnum)}))
In terms of logic is should be equivalent to the previous query,
should there be a performance difference? My experiments show that
this version is quite consistently twice as fast as the OPTIONAL
version.
Best,
--
Elie
Re: bottom-up semantics
Posted by "Lorenz B." <bu...@informatik.uni-leipzig.de>.
Hello
the query algebra has the following structure
|(project (?book ?title)||
|| (filter (|| (&& (&& (bound ?endvol) (<= ?startvol 2)) (>=
?endvol 2)) (&& (! (bound ?endvol)) (= ?startvol 2)))||
|| (leftjoin||
|| (bgp||
|| (triple ?loc :locatedInWork ex:Work1)||
|| (triple ?loc :startVolume ?startvol)||
|| )||
|| (bgp (triple ?loc :endVolume ?endvol)))))))|
(optimized)
|(project (?book ?title)||
|| (filter (|| (&& (&& (bound ?endvol) (<= ?startvol 2)) (>=
?endvol 2)) (&& (! (bound ?endvol)) (= ?startvol 2)))||
|| (conditional||
|| (bgp||
|| (triple ?loc :locatedInWork ex:Work1)||
|| (triple ?loc :startVolume ?startvol)||
|| )||
|| (bgp (triple ?loc :endVolume ?endvol)))))))||
|
You can see, an OPTIONAL is basically a left outer join.
If you're using TDB some statistics on the data could be taken into
account by an optimizer. You can check this by followoing the steps here [1]
[1] https://jena.apache.org/documentation/tdb/optimizer.html
> Dear Jean users,
>
> In short, I'm wondering if there could be an option somewhere for a
> top-down SPARQL evaluation mechanism.
>
> Long version: the dataset I'm dealing with contains data in the following form:
>
> ex:Loc1 a :Location ;
> :locatedInWork ex:Work1 ;
> :startPage 123 ;
> :endPage 234 ;
> :startVolume 1 .
>
> ex:Loc2 a :Location ;
> :locatedInWork ex:Work1 ;
> :startPage 234 ;
> :endPage 345 ;
> :startVolume 1 ;
> :endVolume 2 .
>
> where the absence of :endVolume denotes that the endVolume is equal to
> the startVolume. This might not be kosher in terms of semantics but
> that's the dataset I'm dealing with.
>
> Now, I want to select all the locations in volume 2 (including those
> starting before volume 2 and ending after volume 2), the most natural
> for me is to write something like:
>
> ?loc :locatedInWork ex:Work1 ;
> :startVolume ?startvol .
> OPTIONAL { ?loc :endVolume ?endvol . }
> FILTER ((BOUND(?endvol) && ?startvol <= 2 && ?endvol >= 2) ||
> (!BOUND(?endvol) && ?startvol = 2))
>
> which works fine, but is slow to the extreme (about 8s) due to the
> very large amount of triples with the :endVolume property. Now, I
> understand the slow performance is sort of expected due what's
> referred to as the bottom-up semantics of SPARQL. My understanding is
> that the first thing that will get evaluated will be ?loc :endVolume
> ?endvol which will return a huge amount of results.
>
> Here are a few questions:
>
> - Is my analysis correct?
>
> - In your experience of writing queries, how often do you rely on the
> bottom-up semantics? (my experience is never)
>
> - The bottom-up semantics are very counter-intuititve to me, what do
> you think is the reason it got into the SPARQL specs?
>
> - I suppose digging into the Jena code to optimize this kind of
> requests in Jena must be very deep dive, am I right?
>
> - Is there any plan or dedicated resources to optimize this kind of requests?
>
> - What would be the complexity of writing an alternate query
> evaluation mechanism using top-down semantics?
>
> - Would having an option to evaluate a sparql query using top-down
> semantics make sense? (we can have discussions of where the option
> would be handled, but I think it's helpful for me to get a general
> answer)
>
> - Blazegraph advertises that they are first evaluating if the results
> of a query would be the same when using a top-down and bottom-up
> semantics, and if they are the same they automatically switch to the
> top-down semantics, how much time do you estimate one would have to
> dive into the Jena code to propose a pull request for that?
>
> Best,
--
Lorenz Bühmann
AKSW group, University of Leipzig
Group: http://aksw.org - semantic web research center