You are viewing a plain text version of this content. The canonical link for it is here.

Posted to users@jena.apache.org by Matteo Casu <ma...@gmail.com> on 2012/02/17 17:39:09 UTC

Jena ResultSet

Dear list,

I'm doing some experiments comparing SPARQL querying on different
platforms, with different entailment regimes (mainly RDFS and OWL2QL). I
came into the overhead problem of ResultSet (or also ResultSetFormatter). I
Reading other threads, I learned that Arq works as a buffer, and that the
execSelect method do not really compute the query. The real results are
retrieved looping over the resultSet. I naively thought that the execSelect
would have done the query, and that ResultSet was only for printing or
seeing results.
Now, the question is: I would like to know whether the right thing for
measuring the time of querying is:
- to take the time before and after the execution of execSelect() , or
-  at the end of the loop...

Any hint would be highly appreciated. Thanks in advance!

Mat
_______________________

Here the snippet:

Query query=QueryFactory.create(queryString);
QueryExecution qexec = QueryExecutionFactory.create(query,
model);ResultSet results = qexec.execSelect();

while (results.hasNext()) { here print on file; }

Re: Jena ResultSet

Posted by Andy Seaborne <an...@apache.org>.

On 17/02/12 17:17, Robert Vesse wrote:
> Generally my recommendation and what we do internally is to time the
> time from when we make the execSelect() call to when we finish
> iterating over the results, I would recommend not doing anything with
> the iteration other than incrementing a count as otherwise you may
> skew your figures as what you do with each result may be far more
> computationally costly than just iterating over them.

ResultSetFormatter.consume will do what you need to do for timing.

/** This operation faithfully walks the results but does nothing with them.
*  @return The count of the number of solutions.
*/

It not only iterates over the rows, but it also touches every variable 
in the results.  See the code for details.

In TDB this matters:

SELECT (count(*) AS ?c) { ?s ?p ?o }

does not touch the nodes, just the internal ids with no fetching the 
representation of the URIs etc.  TDB returns a lazy-eval result row; 
that query does not need the bytes for ?s etc.  This is fast for count(*).

> We have a benchmarking tool that we use internally and we distinguish
> these two things as response time and runtime, the former being the
> time for the first result to be received and the latter being the
> time for all results to be received.  Often the two figures can be
> massively differently especially with queries that generate very
> large results.

That's also useful - the first row can be more expensive than the rest. 
  This is not an execSelect thing - the first hasNext() can trigger 
anything from a little work to most of the query, depending on the 
query.  ORDER BY and GROUP BY being extreme cases.

	Andy

Re: Jena ResultSet

Posted by Robert Vesse <rv...@yarcdata.com>.

Hi Matteo

It depends on exactly what you are trying to time and what store you are talking to.

For example if the store is TDB then execSelect() is simply causing the query plan to be generated, optimized and turned into a QueryIterator that can return the actual results of the query.  So timing around execSelect() would not really be appropriate because TDB has not really done any work to answer the query at that stage.

However if your store is a remote store accessing via SPARQL over HTTP (you used QueryExecutionFactory.sparqlService()) to get a QueryExecution then timing around execSelect() might be more useful because then the timing reflects the time for the HTTP request to be made and for the store to start returning results.  This won't necessarily mean all results are returned as depending on the response format from the server ARQ may stream the results.

Generally my recommendation and what we do internally is to time the time from when we make the execSelect() call to when we finish iterating over the results, I would recommend not doing anything with the iteration other than incrementing a count as otherwise you may skew your figures as what you do with each result may be far more computationally costly than just iterating over them.

We have a benchmarking tool that we use internally and we distinguish these two things as response time and runtime, the former being the time for the first result to be received and the latter being the time for all results to be received.  Often the two figures can be massively differently especially with queries that generate very large results.

Hope this helps

Rob

On Feb 17, 2012, at 8:39 AM, Matteo Casu wrote:

> Dear list,
> 
> I'm doing some experiments comparing SPARQL querying on different
> platforms, with different entailment regimes (mainly RDFS and OWL2QL). I
> came into the overhead problem of ResultSet (or also ResultSetFormatter). I
> Reading other threads, I learned that Arq works as a buffer, and that the
> execSelect method do not really compute the query. The real results are
> retrieved looping over the resultSet. I naively thought that the execSelect
> would have done the query, and that ResultSet was only for printing or
> seeing results.
> Now, the question is: I would like to know whether the right thing for
> measuring the time of querying is:
> - to take the time before and after the execution of execSelect() , or
> -  at the end of the loop...
> 
> Any hint would be highly appreciated. Thanks in advance!
> 
> Mat
> _______________________
> 
> Here the snippet:
> 
> Query query=QueryFactory.create(queryString);
> QueryExecution qexec = QueryExecutionFactory.create(query,
> model);ResultSet results = qexec.execSelect();
> 
> while (results.hasNext()) { here print on file; }