You are viewing a plain text version of this content. The canonical link for it is here.

Posted to users@jena.apache.org by Paul Tyson <ph...@sbcglobal.net> on 2015/04/11 23:40:23 UTC

named graph impact on query performance

Hi,

Any theoretical reasons or evidence that lots of named graphs in a TDB
repository will adversely affect query performance?

For example, 50,000 named graphs containing total of 11 million triples.

Some queries will be for specific graphs, but most will be union queries
over the entire repository.

I'm going to find out in the next few days, but curious about community
experience with this.

Thanks,
--Paul

Re: named graph impact on query performance

Posted by Willie <mi...@gmail.com>.

I am working with a few million triples or so in TDB and experiencing pretty good performance.  

Irrespective of the queries, my experience has been that TDB performs better when it is set to query the union of the graphs by default--TDB.getContext().setTrue(symUnionGraphDefault)--rather than unioning graphs in the query with FROM and FROM NAMED clauses.  With that set to true, you can still execute queries against specific graphs (with FROM, FROM NAMED, and GRAPH clauses).

Cheers,
Willie

> On Apr 11, 2015, at 18:16, Paul Tyson <ph...@sbcglobal.net> wrote:
> 
>> On Sun, 2015-04-12 at 03:18 +0530, Rose Beck wrote:
>> I think performance largely depends on your application and queries.
>> Any specific reason for using 50,000 named graphs?
> 
> Yes, but other than to say I believe it to be a viable architectural
> choice I'd rather not go into details. Of course if performance sucks
> that would jeopardize its viability, with TDB anyway.
> 
> Regards,
> --Paul
> 
>> 
>>> On Sun, Apr 12, 2015 at 3:10 AM, Paul Tyson <ph...@sbcglobal.net> wrote:
>>> Hi,
>>> 
>>> Any theoretical reasons or evidence that lots of named graphs in a TDB
>>> repository will adversely affect query performance?
>>> 
>>> For example, 50,000 named graphs containing total of 11 million triples.
>>> 
>>> Some queries will be for specific graphs, but most will be union queries
>>> over the entire repository.
>>> 
>>> I'm going to find out in the next few days, but curious about community
>>> experience with this.
>>> 
>>> Thanks,
>>> --Paul
> 
>

Re: named graph impact on query performance

Posted by Paul Tyson <ph...@sbcglobal.net>.

On Sun, 2015-04-12 at 03:18 +0530, Rose Beck wrote:
> I think performance largely depends on your application and queries.
> Any specific reason for using 50,000 named graphs?

Yes, but other than to say I believe it to be a viable architectural
choice I'd rather not go into details. Of course if performance sucks
that would jeopardize its viability, with TDB anyway.

Regards,
--Paul

> 
> On Sun, Apr 12, 2015 at 3:10 AM, Paul Tyson <ph...@sbcglobal.net> wrote:
> > Hi,
> >
> > Any theoretical reasons or evidence that lots of named graphs in a TDB
> > repository will adversely affect query performance?
> >
> > For example, 50,000 named graphs containing total of 11 million triples.
> >
> > Some queries will be for specific graphs, but most will be union queries
> > over the entire repository.
> >
> > I'm going to find out in the next few days, but curious about community
> > experience with this.
> >
> > Thanks,
> > --Paul
> >
> 
> 
>

Re: named graph impact on query performance

Posted by Rose Beck <ro...@gmail.com>.

I think performance largely depends on your application and queries.
Any specific reason for using 50,000 named graphs?

On Sun, Apr 12, 2015 at 3:10 AM, Paul Tyson <ph...@sbcglobal.net> wrote:
> Hi,
>
> Any theoretical reasons or evidence that lots of named graphs in a TDB
> repository will adversely affect query performance?
>
> For example, 50,000 named graphs containing total of 11 million triples.
>
> Some queries will be for specific graphs, but most will be union queries
> over the entire repository.
>
> I'm going to find out in the next few days, but curious about community
> experience with this.
>
> Thanks,
> --Paul
>



-- 
With Warm Regards,
Rose

Re: named graph impact on query performance

Posted by Andy Seaborne <an...@apache.org>.

On 12/04/15 11:00, Dave Reynolds wrote:
> On 11/04/15 22:40, Paul Tyson wrote:
>> Hi,
>>
>> Any theoretical reasons or evidence that lots of named graphs in a TDB
>> repository will adversely affect query performance?
>>
>> For example, 50,000 named graphs containing total of 11 million triples.
>>
>> Some queries will be for specific graphs, but most will be union queries
>> over the entire repository.
>
> I know of other applications with make use of very large numbers of
> named graphs successfully.
>
> To first approximation the performance will depend on the overall number
> of triples, not how many graphs they are split into. The graph is just a
> fourth entry in the quad and for a union query you just ignore the graph
> column.
>
> To second approximation there is a difference between using graphs and
> using no graphs at all. If you use no graphs then TDB can run as a
> triple store and only needs a smaller number of smaller indexes (doesn't
> need to include G in the indexes) so more of them can fit in a given
> memory footprint. The difference is not enormous. Once you start to use
> graphs the number of them doesn't make that much difference.
>
> Especially for modest scales like 11 mT then you should be fine.
>
> Dave
>

With large numbers of named graphs, and the union default graph, updates 
will be slowed due to needing to update more indexes.

However, in query terms, they should be about the same.  The query 
engine will be using only the indexes where the graph field is least 
important :: SPOG and POSG, instead of SPO and POS.

While the named graph versions are wider (24 bytes to 32 bytes a row) 
that isn't usually noticeable.

	Andy

Re: named graph impact on query performance

Posted by Dave Reynolds <da...@gmail.com>.

On 11/04/15 22:40, Paul Tyson wrote:
> Hi,
>
> Any theoretical reasons or evidence that lots of named graphs in a TDB
> repository will adversely affect query performance?
>
> For example, 50,000 named graphs containing total of 11 million triples.
>
> Some queries will be for specific graphs, but most will be union queries
> over the entire repository.

I know of other applications with make use of very large numbers of 
named graphs successfully.

To first approximation the performance will depend on the overall number 
of triples, not how many graphs they are split into. The graph is just a 
fourth entry in the quad and for a union query you just ignore the graph 
column.

To second approximation there is a difference between using graphs and 
using no graphs at all. If you use no graphs then TDB can run as a 
triple store and only needs a smaller number of smaller indexes (doesn't 
need to include G in the indexes) so more of them can fit in a given 
memory footprint. The difference is not enormous. Once you start to use 
graphs the number of them doesn't make that much difference.

Especially for modest scales like 11 mT then you should be fine.

Dave