You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@marmotta.apache.org by Tom Johnson <to...@dp.la> on 2016/04/08 00:00:33 UTC

Re: Generic Query Plan Performance Cost

Reviving a somewhat dead thread with an update:

Previously, we had run into problems with the Posgresql query planner and
its use of generic query plans for prepared queries from Marmotta. We've
been able to mitigate this in the past by:

  - Upgrading to PSQL 9.4, which is much less aggressive about using
generic query plans; and
  - Adding an index that directly matches the order of the problem queries.
For reasons unknown to us, PSQL seemed to prefer an `spoc` index to an
otherwise identical `cspo` index.

As our quad store has grown, we're now seeing this solution fail, and are
back to fighting against the query planner.

The crux of our problem is that we use Marmotta primarily as an LDP server;
this server now contains in the millions of individual LDP-NRs, and about
twice as many LDP-RSs. The result is that we have approximately 10 million
named graphs, each containing a small number of triples (varying from 3 to
about 75), and a single LDP context with tens of millions of triples. When
we do a PUT request, the LDP context is queried with a prepared query like:

   prepare problem_query  as SELECT id FROM triples WHERE subject = $1 AND
predicate = $2 AND object = $3 AND context = $4 AND deleted = false;

The query planner has, at various points, chosen different indexes to use
to carry out this query. The behavior we are seeing now is that it uses a
`cpo` index, then filters the subjects. Presumably, it believes that
scanning the smaller index will save it time, and in most cases, between
zero and three subjects will need to be filtered.

When the LDP context is used, however, the result is generally that at
least 10 million subjects need to be filtered (this varies depending on
which of the predicates from the LDP graph are queried). This process makes
the query extremely expensive. Running time is counted in minutes.

We've been able to persuade the query planner to use an `spo` index, which
runs instantly, but only when we are not past the JDBC preparedThreshold.
Our current thinking is to disable prepared statements altogether, but we
expect to see a cost for this in other queries.

Have you all seen this before? It looks to us like it should be repeatable
for any large scale LDP marmotta service running on Postgresql, and we
thought it would be important to share our experience so this can be
patched on the Marmotta side.

Thanks for your time, and for the work you all do.

Best,

Tom Johnson
Metadata & Platform Architect
Digital Public Library of America

On Fri, Nov 6, 2015 at 3:17 AM, Sergio Fernández <wi...@apache.org> wrote:

> Hi Tom,
>
> On Thu, Nov 5, 2015 at 6:12 PM, Tom Johnson <to...@dp.la> wrote:
>>
>> We can confirm that upgrading to PSQL 9.2+ (we are using 9.4) has
>> significant performance benefits on LDP requests. GET requests that had
>> been taking between 10 and 50 seconds are now observed at around 1/10 of a
>> second, with much of that time apparently spent on the Marmotta side rather
>> than in PSQL.
>>
>
> Definitely PostgreSQL has brought quite some improvements in 9.4 that we
> strongly recommend it for Marmotta.
>
> After the upgrade, we saw vastly improved performance for a time, but
>> found that under a certain load similar query planning problems resurfaced.
>> Perhaps oddly, we were able to stem this by creating a new index on the
>> triples table, with a different field order from the existing four-column
>> index. The new index (on `spoc` rather than `cspo`) is used more readily in
>> the generic plan, and we've not seen slow queries (> 200ms) logged on LDP
>> requests since introducing it.
>>
>
> Those are really good news!
>
> Probably there is still room to improve the current implementation. But
> also take into account that currently the LDP module uses Sesame-based
> persistence with KiWi. Following the experience on providing a SQL native
> SPARQL evaluation strategy, we could try something similar for improving
> even more the LDP performance. But I'd prefer to keep that plan aside if
> possible.
>
> Cheers,
>
> --
> Sergio Fernández
> Partner Technology Manager
> Redlink GmbH
> m: +43 6602747925
> e: sergio.fernandez@redlink.co
> w: http://redlink.co
>

Re: Generic Query Plan Performance Cost

Posted by Sergio Fernández <wi...@apache.org>.
Thanks Tom for sharing with all of us your experiences; they're really
valuable in many aspect.

In the end the problem comes from the idea that every LDP-R holds its own
context/graph. The LDP data access relies in Sesame, and probably moving it
down to a native SQL access (like we did for SPARQL) would allow us to deal
more efficiently with such scenarios.

As a less dramatic solution for now, in the past I've been experimented
with some Postgres parameters (enable_seqscan, enable_mergejoinor
enable_hashjoin) with some good results.

Hope that helps.

Cheers,


On Fri, Apr 8, 2016 at 12:00 AM, Tom Johnson <to...@dp.la> wrote:

> Reviving a somewhat dead thread with an update:
>
> Previously, we had run into problems with the Posgresql query planner and
> its use of generic query plans for prepared queries from Marmotta. We've
> been able to mitigate this in the past by:
>
>   - Upgrading to PSQL 9.4, which is much less aggressive about using
> generic query plans; and
>   - Adding an index that directly matches the order of the problem
> queries. For reasons unknown to us, PSQL seemed to prefer an `spoc` index
> to an otherwise identical `cspo` index.
>
> As our quad store has grown, we're now seeing this solution fail, and are
> back to fighting against the query planner.
>
> The crux of our problem is that we use Marmotta primarily as an LDP
> server; this server now contains in the millions of individual LDP-NRs, and
> about twice as many LDP-RSs. The result is that we have approximately 10
> million named graphs, each containing a small number of triples (varying
> from 3 to about 75), and a single LDP context with tens of millions of
> triples. When we do a PUT request, the LDP context is queried with a
> prepared query like:
>
>    prepare problem_query  as SELECT id FROM triples WHERE subject = $1 AND
> predicate = $2 AND object = $3 AND context = $4 AND deleted = false;
>
> The query planner has, at various points, chosen different indexes to use
> to carry out this query. The behavior we are seeing now is that it uses a
> `cpo` index, then filters the subjects. Presumably, it believes that
> scanning the smaller index will save it time, and in most cases, between
> zero and three subjects will need to be filtered.
>
> When the LDP context is used, however, the result is generally that at
> least 10 million subjects need to be filtered (this varies depending on
> which of the predicates from the LDP graph are queried). This process makes
> the query extremely expensive. Running time is counted in minutes.
>
> We've been able to persuade the query planner to use an `spo` index, which
> runs instantly, but only when we are not past the JDBC preparedThreshold.
> Our current thinking is to disable prepared statements altogether, but we
> expect to see a cost for this in other queries.
>
> Have you all seen this before? It looks to us like it should be repeatable
> for any large scale LDP marmotta service running on Postgresql, and we
> thought it would be important to share our experience so this can be
> patched on the Marmotta side.
>
> Thanks for your time, and for the work you all do.
>
> Best,
>
> Tom Johnson
> Metadata & Platform Architect
> Digital Public Library of America
>
> On Fri, Nov 6, 2015 at 3:17 AM, Sergio Fernández <wi...@apache.org>
> wrote:
>
>> Hi Tom,
>>
>> On Thu, Nov 5, 2015 at 6:12 PM, Tom Johnson <to...@dp.la> wrote:
>>>
>>> We can confirm that upgrading to PSQL 9.2+ (we are using 9.4) has
>>> significant performance benefits on LDP requests. GET requests that had
>>> been taking between 10 and 50 seconds are now observed at around 1/10 of a
>>> second, with much of that time apparently spent on the Marmotta side rather
>>> than in PSQL.
>>>
>>
>> Definitely PostgreSQL has brought quite some improvements in 9.4 that we
>> strongly recommend it for Marmotta.
>>
>> After the upgrade, we saw vastly improved performance for a time, but
>>> found that under a certain load similar query planning problems resurfaced.
>>> Perhaps oddly, we were able to stem this by creating a new index on the
>>> triples table, with a different field order from the existing four-column
>>> index. The new index (on `spoc` rather than `cspo`) is used more readily in
>>> the generic plan, and we've not seen slow queries (> 200ms) logged on LDP
>>> requests since introducing it.
>>>
>>
>> Those are really good news!
>>
>> Probably there is still room to improve the current implementation. But
>> also take into account that currently the LDP module uses Sesame-based
>> persistence with KiWi. Following the experience on providing a SQL native
>> SPARQL evaluation strategy, we could try something similar for improving
>> even more the LDP performance. But I'd prefer to keep that plan aside if
>> possible.
>>
>> Cheers,
>>
>> --
>> Sergio Fernández
>> Partner Technology Manager
>> Redlink GmbH
>> m: +43 6602747925
>> e: sergio.fernandez@redlink.co
>> w: http://redlink.co
>>
>
>


-- 
Sergio Fernández
Partner Technology Manager
Redlink GmbH
m: +43 6602747925
e: sergio.fernandez@redlink.co
w: http://redlink.co