You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@syncope.apache.org by Dmitriy Brashevets <Dm...@united-security-providers.ch> on 2019/10/09 11:03:03 UTC

Performance improvements: SQL queries

Hi Apache Syncope team!

Currently, we're using the Apache Syncope 2.0.12 version.

AS provides the wide rest API and our team would like to have the org.apache.syncope.common.rest.api.service.AnyService#search working faster as we do the search of groups and users using FIQL queries.

Here is an example of the URI with a request path that has the FIQL query:

https://hostname/syncope/rest/users?fiql=%28realm%3D%3Dea696a4f-e77a-4ef1-be67-8f8093bc8686%3BlastChangeDate%3Dle%3D2019-10-10T13%253A57%253A00%252B0300%29&size=50&orderby=username+ASC&details=true&page=1


During the quick investigation of how this endpoint works under the hood, I found the places that potentially can be optimized.

The result of generated sql query that searches for AnyTO keys looks like this (in a particular case I'm taking into consideration the User anyTO):

{code:sql}
SELECT u.any_id,sv.username FROM (SELECT DISTINCT any_id FROM user_search WHERE (realm_id=? AND any_id IN ( SELECT DISTINCT any_id FROM user_search WHERE lastChangeDate<=?))) u,user_search sv WHERE u.any_id=sv.any_id AND u.any_id IN (SELECT any_id FROM user_search WHERE realm_id IN (SELECT id AS realm_id FROM Realm WHERE id=? OR id=? OR id=? OR id=? OR id=? OR id=? OR id=? OR id=? OR id=? OR id=? OR id=? OR id=? OR id=? OR id=? OR id=? OR id=? OR id=? OR id=? OR id=? OR id=? OR id=? OR id=? OR id=? OR id=? OR id=? OR id=? OR id=? OR id=? OR id=?)) ORDER BY sv.username ASC
{code}

Firstly, we can optimize the multiple realm OR-queries that is used for effective realms to [1st Optimization]:
{code:sql}
SELECT u.any_id, sv.username
FROM (SELECT DISTINCT any_id
      FROM user_search
      WHERE (realm_id=? AND any_id IN (SELECT DISTINCT any_id FROM user_search WHERE lastChangeDate<=?))) u,
     user_search sv
WHERE u.any_id = sv.any_id
  AND u.any_id IN (SELECT any_id
                   FROM user_search
                  WHERE realm_id IN (SELECT id AS realm_id
                                      FROM Realm
                                      WHERE realm_id IN (?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?)))
ORDER BY sv.username ASC
{code}

Also, it is not clear for me why there is an inner join of two tables "u" and "sv"? As far as I can see the effective realms clause and fiql query can be done without inner join.  Thus, the query can be optimized to [2nd Optimization]:
{code:sql}
SELECT DISTINCT any_id, username
FROM user_search
WHERE (realm_id=? AND any_id IN (SELECT DISTINCT any_id FROM user_search WHERE lastChangeDate<=?))
  and realm_id IN
      (?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?,
       ?, ?, ?, ?) ORDER BY user_search.username ASC;
{code}


Is it possbile that in "user_search" view two rows with the same `user_search.any_id` can present? If not then the query can be optimized to  [3rd Optimization]:
{code:sql}
SELECT any_id, username
FROM user_search
WHERE (realm_id=? AND any_id IN (SELECT any_id FROM user_search WHERE lastChangeDate<=?))
  and realm_id IN
      (?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?,
       ?, ?, ?, ?) ORDER BY user_search.username ASC;
{code}


After fetching the any_id <-> username pairs org.apache.syncope.core.persistence.jpa.dao.AbstractAnySearchDAO#buildResult method plays its role.
Here the code selects each entity one by one in the loop by sorted `any_id` order keys.
{code:sql}
for (Object anyKey : raw) {
            String actualKey = anyKey instanceof Object[]
                    ? (String) ((Object[]) anyKey)[0]
                    : ((String) anyKey);

            @SuppressWarnings("unchecked")
            T any = kind == AnyTypeKind.USER
                    ? (T) userDAO.find(actualKey)
                    : kind == AnyTypeKind.GROUP
                            ? (T) groupDAO.find(actualKey)
                            : (T) anyObjectDAO.find(actualKey);
            if (any == null) {
                LOG.error("Could not find {} with id {}, even if returned by native query", kind, actualKey);
            } else if (!result.contains(any)) {
                result.add(any);
            }
        }
{code}

What if we can optimize this logic and instead of producing the multiple sql requests, where each sql request fetches only one user, we will use the IN clause? For this purpose we can add an additional method "findByKeys" into org.apache.syncope.core.persistence.jpa.dao.AbstractAnyDAO. It can look like this [4th Optimization]:
{code:sql}
    @Transactional(readOnly = true)
    @Override
    public Collection<A> findByKeys(Set<String> keys) {
        validateKeys(keys);

        Class<A> entityClass = anyUtils().anyClass();
        TypedQuery<A> query = entityManager().createQuery(
                "SELECT e FROM " + entityClass.getSimpleName()
                        + " e WHERE e.id IN (:keys)",
                entityClass);
        query.setParameter("keys", keys);
        List<A> foundAnys = query.getResultList();

        validateAllAnysPresent(keys, foundAnys);

        // TODO: adopt to multiple keys
        for (A any : foundAnys) {
            securityChecks(any);
        }

        return foundAnys;
    }
{code}
Note that this is a quick implementation and method like org.apache.syncope.core.persistence.jpa.dao.AbstractAnyDAO#securityChecks can be also optimized and adopted to consume multiple entities.
And we can use the new method "findByKeys" instead of the loop.
After all, entities are fetched by keys in a single query we can sort the entities in Java code using the sorted keys.

I prepared the PR where I implemented quickly https://github.com/DmitriyBrashevets/syncope/pull/1:
* [1st Optimization]
* [4th Optimization]


What I also want to achieve in result:
1) Understand if Apache Syncope team also sees the benefits in these optimizations and to ensure that everybody agrees with them and I haven't missed anything and everything written above is correct.
2) If everything is correct we can find other places in the code and also optimize them
3) Finally, we can measure the performance if AS team is agree with these changes. I hope that AS team already have an infrastructure with integration or e2e tests that do measure the performance.


Thanks in advance,
Dmitriy Brashevets

RE: Performance improvements: SQL queries

Posted by Dmitriy Brashevets <Dm...@united-security-providers.ch>.

I created the first PR regarding this: https://github.com/apache/syncope/pull/126/files

Regards,
Dmitriy

From: Francesco Chicchiriccò <il...@apache.org>
Sent: Wednesday, October 9, 2019 2:15 PM
To: user@syncope.apache.org
Subject: Re: Performance improvements: SQL queries

Hi Dmitriy,
thanks for your analysis.

Please consider that:

1. you should refer to latest Syncope 2.1 (2.1.5 at the moment), rather than old 2.0.12.
2. the queries are generated in the current form because of the various use cases they need to work for, and that are mostly checked by various tests
3. Syncope 2.1 also provides Postgresql JSONB support, which dramatically improves overall performance

Anyway, feel free to submit any PR that we'll be glad to review.
Regards.

On 09/10/19 13:03, Dmitriy Brashevets wrote:
Hi Apache Syncope team!

Currently, we're using the Apache Syncope 2.0.12 version.

AS provides the wide rest API and our team would like to have the org.apache.syncope.common.rest.api.service.AnyService#search working faster as we do the search of groups and users using FIQL queries.

Here is an example of the URI with a request path that has the FIQL query:

https://hostname/syncope/rest/users?fiql=%28realm%3D%3Dea696a4f-e77a-4ef1-be67-8f8093bc8686%3BlastChangeDate%3Dle%3D2019-10-10T13%253A57%253A00%252B0300%29&size=50&orderby=username+ASC&details=true&page=1

During the quick investigation of how this endpoint works under the hood, I found the places that potentially can be optimized.

The result of generated sql query that searches for AnyTO keys looks like this (in a particular case I'm taking into consideration the User anyTO):

{code:sql}
SELECT u.any_id,sv.username FROM (SELECT DISTINCT any_id FROM user_search WHERE (realm_id=? AND any_id IN ( SELECT DISTINCT any_id FROM user_search WHERE lastChangeDate<=?))) u,user_search sv WHERE u.any_id=sv.any_id AND u.any_id IN (SELECT any_id FROM user_search WHERE realm_id IN (SELECT id AS realm_id FROM Realm WHERE id=? OR id=? OR id=? OR id=? OR id=? OR id=? OR id=? OR id=? OR id=? OR id=? OR id=? OR id=? OR id=? OR id=? OR id=? OR id=? OR id=? OR id=? OR id=? OR id=? OR id=? OR id=? OR id=? OR id=? OR id=? OR id=? OR id=? OR id=? OR id=?)) ORDER BY sv.username ASC
{code}

Firstly, we can optimize the multiple realm OR-queries that is used for effective realms to [1st Optimization]:
{code:sql}
SELECT u.any_id, sv.username
FROM (SELECT DISTINCT any_id
      FROM user_search
      WHERE (realm_id=? AND any_id IN (SELECT DISTINCT any_id FROM user_search WHERE lastChangeDate<=?))) u,
     user_search sv
WHERE u.any_id = sv.any_id
  AND u.any_id IN (SELECT any_id
                   FROM user_search
                  WHERE realm_id IN (SELECT id AS realm_id
                                      FROM Realm
                                      WHERE realm_id IN (?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?)))
ORDER BY sv.username ASC
{code}

Also, it is not clear for me why there is an inner join of two tables "u" and "sv"? As far as I can see the effective realms clause and fiql query can be done without inner join.  Thus, the query can be optimized to [2nd Optimization]:
{code:sql}
SELECT DISTINCT any_id, username
FROM user_search
WHERE (realm_id=? AND any_id IN (SELECT DISTINCT any_id FROM user_search WHERE lastChangeDate<=?))
  and realm_id IN
      (?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?,
       ?, ?, ?, ?) ORDER BY user_search.username ASC;
{code}

Is it possbile that in "user_search" view two rows with the same `user_search.any_id` can present? If not then the query can be optimized to  [3rd Optimization]:
{code:sql}
SELECT any_id, username
FROM user_search
WHERE (realm_id=? AND any_id IN (SELECT any_id FROM user_search WHERE lastChangeDate<=?))
  and realm_id IN
      (?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?,
       ?, ?, ?, ?) ORDER BY user_search.username ASC;
{code}

After fetching the any_id <-> username pairs org.apache.syncope.core.persistence.jpa.dao.AbstractAnySearchDAO#buildResult method plays its role.
Here the code selects each entity one by one in the loop by sorted `any_id` order keys.
{code:sql}
for (Object anyKey : raw) {
            String actualKey = anyKey instanceof Object[]
                    ? (String) ((Object[]) anyKey)[0]
                    : ((String) anyKey);

            @SuppressWarnings("unchecked")
            T any = kind == AnyTypeKind.USER
                    ? (T) userDAO.find(actualKey)
                    : kind == AnyTypeKind.GROUP
                            ? (T) groupDAO.find(actualKey)
                            : (T) anyObjectDAO.find(actualKey);
            if (any == null) {
                LOG.error("Could not find {} with id {}, even if returned by native query", kind, actualKey);
            } else if (!result.contains(any)) {
                result.add(any);
            }
        }
{code}

What if we can optimize this logic and instead of producing the multiple sql requests, where each sql request fetches only one user, we will use the IN clause? For this purpose we can add an additional method "findByKeys" into org.apache.syncope.core.persistence.jpa.dao.AbstractAnyDAO. It can look like this [4th Optimization]:
{code:sql}
    @Transactional(readOnly = true)
    @Override
    public Collection<A> findByKeys(Set<String> keys) {
        validateKeys(keys);

        Class<A> entityClass = anyUtils().anyClass();
        TypedQuery<A> query = entityManager().createQuery(
                "SELECT e FROM " + entityClass.getSimpleName()
                        + " e WHERE e.id IN (:keys)",
                entityClass);
        query.setParameter("keys", keys);
        List<A> foundAnys = query.getResultList();

        validateAllAnysPresent(keys, foundAnys);

        // TODO: adopt to multiple keys
        for (A any : foundAnys) {
            securityChecks(any);
        }

        return foundAnys;
    }
{code}
Note that this is a quick implementation and method like org.apache.syncope.core.persistence.jpa.dao.AbstractAnyDAO#securityChecks can be also optimized and adopted to consume multiple entities.
And we can use the new method "findByKeys" instead of the loop.
After all, entities are fetched by keys in a single query we can sort the entities in Java code using the sorted keys.

I prepared the PR where I implemented quickly https://github.com/DmitriyBrashevets/syncope/pull/1:
* [1st Optimization]
* [4th Optimization]

What I also want to achieve in result:
1) Understand if Apache Syncope team also sees the benefits in these optimizations and to ensure that everybody agrees with them and I haven't missed anything and everything written above is correct.
2) If everything is correct we can find other places in the code and also optimize them
3) Finally, we can measure the performance if AS team is agree with these changes. I hope that AS team already have an infrastructure with integration or e2e tests that do measure the performance.

Thanks in advance,
Dmitriy Brashevets

--

Francesco Chicchiriccò

Tirasa - Open Source Excellence

http://www.tirasa.net/

Member at The Apache Software Foundation

Syncope, Cocoon, Olingo, CXF, OpenJPA, PonyMail

http://home.apache.org/~ilgrosso/

Re: Performance improvements: SQL queries

Posted by Francesco Chicchiriccò <il...@apache.org>.

Hi Dmitriy,
thanks for your analysis.

Please consider that:

1. you should refer to latest Syncope 2.1 (2.1.5 at the moment), rather than old 2.0.12.
2. the queries are generated in the current form because of the various use cases they need to work for, and that are mostly checked by various tests
3. Syncope 2.1 also provides Postgresql JSONB support, which dramatically improves overall performance

Anyway, feel free to submit any PR that we'll be glad to review.
Regards.

On 09/10/19 13:03, Dmitriy Brashevets wrote:
>
> Hi Apache Syncope team!
>
>  
>
> Currently, we're using the Apache Syncope 2.0.12 version.
>
>  
>
> AS provides the wide rest API and our team would like to have the org.apache.syncope.common.rest.api.service.AnyService#search working faster as we do the search of groups and users using FIQL queries.
>
>  
>
> Here is an example of the URI with a request path that has the FIQL query:
>
>  
>
> https://hostname/syncope/rest/users?fiql=%28realm%3D%3Dea696a4f-e77a-4ef1-be67-8f8093bc8686%3BlastChangeDate%3Dle%3D2019-10-10T13%253A57%253A00%252B0300%29&size=50&orderby=username+ASC&details=true&page=1
>
>  
>
>  
>
> During the quick investigation of how this endpoint works under the hood, I found the places that potentially can be optimized.
>
>  
>
> The result of generated sql query that searches for AnyTO keys looks like this (in a particular case I'm taking into consideration the User anyTO):
>
>  
>
> {code:sql}
>
> SELECT u.any_id,sv.username FROM (SELECT DISTINCT any_id FROM user_search WHERE (realm_id=? AND any_id IN ( SELECT DISTINCT any_id FROM user_search WHERE lastChangeDate<=?))) u,user_search sv WHERE u.any_id=sv.any_id AND u.any_id IN (SELECT any_id FROM user_search WHERE realm_id IN (SELECT id AS realm_id FROM Realm WHERE id=? OR id=? OR id=? OR id=? OR id=? OR id=? OR id=? OR id=? OR id=? OR id=? OR id=? OR id=? OR id=? OR id=? OR id=? OR id=? OR id=? OR id=? OR id=? OR id=? OR id=? OR id=? OR id=? OR id=? OR id=? OR id=? OR id=? OR id=? OR id=?)) ORDER BY sv.username ASC
>
> {code}
>
>  
>
> Firstly, we can optimize the multiple realm OR-queries that is used for effective realms to [1st Optimization]:
>
> {code:sql}
>
> SELECT u.any_id, sv.username
>
> FROM (SELECT DISTINCT any_id
>
>       FROM user_search
>
>       WHERE (realm_id=? AND any_id IN (SELECT DISTINCT any_id FROM user_search WHERE lastChangeDate<=?))) u,
>
>      user_search sv
>
> WHERE u.any_id = sv.any_id
>
>   AND u.any_id IN (SELECT any_id
>
>                    FROM user_search
>
>                   WHERE realm_id IN (SELECT id AS realm_id
>
>                                       FROM Realm
>
>                                       WHERE realm_id IN (?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?)))
>
> ORDER BY sv.username ASC
>
> {code}
>
>  
>
> Also, it is not clear for me why there is an inner join of two tables "u" and "sv"? As far as I can see the effective realms clause and fiql query can be done without inner join.  Thus, the query can be optimized to [2nd Optimization]:
>
> {code:sql}
>
> SELECT DISTINCT any_id, username
>
> FROM user_search
>
> WHERE (realm_id=? AND any_id IN (SELECT DISTINCT any_id FROM user_search WHERE lastChangeDate<=?))
>
>   and realm_id IN
>
>       (?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?,
>
>        ?, ?, ?, ?) ORDER BY user_search.username ASC;
>
> {code}
>
>  
>
>  
>
> Is it possbile that in "user_search" view two rows with the same `user_search.any_id` can present? If not then the query can be optimized to  [3rd Optimization]:
>
> {code:sql}
>
> SELECT any_id, username
>
> FROM user_search
>
> WHERE (realm_id=? AND any_id IN (SELECT any_id FROM user_search WHERE lastChangeDate<=?))
>
>   and realm_id IN
>
>       (?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?,
>
>        ?, ?, ?, ?) ORDER BY user_search.username ASC;
>
> {code}
>
>  
>
>  
>
> After fetching the any_id <-> username pairs org.apache.syncope.core.persistence.jpa.dao.AbstractAnySearchDAO#buildResult method plays its role.
>
> Here the code selects each entity one by one in the loop by sorted `any_id` order keys.
>
> {code:sql}
>
> for (Object anyKey : raw) {
>
>             String actualKey = anyKey instanceof Object[]
>
>                     ? (String) ((Object[]) anyKey)[0]
>
>                     : ((String) anyKey);
>
>  
>
>             @SuppressWarnings("unchecked")
>
>             T any = kind == AnyTypeKind.USER
>
>                     ? (T) userDAO.find(actualKey)
>
>                     : kind == AnyTypeKind.GROUP
>
>                             ? (T) groupDAO.find(actualKey)
>
>                             : (T) anyObjectDAO.find(actualKey);
>
>             if (any == null) {
>
>                 LOG.error("Could not find {} with id {}, even if returned by native query", kind, actualKey);
>
>             } else if (!result.contains(any)) {
>
>                 result.add(any);
>
>             }
>
>         }
>
> {code}
>
>  
>
> What if we can optimize this logic and instead of producing the multiple sql requests, where each sql request fetches only one user, we will use the IN clause? For this purpose we can add an additional method "findByKeys" into org.apache.syncope.core.persistence.jpa.dao.AbstractAnyDAO. It can look like this [4th Optimization]:
>
> {code:sql}
>
>     @Transactional(readOnly = true)
>
>     @Override
>
>     public Collection<A> findByKeys(Set<String> keys) {
>
>         validateKeys(keys);
>
>  
>
>         Class<A> entityClass = anyUtils().anyClass();
>
>         TypedQuery<A> query = entityManager().createQuery(
>
>                 "SELECT e FROM " + entityClass.getSimpleName()
>
>                         + " e WHERE e.id IN (:keys)",
>
>                 entityClass);
>
>         query.setParameter("keys", keys);
>
>         List<A> foundAnys = query.getResultList();
>
>  
>
>         validateAllAnysPresent(keys, foundAnys);
>
>  
>
>         // TODO: adopt to multiple keys
>
>         for (A any : foundAnys) {
>
>             securityChecks(any);
>
>         }
>
>  
>
>         return foundAnys;
>
>     }
>
> {code}
>
> Note that this is a quick implementation and method like org.apache.syncope.core.persistence.jpa.dao.AbstractAnyDAO#securityChecks can be also optimized and adopted to consume multiple entities.
>
> And we can use the new method "findByKeys" instead of the loop.
>
> After all, entities are fetched by keys in a single query we can sort the entities in Java code using the sorted keys.
>
>  
>
> I prepared the PR where I implemented quickly https://github.com/DmitriyBrashevets/syncope/pull/1:
>
> * [1st Optimization]
>
> * [4th Optimization]
>
>  
>
>  
>
> What I also want to achieve in result:
>
> 1) Understand if Apache Syncope team also sees the benefits in these optimizations and to ensure that everybody agrees with them and I haven't missed anything and everything written above is correct.
>
> 2) If everything is correct we can find other places in the code and also optimize them
>
> 3) Finally, we can measure the performance if AS team is agree with these changes. I hope that AS team already have an infrastructure with integration or e2e tests that do measure the performance.
>
>  
>
>  
>
> Thanks in advance,
>
> Dmitriy Brashevets
>

-- 
Francesco Chicchiriccò

Tirasa - Open Source Excellence
http://www.tirasa.net/

Member at The Apache Software Foundation
Syncope, Cocoon, Olingo, CXF, OpenJPA, PonyMail
http://home.apache.org/~ilgrosso/