You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@syncope.apache.org by Francesco Chicchiriccò <il...@apache.org> on 2018/10/16 08:54:08 UTC

[DISCUSS] Manage millions of identities

Hi all,
I think it's time to discuss about how we want to get prepared for 
scenarios where the number of identities (mostly users, for the vast 
majority) to manage is considerably high - from 1 million to above; the 
typical case being CIAM (Customer IAM).

In the IdM deliveries I've been involved so far, scaling Apache Syncope 
up to hundreds of thousands of identities is not trivial, but doable: 
naturally, most of optimization work shall be done at DBMS level, as 
that is obviously the component which is stressed more.

I think we can agree about the fact that, in such scenarios, the most 
critical data are the ones bound to the actual identities (hence no 
connectors, resources, tasks, reports or any other configuration): 
consider that with 1 million users and 10 attributes for each user, we 
have the following table sizing to deal with:

* SyncopeUser: 1M rows
* UPlainAttr: 10M rows
* UPlainAttrValue: 10M rows

Moreover, the search views [1] are all on the same size order (although 
one can enable the Elasticsearch extension in such cases, to improve 
performances).

I think this is what we need to change in order to get better results.
So far, I have been able to think of a couple of possibilities:

1. Leverage the JSON column support provided by PostgreSQL [2], MySQL 
[3], SQL Server [4] and Oracle DB [5] to extend the current 
OpenJPA-based persistence layer

Pros:
  * reduce the sizing problems by removing the need of UPlainAttr and 
UPlainAttrValue tables, search views and joins
  * limited implementation effort, as  most of the current JPA layer can 
be retained
  * keep enjoying the benefits of referential integrity and other 
constraints enforced by DBMS (including UNIQUE)

Cons:
  * each DBMS provides JSON support in its own fashion: implementation 
wouldn't be trivial (while we can make it incremental, and add support 
for one DBMS at a time)
  * scaling capabilities and performance might be overrated - even 
though there seems to be very nice references, at least for PostgreSQL 
[6][7]

2. Implement a new persistence layer based on a different technology - I 
have done some experiments with Apache Cassandra [8] and the Datastax 
Java Driver [9]

Pros:
  * built native for scalability and high availability
  * proven and widespread adoption
  * Object Mapper [10] allows to semi-transparently convert between 
query results and domain, somehow similar as JPA's EntityManager

Cons:
  * relations are obviously not available, only custom types [11]: the 
persistence model shall be redesigned to cope with such situation
  * constraints are not available - more specifically UNIQUE, which will 
require additional code handling
  * implementation effort: all the persistence layer shall be redone, 
not only identity-related entities as User, UPlainAttr, UPlainAttrValue...

Besides the two above, there are of course other options in the NoSQL 
world (Neo4j, MongoDB, ...), but I am afraid they all present similar 
challenges as Cassandra.

WDYT?
Regards.

[1] 
https://github.com/apache/syncope/blob/master/core/persistence-jpa/src/main/resources/views.xml#L50-L94
[2] https://www.postgresql.org/docs/10/static/functions-json.html
[3] https://dev.mysql.com/doc/refman/8.0/en/json.html
[4] 
https://docs.microsoft.com/en-us/sql/relational-databases/json/json-data-sql-server?view=sql-server-2017
[5] https://docs.oracle.com/database/121/ADXDB/json.htm#ADXDB6246
[6] 
https://www.postgresql.eu/events/fosdem2018/sessions/session/1691/slides/63/High-Performance%20JSON_%20PostgreSQL%20Vs.%20MongoDB.pdf
[7] 
http://coussej.github.io/2016/01/14/Replacing-EAV-with-JSONB-in-PostgreSQL/
[8] http://cassandra.apache.org/
[9] https://github.com/datastax/java-driver
[10] 
https://docs.datastax.com/en/developer/java-driver/3.5/manual/object_mapper/
[11] 
http://cassandra.apache.org/doc/latest/cql/types.html?highlight=user%20defined%20types#user-defined-types

-- 
Francesco Chicchiriccò

Tirasa - Open Source Excellence
http://www.tirasa.net/

Member at The Apache Software Foundation
Syncope, Cocoon, Olingo, CXF, OpenJPA, PonyMail
http://home.apache.org/~ilgrosso/

Re: [DISCUSS] Manage millions of identities

Posted by Francesco Chicchiriccò <il...@apache.org>.

On 26/11/18 08:29, Francesco Chicchiriccò wrote:
> On 07/11/18 13:48, Francesco Chicchiriccò wrote:
>> On 29/10/18 11:27, Francesco Chicchiriccò wrote:
>>> [...]
>>>
>>> I am currently in the middle of a spike which leverages PostgreSQL's 
>>> JSONB data type to replace *PlainAttr / * PlainAttrValue, and I am 
>>> around 90% feature-wise.
>>
>> https://issues.apache.org/jira/browse/SYNCOPE-1395
>>
>>> After that, I would also like to add a new module to the sources, 
>>> with purpose of running performance tests with JMeter support: in 
>>> this way we will be able to effectively check the numbers of the 
>>> available implementations. 
>>
>> That's next step.
>
> Hi all,
> here's the status update:
>
> * PostgreSQL JSONB support was successfully built - see SYNCOPE-1395
> * MySQL JSON support (yet to come) likely requires MySQL 8 - see 
> SYNCOPE-1401
> * I have developed a performance test suite [1] (which can be run 
> either against "standard" and "JSON" flavors) and reported the results 
> obtained with PostgreSQL [2], up to 1 million users
>
> FYI I am going to blog about the results obtained so far, which are 
> definitely good to me, and show that Syncope is now equipped with an 
> engine capable of gracefully handling (at least) 1 million users.

Here we go: 
http://blog.tirasa.net/benchmarking-apache-syncope-on-postgresql.html

Regards.

> [1] https://github.com/Tirasa/syncoperf
> [2] https://tirasa.github.io/syncoperf/


-- 
Francesco Chicchiriccò

Tirasa - Open Source Excellence
http://www.tirasa.net/

Member at The Apache Software Foundation
Syncope, Cocoon, Olingo, CXF, OpenJPA, PonyMail
http://home.apache.org/~ilgrosso/

Re: [DISCUSS] Manage millions of identities

Posted by Francesco Chicchiriccò <il...@apache.org>.

On 07/11/18 13:48, Francesco Chicchiriccò wrote:
> On 29/10/18 11:27, Francesco Chicchiriccò wrote:
>> [...]
>>
>> I am currently in the middle of a spike which leverages PostgreSQL's 
>> JSONB data type to replace *PlainAttr / * PlainAttrValue, and I am 
>> around 90% feature-wise.
>
> https://issues.apache.org/jira/browse/SYNCOPE-1395
>
>> After that, I would also like to add a new module to the sources, 
>> with purpose of running performance tests with JMeter support: in 
>> this way we will be able to effectively check the numbers of the 
>> available implementations. 
>
> That's next step.

Hi all,
here's the status update:

* PostgreSQL JSONB support was successfully built - see SYNCOPE-1395
* MySQL JSON support (yet to come) likely requires MySQL 8 - see 
SYNCOPE-1401
* I have developed a performance test suite [1] (which can be run either 
against "standard" and "JSON" flavors) and reported the results obtained 
with PostgreSQL [2], up to 1 million users

FYI I am going to blog about the results obtained so far, which are 
definitely good to me, and show that Syncope is now equipped with an 
engine capable of gracefully handling (at least) 1 million users.

Regards.

[1] https://github.com/Tirasa/syncoperf
[2] https://tirasa.github.io/syncoperf/

-- 
Francesco Chicchiriccò

Tirasa - Open Source Excellence
http://www.tirasa.net/

Member at The Apache Software Foundation
Syncope, Cocoon, Olingo, CXF, OpenJPA, PonyMail
http://home.apache.org/~ilgrosso/

Re: [DISCUSS] Manage millions of identities

Posted by Francesco Chicchiriccò <il...@apache.org>.

On 29/10/18 11:27, Francesco Chicchiriccò wrote:
> [...]
>
> I am currently in the middle of a spike which leverages PostgreSQL's 
> JSONB data type to replace *PlainAttr / * PlainAttrValue, and I am 
> around 90% feature-wise.

https://issues.apache.org/jira/browse/SYNCOPE-1395

> After that, I would also like to add a new module to the sources, with 
> purpose of running performance tests with JMeter support: in this way 
> we will be able to effectively check the numbers of the available 
> implementations. 

That's next step.

Regards.

-- 
Francesco Chicchiriccò

Tirasa - Open Source Excellence
http://www.tirasa.net/

Member at The Apache Software Foundation
Syncope, Cocoon, Olingo, CXF, OpenJPA, PonyMail
http://home.apache.org/~ilgrosso/

Re: [DISCUSS] Manage millions of identities

Posted by Francesco Chicchiriccò <il...@apache.org>.

Hi Guido,

On 24/10/18 20:51, Guido Wimmel wrote:
> Hi,
>
> Am 16.10.18 um 10:54 schrieb Francesco Chicchiriccò:
>> Hi all,
>> I think it's time to discuss about how we want to get prepared for 
>> scenarios where the number of identities (mostly users, for the vast 
>> majority) to manage is considerably high - from 1 million to above; 
>> the typical case being CIAM (Customer IAM).
>>
>> In the IdM deliveries I've been involved so far, scaling Apache 
>> Syncope up to hundreds of thousands of identities is not trivial, but 
>> doable: naturally, most of optimization work shall be done at DBMS 
>> level, as that is obviously the component which is stressed more.
>>
>> I think we can agree about the fact that, in such scenarios, the most 
>> critical data are the ones bound to the actual identities (hence no 
>> connectors, resources, tasks, reports or any other configuration): 
>> consider that with 1 million users and 10 attributes for each user, 
>> we have the following table sizing to deal with:
>>
>> * SyncopeUser: 1M rows
>> * UPlainAttr: 10M rows
>> * UPlainAttrValue: 10M rows
>>
>> Moreover, the search views [1] are all on the same size order 
>> (although one can enable the Elasticsearch extension in such cases, 
>> to improve performances).
>>
>> I think this is what we need to change in order to get better results.
>> So far, I have been able to think of a couple of possibilities:
>>
>> 1. Leverage the JSON column support provided by PostgreSQL [2], MySQL 
>> [3], SQL Server [4] and Oracle DB [5] to extend the current 
>> OpenJPA-based persistence layer
>>
>> Pros:
>>  * reduce the sizing problems by removing the need of UPlainAttr and 
>> UPlainAttrValue tables, search views and joins
>>  * limited implementation effort, as  most of the current JPA layer 
>> can be retained
>>  * keep enjoying the benefits of referential integrity and other 
>> constraints enforced by DBMS (including UNIQUE)
>>
>> Cons:
>>  * each DBMS provides JSON support in its own fashion: implementation 
>> wouldn't be trivial (while we can make it incremental, and add 
>> support for one DBMS at a time)
>>  * scaling capabilities and performance might be overrated - even 
>> though there seems to be very nice references, at least for 
>> PostgreSQL [6][7]
>>
>> 2. Implement a new persistence layer based on a different technology 
>> - I have done some experiments with Apache Cassandra [8] and the 
>> Datastax Java Driver [9]
>>
>> Pros:
>>  * built native for scalability and high availability
>>  * proven and widespread adoption
>>  * Object Mapper [10] allows to semi-transparently convert between 
>> query results and domain, somehow similar as JPA's EntityManager
>>
>> Cons:
>>  * relations are obviously not available, only custom types [11]: the 
>> persistence model shall be redesigned to cope with such situation
>>  * constraints are not available - more specifically UNIQUE, which 
>> will require additional code handling
>>  * implementation effort: all the persistence layer shall be redone, 
>> not only identity-related entities as User, UPlainAttr, 
>> UPlainAttrValue...
>>
>> Besides the two above, there are of course other options in the NoSQL 
>> world (Neo4j, MongoDB, ...), but I am afraid they all present similar 
>> challenges as Cassandra.
>>
>> WDYT?
>> Regards.
>
> I'd expect it should be possible to make the current relational model 
> work for hundreds of thousands - millions of identities. This should 
> not be too much data for enterprise-grade databases like Oracle or 
> PostgreSQL.
> We have a deployment with approx. a million identities (however, we 
> mostly use basic features of Syncope, and had to do some tweaking on 
> the search queries).
> Maybe one could document the required optimizations / partially 
> integrate them into Syncope? (possibly additional indexes / optimized 
> queries / ...)

First of all, this is an interesting confirmation: (1) the current model 
can handle (in your experience) "hundreds of thousands - millions of 
identities" and (2) you have a deployment with approx. a million identities.

Documenting the required optimizations, or integrating something into 
Syncope is definitely worthwhile: could you share these somehow, even as 
descriptions into an improvement on JIRA?

> For even larger numbers, I'd find both suggestions interesting. I 
> think one would have to do spikes in order to evaluate the performance 
> gain for large numbers of identities for different functionalities. 
> Maybe one could even support both, so that users could choose 
> according to their requirements / risk tolerance.

I am currently in the middle of a spike which leverages PostgreSQL's 
JSONB data type to replace *PlainAttr / * PlainAttrValue, and I am 
around 90% feature-wise.
After that, I would also like to add a new module to the sources, with 
purpose of running performance tests with JMeter support: in this way we 
will be able to effectively check the numbers of the available 
implementations.

Anyway, given the way how the code is structured from Syncope 2.0 
onwards, we are simply providing different implementations of the 
interfaces in syncope-core-persistence-api:

* syncope-core-persistence-jpa is the current implementation
* syncope-core-persistence-jpa-pgjsonb could be the name of the one I am 
working on (which is actually an extension of the former)
* syncope-core-persistence-jpa-mysqljson could be the same approach as 
above, but for MySQL's JSON data type
* syncope-core-persistence-cassandra, syncope-core-persistence-mongodb, 
syncope-core-persistence-whatever could be provided at any point in time

This to confirm that, in response to your suggestion and to concerns 
raised in other e-mails of this thread, there is always room to provide 
new implementations to support virtually any persistence technology; and 
that users will be free to choose one based on their needs by simply 
selecting the correct Maven dependency to include in their own projects.

Regards.

>> [1] 
>> https://github.com/apache/syncope/blob/master/core/persistence-jpa/src/main/resources/views.xml#L50-L94
>> [2] https://www.postgresql.org/docs/10/static/functions-json.html
>> [3] https://dev.mysql.com/doc/refman/8.0/en/json.html
>> [4] 
>> https://docs.microsoft.com/en-us/sql/relational-databases/json/json-data-sql-server?view=sql-server-2017
>> [5] https://docs.oracle.com/database/121/ADXDB/json.htm#ADXDB6246
>> [6] 
>> https://www.postgresql.eu/events/fosdem2018/sessions/session/1691/slides/63/High-Performance%20JSON_%20PostgreSQL%20Vs.%20MongoDB.pdf
>> [7] 
>> http://coussej.github.io/2016/01/14/Replacing-EAV-with-JSONB-in-PostgreSQL/
>> [8] http://cassandra.apache.org/
>> [9] https://github.com/datastax/java-driver
>> [10] 
>> https://docs.datastax.com/en/developer/java-driver/3.5/manual/object_mapper/
>> [11] 
>> http://cassandra.apache.org/doc/latest/cql/types.html?highlight=user%20defined%20types#user-defined-types

-- 
Francesco Chicchiriccò

Tirasa - Open Source Excellence
http://www.tirasa.net/

Member at The Apache Software Foundation
Syncope, Cocoon, Olingo, CXF, OpenJPA, PonyMail
http://home.apache.org/~ilgrosso/

Re: [DISCUSS] Manage millions of identities

Posted by Guido Wimmel <gu...@gmx.net>.

Hi,

Am 16.10.18 um 10:54 schrieb Francesco Chicchiriccò:
> Hi all,
> I think it's time to discuss about how we want to get prepared for 
> scenarios where the number of identities (mostly users, for the vast 
> majority) to manage is considerably high - from 1 million to above; 
> the typical case being CIAM (Customer IAM).
>
> In the IdM deliveries I've been involved so far, scaling Apache 
> Syncope up to hundreds of thousands of identities is not trivial, but 
> doable: naturally, most of optimization work shall be done at DBMS 
> level, as that is obviously the component which is stressed more.
>
> I think we can agree about the fact that, in such scenarios, the most 
> critical data are the ones bound to the actual identities (hence no 
> connectors, resources, tasks, reports or any other configuration): 
> consider that with 1 million users and 10 attributes for each user, we 
> have the following table sizing to deal with:
>
> * SyncopeUser: 1M rows
> * UPlainAttr: 10M rows
> * UPlainAttrValue: 10M rows
>
> Moreover, the search views [1] are all on the same size order 
> (although one can enable the Elasticsearch extension in such cases, to 
> improve performances).
>
> I think this is what we need to change in order to get better results.
> So far, I have been able to think of a couple of possibilities:
>
> 1. Leverage the JSON column support provided by PostgreSQL [2], MySQL 
> [3], SQL Server [4] and Oracle DB [5] to extend the current 
> OpenJPA-based persistence layer
>
> Pros:
>  * reduce the sizing problems by removing the need of UPlainAttr and 
> UPlainAttrValue tables, search views and joins
>  * limited implementation effort, as  most of the current JPA layer 
> can be retained
>  * keep enjoying the benefits of referential integrity and other 
> constraints enforced by DBMS (including UNIQUE)
>
> Cons:
>  * each DBMS provides JSON support in its own fashion: implementation 
> wouldn't be trivial (while we can make it incremental, and add support 
> for one DBMS at a time)
>  * scaling capabilities and performance might be overrated - even 
> though there seems to be very nice references, at least for PostgreSQL 
> [6][7]
>
> 2. Implement a new persistence layer based on a different technology - 
> I have done some experiments with Apache Cassandra [8] and the 
> Datastax Java Driver [9]
>
> Pros:
>  * built native for scalability and high availability
>  * proven and widespread adoption
>  * Object Mapper [10] allows to semi-transparently convert between 
> query results and domain, somehow similar as JPA's EntityManager
>
> Cons:
>  * relations are obviously not available, only custom types [11]: the 
> persistence model shall be redesigned to cope with such situation
>  * constraints are not available - more specifically UNIQUE, which 
> will require additional code handling
>  * implementation effort: all the persistence layer shall be redone, 
> not only identity-related entities as User, UPlainAttr, 
> UPlainAttrValue...
>
> Besides the two above, there are of course other options in the NoSQL 
> world (Neo4j, MongoDB, ...), but I am afraid they all present similar 
> challenges as Cassandra.
>
> WDYT?
> Regards.

I'd expect it should be possible to make the current relational model 
work for hundreds of thousands - millions of identities. This should not 
be too much data for enterprise-grade databases like Oracle or PostgreSQL.
We have a deployment with approx. a million identities (however, we 
mostly use basic features of Syncope, and had to do some tweaking on the 
search queries).
Maybe one could document the required optimizations / partially 
integrate them into Syncope? (possibly additional indexes / optimized 
queries / ...)

For even larger numbers, I'd find both suggestions interesting. I think 
one would have to do spikes in order to evaluate the performance gain 
for large numbers of identities for different functionalities. Maybe one 
could even support both, so that users could choose according to their 
requirements / risk tolerance.

Cheers,
   Guido






>
> [1] 
> https://github.com/apache/syncope/blob/master/core/persistence-jpa/src/main/resources/views.xml#L50-L94
> [2] https://www.postgresql.org/docs/10/static/functions-json.html
> [3] https://dev.mysql.com/doc/refman/8.0/en/json.html
> [4] 
> https://docs.microsoft.com/en-us/sql/relational-databases/json/json-data-sql-server?view=sql-server-2017
> [5] https://docs.oracle.com/database/121/ADXDB/json.htm#ADXDB6246
> [6] 
> https://www.postgresql.eu/events/fosdem2018/sessions/session/1691/slides/63/High-Performance%20JSON_%20PostgreSQL%20Vs.%20MongoDB.pdf
> [7] 
> http://coussej.github.io/2016/01/14/Replacing-EAV-with-JSONB-in-PostgreSQL/
> [8] http://cassandra.apache.org/
> [9] https://github.com/datastax/java-driver
> [10] 
> https://docs.datastax.com/en/developer/java-driver/3.5/manual/object_mapper/
> [11] 
> http://cassandra.apache.org/doc/latest/cql/types.html?highlight=user%20defined%20types#user-defined-types
>

Re: [DISCUSS] Manage millions of identities

Posted by Massimiliano Perrone <ma...@tirasa.net>.


Il 16/10/2018 10:54, Francesco Chicchiriccò ha scritto:
> Hi all,
> I think it's time to discuss about how we want to get prepared for 
> scenarios where the number of identities (mostly users, for the vast 
> majority) to manage is considerably high - from 1 million to above; 
> the typical case being CIAM (Customer IAM).
>
> In the IdM deliveries I've been involved so far, scaling Apache 
> Syncope up to hundreds of thousands of identities is not trivial, but 
> doable: naturally, most of optimization work shall be done at DBMS 
> level, as that is obviously the component which is stressed more.
>
> I think we can agree about the fact that, in such scenarios, the most 
> critical data are the ones bound to the actual identities (hence no 
> connectors, resources, tasks, reports or any other configuration): 
> consider that with 1 million users and 10 attributes for each user, we 
> have the following table sizing to deal with:
>
> * SyncopeUser: 1M rows
> * UPlainAttr: 10M rows
> * UPlainAttrValue: 10M rows
>
> Moreover, the search views [1] are all on the same size order 
> (although one can enable the Elasticsearch extension in such cases, to 
> improve performances).
>
> I think this is what we need to change in order to get better results.
> So far, I have been able to think of a couple of possibilities:
>
> 1. Leverage the JSON column support provided by PostgreSQL [2], MySQL 
> [3], SQL Server [4] and Oracle DB [5] to extend the current 
> OpenJPA-based persistence layer
>
> Pros:
>  * reduce the sizing problems by removing the need of UPlainAttr and 
> UPlainAttrValue tables, search views and joins
>  * limited implementation effort, as  most of the current JPA layer 
> can be retained
>  * keep enjoying the benefits of referential integrity and other 
> constraints enforced by DBMS (including UNIQUE)
>
> Cons:
>  * each DBMS provides JSON support in its own fashion: implementation 
> wouldn't be trivial (while we can make it incremental, and add support 
> for one DBMS at a time)
>  * scaling capabilities and performance might be overrated - even 
> though there seems to be very nice references, at least for PostgreSQL 
> [6][7]
>
> 2. Implement a new persistence layer based on a different technology - 
> I have done some experiments with Apache Cassandra [8] and the 
> Datastax Java Driver [9]
>
> Pros:
>  * built native for scalability and high availability
>  * proven and widespread adoption
>  * Object Mapper [10] allows to semi-transparently convert between 
> query results and domain, somehow similar as JPA's EntityManager
>
> Cons:
>  * relations are obviously not available, only custom types [11]: the 
> persistence model shall be redesigned to cope with such situation
>  * constraints are not available - more specifically UNIQUE, which 
> will require additional code handling
>  * implementation effort: all the persistence layer shall be redone, 
> not only identity-related entities as User, UPlainAttr, 
> UPlainAttrValue...
>
> Besides the two above, there are of course other options in the NoSQL 
> world (Neo4j, MongoDB, ...), but I am afraid they all present similar 
> challenges as Cassandra.
>
> WDYT?

Hi,
IMHO I think the better choice is the last one.

More difficult but definitive.

BR
Massi

> Regards.
>
> [1] 
> https://github.com/apache/syncope/blob/master/core/persistence-jpa/src/main/resources/views.xml#L50-L94
> [2] https://www.postgresql.org/docs/10/static/functions-json.html
> [3] https://dev.mysql.com/doc/refman/8.0/en/json.html
> [4] 
> https://docs.microsoft.com/en-us/sql/relational-databases/json/json-data-sql-server?view=sql-server-2017
> [5] https://docs.oracle.com/database/121/ADXDB/json.htm#ADXDB6246
> [6] 
> https://www.postgresql.eu/events/fosdem2018/sessions/session/1691/slides/63/High-Performance%20JSON_%20PostgreSQL%20Vs.%20MongoDB.pdf
> [7] 
> http://coussej.github.io/2016/01/14/Replacing-EAV-with-JSONB-in-PostgreSQL/
> [8] http://cassandra.apache.org/
> [9] https://github.com/datastax/java-driver
> [10] 
> https://docs.datastax.com/en/developer/java-driver/3.5/manual/object_mapper/
> [11] 
> http://cassandra.apache.org/doc/latest/cql/types.html?highlight=user%20defined%20types#user-defined-types
>

-- 
Massimiliano Perrone
Tel +39 393 9121310
Tirasa S.r.l.
http://www.tirasa.net
"L'apprendere molte cose non insegna l'intelligenza"
(Eraclito)

Re: [DISCUSS] Manage millions of identities

Posted by Andrea Patricelli <an...@apache.org>.

Hi,

Il 16/10/2018 10:54, Francesco Chicchiriccò ha scritto:
> Hi all,
> I think it's time to discuss about how we want to get prepared for 
> scenarios where the number of identities (mostly users, for the vast 
> majority) to manage is considerably high - from 1 million to above; 
> the typical case being CIAM (Customer IAM).
>
> In the IdM deliveries I've been involved so far, scaling Apache 
> Syncope up to hundreds of thousands of identities is not trivial, but 
> doable: naturally, most of optimization work shall be done at DBMS 
> level, as that is obviously the component which is stressed more.
>
> I think we can agree about the fact that, in such scenarios, the most 
> critical data are the ones bound to the actual identities (hence no 
> connectors, resources, tasks, reports or any other configuration): 
> consider that with 1 million users and 10 attributes for each user, we 
> have the following table sizing to deal with:
>
> * SyncopeUser: 1M rows
> * UPlainAttr: 10M rows
> * UPlainAttrValue: 10M rows
>
> Moreover, the search views [1] are all on the same size order 
> (although one can enable the Elasticsearch extension in such cases, to 
> improve performances).
>
> I think this is what we need to change in order to get better results.
> So far, I have been able to think of a couple of possibilities:
>
> 1. Leverage the JSON column support provided by PostgreSQL [2], MySQL 
> [3], SQL Server [4] and Oracle DB [5] to extend the current 
> OpenJPA-based persistence layer
>
> Pros:
>  * reduce the sizing problems by removing the need of UPlainAttr and 
> UPlainAttrValue tables, search views and joins
>  * limited implementation effort, as  most of the current JPA layer 
> can be retained
>  * keep enjoying the benefits of referential integrity and other 
> constraints enforced by DBMS (including UNIQUE)
>
> Cons:
>  * each DBMS provides JSON support in its own fashion: implementation 
> wouldn't be trivial (while we can make it incremental, and add support 
> for one DBMS at a time)
>  * scaling capabilities and performance might be overrated - even 
> though there seems to be very nice references, at least for PostgreSQL 
> [6][7]
>
> 2. Implement a new persistence layer based on a different technology - 
> I have done some experiments with Apache Cassandra [8] and the 
> Datastax Java Driver [9]
>
> Pros:
>  * built native for scalability and high availability
>  * proven and widespread adoption
>  * Object Mapper [10] allows to semi-transparently convert between 
> query results and domain, somehow similar as JPA's EntityManager
>
> Cons:
>  * relations are obviously not available, only custom types [11]: the 
> persistence model shall be redesigned to cope with such situation
>  * constraints are not available - more specifically UNIQUE, which 
> will require additional code handling
>  * implementation effort: all the persistence layer shall be redone, 
> not only identity-related entities as User, UPlainAttr, 
> UPlainAttrValue...
>
> Besides the two above, there are of course other options in the NoSQL 
> world (Neo4j, MongoDB, ...), but I am afraid they all present similar 
> challenges as Cassandra.
>
> WDYT?

I would tend for *solution 1*, since relational and SQL paradigm is 
still wide spread I think that it is necessary support millions of 
entities also on a relational database.

Nevertheless I would also put some effort doing (at least) some advanced 
spike with the most widely used no-sql technologies like Apache 
Cassandra, MongoDB, Apache CouncDB(?), Neo4j.
They could be, maybe, the best solution for larger environments. About 
relations and constraints: in my (very little) experience with noSQL 
technologies (mainly elasticsearch) I found that these very nice 
features of relational paradigm are often enhanced/supported because 
highly requested by users. I'm referring to [12] [13] [14] [15]. Though 
I share with you the same doubts on moving to a new noSQL (and less 
tested) persistence layer.

[12] https://docs.mongodb.com/manual/applications/data-models-relationships/
[13] 
https://www.elastic.co/guide/en/elasticsearch/guide/current/relations.html
[14] https://docs.mongodb.com/manual/core/index-unique/
[15] 
https://medium.com/@mustwin/cassandra-from-a-relational-world-7bbdb0a9f1d

> Regards.
>
> [1] 
> https://github.com/apache/syncope/blob/master/core/persistence-jpa/src/main/resources/views.xml#L50-L94
> [2] https://www.postgresql.org/docs/10/static/functions-json.html
> [3] https://dev.mysql.com/doc/refman/8.0/en/json.html
> [4] 
> https://docs.microsoft.com/en-us/sql/relational-databases/json/json-data-sql-server?view=sql-server-2017
> [5] https://docs.oracle.com/database/121/ADXDB/json.htm#ADXDB6246
> [6] 
> https://www.postgresql.eu/events/fosdem2018/sessions/session/1691/slides/63/High-Performance%20JSON_%20PostgreSQL%20Vs.%20MongoDB.pdf
> [7] 
> http://coussej.github.io/2016/01/14/Replacing-EAV-with-JSONB-in-PostgreSQL/
> [8] http://cassandra.apache.org/
> [9] https://github.com/datastax/java-driver
> [10] 
> https://docs.datastax.com/en/developer/java-driver/3.5/manual/object_mapper/
> [11] 
> http://cassandra.apache.org/doc/latest/cql/types.html?highlight=user%20defined%20types#user-defined-types
>
-- 
Dott. Andrea Patricelli
Tel. +39 3204524292

Engineer @ Tirasa S.r.l.
Viale Vittoria Colonna 97 - 65127 Pescara
Tel +39 0859116307 / FAX +39 0859111173
http://www.tirasa.net

Apache Syncope PMC Member