You are viewing a plain text version of this content. The canonical link for it is here.

Posted to oak-dev@jackrabbit.apache.org by Tomek Rekawek <re...@adobe.com> on 2016/08/16 10:25:24 UTC

normalising the rdb database schema

Hello,

I was wondering whether it’d make sense to normalise the RDB Document Store schema - get rid of the JSON/JSOP concatenated strings and store each key/value in a separate database row. Something like this:

id          STRING
key         STRING
revision    STRING (nullable)
value       (LONG) STRING
modcount    INTEGER

The id+key+revision would make an unique primary key.

So, an example node from the DocumentMK documentation [1]:

{
    "_id" : "1:/node",
    "_deleted" : {
        "r13f3875b5d1-0-1" : "false"
    },
    "_lastRev" : {
        "r0-0-1" : "r13f3875b5d1-0-1"
    },
    "_modified" : NumberLong(274208361),
    "_modCount" : NumberLong(1),
    "_children" : Boolean(true),
    "_revisions" : {
        "r13f3875b5d1-0-1" : "c"
    }
}

Would transform to following database rows:

(id, key, revision, value, modcount)
(“1:/node”, “_deleted”, "r13f3875b5d1-0-1”, “false”, 1)
(“1:/node”, “_lastRev”, "r0-0-1”, “r13f3875b5d1-0-1”, 1)
(“1:/node”, “_modified”, null, “274208361”, 1)
(“1:/node”, “_children”, null, “true”, 1)
(“1:/node”, “_revisions”, "r13f3875b5d1-0-1", “c”, 1)

Creating a new document would require batching a few INSERTs. Updating a document will combine INSERTs (for the new properties) and UPDATEs (for the modified ones). Each update would end with a modcount increment for all rows related to the given document. Fetching a document will require reading all rows for given id. I think all of these reads and writes can be done in batches, so we’ll end up with a single database call anyway.

Advantages I can see here are:

* no need to parse/serialize JSONs and JSONPs (less load on the Oak instance),
* no need to periodically compact the JSONPs,
* more granular updates are possible, we can properly implement all the UpdateOp cases,
* we can better use the database features, as now the DBE is aware about the document internal structure (it’s not a blob anymore). Eg. we can fetch only a few properties.

For me such design looks more natural and RDB-native. The schema is just a draft and probably I’m missing something, but I wanted to ask about a general feedback on this approach. WDYT?

Regards,
Tomek

-- 
Tomek Rękawek | Adobe Research | www.adobe.com
rekawek@adobe.com

Re: normalising the rdb database schema

Posted by Julian Reschke <ju...@gmx.de>.

On 2016-08-16 12:25, Tomek Rekawek wrote:
> Hello,
>
> I was wondering whether it\u2019d make sense to normalise the RDB Document Store schema - get rid of the JSON/JSOP concatenated strings and store each key/value in a separate database row. Something like this:
>
> id          STRING
> key         STRING
> revision    STRING (nullable)
> value       (LONG) STRING
> modcount    INTEGER
> ...

We need to be careful with the value field. Right now, the size of a 
property value is de facto unconstrained, and it would be risky to 
silently change that... (the same is true to some degree for the 
property names)

Best regards, Julian

Re: normalising the rdb database schema

Posted by Julian Reschke <ju...@gmx.de>.

On 2016-08-17 06:09, Chetan Mehrotra wrote:
> Hi Tomek,
>
> I like the idea of revisiting our current schema based on usage so
> far. However couple of points around potential issue with such a
> normalized approach
>
> - This approach would lead to a thin and loooong table. As noted in
> [1] in a small repo ~14 M nodes we have ~26 M properties. With
> multiple revisions (GC takes some time) this can go higher. This would
> then increase the memory requirement for id index. Memory consumption
> increases further with id+key+revision index. For any db to perform
> optimally the index should fit in ram. So such such a design would
> possibly reduce the max size of repository which can be supported
> (compared to older one) for given memory
>
> - The read for specific id can be done in 1 remote call. But that
> would involve select across multiple rows which might increase the
> time taken as it would involve 'm' index lookup and then 'm' reads of
> row data for any node having 'n' properties (m > n assuming multiple
> revision for property present)

Yes, that scared me as well when I thought about the schema a long time 
ago. (Maybe it would make sense to consider something less drastic, 
keeping the JSON, but having one row per revision?)

Another thing to think about is that with the current scheme, updating a 
document maps to a row update. With the proposed scheme, updating a 
document always requires inserting new rows. That sounds like something 
that could be very costly.

> May be we should explore the json support being introduced in multiple
> dbs.  DB2 [2], SQL Server [3], Oracle [4], Postgres [5], MySql [6].
> Problem here is that we would need DB specific implementation and also
> increases the testing effort!
> ...

That, and that the JSON support really doesn't help that much, as it 
doesn't address document updates.

Best regards, Julian

Re: normalising the rdb database schema

Posted by Julian Reschke <ju...@gmx.de>.

On 2016-08-17 18:06, Mike Nimer wrote:
> This reminds me of a system I worked on years ago that had a similar
> problem and solved it with a side table.  It would store the object in the
> DB as XML (I told you it was long time ago) and then we had a table on the
> side that stored each property in a row, if it's was marked as indexed.
> The full XML object gave us quick read access. The side table allowed us to
> write complex sql to find objects. Although I will say the sql could get
> nasty with this table structure, lot's of self-referencing joins.
>
> This might be crazy, but what if oak did something similar? The current
> data could still be stored as a JSON object, like it is now, as the source
> of truth.  And then an Async task could be used update the side table with
> all properties or only indexed properties of each node.  This could be
> disabled for very large implementations or you could flip a flag to store
> all revisions or only the latest. This table could even be considered
> temporary and rebuilt as needed.
>
> With that said, the new JSON support being added to the different databases
> looks very promising and might solve the same problem.
> --mike

Yes, I once worked in a project with a similar design.

What you might miss is that we actually (currently) do not need to index 
anything additionally, as JCR/OAK indexes are handled by a layer above 
the DocumentStore - whether *that* is something that should reviewed 
would be an interesting discussion as well.

Best regards, Julian

Re: normalising the rdb database schema

Posted by Mike Nimer <mn...@gmail.com>.

This reminds me of a system I worked on years ago that had a similar
problem and solved it with a side table.  It would store the object in the
DB as XML (I told you it was long time ago) and then we had a table on the
side that stored each property in a row, if it's was marked as indexed.
The full XML object gave us quick read access. The side table allowed us to
write complex sql to find objects. Although I will say the sql could get
nasty with this table structure, lot's of self-referencing joins.

This might be crazy, but what if oak did something similar? The current
data could still be stored as a JSON object, like it is now, as the source
of truth.  And then an Async task could be used update the side table with
all properties or only indexed properties of each node.  This could be
disabled for very large implementations or you could flip a flag to store
all revisions or only the latest. This table could even be considered
temporary and rebuilt as needed.

With that said, the new JSON support being added to the different databases
looks very promising and might solve the same problem.
--mike




On Tue, Aug 16, 2016 at 11:09 PM, Chetan Mehrotra <chetan.mehrotra@gmail.com
> wrote:

> Hi Tomek,
>
> I like the idea of revisiting our current schema based on usage so
> far. However couple of points around potential issue with such a
> normalized approach
>
> - This approach would lead to a thin and loooong table. As noted in
> [1] in a small repo ~14 M nodes we have ~26 M properties. With
> multiple revisions (GC takes some time) this can go higher. This would
> then increase the memory requirement for id index. Memory consumption
> increases further with id+key+revision index. For any db to perform
> optimally the index should fit in ram. So such such a design would
> possibly reduce the max size of repository which can be supported
> (compared to older one) for given memory
>
> - The read for specific id can be done in 1 remote call. But that
> would involve select across multiple rows which might increase the
> time taken as it would involve 'm' index lookup and then 'm' reads of
> row data for any node having 'n' properties (m > n assuming multiple
> revision for property present)
>
> May be we should explore the json support being introduced in multiple
> dbs.  DB2 [2], SQL Server [3], Oracle [4], Postgres [5], MySql [6].
> Problem here is that we would need DB specific implementation and also
> increases the testing effort!
>
> > we can better use the database features, as now the DBE is aware about
> the document internal structure (it’s not a blob anymore). Eg. we can fetch
> only a few properties.
>
> In most cases the kind of properties stored in blob part of db row are
> always read as a whole.
>
> Chetan Mehrotra
> [1] https://issues.apache.org/jira/browse/OAK-4471
> [2] http://www.ibm.com/developerworks/data/library/techarticle/dm-
> 1306nosqlforjson1/
> [3] https://msdn.microsoft.com/en-in/library/dn921897.aspx
> [4] https://docs.oracle.com/database/121/ADXDB/json.htm
> [5] https://www.postgresql.org/docs/9.3/static/functions-json.html
> [6] https://dev.mysql.com/doc/refman/5.7/en/json.html
>
>
> On Wed, Aug 17, 2016 at 7:19 AM, Michael Marth <mm...@adobe.com> wrote:
> > Hi Tomek,
> >
> > I like the idea (agree with Vikas’ comments / cautions as well).
> >
> > You are hinting at expected performance differences (maybe faster or
> slower than the current approach). That would probably be worthwhile to
> investigate in order to assess your idea.
> >
> > One more (hypothetical at this point) advantage of your approach: we
> could utilise DB-native indexes as a replacement for property indexes.
> >
> > Cheers
> > Michael
> >
> >
> >
> > On 16/08/16 07:42, "Tomek Rekawek" <re...@adobe.com> wrote:
> >
> >>Hi Vikas,
> >>
> >>thanks for the reply.
> >>
> >>> On 16 Aug 2016, at 14:38, Vikas Saurabh <vi...@gmail.com>
> wrote:
> >>
> >>> * It'd incur a very heavy migration impact on upgrade or RDB setups -
> >>> that, most probably, would translate to us having to support both
> >>> schemas. I don't feel that it'd easy to flip the switch for existing
> >>> setups.
> >>
> >>That’s true. I think we should take a similar approach here as with the
> segment / segment-tar implementations (and we can use oak-upgrade to
> convert between them). At least for now.
> >>
> >>> * DocumentNodeStore implementation very freely touches prop:rev=value
> >>> for a given id… […] I think this would get
> >>> expensive for index (_id+propName+rev) maintenance.
> >>
> >>Indeed, probably we’ll have to analyse the indexing capabilities offered
> by different database engines more closely, choosing the one that offers
> good writing speed.
> >>
> >>Best regards,
> >>Tomek
> >>
> >>--
> >>Tomek Rękawek | Adobe Research | www.adobe.com
> >>rekawek@adobe.com
>

Re: normalising the rdb database schema

Posted by Tomek Rekawek <re...@adobe.com>.

Hi,

thanks for the feedback. With regards to the potential performance / indexing issues I agree that having more rows in the table will certainly affect the performance. On the other hand, I think that storing and querying a large number of rows is the exact thing that the database engines should be optimized for (or maybe it’s just a wishful thinking). Anyway, I guess it’s hard to tell anything until we have a PoC. Starting this thread I was wondering whether there’s some obvious reason that we should stick to the current schema and since all of the replies started with “I like the idea, but…” I think it’s worth to proceed and at least try.

I spend some time researching the JSON support in DBEs. I think that such support was primarily added to allow querying the JSON documents loaded to the database. Eg. the Oracle / MS SQL / DB2 doesn’t even allow to modify values stored in the JSON fields. On the other hand, what we need is a support for the complex updates, so we don’t have to load a full document in order to apply UpdateOps on it. In the current implementation it can be done by appending the serialised diffs. If we switch to using a single JSON document per node we won’t be able to update them in a single call. Instead we’ll get a possibility to query the documents, which we don’t really need. Because of that I’m a bit skeptical to using this feature.

Best regards,
Tomek

-- 
Tomek Rękawek | Adobe Research | www.adobe.com
rekawek@adobe.com

> On 17 Aug 2016, at 06:09, Chetan Mehrotra <ch...@gmail.com> wrote:
> 
> Hi Tomek,
> 
> I like the idea of revisiting our current schema based on usage so
> far. However couple of points around potential issue with such a
> normalized approach
> 
> - This approach would lead to a thin and loooong table. As noted in
> [1] in a small repo ~14 M nodes we have ~26 M properties. With
> multiple revisions (GC takes some time) this can go higher. This would
> then increase the memory requirement for id index. Memory consumption
> increases further with id+key+revision index. For any db to perform
> optimally the index should fit in ram. So such such a design would
> possibly reduce the max size of repository which can be supported
> (compared to older one) for given memory
> 
> - The read for specific id can be done in 1 remote call. But that
> would involve select across multiple rows which might increase the
> time taken as it would involve 'm' index lookup and then 'm' reads of
> row data for any node having 'n' properties (m > n assuming multiple
> revision for property present)
> 
> May be we should explore the json support being introduced in multiple
> dbs.  DB2 [2], SQL Server [3], Oracle [4], Postgres [5], MySql [6].
> Problem here is that we would need DB specific implementation and also
> increases the testing effort!
> 
>> we can better use the database features, as now the DBE is aware about the document internal structure (it’s not a blob anymore). Eg. we can fetch only a few properties.
> 
> In most cases the kind of properties stored in blob part of db row are
> always read as a whole.
> 
> Chetan Mehrotra
> [1] https://issues.apache.org/jira/browse/OAK-4471
> [2] http://www.ibm.com/developerworks/data/library/techarticle/dm-1306nosqlforjson1/
> [3] https://msdn.microsoft.com/en-in/library/dn921897.aspx
> [4] https://docs.oracle.com/database/121/ADXDB/json.htm
> [5] https://www.postgresql.org/docs/9.3/static/functions-json.html
> [6] https://dev.mysql.com/doc/refman/5.7/en/json.html
> 
> 
> On Wed, Aug 17, 2016 at 7:19 AM, Michael Marth <mm...@adobe.com> wrote:
>> Hi Tomek,
>> 
>> I like the idea (agree with Vikas’ comments / cautions as well).
>> 
>> You are hinting at expected performance differences (maybe faster or slower than the current approach). That would probably be worthwhile to investigate in order to assess your idea.
>> 
>> One more (hypothetical at this point) advantage of your approach: we could utilise DB-native indexes as a replacement for property indexes.
>> 
>> Cheers
>> Michael
>> 
>> 
>> 
>> On 16/08/16 07:42, "Tomek Rekawek" <re...@adobe.com> wrote:
>> 
>>> Hi Vikas,
>>> 
>>> thanks for the reply.
>>> 
>>>> On 16 Aug 2016, at 14:38, Vikas Saurabh <vi...@gmail.com> wrote:
>>> 
>>>> * It'd incur a very heavy migration impact on upgrade or RDB setups -
>>>> that, most probably, would translate to us having to support both
>>>> schemas. I don't feel that it'd easy to flip the switch for existing
>>>> setups.
>>> 
>>> That’s true. I think we should take a similar approach here as with the segment / segment-tar implementations (and we can use oak-upgrade to convert between them). At least for now.
>>> 
>>>> * DocumentNodeStore implementation very freely touches prop:rev=value
>>>> for a given id… […] I think this would get
>>>> expensive for index (_id+propName+rev) maintenance.
>>> 
>>> Indeed, probably we’ll have to analyse the indexing capabilities offered by different database engines more closely, choosing the one that offers good writing speed.
>>> 
>>> Best regards,
>>> Tomek
>>> 
>>> --
>>> Tomek Rękawek | Adobe Research | www.adobe.com
>>> rekawek@adobe.com

Re: normalising the rdb database schema

Posted by Chetan Mehrotra <ch...@gmail.com>.

Hi Tomek,

I like the idea of revisiting our current schema based on usage so
far. However couple of points around potential issue with such a
normalized approach

- This approach would lead to a thin and loooong table. As noted in
[1] in a small repo ~14 M nodes we have ~26 M properties. With
multiple revisions (GC takes some time) this can go higher. This would
then increase the memory requirement for id index. Memory consumption
increases further with id+key+revision index. For any db to perform
optimally the index should fit in ram. So such such a design would
possibly reduce the max size of repository which can be supported
(compared to older one) for given memory

- The read for specific id can be done in 1 remote call. But that
would involve select across multiple rows which might increase the
time taken as it would involve 'm' index lookup and then 'm' reads of
row data for any node having 'n' properties (m > n assuming multiple
revision for property present)

May be we should explore the json support being introduced in multiple
dbs.  DB2 [2], SQL Server [3], Oracle [4], Postgres [5], MySql [6].
Problem here is that we would need DB specific implementation and also
increases the testing effort!

> we can better use the database features, as now the DBE is aware about the document internal structure (it’s not a blob anymore). Eg. we can fetch only a few properties.

In most cases the kind of properties stored in blob part of db row are
always read as a whole.

Chetan Mehrotra
[1] https://issues.apache.org/jira/browse/OAK-4471
[2] http://www.ibm.com/developerworks/data/library/techarticle/dm-1306nosqlforjson1/
[3] https://msdn.microsoft.com/en-in/library/dn921897.aspx
[4] https://docs.oracle.com/database/121/ADXDB/json.htm
[5] https://www.postgresql.org/docs/9.3/static/functions-json.html
[6] https://dev.mysql.com/doc/refman/5.7/en/json.html


On Wed, Aug 17, 2016 at 7:19 AM, Michael Marth <mm...@adobe.com> wrote:
> Hi Tomek,
>
> I like the idea (agree with Vikas’ comments / cautions as well).
>
> You are hinting at expected performance differences (maybe faster or slower than the current approach). That would probably be worthwhile to investigate in order to assess your idea.
>
> One more (hypothetical at this point) advantage of your approach: we could utilise DB-native indexes as a replacement for property indexes.
>
> Cheers
> Michael
>
>
>
> On 16/08/16 07:42, "Tomek Rekawek" <re...@adobe.com> wrote:
>
>>Hi Vikas,
>>
>>thanks for the reply.
>>
>>> On 16 Aug 2016, at 14:38, Vikas Saurabh <vi...@gmail.com> wrote:
>>
>>> * It'd incur a very heavy migration impact on upgrade or RDB setups -
>>> that, most probably, would translate to us having to support both
>>> schemas. I don't feel that it'd easy to flip the switch for existing
>>> setups.
>>
>>That’s true. I think we should take a similar approach here as with the segment / segment-tar implementations (and we can use oak-upgrade to convert between them). At least for now.
>>
>>> * DocumentNodeStore implementation very freely touches prop:rev=value
>>> for a given id… […] I think this would get
>>> expensive for index (_id+propName+rev) maintenance.
>>
>>Indeed, probably we’ll have to analyse the indexing capabilities offered by different database engines more closely, choosing the one that offers good writing speed.
>>
>>Best regards,
>>Tomek
>>
>>--
>>Tomek Rękawek | Adobe Research | www.adobe.com
>>rekawek@adobe.com

Re: normalising the rdb database schema

Posted by Michael Marth <mm...@adobe.com>.

Hi Tomek,

I like the idea (agree with Vikas’ comments / cautions as well).

You are hinting at expected performance differences (maybe faster or slower than the current approach). That would probably be worthwhile to investigate in order to assess your idea.

One more (hypothetical at this point) advantage of your approach: we could utilise DB-native indexes as a replacement for property indexes.

Cheers
Michael

On 16/08/16 07:42, "Tomek Rekawek" <re...@adobe.com> wrote:

>Hi Vikas,
>
>thanks for the reply.
>
>> On 16 Aug 2016, at 14:38, Vikas Saurabh <vi...@gmail.com> wrote:
>
>> * It'd incur a very heavy migration impact on upgrade or RDB setups -
>> that, most probably, would translate to us having to support both
>> schemas. I don't feel that it'd easy to flip the switch for existing
>> setups.
>
>That’s true. I think we should take a similar approach here as with the segment / segment-tar implementations (and we can use oak-upgrade to convert between them). At least for now.
>
>> * DocumentNodeStore implementation very freely touches prop:rev=value
>> for a given id… […] I think this would get
>> expensive for index (_id+propName+rev) maintenance.
>
>Indeed, probably we’ll have to analyse the indexing capabilities offered by different database engines more closely, choosing the one that offers good writing speed.
>
>Best regards,
>Tomek
>
>-- 
>Tomek Rękawek | Adobe Research | www.adobe.com
>rekawek@adobe.com

Re: normalising the rdb database schema

Posted by Tomek Rekawek <re...@adobe.com>.

Hi Vikas,

thanks for the reply.

> On 16 Aug 2016, at 14:38, Vikas Saurabh <vi...@gmail.com> wrote:

> * It'd incur a very heavy migration impact on upgrade or RDB setups -
> that, most probably, would translate to us having to support both
> schemas. I don't feel that it'd easy to flip the switch for existing
> setups.

That’s true. I think we should take a similar approach here as with the segment / segment-tar implementations (and we can use oak-upgrade to convert between them). At least for now.

> * DocumentNodeStore implementation very freely touches prop:rev=value
> for a given id… […] I think this would get
> expensive for index (_id+propName+rev) maintenance.

Indeed, probably we’ll have to analyse the indexing capabilities offered by different database engines more closely, choosing the one that offers good writing speed.

Best regards,
Tomek

-- 
Tomek Rękawek | Adobe Research | www.adobe.com
rekawek@adobe.com

Re: normalising the rdb database schema

Posted by Vikas Saurabh <vi...@gmail.com>.

Hi Tomek,

While at first glance I like the idea of normalizing the schema, but
there are potential practical issues with the approach:
* It'd incur a very heavy migration impact on upgrade or RDB setups -
that, most probably, would translate to us having to support both
schemas. I don't feel that it'd easy to flip the switch for existing
setups.
* DocumentNodeStore implementation very freely touches prop:rev=value
for a given id... i.e it assumes there's no cost (at least minimal)
cost involved in persisting those (a commit would tentatively set the
value and flip it to null (which in new schema would be same as
deleting the row) if the commit fails). I think this would get
expensive for index (_id+propName+rev) maintenance - note, in current
scheme of things a document gets deleted only on revision gc ... so,
index cost is really very minimal (well, apart from _modified one)...

Btw, I like the basic idea (and the advantages that you mentioned)...
just that I think we probably need to be careful if we go ahead with
this.

Thanks,
Vikas

On Tue, Aug 16, 2016 at 12:25 PM, Tomek Rekawek <re...@adobe.com> wrote:
> Hello,
>
> I was wondering whether it’d make sense to normalise the RDB Document Store schema - get rid of the JSON/JSOP concatenated strings and store each key/value in a separate database row. Something like this:
>
> id          STRING
> key         STRING
> revision    STRING (nullable)
> value       (LONG) STRING
> modcount    INTEGER
>
> The id+key+revision would make an unique primary key.
>
> So, an example node from the DocumentMK documentation [1]:
>
> {
>     "_id" : "1:/node",
>     "_deleted" : {
>         "r13f3875b5d1-0-1" : "false"
>     },
>     "_lastRev" : {
>         "r0-0-1" : "r13f3875b5d1-0-1"
>     },
>     "_modified" : NumberLong(274208361),
>     "_modCount" : NumberLong(1),
>     "_children" : Boolean(true),
>     "_revisions" : {
>         "r13f3875b5d1-0-1" : "c"
>     }
> }
>
> Would transform to following database rows:
>
> (id, key, revision, value, modcount)
> (“1:/node”, “_deleted”, "r13f3875b5d1-0-1”, “false”, 1)
> (“1:/node”, “_lastRev”, "r0-0-1”, “r13f3875b5d1-0-1”, 1)
> (“1:/node”, “_modified”, null, “274208361”, 1)
> (“1:/node”, “_children”, null, “true”, 1)
> (“1:/node”, “_revisions”, "r13f3875b5d1-0-1", “c”, 1)
>
> Creating a new document would require batching a few INSERTs. Updating a document will combine INSERTs (for the new properties) and UPDATEs (for the modified ones). Each update would end with a modcount increment for all rows related to the given document. Fetching a document will require reading all rows for given id. I think all of these reads and writes can be done in batches, so we’ll end up with a single database call anyway.
>
> Advantages I can see here are:
>
> * no need to parse/serialize JSONs and JSONPs (less load on the Oak instance),
> * no need to periodically compact the JSONPs,
> * more granular updates are possible, we can properly implement all the UpdateOp cases,
> * we can better use the database features, as now the DBE is aware about the document internal structure (it’s not a blob anymore). Eg. we can fetch only a few properties.
>
> For me such design looks more natural and RDB-native. The schema is just a draft and probably I’m missing something, but I wanted to ask about a general feedback on this approach. WDYT?
>
> Regards,
> Tomek
>
> --
> Tomek Rękawek | Adobe Research | www.adobe.com
> rekawek@adobe.com
>