You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@directmemory.apache.org by Christoph Engelbert <no...@apache.org> on 2013/03/24 19:13:26 UTC

WAL Implementation

Hey guys,

after a few weeks heavily busy at work to bring our new game to open
beta I finally have some time to work on lovely opensource stuff
again :-)

Currently I'm implementing a generic WAL (Write Aheat Log / Journal)
implementation, in first place for the persistence system at our
company.

We collect statements in a queue to be written in a background
thread to linearize database load.
The problem about this approach is if db servers are busy this queue
can take some time to be cleaned up and if the gameservers crash
before the queue is cleared (or at least the background persister is
killed - for whatever reason - yeah we had a bug where data weren't
written for about 4 days) player data are lost.

The new system forced all statements to be written to disk before
being enqueued so that journals can be replayed on gameserver
startup. I haven't found any ready to use implementation beside
implementations found in frameworks like Hadoop, databases (I guess
it was derby), hornetmq, etc and so I started my own implementation.
I'll try to make it as generic as possible to not force it to be
used for persistency (SQL Statements) only but even for maybe
journaling memory access (or whatever).

Do you guys think it could be interesting for DM to implement some
thing as WAL in some place? Or do you have other interesting ideas
what to do with it?

I'll look forward to hopefully an intensive discussion. Maybe
someone else has found a WAL implementation that could be used /
analysed :-)

Chris / Noc

Re: WAL Implementation

Posted by Christoph Engelbert <no...@apache.org>.

Am 24.03.2013 21:30, schrieb Jan Kotek:
>
> There are two formats in MapDB; 
>
> Append-only has index size stored in memory, constructed by replay at startup.
>
> Journal (and direct) have separate index because of compaction. It traverses 
> all records and reinserts all data. This reclaims all unused disk space. 
>
> Recid (record id) is offset in file and is fixed once allocated. If index offsets 
> would be too high, I would have to keep disk space occupied to keep offset 
> valid. Leaving index table in separate file (without data) keeps offset small 
> and space reclaim possible. 
>
>> Every new journal file is set to the max filesize at creationtime
>> and is explicitly zero-filled.
>> If an entry won't fit in a standard journalfile a special
>> "full-overflow" journal file (only containing that single entry) is
>> created.
> Note:  journal==WAL in MapDB. Maybe not best terminology choice.
>
> MapDB does not have journal overflow. User must call explicitly 'commit()' to 
> open new log file. Only single WAL file is supported, it is always replayed on 
> commit. There is no option to have multiple not-yet replayed logs. I have to 
> keep thinks concurrent (fine grained locking) and with multiple logs would just 
> skyrocket complexity.

Ok I see that in theory every "transaction" is a single journal that
is replayed against the database on commit.

>> Every new journal file is set to the max filesize at creationtime
>> and is explicitly zero-filled.
> Sorry I have no time to study/comment your design

No prob ;-)

>
>> What is your exact design and what do you think is the better approach?
> Usually DBs use fixed-size pages (blocks) but this layer was removed in last 
> version to save space. Now WAL is sequence of 'modification commands'. Each 
> says 'write long (or byteArray) at this offset'. Each operation (such as delete 
> or update) is split broken down into  sequence of modifications and written 
> into WAL. I keep some data in-memory to keep track of modified or deleted 
> records, but this is low overhead (typically 10 bytes per record)

The good thing, I don't need to bother with DB interna because my
WAL implementation sits infront of the normal DB stuff but the
general design seems to be very similar. Every modification is a
single entry in the journal with the difference that I just used
append only.

> I have no time to discus what approach is better. Just run some benchmarks and 
> tell me if it is faster. Also current stuff is already obsolete, it uses global 
> ReadWrite lock which will be soon removed. 

Well I guess benchmarks are not everything, it needs to be fast but
it needs to be extremely safe for me. If you need to write multiple
files it is not guaranteed that both are written (but I guess a
broken index can be rebuild by crawling the journal).

>> PS: Your journal implementation is MapDB specific (at least a bit
>> because of the Serializer - but could be used yeah :))
> It depends on other classes such as Volume (ByteBuffer abstraction). But that 
> can be removed very easily. I think that code is fairly low-level and 
> portable.

Thanks for your comments, I'll take a deeper look in the new version
when it's done.

Chris

> j.
>
> On Sunday 24 March 2013 20:11:23 Christoph Engelbert wrote:
>> Hey Jan
>>
>> Thanks for your answer.
>>
>> I just had a short look over the code and you're using a separate
>> index file, don't you? Is there any advantage?
>> My current implementation is an append only, fixed sized journal.
>> This means I write as much entries to the file as fit in the given
>> journal filesize and roll over to a new journal. If all entries in
>> an full journal file are executed the file is deleted or moved to an
>> archive path.
>>
>> Every new journal file is set to the max filesize at creationtime
>> and is explicitly zero-filled.
>> If an entry won't fit in a standard journalfile a special
>> "full-overflow" journal file (only containing that single entry) is
>> created.
>>
>> The fileformat looks like this:
>> 0x00 - 0x03    MagicHeader
>> 0x04 - 0x07    Format-Version (currently 1 ;-))
>> 0x08 - 0x0B    Filelength (to check if the filelength is corrupted
>> by filesystem failure)
>> 0x0C - 0x13    Logfile number (the number of the logfile for
>> ordering multiple files while replaying)
>> 0x14 - 0x14    Type of the Logfile (standard / full overflow)
>> 0x15 - 0x18    Offset of the first dataset (normally 0x19 but can be
>> used to inject additional properties in the header)
>> 0x19 - ...         Journal records
>>
>> JournalRecord (every position is calculated by record-base-offset +
>> pos):
>> 0x00 - 0x03    Records length (if first 4 bytes and last 4 bytes are
>> equal the record isn't corrupted)
>> 0x04 - 0x0B    Record ID, incrementing number
>> 0x0C - 0x0C    Record type (application depending, defines type of data)
>> 0x0D - 0x...     Records data
>> 0x... - 0x...+4  Records length (needs to equals first four bytes of
>> the record)
>>
>> What is your exact design and what do you think is the better approach?
>>
>> PS: Your journal implementation is MapDB specific (at least a bit
>> because of the Serializer - but could be used yeah :))
>>
>> Chris
>>
>> Am 24.03.2013 19:41, schrieb Jan Kotek:
>>> Hi,
>>>
>>> There is WAL implementation (called journal) in MapDB. It has an
>>> interesting feature that modified data written into log, are not stored
>>> in memory, but can be re-read directly from log. MapDB is not exactly DB,
>>> it is more like persistent heap.
>>>
>>> Here is WAL storage implementation:
>>> https://github.com/jankotek/MapDB/blob/master/src/main/java/org/mapdb/Stor
>>> ageJournaled.java
>>>
>>> There is also 'direct' (update on place) and append-only storage
>>> implementation. Please note that I am currently reimplementing this store
>>> to be lock-free. In couple of days this file will be completely replaced.
>>>
>>> Hope it helps.
>>> Jan
>>>
>>> On Sunday 24 March 2013 19:13:26 Christoph Engelbert wrote:
>>>> Hey guys,
>>>>
>>>> after a few weeks heavily busy at work to bring our new game to open
>>>> beta I finally have some time to work on lovely opensource stuff
>>>> again :-)
>>>>
>>>> Currently I'm implementing a generic WAL (Write Aheat Log / Journal)
>>>> implementation, in first place for the persistence system at our
>>>> company.
>>>>
>>>> We collect statements in a queue to be written in a background
>>>> thread to linearize database load.
>>>> The problem about this approach is if db servers are busy this queue
>>>> can take some time to be cleaned up and if the gameservers crash
>>>> before the queue is cleared (or at least the background persister is
>>>> killed - for whatever reason - yeah we had a bug where data weren't
>>>> written for about 4 days) player data are lost.
>>>>
>>>> The new system forced all statements to be written to disk before
>>>> being enqueued so that journals can be replayed on gameserver
>>>> startup. I haven't found any ready to use implementation beside
>>>> implementations found in frameworks like Hadoop, databases (I guess
>>>> it was derby), hornetmq, etc and so I started my own implementation.
>>>> I'll try to make it as generic as possible to not force it to be
>>>> used for persistency (SQL Statements) only but even for maybe
>>>> journaling memory access (or whatever).
>>>>
>>>> Do you guys think it could be interesting for DM to implement some
>>>> thing as WAL in some place? Or do you have other interesting ideas
>>>> what to do with it?
>>>>
>>>> I'll look forward to hopefully an intensive discussion. Maybe
>>>> someone else has found a WAL implementation that could be used /
>>>> analysed :-)
>>>>
>>>> Chris / Noc

Re: WAL Implementation

Posted by Jan Kotek <di...@kotek.net>.


There are two formats in MapDB; 

Append-only has index size stored in memory, constructed by replay at startup.

Journal (and direct) have separate index because of compaction. It traverses 
all records and reinserts all data. This reclaims all unused disk space. 

Recid (record id) is offset in file and is fixed once allocated. If index offsets 
would be too high, I would have to keep disk space occupied to keep offset 
valid. Leaving index table in separate file (without data) keeps offset small 
and space reclaim possible. 

> Every new journal file is set to the max filesize at creationtime
> and is explicitly zero-filled.
> If an entry won't fit in a standard journalfile a special
> "full-overflow" journal file (only containing that single entry) is
> created.

Note:  journal==WAL in MapDB. Maybe not best terminology choice.

MapDB does not have journal overflow. User must call explicitly 'commit()' to 
open new log file. Only single WAL file is supported, it is always replayed on 
commit. There is no option to have multiple not-yet replayed logs. I have to 
keep thinks concurrent (fine grained locking) and with multiple logs would just 
skyrocket complexity.

> Every new journal file is set to the max filesize at creationtime
> and is explicitly zero-filled.

Sorry I have no time to study/comment your design

> What is your exact design and what do you think is the better approach?

Usually DBs use fixed-size pages (blocks) but this layer was removed in last 
version to save space. Now WAL is sequence of 'modification commands'. Each 
says 'write long (or byteArray) at this offset'. Each operation (such as delete 
or update) is split broken down into  sequence of modifications and written 
into WAL. I keep some data in-memory to keep track of modified or deleted 
records, but this is low overhead (typically 10 bytes per record)

I have no time to discus what approach is better. Just run some benchmarks and 
tell me if it is faster. Also current stuff is already obsolete, it uses global 
ReadWrite lock which will be soon removed. 

> PS: Your journal implementation is MapDB specific (at least a bit
> because of the Serializer - but could be used yeah :))

It depends on other classes such as Volume (ByteBuffer abstraction). But that 
can be removed very easily. I think that code is fairly low-level and 
portable.

j.

On Sunday 24 March 2013 20:11:23 Christoph Engelbert wrote:
> Hey Jan
> 
> Thanks for your answer.
> 
> I just had a short look over the code and you're using a separate
> index file, don't you? Is there any advantage?
> My current implementation is an append only, fixed sized journal.
> This means I write as much entries to the file as fit in the given
> journal filesize and roll over to a new journal. If all entries in
> an full journal file are executed the file is deleted or moved to an
> archive path.
> 
> Every new journal file is set to the max filesize at creationtime
> and is explicitly zero-filled.
> If an entry won't fit in a standard journalfile a special
> "full-overflow" journal file (only containing that single entry) is
> created.
> 
> The fileformat looks like this:
> 0x00 - 0x03    MagicHeader
> 0x04 - 0x07    Format-Version (currently 1 ;-))
> 0x08 - 0x0B    Filelength (to check if the filelength is corrupted
> by filesystem failure)
> 0x0C - 0x13    Logfile number (the number of the logfile for
> ordering multiple files while replaying)
> 0x14 - 0x14    Type of the Logfile (standard / full overflow)
> 0x15 - 0x18    Offset of the first dataset (normally 0x19 but can be
> used to inject additional properties in the header)
> 0x19 - ...         Journal records
> 
> JournalRecord (every position is calculated by record-base-offset +
> pos):
> 0x00 - 0x03    Records length (if first 4 bytes and last 4 bytes are
> equal the record isn't corrupted)
> 0x04 - 0x0B    Record ID, incrementing number
> 0x0C - 0x0C    Record type (application depending, defines type of data)
> 0x0D - 0x...     Records data
> 0x... - 0x...+4  Records length (needs to equals first four bytes of
> the record)
> 
> What is your exact design and what do you think is the better approach?
> 
> PS: Your journal implementation is MapDB specific (at least a bit
> because of the Serializer - but could be used yeah :))
> 
> Chris
> 
> Am 24.03.2013 19:41, schrieb Jan Kotek:
> > Hi,
> > 
> > There is WAL implementation (called journal) in MapDB. It has an
> > interesting feature that modified data written into log, are not stored
> > in memory, but can be re-read directly from log. MapDB is not exactly DB,
> > it is more like persistent heap.
> > 
> > Here is WAL storage implementation:
> > https://github.com/jankotek/MapDB/blob/master/src/main/java/org/mapdb/Stor
> > ageJournaled.java
> > 
> > There is also 'direct' (update on place) and append-only storage
> > implementation. Please note that I am currently reimplementing this store
> > to be lock-free. In couple of days this file will be completely replaced.
> > 
> > Hope it helps.
> > Jan
> > 
> > On Sunday 24 March 2013 19:13:26 Christoph Engelbert wrote:
> >> Hey guys,
> >> 
> >> after a few weeks heavily busy at work to bring our new game to open
> >> beta I finally have some time to work on lovely opensource stuff
> >> again :-)
> >> 
> >> Currently I'm implementing a generic WAL (Write Aheat Log / Journal)
> >> implementation, in first place for the persistence system at our
> >> company.
> >> 
> >> We collect statements in a queue to be written in a background
> >> thread to linearize database load.
> >> The problem about this approach is if db servers are busy this queue
> >> can take some time to be cleaned up and if the gameservers crash
> >> before the queue is cleared (or at least the background persister is
> >> killed - for whatever reason - yeah we had a bug where data weren't
> >> written for about 4 days) player data are lost.
> >> 
> >> The new system forced all statements to be written to disk before
> >> being enqueued so that journals can be replayed on gameserver
> >> startup. I haven't found any ready to use implementation beside
> >> implementations found in frameworks like Hadoop, databases (I guess
> >> it was derby), hornetmq, etc and so I started my own implementation.
> >> I'll try to make it as generic as possible to not force it to be
> >> used for persistency (SQL Statements) only but even for maybe
> >> journaling memory access (or whatever).
> >> 
> >> Do you guys think it could be interesting for DM to implement some
> >> thing as WAL in some place? Or do you have other interesting ideas
> >> what to do with it?
> >> 
> >> I'll look forward to hopefully an intensive discussion. Maybe
> >> someone else has found a WAL implementation that could be used /
> >> analysed :-)
> >> 
> >> Chris / Noc

Re: WAL Implementation

Posted by Christoph Engelbert <no...@apache.org>.

Hey Jan

Thanks for your answer.

I just had a short look over the code and you're using a separate
index file, don't you? Is there any advantage?
My current implementation is an append only, fixed sized journal.
This means I write as much entries to the file as fit in the given
journal filesize and roll over to a new journal. If all entries in
an full journal file are executed the file is deleted or moved to an
archive path.

Every new journal file is set to the max filesize at creationtime
and is explicitly zero-filled.
If an entry won't fit in a standard journalfile a special
"full-overflow" journal file (only containing that single entry) is
created.

The fileformat looks like this:
0x00 - 0x03    MagicHeader
0x04 - 0x07    Format-Version (currently 1 ;-))
0x08 - 0x0B    Filelength (to check if the filelength is corrupted
by filesystem failure)
0x0C - 0x13    Logfile number (the number of the logfile for
ordering multiple files while replaying)
0x14 - 0x14    Type of the Logfile (standard / full overflow)
0x15 - 0x18    Offset of the first dataset (normally 0x19 but can be
used to inject additional properties in the header)
0x19 - ...         Journal records

JournalRecord (every position is calculated by record-base-offset +
pos):
0x00 - 0x03    Records length (if first 4 bytes and last 4 bytes are
equal the record isn't corrupted)
0x04 - 0x0B    Record ID, incrementing number
0x0C - 0x0C    Record type (application depending, defines type of data)
0x0D - 0x...     Records data
0x... - 0x...+4  Records length (needs to equals first four bytes of
the record)

What is your exact design and what do you think is the better approach?

PS: Your journal implementation is MapDB specific (at least a bit
because of the Serializer - but could be used yeah :))

Chris

Am 24.03.2013 19:41, schrieb Jan Kotek:
> Hi,
>
> There is WAL implementation (called journal) in MapDB. It has an interesting 
> feature that modified data written into log, are not stored in memory, but can 
> be re-read directly from log. MapDB is not exactly DB, it is more like 
> persistent heap. 
>
> Here is WAL storage implementation:
> https://github.com/jankotek/MapDB/blob/master/src/main/java/org/mapdb/StorageJournaled.java
>
> There is also 'direct' (update on place) and append-only storage 
> implementation. Please note that I am currently reimplementing this store to 
> be lock-free. In couple of days this file will be completely replaced.
>
> Hope it helps.
> Jan
>
>
> On Sunday 24 March 2013 19:13:26 Christoph Engelbert wrote:
>> Hey guys,
>>
>> after a few weeks heavily busy at work to bring our new game to open
>> beta I finally have some time to work on lovely opensource stuff
>> again :-)
>>
>> Currently I'm implementing a generic WAL (Write Aheat Log / Journal)
>> implementation, in first place for the persistence system at our
>> company.
>>
>> We collect statements in a queue to be written in a background
>> thread to linearize database load.
>> The problem about this approach is if db servers are busy this queue
>> can take some time to be cleaned up and if the gameservers crash
>> before the queue is cleared (or at least the background persister is
>> killed - for whatever reason - yeah we had a bug where data weren't
>> written for about 4 days) player data are lost.
>>
>> The new system forced all statements to be written to disk before
>> being enqueued so that journals can be replayed on gameserver
>> startup. I haven't found any ready to use implementation beside
>> implementations found in frameworks like Hadoop, databases (I guess
>> it was derby), hornetmq, etc and so I started my own implementation.
>> I'll try to make it as generic as possible to not force it to be
>> used for persistency (SQL Statements) only but even for maybe
>> journaling memory access (or whatever).
>>
>> Do you guys think it could be interesting for DM to implement some
>> thing as WAL in some place? Or do you have other interesting ideas
>> what to do with it?
>>
>> I'll look forward to hopefully an intensive discussion. Maybe
>> someone else has found a WAL implementation that could be used /
>> analysed :-)
>>
>> Chris / Noc

Re: WAL Implementation

Posted by Jan Kotek <di...@kotek.net>.

Hi,

There is WAL implementation (called journal) in MapDB. It has an interesting 
feature that modified data written into log, are not stored in memory, but can 
be re-read directly from log. MapDB is not exactly DB, it is more like 
persistent heap. 

Here is WAL storage implementation:
https://github.com/jankotek/MapDB/blob/master/src/main/java/org/mapdb/StorageJournaled.java

There is also 'direct' (update on place) and append-only storage 
implementation. Please note that I am currently reimplementing this store to 
be lock-free. In couple of days this file will be completely replaced.

Hope it helps.
Jan


On Sunday 24 March 2013 19:13:26 Christoph Engelbert wrote:
> Hey guys,
> 
> after a few weeks heavily busy at work to bring our new game to open
> beta I finally have some time to work on lovely opensource stuff
> again :-)
> 
> Currently I'm implementing a generic WAL (Write Aheat Log / Journal)
> implementation, in first place for the persistence system at our
> company.
> 
> We collect statements in a queue to be written in a background
> thread to linearize database load.
> The problem about this approach is if db servers are busy this queue
> can take some time to be cleaned up and if the gameservers crash
> before the queue is cleared (or at least the background persister is
> killed - for whatever reason - yeah we had a bug where data weren't
> written for about 4 days) player data are lost.
> 
> The new system forced all statements to be written to disk before
> being enqueued so that journals can be replayed on gameserver
> startup. I haven't found any ready to use implementation beside
> implementations found in frameworks like Hadoop, databases (I guess
> it was derby), hornetmq, etc and so I started my own implementation.
> I'll try to make it as generic as possible to not force it to be
> used for persistency (SQL Statements) only but even for maybe
> journaling memory access (or whatever).
> 
> Do you guys think it could be interesting for DM to implement some
> thing as WAL in some place? Or do you have other interesting ideas
> what to do with it?
> 
> I'll look forward to hopefully an intensive discussion. Maybe
> someone else has found a WAL implementation that could be used /
> analysed :-)
> 
> Chris / Noc

Re: WAL Implementation

Posted by Jan Kotek <ja...@kotek.net>.

Hi,

There is WAL implementation (called journal) in MapDB. It has an interesting 
feature that modified data written into log, are not stored in memory, but can 
be re-read directly from log. MapDB is not exactly DB, it is more like 
persistent heap. 

Here is WAL storage implementation:
https://github.com/jankotek/MapDB/blob/master/src/main/java/org/mapdb/StorageJournaled.java

There is also 'direct' (update on place) and append-only storage 
implementation. Please note that I am currently reimplementing this store to 
be lock-free. In couple of days this file will be completely replaced.

Hope it helps.
Jan



On Sunday 24 March 2013 19:13:26 Christoph Engelbert wrote:
> Hey guys,
> 
> after a few weeks heavily busy at work to bring our new game to open
> beta I finally have some time to work on lovely opensource stuff
> again :-)
> 
> Currently I'm implementing a generic WAL (Write Aheat Log / Journal)
> implementation, in first place for the persistence system at our
> company.
> 
> We collect statements in a queue to be written in a background
> thread to linearize database load.
> The problem about this approach is if db servers are busy this queue
> can take some time to be cleaned up and if the gameservers crash
> before the queue is cleared (or at least the background persister is
> killed - for whatever reason - yeah we had a bug where data weren't
> written for about 4 days) player data are lost.
> 
> The new system forced all statements to be written to disk before
> being enqueued so that journals can be replayed on gameserver
> startup. I haven't found any ready to use implementation beside
> implementations found in frameworks like Hadoop, databases (I guess
> it was derby), hornetmq, etc and so I started my own implementation.
> I'll try to make it as generic as possible to not force it to be
> used for persistency (SQL Statements) only but even for maybe
> journaling memory access (or whatever).
> 
> Do you guys think it could be interesting for DM to implement some
> thing as WAL in some place? Or do you have other interesting ideas
> what to do with it?
> 
> I'll look forward to hopefully an intensive discussion. Maybe
> someone else has found a WAL implementation that could be used /
> analysed :-)
> 
> Chris / Noc