You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@subversion.apache.org by Ivan Zhakov <iv...@visualsvn.com> on 2014/06/25 17:09:45 UTC

Numbers encoding in FSFS log addressing indexes

Subversion 1.8 and before in general uses human readable decimal
format to store numbers in FSFS repositories on disk. Log addressing
implementation on trunk introduces new encoding for storing numbers in
indexes. Quoting log addressing indexes format documentation [1]
[[[
Encoding
--------

The final index file format is tuned for space and decoding efficiency.
Indexes are stored as a sequence of variable integers.  The encoding is
as follows:

* Unsigned integers are stored in little endian order with a variable
  length 7b/8b encoding.  If most significant bit a byte has been set,
  the next byte has also belongs to the same value.

  0x00 .. 0x7f    -> 0x00 .. 0x7f               ( 7 bits stored in  8 bits)
  0x80 .. 0xff    -> 0x80 0x01 .. 0xff 0x01     (14 bits stored in 16 bits)
  0x100 .. 0x3fff -> 0x80 0x02 .. 0xff 0x7f     (14 bits stored in 16 bits)
  0x100000000     -> 0x80 0x80 0x80 0x80 0x10   (35 bits stored in 40 bits)

  Technically, we can represent integers of arbitrary lengths.  Currently,
  we only generate and parse up to 64 bits.

* Signed integers are mapped onto the unsigned value space as follows:

  x >= 0 ->  2 * x
  x < 0  -> -2 * x - 1

  Again, we can represent arbitrary length numbers that way but the code
  is currently restricted to 64 bits.

Most data is unsigned by nature but will be stored differentially using
signed integers.
]]]

I'm unhappy with choosen encoding since it's not human readable. Also
it is not so good for performance as storing 8 bytes for every number.

I think indexes should use one of the following format:
1. Use human readable decimal numbers with trailing newline: this will
   be consistent with original FSFS encoding and easier to investigate
   corruptions.

2. Just store 64-bit numbers as 8-byte in some fixed endianess (little endian
   for example). This will give us maximum performance since we get fixed
   length index records. While they still be somewhat human readable using
   HEX editors.

The current encoding is unacceptable, because it makes repository
maintenance and recovery nearly impossible.

[1] http://svn.apache.org/repos/asf/subversion/trunk/subversion/libsvn_fs_fs/structure-indexes

-- 
Ivan Zhakov
CTO | VisualSVN | http://www.visualsvn.com

Re: Numbers encoding in FSFS log addressing indexes

Posted by Stefan Fuhrmann <st...@wandisco.com>.
On Wed, Jun 25, 2014 at 8:07 PM, Daniel Shahaf <d....@daniel.shahaf.name>
wrote:

> Daniel Shahaf wrote on Wed, Jun 25, 2014 at 18:03:24 +0000:
> > Stefan Fuhrmann wrote on Wed, Jun 25, 2014 at 17:34:43 +0200:
> > > On Wed, Jun 25, 2014 at 5:09 PM, Ivan Zhakov <iv...@visualsvn.com>
> wrote:
> > > > Log addressing
> > > > implementation on trunk introduces new encoding for storing numbers
> in
> > > > indexes. Quoting log addressing indexes format documentation [1]
> > > >
> > >
> > > I'm not even sure there is documentation for our txdelta
> > > on-disk representation. So, FSFS indexes are doing a
> > > better job in that department, ATM.
> >
> > Why is this relevant to the subject at hand?  Good job for writing
> > documentation, but lack of documentation wasn't Ivan's concern.
>
> By the way, the svndiff0 format is documented in notes/svndiff.  The
> PLAIN, DELTA, and ENDREP headers are, as you know, documented in
> libsvn_fs_fs/structure.  Is anything missing from these documentations?
>

Sorry, I missed that one. I had been looking for notes/diff*
and nodes/tx* and been surprised to not find anything.

-- Stefan^2

Re: Numbers encoding in FSFS log addressing indexes

Posted by Daniel Shahaf <d....@daniel.shahaf.name>.
Daniel Shahaf wrote on Wed, Jun 25, 2014 at 18:03:24 +0000:
> Stefan Fuhrmann wrote on Wed, Jun 25, 2014 at 17:34:43 +0200:
> > On Wed, Jun 25, 2014 at 5:09 PM, Ivan Zhakov <iv...@visualsvn.com> wrote:
> > > Log addressing
> > > implementation on trunk introduces new encoding for storing numbers in
> > > indexes. Quoting log addressing indexes format documentation [1]
> > >
> > 
> > I'm not even sure there is documentation for our txdelta
> > on-disk representation. So, FSFS indexes are doing a
> > better job in that department, ATM.
> 
> Why is this relevant to the subject at hand?  Good job for writing
> documentation, but lack of documentation wasn't Ivan's concern.

By the way, the svndiff0 format is documented in notes/svndiff.  The
PLAIN, DELTA, and ENDREP headers are, as you know, documented in
libsvn_fs_fs/structure.  Is anything missing from these documentations?

Daniel

Re: Numbers encoding in FSFS log addressing indexes

Posted by Daniel Shahaf <d....@daniel.shahaf.name>.
Stefan Fuhrmann wrote on Wed, Jun 25, 2014 at 20:33:42 +0200:
> Why would someone quote the 7b/8b encoding scheme docs if not to use
> it *against* the current code?

To bring enough context into the discussion so that people who haven't
read the f7 design docs may participate in it.

Re: Numbers encoding in FSFS log addressing indexes

Posted by Stefan Fuhrmann <st...@wandisco.com>.
On Wed, Jun 25, 2014 at 8:03 PM, Daniel Shahaf <d....@daniel.shahaf.name>
wrote:

> Stefan Fuhrmann wrote on Wed, Jun 25, 2014 at 17:34:43 +0200:
> > On Wed, Jun 25, 2014 at 5:09 PM, Ivan Zhakov <iv...@visualsvn.com> wrote:
> >
> > > Subversion 1.8 and before in general uses human readable decimal
> > > format to store numbers in FSFS repositories on disk.
> >
> >
> > True. However, there are exceptions to that general rule.
> > The index data uses the same basic encoding as we
> > already use in txdelta. In both cases, encoding density
> > is critical I/O performance.
> >
>
> Is "density" the right word?  The density ratio between base-2⁷ encoding
> and base-10 encoding is a constant factor, is that constant significant?
>

It is. In source code repositories, the indexes are something
like 5% of the repository size - with great variation depending
on average representation delta size. A base-10 encoding
would use about 2.5x as much space (2+x for the number
plus the separating white space).

The crux is that FSFS is very random access because most
nodes get changed independently, i.e. not only their HEAD
rev differs greatly but also the location of their delta chains.
Format7 manages to eliminate a great portion of that randomness
for the cost of random index access.

To make that trade-off work, the index must be small. So, we
would not be talking about 5% extra data (and I/O) but rather
+100% addressing I/O overhead. I never actually measured
that but the principle holds. BTW, it is one of the advantages
of FSX, that it combines multiple items into a single container
allowing for much smaller indexes.


> Perhaps an ASCII hexadecimal integer would solve whatever the problems
> with ASCII decimals are that a txdelta (base-2⁷) integer solves?
>
> > For instance, if you disable deltification in the ruby repo
> > (but keeping compression active), it explodes to 9.7GB,
> > a factor of 22.8. From that it should be obvious how
> > important space efficient encoding is to Subversion.
> >
>
> What does deltification have to do with choosing between ASCII-encoding
> and svndiff-encoding of 64-bit integers?
>

The operation sequence becomes a significant contribution
to the repository size when less than 4% of your content
is left after deltification (it's actually less than 2%, the remainder
is meta data overhead). Using longer encodings may have
a significant impact on repo size. That's probably why the
current encoding was chosen way back when.

I chose this as a reference for what people deemed a good
rationale for diverging from the "human readable" rule.


> > > Log addressing
> > > implementation on trunk introduces new encoding for storing numbers in
> > > indexes. Quoting log addressing indexes format documentation [1]
> > >
> >
> > I'm not even sure there is documentation for our txdelta
> > on-disk representation. So, FSFS indexes are doing a
> > better job in that department, ATM.
>
> Why is this relevant to the subject at hand?  Good job for writing
> documentation, but lack of documentation wasn't Ivan's concern.
>

I was merely sandbagging against future "hard to maintain"
claims etc. Why would someone quote the 7b/8b encoding
scheme docs if not to use it *against* the current code?

-- Stefan^2.

Re: Numbers encoding in FSFS log addressing indexes

Posted by Daniel Shahaf <d....@daniel.shahaf.name>.
Stefan Fuhrmann wrote on Wed, Jun 25, 2014 at 17:34:43 +0200:
> On Wed, Jun 25, 2014 at 5:09 PM, Ivan Zhakov <iv...@visualsvn.com> wrote:
> 
> > Subversion 1.8 and before in general uses human readable decimal
> > format to store numbers in FSFS repositories on disk.
> 
> 
> True. However, there are exceptions to that general rule.
> The index data uses the same basic encoding as we
> already use in txdelta. In both cases, encoding density
> is critical I/O performance.
> 

Is "density" the right word?  The density ratio between base-2⁷ encoding
and base-10 encoding is a constant factor, is that constant significant?

Perhaps an ASCII hexadecimal integer would solve whatever the problems
with ASCII decimals are that a txdelta (base-2⁷) integer solves?

> For instance, if you disable deltification in the ruby repo
> (but keeping compression active), it explodes to 9.7GB,
> a factor of 22.8. From that it should be obvious how
> important space efficient encoding is to Subversion.
> 

What does deltification have to do with choosing between ASCII-encoding
and svndiff-encoding of 64-bit integers?

> 
> > Log addressing
> > implementation on trunk introduces new encoding for storing numbers in
> > indexes. Quoting log addressing indexes format documentation [1]
> >
> 
> I'm not even sure there is documentation for our txdelta
> on-disk representation. So, FSFS indexes are doing a
> better job in that department, ATM.

Why is this relevant to the subject at hand?  Good job for writing
documentation, but lack of documentation wasn't Ivan's concern.

Cheers,

Daniel

Re: Numbers encoding in FSFS log addressing indexes

Posted by Stefan Fuhrmann <st...@wandisco.com>.
On Wed, Jun 25, 2014 at 5:09 PM, Ivan Zhakov <iv...@visualsvn.com> wrote:

> Subversion 1.8 and before in general uses human readable decimal
> format to store numbers in FSFS repositories on disk.


True. However, there are exceptions to that general rule.
The index data uses the same basic encoding as we
already use in txdelta. In both cases, encoding density
is critical I/O performance.

For instance, if you disable deltification in the ruby repo
(but keeping compression active), it explodes to 9.7GB,
a factor of 22.8. From that it should be obvious how
important space efficient encoding is to Subversion.


> Log addressing
> implementation on trunk introduces new encoding for storing numbers in
> indexes. Quoting log addressing indexes format documentation [1]
>

I'm not even sure there is documentation for our txdelta
on-disk representation. So, FSFS indexes are doing a
better job in that department, ATM.

-- Stefan^2.