You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@subversion.apache.org by Daniel Shahaf <d....@daniel.shahaf.name> on 2022/11/11 14:09:07 UTC

Re: Triage recovery of damaged Subversion repo

Michael K wrote on Fri, Oct 28, 2022 at 17:25:19 -0500:
> I am working on an important Subversion repository that was hit by a
> targeted ransomware attack. Apparently the backups were deleted securely as
> well, though there is a backup from a few years back that was unaffected in
> different storage. In brief, the ransomware encrypted and overwrote (up to)
> the first 4 KB of data and also added some encrypted data and zero-padding
> to the end of every file. Since Subversion has many small files, the data
> has been slashed up badly and some is gone forever. But files larger than 4
> KB have original data remaining.
> 
> My goal is to build a working repository with as much of the original data
> that is remaining as I can, like a triage operation. I have a backup that
> was not affected, but it does not contain the last few years of data. I
> need to utilize the data that is affected by ransomware encryption.
> 
> Eventually I plan to write a program that will work over all the affected
> revs and revprops files required and output new files. I'm coming at this
> without previous knowledge of the inner workings of Subversion, but I am
> comfortable working in a hex editor and writing programs that process raw
> data. So for now, I have been learning about Subversion from reading the
> documentation and while working hands-on with the raw data of these files
> in a hex editor. I've learned a bit about the "representations" within the
> revs files. That will probably be helpful since those provide units that
> each revs file can be broken down into. I can use that knowledge to try
> keeping full "representations" and discard partial ones.
> 

Yes, rev files have quite a bit of internal structure: reps, node-rev
headers, changed-paths, P2L/L2P, final line.  These are generally easy
to parse out of surrounding contexts (revprop files use counted-length
strings, reps have their header and "ENDREP" trailer, L2P-INDEX and
P2L-INDEX know their own length and have ASCII before and after them,
and everything else is ASCII in specific formats).

Similarly, it should be easy to recognize where the appended cryptogram
and padding start, since the part from L2P-INDEX to the last line is
distinctive and self-checksummed.

I don't know by heart what elements will be serialized into the first
4KB of a rev file in logical addressing mode.  (By the way, it's worth
looking up in the implementation what physical order it writes the items
to the file in.  Chances are this wasn't left to chance.)  What you
might find there is:

- File reps.

  A rep is a compressed [see fsfs.conf, no relation to "self-compressed"
  in the sense of having no base] svndelta [see notes/svndiff], whose
  base, if there is one, might or might not be the preceding revision of
  the node [see notes/skip-deltas and fsfs.conf] [note: this means it's
  possible for rN+M of a file to be recoverable even if rN's rep is lost].

  In principle, you can even dive down this rabbit hole of abstractions to
  recover data from the surviving tail ends of partially-overwritten reps.

- Dir reps.  These are like file reps but the content of the file is
  an svn_hash_write2() hash mapping basenames to node-rev id's.  IIRC,
  the hashes are dumped in sorted order and the node-rev id's are also
  fairly predictable, and in any case they are repeated in the node-rev
  headers of the directory entries.  It might even be possible to
  reconstruct an overwritten dir rep from the remainder of the rev file.

- Node-rev headers.  Parts of these are predictable (e.g., the "pred:"
  value), or can be regenerated (e.g., the checksums), or inferred from
  other parts of the rev file (e.g., "type: dir" can easily be guessed
  if you still have the rep itself).

- Changed-paths.  That's just an index/cache, IIRC, of information
  derivable from the remainder of the file.

> Currently, I am trying to add a single new empty revision that Subversion
> will accept after testing with the "svnadmin verify" and "svn info"
> commands. I fabricated data for a revprops file on this new revision, I
> adjusted the "current" file to the new revision number, and I'm working on
> the revs file. If I can achieve that, I'll move on to adding a new revision
> that contains some original data.
> 

I assume you mean this:

[[[
echo Hello world > foo
svnadmin create r
svnmucc -U file://$(pwd)/r -mm put foo iota   # 'svn import' would do the trick too
xxd < r/db/revs/1
]]]

Why would you need to /manually/ create a rev file with original data?
You can use 'svn commit' to create rev files (on top of the old, good
backup).  I'd have thought you'd focus on trying to extract data from
the partially-corrupted rev files (e.g., reconstruct the fulltexts of
reps where it's possible to do so).

Anyway, regarding creating rev files:

The rev files you get by default have bells and whistles turned on.  For
instance, they use DELTA and self-DELTA reps even though it's a lot
easier to fabricate a PLAIN rep, and you can use PLAIN anywhere you can
use DELTA.

For this reason, I'd recommend to try to create a 1.1-era rev file
first.  Pass «--compatible-version=1.1 --fs-type=fsfs» to «svnadmin
create» above.  (Subversion 1.1's FSFS is the oldest FSFS there is; see
`svnadmin info`.)

Word of warning: when you test things, do NOT test with the r0 rev file.
The C code hard-codes the assumption that r0 is empty.

> I've learned about the footer of the revs files as I've come across errors
> when trying those commands. I know how the L2P_OFFSET and P2L_OFFSET work
> and I have remedied the errors when those offsets are incorrect. I also
> discovered some kind of item indexes from logical addressing (I think, not
> sure what they are called) which occur right after both "L2P_OFFSET" and
> "P2L_OFFSET" in the revs files.

Do you mean "L2P-INDEX" and "P2L-INDEX"?

>                                 By looking at many files, I figured out how
> to calculate the binary representation for that based on the rev number
> (strange calculation).

The checksum in the final line is just MD5.

>                        That got me past the error such as - "svn: E160054:
> Index rev / pack file revision numbers do not match" - from the svn info
> command.
> 
> And now I'm trying to get past the "L2P index checksum mismatch" error. I
> don't know yet how the "actual" checksum value is calculated. Thankfully
> Subversion's error message shows both the "expected" and "actual"
> checksums. So I've tried taking an MD5 hash on byte ranges of the L2P-INDEX
> area (and variations), but haven't gotten a match to that "actual" value
> yet.
> 
> If you could provide insight to where these 2 checksums come from, I'd be
> really grateful.

I think you're looking for the modified FNV-1A in structure-indexes
(which I suspect is svn_checksum_fnv1a_32), but anyway, try setting the
checksum's value to all-zeroes: by convention, such a checksum is
considered equal to everything in checksum comparisons.  You might even
be able to use «svnfsfs load-index» for that (after removing the
appended data or adjusting svnfsfs's source).

The on-disk format is documetned in subversion/libsvn_fs_fs/structure
(grep for "logical").

You can sidestep the entire L2P/P2L fabrication step by using physical
addressing.  The C code as it stands makes use_log_addressing
a per-fs-instance knob rather than a per-rev-file one, but for your
purposes you can patch the C sources to pretend ffd->use_log_addressing
were FALSE for a specific fs instance and revnum range (the revnums
whose rev files you'll be fabricating).  svn_fs_fs__item_offset() seems
a relevant callsite.

> Also, any other general thoughts on this project would be appreciated.

Enable post-commit email notifications with diffs?

Cheers,

Daniel

Re: Triage recovery of damaged Subversion repo

Posted by Michael K <mk...@gmail.com>.
Daniel, thanks for your reply! It is greatly appreciated.

"Yes, rev files have quite a bit of internal structure: reps, node-rev
headers, changed-paths, P2L/L2P, final line. These are generally easy
to parse out of surrounding contexts (revprop files use counted-length
strings, reps have their header and "ENDREP" trailer, L2P-INDEX and
P2L-INDEX know their own length and have ASCII before and after them,
and everything else is ASCII in specific formats)."

I have frequently looked over the documentation at
https://svn.apache.org/repos/asf/subversion/trunk/subversion/libsvn_fs_fs/structure.


I can definitely recognize the border between reps when I see an "ENDREP",
0A (newline), and then a "DELTA SVN". But then there are those that have
significant data bewteen "ENDREP" and "DELTA SVN" and I don't understand
what is going on there yet. I don't know how I would split those if needed.

"Similarly, it should be easy to recognize where the appended cryptogram
and padding start, since the part from L2P-INDEX to the last line is
distinctive and self-checksummed."

Yes, I have been able to remove the tailing bit that the ransomware added
at the end of files. I made a program to process all files and it does that
fine.

"I don't know by heart what elements will be serialized into the first
4KB of a rev file in logical addressing mode."

I'll mention that a great many of these rev files are smaller than 4 KB, so
they contain no original data.

"Why would you need to /manually/ create a rev file with original data?
You can use 'svn commit' to create rev files (on top of the old, good
backup). I'd have thought you'd focus on trying to extract data from
the partially-corrupted rev files (e.g., reconstruct the fulltexts of
reps where it's possible to do so)."

The old, good backup went through rev 88214, while the original data repo
goes through rev 241130. So that is 152916 revisions difference. None of
those revisions work. I assume so with the ~4KB of damage at the
beginning... as far as I know nothing can read anything automatically from
those, and Subversion will not show any data or verify any revisions at the
point it hits the ransomware-affected files.

So I have been investigating a process to create rev files that includes
remaining original data from the revs so they are functional within
Subversion. If I can find a process to do that, then I can write a program
that will execute that over all the ransomware-affected original revisions
88215 thru 241130. I am comfortable writing programs that process raw data
from files in different ways. The plan would be to process all those files,
output new files (completely outside of Subversion), and then access that
within Subversion to check it. If something didn't work, I would rework the
program and run it again.

I just started with an "empty" revision so that I would first know I can
satisfy Subversion's minimum requirements for a revision (revprops and revs
files). Someone related to the original project actually gave me an example
repo for this purpose. Within that, they created a revision with a single
change. They then used a dump filter to filter out the contents of that
revision, and made a new repo from that. Then I look at the specific
revision file in a hex editor. I've been looking at lots of files in a hex
editor here.

Now, I am completely new to Subversion since this project. So if there is a
better way to do this using Subversion, I'm certainly open to that! If I
were to do an "svn commit", how would I include original data from the
damaged repo?

My thought was these rev files include "reps" units, and those units are
how I would include the original data in newly created rev files.

"(e.g., reconstruct the fulltexts of reps where it's possible to do so)"

Hmm I don't know what "the fulltexts" means.

I am also learning about SVNKit at the same time. Actually I was looking
there to try to figure out the 2 hashes in the footer of the rev files. But
it might be useful to use for this process.

"[note: this means it's possible for rN+M of a file to be recoverable even
if rN's rep is lost].
...
In principle, you can even dive down this rabbit hole of abstractions to
recover data from the surviving tail ends of partially-overwritten reps."

That is intriguing to know. But as for diving down rabbit holes, I likely
won't want to do that if it requires manual work per revision, or if it
requires a lot of coding work with very little to gain. At this point I
would love to get something that works and also contains some original data
so that I know this is feasible. Then if I can improve from there, great.

"The rev files you get by default have bells and whistles turned on. For
instance, they use DELTA and self-DELTA reps even though it's a lot
easier to fabricate a PLAIN rep, and you can use PLAIN anywhere you can
use DELTA."

"For this reason, I'd recommend to try to create a 1.1-era rev file
first. Pass «--compatible-version=1.1 --fs-type=fsfs» to «svnadmin
create» above. (Subversion 1.1's FSFS is the oldest FSFS there is; see
`svnadmin info`.)"

All right, I understand what you are saying. So that would create a new
repo that uses the old FSFS format. And that should be easier to fabricate.
But that would differ from the backup repo I have with good data. So maybe
that would have to be converted as well, if possible.

"Word of warning: when you test things, do NOT test with the r0 rev file.
The C code hard-codes the assumption that r0 is empty."

That is good to know, thank you.

"Do you mean "L2P-INDEX" and "P2L-INDEX"?"

I guess that is it. I see "L2P-INDEX", then 0A, then a value. Then later I
see "P2L-INDEX", then 0A, then a value. Those values have an odd
calculation apparently and I was able to derive them from the revision
number. I'm not sure how it functions as an index. But I have read someone
say they are meant to discourage analysis or something.

"The checksum in the final line is just MD5."

Which one? The final line of what? From what I understand, the end of a rev
file has L2P-INDEX offset, FNV-1A hash, P2L-INDEX offset, FNV-1A, then
terminal byte.
UPDATE - Ok now I know these are MD5, and the FNV-1A hashes are part of an
intermediate step before the rev file is created.

Setting the value to all zeroes seemed to work, as now I have a different
error message. Thank you!

"You can sidestep the entire L2P/P2L fabrication step by using physical
addressing."

That makes sense as well. I suppose the output of my program could be a
repo that uses physical addressing. Again that would differ from the backup
repo and so a conversion would be required if I would go that route.

Once again, huge thanks for your input!


On Fri, Nov 11, 2022 at 8:09 AM Daniel Shahaf <d....@daniel.shahaf.name>
wrote:

> Michael K wrote on Fri, Oct 28, 2022 at 17:25:19 -0500:
> > I am working on an important Subversion repository that was hit by a
> > targeted ransomware attack. Apparently the backups were deleted securely
> as
> > well, though there is a backup from a few years back that was unaffected
> in
> > different storage. In brief, the ransomware encrypted and overwrote (up
> to)
> > the first 4 KB of data and also added some encrypted data and
> zero-padding
> > to the end of every file. Since Subversion has many small files, the data
> > has been slashed up badly and some is gone forever. But files larger
> than 4
> > KB have original data remaining.
> >
> > My goal is to build a working repository with as much of the original
> data
> > that is remaining as I can, like a triage operation. I have a backup that
> > was not affected, but it does not contain the last few years of data. I
> > need to utilize the data that is affected by ransomware encryption.
> >
> > Eventually I plan to write a program that will work over all the affected
> > revs and revprops files required and output new files. I'm coming at this
> > without previous knowledge of the inner workings of Subversion, but I am
> > comfortable working in a hex editor and writing programs that process raw
> > data. So for now, I have been learning about Subversion from reading the
> > documentation and while working hands-on with the raw data of these files
> > in a hex editor. I've learned a bit about the "representations" within
> the
> > revs files. That will probably be helpful since those provide units that
> > each revs file can be broken down into. I can use that knowledge to try
> > keeping full "representations" and discard partial ones.
> >
>
> Yes, rev files have quite a bit of internal structure: reps, node-rev
> headers, changed-paths, P2L/L2P, final line.  These are generally easy
> to parse out of surrounding contexts (revprop files use counted-length
> strings, reps have their header and "ENDREP" trailer, L2P-INDEX and
> P2L-INDEX know their own length and have ASCII before and after them,
> and everything else is ASCII in specific formats).
>
> Similarly, it should be easy to recognize where the appended cryptogram
> and padding start, since the part from L2P-INDEX to the last line is
> distinctive and self-checksummed.
>
> I don't know by heart what elements will be serialized into the first
> 4KB of a rev file in logical addressing mode.  (By the way, it's worth
> looking up in the implementation what physical order it writes the items
> to the file in.  Chances are this wasn't left to chance.)  What you
> might find there is:
>
> - File reps.
>
>   A rep is a compressed [see fsfs.conf, no relation to "self-compressed"
>   in the sense of having no base] svndelta [see notes/svndiff], whose
>   base, if there is one, might or might not be the preceding revision of
>   the node [see notes/skip-deltas and fsfs.conf] [note: this means it's
>   possible for rN+M of a file to be recoverable even if rN's rep is lost].
>
>   In principle, you can even dive down this rabbit hole of abstractions to
>   recover data from the surviving tail ends of partially-overwritten reps.
>
> - Dir reps.  These are like file reps but the content of the file is
>   an svn_hash_write2() hash mapping basenames to node-rev id's.  IIRC,
>   the hashes are dumped in sorted order and the node-rev id's are also
>   fairly predictable, and in any case they are repeated in the node-rev
>   headers of the directory entries.  It might even be possible to
>   reconstruct an overwritten dir rep from the remainder of the rev file.
>
> - Node-rev headers.  Parts of these are predictable (e.g., the "pred:"
>   value), or can be regenerated (e.g., the checksums), or inferred from
>   other parts of the rev file (e.g., "type: dir" can easily be guessed
>   if you still have the rep itself).
>
> - Changed-paths.  That's just an index/cache, IIRC, of information
>   derivable from the remainder of the file.
>
> > Currently, I am trying to add a single new empty revision that Subversion
> > will accept after testing with the "svnadmin verify" and "svn info"
> > commands. I fabricated data for a revprops file on this new revision, I
> > adjusted the "current" file to the new revision number, and I'm working
> on
> > the revs file. If I can achieve that, I'll move on to adding a new
> revision
> > that contains some original data.
> >
>
> I assume you mean this:
>
> [[[
> echo Hello world > foo
> svnadmin create r
> svnmucc -U file://$(pwd)/r -mm put foo iota   # 'svn import' would do the
> trick too
> xxd < r/db/revs/1
> ]]]
>
> Why would you need to /manually/ create a rev file with original data?
> You can use 'svn commit' to create rev files (on top of the old, good
> backup).  I'd have thought you'd focus on trying to extract data from
> the partially-corrupted rev files (e.g., reconstruct the fulltexts of
> reps where it's possible to do so).
>
> Anyway, regarding creating rev files:
>
> The rev files you get by default have bells and whistles turned on.  For
> instance, they use DELTA and self-DELTA reps even though it's a lot
> easier to fabricate a PLAIN rep, and you can use PLAIN anywhere you can
> use DELTA.
>
> For this reason, I'd recommend to try to create a 1.1-era rev file
> first.  Pass «--compatible-version=1.1 --fs-type=fsfs» to «svnadmin
> create» above.  (Subversion 1.1's FSFS is the oldest FSFS there is; see
> `svnadmin info`.)
>
> Word of warning: when you test things, do NOT test with the r0 rev file.
> The C code hard-codes the assumption that r0 is empty.
>
> > I've learned about the footer of the revs files as I've come across
> errors
> > when trying those commands. I know how the L2P_OFFSET and P2L_OFFSET work
> > and I have remedied the errors when those offsets are incorrect. I also
> > discovered some kind of item indexes from logical addressing (I think,
> not
> > sure what they are called) which occur right after both "L2P_OFFSET" and
> > "P2L_OFFSET" in the revs files.
>
> Do you mean "L2P-INDEX" and "P2L-INDEX"?
>
> >                                 By looking at many files, I figured out
> how
> > to calculate the binary representation for that based on the rev number
> > (strange calculation).
>
> The checksum in the final line is just MD5.
>
> >                        That got me past the error such as - "svn:
> E160054:
> > Index rev / pack file revision numbers do not match" - from the svn info
> > command.
> >
> > And now I'm trying to get past the "L2P index checksum mismatch" error. I
> > don't know yet how the "actual" checksum value is calculated. Thankfully
> > Subversion's error message shows both the "expected" and "actual"
> > checksums. So I've tried taking an MD5 hash on byte ranges of the
> L2P-INDEX
> > area (and variations), but haven't gotten a match to that "actual" value
> > yet.
> >
> > If you could provide insight to where these 2 checksums come from, I'd be
> > really grateful.
>
> I think you're looking for the modified FNV-1A in structure-indexes
> (which I suspect is svn_checksum_fnv1a_32), but anyway, try setting the
> checksum's value to all-zeroes: by convention, such a checksum is
> considered equal to everything in checksum comparisons.  You might even
> be able to use «svnfsfs load-index» for that (after removing the
> appended data or adjusting svnfsfs's source).
>
> The on-disk format is documetned in subversion/libsvn_fs_fs/structure
> (grep for "logical").
>
> You can sidestep the entire L2P/P2L fabrication step by using physical
> addressing.  The C code as it stands makes use_log_addressing
> a per-fs-instance knob rather than a per-rev-file one, but for your
> purposes you can patch the C sources to pretend ffd->use_log_addressing
> were FALSE for a specific fs instance and revnum range (the revnums
> whose rev files you'll be fabricating).  svn_fs_fs__item_offset() seems
> a relevant callsite.
>
> > Also, any other general thoughts on this project would be appreciated.
>
> Enable post-commit email notifications with diffs?
>
> Cheers,
>
> Daniel
>