You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@subversion.apache.org by Stefan Fuhrmann <st...@apache.org> on 2017/03/01 12:52:27 UTC

Fast alternative hash (was: Files with identical SHA1 breaks the repo)

On 01.03.2017 05:17, Greg Stein wrote:
> I really like this idea.
>
> And we could take a copy of APR's sha1 code, and rejigger it to 
> perform *both* hashes during the same scan of the raw bytes. I would 
> expect the time taken to extend by (say) 1.1X rather than a full 2X. 
> The inner loop might cost a bit more, but we'd only scan the bytes 
> once. Very handy, when you're talking about megabytes in a stream-y 
> environment.
>
> (and medium-term, push this dual-sha1 computation back into APR)
>
>
> On Sun, Feb 26, 2017 at 10:08 AM, Garance A Drosehn <drosih@rpi.edu 
> <ma...@rpi.edu>> wrote:
>
>     On 24 Feb 2017, at 15:46, Stefan Sperling wrote:
>     >
>     > I believe we should prepare a new working format for 1.10.0
>     > which addresses this problem. I don't see a good way of fixing
>     > it without a format bump. The bright side of this is that it
>     > gives us a good reason to get 1.10.0 ready ASAP.
>     >
>     > We can switch to a better hash algorithm with a WC format
>     > bump.
>
>     One of the previous messages mentioned that better hash
>     algorithms are more expensive.  So let me mention a tactic
>     that I used many years ago, when MD5 was the best digest
>     algorithm that I knew of, and I didn't trust it for the
>     larger files I was working with at the time:
>
>     Instead of going with a completely different hash algorithm,
>     just double-down on the one you're using.  What I did was to
>     calculate one digest the standard way, and then a second one
>     which summed up every-other-byte (or every 3rd byte, or ...).
>     So to get a collision, not only do two files have to get the
>     same digest-result for all their data, but they have to also
>     get the same digest-result when exactly half the data is
>     skipped over.
>
>     (I did this a long time ago, and forget the details.  What
>     I may have done for performance reasons was every-other-word,
>     not every-other-byte)
>
>     My thinking was that *any* single algorithm which processes
>     all the data is going to get collisions, eventually.  But it
>     will be much harder for someone to generate a duplicate file
>     where there will also be a collision when summing up only
>     half of the data.
>
>     I'm not claiming this is great cure-all solution, but just
>     that it's an alternate tactic which might be interesting.
>     People could create repositories with just the one digest,
>     or upgrade it to use multiple digests if they have the need.
>
>     I found a few benchmarks which suggest that sha-256 is maybe
>     twice as expensive as sha-1, so calculating two sha-1 digests
>     might be a reasonable alternative.
>

That is also known as bit-slicing. The neat thing is that
you create N (e.g. 4) interleaved sub-streams who's
checksum can be calculated *concurrently*, e.g using
SIMD instructions. so, you end up being ~3 times *faster*
than calculating the normal checksum.

Because the interleaved streams may look quite similar
(think bitmaps), you probably want to "salt" them. A simple
rotate or XOR might do - but I'm not an expert on this.
the goal is to end up with 4 reasonably independent
streams, hence sub-hashes.

So, the full sequence would look something like this:

* Split text T into interleaved sub-streams T1,..,T4
* Salt them S1 = salt(T1, 1), ..., S4 = salt(T4, 4)
* Calculate sub-stream hashes using bit-sliced code
   D1, ..., D4 = sha1_4x(S1, ..., S4) = sha1(S1), ..., sha1(S4)
* Calculate the final checksum D = sha2(D1|...|D4)

Not only would that solve the current sha1 issue but
neatly address the fact that nowadays we can read
data faster from disk that we could checksum it.

-- Stefan^2.

Re: Fast alternative hash

Posted by Branko Čibej <br...@apache.org>.
On 01.03.2017 13:52, Stefan Fuhrmann wrote:
> On 01.03.2017 05:17, Greg Stein wrote:
>> I really like this idea.
>>
>> And we could take a copy of APR's sha1 code, and rejigger it to
>> perform *both* hashes during the same scan of the raw bytes. I would
>> expect the time taken to extend by (say) 1.1X rather than a full 2X.
>> The inner loop might cost a bit more, but we'd only scan the bytes
>> once. Very handy, when you're talking about megabytes in a stream-y
>> environment.
>>
>> (and medium-term, push this dual-sha1 computation back into APR)
>>
>>
>> On Sun, Feb 26, 2017 at 10:08 AM, Garance A Drosehn <drosih@rpi.edu
>> <ma...@rpi.edu>> wrote:
>>
>>     On 24 Feb 2017, at 15:46, Stefan Sperling wrote:
>>     >
>>     > I believe we should prepare a new working format for 1.10.0
>>     > which addresses this problem. I don't see a good way of fixing
>>     > it without a format bump. The bright side of this is that it
>>     > gives us a good reason to get 1.10.0 ready ASAP.
>>     >
>>     > We can switch to a better hash algorithm with a WC format
>>     > bump.
>>
>>     One of the previous messages mentioned that better hash
>>     algorithms are more expensive.  So let me mention a tactic
>>     that I used many years ago, when MD5 was the best digest
>>     algorithm that I knew of, and I didn't trust it for the
>>     larger files I was working with at the time:
>>
>>     Instead of going with a completely different hash algorithm,
>>     just double-down on the one you're using.  What I did was to
>>     calculate one digest the standard way, and then a second one
>>     which summed up every-other-byte (or every 3rd byte, or ...).
>>     So to get a collision, not only do two files have to get the
>>     same digest-result for all their data, but they have to also
>>     get the same digest-result when exactly half the data is
>>     skipped over.
>>
>>     (I did this a long time ago, and forget the details.  What
>>     I may have done for performance reasons was every-other-word,
>>     not every-other-byte)
>>
>>     My thinking was that *any* single algorithm which processes
>>     all the data is going to get collisions, eventually.  But it
>>     will be much harder for someone to generate a duplicate file
>>     where there will also be a collision when summing up only
>>     half of the data.
>>
>>     I'm not claiming this is great cure-all solution, but just
>>     that it's an alternate tactic which might be interesting.
>>     People could create repositories with just the one digest,
>>     or upgrade it to use multiple digests if they have the need.
>>
>>     I found a few benchmarks which suggest that sha-256 is maybe
>>     twice as expensive as sha-1, so calculating two sha-1 digests
>>     might be a reasonable alternative.
>>
>
> That is also known as bit-slicing. The neat thing is that
> you create N (e.g. 4) interleaved sub-streams who's
> checksum can be calculated *concurrently*, e.g using
> SIMD instructions. so, you end up being ~3 times *faster*
> than calculating the normal checksum.
>
> Because the interleaved streams may look quite similar
> (think bitmaps), you probably want to "salt" them. A simple
> rotate or XOR might do - but I'm not an expert on this.
> the goal is to end up with 4 reasonably independent
> streams, hence sub-hashes.
>
> So, the full sequence would look something like this:
>
> * Split text T into interleaved sub-streams T1,..,T4
> * Salt them S1 = salt(T1, 1), ..., S4 = salt(T4, 4)
> * Calculate sub-stream hashes using bit-sliced code
>   D1, ..., D4 = sha1_4x(S1, ..., S4) = sha1(S1), ..., sha1(S4)
> * Calculate the final checksum D = sha2(D1|...|D4)
>
> Not only would that solve the current sha1 issue but
> neatly address the fact that nowadays we can read
> data faster from disk that we could checksum it.

For some definition of "solve" \u2014 i.e., until a more generic attack
method is invented. :)

-- Brane