You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@cassandra.apache.org by "Tyagi, Preetika" <pr...@intel.com> on 2018/09/26 22:43:55 UTC

MD5 in the read path

Hi all,

I have a question about MD5 being used in the read path in Cassandra.
I wanted to understand what exactly it is being used for and why not something like CRC is used which is less complex in comparison to MD5.

Thanks,
Preetika


RE: MD5 in the read path

Posted by "Tyagi, Preetika" <pr...@intel.com>.
Makes sense. Thanks!

-----Original Message-----
From: Joseph Lynch [mailto:joe.e.lynch@gmail.com] 
Sent: Wednesday, September 26, 2018 9:02 PM
To: dev@cassandra.apache.org
Subject: Re: MD5 in the read path

>
> Thank you all for the response.
> For RandomPartitioner, MD5 is used to avoid collision. However, why is 
> it necessary for comparing data between different replicas? Is it not 
> feasible to use CRC for data comparison?
>
My understanding is that it is not necessary to use MD5 and we can switch out the message digest function as long as we have an upgrade path. I believe this is the goal of https://issues.apache.org/jira/browse/CASSANDRA-13292.

-Joey

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@cassandra.apache.org
For additional commands, e-mail: dev-help@cassandra.apache.org


Re: MD5 in the read path

Posted by Joseph Lynch <jo...@gmail.com>.
>
> Thank you all for the response.
> For RandomPartitioner, MD5 is used to avoid collision. However, why is it
> necessary for comparing data between different replicas? Is it not feasible
> to use CRC for data comparison?
>
My understanding is that it is not necessary to use MD5 and we can switch
out the message digest function as long as we have an upgrade path. I
believe this is the goal of
https://issues.apache.org/jira/browse/CASSANDRA-13292.

-Joey

RE: MD5 in the read path

Posted by "Tyagi, Preetika" <pr...@intel.com>.
Thank you all for the response.
For RandomPartitioner, MD5 is used to avoid collision. However, why is it necessary for comparing data between different replicas? Is it not feasible to use CRC for data comparison?

Thanks,
Preetika

-----Original Message-----
From: Elliott Sims [mailto:elliott@backblaze.com] 
Sent: Wednesday, September 26, 2018 7:58 PM
To: dev@cassandra.apache.org
Subject: Re: MD5 in the read path

Would xxHash be large enough for digests?  Looks like there's no 128-bit version yet, and it seems like 64 bits would be a bit short to avoid accidental collisions/matches.  FarmHash128 or MetroHash128 might be a good choice.  Not quite as fast as xxHash64, but not far off and still much, much faster than MD5 and somewhat faster than murmur3. May require some amount of benchmarking, since most of the performance comparisons are C and the performance of the Java implementations may vary drastically.

Looks like https://issues.apache.org/jira/browse/CASSANDRA-13291 already switched to Guava, which probably makes Murmur3_128 easier to switch to than the rest, and it may be enough faster than MD5 to be beyond the point of diminishing returns anyways.

 (far from an expert, but this thread prompted me to go poking through hash options out of curiosity)

On Wed, Sep 26, 2018 at 9:04 PM Joseph Lynch <jo...@gmail.com> wrote:

> Michael Kjellman and others (Jason, Sam, et al.) have already done a 
> lot of work in 4.0 to help change the use of MD5 to something more modern [1][2].
> Also I cut a ticket a little while back about the significant 
> performance penalty of using MD5 for digests when doing quorum reads 
> of wide partitions [1]. Given the profiling that Michael has done and 
> the production profiling we did I think it's fair to say that changing 
> the digest from MD5 to
> murmur3 or xxHash would lead to a noticeable performance improvement 
> for quorum reads, perhaps even something like a 2x throughput increase for e.g.
> wide partition workloads.
>
> The hard part is changing the digest hash without breaking older 
> versions, e.g. during a rolling restart you can't have one node give a 
> MD5 hash and the other give a xxHash hash as you'll end up with lots 
> of mismatches and read repairs ... so that would be the tricky part. I 
> believe that we just need to do what was done during the 3.0 storage 
> engine refactor (I can't remember the ticket but I'm pretty sure 
> Sylvain did the work) which checked the messaging version of the 
> destination node and sent the appropriate hash back.
>
> -Joey
>
> [1] https://issues.apache.org/jira/browse/CASSANDRA-13291
> [2] https://issues.apache.org/jira/browse/CASSANDRA-13292
> [3] https://issues.apache.org/jira/browse/CASSANDRA-14611
>
>
> On Wed, Sep 26, 2018 at 5:00 PM Elliott Sims <el...@backblaze.com>
> wrote:
>
> > They also don't matter for digests, as long as we're assuming all 
> > nodes
> in
> > the cluster are non-malicious (which is a pretty reasonable and 
> > probably necessary assumption).  Or at least, deliberate collisions don't.
> > Accidental collisions do, but 128 bits is sufficient to make that 
> > sufficiently unlikely (as in, chances are nobody will ever see a 
> > single
> > collision)
> >
> > On Wed, Sep 26, 2018 at 7:58 PM Brandon Williams <dr...@gmail.com>
> wrote:
> >
> > > Collisions don't matter in the partitioner.
> > >
> > > On Wed, Sep 26, 2018, 6:53 PM Anirudh Kubatoor <
> > anirudh.kubatoor@gmail.com
> > > >
> > > wrote:
> > >
> > > > Isn't MD5 broken from a security standpoint? From wikipedia 
> > > > *"One basic requirement of any cryptographic hash function is 
> > > > that it should be computationally infeasible <
> > > >
> > >
> >
> https://en.wikipedia.org/wiki/Computational_complexity_theory#Intractability
> > > > >
> > > > to
> > > > find two non-identical messages which hash to the same value. MD5
> fails
> > > > this requirement catastrophically; such collisions
> > > > <https://en.wikipedia.org/wiki/Collision_resistance> can be found in
> > > > seconds on an ordinary home computer"*
> > > >
> > > > Regards,
> > > > Anirudh
> > > >
> > > > On Wed, Sep 26, 2018 at 7:14 PM Jeff Jirsa <jj...@gmail.com> wrote:
> > > >
> > > > > In some installations, it's used for hashing the partition key to
> > find
> > > > the
> > > > > host ( RandomPartitioner )
> > > > > It's used for prepared statement IDs
> > > > > It's used for hashing the data for reads to know if the data
> matches
> > on
> > > > all
> > > > > different replicas.
> > > > >
> > > > > We don't use CRC because conflicts would be really bad. There's
> > > probably
> > > > > something in the middle that's slightly faster than md5 without the
> > > > > drawbacks of crc32
> > > > >
> > > > >
> > > > > On Wed, Sep 26, 2018 at 3:47 PM Tyagi, Preetika <
> > > > preetika.tyagi@intel.com>
> > > > > wrote:
> > > > >
> > > > > > Hi all,
> > > > > >
> > > > > > I have a question about MD5 being used in the read path in
> > Cassandra.
> > > > > > I wanted to understand what exactly it is being used for and why
> > not
> > > > > > something like CRC is used which is less complex in comparison to
> > > MD5.
> > > > > >
> > > > > > Thanks,
> > > > > > Preetika
> > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
>

Re: MD5 in the read path

Posted by Elliott Sims <el...@backblaze.com>.
Would xxHash be large enough for digests?  Looks like there's no 128-bit
version yet, and it seems like 64 bits would be a bit short to avoid
accidental collisions/matches.  FarmHash128 or MetroHash128 might be a good
choice.  Not quite as fast as xxHash64, but not far off and still much,
much faster than MD5 and somewhat faster than murmur3. May require some
amount of benchmarking, since most of the performance comparisons are C and
the performance of the Java implementations may vary drastically.

Looks like https://issues.apache.org/jira/browse/CASSANDRA-13291 already
switched to Guava, which probably makes Murmur3_128 easier to switch to
than the rest, and it may be enough faster than MD5 to be beyond the point
of diminishing returns anyways.

 (far from an expert, but this thread prompted me to go poking through hash
options out of curiosity)

On Wed, Sep 26, 2018 at 9:04 PM Joseph Lynch <jo...@gmail.com> wrote:

> Michael Kjellman and others (Jason, Sam, et al.) have already done a lot of
> work in 4.0 to help change the use of MD5 to something more modern [1][2].
> Also I cut a ticket a little while back about the significant performance
> penalty of using MD5 for digests when doing quorum reads of wide partitions
> [1]. Given the profiling that Michael has done and the production profiling
> we did I think it's fair to say that changing the digest from MD5 to
> murmur3 or xxHash would lead to a noticeable performance improvement for
> quorum reads, perhaps even something like a 2x throughput increase for e.g.
> wide partition workloads.
>
> The hard part is changing the digest hash without breaking older versions,
> e.g. during a rolling restart you can't have one node give a MD5 hash and
> the other give a xxHash hash as you'll end up with lots of mismatches and
> read repairs ... so that would be the tricky part. I believe that we just
> need to do what was done during the 3.0 storage engine refactor (I can't
> remember the ticket but I'm pretty sure Sylvain did the work) which checked
> the messaging version of the destination node and sent the appropriate hash
> back.
>
> -Joey
>
> [1] https://issues.apache.org/jira/browse/CASSANDRA-13291
> [2] https://issues.apache.org/jira/browse/CASSANDRA-13292
> [3] https://issues.apache.org/jira/browse/CASSANDRA-14611
>
>
> On Wed, Sep 26, 2018 at 5:00 PM Elliott Sims <el...@backblaze.com>
> wrote:
>
> > They also don't matter for digests, as long as we're assuming all nodes
> in
> > the cluster are non-malicious (which is a pretty reasonable and probably
> > necessary assumption).  Or at least, deliberate collisions don't.
> > Accidental collisions do, but 128 bits is sufficient to make that
> > sufficiently unlikely (as in, chances are nobody will ever see a single
> > collision)
> >
> > On Wed, Sep 26, 2018 at 7:58 PM Brandon Williams <dr...@gmail.com>
> wrote:
> >
> > > Collisions don't matter in the partitioner.
> > >
> > > On Wed, Sep 26, 2018, 6:53 PM Anirudh Kubatoor <
> > anirudh.kubatoor@gmail.com
> > > >
> > > wrote:
> > >
> > > > Isn't MD5 broken from a security standpoint? From wikipedia
> > > > *"One basic requirement of any cryptographic hash function is that it
> > > > should be computationally infeasible
> > > > <
> > > >
> > >
> >
> https://en.wikipedia.org/wiki/Computational_complexity_theory#Intractability
> > > > >
> > > > to
> > > > find two non-identical messages which hash to the same value. MD5
> fails
> > > > this requirement catastrophically; such collisions
> > > > <https://en.wikipedia.org/wiki/Collision_resistance> can be found in
> > > > seconds on an ordinary home computer"*
> > > >
> > > > Regards,
> > > > Anirudh
> > > >
> > > > On Wed, Sep 26, 2018 at 7:14 PM Jeff Jirsa <jj...@gmail.com> wrote:
> > > >
> > > > > In some installations, it's used for hashing the partition key to
> > find
> > > > the
> > > > > host ( RandomPartitioner )
> > > > > It's used for prepared statement IDs
> > > > > It's used for hashing the data for reads to know if the data
> matches
> > on
> > > > all
> > > > > different replicas.
> > > > >
> > > > > We don't use CRC because conflicts would be really bad. There's
> > > probably
> > > > > something in the middle that's slightly faster than md5 without the
> > > > > drawbacks of crc32
> > > > >
> > > > >
> > > > > On Wed, Sep 26, 2018 at 3:47 PM Tyagi, Preetika <
> > > > preetika.tyagi@intel.com>
> > > > > wrote:
> > > > >
> > > > > > Hi all,
> > > > > >
> > > > > > I have a question about MD5 being used in the read path in
> > Cassandra.
> > > > > > I wanted to understand what exactly it is being used for and why
> > not
> > > > > > something like CRC is used which is less complex in comparison to
> > > MD5.
> > > > > >
> > > > > > Thanks,
> > > > > > Preetika
> > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
>

Re: MD5 in the read path

Posted by Joseph Lynch <jo...@gmail.com>.
Michael Kjellman and others (Jason, Sam, et al.) have already done a lot of
work in 4.0 to help change the use of MD5 to something more modern [1][2].
Also I cut a ticket a little while back about the significant performance
penalty of using MD5 for digests when doing quorum reads of wide partitions
[1]. Given the profiling that Michael has done and the production profiling
we did I think it's fair to say that changing the digest from MD5 to
murmur3 or xxHash would lead to a noticeable performance improvement for
quorum reads, perhaps even something like a 2x throughput increase for e.g.
wide partition workloads.

The hard part is changing the digest hash without breaking older versions,
e.g. during a rolling restart you can't have one node give a MD5 hash and
the other give a xxHash hash as you'll end up with lots of mismatches and
read repairs ... so that would be the tricky part. I believe that we just
need to do what was done during the 3.0 storage engine refactor (I can't
remember the ticket but I'm pretty sure Sylvain did the work) which checked
the messaging version of the destination node and sent the appropriate hash
back.

-Joey

[1] https://issues.apache.org/jira/browse/CASSANDRA-13291
[2] https://issues.apache.org/jira/browse/CASSANDRA-13292
[3] https://issues.apache.org/jira/browse/CASSANDRA-14611


On Wed, Sep 26, 2018 at 5:00 PM Elliott Sims <el...@backblaze.com> wrote:

> They also don't matter for digests, as long as we're assuming all nodes in
> the cluster are non-malicious (which is a pretty reasonable and probably
> necessary assumption).  Or at least, deliberate collisions don't.
> Accidental collisions do, but 128 bits is sufficient to make that
> sufficiently unlikely (as in, chances are nobody will ever see a single
> collision)
>
> On Wed, Sep 26, 2018 at 7:58 PM Brandon Williams <dr...@gmail.com> wrote:
>
> > Collisions don't matter in the partitioner.
> >
> > On Wed, Sep 26, 2018, 6:53 PM Anirudh Kubatoor <
> anirudh.kubatoor@gmail.com
> > >
> > wrote:
> >
> > > Isn't MD5 broken from a security standpoint? From wikipedia
> > > *"One basic requirement of any cryptographic hash function is that it
> > > should be computationally infeasible
> > > <
> > >
> >
> https://en.wikipedia.org/wiki/Computational_complexity_theory#Intractability
> > > >
> > > to
> > > find two non-identical messages which hash to the same value. MD5 fails
> > > this requirement catastrophically; such collisions
> > > <https://en.wikipedia.org/wiki/Collision_resistance> can be found in
> > > seconds on an ordinary home computer"*
> > >
> > > Regards,
> > > Anirudh
> > >
> > > On Wed, Sep 26, 2018 at 7:14 PM Jeff Jirsa <jj...@gmail.com> wrote:
> > >
> > > > In some installations, it's used for hashing the partition key to
> find
> > > the
> > > > host ( RandomPartitioner )
> > > > It's used for prepared statement IDs
> > > > It's used for hashing the data for reads to know if the data matches
> on
> > > all
> > > > different replicas.
> > > >
> > > > We don't use CRC because conflicts would be really bad. There's
> > probably
> > > > something in the middle that's slightly faster than md5 without the
> > > > drawbacks of crc32
> > > >
> > > >
> > > > On Wed, Sep 26, 2018 at 3:47 PM Tyagi, Preetika <
> > > preetika.tyagi@intel.com>
> > > > wrote:
> > > >
> > > > > Hi all,
> > > > >
> > > > > I have a question about MD5 being used in the read path in
> Cassandra.
> > > > > I wanted to understand what exactly it is being used for and why
> not
> > > > > something like CRC is used which is less complex in comparison to
> > MD5.
> > > > >
> > > > > Thanks,
> > > > > Preetika
> > > > >
> > > > >
> > > >
> > >
> >
>

Re: MD5 in the read path

Posted by Elliott Sims <el...@backblaze.com>.
They also don't matter for digests, as long as we're assuming all nodes in
the cluster are non-malicious (which is a pretty reasonable and probably
necessary assumption).  Or at least, deliberate collisions don't.
Accidental collisions do, but 128 bits is sufficient to make that
sufficiently unlikely (as in, chances are nobody will ever see a single
collision)

On Wed, Sep 26, 2018 at 7:58 PM Brandon Williams <dr...@gmail.com> wrote:

> Collisions don't matter in the partitioner.
>
> On Wed, Sep 26, 2018, 6:53 PM Anirudh Kubatoor <anirudh.kubatoor@gmail.com
> >
> wrote:
>
> > Isn't MD5 broken from a security standpoint? From wikipedia
> > *"One basic requirement of any cryptographic hash function is that it
> > should be computationally infeasible
> > <
> >
> https://en.wikipedia.org/wiki/Computational_complexity_theory#Intractability
> > >
> > to
> > find two non-identical messages which hash to the same value. MD5 fails
> > this requirement catastrophically; such collisions
> > <https://en.wikipedia.org/wiki/Collision_resistance> can be found in
> > seconds on an ordinary home computer"*
> >
> > Regards,
> > Anirudh
> >
> > On Wed, Sep 26, 2018 at 7:14 PM Jeff Jirsa <jj...@gmail.com> wrote:
> >
> > > In some installations, it's used for hashing the partition key to find
> > the
> > > host ( RandomPartitioner )
> > > It's used for prepared statement IDs
> > > It's used for hashing the data for reads to know if the data matches on
> > all
> > > different replicas.
> > >
> > > We don't use CRC because conflicts would be really bad. There's
> probably
> > > something in the middle that's slightly faster than md5 without the
> > > drawbacks of crc32
> > >
> > >
> > > On Wed, Sep 26, 2018 at 3:47 PM Tyagi, Preetika <
> > preetika.tyagi@intel.com>
> > > wrote:
> > >
> > > > Hi all,
> > > >
> > > > I have a question about MD5 being used in the read path in Cassandra.
> > > > I wanted to understand what exactly it is being used for and why not
> > > > something like CRC is used which is less complex in comparison to
> MD5.
> > > >
> > > > Thanks,
> > > > Preetika
> > > >
> > > >
> > >
> >
>

Re: MD5 in the read path

Posted by Brandon Williams <dr...@gmail.com>.
Collisions don't matter in the partitioner.

On Wed, Sep 26, 2018, 6:53 PM Anirudh Kubatoor <an...@gmail.com>
wrote:

> Isn't MD5 broken from a security standpoint? From wikipedia
> *"One basic requirement of any cryptographic hash function is that it
> should be computationally infeasible
> <
> https://en.wikipedia.org/wiki/Computational_complexity_theory#Intractability
> >
> to
> find two non-identical messages which hash to the same value. MD5 fails
> this requirement catastrophically; such collisions
> <https://en.wikipedia.org/wiki/Collision_resistance> can be found in
> seconds on an ordinary home computer"*
>
> Regards,
> Anirudh
>
> On Wed, Sep 26, 2018 at 7:14 PM Jeff Jirsa <jj...@gmail.com> wrote:
>
> > In some installations, it's used for hashing the partition key to find
> the
> > host ( RandomPartitioner )
> > It's used for prepared statement IDs
> > It's used for hashing the data for reads to know if the data matches on
> all
> > different replicas.
> >
> > We don't use CRC because conflicts would be really bad. There's probably
> > something in the middle that's slightly faster than md5 without the
> > drawbacks of crc32
> >
> >
> > On Wed, Sep 26, 2018 at 3:47 PM Tyagi, Preetika <
> preetika.tyagi@intel.com>
> > wrote:
> >
> > > Hi all,
> > >
> > > I have a question about MD5 being used in the read path in Cassandra.
> > > I wanted to understand what exactly it is being used for and why not
> > > something like CRC is used which is less complex in comparison to MD5.
> > >
> > > Thanks,
> > > Preetika
> > >
> > >
> >
>

Re: MD5 in the read path

Posted by Anirudh Kubatoor <an...@gmail.com>.
Isn't MD5 broken from a security standpoint? From wikipedia
*"One basic requirement of any cryptographic hash function is that it
should be computationally infeasible
<https://en.wikipedia.org/wiki/Computational_complexity_theory#Intractability>
to
find two non-identical messages which hash to the same value. MD5 fails
this requirement catastrophically; such collisions
<https://en.wikipedia.org/wiki/Collision_resistance> can be found in
seconds on an ordinary home computer"*

Regards,
Anirudh

On Wed, Sep 26, 2018 at 7:14 PM Jeff Jirsa <jj...@gmail.com> wrote:

> In some installations, it's used for hashing the partition key to find the
> host ( RandomPartitioner )
> It's used for prepared statement IDs
> It's used for hashing the data for reads to know if the data matches on all
> different replicas.
>
> We don't use CRC because conflicts would be really bad. There's probably
> something in the middle that's slightly faster than md5 without the
> drawbacks of crc32
>
>
> On Wed, Sep 26, 2018 at 3:47 PM Tyagi, Preetika <pr...@intel.com>
> wrote:
>
> > Hi all,
> >
> > I have a question about MD5 being used in the read path in Cassandra.
> > I wanted to understand what exactly it is being used for and why not
> > something like CRC is used which is less complex in comparison to MD5.
> >
> > Thanks,
> > Preetika
> >
> >
>

Re: MD5 in the read path

Posted by Jeff Jirsa <jj...@gmail.com>.
In some installations, it's used for hashing the partition key to find the
host ( RandomPartitioner )
It's used for prepared statement IDs
It's used for hashing the data for reads to know if the data matches on all
different replicas.

We don't use CRC because conflicts would be really bad. There's probably
something in the middle that's slightly faster than md5 without the
drawbacks of crc32


On Wed, Sep 26, 2018 at 3:47 PM Tyagi, Preetika <pr...@intel.com>
wrote:

> Hi all,
>
> I have a question about MD5 being used in the read path in Cassandra.
> I wanted to understand what exactly it is being used for and why not
> something like CRC is used which is less complex in comparison to MD5.
>
> Thanks,
> Preetika
>
>

Re: MD5 in the read path

Posted by Elliott Sims <el...@backblaze.com>.
Thanks to open source, you can answer yourself:
https://github.com/apache/cassandra/search?q=md5&unscoped_q=md5
At a glance, looks like it's used for digest verification, and to get a
good hash distribution on the RandomPartitioner

I haven't done the math, but I suspect CRC32's just not good enough either
in terms of result distribution for hashes or ability to catch multi-bit
errors without accidental collision (that is, it doesn't sufficiently
guarantee uniqueness).  There are other error checking algorithms that are
probably less computationally complex, but thanks to a lot of hardware and
software optimizations geared towards md5 specifically over the years you'd
be hard-pressed to find something that gives you comparable speed in
practice for a 128-bit result.