You are viewing a plain text version of this content. The canonical link for it is here.
Posted to modperl@perl.apache.org by Barrie Slaymaker <ba...@slaysys.com> on 2001/11/08 19:11:21 UTC

Cache::* and MD5 collisions [was: [OT] Data store options]

On Thu, Nov 08, 2001 at 08:59:55AM -0800, Bill Moseley wrote:
> Hi,
> 
> <verbose>
> I'm looking for a little discussion on selecting a data storage method, and
> I'm posting here because Cache::Cache often is discussed here (along with
> Apache::Session).  And people here are smart, of course ;).
> 
> Basically, I'm trying to understand when to use Cache::Cache, vs. Berkeley
> DB, and locking issues.

Even a bit more OT: one thing to watch out for, especially if you plan
on caching a *lot* of data, is that the Cache::* modules did not do
collision detection on MD5 collisions the last time I looked.  Forgive
me if that's changed recently.

The probability is extremely low, but they can and do occur "in the
wild" (I won't bring anyone's name up here :).  In other words, it's a
remote possibility that one URI might serve up another's content if the
two hash keys map to the same MD5 value.

Given an infinite number of web monkeys using Cache::* and an infinite
number of user monkeys clicking on links...

- Barrie

Re: Cache::* and MD5 collisions [was: [OT] Data store options]

Posted by "Jeffrey W. Baker" <jw...@acm.org>.
On Thu, 2001-11-08 at 10:11, Barrie Slaymaker wrote:
> On Thu, Nov 08, 2001 at 08:59:55AM -0800, Bill Moseley wrote:
> > Hi,
> > 
> > <verbose>
> > I'm looking for a little discussion on selecting a data storage method, and
> > I'm posting here because Cache::Cache often is discussed here (along with
> > Apache::Session).  And people here are smart, of course ;).
> > 
> > Basically, I'm trying to understand when to use Cache::Cache, vs. Berkeley
> > DB, and locking issues.
> 
> Even a bit more OT: one thing to watch out for, especially if you plan
> on caching a *lot* of data, is that the Cache::* modules did not do
> collision detection on MD5 collisions the last time I looked.  Forgive
> me if that's changed recently.
> 
> The probability is extremely low, but they can and do occur "in the
> wild" (I won't bring anyone's name up here :).  In other words, it's a
> remote possibility that one URI might serve up another's content if the
> two hash keys map to the same MD5 value.
> 
> Given an infinite number of web monkeys using Cache::* and an infinite
> number of user monkeys clicking on links...

You could switch to SHA1.  SHA1 collisions will be much more rare than
MD5 collisions.

-jwb


Re: Mac OSX 10.1 - mod_perl build instructions

Posted by Rick Myers <ri...@sumthin.nu>.
On Nov 08, 2001 at 11:31:39 -0800, Randal L. Schwartz wrote:
> 
>      sudo perl Makefile.PL \ 
>          APACHE_SRC=/opt/apache/1.3.22/src/apache_1.3.22/src \
>          NO_HTTPD=1 \ 
>          USE_APACI=1 \ 
>          PREP_HTTPD=1 \ 
>          EVERYTHING=1 
>      # hit return a bunch of times

That "hit return a bunch of times" has always bugged me.
It seems to me that the combination of NO_HTTPD=1,
PREP_HTTPD=1, and APACHE_SRC=??? has already answered the
three questions that it's going to ask.

Here's a little patch I wrote up for 1.25 to eliminate the
questions. It still applies to the current 1.xx cvs versions
with a bit of skew.

--- Makefile.PL.orig	Fri Mar  9 15:14:14 2001
+++ Makefile.PL	Fri Mar  9 15:17:08 2001
@@ -493,6 +493,9 @@
 }
 
 }
+elsif ($NO_HTTPD && !$PREP_HTTPD && !$DO_HTTPD && $APACHE_SRC) {
+  push(@adirs,$APACHE_SRC);
+}
 
 	if($PERL_EXTRA_CFLAGS) {
 	    $PERL_EXTRA_CFLAGS = join(" ", split(",",  $PERL_EXTRA_CFLAGS));
@@ -525,7 +528,7 @@
 
     if (-e $httpd_h) {
 	unless($NO_HTTPD and not $DYNAMIC and not $PREP_HTTPD) {
-	    unless($DO_HTTPD) {
+	    unless($DO_HTTPD || $PREP_HTTPD && $NO_HTTPD) {
 		$ans = prompt("Configure mod_perl with $adir ?", "y");
 		next unless $ans =~ /^y$/i;
 	    }

Rick Myers                            rik@sumthin.nu
----------------------------------------------------
The Feynman Problem       1) Write down the problem.
Solving Algorithm         2) Think real hard.
                          3) Write down the answer.

Re: Cache::* and MD5 collisions [was: [OT] Data store options]

Posted by Bill Moseley <mo...@hank.org>.
At 10:54 AM 11/08/01 -0800, Andrew Ho wrote:
>For example, say your keys are e-mail addresses and you just want to use
>an MD5 hash to spread your data files over directories so that no one
>directory has too many files in it. Say your original key is
>andrew@tellme.com (hex encoded MD5 hash of this is RfbmPiuRLyPGGt3oHBagt).
>Instead of just storing the key in the file
>R/Rf/Rfb/Rfbm/RfbmPiuRLyPGGt3oHBagt.dat, store the key in the file
>R/Rf/Rfb/Rfbm/andrew@tellme.com.dat. Presto... collisions are impossible.

That has the nice side effect that I can run through the directory tree and
get the key for every file.  I do need a way to read every key in the
store.  Order is not important.



Bill Moseley
mailto:moseley@hank.org

Re: Cache::* and MD5 collisions [was: [OT] Data store options]

Posted by Barrie Slaymaker <ba...@slaysys.com>.
On Thu, Nov 08, 2001 at 10:54:11AM -0800, Andrew Ho wrote:
> Let me point out that if you are using MD5 hashes for directory spreading
> (i.e. to spread a large number of files across a tree of directories so
> that no one directory is filled with too many files for efficient
> filesystem access), the easy solution is to just append the unique key
> that you are hashing to the files involved.

Neat.

That adds directory levels if you do URI->filesystem_path mapping for
the key, or runs in to namelength limits if you fold "/" and, say, "%"
and "." in to %FF style escape codes or base64 it.  And namelength
limitations are even more severe on many non-Unixish filesystems :).

I prefer the technique of storing the full text of the hash key in the
stored object's metadata (ie in the cache) and comparing it to the
requested key on retrieval.  When a collision is detected, you can then
do overflow processing (like most hashed dictionary data structures from
CS-land) and seek the cached copy somewhere else in the cache or just
treat it like a cache miss.

MD5 is a good thing, but relying on it's uniqueness for a site that needs
to be reliable is a bit risky in my mind.  YMMV, I just want folks to be
aware of the issues.

- Barrie

Re: Cache::* and MD5 collisions [was: [OT] Data store options]

Posted by Andrew Ho <an...@tellme.com>.
Hello,

DC>For example, file system caches fill their directories roughly equally
DC>when their paths are created from MD5 hashed keys. Doing something
DC>simple and unique like "URL-encoding" the key to make a legal identifier
DC>(legal in the sense that it is a valid filename) wouldn't distribute as
DC>evenly.

Let me point out that if you are using MD5 hashes for directory spreading
(i.e. to spread a large number of files across a tree of directories so
that no one directory is filled with too many files for efficient
filesystem access), the easy solution is to just append the unique key
that you are hashing to the files involved.

For example, say your keys are e-mail addresses and you just want to use
an MD5 hash to spread your data files over directories so that no one
directory has too many files in it. Say your original key is
andrew@tellme.com (hex encoded MD5 hash of this is RfbmPiuRLyPGGt3oHBagt).
Instead of just storing the key in the file
R/Rf/Rfb/Rfbm/RfbmPiuRLyPGGt3oHBagt.dat, store the key in the file
R/Rf/Rfb/Rfbm/andrew@tellme.com.dat. Presto... collisions are impossible.

Humbly,

Andrew

----------------------------------------------------------------------
Andrew Ho               http://www.tellme.com/       andrew@tellme.com
Engineer                   info@tellme.com          Voice 650-930-9062
Tellme Networks, Inc.       1-800-555-TELL            Fax 650-930-9101
----------------------------------------------------------------------




Re: Cache::* and MD5 collisions [was: [OT] Data store options]

Posted by DeWitt Clinton <de...@unto.net>.
On Thu, Nov 08, 2001 at 01:11:21PM -0500, Barrie Slaymaker wrote:

> Even a bit more OT: one thing to watch out for, especially if you
> plan on caching a *lot* of data, is that the Cache::* modules did
> not do collision detection on MD5 collisions the last time I looked.
> Forgive me if that's changed recently.

I knew that a collision was technically possible, but the odds seemed
astronomically low.  But if security is a concern, then I agree that
MD5-ing the key isn't the best way to create a unique identifier.  The
thing that I liked most about MD5 (other than its convenience) is that
it tends to distribute very evenly.  For example, file system caches
fill their directories roughly equally when their paths are created
from MD5 hashed keys.  Doing something simple and unique like
"URL-encoding" the key to make a legal identifier (legal in the sense
that it is a valid filename) wouldn't distribute as evenly.

Barrie raised this question privately to me earlier this year, and
this is what I wrote at the time:

   Good question.  They would overwrite each other.  You may want to
   check out this RFC for more information about the odds of that:

    http://www.faqs.org/rfcs/rfc1321.html                                         
   Specifically, they say that "it is conjectured that the difficulty
   of coming up with two messages having the same message digest is on
   the order of 2^64 operations, and that the difficulty of coming up
   with any message having a given message digest is on the order
   of 2^128 operations."

   This means you would need 18,446,744,073,709,551,616 keys to make
   it 100% likely that there would be a collision.

   Assuming you were only willing to risk a 0.01% chance that there is
   a collision, you could still fit 1,844,674,407,370,955 keys.  Since
   each value stored will take at least 1k in memory, that number of
   keys would require 1,844,674 Terrabytes of storage.
   
   In other words, it is highly unlikely that you could get a collision
   in real world usage.


Note that I'm not disagreeing with Barrie at all -- I just took the
easy way out and refuted the likelihood of an infinite number of
monkeys hitting my code.  :)

-DeWitt


Mac OSX 10.1 - mod_perl build instructions

Posted by "Randal L. Schwartz" <me...@stonehenge.com>.
Apparently, mod_perl wants to be built static into Apache on OSX, and
yet wants to use mod_so to load any additional thingies like
Apache::Request or Apache::Template.  So after many hours of trying
different combinations of things, I finally yelled out "Yippee Skippee"
(a phrase my dad firmly installed in my brain) when I stumbled
across this combination of things:

    So, here's the current, verynicelyworking config.status for
    Apache, and the build instructions for mod_perl with Apache:

    1) unpack mod_perl from the CPAN, and execute

     sudo perl Makefile.PL \ 
         APACHE_SRC=/opt/apache/1.3.22/src/apache_1.3.22/src \
         NO_HTTPD=1 \ 
         USE_APACI=1 \ 
         PREP_HTTPD=1 \ 
         EVERYTHING=1 
     # hit return a bunch of times
     sudo make all install

    2) then go to your apache installation, and execute

     sudo ./configure \ 
      "--with-layout=GNU" \ 
      "--prefix=/opt/apache/1.3.22" \ 
      "--activate-module=src/modules/perl/libperl.a" \ 
      "--enable-module=most" \ 
      "--enable-shared=max" \ 
      "--disable-shared=perl"
     sudo make all install

    fixing "prefix" to be wherever you want.  I'm sticking all local
    installs under /opt/PACKAGE/VERSION/* so I can quickly see what I've
    added.  I usually ln -s /opt/PACKAGE/VERSION/bin/* to /usr/local/bin
    so that I can invoke the commands, though.

And there it is... so nobody else has to tweak the same way I did.

-- 
Randal L. Schwartz - Stonehenge Consulting Services, Inc. - +1 503 777 0095
<me...@stonehenge.com> <URL:http://www.stonehenge.com/merlyn/>
Perl/Unix/security consulting, Technical writing, Comedy, etc. etc.
See PerlTraining.Stonehenge.com for onsite and open-enrollment Perl training!