You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@nutch.apache.org by "Paul Baclace (JIRA)" <ji...@apache.org> on 2005/10/21 06:26:44 UTC

[jira] Created: (NUTCH-122) block numbers need a better random number generator

block numbers need a better random number generator
---------------------------------------------------

         Key: NUTCH-122
         URL: http://issues.apache.org/jira/browse/NUTCH-122
     Project: Nutch
        Type: Bug
  Components: fetcher, indexer, searcher  
    Versions: 0.8-dev    
    Reporter: Paul Baclace


In order to support billions of block numbers, a better PRNG than java.util.Random is needed.  To reach billions with low probability of collision, 64 bit random numbers are needed (the Birthday Problem is the model for the number of bits needed; the result is that twice as many bits are needed as the number of bits to count the expected number of items.) The built-in java.util.Random keeps only 48 bits of state which is only sufficient for 2^24 items.  Using repeated calls to or more than one instance of Random does not increase its total entropy.  

  Analysis
    util.Random is a linear congruential generator (LCG) identical to drand48.
      util.Random keeps 48 bits of state and gangs together 2 consecutive values to return 64 bit values.
      LCGs suffer from periodicity in the low order bits which would make modular binning less than random.
        "low order bits" could mean least significant byte.
      LCGs have periods in the range 106 to 109 when using 32 bit words, a range of poor to fair.
      seed = ( 0x5DEECE66DL * seed + 0xBL ) & ((1L << 48) - 1);
        the origin of 0x5DEECE66D, a non-prime, is shrouded in the mists of time.
      Results of the Birthday Spacings Test look good.
     References
      http://www.math.utah.edu/~beebe/java/random/README
      http://www.pierssen.com/arcview/upload/esoterica/randomizer.html

Recommended alternative:    MersenneTwister
      Matsumoto and Nishimura (1998).
      Longest period of any known generator 2^19937 or about 10^6001.
      A period that exceeds the number of unique values seems ideal; obviously a shorter period than the number of unique values (like util.Random)  is a problem).
      Faster than java.util.Random (Random was recent tweaked, however).
      Excellent result for Diehard Birthday Spacings Test.
      Can be seeded with up to 624 32 bit integers.

Doug Cutting wrote on nutch-dev:
> It just occurred to me that perhaps we could simply use sequential block numbering. 
>  All block ids are generated centrally on the namenode.  

Response from Paul Baclace:
I'm not sure what the advantage of sequential block numbers would be
since long period PRNG block numbering does not even need to store
it's state, just pick a new starting place.

Sequential block numbering does have the downside that picking a datanode based on (BlockNum % DataNodeCount) would devolve into round robin.  Any attempt to pass the sequence through a hash ends up becoming a random number generator.

Sequential numbering provides contiguous numbers, but after G.C. that would be lost, so no advantage there.

When human beings eyeball block numbers, many with small differences are more likely to be misread than many that are totally different.

If block numbering is sequential, then there is a temptation to use 32 bits instead of 64, but 32 bits leads to wrap-around and uh oh. 

FSNamesystem uses Random to help pick a target datanode, but it could just use the randomness of block numbers.



-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira


[jira] Updated: (NUTCH-122) block numbers need a better random number generator

Posted by "Paul Baclace (JIRA)" <ji...@apache.org>.
     [ http://issues.apache.org/jira/browse/NUTCH-122?page=all ]

Paul Baclace updated NUTCH-122:
-------------------------------

    Attachment: MersenneTwister.java

I am attaching MersenneTwister.java

The license on the attached source:  
All implementations are based on the mt19937 C code from  Matsumoto:
  http://www.math.sci.hiroshima-u.ac.jp/~m-mat/eindex.html  
which has a BSD license (in the file mt19937ar.c).  The attached MersenneTwister.java is by Nick Galbreath and has a BSD license.  I modified it to be a subclass of java.util.Random, use SecureRandom to generate 128 bits of seed when the zero-arg constructor is called, and I added Matsumoto's validation test as a main().  It passes the validation test which also verifies that multiple, parallel instances can be made which produce identical sequences.  

My understanding is that the BSD license is compatible with ASF, so I prepended the ASF language.

It would be nice if MersenneTwister could be added to one of the Commons packages, but this version is in the org.apache.nutch.util package.


> block numbers need a better random number generator
> ---------------------------------------------------
>
>          Key: NUTCH-122
>          URL: http://issues.apache.org/jira/browse/NUTCH-122
>      Project: Nutch
>         Type: Bug
>   Components: fetcher, indexer, searcher
>     Versions: 0.8-dev
>     Reporter: Paul Baclace
>  Attachments: MersenneTwister.java
>
> In order to support billions of block numbers, a better PRNG than java.util.Random is needed.  To reach billions with low probability of collision, 64 bit random numbers are needed (the Birthday Problem is the model for the number of bits needed; the result is that twice as many bits are needed as the number of bits to count the expected number of items.) The built-in java.util.Random keeps only 48 bits of state which is only sufficient for 2^24 items.  Using repeated calls to or more than one instance of Random does not increase its total entropy.  
>   Analysis
>     util.Random is a linear congruential generator (LCG) identical to drand48.
>       util.Random keeps 48 bits of state and gangs together 2 consecutive values to return 64 bit values.
>       LCGs suffer from periodicity in the low order bits which would make modular binning less than random.
>         "low order bits" could mean least significant byte.
>       LCGs have periods in the range 106 to 109 when using 32 bit words, a range of poor to fair.
>       seed = ( 0x5DEECE66DL * seed + 0xBL ) & ((1L << 48) - 1);
>         the origin of 0x5DEECE66D, a non-prime, is shrouded in the mists of time.
>       Results of the Birthday Spacings Test look good.
>      References
>       http://www.math.utah.edu/~beebe/java/random/README
>       http://www.pierssen.com/arcview/upload/esoterica/randomizer.html
> Recommended alternative:    MersenneTwister
>       Matsumoto and Nishimura (1998).
>       Longest period of any known generator 2^19937 or about 10^6001.
>       A period that exceeds the number of unique values seems ideal; obviously a shorter period than the number of unique values (like util.Random)  is a problem).
>       Faster than java.util.Random (Random was recent tweaked, however).
>       Excellent result for Diehard Birthday Spacings Test.
>       Can be seeded with up to 624 32 bit integers.
> Doug Cutting wrote on nutch-dev:
> > It just occurred to me that perhaps we could simply use sequential block numbering. 
> >  All block ids are generated centrally on the namenode.  
> Response from Paul Baclace:
> I'm not sure what the advantage of sequential block numbers would be
> since long period PRNG block numbering does not even need to store
> it's state, just pick a new starting place.
> Sequential block numbering does have the downside that picking a datanode based on (BlockNum % DataNodeCount) would devolve into round robin.  Any attempt to pass the sequence through a hash ends up becoming a random number generator.
> Sequential numbering provides contiguous numbers, but after G.C. that would be lost, so no advantage there.
> When human beings eyeball block numbers, many with small differences are more likely to be misread than many that are totally different.
> If block numbering is sequential, then there is a temptation to use 32 bits instead of 64, but 32 bits leads to wrap-around and uh oh. 
> FSNamesystem uses Random to help pick a target datanode, but it could just use the randomness of block numbers.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira


[jira] Resolved: (NUTCH-122) block numbers need a better random number generator

Posted by "Sami Siren (JIRA)" <ji...@apache.org>.
     [ http://issues.apache.org/jira/browse/NUTCH-122?page=all ]
     
Sami Siren resolved NUTCH-122:
------------------------------

    Resolution: Invalid

this is more related to hadoop

> block numbers need a better random number generator
> ---------------------------------------------------
>
>          Key: NUTCH-122
>          URL: http://issues.apache.org/jira/browse/NUTCH-122
>      Project: Nutch
>         Type: Bug

>   Components: fetcher, indexer, searcher
>     Versions: 0.8-dev
>     Reporter: Paul Baclace
>  Attachments: MersenneTwister.java, MersenneTwister.java
>
> In order to support billions of block numbers, a better PRNG than java.util.Random is needed.  To reach billions with low probability of collision, 64 bit random numbers are needed (the Birthday Problem is the model for the number of bits needed; the result is that twice as many bits are needed as the number of bits to count the expected number of items.) The built-in java.util.Random keeps only 48 bits of state which is only sufficient for 2^24 items.  Using repeated calls to or more than one instance of Random does not increase its total entropy.  
>   Analysis
>     util.Random is a linear congruential generator (LCG) identical to drand48.
>       util.Random keeps 48 bits of state and gangs together 2 consecutive values to return 64 bit values.
>       LCGs suffer from periodicity in the low order bits which would make modular binning less than random.
>         "low order bits" could mean least significant byte.
>       LCGs have periods in the range 106 to 109 when using 32 bit words, a range of poor to fair.
>       seed = ( 0x5DEECE66DL * seed + 0xBL ) & ((1L << 48) - 1);
>         the origin of 0x5DEECE66D, a non-prime, is shrouded in the mists of time.
>       Results of the Birthday Spacings Test look good.
>      References
>       http://www.math.utah.edu/~beebe/java/random/README
>       http://www.pierssen.com/arcview/upload/esoterica/randomizer.html
> Recommended alternative:    MersenneTwister
>       Matsumoto and Nishimura (1998).
>       Longest period of any known generator 2^19937 or about 10^6001.
>       A period that exceeds the number of unique values seems ideal; obviously a shorter period than the number of unique values (like util.Random)  is a problem).
>       Faster than java.util.Random (Random was recent tweaked, however).
>       Excellent result for Diehard Birthday Spacings Test.
>       Can be seeded with up to 624 32 bit integers.
> Doug Cutting wrote on nutch-dev:
> > It just occurred to me that perhaps we could simply use sequential block numbering. 
> >  All block ids are generated centrally on the namenode.  
> Response from Paul Baclace:
> I'm not sure what the advantage of sequential block numbers would be
> since long period PRNG block numbering does not even need to store
> it's state, just pick a new starting place.
> Sequential block numbering does have the downside that picking a datanode based on (BlockNum % DataNodeCount) would devolve into round robin.  Any attempt to pass the sequence through a hash ends up becoming a random number generator.
> Sequential numbering provides contiguous numbers, but after G.C. that would be lost, so no advantage there.
> When human beings eyeball block numbers, many with small differences are more likely to be misread than many that are totally different.
> If block numbering is sequential, then there is a temptation to use 32 bits instead of 64, but 32 bits leads to wrap-around and uh oh. 
> FSNamesystem uses Random to help pick a target datanode, but it could just use the randomness of block numbers.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira


[jira] Updated: (NUTCH-122) block numbers need a better random number generator

Posted by "Paul Baclace (JIRA)" <ji...@apache.org>.
     [ http://issues.apache.org/jira/browse/NUTCH-122?page=all ]

Paul Baclace updated NUTCH-122:
-------------------------------

    Attachment: MersenneTwister.java

Resubmitting MersenneTwister.java, this time with the Grant ASF checked.


> block numbers need a better random number generator
> ---------------------------------------------------
>
>          Key: NUTCH-122
>          URL: http://issues.apache.org/jira/browse/NUTCH-122
>      Project: Nutch
>         Type: Bug
>   Components: fetcher, indexer, searcher
>     Versions: 0.8-dev
>     Reporter: Paul Baclace
>  Attachments: MersenneTwister.java, MersenneTwister.java
>
> In order to support billions of block numbers, a better PRNG than java.util.Random is needed.  To reach billions with low probability of collision, 64 bit random numbers are needed (the Birthday Problem is the model for the number of bits needed; the result is that twice as many bits are needed as the number of bits to count the expected number of items.) The built-in java.util.Random keeps only 48 bits of state which is only sufficient for 2^24 items.  Using repeated calls to or more than one instance of Random does not increase its total entropy.  
>   Analysis
>     util.Random is a linear congruential generator (LCG) identical to drand48.
>       util.Random keeps 48 bits of state and gangs together 2 consecutive values to return 64 bit values.
>       LCGs suffer from periodicity in the low order bits which would make modular binning less than random.
>         "low order bits" could mean least significant byte.
>       LCGs have periods in the range 106 to 109 when using 32 bit words, a range of poor to fair.
>       seed = ( 0x5DEECE66DL * seed + 0xBL ) & ((1L << 48) - 1);
>         the origin of 0x5DEECE66D, a non-prime, is shrouded in the mists of time.
>       Results of the Birthday Spacings Test look good.
>      References
>       http://www.math.utah.edu/~beebe/java/random/README
>       http://www.pierssen.com/arcview/upload/esoterica/randomizer.html
> Recommended alternative:    MersenneTwister
>       Matsumoto and Nishimura (1998).
>       Longest period of any known generator 2^19937 or about 10^6001.
>       A period that exceeds the number of unique values seems ideal; obviously a shorter period than the number of unique values (like util.Random)  is a problem).
>       Faster than java.util.Random (Random was recent tweaked, however).
>       Excellent result for Diehard Birthday Spacings Test.
>       Can be seeded with up to 624 32 bit integers.
> Doug Cutting wrote on nutch-dev:
> > It just occurred to me that perhaps we could simply use sequential block numbering. 
> >  All block ids are generated centrally on the namenode.  
> Response from Paul Baclace:
> I'm not sure what the advantage of sequential block numbers would be
> since long period PRNG block numbering does not even need to store
> it's state, just pick a new starting place.
> Sequential block numbering does have the downside that picking a datanode based on (BlockNum % DataNodeCount) would devolve into round robin.  Any attempt to pass the sequence through a hash ends up becoming a random number generator.
> Sequential numbering provides contiguous numbers, but after G.C. that would be lost, so no advantage there.
> When human beings eyeball block numbers, many with small differences are more likely to be misread than many that are totally different.
> If block numbering is sequential, then there is a temptation to use 32 bits instead of 64, but 32 bits leads to wrap-around and uh oh. 
> FSNamesystem uses Random to help pick a target datanode, but it could just use the randomness of block numbers.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira