You are viewing a plain text version of this content. The canonical link for it is here.
Posted to common-issues@hadoop.apache.org by "Scott Carey (JIRA)" <ji...@apache.org> on 2009/07/23 05:37:15 UTC
[jira] Commented: (HADOOP-6166) Improve PureJavaCrc32

    [ https://issues.apache.org/jira/browse/HADOOP-6166?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12734445#action_12734445 ] 

Scott Carey commented on HADOOP-6166:
-------------------------------------

One could take Intel's idea, and go 8 (or 6 or 12?) bytes at a time instead of 4. This would line up with the 12 bit lookup tables a bit better.  6 bytes might be the easier boundary, and require 4 12 bit lookup tables. which is 32K, the size of the tables in the inner loop in your "4_3" are 17K, and the current version is 4K.

Going 12 bytes at a time would require 8 tables and 64K of space, and then we're randomly jumping around lookup tables that don't fit in a L1 D-cache on some processors.

The 8 bytes at a time approach is in C code (BSD license) here:
http://sourceforge.net/projects/slicing-by-8/files/

The trick with going over 4 bytes in a loop has to do with how the cycle works on the CRC.  I think, after the first four bytes of lookups, it changes a bit.
But I didn't read that much into that code to be sure of what it is doing.  They do it with 8 1K lookup tables each of one byte indexes.  They also get to directly Xor out of the byte array a single int, rather than grabbing one byte at a time and shifting.  We can do that with a ByteBuffer with getInt(), but when I tried that (with getInt) in the previous case the byte buffer creation overhead was too large, and the ByteBuffer access seemed very inefficient for some reason.  Oh how nice it would be if you could grab the next 4 bytes in a byte[] as an Int (or the next 8 as a Long) without wrapping it in a ByteBuffer, and let the compiler figure out the optimal processor load instruction.

I think a 6 byte at a time, 4 lookup of 12 bit LUT would be like this:

{code}
public class Crc32_6_4 extends Crc32Base {
  /** {@inheritDoc} */
  public void update(byte[] b, int off, int len) {
    while(len > 3) {
      crc ^= b[off++] & 0xff;
      crc ^= (b[off++] & 0xff) << 8;
      crc ^= (b[off++] & 0xff) << 16;
      crc ^= (b[off++] & 0xff) << 24;
      int c0 = b[off++] & 0xff;;
      c0 ^= (b[off++] & 0xff) << 8;

      crc = T12_3[crc & 0xfff] ^ T12_2[(crc >>> 12) & 0xfff] ^ T12_1[((crc >>> 24) & (c0 << 8)) & 0xfff] ^ T12_0[c0 >> 4];
      len -= 6;
    }

    for (; len > 0; len--) {
      crc = (crc >>> 8) ^ Table8_0[(crc ^ b[off++]) & 0xff];
    }
  }
{code}

Assuming the tables were built right and the "wrap past 4 bytes means don't xor with crc" is correct.  I haven't tried this at all.

All of these with larger lookup tables run the risk of performing worse under concurrency, even if faster single threaded.  Cache pressure is greater under concurrency.  We might want to use the benchmark in HADOOP-5318, which is heavily CRC reliant, as a check to make sure we aren't regressing under higher concurrency due to cache pressure.


For reference, Intel's C code (referenced above, snippet below) with 8 tables looks like this in the inner loop:

{code}
/*++
 *
 * Copyright (c) 2004-2006 Intel Corporation - All Rights Reserved
 *
 * This software program is licensed subject to the BSD License, 
 * available at http://www.opensource.org/licenses/bsd-license.html
 *
 * Abstract: The main routine
 * 
 --*/
crc32c_sb8_64_bit(
	uint32_t* p_running_crc,
    const uint8_t*	p_buf,
    const uint32_t length,
	const uint32_t init_bytes,
	uint8_t			mode)
{
	uint32_t li;
	uint32_t crc, term1, term2;
	uint32_t running_length;
	uint32_t end_bytes;
	if(mode ==  MODE_CONT)
		crc = *p_running_crc;
	else	
		crc = CRC32C_INIT_REFLECTED;
	running_length = ((length - init_bytes)/8)*8;
	end_bytes = length - init_bytes - running_length; 

	for(li=0; li < init_bytes; li++) 
		crc = crc_tableil8_o32[(crc ^ *p_buf++) & 0x000000FF] ^ (crc >> 8);	
	for(li=0; li < running_length/8; li++) 
	{
		crc ^= *(uint32_t *)p_buf;
		p_buf += 4;
		term1 = crc_tableil8_o88[crc & 0x000000FF] ^
				crc_tableil8_o80[(crc >> 8) & 0x000000FF];
		term2 = crc >> 16;
		crc = term1 ^
			  crc_tableil8_o72[term2 & 0x000000FF] ^ 
			  crc_tableil8_o64[(term2 >> 8) & 0x000000FF];
		term1 = crc_tableil8_o56[(*(uint32_t *)p_buf) & 0x000000FF] ^
				crc_tableil8_o48[((*(uint32_t *)p_buf) >> 8) & 0x000000FF];
		
		term2 = (*(uint32_t *)p_buf) >> 16;
		crc =	crc ^ 
				term1 ^		
				crc_tableil8_o40[term2  & 0x000000FF] ^	
				crc_tableil8_o32[(term2 >> 8) & 0x000000FF];	
		p_buf += 4;
	}
	for(li=0; li < end_bytes; li++) 
		crc = crc_tableil8_o32[(crc ^ *p_buf++) & 0x000000FF] ^ (crc >> 8);
	if((mode == MODE_BEGIN) || (mode ==  MODE_CONT))
		return crc;		
    return crc ^ XOROT;	
}
{code}

That is pretty straightforward to implement if the next 4 tables (T5, T6, T7, T8 in the terminology of the current PureJavaCRC32) are generated the same way as the previous 4.
Intel makes sure the inner loop is on an 8 byte boundary, because the C compiler can make the load needed for the
{code}crc ^= *(uint32_t *)p_buf{code} part faster if that is the case.  They also tend to favor shifting by 16 and 8 and avoiding shifting by 24 for some reason.

I may try out the eight bytes at once with 8 lookup tables version next week.

> Improve PureJavaCrc32
> ---------------------
>
>                 Key: HADOOP-6166
>                 URL: https://issues.apache.org/jira/browse/HADOOP-6166
>             Project: Hadoop Common
>          Issue Type: Improvement
>          Components: util
>            Reporter: Tsz Wo (Nicholas), SZE
>            Assignee: Tsz Wo (Nicholas), SZE
>         Attachments: c6166_20090722.patch, c6166_20090722_benchmark_32VM.txt, c6166_20090722_benchmark_64VM.txt
>
>
> Got some ideas to improve CRC32 calculation.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.