You are viewing a plain text version of this content. The canonical link for it is here.
Posted to mapreduce-user@hadoop.apache.org by Pedro Costa <ps...@gmail.com> on 2011/02/16 22:37:04 UTC
How read compressed files?

Hi,

1 - I'm trying to read parts of a compressed file to generate message
digests, but I can't fetch the right parts. I searched for an example
that read compressed files, but I can't find one.
As I've 3 partition in my example, below are the indexes of the file:
raw bytes: 54632 / offset: 0 / partLength: 20307
raw bytes: 53771 / offset: 20307 / partLength: 19882
raw bytes: 53568 / offset: 40189 / partLength: 19814

Here's my code:

[code]
readCompressedFile(InputStream input) {
						decompressor.reset();
						CompressionInputStream input2 = codec.createInputStream(input,
decompressor);
						
						IndexRecord index = spillRec.getIndex(part);

						long size = index.rawLength;
						//long size2 = index.partLength;
						long offset = index.startOffset;
			hash[part] = hashGen.generateHash(input2, (int) offset, (int) size);
}



public String generateHash(CompressionInputStream input, int offset,
int mapOutputLength) {
		MessageDigest md = null;
		StringBuffer buf = new StringBuffer();

		try {
			md = MessageDigest.getInstance("SHA-1");
			int totalBytes= 0;

			int size = mapOutputLength < (60 * 1024) ? mapOutputLength : (60*1024);

			byte[] buffer = new byte[size];

			int n = input.read(buffer, 0, size);

			if(n > 0)
				md.update(buffer);

			while (n > 0) {
				totalBytes += n;

				mapOutputLength -= n;

				// the case that the bytes read is small the the default size.
				// We don't want that the message digest contains trash.
				size = mapOutputLength < (60 * 1024) ? mapOutputLength : (60*1024);

				if(size == 0)
					break;

				buffer = new byte[size];
				n = input.read(buffer, 0, size);

				if(n > 0) {
					md.update(buffer);
				}
			}
			System.out.println("END: " + totalBytes + " - ");

			// DO THE HASH

		} catch (NoSuchAlgorithmException e) {
			// TODO Auto-generated catch block
			e.printStackTrace();
		} catch (IOException e) {
			// TODO Auto-generated catch block
			e.printStackTrace();
		}

		return HASH;
	}
						
[/code]

I can't get the right portions of the compressed file, and I don't
know why. What am I doing wrong?

2 - When I'm reading a compressed file with the CompressionInputStream class,
	CompressionInputStream input2 = codec.createInputStream(input, decompressor);

means that, when I call the method "read", I'm reading uncompressed data?




Thanks,


-- 
Pedro