You are viewing a plain text version of this content. The canonical link for it is here.
Posted to mapreduce-user@hadoop.apache.org by java8964 java8964 <ja...@hotmail.com> on 2013/02/09 21:49:31 UTC

Question related to Decompressor interface

HI, 
Currently I am researching about options of encrypting the data in the MapReduce, as we plan to use the Amazon EMR or EC2 services for our data.
I am thinking that the compression codec is good place to integrate with the encryption logic, and I found out there are some people having the same idea as mine.
I google around and found out this code:
https://github.com/geisbruch/HadoopCryptoCompressor/
It doesn't seem maintained any more, but it gave me a starting point. I download the source code, and try to do some tests with it.
It doesn't work out of box. There are some bugs I have to fix to make it work. I believe it contains 'AES' as an example algorithm.
But right now, I faced a problem when I tried to use it in my testing MapReduer program. Here is the stack trace I got:
2013-02-08 23:16:47,038 INFO org.apache.hadoop.io.compress.crypto.CryptoBasicDecompressor: buf length = 512, and offset = 0, length = -132967308java.lang.IndexOutOfBoundsException    at java.nio.ByteBuffer.wrap(ByteBuffer.java:352)    at org.apache.hadoop.io.compress.crypto.CryptoBasicDecompressor.setInput(CryptoBasicDecompressor.java:100)    at org.apache.hadoop.io.compress.BlockDecompressorStream.decompress(BlockDecompressorStream.java:97)    at org.apache.hadoop.io.compress.DecompressorStream.read(DecompressorStream.java:83)    at java.io.InputStream.read(InputStream.java:82)    at org.apache.hadoop.util.LineReader.readDefaultLine(LineReader.java:209)    at org.apache.hadoop.util.LineReader.readLine(LineReader.java:173)    at org.apache.hadoop.mapreduce.lib.input.LineRecordReader.nextKeyValue(LineRecordReader.java:114)    at org.apache.hadoop.mapred.MapTask$NewTrackingRecordReader.nextKeyValue(MapTask.java:458)    at org.apache.hadoop.mapreduce.task.MapContextImpl.nextKeyValue(MapContextImpl.java:76)    at org.apache.hadoop.mapreduce.lib.map.WrappedMapper$Context.nextKeyValue(WrappedMapper.java:85)    at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:139)    at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:645)    at org.apache.hadoop.mapred.MapTask.run(MapTask.java:325)    at org.apache.hadoop.mapred.Child$4.run(Child.java:268)    at java.security.AccessController.doPrivileged(Native Method)    at javax.security.auth.Subject.doAs(Subject.java:396)    at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1332)    at org.apache.hadoop.mapred.Child.main(Child.java:262)
I know the error is thrown out of this custom CryptoBasicDecompressor class, but I really have questions related to the interface it implemented: Decompressor.
There is limited document about this interface, for example, when and how the method setInput() will be invoked. If I want to write my own Decompressor, what do these methods mean in the interface?In the above case, I enable some debug information, you can see that in this case, the byte[] array passed to setInput method, only have 512 as the length, but the 3rd parameter of length passed in is a negative number: -132967308. That caused the IndexOutOfBoundsException. If I check the GzipDecompressor class of this method in the hadoop, the code will also throw IndexOutoutBoundsException in this case, so this is a RuntimeException case. Why it happened in my test case?
Here is my test case:
I have a simpel log text file about 700k. I encrypted it with above code using 'AES'. I can encrypted and decrypted to get my original content. The file name is foo.log.crypto, this file extension is registered to invoke this CryptoBasicDecompressor in my testing hadoop using CDH4.1.2 release (hadoop 2.0). Everything works as I expected. The CryptoBasicDecompressor is invoked when the input file is foo.log.crypto, as you can see in the above stack trace. But I don't know why the 3rd parameter (length) in setInput() is a negative number at runtime.
In additional to it, I also have further questions related to use Compressor/Decompressor to handle the encrypting/decrypting file. Ideally, I wonder if the encrypting/decrypting can support file splits. This maybe depends the algorithm we are using, is that right? If so, what kind of algorithm can do that? I am not sure if it likes the compressor cases, most of them do not support file split. If so, it maybe not good for my requirements.
If we have a 1G file, encrypted in the Amazone S3, after it copied to the HDFS of Amazon EMR, can each block of the date be decrypted independently by each mapper, then passed to the underline RecorderReader to be processed totally concurrently? Does any one do this before? If so, what encryption algorithm does support it? Any idea?
Thanks
Yong 		 	   		  

Re: Question related to Decompressor interface

Posted by Ted Dunning <td...@maprtech.com>.
All of these suggestions tend to founder on the problem of key management.

What you need to do is

1) define your threats.

2) define your architecture including key management.

3) demonstrate how the architecture defends against the threat environment.

I haven't seen more than a cursory comment about item (1) in this thread.
 Typically, the threats include

a) compromise or theft of physical media by outsiders

b) compromise of one or more live machines in the cluster

c) insider attack by one employee working alone, but able to socially
engineer others into unwitting cooperation

Which threats did the OP need to defend against?


On Sun, Feb 10, 2013 at 8:24 PM, David Parks <da...@yahoo.com> wrote:

> In the EncryptedWritableWrapper idea you would create an object that takes
> any Writable object as it’s parameter. ****
>
> ** **
>
> Your EncryptedWritableWrapper would naturally implement Writable.****
>
> ** **
>
> **·         **When write(DataOutput out) is called on your object, create
> your own DataOutputStream which reads data into a byte array that you
> control (i.e. new DataOutputStream(new myByteArrayOutputStream()), keeping
> references to the objects of course).****
>
> **·         **Now encrypt the bytes and pass them on to the DataOutput
> object you received in write(DataOutput out)****
>
> ** **
>
> To decrypt is basically the same with the readFields(DataInput in) method.
> ****
>
> **·         **Read in the bytes and decrypt them (you will probably have
> needed to write out the length of bytes previously so you know how much to
> read in).****
>
> **·         **Take the decrypted bytes and pass them to the readFields(…)
> method of the Writable object you’re wrapping ****
>
> ** **
>
> The rest of Hadoop doesn’t know or care if the data is encrypted, your
> Writable objects are just a bunch of bytes, you’re Key and Value class in
> this case are now EncryptedWritableWrapper, and you’ll need to know which
> type of Writable to pass it in the code.****
>
> ** **
>
> This would be good for encrypting in Hadoop. If your file comes in
> encrypted then it necessarily can’t be split (you should aim to limit the
> maximum size of the file on the source side). In the case of an encrypted
> input you would need your own record reader to decrypt it, your description
> of the scenario below is correct, extending TextinputFormat would be the
> way to go.****
>
> ** **
>
> If your input is just a plain text file and your goal is to store it in an
> encrypted fashion then the EncryptedWritable idea works and is a more
> simple implementation.****
>
> ** **
>
> ** **
>
> ** **
>
> *From:* java8964 java8964 [mailto:java8964@hotmail.com]
> *Sent:* Sunday, February 10, 2013 10:13 PM
> *To:* user@hadoop.apache.org
> *Subject:* RE: Question related to Decompressor interface****
>
> ** **
>
> Hi, Dave:****
>
> ** **
>
> Thanks for you reply. I am not sure how the EncryptedWritable will work,
> can you share more ideas about it?****
>
> ** **
>
> For example, if I have a text file as my source raw file. Now I need to
> store it in HDFS. If I use any encryption to encrypt the whole file, then
> there is no good InputFormat or RecordReader to process it, unless whole
> file is decrypted first at runtime, then using TextInputFormat to process
> it, right?****
>
> ** **
>
> What you suggest is  when I encrypted the file, store it as a
> SequenceFile, using anything I want as the key, then encrypt each line
> (Record), and stores it as the value, put both (key, value) pair into the
> sequence file, is that right? ****
>
> ** **
>
> Then in the runtime, each value can be decrypted from the sequence file,
> and ready for next step in the by the EncryptedWritable class. Is my
> understanding correct?****
>
> ** **
>
>  In this case, of course I don't need to worry about split any more, as
> each record is encrypted/decrypted separately.****
>
> ** **
>
> I think it is a valid option, but problem is that the data has to be
> encrypted by this EncryptedWritable class. What I was thinking about is
> allow data source to encrypt its data any way they want, as long as it is
> supported by Java security package, then only provide the private key to
> the runtime to decrypt it.****
>
> ** **
>
> Yong****
> ------------------------------
>
> From: davidparks21@yahoo.com
> To: user@hadoop.apache.org
> Subject: RE: Question related to Decompressor interface
> Date: Sun, 10 Feb 2013 09:36:40 +0700****
>
> I can’t answer your question about the Decompressor interface, but I have
> a query for you.****
>
>  ****
>
> Why not just create an EncryptedWritable object? Encrypt/decrypt the bytes
> on the read/write method, that should be darn near trivial. Then stick with
> good ‘ol SequenceFile, which, as you note, is splittable. Otherwise you’d
> have to deal with making the output splittable, and given encrypted data,
> the only solution that I see is basically rolling your own SequenceFile
> with encrypted innards. ****
>
>  ****
>
> Come to think of it, a simple, standardized EncryptedWritable object out
> of the box with Hadoop would be great. Or perhaps better yet, an
> EncryptedWritableWrapper<T extends Writable> so we can convert any existing
> Writable into an encrypted form.****
>
>  ****
>
> Dave****
>
>  ****
>
>  ****
>
> *From:* java8964 java8964 [mailto:java8964@hotmail.com<ja...@hotmail.com>]
>
> *Sent:* Sunday, February 10, 2013 3:50 AM
> *To:* user@hadoop.apache.org
> *Subject:* Question related to Decompressor interface****
>
>  ****
>
> HI, ****
>
>  ****
>
> Currently I am researching about options of encrypting the data in the
> MapReduce, as we plan to use the Amazon EMR or EC2 services for our data.*
> ***
>
>  ****
>
> I am thinking that the compression codec is good place to integrate with
> the encryption logic, and I found out there are some people having the same
> idea as mine.****
>
>  ****
>
> I google around and found out this code:****
>
>  ****
>
> https://github.com/geisbruch/HadoopCryptoCompressor/****
>
>  ****
>
> It doesn't seem maintained any more, but it gave me a starting point. I
> download the source code, and try to do some tests with it.****
>
>  ****
>
> It doesn't work out of box. There are some bugs I have to fix to make it
> work. I believe it contains 'AES' as an example algorithm.****
>
>  ****
>
> But right now, I faced a problem when I tried to use it in my testing
> MapReduer program. Here is the stack trace I got:****
>
>  ****
>
> 2013-02-08 23:16:47,038 INFO
> org.apache.hadoop.io.compress.crypto.CryptoBasicDecompressor: buf length =
> 512, and offset = 0, length = -132967308****
>
> java.lang.IndexOutOfBoundsException****
>
>     at java.nio.ByteBuffer.wrap(ByteBuffer.java:352)****
>
>     at
> org.apache.hadoop.io.compress.crypto.CryptoBasicDecompressor.setInput(CryptoBasicDecompressor.java:100)
> ****
>
>     at
> org.apache.hadoop.io.compress.BlockDecompressorStream.decompress(BlockDecompressorStream.java:97)
> ****
>
>     at
> org.apache.hadoop.io.compress.DecompressorStream.read(DecompressorStream.java:83)
> ****
>
>     at java.io.InputStream.read(InputStream.java:82)****
>
>     at
> org.apache.hadoop.util.LineReader.readDefaultLine(LineReader.java:209)****
>
>     at org.apache.hadoop.util.LineReader.readLine(LineReader.java:173)****
>
>     at
> org.apache.hadoop.mapreduce.lib.input.LineRecordReader.nextKeyValue(LineRecordReader.java:114)
> ****
>
>     at
> org.apache.hadoop.mapred.MapTask$NewTrackingRecordReader.nextKeyValue(MapTask.java:458)
> ****
>
>     at
> org.apache.hadoop.mapreduce.task.MapContextImpl.nextKeyValue(MapContextImpl.java:76)
> ****
>
>     at
> org.apache.hadoop.mapreduce.lib.map.WrappedMapper$Context.nextKeyValue(WrappedMapper.java:85)
> ****
>
>     at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:139)****
>
>     at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:645)****
>
>     at org.apache.hadoop.mapred.MapTask.run(MapTask.java:325)****
>
>     at org.apache.hadoop.mapred.Child$4.run(Child.java:268)****
>
>     at java.security.AccessController.doPrivileged(Native Method)****
>
>     at javax.security.auth.Subject.doAs(Subject.java:396)****
>
>     at
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1332)
> ****
>
>     at org.apache.hadoop.mapred.Child.main(Child.java:262)****
>
>  ****
>
> I know the error is thrown out of this custom CryptoBasicDecompressor
> class, but I really have questions related to the interface it
> implemented: Decompressor.****
>
>  ****
>
> There is limited document about this interface, for example, when and how
> the method setInput() will be invoked. If I want to write my own
> Decompressor, what do these methods mean in the interface?****
>
> In the above case, I enable some debug information, you can see that in
> this case, the byte[] array passed to setInput method, only have 512 as the
> length, but the 3rd parameter of length passed in is a negative number:
> -132967308. That caused the IndexOutOfBoundsException. If I check the
> GzipDecompressor class of this method in the hadoop, the code will also
> throw IndexOutoutBoundsException in this case, so this is a
> RuntimeException case. Why it happened in my test case?****
>
>  ****
>
> Here is my test case:****
>
>  ****
>
> I have a simpel log text file about 700k. I encrypted it with above code
> using 'AES'. I can encrypted and decrypted to get my original content. The
> file name is foo.log.crypto, this file extension is registered to invoke
> this CryptoBasicDecompressor in my testing hadoop using CDH4.1.2 release
> (hadoop 2.0). Everything works as I expected. The CryptoBasicDecompressor
> is invoked when the input file is foo.log.crypto, as you can see in the
> above stack trace. But I don't know why the 3rd parameter (length) in
> setInput() is a negative number at runtime.****
>
>  ****
>
> In additional to it, I also have further questions related to use
> Compressor/Decompressor to handle the encrypting/decrypting file. Ideally,
> I wonder if the encrypting/decrypting can support file splits. This maybe
> depends the algorithm we are using, is that right? If so, what kind
> of algorithm can do that? I am not sure if it likes the compressor cases,
> most of them do not support file split. If so, it maybe not good for my
> requirements.****
>
>  ****
>
> If we have a 1G file, encrypted in the Amazone S3, after it copied to the
> HDFS of Amazon EMR, can each block of the date be decrypted independently
> by each mapper, then passed to the underline RecorderReader to be processed
> totally concurrently? Does any one do this before? If so, what encryption
> algorithm does support it? Any idea?****
>
>  ****
>
> Thanks****
>
>  ****
>
> Yong****
>

Re: Question related to Decompressor interface

Posted by Ted Dunning <td...@maprtech.com>.
All of these suggestions tend to founder on the problem of key management.

What you need to do is

1) define your threats.

2) define your architecture including key management.

3) demonstrate how the architecture defends against the threat environment.

I haven't seen more than a cursory comment about item (1) in this thread.
 Typically, the threats include

a) compromise or theft of physical media by outsiders

b) compromise of one or more live machines in the cluster

c) insider attack by one employee working alone, but able to socially
engineer others into unwitting cooperation

Which threats did the OP need to defend against?


On Sun, Feb 10, 2013 at 8:24 PM, David Parks <da...@yahoo.com> wrote:

> In the EncryptedWritableWrapper idea you would create an object that takes
> any Writable object as it’s parameter. ****
>
> ** **
>
> Your EncryptedWritableWrapper would naturally implement Writable.****
>
> ** **
>
> **·         **When write(DataOutput out) is called on your object, create
> your own DataOutputStream which reads data into a byte array that you
> control (i.e. new DataOutputStream(new myByteArrayOutputStream()), keeping
> references to the objects of course).****
>
> **·         **Now encrypt the bytes and pass them on to the DataOutput
> object you received in write(DataOutput out)****
>
> ** **
>
> To decrypt is basically the same with the readFields(DataInput in) method.
> ****
>
> **·         **Read in the bytes and decrypt them (you will probably have
> needed to write out the length of bytes previously so you know how much to
> read in).****
>
> **·         **Take the decrypted bytes and pass them to the readFields(…)
> method of the Writable object you’re wrapping ****
>
> ** **
>
> The rest of Hadoop doesn’t know or care if the data is encrypted, your
> Writable objects are just a bunch of bytes, you’re Key and Value class in
> this case are now EncryptedWritableWrapper, and you’ll need to know which
> type of Writable to pass it in the code.****
>
> ** **
>
> This would be good for encrypting in Hadoop. If your file comes in
> encrypted then it necessarily can’t be split (you should aim to limit the
> maximum size of the file on the source side). In the case of an encrypted
> input you would need your own record reader to decrypt it, your description
> of the scenario below is correct, extending TextinputFormat would be the
> way to go.****
>
> ** **
>
> If your input is just a plain text file and your goal is to store it in an
> encrypted fashion then the EncryptedWritable idea works and is a more
> simple implementation.****
>
> ** **
>
> ** **
>
> ** **
>
> *From:* java8964 java8964 [mailto:java8964@hotmail.com]
> *Sent:* Sunday, February 10, 2013 10:13 PM
> *To:* user@hadoop.apache.org
> *Subject:* RE: Question related to Decompressor interface****
>
> ** **
>
> Hi, Dave:****
>
> ** **
>
> Thanks for you reply. I am not sure how the EncryptedWritable will work,
> can you share more ideas about it?****
>
> ** **
>
> For example, if I have a text file as my source raw file. Now I need to
> store it in HDFS. If I use any encryption to encrypt the whole file, then
> there is no good InputFormat or RecordReader to process it, unless whole
> file is decrypted first at runtime, then using TextInputFormat to process
> it, right?****
>
> ** **
>
> What you suggest is  when I encrypted the file, store it as a
> SequenceFile, using anything I want as the key, then encrypt each line
> (Record), and stores it as the value, put both (key, value) pair into the
> sequence file, is that right? ****
>
> ** **
>
> Then in the runtime, each value can be decrypted from the sequence file,
> and ready for next step in the by the EncryptedWritable class. Is my
> understanding correct?****
>
> ** **
>
>  In this case, of course I don't need to worry about split any more, as
> each record is encrypted/decrypted separately.****
>
> ** **
>
> I think it is a valid option, but problem is that the data has to be
> encrypted by this EncryptedWritable class. What I was thinking about is
> allow data source to encrypt its data any way they want, as long as it is
> supported by Java security package, then only provide the private key to
> the runtime to decrypt it.****
>
> ** **
>
> Yong****
> ------------------------------
>
> From: davidparks21@yahoo.com
> To: user@hadoop.apache.org
> Subject: RE: Question related to Decompressor interface
> Date: Sun, 10 Feb 2013 09:36:40 +0700****
>
> I can’t answer your question about the Decompressor interface, but I have
> a query for you.****
>
>  ****
>
> Why not just create an EncryptedWritable object? Encrypt/decrypt the bytes
> on the read/write method, that should be darn near trivial. Then stick with
> good ‘ol SequenceFile, which, as you note, is splittable. Otherwise you’d
> have to deal with making the output splittable, and given encrypted data,
> the only solution that I see is basically rolling your own SequenceFile
> with encrypted innards. ****
>
>  ****
>
> Come to think of it, a simple, standardized EncryptedWritable object out
> of the box with Hadoop would be great. Or perhaps better yet, an
> EncryptedWritableWrapper<T extends Writable> so we can convert any existing
> Writable into an encrypted form.****
>
>  ****
>
> Dave****
>
>  ****
>
>  ****
>
> *From:* java8964 java8964 [mailto:java8964@hotmail.com<ja...@hotmail.com>]
>
> *Sent:* Sunday, February 10, 2013 3:50 AM
> *To:* user@hadoop.apache.org
> *Subject:* Question related to Decompressor interface****
>
>  ****
>
> HI, ****
>
>  ****
>
> Currently I am researching about options of encrypting the data in the
> MapReduce, as we plan to use the Amazon EMR or EC2 services for our data.*
> ***
>
>  ****
>
> I am thinking that the compression codec is good place to integrate with
> the encryption logic, and I found out there are some people having the same
> idea as mine.****
>
>  ****
>
> I google around and found out this code:****
>
>  ****
>
> https://github.com/geisbruch/HadoopCryptoCompressor/****
>
>  ****
>
> It doesn't seem maintained any more, but it gave me a starting point. I
> download the source code, and try to do some tests with it.****
>
>  ****
>
> It doesn't work out of box. There are some bugs I have to fix to make it
> work. I believe it contains 'AES' as an example algorithm.****
>
>  ****
>
> But right now, I faced a problem when I tried to use it in my testing
> MapReduer program. Here is the stack trace I got:****
>
>  ****
>
> 2013-02-08 23:16:47,038 INFO
> org.apache.hadoop.io.compress.crypto.CryptoBasicDecompressor: buf length =
> 512, and offset = 0, length = -132967308****
>
> java.lang.IndexOutOfBoundsException****
>
>     at java.nio.ByteBuffer.wrap(ByteBuffer.java:352)****
>
>     at
> org.apache.hadoop.io.compress.crypto.CryptoBasicDecompressor.setInput(CryptoBasicDecompressor.java:100)
> ****
>
>     at
> org.apache.hadoop.io.compress.BlockDecompressorStream.decompress(BlockDecompressorStream.java:97)
> ****
>
>     at
> org.apache.hadoop.io.compress.DecompressorStream.read(DecompressorStream.java:83)
> ****
>
>     at java.io.InputStream.read(InputStream.java:82)****
>
>     at
> org.apache.hadoop.util.LineReader.readDefaultLine(LineReader.java:209)****
>
>     at org.apache.hadoop.util.LineReader.readLine(LineReader.java:173)****
>
>     at
> org.apache.hadoop.mapreduce.lib.input.LineRecordReader.nextKeyValue(LineRecordReader.java:114)
> ****
>
>     at
> org.apache.hadoop.mapred.MapTask$NewTrackingRecordReader.nextKeyValue(MapTask.java:458)
> ****
>
>     at
> org.apache.hadoop.mapreduce.task.MapContextImpl.nextKeyValue(MapContextImpl.java:76)
> ****
>
>     at
> org.apache.hadoop.mapreduce.lib.map.WrappedMapper$Context.nextKeyValue(WrappedMapper.java:85)
> ****
>
>     at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:139)****
>
>     at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:645)****
>
>     at org.apache.hadoop.mapred.MapTask.run(MapTask.java:325)****
>
>     at org.apache.hadoop.mapred.Child$4.run(Child.java:268)****
>
>     at java.security.AccessController.doPrivileged(Native Method)****
>
>     at javax.security.auth.Subject.doAs(Subject.java:396)****
>
>     at
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1332)
> ****
>
>     at org.apache.hadoop.mapred.Child.main(Child.java:262)****
>
>  ****
>
> I know the error is thrown out of this custom CryptoBasicDecompressor
> class, but I really have questions related to the interface it
> implemented: Decompressor.****
>
>  ****
>
> There is limited document about this interface, for example, when and how
> the method setInput() will be invoked. If I want to write my own
> Decompressor, what do these methods mean in the interface?****
>
> In the above case, I enable some debug information, you can see that in
> this case, the byte[] array passed to setInput method, only have 512 as the
> length, but the 3rd parameter of length passed in is a negative number:
> -132967308. That caused the IndexOutOfBoundsException. If I check the
> GzipDecompressor class of this method in the hadoop, the code will also
> throw IndexOutoutBoundsException in this case, so this is a
> RuntimeException case. Why it happened in my test case?****
>
>  ****
>
> Here is my test case:****
>
>  ****
>
> I have a simpel log text file about 700k. I encrypted it with above code
> using 'AES'. I can encrypted and decrypted to get my original content. The
> file name is foo.log.crypto, this file extension is registered to invoke
> this CryptoBasicDecompressor in my testing hadoop using CDH4.1.2 release
> (hadoop 2.0). Everything works as I expected. The CryptoBasicDecompressor
> is invoked when the input file is foo.log.crypto, as you can see in the
> above stack trace. But I don't know why the 3rd parameter (length) in
> setInput() is a negative number at runtime.****
>
>  ****
>
> In additional to it, I also have further questions related to use
> Compressor/Decompressor to handle the encrypting/decrypting file. Ideally,
> I wonder if the encrypting/decrypting can support file splits. This maybe
> depends the algorithm we are using, is that right? If so, what kind
> of algorithm can do that? I am not sure if it likes the compressor cases,
> most of them do not support file split. If so, it maybe not good for my
> requirements.****
>
>  ****
>
> If we have a 1G file, encrypted in the Amazone S3, after it copied to the
> HDFS of Amazon EMR, can each block of the date be decrypted independently
> by each mapper, then passed to the underline RecorderReader to be processed
> totally concurrently? Does any one do this before? If so, what encryption
> algorithm does support it? Any idea?****
>
>  ****
>
> Thanks****
>
>  ****
>
> Yong****
>

Re: Question related to Decompressor interface

Posted by Ted Dunning <td...@maprtech.com>.
All of these suggestions tend to founder on the problem of key management.

What you need to do is

1) define your threats.

2) define your architecture including key management.

3) demonstrate how the architecture defends against the threat environment.

I haven't seen more than a cursory comment about item (1) in this thread.
 Typically, the threats include

a) compromise or theft of physical media by outsiders

b) compromise of one or more live machines in the cluster

c) insider attack by one employee working alone, but able to socially
engineer others into unwitting cooperation

Which threats did the OP need to defend against?


On Sun, Feb 10, 2013 at 8:24 PM, David Parks <da...@yahoo.com> wrote:

> In the EncryptedWritableWrapper idea you would create an object that takes
> any Writable object as it’s parameter. ****
>
> ** **
>
> Your EncryptedWritableWrapper would naturally implement Writable.****
>
> ** **
>
> **·         **When write(DataOutput out) is called on your object, create
> your own DataOutputStream which reads data into a byte array that you
> control (i.e. new DataOutputStream(new myByteArrayOutputStream()), keeping
> references to the objects of course).****
>
> **·         **Now encrypt the bytes and pass them on to the DataOutput
> object you received in write(DataOutput out)****
>
> ** **
>
> To decrypt is basically the same with the readFields(DataInput in) method.
> ****
>
> **·         **Read in the bytes and decrypt them (you will probably have
> needed to write out the length of bytes previously so you know how much to
> read in).****
>
> **·         **Take the decrypted bytes and pass them to the readFields(…)
> method of the Writable object you’re wrapping ****
>
> ** **
>
> The rest of Hadoop doesn’t know or care if the data is encrypted, your
> Writable objects are just a bunch of bytes, you’re Key and Value class in
> this case are now EncryptedWritableWrapper, and you’ll need to know which
> type of Writable to pass it in the code.****
>
> ** **
>
> This would be good for encrypting in Hadoop. If your file comes in
> encrypted then it necessarily can’t be split (you should aim to limit the
> maximum size of the file on the source side). In the case of an encrypted
> input you would need your own record reader to decrypt it, your description
> of the scenario below is correct, extending TextinputFormat would be the
> way to go.****
>
> ** **
>
> If your input is just a plain text file and your goal is to store it in an
> encrypted fashion then the EncryptedWritable idea works and is a more
> simple implementation.****
>
> ** **
>
> ** **
>
> ** **
>
> *From:* java8964 java8964 [mailto:java8964@hotmail.com]
> *Sent:* Sunday, February 10, 2013 10:13 PM
> *To:* user@hadoop.apache.org
> *Subject:* RE: Question related to Decompressor interface****
>
> ** **
>
> Hi, Dave:****
>
> ** **
>
> Thanks for you reply. I am not sure how the EncryptedWritable will work,
> can you share more ideas about it?****
>
> ** **
>
> For example, if I have a text file as my source raw file. Now I need to
> store it in HDFS. If I use any encryption to encrypt the whole file, then
> there is no good InputFormat or RecordReader to process it, unless whole
> file is decrypted first at runtime, then using TextInputFormat to process
> it, right?****
>
> ** **
>
> What you suggest is  when I encrypted the file, store it as a
> SequenceFile, using anything I want as the key, then encrypt each line
> (Record), and stores it as the value, put both (key, value) pair into the
> sequence file, is that right? ****
>
> ** **
>
> Then in the runtime, each value can be decrypted from the sequence file,
> and ready for next step in the by the EncryptedWritable class. Is my
> understanding correct?****
>
> ** **
>
>  In this case, of course I don't need to worry about split any more, as
> each record is encrypted/decrypted separately.****
>
> ** **
>
> I think it is a valid option, but problem is that the data has to be
> encrypted by this EncryptedWritable class. What I was thinking about is
> allow data source to encrypt its data any way they want, as long as it is
> supported by Java security package, then only provide the private key to
> the runtime to decrypt it.****
>
> ** **
>
> Yong****
> ------------------------------
>
> From: davidparks21@yahoo.com
> To: user@hadoop.apache.org
> Subject: RE: Question related to Decompressor interface
> Date: Sun, 10 Feb 2013 09:36:40 +0700****
>
> I can’t answer your question about the Decompressor interface, but I have
> a query for you.****
>
>  ****
>
> Why not just create an EncryptedWritable object? Encrypt/decrypt the bytes
> on the read/write method, that should be darn near trivial. Then stick with
> good ‘ol SequenceFile, which, as you note, is splittable. Otherwise you’d
> have to deal with making the output splittable, and given encrypted data,
> the only solution that I see is basically rolling your own SequenceFile
> with encrypted innards. ****
>
>  ****
>
> Come to think of it, a simple, standardized EncryptedWritable object out
> of the box with Hadoop would be great. Or perhaps better yet, an
> EncryptedWritableWrapper<T extends Writable> so we can convert any existing
> Writable into an encrypted form.****
>
>  ****
>
> Dave****
>
>  ****
>
>  ****
>
> *From:* java8964 java8964 [mailto:java8964@hotmail.com<ja...@hotmail.com>]
>
> *Sent:* Sunday, February 10, 2013 3:50 AM
> *To:* user@hadoop.apache.org
> *Subject:* Question related to Decompressor interface****
>
>  ****
>
> HI, ****
>
>  ****
>
> Currently I am researching about options of encrypting the data in the
> MapReduce, as we plan to use the Amazon EMR or EC2 services for our data.*
> ***
>
>  ****
>
> I am thinking that the compression codec is good place to integrate with
> the encryption logic, and I found out there are some people having the same
> idea as mine.****
>
>  ****
>
> I google around and found out this code:****
>
>  ****
>
> https://github.com/geisbruch/HadoopCryptoCompressor/****
>
>  ****
>
> It doesn't seem maintained any more, but it gave me a starting point. I
> download the source code, and try to do some tests with it.****
>
>  ****
>
> It doesn't work out of box. There are some bugs I have to fix to make it
> work. I believe it contains 'AES' as an example algorithm.****
>
>  ****
>
> But right now, I faced a problem when I tried to use it in my testing
> MapReduer program. Here is the stack trace I got:****
>
>  ****
>
> 2013-02-08 23:16:47,038 INFO
> org.apache.hadoop.io.compress.crypto.CryptoBasicDecompressor: buf length =
> 512, and offset = 0, length = -132967308****
>
> java.lang.IndexOutOfBoundsException****
>
>     at java.nio.ByteBuffer.wrap(ByteBuffer.java:352)****
>
>     at
> org.apache.hadoop.io.compress.crypto.CryptoBasicDecompressor.setInput(CryptoBasicDecompressor.java:100)
> ****
>
>     at
> org.apache.hadoop.io.compress.BlockDecompressorStream.decompress(BlockDecompressorStream.java:97)
> ****
>
>     at
> org.apache.hadoop.io.compress.DecompressorStream.read(DecompressorStream.java:83)
> ****
>
>     at java.io.InputStream.read(InputStream.java:82)****
>
>     at
> org.apache.hadoop.util.LineReader.readDefaultLine(LineReader.java:209)****
>
>     at org.apache.hadoop.util.LineReader.readLine(LineReader.java:173)****
>
>     at
> org.apache.hadoop.mapreduce.lib.input.LineRecordReader.nextKeyValue(LineRecordReader.java:114)
> ****
>
>     at
> org.apache.hadoop.mapred.MapTask$NewTrackingRecordReader.nextKeyValue(MapTask.java:458)
> ****
>
>     at
> org.apache.hadoop.mapreduce.task.MapContextImpl.nextKeyValue(MapContextImpl.java:76)
> ****
>
>     at
> org.apache.hadoop.mapreduce.lib.map.WrappedMapper$Context.nextKeyValue(WrappedMapper.java:85)
> ****
>
>     at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:139)****
>
>     at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:645)****
>
>     at org.apache.hadoop.mapred.MapTask.run(MapTask.java:325)****
>
>     at org.apache.hadoop.mapred.Child$4.run(Child.java:268)****
>
>     at java.security.AccessController.doPrivileged(Native Method)****
>
>     at javax.security.auth.Subject.doAs(Subject.java:396)****
>
>     at
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1332)
> ****
>
>     at org.apache.hadoop.mapred.Child.main(Child.java:262)****
>
>  ****
>
> I know the error is thrown out of this custom CryptoBasicDecompressor
> class, but I really have questions related to the interface it
> implemented: Decompressor.****
>
>  ****
>
> There is limited document about this interface, for example, when and how
> the method setInput() will be invoked. If I want to write my own
> Decompressor, what do these methods mean in the interface?****
>
> In the above case, I enable some debug information, you can see that in
> this case, the byte[] array passed to setInput method, only have 512 as the
> length, but the 3rd parameter of length passed in is a negative number:
> -132967308. That caused the IndexOutOfBoundsException. If I check the
> GzipDecompressor class of this method in the hadoop, the code will also
> throw IndexOutoutBoundsException in this case, so this is a
> RuntimeException case. Why it happened in my test case?****
>
>  ****
>
> Here is my test case:****
>
>  ****
>
> I have a simpel log text file about 700k. I encrypted it with above code
> using 'AES'. I can encrypted and decrypted to get my original content. The
> file name is foo.log.crypto, this file extension is registered to invoke
> this CryptoBasicDecompressor in my testing hadoop using CDH4.1.2 release
> (hadoop 2.0). Everything works as I expected. The CryptoBasicDecompressor
> is invoked when the input file is foo.log.crypto, as you can see in the
> above stack trace. But I don't know why the 3rd parameter (length) in
> setInput() is a negative number at runtime.****
>
>  ****
>
> In additional to it, I also have further questions related to use
> Compressor/Decompressor to handle the encrypting/decrypting file. Ideally,
> I wonder if the encrypting/decrypting can support file splits. This maybe
> depends the algorithm we are using, is that right? If so, what kind
> of algorithm can do that? I am not sure if it likes the compressor cases,
> most of them do not support file split. If so, it maybe not good for my
> requirements.****
>
>  ****
>
> If we have a 1G file, encrypted in the Amazone S3, after it copied to the
> HDFS of Amazon EMR, can each block of the date be decrypted independently
> by each mapper, then passed to the underline RecorderReader to be processed
> totally concurrently? Does any one do this before? If so, what encryption
> algorithm does support it? Any idea?****
>
>  ****
>
> Thanks****
>
>  ****
>
> Yong****
>

Re: Question related to Decompressor interface

Posted by Ted Dunning <td...@maprtech.com>.
All of these suggestions tend to founder on the problem of key management.

What you need to do is

1) define your threats.

2) define your architecture including key management.

3) demonstrate how the architecture defends against the threat environment.

I haven't seen more than a cursory comment about item (1) in this thread.
 Typically, the threats include

a) compromise or theft of physical media by outsiders

b) compromise of one or more live machines in the cluster

c) insider attack by one employee working alone, but able to socially
engineer others into unwitting cooperation

Which threats did the OP need to defend against?


On Sun, Feb 10, 2013 at 8:24 PM, David Parks <da...@yahoo.com> wrote:

> In the EncryptedWritableWrapper idea you would create an object that takes
> any Writable object as it’s parameter. ****
>
> ** **
>
> Your EncryptedWritableWrapper would naturally implement Writable.****
>
> ** **
>
> **·         **When write(DataOutput out) is called on your object, create
> your own DataOutputStream which reads data into a byte array that you
> control (i.e. new DataOutputStream(new myByteArrayOutputStream()), keeping
> references to the objects of course).****
>
> **·         **Now encrypt the bytes and pass them on to the DataOutput
> object you received in write(DataOutput out)****
>
> ** **
>
> To decrypt is basically the same with the readFields(DataInput in) method.
> ****
>
> **·         **Read in the bytes and decrypt them (you will probably have
> needed to write out the length of bytes previously so you know how much to
> read in).****
>
> **·         **Take the decrypted bytes and pass them to the readFields(…)
> method of the Writable object you’re wrapping ****
>
> ** **
>
> The rest of Hadoop doesn’t know or care if the data is encrypted, your
> Writable objects are just a bunch of bytes, you’re Key and Value class in
> this case are now EncryptedWritableWrapper, and you’ll need to know which
> type of Writable to pass it in the code.****
>
> ** **
>
> This would be good for encrypting in Hadoop. If your file comes in
> encrypted then it necessarily can’t be split (you should aim to limit the
> maximum size of the file on the source side). In the case of an encrypted
> input you would need your own record reader to decrypt it, your description
> of the scenario below is correct, extending TextinputFormat would be the
> way to go.****
>
> ** **
>
> If your input is just a plain text file and your goal is to store it in an
> encrypted fashion then the EncryptedWritable idea works and is a more
> simple implementation.****
>
> ** **
>
> ** **
>
> ** **
>
> *From:* java8964 java8964 [mailto:java8964@hotmail.com]
> *Sent:* Sunday, February 10, 2013 10:13 PM
> *To:* user@hadoop.apache.org
> *Subject:* RE: Question related to Decompressor interface****
>
> ** **
>
> Hi, Dave:****
>
> ** **
>
> Thanks for you reply. I am not sure how the EncryptedWritable will work,
> can you share more ideas about it?****
>
> ** **
>
> For example, if I have a text file as my source raw file. Now I need to
> store it in HDFS. If I use any encryption to encrypt the whole file, then
> there is no good InputFormat or RecordReader to process it, unless whole
> file is decrypted first at runtime, then using TextInputFormat to process
> it, right?****
>
> ** **
>
> What you suggest is  when I encrypted the file, store it as a
> SequenceFile, using anything I want as the key, then encrypt each line
> (Record), and stores it as the value, put both (key, value) pair into the
> sequence file, is that right? ****
>
> ** **
>
> Then in the runtime, each value can be decrypted from the sequence file,
> and ready for next step in the by the EncryptedWritable class. Is my
> understanding correct?****
>
> ** **
>
>  In this case, of course I don't need to worry about split any more, as
> each record is encrypted/decrypted separately.****
>
> ** **
>
> I think it is a valid option, but problem is that the data has to be
> encrypted by this EncryptedWritable class. What I was thinking about is
> allow data source to encrypt its data any way they want, as long as it is
> supported by Java security package, then only provide the private key to
> the runtime to decrypt it.****
>
> ** **
>
> Yong****
> ------------------------------
>
> From: davidparks21@yahoo.com
> To: user@hadoop.apache.org
> Subject: RE: Question related to Decompressor interface
> Date: Sun, 10 Feb 2013 09:36:40 +0700****
>
> I can’t answer your question about the Decompressor interface, but I have
> a query for you.****
>
>  ****
>
> Why not just create an EncryptedWritable object? Encrypt/decrypt the bytes
> on the read/write method, that should be darn near trivial. Then stick with
> good ‘ol SequenceFile, which, as you note, is splittable. Otherwise you’d
> have to deal with making the output splittable, and given encrypted data,
> the only solution that I see is basically rolling your own SequenceFile
> with encrypted innards. ****
>
>  ****
>
> Come to think of it, a simple, standardized EncryptedWritable object out
> of the box with Hadoop would be great. Or perhaps better yet, an
> EncryptedWritableWrapper<T extends Writable> so we can convert any existing
> Writable into an encrypted form.****
>
>  ****
>
> Dave****
>
>  ****
>
>  ****
>
> *From:* java8964 java8964 [mailto:java8964@hotmail.com<ja...@hotmail.com>]
>
> *Sent:* Sunday, February 10, 2013 3:50 AM
> *To:* user@hadoop.apache.org
> *Subject:* Question related to Decompressor interface****
>
>  ****
>
> HI, ****
>
>  ****
>
> Currently I am researching about options of encrypting the data in the
> MapReduce, as we plan to use the Amazon EMR or EC2 services for our data.*
> ***
>
>  ****
>
> I am thinking that the compression codec is good place to integrate with
> the encryption logic, and I found out there are some people having the same
> idea as mine.****
>
>  ****
>
> I google around and found out this code:****
>
>  ****
>
> https://github.com/geisbruch/HadoopCryptoCompressor/****
>
>  ****
>
> It doesn't seem maintained any more, but it gave me a starting point. I
> download the source code, and try to do some tests with it.****
>
>  ****
>
> It doesn't work out of box. There are some bugs I have to fix to make it
> work. I believe it contains 'AES' as an example algorithm.****
>
>  ****
>
> But right now, I faced a problem when I tried to use it in my testing
> MapReduer program. Here is the stack trace I got:****
>
>  ****
>
> 2013-02-08 23:16:47,038 INFO
> org.apache.hadoop.io.compress.crypto.CryptoBasicDecompressor: buf length =
> 512, and offset = 0, length = -132967308****
>
> java.lang.IndexOutOfBoundsException****
>
>     at java.nio.ByteBuffer.wrap(ByteBuffer.java:352)****
>
>     at
> org.apache.hadoop.io.compress.crypto.CryptoBasicDecompressor.setInput(CryptoBasicDecompressor.java:100)
> ****
>
>     at
> org.apache.hadoop.io.compress.BlockDecompressorStream.decompress(BlockDecompressorStream.java:97)
> ****
>
>     at
> org.apache.hadoop.io.compress.DecompressorStream.read(DecompressorStream.java:83)
> ****
>
>     at java.io.InputStream.read(InputStream.java:82)****
>
>     at
> org.apache.hadoop.util.LineReader.readDefaultLine(LineReader.java:209)****
>
>     at org.apache.hadoop.util.LineReader.readLine(LineReader.java:173)****
>
>     at
> org.apache.hadoop.mapreduce.lib.input.LineRecordReader.nextKeyValue(LineRecordReader.java:114)
> ****
>
>     at
> org.apache.hadoop.mapred.MapTask$NewTrackingRecordReader.nextKeyValue(MapTask.java:458)
> ****
>
>     at
> org.apache.hadoop.mapreduce.task.MapContextImpl.nextKeyValue(MapContextImpl.java:76)
> ****
>
>     at
> org.apache.hadoop.mapreduce.lib.map.WrappedMapper$Context.nextKeyValue(WrappedMapper.java:85)
> ****
>
>     at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:139)****
>
>     at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:645)****
>
>     at org.apache.hadoop.mapred.MapTask.run(MapTask.java:325)****
>
>     at org.apache.hadoop.mapred.Child$4.run(Child.java:268)****
>
>     at java.security.AccessController.doPrivileged(Native Method)****
>
>     at javax.security.auth.Subject.doAs(Subject.java:396)****
>
>     at
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1332)
> ****
>
>     at org.apache.hadoop.mapred.Child.main(Child.java:262)****
>
>  ****
>
> I know the error is thrown out of this custom CryptoBasicDecompressor
> class, but I really have questions related to the interface it
> implemented: Decompressor.****
>
>  ****
>
> There is limited document about this interface, for example, when and how
> the method setInput() will be invoked. If I want to write my own
> Decompressor, what do these methods mean in the interface?****
>
> In the above case, I enable some debug information, you can see that in
> this case, the byte[] array passed to setInput method, only have 512 as the
> length, but the 3rd parameter of length passed in is a negative number:
> -132967308. That caused the IndexOutOfBoundsException. If I check the
> GzipDecompressor class of this method in the hadoop, the code will also
> throw IndexOutoutBoundsException in this case, so this is a
> RuntimeException case. Why it happened in my test case?****
>
>  ****
>
> Here is my test case:****
>
>  ****
>
> I have a simpel log text file about 700k. I encrypted it with above code
> using 'AES'. I can encrypted and decrypted to get my original content. The
> file name is foo.log.crypto, this file extension is registered to invoke
> this CryptoBasicDecompressor in my testing hadoop using CDH4.1.2 release
> (hadoop 2.0). Everything works as I expected. The CryptoBasicDecompressor
> is invoked when the input file is foo.log.crypto, as you can see in the
> above stack trace. But I don't know why the 3rd parameter (length) in
> setInput() is a negative number at runtime.****
>
>  ****
>
> In additional to it, I also have further questions related to use
> Compressor/Decompressor to handle the encrypting/decrypting file. Ideally,
> I wonder if the encrypting/decrypting can support file splits. This maybe
> depends the algorithm we are using, is that right? If so, what kind
> of algorithm can do that? I am not sure if it likes the compressor cases,
> most of them do not support file split. If so, it maybe not good for my
> requirements.****
>
>  ****
>
> If we have a 1G file, encrypted in the Amazone S3, after it copied to the
> HDFS of Amazon EMR, can each block of the date be decrypted independently
> by each mapper, then passed to the underline RecorderReader to be processed
> totally concurrently? Does any one do this before? If so, what encryption
> algorithm does support it? Any idea?****
>
>  ****
>
> Thanks****
>
>  ****
>
> Yong****
>

RE: Question related to Decompressor interface

Posted by David Parks <da...@yahoo.com>.
In the EncryptedWritableWrapper idea you would create an object that takes
any Writable object as it's parameter. 

 

Your EncryptedWritableWrapper would naturally implement Writable.

 

.         When write(DataOutput out) is called on your object, create your
own DataOutputStream which reads data into a byte array that you control
(i.e. new DataOutputStream(new myByteArrayOutputStream()), keeping
references to the objects of course).

.         Now encrypt the bytes and pass them on to the DataOutput object
you received in write(DataOutput out)

 

To decrypt is basically the same with the readFields(DataInput in) method.

.         Read in the bytes and decrypt them (you will probably have needed
to write out the length of bytes previously so you know how much to read
in).

.         Take the decrypted bytes and pass them to the readFields(.) method
of the Writable object you're wrapping 

 

The rest of Hadoop doesn't know or care if the data is encrypted, your
Writable objects are just a bunch of bytes, you're Key and Value class in
this case are now EncryptedWritableWrapper, and you'll need to know which
type of Writable to pass it in the code.

 

This would be good for encrypting in Hadoop. If your file comes in encrypted
then it necessarily can't be split (you should aim to limit the maximum size
of the file on the source side). In the case of an encrypted input you would
need your own record reader to decrypt it, your description of the scenario
below is correct, extending TextinputFormat would be the way to go.

 

If your input is just a plain text file and your goal is to store it in an
encrypted fashion then the EncryptedWritable idea works and is a more simple
implementation.

 

 

 

From: java8964 java8964 [mailto:java8964@hotmail.com] 
Sent: Sunday, February 10, 2013 10:13 PM
To: user@hadoop.apache.org
Subject: RE: Question related to Decompressor interface

 

Hi, Dave:

 

Thanks for you reply. I am not sure how the EncryptedWritable will work, can
you share more ideas about it?

 

For example, if I have a text file as my source raw file. Now I need to
store it in HDFS. If I use any encryption to encrypt the whole file, then
there is no good InputFormat or RecordReader to process it, unless whole
file is decrypted first at runtime, then using TextInputFormat to process
it, right?

 

What you suggest is  when I encrypted the file, store it as a SequenceFile,
using anything I want as the key, then encrypt each line (Record), and
stores it as the value, put both (key, value) pair into the sequence file,
is that right? 

 

Then in the runtime, each value can be decrypted from the sequence file, and
ready for next step in the by the EncryptedWritable class. Is my
understanding correct?

 

 In this case, of course I don't need to worry about split any more, as each
record is encrypted/decrypted separately.

 

I think it is a valid option, but problem is that the data has to be
encrypted by this EncryptedWritable class. What I was thinking about is
allow data source to encrypt its data any way they want, as long as it is
supported by Java security package, then only provide the private key to the
runtime to decrypt it.

 

Yong

  _____  

From: davidparks21@yahoo.com
To: user@hadoop.apache.org
Subject: RE: Question related to Decompressor interface
Date: Sun, 10 Feb 2013 09:36:40 +0700

I can't answer your question about the Decompressor interface, but I have a
query for you.

 

Why not just create an EncryptedWritable object? Encrypt/decrypt the bytes
on the read/write method, that should be darn near trivial. Then stick with
good 'ol SequenceFile, which, as you note, is splittable. Otherwise you'd
have to deal with making the output splittable, and given encrypted data,
the only solution that I see is basically rolling your own SequenceFile with
encrypted innards. 

 

Come to think of it, a simple, standardized EncryptedWritable object out of
the box with Hadoop would be great. Or perhaps better yet, an
EncryptedWritableWrapper<T extends Writable> so we can convert any existing
Writable into an encrypted form.

 

Dave

 

 

From: java8964 java8964 [mailto:java8964@hotmail.com] 
Sent: Sunday, February 10, 2013 3:50 AM
To: user@hadoop.apache.org
Subject: Question related to Decompressor interface

 

HI, 

 

Currently I am researching about options of encrypting the data in the
MapReduce, as we plan to use the Amazon EMR or EC2 services for our data.

 

I am thinking that the compression codec is good place to integrate with the
encryption logic, and I found out there are some people having the same idea
as mine.

 

I google around and found out this code:

 

https://github.com/geisbruch/HadoopCryptoCompressor/

 

It doesn't seem maintained any more, but it gave me a starting point. I
download the source code, and try to do some tests with it.

 

It doesn't work out of box. There are some bugs I have to fix to make it
work. I believe it contains 'AES' as an example algorithm.

 

But right now, I faced a problem when I tried to use it in my testing
MapReduer program. Here is the stack trace I got:

 

2013-02-08 23:16:47,038 INFO
org.apache.hadoop.io.compress.crypto.CryptoBasicDecompressor: buf length =
512, and offset = 0, length = -132967308

java.lang.IndexOutOfBoundsException

    at java.nio.ByteBuffer.wrap(ByteBuffer.java:352)

    at
org.apache.hadoop.io.compress.crypto.CryptoBasicDecompressor.setInput(Crypto
BasicDecompressor.java:100)

    at
org.apache.hadoop.io.compress.BlockDecompressorStream.decompress(BlockDecomp
ressorStream.java:97)

    at
org.apache.hadoop.io.compress.DecompressorStream.read(DecompressorStream.jav
a:83)

    at java.io.InputStream.read(InputStream.java:82)

    at
org.apache.hadoop.util.LineReader.readDefaultLine(LineReader.java:209)

    at org.apache.hadoop.util.LineReader.readLine(LineReader.java:173)

    at
org.apache.hadoop.mapreduce.lib.input.LineRecordReader.nextKeyValue(LineReco
rdReader.java:114)

    at
org.apache.hadoop.mapred.MapTask$NewTrackingRecordReader.nextKeyValue(MapTas
k.java:458)

    at
org.apache.hadoop.mapreduce.task.MapContextImpl.nextKeyValue(MapContextImpl.
java:76)

    at
org.apache.hadoop.mapreduce.lib.map.WrappedMapper$Context.nextKeyValue(Wrapp
edMapper.java:85)

    at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:139)

    at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:645)

    at org.apache.hadoop.mapred.MapTask.run(MapTask.java:325)

    at org.apache.hadoop.mapred.Child$4.run(Child.java:268)

    at java.security.AccessController.doPrivileged(Native Method)

    at javax.security.auth.Subject.doAs(Subject.java:396)

    at
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.ja
va:1332)

    at org.apache.hadoop.mapred.Child.main(Child.java:262)

 

I know the error is thrown out of this custom CryptoBasicDecompressor class,
but I really have questions related to the interface it implemented:
Decompressor.

 

There is limited document about this interface, for example, when and how
the method setInput() will be invoked. If I want to write my own
Decompressor, what do these methods mean in the interface?

In the above case, I enable some debug information, you can see that in this
case, the byte[] array passed to setInput method, only have 512 as the
length, but the 3rd parameter of length passed in is a negative number:
-132967308. That caused the IndexOutOfBoundsException. If I check the
GzipDecompressor class of this method in the hadoop, the code will also
throw IndexOutoutBoundsException in this case, so this is a RuntimeException
case. Why it happened in my test case?

 

Here is my test case:

 

I have a simpel log text file about 700k. I encrypted it with above code
using 'AES'. I can encrypted and decrypted to get my original content. The
file name is foo.log.crypto, this file extension is registered to invoke
this CryptoBasicDecompressor in my testing hadoop using CDH4.1.2 release
(hadoop 2.0). Everything works as I expected. The CryptoBasicDecompressor is
invoked when the input file is foo.log.crypto, as you can see in the above
stack trace. But I don't know why the 3rd parameter (length) in setInput()
is a negative number at runtime.

 

In additional to it, I also have further questions related to use
Compressor/Decompressor to handle the encrypting/decrypting file. Ideally, I
wonder if the encrypting/decrypting can support file splits. This maybe
depends the algorithm we are using, is that right? If so, what kind of
algorithm can do that? I am not sure if it likes the compressor cases, most
of them do not support file split. If so, it maybe not good for my
requirements.

 

If we have a 1G file, encrypted in the Amazone S3, after it copied to the
HDFS of Amazon EMR, can each block of the date be decrypted independently by
each mapper, then passed to the underline RecorderReader to be processed
totally concurrently? Does any one do this before? If so, what encryption
algorithm does support it? Any idea?

 

Thanks

 

Yong


RE: Question related to Decompressor interface

Posted by David Parks <da...@yahoo.com>.
In the EncryptedWritableWrapper idea you would create an object that takes
any Writable object as it's parameter. 

 

Your EncryptedWritableWrapper would naturally implement Writable.

 

.         When write(DataOutput out) is called on your object, create your
own DataOutputStream which reads data into a byte array that you control
(i.e. new DataOutputStream(new myByteArrayOutputStream()), keeping
references to the objects of course).

.         Now encrypt the bytes and pass them on to the DataOutput object
you received in write(DataOutput out)

 

To decrypt is basically the same with the readFields(DataInput in) method.

.         Read in the bytes and decrypt them (you will probably have needed
to write out the length of bytes previously so you know how much to read
in).

.         Take the decrypted bytes and pass them to the readFields(.) method
of the Writable object you're wrapping 

 

The rest of Hadoop doesn't know or care if the data is encrypted, your
Writable objects are just a bunch of bytes, you're Key and Value class in
this case are now EncryptedWritableWrapper, and you'll need to know which
type of Writable to pass it in the code.

 

This would be good for encrypting in Hadoop. If your file comes in encrypted
then it necessarily can't be split (you should aim to limit the maximum size
of the file on the source side). In the case of an encrypted input you would
need your own record reader to decrypt it, your description of the scenario
below is correct, extending TextinputFormat would be the way to go.

 

If your input is just a plain text file and your goal is to store it in an
encrypted fashion then the EncryptedWritable idea works and is a more simple
implementation.

 

 

 

From: java8964 java8964 [mailto:java8964@hotmail.com] 
Sent: Sunday, February 10, 2013 10:13 PM
To: user@hadoop.apache.org
Subject: RE: Question related to Decompressor interface

 

Hi, Dave:

 

Thanks for you reply. I am not sure how the EncryptedWritable will work, can
you share more ideas about it?

 

For example, if I have a text file as my source raw file. Now I need to
store it in HDFS. If I use any encryption to encrypt the whole file, then
there is no good InputFormat or RecordReader to process it, unless whole
file is decrypted first at runtime, then using TextInputFormat to process
it, right?

 

What you suggest is  when I encrypted the file, store it as a SequenceFile,
using anything I want as the key, then encrypt each line (Record), and
stores it as the value, put both (key, value) pair into the sequence file,
is that right? 

 

Then in the runtime, each value can be decrypted from the sequence file, and
ready for next step in the by the EncryptedWritable class. Is my
understanding correct?

 

 In this case, of course I don't need to worry about split any more, as each
record is encrypted/decrypted separately.

 

I think it is a valid option, but problem is that the data has to be
encrypted by this EncryptedWritable class. What I was thinking about is
allow data source to encrypt its data any way they want, as long as it is
supported by Java security package, then only provide the private key to the
runtime to decrypt it.

 

Yong

  _____  

From: davidparks21@yahoo.com
To: user@hadoop.apache.org
Subject: RE: Question related to Decompressor interface
Date: Sun, 10 Feb 2013 09:36:40 +0700

I can't answer your question about the Decompressor interface, but I have a
query for you.

 

Why not just create an EncryptedWritable object? Encrypt/decrypt the bytes
on the read/write method, that should be darn near trivial. Then stick with
good 'ol SequenceFile, which, as you note, is splittable. Otherwise you'd
have to deal with making the output splittable, and given encrypted data,
the only solution that I see is basically rolling your own SequenceFile with
encrypted innards. 

 

Come to think of it, a simple, standardized EncryptedWritable object out of
the box with Hadoop would be great. Or perhaps better yet, an
EncryptedWritableWrapper<T extends Writable> so we can convert any existing
Writable into an encrypted form.

 

Dave

 

 

From: java8964 java8964 [mailto:java8964@hotmail.com] 
Sent: Sunday, February 10, 2013 3:50 AM
To: user@hadoop.apache.org
Subject: Question related to Decompressor interface

 

HI, 

 

Currently I am researching about options of encrypting the data in the
MapReduce, as we plan to use the Amazon EMR or EC2 services for our data.

 

I am thinking that the compression codec is good place to integrate with the
encryption logic, and I found out there are some people having the same idea
as mine.

 

I google around and found out this code:

 

https://github.com/geisbruch/HadoopCryptoCompressor/

 

It doesn't seem maintained any more, but it gave me a starting point. I
download the source code, and try to do some tests with it.

 

It doesn't work out of box. There are some bugs I have to fix to make it
work. I believe it contains 'AES' as an example algorithm.

 

But right now, I faced a problem when I tried to use it in my testing
MapReduer program. Here is the stack trace I got:

 

2013-02-08 23:16:47,038 INFO
org.apache.hadoop.io.compress.crypto.CryptoBasicDecompressor: buf length =
512, and offset = 0, length = -132967308

java.lang.IndexOutOfBoundsException

    at java.nio.ByteBuffer.wrap(ByteBuffer.java:352)

    at
org.apache.hadoop.io.compress.crypto.CryptoBasicDecompressor.setInput(Crypto
BasicDecompressor.java:100)

    at
org.apache.hadoop.io.compress.BlockDecompressorStream.decompress(BlockDecomp
ressorStream.java:97)

    at
org.apache.hadoop.io.compress.DecompressorStream.read(DecompressorStream.jav
a:83)

    at java.io.InputStream.read(InputStream.java:82)

    at
org.apache.hadoop.util.LineReader.readDefaultLine(LineReader.java:209)

    at org.apache.hadoop.util.LineReader.readLine(LineReader.java:173)

    at
org.apache.hadoop.mapreduce.lib.input.LineRecordReader.nextKeyValue(LineReco
rdReader.java:114)

    at
org.apache.hadoop.mapred.MapTask$NewTrackingRecordReader.nextKeyValue(MapTas
k.java:458)

    at
org.apache.hadoop.mapreduce.task.MapContextImpl.nextKeyValue(MapContextImpl.
java:76)

    at
org.apache.hadoop.mapreduce.lib.map.WrappedMapper$Context.nextKeyValue(Wrapp
edMapper.java:85)

    at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:139)

    at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:645)

    at org.apache.hadoop.mapred.MapTask.run(MapTask.java:325)

    at org.apache.hadoop.mapred.Child$4.run(Child.java:268)

    at java.security.AccessController.doPrivileged(Native Method)

    at javax.security.auth.Subject.doAs(Subject.java:396)

    at
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.ja
va:1332)

    at org.apache.hadoop.mapred.Child.main(Child.java:262)

 

I know the error is thrown out of this custom CryptoBasicDecompressor class,
but I really have questions related to the interface it implemented:
Decompressor.

 

There is limited document about this interface, for example, when and how
the method setInput() will be invoked. If I want to write my own
Decompressor, what do these methods mean in the interface?

In the above case, I enable some debug information, you can see that in this
case, the byte[] array passed to setInput method, only have 512 as the
length, but the 3rd parameter of length passed in is a negative number:
-132967308. That caused the IndexOutOfBoundsException. If I check the
GzipDecompressor class of this method in the hadoop, the code will also
throw IndexOutoutBoundsException in this case, so this is a RuntimeException
case. Why it happened in my test case?

 

Here is my test case:

 

I have a simpel log text file about 700k. I encrypted it with above code
using 'AES'. I can encrypted and decrypted to get my original content. The
file name is foo.log.crypto, this file extension is registered to invoke
this CryptoBasicDecompressor in my testing hadoop using CDH4.1.2 release
(hadoop 2.0). Everything works as I expected. The CryptoBasicDecompressor is
invoked when the input file is foo.log.crypto, as you can see in the above
stack trace. But I don't know why the 3rd parameter (length) in setInput()
is a negative number at runtime.

 

In additional to it, I also have further questions related to use
Compressor/Decompressor to handle the encrypting/decrypting file. Ideally, I
wonder if the encrypting/decrypting can support file splits. This maybe
depends the algorithm we are using, is that right? If so, what kind of
algorithm can do that? I am not sure if it likes the compressor cases, most
of them do not support file split. If so, it maybe not good for my
requirements.

 

If we have a 1G file, encrypted in the Amazone S3, after it copied to the
HDFS of Amazon EMR, can each block of the date be decrypted independently by
each mapper, then passed to the underline RecorderReader to be processed
totally concurrently? Does any one do this before? If so, what encryption
algorithm does support it? Any idea?

 

Thanks

 

Yong


RE: Question related to Decompressor interface

Posted by David Parks <da...@yahoo.com>.
In the EncryptedWritableWrapper idea you would create an object that takes
any Writable object as it's parameter. 

 

Your EncryptedWritableWrapper would naturally implement Writable.

 

.         When write(DataOutput out) is called on your object, create your
own DataOutputStream which reads data into a byte array that you control
(i.e. new DataOutputStream(new myByteArrayOutputStream()), keeping
references to the objects of course).

.         Now encrypt the bytes and pass them on to the DataOutput object
you received in write(DataOutput out)

 

To decrypt is basically the same with the readFields(DataInput in) method.

.         Read in the bytes and decrypt them (you will probably have needed
to write out the length of bytes previously so you know how much to read
in).

.         Take the decrypted bytes and pass them to the readFields(.) method
of the Writable object you're wrapping 

 

The rest of Hadoop doesn't know or care if the data is encrypted, your
Writable objects are just a bunch of bytes, you're Key and Value class in
this case are now EncryptedWritableWrapper, and you'll need to know which
type of Writable to pass it in the code.

 

This would be good for encrypting in Hadoop. If your file comes in encrypted
then it necessarily can't be split (you should aim to limit the maximum size
of the file on the source side). In the case of an encrypted input you would
need your own record reader to decrypt it, your description of the scenario
below is correct, extending TextinputFormat would be the way to go.

 

If your input is just a plain text file and your goal is to store it in an
encrypted fashion then the EncryptedWritable idea works and is a more simple
implementation.

 

 

 

From: java8964 java8964 [mailto:java8964@hotmail.com] 
Sent: Sunday, February 10, 2013 10:13 PM
To: user@hadoop.apache.org
Subject: RE: Question related to Decompressor interface

 

Hi, Dave:

 

Thanks for you reply. I am not sure how the EncryptedWritable will work, can
you share more ideas about it?

 

For example, if I have a text file as my source raw file. Now I need to
store it in HDFS. If I use any encryption to encrypt the whole file, then
there is no good InputFormat or RecordReader to process it, unless whole
file is decrypted first at runtime, then using TextInputFormat to process
it, right?

 

What you suggest is  when I encrypted the file, store it as a SequenceFile,
using anything I want as the key, then encrypt each line (Record), and
stores it as the value, put both (key, value) pair into the sequence file,
is that right? 

 

Then in the runtime, each value can be decrypted from the sequence file, and
ready for next step in the by the EncryptedWritable class. Is my
understanding correct?

 

 In this case, of course I don't need to worry about split any more, as each
record is encrypted/decrypted separately.

 

I think it is a valid option, but problem is that the data has to be
encrypted by this EncryptedWritable class. What I was thinking about is
allow data source to encrypt its data any way they want, as long as it is
supported by Java security package, then only provide the private key to the
runtime to decrypt it.

 

Yong

  _____  

From: davidparks21@yahoo.com
To: user@hadoop.apache.org
Subject: RE: Question related to Decompressor interface
Date: Sun, 10 Feb 2013 09:36:40 +0700

I can't answer your question about the Decompressor interface, but I have a
query for you.

 

Why not just create an EncryptedWritable object? Encrypt/decrypt the bytes
on the read/write method, that should be darn near trivial. Then stick with
good 'ol SequenceFile, which, as you note, is splittable. Otherwise you'd
have to deal with making the output splittable, and given encrypted data,
the only solution that I see is basically rolling your own SequenceFile with
encrypted innards. 

 

Come to think of it, a simple, standardized EncryptedWritable object out of
the box with Hadoop would be great. Or perhaps better yet, an
EncryptedWritableWrapper<T extends Writable> so we can convert any existing
Writable into an encrypted form.

 

Dave

 

 

From: java8964 java8964 [mailto:java8964@hotmail.com] 
Sent: Sunday, February 10, 2013 3:50 AM
To: user@hadoop.apache.org
Subject: Question related to Decompressor interface

 

HI, 

 

Currently I am researching about options of encrypting the data in the
MapReduce, as we plan to use the Amazon EMR or EC2 services for our data.

 

I am thinking that the compression codec is good place to integrate with the
encryption logic, and I found out there are some people having the same idea
as mine.

 

I google around and found out this code:

 

https://github.com/geisbruch/HadoopCryptoCompressor/

 

It doesn't seem maintained any more, but it gave me a starting point. I
download the source code, and try to do some tests with it.

 

It doesn't work out of box. There are some bugs I have to fix to make it
work. I believe it contains 'AES' as an example algorithm.

 

But right now, I faced a problem when I tried to use it in my testing
MapReduer program. Here is the stack trace I got:

 

2013-02-08 23:16:47,038 INFO
org.apache.hadoop.io.compress.crypto.CryptoBasicDecompressor: buf length =
512, and offset = 0, length = -132967308

java.lang.IndexOutOfBoundsException

    at java.nio.ByteBuffer.wrap(ByteBuffer.java:352)

    at
org.apache.hadoop.io.compress.crypto.CryptoBasicDecompressor.setInput(Crypto
BasicDecompressor.java:100)

    at
org.apache.hadoop.io.compress.BlockDecompressorStream.decompress(BlockDecomp
ressorStream.java:97)

    at
org.apache.hadoop.io.compress.DecompressorStream.read(DecompressorStream.jav
a:83)

    at java.io.InputStream.read(InputStream.java:82)

    at
org.apache.hadoop.util.LineReader.readDefaultLine(LineReader.java:209)

    at org.apache.hadoop.util.LineReader.readLine(LineReader.java:173)

    at
org.apache.hadoop.mapreduce.lib.input.LineRecordReader.nextKeyValue(LineReco
rdReader.java:114)

    at
org.apache.hadoop.mapred.MapTask$NewTrackingRecordReader.nextKeyValue(MapTas
k.java:458)

    at
org.apache.hadoop.mapreduce.task.MapContextImpl.nextKeyValue(MapContextImpl.
java:76)

    at
org.apache.hadoop.mapreduce.lib.map.WrappedMapper$Context.nextKeyValue(Wrapp
edMapper.java:85)

    at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:139)

    at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:645)

    at org.apache.hadoop.mapred.MapTask.run(MapTask.java:325)

    at org.apache.hadoop.mapred.Child$4.run(Child.java:268)

    at java.security.AccessController.doPrivileged(Native Method)

    at javax.security.auth.Subject.doAs(Subject.java:396)

    at
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.ja
va:1332)

    at org.apache.hadoop.mapred.Child.main(Child.java:262)

 

I know the error is thrown out of this custom CryptoBasicDecompressor class,
but I really have questions related to the interface it implemented:
Decompressor.

 

There is limited document about this interface, for example, when and how
the method setInput() will be invoked. If I want to write my own
Decompressor, what do these methods mean in the interface?

In the above case, I enable some debug information, you can see that in this
case, the byte[] array passed to setInput method, only have 512 as the
length, but the 3rd parameter of length passed in is a negative number:
-132967308. That caused the IndexOutOfBoundsException. If I check the
GzipDecompressor class of this method in the hadoop, the code will also
throw IndexOutoutBoundsException in this case, so this is a RuntimeException
case. Why it happened in my test case?

 

Here is my test case:

 

I have a simpel log text file about 700k. I encrypted it with above code
using 'AES'. I can encrypted and decrypted to get my original content. The
file name is foo.log.crypto, this file extension is registered to invoke
this CryptoBasicDecompressor in my testing hadoop using CDH4.1.2 release
(hadoop 2.0). Everything works as I expected. The CryptoBasicDecompressor is
invoked when the input file is foo.log.crypto, as you can see in the above
stack trace. But I don't know why the 3rd parameter (length) in setInput()
is a negative number at runtime.

 

In additional to it, I also have further questions related to use
Compressor/Decompressor to handle the encrypting/decrypting file. Ideally, I
wonder if the encrypting/decrypting can support file splits. This maybe
depends the algorithm we are using, is that right? If so, what kind of
algorithm can do that? I am not sure if it likes the compressor cases, most
of them do not support file split. If so, it maybe not good for my
requirements.

 

If we have a 1G file, encrypted in the Amazone S3, after it copied to the
HDFS of Amazon EMR, can each block of the date be decrypted independently by
each mapper, then passed to the underline RecorderReader to be processed
totally concurrently? Does any one do this before? If so, what encryption
algorithm does support it? Any idea?

 

Thanks

 

Yong


RE: Question related to Decompressor interface

Posted by David Parks <da...@yahoo.com>.
In the EncryptedWritableWrapper idea you would create an object that takes
any Writable object as it's parameter. 

 

Your EncryptedWritableWrapper would naturally implement Writable.

 

.         When write(DataOutput out) is called on your object, create your
own DataOutputStream which reads data into a byte array that you control
(i.e. new DataOutputStream(new myByteArrayOutputStream()), keeping
references to the objects of course).

.         Now encrypt the bytes and pass them on to the DataOutput object
you received in write(DataOutput out)

 

To decrypt is basically the same with the readFields(DataInput in) method.

.         Read in the bytes and decrypt them (you will probably have needed
to write out the length of bytes previously so you know how much to read
in).

.         Take the decrypted bytes and pass them to the readFields(.) method
of the Writable object you're wrapping 

 

The rest of Hadoop doesn't know or care if the data is encrypted, your
Writable objects are just a bunch of bytes, you're Key and Value class in
this case are now EncryptedWritableWrapper, and you'll need to know which
type of Writable to pass it in the code.

 

This would be good for encrypting in Hadoop. If your file comes in encrypted
then it necessarily can't be split (you should aim to limit the maximum size
of the file on the source side). In the case of an encrypted input you would
need your own record reader to decrypt it, your description of the scenario
below is correct, extending TextinputFormat would be the way to go.

 

If your input is just a plain text file and your goal is to store it in an
encrypted fashion then the EncryptedWritable idea works and is a more simple
implementation.

 

 

 

From: java8964 java8964 [mailto:java8964@hotmail.com] 
Sent: Sunday, February 10, 2013 10:13 PM
To: user@hadoop.apache.org
Subject: RE: Question related to Decompressor interface

 

Hi, Dave:

 

Thanks for you reply. I am not sure how the EncryptedWritable will work, can
you share more ideas about it?

 

For example, if I have a text file as my source raw file. Now I need to
store it in HDFS. If I use any encryption to encrypt the whole file, then
there is no good InputFormat or RecordReader to process it, unless whole
file is decrypted first at runtime, then using TextInputFormat to process
it, right?

 

What you suggest is  when I encrypted the file, store it as a SequenceFile,
using anything I want as the key, then encrypt each line (Record), and
stores it as the value, put both (key, value) pair into the sequence file,
is that right? 

 

Then in the runtime, each value can be decrypted from the sequence file, and
ready for next step in the by the EncryptedWritable class. Is my
understanding correct?

 

 In this case, of course I don't need to worry about split any more, as each
record is encrypted/decrypted separately.

 

I think it is a valid option, but problem is that the data has to be
encrypted by this EncryptedWritable class. What I was thinking about is
allow data source to encrypt its data any way they want, as long as it is
supported by Java security package, then only provide the private key to the
runtime to decrypt it.

 

Yong

  _____  

From: davidparks21@yahoo.com
To: user@hadoop.apache.org
Subject: RE: Question related to Decompressor interface
Date: Sun, 10 Feb 2013 09:36:40 +0700

I can't answer your question about the Decompressor interface, but I have a
query for you.

 

Why not just create an EncryptedWritable object? Encrypt/decrypt the bytes
on the read/write method, that should be darn near trivial. Then stick with
good 'ol SequenceFile, which, as you note, is splittable. Otherwise you'd
have to deal with making the output splittable, and given encrypted data,
the only solution that I see is basically rolling your own SequenceFile with
encrypted innards. 

 

Come to think of it, a simple, standardized EncryptedWritable object out of
the box with Hadoop would be great. Or perhaps better yet, an
EncryptedWritableWrapper<T extends Writable> so we can convert any existing
Writable into an encrypted form.

 

Dave

 

 

From: java8964 java8964 [mailto:java8964@hotmail.com] 
Sent: Sunday, February 10, 2013 3:50 AM
To: user@hadoop.apache.org
Subject: Question related to Decompressor interface

 

HI, 

 

Currently I am researching about options of encrypting the data in the
MapReduce, as we plan to use the Amazon EMR or EC2 services for our data.

 

I am thinking that the compression codec is good place to integrate with the
encryption logic, and I found out there are some people having the same idea
as mine.

 

I google around and found out this code:

 

https://github.com/geisbruch/HadoopCryptoCompressor/

 

It doesn't seem maintained any more, but it gave me a starting point. I
download the source code, and try to do some tests with it.

 

It doesn't work out of box. There are some bugs I have to fix to make it
work. I believe it contains 'AES' as an example algorithm.

 

But right now, I faced a problem when I tried to use it in my testing
MapReduer program. Here is the stack trace I got:

 

2013-02-08 23:16:47,038 INFO
org.apache.hadoop.io.compress.crypto.CryptoBasicDecompressor: buf length =
512, and offset = 0, length = -132967308

java.lang.IndexOutOfBoundsException

    at java.nio.ByteBuffer.wrap(ByteBuffer.java:352)

    at
org.apache.hadoop.io.compress.crypto.CryptoBasicDecompressor.setInput(Crypto
BasicDecompressor.java:100)

    at
org.apache.hadoop.io.compress.BlockDecompressorStream.decompress(BlockDecomp
ressorStream.java:97)

    at
org.apache.hadoop.io.compress.DecompressorStream.read(DecompressorStream.jav
a:83)

    at java.io.InputStream.read(InputStream.java:82)

    at
org.apache.hadoop.util.LineReader.readDefaultLine(LineReader.java:209)

    at org.apache.hadoop.util.LineReader.readLine(LineReader.java:173)

    at
org.apache.hadoop.mapreduce.lib.input.LineRecordReader.nextKeyValue(LineReco
rdReader.java:114)

    at
org.apache.hadoop.mapred.MapTask$NewTrackingRecordReader.nextKeyValue(MapTas
k.java:458)

    at
org.apache.hadoop.mapreduce.task.MapContextImpl.nextKeyValue(MapContextImpl.
java:76)

    at
org.apache.hadoop.mapreduce.lib.map.WrappedMapper$Context.nextKeyValue(Wrapp
edMapper.java:85)

    at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:139)

    at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:645)

    at org.apache.hadoop.mapred.MapTask.run(MapTask.java:325)

    at org.apache.hadoop.mapred.Child$4.run(Child.java:268)

    at java.security.AccessController.doPrivileged(Native Method)

    at javax.security.auth.Subject.doAs(Subject.java:396)

    at
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.ja
va:1332)

    at org.apache.hadoop.mapred.Child.main(Child.java:262)

 

I know the error is thrown out of this custom CryptoBasicDecompressor class,
but I really have questions related to the interface it implemented:
Decompressor.

 

There is limited document about this interface, for example, when and how
the method setInput() will be invoked. If I want to write my own
Decompressor, what do these methods mean in the interface?

In the above case, I enable some debug information, you can see that in this
case, the byte[] array passed to setInput method, only have 512 as the
length, but the 3rd parameter of length passed in is a negative number:
-132967308. That caused the IndexOutOfBoundsException. If I check the
GzipDecompressor class of this method in the hadoop, the code will also
throw IndexOutoutBoundsException in this case, so this is a RuntimeException
case. Why it happened in my test case?

 

Here is my test case:

 

I have a simpel log text file about 700k. I encrypted it with above code
using 'AES'. I can encrypted and decrypted to get my original content. The
file name is foo.log.crypto, this file extension is registered to invoke
this CryptoBasicDecompressor in my testing hadoop using CDH4.1.2 release
(hadoop 2.0). Everything works as I expected. The CryptoBasicDecompressor is
invoked when the input file is foo.log.crypto, as you can see in the above
stack trace. But I don't know why the 3rd parameter (length) in setInput()
is a negative number at runtime.

 

In additional to it, I also have further questions related to use
Compressor/Decompressor to handle the encrypting/decrypting file. Ideally, I
wonder if the encrypting/decrypting can support file splits. This maybe
depends the algorithm we are using, is that right? If so, what kind of
algorithm can do that? I am not sure if it likes the compressor cases, most
of them do not support file split. If so, it maybe not good for my
requirements.

 

If we have a 1G file, encrypted in the Amazone S3, after it copied to the
HDFS of Amazon EMR, can each block of the date be decrypted independently by
each mapper, then passed to the underline RecorderReader to be processed
totally concurrently? Does any one do this before? If so, what encryption
algorithm does support it? Any idea?

 

Thanks

 

Yong


RE: Question related to Decompressor interface

Posted by java8964 java8964 <ja...@hotmail.com>.
Hi, Dave:
Thanks for you reply. I am not sure how the EncryptedWritable will work, can you share more ideas about it?
For example, if I have a text file as my source raw file. Now I need to store it in HDFS. If I use any encryption to encrypt the whole file, then there is no good InputFormat or RecordReader to process it, unless whole file is decrypted first at runtime, then using TextInputFormat to process it, right?
What you suggest is  when I encrypted the file, store it as a SequenceFile, using anything I want as the key, then encrypt each line (Record), and stores it as the value, put both (key, value) pair into the sequence file, is that right? 
Then in the runtime, each value can be decrypted from the sequence file, and ready for next step in the by the EncryptedWritable class. Is my understanding correct?
 In this case, of course I don't need to worry about split any more, as each record is encrypted/decrypted separately.
I think it is a valid option, but problem is that the data has to be encrypted by this EncryptedWritable class. What I was thinking about is allow data source to encrypt its data any way they want, as long as it is supported by Java security package, then only provide the private key to the runtime to decrypt it.
Yong

From: davidparks21@yahoo.com
To: user@hadoop.apache.org
Subject: RE: Question related to Decompressor interface
Date: Sun, 10 Feb 2013 09:36:40 +0700

I can’t answer your question about the Decompressor interface, but I have a query for you. Why not just create an EncryptedWritable object? Encrypt/decrypt the bytes on the read/write method, that should be darn near trivial. Then stick with good ‘ol SequenceFile, which, as you note, is splittable. Otherwise you’d have to deal with making the output splittable, and given encrypted data, the only solution that I see is basically rolling your own SequenceFile with encrypted innards.  Come to think of it, a simple, standardized EncryptedWritable object out of the box with Hadoop would be great. Or perhaps better yet, an EncryptedWritableWrapper<T extends Writable> so we can convert any existing Writable into an encrypted form. Dave  From: java8964 java8964 [mailto:java8964@hotmail.com] 
Sent: Sunday, February 10, 2013 3:50 AM
To: user@hadoop.apache.org
Subject: Question related to Decompressor interface HI,  Currently I am researching about options of encrypting the data in the MapReduce, as we plan to use the Amazon EMR or EC2 services for our data. I am thinking that the compression codec is good place to integrate with the encryption logic, and I found out there are some people having the same idea as mine. I google around and found out this code: https://github.com/geisbruch/HadoopCryptoCompressor/ It doesn't seem maintained any more, but it gave me a starting point. I download the source code, and try to do some tests with it. It doesn't work out of box. There are some bugs I have to fix to make it work. I believe it contains 'AES' as an example algorithm. But right now, I faced a problem when I tried to use it in my testing MapReduer program. Here is the stack trace I got: 2013-02-08 23:16:47,038 INFO org.apache.hadoop.io.compress.crypto.CryptoBasicDecompressor: buf length = 512, and offset = 0, length = -132967308java.lang.IndexOutOfBoundsException    at java.nio.ByteBuffer.wrap(ByteBuffer.java:352)    at org.apache.hadoop.io.compress.crypto.CryptoBasicDecompressor.setInput(CryptoBasicDecompressor.java:100)    at org.apache.hadoop.io.compress.BlockDecompressorStream.decompress(BlockDecompressorStream.java:97)    at org.apache.hadoop.io.compress.DecompressorStream.read(DecompressorStream.java:83)    at java.io.InputStream.read(InputStream.java:82)    at org.apache.hadoop.util.LineReader.readDefaultLine(LineReader.java:209)    at org.apache.hadoop.util.LineReader.readLine(LineReader.java:173)    at org.apache.hadoop.mapreduce.lib.input.LineRecordReader.nextKeyValue(LineRecordReader.java:114)    at org.apache.hadoop.mapred.MapTask$NewTrackingRecordReader.nextKeyValue(MapTask.java:458)    at org.apache.hadoop.mapreduce.task.MapContextImpl.nextKeyValue(MapContextImpl.java:76)    at org.apache.hadoop.mapreduce.lib.map.WrappedMapper$Context.nextKeyValue(WrappedMapper.java:85)    at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:139)    at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:645)    at org.apache.hadoop.mapred.MapTask.run(MapTask.java:325)    at org.apache.hadoop.mapred.Child$4.run(Child.java:268)    at java.security.AccessController.doPrivileged(Native Method)    at javax.security.auth.Subject.doAs(Subject.java:396)    at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1332)    at org.apache.hadoop.mapred.Child.main(Child.java:262) I know the error is thrown out of this custom CryptoBasicDecompressor class, but I really have questions related to the interface it implemented: Decompressor. There is limited document about this interface, for example, when and how the method setInput() will be invoked. If I want to write my own Decompressor, what do these methods mean in the interface?In the above case, I enable some debug information, you can see that in this case, the byte[] array passed to setInput method, only have 512 as the length, but the 3rd parameter of length passed in is a negative number: -132967308. That caused the IndexOutOfBoundsException. If I check the GzipDecompressor class of this method in the hadoop, the code will also throw IndexOutoutBoundsException in this case, so this is a RuntimeException case. Why it happened in my test case? Here is my test case: I have a simpel log text file about 700k. I encrypted it with above code using 'AES'. I can encrypted and decrypted to get my original content. The file name is foo.log.crypto, this file extension is registered to invoke this CryptoBasicDecompressor in my testing hadoop using CDH4.1.2 release (hadoop 2.0). Everything works as I expected. The CryptoBasicDecompressor is invoked when the input file is foo.log.crypto, as you can see in the above stack trace. But I don't know why the 3rd parameter (length) in setInput() is a negative number at runtime. In additional to it, I also have further questions related to use Compressor/Decompressor to handle the encrypting/decrypting file. Ideally, I wonder if the encrypting/decrypting can support file splits. This maybe depends the algorithm we are using, is that right? If so, what kind of algorithm can do that? I am not sure if it likes the compressor cases, most of them do not support file split. If so, it maybe not good for my requirements. If we have a 1G file, encrypted in the Amazone S3, after it copied to the HDFS of Amazon EMR, can each block of the date be decrypted independently by each mapper, then passed to the underline RecorderReader to be processed totally concurrently? Does any one do this before? If so, what encryption algorithm does support it? Any idea? Thanks Yong 		 	   		  

RE: Question related to Decompressor interface

Posted by java8964 java8964 <ja...@hotmail.com>.
Hi, Dave:
Thanks for you reply. I am not sure how the EncryptedWritable will work, can you share more ideas about it?
For example, if I have a text file as my source raw file. Now I need to store it in HDFS. If I use any encryption to encrypt the whole file, then there is no good InputFormat or RecordReader to process it, unless whole file is decrypted first at runtime, then using TextInputFormat to process it, right?
What you suggest is  when I encrypted the file, store it as a SequenceFile, using anything I want as the key, then encrypt each line (Record), and stores it as the value, put both (key, value) pair into the sequence file, is that right? 
Then in the runtime, each value can be decrypted from the sequence file, and ready for next step in the by the EncryptedWritable class. Is my understanding correct?
 In this case, of course I don't need to worry about split any more, as each record is encrypted/decrypted separately.
I think it is a valid option, but problem is that the data has to be encrypted by this EncryptedWritable class. What I was thinking about is allow data source to encrypt its data any way they want, as long as it is supported by Java security package, then only provide the private key to the runtime to decrypt it.
Yong

From: davidparks21@yahoo.com
To: user@hadoop.apache.org
Subject: RE: Question related to Decompressor interface
Date: Sun, 10 Feb 2013 09:36:40 +0700

I can’t answer your question about the Decompressor interface, but I have a query for you. Why not just create an EncryptedWritable object? Encrypt/decrypt the bytes on the read/write method, that should be darn near trivial. Then stick with good ‘ol SequenceFile, which, as you note, is splittable. Otherwise you’d have to deal with making the output splittable, and given encrypted data, the only solution that I see is basically rolling your own SequenceFile with encrypted innards.  Come to think of it, a simple, standardized EncryptedWritable object out of the box with Hadoop would be great. Or perhaps better yet, an EncryptedWritableWrapper<T extends Writable> so we can convert any existing Writable into an encrypted form. Dave  From: java8964 java8964 [mailto:java8964@hotmail.com] 
Sent: Sunday, February 10, 2013 3:50 AM
To: user@hadoop.apache.org
Subject: Question related to Decompressor interface HI,  Currently I am researching about options of encrypting the data in the MapReduce, as we plan to use the Amazon EMR or EC2 services for our data. I am thinking that the compression codec is good place to integrate with the encryption logic, and I found out there are some people having the same idea as mine. I google around and found out this code: https://github.com/geisbruch/HadoopCryptoCompressor/ It doesn't seem maintained any more, but it gave me a starting point. I download the source code, and try to do some tests with it. It doesn't work out of box. There are some bugs I have to fix to make it work. I believe it contains 'AES' as an example algorithm. But right now, I faced a problem when I tried to use it in my testing MapReduer program. Here is the stack trace I got: 2013-02-08 23:16:47,038 INFO org.apache.hadoop.io.compress.crypto.CryptoBasicDecompressor: buf length = 512, and offset = 0, length = -132967308java.lang.IndexOutOfBoundsException    at java.nio.ByteBuffer.wrap(ByteBuffer.java:352)    at org.apache.hadoop.io.compress.crypto.CryptoBasicDecompressor.setInput(CryptoBasicDecompressor.java:100)    at org.apache.hadoop.io.compress.BlockDecompressorStream.decompress(BlockDecompressorStream.java:97)    at org.apache.hadoop.io.compress.DecompressorStream.read(DecompressorStream.java:83)    at java.io.InputStream.read(InputStream.java:82)    at org.apache.hadoop.util.LineReader.readDefaultLine(LineReader.java:209)    at org.apache.hadoop.util.LineReader.readLine(LineReader.java:173)    at org.apache.hadoop.mapreduce.lib.input.LineRecordReader.nextKeyValue(LineRecordReader.java:114)    at org.apache.hadoop.mapred.MapTask$NewTrackingRecordReader.nextKeyValue(MapTask.java:458)    at org.apache.hadoop.mapreduce.task.MapContextImpl.nextKeyValue(MapContextImpl.java:76)    at org.apache.hadoop.mapreduce.lib.map.WrappedMapper$Context.nextKeyValue(WrappedMapper.java:85)    at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:139)    at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:645)    at org.apache.hadoop.mapred.MapTask.run(MapTask.java:325)    at org.apache.hadoop.mapred.Child$4.run(Child.java:268)    at java.security.AccessController.doPrivileged(Native Method)    at javax.security.auth.Subject.doAs(Subject.java:396)    at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1332)    at org.apache.hadoop.mapred.Child.main(Child.java:262) I know the error is thrown out of this custom CryptoBasicDecompressor class, but I really have questions related to the interface it implemented: Decompressor. There is limited document about this interface, for example, when and how the method setInput() will be invoked. If I want to write my own Decompressor, what do these methods mean in the interface?In the above case, I enable some debug information, you can see that in this case, the byte[] array passed to setInput method, only have 512 as the length, but the 3rd parameter of length passed in is a negative number: -132967308. That caused the IndexOutOfBoundsException. If I check the GzipDecompressor class of this method in the hadoop, the code will also throw IndexOutoutBoundsException in this case, so this is a RuntimeException case. Why it happened in my test case? Here is my test case: I have a simpel log text file about 700k. I encrypted it with above code using 'AES'. I can encrypted and decrypted to get my original content. The file name is foo.log.crypto, this file extension is registered to invoke this CryptoBasicDecompressor in my testing hadoop using CDH4.1.2 release (hadoop 2.0). Everything works as I expected. The CryptoBasicDecompressor is invoked when the input file is foo.log.crypto, as you can see in the above stack trace. But I don't know why the 3rd parameter (length) in setInput() is a negative number at runtime. In additional to it, I also have further questions related to use Compressor/Decompressor to handle the encrypting/decrypting file. Ideally, I wonder if the encrypting/decrypting can support file splits. This maybe depends the algorithm we are using, is that right? If so, what kind of algorithm can do that? I am not sure if it likes the compressor cases, most of them do not support file split. If so, it maybe not good for my requirements. If we have a 1G file, encrypted in the Amazone S3, after it copied to the HDFS of Amazon EMR, can each block of the date be decrypted independently by each mapper, then passed to the underline RecorderReader to be processed totally concurrently? Does any one do this before? If so, what encryption algorithm does support it? Any idea? Thanks Yong 		 	   		  

RE: Question related to Decompressor interface

Posted by java8964 java8964 <ja...@hotmail.com>.
Hi, Dave:
Thanks for you reply. I am not sure how the EncryptedWritable will work, can you share more ideas about it?
For example, if I have a text file as my source raw file. Now I need to store it in HDFS. If I use any encryption to encrypt the whole file, then there is no good InputFormat or RecordReader to process it, unless whole file is decrypted first at runtime, then using TextInputFormat to process it, right?
What you suggest is  when I encrypted the file, store it as a SequenceFile, using anything I want as the key, then encrypt each line (Record), and stores it as the value, put both (key, value) pair into the sequence file, is that right? 
Then in the runtime, each value can be decrypted from the sequence file, and ready for next step in the by the EncryptedWritable class. Is my understanding correct?
 In this case, of course I don't need to worry about split any more, as each record is encrypted/decrypted separately.
I think it is a valid option, but problem is that the data has to be encrypted by this EncryptedWritable class. What I was thinking about is allow data source to encrypt its data any way they want, as long as it is supported by Java security package, then only provide the private key to the runtime to decrypt it.
Yong

From: davidparks21@yahoo.com
To: user@hadoop.apache.org
Subject: RE: Question related to Decompressor interface
Date: Sun, 10 Feb 2013 09:36:40 +0700

I can’t answer your question about the Decompressor interface, but I have a query for you. Why not just create an EncryptedWritable object? Encrypt/decrypt the bytes on the read/write method, that should be darn near trivial. Then stick with good ‘ol SequenceFile, which, as you note, is splittable. Otherwise you’d have to deal with making the output splittable, and given encrypted data, the only solution that I see is basically rolling your own SequenceFile with encrypted innards.  Come to think of it, a simple, standardized EncryptedWritable object out of the box with Hadoop would be great. Or perhaps better yet, an EncryptedWritableWrapper<T extends Writable> so we can convert any existing Writable into an encrypted form. Dave  From: java8964 java8964 [mailto:java8964@hotmail.com] 
Sent: Sunday, February 10, 2013 3:50 AM
To: user@hadoop.apache.org
Subject: Question related to Decompressor interface HI,  Currently I am researching about options of encrypting the data in the MapReduce, as we plan to use the Amazon EMR or EC2 services for our data. I am thinking that the compression codec is good place to integrate with the encryption logic, and I found out there are some people having the same idea as mine. I google around and found out this code: https://github.com/geisbruch/HadoopCryptoCompressor/ It doesn't seem maintained any more, but it gave me a starting point. I download the source code, and try to do some tests with it. It doesn't work out of box. There are some bugs I have to fix to make it work. I believe it contains 'AES' as an example algorithm. But right now, I faced a problem when I tried to use it in my testing MapReduer program. Here is the stack trace I got: 2013-02-08 23:16:47,038 INFO org.apache.hadoop.io.compress.crypto.CryptoBasicDecompressor: buf length = 512, and offset = 0, length = -132967308java.lang.IndexOutOfBoundsException    at java.nio.ByteBuffer.wrap(ByteBuffer.java:352)    at org.apache.hadoop.io.compress.crypto.CryptoBasicDecompressor.setInput(CryptoBasicDecompressor.java:100)    at org.apache.hadoop.io.compress.BlockDecompressorStream.decompress(BlockDecompressorStream.java:97)    at org.apache.hadoop.io.compress.DecompressorStream.read(DecompressorStream.java:83)    at java.io.InputStream.read(InputStream.java:82)    at org.apache.hadoop.util.LineReader.readDefaultLine(LineReader.java:209)    at org.apache.hadoop.util.LineReader.readLine(LineReader.java:173)    at org.apache.hadoop.mapreduce.lib.input.LineRecordReader.nextKeyValue(LineRecordReader.java:114)    at org.apache.hadoop.mapred.MapTask$NewTrackingRecordReader.nextKeyValue(MapTask.java:458)    at org.apache.hadoop.mapreduce.task.MapContextImpl.nextKeyValue(MapContextImpl.java:76)    at org.apache.hadoop.mapreduce.lib.map.WrappedMapper$Context.nextKeyValue(WrappedMapper.java:85)    at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:139)    at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:645)    at org.apache.hadoop.mapred.MapTask.run(MapTask.java:325)    at org.apache.hadoop.mapred.Child$4.run(Child.java:268)    at java.security.AccessController.doPrivileged(Native Method)    at javax.security.auth.Subject.doAs(Subject.java:396)    at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1332)    at org.apache.hadoop.mapred.Child.main(Child.java:262) I know the error is thrown out of this custom CryptoBasicDecompressor class, but I really have questions related to the interface it implemented: Decompressor. There is limited document about this interface, for example, when and how the method setInput() will be invoked. If I want to write my own Decompressor, what do these methods mean in the interface?In the above case, I enable some debug information, you can see that in this case, the byte[] array passed to setInput method, only have 512 as the length, but the 3rd parameter of length passed in is a negative number: -132967308. That caused the IndexOutOfBoundsException. If I check the GzipDecompressor class of this method in the hadoop, the code will also throw IndexOutoutBoundsException in this case, so this is a RuntimeException case. Why it happened in my test case? Here is my test case: I have a simpel log text file about 700k. I encrypted it with above code using 'AES'. I can encrypted and decrypted to get my original content. The file name is foo.log.crypto, this file extension is registered to invoke this CryptoBasicDecompressor in my testing hadoop using CDH4.1.2 release (hadoop 2.0). Everything works as I expected. The CryptoBasicDecompressor is invoked when the input file is foo.log.crypto, as you can see in the above stack trace. But I don't know why the 3rd parameter (length) in setInput() is a negative number at runtime. In additional to it, I also have further questions related to use Compressor/Decompressor to handle the encrypting/decrypting file. Ideally, I wonder if the encrypting/decrypting can support file splits. This maybe depends the algorithm we are using, is that right? If so, what kind of algorithm can do that? I am not sure if it likes the compressor cases, most of them do not support file split. If so, it maybe not good for my requirements. If we have a 1G file, encrypted in the Amazone S3, after it copied to the HDFS of Amazon EMR, can each block of the date be decrypted independently by each mapper, then passed to the underline RecorderReader to be processed totally concurrently? Does any one do this before? If so, what encryption algorithm does support it? Any idea? Thanks Yong 		 	   		  

RE: Question related to Decompressor interface

Posted by java8964 java8964 <ja...@hotmail.com>.
Hi, Dave:
Thanks for you reply. I am not sure how the EncryptedWritable will work, can you share more ideas about it?
For example, if I have a text file as my source raw file. Now I need to store it in HDFS. If I use any encryption to encrypt the whole file, then there is no good InputFormat or RecordReader to process it, unless whole file is decrypted first at runtime, then using TextInputFormat to process it, right?
What you suggest is  when I encrypted the file, store it as a SequenceFile, using anything I want as the key, then encrypt each line (Record), and stores it as the value, put both (key, value) pair into the sequence file, is that right? 
Then in the runtime, each value can be decrypted from the sequence file, and ready for next step in the by the EncryptedWritable class. Is my understanding correct?
 In this case, of course I don't need to worry about split any more, as each record is encrypted/decrypted separately.
I think it is a valid option, but problem is that the data has to be encrypted by this EncryptedWritable class. What I was thinking about is allow data source to encrypt its data any way they want, as long as it is supported by Java security package, then only provide the private key to the runtime to decrypt it.
Yong

From: davidparks21@yahoo.com
To: user@hadoop.apache.org
Subject: RE: Question related to Decompressor interface
Date: Sun, 10 Feb 2013 09:36:40 +0700

I can’t answer your question about the Decompressor interface, but I have a query for you. Why not just create an EncryptedWritable object? Encrypt/decrypt the bytes on the read/write method, that should be darn near trivial. Then stick with good ‘ol SequenceFile, which, as you note, is splittable. Otherwise you’d have to deal with making the output splittable, and given encrypted data, the only solution that I see is basically rolling your own SequenceFile with encrypted innards.  Come to think of it, a simple, standardized EncryptedWritable object out of the box with Hadoop would be great. Or perhaps better yet, an EncryptedWritableWrapper<T extends Writable> so we can convert any existing Writable into an encrypted form. Dave  From: java8964 java8964 [mailto:java8964@hotmail.com] 
Sent: Sunday, February 10, 2013 3:50 AM
To: user@hadoop.apache.org
Subject: Question related to Decompressor interface HI,  Currently I am researching about options of encrypting the data in the MapReduce, as we plan to use the Amazon EMR or EC2 services for our data. I am thinking that the compression codec is good place to integrate with the encryption logic, and I found out there are some people having the same idea as mine. I google around and found out this code: https://github.com/geisbruch/HadoopCryptoCompressor/ It doesn't seem maintained any more, but it gave me a starting point. I download the source code, and try to do some tests with it. It doesn't work out of box. There are some bugs I have to fix to make it work. I believe it contains 'AES' as an example algorithm. But right now, I faced a problem when I tried to use it in my testing MapReduer program. Here is the stack trace I got: 2013-02-08 23:16:47,038 INFO org.apache.hadoop.io.compress.crypto.CryptoBasicDecompressor: buf length = 512, and offset = 0, length = -132967308java.lang.IndexOutOfBoundsException    at java.nio.ByteBuffer.wrap(ByteBuffer.java:352)    at org.apache.hadoop.io.compress.crypto.CryptoBasicDecompressor.setInput(CryptoBasicDecompressor.java:100)    at org.apache.hadoop.io.compress.BlockDecompressorStream.decompress(BlockDecompressorStream.java:97)    at org.apache.hadoop.io.compress.DecompressorStream.read(DecompressorStream.java:83)    at java.io.InputStream.read(InputStream.java:82)    at org.apache.hadoop.util.LineReader.readDefaultLine(LineReader.java:209)    at org.apache.hadoop.util.LineReader.readLine(LineReader.java:173)    at org.apache.hadoop.mapreduce.lib.input.LineRecordReader.nextKeyValue(LineRecordReader.java:114)    at org.apache.hadoop.mapred.MapTask$NewTrackingRecordReader.nextKeyValue(MapTask.java:458)    at org.apache.hadoop.mapreduce.task.MapContextImpl.nextKeyValue(MapContextImpl.java:76)    at org.apache.hadoop.mapreduce.lib.map.WrappedMapper$Context.nextKeyValue(WrappedMapper.java:85)    at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:139)    at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:645)    at org.apache.hadoop.mapred.MapTask.run(MapTask.java:325)    at org.apache.hadoop.mapred.Child$4.run(Child.java:268)    at java.security.AccessController.doPrivileged(Native Method)    at javax.security.auth.Subject.doAs(Subject.java:396)    at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1332)    at org.apache.hadoop.mapred.Child.main(Child.java:262) I know the error is thrown out of this custom CryptoBasicDecompressor class, but I really have questions related to the interface it implemented: Decompressor. There is limited document about this interface, for example, when and how the method setInput() will be invoked. If I want to write my own Decompressor, what do these methods mean in the interface?In the above case, I enable some debug information, you can see that in this case, the byte[] array passed to setInput method, only have 512 as the length, but the 3rd parameter of length passed in is a negative number: -132967308. That caused the IndexOutOfBoundsException. If I check the GzipDecompressor class of this method in the hadoop, the code will also throw IndexOutoutBoundsException in this case, so this is a RuntimeException case. Why it happened in my test case? Here is my test case: I have a simpel log text file about 700k. I encrypted it with above code using 'AES'. I can encrypted and decrypted to get my original content. The file name is foo.log.crypto, this file extension is registered to invoke this CryptoBasicDecompressor in my testing hadoop using CDH4.1.2 release (hadoop 2.0). Everything works as I expected. The CryptoBasicDecompressor is invoked when the input file is foo.log.crypto, as you can see in the above stack trace. But I don't know why the 3rd parameter (length) in setInput() is a negative number at runtime. In additional to it, I also have further questions related to use Compressor/Decompressor to handle the encrypting/decrypting file. Ideally, I wonder if the encrypting/decrypting can support file splits. This maybe depends the algorithm we are using, is that right? If so, what kind of algorithm can do that? I am not sure if it likes the compressor cases, most of them do not support file split. If so, it maybe not good for my requirements. If we have a 1G file, encrypted in the Amazone S3, after it copied to the HDFS of Amazon EMR, can each block of the date be decrypted independently by each mapper, then passed to the underline RecorderReader to be processed totally concurrently? Does any one do this before? If so, what encryption algorithm does support it? Any idea? Thanks Yong 		 	   		  

RE: Question related to Decompressor interface

Posted by David Parks <da...@yahoo.com>.
I can't answer your question about the Decompressor interface, but I have a
query for you.

 

Why not just create an EncryptedWritable object? Encrypt/decrypt the bytes
on the read/write method, that should be darn near trivial. Then stick with
good 'ol SequenceFile, which, as you note, is splittable. Otherwise you'd
have to deal with making the output splittable, and given encrypted data,
the only solution that I see is basically rolling your own SequenceFile with
encrypted innards. 

 

Come to think of it, a simple, standardized EncryptedWritable object out of
the box with Hadoop would be great. Or perhaps better yet, an
EncryptedWritableWrapper<T extends Writable> so we can convert any existing
Writable into an encrypted form.

 

Dave

 

 

From: java8964 java8964 [mailto:java8964@hotmail.com] 
Sent: Sunday, February 10, 2013 3:50 AM
To: user@hadoop.apache.org
Subject: Question related to Decompressor interface

 

HI, 

 

Currently I am researching about options of encrypting the data in the
MapReduce, as we plan to use the Amazon EMR or EC2 services for our data.

 

I am thinking that the compression codec is good place to integrate with the
encryption logic, and I found out there are some people having the same idea
as mine.

 

I google around and found out this code:

 

https://github.com/geisbruch/HadoopCryptoCompressor/

 

It doesn't seem maintained any more, but it gave me a starting point. I
download the source code, and try to do some tests with it.

 

It doesn't work out of box. There are some bugs I have to fix to make it
work. I believe it contains 'AES' as an example algorithm.

 

But right now, I faced a problem when I tried to use it in my testing
MapReduer program. Here is the stack trace I got:

 

2013-02-08 23:16:47,038 INFO
org.apache.hadoop.io.compress.crypto.CryptoBasicDecompressor: buf length =
512, and offset = 0, length = -132967308

java.lang.IndexOutOfBoundsException

    at java.nio.ByteBuffer.wrap(ByteBuffer.java:352)

    at
org.apache.hadoop.io.compress.crypto.CryptoBasicDecompressor.setInput(Crypto
BasicDecompressor.java:100)

    at
org.apache.hadoop.io.compress.BlockDecompressorStream.decompress(BlockDecomp
ressorStream.java:97)

    at
org.apache.hadoop.io.compress.DecompressorStream.read(DecompressorStream.jav
a:83)

    at java.io.InputStream.read(InputStream.java:82)

    at
org.apache.hadoop.util.LineReader.readDefaultLine(LineReader.java:209)

    at org.apache.hadoop.util.LineReader.readLine(LineReader.java:173)

    at
org.apache.hadoop.mapreduce.lib.input.LineRecordReader.nextKeyValue(LineReco
rdReader.java:114)

    at
org.apache.hadoop.mapred.MapTask$NewTrackingRecordReader.nextKeyValue(MapTas
k.java:458)

    at
org.apache.hadoop.mapreduce.task.MapContextImpl.nextKeyValue(MapContextImpl.
java:76)

    at
org.apache.hadoop.mapreduce.lib.map.WrappedMapper$Context.nextKeyValue(Wrapp
edMapper.java:85)

    at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:139)

    at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:645)

    at org.apache.hadoop.mapred.MapTask.run(MapTask.java:325)

    at org.apache.hadoop.mapred.Child$4.run(Child.java:268)

    at java.security.AccessController.doPrivileged(Native Method)

    at javax.security.auth.Subject.doAs(Subject.java:396)

    at
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.ja
va:1332)

    at org.apache.hadoop.mapred.Child.main(Child.java:262)

 

I know the error is thrown out of this custom CryptoBasicDecompressor class,
but I really have questions related to the interface it implemented:
Decompressor.

 

There is limited document about this interface, for example, when and how
the method setInput() will be invoked. If I want to write my own
Decompressor, what do these methods mean in the interface?

In the above case, I enable some debug information, you can see that in this
case, the byte[] array passed to setInput method, only have 512 as the
length, but the 3rd parameter of length passed in is a negative number:
-132967308. That caused the IndexOutOfBoundsException. If I check the
GzipDecompressor class of this method in the hadoop, the code will also
throw IndexOutoutBoundsException in this case, so this is a RuntimeException
case. Why it happened in my test case?

 

Here is my test case:

 

I have a simpel log text file about 700k. I encrypted it with above code
using 'AES'. I can encrypted and decrypted to get my original content. The
file name is foo.log.crypto, this file extension is registered to invoke
this CryptoBasicDecompressor in my testing hadoop using CDH4.1.2 release
(hadoop 2.0). Everything works as I expected. The CryptoBasicDecompressor is
invoked when the input file is foo.log.crypto, as you can see in the above
stack trace. But I don't know why the 3rd parameter (length) in setInput()
is a negative number at runtime.

 

In additional to it, I also have further questions related to use
Compressor/Decompressor to handle the encrypting/decrypting file. Ideally, I
wonder if the encrypting/decrypting can support file splits. This maybe
depends the algorithm we are using, is that right? If so, what kind of
algorithm can do that? I am not sure if it likes the compressor cases, most
of them do not support file split. If so, it maybe not good for my
requirements.

 

If we have a 1G file, encrypted in the Amazone S3, after it copied to the
HDFS of Amazon EMR, can each block of the date be decrypted independently by
each mapper, then passed to the underline RecorderReader to be processed
totally concurrently? Does any one do this before? If so, what encryption
algorithm does support it? Any idea?

 

Thanks

 

Yong


Re: Question related to Decompressor interface

Posted by George Datskos <ge...@jp.fujitsu.com>.
Hello

> Can someone share some idea what the Hadoop source code of class 
> org.apache.hadoop.io.compress.BlockDecompressorStream, method 
> rawReadInt() is trying to do here?

The BlockDecompressorStream class is used for block-based decompression 
(e.g. snappy).  Each chunk has a header indicating how many bytes it is. 
That header is obtained by the rawReadInt method so it is expected to 
return a non-negative value (since you can't have a negative length).


George

Re: Question related to Decompressor interface

Posted by George Datskos <ge...@jp.fujitsu.com>.
Hello

> Can someone share some idea what the Hadoop source code of class 
> org.apache.hadoop.io.compress.BlockDecompressorStream, method 
> rawReadInt() is trying to do here?

The BlockDecompressorStream class is used for block-based decompression 
(e.g. snappy).  Each chunk has a header indicating how many bytes it is. 
That header is obtained by the rawReadInt method so it is expected to 
return a non-negative value (since you can't have a negative length).


George

Re: Question related to Decompressor interface

Posted by George Datskos <ge...@jp.fujitsu.com>.
Hello

> Can someone share some idea what the Hadoop source code of class 
> org.apache.hadoop.io.compress.BlockDecompressorStream, method 
> rawReadInt() is trying to do here?

The BlockDecompressorStream class is used for block-based decompression 
(e.g. snappy).  Each chunk has a header indicating how many bytes it is. 
That header is obtained by the rawReadInt method so it is expected to 
return a non-negative value (since you can't have a negative length).


George

Re: Question related to Decompressor interface

Posted by George Datskos <ge...@jp.fujitsu.com>.
Hello

> Can someone share some idea what the Hadoop source code of class 
> org.apache.hadoop.io.compress.BlockDecompressorStream, method 
> rawReadInt() is trying to do here?

The BlockDecompressorStream class is used for block-based decompression 
(e.g. snappy).  Each chunk has a header indicating how many bytes it is. 
That header is obtained by the rawReadInt method so it is expected to 
return a non-negative value (since you can't have a negative length).


George

RE: Question related to Decompressor interface

Posted by java8964 java8964 <ja...@hotmail.com>.
Can someone share some idea what the Hadoop source code of class org.apache.hadoop.io.compress.BlockDecompressorStream, method rawReadInt() is trying to do here?
There is a comment in the code this this method shouldn't return negative number, but in my testing file, it contains the following bytes from the inputStream: 248, 19, 20, 116, which corresponding to b1, b2, b3, b4.
After the 4 bytes is read fromt the input stream, then the return result will be a negative number here, as 
(b1 << 24) = -134217728(b2 << 16) = 1245184(b3 << 8) = 5120(b4 << 0) = 116
I am not sure what logic of this method is trying to do here, can anyone share some idea about it?
Thanks








  private int rawReadInt() throws IOException {
    int b1 = in.read();
    int b2 = in.read();
    int b3 = in.read();
    int b4 = in.read();
    if ((b1 | b2 | b3 | b4) < 0)
      throw new EOFException();
    return ((b1 << 24) + (b2 << 16) + (b3 << 8) + (b4 << 0));
  }
From: java8964@hotmail.com
To: user@hadoop.apache.org
Subject: Question related to Decompressor interface
Date: Sat, 9 Feb 2013 15:49:31 -0500





HI, 
Currently I am researching about options of encrypting the data in the MapReduce, as we plan to use the Amazon EMR or EC2 services for our data.
I am thinking that the compression codec is good place to integrate with the encryption logic, and I found out there are some people having the same idea as mine.
I google around and found out this code:
https://github.com/geisbruch/HadoopCryptoCompressor/
It doesn't seem maintained any more, but it gave me a starting point. I download the source code, and try to do some tests with it.
It doesn't work out of box. There are some bugs I have to fix to make it work. I believe it contains 'AES' as an example algorithm.
But right now, I faced a problem when I tried to use it in my testing MapReduer program. Here is the stack trace I got:
2013-02-08 23:16:47,038 INFO org.apache.hadoop.io.compress.crypto.CryptoBasicDecompressor: buf length = 512, and offset = 0, length = -132967308java.lang.IndexOutOfBoundsException    at java.nio.ByteBuffer.wrap(ByteBuffer.java:352)    at org.apache.hadoop.io.compress.crypto.CryptoBasicDecompressor.setInput(CryptoBasicDecompressor.java:100)    at org.apache.hadoop.io.compress.BlockDecompressorStream.decompress(BlockDecompressorStream.java:97)    at org.apache.hadoop.io.compress.DecompressorStream.read(DecompressorStream.java:83)    at java.io.InputStream.read(InputStream.java:82)    at org.apache.hadoop.util.LineReader.readDefaultLine(LineReader.java:209)    at org.apache.hadoop.util.LineReader.readLine(LineReader.java:173)    at org.apache.hadoop.mapreduce.lib.input.LineRecordReader.nextKeyValue(LineRecordReader.java:114)    at org.apache.hadoop.mapred.MapTask$NewTrackingRecordReader.nextKeyValue(MapTask.java:458)    at org.apache.hadoop.mapreduce.task.MapContextImpl.nextKeyValue(MapContextImpl.java:76)    at org.apache.hadoop.mapreduce.lib.map.WrappedMapper$Context.nextKeyValue(WrappedMapper.java:85)    at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:139)    at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:645)    at org.apache.hadoop.mapred.MapTask.run(MapTask.java:325)    at org.apache.hadoop.mapred.Child$4.run(Child.java:268)    at java.security.AccessController.doPrivileged(Native Method)    at javax.security.auth.Subject.doAs(Subject.java:396)    at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1332)    at org.apache.hadoop.mapred.Child.main(Child.java:262)
I know the error is thrown out of this custom CryptoBasicDecompressor class, but I really have questions related to the interface it implemented: Decompressor.
There is limited document about this interface, for example, when and how the method setInput() will be invoked. If I want to write my own Decompressor, what do these methods mean in the interface?In the above case, I enable some debug information, you can see that in this case, the byte[] array passed to setInput method, only have 512 as the length, but the 3rd parameter of length passed in is a negative number: -132967308. That caused the IndexOutOfBoundsException. If I check the GzipDecompressor class of this method in the hadoop, the code will also throw IndexOutoutBoundsException in this case, so this is a RuntimeException case. Why it happened in my test case?
Here is my test case:
I have a simpel log text file about 700k. I encrypted it with above code using 'AES'. I can encrypted and decrypted to get my original content. The file name is foo.log.crypto, this file extension is registered to invoke this CryptoBasicDecompressor in my testing hadoop using CDH4.1.2 release (hadoop 2.0). Everything works as I expected. The CryptoBasicDecompressor is invoked when the input file is foo.log.crypto, as you can see in the above stack trace. But I don't know why the 3rd parameter (length) in setInput() is a negative number at runtime.
In additional to it, I also have further questions related to use Compressor/Decompressor to handle the encrypting/decrypting file. Ideally, I wonder if the encrypting/decrypting can support file splits. This maybe depends the algorithm we are using, is that right? If so, what kind of algorithm can do that? I am not sure if it likes the compressor cases, most of them do not support file split. If so, it maybe not good for my requirements.
If we have a 1G file, encrypted in the Amazone S3, after it copied to the HDFS of Amazon EMR, can each block of the date be decrypted independently by each mapper, then passed to the underline RecorderReader to be processed totally concurrently? Does any one do this before? If so, what encryption algorithm does support it? Any idea?
Thanks
Yong 		 	   		   		 	   		  

RE: Question related to Decompressor interface

Posted by David Parks <da...@yahoo.com>.
I can't answer your question about the Decompressor interface, but I have a
query for you.

 

Why not just create an EncryptedWritable object? Encrypt/decrypt the bytes
on the read/write method, that should be darn near trivial. Then stick with
good 'ol SequenceFile, which, as you note, is splittable. Otherwise you'd
have to deal with making the output splittable, and given encrypted data,
the only solution that I see is basically rolling your own SequenceFile with
encrypted innards. 

 

Come to think of it, a simple, standardized EncryptedWritable object out of
the box with Hadoop would be great. Or perhaps better yet, an
EncryptedWritableWrapper<T extends Writable> so we can convert any existing
Writable into an encrypted form.

 

Dave

 

 

From: java8964 java8964 [mailto:java8964@hotmail.com] 
Sent: Sunday, February 10, 2013 3:50 AM
To: user@hadoop.apache.org
Subject: Question related to Decompressor interface

 

HI, 

 

Currently I am researching about options of encrypting the data in the
MapReduce, as we plan to use the Amazon EMR or EC2 services for our data.

 

I am thinking that the compression codec is good place to integrate with the
encryption logic, and I found out there are some people having the same idea
as mine.

 

I google around and found out this code:

 

https://github.com/geisbruch/HadoopCryptoCompressor/

 

It doesn't seem maintained any more, but it gave me a starting point. I
download the source code, and try to do some tests with it.

 

It doesn't work out of box. There are some bugs I have to fix to make it
work. I believe it contains 'AES' as an example algorithm.

 

But right now, I faced a problem when I tried to use it in my testing
MapReduer program. Here is the stack trace I got:

 

2013-02-08 23:16:47,038 INFO
org.apache.hadoop.io.compress.crypto.CryptoBasicDecompressor: buf length =
512, and offset = 0, length = -132967308

java.lang.IndexOutOfBoundsException

    at java.nio.ByteBuffer.wrap(ByteBuffer.java:352)

    at
org.apache.hadoop.io.compress.crypto.CryptoBasicDecompressor.setInput(Crypto
BasicDecompressor.java:100)

    at
org.apache.hadoop.io.compress.BlockDecompressorStream.decompress(BlockDecomp
ressorStream.java:97)

    at
org.apache.hadoop.io.compress.DecompressorStream.read(DecompressorStream.jav
a:83)

    at java.io.InputStream.read(InputStream.java:82)

    at
org.apache.hadoop.util.LineReader.readDefaultLine(LineReader.java:209)

    at org.apache.hadoop.util.LineReader.readLine(LineReader.java:173)

    at
org.apache.hadoop.mapreduce.lib.input.LineRecordReader.nextKeyValue(LineReco
rdReader.java:114)

    at
org.apache.hadoop.mapred.MapTask$NewTrackingRecordReader.nextKeyValue(MapTas
k.java:458)

    at
org.apache.hadoop.mapreduce.task.MapContextImpl.nextKeyValue(MapContextImpl.
java:76)

    at
org.apache.hadoop.mapreduce.lib.map.WrappedMapper$Context.nextKeyValue(Wrapp
edMapper.java:85)

    at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:139)

    at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:645)

    at org.apache.hadoop.mapred.MapTask.run(MapTask.java:325)

    at org.apache.hadoop.mapred.Child$4.run(Child.java:268)

    at java.security.AccessController.doPrivileged(Native Method)

    at javax.security.auth.Subject.doAs(Subject.java:396)

    at
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.ja
va:1332)

    at org.apache.hadoop.mapred.Child.main(Child.java:262)

 

I know the error is thrown out of this custom CryptoBasicDecompressor class,
but I really have questions related to the interface it implemented:
Decompressor.

 

There is limited document about this interface, for example, when and how
the method setInput() will be invoked. If I want to write my own
Decompressor, what do these methods mean in the interface?

In the above case, I enable some debug information, you can see that in this
case, the byte[] array passed to setInput method, only have 512 as the
length, but the 3rd parameter of length passed in is a negative number:
-132967308. That caused the IndexOutOfBoundsException. If I check the
GzipDecompressor class of this method in the hadoop, the code will also
throw IndexOutoutBoundsException in this case, so this is a RuntimeException
case. Why it happened in my test case?

 

Here is my test case:

 

I have a simpel log text file about 700k. I encrypted it with above code
using 'AES'. I can encrypted and decrypted to get my original content. The
file name is foo.log.crypto, this file extension is registered to invoke
this CryptoBasicDecompressor in my testing hadoop using CDH4.1.2 release
(hadoop 2.0). Everything works as I expected. The CryptoBasicDecompressor is
invoked when the input file is foo.log.crypto, as you can see in the above
stack trace. But I don't know why the 3rd parameter (length) in setInput()
is a negative number at runtime.

 

In additional to it, I also have further questions related to use
Compressor/Decompressor to handle the encrypting/decrypting file. Ideally, I
wonder if the encrypting/decrypting can support file splits. This maybe
depends the algorithm we are using, is that right? If so, what kind of
algorithm can do that? I am not sure if it likes the compressor cases, most
of them do not support file split. If so, it maybe not good for my
requirements.

 

If we have a 1G file, encrypted in the Amazone S3, after it copied to the
HDFS of Amazon EMR, can each block of the date be decrypted independently by
each mapper, then passed to the underline RecorderReader to be processed
totally concurrently? Does any one do this before? If so, what encryption
algorithm does support it? Any idea?

 

Thanks

 

Yong


RE: Question related to Decompressor interface

Posted by java8964 java8964 <ja...@hotmail.com>.
Can someone share some idea what the Hadoop source code of class org.apache.hadoop.io.compress.BlockDecompressorStream, method rawReadInt() is trying to do here?
There is a comment in the code this this method shouldn't return negative number, but in my testing file, it contains the following bytes from the inputStream: 248, 19, 20, 116, which corresponding to b1, b2, b3, b4.
After the 4 bytes is read fromt the input stream, then the return result will be a negative number here, as 
(b1 << 24) = -134217728(b2 << 16) = 1245184(b3 << 8) = 5120(b4 << 0) = 116
I am not sure what logic of this method is trying to do here, can anyone share some idea about it?
Thanks








  private int rawReadInt() throws IOException {
    int b1 = in.read();
    int b2 = in.read();
    int b3 = in.read();
    int b4 = in.read();
    if ((b1 | b2 | b3 | b4) < 0)
      throw new EOFException();
    return ((b1 << 24) + (b2 << 16) + (b3 << 8) + (b4 << 0));
  }
From: java8964@hotmail.com
To: user@hadoop.apache.org
Subject: Question related to Decompressor interface
Date: Sat, 9 Feb 2013 15:49:31 -0500





HI, 
Currently I am researching about options of encrypting the data in the MapReduce, as we plan to use the Amazon EMR or EC2 services for our data.
I am thinking that the compression codec is good place to integrate with the encryption logic, and I found out there are some people having the same idea as mine.
I google around and found out this code:
https://github.com/geisbruch/HadoopCryptoCompressor/
It doesn't seem maintained any more, but it gave me a starting point. I download the source code, and try to do some tests with it.
It doesn't work out of box. There are some bugs I have to fix to make it work. I believe it contains 'AES' as an example algorithm.
But right now, I faced a problem when I tried to use it in my testing MapReduer program. Here is the stack trace I got:
2013-02-08 23:16:47,038 INFO org.apache.hadoop.io.compress.crypto.CryptoBasicDecompressor: buf length = 512, and offset = 0, length = -132967308java.lang.IndexOutOfBoundsException    at java.nio.ByteBuffer.wrap(ByteBuffer.java:352)    at org.apache.hadoop.io.compress.crypto.CryptoBasicDecompressor.setInput(CryptoBasicDecompressor.java:100)    at org.apache.hadoop.io.compress.BlockDecompressorStream.decompress(BlockDecompressorStream.java:97)    at org.apache.hadoop.io.compress.DecompressorStream.read(DecompressorStream.java:83)    at java.io.InputStream.read(InputStream.java:82)    at org.apache.hadoop.util.LineReader.readDefaultLine(LineReader.java:209)    at org.apache.hadoop.util.LineReader.readLine(LineReader.java:173)    at org.apache.hadoop.mapreduce.lib.input.LineRecordReader.nextKeyValue(LineRecordReader.java:114)    at org.apache.hadoop.mapred.MapTask$NewTrackingRecordReader.nextKeyValue(MapTask.java:458)    at org.apache.hadoop.mapreduce.task.MapContextImpl.nextKeyValue(MapContextImpl.java:76)    at org.apache.hadoop.mapreduce.lib.map.WrappedMapper$Context.nextKeyValue(WrappedMapper.java:85)    at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:139)    at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:645)    at org.apache.hadoop.mapred.MapTask.run(MapTask.java:325)    at org.apache.hadoop.mapred.Child$4.run(Child.java:268)    at java.security.AccessController.doPrivileged(Native Method)    at javax.security.auth.Subject.doAs(Subject.java:396)    at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1332)    at org.apache.hadoop.mapred.Child.main(Child.java:262)
I know the error is thrown out of this custom CryptoBasicDecompressor class, but I really have questions related to the interface it implemented: Decompressor.
There is limited document about this interface, for example, when and how the method setInput() will be invoked. If I want to write my own Decompressor, what do these methods mean in the interface?In the above case, I enable some debug information, you can see that in this case, the byte[] array passed to setInput method, only have 512 as the length, but the 3rd parameter of length passed in is a negative number: -132967308. That caused the IndexOutOfBoundsException. If I check the GzipDecompressor class of this method in the hadoop, the code will also throw IndexOutoutBoundsException in this case, so this is a RuntimeException case. Why it happened in my test case?
Here is my test case:
I have a simpel log text file about 700k. I encrypted it with above code using 'AES'. I can encrypted and decrypted to get my original content. The file name is foo.log.crypto, this file extension is registered to invoke this CryptoBasicDecompressor in my testing hadoop using CDH4.1.2 release (hadoop 2.0). Everything works as I expected. The CryptoBasicDecompressor is invoked when the input file is foo.log.crypto, as you can see in the above stack trace. But I don't know why the 3rd parameter (length) in setInput() is a negative number at runtime.
In additional to it, I also have further questions related to use Compressor/Decompressor to handle the encrypting/decrypting file. Ideally, I wonder if the encrypting/decrypting can support file splits. This maybe depends the algorithm we are using, is that right? If so, what kind of algorithm can do that? I am not sure if it likes the compressor cases, most of them do not support file split. If so, it maybe not good for my requirements.
If we have a 1G file, encrypted in the Amazone S3, after it copied to the HDFS of Amazon EMR, can each block of the date be decrypted independently by each mapper, then passed to the underline RecorderReader to be processed totally concurrently? Does any one do this before? If so, what encryption algorithm does support it? Any idea?
Thanks
Yong 		 	   		   		 	   		  

RE: Question related to Decompressor interface

Posted by David Parks <da...@yahoo.com>.
I can't answer your question about the Decompressor interface, but I have a
query for you.

 

Why not just create an EncryptedWritable object? Encrypt/decrypt the bytes
on the read/write method, that should be darn near trivial. Then stick with
good 'ol SequenceFile, which, as you note, is splittable. Otherwise you'd
have to deal with making the output splittable, and given encrypted data,
the only solution that I see is basically rolling your own SequenceFile with
encrypted innards. 

 

Come to think of it, a simple, standardized EncryptedWritable object out of
the box with Hadoop would be great. Or perhaps better yet, an
EncryptedWritableWrapper<T extends Writable> so we can convert any existing
Writable into an encrypted form.

 

Dave

 

 

From: java8964 java8964 [mailto:java8964@hotmail.com] 
Sent: Sunday, February 10, 2013 3:50 AM
To: user@hadoop.apache.org
Subject: Question related to Decompressor interface

 

HI, 

 

Currently I am researching about options of encrypting the data in the
MapReduce, as we plan to use the Amazon EMR or EC2 services for our data.

 

I am thinking that the compression codec is good place to integrate with the
encryption logic, and I found out there are some people having the same idea
as mine.

 

I google around and found out this code:

 

https://github.com/geisbruch/HadoopCryptoCompressor/

 

It doesn't seem maintained any more, but it gave me a starting point. I
download the source code, and try to do some tests with it.

 

It doesn't work out of box. There are some bugs I have to fix to make it
work. I believe it contains 'AES' as an example algorithm.

 

But right now, I faced a problem when I tried to use it in my testing
MapReduer program. Here is the stack trace I got:

 

2013-02-08 23:16:47,038 INFO
org.apache.hadoop.io.compress.crypto.CryptoBasicDecompressor: buf length =
512, and offset = 0, length = -132967308

java.lang.IndexOutOfBoundsException

    at java.nio.ByteBuffer.wrap(ByteBuffer.java:352)

    at
org.apache.hadoop.io.compress.crypto.CryptoBasicDecompressor.setInput(Crypto
BasicDecompressor.java:100)

    at
org.apache.hadoop.io.compress.BlockDecompressorStream.decompress(BlockDecomp
ressorStream.java:97)

    at
org.apache.hadoop.io.compress.DecompressorStream.read(DecompressorStream.jav
a:83)

    at java.io.InputStream.read(InputStream.java:82)

    at
org.apache.hadoop.util.LineReader.readDefaultLine(LineReader.java:209)

    at org.apache.hadoop.util.LineReader.readLine(LineReader.java:173)

    at
org.apache.hadoop.mapreduce.lib.input.LineRecordReader.nextKeyValue(LineReco
rdReader.java:114)

    at
org.apache.hadoop.mapred.MapTask$NewTrackingRecordReader.nextKeyValue(MapTas
k.java:458)

    at
org.apache.hadoop.mapreduce.task.MapContextImpl.nextKeyValue(MapContextImpl.
java:76)

    at
org.apache.hadoop.mapreduce.lib.map.WrappedMapper$Context.nextKeyValue(Wrapp
edMapper.java:85)

    at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:139)

    at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:645)

    at org.apache.hadoop.mapred.MapTask.run(MapTask.java:325)

    at org.apache.hadoop.mapred.Child$4.run(Child.java:268)

    at java.security.AccessController.doPrivileged(Native Method)

    at javax.security.auth.Subject.doAs(Subject.java:396)

    at
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.ja
va:1332)

    at org.apache.hadoop.mapred.Child.main(Child.java:262)

 

I know the error is thrown out of this custom CryptoBasicDecompressor class,
but I really have questions related to the interface it implemented:
Decompressor.

 

There is limited document about this interface, for example, when and how
the method setInput() will be invoked. If I want to write my own
Decompressor, what do these methods mean in the interface?

In the above case, I enable some debug information, you can see that in this
case, the byte[] array passed to setInput method, only have 512 as the
length, but the 3rd parameter of length passed in is a negative number:
-132967308. That caused the IndexOutOfBoundsException. If I check the
GzipDecompressor class of this method in the hadoop, the code will also
throw IndexOutoutBoundsException in this case, so this is a RuntimeException
case. Why it happened in my test case?

 

Here is my test case:

 

I have a simpel log text file about 700k. I encrypted it with above code
using 'AES'. I can encrypted and decrypted to get my original content. The
file name is foo.log.crypto, this file extension is registered to invoke
this CryptoBasicDecompressor in my testing hadoop using CDH4.1.2 release
(hadoop 2.0). Everything works as I expected. The CryptoBasicDecompressor is
invoked when the input file is foo.log.crypto, as you can see in the above
stack trace. But I don't know why the 3rd parameter (length) in setInput()
is a negative number at runtime.

 

In additional to it, I also have further questions related to use
Compressor/Decompressor to handle the encrypting/decrypting file. Ideally, I
wonder if the encrypting/decrypting can support file splits. This maybe
depends the algorithm we are using, is that right? If so, what kind of
algorithm can do that? I am not sure if it likes the compressor cases, most
of them do not support file split. If so, it maybe not good for my
requirements.

 

If we have a 1G file, encrypted in the Amazone S3, after it copied to the
HDFS of Amazon EMR, can each block of the date be decrypted independently by
each mapper, then passed to the underline RecorderReader to be processed
totally concurrently? Does any one do this before? If so, what encryption
algorithm does support it? Any idea?

 

Thanks

 

Yong


RE: Question related to Decompressor interface

Posted by David Parks <da...@yahoo.com>.
I can't answer your question about the Decompressor interface, but I have a
query for you.

 

Why not just create an EncryptedWritable object? Encrypt/decrypt the bytes
on the read/write method, that should be darn near trivial. Then stick with
good 'ol SequenceFile, which, as you note, is splittable. Otherwise you'd
have to deal with making the output splittable, and given encrypted data,
the only solution that I see is basically rolling your own SequenceFile with
encrypted innards. 

 

Come to think of it, a simple, standardized EncryptedWritable object out of
the box with Hadoop would be great. Or perhaps better yet, an
EncryptedWritableWrapper<T extends Writable> so we can convert any existing
Writable into an encrypted form.

 

Dave

 

 

From: java8964 java8964 [mailto:java8964@hotmail.com] 
Sent: Sunday, February 10, 2013 3:50 AM
To: user@hadoop.apache.org
Subject: Question related to Decompressor interface

 

HI, 

 

Currently I am researching about options of encrypting the data in the
MapReduce, as we plan to use the Amazon EMR or EC2 services for our data.

 

I am thinking that the compression codec is good place to integrate with the
encryption logic, and I found out there are some people having the same idea
as mine.

 

I google around and found out this code:

 

https://github.com/geisbruch/HadoopCryptoCompressor/

 

It doesn't seem maintained any more, but it gave me a starting point. I
download the source code, and try to do some tests with it.

 

It doesn't work out of box. There are some bugs I have to fix to make it
work. I believe it contains 'AES' as an example algorithm.

 

But right now, I faced a problem when I tried to use it in my testing
MapReduer program. Here is the stack trace I got:

 

2013-02-08 23:16:47,038 INFO
org.apache.hadoop.io.compress.crypto.CryptoBasicDecompressor: buf length =
512, and offset = 0, length = -132967308

java.lang.IndexOutOfBoundsException

    at java.nio.ByteBuffer.wrap(ByteBuffer.java:352)

    at
org.apache.hadoop.io.compress.crypto.CryptoBasicDecompressor.setInput(Crypto
BasicDecompressor.java:100)

    at
org.apache.hadoop.io.compress.BlockDecompressorStream.decompress(BlockDecomp
ressorStream.java:97)

    at
org.apache.hadoop.io.compress.DecompressorStream.read(DecompressorStream.jav
a:83)

    at java.io.InputStream.read(InputStream.java:82)

    at
org.apache.hadoop.util.LineReader.readDefaultLine(LineReader.java:209)

    at org.apache.hadoop.util.LineReader.readLine(LineReader.java:173)

    at
org.apache.hadoop.mapreduce.lib.input.LineRecordReader.nextKeyValue(LineReco
rdReader.java:114)

    at
org.apache.hadoop.mapred.MapTask$NewTrackingRecordReader.nextKeyValue(MapTas
k.java:458)

    at
org.apache.hadoop.mapreduce.task.MapContextImpl.nextKeyValue(MapContextImpl.
java:76)

    at
org.apache.hadoop.mapreduce.lib.map.WrappedMapper$Context.nextKeyValue(Wrapp
edMapper.java:85)

    at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:139)

    at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:645)

    at org.apache.hadoop.mapred.MapTask.run(MapTask.java:325)

    at org.apache.hadoop.mapred.Child$4.run(Child.java:268)

    at java.security.AccessController.doPrivileged(Native Method)

    at javax.security.auth.Subject.doAs(Subject.java:396)

    at
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.ja
va:1332)

    at org.apache.hadoop.mapred.Child.main(Child.java:262)

 

I know the error is thrown out of this custom CryptoBasicDecompressor class,
but I really have questions related to the interface it implemented:
Decompressor.

 

There is limited document about this interface, for example, when and how
the method setInput() will be invoked. If I want to write my own
Decompressor, what do these methods mean in the interface?

In the above case, I enable some debug information, you can see that in this
case, the byte[] array passed to setInput method, only have 512 as the
length, but the 3rd parameter of length passed in is a negative number:
-132967308. That caused the IndexOutOfBoundsException. If I check the
GzipDecompressor class of this method in the hadoop, the code will also
throw IndexOutoutBoundsException in this case, so this is a RuntimeException
case. Why it happened in my test case?

 

Here is my test case:

 

I have a simpel log text file about 700k. I encrypted it with above code
using 'AES'. I can encrypted and decrypted to get my original content. The
file name is foo.log.crypto, this file extension is registered to invoke
this CryptoBasicDecompressor in my testing hadoop using CDH4.1.2 release
(hadoop 2.0). Everything works as I expected. The CryptoBasicDecompressor is
invoked when the input file is foo.log.crypto, as you can see in the above
stack trace. But I don't know why the 3rd parameter (length) in setInput()
is a negative number at runtime.

 

In additional to it, I also have further questions related to use
Compressor/Decompressor to handle the encrypting/decrypting file. Ideally, I
wonder if the encrypting/decrypting can support file splits. This maybe
depends the algorithm we are using, is that right? If so, what kind of
algorithm can do that? I am not sure if it likes the compressor cases, most
of them do not support file split. If so, it maybe not good for my
requirements.

 

If we have a 1G file, encrypted in the Amazone S3, after it copied to the
HDFS of Amazon EMR, can each block of the date be decrypted independently by
each mapper, then passed to the underline RecorderReader to be processed
totally concurrently? Does any one do this before? If so, what encryption
algorithm does support it? Any idea?

 

Thanks

 

Yong


RE: Question related to Decompressor interface

Posted by java8964 java8964 <ja...@hotmail.com>.
Can someone share some idea what the Hadoop source code of class org.apache.hadoop.io.compress.BlockDecompressorStream, method rawReadInt() is trying to do here?
There is a comment in the code this this method shouldn't return negative number, but in my testing file, it contains the following bytes from the inputStream: 248, 19, 20, 116, which corresponding to b1, b2, b3, b4.
After the 4 bytes is read fromt the input stream, then the return result will be a negative number here, as 
(b1 << 24) = -134217728(b2 << 16) = 1245184(b3 << 8) = 5120(b4 << 0) = 116
I am not sure what logic of this method is trying to do here, can anyone share some idea about it?
Thanks








  private int rawReadInt() throws IOException {
    int b1 = in.read();
    int b2 = in.read();
    int b3 = in.read();
    int b4 = in.read();
    if ((b1 | b2 | b3 | b4) < 0)
      throw new EOFException();
    return ((b1 << 24) + (b2 << 16) + (b3 << 8) + (b4 << 0));
  }
From: java8964@hotmail.com
To: user@hadoop.apache.org
Subject: Question related to Decompressor interface
Date: Sat, 9 Feb 2013 15:49:31 -0500





HI, 
Currently I am researching about options of encrypting the data in the MapReduce, as we plan to use the Amazon EMR or EC2 services for our data.
I am thinking that the compression codec is good place to integrate with the encryption logic, and I found out there are some people having the same idea as mine.
I google around and found out this code:
https://github.com/geisbruch/HadoopCryptoCompressor/
It doesn't seem maintained any more, but it gave me a starting point. I download the source code, and try to do some tests with it.
It doesn't work out of box. There are some bugs I have to fix to make it work. I believe it contains 'AES' as an example algorithm.
But right now, I faced a problem when I tried to use it in my testing MapReduer program. Here is the stack trace I got:
2013-02-08 23:16:47,038 INFO org.apache.hadoop.io.compress.crypto.CryptoBasicDecompressor: buf length = 512, and offset = 0, length = -132967308java.lang.IndexOutOfBoundsException    at java.nio.ByteBuffer.wrap(ByteBuffer.java:352)    at org.apache.hadoop.io.compress.crypto.CryptoBasicDecompressor.setInput(CryptoBasicDecompressor.java:100)    at org.apache.hadoop.io.compress.BlockDecompressorStream.decompress(BlockDecompressorStream.java:97)    at org.apache.hadoop.io.compress.DecompressorStream.read(DecompressorStream.java:83)    at java.io.InputStream.read(InputStream.java:82)    at org.apache.hadoop.util.LineReader.readDefaultLine(LineReader.java:209)    at org.apache.hadoop.util.LineReader.readLine(LineReader.java:173)    at org.apache.hadoop.mapreduce.lib.input.LineRecordReader.nextKeyValue(LineRecordReader.java:114)    at org.apache.hadoop.mapred.MapTask$NewTrackingRecordReader.nextKeyValue(MapTask.java:458)    at org.apache.hadoop.mapreduce.task.MapContextImpl.nextKeyValue(MapContextImpl.java:76)    at org.apache.hadoop.mapreduce.lib.map.WrappedMapper$Context.nextKeyValue(WrappedMapper.java:85)    at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:139)    at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:645)    at org.apache.hadoop.mapred.MapTask.run(MapTask.java:325)    at org.apache.hadoop.mapred.Child$4.run(Child.java:268)    at java.security.AccessController.doPrivileged(Native Method)    at javax.security.auth.Subject.doAs(Subject.java:396)    at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1332)    at org.apache.hadoop.mapred.Child.main(Child.java:262)
I know the error is thrown out of this custom CryptoBasicDecompressor class, but I really have questions related to the interface it implemented: Decompressor.
There is limited document about this interface, for example, when and how the method setInput() will be invoked. If I want to write my own Decompressor, what do these methods mean in the interface?In the above case, I enable some debug information, you can see that in this case, the byte[] array passed to setInput method, only have 512 as the length, but the 3rd parameter of length passed in is a negative number: -132967308. That caused the IndexOutOfBoundsException. If I check the GzipDecompressor class of this method in the hadoop, the code will also throw IndexOutoutBoundsException in this case, so this is a RuntimeException case. Why it happened in my test case?
Here is my test case:
I have a simpel log text file about 700k. I encrypted it with above code using 'AES'. I can encrypted and decrypted to get my original content. The file name is foo.log.crypto, this file extension is registered to invoke this CryptoBasicDecompressor in my testing hadoop using CDH4.1.2 release (hadoop 2.0). Everything works as I expected. The CryptoBasicDecompressor is invoked when the input file is foo.log.crypto, as you can see in the above stack trace. But I don't know why the 3rd parameter (length) in setInput() is a negative number at runtime.
In additional to it, I also have further questions related to use Compressor/Decompressor to handle the encrypting/decrypting file. Ideally, I wonder if the encrypting/decrypting can support file splits. This maybe depends the algorithm we are using, is that right? If so, what kind of algorithm can do that? I am not sure if it likes the compressor cases, most of them do not support file split. If so, it maybe not good for my requirements.
If we have a 1G file, encrypted in the Amazone S3, after it copied to the HDFS of Amazon EMR, can each block of the date be decrypted independently by each mapper, then passed to the underline RecorderReader to be processed totally concurrently? Does any one do this before? If so, what encryption algorithm does support it? Any idea?
Thanks
Yong 		 	   		   		 	   		  

RE: Question related to Decompressor interface

Posted by java8964 java8964 <ja...@hotmail.com>.
Can someone share some idea what the Hadoop source code of class org.apache.hadoop.io.compress.BlockDecompressorStream, method rawReadInt() is trying to do here?
There is a comment in the code this this method shouldn't return negative number, but in my testing file, it contains the following bytes from the inputStream: 248, 19, 20, 116, which corresponding to b1, b2, b3, b4.
After the 4 bytes is read fromt the input stream, then the return result will be a negative number here, as 
(b1 << 24) = -134217728(b2 << 16) = 1245184(b3 << 8) = 5120(b4 << 0) = 116
I am not sure what logic of this method is trying to do here, can anyone share some idea about it?
Thanks








  private int rawReadInt() throws IOException {
    int b1 = in.read();
    int b2 = in.read();
    int b3 = in.read();
    int b4 = in.read();
    if ((b1 | b2 | b3 | b4) < 0)
      throw new EOFException();
    return ((b1 << 24) + (b2 << 16) + (b3 << 8) + (b4 << 0));
  }
From: java8964@hotmail.com
To: user@hadoop.apache.org
Subject: Question related to Decompressor interface
Date: Sat, 9 Feb 2013 15:49:31 -0500





HI, 
Currently I am researching about options of encrypting the data in the MapReduce, as we plan to use the Amazon EMR or EC2 services for our data.
I am thinking that the compression codec is good place to integrate with the encryption logic, and I found out there are some people having the same idea as mine.
I google around and found out this code:
https://github.com/geisbruch/HadoopCryptoCompressor/
It doesn't seem maintained any more, but it gave me a starting point. I download the source code, and try to do some tests with it.
It doesn't work out of box. There are some bugs I have to fix to make it work. I believe it contains 'AES' as an example algorithm.
But right now, I faced a problem when I tried to use it in my testing MapReduer program. Here is the stack trace I got:
2013-02-08 23:16:47,038 INFO org.apache.hadoop.io.compress.crypto.CryptoBasicDecompressor: buf length = 512, and offset = 0, length = -132967308java.lang.IndexOutOfBoundsException    at java.nio.ByteBuffer.wrap(ByteBuffer.java:352)    at org.apache.hadoop.io.compress.crypto.CryptoBasicDecompressor.setInput(CryptoBasicDecompressor.java:100)    at org.apache.hadoop.io.compress.BlockDecompressorStream.decompress(BlockDecompressorStream.java:97)    at org.apache.hadoop.io.compress.DecompressorStream.read(DecompressorStream.java:83)    at java.io.InputStream.read(InputStream.java:82)    at org.apache.hadoop.util.LineReader.readDefaultLine(LineReader.java:209)    at org.apache.hadoop.util.LineReader.readLine(LineReader.java:173)    at org.apache.hadoop.mapreduce.lib.input.LineRecordReader.nextKeyValue(LineRecordReader.java:114)    at org.apache.hadoop.mapred.MapTask$NewTrackingRecordReader.nextKeyValue(MapTask.java:458)    at org.apache.hadoop.mapreduce.task.MapContextImpl.nextKeyValue(MapContextImpl.java:76)    at org.apache.hadoop.mapreduce.lib.map.WrappedMapper$Context.nextKeyValue(WrappedMapper.java:85)    at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:139)    at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:645)    at org.apache.hadoop.mapred.MapTask.run(MapTask.java:325)    at org.apache.hadoop.mapred.Child$4.run(Child.java:268)    at java.security.AccessController.doPrivileged(Native Method)    at javax.security.auth.Subject.doAs(Subject.java:396)    at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1332)    at org.apache.hadoop.mapred.Child.main(Child.java:262)
I know the error is thrown out of this custom CryptoBasicDecompressor class, but I really have questions related to the interface it implemented: Decompressor.
There is limited document about this interface, for example, when and how the method setInput() will be invoked. If I want to write my own Decompressor, what do these methods mean in the interface?In the above case, I enable some debug information, you can see that in this case, the byte[] array passed to setInput method, only have 512 as the length, but the 3rd parameter of length passed in is a negative number: -132967308. That caused the IndexOutOfBoundsException. If I check the GzipDecompressor class of this method in the hadoop, the code will also throw IndexOutoutBoundsException in this case, so this is a RuntimeException case. Why it happened in my test case?
Here is my test case:
I have a simpel log text file about 700k. I encrypted it with above code using 'AES'. I can encrypted and decrypted to get my original content. The file name is foo.log.crypto, this file extension is registered to invoke this CryptoBasicDecompressor in my testing hadoop using CDH4.1.2 release (hadoop 2.0). Everything works as I expected. The CryptoBasicDecompressor is invoked when the input file is foo.log.crypto, as you can see in the above stack trace. But I don't know why the 3rd parameter (length) in setInput() is a negative number at runtime.
In additional to it, I also have further questions related to use Compressor/Decompressor to handle the encrypting/decrypting file. Ideally, I wonder if the encrypting/decrypting can support file splits. This maybe depends the algorithm we are using, is that right? If so, what kind of algorithm can do that? I am not sure if it likes the compressor cases, most of them do not support file split. If so, it maybe not good for my requirements.
If we have a 1G file, encrypted in the Amazone S3, after it copied to the HDFS of Amazon EMR, can each block of the date be decrypted independently by each mapper, then passed to the underline RecorderReader to be processed totally concurrently? Does any one do this before? If so, what encryption algorithm does support it? Any idea?
Thanks
Yong