You are viewing a plain text version of this content. The canonical link for it is here.

Posted to common-user@hadoop.apache.org by Haijun Cao <ha...@kindsight.net> on 2008/06/04 23:59:26 UTC

compressed/encrypted file

If a file is compressed and encrypted, then is it still possible to split it and run mappers in parallel? 

Do people compress their files stored in hadoop? If yes, how do you go about processing them in parallel?

Thanks
Haijun

Re: compressed/encrypted file

Posted by Ted Dunning <te...@gmail.com>.

Yes, that is what I meant.

Not particularly good, but possibly the best we can do with hadoop (for a
while).  If hadoop handles the ticket for us in a secure way, then I would
feel better.

On Thu, Jun 5, 2008 at 3:40 PM, Haijun Cao <ha...@kindsight.net> wrote:

>
>
> "If and when something like kerberos user authentication exists, then
> kerberos
> tickets may be the reasonable alternative for opening the keyring."
>
> Ted,
>
> Do you mean instead of "insert an auth key to the job conf", we can insert
> the ticket to the job conf? even though the job conf itself can be
> compromised, since the ticket is short lived, other people can't use the
> ticket to decrypt the file later. Is my understanding right?
>
>
>
> Thanks
> Haijun
>
> -----Original Message-----
> From: Ted Dunning [mailto:ted.dunning@gmail.com]
> Sent: Thursday, June 05, 2008 11:58 AM
> To: core-user@hadoop.apache.org
> Subject: Re: compressed/encrypted file
>
> Security and hadoop are not particularly compatible concepts.  Things may
> improve when user authentication exists.  The lack of security on job confs
> is the major motivation for making sure the auth is time limited.  If and
> when something like kerberos user authentication exists, then kerberos
> tickets may be the reasonable alternative for opening the keyring.
>
> Can you suggest an alternative way to communicate a secret to hadoop tasks
> short of embedding it into source code?
>
> On Thu, Jun 5, 2008 at 11:46 AM, Allen Wittenauer <aw...@yahoo-inc.com>
> wrote:
>
> > On 6/5/08 11:38 AM, "Ted Dunning" <te...@gmail.com> wrote:
> > > We use encryption on log files using standard AES.  I wrote an input
> > format
> > > to deal with it.
> > >
> > > Key distribution should be done better than we do it.  My preference
> > would
> > > be to insert an auth key into the job conf which is then used by the
> > input
> > > to open a well known keyring via an API that prevents auths from
> > surviving
> > > for long term.
> >
> >    This sounds like it opens the door for key stealing in a
> > multi-user/static job tracker system, since the job conf is readable by
> all
> > jobs running on the same machine.
> >
> >
>
>
> --
> ted
>



-- 
ted

RE: compressed/encrypted file

Posted by Haijun Cao <ha...@kindsight.net>.

HOD may be too heavy weight for us with small cluster and small number of users. 

From hadoop summit, I heard Kerberos authentication is in the pipeline, is there a place I can check on the progress? 
It seems that authentication/authorization work is from the perspective of file system, but not from the perspective of file encryption and key distribution among map reduce tasks.

"If and when something like kerberos user authentication exists, then kerberos
tickets may be the reasonable alternative for opening the keyring."

Ted, 

Do you mean instead of "insert an auth key to the job conf", we can insert the ticket to the job conf? even though the job conf itself can be compromised, since the ticket is short lived, other people can't use the ticket to decrypt the file later. Is my understanding right?   

Thanks
Haijun

-----Original Message-----
From: Ted Dunning [mailto:ted.dunning@gmail.com] 
Sent: Thursday, June 05, 2008 11:58 AM
To: core-user@hadoop.apache.org
Subject: Re: compressed/encrypted file

Security and hadoop are not particularly compatible concepts.  Things may
improve when user authentication exists.  The lack of security on job confs
is the major motivation for making sure the auth is time limited.  If and
when something like kerberos user authentication exists, then kerberos
tickets may be the reasonable alternative for opening the keyring.

Can you suggest an alternative way to communicate a secret to hadoop tasks
short of embedding it into source code?

On Thu, Jun 5, 2008 at 11:46 AM, Allen Wittenauer <aw...@yahoo-inc.com> wrote:

> On 6/5/08 11:38 AM, "Ted Dunning" <te...@gmail.com> wrote:
> > We use encryption on log files using standard AES.  I wrote an input
> format
> > to deal with it.
> >
> > Key distribution should be done better than we do it.  My preference
> would
> > be to insert an auth key into the job conf which is then used by the
> input
> > to open a well known keyring via an API that prevents auths from
> surviving
> > for long term.
>
>    This sounds like it opens the door for key stealing in a
> multi-user/static job tracker system, since the job conf is readable by all
> jobs running on the same machine.
>
>

-- 
ted

Re: compressed/encrypted file

Posted by Allen Wittenauer <aw...@yahoo-inc.com>.

On 6/5/08 11:57 AM, "Ted Dunning" <te...@gmail.com> wrote:
> Can you suggest an alternative way to communicate a secret to hadoop tasks
> short of embedding it into source code?

    This is one of the reasons why we use hod--job isolation such that it
helps prevent data leaks from one job to the next.

Re: compressed/encrypted file

Posted by Ted Dunning <te...@gmail.com>.

Security and hadoop are not particularly compatible concepts.  Things may
improve when user authentication exists.  The lack of security on job confs
is the major motivation for making sure the auth is time limited.  If and
when something like kerberos user authentication exists, then kerberos
tickets may be the reasonable alternative for opening the keyring.

Can you suggest an alternative way to communicate a secret to hadoop tasks
short of embedding it into source code?

On Thu, Jun 5, 2008 at 11:46 AM, Allen Wittenauer <aw...@yahoo-inc.com> wrote:

> On 6/5/08 11:38 AM, "Ted Dunning" <te...@gmail.com> wrote:
> > We use encryption on log files using standard AES.  I wrote an input
> format
> > to deal with it.
> >
> > Key distribution should be done better than we do it.  My preference
> would
> > be to insert an auth key into the job conf which is then used by the
> input
> > to open a well known keyring via an API that prevents auths from
> surviving
> > for long term.
>
>    This sounds like it opens the door for key stealing in a
> multi-user/static job tracker system, since the job conf is readable by all
> jobs running on the same machine.
>
>

-- 
ted

Re: compressed/encrypted file

Posted by Allen Wittenauer <aw...@yahoo-inc.com>.

On 6/5/08 11:38 AM, "Ted Dunning" <te...@gmail.com> wrote:
> We use encryption on log files using standard AES.  I wrote an input format
> to deal with it.
> 
> Key distribution should be done better than we do it.  My preference would
> be to insert an auth key into the job conf which is then used by the input
> to open a well known keyring via an API that prevents auths from surviving
> for long term.

   This sounds like it opens the door for key stealing in a
multi-user/static job tracker system, since the job conf is readable by all
jobs running on the same machine.

Re: compressed/encrypted file

Posted by Ted Dunning <te...@gmail.com>.

We use encryption on log files using standard AES.  I wrote an input format
to deal with it.

Key distribution should be done better than we do it.  My preference would
be to insert an auth key into the job conf which is then used by the input
to open a well known keyring via an API that prevents auths from surviving
for long term.

On Thu, Jun 5, 2008 at 11:02 AM, Haijun Cao <ha...@kindsight.net> wrote:

> Arun/John, Thanks for the update.
>
> For security reasons, we also need to encrypt the file, there is no
> support for encryption currently, so we will have to roll our own.
> Again, I'd like to know if anybody here do encryption, if yes, what
> algorithm and how key/password distribution is handled.
>
> Thanks
> Haijun
>
> -----Original Message-----
> From: John Heidemann [mailto:johnh@isi.edu]
> Sent: Thursday, June 05, 2008 9:15 AM
> To: core-user@hadoop.apache.org
> Subject: Re: compressed/encrypted file
>
> On Wed, 04 Jun 2008 15:52:55 PDT, Arun C Murthy wrote:
> >Haijun,
> >
> >On Jun 4, 2008, at 3:45 PM, Haijun Cao wrote:
> >
> >>
> >> Mile, Thanks.
> >>
> >> "If your inputs to maps are compressed, then you don't get any
> >> automatic
> >> assignment of mappers to your data:  each gzipped file gets assigned
> a
> >> mapper." <--- this is the case I am talking about.
> >>
> >
> >With the current compression codecs available in Hadoop (zlib/gzip/
> >lzo) it is not possible to split up a compressed file and then process
> >it in a parallel manner. However once we get bzip2 to work we  could
> >split up the files as you are describing...
>
>
> We are actually working on a bzip2 codec, hopefully with split support,
> so hopefully something will be here by the end of summer.
>
>   -John Heidemann
>



-- 
ted

RE: compressed/encrypted file

Posted by Haijun Cao <ha...@kindsight.net>.

Arun/John, Thanks for the update. 

For security reasons, we also need to encrypt the file, there is no
support for encryption currently, so we will have to roll our own.
Again, I'd like to know if anybody here do encryption, if yes, what
algorithm and how key/password distribution is handled. 

Thanks
Haijun 

-----Original Message-----
From: John Heidemann [mailto:johnh@isi.edu] 
Sent: Thursday, June 05, 2008 9:15 AM
To: core-user@hadoop.apache.org
Subject: Re: compressed/encrypted file

On Wed, 04 Jun 2008 15:52:55 PDT, Arun C Murthy wrote: 
>Haijun,
>
>On Jun 4, 2008, at 3:45 PM, Haijun Cao wrote:
>
>>
>> Mile, Thanks.
>>
>> "If your inputs to maps are compressed, then you don't get any
>> automatic
>> assignment of mappers to your data:  each gzipped file gets assigned
a
>> mapper." <--- this is the case I am talking about.
>>
>
>With the current compression codecs available in Hadoop (zlib/gzip/
>lzo) it is not possible to split up a compressed file and then process
>it in a parallel manner. However once we get bzip2 to work we  could
>split up the files as you are describing...

We are actually working on a bzip2 codec, hopefully with split support,
so hopefully something will be here by the end of summer.

   -John Heidemann

Re: compressed/encrypted file

Posted by Stuart Sierra <ma...@stuartsierra.com>.

On Wed, Jun 4, 2008 at 6:52 PM, Arun C Murthy <ar...@yahoo-inc.com> wrote:
> With the current compression codecs available in Hadoop (zlib/gzip/lzo) it
> is not possible to split up a compressed file and then process it in a
> parallel manner. However once we get bzip2 to work we could split up the
> files as you are describing...

If it helps, on *nix you can split a compressed text file like this:
    gunzip -c original.txt.gz | split -a 5 -d -C 16777216 - output.txt.

Replace 16777216 (16MB) with however many (max) bytes you want per
split.  This is guaranteed to split only on line breaks.  You get
files named output.txt.00000, output.txt.00001, and so on.

-Stuart

Re: compressed/encrypted file

Posted by John Heidemann <jo...@isi.edu>.

On Wed, 04 Jun 2008 15:52:55 PDT, Arun C Murthy wrote: 
>Haijun,
>
>On Jun 4, 2008, at 3:45 PM, Haijun Cao wrote:
>
>>
>> Mile, Thanks.
>>
>> "If your inputs to maps are compressed, then you don't get any
>> automatic
>> assignment of mappers to your data:  each gzipped file gets assigned a
>> mapper." <--- this is the case I am talking about.
>>
>
>With the current compression codecs available in Hadoop (zlib/gzip/
>lzo) it is not possible to split up a compressed file and then process
>it in a parallel manner. However once we get bzip2 to work we  could
>split up the files as you are describing...


We are actually working on a bzip2 codec, hopefully with split support,
so hopefully something will be here by the end of summer.

   -John Heidemann

Re: compressed/encrypted file

Posted by Arun C Murthy <ar...@yahoo-inc.com>.

Haijun,

On Jun 4, 2008, at 3:45 PM, Haijun Cao wrote:

>
> Mile, Thanks.
>
> "If your inputs to maps are compressed, then you don't get any  
> automatic
> assignment of mappers to your data:  each gzipped file gets assigned a
> mapper." <--- this is the case I am talking about.
>

With the current compression codecs available in Hadoop (zlib/gzip/ 
lzo) it is not possible to split up a compressed file and then  
process it in a parallel manner. However once we get bzip2 to work we  
could split up the files as you are describing...

Arun

> Haijun
>
>
> -----Original Message-----
> From: milesosb@gmail.com [mailto:milesosb@gmail.com] On Behalf Of  
> Miles
> Osborne
> Sent: Wednesday, June 04, 2008 3:07 PM
> To: core-user@hadoop.apache.org
> Subject: Re: compressed/encrypted file
>
> You can compress / decompress at many points:
>
> --prior to mapping
>
> --after mapping
>
> --after reducing
>
> (I've been experimenting with all these options; we have been crawling
> blogs
> every day since Feb and we store on DFS compressed sets of posts)
>
> If your inputs to maps are compressed, then you don't get any  
> automatic
> assignment of mappers to your data:  each gzipped file gets assigned a
> mapper.
>
> But otherwise, it is all pretty transparent.
>
> Miles
>
> 2008/6/4 Haijun Cao <ha...@kindsight.net>:
>
>>
>> If a file is compressed and encrypted, then is it still possible to
> split
>> it and run mappers in parallel?
>>
>> Do people compress their files stored in hadoop? If yes, how do  
>> you go
>> about processing them in parallel?
>>
>> Thanks
>> Haijun
>>
>
>
>
> -- 
> The University of Edinburgh is a charitable body, registered in
> Scotland,
> with registration number SC005336.

RE: compressed/encrypted file

Posted by Haijun Cao <ha...@kindsight.net>.

Mile, Thanks.

"If your inputs to maps are compressed, then you don't get any automatic
assignment of mappers to your data:  each gzipped file gets assigned a
mapper." <--- this is the case I am talking about.

Haijun


-----Original Message-----
From: milesosb@gmail.com [mailto:milesosb@gmail.com] On Behalf Of Miles
Osborne
Sent: Wednesday, June 04, 2008 3:07 PM
To: core-user@hadoop.apache.org
Subject: Re: compressed/encrypted file

You can compress / decompress at many points:

--prior to mapping

--after mapping

--after reducing

(I've been experimenting with all these options; we have been crawling
blogs
every day since Feb and we store on DFS compressed sets of posts)

If your inputs to maps are compressed, then you don't get any automatic
assignment of mappers to your data:  each gzipped file gets assigned a
mapper.

But otherwise, it is all pretty transparent.

Miles

2008/6/4 Haijun Cao <ha...@kindsight.net>:

>
> If a file is compressed and encrypted, then is it still possible to
split
> it and run mappers in parallel?
>
> Do people compress their files stored in hadoop? If yes, how do you go
> about processing them in parallel?
>
> Thanks
> Haijun
>



-- 
The University of Edinburgh is a charitable body, registered in
Scotland,
with registration number SC005336.

Re: compressed/encrypted file

Posted by Miles Osborne <mi...@inf.ed.ac.uk>.

You can compress / decompress at many points:

--prior to mapping

--after mapping

--after reducing

(I've been experimenting with all these options; we have been crawling blogs
every day since Feb and we store on DFS compressed sets of posts)

If your inputs to maps are compressed, then you don't get any automatic
assignment of mappers to your data:  each gzipped file gets assigned a
mapper.

But otherwise, it is all pretty transparent.

Miles

2008/6/4 Haijun Cao <ha...@kindsight.net>:

>
> If a file is compressed and encrypted, then is it still possible to split
> it and run mappers in parallel?
>
> Do people compress their files stored in hadoop? If yes, how do you go
> about processing them in parallel?
>
> Thanks
> Haijun
>



-- 
The University of Edinburgh is a charitable body, registered in Scotland,
with registration number SC005336.