You are viewing a plain text version of this content. The canonical link for it is here.
Posted to hdfs-dev@hadoop.apache.org by "Chen, Haifeng" <ha...@intel.com> on 2016/02/02 06:29:25 UTC

RE: Hadoop encryption module as Apache Chimera incubator project

Thanks to all folks providing feedbacks and participating the discussions.

@Owen, do you still have any concerns on going forward in the direction of Apache Commons (or other options, TLP)?

Thanks,
Haifeng

-----Original Message-----
From: Chen, Haifeng [mailto:haifeng.chen@intel.com] 
Sent: Saturday, January 30, 2016 10:52 AM
To: hdfs-dev@hadoop.apache.org
Subject: RE: Hadoop encryption module as Apache Chimera incubator project

>> I believe encryption is becoming a core part of Hadoop. I think that 
>> moving core components out of Hadoop is bad from a project management perspective.

> Although it's certainly true that encryption capabilities (in HDFS, YARN, etc.) are becoming core to Hadoop, I don't think that should really influence whether or not the non-Hadoop-specific encryption routines should be part of the Hadoop code base, or part of the code base of another project that Hadoop depends on. If Chimera had existed as a library hosted at ASF when HDFS encryption was first developed, HDFS probably would have just added that as a dependency and been done with it. I don't think we would've copy/pasted the code for Chimera into the Hadoop code base.

Agree with ATM. I want to also make an additional clarification. I agree that the encryption capabilities are becoming core to Hadoop. While this effort is to put common and shared encryption routines such as crypto stream implementations into a scope which can be widely shared across the Apache ecosystem. This doesn't move Hadoop encryption out of Hadoop (that is not possible). 

Agree if we make it a separate and independent releases project in Hadoop takes a step further than the existing approach and solve some issues (such as libhadoop.so problem). Frankly speaking, I think it is not the best option we can try. I also expect that an independent release project within Hadoop core will also complicate the existing release ideology of Hadoop release. 

Thanks,
Haifeng

-----Original Message-----
From: Aaron T. Myers [mailto:atm@cloudera.com]
Sent: Friday, January 29, 2016 9:51 AM
To: hdfs-dev@hadoop.apache.org
Subject: Re: Hadoop encryption module as Apache Chimera incubator project

On Wed, Jan 27, 2016 at 11:31 AM, Owen O'Malley <om...@apache.org> wrote:

> I believe encryption is becoming a core part of Hadoop. I think that 
> moving core components out of Hadoop is bad from a project management perspective.
>

Although it's certainly true that encryption capabilities (in HDFS, YARN,
etc.) are becoming core to Hadoop, I don't think that should really influence whether or not the non-Hadoop-specific encryption routines should be part of the Hadoop code base, or part of the code base of another project that Hadoop depends on. If Chimera had existed as a library hosted at ASF when HDFS encryption was first developed, HDFS probably would have just added that as a dependency and been done with it. I don't think we would've copy/pasted the code for Chimera into the Hadoop code base.


> To put it another way, a bug in the encryption routines will likely 
> become a security problem that security@hadoop needs to hear about.
>
I don't think
> adding a separate project in the middle of that communication chain is 
> a good idea. The same applies to data corruption problems, and so on...
>

Isn't the same true of all the libraries that Hadoop currently depends upon? If the commons-httpclient library (or commons-codec, or commons-io, or guava, or...) has a security vulnerability, we need to know about it so that we can update our dependency to a fixed version. This case doesn't seem materially different than that.


>
>
> > It may be good to keep at generalized place(As in the discussion, we 
> > thought that place could be Apache Commons).
>
>
> Apache Commons is a collection of *Java* projects, so Chimera as a 
> JNI-based library isn't a natural fit.
>

Could very well be that Apache Commons's charter would preclude Chimera.
You probably know better than I do about that.


> Furthermore, Apache Commons doesn't
> have its own security list so problems will go to the generic 
> security@apache.org.
>

That seems easy enough to remedy, if they wanted to, and besides I'm not sure why that would influence this discussion. In my experience projects that don't have a separate security@project.a.o mailing list tend to just handle security issues on their private@project.a.o mailing list, which seems fine to me.


>
> Why do you think that Apache Commons is a better home than Hadoop?
>

I'm certainly not at all wedded to Apache Commons, that just seemed like a natural place to put it to me. Could be that a brand new TLP might make more sense.

I *do* think that if other non-Hadoop projects want to make use of Chimera, which as I understand it is the goal which started this thread, then Chimera should exist outside of Hadoop so that:

a) Projects that have nothing to do with Hadoop can just depend directly on Chimera, which has nothing Hadoop-specific in there.

b) The Hadoop project doesn't have to export/maintain/concern itself with yet another publicly-consumed interface.

c) Chimera can have its own (presumably much faster) release cadence completely separate from Hadoop.

--
Aaron T. Myers
Software Engineer, Cloudera

RE: Hadoop encryption module as Apache Chimera incubator project

Posted by "Zheng, Kai" <ka...@intel.com>.
The encryption or security thing is surely a good starting as the current focus. Considering or having other things like compression would help to determine how to vision, position and layout the new project, in Hadoop side, apache common project, or a new TLP, containing the candidate modules. Yes at the beginning, only the encryption thing.

Regards,
Kai

-----Original Message-----
From: Chen, Haifeng [mailto:haifeng.chen@intel.com] 
Sent: Thursday, February 04, 2016 10:30 AM
To: hdfs-dev@hadoop.apache.org
Subject: RE: Hadoop encryption module as Apache Chimera incubator project

>> Let's do one step at a time. There is a clear need for common encryption, and let's focus on making that happen.
Strongly agree.

-----Original Message-----
From: Reynold Xin [mailto:rxin@databricks.com]
Sent: Thursday, February 4, 2016 8:50 AM
To: hdfs-dev@hadoop.apache.org
Subject: Re: Hadoop encryption module as Apache Chimera incubator project

Let's do one step at a time. There is a clear need for common encryption, and let's focus on making that happen.

On Wed, Feb 3, 2016 at 4:48 PM, Zheng, Kai <ka...@intel.com> wrote:

> I thought this discussion would switch to common-dev@ now?
>
> >> Would it make sense to also package some of the compression 
> >> libraries,
> and maybe some of the text processing from MapReduce? Evolving some of 
> this code to a common library with few/no dependencies would be 
> generally useful. As a subproject, it could have a broader scope that 
> could evolve into a viable TLP.
>
> Sounds like a great idea to make the potential TLP more sense!! I 
> thought it could be organized like in Apache common, the security, 
> compression and other common text related things could be organized in 
> different independent modules. Perhaps Hadoop conf could also be 
> considered. These modules could rely on some common utility module. It 
> can still be Hadoop background or powered, and eventually we would 
> have a good place for some Hadoop common codes to move into to benefit 
> and impact even more broad scope than Hadoop itself.
>
> Regards,
> Kai
>
> -----Original Message-----
> From: Chris Douglas [mailto:cdouglas@apache.org]
> Sent: Thursday, February 04, 2016 7:26 AM
> To: hdfs-dev@hadoop.apache.org
> Subject: Re: Hadoop encryption module as Apache Chimera incubator 
> project
>
> I went through the repository, and now understand the reasoning that 
> would locate this code in Apache Commons. This isn't proposing to 
> extract much of the implementation and it takes none of the 
> integration. It's limited to interfaces to crypto libraries and 
> streams/configuration. It might be a reasonable fit for commons-codec, 
> but that's a pretty sparse library and driving the release cadence 
> might be more complicated. It'd be worth discussing on their lists (please also CC common-dev@).
>
> Chimera would be a boutique TLP, unless we wanted to draw out more of 
> the integration and tooling. Is that a goal you're interested in pursuing?
> There's a tension between keeping this focused and including enough 
> functionality to make it viable as an independent component. By way of 
> example, Hadoop's common project requires too many dependencies and 
> carries too much historical baggage for other projects to rely on.
> I agree with Colin/Steve: we don't want this to grow into another 
> guava-like dependency that creates more work in conflicts than it 
> saves in implementation...
>
> Would it make sense to also package some of the compression libraries, 
> and maybe some of the text processing from MapReduce? Evolving some of 
> this code to a common library with few/no dependencies would be 
> generally useful. As a subproject, it could have a broader scope that 
> could evolve into a viable TLP. If the encryption libraries are the 
> only ones you're interested in pulling out, then Apache Commons does 
> seem like a better target than a separate project. -C
>
>
> On Wed, Feb 3, 2016 at 1:49 AM, Chris Douglas <cd...@apache.org> wrote:
> > On Wed, Feb 3, 2016 at 12:48 AM, Gangumalla, Uma 
> > <um...@intel.com> wrote:
> >>>Standing in the point of shared fundamental piece of code like 
> >>>this, I do think Apache Commons might be the best direction which 
> >>>we can try as the first effort. In this direction, we still need to 
> >>>work with Apache Common community for buying in and accepting the proposal.
> >> Make sense.
> >
> > Makes sense how?
> >
> >> For this we should define the independent release cycles for this 
> >> project and it would just place under Hadoop tree if we all 
> >> conclude with this option at the end.
> >
> > Yes.
> >
> >> [Chris]
> >>>If Chimera is not successful as an independent project or stalls, 
> >>>Hadoop and/or Spark and/or $project will have to reabsorb it as 
> >>>maintainers.
> >>>
> >> I am not so strong on this point. If we assume project would be 
> >> unsuccessful, it can be unsuccessful(less maintained) even under hadoop.
> >> But if other projects depending on this piece then they would get 
> >> less support. Of course right now we feel this piece of code is 
> >> very important and we feel(expect) it can be successful as 
> >> independent project, irrespective of whether it as separate project 
> >> outside hadoop
> or inside.
> >> So, I feel this point would not really influence to judge the
> discussion.
> >
> > Sure; code can idle anywhere, but that wasn't the point I was after.
> > You propose to extract code from Hadoop, but if Chimera fails then 
> > what recourse do we have among the other projects taking a 
> > dependency on it? Splitting off another project is feasible, but 
> > Chimera should be sustainable before this PMC can divest itself of 
> > responsibility for security libraries. That's a pretty low bar.
> >
> > Bundling the library with the jar is helpful; I've used that before.
> > It should prefer (updated) libraries from the environment, if 
> > configured. Otherwise it's a pain (or impossible) for ops to patch 
> > security bugs. -C
> >
> >>>-----Original Message-----
> >>>From: Colin P. McCabe [mailto:cmccabe@apache.org]
> >>>Sent: Wednesday, February 3, 2016 4:56 AM
> >>>To: hdfs-dev@hadoop.apache.org
> >>>Subject: Re: Hadoop encryption module as Apache Chimera incubator 
> >>>project
> >>>
> >>>It's great to see interest in improving this functionality.  I 
> >>>think Chimera could be successful as an Apache project.  I don't 
> >>>have a strong opinion one way or the other as to whether it belongs 
> >>>as part of Hadoop or separate.
> >>>
> >>>I do think there will be some challenges splitting this 
> >>>functionality out into a separate jar, because of the way our 
> >>>CLASSPATH works right
> now.
> >>>For example, let's say that Hadoop depends on Chimera 1.2 and Spark 
> >>>depends on Chimera 1.1.  Now Spark jobs have two different versions 
> >>>fighting it out on the classpath, similar to the situation with 
> >>>Guava and other libraries.  Perhaps if Chimera adopts a policy of 
> >>>strong backwards compatibility, we can just always use the latest 
> >>>jar, but it still seems likely that there will be problems.  There 
> >>>are various classpath isolation ideas that could help here, but 
> >>>they are big projects in their own right and we don't have a clear 
> >>>timeline for them.  If this does end up being a separate jar, we 
> >>>may need to shade it to avoid all these issues.
> >>>
> >>>Bundling the JNI glue code in the jar itself is an interesting 
> >>>idea, which we have talked about before for libhadoop.so.  It 
> >>>doesn't really have anything to do with the question of TLP vs.
> >>>non-TLP, of
> course.
> >>>We could do that refactoring in Hadoop itself.  The really 
> >>>complicated part of bundling JNI code in a jar is that you need to 
> >>>create jars for every cross product of (JVM version, openssl 
> >>>version,
> operating system).
> >>>For example, you have the RHEL6 build for openJDK7 using openssl 1.0.1e.
> >>>If you change any one thing-- say, change openJDK7 to Oracle JDK8, 
> >>>then you might need to rebuild.  And certainly using Ubuntu would 
> >>>be a rebuild.  And so forth.  This kind of clashes with Maven's 
> >>>philosophy of pulling prebuilt jars from the internet.
> >>>
> >>>Kai Zheng's question about whether we would bundle openSSL's 
> >>>libraries is a good one.  Given the high rate of new 
> >>>vulnerabilities discovered in that library, it seems like bundling 
> >>>would require Hadoop users and vendors to update very frequently, 
> >>>much more frequently than Hadoop is traditionally updated.  So 
> >>>probably we would
> not choose to bundle openssl.
> >>>
> >>>best,
> >>>Colin
> >>>
> >>>On Tue, Feb 2, 2016 at 12:29 AM, Chris Douglas 
> >>><cd...@apache.org>
> >>>wrote:
> >>>> As a subproject of Hadoop, Chimera could maintain its own cadence.
> >>>> There's also no reason why it should maintain dependencies on 
> >>>> other parts of Hadoop, if those are separable. How is this 
> >>>> solution inadequate?
> >>>>
> >>>> If Chimera is not successful as an independent project or stalls, 
> >>>> Hadoop and/or Spark and/or $project will have to reabsorb it as 
> >>>> maintainers. Projects have high mortality in early life, and a 
> >>>> fight over inheritance/maintenance is something we'd like to avoid.
> >>>> If, on the other hand, it develops enough of a community where it 
> >>>> is obviously viable, then we can (and should) break it out as a 
> >>>> TLP (as we have before). If other Apache projects take a 
> >>>> dependency on Chimera, we're open to adding them to security@hadoop.
> >>>>
> >>>> Unlike Yetus, which was largely rewritten right before it was 
> >>>> made into a TLP, security in Hadoop has a complicated pedigree.
> >>>> If Chimera eventually becomes a TLP, it seems fair to include 
> >>>> those who work on it while it is a subproject. Declared upfront, 
> >>>> that criterion is fairer than any post hoc justification, and 
> >>>> will lead to a more accurate account of its community than a 
> >>>> subset of the Hadoop PMC/committers that volunteer. -C
> >>>>
> >>>>
> >>>> On Mon, Feb 1, 2016 at 9:29 PM, Chen, Haifeng 
> >>>><ha...@intel.com>
> >>>>wrote:
> >>>>> Thanks to all folks providing feedbacks and participating the 
> >>>>>discussions.
> >>>>>
> >>>>> @Owen, do you still have any concerns on going forward in the 
> >>>>>direction of Apache Commons (or other options, TLP)?
> >>>>>
> >>>>> Thanks,
> >>>>> Haifeng
> >>>>>
> >>>>> -----Original Message-----
> >>>>> From: Chen, Haifeng [mailto:haifeng.chen@intel.com]
> >>>>> Sent: Saturday, January 30, 2016 10:52 AM
> >>>>> To: hdfs-dev@hadoop.apache.org
> >>>>> Subject: RE: Hadoop encryption module as Apache Chimera 
> >>>>> incubator project
> >>>>>
> >>>>>>> I believe encryption is becoming a core part of Hadoop. I 
> >>>>>>>think that moving core components out of Hadoop is bad from a 
> >>>>>>>project management perspective.
> >>>>>
> >>>>>> Although it's certainly true that encryption capabilities (in 
> >>>>>>HDFS, YARN, etc.) are becoming core to Hadoop, I don't think 
> >>>>>>that should really influence whether or not the 
> >>>>>>non-Hadoop-specific encryption routines should be part of the 
> >>>>>>Hadoop code base, or part of the code base of another project that Hadoop depends on.
> >>>>>>If Chimera had existed as a library hosted at ASF when HDFS 
> >>>>>>encryption was first developed, HDFS probably would have just 
> >>>>>>added that as a dependency and been done with it. I don't think 
> >>>>>>we would've copy/pasted the code for Chimera into the Hadoop code base.
> >>>>>
> >>>>> Agree with ATM. I want to also make an additional clarification. 
> >>>>>I agree that the encryption capabilities are becoming core to Hadoop.
> >>>>>While this effort is to put common and shared encryption routines 
> >>>>>such as crypto stream implementations into a scope which can be 
> >>>>>widely shared across the Apache ecosystem. This doesn't move 
> >>>>>Hadoop encryption out of Hadoop (that is not possible).
> >>>>>
> >>>>> Agree if we make it a separate and independent releases project 
> >>>>>in Hadoop takes a step further than the existing approach and 
> >>>>>solve some issues (such as libhadoop.so problem). Frankly 
> >>>>>speaking, I think it is not the best option we can try. I also 
> >>>>>expect that an independent release project within Hadoop core 
> >>>>>will also complicate the existing release ideology of Hadoop release.
> >>>>>
> >>>>> Thanks,
> >>>>> Haifeng
> >>>>>
> >>>>> -----Original Message-----
> >>>>> From: Aaron T. Myers [mailto:atm@cloudera.com]
> >>>>> Sent: Friday, January 29, 2016 9:51 AM
> >>>>> To: hdfs-dev@hadoop.apache.org
> >>>>> Subject: Re: Hadoop encryption module as Apache Chimera 
> >>>>> incubator project
> >>>>>
> >>>>> On Wed, Jan 27, 2016 at 11:31 AM, Owen O'Malley 
> >>>>><om...@apache.org>
> >>>>>wrote:
> >>>>>
> >>>>>> I believe encryption is becoming a core part of Hadoop. I think 
> >>>>>>that  moving core components out of Hadoop is bad from a project 
> >>>>>>management perspective.
> >>>>>>
> >>>>>
> >>>>> Although it's certainly true that encryption capabilities (in 
> >>>>>HDFS,  YARN,
> >>>>> etc.) are becoming core to Hadoop, I don't think that should 
> >>>>>really influence whether or not the non-Hadoop-specific 
> >>>>>encryption routines should be part of the Hadoop code base, or 
> >>>>>part of the code base of another project that Hadoop depends on.
> >>>>>If Chimera had existed as a library hosted at ASF when HDFS 
> >>>>>encryption was first developed, HDFS probably would have just 
> >>>>>added that as a dependency and been done with it. I don't think 
> >>>>>we would've copy/pasted the code for Chimera into the Hadoop code base.
> >>>>>
> >>>>>
> >>>>>> To put it another way, a bug in the encryption routines will 
> >>>>>> likely become a security problem that security@hadoop needs to
> hear about.
> >>>>>>
> >>>>> I don't think
> >>>>>> adding a separate project in the middle of that communication 
> >>>>>>chain  is a good idea. The same applies to data corruption 
> >>>>>>problems, and so on...
> >>>>>>
> >>>>>
> >>>>> Isn't the same true of all the libraries that Hadoop currently 
> >>>>>depends upon? If the commons-httpclient library (or 
> >>>>>commons-codec, or commons-io, or guava, or...) has a security 
> >>>>>vulnerability, we need to know about it so that we can update our 
> >>>>>dependency to a fixed
> version.
> >>>>>This case doesn't seem materially different than that.
> >>>>>
> >>>>>
> >>>>>>
> >>>>>>
> >>>>>> > It may be good to keep at generalized place(As in the 
> >>>>>> > discussion, we thought that place could be Apache Commons).
> >>>>>>
> >>>>>>
> >>>>>> Apache Commons is a collection of *Java* projects, so Chimera 
> >>>>>> as a JNI-based library isn't a natural fit.
> >>>>>>
> >>>>>
> >>>>> Could very well be that Apache Commons's charter would preclude 
> >>>>>Chimera.
> >>>>> You probably know better than I do about that.
> >>>>>
> >>>>>
> >>>>>> Furthermore, Apache Commons doesn't have its own security list 
> >>>>>> so problems will go to the generic security@apache.org.
> >>>>>>
> >>>>>
> >>>>> That seems easy enough to remedy, if they wanted to, and besides 
> >>>>>I'm not sure why that would influence this discussion. In my 
> >>>>>experience projects that don't have a separate 
> >>>>>security@project.a.o mailing list tend to just handle security 
> >>>>>issues on their private@project.a.o mailing list, which seems fine to me.
> >>>>>
> >>>>>
> >>>>>>
> >>>>>> Why do you think that Apache Commons is a better home than Hadoop?
> >>>>>>
> >>>>>
> >>>>> I'm certainly not at all wedded to Apache Commons, that just 
> >>>>>seemed like a natural place to put it to me. Could be that a 
> >>>>>brand new TLP might make more sense.
> >>>>>
> >>>>> I *do* think that if other non-Hadoop projects want to make use 
> >>>>>of Chimera, which as I understand it is the goal which started 
> >>>>>this thread, then Chimera should exist outside of Hadoop so that:
> >>>>>
> >>>>> a) Projects that have nothing to do with Hadoop can just depend 
> >>>>>directly on Chimera, which has nothing Hadoop-specific in there.
> >>>>>
> >>>>> b) The Hadoop project doesn't have to export/maintain/concern 
> >>>>>itself with yet another publicly-consumed interface.
> >>>>>
> >>>>> c) Chimera can have its own (presumably much faster) release 
> >>>>>cadence completely separate from Hadoop.
> >>>>>
> >>>>> --
> >>>>> Aaron T. Myers
> >>>>> Software Engineer, Cloudera
> >>
>

RE: Hadoop encryption module as Apache Chimera incubator project

Posted by "Chen, Haifeng" <ha...@intel.com>.
>> Let's do one step at a time. There is a clear need for common encryption, and let's focus on making that happen.
Strongly agree.

-----Original Message-----
From: Reynold Xin [mailto:rxin@databricks.com] 
Sent: Thursday, February 4, 2016 8:50 AM
To: hdfs-dev@hadoop.apache.org
Subject: Re: Hadoop encryption module as Apache Chimera incubator project

Let's do one step at a time. There is a clear need for common encryption, and let's focus on making that happen.

On Wed, Feb 3, 2016 at 4:48 PM, Zheng, Kai <ka...@intel.com> wrote:

> I thought this discussion would switch to common-dev@ now?
>
> >> Would it make sense to also package some of the compression 
> >> libraries,
> and maybe some of the text processing from MapReduce? Evolving some of 
> this code to a common library with few/no dependencies would be 
> generally useful. As a subproject, it could have a broader scope that 
> could evolve into a viable TLP.
>
> Sounds like a great idea to make the potential TLP more sense!! I 
> thought it could be organized like in Apache common, the security, 
> compression and other common text related things could be organized in 
> different independent modules. Perhaps Hadoop conf could also be 
> considered. These modules could rely on some common utility module. It 
> can still be Hadoop background or powered, and eventually we would 
> have a good place for some Hadoop common codes to move into to benefit 
> and impact even more broad scope than Hadoop itself.
>
> Regards,
> Kai
>
> -----Original Message-----
> From: Chris Douglas [mailto:cdouglas@apache.org]
> Sent: Thursday, February 04, 2016 7:26 AM
> To: hdfs-dev@hadoop.apache.org
> Subject: Re: Hadoop encryption module as Apache Chimera incubator 
> project
>
> I went through the repository, and now understand the reasoning that 
> would locate this code in Apache Commons. This isn't proposing to 
> extract much of the implementation and it takes none of the 
> integration. It's limited to interfaces to crypto libraries and 
> streams/configuration. It might be a reasonable fit for commons-codec, 
> but that's a pretty sparse library and driving the release cadence 
> might be more complicated. It'd be worth discussing on their lists (please also CC common-dev@).
>
> Chimera would be a boutique TLP, unless we wanted to draw out more of 
> the integration and tooling. Is that a goal you're interested in pursuing?
> There's a tension between keeping this focused and including enough 
> functionality to make it viable as an independent component. By way of 
> example, Hadoop's common project requires too many dependencies and 
> carries too much historical baggage for other projects to rely on.
> I agree with Colin/Steve: we don't want this to grow into another 
> guava-like dependency that creates more work in conflicts than it 
> saves in implementation...
>
> Would it make sense to also package some of the compression libraries, 
> and maybe some of the text processing from MapReduce? Evolving some of 
> this code to a common library with few/no dependencies would be 
> generally useful. As a subproject, it could have a broader scope that 
> could evolve into a viable TLP. If the encryption libraries are the 
> only ones you're interested in pulling out, then Apache Commons does 
> seem like a better target than a separate project. -C
>
>
> On Wed, Feb 3, 2016 at 1:49 AM, Chris Douglas <cd...@apache.org> wrote:
> > On Wed, Feb 3, 2016 at 12:48 AM, Gangumalla, Uma 
> > <um...@intel.com> wrote:
> >>>Standing in the point of shared fundamental piece of code like 
> >>>this, I do think Apache Commons might be the best direction which 
> >>>we can try as the first effort. In this direction, we still need to 
> >>>work with Apache Common community for buying in and accepting the proposal.
> >> Make sense.
> >
> > Makes sense how?
> >
> >> For this we should define the independent release cycles for this 
> >> project and it would just place under Hadoop tree if we all 
> >> conclude with this option at the end.
> >
> > Yes.
> >
> >> [Chris]
> >>>If Chimera is not successful as an independent project or stalls, 
> >>>Hadoop and/or Spark and/or $project will have to reabsorb it as 
> >>>maintainers.
> >>>
> >> I am not so strong on this point. If we assume project would be 
> >> unsuccessful, it can be unsuccessful(less maintained) even under hadoop.
> >> But if other projects depending on this piece then they would get 
> >> less support. Of course right now we feel this piece of code is 
> >> very important and we feel(expect) it can be successful as 
> >> independent project, irrespective of whether it as separate project 
> >> outside hadoop
> or inside.
> >> So, I feel this point would not really influence to judge the
> discussion.
> >
> > Sure; code can idle anywhere, but that wasn't the point I was after.
> > You propose to extract code from Hadoop, but if Chimera fails then 
> > what recourse do we have among the other projects taking a 
> > dependency on it? Splitting off another project is feasible, but 
> > Chimera should be sustainable before this PMC can divest itself of 
> > responsibility for security libraries. That's a pretty low bar.
> >
> > Bundling the library with the jar is helpful; I've used that before.
> > It should prefer (updated) libraries from the environment, if 
> > configured. Otherwise it's a pain (or impossible) for ops to patch 
> > security bugs. -C
> >
> >>>-----Original Message-----
> >>>From: Colin P. McCabe [mailto:cmccabe@apache.org]
> >>>Sent: Wednesday, February 3, 2016 4:56 AM
> >>>To: hdfs-dev@hadoop.apache.org
> >>>Subject: Re: Hadoop encryption module as Apache Chimera incubator 
> >>>project
> >>>
> >>>It's great to see interest in improving this functionality.  I 
> >>>think Chimera could be successful as an Apache project.  I don't 
> >>>have a strong opinion one way or the other as to whether it belongs 
> >>>as part of Hadoop or separate.
> >>>
> >>>I do think there will be some challenges splitting this 
> >>>functionality out into a separate jar, because of the way our 
> >>>CLASSPATH works right
> now.
> >>>For example, let's say that Hadoop depends on Chimera 1.2 and Spark 
> >>>depends on Chimera 1.1.  Now Spark jobs have two different versions 
> >>>fighting it out on the classpath, similar to the situation with 
> >>>Guava and other libraries.  Perhaps if Chimera adopts a policy of 
> >>>strong backwards compatibility, we can just always use the latest 
> >>>jar, but it still seems likely that there will be problems.  There 
> >>>are various classpath isolation ideas that could help here, but 
> >>>they are big projects in their own right and we don't have a clear 
> >>>timeline for them.  If this does end up being a separate jar, we 
> >>>may need to shade it to avoid all these issues.
> >>>
> >>>Bundling the JNI glue code in the jar itself is an interesting 
> >>>idea, which we have talked about before for libhadoop.so.  It 
> >>>doesn't really have anything to do with the question of TLP vs. 
> >>>non-TLP, of
> course.
> >>>We could do that refactoring in Hadoop itself.  The really 
> >>>complicated part of bundling JNI code in a jar is that you need to 
> >>>create jars for every cross product of (JVM version, openssl 
> >>>version,
> operating system).
> >>>For example, you have the RHEL6 build for openJDK7 using openssl 1.0.1e.
> >>>If you change any one thing-- say, change openJDK7 to Oracle JDK8, 
> >>>then you might need to rebuild.  And certainly using Ubuntu would 
> >>>be a rebuild.  And so forth.  This kind of clashes with Maven's 
> >>>philosophy of pulling prebuilt jars from the internet.
> >>>
> >>>Kai Zheng's question about whether we would bundle openSSL's 
> >>>libraries is a good one.  Given the high rate of new 
> >>>vulnerabilities discovered in that library, it seems like bundling 
> >>>would require Hadoop users and vendors to update very frequently, 
> >>>much more frequently than Hadoop is traditionally updated.  So 
> >>>probably we would
> not choose to bundle openssl.
> >>>
> >>>best,
> >>>Colin
> >>>
> >>>On Tue, Feb 2, 2016 at 12:29 AM, Chris Douglas 
> >>><cd...@apache.org>
> >>>wrote:
> >>>> As a subproject of Hadoop, Chimera could maintain its own cadence.
> >>>> There's also no reason why it should maintain dependencies on 
> >>>> other parts of Hadoop, if those are separable. How is this 
> >>>> solution inadequate?
> >>>>
> >>>> If Chimera is not successful as an independent project or stalls, 
> >>>> Hadoop and/or Spark and/or $project will have to reabsorb it as 
> >>>> maintainers. Projects have high mortality in early life, and a 
> >>>> fight over inheritance/maintenance is something we'd like to avoid.
> >>>> If, on the other hand, it develops enough of a community where it 
> >>>> is obviously viable, then we can (and should) break it out as a 
> >>>> TLP (as we have before). If other Apache projects take a 
> >>>> dependency on Chimera, we're open to adding them to security@hadoop.
> >>>>
> >>>> Unlike Yetus, which was largely rewritten right before it was 
> >>>> made into a TLP, security in Hadoop has a complicated pedigree. 
> >>>> If Chimera eventually becomes a TLP, it seems fair to include 
> >>>> those who work on it while it is a subproject. Declared upfront, 
> >>>> that criterion is fairer than any post hoc justification, and 
> >>>> will lead to a more accurate account of its community than a 
> >>>> subset of the Hadoop PMC/committers that volunteer. -C
> >>>>
> >>>>
> >>>> On Mon, Feb 1, 2016 at 9:29 PM, Chen, Haifeng 
> >>>><ha...@intel.com>
> >>>>wrote:
> >>>>> Thanks to all folks providing feedbacks and participating the 
> >>>>>discussions.
> >>>>>
> >>>>> @Owen, do you still have any concerns on going forward in the 
> >>>>>direction of Apache Commons (or other options, TLP)?
> >>>>>
> >>>>> Thanks,
> >>>>> Haifeng
> >>>>>
> >>>>> -----Original Message-----
> >>>>> From: Chen, Haifeng [mailto:haifeng.chen@intel.com]
> >>>>> Sent: Saturday, January 30, 2016 10:52 AM
> >>>>> To: hdfs-dev@hadoop.apache.org
> >>>>> Subject: RE: Hadoop encryption module as Apache Chimera 
> >>>>> incubator project
> >>>>>
> >>>>>>> I believe encryption is becoming a core part of Hadoop. I 
> >>>>>>>think that moving core components out of Hadoop is bad from a 
> >>>>>>>project management perspective.
> >>>>>
> >>>>>> Although it's certainly true that encryption capabilities (in 
> >>>>>>HDFS, YARN, etc.) are becoming core to Hadoop, I don't think 
> >>>>>>that should really influence whether or not the 
> >>>>>>non-Hadoop-specific encryption routines should be part of the 
> >>>>>>Hadoop code base, or part of the code base of another project that Hadoop depends on.
> >>>>>>If Chimera had existed as a library hosted at ASF when HDFS 
> >>>>>>encryption was first developed, HDFS probably would have just 
> >>>>>>added that as a dependency and been done with it. I don't think 
> >>>>>>we would've copy/pasted the code for Chimera into the Hadoop code base.
> >>>>>
> >>>>> Agree with ATM. I want to also make an additional clarification. 
> >>>>>I agree that the encryption capabilities are becoming core to Hadoop.
> >>>>>While this effort is to put common and shared encryption routines 
> >>>>>such as crypto stream implementations into a scope which can be 
> >>>>>widely shared across the Apache ecosystem. This doesn't move 
> >>>>>Hadoop encryption out of Hadoop (that is not possible).
> >>>>>
> >>>>> Agree if we make it a separate and independent releases project 
> >>>>>in Hadoop takes a step further than the existing approach and 
> >>>>>solve some issues (such as libhadoop.so problem). Frankly 
> >>>>>speaking, I think it is not the best option we can try. I also 
> >>>>>expect that an independent release project within Hadoop core 
> >>>>>will also complicate the existing release ideology of Hadoop release.
> >>>>>
> >>>>> Thanks,
> >>>>> Haifeng
> >>>>>
> >>>>> -----Original Message-----
> >>>>> From: Aaron T. Myers [mailto:atm@cloudera.com]
> >>>>> Sent: Friday, January 29, 2016 9:51 AM
> >>>>> To: hdfs-dev@hadoop.apache.org
> >>>>> Subject: Re: Hadoop encryption module as Apache Chimera 
> >>>>> incubator project
> >>>>>
> >>>>> On Wed, Jan 27, 2016 at 11:31 AM, Owen O'Malley 
> >>>>><om...@apache.org>
> >>>>>wrote:
> >>>>>
> >>>>>> I believe encryption is becoming a core part of Hadoop. I think 
> >>>>>>that  moving core components out of Hadoop is bad from a project 
> >>>>>>management perspective.
> >>>>>>
> >>>>>
> >>>>> Although it's certainly true that encryption capabilities (in 
> >>>>>HDFS,  YARN,
> >>>>> etc.) are becoming core to Hadoop, I don't think that should 
> >>>>>really influence whether or not the non-Hadoop-specific 
> >>>>>encryption routines should be part of the Hadoop code base, or 
> >>>>>part of the code base of another project that Hadoop depends on. 
> >>>>>If Chimera had existed as a library hosted at ASF when HDFS 
> >>>>>encryption was first developed, HDFS probably would have just 
> >>>>>added that as a dependency and been done with it. I don't think 
> >>>>>we would've copy/pasted the code for Chimera into the Hadoop code base.
> >>>>>
> >>>>>
> >>>>>> To put it another way, a bug in the encryption routines will 
> >>>>>> likely become a security problem that security@hadoop needs to
> hear about.
> >>>>>>
> >>>>> I don't think
> >>>>>> adding a separate project in the middle of that communication 
> >>>>>>chain  is a good idea. The same applies to data corruption 
> >>>>>>problems, and so on...
> >>>>>>
> >>>>>
> >>>>> Isn't the same true of all the libraries that Hadoop currently 
> >>>>>depends upon? If the commons-httpclient library (or 
> >>>>>commons-codec, or commons-io, or guava, or...) has a security 
> >>>>>vulnerability, we need to know about it so that we can update our 
> >>>>>dependency to a fixed
> version.
> >>>>>This case doesn't seem materially different than that.
> >>>>>
> >>>>>
> >>>>>>
> >>>>>>
> >>>>>> > It may be good to keep at generalized place(As in the 
> >>>>>> > discussion, we thought that place could be Apache Commons).
> >>>>>>
> >>>>>>
> >>>>>> Apache Commons is a collection of *Java* projects, so Chimera 
> >>>>>> as a JNI-based library isn't a natural fit.
> >>>>>>
> >>>>>
> >>>>> Could very well be that Apache Commons's charter would preclude 
> >>>>>Chimera.
> >>>>> You probably know better than I do about that.
> >>>>>
> >>>>>
> >>>>>> Furthermore, Apache Commons doesn't have its own security list 
> >>>>>> so problems will go to the generic security@apache.org.
> >>>>>>
> >>>>>
> >>>>> That seems easy enough to remedy, if they wanted to, and besides 
> >>>>>I'm not sure why that would influence this discussion. In my 
> >>>>>experience projects that don't have a separate 
> >>>>>security@project.a.o mailing list tend to just handle security 
> >>>>>issues on their private@project.a.o mailing list, which seems fine to me.
> >>>>>
> >>>>>
> >>>>>>
> >>>>>> Why do you think that Apache Commons is a better home than Hadoop?
> >>>>>>
> >>>>>
> >>>>> I'm certainly not at all wedded to Apache Commons, that just 
> >>>>>seemed like a natural place to put it to me. Could be that a 
> >>>>>brand new TLP might make more sense.
> >>>>>
> >>>>> I *do* think that if other non-Hadoop projects want to make use 
> >>>>>of Chimera, which as I understand it is the goal which started 
> >>>>>this thread, then Chimera should exist outside of Hadoop so that:
> >>>>>
> >>>>> a) Projects that have nothing to do with Hadoop can just depend 
> >>>>>directly on Chimera, which has nothing Hadoop-specific in there.
> >>>>>
> >>>>> b) The Hadoop project doesn't have to export/maintain/concern 
> >>>>>itself with yet another publicly-consumed interface.
> >>>>>
> >>>>> c) Chimera can have its own (presumably much faster) release 
> >>>>>cadence completely separate from Hadoop.
> >>>>>
> >>>>> --
> >>>>> Aaron T. Myers
> >>>>> Software Engineer, Cloudera
> >>
>

Re: Hadoop encryption module as Apache Chimera incubator project

Posted by Reynold Xin <rx...@databricks.com>.
Let's do one step at a time. There is a clear need for common encryption,
and let's focus on making that happen.

On Wed, Feb 3, 2016 at 4:48 PM, Zheng, Kai <ka...@intel.com> wrote:

> I thought this discussion would switch to common-dev@ now?
>
> >> Would it make sense to also package some of the compression libraries,
> and maybe some of the text processing from MapReduce? Evolving some of this
> code to a common library with few/no dependencies would be generally
> useful. As a subproject, it could have a broader scope that could evolve
> into a viable TLP.
>
> Sounds like a great idea to make the potential TLP more sense!! I thought
> it could be organized like in Apache common, the security, compression and
> other common text related things could be organized in different
> independent modules. Perhaps Hadoop conf could also be considered. These
> modules could rely on some common utility module. It can still be Hadoop
> background or powered, and eventually we would have a good place for some
> Hadoop common codes to move into to benefit and impact even more broad
> scope than Hadoop itself.
>
> Regards,
> Kai
>
> -----Original Message-----
> From: Chris Douglas [mailto:cdouglas@apache.org]
> Sent: Thursday, February 04, 2016 7:26 AM
> To: hdfs-dev@hadoop.apache.org
> Subject: Re: Hadoop encryption module as Apache Chimera incubator project
>
> I went through the repository, and now understand the reasoning that would
> locate this code in Apache Commons. This isn't proposing to extract much of
> the implementation and it takes none of the integration. It's limited to
> interfaces to crypto libraries and streams/configuration. It might be a
> reasonable fit for commons-codec, but that's a pretty sparse library and
> driving the release cadence might be more complicated. It'd be worth
> discussing on their lists (please also CC common-dev@).
>
> Chimera would be a boutique TLP, unless we wanted to draw out more of the
> integration and tooling. Is that a goal you're interested in pursuing?
> There's a tension between keeping this focused and including enough
> functionality to make it viable as an independent component. By way of
> example, Hadoop's common project requires too many dependencies and carries
> too much historical baggage for other projects to rely on.
> I agree with Colin/Steve: we don't want this to grow into another
> guava-like dependency that creates more work in conflicts than it saves in
> implementation...
>
> Would it make sense to also package some of the compression libraries, and
> maybe some of the text processing from MapReduce? Evolving some of this
> code to a common library with few/no dependencies would be generally
> useful. As a subproject, it could have a broader scope that could evolve
> into a viable TLP. If the encryption libraries are the only ones you're
> interested in pulling out, then Apache Commons does seem like a better
> target than a separate project. -C
>
>
> On Wed, Feb 3, 2016 at 1:49 AM, Chris Douglas <cd...@apache.org> wrote:
> > On Wed, Feb 3, 2016 at 12:48 AM, Gangumalla, Uma
> > <um...@intel.com> wrote:
> >>>Standing in the point of shared fundamental piece of code like this,
> >>>I do think Apache Commons might be the best direction which we can
> >>>try as the first effort. In this direction, we still need to work
> >>>with Apache Common community for buying in and accepting the proposal.
> >> Make sense.
> >
> > Makes sense how?
> >
> >> For this we should define the independent release cycles for this
> >> project and it would just place under Hadoop tree if we all conclude
> >> with this option at the end.
> >
> > Yes.
> >
> >> [Chris]
> >>>If Chimera is not successful as an independent project or stalls,
> >>>Hadoop and/or Spark and/or $project will have to reabsorb it as
> >>>maintainers.
> >>>
> >> I am not so strong on this point. If we assume project would be
> >> unsuccessful, it can be unsuccessful(less maintained) even under hadoop.
> >> But if other projects depending on this piece then they would get
> >> less support. Of course right now we feel this piece of code is very
> >> important and we feel(expect) it can be successful as independent
> >> project, irrespective of whether it as separate project outside hadoop
> or inside.
> >> So, I feel this point would not really influence to judge the
> discussion.
> >
> > Sure; code can idle anywhere, but that wasn't the point I was after.
> > You propose to extract code from Hadoop, but if Chimera fails then
> > what recourse do we have among the other projects taking a dependency
> > on it? Splitting off another project is feasible, but Chimera should
> > be sustainable before this PMC can divest itself of responsibility for
> > security libraries. That's a pretty low bar.
> >
> > Bundling the library with the jar is helpful; I've used that before.
> > It should prefer (updated) libraries from the environment, if
> > configured. Otherwise it's a pain (or impossible) for ops to patch
> > security bugs. -C
> >
> >>>-----Original Message-----
> >>>From: Colin P. McCabe [mailto:cmccabe@apache.org]
> >>>Sent: Wednesday, February 3, 2016 4:56 AM
> >>>To: hdfs-dev@hadoop.apache.org
> >>>Subject: Re: Hadoop encryption module as Apache Chimera incubator
> >>>project
> >>>
> >>>It's great to see interest in improving this functionality.  I think
> >>>Chimera could be successful as an Apache project.  I don't have a
> >>>strong opinion one way or the other as to whether it belongs as part
> >>>of Hadoop or separate.
> >>>
> >>>I do think there will be some challenges splitting this functionality
> >>>out into a separate jar, because of the way our CLASSPATH works right
> now.
> >>>For example, let's say that Hadoop depends on Chimera 1.2 and Spark
> >>>depends on Chimera 1.1.  Now Spark jobs have two different versions
> >>>fighting it out on the classpath, similar to the situation with Guava
> >>>and other libraries.  Perhaps if Chimera adopts a policy of strong
> >>>backwards compatibility, we can just always use the latest jar, but
> >>>it still seems likely that there will be problems.  There are various
> >>>classpath isolation ideas that could help here, but they are big
> >>>projects in their own right and we don't have a clear timeline for
> >>>them.  If this does end up being a separate jar, we may need to shade
> >>>it to avoid all these issues.
> >>>
> >>>Bundling the JNI glue code in the jar itself is an interesting idea,
> >>>which we have talked about before for libhadoop.so.  It doesn't
> >>>really have anything to do with the question of TLP vs. non-TLP, of
> course.
> >>>We could do that refactoring in Hadoop itself.  The really
> >>>complicated part of bundling JNI code in a jar is that you need to
> >>>create jars for every cross product of (JVM version, openssl version,
> operating system).
> >>>For example, you have the RHEL6 build for openJDK7 using openssl 1.0.1e.
> >>>If you change any one thing-- say, change openJDK7 to Oracle JDK8,
> >>>then you might need to rebuild.  And certainly using Ubuntu would be
> >>>a rebuild.  And so forth.  This kind of clashes with Maven's
> >>>philosophy of pulling prebuilt jars from the internet.
> >>>
> >>>Kai Zheng's question about whether we would bundle openSSL's
> >>>libraries is a good one.  Given the high rate of new vulnerabilities
> >>>discovered in that library, it seems like bundling would require
> >>>Hadoop users and vendors to update very frequently, much more
> >>>frequently than Hadoop is traditionally updated.  So probably we would
> not choose to bundle openssl.
> >>>
> >>>best,
> >>>Colin
> >>>
> >>>On Tue, Feb 2, 2016 at 12:29 AM, Chris Douglas <cd...@apache.org>
> >>>wrote:
> >>>> As a subproject of Hadoop, Chimera could maintain its own cadence.
> >>>> There's also no reason why it should maintain dependencies on other
> >>>> parts of Hadoop, if those are separable. How is this solution
> >>>> inadequate?
> >>>>
> >>>> If Chimera is not successful as an independent project or stalls,
> >>>> Hadoop and/or Spark and/or $project will have to reabsorb it as
> >>>> maintainers. Projects have high mortality in early life, and a
> >>>> fight over inheritance/maintenance is something we'd like to avoid.
> >>>> If, on the other hand, it develops enough of a community where it
> >>>> is obviously viable, then we can (and should) break it out as a TLP
> >>>> (as we have before). If other Apache projects take a dependency on
> >>>> Chimera, we're open to adding them to security@hadoop.
> >>>>
> >>>> Unlike Yetus, which was largely rewritten right before it was made
> >>>> into a TLP, security in Hadoop has a complicated pedigree. If
> >>>> Chimera eventually becomes a TLP, it seems fair to include those
> >>>> who work on it while it is a subproject. Declared upfront, that
> >>>> criterion is fairer than any post hoc justification, and will lead
> >>>> to a more accurate account of its community than a subset of the
> >>>> Hadoop PMC/committers that volunteer. -C
> >>>>
> >>>>
> >>>> On Mon, Feb 1, 2016 at 9:29 PM, Chen, Haifeng
> >>>><ha...@intel.com>
> >>>>wrote:
> >>>>> Thanks to all folks providing feedbacks and participating the
> >>>>>discussions.
> >>>>>
> >>>>> @Owen, do you still have any concerns on going forward in the
> >>>>>direction of Apache Commons (or other options, TLP)?
> >>>>>
> >>>>> Thanks,
> >>>>> Haifeng
> >>>>>
> >>>>> -----Original Message-----
> >>>>> From: Chen, Haifeng [mailto:haifeng.chen@intel.com]
> >>>>> Sent: Saturday, January 30, 2016 10:52 AM
> >>>>> To: hdfs-dev@hadoop.apache.org
> >>>>> Subject: RE: Hadoop encryption module as Apache Chimera incubator
> >>>>> project
> >>>>>
> >>>>>>> I believe encryption is becoming a core part of Hadoop. I think
> >>>>>>>that moving core components out of Hadoop is bad from a project
> >>>>>>>management perspective.
> >>>>>
> >>>>>> Although it's certainly true that encryption capabilities (in
> >>>>>>HDFS, YARN, etc.) are becoming core to Hadoop, I don't think that
> >>>>>>should really influence whether or not the non-Hadoop-specific
> >>>>>>encryption routines should be part of the Hadoop code base, or
> >>>>>>part of the code base of another project that Hadoop depends on.
> >>>>>>If Chimera had existed as a library hosted at ASF when HDFS
> >>>>>>encryption was first developed, HDFS probably would have just
> >>>>>>added that as a dependency and been done with it. I don't think we
> >>>>>>would've copy/pasted the code for Chimera into the Hadoop code base.
> >>>>>
> >>>>> Agree with ATM. I want to also make an additional clarification. I
> >>>>>agree that the encryption capabilities are becoming core to Hadoop.
> >>>>>While this effort is to put common and shared encryption routines
> >>>>>such as crypto stream implementations into a scope which can be
> >>>>>widely shared across the Apache ecosystem. This doesn't move Hadoop
> >>>>>encryption out of Hadoop (that is not possible).
> >>>>>
> >>>>> Agree if we make it a separate and independent releases project in
> >>>>>Hadoop takes a step further than the existing approach and solve
> >>>>>some issues (such as libhadoop.so problem). Frankly speaking, I
> >>>>>think it is not the best option we can try. I also expect that an
> >>>>>independent release project within Hadoop core will also complicate
> >>>>>the existing release ideology of Hadoop release.
> >>>>>
> >>>>> Thanks,
> >>>>> Haifeng
> >>>>>
> >>>>> -----Original Message-----
> >>>>> From: Aaron T. Myers [mailto:atm@cloudera.com]
> >>>>> Sent: Friday, January 29, 2016 9:51 AM
> >>>>> To: hdfs-dev@hadoop.apache.org
> >>>>> Subject: Re: Hadoop encryption module as Apache Chimera incubator
> >>>>> project
> >>>>>
> >>>>> On Wed, Jan 27, 2016 at 11:31 AM, Owen O'Malley
> >>>>><om...@apache.org>
> >>>>>wrote:
> >>>>>
> >>>>>> I believe encryption is becoming a core part of Hadoop. I think
> >>>>>>that  moving core components out of Hadoop is bad from a project
> >>>>>>management perspective.
> >>>>>>
> >>>>>
> >>>>> Although it's certainly true that encryption capabilities (in
> >>>>>HDFS,  YARN,
> >>>>> etc.) are becoming core to Hadoop, I don't think that should
> >>>>>really influence whether or not the non-Hadoop-specific encryption
> >>>>>routines should be part of the Hadoop code base, or part of the
> >>>>>code base of another project that Hadoop depends on. If Chimera had
> >>>>>existed as a library hosted at ASF when HDFS encryption was first
> >>>>>developed, HDFS probably would have just added that as a dependency
> >>>>>and been done with it. I don't think we would've copy/pasted the
> >>>>>code for Chimera into the Hadoop code base.
> >>>>>
> >>>>>
> >>>>>> To put it another way, a bug in the encryption routines will
> >>>>>> likely become a security problem that security@hadoop needs to
> hear about.
> >>>>>>
> >>>>> I don't think
> >>>>>> adding a separate project in the middle of that communication
> >>>>>>chain  is a good idea. The same applies to data corruption
> >>>>>>problems, and so on...
> >>>>>>
> >>>>>
> >>>>> Isn't the same true of all the libraries that Hadoop currently
> >>>>>depends upon? If the commons-httpclient library (or commons-codec,
> >>>>>or commons-io, or guava, or...) has a security vulnerability, we
> >>>>>need to know about it so that we can update our dependency to a fixed
> version.
> >>>>>This case doesn't seem materially different than that.
> >>>>>
> >>>>>
> >>>>>>
> >>>>>>
> >>>>>> > It may be good to keep at generalized place(As in the
> >>>>>> > discussion, we thought that place could be Apache Commons).
> >>>>>>
> >>>>>>
> >>>>>> Apache Commons is a collection of *Java* projects, so Chimera as
> >>>>>> a JNI-based library isn't a natural fit.
> >>>>>>
> >>>>>
> >>>>> Could very well be that Apache Commons's charter would preclude
> >>>>>Chimera.
> >>>>> You probably know better than I do about that.
> >>>>>
> >>>>>
> >>>>>> Furthermore, Apache Commons doesn't have its own security list so
> >>>>>> problems will go to the generic security@apache.org.
> >>>>>>
> >>>>>
> >>>>> That seems easy enough to remedy, if they wanted to, and besides I'm
> >>>>>not sure why that would influence this discussion. In my experience
> >>>>>projects that don't have a separate security@project.a.o mailing list
> >>>>>tend to just handle security issues on their private@project.a.o
> >>>>>mailing list, which seems fine to me.
> >>>>>
> >>>>>
> >>>>>>
> >>>>>> Why do you think that Apache Commons is a better home than Hadoop?
> >>>>>>
> >>>>>
> >>>>> I'm certainly not at all wedded to Apache Commons, that just seemed
> >>>>>like a natural place to put it to me. Could be that a brand new TLP
> >>>>>might make more sense.
> >>>>>
> >>>>> I *do* think that if other non-Hadoop projects want to make use of
> >>>>>Chimera, which as I understand it is the goal which started this
> >>>>>thread, then Chimera should exist outside of Hadoop so that:
> >>>>>
> >>>>> a) Projects that have nothing to do with Hadoop can just depend
> >>>>>directly on Chimera, which has nothing Hadoop-specific in there.
> >>>>>
> >>>>> b) The Hadoop project doesn't have to export/maintain/concern itself
> >>>>>with yet another publicly-consumed interface.
> >>>>>
> >>>>> c) Chimera can have its own (presumably much faster) release cadence
> >>>>>completely separate from Hadoop.
> >>>>>
> >>>>> --
> >>>>> Aaron T. Myers
> >>>>> Software Engineer, Cloudera
> >>
>

RE: Hadoop encryption module as Apache Chimera incubator project

Posted by "Zheng, Kai" <ka...@intel.com>.
I thought this discussion would switch to common-dev@ now?

>> Would it make sense to also package some of the compression libraries, and maybe some of the text processing from MapReduce? Evolving some of this code to a common library with few/no dependencies would be generally useful. As a subproject, it could have a broader scope that could evolve into a viable TLP.

Sounds like a great idea to make the potential TLP more sense!! I thought it could be organized like in Apache common, the security, compression and other common text related things could be organized in different independent modules. Perhaps Hadoop conf could also be considered. These modules could rely on some common utility module. It can still be Hadoop background or powered, and eventually we would have a good place for some Hadoop common codes to move into to benefit and impact even more broad scope than Hadoop itself.

Regards,
Kai

-----Original Message-----
From: Chris Douglas [mailto:cdouglas@apache.org] 
Sent: Thursday, February 04, 2016 7:26 AM
To: hdfs-dev@hadoop.apache.org
Subject: Re: Hadoop encryption module as Apache Chimera incubator project

I went through the repository, and now understand the reasoning that would locate this code in Apache Commons. This isn't proposing to extract much of the implementation and it takes none of the integration. It's limited to interfaces to crypto libraries and streams/configuration. It might be a reasonable fit for commons-codec, but that's a pretty sparse library and driving the release cadence might be more complicated. It'd be worth discussing on their lists (please also CC common-dev@).

Chimera would be a boutique TLP, unless we wanted to draw out more of the integration and tooling. Is that a goal you're interested in pursuing? There's a tension between keeping this focused and including enough functionality to make it viable as an independent component. By way of example, Hadoop's common project requires too many dependencies and carries too much historical baggage for other projects to rely on.
I agree with Colin/Steve: we don't want this to grow into another guava-like dependency that creates more work in conflicts than it saves in implementation...

Would it make sense to also package some of the compression libraries, and maybe some of the text processing from MapReduce? Evolving some of this code to a common library with few/no dependencies would be generally useful. As a subproject, it could have a broader scope that could evolve into a viable TLP. If the encryption libraries are the only ones you're interested in pulling out, then Apache Commons does seem like a better target than a separate project. -C


On Wed, Feb 3, 2016 at 1:49 AM, Chris Douglas <cd...@apache.org> wrote:
> On Wed, Feb 3, 2016 at 12:48 AM, Gangumalla, Uma 
> <um...@intel.com> wrote:
>>>Standing in the point of shared fundamental piece of code like this, 
>>>I do think Apache Commons might be the best direction which we can 
>>>try as the first effort. In this direction, we still need to work 
>>>with Apache Common community for buying in and accepting the proposal.
>> Make sense.
>
> Makes sense how?
>
>> For this we should define the independent release cycles for this 
>> project and it would just place under Hadoop tree if we all conclude 
>> with this option at the end.
>
> Yes.
>
>> [Chris]
>>>If Chimera is not successful as an independent project or stalls, 
>>>Hadoop and/or Spark and/or $project will have to reabsorb it as 
>>>maintainers.
>>>
>> I am not so strong on this point. If we assume project would be 
>> unsuccessful, it can be unsuccessful(less maintained) even under hadoop.
>> But if other projects depending on this piece then they would get 
>> less support. Of course right now we feel this piece of code is very 
>> important and we feel(expect) it can be successful as independent 
>> project, irrespective of whether it as separate project outside hadoop or inside.
>> So, I feel this point would not really influence to judge the discussion.
>
> Sure; code can idle anywhere, but that wasn't the point I was after.
> You propose to extract code from Hadoop, but if Chimera fails then 
> what recourse do we have among the other projects taking a dependency 
> on it? Splitting off another project is feasible, but Chimera should 
> be sustainable before this PMC can divest itself of responsibility for 
> security libraries. That's a pretty low bar.
>
> Bundling the library with the jar is helpful; I've used that before.
> It should prefer (updated) libraries from the environment, if 
> configured. Otherwise it's a pain (or impossible) for ops to patch 
> security bugs. -C
>
>>>-----Original Message-----
>>>From: Colin P. McCabe [mailto:cmccabe@apache.org]
>>>Sent: Wednesday, February 3, 2016 4:56 AM
>>>To: hdfs-dev@hadoop.apache.org
>>>Subject: Re: Hadoop encryption module as Apache Chimera incubator 
>>>project
>>>
>>>It's great to see interest in improving this functionality.  I think 
>>>Chimera could be successful as an Apache project.  I don't have a 
>>>strong opinion one way or the other as to whether it belongs as part 
>>>of Hadoop or separate.
>>>
>>>I do think there will be some challenges splitting this functionality 
>>>out into a separate jar, because of the way our CLASSPATH works right now.
>>>For example, let's say that Hadoop depends on Chimera 1.2 and Spark 
>>>depends on Chimera 1.1.  Now Spark jobs have two different versions 
>>>fighting it out on the classpath, similar to the situation with Guava 
>>>and other libraries.  Perhaps if Chimera adopts a policy of strong 
>>>backwards compatibility, we can just always use the latest jar, but 
>>>it still seems likely that there will be problems.  There are various 
>>>classpath isolation ideas that could help here, but they are big 
>>>projects in their own right and we don't have a clear timeline for 
>>>them.  If this does end up being a separate jar, we may need to shade 
>>>it to avoid all these issues.
>>>
>>>Bundling the JNI glue code in the jar itself is an interesting idea, 
>>>which we have talked about before for libhadoop.so.  It doesn't 
>>>really have anything to do with the question of TLP vs. non-TLP, of course.
>>>We could do that refactoring in Hadoop itself.  The really 
>>>complicated part of bundling JNI code in a jar is that you need to 
>>>create jars for every cross product of (JVM version, openssl version, operating system).
>>>For example, you have the RHEL6 build for openJDK7 using openssl 1.0.1e.
>>>If you change any one thing-- say, change openJDK7 to Oracle JDK8, 
>>>then you might need to rebuild.  And certainly using Ubuntu would be 
>>>a rebuild.  And so forth.  This kind of clashes with Maven's 
>>>philosophy of pulling prebuilt jars from the internet.
>>>
>>>Kai Zheng's question about whether we would bundle openSSL's 
>>>libraries is a good one.  Given the high rate of new vulnerabilities 
>>>discovered in that library, it seems like bundling would require 
>>>Hadoop users and vendors to update very frequently, much more 
>>>frequently than Hadoop is traditionally updated.  So probably we would not choose to bundle openssl.
>>>
>>>best,
>>>Colin
>>>
>>>On Tue, Feb 2, 2016 at 12:29 AM, Chris Douglas <cd...@apache.org>
>>>wrote:
>>>> As a subproject of Hadoop, Chimera could maintain its own cadence.
>>>> There's also no reason why it should maintain dependencies on other 
>>>> parts of Hadoop, if those are separable. How is this solution 
>>>> inadequate?
>>>>
>>>> If Chimera is not successful as an independent project or stalls, 
>>>> Hadoop and/or Spark and/or $project will have to reabsorb it as 
>>>> maintainers. Projects have high mortality in early life, and a 
>>>> fight over inheritance/maintenance is something we'd like to avoid. 
>>>> If, on the other hand, it develops enough of a community where it 
>>>> is obviously viable, then we can (and should) break it out as a TLP 
>>>> (as we have before). If other Apache projects take a dependency on 
>>>> Chimera, we're open to adding them to security@hadoop.
>>>>
>>>> Unlike Yetus, which was largely rewritten right before it was made 
>>>> into a TLP, security in Hadoop has a complicated pedigree. If 
>>>> Chimera eventually becomes a TLP, it seems fair to include those 
>>>> who work on it while it is a subproject. Declared upfront, that 
>>>> criterion is fairer than any post hoc justification, and will lead 
>>>> to a more accurate account of its community than a subset of the 
>>>> Hadoop PMC/committers that volunteer. -C
>>>>
>>>>
>>>> On Mon, Feb 1, 2016 at 9:29 PM, Chen, Haifeng 
>>>><ha...@intel.com>
>>>>wrote:
>>>>> Thanks to all folks providing feedbacks and participating the 
>>>>>discussions.
>>>>>
>>>>> @Owen, do you still have any concerns on going forward in the 
>>>>>direction of Apache Commons (or other options, TLP)?
>>>>>
>>>>> Thanks,
>>>>> Haifeng
>>>>>
>>>>> -----Original Message-----
>>>>> From: Chen, Haifeng [mailto:haifeng.chen@intel.com]
>>>>> Sent: Saturday, January 30, 2016 10:52 AM
>>>>> To: hdfs-dev@hadoop.apache.org
>>>>> Subject: RE: Hadoop encryption module as Apache Chimera incubator 
>>>>> project
>>>>>
>>>>>>> I believe encryption is becoming a core part of Hadoop. I think  
>>>>>>>that moving core components out of Hadoop is bad from a project 
>>>>>>>management perspective.
>>>>>
>>>>>> Although it's certainly true that encryption capabilities (in 
>>>>>>HDFS, YARN, etc.) are becoming core to Hadoop, I don't think that 
>>>>>>should really influence whether or not the non-Hadoop-specific 
>>>>>>encryption routines should be part of the Hadoop code base, or 
>>>>>>part of the code base of another project that Hadoop depends on. 
>>>>>>If Chimera had existed as a library hosted at ASF when HDFS 
>>>>>>encryption was first developed, HDFS probably would have just 
>>>>>>added that as a dependency and been done with it. I don't think we 
>>>>>>would've copy/pasted the code for Chimera into the Hadoop code base.
>>>>>
>>>>> Agree with ATM. I want to also make an additional clarification. I 
>>>>>agree that the encryption capabilities are becoming core to Hadoop.
>>>>>While this effort is to put common and shared encryption routines 
>>>>>such as crypto stream implementations into a scope which can be 
>>>>>widely shared across the Apache ecosystem. This doesn't move Hadoop 
>>>>>encryption out of Hadoop (that is not possible).
>>>>>
>>>>> Agree if we make it a separate and independent releases project in 
>>>>>Hadoop takes a step further than the existing approach and solve 
>>>>>some issues (such as libhadoop.so problem). Frankly speaking, I 
>>>>>think it is not the best option we can try. I also expect that an 
>>>>>independent release project within Hadoop core will also complicate 
>>>>>the existing release ideology of Hadoop release.
>>>>>
>>>>> Thanks,
>>>>> Haifeng
>>>>>
>>>>> -----Original Message-----
>>>>> From: Aaron T. Myers [mailto:atm@cloudera.com]
>>>>> Sent: Friday, January 29, 2016 9:51 AM
>>>>> To: hdfs-dev@hadoop.apache.org
>>>>> Subject: Re: Hadoop encryption module as Apache Chimera incubator 
>>>>> project
>>>>>
>>>>> On Wed, Jan 27, 2016 at 11:31 AM, Owen O'Malley 
>>>>><om...@apache.org>
>>>>>wrote:
>>>>>
>>>>>> I believe encryption is becoming a core part of Hadoop. I think 
>>>>>>that  moving core components out of Hadoop is bad from a project 
>>>>>>management perspective.
>>>>>>
>>>>>
>>>>> Although it's certainly true that encryption capabilities (in 
>>>>>HDFS,  YARN,
>>>>> etc.) are becoming core to Hadoop, I don't think that should 
>>>>>really influence whether or not the non-Hadoop-specific encryption 
>>>>>routines should be part of the Hadoop code base, or part of the 
>>>>>code base of another project that Hadoop depends on. If Chimera had 
>>>>>existed as a library hosted at ASF when HDFS encryption was first 
>>>>>developed, HDFS probably would have just added that as a dependency 
>>>>>and been done with it. I don't think we would've copy/pasted the 
>>>>>code for Chimera into the Hadoop code base.
>>>>>
>>>>>
>>>>>> To put it another way, a bug in the encryption routines will 
>>>>>> likely become a security problem that security@hadoop needs to hear about.
>>>>>>
>>>>> I don't think
>>>>>> adding a separate project in the middle of that communication 
>>>>>>chain  is a good idea. The same applies to data corruption 
>>>>>>problems, and so on...
>>>>>>
>>>>>
>>>>> Isn't the same true of all the libraries that Hadoop currently 
>>>>>depends upon? If the commons-httpclient library (or commons-codec, 
>>>>>or commons-io, or guava, or...) has a security vulnerability, we 
>>>>>need to know about it so that we can update our dependency to a fixed version.
>>>>>This case doesn't seem materially different than that.
>>>>>
>>>>>
>>>>>>
>>>>>>
>>>>>> > It may be good to keep at generalized place(As in the 
>>>>>> > discussion, we thought that place could be Apache Commons).
>>>>>>
>>>>>>
>>>>>> Apache Commons is a collection of *Java* projects, so Chimera as 
>>>>>> a JNI-based library isn't a natural fit.
>>>>>>
>>>>>
>>>>> Could very well be that Apache Commons's charter would preclude 
>>>>>Chimera.
>>>>> You probably know better than I do about that.
>>>>>
>>>>>
>>>>>> Furthermore, Apache Commons doesn't have its own security list so 
>>>>>> problems will go to the generic security@apache.org.
>>>>>>
>>>>>
>>>>> That seems easy enough to remedy, if they wanted to, and besides I'm
>>>>>not sure why that would influence this discussion. In my experience
>>>>>projects that don't have a separate security@project.a.o mailing list
>>>>>tend to just handle security issues on their private@project.a.o
>>>>>mailing list, which seems fine to me.
>>>>>
>>>>>
>>>>>>
>>>>>> Why do you think that Apache Commons is a better home than Hadoop?
>>>>>>
>>>>>
>>>>> I'm certainly not at all wedded to Apache Commons, that just seemed
>>>>>like a natural place to put it to me. Could be that a brand new TLP
>>>>>might make more sense.
>>>>>
>>>>> I *do* think that if other non-Hadoop projects want to make use of
>>>>>Chimera, which as I understand it is the goal which started this
>>>>>thread, then Chimera should exist outside of Hadoop so that:
>>>>>
>>>>> a) Projects that have nothing to do with Hadoop can just depend
>>>>>directly on Chimera, which has nothing Hadoop-specific in there.
>>>>>
>>>>> b) The Hadoop project doesn't have to export/maintain/concern itself
>>>>>with yet another publicly-consumed interface.
>>>>>
>>>>> c) Chimera can have its own (presumably much faster) release cadence
>>>>>completely separate from Hadoop.
>>>>>
>>>>> --
>>>>> Aaron T. Myers
>>>>> Software Engineer, Cloudera
>>

Re: Hadoop encryption module as Apache Chimera incubator project

Posted by "Gangumalla, Uma" <um...@intel.com>.
Thanks Haifeng. I was just waiting if any more comments. If no objections
further, I would initiate a discussion thread in Apache Commons in a day
time and will also cc to hadoop common.

Regards,
Uma

On 2/11/16, 6:13 PM, "Chen, Haifeng" <ha...@intel.com> wrote:

>Thanks all the folks participating this discussion and providing valuable
>suggestions and options.
>
>I suggest we take it forward to make a proposal in Apache Commons
>community. 
>
>Thanks,
>Haifeng
>
>-----Original Message-----
>From: Chen, Haifeng [mailto:haifeng.chen@intel.com]
>Sent: Friday, February 5, 2016 10:06 AM
>To: hdfs-dev@hadoop.apache.org; common-dev@hadoop.apache.org
>Subject: RE: Hadoop encryption module as Apache Chimera incubator project
>
>> [Chirs] Yes, but even if the artifact is widely consumed, as a TLP it
>>would need to sustain a community. If the scope is too narrow, then it
>>will quickly fall into maintenance mode, its contributors will move on,
>>and it will retire to the attic. Alone, I doubt its viability as a TLP.
>>So as a first option, donating only this code to Apache Commons would
>>accomplish some immediate goals in a sustainable forum.
>Totally agree. As a TLP it needs nice scope and roadmap to sustain a
>development community.
>
>Thanks,
>Haifeng
>
>-----Original Message-----
>From: Chris Douglas [mailto:cdouglas@apache.org]
>Sent: Friday, February 5, 2016 6:28 AM
>To: common-dev@hadoop.apache.org
>Cc: hdfs-dev@hadoop.apache.org
>Subject: Re: Hadoop encryption module as Apache Chimera incubator project
>
>On Thu, Feb 4, 2016 at 12:06 PM, Gangumalla, Uma
><um...@intel.com> wrote:
>
>> [UMA] Ok. Great. You are right. I have cc¹ed to hadoop common. (You
>> mean to cc Apache commons as well?)
>
>I meant, if you start a discussion with Apache Commons, please CC
>common-dev@hadoop to coordinate.
>
>> [UMA] Right now we plan to have encryption libraries are the only
>> one¹s we planned and as we see lot of interest from other projects
>> like spark to use them. I see some challenges when we bring lot of
>> code(other common
>> codes) into this project is that, they all would have different
>> requirements and may be different expected timelines for release etc.
>> Some projects may just wanted to use encryption interfaces alone but
>>not all.
>> As they are completely independent codes, may be better to scope out
>> clearly.
>
>Yes, but even if the artifact is widely consumed, as a TLP it would need
>to sustain a community. If the scope is too narrow, then it will quickly
>fall into maintenance mode, its contributors will move on, and it will
>retire to the attic. Alone, I doubt its viability as a TLP. So as a first
>option, donating only this code to Apache Commons would accomplish some
>immediate goals in a sustainable forum.
>
>APR has a similar scope. As a second option, that may also be a
>reasonable home, particularly if some of the native bits could integrate
>with APR.
>
>If the scope is broader, the effort could sustain prolonged development.
>The current code is developing a strategy for packing native libraries on
>multiple platforms, a capability that, say, the native compression codecs
>(AFAIK) still lack. While java.nio is improving, many projects would
>benefit from a better, native interface to the filesystem (e.g.,
>NativeIO). We could avoid duplicating effort and collaborate on a common
>library.
>
>As a third option, Hadoop already implements some useful native
>libraries, which is why a subproject might be a sound course. That would
>enable the subproject to coordinate with Hadoop on migrating its native
>functionality to a separable, reusable component, then move to a TLP when
>we can rely on it exclusively (if it has a well-defined, independent
>community). It could control its release cadence and limit its
>dependencies.
>
>Finally, this is beside the point if nobody is interested in doing the
>work on such a project. It's rude to pull code out of Hadoop and donate
>it to another project so Spark can avoid a dependency, but this instance
>seems reasonable to me. -C
>
>[1] https://apr.apache.org/
>
>> On 2/3/16, 6:46 PM, "Chen, Haifeng" <ha...@intel.com> wrote:
>>
>>>Thanks Chris.
>>>
>>>>> I went through the repository, and now understand the reasoning
>>>>>that would locate this code in Apache Commons. This isn't proposing
>>>>>to extract much of the implementation and it takes none of the
>>>>>integration. It's limited to interfaces to crypto libraries and
>>>>>streams/configuration.
>>>Exactly.
>>>
>>>>> Chimera would be a boutique TLP, unless we wanted to draw out more
>>>>>of the integration and tooling. Is that a goal you're interested in
>>>>>pursuing? There's a tension between keeping this focused and
>>>>>including enough functionality to make it viable as an independent
>>>>>component.
>>>The Chimera goal was for providing useful, common and optimized
>>>cryptographic functionalities. I would prefer that it is still focused
>>>in this clear scope. Multiple domain requirements will put more
>>>challenges and uncertainties in where and how it should go, thus more
>>>risk in stalling.
>>>
>>>>> If the encryption libraries are the only ones you're interested in
>>>>>pulling out, then Apache Commons does seem like a better target than
>>>>>a separate project.
>>>Yes. Just mentioned above, the library will be positioned in
>>>cryptographic.
>>>
>>>
>>>Thanks,
>>>
>>>-----Original Message-----
>>>From: Chris Douglas [mailto:cdouglas@apache.org]
>>>Sent: Thursday, February 4, 2016 7:26 AM
>>>To: hdfs-dev@hadoop.apache.org
>>>Subject: Re: Hadoop encryption module as Apache Chimera incubator
>>>project
>>>
>>>I went through the repository, and now understand the reasoning that
>>>would locate this code in Apache Commons. This isn't proposing to
>>>extract much of the implementation and it takes none of the
>>>integration. It's limited to interfaces to crypto libraries and
>>>streams/configuration. It might be a reasonable fit for commons-codec,
>>>but that's a pretty sparse library and driving the release cadence
>>>might be more complicated. It'd be worth discussing on their lists
>>>(please also CC common-dev@).
>>>
>>>Chimera would be a boutique TLP, unless we wanted to draw out more of
>>>the integration and tooling. Is that a goal you're interested in
>>>pursuing?
>>>There's a tension between keeping this focused and including enough
>>>functionality to make it viable as an independent component. By way of
>>>example, Hadoop's common project requires too many dependencies and
>>>carries too much historical baggage for other projects to rely on.
>>>I agree with Colin/Steve: we don't want this to grow into another
>>>guava-like dependency that creates more work in conflicts than it
>>>saves in implementation...
>>>
>>>Would it make sense to also package some of the compression libraries,
>>>and maybe some of the text processing from MapReduce? Evolving some of
>>>this code to a common library with few/no dependencies would be
>>>generally useful. As a subproject, it could have a broader scope that
>>>could evolve into a viable TLP. If the encryption libraries are the
>>>only ones you're interested in pulling out, then Apache Commons does
>>>seem like a better target than a separate project. -C
>>>
>>>
>>>On Wed, Feb 3, 2016 at 1:49 AM, Chris Douglas <cd...@apache.org>
>>>wrote:
>>>> On Wed, Feb 3, 2016 at 12:48 AM, Gangumalla, Uma
>>>> <um...@intel.com> wrote:
>>>>>>Standing in the point of shared fundamental piece of code like
>>>>>>this, I do think Apache Commons might be the best direction which
>>>>>>we can try as the first effort. In this direction, we still need to
>>>>>>work with Apache Common community for buying in and accepting the
>>>>>>proposal.
>>>>> Make sense.
>>>>
>>>> Makes sense how?
>>>>
>>>>> For this we should define the independent release cycles for this
>>>>> project and it would just place under Hadoop tree if we all
>>>>> conclude with this option at the end.
>>>>
>>>> Yes.
>>>>
>>>>> [Chris]
>>>>>>If Chimera is not successful as an independent project or stalls,
>>>>>>Hadoop and/or Spark and/or $project will have to reabsorb it as
>>>>>>maintainers.
>>>>>>
>>>>> I am not so strong on this point. If we assume project would be
>>>>>unsuccessful, it can be unsuccessful(less maintained) even under
>>>>>hadoop.
>>>>> But if other projects depending on this piece then they would get
>>>>>less support. Of course right now we feel this piece of code is very
>>>>>important and we feel(expect) it can be successful as independent
>>>>>project, irrespective of whether it as separate project outside
>>>>>hadoop or inside.
>>>>> So, I feel this point would not really influence to judge the
>>>>>discussion.
>>>>
>>>> Sure; code can idle anywhere, but that wasn't the point I was after.
>>>> You propose to extract code from Hadoop, but if Chimera fails then
>>>> what recourse do we have among the other projects taking a
>>>> dependency on it? Splitting off another project is feasible, but
>>>> Chimera should be sustainable before this PMC can divest itself of
>>>> responsibility for security libraries. That's a pretty low bar.
>>>>
>>>> Bundling the library with the jar is helpful; I've used that before.
>>>> It should prefer (updated) libraries from the environment, if
>>>> configured. Otherwise it's a pain (or impossible) for ops to patch
>>>> security bugs. -C
>>>>
>>>>>>-----Original Message-----
>>>>>>From: Colin P. McCabe [mailto:cmccabe@apache.org]
>>>>>>Sent: Wednesday, February 3, 2016 4:56 AM
>>>>>>To: hdfs-dev@hadoop.apache.org
>>>>>>Subject: Re: Hadoop encryption module as Apache Chimera incubator
>>>>>>project
>>>>>>
>>>>>>It's great to see interest in improving this functionality.  I
>>>>>>think Chimera could be successful as an Apache project.  I don't
>>>>>>have a strong opinion one way or the other as to whether it belongs
>>>>>>as part of Hadoop or separate.
>>>>>>
>>>>>>I do think there will be some challenges splitting this
>>>>>>functionality out into a separate jar, because of the way our
>>>>>>CLASSPATH works right now.
>>>>>>For example, let's say that Hadoop depends on Chimera 1.2 and Spark
>>>>>>depends on Chimera 1.1.  Now Spark jobs have two different versions
>>>>>>fighting it out on the classpath, similar to the situation with
>>>>>>Guava and other libraries.  Perhaps if Chimera adopts a policy of
>>>>>>strong backwards compatibility, we can just always use the latest
>>>>>>jar, but it still seems likely that there will be problems.  There
>>>>>>are various classpath isolation ideas that could help here, but
>>>>>>they are big projects in their own right and we don't have a clear
>>>>>>timeline for them.  If this does end up being a separate jar, we
>>>>>>may need to shade it to avoid all these issues.
>>>>>>
>>>>>>Bundling the JNI glue code in the jar itself is an interesting
>>>>>>idea, which we have talked about before for libhadoop.so.  It
>>>>>>doesn't really have anything to do with the question of TLP vs.
>>>>>>non-TLP, of course.
>>>>>>We could do that refactoring in Hadoop itself.  The really
>>>>>>complicated part of bundling JNI code in a jar is that you need to
>>>>>>create jars for every cross product of (JVM version, openssl
>>>>>>version, operating system).
>>>>>>For example, you have the RHEL6 build for openJDK7 using openssl
>>>>>>1.0.1e.
>>>>>>If you change any one thing-- say, change openJDK7 to Oracle JDK8,
>>>>>>then you might need to rebuild.  And certainly using Ubuntu would
>>>>>>be a rebuild.  And so forth.  This kind of clashes with Maven's
>>>>>>philosophy of pulling prebuilt jars from the internet.
>>>>>>
>>>>>>Kai Zheng's question about whether we would bundle openSSL's
>>>>>>libraries is a good one.  Given the high rate of new
>>>>>>vulnerabilities discovered in that library, it seems like bundling
>>>>>>would require Hadoop users and vendors to update very frequently,
>>>>>>much more frequently than Hadoop is traditionally updated.  So
>>>>>>probably we would not choose to bundle openssl.
>>>>>>
>>>>>>best,
>>>>>>Colin
>>>>>>
>>>>>>On Tue, Feb 2, 2016 at 12:29 AM, Chris Douglas
>>>>>><cd...@apache.org>
>>>>>>wrote:
>>>>>>> As a subproject of Hadoop, Chimera could maintain its own cadence.
>>>>>>> There's also no reason why it should maintain dependencies on
>>>>>>> other parts of Hadoop, if those are separable. How is this
>>>>>>> solution inadequate?
>>>>>>>
>>>>>>> If Chimera is not successful as an independent project or stalls,
>>>>>>> Hadoop and/or Spark and/or $project will have to reabsorb it as
>>>>>>> maintainers. Projects have high mortality in early life, and a
>>>>>>> fight over inheritance/maintenance is something we'd like to avoid.
>>>>>>> If, on the other hand, it develops enough of a community where it
>>>>>>> is obviously viable, then we can (and should) break it out as a
>>>>>>> TLP (as we have before). If other Apache projects take a
>>>>>>> dependency on Chimera, we're open to adding them to
>>>>>>>security@hadoop.
>>>>>>>
>>>>>>> Unlike Yetus, which was largely rewritten right before it was
>>>>>>> made into a TLP, security in Hadoop has a complicated pedigree.
>>>>>>> If Chimera eventually becomes a TLP, it seems fair to include
>>>>>>> those who work on it while it is a subproject. Declared upfront,
>>>>>>> that criterion is fairer than any post hoc justification, and
>>>>>>> will lead to a more accurate account of its community than a
>>>>>>> subset of the Hadoop PMC/committers that volunteer. -C
>>>>>>>
>>>>>>>
>>>>>>> On Mon, Feb 1, 2016 at 9:29 PM, Chen, Haifeng
>>>>>>><ha...@intel.com>
>>>>>>>wrote:
>>>>>>>> Thanks to all folks providing feedbacks and participating the
>>>>>>>>discussions.
>>>>>>>>
>>>>>>>> @Owen, do you still have any concerns on going forward in the
>>>>>>>>direction of Apache Commons (or other options, TLP)?
>>>>>>>>
>>>>>>>> Thanks,
>>>>>>>> Haifeng
>>>>>>>>
>>>>>>>> -----Original Message-----
>>>>>>>> From: Chen, Haifeng [mailto:haifeng.chen@intel.com]
>>>>>>>> Sent: Saturday, January 30, 2016 10:52 AM
>>>>>>>> To: hdfs-dev@hadoop.apache.org
>>>>>>>> Subject: RE: Hadoop encryption module as Apache Chimera
>>>>>>>> incubator project
>>>>>>>>
>>>>>>>>>> I believe encryption is becoming a core part of Hadoop. I
>>>>>>>>>>think that moving core components out of Hadoop is bad from a
>>>>>>>>>>project management perspective.
>>>>>>>>
>>>>>>>>> Although it's certainly true that encryption capabilities (in
>>>>>>>>>HDFS, YARN, etc.) are becoming core to Hadoop, I don't think
>>>>>>>>>that should really influence whether or not the
>>>>>>>>>non-Hadoop-specific encryption routines should be part of the
>>>>>>>>>Hadoop code base, or part of the code base of another project
>>>>>>>>>that Hadoop depends on.
>>>>>>>>>If Chimera had existed as a library hosted at ASF when HDFS
>>>>>>>>>encryption was first developed, HDFS probably would have just
>>>>>>>>>added that as a dependency and been done with it. I don't think
>>>>>>>>>we would've copy/pasted the code for Chimera into the Hadoop code
>>>>>>>>>base.
>>>>>>>>
>>>>>>>> Agree with ATM. I want to also make an additional clarification.
>>>>>>>>I agree that the encryption capabilities are becoming core to
>>>>>>>>Hadoop.
>>>>>>>>While this effort is to put common and shared encryption routines
>>>>>>>>such as crypto stream implementations into a scope which can be
>>>>>>>>widely shared across the Apache ecosystem. This doesn't move
>>>>>>>>Hadoop encryption out of Hadoop (that is not possible).
>>>>>>>>
>>>>>>>> Agree if we make it a separate and independent releases project
>>>>>>>>in Hadoop takes a step further than the existing approach and
>>>>>>>>solve some issues (such as libhadoop.so problem). Frankly
>>>>>>>>speaking, I think it is not the best option we can try. I also
>>>>>>>>expect that an independent release project within Hadoop core
>>>>>>>>will also complicate the existing release ideology of Hadoop
>>>>>>>>release.
>>>>>>>>
>>>>>>>> Thanks,
>>>>>>>> Haifeng
>>>>>>>>
>>>>>>>> -----Original Message-----
>>>>>>>> From: Aaron T. Myers [mailto:atm@cloudera.com]
>>>>>>>> Sent: Friday, January 29, 2016 9:51 AM
>>>>>>>> To: hdfs-dev@hadoop.apache.org
>>>>>>>> Subject: Re: Hadoop encryption module as Apache Chimera
>>>>>>>> incubator project
>>>>>>>>
>>>>>>>> On Wed, Jan 27, 2016 at 11:31 AM, Owen O'Malley
>>>>>>>><om...@apache.org>
>>>>>>>>wrote:
>>>>>>>>
>>>>>>>>> I believe encryption is becoming a core part of Hadoop. I think
>>>>>>>>>that  moving core components out of Hadoop is bad from a project
>>>>>>>>>management perspective.
>>>>>>>>>
>>>>>>>>
>>>>>>>> Although it's certainly true that encryption capabilities (in
>>>>>>>>HDFS,  YARN,
>>>>>>>> etc.) are becoming core to Hadoop, I don't think that should
>>>>>>>>really influence whether or not the non-Hadoop-specific
>>>>>>>>encryption routines should be part of the Hadoop code base, or
>>>>>>>>part of the code base of another project that Hadoop depends on.
>>>>>>>>If Chimera had existed as a library hosted at ASF when HDFS
>>>>>>>>encryption was first developed, HDFS probably would have just
>>>>>>>>added that as a dependency and been done with it. I don't think
>>>>>>>>we would've copy/pasted the code for Chimera into the Hadoop code
>>>>>>>>base.
>>>>>>>>
>>>>>>>>
>>>>>>>>> To put it another way, a bug in the encryption routines will
>>>>>>>>>likely become a security problem that security@hadoop needs to
>>>>>>>>>hear about.
>>>>>>>>>
>>>>>>>> I don't think
>>>>>>>>> adding a separate project in the middle of that communication
>>>>>>>>>chain  is a good idea. The same applies to data corruption
>>>>>>>>>problems, and so on...
>>>>>>>>>
>>>>>>>>
>>>>>>>> Isn't the same true of all the libraries that Hadoop currently
>>>>>>>>depends upon? If the commons-httpclient library (or
>>>>>>>>commons-codec, or commons-io, or guava, or...) has a security
>>>>>>>>vulnerability, we need to know about it so that we can update our
>>>>>>>>dependency to a fixed version.
>>>>>>>>This case doesn't seem materially different than that.
>>>>>>>>
>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> > It may be good to keep at generalized place(As in the
>>>>>>>>> > discussion, we thought that place could be Apache Commons).
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Apache Commons is a collection of *Java* projects, so Chimera
>>>>>>>>> as a JNI-based library isn't a natural fit.
>>>>>>>>>
>>>>>>>>
>>>>>>>> Could very well be that Apache Commons's charter would preclude
>>>>>>>>Chimera.
>>>>>>>> You probably know better than I do about that.
>>>>>>>>
>>>>>>>>
>>>>>>>>> Furthermore, Apache Commons doesn't have its own security list
>>>>>>>>> so problems will go to the generic security@apache.org.
>>>>>>>>>
>>>>>>>>
>>>>>>>> That seems easy enough to remedy, if they wanted to, and besides
>>>>>>>>I'm not sure why that would influence this discussion. In my
>>>>>>>>experience projects that don't have a separate
>>>>>>>>security@project.a.o mailing list tend to just handle security
>>>>>>>>issues on their private@project.a.o mailing list, which seems fine
>>>>>>>>to me.
>>>>>>>>
>>>>>>>>
>>>>>>>>>
>>>>>>>>> Why do you think that Apache Commons is a better home than
>>>>>>>>>Hadoop?
>>>>>>>>>
>>>>>>>>
>>>>>>>> I'm certainly not at all wedded to Apache Commons, that just
>>>>>>>>seemed like a natural place to put it to me. Could be that a
>>>>>>>>brand new TLP might make more sense.
>>>>>>>>
>>>>>>>> I *do* think that if other non-Hadoop projects want to make use
>>>>>>>>of Chimera, which as I understand it is the goal which started
>>>>>>>>this thread, then Chimera should exist outside of Hadoop so that:
>>>>>>>>
>>>>>>>> a) Projects that have nothing to do with Hadoop can just depend
>>>>>>>>directly on Chimera, which has nothing Hadoop-specific in there.
>>>>>>>>
>>>>>>>> b) The Hadoop project doesn't have to export/maintain/concern
>>>>>>>>itself with yet another publicly-consumed interface.
>>>>>>>>
>>>>>>>> c) Chimera can have its own (presumably much faster) release
>>>>>>>>cadence completely separate from Hadoop.
>>>>>>>>
>>>>>>>> --
>>>>>>>> Aaron T. Myers
>>>>>>>> Software Engineer, Cloudera
>>>>>
>>


Re: Hadoop encryption module as Apache Chimera incubator project

Posted by "Gangumalla, Uma" <um...@intel.com>.
Thanks Haifeng. I was just waiting if any more comments. If no objections
further, I would initiate a discussion thread in Apache Commons in a day
time and will also cc to hadoop common.

Regards,
Uma

On 2/11/16, 6:13 PM, "Chen, Haifeng" <ha...@intel.com> wrote:

>Thanks all the folks participating this discussion and providing valuable
>suggestions and options.
>
>I suggest we take it forward to make a proposal in Apache Commons
>community. 
>
>Thanks,
>Haifeng
>
>-----Original Message-----
>From: Chen, Haifeng [mailto:haifeng.chen@intel.com]
>Sent: Friday, February 5, 2016 10:06 AM
>To: hdfs-dev@hadoop.apache.org; common-dev@hadoop.apache.org
>Subject: RE: Hadoop encryption module as Apache Chimera incubator project
>
>> [Chirs] Yes, but even if the artifact is widely consumed, as a TLP it
>>would need to sustain a community. If the scope is too narrow, then it
>>will quickly fall into maintenance mode, its contributors will move on,
>>and it will retire to the attic. Alone, I doubt its viability as a TLP.
>>So as a first option, donating only this code to Apache Commons would
>>accomplish some immediate goals in a sustainable forum.
>Totally agree. As a TLP it needs nice scope and roadmap to sustain a
>development community.
>
>Thanks,
>Haifeng
>
>-----Original Message-----
>From: Chris Douglas [mailto:cdouglas@apache.org]
>Sent: Friday, February 5, 2016 6:28 AM
>To: common-dev@hadoop.apache.org
>Cc: hdfs-dev@hadoop.apache.org
>Subject: Re: Hadoop encryption module as Apache Chimera incubator project
>
>On Thu, Feb 4, 2016 at 12:06 PM, Gangumalla, Uma
><um...@intel.com> wrote:
>
>> [UMA] Ok. Great. You are right. I have cc¹ed to hadoop common. (You
>> mean to cc Apache commons as well?)
>
>I meant, if you start a discussion with Apache Commons, please CC
>common-dev@hadoop to coordinate.
>
>> [UMA] Right now we plan to have encryption libraries are the only
>> one¹s we planned and as we see lot of interest from other projects
>> like spark to use them. I see some challenges when we bring lot of
>> code(other common
>> codes) into this project is that, they all would have different
>> requirements and may be different expected timelines for release etc.
>> Some projects may just wanted to use encryption interfaces alone but
>>not all.
>> As they are completely independent codes, may be better to scope out
>> clearly.
>
>Yes, but even if the artifact is widely consumed, as a TLP it would need
>to sustain a community. If the scope is too narrow, then it will quickly
>fall into maintenance mode, its contributors will move on, and it will
>retire to the attic. Alone, I doubt its viability as a TLP. So as a first
>option, donating only this code to Apache Commons would accomplish some
>immediate goals in a sustainable forum.
>
>APR has a similar scope. As a second option, that may also be a
>reasonable home, particularly if some of the native bits could integrate
>with APR.
>
>If the scope is broader, the effort could sustain prolonged development.
>The current code is developing a strategy for packing native libraries on
>multiple platforms, a capability that, say, the native compression codecs
>(AFAIK) still lack. While java.nio is improving, many projects would
>benefit from a better, native interface to the filesystem (e.g.,
>NativeIO). We could avoid duplicating effort and collaborate on a common
>library.
>
>As a third option, Hadoop already implements some useful native
>libraries, which is why a subproject might be a sound course. That would
>enable the subproject to coordinate with Hadoop on migrating its native
>functionality to a separable, reusable component, then move to a TLP when
>we can rely on it exclusively (if it has a well-defined, independent
>community). It could control its release cadence and limit its
>dependencies.
>
>Finally, this is beside the point if nobody is interested in doing the
>work on such a project. It's rude to pull code out of Hadoop and donate
>it to another project so Spark can avoid a dependency, but this instance
>seems reasonable to me. -C
>
>[1] https://apr.apache.org/
>
>> On 2/3/16, 6:46 PM, "Chen, Haifeng" <ha...@intel.com> wrote:
>>
>>>Thanks Chris.
>>>
>>>>> I went through the repository, and now understand the reasoning
>>>>>that would locate this code in Apache Commons. This isn't proposing
>>>>>to extract much of the implementation and it takes none of the
>>>>>integration. It's limited to interfaces to crypto libraries and
>>>>>streams/configuration.
>>>Exactly.
>>>
>>>>> Chimera would be a boutique TLP, unless we wanted to draw out more
>>>>>of the integration and tooling. Is that a goal you're interested in
>>>>>pursuing? There's a tension between keeping this focused and
>>>>>including enough functionality to make it viable as an independent
>>>>>component.
>>>The Chimera goal was for providing useful, common and optimized
>>>cryptographic functionalities. I would prefer that it is still focused
>>>in this clear scope. Multiple domain requirements will put more
>>>challenges and uncertainties in where and how it should go, thus more
>>>risk in stalling.
>>>
>>>>> If the encryption libraries are the only ones you're interested in
>>>>>pulling out, then Apache Commons does seem like a better target than
>>>>>a separate project.
>>>Yes. Just mentioned above, the library will be positioned in
>>>cryptographic.
>>>
>>>
>>>Thanks,
>>>
>>>-----Original Message-----
>>>From: Chris Douglas [mailto:cdouglas@apache.org]
>>>Sent: Thursday, February 4, 2016 7:26 AM
>>>To: hdfs-dev@hadoop.apache.org
>>>Subject: Re: Hadoop encryption module as Apache Chimera incubator
>>>project
>>>
>>>I went through the repository, and now understand the reasoning that
>>>would locate this code in Apache Commons. This isn't proposing to
>>>extract much of the implementation and it takes none of the
>>>integration. It's limited to interfaces to crypto libraries and
>>>streams/configuration. It might be a reasonable fit for commons-codec,
>>>but that's a pretty sparse library and driving the release cadence
>>>might be more complicated. It'd be worth discussing on their lists
>>>(please also CC common-dev@).
>>>
>>>Chimera would be a boutique TLP, unless we wanted to draw out more of
>>>the integration and tooling. Is that a goal you're interested in
>>>pursuing?
>>>There's a tension between keeping this focused and including enough
>>>functionality to make it viable as an independent component. By way of
>>>example, Hadoop's common project requires too many dependencies and
>>>carries too much historical baggage for other projects to rely on.
>>>I agree with Colin/Steve: we don't want this to grow into another
>>>guava-like dependency that creates more work in conflicts than it
>>>saves in implementation...
>>>
>>>Would it make sense to also package some of the compression libraries,
>>>and maybe some of the text processing from MapReduce? Evolving some of
>>>this code to a common library with few/no dependencies would be
>>>generally useful. As a subproject, it could have a broader scope that
>>>could evolve into a viable TLP. If the encryption libraries are the
>>>only ones you're interested in pulling out, then Apache Commons does
>>>seem like a better target than a separate project. -C
>>>
>>>
>>>On Wed, Feb 3, 2016 at 1:49 AM, Chris Douglas <cd...@apache.org>
>>>wrote:
>>>> On Wed, Feb 3, 2016 at 12:48 AM, Gangumalla, Uma
>>>> <um...@intel.com> wrote:
>>>>>>Standing in the point of shared fundamental piece of code like
>>>>>>this, I do think Apache Commons might be the best direction which
>>>>>>we can try as the first effort. In this direction, we still need to
>>>>>>work with Apache Common community for buying in and accepting the
>>>>>>proposal.
>>>>> Make sense.
>>>>
>>>> Makes sense how?
>>>>
>>>>> For this we should define the independent release cycles for this
>>>>> project and it would just place under Hadoop tree if we all
>>>>> conclude with this option at the end.
>>>>
>>>> Yes.
>>>>
>>>>> [Chris]
>>>>>>If Chimera is not successful as an independent project or stalls,
>>>>>>Hadoop and/or Spark and/or $project will have to reabsorb it as
>>>>>>maintainers.
>>>>>>
>>>>> I am not so strong on this point. If we assume project would be
>>>>>unsuccessful, it can be unsuccessful(less maintained) even under
>>>>>hadoop.
>>>>> But if other projects depending on this piece then they would get
>>>>>less support. Of course right now we feel this piece of code is very
>>>>>important and we feel(expect) it can be successful as independent
>>>>>project, irrespective of whether it as separate project outside
>>>>>hadoop or inside.
>>>>> So, I feel this point would not really influence to judge the
>>>>>discussion.
>>>>
>>>> Sure; code can idle anywhere, but that wasn't the point I was after.
>>>> You propose to extract code from Hadoop, but if Chimera fails then
>>>> what recourse do we have among the other projects taking a
>>>> dependency on it? Splitting off another project is feasible, but
>>>> Chimera should be sustainable before this PMC can divest itself of
>>>> responsibility for security libraries. That's a pretty low bar.
>>>>
>>>> Bundling the library with the jar is helpful; I've used that before.
>>>> It should prefer (updated) libraries from the environment, if
>>>> configured. Otherwise it's a pain (or impossible) for ops to patch
>>>> security bugs. -C
>>>>
>>>>>>-----Original Message-----
>>>>>>From: Colin P. McCabe [mailto:cmccabe@apache.org]
>>>>>>Sent: Wednesday, February 3, 2016 4:56 AM
>>>>>>To: hdfs-dev@hadoop.apache.org
>>>>>>Subject: Re: Hadoop encryption module as Apache Chimera incubator
>>>>>>project
>>>>>>
>>>>>>It's great to see interest in improving this functionality.  I
>>>>>>think Chimera could be successful as an Apache project.  I don't
>>>>>>have a strong opinion one way or the other as to whether it belongs
>>>>>>as part of Hadoop or separate.
>>>>>>
>>>>>>I do think there will be some challenges splitting this
>>>>>>functionality out into a separate jar, because of the way our
>>>>>>CLASSPATH works right now.
>>>>>>For example, let's say that Hadoop depends on Chimera 1.2 and Spark
>>>>>>depends on Chimera 1.1.  Now Spark jobs have two different versions
>>>>>>fighting it out on the classpath, similar to the situation with
>>>>>>Guava and other libraries.  Perhaps if Chimera adopts a policy of
>>>>>>strong backwards compatibility, we can just always use the latest
>>>>>>jar, but it still seems likely that there will be problems.  There
>>>>>>are various classpath isolation ideas that could help here, but
>>>>>>they are big projects in their own right and we don't have a clear
>>>>>>timeline for them.  If this does end up being a separate jar, we
>>>>>>may need to shade it to avoid all these issues.
>>>>>>
>>>>>>Bundling the JNI glue code in the jar itself is an interesting
>>>>>>idea, which we have talked about before for libhadoop.so.  It
>>>>>>doesn't really have anything to do with the question of TLP vs.
>>>>>>non-TLP, of course.
>>>>>>We could do that refactoring in Hadoop itself.  The really
>>>>>>complicated part of bundling JNI code in a jar is that you need to
>>>>>>create jars for every cross product of (JVM version, openssl
>>>>>>version, operating system).
>>>>>>For example, you have the RHEL6 build for openJDK7 using openssl
>>>>>>1.0.1e.
>>>>>>If you change any one thing-- say, change openJDK7 to Oracle JDK8,
>>>>>>then you might need to rebuild.  And certainly using Ubuntu would
>>>>>>be a rebuild.  And so forth.  This kind of clashes with Maven's
>>>>>>philosophy of pulling prebuilt jars from the internet.
>>>>>>
>>>>>>Kai Zheng's question about whether we would bundle openSSL's
>>>>>>libraries is a good one.  Given the high rate of new
>>>>>>vulnerabilities discovered in that library, it seems like bundling
>>>>>>would require Hadoop users and vendors to update very frequently,
>>>>>>much more frequently than Hadoop is traditionally updated.  So
>>>>>>probably we would not choose to bundle openssl.
>>>>>>
>>>>>>best,
>>>>>>Colin
>>>>>>
>>>>>>On Tue, Feb 2, 2016 at 12:29 AM, Chris Douglas
>>>>>><cd...@apache.org>
>>>>>>wrote:
>>>>>>> As a subproject of Hadoop, Chimera could maintain its own cadence.
>>>>>>> There's also no reason why it should maintain dependencies on
>>>>>>> other parts of Hadoop, if those are separable. How is this
>>>>>>> solution inadequate?
>>>>>>>
>>>>>>> If Chimera is not successful as an independent project or stalls,
>>>>>>> Hadoop and/or Spark and/or $project will have to reabsorb it as
>>>>>>> maintainers. Projects have high mortality in early life, and a
>>>>>>> fight over inheritance/maintenance is something we'd like to avoid.
>>>>>>> If, on the other hand, it develops enough of a community where it
>>>>>>> is obviously viable, then we can (and should) break it out as a
>>>>>>> TLP (as we have before). If other Apache projects take a
>>>>>>> dependency on Chimera, we're open to adding them to
>>>>>>>security@hadoop.
>>>>>>>
>>>>>>> Unlike Yetus, which was largely rewritten right before it was
>>>>>>> made into a TLP, security in Hadoop has a complicated pedigree.
>>>>>>> If Chimera eventually becomes a TLP, it seems fair to include
>>>>>>> those who work on it while it is a subproject. Declared upfront,
>>>>>>> that criterion is fairer than any post hoc justification, and
>>>>>>> will lead to a more accurate account of its community than a
>>>>>>> subset of the Hadoop PMC/committers that volunteer. -C
>>>>>>>
>>>>>>>
>>>>>>> On Mon, Feb 1, 2016 at 9:29 PM, Chen, Haifeng
>>>>>>><ha...@intel.com>
>>>>>>>wrote:
>>>>>>>> Thanks to all folks providing feedbacks and participating the
>>>>>>>>discussions.
>>>>>>>>
>>>>>>>> @Owen, do you still have any concerns on going forward in the
>>>>>>>>direction of Apache Commons (or other options, TLP)?
>>>>>>>>
>>>>>>>> Thanks,
>>>>>>>> Haifeng
>>>>>>>>
>>>>>>>> -----Original Message-----
>>>>>>>> From: Chen, Haifeng [mailto:haifeng.chen@intel.com]
>>>>>>>> Sent: Saturday, January 30, 2016 10:52 AM
>>>>>>>> To: hdfs-dev@hadoop.apache.org
>>>>>>>> Subject: RE: Hadoop encryption module as Apache Chimera
>>>>>>>> incubator project
>>>>>>>>
>>>>>>>>>> I believe encryption is becoming a core part of Hadoop. I
>>>>>>>>>>think that moving core components out of Hadoop is bad from a
>>>>>>>>>>project management perspective.
>>>>>>>>
>>>>>>>>> Although it's certainly true that encryption capabilities (in
>>>>>>>>>HDFS, YARN, etc.) are becoming core to Hadoop, I don't think
>>>>>>>>>that should really influence whether or not the
>>>>>>>>>non-Hadoop-specific encryption routines should be part of the
>>>>>>>>>Hadoop code base, or part of the code base of another project
>>>>>>>>>that Hadoop depends on.
>>>>>>>>>If Chimera had existed as a library hosted at ASF when HDFS
>>>>>>>>>encryption was first developed, HDFS probably would have just
>>>>>>>>>added that as a dependency and been done with it. I don't think
>>>>>>>>>we would've copy/pasted the code for Chimera into the Hadoop code
>>>>>>>>>base.
>>>>>>>>
>>>>>>>> Agree with ATM. I want to also make an additional clarification.
>>>>>>>>I agree that the encryption capabilities are becoming core to
>>>>>>>>Hadoop.
>>>>>>>>While this effort is to put common and shared encryption routines
>>>>>>>>such as crypto stream implementations into a scope which can be
>>>>>>>>widely shared across the Apache ecosystem. This doesn't move
>>>>>>>>Hadoop encryption out of Hadoop (that is not possible).
>>>>>>>>
>>>>>>>> Agree if we make it a separate and independent releases project
>>>>>>>>in Hadoop takes a step further than the existing approach and
>>>>>>>>solve some issues (such as libhadoop.so problem). Frankly
>>>>>>>>speaking, I think it is not the best option we can try. I also
>>>>>>>>expect that an independent release project within Hadoop core
>>>>>>>>will also complicate the existing release ideology of Hadoop
>>>>>>>>release.
>>>>>>>>
>>>>>>>> Thanks,
>>>>>>>> Haifeng
>>>>>>>>
>>>>>>>> -----Original Message-----
>>>>>>>> From: Aaron T. Myers [mailto:atm@cloudera.com]
>>>>>>>> Sent: Friday, January 29, 2016 9:51 AM
>>>>>>>> To: hdfs-dev@hadoop.apache.org
>>>>>>>> Subject: Re: Hadoop encryption module as Apache Chimera
>>>>>>>> incubator project
>>>>>>>>
>>>>>>>> On Wed, Jan 27, 2016 at 11:31 AM, Owen O'Malley
>>>>>>>><om...@apache.org>
>>>>>>>>wrote:
>>>>>>>>
>>>>>>>>> I believe encryption is becoming a core part of Hadoop. I think
>>>>>>>>>that  moving core components out of Hadoop is bad from a project
>>>>>>>>>management perspective.
>>>>>>>>>
>>>>>>>>
>>>>>>>> Although it's certainly true that encryption capabilities (in
>>>>>>>>HDFS,  YARN,
>>>>>>>> etc.) are becoming core to Hadoop, I don't think that should
>>>>>>>>really influence whether or not the non-Hadoop-specific
>>>>>>>>encryption routines should be part of the Hadoop code base, or
>>>>>>>>part of the code base of another project that Hadoop depends on.
>>>>>>>>If Chimera had existed as a library hosted at ASF when HDFS
>>>>>>>>encryption was first developed, HDFS probably would have just
>>>>>>>>added that as a dependency and been done with it. I don't think
>>>>>>>>we would've copy/pasted the code for Chimera into the Hadoop code
>>>>>>>>base.
>>>>>>>>
>>>>>>>>
>>>>>>>>> To put it another way, a bug in the encryption routines will
>>>>>>>>>likely become a security problem that security@hadoop needs to
>>>>>>>>>hear about.
>>>>>>>>>
>>>>>>>> I don't think
>>>>>>>>> adding a separate project in the middle of that communication
>>>>>>>>>chain  is a good idea. The same applies to data corruption
>>>>>>>>>problems, and so on...
>>>>>>>>>
>>>>>>>>
>>>>>>>> Isn't the same true of all the libraries that Hadoop currently
>>>>>>>>depends upon? If the commons-httpclient library (or
>>>>>>>>commons-codec, or commons-io, or guava, or...) has a security
>>>>>>>>vulnerability, we need to know about it so that we can update our
>>>>>>>>dependency to a fixed version.
>>>>>>>>This case doesn't seem materially different than that.
>>>>>>>>
>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> > It may be good to keep at generalized place(As in the
>>>>>>>>> > discussion, we thought that place could be Apache Commons).
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Apache Commons is a collection of *Java* projects, so Chimera
>>>>>>>>> as a JNI-based library isn't a natural fit.
>>>>>>>>>
>>>>>>>>
>>>>>>>> Could very well be that Apache Commons's charter would preclude
>>>>>>>>Chimera.
>>>>>>>> You probably know better than I do about that.
>>>>>>>>
>>>>>>>>
>>>>>>>>> Furthermore, Apache Commons doesn't have its own security list
>>>>>>>>> so problems will go to the generic security@apache.org.
>>>>>>>>>
>>>>>>>>
>>>>>>>> That seems easy enough to remedy, if they wanted to, and besides
>>>>>>>>I'm not sure why that would influence this discussion. In my
>>>>>>>>experience projects that don't have a separate
>>>>>>>>security@project.a.o mailing list tend to just handle security
>>>>>>>>issues on their private@project.a.o mailing list, which seems fine
>>>>>>>>to me.
>>>>>>>>
>>>>>>>>
>>>>>>>>>
>>>>>>>>> Why do you think that Apache Commons is a better home than
>>>>>>>>>Hadoop?
>>>>>>>>>
>>>>>>>>
>>>>>>>> I'm certainly not at all wedded to Apache Commons, that just
>>>>>>>>seemed like a natural place to put it to me. Could be that a
>>>>>>>>brand new TLP might make more sense.
>>>>>>>>
>>>>>>>> I *do* think that if other non-Hadoop projects want to make use
>>>>>>>>of Chimera, which as I understand it is the goal which started
>>>>>>>>this thread, then Chimera should exist outside of Hadoop so that:
>>>>>>>>
>>>>>>>> a) Projects that have nothing to do with Hadoop can just depend
>>>>>>>>directly on Chimera, which has nothing Hadoop-specific in there.
>>>>>>>>
>>>>>>>> b) The Hadoop project doesn't have to export/maintain/concern
>>>>>>>>itself with yet another publicly-consumed interface.
>>>>>>>>
>>>>>>>> c) Chimera can have its own (presumably much faster) release
>>>>>>>>cadence completely separate from Hadoop.
>>>>>>>>
>>>>>>>> --
>>>>>>>> Aaron T. Myers
>>>>>>>> Software Engineer, Cloudera
>>>>>
>>


RE: Hadoop encryption module as Apache Chimera incubator project

Posted by "Chen, Haifeng" <ha...@intel.com>.
Thanks all the folks participating this discussion and providing valuable suggestions and options.

I suggest we take it forward to make a proposal in Apache Commons community. 

Thanks,
Haifeng

-----Original Message-----
From: Chen, Haifeng [mailto:haifeng.chen@intel.com] 
Sent: Friday, February 5, 2016 10:06 AM
To: hdfs-dev@hadoop.apache.org; common-dev@hadoop.apache.org
Subject: RE: Hadoop encryption module as Apache Chimera incubator project

> [Chirs] Yes, but even if the artifact is widely consumed, as a TLP it would need to sustain a community. If the scope is too narrow, then it will quickly fall into maintenance mode, its contributors will move on, and it will retire to the attic. Alone, I doubt its viability as a TLP. So as a first option, donating only this code to Apache Commons would accomplish some immediate goals in a sustainable forum.
Totally agree. As a TLP it needs nice scope and roadmap to sustain a development community. 

Thanks,
Haifeng

-----Original Message-----
From: Chris Douglas [mailto:cdouglas@apache.org]
Sent: Friday, February 5, 2016 6:28 AM
To: common-dev@hadoop.apache.org
Cc: hdfs-dev@hadoop.apache.org
Subject: Re: Hadoop encryption module as Apache Chimera incubator project

On Thu, Feb 4, 2016 at 12:06 PM, Gangumalla, Uma <um...@intel.com> wrote:

> [UMA] Ok. Great. You are right. I have cc¹ed to hadoop common. (You 
> mean to cc Apache commons as well?)

I meant, if you start a discussion with Apache Commons, please CC common-dev@hadoop to coordinate.

> [UMA] Right now we plan to have encryption libraries are the only 
> one¹s we planned and as we see lot of interest from other projects 
> like spark to use them. I see some challenges when we bring lot of 
> code(other common
> codes) into this project is that, they all would have different 
> requirements and may be different expected timelines for release etc.
> Some projects may just wanted to use encryption interfaces alone but not all.
> As they are completely independent codes, may be better to scope out 
> clearly.

Yes, but even if the artifact is widely consumed, as a TLP it would need to sustain a community. If the scope is too narrow, then it will quickly fall into maintenance mode, its contributors will move on, and it will retire to the attic. Alone, I doubt its viability as a TLP. So as a first option, donating only this code to Apache Commons would accomplish some immediate goals in a sustainable forum.

APR has a similar scope. As a second option, that may also be a reasonable home, particularly if some of the native bits could integrate with APR.

If the scope is broader, the effort could sustain prolonged development. The current code is developing a strategy for packing native libraries on multiple platforms, a capability that, say, the native compression codecs (AFAIK) still lack. While java.nio is improving, many projects would benefit from a better, native interface to the filesystem (e.g., NativeIO). We could avoid duplicating effort and collaborate on a common library.

As a third option, Hadoop already implements some useful native libraries, which is why a subproject might be a sound course. That would enable the subproject to coordinate with Hadoop on migrating its native functionality to a separable, reusable component, then move to a TLP when we can rely on it exclusively (if it has a well-defined, independent community). It could control its release cadence and limit its dependencies.

Finally, this is beside the point if nobody is interested in doing the work on such a project. It's rude to pull code out of Hadoop and donate it to another project so Spark can avoid a dependency, but this instance seems reasonable to me. -C

[1] https://apr.apache.org/

> On 2/3/16, 6:46 PM, "Chen, Haifeng" <ha...@intel.com> wrote:
>
>>Thanks Chris.
>>
>>>> I went through the repository, and now understand the reasoning 
>>>>that would locate this code in Apache Commons. This isn't proposing 
>>>>to extract much of the implementation and it takes none of the 
>>>>integration. It's limited to interfaces to crypto libraries and 
>>>>streams/configuration.
>>Exactly.
>>
>>>> Chimera would be a boutique TLP, unless we wanted to draw out more 
>>>>of the integration and tooling. Is that a goal you're interested in 
>>>>pursuing? There's a tension between keeping this focused and 
>>>>including enough functionality to make it viable as an independent component.
>>The Chimera goal was for providing useful, common and optimized 
>>cryptographic functionalities. I would prefer that it is still focused 
>>in this clear scope. Multiple domain requirements will put more 
>>challenges and uncertainties in where and how it should go, thus more 
>>risk in stalling.
>>
>>>> If the encryption libraries are the only ones you're interested in 
>>>>pulling out, then Apache Commons does seem like a better target than 
>>>>a separate project.
>>Yes. Just mentioned above, the library will be positioned in 
>>cryptographic.
>>
>>
>>Thanks,
>>
>>-----Original Message-----
>>From: Chris Douglas [mailto:cdouglas@apache.org]
>>Sent: Thursday, February 4, 2016 7:26 AM
>>To: hdfs-dev@hadoop.apache.org
>>Subject: Re: Hadoop encryption module as Apache Chimera incubator 
>>project
>>
>>I went through the repository, and now understand the reasoning that 
>>would locate this code in Apache Commons. This isn't proposing to 
>>extract much of the implementation and it takes none of the 
>>integration. It's limited to interfaces to crypto libraries and 
>>streams/configuration. It might be a reasonable fit for commons-codec, 
>>but that's a pretty sparse library and driving the release cadence 
>>might be more complicated. It'd be worth discussing on their lists (please also CC common-dev@).
>>
>>Chimera would be a boutique TLP, unless we wanted to draw out more of 
>>the integration and tooling. Is that a goal you're interested in pursuing?
>>There's a tension between keeping this focused and including enough 
>>functionality to make it viable as an independent component. By way of 
>>example, Hadoop's common project requires too many dependencies and 
>>carries too much historical baggage for other projects to rely on.
>>I agree with Colin/Steve: we don't want this to grow into another 
>>guava-like dependency that creates more work in conflicts than it 
>>saves in implementation...
>>
>>Would it make sense to also package some of the compression libraries, 
>>and maybe some of the text processing from MapReduce? Evolving some of 
>>this code to a common library with few/no dependencies would be 
>>generally useful. As a subproject, it could have a broader scope that 
>>could evolve into a viable TLP. If the encryption libraries are the 
>>only ones you're interested in pulling out, then Apache Commons does 
>>seem like a better target than a separate project. -C
>>
>>
>>On Wed, Feb 3, 2016 at 1:49 AM, Chris Douglas <cd...@apache.org> wrote:
>>> On Wed, Feb 3, 2016 at 12:48 AM, Gangumalla, Uma 
>>> <um...@intel.com> wrote:
>>>>>Standing in the point of shared fundamental piece of code like 
>>>>>this, I do think Apache Commons might be the best direction which 
>>>>>we can try as the first effort. In this direction, we still need to 
>>>>>work with Apache Common community for buying in and accepting the proposal.
>>>> Make sense.
>>>
>>> Makes sense how?
>>>
>>>> For this we should define the independent release cycles for this 
>>>> project and it would just place under Hadoop tree if we all 
>>>> conclude with this option at the end.
>>>
>>> Yes.
>>>
>>>> [Chris]
>>>>>If Chimera is not successful as an independent project or stalls, 
>>>>>Hadoop and/or Spark and/or $project will have to reabsorb it as 
>>>>>maintainers.
>>>>>
>>>> I am not so strong on this point. If we assume project would be 
>>>>unsuccessful, it can be unsuccessful(less maintained) even under 
>>>>hadoop.
>>>> But if other projects depending on this piece then they would get 
>>>>less support. Of course right now we feel this piece of code is very 
>>>>important and we feel(expect) it can be successful as independent 
>>>>project, irrespective of whether it as separate project outside 
>>>>hadoop or inside.
>>>> So, I feel this point would not really influence to judge the 
>>>>discussion.
>>>
>>> Sure; code can idle anywhere, but that wasn't the point I was after.
>>> You propose to extract code from Hadoop, but if Chimera fails then 
>>> what recourse do we have among the other projects taking a 
>>> dependency on it? Splitting off another project is feasible, but 
>>> Chimera should be sustainable before this PMC can divest itself of 
>>> responsibility for security libraries. That's a pretty low bar.
>>>
>>> Bundling the library with the jar is helpful; I've used that before.
>>> It should prefer (updated) libraries from the environment, if 
>>> configured. Otherwise it's a pain (or impossible) for ops to patch 
>>> security bugs. -C
>>>
>>>>>-----Original Message-----
>>>>>From: Colin P. McCabe [mailto:cmccabe@apache.org]
>>>>>Sent: Wednesday, February 3, 2016 4:56 AM
>>>>>To: hdfs-dev@hadoop.apache.org
>>>>>Subject: Re: Hadoop encryption module as Apache Chimera incubator 
>>>>>project
>>>>>
>>>>>It's great to see interest in improving this functionality.  I 
>>>>>think Chimera could be successful as an Apache project.  I don't 
>>>>>have a strong opinion one way or the other as to whether it belongs 
>>>>>as part of Hadoop or separate.
>>>>>
>>>>>I do think there will be some challenges splitting this 
>>>>>functionality out into a separate jar, because of the way our 
>>>>>CLASSPATH works right now.
>>>>>For example, let's say that Hadoop depends on Chimera 1.2 and Spark 
>>>>>depends on Chimera 1.1.  Now Spark jobs have two different versions 
>>>>>fighting it out on the classpath, similar to the situation with 
>>>>>Guava and other libraries.  Perhaps if Chimera adopts a policy of 
>>>>>strong backwards compatibility, we can just always use the latest 
>>>>>jar, but it still seems likely that there will be problems.  There 
>>>>>are various classpath isolation ideas that could help here, but 
>>>>>they are big projects in their own right and we don't have a clear 
>>>>>timeline for them.  If this does end up being a separate jar, we 
>>>>>may need to shade it to avoid all these issues.
>>>>>
>>>>>Bundling the JNI glue code in the jar itself is an interesting 
>>>>>idea, which we have talked about before for libhadoop.so.  It 
>>>>>doesn't really have anything to do with the question of TLP vs.
>>>>>non-TLP, of course.
>>>>>We could do that refactoring in Hadoop itself.  The really 
>>>>>complicated part of bundling JNI code in a jar is that you need to 
>>>>>create jars for every cross product of (JVM version, openssl 
>>>>>version, operating system).
>>>>>For example, you have the RHEL6 build for openJDK7 using openssl 
>>>>>1.0.1e.
>>>>>If you change any one thing-- say, change openJDK7 to Oracle JDK8, 
>>>>>then you might need to rebuild.  And certainly using Ubuntu would 
>>>>>be a rebuild.  And so forth.  This kind of clashes with Maven's 
>>>>>philosophy of pulling prebuilt jars from the internet.
>>>>>
>>>>>Kai Zheng's question about whether we would bundle openSSL's 
>>>>>libraries is a good one.  Given the high rate of new 
>>>>>vulnerabilities discovered in that library, it seems like bundling 
>>>>>would require Hadoop users and vendors to update very frequently, 
>>>>>much more frequently than Hadoop is traditionally updated.  So 
>>>>>probably we would not choose to bundle openssl.
>>>>>
>>>>>best,
>>>>>Colin
>>>>>
>>>>>On Tue, Feb 2, 2016 at 12:29 AM, Chris Douglas 
>>>>><cd...@apache.org>
>>>>>wrote:
>>>>>> As a subproject of Hadoop, Chimera could maintain its own cadence.
>>>>>> There's also no reason why it should maintain dependencies on 
>>>>>> other parts of Hadoop, if those are separable. How is this 
>>>>>> solution inadequate?
>>>>>>
>>>>>> If Chimera is not successful as an independent project or stalls, 
>>>>>> Hadoop and/or Spark and/or $project will have to reabsorb it as 
>>>>>> maintainers. Projects have high mortality in early life, and a 
>>>>>> fight over inheritance/maintenance is something we'd like to avoid.
>>>>>> If, on the other hand, it develops enough of a community where it 
>>>>>> is obviously viable, then we can (and should) break it out as a 
>>>>>> TLP (as we have before). If other Apache projects take a 
>>>>>> dependency on Chimera, we're open to adding them to security@hadoop.
>>>>>>
>>>>>> Unlike Yetus, which was largely rewritten right before it was 
>>>>>> made into a TLP, security in Hadoop has a complicated pedigree.
>>>>>> If Chimera eventually becomes a TLP, it seems fair to include 
>>>>>> those who work on it while it is a subproject. Declared upfront, 
>>>>>> that criterion is fairer than any post hoc justification, and 
>>>>>> will lead to a more accurate account of its community than a 
>>>>>> subset of the Hadoop PMC/committers that volunteer. -C
>>>>>>
>>>>>>
>>>>>> On Mon, Feb 1, 2016 at 9:29 PM, Chen, Haifeng 
>>>>>><ha...@intel.com>
>>>>>>wrote:
>>>>>>> Thanks to all folks providing feedbacks and participating the 
>>>>>>>discussions.
>>>>>>>
>>>>>>> @Owen, do you still have any concerns on going forward in the 
>>>>>>>direction of Apache Commons (or other options, TLP)?
>>>>>>>
>>>>>>> Thanks,
>>>>>>> Haifeng
>>>>>>>
>>>>>>> -----Original Message-----
>>>>>>> From: Chen, Haifeng [mailto:haifeng.chen@intel.com]
>>>>>>> Sent: Saturday, January 30, 2016 10:52 AM
>>>>>>> To: hdfs-dev@hadoop.apache.org
>>>>>>> Subject: RE: Hadoop encryption module as Apache Chimera 
>>>>>>> incubator project
>>>>>>>
>>>>>>>>> I believe encryption is becoming a core part of Hadoop. I 
>>>>>>>>>think that moving core components out of Hadoop is bad from a 
>>>>>>>>>project management perspective.
>>>>>>>
>>>>>>>> Although it's certainly true that encryption capabilities (in 
>>>>>>>>HDFS, YARN, etc.) are becoming core to Hadoop, I don't think 
>>>>>>>>that should really influence whether or not the 
>>>>>>>>non-Hadoop-specific encryption routines should be part of the 
>>>>>>>>Hadoop code base, or part of the code base of another project that Hadoop depends on.
>>>>>>>>If Chimera had existed as a library hosted at ASF when HDFS 
>>>>>>>>encryption was first developed, HDFS probably would have just 
>>>>>>>>added that as a dependency and been done with it. I don't think 
>>>>>>>>we would've copy/pasted the code for Chimera into the Hadoop code base.
>>>>>>>
>>>>>>> Agree with ATM. I want to also make an additional clarification. 
>>>>>>>I agree that the encryption capabilities are becoming core to Hadoop.
>>>>>>>While this effort is to put common and shared encryption routines 
>>>>>>>such as crypto stream implementations into a scope which can be 
>>>>>>>widely shared across the Apache ecosystem. This doesn't move 
>>>>>>>Hadoop encryption out of Hadoop (that is not possible).
>>>>>>>
>>>>>>> Agree if we make it a separate and independent releases project 
>>>>>>>in Hadoop takes a step further than the existing approach and 
>>>>>>>solve some issues (such as libhadoop.so problem). Frankly 
>>>>>>>speaking, I think it is not the best option we can try. I also 
>>>>>>>expect that an independent release project within Hadoop core 
>>>>>>>will also complicate the existing release ideology of Hadoop release.
>>>>>>>
>>>>>>> Thanks,
>>>>>>> Haifeng
>>>>>>>
>>>>>>> -----Original Message-----
>>>>>>> From: Aaron T. Myers [mailto:atm@cloudera.com]
>>>>>>> Sent: Friday, January 29, 2016 9:51 AM
>>>>>>> To: hdfs-dev@hadoop.apache.org
>>>>>>> Subject: Re: Hadoop encryption module as Apache Chimera 
>>>>>>> incubator project
>>>>>>>
>>>>>>> On Wed, Jan 27, 2016 at 11:31 AM, Owen O'Malley 
>>>>>>><om...@apache.org>
>>>>>>>wrote:
>>>>>>>
>>>>>>>> I believe encryption is becoming a core part of Hadoop. I think 
>>>>>>>>that  moving core components out of Hadoop is bad from a project 
>>>>>>>>management perspective.
>>>>>>>>
>>>>>>>
>>>>>>> Although it's certainly true that encryption capabilities (in 
>>>>>>>HDFS,  YARN,
>>>>>>> etc.) are becoming core to Hadoop, I don't think that should 
>>>>>>>really influence whether or not the non-Hadoop-specific 
>>>>>>>encryption routines should be part of the Hadoop code base, or 
>>>>>>>part of the code base of another project that Hadoop depends on.
>>>>>>>If Chimera had existed as a library hosted at ASF when HDFS 
>>>>>>>encryption was first developed, HDFS probably would have just 
>>>>>>>added that as a dependency and been done with it. I don't think 
>>>>>>>we would've copy/pasted the code for Chimera into the Hadoop code base.
>>>>>>>
>>>>>>>
>>>>>>>> To put it another way, a bug in the encryption routines will 
>>>>>>>>likely become a security problem that security@hadoop needs to 
>>>>>>>>hear about.
>>>>>>>>
>>>>>>> I don't think
>>>>>>>> adding a separate project in the middle of that communication 
>>>>>>>>chain  is a good idea. The same applies to data corruption 
>>>>>>>>problems, and so on...
>>>>>>>>
>>>>>>>
>>>>>>> Isn't the same true of all the libraries that Hadoop currently 
>>>>>>>depends upon? If the commons-httpclient library (or 
>>>>>>>commons-codec, or commons-io, or guava, or...) has a security 
>>>>>>>vulnerability, we need to know about it so that we can update our 
>>>>>>>dependency to a fixed version.
>>>>>>>This case doesn't seem materially different than that.
>>>>>>>
>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> > It may be good to keep at generalized place(As in the 
>>>>>>>> > discussion, we thought that place could be Apache Commons).
>>>>>>>>
>>>>>>>>
>>>>>>>> Apache Commons is a collection of *Java* projects, so Chimera 
>>>>>>>> as a JNI-based library isn't a natural fit.
>>>>>>>>
>>>>>>>
>>>>>>> Could very well be that Apache Commons's charter would preclude 
>>>>>>>Chimera.
>>>>>>> You probably know better than I do about that.
>>>>>>>
>>>>>>>
>>>>>>>> Furthermore, Apache Commons doesn't have its own security list 
>>>>>>>> so problems will go to the generic security@apache.org.
>>>>>>>>
>>>>>>>
>>>>>>> That seems easy enough to remedy, if they wanted to, and besides 
>>>>>>>I'm not sure why that would influence this discussion. In my 
>>>>>>>experience projects that don't have a separate 
>>>>>>>security@project.a.o mailing list tend to just handle security 
>>>>>>>issues on their private@project.a.o mailing list, which seems fine to me.
>>>>>>>
>>>>>>>
>>>>>>>>
>>>>>>>> Why do you think that Apache Commons is a better home than Hadoop?
>>>>>>>>
>>>>>>>
>>>>>>> I'm certainly not at all wedded to Apache Commons, that just 
>>>>>>>seemed like a natural place to put it to me. Could be that a 
>>>>>>>brand new TLP might make more sense.
>>>>>>>
>>>>>>> I *do* think that if other non-Hadoop projects want to make use 
>>>>>>>of Chimera, which as I understand it is the goal which started 
>>>>>>>this thread, then Chimera should exist outside of Hadoop so that:
>>>>>>>
>>>>>>> a) Projects that have nothing to do with Hadoop can just depend 
>>>>>>>directly on Chimera, which has nothing Hadoop-specific in there.
>>>>>>>
>>>>>>> b) The Hadoop project doesn't have to export/maintain/concern 
>>>>>>>itself with yet another publicly-consumed interface.
>>>>>>>
>>>>>>> c) Chimera can have its own (presumably much faster) release 
>>>>>>>cadence completely separate from Hadoop.
>>>>>>>
>>>>>>> --
>>>>>>> Aaron T. Myers
>>>>>>> Software Engineer, Cloudera
>>>>
>

RE: Hadoop encryption module as Apache Chimera incubator project

Posted by "Chen, Haifeng" <ha...@intel.com>.
Thanks all the folks participating this discussion and providing valuable suggestions and options.

I suggest we take it forward to make a proposal in Apache Commons community. 

Thanks,
Haifeng

-----Original Message-----
From: Chen, Haifeng [mailto:haifeng.chen@intel.com] 
Sent: Friday, February 5, 2016 10:06 AM
To: hdfs-dev@hadoop.apache.org; common-dev@hadoop.apache.org
Subject: RE: Hadoop encryption module as Apache Chimera incubator project

> [Chirs] Yes, but even if the artifact is widely consumed, as a TLP it would need to sustain a community. If the scope is too narrow, then it will quickly fall into maintenance mode, its contributors will move on, and it will retire to the attic. Alone, I doubt its viability as a TLP. So as a first option, donating only this code to Apache Commons would accomplish some immediate goals in a sustainable forum.
Totally agree. As a TLP it needs nice scope and roadmap to sustain a development community. 

Thanks,
Haifeng

-----Original Message-----
From: Chris Douglas [mailto:cdouglas@apache.org]
Sent: Friday, February 5, 2016 6:28 AM
To: common-dev@hadoop.apache.org
Cc: hdfs-dev@hadoop.apache.org
Subject: Re: Hadoop encryption module as Apache Chimera incubator project

On Thu, Feb 4, 2016 at 12:06 PM, Gangumalla, Uma <um...@intel.com> wrote:

> [UMA] Ok. Great. You are right. I have cc¹ed to hadoop common. (You 
> mean to cc Apache commons as well?)

I meant, if you start a discussion with Apache Commons, please CC common-dev@hadoop to coordinate.

> [UMA] Right now we plan to have encryption libraries are the only 
> one¹s we planned and as we see lot of interest from other projects 
> like spark to use them. I see some challenges when we bring lot of 
> code(other common
> codes) into this project is that, they all would have different 
> requirements and may be different expected timelines for release etc.
> Some projects may just wanted to use encryption interfaces alone but not all.
> As they are completely independent codes, may be better to scope out 
> clearly.

Yes, but even if the artifact is widely consumed, as a TLP it would need to sustain a community. If the scope is too narrow, then it will quickly fall into maintenance mode, its contributors will move on, and it will retire to the attic. Alone, I doubt its viability as a TLP. So as a first option, donating only this code to Apache Commons would accomplish some immediate goals in a sustainable forum.

APR has a similar scope. As a second option, that may also be a reasonable home, particularly if some of the native bits could integrate with APR.

If the scope is broader, the effort could sustain prolonged development. The current code is developing a strategy for packing native libraries on multiple platforms, a capability that, say, the native compression codecs (AFAIK) still lack. While java.nio is improving, many projects would benefit from a better, native interface to the filesystem (e.g., NativeIO). We could avoid duplicating effort and collaborate on a common library.

As a third option, Hadoop already implements some useful native libraries, which is why a subproject might be a sound course. That would enable the subproject to coordinate with Hadoop on migrating its native functionality to a separable, reusable component, then move to a TLP when we can rely on it exclusively (if it has a well-defined, independent community). It could control its release cadence and limit its dependencies.

Finally, this is beside the point if nobody is interested in doing the work on such a project. It's rude to pull code out of Hadoop and donate it to another project so Spark can avoid a dependency, but this instance seems reasonable to me. -C

[1] https://apr.apache.org/

> On 2/3/16, 6:46 PM, "Chen, Haifeng" <ha...@intel.com> wrote:
>
>>Thanks Chris.
>>
>>>> I went through the repository, and now understand the reasoning 
>>>>that would locate this code in Apache Commons. This isn't proposing 
>>>>to extract much of the implementation and it takes none of the 
>>>>integration. It's limited to interfaces to crypto libraries and 
>>>>streams/configuration.
>>Exactly.
>>
>>>> Chimera would be a boutique TLP, unless we wanted to draw out more 
>>>>of the integration and tooling. Is that a goal you're interested in 
>>>>pursuing? There's a tension between keeping this focused and 
>>>>including enough functionality to make it viable as an independent component.
>>The Chimera goal was for providing useful, common and optimized 
>>cryptographic functionalities. I would prefer that it is still focused 
>>in this clear scope. Multiple domain requirements will put more 
>>challenges and uncertainties in where and how it should go, thus more 
>>risk in stalling.
>>
>>>> If the encryption libraries are the only ones you're interested in 
>>>>pulling out, then Apache Commons does seem like a better target than 
>>>>a separate project.
>>Yes. Just mentioned above, the library will be positioned in 
>>cryptographic.
>>
>>
>>Thanks,
>>
>>-----Original Message-----
>>From: Chris Douglas [mailto:cdouglas@apache.org]
>>Sent: Thursday, February 4, 2016 7:26 AM
>>To: hdfs-dev@hadoop.apache.org
>>Subject: Re: Hadoop encryption module as Apache Chimera incubator 
>>project
>>
>>I went through the repository, and now understand the reasoning that 
>>would locate this code in Apache Commons. This isn't proposing to 
>>extract much of the implementation and it takes none of the 
>>integration. It's limited to interfaces to crypto libraries and 
>>streams/configuration. It might be a reasonable fit for commons-codec, 
>>but that's a pretty sparse library and driving the release cadence 
>>might be more complicated. It'd be worth discussing on their lists (please also CC common-dev@).
>>
>>Chimera would be a boutique TLP, unless we wanted to draw out more of 
>>the integration and tooling. Is that a goal you're interested in pursuing?
>>There's a tension between keeping this focused and including enough 
>>functionality to make it viable as an independent component. By way of 
>>example, Hadoop's common project requires too many dependencies and 
>>carries too much historical baggage for other projects to rely on.
>>I agree with Colin/Steve: we don't want this to grow into another 
>>guava-like dependency that creates more work in conflicts than it 
>>saves in implementation...
>>
>>Would it make sense to also package some of the compression libraries, 
>>and maybe some of the text processing from MapReduce? Evolving some of 
>>this code to a common library with few/no dependencies would be 
>>generally useful. As a subproject, it could have a broader scope that 
>>could evolve into a viable TLP. If the encryption libraries are the 
>>only ones you're interested in pulling out, then Apache Commons does 
>>seem like a better target than a separate project. -C
>>
>>
>>On Wed, Feb 3, 2016 at 1:49 AM, Chris Douglas <cd...@apache.org> wrote:
>>> On Wed, Feb 3, 2016 at 12:48 AM, Gangumalla, Uma 
>>> <um...@intel.com> wrote:
>>>>>Standing in the point of shared fundamental piece of code like 
>>>>>this, I do think Apache Commons might be the best direction which 
>>>>>we can try as the first effort. In this direction, we still need to 
>>>>>work with Apache Common community for buying in and accepting the proposal.
>>>> Make sense.
>>>
>>> Makes sense how?
>>>
>>>> For this we should define the independent release cycles for this 
>>>> project and it would just place under Hadoop tree if we all 
>>>> conclude with this option at the end.
>>>
>>> Yes.
>>>
>>>> [Chris]
>>>>>If Chimera is not successful as an independent project or stalls, 
>>>>>Hadoop and/or Spark and/or $project will have to reabsorb it as 
>>>>>maintainers.
>>>>>
>>>> I am not so strong on this point. If we assume project would be 
>>>>unsuccessful, it can be unsuccessful(less maintained) even under 
>>>>hadoop.
>>>> But if other projects depending on this piece then they would get 
>>>>less support. Of course right now we feel this piece of code is very 
>>>>important and we feel(expect) it can be successful as independent 
>>>>project, irrespective of whether it as separate project outside 
>>>>hadoop or inside.
>>>> So, I feel this point would not really influence to judge the 
>>>>discussion.
>>>
>>> Sure; code can idle anywhere, but that wasn't the point I was after.
>>> You propose to extract code from Hadoop, but if Chimera fails then 
>>> what recourse do we have among the other projects taking a 
>>> dependency on it? Splitting off another project is feasible, but 
>>> Chimera should be sustainable before this PMC can divest itself of 
>>> responsibility for security libraries. That's a pretty low bar.
>>>
>>> Bundling the library with the jar is helpful; I've used that before.
>>> It should prefer (updated) libraries from the environment, if 
>>> configured. Otherwise it's a pain (or impossible) for ops to patch 
>>> security bugs. -C
>>>
>>>>>-----Original Message-----
>>>>>From: Colin P. McCabe [mailto:cmccabe@apache.org]
>>>>>Sent: Wednesday, February 3, 2016 4:56 AM
>>>>>To: hdfs-dev@hadoop.apache.org
>>>>>Subject: Re: Hadoop encryption module as Apache Chimera incubator 
>>>>>project
>>>>>
>>>>>It's great to see interest in improving this functionality.  I 
>>>>>think Chimera could be successful as an Apache project.  I don't 
>>>>>have a strong opinion one way or the other as to whether it belongs 
>>>>>as part of Hadoop or separate.
>>>>>
>>>>>I do think there will be some challenges splitting this 
>>>>>functionality out into a separate jar, because of the way our 
>>>>>CLASSPATH works right now.
>>>>>For example, let's say that Hadoop depends on Chimera 1.2 and Spark 
>>>>>depends on Chimera 1.1.  Now Spark jobs have two different versions 
>>>>>fighting it out on the classpath, similar to the situation with 
>>>>>Guava and other libraries.  Perhaps if Chimera adopts a policy of 
>>>>>strong backwards compatibility, we can just always use the latest 
>>>>>jar, but it still seems likely that there will be problems.  There 
>>>>>are various classpath isolation ideas that could help here, but 
>>>>>they are big projects in their own right and we don't have a clear 
>>>>>timeline for them.  If this does end up being a separate jar, we 
>>>>>may need to shade it to avoid all these issues.
>>>>>
>>>>>Bundling the JNI glue code in the jar itself is an interesting 
>>>>>idea, which we have talked about before for libhadoop.so.  It 
>>>>>doesn't really have anything to do with the question of TLP vs.
>>>>>non-TLP, of course.
>>>>>We could do that refactoring in Hadoop itself.  The really 
>>>>>complicated part of bundling JNI code in a jar is that you need to 
>>>>>create jars for every cross product of (JVM version, openssl 
>>>>>version, operating system).
>>>>>For example, you have the RHEL6 build for openJDK7 using openssl 
>>>>>1.0.1e.
>>>>>If you change any one thing-- say, change openJDK7 to Oracle JDK8, 
>>>>>then you might need to rebuild.  And certainly using Ubuntu would 
>>>>>be a rebuild.  And so forth.  This kind of clashes with Maven's 
>>>>>philosophy of pulling prebuilt jars from the internet.
>>>>>
>>>>>Kai Zheng's question about whether we would bundle openSSL's 
>>>>>libraries is a good one.  Given the high rate of new 
>>>>>vulnerabilities discovered in that library, it seems like bundling 
>>>>>would require Hadoop users and vendors to update very frequently, 
>>>>>much more frequently than Hadoop is traditionally updated.  So 
>>>>>probably we would not choose to bundle openssl.
>>>>>
>>>>>best,
>>>>>Colin
>>>>>
>>>>>On Tue, Feb 2, 2016 at 12:29 AM, Chris Douglas 
>>>>><cd...@apache.org>
>>>>>wrote:
>>>>>> As a subproject of Hadoop, Chimera could maintain its own cadence.
>>>>>> There's also no reason why it should maintain dependencies on 
>>>>>> other parts of Hadoop, if those are separable. How is this 
>>>>>> solution inadequate?
>>>>>>
>>>>>> If Chimera is not successful as an independent project or stalls, 
>>>>>> Hadoop and/or Spark and/or $project will have to reabsorb it as 
>>>>>> maintainers. Projects have high mortality in early life, and a 
>>>>>> fight over inheritance/maintenance is something we'd like to avoid.
>>>>>> If, on the other hand, it develops enough of a community where it 
>>>>>> is obviously viable, then we can (and should) break it out as a 
>>>>>> TLP (as we have before). If other Apache projects take a 
>>>>>> dependency on Chimera, we're open to adding them to security@hadoop.
>>>>>>
>>>>>> Unlike Yetus, which was largely rewritten right before it was 
>>>>>> made into a TLP, security in Hadoop has a complicated pedigree.
>>>>>> If Chimera eventually becomes a TLP, it seems fair to include 
>>>>>> those who work on it while it is a subproject. Declared upfront, 
>>>>>> that criterion is fairer than any post hoc justification, and 
>>>>>> will lead to a more accurate account of its community than a 
>>>>>> subset of the Hadoop PMC/committers that volunteer. -C
>>>>>>
>>>>>>
>>>>>> On Mon, Feb 1, 2016 at 9:29 PM, Chen, Haifeng 
>>>>>><ha...@intel.com>
>>>>>>wrote:
>>>>>>> Thanks to all folks providing feedbacks and participating the 
>>>>>>>discussions.
>>>>>>>
>>>>>>> @Owen, do you still have any concerns on going forward in the 
>>>>>>>direction of Apache Commons (or other options, TLP)?
>>>>>>>
>>>>>>> Thanks,
>>>>>>> Haifeng
>>>>>>>
>>>>>>> -----Original Message-----
>>>>>>> From: Chen, Haifeng [mailto:haifeng.chen@intel.com]
>>>>>>> Sent: Saturday, January 30, 2016 10:52 AM
>>>>>>> To: hdfs-dev@hadoop.apache.org
>>>>>>> Subject: RE: Hadoop encryption module as Apache Chimera 
>>>>>>> incubator project
>>>>>>>
>>>>>>>>> I believe encryption is becoming a core part of Hadoop. I 
>>>>>>>>>think that moving core components out of Hadoop is bad from a 
>>>>>>>>>project management perspective.
>>>>>>>
>>>>>>>> Although it's certainly true that encryption capabilities (in 
>>>>>>>>HDFS, YARN, etc.) are becoming core to Hadoop, I don't think 
>>>>>>>>that should really influence whether or not the 
>>>>>>>>non-Hadoop-specific encryption routines should be part of the 
>>>>>>>>Hadoop code base, or part of the code base of another project that Hadoop depends on.
>>>>>>>>If Chimera had existed as a library hosted at ASF when HDFS 
>>>>>>>>encryption was first developed, HDFS probably would have just 
>>>>>>>>added that as a dependency and been done with it. I don't think 
>>>>>>>>we would've copy/pasted the code for Chimera into the Hadoop code base.
>>>>>>>
>>>>>>> Agree with ATM. I want to also make an additional clarification. 
>>>>>>>I agree that the encryption capabilities are becoming core to Hadoop.
>>>>>>>While this effort is to put common and shared encryption routines 
>>>>>>>such as crypto stream implementations into a scope which can be 
>>>>>>>widely shared across the Apache ecosystem. This doesn't move 
>>>>>>>Hadoop encryption out of Hadoop (that is not possible).
>>>>>>>
>>>>>>> Agree if we make it a separate and independent releases project 
>>>>>>>in Hadoop takes a step further than the existing approach and 
>>>>>>>solve some issues (such as libhadoop.so problem). Frankly 
>>>>>>>speaking, I think it is not the best option we can try. I also 
>>>>>>>expect that an independent release project within Hadoop core 
>>>>>>>will also complicate the existing release ideology of Hadoop release.
>>>>>>>
>>>>>>> Thanks,
>>>>>>> Haifeng
>>>>>>>
>>>>>>> -----Original Message-----
>>>>>>> From: Aaron T. Myers [mailto:atm@cloudera.com]
>>>>>>> Sent: Friday, January 29, 2016 9:51 AM
>>>>>>> To: hdfs-dev@hadoop.apache.org
>>>>>>> Subject: Re: Hadoop encryption module as Apache Chimera 
>>>>>>> incubator project
>>>>>>>
>>>>>>> On Wed, Jan 27, 2016 at 11:31 AM, Owen O'Malley 
>>>>>>><om...@apache.org>
>>>>>>>wrote:
>>>>>>>
>>>>>>>> I believe encryption is becoming a core part of Hadoop. I think 
>>>>>>>>that  moving core components out of Hadoop is bad from a project 
>>>>>>>>management perspective.
>>>>>>>>
>>>>>>>
>>>>>>> Although it's certainly true that encryption capabilities (in 
>>>>>>>HDFS,  YARN,
>>>>>>> etc.) are becoming core to Hadoop, I don't think that should 
>>>>>>>really influence whether or not the non-Hadoop-specific 
>>>>>>>encryption routines should be part of the Hadoop code base, or 
>>>>>>>part of the code base of another project that Hadoop depends on.
>>>>>>>If Chimera had existed as a library hosted at ASF when HDFS 
>>>>>>>encryption was first developed, HDFS probably would have just 
>>>>>>>added that as a dependency and been done with it. I don't think 
>>>>>>>we would've copy/pasted the code for Chimera into the Hadoop code base.
>>>>>>>
>>>>>>>
>>>>>>>> To put it another way, a bug in the encryption routines will 
>>>>>>>>likely become a security problem that security@hadoop needs to 
>>>>>>>>hear about.
>>>>>>>>
>>>>>>> I don't think
>>>>>>>> adding a separate project in the middle of that communication 
>>>>>>>>chain  is a good idea. The same applies to data corruption 
>>>>>>>>problems, and so on...
>>>>>>>>
>>>>>>>
>>>>>>> Isn't the same true of all the libraries that Hadoop currently 
>>>>>>>depends upon? If the commons-httpclient library (or 
>>>>>>>commons-codec, or commons-io, or guava, or...) has a security 
>>>>>>>vulnerability, we need to know about it so that we can update our 
>>>>>>>dependency to a fixed version.
>>>>>>>This case doesn't seem materially different than that.
>>>>>>>
>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> > It may be good to keep at generalized place(As in the 
>>>>>>>> > discussion, we thought that place could be Apache Commons).
>>>>>>>>
>>>>>>>>
>>>>>>>> Apache Commons is a collection of *Java* projects, so Chimera 
>>>>>>>> as a JNI-based library isn't a natural fit.
>>>>>>>>
>>>>>>>
>>>>>>> Could very well be that Apache Commons's charter would preclude 
>>>>>>>Chimera.
>>>>>>> You probably know better than I do about that.
>>>>>>>
>>>>>>>
>>>>>>>> Furthermore, Apache Commons doesn't have its own security list 
>>>>>>>> so problems will go to the generic security@apache.org.
>>>>>>>>
>>>>>>>
>>>>>>> That seems easy enough to remedy, if they wanted to, and besides 
>>>>>>>I'm not sure why that would influence this discussion. In my 
>>>>>>>experience projects that don't have a separate 
>>>>>>>security@project.a.o mailing list tend to just handle security 
>>>>>>>issues on their private@project.a.o mailing list, which seems fine to me.
>>>>>>>
>>>>>>>
>>>>>>>>
>>>>>>>> Why do you think that Apache Commons is a better home than Hadoop?
>>>>>>>>
>>>>>>>
>>>>>>> I'm certainly not at all wedded to Apache Commons, that just 
>>>>>>>seemed like a natural place to put it to me. Could be that a 
>>>>>>>brand new TLP might make more sense.
>>>>>>>
>>>>>>> I *do* think that if other non-Hadoop projects want to make use 
>>>>>>>of Chimera, which as I understand it is the goal which started 
>>>>>>>this thread, then Chimera should exist outside of Hadoop so that:
>>>>>>>
>>>>>>> a) Projects that have nothing to do with Hadoop can just depend 
>>>>>>>directly on Chimera, which has nothing Hadoop-specific in there.
>>>>>>>
>>>>>>> b) The Hadoop project doesn't have to export/maintain/concern 
>>>>>>>itself with yet another publicly-consumed interface.
>>>>>>>
>>>>>>> c) Chimera can have its own (presumably much faster) release 
>>>>>>>cadence completely separate from Hadoop.
>>>>>>>
>>>>>>> --
>>>>>>> Aaron T. Myers
>>>>>>> Software Engineer, Cloudera
>>>>
>

RE: Hadoop encryption module as Apache Chimera incubator project

Posted by "Chen, Haifeng" <ha...@intel.com>.
> [Chirs] Yes, but even if the artifact is widely consumed, as a TLP it would need to sustain a community. If the scope is too narrow, then it will quickly fall into maintenance mode, its contributors will move on, and it will retire to the attic. Alone, I doubt its viability as a TLP. So as a first option, donating only this code to Apache Commons would accomplish some immediate goals in a sustainable forum.
Totally agree. As a TLP it needs nice scope and roadmap to sustain a development community. 

Thanks,
Haifeng

-----Original Message-----
From: Chris Douglas [mailto:cdouglas@apache.org] 
Sent: Friday, February 5, 2016 6:28 AM
To: common-dev@hadoop.apache.org
Cc: hdfs-dev@hadoop.apache.org
Subject: Re: Hadoop encryption module as Apache Chimera incubator project

On Thu, Feb 4, 2016 at 12:06 PM, Gangumalla, Uma <um...@intel.com> wrote:

> [UMA] Ok. Great. You are right. I have cc¹ed to hadoop common. (You 
> mean to cc Apache commons as well?)

I meant, if you start a discussion with Apache Commons, please CC common-dev@hadoop to coordinate.

> [UMA] Right now we plan to have encryption libraries are the only 
> one¹s we planned and as we see lot of interest from other projects 
> like spark to use them. I see some challenges when we bring lot of 
> code(other common
> codes) into this project is that, they all would have different 
> requirements and may be different expected timelines for release etc. 
> Some projects may just wanted to use encryption interfaces alone but not all.
> As they are completely independent codes, may be better to scope out 
> clearly.

Yes, but even if the artifact is widely consumed, as a TLP it would need to sustain a community. If the scope is too narrow, then it will quickly fall into maintenance mode, its contributors will move on, and it will retire to the attic. Alone, I doubt its viability as a TLP. So as a first option, donating only this code to Apache Commons would accomplish some immediate goals in a sustainable forum.

APR has a similar scope. As a second option, that may also be a reasonable home, particularly if some of the native bits could integrate with APR.

If the scope is broader, the effort could sustain prolonged development. The current code is developing a strategy for packing native libraries on multiple platforms, a capability that, say, the native compression codecs (AFAIK) still lack. While java.nio is improving, many projects would benefit from a better, native interface to the filesystem (e.g., NativeIO). We could avoid duplicating effort and collaborate on a common library.

As a third option, Hadoop already implements some useful native libraries, which is why a subproject might be a sound course. That would enable the subproject to coordinate with Hadoop on migrating its native functionality to a separable, reusable component, then move to a TLP when we can rely on it exclusively (if it has a well-defined, independent community). It could control its release cadence and limit its dependencies.

Finally, this is beside the point if nobody is interested in doing the work on such a project. It's rude to pull code out of Hadoop and donate it to another project so Spark can avoid a dependency, but this instance seems reasonable to me. -C

[1] https://apr.apache.org/

> On 2/3/16, 6:46 PM, "Chen, Haifeng" <ha...@intel.com> wrote:
>
>>Thanks Chris.
>>
>>>> I went through the repository, and now understand the reasoning 
>>>>that would locate this code in Apache Commons. This isn't proposing 
>>>>to extract much of the implementation and it takes none of the 
>>>>integration. It's limited to interfaces to crypto libraries and 
>>>>streams/configuration.
>>Exactly.
>>
>>>> Chimera would be a boutique TLP, unless we wanted to draw out more 
>>>>of the integration and tooling. Is that a goal you're interested in 
>>>>pursuing? There's a tension between keeping this focused and 
>>>>including enough functionality to make it viable as an independent component.
>>The Chimera goal was for providing useful, common and optimized 
>>cryptographic functionalities. I would prefer that it is still focused 
>>in this clear scope. Multiple domain requirements will put more 
>>challenges and uncertainties in where and how it should go, thus more 
>>risk in stalling.
>>
>>>> If the encryption libraries are the only ones you're interested in 
>>>>pulling out, then Apache Commons does seem like a better target than 
>>>>a separate project.
>>Yes. Just mentioned above, the library will be positioned in 
>>cryptographic.
>>
>>
>>Thanks,
>>
>>-----Original Message-----
>>From: Chris Douglas [mailto:cdouglas@apache.org]
>>Sent: Thursday, February 4, 2016 7:26 AM
>>To: hdfs-dev@hadoop.apache.org
>>Subject: Re: Hadoop encryption module as Apache Chimera incubator 
>>project
>>
>>I went through the repository, and now understand the reasoning that 
>>would locate this code in Apache Commons. This isn't proposing to 
>>extract much of the implementation and it takes none of the 
>>integration. It's limited to interfaces to crypto libraries and 
>>streams/configuration. It might be a reasonable fit for commons-codec, 
>>but that's a pretty sparse library and driving the release cadence 
>>might be more complicated. It'd be worth discussing on their lists (please also CC common-dev@).
>>
>>Chimera would be a boutique TLP, unless we wanted to draw out more of 
>>the integration and tooling. Is that a goal you're interested in pursuing?
>>There's a tension between keeping this focused and including enough 
>>functionality to make it viable as an independent component. By way of 
>>example, Hadoop's common project requires too many dependencies and 
>>carries too much historical baggage for other projects to rely on.
>>I agree with Colin/Steve: we don't want this to grow into another 
>>guava-like dependency that creates more work in conflicts than it 
>>saves in implementation...
>>
>>Would it make sense to also package some of the compression libraries, 
>>and maybe some of the text processing from MapReduce? Evolving some of 
>>this code to a common library with few/no dependencies would be 
>>generally useful. As a subproject, it could have a broader scope that 
>>could evolve into a viable TLP. If the encryption libraries are the 
>>only ones you're interested in pulling out, then Apache Commons does 
>>seem like a better target than a separate project. -C
>>
>>
>>On Wed, Feb 3, 2016 at 1:49 AM, Chris Douglas <cd...@apache.org> wrote:
>>> On Wed, Feb 3, 2016 at 12:48 AM, Gangumalla, Uma 
>>> <um...@intel.com> wrote:
>>>>>Standing in the point of shared fundamental piece of code like 
>>>>>this, I do think Apache Commons might be the best direction which 
>>>>>we can try as the first effort. In this direction, we still need to 
>>>>>work with Apache Common community for buying in and accepting the proposal.
>>>> Make sense.
>>>
>>> Makes sense how?
>>>
>>>> For this we should define the independent release cycles for this 
>>>> project and it would just place under Hadoop tree if we all 
>>>> conclude with this option at the end.
>>>
>>> Yes.
>>>
>>>> [Chris]
>>>>>If Chimera is not successful as an independent project or stalls, 
>>>>>Hadoop and/or Spark and/or $project will have to reabsorb it as 
>>>>>maintainers.
>>>>>
>>>> I am not so strong on this point. If we assume project would be  
>>>>unsuccessful, it can be unsuccessful(less maintained) even under 
>>>>hadoop.
>>>> But if other projects depending on this piece then they would get  
>>>>less support. Of course right now we feel this piece of code is very  
>>>>important and we feel(expect) it can be successful as independent  
>>>>project, irrespective of whether it as separate project outside 
>>>>hadoop or inside.
>>>> So, I feel this point would not really influence to judge the 
>>>>discussion.
>>>
>>> Sure; code can idle anywhere, but that wasn't the point I was after.
>>> You propose to extract code from Hadoop, but if Chimera fails then 
>>> what recourse do we have among the other projects taking a 
>>> dependency on it? Splitting off another project is feasible, but 
>>> Chimera should be sustainable before this PMC can divest itself of 
>>> responsibility for security libraries. That's a pretty low bar.
>>>
>>> Bundling the library with the jar is helpful; I've used that before.
>>> It should prefer (updated) libraries from the environment, if 
>>> configured. Otherwise it's a pain (or impossible) for ops to patch 
>>> security bugs. -C
>>>
>>>>>-----Original Message-----
>>>>>From: Colin P. McCabe [mailto:cmccabe@apache.org]
>>>>>Sent: Wednesday, February 3, 2016 4:56 AM
>>>>>To: hdfs-dev@hadoop.apache.org
>>>>>Subject: Re: Hadoop encryption module as Apache Chimera incubator 
>>>>>project
>>>>>
>>>>>It's great to see interest in improving this functionality.  I 
>>>>>think Chimera could be successful as an Apache project.  I don't 
>>>>>have a strong opinion one way or the other as to whether it belongs 
>>>>>as part of Hadoop or separate.
>>>>>
>>>>>I do think there will be some challenges splitting this 
>>>>>functionality out into a separate jar, because of the way our 
>>>>>CLASSPATH works right now.
>>>>>For example, let's say that Hadoop depends on Chimera 1.2 and Spark 
>>>>>depends on Chimera 1.1.  Now Spark jobs have two different versions 
>>>>>fighting it out on the classpath, similar to the situation with 
>>>>>Guava and other libraries.  Perhaps if Chimera adopts a policy of 
>>>>>strong backwards compatibility, we can just always use the latest 
>>>>>jar, but it still seems likely that there will be problems.  There 
>>>>>are various classpath isolation ideas that could help here, but 
>>>>>they are big projects in their own right and we don't have a clear 
>>>>>timeline for them.  If this does end up being a separate jar, we 
>>>>>may need to shade it to avoid all these issues.
>>>>>
>>>>>Bundling the JNI glue code in the jar itself is an interesting 
>>>>>idea, which we have talked about before for libhadoop.so.  It 
>>>>>doesn't really have anything to do with the question of TLP vs. 
>>>>>non-TLP, of course.
>>>>>We could do that refactoring in Hadoop itself.  The really 
>>>>>complicated part of bundling JNI code in a jar is that you need to 
>>>>>create jars for every cross product of (JVM version, openssl 
>>>>>version, operating system).
>>>>>For example, you have the RHEL6 build for openJDK7 using openssl 
>>>>>1.0.1e.
>>>>>If you change any one thing-- say, change openJDK7 to Oracle JDK8, 
>>>>>then you might need to rebuild.  And certainly using Ubuntu would 
>>>>>be a rebuild.  And so forth.  This kind of clashes with Maven's 
>>>>>philosophy of pulling prebuilt jars from the internet.
>>>>>
>>>>>Kai Zheng's question about whether we would bundle openSSL's 
>>>>>libraries is a good one.  Given the high rate of new 
>>>>>vulnerabilities discovered in that library, it seems like bundling 
>>>>>would require Hadoop users and vendors to update very frequently, 
>>>>>much more frequently than Hadoop is traditionally updated.  So 
>>>>>probably we would not choose to bundle openssl.
>>>>>
>>>>>best,
>>>>>Colin
>>>>>
>>>>>On Tue, Feb 2, 2016 at 12:29 AM, Chris Douglas 
>>>>><cd...@apache.org>
>>>>>wrote:
>>>>>> As a subproject of Hadoop, Chimera could maintain its own cadence.
>>>>>> There's also no reason why it should maintain dependencies on 
>>>>>> other parts of Hadoop, if those are separable. How is this 
>>>>>> solution inadequate?
>>>>>>
>>>>>> If Chimera is not successful as an independent project or stalls, 
>>>>>> Hadoop and/or Spark and/or $project will have to reabsorb it as 
>>>>>> maintainers. Projects have high mortality in early life, and a 
>>>>>> fight over inheritance/maintenance is something we'd like to avoid.
>>>>>> If, on the other hand, it develops enough of a community where it 
>>>>>> is obviously viable, then we can (and should) break it out as a 
>>>>>> TLP (as we have before). If other Apache projects take a 
>>>>>> dependency on Chimera, we're open to adding them to security@hadoop.
>>>>>>
>>>>>> Unlike Yetus, which was largely rewritten right before it was 
>>>>>> made into a TLP, security in Hadoop has a complicated pedigree. 
>>>>>> If Chimera eventually becomes a TLP, it seems fair to include 
>>>>>> those who work on it while it is a subproject. Declared upfront, 
>>>>>> that criterion is fairer than any post hoc justification, and 
>>>>>> will lead to a more accurate account of its community than a 
>>>>>> subset of the Hadoop PMC/committers that volunteer. -C
>>>>>>
>>>>>>
>>>>>> On Mon, Feb 1, 2016 at 9:29 PM, Chen, Haifeng 
>>>>>><ha...@intel.com>
>>>>>>wrote:
>>>>>>> Thanks to all folks providing feedbacks and participating the 
>>>>>>>discussions.
>>>>>>>
>>>>>>> @Owen, do you still have any concerns on going forward in the 
>>>>>>>direction of Apache Commons (or other options, TLP)?
>>>>>>>
>>>>>>> Thanks,
>>>>>>> Haifeng
>>>>>>>
>>>>>>> -----Original Message-----
>>>>>>> From: Chen, Haifeng [mailto:haifeng.chen@intel.com]
>>>>>>> Sent: Saturday, January 30, 2016 10:52 AM
>>>>>>> To: hdfs-dev@hadoop.apache.org
>>>>>>> Subject: RE: Hadoop encryption module as Apache Chimera 
>>>>>>> incubator project
>>>>>>>
>>>>>>>>> I believe encryption is becoming a core part of Hadoop. I 
>>>>>>>>>think that moving core components out of Hadoop is bad from a 
>>>>>>>>>project management perspective.
>>>>>>>
>>>>>>>> Although it's certainly true that encryption capabilities (in 
>>>>>>>>HDFS, YARN, etc.) are becoming core to Hadoop, I don't think 
>>>>>>>>that should really influence whether or not the 
>>>>>>>>non-Hadoop-specific encryption routines should be part of the 
>>>>>>>>Hadoop code base, or part of the code base of another project that Hadoop depends on.
>>>>>>>>If Chimera had existed as a library hosted at ASF when HDFS 
>>>>>>>>encryption was first developed, HDFS probably would have just 
>>>>>>>>added that as a dependency and been done with it. I don't think 
>>>>>>>>we would've copy/pasted the code for Chimera into the Hadoop code base.
>>>>>>>
>>>>>>> Agree with ATM. I want to also make an additional clarification. 
>>>>>>>I agree that the encryption capabilities are becoming core to Hadoop.
>>>>>>>While this effort is to put common and shared encryption routines 
>>>>>>>such as crypto stream implementations into a scope which can be 
>>>>>>>widely shared across the Apache ecosystem. This doesn't move 
>>>>>>>Hadoop encryption out of Hadoop (that is not possible).
>>>>>>>
>>>>>>> Agree if we make it a separate and independent releases project 
>>>>>>>in Hadoop takes a step further than the existing approach and 
>>>>>>>solve some issues (such as libhadoop.so problem). Frankly 
>>>>>>>speaking, I think it is not the best option we can try. I also 
>>>>>>>expect that an independent release project within Hadoop core 
>>>>>>>will also complicate the existing release ideology of Hadoop release.
>>>>>>>
>>>>>>> Thanks,
>>>>>>> Haifeng
>>>>>>>
>>>>>>> -----Original Message-----
>>>>>>> From: Aaron T. Myers [mailto:atm@cloudera.com]
>>>>>>> Sent: Friday, January 29, 2016 9:51 AM
>>>>>>> To: hdfs-dev@hadoop.apache.org
>>>>>>> Subject: Re: Hadoop encryption module as Apache Chimera 
>>>>>>> incubator project
>>>>>>>
>>>>>>> On Wed, Jan 27, 2016 at 11:31 AM, Owen O'Malley 
>>>>>>><om...@apache.org>
>>>>>>>wrote:
>>>>>>>
>>>>>>>> I believe encryption is becoming a core part of Hadoop. I think 
>>>>>>>>that  moving core components out of Hadoop is bad from a project 
>>>>>>>>management perspective.
>>>>>>>>
>>>>>>>
>>>>>>> Although it's certainly true that encryption capabilities (in 
>>>>>>>HDFS,  YARN,
>>>>>>> etc.) are becoming core to Hadoop, I don't think that should 
>>>>>>>really influence whether or not the non-Hadoop-specific 
>>>>>>>encryption routines should be part of the Hadoop code base, or 
>>>>>>>part of the code base of another project that Hadoop depends on. 
>>>>>>>If Chimera had existed as a library hosted at ASF when HDFS 
>>>>>>>encryption was first developed, HDFS probably would have just 
>>>>>>>added that as a dependency and been done with it. I don't think 
>>>>>>>we would've copy/pasted the code for Chimera into the Hadoop code base.
>>>>>>>
>>>>>>>
>>>>>>>> To put it another way, a bug in the encryption routines will  
>>>>>>>>likely become a security problem that security@hadoop needs to 
>>>>>>>>hear about.
>>>>>>>>
>>>>>>> I don't think
>>>>>>>> adding a separate project in the middle of that communication 
>>>>>>>>chain  is a good idea. The same applies to data corruption 
>>>>>>>>problems, and so on...
>>>>>>>>
>>>>>>>
>>>>>>> Isn't the same true of all the libraries that Hadoop currently 
>>>>>>>depends upon? If the commons-httpclient library (or 
>>>>>>>commons-codec, or commons-io, or guava, or...) has a security 
>>>>>>>vulnerability, we need to know about it so that we can update our 
>>>>>>>dependency to a fixed version.
>>>>>>>This case doesn't seem materially different than that.
>>>>>>>
>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> > It may be good to keep at generalized place(As in the 
>>>>>>>> > discussion, we thought that place could be Apache Commons).
>>>>>>>>
>>>>>>>>
>>>>>>>> Apache Commons is a collection of *Java* projects, so Chimera 
>>>>>>>> as a JNI-based library isn't a natural fit.
>>>>>>>>
>>>>>>>
>>>>>>> Could very well be that Apache Commons's charter would preclude 
>>>>>>>Chimera.
>>>>>>> You probably know better than I do about that.
>>>>>>>
>>>>>>>
>>>>>>>> Furthermore, Apache Commons doesn't have its own security list 
>>>>>>>> so problems will go to the generic security@apache.org.
>>>>>>>>
>>>>>>>
>>>>>>> That seems easy enough to remedy, if they wanted to, and besides 
>>>>>>>I'm not sure why that would influence this discussion. In my 
>>>>>>>experience projects that don't have a separate 
>>>>>>>security@project.a.o mailing list tend to just handle security 
>>>>>>>issues on their private@project.a.o mailing list, which seems fine to me.
>>>>>>>
>>>>>>>
>>>>>>>>
>>>>>>>> Why do you think that Apache Commons is a better home than Hadoop?
>>>>>>>>
>>>>>>>
>>>>>>> I'm certainly not at all wedded to Apache Commons, that just 
>>>>>>>seemed like a natural place to put it to me. Could be that a 
>>>>>>>brand new TLP might make more sense.
>>>>>>>
>>>>>>> I *do* think that if other non-Hadoop projects want to make use 
>>>>>>>of Chimera, which as I understand it is the goal which started 
>>>>>>>this thread, then Chimera should exist outside of Hadoop so that:
>>>>>>>
>>>>>>> a) Projects that have nothing to do with Hadoop can just depend 
>>>>>>>directly on Chimera, which has nothing Hadoop-specific in there.
>>>>>>>
>>>>>>> b) The Hadoop project doesn't have to export/maintain/concern 
>>>>>>>itself with yet another publicly-consumed interface.
>>>>>>>
>>>>>>> c) Chimera can have its own (presumably much faster) release 
>>>>>>>cadence completely separate from Hadoop.
>>>>>>>
>>>>>>> --
>>>>>>> Aaron T. Myers
>>>>>>> Software Engineer, Cloudera
>>>>
>

RE: Hadoop encryption module as Apache Chimera incubator project

Posted by "Chen, Haifeng" <ha...@intel.com>.
> [Chirs] Yes, but even if the artifact is widely consumed, as a TLP it would need to sustain a community. If the scope is too narrow, then it will quickly fall into maintenance mode, its contributors will move on, and it will retire to the attic. Alone, I doubt its viability as a TLP. So as a first option, donating only this code to Apache Commons would accomplish some immediate goals in a sustainable forum.
Totally agree. As a TLP it needs nice scope and roadmap to sustain a development community. 

Thanks,
Haifeng

-----Original Message-----
From: Chris Douglas [mailto:cdouglas@apache.org] 
Sent: Friday, February 5, 2016 6:28 AM
To: common-dev@hadoop.apache.org
Cc: hdfs-dev@hadoop.apache.org
Subject: Re: Hadoop encryption module as Apache Chimera incubator project

On Thu, Feb 4, 2016 at 12:06 PM, Gangumalla, Uma <um...@intel.com> wrote:

> [UMA] Ok. Great. You are right. I have cc¹ed to hadoop common. (You 
> mean to cc Apache commons as well?)

I meant, if you start a discussion with Apache Commons, please CC common-dev@hadoop to coordinate.

> [UMA] Right now we plan to have encryption libraries are the only 
> one¹s we planned and as we see lot of interest from other projects 
> like spark to use them. I see some challenges when we bring lot of 
> code(other common
> codes) into this project is that, they all would have different 
> requirements and may be different expected timelines for release etc. 
> Some projects may just wanted to use encryption interfaces alone but not all.
> As they are completely independent codes, may be better to scope out 
> clearly.

Yes, but even if the artifact is widely consumed, as a TLP it would need to sustain a community. If the scope is too narrow, then it will quickly fall into maintenance mode, its contributors will move on, and it will retire to the attic. Alone, I doubt its viability as a TLP. So as a first option, donating only this code to Apache Commons would accomplish some immediate goals in a sustainable forum.

APR has a similar scope. As a second option, that may also be a reasonable home, particularly if some of the native bits could integrate with APR.

If the scope is broader, the effort could sustain prolonged development. The current code is developing a strategy for packing native libraries on multiple platforms, a capability that, say, the native compression codecs (AFAIK) still lack. While java.nio is improving, many projects would benefit from a better, native interface to the filesystem (e.g., NativeIO). We could avoid duplicating effort and collaborate on a common library.

As a third option, Hadoop already implements some useful native libraries, which is why a subproject might be a sound course. That would enable the subproject to coordinate with Hadoop on migrating its native functionality to a separable, reusable component, then move to a TLP when we can rely on it exclusively (if it has a well-defined, independent community). It could control its release cadence and limit its dependencies.

Finally, this is beside the point if nobody is interested in doing the work on such a project. It's rude to pull code out of Hadoop and donate it to another project so Spark can avoid a dependency, but this instance seems reasonable to me. -C

[1] https://apr.apache.org/

> On 2/3/16, 6:46 PM, "Chen, Haifeng" <ha...@intel.com> wrote:
>
>>Thanks Chris.
>>
>>>> I went through the repository, and now understand the reasoning 
>>>>that would locate this code in Apache Commons. This isn't proposing 
>>>>to extract much of the implementation and it takes none of the 
>>>>integration. It's limited to interfaces to crypto libraries and 
>>>>streams/configuration.
>>Exactly.
>>
>>>> Chimera would be a boutique TLP, unless we wanted to draw out more 
>>>>of the integration and tooling. Is that a goal you're interested in 
>>>>pursuing? There's a tension between keeping this focused and 
>>>>including enough functionality to make it viable as an independent component.
>>The Chimera goal was for providing useful, common and optimized 
>>cryptographic functionalities. I would prefer that it is still focused 
>>in this clear scope. Multiple domain requirements will put more 
>>challenges and uncertainties in where and how it should go, thus more 
>>risk in stalling.
>>
>>>> If the encryption libraries are the only ones you're interested in 
>>>>pulling out, then Apache Commons does seem like a better target than 
>>>>a separate project.
>>Yes. Just mentioned above, the library will be positioned in 
>>cryptographic.
>>
>>
>>Thanks,
>>
>>-----Original Message-----
>>From: Chris Douglas [mailto:cdouglas@apache.org]
>>Sent: Thursday, February 4, 2016 7:26 AM
>>To: hdfs-dev@hadoop.apache.org
>>Subject: Re: Hadoop encryption module as Apache Chimera incubator 
>>project
>>
>>I went through the repository, and now understand the reasoning that 
>>would locate this code in Apache Commons. This isn't proposing to 
>>extract much of the implementation and it takes none of the 
>>integration. It's limited to interfaces to crypto libraries and 
>>streams/configuration. It might be a reasonable fit for commons-codec, 
>>but that's a pretty sparse library and driving the release cadence 
>>might be more complicated. It'd be worth discussing on their lists (please also CC common-dev@).
>>
>>Chimera would be a boutique TLP, unless we wanted to draw out more of 
>>the integration and tooling. Is that a goal you're interested in pursuing?
>>There's a tension between keeping this focused and including enough 
>>functionality to make it viable as an independent component. By way of 
>>example, Hadoop's common project requires too many dependencies and 
>>carries too much historical baggage for other projects to rely on.
>>I agree with Colin/Steve: we don't want this to grow into another 
>>guava-like dependency that creates more work in conflicts than it 
>>saves in implementation...
>>
>>Would it make sense to also package some of the compression libraries, 
>>and maybe some of the text processing from MapReduce? Evolving some of 
>>this code to a common library with few/no dependencies would be 
>>generally useful. As a subproject, it could have a broader scope that 
>>could evolve into a viable TLP. If the encryption libraries are the 
>>only ones you're interested in pulling out, then Apache Commons does 
>>seem like a better target than a separate project. -C
>>
>>
>>On Wed, Feb 3, 2016 at 1:49 AM, Chris Douglas <cd...@apache.org> wrote:
>>> On Wed, Feb 3, 2016 at 12:48 AM, Gangumalla, Uma 
>>> <um...@intel.com> wrote:
>>>>>Standing in the point of shared fundamental piece of code like 
>>>>>this, I do think Apache Commons might be the best direction which 
>>>>>we can try as the first effort. In this direction, we still need to 
>>>>>work with Apache Common community for buying in and accepting the proposal.
>>>> Make sense.
>>>
>>> Makes sense how?
>>>
>>>> For this we should define the independent release cycles for this 
>>>> project and it would just place under Hadoop tree if we all 
>>>> conclude with this option at the end.
>>>
>>> Yes.
>>>
>>>> [Chris]
>>>>>If Chimera is not successful as an independent project or stalls, 
>>>>>Hadoop and/or Spark and/or $project will have to reabsorb it as 
>>>>>maintainers.
>>>>>
>>>> I am not so strong on this point. If we assume project would be  
>>>>unsuccessful, it can be unsuccessful(less maintained) even under 
>>>>hadoop.
>>>> But if other projects depending on this piece then they would get  
>>>>less support. Of course right now we feel this piece of code is very  
>>>>important and we feel(expect) it can be successful as independent  
>>>>project, irrespective of whether it as separate project outside 
>>>>hadoop or inside.
>>>> So, I feel this point would not really influence to judge the 
>>>>discussion.
>>>
>>> Sure; code can idle anywhere, but that wasn't the point I was after.
>>> You propose to extract code from Hadoop, but if Chimera fails then 
>>> what recourse do we have among the other projects taking a 
>>> dependency on it? Splitting off another project is feasible, but 
>>> Chimera should be sustainable before this PMC can divest itself of 
>>> responsibility for security libraries. That's a pretty low bar.
>>>
>>> Bundling the library with the jar is helpful; I've used that before.
>>> It should prefer (updated) libraries from the environment, if 
>>> configured. Otherwise it's a pain (or impossible) for ops to patch 
>>> security bugs. -C
>>>
>>>>>-----Original Message-----
>>>>>From: Colin P. McCabe [mailto:cmccabe@apache.org]
>>>>>Sent: Wednesday, February 3, 2016 4:56 AM
>>>>>To: hdfs-dev@hadoop.apache.org
>>>>>Subject: Re: Hadoop encryption module as Apache Chimera incubator 
>>>>>project
>>>>>
>>>>>It's great to see interest in improving this functionality.  I 
>>>>>think Chimera could be successful as an Apache project.  I don't 
>>>>>have a strong opinion one way or the other as to whether it belongs 
>>>>>as part of Hadoop or separate.
>>>>>
>>>>>I do think there will be some challenges splitting this 
>>>>>functionality out into a separate jar, because of the way our 
>>>>>CLASSPATH works right now.
>>>>>For example, let's say that Hadoop depends on Chimera 1.2 and Spark 
>>>>>depends on Chimera 1.1.  Now Spark jobs have two different versions 
>>>>>fighting it out on the classpath, similar to the situation with 
>>>>>Guava and other libraries.  Perhaps if Chimera adopts a policy of 
>>>>>strong backwards compatibility, we can just always use the latest 
>>>>>jar, but it still seems likely that there will be problems.  There 
>>>>>are various classpath isolation ideas that could help here, but 
>>>>>they are big projects in their own right and we don't have a clear 
>>>>>timeline for them.  If this does end up being a separate jar, we 
>>>>>may need to shade it to avoid all these issues.
>>>>>
>>>>>Bundling the JNI glue code in the jar itself is an interesting 
>>>>>idea, which we have talked about before for libhadoop.so.  It 
>>>>>doesn't really have anything to do with the question of TLP vs. 
>>>>>non-TLP, of course.
>>>>>We could do that refactoring in Hadoop itself.  The really 
>>>>>complicated part of bundling JNI code in a jar is that you need to 
>>>>>create jars for every cross product of (JVM version, openssl 
>>>>>version, operating system).
>>>>>For example, you have the RHEL6 build for openJDK7 using openssl 
>>>>>1.0.1e.
>>>>>If you change any one thing-- say, change openJDK7 to Oracle JDK8, 
>>>>>then you might need to rebuild.  And certainly using Ubuntu would 
>>>>>be a rebuild.  And so forth.  This kind of clashes with Maven's 
>>>>>philosophy of pulling prebuilt jars from the internet.
>>>>>
>>>>>Kai Zheng's question about whether we would bundle openSSL's 
>>>>>libraries is a good one.  Given the high rate of new 
>>>>>vulnerabilities discovered in that library, it seems like bundling 
>>>>>would require Hadoop users and vendors to update very frequently, 
>>>>>much more frequently than Hadoop is traditionally updated.  So 
>>>>>probably we would not choose to bundle openssl.
>>>>>
>>>>>best,
>>>>>Colin
>>>>>
>>>>>On Tue, Feb 2, 2016 at 12:29 AM, Chris Douglas 
>>>>><cd...@apache.org>
>>>>>wrote:
>>>>>> As a subproject of Hadoop, Chimera could maintain its own cadence.
>>>>>> There's also no reason why it should maintain dependencies on 
>>>>>> other parts of Hadoop, if those are separable. How is this 
>>>>>> solution inadequate?
>>>>>>
>>>>>> If Chimera is not successful as an independent project or stalls, 
>>>>>> Hadoop and/or Spark and/or $project will have to reabsorb it as 
>>>>>> maintainers. Projects have high mortality in early life, and a 
>>>>>> fight over inheritance/maintenance is something we'd like to avoid.
>>>>>> If, on the other hand, it develops enough of a community where it 
>>>>>> is obviously viable, then we can (and should) break it out as a 
>>>>>> TLP (as we have before). If other Apache projects take a 
>>>>>> dependency on Chimera, we're open to adding them to security@hadoop.
>>>>>>
>>>>>> Unlike Yetus, which was largely rewritten right before it was 
>>>>>> made into a TLP, security in Hadoop has a complicated pedigree. 
>>>>>> If Chimera eventually becomes a TLP, it seems fair to include 
>>>>>> those who work on it while it is a subproject. Declared upfront, 
>>>>>> that criterion is fairer than any post hoc justification, and 
>>>>>> will lead to a more accurate account of its community than a 
>>>>>> subset of the Hadoop PMC/committers that volunteer. -C
>>>>>>
>>>>>>
>>>>>> On Mon, Feb 1, 2016 at 9:29 PM, Chen, Haifeng 
>>>>>><ha...@intel.com>
>>>>>>wrote:
>>>>>>> Thanks to all folks providing feedbacks and participating the 
>>>>>>>discussions.
>>>>>>>
>>>>>>> @Owen, do you still have any concerns on going forward in the 
>>>>>>>direction of Apache Commons (or other options, TLP)?
>>>>>>>
>>>>>>> Thanks,
>>>>>>> Haifeng
>>>>>>>
>>>>>>> -----Original Message-----
>>>>>>> From: Chen, Haifeng [mailto:haifeng.chen@intel.com]
>>>>>>> Sent: Saturday, January 30, 2016 10:52 AM
>>>>>>> To: hdfs-dev@hadoop.apache.org
>>>>>>> Subject: RE: Hadoop encryption module as Apache Chimera 
>>>>>>> incubator project
>>>>>>>
>>>>>>>>> I believe encryption is becoming a core part of Hadoop. I 
>>>>>>>>>think that moving core components out of Hadoop is bad from a 
>>>>>>>>>project management perspective.
>>>>>>>
>>>>>>>> Although it's certainly true that encryption capabilities (in 
>>>>>>>>HDFS, YARN, etc.) are becoming core to Hadoop, I don't think 
>>>>>>>>that should really influence whether or not the 
>>>>>>>>non-Hadoop-specific encryption routines should be part of the 
>>>>>>>>Hadoop code base, or part of the code base of another project that Hadoop depends on.
>>>>>>>>If Chimera had existed as a library hosted at ASF when HDFS 
>>>>>>>>encryption was first developed, HDFS probably would have just 
>>>>>>>>added that as a dependency and been done with it. I don't think 
>>>>>>>>we would've copy/pasted the code for Chimera into the Hadoop code base.
>>>>>>>
>>>>>>> Agree with ATM. I want to also make an additional clarification. 
>>>>>>>I agree that the encryption capabilities are becoming core to Hadoop.
>>>>>>>While this effort is to put common and shared encryption routines 
>>>>>>>such as crypto stream implementations into a scope which can be 
>>>>>>>widely shared across the Apache ecosystem. This doesn't move 
>>>>>>>Hadoop encryption out of Hadoop (that is not possible).
>>>>>>>
>>>>>>> Agree if we make it a separate and independent releases project 
>>>>>>>in Hadoop takes a step further than the existing approach and 
>>>>>>>solve some issues (such as libhadoop.so problem). Frankly 
>>>>>>>speaking, I think it is not the best option we can try. I also 
>>>>>>>expect that an independent release project within Hadoop core 
>>>>>>>will also complicate the existing release ideology of Hadoop release.
>>>>>>>
>>>>>>> Thanks,
>>>>>>> Haifeng
>>>>>>>
>>>>>>> -----Original Message-----
>>>>>>> From: Aaron T. Myers [mailto:atm@cloudera.com]
>>>>>>> Sent: Friday, January 29, 2016 9:51 AM
>>>>>>> To: hdfs-dev@hadoop.apache.org
>>>>>>> Subject: Re: Hadoop encryption module as Apache Chimera 
>>>>>>> incubator project
>>>>>>>
>>>>>>> On Wed, Jan 27, 2016 at 11:31 AM, Owen O'Malley 
>>>>>>><om...@apache.org>
>>>>>>>wrote:
>>>>>>>
>>>>>>>> I believe encryption is becoming a core part of Hadoop. I think 
>>>>>>>>that  moving core components out of Hadoop is bad from a project 
>>>>>>>>management perspective.
>>>>>>>>
>>>>>>>
>>>>>>> Although it's certainly true that encryption capabilities (in 
>>>>>>>HDFS,  YARN,
>>>>>>> etc.) are becoming core to Hadoop, I don't think that should 
>>>>>>>really influence whether or not the non-Hadoop-specific 
>>>>>>>encryption routines should be part of the Hadoop code base, or 
>>>>>>>part of the code base of another project that Hadoop depends on. 
>>>>>>>If Chimera had existed as a library hosted at ASF when HDFS 
>>>>>>>encryption was first developed, HDFS probably would have just 
>>>>>>>added that as a dependency and been done with it. I don't think 
>>>>>>>we would've copy/pasted the code for Chimera into the Hadoop code base.
>>>>>>>
>>>>>>>
>>>>>>>> To put it another way, a bug in the encryption routines will  
>>>>>>>>likely become a security problem that security@hadoop needs to 
>>>>>>>>hear about.
>>>>>>>>
>>>>>>> I don't think
>>>>>>>> adding a separate project in the middle of that communication 
>>>>>>>>chain  is a good idea. The same applies to data corruption 
>>>>>>>>problems, and so on...
>>>>>>>>
>>>>>>>
>>>>>>> Isn't the same true of all the libraries that Hadoop currently 
>>>>>>>depends upon? If the commons-httpclient library (or 
>>>>>>>commons-codec, or commons-io, or guava, or...) has a security 
>>>>>>>vulnerability, we need to know about it so that we can update our 
>>>>>>>dependency to a fixed version.
>>>>>>>This case doesn't seem materially different than that.
>>>>>>>
>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> > It may be good to keep at generalized place(As in the 
>>>>>>>> > discussion, we thought that place could be Apache Commons).
>>>>>>>>
>>>>>>>>
>>>>>>>> Apache Commons is a collection of *Java* projects, so Chimera 
>>>>>>>> as a JNI-based library isn't a natural fit.
>>>>>>>>
>>>>>>>
>>>>>>> Could very well be that Apache Commons's charter would preclude 
>>>>>>>Chimera.
>>>>>>> You probably know better than I do about that.
>>>>>>>
>>>>>>>
>>>>>>>> Furthermore, Apache Commons doesn't have its own security list 
>>>>>>>> so problems will go to the generic security@apache.org.
>>>>>>>>
>>>>>>>
>>>>>>> That seems easy enough to remedy, if they wanted to, and besides 
>>>>>>>I'm not sure why that would influence this discussion. In my 
>>>>>>>experience projects that don't have a separate 
>>>>>>>security@project.a.o mailing list tend to just handle security 
>>>>>>>issues on their private@project.a.o mailing list, which seems fine to me.
>>>>>>>
>>>>>>>
>>>>>>>>
>>>>>>>> Why do you think that Apache Commons is a better home than Hadoop?
>>>>>>>>
>>>>>>>
>>>>>>> I'm certainly not at all wedded to Apache Commons, that just 
>>>>>>>seemed like a natural place to put it to me. Could be that a 
>>>>>>>brand new TLP might make more sense.
>>>>>>>
>>>>>>> I *do* think that if other non-Hadoop projects want to make use 
>>>>>>>of Chimera, which as I understand it is the goal which started 
>>>>>>>this thread, then Chimera should exist outside of Hadoop so that:
>>>>>>>
>>>>>>> a) Projects that have nothing to do with Hadoop can just depend 
>>>>>>>directly on Chimera, which has nothing Hadoop-specific in there.
>>>>>>>
>>>>>>> b) The Hadoop project doesn't have to export/maintain/concern 
>>>>>>>itself with yet another publicly-consumed interface.
>>>>>>>
>>>>>>> c) Chimera can have its own (presumably much faster) release 
>>>>>>>cadence completely separate from Hadoop.
>>>>>>>
>>>>>>> --
>>>>>>> Aaron T. Myers
>>>>>>> Software Engineer, Cloudera
>>>>
>

Re: Hadoop encryption module as Apache Chimera incubator project

Posted by Chris Douglas <cd...@apache.org>.
On Thu, Feb 4, 2016 at 12:06 PM, Gangumalla, Uma
<um...@intel.com> wrote:

> [UMA] Ok. Great. You are right. I have cc¹ed to hadoop common. (You mean
> to cc Apache commons as well?)

I meant, if you start a discussion with Apache Commons, please CC
common-dev@hadoop to coordinate.

> [UMA] Right now we plan to have encryption libraries are the only one¹s we
> planned and as we see lot of interest from other projects like spark to
> use them. I see some challenges when we bring lot of code(other common
> codes) into this project is that, they all would have different
> requirements and may be different expected timelines for release etc. Some
> projects may just wanted to use encryption interfaces alone but not all.
> As they are completely independent codes, may be better to scope out
> clearly.

Yes, but even if the artifact is widely consumed, as a TLP it would
need to sustain a community. If the scope is too narrow, then it will
quickly fall into maintenance mode, its contributors will move on, and
it will retire to the attic. Alone, I doubt its viability as a TLP. So
as a first option, donating only this code to Apache Commons would
accomplish some immediate goals in a sustainable forum.

APR has a similar scope. As a second option, that may also be a
reasonable home, particularly if some of the native bits could
integrate with APR.

If the scope is broader, the effort could sustain prolonged
development. The current code is developing a strategy for packing
native libraries on multiple platforms, a capability that, say, the
native compression codecs (AFAIK) still lack. While java.nio is
improving, many projects would benefit from a better, native interface
to the filesystem (e.g., NativeIO). We could avoid duplicating effort
and collaborate on a common library.

As a third option, Hadoop already implements some useful native
libraries, which is why a subproject might be a sound course. That
would enable the subproject to coordinate with Hadoop on migrating its
native functionality to a separable, reusable component, then move to
a TLP when we can rely on it exclusively (if it has a well-defined,
independent community). It could control its release cadence and limit
its dependencies.

Finally, this is beside the point if nobody is interested in doing the
work on such a project. It's rude to pull code out of Hadoop and
donate it to another project so Spark can avoid a dependency, but this
instance seems reasonable to me. -C

[1] https://apr.apache.org/

> On 2/3/16, 6:46 PM, "Chen, Haifeng" <ha...@intel.com> wrote:
>
>>Thanks Chris.
>>
>>>> I went through the repository, and now understand the reasoning that
>>>>would locate this code in Apache Commons. This isn't proposing to
>>>>extract much of the implementation and it takes none of the
>>>>integration. It's limited to interfaces to crypto libraries and
>>>>streams/configuration.
>>Exactly.
>>
>>>> Chimera would be a boutique TLP, unless we wanted to draw out more of
>>>>the integration and tooling. Is that a goal you're interested in
>>>>pursuing? There's a tension between keeping this focused and including
>>>>enough functionality to make it viable as an independent component.
>>The Chimera goal was for providing useful, common and optimized
>>cryptographic functionalities. I would prefer that it is still focused in
>>this clear scope. Multiple domain requirements will put more challenges
>>and uncertainties in where and how it should go, thus more risk in
>>stalling.
>>
>>>> If the encryption libraries are the only ones you're interested in
>>>>pulling out, then Apache Commons does seem like a better target than a
>>>>separate project.
>>Yes. Just mentioned above, the library will be positioned in
>>cryptographic.
>>
>>
>>Thanks,
>>
>>-----Original Message-----
>>From: Chris Douglas [mailto:cdouglas@apache.org]
>>Sent: Thursday, February 4, 2016 7:26 AM
>>To: hdfs-dev@hadoop.apache.org
>>Subject: Re: Hadoop encryption module as Apache Chimera incubator project
>>
>>I went through the repository, and now understand the reasoning that
>>would locate this code in Apache Commons. This isn't proposing to extract
>>much of the implementation and it takes none of the integration. It's
>>limited to interfaces to crypto libraries and streams/configuration. It
>>might be a reasonable fit for commons-codec, but that's a pretty sparse
>>library and driving the release cadence might be more complicated. It'd
>>be worth discussing on their lists (please also CC common-dev@).
>>
>>Chimera would be a boutique TLP, unless we wanted to draw out more of the
>>integration and tooling. Is that a goal you're interested in pursuing?
>>There's a tension between keeping this focused and including enough
>>functionality to make it viable as an independent component. By way of
>>example, Hadoop's common project requires too many dependencies and
>>carries too much historical baggage for other projects to rely on.
>>I agree with Colin/Steve: we don't want this to grow into another
>>guava-like dependency that creates more work in conflicts than it saves
>>in implementation...
>>
>>Would it make sense to also package some of the compression libraries,
>>and maybe some of the text processing from MapReduce? Evolving some of
>>this code to a common library with few/no dependencies would be generally
>>useful. As a subproject, it could have a broader scope that could evolve
>>into a viable TLP. If the encryption libraries are the only ones you're
>>interested in pulling out, then Apache Commons does seem like a better
>>target than a separate project. -C
>>
>>
>>On Wed, Feb 3, 2016 at 1:49 AM, Chris Douglas <cd...@apache.org> wrote:
>>> On Wed, Feb 3, 2016 at 12:48 AM, Gangumalla, Uma
>>> <um...@intel.com> wrote:
>>>>>Standing in the point of shared fundamental piece of code like this,
>>>>>I do think Apache Commons might be the best direction which we can
>>>>>try as the first effort. In this direction, we still need to work
>>>>>with Apache Common community for buying in and accepting the proposal.
>>>> Make sense.
>>>
>>> Makes sense how?
>>>
>>>> For this we should define the independent release cycles for this
>>>> project and it would just place under Hadoop tree if we all conclude
>>>> with this option at the end.
>>>
>>> Yes.
>>>
>>>> [Chris]
>>>>>If Chimera is not successful as an independent project or stalls,
>>>>>Hadoop and/or Spark and/or $project will have to reabsorb it as
>>>>>maintainers.
>>>>>
>>>> I am not so strong on this point. If we assume project would be
>>>> unsuccessful, it can be unsuccessful(less maintained) even under
>>>>hadoop.
>>>> But if other projects depending on this piece then they would get
>>>> less support. Of course right now we feel this piece of code is very
>>>> important and we feel(expect) it can be successful as independent
>>>> project, irrespective of whether it as separate project outside hadoop
>>>>or inside.
>>>> So, I feel this point would not really influence to judge the
>>>>discussion.
>>>
>>> Sure; code can idle anywhere, but that wasn't the point I was after.
>>> You propose to extract code from Hadoop, but if Chimera fails then
>>> what recourse do we have among the other projects taking a dependency
>>> on it? Splitting off another project is feasible, but Chimera should
>>> be sustainable before this PMC can divest itself of responsibility for
>>> security libraries. That's a pretty low bar.
>>>
>>> Bundling the library with the jar is helpful; I've used that before.
>>> It should prefer (updated) libraries from the environment, if
>>> configured. Otherwise it's a pain (or impossible) for ops to patch
>>> security bugs. -C
>>>
>>>>>-----Original Message-----
>>>>>From: Colin P. McCabe [mailto:cmccabe@apache.org]
>>>>>Sent: Wednesday, February 3, 2016 4:56 AM
>>>>>To: hdfs-dev@hadoop.apache.org
>>>>>Subject: Re: Hadoop encryption module as Apache Chimera incubator
>>>>>project
>>>>>
>>>>>It's great to see interest in improving this functionality.  I think
>>>>>Chimera could be successful as an Apache project.  I don't have a
>>>>>strong opinion one way or the other as to whether it belongs as part
>>>>>of Hadoop or separate.
>>>>>
>>>>>I do think there will be some challenges splitting this functionality
>>>>>out into a separate jar, because of the way our CLASSPATH works right
>>>>>now.
>>>>>For example, let's say that Hadoop depends on Chimera 1.2 and Spark
>>>>>depends on Chimera 1.1.  Now Spark jobs have two different versions
>>>>>fighting it out on the classpath, similar to the situation with Guava
>>>>>and other libraries.  Perhaps if Chimera adopts a policy of strong
>>>>>backwards compatibility, we can just always use the latest jar, but
>>>>>it still seems likely that there will be problems.  There are various
>>>>>classpath isolation ideas that could help here, but they are big
>>>>>projects in their own right and we don't have a clear timeline for
>>>>>them.  If this does end up being a separate jar, we may need to shade
>>>>>it to avoid all these issues.
>>>>>
>>>>>Bundling the JNI glue code in the jar itself is an interesting idea,
>>>>>which we have talked about before for libhadoop.so.  It doesn't
>>>>>really have anything to do with the question of TLP vs. non-TLP, of
>>>>>course.
>>>>>We could do that refactoring in Hadoop itself.  The really
>>>>>complicated part of bundling JNI code in a jar is that you need to
>>>>>create jars for every cross product of (JVM version, openssl version,
>>>>>operating system).
>>>>>For example, you have the RHEL6 build for openJDK7 using openssl
>>>>>1.0.1e.
>>>>>If you change any one thing-- say, change openJDK7 to Oracle JDK8,
>>>>>then you might need to rebuild.  And certainly using Ubuntu would be
>>>>>a rebuild.  And so forth.  This kind of clashes with Maven's
>>>>>philosophy of pulling prebuilt jars from the internet.
>>>>>
>>>>>Kai Zheng's question about whether we would bundle openSSL's
>>>>>libraries is a good one.  Given the high rate of new vulnerabilities
>>>>>discovered in that library, it seems like bundling would require
>>>>>Hadoop users and vendors to update very frequently, much more
>>>>>frequently than Hadoop is traditionally updated.  So probably we would
>>>>>not choose to bundle openssl.
>>>>>
>>>>>best,
>>>>>Colin
>>>>>
>>>>>On Tue, Feb 2, 2016 at 12:29 AM, Chris Douglas <cd...@apache.org>
>>>>>wrote:
>>>>>> As a subproject of Hadoop, Chimera could maintain its own cadence.
>>>>>> There's also no reason why it should maintain dependencies on other
>>>>>> parts of Hadoop, if those are separable. How is this solution
>>>>>> inadequate?
>>>>>>
>>>>>> If Chimera is not successful as an independent project or stalls,
>>>>>> Hadoop and/or Spark and/or $project will have to reabsorb it as
>>>>>> maintainers. Projects have high mortality in early life, and a
>>>>>> fight over inheritance/maintenance is something we'd like to avoid.
>>>>>> If, on the other hand, it develops enough of a community where it
>>>>>> is obviously viable, then we can (and should) break it out as a TLP
>>>>>> (as we have before). If other Apache projects take a dependency on
>>>>>> Chimera, we're open to adding them to security@hadoop.
>>>>>>
>>>>>> Unlike Yetus, which was largely rewritten right before it was made
>>>>>> into a TLP, security in Hadoop has a complicated pedigree. If
>>>>>> Chimera eventually becomes a TLP, it seems fair to include those
>>>>>> who work on it while it is a subproject. Declared upfront, that
>>>>>> criterion is fairer than any post hoc justification, and will lead
>>>>>> to a more accurate account of its community than a subset of the
>>>>>> Hadoop PMC/committers that volunteer. -C
>>>>>>
>>>>>>
>>>>>> On Mon, Feb 1, 2016 at 9:29 PM, Chen, Haifeng
>>>>>><ha...@intel.com>
>>>>>>wrote:
>>>>>>> Thanks to all folks providing feedbacks and participating the
>>>>>>>discussions.
>>>>>>>
>>>>>>> @Owen, do you still have any concerns on going forward in the
>>>>>>>direction of Apache Commons (or other options, TLP)?
>>>>>>>
>>>>>>> Thanks,
>>>>>>> Haifeng
>>>>>>>
>>>>>>> -----Original Message-----
>>>>>>> From: Chen, Haifeng [mailto:haifeng.chen@intel.com]
>>>>>>> Sent: Saturday, January 30, 2016 10:52 AM
>>>>>>> To: hdfs-dev@hadoop.apache.org
>>>>>>> Subject: RE: Hadoop encryption module as Apache Chimera incubator
>>>>>>> project
>>>>>>>
>>>>>>>>> I believe encryption is becoming a core part of Hadoop. I think
>>>>>>>>>that moving core components out of Hadoop is bad from a project
>>>>>>>>>management perspective.
>>>>>>>
>>>>>>>> Although it's certainly true that encryption capabilities (in
>>>>>>>>HDFS, YARN, etc.) are becoming core to Hadoop, I don't think that
>>>>>>>>should really influence whether or not the non-Hadoop-specific
>>>>>>>>encryption routines should be part of the Hadoop code base, or
>>>>>>>>part of the code base of another project that Hadoop depends on.
>>>>>>>>If Chimera had existed as a library hosted at ASF when HDFS
>>>>>>>>encryption was first developed, HDFS probably would have just
>>>>>>>>added that as a dependency and been done with it. I don't think we
>>>>>>>>would've copy/pasted the code for Chimera into the Hadoop code base.
>>>>>>>
>>>>>>> Agree with ATM. I want to also make an additional clarification. I
>>>>>>>agree that the encryption capabilities are becoming core to Hadoop.
>>>>>>>While this effort is to put common and shared encryption routines
>>>>>>>such as crypto stream implementations into a scope which can be
>>>>>>>widely shared across the Apache ecosystem. This doesn't move Hadoop
>>>>>>>encryption out of Hadoop (that is not possible).
>>>>>>>
>>>>>>> Agree if we make it a separate and independent releases project in
>>>>>>>Hadoop takes a step further than the existing approach and solve
>>>>>>>some issues (such as libhadoop.so problem). Frankly speaking, I
>>>>>>>think it is not the best option we can try. I also expect that an
>>>>>>>independent release project within Hadoop core will also complicate
>>>>>>>the existing release ideology of Hadoop release.
>>>>>>>
>>>>>>> Thanks,
>>>>>>> Haifeng
>>>>>>>
>>>>>>> -----Original Message-----
>>>>>>> From: Aaron T. Myers [mailto:atm@cloudera.com]
>>>>>>> Sent: Friday, January 29, 2016 9:51 AM
>>>>>>> To: hdfs-dev@hadoop.apache.org
>>>>>>> Subject: Re: Hadoop encryption module as Apache Chimera incubator
>>>>>>> project
>>>>>>>
>>>>>>> On Wed, Jan 27, 2016 at 11:31 AM, Owen O'Malley
>>>>>>><om...@apache.org>
>>>>>>>wrote:
>>>>>>>
>>>>>>>> I believe encryption is becoming a core part of Hadoop. I think
>>>>>>>>that  moving core components out of Hadoop is bad from a project
>>>>>>>>management perspective.
>>>>>>>>
>>>>>>>
>>>>>>> Although it's certainly true that encryption capabilities (in
>>>>>>>HDFS,  YARN,
>>>>>>> etc.) are becoming core to Hadoop, I don't think that should
>>>>>>>really influence whether or not the non-Hadoop-specific encryption
>>>>>>>routines should be part of the Hadoop code base, or part of the
>>>>>>>code base of another project that Hadoop depends on. If Chimera had
>>>>>>>existed as a library hosted at ASF when HDFS encryption was first
>>>>>>>developed, HDFS probably would have just added that as a dependency
>>>>>>>and been done with it. I don't think we would've copy/pasted the
>>>>>>>code for Chimera into the Hadoop code base.
>>>>>>>
>>>>>>>
>>>>>>>> To put it another way, a bug in the encryption routines will
>>>>>>>> likely become a security problem that security@hadoop needs to
>>>>>>>>hear about.
>>>>>>>>
>>>>>>> I don't think
>>>>>>>> adding a separate project in the middle of that communication
>>>>>>>>chain  is a good idea. The same applies to data corruption
>>>>>>>>problems, and so on...
>>>>>>>>
>>>>>>>
>>>>>>> Isn't the same true of all the libraries that Hadoop currently
>>>>>>>depends upon? If the commons-httpclient library (or commons-codec,
>>>>>>>or commons-io, or guava, or...) has a security vulnerability, we
>>>>>>>need to know about it so that we can update our dependency to a
>>>>>>>fixed version.
>>>>>>>This case doesn't seem materially different than that.
>>>>>>>
>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> > It may be good to keep at generalized place(As in the
>>>>>>>> > discussion, we thought that place could be Apache Commons).
>>>>>>>>
>>>>>>>>
>>>>>>>> Apache Commons is a collection of *Java* projects, so Chimera as
>>>>>>>> a JNI-based library isn't a natural fit.
>>>>>>>>
>>>>>>>
>>>>>>> Could very well be that Apache Commons's charter would preclude
>>>>>>>Chimera.
>>>>>>> You probably know better than I do about that.
>>>>>>>
>>>>>>>
>>>>>>>> Furthermore, Apache Commons doesn't have its own security list so
>>>>>>>> problems will go to the generic security@apache.org.
>>>>>>>>
>>>>>>>
>>>>>>> That seems easy enough to remedy, if they wanted to, and besides I'm
>>>>>>>not sure why that would influence this discussion. In my experience
>>>>>>>projects that don't have a separate security@project.a.o mailing list
>>>>>>>tend to just handle security issues on their private@project.a.o
>>>>>>>mailing list, which seems fine to me.
>>>>>>>
>>>>>>>
>>>>>>>>
>>>>>>>> Why do you think that Apache Commons is a better home than Hadoop?
>>>>>>>>
>>>>>>>
>>>>>>> I'm certainly not at all wedded to Apache Commons, that just seemed
>>>>>>>like a natural place to put it to me. Could be that a brand new TLP
>>>>>>>might make more sense.
>>>>>>>
>>>>>>> I *do* think that if other non-Hadoop projects want to make use of
>>>>>>>Chimera, which as I understand it is the goal which started this
>>>>>>>thread, then Chimera should exist outside of Hadoop so that:
>>>>>>>
>>>>>>> a) Projects that have nothing to do with Hadoop can just depend
>>>>>>>directly on Chimera, which has nothing Hadoop-specific in there.
>>>>>>>
>>>>>>> b) The Hadoop project doesn't have to export/maintain/concern itself
>>>>>>>with yet another publicly-consumed interface.
>>>>>>>
>>>>>>> c) Chimera can have its own (presumably much faster) release cadence
>>>>>>>completely separate from Hadoop.
>>>>>>>
>>>>>>> --
>>>>>>> Aaron T. Myers
>>>>>>> Software Engineer, Cloudera
>>>>
>

Re: Hadoop encryption module as Apache Chimera incubator project

Posted by Chris Douglas <cd...@apache.org>.
On Thu, Feb 4, 2016 at 12:06 PM, Gangumalla, Uma
<um...@intel.com> wrote:

> [UMA] Ok. Great. You are right. I have cc¹ed to hadoop common. (You mean
> to cc Apache commons as well?)

I meant, if you start a discussion with Apache Commons, please CC
common-dev@hadoop to coordinate.

> [UMA] Right now we plan to have encryption libraries are the only one¹s we
> planned and as we see lot of interest from other projects like spark to
> use them. I see some challenges when we bring lot of code(other common
> codes) into this project is that, they all would have different
> requirements and may be different expected timelines for release etc. Some
> projects may just wanted to use encryption interfaces alone but not all.
> As they are completely independent codes, may be better to scope out
> clearly.

Yes, but even if the artifact is widely consumed, as a TLP it would
need to sustain a community. If the scope is too narrow, then it will
quickly fall into maintenance mode, its contributors will move on, and
it will retire to the attic. Alone, I doubt its viability as a TLP. So
as a first option, donating only this code to Apache Commons would
accomplish some immediate goals in a sustainable forum.

APR has a similar scope. As a second option, that may also be a
reasonable home, particularly if some of the native bits could
integrate with APR.

If the scope is broader, the effort could sustain prolonged
development. The current code is developing a strategy for packing
native libraries on multiple platforms, a capability that, say, the
native compression codecs (AFAIK) still lack. While java.nio is
improving, many projects would benefit from a better, native interface
to the filesystem (e.g., NativeIO). We could avoid duplicating effort
and collaborate on a common library.

As a third option, Hadoop already implements some useful native
libraries, which is why a subproject might be a sound course. That
would enable the subproject to coordinate with Hadoop on migrating its
native functionality to a separable, reusable component, then move to
a TLP when we can rely on it exclusively (if it has a well-defined,
independent community). It could control its release cadence and limit
its dependencies.

Finally, this is beside the point if nobody is interested in doing the
work on such a project. It's rude to pull code out of Hadoop and
donate it to another project so Spark can avoid a dependency, but this
instance seems reasonable to me. -C

[1] https://apr.apache.org/

> On 2/3/16, 6:46 PM, "Chen, Haifeng" <ha...@intel.com> wrote:
>
>>Thanks Chris.
>>
>>>> I went through the repository, and now understand the reasoning that
>>>>would locate this code in Apache Commons. This isn't proposing to
>>>>extract much of the implementation and it takes none of the
>>>>integration. It's limited to interfaces to crypto libraries and
>>>>streams/configuration.
>>Exactly.
>>
>>>> Chimera would be a boutique TLP, unless we wanted to draw out more of
>>>>the integration and tooling. Is that a goal you're interested in
>>>>pursuing? There's a tension between keeping this focused and including
>>>>enough functionality to make it viable as an independent component.
>>The Chimera goal was for providing useful, common and optimized
>>cryptographic functionalities. I would prefer that it is still focused in
>>this clear scope. Multiple domain requirements will put more challenges
>>and uncertainties in where and how it should go, thus more risk in
>>stalling.
>>
>>>> If the encryption libraries are the only ones you're interested in
>>>>pulling out, then Apache Commons does seem like a better target than a
>>>>separate project.
>>Yes. Just mentioned above, the library will be positioned in
>>cryptographic.
>>
>>
>>Thanks,
>>
>>-----Original Message-----
>>From: Chris Douglas [mailto:cdouglas@apache.org]
>>Sent: Thursday, February 4, 2016 7:26 AM
>>To: hdfs-dev@hadoop.apache.org
>>Subject: Re: Hadoop encryption module as Apache Chimera incubator project
>>
>>I went through the repository, and now understand the reasoning that
>>would locate this code in Apache Commons. This isn't proposing to extract
>>much of the implementation and it takes none of the integration. It's
>>limited to interfaces to crypto libraries and streams/configuration. It
>>might be a reasonable fit for commons-codec, but that's a pretty sparse
>>library and driving the release cadence might be more complicated. It'd
>>be worth discussing on their lists (please also CC common-dev@).
>>
>>Chimera would be a boutique TLP, unless we wanted to draw out more of the
>>integration and tooling. Is that a goal you're interested in pursuing?
>>There's a tension between keeping this focused and including enough
>>functionality to make it viable as an independent component. By way of
>>example, Hadoop's common project requires too many dependencies and
>>carries too much historical baggage for other projects to rely on.
>>I agree with Colin/Steve: we don't want this to grow into another
>>guava-like dependency that creates more work in conflicts than it saves
>>in implementation...
>>
>>Would it make sense to also package some of the compression libraries,
>>and maybe some of the text processing from MapReduce? Evolving some of
>>this code to a common library with few/no dependencies would be generally
>>useful. As a subproject, it could have a broader scope that could evolve
>>into a viable TLP. If the encryption libraries are the only ones you're
>>interested in pulling out, then Apache Commons does seem like a better
>>target than a separate project. -C
>>
>>
>>On Wed, Feb 3, 2016 at 1:49 AM, Chris Douglas <cd...@apache.org> wrote:
>>> On Wed, Feb 3, 2016 at 12:48 AM, Gangumalla, Uma
>>> <um...@intel.com> wrote:
>>>>>Standing in the point of shared fundamental piece of code like this,
>>>>>I do think Apache Commons might be the best direction which we can
>>>>>try as the first effort. In this direction, we still need to work
>>>>>with Apache Common community for buying in and accepting the proposal.
>>>> Make sense.
>>>
>>> Makes sense how?
>>>
>>>> For this we should define the independent release cycles for this
>>>> project and it would just place under Hadoop tree if we all conclude
>>>> with this option at the end.
>>>
>>> Yes.
>>>
>>>> [Chris]
>>>>>If Chimera is not successful as an independent project or stalls,
>>>>>Hadoop and/or Spark and/or $project will have to reabsorb it as
>>>>>maintainers.
>>>>>
>>>> I am not so strong on this point. If we assume project would be
>>>> unsuccessful, it can be unsuccessful(less maintained) even under
>>>>hadoop.
>>>> But if other projects depending on this piece then they would get
>>>> less support. Of course right now we feel this piece of code is very
>>>> important and we feel(expect) it can be successful as independent
>>>> project, irrespective of whether it as separate project outside hadoop
>>>>or inside.
>>>> So, I feel this point would not really influence to judge the
>>>>discussion.
>>>
>>> Sure; code can idle anywhere, but that wasn't the point I was after.
>>> You propose to extract code from Hadoop, but if Chimera fails then
>>> what recourse do we have among the other projects taking a dependency
>>> on it? Splitting off another project is feasible, but Chimera should
>>> be sustainable before this PMC can divest itself of responsibility for
>>> security libraries. That's a pretty low bar.
>>>
>>> Bundling the library with the jar is helpful; I've used that before.
>>> It should prefer (updated) libraries from the environment, if
>>> configured. Otherwise it's a pain (or impossible) for ops to patch
>>> security bugs. -C
>>>
>>>>>-----Original Message-----
>>>>>From: Colin P. McCabe [mailto:cmccabe@apache.org]
>>>>>Sent: Wednesday, February 3, 2016 4:56 AM
>>>>>To: hdfs-dev@hadoop.apache.org
>>>>>Subject: Re: Hadoop encryption module as Apache Chimera incubator
>>>>>project
>>>>>
>>>>>It's great to see interest in improving this functionality.  I think
>>>>>Chimera could be successful as an Apache project.  I don't have a
>>>>>strong opinion one way or the other as to whether it belongs as part
>>>>>of Hadoop or separate.
>>>>>
>>>>>I do think there will be some challenges splitting this functionality
>>>>>out into a separate jar, because of the way our CLASSPATH works right
>>>>>now.
>>>>>For example, let's say that Hadoop depends on Chimera 1.2 and Spark
>>>>>depends on Chimera 1.1.  Now Spark jobs have two different versions
>>>>>fighting it out on the classpath, similar to the situation with Guava
>>>>>and other libraries.  Perhaps if Chimera adopts a policy of strong
>>>>>backwards compatibility, we can just always use the latest jar, but
>>>>>it still seems likely that there will be problems.  There are various
>>>>>classpath isolation ideas that could help here, but they are big
>>>>>projects in their own right and we don't have a clear timeline for
>>>>>them.  If this does end up being a separate jar, we may need to shade
>>>>>it to avoid all these issues.
>>>>>
>>>>>Bundling the JNI glue code in the jar itself is an interesting idea,
>>>>>which we have talked about before for libhadoop.so.  It doesn't
>>>>>really have anything to do with the question of TLP vs. non-TLP, of
>>>>>course.
>>>>>We could do that refactoring in Hadoop itself.  The really
>>>>>complicated part of bundling JNI code in a jar is that you need to
>>>>>create jars for every cross product of (JVM version, openssl version,
>>>>>operating system).
>>>>>For example, you have the RHEL6 build for openJDK7 using openssl
>>>>>1.0.1e.
>>>>>If you change any one thing-- say, change openJDK7 to Oracle JDK8,
>>>>>then you might need to rebuild.  And certainly using Ubuntu would be
>>>>>a rebuild.  And so forth.  This kind of clashes with Maven's
>>>>>philosophy of pulling prebuilt jars from the internet.
>>>>>
>>>>>Kai Zheng's question about whether we would bundle openSSL's
>>>>>libraries is a good one.  Given the high rate of new vulnerabilities
>>>>>discovered in that library, it seems like bundling would require
>>>>>Hadoop users and vendors to update very frequently, much more
>>>>>frequently than Hadoop is traditionally updated.  So probably we would
>>>>>not choose to bundle openssl.
>>>>>
>>>>>best,
>>>>>Colin
>>>>>
>>>>>On Tue, Feb 2, 2016 at 12:29 AM, Chris Douglas <cd...@apache.org>
>>>>>wrote:
>>>>>> As a subproject of Hadoop, Chimera could maintain its own cadence.
>>>>>> There's also no reason why it should maintain dependencies on other
>>>>>> parts of Hadoop, if those are separable. How is this solution
>>>>>> inadequate?
>>>>>>
>>>>>> If Chimera is not successful as an independent project or stalls,
>>>>>> Hadoop and/or Spark and/or $project will have to reabsorb it as
>>>>>> maintainers. Projects have high mortality in early life, and a
>>>>>> fight over inheritance/maintenance is something we'd like to avoid.
>>>>>> If, on the other hand, it develops enough of a community where it
>>>>>> is obviously viable, then we can (and should) break it out as a TLP
>>>>>> (as we have before). If other Apache projects take a dependency on
>>>>>> Chimera, we're open to adding them to security@hadoop.
>>>>>>
>>>>>> Unlike Yetus, which was largely rewritten right before it was made
>>>>>> into a TLP, security in Hadoop has a complicated pedigree. If
>>>>>> Chimera eventually becomes a TLP, it seems fair to include those
>>>>>> who work on it while it is a subproject. Declared upfront, that
>>>>>> criterion is fairer than any post hoc justification, and will lead
>>>>>> to a more accurate account of its community than a subset of the
>>>>>> Hadoop PMC/committers that volunteer. -C
>>>>>>
>>>>>>
>>>>>> On Mon, Feb 1, 2016 at 9:29 PM, Chen, Haifeng
>>>>>><ha...@intel.com>
>>>>>>wrote:
>>>>>>> Thanks to all folks providing feedbacks and participating the
>>>>>>>discussions.
>>>>>>>
>>>>>>> @Owen, do you still have any concerns on going forward in the
>>>>>>>direction of Apache Commons (or other options, TLP)?
>>>>>>>
>>>>>>> Thanks,
>>>>>>> Haifeng
>>>>>>>
>>>>>>> -----Original Message-----
>>>>>>> From: Chen, Haifeng [mailto:haifeng.chen@intel.com]
>>>>>>> Sent: Saturday, January 30, 2016 10:52 AM
>>>>>>> To: hdfs-dev@hadoop.apache.org
>>>>>>> Subject: RE: Hadoop encryption module as Apache Chimera incubator
>>>>>>> project
>>>>>>>
>>>>>>>>> I believe encryption is becoming a core part of Hadoop. I think
>>>>>>>>>that moving core components out of Hadoop is bad from a project
>>>>>>>>>management perspective.
>>>>>>>
>>>>>>>> Although it's certainly true that encryption capabilities (in
>>>>>>>>HDFS, YARN, etc.) are becoming core to Hadoop, I don't think that
>>>>>>>>should really influence whether or not the non-Hadoop-specific
>>>>>>>>encryption routines should be part of the Hadoop code base, or
>>>>>>>>part of the code base of another project that Hadoop depends on.
>>>>>>>>If Chimera had existed as a library hosted at ASF when HDFS
>>>>>>>>encryption was first developed, HDFS probably would have just
>>>>>>>>added that as a dependency and been done with it. I don't think we
>>>>>>>>would've copy/pasted the code for Chimera into the Hadoop code base.
>>>>>>>
>>>>>>> Agree with ATM. I want to also make an additional clarification. I
>>>>>>>agree that the encryption capabilities are becoming core to Hadoop.
>>>>>>>While this effort is to put common and shared encryption routines
>>>>>>>such as crypto stream implementations into a scope which can be
>>>>>>>widely shared across the Apache ecosystem. This doesn't move Hadoop
>>>>>>>encryption out of Hadoop (that is not possible).
>>>>>>>
>>>>>>> Agree if we make it a separate and independent releases project in
>>>>>>>Hadoop takes a step further than the existing approach and solve
>>>>>>>some issues (such as libhadoop.so problem). Frankly speaking, I
>>>>>>>think it is not the best option we can try. I also expect that an
>>>>>>>independent release project within Hadoop core will also complicate
>>>>>>>the existing release ideology of Hadoop release.
>>>>>>>
>>>>>>> Thanks,
>>>>>>> Haifeng
>>>>>>>
>>>>>>> -----Original Message-----
>>>>>>> From: Aaron T. Myers [mailto:atm@cloudera.com]
>>>>>>> Sent: Friday, January 29, 2016 9:51 AM
>>>>>>> To: hdfs-dev@hadoop.apache.org
>>>>>>> Subject: Re: Hadoop encryption module as Apache Chimera incubator
>>>>>>> project
>>>>>>>
>>>>>>> On Wed, Jan 27, 2016 at 11:31 AM, Owen O'Malley
>>>>>>><om...@apache.org>
>>>>>>>wrote:
>>>>>>>
>>>>>>>> I believe encryption is becoming a core part of Hadoop. I think
>>>>>>>>that  moving core components out of Hadoop is bad from a project
>>>>>>>>management perspective.
>>>>>>>>
>>>>>>>
>>>>>>> Although it's certainly true that encryption capabilities (in
>>>>>>>HDFS,  YARN,
>>>>>>> etc.) are becoming core to Hadoop, I don't think that should
>>>>>>>really influence whether or not the non-Hadoop-specific encryption
>>>>>>>routines should be part of the Hadoop code base, or part of the
>>>>>>>code base of another project that Hadoop depends on. If Chimera had
>>>>>>>existed as a library hosted at ASF when HDFS encryption was first
>>>>>>>developed, HDFS probably would have just added that as a dependency
>>>>>>>and been done with it. I don't think we would've copy/pasted the
>>>>>>>code for Chimera into the Hadoop code base.
>>>>>>>
>>>>>>>
>>>>>>>> To put it another way, a bug in the encryption routines will
>>>>>>>> likely become a security problem that security@hadoop needs to
>>>>>>>>hear about.
>>>>>>>>
>>>>>>> I don't think
>>>>>>>> adding a separate project in the middle of that communication
>>>>>>>>chain  is a good idea. The same applies to data corruption
>>>>>>>>problems, and so on...
>>>>>>>>
>>>>>>>
>>>>>>> Isn't the same true of all the libraries that Hadoop currently
>>>>>>>depends upon? If the commons-httpclient library (or commons-codec,
>>>>>>>or commons-io, or guava, or...) has a security vulnerability, we
>>>>>>>need to know about it so that we can update our dependency to a
>>>>>>>fixed version.
>>>>>>>This case doesn't seem materially different than that.
>>>>>>>
>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> > It may be good to keep at generalized place(As in the
>>>>>>>> > discussion, we thought that place could be Apache Commons).
>>>>>>>>
>>>>>>>>
>>>>>>>> Apache Commons is a collection of *Java* projects, so Chimera as
>>>>>>>> a JNI-based library isn't a natural fit.
>>>>>>>>
>>>>>>>
>>>>>>> Could very well be that Apache Commons's charter would preclude
>>>>>>>Chimera.
>>>>>>> You probably know better than I do about that.
>>>>>>>
>>>>>>>
>>>>>>>> Furthermore, Apache Commons doesn't have its own security list so
>>>>>>>> problems will go to the generic security@apache.org.
>>>>>>>>
>>>>>>>
>>>>>>> That seems easy enough to remedy, if they wanted to, and besides I'm
>>>>>>>not sure why that would influence this discussion. In my experience
>>>>>>>projects that don't have a separate security@project.a.o mailing list
>>>>>>>tend to just handle security issues on their private@project.a.o
>>>>>>>mailing list, which seems fine to me.
>>>>>>>
>>>>>>>
>>>>>>>>
>>>>>>>> Why do you think that Apache Commons is a better home than Hadoop?
>>>>>>>>
>>>>>>>
>>>>>>> I'm certainly not at all wedded to Apache Commons, that just seemed
>>>>>>>like a natural place to put it to me. Could be that a brand new TLP
>>>>>>>might make more sense.
>>>>>>>
>>>>>>> I *do* think that if other non-Hadoop projects want to make use of
>>>>>>>Chimera, which as I understand it is the goal which started this
>>>>>>>thread, then Chimera should exist outside of Hadoop so that:
>>>>>>>
>>>>>>> a) Projects that have nothing to do with Hadoop can just depend
>>>>>>>directly on Chimera, which has nothing Hadoop-specific in there.
>>>>>>>
>>>>>>> b) The Hadoop project doesn't have to export/maintain/concern itself
>>>>>>>with yet another publicly-consumed interface.
>>>>>>>
>>>>>>> c) Chimera can have its own (presumably much faster) release cadence
>>>>>>>completely separate from Hadoop.
>>>>>>>
>>>>>>> --
>>>>>>> Aaron T. Myers
>>>>>>> Software Engineer, Cloudera
>>>>
>

Re: Hadoop encryption module as Apache Chimera incubator project

Posted by "Gangumalla, Uma" <um...@intel.com>.
Thanks all for the opinions.

Chris wrote:
I went through the repository, and now understand the reasoning that would
locate this code in Apache Commons. This isn't proposing to extract much
of the implementation and it takes none of the integration. It's limited
to interfaces to crypto libraries and streams/configuration. It might be a
reasonable fit for commons-codec, but that's a pretty sparse library and
driving the release cadence might be more complicated. It'd be worth
discussing on their lists (please also CC common-dev@).

[UMA] Ok. Great. You are right. I have cc¹ed to hadoop common. (You mean
to cc Apache commons as well?)

Chris wrote:
Would it make sense to also package some of the compression libraries, and
maybe some of the text processing from MapReduce? Evolving some of this
code to a common library with few/no dependencies would be generally
useful. As a subproject, it could have a broader scope that could evolve
into a viable TLP. If the encryption libraries are the only ones you're
interested in pulling out, then Apache Commons does seem like a better
target than a separate project. -C

[UMA] Right now we plan to have encryption libraries are the only one¹s we
planned and as we see lot of interest from other projects like spark to
use them. I see some challenges when we bring lot of code(other common
codes) into this project is that, they all would have different
requirements and may be different expected timelines for release etc. Some
projects may just wanted to use encryption interfaces alone but not all.
As they are completely independent codes, may be better to scope out
clearly.

Chris wrote:
Bundling the library with the jar is helpful; I've used that before.
It should prefer (updated) libraries from the environment, if
configured. Otherwise it's a pain (or impossible) for ops to patch
security bugs.
[UMA] Agreed.


Kai wrote:
The encryption or security thing is surely a good starting as the current
focus. Considering or having other things like compression would help to
determine how to vision, position and layout the new project, in Hadoop
side, apache common project, or a new TLP, containing the candidate
modules. Yes at the beginning, only the encryption thing.

[UMA] Yeah right. Considering encryption at this stage is right thing. But
when we consider encryption thing alone we have to go to apache commons is
what one proposal.

Regards,
Uma


On 2/3/16, 6:46 PM, "Chen, Haifeng" <ha...@intel.com> wrote:

>Thanks Chris.
>
>>> I went through the repository, and now understand the reasoning that
>>>would locate this code in Apache Commons. This isn't proposing to
>>>extract much of the implementation and it takes none of the
>>>integration. It's limited to interfaces to crypto libraries and
>>>streams/configuration.
>Exactly. 
>
>>> Chimera would be a boutique TLP, unless we wanted to draw out more of
>>>the integration and tooling. Is that a goal you're interested in
>>>pursuing? There's a tension between keeping this focused and including
>>>enough functionality to make it viable as an independent component.
>The Chimera goal was for providing useful, common and optimized
>cryptographic functionalities. I would prefer that it is still focused in
>this clear scope. Multiple domain requirements will put more challenges
>and uncertainties in where and how it should go, thus more risk in
>stalling.
>
>>> If the encryption libraries are the only ones you're interested in
>>>pulling out, then Apache Commons does seem like a better target than a
>>>separate project.
>Yes. Just mentioned above, the library will be positioned in
>cryptographic.
>
>
>Thanks,
>
>-----Original Message-----
>From: Chris Douglas [mailto:cdouglas@apache.org]
>Sent: Thursday, February 4, 2016 7:26 AM
>To: hdfs-dev@hadoop.apache.org
>Subject: Re: Hadoop encryption module as Apache Chimera incubator project
>
>I went through the repository, and now understand the reasoning that
>would locate this code in Apache Commons. This isn't proposing to extract
>much of the implementation and it takes none of the integration. It's
>limited to interfaces to crypto libraries and streams/configuration. It
>might be a reasonable fit for commons-codec, but that's a pretty sparse
>library and driving the release cadence might be more complicated. It'd
>be worth discussing on their lists (please also CC common-dev@).
>
>Chimera would be a boutique TLP, unless we wanted to draw out more of the
>integration and tooling. Is that a goal you're interested in pursuing?
>There's a tension between keeping this focused and including enough
>functionality to make it viable as an independent component. By way of
>example, Hadoop's common project requires too many dependencies and
>carries too much historical baggage for other projects to rely on.
>I agree with Colin/Steve: we don't want this to grow into another
>guava-like dependency that creates more work in conflicts than it saves
>in implementation...
>
>Would it make sense to also package some of the compression libraries,
>and maybe some of the text processing from MapReduce? Evolving some of
>this code to a common library with few/no dependencies would be generally
>useful. As a subproject, it could have a broader scope that could evolve
>into a viable TLP. If the encryption libraries are the only ones you're
>interested in pulling out, then Apache Commons does seem like a better
>target than a separate project. -C
>
>
>On Wed, Feb 3, 2016 at 1:49 AM, Chris Douglas <cd...@apache.org> wrote:
>> On Wed, Feb 3, 2016 at 12:48 AM, Gangumalla, Uma
>> <um...@intel.com> wrote:
>>>>Standing in the point of shared fundamental piece of code like this,
>>>>I do think Apache Commons might be the best direction which we can
>>>>try as the first effort. In this direction, we still need to work
>>>>with Apache Common community for buying in and accepting the proposal.
>>> Make sense.
>>
>> Makes sense how?
>>
>>> For this we should define the independent release cycles for this
>>> project and it would just place under Hadoop tree if we all conclude
>>> with this option at the end.
>>
>> Yes.
>>
>>> [Chris]
>>>>If Chimera is not successful as an independent project or stalls,
>>>>Hadoop and/or Spark and/or $project will have to reabsorb it as
>>>>maintainers.
>>>>
>>> I am not so strong on this point. If we assume project would be
>>> unsuccessful, it can be unsuccessful(less maintained) even under
>>>hadoop.
>>> But if other projects depending on this piece then they would get
>>> less support. Of course right now we feel this piece of code is very
>>> important and we feel(expect) it can be successful as independent
>>> project, irrespective of whether it as separate project outside hadoop
>>>or inside.
>>> So, I feel this point would not really influence to judge the
>>>discussion.
>>
>> Sure; code can idle anywhere, but that wasn't the point I was after.
>> You propose to extract code from Hadoop, but if Chimera fails then
>> what recourse do we have among the other projects taking a dependency
>> on it? Splitting off another project is feasible, but Chimera should
>> be sustainable before this PMC can divest itself of responsibility for
>> security libraries. That's a pretty low bar.
>>
>> Bundling the library with the jar is helpful; I've used that before.
>> It should prefer (updated) libraries from the environment, if
>> configured. Otherwise it's a pain (or impossible) for ops to patch
>> security bugs. -C
>>
>>>>-----Original Message-----
>>>>From: Colin P. McCabe [mailto:cmccabe@apache.org]
>>>>Sent: Wednesday, February 3, 2016 4:56 AM
>>>>To: hdfs-dev@hadoop.apache.org
>>>>Subject: Re: Hadoop encryption module as Apache Chimera incubator
>>>>project
>>>>
>>>>It's great to see interest in improving this functionality.  I think
>>>>Chimera could be successful as an Apache project.  I don't have a
>>>>strong opinion one way or the other as to whether it belongs as part
>>>>of Hadoop or separate.
>>>>
>>>>I do think there will be some challenges splitting this functionality
>>>>out into a separate jar, because of the way our CLASSPATH works right
>>>>now.
>>>>For example, let's say that Hadoop depends on Chimera 1.2 and Spark
>>>>depends on Chimera 1.1.  Now Spark jobs have two different versions
>>>>fighting it out on the classpath, similar to the situation with Guava
>>>>and other libraries.  Perhaps if Chimera adopts a policy of strong
>>>>backwards compatibility, we can just always use the latest jar, but
>>>>it still seems likely that there will be problems.  There are various
>>>>classpath isolation ideas that could help here, but they are big
>>>>projects in their own right and we don't have a clear timeline for
>>>>them.  If this does end up being a separate jar, we may need to shade
>>>>it to avoid all these issues.
>>>>
>>>>Bundling the JNI glue code in the jar itself is an interesting idea,
>>>>which we have talked about before for libhadoop.so.  It doesn't
>>>>really have anything to do with the question of TLP vs. non-TLP, of
>>>>course.
>>>>We could do that refactoring in Hadoop itself.  The really
>>>>complicated part of bundling JNI code in a jar is that you need to
>>>>create jars for every cross product of (JVM version, openssl version,
>>>>operating system).
>>>>For example, you have the RHEL6 build for openJDK7 using openssl
>>>>1.0.1e.
>>>>If you change any one thing-- say, change openJDK7 to Oracle JDK8,
>>>>then you might need to rebuild.  And certainly using Ubuntu would be
>>>>a rebuild.  And so forth.  This kind of clashes with Maven's
>>>>philosophy of pulling prebuilt jars from the internet.
>>>>
>>>>Kai Zheng's question about whether we would bundle openSSL's
>>>>libraries is a good one.  Given the high rate of new vulnerabilities
>>>>discovered in that library, it seems like bundling would require
>>>>Hadoop users and vendors to update very frequently, much more
>>>>frequently than Hadoop is traditionally updated.  So probably we would
>>>>not choose to bundle openssl.
>>>>
>>>>best,
>>>>Colin
>>>>
>>>>On Tue, Feb 2, 2016 at 12:29 AM, Chris Douglas <cd...@apache.org>
>>>>wrote:
>>>>> As a subproject of Hadoop, Chimera could maintain its own cadence.
>>>>> There's also no reason why it should maintain dependencies on other
>>>>> parts of Hadoop, if those are separable. How is this solution
>>>>> inadequate?
>>>>>
>>>>> If Chimera is not successful as an independent project or stalls,
>>>>> Hadoop and/or Spark and/or $project will have to reabsorb it as
>>>>> maintainers. Projects have high mortality in early life, and a
>>>>> fight over inheritance/maintenance is something we'd like to avoid.
>>>>> If, on the other hand, it develops enough of a community where it
>>>>> is obviously viable, then we can (and should) break it out as a TLP
>>>>> (as we have before). If other Apache projects take a dependency on
>>>>> Chimera, we're open to adding them to security@hadoop.
>>>>>
>>>>> Unlike Yetus, which was largely rewritten right before it was made
>>>>> into a TLP, security in Hadoop has a complicated pedigree. If
>>>>> Chimera eventually becomes a TLP, it seems fair to include those
>>>>> who work on it while it is a subproject. Declared upfront, that
>>>>> criterion is fairer than any post hoc justification, and will lead
>>>>> to a more accurate account of its community than a subset of the
>>>>> Hadoop PMC/committers that volunteer. -C
>>>>>
>>>>>
>>>>> On Mon, Feb 1, 2016 at 9:29 PM, Chen, Haifeng
>>>>><ha...@intel.com>
>>>>>wrote:
>>>>>> Thanks to all folks providing feedbacks and participating the
>>>>>>discussions.
>>>>>>
>>>>>> @Owen, do you still have any concerns on going forward in the
>>>>>>direction of Apache Commons (or other options, TLP)?
>>>>>>
>>>>>> Thanks,
>>>>>> Haifeng
>>>>>>
>>>>>> -----Original Message-----
>>>>>> From: Chen, Haifeng [mailto:haifeng.chen@intel.com]
>>>>>> Sent: Saturday, January 30, 2016 10:52 AM
>>>>>> To: hdfs-dev@hadoop.apache.org
>>>>>> Subject: RE: Hadoop encryption module as Apache Chimera incubator
>>>>>> project
>>>>>>
>>>>>>>> I believe encryption is becoming a core part of Hadoop. I think
>>>>>>>>that moving core components out of Hadoop is bad from a project
>>>>>>>>management perspective.
>>>>>>
>>>>>>> Although it's certainly true that encryption capabilities (in
>>>>>>>HDFS, YARN, etc.) are becoming core to Hadoop, I don't think that
>>>>>>>should really influence whether or not the non-Hadoop-specific
>>>>>>>encryption routines should be part of the Hadoop code base, or
>>>>>>>part of the code base of another project that Hadoop depends on.
>>>>>>>If Chimera had existed as a library hosted at ASF when HDFS
>>>>>>>encryption was first developed, HDFS probably would have just
>>>>>>>added that as a dependency and been done with it. I don't think we
>>>>>>>would've copy/pasted the code for Chimera into the Hadoop code base.
>>>>>>
>>>>>> Agree with ATM. I want to also make an additional clarification. I
>>>>>>agree that the encryption capabilities are becoming core to Hadoop.
>>>>>>While this effort is to put common and shared encryption routines
>>>>>>such as crypto stream implementations into a scope which can be
>>>>>>widely shared across the Apache ecosystem. This doesn't move Hadoop
>>>>>>encryption out of Hadoop (that is not possible).
>>>>>>
>>>>>> Agree if we make it a separate and independent releases project in
>>>>>>Hadoop takes a step further than the existing approach and solve
>>>>>>some issues (such as libhadoop.so problem). Frankly speaking, I
>>>>>>think it is not the best option we can try. I also expect that an
>>>>>>independent release project within Hadoop core will also complicate
>>>>>>the existing release ideology of Hadoop release.
>>>>>>
>>>>>> Thanks,
>>>>>> Haifeng
>>>>>>
>>>>>> -----Original Message-----
>>>>>> From: Aaron T. Myers [mailto:atm@cloudera.com]
>>>>>> Sent: Friday, January 29, 2016 9:51 AM
>>>>>> To: hdfs-dev@hadoop.apache.org
>>>>>> Subject: Re: Hadoop encryption module as Apache Chimera incubator
>>>>>> project
>>>>>>
>>>>>> On Wed, Jan 27, 2016 at 11:31 AM, Owen O'Malley
>>>>>><om...@apache.org>
>>>>>>wrote:
>>>>>>
>>>>>>> I believe encryption is becoming a core part of Hadoop. I think
>>>>>>>that  moving core components out of Hadoop is bad from a project
>>>>>>>management perspective.
>>>>>>>
>>>>>>
>>>>>> Although it's certainly true that encryption capabilities (in
>>>>>>HDFS,  YARN,
>>>>>> etc.) are becoming core to Hadoop, I don't think that should
>>>>>>really influence whether or not the non-Hadoop-specific encryption
>>>>>>routines should be part of the Hadoop code base, or part of the
>>>>>>code base of another project that Hadoop depends on. If Chimera had
>>>>>>existed as a library hosted at ASF when HDFS encryption was first
>>>>>>developed, HDFS probably would have just added that as a dependency
>>>>>>and been done with it. I don't think we would've copy/pasted the
>>>>>>code for Chimera into the Hadoop code base.
>>>>>>
>>>>>>
>>>>>>> To put it another way, a bug in the encryption routines will
>>>>>>> likely become a security problem that security@hadoop needs to
>>>>>>>hear about.
>>>>>>>
>>>>>> I don't think
>>>>>>> adding a separate project in the middle of that communication
>>>>>>>chain  is a good idea. The same applies to data corruption
>>>>>>>problems, and so on...
>>>>>>>
>>>>>>
>>>>>> Isn't the same true of all the libraries that Hadoop currently
>>>>>>depends upon? If the commons-httpclient library (or commons-codec,
>>>>>>or commons-io, or guava, or...) has a security vulnerability, we
>>>>>>need to know about it so that we can update our dependency to a
>>>>>>fixed version.
>>>>>>This case doesn't seem materially different than that.
>>>>>>
>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> > It may be good to keep at generalized place(As in the
>>>>>>> > discussion, we thought that place could be Apache Commons).
>>>>>>>
>>>>>>>
>>>>>>> Apache Commons is a collection of *Java* projects, so Chimera as
>>>>>>> a JNI-based library isn't a natural fit.
>>>>>>>
>>>>>>
>>>>>> Could very well be that Apache Commons's charter would preclude
>>>>>>Chimera.
>>>>>> You probably know better than I do about that.
>>>>>>
>>>>>>
>>>>>>> Furthermore, Apache Commons doesn't have its own security list so
>>>>>>> problems will go to the generic security@apache.org.
>>>>>>>
>>>>>>
>>>>>> That seems easy enough to remedy, if they wanted to, and besides I'm
>>>>>>not sure why that would influence this discussion. In my experience
>>>>>>projects that don't have a separate security@project.a.o mailing list
>>>>>>tend to just handle security issues on their private@project.a.o
>>>>>>mailing list, which seems fine to me.
>>>>>>
>>>>>>
>>>>>>>
>>>>>>> Why do you think that Apache Commons is a better home than Hadoop?
>>>>>>>
>>>>>>
>>>>>> I'm certainly not at all wedded to Apache Commons, that just seemed
>>>>>>like a natural place to put it to me. Could be that a brand new TLP
>>>>>>might make more sense.
>>>>>>
>>>>>> I *do* think that if other non-Hadoop projects want to make use of
>>>>>>Chimera, which as I understand it is the goal which started this
>>>>>>thread, then Chimera should exist outside of Hadoop so that:
>>>>>>
>>>>>> a) Projects that have nothing to do with Hadoop can just depend
>>>>>>directly on Chimera, which has nothing Hadoop-specific in there.
>>>>>>
>>>>>> b) The Hadoop project doesn't have to export/maintain/concern itself
>>>>>>with yet another publicly-consumed interface.
>>>>>>
>>>>>> c) Chimera can have its own (presumably much faster) release cadence
>>>>>>completely separate from Hadoop.
>>>>>>
>>>>>> --
>>>>>> Aaron T. Myers
>>>>>> Software Engineer, Cloudera
>>>


Re: Hadoop encryption module as Apache Chimera incubator project

Posted by "Gangumalla, Uma" <um...@intel.com>.
Thanks all for the opinions.

Chris wrote:
I went through the repository, and now understand the reasoning that would
locate this code in Apache Commons. This isn't proposing to extract much
of the implementation and it takes none of the integration. It's limited
to interfaces to crypto libraries and streams/configuration. It might be a
reasonable fit for commons-codec, but that's a pretty sparse library and
driving the release cadence might be more complicated. It'd be worth
discussing on their lists (please also CC common-dev@).

[UMA] Ok. Great. You are right. I have cc¹ed to hadoop common. (You mean
to cc Apache commons as well?)

Chris wrote:
Would it make sense to also package some of the compression libraries, and
maybe some of the text processing from MapReduce? Evolving some of this
code to a common library with few/no dependencies would be generally
useful. As a subproject, it could have a broader scope that could evolve
into a viable TLP. If the encryption libraries are the only ones you're
interested in pulling out, then Apache Commons does seem like a better
target than a separate project. -C

[UMA] Right now we plan to have encryption libraries are the only one¹s we
planned and as we see lot of interest from other projects like spark to
use them. I see some challenges when we bring lot of code(other common
codes) into this project is that, they all would have different
requirements and may be different expected timelines for release etc. Some
projects may just wanted to use encryption interfaces alone but not all.
As they are completely independent codes, may be better to scope out
clearly.

Chris wrote:
Bundling the library with the jar is helpful; I've used that before.
It should prefer (updated) libraries from the environment, if
configured. Otherwise it's a pain (or impossible) for ops to patch
security bugs.
[UMA] Agreed.


Kai wrote:
The encryption or security thing is surely a good starting as the current
focus. Considering or having other things like compression would help to
determine how to vision, position and layout the new project, in Hadoop
side, apache common project, or a new TLP, containing the candidate
modules. Yes at the beginning, only the encryption thing.

[UMA] Yeah right. Considering encryption at this stage is right thing. But
when we consider encryption thing alone we have to go to apache commons is
what one proposal.

Regards,
Uma


On 2/3/16, 6:46 PM, "Chen, Haifeng" <ha...@intel.com> wrote:

>Thanks Chris.
>
>>> I went through the repository, and now understand the reasoning that
>>>would locate this code in Apache Commons. This isn't proposing to
>>>extract much of the implementation and it takes none of the
>>>integration. It's limited to interfaces to crypto libraries and
>>>streams/configuration.
>Exactly. 
>
>>> Chimera would be a boutique TLP, unless we wanted to draw out more of
>>>the integration and tooling. Is that a goal you're interested in
>>>pursuing? There's a tension between keeping this focused and including
>>>enough functionality to make it viable as an independent component.
>The Chimera goal was for providing useful, common and optimized
>cryptographic functionalities. I would prefer that it is still focused in
>this clear scope. Multiple domain requirements will put more challenges
>and uncertainties in where and how it should go, thus more risk in
>stalling.
>
>>> If the encryption libraries are the only ones you're interested in
>>>pulling out, then Apache Commons does seem like a better target than a
>>>separate project.
>Yes. Just mentioned above, the library will be positioned in
>cryptographic.
>
>
>Thanks,
>
>-----Original Message-----
>From: Chris Douglas [mailto:cdouglas@apache.org]
>Sent: Thursday, February 4, 2016 7:26 AM
>To: hdfs-dev@hadoop.apache.org
>Subject: Re: Hadoop encryption module as Apache Chimera incubator project
>
>I went through the repository, and now understand the reasoning that
>would locate this code in Apache Commons. This isn't proposing to extract
>much of the implementation and it takes none of the integration. It's
>limited to interfaces to crypto libraries and streams/configuration. It
>might be a reasonable fit for commons-codec, but that's a pretty sparse
>library and driving the release cadence might be more complicated. It'd
>be worth discussing on their lists (please also CC common-dev@).
>
>Chimera would be a boutique TLP, unless we wanted to draw out more of the
>integration and tooling. Is that a goal you're interested in pursuing?
>There's a tension between keeping this focused and including enough
>functionality to make it viable as an independent component. By way of
>example, Hadoop's common project requires too many dependencies and
>carries too much historical baggage for other projects to rely on.
>I agree with Colin/Steve: we don't want this to grow into another
>guava-like dependency that creates more work in conflicts than it saves
>in implementation...
>
>Would it make sense to also package some of the compression libraries,
>and maybe some of the text processing from MapReduce? Evolving some of
>this code to a common library with few/no dependencies would be generally
>useful. As a subproject, it could have a broader scope that could evolve
>into a viable TLP. If the encryption libraries are the only ones you're
>interested in pulling out, then Apache Commons does seem like a better
>target than a separate project. -C
>
>
>On Wed, Feb 3, 2016 at 1:49 AM, Chris Douglas <cd...@apache.org> wrote:
>> On Wed, Feb 3, 2016 at 12:48 AM, Gangumalla, Uma
>> <um...@intel.com> wrote:
>>>>Standing in the point of shared fundamental piece of code like this,
>>>>I do think Apache Commons might be the best direction which we can
>>>>try as the first effort. In this direction, we still need to work
>>>>with Apache Common community for buying in and accepting the proposal.
>>> Make sense.
>>
>> Makes sense how?
>>
>>> For this we should define the independent release cycles for this
>>> project and it would just place under Hadoop tree if we all conclude
>>> with this option at the end.
>>
>> Yes.
>>
>>> [Chris]
>>>>If Chimera is not successful as an independent project or stalls,
>>>>Hadoop and/or Spark and/or $project will have to reabsorb it as
>>>>maintainers.
>>>>
>>> I am not so strong on this point. If we assume project would be
>>> unsuccessful, it can be unsuccessful(less maintained) even under
>>>hadoop.
>>> But if other projects depending on this piece then they would get
>>> less support. Of course right now we feel this piece of code is very
>>> important and we feel(expect) it can be successful as independent
>>> project, irrespective of whether it as separate project outside hadoop
>>>or inside.
>>> So, I feel this point would not really influence to judge the
>>>discussion.
>>
>> Sure; code can idle anywhere, but that wasn't the point I was after.
>> You propose to extract code from Hadoop, but if Chimera fails then
>> what recourse do we have among the other projects taking a dependency
>> on it? Splitting off another project is feasible, but Chimera should
>> be sustainable before this PMC can divest itself of responsibility for
>> security libraries. That's a pretty low bar.
>>
>> Bundling the library with the jar is helpful; I've used that before.
>> It should prefer (updated) libraries from the environment, if
>> configured. Otherwise it's a pain (or impossible) for ops to patch
>> security bugs. -C
>>
>>>>-----Original Message-----
>>>>From: Colin P. McCabe [mailto:cmccabe@apache.org]
>>>>Sent: Wednesday, February 3, 2016 4:56 AM
>>>>To: hdfs-dev@hadoop.apache.org
>>>>Subject: Re: Hadoop encryption module as Apache Chimera incubator
>>>>project
>>>>
>>>>It's great to see interest in improving this functionality.  I think
>>>>Chimera could be successful as an Apache project.  I don't have a
>>>>strong opinion one way or the other as to whether it belongs as part
>>>>of Hadoop or separate.
>>>>
>>>>I do think there will be some challenges splitting this functionality
>>>>out into a separate jar, because of the way our CLASSPATH works right
>>>>now.
>>>>For example, let's say that Hadoop depends on Chimera 1.2 and Spark
>>>>depends on Chimera 1.1.  Now Spark jobs have two different versions
>>>>fighting it out on the classpath, similar to the situation with Guava
>>>>and other libraries.  Perhaps if Chimera adopts a policy of strong
>>>>backwards compatibility, we can just always use the latest jar, but
>>>>it still seems likely that there will be problems.  There are various
>>>>classpath isolation ideas that could help here, but they are big
>>>>projects in their own right and we don't have a clear timeline for
>>>>them.  If this does end up being a separate jar, we may need to shade
>>>>it to avoid all these issues.
>>>>
>>>>Bundling the JNI glue code in the jar itself is an interesting idea,
>>>>which we have talked about before for libhadoop.so.  It doesn't
>>>>really have anything to do with the question of TLP vs. non-TLP, of
>>>>course.
>>>>We could do that refactoring in Hadoop itself.  The really
>>>>complicated part of bundling JNI code in a jar is that you need to
>>>>create jars for every cross product of (JVM version, openssl version,
>>>>operating system).
>>>>For example, you have the RHEL6 build for openJDK7 using openssl
>>>>1.0.1e.
>>>>If you change any one thing-- say, change openJDK7 to Oracle JDK8,
>>>>then you might need to rebuild.  And certainly using Ubuntu would be
>>>>a rebuild.  And so forth.  This kind of clashes with Maven's
>>>>philosophy of pulling prebuilt jars from the internet.
>>>>
>>>>Kai Zheng's question about whether we would bundle openSSL's
>>>>libraries is a good one.  Given the high rate of new vulnerabilities
>>>>discovered in that library, it seems like bundling would require
>>>>Hadoop users and vendors to update very frequently, much more
>>>>frequently than Hadoop is traditionally updated.  So probably we would
>>>>not choose to bundle openssl.
>>>>
>>>>best,
>>>>Colin
>>>>
>>>>On Tue, Feb 2, 2016 at 12:29 AM, Chris Douglas <cd...@apache.org>
>>>>wrote:
>>>>> As a subproject of Hadoop, Chimera could maintain its own cadence.
>>>>> There's also no reason why it should maintain dependencies on other
>>>>> parts of Hadoop, if those are separable. How is this solution
>>>>> inadequate?
>>>>>
>>>>> If Chimera is not successful as an independent project or stalls,
>>>>> Hadoop and/or Spark and/or $project will have to reabsorb it as
>>>>> maintainers. Projects have high mortality in early life, and a
>>>>> fight over inheritance/maintenance is something we'd like to avoid.
>>>>> If, on the other hand, it develops enough of a community where it
>>>>> is obviously viable, then we can (and should) break it out as a TLP
>>>>> (as we have before). If other Apache projects take a dependency on
>>>>> Chimera, we're open to adding them to security@hadoop.
>>>>>
>>>>> Unlike Yetus, which was largely rewritten right before it was made
>>>>> into a TLP, security in Hadoop has a complicated pedigree. If
>>>>> Chimera eventually becomes a TLP, it seems fair to include those
>>>>> who work on it while it is a subproject. Declared upfront, that
>>>>> criterion is fairer than any post hoc justification, and will lead
>>>>> to a more accurate account of its community than a subset of the
>>>>> Hadoop PMC/committers that volunteer. -C
>>>>>
>>>>>
>>>>> On Mon, Feb 1, 2016 at 9:29 PM, Chen, Haifeng
>>>>><ha...@intel.com>
>>>>>wrote:
>>>>>> Thanks to all folks providing feedbacks and participating the
>>>>>>discussions.
>>>>>>
>>>>>> @Owen, do you still have any concerns on going forward in the
>>>>>>direction of Apache Commons (or other options, TLP)?
>>>>>>
>>>>>> Thanks,
>>>>>> Haifeng
>>>>>>
>>>>>> -----Original Message-----
>>>>>> From: Chen, Haifeng [mailto:haifeng.chen@intel.com]
>>>>>> Sent: Saturday, January 30, 2016 10:52 AM
>>>>>> To: hdfs-dev@hadoop.apache.org
>>>>>> Subject: RE: Hadoop encryption module as Apache Chimera incubator
>>>>>> project
>>>>>>
>>>>>>>> I believe encryption is becoming a core part of Hadoop. I think
>>>>>>>>that moving core components out of Hadoop is bad from a project
>>>>>>>>management perspective.
>>>>>>
>>>>>>> Although it's certainly true that encryption capabilities (in
>>>>>>>HDFS, YARN, etc.) are becoming core to Hadoop, I don't think that
>>>>>>>should really influence whether or not the non-Hadoop-specific
>>>>>>>encryption routines should be part of the Hadoop code base, or
>>>>>>>part of the code base of another project that Hadoop depends on.
>>>>>>>If Chimera had existed as a library hosted at ASF when HDFS
>>>>>>>encryption was first developed, HDFS probably would have just
>>>>>>>added that as a dependency and been done with it. I don't think we
>>>>>>>would've copy/pasted the code for Chimera into the Hadoop code base.
>>>>>>
>>>>>> Agree with ATM. I want to also make an additional clarification. I
>>>>>>agree that the encryption capabilities are becoming core to Hadoop.
>>>>>>While this effort is to put common and shared encryption routines
>>>>>>such as crypto stream implementations into a scope which can be
>>>>>>widely shared across the Apache ecosystem. This doesn't move Hadoop
>>>>>>encryption out of Hadoop (that is not possible).
>>>>>>
>>>>>> Agree if we make it a separate and independent releases project in
>>>>>>Hadoop takes a step further than the existing approach and solve
>>>>>>some issues (such as libhadoop.so problem). Frankly speaking, I
>>>>>>think it is not the best option we can try. I also expect that an
>>>>>>independent release project within Hadoop core will also complicate
>>>>>>the existing release ideology of Hadoop release.
>>>>>>
>>>>>> Thanks,
>>>>>> Haifeng
>>>>>>
>>>>>> -----Original Message-----
>>>>>> From: Aaron T. Myers [mailto:atm@cloudera.com]
>>>>>> Sent: Friday, January 29, 2016 9:51 AM
>>>>>> To: hdfs-dev@hadoop.apache.org
>>>>>> Subject: Re: Hadoop encryption module as Apache Chimera incubator
>>>>>> project
>>>>>>
>>>>>> On Wed, Jan 27, 2016 at 11:31 AM, Owen O'Malley
>>>>>><om...@apache.org>
>>>>>>wrote:
>>>>>>
>>>>>>> I believe encryption is becoming a core part of Hadoop. I think
>>>>>>>that  moving core components out of Hadoop is bad from a project
>>>>>>>management perspective.
>>>>>>>
>>>>>>
>>>>>> Although it's certainly true that encryption capabilities (in
>>>>>>HDFS,  YARN,
>>>>>> etc.) are becoming core to Hadoop, I don't think that should
>>>>>>really influence whether or not the non-Hadoop-specific encryption
>>>>>>routines should be part of the Hadoop code base, or part of the
>>>>>>code base of another project that Hadoop depends on. If Chimera had
>>>>>>existed as a library hosted at ASF when HDFS encryption was first
>>>>>>developed, HDFS probably would have just added that as a dependency
>>>>>>and been done with it. I don't think we would've copy/pasted the
>>>>>>code for Chimera into the Hadoop code base.
>>>>>>
>>>>>>
>>>>>>> To put it another way, a bug in the encryption routines will
>>>>>>> likely become a security problem that security@hadoop needs to
>>>>>>>hear about.
>>>>>>>
>>>>>> I don't think
>>>>>>> adding a separate project in the middle of that communication
>>>>>>>chain  is a good idea. The same applies to data corruption
>>>>>>>problems, and so on...
>>>>>>>
>>>>>>
>>>>>> Isn't the same true of all the libraries that Hadoop currently
>>>>>>depends upon? If the commons-httpclient library (or commons-codec,
>>>>>>or commons-io, or guava, or...) has a security vulnerability, we
>>>>>>need to know about it so that we can update our dependency to a
>>>>>>fixed version.
>>>>>>This case doesn't seem materially different than that.
>>>>>>
>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> > It may be good to keep at generalized place(As in the
>>>>>>> > discussion, we thought that place could be Apache Commons).
>>>>>>>
>>>>>>>
>>>>>>> Apache Commons is a collection of *Java* projects, so Chimera as
>>>>>>> a JNI-based library isn't a natural fit.
>>>>>>>
>>>>>>
>>>>>> Could very well be that Apache Commons's charter would preclude
>>>>>>Chimera.
>>>>>> You probably know better than I do about that.
>>>>>>
>>>>>>
>>>>>>> Furthermore, Apache Commons doesn't have its own security list so
>>>>>>> problems will go to the generic security@apache.org.
>>>>>>>
>>>>>>
>>>>>> That seems easy enough to remedy, if they wanted to, and besides I'm
>>>>>>not sure why that would influence this discussion. In my experience
>>>>>>projects that don't have a separate security@project.a.o mailing list
>>>>>>tend to just handle security issues on their private@project.a.o
>>>>>>mailing list, which seems fine to me.
>>>>>>
>>>>>>
>>>>>>>
>>>>>>> Why do you think that Apache Commons is a better home than Hadoop?
>>>>>>>
>>>>>>
>>>>>> I'm certainly not at all wedded to Apache Commons, that just seemed
>>>>>>like a natural place to put it to me. Could be that a brand new TLP
>>>>>>might make more sense.
>>>>>>
>>>>>> I *do* think that if other non-Hadoop projects want to make use of
>>>>>>Chimera, which as I understand it is the goal which started this
>>>>>>thread, then Chimera should exist outside of Hadoop so that:
>>>>>>
>>>>>> a) Projects that have nothing to do with Hadoop can just depend
>>>>>>directly on Chimera, which has nothing Hadoop-specific in there.
>>>>>>
>>>>>> b) The Hadoop project doesn't have to export/maintain/concern itself
>>>>>>with yet another publicly-consumed interface.
>>>>>>
>>>>>> c) Chimera can have its own (presumably much faster) release cadence
>>>>>>completely separate from Hadoop.
>>>>>>
>>>>>> --
>>>>>> Aaron T. Myers
>>>>>> Software Engineer, Cloudera
>>>


RE: Hadoop encryption module as Apache Chimera incubator project

Posted by "Chen, Haifeng" <ha...@intel.com>.
Thanks Chris.

>> I went through the repository, and now understand the reasoning that would locate this code in Apache Commons. This isn't proposing to extract much of the implementation and it takes none of the integration. It's limited to interfaces to crypto libraries and streams/configuration.
Exactly. 

>> Chimera would be a boutique TLP, unless we wanted to draw out more of the integration and tooling. Is that a goal you're interested in pursuing? There's a tension between keeping this focused and including enough functionality to make it viable as an independent component.
The Chimera goal was for providing useful, common and optimized cryptographic functionalities. I would prefer that it is still focused in this clear scope. Multiple domain requirements will put more challenges and uncertainties in where and how it should go, thus more risk in stalling.

>> If the encryption libraries are the only ones you're interested in pulling out, then Apache Commons does seem like a better target than a separate project.
Yes. Just mentioned above, the library will be positioned in cryptographic.


Thanks,

-----Original Message-----
From: Chris Douglas [mailto:cdouglas@apache.org] 
Sent: Thursday, February 4, 2016 7:26 AM
To: hdfs-dev@hadoop.apache.org
Subject: Re: Hadoop encryption module as Apache Chimera incubator project

I went through the repository, and now understand the reasoning that would locate this code in Apache Commons. This isn't proposing to extract much of the implementation and it takes none of the integration. It's limited to interfaces to crypto libraries and streams/configuration. It might be a reasonable fit for commons-codec, but that's a pretty sparse library and driving the release cadence might be more complicated. It'd be worth discussing on their lists (please also CC common-dev@).

Chimera would be a boutique TLP, unless we wanted to draw out more of the integration and tooling. Is that a goal you're interested in pursuing? There's a tension between keeping this focused and including enough functionality to make it viable as an independent component. By way of example, Hadoop's common project requires too many dependencies and carries too much historical baggage for other projects to rely on.
I agree with Colin/Steve: we don't want this to grow into another guava-like dependency that creates more work in conflicts than it saves in implementation...

Would it make sense to also package some of the compression libraries, and maybe some of the text processing from MapReduce? Evolving some of this code to a common library with few/no dependencies would be generally useful. As a subproject, it could have a broader scope that could evolve into a viable TLP. If the encryption libraries are the only ones you're interested in pulling out, then Apache Commons does seem like a better target than a separate project. -C


On Wed, Feb 3, 2016 at 1:49 AM, Chris Douglas <cd...@apache.org> wrote:
> On Wed, Feb 3, 2016 at 12:48 AM, Gangumalla, Uma 
> <um...@intel.com> wrote:
>>>Standing in the point of shared fundamental piece of code like this, 
>>>I do think Apache Commons might be the best direction which we can 
>>>try as the first effort. In this direction, we still need to work 
>>>with Apache Common community for buying in and accepting the proposal.
>> Make sense.
>
> Makes sense how?
>
>> For this we should define the independent release cycles for this 
>> project and it would just place under Hadoop tree if we all conclude 
>> with this option at the end.
>
> Yes.
>
>> [Chris]
>>>If Chimera is not successful as an independent project or stalls, 
>>>Hadoop and/or Spark and/or $project will have to reabsorb it as 
>>>maintainers.
>>>
>> I am not so strong on this point. If we assume project would be 
>> unsuccessful, it can be unsuccessful(less maintained) even under hadoop.
>> But if other projects depending on this piece then they would get 
>> less support. Of course right now we feel this piece of code is very 
>> important and we feel(expect) it can be successful as independent 
>> project, irrespective of whether it as separate project outside hadoop or inside.
>> So, I feel this point would not really influence to judge the discussion.
>
> Sure; code can idle anywhere, but that wasn't the point I was after.
> You propose to extract code from Hadoop, but if Chimera fails then 
> what recourse do we have among the other projects taking a dependency 
> on it? Splitting off another project is feasible, but Chimera should 
> be sustainable before this PMC can divest itself of responsibility for 
> security libraries. That's a pretty low bar.
>
> Bundling the library with the jar is helpful; I've used that before.
> It should prefer (updated) libraries from the environment, if 
> configured. Otherwise it's a pain (or impossible) for ops to patch 
> security bugs. -C
>
>>>-----Original Message-----
>>>From: Colin P. McCabe [mailto:cmccabe@apache.org]
>>>Sent: Wednesday, February 3, 2016 4:56 AM
>>>To: hdfs-dev@hadoop.apache.org
>>>Subject: Re: Hadoop encryption module as Apache Chimera incubator 
>>>project
>>>
>>>It's great to see interest in improving this functionality.  I think 
>>>Chimera could be successful as an Apache project.  I don't have a 
>>>strong opinion one way or the other as to whether it belongs as part 
>>>of Hadoop or separate.
>>>
>>>I do think there will be some challenges splitting this functionality 
>>>out into a separate jar, because of the way our CLASSPATH works right now.
>>>For example, let's say that Hadoop depends on Chimera 1.2 and Spark 
>>>depends on Chimera 1.1.  Now Spark jobs have two different versions 
>>>fighting it out on the classpath, similar to the situation with Guava 
>>>and other libraries.  Perhaps if Chimera adopts a policy of strong 
>>>backwards compatibility, we can just always use the latest jar, but 
>>>it still seems likely that there will be problems.  There are various 
>>>classpath isolation ideas that could help here, but they are big 
>>>projects in their own right and we don't have a clear timeline for 
>>>them.  If this does end up being a separate jar, we may need to shade 
>>>it to avoid all these issues.
>>>
>>>Bundling the JNI glue code in the jar itself is an interesting idea, 
>>>which we have talked about before for libhadoop.so.  It doesn't 
>>>really have anything to do with the question of TLP vs. non-TLP, of course.
>>>We could do that refactoring in Hadoop itself.  The really 
>>>complicated part of bundling JNI code in a jar is that you need to 
>>>create jars for every cross product of (JVM version, openssl version, operating system).
>>>For example, you have the RHEL6 build for openJDK7 using openssl 1.0.1e.
>>>If you change any one thing-- say, change openJDK7 to Oracle JDK8, 
>>>then you might need to rebuild.  And certainly using Ubuntu would be 
>>>a rebuild.  And so forth.  This kind of clashes with Maven's 
>>>philosophy of pulling prebuilt jars from the internet.
>>>
>>>Kai Zheng's question about whether we would bundle openSSL's 
>>>libraries is a good one.  Given the high rate of new vulnerabilities 
>>>discovered in that library, it seems like bundling would require 
>>>Hadoop users and vendors to update very frequently, much more 
>>>frequently than Hadoop is traditionally updated.  So probably we would not choose to bundle openssl.
>>>
>>>best,
>>>Colin
>>>
>>>On Tue, Feb 2, 2016 at 12:29 AM, Chris Douglas <cd...@apache.org>
>>>wrote:
>>>> As a subproject of Hadoop, Chimera could maintain its own cadence.
>>>> There's also no reason why it should maintain dependencies on other 
>>>> parts of Hadoop, if those are separable. How is this solution 
>>>> inadequate?
>>>>
>>>> If Chimera is not successful as an independent project or stalls, 
>>>> Hadoop and/or Spark and/or $project will have to reabsorb it as 
>>>> maintainers. Projects have high mortality in early life, and a 
>>>> fight over inheritance/maintenance is something we'd like to avoid. 
>>>> If, on the other hand, it develops enough of a community where it 
>>>> is obviously viable, then we can (and should) break it out as a TLP 
>>>> (as we have before). If other Apache projects take a dependency on 
>>>> Chimera, we're open to adding them to security@hadoop.
>>>>
>>>> Unlike Yetus, which was largely rewritten right before it was made 
>>>> into a TLP, security in Hadoop has a complicated pedigree. If 
>>>> Chimera eventually becomes a TLP, it seems fair to include those 
>>>> who work on it while it is a subproject. Declared upfront, that 
>>>> criterion is fairer than any post hoc justification, and will lead 
>>>> to a more accurate account of its community than a subset of the 
>>>> Hadoop PMC/committers that volunteer. -C
>>>>
>>>>
>>>> On Mon, Feb 1, 2016 at 9:29 PM, Chen, Haifeng 
>>>><ha...@intel.com>
>>>>wrote:
>>>>> Thanks to all folks providing feedbacks and participating the 
>>>>>discussions.
>>>>>
>>>>> @Owen, do you still have any concerns on going forward in the 
>>>>>direction of Apache Commons (or other options, TLP)?
>>>>>
>>>>> Thanks,
>>>>> Haifeng
>>>>>
>>>>> -----Original Message-----
>>>>> From: Chen, Haifeng [mailto:haifeng.chen@intel.com]
>>>>> Sent: Saturday, January 30, 2016 10:52 AM
>>>>> To: hdfs-dev@hadoop.apache.org
>>>>> Subject: RE: Hadoop encryption module as Apache Chimera incubator 
>>>>> project
>>>>>
>>>>>>> I believe encryption is becoming a core part of Hadoop. I think  
>>>>>>>that moving core components out of Hadoop is bad from a project 
>>>>>>>management perspective.
>>>>>
>>>>>> Although it's certainly true that encryption capabilities (in 
>>>>>>HDFS, YARN, etc.) are becoming core to Hadoop, I don't think that 
>>>>>>should really influence whether or not the non-Hadoop-specific 
>>>>>>encryption routines should be part of the Hadoop code base, or 
>>>>>>part of the code base of another project that Hadoop depends on. 
>>>>>>If Chimera had existed as a library hosted at ASF when HDFS 
>>>>>>encryption was first developed, HDFS probably would have just 
>>>>>>added that as a dependency and been done with it. I don't think we 
>>>>>>would've copy/pasted the code for Chimera into the Hadoop code base.
>>>>>
>>>>> Agree with ATM. I want to also make an additional clarification. I 
>>>>>agree that the encryption capabilities are becoming core to Hadoop.
>>>>>While this effort is to put common and shared encryption routines 
>>>>>such as crypto stream implementations into a scope which can be 
>>>>>widely shared across the Apache ecosystem. This doesn't move Hadoop 
>>>>>encryption out of Hadoop (that is not possible).
>>>>>
>>>>> Agree if we make it a separate and independent releases project in 
>>>>>Hadoop takes a step further than the existing approach and solve 
>>>>>some issues (such as libhadoop.so problem). Frankly speaking, I 
>>>>>think it is not the best option we can try. I also expect that an 
>>>>>independent release project within Hadoop core will also complicate 
>>>>>the existing release ideology of Hadoop release.
>>>>>
>>>>> Thanks,
>>>>> Haifeng
>>>>>
>>>>> -----Original Message-----
>>>>> From: Aaron T. Myers [mailto:atm@cloudera.com]
>>>>> Sent: Friday, January 29, 2016 9:51 AM
>>>>> To: hdfs-dev@hadoop.apache.org
>>>>> Subject: Re: Hadoop encryption module as Apache Chimera incubator 
>>>>> project
>>>>>
>>>>> On Wed, Jan 27, 2016 at 11:31 AM, Owen O'Malley 
>>>>><om...@apache.org>
>>>>>wrote:
>>>>>
>>>>>> I believe encryption is becoming a core part of Hadoop. I think 
>>>>>>that  moving core components out of Hadoop is bad from a project 
>>>>>>management perspective.
>>>>>>
>>>>>
>>>>> Although it's certainly true that encryption capabilities (in 
>>>>>HDFS,  YARN,
>>>>> etc.) are becoming core to Hadoop, I don't think that should 
>>>>>really influence whether or not the non-Hadoop-specific encryption 
>>>>>routines should be part of the Hadoop code base, or part of the 
>>>>>code base of another project that Hadoop depends on. If Chimera had 
>>>>>existed as a library hosted at ASF when HDFS encryption was first 
>>>>>developed, HDFS probably would have just added that as a dependency 
>>>>>and been done with it. I don't think we would've copy/pasted the 
>>>>>code for Chimera into the Hadoop code base.
>>>>>
>>>>>
>>>>>> To put it another way, a bug in the encryption routines will 
>>>>>> likely become a security problem that security@hadoop needs to hear about.
>>>>>>
>>>>> I don't think
>>>>>> adding a separate project in the middle of that communication 
>>>>>>chain  is a good idea. The same applies to data corruption 
>>>>>>problems, and so on...
>>>>>>
>>>>>
>>>>> Isn't the same true of all the libraries that Hadoop currently 
>>>>>depends upon? If the commons-httpclient library (or commons-codec, 
>>>>>or commons-io, or guava, or...) has a security vulnerability, we 
>>>>>need to know about it so that we can update our dependency to a fixed version.
>>>>>This case doesn't seem materially different than that.
>>>>>
>>>>>
>>>>>>
>>>>>>
>>>>>> > It may be good to keep at generalized place(As in the 
>>>>>> > discussion, we thought that place could be Apache Commons).
>>>>>>
>>>>>>
>>>>>> Apache Commons is a collection of *Java* projects, so Chimera as 
>>>>>> a JNI-based library isn't a natural fit.
>>>>>>
>>>>>
>>>>> Could very well be that Apache Commons's charter would preclude 
>>>>>Chimera.
>>>>> You probably know better than I do about that.
>>>>>
>>>>>
>>>>>> Furthermore, Apache Commons doesn't have its own security list so 
>>>>>> problems will go to the generic security@apache.org.
>>>>>>
>>>>>
>>>>> That seems easy enough to remedy, if they wanted to, and besides I'm
>>>>>not sure why that would influence this discussion. In my experience
>>>>>projects that don't have a separate security@project.a.o mailing list
>>>>>tend to just handle security issues on their private@project.a.o
>>>>>mailing list, which seems fine to me.
>>>>>
>>>>>
>>>>>>
>>>>>> Why do you think that Apache Commons is a better home than Hadoop?
>>>>>>
>>>>>
>>>>> I'm certainly not at all wedded to Apache Commons, that just seemed
>>>>>like a natural place to put it to me. Could be that a brand new TLP
>>>>>might make more sense.
>>>>>
>>>>> I *do* think that if other non-Hadoop projects want to make use of
>>>>>Chimera, which as I understand it is the goal which started this
>>>>>thread, then Chimera should exist outside of Hadoop so that:
>>>>>
>>>>> a) Projects that have nothing to do with Hadoop can just depend
>>>>>directly on Chimera, which has nothing Hadoop-specific in there.
>>>>>
>>>>> b) The Hadoop project doesn't have to export/maintain/concern itself
>>>>>with yet another publicly-consumed interface.
>>>>>
>>>>> c) Chimera can have its own (presumably much faster) release cadence
>>>>>completely separate from Hadoop.
>>>>>
>>>>> --
>>>>> Aaron T. Myers
>>>>> Software Engineer, Cloudera
>>

Re: Hadoop encryption module as Apache Chimera incubator project

Posted by Chris Douglas <cd...@apache.org>.
I went through the repository, and now understand the reasoning that
would locate this code in Apache Commons. This isn't proposing to
extract much of the implementation and it takes none of the
integration. It's limited to interfaces to crypto libraries and
streams/configuration. It might be a reasonable fit for commons-codec,
but that's a pretty sparse library and driving the release cadence
might be more complicated. It'd be worth discussing on their lists
(please also CC common-dev@).

Chimera would be a boutique TLP, unless we wanted to draw out more of
the integration and tooling. Is that a goal you're interested in
pursuing? There's a tension between keeping this focused and including
enough functionality to make it viable as an independent component. By
way of example, Hadoop's common project requires too many dependencies
and carries too much historical baggage for other projects to rely on.
I agree with Colin/Steve: we don't want this to grow into another
guava-like dependency that creates more work in conflicts than it
saves in implementation...

Would it make sense to also package some of the compression libraries,
and maybe some of the text processing from MapReduce? Evolving some of
this code to a common library with few/no dependencies would be
generally useful. As a subproject, it could have a broader scope that
could evolve into a viable TLP. If the encryption libraries are the
only ones you're interested in pulling out, then Apache Commons does
seem like a better target than a separate project. -C


On Wed, Feb 3, 2016 at 1:49 AM, Chris Douglas <cd...@apache.org> wrote:
> On Wed, Feb 3, 2016 at 12:48 AM, Gangumalla, Uma
> <um...@intel.com> wrote:
>>>Standing in the point of shared fundamental piece of code like this, I do
>>>think Apache Commons might be the best direction which we can try as the
>>>first effort. In this direction, we still need to work with Apache Common
>>>community for buying in and accepting the proposal.
>> Make sense.
>
> Makes sense how?
>
>> For this we should define the independent release cycles for this project
>> and it would just place under Hadoop tree if we all conclude with this
>> option at the end.
>
> Yes.
>
>> [Chris]
>>>If Chimera is not successful as an independent project or stalls,
>>>Hadoop and/or Spark and/or $project will have to reabsorb it as
>>>maintainers.
>>>
>> I am not so strong on this point. If we assume project would be
>> unsuccessful, it can be unsuccessful(less maintained) even under hadoop.
>> But if other projects depending on this piece then they would get less
>> support. Of course right now we feel this piece of code is very important
>> and we feel(expect) it can be successful as independent project,
>> irrespective of whether it as separate project outside hadoop or inside.
>> So, I feel this point would not really influence to judge the discussion.
>
> Sure; code can idle anywhere, but that wasn't the point I was after.
> You propose to extract code from Hadoop, but if Chimera fails then
> what recourse do we have among the other projects taking a dependency
> on it? Splitting off another project is feasible, but Chimera should
> be sustainable before this PMC can divest itself of responsibility for
> security libraries. That's a pretty low bar.
>
> Bundling the library with the jar is helpful; I've used that before.
> It should prefer (updated) libraries from the environment, if
> configured. Otherwise it's a pain (or impossible) for ops to patch
> security bugs. -C
>
>>>-----Original Message-----
>>>From: Colin P. McCabe [mailto:cmccabe@apache.org]
>>>Sent: Wednesday, February 3, 2016 4:56 AM
>>>To: hdfs-dev@hadoop.apache.org
>>>Subject: Re: Hadoop encryption module as Apache Chimera incubator project
>>>
>>>It's great to see interest in improving this functionality.  I think
>>>Chimera could be successful as an Apache project.  I don't have a strong
>>>opinion one way or the other as to whether it belongs as part of Hadoop
>>>or separate.
>>>
>>>I do think there will be some challenges splitting this functionality out
>>>into a separate jar, because of the way our CLASSPATH works right now.
>>>For example, let's say that Hadoop depends on Chimera 1.2 and Spark
>>>depends on Chimera 1.1.  Now Spark jobs have two different versions
>>>fighting it out on the classpath, similar to the situation with Guava and
>>>other libraries.  Perhaps if Chimera adopts a policy of strong backwards
>>>compatibility, we can just always use the latest jar, but it still seems
>>>likely that there will be problems.  There are various classpath
>>>isolation ideas that could help here, but they are big projects in their
>>>own right and we don't have a clear timeline for them.  If this does end
>>>up being a separate jar, we may need to shade it to avoid all these
>>>issues.
>>>
>>>Bundling the JNI glue code in the jar itself is an interesting idea,
>>>which we have talked about before for libhadoop.so.  It doesn't really
>>>have anything to do with the question of TLP vs. non-TLP, of course.
>>>We could do that refactoring in Hadoop itself.  The really complicated
>>>part of bundling JNI code in a jar is that you need to create jars for
>>>every cross product of (JVM version, openssl version, operating system).
>>>For example, you have the RHEL6 build for openJDK7 using openssl 1.0.1e.
>>>If you change any one thing-- say, change openJDK7 to Oracle JDK8, then
>>>you might need to rebuild.  And certainly using Ubuntu would be a
>>>rebuild.  And so forth.  This kind of clashes with Maven's philosophy of
>>>pulling prebuilt jars from the internet.
>>>
>>>Kai Zheng's question about whether we would bundle openSSL's libraries is
>>>a good one.  Given the high rate of new vulnerabilities discovered in
>>>that library, it seems like bundling would require Hadoop users and
>>>vendors to update very frequently, much more frequently than Hadoop is
>>>traditionally updated.  So probably we would not choose to bundle openssl.
>>>
>>>best,
>>>Colin
>>>
>>>On Tue, Feb 2, 2016 at 12:29 AM, Chris Douglas <cd...@apache.org>
>>>wrote:
>>>> As a subproject of Hadoop, Chimera could maintain its own cadence.
>>>> There's also no reason why it should maintain dependencies on other
>>>> parts of Hadoop, if those are separable. How is this solution
>>>> inadequate?
>>>>
>>>> If Chimera is not successful as an independent project or stalls,
>>>> Hadoop and/or Spark and/or $project will have to reabsorb it as
>>>> maintainers. Projects have high mortality in early life, and a fight
>>>> over inheritance/maintenance is something we'd like to avoid. If, on
>>>> the other hand, it develops enough of a community where it is
>>>> obviously viable, then we can (and should) break it out as a TLP (as
>>>> we have before). If other Apache projects take a dependency on
>>>> Chimera, we're open to adding them to security@hadoop.
>>>>
>>>> Unlike Yetus, which was largely rewritten right before it was made
>>>> into a TLP, security in Hadoop has a complicated pedigree. If Chimera
>>>> eventually becomes a TLP, it seems fair to include those who work on
>>>> it while it is a subproject. Declared upfront, that criterion is
>>>> fairer than any post hoc justification, and will lead to a more
>>>> accurate account of its community than a subset of the Hadoop
>>>> PMC/committers that volunteer. -C
>>>>
>>>>
>>>> On Mon, Feb 1, 2016 at 9:29 PM, Chen, Haifeng <ha...@intel.com>
>>>>wrote:
>>>>> Thanks to all folks providing feedbacks and participating the
>>>>>discussions.
>>>>>
>>>>> @Owen, do you still have any concerns on going forward in the
>>>>>direction of Apache Commons (or other options, TLP)?
>>>>>
>>>>> Thanks,
>>>>> Haifeng
>>>>>
>>>>> -----Original Message-----
>>>>> From: Chen, Haifeng [mailto:haifeng.chen@intel.com]
>>>>> Sent: Saturday, January 30, 2016 10:52 AM
>>>>> To: hdfs-dev@hadoop.apache.org
>>>>> Subject: RE: Hadoop encryption module as Apache Chimera incubator
>>>>> project
>>>>>
>>>>>>> I believe encryption is becoming a core part of Hadoop. I think
>>>>>>> that moving core components out of Hadoop is bad from a project
>>>>>>>management perspective.
>>>>>
>>>>>> Although it's certainly true that encryption capabilities (in HDFS,
>>>>>>YARN, etc.) are becoming core to Hadoop, I don't think that should
>>>>>>really influence whether or not the non-Hadoop-specific encryption
>>>>>>routines should be part of the Hadoop code base, or part of the code
>>>>>>base of another project that Hadoop depends on. If Chimera had existed
>>>>>>as a library hosted at ASF when HDFS encryption was first developed,
>>>>>>HDFS probably would have just added that as a dependency and been done
>>>>>>with it. I don't think we would've copy/pasted the code for Chimera
>>>>>>into the Hadoop code base.
>>>>>
>>>>> Agree with ATM. I want to also make an additional clarification. I
>>>>>agree that the encryption capabilities are becoming core to Hadoop.
>>>>>While this effort is to put common and shared encryption routines such
>>>>>as crypto stream implementations into a scope which can be widely
>>>>>shared across the Apache ecosystem. This doesn't move Hadoop encryption
>>>>>out of Hadoop (that is not possible).
>>>>>
>>>>> Agree if we make it a separate and independent releases project in
>>>>>Hadoop takes a step further than the existing approach and solve some
>>>>>issues (such as libhadoop.so problem). Frankly speaking, I think it is
>>>>>not the best option we can try. I also expect that an independent
>>>>>release project within Hadoop core will also complicate the existing
>>>>>release ideology of Hadoop release.
>>>>>
>>>>> Thanks,
>>>>> Haifeng
>>>>>
>>>>> -----Original Message-----
>>>>> From: Aaron T. Myers [mailto:atm@cloudera.com]
>>>>> Sent: Friday, January 29, 2016 9:51 AM
>>>>> To: hdfs-dev@hadoop.apache.org
>>>>> Subject: Re: Hadoop encryption module as Apache Chimera incubator
>>>>> project
>>>>>
>>>>> On Wed, Jan 27, 2016 at 11:31 AM, Owen O'Malley <om...@apache.org>
>>>>>wrote:
>>>>>
>>>>>> I believe encryption is becoming a core part of Hadoop. I think that
>>>>>> moving core components out of Hadoop is bad from a project management
>>>>>>perspective.
>>>>>>
>>>>>
>>>>> Although it's certainly true that encryption capabilities (in HDFS,
>>>>> YARN,
>>>>> etc.) are becoming core to Hadoop, I don't think that should really
>>>>>influence whether or not the non-Hadoop-specific encryption routines
>>>>>should be part of the Hadoop code base, or part of the code base of
>>>>>another project that Hadoop depends on. If Chimera had existed as a
>>>>>library hosted at ASF when HDFS encryption was first developed, HDFS
>>>>>probably would have just added that as a dependency and been done with
>>>>>it. I don't think we would've copy/pasted the code for Chimera into the
>>>>>Hadoop code base.
>>>>>
>>>>>
>>>>>> To put it another way, a bug in the encryption routines will likely
>>>>>> become a security problem that security@hadoop needs to hear about.
>>>>>>
>>>>> I don't think
>>>>>> adding a separate project in the middle of that communication chain
>>>>>> is a good idea. The same applies to data corruption problems, and so
>>>>>>on...
>>>>>>
>>>>>
>>>>> Isn't the same true of all the libraries that Hadoop currently depends
>>>>>upon? If the commons-httpclient library (or commons-codec, or
>>>>>commons-io, or guava, or...) has a security vulnerability, we need to
>>>>>know about it so that we can update our dependency to a fixed version.
>>>>>This case doesn't seem materially different than that.
>>>>>
>>>>>
>>>>>>
>>>>>>
>>>>>> > It may be good to keep at generalized place(As in the discussion,
>>>>>> > we thought that place could be Apache Commons).
>>>>>>
>>>>>>
>>>>>> Apache Commons is a collection of *Java* projects, so Chimera as a
>>>>>> JNI-based library isn't a natural fit.
>>>>>>
>>>>>
>>>>> Could very well be that Apache Commons's charter would preclude
>>>>>Chimera.
>>>>> You probably know better than I do about that.
>>>>>
>>>>>
>>>>>> Furthermore, Apache Commons doesn't
>>>>>> have its own security list so problems will go to the generic
>>>>>> security@apache.org.
>>>>>>
>>>>>
>>>>> That seems easy enough to remedy, if they wanted to, and besides I'm
>>>>>not sure why that would influence this discussion. In my experience
>>>>>projects that don't have a separate security@project.a.o mailing list
>>>>>tend to just handle security issues on their private@project.a.o
>>>>>mailing list, which seems fine to me.
>>>>>
>>>>>
>>>>>>
>>>>>> Why do you think that Apache Commons is a better home than Hadoop?
>>>>>>
>>>>>
>>>>> I'm certainly not at all wedded to Apache Commons, that just seemed
>>>>>like a natural place to put it to me. Could be that a brand new TLP
>>>>>might make more sense.
>>>>>
>>>>> I *do* think that if other non-Hadoop projects want to make use of
>>>>>Chimera, which as I understand it is the goal which started this
>>>>>thread, then Chimera should exist outside of Hadoop so that:
>>>>>
>>>>> a) Projects that have nothing to do with Hadoop can just depend
>>>>>directly on Chimera, which has nothing Hadoop-specific in there.
>>>>>
>>>>> b) The Hadoop project doesn't have to export/maintain/concern itself
>>>>>with yet another publicly-consumed interface.
>>>>>
>>>>> c) Chimera can have its own (presumably much faster) release cadence
>>>>>completely separate from Hadoop.
>>>>>
>>>>> --
>>>>> Aaron T. Myers
>>>>> Software Engineer, Cloudera
>>

Re: Hadoop encryption module as Apache Chimera incubator project

Posted by Chris Douglas <cd...@apache.org>.
On Wed, Feb 3, 2016 at 12:48 AM, Gangumalla, Uma
<um...@intel.com> wrote:
>>Standing in the point of shared fundamental piece of code like this, I do
>>think Apache Commons might be the best direction which we can try as the
>>first effort. In this direction, we still need to work with Apache Common
>>community for buying in and accepting the proposal.
> Make sense.

Makes sense how?

> For this we should define the independent release cycles for this project
> and it would just place under Hadoop tree if we all conclude with this
> option at the end.

Yes.

> [Chris]
>>If Chimera is not successful as an independent project or stalls,
>>Hadoop and/or Spark and/or $project will have to reabsorb it as
>>maintainers.
>>
> I am not so strong on this point. If we assume project would be
> unsuccessful, it can be unsuccessful(less maintained) even under hadoop.
> But if other projects depending on this piece then they would get less
> support. Of course right now we feel this piece of code is very important
> and we feel(expect) it can be successful as independent project,
> irrespective of whether it as separate project outside hadoop or inside.
> So, I feel this point would not really influence to judge the discussion.

Sure; code can idle anywhere, but that wasn't the point I was after.
You propose to extract code from Hadoop, but if Chimera fails then
what recourse do we have among the other projects taking a dependency
on it? Splitting off another project is feasible, but Chimera should
be sustainable before this PMC can divest itself of responsibility for
security libraries. That's a pretty low bar.

Bundling the library with the jar is helpful; I've used that before.
It should prefer (updated) libraries from the environment, if
configured. Otherwise it's a pain (or impossible) for ops to patch
security bugs. -C

>>-----Original Message-----
>>From: Colin P. McCabe [mailto:cmccabe@apache.org]
>>Sent: Wednesday, February 3, 2016 4:56 AM
>>To: hdfs-dev@hadoop.apache.org
>>Subject: Re: Hadoop encryption module as Apache Chimera incubator project
>>
>>It's great to see interest in improving this functionality.  I think
>>Chimera could be successful as an Apache project.  I don't have a strong
>>opinion one way or the other as to whether it belongs as part of Hadoop
>>or separate.
>>
>>I do think there will be some challenges splitting this functionality out
>>into a separate jar, because of the way our CLASSPATH works right now.
>>For example, let's say that Hadoop depends on Chimera 1.2 and Spark
>>depends on Chimera 1.1.  Now Spark jobs have two different versions
>>fighting it out on the classpath, similar to the situation with Guava and
>>other libraries.  Perhaps if Chimera adopts a policy of strong backwards
>>compatibility, we can just always use the latest jar, but it still seems
>>likely that there will be problems.  There are various classpath
>>isolation ideas that could help here, but they are big projects in their
>>own right and we don't have a clear timeline for them.  If this does end
>>up being a separate jar, we may need to shade it to avoid all these
>>issues.
>>
>>Bundling the JNI glue code in the jar itself is an interesting idea,
>>which we have talked about before for libhadoop.so.  It doesn't really
>>have anything to do with the question of TLP vs. non-TLP, of course.
>>We could do that refactoring in Hadoop itself.  The really complicated
>>part of bundling JNI code in a jar is that you need to create jars for
>>every cross product of (JVM version, openssl version, operating system).
>>For example, you have the RHEL6 build for openJDK7 using openssl 1.0.1e.
>>If you change any one thing-- say, change openJDK7 to Oracle JDK8, then
>>you might need to rebuild.  And certainly using Ubuntu would be a
>>rebuild.  And so forth.  This kind of clashes with Maven's philosophy of
>>pulling prebuilt jars from the internet.
>>
>>Kai Zheng's question about whether we would bundle openSSL's libraries is
>>a good one.  Given the high rate of new vulnerabilities discovered in
>>that library, it seems like bundling would require Hadoop users and
>>vendors to update very frequently, much more frequently than Hadoop is
>>traditionally updated.  So probably we would not choose to bundle openssl.
>>
>>best,
>>Colin
>>
>>On Tue, Feb 2, 2016 at 12:29 AM, Chris Douglas <cd...@apache.org>
>>wrote:
>>> As a subproject of Hadoop, Chimera could maintain its own cadence.
>>> There's also no reason why it should maintain dependencies on other
>>> parts of Hadoop, if those are separable. How is this solution
>>> inadequate?
>>>
>>> If Chimera is not successful as an independent project or stalls,
>>> Hadoop and/or Spark and/or $project will have to reabsorb it as
>>> maintainers. Projects have high mortality in early life, and a fight
>>> over inheritance/maintenance is something we'd like to avoid. If, on
>>> the other hand, it develops enough of a community where it is
>>> obviously viable, then we can (and should) break it out as a TLP (as
>>> we have before). If other Apache projects take a dependency on
>>> Chimera, we're open to adding them to security@hadoop.
>>>
>>> Unlike Yetus, which was largely rewritten right before it was made
>>> into a TLP, security in Hadoop has a complicated pedigree. If Chimera
>>> eventually becomes a TLP, it seems fair to include those who work on
>>> it while it is a subproject. Declared upfront, that criterion is
>>> fairer than any post hoc justification, and will lead to a more
>>> accurate account of its community than a subset of the Hadoop
>>> PMC/committers that volunteer. -C
>>>
>>>
>>> On Mon, Feb 1, 2016 at 9:29 PM, Chen, Haifeng <ha...@intel.com>
>>>wrote:
>>>> Thanks to all folks providing feedbacks and participating the
>>>>discussions.
>>>>
>>>> @Owen, do you still have any concerns on going forward in the
>>>>direction of Apache Commons (or other options, TLP)?
>>>>
>>>> Thanks,
>>>> Haifeng
>>>>
>>>> -----Original Message-----
>>>> From: Chen, Haifeng [mailto:haifeng.chen@intel.com]
>>>> Sent: Saturday, January 30, 2016 10:52 AM
>>>> To: hdfs-dev@hadoop.apache.org
>>>> Subject: RE: Hadoop encryption module as Apache Chimera incubator
>>>> project
>>>>
>>>>>> I believe encryption is becoming a core part of Hadoop. I think
>>>>>> that moving core components out of Hadoop is bad from a project
>>>>>>management perspective.
>>>>
>>>>> Although it's certainly true that encryption capabilities (in HDFS,
>>>>>YARN, etc.) are becoming core to Hadoop, I don't think that should
>>>>>really influence whether or not the non-Hadoop-specific encryption
>>>>>routines should be part of the Hadoop code base, or part of the code
>>>>>base of another project that Hadoop depends on. If Chimera had existed
>>>>>as a library hosted at ASF when HDFS encryption was first developed,
>>>>>HDFS probably would have just added that as a dependency and been done
>>>>>with it. I don't think we would've copy/pasted the code for Chimera
>>>>>into the Hadoop code base.
>>>>
>>>> Agree with ATM. I want to also make an additional clarification. I
>>>>agree that the encryption capabilities are becoming core to Hadoop.
>>>>While this effort is to put common and shared encryption routines such
>>>>as crypto stream implementations into a scope which can be widely
>>>>shared across the Apache ecosystem. This doesn't move Hadoop encryption
>>>>out of Hadoop (that is not possible).
>>>>
>>>> Agree if we make it a separate and independent releases project in
>>>>Hadoop takes a step further than the existing approach and solve some
>>>>issues (such as libhadoop.so problem). Frankly speaking, I think it is
>>>>not the best option we can try. I also expect that an independent
>>>>release project within Hadoop core will also complicate the existing
>>>>release ideology of Hadoop release.
>>>>
>>>> Thanks,
>>>> Haifeng
>>>>
>>>> -----Original Message-----
>>>> From: Aaron T. Myers [mailto:atm@cloudera.com]
>>>> Sent: Friday, January 29, 2016 9:51 AM
>>>> To: hdfs-dev@hadoop.apache.org
>>>> Subject: Re: Hadoop encryption module as Apache Chimera incubator
>>>> project
>>>>
>>>> On Wed, Jan 27, 2016 at 11:31 AM, Owen O'Malley <om...@apache.org>
>>>>wrote:
>>>>
>>>>> I believe encryption is becoming a core part of Hadoop. I think that
>>>>> moving core components out of Hadoop is bad from a project management
>>>>>perspective.
>>>>>
>>>>
>>>> Although it's certainly true that encryption capabilities (in HDFS,
>>>> YARN,
>>>> etc.) are becoming core to Hadoop, I don't think that should really
>>>>influence whether or not the non-Hadoop-specific encryption routines
>>>>should be part of the Hadoop code base, or part of the code base of
>>>>another project that Hadoop depends on. If Chimera had existed as a
>>>>library hosted at ASF when HDFS encryption was first developed, HDFS
>>>>probably would have just added that as a dependency and been done with
>>>>it. I don't think we would've copy/pasted the code for Chimera into the
>>>>Hadoop code base.
>>>>
>>>>
>>>>> To put it another way, a bug in the encryption routines will likely
>>>>> become a security problem that security@hadoop needs to hear about.
>>>>>
>>>> I don't think
>>>>> adding a separate project in the middle of that communication chain
>>>>> is a good idea. The same applies to data corruption problems, and so
>>>>>on...
>>>>>
>>>>
>>>> Isn't the same true of all the libraries that Hadoop currently depends
>>>>upon? If the commons-httpclient library (or commons-codec, or
>>>>commons-io, or guava, or...) has a security vulnerability, we need to
>>>>know about it so that we can update our dependency to a fixed version.
>>>>This case doesn't seem materially different than that.
>>>>
>>>>
>>>>>
>>>>>
>>>>> > It may be good to keep at generalized place(As in the discussion,
>>>>> > we thought that place could be Apache Commons).
>>>>>
>>>>>
>>>>> Apache Commons is a collection of *Java* projects, so Chimera as a
>>>>> JNI-based library isn't a natural fit.
>>>>>
>>>>
>>>> Could very well be that Apache Commons's charter would preclude
>>>>Chimera.
>>>> You probably know better than I do about that.
>>>>
>>>>
>>>>> Furthermore, Apache Commons doesn't
>>>>> have its own security list so problems will go to the generic
>>>>> security@apache.org.
>>>>>
>>>>
>>>> That seems easy enough to remedy, if they wanted to, and besides I'm
>>>>not sure why that would influence this discussion. In my experience
>>>>projects that don't have a separate security@project.a.o mailing list
>>>>tend to just handle security issues on their private@project.a.o
>>>>mailing list, which seems fine to me.
>>>>
>>>>
>>>>>
>>>>> Why do you think that Apache Commons is a better home than Hadoop?
>>>>>
>>>>
>>>> I'm certainly not at all wedded to Apache Commons, that just seemed
>>>>like a natural place to put it to me. Could be that a brand new TLP
>>>>might make more sense.
>>>>
>>>> I *do* think that if other non-Hadoop projects want to make use of
>>>>Chimera, which as I understand it is the goal which started this
>>>>thread, then Chimera should exist outside of Hadoop so that:
>>>>
>>>> a) Projects that have nothing to do with Hadoop can just depend
>>>>directly on Chimera, which has nothing Hadoop-specific in there.
>>>>
>>>> b) The Hadoop project doesn't have to export/maintain/concern itself
>>>>with yet another publicly-consumed interface.
>>>>
>>>> c) Chimera can have its own (presumably much faster) release cadence
>>>>completely separate from Hadoop.
>>>>
>>>> --
>>>> Aaron T. Myers
>>>> Software Engineer, Cloudera
>

Re: Hadoop encryption module as Apache Chimera incubator project

Posted by "Gangumalla, Uma" <um...@intel.com>.
Thanks guys for the opinions. Below are my responses for some questions or
thoughts.

On 2/3/16, 12:07 AM, "Chen, Haifeng" <ha...@intel.com> wrote:

>Thanks Chris and Colin for your opinions.
>
>>> [Chris] If Chimera is not successful as an independent project or
>>>stalls, Hadoop and/or Spark and/or $project will have to reabsorb it as
>>>maintainers. 
>Understand the concern. One point to consider, Chimera dedicates in a
>specific domain of optimized cryptographic, like Apache commons logging
>dedicates in logging. It is not as dynamic as other Apache projects.
>Of course, as to whether it as part of Hadoop or separate, both ways have
>uncertainties. I am not strongly opposite one way or the other.
>
>Standing in the point of shared fundamental piece of code like this, I do
>think Apache Commons might be the best direction which we can try as the
>first effort. In this direction, we still need to work with Apache Common
>community for buying in and accepting the proposal.
Make sense.
>
>On the other hand, for the direction as sub project within Hadoop, I am
>uncertain about where will the sub-project locate and how it manages to
>be its own cadence in Hadoop. Hadoop has modules like Hadoop Common,
>Hadoop HDFS, Hadoop YARN, Hadop MapReduce. And these modules have the
>same release cycle and are released together. Am I right?
For this we should define the independent release cycles for this project
and it would just place under Hadoop tree if we all conclude with this
option at the end.
>
>>> [Colin] I do think there will be some challenges splitting this
>>>functionality out into a separate jar, because of the way our CLASSPATH
>>>works right now.
>Yes, this challenges are common for shared libraries in Java. Just as you
>mentioned, keeping API compatibility or using classpath isolation are two
>practical methods.
>
>>> [Colin] The really complicated part of bundling JNI code in a jar is
>>>that you need to create jars for every cross product.
>Building does get complex for cross platform. But it might not be as
>complex as described considering the native. First, building with JDK7 or
>JDK8 is the common thing to consider for all Java libraries I think. It
>doesn't specific related to building of the JNI code. (Correct me if I am
>wrong). Secondly, it is still possible to isolate the building of the
>native in the way that you don't have to build different version for
>Ubuntu and RHEL. Third, if it is dynamic link to openssl and the openssl
>API used by the library is not changed in the versions, we don't have to
>build different versions for it.
>
>So the building matrix might be Linux32, Linux64, Windows32, Windows64,
>Mac... 
>
>>>[Colin] So probably we would not choose to bundle openssl.
>Agree. Bundle openssl is not a good idea considering upgrading for
>vulnerabilities.
Agreed too.

[Chris]
>If Chimera is not successful as an independent project or stalls,
>Hadoop and/or Spark and/or $project will have to reabsorb it as
>maintainers.
>
I am not so strong on this point. If we assume project would be
unsuccessful, it can be unsuccessful(less maintained) even under hadoop.
But if other projects depending on this piece then they would get less
support. Of course right now we feel this piece of code is very important
and we feel(expect) it can be successful as independent project,
irrespective of whether it as separate project outside hadoop or inside.
So, I feel this point would not really influence to judge the discussion.
>
>
>Regards,
>Haifeng
>
>-----Original Message-----
>From: Colin P. McCabe [mailto:cmccabe@apache.org]
>Sent: Wednesday, February 3, 2016 4:56 AM
>To: hdfs-dev@hadoop.apache.org
>Subject: Re: Hadoop encryption module as Apache Chimera incubator project
>
>It's great to see interest in improving this functionality.  I think
>Chimera could be successful as an Apache project.  I don't have a strong
>opinion one way or the other as to whether it belongs as part of Hadoop
>or separate.
>
>I do think there will be some challenges splitting this functionality out
>into a separate jar, because of the way our CLASSPATH works right now.
>For example, let's say that Hadoop depends on Chimera 1.2 and Spark
>depends on Chimera 1.1.  Now Spark jobs have two different versions
>fighting it out on the classpath, similar to the situation with Guava and
>other libraries.  Perhaps if Chimera adopts a policy of strong backwards
>compatibility, we can just always use the latest jar, but it still seems
>likely that there will be problems.  There are various classpath
>isolation ideas that could help here, but they are big projects in their
>own right and we don't have a clear timeline for them.  If this does end
>up being a separate jar, we may need to shade it to avoid all these
>issues.
>
>Bundling the JNI glue code in the jar itself is an interesting idea,
>which we have talked about before for libhadoop.so.  It doesn't really
>have anything to do with the question of TLP vs. non-TLP, of course.
>We could do that refactoring in Hadoop itself.  The really complicated
>part of bundling JNI code in a jar is that you need to create jars for
>every cross product of (JVM version, openssl version, operating system).
>For example, you have the RHEL6 build for openJDK7 using openssl 1.0.1e.
>If you change any one thing-- say, change openJDK7 to Oracle JDK8, then
>you might need to rebuild.  And certainly using Ubuntu would be a
>rebuild.  And so forth.  This kind of clashes with Maven's philosophy of
>pulling prebuilt jars from the internet.
>
>Kai Zheng's question about whether we would bundle openSSL's libraries is
>a good one.  Given the high rate of new vulnerabilities discovered in
>that library, it seems like bundling would require Hadoop users and
>vendors to update very frequently, much more frequently than Hadoop is
>traditionally updated.  So probably we would not choose to bundle openssl.
>
>best,
>Colin
>
>On Tue, Feb 2, 2016 at 12:29 AM, Chris Douglas <cd...@apache.org>
>wrote:
>> As a subproject of Hadoop, Chimera could maintain its own cadence.
>> There's also no reason why it should maintain dependencies on other
>> parts of Hadoop, if those are separable. How is this solution
>> inadequate?
>>
>> If Chimera is not successful as an independent project or stalls,
>> Hadoop and/or Spark and/or $project will have to reabsorb it as
>> maintainers. Projects have high mortality in early life, and a fight
>> over inheritance/maintenance is something we'd like to avoid. If, on
>> the other hand, it develops enough of a community where it is
>> obviously viable, then we can (and should) break it out as a TLP (as
>> we have before). If other Apache projects take a dependency on
>> Chimera, we're open to adding them to security@hadoop.
>>
>> Unlike Yetus, which was largely rewritten right before it was made
>> into a TLP, security in Hadoop has a complicated pedigree. If Chimera
>> eventually becomes a TLP, it seems fair to include those who work on
>> it while it is a subproject. Declared upfront, that criterion is
>> fairer than any post hoc justification, and will lead to a more
>> accurate account of its community than a subset of the Hadoop
>> PMC/committers that volunteer. -C
>>
>>
>> On Mon, Feb 1, 2016 at 9:29 PM, Chen, Haifeng <ha...@intel.com>
>>wrote:
>>> Thanks to all folks providing feedbacks and participating the
>>>discussions.
>>>
>>> @Owen, do you still have any concerns on going forward in the
>>>direction of Apache Commons (or other options, TLP)?
>>>
>>> Thanks,
>>> Haifeng
>>>
>>> -----Original Message-----
>>> From: Chen, Haifeng [mailto:haifeng.chen@intel.com]
>>> Sent: Saturday, January 30, 2016 10:52 AM
>>> To: hdfs-dev@hadoop.apache.org
>>> Subject: RE: Hadoop encryption module as Apache Chimera incubator
>>> project
>>>
>>>>> I believe encryption is becoming a core part of Hadoop. I think
>>>>> that moving core components out of Hadoop is bad from a project
>>>>>management perspective.
>>>
>>>> Although it's certainly true that encryption capabilities (in HDFS,
>>>>YARN, etc.) are becoming core to Hadoop, I don't think that should
>>>>really influence whether or not the non-Hadoop-specific encryption
>>>>routines should be part of the Hadoop code base, or part of the code
>>>>base of another project that Hadoop depends on. If Chimera had existed
>>>>as a library hosted at ASF when HDFS encryption was first developed,
>>>>HDFS probably would have just added that as a dependency and been done
>>>>with it. I don't think we would've copy/pasted the code for Chimera
>>>>into the Hadoop code base.
>>>
>>> Agree with ATM. I want to also make an additional clarification. I
>>>agree that the encryption capabilities are becoming core to Hadoop.
>>>While this effort is to put common and shared encryption routines such
>>>as crypto stream implementations into a scope which can be widely
>>>shared across the Apache ecosystem. This doesn't move Hadoop encryption
>>>out of Hadoop (that is not possible).
>>>
>>> Agree if we make it a separate and independent releases project in
>>>Hadoop takes a step further than the existing approach and solve some
>>>issues (such as libhadoop.so problem). Frankly speaking, I think it is
>>>not the best option we can try. I also expect that an independent
>>>release project within Hadoop core will also complicate the existing
>>>release ideology of Hadoop release.
>>>
>>> Thanks,
>>> Haifeng
>>>
>>> -----Original Message-----
>>> From: Aaron T. Myers [mailto:atm@cloudera.com]
>>> Sent: Friday, January 29, 2016 9:51 AM
>>> To: hdfs-dev@hadoop.apache.org
>>> Subject: Re: Hadoop encryption module as Apache Chimera incubator
>>> project
>>>
>>> On Wed, Jan 27, 2016 at 11:31 AM, Owen O'Malley <om...@apache.org>
>>>wrote:
>>>
>>>> I believe encryption is becoming a core part of Hadoop. I think that
>>>> moving core components out of Hadoop is bad from a project management
>>>>perspective.
>>>>
>>>
>>> Although it's certainly true that encryption capabilities (in HDFS,
>>> YARN,
>>> etc.) are becoming core to Hadoop, I don't think that should really
>>>influence whether or not the non-Hadoop-specific encryption routines
>>>should be part of the Hadoop code base, or part of the code base of
>>>another project that Hadoop depends on. If Chimera had existed as a
>>>library hosted at ASF when HDFS encryption was first developed, HDFS
>>>probably would have just added that as a dependency and been done with
>>>it. I don't think we would've copy/pasted the code for Chimera into the
>>>Hadoop code base.
>>>
>>>
>>>> To put it another way, a bug in the encryption routines will likely
>>>> become a security problem that security@hadoop needs to hear about.
>>>>
>>> I don't think
>>>> adding a separate project in the middle of that communication chain
>>>> is a good idea. The same applies to data corruption problems, and so
>>>>on...
>>>>
>>>
>>> Isn't the same true of all the libraries that Hadoop currently depends
>>>upon? If the commons-httpclient library (or commons-codec, or
>>>commons-io, or guava, or...) has a security vulnerability, we need to
>>>know about it so that we can update our dependency to a fixed version.
>>>This case doesn't seem materially different than that.
>>>
>>>
>>>>
>>>>
>>>> > It may be good to keep at generalized place(As in the discussion,
>>>> > we thought that place could be Apache Commons).
>>>>
>>>>
>>>> Apache Commons is a collection of *Java* projects, so Chimera as a
>>>> JNI-based library isn't a natural fit.
>>>>
>>>
>>> Could very well be that Apache Commons's charter would preclude
>>>Chimera.
>>> You probably know better than I do about that.
>>>
>>>
>>>> Furthermore, Apache Commons doesn't
>>>> have its own security list so problems will go to the generic
>>>> security@apache.org.
>>>>
>>>
>>> That seems easy enough to remedy, if they wanted to, and besides I'm
>>>not sure why that would influence this discussion. In my experience
>>>projects that don't have a separate security@project.a.o mailing list
>>>tend to just handle security issues on their private@project.a.o
>>>mailing list, which seems fine to me.
>>>
>>>
>>>>
>>>> Why do you think that Apache Commons is a better home than Hadoop?
>>>>
>>>
>>> I'm certainly not at all wedded to Apache Commons, that just seemed
>>>like a natural place to put it to me. Could be that a brand new TLP
>>>might make more sense.
>>>
>>> I *do* think that if other non-Hadoop projects want to make use of
>>>Chimera, which as I understand it is the goal which started this
>>>thread, then Chimera should exist outside of Hadoop so that:
>>>
>>> a) Projects that have nothing to do with Hadoop can just depend
>>>directly on Chimera, which has nothing Hadoop-specific in there.
>>>
>>> b) The Hadoop project doesn't have to export/maintain/concern itself
>>>with yet another publicly-consumed interface.
>>>
>>> c) Chimera can have its own (presumably much faster) release cadence
>>>completely separate from Hadoop.
>>>
>>> --
>>> Aaron T. Myers
>>> Software Engineer, Cloudera


RE: Hadoop encryption module as Apache Chimera incubator project

Posted by "Chen, Haifeng" <ha...@intel.com>.
Thanks Allen for your questioning.

>> Why is this discussion taking place on hdfs-dev when other parts of Hadoop use encryption code?  That alone makes me uncomfortable moving this code outside of the Hadoop umbrella.  
Let me explain in this way. Hadoop common implements the crypto stuff and HDFS (encryption at rest and data transfer) is the major (maybe the only) consumer of this piece of the code by now. And the fact is the crypto stuff in Hadoop common was contributed as part of HDFS encryption at rest effort. I think this is the reason that the original discussion happens in hdfs-dev.

Regards,
Haifeng

-----Original Message-----
From: Allen Wittenauer [mailto:aw@altiscale.com] 
Sent: Wednesday, February 3, 2016 11:59 PM
To: hdfs-dev@hadoop.apache.org
Subject: Re: Hadoop encryption module as Apache Chimera incubator project



	Why is this discussion taking place on hdfs-dev when other parts of Hadoop use encryption code?  That alone makes me uncomfortable moving this code outside of the Hadoop umbrella.  

	Also, since Yetus was invoked, it’s worthwhile pointing out that we had _many_ projects, inside and outside the ASF, ask us about using the revamped toolset, many of whom were already using an older version of test-patch. A wide audience was pretty much built-in on day one.

Re: Hadoop encryption module as Apache Chimera incubator project

Posted by Allen Wittenauer <aw...@altiscale.com>.

	Why is this discussion taking place on hdfs-dev when other parts of Hadoop use encryption code?  That alone makes me uncomfortable moving this code outside of the Hadoop umbrella.  

	Also, since Yetus was invoked, it’s worthwhile pointing out that we had _many_ projects, inside and outside the ASF, ask us about using the revamped toolset, many of whom were already using an older version of test-patch. A wide audience was pretty much built-in on day one.

Re: Hadoop encryption module as Apache Chimera incubator project

Posted by Steve Loughran <st...@hortonworks.com>.
> On 3 Feb 2016, at 08:07, Chen, Haifeng <ha...@intel.com> wrote:
> 
>>> [Colin] I do think there will be some challenges splitting this functionality out into a separate jar, because of the way our CLASSPATH works right now.
> Yes, this challenges are common for shared libraries in Java. Just as you mentioned, keeping API compatibility or using classpath isolation are two practical methods.

you can't isolate JNI. This is why NMs are moving to to a separate JVM for plugins - the Spark 1.6 shuffle is incompatible with Hadoop (currently: jackson)

RE: Hadoop encryption module as Apache Chimera incubator project

Posted by "Chen, Haifeng" <ha...@intel.com>.
Thanks Chris and Colin for your opinions.

>> [Chris] If Chimera is not successful as an independent project or stalls, Hadoop and/or Spark and/or $project will have to reabsorb it as maintainers. 
Understand the concern. One point to consider, Chimera dedicates in a specific domain of optimized cryptographic, like Apache commons logging dedicates in logging. It is not as dynamic as other Apache projects. 
Of course, as to whether it as part of Hadoop or separate, both ways have uncertainties. I am not strongly opposite one way or the other. 

Standing in the point of shared fundamental piece of code like this, I do think Apache Commons might be the best direction which we can try as the first effort. In this direction, we still need to work with Apache Common community for buying in and accepting the proposal. 

On the other hand, for the direction as sub project within Hadoop, I am uncertain about where will the sub-project locate and how it manages to be its own cadence in Hadoop. Hadoop has modules like Hadoop Common, Hadoop HDFS, Hadoop YARN, Hadop MapReduce. And these modules have the same release cycle and are released together. Am I right?

>> [Colin] I do think there will be some challenges splitting this functionality out into a separate jar, because of the way our CLASSPATH works right now.
Yes, this challenges are common for shared libraries in Java. Just as you mentioned, keeping API compatibility or using classpath isolation are two practical methods.

>> [Colin] The really complicated part of bundling JNI code in a jar is that you need to create jars for every cross product.
Building does get complex for cross platform. But it might not be as complex as described considering the native. First, building with JDK7 or JDK8 is the common thing to consider for all Java libraries I think. It doesn't specific related to building of the JNI code. (Correct me if I am wrong). Secondly, it is still possible to isolate the building of the native in the way that you don't have to build different version for Ubuntu and RHEL. Third, if it is dynamic link to openssl and the openssl API used by the library is not changed in the versions, we don't have to build different versions for it. 

So the building matrix might be Linux32, Linux64, Windows32, Windows64, Mac... 

>>[Colin] So probably we would not choose to bundle openssl.
Agree. Bundle openssl is not a good idea considering upgrading for vulnerabilities.


Regards,
Haifeng

-----Original Message-----
From: Colin P. McCabe [mailto:cmccabe@apache.org] 
Sent: Wednesday, February 3, 2016 4:56 AM
To: hdfs-dev@hadoop.apache.org
Subject: Re: Hadoop encryption module as Apache Chimera incubator project

It's great to see interest in improving this functionality.  I think Chimera could be successful as an Apache project.  I don't have a strong opinion one way or the other as to whether it belongs as part of Hadoop or separate.

I do think there will be some challenges splitting this functionality out into a separate jar, because of the way our CLASSPATH works right now.  For example, let's say that Hadoop depends on Chimera 1.2 and Spark depends on Chimera 1.1.  Now Spark jobs have two different versions fighting it out on the classpath, similar to the situation with Guava and other libraries.  Perhaps if Chimera adopts a policy of strong backwards compatibility, we can just always use the latest jar, but it still seems likely that there will be problems.  There are various classpath isolation ideas that could help here, but they are big projects in their own right and we don't have a clear timeline for them.  If this does end up being a separate jar, we may need to shade it to avoid all these issues.

Bundling the JNI glue code in the jar itself is an interesting idea, which we have talked about before for libhadoop.so.  It doesn't really have anything to do with the question of TLP vs. non-TLP, of course.
We could do that refactoring in Hadoop itself.  The really complicated part of bundling JNI code in a jar is that you need to create jars for every cross product of (JVM version, openssl version, operating system).  For example, you have the RHEL6 build for openJDK7 using openssl 1.0.1e.  If you change any one thing-- say, change openJDK7 to Oracle JDK8, then you might need to rebuild.  And certainly using Ubuntu would be a rebuild.  And so forth.  This kind of clashes with Maven's philosophy of pulling prebuilt jars from the internet.

Kai Zheng's question about whether we would bundle openSSL's libraries is a good one.  Given the high rate of new vulnerabilities discovered in that library, it seems like bundling would require Hadoop users and vendors to update very frequently, much more frequently than Hadoop is traditionally updated.  So probably we would not choose to bundle openssl.

best,
Colin

On Tue, Feb 2, 2016 at 12:29 AM, Chris Douglas <cd...@apache.org> wrote:
> As a subproject of Hadoop, Chimera could maintain its own cadence.
> There's also no reason why it should maintain dependencies on other 
> parts of Hadoop, if those are separable. How is this solution 
> inadequate?
>
> If Chimera is not successful as an independent project or stalls, 
> Hadoop and/or Spark and/or $project will have to reabsorb it as 
> maintainers. Projects have high mortality in early life, and a fight 
> over inheritance/maintenance is something we'd like to avoid. If, on 
> the other hand, it develops enough of a community where it is 
> obviously viable, then we can (and should) break it out as a TLP (as 
> we have before). If other Apache projects take a dependency on 
> Chimera, we're open to adding them to security@hadoop.
>
> Unlike Yetus, which was largely rewritten right before it was made 
> into a TLP, security in Hadoop has a complicated pedigree. If Chimera 
> eventually becomes a TLP, it seems fair to include those who work on 
> it while it is a subproject. Declared upfront, that criterion is 
> fairer than any post hoc justification, and will lead to a more 
> accurate account of its community than a subset of the Hadoop 
> PMC/committers that volunteer. -C
>
>
> On Mon, Feb 1, 2016 at 9:29 PM, Chen, Haifeng <ha...@intel.com> wrote:
>> Thanks to all folks providing feedbacks and participating the discussions.
>>
>> @Owen, do you still have any concerns on going forward in the direction of Apache Commons (or other options, TLP)?
>>
>> Thanks,
>> Haifeng
>>
>> -----Original Message-----
>> From: Chen, Haifeng [mailto:haifeng.chen@intel.com]
>> Sent: Saturday, January 30, 2016 10:52 AM
>> To: hdfs-dev@hadoop.apache.org
>> Subject: RE: Hadoop encryption module as Apache Chimera incubator 
>> project
>>
>>>> I believe encryption is becoming a core part of Hadoop. I think 
>>>> that moving core components out of Hadoop is bad from a project management perspective.
>>
>>> Although it's certainly true that encryption capabilities (in HDFS, YARN, etc.) are becoming core to Hadoop, I don't think that should really influence whether or not the non-Hadoop-specific encryption routines should be part of the Hadoop code base, or part of the code base of another project that Hadoop depends on. If Chimera had existed as a library hosted at ASF when HDFS encryption was first developed, HDFS probably would have just added that as a dependency and been done with it. I don't think we would've copy/pasted the code for Chimera into the Hadoop code base.
>>
>> Agree with ATM. I want to also make an additional clarification. I agree that the encryption capabilities are becoming core to Hadoop. While this effort is to put common and shared encryption routines such as crypto stream implementations into a scope which can be widely shared across the Apache ecosystem. This doesn't move Hadoop encryption out of Hadoop (that is not possible).
>>
>> Agree if we make it a separate and independent releases project in Hadoop takes a step further than the existing approach and solve some issues (such as libhadoop.so problem). Frankly speaking, I think it is not the best option we can try. I also expect that an independent release project within Hadoop core will also complicate the existing release ideology of Hadoop release.
>>
>> Thanks,
>> Haifeng
>>
>> -----Original Message-----
>> From: Aaron T. Myers [mailto:atm@cloudera.com]
>> Sent: Friday, January 29, 2016 9:51 AM
>> To: hdfs-dev@hadoop.apache.org
>> Subject: Re: Hadoop encryption module as Apache Chimera incubator 
>> project
>>
>> On Wed, Jan 27, 2016 at 11:31 AM, Owen O'Malley <om...@apache.org> wrote:
>>
>>> I believe encryption is becoming a core part of Hadoop. I think that 
>>> moving core components out of Hadoop is bad from a project management perspective.
>>>
>>
>> Although it's certainly true that encryption capabilities (in HDFS, 
>> YARN,
>> etc.) are becoming core to Hadoop, I don't think that should really influence whether or not the non-Hadoop-specific encryption routines should be part of the Hadoop code base, or part of the code base of another project that Hadoop depends on. If Chimera had existed as a library hosted at ASF when HDFS encryption was first developed, HDFS probably would have just added that as a dependency and been done with it. I don't think we would've copy/pasted the code for Chimera into the Hadoop code base.
>>
>>
>>> To put it another way, a bug in the encryption routines will likely 
>>> become a security problem that security@hadoop needs to hear about.
>>>
>> I don't think
>>> adding a separate project in the middle of that communication chain 
>>> is a good idea. The same applies to data corruption problems, and so on...
>>>
>>
>> Isn't the same true of all the libraries that Hadoop currently depends upon? If the commons-httpclient library (or commons-codec, or commons-io, or guava, or...) has a security vulnerability, we need to know about it so that we can update our dependency to a fixed version. This case doesn't seem materially different than that.
>>
>>
>>>
>>>
>>> > It may be good to keep at generalized place(As in the discussion, 
>>> > we thought that place could be Apache Commons).
>>>
>>>
>>> Apache Commons is a collection of *Java* projects, so Chimera as a 
>>> JNI-based library isn't a natural fit.
>>>
>>
>> Could very well be that Apache Commons's charter would preclude Chimera.
>> You probably know better than I do about that.
>>
>>
>>> Furthermore, Apache Commons doesn't
>>> have its own security list so problems will go to the generic 
>>> security@apache.org.
>>>
>>
>> That seems easy enough to remedy, if they wanted to, and besides I'm not sure why that would influence this discussion. In my experience projects that don't have a separate security@project.a.o mailing list tend to just handle security issues on their private@project.a.o mailing list, which seems fine to me.
>>
>>
>>>
>>> Why do you think that Apache Commons is a better home than Hadoop?
>>>
>>
>> I'm certainly not at all wedded to Apache Commons, that just seemed like a natural place to put it to me. Could be that a brand new TLP might make more sense.
>>
>> I *do* think that if other non-Hadoop projects want to make use of Chimera, which as I understand it is the goal which started this thread, then Chimera should exist outside of Hadoop so that:
>>
>> a) Projects that have nothing to do with Hadoop can just depend directly on Chimera, which has nothing Hadoop-specific in there.
>>
>> b) The Hadoop project doesn't have to export/maintain/concern itself with yet another publicly-consumed interface.
>>
>> c) Chimera can have its own (presumably much faster) release cadence completely separate from Hadoop.
>>
>> --
>> Aaron T. Myers
>> Software Engineer, Cloudera

Re: Hadoop encryption module as Apache Chimera incubator project

Posted by "Colin P. McCabe" <cm...@apache.org>.
It's great to see interest in improving this functionality.  I think
Chimera could be successful as an Apache project.  I don't have a
strong opinion one way or the other as to whether it belongs as part
of Hadoop or separate.

I do think there will be some challenges splitting this functionality
out into a separate jar, because of the way our CLASSPATH works right
now.  For example, let's say that Hadoop depends on Chimera 1.2 and
Spark depends on Chimera 1.1.  Now Spark jobs have two different
versions fighting it out on the classpath, similar to the situation
with Guava and other libraries.  Perhaps if Chimera adopts a policy of
strong backwards compatibility, we can just always use the latest jar,
but it still seems likely that there will be problems.  There are
various classpath isolation ideas that could help here, but they are
big projects in their own right and we don't have a clear timeline for
them.  If this does end up being a separate jar, we may need to shade
it to avoid all these issues.

Bundling the JNI glue code in the jar itself is an interesting idea,
which we have talked about before for libhadoop.so.  It doesn't really
have anything to do with the question of TLP vs. non-TLP, of course.
We could do that refactoring in Hadoop itself.  The really complicated
part of bundling JNI code in a jar is that you need to create jars for
every cross product of (JVM version, openssl version, operating
system).  For example, you have the RHEL6 build for openJDK7 using
openssl 1.0.1e.  If you change any one thing-- say, change openJDK7 to
Oracle JDK8, then you might need to rebuild.  And certainly using
Ubuntu would be a rebuild.  And so forth.  This kind of clashes with
Maven's philosophy of pulling prebuilt jars from the internet.

Kai Zheng's question about whether we would bundle openSSL's libraries
is a good one.  Given the high rate of new vulnerabilities discovered
in that library, it seems like bundling would require Hadoop users and
vendors to update very frequently, much more frequently than Hadoop is
traditionally updated.  So probably we would not choose to bundle
openssl.

best,
Colin

On Tue, Feb 2, 2016 at 12:29 AM, Chris Douglas <cd...@apache.org> wrote:
> As a subproject of Hadoop, Chimera could maintain its own cadence.
> There's also no reason why it should maintain dependencies on other
> parts of Hadoop, if those are separable. How is this solution
> inadequate?
>
> If Chimera is not successful as an independent project or stalls,
> Hadoop and/or Spark and/or $project will have to reabsorb it as
> maintainers. Projects have high mortality in early life, and a fight
> over inheritance/maintenance is something we'd like to avoid. If, on
> the other hand, it develops enough of a community where it is
> obviously viable, then we can (and should) break it out as a TLP (as
> we have before). If other Apache projects take a dependency on
> Chimera, we're open to adding them to security@hadoop.
>
> Unlike Yetus, which was largely rewritten right before it was made
> into a TLP, security in Hadoop has a complicated pedigree. If Chimera
> eventually becomes a TLP, it seems fair to include those who work on
> it while it is a subproject. Declared upfront, that criterion is
> fairer than any post hoc justification, and will lead to a more
> accurate account of its community than a subset of the Hadoop
> PMC/committers that volunteer. -C
>
>
> On Mon, Feb 1, 2016 at 9:29 PM, Chen, Haifeng <ha...@intel.com> wrote:
>> Thanks to all folks providing feedbacks and participating the discussions.
>>
>> @Owen, do you still have any concerns on going forward in the direction of Apache Commons (or other options, TLP)?
>>
>> Thanks,
>> Haifeng
>>
>> -----Original Message-----
>> From: Chen, Haifeng [mailto:haifeng.chen@intel.com]
>> Sent: Saturday, January 30, 2016 10:52 AM
>> To: hdfs-dev@hadoop.apache.org
>> Subject: RE: Hadoop encryption module as Apache Chimera incubator project
>>
>>>> I believe encryption is becoming a core part of Hadoop. I think that
>>>> moving core components out of Hadoop is bad from a project management perspective.
>>
>>> Although it's certainly true that encryption capabilities (in HDFS, YARN, etc.) are becoming core to Hadoop, I don't think that should really influence whether or not the non-Hadoop-specific encryption routines should be part of the Hadoop code base, or part of the code base of another project that Hadoop depends on. If Chimera had existed as a library hosted at ASF when HDFS encryption was first developed, HDFS probably would have just added that as a dependency and been done with it. I don't think we would've copy/pasted the code for Chimera into the Hadoop code base.
>>
>> Agree with ATM. I want to also make an additional clarification. I agree that the encryption capabilities are becoming core to Hadoop. While this effort is to put common and shared encryption routines such as crypto stream implementations into a scope which can be widely shared across the Apache ecosystem. This doesn't move Hadoop encryption out of Hadoop (that is not possible).
>>
>> Agree if we make it a separate and independent releases project in Hadoop takes a step further than the existing approach and solve some issues (such as libhadoop.so problem). Frankly speaking, I think it is not the best option we can try. I also expect that an independent release project within Hadoop core will also complicate the existing release ideology of Hadoop release.
>>
>> Thanks,
>> Haifeng
>>
>> -----Original Message-----
>> From: Aaron T. Myers [mailto:atm@cloudera.com]
>> Sent: Friday, January 29, 2016 9:51 AM
>> To: hdfs-dev@hadoop.apache.org
>> Subject: Re: Hadoop encryption module as Apache Chimera incubator project
>>
>> On Wed, Jan 27, 2016 at 11:31 AM, Owen O'Malley <om...@apache.org> wrote:
>>
>>> I believe encryption is becoming a core part of Hadoop. I think that
>>> moving core components out of Hadoop is bad from a project management perspective.
>>>
>>
>> Although it's certainly true that encryption capabilities (in HDFS, YARN,
>> etc.) are becoming core to Hadoop, I don't think that should really influence whether or not the non-Hadoop-specific encryption routines should be part of the Hadoop code base, or part of the code base of another project that Hadoop depends on. If Chimera had existed as a library hosted at ASF when HDFS encryption was first developed, HDFS probably would have just added that as a dependency and been done with it. I don't think we would've copy/pasted the code for Chimera into the Hadoop code base.
>>
>>
>>> To put it another way, a bug in the encryption routines will likely
>>> become a security problem that security@hadoop needs to hear about.
>>>
>> I don't think
>>> adding a separate project in the middle of that communication chain is
>>> a good idea. The same applies to data corruption problems, and so on...
>>>
>>
>> Isn't the same true of all the libraries that Hadoop currently depends upon? If the commons-httpclient library (or commons-codec, or commons-io, or guava, or...) has a security vulnerability, we need to know about it so that we can update our dependency to a fixed version. This case doesn't seem materially different than that.
>>
>>
>>>
>>>
>>> > It may be good to keep at generalized place(As in the discussion, we
>>> > thought that place could be Apache Commons).
>>>
>>>
>>> Apache Commons is a collection of *Java* projects, so Chimera as a
>>> JNI-based library isn't a natural fit.
>>>
>>
>> Could very well be that Apache Commons's charter would preclude Chimera.
>> You probably know better than I do about that.
>>
>>
>>> Furthermore, Apache Commons doesn't
>>> have its own security list so problems will go to the generic
>>> security@apache.org.
>>>
>>
>> That seems easy enough to remedy, if they wanted to, and besides I'm not sure why that would influence this discussion. In my experience projects that don't have a separate security@project.a.o mailing list tend to just handle security issues on their private@project.a.o mailing list, which seems fine to me.
>>
>>
>>>
>>> Why do you think that Apache Commons is a better home than Hadoop?
>>>
>>
>> I'm certainly not at all wedded to Apache Commons, that just seemed like a natural place to put it to me. Could be that a brand new TLP might make more sense.
>>
>> I *do* think that if other non-Hadoop projects want to make use of Chimera, which as I understand it is the goal which started this thread, then Chimera should exist outside of Hadoop so that:
>>
>> a) Projects that have nothing to do with Hadoop can just depend directly on Chimera, which has nothing Hadoop-specific in there.
>>
>> b) The Hadoop project doesn't have to export/maintain/concern itself with yet another publicly-consumed interface.
>>
>> c) Chimera can have its own (presumably much faster) release cadence completely separate from Hadoop.
>>
>> --
>> Aaron T. Myers
>> Software Engineer, Cloudera

Re: Hadoop encryption module as Apache Chimera incubator project

Posted by Chris Douglas <cd...@apache.org>.
As a subproject of Hadoop, Chimera could maintain its own cadence.
There's also no reason why it should maintain dependencies on other
parts of Hadoop, if those are separable. How is this solution
inadequate?

If Chimera is not successful as an independent project or stalls,
Hadoop and/or Spark and/or $project will have to reabsorb it as
maintainers. Projects have high mortality in early life, and a fight
over inheritance/maintenance is something we'd like to avoid. If, on
the other hand, it develops enough of a community where it is
obviously viable, then we can (and should) break it out as a TLP (as
we have before). If other Apache projects take a dependency on
Chimera, we're open to adding them to security@hadoop.

Unlike Yetus, which was largely rewritten right before it was made
into a TLP, security in Hadoop has a complicated pedigree. If Chimera
eventually becomes a TLP, it seems fair to include those who work on
it while it is a subproject. Declared upfront, that criterion is
fairer than any post hoc justification, and will lead to a more
accurate account of its community than a subset of the Hadoop
PMC/committers that volunteer. -C


On Mon, Feb 1, 2016 at 9:29 PM, Chen, Haifeng <ha...@intel.com> wrote:
> Thanks to all folks providing feedbacks and participating the discussions.
>
> @Owen, do you still have any concerns on going forward in the direction of Apache Commons (or other options, TLP)?
>
> Thanks,
> Haifeng
>
> -----Original Message-----
> From: Chen, Haifeng [mailto:haifeng.chen@intel.com]
> Sent: Saturday, January 30, 2016 10:52 AM
> To: hdfs-dev@hadoop.apache.org
> Subject: RE: Hadoop encryption module as Apache Chimera incubator project
>
>>> I believe encryption is becoming a core part of Hadoop. I think that
>>> moving core components out of Hadoop is bad from a project management perspective.
>
>> Although it's certainly true that encryption capabilities (in HDFS, YARN, etc.) are becoming core to Hadoop, I don't think that should really influence whether or not the non-Hadoop-specific encryption routines should be part of the Hadoop code base, or part of the code base of another project that Hadoop depends on. If Chimera had existed as a library hosted at ASF when HDFS encryption was first developed, HDFS probably would have just added that as a dependency and been done with it. I don't think we would've copy/pasted the code for Chimera into the Hadoop code base.
>
> Agree with ATM. I want to also make an additional clarification. I agree that the encryption capabilities are becoming core to Hadoop. While this effort is to put common and shared encryption routines such as crypto stream implementations into a scope which can be widely shared across the Apache ecosystem. This doesn't move Hadoop encryption out of Hadoop (that is not possible).
>
> Agree if we make it a separate and independent releases project in Hadoop takes a step further than the existing approach and solve some issues (such as libhadoop.so problem). Frankly speaking, I think it is not the best option we can try. I also expect that an independent release project within Hadoop core will also complicate the existing release ideology of Hadoop release.
>
> Thanks,
> Haifeng
>
> -----Original Message-----
> From: Aaron T. Myers [mailto:atm@cloudera.com]
> Sent: Friday, January 29, 2016 9:51 AM
> To: hdfs-dev@hadoop.apache.org
> Subject: Re: Hadoop encryption module as Apache Chimera incubator project
>
> On Wed, Jan 27, 2016 at 11:31 AM, Owen O'Malley <om...@apache.org> wrote:
>
>> I believe encryption is becoming a core part of Hadoop. I think that
>> moving core components out of Hadoop is bad from a project management perspective.
>>
>
> Although it's certainly true that encryption capabilities (in HDFS, YARN,
> etc.) are becoming core to Hadoop, I don't think that should really influence whether or not the non-Hadoop-specific encryption routines should be part of the Hadoop code base, or part of the code base of another project that Hadoop depends on. If Chimera had existed as a library hosted at ASF when HDFS encryption was first developed, HDFS probably would have just added that as a dependency and been done with it. I don't think we would've copy/pasted the code for Chimera into the Hadoop code base.
>
>
>> To put it another way, a bug in the encryption routines will likely
>> become a security problem that security@hadoop needs to hear about.
>>
> I don't think
>> adding a separate project in the middle of that communication chain is
>> a good idea. The same applies to data corruption problems, and so on...
>>
>
> Isn't the same true of all the libraries that Hadoop currently depends upon? If the commons-httpclient library (or commons-codec, or commons-io, or guava, or...) has a security vulnerability, we need to know about it so that we can update our dependency to a fixed version. This case doesn't seem materially different than that.
>
>
>>
>>
>> > It may be good to keep at generalized place(As in the discussion, we
>> > thought that place could be Apache Commons).
>>
>>
>> Apache Commons is a collection of *Java* projects, so Chimera as a
>> JNI-based library isn't a natural fit.
>>
>
> Could very well be that Apache Commons's charter would preclude Chimera.
> You probably know better than I do about that.
>
>
>> Furthermore, Apache Commons doesn't
>> have its own security list so problems will go to the generic
>> security@apache.org.
>>
>
> That seems easy enough to remedy, if they wanted to, and besides I'm not sure why that would influence this discussion. In my experience projects that don't have a separate security@project.a.o mailing list tend to just handle security issues on their private@project.a.o mailing list, which seems fine to me.
>
>
>>
>> Why do you think that Apache Commons is a better home than Hadoop?
>>
>
> I'm certainly not at all wedded to Apache Commons, that just seemed like a natural place to put it to me. Could be that a brand new TLP might make more sense.
>
> I *do* think that if other non-Hadoop projects want to make use of Chimera, which as I understand it is the goal which started this thread, then Chimera should exist outside of Hadoop so that:
>
> a) Projects that have nothing to do with Hadoop can just depend directly on Chimera, which has nothing Hadoop-specific in there.
>
> b) The Hadoop project doesn't have to export/maintain/concern itself with yet another publicly-consumed interface.
>
> c) Chimera can have its own (presumably much faster) release cadence completely separate from Hadoop.
>
> --
> Aaron T. Myers
> Software Engineer, Cloudera