You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@hbase.apache.org by "Dey, Avik" <av...@intel.com> on 2013/02/26 00:46:45 UTC

ANNOUNCEMENT: Project Rhino: Enhanced Data Protection for the Apache Hadoop Ecosystem

Project Rhino

As the Apache Hadoop ecosystem extends into new markets and sees new use cases with security and compliance challenges, the benefits of processing sensitive and legally protected data with Hadoop must be coupled with protection for private information that limits performance impact. Project Rhino<https://github.com/intel-hadoop/project-rhino/> is our open source effort to enhance the existing data protection capabilities of the Hadoop ecosystem to address these challenges, and contribute the code back to Apache.

The core of the Apache Hadoop ecosystem as it is commonly understood is:

- Core: A set of shared libraries
- HDFS: The Hadoop filesystem
- MapReduce: Parallel computation framework
- ZooKeeper: Configuration management and coordination
- HBase: Column-oriented database on HDFS
- Hive: Data warehouse on HDFS with SQL-like access
- Pig: Higher-level programming language for Hadoop computations
- Oozie: Orchestration and workflow management
- Mahout: A library of machine learning and data mining algorithms
- Flume: Collection and import of log and event data
- Sqoop: Imports data from relational databases

These components are all separate projects and therefore cross cutting concerns like authN, authZ, a consistent security policy framework, consistent authorization model and audit coverage are loosely coordinated. Some security features expected by our customers, such as encryption, are simply missing. Our aim is to take a full stack view and work with the individual projects toward consistent concepts and capabilities, filling gaps as we go.

Our initial goals are:

1) Framework support for encryption and key management

There is currently no framework support for encryption or key management. We will add this support into Hadoop Core and integrate it across the ecosystem.

2) A common authorization framework for the Hadoop ecosystem

Each component currently has its own authorization engine. We will abstract the common functions into a reusable authorization framework with a consistent interface. Where appropriate we will either modify an existing engine to work within this framework, or we will plug in a common default engine. Therefore we also must normalize how security policy is expressed and applied by each component. Core, HDFS, ZooKeeper, and HBase currently support simple access control lists (ACLs) composed of users and groups. We see this as a good starting point. Where necessary we will modify components so they each offer equivalent functionality, and build support into others.

3) Token based authentication and single sign on

Core, HDFS, ZooKeeper, and HBase currently support Kerberos authentication at the RPC layer, via SASL. However this does not provide valuable attributes such as group membership, classification level, organizational identity, or support for user defined attributes. Hadoop components must interrogate external resources for discovering these attributes and at scale this is problematic. There is also no consistent delegation model. HDFS has a simple delegation capability, and only Oozie can take limited advantage of it. We will implement a common token based authentication framework to decouple internal user and service authentication from external mechanisms used to support it (like Kerberos).

4) Extend HBase support for ACLs to the cell level

Currently HBase supports setting access controls at the table or column family level. However, many use cases would benefit from the additional capability to do this on a per cell basis. In fact for many users dealing with sensitive information the ability to do this is crucial.

5) Improve audit logging

Audit messages from various Hadoop components do not use a unified or even consistently formatted format. This makes analysis of logs for verifying compliance or taking corrective action difficult. We will build a common audit logging facility as part of the common authorization framework work. We will also build a set of common audit log processing tools for transforming them to different industry standard formats, for supporting compliance verification, and for triggering responses to policy violations.

Current JIRAs:

As part of this ongoing effort we are contributing our work to-date against the JIRAs listed below. As you may appreciate, the goals for Project Rhino covers a number of different Apache projects, the scope of work is significant and likely to only increase as we get additional community input. We also appreciate that there may be others in the Apache community that may be working on some of this or are interested in contributing to it. If so, we look forward to partnering with you in Apache to accelerate this effort so the Apache community can see the benefits from our collective efforts sooner. You can also find a more detailed version of this announcement at Project Rhino<https://github.com/intel-hadoop/project-rhino/>.

Please feel free to reach out to us by commenting on the JIRAs below:

HBASE-6222: Add per-KeyValue Security<https://issues.apache.org/jira/browse/hbase-6222>

HADOOP-9331: Hadoop crypto codec framework and crypto codec implementations<https://issues.apache.org/jira/browse/hadoop-9331> and related sub-tasks

MAPREDUCE-5025: Key Distribution and Management for supporting crypto codec in Map Reduce<https://issues.apache.org/jira/browse/mapreduce-5025> and related JIRAs

HBASE-7544: Transparent table/CF encryption<https://issues.apache.org/jira/browse/hbase-7544>


Re: ANNOUNCEMENT: Project Rhino: Enhanced Data Protection for the Apache Hadoop Ecosystem

Posted by Avik Dey <av...@gmail.com>.
[thanks appreciate your doing that, the announcement itself was
cross-posted as outreach]

Thanks Cos.

As I see the work currently, I believe most, if not all of these, will be
work against JIRAs in individual projects similar to the JIRAs posted here
https://github.com/intel-hadoop/project-rhino. If we get to a point where
some of the future work needs a home outside of the individual projects,
happy to incubate that work in Apache.

~avik



On Mon, Feb 25, 2013 at 4:18 PM, Konstantin Boudnik <co...@apache.org> wrote:

> [yanking away most of the cross-posts...]
>
> An interesting cross component project Avik. Any plans to incubate it in
> Apache?
>
> Cos
>
> On Mon, Feb 25, 2013 at 11:46PM, Dey, Avik wrote:
> > Project Rhino
> >
> > As the Apache Hadoop ecosystem extends into new markets and sees new use
> > cases with security and compliance challenges, the benefits of processing
> > sensitive and legally protected data with Hadoop must be coupled with
> > protection for private information that limits performance impact.
> Project
> > Rhino<https://github.com/intel-hadoop/project-rhino/> is our open source
> > effort to enhance the existing data protection capabilities of the Hadoop
> > ecosystem to address these challenges, and contribute the code back to
> > Apache.
> >
> > The core of the Apache Hadoop ecosystem as it is commonly understood is:
> >
> > - Core: A set of shared libraries
> > - HDFS: The Hadoop filesystem
> > - MapReduce: Parallel computation framework
> > - ZooKeeper: Configuration management and coordination
> > - HBase: Column-oriented database on HDFS
> > - Hive: Data warehouse on HDFS with SQL-like access
> > - Pig: Higher-level programming language for Hadoop computations
> > - Oozie: Orchestration and workflow management
> > - Mahout: A library of machine learning and data mining algorithms
> > - Flume: Collection and import of log and event data
> > - Sqoop: Imports data from relational databases
> >
> > These components are all separate projects and therefore cross cutting
> concerns like authN, authZ, a consistent security policy framework,
> consistent authorization model and audit coverage are loosely coordinated.
> Some security features expected by our customers, such as encryption, are
> simply missing. Our aim is to take a full stack view and work with the
> individual projects toward consistent concepts and capabilities, filling
> gaps as we go.
> >
> > Our initial goals are:
> >
> > 1) Framework support for encryption and key management
> >
> > There is currently no framework support for encryption or key
> management. We will add this support into Hadoop Core and integrate it
> across the ecosystem.
> >
> > 2) A common authorization framework for the Hadoop ecosystem
> >
> > Each component currently has its own authorization engine. We will
> abstract the common functions into a reusable authorization framework with
> a consistent interface. Where appropriate we will either modify an existing
> engine to work within this framework, or we will plug in a common default
> engine. Therefore we also must normalize how security policy is expressed
> and applied by each component. Core, HDFS, ZooKeeper, and HBase currently
> support simple access control lists (ACLs) composed of users and groups. We
> see this as a good starting point. Where necessary we will modify
> components so they each offer equivalent functionality, and build support
> into others.
> >
> > 3) Token based authentication and single sign on
> >
> > Core, HDFS, ZooKeeper, and HBase currently support Kerberos
> authentication at the RPC layer, via SASL. However this does not provide
> valuable attributes such as group membership, classification level,
> organizational identity, or support for user defined attributes. Hadoop
> components must interrogate external resources for discovering these
> attributes and at scale this is problematic. There is also no consistent
> delegation model. HDFS has a simple delegation capability, and only Oozie
> can take limited advantage of it. We will implement a common token based
> authentication framework to decouple internal user and service
> authentication from external mechanisms used to support it (like Kerberos).
> >
> > 4) Extend HBase support for ACLs to the cell level
> >
> > Currently HBase supports setting access controls at the table or column
> family level. However, many use cases would benefit from the additional
> capability to do this on a per cell basis. In fact for many users dealing
> with sensitive information the ability to do this is crucial.
> >
> > 5) Improve audit logging
> >
> > Audit messages from various Hadoop components do not use a unified or
> even consistently formatted format. This makes analysis of logs for
> verifying compliance or taking corrective action difficult. We will build a
> common audit logging facility as part of the common authorization framework
> work. We will also build a set of common audit log processing tools for
> transforming them to different industry standard formats, for supporting
> compliance verification, and for triggering responses to policy violations.
> >
> > Current JIRAs:
> >
> > As part of this ongoing effort we are contributing our work to-date
> against the JIRAs listed below. As you may appreciate, the goals for
> Project Rhino covers a number of different Apache projects, the scope of
> work is significant and likely to only increase as we get additional
> community input. We also appreciate that there may be others in the Apache
> community that may be working on some of this or are interested in
> contributing to it. If so, we look forward to partnering with you in Apache
> to accelerate this effort so the Apache community can see the benefits from
> our collective efforts sooner. You can also find a more detailed version of
> this announcement at Project Rhino<
> https://github.com/intel-hadoop/project-rhino/>.
> >
> > Please feel free to reach out to us by commenting on the JIRAs below:
> >
> > HBASE-6222: Add per-KeyValue Security<
> https://issues.apache.org/jira/browse/hbase-6222>
> >
> > HADOOP-9331: Hadoop crypto codec framework and crypto codec
> implementations<https://issues.apache.org/jira/browse/hadoop-9331> and
> related sub-tasks
> >
> > MAPREDUCE-5025: Key Distribution and Management for supporting crypto
> codec in Map Reduce<https://issues.apache.org/jira/browse/mapreduce-5025>
> and related JIRAs
> >
> > HBASE-7544: Transparent table/CF encryption<
> https://issues.apache.org/jira/browse/hbase-7544>
> >
>

Re: ANNOUNCEMENT: Project Rhino: Enhanced Data Protection for the Apache Hadoop Ecosystem

Posted by Konstantin Boudnik <co...@apache.org>.
[yanking away most of the cross-posts...]

An interesting cross component project Avik. Any plans to incubate it in Apache?

Cos

On Mon, Feb 25, 2013 at 11:46PM, Dey, Avik wrote:
> Project Rhino
> 
> As the Apache Hadoop ecosystem extends into new markets and sees new use
> cases with security and compliance challenges, the benefits of processing
> sensitive and legally protected data with Hadoop must be coupled with
> protection for private information that limits performance impact. Project
> Rhino<https://github.com/intel-hadoop/project-rhino/> is our open source
> effort to enhance the existing data protection capabilities of the Hadoop
> ecosystem to address these challenges, and contribute the code back to
> Apache.
> 
> The core of the Apache Hadoop ecosystem as it is commonly understood is:
> 
> - Core: A set of shared libraries
> - HDFS: The Hadoop filesystem
> - MapReduce: Parallel computation framework
> - ZooKeeper: Configuration management and coordination
> - HBase: Column-oriented database on HDFS
> - Hive: Data warehouse on HDFS with SQL-like access
> - Pig: Higher-level programming language for Hadoop computations
> - Oozie: Orchestration and workflow management
> - Mahout: A library of machine learning and data mining algorithms
> - Flume: Collection and import of log and event data
> - Sqoop: Imports data from relational databases
> 
> These components are all separate projects and therefore cross cutting concerns like authN, authZ, a consistent security policy framework, consistent authorization model and audit coverage are loosely coordinated. Some security features expected by our customers, such as encryption, are simply missing. Our aim is to take a full stack view and work with the individual projects toward consistent concepts and capabilities, filling gaps as we go.
> 
> Our initial goals are:
> 
> 1) Framework support for encryption and key management
> 
> There is currently no framework support for encryption or key management. We will add this support into Hadoop Core and integrate it across the ecosystem.
> 
> 2) A common authorization framework for the Hadoop ecosystem
> 
> Each component currently has its own authorization engine. We will abstract the common functions into a reusable authorization framework with a consistent interface. Where appropriate we will either modify an existing engine to work within this framework, or we will plug in a common default engine. Therefore we also must normalize how security policy is expressed and applied by each component. Core, HDFS, ZooKeeper, and HBase currently support simple access control lists (ACLs) composed of users and groups. We see this as a good starting point. Where necessary we will modify components so they each offer equivalent functionality, and build support into others.
> 
> 3) Token based authentication and single sign on
> 
> Core, HDFS, ZooKeeper, and HBase currently support Kerberos authentication at the RPC layer, via SASL. However this does not provide valuable attributes such as group membership, classification level, organizational identity, or support for user defined attributes. Hadoop components must interrogate external resources for discovering these attributes and at scale this is problematic. There is also no consistent delegation model. HDFS has a simple delegation capability, and only Oozie can take limited advantage of it. We will implement a common token based authentication framework to decouple internal user and service authentication from external mechanisms used to support it (like Kerberos).
> 
> 4) Extend HBase support for ACLs to the cell level
> 
> Currently HBase supports setting access controls at the table or column family level. However, many use cases would benefit from the additional capability to do this on a per cell basis. In fact for many users dealing with sensitive information the ability to do this is crucial.
> 
> 5) Improve audit logging
> 
> Audit messages from various Hadoop components do not use a unified or even consistently formatted format. This makes analysis of logs for verifying compliance or taking corrective action difficult. We will build a common audit logging facility as part of the common authorization framework work. We will also build a set of common audit log processing tools for transforming them to different industry standard formats, for supporting compliance verification, and for triggering responses to policy violations.
> 
> Current JIRAs:
> 
> As part of this ongoing effort we are contributing our work to-date against the JIRAs listed below. As you may appreciate, the goals for Project Rhino covers a number of different Apache projects, the scope of work is significant and likely to only increase as we get additional community input. We also appreciate that there may be others in the Apache community that may be working on some of this or are interested in contributing to it. If so, we look forward to partnering with you in Apache to accelerate this effort so the Apache community can see the benefits from our collective efforts sooner. You can also find a more detailed version of this announcement at Project Rhino<https://github.com/intel-hadoop/project-rhino/>.
> 
> Please feel free to reach out to us by commenting on the JIRAs below:
> 
> HBASE-6222: Add per-KeyValue Security<https://issues.apache.org/jira/browse/hbase-6222>
> 
> HADOOP-9331: Hadoop crypto codec framework and crypto codec implementations<https://issues.apache.org/jira/browse/hadoop-9331> and related sub-tasks
> 
> MAPREDUCE-5025: Key Distribution and Management for supporting crypto codec in Map Reduce<https://issues.apache.org/jira/browse/mapreduce-5025> and related JIRAs
> 
> HBASE-7544: Transparent table/CF encryption<https://issues.apache.org/jira/browse/hbase-7544>
> 

Re: ANNOUNCEMENT: Project Rhino: Enhanced Data Protection for the Apache Hadoop Ecosystem

Posted by Konstantin Boudnik <co...@apache.org>.
[yanking away most of the cross-posts...]

An interesting cross component project Avik. Any plans to incubate it in Apache?

Cos

On Mon, Feb 25, 2013 at 11:46PM, Dey, Avik wrote:
> Project Rhino
> 
> As the Apache Hadoop ecosystem extends into new markets and sees new use
> cases with security and compliance challenges, the benefits of processing
> sensitive and legally protected data with Hadoop must be coupled with
> protection for private information that limits performance impact. Project
> Rhino<https://github.com/intel-hadoop/project-rhino/> is our open source
> effort to enhance the existing data protection capabilities of the Hadoop
> ecosystem to address these challenges, and contribute the code back to
> Apache.
> 
> The core of the Apache Hadoop ecosystem as it is commonly understood is:
> 
> - Core: A set of shared libraries
> - HDFS: The Hadoop filesystem
> - MapReduce: Parallel computation framework
> - ZooKeeper: Configuration management and coordination
> - HBase: Column-oriented database on HDFS
> - Hive: Data warehouse on HDFS with SQL-like access
> - Pig: Higher-level programming language for Hadoop computations
> - Oozie: Orchestration and workflow management
> - Mahout: A library of machine learning and data mining algorithms
> - Flume: Collection and import of log and event data
> - Sqoop: Imports data from relational databases
> 
> These components are all separate projects and therefore cross cutting concerns like authN, authZ, a consistent security policy framework, consistent authorization model and audit coverage are loosely coordinated. Some security features expected by our customers, such as encryption, are simply missing. Our aim is to take a full stack view and work with the individual projects toward consistent concepts and capabilities, filling gaps as we go.
> 
> Our initial goals are:
> 
> 1) Framework support for encryption and key management
> 
> There is currently no framework support for encryption or key management. We will add this support into Hadoop Core and integrate it across the ecosystem.
> 
> 2) A common authorization framework for the Hadoop ecosystem
> 
> Each component currently has its own authorization engine. We will abstract the common functions into a reusable authorization framework with a consistent interface. Where appropriate we will either modify an existing engine to work within this framework, or we will plug in a common default engine. Therefore we also must normalize how security policy is expressed and applied by each component. Core, HDFS, ZooKeeper, and HBase currently support simple access control lists (ACLs) composed of users and groups. We see this as a good starting point. Where necessary we will modify components so they each offer equivalent functionality, and build support into others.
> 
> 3) Token based authentication and single sign on
> 
> Core, HDFS, ZooKeeper, and HBase currently support Kerberos authentication at the RPC layer, via SASL. However this does not provide valuable attributes such as group membership, classification level, organizational identity, or support for user defined attributes. Hadoop components must interrogate external resources for discovering these attributes and at scale this is problematic. There is also no consistent delegation model. HDFS has a simple delegation capability, and only Oozie can take limited advantage of it. We will implement a common token based authentication framework to decouple internal user and service authentication from external mechanisms used to support it (like Kerberos).
> 
> 4) Extend HBase support for ACLs to the cell level
> 
> Currently HBase supports setting access controls at the table or column family level. However, many use cases would benefit from the additional capability to do this on a per cell basis. In fact for many users dealing with sensitive information the ability to do this is crucial.
> 
> 5) Improve audit logging
> 
> Audit messages from various Hadoop components do not use a unified or even consistently formatted format. This makes analysis of logs for verifying compliance or taking corrective action difficult. We will build a common audit logging facility as part of the common authorization framework work. We will also build a set of common audit log processing tools for transforming them to different industry standard formats, for supporting compliance verification, and for triggering responses to policy violations.
> 
> Current JIRAs:
> 
> As part of this ongoing effort we are contributing our work to-date against the JIRAs listed below. As you may appreciate, the goals for Project Rhino covers a number of different Apache projects, the scope of work is significant and likely to only increase as we get additional community input. We also appreciate that there may be others in the Apache community that may be working on some of this or are interested in contributing to it. If so, we look forward to partnering with you in Apache to accelerate this effort so the Apache community can see the benefits from our collective efforts sooner. You can also find a more detailed version of this announcement at Project Rhino<https://github.com/intel-hadoop/project-rhino/>.
> 
> Please feel free to reach out to us by commenting on the JIRAs below:
> 
> HBASE-6222: Add per-KeyValue Security<https://issues.apache.org/jira/browse/hbase-6222>
> 
> HADOOP-9331: Hadoop crypto codec framework and crypto codec implementations<https://issues.apache.org/jira/browse/hadoop-9331> and related sub-tasks
> 
> MAPREDUCE-5025: Key Distribution and Management for supporting crypto codec in Map Reduce<https://issues.apache.org/jira/browse/mapreduce-5025> and related JIRAs
> 
> HBASE-7544: Transparent table/CF encryption<https://issues.apache.org/jira/browse/hbase-7544>
>