You are viewing a plain text version of this content. The canonical link for it is here.

Posted to general@incubator.apache.org by Billie J Rinaldi <bi...@ugov.gov> on 2011/09/02 17:45:15 UTC

[PROPOSAL] Accumulo for the Apache Incubator

Greetings,

I would like to propose Accumulo to be an Apache Incubator project.  Accumulo is a distributed key/value store that provides expressive cell-level access labels and a server-side programming mechanism that can modify key/value pairs at various points in the data management process.  It is based on Google's BigTable design and runs over Apache Hadoop and Zookeeper.

Here is a link to the proposal in the Incubator wiki:
http://wiki.apache.org/incubator/AccumuloProposal

I've also pasted the initial contents below.

Thanks,
Billie Rinaldi


= Accumulo Proposal =

== Abstract ==
Accumulo is a distributed key/value store that provides expressive, cell-level access labels.

== Proposal ==
Accumulo is a sorted, distributed key/value store based on Google's BigTable design.  It is built on top of Apache Hadoop, Zookeeper, and Thrift.  It features a few novel improvements on the BigTable design in the form of cell-level access labels and a server-side programming mechanism that can modify key/value pairs at various points in the data management process.

== Background ==
Google published the design of BigTable in 2006.  Several other open source projects have implemented aspects of this design including HBase, CloudStore, and Cassandra.  Accumulo began its development in 2008.

== Rationale ==
There is a need for a flexible, high performance distributed key/value store that provides expressive, fine-grained access labels.  The communities we expect to be most interested in such a project are government, health care, and other industries where privacy is a concern.  We have made much progress in developing this project over the past 3 years and believe both the project and the interested communities would benefit from this work being openly available and having open development.

== Current Status ==

=== Meritocracy ===
We intend to strongly encourage the community to help with and contribute to the code.  We will actively seek potential committers and help them become familiar with the codebase.

=== Community ===
A strong government community has developed around Accumulo and training classes have been ongoing for about a year.  Hundreds of developers use Accumulo.

=== Core Developers ===
The developers are mainly employed by the National Security Agency, but we anticipate interest developing among other companies.

=== Alignment ===
Accumulo is built on top of Hadoop, Zookeeper, and Thrift.  It builds with Maven.  Due to the strong relationship with these Apache projects, the incubator is a good match for Accumulo.

== Known Risks ==
=== Orphaned Products ===
There is only a small risk of being orphaned.  The community is committed to improving the codebase of the project due to its fulfilling needs not addressed by any other software.

=== Inexperience with Open Source ===
The codebase has been treated internally as an open source project since its beginning, and the initial Apache committers have been involved with the code for multiple years.  While our experience with public open source is limited, we do not anticipate difficulty in operating under Apache's development process.

=== Homogeneous Developers ===
The committers have multiple employers and it is expected that committers from different companies will be recruited.

=== Reliance on Salaried Developers ===
The initial committers are all paid by their employers to work on Accumulo and we expect such employment to continue.  Some of the initial committers would continue as volunteers even if no longer employed to do so.

=== Relationships with Other Apache Products ===
Accumulo uses Hadoop, Zookeeper, Thrift, Maven, log4j, commons-lang, -net, -io, -jci, -collections, -configuration, -logging, and -codec.

=== Relationship to HBase ===
Accumulo and HBase are both based on the design of Google's BigTable, so there is a danger that potential users will have difficulty distinguishing the two or that they will not see an incentive in adopting Accumulo.  There are a few key areas in which Accumulo differs from HBase.  Some of the desired features of Accumulo could be incorporated into HBase, however the most important of these may be unlikely to be adopted (see cell-level access labels and iterators below).  It is a possibility that the codebases will ultimately converge, but the number of differences at the current time warrants a separate project for Accumulo.

==== Access Labels ====
Accumulo has an additional portion of its key that sorts after the column qualifier and before the timestamp.  It is called column visibility and enables expressive cell-level access control.  Authorizations are passed with each query to control what data is returned to the user.  The column visibilities are boolean AND and OR combinations of arbitrary strings (such as "(A&B)|C") and authorizations are sets of strings (such as {C,D}).

==== Iterators ====
Accumulo has a novel server-side programming mechanism that can modify the data written to disk or returned to the user.  This mechanism can be configured for any of the scopes where data is read from or written to disk.  It can be used to perform joins on data within a single tablet.

==== Flexibility ====
HBase requires the user to specify the set of column families to be used up front.  Accumulo places no restrictions on the column families.  Also, each column family in HBase is stored separately on disk.  Accumulo allows column families to be grouped together on disk, as does BigTable.  This enables users to configure how their data is stored, potentially providing improvements in compression and lookup speeds.  It gives Accumulo a row/column hybrid nature, while HBase is currently column-oriented.

==== Testing ====
Accumulo has testing frameworks that have resulted in its achieving a high level of correctness and performance.  We have observed that under some configurations and conditions Accumulo will outperform HBase and provide greater data integrity.

==== Logging ====
HBase uses a write-ahead log on the Hadoop Distributed File System.  Accumulo has its own logging service that does not depend on communication with the HDFS NameNode.

==== Storage ====
Accumulo has a relative key file format that improves compression.

==== Areas in which HBase features improvements over Accumulo ====
in memory tables, upserts, coprocessors, connections to other projects such as Cascading and Pig

=== Expectations ===
There is a risk that Accumulo will be criticized for not providing adequate security.  The access labels in Accumulo do not in themselves provide a complete security solution, but are a mechanism for labeling each piece of data with the authorizations that are necessary to see it.

=== Apache Brand ===
Our interest in releasing this code as an Apache incubator project is due to its strong relationship with other Apache projects, i.e. Hadoop, Zookeeper, and HBase.

== Documentation ==
There is not currently documentation about Accumulo on the web, but a fair amount of documentation and training materials exists and will be provided on the Accumulo wiki at apache.org.  Also, a paper discussing YCSB results for Accumulo will be presented at the 2011 Symposium on Cloud Computing.

== Initial Source ==
Accumulo has been in development since spring 2008.  There are hundreds of developers using it and tens of developers have contributed to it.  The core codebase consists of 200,000 lines of code (mainly Java) and 100s of pages of documentation.  There are also a few projects built on top of Accumulo that may be added to its contrib in the future.  These include support for Hive, Matlab, YCSB, and graph processing.

== Source and Intellectual Property Submission Plan ==
Accumulo core code, examples, documention, and training materials will be submitted by the National Security Agency.

We will also be soliciting contributions of further plugins from MIT Lincoln Labs, Carnegie Mellon University, and others.

Accumulo has been developed by a mix of government employees and private companies under government contract.  Material developed by government employees is in the public domain and no U.S. copyright exists in works of the federal government.  For the contractor developed material in the initial submission, the U.S. Government has sufficient authority per the ICLA from the copyright owner to contribute the Accumulo code to the incubator.

There has been some discussion regarding accepting contributions from US Government sources on [https://issues.apache.org/jira/browse/LEGAL-93 LEGAL-93]. We propose that the NSA will sign an ICLA/CCLA if that document could be slightly modified to explicitly address copyright in works of government employees. Specifically, we propose that the definition of “You” be modified to include “the copyright owner, the owner of a Contribution not subject to copyright, or legal entity authorized by the copyright owner that is making this Agreement.” In addition, section 2, the copyright license grant be modified after “You hereby grant” that either states “to the extent authorized by law” or “to the extent copyright exists in the Contribution.”  These changes will permit US Government employee developed work to be included.

One proposed solution is to form a Collaborative Research and Development Agreement (CRADA) between the Apache Software Foundation and the US Government, but this will not solve the underlying problem that U.S. law does not grant copyright to works of government employees.  At this time a CRADA is not necessary but should it be determined that a CRADA is necessary, we would like to work through that process during the incubation phase of Accumulo rather than before acceptance as this may take time to enter into an agreement.

== External Dependencies ==
jetty (Apache and EPL), jline (BSD), jfreechart (LGPL), jcommon (LGPL), slf4j (MIT), junit (CPL)

== Cryptography ==
none

== Required Resources ==
 * Mailing Lists
   * accumulo-private
   * accumulo-dev
   * accumulo-commits
   * accumulo-user

 * Subversion Directory
   * https://svn.apache.org/repos/asf/incubator/accumulo

 * Issue Tracking
   * JIRA Accumulo (ACCUMULO)

 * Continuous Integration
   * Jenkins builds on https://builds.apache.org/

 * Web
   * http://incubator.apache.org/accumulo/
   * wiki at http://wiki.apache.org or http://cwiki.apache.org

== Initial Committers ==
 * Aaron Cordova (aaron at cordovas dot org)
 * Adam Fuchs (adam.p.fuchs at ugov dot gov)
 * Eric Newton (ecn at swcomplete dot com)
 * Billie Rinaldi (billie.j.rinaldi at ugov dot gov)
 * Keith Turner (keith.turner at ptech-llc dot com)
 * John Vines (john.w.vines at ugov dot gov)
 * Chris Waring (christopher.a.waring at ugov dot gov)

== Affiliations ==
 * Aaron Cordova, The Interllective
 * Adam Fuchs, National Security Agency
 * Eric Newton, SW Complete Incorporated
 * Billie Rinaldi, National Security Agency
 * Keith Turner, Peterson Technology LLC
 * John Vines, National Security Agency
 * Chris Waring, National Security Agency

== Sponsors ==
 * Champion: Doug Cutting
 * Nominated Mentors: Benson Margulies, ?, ?
 * Sponsoring Entity: Apache Incubator

---------------------------------------------------------------------
To unsubscribe, e-mail: general-unsubscribe@incubator.apache.org
For additional commands, e-mail: general-help@incubator.apache.org

Re: [PROPOSAL] Accumulo for the Apache Incubator

Posted by Patrick Hunt <ph...@apache.org>.

Seems similar, see the proposal, there are a few sections that call
out the differences. (search for "hbase")

On Fri, Sep 2, 2011 at 9:45 AM, Mahadev Konar <ma...@hortonworks.com> wrote:
> Nice!
> Is this related to HBase? Or similar to it?
>
> mahadev
>
> On Fri, Sep 2, 2011 at 9:27 AM, Patrick Hunt <ph...@apache.org> wrote:
>> FYI, another project using ZK -- woot!!! (note that they have their
>> own WAL - perhaps a good application for BookKeeper?)
>>
>> ---------- Forwarded message ----------
>> From: Billie J Rinaldi <bi...@ugov.gov>
>> Date: Fri, Sep 2, 2011 at 8:45 AM
>> Subject: [PROPOSAL] Accumulo for the Apache Incubator
>> To: general@incubator.apache.org
>>
>>
>> Greetings,
>>
>> I would like to propose Accumulo to be an Apache Incubator project.
>> Accumulo is a distributed key/value store that provides expressive
>> cell-level access labels and a server-side programming mechanism that
>> can modify key/value pairs at various points in the data management
>> process.  It is based on Google's BigTable design and runs over Apache
>> Hadoop and Zookeeper.
>>
>> Here is a link to the proposal in the Incubator wiki:
>> http://wiki.apache.org/incubator/AccumuloProposal
>>
>> I've also pasted the initial contents below.
>>
>> Thanks,
>> Billie Rinaldi
>>
>>
>> = Accumulo Proposal =
>>
>> == Abstract ==
>> Accumulo is a distributed key/value store that provides expressive,
>> cell-level access labels.
>>
>> == Proposal ==
>> Accumulo is a sorted, distributed key/value store based on Google's
>> BigTable design.  It is built on top of Apache Hadoop, Zookeeper, and
>> Thrift.  It features a few novel improvements on the BigTable design
>> in the form of cell-level access labels and a server-side programming
>> mechanism that can modify key/value pairs at various points in the
>> data management process.
>>
>> == Background ==
>> Google published the design of BigTable in 2006.  Several other open
>> source projects have implemented aspects of this design including
>> HBase, CloudStore, and Cassandra.  Accumulo began its development in
>> 2008.
>>
>> == Rationale ==
>> There is a need for a flexible, high performance distributed key/value
>> store that provides expressive, fine-grained access labels.  The
>> communities we expect to be most interested in such a project are
>> government, health care, and other industries where privacy is a
>> concern.  We have made much progress in developing this project over
>> the past 3 years and believe both the project and the interested
>> communities would benefit from this work being openly available and
>> having open development.
>>
>> == Current Status ==
>>
>> === Meritocracy ===
>> We intend to strongly encourage the community to help with and
>> contribute to the code.  We will actively seek potential committers
>> and help them become familiar with the codebase.
>>
>> === Community ===
>> A strong government community has developed around Accumulo and
>> training classes have been ongoing for about a year.  Hundreds of
>> developers use Accumulo.
>>
>> === Core Developers ===
>> The developers are mainly employed by the National Security Agency,
>> but we anticipate interest developing among other companies.
>>
>> === Alignment ===
>> Accumulo is built on top of Hadoop, Zookeeper, and Thrift.  It builds
>> with Maven.  Due to the strong relationship with these Apache
>> projects, the incubator is a good match for Accumulo.
>>
>> == Known Risks ==
>> === Orphaned Products ===
>> There is only a small risk of being orphaned.  The community is
>> committed to improving the codebase of the project due to its
>> fulfilling needs not addressed by any other software.
>>
>> === Inexperience with Open Source ===
>> The codebase has been treated internally as an open source project
>> since its beginning, and the initial Apache committers have been
>> involved with the code for multiple years.  While our experience with
>> public open source is limited, we do not anticipate difficulty in
>> operating under Apache's development process.
>>
>> === Homogeneous Developers ===
>> The committers have multiple employers and it is expected that
>> committers from different companies will be recruited.
>>
>> === Reliance on Salaried Developers ===
>> The initial committers are all paid by their employers to work on
>> Accumulo and we expect such employment to continue.  Some of the
>> initial committers would continue as volunteers even if no longer
>> employed to do so.
>>
>> === Relationships with Other Apache Products ===
>> Accumulo uses Hadoop, Zookeeper, Thrift, Maven, log4j, commons-lang,
>> -net, -io, -jci, -collections, -configuration, -logging, and -codec.
>>
>> === Relationship to HBase ===
>> Accumulo and HBase are both based on the design of Google's BigTable,
>> so there is a danger that potential users will have difficulty
>> distinguishing the two or that they will not see an incentive in
>> adopting Accumulo.  There are a few key areas in which Accumulo
>> differs from HBase.  Some of the desired features of Accumulo could be
>> incorporated into HBase, however the most important of these may be
>> unlikely to be adopted (see cell-level access labels and iterators
>> below).  It is a possibility that the codebases will ultimately
>> converge, but the number of differences at the current time warrants a
>> separate project for Accumulo.
>>
>> ==== Access Labels ====
>> Accumulo has an additional portion of its key that sorts after the
>> column qualifier and before the timestamp.  It is called column
>> visibility and enables expressive cell-level access control.
>> Authorizations are passed with each query to control what data is
>> returned to the user.  The column visibilities are boolean AND and OR
>> combinations of arbitrary strings (such as "(A&B)|C") and
>> authorizations are sets of strings (such as {C,D}).
>>
>> ==== Iterators ====
>> Accumulo has a novel server-side programming mechanism that can modify
>> the data written to disk or returned to the user.  This mechanism can
>> be configured for any of the scopes where data is read from or written
>> to disk.  It can be used to perform joins on data within a single
>> tablet.
>>
>> ==== Flexibility ====
>> HBase requires the user to specify the set of column families to be
>> used up front.  Accumulo places no restrictions on the column
>> families.  Also, each column family in HBase is stored separately on
>> disk.  Accumulo allows column families to be grouped together on disk,
>> as does BigTable.  This enables users to configure how their data is
>> stored, potentially providing improvements in compression and lookup
>> speeds.  It gives Accumulo a row/column hybrid nature, while HBase is
>> currently column-oriented.
>>
>> ==== Testing ====
>> Accumulo has testing frameworks that have resulted in its achieving a
>> high level of correctness and performance.  We have observed that
>> under some configurations and conditions Accumulo will outperform
>> HBase and provide greater data integrity.
>>
>> ==== Logging ====
>> HBase uses a write-ahead log on the Hadoop Distributed File System.
>> Accumulo has its own logging service that does not depend on
>> communication with the HDFS NameNode.
>>
>> ==== Storage ====
>> Accumulo has a relative key file format that improves compression.
>>
>> ==== Areas in which HBase features improvements over Accumulo ====
>> in memory tables, upserts, coprocessors, connections to other projects
>> such as Cascading and Pig
>>
>> === Expectations ===
>> There is a risk that Accumulo will be criticized for not providing
>> adequate security.  The access labels in Accumulo do not in themselves
>> provide a complete security solution, but are a mechanism for labeling
>> each piece of data with the authorizations that are necessary to see
>> it.
>>
>> === Apache Brand ===
>> Our interest in releasing this code as an Apache incubator project is
>> due to its strong relationship with other Apache projects, i.e.
>> Hadoop, Zookeeper, and HBase.
>>
>> == Documentation ==
>> There is not currently documentation about Accumulo on the web, but a
>> fair amount of documentation and training materials exists and will be
>> provided on the Accumulo wiki at apache.org.  Also, a paper discussing
>> YCSB results for Accumulo will be presented at the 2011 Symposium on
>> Cloud Computing.
>>
>> == Initial Source ==
>> Accumulo has been in development since spring 2008.  There are
>> hundreds of developers using it and tens of developers have
>> contributed to it.  The core codebase consists of 200,000 lines of
>> code (mainly Java) and 100s of pages of documentation.  There are also
>> a few projects built on top of Accumulo that may be added to its
>> contrib in the future.  These include support for Hive, Matlab, YCSB,
>> and graph processing.
>>
>> == Source and Intellectual Property Submission Plan ==
>> Accumulo core code, examples, documention, and training materials will
>> be submitted by the National Security Agency.
>>
>> We will also be soliciting contributions of further plugins from MIT
>> Lincoln Labs, Carnegie Mellon University, and others.
>>
>> Accumulo has been developed by a mix of government employees and
>> private companies under government contract.  Material developed by
>> government employees is in the public domain and no U.S. copyright
>> exists in works of the federal government.  For the contractor
>> developed material in the initial submission, the U.S. Government has
>> sufficient authority per the ICLA from the copyright owner to
>> contribute the Accumulo code to the incubator.
>>
>> There has been some discussion regarding accepting contributions from
>> US Government sources on
>> [https://issues.apache.org/jira/browse/LEGAL-93 LEGAL-93]. We propose
>> that the NSA will sign an ICLA/CCLA if that document could be slightly
>> modified to explicitly address copyright in works of government
>> employees. Specifically, we propose that the definition of “You” be
>> modified to include “the copyright owner, the owner of a Contribution
>> not subject to copyright, or legal entity authorized by the copyright
>> owner that is making this Agreement.” In addition, section 2, the
>> copyright license grant be modified after “You hereby grant” that
>> either states “to the extent authorized by law” or “to the extent
>> copyright exists in the Contribution.”  These changes will permit US
>> Government employee developed work to be included.
>>
>> One proposed solution is to form a Collaborative Research and
>> Development Agreement (CRADA) between the Apache Software Foundation
>> and the US Government, but this will not solve the underlying problem
>> that U.S. law does not grant copyright to works of government
>> employees.  At this time a CRADA is not necessary but should it be
>> determined that a CRADA is necessary, we would like to work through
>> that process during the incubation phase of Accumulo rather than
>> before acceptance as this may take time to enter into an agreement.
>>
>> == External Dependencies ==
>> jetty (Apache and EPL), jline (BSD), jfreechart (LGPL), jcommon
>> (LGPL), slf4j (MIT), junit (CPL)
>>
>> == Cryptography ==
>> none
>>
>> == Required Resources ==
>>  * Mailing Lists
>>   * accumulo-private
>>   * accumulo-dev
>>   * accumulo-commits
>>   * accumulo-user
>>
>>  * Subversion Directory
>>   * https://svn.apache.org/repos/asf/incubator/accumulo
>>
>>  * Issue Tracking
>>   * JIRA Accumulo (ACCUMULO)
>>
>>  * Continuous Integration
>>   * Jenkins builds on https://builds.apache.org/
>>
>>  * Web
>>   * http://incubator.apache.org/accumulo/
>>   * wiki at http://wiki.apache.org or http://cwiki.apache.org
>>
>> == Initial Committers ==
>>  * Aaron Cordova (aaron at cordovas dot org)
>>  * Adam Fuchs (adam.p.fuchs at ugov dot gov)
>>  * Eric Newton (ecn at swcomplete dot com)
>>  * Billie Rinaldi (billie.j.rinaldi at ugov dot gov)
>>  * Keith Turner (keith.turner at ptech-llc dot com)
>>  * John Vines (john.w.vines at ugov dot gov)
>>  * Chris Waring (christopher.a.waring at ugov dot gov)
>>
>> == Affiliations ==
>>  * Aaron Cordova, The Interllective
>>  * Adam Fuchs, National Security Agency
>>  * Eric Newton, SW Complete Incorporated
>>  * Billie Rinaldi, National Security Agency
>>  * Keith Turner, Peterson Technology LLC
>>  * John Vines, National Security Agency
>>  * Chris Waring, National Security Agency
>>
>> == Sponsors ==
>>  * Champion: Doug Cutting
>>  * Nominated Mentors: Benson Margulies, ?, ?
>>  * Sponsoring Entity: Apache Incubator
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: general-unsubscribe@incubator.apache.org
>> For additional commands, e-mail: general-help@incubator.apache.org
>>
>

Re: [PROPOSAL] Accumulo for the Apache Incubator

Posted by Mahadev Konar <ma...@hortonworks.com>.

Nice!
Is this related to HBase? Or similar to it?

mahadev

On Fri, Sep 2, 2011 at 9:27 AM, Patrick Hunt <ph...@apache.org> wrote:
> FYI, another project using ZK -- woot!!! (note that they have their
> own WAL - perhaps a good application for BookKeeper?)
>
> ---------- Forwarded message ----------
> From: Billie J Rinaldi <bi...@ugov.gov>
> Date: Fri, Sep 2, 2011 at 8:45 AM
> Subject: [PROPOSAL] Accumulo for the Apache Incubator
> To: general@incubator.apache.org
>
>
> Greetings,
>
> I would like to propose Accumulo to be an Apache Incubator project.
> Accumulo is a distributed key/value store that provides expressive
> cell-level access labels and a server-side programming mechanism that
> can modify key/value pairs at various points in the data management
> process.  It is based on Google's BigTable design and runs over Apache
> Hadoop and Zookeeper.
>
> Here is a link to the proposal in the Incubator wiki:
> http://wiki.apache.org/incubator/AccumuloProposal
>
> I've also pasted the initial contents below.
>
> Thanks,
> Billie Rinaldi
>
>
> = Accumulo Proposal =
>
> == Abstract ==
> Accumulo is a distributed key/value store that provides expressive,
> cell-level access labels.
>
> == Proposal ==
> Accumulo is a sorted, distributed key/value store based on Google's
> BigTable design.  It is built on top of Apache Hadoop, Zookeeper, and
> Thrift.  It features a few novel improvements on the BigTable design
> in the form of cell-level access labels and a server-side programming
> mechanism that can modify key/value pairs at various points in the
> data management process.
>
> == Background ==
> Google published the design of BigTable in 2006.  Several other open
> source projects have implemented aspects of this design including
> HBase, CloudStore, and Cassandra.  Accumulo began its development in
> 2008.
>
> == Rationale ==
> There is a need for a flexible, high performance distributed key/value
> store that provides expressive, fine-grained access labels.  The
> communities we expect to be most interested in such a project are
> government, health care, and other industries where privacy is a
> concern.  We have made much progress in developing this project over
> the past 3 years and believe both the project and the interested
> communities would benefit from this work being openly available and
> having open development.
>
> == Current Status ==
>
> === Meritocracy ===
> We intend to strongly encourage the community to help with and
> contribute to the code.  We will actively seek potential committers
> and help them become familiar with the codebase.
>
> === Community ===
> A strong government community has developed around Accumulo and
> training classes have been ongoing for about a year.  Hundreds of
> developers use Accumulo.
>
> === Core Developers ===
> The developers are mainly employed by the National Security Agency,
> but we anticipate interest developing among other companies.
>
> === Alignment ===
> Accumulo is built on top of Hadoop, Zookeeper, and Thrift.  It builds
> with Maven.  Due to the strong relationship with these Apache
> projects, the incubator is a good match for Accumulo.
>
> == Known Risks ==
> === Orphaned Products ===
> There is only a small risk of being orphaned.  The community is
> committed to improving the codebase of the project due to its
> fulfilling needs not addressed by any other software.
>
> === Inexperience with Open Source ===
> The codebase has been treated internally as an open source project
> since its beginning, and the initial Apache committers have been
> involved with the code for multiple years.  While our experience with
> public open source is limited, we do not anticipate difficulty in
> operating under Apache's development process.
>
> === Homogeneous Developers ===
> The committers have multiple employers and it is expected that
> committers from different companies will be recruited.
>
> === Reliance on Salaried Developers ===
> The initial committers are all paid by their employers to work on
> Accumulo and we expect such employment to continue.  Some of the
> initial committers would continue as volunteers even if no longer
> employed to do so.
>
> === Relationships with Other Apache Products ===
> Accumulo uses Hadoop, Zookeeper, Thrift, Maven, log4j, commons-lang,
> -net, -io, -jci, -collections, -configuration, -logging, and -codec.
>
> === Relationship to HBase ===
> Accumulo and HBase are both based on the design of Google's BigTable,
> so there is a danger that potential users will have difficulty
> distinguishing the two or that they will not see an incentive in
> adopting Accumulo.  There are a few key areas in which Accumulo
> differs from HBase.  Some of the desired features of Accumulo could be
> incorporated into HBase, however the most important of these may be
> unlikely to be adopted (see cell-level access labels and iterators
> below).  It is a possibility that the codebases will ultimately
> converge, but the number of differences at the current time warrants a
> separate project for Accumulo.
>
> ==== Access Labels ====
> Accumulo has an additional portion of its key that sorts after the
> column qualifier and before the timestamp.  It is called column
> visibility and enables expressive cell-level access control.
> Authorizations are passed with each query to control what data is
> returned to the user.  The column visibilities are boolean AND and OR
> combinations of arbitrary strings (such as "(A&B)|C") and
> authorizations are sets of strings (such as {C,D}).
>
> ==== Iterators ====
> Accumulo has a novel server-side programming mechanism that can modify
> the data written to disk or returned to the user.  This mechanism can
> be configured for any of the scopes where data is read from or written
> to disk.  It can be used to perform joins on data within a single
> tablet.
>
> ==== Flexibility ====
> HBase requires the user to specify the set of column families to be
> used up front.  Accumulo places no restrictions on the column
> families.  Also, each column family in HBase is stored separately on
> disk.  Accumulo allows column families to be grouped together on disk,
> as does BigTable.  This enables users to configure how their data is
> stored, potentially providing improvements in compression and lookup
> speeds.  It gives Accumulo a row/column hybrid nature, while HBase is
> currently column-oriented.
>
> ==== Testing ====
> Accumulo has testing frameworks that have resulted in its achieving a
> high level of correctness and performance.  We have observed that
> under some configurations and conditions Accumulo will outperform
> HBase and provide greater data integrity.
>
> ==== Logging ====
> HBase uses a write-ahead log on the Hadoop Distributed File System.
> Accumulo has its own logging service that does not depend on
> communication with the HDFS NameNode.
>
> ==== Storage ====
> Accumulo has a relative key file format that improves compression.
>
> ==== Areas in which HBase features improvements over Accumulo ====
> in memory tables, upserts, coprocessors, connections to other projects
> such as Cascading and Pig
>
> === Expectations ===
> There is a risk that Accumulo will be criticized for not providing
> adequate security.  The access labels in Accumulo do not in themselves
> provide a complete security solution, but are a mechanism for labeling
> each piece of data with the authorizations that are necessary to see
> it.
>
> === Apache Brand ===
> Our interest in releasing this code as an Apache incubator project is
> due to its strong relationship with other Apache projects, i.e.
> Hadoop, Zookeeper, and HBase.
>
> == Documentation ==
> There is not currently documentation about Accumulo on the web, but a
> fair amount of documentation and training materials exists and will be
> provided on the Accumulo wiki at apache.org.  Also, a paper discussing
> YCSB results for Accumulo will be presented at the 2011 Symposium on
> Cloud Computing.
>
> == Initial Source ==
> Accumulo has been in development since spring 2008.  There are
> hundreds of developers using it and tens of developers have
> contributed to it.  The core codebase consists of 200,000 lines of
> code (mainly Java) and 100s of pages of documentation.  There are also
> a few projects built on top of Accumulo that may be added to its
> contrib in the future.  These include support for Hive, Matlab, YCSB,
> and graph processing.
>
> == Source and Intellectual Property Submission Plan ==
> Accumulo core code, examples, documention, and training materials will
> be submitted by the National Security Agency.
>
> We will also be soliciting contributions of further plugins from MIT
> Lincoln Labs, Carnegie Mellon University, and others.
>
> Accumulo has been developed by a mix of government employees and
> private companies under government contract.  Material developed by
> government employees is in the public domain and no U.S. copyright
> exists in works of the federal government.  For the contractor
> developed material in the initial submission, the U.S. Government has
> sufficient authority per the ICLA from the copyright owner to
> contribute the Accumulo code to the incubator.
>
> There has been some discussion regarding accepting contributions from
> US Government sources on
> [https://issues.apache.org/jira/browse/LEGAL-93 LEGAL-93]. We propose
> that the NSA will sign an ICLA/CCLA if that document could be slightly
> modified to explicitly address copyright in works of government
> employees. Specifically, we propose that the definition of “You” be
> modified to include “the copyright owner, the owner of a Contribution
> not subject to copyright, or legal entity authorized by the copyright
> owner that is making this Agreement.” In addition, section 2, the
> copyright license grant be modified after “You hereby grant” that
> either states “to the extent authorized by law” or “to the extent
> copyright exists in the Contribution.”  These changes will permit US
> Government employee developed work to be included.
>
> One proposed solution is to form a Collaborative Research and
> Development Agreement (CRADA) between the Apache Software Foundation
> and the US Government, but this will not solve the underlying problem
> that U.S. law does not grant copyright to works of government
> employees.  At this time a CRADA is not necessary but should it be
> determined that a CRADA is necessary, we would like to work through
> that process during the incubation phase of Accumulo rather than
> before acceptance as this may take time to enter into an agreement.
>
> == External Dependencies ==
> jetty (Apache and EPL), jline (BSD), jfreechart (LGPL), jcommon
> (LGPL), slf4j (MIT), junit (CPL)
>
> == Cryptography ==
> none
>
> == Required Resources ==
>  * Mailing Lists
>   * accumulo-private
>   * accumulo-dev
>   * accumulo-commits
>   * accumulo-user
>
>  * Subversion Directory
>   * https://svn.apache.org/repos/asf/incubator/accumulo
>
>  * Issue Tracking
>   * JIRA Accumulo (ACCUMULO)
>
>  * Continuous Integration
>   * Jenkins builds on https://builds.apache.org/
>
>  * Web
>   * http://incubator.apache.org/accumulo/
>   * wiki at http://wiki.apache.org or http://cwiki.apache.org
>
> == Initial Committers ==
>  * Aaron Cordova (aaron at cordovas dot org)
>  * Adam Fuchs (adam.p.fuchs at ugov dot gov)
>  * Eric Newton (ecn at swcomplete dot com)
>  * Billie Rinaldi (billie.j.rinaldi at ugov dot gov)
>  * Keith Turner (keith.turner at ptech-llc dot com)
>  * John Vines (john.w.vines at ugov dot gov)
>  * Chris Waring (christopher.a.waring at ugov dot gov)
>
> == Affiliations ==
>  * Aaron Cordova, The Interllective
>  * Adam Fuchs, National Security Agency
>  * Eric Newton, SW Complete Incorporated
>  * Billie Rinaldi, National Security Agency
>  * Keith Turner, Peterson Technology LLC
>  * John Vines, National Security Agency
>  * Chris Waring, National Security Agency
>
> == Sponsors ==
>  * Champion: Doug Cutting
>  * Nominated Mentors: Benson Margulies, ?, ?
>  * Sponsoring Entity: Apache Incubator
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: general-unsubscribe@incubator.apache.org
> For additional commands, e-mail: general-help@incubator.apache.org
>

Fwd: [PROPOSAL] Accumulo for the Apache Incubator

Posted by Patrick Hunt <ph...@apache.org>.

FYI, another project using ZK -- woot!!! (note that they have their
own WAL - perhaps a good application for BookKeeper?)

---------- Forwarded message ----------
From: Billie J Rinaldi <bi...@ugov.gov>
Date: Fri, Sep 2, 2011 at 8:45 AM
Subject: [PROPOSAL] Accumulo for the Apache Incubator
To: general@incubator.apache.org

Greetings,

I would like to propose Accumulo to be an Apache Incubator project.
Accumulo is a distributed key/value store that provides expressive
cell-level access labels and a server-side programming mechanism that
can modify key/value pairs at various points in the data management
process.  It is based on Google's BigTable design and runs over Apache
Hadoop and Zookeeper.

Here is a link to the proposal in the Incubator wiki:
http://wiki.apache.org/incubator/AccumuloProposal

I've also pasted the initial contents below.

Thanks,
Billie Rinaldi

= Accumulo Proposal =

== Abstract ==
Accumulo is a distributed key/value store that provides expressive,
cell-level access labels.

== Proposal ==
Accumulo is a sorted, distributed key/value store based on Google's
BigTable design.  It is built on top of Apache Hadoop, Zookeeper, and
Thrift.  It features a few novel improvements on the BigTable design
in the form of cell-level access labels and a server-side programming
mechanism that can modify key/value pairs at various points in the
data management process.

== Background ==
Google published the design of BigTable in 2006.  Several other open
source projects have implemented aspects of this design including
HBase, CloudStore, and Cassandra.  Accumulo began its development in
2008.

== Rationale ==
There is a need for a flexible, high performance distributed key/value
store that provides expressive, fine-grained access labels.  The
communities we expect to be most interested in such a project are
government, health care, and other industries where privacy is a
concern.  We have made much progress in developing this project over
the past 3 years and believe both the project and the interested
communities would benefit from this work being openly available and
having open development.

== Current Status ==

=== Meritocracy ===
We intend to strongly encourage the community to help with and
contribute to the code.  We will actively seek potential committers
and help them become familiar with the codebase.

=== Community ===
A strong government community has developed around Accumulo and
training classes have been ongoing for about a year.  Hundreds of
developers use Accumulo.

=== Core Developers ===
The developers are mainly employed by the National Security Agency,
but we anticipate interest developing among other companies.

=== Alignment ===
Accumulo is built on top of Hadoop, Zookeeper, and Thrift.  It builds
with Maven.  Due to the strong relationship with these Apache
projects, the incubator is a good match for Accumulo.

== Known Risks ==
=== Orphaned Products ===
There is only a small risk of being orphaned.  The community is
committed to improving the codebase of the project due to its
fulfilling needs not addressed by any other software.

=== Inexperience with Open Source ===
The codebase has been treated internally as an open source project
since its beginning, and the initial Apache committers have been
involved with the code for multiple years.  While our experience with
public open source is limited, we do not anticipate difficulty in
operating under Apache's development process.

=== Homogeneous Developers ===
The committers have multiple employers and it is expected that
committers from different companies will be recruited.

=== Reliance on Salaried Developers ===
The initial committers are all paid by their employers to work on
Accumulo and we expect such employment to continue.  Some of the
initial committers would continue as volunteers even if no longer
employed to do so.

=== Relationships with Other Apache Products ===
Accumulo uses Hadoop, Zookeeper, Thrift, Maven, log4j, commons-lang,
-net, -io, -jci, -collections, -configuration, -logging, and -codec.

=== Relationship to HBase ===
Accumulo and HBase are both based on the design of Google's BigTable,
so there is a danger that potential users will have difficulty
distinguishing the two or that they will not see an incentive in
adopting Accumulo.  There are a few key areas in which Accumulo
differs from HBase.  Some of the desired features of Accumulo could be
incorporated into HBase, however the most important of these may be
unlikely to be adopted (see cell-level access labels and iterators
below).  It is a possibility that the codebases will ultimately
converge, but the number of differences at the current time warrants a
separate project for Accumulo.

==== Access Labels ====
Accumulo has an additional portion of its key that sorts after the
column qualifier and before the timestamp.  It is called column
visibility and enables expressive cell-level access control.
Authorizations are passed with each query to control what data is
returned to the user.  The column visibilities are boolean AND and OR
combinations of arbitrary strings (such as "(A&B)|C") and
authorizations are sets of strings (such as {C,D}).

==== Iterators ====
Accumulo has a novel server-side programming mechanism that can modify
the data written to disk or returned to the user.  This mechanism can
be configured for any of the scopes where data is read from or written
to disk.  It can be used to perform joins on data within a single
tablet.

==== Flexibility ====
HBase requires the user to specify the set of column families to be
used up front.  Accumulo places no restrictions on the column
families.  Also, each column family in HBase is stored separately on
disk.  Accumulo allows column families to be grouped together on disk,
as does BigTable.  This enables users to configure how their data is
stored, potentially providing improvements in compression and lookup
speeds.  It gives Accumulo a row/column hybrid nature, while HBase is
currently column-oriented.

==== Testing ====
Accumulo has testing frameworks that have resulted in its achieving a
high level of correctness and performance.  We have observed that
under some configurations and conditions Accumulo will outperform
HBase and provide greater data integrity.

==== Logging ====
HBase uses a write-ahead log on the Hadoop Distributed File System.
Accumulo has its own logging service that does not depend on
communication with the HDFS NameNode.

==== Storage ====
Accumulo has a relative key file format that improves compression.

==== Areas in which HBase features improvements over Accumulo ====
in memory tables, upserts, coprocessors, connections to other projects
such as Cascading and Pig

=== Expectations ===
There is a risk that Accumulo will be criticized for not providing
adequate security.  The access labels in Accumulo do not in themselves
provide a complete security solution, but are a mechanism for labeling
each piece of data with the authorizations that are necessary to see
it.

=== Apache Brand ===
Our interest in releasing this code as an Apache incubator project is
due to its strong relationship with other Apache projects, i.e.
Hadoop, Zookeeper, and HBase.

== Documentation ==
There is not currently documentation about Accumulo on the web, but a
fair amount of documentation and training materials exists and will be
provided on the Accumulo wiki at apache.org.  Also, a paper discussing
YCSB results for Accumulo will be presented at the 2011 Symposium on
Cloud Computing.

== Initial Source ==
Accumulo has been in development since spring 2008.  There are
hundreds of developers using it and tens of developers have
contributed to it.  The core codebase consists of 200,000 lines of
code (mainly Java) and 100s of pages of documentation.  There are also
a few projects built on top of Accumulo that may be added to its
contrib in the future.  These include support for Hive, Matlab, YCSB,
and graph processing.

== Source and Intellectual Property Submission Plan ==
Accumulo core code, examples, documention, and training materials will
be submitted by the National Security Agency.

We will also be soliciting contributions of further plugins from MIT
Lincoln Labs, Carnegie Mellon University, and others.

Accumulo has been developed by a mix of government employees and
private companies under government contract.  Material developed by
government employees is in the public domain and no U.S. copyright
exists in works of the federal government.  For the contractor
developed material in the initial submission, the U.S. Government has
sufficient authority per the ICLA from the copyright owner to
contribute the Accumulo code to the incubator.

There has been some discussion regarding accepting contributions from
US Government sources on
[https://issues.apache.org/jira/browse/LEGAL-93 LEGAL-93]. We propose
that the NSA will sign an ICLA/CCLA if that document could be slightly
modified to explicitly address copyright in works of government
employees. Specifically, we propose that the definition of “You” be
modified to include “the copyright owner, the owner of a Contribution
not subject to copyright, or legal entity authorized by the copyright
owner that is making this Agreement.” In addition, section 2, the
copyright license grant be modified after “You hereby grant” that
either states “to the extent authorized by law” or “to the extent
copyright exists in the Contribution.”  These changes will permit US
Government employee developed work to be included.

One proposed solution is to form a Collaborative Research and
Development Agreement (CRADA) between the Apache Software Foundation
and the US Government, but this will not solve the underlying problem
that U.S. law does not grant copyright to works of government
employees.  At this time a CRADA is not necessary but should it be
determined that a CRADA is necessary, we would like to work through
that process during the incubation phase of Accumulo rather than
before acceptance as this may take time to enter into an agreement.

== External Dependencies ==
jetty (Apache and EPL), jline (BSD), jfreechart (LGPL), jcommon
(LGPL), slf4j (MIT), junit (CPL)

== Cryptography ==
none

== Required Resources ==
 * Mailing Lists
  * accumulo-private
  * accumulo-dev
  * accumulo-commits
  * accumulo-user

 * Subversion Directory
  * https://svn.apache.org/repos/asf/incubator/accumulo

 * Issue Tracking
  * JIRA Accumulo (ACCUMULO)

 * Continuous Integration
  * Jenkins builds on https://builds.apache.org/

 * Web
  * http://incubator.apache.org/accumulo/
  * wiki at http://wiki.apache.org or http://cwiki.apache.org

== Initial Committers ==
 * Aaron Cordova (aaron at cordovas dot org)
 * Adam Fuchs (adam.p.fuchs at ugov dot gov)
 * Eric Newton (ecn at swcomplete dot com)
 * Billie Rinaldi (billie.j.rinaldi at ugov dot gov)
 * Keith Turner (keith.turner at ptech-llc dot com)
 * John Vines (john.w.vines at ugov dot gov)
 * Chris Waring (christopher.a.waring at ugov dot gov)

== Affiliations ==
 * Aaron Cordova, The Interllective
 * Adam Fuchs, National Security Agency
 * Eric Newton, SW Complete Incorporated
 * Billie Rinaldi, National Security Agency
 * Keith Turner, Peterson Technology LLC
 * John Vines, National Security Agency
 * Chris Waring, National Security Agency

== Sponsors ==
 * Champion: Doug Cutting
 * Nominated Mentors: Benson Margulies, ?, ?
 * Sponsoring Entity: Apache Incubator

---------------------------------------------------------------------
To unsubscribe, e-mail: general-unsubscribe@incubator.apache.org
For additional commands, e-mail: general-help@incubator.apache.org

Re: [PROPOSAL] Accumulo for the Apache Incubator

Posted by Bernd Fondermann <be...@googlemail.com>.

On Sun, Sep 4, 2011 at 18:16, Greg Stein <gs...@gmail.com> wrote:
> On Sep 4, 2011 3:41 AM, "Bernd Fondermann" <be...@googlemail.com>
> wrote:
>>...
>>
>> So, you are saying more than 10% of the non-generated code base (and you
> are
>> not counting lib-style uses/JARs here, right?) is derived from other
> Apache
>> code? That seems to be unusual. Just curious, could you elaborate a bit
>> about why you did that amd what kind of code that is? Thank you.
>
> You make it sound like deriving from our code base is a bad thing, and
> should be justified. I don't get it. That is what we *want* people to do.

Of course, many do so. Especially in closed source projects we will
never know about.

>
> What is your concern here?

The concern would be when people would take code and re-incubate it
"at large scale", whatever that means.

But Billies reply below is showing that they improved Hadoop code
(like I hoped) and are willing to contribute back. (If the code grant
is going through at all, it sounds like a little bit more complicated
than usual.) Hadoop can only benefit from that.

Also, I don't share the concerns discussed over at hbase-dev. How
large the overlap between HBase and Accumulo really is can still be
determined in Incubation. Whether or not they will become two
different projects or one is something that would be decided later in
Incubation.

  Bernd

---------------------------------------------------------------------
To unsubscribe, e-mail: general-unsubscribe@incubator.apache.org
For additional commands, e-mail: general-help@incubator.apache.org

Re: [PROPOSAL] Accumulo for the Apache Incubator

Posted by Greg Stein <gs...@gmail.com>.

On Sep 4, 2011 3:41 AM, "Bernd Fondermann" <be...@googlemail.com>
wrote:
>...
>
> So, you are saying more than 10% of the non-generated code base (and you
are
> not counting lib-style uses/JARs here, right?) is derived from other
Apache
> code? That seems to be unusual. Just curious, could you elaborate a bit
> about why you did that amd what kind of code that is? Thank you.

You make it sound like deriving from our code base is a bad thing, and
should be justified. I don't get it. That is what we *want* people to do.

What is your concern here?

Cheers,
-g

Re: [PROPOSAL] Accumulo for the Apache Incubator

Posted by Mohammad Nour El-Din <no...@gmail.com>.

+1 on the proposal


On Sun, Sep 4, 2011 at 9:41 AM, Bernd Fondermann
<be...@googlemail.com> wrote:
> On Saturday, September 3, 2011, Adam P Fuchs <ad...@ugov.gov> wrote:
>> Hi Bernd,
>>
>> The latest stable release of Accumulo contains roughly 200,000 lines of
> code, of which about 85,000 are machine generated thrift code. Of the
> remaining code, about 15,000 lines are derived from other Apache projects,
> and about 1,500 of those are derived from HBase code. The code derived from
> HBase comprises a query caching layer (block cache, index cache, multi-level
> LRU logic, etc.).
>
> So, you are saying more than 10% of the non-generated code base (and you are
> not counting lib-style uses/JARs here, right?) is derived from other Apache
> code? That seems to be unusual. Just curious, could you elaborate a bit
> about why you did that amd what kind of code that is? Thank you.
>
>  Bernd
>



-- 
Thanks
- Mohammad Nour
----
"Life is like riding a bicycle. To keep your balance you must keep moving"
- Albert Einstein

---------------------------------------------------------------------
To unsubscribe, e-mail: general-unsubscribe@incubator.apache.org
For additional commands, e-mail: general-help@incubator.apache.org

Re: [PROPOSAL] Accumulo for the Apache Incubator

Posted by Todd Lipcon <to...@cloudera.com>.

On Tue, Sep 6, 2011 at 8:09 AM, Steve Loughran <st...@apache.org> wrote:
>> 1300 lines: heavily modified versions of Hadoop BloomFilters
>
> -any plan to contribute back to hadoop-core, or are they too incompatible
> now?
>
>
>> 419 lines: modified Hadoop TeraSortIngest to sort data using Accumulo
>> 325 lines: our Value is an immutable version of Hadoop BytesWritable
>
> -any plan to contribute back to hadoop-core?
>...
> I understand why you've forked off your own versions of some of the Hadoop
> and HBase core -it is not only your right, it gets the changes in on your
> schedule. I have been known to do this myself.
>

Without derailing this thread too much, just to put things in
perspective: HBase has a fork of Hadoop's IPC. This makes up about
4000 lines of HBase's code. It's not a big deal. That's why we like
the Apache license. Good engineers should always be evaluating the
tradeoffs between staying with mainline and having to maintain a fork
of a particular piece of code. Sometimes the latter makes sense, even
within two closely-related projects.

-Todd
-- 
Todd Lipcon
Software Engineer, Cloudera

---------------------------------------------------------------------
To unsubscribe, e-mail: general-unsubscribe@incubator.apache.org
For additional commands, e-mail: general-help@incubator.apache.org

Re: [PROPOSAL] Accumulo for the Apache Incubator

Posted by Adam P Fuchs <ad...@ugov.gov>.

Hey Steve,

We would like to be able to contribute back where appropriate. We think that our BloomFilter improvements and some of our MapFile improvements are generally useful, and those should be pretty natural contributions back to Hadoop. Other modifications may not be so obviously generally useful, such as hard-coded optimizations for Accumulo. However, it is certainly our goal to reduce unnecessary code forks.

The classloader project was a challenge, and it took us several attempts to get it right. It sure is cool now that it works. We still have a number of tickets on our todo list in this area, like more convenient distribution mechanisms for user-defined functions (i.e. Iterators or Coprocessors) across a Hadoop cluster.

Thanks for the pointers to BigTop and MR-279. Those certainly look promising for better integration with the Apache brand. I'm looking forward to lots of great contributions from the community to the roadmap as Accumulo moves into incubation.

Cheers,
Adam

----- Original Message -----
From: Steve Loughran <st...@apache.org>
To: general@incubator.apache.org
Sent: Tue, 06 Sep 2011 15:09:44 -0000
Subject: Re: [PROPOSAL] Accumulo for the Apache Incubator

On 04/09/11 17:39, Billie J Rinaldi wrote:
> Bernd,
>
> We would divide the derived code into two categories: that which we modified only slightly (for example to allow us to extend it) and that which we modified heavily.  Now that we are able to interact openly, we hope to supply much of that back to the original projects.  There is a detailed overview below.  We identified these by searching for "copyright" in our code.  The total count came to just over 14,000 lines.  We use "heavily" as a qualitative assessment of how much we modified, but we could certainly come up with quantitative assessments.
>
> 5400 lines: slightly modified versions of Hadoop BCFile and related classes
>              (our current file format extends BCFile)
> 4300 lines: heavily modified versions of MapFile and SequenceFile
>              (no longer our default file format, but still included for backward compatibility)

Internal compatibility or external? If internal only I'd keep that out 
of the public codebase.

> 2000 lines: heavily modified versions of HBase BlockCache and related files
>              (Adam didn't count the tests when he said 1500 lines)

+1 for more tests.

> 1300 lines: heavily modified versions of Hadoop BloomFilters

-any plan to contribute back to hadoop-core, or are they too 
incompatible now?

> 419 lines: modified Hadoop TeraSortIngest to sort data using Accumulo
> 325 lines: our Value is an immutable version of Hadoop BytesWritable

-any plan to contribute back to hadoop-core?

> 142 lines: modified ClassLoader based on commons-jci ReloadingClassLoader

classloaders scare me. If we had an ASF-certified-classloader-hacker 
proposal where only approved people could write CLs for ASF code I'd be 
+1 for it, even though I'd fail the test myself.

I understand why you've forked off your own versions of some of the 
Hadoop and HBase core -it is not only your right, it gets the changes in 
on your schedule. I have been known to do this myself.

Ideally those thing have to get back to a (future) version of Hadoop, 
which people like Doug and Owen can help with. Having forked code in the 
ASF codebase is something to avoid. Again, I speak from experience.

I think the proposal ought to consider how they fit in with BigTop too, 
so it can be part of the full apache hadoop stack deploy/test process.

I also think that the roadmap for the system may want to think about 
MR-279 integration; would that architecture be a better way to run 
Accumulo code within a Hadoop cluster.

-Steve

(BTW: I'm not going to volunteer as a mentor/committer, my focus is on 
getting back into Hadoop core coding without distractions)

---------------------------------------------------------------------
To unsubscribe, e-mail: general-unsubscribe@incubator.apache.org
For additional commands, e-mail: general-help@incubator.apache.org

---------------------------------------------------------------------
To unsubscribe, e-mail: general-unsubscribe@incubator.apache.org
For additional commands, e-mail: general-help@incubator.apache.org

Re: [PROPOSAL] Accumulo for the Apache Incubator

Posted by Steve Loughran <st...@apache.org>.

On 04/09/11 17:39, Billie J Rinaldi wrote:
> Bernd,
>
> We would divide the derived code into two categories: that which we modified only slightly (for example to allow us to extend it) and that which we modified heavily.  Now that we are able to interact openly, we hope to supply much of that back to the original projects.  There is a detailed overview below.  We identified these by searching for "copyright" in our code.  The total count came to just over 14,000 lines.  We use "heavily" as a qualitative assessment of how much we modified, but we could certainly come up with quantitative assessments.
>
> 5400 lines: slightly modified versions of Hadoop BCFile and related classes
>              (our current file format extends BCFile)
> 4300 lines: heavily modified versions of MapFile and SequenceFile
>              (no longer our default file format, but still included for backward compatibility)

Internal compatibility or external? If internal only I'd keep that out 
of the public codebase.

> 2000 lines: heavily modified versions of HBase BlockCache and related files
>              (Adam didn't count the tests when he said 1500 lines)

+1 for more tests.

> 1300 lines: heavily modified versions of Hadoop BloomFilters

-any plan to contribute back to hadoop-core, or are they too 
incompatible now?


> 419 lines: modified Hadoop TeraSortIngest to sort data using Accumulo
> 325 lines: our Value is an immutable version of Hadoop BytesWritable

-any plan to contribute back to hadoop-core?

> 142 lines: modified ClassLoader based on commons-jci ReloadingClassLoader

classloaders scare me. If we had an ASF-certified-classloader-hacker 
proposal where only approved people could write CLs for ASF code I'd be 
+1 for it, even though I'd fail the test myself.

I understand why you've forked off your own versions of some of the 
Hadoop and HBase core -it is not only your right, it gets the changes in 
on your schedule. I have been known to do this myself.


Ideally those thing have to get back to a (future) version of Hadoop, 
which people like Doug and Owen can help with. Having forked code in the 
ASF codebase is something to avoid. Again, I speak from experience.

I think the proposal ought to consider how they fit in with BigTop too, 
so it can be part of the full apache hadoop stack deploy/test process.

I also think that the roadmap for the system may want to think about 
MR-279 integration; would that architecture be a better way to run 
Accumulo code within a Hadoop cluster.

-Steve

(BTW: I'm not going to volunteer as a mentor/committer, my focus is on 
getting back into Hadoop core coding without distractions)

---------------------------------------------------------------------
To unsubscribe, e-mail: general-unsubscribe@incubator.apache.org
For additional commands, e-mail: general-help@incubator.apache.org

Re: [PROPOSAL] Accumulo for the Apache Incubator

Posted by Billie J Rinaldi <bi...@ugov.gov>.

Bernd,

We would divide the derived code into two categories: that which we modified only slightly (for example to allow us to extend it) and that which we modified heavily.  Now that we are able to interact openly, we hope to supply much of that back to the original projects.  There is a detailed overview below.  We identified these by searching for "copyright" in our code.  The total count came to just over 14,000 lines.  We use "heavily" as a qualitative assessment of how much we modified, but we could certainly come up with quantitative assessments.

5400 lines: slightly modified versions of Hadoop BCFile and related classes
            (our current file format extends BCFile)
4300 lines: heavily modified versions of MapFile and SequenceFile
            (no longer our default file format, but still included for backward compatibility)
2000 lines: heavily modified versions of HBase BlockCache and related files
            (Adam didn't count the tests when he said 1500 lines)
1300 lines: heavily modified versions of Hadoop BloomFilters
419 lines: modified Hadoop TeraSortIngest to sort data using Accumulo
325 lines: our Value is an immutable version of Hadoop BytesWritable
142 lines: modified ClassLoader based on commons-jci ReloadingClassLoader

Billie

----- Original Message -----
From: "Bernd Fondermann" <be...@googlemail.com>
To: general@incubator.apache.org
Sent: Sunday, September 4, 2011 3:41:09 AM
Subject: Re: [PROPOSAL] Accumulo for the Apache Incubator

On Saturday, September 3, 2011, Adam P Fuchs <ad...@ugov.gov> wrote:
> Hi Bernd,
>
> The latest stable release of Accumulo contains roughly 200,000 lines of
code, of which about 85,000 are machine generated thrift code. Of the
remaining code, about 15,000 lines are derived from other Apache projects,
and about 1,500 of those are derived from HBase code. The code derived from
HBase comprises a query caching layer (block cache, index cache, multi-level
LRU logic, etc.).

So, you are saying more than 10% of the non-generated code base (and you are
not counting lib-style uses/JARs here, right?) is derived from other Apache
code? That seems to be unusual. Just curious, could you elaborate a bit
about why you did that amd what kind of code that is? Thank you.

 Bernd

---------------------------------------------------------------------
To unsubscribe, e-mail: general-unsubscribe@incubator.apache.org
For additional commands, e-mail: general-help@incubator.apache.org

Re: [PROPOSAL] Accumulo for the Apache Incubator

Posted by Bernd Fondermann <be...@googlemail.com>.

On Saturday, September 3, 2011, Adam P Fuchs <ad...@ugov.gov> wrote:
> Hi Bernd,
>
> The latest stable release of Accumulo contains roughly 200,000 lines of
code, of which about 85,000 are machine generated thrift code. Of the
remaining code, about 15,000 lines are derived from other Apache projects,
and about 1,500 of those are derived from HBase code. The code derived from
HBase comprises a query caching layer (block cache, index cache, multi-level
LRU logic, etc.).

So, you are saying more than 10% of the non-generated code base (and you are
not counting lib-style uses/JARs here, right?) is derived from other Apache
code? That seems to be unusual. Just curious, could you elaborate a bit
about why you did that amd what kind of code that is? Thank you.

 Bernd

Re: [PROPOSAL] Accumulo for the Apache Incubator

Posted by Adam P Fuchs <ad...@ugov.gov>.

Hi Bernd,

The latest stable release of Accumulo contains roughly 200,000 lines of code, of which about 85,000 are machine generated thrift code. Of the remaining code, about 15,000 lines are derived from other Apache projects, and about 1,500 of those are derived from HBase code. The code derived from HBase comprises a query caching layer (block cache, index cache, multi-level LRU logic, etc.).

More broadly, there are aspects of both systems that share common design elements, while many of the advanced features of the two systems are complementary. For example, the iterator framework in Accumulo and the coprocessor framework in HBase are distinct mechanisms for server-side execution of user-defined functions that can be used to encode different types of applications. The iterator framework provides a unique capability to encode functions (e.g. filtering and aggregation) within the compaction steps that happen in the background of the tablet server/region server, but they cannot be as easily used for inter-process communication as coprocessors without introducing the possibility of deadlock.

In addition to the complementary features, many of the low-level designs of the two projects, while supporting similar functionality, differ in various dimensions of performance. Some examples of this are the way we implement column family partitioning/locality groups, our file selection algorithms for compactions, tablet/region metadata handling, RPC libraries, user-level security, testing suites (which could also be considered complementary), administrative tools, methods of dealing with the java garbage collector, server-side threading models, client code threading models, file compression, Key classes, and write-ahead logs.

Going forward, both projects are going to be able to adapt complementary aspects of the other (we're already doing this with the query cache, and we are investigating adapting coprocessors from HBase). We look at having two systems that are so similar in core functionality but differ in implementation as a great opportunity for empirical exploration of the design space that will benefit both projects. I think that having both projects hosted in Apache gives us more incentive and opportunity to improve API compatibility between the two. If/when we find that the design space exploration has settled I expect that this will also be the best avenue towards merging the two projects if that becomes the desired goal.

Cheers,
Adam



----- Original Message -----
From: Bernd Fondermann <be...@googlemail.com>
To: general@incubator.apache.org
Sent: Sat, 03 Sep 2011 11:17:10 -0000
Subject: Re: [PROPOSAL] Accumulo for the Apache Incubator

On Friday, September 2, 2011, Billie J Rinaldi <bi...@ugov.gov>
wrote:
> Greetings,
>
> I would like to propose Accumulo to be an Apache Incubator project.
 Accumulo is a distributed key/value store that provides expressive
cell-level access labels and a server-side programming mechanism that can
modify key/value pairs at various points in the data management process.  It
is based on Google's BigTable design and runs over Apache Hadoop and
Zookeeper.

How is the project's relation to HBase? Especially, how much code - if any -
in the Accumolo code base is directly taken from HBase?

Thanks,

 Bernd


>
> Here is a link to the proposal in the Incubator wiki:
> http://wiki.apache.org/incubator/AccumuloProposal
>
> I've also pasted the initial contents below.
>
> Thanks,
> Billie Rinaldi
>
>
> = Accumulo Proposal =
>
> == Abstract ==
> Accumulo is a distributed key/value store that provides expressive,
cell-level access labels.
>
> == Proposal ==
> Accumulo is a sorted, distributed key/value store based on Google's
BigTable design.  It is built on top of Apache Hadoop, Zookeeper, and
Thrift.  It features a few novel improvements on the BigTable design in the
form of cell-level access labels and a server-side programming mechanism
that can modify key/value pairs at various points in the data management
process.
>
> == Background ==
> Google published the design of BigTable in 2006.  Several other open
source projects have implemented aspects of this design including HBase,
CloudStore, and Cassandra.  Accumulo began its development in 2008.
>
> == Rationale ==
> There is a need for a flexible, high performance distributed key/value
store that provides expressive, fine-grained access labels.  The communities
we expect to be most interested in such a project are government, health
care, and other industries where privacy is a concern.  We have made much
progress in developing this project over the past 3 years and believe both
the project and the interested communities would benefit from this work
being openly available and having open development.
>
> == Current Status ==
>
> === Meritocracy ===
> We intend to strongly encourage the community to help with and contribute
to the code.  We will actively seek potential committers and help them
become familiar with the codebase.
>
> === Community ===
> A strong government community has developed around Accumulo and training
classes have been ongoing for about a year.  Hundreds of developers use
Accumulo.
>
> === Core Developers ===
> The developers are mainly employed by the National Security Agency, but we
anticipate interest developing among other companies.
>
> === Alignment ===
> Accumulo is built on top of Hadoop, Zookeeper, and Thrift.  It builds with
Maven.  Due to the strong relationship with these Apache projects, the
incubator is a good match for Accumulo.
>
> == Known Risks ==
> === Orphaned Products ===
> There is only a small risk of being orphaned.  The community is committed
to improving the codebase of the project due to its fulfilling needs not
addressed by any other software.
>
> === Inexperience with Open Source ===
> The codebase has been treated internally as an open source project since
its beginning, and the initial Apache committers have been involved with the
code for multiple years.  While our experience with public open source is
limited, we do not anticipate difficulty in operating under Apache's
development process.
>
> === Homogeneous Developers ===
> The committers have multiple employers and it is expected that committers
from different companies will be recruited.
>
> === Reliance on Salaried Developers ===
> The initial committers are all paid by their employers to work on Accumulo
and we expect such employment to continue.  Some of the initial committers
would continue as volunteers even if no longer employed to do so.
>
> === Relationships with Other Apache Products ===
> Accumulo uses Hadoop, Zookeeper, Thrift, Maven, log4j, commons-lang, -net,
-io, -jci, -collections, -configuration, -logging, and -codec.
>
> === Relationship to HBase ===
> Accumulo and HBase are both based on the design of Google's BigTable, so
there is a danger that potential users will have difficulty distinguishing
the two or that they will not see an incentive in adopting Accumulo.  There
are a few key areas in which Accumulo differs from HBase.  Some of the
desired features of Accumulo could be incorporated into HBase, however the
most important of these may be unlikely to be adopted (see cell-level access
labels and iterators below).  It is a possibility that the codebases will
ultimately converge, but the number of differences at the current time
warrants a separate project for Accumulo.
>
> ==== Access Labels ====
> Accumulo has an additional portion of its key that sorts after the column
qualifier and before the timestamp.  It is called column visibility and
enables expressive cell-level access control.  Authorizations are passed
with each query to control what data is returned to the user.  The column
visibilities are boolean AND and OR combinations of arbitrary strings (such
as "(A&B)|C") and authorizations are sets of strings (such as {C,D}).
>
> ==== Iterators ====
> Accumulo has a novel server-side programming mechanism that can modify the
data written to disk or returned to the user.  This mechanism can be
configured for any of the scopes where data is read from or written to disk.
 It can be used to perform joins on data within a single tablet.
>
> ==== Flexibility ====
> HBase requires the user to specify the set of column families to be used
up front.  Accumulo places no restrictions on the column families.  Also,
each column family in HBase is stored separately on disk.  Accumulo allows
column families to be grouped together on disk, as does BigTable.  This
enables users to configure how their data is stored, potentially providing
improvements in compression and lookup speeds.  It gives Accumulo a
row/column hybrid nature, while HBase is currently column-oriented.
>
> ==== Testing ====
> Accumulo has testing frameworks that have resulted in its achieving a high
level of correctness and performance.  We have observed that under some
configurations and conditions Accumulo will outperform HBase and provide
greater data integrity.
>
> ==== Logging ====
> HBase uses a write-ahead log on the Hadoop Distributed File System.
 Accumulo has its own logging service that does not depend on communication
with the HDFS NameNode.
>
> ==== Storage ====
> Accumulo has a relative key file format that improves compression.
>
> ==== Areas in which HBase features improvements over Accumulo ====
> in memory tables, upserts, coprocessors, connections to other projects
such as Cascading and Pig
>
> === Expectations ===
> There is a risk that Accumulo will be criticized for not providing
adequate security.  The access labels in Accumulo do not in themselves
provide a complete security solution, but are a mechanism for labeling each
piece of data with the authorizations that are necessary to see it.
>
> === Apache Brand ===
> Our interest in releasing this code as an Apache incubator project is due
to its strong relationship with other Apache projects, i.e. Hadoop,
Zookeeper, and HBase.
>
> == Documentation ==
> There is not currently documentation about Accumulo on the web, but a fair
amount of documentation and training materials exists and will be provided
on the Accumulo wiki at apache.org.  Also, a paper discussing YCSB results
for Accumulo will be presented at the 2011 Symposium on Cloud Computing.
>
> == Initial Source ==
> Accumulo has been in development since spring 2008.  There are hundreds of
developers using it and tens of developers have contributed to it.  The core
codebase consists of 200,000 lines of code (mainly Java) and 100s of pages
of documentation.  There are also a few projects built on top of Accumulo
that may be added to its contrib in the future.  These include support for
Hive, Matlab, YCSB, and graph processing.
>
> == Source and Intellectual Property Submission Plan ==
> Accumulo core code, examples, documention, and training materials will be
submitted by the National Security Agency.
>
> We will also be soliciting contributions of further plugins from MIT
Lincoln Labs, Carnegie Mellon University, and others.
>
> Accumulo has been developed by a mix of government employees and private
companies under government contract.  Material developed by government
employees is in the public domain and no U.S. copyright exists in works of
the federal government.  For the contractor developed material in the
initial submission, the U.S. Government has sufficient authority per the
ICLA from the copyright owner to contribute the Accumulo code to the
incubator.
>
> There has been some discussion regarding accepting contributions from US
Government sources on
[https://issues.apache.org/jira/browse/LEGAL-93LEGAL-93]. We propose
that the NSA will sign an ICLA/CCLA if that document
could be slightly modified to explicitly address copyright in works of
government employees. Specifically, we propose that the definition of “You”
be modified to include “the copyright owner, the owner of a Contribution not
subject to copyright, or legal entity authorized by the copyright owner that
is making this Agreement.” In addition, section 2, the copyright license
grant be modified after “You hereby grant” that either states “to the extent
authorized by law” or “to the extent copyright exists in the Contribution.”
 These changes will permit US Government employee developed work to be
included.
>
> One proposed solution is to form a Collaborative Research and Development
Agreement (CRADA) between the Apache Software Foundation and the US
Government, but this will not solve the underlying problem that U.S. law
does not grant copyright to works of government employees.  At this time a
CRADA is not necessary but should it be determined that a CRADA is
necessary, we would like to work through that process during the incubation
phase of Accumulo rather than before acceptance as this may take time to
enter into an agreement.
>
> == External Dependencies ==
> jetty (Apache and EPL), jline (BSD), jfreechart (LGPL), jcommon (LGPL),
slf4j (MIT), junit (CPL)
>
> == Cryptography ==
> none
>
> == Required Resources ==
>  * Mailing Lists
>   * accumulo-private
>   * accumulo-dev
>   * accumulo-commits
>   * accumulo-user
>
>  * Subversion Directory
>   * https://svn.apache.org/repos/asf/incubator/accumulo
>
>  * Issue Tracking
>   * JIRA Accumulo (ACCUMULO)
>
>  * Continuous Integration
>   * Jenkins builds on https://builds.apache.org/
>
>  * Web
>   * http://incubator.apache.org/accumulo/
>   * wiki at http://wiki.apache.org or http://cwiki.apache.org
>
> == Initial Committers ==
>  * Aaron Cordova (aaron at cordovas dot org)
>  * Adam Fuchs (adam.p.fuchs at ugov dot gov)
>  * Eric Newton (ecn at swcomplete dot com)
>  * Billie Rinaldi (billie.j.rinaldi at ugov dot gov)
>  * Keith Turner (keith.turner at ptech-llc dot com)
>  * John Vines (john.w.vines at ugov dot gov)
>  * Chris Waring (christopher.a.waring at ugov dot gov)
>
> == Affiliations ==
>  * Aaron Cordova, The Interllective
>  * Adam Fuchs, National Security Agency
>  * Eric Newton, SW Complete Incorporated
>  * Billie Rinaldi, National Security Agency
>  * Keith Turner, Peterson Technology LLC
>  * John Vines, National Security Agency
>  * Chris Waring, National Security Agency
>
> == Sponsors ==
>  * Champion: Doug Cutting
>  * Nominated Mentors: Benson Margulies, ?, ?
>  * Sponsoring Entity: Apache Incubator
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: general-unsubscribe@incubator.apache.org
> For additional commands, e-mail: general-help@incubator.apache.org
>
>


---------------------------------------------------------------------
To unsubscribe, e-mail: general-unsubscribe@incubator.apache.org
For additional commands, e-mail: general-help@incubator.apache.org

Re: [PROPOSAL] Accumulo for the Apache Incubator

Posted by Bernd Fondermann <be...@googlemail.com>.

On Friday, September 2, 2011, Billie J Rinaldi <bi...@ugov.gov>
wrote:
> Greetings,
>
> I would like to propose Accumulo to be an Apache Incubator project.
 Accumulo is a distributed key/value store that provides expressive
cell-level access labels and a server-side programming mechanism that can
modify key/value pairs at various points in the data management process.  It
is based on Google's BigTable design and runs over Apache Hadoop and
Zookeeper.

How is the project's relation to HBase? Especially, how much code - if any -
in the Accumolo code base is directly taken from HBase?

Thanks,

 Bernd


>
> Here is a link to the proposal in the Incubator wiki:
> http://wiki.apache.org/incubator/AccumuloProposal
>
> I've also pasted the initial contents below.
>
> Thanks,
> Billie Rinaldi
>
>
> = Accumulo Proposal =
>
> == Abstract ==
> Accumulo is a distributed key/value store that provides expressive,
cell-level access labels.
>
> == Proposal ==
> Accumulo is a sorted, distributed key/value store based on Google's
BigTable design.  It is built on top of Apache Hadoop, Zookeeper, and
Thrift.  It features a few novel improvements on the BigTable design in the
form of cell-level access labels and a server-side programming mechanism
that can modify key/value pairs at various points in the data management
process.
>
> == Background ==
> Google published the design of BigTable in 2006.  Several other open
source projects have implemented aspects of this design including HBase,
CloudStore, and Cassandra.  Accumulo began its development in 2008.
>
> == Rationale ==
> There is a need for a flexible, high performance distributed key/value
store that provides expressive, fine-grained access labels.  The communities
we expect to be most interested in such a project are government, health
care, and other industries where privacy is a concern.  We have made much
progress in developing this project over the past 3 years and believe both
the project and the interested communities would benefit from this work
being openly available and having open development.
>
> == Current Status ==
>
> === Meritocracy ===
> We intend to strongly encourage the community to help with and contribute
to the code.  We will actively seek potential committers and help them
become familiar with the codebase.
>
> === Community ===
> A strong government community has developed around Accumulo and training
classes have been ongoing for about a year.  Hundreds of developers use
Accumulo.
>
> === Core Developers ===
> The developers are mainly employed by the National Security Agency, but we
anticipate interest developing among other companies.
>
> === Alignment ===
> Accumulo is built on top of Hadoop, Zookeeper, and Thrift.  It builds with
Maven.  Due to the strong relationship with these Apache projects, the
incubator is a good match for Accumulo.
>
> == Known Risks ==
> === Orphaned Products ===
> There is only a small risk of being orphaned.  The community is committed
to improving the codebase of the project due to its fulfilling needs not
addressed by any other software.
>
> === Inexperience with Open Source ===
> The codebase has been treated internally as an open source project since
its beginning, and the initial Apache committers have been involved with the
code for multiple years.  While our experience with public open source is
limited, we do not anticipate difficulty in operating under Apache's
development process.
>
> === Homogeneous Developers ===
> The committers have multiple employers and it is expected that committers
from different companies will be recruited.
>
> === Reliance on Salaried Developers ===
> The initial committers are all paid by their employers to work on Accumulo
and we expect such employment to continue.  Some of the initial committers
would continue as volunteers even if no longer employed to do so.
>
> === Relationships with Other Apache Products ===
> Accumulo uses Hadoop, Zookeeper, Thrift, Maven, log4j, commons-lang, -net,
-io, -jci, -collections, -configuration, -logging, and -codec.
>
> === Relationship to HBase ===
> Accumulo and HBase are both based on the design of Google's BigTable, so
there is a danger that potential users will have difficulty distinguishing
the two or that they will not see an incentive in adopting Accumulo.  There
are a few key areas in which Accumulo differs from HBase.  Some of the
desired features of Accumulo could be incorporated into HBase, however the
most important of these may be unlikely to be adopted (see cell-level access
labels and iterators below).  It is a possibility that the codebases will
ultimately converge, but the number of differences at the current time
warrants a separate project for Accumulo.
>
> ==== Access Labels ====
> Accumulo has an additional portion of its key that sorts after the column
qualifier and before the timestamp.  It is called column visibility and
enables expressive cell-level access control.  Authorizations are passed
with each query to control what data is returned to the user.  The column
visibilities are boolean AND and OR combinations of arbitrary strings (such
as "(A&B)|C") and authorizations are sets of strings (such as {C,D}).
>
> ==== Iterators ====
> Accumulo has a novel server-side programming mechanism that can modify the
data written to disk or returned to the user.  This mechanism can be
configured for any of the scopes where data is read from or written to disk.
 It can be used to perform joins on data within a single tablet.
>
> ==== Flexibility ====
> HBase requires the user to specify the set of column families to be used
up front.  Accumulo places no restrictions on the column families.  Also,
each column family in HBase is stored separately on disk.  Accumulo allows
column families to be grouped together on disk, as does BigTable.  This
enables users to configure how their data is stored, potentially providing
improvements in compression and lookup speeds.  It gives Accumulo a
row/column hybrid nature, while HBase is currently column-oriented.
>
> ==== Testing ====
> Accumulo has testing frameworks that have resulted in its achieving a high
level of correctness and performance.  We have observed that under some
configurations and conditions Accumulo will outperform HBase and provide
greater data integrity.
>
> ==== Logging ====
> HBase uses a write-ahead log on the Hadoop Distributed File System.
 Accumulo has its own logging service that does not depend on communication
with the HDFS NameNode.
>
> ==== Storage ====
> Accumulo has a relative key file format that improves compression.
>
> ==== Areas in which HBase features improvements over Accumulo ====
> in memory tables, upserts, coprocessors, connections to other projects
such as Cascading and Pig
>
> === Expectations ===
> There is a risk that Accumulo will be criticized for not providing
adequate security.  The access labels in Accumulo do not in themselves
provide a complete security solution, but are a mechanism for labeling each
piece of data with the authorizations that are necessary to see it.
>
> === Apache Brand ===
> Our interest in releasing this code as an Apache incubator project is due
to its strong relationship with other Apache projects, i.e. Hadoop,
Zookeeper, and HBase.
>
> == Documentation ==
> There is not currently documentation about Accumulo on the web, but a fair
amount of documentation and training materials exists and will be provided
on the Accumulo wiki at apache.org.  Also, a paper discussing YCSB results
for Accumulo will be presented at the 2011 Symposium on Cloud Computing.
>
> == Initial Source ==
> Accumulo has been in development since spring 2008.  There are hundreds of
developers using it and tens of developers have contributed to it.  The core
codebase consists of 200,000 lines of code (mainly Java) and 100s of pages
of documentation.  There are also a few projects built on top of Accumulo
that may be added to its contrib in the future.  These include support for
Hive, Matlab, YCSB, and graph processing.
>
> == Source and Intellectual Property Submission Plan ==
> Accumulo core code, examples, documention, and training materials will be
submitted by the National Security Agency.
>
> We will also be soliciting contributions of further plugins from MIT
Lincoln Labs, Carnegie Mellon University, and others.
>
> Accumulo has been developed by a mix of government employees and private
companies under government contract.  Material developed by government
employees is in the public domain and no U.S. copyright exists in works of
the federal government.  For the contractor developed material in the
initial submission, the U.S. Government has sufficient authority per the
ICLA from the copyright owner to contribute the Accumulo code to the
incubator.
>
> There has been some discussion regarding accepting contributions from US
Government sources on
[https://issues.apache.org/jira/browse/LEGAL-93LEGAL-93]. We propose
that the NSA will sign an ICLA/CCLA if that document
could be slightly modified to explicitly address copyright in works of
government employees. Specifically, we propose that the definition of “You”
be modified to include “the copyright owner, the owner of a Contribution not
subject to copyright, or legal entity authorized by the copyright owner that
is making this Agreement.” In addition, section 2, the copyright license
grant be modified after “You hereby grant” that either states “to the extent
authorized by law” or “to the extent copyright exists in the Contribution.”
 These changes will permit US Government employee developed work to be
included.
>
> One proposed solution is to form a Collaborative Research and Development
Agreement (CRADA) between the Apache Software Foundation and the US
Government, but this will not solve the underlying problem that U.S. law
does not grant copyright to works of government employees.  At this time a
CRADA is not necessary but should it be determined that a CRADA is
necessary, we would like to work through that process during the incubation
phase of Accumulo rather than before acceptance as this may take time to
enter into an agreement.
>
> == External Dependencies ==
> jetty (Apache and EPL), jline (BSD), jfreechart (LGPL), jcommon (LGPL),
slf4j (MIT), junit (CPL)
>
> == Cryptography ==
> none
>
> == Required Resources ==
>  * Mailing Lists
>   * accumulo-private
>   * accumulo-dev
>   * accumulo-commits
>   * accumulo-user
>
>  * Subversion Directory
>   * https://svn.apache.org/repos/asf/incubator/accumulo
>
>  * Issue Tracking
>   * JIRA Accumulo (ACCUMULO)
>
>  * Continuous Integration
>   * Jenkins builds on https://builds.apache.org/
>
>  * Web
>   * http://incubator.apache.org/accumulo/
>   * wiki at http://wiki.apache.org or http://cwiki.apache.org
>
> == Initial Committers ==
>  * Aaron Cordova (aaron at cordovas dot org)
>  * Adam Fuchs (adam.p.fuchs at ugov dot gov)
>  * Eric Newton (ecn at swcomplete dot com)
>  * Billie Rinaldi (billie.j.rinaldi at ugov dot gov)
>  * Keith Turner (keith.turner at ptech-llc dot com)
>  * John Vines (john.w.vines at ugov dot gov)
>  * Chris Waring (christopher.a.waring at ugov dot gov)
>
> == Affiliations ==
>  * Aaron Cordova, The Interllective
>  * Adam Fuchs, National Security Agency
>  * Eric Newton, SW Complete Incorporated
>  * Billie Rinaldi, National Security Agency
>  * Keith Turner, Peterson Technology LLC
>  * John Vines, National Security Agency
>  * Chris Waring, National Security Agency
>
> == Sponsors ==
>  * Champion: Doug Cutting
>  * Nominated Mentors: Benson Margulies, ?, ?
>  * Sponsoring Entity: Apache Incubator
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: general-unsubscribe@incubator.apache.org
> For additional commands, e-mail: general-help@incubator.apache.org
>
>

Re: [PROPOSAL] Accumulo for the Apache Incubator

Posted by Owen O'Malley <om...@apache.org>.

On Fri, Sep 2, 2011 at 3:22 PM, Adam P Fuchs <ad...@ugov.gov> wrote:

The project looks interesting.

> I believe the answer is yes regarding the code grant, and I am currently confirming that with our lawyers. We'll get you an official answer early next week.

Great. I know that the US government has its own rules for such
things. I took part in the meetings that created the NASA Open Source
Agreement. (eg. the lawyers wouldn't let us call it an open source
"license"...) Let us know how it goes.

> The LGPL dependencies are not core to Accumulo, and we're working on substituting other packages. We would have no problem doing this before the initial commit if necessary.

I needs to be cleaned up before release, but the original commit is fine.

-- Owen

---------------------------------------------------------------------
To unsubscribe, e-mail: general-unsubscribe@incubator.apache.org
For additional commands, e-mail: general-help@incubator.apache.org

Re: [PROPOSAL] Accumulo for the Apache Incubator

Posted by Adam P Fuchs <ad...@ugov.gov>.

Owen,

I believe the answer is yes regarding the code grant, and I am currently confirming that with our lawyers. We'll get you an official answer early next week.

The LGPL dependencies are not core to Accumulo, and we're working on substituting other packages. We would have no problem doing this before the initial commit if necessary. 

Cheers,
Adam

----- Original Message -----
From: Owen O'Malley <om...@apache.org>
To: general@incubator.apache.org
Sent: Fri, 02 Sep 2011 18:36:11 -0000
Subject: Re: [PROPOSAL] Accumulo for the Apache Incubator

Is the NSA going to file a code grant for the project? How deeply
embedded are the LGPL dependencies? Are they optional components or
mandatory?

Thanks,
   Owen

---------------------------------------------------------------------
To unsubscribe, e-mail: general-unsubscribe@incubator.apache.org
For additional commands, e-mail: general-help@incubator.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: general-unsubscribe@incubator.apache.org
For additional commands, e-mail: general-help@incubator.apache.org

Re: [PROPOSAL] Accumulo for the Apache Incubator

Posted by Owen O'Malley <om...@apache.org>.

Is the NSA going to file a code grant for the project? How deeply
embedded are the LGPL dependencies? Are they optional components or
mandatory?

Thanks,
   Owen

---------------------------------------------------------------------
To unsubscribe, e-mail: general-unsubscribe@incubator.apache.org
For additional commands, e-mail: general-help@incubator.apache.org

Re: [PROPOSAL] Accumulo for the Apache Incubator

Posted by Benson Margulies <bi...@gmail.com>.

No votes yet, please, except as an informal expression of (un)enthusiasm.

Owen, you raise two question.

On the subject of grants, please read the IP description in the
proposal again. You can't 'grant' rights to something that neither you
nor anyone else owns. The proposal offers both a preferred alternative
and a backstop.

On the subject of LGPL, I'll leave it to the authors to answer.


On Fri, Sep 2, 2011 at 5:17 PM, Todd Lipcon <to...@cloudera.com> wrote:
> Non-binding +1. Regarding Owen's concern over licenses, if I recall
> correctly, those concerns would block graduation from the incubator,
> but not acceptance to it.
>
> I am also interested in being added as a committer to this proposal.
> As an HBase committer (but not speaking for the project as a whole) I
> think having cross-pollination between the codebases will be
> beneficial to everyone, so I'd like to be involved.
>
> Thanks
> -Todd
>
> On Fri, Sep 2, 2011 at 8:45 AM, Billie J Rinaldi
> <bi...@ugov.gov> wrote:
>> Greetings,
>>
>> I would like to propose Accumulo to be an Apache Incubator project.  Accumulo is a distributed key/value store that provides expressive cell-level access labels and a server-side programming mechanism that can modify key/value pairs at various points in the data management process.  It is based on Google's BigTable design and runs over Apache Hadoop and Zookeeper.
>>
>> Here is a link to the proposal in the Incubator wiki:
>> http://wiki.apache.org/incubator/AccumuloProposal
>>
>> I've also pasted the initial contents below.
>>
>> Thanks,
>> Billie Rinaldi
>>
>>
>> = Accumulo Proposal =
>>
>> == Abstract ==
>> Accumulo is a distributed key/value store that provides expressive, cell-level access labels.
>>
>> == Proposal ==
>> Accumulo is a sorted, distributed key/value store based on Google's BigTable design.  It is built on top of Apache Hadoop, Zookeeper, and Thrift.  It features a few novel improvements on the BigTable design in the form of cell-level access labels and a server-side programming mechanism that can modify key/value pairs at various points in the data management process.
>>
>> == Background ==
>> Google published the design of BigTable in 2006.  Several other open source projects have implemented aspects of this design including HBase, CloudStore, and Cassandra.  Accumulo began its development in 2008.
>>
>> == Rationale ==
>> There is a need for a flexible, high performance distributed key/value store that provides expressive, fine-grained access labels.  The communities we expect to be most interested in such a project are government, health care, and other industries where privacy is a concern.  We have made much progress in developing this project over the past 3 years and believe both the project and the interested communities would benefit from this work being openly available and having open development.
>>
>> == Current Status ==
>>
>> === Meritocracy ===
>> We intend to strongly encourage the community to help with and contribute to the code.  We will actively seek potential committers and help them become familiar with the codebase.
>>
>> === Community ===
>> A strong government community has developed around Accumulo and training classes have been ongoing for about a year.  Hundreds of developers use Accumulo.
>>
>> === Core Developers ===
>> The developers are mainly employed by the National Security Agency, but we anticipate interest developing among other companies.
>>
>> === Alignment ===
>> Accumulo is built on top of Hadoop, Zookeeper, and Thrift.  It builds with Maven.  Due to the strong relationship with these Apache projects, the incubator is a good match for Accumulo.
>>
>> == Known Risks ==
>> === Orphaned Products ===
>> There is only a small risk of being orphaned.  The community is committed to improving the codebase of the project due to its fulfilling needs not addressed by any other software.
>>
>> === Inexperience with Open Source ===
>> The codebase has been treated internally as an open source project since its beginning, and the initial Apache committers have been involved with the code for multiple years.  While our experience with public open source is limited, we do not anticipate difficulty in operating under Apache's development process.
>>
>> === Homogeneous Developers ===
>> The committers have multiple employers and it is expected that committers from different companies will be recruited.
>>
>> === Reliance on Salaried Developers ===
>> The initial committers are all paid by their employers to work on Accumulo and we expect such employment to continue.  Some of the initial committers would continue as volunteers even if no longer employed to do so.
>>
>> === Relationships with Other Apache Products ===
>> Accumulo uses Hadoop, Zookeeper, Thrift, Maven, log4j, commons-lang, -net, -io, -jci, -collections, -configuration, -logging, and -codec.
>>
>> === Relationship to HBase ===
>> Accumulo and HBase are both based on the design of Google's BigTable, so there is a danger that potential users will have difficulty distinguishing the two or that they will not see an incentive in adopting Accumulo.  There are a few key areas in which Accumulo differs from HBase.  Some of the desired features of Accumulo could be incorporated into HBase, however the most important of these may be unlikely to be adopted (see cell-level access labels and iterators below).  It is a possibility that the codebases will ultimately converge, but the number of differences at the current time warrants a separate project for Accumulo.
>>
>> ==== Access Labels ====
>> Accumulo has an additional portion of its key that sorts after the column qualifier and before the timestamp.  It is called column visibility and enables expressive cell-level access control.  Authorizations are passed with each query to control what data is returned to the user.  The column visibilities are boolean AND and OR combinations of arbitrary strings (such as "(A&B)|C") and authorizations are sets of strings (such as {C,D}).
>>
>> ==== Iterators ====
>> Accumulo has a novel server-side programming mechanism that can modify the data written to disk or returned to the user.  This mechanism can be configured for any of the scopes where data is read from or written to disk.  It can be used to perform joins on data within a single tablet.
>>
>> ==== Flexibility ====
>> HBase requires the user to specify the set of column families to be used up front.  Accumulo places no restrictions on the column families.  Also, each column family in HBase is stored separately on disk.  Accumulo allows column families to be grouped together on disk, as does BigTable.  This enables users to configure how their data is stored, potentially providing improvements in compression and lookup speeds.  It gives Accumulo a row/column hybrid nature, while HBase is currently column-oriented.
>>
>> ==== Testing ====
>> Accumulo has testing frameworks that have resulted in its achieving a high level of correctness and performance.  We have observed that under some configurations and conditions Accumulo will outperform HBase and provide greater data integrity.
>>
>> ==== Logging ====
>> HBase uses a write-ahead log on the Hadoop Distributed File System.  Accumulo has its own logging service that does not depend on communication with the HDFS NameNode.
>>
>> ==== Storage ====
>> Accumulo has a relative key file format that improves compression.
>>
>> ==== Areas in which HBase features improvements over Accumulo ====
>> in memory tables, upserts, coprocessors, connections to other projects such as Cascading and Pig
>>
>> === Expectations ===
>> There is a risk that Accumulo will be criticized for not providing adequate security.  The access labels in Accumulo do not in themselves provide a complete security solution, but are a mechanism for labeling each piece of data with the authorizations that are necessary to see it.
>>
>> === Apache Brand ===
>> Our interest in releasing this code as an Apache incubator project is due to its strong relationship with other Apache projects, i.e. Hadoop, Zookeeper, and HBase.
>>
>> == Documentation ==
>> There is not currently documentation about Accumulo on the web, but a fair amount of documentation and training materials exists and will be provided on the Accumulo wiki at apache.org.  Also, a paper discussing YCSB results for Accumulo will be presented at the 2011 Symposium on Cloud Computing.
>>
>> == Initial Source ==
>> Accumulo has been in development since spring 2008.  There are hundreds of developers using it and tens of developers have contributed to it.  The core codebase consists of 200,000 lines of code (mainly Java) and 100s of pages of documentation.  There are also a few projects built on top of Accumulo that may be added to its contrib in the future.  These include support for Hive, Matlab, YCSB, and graph processing.
>>
>> == Source and Intellectual Property Submission Plan ==
>> Accumulo core code, examples, documention, and training materials will be submitted by the National Security Agency.
>>
>> We will also be soliciting contributions of further plugins from MIT Lincoln Labs, Carnegie Mellon University, and others.
>>
>> Accumulo has been developed by a mix of government employees and private companies under government contract.  Material developed by government employees is in the public domain and no U.S. copyright exists in works of the federal government.  For the contractor developed material in the initial submission, the U.S. Government has sufficient authority per the ICLA from the copyright owner to contribute the Accumulo code to the incubator.
>>
>> There has been some discussion regarding accepting contributions from US Government sources on [https://issues.apache.org/jira/browse/LEGAL-93 LEGAL-93]. We propose that the NSA will sign an ICLA/CCLA if that document could be slightly modified to explicitly address copyright in works of government employees. Specifically, we propose that the definition of “You” be modified to include “the copyright owner, the owner of a Contribution not subject to copyright, or legal entity authorized by the copyright owner that is making this Agreement.” In addition, section 2, the copyright license grant be modified after “You hereby grant” that either states “to the extent authorized by law” or “to the extent copyright exists in the Contribution.”  These changes will permit US Government employee developed work to be included.
>>
>> One proposed solution is to form a Collaborative Research and Development Agreement (CRADA) between the Apache Software Foundation and the US Government, but this will not solve the underlying problem that U.S. law does not grant copyright to works of government employees.  At this time a CRADA is not necessary but should it be determined that a CRADA is necessary, we would like to work through that process during the incubation phase of Accumulo rather than before acceptance as this may take time to enter into an agreement.
>>
>> == External Dependencies ==
>> jetty (Apache and EPL), jline (BSD), jfreechart (LGPL), jcommon (LGPL), slf4j (MIT), junit (CPL)
>>
>> == Cryptography ==
>> none
>>
>> == Required Resources ==
>>  * Mailing Lists
>>   * accumulo-private
>>   * accumulo-dev
>>   * accumulo-commits
>>   * accumulo-user
>>
>>  * Subversion Directory
>>   * https://svn.apache.org/repos/asf/incubator/accumulo
>>
>>  * Issue Tracking
>>   * JIRA Accumulo (ACCUMULO)
>>
>>  * Continuous Integration
>>   * Jenkins builds on https://builds.apache.org/
>>
>>  * Web
>>   * http://incubator.apache.org/accumulo/
>>   * wiki at http://wiki.apache.org or http://cwiki.apache.org
>>
>> == Initial Committers ==
>>  * Aaron Cordova (aaron at cordovas dot org)
>>  * Adam Fuchs (adam.p.fuchs at ugov dot gov)
>>  * Eric Newton (ecn at swcomplete dot com)
>>  * Billie Rinaldi (billie.j.rinaldi at ugov dot gov)
>>  * Keith Turner (keith.turner at ptech-llc dot com)
>>  * John Vines (john.w.vines at ugov dot gov)
>>  * Chris Waring (christopher.a.waring at ugov dot gov)
>>
>> == Affiliations ==
>>  * Aaron Cordova, The Interllective
>>  * Adam Fuchs, National Security Agency
>>  * Eric Newton, SW Complete Incorporated
>>  * Billie Rinaldi, National Security Agency
>>  * Keith Turner, Peterson Technology LLC
>>  * John Vines, National Security Agency
>>  * Chris Waring, National Security Agency
>>
>> == Sponsors ==
>>  * Champion: Doug Cutting
>>  * Nominated Mentors: Benson Margulies, ?, ?
>>  * Sponsoring Entity: Apache Incubator
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: general-unsubscribe@incubator.apache.org
>> For additional commands, e-mail: general-help@incubator.apache.org
>>
>>
>
>
>
> --
> Todd Lipcon
> Software Engineer, Cloudera
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: general-unsubscribe@incubator.apache.org
> For additional commands, e-mail: general-help@incubator.apache.org
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: general-unsubscribe@incubator.apache.org
For additional commands, e-mail: general-help@incubator.apache.org

Re: [PROPOSAL] Accumulo for the Apache Incubator

Posted by Todd Lipcon <to...@cloudera.com>.

Non-binding +1. Regarding Owen's concern over licenses, if I recall
correctly, those concerns would block graduation from the incubator,
but not acceptance to it.

I am also interested in being added as a committer to this proposal.
As an HBase committer (but not speaking for the project as a whole) I
think having cross-pollination between the codebases will be
beneficial to everyone, so I'd like to be involved.

Thanks
-Todd

On Fri, Sep 2, 2011 at 8:45 AM, Billie J Rinaldi
<bi...@ugov.gov> wrote:
> Greetings,
>
> I would like to propose Accumulo to be an Apache Incubator project.  Accumulo is a distributed key/value store that provides expressive cell-level access labels and a server-side programming mechanism that can modify key/value pairs at various points in the data management process.  It is based on Google's BigTable design and runs over Apache Hadoop and Zookeeper.
>
> Here is a link to the proposal in the Incubator wiki:
> http://wiki.apache.org/incubator/AccumuloProposal
>
> I've also pasted the initial contents below.
>
> Thanks,
> Billie Rinaldi
>
>
> = Accumulo Proposal =
>
> == Abstract ==
> Accumulo is a distributed key/value store that provides expressive, cell-level access labels.
>
> == Proposal ==
> Accumulo is a sorted, distributed key/value store based on Google's BigTable design.  It is built on top of Apache Hadoop, Zookeeper, and Thrift.  It features a few novel improvements on the BigTable design in the form of cell-level access labels and a server-side programming mechanism that can modify key/value pairs at various points in the data management process.
>
> == Background ==
> Google published the design of BigTable in 2006.  Several other open source projects have implemented aspects of this design including HBase, CloudStore, and Cassandra.  Accumulo began its development in 2008.
>
> == Rationale ==
> There is a need for a flexible, high performance distributed key/value store that provides expressive, fine-grained access labels.  The communities we expect to be most interested in such a project are government, health care, and other industries where privacy is a concern.  We have made much progress in developing this project over the past 3 years and believe both the project and the interested communities would benefit from this work being openly available and having open development.
>
> == Current Status ==
>
> === Meritocracy ===
> We intend to strongly encourage the community to help with and contribute to the code.  We will actively seek potential committers and help them become familiar with the codebase.
>
> === Community ===
> A strong government community has developed around Accumulo and training classes have been ongoing for about a year.  Hundreds of developers use Accumulo.
>
> === Core Developers ===
> The developers are mainly employed by the National Security Agency, but we anticipate interest developing among other companies.
>
> === Alignment ===
> Accumulo is built on top of Hadoop, Zookeeper, and Thrift.  It builds with Maven.  Due to the strong relationship with these Apache projects, the incubator is a good match for Accumulo.
>
> == Known Risks ==
> === Orphaned Products ===
> There is only a small risk of being orphaned.  The community is committed to improving the codebase of the project due to its fulfilling needs not addressed by any other software.
>
> === Inexperience with Open Source ===
> The codebase has been treated internally as an open source project since its beginning, and the initial Apache committers have been involved with the code for multiple years.  While our experience with public open source is limited, we do not anticipate difficulty in operating under Apache's development process.
>
> === Homogeneous Developers ===
> The committers have multiple employers and it is expected that committers from different companies will be recruited.
>
> === Reliance on Salaried Developers ===
> The initial committers are all paid by their employers to work on Accumulo and we expect such employment to continue.  Some of the initial committers would continue as volunteers even if no longer employed to do so.
>
> === Relationships with Other Apache Products ===
> Accumulo uses Hadoop, Zookeeper, Thrift, Maven, log4j, commons-lang, -net, -io, -jci, -collections, -configuration, -logging, and -codec.
>
> === Relationship to HBase ===
> Accumulo and HBase are both based on the design of Google's BigTable, so there is a danger that potential users will have difficulty distinguishing the two or that they will not see an incentive in adopting Accumulo.  There are a few key areas in which Accumulo differs from HBase.  Some of the desired features of Accumulo could be incorporated into HBase, however the most important of these may be unlikely to be adopted (see cell-level access labels and iterators below).  It is a possibility that the codebases will ultimately converge, but the number of differences at the current time warrants a separate project for Accumulo.
>
> ==== Access Labels ====
> Accumulo has an additional portion of its key that sorts after the column qualifier and before the timestamp.  It is called column visibility and enables expressive cell-level access control.  Authorizations are passed with each query to control what data is returned to the user.  The column visibilities are boolean AND and OR combinations of arbitrary strings (such as "(A&B)|C") and authorizations are sets of strings (such as {C,D}).
>
> ==== Iterators ====
> Accumulo has a novel server-side programming mechanism that can modify the data written to disk or returned to the user.  This mechanism can be configured for any of the scopes where data is read from or written to disk.  It can be used to perform joins on data within a single tablet.
>
> ==== Flexibility ====
> HBase requires the user to specify the set of column families to be used up front.  Accumulo places no restrictions on the column families.  Also, each column family in HBase is stored separately on disk.  Accumulo allows column families to be grouped together on disk, as does BigTable.  This enables users to configure how their data is stored, potentially providing improvements in compression and lookup speeds.  It gives Accumulo a row/column hybrid nature, while HBase is currently column-oriented.
>
> ==== Testing ====
> Accumulo has testing frameworks that have resulted in its achieving a high level of correctness and performance.  We have observed that under some configurations and conditions Accumulo will outperform HBase and provide greater data integrity.
>
> ==== Logging ====
> HBase uses a write-ahead log on the Hadoop Distributed File System.  Accumulo has its own logging service that does not depend on communication with the HDFS NameNode.
>
> ==== Storage ====
> Accumulo has a relative key file format that improves compression.
>
> ==== Areas in which HBase features improvements over Accumulo ====
> in memory tables, upserts, coprocessors, connections to other projects such as Cascading and Pig
>
> === Expectations ===
> There is a risk that Accumulo will be criticized for not providing adequate security.  The access labels in Accumulo do not in themselves provide a complete security solution, but are a mechanism for labeling each piece of data with the authorizations that are necessary to see it.
>
> === Apache Brand ===
> Our interest in releasing this code as an Apache incubator project is due to its strong relationship with other Apache projects, i.e. Hadoop, Zookeeper, and HBase.
>
> == Documentation ==
> There is not currently documentation about Accumulo on the web, but a fair amount of documentation and training materials exists and will be provided on the Accumulo wiki at apache.org.  Also, a paper discussing YCSB results for Accumulo will be presented at the 2011 Symposium on Cloud Computing.
>
> == Initial Source ==
> Accumulo has been in development since spring 2008.  There are hundreds of developers using it and tens of developers have contributed to it.  The core codebase consists of 200,000 lines of code (mainly Java) and 100s of pages of documentation.  There are also a few projects built on top of Accumulo that may be added to its contrib in the future.  These include support for Hive, Matlab, YCSB, and graph processing.
>
> == Source and Intellectual Property Submission Plan ==
> Accumulo core code, examples, documention, and training materials will be submitted by the National Security Agency.
>
> We will also be soliciting contributions of further plugins from MIT Lincoln Labs, Carnegie Mellon University, and others.
>
> Accumulo has been developed by a mix of government employees and private companies under government contract.  Material developed by government employees is in the public domain and no U.S. copyright exists in works of the federal government.  For the contractor developed material in the initial submission, the U.S. Government has sufficient authority per the ICLA from the copyright owner to contribute the Accumulo code to the incubator.
>
> There has been some discussion regarding accepting contributions from US Government sources on [https://issues.apache.org/jira/browse/LEGAL-93 LEGAL-93]. We propose that the NSA will sign an ICLA/CCLA if that document could be slightly modified to explicitly address copyright in works of government employees. Specifically, we propose that the definition of “You” be modified to include “the copyright owner, the owner of a Contribution not subject to copyright, or legal entity authorized by the copyright owner that is making this Agreement.” In addition, section 2, the copyright license grant be modified after “You hereby grant” that either states “to the extent authorized by law” or “to the extent copyright exists in the Contribution.”  These changes will permit US Government employee developed work to be included.
>
> One proposed solution is to form a Collaborative Research and Development Agreement (CRADA) between the Apache Software Foundation and the US Government, but this will not solve the underlying problem that U.S. law does not grant copyright to works of government employees.  At this time a CRADA is not necessary but should it be determined that a CRADA is necessary, we would like to work through that process during the incubation phase of Accumulo rather than before acceptance as this may take time to enter into an agreement.
>
> == External Dependencies ==
> jetty (Apache and EPL), jline (BSD), jfreechart (LGPL), jcommon (LGPL), slf4j (MIT), junit (CPL)
>
> == Cryptography ==
> none
>
> == Required Resources ==
>  * Mailing Lists
>   * accumulo-private
>   * accumulo-dev
>   * accumulo-commits
>   * accumulo-user
>
>  * Subversion Directory
>   * https://svn.apache.org/repos/asf/incubator/accumulo
>
>  * Issue Tracking
>   * JIRA Accumulo (ACCUMULO)
>
>  * Continuous Integration
>   * Jenkins builds on https://builds.apache.org/
>
>  * Web
>   * http://incubator.apache.org/accumulo/
>   * wiki at http://wiki.apache.org or http://cwiki.apache.org
>
> == Initial Committers ==
>  * Aaron Cordova (aaron at cordovas dot org)
>  * Adam Fuchs (adam.p.fuchs at ugov dot gov)
>  * Eric Newton (ecn at swcomplete dot com)
>  * Billie Rinaldi (billie.j.rinaldi at ugov dot gov)
>  * Keith Turner (keith.turner at ptech-llc dot com)
>  * John Vines (john.w.vines at ugov dot gov)
>  * Chris Waring (christopher.a.waring at ugov dot gov)
>
> == Affiliations ==
>  * Aaron Cordova, The Interllective
>  * Adam Fuchs, National Security Agency
>  * Eric Newton, SW Complete Incorporated
>  * Billie Rinaldi, National Security Agency
>  * Keith Turner, Peterson Technology LLC
>  * John Vines, National Security Agency
>  * Chris Waring, National Security Agency
>
> == Sponsors ==
>  * Champion: Doug Cutting
>  * Nominated Mentors: Benson Margulies, ?, ?
>  * Sponsoring Entity: Apache Incubator
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: general-unsubscribe@incubator.apache.org
> For additional commands, e-mail: general-help@incubator.apache.org
>
>



-- 
Todd Lipcon
Software Engineer, Cloudera

---------------------------------------------------------------------
To unsubscribe, e-mail: general-unsubscribe@incubator.apache.org
For additional commands, e-mail: general-help@incubator.apache.org

Re: [PROPOSAL] Accumulo for the Apache Incubator

Posted by Adam Fuchs <ad...@ugov.gov>.

Hi Owen,

I believe the answer is yes regarding the code grant, and I am currently
confirming that with our lawyers.

The LGPL dependencies are not core to Accumulo, and we're working on
substituting other packages. We would have no problem doing this before the
initial commit if necessary.

Cheers,
Adam
On Sep 2, 2011 11:36 AM, "Owen O&apos;Malley" <om...@apache.org> wrote:
> Is the NSA going to file a code grant for the project? How deeply
> embedded are the LGPL dependencies? Are they optional components or
> mandatory?
>
> Thanks,
> Owen
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: general-unsubscribe@incubator.apache.org
> For additional commands, e-mail: general-help@incubator.apache.org
>