You are viewing a plain text version of this content. The canonical link for it is here.

Posted to cvs@incubator.apache.org by Apache Wiki <wi...@apache.org> on 2017/08/02 16:37:47 UTC

[Incubator Wiki] Update of "DRATProposal" by ChrisMattmann

Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Incubator Wiki" for change notification.

The "DRATProposal" page has been changed by ChrisMattmann:
https://wiki.apache.org/incubator/DRATProposal?action=diff&rev1=14&rev2=15

- ## page was copied from AsterixDBProposal
- = Apache AsterixDB Proposal =
+ = Apache DRAT Proposal =

== Abstract ==

- Apache AsterixDB is a scalable big data management system (BDMS) that
- provides storage, management, and query capabilities for large
- collections of semi-structured data.
+ Apache Distributed Release Audit Tool (DRAT) is a distributed,
+ parallelized (Map Reduce) wrapper around Apache™ RAT to allow it to
+ complete on large code repositories of multiple file types where Apache™ RAT hangs forever

== Proposal ==

+ Apache DRAT is a distributed, parallelized (Map Reduce) wrapper around Apache™ RAT (Release Audit Tool). RAT is used to check for proper licensing in software projects. However, RAT takes a prohibitively long time to analyze large repositories of code, since it can only run on one JVM. Furthermore, RAT isn't customizable by file type or file size and provides no incremental output. This wrapper dramatically speeds up the process by leveraging Apache™ OODT to parallelize and workflow the following components:
- AsterixDB is a big data management system (BDMS) that makes it
- well-suited to needs such as web data warehousing and social data
- storage and analysis. Feature-wise, AsterixDB has:

+ * Apache™ Solr based exploration of a CM repository (e.g., Git, SVN, etc.) and classification of that repository based on MIME type using Apache™ Tika.
+ * A MIME partitioner that uses Apache™ Tika to automatically deduce and classify by file type and then partition Apache™ RAT jobs based on sets of 100 files per type (configurable) -- the M/R "partitioner"
+ * A throttle wrapper for RAT to MIME targeted Apache™ RAT. -- the M/R "mapper"
+ * A reducer to "combine" the produced RAT logs together into a global RAT report that can be used for stats generation. -- the M/R "reducer"
- * A NoSQL style data model (ADM) based on extending JSON with object database concepts.
- * An expressive and declarative query language (AQL) for querying semi-structured data.
- * A runtime query execution engine, Hyracks, for partitioned-parallel execution of query plans.
- * Partitioned LSM-based data storage and indexing for efficient ingestion of newly arriving data.
- * Support for querying and indexing external data (e.g., in HDFS) as well as data stored within AsterixDB.
- * A rich set of primitive data types, including support for spatial, temporal, and textual data.
- * Indexing options that include B+ trees, R trees, and inverted keyword index support.
- * Basic transactional (concurrency and recovery) capabilities akin to those of a NoSQL store.
-

== Background and Rationale ==

+ As a part of the Apache Software Foundation (ASF) project, Apache Creadur, a Release Audit Tool (RAT) was developed especially in response to demand from the Apache Software Foundation and its hundreds of projects to provide a capability for release auditing that could be integrated into projects. The primary function of the RAT is automated code auditing and open-source license analysis focusing on headers. RAT is a natural language processing tool written in Java to easily run on any platform and to audit code from many source languages (e.g., C, C++, Java, Python, etc.). RAT can also be used to add license headers to codes that are not licensed.
- In the world of relational databases, the need to tackle data volumes
- that exceed the capabilities of a single server led to the
- development of “shared-nothing” parallel database systems several
- decades ago. These systems spread data over a cluster based on a
- partitioning strategy, such as hash partitioning, and queries are
- processed by employing partitioned-parallel divide-and-conquer
- techniques. Since these systems are fronted by a high-level,
- declarative language (SQL), their users are shielded from the
- complexities of parallel programming. Parallel database systems have
- been an extremely successful application of parallel computing, and
- quite a number of commercial products exist today.

+ In the summer of 2013, our team ran Apache RAT on source code produced from the Defense Advanced Research Projects Agency (DARPA) XDATA national initiative whose inception coincided with the 2012 U.S. Presidential Initiative in Big Data. XDATA brought together 24 performers across academia, private industry and the government to construct analytics, visualizations, and open source software mash-ups that were transitioned into government projects and to the defense sector. XDATA produced a large Git repository consisting of ~50,000 files and 10s of millions of lines of code. DARPA XDATA was launched to build a useful infrastructure for many government agencies and ultimately is an effort to avoid the traditional government-contractor software pipeline in which additional contracts are required to reuse and to unlock software previously funded by the government in other programs.
+ All XDATA software is open source and is ingested into DARPA’s Open Catalog [6] that points to outputs of the program including its source code and metrics on the repository. Because of this, one of core products of XDATA is the internal Git repository. Since XDATA brought together open source software across multiple performers, having an understanding of the licenses that the source codes used, and their compatibilities and differences was extremely important and since there repository was so large, our strategy was to develop an automated process using Apache RAT.
+ We ran RAT on 24-core, 48 GB RAM Linux machine at the National Aeronautics and Space Administration (NASA)’s Jet Propulsion Laboratory (JPL) to produce a license evaluation of the XDATA Git repository and to provide recommendations on how the open source software products can be combined to adhere to the XDATA open source policy encouraging permissive licenses. Against our expectations, however, RAT failed to successfully and quickly audit XDATA’s large Git repository. Moreover, RAT provided no incremental output, resulting in solely a final report when a task was completed. RAT’s crawler did not automatically discern between binary file types and another file types. It seemed that RAT performed better by collecting similar sets of files together (e.g., all Javascript, all C++, all Java) and then running RAT jobs individually based on file types on smaller increments of files (e.g., 100 Java files at a time, etc).
+ The lessons learned navigating these issues have motivated to create “DRAT”, which stands for "Distributed Release Audit Tool". DRAT directly overcomes RAT's limitations and brings code auditing and open source license analysis into the realm of Big Data using scalable open source Apache technologies. DRAT is already being applied and transitioned into the government agencies. DRAT currently exists at Github under the ALv2
- In the distributed systems world, the Web brought a need to index and
- query its huge content. SQL and relational databases were not the
- answer, though shared-nothing clusters again emerged as the hardware
- platform of choice. Google developed the Google File System (GFS) and
- !MapReduce programming model to allow programmers to store and process
- Big Data by writing a few user-defined functions. The !MapReduce
- framework applies these functions in parallel to data instances in
- distributed files (map) and to sorted groups of instances sharing a
- common key (reduce) -- not unlike the partitioned parallelism in
- parallel database systems. Apache's Hadoop !MapReduce platform is the
- most prominent implementation of this paradigm for the rest of the
- Big Data community. On top of Hadoop and HDFS sit declarative
- languages like Pig and Hive that each compile down to Hadoop
- !MapReduce jobs.
-
- The big Web companies were also challenged by extreme user bases
- (100s of millions of users) and needed fast simple lookups and
- updates to very large keyed data sets like user profiles. SQL
- databases were deemed either too expensive or not scalable, so the
- “NoSQL movement” was born. The ASF now has HBase and Cassandra, two
- popular key-value stores, in this space. MongoDB and Apache CouchDB are
- other open source alternatives (document stores).
-
- It is evident from the rapidly growing popularity of "NoSQL" stores,
- as well as the strong demand for Big Data analytics engines today,
- that there is a strong (and growing!) need to store, process, *and*
- query large volumes of semi-structured data in many application
- areas. Until very recently, developers have had to ''choose'' between
- using big data analytics engines like Apache Hive or Apache Spark,
- which can do complex query processing and analysis over HDFS-resident
- files, and flexible but low-function data stores like MongoDB or
- Apache HBase. (The Apache Phoenix project,
- http://phoenix.apache.org/, is a recent SQL-over-HBase effort that
- aims to bridge between these choices.)
-
- AsterixDB is a highly scalable data management system that can store,
- index, and manage semi-structured data, e.g., much like MongoDB, but
- it also supports a full-power query language with the expressiveness
- of SQL (and more). Unlike analytics engines like Hive or Spark, it
- stores and manages data, so AsterixDB can exploit its knowledge of
- data partitioning and the availability of indexes to avoid always
- scanning data set(s) to process queries. Somewhat surprisingly, there
- is no open source parallel database system (relational or otherwise)
- available to developers today -- AsterixDB aims to fill this need.
- Since Apache is where the majority of the today's most important Big
- Data technologies live, the ASF seems like the obvious home for a
- system like AsterixDB.

== Current Status ==

+ TBD
- The current version of AsterixDB was co-developed by a team of
- faculty, staff, and students at UC Irvine and UC Riverside. The
- project was initiated as a large NSF-sponsored project in 2009, the
- goal of which was to combine the best ideas from the parallel
- database world, the then new Hadoop world, and the semi-structured
- (e.g., XML/JSON) data world in order to create a next-generation
- BDMS. A first informal open source release was made four years later,
- in June of 2013, under the Apache Software License 2.0.

== Meritocracy ==

+ Current developers for the project are all ASF members, experienced with ASF processes and procedures. We know how to grow an Apache community and to develop a meritocratic free and open source project.
+
- The current developers are familiar with meritocratic open source
- development at Apache. Apache was chosen specifically because we want
- to encourage this style of development for the project.

== Community ==

+ TBD
- While AsterixDB started as a university project it has developed into
- a community. A number of the initial committers started contributing
- in academia and continue to actively participate and contribute after
- graduation. And we seek to further develop developer and user
- communities. One way to broaden the community that is ongoing is
- through academic collaborations (currently with IIT Mumbai in India
- and TU Berlin in Germany). During incubation we will also explicitly
- seek increased industrial participation.
-
- Some indicators of the effort's development community and history can
- be
- found at:
- https://www.openhub.net/p/asterixdb/contributors?query=&sort=commits_12_mo,
- https://www.openhub.net/p/hyracks/contributors?query=&sort=commits_12_mo

== Core Developers ==

+ Mention JPL folks
+ Tyler is at Google
+ Karanjeet formerly of USC + JPL and now Apple
- The core developers of the project are diverse, although initially UC
- Irvine heavy (roughly 50%) due to the project's origins at UCI. The
- other 50% are from other academic institutions (UC Riverside and the
- Hebrew University in Jerusalem) and companies (Couchbase, IBM, KACST
- Saudi Arabia, Oracle, Saudi Aramco, X15 Software).

== Alignment ==

---------------------------------------------------------------------
To unsubscribe, e-mail: cvs-unsubscribe@incubator.apache.org
For additional commands, e-mail: cvs-help@incubator.apache.org