You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@hive.apache.org by Owen O'Malley <om...@apache.org> on 2015/03/20 19:37:28 UTC
Fwd: Create ORC project

---------- Forwarded message ----------
From: Owen O'Malley <om...@apache.org>
Date: Fri, Mar 20, 2015 at 11:36 AM
Subject: Create ORC project
To: board@apache.org


Board,
  We'd like to create a separate ORC project from the code base that is
currently in Hive.

*Apache ORC f**or Apache Top Level Project*

*Abstract*

ORC is a fast columnar file format for Apache Hadoop style workloads that
supports columnar projection and pushing filters down in the reader. Both
features can dramatically reduce the number of bytes that need to be read,
decompressed, and deserialized to answer a query.

ORC has been developed and released by Apache Hive, but other projects
wanting to use ORC do not want to depend on Hive’s large jar and dependency
tree. Additionally, a C++ ORC reader and writer are being developed and
would benefit from being released as the same project. The Hive community
believes that further development on ORC can be done better as a separate
project as discussed on the Hive email lists (here
<http://search-hadoop.com/m/8er9LOj9O1/separate+orc&subj=ORC+separate+project>
and here <http://search-hadoop.com/m/8er9WxfWw&subj=Re+Native+ORC>).

*Proposal*

Although ORC (Optimized Row Columnar) file was originally developed in
Apache Hive, there are several forces that are encouraging it to move to a
separate project. First is that projects both inside and outside of Apache
wish to support it, but do not want to depend on Hive and its large list of
dependencies. Additionally, the Hive community, as a Java project, is not
interested in incorporating the new C++ implementation of ORC into their
code base. Developing and releasing the Java and C++ ORC readers and
writers in the same project will allow them to stay synchronized with each
other and give users a single place to direct questions and file issues.
Moving out of Hive will also allow ORC to support other languages in the
future (Go, etc.), release on a faster release cycle than Hive, and develop
an independent community.

The traditional path at Apache would have been to create an incubator
project, but the code is already being released by Apache and most of the
developers are familiar with Apache rules and guidelines. In particular,
the proposed PMC has 3 Apache members and incubator PMC members from three
companies. They will provide oversight and guidance for the developers that
are less experienced in the Apache Way. Therefore, the ORC project would
like to propose becoming a Top Level Project at Apache.

*Overview of ORC *

Although Hive's RCFile was the standard format for storing tabular data in
Hadoop for several years, it has limitations because it treats each column
as a binary blob without semantics. In 2013, Hive added a new file format
named Optimized Row Columnar (ORC) file that uses and retains the type
information from the table definition. ORC uses type specific readers and
writers that provide light-weight compression techniques such as dictionary
encoding, bit packing, delta encoding, and run length encoding -- resulting
in dramatically smaller files. Additionally, ORC can apply generic
compression using zlib, or Snappy on top of the lightweight compression for
even smaller files. However, storage savings are only part of the gain. ORC
supports projection, which selects subsets of the columns for reading, so
that queries reading only one column read only the required bytes.
Furthermore, ORC files include light weight indexes that include the
minimum and maximum values for each column in each set of 10,000 rows and
the entire file. ORC files also have optional bloom filters that provide
fine grain details of the values in each set of 10,000 rows. Using pushdown
filters from Hive, the file reader can skip entire sets of rows that aren't
important for this query.

*Current Status*

*Meritocracy*

ORC has been developed as part of Apache Hive and thus has been operating
as a meritocracy. Many of the developers of ORC are Hive PMC members or
committers. The ORC project plans to continue adding new PMC and committers
as the project continues to develop.

*Community*

ORC’s development team seeks to foster the development and user
communities. We feel that becoming a separate project will improve both
communities by being smaller and more focused than Hive and bring tighter
integration with various Apache projects that either don’t want to or can’t
accept the large list of dependencies from Hive.

*Core Developers*

ORC is being primarily developed by HP, Hortonworks, and Microsoft.
Facebook was instrumental in the early development and is an active user.

*Alignment*

The ASF is a natural host for ORC given that it is already the home of
Hadoop, Pig, Hive, and other emerging distributed computing software
projects. ORC was designed to offer improved storage capability for Hadoop
clusters and query speed on Hive and Pig.

*Known Risks*

*Orphaned Products*

The core developers of the ORC team are actively working on the project and
plan to continue. There is very little risk of ORC getting orphaned since
many large companies are storing their production data in ORC format. For
example, Facebook is using ORC to store 10’s of petabytes of their
production data (blog
<https://code.facebook.com/posts/229861827208629/scaling-the-facebook-data-warehouse-to-300-pb/>
).

*Inexperience with Open Source*

The proposed PMC has extensive experience with Apache projects and includes
3 Apache members and Incubator PMC members. The ORC PMC and the more
experienced committers will be responsible for training the committers that
are less familiar with the Apache Way.

*Homogeneous Developers*

The developers include employees from Facebook, HP, Hortonworks, Microsoft,
and an independent contributor. Apache projects encourage an open and
diverse meritocratic community and ORC team is very motivated to increase
the size and diversity of the development team.

*Reliance on Salaried Developers*

Most of the work on ORC has been by salaried developers, but the hope is
that by making ORC a separate project, it will be more approachable for new
developers including non-salaried developers.

*Relationships with Other Apache Products*

ORC has a strong relationship and integration with Apache Hadoop, Hive, and
Pig. Being independent of Hive will allow other projects to depend on ORC
directly without incurring the cost of depending on the large list of Hive
dependencies.

ORC would like to encourage integration with additional Apache projects:

   - Apache Bigtop
   - Apache Drill
   - Apache Flink
   - Apache Flume
   - Apache Spark

ORC does compete with Parquet, which is also a columnar format that was
released after ORC, and to a lesser extent Avro and Thrift, which are
row-major serialization formats. Apache as a foundation, does not pick
particular projects among competitors, but rather acts as a support system
for each project’s community.

*An Excessive Fascination with the Apache Brand*
ORC wants to become an Apache project in order to help efforts to diversify
the committer-base, and not to capitalize on the Apache brand. The ORC
project is in production use already inside many large companies and is
already being released by Apache Hive. As such, the ORC project is not
seeking to use the Apache brand as a marketing tool.

*Documentation*
The primary documentation about ORC is located on the Apache Hive wiki
<https://cwiki.apache.org/confluence/display/Hive/LanguageManual+ORC>.

There have been also been presentations on ORC:


● Introduction to ORC files 2012
<http://www.slideshare.net/oom65/orc-fileintro>
● Berlin Buzzwords 2013 <http://www.slideshare.net/oom65/orc-files>
● Hadoop Summit 2013
<http://www.slideshare.net/Hadoop_Summit/hanson-o-malleypandeyjune27425pmroom212?related=1>

*Initial Source*
ORC has been under development as part of Hive since late 2012. The
original inclusion into Hive was via HIVE-3874
<https://issues.apache.org/jira/browse/HIVE-3874>. There are several
implementations that read or write the ORC format:


● Hive reader and writer in Hive subversion
<http://svn.apache.org/repos/asf/hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/io/orc/>
● Presto integrated reader in Presto github
<https://github.com/facebook/presto/tree/master/presto-orc>
● C++ reader and writer in github <https://github.com/hortonworks/orc>

The intent is to pull both the Hive Java reader and writer and the C++
reader and writer into the Apache ORC project.

*External Dependencies*
ORC has the following external dependencies.


● Build tools



   - Apache Maven
   - Gmock
   - JUnit

● Apache

   - Log4j
   - Hadoop

● Non-Apache

   - JDK 1.6+
   - Protobuf
   - Snappy
   - zlib

*Cryptography*
ORC does not currently support encryption, but will eventually support
column encryption.

*Required Resources*
*Mailing Lists*


● private@orc for private PMC discussions (with moderated subscriptions)
● dev@orc
● user@orc
● commits@orc

*Version Control*
Git is the preferred source control system.

*Issue Tracking*
ORC will need a jira instance.

*Other Resources*
The existing code already has unit tests so we will make use of existing
Apache continuous testing infrastructure. The resulting load should not be
very large.

*Initial PMC*


● Chris Douglas <cdouglas at apache.org> (Apache member, Incubator & Hadoop
PMC)
● Alan Gates <gates at apache.org> (Apache member, Incubator, Hive & Pig
PMC)
● Prasanth Jayachandran <prasanthj at apache dot org> (Hive PMC)
● Lefty Leverenz <leftyleverenz at gmail dot com> (Hive PMC)
● Owen O’Malley <omalley at apache dot org> (Apache member, Incubator,
Hadoop, & Hive PMC)

We’d like to propose Owen O’Malley as the initial VP for the ORC project.

*Initial Committers*


● Thanh Do <thdo at microsoft dot com>
● Gunther Hagleitner <gunther at apache dot org> (Hive PMC)
● Pavan Lanka <pavibhai at gmail dot com>
● Aliaksei Sandryhaila <aliaksei.sandryhaila at hp dot com>
● Sergey Shelukhin <sershe at apache dot org> (Hive PMC)
● Dain Sundstrom <dain at fb dot com>
● Gopal Vijayaraghavan <gopalv at apache dot org> (Hive committer)
● Stephen Walkauskas <stephen.walkauskas at hp dot com>
● Kevin Wilfong <kevinwilfong at fb dot com> (Hive PMC)
● Jing Xu <jing.xu2 at hp dot com>
● Xuefu Zhang <xzhang at cloudera dot com>

*Affiliations*
The initial PMC is employed at Doc of the Bay, Hortonworks and Microsoft.
The initial committers are employed by Cloudera, Facebook, HP, Hortonworks,
Microsoft and an independent contributor.