You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@avro.apache.org by Enis Soztutar <en...@gmail.com> on 2010/07/07 19:00:01 UTC

Introducing Gora, an ORM for NoSQL data stores

Hi all,

First off, sorry for cross-posting, but I think the announcement is relevant
to hadoop, hbase and avro communities, all.

We would like to introduce a new project, called Gora. Gora is a Java ORM
layer for column stores, SQL databases, key-value stores and document
databases. The design goal is to have a common API to access and manage
multiple data stores. Gora differs from Java ORM frameworks, in that the
special focus is given to column oriented data bases, like Apache HBase and
Apache Cassandra.

Gora, in no way, is a replacement for Hibernate, DataNucleus or [insert the
name of your favorite ORM project]. But we think we differ from traditional
data stores from the following perspectives.
 - Gora is specifically designed with NoSQL data stores in mind. For
example, the API is based on <key, value> pairs, rather than just beans.
Also, we believe that the ORM layer should be tuned for batch operations
(like first class object re-use support),
 - Gora uses Avro, to generate data beans from avro schemas. Moreover, most
of the serializations are delegated to avro. For example, a map is
serialized to a field (if not configured otherwise) using Avro
serialization.
 - Gora provides first-class support for Hadoop MapReduce. DataStore
implementations are responsible for partitioning the data (which is then
converted to Hadoop Splits), and all the locality information is again
obtained from the data store. Developing MapReduce jobs with Gora is really
easy.
 - The long term goal for Gora is to be an intermediate data format for
popular big-data and search related projects. In the middle term, we plan to
support Cassandra, Cascading, Pig and Solr. Think of the possibilities when
you can use the same data structures to persist objects to Hbase, SQL and
Solr. And use Pig or Cascading in jobs to mine the data stored at
HBase/Cassandra/SQL/etc.

Gora works as follows. You define the data structures for your domain using
regular Avro Json schemas. Then instead of compiling the avro files with
Avro's compiler, you compile the files with GoraCompiler. Generated keep
track of the persistency information along with the data. Then for each data
back-end, you define a mapping file which contains class fields to data
store specific schema configuration. For example, HBase mapping files,
define the column families, and mappings from fields to columns or column
families, whereas SQL mapping files define mappings for table fields.

Gora has started in NutchBase(http://github.com/dogacan/nutchbase), a branch
of Apache Nutch(http://nutch.apache.org/) which is being used as a basis for
what will become Nutch 2.0. For the second version of the popular open
source web search project, an abstraction layer was needed so that the core
data structures for Nutch would no longer be kept as flat files on Hadoop.
We wanted to be able to use popular NoSQL databases (HBase, Cassandra,
Hypertable, etc), optionally flat files, and SQL databases (especially
embedded zero-conf SQL databases). So Gora as a project was born.

Gora is now in pre-alpha stage, with a public release planned before the end
of the year. Documentation is also very sparse at this point. However, the
code is already used at NutchBase and will be used in Nutch 2.0. We
currently support HBase, plain Avro data files and SQL. Cassandra support is
coming soon. Of course, the current set of developers is very small, and we
need your help in achieving these goals. So feel free to contribute in any
way you see fit. We believe in the Apache way of development and in fact,
one of the possible paths for Gora is to be accepted as a sub project of
Incubator or Hadoop (we welcome any feedback on this).

Lastly, you can find the project at http://github.com/enis/gora/. Some
example code is at
http://github.com/enis/gora/tree/master/gora-core/src/examples/ and
http://github.com/dogacan/nutchbase.

Feel free to use this list, or gora-dev@googlegroups.com for further
discussion.

Thanks,
Enis Söztutar

tl;dr Gora is an ORM layer with a specific focus on NoSQL data stores. It
has HBase, SQL, Avro and Mapreduce support.