You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@pig.apache.org by "krokodil@gmail.com" <kr...@gmail.com> on 2010/05/13 20:52:33 UTC

[ANNOUNCE] hamake-2.0b

After more than one year since previous release I am proud to announce
a new version of HAMAKE. Based on our experience of using we rewrote
it in Java, added support for Amazon EMR. We also streamlined XML
syntax and updated and improved documentation. Please visit
http://code.google.com/p/hamake/ to learn more and to download a new
version.

Brief description:

Most non-trivial data processing scenarios with Hadoop typically
require more than one MapReduce job. Usually such processing is
data-driven, with the data funneled through a sequence of jobs. The
processing model could be presented in terms of dataflow programming.
It could be expressed as a directed graph, with datasets as nodes.
Each edge indicates a dependency between two or more datasets and is
associated with a processing instruction (Hadoop MapReduce job, PIG
Latin script or an external command), which produces one dataset from
the others. Using fuzzy timestamps as a way to detect when a dataset
needs to be updated, we can calculate a sequence in which the tasks
need to be executed to bring all datasets up to date. Jobs for
updating independent datasets could be executed concurrently, taking
advantage of your Hadoop cluster's full capacity. The dependency graph
may even contain cycles, leading to dependency loops which could be
resolved using dataset versioning.

These ideas inspired the creation of HAMAKE utility. We tried
emphasizing data and allowing the developer to express one's goals in
terms of dataflow (versus workflow). Data dependency graph is
expressed using just two data flow instructions: fold and foreach
providing a clear processing model, similar to MapReduce, but on a
dataset level. Another design goal was to create a simple to use
utility that developers can start using right away without complex
installation or extensive learning.

Key Features

* Lightweight utility - no need for complex installation
* Based on dataflow programming model
* Easy learning curve.
* Supports Amazon Elastic MapReduce
* Allows to run MapReduce jobs as well as PIG Latin scripts

Sincerely,
Vadim Zaliva