You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@pig.apache.org by Olga Natkovich <ol...@yahoo-inc.com> on 2007/11/08 00:25:02 UTC
User requirements for adding hadoop streaming to Pig

Hi,
 
We got some requirements from a Hadoop Streaming user that would like to
see streaming available in Pig. Here are his requirements:
 
================================================================
 
Features that will help to make Pig the standard the preferred way to
use Hadoop.

We need to look at 3 set of features:

-- immediate added value -- things that become easy (can calculate the
total deposit amount)

-- cannot do without (have to do OCR in order to get the deposit amount)

-- real added value -- things that are very hard or impossible to do
otherwise

The "immediate added value" is

-- to be able to fully specify a hadoop-based (probably consisting of
multiple steps) in a single script, without writing many redundant and
cryptic character sequences.

-- the data processing algorithms themselves are expressed in any
language of the suer choice, without any adaptation to Hadoop.

The "cannot do without" features the things that Hadoop Streaming users
are already using -- either provided by the current infrastructure, or
by tools and hacks they have put together themselves.

1.

-- a simple streaming program does not use any extra concepts

-- specifying a sequence of steps in one script (mostly there)

-- input and output data set (directory) names should not need to be
hard coded n the Pig Script. It should be possible to combine
configuration parameters to define the input and working directory names
(DFS)

-- error checking -- stop the execution if a step failed (may be not-
trivial in case of screaming, as a streaming step may have its own way
to indicate a failure)

-- meaningful but over-ridable defaults (a lot of specifics -- needs a
separate discussion)

2. "stderr"

-- available before the job (the step) has completed

-- configurable -- by default only some standard environment summary
(command, input name, available disk space, start & end time, number of
processed records) plus all the user stderr goes there

-- available in DFS (the name has a useful default -- e.g. based on the
name of the output)

-- (advanced) deliver the error --e.g. synax errors -- messages to the
client 

3. Map / reduce command

-- may be a command line (not just "executable")

-- command line parameters may be calculated within the Pig script; in
particular, they may come from the command line that invoked the Pig
script

-- the command may have multiple levels of quotes and other special
chartacters

-- the step can be defined by an existing Java class, rather than by a
Unix command

4. input

-- allow to get the input without any transformation

-- allow to use any existing InputFormat class as the input
transformation

-- the program may require to take the input from a named files, (rather
than stdin)

-- secondary sort

5. output

-- the program may write its main output to a named "file", rather that
to standard output

-- secondary output

-- sorted output in a single file

6. files

-- the files that should be shipped together with the executables for a
given step can be specified inside the Pig script

REAL ADDED VALUE FEATURES

7. Efficient joins -- for all important variations of join

8. Metadata about files (schema)

9. Support for counters