You are viewing a plain text version of this content. The canonical link for it is here.
Posted to mapreduce-dev@hadoop.apache.org by Alex Xia <xx...@attinteractive.com> on 2011/09/14 01:15:13 UTC

Proposal

I am not sure this is the right place to post this. Please leave your comments or let me know
where I should post.

Thanks

Alex

$1 Purpose
    This document proposes some ideas to build a new framework on map/reducer.

$2 Issues addressed
    For ETL and analytic purpose, there are some issues in the current development process. I will try to
address the following issues,
    a) Programming map/reducer is not very productive. It is even unnecessary for business analysts.
    b) There is no job management system. We can use oozie but oozie is not a good job management system.

$3 Proposal Highlights
    I propose a new framework interface and summarize highlights as following,
    a) The framework exposes a shell scripting interface and all programming will be done in script.
    b) All data in interface are objects. The object exposes the key and value by method.
    c) The interface exposes SQL-like statement to manipulate the objects from input and write to output.
    d) The interface provides job management functionality.

$4 Possible syntax
    The proposed interface is built on map/reducer. The data in system can be partitioned into small data set
and processed as stream of objects. For example, to process the web log, we can partition the traffic by
visitor ID and process data in parallel.

a) script
    The interface will have all scripting functionary, like control and loop. It may be built on "BeanShell".

b) data manipulation
    The interface reads a stream of objects from low level framework, manipulate the object one by one, and
write back a stream of objects to low level framework. It may also partition data into small sets and
group data by key.

    The possible syntax looks like this and there are some examples in following section.
        select {}
        partition by {}
        group key {}
        group value {}

c) job control
    The interface will have job control functionary, like submitJob, runJob, listJob, and DeleteJob.

$5 Examples
    I use the words counting example in map/reducer. This example will count the work occurrence in document.

                setInput(indir)
                setOutput(outDir)
                select {
                                array = SYS.read().split(' ');
                                foreach i in array
                                {
                                                obj.setKey(i.getValue());
                                                obj.setValue(1);
                                                SYS.write(obj);
                                }
                }
                // partition by {}
                group key {
                                SYS.getObject().getKey();
                }
                group value {
        array = SYS.read();
                obj = array[0];
        obj.setValue(array.size);
                                SYS.write(obj);
                };

    This example submits two jobs and waits for them.

                define job1 { select ...};     // use select statement above
                define job2 { select ...};
                id1 = submitJob(job1);
                id2 = submitJob(job1);

                jobList.add(id1);
                jobList.add(id2);

                waitFor(jobList);

                if (id1.status() == "Complete")
                                println("complete");