You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@hive.apache.org by "Carl Steinbach (JIRA)" <ji...@apache.org> on 2010/01/27 05:58:34 UTC

[jira] Created: (HIVE-1107) Generic parallel execution framework for Hive (and Pig, and ...)

Generic parallel execution framework for Hive (and Pig, and ...)
----------------------------------------------------------------

                 Key: HIVE-1107
                 URL: https://issues.apache.org/jira/browse/HIVE-1107
             Project: Hadoop Hive
          Issue Type: New Feature
          Components: Query Processor
            Reporter: Carl Steinbach


Pig and Hive each have their own libraries for handling plan execution. As we prepare to invest more time improving Hive's plan execution mechanism we should also start to consider ways of building a generic plan execution mechanism that is capable of supporting the needs of Hive and Pig, as well as other Hadoop data flow programming environments. 


-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HIVE-1107) Generic parallel execution framework for Hive (and Pig, and ...)

Posted by "Russell Jurney (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HIVE-1107?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12888870#action_12888870 ] 

Russell Jurney commented on HIVE-1107:
--------------------------------------

At Jeff's suggestion, my comments on this ticket for Hive and Pig follow.

Oozie has been suggested as a solution to this ticket, but it is in my opinion far too complex to be appropriate for Pig or HIVE.  A scheduler should not be more complex than the language it schedules, and Oozie is more complex than Pig and HIVE put together.  Compare their manuals, both in terms of length and readability.  Furthermore, Oozie is (nearly?) turing complete XML, not easily human readable script, and scheduling one job takes far too much of it.

Pig and HIVE aim to deliver simplicity and accessibility.  In time Oozie may mature, but it is not there yet.  The features are present, but the open source interface is extremely raw.  The only simple interface to Oozie is a proprietary GUI.  Perhaps the next major release will be an improvement.

A tight binding between these projects would cause LinkedIn problems, as we use Azkaban to schedule pig jobs.  Scheduling a job in Azkaban consists of creating a zip file of your job's content, inserting a very brief config (typically 3-6 lines), and issuing a one-line command.  The web interface to Azkaban is free.  This makes it a more appropriate choice for this ticket than Oozie, but making Azkaban tightly bound to Pig would be a terrible idea too.

We should be very careful about adding enterprise baggage to these tools that is simply not needed for the vast majority of users.  Convention over configuration is at the core of Pig and HIVE.  Lets not spoil that.

> Generic parallel execution framework for Hive (and Pig, and ...)
> ----------------------------------------------------------------
>
>                 Key: HIVE-1107
>                 URL: https://issues.apache.org/jira/browse/HIVE-1107
>             Project: Hadoop Hive
>          Issue Type: New Feature
>          Components: Query Processor
>            Reporter: Carl Steinbach
>
> Pig and Hive each have their own libraries for handling plan execution. As we prepare to invest more time improving Hive's plan execution mechanism we should also start to consider ways of building a generic plan execution mechanism that is capable of supporting the needs of Hive and Pig, as well as other Hadoop data flow programming environments. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HIVE-1107) Generic parallel execution framework for Hive (and Pig, and ...)

Posted by "Zheng Shao (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HIVE-1107?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12805351#action_12805351 ] 

Zheng Shao commented on HIVE-1107:
----------------------------------

Hadoop has the JobControl classes which can be generalized to support our need.

The current major limitations of JobControl are:
1. No way to add jobs that are non-mapreduce. Hive has a lot of other jobs as well, including MoveTask, etc.
2. No way to serialize the jobs and resume the progress at a later time.


> Generic parallel execution framework for Hive (and Pig, and ...)
> ----------------------------------------------------------------
>
>                 Key: HIVE-1107
>                 URL: https://issues.apache.org/jira/browse/HIVE-1107
>             Project: Hadoop Hive
>          Issue Type: New Feature
>          Components: Query Processor
>            Reporter: Carl Steinbach
>
> Pig and Hive each have their own libraries for handling plan execution. As we prepare to invest more time improving Hive's plan execution mechanism we should also start to consider ways of building a generic plan execution mechanism that is capable of supporting the needs of Hive and Pig, as well as other Hadoop data flow programming environments. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HIVE-1107) Generic parallel execution framework for Hive (and Pig, and ...)

Posted by "Carl Steinbach (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HIVE-1107?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12889067#action_12889067 ] 

Carl Steinbach commented on HIVE-1107:
--------------------------------------

bq. The only simple interface to Oozie is a proprietary GUI.

Which Oozie GUI are you talking about? Can you provide a link? I'd really like to check this out.


> Generic parallel execution framework for Hive (and Pig, and ...)
> ----------------------------------------------------------------
>
>                 Key: HIVE-1107
>                 URL: https://issues.apache.org/jira/browse/HIVE-1107
>             Project: Hadoop Hive
>          Issue Type: New Feature
>          Components: Query Processor
>            Reporter: Carl Steinbach
>
> Pig and Hive each have their own libraries for handling plan execution. As we prepare to invest more time improving Hive's plan execution mechanism we should also start to consider ways of building a generic plan execution mechanism that is capable of supporting the needs of Hive and Pig, as well as other Hadoop data flow programming environments. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HIVE-1107) Generic parallel execution framework for Hive (and Pig, and ...)

Posted by "Jeff Hammerbacher (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HIVE-1107?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12888886#action_12888886 ] 

Jeff Hammerbacher commented on HIVE-1107:
-----------------------------------------

Russell,

Let's not focus too hard on the name of the particular workflow execution engine.

The idea here is that a program of some sort (Hive query or set of Pig statements) must be processed and a physical plan of MapReduce operators produced. Once you have a DAG of operators to carry out, you need:

1) A way to serialize and exchange this DAG (e.g. Avro, JSON, XML)
2) A service to execute the DAG and ensure it runs to completion

Of course, things aren't this simple; for example, we need a consistent way to handle side data generated by an operator.

The goal of this proposal was to encourage Hive and Pig to target the same plan serialization format so that a single plan execution engine could be used. That way, work that is done on monitoring, capturing metadata from, and ensuring the reliability of multi-stage DAGs of MapReduce can be reused rather than reimplemented in each system.

Some arguments against this idea: component modularity can introduce inefficiencies, may make the overall system feel more complex, and does not deliver user-visible features despite the large effort required for implementation.

I believe the convergence of Pig and Hive on this front would be beneficial to the larger Hadoop community, but it's a large undertaking, and each organization has their own goals for their infrastructure.

Later,
Jeff

> Generic parallel execution framework for Hive (and Pig, and ...)
> ----------------------------------------------------------------
>
>                 Key: HIVE-1107
>                 URL: https://issues.apache.org/jira/browse/HIVE-1107
>             Project: Hadoop Hive
>          Issue Type: New Feature
>          Components: Query Processor
>            Reporter: Carl Steinbach
>
> Pig and Hive each have their own libraries for handling plan execution. As we prepare to invest more time improving Hive's plan execution mechanism we should also start to consider ways of building a generic plan execution mechanism that is capable of supporting the needs of Hive and Pig, as well as other Hadoop data flow programming environments. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.