You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@hive.apache.org by "Joydeep Sen Sarma (JIRA)" <ji...@apache.org> on 2009/05/12 17:21:45 UTC

[jira] Created: (HIVE-480) allow option to retry map-reduce tasks

allow option to retry map-reduce tasks
--------------------------------------

                 Key: HIVE-480
                 URL: https://issues.apache.org/jira/browse/HIVE-480
             Project: Hadoop Hive
          Issue Type: New Feature
          Components: Query Processor
            Reporter: Joydeep Sen Sarma


for long running queries with multiple map-reduce jobs - this should help in dealing with any transient cluster failures without having to re-running all the tasks.

ideally - the entire plan can be serialized out and the actual process of executing the workflow can be left to a pluggable workflow execution engine (since this is a problem that has been solved many times already).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HIVE-480) allow option to retry map-reduce tasks

Posted by "Zheng Shao (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HIVE-480?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12718623#action_12718623 ] 

Zheng Shao commented on HIVE-480:
---------------------------------

As a side note, the conf in hadoop is "mapred.max.tracker.failures" which controls the number of maximum permitted failures for each task.

> allow option to retry map-reduce tasks
> --------------------------------------
>
>                 Key: HIVE-480
>                 URL: https://issues.apache.org/jira/browse/HIVE-480
>             Project: Hadoop Hive
>          Issue Type: New Feature
>          Components: Query Processor
>            Reporter: Joydeep Sen Sarma
>
> for long running queries with multiple map-reduce jobs - this should help in dealing with any transient cluster failures without having to re-running all the tasks.
> ideally - the entire plan can be serialized out and the actual process of executing the workflow can be left to a pluggable workflow execution engine (since this is a problem that has been solved many times already).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HIVE-480) allow option to retry map-reduce tasks

Posted by "Joydeep Sen Sarma (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HIVE-480?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12718911#action_12718911 ] 

Joydeep Sen Sarma commented on HIVE-480:
----------------------------------------

one concern i have is that if the cluster goes down temporarily - then the retries will fail promptly and this fix would serve no purpose.

on the other hand - if the failure is due to genuine problems with the job (like problems in user scripts or bad input etc.) - then we will try this unnecessarily and cause excess load.

we need to think about how to distinguish these cases. in some cases (interactive cli session) - it may be better to leave the decision to user (give a prompt and ask the user whether they want to retry the job).

ideally - we should be able to do something like this for a non-interactive session as well - but that seems much more complicated (suspending and resuming a query given a queryid)

> allow option to retry map-reduce tasks
> --------------------------------------
>
>                 Key: HIVE-480
>                 URL: https://issues.apache.org/jira/browse/HIVE-480
>             Project: Hadoop Hive
>          Issue Type: New Feature
>          Components: Query Processor
>            Reporter: Joydeep Sen Sarma
>         Attachments: HIVE-480.1.patch
>
>
> for long running queries with multiple map-reduce jobs - this should help in dealing with any transient cluster failures without having to re-running all the tasks.
> ideally - the entire plan can be serialized out and the actual process of executing the workflow can be left to a pluggable workflow execution engine (since this is a problem that has been solved many times already).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HIVE-480) allow option to retry map-reduce tasks

Posted by "Zheng Shao (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HIVE-480?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12718622#action_12718622 ] 

Zheng Shao commented on HIVE-480:
---------------------------------

I will add a hive option to control the number of retries.

> allow option to retry map-reduce tasks
> --------------------------------------
>
>                 Key: HIVE-480
>                 URL: https://issues.apache.org/jira/browse/HIVE-480
>             Project: Hadoop Hive
>          Issue Type: New Feature
>          Components: Query Processor
>            Reporter: Joydeep Sen Sarma
>
> for long running queries with multiple map-reduce jobs - this should help in dealing with any transient cluster failures without having to re-running all the tasks.
> ideally - the entire plan can be serialized out and the actual process of executing the workflow can be left to a pluggable workflow execution engine (since this is a problem that has been solved many times already).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HIVE-480) allow option to retry map-reduce tasks

Posted by "Prasad Chakka (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HIVE-480?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12718928#action_12718928 ] 

Prasad Chakka commented on HIVE-480:
------------------------------------

if all queries go through Hive Server, it can figure out when to start queuing queries to be executed when the cluster comes back either.

can we integrate the code that Pete wrote to figure out whether cluster is up or not into Hive CLI so that we can display a neat message to user that cluster is not available?

> allow option to retry map-reduce tasks
> --------------------------------------
>
>                 Key: HIVE-480
>                 URL: https://issues.apache.org/jira/browse/HIVE-480
>             Project: Hadoop Hive
>          Issue Type: New Feature
>          Components: Query Processor
>            Reporter: Joydeep Sen Sarma
>         Attachments: HIVE-480.1.patch
>
>
> for long running queries with multiple map-reduce jobs - this should help in dealing with any transient cluster failures without having to re-running all the tasks.
> ideally - the entire plan can be serialized out and the actual process of executing the workflow can be left to a pluggable workflow execution engine (since this is a problem that has been solved many times already).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HIVE-480) allow option to retry map-reduce tasks

Posted by "Joydeep Sen Sarma (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HIVE-480?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12718990#action_12718990 ] 

Joydeep Sen Sarma commented on HIVE-480:
----------------------------------------

unfortunately the hive server route would almost mean that this jira is dead. even FB hasn't standardized on using it. other installs may never use it.

is there something short term we can do? for example - taking the current patch and adding a user prompt (unless prompting is disabled and for the case of '-f' execution) would provide a short term solution that may help some subset of users.

another practical solution could be to try and distinguish between communication failures to the JT (which scream for sleep/retry) and failures of the job due to task failure (which means we shouldn't retry automatically). is not possible to do this distinction at all? (if not - perhaps we can do something on the hadoop side to enable this).

> allow option to retry map-reduce tasks
> --------------------------------------
>
>                 Key: HIVE-480
>                 URL: https://issues.apache.org/jira/browse/HIVE-480
>             Project: Hadoop Hive
>          Issue Type: New Feature
>          Components: Query Processor
>            Reporter: Joydeep Sen Sarma
>         Attachments: HIVE-480.1.patch
>
>
> for long running queries with multiple map-reduce jobs - this should help in dealing with any transient cluster failures without having to re-running all the tasks.
> ideally - the entire plan can be serialized out and the actual process of executing the workflow can be left to a pluggable workflow execution engine (since this is a problem that has been solved many times already).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (HIVE-480) allow option to retry map-reduce tasks

Posted by "Zheng Shao (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HIVE-480?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Zheng Shao updated HIVE-480:
----------------------------

    Attachment: HIVE-480.1.patch

This patch adds an additional config: "hive.exec.retries.max" (default: 1) to HiveConf and hive-default.xml

> allow option to retry map-reduce tasks
> --------------------------------------
>
>                 Key: HIVE-480
>                 URL: https://issues.apache.org/jira/browse/HIVE-480
>             Project: Hadoop Hive
>          Issue Type: New Feature
>          Components: Query Processor
>            Reporter: Joydeep Sen Sarma
>         Attachments: HIVE-480.1.patch
>
>
> for long running queries with multiple map-reduce jobs - this should help in dealing with any transient cluster failures without having to re-running all the tasks.
> ideally - the entire plan can be serialized out and the actual process of executing the workflow can be left to a pluggable workflow execution engine (since this is a problem that has been solved many times already).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HIVE-480) allow option to retry map-reduce tasks

Posted by "Namit Jain (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HIVE-480?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12718736#action_12718736 ] 

Namit Jain commented on HIVE-480:
---------------------------------

The changes look good - I had a question about the usage. When will the default be greater than 1 ?
If a long job gets retried after running for 5 hours, it may really increase the load on the cluster.
So, if a cluster is unhealthy for some random reason, it may incur further pain on the cluster.

Although, this is mute till max retries is 1, where current behavior is preserved.


> allow option to retry map-reduce tasks
> --------------------------------------
>
>                 Key: HIVE-480
>                 URL: https://issues.apache.org/jira/browse/HIVE-480
>             Project: Hadoop Hive
>          Issue Type: New Feature
>          Components: Query Processor
>            Reporter: Joydeep Sen Sarma
>         Attachments: HIVE-480.1.patch
>
>
> for long running queries with multiple map-reduce jobs - this should help in dealing with any transient cluster failures without having to re-running all the tasks.
> ideally - the entire plan can be serialized out and the actual process of executing the workflow can be left to a pluggable workflow execution engine (since this is a problem that has been solved many times already).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HIVE-480) allow option to retry map-reduce tasks

Posted by "Namit Jain (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HIVE-480?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12718884#action_12718884 ] 

Namit Jain commented on HIVE-480:
---------------------------------

a lot of tests failed - can you fix and resubmit the patch

> allow option to retry map-reduce tasks
> --------------------------------------
>
>                 Key: HIVE-480
>                 URL: https://issues.apache.org/jira/browse/HIVE-480
>             Project: Hadoop Hive
>          Issue Type: New Feature
>          Components: Query Processor
>            Reporter: Joydeep Sen Sarma
>         Attachments: HIVE-480.1.patch
>
>
> for long running queries with multiple map-reduce jobs - this should help in dealing with any transient cluster failures without having to re-running all the tasks.
> ideally - the entire plan can be serialized out and the actual process of executing the workflow can be left to a pluggable workflow execution engine (since this is a problem that has been solved many times already).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HIVE-480) allow option to retry map-reduce tasks

Posted by "Namit Jain (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HIVE-480?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12718944#action_12718944 ] 

Namit Jain commented on HIVE-480:
---------------------------------

We should wait for this till we have a Hive Server.

If we have a Hive server, then we have a cache (query -> job file (which contains all map-reduce tasks) + base dependencies with latest timestamps), we can use jdbm for that.

Metastore needs to keep track of latest modification of a base object (table/partition) if it does not do so already.

Then, we dont need retries - the results will automatically get shared even across multiple users.


> allow option to retry map-reduce tasks
> --------------------------------------
>
>                 Key: HIVE-480
>                 URL: https://issues.apache.org/jira/browse/HIVE-480
>             Project: Hadoop Hive
>          Issue Type: New Feature
>          Components: Query Processor
>            Reporter: Joydeep Sen Sarma
>         Attachments: HIVE-480.1.patch
>
>
> for long running queries with multiple map-reduce jobs - this should help in dealing with any transient cluster failures without having to re-running all the tasks.
> ideally - the entire plan can be serialized out and the actual process of executing the workflow can be left to a pluggable workflow execution engine (since this is a problem that has been solved many times already).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.