You are viewing a plain text version of this content. The canonical link for it is here.
Posted to mapreduce-issues@hadoop.apache.org by "Ahmed Radwan (JIRA)" <ji...@apache.org> on 2012/05/24 00:50:41 UTC

[jira] [Created] (MAPREDUCE-4284) Allow setting yarn.nodemanager.delete.debug-delay-sec on a per-job basis

Ahmed Radwan created MAPREDUCE-4284:
---------------------------------------

             Summary: Allow setting yarn.nodemanager.delete.debug-delay-sec on a per-job basis
                 Key: MAPREDUCE-4284
                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-4284
             Project: Hadoop Map/Reduce
          Issue Type: Bug
          Components: mrv2
            Reporter: Ahmed Radwan
            Assignee: Ahmed Radwan


The yarn.nodemanager.delete.debug-delay-sec property is helpful in debugging jobs (inspecting container logs/local dirs after the job finishes). Currently it is a nodemanager property and changing it requires restarting the nodemanager. In a production cluster this can be a real problem. It is better to have this property set on a per-job basis and not requiring the restart of nodemanagers. 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (MAPREDUCE-4284) Allow setting yarn.nodemanager.delete.debug-delay-sec on a per-job basis

Posted by "Ahmed Radwan (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/MAPREDUCE-4284?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13283821#comment-13283821 ] 

Ahmed Radwan commented on MAPREDUCE-4284:
-----------------------------------------

Thanks Arun, 

Let me add more details. I think it's not just the tasklogs and this is why this property exists. We have seen cases where inspecting the contents of the containers' localized file directories and log directories were extremely useful in troubleshooting problems (e.g. AM failure to start issues).

I think easily controlling this property is equally important in production clusters. Consider the following scenario:

* A job failing on a production cluster.
* Tasklogs are not showing much, and it is required to inspect the containers' files for any clues.
* It is now required to change this configuration property (e.g. set it to 1 day) and restart every NM in the cluster (see how expensive this is).
* The problem for this job is solved, but now these directories are kept for every submitted job, which is an unneeded and expensive storage problem. To solve that, we need to change back the property and restart NMs on all nodes again.

Also thinking about this issue more: YARN is a general framework, and applications other than MapReduce need to considered, and their ability to hint to yarn to keep these files. So we can't generalize assumptions about information available through specific application services (e.g. MapReduce JobHistoryServer). I think the new proposed property above can be generalized across applications (or the Application interface could be extended).

bq. Your proposal doesn't work because the NodeManager doesn't load jobConf of the container... this would require changes to ContainerManager protocol.

Yes, I only wrote how the new delay will be calculated, but how this new jobConf property is communicated to the DeletionService will require more changes as you highlighted. The question here is whether the added benefit outweighs the effort of these extra changes. Thoughts?
                
> Allow setting yarn.nodemanager.delete.debug-delay-sec on a per-job basis
> ------------------------------------------------------------------------
>
>                 Key: MAPREDUCE-4284
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-4284
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>          Components: mrv2
>            Reporter: Ahmed Radwan
>            Assignee: Ahmed Radwan
>
> The yarn.nodemanager.delete.debug-delay-sec property is helpful in debugging jobs (inspecting container logs/local dirs after the job finishes). Currently it is a nodemanager property and changing it requires restarting the nodemanager. In a production cluster this can be a real problem. It is better to have this property set on a per-job basis and not requiring the restart of nodemanagers. 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (MAPREDUCE-4284) Allow setting yarn.nodemanager.delete.debug-delay-sec on a per-job basis

Posted by "Ahmed Radwan (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/MAPREDUCE-4284?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13282888#comment-13282888 ] 

Ahmed Radwan commented on MAPREDUCE-4284:
-----------------------------------------

Arun & Tucu,

What about the following approach:

1- We'll add a new boolean property (e.g. mapreduce.job.debug.delete), its default value will be true. This property can be set by the user when submitting the job.

2- Then, in the NM DeletionService, the effective delay before deleting will be:

{code}
int effDelay = jobConf.getBoolean("mapreduce.job.debug.delete", true)? 0 : conf.getInt(YarnConfiguration.DEBUG_NM_DELETE_DELAY_SEC, 0);
{code}

This will guarantee that the value set in the NM property yarn.nodemanager.delete.debug-delay-sec will only apply to jobs where the user explicitly set mapreduce.job.debug.delete to false.

This approach will address Arun's concerns about misusing the property and filling the NM disks, and at the same time, prevent flooding the nodes with files regardless if the user submitting the job cares or not (as Tucu highlighted).

What do you think?

                
> Allow setting yarn.nodemanager.delete.debug-delay-sec on a per-job basis
> ------------------------------------------------------------------------
>
>                 Key: MAPREDUCE-4284
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-4284
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>          Components: mrv2
>            Reporter: Ahmed Radwan
>            Assignee: Ahmed Radwan
>
> The yarn.nodemanager.delete.debug-delay-sec property is helpful in debugging jobs (inspecting container logs/local dirs after the job finishes). Currently it is a nodemanager property and changing it requires restarting the nodemanager. In a production cluster this can be a real problem. It is better to have this property set on a per-job basis and not requiring the restart of nodemanagers. 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (MAPREDUCE-4284) Allow setting yarn.nodemanager.delete.debug-delay-sec on a per-job basis

Posted by "Arun C Murthy (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/MAPREDUCE-4284?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13283340#comment-13283340 ] 

Arun C Murthy commented on MAPREDUCE-4284:
------------------------------------------

Overall, I'm still unclear about why we need this when we already have tasklogs in HDFS and they are accessible via JobHistory etc. YARN/MapReduce really needs to stay out of managing local-storage for logs.
                
> Allow setting yarn.nodemanager.delete.debug-delay-sec on a per-job basis
> ------------------------------------------------------------------------
>
>                 Key: MAPREDUCE-4284
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-4284
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>          Components: mrv2
>            Reporter: Ahmed Radwan
>            Assignee: Ahmed Radwan
>
> The yarn.nodemanager.delete.debug-delay-sec property is helpful in debugging jobs (inspecting container logs/local dirs after the job finishes). Currently it is a nodemanager property and changing it requires restarting the nodemanager. In a production cluster this can be a real problem. It is better to have this property set on a per-job basis and not requiring the restart of nodemanagers. 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Resolved] (MAPREDUCE-4284) Allow setting yarn.nodemanager.delete.debug-delay-sec on a per-job basis

Posted by "Arun C Murthy (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/MAPREDUCE-4284?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Arun C Murthy resolved MAPREDUCE-4284.
--------------------------------------

    Resolution: Invalid
    
> Allow setting yarn.nodemanager.delete.debug-delay-sec on a per-job basis
> ------------------------------------------------------------------------
>
>                 Key: MAPREDUCE-4284
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-4284
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>          Components: mrv2
>            Reporter: Ahmed Radwan
>            Assignee: Ahmed Radwan
>
> The yarn.nodemanager.delete.debug-delay-sec property is helpful in debugging jobs (inspecting container logs/local dirs after the job finishes). Currently it is a nodemanager property and changing it requires restarting the nodemanager. In a production cluster this can be a real problem. It is better to have this property set on a per-job basis and not requiring the restart of nodemanagers. 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (MAPREDUCE-4284) Allow setting yarn.nodemanager.delete.debug-delay-sec on a per-job basis

Posted by "Arun C Murthy (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/MAPREDUCE-4284?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13282097#comment-13282097 ] 

Arun C Murthy commented on MAPREDUCE-4284:
------------------------------------------

Doesn't make sense. This opens up all sorts of attack vectors and leads to cluster instability if you allow jobs to affect NMs. -1.
                
> Allow setting yarn.nodemanager.delete.debug-delay-sec on a per-job basis
> ------------------------------------------------------------------------
>
>                 Key: MAPREDUCE-4284
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-4284
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>          Components: mrv2
>            Reporter: Ahmed Radwan
>            Assignee: Ahmed Radwan
>
> The yarn.nodemanager.delete.debug-delay-sec property is helpful in debugging jobs (inspecting container logs/local dirs after the job finishes). Currently it is a nodemanager property and changing it requires restarting the nodemanager. In a production cluster this can be a real problem. It is better to have this property set on a per-job basis and not requiring the restart of nodemanagers. 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (MAPREDUCE-4284) Allow setting yarn.nodemanager.delete.debug-delay-sec on a per-job basis

Posted by "Ahmed Radwan (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/MAPREDUCE-4284?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13282103#comment-13282103 ] 

Ahmed Radwan commented on MAPREDUCE-4284:
-----------------------------------------

Arun, the idea here is to allow the option of inspecting container logs/local dirs without the need to change a NM property which will require restarting all NMs in the whole cluster (which doesn't seem feasible on a real cluster). Also it is not useful to generalize the behavior for all jobs when the requirement is to inspect a single failing job for example.

I am contrasting this behavior to the older behavior we had with keep.failed.task.files property which was per-job. What do you think? What are the type of attacks you are worried about?
                
> Allow setting yarn.nodemanager.delete.debug-delay-sec on a per-job basis
> ------------------------------------------------------------------------
>
>                 Key: MAPREDUCE-4284
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-4284
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>          Components: mrv2
>            Reporter: Ahmed Radwan
>            Assignee: Ahmed Radwan
>
> The yarn.nodemanager.delete.debug-delay-sec property is helpful in debugging jobs (inspecting container logs/local dirs after the job finishes). Currently it is a nodemanager property and changing it requires restarting the nodemanager. In a production cluster this can be a real problem. It is better to have this property set on a per-job basis and not requiring the restart of nodemanagers. 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (MAPREDUCE-4284) Allow setting yarn.nodemanager.delete.debug-delay-sec on a per-job basis

Posted by "Arun C Murthy (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/MAPREDUCE-4284?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13283335#comment-13283335 ] 

Arun C Murthy commented on MAPREDUCE-4284:
------------------------------------------

@Tucu:

bq.Still, I would say this is a property to be use in development clusters.

If this is only needed for development clusters, then we just use the global setting and make it very high (e.g. 3 days).

bq.  Or, in order to make it more production friendly there should be a MAX_TIME_TO_KEEP_FILES property in the NM and jobs can set any value up to that time.

Then you pretty much have to have a limit on file-sizes, number of files etc. which leads exactly to MAPREDUCE-1100, something which we've been trying to avoid by durably storing logs in HDFS and not on the NM local disk.

----

To recap, if this is just for debugging, we can set the global limit very high and not bother with per-job limits.

IAC, we have all task logs on HDFS - so I really don't see the need to reinvent MAPREDUCE-1100.

----

@Ahmed - Your proposal doesn't work because the NodeManager doesn't load jobConf of the container... this would require changes to ContainerManager protocol.

                
> Allow setting yarn.nodemanager.delete.debug-delay-sec on a per-job basis
> ------------------------------------------------------------------------
>
>                 Key: MAPREDUCE-4284
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-4284
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>          Components: mrv2
>            Reporter: Ahmed Radwan
>            Assignee: Ahmed Radwan
>
> The yarn.nodemanager.delete.debug-delay-sec property is helpful in debugging jobs (inspecting container logs/local dirs after the job finishes). Currently it is a nodemanager property and changing it requires restarting the nodemanager. In a production cluster this can be a real problem. It is better to have this property set on a per-job basis and not requiring the restart of nodemanagers. 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (MAPREDUCE-4284) Allow setting yarn.nodemanager.delete.debug-delay-sec on a per-job basis

Posted by "Arun C Murthy (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/MAPREDUCE-4284?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13282143#comment-13282143 ] 

Arun C Murthy commented on MAPREDUCE-4284:
------------------------------------------

You easily run into situations where lots (if not all) users set this limit very high and start taking up local-disk on NMs... then you end up with situations which look like MAPREDUCE-1100. This is precisely the reason we have long-term storage for userlogs on HDFS now, and not on NM local disks.
                
> Allow setting yarn.nodemanager.delete.debug-delay-sec on a per-job basis
> ------------------------------------------------------------------------
>
>                 Key: MAPREDUCE-4284
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-4284
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>          Components: mrv2
>            Reporter: Ahmed Radwan
>            Assignee: Ahmed Radwan
>
> The yarn.nodemanager.delete.debug-delay-sec property is helpful in debugging jobs (inspecting container logs/local dirs after the job finishes). Currently it is a nodemanager property and changing it requires restarting the nodemanager. In a production cluster this can be a real problem. It is better to have this property set on a per-job basis and not requiring the restart of nodemanagers. 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (MAPREDUCE-4284) Allow setting yarn.nodemanager.delete.debug-delay-sec on a per-job basis

Posted by "Ahmed Radwan (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/MAPREDUCE-4284?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13282115#comment-13282115 ] 

Ahmed Radwan commented on MAPREDUCE-4284:
-----------------------------------------

Just to elaborate more: The default for this property is 0, so these container dirs are directly deleted when the job finishes. In a test cluster we can set the property to a relatively high value to be able to inspect container logs/local dirs. But how can we do that in a production cluster. The problem is that any change in this property will affect all jobs, and the change will require restarting all the NodeManagers in the whole cluster. Both consequences are bad, since keeping all dirs for all jobs is expensive from storage perspective and restarting the NMs is expensive from operations perspective.

So one possible solution is to have the scope of this property as per-job (or add another per-job property). so the user can set this value to give a hint to the NM to keep the dirs for this individual job. We can still keep a NodeManager property to override or cap the delay time.

Arun, what do you think?
                
> Allow setting yarn.nodemanager.delete.debug-delay-sec on a per-job basis
> ------------------------------------------------------------------------
>
>                 Key: MAPREDUCE-4284
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-4284
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>          Components: mrv2
>            Reporter: Ahmed Radwan
>            Assignee: Ahmed Radwan
>
> The yarn.nodemanager.delete.debug-delay-sec property is helpful in debugging jobs (inspecting container logs/local dirs after the job finishes). Currently it is a nodemanager property and changing it requires restarting the nodemanager. In a production cluster this can be a real problem. It is better to have this property set on a per-job basis and not requiring the restart of nodemanagers. 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Reopened] (MAPREDUCE-4284) Allow setting yarn.nodemanager.delete.debug-delay-sec on a per-job basis

Posted by "Alejandro Abdelnur (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/MAPREDUCE-4284?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Alejandro Abdelnur reopened MAPREDUCE-4284:
-------------------------------------------


@Arun, I think there is value in having this per job because the current global setting would flod the nodes with files regardless if the user submitting the job cares or not. Still, I would say this is a property to be use in development clusters. In production clusters it can be fixed to ZERO by marking it final in the node site.xml. Or, in order to make it more production friendly there should be a MAX_TIME_TO_KEEP_FILES property in the NM and jobs can set any value up to that time. thoughts?

                
> Allow setting yarn.nodemanager.delete.debug-delay-sec on a per-job basis
> ------------------------------------------------------------------------
>
>                 Key: MAPREDUCE-4284
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-4284
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>          Components: mrv2
>            Reporter: Ahmed Radwan
>            Assignee: Ahmed Radwan
>
> The yarn.nodemanager.delete.debug-delay-sec property is helpful in debugging jobs (inspecting container logs/local dirs after the job finishes). Currently it is a nodemanager property and changing it requires restarting the nodemanager. In a production cluster this can be a real problem. It is better to have this property set on a per-job basis and not requiring the restart of nodemanagers. 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira