You are viewing a plain text version of this content. The canonical link for it is here.

Posted to common-dev@hadoop.apache.org by "Amar Kamat (JIRA)" <ji...@apache.org> on 2008/11/17 08:48:46 UTC

[jira] Created: (HADOOP-4670) Improve the way job history files are managed

Improve the way job history files are managed
---------------------------------------------

                 Key: HADOOP-4670
                 URL: https://issues.apache.org/jira/browse/HADOOP-4670
             Project: Hadoop Core
          Issue Type: Improvement
          Components: mapred
    Affects Versions: 0.20.0
            Reporter: Amar Kamat
             Fix For: 0.20.0


Today all the jobhistory files are dumped in one _job-history_ folder. This can cause problems when there is a need to search the history folder (job-recovery etc). It would be nice if we group all the jobs under a _user_ folder. So all the jobs for user _amar_ will go in _history-folder/amar/_. Jobs can be categorized using various features like _jobid, date, jobname_ etc but using _username_ will make the search much more efficient and also will not result into namespace explosion. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Assigned: (HADOOP-4670) Improve the way job history files are managed

Posted by "Amar Kamat (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HADOOP-4670?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Amar Kamat reassigned HADOOP-4670:
----------------------------------

    Assignee: Amar Kamat

> Improve the way job history files are managed
> ---------------------------------------------
>
>                 Key: HADOOP-4670
>                 URL: https://issues.apache.org/jira/browse/HADOOP-4670
>             Project: Hadoop Core
>          Issue Type: Improvement
>          Components: mapred
>    Affects Versions: 0.20.0
>            Reporter: Amar Kamat
>            Assignee: Amar Kamat
>             Fix For: 0.20.0
>
>
> Today all the jobhistory files are dumped in one _job-history_ folder. This can cause problems when there is a need to search the history folder (job-recovery etc). It would be nice if we group all the jobs under a _user_ folder. So all the jobs for user _amar_ will go in _history-folder/amar/_. Jobs can be categorized using various features like _jobid, date, jobname_ etc but using _username_ will make the search much more efficient and also will not result into namespace explosion. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-4670) Improve the way job history files are managed

Posted by "Amar Kamat (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-4670?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12678270#action_12678270 ] 

Amar Kamat commented on HADOOP-4670:
------------------------------------

I had an offline discussion with Devaraj, Hemanth and Sharad. Seems like the following structure should solve this issue :
# old history files : path-to-job-history/
# history files for jobtracker on host hostname: path-to-job-history/hostname
# history files for user username using jobtracker running on hostname: path-to-job-history/hostname/username
# job history file format : <start-time>_<jobid>_<jobname>

Structuring it further on year, month and day might prove useful but for now it looks like a premature step. If needed we can add it later. So users who submit job at very high rate will be affected as compared to users that submit jobs less frequently. Searching will be easier per-user.

Future steps :
1) Add date level info in structuring or atleast display
2) Add indexing info for faster access/display
3) Provide various view like recent ones, sort by day/week/month/year, jobname (sorting and structuring) etc.
4) Secure access
5) Faster access and analysis (involves changes/tweaks to JobHistory and parsing).

Thoughts?

> Improve the way job history files are managed
> ---------------------------------------------
>
>                 Key: HADOOP-4670
>                 URL: https://issues.apache.org/jira/browse/HADOOP-4670
>             Project: Hadoop Core
>          Issue Type: Improvement
>          Components: mapred
>    Affects Versions: 0.20.0
>            Reporter: Amar Kamat
>            Assignee: Amar Kamat
>
> Today all the jobhistory files are dumped in one _job-history_ folder. This can cause problems when there is a need to search the history folder (job-recovery etc). It would be nice if we group all the jobs under a _user_ folder. So all the jobs for user _amar_ will go in _history-folder/amar/_. Jobs can be categorized using various features like _jobid, date, jobname_ etc but using _username_ will make the search much more efficient and also will not result into namespace explosion. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-4670) Improve the way job history files are managed

Posted by "Amar Kamat (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-4670?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12648565#action_12648565 ] 

Amar Kamat commented on HADOOP-4670:
------------------------------------

Doug, the search is w.r.t job-recovery. The type of search we do there is given a _jobtracker-hostname, job-id, username and job-name_ search the job-history file. The way we do it now is 
- construct a regex using _jobtracker-hostname, job-id, username and job-name_
- construct a path filter that accepts files that match the pattern and reject otherwise
- use the dfs listing api to find out files matching the pattern

This is a costly operation as all the files are scanned linearly. Over time the history folder can grow big leading to more search time. The only problem is all the users will be hit with this. With the above mentioned optimization we can reduce the search time for most of the users.

> Improve the way job history files are managed
> ---------------------------------------------
>
>                 Key: HADOOP-4670
>                 URL: https://issues.apache.org/jira/browse/HADOOP-4670
>             Project: Hadoop Core
>          Issue Type: Improvement
>          Components: mapred
>    Affects Versions: 0.20.0
>            Reporter: Amar Kamat
>            Assignee: Amar Kamat
>             Fix For: 0.20.0
>
>
> Today all the jobhistory files are dumped in one _job-history_ folder. This can cause problems when there is a need to search the history folder (job-recovery etc). It would be nice if we group all the jobs under a _user_ folder. So all the jobs for user _amar_ will go in _history-folder/amar/_. Jobs can be categorized using various features like _jobid, date, jobname_ etc but using _username_ will make the search much more efficient and also will not result into namespace explosion. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-4670) Improve the way job history files are managed

Posted by "Doug Cutting (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-4670?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12648831#action_12648831 ] 

Doug Cutting commented on HADOOP-4670:
--------------------------------------

> But then I thought its better to fix/add that when needed.

That's fine with me!

> Improve the way job history files are managed
> ---------------------------------------------
>
>                 Key: HADOOP-4670
>                 URL: https://issues.apache.org/jira/browse/HADOOP-4670
>             Project: Hadoop Core
>          Issue Type: Improvement
>          Components: mapred
>    Affects Versions: 0.20.0
>            Reporter: Amar Kamat
>            Assignee: Amar Kamat
>             Fix For: 0.20.0
>
>
> Today all the jobhistory files are dumped in one _job-history_ folder. This can cause problems when there is a need to search the history folder (job-recovery etc). It would be nice if we group all the jobs under a _user_ folder. So all the jobs for user _amar_ will go in _history-folder/amar/_. Jobs can be categorized using various features like _jobid, date, jobname_ etc but using _username_ will make the search much more efficient and also will not result into namespace explosion. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-4670) Improve the way job history files are managed

Posted by "Nick Rettinghouse (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-4670?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12678361#action_12678361 ] 

Nick Rettinghouse commented on HADOOP-4670:
-------------------------------------------

We have an extremely high job rate. Sorting by YYYY/MM/DD/HH would be a great help. (We could live with YYYY/MM/DD.)

> Improve the way job history files are managed
> ---------------------------------------------
>
>                 Key: HADOOP-4670
>                 URL: https://issues.apache.org/jira/browse/HADOOP-4670
>             Project: Hadoop Core
>          Issue Type: Improvement
>          Components: mapred
>    Affects Versions: 0.20.0
>            Reporter: Amar Kamat
>            Assignee: Amar Kamat
>
> Today all the jobhistory files are dumped in one _job-history_ folder. This can cause problems when there is a need to search the history folder (job-recovery etc). It would be nice if we group all the jobs under a _user_ folder. So all the jobs for user _amar_ will go in _history-folder/amar/_. Jobs can be categorized using various features like _jobid, date, jobname_ etc but using _username_ will make the search much more efficient and also will not result into namespace explosion. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-4670) Improve the way job history files are managed

Posted by "dhruba borthakur (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-4670?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12678502#action_12678502 ] 

dhruba borthakur commented on HADOOP-4670:
------------------------------------------

The most common case is when a user is looking for the logs of a job that he had submitted earlier. So, your proposal looks good to me. +1

On a  general note, it appears that what we are trying to do is to index the metadata of completed jobs for efficient retrieval. Is there any way that Apache Derby http://db.apache.org/derby/ might help in this regard?



> Improve the way job history files are managed
> ---------------------------------------------
>
>                 Key: HADOOP-4670
>                 URL: https://issues.apache.org/jira/browse/HADOOP-4670
>             Project: Hadoop Core
>          Issue Type: Improvement
>          Components: mapred
>    Affects Versions: 0.20.0
>            Reporter: Amar Kamat
>            Assignee: Amar Kamat
>
> Today all the jobhistory files are dumped in one _job-history_ folder. This can cause problems when there is a need to search the history folder (job-recovery etc). It would be nice if we group all the jobs under a _user_ folder. So all the jobs for user _amar_ will go in _history-folder/amar/_. Jobs can be categorized using various features like _jobid, date, jobname_ etc but using _username_ will make the search much more efficient and also will not result into namespace explosion. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-4670) Improve the way job history files are managed

Posted by "Doug Cutting (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-4670?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12648205#action_12648205 ] 

Doug Cutting commented on HADOOP-4670:
--------------------------------------

> using username will make the search much more efficient 

Why is that?  Are most operations that touch the job history user-specific?  I would have guessed that most were rather time-specific, that the most frequent operation would be to browse through the job history by time.  Is that not the case?

> Improve the way job history files are managed
> ---------------------------------------------
>
>                 Key: HADOOP-4670
>                 URL: https://issues.apache.org/jira/browse/HADOOP-4670
>             Project: Hadoop Core
>          Issue Type: Improvement
>          Components: mapred
>    Affects Versions: 0.20.0
>            Reporter: Amar Kamat
>            Assignee: Amar Kamat
>             Fix For: 0.20.0
>
>
> Today all the jobhistory files are dumped in one _job-history_ folder. This can cause problems when there is a need to search the history folder (job-recovery etc). It would be nice if we group all the jobs under a _user_ folder. So all the jobs for user _amar_ will go in _history-folder/amar/_. Jobs can be categorized using various features like _jobid, date, jobname_ etc but using _username_ will make the search much more efficient and also will not result into namespace explosion. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-4670) Improve the way job history files are managed

Posted by "Doug Cutting (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-4670?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12648652#action_12648652 ] 

Doug Cutting commented on HADOOP-4670:
--------------------------------------

> The type of search we do there is given a jobtracker-hostname, job-id, username and job-name [...]

Thanks for the explanation.  In that case, a directory per username probably does make sense.  Really big directories are generally cumbersome, so you might still also slice things by date too, either above or below the username, so that even a user who has run lots of jobs won't cause, e.g., huge RPCs for listings.

> Improve the way job history files are managed
> ---------------------------------------------
>
>                 Key: HADOOP-4670
>                 URL: https://issues.apache.org/jira/browse/HADOOP-4670
>             Project: Hadoop Core
>          Issue Type: Improvement
>          Components: mapred
>    Affects Versions: 0.20.0
>            Reporter: Amar Kamat
>            Assignee: Amar Kamat
>             Fix For: 0.20.0
>
>
> Today all the jobhistory files are dumped in one _job-history_ folder. This can cause problems when there is a need to search the history folder (job-recovery etc). It would be nice if we group all the jobs under a _user_ folder. So all the jobs for user _amar_ will go in _history-folder/amar/_. Jobs can be categorized using various features like _jobid, date, jobname_ etc but using _username_ will make the search much more efficient and also will not result into namespace explosion. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-4670) Improve the way job history files are managed

Posted by "Amar Kamat (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-4670?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12648690#action_12648690 ] 

Amar Kamat commented on HADOOP-4670:
------------------------------------

Even I thought of having a second level categorization based on date/time. But then I thought its better to fix/add that when needed. Currently fixing the search by _user_ should help us over come the issue. Let me know if we should also categorize by date in this issue.

> Improve the way job history files are managed
> ---------------------------------------------
>
>                 Key: HADOOP-4670
>                 URL: https://issues.apache.org/jira/browse/HADOOP-4670
>             Project: Hadoop Core
>          Issue Type: Improvement
>          Components: mapred
>    Affects Versions: 0.20.0
>            Reporter: Amar Kamat
>            Assignee: Amar Kamat
>             Fix For: 0.20.0
>
>
> Today all the jobhistory files are dumped in one _job-history_ folder. This can cause problems when there is a need to search the history folder (job-recovery etc). It would be nice if we group all the jobs under a _user_ folder. So all the jobs for user _amar_ will go in _history-folder/amar/_. Jobs can be categorized using various features like _jobid, date, jobname_ etc but using _username_ will make the search much more efficient and also will not result into namespace explosion. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-4670) Improve the way job history files are managed

Posted by "Tim Williamson (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-4670?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12679964#action_12679964 ] 

Tim Williamson commented on HADOOP-4670:
----------------------------------------

It would be nice if whatever scheme adopted ensured some upper bound on the number of logs in any single directory.  The YYYY/MM/DD/HH scheme would do that in practice.  And there's no reason it couldn't be:
  user/YYYY/MM/DD/HH
which would have the best of both worlds.

> Improve the way job history files are managed
> ---------------------------------------------
>
>                 Key: HADOOP-4670
>                 URL: https://issues.apache.org/jira/browse/HADOOP-4670
>             Project: Hadoop Core
>          Issue Type: Improvement
>          Components: mapred
>    Affects Versions: 0.20.0
>            Reporter: Amar Kamat
>            Assignee: Amar Kamat
>
> Today all the jobhistory files are dumped in one _job-history_ folder. This can cause problems when there is a need to search the history folder (job-recovery etc). It would be nice if we group all the jobs under a _user_ folder. So all the jobs for user _amar_ will go in _history-folder/amar/_. Jobs can be categorized using various features like _jobid, date, jobname_ etc but using _username_ will make the search much more efficient and also will not result into namespace explosion. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.