You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@pig.apache.org by "Richard Ding (JIRA)" <ji...@apache.org> on 2010/07/07 20:14:50 UTC
[jira] Created: (PIG-1483) Add HadoopJobHistoryLoader to the
piggybank
Add HadoopJobHistoryLoader to the piggybank
-------------------------------------------
Key: PIG-1483
URL: https://issues.apache.org/jira/browse/PIG-1483
Project: Pig
Issue Type: New Feature
Reporter: Richard Ding
Assignee: Richard Ding
Fix For: 0.8.0
PIG-1333 added many script-related entries to the MR job xml file and thus it's now possible to use Pig for querying Hadoop job history/xml files to get script-level usage statistics. What we need is a Pig loader that can parse these files and generate corresponding data objects.
The goal of this jira is to create a HadoopJobHistoryLoader in piggybank.
Here is an example that shows the intended usage:
*Find all the jobs grouped by script and user:*
{code}
a = load '/mapred/history/_logs/history/' using HadoopJobHistoryLoader() as (j:map[], m:map[], r:map[]);
b = foreach a generate (Chararray) j#'PIG_SCRIPT_ID' as id, (Chararray) j#'USER' as user, (Chararray) j#'JOBID' as job;
c = filter b by not (id is null);
d = group c by (id, user);
e = foreach d generate flatten(group), c.job;
dump e;
{code}
A couple more examples:
*Find scripts that use only the default parallelism:*
{code}
a = load '/mapred/history/done' using HadoopJobHistoryLoader() as (j:map[], m:map[], r:map[]);
b = foreach a generate j#'PIG_SCRIPT_ID' as id, j#'USER' as user, j#'JOBNAME' as script_name, (Long) r#'NUMBER_REDUCES' as reduces;
c = group b by (id, user, script_name) parallel 10;
d = foreach c generate group.user, group.script_name, MAX(b.reduces) as max_reduces;
e = filter d by max_reduces == 1;
dump e;
{code}
*Find the running time of each script (in seconds):*
{code}
a = load '/mapred/history/done' using HadoopJobHistoryLoader() as (j:map[], m:map[], r:map[]);
b = foreach a generate j#'PIG_SCRIPT_ID' as id, j#'USER' as user, j#'JOBNAME' as script_name, (Long) j#'SUBMIT_TIME' as start, (Long) j#'FINISH_TIME' as end;
c = group b by (id, user, script_name)
d = foreach c generate group.user, group.script_name, (MAX(b.end) - MIN(b.start)/1000;
dump d;
{code}
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1483) [piggybank] Add HadoopJobHistoryLoader
to the piggybank
Posted by "Richard Ding (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/PIG-1483?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Richard Ding updated PIG-1483:
------------------------------
Summary: [piggybank] Add HadoopJobHistoryLoader to the piggybank (was: Add HadoopJobHistoryLoader to the piggybank)
> [piggybank] Add HadoopJobHistoryLoader to the piggybank
> -------------------------------------------------------
>
> Key: PIG-1483
> URL: https://issues.apache.org/jira/browse/PIG-1483
> Project: Pig
> Issue Type: New Feature
> Reporter: Richard Ding
> Assignee: Richard Ding
> Fix For: 0.8.0
>
>
> PIG-1333 added many script-related entries to the MR job xml file and thus it's now possible to use Pig for querying Hadoop job history/xml files to get script-level usage statistics. What we need is a Pig loader that can parse these files and generate corresponding data objects.
> The goal of this jira is to create a HadoopJobHistoryLoader in piggybank.
> Here is an example that shows the intended usage:
> *Find all the jobs grouped by script and user:*
> {code}
> a = load '/mapred/history/_logs/history/' using HadoopJobHistoryLoader() as (j:map[], m:map[], r:map[]);
> b = foreach a generate (Chararray) j#'PIG_SCRIPT_ID' as id, (Chararray) j#'USER' as user, (Chararray) j#'JOBID' as job;
> c = filter b by not (id is null);
> d = group c by (id, user);
> e = foreach d generate flatten(group), c.job;
> dump e;
> {code}
> A couple more examples:
> *Find scripts that use only the default parallelism:*
> {code}
> a = load '/mapred/history/done' using HadoopJobHistoryLoader() as (j:map[], m:map[], r:map[]);
> b = foreach a generate j#'PIG_SCRIPT_ID' as id, j#'USER' as user, j#'JOBNAME' as script_name, (Long) r#'NUMBER_REDUCES' as reduces;
> c = group b by (id, user, script_name) parallel 10;
> d = foreach c generate group.user, group.script_name, MAX(b.reduces) as max_reduces;
> e = filter d by max_reduces == 1;
> dump e;
> {code}
> *Find the running time of each script (in seconds):*
> {code}
> a = load '/mapred/history/done' using HadoopJobHistoryLoader() as (j:map[], m:map[], r:map[]);
> b = foreach a generate j#'PIG_SCRIPT_ID' as id, j#'USER' as user, j#'JOBNAME' as script_name, (Long) j#'SUBMIT_TIME' as start, (Long) j#'FINISH_TIME' as end;
> c = group b by (id, user, script_name)
> d = foreach c generate group.user, group.script_name, (MAX(b.end) - MIN(b.start)/1000;
> dump d;
> {code}
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-1483) [piggybank] Add HadoopJobHistoryLoader
to the piggybank
Posted by "Richard Ding (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/PIG-1483?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12886502#action_12886502 ]
Richard Ding commented on PIG-1483:
-----------------------------------
Usage:
{code}
register piggybank.jar
A = load '<directory or file>' org.apache.pig.piggybank.storage.HadoopJobHistoryLoader() as (j:map[], m:map[], r:map[]);
{code}
where j is a map with following entries:
{code}
JOBID, JOBNAME, CLUSTER, QUEUE_NAME, STATUS, PIG_VERSION, HADOOP_VERSION, USER, USER_GROUP, HOST_DIR,
JOBCONF, PIG_SCRIPT_ID, PIG_SCRIPT,
TOTAL_LAUNCHED_MAPS, TOTAL_MAPS, FINISHED_MAPS, FAILED_MAPS, RACK_LOCAL_MAPS, DATA_LOCAL_MAPS,
TOTAL_LAUNCHED_REDUCES, TOTAL_REDUCES, FINISHED_REDUCES, FAILED_REDUCES,
SUBMIT_TIME, LAUNCH_TIME, FINISH_TIME,
MAP_INPUT_RECORDS, MAP_OUTPUT_RECORDS, MAP_OUTPUT_BYTES,
COMBINE_INPUT_RECORDS, COMBINE_OUTPUT_RECORDS, SPILLED_RECORDS,
REDUCE_SHUFFLE_BYTES, REDUCE_INPUT_GROUPS, REDUCE_INPUT_RECORDS, REDUCE_OUTPUT_RECORDS,
HDFS_BYTES_READ, HDFS_BYTES_WRITTEN, FILE_BYTES_READ, FILE_BYTES_WRITTEN,
{code}
m is a map with following entries:
{code}
MAX_MAP_INPUT_ROWS, MIN_MAP_INPUT_ROWS, MAX_MAP_TIME, MIN_MAP_TIME, AVG_MAP_TIME, NUMBER_MAPS
{code}
r is a map with following entries:
{code}
AVG_REDUCE_TIME, MAX_REDUCE_TIME, NUMBER_REDUCES, MIN_REDUCE_TIME, MIN_REDUCE_INPUT_ROWS, MAX_REDUCE_INPUT_ROWS
{code}
> [piggybank] Add HadoopJobHistoryLoader to the piggybank
> -------------------------------------------------------
>
> Key: PIG-1483
> URL: https://issues.apache.org/jira/browse/PIG-1483
> Project: Pig
> Issue Type: New Feature
> Reporter: Richard Ding
> Assignee: Richard Ding
> Fix For: 0.8.0
>
> Attachments: PIG-1483.patch
>
>
> PIG-1333 added many script-related entries to the MR job xml file and thus it's now possible to use Pig for querying Hadoop job history/xml files to get script-level usage statistics. What we need is a Pig loader that can parse these files and generate corresponding data objects.
> The goal of this jira is to create a HadoopJobHistoryLoader in piggybank.
> Here is an example that shows the intended usage:
> *Find all the jobs grouped by script and user:*
> {code}
> a = load '/mapred/history/_logs/history/' using HadoopJobHistoryLoader() as (j:map[], m:map[], r:map[]);
> b = foreach a generate (Chararray) j#'PIG_SCRIPT_ID' as id, (Chararray) j#'USER' as user, (Chararray) j#'JOBID' as job;
> c = filter b by not (id is null);
> d = group c by (id, user);
> e = foreach d generate flatten(group), c.job;
> dump e;
> {code}
> A couple more examples:
> *Find scripts that use only the default parallelism:*
> {code}
> a = load '/mapred/history/done' using HadoopJobHistoryLoader() as (j:map[], m:map[], r:map[]);
> b = foreach a generate j#'PIG_SCRIPT_ID' as id, j#'USER' as user, j#'JOBNAME' as script_name, (Long) r#'NUMBER_REDUCES' as reduces;
> c = group b by (id, user, script_name) parallel 10;
> d = foreach c generate group.user, group.script_name, MAX(b.reduces) as max_reduces;
> e = filter d by max_reduces == 1;
> dump e;
> {code}
> *Find the running time of each script (in seconds):*
> {code}
> a = load '/mapred/history/done' using HadoopJobHistoryLoader() as (j:map[], m:map[], r:map[]);
> b = foreach a generate j#'PIG_SCRIPT_ID' as id, j#'USER' as user, j#'JOBNAME' as script_name, (Long) j#'SUBMIT_TIME' as start, (Long) j#'FINISH_TIME' as end;
> c = group b by (id, user, script_name)
> d = foreach c generate group.user, group.script_name, (MAX(b.end) - MIN(b.start)/1000;
> dump d;
> {code}
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-1483) [piggybank] Add HadoopJobHistoryLoader
to the piggybank
Posted by "Olga Natkovich (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/PIG-1483?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12903491#action_12903491 ]
Olga Natkovich commented on PIG-1483:
-------------------------------------
+1, please, commit
> [piggybank] Add HadoopJobHistoryLoader to the piggybank
> -------------------------------------------------------
>
> Key: PIG-1483
> URL: https://issues.apache.org/jira/browse/PIG-1483
> Project: Pig
> Issue Type: New Feature
> Reporter: Richard Ding
> Assignee: Richard Ding
> Fix For: 0.8.0
>
> Attachments: PIG-1483.patch, PIG-1483_1.patch
>
>
> PIG-1333 added many script-related entries to the MR job xml file and thus it's now possible to use Pig for querying Hadoop job history/xml files to get script-level usage statistics. What we need is a Pig loader that can parse these files and generate corresponding data objects.
> The goal of this jira is to create a HadoopJobHistoryLoader in piggybank.
> Here is an example that shows the intended usage:
> *Find all the jobs grouped by script and user:*
> {code}
> a = load '/mapred/history/_logs/history/' using HadoopJobHistoryLoader() as (j:map[], m:map[], r:map[]);
> b = foreach a generate (Chararray) j#'PIG_SCRIPT_ID' as id, (Chararray) j#'USER' as user, (Chararray) j#'JOBID' as job;
> c = filter b by not (id is null);
> d = group c by (id, user);
> e = foreach d generate flatten(group), c.job;
> dump e;
> {code}
> A couple more examples:
> *Find scripts that use only the default parallelism:*
> {code}
> a = load '/mapred/history/done' using HadoopJobHistoryLoader() as (j:map[], m:map[], r:map[]);
> b = foreach a generate j#'PIG_SCRIPT_ID' as id, j#'USER' as user, j#'JOBNAME' as script_name, (Long) r#'NUMBER_REDUCES' as reduces;
> c = group b by (id, user, script_name) parallel 10;
> d = foreach c generate group.user, group.script_name, MAX(b.reduces) as max_reduces;
> e = filter d by max_reduces == 1;
> dump e;
> {code}
> *Find the running time of each script (in seconds):*
> {code}
> a = load '/mapred/history/done' using HadoopJobHistoryLoader() as (j:map[], m:map[], r:map[]);
> b = foreach a generate j#'PIG_SCRIPT_ID' as id, j#'USER' as user, j#'JOBNAME' as script_name, (Long) j#'SUBMIT_TIME' as start, (Long) j#'FINISH_TIME' as end;
> c = group b by (id, user, script_name)
> d = foreach c generate group.user, group.script_name, (MAX(b.end) - MIN(b.start)/1000;
> dump d;
> {code}
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1483) [piggybank] Add HadoopJobHistoryLoader
to the piggybank
Posted by "Richard Ding (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/PIG-1483?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Richard Ding updated PIG-1483:
------------------------------
Attachment: PIG-1483_1.patch
New patch adding unit test.
> [piggybank] Add HadoopJobHistoryLoader to the piggybank
> -------------------------------------------------------
>
> Key: PIG-1483
> URL: https://issues.apache.org/jira/browse/PIG-1483
> Project: Pig
> Issue Type: New Feature
> Reporter: Richard Ding
> Assignee: Richard Ding
> Fix For: 0.8.0
>
> Attachments: PIG-1483.patch, PIG-1483_1.patch
>
>
> PIG-1333 added many script-related entries to the MR job xml file and thus it's now possible to use Pig for querying Hadoop job history/xml files to get script-level usage statistics. What we need is a Pig loader that can parse these files and generate corresponding data objects.
> The goal of this jira is to create a HadoopJobHistoryLoader in piggybank.
> Here is an example that shows the intended usage:
> *Find all the jobs grouped by script and user:*
> {code}
> a = load '/mapred/history/_logs/history/' using HadoopJobHistoryLoader() as (j:map[], m:map[], r:map[]);
> b = foreach a generate (Chararray) j#'PIG_SCRIPT_ID' as id, (Chararray) j#'USER' as user, (Chararray) j#'JOBID' as job;
> c = filter b by not (id is null);
> d = group c by (id, user);
> e = foreach d generate flatten(group), c.job;
> dump e;
> {code}
> A couple more examples:
> *Find scripts that use only the default parallelism:*
> {code}
> a = load '/mapred/history/done' using HadoopJobHistoryLoader() as (j:map[], m:map[], r:map[]);
> b = foreach a generate j#'PIG_SCRIPT_ID' as id, j#'USER' as user, j#'JOBNAME' as script_name, (Long) r#'NUMBER_REDUCES' as reduces;
> c = group b by (id, user, script_name) parallel 10;
> d = foreach c generate group.user, group.script_name, MAX(b.reduces) as max_reduces;
> e = filter d by max_reduces == 1;
> dump e;
> {code}
> *Find the running time of each script (in seconds):*
> {code}
> a = load '/mapred/history/done' using HadoopJobHistoryLoader() as (j:map[], m:map[], r:map[]);
> b = foreach a generate j#'PIG_SCRIPT_ID' as id, j#'USER' as user, j#'JOBNAME' as script_name, (Long) j#'SUBMIT_TIME' as start, (Long) j#'FINISH_TIME' as end;
> c = group b by (id, user, script_name)
> d = foreach c generate group.user, group.script_name, (MAX(b.end) - MIN(b.start)/1000;
> dump d;
> {code}
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1483) [piggybank] Add HadoopJobHistoryLoader
to the piggybank
Posted by "Richard Ding (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/PIG-1483?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Richard Ding updated PIG-1483:
------------------------------
Attachment: PIG-1483.patch
This is the initial patch with a few caveats:
# Each mapper processes only one job history file. This loader will create as many map tasks as the number of files to process.
# It uses _org.apache.hadoop.mapred.DefaultJobHistoryParser_ to parse the job history files. This parser isn't production ready.
> [piggybank] Add HadoopJobHistoryLoader to the piggybank
> -------------------------------------------------------
>
> Key: PIG-1483
> URL: https://issues.apache.org/jira/browse/PIG-1483
> Project: Pig
> Issue Type: New Feature
> Reporter: Richard Ding
> Assignee: Richard Ding
> Fix For: 0.8.0
>
> Attachments: PIG-1483.patch
>
>
> PIG-1333 added many script-related entries to the MR job xml file and thus it's now possible to use Pig for querying Hadoop job history/xml files to get script-level usage statistics. What we need is a Pig loader that can parse these files and generate corresponding data objects.
> The goal of this jira is to create a HadoopJobHistoryLoader in piggybank.
> Here is an example that shows the intended usage:
> *Find all the jobs grouped by script and user:*
> {code}
> a = load '/mapred/history/_logs/history/' using HadoopJobHistoryLoader() as (j:map[], m:map[], r:map[]);
> b = foreach a generate (Chararray) j#'PIG_SCRIPT_ID' as id, (Chararray) j#'USER' as user, (Chararray) j#'JOBID' as job;
> c = filter b by not (id is null);
> d = group c by (id, user);
> e = foreach d generate flatten(group), c.job;
> dump e;
> {code}
> A couple more examples:
> *Find scripts that use only the default parallelism:*
> {code}
> a = load '/mapred/history/done' using HadoopJobHistoryLoader() as (j:map[], m:map[], r:map[]);
> b = foreach a generate j#'PIG_SCRIPT_ID' as id, j#'USER' as user, j#'JOBNAME' as script_name, (Long) r#'NUMBER_REDUCES' as reduces;
> c = group b by (id, user, script_name) parallel 10;
> d = foreach c generate group.user, group.script_name, MAX(b.reduces) as max_reduces;
> e = filter d by max_reduces == 1;
> dump e;
> {code}
> *Find the running time of each script (in seconds):*
> {code}
> a = load '/mapred/history/done' using HadoopJobHistoryLoader() as (j:map[], m:map[], r:map[]);
> b = foreach a generate j#'PIG_SCRIPT_ID' as id, j#'USER' as user, j#'JOBNAME' as script_name, (Long) j#'SUBMIT_TIME' as start, (Long) j#'FINISH_TIME' as end;
> c = group b by (id, user, script_name)
> d = foreach c generate group.user, group.script_name, (MAX(b.end) - MIN(b.start)/1000;
> dump d;
> {code}
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1483) [piggybank] Add HadoopJobHistoryLoader
to the piggybank
Posted by "Richard Ding (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/PIG-1483?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Richard Ding updated PIG-1483:
------------------------------
Attachment: PIG-1483.patch
> [piggybank] Add HadoopJobHistoryLoader to the piggybank
> -------------------------------------------------------
>
> Key: PIG-1483
> URL: https://issues.apache.org/jira/browse/PIG-1483
> Project: Pig
> Issue Type: New Feature
> Reporter: Richard Ding
> Assignee: Richard Ding
> Fix For: 0.8.0
>
> Attachments: PIG-1483.patch, PIG-1483.patch
>
>
> PIG-1333 added many script-related entries to the MR job xml file and thus it's now possible to use Pig for querying Hadoop job history/xml files to get script-level usage statistics. What we need is a Pig loader that can parse these files and generate corresponding data objects.
> The goal of this jira is to create a HadoopJobHistoryLoader in piggybank.
> Here is an example that shows the intended usage:
> *Find all the jobs grouped by script and user:*
> {code}
> a = load '/mapred/history/_logs/history/' using HadoopJobHistoryLoader() as (j:map[], m:map[], r:map[]);
> b = foreach a generate (Chararray) j#'PIG_SCRIPT_ID' as id, (Chararray) j#'USER' as user, (Chararray) j#'JOBID' as job;
> c = filter b by not (id is null);
> d = group c by (id, user);
> e = foreach d generate flatten(group), c.job;
> dump e;
> {code}
> A couple more examples:
> *Find scripts that use only the default parallelism:*
> {code}
> a = load '/mapred/history/done' using HadoopJobHistoryLoader() as (j:map[], m:map[], r:map[]);
> b = foreach a generate j#'PIG_SCRIPT_ID' as id, j#'USER' as user, j#'JOBNAME' as script_name, (Long) r#'NUMBER_REDUCES' as reduces;
> c = group b by (id, user, script_name) parallel 10;
> d = foreach c generate group.user, group.script_name, MAX(b.reduces) as max_reduces;
> e = filter d by max_reduces == 1;
> dump e;
> {code}
> *Find the running time of each script (in seconds):*
> {code}
> a = load '/mapred/history/done' using HadoopJobHistoryLoader() as (j:map[], m:map[], r:map[]);
> b = foreach a generate j#'PIG_SCRIPT_ID' as id, j#'USER' as user, j#'JOBNAME' as script_name, (Long) j#'SUBMIT_TIME' as start, (Long) j#'FINISH_TIME' as end;
> c = group b by (id, user, script_name)
> d = foreach c generate group.user, group.script_name, (MAX(b.end) - MIN(b.start)/1000;
> dump d;
> {code}
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1483) [piggybank] Add HadoopJobHistoryLoader
to the piggybank
Posted by "Richard Ding (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/PIG-1483?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Richard Ding updated PIG-1483:
------------------------------
Attachment: (was: PIG-1483.patch)
> [piggybank] Add HadoopJobHistoryLoader to the piggybank
> -------------------------------------------------------
>
> Key: PIG-1483
> URL: https://issues.apache.org/jira/browse/PIG-1483
> Project: Pig
> Issue Type: New Feature
> Reporter: Richard Ding
> Assignee: Richard Ding
> Fix For: 0.8.0
>
> Attachments: PIG-1483.patch
>
>
> PIG-1333 added many script-related entries to the MR job xml file and thus it's now possible to use Pig for querying Hadoop job history/xml files to get script-level usage statistics. What we need is a Pig loader that can parse these files and generate corresponding data objects.
> The goal of this jira is to create a HadoopJobHistoryLoader in piggybank.
> Here is an example that shows the intended usage:
> *Find all the jobs grouped by script and user:*
> {code}
> a = load '/mapred/history/_logs/history/' using HadoopJobHistoryLoader() as (j:map[], m:map[], r:map[]);
> b = foreach a generate (Chararray) j#'PIG_SCRIPT_ID' as id, (Chararray) j#'USER' as user, (Chararray) j#'JOBID' as job;
> c = filter b by not (id is null);
> d = group c by (id, user);
> e = foreach d generate flatten(group), c.job;
> dump e;
> {code}
> A couple more examples:
> *Find scripts that use only the default parallelism:*
> {code}
> a = load '/mapred/history/done' using HadoopJobHistoryLoader() as (j:map[], m:map[], r:map[]);
> b = foreach a generate j#'PIG_SCRIPT_ID' as id, j#'USER' as user, j#'JOBNAME' as script_name, (Long) r#'NUMBER_REDUCES' as reduces;
> c = group b by (id, user, script_name) parallel 10;
> d = foreach c generate group.user, group.script_name, MAX(b.reduces) as max_reduces;
> e = filter d by max_reduces == 1;
> dump e;
> {code}
> *Find the running time of each script (in seconds):*
> {code}
> a = load '/mapred/history/done' using HadoopJobHistoryLoader() as (j:map[], m:map[], r:map[]);
> b = foreach a generate j#'PIG_SCRIPT_ID' as id, j#'USER' as user, j#'JOBNAME' as script_name, (Long) j#'SUBMIT_TIME' as start, (Long) j#'FINISH_TIME' as end;
> c = group b by (id, user, script_name)
> d = foreach c generate group.user, group.script_name, (MAX(b.end) - MIN(b.start)/1000;
> dump d;
> {code}
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1483) [piggybank] Add HadoopJobHistoryLoader
to the piggybank
Posted by "Richard Ding (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/PIG-1483?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Richard Ding updated PIG-1483:
------------------------------
Status: Patch Available (was: Open)
> [piggybank] Add HadoopJobHistoryLoader to the piggybank
> -------------------------------------------------------
>
> Key: PIG-1483
> URL: https://issues.apache.org/jira/browse/PIG-1483
> Project: Pig
> Issue Type: New Feature
> Reporter: Richard Ding
> Assignee: Richard Ding
> Fix For: 0.8.0
>
> Attachments: PIG-1483.patch, PIG-1483_1.patch
>
>
> PIG-1333 added many script-related entries to the MR job xml file and thus it's now possible to use Pig for querying Hadoop job history/xml files to get script-level usage statistics. What we need is a Pig loader that can parse these files and generate corresponding data objects.
> The goal of this jira is to create a HadoopJobHistoryLoader in piggybank.
> Here is an example that shows the intended usage:
> *Find all the jobs grouped by script and user:*
> {code}
> a = load '/mapred/history/_logs/history/' using HadoopJobHistoryLoader() as (j:map[], m:map[], r:map[]);
> b = foreach a generate (Chararray) j#'PIG_SCRIPT_ID' as id, (Chararray) j#'USER' as user, (Chararray) j#'JOBID' as job;
> c = filter b by not (id is null);
> d = group c by (id, user);
> e = foreach d generate flatten(group), c.job;
> dump e;
> {code}
> A couple more examples:
> *Find scripts that use only the default parallelism:*
> {code}
> a = load '/mapred/history/done' using HadoopJobHistoryLoader() as (j:map[], m:map[], r:map[]);
> b = foreach a generate j#'PIG_SCRIPT_ID' as id, j#'USER' as user, j#'JOBNAME' as script_name, (Long) r#'NUMBER_REDUCES' as reduces;
> c = group b by (id, user, script_name) parallel 10;
> d = foreach c generate group.user, group.script_name, MAX(b.reduces) as max_reduces;
> e = filter d by max_reduces == 1;
> dump e;
> {code}
> *Find the running time of each script (in seconds):*
> {code}
> a = load '/mapred/history/done' using HadoopJobHistoryLoader() as (j:map[], m:map[], r:map[]);
> b = foreach a generate j#'PIG_SCRIPT_ID' as id, j#'USER' as user, j#'JOBNAME' as script_name, (Long) j#'SUBMIT_TIME' as start, (Long) j#'FINISH_TIME' as end;
> c = group b by (id, user, script_name)
> d = foreach c generate group.user, group.script_name, (MAX(b.end) - MIN(b.start)/1000;
> dump d;
> {code}
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-1483) [piggybank] Add HadoopJobHistoryLoader
to the piggybank
Posted by "Richard Ding (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/PIG-1483?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12886537#action_12886537 ]
Richard Ding commented on PIG-1483:
-----------------------------------
Add these additional entries to the first map:
{code}
PIG_JOB_FEATURE, PIG_JOB_ALIAS, PIG_JOB_PARENTS
{code}
> [piggybank] Add HadoopJobHistoryLoader to the piggybank
> -------------------------------------------------------
>
> Key: PIG-1483
> URL: https://issues.apache.org/jira/browse/PIG-1483
> Project: Pig
> Issue Type: New Feature
> Reporter: Richard Ding
> Assignee: Richard Ding
> Fix For: 0.8.0
>
> Attachments: PIG-1483.patch
>
>
> PIG-1333 added many script-related entries to the MR job xml file and thus it's now possible to use Pig for querying Hadoop job history/xml files to get script-level usage statistics. What we need is a Pig loader that can parse these files and generate corresponding data objects.
> The goal of this jira is to create a HadoopJobHistoryLoader in piggybank.
> Here is an example that shows the intended usage:
> *Find all the jobs grouped by script and user:*
> {code}
> a = load '/mapred/history/_logs/history/' using HadoopJobHistoryLoader() as (j:map[], m:map[], r:map[]);
> b = foreach a generate (Chararray) j#'PIG_SCRIPT_ID' as id, (Chararray) j#'USER' as user, (Chararray) j#'JOBID' as job;
> c = filter b by not (id is null);
> d = group c by (id, user);
> e = foreach d generate flatten(group), c.job;
> dump e;
> {code}
> A couple more examples:
> *Find scripts that use only the default parallelism:*
> {code}
> a = load '/mapred/history/done' using HadoopJobHistoryLoader() as (j:map[], m:map[], r:map[]);
> b = foreach a generate j#'PIG_SCRIPT_ID' as id, j#'USER' as user, j#'JOBNAME' as script_name, (Long) r#'NUMBER_REDUCES' as reduces;
> c = group b by (id, user, script_name) parallel 10;
> d = foreach c generate group.user, group.script_name, MAX(b.reduces) as max_reduces;
> e = filter d by max_reduces == 1;
> dump e;
> {code}
> *Find the running time of each script (in seconds):*
> {code}
> a = load '/mapred/history/done' using HadoopJobHistoryLoader() as (j:map[], m:map[], r:map[]);
> b = foreach a generate j#'PIG_SCRIPT_ID' as id, j#'USER' as user, j#'JOBNAME' as script_name, (Long) j#'SUBMIT_TIME' as start, (Long) j#'FINISH_TIME' as end;
> c = group b by (id, user, script_name)
> d = foreach c generate group.user, group.script_name, (MAX(b.end) - MIN(b.start)/1000;
> dump d;
> {code}
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-1483) [piggybank] Add HadoopJobHistoryLoader
to the piggybank
Posted by "Richard Ding (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/PIG-1483?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12904453#action_12904453 ]
Richard Ding commented on PIG-1483:
-----------------------------------
Patch committed to trunk.
> [piggybank] Add HadoopJobHistoryLoader to the piggybank
> -------------------------------------------------------
>
> Key: PIG-1483
> URL: https://issues.apache.org/jira/browse/PIG-1483
> Project: Pig
> Issue Type: New Feature
> Reporter: Richard Ding
> Assignee: Richard Ding
> Fix For: 0.8.0
>
> Attachments: PIG-1483.patch, PIG-1483_1.patch
>
>
> PIG-1333 added many script-related entries to the MR job xml file and thus it's now possible to use Pig for querying Hadoop job history/xml files to get script-level usage statistics. What we need is a Pig loader that can parse these files and generate corresponding data objects.
> The goal of this jira is to create a HadoopJobHistoryLoader in piggybank.
> Here is an example that shows the intended usage:
> *Find all the jobs grouped by script and user:*
> {code}
> a = load '/mapred/history/_logs/history/' using HadoopJobHistoryLoader() as (j:map[], m:map[], r:map[]);
> b = foreach a generate (Chararray) j#'PIG_SCRIPT_ID' as id, (Chararray) j#'USER' as user, (Chararray) j#'JOBID' as job;
> c = filter b by not (id is null);
> d = group c by (id, user);
> e = foreach d generate flatten(group), c.job;
> dump e;
> {code}
> A couple more examples:
> *Find scripts that use only the default parallelism:*
> {code}
> a = load '/mapred/history/done' using HadoopJobHistoryLoader() as (j:map[], m:map[], r:map[]);
> b = foreach a generate j#'PIG_SCRIPT_ID' as id, j#'USER' as user, j#'JOBNAME' as script_name, (Long) r#'NUMBER_REDUCES' as reduces;
> c = group b by (id, user, script_name) parallel 10;
> d = foreach c generate group.user, group.script_name, MAX(b.reduces) as max_reduces;
> e = filter d by max_reduces == 1;
> dump e;
> {code}
> *Find the running time of each script (in seconds):*
> {code}
> a = load '/mapred/history/done' using HadoopJobHistoryLoader() as (j:map[], m:map[], r:map[]);
> b = foreach a generate j#'PIG_SCRIPT_ID' as id, j#'USER' as user, j#'JOBNAME' as script_name, (Long) j#'SUBMIT_TIME' as start, (Long) j#'FINISH_TIME' as end;
> c = group b by (id, user, script_name)
> d = foreach c generate group.user, group.script_name, (MAX(b.end) - MIN(b.start)/1000;
> dump d;
> {code}
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1483) [piggybank] Add HadoopJobHistoryLoader
to the piggybank
Posted by "Richard Ding (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/PIG-1483?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Richard Ding updated PIG-1483:
------------------------------
Status: Resolved (was: Patch Available)
Hadoop Flags: [Reviewed]
Resolution: Fixed
> [piggybank] Add HadoopJobHistoryLoader to the piggybank
> -------------------------------------------------------
>
> Key: PIG-1483
> URL: https://issues.apache.org/jira/browse/PIG-1483
> Project: Pig
> Issue Type: New Feature
> Reporter: Richard Ding
> Assignee: Richard Ding
> Fix For: 0.8.0
>
> Attachments: PIG-1483.patch, PIG-1483_1.patch
>
>
> PIG-1333 added many script-related entries to the MR job xml file and thus it's now possible to use Pig for querying Hadoop job history/xml files to get script-level usage statistics. What we need is a Pig loader that can parse these files and generate corresponding data objects.
> The goal of this jira is to create a HadoopJobHistoryLoader in piggybank.
> Here is an example that shows the intended usage:
> *Find all the jobs grouped by script and user:*
> {code}
> a = load '/mapred/history/_logs/history/' using HadoopJobHistoryLoader() as (j:map[], m:map[], r:map[]);
> b = foreach a generate (Chararray) j#'PIG_SCRIPT_ID' as id, (Chararray) j#'USER' as user, (Chararray) j#'JOBID' as job;
> c = filter b by not (id is null);
> d = group c by (id, user);
> e = foreach d generate flatten(group), c.job;
> dump e;
> {code}
> A couple more examples:
> *Find scripts that use only the default parallelism:*
> {code}
> a = load '/mapred/history/done' using HadoopJobHistoryLoader() as (j:map[], m:map[], r:map[]);
> b = foreach a generate j#'PIG_SCRIPT_ID' as id, j#'USER' as user, j#'JOBNAME' as script_name, (Long) r#'NUMBER_REDUCES' as reduces;
> c = group b by (id, user, script_name) parallel 10;
> d = foreach c generate group.user, group.script_name, MAX(b.reduces) as max_reduces;
> e = filter d by max_reduces == 1;
> dump e;
> {code}
> *Find the running time of each script (in seconds):*
> {code}
> a = load '/mapred/history/done' using HadoopJobHistoryLoader() as (j:map[], m:map[], r:map[]);
> b = foreach a generate j#'PIG_SCRIPT_ID' as id, j#'USER' as user, j#'JOBNAME' as script_name, (Long) j#'SUBMIT_TIME' as start, (Long) j#'FINISH_TIME' as end;
> c = group b by (id, user, script_name)
> d = foreach c generate group.user, group.script_name, (MAX(b.end) - MIN(b.start)/1000;
> dump d;
> {code}
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.