You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@pig.apache.org by "Olga Natkovich (JIRA)" <ji...@apache.org> on 2007/11/02 03:51:50 UTC

[jira] Created: (PIG-12) Please add timestamps to pig map/reduce progress messages

Please add timestamps to pig map/reduce progress messages
---------------------------------------------------------

                 Key: PIG-12
                 URL: https://issues.apache.org/jira/browse/PIG-12
             Project: Pig
          Issue Type: Improvement
            Reporter: Olga Natkovich
            Priority: Minor


>From one of the users: 
------------------------------
I'm spending a lot of time trying to optimize my pig queries for short
run-times.  This process would be much easier if, in the progress output
from pig (currently on stdout, but hopefully soon moving to  
stderr?!), the
initiation and completion of each map/reduce job could be  
timestamped.  Pig
already spits out messages of the form "----- MapReduce Job -----",  
"Input:
...", "Combine: ...", etc; could you just add a "Timestamp: ..."
field as well?	Or ideally, both "Starting timestamp: ..." and	
"Finishing
timestamp ...".

Additional comments from another user:
------------------------------------------------------

I'm adding my vote for this as well.

I'd like to know timestamp and "running time" in seconds or D;H:M:S:

Thu Oct 25 10:06:01 GMT 2007 (0:00:12:56): 56% done


Starting and stopping timestamps in the log would also be valuable.


Unforutately, there's no "workaround" such as putting a date command before and after the pig command in logging --
queuing times can be seconds to hours and completely mess up any notion of job execution time.


-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Resolved: (PIG-12) Please add timestamps to pig map/reduce progress messages

Posted by "Alan Gates (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/PIG-12?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Alan Gates resolved PIG-12.
---------------------------

    Resolution: Fixed

Patch provided by Patrick checked in.  Thanks Patrick.

> Please add timestamps to pig map/reduce progress messages
> ---------------------------------------------------------
>
>                 Key: PIG-12
>                 URL: https://issues.apache.org/jira/browse/PIG-12
>             Project: Pig
>          Issue Type: Improvement
>          Components: impl
>            Reporter: Olga Natkovich
>         Attachments: timestamps.diff
>
>
> From one of the users: 
> ------------------------------
> I'm spending a lot of time trying to optimize my pig queries for short
> run-times.  This process would be much easier if, in the progress output
> from pig (currently on stdout, but hopefully soon moving to  
> stderr?!), the
> initiation and completion of each map/reduce job could be  
> timestamped.  Pig
> already spits out messages of the form "----- MapReduce Job -----",  
> "Input:
> ...", "Combine: ...", etc; could you just add a "Timestamp: ..."
> field as well?	Or ideally, both "Starting timestamp: ..." and	
> "Finishing
> timestamp ...".
> Additional comments from another user:
> ------------------------------------------------------
> I'm adding my vote for this as well.
> I'd like to know timestamp and "running time" in seconds or D;H:M:S:
> Thu Oct 25 10:06:01 GMT 2007 (0:00:12:56): 56% done
> Starting and stopping timestamps in the log would also be valuable.
> Unforutately, there's no "workaround" such as putting a date command before and after the pig command in logging --
> queuing times can be seconds to hours and completely mess up any notion of job execution time.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (PIG-12) Please add timestamps to pig map/reduce progress messages

Posted by "Olga Natkovich (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/PIG-12?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Olga Natkovich updated PIG-12:
------------------------------

    Component/s: impl
       Priority: Major  (was: Minor)

> Please add timestamps to pig map/reduce progress messages
> ---------------------------------------------------------
>
>                 Key: PIG-12
>                 URL: https://issues.apache.org/jira/browse/PIG-12
>             Project: Pig
>          Issue Type: Improvement
>          Components: impl
>            Reporter: Olga Natkovich
>
> From one of the users: 
> ------------------------------
> I'm spending a lot of time trying to optimize my pig queries for short
> run-times.  This process would be much easier if, in the progress output
> from pig (currently on stdout, but hopefully soon moving to  
> stderr?!), the
> initiation and completion of each map/reduce job could be  
> timestamped.  Pig
> already spits out messages of the form "----- MapReduce Job -----",  
> "Input:
> ...", "Combine: ...", etc; could you just add a "Timestamp: ..."
> field as well?	Or ideally, both "Starting timestamp: ..." and	
> "Finishing
> timestamp ...".
> Additional comments from another user:
> ------------------------------------------------------
> I'm adding my vote for this as well.
> I'd like to know timestamp and "running time" in seconds or D;H:M:S:
> Thu Oct 25 10:06:01 GMT 2007 (0:00:12:56): 56% done
> Starting and stopping timestamps in the log would also be valuable.
> Unforutately, there's no "workaround" such as putting a date command before and after the pig command in logging --
> queuing times can be seconds to hours and completely mess up any notion of job execution time.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (PIG-12) Please add timestamps to pig map/reduce progress messages

Posted by "David Ciemiewicz (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/PIG-12?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12546811 ] 

David Ciemiewicz commented on PIG-12:
-------------------------------------

Two comments -

Timestamps are really useful when you need to correlate events in the execution with external events.

But they are still a pain in the *** when it comes to trying to profile the code.  One of my goals was to have 3 running times (time for current task/job, overall running time, overall time of queue time + running time).  I really didn't want to have write a parsing script.  In fact, because the time stamps don't seem to have Unix ctime timestamps (in seconds), it makes the job harder because I have to convert the timestamps into time values and then do the computations.  It's not that difficult, but I just want simplicity.

The prototype script (see below) annotates the stream with running times which is much more valuable for profiling tasks.

Just adding the log4j timestamps helps.

However, if we go into "optional" mode, at the end of the day, I forgot to set the flag to get timestamps, I'm no better off than I was before.

So, here's what I would propose:

1) Always turn on timestamp logging.
2) If someone doesn't want it, give them the option to turn off timestamps.
3) In the pig command line, include the code to compute the running times, based on the time stamps.

If there were one scenario where I might turn off timestamps by default, it would be when people are using the Pig grunt shell interactively.  But as soon as we are in batch mode, add the timestamps.

Here's the running times example.  Somehow, someway, it should just be added to the pig execution environment, rather than making me jump through the machinations of doing it myself.

The example running time output is first, followed by the Perl script:

Connecting to HOD... : 0:00:00 0:00:00
Connected to HOD : 0:00:14 0:00:14
HDFS Web UI: mithril-nn1.inktomisearch.com:50070
JobTracker Web UI: kry101624:59588
HadoopConf: /export/crawlspace/kryptonite/hod/tmp/hod-8168-tmp/hadoop-site.xml
HDFS: mithril-nn1.inktomisearch.com:8020
JobTracker: kry101624:55365
...
Pig progress = 87% : 0:06:08 0:06:08 0:06:22
Pig progress = 95% : 0:06:11 0:06:11 0:06:25
Pig progress = 99% : 0:06:16 0:06:16 0:06:30
Pig progress = 100% : 0:06:18 0:06:18 0:06:32
J20071001: (datestamp, prev_query_string, curr_query_string )
L20071001: (group: (K20071001::datestamp, prev_query_string, K20071001::curr_query_string ), K20071001: (datestamp, prev_query_string, curr_query_string ) )
M20071001: (datestamp, prev_query_string, curr_query_string, freq )
N20071001: (datestamp, prev_query_string, curr_query_string, freq )

----- MapReduce Job ----- : 0:00:00 0:06:19 0:06:33
Input: [news/san-diego/tmp/20071001:BinStorage()]
Map: [[*]->[FILTER BY ([PROJECT $1] MATCHES '.*san diego.*')]]
Group: [GENERATE {[*],[*]}]
Combine: null
Reduce: GENERATE {[PROJECT $0]->[PROJECT $0],[PROJECT $0]->[PROJECT $1],[PROJECT $0]->[PROJECT $2],[COUNT(GENERATE {[PROJECT $1]})]}
Output: /tmp/temp657966079/tmp568530370:com.yahoo.pig.builtin.BinStorage

Just pipe the output of pig into this script (timestamps.pl):

            pig -cluster mithril -latest my.pig | ./timestamp.pl
 

#! /usr/bin/perl
use POSIX;

$| = 1;  # No buffering of output

$hodqueuestarttime = time();
$jobstarttime = 0;
$mapreducestarttime = 0;


line:
while (<>)
{
        s/\r*\n*$//;

        if (/^Connecting to HOD.../) {
                print $_, " : ", timestamps(), "\n";
                next line;
        }

        if (/^HDFS Web UI/) {
                print "Connected to HOD : ", timestamps(), "\n";
                $jobstarttime = time();
                print $_, "\n";
                next line; 
        }

        if (/^----- Map/) {
                $mapreducestarttime = time();
                print $_, " : ", timestamps(), "\n";
                next line; 
        }

        if (/^Pig progress/) {
                print $_, " : ", timestamps(), "\n";
                next line;
        }
        print $_, "\n";
}

sub timestamps()
{
    if ($jobstarttime == 0 and $mapreducestarttime == 0)
    {
        return (
                timestamp($hodqueuestarttime)." ".
                timestamp($hodqueuestarttime)
                );
    }
    elsif ($mapreducestarttime == 0)
    {
        return (
                timestamp($jobstarttime)." ".
                timestamp($hodqueuestarttime)
                );
    }
    else
    {
        return (
                timestamp($mapreducestarttime)." ".
                timestamp($jobstarttime)." ".
                timestamp($hodqueuestarttime)
                );
    }
}

sub timestamp()
{
        $starttime = shift;
        $now = time();
        $interval = $now - $starttime;
        $hours = int($interval / (60 * 60));
        $interval = $interval - $hours * (60 * 60);
        $mins = int($interval / 60);
        $interval = $interval - $mins * 60;
        $secs = $interval;
        $tsf = sprintf "%0d:%02d:%02d", $hours, $mins, $secs;
        return $tsf;
}



> Please add timestamps to pig map/reduce progress messages
> ---------------------------------------------------------
>
>                 Key: PIG-12
>                 URL: https://issues.apache.org/jira/browse/PIG-12
>             Project: Pig
>          Issue Type: Improvement
>          Components: impl
>            Reporter: Olga Natkovich
>
> From one of the users: 
> ------------------------------
> I'm spending a lot of time trying to optimize my pig queries for short
> run-times.  This process would be much easier if, in the progress output
> from pig (currently on stdout, but hopefully soon moving to  
> stderr?!), the
> initiation and completion of each map/reduce job could be  
> timestamped.  Pig
> already spits out messages of the form "----- MapReduce Job -----",  
> "Input:
> ...", "Combine: ...", etc; could you just add a "Timestamp: ..."
> field as well?	Or ideally, both "Starting timestamp: ..." and	
> "Finishing
> timestamp ...".
> Additional comments from another user:
> ------------------------------------------------------
> I'm adding my vote for this as well.
> I'd like to know timestamp and "running time" in seconds or D;H:M:S:
> Thu Oct 25 10:06:01 GMT 2007 (0:00:12:56): 56% done
> Starting and stopping timestamps in the log would also be valuable.
> Unforutately, there's no "workaround" such as putting a date command before and after the pig command in logging --
> queuing times can be seconds to hours and completely mess up any notion of job execution time.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Issue Comment Edited: (PIG-12) Please add timestamps to pig map/reduce progress messages

Posted by "Patrick Hunt (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/PIG-12?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12546411 ] 

phunt edited comment on PIG-12 at 11/28/07 1:51 PM:
-----------------------------------------------------------

I have an alternative suggestion that I'd like to get your feedback on.

Both hadoop and pig make use of log4j for all logging. Why write special purpose code when we can just have the users specify something like this:
----- 
<pre>
 bash-3.00$ cat log4j.properties
 log4j.rootLogger=INFO, A1
 log4j.appender.A1=org.apache.log4j.ConsoleAppender
 log4j.appender.A1.layout=org.apache.log4j.PatternLayout

 # Print the date in ISO 8601 format
 log4j.appender.A1.layout.ConversionPattern=%d [%t] %-5p %c - %m%n
</pre>
------
in their classpath for pig. The result is something like:

2007-11-28 21:23:25,632 [main] DEBUG org.apache.hadoop.ipc.Client - IPC Client connection to mithril-nn1.inktomisearch.com/68.180.187.193:8020 sending #0


I did notice an issue in Pig - Pig has hardcoded a configuration for it's logger. If we replace lines 159-164 in Main.java with something like:

BasicConfigurator.configure();

from http://logging.apache.org/log4j/1.2/apidocs/org/apache/log4j/BasicConfigurator.html

this should provide the users with timestamps as well as a whole lot more flexibility than if we hardcoded something ourselves.

If you agree then we are basically done (you just need to apply this change I suggested and test it out).


      was (Author: phunt):
    I have an alternative suggestion that I'd like to get your feedback on.

Both hadoop and pig make use of log4j for all logging. Why write special purpose code when we can just have the users specify something like this:
----- 
 bash-3.00$ cat log4j.properties
 log4j.rootLogger=INFO, A1
 log4j.appender.A1=org.apache.log4j.ConsoleAppender
 log4j.appender.A1.layout=org.apache.log4j.PatternLayout

 # Print the date in ISO 8601 format
 log4j.appender.A1.layout.ConversionPattern=%d [%t] %-5p %c - %m%n
------
in their classpath for pig. The result is something like:

2007-11-28 21:23:25,632 [main] DEBUG org.apache.hadoop.ipc.Client - IPC Client connection to mithril-nn1.inktomisearch.com/68.180.187.193:8020 sending #0


I did notice an issue in Pig - Pig has hardcoded a configuration for it's logger. If we replace lines 159-164 in Main.java with something like:

BasicConfigurator.configure();

from http://logging.apache.org/log4j/1.2/apidocs/org/apache/log4j/BasicConfigurator.html

this should provide the users with timestamps as well as a whole lot more flexibility than if we hardcoded something ourselves.

If you agree then we are basically done (you just need to apply this change I suggested and test it out).

  
> Please add timestamps to pig map/reduce progress messages
> ---------------------------------------------------------
>
>                 Key: PIG-12
>                 URL: https://issues.apache.org/jira/browse/PIG-12
>             Project: Pig
>          Issue Type: Improvement
>          Components: impl
>            Reporter: Olga Natkovich
>
> From one of the users: 
> ------------------------------
> I'm spending a lot of time trying to optimize my pig queries for short
> run-times.  This process would be much easier if, in the progress output
> from pig (currently on stdout, but hopefully soon moving to  
> stderr?!), the
> initiation and completion of each map/reduce job could be  
> timestamped.  Pig
> already spits out messages of the form "----- MapReduce Job -----",  
> "Input:
> ...", "Combine: ...", etc; could you just add a "Timestamp: ..."
> field as well?	Or ideally, both "Starting timestamp: ..." and	
> "Finishing
> timestamp ...".
> Additional comments from another user:
> ------------------------------------------------------
> I'm adding my vote for this as well.
> I'd like to know timestamp and "running time" in seconds or D;H:M:S:
> Thu Oct 25 10:06:01 GMT 2007 (0:00:12:56): 56% done
> Starting and stopping timestamps in the log would also be valuable.
> Unforutately, there's no "workaround" such as putting a date command before and after the pig command in logging --
> queuing times can be seconds to hours and completely mess up any notion of job execution time.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (PIG-12) Please add timestamps to pig map/reduce progress messages

Posted by "Patrick Hunt (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/PIG-12?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12548356 ] 

Patrick Hunt commented on PIG-12:
---------------------------------

Pig could provide a set of "default" log config files - say:

1) a log file specing the "true" defaults - run pig with no options and you get X logging, this could include timestamps

2) a second log file that is similar to what we have now - no timestamps

we could then add a command line option to switch btw the two logging files, or optionally to select a user specified log file that could be as complex as the user likes (for example it could calculate delta btw log messages rather than timestamps).

We still need to decide what to do wrt "hadoop logging" - we probably want to have a root logger that is the same for the pig application log messages (pig code, hadoop code, etc....).

If this works for ppl we would remove almost all of the pig specific log4j configuration that we have in main and pigcontext and just move it into the log configuration file(s)

> Please add timestamps to pig map/reduce progress messages
> ---------------------------------------------------------
>
>                 Key: PIG-12
>                 URL: https://issues.apache.org/jira/browse/PIG-12
>             Project: Pig
>          Issue Type: Improvement
>          Components: impl
>            Reporter: Olga Natkovich
>
> From one of the users: 
> ------------------------------
> I'm spending a lot of time trying to optimize my pig queries for short
> run-times.  This process would be much easier if, in the progress output
> from pig (currently on stdout, but hopefully soon moving to  
> stderr?!), the
> initiation and completion of each map/reduce job could be  
> timestamped.  Pig
> already spits out messages of the form "----- MapReduce Job -----",  
> "Input:
> ...", "Combine: ...", etc; could you just add a "Timestamp: ..."
> field as well?	Or ideally, both "Starting timestamp: ..." and	
> "Finishing
> timestamp ...".
> Additional comments from another user:
> ------------------------------------------------------
> I'm adding my vote for this as well.
> I'd like to know timestamp and "running time" in seconds or D;H:M:S:
> Thu Oct 25 10:06:01 GMT 2007 (0:00:12:56): 56% done
> Starting and stopping timestamps in the log would also be valuable.
> Unforutately, there's no "workaround" such as putting a date command before and after the pig command in logging --
> queuing times can be seconds to hours and completely mess up any notion of job execution time.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Issue Comment Edited: (PIG-12) Please add timestamps to pig map/reduce progress messages

Posted by "David Ciemiewicz (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/PIG-12?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12546811 ] 

ciemo edited comment on PIG-12 at 11/29/07 9:43 AM:
---------------------------------------------------------------

Two comments -

Timestamps are really useful when you need to correlate events in the execution with external events.

But they are still a pain in the *** when it comes to trying to profile the code.  One of my goals was to have 3 running times (time for current task/job, overall running time, overall time of queue time + running time).  I really didn't want to have write a parsing script.  In fact, because the time stamps don't seem to have Unix ctime timestamps (in seconds), it makes the job harder because I have to convert the timestamps into time values and then do the computations.  It's not that difficult, but I just want simplicity.

The prototype script (see below) annotates the stream with running times which is much more valuable for profiling tasks.

Just adding the log4j timestamps helps.

However, if we go into "optional" mode, at the end of the day, I forgot to set the flag to get timestamps, I'm no better off than I was before.

So, here's what I would propose:

1) Always turn on timestamp logging.
2) If someone doesn't want it, give them the option to turn off timestamps.
3) In the pig command line, include the code to compute the running times, based on the time stamps.

If there were one scenario where I might turn off timestamps by default, it would be when people are using the Pig grunt shell interactively.  But as soon as we are in batch mode, add the timestamps.

Here's the running times example.  Somehow, someway, it should just be added to the pig execution environment, rather than making me jump through the machinations of doing it myself.

The example running time output is first, followed by the Perl script:

{noformat}
Connecting to HOD... : 0:00:00 0:00:00
Connected to HOD : 0:00:14 0:00:14
HDFS Web UI: mithril-nn1.inktomisearch.com:50070
JobTracker Web UI: kry101624:59588
HadoopConf: /export/crawlspace/kryptonite/hod/tmp/hod-8168-tmp/hadoop-site.xml
HDFS: mithril-nn1.inktomisearch.com:8020
JobTracker: kry101624:55365
...
Pig progress = 87% : 0:06:08 0:06:08 0:06:22
Pig progress = 95% : 0:06:11 0:06:11 0:06:25
Pig progress = 99% : 0:06:16 0:06:16 0:06:30
Pig progress = 100% : 0:06:18 0:06:18 0:06:32
J20071001: (datestamp, prev_query_string, curr_query_string )
L20071001: (group: (K20071001::datestamp, prev_query_string, K20071001::curr_query_string ), 
K20071001: (datestamp, prev_query_string, curr_query_string ) )
M20071001: (datestamp, prev_query_string, curr_query_string, freq )
N20071001: (datestamp, prev_query_string, curr_query_string, freq )

----- MapReduce Job ----- : 0:00:00 0:06:19 0:06:33
Input: [news/san-diego/tmp/20071001:BinStorage()]
Map: [[*]->[FILTER BY ([PROJECT $1] MATCHES '.*san diego.*')]]
Group: [GENERATE {[*],[*]}]
Combine: null
Reduce: GENERATE {[PROJECT $0]->[PROJECT $0],[PROJECT $0]->[PROJECT $1],[PROJECT $0]->[PROJECT $2],[COUNT(GENERATE {[PROJECT $1]})]}
Output: /tmp/temp657966079/tmp568530370:com.yahoo.pig.builtin.BinStorage
{noformat}

Just pipe the output of pig into this script (timestamps.pl):

            pig -cluster mithril -latest my.pig | ./timestamp.pl
 
{code:none}
#! /usr/bin/perl
use POSIX;

$| = 1;  # No buffering of output

$hodqueuestarttime = time();
$jobstarttime = 0;
$mapreducestarttime = 0;

line:
while (<>)
{
        s/\r*\n*$//;

        if (/^Connecting to HOD.../) {
                print $_, " : ", timestamps(), "\n";
                next line;
        }

        if (/^HDFS Web UI/) {
                print "Connected to HOD : ", timestamps(), "\n";
                $jobstarttime = time();
                print $_, "\n";
                next line; 
        }

        if (/^----- Map/) {
                $mapreducestarttime = time();
                print $_, " : ", timestamps(), "\n";
                next line; 
        }

        if (/^Pig progress/) {
                print $_, " : ", timestamps(), "\n";
                next line;
        }
        print $_, "\n";
}

sub timestamps()
{
    if ($jobstarttime == 0 and $mapreducestarttime == 0)
    {
        return (
                timestamp($hodqueuestarttime)." ".
                timestamp($hodqueuestarttime)
                );
    }
    elsif ($mapreducestarttime == 0)
    {
        return (
                timestamp($jobstarttime)." ".
                timestamp($hodqueuestarttime)
                );
    }
    else
    {
        return (
                timestamp($mapreducestarttime)." ".
                timestamp($jobstarttime)." ".
                timestamp($hodqueuestarttime)
                );
    }
}

sub timestamp()
{
        $starttime = shift;
        $now = time();
        $interval = $now - $starttime;
        $hours = int($interval / (60 * 60));
        $interval = $interval - $hours * (60 * 60);
        $mins = int($interval / 60);
        $interval = $interval - $mins * 60;
        $secs = $interval;
        $tsf = sprintf "%0d:%02d:%02d", $hours, $mins, $secs;
        return $tsf;
}
{code}

      was (Author: ciemo):
    Two comments -

Timestamps are really useful when you need to correlate events in the execution with external events.

But they are still a pain in the *** when it comes to trying to profile the code.  One of my goals was to have 3 running times (time for current task/job, overall running time, overall time of queue time + running time).  I really didn't want to have write a parsing script.  In fact, because the time stamps don't seem to have Unix ctime timestamps (in seconds), it makes the job harder because I have to convert the timestamps into time values and then do the computations.  It's not that difficult, but I just want simplicity.

The prototype script (see below) annotates the stream with running times which is much more valuable for profiling tasks.

Just adding the log4j timestamps helps.

However, if we go into "optional" mode, at the end of the day, I forgot to set the flag to get timestamps, I'm no better off than I was before.

So, here's what I would propose:

1) Always turn on timestamp logging.
2) If someone doesn't want it, give them the option to turn off timestamps.
3) In the pig command line, include the code to compute the running times, based on the time stamps.

If there were one scenario where I might turn off timestamps by default, it would be when people are using the Pig grunt shell interactively.  But as soon as we are in batch mode, add the timestamps.

Here's the running times example.  Somehow, someway, it should just be added to the pig execution environment, rather than making me jump through the machinations of doing it myself.

The example running time output is first, followed by the Perl script:

{noformat}
Connecting to HOD... : 0:00:00 0:00:00
Connected to HOD : 0:00:14 0:00:14
HDFS Web UI: mithril-nn1.inktomisearch.com:50070
JobTracker Web UI: kry101624:59588
HadoopConf: /export/crawlspace/kryptonite/hod/tmp/hod-8168-tmp/hadoop-site.xml
HDFS: mithril-nn1.inktomisearch.com:8020
JobTracker: kry101624:55365
...
Pig progress = 87% : 0:06:08 0:06:08 0:06:22
Pig progress = 95% : 0:06:11 0:06:11 0:06:25
Pig progress = 99% : 0:06:16 0:06:16 0:06:30
Pig progress = 100% : 0:06:18 0:06:18 0:06:32
J20071001: (datestamp, prev_query_string, curr_query_string )
L20071001: (group: (K20071001::datestamp, prev_query_string, K20071001::curr_query_string ), K20071001: (datestamp, prev_query_string, curr_query_string ) )
M20071001: (datestamp, prev_query_string, curr_query_string, freq )
N20071001: (datestamp, prev_query_string, curr_query_string, freq )

----- MapReduce Job ----- : 0:00:00 0:06:19 0:06:33
Input: [news/san-diego/tmp/20071001:BinStorage()]
Map: [[*]->[FILTER BY ([PROJECT $1] MATCHES '.*san diego.*')]]
Group: [GENERATE {[*],[*]}]
Combine: null
Reduce: GENERATE {[PROJECT $0]->[PROJECT $0],[PROJECT $0]->[PROJECT $1],[PROJECT $0]->[PROJECT $2],[COUNT(GENERATE {[PROJECT $1]})]}
Output: /tmp/temp657966079/tmp568530370:com.yahoo.pig.builtin.BinStorage
{noformat}

Just pipe the output of pig into this script (timestamps.pl):

            pig -cluster mithril -latest my.pig | ./timestamp.pl
 
{code:perl}
#! /usr/bin/perl
use POSIX;

$| = 1;  # No buffering of output

$hodqueuestarttime = time();
$jobstarttime = 0;
$mapreducestarttime = 0;

line:
while (<>)
{
        s/\r*\n*$//;

        if (/^Connecting to HOD.../) {
                print $_, " : ", timestamps(), "\n";
                next line;
        }

        if (/^HDFS Web UI/) {
                print "Connected to HOD : ", timestamps(), "\n";
                $jobstarttime = time();
                print $_, "\n";
                next line; 
        }

        if (/^----- Map/) {
                $mapreducestarttime = time();
                print $_, " : ", timestamps(), "\n";
                next line; 
        }

        if (/^Pig progress/) {
                print $_, " : ", timestamps(), "\n";
                next line;
        }
        print $_, "\n";
}

sub timestamps()
{
    if ($jobstarttime == 0 and $mapreducestarttime == 0)
    {
        return (
                timestamp($hodqueuestarttime)." ".
                timestamp($hodqueuestarttime)
                );
    }
    elsif ($mapreducestarttime == 0)
    {
        return (
                timestamp($jobstarttime)." ".
                timestamp($hodqueuestarttime)
                );
    }
    else
    {
        return (
                timestamp($mapreducestarttime)." ".
                timestamp($jobstarttime)." ".
                timestamp($hodqueuestarttime)
                );
    }
}

sub timestamp()
{
        $starttime = shift;
        $now = time();
        $interval = $now - $starttime;
        $hours = int($interval / (60 * 60));
        $interval = $interval - $hours * (60 * 60);
        $mins = int($interval / 60);
        $interval = $interval - $mins * 60;
        $secs = $interval;
        $tsf = sprintf "%0d:%02d:%02d", $hours, $mins, $secs;
        return $tsf;
}
{code}
  
> Please add timestamps to pig map/reduce progress messages
> ---------------------------------------------------------
>
>                 Key: PIG-12
>                 URL: https://issues.apache.org/jira/browse/PIG-12
>             Project: Pig
>          Issue Type: Improvement
>          Components: impl
>            Reporter: Olga Natkovich
>
> From one of the users: 
> ------------------------------
> I'm spending a lot of time trying to optimize my pig queries for short
> run-times.  This process would be much easier if, in the progress output
> from pig (currently on stdout, but hopefully soon moving to  
> stderr?!), the
> initiation and completion of each map/reduce job could be  
> timestamped.  Pig
> already spits out messages of the form "----- MapReduce Job -----",  
> "Input:
> ...", "Combine: ...", etc; could you just add a "Timestamp: ..."
> field as well?	Or ideally, both "Starting timestamp: ..." and	
> "Finishing
> timestamp ...".
> Additional comments from another user:
> ------------------------------------------------------
> I'm adding my vote for this as well.
> I'd like to know timestamp and "running time" in seconds or D;H:M:S:
> Thu Oct 25 10:06:01 GMT 2007 (0:00:12:56): 56% done
> Starting and stopping timestamps in the log would also be valuable.
> Unforutately, there's no "workaround" such as putting a date command before and after the pig command in logging --
> queuing times can be seconds to hours and completely mess up any notion of job execution time.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Issue Comment Edited: (PIG-12) Please add timestamps to pig map/reduce progress messages

Posted by "David Ciemiewicz (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/PIG-12?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12546811 ] 

ciemo edited comment on PIG-12 at 11/29/07 9:31 AM:
---------------------------------------------------------------

Two comments -

Timestamps are really useful when you need to correlate events in the execution with external events.

But they are still a pain in the *** when it comes to trying to profile the code.  One of my goals was to have 3 running times (time for current task/job, overall running time, overall time of queue time + running time).  I really didn't want to have write a parsing script.  In fact, because the time stamps don't seem to have Unix ctime timestamps (in seconds), it makes the job harder because I have to convert the timestamps into time values and then do the computations.  It's not that difficult, but I just want simplicity.

The prototype script (see below) annotates the stream with running times which is much more valuable for profiling tasks.

Just adding the log4j timestamps helps.

However, if we go into "optional" mode, at the end of the day, I forgot to set the flag to get timestamps, I'm no better off than I was before.

So, here's what I would propose:

1) Always turn on timestamp logging.
2) If someone doesn't want it, give them the option to turn off timestamps.
3) In the pig command line, include the code to compute the running times, based on the time stamps.

If there were one scenario where I might turn off timestamps by default, it would be when people are using the Pig grunt shell interactively.  But as soon as we are in batch mode, add the timestamps.

Here's the running times example.  Somehow, someway, it should just be added to the pig execution environment, rather than making me jump through the machinations of doing it myself.

The example running time output is first, followed by the Perl script:

<pre>
Connecting to HOD... : 0:00:00 0:00:00
Connected to HOD : 0:00:14 0:00:14
HDFS Web UI: mithril-nn1.inktomisearch.com:50070
JobTracker Web UI: kry101624:59588
HadoopConf: /export/crawlspace/kryptonite/hod/tmp/hod-8168-tmp/hadoop-site.xml
HDFS: mithril-nn1.inktomisearch.com:8020
JobTracker: kry101624:55365
...
Pig progress = 87% : 0:06:08 0:06:08 0:06:22
Pig progress = 95% : 0:06:11 0:06:11 0:06:25
Pig progress = 99% : 0:06:16 0:06:16 0:06:30
Pig progress = 100% : 0:06:18 0:06:18 0:06:32
J20071001: (datestamp, prev_query_string, curr_query_string )
L20071001: (group: (K20071001::datestamp, prev_query_string, K20071001::curr_query_string ), K20071001: (datestamp, prev_query_string, curr_query_string ) )
M20071001: (datestamp, prev_query_string, curr_query_string, freq )
N20071001: (datestamp, prev_query_string, curr_query_string, freq )

----- MapReduce Job ----- : 0:00:00 0:06:19 0:06:33
Input: [news/san-diego/tmp/20071001:BinStorage()]
Map: [[*]->[FILTER BY ([PROJECT $1] MATCHES '.*san diego.*')]]
Group: [GENERATE {[*],[*]}]
Combine: null
Reduce: GENERATE {[PROJECT $0]->[PROJECT $0],[PROJECT $0]->[PROJECT $1],[PROJECT $0]->[PROJECT $2],[COUNT(GENERATE {[PROJECT $1]})]}
Output: /tmp/temp657966079/tmp568530370:com.yahoo.pig.builtin.BinStorage
</pre>

Just pipe the output of pig into this script (timestamps.pl):

            pig -cluster mithril -latest my.pig | ./timestamp.pl
 
<pre>
#! /usr/bin/perl
use POSIX;

$| = 1;  # No buffering of output

$hodqueuestarttime = time();
$jobstarttime = 0;
$mapreducestarttime = 0;

line:
while (<>)
{
        s/\r*\n*$//;

        if (/^Connecting to HOD.../) {
                print $_, " : ", timestamps(), "\n";
                next line;
        }

        if (/^HDFS Web UI/) {
                print "Connected to HOD : ", timestamps(), "\n";
                $jobstarttime = time();
                print $_, "\n";
                next line; 
        }

        if (/^----- Map/) {
                $mapreducestarttime = time();
                print $_, " : ", timestamps(), "\n";
                next line; 
        }

        if (/^Pig progress/) {
                print $_, " : ", timestamps(), "\n";
                next line;
        }
        print $_, "\n";
}

sub timestamps()
{
    if ($jobstarttime == 0 and $mapreducestarttime == 0)
    {
        return (
                timestamp($hodqueuestarttime)." ".
                timestamp($hodqueuestarttime)
                );
    }
    elsif ($mapreducestarttime == 0)
    {
        return (
                timestamp($jobstarttime)." ".
                timestamp($hodqueuestarttime)
                );
    }
    else
    {
        return (
                timestamp($mapreducestarttime)." ".
                timestamp($jobstarttime)." ".
                timestamp($hodqueuestarttime)
                );
    }
}

sub timestamp()
{
        $starttime = shift;
        $now = time();
        $interval = $now - $starttime;
        $hours = int($interval / (60 * 60));
        $interval = $interval - $hours * (60 * 60);
        $mins = int($interval / 60);
        $interval = $interval - $mins * 60;
        $secs = $interval;
        $tsf = sprintf "%0d:%02d:%02d", $hours, $mins, $secs;
        return $tsf;
}
</pre>


      was (Author: ciemo):
    Two comments -

Timestamps are really useful when you need to correlate events in the execution with external events.

But they are still a pain in the *** when it comes to trying to profile the code.  One of my goals was to have 3 running times (time for current task/job, overall running time, overall time of queue time + running time).  I really didn't want to have write a parsing script.  In fact, because the time stamps don't seem to have Unix ctime timestamps (in seconds), it makes the job harder because I have to convert the timestamps into time values and then do the computations.  It's not that difficult, but I just want simplicity.

The prototype script (see below) annotates the stream with running times which is much more valuable for profiling tasks.

Just adding the log4j timestamps helps.

However, if we go into "optional" mode, at the end of the day, I forgot to set the flag to get timestamps, I'm no better off than I was before.

So, here's what I would propose:

1) Always turn on timestamp logging.
2) If someone doesn't want it, give them the option to turn off timestamps.
3) In the pig command line, include the code to compute the running times, based on the time stamps.

If there were one scenario where I might turn off timestamps by default, it would be when people are using the Pig grunt shell interactively.  But as soon as we are in batch mode, add the timestamps.

Here's the running times example.  Somehow, someway, it should just be added to the pig execution environment, rather than making me jump through the machinations of doing it myself.

The example running time output is first, followed by the Perl script:

Connecting to HOD... : 0:00:00 0:00:00
Connected to HOD : 0:00:14 0:00:14
HDFS Web UI: mithril-nn1.inktomisearch.com:50070
JobTracker Web UI: kry101624:59588
HadoopConf: /export/crawlspace/kryptonite/hod/tmp/hod-8168-tmp/hadoop-site.xml
HDFS: mithril-nn1.inktomisearch.com:8020
JobTracker: kry101624:55365
...
Pig progress = 87% : 0:06:08 0:06:08 0:06:22
Pig progress = 95% : 0:06:11 0:06:11 0:06:25
Pig progress = 99% : 0:06:16 0:06:16 0:06:30
Pig progress = 100% : 0:06:18 0:06:18 0:06:32
J20071001: (datestamp, prev_query_string, curr_query_string )
L20071001: (group: (K20071001::datestamp, prev_query_string, K20071001::curr_query_string ), K20071001: (datestamp, prev_query_string, curr_query_string ) )
M20071001: (datestamp, prev_query_string, curr_query_string, freq )
N20071001: (datestamp, prev_query_string, curr_query_string, freq )

----- MapReduce Job ----- : 0:00:00 0:06:19 0:06:33
Input: [news/san-diego/tmp/20071001:BinStorage()]
Map: [[*]->[FILTER BY ([PROJECT $1] MATCHES '.*san diego.*')]]
Group: [GENERATE {[*],[*]}]
Combine: null
Reduce: GENERATE {[PROJECT $0]->[PROJECT $0],[PROJECT $0]->[PROJECT $1],[PROJECT $0]->[PROJECT $2],[COUNT(GENERATE {[PROJECT $1]})]}
Output: /tmp/temp657966079/tmp568530370:com.yahoo.pig.builtin.BinStorage

Just pipe the output of pig into this script (timestamps.pl):

            pig -cluster mithril -latest my.pig | ./timestamp.pl
 

#! /usr/bin/perl
use POSIX;

$| = 1;  # No buffering of output

$hodqueuestarttime = time();
$jobstarttime = 0;
$mapreducestarttime = 0;


line:
while (<>)
{
        s/\r*\n*$//;

        if (/^Connecting to HOD.../) {
                print $_, " : ", timestamps(), "\n";
                next line;
        }

        if (/^HDFS Web UI/) {
                print "Connected to HOD : ", timestamps(), "\n";
                $jobstarttime = time();
                print $_, "\n";
                next line; 
        }

        if (/^----- Map/) {
                $mapreducestarttime = time();
                print $_, " : ", timestamps(), "\n";
                next line; 
        }

        if (/^Pig progress/) {
                print $_, " : ", timestamps(), "\n";
                next line;
        }
        print $_, "\n";
}

sub timestamps()
{
    if ($jobstarttime == 0 and $mapreducestarttime == 0)
    {
        return (
                timestamp($hodqueuestarttime)." ".
                timestamp($hodqueuestarttime)
                );
    }
    elsif ($mapreducestarttime == 0)
    {
        return (
                timestamp($jobstarttime)." ".
                timestamp($hodqueuestarttime)
                );
    }
    else
    {
        return (
                timestamp($mapreducestarttime)." ".
                timestamp($jobstarttime)." ".
                timestamp($hodqueuestarttime)
                );
    }
}

sub timestamp()
{
        $starttime = shift;
        $now = time();
        $interval = $now - $starttime;
        $hours = int($interval / (60 * 60));
        $interval = $interval - $hours * (60 * 60);
        $mins = int($interval / 60);
        $interval = $interval - $mins * 60;
        $secs = $interval;
        $tsf = sprintf "%0d:%02d:%02d", $hours, $mins, $secs;
        return $tsf;
}


  
> Please add timestamps to pig map/reduce progress messages
> ---------------------------------------------------------
>
>                 Key: PIG-12
>                 URL: https://issues.apache.org/jira/browse/PIG-12
>             Project: Pig
>          Issue Type: Improvement
>          Components: impl
>            Reporter: Olga Natkovich
>
> From one of the users: 
> ------------------------------
> I'm spending a lot of time trying to optimize my pig queries for short
> run-times.  This process would be much easier if, in the progress output
> from pig (currently on stdout, but hopefully soon moving to  
> stderr?!), the
> initiation and completion of each map/reduce job could be  
> timestamped.  Pig
> already spits out messages of the form "----- MapReduce Job -----",  
> "Input:
> ...", "Combine: ...", etc; could you just add a "Timestamp: ..."
> field as well?	Or ideally, both "Starting timestamp: ..." and	
> "Finishing
> timestamp ...".
> Additional comments from another user:
> ------------------------------------------------------
> I'm adding my vote for this as well.
> I'd like to know timestamp and "running time" in seconds or D;H:M:S:
> Thu Oct 25 10:06:01 GMT 2007 (0:00:12:56): 56% done
> Starting and stopping timestamps in the log would also be valuable.
> Unforutately, there's no "workaround" such as putting a date command before and after the pig command in logging --
> queuing times can be seconds to hours and completely mess up any notion of job execution time.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (PIG-12) Please add timestamps to pig map/reduce progress messages

Posted by "Patrick Hunt (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/PIG-12?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12548359 ] 

Patrick Hunt commented on PIG-12:
---------------------------------

I found this to be a good reference if you want more background on log4j config:

http://logging.apache.org/log4j/1.2/manual.html

in particular see the sections:

"Configuration" and "Default Initialization Procedure"

we would probably use a PropertyConfigurator with the command line arg (or default if not provided) to switch btw the log config files I mentioned in the prev comment.


> Please add timestamps to pig map/reduce progress messages
> ---------------------------------------------------------
>
>                 Key: PIG-12
>                 URL: https://issues.apache.org/jira/browse/PIG-12
>             Project: Pig
>          Issue Type: Improvement
>          Components: impl
>            Reporter: Olga Natkovich
>
> From one of the users: 
> ------------------------------
> I'm spending a lot of time trying to optimize my pig queries for short
> run-times.  This process would be much easier if, in the progress output
> from pig (currently on stdout, but hopefully soon moving to  
> stderr?!), the
> initiation and completion of each map/reduce job could be  
> timestamped.  Pig
> already spits out messages of the form "----- MapReduce Job -----",  
> "Input:
> ...", "Combine: ...", etc; could you just add a "Timestamp: ..."
> field as well?	Or ideally, both "Starting timestamp: ..." and	
> "Finishing
> timestamp ...".
> Additional comments from another user:
> ------------------------------------------------------
> I'm adding my vote for this as well.
> I'd like to know timestamp and "running time" in seconds or D;H:M:S:
> Thu Oct 25 10:06:01 GMT 2007 (0:00:12:56): 56% done
> Starting and stopping timestamps in the log would also be valuable.
> Unforutately, there's no "workaround" such as putting a date command before and after the pig command in logging --
> queuing times can be seconds to hours and completely mess up any notion of job execution time.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (PIG-12) Please add timestamps to pig map/reduce progress messages

Posted by "Patrick Hunt (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/PIG-12?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12546471 ] 

Patrick Hunt commented on PIG-12:
---------------------------------

IMO most ppl won't care for the timestamp and it will just clutter the console. My recommendation is to:

1) document how to enable timestamps somewhere (pig twiki), users can just copy/paste what we give them, or they can dig into log4j if they want to get more complicated (their own appenders/formatters that can also calculate time intervals)
2) fix the issue that I documented - fix Pig Main.java so that log4j config is not hardcoded
3) communicate out to users


> Please add timestamps to pig map/reduce progress messages
> ---------------------------------------------------------
>
>                 Key: PIG-12
>                 URL: https://issues.apache.org/jira/browse/PIG-12
>             Project: Pig
>          Issue Type: Improvement
>          Components: impl
>            Reporter: Olga Natkovich
>
> From one of the users: 
> ------------------------------
> I'm spending a lot of time trying to optimize my pig queries for short
> run-times.  This process would be much easier if, in the progress output
> from pig (currently on stdout, but hopefully soon moving to  
> stderr?!), the
> initiation and completion of each map/reduce job could be  
> timestamped.  Pig
> already spits out messages of the form "----- MapReduce Job -----",  
> "Input:
> ...", "Combine: ...", etc; could you just add a "Timestamp: ..."
> field as well?	Or ideally, both "Starting timestamp: ..." and	
> "Finishing
> timestamp ...".
> Additional comments from another user:
> ------------------------------------------------------
> I'm adding my vote for this as well.
> I'd like to know timestamp and "running time" in seconds or D;H:M:S:
> Thu Oct 25 10:06:01 GMT 2007 (0:00:12:56): 56% done
> Starting and stopping timestamps in the log would also be valuable.
> Unforutately, there's no "workaround" such as putting a date command before and after the pig command in logging --
> queuing times can be seconds to hours and completely mess up any notion of job execution time.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (PIG-12) Please add timestamps to pig map/reduce progress messages

Posted by "Patrick Hunt (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/PIG-12?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12546412 ] 

Patrick Hunt commented on PIG-12:
---------------------------------

somehow the formatting of my previous comment got messed up - I think ppl can understand it though, let me know otw.

> Please add timestamps to pig map/reduce progress messages
> ---------------------------------------------------------
>
>                 Key: PIG-12
>                 URL: https://issues.apache.org/jira/browse/PIG-12
>             Project: Pig
>          Issue Type: Improvement
>          Components: impl
>            Reporter: Olga Natkovich
>
> From one of the users: 
> ------------------------------
> I'm spending a lot of time trying to optimize my pig queries for short
> run-times.  This process would be much easier if, in the progress output
> from pig (currently on stdout, but hopefully soon moving to  
> stderr?!), the
> initiation and completion of each map/reduce job could be  
> timestamped.  Pig
> already spits out messages of the form "----- MapReduce Job -----",  
> "Input:
> ...", "Combine: ...", etc; could you just add a "Timestamp: ..."
> field as well?	Or ideally, both "Starting timestamp: ..." and	
> "Finishing
> timestamp ...".
> Additional comments from another user:
> ------------------------------------------------------
> I'm adding my vote for this as well.
> I'd like to know timestamp and "running time" in seconds or D;H:M:S:
> Thu Oct 25 10:06:01 GMT 2007 (0:00:12:56): 56% done
> Starting and stopping timestamps in the log would also be valuable.
> Unforutately, there's no "workaround" such as putting a date command before and after the pig command in logging --
> queuing times can be seconds to hours and completely mess up any notion of job execution time.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (PIG-12) Please add timestamps to pig map/reduce progress messages

Posted by "Olga Natkovich (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/PIG-12?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12546448 ] 

Olga Natkovich commented on PIG-12:
-----------------------------------

I don't think we need an option - we should just always do it.

> Please add timestamps to pig map/reduce progress messages
> ---------------------------------------------------------
>
>                 Key: PIG-12
>                 URL: https://issues.apache.org/jira/browse/PIG-12
>             Project: Pig
>          Issue Type: Improvement
>          Components: impl
>            Reporter: Olga Natkovich
>
> From one of the users: 
> ------------------------------
> I'm spending a lot of time trying to optimize my pig queries for short
> run-times.  This process would be much easier if, in the progress output
> from pig (currently on stdout, but hopefully soon moving to  
> stderr?!), the
> initiation and completion of each map/reduce job could be  
> timestamped.  Pig
> already spits out messages of the form "----- MapReduce Job -----",  
> "Input:
> ...", "Combine: ...", etc; could you just add a "Timestamp: ..."
> field as well?	Or ideally, both "Starting timestamp: ..." and	
> "Finishing
> timestamp ...".
> Additional comments from another user:
> ------------------------------------------------------
> I'm adding my vote for this as well.
> I'd like to know timestamp and "running time" in seconds or D;H:M:S:
> Thu Oct 25 10:06:01 GMT 2007 (0:00:12:56): 56% done
> Starting and stopping timestamps in the log would also be valuable.
> Unforutately, there's no "workaround" such as putting a date command before and after the pig command in logging --
> queuing times can be seconds to hours and completely mess up any notion of job execution time.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Issue Comment Edited: (PIG-12) Please add timestamps to pig map/reduce progress messages

Posted by "Patrick Hunt (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/PIG-12?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12546411 ] 

phunt edited comment on PIG-12 at 11/28/07 1:51 PM:
-----------------------------------------------------------

I have an alternative suggestion that I'd like to get your feedback on.

Both hadoop and pig make use of log4j for all logging. Why write special purpose code when we can just have the users specify something like this:
----- 
 bash-3.00$ cat log4j.properties
 log4j.rootLogger=INFO, A1
 log4j.appender.A1=org.apache.log4j.ConsoleAppender
 log4j.appender.A1.layout=org.apache.log4j.PatternLayout

 # Print the date in ISO 8601 format
 log4j.appender.A1.layout.ConversionPattern=%d [%t] %-5p %c - %m%n
------
in their classpath for pig. The result is something like:

2007-11-28 21:23:25,632 [main] DEBUG org.apache.hadoop.ipc.Client - IPC Client connection to mithril-nn1.inktomisearch.com/68.180.187.193:8020 sending #0


I did notice an issue in Pig - Pig has hardcoded a configuration for it's logger. If we replace lines 159-164 in Main.java with something like:

BasicConfigurator.configure();

from http://logging.apache.org/log4j/1.2/apidocs/org/apache/log4j/BasicConfigurator.html

this should provide the users with timestamps as well as a whole lot more flexibility than if we hardcoded something ourselves.

If you agree then we are basically done (you just need to apply this change I suggested and test it out).


      was (Author: phunt):
    I have an alternative suggestion that I'd like to get your feedback on.

Both hadoop and pig make use of log4j for all logging. Why write special purpose code when we can just have the users specify something like this:
-----
bash-3.00$ cat log4j.properties
log4j.rootLogger=INFO, A1
log4j.appender.A1=org.apache.log4j.ConsoleAppender
log4j.appender.A1.layout=org.apache.log4j.PatternLayout

# Print the date in ISO 8601 format
log4j.appender.A1.layout.ConversionPattern=%d [%t] %-5p %c - %m%n
------
in their classpath for pig. The result is something like:

2007-11-28 21:23:25,632 [main] DEBUG org.apache.hadoop.ipc.Client - IPC Client connection to mithril-nn1.inktomisearch.com/68.180.187.193:8020 sending #0


I did notice an issue in Pig - Pig has hardcoded a configuration for it's logger. If we replace lines 159-164 in Main.java with something like:

BasicConfigurator.configure();

from http://logging.apache.org/log4j/1.2/apidocs/org/apache/log4j/BasicConfigurator.html

this should provide the users with timestamps as well as a whole lot more flexibility than if we hardcoded something ourselves.

If you agree then we are basically done (you just need to apply this change I suggested and test it out).

  
> Please add timestamps to pig map/reduce progress messages
> ---------------------------------------------------------
>
>                 Key: PIG-12
>                 URL: https://issues.apache.org/jira/browse/PIG-12
>             Project: Pig
>          Issue Type: Improvement
>          Components: impl
>            Reporter: Olga Natkovich
>
> From one of the users: 
> ------------------------------
> I'm spending a lot of time trying to optimize my pig queries for short
> run-times.  This process would be much easier if, in the progress output
> from pig (currently on stdout, but hopefully soon moving to  
> stderr?!), the
> initiation and completion of each map/reduce job could be  
> timestamped.  Pig
> already spits out messages of the form "----- MapReduce Job -----",  
> "Input:
> ...", "Combine: ...", etc; could you just add a "Timestamp: ..."
> field as well?	Or ideally, both "Starting timestamp: ..." and	
> "Finishing
> timestamp ...".
> Additional comments from another user:
> ------------------------------------------------------
> I'm adding my vote for this as well.
> I'd like to know timestamp and "running time" in seconds or D;H:M:S:
> Thu Oct 25 10:06:01 GMT 2007 (0:00:12:56): 56% done
> Starting and stopping timestamps in the log would also be valuable.
> Unforutately, there's no "workaround" such as putting a date command before and after the pig command in logging --
> queuing times can be seconds to hours and completely mess up any notion of job execution time.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (PIG-12) Please add timestamps to pig map/reduce progress messages

Posted by "Olga Natkovich (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/PIG-12?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12539731 ] 

Olga Natkovich commented on PIG-12:
-----------------------------------

Knowing when individual maps start and stop would require more work but what we can do easily as the short term measure is to annotate each progress message with a timestamp indicated above. We can also print the timestamp at the very beginning of the job and at the very end. The relevant code is in src/org/apache/pig/impl/mapreduceExec> vi MapReduceLauncher.java.

> Please add timestamps to pig map/reduce progress messages
> ---------------------------------------------------------
>
>                 Key: PIG-12
>                 URL: https://issues.apache.org/jira/browse/PIG-12
>             Project: Pig
>          Issue Type: Improvement
>          Components: impl
>            Reporter: Olga Natkovich
>
> From one of the users: 
> ------------------------------
> I'm spending a lot of time trying to optimize my pig queries for short
> run-times.  This process would be much easier if, in the progress output
> from pig (currently on stdout, but hopefully soon moving to  
> stderr?!), the
> initiation and completion of each map/reduce job could be  
> timestamped.  Pig
> already spits out messages of the form "----- MapReduce Job -----",  
> "Input:
> ...", "Combine: ...", etc; could you just add a "Timestamp: ..."
> field as well?	Or ideally, both "Starting timestamp: ..." and	
> "Finishing
> timestamp ...".
> Additional comments from another user:
> ------------------------------------------------------
> I'm adding my vote for this as well.
> I'd like to know timestamp and "running time" in seconds or D;H:M:S:
> Thu Oct 25 10:06:01 GMT 2007 (0:00:12:56): 56% done
> Starting and stopping timestamps in the log would also be valuable.
> Unforutately, there's no "workaround" such as putting a date command before and after the pig command in logging --
> queuing times can be seconds to hours and completely mess up any notion of job execution time.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (PIG-12) Please add timestamps to pig map/reduce progress messages

Posted by "Alan Gates (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/PIG-12?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12546439 ] 

Alan Gates commented on PIG-12:
-------------------------------

I'm all for using log4j for this.  That's exactly what it's designed to do.

Should we also support a -timestamp (or something) option so that users can get timestamps w/o needing the semi-cryptic log4j properties file?  If the user specified this then we would set the PatternLayout to match the one you suggest in the example.

> Please add timestamps to pig map/reduce progress messages
> ---------------------------------------------------------
>
>                 Key: PIG-12
>                 URL: https://issues.apache.org/jira/browse/PIG-12
>             Project: Pig
>          Issue Type: Improvement
>          Components: impl
>            Reporter: Olga Natkovich
>
> From one of the users: 
> ------------------------------
> I'm spending a lot of time trying to optimize my pig queries for short
> run-times.  This process would be much easier if, in the progress output
> from pig (currently on stdout, but hopefully soon moving to  
> stderr?!), the
> initiation and completion of each map/reduce job could be  
> timestamped.  Pig
> already spits out messages of the form "----- MapReduce Job -----",  
> "Input:
> ...", "Combine: ...", etc; could you just add a "Timestamp: ..."
> field as well?	Or ideally, both "Starting timestamp: ..." and	
> "Finishing
> timestamp ...".
> Additional comments from another user:
> ------------------------------------------------------
> I'm adding my vote for this as well.
> I'd like to know timestamp and "running time" in seconds or D;H:M:S:
> Thu Oct 25 10:06:01 GMT 2007 (0:00:12:56): 56% done
> Starting and stopping timestamps in the log would also be valuable.
> Unforutately, there's no "workaround" such as putting a date command before and after the pig command in logging --
> queuing times can be seconds to hours and completely mess up any notion of job execution time.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Issue Comment Edited: (PIG-12) Please add timestamps to pig map/reduce progress messages

Posted by "David Ciemiewicz (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/PIG-12?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12546811 ] 

ciemo edited comment on PIG-12 at 11/29/07 9:38 AM:
---------------------------------------------------------------

Two comments -

Timestamps are really useful when you need to correlate events in the execution with external events.

But they are still a pain in the *** when it comes to trying to profile the code.  One of my goals was to have 3 running times (time for current task/job, overall running time, overall time of queue time + running time).  I really didn't want to have write a parsing script.  In fact, because the time stamps don't seem to have Unix ctime timestamps (in seconds), it makes the job harder because I have to convert the timestamps into time values and then do the computations.  It's not that difficult, but I just want simplicity.

The prototype script (see below) annotates the stream with running times which is much more valuable for profiling tasks.

Just adding the log4j timestamps helps.

However, if we go into "optional" mode, at the end of the day, I forgot to set the flag to get timestamps, I'm no better off than I was before.

So, here's what I would propose:

1) Always turn on timestamp logging.
2) If someone doesn't want it, give them the option to turn off timestamps.
3) In the pig command line, include the code to compute the running times, based on the time stamps.

If there were one scenario where I might turn off timestamps by default, it would be when people are using the Pig grunt shell interactively.  But as soon as we are in batch mode, add the timestamps.

Here's the running times example.  Somehow, someway, it should just be added to the pig execution environment, rather than making me jump through the machinations of doing it myself.

The example running time output is first, followed by the Perl script:

<pre>
Connecting to HOD... : 0:00:00 0:00:00
Connected to HOD : 0:00:14 0:00:14
HDFS Web UI: mithril-nn1.inktomisearch.com:50070
JobTracker Web UI: kry101624:59588
HadoopConf: /export/crawlspace/kryptonite/hod/tmp/hod-8168-tmp/hadoop-site.xml
HDFS: mithril-nn1.inktomisearch.com:8020
JobTracker: kry101624:55365
...
Pig progress = 87% : 0:06:08 0:06:08 0:06:22
Pig progress = 95% : 0:06:11 0:06:11 0:06:25
Pig progress = 99% : 0:06:16 0:06:16 0:06:30
Pig progress = 100% : 0:06:18 0:06:18 0:06:32
J20071001: (datestamp, prev_query_string, curr_query_string )
L20071001: (group: (K20071001::datestamp, prev_query_string, K20071001::curr_query_string ), K20071001: (datestamp, prev_query_string, curr_query_string ) )
M20071001: (datestamp, prev_query_string, curr_query_string, freq )
N20071001: (datestamp, prev_query_string, curr_query_string, freq )

----- MapReduce Job ----- : 0:00:00 0:06:19 0:06:33
Input: [news/san-diego/tmp/20071001:BinStorage()]
Map: [[*]->[FILTER BY ([PROJECT $1] MATCHES '.*san diego.*')]]
Group: [GENERATE {[*],[*]}]
Combine: null
Reduce: GENERATE {[PROJECT $0]->[PROJECT $0],[PROJECT $0]->[PROJECT $1],[PROJECT $0]->[PROJECT $2],[COUNT(GENERATE {[PROJECT $1]})]}
Output: /tmp/temp657966079/tmp568530370:com.yahoo.pig.builtin.BinStorage
</pre>

Just pipe the output of pig into this script (timestamps.pl):

            pig -cluster mithril -latest my.pig | ./timestamp.pl
 
{noformat}
#! /usr/bin/perl
use POSIX;

$| = 1;  # No buffering of output

$hodqueuestarttime = time();
$jobstarttime = 0;
$mapreducestarttime = 0;

line:
while (<>)
{
        s/\r*\n*$//;

        if (/^Connecting to HOD.../) {
                print $_, " : ", timestamps(), "\n";
                next line;
        }

        if (/^HDFS Web UI/) {
                print "Connected to HOD : ", timestamps(), "\n";
                $jobstarttime = time();
                print $_, "\n";
                next line; 
        }

        if (/^----- Map/) {
                $mapreducestarttime = time();
                print $_, " : ", timestamps(), "\n";
                next line; 
        }

        if (/^Pig progress/) {
                print $_, " : ", timestamps(), "\n";
                next line;
        }
        print $_, "\n";
}

sub timestamps()
{
    if ($jobstarttime == 0 and $mapreducestarttime == 0)
    {
        return (
                timestamp($hodqueuestarttime)." ".
                timestamp($hodqueuestarttime)
                );
    }
    elsif ($mapreducestarttime == 0)
    {
        return (
                timestamp($jobstarttime)." ".
                timestamp($hodqueuestarttime)
                );
    }
    else
    {
        return (
                timestamp($mapreducestarttime)." ".
                timestamp($jobstarttime)." ".
                timestamp($hodqueuestarttime)
                );
    }
}

sub timestamp()
{
        $starttime = shift;
        $now = time();
        $interval = $now - $starttime;
        $hours = int($interval / (60 * 60));
        $interval = $interval - $hours * (60 * 60);
        $mins = int($interval / 60);
        $interval = $interval - $mins * 60;
        $secs = $interval;
        $tsf = sprintf "%0d:%02d:%02d", $hours, $mins, $secs;
        return $tsf;
}
{noformat}

      was (Author: ciemo):
    Two comments -

Timestamps are really useful when you need to correlate events in the execution with external events.

But they are still a pain in the *** when it comes to trying to profile the code.  One of my goals was to have 3 running times (time for current task/job, overall running time, overall time of queue time + running time).  I really didn't want to have write a parsing script.  In fact, because the time stamps don't seem to have Unix ctime timestamps (in seconds), it makes the job harder because I have to convert the timestamps into time values and then do the computations.  It's not that difficult, but I just want simplicity.

The prototype script (see below) annotates the stream with running times which is much more valuable for profiling tasks.

Just adding the log4j timestamps helps.

However, if we go into "optional" mode, at the end of the day, I forgot to set the flag to get timestamps, I'm no better off than I was before.

So, here's what I would propose:

1) Always turn on timestamp logging.
2) If someone doesn't want it, give them the option to turn off timestamps.
3) In the pig command line, include the code to compute the running times, based on the time stamps.

If there were one scenario where I might turn off timestamps by default, it would be when people are using the Pig grunt shell interactively.  But as soon as we are in batch mode, add the timestamps.

Here's the running times example.  Somehow, someway, it should just be added to the pig execution environment, rather than making me jump through the machinations of doing it myself.

The example running time output is first, followed by the Perl script:

<pre>
Connecting to HOD... : 0:00:00 0:00:00
Connected to HOD : 0:00:14 0:00:14
HDFS Web UI: mithril-nn1.inktomisearch.com:50070
JobTracker Web UI: kry101624:59588
HadoopConf: /export/crawlspace/kryptonite/hod/tmp/hod-8168-tmp/hadoop-site.xml
HDFS: mithril-nn1.inktomisearch.com:8020
JobTracker: kry101624:55365
...
Pig progress = 87% : 0:06:08 0:06:08 0:06:22
Pig progress = 95% : 0:06:11 0:06:11 0:06:25
Pig progress = 99% : 0:06:16 0:06:16 0:06:30
Pig progress = 100% : 0:06:18 0:06:18 0:06:32
J20071001: (datestamp, prev_query_string, curr_query_string )
L20071001: (group: (K20071001::datestamp, prev_query_string, K20071001::curr_query_string ), K20071001: (datestamp, prev_query_string, curr_query_string ) )
M20071001: (datestamp, prev_query_string, curr_query_string, freq )
N20071001: (datestamp, prev_query_string, curr_query_string, freq )

----- MapReduce Job ----- : 0:00:00 0:06:19 0:06:33
Input: [news/san-diego/tmp/20071001:BinStorage()]
Map: [[*]->[FILTER BY ([PROJECT $1] MATCHES '.*san diego.*')]]
Group: [GENERATE {[*],[*]}]
Combine: null
Reduce: GENERATE {[PROJECT $0]->[PROJECT $0],[PROJECT $0]->[PROJECT $1],[PROJECT $0]->[PROJECT $2],[COUNT(GENERATE {[PROJECT $1]})]}
Output: /tmp/temp657966079/tmp568530370:com.yahoo.pig.builtin.BinStorage
</pre>

Just pipe the output of pig into this script (timestamps.pl):

            pig -cluster mithril -latest my.pig | ./timestamp.pl
 
{norformat}
#! /usr/bin/perl
use POSIX;

$| = 1;  # No buffering of output

$hodqueuestarttime = time();
$jobstarttime = 0;
$mapreducestarttime = 0;

line:
while (<>)
{
        s/\r*\n*$//;

        if (/^Connecting to HOD.../) {
                print $_, " : ", timestamps(), "\n";
                next line;
        }

        if (/^HDFS Web UI/) {
                print "Connected to HOD : ", timestamps(), "\n";
                $jobstarttime = time();
                print $_, "\n";
                next line; 
        }

        if (/^----- Map/) {
                $mapreducestarttime = time();
                print $_, " : ", timestamps(), "\n";
                next line; 
        }

        if (/^Pig progress/) {
                print $_, " : ", timestamps(), "\n";
                next line;
        }
        print $_, "\n";
}

sub timestamps()
{
    if ($jobstarttime == 0 and $mapreducestarttime == 0)
    {
        return (
                timestamp($hodqueuestarttime)." ".
                timestamp($hodqueuestarttime)
                );
    }
    elsif ($mapreducestarttime == 0)
    {
        return (
                timestamp($jobstarttime)." ".
                timestamp($hodqueuestarttime)
                );
    }
    else
    {
        return (
                timestamp($mapreducestarttime)." ".
                timestamp($jobstarttime)." ".
                timestamp($hodqueuestarttime)
                );
    }
}

sub timestamp()
{
        $starttime = shift;
        $now = time();
        $interval = $now - $starttime;
        $hours = int($interval / (60 * 60));
        $interval = $interval - $hours * (60 * 60);
        $mins = int($interval / 60);
        $interval = $interval - $mins * 60;
        $secs = $interval;
        $tsf = sprintf "%0d:%02d:%02d", $hours, $mins, $secs;
        return $tsf;
}
{noformat}
  
> Please add timestamps to pig map/reduce progress messages
> ---------------------------------------------------------
>
>                 Key: PIG-12
>                 URL: https://issues.apache.org/jira/browse/PIG-12
>             Project: Pig
>          Issue Type: Improvement
>          Components: impl
>            Reporter: Olga Natkovich
>
> From one of the users: 
> ------------------------------
> I'm spending a lot of time trying to optimize my pig queries for short
> run-times.  This process would be much easier if, in the progress output
> from pig (currently on stdout, but hopefully soon moving to  
> stderr?!), the
> initiation and completion of each map/reduce job could be  
> timestamped.  Pig
> already spits out messages of the form "----- MapReduce Job -----",  
> "Input:
> ...", "Combine: ...", etc; could you just add a "Timestamp: ..."
> field as well?	Or ideally, both "Starting timestamp: ..." and	
> "Finishing
> timestamp ...".
> Additional comments from another user:
> ------------------------------------------------------
> I'm adding my vote for this as well.
> I'd like to know timestamp and "running time" in seconds or D;H:M:S:
> Thu Oct 25 10:06:01 GMT 2007 (0:00:12:56): 56% done
> Starting and stopping timestamps in the log would also be valuable.
> Unforutately, there's no "workaround" such as putting a date command before and after the pig command in logging --
> queuing times can be seconds to hours and completely mess up any notion of job execution time.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Issue Comment Edited: (PIG-12) Please add timestamps to pig map/reduce progress messages

Posted by "David Ciemiewicz (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/PIG-12?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12546811 ] 

ciemo edited comment on PIG-12 at 11/29/07 9:34 AM:
---------------------------------------------------------------

Two comments -

Timestamps are really useful when you need to correlate events in the execution with external events.

But they are still a pain in the *** when it comes to trying to profile the code.  One of my goals was to have 3 running times (time for current task/job, overall running time, overall time of queue time + running time).  I really didn't want to have write a parsing script.  In fact, because the time stamps don't seem to have Unix ctime timestamps (in seconds), it makes the job harder because I have to convert the timestamps into time values and then do the computations.  It's not that difficult, but I just want simplicity.

The prototype script (see below) annotates the stream with running times which is much more valuable for profiling tasks.

Just adding the log4j timestamps helps.

However, if we go into "optional" mode, at the end of the day, I forgot to set the flag to get timestamps, I'm no better off than I was before.

So, here's what I would propose:

1) Always turn on timestamp logging.
2) If someone doesn't want it, give them the option to turn off timestamps.
3) In the pig command line, include the code to compute the running times, based on the time stamps.

If there were one scenario where I might turn off timestamps by default, it would be when people are using the Pig grunt shell interactively.  But as soon as we are in batch mode, add the timestamps.

Here's the running times example.  Somehow, someway, it should just be added to the pig execution environment, rather than making me jump through the machinations of doing it myself.

The example running time output is first, followed by the Perl script:

<pre>
Connecting to HOD... : 0:00:00 0:00:00
Connected to HOD : 0:00:14 0:00:14
HDFS Web UI: mithril-nn1.inktomisearch.com:50070
JobTracker Web UI: kry101624:59588
HadoopConf: /export/crawlspace/kryptonite/hod/tmp/hod-8168-tmp/hadoop-site.xml
HDFS: mithril-nn1.inktomisearch.com:8020
JobTracker: kry101624:55365
...
Pig progress = 87% : 0:06:08 0:06:08 0:06:22
Pig progress = 95% : 0:06:11 0:06:11 0:06:25
Pig progress = 99% : 0:06:16 0:06:16 0:06:30
Pig progress = 100% : 0:06:18 0:06:18 0:06:32
J20071001: (datestamp, prev_query_string, curr_query_string )
L20071001: (group: (K20071001::datestamp, prev_query_string, K20071001::curr_query_string ), K20071001: (datestamp, prev_query_string, curr_query_string ) )
M20071001: (datestamp, prev_query_string, curr_query_string, freq )
N20071001: (datestamp, prev_query_string, curr_query_string, freq )

----- MapReduce Job ----- : 0:00:00 0:06:19 0:06:33
Input: [news/san-diego/tmp/20071001:BinStorage()]
Map: [[*]->[FILTER BY ([PROJECT $1] MATCHES '.*san diego.*')]]
Group: [GENERATE {[*],[*]}]
Combine: null
Reduce: GENERATE {[PROJECT $0]->[PROJECT $0],[PROJECT $0]->[PROJECT $1],[PROJECT $0]->[PROJECT $2],[COUNT(GENERATE {[PROJECT $1]})]}
Output: /tmp/temp657966079/tmp568530370:com.yahoo.pig.builtin.BinStorage
</pre>

Just pipe the output of pig into this script (timestamps.pl):

            pig -cluster mithril -latest my.pig | ./timestamp.pl
 
{norformat}
#! /usr/bin/perl
use POSIX;

$| = 1;  # No buffering of output

$hodqueuestarttime = time();
$jobstarttime = 0;
$mapreducestarttime = 0;

line:
while (<>)
{
        s/\r*\n*$//;

        if (/^Connecting to HOD.../) {
                print $_, " : ", timestamps(), "\n";
                next line;
        }

        if (/^HDFS Web UI/) {
                print "Connected to HOD : ", timestamps(), "\n";
                $jobstarttime = time();
                print $_, "\n";
                next line; 
        }

        if (/^----- Map/) {
                $mapreducestarttime = time();
                print $_, " : ", timestamps(), "\n";
                next line; 
        }

        if (/^Pig progress/) {
                print $_, " : ", timestamps(), "\n";
                next line;
        }
        print $_, "\n";
}

sub timestamps()
{
    if ($jobstarttime == 0 and $mapreducestarttime == 0)
    {
        return (
                timestamp($hodqueuestarttime)." ".
                timestamp($hodqueuestarttime)
                );
    }
    elsif ($mapreducestarttime == 0)
    {
        return (
                timestamp($jobstarttime)." ".
                timestamp($hodqueuestarttime)
                );
    }
    else
    {
        return (
                timestamp($mapreducestarttime)." ".
                timestamp($jobstarttime)." ".
                timestamp($hodqueuestarttime)
                );
    }
}

sub timestamp()
{
        $starttime = shift;
        $now = time();
        $interval = $now - $starttime;
        $hours = int($interval / (60 * 60));
        $interval = $interval - $hours * (60 * 60);
        $mins = int($interval / 60);
        $interval = $interval - $mins * 60;
        $secs = $interval;
        $tsf = sprintf "%0d:%02d:%02d", $hours, $mins, $secs;
        return $tsf;
}
{noformat}

      was (Author: ciemo):
    Two comments -

Timestamps are really useful when you need to correlate events in the execution with external events.

But they are still a pain in the *** when it comes to trying to profile the code.  One of my goals was to have 3 running times (time for current task/job, overall running time, overall time of queue time + running time).  I really didn't want to have write a parsing script.  In fact, because the time stamps don't seem to have Unix ctime timestamps (in seconds), it makes the job harder because I have to convert the timestamps into time values and then do the computations.  It's not that difficult, but I just want simplicity.

The prototype script (see below) annotates the stream with running times which is much more valuable for profiling tasks.

Just adding the log4j timestamps helps.

However, if we go into "optional" mode, at the end of the day, I forgot to set the flag to get timestamps, I'm no better off than I was before.

So, here's what I would propose:

1) Always turn on timestamp logging.
2) If someone doesn't want it, give them the option to turn off timestamps.
3) In the pig command line, include the code to compute the running times, based on the time stamps.

If there were one scenario where I might turn off timestamps by default, it would be when people are using the Pig grunt shell interactively.  But as soon as we are in batch mode, add the timestamps.

Here's the running times example.  Somehow, someway, it should just be added to the pig execution environment, rather than making me jump through the machinations of doing it myself.

The example running time output is first, followed by the Perl script:

<pre>
Connecting to HOD... : 0:00:00 0:00:00
Connected to HOD : 0:00:14 0:00:14
HDFS Web UI: mithril-nn1.inktomisearch.com:50070
JobTracker Web UI: kry101624:59588
HadoopConf: /export/crawlspace/kryptonite/hod/tmp/hod-8168-tmp/hadoop-site.xml
HDFS: mithril-nn1.inktomisearch.com:8020
JobTracker: kry101624:55365
...
Pig progress = 87% : 0:06:08 0:06:08 0:06:22
Pig progress = 95% : 0:06:11 0:06:11 0:06:25
Pig progress = 99% : 0:06:16 0:06:16 0:06:30
Pig progress = 100% : 0:06:18 0:06:18 0:06:32
J20071001: (datestamp, prev_query_string, curr_query_string )
L20071001: (group: (K20071001::datestamp, prev_query_string, K20071001::curr_query_string ), K20071001: (datestamp, prev_query_string, curr_query_string ) )
M20071001: (datestamp, prev_query_string, curr_query_string, freq )
N20071001: (datestamp, prev_query_string, curr_query_string, freq )

----- MapReduce Job ----- : 0:00:00 0:06:19 0:06:33
Input: [news/san-diego/tmp/20071001:BinStorage()]
Map: [[*]->[FILTER BY ([PROJECT $1] MATCHES '.*san diego.*')]]
Group: [GENERATE {[*],[*]}]
Combine: null
Reduce: GENERATE {[PROJECT $0]->[PROJECT $0],[PROJECT $0]->[PROJECT $1],[PROJECT $0]->[PROJECT $2],[COUNT(GENERATE {[PROJECT $1]})]}
Output: /tmp/temp657966079/tmp568530370:com.yahoo.pig.builtin.BinStorage
</pre>

Just pipe the output of pig into this script (timestamps.pl):

            pig -cluster mithril -latest my.pig | ./timestamp.pl
 
<pre>
#! /usr/bin/perl
use POSIX;

$| = 1;  # No buffering of output

$hodqueuestarttime = time();
$jobstarttime = 0;
$mapreducestarttime = 0;

line:
while (<>)
{
        s/\r*\n*$//;

        if (/^Connecting to HOD.../) {
                print $_, " : ", timestamps(), "\n";
                next line;
        }

        if (/^HDFS Web UI/) {
                print "Connected to HOD : ", timestamps(), "\n";
                $jobstarttime = time();
                print $_, "\n";
                next line; 
        }

        if (/^----- Map/) {
                $mapreducestarttime = time();
                print $_, " : ", timestamps(), "\n";
                next line; 
        }

        if (/^Pig progress/) {
                print $_, " : ", timestamps(), "\n";
                next line;
        }
        print $_, "\n";
}

sub timestamps()
{
    if ($jobstarttime == 0 and $mapreducestarttime == 0)
    {
        return (
                timestamp($hodqueuestarttime)." ".
                timestamp($hodqueuestarttime)
                );
    }
    elsif ($mapreducestarttime == 0)
    {
        return (
                timestamp($jobstarttime)." ".
                timestamp($hodqueuestarttime)
                );
    }
    else
    {
        return (
                timestamp($mapreducestarttime)." ".
                timestamp($jobstarttime)." ".
                timestamp($hodqueuestarttime)
                );
    }
}

sub timestamp()
{
        $starttime = shift;
        $now = time();
        $interval = $now - $starttime;
        $hours = int($interval / (60 * 60));
        $interval = $interval - $hours * (60 * 60);
        $mins = int($interval / 60);
        $interval = $interval - $mins * 60;
        $secs = $interval;
        $tsf = sprintf "%0d:%02d:%02d", $hours, $mins, $secs;
        return $tsf;
}
</pre>

  
> Please add timestamps to pig map/reduce progress messages
> ---------------------------------------------------------
>
>                 Key: PIG-12
>                 URL: https://issues.apache.org/jira/browse/PIG-12
>             Project: Pig
>          Issue Type: Improvement
>          Components: impl
>            Reporter: Olga Natkovich
>
> From one of the users: 
> ------------------------------
> I'm spending a lot of time trying to optimize my pig queries for short
> run-times.  This process would be much easier if, in the progress output
> from pig (currently on stdout, but hopefully soon moving to  
> stderr?!), the
> initiation and completion of each map/reduce job could be  
> timestamped.  Pig
> already spits out messages of the form "----- MapReduce Job -----",  
> "Input:
> ...", "Combine: ...", etc; could you just add a "Timestamp: ..."
> field as well?	Or ideally, both "Starting timestamp: ..." and	
> "Finishing
> timestamp ...".
> Additional comments from another user:
> ------------------------------------------------------
> I'm adding my vote for this as well.
> I'd like to know timestamp and "running time" in seconds or D;H:M:S:
> Thu Oct 25 10:06:01 GMT 2007 (0:00:12:56): 56% done
> Starting and stopping timestamps in the log would also be valuable.
> Unforutately, there's no "workaround" such as putting a date command before and after the pig command in logging --
> queuing times can be seconds to hours and completely mess up any notion of job execution time.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (PIG-12) Please add timestamps to pig map/reduce progress messages

Posted by "Alan Gates (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/PIG-12?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Alan Gates updated PIG-12:
--------------------------

    Attachment: timestamps.diff

Patch uploaded on behalf of Patrick Hunt.  Comments from his email:

Here is the patch, some things to note:

1) -b/--brief gives brief logging - no timestamps
2) -4/--log4jconf allows user to specify properties conf file which will be "the" log4j configuration (overrides anything we might do)

3) I tried to keep the semantics of -v and -d the same, see changes to Main.java. Main diff is that it applies to the root (everyone; pig, hadoop, etc...), rather than just pig as it did previously. You should verify bw compatibility (if you care about such things).

4) some of the code is using system.out.println (like POMapreduce.java). As a result, obviously, these messages won't have timestamp. You may/maynot want to clean this up (35 matches in src hierarchy)

Patrick


> Please add timestamps to pig map/reduce progress messages
> ---------------------------------------------------------
>
>                 Key: PIG-12
>                 URL: https://issues.apache.org/jira/browse/PIG-12
>             Project: Pig
>          Issue Type: Improvement
>          Components: impl
>            Reporter: Olga Natkovich
>         Attachments: timestamps.diff
>
>
> From one of the users: 
> ------------------------------
> I'm spending a lot of time trying to optimize my pig queries for short
> run-times.  This process would be much easier if, in the progress output
> from pig (currently on stdout, but hopefully soon moving to  
> stderr?!), the
> initiation and completion of each map/reduce job could be  
> timestamped.  Pig
> already spits out messages of the form "----- MapReduce Job -----",  
> "Input:
> ...", "Combine: ...", etc; could you just add a "Timestamp: ..."
> field as well?	Or ideally, both "Starting timestamp: ..." and	
> "Finishing
> timestamp ...".
> Additional comments from another user:
> ------------------------------------------------------
> I'm adding my vote for this as well.
> I'd like to know timestamp and "running time" in seconds or D;H:M:S:
> Thu Oct 25 10:06:01 GMT 2007 (0:00:12:56): 56% done
> Starting and stopping timestamps in the log would also be valuable.
> Unforutately, there's no "workaround" such as putting a date command before and after the pig command in logging --
> queuing times can be seconds to hours and completely mess up any notion of job execution time.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Issue Comment Edited: (PIG-12) Please add timestamps to pig map/reduce progress messages

Posted by "Patrick Hunt (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/PIG-12?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12546411 ] 

phunt edited comment on PIG-12 at 11/28/07 3:07 PM:
-----------------------------------------------------------

I have an alternative suggestion that I'd like to get your feedback on.

Both hadoop and pig make use of log4j for all logging. Why write special purpose code when we can just have the users specify something like this:
----- 
{noformat}
 bash-3.00$ cat log4j.properties
 log4j.rootLogger=INFO, A1
 log4j.appender.A1=org.apache.log4j.ConsoleAppender
 log4j.appender.A1.layout=org.apache.log4j.PatternLayout

 # Print the date in ISO 8601 format
 log4j.appender.A1.layout.ConversionPattern=%d [%t] %-5p %c - %m%n
{noformat}
------
in their classpath for pig. The result is something like:

2007-11-28 21:23:25,632 [main] DEBUG org.apache.hadoop.ipc.Client - IPC Client connection to host:port sending #0


I did notice an issue in Pig - Pig has hardcoded a configuration for it's logger. If we replace lines 159-164 in Main.java with something like:

BasicConfigurator.configure();

from http://logging.apache.org/log4j/1.2/apidocs/org/apache/log4j/BasicConfigurator.html

this should provide the users with timestamps as well as a whole lot more flexibility than if we hardcoded something ourselves.

If you agree then we are basically done (you just need to apply this change I suggested and test it out).


      was (Author: phunt):
    I have an alternative suggestion that I'd like to get your feedback on.

Both hadoop and pig make use of log4j for all logging. Why write special purpose code when we can just have the users specify something like this:
----- 
{noformat}
 bash-3.00$ cat log4j.properties
 log4j.rootLogger=INFO, A1
 log4j.appender.A1=org.apache.log4j.ConsoleAppender
 log4j.appender.A1.layout=org.apache.log4j.PatternLayout

 # Print the date in ISO 8601 format
 log4j.appender.A1.layout.ConversionPattern=%d [%t] %-5p %c - %m%n
{noformat}
------
in their classpath for pig. The result is something like:

2007-11-28 21:23:25,632 [main] DEBUG org.apache.hadoop.ipc.Client - IPC Client connection to mithril-nn1.inktomisearch.com/68.180.187.193:8020 sending #0


I did notice an issue in Pig - Pig has hardcoded a configuration for it's logger. If we replace lines 159-164 in Main.java with something like:

BasicConfigurator.configure();

from http://logging.apache.org/log4j/1.2/apidocs/org/apache/log4j/BasicConfigurator.html

this should provide the users with timestamps as well as a whole lot more flexibility than if we hardcoded something ourselves.

If you agree then we are basically done (you just need to apply this change I suggested and test it out).

  
> Please add timestamps to pig map/reduce progress messages
> ---------------------------------------------------------
>
>                 Key: PIG-12
>                 URL: https://issues.apache.org/jira/browse/PIG-12
>             Project: Pig
>          Issue Type: Improvement
>          Components: impl
>            Reporter: Olga Natkovich
>
> From one of the users: 
> ------------------------------
> I'm spending a lot of time trying to optimize my pig queries for short
> run-times.  This process would be much easier if, in the progress output
> from pig (currently on stdout, but hopefully soon moving to  
> stderr?!), the
> initiation and completion of each map/reduce job could be  
> timestamped.  Pig
> already spits out messages of the form "----- MapReduce Job -----",  
> "Input:
> ...", "Combine: ...", etc; could you just add a "Timestamp: ..."
> field as well?	Or ideally, both "Starting timestamp: ..." and	
> "Finishing
> timestamp ...".
> Additional comments from another user:
> ------------------------------------------------------
> I'm adding my vote for this as well.
> I'd like to know timestamp and "running time" in seconds or D;H:M:S:
> Thu Oct 25 10:06:01 GMT 2007 (0:00:12:56): 56% done
> Starting and stopping timestamps in the log would also be valuable.
> Unforutately, there's no "workaround" such as putting a date command before and after the pig command in logging --
> queuing times can be seconds to hours and completely mess up any notion of job execution time.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (PIG-12) Please add timestamps to pig map/reduce progress messages

Posted by "Alan Gates (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/PIG-12?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12548360 ] 

Alan Gates commented on PIG-12:
-------------------------------

That looks good.

> Please add timestamps to pig map/reduce progress messages
> ---------------------------------------------------------
>
>                 Key: PIG-12
>                 URL: https://issues.apache.org/jira/browse/PIG-12
>             Project: Pig
>          Issue Type: Improvement
>          Components: impl
>            Reporter: Olga Natkovich
>
> From one of the users: 
> ------------------------------
> I'm spending a lot of time trying to optimize my pig queries for short
> run-times.  This process would be much easier if, in the progress output
> from pig (currently on stdout, but hopefully soon moving to  
> stderr?!), the
> initiation and completion of each map/reduce job could be  
> timestamped.  Pig
> already spits out messages of the form "----- MapReduce Job -----",  
> "Input:
> ...", "Combine: ...", etc; could you just add a "Timestamp: ..."
> field as well?	Or ideally, both "Starting timestamp: ..." and	
> "Finishing
> timestamp ...".
> Additional comments from another user:
> ------------------------------------------------------
> I'm adding my vote for this as well.
> I'd like to know timestamp and "running time" in seconds or D;H:M:S:
> Thu Oct 25 10:06:01 GMT 2007 (0:00:12:56): 56% done
> Starting and stopping timestamps in the log would also be valuable.
> Unforutately, there's no "workaround" such as putting a date command before and after the pig command in logging --
> queuing times can be seconds to hours and completely mess up any notion of job execution time.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (PIG-12) Please add timestamps to pig map/reduce progress messages

Posted by "Olga Natkovich (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/PIG-12?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12547247 ] 

Olga Natkovich commented on PIG-12:
-----------------------------------

I am fine with the proposal.

Patrick, lets add the timestamps unconditionally and, if we have time left, we can make changes to pig.pl to do running times; if not, will do it later.

> Please add timestamps to pig map/reduce progress messages
> ---------------------------------------------------------
>
>                 Key: PIG-12
>                 URL: https://issues.apache.org/jira/browse/PIG-12
>             Project: Pig
>          Issue Type: Improvement
>          Components: impl
>            Reporter: Olga Natkovich
>
> From one of the users: 
> ------------------------------
> I'm spending a lot of time trying to optimize my pig queries for short
> run-times.  This process would be much easier if, in the progress output
> from pig (currently on stdout, but hopefully soon moving to  
> stderr?!), the
> initiation and completion of each map/reduce job could be  
> timestamped.  Pig
> already spits out messages of the form "----- MapReduce Job -----",  
> "Input:
> ...", "Combine: ...", etc; could you just add a "Timestamp: ..."
> field as well?	Or ideally, both "Starting timestamp: ..." and	
> "Finishing
> timestamp ...".
> Additional comments from another user:
> ------------------------------------------------------
> I'm adding my vote for this as well.
> I'd like to know timestamp and "running time" in seconds or D;H:M:S:
> Thu Oct 25 10:06:01 GMT 2007 (0:00:12:56): 56% done
> Starting and stopping timestamps in the log would also be valuable.
> Unforutately, there's no "workaround" such as putting a date command before and after the pig command in logging --
> queuing times can be seconds to hours and completely mess up any notion of job execution time.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (PIG-12) Please add timestamps to pig map/reduce progress messages

Posted by "Patrick Hunt (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/PIG-12?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Patrick Hunt updated PIG-12:
----------------------------

    Comment: was deleted

> Please add timestamps to pig map/reduce progress messages
> ---------------------------------------------------------
>
>                 Key: PIG-12
>                 URL: https://issues.apache.org/jira/browse/PIG-12
>             Project: Pig
>          Issue Type: Improvement
>          Components: impl
>            Reporter: Olga Natkovich
>
> From one of the users: 
> ------------------------------
> I'm spending a lot of time trying to optimize my pig queries for short
> run-times.  This process would be much easier if, in the progress output
> from pig (currently on stdout, but hopefully soon moving to  
> stderr?!), the
> initiation and completion of each map/reduce job could be  
> timestamped.  Pig
> already spits out messages of the form "----- MapReduce Job -----",  
> "Input:
> ...", "Combine: ...", etc; could you just add a "Timestamp: ..."
> field as well?	Or ideally, both "Starting timestamp: ..." and	
> "Finishing
> timestamp ...".
> Additional comments from another user:
> ------------------------------------------------------
> I'm adding my vote for this as well.
> I'd like to know timestamp and "running time" in seconds or D;H:M:S:
> Thu Oct 25 10:06:01 GMT 2007 (0:00:12:56): 56% done
> Starting and stopping timestamps in the log would also be valuable.
> Unforutately, there's no "workaround" such as putting a date command before and after the pig command in logging --
> queuing times can be seconds to hours and completely mess up any notion of job execution time.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Issue Comment Edited: (PIG-12) Please add timestamps to pig map/reduce progress messages

Posted by "Patrick Hunt (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/PIG-12?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12546411 ] 

phunt edited comment on PIG-12 at 11/28/07 1:52 PM:
-----------------------------------------------------------

I have an alternative suggestion that I'd like to get your feedback on.

Both hadoop and pig make use of log4j for all logging. Why write special purpose code when we can just have the users specify something like this:
----- 
{noformat}
 bash-3.00$ cat log4j.properties
 log4j.rootLogger=INFO, A1
 log4j.appender.A1=org.apache.log4j.ConsoleAppender
 log4j.appender.A1.layout=org.apache.log4j.PatternLayout

 # Print the date in ISO 8601 format
 log4j.appender.A1.layout.ConversionPattern=%d [%t] %-5p %c - %m%n
{noformat}
------
in their classpath for pig. The result is something like:

2007-11-28 21:23:25,632 [main] DEBUG org.apache.hadoop.ipc.Client - IPC Client connection to mithril-nn1.inktomisearch.com/68.180.187.193:8020 sending #0


I did notice an issue in Pig - Pig has hardcoded a configuration for it's logger. If we replace lines 159-164 in Main.java with something like:

BasicConfigurator.configure();

from http://logging.apache.org/log4j/1.2/apidocs/org/apache/log4j/BasicConfigurator.html

this should provide the users with timestamps as well as a whole lot more flexibility than if we hardcoded something ourselves.

If you agree then we are basically done (you just need to apply this change I suggested and test it out).


      was (Author: phunt):
    I have an alternative suggestion that I'd like to get your feedback on.

Both hadoop and pig make use of log4j for all logging. Why write special purpose code when we can just have the users specify something like this:
----- 
<pre>
 bash-3.00$ cat log4j.properties
 log4j.rootLogger=INFO, A1
 log4j.appender.A1=org.apache.log4j.ConsoleAppender
 log4j.appender.A1.layout=org.apache.log4j.PatternLayout

 # Print the date in ISO 8601 format
 log4j.appender.A1.layout.ConversionPattern=%d [%t] %-5p %c - %m%n
</pre>
------
in their classpath for pig. The result is something like:

2007-11-28 21:23:25,632 [main] DEBUG org.apache.hadoop.ipc.Client - IPC Client connection to mithril-nn1.inktomisearch.com/68.180.187.193:8020 sending #0


I did notice an issue in Pig - Pig has hardcoded a configuration for it's logger. If we replace lines 159-164 in Main.java with something like:

BasicConfigurator.configure();

from http://logging.apache.org/log4j/1.2/apidocs/org/apache/log4j/BasicConfigurator.html

this should provide the users with timestamps as well as a whole lot more flexibility than if we hardcoded something ourselves.

If you agree then we are basically done (you just need to apply this change I suggested and test it out).

  
> Please add timestamps to pig map/reduce progress messages
> ---------------------------------------------------------
>
>                 Key: PIG-12
>                 URL: https://issues.apache.org/jira/browse/PIG-12
>             Project: Pig
>          Issue Type: Improvement
>          Components: impl
>            Reporter: Olga Natkovich
>
> From one of the users: 
> ------------------------------
> I'm spending a lot of time trying to optimize my pig queries for short
> run-times.  This process would be much easier if, in the progress output
> from pig (currently on stdout, but hopefully soon moving to  
> stderr?!), the
> initiation and completion of each map/reduce job could be  
> timestamped.  Pig
> already spits out messages of the form "----- MapReduce Job -----",  
> "Input:
> ...", "Combine: ...", etc; could you just add a "Timestamp: ..."
> field as well?	Or ideally, both "Starting timestamp: ..." and	
> "Finishing
> timestamp ...".
> Additional comments from another user:
> ------------------------------------------------------
> I'm adding my vote for this as well.
> I'd like to know timestamp and "running time" in seconds or D;H:M:S:
> Thu Oct 25 10:06:01 GMT 2007 (0:00:12:56): 56% done
> Starting and stopping timestamps in the log would also be valuable.
> Unforutately, there's no "workaround" such as putting a date command before and after the pig command in logging --
> queuing times can be seconds to hours and completely mess up any notion of job execution time.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (PIG-12) Please add timestamps to pig map/reduce progress messages

Posted by "Patrick Hunt (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/PIG-12?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12546411 ] 

Patrick Hunt commented on PIG-12:
---------------------------------

I have an alternative suggestion that I'd like to get your feedback on.

Both hadoop and pig make use of log4j for all logging. Why write special purpose code when we can just have the users specify something like this:
-----
bash-3.00$ cat log4j.properties
log4j.rootLogger=INFO, A1
log4j.appender.A1=org.apache.log4j.ConsoleAppender
log4j.appender.A1.layout=org.apache.log4j.PatternLayout

# Print the date in ISO 8601 format
log4j.appender.A1.layout.ConversionPattern=%d [%t] %-5p %c - %m%n
------
in their classpath for pig. The result is something like:

2007-11-28 21:23:25,632 [main] DEBUG org.apache.hadoop.ipc.Client - IPC Client connection to mithril-nn1.inktomisearch.com/68.180.187.193:8020 sending #0


I did notice an issue in Pig - Pig has hardcoded a configuration for it's logger. If we replace lines 159-164 in Main.java with something like:

BasicConfigurator.configure();

from http://logging.apache.org/log4j/1.2/apidocs/org/apache/log4j/BasicConfigurator.html

this should provide the users with timestamps as well as a whole lot more flexibility than if we hardcoded something ourselves.

If you agree then we are basically done (you just need to apply this change I suggested and test it out).


> Please add timestamps to pig map/reduce progress messages
> ---------------------------------------------------------
>
>                 Key: PIG-12
>                 URL: https://issues.apache.org/jira/browse/PIG-12
>             Project: Pig
>          Issue Type: Improvement
>          Components: impl
>            Reporter: Olga Natkovich
>
> From one of the users: 
> ------------------------------
> I'm spending a lot of time trying to optimize my pig queries for short
> run-times.  This process would be much easier if, in the progress output
> from pig (currently on stdout, but hopefully soon moving to  
> stderr?!), the
> initiation and completion of each map/reduce job could be  
> timestamped.  Pig
> already spits out messages of the form "----- MapReduce Job -----",  
> "Input:
> ...", "Combine: ...", etc; could you just add a "Timestamp: ..."
> field as well?	Or ideally, both "Starting timestamp: ..." and	
> "Finishing
> timestamp ...".
> Additional comments from another user:
> ------------------------------------------------------
> I'm adding my vote for this as well.
> I'd like to know timestamp and "running time" in seconds or D;H:M:S:
> Thu Oct 25 10:06:01 GMT 2007 (0:00:12:56): 56% done
> Starting and stopping timestamps in the log would also be valuable.
> Unforutately, there's no "workaround" such as putting a date command before and after the pig command in logging --
> queuing times can be seconds to hours and completely mess up any notion of job execution time.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (PIG-12) Please add timestamps to pig map/reduce progress messages

Posted by "Patrick Hunt (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/PIG-12?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12548034 ] 

Patrick Hunt commented on PIG-12:
---------------------------------

I looked into the latest comments and I don't think you want to do this, however:

Adding timestamps to pig output is simple, change line 160 to this:

        String logfmt = "%d [%t] %-5p %c - %m%n";
        ConsoleAppender screen = new ConsoleAppender(new PatternLayout(logfmt));

all PIG (notice PIG) output will now have timestamps. HOWEVER all hadoop output will have the std output format (no timestamp). I don't think this is what you/we want. This is happening because pig is using it's own appender, rather than the root appender (See PigContext.java, line 113). Hadoop seems to be using the root.

log4j architecture is complex and rooted around extreme flexibility. Typically you configure things like the timestamp through configuration, not programmatically. The issue I see is that we have no easy way to configure the root logger "in the code", rather we are at the mercy of the user/hadoop configuration. We can't assume a particular root logging scheme in our code.

I don't think you want to do this. I'm no expert at log4j but my understanding is that what you are suggesting won't work unless we make certain assumptions which we then bake into the code. Primarily we would need to override the root logger with our own log configuration (down at the root, currently we only do this at org.apache.pig level). This may be a viable option, you would need to think about the effects...


> Please add timestamps to pig map/reduce progress messages
> ---------------------------------------------------------
>
>                 Key: PIG-12
>                 URL: https://issues.apache.org/jira/browse/PIG-12
>             Project: Pig
>          Issue Type: Improvement
>          Components: impl
>            Reporter: Olga Natkovich
>
> From one of the users: 
> ------------------------------
> I'm spending a lot of time trying to optimize my pig queries for short
> run-times.  This process would be much easier if, in the progress output
> from pig (currently on stdout, but hopefully soon moving to  
> stderr?!), the
> initiation and completion of each map/reduce job could be  
> timestamped.  Pig
> already spits out messages of the form "----- MapReduce Job -----",  
> "Input:
> ...", "Combine: ...", etc; could you just add a "Timestamp: ..."
> field as well?	Or ideally, both "Starting timestamp: ..." and	
> "Finishing
> timestamp ...".
> Additional comments from another user:
> ------------------------------------------------------
> I'm adding my vote for this as well.
> I'd like to know timestamp and "running time" in seconds or D;H:M:S:
> Thu Oct 25 10:06:01 GMT 2007 (0:00:12:56): 56% done
> Starting and stopping timestamps in the log would also be valuable.
> Unforutately, there's no "workaround" such as putting a date command before and after the pig command in logging --
> queuing times can be seconds to hours and completely mess up any notion of job execution time.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (PIG-12) Please add timestamps to pig map/reduce progress messages

Posted by "Alan Gates (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/PIG-12?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12548058 ] 

Alan Gates commented on PIG-12:
-------------------------------

A couple of things:

1) When I put in the log4j stuff, I had pig use its own appender rather than the root one so that we could set up our own appenders, log levels, etc. independent of hadoop.  Maybe that isn't the right choice.  Maybe we should assume that users will want the same from pig and hadoop log4j.  I'm open to modifying this if we don't think the initial approach is the best one.

2) I agree that we don't want hadoop and pig to be outputting log records that are formatted differently.  My concern is that if we control format through configuration, it is difficult for a user to turn the timestamp on or off.  (Not difficult in the sense that they can't do it, as they can and it is totally flexible.  But difficult in the sense that how many users know how to manage log4j and tweak their configuration to change the timestamp format?)  If we do it through the code we can provide the user command line options to control it.



> Please add timestamps to pig map/reduce progress messages
> ---------------------------------------------------------
>
>                 Key: PIG-12
>                 URL: https://issues.apache.org/jira/browse/PIG-12
>             Project: Pig
>          Issue Type: Improvement
>          Components: impl
>            Reporter: Olga Natkovich
>
> From one of the users: 
> ------------------------------
> I'm spending a lot of time trying to optimize my pig queries for short
> run-times.  This process would be much easier if, in the progress output
> from pig (currently on stdout, but hopefully soon moving to  
> stderr?!), the
> initiation and completion of each map/reduce job could be  
> timestamped.  Pig
> already spits out messages of the form "----- MapReduce Job -----",  
> "Input:
> ...", "Combine: ...", etc; could you just add a "Timestamp: ..."
> field as well?	Or ideally, both "Starting timestamp: ..." and	
> "Finishing
> timestamp ...".
> Additional comments from another user:
> ------------------------------------------------------
> I'm adding my vote for this as well.
> I'd like to know timestamp and "running time" in seconds or D;H:M:S:
> Thu Oct 25 10:06:01 GMT 2007 (0:00:12:56): 56% done
> Starting and stopping timestamps in the log would also be valuable.
> Unforutately, there's no "workaround" such as putting a date command before and after the pig command in logging --
> queuing times can be seconds to hours and completely mess up any notion of job execution time.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Issue Comment Edited: (PIG-12) Please add timestamps to pig map/reduce progress messages

Posted by "David Ciemiewicz (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/PIG-12?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12546811 ] 

ciemo edited comment on PIG-12 at 11/29/07 9:40 AM:
---------------------------------------------------------------

Two comments -

Timestamps are really useful when you need to correlate events in the execution with external events.

But they are still a pain in the *** when it comes to trying to profile the code.  One of my goals was to have 3 running times (time for current task/job, overall running time, overall time of queue time + running time).  I really didn't want to have write a parsing script.  In fact, because the time stamps don't seem to have Unix ctime timestamps (in seconds), it makes the job harder because I have to convert the timestamps into time values and then do the computations.  It's not that difficult, but I just want simplicity.

The prototype script (see below) annotates the stream with running times which is much more valuable for profiling tasks.

Just adding the log4j timestamps helps.

However, if we go into "optional" mode, at the end of the day, I forgot to set the flag to get timestamps, I'm no better off than I was before.

So, here's what I would propose:

1) Always turn on timestamp logging.
2) If someone doesn't want it, give them the option to turn off timestamps.
3) In the pig command line, include the code to compute the running times, based on the time stamps.

If there were one scenario where I might turn off timestamps by default, it would be when people are using the Pig grunt shell interactively.  But as soon as we are in batch mode, add the timestamps.

Here's the running times example.  Somehow, someway, it should just be added to the pig execution environment, rather than making me jump through the machinations of doing it myself.

The example running time output is first, followed by the Perl script:

{noformat}
Connecting to HOD... : 0:00:00 0:00:00
Connected to HOD : 0:00:14 0:00:14
HDFS Web UI: mithril-nn1.inktomisearch.com:50070
JobTracker Web UI: kry101624:59588
HadoopConf: /export/crawlspace/kryptonite/hod/tmp/hod-8168-tmp/hadoop-site.xml
HDFS: mithril-nn1.inktomisearch.com:8020
JobTracker: kry101624:55365
...
Pig progress = 87% : 0:06:08 0:06:08 0:06:22
Pig progress = 95% : 0:06:11 0:06:11 0:06:25
Pig progress = 99% : 0:06:16 0:06:16 0:06:30
Pig progress = 100% : 0:06:18 0:06:18 0:06:32
J20071001: (datestamp, prev_query_string, curr_query_string )
L20071001: (group: (K20071001::datestamp, prev_query_string, K20071001::curr_query_string ), K20071001: (datestamp, prev_query_string, curr_query_string ) )
M20071001: (datestamp, prev_query_string, curr_query_string, freq )
N20071001: (datestamp, prev_query_string, curr_query_string, freq )

----- MapReduce Job ----- : 0:00:00 0:06:19 0:06:33
Input: [news/san-diego/tmp/20071001:BinStorage()]
Map: [[*]->[FILTER BY ([PROJECT $1] MATCHES '.*san diego.*')]]
Group: [GENERATE {[*],[*]}]
Combine: null
Reduce: GENERATE {[PROJECT $0]->[PROJECT $0],[PROJECT $0]->[PROJECT $1],[PROJECT $0]->[PROJECT $2],[COUNT(GENERATE {[PROJECT $1]})]}
Output: /tmp/temp657966079/tmp568530370:com.yahoo.pig.builtin.BinStorage
{noformat}

Just pipe the output of pig into this script (timestamps.pl):

            pig -cluster mithril -latest my.pig | ./timestamp.pl
 
{code:perl}
#! /usr/bin/perl
use POSIX;

$| = 1;  # No buffering of output

$hodqueuestarttime = time();
$jobstarttime = 0;
$mapreducestarttime = 0;

line:
while (<>)
{
        s/\r*\n*$//;

        if (/^Connecting to HOD.../) {
                print $_, " : ", timestamps(), "\n";
                next line;
        }

        if (/^HDFS Web UI/) {
                print "Connected to HOD : ", timestamps(), "\n";
                $jobstarttime = time();
                print $_, "\n";
                next line; 
        }

        if (/^----- Map/) {
                $mapreducestarttime = time();
                print $_, " : ", timestamps(), "\n";
                next line; 
        }

        if (/^Pig progress/) {
                print $_, " : ", timestamps(), "\n";
                next line;
        }
        print $_, "\n";
}

sub timestamps()
{
    if ($jobstarttime == 0 and $mapreducestarttime == 0)
    {
        return (
                timestamp($hodqueuestarttime)." ".
                timestamp($hodqueuestarttime)
                );
    }
    elsif ($mapreducestarttime == 0)
    {
        return (
                timestamp($jobstarttime)." ".
                timestamp($hodqueuestarttime)
                );
    }
    else
    {
        return (
                timestamp($mapreducestarttime)." ".
                timestamp($jobstarttime)." ".
                timestamp($hodqueuestarttime)
                );
    }
}

sub timestamp()
{
        $starttime = shift;
        $now = time();
        $interval = $now - $starttime;
        $hours = int($interval / (60 * 60));
        $interval = $interval - $hours * (60 * 60);
        $mins = int($interval / 60);
        $interval = $interval - $mins * 60;
        $secs = $interval;
        $tsf = sprintf "%0d:%02d:%02d", $hours, $mins, $secs;
        return $tsf;
}
{code}

      was (Author: ciemo):
    Two comments -

Timestamps are really useful when you need to correlate events in the execution with external events.

But they are still a pain in the *** when it comes to trying to profile the code.  One of my goals was to have 3 running times (time for current task/job, overall running time, overall time of queue time + running time).  I really didn't want to have write a parsing script.  In fact, because the time stamps don't seem to have Unix ctime timestamps (in seconds), it makes the job harder because I have to convert the timestamps into time values and then do the computations.  It's not that difficult, but I just want simplicity.

The prototype script (see below) annotates the stream with running times which is much more valuable for profiling tasks.

Just adding the log4j timestamps helps.

However, if we go into "optional" mode, at the end of the day, I forgot to set the flag to get timestamps, I'm no better off than I was before.

So, here's what I would propose:

1) Always turn on timestamp logging.
2) If someone doesn't want it, give them the option to turn off timestamps.
3) In the pig command line, include the code to compute the running times, based on the time stamps.

If there were one scenario where I might turn off timestamps by default, it would be when people are using the Pig grunt shell interactively.  But as soon as we are in batch mode, add the timestamps.

Here's the running times example.  Somehow, someway, it should just be added to the pig execution environment, rather than making me jump through the machinations of doing it myself.

The example running time output is first, followed by the Perl script:

<pre>
Connecting to HOD... : 0:00:00 0:00:00
Connected to HOD : 0:00:14 0:00:14
HDFS Web UI: mithril-nn1.inktomisearch.com:50070
JobTracker Web UI: kry101624:59588
HadoopConf: /export/crawlspace/kryptonite/hod/tmp/hod-8168-tmp/hadoop-site.xml
HDFS: mithril-nn1.inktomisearch.com:8020
JobTracker: kry101624:55365
...
Pig progress = 87% : 0:06:08 0:06:08 0:06:22
Pig progress = 95% : 0:06:11 0:06:11 0:06:25
Pig progress = 99% : 0:06:16 0:06:16 0:06:30
Pig progress = 100% : 0:06:18 0:06:18 0:06:32
J20071001: (datestamp, prev_query_string, curr_query_string )
L20071001: (group: (K20071001::datestamp, prev_query_string, K20071001::curr_query_string ), K20071001: (datestamp, prev_query_string, curr_query_string ) )
M20071001: (datestamp, prev_query_string, curr_query_string, freq )
N20071001: (datestamp, prev_query_string, curr_query_string, freq )

----- MapReduce Job ----- : 0:00:00 0:06:19 0:06:33
Input: [news/san-diego/tmp/20071001:BinStorage()]
Map: [[*]->[FILTER BY ([PROJECT $1] MATCHES '.*san diego.*')]]
Group: [GENERATE {[*],[*]}]
Combine: null
Reduce: GENERATE {[PROJECT $0]->[PROJECT $0],[PROJECT $0]->[PROJECT $1],[PROJECT $0]->[PROJECT $2],[COUNT(GENERATE {[PROJECT $1]})]}
Output: /tmp/temp657966079/tmp568530370:com.yahoo.pig.builtin.BinStorage
</pre>

Just pipe the output of pig into this script (timestamps.pl):

            pig -cluster mithril -latest my.pig | ./timestamp.pl
 
{noformat}
#! /usr/bin/perl
use POSIX;

$| = 1;  # No buffering of output

$hodqueuestarttime = time();
$jobstarttime = 0;
$mapreducestarttime = 0;

line:
while (<>)
{
        s/\r*\n*$//;

        if (/^Connecting to HOD.../) {
                print $_, " : ", timestamps(), "\n";
                next line;
        }

        if (/^HDFS Web UI/) {
                print "Connected to HOD : ", timestamps(), "\n";
                $jobstarttime = time();
                print $_, "\n";
                next line; 
        }

        if (/^----- Map/) {
                $mapreducestarttime = time();
                print $_, " : ", timestamps(), "\n";
                next line; 
        }

        if (/^Pig progress/) {
                print $_, " : ", timestamps(), "\n";
                next line;
        }
        print $_, "\n";
}

sub timestamps()
{
    if ($jobstarttime == 0 and $mapreducestarttime == 0)
    {
        return (
                timestamp($hodqueuestarttime)." ".
                timestamp($hodqueuestarttime)
                );
    }
    elsif ($mapreducestarttime == 0)
    {
        return (
                timestamp($jobstarttime)." ".
                timestamp($hodqueuestarttime)
                );
    }
    else
    {
        return (
                timestamp($mapreducestarttime)." ".
                timestamp($jobstarttime)." ".
                timestamp($hodqueuestarttime)
                );
    }
}

sub timestamp()
{
        $starttime = shift;
        $now = time();
        $interval = $now - $starttime;
        $hours = int($interval / (60 * 60));
        $interval = $interval - $hours * (60 * 60);
        $mins = int($interval / 60);
        $interval = $interval - $mins * 60;
        $secs = $interval;
        $tsf = sprintf "%0d:%02d:%02d", $hours, $mins, $secs;
        return $tsf;
}
{noformat}
  
> Please add timestamps to pig map/reduce progress messages
> ---------------------------------------------------------
>
>                 Key: PIG-12
>                 URL: https://issues.apache.org/jira/browse/PIG-12
>             Project: Pig
>          Issue Type: Improvement
>          Components: impl
>            Reporter: Olga Natkovich
>
> From one of the users: 
> ------------------------------
> I'm spending a lot of time trying to optimize my pig queries for short
> run-times.  This process would be much easier if, in the progress output
> from pig (currently on stdout, but hopefully soon moving to  
> stderr?!), the
> initiation and completion of each map/reduce job could be  
> timestamped.  Pig
> already spits out messages of the form "----- MapReduce Job -----",  
> "Input:
> ...", "Combine: ...", etc; could you just add a "Timestamp: ..."
> field as well?	Or ideally, both "Starting timestamp: ..." and	
> "Finishing
> timestamp ...".
> Additional comments from another user:
> ------------------------------------------------------
> I'm adding my vote for this as well.
> I'd like to know timestamp and "running time" in seconds or D;H:M:S:
> Thu Oct 25 10:06:01 GMT 2007 (0:00:12:56): 56% done
> Starting and stopping timestamps in the log would also be valuable.
> Unforutately, there's no "workaround" such as putting a date command before and after the pig command in logging --
> queuing times can be seconds to hours and completely mess up any notion of job execution time.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Issue Comment Edited: (PIG-12) Please add timestamps to pig map/reduce progress messages

Posted by "David Ciemiewicz (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/PIG-12?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12546811 ] 

ciemo edited comment on PIG-12 at 11/29/07 9:49 AM:
---------------------------------------------------------------

Two comments -

Timestamps are really useful when you need to correlate events in the execution with external events.

But they are still a pain in the *** when it comes to trying to profile the code.  One of my goals was to have 3 running times (time for current task/job, overall running time, overall time of queue time + running time).  I really didn't want to have write a parsing script.  In fact, because the time stamps don't seem to have Unix ctime timestamps (in seconds), it makes the job harder because I have to convert the timestamps into time values and then do the computations.  It's not that difficult, but I just want simplicity.

The prototype script (see below) annotates the stream with running times which is much more valuable for profiling tasks.

Just adding the log4j timestamps helps.

However, if we go into "optional" mode, at the end of the day, I forgot to set the flag to get timestamps, I'm no better off than I was before.

So, here's what I would propose:

1) Always turn on timestamp logging.
2) If someone doesn't want it, give them the option to turn off timestamps.
3) In the pig command line, include the code to compute the running times, based on the time stamps.

If there were one scenario where I might turn off timestamps by default, it would be when people are using the Pig grunt shell interactively.  But as soon as we are in batch mode, add the timestamps.

Here's the running times example.  Somehow, someway, it should just be added to the pig execution environment, rather than making me jump through the machinations of doing it myself.

The example running time output is first, followed by the Perl script:

{noformat}
Connecting to HOD... : 0:00:00 0:00:00
Connected to HOD : 0:00:14 0:00:14
HDFS Web UI: mithril-nn1.inktomisearch.com:50070
JobTracker Web UI: kry101624:59588
HadoopConf: /export/crawlspace/kryptonite/hod/tmp/hod-8168-tmp/hadoop-site.xml
HDFS: mithril-nn1.inktomisearch.com:8020
JobTracker: kry101624:55365
...
Pig progress = 87% : 0:06:08 0:06:08 0:06:22
Pig progress = 95% : 0:06:11 0:06:11 0:06:25
Pig progress = 99% : 0:06:16 0:06:16 0:06:30
Pig progress = 100% : 0:06:18 0:06:18 0:06:32
J20071001: (datestamp, prev_query_string, curr_query_string )
L20071001: (group: (K20071001::datestamp, prev_query_string, K20071001::curr_query_string ), 
K20071001: (datestamp, prev_query_string, curr_query_string ) )
M20071001: (datestamp, prev_query_string, curr_query_string, freq )
N20071001: (datestamp, prev_query_string, curr_query_string, freq )

----- MapReduce Job ----- : 0:00:00 0:06:19 0:06:33
Input: [news/san-diego/tmp/20071001:BinStorage()]
Map: [[*]->[FILTER BY ([PROJECT $1] MATCHES '.*san diego.*')]]
Group: [GENERATE {[*],[*]}]
Combine: null
Reduce: GENERATE {[PROJECT $0]->[PROJECT $0],[PROJECT $0]->[PROJECT $1],
[PROJECT $0]->[PROJECT $2],[COUNT(GENERATE {[PROJECT $1]})]}
Output: /tmp/temp657966079/tmp568530370:com.yahoo.pig.builtin.BinStorage
{noformat}

Just pipe the output of pig into this script (timestamps.pl):

            pig -cluster mithril -latest my.pig | ./timestamp.pl
 
{code:none}
#! /usr/bin/perl
use POSIX;

$| = 1;  # No buffering of output

$hodqueuestarttime = time();
$jobstarttime = 0;
$mapreducestarttime = 0;

line:
while (<>)
{
        s/\r*\n*$//;

        if (/^Connecting to HOD.../) {
                print $_, " : ", timestamps(), "\n";
                next line;
        }

        if (/^HDFS Web UI/) {
                print "Connected to HOD : ", timestamps(), "\n";
                $jobstarttime = time();
                print $_, "\n";
                next line; 
        }

        if (/^----- Map/) {
                $mapreducestarttime = time();
                print $_, " : ", timestamps(), "\n";
                next line;
        }

        if (/^Pig progress/) {
                print $_, " : ", timestamps(), "\n";
                next line;
        }
        print $_, "\n";
}

sub timestamps()
{
    if ($jobstarttime == 0 and $mapreducestarttime == 0)
    {
        return (
                timestamp($hodqueuestarttime)." ".
                timestamp($hodqueuestarttime)
        );
    }
    elsif ($mapreducestarttime == 0)
    {
        return (
                timestamp($jobstarttime)." ".
                timestamp($hodqueuestarttime)
        );
    }
    else
    {
        return (
                timestamp($mapreducestarttime)." ".
                timestamp($jobstarttime)." ".
                timestamp($hodqueuestarttime)
        );
    }
}

sub timestamp()
{
        $starttime = shift;
        $now = time();
        $interval = $now - $starttime;
        $hours = int($interval / (60 * 60));
        $interval = $interval - $hours * (60 * 60);
        $mins = int($interval / 60);
        $interval = $interval - $mins * 60;
        $secs = $interval;
        $tsf = sprintf "%0d:%02d:%02d", $hours, $mins, $secs;
        return $tsf;
}
{code}

      was (Author: ciemo):
    Two comments -

Timestamps are really useful when you need to correlate events in the execution with external events.

But they are still a pain in the *** when it comes to trying to profile the code.  One of my goals was to have 3 running times (time for current task/job, overall running time, overall time of queue time + running time).  I really didn't want to have write a parsing script.  In fact, because the time stamps don't seem to have Unix ctime timestamps (in seconds), it makes the job harder because I have to convert the timestamps into time values and then do the computations.  It's not that difficult, but I just want simplicity.

The prototype script (see below) annotates the stream with running times which is much more valuable for profiling tasks.

Just adding the log4j timestamps helps.

However, if we go into "optional" mode, at the end of the day, I forgot to set the flag to get timestamps, I'm no better off than I was before.

So, here's what I would propose:

1) Always turn on timestamp logging.
2) If someone doesn't want it, give them the option to turn off timestamps.
3) In the pig command line, include the code to compute the running times, based on the time stamps.

If there were one scenario where I might turn off timestamps by default, it would be when people are using the Pig grunt shell interactively.  But as soon as we are in batch mode, add the timestamps.

Here's the running times example.  Somehow, someway, it should just be added to the pig execution environment, rather than making me jump through the machinations of doing it myself.

The example running time output is first, followed by the Perl script:

{noformat}
Connecting to HOD... : 0:00:00 0:00:00
Connected to HOD : 0:00:14 0:00:14
HDFS Web UI: mithril-nn1.inktomisearch.com:50070
JobTracker Web UI: kry101624:59588
HadoopConf: /export/crawlspace/kryptonite/hod/tmp/hod-8168-tmp/hadoop-site.xml
HDFS: mithril-nn1.inktomisearch.com:8020
JobTracker: kry101624:55365
...
Pig progress = 87% : 0:06:08 0:06:08 0:06:22
Pig progress = 95% : 0:06:11 0:06:11 0:06:25
Pig progress = 99% : 0:06:16 0:06:16 0:06:30
Pig progress = 100% : 0:06:18 0:06:18 0:06:32
J20071001: (datestamp, prev_query_string, curr_query_string )
L20071001: (group: (K20071001::datestamp, prev_query_string, K20071001::curr_query_string ), 
K20071001: (datestamp, prev_query_string, curr_query_string ) )
M20071001: (datestamp, prev_query_string, curr_query_string, freq )
N20071001: (datestamp, prev_query_string, curr_query_string, freq )

----- MapReduce Job ----- : 0:00:00 0:06:19 0:06:33
Input: [news/san-diego/tmp/20071001:BinStorage()]
Map: [[*]->[FILTER BY ([PROJECT $1] MATCHES '.*san diego.*')]]
Group: [GENERATE {[*],[*]}]
Combine: null
Reduce: GENERATE {[PROJECT $0]->[PROJECT $0],[PROJECT $0]->[PROJECT $1],[PROJECT $0]->[PROJECT $2],[COUNT(GENERATE {[PROJECT $1]})]}
Output: /tmp/temp657966079/tmp568530370:com.yahoo.pig.builtin.BinStorage
{noformat}

Just pipe the output of pig into this script (timestamps.pl):

            pig -cluster mithril -latest my.pig | ./timestamp.pl
 
{code:none}
#! /usr/bin/perl
use POSIX;

$| = 1;  # No buffering of output

$hodqueuestarttime = time();
$jobstarttime = 0;
$mapreducestarttime = 0;

line:
while (<>)
{
        s/\r*\n*$//;

        if (/^Connecting to HOD.../) {
                print $_, " : ", timestamps(), "\n";
                next line;
        }

        if (/^HDFS Web UI/) {
                print "Connected to HOD : ", timestamps(), "\n";
                $jobstarttime = time();
                print $_, "\n";
                next line; 
        }

        if (/^----- Map/) {
                $mapreducestarttime = time();
                print $_, " : ", timestamps(), "\n";
                next line; 
        }

        if (/^Pig progress/) {
                print $_, " : ", timestamps(), "\n";
                next line;
        }
        print $_, "\n";
}

sub timestamps()
{
    if ($jobstarttime == 0 and $mapreducestarttime == 0)
    {
        return (
                timestamp($hodqueuestarttime)." ".
                timestamp($hodqueuestarttime)
                );
    }
    elsif ($mapreducestarttime == 0)
    {
        return (
                timestamp($jobstarttime)." ".
                timestamp($hodqueuestarttime)
                );
    }
    else
    {
        return (
                timestamp($mapreducestarttime)." ".
                timestamp($jobstarttime)." ".
                timestamp($hodqueuestarttime)
                );
    }
}

sub timestamp()
{
        $starttime = shift;
        $now = time();
        $interval = $now - $starttime;
        $hours = int($interval / (60 * 60));
        $interval = $interval - $hours * (60 * 60);
        $mins = int($interval / 60);
        $interval = $interval - $mins * 60;
        $secs = $interval;
        $tsf = sprintf "%0d:%02d:%02d", $hours, $mins, $secs;
        return $tsf;
}
{code}
  
> Please add timestamps to pig map/reduce progress messages
> ---------------------------------------------------------
>
>                 Key: PIG-12
>                 URL: https://issues.apache.org/jira/browse/PIG-12
>             Project: Pig
>          Issue Type: Improvement
>          Components: impl
>            Reporter: Olga Natkovich
>
> From one of the users: 
> ------------------------------
> I'm spending a lot of time trying to optimize my pig queries for short
> run-times.  This process would be much easier if, in the progress output
> from pig (currently on stdout, but hopefully soon moving to  
> stderr?!), the
> initiation and completion of each map/reduce job could be  
> timestamped.  Pig
> already spits out messages of the form "----- MapReduce Job -----",  
> "Input:
> ...", "Combine: ...", etc; could you just add a "Timestamp: ..."
> field as well?	Or ideally, both "Starting timestamp: ..." and	
> "Finishing
> timestamp ...".
> Additional comments from another user:
> ------------------------------------------------------
> I'm adding my vote for this as well.
> I'd like to know timestamp and "running time" in seconds or D;H:M:S:
> Thu Oct 25 10:06:01 GMT 2007 (0:00:12:56): 56% done
> Starting and stopping timestamps in the log would also be valuable.
> Unforutately, there's no "workaround" such as putting a date command before and after the pig command in logging --
> queuing times can be seconds to hours and completely mess up any notion of job execution time.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.