You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@pig.apache.org by Michael Harris <Mi...@Telespree.com> on 2008/04/02 19:47:13 UTC

MapReduceLauncher static fields

Hello,

 

I have written a pig application that does a fixed set of queries
on-demand through a web interface. I am trying to get the progress of
the queries from the PigServer, but I have noticed that the source of
the progress data is all static fields in the MapReduceLauncher. Clearly
my webapp must be able to handle multiple concurrent pig queries (and be
thread-safe) and I would like to report the progress of each individual
query (job set) to the end user.  Do these static fields indicate that I
would get the progress of multiple concurrent queries initiated by
different PigServer instances? or would I get the overall progress of
the MapReduceLauncher for all queries currently being executed?

 

Thanks,
Michael

Re: MapReduceLauncher static fields

Posted by Benjamin Reed <br...@yahoo-inc.com>.

Your approach is one way of doing it; perhaps it's the best way. Another 
potential way is to pass a progress object to the store method.

In MapReduceLauncher.launchPig is the code for displaying the progress to the 
user. It's not perfect, but it does give a reasonable idea even in the 
presence of multiple jobs to complete a set of queries. Perhaps you could 
incorporate that into your getProgress().

thanx
ben

On Friday 04 April 2008 12:08:52 Michael Harris wrote:
> Ben,
>
> Thanks for getting back to me. Ideally the stats would be attached to a
> dump/store command. I tried to hack together a solution by making those
> fields non-static, making MapReduceLauncher serializable, adding method
> to MapReduceLauncher instances to getProgress, and modifying update
> points to use particular instances of MapReduceLauncher rather than the
> static calls it was doing before. Then I modified pig server to have a
> method getProgress(String alias) :
>
> 	public double getProgress(String id) {
> 		ExecutionEngine ee = pigContext.getExecutionEngine();
> 		if (ee instanceof HExecutionEngine) {
> 			HExecutionEngine he = (HExecutionEngine) ee;
> 			LogicalPlan lp = aliases.get(id);
> 			POMapreduce mapRed = (POMapreduce)
> he.getPhysicalOpTable().get(
>
> he.getPhysicalKey(lp.getRoot()));
> 			return
> mapRed.getMapReduceLauncher().getProgress();
> 		}
> 		return -1;
> 	}
>
> I have only spent a few hours with the Pig code so im not sure this is
> even correct, but it seems to work really well except the case when a
> set of queries uses a set of jobs to complete: the results are totally
> inaccurate until it gets to the final job. Its not really a big deal its
> just an internal tool, my users can live with no status updates, but I
> thought it would be a nice touch. I have looked at the roadmap for Pig
> and see that querying for progress is on there, I just wanted to make
> sure you guys think of my scenario (thread-safe, end user facing
> application) when you add it :)
>
> -Michael
>
> -----Original Message-----
> From: Benjamin Reed [mailto:breed@yahoo-inc.com]
> Sent: Friday, April 04, 2008 11:29 AM
> To: pig-user@incubator.apache.org
> Subject: Re: MapReduceLauncher static fields
>
> The statistics are not updated in a thread safe way. They are global
> statistics, so they will be across jobs, and since they aren't thread
> safe they may be wrong. Other than the numbers I think that the rest
> should be thread safe assuming that the underlying Hadoop code is thread
> safe, which it looks to be.
>
> I would think for your application the stats should really be attached
> to an object that represents the store or dump method object right? (Or
> at least accessible through that object.)
>
> ben
>
> Michael Harris wrote:
> > Hello,
> >
> >
> >
> > I have written a pig application that does a fixed set of queries
> > on-demand through a web interface. I am trying to get the progress of
> > the queries from the PigServer, but I have noticed that the source of
> > the progress data is all static fields in the MapReduceLauncher.
>
> Clearly
>
> > my webapp must be able to handle multiple concurrent pig queries (and
>
> be
>
> > thread-safe) and I would like to report the progress of each
>
> individual
>
> > query (job set) to the end user.  Do these static fields indicate that
>
> I
>
> > would get the progress of multiple concurrent queries initiated by
> > different PigServer instances? or would I get the overall progress of
> > the MapReduceLauncher for all queries currently being executed?
> >
> >
> >
> > Thanks,
> > Michael

RE: MapReduceLauncher static fields

Posted by Michael Harris <Mi...@Telespree.com>.

Ben,

Thanks for getting back to me. Ideally the stats would be attached to a
dump/store command. I tried to hack together a solution by making those
fields non-static, making MapReduceLauncher serializable, adding method
to MapReduceLauncher instances to getProgress, and modifying update
points to use particular instances of MapReduceLauncher rather than the
static calls it was doing before. Then I modified pig server to have a
method getProgress(String alias) :

	public double getProgress(String id) {
		ExecutionEngine ee = pigContext.getExecutionEngine();
		if (ee instanceof HExecutionEngine) {
			HExecutionEngine he = (HExecutionEngine) ee;
			LogicalPlan lp = aliases.get(id);
			POMapreduce mapRed = (POMapreduce)
he.getPhysicalOpTable().get(

he.getPhysicalKey(lp.getRoot()));
			return
mapRed.getMapReduceLauncher().getProgress();
		}
		return -1;
	}

I have only spent a few hours with the Pig code so im not sure this is
even correct, but it seems to work really well except the case when a
set of queries uses a set of jobs to complete: the results are totally
inaccurate until it gets to the final job. Its not really a big deal its
just an internal tool, my users can live with no status updates, but I
thought it would be a nice touch. I have looked at the roadmap for Pig
and see that querying for progress is on there, I just wanted to make
sure you guys think of my scenario (thread-safe, end user facing
application) when you add it :)

-Michael

-----Original Message-----
From: Benjamin Reed [mailto:breed@yahoo-inc.com] 
Sent: Friday, April 04, 2008 11:29 AM
To: pig-user@incubator.apache.org
Subject: Re: MapReduceLauncher static fields

The statistics are not updated in a thread safe way. They are global
statistics, so they will be across jobs, and since they aren't thread
safe they may be wrong. Other than the numbers I think that the rest
should be thread safe assuming that the underlying Hadoop code is thread
safe, which it looks to be.

I would think for your application the stats should really be attached
to an object that represents the store or dump method object right? (Or
at least accessible through that object.)

ben

Michael Harris wrote:
> Hello,
>
>  
>
> I have written a pig application that does a fixed set of queries
> on-demand through a web interface. I am trying to get the progress of
> the queries from the PigServer, but I have noticed that the source of
> the progress data is all static fields in the MapReduceLauncher.
Clearly
> my webapp must be able to handle multiple concurrent pig queries (and
be
> thread-safe) and I would like to report the progress of each
individual
> query (job set) to the end user.  Do these static fields indicate that
I
> would get the progress of multiple concurrent queries initiated by
> different PigServer instances? or would I get the overall progress of
> the MapReduceLauncher for all queries currently being executed?
>
>  
>
> Thanks,
> Michael
>
>
>

Re: MapReduceLauncher static fields

Posted by Benjamin Reed <br...@yahoo-inc.com>.

The statistics are not updated in a thread safe way. They are global
statistics, so they will be across jobs, and since they aren't thread
safe they may be wrong. Other than the numbers I think that the rest
should be thread safe assuming that the underlying Hadoop code is thread
safe, which it looks to be.

I would think for your application the stats should really be attached
to an object that represents the store or dump method object right? (Or
at least accessible through that object.)

ben

Michael Harris wrote:
> Hello,
>
>  
>
> I have written a pig application that does a fixed set of queries
> on-demand through a web interface. I am trying to get the progress of
> the queries from the PigServer, but I have noticed that the source of
> the progress data is all static fields in the MapReduceLauncher. Clearly
> my webapp must be able to handle multiple concurrent pig queries (and be
> thread-safe) and I would like to report the progress of each individual
> query (job set) to the end user.  Do these static fields indicate that I
> would get the progress of multiple concurrent queries initiated by
> different PigServer instances? or would I get the overall progress of
> the MapReduceLauncher for all queries currently being executed?
>
>  
>
> Thanks,
> Michael
>
>
>