You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@spark.apache.org by Philip Ogren <ph...@oracle.com> on 2014/04/01 17:43:41 UTC

Re: Is there a way to get the current progress of the job?

Hi DB,

Just wondering if you ever got an answer to your question about 
monitoring progress - either offline or through your own investigation.  
Any findings would be appreciated.

Thanks,
Philip

On 01/30/2014 10:32 PM, DB Tsai wrote:
> Hi guys,
>
> When we're running a very long job, we would like to show users the 
> current progress of map and reduce job. After looking at the api 
> document, I don't find anything for this. However, in Spark UI, I 
> could see the progress of the task. Is there anything I miss?
>
> Thanks.
>
> Sincerely,
>
> DB Tsai
> Machine Learning Engineer
> Alpine Data Labs
> --------------------------------------
> Web: http://alpinenow.com/


Re: Is there a way to get the current progress of the job?

Posted by Philip Ogren <ph...@oracle.com>.
This is great news thanks for the update!  I will either wait for the 
1.0 release or go and test it ahead of time from git rather than trying 
to pull it out of JobLogger or creating my own SparkListener.


On 04/02/2014 06:48 PM, Andrew Or wrote:
> Hi Philip,
>
> In the upcoming release of Spark 1.0 there will be a feature that 
> provides for exactly what you describe: capturing the information 
> displayed on the UI in JSON. More details will be provided in the 
> documentation, but for now, anything before 0.9.1 can only go through 
> JobLogger.scala, which outputs information in a somewhat arbitrary 
> format and will be deprecated soon. If you find this feature useful, 
> you can test it out by building the master branch of Spark yourself, 
> following the instructions in https://github.com/apache/spark/pull/42.
>
> Andrew
>
>
> On Wed, Apr 2, 2014 at 3:39 PM, Philip Ogren <philip.ogren@oracle.com 
> <ma...@oracle.com>> wrote:
>
>     What I'd like is a way to capture the information provided on the
>     stages page (i.e. cluster:4040/stages via IndexPage).  Looking
>     through the Spark code, it doesn't seem like it is possible to
>     directly query for specific facts such as how many tasks have
>     succeeded or how many total tasks there are for a given active
>     stage.  Instead, it looks like all the data for the page is
>     generated at once using information from the JobProgressListener.
>     It doesn't seem like I have any way to programmatically access
>     this information myself.  I can't even instantiate my own
>     JobProgressListener because it is spark package private.  I could
>     implement my SparkListener and gather up the information myself.
>      It feels a bit awkward since classes like Task and TaskInfo are
>     also spark package private.  It does seem possible to gather up
>     what I need but it seems like this sort of information should just
>     be available without by implementing a custom SparkListener (or
>     worse screen scraping the html generated by StageTable!)
>
>     I was hoping that I would find the answer in MetricsServlet which
>     is turned on by default.  It seems that when I visit
>     http://cluster:4040/metrics/json/ I should be able to get
>     everything I want but I don't see the basic stage/task progress
>     information I would expect.  Are there special metrics properties
>     that I should set to get this info?  I think this would be the
>     best solution - just give it the right URL and parse the resulting
>     JSON - but I can't seem to figure out how to do this or if it is
>     possible.
>
>     Any advice is appreciated.
>
>     Thanks,
>     Philip
>
>
>
>     On 04/01/2014 09:43 AM, Philip Ogren wrote:
>
>         Hi DB,
>
>         Just wondering if you ever got an answer to your question
>         about monitoring progress - either offline or through your own
>         investigation.  Any findings would be appreciated.
>
>         Thanks,
>         Philip
>
>         On 01/30/2014 10:32 PM, DB Tsai wrote:
>
>             Hi guys,
>
>             When we're running a very long job, we would like to show
>             users the current progress of map and reduce job. After
>             looking at the api document, I don't find anything for
>             this. However, in Spark UI, I could see the progress of
>             the task. Is there anything I miss?
>
>             Thanks.
>
>             Sincerely,
>
>             DB Tsai
>             Machine Learning Engineer
>             Alpine Data Labs
>             --------------------------------------
>             Web: http://alpinenow.com/
>
>
>
>


Re: Is there a way to get the current progress of the job?

Posted by Andrew Or <an...@databricks.com>.
Hi Philip,

In the upcoming release of Spark 1.0 there will be a feature that provides
for exactly what you describe: capturing the information displayed on the
UI in JSON. More details will be provided in the documentation, but for
now, anything before 0.9.1 can only go through JobLogger.scala, which
outputs information in a somewhat arbitrary format and will be deprecated
soon. If you find this feature useful, you can test it out by building the
master branch of Spark yourself, following the instructions in
https://github.com/apache/spark/pull/42.

Andrew


On Wed, Apr 2, 2014 at 3:39 PM, Philip Ogren <ph...@oracle.com>wrote:

> What I'd like is a way to capture the information provided on the stages
> page (i.e. cluster:4040/stages via IndexPage).  Looking through the Spark
> code, it doesn't seem like it is possible to directly query for specific
> facts such as how many tasks have succeeded or how many total tasks there
> are for a given active stage.  Instead, it looks like all the data for the
> page is generated at once using information from the JobProgressListener.
> It doesn't seem like I have any way to programmatically access this
> information myself.  I can't even instantiate my own JobProgressListener
> because it is spark package private.  I could implement my SparkListener
> and gather up the information myself.  It feels a bit awkward since classes
> like Task and TaskInfo are also spark package private.  It does seem
> possible to gather up what I need but it seems like this sort of
> information should just be available without by implementing a custom
> SparkListener (or worse screen scraping the html generated by StageTable!)
>
> I was hoping that I would find the answer in MetricsServlet which is
> turned on by default.  It seems that when I visit
> http://cluster:4040/metrics/json/ I should be able to get everything I
> want but I don't see the basic stage/task progress information I would
> expect.  Are there special metrics properties that I should set to get this
> info?  I think this would be the best solution - just give it the right URL
> and parse the resulting JSON - but I can't seem to figure out how to do
> this or if it is possible.
>
> Any advice is appreciated.
>
> Thanks,
> Philip
>
>
>
> On 04/01/2014 09:43 AM, Philip Ogren wrote:
>
>> Hi DB,
>>
>> Just wondering if you ever got an answer to your question about
>> monitoring progress - either offline or through your own investigation.
>>  Any findings would be appreciated.
>>
>> Thanks,
>> Philip
>>
>> On 01/30/2014 10:32 PM, DB Tsai wrote:
>>
>>> Hi guys,
>>>
>>> When we're running a very long job, we would like to show users the
>>> current progress of map and reduce job. After looking at the api document,
>>> I don't find anything for this. However, in Spark UI, I could see the
>>> progress of the task. Is there anything I miss?
>>>
>>> Thanks.
>>>
>>> Sincerely,
>>>
>>> DB Tsai
>>> Machine Learning Engineer
>>> Alpine Data Labs
>>> --------------------------------------
>>> Web: http://alpinenow.com/
>>>
>>
>>
>

Re: Is there a way to get the current progress of the job?

Posted by Mark Hamstra <ma...@clearstorydata.com>.
https://issues.apache.org/jira/browse/SPARK-1081?jql=project%20%3D%20SPARK%20AND%20text%20~%20Annotate


On Thu, Apr 3, 2014 at 9:24 AM, Philip Ogren <ph...@oracle.com>wrote:

>  I can appreciate the reluctance to expose something like the
> JobProgressListener as a public interface.  It's exactly the sort of thing
> that you want to deprecate as soon as something better comes along and can
> be a real pain when trying to maintain the level of backwards
> compatibility  that we all expect from commercial grade software.  Instead
> of simply marking it private and therefore unavailable to Spark developers,
> it might be worth incorporating something like a @Beta annotation<http://docs.guava-libraries.googlecode.com/git/javadoc/com/google/common/annotations/Beta.html>which you could sprinkle liberally throughout Spark that communicates "hey
> use this if you want to cause its here now" and "don't come crying if we
> rip it out or change it later."  This might be better than simply marking
> so many useful functions/classes as private.  I bet such an annotation
> could generate a compile warning/error for those who don't want to risk
> using them.
>
>
> On 04/02/2014 06:40 PM, Patrick Wendell wrote:
>
> Hey Phillip,
>
>  Right now there is no mechanism for this. You have to go in through the
> low level listener interface.
>
>  We could consider exposing the JobProgressListener directly - I think
> it's been factored nicely so it's fairly decoupled from the UI. The concern
> is this is a semi-internal piece of functionality and something we might,
> e.g. want to change the API of over time.
>
>  - Patrick
>
>
> On Wed, Apr 2, 2014 at 3:39 PM, Philip Ogren <ph...@oracle.com>wrote:
>
>> What I'd like is a way to capture the information provided on the stages
>> page (i.e. cluster:4040/stages via IndexPage).  Looking through the Spark
>> code, it doesn't seem like it is possible to directly query for specific
>> facts such as how many tasks have succeeded or how many total tasks there
>> are for a given active stage.  Instead, it looks like all the data for the
>> page is generated at once using information from the JobProgressListener.
>> It doesn't seem like I have any way to programmatically access this
>> information myself.  I can't even instantiate my own JobProgressListener
>> because it is spark package private.  I could implement my SparkListener
>> and gather up the information myself.  It feels a bit awkward since classes
>> like Task and TaskInfo are also spark package private.  It does seem
>> possible to gather up what I need but it seems like this sort of
>> information should just be available without by implementing a custom
>> SparkListener (or worse screen scraping the html generated by StageTable!)
>>
>> I was hoping that I would find the answer in MetricsServlet which is
>> turned on by default.  It seems that when I visit
>> http://cluster:4040/metrics/json/ I should be able to get everything I
>> want but I don't see the basic stage/task progress information I would
>> expect.  Are there special metrics properties that I should set to get this
>> info?  I think this would be the best solution - just give it the right URL
>> and parse the resulting JSON - but I can't seem to figure out how to do
>> this or if it is possible.
>>
>> Any advice is appreciated.
>>
>> Thanks,
>> Philip
>>
>>
>>
>> On 04/01/2014 09:43 AM, Philip Ogren wrote:
>>
>>> Hi DB,
>>>
>>> Just wondering if you ever got an answer to your question about
>>> monitoring progress - either offline or through your own investigation.
>>>  Any findings would be appreciated.
>>>
>>> Thanks,
>>> Philip
>>>
>>> On 01/30/2014 10:32 PM, DB Tsai wrote:
>>>
>>>> Hi guys,
>>>>
>>>> When we're running a very long job, we would like to show users the
>>>> current progress of map and reduce job. After looking at the api document,
>>>> I don't find anything for this. However, in Spark UI, I could see the
>>>> progress of the task. Is there anything I miss?
>>>>
>>>> Thanks.
>>>>
>>>> Sincerely,
>>>>
>>>> DB Tsai
>>>> Machine Learning Engineer
>>>> Alpine Data Labs
>>>> --------------------------------------
>>>> Web: http://alpinenow.com/
>>>>
>>>
>>>
>>
>
>

Re: Is there a way to get the current progress of the job?

Posted by Philip Ogren <ph...@oracle.com>.
I can appreciate the reluctance to expose something like the 
JobProgressListener as a public interface.  It's exactly the sort of 
thing that you want to deprecate as soon as something better comes along 
and can be a real pain when trying to maintain the level of backwards 
compatibility  that we all expect from commercial grade software.  
Instead of simply marking it private and therefore unavailable to Spark 
developers, it might be worth incorporating something like a @Beta 
annotation 
<http://docs.guava-libraries.googlecode.com/git/javadoc/com/google/common/annotations/Beta.html> 
which you could sprinkle liberally throughout Spark that communicates 
"hey use this if you want to cause its here now" and "don't come crying 
if we rip it out or change it later."  This might be better than simply 
marking so many useful functions/classes as private.  I bet such an 
annotation could generate a compile warning/error for those who don't 
want to risk using them.


On 04/02/2014 06:40 PM, Patrick Wendell wrote:
> Hey Phillip,
>
> Right now there is no mechanism for this. You have to go in through 
> the low level listener interface.
>
> We could consider exposing the JobProgressListener directly - I think 
> it's been factored nicely so it's fairly decoupled from the UI. The 
> concern is this is a semi-internal piece of functionality and 
> something we might, e.g. want to change the API of over time.
>
> - Patrick
>
>
> On Wed, Apr 2, 2014 at 3:39 PM, Philip Ogren <philip.ogren@oracle.com 
> <ma...@oracle.com>> wrote:
>
>     What I'd like is a way to capture the information provided on the
>     stages page (i.e. cluster:4040/stages via IndexPage).  Looking
>     through the Spark code, it doesn't seem like it is possible to
>     directly query for specific facts such as how many tasks have
>     succeeded or how many total tasks there are for a given active
>     stage.  Instead, it looks like all the data for the page is
>     generated at once using information from the JobProgressListener.
>     It doesn't seem like I have any way to programmatically access
>     this information myself.  I can't even instantiate my own
>     JobProgressListener because it is spark package private.  I could
>     implement my SparkListener and gather up the information myself.
>      It feels a bit awkward since classes like Task and TaskInfo are
>     also spark package private.  It does seem possible to gather up
>     what I need but it seems like this sort of information should just
>     be available without by implementing a custom SparkListener (or
>     worse screen scraping the html generated by StageTable!)
>
>     I was hoping that I would find the answer in MetricsServlet which
>     is turned on by default.  It seems that when I visit
>     http://cluster:4040/metrics/json/ I should be able to get
>     everything I want but I don't see the basic stage/task progress
>     information I would expect.  Are there special metrics properties
>     that I should set to get this info?  I think this would be the
>     best solution - just give it the right URL and parse the resulting
>     JSON - but I can't seem to figure out how to do this or if it is
>     possible.
>
>     Any advice is appreciated.
>
>     Thanks,
>     Philip
>
>
>
>     On 04/01/2014 09:43 AM, Philip Ogren wrote:
>
>         Hi DB,
>
>         Just wondering if you ever got an answer to your question
>         about monitoring progress - either offline or through your own
>         investigation.  Any findings would be appreciated.
>
>         Thanks,
>         Philip
>
>         On 01/30/2014 10:32 PM, DB Tsai wrote:
>
>             Hi guys,
>
>             When we're running a very long job, we would like to show
>             users the current progress of map and reduce job. After
>             looking at the api document, I don't find anything for
>             this. However, in Spark UI, I could see the progress of
>             the task. Is there anything I miss?
>
>             Thanks.
>
>             Sincerely,
>
>             DB Tsai
>             Machine Learning Engineer
>             Alpine Data Labs
>             --------------------------------------
>             Web: http://alpinenow.com/
>
>
>
>


Re: Is there a way to get the current progress of the job?

Posted by Patrick Wendell <pw...@gmail.com>.
Hey Phillip,

Right now there is no mechanism for this. You have to go in through the low
level listener interface.

We could consider exposing the JobProgressListener directly - I think it's
been factored nicely so it's fairly decoupled from the UI. The concern is
this is a semi-internal piece of functionality and something we might, e.g.
want to change the API of over time.

- Patrick


On Wed, Apr 2, 2014 at 3:39 PM, Philip Ogren <ph...@oracle.com>wrote:

> What I'd like is a way to capture the information provided on the stages
> page (i.e. cluster:4040/stages via IndexPage).  Looking through the Spark
> code, it doesn't seem like it is possible to directly query for specific
> facts such as how many tasks have succeeded or how many total tasks there
> are for a given active stage.  Instead, it looks like all the data for the
> page is generated at once using information from the JobProgressListener.
> It doesn't seem like I have any way to programmatically access this
> information myself.  I can't even instantiate my own JobProgressListener
> because it is spark package private.  I could implement my SparkListener
> and gather up the information myself.  It feels a bit awkward since classes
> like Task and TaskInfo are also spark package private.  It does seem
> possible to gather up what I need but it seems like this sort of
> information should just be available without by implementing a custom
> SparkListener (or worse screen scraping the html generated by StageTable!)
>
> I was hoping that I would find the answer in MetricsServlet which is
> turned on by default.  It seems that when I visit
> http://cluster:4040/metrics/json/ I should be able to get everything I
> want but I don't see the basic stage/task progress information I would
> expect.  Are there special metrics properties that I should set to get this
> info?  I think this would be the best solution - just give it the right URL
> and parse the resulting JSON - but I can't seem to figure out how to do
> this or if it is possible.
>
> Any advice is appreciated.
>
> Thanks,
> Philip
>
>
>
> On 04/01/2014 09:43 AM, Philip Ogren wrote:
>
>> Hi DB,
>>
>> Just wondering if you ever got an answer to your question about
>> monitoring progress - either offline or through your own investigation.
>>  Any findings would be appreciated.
>>
>> Thanks,
>> Philip
>>
>> On 01/30/2014 10:32 PM, DB Tsai wrote:
>>
>>> Hi guys,
>>>
>>> When we're running a very long job, we would like to show users the
>>> current progress of map and reduce job. After looking at the api document,
>>> I don't find anything for this. However, in Spark UI, I could see the
>>> progress of the task. Is there anything I miss?
>>>
>>> Thanks.
>>>
>>> Sincerely,
>>>
>>> DB Tsai
>>> Machine Learning Engineer
>>> Alpine Data Labs
>>> --------------------------------------
>>> Web: http://alpinenow.com/
>>>
>>
>>
>

Re: Is there a way to get the current progress of the job?

Posted by Philip Ogren <ph...@oracle.com>.
What I'd like is a way to capture the information provided on the stages 
page (i.e. cluster:4040/stages via IndexPage).  Looking through the 
Spark code, it doesn't seem like it is possible to directly query for 
specific facts such as how many tasks have succeeded or how many total 
tasks there are for a given active stage.  Instead, it looks like all 
the data for the page is generated at once using information from the 
JobProgressListener. It doesn't seem like I have any way to 
programmatically access this information myself.  I can't even 
instantiate my own JobProgressListener because it is spark package 
private.  I could implement my SparkListener and gather up the 
information myself.  It feels a bit awkward since classes like Task and 
TaskInfo are also spark package private.  It does seem possible to 
gather up what I need but it seems like this sort of information should 
just be available without by implementing a custom SparkListener (or 
worse screen scraping the html generated by StageTable!)

I was hoping that I would find the answer in MetricsServlet which is 
turned on by default.  It seems that when I visit 
http://cluster:4040/metrics/json/ I should be able to get everything I 
want but I don't see the basic stage/task progress information I would 
expect.  Are there special metrics properties that I should set to get 
this info?  I think this would be the best solution - just give it the 
right URL and parse the resulting JSON - but I can't seem to figure out 
how to do this or if it is possible.

Any advice is appreciated.

Thanks,
Philip


On 04/01/2014 09:43 AM, Philip Ogren wrote:
> Hi DB,
>
> Just wondering if you ever got an answer to your question about 
> monitoring progress - either offline or through your own 
> investigation.  Any findings would be appreciated.
>
> Thanks,
> Philip
>
> On 01/30/2014 10:32 PM, DB Tsai wrote:
>> Hi guys,
>>
>> When we're running a very long job, we would like to show users the 
>> current progress of map and reduce job. After looking at the api 
>> document, I don't find anything for this. However, in Spark UI, I 
>> could see the progress of the task. Is there anything I miss?
>>
>> Thanks.
>>
>> Sincerely,
>>
>> DB Tsai
>> Machine Learning Engineer
>> Alpine Data Labs
>> --------------------------------------
>> Web: http://alpinenow.com/
>


Re: Is there a way to get the current progress of the job?

Posted by Mayur Rustagi <ma...@gmail.com>.
You can get detailed information through Spark listener interface regarding
each stage. Multiple jobs may be compressed into a single stage so jobwise
information would be same as Spark.
Regards
Mayur

Mayur Rustagi
Ph: +1 (760) 203 3257
http://www.sigmoidanalytics.com
@mayur_rustagi <https://twitter.com/mayur_rustagi>



On Tue, Apr 1, 2014 at 11:18 AM, Kevin Markey <ke...@oracle.com>wrote:

>  The discussion there hits on the distinction of jobs and stages.  When
> looking at one application, there are hundreds of stages, sometimes
> thousands.  Depends on the data and the task.  And the UI seems to track
> stages.  And one could independently track them for such a job.  But what
> if -- as occurs in another application -- there's only one or two stages,
> but lots of data passing through those 1 or 2 stages?
>
> Kevin Markey
>
>
>
> On 04/01/2014 09:55 AM, Mark Hamstra wrote:
>
> Some related discussion: https://github.com/apache/spark/pull/246
>
>
> On Tue, Apr 1, 2014 at 8:43 AM, Philip Ogren <ph...@oracle.com>wrote:
>
>> Hi DB,
>>
>> Just wondering if you ever got an answer to your question about
>> monitoring progress - either offline or through your own investigation.
>>  Any findings would be appreciated.
>>
>> Thanks,
>> Philip
>>
>>
>> On 01/30/2014 10:32 PM, DB Tsai wrote:
>>
>>> Hi guys,
>>>
>>> When we're running a very long job, we would like to show users the
>>> current progress of map and reduce job. After looking at the api document,
>>> I don't find anything for this. However, in Spark UI, I could see the
>>> progress of the task. Is there anything I miss?
>>>
>>> Thanks.
>>>
>>> Sincerely,
>>>
>>> DB Tsai
>>> Machine Learning Engineer
>>> Alpine Data Labs
>>> --------------------------------------
>>> Web: http://alpinenow.com/
>>>
>>
>>
>
>

Re: Is there a way to get the current progress of the job?

Posted by Mark Hamstra <ma...@clearstorydata.com>.
Some related discussion: https://github.com/apache/spark/pull/246


On Tue, Apr 1, 2014 at 8:43 AM, Philip Ogren <ph...@oracle.com>wrote:

> Hi DB,
>
> Just wondering if you ever got an answer to your question about monitoring
> progress - either offline or through your own investigation.  Any findings
> would be appreciated.
>
> Thanks,
> Philip
>
>
> On 01/30/2014 10:32 PM, DB Tsai wrote:
>
>> Hi guys,
>>
>> When we're running a very long job, we would like to show users the
>> current progress of map and reduce job. After looking at the api document,
>> I don't find anything for this. However, in Spark UI, I could see the
>> progress of the task. Is there anything I miss?
>>
>> Thanks.
>>
>> Sincerely,
>>
>> DB Tsai
>> Machine Learning Engineer
>> Alpine Data Labs
>> --------------------------------------
>> Web: http://alpinenow.com/
>>
>
>