You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@spark.apache.org by Philip Ogren <ph...@oracle.com> on 2014/04/01 17:43:41 UTC
Re: Is there a way to get the current progress of the job?
Hi DB,
Just wondering if you ever got an answer to your question about
monitoring progress - either offline or through your own investigation.
Any findings would be appreciated.
Thanks,
Philip
On 01/30/2014 10:32 PM, DB Tsai wrote:
> Hi guys,
>
> When we're running a very long job, we would like to show users the
> current progress of map and reduce job. After looking at the api
> document, I don't find anything for this. However, in Spark UI, I
> could see the progress of the task. Is there anything I miss?
>
> Thanks.
>
> Sincerely,
>
> DB Tsai
> Machine Learning Engineer
> Alpine Data Labs
> --------------------------------------
> Web: http://alpinenow.com/
Re: Is there a way to get the current progress of the job?
Posted by Philip Ogren <ph...@oracle.com>.
This is great news thanks for the update! I will either wait for the
1.0 release or go and test it ahead of time from git rather than trying
to pull it out of JobLogger or creating my own SparkListener.
On 04/02/2014 06:48 PM, Andrew Or wrote:
> Hi Philip,
>
> In the upcoming release of Spark 1.0 there will be a feature that
> provides for exactly what you describe: capturing the information
> displayed on the UI in JSON. More details will be provided in the
> documentation, but for now, anything before 0.9.1 can only go through
> JobLogger.scala, which outputs information in a somewhat arbitrary
> format and will be deprecated soon. If you find this feature useful,
> you can test it out by building the master branch of Spark yourself,
> following the instructions in https://github.com/apache/spark/pull/42.
>
> Andrew
>
>
> On Wed, Apr 2, 2014 at 3:39 PM, Philip Ogren <philip.ogren@oracle.com
> <ma...@oracle.com>> wrote:
>
> What I'd like is a way to capture the information provided on the
> stages page (i.e. cluster:4040/stages via IndexPage). Looking
> through the Spark code, it doesn't seem like it is possible to
> directly query for specific facts such as how many tasks have
> succeeded or how many total tasks there are for a given active
> stage. Instead, it looks like all the data for the page is
> generated at once using information from the JobProgressListener.
> It doesn't seem like I have any way to programmatically access
> this information myself. I can't even instantiate my own
> JobProgressListener because it is spark package private. I could
> implement my SparkListener and gather up the information myself.
> It feels a bit awkward since classes like Task and TaskInfo are
> also spark package private. It does seem possible to gather up
> what I need but it seems like this sort of information should just
> be available without by implementing a custom SparkListener (or
> worse screen scraping the html generated by StageTable!)
>
> I was hoping that I would find the answer in MetricsServlet which
> is turned on by default. It seems that when I visit
> http://cluster:4040/metrics/json/ I should be able to get
> everything I want but I don't see the basic stage/task progress
> information I would expect. Are there special metrics properties
> that I should set to get this info? I think this would be the
> best solution - just give it the right URL and parse the resulting
> JSON - but I can't seem to figure out how to do this or if it is
> possible.
>
> Any advice is appreciated.
>
> Thanks,
> Philip
>
>
>
> On 04/01/2014 09:43 AM, Philip Ogren wrote:
>
> Hi DB,
>
> Just wondering if you ever got an answer to your question
> about monitoring progress - either offline or through your own
> investigation. Any findings would be appreciated.
>
> Thanks,
> Philip
>
> On 01/30/2014 10:32 PM, DB Tsai wrote:
>
> Hi guys,
>
> When we're running a very long job, we would like to show
> users the current progress of map and reduce job. After
> looking at the api document, I don't find anything for
> this. However, in Spark UI, I could see the progress of
> the task. Is there anything I miss?
>
> Thanks.
>
> Sincerely,
>
> DB Tsai
> Machine Learning Engineer
> Alpine Data Labs
> --------------------------------------
> Web: http://alpinenow.com/
>
>
>
>
Re: Is there a way to get the current progress of the job?
Posted by Andrew Or <an...@databricks.com>.
Hi Philip,
In the upcoming release of Spark 1.0 there will be a feature that provides
for exactly what you describe: capturing the information displayed on the
UI in JSON. More details will be provided in the documentation, but for
now, anything before 0.9.1 can only go through JobLogger.scala, which
outputs information in a somewhat arbitrary format and will be deprecated
soon. If you find this feature useful, you can test it out by building the
master branch of Spark yourself, following the instructions in
https://github.com/apache/spark/pull/42.
Andrew
On Wed, Apr 2, 2014 at 3:39 PM, Philip Ogren <ph...@oracle.com>wrote:
> What I'd like is a way to capture the information provided on the stages
> page (i.e. cluster:4040/stages via IndexPage). Looking through the Spark
> code, it doesn't seem like it is possible to directly query for specific
> facts such as how many tasks have succeeded or how many total tasks there
> are for a given active stage. Instead, it looks like all the data for the
> page is generated at once using information from the JobProgressListener.
> It doesn't seem like I have any way to programmatically access this
> information myself. I can't even instantiate my own JobProgressListener
> because it is spark package private. I could implement my SparkListener
> and gather up the information myself. It feels a bit awkward since classes
> like Task and TaskInfo are also spark package private. It does seem
> possible to gather up what I need but it seems like this sort of
> information should just be available without by implementing a custom
> SparkListener (or worse screen scraping the html generated by StageTable!)
>
> I was hoping that I would find the answer in MetricsServlet which is
> turned on by default. It seems that when I visit
> http://cluster:4040/metrics/json/ I should be able to get everything I
> want but I don't see the basic stage/task progress information I would
> expect. Are there special metrics properties that I should set to get this
> info? I think this would be the best solution - just give it the right URL
> and parse the resulting JSON - but I can't seem to figure out how to do
> this or if it is possible.
>
> Any advice is appreciated.
>
> Thanks,
> Philip
>
>
>
> On 04/01/2014 09:43 AM, Philip Ogren wrote:
>
>> Hi DB,
>>
>> Just wondering if you ever got an answer to your question about
>> monitoring progress - either offline or through your own investigation.
>> Any findings would be appreciated.
>>
>> Thanks,
>> Philip
>>
>> On 01/30/2014 10:32 PM, DB Tsai wrote:
>>
>>> Hi guys,
>>>
>>> When we're running a very long job, we would like to show users the
>>> current progress of map and reduce job. After looking at the api document,
>>> I don't find anything for this. However, in Spark UI, I could see the
>>> progress of the task. Is there anything I miss?
>>>
>>> Thanks.
>>>
>>> Sincerely,
>>>
>>> DB Tsai
>>> Machine Learning Engineer
>>> Alpine Data Labs
>>> --------------------------------------
>>> Web: http://alpinenow.com/
>>>
>>
>>
>
Re: Is there a way to get the current progress of the job?
Posted by Mark Hamstra <ma...@clearstorydata.com>.
https://issues.apache.org/jira/browse/SPARK-1081?jql=project%20%3D%20SPARK%20AND%20text%20~%20Annotate
On Thu, Apr 3, 2014 at 9:24 AM, Philip Ogren <ph...@oracle.com>wrote:
> I can appreciate the reluctance to expose something like the
> JobProgressListener as a public interface. It's exactly the sort of thing
> that you want to deprecate as soon as something better comes along and can
> be a real pain when trying to maintain the level of backwards
> compatibility that we all expect from commercial grade software. Instead
> of simply marking it private and therefore unavailable to Spark developers,
> it might be worth incorporating something like a @Beta annotation<http://docs.guava-libraries.googlecode.com/git/javadoc/com/google/common/annotations/Beta.html>which you could sprinkle liberally throughout Spark that communicates "hey
> use this if you want to cause its here now" and "don't come crying if we
> rip it out or change it later." This might be better than simply marking
> so many useful functions/classes as private. I bet such an annotation
> could generate a compile warning/error for those who don't want to risk
> using them.
>
>
> On 04/02/2014 06:40 PM, Patrick Wendell wrote:
>
> Hey Phillip,
>
> Right now there is no mechanism for this. You have to go in through the
> low level listener interface.
>
> We could consider exposing the JobProgressListener directly - I think
> it's been factored nicely so it's fairly decoupled from the UI. The concern
> is this is a semi-internal piece of functionality and something we might,
> e.g. want to change the API of over time.
>
> - Patrick
>
>
> On Wed, Apr 2, 2014 at 3:39 PM, Philip Ogren <ph...@oracle.com>wrote:
>
>> What I'd like is a way to capture the information provided on the stages
>> page (i.e. cluster:4040/stages via IndexPage). Looking through the Spark
>> code, it doesn't seem like it is possible to directly query for specific
>> facts such as how many tasks have succeeded or how many total tasks there
>> are for a given active stage. Instead, it looks like all the data for the
>> page is generated at once using information from the JobProgressListener.
>> It doesn't seem like I have any way to programmatically access this
>> information myself. I can't even instantiate my own JobProgressListener
>> because it is spark package private. I could implement my SparkListener
>> and gather up the information myself. It feels a bit awkward since classes
>> like Task and TaskInfo are also spark package private. It does seem
>> possible to gather up what I need but it seems like this sort of
>> information should just be available without by implementing a custom
>> SparkListener (or worse screen scraping the html generated by StageTable!)
>>
>> I was hoping that I would find the answer in MetricsServlet which is
>> turned on by default. It seems that when I visit
>> http://cluster:4040/metrics/json/ I should be able to get everything I
>> want but I don't see the basic stage/task progress information I would
>> expect. Are there special metrics properties that I should set to get this
>> info? I think this would be the best solution - just give it the right URL
>> and parse the resulting JSON - but I can't seem to figure out how to do
>> this or if it is possible.
>>
>> Any advice is appreciated.
>>
>> Thanks,
>> Philip
>>
>>
>>
>> On 04/01/2014 09:43 AM, Philip Ogren wrote:
>>
>>> Hi DB,
>>>
>>> Just wondering if you ever got an answer to your question about
>>> monitoring progress - either offline or through your own investigation.
>>> Any findings would be appreciated.
>>>
>>> Thanks,
>>> Philip
>>>
>>> On 01/30/2014 10:32 PM, DB Tsai wrote:
>>>
>>>> Hi guys,
>>>>
>>>> When we're running a very long job, we would like to show users the
>>>> current progress of map and reduce job. After looking at the api document,
>>>> I don't find anything for this. However, in Spark UI, I could see the
>>>> progress of the task. Is there anything I miss?
>>>>
>>>> Thanks.
>>>>
>>>> Sincerely,
>>>>
>>>> DB Tsai
>>>> Machine Learning Engineer
>>>> Alpine Data Labs
>>>> --------------------------------------
>>>> Web: http://alpinenow.com/
>>>>
>>>
>>>
>>
>
>
Re: Is there a way to get the current progress of the job?
Posted by Philip Ogren <ph...@oracle.com>.
I can appreciate the reluctance to expose something like the
JobProgressListener as a public interface. It's exactly the sort of
thing that you want to deprecate as soon as something better comes along
and can be a real pain when trying to maintain the level of backwards
compatibility that we all expect from commercial grade software.
Instead of simply marking it private and therefore unavailable to Spark
developers, it might be worth incorporating something like a @Beta
annotation
<http://docs.guava-libraries.googlecode.com/git/javadoc/com/google/common/annotations/Beta.html>
which you could sprinkle liberally throughout Spark that communicates
"hey use this if you want to cause its here now" and "don't come crying
if we rip it out or change it later." This might be better than simply
marking so many useful functions/classes as private. I bet such an
annotation could generate a compile warning/error for those who don't
want to risk using them.
On 04/02/2014 06:40 PM, Patrick Wendell wrote:
> Hey Phillip,
>
> Right now there is no mechanism for this. You have to go in through
> the low level listener interface.
>
> We could consider exposing the JobProgressListener directly - I think
> it's been factored nicely so it's fairly decoupled from the UI. The
> concern is this is a semi-internal piece of functionality and
> something we might, e.g. want to change the API of over time.
>
> - Patrick
>
>
> On Wed, Apr 2, 2014 at 3:39 PM, Philip Ogren <philip.ogren@oracle.com
> <ma...@oracle.com>> wrote:
>
> What I'd like is a way to capture the information provided on the
> stages page (i.e. cluster:4040/stages via IndexPage). Looking
> through the Spark code, it doesn't seem like it is possible to
> directly query for specific facts such as how many tasks have
> succeeded or how many total tasks there are for a given active
> stage. Instead, it looks like all the data for the page is
> generated at once using information from the JobProgressListener.
> It doesn't seem like I have any way to programmatically access
> this information myself. I can't even instantiate my own
> JobProgressListener because it is spark package private. I could
> implement my SparkListener and gather up the information myself.
> It feels a bit awkward since classes like Task and TaskInfo are
> also spark package private. It does seem possible to gather up
> what I need but it seems like this sort of information should just
> be available without by implementing a custom SparkListener (or
> worse screen scraping the html generated by StageTable!)
>
> I was hoping that I would find the answer in MetricsServlet which
> is turned on by default. It seems that when I visit
> http://cluster:4040/metrics/json/ I should be able to get
> everything I want but I don't see the basic stage/task progress
> information I would expect. Are there special metrics properties
> that I should set to get this info? I think this would be the
> best solution - just give it the right URL and parse the resulting
> JSON - but I can't seem to figure out how to do this or if it is
> possible.
>
> Any advice is appreciated.
>
> Thanks,
> Philip
>
>
>
> On 04/01/2014 09:43 AM, Philip Ogren wrote:
>
> Hi DB,
>
> Just wondering if you ever got an answer to your question
> about monitoring progress - either offline or through your own
> investigation. Any findings would be appreciated.
>
> Thanks,
> Philip
>
> On 01/30/2014 10:32 PM, DB Tsai wrote:
>
> Hi guys,
>
> When we're running a very long job, we would like to show
> users the current progress of map and reduce job. After
> looking at the api document, I don't find anything for
> this. However, in Spark UI, I could see the progress of
> the task. Is there anything I miss?
>
> Thanks.
>
> Sincerely,
>
> DB Tsai
> Machine Learning Engineer
> Alpine Data Labs
> --------------------------------------
> Web: http://alpinenow.com/
>
>
>
>
Re: Is there a way to get the current progress of the job?
Posted by Patrick Wendell <pw...@gmail.com>.
Hey Phillip,
Right now there is no mechanism for this. You have to go in through the low
level listener interface.
We could consider exposing the JobProgressListener directly - I think it's
been factored nicely so it's fairly decoupled from the UI. The concern is
this is a semi-internal piece of functionality and something we might, e.g.
want to change the API of over time.
- Patrick
On Wed, Apr 2, 2014 at 3:39 PM, Philip Ogren <ph...@oracle.com>wrote:
> What I'd like is a way to capture the information provided on the stages
> page (i.e. cluster:4040/stages via IndexPage). Looking through the Spark
> code, it doesn't seem like it is possible to directly query for specific
> facts such as how many tasks have succeeded or how many total tasks there
> are for a given active stage. Instead, it looks like all the data for the
> page is generated at once using information from the JobProgressListener.
> It doesn't seem like I have any way to programmatically access this
> information myself. I can't even instantiate my own JobProgressListener
> because it is spark package private. I could implement my SparkListener
> and gather up the information myself. It feels a bit awkward since classes
> like Task and TaskInfo are also spark package private. It does seem
> possible to gather up what I need but it seems like this sort of
> information should just be available without by implementing a custom
> SparkListener (or worse screen scraping the html generated by StageTable!)
>
> I was hoping that I would find the answer in MetricsServlet which is
> turned on by default. It seems that when I visit
> http://cluster:4040/metrics/json/ I should be able to get everything I
> want but I don't see the basic stage/task progress information I would
> expect. Are there special metrics properties that I should set to get this
> info? I think this would be the best solution - just give it the right URL
> and parse the resulting JSON - but I can't seem to figure out how to do
> this or if it is possible.
>
> Any advice is appreciated.
>
> Thanks,
> Philip
>
>
>
> On 04/01/2014 09:43 AM, Philip Ogren wrote:
>
>> Hi DB,
>>
>> Just wondering if you ever got an answer to your question about
>> monitoring progress - either offline or through your own investigation.
>> Any findings would be appreciated.
>>
>> Thanks,
>> Philip
>>
>> On 01/30/2014 10:32 PM, DB Tsai wrote:
>>
>>> Hi guys,
>>>
>>> When we're running a very long job, we would like to show users the
>>> current progress of map and reduce job. After looking at the api document,
>>> I don't find anything for this. However, in Spark UI, I could see the
>>> progress of the task. Is there anything I miss?
>>>
>>> Thanks.
>>>
>>> Sincerely,
>>>
>>> DB Tsai
>>> Machine Learning Engineer
>>> Alpine Data Labs
>>> --------------------------------------
>>> Web: http://alpinenow.com/
>>>
>>
>>
>
Re: Is there a way to get the current progress of the job?
Posted by Philip Ogren <ph...@oracle.com>.
What I'd like is a way to capture the information provided on the stages
page (i.e. cluster:4040/stages via IndexPage). Looking through the
Spark code, it doesn't seem like it is possible to directly query for
specific facts such as how many tasks have succeeded or how many total
tasks there are for a given active stage. Instead, it looks like all
the data for the page is generated at once using information from the
JobProgressListener. It doesn't seem like I have any way to
programmatically access this information myself. I can't even
instantiate my own JobProgressListener because it is spark package
private. I could implement my SparkListener and gather up the
information myself. It feels a bit awkward since classes like Task and
TaskInfo are also spark package private. It does seem possible to
gather up what I need but it seems like this sort of information should
just be available without by implementing a custom SparkListener (or
worse screen scraping the html generated by StageTable!)
I was hoping that I would find the answer in MetricsServlet which is
turned on by default. It seems that when I visit
http://cluster:4040/metrics/json/ I should be able to get everything I
want but I don't see the basic stage/task progress information I would
expect. Are there special metrics properties that I should set to get
this info? I think this would be the best solution - just give it the
right URL and parse the resulting JSON - but I can't seem to figure out
how to do this or if it is possible.
Any advice is appreciated.
Thanks,
Philip
On 04/01/2014 09:43 AM, Philip Ogren wrote:
> Hi DB,
>
> Just wondering if you ever got an answer to your question about
> monitoring progress - either offline or through your own
> investigation. Any findings would be appreciated.
>
> Thanks,
> Philip
>
> On 01/30/2014 10:32 PM, DB Tsai wrote:
>> Hi guys,
>>
>> When we're running a very long job, we would like to show users the
>> current progress of map and reduce job. After looking at the api
>> document, I don't find anything for this. However, in Spark UI, I
>> could see the progress of the task. Is there anything I miss?
>>
>> Thanks.
>>
>> Sincerely,
>>
>> DB Tsai
>> Machine Learning Engineer
>> Alpine Data Labs
>> --------------------------------------
>> Web: http://alpinenow.com/
>
Re: Is there a way to get the current progress of the job?
Posted by Mayur Rustagi <ma...@gmail.com>.
You can get detailed information through Spark listener interface regarding
each stage. Multiple jobs may be compressed into a single stage so jobwise
information would be same as Spark.
Regards
Mayur
Mayur Rustagi
Ph: +1 (760) 203 3257
http://www.sigmoidanalytics.com
@mayur_rustagi <https://twitter.com/mayur_rustagi>
On Tue, Apr 1, 2014 at 11:18 AM, Kevin Markey <ke...@oracle.com>wrote:
> The discussion there hits on the distinction of jobs and stages. When
> looking at one application, there are hundreds of stages, sometimes
> thousands. Depends on the data and the task. And the UI seems to track
> stages. And one could independently track them for such a job. But what
> if -- as occurs in another application -- there's only one or two stages,
> but lots of data passing through those 1 or 2 stages?
>
> Kevin Markey
>
>
>
> On 04/01/2014 09:55 AM, Mark Hamstra wrote:
>
> Some related discussion: https://github.com/apache/spark/pull/246
>
>
> On Tue, Apr 1, 2014 at 8:43 AM, Philip Ogren <ph...@oracle.com>wrote:
>
>> Hi DB,
>>
>> Just wondering if you ever got an answer to your question about
>> monitoring progress - either offline or through your own investigation.
>> Any findings would be appreciated.
>>
>> Thanks,
>> Philip
>>
>>
>> On 01/30/2014 10:32 PM, DB Tsai wrote:
>>
>>> Hi guys,
>>>
>>> When we're running a very long job, we would like to show users the
>>> current progress of map and reduce job. After looking at the api document,
>>> I don't find anything for this. However, in Spark UI, I could see the
>>> progress of the task. Is there anything I miss?
>>>
>>> Thanks.
>>>
>>> Sincerely,
>>>
>>> DB Tsai
>>> Machine Learning Engineer
>>> Alpine Data Labs
>>> --------------------------------------
>>> Web: http://alpinenow.com/
>>>
>>
>>
>
>
Re: Is there a way to get the current progress of the job?
Posted by Mark Hamstra <ma...@clearstorydata.com>.
Some related discussion: https://github.com/apache/spark/pull/246
On Tue, Apr 1, 2014 at 8:43 AM, Philip Ogren <ph...@oracle.com>wrote:
> Hi DB,
>
> Just wondering if you ever got an answer to your question about monitoring
> progress - either offline or through your own investigation. Any findings
> would be appreciated.
>
> Thanks,
> Philip
>
>
> On 01/30/2014 10:32 PM, DB Tsai wrote:
>
>> Hi guys,
>>
>> When we're running a very long job, we would like to show users the
>> current progress of map and reduce job. After looking at the api document,
>> I don't find anything for this. However, in Spark UI, I could see the
>> progress of the task. Is there anything I miss?
>>
>> Thanks.
>>
>> Sincerely,
>>
>> DB Tsai
>> Machine Learning Engineer
>> Alpine Data Labs
>> --------------------------------------
>> Web: http://alpinenow.com/
>>
>
>