You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@spark.apache.org by "Wang, Ningjun (LNG-NPV)" <ni...@lexisnexis.com> on 2016/02/23 19:53:11 UTC

How to get progress information of an RDD operation

How can I get progress information of a RDD operation? For example

val lines = sc.textFile("c:/temp/input.txt")  // a RDD of millions of line
lines.foreach(line => {
        handleLine(line)
    })

The input.txt contains millions of lines. The entire operation take 6 hours. I want to print out how many lines are processed every 1 minute so user know the progress. How can I do that?

One way I am thinking of is to use accumulator, e.g.



val lines = sc.textFile("c:/temp/input.txt")
val acCount = sc.accumulator(0L)
lines.foreach(line => {
        handleLine(line)
        acCount += 1
}


However how can I print out account every 1 minutes?



Ningjun


RE: How to get progress information of an RDD operation

Posted by "Wang, Ningjun (LNG-NPV)" <ni...@lexisnexis.com>.
Yes, I am looking for programmatic way of tracking progress.  SparkListener.scala does not track at RDD item level so it will not tell how many items have been processed.

I wonder is there any way to track the accumulator value as it reflect the correct number of items processed so far?

Ningjun

From: Ted Yu [mailto:yuzhihong@gmail.com]
Sent: Tuesday, February 23, 2016 2:30 PM
To: Kevin Mellott
Cc: Wang, Ningjun (LNG-NPV); user@spark.apache.org
Subject: Re: How to get progress information of an RDD operation

I think Ningjun was looking for programmatic way of tracking progress.

I took a look at:
./core/src/main/scala/org/apache/spark/scheduler/SparkListener.scala

but there doesn't seem to exist fine grained events directly reflecting what Ningjun looks for.

On Tue, Feb 23, 2016 at 11:24 AM, Kevin Mellott <ke...@gmail.com>> wrote:
Have you considered using the Spark Web UI to view progress on your job? It does a very good job showing the progress of the overall job, as well as allows you to drill into the individual tasks and server activity.

On Tue, Feb 23, 2016 at 12:53 PM, Wang, Ningjun (LNG-NPV) <ni...@lexisnexis.com>> wrote:
How can I get progress information of a RDD operation? For example

val lines = sc.textFile("c:/temp/input.txt")  // a RDD of millions of line
lines.foreach(line => {
        handleLine(line)
    })
The input.txt contains millions of lines. The entire operation take 6 hours. I want to print out how many lines are processed every 1 minute so user know the progress. How can I do that?

One way I am thinking of is to use accumulator, e.g.



val lines = sc.textFile("c:/temp/input.txt")
val acCount = sc.accumulator(0L)
lines.foreach(line => {
        handleLine(line)
        acCount += 1
}

However how can I print out account every 1 minutes?


Ningjun




Re: How to get progress information of an RDD operation

Posted by Ted Yu <yu...@gmail.com>.
I think Ningjun was looking for programmatic way of tracking progress.

I took a look at:
./core/src/main/scala/org/apache/spark/scheduler/SparkListener.scala

but there doesn't seem to exist fine grained events directly reflecting
what Ningjun looks for.

On Tue, Feb 23, 2016 at 11:24 AM, Kevin Mellott <ke...@gmail.com>
wrote:

> Have you considered using the Spark Web UI to view progress on your job?
> It does a very good job showing the progress of the overall job, as well as
> allows you to drill into the individual tasks and server activity.
>
> On Tue, Feb 23, 2016 at 12:53 PM, Wang, Ningjun (LNG-NPV) <
> ningjun.wang@lexisnexis.com> wrote:
>
>> How can I get progress information of a RDD operation? For example
>>
>>
>>
>> *val *lines = sc.textFile(*"c:/temp/input.txt"*)  // a RDD of millions
>> of line
>> lines.foreach(line => {
>>         handleLine(line)
>>     })
>>
>> The input.txt contains millions of lines. The entire operation take 6
>> hours. I want to print out how many lines are processed every 1 minute so
>> user know the progress. How can I do that?
>>
>>
>>
>> One way I am thinking of is to use accumulator, e.g.
>>
>>
>>
>>
>>
>> *val *lines = sc.textFile(*"c:/temp/input.txt"*)
>> *val *acCount = sc.accumulator(0L)
>> lines.foreach(line => {
>>         handleLine(line)
>>         acCount += 1
>> }
>>
>> However how can I print out account every 1 minutes?
>>
>>
>>
>>
>>
>> Ningjun
>>
>>
>>
>
>

Re: How to get progress information of an RDD operation

Posted by Kevin Mellott <ke...@gmail.com>.
Have you considered using the Spark Web UI to view progress on your job? It
does a very good job showing the progress of the overall job, as well as
allows you to drill into the individual tasks and server activity.

On Tue, Feb 23, 2016 at 12:53 PM, Wang, Ningjun (LNG-NPV) <
ningjun.wang@lexisnexis.com> wrote:

> How can I get progress information of a RDD operation? For example
>
>
>
> *val *lines = sc.textFile(*"c:/temp/input.txt"*)  // a RDD of millions of
> line
> lines.foreach(line => {
>         handleLine(line)
>     })
>
> The input.txt contains millions of lines. The entire operation take 6
> hours. I want to print out how many lines are processed every 1 minute so
> user know the progress. How can I do that?
>
>
>
> One way I am thinking of is to use accumulator, e.g.
>
>
>
>
>
> *val *lines = sc.textFile(*"c:/temp/input.txt"*)
> *val *acCount = sc.accumulator(0L)
> lines.foreach(line => {
>         handleLine(line)
>         acCount += 1
> }
>
> However how can I print out account every 1 minutes?
>
>
>
>
>
> Ningjun
>
>
>