You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by Chris Fregly <ch...@fregly.com> on 2014/09/05 21:42:07 UTC

Re: Shared variable in Spark Streaming

good question, soumitra.  it's a bit confusing.

to break TD's code down a bit:

dstream.count() is a transformation operation (returns a new DStream),
executes lazily, runs in the cluster on the underlying RDDs that come
through in that batch, and returns a new DStream with a single element
representing the count of the underlying RDDs in each batch.

dstream.foreachRDD() is an output/action operation (returns something other
than a DStream - nothing in this case), triggers the lazy execution above,
returns the results to the driver, and increments the globalCount locally
in the driver.

per your specific question, RDD.count() is different in that it's an
output/action operation that materializes the RDD and collects the count of
elements in the RDD locally in the driver.  confusing, indeed.

accumulators updated in parallel on the worker nodes across the cluster and
are read locally in the driver.

On Fri, Aug 8, 2014 at 7:36 AM, Soumitra Kumar <ku...@gmail.com>
wrote:

> I want to keep track of the events processed in a batch.
>
> How come 'globalCount' work for DStream? I think similar construct won't
> work for RDD, that's why there is accumulator.
>
>
> On Fri, Aug 8, 2014 at 12:52 AM, Tathagata Das <
> tathagata.das1565@gmail.com> wrote:
>
>> Do you mean that you want a continuously updated count as more
>> events/records are received in the DStream (remember, DStream is a
>> continuous stream of data)? Assuming that is what you want, you can use a
>> global counter
>>
>> var globalCount = 0L
>>
>> dstream.count().foreachRDD(rdd => { globalCount += rdd.first() } )
>>
>> This globalCount variable will reside in the driver and will keep being
>> updated after every batch.
>>
>> TD
>>
>>
>> On Thu, Aug 7, 2014 at 10:16 PM, Soumitra Kumar <kumar.soumitra@gmail.com
>> > wrote:
>>
>>> Hello,
>>>
>>> I want to count the number of elements in the DStream, like RDD.count()
>>> . Since there is no such method in DStream, I thought of using
>>> DStream.count and use the accumulator.
>>>
>>> How do I do DStream.count() to count the number of elements in a DStream?
>>>
>>> How do I create a shared variable in Spark Streaming?
>>>
>>> -Soumitra.
>>>
>>
>>
>

Re: Shared variable in Spark Streaming

Posted by Tathagata Das <ta...@gmail.com>.

Good explanation, Chris :)



On Fri, Sep 5, 2014 at 12:42 PM, Chris Fregly <ch...@fregly.com> wrote:

> good question, soumitra.  it's a bit confusing.
>
> to break TD's code down a bit:
>
> dstream.count() is a transformation operation (returns a new DStream),
> executes lazily, runs in the cluster on the underlying RDDs that come
> through in that batch, and returns a new DStream with a single element
> representing the count of the underlying RDDs in each batch.
>
> dstream.foreachRDD() is an output/action operation (returns something
> other than a DStream - nothing in this case), triggers the lazy execution
> above, returns the results to the driver, and increments the globalCount
> locally in the driver.
>
> per your specific question, RDD.count() is different in that it's an
> output/action operation that materializes the RDD and collects the count of
> elements in the RDD locally in the driver.  confusing, indeed.
>
> accumulators updated in parallel on the worker nodes across the cluster
> and are read locally in the driver.
>
>
>
>
> On Fri, Aug 8, 2014 at 7:36 AM, Soumitra Kumar <ku...@gmail.com>
> wrote:
>
>> I want to keep track of the events processed in a batch.
>>
>> How come 'globalCount' work for DStream? I think similar construct won't
>> work for RDD, that's why there is accumulator.
>>
>>
>> On Fri, Aug 8, 2014 at 12:52 AM, Tathagata Das <
>> tathagata.das1565@gmail.com> wrote:
>>
>>> Do you mean that you want a continuously updated count as more
>>> events/records are received in the DStream (remember, DStream is a
>>> continuous stream of data)? Assuming that is what you want, you can use a
>>> global counter
>>>
>>> var globalCount = 0L
>>>
>>> dstream.count().foreachRDD(rdd => { globalCount += rdd.first() } )
>>>
>>> This globalCount variable will reside in the driver and will keep being
>>> updated after every batch.
>>>
>>> TD
>>>
>>>
>>> On Thu, Aug 7, 2014 at 10:16 PM, Soumitra Kumar <
>>> kumar.soumitra@gmail.com> wrote:
>>>
>>>> Hello,
>>>>
>>>> I want to count the number of elements in the DStream, like RDD.count()
>>>> . Since there is no such method in DStream, I thought of using
>>>> DStream.count and use the accumulator.
>>>>
>>>> How do I do DStream.count() to count the number of elements in a
>>>> DStream?
>>>>
>>>> How do I create a shared variable in Spark Streaming?
>>>>
>>>> -Soumitra.
>>>>
>>>
>>>
>>
>