You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@spark.apache.org by anshu shukla <an...@gmail.com> on 2015/06/18 20:24:53 UTC
Latency between the RDD in Streaming
Is there any fixed way to find among RDD in stream processing systems ,
in the Distributed set-up .
--
Thanks & Regards,
Anshu Shukla
Re: Latency between the RDD in Streaming
Posted by anshu shukla <an...@gmail.com>.
How will i can to know that for how much time particular RDD had
remained in pipeline .
On Fri, Jun 19, 2015 at 7:59 AM, Tathagata Das <td...@databricks.com> wrote:
> Why do you need to uniquely identify the message? All you need is the time
> when the message was inserted by the receiver, and when it is processed,
> isnt it?
>
>
> On Thu, Jun 18, 2015 at 2:28 PM, anshu shukla <an...@gmail.com>
> wrote:
>
>> Thanks alot , But i have already tried the second way ,Problem with
>> that is that how to identify the particular RDD from source to sink (as we
>> can do by passing a msg id in storm) . For that i just updated RDD and
>> added a msgID (as static variable) . but while dumping them to file some of
>> the tuples of RDD are failed/missed (approx 3000 and data rate is aprox
>> 1500 tuples/sec).
>>
>> On Fri, Jun 19, 2015 at 2:50 AM, Tathagata Das <td...@databricks.com>
>> wrote:
>>
>>> Couple of ways.
>>>
>>> 1. Easy but approx way: Find scheduling delay and processing time using
>>> StreamingListener interface, and then calculate "end-to-end delay = 0.5 *
>>> batch interval + scheduling delay + processing time". The 0.5 * batch
>>> inteval is the approx average batching delay across all the records in the
>>> batch.
>>>
>>> 2. Hard but precise way: You could build a custom receiver that embeds
>>> the current timestamp in the records, and then compare them with the
>>> timestamp at the final step of the records. Assuming the executor and
>>> driver clocks are reasonably in sync, this will measure the latency between
>>> the time is received by the system and the result from the record is
>>> available.
>>>
>>> On Thu, Jun 18, 2015 at 2:12 PM, anshu shukla <an...@gmail.com>
>>> wrote:
>>>
>>>> Sorry , i missed the LATENCY word.. for a large streaming query .How
>>>> to find the time taken by the particular RDD to travel from initial
>>>> D-STREAM to final/last D-STREAM .
>>>> Help Please !!
>>>>
>>>> On Fri, Jun 19, 2015 at 12:40 AM, Tathagata Das <td...@databricks.com>
>>>> wrote:
>>>>
>>>>> Its not clear what you are asking. Find "what" among RDD?
>>>>>
>>>>> On Thu, Jun 18, 2015 at 11:24 AM, anshu shukla <anshushukla0@gmail.com
>>>>> > wrote:
>>>>>
>>>>>> Is there any fixed way to find among RDD in stream processing
>>>>>> systems , in the Distributed set-up .
>>>>>>
>>>>>> --
>>>>>> Thanks & Regards,
>>>>>> Anshu Shukla
>>>>>>
>>>>>
>>>>>
>>>>
>>>>
>>>> --
>>>> Thanks & Regards,
>>>> Anshu Shukla
>>>>
>>>
>>>
>>
>>
>> --
>> Thanks & Regards,
>> Anshu Shukla
>>
>
>
--
Thanks & Regards,
Anshu Shukla
Re: Latency between the RDD in Streaming
Posted by Tathagata Das <td...@databricks.com>.
Why do you need to uniquely identify the message? All you need is the time
when the message was inserted by the receiver, and when it is processed,
isnt it?
On Thu, Jun 18, 2015 at 2:28 PM, anshu shukla <an...@gmail.com>
wrote:
> Thanks alot , But i have already tried the second way ,Problem with that
> is that how to identify the particular RDD from source to sink (as we can
> do by passing a msg id in storm) . For that i just updated RDD and added
> a msgID (as static variable) . but while dumping them to file some of the
> tuples of RDD are failed/missed (approx 3000 and data rate is aprox 1500
> tuples/sec).
>
> On Fri, Jun 19, 2015 at 2:50 AM, Tathagata Das <td...@databricks.com>
> wrote:
>
>> Couple of ways.
>>
>> 1. Easy but approx way: Find scheduling delay and processing time using
>> StreamingListener interface, and then calculate "end-to-end delay = 0.5 *
>> batch interval + scheduling delay + processing time". The 0.5 * batch
>> inteval is the approx average batching delay across all the records in the
>> batch.
>>
>> 2. Hard but precise way: You could build a custom receiver that embeds
>> the current timestamp in the records, and then compare them with the
>> timestamp at the final step of the records. Assuming the executor and
>> driver clocks are reasonably in sync, this will measure the latency between
>> the time is received by the system and the result from the record is
>> available.
>>
>> On Thu, Jun 18, 2015 at 2:12 PM, anshu shukla <an...@gmail.com>
>> wrote:
>>
>>> Sorry , i missed the LATENCY word.. for a large streaming query .How
>>> to find the time taken by the particular RDD to travel from initial
>>> D-STREAM to final/last D-STREAM .
>>> Help Please !!
>>>
>>> On Fri, Jun 19, 2015 at 12:40 AM, Tathagata Das <td...@databricks.com>
>>> wrote:
>>>
>>>> Its not clear what you are asking. Find "what" among RDD?
>>>>
>>>> On Thu, Jun 18, 2015 at 11:24 AM, anshu shukla <an...@gmail.com>
>>>> wrote:
>>>>
>>>>> Is there any fixed way to find among RDD in stream processing
>>>>> systems , in the Distributed set-up .
>>>>>
>>>>> --
>>>>> Thanks & Regards,
>>>>> Anshu Shukla
>>>>>
>>>>
>>>>
>>>
>>>
>>> --
>>> Thanks & Regards,
>>> Anshu Shukla
>>>
>>
>>
>
>
> --
> Thanks & Regards,
> Anshu Shukla
>
Re: Latency between the RDD in Streaming
Posted by anshu shukla <an...@gmail.com>.
Thanks alot , But i have already tried the second way ,Problem with that
is that how to identify the particular RDD from source to sink (as we can
do by passing a msg id in storm) . For that i just updated RDD and added
a msgID (as static variable) . but while dumping them to file some of the
tuples of RDD are failed/missed (approx 3000 and data rate is aprox 1500
tuples/sec).
On Fri, Jun 19, 2015 at 2:50 AM, Tathagata Das <td...@databricks.com> wrote:
> Couple of ways.
>
> 1. Easy but approx way: Find scheduling delay and processing time using
> StreamingListener interface, and then calculate "end-to-end delay = 0.5 *
> batch interval + scheduling delay + processing time". The 0.5 * batch
> inteval is the approx average batching delay across all the records in the
> batch.
>
> 2. Hard but precise way: You could build a custom receiver that embeds the
> current timestamp in the records, and then compare them with the timestamp
> at the final step of the records. Assuming the executor and driver clocks
> are reasonably in sync, this will measure the latency between the time is
> received by the system and the result from the record is available.
>
> On Thu, Jun 18, 2015 at 2:12 PM, anshu shukla <an...@gmail.com>
> wrote:
>
>> Sorry , i missed the LATENCY word.. for a large streaming query .How to
>> find the time taken by the particular RDD to travel from initial
>> D-STREAM to final/last D-STREAM .
>> Help Please !!
>>
>> On Fri, Jun 19, 2015 at 12:40 AM, Tathagata Das <td...@databricks.com>
>> wrote:
>>
>>> Its not clear what you are asking. Find "what" among RDD?
>>>
>>> On Thu, Jun 18, 2015 at 11:24 AM, anshu shukla <an...@gmail.com>
>>> wrote:
>>>
>>>> Is there any fixed way to find among RDD in stream processing systems
>>>> , in the Distributed set-up .
>>>>
>>>> --
>>>> Thanks & Regards,
>>>> Anshu Shukla
>>>>
>>>
>>>
>>
>>
>> --
>> Thanks & Regards,
>> Anshu Shukla
>>
>
>
--
Thanks & Regards,
Anshu Shukla
Re: Latency between the RDD in Streaming
Posted by Tathagata Das <td...@databricks.com>.
Couple of ways.
1. Easy but approx way: Find scheduling delay and processing time using
StreamingListener interface, and then calculate "end-to-end delay = 0.5 *
batch interval + scheduling delay + processing time". The 0.5 * batch
inteval is the approx average batching delay across all the records in the
batch.
2. Hard but precise way: You could build a custom receiver that embeds the
current timestamp in the records, and then compare them with the timestamp
at the final step of the records. Assuming the executor and driver clocks
are reasonably in sync, this will measure the latency between the time is
received by the system and the result from the record is available.
On Thu, Jun 18, 2015 at 2:12 PM, anshu shukla <an...@gmail.com>
wrote:
> Sorry , i missed the LATENCY word.. for a large streaming query .How to
> find the time taken by the particular RDD to travel from initial
> D-STREAM to final/last D-STREAM .
> Help Please !!
>
> On Fri, Jun 19, 2015 at 12:40 AM, Tathagata Das <td...@databricks.com>
> wrote:
>
>> Its not clear what you are asking. Find "what" among RDD?
>>
>> On Thu, Jun 18, 2015 at 11:24 AM, anshu shukla <an...@gmail.com>
>> wrote:
>>
>>> Is there any fixed way to find among RDD in stream processing systems
>>> , in the Distributed set-up .
>>>
>>> --
>>> Thanks & Regards,
>>> Anshu Shukla
>>>
>>
>>
>
>
> --
> Thanks & Regards,
> Anshu Shukla
>
Re: Latency between the RDD in Streaming
Posted by anshu shukla <an...@gmail.com>.
Sorry , i missed the LATENCY word.. for a large streaming query .How to
find the time taken by the particular RDD to travel from initial
D-STREAM to final/last D-STREAM .
Help Please !!
On Fri, Jun 19, 2015 at 12:40 AM, Tathagata Das <td...@databricks.com> wrote:
> Its not clear what you are asking. Find "what" among RDD?
>
> On Thu, Jun 18, 2015 at 11:24 AM, anshu shukla <an...@gmail.com>
> wrote:
>
>> Is there any fixed way to find among RDD in stream processing systems ,
>> in the Distributed set-up .
>>
>> --
>> Thanks & Regards,
>> Anshu Shukla
>>
>
>
--
Thanks & Regards,
Anshu Shukla
Re: Latency between the RDD in Streaming
Posted by Tathagata Das <td...@databricks.com>.
Its not clear what you are asking. Find "what" among RDD?
On Thu, Jun 18, 2015 at 11:24 AM, anshu shukla <an...@gmail.com>
wrote:
> Is there any fixed way to find among RDD in stream processing systems ,
> in the Distributed set-up .
>
> --
> Thanks & Regards,
> Anshu Shukla
>