You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@spark.apache.org by Esa Heikkinen <es...@student.tut.fi> on 2016/05/09 09:36:10 UTC
Re: Spark support for Complex Event Processing (CEP)

Sorry for answering delay.. Yes, this is not pure "CEP", but quite close 
for it or many similar "functionalities".

My case is not so easy, because i dont' want to compare against original 
time schedule of route.

I want to compare how close (ITS) system has estimated arrival time to 
bus stop.

That means i have to read more (LOG C) logs (and do little calculation) 
to determine the estimated arrival time.

And then it is checked how much a difference (error) between bus real 
and the system's estimated arrival time..

In practic the situation can be little more complex..

---
Esa Heikkinen

29.4.2016, 19:38, Michael Segel kirjoitti:
> If you\u2019re getting the logs, then it really isn\u2019t CEP unless you 
> consider the event to be the log from the bus.
> This doesn\u2019t sound like there is a time constraint.
>
> Your bus schedule is fairly fixed and changes infrequently.
> Your bus stops are relatively fixed points. (Within a couple of meters)
>
> So then you\u2019re taking bus A who is scheduled to drive route 123 and 
> you want to compare their nearest location to the bus stop at time T 
> and see how close it is to the scheduled route.
>
>
> Or am I missing something?
>
> -Mike
>
>> On Apr 29, 2016, at 3:54 AM, Esa Heikkinen 
>> <esa.heikkinen@student.tut.fi <ma...@student.tut.fi>> 
>> wrote:
>>
>>
>> Hi
>>
>> I try to explain my case ..
>>
>> Situation is not so simple in my logs and solution. There also many 
>> types of logs and there are from many sources.
>> They are as csv-format and header line includes names of the columns.
>>
>> This is simplified description of input logs.
>>
>> LOG A's: bus coordinate logs (every bus has own log):
>> - timestamp
>> - bus number
>> - coordinates
>>
>> LOG B: bus login/logout (to/from line) message log:
>> - timestamp
>> - bus number
>> - line number
>>
>> LOG C:  log from central computers:
>> - timestamp
>> - bus number
>> - bus stop number
>> - estimated arrival time to bus stop
>>
>> LOG A are updated every 30 seconds (i have also another system by 1 
>> seconds interval). LOG B are updated when bus starts from terminal 
>> bus stop and arrives to final bus stop in a line. LOG C is updated 
>> when central computer sends new arrival time estimation to bus stop.
>>
>> I also need metadata for logs (and analyzer). For example coordinates 
>> for bus stop areas.
>>
>> Main purpose of analyzing is to check an accuracy (error) of the 
>> estimated arrival time to bus stops.
>>
>> Because there are many buses and lines, it is too time-comsuming to 
>> check all of them. So i check only specific lines with specific bus 
>> stops. There are many buses (logged to lines) coming to one bus stop 
>> and i am interested about only certain bus.
>>
>> To do that, i have to read log partly not in time order (upstream) by 
>> sequence:
>> 1. From LOG C is searched bus number
>> 2. From LOG A is searched when the bus has leaved from terminal bus stop
>> 3. From LOG B is searched when bus has sent a login to the line
>> 4. From LOG A is searched when the bus has entered to bus stop
>> 5. From LOG C is searched a last estimated arrival time to the bus 
>> stop and calculates error between real and estimated value
>>
>> In my understanding (almost) all log file analyzers reads all data 
>> (lines) in time order from log files. My need is only for specific 
>> part of log (lines). To achieve that, my solution is to read logs in 
>> an arbitrary order (with given time window).
>>
>> I know this solution is not suitable for all cases (for example for 
>> very fast analyzing and very big data). This solution is suitable for 
>> very complex (targeted) analyzing. It can be too slow and 
>> memory-consuming, but well done pre-processing of log data can help a 
>> lot.
>>
>> ---
>> Esa Heikkinen
>>
>> 28.4.2016, 14:44, Michael Segel kirjoitti:
>>> I don\u2019t.
>>>
>>> I believe that there have been a  couple of hack-a-thons like one 
>>> done in Chicago a few years back using public transportation data.
>>>
>>> The first question is what sort of data do you get from the city?
>>>
>>> I mean it could be as simple as time_stamp, bus_id, route and GPS 
>>> (x,y).   Or they could provide more information. Like last stop, 
>>> distance to next stop, avg current velocity\u2026
>>>
>>> Then there is the frequency of the updates. Every second? Every 3 
>>> seconds? 5 or 6 seconds\u2026
>>>
>>> This will determine how much work you have to do.
>>>
>>> Maybe they provide the routes of the busses via a different API call 
>>> since its relatively static.
>>>
>>> This will drive your solution more than the underlying technology.
>>>
>>> Oh and whileI focused on bus, there are also rail and other modes of 
>>> public transportation like light rail, trains, etc \u2026
>>>
>>> HTH
>>>
>>> -Mike
>>>
>>>
>>>> On Apr 28, 2016, at 4:10 AM, Esa Heikkinen 
>>>> <esa.heikkinen@student.tut.fi 
>>>> <ma...@student.tut.fi>> wrote:
>>>>
>>>>
>>>> Do you know any good examples how to use Spark streaming in 
>>>> tracking public transportation systems ?
>>>>
>>>> Or Storm or some other tool example ?
>>>>
>>>> Regards
>>>> Esa Heikkinen
>>>>
>>>> 28.4.2016, 3:16, Michael Segel kirjoitti:
>>>>> Uhm\u2026
>>>>> I think you need to clarify a couple of things\u2026
>>>>>
>>>>> First there is this thing called analog signal processing\u2026. Is 
>>>>> that continuous enough for you?
>>>>>
>>>>> But more to the point, Spark Streaming does micro batching so if 
>>>>> you\u2019re processing a continuous stream of tick data, you will have 
>>>>> more than 50K of tics per second while there are markets open and 
>>>>> trading.  Even at 50K a second, that would mean 1 every .02 ms or 
>>>>> 50 ticks a ms.
>>>>>
>>>>> And you don\u2019t want to wait until you have a batch to start 
>>>>> processing, but you want to process when the data hits the queue 
>>>>> and pull it from the queue as quickly as possible.
>>>>>
>>>>> Spark streaming will be able to pull batches in as little as 
>>>>> 500ms. So if you pull a batch at t0 and immediately have a tick in 
>>>>> your queue, you won\u2019t process that data until t0+500ms. And said 
>>>>> batch would contain 25,000 entries.
>>>>>
>>>>> Depending on what you are doing\u2026 that 500ms delay can be enough to 
>>>>> be fatal to your trading process.
>>>>>
>>>>> If you don\u2019t like stock data, there are other examples mainly when 
>>>>> pulling data from real time embedded systems.
>>>>>
>>>>>
>>>>> If you go back and read what I said, if your data flow is >> (much 
>>>>> slower) than 500ms, and / or the time to process is >> 500ms ( 
>>>>> much longer )  you could use spark streaming.  If not\u2026 and there 
>>>>> are applications which require that type of speed\u2026  then you 
>>>>> shouldn\u2019t use spark streaming.
>>>>>
>>>>> If you do have that constraint, then you can look at systems like 
>>>>> storm/flink/samza / whatever where you have a continuous queue and 
>>>>> listener and no micro batch delays.
>>>>> Then for each bolt (storm) you can have a spark context for 
>>>>> processing the data. (Depending on what sort of processing you 
>>>>> want to do.)
>>>>>
>>>>> To put this in perspective\u2026 if you\u2019re using spark streaming / akka 
>>>>> / storm /etc to handle real time requests from the web, 500ms 
>>>>> added delay can be a long time.
>>>>>
>>>>> Choose the right tool.
>>>>>
>>>>> For the OP\u2019s problem. Sure Tracking public transportation could be 
>>>>> done using spark streaming. It could also be done using half a 
>>>>> dozen other tools because the rate of data generation is much 
>>>>> slower than 500ms.
>>>>>
>>>>> HTH
>>>>>
>>>>>
>>>>>> On Apr 27, 2016, at 4:34 PM, Mich Talebzadeh 
>>>>>> <mich.talebzadeh@gmail.com <ma...@gmail.com>> wrote:
>>>>>>
>>>>>> couple of things.
>>>>>>
>>>>>> There is no such thing as Continuous Data Streaming as there is 
>>>>>> no such thing as Continuous Availability.
>>>>>>
>>>>>> There is such thing as Discrete Data Streaming and  High 
>>>>>> Availability  but they reduce the finite unavailability to 
>>>>>> minimum. In terms of business needs a 5 SIGMA is good enough and 
>>>>>> acceptable. Even the candles set to a predefined time interval 
>>>>>> say 2, 4, 15 seconds overlap. No FX savvy trader makes a sell or 
>>>>>> buy decision on the basis of 2 seconds candlestick
>>>>>>
>>>>>> The calculation itself in measurements is subject to finite error 
>>>>>> as defined by their Confidence Level (CL) using Standard 
>>>>>> Deviation function.
>>>>>>
>>>>>> OK so far I have never noticed a tool that requires that details 
>>>>>> of granularity. Those stuff from Flink etc is in practical term 
>>>>>> is of little value and does not make commercial sense.
>>>>>>
>>>>>> Now with regard to your needs, Spark micro batching is perfectly 
>>>>>> adequate.
>>>>>>
>>>>>> HTH
>>>>>>
>>>>>> Dr Mich Talebzadeh
>>>>>>
>>>>>> LinkedIn 
>>>>>> /https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw/
>>>>>>
>>>>>> http://talebzadehmich.wordpress.com
>>>>>>
>>>>>>
>>>>>> On 27 April 2016 at 22:10, Esa Heikkinen 
>>>>>> <es...@student.tut.fi> wrote:
>>>>>>
>>>>>>
>>>>>>     Hi
>>>>>>
>>>>>>     Thanks for the answer.
>>>>>>
>>>>>>     I have developed a log file analyzer for RTPIS (Real Time
>>>>>>     Passenger Information System) system, where buses drive lines
>>>>>>     and the system try to estimate the arrival times to the bus
>>>>>>     stops. There are many different log files (and events) and
>>>>>>     analyzing situation can be very complex. Also spatial data
>>>>>>     can be included to the log data.
>>>>>>
>>>>>>     The analyzer also has a query (or analyzing) language, which
>>>>>>     describes a expected behavior. This can be a requirement of
>>>>>>     system. Analyzer can be think to be also a test oracle.
>>>>>>
>>>>>>     I have published a paper (SPLST'15 conference) about my
>>>>>>     analyzer and its language. The paper is maybe too technical,
>>>>>>     but it is found:
>>>>>>     http://ceur-ws.org/Vol-1525/paper-19.pdf
>>>>>>
>>>>>>     I do not know yet where it belongs. I think it can be some
>>>>>>     "CEP with delays". Or do you know better ?
>>>>>>     My analyzer can also do little bit more complex and
>>>>>>     time-consuming analyzings? There are no a need for real time.
>>>>>>
>>>>>>     And it is possible to do "CEP with delays" reasonably some
>>>>>>     existing analyzer (for example Spark) ?
>>>>>>
>>>>>>     Regards
>>>>>>     PhD student at Tampere University of Technology, Finland,
>>>>>>     www.tut.fi
>>>>>>     Esa Heikkinen
>>>>>>
>>>>>>     27.4.2016, 15:51, Michael Segel kirjoitti:
>>>>>>>     Spark and CEP? It depends\u2026
>>>>>>>
>>>>>>>     Ok, I know that\u2019s not the answer you want to hear, but its a
>>>>>>>     bit more complicated\u2026
>>>>>>>
>>>>>>>     If you consider Spark Streaming, you have some issues.
>>>>>>>     Spark Streaming isn\u2019t a Real Time solution because it is a
>>>>>>>     micro batch solution. The smallest Window is 500ms.  This
>>>>>>>     means that if your compute time is >> 500ms and/or  your
>>>>>>>     event flow is >> 500ms this could work.
>>>>>>>     (e.g. 'real time' image processing on a system that is
>>>>>>>     capturing 60FPS because the processing time is >> 500ms. )
>>>>>>>
>>>>>>>     So Spark Streaming wouldn\u2019t be the best solution\u2026.
>>>>>>>
>>>>>>>     However, you can combine spark with other technologies like
>>>>>>>     Storm, Akka, etc .. where you have continuous streaming.
>>>>>>>     So you could instantiate a spark context per worker in storm\u2026
>>>>>>>
>>>>>>>     I think if there are no class collisions between Akka and
>>>>>>>     Spark, you could use Akka, which may have a better potential
>>>>>>>     for communication between workers.
>>>>>>>     So here you can handle CEP events.
>>>>>>>
>>>>>>>     HTH
>>>>>>>
>>>>>>>>     On Apr 27, 2016, at 7:03 AM, Mich Talebzadeh
>>>>>>>>     <mi...@gmail.com> wrote:
>>>>>>>>
>>>>>>>>     please see my other reply
>>>>>>>>
>>>>>>>>     Dr Mich Talebzadeh
>>>>>>>>
>>>>>>>>     LinkedIn
>>>>>>>>     /https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw/
>>>>>>>>
>>>>>>>>     http://talebzadehmich.wordpress.com
>>>>>>>>
>>>>>>>>
>>>>>>>>     On 27 April 2016 at 10:40, Esa Heikkinen
>>>>>>>>     <es...@student.tut.fi> wrote:
>>>>>>>>
>>>>>>>>         Hi
>>>>>>>>
>>>>>>>>         I have followed with interest the discussion about CEP
>>>>>>>>         and Spark. It is quite close to my research, which is a
>>>>>>>>         complex analyzing for log files and "history" data 
>>>>>>>>         (not actually for real time streams).
>>>>>>>>
>>>>>>>>         I have few questions:
>>>>>>>>
>>>>>>>>         1) Is CEP only for (real time) stream data and not for
>>>>>>>>         "history" data?
>>>>>>>>
>>>>>>>>         2) Is it possible to search "backward" (upstream) by
>>>>>>>>         CEP with given time window? If a start time of the time
>>>>>>>>         window is earlier than the current stream time.
>>>>>>>>
>>>>>>>>         3) Do you know any good tools or softwares for "CEP's"
>>>>>>>>         using for log data ?
>>>>>>>>
>>>>>>>>         4) Do you know any good (scientific) papers i should
>>>>>>>>         read about CEP ?
>>>>>>>>
>>>>>>>>
>>>>>>>>         Regards
>>>>>>>>         PhD student at Tampere University of Technology,
>>>>>>>>         Finland, www.tut.fi
>>>>>>>>         Esa Heikkinen
>>>>>>>>
>>>>>>>>         ---------------------------------------------------------------------
>>>>>>>>         To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
>>>>>>>>         For additional commands, e-mail: user-help@spark.apache.org
>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>>     The opinions expressed here are mine, while they may reflect
>>>>>>>     a cognitive thought, that is purely accidental.
>>>>>>>     Use at your own risk.
>>>>>>>     Michael Segel
>>>>>>>     michael_segel (AT) hotmail.com <http://hotmail.com/>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>> The opinions expressed here are mine, while they may reflect a 
>>>>> cognitive thought, that is purely accidental.
>>>>> Use at your own risk.
>>>>> Michael Segel
>>>>> michael_segel (AT) hotmail.com <http://hotmail.com/>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>
>>>
>>>
>>
>