You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by Eko Susilo <ek...@gmail.com> on 2014/09/30 11:52:17 UTC

Spark Streaming for time consuming job

Hi All,

I have a problem that i would like to consult about spark streaming.

I have a spark streaming application that parse a file (which will be
growing as time passed by)This file contains several columns containing
lines of numbers,
these parsing is divided into windows (each 1 minute). Each column
represent different entity while each row within a column represent the
same entity (for example, first column represent temprature, second column
represent humidty, etc, while each row represent the value of each
attribute). I use PairDStream for each column.

Afterwards, I need to run a time consuming algorithm (outlier detection,
for now i use box plot algorithm) for each RDD of each PairDStream.

To run the outlier detection, currently i am thinking about to call collect
on each of the PairDStream from method forEachRDD and then i get the List
of the items, and then pass the each list of items to a thread. Each thread
runs the outlier detection algorithm and process the result.

I run the outlier detection in separate thread in order not to put too much
burden on spark streaming task. So, I would like to ask if this model has a
risk? or is there any alternatives provided by the framework such that i
don't have to run a separate thread for this?

Thank you for your attention.



-- 
Best Regards,
Eko Susilo

Re: Spark Streaming for time consuming job

Posted by Eko Susilo <ek...@gmail.com>.

Hi Mayur,

Thanks for your suggestion.

In fact, that's i'm thinking about; to pass those data, and return only the
percentage of the outlier in a particular window.

I also have some doubt if i would implement the outlier detection on rdd as
you have suggested.

>From what i understand that those RDD are distributed among spark workers;
so, i imagine that i would do as the following (code_905 is a PairDStream)

code_905.foreachRDD(new Function2<JavaPairRDD<String,Long>,Time,Void>(){
public Void call(JavaPairRDD<String, Long> pair,Time time) throws Exception
{
if(pair.count()>0){
final List<Double> data=new LinkedList<Double>();
pair.foreach(new VoidFunction<Tuple2<String,Long>>(){
 @Override
public void call(Tuple2<String, Long> t)
throws Exception {
 double doubleValue=t._2.doubleValue();
//register data from this window to be checked

data.add(doubleValue);
//register the data to the outlier detector
outlierDetector.addData(doubleValue);
}
 });
                                      //get percentage of the outlier for
this window.
double percentage=outlierDetector.getOutlierPercentageFromThisData(data);

 }
return null;
}
});

the variable outlierDetector is declared on class static variable.  the
call "outlierDetector.addData" is needed because i would like to run the
outlier detection from the data obtained from previous window(s).

My concern on writing the, outlier detection on spark is it would slow down
the spark streaming since, the outlier detection would involve sorting
data, calculating some statistic stuff. especially, i would need to run
many instances of outlier detection  (each instances to handle different
set of data).  So, what do you think about this model?






On Wed, Oct 1, 2014 at 1:59 PM, Mayur Rustagi <ma...@gmail.com>
wrote:

> Calling collect on anything  is almost always a bad idea. The only
> exception is if you are looking to pass that data on to any other system &
> never see it again :) .
> I would say you need to implement outlier detection on the rdd & process
> it in spark itself rather than calling collect on it.
>
> Regards
> Mayur
>
> Mayur Rustagi
> Ph: +1 (760) 203 3257
> http://www.sigmoidanalytics.com
> @mayur_rustagi <https://twitter.com/mayur_rustagi>
>
>
> On Tue, Sep 30, 2014 at 3:22 PM, Eko Susilo <eko.harmawan.susilo@gmail.com
> > wrote:
>
>> Hi All,
>>
>> I have a problem that i would like to consult about spark streaming.
>>
>> I have a spark streaming application that parse a file (which will be
>> growing as time passed by)This file contains several columns containing
>> lines of numbers,
>> these parsing is divided into windows (each 1 minute). Each column
>> represent different entity while each row within a column represent the
>> same entity (for example, first column represent temprature, second column
>> represent humidty, etc, while each row represent the value of each
>> attribute). I use PairDStream for each column.
>>
>> Afterwards, I need to run a time consuming algorithm (outlier detection,
>> for now i use box plot algorithm) for each RDD of each PairDStream.
>>
>> To run the outlier detection, currently i am thinking about to call
>> collect on each of the PairDStream from method forEachRDD and then i get
>> the List of the items, and then pass the each list of items to a thread.
>> Each thread runs the outlier detection algorithm and process the result.
>>
>> I run the outlier detection in separate thread in order not to put too
>> much burden on spark streaming task. So, I would like to ask if this model
>> has a risk? or is there any alternatives provided by the framework such
>> that i don't have to run a separate thread for this?
>>
>> Thank you for your attention.
>>
>>
>>
>> --
>> Best Regards,
>> Eko Susilo
>>
>
>


-- 
Best Regards,
Eko Susilo

Re: Spark Streaming for time consuming job

Posted by Mayur Rustagi <ma...@gmail.com>.

Calling collect on anything  is almost always a bad idea. The only
exception is if you are looking to pass that data on to any other system &
never see it again :) .
I would say you need to implement outlier detection on the rdd & process it
in spark itself rather than calling collect on it.

Regards
Mayur

Mayur Rustagi
Ph: +1 (760) 203 3257
http://www.sigmoidanalytics.com
@mayur_rustagi <https://twitter.com/mayur_rustagi>


On Tue, Sep 30, 2014 at 3:22 PM, Eko Susilo <ek...@gmail.com>
wrote:

> Hi All,
>
> I have a problem that i would like to consult about spark streaming.
>
> I have a spark streaming application that parse a file (which will be
> growing as time passed by)This file contains several columns containing
> lines of numbers,
> these parsing is divided into windows (each 1 minute). Each column
> represent different entity while each row within a column represent the
> same entity (for example, first column represent temprature, second column
> represent humidty, etc, while each row represent the value of each
> attribute). I use PairDStream for each column.
>
> Afterwards, I need to run a time consuming algorithm (outlier detection,
> for now i use box plot algorithm) for each RDD of each PairDStream.
>
> To run the outlier detection, currently i am thinking about to call
> collect on each of the PairDStream from method forEachRDD and then i get
> the List of the items, and then pass the each list of items to a thread.
> Each thread runs the outlier detection algorithm and process the result.
>
> I run the outlier detection in separate thread in order not to put too
> much burden on spark streaming task. So, I would like to ask if this model
> has a risk? or is there any alternatives provided by the framework such
> that i don't have to run a separate thread for this?
>
> Thank you for your attention.
>
>
>
> --
> Best Regards,
> Eko Susilo
>