You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@spark.apache.org by Shrikar archak <sh...@gmail.com> on 2014/06/20 20:16:54 UTC

Possible approaches for adding extra metadata (Spark Streaming)?

Hi All,

I was curious to know which of the two approach is better for doing
analytics using spark streaming. Lets say we want to add some metadata to
the stream which is being processed like sentiment, tags etc and then
perform some analytics using these added metadata.

1)  Is it ok to make a http call and add some extra information to the
stream being processed in the updateByKeyAndWindow operations.

2) Add these sentiment/tags before and then stream through DStreams.

Thanks,
Shrikar

Re: Possible approaches for adding extra metadata (Spark Streaming)?

Posted by Shrikar archak <sh...@gmail.com>.
Thanks Mayur and TD for your inputs.

~Shrikar


On Fri, Jun 20, 2014 at 1:20 PM, Tathagata Das <ta...@gmail.com>
wrote:

> If the metadata is directly related to each individual records, then it
> can be done either ways. Since I am not sure how easy or hard will it be
> for you add tags before putting the data into spark streaming, its hard to
> recommend one method over the other.
>
> However, if the metadata is related to each key (based on which you are
> called updateStateByKey) and not every record, then it may be more
> efficient to maintain that per-key metadata in the updateStateByKey's state
> object.
>
> Regarding doing http calls, I would be a bit cautious about performance.
> Doing a http call for every records it going to be quite expensive, and
> reduce throughput significantly. If it is possible, cache values as much as
> possible to amortize the cost of http calls.
>
> TD
>
>
>
>
>
> On Fri, Jun 20, 2014 at 11:16 AM, Shrikar archak <sh...@gmail.com>
> wrote:
>
>> Hi All,
>>
>> I was curious to know which of the two approach is better for doing
>> analytics using spark streaming. Lets say we want to add some metadata to
>> the stream which is being processed like sentiment, tags etc and then
>> perform some analytics using these added metadata.
>>
>> 1)  Is it ok to make a http call and add some extra information to the
>> stream being processed in the updateByKeyAndWindow operations.
>>
>> 2) Add these sentiment/tags before and then stream through DStreams.
>>
>> Thanks,
>> Shrikar
>>
>>
>

Re: Possible approaches for adding extra metadata (Spark Streaming)?

Posted by Tathagata Das <ta...@gmail.com>.
If the metadata is directly related to each individual records, then it can
be done either ways. Since I am not sure how easy or hard will it be for
you add tags before putting the data into spark streaming, its hard to
recommend one method over the other.

However, if the metadata is related to each key (based on which you are
called updateStateByKey) and not every record, then it may be more
efficient to maintain that per-key metadata in the updateStateByKey's state
object.

Regarding doing http calls, I would be a bit cautious about performance.
Doing a http call for every records it going to be quite expensive, and
reduce throughput significantly. If it is possible, cache values as much as
possible to amortize the cost of http calls.

TD





On Fri, Jun 20, 2014 at 11:16 AM, Shrikar archak <sh...@gmail.com>
wrote:

> Hi All,
>
> I was curious to know which of the two approach is better for doing
> analytics using spark streaming. Lets say we want to add some metadata to
> the stream which is being processed like sentiment, tags etc and then
> perform some analytics using these added metadata.
>
> 1)  Is it ok to make a http call and add some extra information to the
> stream being processed in the updateByKeyAndWindow operations.
>
> 2) Add these sentiment/tags before and then stream through DStreams.
>
> Thanks,
> Shrikar
>
>

Re: Possible approaches for adding extra metadata (Spark Streaming)?

Posted by Mayur Rustagi <ma...@gmail.com>.
You can apply transformations on RDD's inside Dstreams using transform or
any number of operations.

Regards
Mayur

Mayur Rustagi
Ph: +1 (760) 203 3257
http://www.sigmoidanalytics.com
@mayur_rustagi <https://twitter.com/mayur_rustagi>



On Fri, Jun 20, 2014 at 2:16 PM, Shrikar archak <sh...@gmail.com> wrote:

> Hi All,
>
> I was curious to know which of the two approach is better for doing
> analytics using spark streaming. Lets say we want to add some metadata to
> the stream which is being processed like sentiment, tags etc and then
> perform some analytics using these added metadata.
>
> 1)  Is it ok to make a http call and add some extra information to the
> stream being processed in the updateByKeyAndWindow operations.
>
> 2) Add these sentiment/tags before and then stream through DStreams.
>
> Thanks,
> Shrikar
>
>