You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by sequoiadb <ma...@sequoiadb.com> on 2015/08/10 12:24:00 UTC

question about spark streaming

hi guys,

i have a question about spark streaming.
There’s an application keep sending transaction records into spark stream with about 50k tps
The record represents a sales information including customer id / product id / time / price columns

The application is required to monitor the change of price for each product. For example, if the price of a product increases 10% within 3 minutes, it will send an alert to end user.

The interval is required to be set every 1 second, window is somewhere between 180 to 300 seconds.

The issue is that I have to compare the price of each transaction ( totally about 10k different products ) against the lowest/highest price for the same product in the all past 180 seconds.

That means, in every single second, I have to loop through 50k transactions and compare the price of the same product in all 180 seconds. So it seems I have to separate the calculation based on product id, so that each worker only processes a certain list of products.

For example, if I can make sure the same product id always go to the same worker agent, it doesn’t need to shuffle data between worker agent for each comparison. Otherwise if it required to compare each transaction with all other RDDs that cross multiple worker agent, I guess it may not be fast enough for the requirement.

Is there anyone knows how to specify the worker node for each transaction record based on its product id, in order to avoid massive shuffle operation?

If simply making the product id as the key and price as the value, reduceByKeyAndWindow may cause massive shuffle and slow down the whole throughput. Am I correct?

Thanks

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
For additional commands, e-mail: user-help@spark.apache.org

Re: question about spark streaming

Posted by Dean Wampler <de...@gmail.com>.

Have a look at the various versions of
PairDStreamFunctions.updateStateByWindow (
http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.streaming.dstream.PairDStreamFunctions).
It supports updating running state in memory. (You can persist the state to
a database/files periodically if you want). Use an in-memory data structure
like a hash map with SKU-price key-values. Update the map as needed on each
iteration. One of the versions of this function lets you specify a
partitioner if you still need to shard keys.

Also, I would be flexible about the 1 second batch interval. Is that really
a mandatory requirement for this problem?

HTH,
dean

Dean Wampler, Ph.D.
Author: Programming Scala, 2nd Edition
<http://shop.oreilly.com/product/0636920033073.do> (O'Reilly)
Typesafe <http://typesafe.com>
@deanwampler <http://twitter.com/deanwampler>
http://polyglotprogramming.com

On Mon, Aug 10, 2015 at 5:24 AM, sequoiadb <ma...@sequoiadb.com>
wrote:

> hi guys,
>
> i have a question about spark streaming.
> There’s an application keep sending transaction records into spark stream
> with about 50k tps
> The record represents a sales information including customer id / product
> id / time / price columns
>
> The application is required to monitor the change of price for each
> product. For example, if the price of a product increases 10% within 3
> minutes, it will send an alert to end user.
>
> The interval is required to be set every 1 second, window is somewhere
> between 180 to 300 seconds.
>
> The issue is that I have to compare the price of each transaction (
> totally about 10k different products ) against the lowest/highest price for
> the same product in the all past 180 seconds.
>
> That means, in every single second, I have to loop through 50k
> transactions and compare the price of the same product in all 180 seconds.
> So it seems I have to separate the calculation based on product id, so that
> each worker only processes a certain list of products.
>
> For example, if I can make sure the same product id always go to the same
> worker agent, it doesn’t need to shuffle data between worker agent for each
> comparison. Otherwise if it required to compare each transaction with all
> other RDDs that cross multiple worker agent, I guess it may not be fast
> enough for the requirement.
>
> Is there anyone knows how to specify the worker node for each transaction
> record based on its product id, in order to avoid massive shuffle operation?
>
> If simply making the product id as the key and price as the value,
> reduceByKeyAndWindow may cause massive shuffle and slow down the whole
> throughput. Am I correct?
>
> Thanks
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
> For additional commands, e-mail: user-help@spark.apache.org
>
>