You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@spark.apache.org by Tom Davis <ma...@gmail.com> on 2016/09/14 18:44:43 UTC

Streaming - lookup against reference data

Hi all,

Interested in patterns people use in the wild for lookup against reference
data sets from a Spark streaming job. The reference dataset will be updated
during the life of the job (although being 30mins out of date wouldn't be
an issue, for example).

So far I have come up with a few options, all of which have advantages and
disadvantages:

1. For small reference datasets, distribute the data as an in memory Map()
from the driver, refreshing it inside the foreachRDD() loop.

Obviously the limitation here is size.

2. Run a Redis (or similar) cache on each worker node, perform lookups
against this.

There's some complexity to managing this, probably outside of the Spark job.

3. Load the reference data into an RDD, again inside the foreachRDD() loop
on the driver. Perform a join of the reference and stream batch RDDs.
Perhaps keep the reference RDD in memory.

I suspect that this will scale, but I also suspect there's going to be the
potential for a lot of data shuffling across the network which will slow
things down.

4. Similar to the Redis option, but use Hbase. Scales well and makes data
available to other services but is a call out over the network, albeit
within the cluster.

I guess there's no solution that fits all, but interested in other people's
experience and whether I've missed anything obvious.

Thanks,

Tom

Re: Streaming - lookup against reference data

Posted by Tom Davis <ma...@gmail.com>.
Thanks Jörn, sounds like there's nothing obvious I'm missing, which is
encouraging.

I've not used Redis, but it does seem that for most of my current and
likely future use-cases it would be the best fit (nice compromise of scale
and easy setup / access).

Thanks,

Tom

On Wed, Sep 14, 2016 at 10:09 PM Jörn Franke <jo...@gmail.com> wrote:

> Hmm is it just a lookup and the values are small? I do not think that in
> this case redis needs to be installed on each worker node. Redis has a
> rather efficient protocol. Hence one or a few dedicated redis nodes
> probably fit your purpose more then needed. Just try to reuse connections
> and do not establish it for each lookup from the same node.
>
> Additionally Redis has a lot of interesting data structures such as
> hyperloglogs.
>
> Hbase - you can design here where to store which part of the reference
> data set and partition in Spark accordingly. Depends on the data and is
> tricky.
>
> About the other options I am a bit skeptical - especially since you need
> to include updated data, might have side effects.
>
> Nevertheless, you mention all the options that are possible. I guess for a
> true evaluation you have to check your use case, the envisioned future
> architecture for other use cases, required performance, maintability etc.
>
> On 14 Sep 2016, at 20:44, Tom Davis <ma...@gmail.com> wrote:
>
> Hi all,
>
> Interested in patterns people use in the wild for lookup against reference
> data sets from a Spark streaming job. The reference dataset will be updated
> during the life of the job (although being 30mins out of date wouldn't be
> an issue, for example).
>
> So far I have come up with a few options, all of which have advantages and
> disadvantages:
>
> 1. For small reference datasets, distribute the data as an in memory Map()
> from the driver, refreshing it inside the foreachRDD() loop.
>
> Obviously the limitation here is size.
>
> 2. Run a Redis (or similar) cache on each worker node, perform lookups
> against this.
>
> There's some complexity to managing this, probably outside of the Spark
> job.
>
> 3. Load the reference data into an RDD, again inside the foreachRDD() loop
> on the driver. Perform a join of the reference and stream batch RDDs.
> Perhaps keep the reference RDD in memory.
>
> I suspect that this will scale, but I also suspect there's going to be the
> potential for a lot of data shuffling across the network which will slow
> things down.
>
> 4. Similar to the Redis option, but use Hbase. Scales well and makes data
> available to other services but is a call out over the network, albeit
> within the cluster.
>
> I guess there's no solution that fits all, but interested in other
> people's experience and whether I've missed anything obvious.
>
> Thanks,
>
> Tom
>
>

Re: Streaming - lookup against reference data

Posted by Jörn Franke <jo...@gmail.com>.
Hmm is it just a lookup and the values are small? I do not think that in this case redis needs to be installed on each worker node. Redis has a rather efficient protocol. Hence one or a few dedicated redis nodes probably fit your purpose more then needed. Just try to reuse connections and do not establish it for each lookup from the same node.

Additionally Redis has a lot of interesting data structures such as hyperloglogs.

Hbase - you can design here where to store which part of the reference data set and partition in Spark accordingly. Depends on the data and is tricky.

About the other options I am a bit skeptical - especially since you need to include updated data, might have side effects.

Nevertheless, you mention all the options that are possible. I guess for a true evaluation you have to check your use case, the envisioned future architecture for other use cases, required performance, maintability etc.

> On 14 Sep 2016, at 20:44, Tom Davis <ma...@gmail.com> wrote:
> 
> Hi all,
> 
> Interested in patterns people use in the wild for lookup against reference data sets from a Spark streaming job. The reference dataset will be updated during the life of the job (although being 30mins out of date wouldn't be an issue, for example). 
> 
> So far I have come up with a few options, all of which have advantages and disadvantages:
> 
> 1. For small reference datasets, distribute the data as an in memory Map() from the driver, refreshing it inside the foreachRDD() loop. 
> 
> Obviously the limitation here is size. 
> 
> 2. Run a Redis (or similar) cache on each worker node, perform lookups against this. 
> 
> There's some complexity to managing this, probably outside of the Spark job.
> 
> 3. Load the reference data into an RDD, again inside the foreachRDD() loop on the driver. Perform a join of the reference and stream batch RDDs. Perhaps keep the reference RDD in memory. 
> 
> I suspect that this will scale, but I also suspect there's going to be the potential for a lot of data shuffling across the network which will slow things down. 
> 
> 4. Similar to the Redis option, but use Hbase. Scales well and makes data available to other services but is a call out over the network, albeit within the cluster.
> 
> I guess there's no solution that fits all, but interested in other people's experience and whether I've missed anything obvious. 
> 
> Thanks,
> 
> Tom