You are viewing a plain text version of this content. The canonical link for it is here.

Posted to users@kafka.apache.org by Kanagha <er...@gmail.com> on 2016/06/12 23:26:03 UTC

Storing results of a stream

Hi,

I'm building a topology where I am connecting twitter with another
application (ex: appA)

appA also consists of the following graph model (similar to
facebook/twitter) where a user can have followers and follow other users.

Ex: UserA follows UserB, UserC, UserD.

And UserB/C/D can have any no.of followers.
This information is currently stored in an Oracle table.

I am retrieving the corresponding twitter ids for users B,C and D and
retrieving the latest (n) tweets posted by them.

1) I have a Kafka Spout where I am streaming the tweets for a specific set
of userIds.
2) I do a join of the Kafka Spout with the records in oracle table in
another Bolt, so that each tweet would be joined with all the users who
follow the user who posted the particular tweet.
3) After doing a join, I 'll be using a RollingCountBolt to capture the
latest n tweets posted by all the followers for a given user.


My question is what is the best way to store the results of
RollingCountBolt by avoiding duplication.
I can use a Redis instance to capture the information.
But say if a userA is followed by 100 users, a tweet posted by userA will
be duplicated 100 times.

To avoid duplication I can just store tweetIds in the outputField of
RollingCountBolt and tweets can be stored in a separate table. But since
tweets are streaming, each record must be associated with an expiration
period while being stored (similar to cache).

How are such scenarios dealt with usually in a streaming
application?Suggestions would be helpful.

Thanks
Kanagha