You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@cassandra.apache.org by Allen He <al...@gmail.com> on 2010/04/15 09:56:38 UTC

If a user has millions of followers, is there millions of iterate? (ref Twissandra)

Hello folks,

When Twissandra <http://twissandra.com/> (Twitter clone example for
Cassandra) post a tweet, it iterate all of the followers to insert a
tweet_id to their time lines(see highlight):

def save_tweet(tweet_id, user_id, tweet):
    """
    Saves the tweet record.
    """
    # Generate a timestamp, and put it in the tweet record
    raw_ts = int(time.time() * 1e6)
    tweet['_ts'] = raw_ts
    ts = _long(raw_ts)
    encoded = dict(((k, json.dumps(v)) for k, v in tweet.iteritems()))
    # Insert the tweet, then into the user's timeline, then into the public one
    TWEET.insert(str(tweet_id), encoded)
    USERLINE.insert(str(user_id), {ts: str(tweet_id)})
    USERLINE.insert(PUBLIC_USERLINE_KEY, {ts: str(tweet_id)})
    # Get the user's followers, and insert the tweet into all of their streams
    follower_ids = [user_id] + get_follower_ids(user_id)
*    **for** **follower_id** **in** **follower_ids**:*
*        **TIMELINE**.**insert**(**str**(**follower_id**),**
**{**ts**:** **str**(**tweet_id**)})*
*
*
*My question is, If a user has millions of followers, is there
millions of iterate?*

*
Sorry for my English :)

Thanks!
*

Re: If a user has millions of followers, is there millions of iterate? (ref Twissandra)

Posted by gabriele renzi <rf...@gmail.com>.
On Thu, Apr 15, 2010 at 9:56 AM, Allen He <al...@gmail.com> wrote:
> Hello folks,
>
> When Twissandra (Twitter clone example for Cassandra) post a tweet, it
> iterate all of the followers to insert a tweet_id to their time lines(see


>     for follower_id in follower_ids:
>         TIMELINE.insert(str(follower_id), {ts: str(tweet_id)})
>
>
>
> My question is, If a user has millions of followers, is there millions of
> iterate?

I never looked at the twissandra code but it looks like that. It is
probably a trade off: either you store the tweets in each timeline and
when a user wants to read them you fetch them all (so putting the
burden on read time) or you do it like this and put it on the write.
Since writes are cheap in cassandra, and reads are more frequents,
this seems to make sense.


PS
  I think it should use batch_mutate anyway so that only one operation
is sent over the network