You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@esme.apache.org by David Pollak <fe...@gmail.com> on 2009/11/30 21:40:21 UTC

Statefulness and algorithms for social networks/graphs

Folks,

Over the last 6 or so months, we've had a bunch of discussions on the list
about statefulness, REST, and ESME's overall design.  I want to walk through
the design choices I've made for ESME and why "stateless" and other such
designs fail and are dead wrong for a social networking system.

There is no such thing as stateless.  Every web site has state.  The state
may change frequently or may change infrequently.  A web site made up of
static files has its state based on those static pages.  When those pages
are changed, the state changes.  State is kept somewhere for all web sites.

Some web sites will present a different state depending on who is accessing
the site.  This can be as simple as serving different pages depending on the
IP address or language preference expressed in the HTTP headers.  This is
sessionful.  The content is calculated based on the request.  This may be
more sophisticated in terms of authenticating the HTTP request and
presenting content based on the authentication.

A session for sessionful content may be short-lived (the length of the
request) or it may be longer lived (typically this is done with an initial
authentication phase resulting in a shared secret [JSESSIONID] that is
presented as an authentication proxy in subsequent requests.)

But no matter the authentication mechanism or the session lifespan, there
must exist a mechanism for translating the HTTP request into the content
presented for the session.

Far and away the most common way of persisting and calculating state is in a
relational database (RDBMS).  RDBMSs are awesome creatures.  They sit on top
of some excellent and well understood mathematics: set theory.  They have
well known and well understood concurrency mechanisms: transactions.  They
have been designed, built, tested, and optimized over the last generation.
RDBMSs offer a simple set of commands (SELECT, DELETE, INSERT, UPDATE) as
well as a generally human understandable set of semantics: people understand
that RDBMSs are a sets of things and there are simple ways to ask about
these sets.  RDBMSs have evolved along with ERP systems and have evolved to
meet the needs of these systems.

However, there are well known things that RDBMSs don't do well that include
tree structures (yeah, Oracle and others have extensions for tree walks, but
nothing is part of the SQL spec and the performance of these extensions is
not always the same as other models: a tree-walk in an RDBMS costs O(log n)
for each node where a tree walk in an OO system costs O(1)).  Social
networks/social graphs are another place where RDBMSs do not excel.

Let's dive down into this.

A naive implementation of a social messaging site runs something like these
tables:

   - Users(id, name, password)
   - Friends(owner, friend)
   - Messages(id, poster, content, date)

So, if we wanted to calculate the timeline for a given user at a given
instant, the query would look like:

SELECT messages.* FROM messages, friends WHERE friends.owner = current_user
AND messages.poster = friends.friend ORDER BY messages.date DESC LIMIT 20

Assuming we've got indexes on friends.owner, messages.poster and
messages.date, the query still results in O(n log n) where n is the
aggregate number of messages posted.  This is non-trivial and if you follow
someone who has posted 20,000 messages (yeah Anne, I'm talkin' to you), the
n log n cost becomes non-trivial.

Basically, each time a client asks for the latest timeline, you've got an
O(n log n) operation to determine state.  This doesn't scale.

The first obvious response to the issue is caching (capturing the state
beyond the duration of a short-lived session).  I'm going to skip caching
for a moment and do a more sophisticated implementation of timelines so we
can get better performance.

Let's create a mailbox table.  Each time someone publishes a message, a
reference to that message will be put in a Mailbox(owner, message, date)
table and we'll create an index on the table: (owner, date DESC)

This changes the query to:

SELECT messages.* FROM messages, mailbox WHERE mailbox.owner = current_user
AND messages.id = mailbox.message ORDER BY mailbox.date DESC LIMIT 20

Depending on your RDBMS, you will wind up with an O(log n) operation.  You
find the newest mailbox entry by user (O(log n)) and do an index walk until
you've found 20 entries (I'm putting aside the fact that looking up the 20
messages is an O(n log n) operation because 20 is a small number and the
messages will likely be in the database's cache... this operation is going
to be fast.)

I'm going to sidetrack for a moment.  I had the pleasure of talking over a
few beers at a baseball game with one of the senior engineers at Facebook.
We were talking about Facebook's scaling success.  His comment was that it
was successful but very expensive.  If there were more than 3% cache misses
from MySQL queries, the system would back up.  If they got more than 2%
cache misses from the memcached stuff in front of their MySQL servers the
system would back up.  So, basically Facebook has 195% of their data in RAM.

The net is that O(log n) is only going to work if you've got your entire
index in the cache of your RDBMS.  Even a dozen disk reads is going to turn
a 10ms query into a 250ms query and if you've got 1,000 users asking for a
status update, you'll wind up with disk thrashing and ultimately you will
not be able to satisfy all of those requests.

Let's make our discussion more concrete.  I'm assuming that an ESME instance
will support 25,000 users.  On average, a user will follow 100 people (100x
fan-out of messages).  Users will post one message every 30 minutes (48
messages a day).  The day lasts 10 hours (this is a reasonable approximation
for peakiness... basically, you're compressing 48 message sends in to a 10
hour period).  There are 300 days in a year.  These numbers are averages and
there will be some folks who are above average in terms of fan out (the CEO
will have a 25,000x fan out) and some folks are above average in number of
messages per day (yeah Anne, I'm lookin' at you... you too Dick.)

So, that means that each year, there will be 36,000M (36B) mailbox entries.
If each entry costs us 16 bytes of RAM for index purposes, that means we're
at 576B bytes of index.  There's no way that amount of index will fit in
RAM.  So, what happens if the average messages/day drops to 1, you're still
looking at 10GB of index.  Alternatively, you could purge messages after 3
weeks or limit timelines to a certain number of messages.  That's not
unreasonable, but it's also adding a constraint to the system to deal with
limitations of the RDBMS.  There are other alternatives.

Let me talk memcached for a minute.  In my opinion, memcached means that you
have a failed design.  Memcached means abandoning all the awesome things
that you get with an RDBMS: a mathematical model, a
concurrency/transactional model, durability guarantees, etc.

But, we could move our state from the calculate-on-demand model of the RDBMS
to the a calculate once and cache model using memcached.  This means that
you only take the nasty hits if the cache is not valid.  Putting aside the
cost of cache invalidation (I haven't covered the costs of updates in this
discussion because there's no need to go there... the implementation
failures can be demonstrated with just reads), if you have a simple cache
invalidation scheme, most of the cache entries will not survive for 15
minutes (I can go through the math, but I'm going to leave this one to the
reader).  You risk cache stampedes (more than 1 process rebuilding the cache
entry).  Basically, the naive memcached implementation buys you a little bit
of head room over the naive (non-mailbox) approach.  In order to get more
than 5x or so improvement (something that will serve a few thousand rather
than a few hundred users), you need to manipulate the cache entries
inserting/deleting individual messages.

The above paragraph in fact leads us in the direction of a better answer.

But first, let me state that I have proven that an RDBMS cannot be the sole
locus of state for a social messaging site that services more than a few
hundred users.  Period.  We must move state somewhere else and manage the
cached state manually rather than with queries and indexes.  Second, I have
not discussed short-lived vs. long-lived sessions yet.  I will get to that,
but first, let's walk through a design that gives us a concurrency model as
well as the performance we want.

Imagine a model where you interact with a User with a limited set of
(asynchronous) messages:

   - add/remove friend
   - add message to timeline
   - post message (the user has created a message and it needs to be
   processed)
   - get current timeline (with offsets and number of entries)

These are the basic messages needed to implement a social messaging site.
If we guaranty that a User will only process 1 message at a time, we have a
concurrency model.  It's simple and simple is good.  We have not defined
how/where Users store there state (it could be on a filesystem, in an RDBMS,
in a NoSQL store, who knows).  But we can say that adding a message is an
O(1) operation (prepending to the head of a singly linked list).  Each User
can have a caching policy (and that caching policy could be dynamic based on
the access characteristics for the User).  The sender of the message doesn't
block on the processing of the message (although the get current timeline
message will have an asynchronous response that the sender will likely block
on).

We have changed our abstraction from one where all data (tables and indexes)
are created equal to one where certain data structures are more prominent
(User and Message) than others (mailbox, friends).

We have lost something: transactions.  In this model, if I add Dick as a
friend, I am not guaranteed that I will receive Dick's next update... it may
take time for the messages to propagate to Dick's User and his Message may
be sent before the "add friend" message gets to him.  In the case of a
financial transaction, this would be fatal.  In the case of social
networking, this is a perfectly reasonable trade-off.

So far, we have not talked about long-lived sessions and how they are
valuable in such a model... an in particular in ESME.

If we add one more message to our User, some of the reasons for long-lived
sessions should become obvious:  updated me on timeline change.  If you can
register with the User for changes to the timeline it means that we don't
have to keep asking "are we there yet?"  When state change happens, it's
instantly propagated out to the listeners.  The alternative is for the
listeners to ask "are we there yet?" over and over.  The cost of asking "are
we there yet?" is non-trivial as anyone who has traveled with 5 year olds
can attest to.  Additionally, sometimes, when one if having a conversation,
it's nice to get an immediate response rather than waiting some polling
period.  Additionally, with a listener model, the User does not need to
store the date of each message (give me new messages since xxx) and that
cuts down cache storage costs by 50% (a big number across 25,000 users).

So, having a long-lived session has some performance benefits over a
short-lived session and polling, but this only part of the story.

One of the ways that RDBMSs get performance (and the way products like
Oracle distinguish themselves from the likes of MySQL) is the ability to
cache optimized query plans, cache the right data, and invalidate the right
caches at the right time.  The same requirements are going to come up in
ESME.

When I designed ESME, I changed the model from a Skittr model (1M users on a
single box) to a more enterprise-friendly model.  The key difference is that
I added the "actions" feature where each User got to see each message
processed in the system and analyze that message for content/context and
perform certain actions based on that analysis.  Things like "add all
message containing 'catfood' to my timeline" or forward all messages
containing "ESME to my followers" or "make an HTTP post of all messages from
my boss to a paging service" or "block 50% of the messages from Joe
Blabbermouth".  Actions are cool, but they are costly.  It means that every
message must be compared to every action definition in the system.  This is
expensive.  If each user has an average of 10 actions, that means each
message sent will have to be compared against 250,000 actions and if we have
a peak of 5 messages per hour per person, that's 31B comparisons per hour at
peak time or 9M action comparisons per second.  That's load.

During peak load, we will need to prioritize which Users are processing
messages/actions such that the system retains responsiveness and can drain
the load.  Put another way, knowing which Users have associated long-lived
sessions allows us to prioritize the message processing for those Users.  We
allow more threads to drain the message queues for those Users while
providing fewer threads for session-less Users.  Yeah, we could prioritize
on other heuristics, but long-lived session is dead simple and will cost us
5K bytes per logged in user.  Not a huge cost and lots of benefit.

So, between the existing long-lived session long polling is more efficient
than shortlived session repeated polling and the upcoming need for message
prioritization indicate that long-lived sessions are the right design
choice.

Also, I hope that the above discussion makes it clear why I am insistent on
message-oriented APIs rather than document/REST oriented APIs.  ESME's
design is not traditional and there are fewer tools helping us get the
implementation right.  On the other hand, implementing ESME on top of a
relational/REST model cannot be done.  Let's keep our design consistent from
the APIs back.

Thanks,

David

-- 
Lift, the simply functional web framework http://liftweb.net
Beginning Scala http://www.apress.com/book/view/1430219890
Follow me: http://twitter.com/dpp
Surf the harmonics

Re: Statefulness and algorithms for social networks/graphs

Posted by Markus Kohler <ma...@gmail.com>.

Hi David,
Thanks a lot!
This makes a lot of sense to me.


Regards,
Markus

"The best way to predict the future is to invent it" -- Alan Kay


On Mon, Nov 30, 2009 at 11:24 PM, David Pollak <
feeder.of.the.bears@gmail.com> wrote:

> On Mon, Nov 30, 2009 at 2:00 PM, Markus Kohler <markus.kohler@gmail.com
> >wrote:
>
> >
> > > So, that means that each year, there will be 36,000M (36B) mailbox
> > entries.
> > >
> >
> >
> > I don't understand why we would need to store all entries in a cache,
> > instead of only keeping the last n entries for each user based on some
> > heuristics such as the last 3 days or something. I would somehow expect
> > that
> > the probability that a user wants to see a message is exponentially
> > decreasing with the messages age. For example that someone wants to see
>  a
> > message that is the 1000 newest message in his timeline is probably
> almost
> > zero.
> >
>
> Some people mine their timelines for information.  I agree that some aging
> policy is necessary as 36B entries will consume a lot of storage in RAM or
> on disk, but the last 1,000 is likely too few based on what I have seen of
> actual user behavior.
>
> In terms of an aging policy in an RDBMS, the cost of aging out old entries
> is likely to be an index scan or something on that order (DELETE FROM
> mailbox WHERE date < xxx or a user-by-user DELETE WHERE id IN (SELECT
> messages > 1000 in mailbox))
>
>
> >
> > > During peak load, we will need to prioritize which Users are processing
> > > messages/actions such that the system retains responsiveness and can
> > drain
> > > the load.  Put another way, knowing which Users have associated
> > long-lived
> > > sessions allows us to prioritize the message processing for those
> Users.
> > >  We
> > > allow more threads to drain the message queues for those Users while
> > > providing fewer threads for session-less Users.  Yeah, we could
> > prioritize
> > > on other heuristics, but long-lived session is dead simple and will
> cost
> > us
> > > 5K bytes per logged in user.  Not a huge cost and lots of benefit.
> > >
> > >
> > I have no issue with some session state and 5K is really low, and
> therefore
> > this is not an issue.  I don't get why it has to be in the session's
> state
> > because you could as well use the information that a user is online as a
> > guidance, even if the state would be stored somewhere out of the session.
> > Wouldn't make a difference I guess and storing it in the session looks
> > natural.
> >
>
> The state itself is not in the session.  The session is the guide that the
> user is online.  The session contains a listener that is attached to the
> User.  The only real state that resides in the session is the state
> necessary to batch up any messages that the User has forwarded to the
> listener in between the HTTP polling requests.  If there is an HTML front
> end, state about that front end will reside in the session as well, but
> that's a different issue.
>
>
> >
> >
> > > So, between the existing long-lived session long polling is more
> > efficient
> > > than shortlived session repeated polling and the upcoming need for
> > message
> > > prioritization indicate that long-lived sessions are the right design
> > > choice.
> > >
> > > Also, I hope that the above discussion makes it clear why I am
> insistent
> > on
> > > message-oriented APIs rather than document/REST oriented APIs.  ESME's
> > > design is not traditional and there are fewer tools helping us get the
> > > implementation right.  On the other hand, implementing ESME on top of a
> > > relational/REST model cannot be done.  Let's keep our design consistent
> > > from
> > > the APIs back.
> > >
> > >
> > I'm really not religious about REST, but I would somehow assume that in
> an
> > Enterprise context it could be an requirement to send a link to someone
> > else
> > pointing to a specific potentially old message in a certain Pool.
>
>
>
> Yes.  That's perfectly reasonable.  That message is like a static file on
> disk.  Once it's written, it remains unchanged until it's deleted.  This is
> an ideal application of a REST-style approach.  That's why I've advocated
> for a "message based" approach first, but a REST/static approach when the
> message based approach doesn't make sense.  What I am opposed to is a "try
> to make everything fit the REST model" approach to API design.
>
>
> > That
> > sounds to me like a requirement for some kind of REST API.
> > Would it be costly in your model to get the message nr. X  (+ n  older
> > messages) in a users timeline?.
> >
>
> A message will exist outside of a timeline.  There exists a cache of
> recently accessed messages.  Sometimes there will be a historic message
> that
> is referenced and that will be materialized from backing store and
> rendered.
>  It will likely fall out of cache if it's historical and not accessed
> again.
>
> Thanks,
>
> David
>
>
> >
> > Regards,
> > Markus
> >
> >
> >
> > > Thanks,
> > >
> > > David
> > >
> > > --
> > > Lift, the simply functional web framework http://liftweb.net
> > > Beginning Scala http://www.apress.com/book/view/1430219890
> > > Follow me: http://twitter.com/dpp
> > > Surf the harmonics
> > >
> >
>
>
>
> --
> Lift, the simply functional web framework http://liftweb.net
> Beginning Scala http://www.apress.com/book/view/1430219890
> Follow me: http://twitter.com/dpp
> Surf the harmonics
>

Re: Statefulness and algorithms for social networks/graphs

Posted by David Pollak <fe...@gmail.com>.

On Mon, Nov 30, 2009 at 2:00 PM, Markus Kohler <ma...@gmail.com>wrote:

>
> > So, that means that each year, there will be 36,000M (36B) mailbox
> entries.
> >
>
>
> I don't understand why we would need to store all entries in a cache,
> instead of only keeping the last n entries for each user based on some
> heuristics such as the last 3 days or something. I would somehow expect
> that
> the probability that a user wants to see a message is exponentially
> decreasing with the messages age. For example that someone wants to see  a
> message that is the 1000 newest message in his timeline is probably almost
> zero.
>

Some people mine their timelines for information.  I agree that some aging
policy is necessary as 36B entries will consume a lot of storage in RAM or
on disk, but the last 1,000 is likely too few based on what I have seen of
actual user behavior.

In terms of an aging policy in an RDBMS, the cost of aging out old entries
is likely to be an index scan or something on that order (DELETE FROM
mailbox WHERE date < xxx or a user-by-user DELETE WHERE id IN (SELECT
messages > 1000 in mailbox))

>
> > During peak load, we will need to prioritize which Users are processing
> > messages/actions such that the system retains responsiveness and can
> drain
> > the load.  Put another way, knowing which Users have associated
> long-lived
> > sessions allows us to prioritize the message processing for those Users.
> >  We
> > allow more threads to drain the message queues for those Users while
> > providing fewer threads for session-less Users.  Yeah, we could
> prioritize
> > on other heuristics, but long-lived session is dead simple and will cost
> us
> > 5K bytes per logged in user.  Not a huge cost and lots of benefit.
> >
> >
> I have no issue with some session state and 5K is really low, and therefore
> this is not an issue.  I don't get why it has to be in the session's state
> because you could as well use the information that a user is online as a
> guidance, even if the state would be stored somewhere out of the session.
> Wouldn't make a difference I guess and storing it in the session looks
> natural.
>

The state itself is not in the session.  The session is the guide that the
user is online.  The session contains a listener that is attached to the
User.  The only real state that resides in the session is the state
necessary to batch up any messages that the User has forwarded to the
listener in between the HTTP polling requests.  If there is an HTML front
end, state about that front end will reside in the session as well, but
that's a different issue.

>
>
> > So, between the existing long-lived session long polling is more
> efficient
> > than shortlived session repeated polling and the upcoming need for
> message
> > prioritization indicate that long-lived sessions are the right design
> > choice.
> >
> > Also, I hope that the above discussion makes it clear why I am insistent
> on
> > message-oriented APIs rather than document/REST oriented APIs.  ESME's
> > design is not traditional and there are fewer tools helping us get the
> > implementation right.  On the other hand, implementing ESME on top of a
> > relational/REST model cannot be done.  Let's keep our design consistent
> > from
> > the APIs back.
> >
> >
> I'm really not religious about REST, but I would somehow assume that in an
> Enterprise context it could be an requirement to send a link to someone
> else
> pointing to a specific potentially old message in a certain Pool.

Yes.  That's perfectly reasonable.  That message is like a static file on
disk.  Once it's written, it remains unchanged until it's deleted.  This is
an ideal application of a REST-style approach.  That's why I've advocated
for a "message based" approach first, but a REST/static approach when the
message based approach doesn't make sense.  What I am opposed to is a "try
to make everything fit the REST model" approach to API design.

> That
> sounds to me like a requirement for some kind of REST API.
> Would it be costly in your model to get the message nr. X  (+ n  older
> messages) in a users timeline?.
>

A message will exist outside of a timeline.  There exists a cache of
recently accessed messages.  Sometimes there will be a historic message that
is referenced and that will be materialized from backing store and rendered.
 It will likely fall out of cache if it's historical and not accessed again.

Thanks,

David

>
> Regards,
> Markus
>
>
>
> > Thanks,
> >
> > David
> >
> > --
> > Lift, the simply functional web framework http://liftweb.net
> > Beginning Scala http://www.apress.com/book/view/1430219890
> > Follow me: http://twitter.com/dpp
> > Surf the harmonics
> >
>

-- 
Lift, the simply functional web framework http://liftweb.net
Beginning Scala http://www.apress.com/book/view/1430219890
Follow me: http://twitter.com/dpp
Surf the harmonics

Re: Statefulness and algorithms for social networks/graphs

Posted by Markus Kohler <ma...@gmail.com>.

Hi David,
Great explanation!

I made a few comments below.

Markus

"The best way to predict the future is to invent it" -- Alan Kay


On Mon, Nov 30, 2009 at 9:40 PM, David Pollak <feeder.of.the.bears@gmail.com
> wrote:

> Folks,
>
> Over the last 6 or so months, we've had a bunch of discussions on the list
> about statefulness, REST, and ESME's overall design.  I want to walk
> through
> the design choices I've made for ESME and why "stateless" and other such
> designs fail and are dead wrong for a social networking system.
>
> There is no such thing as stateless.  Every web site has state.  The state
> may change frequently or may change infrequently.  A web site made up of
> static files has its state based on those static pages.  When those pages
> are changed, the state changes.  State is kept somewhere for all web sites.
>
> Some web sites will present a different state depending on who is accessing
> the site.  This can be as simple as serving different pages depending on
> the
> IP address or language preference expressed in the HTTP headers.  This is
> sessionful.  The content is calculated based on the request.  This may be
> more sophisticated in terms of authenticating the HTTP request and
> presenting content based on the authentication.
>
>
I think what you call "sessionful" is what most people would describe as
stateful.



> A session for sessionful content may be short-lived (the length of the
> request) or it may be longer lived (typically this is done with an initial
> authentication phase resulting in a shared secret [JSESSIONID] that is
> presented as an authentication proxy in subsequent requests.)
>
> But no matter the authentication mechanism or the session lifespan, there
> must exist a mechanism for translating the HTTP request into the content
> presented for the session.
>
> [snip]
>

Yes. I fully agree that an RDBMS is not ideal to implement an twitter like
messaging service. The issue is that a graph needs to be stored and RDBMS
are already not very good in storing trees (as you said). If ones look at
the various methods to store trees in RDMS (for example
http://dev.mysql.com/tech-resources/articles/hierarchical-data.html), one
will find that all those methods have their drawbacks and most likely will
cause performance issues with large graphs.


> I'm going to sidetrack for a moment.  I had the pleasure of talking over a
> few beers at a baseball game with one of the senior engineers at Facebook.
> We were talking about Facebook's scaling success.  His comment was that it
> was successful but very expensive.  If there were more than 3% cache misses
> from MySQL queries, the system would back up.  If they got more than 2%
> cache misses from the memcached stuff in front of their MySQL servers the
> system would back up.  So, basically Facebook has 195% of their data in
> RAM.
>
> The net is that O(log n) is only going to work if you've got your entire
> index in the cache of your RDBMS.  Even a dozen disk reads is going to turn
> a 10ms query into a 250ms query and if you've got 1,000 users asking for a
> status update, you'll wind up with disk thrashing and ultimately you will
> not be able to satisfy all of those requests.
>
> Let's make our discussion more concrete.  I'm assuming that an ESME
> instance
> will support 25,000 users.  On average, a user will follow 100 people (100x
> fan-out of messages).  Users will post one message every 30 minutes (48
> messages a day).  The day lasts 10 hours (this is a reasonable
> approximation
> for peakiness... basically, you're compressing 48 message sends in to a 10
> hour period).  There are 300 days in a year.  These numbers are averages
> and
> there will be some folks who are above average in terms of fan out (the CEO
> will have a 25,000x fan out) and some folks are above average in number of
> messages per day (yeah Anne, I'm lookin' at you... you too Dick.)
>
> So, that means that each year, there will be 36,000M (36B) mailbox entries.
>


I don't understand why we would need to store all entries in a cache,
instead of only keeping the last n entries for each user based on some
heuristics such as the last 3 days or something. I would somehow expect that
the probability that a user wants to see a message is exponentially
decreasing with the messages age. For example that someone wants to see  a
message that is the 1000 newest message in his timeline is probably almost
zero.


> If each entry costs us 16 bytes of RAM for index purposes, that means we're
> at 576B bytes of index.  There's no way that amount of index will fit in
> RAM.  So, what happens if the average messages/day drops to 1, you're still
> looking at 10GB of index.  Alternatively, you could purge messages after 3
> weeks or limit timelines to a certain number of messages.  That's not
> unreasonable, but it's also adding a constraint to the system to deal with
> limitations of the RDBMS.  There are other alternatives.
>
> Let me talk memcached for a minute.  In my opinion, memcached means that
> you
> have a failed design.  Memcached means abandoning all the awesome things
> that you get with an RDBMS: a mathematical model, a
> concurrency/transactional model, durability guarantees, etc.
>
> But, we could move our state from the calculate-on-demand model of the
> RDBMS
> to the a calculate once and cache model using memcached.  This means that
> you only take the nasty hits if the cache is not valid.  Putting aside the
> cost of cache invalidation (I haven't covered the costs of updates in this
> discussion because there's no need to go there... the implementation
> failures can be demonstrated with just reads), if you have a simple cache
> invalidation scheme, most of the cache entries will not survive for 15
> minutes (I can go through the math, but I'm going to leave this one to the
> reader).  You risk cache stampedes (more than 1 process rebuilding the
> cache
> entry).  Basically, the naive memcached implementation buys you a little
> bit
> of head room over the naive (non-mailbox) approach.  In order to get more
> than 5x or so improvement (something that will serve a few thousand rather
> than a few hundred users), you need to manipulate the cache entries
> inserting/deleting individual messages.
>
> The above paragraph in fact leads us in the direction of a better answer.
>
> But first, let me state that I have proven that an RDBMS cannot be the sole
> locus of state for a social messaging site that services more than a few
> hundred users.  Period.  We must move state somewhere else and manage the
> cached state manually rather than with queries and indexes.  Second, I have
> not discussed short-lived vs. long-lived sessions yet.  I will get to that,
> but first, let's walk through a design that gives us a concurrency model as
> well as the performance we want.
>
> Imagine a model where you interact with a User with a limited set of
> (asynchronous) messages:
>
>   - add/remove friend
>   - add message to timeline
>   - post message (the user has created a message and it needs to be
>   processed)
>   - get current timeline (with offsets and number of entries)
>
> These are the basic messages needed to implement a social messaging site.
> If we guaranty that a User will only process 1 message at a time, we have a
> concurrency model.  It's simple and simple is good.  We have not defined
> how/where Users store there state (it could be on a filesystem, in an
> RDBMS,
> in a NoSQL store, who knows).  But we can say that adding a message is an
> O(1) operation (prepending to the head of a singly linked list).  Each User
> can have a caching policy (and that caching policy could be dynamic based
> on
> the access characteristics for the User).  The sender of the message
> doesn't
> block on the processing of the message (although the get current timeline
> message will have an asynchronous response that the sender will likely
> block
> on).
>
> I guess this is the same idea as the one I was talking about above.


> We have changed our abstraction from one where all data (tables and
> indexes)
> are created equal to one where certain data structures are more prominent
> (User and Message) than others (mailbox, friends).
>
> We have lost something: transactions.  In this model, if I add Dick as a
> friend, I am not guaranteed that I will receive Dick's next update... it
> may
> take time for the messages to propagate to Dick's User and his Message may
> be sent before the "add friend" message gets to him.  In the case of a
> financial transaction, this would be fatal.  In the case of social
> networking, this is a perfectly reasonable trade-off.
>
>
I agree it makes sense here to weaken consistency.


> So far, we have not talked about long-lived sessions and how they are
> valuable in such a model... an in particular in ESME.
>
> If we add one more message to our User, some of the reasons for long-lived
> sessions should become obvious:  updated me on timeline change.  If you can
> register with the User for changes to the timeline it means that we don't
> have to keep asking "are we there yet?"  When state change happens, it's
> instantly propagated out to the listeners.  The alternative is for the
> listeners to ask "are we there yet?" over and over.  The cost of asking
> "are
> we there yet?" is non-trivial as anyone who has traveled with 5 year olds
> can attest to.  Additionally, sometimes, when one if having a conversation,
> it's nice to get an immediate response rather than waiting some polling
> period.  Additionally, with a listener model, the User does not need to
> store the date of each message (give me new messages since xxx) and that
> cuts down cache storage costs by 50% (a big number across 25,000 users).
>
> So, having a long-lived session has some performance benefits over a
> short-lived session and polling, but this only part of the story.
>
> One of the ways that RDBMSs get performance (and the way products like
> Oracle distinguish themselves from the likes of MySQL) is the ability to
> cache optimized query plans, cache the right data, and invalidate the right
> caches at the right time.  The same requirements are going to come up in
> ESME.
>
> When I designed ESME, I changed the model from a Skittr model (1M users on
> a
> single box) to a more enterprise-friendly model.  The key difference is
> that
> I added the "actions" feature where each User got to see each message
> processed in the system and analyze that message for content/context and
> perform certain actions based on that analysis.  Things like "add all
> message containing 'catfood' to my timeline" or forward all messages
> containing "ESME to my followers" or "make an HTTP post of all messages
> from
> my boss to a paging service" or "block 50% of the messages from Joe
> Blabbermouth".  Actions are cool, but they are costly.  It means that every
> message must be compared to every action definition in the system.  This is
> expensive.  If each user has an average of 10 actions, that means each
> message sent will have to be compared against 250,000 actions and if we
> have
> a peak of 5 messages per hour per person, that's 31B comparisons per hour
> at
> peak time or 9M action comparisons per second.  That's load.
>
> During peak load, we will need to prioritize which Users are processing
> messages/actions such that the system retains responsiveness and can drain
> the load.  Put another way, knowing which Users have associated long-lived
> sessions allows us to prioritize the message processing for those Users.
>  We
> allow more threads to drain the message queues for those Users while
> providing fewer threads for session-less Users.  Yeah, we could prioritize
> on other heuristics, but long-lived session is dead simple and will cost us
> 5K bytes per logged in user.  Not a huge cost and lots of benefit.
>
>
I have no issue with some session state and 5K is really low, and therefore
this is not an issue.  I don't get why it has to be in the session's state
because you could as well use the information that a user is online as a
guidance, even if the state would be stored somewhere out of the session.
Wouldn't make a difference I guess and storing it in the session looks
natural.


> So, between the existing long-lived session long polling is more efficient
> than shortlived session repeated polling and the upcoming need for message
> prioritization indicate that long-lived sessions are the right design
> choice.
>
> Also, I hope that the above discussion makes it clear why I am insistent on
> message-oriented APIs rather than document/REST oriented APIs.  ESME's
> design is not traditional and there are fewer tools helping us get the
> implementation right.  On the other hand, implementing ESME on top of a
> relational/REST model cannot be done.  Let's keep our design consistent
> from
> the APIs back.
>
>
I'm really not religious about REST, but I would somehow assume that in an
Enterprise context it could be an requirement to send a link to someone else
pointing to a specific potentially old message in a certain Pool. That
sounds to me like a requirement for some kind of REST API.
Would it be costly in your model to get the message nr. X  (+ n  older
messages) in a users timeline?.

Regards,
Markus



> Thanks,
>
> David
>
> --
> Lift, the simply functional web framework http://liftweb.net
> Beginning Scala http://www.apress.com/book/view/1430219890
> Follow me: http://twitter.com/dpp
> Surf the harmonics
>

Re: Statefulness and algorithms for social networks/graphs

Posted by Ethan Jewett <es...@gmail.com>.

Wow David, that's a fantastic explanation. Thank you for taking the
time to write it. Should help a lot in the future (especially when I
have to remind myself why I'm messing around with Actors and Futures
in the api2 endpoint :-).

Can we put this or an edited version on the wiki?

Ethan

On Mon, Nov 30, 2009 at 3:40 PM, David Pollak
<fe...@gmail.com> wrote:
> Folks,
>
> Over the last 6 or so months, we've had a bunch of discussions on the list
> about statefulness, REST, and ESME's overall design.  I want to walk through
> the design choices I've made for ESME and why "stateless" and other such
> designs fail and are dead wrong for a social networking system.
>
> There is no such thing as stateless.  Every web site has state.  The state
> may change frequently or may change infrequently.  A web site made up of
> static files has its state based on those static pages.  When those pages
> are changed, the state changes.  State is kept somewhere for all web sites.
>
> Some web sites will present a different state depending on who is accessing
> the site.  This can be as simple as serving different pages depending on the
> IP address or language preference expressed in the HTTP headers.  This is
> sessionful.  The content is calculated based on the request.  This may be
> more sophisticated in terms of authenticating the HTTP request and
> presenting content based on the authentication.
>
> A session for sessionful content may be short-lived (the length of the
> request) or it may be longer lived (typically this is done with an initial
> authentication phase resulting in a shared secret [JSESSIONID] that is
> presented as an authentication proxy in subsequent requests.)
>
> But no matter the authentication mechanism or the session lifespan, there
> must exist a mechanism for translating the HTTP request into the content
> presented for the session.
>
> Far and away the most common way of persisting and calculating state is in a
> relational database (RDBMS).  RDBMSs are awesome creatures.  They sit on top
> of some excellent and well understood mathematics: set theory.  They have
> well known and well understood concurrency mechanisms: transactions.  They
> have been designed, built, tested, and optimized over the last generation.
> RDBMSs offer a simple set of commands (SELECT, DELETE, INSERT, UPDATE) as
> well as a generally human understandable set of semantics: people understand
> that RDBMSs are a sets of things and there are simple ways to ask about
> these sets.  RDBMSs have evolved along with ERP systems and have evolved to
> meet the needs of these systems.
>
> However, there are well known things that RDBMSs don't do well that include
> tree structures (yeah, Oracle and others have extensions for tree walks, but
> nothing is part of the SQL spec and the performance of these extensions is
> not always the same as other models: a tree-walk in an RDBMS costs O(log n)
> for each node where a tree walk in an OO system costs O(1)).  Social
> networks/social graphs are another place where RDBMSs do not excel.
>
> Let's dive down into this.
>
> A naive implementation of a social messaging site runs something like these
> tables:
>
>   - Users(id, name, password)
>   - Friends(owner, friend)
>   - Messages(id, poster, content, date)
>
> So, if we wanted to calculate the timeline for a given user at a given
> instant, the query would look like:
>
> SELECT messages.* FROM messages, friends WHERE friends.owner = current_user
> AND messages.poster = friends.friend ORDER BY messages.date DESC LIMIT 20
>
> Assuming we've got indexes on friends.owner, messages.poster and
> messages.date, the query still results in O(n log n) where n is the
> aggregate number of messages posted.  This is non-trivial and if you follow
> someone who has posted 20,000 messages (yeah Anne, I'm talkin' to you), the
> n log n cost becomes non-trivial.
>
> Basically, each time a client asks for the latest timeline, you've got an
> O(n log n) operation to determine state.  This doesn't scale.
>
> The first obvious response to the issue is caching (capturing the state
> beyond the duration of a short-lived session).  I'm going to skip caching
> for a moment and do a more sophisticated implementation of timelines so we
> can get better performance.
>
> Let's create a mailbox table.  Each time someone publishes a message, a
> reference to that message will be put in a Mailbox(owner, message, date)
> table and we'll create an index on the table: (owner, date DESC)
>
> This changes the query to:
>
> SELECT messages.* FROM messages, mailbox WHERE mailbox.owner = current_user
> AND messages.id = mailbox.message ORDER BY mailbox.date DESC LIMIT 20
>
> Depending on your RDBMS, you will wind up with an O(log n) operation.  You
> find the newest mailbox entry by user (O(log n)) and do an index walk until
> you've found 20 entries (I'm putting aside the fact that looking up the 20
> messages is an O(n log n) operation because 20 is a small number and the
> messages will likely be in the database's cache... this operation is going
> to be fast.)
>
> I'm going to sidetrack for a moment.  I had the pleasure of talking over a
> few beers at a baseball game with one of the senior engineers at Facebook.
> We were talking about Facebook's scaling success.  His comment was that it
> was successful but very expensive.  If there were more than 3% cache misses
> from MySQL queries, the system would back up.  If they got more than 2%
> cache misses from the memcached stuff in front of their MySQL servers the
> system would back up.  So, basically Facebook has 195% of their data in RAM.
>
> The net is that O(log n) is only going to work if you've got your entire
> index in the cache of your RDBMS.  Even a dozen disk reads is going to turn
> a 10ms query into a 250ms query and if you've got 1,000 users asking for a
> status update, you'll wind up with disk thrashing and ultimately you will
> not be able to satisfy all of those requests.
>
> Let's make our discussion more concrete.  I'm assuming that an ESME instance
> will support 25,000 users.  On average, a user will follow 100 people (100x
> fan-out of messages).  Users will post one message every 30 minutes (48
> messages a day).  The day lasts 10 hours (this is a reasonable approximation
> for peakiness... basically, you're compressing 48 message sends in to a 10
> hour period).  There are 300 days in a year.  These numbers are averages and
> there will be some folks who are above average in terms of fan out (the CEO
> will have a 25,000x fan out) and some folks are above average in number of
> messages per day (yeah Anne, I'm lookin' at you... you too Dick.)
>
> So, that means that each year, there will be 36,000M (36B) mailbox entries.
> If each entry costs us 16 bytes of RAM for index purposes, that means we're
> at 576B bytes of index.  There's no way that amount of index will fit in
> RAM.  So, what happens if the average messages/day drops to 1, you're still
> looking at 10GB of index.  Alternatively, you could purge messages after 3
> weeks or limit timelines to a certain number of messages.  That's not
> unreasonable, but it's also adding a constraint to the system to deal with
> limitations of the RDBMS.  There are other alternatives.
>
> Let me talk memcached for a minute.  In my opinion, memcached means that you
> have a failed design.  Memcached means abandoning all the awesome things
> that you get with an RDBMS: a mathematical model, a
> concurrency/transactional model, durability guarantees, etc.
>
> But, we could move our state from the calculate-on-demand model of the RDBMS
> to the a calculate once and cache model using memcached.  This means that
> you only take the nasty hits if the cache is not valid.  Putting aside the
> cost of cache invalidation (I haven't covered the costs of updates in this
> discussion because there's no need to go there... the implementation
> failures can be demonstrated with just reads), if you have a simple cache
> invalidation scheme, most of the cache entries will not survive for 15
> minutes (I can go through the math, but I'm going to leave this one to the
> reader).  You risk cache stampedes (more than 1 process rebuilding the cache
> entry).  Basically, the naive memcached implementation buys you a little bit
> of head room over the naive (non-mailbox) approach.  In order to get more
> than 5x or so improvement (something that will serve a few thousand rather
> than a few hundred users), you need to manipulate the cache entries
> inserting/deleting individual messages.
>
> The above paragraph in fact leads us in the direction of a better answer.
>
> But first, let me state that I have proven that an RDBMS cannot be the sole
> locus of state for a social messaging site that services more than a few
> hundred users.  Period.  We must move state somewhere else and manage the
> cached state manually rather than with queries and indexes.  Second, I have
> not discussed short-lived vs. long-lived sessions yet.  I will get to that,
> but first, let's walk through a design that gives us a concurrency model as
> well as the performance we want.
>
> Imagine a model where you interact with a User with a limited set of
> (asynchronous) messages:
>
>   - add/remove friend
>   - add message to timeline
>   - post message (the user has created a message and it needs to be
>   processed)
>   - get current timeline (with offsets and number of entries)
>
> These are the basic messages needed to implement a social messaging site.
> If we guaranty that a User will only process 1 message at a time, we have a
> concurrency model.  It's simple and simple is good.  We have not defined
> how/where Users store there state (it could be on a filesystem, in an RDBMS,
> in a NoSQL store, who knows).  But we can say that adding a message is an
> O(1) operation (prepending to the head of a singly linked list).  Each User
> can have a caching policy (and that caching policy could be dynamic based on
> the access characteristics for the User).  The sender of the message doesn't
> block on the processing of the message (although the get current timeline
> message will have an asynchronous response that the sender will likely block
> on).
>
> We have changed our abstraction from one where all data (tables and indexes)
> are created equal to one where certain data structures are more prominent
> (User and Message) than others (mailbox, friends).
>
> We have lost something: transactions.  In this model, if I add Dick as a
> friend, I am not guaranteed that I will receive Dick's next update... it may
> take time for the messages to propagate to Dick's User and his Message may
> be sent before the "add friend" message gets to him.  In the case of a
> financial transaction, this would be fatal.  In the case of social
> networking, this is a perfectly reasonable trade-off.
>
> So far, we have not talked about long-lived sessions and how they are
> valuable in such a model... an in particular in ESME.
>
> If we add one more message to our User, some of the reasons for long-lived
> sessions should become obvious:  updated me on timeline change.  If you can
> register with the User for changes to the timeline it means that we don't
> have to keep asking "are we there yet?"  When state change happens, it's
> instantly propagated out to the listeners.  The alternative is for the
> listeners to ask "are we there yet?" over and over.  The cost of asking "are
> we there yet?" is non-trivial as anyone who has traveled with 5 year olds
> can attest to.  Additionally, sometimes, when one if having a conversation,
> it's nice to get an immediate response rather than waiting some polling
> period.  Additionally, with a listener model, the User does not need to
> store the date of each message (give me new messages since xxx) and that
> cuts down cache storage costs by 50% (a big number across 25,000 users).
>
> So, having a long-lived session has some performance benefits over a
> short-lived session and polling, but this only part of the story.
>
> One of the ways that RDBMSs get performance (and the way products like
> Oracle distinguish themselves from the likes of MySQL) is the ability to
> cache optimized query plans, cache the right data, and invalidate the right
> caches at the right time.  The same requirements are going to come up in
> ESME.
>
> When I designed ESME, I changed the model from a Skittr model (1M users on a
> single box) to a more enterprise-friendly model.  The key difference is that
> I added the "actions" feature where each User got to see each message
> processed in the system and analyze that message for content/context and
> perform certain actions based on that analysis.  Things like "add all
> message containing 'catfood' to my timeline" or forward all messages
> containing "ESME to my followers" or "make an HTTP post of all messages from
> my boss to a paging service" or "block 50% of the messages from Joe
> Blabbermouth".  Actions are cool, but they are costly.  It means that every
> message must be compared to every action definition in the system.  This is
> expensive.  If each user has an average of 10 actions, that means each
> message sent will have to be compared against 250,000 actions and if we have
> a peak of 5 messages per hour per person, that's 31B comparisons per hour at
> peak time or 9M action comparisons per second.  That's load.
>
> During peak load, we will need to prioritize which Users are processing
> messages/actions such that the system retains responsiveness and can drain
> the load.  Put another way, knowing which Users have associated long-lived
> sessions allows us to prioritize the message processing for those Users.  We
> allow more threads to drain the message queues for those Users while
> providing fewer threads for session-less Users.  Yeah, we could prioritize
> on other heuristics, but long-lived session is dead simple and will cost us
> 5K bytes per logged in user.  Not a huge cost and lots of benefit.
>
> So, between the existing long-lived session long polling is more efficient
> than shortlived session repeated polling and the upcoming need for message
> prioritization indicate that long-lived sessions are the right design
> choice.
>
> Also, I hope that the above discussion makes it clear why I am insistent on
> message-oriented APIs rather than document/REST oriented APIs.  ESME's
> design is not traditional and there are fewer tools helping us get the
> implementation right.  On the other hand, implementing ESME on top of a
> relational/REST model cannot be done.  Let's keep our design consistent from
> the APIs back.
>
> Thanks,
>
> David
>
> --
> Lift, the simply functional web framework http://liftweb.net
> Beginning Scala http://www.apress.com/book/view/1430219890
> Follow me: http://twitter.com/dpp
> Surf the harmonics
>