Memcached at Facebook (2013)

Mon Mar 15 2021

tags: draft programming computer science self study notes public 6.824 MIT distributed systems

Introduction

I read the following the paper as part of my self-study of 6.824 with NUS SoC folks. It may or may not be part of my goal to finish a Stanford CS degree in a year.

This week I watched the 6.824 lecture video Lecture 16: Cache Consistency: Memcached at Facebook. and we discussed the paper [Nishtala et al. (2013). Scaling Memcache at Facebook](https://files.lieu.gg/docs/Nishtala et al. (2013). Scaling Memcache at Facebook..pdf).

Memcached is an in-memory hash table providing storage at low cost.

The big picture

The idea is to reduce the load on the database layer by caching reads. For FB's usecase they have orders of magnitude more reads than writes, and so caching can have significant advantages.

Additionally, it's OK that reads don't necessarily observe the latest writes. So consistency (the condition that all reads observe the latest write) can be relaxed. e.g. a write can be an edit to a post, or a new post. It's OK if you refresh the page a few seconds after a new post has been made and don't see anything. What we want to avoid is unbounded inconsistency: after some time period $t$ we want to guarantee that all users see the latest version of the data.

Second, read operations fetch data from a variety of sources, e.g. MySQL, HDFS, backend services: this points to a separate storage layer that can handle heterogeneous

Memcached provides set, get, and delete.

Memcache is a "demand-filled look aside cache." in contrast to an inline, or "look-through" cache.

(Look-aside means you're aware of the cache, you decide which data paths you want to cache. Look-through means you can't choose to cache or not.)

Get: retrieve from cache: if miss, get from DB. Set: update DB, then invalidate the cache.

Generic cache: use it as a general key-value store, storing pre-computed results from ML algorithms.

As is, memcached runs only on a single server: FB scaled it out to run on a distributed system.

Two key design goals:

any change must impact a user-facing or operational issue: optimisations not considered.
The probability of reading transient stale data as a parameter to be tuned, similar to responsiveness.

Web servers must communicate with many memcached servers, all web servers communicate with all memcached server in a short period.

This all-to-all communication pattern can cause incast congestion or allow a single server to become the bottleneck for many web servers. Data replication alleviates the single-server bottleneck but leads to significant memory inefficiencies in the common case.

Guarantees

Memcached at Facebook promises "best-effort" eventual consistency. Without the cold cluster warm up (more on this later), Memcached guarantees eventual consistency in the absence of network partitions, and loses consistency in the presence of network partitions.

Eventual consistency means that after some bounded time, all reads will eventually return the latest write.

Memcached does not guarantee eventual consistency due to the cold Cold cluster: The two-second holdoff rule fails eventual consistency and results instead in "best-effort" eventual consistency: if the DB hasn't deleted the cache entry in the warm cluster within 2 seconds, then the cold cluster will read a stale value from the warm cluster and that value will be indefinitely inconsistent.

It also doesn't guarantee eventual consistency in the presence of network partitions, because a put request can fail/be partitioned after updating the DB but before invalidating the cache. Then that value will again be indefinitely inconsistent.

Glossary

fan-out: one web server needs to talk to lots of memcache servers to get the data they need.
consistent hashing: reduces the need for rehashing when the number of buckets grows or shrinks
incast congestion: problem at the web server side. When all the responses from the memcache layer are coming back, they get stuck.
single server becoming the bottleneck: when you have a hot key.
load-link/store-conditional:
hold-off-time: Hold of period of two seconds. If a miss is detected in the cold-cluster C, the It can get the stale value, but it won't store the stale value in C.
system V shared memory: just generic shared memory

Three techniques to reduce latency

How to reduce latency? Parallelise and batch, connection coalescing, and sliding window. They build a DAG representing the dependencies, then batch them.

Each server has a library

Standalone proxy (web server) mcrouter: presents a memcached server interface and routes requests/replies to/from other servers.

--> why would u use a proxy.

Clients use UDP and TCP, favouring UDP for get requests to reduce latency and overhead.

(UDP use memcache to do amplification attack)

UDP might detect packets that are dropped or received out of order, we just call it a cache miss.

For set and delete, we need reliability. So we need TCP. Since TCP are stateful, each of these threads must have a separate connection. We use mcrouterto coalesce connections within a machine/server, so instead of in a per-thread basis it's per-machine basis.

Preventing incast congestion

Each server will send multiple requests to different memcache servers, and this may overload the server when the memcache responses all hit the server at the same time. To prevent this, we limit the number of open connections.

Strike a balance where you have too small window size, vs incast congestion with too large a window size.

Reducing Load

Leases address two problems: stale sets and thundering herds. A stale set

Thundering herd: a specific key undergoes heavy read and write activity, which repeatedly invalidates the recently set values, causing all of them to default to the more costly path.

What's a lease: when you as client 1 get a cache miss, the memcached server gives you a 64-bit token. This 64-bit token is a lease: it allows you and you only to

There must be a timeout for this lease Each token lasts only for 10 seconds: subseuquent

Requests for a key's value within 10 seconds of a token being issued results in o

So if the client crashes, it will issue a new token after ten seconds to the next client who sends in a read request.

... explain this section better;

Using lease to minimise application wait time.

We can further reduce this wait time (10s) by identifying situations in which returning slightly out-of-date data is acceptable. When a key is deleted we put it in a temporary store, and if clients are happy to receive stale data, we can give them this recently deleted data.

Memcache pools

There's different types of data: some data is frequently accessed but cache misses are inexpensive, and

Many high churn items will eventually evict low churn items, so we separate different pools.

Replication between pools

storing multiple copies of the same data; they only replicate when the data set is small, requests ask for many keys simultaneously, and the request rate is much higher. They can either do replication or do key-space partitioning. Given the fact that the difference in memcached overhead for retrievin With partitioning

Handling failures

Gutter servers are spun up when memcache servers fail

Clients don't invalidate keys in a gutter server

You hit the DB once and store it in the gutter server. Gutter server prevents lots of DB hits.

How do we deal with multiple regions?

Web server writes to the database directly.

Each webserver has one MCRouter, each CLUSTER also has multiple additional MCRouters

Since cross-cluster communication is expensive, we want to reduce this as far as possible. Hence the cluster-wide mcrouter.

Regional pool

We store cold keys in a shared memcached

Cold cluster warm up

We let cold clusters read from the warm cluster, but this introduces the possibility of inconsistency.

What's the difference between look-aside vs look-through cache?

Look-aside means you're aware of the cache, you decide which data paths you want to cache. Look-through means you can't choose to cache or not.

What Facebook needs

You don't need linearisability, but you cannot cache stale data indefinitely.

Also, user cannot see stale data if they update their own data.

Key concern: we want the user who writes the data not to see their stale data.

Also, we must never expose the database layer to the full brunt of the read requests.

READ

v = get(k). Client hashes the data into k.

# READ 
# where get_cache and set_cache are RPC calls to the Memcached layer
v = get_cache(k)
if v is None:
    v = db_fetch(k)
    set_cache(k, v)
    
    
# WRITE(k, v)
send (k, v) to DB
delete_cache(k)

When FE sends a write to the DB, it will send a DELETE to the rest of the replicated DB

Invalidate scheme vs update scheme?

What's the difference? Why does FB use invalidate instead of delete?

(37:00)

Partition vs replication?

(38:00)

Partition is RAM efficient but not good for hot keys. Clients need to talk to every partition. Replication: good if hot keys, few TCP connections (each client only needs to talk to one partition), but less total data stored.

Why do you need the delete done by the client?

The front-end needs to be the one to do it to ensure that writes by the user are not stale. (The user observes non-stale data).

Bibliography

MIT 6.824 Memcached lecture notes MIT 6.824 Memcached FAQ

Appendix

Why Memcached at FB is not consistent

First let's talk about the key invariant of any database. A database must give the following guarantees (invariants):

Once something is written, it must be readable;
All reads must observe the latest writes.

We say a system is consistent if for every valid concurrent history (a history of interleaving concurrent events that could happen in normal operation) there exists some rearrangement under the consistency rules that preserves the invariants of the database.

So let's consider a particular concurrent history that is problematic. Any write needs to do two separate steps: update the DB, then invalidate the cache entry. But suppose there's another read request that comes in before you can invalidate the cache entry. Then you'll end up serving a stale value to the read.

write1 db start
write1 db end
write1 cache start
write1 cache end
read start
read end (now the cache is filled up)
write2 db start
write2 db end
read start
read end
write2 cache start
write2 cache end

Here the last read does not observe the latest DB write because the second write has yet to delete the cached entry by the time the last read reads from the cache. This breaks the invariant.

Now the challenge is to rearrange this history by the rules allowed by various concurrency models

This history will result in a stale read, thus breaking the invariant. This is obviously not linearisable, since there's nothing to rearrange without changing the real-time ordering.

Might the system be serialisable? I think it is, pathologically. A serialisable database can always reorder all the reads in front and put all the writes after all the reads, so all reads return 0. This is an incredibly weak condition for a system that only allows reads and writes as atomic transactions! See Fauna blog post on why serialisability is useless:

But let's assume that we don't allow that. Let's use strict serialisability. But then of course it isn't strict serialisable for the same reason it isn't linearisable.