Next up in the toolbox series is an idea so good it deserves an entire article all to itself: consistent hashing.
Let’s say you’re a hot startup and your database is starting to slow down. You decide to cache some results so that you can render web pages more quickly. If you want your cache to use multiple servers (scale horizontally, in the biz), you’ll need some way of picking the right server for a particular key. If you only have 5 to 10 minutes allocated for this problem on your development schedule, you’ll end up using what is known as the naïve solution: put your N server IPs in an array and pick one using key % N.
I kid, I kid — I know you don’t have a development schedule. That’s OK. You’re a startup.
Anyway, this ultra simple solution has some nice characteristics and may be the right thing to do. But your first major problem with it is that as soon as you add a server and change N, most of your cache will become invalid. Your databases will wail and gnash their teeth as practically everything has to be pulled out of the DB and stuck back into the cache. If you’ve got a popular site, what this really means is that someone is going to have to wait until 3am to add servers because that is the only time you can handle having a busted cache. Poor Asia and Europe — always getting screwed by late night server administration.
You’ll have a second problem if your cache is read-through or you have some sort of processing occurring alongside your cached data. What happens if one of your cache servers fails? Do you just fail the requests that should have used that server? Do you dynamically change N? In either case, I recommend you save the angriest twitters about your site being down. One day you’ll look back and laugh. One day.
As I said, though, that might be OK. You may be trying to crank this whole project out over the weekend and simply not have time for a better solution. That is how I wrote the caching layer for Audiogalaxy searches, and that turned out OK. The caching part, at least. But if had known about it at the time, I would have started with a simple version of consistent hashing. It isn’t that much more complicated to implement and it gives you a lot of flexibility down the road.
The technical aspects of consistent hashing have been well explained in other places, and you’re crazy and negligent if you use this as your only reference. But, I’ll try to do my best. Consistent hashing is a technique that lets you smoothly handle these problems:
- Given a resource key and a list of servers, how do you find a primary, second, tertiary (and on down the line) server for the resource?
- If you have different size servers, how do you assign each of them an amount of work that corresponds to their capacity?
- How do you smoothly add capacity to the system without downtime? Specifically, this means solving two problems:
- How do you avoid dumping 1/N of the total load on a new server as soon as you turn it on?
- How do you avoid rehashing more existing keys than necessary?
In a nutshell, here is how it works. Imagine a 64-bit space. For bonus points, visualize it as a ring, or a clock face. Sure, this will make it more complicated when you try to explain it to your boss, but bear with me:

That part isn’t very complicated.
Now imagine hashing resources into points on the circle. They could be URLs, GUIDs, integer IDs, or any arbitrary sequence of bytes. Just run them through MD5 or SHA and shave off everything but 8 bytes (and if anyone tells you that you shouldn’t use MD5 for this because it isn’t secure, just nod and back away slowly. You have identified someone not worth arguing with). Now, take those freshly minted 64-bit numbers and stick them onto the circle:

Finally, imagine your servers. Imagine that you take your first server and create a string by appending the number 1 to its IP. Let’s call that string IP1-1. Next, imagine you have a second server that has twice as much memory as server 1. Start with server #2’s IP, and create 2 strings from it by appending 1 for the first one and 2 for the second one. Call those strings IP2-1 and IP2-2. Finally, imagine you have a third server that is exactly the same as your first server, and create the string IP3-1. Now, take all those strings, hash them into 64-bit numbers, and stick them on the circle with your resources:

Can you see where this is headed? You have just solved the problem of which server to use for resource A. You start where resource A is and head clockwise on the ring until you hit a server. If that server is down, you go to the next one, and so on and so forth. In practice, you’ll want to use more than 1 or 2 points for each server, but I’ll leave those details as an exercise for you, dear reader.
Now, allow me to use bullet points to explain how cool this is:
- Assuming you’ve used a lot more than 1 point per server, when one server goes down, every other server will get a share of the new load. In the case above, imagine what happens when server #2 goes down. Resource A shifts to server #1, and resource B shifts to server #3 (Note that this won’t help if all of your servers are already at 100% capacity. Call your VC and ask for more funding).
- You can tune the amount of load you send to each server based on that server’s capacity. Imagine this spatially – more points for a server means it covers more of the ring and is more likely to get more resources assigned to it.
You could have a process try to tune this load dynamically, but be aware that you’ll be stepping close to problems that control theory was built to solve. Control theory is more complicated than consistent hashing.
- If you store your server list in a database (2 columns: IP address and number of points), you can bring servers online slowly by gradually increasing the number of points they use. This is particularly important for services that are disk bound and need time for the kernel to fill up its caches. This is one way to deal with the datacenter variant of the Thundering Herd Problem.
Here I go again with the control theory — you could do this automatically. But adding capacity usually happens so rarely that just having somebody sitting there watching top and running SQL updates is probably fine. Of course, EC2 changes everything, so maybe you’ll be hitting the books after all.
- If you are really clever, when everything is running smoothly you can go ahead and pay the cost of storing items on both their primary and secondary cache servers. That way, when one server goes down, you’ve probably got a backup cache ready to go.
Pretty cool, eh?
I want to hammer on point #4 for a minute. If you are building a big system, you really need to consider what happens when machines fail. If the answer is “we crush the databases,” congratulations: you will get to observe a cascading failure. I love this stuff, so hearing about cascading failures makes me smile. But it won’t have the same effect on your users.
Finally, you may not know this, but you use consistent hashing every time you put something in your cart at Amazon.com. Their massively scalable data store, Dynamo, uses this technique. Or if you use Last.fm, you’ve used a great combination: consistent hashing + memcached. They were kind enough to release their changes, so if you are using memcached, you can just use their code without dealing with these messy details. But keep in mind that there are more applications to this idea than just simple caching. Consistent hashing is a powerful idea for anyone building services that have to scale across a group of computers.
A few more links:


Would sticky sessions enabled on a load balancer help with the caching issue “If you want your cache to use multiple servers (scale horizontally, in the biz), you’ll need some way of picking the right server for a particular key.”? Good article by the way…thanks.
Interesting and informative article, thanks!
Peter,
Sticking a users session to a server, web, application or other, isn’t going to help in determining what particular server within your cache cluster has the key you want.
Al.
Thanks for sharing important logic like this that most people don’t think of until they are in a big mess.
MD5 isn’t secure.
Thanks for great the article Tom.
Your clear explanation and illustrations inspired me to write an open source implementation for PHP, as I couldn’t see anything decent around that fit the bill. I’ve put it on Google Code at http://code.google.com/p/flexihash/ in case anybody needs it.
Oh, there’s a java version in there too along with the libketama C library that is used for the PHP extension
Hej Tom,
I got half excited about your article… hits most of the squares… However your circle shows the servers nicely equadistant on the 360 but if we leave their placement to the hash function then they could actually all appear in a very accute part of the circle so that the services or resources could mostly be directed unneavenly to one server….
Would it not be best to divide the available resources equally in the pie? (and one could take into account weighting on this also)
If in the case that one server fails, then instead of applying roundrobin, would it not be best to select the best server… i.e. the server with the most currently available processing capability (obtained from server size and current load).
Your comments would be well appreciated.
Slainte //Ross
@Ross — each server should be hashed into many points on the circle. That way probability takes care of spreading them out equally.
Good Article Tom,
For newbies, if you include active-standby also in this article, then removal use-case will make more sense. I mean when one server has to be removed, what will happen to the data it contained. So either (a) the standby server is assigned the same IP ( say using Virtual IP ) or (b) a new server which was having the original data of the failed node will be added to the hashing and failed node IP will be removed.
Many thanks Tom
Very informative article. Nicely written. Cool stuff.
Thanks.