implementing scale up for web 2.0 sites with current practices
This blog was recently featured on Slashdot over the Thanksgiving holiday in the US. It was the perfect storm: commercial news organizations were mostly dormant creating a slow news day, and geeks like me were at home eager to get the latest technology scoop. What surprised me is how this relatively modest box, a Linode 540MB Xen Virtual Machine, withstood up to 100+ requests a second without even breaking a sweat. Furthermore, I had only performed some of the tuning I detail below. It scales to over 1100 requests a second after following my guide below!
I will detail how to tune your server for optimum capacity, or what I will call free scale up (as opposed to scale up by adding hardware or scale out - adding machines, database servers, application servers, load balancing - which may come in a future article depending on interest). Most of the ideas here are platform neutral - both OS and application server - assuming you are using a UNIX style OS.
The only tutorials I've found were dated and don't detail the latest practices like varnish or Passenger, so read on for a fresh look.
The intended audience for this article is anyone running a web site. Running your own web server gives much greater flexibility in choice of development environment. A dedicated server and certain virtual private server providers give much more predictable performance and wont cancel your service on a whim. (I'm looking at you MediaTemple... google for horror stories). A Linode VDS is much more flexible and very powerful for around the same cost.
Most people use Apache. According to Netcraft, over 50% of hosts were as of November. For good reason, Apache has proven stability, scalability, and security. Some folks are quick to rip out Apache due to poor configuration and tuning. I personally find it to be an excellent choice for most sites because of the aforementioned traits and first-rate extensions. With proper setup, you will likely max your transfer or tax your application sever before it ever becomes the bottleneck.
The key to tuning Apache is to minimize RAM usage, especially on a limited machine like my 512MB Linode. Memory swapping of applications to disk is almost entirely unacceptable on modern servers. Disk I/O is very expensive and the biggest bottleneck on modern computers, which is why swap is so unappealing.
Therefore, you need to:
- Limit overall Apache memory usage
- Minimize per thread/process memory usage
- Minimize disk I/O
Limit overall memory usage
Step one is very important. If your server begins swapping heavily, it can be very difficult to even log on and perform administration. You need to develop an idea of the RAM an average Apache process is using via top, ps, or another monitoring framework. Make sure you are looking at the RES column in top, since shared libraries will be used between all processes. Take this number and divide it by the amount of availalbe RAM. Available RAM should take into account RAM used by other processes including your database when under reciprocal load. Set the MaxClients directive to a number close to the resultant, and tune accordingly with benchmarks (see Benchmarking section).
Minimize memory usage
Step two determines how many child processes you can handle. This is important because the more children, the more in flight requests, and lower end user latency. This is also a lot more environment dependent than step one.
A good way to reduce memory consumption is to unload unneeded modules. Most server operating systems default Apache with a wide range of modules that are probably not used on your site including several basic authentication methods. Using shared objects rather than static modules will help memory usage as well, and most distributions ship this way.
If you use an Apache module for your application server (mod_php, mod_perl, mod_python, Passenger aka mod_rails), each child process will consume the memory of that module regardless of whether or not it is serving a static asset (images, css, etc.) or an application page. Mitigate this by using a proxy (see next section) or moving application serving to its own processes via FastCGI (PHP, most others), AJP (Java, Python), WSGI (newer Python), proxy (Ruby, all).
I should take a moment to step back and hit on an important topic. Hard disks have improved very little in regard to performance in recent years. Disk I/O is an expensive task and therefore the primary bottleneck you wish to avoid.
When Apache logging is enabled, a write operation must occur for every hit. If possible, consider completely disabling access logging. You can outsource web statistics to Google Analytics. If you require logging, make sure HostnameLookups is disabled (network I/O is even more expensive than disk!) and batch look-ups on another machine or during idle periods with a log analyzer. As your setups grows (scale-out), log files will become more cumbersome and you will probably be logging to database or a central server anyways. Varnish, a proxy/http accelerator detailed below has an optimized design for logging.
Apache has an integrated cache module that will keep frequently hit static assets in memory. For larger sites, forgo this and use a proxy which will be more flexible and allow easier scale-out.
Apache makes use of MPMs, or Multi-Processing Modules, for its core functionality. The default on UNIX is prefork, which makes a separate process for each request. By switching to a threading MPM such as worker or event, you can cut down overhead and memory use. Some modules do not play well with threading (PHP), so you should research before changing MPMs. prefork works well for one and two core servers.
Alternative Web Servers
Lighttpd is the leading alternative FOSS web server. Users include A-list web sites such as Youtube and wikipedia. Benchmarks show impressive performance. Keep in mind Apache is by no means slow nor resource intensive and links on that page show that it is faster on some workloads.
When making comparisons, keep in mind that by design you will probably be using a FastCGI application server and most of the optimizations above will hold true for Lighty.
For sites with long connection times (download servers, AJAX keep-alive) or static content servers, I would definitely lean toward it (scale-out).
Nginx has also been picking up steam (pun intended) and is being used by large sites like Wordpress.com. I would consider it in the same class as Lighty.
A reverse proxy is very useful for modern web serving. Even with just one server, a reverse proxy will keep common pages in memory - greatly reducing disk I/O. They will also keep static requests from using potentially heavy application server HTTPd processes. These are often very fast at basic HTTP since they are not concerned with all the features of a web server. When it comes time to scale-out, the proxy can be moved to a separate server. Proxies can direct traffic to different backend servers. Proxies can even be placed in geographically disperse areas (think CDNs: Akamai, Limelight - Youtube, Google). Logging, compression, and SSL can be offloaded to the proxy. In short, you want a proxy even on a single server (or at least mod_cache).
Varnish bills itself as an HTTP accelerator. It was written from the ground up to perform reverse proxying, and this it does well. The Varnish design philosophy is enlightened and leaves a lot of the work like memory management to modern advanced operating systems. Logging is performed in a separate processes and is optimized. If you need an advanced proxy and accelerator, this is likely the way to go.
Squid has traditionally been used as the de facto FOSS forward and reverse proxy. Many large sites such as Wikipedia are extensive users.
Apache and Lighttpd
Both Apache and Lighttpd have modules that will allow them to cache and reverse proxy. For single server setups, it would probably be worth reusing the components of your web server (think: shared memory) if your application server is external. mod_proxy is very useful for forwarding ruby requests to a Ruby web server like mongrel or thin.
The application server is where most of the magic happens in today's web 2.0 sites. Gone are the days of static HTML files. Most sites are now dynamically generated every visit, and customized per visitor. This is an order of magnitude more complex, and a lot of CPU time is spent on page generation. Therefore, tuning here is often one of the best things you can do to improve site scalability.
PHP is the most widely deployed language on the web. Many extremely popular applications are written in PHP, including: MediaWiki, Wordpress, Drupal, and phpBB.
By default, PHP breaks a script down into opcodes every time it is called. Opcode translation is necessary to simplify programs so they can easily be parsed by the Zend Engine. It is unnecessary for this to be done every time a script is called since the source code will rarely change once deployed. Luckily, a cache can be added that will eliminate this step. The net performance gain can be a factor of 2 to 10, very impressive for a simple install!
These days, you should chose APC - The Alternative PHP Cache. Once upon a time, there were several choices here. Turck MMCache was notably fast, beating even the commercial Zend Suite, but mysteriously died out (the original author is now a Zend employee. hmm.. coincidence??). Others have tried to revive it in the form of eAccelerator, but it isn't stable nor active. Any other arguments are moot point since APC will be part of PHP6 core as well as having PHP's founder as a developer.
Just as with Apache, removing unused extensions in PHP will help reduce memory usage. These can be commented out in php.ini.
Rails has gained a lot of steam (okay I'm wearing that one out) and is a favorite among many Web 2.0 startups including Twitter.
A lot of Rails scalability problems are due to the underlying Ruby language. The garbage collector, threading and memory allocator have been pinpointed to be particularly bad. Work is underway to fix these in Ruby 1.9 (bytecode) and 2.0(threading). In the mean time, consider Ruby Enterprise Edition in tandem with Passenger. Personally, I'd rather avoid Ruby and all you kool-aid drinkers (but I've done a large deployment of Passenger). Go Python :).
Python is just a plain good language. With that out of the way, like all the other scripting languages, Python is supposed to be getting a bytecode implementation sooner or later. Psyco can yield an average 4x performance improvement and is available now. PyPy should be here sooner rather than later.
Due to the Java language design, code is JIT (Just-In-Time) compiled and you don't have the compilation problem that the dynamic languages above do.
Java web apps are immensely complex, and aside from the latest JDK (126.96.36.199), your container will play a big role in speed. Jetty and Tomcat are always good choices.
Databases and Database Caching
A large portion of modern web applications are database driven. To keep your site running, this point of contention must be addressed. MySQL is ubiquitous and known for its speed. PostgreSQL offers some advanced features and is known as the DBA's FOSS database. If you need extreme scalability, consider DB2 but prepare to pay dearly :-).
MySQL comes configured fairly well out of the box in most distributions. MySQL Performance Blog sums it up better than I can, so head that way for basic tuning info.
Probably one of the easiest things you can do is enable the integrated query cache. The good news is your application doesn't need to do anything to take advantage of this.
query_cache_size = 64M
For single server web workloads, this simple change can work miracles and prevent dreaded MySQL connection errors. This is especially true since web apps are primarily read oriented. The query cache isn't perfect in all situations, and in larger sites memcached is more appropriate but has its own disadvantages (see memcached section).
PostgreSQL should also be set up fairly well by your distribution. shared_buffers should probably be tuned, as well as max_connections. See the PostgreSQL wiki on tuning for a good overview.
There is nothing strictly akin to the MySQL query cache, for better or worse.
Applications and Application Caching
This is potentially the hardest step to implement, yet can also yield the greatest reward. Caching common database queries, objects, modules, or even writing static HTML versions of a page can cut server load to nothing. If you are using a common FOSS (free) or COTS (commercial) product, chances are the software already implements some of these options and they may just need to be activated or downloaded as an extension.
Keep in mind not all things are effectively cached, and you may need to perform a major rework to implement aggressive caching like this.
Generic Data Caching - memcached and APC
Many common applications contain backends for caching against memcached or APC. Mediawiki is a prime example of this, which integrates nicely with memcached or APC. If you are writing your own apps, using a memory cache can greatly reduce dependency on the database.
Realizing that databases have a lot of constraints, the folks at LiveJournal.com wrote a generic caching framework called memcached. Most large sites such as Facebook, Wikipedia, and Slashdot are all using this.
The bad news is you have to port your application to store and check against memcached. Database queries are a prime target, but just about anything can be stored here.
It is also handy for scale-out because you can add dedicated cache severs.
APC user cache
PHP APC users can manually store information in APC's shared memory. This is ideal for single server solutions. Take a look at this performance comparison vs memcached and files.
Although most pages are dynamically generated these days, a lot are needlessly so. For example, a content management system might include a header, content, comments and a footer. This output can be updated and written as a static HTML pages when an author updates them. Static pages are then served until a user comments on an article, which triggers a cache invalidation and the page is rendered and stored again. The output of generated menus, columns, and other objects can be stored in cache form as well.
Wordpress Cache Plugins
Wordpress has a couple of plugins that are mandatory for large sites.
WP Super Cache will generate static HTML files of posts on your blog. They are automatically served via some mod_rewrite magic, and will expire and update automatically. This can effectively reduce load to almost nothing - it completely eliminated database access and PHP execution.
WP Widget Cache is a nice addition that will cache output of widgets (sidebar elements such as menus) that don't commonly change.
It is important to benchmark your site after making changes to see if it meets performance expectations. ab is a common tool for this task.
The following will run 10 concurrent requests for 3000 total against localhost:
ab -c10 -n3000 http://localhost/
Be very careful when benchmarking a live site. You could effectively Denial of Service your server while it is processing all those requests.
What do you think?
I'd be happy to hear your stories from the trenches. Please share your tuning advice!