Caching in WordPress

(This introduction got a little longer than I thought it would be, so please bear with me in the next three paragraphs!)

I’m often surprised to find WordPress deployed in very large-scale sites; a good example would be, “Did you know that the New York Times runs their blogs on WordPress?!” I’m sure this is of no surprise to the fellows at Automattic.

The New York Times, I’m sure, needs caching. Or if your blog is overrun by DiggSlashdot, or the next click conduit of the interwebs, then you’ll definitely need caching. The solutions are there; the most critically acclaimed (currently) is WP Super Cache, by Donncha O Caoimh. It works by saving off your pages as static HTML files and mod_rewriting requests to the cached version. Fairly simple concept, and spectacularly well implemented.

WP Super Cache has a deceptively short list of requirements: Apache, mod_rewrite, and write access to the wp-content folder. If you’re running a large blog, you’ll most likely have the resources to control the environment it runs in.

But if you’re running your blog on some shared hosting service for two dollars a month, it might not be on an Apache server. There may be no mod_rewrite, or .htaccess access. There might not even be write access available, especially if you installed WordPress by clicking a button. Having never been subjected to PHP’s safe mode, I don’t know the particulars of its restrictions, but that’s a fairly large segment of the web population as well.

As I was contemplating what I’d like to apply for in Google’s Summer of Code, I tripped over one of WordPress’ proposal ideas, namely, “Integrated Caching Solutions“. Having worked a bunch on TheDartmouth.com’s caching system, I thought that I could bring something to the table there.

The basic idea, as I see it, is to raise the amount of traffic a vanilla WordPress install can take. Thus, by definition, I’d be looking to write a core patch to the WordPress codebase, because, honestly, the caching plugin market is very well catered to already (see above). The core infrastructure could, however, definitely use some love.

Consider the most restrictive hosting scenario possible – no write access, no .htaccess, no nothing. But you have a database to run WordPress with. For caching to work for everyone, it will need to cache into the database.

Luckily, the most difficult work has been done with the brand-new 2.5 release – WP_Object_Cache exists and caches data within each request. Previous to 2.5, there was a file-based caching solution that took serialized objects and cached them on disk; this was ripped out of the core because, according to DD32, the code was not well-maintained, and hence there were more problems than benefits.

I think more performance can be squeezed out of a WordPress with database-backed, object-based caching verses without. It may not be a massive speedup – after all, you’ll still be spending a ton of time connecting to the database – but it would be faster than it is now. There must be the option of turning it off and falling back to the current per-request object caching, since it would be redundant to run a full-page caching solution (i.e. WP Super Cache) and a persistent, database-backed object cache as well.

I pitched this idea to the wp-hackers mailing list, and Mahmoud Al-Qudsi was kind enough to do a quick proof-of-concept for this idea. It didn’t work – taking the current memory-backed object cache as-is and moving it far, far away into the database makes page load time 4-5 times longer. However, I think implementing some more optimization would make this idea feasible – for example, pre-fetching a default set of objects from the cache at initialization, and saving all modified data to cache in one go, at the end of the request. Perhaps pages can have some hinting information built in about which objects to pre-fetch. Another idea is to have the caching engine automatically collect hinting data for each type of page. The first time each type of page is accessed it would be a little slow, but pretty fast from then on. The hints could then be stored in the cache itself (or somewhere else more permanent), and this solution would adapt to plugins’ data access automatically.

The first thing to do, in implementing core caching, is to figure out if this is feasible. It’s the only solution that will work on all installations of WordPress out-of-the-box. Mahmoud sent me his code (so nice of him to, thank you thank you), so that’s something to work on the off-start.

Now that we have persistent object caching, the next logical step would be caching at higher levels – namely, full-page caching. If shipped with the core, it should definitely be a plugin, and probably disabled by default; there are definitive drawbacks to caching whole pages, namely that dynamic content (e.x., random ordering of links) isn’t anymore. Of course, WordPress could not cache pages with dynamic content, but some users will have dynamic content on every page – say, an advertising system that displays a random ad per request. In this case, object-based caching will have to do, or perhaps, through the widget system, there could be a way of marking sections of pages as dynamic, and cache the page with sections to be replaced. This might not be worth it for the overhead, though.

Back to vanilla full-page caching: the obvious thing to do here is to integrate WP Super Cache into the WordPress core, if Donncha O Caoimh is OK with that, since there is no point in reinventing an already spectacularly done wheel. The big task will be to expand it to be able to cache to database with our fancy new framework from above. A simple check – whether we have .htaccess write access – will tell us whether to cache to database or file. It’ll also be worth investigating which pages can be full-page cached, and which can’t. WP Super Cache can tell us that, to some extent – but that might be able to be better defined.

A persistent caching system, regardless of being stored in database or as files, needs to be quite a bit more sophisticated. Someone famous once said that (paraphrasing) the most difficult problem in computer science is cache invalidation. The cache will need to kept both current, with the freshest content, and relevant, with the most-hit content, as the classic Digg/Slashdot effect often just hits one page over and over again. We’re also possibly working with a very restricted cache size (i.e. typical shared hosting database limits), so we’ll need to remove cache entries that may be current, but fairly old.

To accomplish a current and relevant cache, we’ll need to keep track of when a cache entry was last hit. In a database, that’s a fairly cheap INT field, and on a file it would obviously be the last accessed date. We should also keep statistics on how much a cache entry has been hit, just for statistical purposes (everyone loves big numbers attached to vaguely meaningful things).

The caching API will therefore need some equivalence to a ‘get_most_state_entry’ method, which will get the cache entry that was last accessed the longest time ago. When the cache is full, a request to cache a new object will remove this oldest cache entry. This will keep the cache relevant; if a slashdot effect yesterday caused one cache entry to be hit 10,000 times, today’s digg effect on a different page will quickly knock the old cache entry off – since the torrent of new cache entries created by the dugg page will give the slashdotted page’s caches the oldest last accessed date.

For keeping the cache current… it would seem that we wouldn’t need to do this at all, assuming that every possible way to modify the database first goes through the caching system. That is, assuming that the caching API has 100% coverage of data access ‘paths’.

If we didn’t have 100% coverage – which, even if we did, is perhaps not a good thing to assume – then we have a cache invalidation problem. Luckily, Andy Skelton has a novel idea of using versioned tags on cache entries. If I understood him correctly, the idea is to have an array, possibly stored in the cache itself, that is loaded on every request. This array would maintain the version status of, say, every WordPress table. Every time a table is changed, the version changes as well. Each cache entry would be hashed with the versions of the tables it uses, so when the version changes on a particular table, there is no cache invalidation needed – all the old entries with data from that table will automatically become impossible to access. Another way to implement this would be through the group mechanism already in WP_Object_Cache.

The above-described last-accessed-date flushing will eventually flush these dead entries out of the cache. We could maintain compatibility with pre-fetching common data objects, at the expense of having two queries at a request’s start instead of one: the first to get the version array, and the second to query for the common objects (since we need the version info to generate the correct hash).

This versioning information will go very well with full-page caching, and full-page caching will look a lot more like caching any serialized object. 

All this implementation of cache techniques can go into the core, and can be selectively enabled, as needed. This way, any caching plugin built on top of this new API can take advantaged of versioned tags, last-accessed-date flushing, etc.

Here are some more implementation details that I’ve pondered on.

  • Modularity: This would essentially form the Caching API that additional caching plugins should use, most probably by forking the current WP_Object_Cache class and wp_cache_* functions, with additions for full HTML page caching, last-accessed-date flushing, versioned tags and so forth. Sketching out a good structure would take up a significant amount of time before implementation, because planning is good. It should also get some ’stamp of approval’ on wp-hackers. 
  • Administration Interface: It would be nice if there was some way of switching between caching plugins without having to edit files. Making a way for plugins to self-identify as a caching plugin so they will show up in the selection page would be good. I don’t want to duplicate too much of the available plugin functionality; just one new page to select the caching method, and let each plugin add their own pages for further configuration – just like WP Super Cache now.
  • Storage/Table Type: There’ll need to be support for MyISAM, InnoDB, and MEMORY caching tables, unless there isn’t a way for shared hosting providers to disable MEMORY tables in MySQL. MEMORY tables will require some splitting mechanism to fit into its rigid row length requirement. The old, file-based caching code could also be reintegrated, if only as a caching plugin.
  • Limited Database Size: The database cache will need to keep itself within a certain size, as a lot of shared hosting have database size limits. Unless there’s a MySQL configuration parameter to check, I think the user will need to enter this. At any rate, the database cache shouldn’t be allowed to rampantly eat all the disk space (because it will). Maybe a default size of.. 50 MB?
  • Hashing: There will need to be a way to differentiate, by hash, a ‘global’ object’s cache entry – e.x., alloptions – from a specific post’s cache entry. I suppose hashing the ID with the cache entry name will suffice. This may already be implemented into WP_Object_Cache with the $global_groups variable – I wasn’t particularly sure what that does right now.

The best thing about all that’s been described so far is that it is compatible with any environment that will currently run a vanilla WordPress blog. Existing plugins - WP Super Cache, the old file-based caching code, memcached clusters – can all run on top of this API. The idea is that they’ll get more performance from this infrastructure as well, especially from the full-page caching.

Wow. What a lot of text. If you’ve gotten this far, I congratulate and sincerely thank you. Do leave a comment with your thoughts.

Tags: ,

2 Responses to “Caching in WordPress”

  1. Andy Skelton Says:

    if a slashdot effect yesterday caused one cache entry to be hit 10,000 times, today’s digg effect on a different page will quickly knock the old cache entry off – since the torrent of new cache entries created by the dugg page will give the slashdotted page’s caches the oldest last accessed date

    For a Digg event there will be just one cached version of the page—generate once, serve many—not a torrent of new cache entries.

    Another way to implement this would be through the group mechanism already in WP_Object_Cache.

    I implemented it by passing arrays of tags to the group argument. We have some other logic piggy-backed on the group arg, so it seemed natural. The first array value is treated as a group. A private method constructs the cache array key out of the tags and their versions. Tags could be a separate argument but this was leaner—probably what you had in mind.

    The things you can do with tag one—the group—are limited to your imagination. Were you aware that $wp_object_cache->global_groups is unused? WordPress.com uses it to great effect. It just occurred to me that in core you could have several cache stores lumped into one and use the group to determine what went where.

    For example, if ( $group == ‘persistent’ ) { store in the database } else { store in memory }.

  2. neodude Says:

    For a Digg event there will be just one cached version of the page—generate once, serve many—not a torrent of new cache entries.

    I didn’t really explain myself very well here. With full-page caching, there’ll be only one version of the page, but still hundreds of dead versions from the “first comment!” commenting – since the torrent of digg pageviews will invalidate everything else out of the cache.

    Without full-page caching, it’s even gloomier; hopefully some global variables (alloptions?) will be fairly unchanged.

    Anywho – this all was a parenthetical to support that cache invalidation is likely to best be weighted entirely on how ‘hot’ each entry is (I had originally thought entries with more hits should stay for longer, but this example neatly invalidates (ha!) that idea).

    Were you aware that $wp_object_cache->global_groups is unused? WordPress.com uses it to great effect.

    I knew it didn’t feel particularly used. I suppose it only makes sense in a MU setting? I didn’t know what ‘global’ meant; now I do, I think.

    For example, if ( $group == ‘persistent’ ) { store in the database } else { store in memory }.

    Magic groups!

    It seems a little rearchitecturing is in order; perhaps multiple backing storages can be implemented via configuration; there’ll need to be a WP_Cache_Master to control all the WP_Cache_Database, WP_Cache_Memcached, etc.

    Maybe make the Cache_Master as a plugin, and have it take many subclasses of WP_Cache_Base as different backing stores. The _Master configuration could specific which groups get stored where?

Leave a Reply