raven ioctl

CVS server configuration.

ioctl.org : unix bits and pieces : load balancing CAS

Load-balancing CAS, the options

These are the two basic strategies for load-balancing CAS servers for failover that seem relatively obvious.

Both require no modifications to the protocol as perceived by the CAS client; however, they both require modifications to the CAS server implementation and have configuration issues.

Both of these approaches have an a priori assumption that the CAS authentication service, which is basically a Java servlet, is deployed to multiple servlet containers to which traditional web-server load balancing and failover techniques can be applied.

Java servlet technology already provides some support for load-balancing, offering session migration seamlessly between servlet container instances. However, CAS requires application-wide server-side state to maintain its cache of Ticket-Granting Cookie identifiers (TGCs); this cache is a Map which must be updatable from any of the CAS web application instances taking part in the load-balancing.

Two obvious approaches to manage this update are outlined below. They trade off additional implementation burden against configuration and deployment issues.

JavaGroups

From an implementation point of view, the simplest approach is to adopt the use of a JavaGroups drop-in replacement for the java.lang.Map which automatically provides for distributed update and failover.

This has a small impact on the existing CAS server code; however, the replication protocol used by JavaGroups requires careful configuration. In particular, attention should be paid to the replication protocol options selected to ensure that an attack against the CAS state-replication channel cannot be used to insert maliciously-crafted Ticket entries into the TGC->Ticket Map.

Update performance of the Map is going to be somewhat impacted by this option; however, the JavaGroups protocol can be streamlined to optimise this operation.

The principle burden of this approach lies with the additional complexity of deploying and configuring the JavaGroups layer.

Cookies

The second option resembles the TCP "syncookie" tactic; that is, the TGC delivered to the client is no longer a truly random string; instead, it contains encrypted details of the ticket which can be used to validate it in the absence of a cache entry.

Here, the CAS server utilises a randomly-generated, server-secret symmetric key. This key should be updated on a regular basis to mitigate the possibility of attacks against the cookie scheme; a lifetime of 8 hours (comparable to the normal TGC lifetime) should be appropriate. Each CAS server should know the set of secret keys (normally the current key and the previous one) that are potential keys used to encrypt a particular live TGC.

The synchronisation effort between two CAS server instances can be limited to the occasional distribution of a new, updated secret key. Clearly, this distribution mechanism should be protected. Since the time between updates is large, a simple approach would be to use an external key-generation utility that communicated through a properly-secured channel to each CAS server instance.

From the client point of view, the TGC and the CAS protocol remain the same. However, the implementation of the CAS server - in particular, the generation of TGC ids, changes. When a new TGC is to be issued, its ID is computed as follows:

	TGC = Ek( { Texpiry, U, H( Texpiry, U, N ), N } )
	where
		Ek is encryption using the server's symmetric key
		Texpiry is the expiry timestamp
		U is the user principal this ticket identifies
		H is a secure hash - this is simply used as a checksum
		N is a one-off (securely) random nonce
		(other fields may be added as required)

The TGC is cached at the server as before.

On receipt of a TGC that requires confirmation, the CAS server first checks the TGC against its cache. If the TGC is not found, the possible TGC is decrypted using each potential key (chosen from { current, previous } if the keys have a suitable lifetime as suggested above). Should the decrypted TGC prove to be valid, it is assumed to have originated at an alternative CAS server instance, and is inserted into the local cache.

The benefit of this approach is that the synchronisation between CAS instances is kept at a minimum; state is preserved in the face of a failover or server restart en passant during the process of confirming an uncached TGC.

The downsides are, firstly, that there is increased invasive modification of the CAS server required; secondly, that the implementation of the TGC cookification requires care to avoid introducing opportunity for attack; and thirdly, that there is an additional symmetric encryption or decryption required whenever a TGC is either generated or checked for the first time at a particular non-origin CAS instance. The last issue may prove to be negligble - this strategy may well have have less cost than a real-time state replication tactic.

NOTE

If you're looking at this page because the CAS wiki pointed you here, you should realise that the major problem with the syncookie approach is that state replication betwwen CAS servers is still required, since all the one-shot cookies could otherwise be presented to each CAS server separately, permitting multiple authentications. So it's a nice idea, but it's a broken one, which is why there's no implementation here (and why this page hasn't changed in two years; I tend to doodle on the web).

Additionally, "loadbalancing" of CAS servers seems to be more effort than is worth it: the load from a CAS server is pretty low. Having a seamless failover mechanism would be nice (which means state replication) but the best approach there is via explicit shared state, I think, such as the JavaGroups idea, rather than encrypted TGCs (and forced failure of one-shot tickets).