Running stateful applications on Azure ?

Running stateful applications on Azure?

Azure architectures are stateless by nature (at least they should be, for being really scalable). Stateless means the application code does not rely on resources (like memory or files on disk) in the machines running the application. From a code perspective the developers need to avoid using ASP.NET session or application variables and should not write to the local file system.

The reason for this approach is simple. On Azure your application runs on multiple instances. These instances are all virtual machines running their own websever. All of these machines listen to requests coming in through a loadbalancer. The loadbalancer is always there to distribute the load equally over the number of available instances. This principle leads to ultimate scalability ... one of the most important reasons why we choose Azure. The loadbalancer is there 'as a service', as the main asset of Azure PAAS.

Result of this is that a sequence of requests (a simple click on a button that reloads a page adding new content) can be executed by different machines. Each new request can end up on another machine than the previous request. This can happen even in a single browser session by one user only. The issue comes up when developers want to store data between the requests. Let's say you have a wizard style user interface where you keep the data for each step into memory and only want to persist the data at the end of the steps. Developers will do this by using ASP.NET session variables. As (by default) these variables are only available in the machine executing the code. The data could be scattered over multiple machines and no single machine has access to all data.

So general advice is to not use ASP.NET session variables. Software architects should do a code review to check if this advice is implemented as needed. That's OK for new 'greenfield' applications on Azure where the development starts after the decision to host the application on Azure. With this knowledge architects can write guidance to do this the right way.

But what about migrating existing applications to Azure. What if ASP.NET session variables are used? I must say that in my experience in consulting for Azure migrations more that 50% of the applications use ASP.NET sessions. Some use this extensively to store loads of functional data between page requests, others user it to simply store a User ID or some data part of the user profile (like a language setting). The use of ASP.NET session is by itself not wrong and works perfect on an on-premise server where there is no loadbalancer. With traditional servers where we tend to scale-up (means adding memory and CPU power to the same machine to reach more performance) this is not an issue at all. Remark here is that the traditional scale-up way of working is less scalable as there are always limits to the capacity of the single machine. On Azure PAAS applications we follow the scale-out principle where machines are virtual and relatively small (single CPUs) but we will add new instances of the machines (as an identical copy) to have more performance. So in scale-out there is no limit (unless you reach the limits of the datacenter). I've seen deployments with a couple of hundreds instances, all running in parallel behind the loadbalancer.

How can we deploy these (sometimes called 'legacy') applications to Azure PAAS if they rely on ASP.NET session variables. Deploying is not the concern here,... they will deploy without problem but the behavior will be broken of course.

There are options to work with ASP.NET session variables :

- Use the Azure caching technology to build up shared memory. Azure cache support dedicate machines for using memory only and to share this memory to other machines. With a proper configuration all other machines can see this as a kind of distributed memory available for the running application. We can configure the ASP.NET Session engine to offload the session variables into the cache. The development work for this is low, only configuration is needed. The code can still use the session variables directly. It's the configuration that tells the system to store and retrieve this data into/form the dedicated cache living outside the machine. Mind that a small latency is a potential drawback of this solution. But the total cost of operational expenses will be higher as more machines are needed. Although there is also the option to share this available memory over the machines running your code it's still a situation where you would loose performance and occur potential higher costs.

- Store the session data into some form of client based property bag like cookies, hidden form fields or ASP.NET viewstate.

Ways to avoid the loadbalancer?

If we could avoid the loadbalancer and have all request from a browser session send to a single and the same machine the memory can be used without the caching configurations. Of course when the loadbalancer is not distributing the requests we lose the scalability services from Azure PAAS in some way. How to achieve this setup? Since SDK 1.7 the webrole configuration schema support the InstanceInput element.

See : http://msdn.microsoft.com/en-us/library/windowsazure/gg557553.aspx#InstanceInputEndpoint

This element creates an endpoint for each specific instance. This is done through providing a range of public ports. So the IP address (and hence the *.cloudapp.net will remain the same). The Azure loadbalancer will just forward the requests made on the IP address on these ports always to the specific instance. Example: defining a range of 1000 to 1004 would result in sending a request to *.cloudapp.net:1000 to machine instance 0, *.cloudapp.net:1001 to machine instance 1, *.cloudapp.net:1002 to machine instance 2 and so on.

Great, now we can address a single machine without having the loadbalancer deciding which one.

But some caveats that need to be addressed.

- Of course you can't expect that you end users are going to add the port number to the URL themselves.

- The length of the range for the instanceinput should be equal to the number if machines. Otherwise you could either end up with a port mapped to a not running machine or having a machine which is not reachable through a port. When adding/removing instances we should also update this port range.

- As the requests are not load balanced we should be careful which port we're going to use. If all clients access the application through the same port it would result in one machine doing all the work (with potential performance problems) and others doing nothing (while we still pay for them).

What do we need to get this architected in a proper way?

1. A single default page that still runs on the known *.cloudapp.net URL where the user will browse to. This page will inform the browsers by a redirect (with some javascript) to go to the URL with the added port specification so from then on the communication is send to the single machine. A good candidate for this page is the login page where after positive authentication the redirect is send.

2. An algorithm to decide which machine to use for a user session. As said the needs to be done carefully so all machines are used equally. Possible algoritms are :

- Based on the username? Imagine having 26 machines and distribute the load based on the first letter of the username. Issue here is that with this algorithm you'll not have an equal distribution as more usernames will start with the letter M for example than with the letter Q. We really need an algorithm that distributes equally.

- Just random, pick a machine randomly. Why not ?

- Sequential, one after the other and back to the first one. This is how the Azure loadbalancer does it. This solution needs some external counter to increment. You could place this in a file on Azure storage.

- Based on performance. Why don't we look at the performance of each machine before deciding which one to pick? We would need a kind of monitor to check the CPU utilization of each machine. We could make use of the Azure buildin performance metrics.

3. Some code in the application to get to know the range you defined in the webrole configuration. After all the landing page needs to instruct the browser to redirect to one of available the ports.

4. A way to make sure the range of the ports for the InstanceInput endpoint is the same size of the number of machines running. Mind that machines can go down unexpected. Or put it differently: If we add/remove instances the decision algorithm needs to know this.

OK, we now know what to do, we know the why and the value of the solution and we know the impact is and what the attention points are and have some work defined.

I'll build a proof-of-concept for this, that's my job, that's what you expect from my services. When ready I’ll post this as a reference architecture.

Off to coding now...