>
 Thursday, March 19, 2009
« Dev Connections Las Vegas - The Code | Main | SD West - The Code! »

As the press has been discussing these past days, Windows Azure had an outage on the weekend. I provided some comments to the press and was quoted here: http://www.crn.com/software/215900816

Only a few of my remarks made it into the article, so I thought I would provide here a complete picture of my comments on this issue. My message to the press, and to the community is that I don't think Microsoft should be judged harshly on this incident, but because they are Microsoft and people are watching...they probably need to kick it up a notch and act like they are already a production service - to inspire trust. It is that simple. They didn't do anything wrong, but people will take notice and therefore it is important. What is it they say? Dress for the job you want, not the job you have?

Here is my summary of the situation:

One question people might have is one of service levels and uptime guarantees. Right now, the Windows Azure platform is a CTP, not even a Beta product yet, so I don’t measure them against Salesforce.com or Amazon. I don't expect them to provide me with any service levels since there is not yet a service level agreement in place for services rendered. I do expect them to try not to have outages, and I do expect them to respond to outages quickly - and I think they are doing this. Last weekend's problem occurred from a routine update, recovery took longer than expected but they acted immediately.

I think the only mistake Microsoft made is not notifying the Azure account holders in advance that an update would be performed, in the event of something failing. This is an easy thing to do that will inspire confidence and trust.

Now, let’s put this into perspective. On the one hand, Microsoft is doing the community a great service by allowing access to Azure during this CTP phase of development. This also helps Microsoft immensely as the community will provide feedback on the platform which will influence the features and functionality of the final release. This is a really good thing. On the other hand, because they are live, CTP or not people will judge Microsoft now. I am hosting my own test projects with Azure, and I don’t like it when they are down, but I am a realist – I am using a CTP. I have no doubts about Microsoft’s ability to host a 24/7 operation and meet acceptable service levels when Azure is a production service. However, not everyone will see it this way.

This is a new space for Microsoft – hosting applications and services – and the community is watching closely. They may not be ready to produce an SLA, for this requires significant legal review as well, but they can at least give prior notice of downtime and the expected duration of that downtime. Or, if unexpected downtime occurs, immediately inform the community via emails to Azure account holders. Keeping people well informed of planned outages in the future would also be helpful. In a word – “communicate”. I expect that Microsoft will in future avoid downtime altogether by performing rolling updates of the system so that our hosted applications are not affected as machines are updated. My understanding is that their data centers are being set up to handle this when Azure goes live. In the meantime, one recommendation Microsoft provides is that we deploy multiple instances of each role (Web Role, Worker Role) so that there is redundancy and less risk of impact if there is another incident.

Microsoft will certainly be incorporating feedback from this incident and ongoing Azure CTP usage into their production release. To the community I say “it is a CTP, don’t worry, just give feedback on what you would have expected”. To Microsoft I say “try to put production-worthy notifications and rolling updates in place now to avoid negative opinions on the CTP basis”.

3/19/2009 4:38 AM  | Comments [0]  |  View reactions  |  Trackback
    ON THIS PAGE
    SEARCH
    CATEGORIES
    ARCHIVES
    BLOGROLL

Designed by NUKEATION STUDIOS