Mar 25

Is performance an issue with SQL Server on Azure?

pasturecloudsI inherited the development of a web app that was meant to run on Azure, using SQLAzure as the data store. Right from my first contact with the app, it was clear that it had performance issues. When running from a local SQL Server performance could be deemed as endurable, but it was totally unacceptable on Azure, where it took 3 times longer.  A performance analysis pointed out database access as the main factor, but there were other things that could be improved upon as well.

The identification of the performance issues coincided with some changes in requirements, that led to the decision to take a totally different approach, regarding the way data the app needed was to be handled. This new approach was motivated mainly by the need to support the new requirements, but it was also meant to see if it could help with the performance issues. A model to represent the data was developed and implemented on a SQL Server database. This model was tested and improved upon, up to a point where performance, in a worst case scenario, was deemed acceptable, when implementing it on a local database server. The model requires some calculations to be performed, and this is done through a stored procedure. Each of the stored procedure queries was optimized for performance, considering the overall goal of minimizing the time needed to execute the stored procedure and not the individual queries themselves. This involved, for example, leaving out some indexes that could improve individual queries, but contributed to the degradation of the overall performance. As stated before, we arrived at a point where performance in any of our local SQL Server 2012 databases was deemed good enough.

Having solved the performance issue with local instances of SQL Server 2012, we set to see how SQL Azure, the intended data store, handled the new way to process the data. We rebuilt the entire database, populated it with the exact same data we have in our local testing scenario and tested the new stored procedure. Performance was much worse – the time needed to complete the stored procedure in the worst case scenario was almost 3 times the time it took to execute it in any of our local servers. Let me be clear here – we are not even talking about time needed to retrieve data from the database – it’s just the time needed to process records inside the database and storing the results in a database table.

Trying to determine if SQL Azure was to blame for this, we decided to compare the performance in different hosting scenarios: Azure SQL VM, a standard hosting offering running SQL Server 2008 Web edition and Amazon RDS. The testing started with Azure SQL VM. Using the Azure Virtual Machine preview, we created several instances of a VM running SQL Server 2012 SP1 Evaluation edition, on Windows Server 2008 R2 SP1. To rule out the virtual machine size as a factor, we had a VM of each size – small, medium, large and extra large (an extra large machine offers 8 computing cores and 14 GB of memory). On all these VMs, used with their pre-configured settings, performance was pretty consistent and didn’t really change much, when compared with SQL Azure. The execution time for the stored procedure was very similar in all VM sizes – and too high in all of them.

We then tried the traditional hosting solution, a Windows 2K3 server running SQL Server 2008 Web edition on two Xeon processors with 4 GB of RAM. Surprisingly, or maybe not, performance was pretty similar to the one in SQL Azure and Azure SQL VM. Almost believing that somehow, hosted solutions of SQL Server were not up to the task, we decided to try Amazon RDS. We had never tried Amazon’s cloud solution before, so we had to check the options offered and create a new account. There are multiple options regarding VM features and size and we decided to test a Large DB Instance – a VM running SQL Server 2008 R2 with 7.5 GB of memory and 4 ECUs (2 virtual cores with 2 ECUs each – 1 ECU, according to Amazon Web Services is equivalent to a 1.0-1.2 GHz 2007 Opteron or Xeon processor). Setting up the VM was as easy as on Azure and a few minutes afterwards, I was creating the database and uploading the test data. Once this was completed, the test stored procedure was executed multiple times … and execution times were less than half, on average, than those from Azure and almost as good as with our local SQL Server instances.

All this testing was both reassuring and worrying. For one, it’s clear that there are cloud based offerings that can provide performance that is pretty similar to what can be obtained from in house SQL Server solutions. For our case, however, it seems that Azure is not one of those. We still need to decide what we will do, but the cost / performance factor for Azure based SQL Server solutions is not looking too good. Right now SQL Azure is the only non preview offer and its cost is rather high, for databases larger than 100 MB. Azure SQL VM won’t be a reality until this summer, but while it may provide a more cost effective solution, it’s not clear that it can be a competitive offer, performance wise. Of course, we are considering a rather specific use case, but the overall performance before we changed our model was not that good either and this experience, while not definitive in anyway, does seem to raise the question – are Azure based SQL Server solutions good enough, performance wise? The answer is not clear right now, but it is a bit unsettling to find a competitor’s cloud offerings a better choice than Microsoft’s, on a Microsoft product as relevant as SQL Server.

Mar 22

Set thinking

It’s quite frequent to read about the efficiency of relational database systems when executing set based operations. Relational DBMSs are, indeed, efficient when executing set based operations especially when compared to cursor based operations. Of course, the debate can become quite heated, leading some people to excommunicate cursor based operations as the devil incarnate of inefficiency. As in everything, such views regarding cursors are extreme, as there are situations where cursors may be a valid choice. While the focus is usually on avoiding cursors, there are other ways in which we can take advantage of the efficiency of set operations. A situation I faced recently involved a stored procedure that was being executed multiple times, with a single changing value each time. The situation is illustrated by the stored procedure shown below (Initial Stored Proc). The stored proc performs a few queries similar to one included.
CREATE PROC myProc(@year int, @month int)
As

-- Temporary table to hold just the relevant cost centers

CREATE TABLE #CC (
	CCID int,
	 CCName varchar(1000)
	);

-- Add the cost centers to the temp table
INSERT INTO #CC
SELECT DISTINCT C.Id, C.Name
	FROM AccountsInCostCenters AI INNER JOIN CostCenters C ON AI.CostCenterId=C.Id
	 WHERE AI.Year=@year or AI.Year=@year-1

-- calculate the results and add them to the results table

INSERT INTO DemoViewFinalResults(GroupID,AccountLevel,RowType,Value,[Year],[Month],(...),CostCenterID,CostCenterName)
SELECT D.GroupID, D.GroupOrder,D.RowOrder,D.AccountID,A.[Description],D.AccountSign, D.AccountLevel,4,
(SELECT SUM(F2.Value) FROM DemoViewFinalResults F2 
						WHERE F2.AccountID IN (SELECT C.AccountID FROM CalcRowAssociatedAccountID C WHERE C.RefRowID=D.RowID)
						AND F2.GroupID = D.GroupID AND F2.Month=@month and F2.Year=@year AND F2.CostCenterID=C.CCID) As Value,
		@year,@month,NULL,D.OriginalRow, C.CCID,C.CCName
FROM DemoViewAccountsAssociatedRowModel D INNER JOIN Accounts A ON A.Id = D.AccountId, #CC C
WHERE D.RowType=2 AND D.IncludeAccount=1;

(...)
The stored procedure was being executed multiples times, one for each month needed, with a simple loop, as illustrated below.
DECLARE @month int
DECLARE @year int

SET @month = 1

WHILE @month < 8
BEGIN
	exec myProc @year,@month
END
Some investment had been done making sure each query running in the stored procedure was optimized. The queries are used to implement a datasheet recursive model, so it was rather hard to avoid correlated subqueries and the individual query performance was good enough. The problem was that, in some cases, we needed to execute the stored procedure for multiple months and the overall execution time for was unsatisfactory, so we need to do much better in such situations. The solution for the performance issue, was of course, in taking advantage of set operations. Instead of looping for each month, we could just execute the code once, if we had a way to have all the required year / month pairs in a table, just as we had done for the cost centers (you can, in line #30 see that the temporary table #CC is cross-joined into the query). The approach used in the final stored procedure is shown below. Lines #23-34 show the new temporary table and the code used to add all the needed records, which are the ones resulting from a cross-join between the “previous version” table and a new temporary table containing all the month / year pairs. 
CREATE PROC myProcForMultipleMonths(@year int, @startMonth int)
As

CREATE TABLE #Months_Years (
	[Month] int,
	[Year] int
	);

DECLARE @tMonth int,
		@tYear int

SET @tMonth = 1

WHILE @tMonth <= @startMonth 
BEGIN
	INSERT INTO #Months_Years
		SELECT @tMonth,@year

	SET @tMonth = @tMonth + 1
END

CREATE TABLE #CC (
	 CCID int,
	 CCName varchar(1000),
	 [Month] int,
	 [Year] int
	 );

INSERT INTO #CC
SELECT DISTINCT C.Id, C.Name, MY.[Month],MY.[Year]
	FROM AccountsInCostCenters AI INNER JOIN CostCenters C ON AI.CostCenterId=C.Id, #Months_Years MY
	 WHERE AI.Year=@year or AI.Year=@year-1

INSERT INTO DemoViewFinalResults 
	(GroupID,GroupOrder,RowOrder,AccountID,AccountDescription,AccountSign,AccountLevel,RowType,Value,[Year],[Month],Reference,OriginalRow,CostCenterID,CostCenterName)
	SELECT D.GroupID, D.GroupOrder,D.RowOrder,D.AccountID,A.[Description],D.AccountSign, D.AccountLevel,4,
	(SELECT SUM(IsNULL(F2.Value,0)) FROM DemoViewFinalResults F2 
						  WHERE F2.AccountID IN (SELECT C.AccountID FROM CalcRowAssociatedAccountID C WHERE C.RefRowID=D.RowID)
						  AND F2.GroupID = D.GroupID AND F2.Month=C.month and F2.Year=C.year AND F2.CostCenterID=C.CCID) As Value,
		   C.year,C.month,NULL,D.OriginalRow, C.CCID,C.CCName
	FROM DemoViewAccountsAssociatedRowModel D INNER JOIN Accounts A ON A.Id = D.AccountId, #CC C
	WHERE D.RowType=2 AND D.IncludeAccount=1;

(...)
This approach resulted in a very obvious performance improvement, since execution time was reduced to less than a third of the time needed to execute the first version of the procedure for the same number of months. The execution time is obviously reduced because operations that were being performed as many times as there were months, now are done just once, regardless of the number of months. The join including the year and month ensures that. This major improvement was essential to validate this approach as a feasible solution to to initial performance problem we set to solve. This overall result showed, yet again, the advantages of adopting a set based “thinking” and it made clear that we can avoid cursors and still fall into a similar performance trap.  The solution here  came from a temporary table, used in a cross-join. A simple solution with a remarkable impact on performance.