R4 | Random Ramblings (on Technology, Programming and Databases,) by Rui Ribeiro

It’s quite frequent to read about the efficiency of relational database systems when executing set based operations. Relational DBMSs are, indeed, efficient when executing set based operations especially when compared to cursor based operations. Of course, the debate can become quite heated, leading some people to excommunicate cursor based operations as the devil incarnate of inefficiency. As in everything, such views regarding cursors are extreme, as there are situations where cursors may be a valid choice. While the focus is usually on avoiding cursors, there are other ways in which we can take advantage of the efficiency of set operations. A situation I faced recently involved a stored procedure that was being executed multiple times, with a single changing value each time. The situation is illustrated by the stored procedure shown below (Initial Stored Proc). The stored proc performs a few queries similar to one included.

CREATE PROC myProc(@year int, @month int)
As

-- Temporary table to hold just the relevant cost centers

CREATE TABLE #CC (
	CCID int,
	 CCName varchar(1000)
	);

-- Add the cost centers to the temp table
INSERT INTO #CC
SELECT DISTINCT C.Id, C.Name
	FROM AccountsInCostCenters AI INNER JOIN CostCenters C ON AI.CostCenterId=C.Id
	 WHERE AI.Year=@year or AI.Year=@year-1

-- calculate the results and add them to the results table

INSERT INTO DemoViewFinalResults(GroupID,AccountLevel,RowType,Value,[Year],[Month],(...),CostCenterID,CostCenterName)
SELECT D.GroupID, D.GroupOrder,D.RowOrder,D.AccountID,A.[Description],D.AccountSign, D.AccountLevel,4,
(SELECT SUM(F2.Value) FROM DemoViewFinalResults F2 
						WHERE F2.AccountID IN (SELECT C.AccountID FROM CalcRowAssociatedAccountID C WHERE C.RefRowID=D.RowID)
						AND F2.GroupID = D.GroupID AND F2.Month=@month and F2.Year=@year AND F2.CostCenterID=C.CCID) As Value,
		@year,@month,NULL,D.OriginalRow, C.CCID,C.CCName
FROM DemoViewAccountsAssociatedRowModel D INNER JOIN Accounts A ON A.Id = D.AccountId, #CC C
WHERE D.RowType=2 AND D.IncludeAccount=1;

(...)

The stored procedure was being executed multiples times, one for each month needed, with a simple loop, as illustrated below.

DECLARE @month int
DECLARE @year int

SET @month = 1

WHILE @month < 8
BEGIN
	exec myProc @year,@month
END

Some investment had been done making sure each query running in the stored procedure was optimized. The queries are used to implement a datasheet recursive model, so it was rather hard to avoid correlated subqueries and the individual query performance was good enough. The problem was that, in some cases, we needed to execute the stored procedure for multiple months and the overall execution time for was unsatisfactory, so we need to do much better in such situations. The solution for the performance issue, was of course, in taking advantage of set operations. Instead of looping for each month, we could just execute the code once, if we had a way to have all the required year / month pairs in a table, just as we had done for the cost centers (you can, in line #30 see that the temporary table #CC is cross-joined into the query). The approach used in the final stored procedure is shown below. Lines #23-34 show the new temporary table and the code used to add all the needed records, which are the ones resulting from a cross-join between the “previous version” table and a new temporary table containing all the month / year pairs.

CREATE PROC myProcForMultipleMonths(@year int, @startMonth int)
As

CREATE TABLE #Months_Years (
	[Month] int,
	[Year] int
	);

DECLARE @tMonth int,
		@tYear int

SET @tMonth = 1

WHILE @tMonth <= @startMonth 
BEGIN
	INSERT INTO #Months_Years
		SELECT @tMonth,@year

	SET @tMonth = @tMonth + 1
END

CREATE TABLE #CC (
	 CCID int,
	 CCName varchar(1000),
	 [Month] int,
	 [Year] int
	 );

INSERT INTO #CC
SELECT DISTINCT C.Id, C.Name, MY.[Month],MY.[Year]
	FROM AccountsInCostCenters AI INNER JOIN CostCenters C ON AI.CostCenterId=C.Id, #Months_Years MY
	 WHERE AI.Year=@year or AI.Year=@year-1

INSERT INTO DemoViewFinalResults 
	(GroupID,GroupOrder,RowOrder,AccountID,AccountDescription,AccountSign,AccountLevel,RowType,Value,[Year],[Month],Reference,OriginalRow,CostCenterID,CostCenterName)
	SELECT D.GroupID, D.GroupOrder,D.RowOrder,D.AccountID,A.[Description],D.AccountSign, D.AccountLevel,4,
	(SELECT SUM(IsNULL(F2.Value,0)) FROM DemoViewFinalResults F2 
						  WHERE F2.AccountID IN (SELECT C.AccountID FROM CalcRowAssociatedAccountID C WHERE C.RefRowID=D.RowID)
						  AND F2.GroupID = D.GroupID AND F2.Month=C.month and F2.Year=C.year AND F2.CostCenterID=C.CCID) As Value,
		   C.year,C.month,NULL,D.OriginalRow, C.CCID,C.CCName
	FROM DemoViewAccountsAssociatedRowModel D INNER JOIN Accounts A ON A.Id = D.AccountId, #CC C
	WHERE D.RowType=2 AND D.IncludeAccount=1;

(...)

This approach resulted in a very obvious performance improvement, since execution time was reduced to less than a third of the time needed to execute the first version of the procedure for the same number of months. The execution time is obviously reduced because operations that were being performed as many times as there were months, now are done just once, regardless of the number of months. The join including the year and month ensures that. This major improvement was essential to validate this approach as a feasible solution to to initial performance problem we set to solve. This overall result showed, yet again, the advantages of adopting a set based “thinking” and it made clear that we can avoid cursors and still fall into a similar performance trap. The solution here came from a temporary table, used in a cross-join. A simple solution with a remarkable impact on performance.

The techniques, methodologies and tools at the disposal of those who build software are always improving. New tools are made available, new approaches are presented as the next best thing and often promoted as such. There is nothing wrong with new techniques, methodologies or tools, quite the contrary. We all benefit from such developments, if and when we can properly determine when to use them, considering their advantages and disadvantages.

Most software systems need to use some kind of data store. Traditionally, this data store was a relational database, but the picture has changed a bit in recent years, with the rise of NoSQL databases. NoSQL databases gained rapid acceptance, especially in high scalability scenarios, but they also seem a very interesting option when addressing another common problem faced by software developers – the object relational impedance mismatch.

Accessing the data store, especially a relational data store, as always been a source of problems for developers. Before NoSQL was another option, Object Relational Mappers, such as NHibernate or Entity Framework were offered as possible solutions to the impedance mismatch problem, while offering other advantages. ORMs are still used and their use will sure see a rise, and some ORMs now support NoSQL databases, as well. The overall goal of softening the burden of software developers, when accessing data stores, remains as a valid one, of course.

Sometime ago ORMs started offering Code First Design. While before the ORM would deal with an existing database, Code First Design offered the possibility of leaving the details of choosing and implementing the database structure to the ORM. In theory, this was just another step to set developers free from the tyranny of the storage layer. The problem is that when using relational databases, the mapping is basically a generic one, not taking into account the specificity of the relational database management system, not only in terms of the required indexes to enable the best possible performance, but also in terms of poor choices for the data types chosen to implement some table fields – it’s common to find situations where basically all string properties are implemented as nvarchar(max) fields, in a SQL Server database, which will mean accessing these fields will be slightly more costly to access than length limited nvarchar or varchar fields. The fact is that each relational database has specific characteristics that make generic design by a tool less adequate than a design that takes into account those specific characteristics. This starts with the database design (in Code First strategies) and continues with the usual issues associated with ORM tools, while exploring the database – that is, with the non optimal code generated when querying the database.

Of course, we can always argue that ORMs and Code First Design, as any other tool or strategy, needs to be applied in circumstances where their disadvantages are less noticeable and where their advantages recommend them. This is undoubtedly true. However, if we look at beginners books on topics such as .Net MVC, or tutorial material offered by Microsoft on its web development technologies, examples are almost always given using Code First approaches. I think this conveys quite a wrong idea to programmers starting their own path on such technologies and sometimes the consequences can be serious. I have seen it more than once, in real, production systems and those consequences are not always easy to overcome, at least without some effort.

If there were no performance penalties from Code First Design, or you could establish some degree of control over the generated code, I confess I would be inclined to take advantage of maintaining the data store totally transparent to the code. The reality is totally different though. I understand all technologies need to start, evolve and mature, but I would also like to see less promotion of not as mature technologies, at least without a clear indication of its caveats. ORMs will improve and Code First implementations will, too. At this time, however, the scope of apps where either is more limited then we would be lead to think, in my opinion.

Most systems have to be designed, implemented and deployed in scenarios where we have certain performance requirements and limited computational and financial resources. This is quite common in web applications, where I would think most apps are not cloud scale and need not be. The systems we develop need to be able to work within those limitations and that can be a bit incompatible with our utopias. Naturally, we always need to have utopias we can chase and they can be a driving force for evolution, but it should never be forgotten that we apply technologies in real systems and the negative consequences of our own utopias should, preferably, be avoided.

Note: The image used in this post was obtained from here.

R4

Random Ramblings (on Technology, Programming and Databases,) by Rui Ribeiro

Set thinking

Always chasing the next (software) utopia