Joining to the Next Sequential RowBy Paul Alcon on 2 April 2008 | Tags: Queries , SELECT One of the more obscure requirements that a developer may find themselves facing is the need to compare a row with its immediate sibling. One such case is when a list of values needs to be processed to produce a moving average or to smooth a sequence of statistical numbers where their order is important. For example, values lying along a time line. The solution is actually quite simple, but not immediately obvious. ProblemTake a sequence of 8 integers 10, 25, 5, 15, 30, 9, 22, 29 in a single table thus: id stat period --- ------- -------- 1 10 1/1/2008 2 25 2/1/2008 3 5 3/1/2008 4 15 4/1/2008 5 30 5/1/2008 6 9 6/1/2008 7 22 7/1/2008 8 29 8/1/2008 We need to calculate the difference between each statistic and the next, then calculate the mean value of the 'gaps.' It is important that the figures are compared in the right order to get a correct result. Firstly lets create a table and fill it with our sample data. create table tbStats ( id int identity(1,1) primary key, stat int not null, period datetime not null ) GO insert into tbStats ( stat, period) select 10, convert(datetime, '20080101') union all select 25, convert(datetime, '20080102') union all select 5, convert(datetime, '20080103') union all select 15, convert(datetime, '20080104') union all select 30, convert(datetime, '20080105') union all select 9, convert(datetime, '20080106') union all select 22, convert(datetime, '20080107') union all select 29, convert(datetime, '20080108') GO It's important to note that the records have been inserted in date order. Therefore, we know that the id column is in the same order as the period column. This alignment is important for our first solution to work correctly. Solution 1We need to join each record with it's subsequent row. We can do that using the ever flexible joining syntax, thanks to the fact that we know the id field is an integer sequence with no gaps. In the case where this is not guaranteed in your target data, you can engineer it by transferring the required records into a temporary table or table variable first. select x.id xId, y.id yId, x.stat xStat, y.stat yStat from tbStats x left join tbStats y on x.id + 1 = y.id order by x.id xId yId xStat yStat ----------- ----------- ----------- ----------- 1 2 10 25 2 3 25 5 3 4 5 15 4 5 15 30 5 6 30 9 6 7 9 22 7 8 22 29 8 NULL 29 NULL (8 row(s) affected) By aliasing the table we can incorporate it into the SQL query twice, then join them together in a staggered fashion by adding 1 to the id of the first aliased table. The first record in the table has an id of 1. 1 + 1 = 2 so it will join on the row with id of 2 in the second aliased table. And so on. Now it's simply a case of subtracting one from the other. select x.id xId, y.id yId, x.stat xStat, y.stat yStat, abs(x.stat - y.stat) gap from tbStats x left join tbStats y on x.id + 1 = y.id order by x.id xId yId xStat yStat gap ----------- ----------- ----------- ----------- ----------- 1 2 10 25 15 2 3 25 5 20 3 4 5 15 10 4 5 15 30 15 5 6 30 9 21 6 7 9 22 13 7 8 22 29 7 8 NULL 29 NULL NULL (8 row(s) affected) We use the ABS function to ensure we always get positive integers as a result of the subtraction regardless of which side of the expression is the higher figure. Lastly we use the AVG function to get our final aggregated result. select avg(abs(x.stat - y.stat)) movingAvg from tbStats x left join tbStats y on x.id + 1 = y.id where y.stat is not null movingAvg ----------- 14 (1 row(s) affected) Solution 2What if the record's ID and period fields are not in the same order? Lets refill our example table with some unordered data. When I describe the data as unordered, I am referring to the fact that the id and period columns are not in alignment. Truncate table tbStats GO insert into tbStats ( stat, period) select 5, convert(datetime, '20080103') union all select 25, convert(datetime, '20080102') union all select 15, convert(datetime, '20080104') union all select 9, convert(datetime, '20080106') union all select 30, convert(datetime, '20080105') union all select 22, convert(datetime, '20080107') union all select 29, convert(datetime, '20080108') union all select 10, convert(datetime, '20080101') GO create index ix_tbStats_period on tbStats (period) GO In this situation another solution to staggering the rows is to create a user defined function to return the subsequent row. As this time we will be working on the period column, it is pertinent to create an index on that column. For the amount of data we are dealing with in our example it will be of no significance, but I would be negligent not to mention it here, if only in passing. CREATE FUNCTION dbo.Stagger ( @period DATETIME ) RETURNS INT AS BEGIN DECLARE @ResultVar INT SELECT TOP 1 @ResultVar = stat FROM tbStats WHERE period > @period ORDER BY period RETURN @ResultVar END GO By using the TOP keyword and an ORDER BY clause, we can select the next consecutive date that occurs in the table after the date passed into the function as a parameter. select avg(abs(x.stat - dbo.Stagger(x.period))) movingAvg from tbStats x movingAvg ----------- 14 (1 row(s) affected) Solution 3For those running SQL Server 2005 we have a third weapon in the armory for tackling this problem: APPLY. With APPLY there are two semantics that can be used to join a table valued function. CROSS and OUTER. OUTER will return all rows in the left hand table regardless of whether the table valued function returns rows or not (Similar to a left outer join) CROSS apply will only return left hand rows when there are also rows in the table valued function. We can use either in our solution. But, CROSS apply will have the neat outcome that the NULL data will be removed for the last record. We can make a new table valued function based on our previous one. CREATE FUNCTION dbo.StaggerTable ( @period DATETIME ) RETURNS TABLE AS RETURN SELECT TOP 1 id, stat FROM tbStats WHERE period > @period ORDER BY period GO And alter our previous statement to include it using CROSS APPLY, which I think you'll agree looks much cleaner. select avg(abs(x.stat - y.stat)) movingAvg from tbStats x cross apply dbo.StaggerTable(x.period) y movingAvg ----------- 14 (1 row(s) affected) Performance NotePlease check the performance carefully as your mileage can vary considerably with these kinds of query -- particularly for solution #2. Running a user-defined function for every row in a result set can impact performance. Especially if they query in the UDF is anything other than a very simple query. Personally I used solution one as it gave me best advantage of using indexes and I had the luxury of a contiguous sequential key. Of the UDF solutions number three using The CROSS APPLY method is preferred and has been faster based on limited testing. Small data sets are your friend here. ConclusionSo we have seen how SQL Server offers us plenty of flexibility in massaging our stored data into viable meaningful results. Through joining on more sophisticated semantics beyond merely matching values as they are and/or leveraging user defined functions.
|
- Advertisement - |