Joining to the Next Sequential Row

By Paul Alcon on 2 April 2008 | Tags: Queries , SELECT

One of the more obscure requirements that a developer may find themselves facing is the need to compare a row with its immediate sibling. One such case is when a list of values needs to be processed to produce a moving average or to smooth a sequence of statistical numbers where their order is important. For example, values lying along a time line. The solution is actually quite simple, but not immediately obvious.

Problem

Take a sequence of 8 integers 10, 25, 5, 15, 30, 9, 22, 29 in a single table thus:

id  stat    period
--- ------- --------
1   10      1/1/2008
2   25      2/1/2008
3    5      3/1/2008
4   15      4/1/2008
5   30      5/1/2008
6    9      6/1/2008
7   22      7/1/2008
8   29      8/1/2008

We need to calculate the difference between each statistic and the next, then calculate the mean value of the 'gaps.' It is important that the figures are compared in the right order to get a correct result. Firstly lets create a table and fill it with our sample data.

create table tbStats 
(
    id int identity(1,1) primary key,
    stat int not null,
    period datetime not null 
)
GO

insert into tbStats ( stat, period) 
select 10, convert(datetime, '20080101')
union all
select 25, convert(datetime, '20080102')
union all
select 5, convert(datetime, '20080103')
union all
select 15, convert(datetime, '20080104')
union all
select 30, convert(datetime, '20080105')
union all
select 9, convert(datetime, '20080106')
union all
select 22, convert(datetime, '20080107')
union all
select 29, convert(datetime, '20080108')
GO

It's important to note that the records have been inserted in date order. Therefore, we know that the id column is in the same order as the period column. This alignment is important for our first solution to work correctly.

Solution 1

We need to join each record with it's subsequent row. We can do that using the ever flexible joining syntax, thanks to the fact that we know the id field is an integer sequence with no gaps. In the case where this is not guaranteed in your target data, you can engineer it by transferring the required records into a temporary table or table variable first.

select
    x.id xId,
    y.id yId, 
    x.stat xStat,
    y.stat yStat 
from 
    tbStats x 
    left join tbStats y on x.id + 1 = y.id 
order by 
    x.id
    
xId           yId         xStat       yStat
----------- ----------- ----------- -----------
1              2             10          25
2              3             25           5
3              4              5          15
4              5             15          30
5              6             30           9
6              7              9          22
7              8             22          29
8           NULL             29        NULL

(8 row(s) affected)

By aliasing the table we can incorporate it into the SQL query twice, then join them together in a staggered fashion by adding 1 to the id of the first aliased table. The first record in the table has an id of 1. 1 + 1 = 2 so it will join on the row with id of 2 in the second aliased table. And so on.

Now it's simply a case of subtracting one from the other.

select
    x.id xId,
    y.id yId,
    x.stat xStat,
    y.stat yStat, 
    abs(x.stat - y.stat) gap 
from 
    tbStats x 
    left join tbStats y on x.id + 1 = y.id
order by 
    x.id
    
xId         yId         xStat       yStat       gap
----------- ----------- ----------- ----------- -----------
1           2           10          25          15
2           3           25          5           20
3           4           5           15          10
4           5           15          30          15
5           6           30          9           21
6           7           9           22          13
7           8           22          29          7
8           NULL        29          NULL        NULL

(8 row(s) affected)

We use the ABS function to ensure we always get positive integers as a result of the subtraction regardless of which side of the expression is the higher figure.

Lastly we use the AVG function to get our final aggregated result.

select
    avg(abs(x.stat - y.stat)) movingAvg 
from 
    tbStats x 
    left join tbStats y on x.id + 1 = y.id
where 
    y.stat is not null
    
movingAvg
-----------
14

(1 row(s) affected)

Solution 2

What if the record's ID and period fields are not in the same order? Lets refill our example table with some unordered data. When I describe the data as unordered, I am referring to the fact that the id and period columns are not in alignment.

Truncate table tbStats
GO

insert into tbStats ( stat, period) 
select 5, convert(datetime, '20080103')
union all
select 25, convert(datetime, '20080102')
union all
select 15, convert(datetime, '20080104')
union all
select 9, convert(datetime, '20080106')
union all
select 30, convert(datetime, '20080105')
union all
select 22, convert(datetime, '20080107')
union all
select 29, convert(datetime, '20080108')
union all
select 10, convert(datetime, '20080101')
GO

create index ix_tbStats_period on tbStats (period)
GO

In this situation another solution to staggering the rows is to create a user defined function to return the subsequent row. As this time we will be working on the period column, it is pertinent to create an index on that column. For the amount of data we are dealing with in our example it will be of no significance, but I would be negligent not to mention it here, if only in passing.

CREATE FUNCTION dbo.Stagger  
(
    @period DATETIME
)
RETURNS INT
AS
BEGIN
    DECLARE @ResultVar INT

    SELECT TOP 1 @ResultVar = stat 
    FROM tbStats 
    WHERE period > @period
    ORDER BY period

    RETURN @ResultVar

END
GO

By using the TOP keyword and an ORDER BY clause, we can select the next consecutive date that occurs in the table after the date passed into the function as a parameter.

select
    avg(abs(x.stat - dbo.Stagger(x.period))) movingAvg  
from 
    tbStats x 
    
movingAvg
-----------
14

(1 row(s) affected)

Solution 3

For those running SQL Server 2005 we have a third weapon in the armory for tackling this problem: APPLY. With APPLY there are two semantics that can be used to join a table valued function. CROSS and OUTER. OUTER will return all rows in the left hand table regardless of whether the table valued function returns rows or not (Similar to a left outer join) CROSS apply will only return left hand rows when there are also rows in the table valued function. We can use either in our solution. But, CROSS apply will have the neat outcome that the NULL data will be removed for the last record. We can make a new table valued function based on our previous one.

CREATE FUNCTION dbo.StaggerTable
(
    @period DATETIME 
)
  RETURNS TABLE
AS
RETURN
    SELECT TOP 1 id, stat 
    FROM tbStats 
    WHERE period > @period
    ORDER BY period
GO

And alter our previous statement to include it using CROSS APPLY, which I think you'll agree looks much cleaner.

select
    avg(abs(x.stat - y.stat)) movingAvg  
from 
    tbStats x 
    cross apply dbo.StaggerTable(x.period) y
    
    
movingAvg
-----------
14

(1 row(s) affected)

Performance Note

Please check the performance carefully as your mileage can vary considerably with these kinds of query -- particularly for solution #2. Running a user-defined function for every row in a result set can impact performance. Especially if they query in the UDF is anything other than a very simple query. Personally I used solution one as it gave me best advantage of using indexes and I had the luxury of a contiguous sequential key. Of the UDF solutions number three using The CROSS APPLY method is preferred and has been faster based on limited testing. Small data sets are your friend here.

Conclusion

So we have seen how SQL Server offers us plenty of flexibility in massaging our stored data into viable meaningful results. Through joining on more sophisticated semantics beyond merely matching values as they are and/or leveraging user defined functions.