Complicated SQL Query on Event Density

Please start any new threads on our new site at https://forums.sqlteam.com. We've got lots of great SQL Server experts to answer whatever question you can come up with.

All Forums

General SQL Server Forums

New to SQL Server Programming

Complicated SQL Query on Event Density

Author

Topic

mercury529
Starting Member

4 Posts

Posted - 2011-08-03 : 11:25:44

Hello,

I am rather new to SQL. I am familiar with some basic syntax, but I am not seeing an efficient query syntax for what I am trying to accomplish. I thought I'd ask some of the experts for their opinion.

I have a database that serves as a warehouse for event information related to a program I have written. The database contains information for multiple customers. For the sake of simplicity, we will call this table 'Events'. For the purposes of this query, there are 2 fields in Events, 'Time' - a datetime structure indicating when the event occurred, and 'Customer'.

I am trying to design a stored procedure that accepts a starting datetime value (@startTime), a stopping datetime value (@stopTime), an interval of time specified in milliseconds (@intervalLength), and an event count threshold (@eventCount). The results of the query should be a list of every 'Customer' from 'Events' for which between @startTime and @stopTime there is an interval of time of length @intervalLength for which the number of events (rows in) is >= to @eventCount. In plain english, I am trying to detect if between two dates, there was a high volume of events in a short period of time (where high volume of events is @eventCount and short period of time is @intervalLength).

Is anyone familiar with a way to accomplish this efficiently in SQL? I would also appreciate a nudge in the direction of any resources that might help me solve what appears on the surface to me to be a complicated query.

Thank you very much for any help you can provide.

visakh16
Very Important crosS Applying yaK Herder

52326 Posts

Posted - 2011-08-03 : 12:51:44

something like


SELECT
FROM
(
SELECT Customer,DATEADD(ms,DATEDIFF(mi,0,LogTime)/@intervalLength*@intervalLength,0) As SlotTime, COUNT(Event) AS EvtCnt
FROM Table
WHERE Time BETWEEN @startTime AND @stopTime
GROUP BY Customer,DATEADD(ms,DATEDIFF(mi,0,LogTime)/@intervalLength*@intervalLength,0)
)t
GROUP BY Customer
HAVING SUM(CASE WHEN EvtCnt > = @eventCount THEN 1 ELSE 0 END) >0

------------------------------------------------------------------------------------------------------
SQL Server MVP
http://visakhm.blogspot.com/

mercury529
Starting Member

4 Posts

Posted - 2011-08-03 : 15:15:22

Thank you very much for the reply visakh16. I appreciate you taking the time to help me out. I had a few questions if you can help to clarify for me.

1.) You specify LogTime and Time. Were those meant to be the same and represent the "Time' field of an event? I have treated it as such when I tested your code.

2.) Is Event supposed to correlate to a specific field in the Events table that I have listed? I have temporarily replaced it with * when I tested your code.

3.) I tried to run your code and I get a syntax error. I removed what appears to be the extraneous 't' after the parenthesis, but it still occurs. The error is "Error 156: Incorrect syntax near the keyword 'FROM'. Incorrect syntax near the keyword 'GROUP'." If I try running only using the inner SELECT, it works properly.

4.) I am trying to wrap my head around what it is doing. The part that is confusing me at the moment is:

DATEDIFF(mi,0,LogTime)/@intervalLength*@intervalLength

What is this code calculating? It appears to be taking the number of minutes that occur between a 0 value datetime and the time of my event. I think the intention after that is to divide by @intervalLength squared, although won't this divide by @intervalLength and then multiply by @intervalLength effectively multiplying by 1 due to operator precedence?

5.) What does adding the value in 4) to a 0 time value do?

Again, thank you very much for your help.

visakh16
Very Important crosS Applying yaK Herder

52326 Posts

Posted - 2011-08-04 : 08:25:04

quote:
Originally posted by mercury529

Thank you very much for the reply visakh16. I appreciate you taking the time to help me out. I had a few questions if you can help to clarify for me.

1.) You specify LogTime and Time. Were those meant to be the same and represent the "Time' field of an event? I have treated it as such when I tested your code.
it was a typo.they are the same

2.) Is Event supposed to correlate to a specific field in the Events table that I have listed? I have temporarily replaced it with * when I tested your code.
event represent any field that gives information on events for a customer
3.) I tried to run your code and I get a syntax error. I removed what appears to be the extraneous 't' after the parenthesis, but it still occurs. The error is "Error 156: Incorrect syntax near the keyword 'FROM'. Incorrect syntax near the keyword 'GROUP'." If I try running only using the inner SELECT, it works properly.
I just gave a stub to work with. in relaity you need to put actual columns there

SELECT Customer
FROM
(
SELECT Customer,DATEADD(ms,DATEDIFF(mi,0,LogTime)/@intervalLength*@intervalLength,0) As SlotTime, COUNT(Event) AS EvtCnt
FROM Table
WHERE LogTime BETWEEN @startTime AND @stopTime
GROUP BY Customer,DATEADD(ms,DATEDIFF(mi,0,LogTime)/@intervalLength*@intervalLength,0)
)t
GROUP BY Customer
HAVING SUM(CASE WHEN EvtCnt > = @eventCount THEN 1 ELSE 0 END) >0

4.) I am trying to wrap my head around what it is doing. The part that is confusing me at the moment is:

DATEDIFF(mi,0,LogTime)/@intervalLength*@intervalLength

What is this code calculating? It appears to be taking the number of minutes that occur between a 0 value datetime and the time of my event. I think the intention after that is to divide by @intervalLength squared, although won't this divide by @intervalLength and then multiply by @intervalLength effectively multiplying by 1 due to operator precedence?
its basically grouping your event time onto slot based on @intervalLength
5.) What does adding the value in 4) to a 0 time value do?
you will get logtime corrected to nearest slot
Again, thank you very much for your help.

------------------------------------------------------------------------------------------------------
SQL Server MVP
http://visakhm.blogspot.com/

mercury529
Starting Member

4 Posts

Posted - 2011-08-04 : 13:55:34

Thank you again for you help visakh16. I certainly did a good job living up to the label "New to SQL Server Programming."

I did a little more research yesterday. I found that GROUP BY can essentially be used to accomplish a sort of "bucket sorting" or hashing. I believe this is the approach your query takes. For instance:

SELECT Customer, DATEPART(yyyy, EventTime) as EventYear, DATEPART(m, EventTime) as EventMonth, DATEPART(d, EventTime) as EventDay, DATEPART(hh, EventTime) as EventHour, DATEPART(mi, EventTime) as EventMinute, DATEPART(s, EventTime) as EventSecond, COUNT(*) as Events
FROM Events
WHERE ((@startTime = NULL) OR (EventTime >= @startTime)) AND
((@stopTime = NULL) OR (EventTime <= @stopTime))
GROUP BY OriginLookupID, DATEPART(yyyy, EventTime), DATEPART(m, EventTime), DATEPART(d, EventTime), DATEPART(hh, EventTime), DATEPART(mi, EventTime), DATEPART(s, EventTime)
HAVING Count(*) > @eventCount

This query effectively puts each event into a bucket by seconds (Perhaps there is a more efficient way to do it, especially the DATEPART portions). So I'd be able to see that 3 events happened during second X and 8 events happened during X + 1.

It seems your intent is to accomplish the same with your code. I did have a few questions, if I can bother you with them.

1.) The documentation on DATEDIFF ([url]http://msdn.microsoft.com/en-us/library/ms189794.aspx[/url]) says the "For millisecond, the maximum difference between startdate and enddate is 24 days, 20 hours, 31 minutes and 23.647 seconds. For second, the maximum difference is 68 years.". Since we are using a comparison date of 0, I believe "the value of the missing date part is set to the default value: 1900-01-01". Since all dates recently would be greater than 68 years if we are using minutes, wouldn't this run into an issue?

2.) In other languages "DATEDIFF(mi,0,LogTime)/@intervalLength*@intervalLength" would typically resolve to DATEDIFF(mi, 0, LogTime) because of operator precedence rules. In SQL do I need to wrap the @intervalLength*@intervalLength in parenthesis to square it?

3.) What was the purpose behind squaring the @intervalLength (assuming that was the intent)?

4.) My one complaint with the GROUP BY bucket sorting concept is that it is possible to miss intervals that fulfill the requirement based on how the buckets are defined. For instance, say you want to detect if there exists a 1 second interval in which 2 events were generated. Let's say we GROUP By based on the second, and event 1 occurs at 0.5 and event 2 1.25. The GROUP BY will fail to detect that there is a 1 second interval that exists in which 2 events occurred.

I did some testing on your code, and it appears to suffer from this same weakness. Is there a way to do a SQL query that will be 100% accurate in determination of the existence of an interval?

Once again, thank you for all your help.

mercury529
Starting Member

4 Posts

Posted - 2011-08-06 : 14:38:32

The following code achieves my goal. However it does so with terrible performance (presumably because it ends up doing what appears to be Events.TotalRows * Events.TotalRows total comparisons). Is there a more efficient way to accomplish the results of this query? I am inclined to believe cursors might be the most efficient way to accomplish this, but I wanted to make sure there is not some more efficient method of non-cursor code I could be using. Thanks again for all the help.

CREATE PROCEDURE dbo.GetCustomersWithHighDensityOfEvents

@nEventsInInterval as INTEGER,
@nSecondsInInterval as INTEGER,

AS
SET NOCOUNT ON;

BEGIN

SELECT IDENTITY(Int, 1, 1) AS RowNumber,
Events.EventTime,
Events.Customer
INTO #tableEventsWithRowNumber
FROM dbo.Events
ORDER BY Events.Customer, Events.EventTime

SELECT DISTINCT tableLowerBound.Customer
FROM
(
SELECT #tableEventsWithRowNumber.RowNumber,
#tableEventsWithRowNumber.EventTime,
#tableEventsWithRowNumber.Customer
FROM #tableEventsWithRowNumber
) tableLowerBound
INNER JOIN
(
SELECT #tableEventsWithRowNumber.RowNumber,
#tableEventsWithRowNumber.EventTime,
#tableEventsWithRowNumber.Customer
FROM #tableEventsWithRowNumber
) tableUpperBound
ON tableLowerBound.rowNumber = tableUpperBound.rowNumber - (@nEventsInInterval - 1)
WHERE (DATEDIFF(millisecond, tableLowerBound.EventTime, tableUpperBound.EventTime) <= (@nSecondsInInterval * 1000))

DROP TABLE #tableEventsWithRowNumber

END
GO

SwePeso
Patron Saint of Lost Yaks

30421 Posts

Posted - 2011-08-06 : 16:35:27

Well, you must add the Customer column to the join too, otherwise you will compare different customers against eachother.

ON tableLowerBound.rowNumber = tableUpperBound.rowNumber - (@nEventsInInterval - 1)
AND tableLowerBound.Customer = tableUpperBound.Customer
WHERE (DATEDIFF(millisecond, tableLowerBound.EventTime, tableUpperBound.EventTime) <= (@nSecondsInInterval * 1000))

N 56°04'39.26"
E 12°55'05.63"

SwePeso
Patron Saint of Lost Yaks

30421 Posts

Posted - 2011-08-06 : 16:36:56

And create an unique clustered index on #tableEventsWithRowNumber would help too!

create unique clustered index ucx_swepeso on #tableEventsWithRowNumber (customer, rownumber)

N 56°04'39.26"
E 12°55'05.63"

SwePeso
Patron Saint of Lost Yaks

30421 Posts

Posted - 2011-08-06 : 16:40:25

I have taken the liberty to fix a few bugs of yours...

quote:
Originally posted by mercury529

CREATE PROCEDURE dbo.GetCustomersWithHighDensityOfEvents
(
@nEventsInInterval INT,
@nSecondsInInterval INT,
)
AS

SET NOCOUNT ON

SELECT IDENTITY(Int, 1, 1) AS RowNumber,
EventTime,
Customer
INTO #tableEventsWithRowNumber
FROM dbo.Events
WHERE EventTime BETWEEN @StartTime AND @StopTime
ORDER BY Customer, EventTime

create unique clustered index ucx_swepeso on #tableEventsWithRowNumber (customer, rownumber)

SELECT DISTINCT tableLowerBound.Customer
FROM
(
SELECT RowNumber,
EventTime,
Customer
FROM #tableEventsWithRowNumber
) as tableLowerBound
INNER JOIN
(
SELECT RowNumber,
EventTime,
Customer
FROM #tableEventsWithRowNumber
) as tableUpperBound
ON tableLowerBound.rowNumber = tableUpperBound.rowNumber - @nEventsInInterval + 1
AND tableLowerBound.customer = tableUpperBound.customer
WHERE DATEDIFF(millisecond, tableLowerBound.EventTime, tableUpperBound.EventTime) <= 1000 * @nSecondsInInterval

DROP TABLE #tableEventsWithRowNumber

GO

N 56°04'39.26"
E 12°55'05.63"

SwePeso
Patron Saint of Lost Yaks

30421 Posts

Posted - 2011-08-06 : 16:51:48

Wouldn't this be enough?

SELECT		e.Customer
FROM		dbo.[Events] AS e
CROSS APPLY	(
		SELECT TOP(1)	x.[Time]
		FROM		dbo.[Events] AS x
		WHERE		x.Customer = e.Customer
				AND x.[Time] >= e.[Time]
				AND x.[Time] <= DATEADD(MILLISECOND, @IntervalLength, e.[Time])
				AND x.[Time] <= @StopTime
		) AS f
WHERE		e.[Time] BETWEEN @StartTime AND @StopTime
GROUP BY	e.Customer
HAVING		COUNT(*) >= @EventCount

N 56°04'39.26"
E 12°55'05.63"

jcelko
Esteemed SQL Purist

547 Posts

Posted - 2011-08-06 : 18:38:15

Every SQL forum or newsgroup expects that you will post DDL, sample data and clear specifications which might might include output. This is explained in the FAQ section. We can not read minds; what would we need to know to do your job for you?

Tables must have keys and should have DRI, constraints, and all the basic features of a schema. You should know use ISO-8601 Standards for temporal data, avoid needless dialect, basic data modeling and use ISO-11179 Standards for data element names. Please tell us if you can change the DDL.

>> I have a database that serves as a warehouse for event information related to a program I have written. <<

Watch out for the term “warehouse”; we SQL guys think of “Data Warehouse” which is not what you want.

>> The database contains information for multiple customers. For the sake of simplicity, we will call this table 'Events'. For the purposes of this query, there are 2 fields [sic] in Events, 'Time' - a DATETIME2(0) structure [sic: data type] indicating when the event occurred, and 'Customer' [sic: only one?]. <<

Columns are not anything like fields. Once you get that concept the rest of RDBMS will be soooooo much easier for you. You will have moved from physical punch cards and mag tape to an abstraction. Tables have collective or plural names because they model sets of entities.

Next, time is not discrete; it is a continuum. We need a pair of (event_start_time, event_end_time) for a correct model of an event. That discussion is for another day . Let's use DATETIME2(0) to get you your milliseconds and assume that the even is modeled by its start time only.

Oh, stop using camelCase – it does not port to ISO standards and your eye jumps to the first uppercase letter, the back to the front. It is a bitch to read for any length of time. The correct form is alphas, digits and underscores only.

>> I am trying to design a stored procedure that accepts a starting DATETIME2(0) value (@in_start_time), a stopping DATETIME2(0) value (@in_stop_time), an interval of time specified in milliseconds (@in_event_duration), and an event count threshold (@in_event_cnt). <<

Clean up the data element names to follow ISO-11179 rules ..

>> The results of the query should be a list of every 'Customers' from 'Events' for which between @in_start_time and @in_stop_time there is an interval of time of length @in_event_duration for which the number of events (rows in) is >= to @in_event_cnt. In plain English, I am trying to detect if between two dates, there was a high volume of events in a short period of time (where high volume of events is @in_event_cnt and short period of time is @in_event_duration). <<

Sample data? DDL? Here is my effort:

CREATE TABLE Customers
(cust_id CHAR(10) NOT NULL PRIMARY KEY,
..);

You were not clear on timing the events,

CREATE TABLE Events
(cust_id CHAR(10) NOT NULL
REFERENCES Customers(cust_id)
ON DELETE CASCADE,
event_timestamp DATETIME2(4) DEFAULT CURRENT_TIMESTAMP NOT NULL,
PRIMARY KEY (cust_id, event_timestamp));

>> Is anyone familiar with a way to accomplish this efficiently in SQL? I would also appreciate a nudge in the direction of any resources that might help me solve what appears on the surface to me to be a complicated query. <<

Temporal stuff is hard. Download the free Rick Snodgrass book from the University of Arizona website and read it.. twice. It is still the best one on this topic.

Let's start off easy

CREATE PROCEDURE Event_Density_by_Customer
(@in_start_time DATETIME2(3),
@in_stop_time DATETIME2(3))
AS
SELECT cust_id, @in_start_time, @in_stop_time,
(COUNT(event_timestamp)
/ DATEDIFF (MS, @in_start_time, @in_stop_time)
AS event_occurrence_avg
FROM Events
WHERE event_timestamp BETWEEN @in_start_time AND @in_stop_time;
GROUP BY cust_id;

This will give us a result like “Fred pinged FaceBook from '2011-08-12 00:00:00' to '2011-08-12 01:00:00' an average of once every 15 milliseconds” which is a good start on what you want.

Her is where it is going to be a mess. That “short period of time” has to computed by adding @in_event_duration to ALL possible start times in the time frame we are considering. One day is how many milliseconds?

Perhaps a better approach wou7dl be a table with the pause between the events as a column:

CREATE TABLE Events
(cust_id CHAR(10) NOT NULL
REFERENCES Customers(cust_id)
ON DELETE CASCADE,
event_timestamp DATETIME2(4) DEFAULT CURRENT_TIMESTAMP NOT NULL,
pause_duration INTEGER NOT NULL,
PRIMARY KEY (cust_id, event_timestamp));

As data is inserted we compute how long a pause there was between this new event and the prior event, in milliseconds. Does that work for you?

--CELKO--
Books in Celko Series for Morgan-Kaufmann Publishing
Analytics and OLAP in SQL
Data and Databases: Concepts in Practice
Data, Measurements and Standards in SQL
SQL for Smarties
SQL Programming Style
SQL Puzzles and Answers
Thinking in Sets
Trees and Hierarchies in SQL

Subscribe to SQLTeam.com

SQLTeam.com Articles via RSS

SQLTeam.com Weblog via RSS

- Advertisement -

Resources