Please start any new threads on our new
site at https://forums.sqlteam.com. We've got lots of great SQL Server
experts to answer whatever question you can come up with.
| Author |
Topic |
|
mercury529
Starting Member
4 Posts |
Posted - 2011-08-03 : 11:25:44
|
| Hello,I am rather new to SQL. I am familiar with some basic syntax, but I am not seeing an efficient query syntax for what I am trying to accomplish. I thought I'd ask some of the experts for their opinion.I have a database that serves as a warehouse for event information related to a program I have written. The database contains information for multiple customers. For the sake of simplicity, we will call this table 'Events'. For the purposes of this query, there are 2 fields in Events, 'Time' - a datetime structure indicating when the event occurred, and 'Customer'.I am trying to design a stored procedure that accepts a starting datetime value (@startTime), a stopping datetime value (@stopTime), an interval of time specified in milliseconds (@intervalLength), and an event count threshold (@eventCount). The results of the query should be a list of every 'Customer' from 'Events' for which between @startTime and @stopTime there is an interval of time of length @intervalLength for which the number of events (rows in) is >= to @eventCount. In plain english, I am trying to detect if between two dates, there was a high volume of events in a short period of time (where high volume of events is @eventCount and short period of time is @intervalLength).Is anyone familiar with a way to accomplish this efficiently in SQL? I would also appreciate a nudge in the direction of any resources that might help me solve what appears on the surface to me to be a complicated query.Thank you very much for any help you can provide. |
|
|
visakh16
Very Important crosS Applying yaK Herder
52326 Posts |
Posted - 2011-08-03 : 12:51:44
|
something likeSELECTFROM(SELECT Customer,DATEADD(ms,DATEDIFF(mi,0,LogTime)/@intervalLength*@intervalLength,0) As SlotTime, COUNT(Event) AS EvtCntFROM TableWHERE Time BETWEEN @startTime AND @stopTimeGROUP BY Customer,DATEADD(ms,DATEDIFF(mi,0,LogTime)/@intervalLength*@intervalLength,0))tGROUP BY CustomerHAVING SUM(CASE WHEN EvtCnt > = @eventCount THEN 1 ELSE 0 END) >0 ------------------------------------------------------------------------------------------------------SQL Server MVPhttp://visakhm.blogspot.com/ |
 |
|
|
mercury529
Starting Member
4 Posts |
Posted - 2011-08-03 : 15:15:22
|
| Thank you very much for the reply visakh16. I appreciate you taking the time to help me out. I had a few questions if you can help to clarify for me.1.) You specify LogTime and Time. Were those meant to be the same and represent the "Time' field of an event? I have treated it as such when I tested your code.2.) Is Event supposed to correlate to a specific field in the Events table that I have listed? I have temporarily replaced it with * when I tested your code.3.) I tried to run your code and I get a syntax error. I removed what appears to be the extraneous 't' after the parenthesis, but it still occurs. The error is "Error 156: Incorrect syntax near the keyword 'FROM'. Incorrect syntax near the keyword 'GROUP'." If I try running only using the inner SELECT, it works properly.4.) I am trying to wrap my head around what it is doing. The part that is confusing me at the moment is:DATEDIFF(mi,0,LogTime)/@intervalLength*@intervalLengthWhat is this code calculating? It appears to be taking the number of minutes that occur between a 0 value datetime and the time of my event. I think the intention after that is to divide by @intervalLength squared, although won't this divide by @intervalLength and then multiply by @intervalLength effectively multiplying by 1 due to operator precedence?5.) What does adding the value in 4) to a 0 time value do?Again, thank you very much for your help. |
 |
|
|
visakh16
Very Important crosS Applying yaK Herder
52326 Posts |
Posted - 2011-08-04 : 08:25:04
|
quote: Originally posted by mercury529 Thank you very much for the reply visakh16. I appreciate you taking the time to help me out. I had a few questions if you can help to clarify for me.1.) You specify LogTime and Time. Were those meant to be the same and represent the "Time' field of an event? I have treated it as such when I tested your code.it was a typo.they are the same2.) Is Event supposed to correlate to a specific field in the Events table that I have listed? I have temporarily replaced it with * when I tested your code.event represent any field that gives information on events for a customer3.) I tried to run your code and I get a syntax error. I removed what appears to be the extraneous 't' after the parenthesis, but it still occurs. The error is "Error 156: Incorrect syntax near the keyword 'FROM'. Incorrect syntax near the keyword 'GROUP'." If I try running only using the inner SELECT, it works properly.I just gave a stub to work with. in relaity you need to put actual columns thereSELECT CustomerFROM(SELECT Customer,DATEADD(ms,DATEDIFF(mi,0,LogTime)/@intervalLength*@intervalLength,0) As SlotTime, COUNT(Event) AS EvtCntFROM TableWHERE LogTime BETWEEN @startTime AND @stopTimeGROUP BY Customer,DATEADD(ms,DATEDIFF(mi,0,LogTime)/@intervalLength*@intervalLength,0))tGROUP BY CustomerHAVING SUM(CASE WHEN EvtCnt > = @eventCount THEN 1 ELSE 0 END) >0 4.) I am trying to wrap my head around what it is doing. The part that is confusing me at the moment is:DATEDIFF(mi,0,LogTime)/@intervalLength*@intervalLengthWhat is this code calculating? It appears to be taking the number of minutes that occur between a 0 value datetime and the time of my event. I think the intention after that is to divide by @intervalLength squared, although won't this divide by @intervalLength and then multiply by @intervalLength effectively multiplying by 1 due to operator precedence?its basically grouping your event time onto slot based on @intervalLength5.) What does adding the value in 4) to a 0 time value do?you will get logtime corrected to nearest slotAgain, thank you very much for your help.
------------------------------------------------------------------------------------------------------SQL Server MVPhttp://visakhm.blogspot.com/ |
 |
|
|
mercury529
Starting Member
4 Posts |
Posted - 2011-08-04 : 13:55:34
|
| Thank you again for you help visakh16. I certainly did a good job living up to the label "New to SQL Server Programming."I did a little more research yesterday. I found that GROUP BY can essentially be used to accomplish a sort of "bucket sorting" or hashing. I believe this is the approach your query takes. For instance:SELECT Customer, DATEPART(yyyy, EventTime) as EventYear, DATEPART(m, EventTime) as EventMonth, DATEPART(d, EventTime) as EventDay, DATEPART(hh, EventTime) as EventHour, DATEPART(mi, EventTime) as EventMinute, DATEPART(s, EventTime) as EventSecond, COUNT(*) as EventsFROM EventsWHERE ((@startTime = NULL) OR (EventTime >= @startTime)) AND ((@stopTime = NULL) OR (EventTime <= @stopTime))GROUP BY OriginLookupID, DATEPART(yyyy, EventTime), DATEPART(m, EventTime), DATEPART(d, EventTime), DATEPART(hh, EventTime), DATEPART(mi, EventTime), DATEPART(s, EventTime)HAVING Count(*) > @eventCountThis query effectively puts each event into a bucket by seconds (Perhaps there is a more efficient way to do it, especially the DATEPART portions). So I'd be able to see that 3 events happened during second X and 8 events happened during X + 1.It seems your intent is to accomplish the same with your code. I did have a few questions, if I can bother you with them.1.) The documentation on DATEDIFF ([url]http://msdn.microsoft.com/en-us/library/ms189794.aspx[/url]) says the "For millisecond, the maximum difference between startdate and enddate is 24 days, 20 hours, 31 minutes and 23.647 seconds. For second, the maximum difference is 68 years.". Since we are using a comparison date of 0, I believe "the value of the missing date part is set to the default value: 1900-01-01". Since all dates recently would be greater than 68 years if we are using minutes, wouldn't this run into an issue?2.) In other languages "DATEDIFF(mi,0,LogTime)/@intervalLength*@intervalLength" would typically resolve to DATEDIFF(mi, 0, LogTime) because of operator precedence rules. In SQL do I need to wrap the @intervalLength*@intervalLength in parenthesis to square it?3.) What was the purpose behind squaring the @intervalLength (assuming that was the intent)?4.) My one complaint with the GROUP BY bucket sorting concept is that it is possible to miss intervals that fulfill the requirement based on how the buckets are defined. For instance, say you want to detect if there exists a 1 second interval in which 2 events were generated. Let's say we GROUP By based on the second, and event 1 occurs at 0.5 and event 2 1.25. The GROUP BY will fail to detect that there is a 1 second interval that exists in which 2 events occurred.I did some testing on your code, and it appears to suffer from this same weakness. Is there a way to do a SQL query that will be 100% accurate in determination of the existence of an interval?Once again, thank you for all your help. |
 |
|
|
mercury529
Starting Member
4 Posts |
Posted - 2011-08-06 : 14:38:32
|
| The following code achieves my goal. However it does so with terrible performance (presumably because it ends up doing what appears to be Events.TotalRows * Events.TotalRows total comparisons). Is there a more efficient way to accomplish the results of this query? I am inclined to believe cursors might be the most efficient way to accomplish this, but I wanted to make sure there is not some more efficient method of non-cursor code I could be using. Thanks again for all the help.CREATE PROCEDURE dbo.GetCustomersWithHighDensityOfEvents@nEventsInInterval as INTEGER,@nSecondsInInterval as INTEGER,ASSET NOCOUNT ON;BEGINSELECT IDENTITY(Int, 1, 1) AS RowNumber, Events.EventTime, Events.CustomerINTO #tableEventsWithRowNumberFROM dbo.EventsORDER BY Events.Customer, Events.EventTimeSELECT DISTINCT tableLowerBound.CustomerFROM(SELECT #tableEventsWithRowNumber.RowNumber, #tableEventsWithRowNumber.EventTime, #tableEventsWithRowNumber.CustomerFROM #tableEventsWithRowNumber) tableLowerBoundINNER JOIN(SELECT #tableEventsWithRowNumber.RowNumber, #tableEventsWithRowNumber.EventTime, #tableEventsWithRowNumber.CustomerFROM #tableEventsWithRowNumber) tableUpperBoundON tableLowerBound.rowNumber = tableUpperBound.rowNumber - (@nEventsInInterval - 1)WHERE (DATEDIFF(millisecond, tableLowerBound.EventTime, tableUpperBound.EventTime) <= (@nSecondsInInterval * 1000))DROP TABLE #tableEventsWithRowNumberENDGO |
 |
|
|
SwePeso
Patron Saint of Lost Yaks
30421 Posts |
Posted - 2011-08-06 : 16:35:27
|
Well, you must add the Customer column to the join too, otherwise you will compare different customers against eachother.ON tableLowerBound.rowNumber = tableUpperBound.rowNumber - (@nEventsInInterval - 1)AND tableLowerBound.Customer = tableUpperBound.CustomerWHERE (DATEDIFF(millisecond, tableLowerBound.EventTime, tableUpperBound.EventTime) <= (@nSecondsInInterval * 1000)) N 56°04'39.26"E 12°55'05.63" |
 |
|
|
SwePeso
Patron Saint of Lost Yaks
30421 Posts |
Posted - 2011-08-06 : 16:36:56
|
And create an unique clustered index on #tableEventsWithRowNumber would help too!create unique clustered index ucx_swepeso on #tableEventsWithRowNumber (customer, rownumber) N 56°04'39.26"E 12°55'05.63" |
 |
|
|
SwePeso
Patron Saint of Lost Yaks
30421 Posts |
Posted - 2011-08-06 : 16:40:25
|
I have taken the liberty to fix a few bugs of yours...quote: Originally posted by mercury529 CREATE PROCEDURE dbo.GetCustomersWithHighDensityOfEvents(@nEventsInInterval INT,@nSecondsInInterval INT,)ASSET NOCOUNT ONSELECT IDENTITY(Int, 1, 1) AS RowNumber, EventTime, CustomerINTO #tableEventsWithRowNumberFROM dbo.EventsWHERE EventTime BETWEEN @StartTime AND @StopTimeORDER BY Customer, EventTimecreate unique clustered index ucx_swepeso on #tableEventsWithRowNumber (customer, rownumber) SELECT DISTINCT tableLowerBound.CustomerFROM(SELECT RowNumber, EventTime, CustomerFROM #tableEventsWithRowNumber) as tableLowerBoundINNER JOIN(SELECT RowNumber, EventTime, CustomerFROM #tableEventsWithRowNumber) as tableUpperBoundON tableLowerBound.rowNumber = tableUpperBound.rowNumber - @nEventsInInterval + 1AND tableLowerBound.customer = tableUpperBound.customerWHERE DATEDIFF(millisecond, tableLowerBound.EventTime, tableUpperBound.EventTime) <= 1000 * @nSecondsInIntervalDROP TABLE #tableEventsWithRowNumberGO
N 56°04'39.26"E 12°55'05.63" |
 |
|
|
SwePeso
Patron Saint of Lost Yaks
30421 Posts |
Posted - 2011-08-06 : 16:51:48
|
Wouldn't this be enough?SELECT e.CustomerFROM dbo.[Events] AS eCROSS APPLY ( SELECT TOP(1) x.[Time] FROM dbo.[Events] AS x WHERE x.Customer = e.Customer AND x.[Time] >= e.[Time] AND x.[Time] <= DATEADD(MILLISECOND, @IntervalLength, e.[Time]) AND x.[Time] <= @StopTime ) AS fWHERE e.[Time] BETWEEN @StartTime AND @StopTimeGROUP BY e.CustomerHAVING COUNT(*) >= @EventCount N 56°04'39.26"E 12°55'05.63" |
 |
|
|
jcelko
Esteemed SQL Purist
547 Posts |
Posted - 2011-08-06 : 18:38:15
|
| Every SQL forum or newsgroup expects that you will post DDL, sample data and clear specifications which might might include output. This is explained in the FAQ section. We can not read minds; what would we need to know to do your job for you? Tables must have keys and should have DRI, constraints, and all the basic features of a schema. You should know use ISO-8601 Standards for temporal data, avoid needless dialect, basic data modeling and use ISO-11179 Standards for data element names. Please tell us if you can change the DDL. >> I have a database that serves as a warehouse for event information related to a program I have written. <<Watch out for the term “warehouse”; we SQL guys think of “Data Warehouse” which is not what you want. >> The database contains information for multiple customers. For the sake of simplicity, we will call this table 'Events'. For the purposes of this query, there are 2 fields [sic] in Events, 'Time' - a DATETIME2(0) structure [sic: data type] indicating when the event occurred, and 'Customer' [sic: only one?]. <<Columns are not anything like fields. Once you get that concept the rest of RDBMS will be soooooo much easier for you. You will have moved from physical punch cards and mag tape to an abstraction. Tables have collective or plural names because they model sets of entities. Next, time is not discrete; it is a continuum. We need a pair of (event_start_time, event_end_time) for a correct model of an event. That discussion is for another day . Let's use DATETIME2(0) to get you your milliseconds and assume that the even is modeled by its start time only.Oh, stop using camelCase – it does not port to ISO standards and your eye jumps to the first uppercase letter, the back to the front. It is a bitch to read for any length of time. The correct form is alphas, digits and underscores only. >> I am trying to design a stored procedure that accepts a starting DATETIME2(0) value (@in_start_time), a stopping DATETIME2(0) value (@in_stop_time), an interval of time specified in milliseconds (@in_event_duration), and an event count threshold (@in_event_cnt). <<Clean up the data element names to follow ISO-11179 rules ..>> The results of the query should be a list of every 'Customers' from 'Events' for which between @in_start_time and @in_stop_time there is an interval of time of length @in_event_duration for which the number of events (rows in) is >= to @in_event_cnt. In plain English, I am trying to detect if between two dates, there was a high volume of events in a short period of time (where high volume of events is @in_event_cnt and short period of time is @in_event_duration). <<Sample data? DDL? Here is my effort: CREATE TABLE Customers(cust_id CHAR(10) NOT NULL PRIMARY KEY, ..);You were not clear on timing the events, CREATE TABLE Events(cust_id CHAR(10) NOT NULL REFERENCES Customers(cust_id) ON DELETE CASCADE, event_timestamp DATETIME2(4) DEFAULT CURRENT_TIMESTAMP NOT NULL, PRIMARY KEY (cust_id, event_timestamp));>> Is anyone familiar with a way to accomplish this efficiently in SQL? I would also appreciate a nudge in the direction of any resources that might help me solve what appears on the surface to me to be a complicated query. <<Temporal stuff is hard. Download the free Rick Snodgrass book from the University of Arizona website and read it.. twice. It is still the best one on this topic. Let's start off easyCREATE PROCEDURE Event_Density_by_Customer (@in_start_time DATETIME2(3), @in_stop_time DATETIME2(3))ASSELECT cust_id, @in_start_time, @in_stop_time, (COUNT(event_timestamp) / DATEDIFF (MS, @in_start_time, @in_stop_time) AS event_occurrence_avg FROM Events WHERE event_timestamp BETWEEN @in_start_time AND @in_stop_time; GROUP BY cust_id;This will give us a result like “Fred pinged FaceBook from '2011-08-12 00:00:00' to '2011-08-12 01:00:00' an average of once every 15 milliseconds” which is a good start on what you want. Her is where it is going to be a mess. That “short period of time” has to computed by adding @in_event_duration to ALL possible start times in the time frame we are considering. One day is how many milliseconds? Perhaps a better approach wou7dl be a table with the pause between the events as a column:CREATE TABLE Events(cust_id CHAR(10) NOT NULL REFERENCES Customers(cust_id) ON DELETE CASCADE, event_timestamp DATETIME2(4) DEFAULT CURRENT_TIMESTAMP NOT NULL, pause_duration INTEGER NOT NULL, PRIMARY KEY (cust_id, event_timestamp));As data is inserted we compute how long a pause there was between this new event and the prior event, in milliseconds. Does that work for you? --CELKO--Books in Celko Series for Morgan-Kaufmann PublishingAnalytics and OLAP in SQLData and Databases: Concepts in Practice Data, Measurements and Standards in SQLSQL for SmartiesSQL Programming Style SQL Puzzles and Answers Thinking in SetsTrees and Hierarchies in SQL |
 |
|
|
|
|
|
|
|