Author |
Topic |
CleaningWoman
Starting Member
13 Posts |
Posted - 2015-01-08 : 10:53:45
|
Hi SQL-Expert,i hope you guys can hep me to find the fastest way possible.I have about 800 million rows of data, which are in need of a new flag.About once a week i get 10 million new rows.I do need to find out which machines were busy and set a flag [is_busy=1].Defintion for a machine to be busy is the following:- partitioned (grouped) by group_id, section_id with a group-specific [gid_busy_limit] (which is joined from elsewhere)
- a row is part of a timeinterval of 4 minutes with >= [gid_busy_limit] entries
Currently there are no primary keys, but if very helpful i could add an autoincrement.I am using sql-server 2014 (64)For experiments i prepared a sample #demo table:CREATE TABLE #demo( [group_id] [smallint] NULL, [gid_busy_limit] [int] NULL, [section_id] [nvarchar](8) NULL, [time_id] [datetime2](6) NULL, [is_busy] [smallint] NULL) ON [PRIMARY]GOINSERT #demo ([group_id], [gid_busy_limit], [section_id], [time_id], [is_busy]) VALUES (7, 4, N'00001047', CAST(N'2014-11-26 11:41:00.0000000' AS DateTime2), NULL),(7, 4, N'00001047', CAST(N'2014-11-26 11:44:00.0000000' AS DateTime2), NULL),(7, 4, N'00001047', CAST(N'2014-11-26 11:46:00.0000000' AS DateTime2), NULL),(7, 4, N'00001047', CAST(N'2014-11-26 11:47:00.0000000' AS DateTime2), NULL),(7, 4, N'00001047', CAST(N'2014-11-26 11:48:00.0000000' AS DateTime2), NULL),(7, 4, N'00001047', CAST(N'2014-11-26 11:50:00.0000000' AS DateTime2), NULL),(7, 4, N'00001047', CAST(N'2014-11-26 12:21:00.0000000' AS DateTime2), NULL),(7, 4, N'00001242', CAST(N'2014-11-26 10:25:00.0000000' AS DateTime2), NULL),(7, 4, N'00001242', CAST(N'2014-11-26 10:47:00.0000000' AS DateTime2), NULL),(7, 4, N'00001242', CAST(N'2014-11-26 10:48:00.0000000' AS DateTime2), NULL),(7, 4, N'00001242', CAST(N'2014-11-26 10:48:00.0000000' AS DateTime2), NULL),(7, 4, N'00001242', CAST(N'2014-11-26 10:50:00.0000000' AS DateTime2), NULL),(7, 4, N'00001242', CAST(N'2014-11-26 10:51:00.0000000' AS DateTime2), NULL),(7, 4, N'00001242', CAST(N'2014-11-26 10:54:00.0000000' AS DateTime2), NULL),(9, 3, N'00002004', CAST(N'2014-11-26 15:39:00.0000000' AS DateTime2), NULL),(9, 3, N'00002004', CAST(N'2014-11-26 15:39:00.0000000' AS DateTime2), NULL),(9, 3, N'00002004', CAST(N'2014-11-26 15:40:00.0000000' AS DateTime2), NULL),(9, 3, N'00002004', CAST(N'2014-11-26 15:40:00.0000000' AS DateTime2), NULL),(9, 3, N'00002004', CAST(N'2014-11-26 15:42:00.0000000' AS DateTime2), NULL),(9, 3, N'00002004', CAST(N'2014-11-26 15:44:00.0000000' AS DateTime2), NULL),(9, 3, N'00002010', CAST(N'2014-11-26 09:56:00.0000000' AS DateTime2), NULL),(9, 3, N'00002010', CAST(N'2014-11-26 09:57:00.0000000' AS DateTime2), NULL),(9, 3, N'00002010', CAST(N'2014-11-26 09:59:00.0000000' AS DateTime2), NULL)GOSelect * from #demogo If solved the above table would look like this (7, 4, N'00001047', CAST(N'2014-11-26 11:41:00.0000000' AS DateTime2), NULL),(7, 4, N'00001047', CAST(N'2014-11-26 11:44:00.0000000' AS DateTime2), NULL),(7, 4, N'00001047', CAST(N'2014-11-26 11:46:00.0000000' AS DateTime2), 1),(7, 4, N'00001047', CAST(N'2014-11-26 11:47:00.0000000' AS DateTime2), 1),(7, 4, N'00001047', CAST(N'2014-11-26 11:48:00.0000000' AS DateTime2), 1),(7, 4, N'00001047', CAST(N'2014-11-26 11:50:00.0000000' AS DateTime2), 1),(7, 4, N'00001047', CAST(N'2014-11-26 12:21:00.0000000' AS DateTime2), NULL),(7, 4, N'00001242', CAST(N'2014-11-26 10:25:00.0000000' AS DateTime2), NULL),(7, 4, N'00001242', CAST(N'2014-11-26 10:47:00.0000000' AS DateTime2), 1),(7, 4, N'00001242', CAST(N'2014-11-26 10:48:00.0000000' AS DateTime2), 1),(7, 4, N'00001242', CAST(N'2014-11-26 10:48:00.0000000' AS DateTime2), 1),(7, 4, N'00001242', CAST(N'2014-11-26 10:50:00.0000000' AS DateTime2), 1),(7, 4, N'00001242', CAST(N'2014-11-26 10:51:00.0000000' AS DateTime2), 1),(7, 4, N'00001242', CAST(N'2014-11-26 10:54:00.0000000' AS DateTime2), NULL),(9, 3, N'00002004', CAST(N'2014-11-26 15:39:00.0000000' AS DateTime2), 1),(9, 3, N'00002004', CAST(N'2014-11-26 15:39:00.0000000' AS DateTime2), 1),(9, 3, N'00002004', CAST(N'2014-11-26 15:40:00.0000000' AS DateTime2), 1),(9, 3, N'00002004', CAST(N'2014-11-26 15:40:00.0000000' AS DateTime2), 1),(9, 3, N'00002004', CAST(N'2014-11-26 15:42:00.0000000' AS DateTime2), 1),(9, 3, N'00002004', CAST(N'2014-11-26 15:44:00.0000000' AS DateTime2), NULL),(9, 3, N'00002010', CAST(N'2014-11-26 09:55:00.0000000' AS DateTime2), NULL),(9, 3, N'00002010', CAST(N'2014-11-26 09:57:00.0000000' AS DateTime2), NULL),(9, 3, N'00002010', CAST(N'2014-11-26 09:59:00.0000000' AS DateTime2), NULL) I hope i could describe the problem well enough.Any help or suggestions are much appreciated.Thank you in advance |
|
SwePeso
Patron Saint of Lost Yaks
30421 Posts |
Posted - 2015-01-08 : 12:32:07
|
You can use the LAG function to solve this really quickly. Microsoft SQL Server MVP, MCT, MCSE, MCSA, MCP, MCITP, MCTS, MCDBA |
|
|
CleaningWoman
Starting Member
13 Posts |
Posted - 2015-01-09 : 05:40:54
|
quote: Originally posted by SwePeso You can use the LAG function to solve this really quickly.
Hi SwePeso,many thanks for your suggestion. I recognized lag/lead from MDX and was surprised, I did not realise, that it is now supported by SQL 2014. Fine. I also think that this function looks promising. But only for the first round.I do have the following problems:1.) I can calculate from one row the lag([gid_busy_limit]) of some (int)time_id i create, and so i will find that this row is part of a bunch of rows that are busy.But how do i calculate the rows in between the start and end row of this bunch of rows?2.) the gid_busy_limit currently goes as high as 10 so i do not think it is an option to put every possible combination of lag(1-9) into extra columns and calculate every option.2.a) Adding to this i would have to probably use lag(1-9) and lead(1-9) in combination to find the inbetween rows.Do you have more insights to this, as to how to implement this ?Thank you, again. |
|
|
gbritton
Master Smack Fu Yak Hacker
2780 Posts |
Posted - 2015-01-09 : 06:53:40
|
if the group of rose forms a partition, then you can use the first value and last value functions to access the first and last rows |
|
|
CleaningWoman
Starting Member
13 Posts |
Posted - 2015-01-12 : 05:37:24
|
quote: Originally posted by SwePeso You can use the LAG function to solve this really quickly.
I did try to solve this with lag() and saved one join, but i am not very satisfied with the code. I would be pleased if you have even more ideas to improve the code?The time '2010-01-01 00:00:01' is arbitrary. All timecodes are currently later than 2011-01-01Particularly pittyful is that i have overlapping timespans which are updated multiple times.Here is what i did: (it already shows a better result than i did writing the example down ;-)select [group_id] ,[section_id] ,dateadd(ss,lagx,'2010-01-01 00:00:01') as starttime_id ,[time_id]into #hitlistfrom ( SELECT [group_id] ,[gid_busy_limit] ,[section_id] ,[time_id] , datediff(ss, '2010-01-01 00:00:01', time_id) as sec2now , LAG(datediff(ss, '2010-01-01 00:00:01', time_id),gid_busy_limit-1) over (partition by group_id, section_id order by time_id) as lagx FROM #demo) as awhere sec2now - lagx < = 240goselect group_id, section_id, starttime_id ,time_id from #hitlist goupdate aset a.is_busy = 1from #demo as aleft join #hitlist as bon a.group_id=b.group_id and a.section_id = b.section_idwhere a.time_id>=starttime_id and a.time_id <=b.time_idgoselect * from #demogo Sorry, if it becomes obvious that i am not the most experienced sql-developer. |
|
|
CleaningWoman
Starting Member
13 Posts |
Posted - 2015-01-12 : 05:41:22
|
quote: Originally posted by gbritton if the group of rose forms a partition, then you can use the first value and last value functions to access the first and last rows
Hi gbritton, thank you for your input. I do not understand how to use first/last value to my advantage. Can you give me more insight to the way you think this might help? |
|
|
SwePeso
Patron Saint of Lost Yaks
30421 Posts |
Posted - 2015-01-12 : 07:03:09
|
[code]CREATE TABLE #demo( [group_id] [smallint] NULL, [gid_busy_limit] [int] NULL, [section_id] [nvarchar](8) NULL, [time_id] [datetime2](6) NULL, [is_busy] [smallint] NULL)GOINSERT #demo ([group_id], [gid_busy_limit], [section_id], [time_id], [is_busy]) VALUES (7, 4, N'00001047', CAST(N'2014-11-26 11:41:00.0000000' AS DateTime2), NULL),(7, 4, N'00001047', CAST(N'2014-11-26 11:44:00.0000000' AS DateTime2), NULL),(7, 4, N'00001047', CAST(N'2014-11-26 11:46:00.0000000' AS DateTime2), NULL),(7, 4, N'00001047', CAST(N'2014-11-26 11:47:00.0000000' AS DateTime2), NULL),(7, 4, N'00001047', CAST(N'2014-11-26 11:48:00.0000000' AS DateTime2), NULL),(7, 4, N'00001047', CAST(N'2014-11-26 11:50:00.0000000' AS DateTime2), NULL),(7, 4, N'00001047', CAST(N'2014-11-26 12:21:00.0000000' AS DateTime2), NULL),(7, 4, N'00001242', CAST(N'2014-11-26 10:25:00.0000000' AS DateTime2), NULL),(7, 4, N'00001242', CAST(N'2014-11-26 10:47:00.0000000' AS DateTime2), NULL),(7, 4, N'00001242', CAST(N'2014-11-26 10:48:00.0000000' AS DateTime2), NULL),(7, 4, N'00001242', CAST(N'2014-11-26 10:48:00.0000000' AS DateTime2), NULL),(7, 4, N'00001242', CAST(N'2014-11-26 10:50:00.0000000' AS DateTime2), NULL),(7, 4, N'00001242', CAST(N'2014-11-26 10:51:00.0000000' AS DateTime2), NULL),(7, 4, N'00001242', CAST(N'2014-11-26 10:54:00.0000000' AS DateTime2), NULL),(9, 3, N'00002004', CAST(N'2014-11-26 15:39:00.0000000' AS DateTime2), NULL),(9, 3, N'00002004', CAST(N'2014-11-26 15:39:00.0000000' AS DateTime2), NULL),(9, 3, N'00002004', CAST(N'2014-11-26 15:40:00.0000000' AS DateTime2), NULL),(9, 3, N'00002004', CAST(N'2014-11-26 15:40:00.0000000' AS DateTime2), NULL),(9, 3, N'00002004', CAST(N'2014-11-26 15:42:00.0000000' AS DateTime2), NULL),(9, 3, N'00002004', CAST(N'2014-11-26 15:44:00.0000000' AS DateTime2), NULL),(9, 3, N'00002010', CAST(N'2014-11-26 09:56:00.0000000' AS DateTime2), NULL),(9, 3, N'00002010', CAST(N'2014-11-26 09:57:00.0000000' AS DateTime2), NULL),(9, 3, N'00002010', CAST(N'2014-11-26 09:59:00.0000000' AS DateTime2), NULL)GO-- SwePesoWITH cteSourceAS ( SELECT Group_ID, Section_ID, DATEADD(SECOND, GID_Busy_Limit, Time_ID) AS ReservedTime, Time_ID, Is_Busy FROM #Demo), cteTargetAS ( SELECT Is_Busy, CASE WHEN LAG(ReservedTime, 1, '99991231') OVER (PARTITION BY Group_ID, Section_ID ORDER BY Time_ID) < Time_ID THEN 1 ELSE 0 END AS Flag FROM cteSource)UPDATE cteTargetSET Is_Busy = Flag;UPDATE dSET d.Is_Busy = w.FlagFROM #Demo AS dINNER JOIN ( SELECT Group_ID, Section_ID, Time_ID, MAX(Is_Busy) AS Flag FROM #demo GROUP BY Group_ID, Section_ID, Time_ID HAVING COUNT(*) >= 2 ) AS w ON w.group_id = d.group_id AND w.section_id = d.section_id AND w.time_id = d.time_id;SELECT *FROM #Demo;DROP TABLE #demo[/code] Microsoft SQL Server MVP, MCT, MCSE, MCSA, MCP, MCITP, MCTS, MCDBA |
|
|
CleaningWoman
Starting Member
13 Posts |
Posted - 2015-01-13 : 05:17:11
|
quote: Originally posted by SwePesoWITH cteSource
Thank you. I dont get it to work.This query has wrong results. Some wright ones are missing and some wrong ones are flagged.The DATEADD(SECOND, GID_Busy_Limit, Time_ID) AS ReservedTime, would (in this example) add 3 or 4 seconds to the time_id but GID_Busy_Limit stands for the number of machines in the timespan of 240 seconds.I could not find any reference to the 240 seconds nor to the number of rows contained in the timespan in the query.I am also unsure about the use of CTE with the update. Would i produce update is_busy=0 and extra performance loss or wont this happen?Did i misunderstand this or is something missing? |
|
|
SwePeso
Patron Saint of Lost Yaks
30421 Posts |
Posted - 2015-01-13 : 17:18:24
|
Change SECOND to MINUTE in the DATEADD function. And then make the calculation work according to your expectations. Microsoft SQL Server MVP, MCT, MCSE, MCSA, MCP, MCITP, MCTS, MCDBA |
|
|
ScottPletcher
Aged Yak Warrior
550 Posts |
Posted - 2015-01-14 : 13:13:55
|
The clustered index for the table should be on:( Group_ID, Section_ID, Time_ID )NOT on identity. |
|
|
CleaningWoman
Starting Member
13 Posts |
Posted - 2015-01-19 : 14:07:07
|
quote: Originally posted by SwePeso Change SECOND to MINUTE in the DATEADD function. And then make the calculation work according to your expectations.
Yes, i tried several alternatives of your code, but the main problem always stays.The rows in between the first and the last of a sequence wont be found with those routines.So far i have only a brute-force routine which takes ages to update. If you find a way to enhance your routine to find the in-between rows - i am still open for ideas.Anyway, thank you for trying. |
|
|
CleaningWoman
Starting Member
13 Posts |
Posted - 2015-01-19 : 19:45:16
|
What do you think of the following working soulution:* I do first set is_busy wherever the timespan is valid * is_busy startingamount is gid_busy_limit* Then i loop through the table counting is_busy down to 1* doubles are foundThe first tests were quite fast and the results correct !;with tbl as(SELECT [section_id] ,[is_busy] ,[group_id] ,[time_id] ,gid_busy_limit ,DATEADD(SECOND,240,[time_id]) as endtime ,lead(time_id,gid_busy_limit-1) OVER (PARTITION BY group_id,section_id ORDER BY time_id) as endentry FROM #demo) update tbl set is_busy=gid_busy_limitwhere endentry <= endtimegoselect 42WHILE (@@ROWCOUNT > 0)BEGINwith tbl as(SELECT [is_busy] ,lag(is_busy,1) OVER (PARTITION BY group_id, section_id ORDER BY time_id, is_busy desc) as prevBusy FROM #demo) Update tbl set is_busy = prevBusy -1where prevBusy -1 > isnull(is_busy,0) and prevBusy > 1 and isnull(is_busy,0)=0ENDSELECT * from #demogo |
|
|
|
|
|