Please start any new threads on our new
site at https://forums.sqlteam.com. We've got lots of great SQL Server
experts to answer whatever question you can come up with.
Author |
Topic |
Kristen
Test
22859 Posts |
Posted - 2013-11-19 : 08:33:14
|
Last time it happened the contractor was summarily dismissed. Client of mine spent a fortune working out a disaster recovery plan ... contractors are well aware of policy and ... guess what ... its happened again.Contractor had a large data import to do, it runs via some crappy API so records imported one-by-one, so it generates a lot of log traffic. So they set DB to SIMPLE and then forgot? to change it back again. Its been like that for weeks. Only came to light because someone set it back to FULL a few days ago and last night some overnight-task generated 70GB of transaction log which filled the disk, and an ADMIN said that they had changed the DB to SIMPLE to recover that situation (lets gloss over whether breaking the recovery chain was a good idea <sigh>)The DB is 4GB and has (right now) at 15GB Log file. Its a document management system, so should not be any massive data updates. 70GB for overnight tasks TLog usage seems crazy to me (updates that don't check if the row is already set to that value spring to mind ...) and checking the TLog backups files since FULL was reinstated its generating 10GB of TLog backups every 10 minutes!!What actions would you suggest?I was planning to:Review SQL Server Log/Error file for when RECOVERY MODEL was changedQuery MSDB for size/frequency of TLog backups to see what max size I got and if that indicates a sensible max size for TLog file.What is the best thing to put in place, going forwards, to monitor this so I get an alert if another contractor commits Contract Suicide?Thanks. |
|
Kristen
Test
22859 Posts |
Posted - 2013-11-19 : 09:20:40
|
Checked SQL Error Log, filtered on "option RECOVERY" and it appears that database was set to FULL on 13-Nov and then before that ... nothing. Error Log goes back to 20-Apr. Lots of "Log backup failed" on that database, presumably because it was in SIMPLE mode.Checked MSDB for the Backup History. No log backups on that DB between 09-Mar and 13-Nov. Scary :( |
|
|
jeffw8713
Aged Yak Warrior
819 Posts |
Posted - 2013-11-19 : 14:16:43
|
If you have notifications setup for your transaction log backups - that should have triggered an immediate review of the job and why it failed. At that point, you would know the recovery model had been changed and could take corrective action.You could prevent it from happening by mirroring the database, but that is probably overkill... |
|
|
robvolk
Most Valuable Yak
15732 Posts |
Posted - 2013-11-19 : 20:28:42
|
Assuming you can't just DENY ALTER DATABASE to the login being used, I'd suggest either Policy Management, or a server trigger to fire on ALTER_DATABASE (which should allow you to roll back the change and capture the details on who did it). DB Mirroring is a nice thought if you have an extra server and the license allows you to do it, but it would exact a performance hit.I vaguely remember a "sneaky trick" post somewhere describing a replication procedure that would mark the database as replicated even if it had no publications. That would also prevent you from changing to simple recovery. |
|
|
Kristen
Test
22859 Posts |
Posted - 2013-11-20 : 06:30:23
|
Thanks guys. I'm only called in to trouble shoot, and I suspect the problem is that there isn't someone responsible for this mess.I don't understand why the IT guys don't have a Disk Full alert on the server (every server in the company in fact ...), as that would have brought the problem to light, let alone why alerts on SQL Jobs are not being acted upon ...There is further mess ... some DBs are in SIMPLE mode and I'm not comfortable that they should be. There are some small DBs that just handle communication with the main application DB, which is fair enough, but I need to ask the question about how a restore would work if the peripheral DBs were restored to "last night" and the main DB to point-in-time.Also, what happens when the disk is full? In the recent instance disk full prevented extension of the main DB's log which was 15GB (on a 4GB data file, which seems a little large), it was set to 10% expansion (even the SQL Error Log recorded a message that it though that decision was questionable!! - probably just the out-of-the-box setting <sigh>), and there wasn't room on the disk.But ... there was still room on the disk for backups of the small databases, and they didn't need any data/log file extension, so they continued to run. I've looked at the source of a few Sprocs that do intra-DB processing and there isn't a BEGIN TRANSACTION / COMMIT in sight, so gawd knows what the knock on effect of just "some" of the DBs continuing to stay upright has been.And then, scary scary, looking at the code for the overnight jobs its "Put all rows to be processed into a table" followed by "Get one and do lots of stuff to it. Repeat". Looks to me like the "stuff" could easily be changed to be SET based.You won't be surprised to hear that there are NOLOCK statements all over the code <sigh>. The couple I looked at would fail catastrophically if a row was processed twice, or one was missed, which is bound to happen sooner or later (and be completely unreproducible of course).Yesterday there was the emergency fix of setting the DB to SIMPLE to throw away an oversize log file, so the log backup figures are incomplete, but even so the TLog backups totally 140GB ... on a 4GB Data file . This isn't a real time logging application, its a document management system and there are only 15 users ... they just cannot be generating enough work for that huge level of transactions (most of them are attorneys, so presuming scratching their heads pondering for most of the day!!)I should start a blog ... you couldn't make this stuff up!Ho!Hum! Just got to think of the fees at emergency callout rates I suppose ...Meeting tomorrow with the Supplier and Client. I already had on my agenda a question as to why backup failure went unnoticed (for 8 months!), thanks, and I like the idea of at least logging ALTER DATABASE with a trigger. We have that in all our systems just as belt & braces.I also wonder if I need to get them to throw away their SQL Maintenance Plans in favour of something we have written. They have specific databases added to a Plan for Log Backups, which seems like an accident waiting to happen that a DB will be added and not in step with the Maintenance Plan (our code backs up a database if it fines a new one, and does TLog backups on any that are set to Full Recovery. We have a "Control" table to tell it to do something out of the ordinary, but by default it does what seems to us to be the most sensible Plan, so a lot more fail-safe than the bog standard SQL Maintenance Plans. There is also the "Rebuild all indexes on all tables regardless" part of the SQL Maintenance plan that must generate horrific logs - ours just does fragmented tables). Not sure I want to be the one who is responsible for housekeeping ... but maybe they aren't safe with any of their other suppliers doing it and ... like I said ... the fees are nice to have :) |
|
|
jeffw8713
Aged Yak Warrior
819 Posts |
Posted - 2013-11-20 : 14:59:18
|
Well Kristen, you have a lot to look at with this system. I will just make a few points...Maintenance Plans - you could replace the tasks in the maintenance plans with Execute SQL Tasks and call out to your code. This keeps those plans as they expect, but insures that your code is making them safer. I know of several vendors that deliver maintenance as either maintenance plans or agent jobs - that are not at all safe. For those systems, the vendors expect their stuff to be there but I cannot allow their processes. A good example is where they backup to devices - and initialize those devices every night. When the full backup runs, they initialize the backup device - and when completed, they initialize the log backup device and run a log backup. Effectively destroying the backup chain and any possibility of recovery if something happens to the system...As for the size of the log, I would not be surprised at all - especially with a document management system - and one with lawyers involved. Each document is going to be edited multiple times every day. Every edit is going to generate a log record, and if they are storing the documents as blobs - with the full document being replaced with the updated version...well you can see how the log will grow. Think of a 1MB document, edited 30 times in a day is going to generate at least 60MB of log space - and still only take up 1MB in the data file.I also have an application that has NOLOCK all over the place. And, this is not a small system - several thousand procedures and more than 2.2TB of used space. Asked the vendor to switch to one of the snapshot isolations - and they freaked... Imagine having to update 1000+ procedures to remove all NOLOCKS so you can implement snapshot isolation and you can understand where they are coming from.About those other databases, my guess would be they are reference databases. I have several systems where there are reference databases which are used to either populate local tables - or referenced directly. Since these databases are not updated directly by the users - and will only be updated at most once a day, through an automated process - there is no reason to set them to full recovery model.On smaller systems (less than 100GB usually) - where I have a large maintenance window - I will rebuild all indexes fully once a week. If the system is only used during business hours I see nothing wrong with rebuilding everything. When I have a system where there are disk limitations then I will move to an approach of only rebuilding those indexes that need to be rebuilt. On larger systems or systems with small maintenance windows - definitely only rebuild what needs to be rebuilt.And finally, having just gone through this here - when the teams responsible for responding to alerts get close to 1000 alerts every day because they haven't been following up on them and fixing them, it is easy to see how some will be missed. Or, how they can just get used to the alerts and ignore them - or worse, build a rule to delete them automatically.Good luck... |
|
|
denis_the_thief
Aged Yak Warrior
596 Posts |
Posted - 2013-11-20 : 15:42:46
|
My points would be:- is there not a means through permissions to prevent this?- are you sure it was the contractor/twit that did this or maybe it was someone else on your team? |
|
|
Kristen
Test
22859 Posts |
Posted - 2013-11-21 : 17:39:12
|
quote: Originally posted by jeffw8713
Some useful pointers there, thanks.quote: As for the size of the log, I would not be surprised at all ... Think of a 1MB document, edited 30 times in a day is going to generate at least 60MB of log space - and still only take up 1MB in the data file.
Sadly the DOC itself is not in the DB, the only thing in the DB is a reference to it. (I'm not actually sure that it stores the revision history, although it might. The lawyers would have kittens if it did that long term because they certainly would not want any internal comments in early revisions to come out in "discovery"!)After today's meeting it turns out that the DB contains the equivalent of the Public Folder structure in Exchange. This is constructed to have every Matter Case No from their Case management system, plus all the sub-cases, and whether they are Active / Inactive, in a tree structure by Attorney and so on; so adding a new Case will create a suitable matching Folder to Public Folders for people to be able to put Emails/DOCs into. Apparently building this causes about 40GB of transaction logs (the whole DB is 4GB remember ...). They throw away the lot, using some API in Autonomy - so one-by-one I expect - and then recreate them, again via API and presumably one-by-one. At that point I expect it is 1% different to what was there before. Their DBA said he had checked that this was the case by deliberately running the process, manually, immediately after the daily scheduled task finished and found that it did indeed generat another 40GB of logs - so that 2nd iteration would have had ZERO work to do IMHO as the attorneys were all tucked up in bed at the time, so no chance of any changes to Cases or Tree folder strcuture. But it did the whole throw-away and re-create anyway.I didn't get any warm-fuzzy feeling when I suggested that they build a temporary table of the desired tree structure and then call the Autonomy API just for the ones that were different ...... bit surprised that there is no Bulk Change in Autonomy API [i.e. that would work set-based].quote: Imagine having to update 1000+ procedures to remove all NOLOCKS so you can implement snapshot isolation and you can understand where they are coming from.
When I explain to DEVs that NOLOCK will give them some records twice, some records not at all, and errors on some (KEY was in the index, but record deleted by the time the fetch was done) and then I ask the MD at the head of the table if anyone is making critical business decisions based on the results ... I find that the DEVs tend to quickly come around to the idea that maybe they can alter 10,000 SProcs ... and in pretty-damn-quick-order too!There is some risk just switching on Read Committed Snapshot (assuming a full QA test cycle cannot be justified) in that the Snapshot is as-of the start of the query - that might give an unexpected result half way through an Sproc that modifies the underlying data, I suppose? but having said that I don't think we've ever had a reported occurrence that that has actually happened in any of our APPs (and for sure we MUST be doing that type of action in more than one client's code !)quote: About those other databases, my guess would be they are reference databases.
As you are rightly guessing they have no user access. However, my worry is that they are doing:SELECT @CutoffDate = LastKnowDateFROM ConfigStuffINSERT INTO #TEMPSELECT *FROM Database-A.dbo.SomeTableWHERE UpdateDate > @CutoffDate... process #TEMP ...SELECT @NewCutoffDate = MAX(UpdateDate)FROM #TEMPUPDATE USET LastKnowDate = @NewCutoffDateFROM ConfigStuff AS U such that recovery of Database-A (to a different point-in-time) would kibosh everythingBut maybe not, and the child DBs would be resilient to that sort of activity. I'm ready to eat my hat if I find that that is the case though!quote: when the teams responsible for responding to alerts get close to 1000 alerts every day because they haven't been following up on them and fixing them, it is easy to see how some will be missed. Or, how they can just get used to the alerts and ignore them - or worse, build a rule to delete them automatically.
Yeah, that's a big issue I think. I've known DBSa that get "I'm all right, No errors" emails from 100's of subsystems, and they are supposed to be able to realise that there is one missing. I have no idea what happens when the Email just fails for some reason - Spam Filter has decided it is getting a lot of emails, all the same, from that Sender for example! and on that day the Email has reported a critical error ... for the first time ever!!quote: Good luck...
I'll keep you all posted, I'm sure you could do with the entertainment! |
|
|
Kristen
Test
22859 Posts |
Posted - 2013-11-21 : 17:48:37
|
quote: Originally posted by denis_the_thief My points would be:- is there not a means through permissions to prevent this?
Yes, could be. Never really thought that I needed to safeguard that before (if I *do* safeguard that what other, similar, commands have I got to "catch" too? :( )By the by, I have been thinking that our ADMIN Database (which is the APP that does all the Backups and Housekeeping for our clients) should keep a record of key settings on the DB (SIMPLE/FULL, READ ONLY etc.) and moan if any of them change - until someone changes the "expected values" in our DB. We could use that to bring to light quite a number of changes in a DB I expect.quote: - are you sure it was the contractor/twit that did this or maybe it was someone else on your team?
I wish it was, they'd hold their hands up and say "mea culpa". Sadly I'm only the hot-shot consultant brought in to advise on how they should go forwards more intelligently, by guys weren't involved in any of the code or procedures.Most probably the 3rd party people reset to SIMPLE for a massive data import and then forgot, or didn't have the skill/knowledge to know, to put it back again.Their predecessor did exactly the same thing, client only found out when they had a server crash, at which point they, sadly, had data loss. Result = 3rd party fired and all suppliers etc. made aware that this must not happen etc etc.This time it has come to light by chance, and no data loss, so now trying to bolt the stable door more rigorously.Client doesn't have an inhouse DBA (company is too small), and has relied on outsourced IT provider, who probably doesn't have enough knowledge to realise the need for alerts on failed SQL jobs etc (and 3rd party certainly never put any Email-To stuff on any of the jobs that they created ...)So I think the issue is that the responsibility has fallen between several stools and no one was "responsible" for the system; client certainly has no one skilled enough in-house. |
|
|
|
|
|
|
|