Author |
Topic |
Jaypoc
Yak Posting Veteran
58 Posts |
Posted - 2015-04-14 : 17:40:57
|
I'm looking for suggestions, or documented strategies for using MSSQL to import, process and store large quantities of data. Here's my scenario:I'm loading 10 Million rows (provided by an outside source) every 10 minutes via BCP.I'm aggregating this data by grouping into 20,000 units.I need to store 2 days worth of this aggregated data.The issue I've run into so far have resulted in filling up a 2TB drive in about a week. I've tried following the BCP import with a :DELETE FROM table WHERE data_date < [2 days ago] but that didn't work as the drive still filled up.I'm considering running a full truncate every 10 minutes after I aggregate the data and store the 20,000 aggregated rows instead of storing the 2 days of raw data. Would this solve the problem of space filling up? Would it be better to drop and recreate the table after each import? Any suggestions or white papers with similar problems/solutions would be welcomed. |
|
tkizer
Almighty SQL Goddess
38200 Posts |
Posted - 2015-04-14 : 17:48:19
|
Is it the mdf/ndf files or the ldf that is growing? I'm suspicious of a recovery model/backup log issue.How wide is the table? What is the average row size?Tara KizerSQL Server MVP since 2007http://weblogs.sqlteam.com/tarad/ |
|
|
Kristen
Test
22859 Posts |
Posted - 2015-04-14 : 20:45:56
|
10M rows every 10 minutes for 2 days is 2.8 billion rows.2TB / 2.8 billion rows = 763 bytes per row, assuming nothing else on the drive!Doesn't sound like a huge amount allowing for some indexes etc. |
|
|
Jaypoc
Yak Posting Veteran
58 Posts |
Posted - 2015-04-15 : 08:15:05
|
That should be once an hour, not once every 10 minutes. I confused two projects I'm working on when I posted this. Sorry about that!As for recovery model, I am using Simple recovery, and there is one index on the column that I'm grouping the aggregated data by. I never actually wrote the procedure to pull out the aggregated data yet though. I had gotten the import and delete to work, but it was chewing up additional space despite deleting the older records. |
|
|
tkizer
Almighty SQL Goddess
38200 Posts |
Posted - 2015-04-15 : 11:56:17
|
Do the delete in a while loop with batches (say 10000 rows are deleted at a time). Loop until done.Tara KizerSQL Server MVP since 2007http://weblogs.sqlteam.com/tarad/ |
|
|
Jaypoc
Yak Posting Veteran
58 Posts |
Posted - 2015-04-15 : 13:05:42
|
The delete works, but it doesn't free the disk space.. Would deleting smaller batches solve this? |
|
|
tkizer
Almighty SQL Goddess
38200 Posts |
Posted - 2015-04-15 : 13:14:21
|
quote: Originally posted by Jaypoc The delete works, but it doesn't free the disk space.. Would deleting smaller batches solve this?
Yes. It'll mean the tlog won't grow, assuming you either are using SIMPLE recovery model or FULL recovery model with LOG backups.Tara KizerSQL Server MVP since 2007http://weblogs.sqlteam.com/tarad/ |
|
|
tkizer
Almighty SQL Goddess
38200 Posts |
|
Jaypoc
Yak Posting Veteran
58 Posts |
Posted - 2015-04-15 : 14:09:08
|
Forgive me, but I'm not following how that will help. The issue I have isn't that the database is being blocked. (unless something is occurring that I'm not familiar with). I can load the data once an hour for two days with no problems, but after that, I delete an hour every time I load another hour and the size of the MDF continues to grow as if I never deleted any of the records at all eventually filling the drive. Alternatively the processing is being used to generate reporting so is not accessed by any other parties other than my own scripts which sequentially load data, process is out and then delete data older than 48 hours. |
|
|
tkizer
Almighty SQL Goddess
38200 Posts |
Posted - 2015-04-15 : 14:14:48
|
The deletes in batches helps with the ldf growth, not the mdf.For the mdf, view the free space INSIDE the file. If there's plenty, you can do a one-time shrink to reclaim it since it sounds like you'll be managing the volume without a huge increase. Also you probably need to rebuild your indexes due to severe fragmentation but do this after the shrink since shrinking causes massive fragmentation.Tara KizerSQL Server MVP since 2007http://weblogs.sqlteam.com/tarad/ |
|
|
Jaypoc
Yak Posting Veteran
58 Posts |
Posted - 2015-04-15 : 14:38:41
|
The volume shouldn't be increasing at all as I delete 9 Million rows (from 2 days ago) then insert another 9 million rows, and I do this hourly. Everything I've read claims you should not shrink the DB, but is this a case where I should shrink it as part of the hourly routine or maybe just as a daily maintenance routine? |
|
|
tkizer
Almighty SQL Goddess
38200 Posts |
Posted - 2015-04-15 : 14:42:13
|
Not an hourly shrink. A one-time shrink, but you need to check the free space inside the mdf file to know if a shrink should happen. You can right click on the database and select the disk usage report to view it easily. If you have a large percentage of free space, then you might want to consider a one-time shrink. If the free space is less than 25% though, I would not do it due to the issues shrinks cause.Also fragmentation needs to be checked. Delete/insert will cause heavy fragmentation.Tara KizerSQL Server MVP since 2007http://weblogs.sqlteam.com/tarad/ |
|
|
Jaypoc
Yak Posting Veteran
58 Posts |
Posted - 2015-04-15 : 19:57:45
|
When you say one time shrink, how often should this be checked. I had reduced the amount of data I was keeping to only 1 days worth and in about 3-4 days it completely filled up all the available space on the drive. This is a process that needs to run indefinitely every hour, and import new data while removing the old data. Is there a way to automate the process so it doesn't need to be checked manually too frequently? Is there a better tool/product offered by MS to handle this kind of heavy rotational data? |
|
|
tkizer
Almighty SQL Goddess
38200 Posts |
Posted - 2015-04-15 : 20:12:13
|
Have you checked the free space inside the file?Regarding another tool/product, you can look into partitioning. You can quickly get rid of data with partitioning.Tara KizerSQL Server MVP since 2007http://weblogs.sqlteam.com/tarad/ |
|
|
Kristen
Test
22859 Posts |
Posted - 2015-04-17 : 06:54:59
|
quote: Originally posted by Jaypoc When you say one time shrink, how often should this be checked
If you change your design from storing "several days data" to "one or two days" SQL will NOT shrink the file, by itself. SQL manages the free space WITHIN the file, and because repeated Shrink/Grow would fragment the file SQL hangs onto it instead.Thus you may, now, have lots of free space in the file. You can do a one-time shrink of the file to change its size down to something more appropriate.Yes you should not do this often, and yes you should not "schedule" it - the advice you have about fragmentation is correct. If you are bothered about that then you can reorganise the file (during scheduled maintenance) so that it is contiguous on the disk. We do do that occasionally (e.g. if we have scheduled downtime for some other reason)If there is no free space in your data file (and you have not already shrunk it) AND given that you have changed your design to store much less data than before then I would guess? that your delete is not working, or is not deleting everything that it could / that you think it is deleting.Check what the MIN and MAX date in the table is? or a SELECT COUNT(*) of the number of rows to see if it is "reasonable"Similar thing with the Log/LDF file. If you have done a single huge DELETE (in the past) that will have created a single [huge] transaction in the Log/LDF file. SQL will have extended the Log/LDF file to be as big as was necessary to store the whole of the delete (it does this, even though you are using SIMPLE recovery model, in case of server crash or powercut, so it can rollback the half-completed delete when the server then restarts).When the transaction finishes the used part of the Log will be marked as "available" again, so the next one would reuse the same space (because you are using SIMPLE Recover Model; if you were to use FULL Recovery model then that space would ONLY be flagged as available WHEN you backed up the transaction file)If you NOW change the delete to be done in smaller batches then the space used in the Log will be smaller, and after EACH/EVERY batch the log space will be marked again as Available, and reused (e.g. for the next batch) so the process will need MUCH less log space. You could, therefore do a ONE TIME Shrink of the log file to set it to a smaller size and reclaim the space that SQL has reserved in the past (for the biggest delete you have ever done to date).SQL will automatically grow the files if they are too small, but as you rightly say repeated shrink/Re-grow will fragment the files. ONE TIME is fine though in particular after a significant change of design (as you have done) or after an accidental one-off massive transaction that went haywire.The other issue you will have is that your index(es) is probably growing at only one end. So when you delete stale data that marks large parts of the index as "available", but you then add new rows at the far end [of the index] so that new blocks have to be added to the index, instead of reusing the pages lower down the index. You can either rebuild the index (which will reclaim the space), or you could drop the index (before your import) and then recreate a nice, shiny!, brand new index after the import. It CAN be more efficient to Drop, Import, Recreate, but not always. However, it is best NOT to do that with the CLUSTERED index.I think the first thing you need to do is fine out what the Free Space is INSIDE the Data/MDF file AND the Log/LDF file (although the space in the Log file won't tell you very much as it will currently be empty but we could perhaps take a guess at whether its size is "excessively large" or not. |
|
|
Jaypoc
Yak Posting Veteran
58 Posts |
Posted - 2015-04-17 : 15:10:03
|
Thank you. This explains the deletion scenario better. Someone had suggested to me truncating the table and rolling the contents between two tables instead of just deleting from and reusing the same table. Does truncating have the same effect, or does it clear all space in the table? |
|
|
Kristen
Test
22859 Posts |
Posted - 2015-04-18 : 02:24:38
|
TRUNCATE empties the whole table. You can't use it on a table that has Foreign Keys (probably not an issue for your situation). TRUNCATE just logs the instruction to "Truncate the table" whereas DELETE logs all the records that it deletes .You could, indeed, have two tables IMPORT1 and IMPORT2 (etc.) and use IMPORT1 today and IMPORT2 tomorrow, and then TRUNCATE and reuse IMPORT1 the next day.I presume your reports are on all the data you have stored? but if they are on various "parts" or the data (e.g. by date/time range) you could use one table per date/time interval and then use a horizontal-view (a VIEW with a UNION on multiple tables such that SQL will be smart enough to only query the tables that contain data for the Date Range of the query. Might be overkill for what you are doing, but if you query only part of the data each time it might make a significant performance difference.TRUNCATE is much the same as DROP / CREATE table - except that you don't have to also recreated Indexes (which requires "knowing" what indexes exist, and modifying the code if a new index needs to be added etc.). However, TRUNCATE will still give you an index that is out-of-shape for the records that you then import, so you still have the "blank pages in index" issue to consider (assuming that you have an index where everything is added "all at one end" - that would include things like Client/Invoice indexes - the Client Codes would be nicely spread over the index, but the Invoice Numbers would be higher than any seen previously, so would be "all at one end" for each individual client and thus might cause a lot of new page inserts in the index). DROP / CREATE (instead of TRUCNATE or DELETE) would remove any index blank page issue (as would rebuilding the indexes periodically, or just DROP/reCREATE the INDEX.So you could do:DROP INDEXTRUNCATE TABLEIMPORT DATAreCREATE INDEXand you could have multiple tables, one per day or one per import batch |
|
|
|