Large Data Extract - Please Advise!

Please start any new threads on our new site at https://forums.sqlteam.com. We've got lots of great SQL Server experts to answer whatever question you can come up with.

All Forums

SQL Server 2000 Forums

SQL Server Development (2000)

Large Data Extract - Please Advise!

Author

Topic

tim999
Starting Member

4 Posts

Posted - 2007-10-31 : 06:12:08

Hi everyone,

I've been working on a large data extract and was wandering if anyone could take a look at the following code and offer any reasons why it takes such a long time compared to when i had it seperated out in steps with GO statements in a scheduled job.

It needs to:

1) delete all data on local tables
2) import all data from external source back into local tables
3) rebuild indexes on local tables

Apparently GO statements are not supposed to be used so i've gone for the transaction approach (makes sense also - either it all happens or none of it happens):

Begin Transaction
 
truncate table arista_caclient
truncate table arista_camatgrp
truncate table arista_camatter
truncate table arista_cabilhis
truncate table arista_cablaloc
truncate table arista_cafintrn
truncate table arista_cadescrp
truncate table arista_catimtrn
truncate table arista_cafeextn
truncate table arista_fmsaddr
truncate table arista_cabilloi
truncate table arista_caferate
 
INSERT INTO arista_caclient SELECT * FROM OPENQUERY(arista_ODBClink,'SELECT * FROM caclient WHERE cl_datopn>=''01/01/1900'' OR cl_datopn is null')
INSERT INTO arista_camatgrp SELECT * FROM OPENQUERY(arista_ODBClink,'SELECT * FROM camatgrp WHERE (mg_datcls>=''01/01/1900'' OR mg_datcls is null) AND (mg_datopn>=''01/01/1900'' OR mg_datopn is null)')
INSERT INTO arista_camatter SELECT * FROM OPENQUERY(arista_ODBClink,'SELECT * FROM camatter WHERE (mt_estcmp>=''01/01/1900'' OR mt_estcmp is null)')
INSERT INTO arista_cabilhis SELECT * FROM OPENQUERY(arista_ODBClink,'SELECT * FROM cabilhis WHERE (bh_bildat>=''01/01/1900'' OR bh_bildat is null) AND (bh_laspay>=''01/01/1900'' OR bh_laspay is null) AND (bh_rundat>=''01/01/1900'' OR bh_rundat is null) AND (bh_remdat>=''01/01/1900'' OR bh_remdat is null)')
INSERT INTO arista_cablaloc SELECT * FROM OPENQUERY(arista_ODBClink,'SELECT * FROM cablaloc')
INSERT INTO arista_cafintrn SELECT * FROM OPENQUERY(arista_ODBClink,'SELECT * FROM cafintrn WHERE (tr_trdate>=''01/01/1900'' OR tr_trdate is null)')
INSERT INTO arista_cadescrp SELECT * FROM OPENQUERY(arista_ODBClink,'SELECT * FROM cadescrp')
INSERT INTO arista_catimtrn SELECT * FROM OPENQUERY(arista_ODBClink,'SELECT * FROM catimtrn WHERE (tt_trndat>=''01/01/1900'' OR tt_trndat is null)')
INSERT INTO arista_cafeextn SELECT * FROM OPENQUERY(arista_ODBClink,'SELECT * FROM cafeextn')
INSERT INTO arista_fmsaddr SELECT * FROM OPENQUERY(arista_ODBClink,'SELECT * FROM fmsaddr')
INSERT INTO arista_cabilloi SELECT * FROM OPENQUERY(arista_ODBClink,'SELECT * FROM cabilloi')
INSERT INTO arista_caferate SELECT * FROM OPENQUERY(arista_ODBClink,'SELECT * FROM caferate')
 
DBCC DBreindex (arista_caclient)
DBCC DBreindex (arista_camatgrp)
DBCC DBreindex (arista_camatter)
DBCC DBreindex (arista_cablaloc)
DBCC DBreindex (arista_catimtrn)
DBCC DBreindex (arista_cafintrn)
DBCC DBreindex (arista_cadescrp)
DBCC DBreindex (arista_cafeextn)
DBCC DBreindex (arista_fmsaddr)
DBCC DBreindex (arista_cabilloi)
DBCC DBreindex (arista_caferate)
 
If @@error <> 0
    Rollback Transaction
Commit Transaction

Ideally i would like to insert a line into a seperate table if it succeeds or a line into a table if it fails. Would it be best to place that INSERT command before the Rollback and Commit statements?

As a statement as a whole is this structured correctly?

Many thanks!

AndrewMurphy
Master Smack Fu Yak Hacker

2916 Posts

Posted - 2007-10-31 : 07:25:15

1. Note the @@error variable will only report for you on the LAST SQL statement executed. For proper error handling you will need na @@ERROR test after EVERY (important) SQL statement. @@Error may also NOT report any error for the DBCC operations (It might only work for true T-SQL commands)

2. The "SELECT * from xxx" coding technique can be bettered by going for "SELECT x1, x2, x3, etc from xxx"....because if table/view XXX is ever amended (extra column, dropped column, re-ordered column, etc) then this coding will fail. Being specific with the columns you need is "recommended practice"...not that I adhere to it myself 100% of the time, but at least I'm aware of the downsides.

3. How large is "Large"? You may run into issues with your use of the TEMPDB database. Will this data-extract happen often (during busy periods). TEMPDB will experience locking issues with "SELECT * INTO"...

tim999
Starting Member

4 Posts

Posted - 2007-10-31 : 08:50:05

1. Ok, that makes sense. I guess i need to make sure this bit is done correctly after each insert. Not too worried about the indexes as its the accurcy of the data which is most important. I need to ensure there is an 'all or nothing' approach with this so these error statements will need to be added. Would a stored procedure be better than simply entering T-SQL into the SQL Agent Step?

2. The underlying database is in Unix and is very old with no major changes happening to the core tables in many many years. As there are so many fields if I ensure that the above error coding is correct then i will get alerted as soon as a problem occurs. Many thanks for pointing this out though.

3. The data is only used in views, which in turn are viewed via ASP pages on the intranet. I've ensured that the statement:

SET TRANSACTION ISOLATION LEVEL READ UNCOMMITTED

is used in all asp queries which should prevent any locking.

The update occurs at 6am and involves millions of transactions on some of the larger tables. It takes around 45 mins to complete the procedure.

A huge thank you for this advice - is there anything else I could do to improve?

AndrewMurphy
Master Smack Fu Yak Hacker

2916 Posts

Posted - 2007-10-31 : 10:32:03

Sometimes.....it works better to drop all indices, load the data, then add back the indices. BUT only volume testing would prove this approach.

You may also want to (re)view your transaction logging approach. Would it be faster to re-load the data than log the data and rewind. Note that the TRUNCATE command is a(n) (partially) UNLOGGED transactional event. Therefore I'm not sure if there's any point in it being inside your BEGIN/COMMIT.

You may also wish to investigate using the BCP utility...for dataloads it SHould be faster than plain vanilla "insert into"s.....not sure if it works with OPENQUERY.

Qualifying tables with the "owner" is also good practice (from a security+speed viewpoint)....but probably not a performance win in this instance.

What's the percieved value of the DBREINDEX if the data is "fresh reloaded" each day?

nr
SQLTeam MVY

12543 Posts

Posted - 2007-10-31 : 10:47:35

If it's a large transfer you don't want to be dealing with transactions.
bcp the data out into a file using native format.
create a new table
bulk insert the data in
create the indexes
drop the production table
rename the new table

That should be the fastest method. It would be better to do it into a new database if your system allows this - then you can flip the dtabases and carry on using the old one if there is a problem.

You should find this orders of magnitude faster than what you are trying.

==========================================
Cursors are useful if you don't know sql.
DTS can be used in a similar way.
Beer is not cold and it isn't fizzy.

tim999
Starting Member

4 Posts

Posted - 2007-10-31 : 10:53:41

quote:
Sometimes.....it works better to drop all indices, load the data, then add back the indices. BUT only volume testing would prove this approach.

I did try this approach initially but as my scheduled job was bombing our randomly and there are lots of indexes I thought it was more reliable with regards to coding errors and would allow me to add other indexes without needed to re-visit the code.

quote:
You may also want to (re)view your transaction logging approach. Would it be faster to re-load the data than log the data and rewind. Note that the TRUNCATE command is a(n) (partially) UNLOGGED transactional event. Therefore I'm not sure if there's any point in it being inside your BEGIN/COMMIT.

I've set the database logging for this database to bulk insert model. The reason the truncate command is in there is because i either want all the data dropped and refreshed or nothing done at all. Does this mean that truncate's use in this respect wont help me?

quote:
You may also wish to investigate using the BCP utility...for dataloads it SHould be faster than plain vanilla "insert into"s.....not sure if it works with OPENQUERY.

Thanks for this - I will take a look although i'm fairly sure i needed to use the openquery syntax to get this to work.

quote:
Qualifying tables with the "owner" is also good practice (from a security+speed viewpoint)....but probably not a performance win in this instance.

Will take a look at this. Its something i'd probably do usually but in this case i'm connecting to a sco unix database so the dated ODBC driver doesnt seem to like anything in any of the tutorials i've been reading.

quote:
What's the percieved value of the DBREINDEX if the data is "fresh reloaded" each day?

I assumed that by doing such a large bulk insert the indexes would need rebuilding - is this not the case? No other data is inserted during the day - its literally once in the morning. Do indexes rebuild themselves automatically following a fresh import?

Cheers for all your help and suggestions - it is much appreaciated!

AndrewMurphy
Master Smack Fu Yak Hacker

2916 Posts

Posted - 2007-11-01 : 04:56:48

Three other factors in the "speed" calculation to remember...
a) the quality of the ODBC driver across to your other server/datasource
b) the quality of your other server....can it pipe the data out fast enough for you.
c) the quality of the pipe between the 2 machines....what's the traffic speed?

To date we've concentrated on ensuring the "data load" process is optimal....what about the underlying equipment - maybe it's the problem?

Kristen
Test

22859 Posts

Posted - 2007-11-01 : 06:43:40

My $0.02 worth ...

Don't switch to BCP unless the run-time is unacceptable. IMHO Broken flat-files are far harder to debug than SQL pulling from another source.

Drop the indexes and re-create them afterwards. You are currently doing all the index work twice - once at each row insert, and then you are throwing all that away and making a new index with DBREINDEX.

"... to add other indexes without needed to re-visit the code"

True, but you could add alterations to indexes in, say, an SProc that was called at the end of this job. The code is centralised, and will automatically be triggered on the next run.

If you think there is a risk that someone will just RightClick create an Index then put a step in the Sproc that first checks the indexes that are present, and complains if there are any that it isn't aware of.

If you can put an ORDER BY in the OpenQuery so that rows arrive in Clustered Index [i.e. of the receiving database] order that will ensure that the table is built efficiently.

I would do this differently, but based on what you have (and assuming sufficient disk space) I would:

1) Import into new, temporary tables (with only Clustered Index)
2) Drop existing tables
3) Rename tables
4) Create secondary indexes

A failure during (1) will mean that you still have yesterday's data

My preference would be:

Use OPENQUERY to get the data to a #TEMP table

Selectively update rows that are different in the main tables (and insert new ones, delete ones no longer represented).

This depends a little bit on how much of the data changes every day. Assuming it a few percent this should be less log-intensive.

"SET TRANSACTION ISOLATION LEVEL READ UNCOMMITTED"

I strongly recommend that you only do this if all actions on this database are Read Only from the ASP application. If there are any writes to the database then this will cause rare errors which will be impossible to reproduce and thus very hard to debug.

"WHERE cl_datopn>=''01/01/1900'' OR cl_datopn is null')"

Are there some dates earlier than 1900? or is this "all" the data?

"The "SELECT * from xxx" coding technique ..."

Personally I'm relaxed about SELECT * in this scenario. The Optimiser has only got to parse it once a day, and it has the nice side effect of causing the whole thing to go T.U. if any new columns get added.

Kristen

pootle_flump

1064 Posts

Posted - 2007-11-01 : 06:55:25

Kristen - I like the idea of importing to a table, dropping the main one and rename.

quote:
Originally posted by Kristen

If you can put an ORDER BY in the OpenQuery so that rows arrive in Clustered Index [i.e. of the receiving database] order that will ensure that the table is built efficiently.

We covered this on dbforums recently. According to MS the order by clause in an insert has no affect at all on the physical order of data insertion. I thought it did (and still think it should) but apparently it doesn't. It does affect the logical order though (so you can use it to get an identity column to "rank" your data as you like).

OP - if you do use BCP and the order of the data in the files matches thge clustered index then you can specify this and it will speed up inserts. If your clustered index is monotonically increasing then it should increase things further.
http://www.sqlskills.com/blogs/kimberly/2007/02/02/TheClusteredIndexDebateContinues.aspx

Kristen
Test

22859 Posts

Posted - 2007-11-01 : 07:41:58

"Kristen - I like the idea of importing to a table, dropping the main one and rename."

Probably better you thank Nigel instead then ... he suggested it first!

However, I would make the indexes AFTER dropping the original table, Nigel had them before, because I think that there will be naming conflicts otherwise. However, that leaves the table with no indexes for a bit, which will obviously be bad for performance, so if it is possible to have same-index-name on different tables I will have to eat Humble Pie (again!)

"According to MS the order by clause in an insert has no affect at all on the physical order of data insertion."

Hmmm ... pity, because it would prevent index page splits. There is a HINT for Bulk Insert that helps with that, but I can;t see how to take advantage of that here.

Absolutely NO sense having an ORDER BY in the OPENQUERY if it can't be used, because it is likely to make the Remote server have to do more work ...

"If your clustered index is monotonically increasing then it should increase things further."

I'm obviously being thick ... Why's that then? If its pre-sorted in Clustered Index (and the appropriate hint given) then my hope would be that there will be no split-index-pages

Kristen

tim999
Starting Member

4 Posts

Posted - 2007-11-05 : 05:22:47

quote:
Drop the indexes and re-create them afterwards. You are currently doing all the index work twice - once at each row insert, and then you are throwing all that away and making a new index with DBREINDEX.

"... to add other indexes without needed to re-visit the code"

Can i just clarify something - as there are no changes/additions/deletions made to the data following the bulk import in the morning why do I need to reindex at all if it happens automatically at each insert?

If it happens after each row insert - and no further rows are inserted during the day would that negate having to rebuild / drop & recreate the indexes?

This is a fundamental concept that i'm not too sure on and just need clarifying.

Thanks again for all the discussion and feedback - interesting read and a few things i can build upon.

nr
SQLTeam MVY

12543 Posts

Posted - 2007-11-05 : 06:04:01

When you add rows the indexes are updated. This is done in a way to minimise time for each insert meaning pacge splits and fragmentation. Also the statistics will not be updated until a threshhold is reached and if automatic updating is turned on.
Reindexing cures all that.
Droping the indexes first means it doesn't have to maintain them and doesn't have to recreate the statistics.
It also means (and this is the big win) that bulk insert will do a fast load and log the extents rather than each row if the database isn't using the full recovery model (and a few other things).

==========================================
Cursors are useful if you don't know sql.
DTS can be used in a similar way.
Beer is not cold and it isn't fizzy.

Kristen
Test

22859 Posts

Posted - 2007-11-05 : 06:33:39

"If it happens after each row insert ..."

it does, and that's so that ordinary applications can do an "add a record" type process, and the indexes will be up to date.

But its horribly inefficient for a bulk import, as Nigel has described, so you would be better off to have no indexes in place during the import, and then (re-)create them, as the create process can produce optimal indexes, and create them more quickly that the collective time taken by the one-by-one insert method.

Kristen

Subscribe to SQLTeam.com

SQLTeam.com Articles via RSS

SQLTeam.com Weblog via RSS

- Advertisement -

Resources