Author |
Topic |
malachi151
Posting Yak Master
152 Posts |
Posted - 2012-02-16 : 06:54:18
|
So I'm looking into the best way to implement replication from our production environment to our internal data warehouse.I've never setup replication before, but for the life of me, having looked into this,the first question that jumps out is, why can't I just create a Job that runs T-SQL MERGE statements using linked servers?I'd want to start with a snapshot of course, but after that we are looking for a nightly update of the warehouse with the transactional data. Also, and this may be a factor as well, the production server is SQL 2005, but the data warehouse server is SQL 2008 R2. |
|
robvolk
Most Valuable Yak
15732 Posts |
Posted - 2012-02-16 : 07:56:29
|
quote: why can't I just create a Job that runs T-SQL MERGE statements using linked servers?
You can. Be sure to watch the glaciers as they fly by while it runs. The last things you want to do over linked servers are JOINs, MERGE, or UNION/INTERSECT/EXCEPT. The underlying architecture of linked servers will convert these operations to cursors of the worst kind. You would quite literally be better off copying all the production data over to the data warehouse server and processing the MERGE locally.The huge benefit of replication for your scenario is that it will only copy changes. MERGE would have to evaluate all rows in both sets.If you don't want to deal with replication and its limitations, you can log ship your production database to your data warehouse server in standby mode, and then perform a MERGE. It's still likely that replication will be more efficient overall, but it's another option. |
|
|
malachi151
Posting Yak Master
152 Posts |
Posted - 2012-02-16 : 08:03:08
|
quote: Originally posted by robvolk
quote: why can't I just create a Job that runs T-SQL MERGE statements using linked servers?
You can. Be sure to watch the glaciers as they fly by while it runs. The last things you want to do over linked servers are JOINs, MERGE, or UNION/INTERSECT/EXCEPT. The underlying architecture of linked servers will convert these operations to cursors of the worst kind. You would quite literally be better off copying all the production data over to the data warehouse server and processing the MERGE locally.The huge benefit of replication for your scenario is that it will only copy changes. MERGE would have to evaluate all rows in both sets.If you don't want to deal with replication and its limitations, you can log ship your production database to your data warehouse server in standby mode, and then perform a MERGE. It's still likely that replication will be more efficient overall, but it's another option.
Thanks, this certainly helps get me pointed in the right direction.So essentially, MERGE should only be used between tables on the same instance, got it... |
|
|
robvolk
Most Valuable Yak
15732 Posts |
Posted - 2012-02-16 : 08:11:55
|
I should clarify something. MERGE can be efficient if you have a datetime or identity column that can be used in the WHERE clause to limit the rows being processed. You'd have to track the last time the warehouse was updated. If you already have a trigger that updates a ModifiedDate column or similar, that would be the best option. But if you don't then it's not worth adding just to perform a MERGE. |
|
|
malachi151
Posting Yak Master
152 Posts |
Posted - 2012-02-16 : 10:52:34
|
quote: Originally posted by robvolk I should clarify something. MERGE can be efficient if you have a datetime or identity column that can be used in the WHERE clause to limit the rows being processed. You'd have to track the last time the warehouse was updated. If you already have a trigger that updates a ModifiedDate column or similar, that would be the best option. But if you don't then it's not worth adding just to perform a MERGE.
The publishing database is basically insert only, no updates, and the tables all have a sequential primary key, so a MERGE would be pretty simple. All it would have to do is INERT records WHEN NOT MATCHED BY TARGET, that's pretty much it... |
|
|
russell
Pyro-ma-ni-yak
5072 Posts |
Posted - 2012-02-16 : 11:11:25
|
Replication was invented for this and is going to almost certainly be the most efficient way. |
|
|
malachi151
Posting Yak Master
152 Posts |
Posted - 2012-02-16 : 11:42:21
|
quote: Originally posted by russell Replication was invented for this and is going to almost certainly be the most efficient way.
Yeah, the only complicating factor here is that we have multiple separate databases with identical schema's (each client has their own database), and there is no unique id in the tables across databases, i.e. there are overlapping primary keys.We want to pull the data from each client into a single data warehouse, but in order to do that we will have to append a client id column to the data. I figured this would be easy using T-SQL MERGE, but I'm not sure if this will complicate things using Replication.I know you can replicate from multiple sources into a single source, but how does that work if there are overlapping primary keys? |
|
|
russell
Pyro-ma-ni-yak
5072 Posts |
Posted - 2012-02-16 : 14:03:15
|
You would need to manage the primary keys so they don't collide. Or add a column to each table indicating which server "owns" the data, to "uniqueify" it.But for a data warehouse, I'd say neither replication nor merge are the right solution. Instead, why not a nightly ETL? |
|
|
malachi151
Posting Yak Master
152 Posts |
Posted - 2012-02-16 : 16:30:55
|
quote: Originally posted by russell You would need to manage the primary keys so they don't collide. Or add a column to each table indicating which server "owns" the data, to "uniqueify" it.But for a data warehouse, I'd say neither replication nor merge are the right solution. Instead, why not a nightly ETL?
Looks like things are turning in the SSIS direction. I've only used SSIS for manually run ETL processes, running them through the BIDS, but I assume running them from SQL Agent isn't a big problem... |
|
|
russell
Pyro-ma-ni-yak
5072 Posts |
Posted - 2012-02-17 : 07:15:07
|
Not a big problem at all. In fact, that's the normal (meaning most common) way they are run. |
|
|
|