Please start any new threads on our new
site at https://forums.sqlteam.com. We've got lots of great SQL Server
experts to answer whatever question you can come up with.
Author |
Topic |
MEOB
Starting Member
5 Posts |
Posted - 2009-02-17 : 08:53:51
|
Hello everyone,I have inherited a Transactional Replication setup which is being run to provide a failover solution for a live application.The application performs a data load every morning, and I have been asked to suspend Replication while the load completes so that any problems with the data load will not be replicated to the failover database, allowing an end-of-day failover position.I have accomplished this by stopping the distribution agent before the load starts and restarting it once the load is complete, however this creates about 40 minutes of latency while the distributed transactions created during the data load are applied to the subscriber.To avoid this I have tried to create a new snapshot and reinitialise the subscription before starting the distribution agent, but the result is the same as the distributed transactions are applied to the subscriber before the snapshot is delivered.I know that the initial snapshot delivery (upon creating the replication setup) takes about 6 minutes to be delivered, as does reinitializing the subscriber from a new snapshot when there are no transactions waiting to be applied.So my question is this; Is there any way to completely stop replication while I perform the data loads and then restart Replication from that point only (with a new Snapshot), without completely removing Replication?I need this to be a scripted process that I can run from a SQL job.I am not a DBA, so please forgive me if I am missing the obvious or have my facts wrong.Thanks in advance.MEOB |
|
sodeep
Master Smack Fu Yak Hacker
7174 Posts |
Posted - 2009-02-17 : 09:07:59
|
quote: Originally posted by MEOB Hello everyone,I have inherited a Transactional Replication setup which is being run to provide a failover solution for a live application.The application performs a data load every morning, and I have been asked to suspend Replication while the load completes so that any problems with the data load will not be replicated to the failover database, allowing an end-of-day failover position.I have accomplished this by stopping the distribution agent before the load starts and restarting it once the load is complete, however this creates about 40 minutes of latency while the distributed transactions created during the data load are applied to the subscriber.To avoid this I have tried to create a new snapshot and reinitialise the subscription before starting the distribution agent, but the result is the same as the distributed transactions are applied to the subscriber before the snapshot is delivered.I know that the initial snapshot delivery (upon creating the replication setup) takes about 6 minutes to be delivered, as does reinitializing the subscriber from a new snapshot when there are no transactions waiting to be applied.So my question is this; Is there any way to completely stop replication while I perform the data loads and then restart Replication from that point only (with a new Snapshot), without completely removing Replication?I need this to be a scripted process that I can run from a SQL job.I am not a DBA, so please forgive me if I am missing the obvious or have my facts wrong.Thanks in advance.MEOB
Well,What you are doing doesn't make sense to me at all? First of all why you are stopping tran replication before loads and starting later on? Transaction replication works in transaction basis.It maintains ACID rule. Also there are tons of undistributed commands that pile up to be applied to subscribers.There will performance lag. New snapshot will also replicate whatever you choose for articles. Maybe you can elaborate what is the business requirements so we can help you better? |
|
|
MEOB
Starting Member
5 Posts |
Posted - 2009-02-17 : 09:24:54
|
Thanks for the quick response.The reason for stopping the replication is because it is possible to run the loads and effectively trash the data. This can be detected by the load process, but only after the damage is done, necessitating a reload with correct data.The customer requires a failover position that is immediately available in this scenario, using the pre-load data.Once the load completes without errors we know we are in the clear and can continue replication, but at this point there are, as you say, tons of undistributed commands that pile up to be applied to subscriber, and it takes about 40 minutes to catch up.I am looking for a solution that allows me to disregard the undistributed transactions and start replicating again from this point.I know that if I drop replication, run the data loads, then re-create replication I can be synchronised within 5-6 minutes. I am looking for a scriptable way of achieving the same result without actually dropping replication, as there are security implications with scripting replication creation.The design decisions were made before I came on board with this, so I have to work with what I have been given. That said, if I can present a convincing argument for changing the approach then I can probably get this done, so I would welcome any alternatives. As I said before, I am not a DBA, so forgive me if I am overlooking the obvious. |
|
|
sodeep
Master Smack Fu Yak Hacker
7174 Posts |
Posted - 2009-02-17 : 10:54:49
|
quote: Originally posted by MEOB Thanks for the quick response.The reason for stopping the replication is because it is possible to run the loads and effectively trash the data. This can be detected by the load process, but only after the damage is done, necessitating a reload with correct data.The customer requires a failover position that is immediately available in this scenario, using the pre-load data.Once the load completes without errors we know we are in the clear and can continue replication, but at this point there are, as you say, tons of undistributed commands that pile up to be applied to subscriber, and it takes about 40 minutes to catch up.I am looking for a solution that allows me to disregard the undistributed transactions and start replicating again from this point.You have to generate new snapshot for it.I know that if I drop replication, run the data loads, then re-create replication I can be synchronised within 5-6 minutes. I am looking for a scriptable way of achieving the same result without actually dropping replication, as there are security implications with scripting replication creation.The design decisions were made before I came on board with this, so I have to work with what I have been given. That said, if I can present a convincing argument for changing the approach then I can probably get this done, so I would welcome any alternatives. As I said before, I am not a DBA, so forgive me if I am overlooking the obvious.
You could use Database backup/restore at end of day or Log shipping approach if you don't want transaction based. |
|
|
maspbury
Starting Member
7 Posts |
Posted - 2009-02-17 : 10:55:24
|
We're working on a solution for nearly this exact scenario at this very moment. In fact, I had to check with my co-worker to make sure that he wasn't the one that posted this question.We have a few publications in our database and only one of them is difficult to recover from data integrity issues. This publication only has one article. We're planning on moving that article into its own filegroup and backing up the filegroup before the load. If we have data integrity issues, we can recover that filegroup back to a point-in-time before the data integrity issues occurred. We'll then take a new snapshot and apply it to the subscribers and reload the data after eliminating the root cause of the data integrity issues.We have discussed altering the ETL process so that the transactional replication 'waits' for the results of the load and doesn't push the data out to the subscribers until it's free from issues. But, we haven't figured that out yet - nor am I sure that we will since our recovery process at the publisher will require that we push out a new snapshot to the subscribers.My only source for this has been BOL. There are some good articles on recovery of databases involved in replication.Michelle AspburySenior Database Engineer |
|
|
maspbury
Starting Member
7 Posts |
Posted - 2009-02-17 : 15:12:22
|
This recovery path is NOT working. I'm not sure what I'm missing. 1. Full Backup (think 'last night's backup')2. ETL process runs every two hours, backing up filegroup and then transaction log before loading data. *May be irrelevant: Scheduled log backups run every 15 minutes. Transaction Replication is scheduled as well. 3. Data issues, want to roll back to backups taken prior to the bad ETL Process (from step 2).4. After restoring the filegroup and log backup from #2, I still have a redo_target_lsn > redo_start_lsn. Filegroup state_desc = RESTORING.Michelle |
|
|
tkizer
Almighty SQL Goddess
38200 Posts |
|
maspbury
Starting Member
7 Posts |
Posted - 2009-02-17 : 15:32:26
|
Thanks Tara,I'm thinking that must be the case. I've taken this database out of the 'regular' log backup job so that the only log backups that I have are the ones that I'm deliberately taking.This is not in production. We're still trying to walk through this before isolating this table in a separate filegroup to ease recovery. We're starting to think that it might be easier if we just put the table in it's own database!Michelle |
|
|
MEOB
Starting Member
5 Posts |
Posted - 2009-02-19 : 04:18:04
|
Hello,Thanks for all the feedback. Michelle, I hope you resolve your issue.I think the issue we are having is actually one of latency caused by the data load process. I tried running replication throughout the data load process this morning, and I am still seeing around 40 minutes of latency.The Publisher to Distributer History is full of messages like this:The Log Reader Agent is scanning the transaction log for commands to be replicated. Approximately 36500000 log records have been scanned in pass #3, 8006921 of which were marked for replication, elapsed time 1416802930 (ms).Though the numbers start low and grow to the above point. The loads run for 20 minutes. This message is starting 2 minutes into the load process and doesn't finish until about 30 minutes after the load has finished. It then takes another 10 to 15 minutes for the latency to disappear.The data load process is a bit of a black box to me at this point.Can you guys tell me what is happening here based on this message? Why aren't these transactions being replicated straight away? Are they uncommitted? Or are we just over-stressing the replication setup with our loads? Why does it take longer to replicate these changes than it does to make them?I need to eliminate this latency. If I can do that then I can accomplish what I need by just stopping the distribution agent while the loads run and restarting on successful completion. |
|
|
tkizer
Almighty SQL Goddess
38200 Posts |
|
sodeep
Master Smack Fu Yak Hacker
7174 Posts |
Posted - 2009-02-19 : 11:48:55
|
How does your load process work? Is it in single transaction or in batches? Have tracked the network transfer rate with performance monitor? |
|
|
maspbury
Starting Member
7 Posts |
Posted - 2009-02-19 : 14:47:44
|
Forget about replication for a moment...I just don't think that the recovery path will work. We're using SQL Lite Speed but I tried the following with both Lite Speed and Native Commands. The messages returned were slightly different but, the results were essentially the same. I think that I am trying to recover the database to a transactionally inconsistent state (by just recovering the filegroup backup which includes just one table). Basically, SQL Server won't put that filegroup on line until I've restored all of the transaction logs (including the one that has the 'bad updates' in it). You can't restore to a point-in-time or a MARKED transaction without starting with a restore of a full backup.The table that we've isolated to a file group has no foreign keys back to any other tables. I don't know how object-level recovery deals with this 'transactionally inconsistent state' (we don't have that version of litespeed anyway). Here are my steps:1. Take Full Backup (think ‘last night’s backup’)2. Back Up Filegroup (Pre-ETL)3. Back Up Log (Pre-ETL, some BOL article(s) seemed to indicate that you should have a log backup that ‘covered’ the file group backup)4. Begin Transaction WITH MARK <update some records> Commit Transaction (think ‘ETL’)5. Someone reports ‘bad data’, we validate and decide to roll back prior to just before the ETL process.6. Back Up Log (Post-ETL, some BOL article(s) seemed to indicate that you should back up the log before you restore a filegroup)7. Restore Filegroup WITH RECOVERY (SQL Server leaves filegroup in ‘restoring’)8. Back Up Log (some BOL article(s) seemed to indicate that you should back up the log AFTER a filegroup restore)9. Restore Pre-ETL Log WITH RECOVERY (SQL Server still leaves filegroup in ‘restoring’)10. Restore Post-ETL Log with STOPATMARK and RECOVERY (SQL Server won’t let you restore to a point-in-time or a named transaction unless you start with a restore of a full backup)11. Restore Post-ETL Log with RECOVERY gives us our bad data back (but filegroup is now available)Am I missing something here? It seems like filegroup backup/restore is more for bringing a database on-line in pieces so that you can get critical data on-line asap but ultimately, you have to recover using a transaction log backup to make everything consistent. And, if you want to stop at a point-in-time, you have to start by recovering from a full backup.This table is already in its own publication (transactional replication) so we would just drop the subscriptions and publication before the recovery and recreate them with a new snapshot after the recovery.Michelle |
|
|
maspbury
Starting Member
7 Posts |
Posted - 2009-02-19 : 14:59:27
|
MEOB,Check the schedule for your log reader and distribution agents. It's possible that your log reader is running continuously but your distribution agent is only running every 30 minutes or something (causing the transactions to build up at the distributor). There are also settings for the agents to process transactions in batches of different sizes. I wouldn't alter the default agent but you might try creating a 'custom' agent and change some of these settings.Also, if your log reader agent is set to start when SQL Server starts (and run continuously), be careful because if it fails to start the transactions will just sit in your transaction log. Also, if it stops for some reason (or if someone stops it), it won't restart. We usually schedule ours to run at some appropriate frequency so that if it stops or doesn't start when SQL Server starts up, it will not be an issue b/c it will try to start again in a few minutes.It does NOT sound as if you have 'sync with backup' set on the publsher (based on the messages that you're seeing). But, if this is on at the publisher, the transactions will not be replicated until the publisher's log has been backed up. Replication Monitor reports 'waiting for log backup' or something.Books Online has a separate section for SQL Server 2005 Replication. quote: Originally posted by MEOB Hello,Thanks for all the feedback. Michelle, I hope you resolve your issue.I think the issue we are having is actually one of latency caused by the data load process. I tried running replication throughout the data load process this morning, and I am still seeing around 40 minutes of latency.The Publisher to Distributer History is full of messages like this:The Log Reader Agent is scanning the transaction log for commands to be replicated. Approximately 36500000 log records have been scanned in pass #3, 8006921 of which were marked for replication, elapsed time 1416802930 (ms).Though the numbers start low and grow to the above point. The loads run for 20 minutes. This message is starting 2 minutes into the load process and doesn't finish until about 30 minutes after the load has finished. It then takes another 10 to 15 minutes for the latency to disappear.The data load process is a bit of a black box to me at this point.Can you guys tell me what is happening here based on this message? Why aren't these transactions being replicated straight away? Are they uncommitted? Or are we just over-stressing the replication setup with our loads? Why does it take longer to replicate these changes than it does to make them?I need to eliminate this latency. If I can do that then I can accomplish what I need by just stopping the distribution agent while the loads run and restarting on successful completion.
Michelle |
|
|
maspbury
Starting Member
7 Posts |
Posted - 2009-02-19 : 15:18:18
|
MEOB,Sorry for steering your post in a different direction.Is this data ONLY updated by this load process? If your subscribers can support snapshot replication (basically, the tables will be dropped and recreated when the snapshot is applied), perhaps you should change your replication process to use snapshot replication instead of transactional replication?If you need to keep transactional replication, reinitializing the subscribers before starting the load and then taking a snapshot after the load should eliminate the need to replicate all of the transactions from the load.USE [MyDatabase]GOexec sp_reinitsubscription @publication = N'MyPublication', @subscriber = 'all'GOUSE [MyDatabase]GOexec sp_startpublication_snapshot @publication = N'MyPublication'I haven't tried this so I don't know for sure that this will work. It seems like it makes sense but I could be wrong. :)You can use sp_dropsubscription and sp_droppublication to drop replication and then sp_addpublication, sp_addsubscription to rebuild it. I think that you can generate a script of your existing publications/subscriptions if it's complicated. So, you could schedule dropping/recreating replication as a job.quote: Originally posted by MEOB Hello everyone,I have inherited a Transactional Replication setup which is being run to provide a failover solution for a live application.The application performs a data load every morning, and I have been asked to suspend Replication while the load completes so that any problems with the data load will not be replicated to the failover database, allowing an end-of-day failover position.I have accomplished this by stopping the distribution agent before the load starts and restarting it once the load is complete, however this creates about 40 minutes of latency while the distributed transactions created during the data load are applied to the subscriber.To avoid this I have tried to create a new snapshot and reinitialise the subscription before starting the distribution agent, but the result is the same as the distributed transactions are applied to the subscriber before the snapshot is delivered.I know that the initial snapshot delivery (upon creating the replication setup) takes about 6 minutes to be delivered, as does reinitializing the subscriber from a new snapshot when there are no transactions waiting to be applied.So my question is this; Is there any way to completely stop replication while I perform the data loads and then restart Replication from that point only (with a new Snapshot), without completely removing Replication?I need this to be a scripted process that I can run from a SQL job.I am not a DBA, so please forgive me if I am missing the obvious or have my facts wrong.Thanks in advance.MEOB
Michelle |
|
|
MEOB
Starting Member
5 Posts |
Posted - 2009-02-19 : 19:58:14
|
We are running Transactional Replication with the Log Reader agent and Distribution agent running all the time, except when the load process is running.I have tried several things. When I came on board the process was to stop the agents, complete the loads, start the agents, generate a new snapshot for the subscriber and reinitialise the subscriber. With this we saw approx 40 minutes of latency. I initially changed this process to generate a snapshot and reinitialise the subscriber before restarting the distribution agent, then I tried the same thing with the Log Reader agent never stopping throughout the load process, then I dropped the snapshot and reinitialize since they didn't seem to be making a difference, and finally I let replication run continuously through the load process. In every case I see approx 40 minutes latency.To explain the load process a little more, we are essentially rebuilding the entire database every morning. Most data is deleted and then new data is inserted. We are looking at a stock listing running into the region of 8,000,000 stock records. Not only this, but once the data is loaded we calculate suggested order amounts for the several hundred stores who will be using the system, as well as auto allocations of stock, etc.This leads me to conclude that the loads themselves are causing the problem, and upon talking with the developers today I have confirmed that the loads call stored procedures that can take up to 15 minutes to complete. I haven't analyzed these yet, but I suspect that they are very large single UPDATES, INSERTS or DELETES that take between 10 and 15 minutes each to complete. So from this I think that short of redesigning the load process (not an option) I must create replication deletion and creation scripts to run each day around the load process, unless someone can tell me how to essentially 'reset' replication without having to delete it.We can't use snapshot replication as the customer specifically requires real-time replication of the users activity throughout the day. Would the idea of reinitializing the subscriber before the load work here? This isn't something I have tried. Would the idea be to stop the Replication agent, reinitialize the subscription, complete the load, create the snapshot and restart the agent? From what I have seen of Replication, the Log Reader would still be pushing out all the changes from the Publisher to the distribution database, and the replication agent would push these through to the subscriber as soon as it came back on, but I will try this if anyone thinks it might work?Sorry for the long post. I appreciate any help or thoughts on the best way to resolve this. |
|
|
MEOB
Starting Member
5 Posts |
Posted - 2009-02-24 : 10:07:05
|
Any suggestions? |
|
|
maspbury
Starting Member
7 Posts |
Posted - 2009-02-24 : 15:14:23
|
quote: Originally posted by MEOB Any suggestions?
If you dropped the subscription/publication before the load, ran the load, and added the publication/subscription with a new snapshot, you definitely wouldn't get the transactions from the load accumulating/pushing to the distribution database and then to the subscribers.I know that you can script a lot of this (not sure if you can script all of it). I DON'T know for sure if reinitializing the subscriber before the load would prevent the transactions from being distributed since it would be waiting for a new snapshot. But, I would certainly be interested in finding out :)Once we decided in our scenario that the best option for recovery in the event of bad data being loaded into one table was to isolate that table in a separate database, someone else took this over from me. I'm not sure if we're just going to change our replication strategy for this to use snapshot replication or if we might have the need to do something like reinitializing the subscribers to prevent them from getting bad data. If so, I'll post back with what we ended up doing for our solution.Michelle |
|
|
|
|
|
|
|