ETL Architecture - Streaming data into an OLTP

Please start any new threads on our new site at https://forums.sqlteam.com. We've got lots of great SQL Server experts to answer whatever question you can come up with.

All Forums

SQL Server 2005 Forums

SQL Server Administration (2005)

ETL Architecture - Streaming data into an OLTP

Author

Topic

lazerath
Constraint Violating Yak Guru

343 Posts

Posted - 2007-10-15 : 17:39:52

Interested in feedback from the SQL grand wizards (and would-be wizards) that haunt these forums.

Let's say you need to constantly stream data into an OLTP system. We are talking multiple level hierarchies totaling upwards of 300 MB a day spread out not unlike a typical human sleep cycle (lower data during off-peak, still 24/7 requirements). All data originates from virtual machines running proprietary algorithms. The VM/data capture infrastructure needs to be massively scalable, meaning that incoming data is going to become more and more frequent and involve many different flat record formats.

The data has tremendous value when viewed both historically as well as in real-time (95% of real-time access will be read-only). The database infrastructure is in it's infancy now and I'm trying to develop a growth plan that can meet the needs of the business as the data requirements grow. I have no doubt that the system will need to work with multiple terabytes of data within a year.

Current database environment is a single server composed of a Dell PowerEdge 2950 (Intel Quad Core 5355, 16 GB RAM, 2 x 73 GB 15K RPM SAS ) with an attached Dell PowerVault MD1000 (15 x 300 GB 10K RPM SAS in RAID 5+0 [2x7] w/hot spare) running Win 2k3 64-bit and SQL Server 2005 x64 Standard, 1-CPU.

I am interested in answering the following questions:

Based on the scaling requirements of the data capture and subsequent ETL, what transmission method would you find most favorable? For instance, we are weighing direct database writes via stored procedures for all VM systems versus establishing processes to collect, aggregate and stream CSVs into a specialized ETL environment running SSIS packages that load data and then call SQL Stored procedures to scrub and prepare for production import. The data will require scrub routines that need access to current production data, so distributing the core data structures to multiple ETL processing systems would be expensive and undesireable.

Cost is very important to the overall solution design. In terms of database infrastructure, how would you maximize business value while keeping cost as low as possible? For instance, do you think there is more value in an ACTIVE/ACTIVE cluster (2 x CPU licenses) where one system acts as ETL and the other as OLTP or would you favor replication of production data from ETL to OLTP or (vice-versa). With the second scenario, am I mistaken in thinking we could get away with a Server/CAL licensing model for the ETL server?.

Are there any third party tools that I should research that would greatly aid me here?

I appreciate all feedback, criticism, and thoughts.

Best Regards,

Shane

lazerath
Constraint Violating Yak Guru

343 Posts

Posted - 2007-10-16 : 11:58:09

No comments? Can anyone point me to a better place to ask my questions?

russell
Pyro-ma-ni-yak

5072 Posts

Posted - 2007-10-16 : 19:09:12

Well, your last post isn't going to be the fastest way to get help...I could point you to another site i like, but that isn't the point here is it? Knowing very little about your application here are my immediate thoughts:

all depends on the nature of the ETLs and inserts. i import almost 10 Gb a night, with tons of scrubbing, on similar hardware setup, but the app is not 24/7 and it fits nicely into my maint window. I also import a few hundred megs throughout the day during heavy operating hours. i experience failure of the ETLs maybe once every 2 months due to deadlocking (it used to be worse lol)

if the requirement is to insert 300 Mb a day into the OLTP tables but the processing is spread out throughout the day, then this doesn't seem too bad from a performance standpoint.

no one knows enough about your app to say for sure, but im thinking that you may want a seperate database on the target server as the target of the ETLs (i'd almost certainly use SSIS in your case). It is going to be critical if your OLTP system is very active to ensure that the import database files (mdf, ldf) are on their own spindles. else get ready for some i/o issues. BUT...if the ETLs are intensive, then they need to be pointed at a seperate (linked) server from which you can bulk insert the results periodically (either via bulk insert or SSIS). I can't think of a faster way to max out as many CPUs as you want than to start doing large table scans. I had someone drop the clustered index on a table with roughly 12 million records...table used in joins several times a second and maxed out all 32 schedulers on the system (16 2x xeon processors) immediately.

so if your ETLs are going to be CPU intensive then you can't allow them to run throughout the day on the same system as a busy OLTP system.

so...in short the very generic recommendations are:
make accommodations for i/o pressure
be aware of CPU and memeory pressure implications
test
test
test
Also, make sure the inserts are failsafe -- can't drop the data if there is a deadlock or a bad record gets through the scrubbing process.

can't be much more specific than that at this point. hope it helps a little.

lazerath
Constraint Violating Yak Guru

343 Posts

Posted - 2007-10-17 : 12:26:34

Russell... my thanks for the feedback. My apologies as well, regarding my last post. It was no slight on SQLTeam (I have the utmost respect for everyone here)... I actually didn't know if the Administration forum, specifically, was the best place for this.

To provide a little background: my organization has complete control over all aspects of this application, from data capture and collection to storage and presentation. I've worked in an environment before where the data capture occurred in a third party system and was transmitted on a schedule for import. Here, I have the flexibility to design a system architecture unconstrained by third party requirements.

A coworker questioned whether or not CSV was the leanest transmission method for the data. From his perspective, he couldn't imagine Google using CSV to load data from their army of web harvesters (Ignore Google's open source stance and vast resources for a second). He proposed a solution involving direct connections to the database via Stored Procedures.

I have my own opinions on the matter but I wanted to get some confirmation and/or alternate perspectives from the community. You guys rock and I have great respect for you. It is quite possible that someone here will have thought of something I haven't already.

To address your post directly, my architecture already includes an ETL database that holds the data as it is loaded, scrubbed and transformed for bulk import. At my last job, I had the advantage of creating and refining an enterprise grade ETL pattern and I am fully confident in deploying in this environment. I just wanted to touch back to the community to make sure I wasn't missing anything obvious.

BTW, feel free to check out my homepage to get a feel for the application.

lazerath
Constraint Violating Yak Guru

343 Posts

Posted - 2007-10-17 : 13:15:41

My ultimate vision at this point is to have an ETL "Master" cluster (Active/Passive) with full production data that acts as the reference point for several ETL "Worker" servers running local copies of SQL 2K5 and SSIS. The ETL Master would be linked to each "worker" and assist in complex scrub procedures, ultimately receiving the updates after processing has completed. This is assuming that the ETL Master will not be able to handle the load by itself.

Of course, a separate OLTP cluster (Active/Passive) would be required to service the web farms. To support that, transactional replication would be enabled between the ETL "Master" and the OLTP servers. Depending on load and performance, I may also need to designate another subscriber of ETL Master expressly for reporting.

From a cost perspective, I believe that only the web facing servers would require the expensive CPU licensing model and the others could get by with Server/CAL licensing. Additionally, if memory serves me I am not forced to purchase licenses for the passive nodes of a cluster. If that is the case, it reduces the overall cost significantly and achieves maximum value for the business.

Although at this time it is not so important, eventually I want to load the data as fast as possible. Literally, I will try to squeeze out every MS that gets between the data being captured and subsequently displayed on the front end. In order to do that, I might need to carve out a subset of the data and house it in yet another server system, but what I want to avoid is a complete rewrite of our loading systems.

I know it is difficult for anyone to respond without a full understanding of the business and requirements, so I really do appreciate the time you have spent to review this and/or provide comments.

russell
Pyro-ma-ni-yak

5072 Posts

Posted - 2007-10-18 : 10:56:19

your memory serves you well on the licensing. u are correct. dont need to license passive node.

fastest way to move data is going to be bcp, ssis. bcp and bulk insert work extremely fast with csv. of course they are troublesome when formatting gets goofy (like a comma or tab character or newline that slips through). would be worth doing some benchmarking with that over SPs to access data directly (guess is direct access via SPs is probably faster). one advantage to bcp/.csv may be locking/blocking issues.

Subscribe to SQLTeam.com

SQLTeam.com Articles via RSS

SQLTeam.com Weblog via RSS

- Advertisement -

Resources