Reader Challenge #1 Solutions (Part II)By Bill Graziano on 4 June 2001 | Tags: Reader Challenges This article is the second summary of responses to the Reader Challenge. I'm covering solutions by Michael Price (again), Ilya Zaltsman, John Nowak and Stephen Morris. These are the fastest and cleanest solutions. I'll also cover the first entry that returned the actual values. Thanks again for your entries and I'm busy looking for another query for the next Reader Challenge.
It might be helpful to review the original challenge and the first correct response I received. I had numerous people submit articles that were pretty similar to what I already published. They are (in order of submission) sriram raghunathan, JP, Marc Yabroff, Jon Scott-Sheldon, Allison Anderson, Balachandar Ganesan, Justin Bigelow (who got me started on this), Eugene Okhrits and John Henry. I also had late entries from sqlnr, Pedro Cardoso and Tim Graham that weren't radically different from the other entries. Merkin also submitted a partial solution but he's written an article for us so he wasn't "officially" eligible.
A Clean SolutionThis is a pretty subjective category. I tried to identify a solution was easy for a novice SQL developer to read and understand. I think in the future I'll present this one first and work my way up to the more complex. I thought the "cleanest" solution came from Ilya Zaltsman. He writes: Hey, Graz! First of all, thanks for the site -- to me it is the best source of SQL information anywhere!!! I thought I'd try my hand at answering the reader challege: Statistics... The subject I've been trying to forget (especially since I blew $350 in a casino last night)... I can't comment on Ilya's gambling skills but his SQL skills seem to be pretty good. Here's the query he submitted: DECLARE @Mean float DECLARE @StDev float DECLARE @LowerBound float DECLARE @UpperBound float --Get Mean and StDev SELECT @Mean = AVG(Ind_Ret), @StDev = STDEV(Ind_Ret) FROM Samples ----Get Upper and Lower bounds of the range SELECT @LowerBound = @Mean - (@StDev * 3), @UpperBound = @Mean + (@StDev * 3) ----Count the outliers SELECT (SELECT Count(Ind_Ret) as Outliers_Below FROM Samples WHERE Ind_Ret<@LowerBound) as Outliers_Below , (SELECT Count(Ind_Ret) as Outliers_Below FROM Samples WHERE Ind_Ret>@UpperBound) as Outliers_Above I think this lays out the steps that the query has to go through pretty thoroughly. Ilya also gets bonus points for using comments. This query generates three table scans which isn't all that bad. I've got solutions with two table scans but I've also got solutions with five table scans. Returning the ValuesJohn Nowak sent in the first solution that returned the actual values in a single SELECT statement. Here's his solution: SELECT so.ind_ret AS OverYakLimit, su.ind_ret AS UnderYakLimit FROM samples s LEFT JOIN samples so ON s.ind_ret = so.ind_ret AND so.ind_ret >(SELECT ISNULL((AVG(Ind_Ret) + STDEV(Ind_Ret) * 3),0) AS CountstdCountOverLimit FROM Samples) LEFT JOIN samples su ON s.ind_ret = su.ind_ret AND su.ind_ret <(SELECT ISNULL((AVG(Ind_Ret) - STDEV(Ind_Ret) * 3),0) AS CountstdCountOverLimit FROM Samples) WHERE (ISNULL(so.ind_ret,0) + ISNULL(su.ind_ret,0)) > 0 ORDER BY so.ind_ret, su.ind_ret It generates five table scans but does return all the values. Plus he gets bonus points for a few Yak references. Two Table ScansI don't think this query can be written in a single table scan. You need one full scan to determine the standard deviation and average. You need a second scan to identify records that are more than three standard deviations from the average. I got two solutions with two table scans from Michael Price and Ilya Zaltsman. Michael was first and writes (in his second email): Well, when driving home last night I was thinking about this some more, and came up with a qry that only uses 2 table scans . . . Now maybe I can get this out of my head <g>, I don't think a single pass on the table is possible, but look forward to the other users responses to see if they came up with a way. And here's the solution he came up with: Select Sum( case when Ind_Ret > NPLimitTop Then 1 Else 0 End ) As AboveNPLimit, Sum( case when Ind_Ret < NPLimitBottom Then 1 Else 0 End ) As BelowNPLimit From Samples, ( Select Avg(Ind_Ret) + (Stdev(Ind_Ret) * 3) AS NPLimitTop, Avg(Ind_Ret) - (Stdev(Ind_Ret) * 3) AS NPLimitBottom From Samples ) RangeData You'll notice his tricky use of a cartesian join to pull NPLimitTop and NPLimitBottom into his case statement. This is very similar to what he did in the first article I wrote. This solution only uses two table scans which is as fast as I think it can be done. Ilya's solution with two table scans was different and I thought I'd publish it too. SELECT SUM(Below) as Outliers_Below, -- Count outliers SUM(Above) as Outliers_Above FROM (SELECT -- Break up high-end and low-end outliers into separate cols CASE WHEN a.Ind_Ret > b.UpperBound THEN 1 ELSE 0 END as Above, CASE WHEN a.Ind_Ret < b.LowerBound THEN 1 ELSE 0 END as Below FROM Samples as a, -- Obtain range boundaries (SELECT AVG(Ind_Ret)+(3*STDEV(Ind_Ret)) as UpperBound, AVG(Ind_Ret)-(3*STDEV(Ind_Ret)) as LowerBound FROM Samples) as b -- Eliminate non-outlier values WHERE (a.Ind_Ret > b.UpperBound OR a.Ind_Ret < b.LowerBound)) as Yaks An Index?Stephen Morris submitted an interesting solution. He built an index on the table. Stephen writes hi Graz got sucked in to your Reader Challenge. Added an index to get rid of the table scans & wrote a single select solution - see below got to get back to work Here's his statement: CREATE CLUSTERED INDEX IX_Samples ON Samples ( Ind_Ret ) ON [PRIMARY] SELECT CASE WHEN ind_ret - average > 0 THEN 'Above' ELSE 'Below' END as Direction , isnull(cast(ind_ret as varchar(20)), 'TOTAL') as Ind_ret , count(*) as CountofSamples FROM SAMPLES CROSS JOIN ( SELECT avg(s.ind_ret) average , STDEV(s.Ind_Ret) stddev FROM SAMPLES s ) as AggFnResults WHERE abs(ind_ret - average) > stddev * 3 GROUP BY CASE WHEN ind_ret - average > 0 THEN 'Above' ELSE 'Below' END , ind_ret WITH ROLLUP ORDER BY 1,2 DESC His query only generated two tables scans before adding the index so it was already pretty efficient. He does have some extra GROUP BY statements that generate some additional overhead though. After adding the index the Table Scans were converted to Clustered Index Scans. They seemed to take the exact same amount of processing time that the table scan did which is what I'd expected. Obviously in a real environment there would have been multiple tests stored in the same table and indexes would make a huge difference. Of course, in a real environment proper indexes would have been built from the start and all the queries would have used them. So in this case using an index didn't really help or hurt this solution much. But it was interesting to see their effect on the query. Thanks to everyone that entered. I think I'll do another one of these in a few weeks. I've got a couple of queries that would make good candidates. Thanks for "playing" and I hope you all give it a try next time.
|
- Advertisement - |