| Author |
Topic |
|
cjp
Yak Posting Veteran
69 Posts |
Posted - 2010-12-02 : 07:29:05
|
| I am using SQLS 2005 sp 3.I've read a variety of articles on indexing (including the essential reading recommended on this forum) and this has led me to experiment with different settings to see what helps (and what doesn't). A while back (probably here), I came across some code designed to test whether one's indexes are adequate. This is what it looks like:Use <dbname>exec sp_helpindex [dbo.<dbname>]Select * From <tablename> Where <set conditions>Run this a few times using different conditions in the Where clause; then, following each instance of this:Select * From sys.dm_db_missing_index_detailsTrying different searches, I have found that the correct index is always being picked up without hints and that my indexes materialise the data very quickly; however, there is a fairly consistent recommendation: the missing index output keeps suggesting that I incorporate every field in the table into the include line for whichever index has been used.I am wondering about this for one reason: following some on-line advice, I have re-jigged my indexes so that every field I use frequently in joins or Where clauses is in a one-field index: lots of small indexes rather than fewer longer (compound) indexes. I am finding this very helpful but am wondering whether it makes sense to have a fully populated include line for every non-clustered index: (a) this looks like a lot of duplication and (b) all the information is held at leaf level in the clustered index, so the optimiser hasn't got far to go to select the correct rows - that is, my queries will always invoke the field that defines the clustered index key so the optimiser will always know where the relevant leaf-level information is to be found.I would welcome some guidance here - I am quite happy to program in as many include lines as there are non-clustered indexes but I don't want to do this if it is a waste of time and resources.Thanks.Chris |
|
|
russell
Pyro-ma-ni-yak
5072 Posts |
Posted - 2010-12-02 : 07:52:24
|
An index with every field included is basically a copy of the entire table -- not very useful.Include only the most frequently selected columns.Suppose we search customers by zip code, but usually want to return their address too. This example is straight from BOLCREATE NONCLUSTERED INDEX IX_Address_PostalCode ON Person.Address (PostalCode) INCLUDE (AddressLine1, AddressLine2, City, StateProvinceID);GO |
 |
|
|
dataguru1971
Master Smack Fu Yak Hacker
1464 Posts |
Posted - 2010-12-02 : 07:55:33
|
It is also likely recommending you include every column because of the Select * instead of Select <column names> Poor planning on your part does not constitute an emergency on my part. |
 |
|
|
webfred
Master Smack Fu Yak Hacker
8781 Posts |
Posted - 2010-12-02 : 08:00:56
|
quote: Originally posted by dataguru1971 It is also likely recommending you include every column because of the Select * instead of Select <column names> Poor planning on your part does not constitute an emergency on my part.
Good point!!!! No, you're never too old to Yak'n'Roll if you're too young to die. |
 |
|
|
cjp
Yak Posting Veteran
69 Posts |
Posted - 2010-12-03 : 04:22:00
|
| Thanks for your thoughts - very helpful. The point about Select (*) is well made and I will keep it in mind.Chris |
 |
|
|
GilaMonster
Master Smack Fu Yak Hacker
4507 Posts |
Posted - 2010-12-03 : 04:38:31
|
| The missing index dmvs are a good place to start investigating indexes, not a good place to end.--Gail ShawSQL Server MVP |
 |
|
|
cjp
Yak Posting Veteran
69 Posts |
Posted - 2010-12-03 : 05:20:15
|
| Thanks for the advice re the index dmvs. Arising from this, I have spent some more time investigating the include option and now have a better understanding of what this is (and is not) designed to achieve. I had not previously registered (though I must have read it) that the include option enables actual data to be stored at leaf level in the non-clustered index: bit by bit, indexes are beginning to make sense.Chris |
 |
|
|
Sachin.Nand
2937 Posts |
Posted - 2010-12-03 : 06:43:58
|
quote: I am wondering about this for one reason: following some on-line advice, I have re-jigged my indexes so that every field I use frequently in joins or Where clauses is in a one-field index: lots of small indexes rather than fewer longer (compound) indexes. I am finding this very helpful but am wondering whether it makes sense to have a fully populated include line for every non-clustered index: (a) this looks like a lot of duplication and (b) all the information is held at leaf level in the clustered index, so the optimiser hasn't got far to go to select the correct rows - that is, my queries will always invoke the field that defines the clustered index key so the optimiser will always know where the relevant leaf-level information is to be found.'
I personally think it is a bad idea to have lot of small indexes.Just imagine a scenario where you have 3 indexes on column1,column2 and column3 respectively and the values for each indexes have a high density.Say for example column1 matches 10000 rows,column2 matches 8000 rows and column3 matches 5000 rows but the combination of all these return only 100 matching records.So optimizer in this case is going to use something called index intersection or go for a straightforward scan which will lead to high reads.But if you have a composite index on the 3 columns for the above scenario the optmizer will go for a single index seek and result in few page reads.PBUH |
 |
|
|
cjp
Yak Posting Veteran
69 Posts |
Posted - 2010-12-03 : 07:26:18
|
| Thanks - this makes sense (and is what I used to do) but I have recently found something different in practice. I have a large table containing 700 million rows of data relating to medical tests - a table that has always been very slow to yield answers in multi-table queries. These tests can be indexed using either a code for each test or a code for each diagnosis that flows from a test: initially, I put both [testcode] and [dxcode] into the same compound index but found that queries on [dxcode] took an age and were using an index scan.Having read Gail Shaw's articles on how to build indexes, I investigated the issue of index length and came across a recommendation in an SQL newsletter to rejig this kind of index so that it was two indexes rather than one - on the grounds that the query optimiser was likely just to read the leftmost part of the index and then, if the required value was not present, resort to a scan. When I dropped [dxcode] and remade it in its own index, the optimiser chose an index seek and the execution time for a 3-table query dropped from 15 minutes to 30 seconds.This is why I have begun to experiment with multiple short indexes - I should admit, however, that it is early days for this investigation and I am open to other opinions since I am very much feeling my way through this.Chris |
 |
|
|
Sachin.Nand
2937 Posts |
Posted - 2010-12-03 : 08:09:58
|
| It depends on the predicates which you are using in the query.Was your using only [dxcode] as a predicate without an individual index on [dxcode]?PBUH |
 |
|
|
GilaMonster
Master Smack Fu Yak Hacker
4507 Posts |
|
|
Kristen
Test
22859 Posts |
Posted - 2010-12-03 : 09:28:34
|
| "I put both [testcode] and [dxcode] into the same compound index but found that queries on [dxcode] took an age and were using an index scan."yes, that's right.Query on [dxcode] will not use index([testcode], [dxcode]) - well, it "Might", but usually not.Query on [testcode] will normally use index([testcode], [dxcode]) or index([testcode])If you want a lookup on [dxcode] (and nothing else) then your index needs to start with that column (or ONLY contain that column)If you routinely want to test on values for BOTH [testcode] and [dxcode] then having a single index with BOTH columns will be best. In those circumstances put the column first that either is used most commonly WITHOT the second one, but also taking into account the "selectivity" of the first column.If [dxcode] contains only unique values, or values that are rarely repeating, but [testcode] does not, then putting [dxcode] first is more likely to cause the index to be used by SQL. For a facetious example if [testcode] has only the values A, B or C then in an index of 700 millions rows SQL will only use if very rarely.These are all huge generalisations, so don't rely on them!, but it may be some help in filling-in-some-gaps. |
 |
|
|
cjp
Yak Posting Veteran
69 Posts |
Posted - 2010-12-06 : 04:12:44
|
| Thanks for the ideas. The two fields I have used as examples both contain many different possible unique lookup values: there are hundreds of codes for different medical tests and >100,000 unique lookup values for the diagnosis (dx) codes: dx codes include thousands of entries that relate, for example, to consultation issues - eg turned up late for an appointment - hence the size of the lookup. Certain testcodes are likely to occur very frequently (eg body mass index, smoking, blood pressure). Any given dxcode is likely to be relatively infrequent in this table.Depending on the kind of study I am trying to populate with data (for statistical analysis), I might need to focus on testcode or on dxcode - occasionally on both. Given the range of unique values in each field, and the fact that the optimiser would not use dxcode when it was the second element in a compound index, I had no choice but to try it on its own - and this worked well - as indicated previously.By the way, this is not a disguised normalisation issue: I have a separate table of diagnosis outcomes but it just happens that every test code is required by the data protocol to be associated with a diagnosis/consultation code, so I can't really split up the tests table and deal with the dxcodes separately.Kristen's remarks reflect closely the actual position in which I find myself.As Gail advises, I strongly prefer to have fewer longer indexes (and doing it this way has been my practice - until I found out that the optimiser wasn't necessarily using the index in the way I anticipated). I will address her reading list today but I always remember - maybe it was she who said it - that, at some point, indexing is as much an art as a science!Thanks again for giving me so much help.Chris |
 |
|
|
Kristen
Test
22859 Posts |
Posted - 2010-12-06 : 05:40:31
|
| You probably know all this, but in case not: you can doSELECT COUNT(DISTINCT dxcode), COUNT(*) FROM MyTable WHERE dxcode IS NOT NULLto see the ratio of different, "distinct", values to overall rows. The closer the first number comes to the second the more "selective" the index is, and the more likely SQL will be to use it.Having said that, if you have some values of [dxcode] that are rare, then the index will be highly selective for those values, and SQL will use the index when you query for those values (SQL stores "Statistics" on the distribution of values in the index, and uses that to decide whether the index is likely to be efficient, or not) |
 |
|
|
cjp
Yak Posting Veteran
69 Posts |
Posted - 2010-12-07 : 03:43:22
|
| These are very interesting ideas: I started to run the procedure Kristen suggested but stopped after a while because I knew it would take for ever. As it happens, I know the bottom line statistics anyway: there are about 500 distinct test codes/650,000,000 records... and perhaps 10,000 dx codes for the same number of records. I decided to run dbcc show_statistics instead since this is very fast. This confirmed high density for the dxcode field in this particular table (much lower density in the medical table, which is the main home of diagnosis and consultation records).I wasn't terribly happy with the statistics because I felt that the sample size was too small, so I reran the dbcc procedure using a 25% sample size and ended up with numbers I can believe - still high density for dxcode. The upshot is that testcode and dxcode do need to be in separate indexes, I think - though I am considering building a compound index and then testing the effect of using a query hint to force the optimiser to use the index (the dxcode field will start at string position 11 in the index).Conceivably, using query hints to direct the optimiser to use a compound index could be a way forward if best practice is to have fewer longer indexes, as Gail suggests. However, I am conscious that only the leftmost field value will generate a single-field histogram - so the optimiser could choose a sub-optimal plan where a query depends on the second field in the index and the first field is not invoked.I will keep exploring.Chris |
 |
|
|
GilaMonster
Master Smack Fu Yak Hacker
4507 Posts |
Posted - 2010-12-07 : 04:40:27
|
quote: Originally posted by cjp However, I am conscious that only the leftmost field value will generate a single-field histogram - so the optimiser could choose a sub-optimal plan where a query depends on the second field in the index and the first field is not invoked.
If a query references only the second column of an index, the best you can get is an index scan, that's not seekable. An index is only seekable if the query filters on a left-based subset of the index columns. That's something I refer to repeatedly in the articles I listed for you.Stay away from index hints unless you are totally, absolutely, 100% certain that you know better than the optimiser for all possible predicate values.--Gail ShawSQL Server MVP |
 |
|
|
cjp
Yak Posting Veteran
69 Posts |
Posted - 2010-12-07 : 07:04:13
|
| Thank you - I agree: the reason I went down the two-index route was that I was getting an index scan when invoking the second field in the composite index: I do apply your teaching...As a matter of interest, I have been reading a variety of papers on the issue of many small indexes or fewer larger indexes. The following is relevant:http://blogs.msdn.com/b/windchillonsql/archive/2010/06/18/sql-server-index-intersection-union.aspxThe author seems to be an enthusiast for multiple short indexes and adds:'Avoid covering indexes except as an outlier solution until you can find a more enterprise solution'I lack the knowledge to evaluate what he says but it is grist to the mill of this topic.Chris |
 |
|
|
Kristen
Test
22859 Posts |
Posted - 2010-12-07 : 07:36:30
|
| Is there a reason not to have two compoud indexes:([testcode], [dxcode])and([dxcode], [testcode])Both will work where only the first column is used in the query (but less efficiently than a single-column index).As the application is largely read-only would it be that "bad" to also have single-column indexes on [dxcode] and [testcode] I wonder?(Disk space aside ...)And if the application is "essentially read-only" then I would be inclined to update the statistics, once, with FULL SCAN so there are not poorly represented values.And make sure the index FILL FACTOR is 100% and that nothing changes that (such as Wizard Maintenance Plans) |
 |
|
|
GilaMonster
Master Smack Fu Yak Hacker
4507 Posts |
Posted - 2010-12-07 : 08:02:23
|
quote: Originally posted by cjp http://blogs.msdn.com/b/windchillonsql/archive/2010/06/18/sql-server-index-intersection-union.aspxThe author seems to be an enthusiast for multiple short indexes and adds:'Avoid covering indexes except as an outlier solution until you can find a more enterprise solution'
Allow me to provide a counterpointhttp://sqlinthewild.co.za/index.php/2010/09/14/one-wide-index-or-multiple-narrow-indexes/I've found a lot (and I mean a LOT) of really bad advice on that windchill blog. I suspect that the author is not a SQL expert. I've posted a couple of mildly scathing comments there correcting blatant errors.Covering indexes are widely accepted as the BEST way to get a query fast. The downside is that you can't afford (space, insert impact, maintenance impact, etc) to cover every query. So cover the important ones and then see how the rest perform.--Gail ShawSQL Server MVP |
 |
|
|
GilaMonster
Master Smack Fu Yak Hacker
4507 Posts |
Posted - 2010-12-07 : 08:05:34
|
quote: Originally posted by Kristen Is there a reason not to have two compoud indexes:([testcode], [dxcode])and([dxcode], [testcode])
What I would typically do in that case is something like the following([testcode], [dxcode])and([dxcode])Queries that filter on both can use the first. Queries that filter on just testcode use the first. Queries that just filter on dxcode use the second.quote: As the application is largely read-only would it be that "bad" to also have single-column indexes on [dxcode] and [testcode] I wonder?
Bad, no. Useful, almost certainly not. SQL would use them. Whether there would be the slightest benefit is a whole other matter.quote: (Disk space aside ...)
Backup space, backup time, maintenance time.--Gail ShawSQL Server MVP |
 |
|
|
Kristen
Test
22859 Posts |
Posted - 2010-12-07 : 10:32:23
|
| "Queries that filter on both can use the first. Queries that filter on just testcode use the first. Queries that just filter on dxcode use the second."Indeed ... "it all depends"If the queries is "[testcode] BETWEEN A and B" (or "[testcode] = A") then ([testcode], [dxcode]) index is efficient([testcode], [dxcode]) is also efficient for "[testcode] BETWEEN [... narrow range ...] AND [dxcode] BETWEEN [... any range ...]"But also having a ([dxcode], [testcode]) index would also be good for "[testcode] BETWEEN [... wide range ...] AND [dxcode] BETWEEN [... narrow range ...]" (which a ([dxcode]) index would also solve, but less efficiently I think).I think, given that this database is "almost static" that it may well be worthwhile creating these additional indexes - i.e. treating it more like a data warehouse. Give Performance for query time higher priority than space (within reason!)."Backup space, backup time, maintenance time."Yup, that too! but putting any intermediate data in a separate database would allow the primary database to be READ-ONLY and only backed up as-and-when it was changed - or even "never" if it can be regenerated from some Import data that is itself backed up / regenerate-able |
 |
|
|
Next Page
|