Using the include option in indexes

Please start any new threads on our new site at https://forums.sqlteam.com. We've got lots of great SQL Server experts to answer whatever question you can come up with.

All Forums

General SQL Server Forums

New to SQL Server Programming

Using the include option in indexes

Author

Topic

cjp
Yak Posting Veteran

69 Posts

Posted - 2010-12-02 : 07:29:05

I am using SQLS 2005 sp 3.

I've read a variety of articles on indexing (including the essential reading recommended on this forum) and this has led me to experiment with different settings to see what helps (and what doesn't). A while back (probably here), I came across some code designed to test whether one's indexes are adequate. This is what it looks like:

Use <dbname>
exec sp_helpindex [dbo.<dbname>]

Select * From <tablename> Where <set conditions>

Run this a few times using different conditions in the Where clause; then, following each instance of this:

Select * From sys.dm_db_missing_index_details

Trying different searches, I have found that the correct index is always being picked up without hints and that my indexes materialise the data very quickly; however, there is a fairly consistent recommendation: the missing index output keeps suggesting that I incorporate every field in the table into the include line for whichever index has been used.

I am wondering about this for one reason: following some on-line advice, I have re-jigged my indexes so that every field I use frequently in joins or Where clauses is in a one-field index: lots of small indexes rather than fewer longer (compound) indexes. I am finding this very helpful but am wondering whether it makes sense to have a fully populated include line for every non-clustered index: (a) this looks like a lot of duplication and (b) all the information is held at leaf level in the clustered index, so the optimiser hasn't got far to go to select the correct rows - that is, my queries will always invoke the field that defines the clustered index key so the optimiser will always know where the relevant leaf-level information is to be found.

I would welcome some guidance here - I am quite happy to program in as many include lines as there are non-clustered indexes but I don't want to do this if it is a waste of time and resources.

Thanks.

Chris

russell
Pyro-ma-ni-yak

5072 Posts

Posted - 2010-12-02 : 07:52:24

An index with every field included is basically a copy of the entire table -- not very useful.

Include only the most frequently selected columns.

Suppose we search customers by zip code, but usually want to return their address too. This example is straight from BOL

CREATE NONCLUSTERED INDEX IX_Address_PostalCode
    ON Person.Address (PostalCode)
    INCLUDE (AddressLine1, AddressLine2, City, StateProvinceID);
GO

dataguru1971
Master Smack Fu Yak Hacker

1464 Posts

Posted - 2010-12-02 : 07:55:33

It is also likely recommending you include every column because of the Select * instead of Select <column names>

Poor planning on your part does not constitute an emergency on my part.

webfred
Master Smack Fu Yak Hacker

8781 Posts

Posted - 2010-12-02 : 08:00:56

quote:
Originally posted by dataguru1971

It is also likely recommending you include every column because of the Select * instead of Select <column names>

Poor planning on your part does not constitute an emergency on my part.

Good point!!!!

No, you're never too old to Yak'n'Roll if you're too young to die.

cjp
Yak Posting Veteran

69 Posts

Posted - 2010-12-03 : 04:22:00

Thanks for your thoughts - very helpful. The point about Select (*) is well made and I will keep it in mind.

Chris

GilaMonster
Master Smack Fu Yak Hacker

4507 Posts

Posted - 2010-12-03 : 04:38:31

The missing index dmvs are a good place to start investigating indexes, not a good place to end.

--
Gail Shaw
SQL Server MVP

cjp
Yak Posting Veteran

69 Posts

Posted - 2010-12-03 : 05:20:15

Thanks for the advice re the index dmvs. Arising from this, I have spent some more time investigating the include option and now have a better understanding of what this is (and is not) designed to achieve. I had not previously registered (though I must have read it) that the include option enables actual data to be stored at leaf level in the non-clustered index: bit by bit, indexes are beginning to make sense.

Chris

Sachin.Nand

2937 Posts

Posted - 2010-12-03 : 06:43:58

quote:

I am wondering about this for one reason: following some on-line advice, I have re-jigged my indexes so that every field I use frequently in joins or Where clauses is in a one-field index:
lots of small indexes rather than fewer longer (compound) indexes. I am finding this very helpful but am wondering whether it makes sense to have a fully populated include line for every non-clustered index: (a) this looks like a lot of duplication and (b) all the information is held at
leaf level in the clustered index, so the optimiser hasn't got far to go to select the correct rows - that is, my queries will always invoke the field that defines the clustered index key so the optimiser will always know where the relevant leaf-level information is to be found.'

I personally think it is a bad idea to have lot of small indexes.Just imagine a scenario where
you have 3 indexes on column1,column2 and column3 respectively and the values for each indexes have a high density.Say for example column1 matches 10000 rows,column2 matches 8000 rows and column3 matches 5000 rows but the combination of all these return only 100 matching records.So optimizer in this case is going to use something called index intersection or go for a straightforward scan which will lead to high reads.But if you have a composite index on the 3 columns for the above scenario the optmizer will go for a single index seek and result in few
page reads.

PBUH

cjp
Yak Posting Veteran

69 Posts

Posted - 2010-12-03 : 07:26:18

Thanks - this makes sense (and is what I used to do) but I have recently found something different in practice. I have a large table containing 700 million rows of data relating to medical tests - a table that has always been very slow to yield answers in multi-table queries. These tests can be indexed using either a code for each test or a code for each diagnosis that flows from a test: initially, I put both [testcode] and [dxcode] into the same compound index but found that queries on [dxcode] took an age and were using an index scan.

Having read Gail Shaw's articles on how to build indexes, I investigated the issue of index length and came across a recommendation in an SQL newsletter to rejig this kind of index so that it was two indexes rather than one - on the grounds that the query optimiser was likely just to read the leftmost part of the index and then, if the required value was not present, resort to a scan. When I dropped [dxcode] and remade it in its own index, the optimiser chose an index seek and the execution time for a 3-table query dropped from 15 minutes to 30 seconds.

This is why I have begun to experiment with multiple short indexes - I should admit, however, that it is early days for this investigation and I am open to other opinions since I am very much feeling my way through this.

Chris

Sachin.Nand

2937 Posts

Posted - 2010-12-03 : 08:09:58

It depends on the predicates which you are using in the query.
Was your using only [dxcode] as a predicate without an individual index on [dxcode]?

PBUH

GilaMonster
Master Smack Fu Yak Hacker

4507 Posts

Posted - 2010-12-03 : 09:08:53

Some reading for you:
http://www.sqlservercentral.com/articles/Indexing/68439/
http://sqlinthewild.co.za/index.php/2009/01/19/index-columns-selectivity-and-equality-predicates/
http://sqlinthewild.co.za/index.php/2009/02/06/index-columns-selectivity-and-inequality-predicates/
http://sqlinthewild.co.za/index.php/2010/09/14/one-wide-index-or-multiple-narrow-indexes/

I'm very much in favour of fewer wider indexes than lots of narrow ones.

--
Gail Shaw
SQL Server MVP

Kristen
Test

22859 Posts

Posted - 2010-12-03 : 09:28:34

"I put both [testcode] and [dxcode] into the same compound index but found that queries on [dxcode] took an age and were using an index scan."

yes, that's right.

Query on [dxcode] will not use index([testcode], [dxcode]) - well, it "Might", but usually not.

Query on [testcode] will normally use index([testcode], [dxcode]) or index([testcode])

If you want a lookup on [dxcode] (and nothing else) then your index needs to start with that column (or ONLY contain that column)

If you routinely want to test on values for BOTH [testcode] and [dxcode] then having a single index with BOTH columns will be best. In those circumstances put the column first that either is used most commonly WITHOT the second one, but also taking into account the "selectivity" of the first column.

If [dxcode] contains only unique values, or values that are rarely repeating, but [testcode] does not, then putting [dxcode] first is more likely to cause the index to be used by SQL. For a facetious example if [testcode] has only the values A, B or C then in an index of 700 millions rows SQL will only use if very rarely.

These are all huge generalisations, so don't rely on them!, but it may be some help in filling-in-some-gaps.

cjp
Yak Posting Veteran

69 Posts

Posted - 2010-12-06 : 04:12:44

Thanks for the ideas. The two fields I have used as examples both contain many different possible unique lookup values: there are hundreds of codes for different medical tests and >100,000 unique lookup values for the diagnosis (dx) codes: dx codes include thousands of entries that relate, for example, to consultation issues - eg turned up late for an appointment - hence the size of the lookup. Certain testcodes are likely to occur very frequently (eg body mass index, smoking, blood pressure). Any given dxcode is likely to be relatively infrequent in this table.

Depending on the kind of study I am trying to populate with data (for statistical analysis), I might need to focus on testcode or on dxcode - occasionally on both. Given the range of unique values in each field, and the fact that the optimiser would not use dxcode when it was the second element in a compound index, I had no choice but to try it on its own - and this worked well - as indicated previously.

By the way, this is not a disguised normalisation issue: I have a separate table of diagnosis outcomes but it just happens that every test code is required by the data protocol to be associated with a diagnosis/consultation code, so I can't really split up the tests table and deal with the dxcodes separately.

Kristen's remarks reflect closely the actual position in which I find myself.

As Gail advises, I strongly prefer to have fewer longer indexes (and doing it this way has been my practice - until I found out that the optimiser wasn't necessarily using the index in the way I anticipated). I will address her reading list today but I always remember - maybe it was she who said it - that, at some point, indexing is as much an art as a science!

Thanks again for giving me so much help.

Chris

Kristen
Test

22859 Posts

Posted - 2010-12-06 : 05:40:31

You probably know all this, but in case not: you can do

SELECT COUNT(DISTINCT dxcode), COUNT(*) FROM MyTable WHERE dxcode IS NOT NULL

to see the ratio of different, "distinct", values to overall rows. The closer the first number comes to the second the more "selective" the index is, and the more likely SQL will be to use it.

Having said that, if you have some values of [dxcode] that are rare, then the index will be highly selective for those values, and SQL will use the index when you query for those values (SQL stores "Statistics" on the distribution of values in the index, and uses that to decide whether the index is likely to be efficient, or not)

cjp
Yak Posting Veteran

69 Posts

Posted - 2010-12-07 : 03:43:22

These are very interesting ideas: I started to run the procedure Kristen suggested but stopped after a while because I knew it would take for ever. As it happens, I know the bottom line statistics anyway: there are about 500 distinct test codes/650,000,000 records... and perhaps 10,000 dx codes for the same number of records. I decided to run dbcc show_statistics instead since this is very fast. This confirmed high density for the dxcode field in this particular table (much lower density in the medical table, which is the main home of diagnosis and consultation records).

I wasn't terribly happy with the statistics because I felt that the sample size was too small, so I reran the dbcc procedure using a 25% sample size and ended up with numbers I can believe - still high density for dxcode. The upshot is that testcode and dxcode do need to be in separate indexes, I think - though I am considering building a compound index and then testing the effect of using a query hint to force the optimiser to use the index (the dxcode field will start at string position 11 in the index).

Conceivably, using query hints to direct the optimiser to use a compound index could be a way forward if best practice is to have fewer longer indexes, as Gail suggests. However, I am conscious that only the leftmost field value will generate a single-field histogram - so the optimiser could choose a sub-optimal plan where a query depends on the second field in the index and the first field is not invoked.

I will keep exploring.

Chris

GilaMonster
Master Smack Fu Yak Hacker

4507 Posts

Posted - 2010-12-07 : 04:40:27

quote:
Originally posted by cjp

However, I am conscious that only the leftmost field value will generate a single-field histogram - so the optimiser could choose a sub-optimal plan where a query depends on the second field in the index and the first field is not invoked.

If a query references only the second column of an index, the best you can get is an index scan, that's not seekable. An index is only seekable if the query filters on a left-based subset of the index columns. That's something I refer to repeatedly in the articles I listed for you.

Stay away from index hints unless you are totally, absolutely, 100% certain that you know better than the optimiser for all possible predicate values.

--
Gail Shaw
SQL Server MVP

cjp
Yak Posting Veteran

69 Posts

Posted - 2010-12-07 : 07:04:13

Thank you - I agree: the reason I went down the two-index route was that I was getting an index scan when invoking the second field in the composite index: I do apply your teaching...

As a matter of interest, I have been reading a variety of papers on the issue of many small indexes or fewer larger indexes. The following is relevant:

http://blogs.msdn.com/b/windchillonsql/archive/2010/06/18/sql-server-index-intersection-union.aspx

The author seems to be an enthusiast for multiple short indexes and adds:

'Avoid covering indexes except as an outlier solution until you can find a more enterprise solution'

I lack the knowledge to evaluate what he says but it is grist to the mill of this topic.

Chris

Kristen
Test

22859 Posts

Posted - 2010-12-07 : 07:36:30

Is there a reason not to have two compoud indexes:

([testcode], [dxcode])
and
([dxcode], [testcode])

Both will work where only the first column is used in the query (but less efficiently than a single-column index).

As the application is largely read-only would it be that "bad" to also have single-column indexes on [dxcode] and [testcode] I wonder?

(Disk space aside ...)

And if the application is "essentially read-only" then I would be inclined to update the statistics, once, with FULL SCAN so there are not poorly represented values.

And make sure the index FILL FACTOR is 100% and that nothing changes that (such as Wizard Maintenance Plans)

GilaMonster
Master Smack Fu Yak Hacker

4507 Posts

Posted - 2010-12-07 : 08:02:23

quote:
Originally posted by cjp

http://blogs.msdn.com/b/windchillonsql/archive/2010/06/18/sql-server-index-intersection-union.aspx

The author seems to be an enthusiast for multiple short indexes and adds:

'Avoid covering indexes except as an outlier solution until you can find a more enterprise solution'

Allow me to provide a counterpoint
http://sqlinthewild.co.za/index.php/2010/09/14/one-wide-index-or-multiple-narrow-indexes/

I've found a lot (and I mean a LOT) of really bad advice on that windchill blog. I suspect that the author is not a SQL expert. I've posted a couple of mildly scathing comments there correcting blatant errors.

Covering indexes are widely accepted as the BEST way to get a query fast. The downside is that you can't afford (space, insert impact, maintenance impact, etc) to cover every query. So cover the important ones and then see how the rest perform.

--
Gail Shaw
SQL Server MVP

GilaMonster
Master Smack Fu Yak Hacker

4507 Posts

Posted - 2010-12-07 : 08:05:34

quote:
Originally posted by Kristen

Is there a reason not to have two compoud indexes:

([testcode], [dxcode])
and
([dxcode], [testcode])

What I would typically do in that case is something like the following
([testcode], [dxcode])
and
([dxcode])

Queries that filter on both can use the first. Queries that filter on just testcode use the first. Queries that just filter on dxcode use the second.

quote:
As the application is largely read-only would it be that "bad" to also have single-column indexes on [dxcode] and [testcode] I wonder?

Bad, no. Useful, almost certainly not. SQL would use them. Whether there would be the slightest benefit is a whole other matter.

quote:
(Disk space aside ...)

Backup space, backup time, maintenance time.

--
Gail Shaw
SQL Server MVP

Kristen
Test

22859 Posts

Posted - 2010-12-07 : 10:32:23

"Queries that filter on both can use the first. Queries that filter on just testcode use the first. Queries that just filter on dxcode use the second."

Indeed ... "it all depends"

If the queries is "[testcode] BETWEEN A and B" (or "[testcode] = A") then ([testcode], [dxcode]) index is efficient

([testcode], [dxcode]) is also efficient for "[testcode] BETWEEN [... narrow range ...] AND [dxcode] BETWEEN [... any range ...]"

But also having a ([dxcode], [testcode]) index would also be good for "[testcode] BETWEEN [... wide range ...] AND [dxcode] BETWEEN [... narrow range ...]" (which a ([dxcode]) index would also solve, but less efficiently I think).

I think, given that this database is "almost static" that it may well be worthwhile creating these additional indexes - i.e. treating it more like a data warehouse. Give Performance for query time higher priority than space (within reason!).

"Backup space, backup time, maintenance time."

Yup, that too! but putting any intermediate data in a separate database would allow the primary database to be READ-ONLY and only backed up as-and-when it was changed - or even "never" if it can be regenerated from some Import data that is itself backed up / regenerate-able

Next Page

Subscribe to SQLTeam.com

SQLTeam.com Articles via RSS

SQLTeam.com Weblog via RSS

- Advertisement -

Resources