Performance of composite keys...

Please start any new threads on our new site at https://forums.sqlteam.com. We've got lots of great SQL Server experts to answer whatever question you can come up with.

All Forums

SQL Server 2008 Forums

Analysis Server and Reporting Services (2008)

Performance of composite keys...

Author

Topic

malachi151
Posting Yak Master

152 Posts

Posted - 2012-01-20 : 10:38:47

I'm having some issues with composite keys and am not sure exactly how they work or exactly what the best practices are that I should be using here.

I've got a table that uses a natural PK, with a composite key that is comprised of 4 columns.

I've structured the key so that the most variable column is first, and generally the data becomes less variable with each subsequent column.

So, for example, its something like:

policy id, policy date, file id, customer id.

The data for this may look like the following:


policy_id | policy_date | file_id | customer_id
1         | 1/1/2000    | 1       | ABC
1         | 2/1/2001    | 1       | ABC
2         | 3/1/2005    | 1       | ABC
3         | 4/1/2007    | 1       | ABC
3         | 4/1/2008    | 1       | ABC
3         | 4/1/2009    | 1       | ABC
4         | 1/1/2010    | 1       | ABC
1         | 3/1/2005    | 2       | XYZ*
2         | 8/1/2008    | 2       | XYZ*
3         | 2/1/2003    | 2       | XYZ*
4         | 6/1/2001    | 2       | XYZ*
1         | 1/1/2000    | 3       | ABC
1         | 1/1/2001    | 3       | ABC
2         | 3/1/2005    | 3       | ABC
2         | 3/1/2006    | 3       | ABC
3         | 4/1/2007    | 3       | ABC
4         | 1/1/2010    | 3       | ABC

So the issue is, if my WHERE clause is something like WHERE customer_id = 'XYZ*' then performance is pretty slow on a table with 250K rows and about 30 different "file_ids" across about 15 different "customer_ids" (not the real column names).

I see that whenever my WHERE clause references policy_id an Index Seek is performed, but if I reference just one of the "inner" columns in the index without referencing the first column in the index then an Index Scan is performed.

Given that there is a lot of code that uses all different combinations of the indexed columns, what is really the best approach here?

There are queries that join on "policy_id" and "policy_date", other that group by "customer_id" or "file_id", other that join on file_id or cusotmer_id. File_id is used frequently in the WHERE clause, etc.

Of course I showed the data up top how it logically makes sense, but based on the PK its really ordered as follows:


policy_id | policy_date | file_id | customer_id
1         | 1/1/2000    | 1       | ABC
1         | 2/1/2001    | 1       | ABC
1         | 3/1/2005    | 2       | XYZ*
1         | 1/1/2000    | 3       | ABC
1         | 1/1/2001    | 3       | ABC
2         | 3/1/2005    | 1       | ABC
2         | 8/1/2008    | 2       | XYZ*
2         | 3/1/2005    | 3       | ABC
2         | 3/1/2006    | 3       | ABC
3         | 4/1/2007    | 1       | ABC
3         | 4/1/2008    | 1       | ABC
3         | 4/1/2009    | 1       | ABC
3         | 2/1/2003    | 2       | XYZ*
3         | 4/1/2007    | 3       | ABC
4         | 1/1/2010    | 1       | ABC
4         | 6/1/2001    | 2       | XYZ*
4         | 1/1/2010    | 3       | ABC

Which means that any filter based on file_id going to have to move around a lot to find all the rows, because the file_id is broken up.

But I also put a non-clustered index on "file_id", hoping that would help, but it doesn't see to be used, query plans still Scan the PK if my WHERE clause is just WHERE file_id = 1.

So anyway, any suggestions here on how to improve the indexing in this scenario?

Kristen
Test

22859 Posts

Posted - 2012-01-20 : 10:55:27

I'm not sure that selectivity of the first key column [for a clustered index] is important (whereas it normally would be for a non-clustered index). For the clustered index you want to be able to find a specific row easily (i.e. when you supply all 4 key values), and also ranges of rows. If your most common range-query is based on customer_id range, or customer_id + file_id range, or customer_id + file_id + policy_date range then I would reverse the order of the keys. (I suspect that that will be the case from your description of the lack of selectively)

If you then want specific queries on column-combinations that do NOT include the customer_id (or whatever your first key-column is) then create secondary indexes for those. An index on "policy_id, policy_date" will likely be used both queries on "policy_id, policy_date" and also those on "policy_id" or ranges of "policy_id"

You can add some INCLUDE columns to try to "cover" the other columns used in such queries (e.g. columns in the SELECT list but not in the WHERE / JOINs), but there is a law of diminishing returns - putting every column in the INCLUDE list for every index is going to mean massive indexes and very little extra gain!!!

Note that indexes have to be updated, as well as the core record, when a column changes that is used in an index, so I try to only reference columns in my indexes that change relatively infrequently; but it depends on the circumstances: if you SELECT something 1,000 times more than you UPDATE it then the "cost" of having to update the index when the column changes is probably worthwhile.

Transact Charlie
Master Smack Fu Yak Hacker

3451 Posts

Posted - 2012-01-20 : 11:19:58

And remember that you don't *have* to have your primary key CLUSTERED. If you want the natural key then you could make it NONCLUSTERED and then create a CLUSTERED index on the fields that you really want to search on.

Charlie
===============================================================
Msg 3903, Level 16, State 1, Line 1736
The ROLLBACK TRANSACTION request has no corresponding BEGIN TRANSACTION

Subscribe to SQLTeam.com

SQLTeam.com Articles via RSS

SQLTeam.com Weblog via RSS

- Advertisement -

Resources