Non Clustered PKey / Separate Clustered Index

Please start any new threads on our new site at https://forums.sqlteam.com. We've got lots of great SQL Server experts to answer whatever question you can come up with.

All Forums

General SQL Server Forums

Database Design and Application Architecture

Non Clustered PKey / Separate Clustered Index

Author

Topic

Kristen
Test

22859 Posts

Posted - 2015-04-23 : 12:38:20

I could do with feedback on whether folk perceive any issues with this approach.

We have some tables which have multi-part keys. This is particularly the case for Child Tables. e.g. Customer Table and then an Address Table where we have one/many Addresses for a single Customer.

The address table might have CustomerID and AddressItemNo columns.

I would create a clustered index for CustomerID, AddressItemNo

In the past I would just create a Primary Key, with clustered index, for those two columns.

Increasingly we find it easier to reference records by a single-part key (typically IDENTITY). This makes life easier for mechanically generating CRUD SProcs, Forms for Web pages, Links to Audit Trail Changes-Tables, synchronisation with read-only copies ... all sorts of stuff.

So my inclination is to have a PKey of the Identity ID, using a non-clustered index, and a Clustered-index of the CustomerID,AddressItemNo combination (which will favour joins and the like).

I could, of course, just have a non-clustered index on the Identity ID column, but all the stuff that mechanically generates stuff is "easier" to build if it just hooks onto the Primary Key. Hence I'd like to use the Primary Key as a single-column unique reference to a record.

(In the case of the Customer Table then the PKey would have a clustered index).

Anything anyone can spot as to whether this might be a bad idea? or if you just hate it?!! Or anything else that occurs to you?

Thanks

robvolk
Most Valuable Yak

15732 Posts

Posted - 2015-04-23 : 23:24:19

Only downside I see is at the physical storage level, your clustered key is twice as wide as the identity column, and will be copied into every non-clustered index. You'll also get more page splits.

Is CustomerID,AddressItemNo unique? And non-nullable? Maybe just make that a clustered primary key and forego the identity completely?

Kristen
Test

22859 Posts

Posted - 2015-04-24 : 03:48:07

quote:
Originally posted by robvolk

Is CustomerID,AddressItemNo unique?

Yes, it is what would be most commonly considered for the Primary Key.

If I make that the Primary Key won't that be used in all the secondary indexes thus making them "wider"? - i.e. whether it is the Primary Key or not (i.e. I would always use it as the Clustered Index as it is the best candidate).

So my question is whether there is any different between:

CLUSTERED PRIMARY KEY (CustomerID,AddressItemNo)
UNIQUE NONCLUSTERED INDEX (AddressIdentity)

and

NONCLUSTERED PRIMARY KEY (AddressIdentity)
UNIQUE CLUSTERED INDEX (CustomerID,AddressItemNo)

If there is no difference it is much easier for all my ancillary code to just latch the Primary Key.

Otherwise I will need to build some additional Meta Data to indicate which index to use. I could have my stuff detect the first available Unique, Single-column, Index - but if there were two of those, or another was added later, there would be risk that one part of the system had been built using one index and a part that was built later chose a different, more recently added, index.

(The "stuff" I'm referring to is code to mechanically generate SQL etc. Currently it uses the Primary Key, but if the Primary Key is multi-column it needs hand-editing because I haven't had the time to change it all to dynamically handle multiple columns. So a quick-fix for me, going forwards, would be to ensure that [wherever possible

] Primary Key is a one-column index)

robvolk
Most Valuable Yak

15732 Posts

Posted - 2015-04-24 : 07:11:55

My point wasn't really which key to make primary, but which one to cluster on. Non-clustered indexes will include the clustering key if the table is clustered, or the row identifier (RID) consisting of FileID:PageID:SlotID for a heap:

https://msdn.microsoft.com/en-us/library/ms190457.aspx

Generally speaking, you'd want a narrow clustering key so that it minimizes storage overhead in the non-clustered indexes. An additional advantage that identity brings is that it increases in value and tends to avoid page splits...new rows are appended to the end of an existing page or allocate a new, adjacent page. With today's storage this probably doesn't matter that much anymore, especially if you're using SSDs.

So my suggestion is basically, cluster on the identity column, regardless of whether it's primary key or not. If you can't or don't want to cluster on that, I don't see any benefit to having it at all.

Kristen
Test

22859 Posts

Posted - 2015-04-24 : 12:28:42

Thanks Rob.

So I am saying that I currently have some multi-part keys, such as ParentID,ChildItemNo on a child table and I am thinking of ADDING an IDENTITY for occasions where I want to reference the Child Record directly and have the simplicity of the Single Part Key.

You are saying "Always use a single part key, preferably something short like an IDENTITY".

I hadn't thought of it like that!

I suppose it is because we have tended to use ItemNo on Child tables (rather than Identity) such that ItemNo was only unique to a specific ParentID. Its too long ago to remember why we did it like that! but my guess is that we thought it nicely End User Friendly:

Customer 123456 Address ONE or Address TWO rather than Address 567890 !!

So if I just change from ChildItemNo to ChildIdentity I will have solved my problem.

Just wanted to check that that is the precise point you are making?

tkizer
Almighty SQL Goddess

38200 Posts

Posted - 2015-04-24 : 12:59:46

We have a logging database that logs nearly everything that's done on our web sites, web services, etc. As we enter peak season, we can't keep up with the logging INSERTs because of the clustered identity. It becomes a hotspot with extreme amounts of inserts. There are articles to workaround it, but it's something to keep in mind.

Tara Kizer
SQL Server MVP since 2007
http://weblogs.sqlteam.com/tarad/

robvolk
Most Valuable Yak

15732 Posts

Posted - 2015-04-24 : 20:19:23

quote:
You are saying "Always use a single part key, preferably something short like an IDENTITY".

Not exactly. For a CLUSTERING KEY, yes, that's generally good advice, but not always. (See Tara's reply)

Note the difference between CLUSTERING and PRIMARY. One is physical (clustering), the other is logical (primary). One of them identifies your rows uniquely (phone number), the other lays out your data on your storage media (last name, first name). (It's a helpful mental exercise to consider the phone directory with these factors reversed)

Sorry for being pedantic. What I'm hoping you avoid is blindly adding identity as PRIMARY KEY because 1) it defaults to clustered and 2) that clustering makes it convenient for physical storage efficiency. You can always make your identity clustered and unique, but not primary key, purely for storage reasons, and never join it to anything else. That might sound useless or contradictory, but hopefully it gets you thinking about what the keys are for and that they function the way they're intended.

This is a tricky topic for me to discuss, I've been seriously beaten up over it before. I'm not sure I'm clearly explaining what I mean.

quote:
Originally posted by tkizer

We have a logging database that logs nearly everything that's done on our web sites, web services, etc. As we enter peak season, we can't keep up with the logging INSERTs because of the clustered identity. It becomes a hotspot with extreme amounts of inserts. There are articles to workaround it, but it's something to keep in mind.

Exactly. There's tradeoffs for any key choice you might make, which is the nature of design. One article in particular:

http://kejser.org/boosting-insert-speed-by-generating-scalable-keys/

And if you want to read up on some other interesting factors and alternatives:

http://kejser.org/table-pattern-rotating-log-ring-buffer/
http://kejser.org/implementing-message-queues-in-relational-databases/
http://kejser.org/good-keys-what-are-they-like/
http://kejser.org/exploring-hash-functions-in-sql-server/

Please note that many of these options are for extremely high performance in specialized scenarios, and NOT general purpose solutions. If you dig around his blog or his SQLBits presentations he'll describe instances where clustered uniqueidentifier offers the best performance.

Kristen
Test

22859 Posts

Posted - 2015-04-25 : 04:21:17

quote:
Originally posted by robvolk

[quote]This is a tricky topic for me to discuss, I've been seriously beaten up over it before. I'm not sure I'm clearly explaining what I mean.

Exactly why I'm asking here as I am looking for any flaws before changing decades of "we always do it this way"

. All opinions respected.

"What I'm hoping you avoid is blindly adding identity as PRIMARY KEY because 1) it defaults to clustered"

Yup, pls assume I'm definitely on-side with that. I understand the difference, and we do NOT automatically CLUSTER the PKey (90% of the time YES, but we do consider the other 10%)

The numb of what I'm debating with myself is:

If I think that, say, ClientID,AddressItemNo is the best candidate for the Clustered Index in the past I would (99%) also make that the PKey. If I also had an IDENTITY column in that table I would create a UNIQUE NONCLUSTERED INDEX.

Mostly, back then, we did not have an additional IDENTITY column in the table, we just used the multi-part PKey of ClientID,AddressItemNo whenever we needed to retrieve a specific record. The new change of strategy is to always have an IDENTITY, specifically so that any record can be identified, uniquely, by a single-part key.

I am now changing my view around to any table where a multi-part key is the best candidate for a Clustered Index then that multi-part key will be created as a UNIQUE CLUSTERED INDEX (which is not also the PKey) and the IDENTITY column will be used as the PKey (NonClustered)

My reason for doing this is solely because all my other Gear (mechanically generated Web Forms, Data Synchronisation, Rollout of data-snippets from DEV to PROD and One-Project to Another-Project) can then just look for the PKey to get a single, one-part, unique key. I don't have to provide any "hint" as to which Index my Gear has to use (on each and every table), the database already has an Attribute for "Primary Key" so I can ride on that.

A separate issue which then arises is "What is the best Clustered Index" for any table.

Where the clustered index needs to be different that's fine, but I will still only ever have a one-part key as the PKey. In some tables that will be the obvious candidate for the Clustered Index (Client Table: ClientID) in other tables it won't be (Address table)

Tara's point about Identity Hot-Spots is interesting. We don't have anything like the volume of data that she does, but we don't have anything like the heavy-iron hardware either!

We have a LOG table for each SProc execution. Each SProc (execution) concatenates its parameter list into a string and INSERTS that into the Log Table. On completion it UPDATES that record with the Elapsed time. We peak at around 500 inserts/second. The table has PKey=IDENTITY and it seems to perform OK. The biggest problem we had (which Spirit1 solved for me, many years ago) was that the SProc that inserted the row in the Log Table used a Function to decide what UserID to store. That Function (as it turned out) recompiled every time and blocked as a consequence, at the time it killed us until we discovered what the problem was.

My colleagues thought I was daft putting a Logging INSERT in every SProc execution ... but it performs just fine (the daily DELETE of 10M rows is a bit more of a challenge!!) and is a really boon when we have to diagnose some problem as all our users are out on WWW so absolutely no chance of wandering round to their desk and saying "What exactly did you do?"

gbritton
Master Smack Fu Yak Hacker

2780 Posts

Posted - 2015-04-25 : 10:08:13

complicated topic, to be sure. Consider:

1. your CI columns are included in every NCI index entry. IF your CI is wide, your NCIs will be bigger than with an IDENT-based CI
2. If you have FKs refering to your table, they will point to the PK or another Unique Key of your reference table, so you ideally want that key to be narrow

1 & 2 together tend to lean towards using an IDENT PK, CLUSTERED.

However, if you have few (or small, fltered) NCIs and/or few FKs pointing to your table, you can choose a CI based on business keys (unique, hopefully) and an IDENT PK for reference, as suggested. however, know that now your PK entries will include the CI columns. If they are wide, your PK will be bigger with more B-Tree levels.

other important thing for CIs: How are inserts done? if not in CI order, consider an IDENT or perhaps another unique, narrow-ish column (datetime?) that reflects the insert order. this to avoid page splits and fragmentation

Kristen
Test

22859 Posts

Posted - 2015-04-26 : 05:41:09

If I take a specific example what is your view on the best Clustered Index (and would you possible use a different non-clustered PKey?)

Table: Order Header

This one seems easy to me. The user needs an "Order Number". I doubt very much that this will be provided by the End User, so the system allocating next-available-number is fine.

So Order Header ID will be IDENTITY and that will be the Clustered Primary Key.

Table: Order Items

In the old days we would have used OrderHeaderID and OrderItemNo. OrderItemNo would have been some 1,2,3 sequence PER OrderHeader. We would have created a Clustered Primary Key on OrderHeaderID, OrderItemNo

In our OLTP App the addition of new orders would be less frequent than SELECTing them, and when orders were selected it would most commonly be "All Order Items for a specific Order Header", so I had always assumed that a clustered index on OrderHeaderID, OrderItemNo was the best choice (maybe not though??)

Now we have added an IDENTITY column for Order Item ID in the Order Item Table. We only use this when we refer to a single specific row in Order Items table. Its "more convenient" that referring to the row by OrderHeaderID, OrderItemNo (although both are unique)

We declare OrderItemNo as INT but, in practice it could be smallint or even maybe tinyint. Dunno if that is worth considering .. if I leave it as INT then no different in key size whether I use

OrderHeaderID, OrderItemNo
or
OrderHeaderID, OrderItemID

Following on from that I would now change Primary Key on Order Item table to be NONClusteed on OrderItemID, and have Clustered Index on OrderHeaderID, OrderItemNo

I could consider doing-away with OrderItemNo and making the Clustered Index OrderHeaderID, OrderItemID.

OrderItemNo is a User-Thing ("Can we ship Item 4 on order 123456 yet"), so is basically a Natural Key and, as such, perhaps should NOT be part of the Primary Key nor Clustered Index - although in practice it never changes. But we do have to "allocate it" (SELECT MAX(OrderItemNo)+1 FROM OrderItems WHERE OrderHeaderID = 123456)
and it would be cheaper to allocate an IDENTITY instead (and given that IDENTITY is always increasing I suppose it would sequence Order Items in "created order" which is exactly what we are doing with OrderItemNo).

So my questions are:

Is OrderHeaderID, OrderItemNo (or OrderHeaderID, OrderItemID - given they are both INT,INT) the best clustered index for Order Item table?

If you were building an Order Item table Would you use OrderItemNo (i.e. 1,2,3,... per Order Header) or OrderItemID (IDENTITY) ?

Thanks

gbritton
Master Smack Fu Yak Hacker

2780 Posts

Posted - 2015-04-26 : 08:19:22

How are rows inserted? In what order?

Kristen
Test

22859 Posts

Posted - 2015-04-26 : 09:13:32

Ascending ID for Order Header.

Then Ascending Order (ID and ItemNo) for Order Item.

Order Items will be added [concurrently] for multiple Order Headers

e.g.

Create Order Header 10000
Add Item 1 to Order 10000

Create Order Header 10001
Add Item 1 to Order 10001

Add Item 2 to Order 10000
Add Item 3 to Order 10000

Add Item 2 to Order 10001

Add Item 4 to Order 10000

It occurs to me that if I were to use IDENTITY instead of Item No for Order Item table then I would only be adding Increase Numbers of ItemID

That won't help if the Clustered Index is OrderHeaderID,OrderItemID - but it would if it was just OrderItemID

Kristen
Test

22859 Posts

Posted - 2015-04-26 : 09:14:35

P.S. RATE of Order Item insert is probably "modest" or even "low" (relatively to everything else going on in the database), so a page split is probably not a significant issue (other than index fragmentation, but we reorganise indexes regularly)

gbritton
Master Smack Fu Yak Hacker

2780 Posts

Posted - 2015-04-26 : 15:49:46

So, either way you have only 8-bytes (two ints) for your CI entries. Not too bad. Any NCIs or FKs using referring to the table? If not (or if only one or two NCIs) I'd stick with OrderID, OrderItem. For FKs, you said you have an additional IDENT column right? Put a UNIQUE index on that and use it as the FK reference column. Or course, the unique index entries will include the CI columns, but if you use Page compression you can mitigate the space usage somewhat.

Kristen
Test

22859 Posts

Posted - 2015-04-27 : 07:36:47

Ah, I think I see where you guys are coming from.

At the level I have described it it makes no difference.

It is sensible to use a Clustered Index of CustomerID, OrderIemNo. If I want the Primary Key to be OrderID (Non Clustered) that's fine too.

Where the crux will come is if I add a child table to Order Item. Lets' say a table called Shipments

Shipments are against an Order and also an Order Item.

I could have a Clustered Index of OrderID, OrderItemNo, ShipmentNo

Or, better, would be a Clustered Index of OrderID, ShipmentNo


SELECT Col1, Col2, ...
FROM OrderHeader AS H
     JOIN OrderItem AS I
         ON I.OrderID = H.OrderID
     LEFT OUTER JOIN Shipment AS S
         ON S.OrderItemID = I.OrderItemID

rather than


     LEFT OUTER JOIN Shipment AS S
         ON S.OrderID = H.OrderID
        AND S.OrderItemNo = I.OrderItemNo

We definitely have, a few, 3, 4 and 5 part keys of children-of-children-of... which linger back to the old days. Every time I have to construct some code to reference them its a PITA ... so I would much prefer something slimmer.

Presumably in Shipments I should NOT store the OrderID ... because I can derive OrderID via OrderItemID, if I need it. (1st normalisation).

I say "presumably" because: are there any good reasons to consider duplicate-storing OrderID in Shipments table? I have been inclined to do that, on occasions, because it makes it easier to "sort stuff out" when just looking at the raw data in the Shipment table. Its a lot of extra data to store for the odd occasion when "sorting stuff out" is needed. (There's the risk that the Order ID changes and I would then need to cascade that change, but in practice that is never going to happen, so its only "for convenience".

Do all of you NEVER redundantly-store data like that?

My current work-around for that is to always make a "Lookup View" for each table. I use a naming convention to make it easy to "guess" the VIEW name, something like this:


CREATE VIEW Shipment_VIEW
AS
SELECT S.ShipmentID as V_ShipmentID,
       I.OrderID as V_Shipment_OrderID,
       H.CustomerID AS V_Shipment_CustomerID
FROM Shipment AS S
     JOIN OrderItem AS I
         ON I.OrderItemID = S.OrderItemID
     JOIN OrderHeader AS D
         ON H.OrderID = I.OrderID

so that I can (on any table) do:


SELECT *
FROM Shipment
     JOIN Shipment_VIEW
         ON V_ShipmentID = ShipmentID
WHERE ShipmentID = 12345

and not have to worry about making all the JOINs to associated (i.e. Parent) tables on every query I write

Subscribe to SQLTeam.com

SQLTeam.com Articles via RSS

SQLTeam.com Weblog via RSS

- Advertisement -

Resources