String-based nested sets - SQL Server Forums

Please start any new threads on our new site at https://forums.sqlteam.com. We've got lots of great SQL Server experts to answer whatever question you can come up with.

All Forums

General SQL Server Forums

New to SQL Server Programming

String-based nested sets

Author

Topic

waveform
Yak Posting Veteran

93 Posts

Posted - 2011-08-16 : 06:36:25

I've been reading up on hierarchies (nested sets) in SQL 2000, as we're still using that here. It seems that "string-based" nested sets would work well and is fairly easy to understand.

One article suggested using a varchar type as the "lineage" string (he calls it a SetString), with each character code being a "sequence number", and each character position representing the depth. See here: http://www.yafla.com/papers/sqlhierarchies/sqlhierarchies2.htm

It's an old article, so I'm wondering if this is still a good solution for handling hierarchies in SQL 2000, or if there is a better way?

Also, is there perchance a pre-written set of procedures I can use, for the standard operations like insert node, delete node, move node, etc?

sunitabeck
Master Smack Fu Yak Hacker

5155 Posts

Posted - 2011-08-16 : 07:17:27

I am not necessarily recommending that you run out and buy this book, and I have not read it, but I have heard good things about Joe Celko's book on trees and hierarchies: http://www.amazon.com/Hierarchies-Smarties-Kaufmann-Management-Systems/dp/1558609202/ref=ntt_at_ep_dpt_4

Seventhnight
Master Smack Fu Yak Hacker

2878 Posts

Posted - 2011-08-16 : 09:56:53

There are alot of old topics on heirarchies here on SQLTeam...

http://www.sqlteam.com/article/more-trees-hierarchies-in-sql
http://www.sqlteam.com/forums/topic.asp?TOPIC_ID=2828

Corey

I Has Returned!!

waveform
Yak Posting Veteran

93 Posts

Posted - 2011-08-16 : 10:16:12

Thanks Seventhnight, yes I was reading those. As I said, they're rather old, so my specific question was whether the "lineage string" approach was still a good one. I'd just like an opinion on that, as I'd like to use it but won't if warned off it in favour of a better approach.

Seventhnight
Master Smack Fu Yak Hacker

2878 Posts

Posted - 2011-08-16 : 10:45:30

For SQL 2000, I have NOT seen anything to indicate that a 'lineage string' approach is a bad approach... so if you want to go that route, I'd say go for it.

I do have some examples of the various operations, but I'll have to dig them up when I get home.

Corey

I Has Returned!!

waveform
Yak Posting Veteran

93 Posts

Posted - 2011-08-16 : 11:04:48

Thanks Corey, our DBA has left (probably sick of SQL 2000) and I have to fill in, so just felt like I needed some reassurance. :) I'll try the string technique in that yafla article and see how it goes. And if you happen to have some pre-written operations to support it, that would be awesome, thank you.

Seventhnight
Master Smack Fu Yak Hacker

2878 Posts

Posted - 2011-08-16 : 11:12:16

I'm not sure I would use that particular article... the concept is a bit limiting.

You have to know alot about levels and you can only have 26 direct reports (unless you branch out from letters or go case sensitive)...

I generally used a list of the ids:

0001;
0001;0002;
0001;0002;0004;
0001;0003;
0001;0003;0005;

etc...

You can tell what level something is, but you can also easily determine who the 'bosses' are with a single join condition.

Corey

I Has Returned!!

waveform
Yak Posting Veteran

93 Posts

Posted - 2011-08-16 : 11:24:43

Thanks Corey. I can see there might be a balance to be struck. I was thinking of using "hex" but extending to the full alphabet, which only uses 2 chars but allows for up to 1296 siblings, and 100 deep if I use a varchar(200) with no separators. That would be enough for this app. Are the separators necessary?

Other than that, is the SQL in the article ok? It's saying the only columns I need to maintain it are parent ID, lineage string and depth. Is that design ok?

Seventhnight
Master Smack Fu Yak Hacker

2878 Posts

Posted - 2011-08-16 : 11:34:15

Seperators are necessary if you aren't using hints.

For example, these are hints (a given position doesn't directly identify an entity, the 'level' is important):
A
AA
AAA
AAB
AB
ABA
ABB
B

For example, using actual ids:

0001;
0001;0002;
0001;0002;0004;
0001;0003;
0001;0003;0005;

For children of id#3, you would do something like this: ';'+path like ';'+right('0000'+convert(varchar,id/* 3 */),4)+';'

In general the concept is fine, if the limitations are acceptable.

I prefer the actual Id method because it is a bit more readable, and you don't have to keep up with a 'depth'. That way, there is less to screw up.

Corey

I Has Returned!!

robvolk
Most Valuable Yak

15732 Posts

Posted - 2011-08-16 : 12:20:13

quote:
Other than that, is the SQL in the article ok? It's saying the only columns I need to maintain it are parent ID, lineage string and depth. Is that design ok?

If you're talking about this one: http://www.sqlteam.com/article/more-trees-hierarchies-in-sql

I'd say it's a reasonable approach for SQL 2000 and earlier, but with 2005 and Common Table Expressions the need for storing the lineage is reduced, since it can be recursed fairly easily. And SQL 2008 introduced hierarchyid that offers a compact binary form of lineage/enumerated path.

quote:
I was thinking of using "hex" but extending to the full alphabet, which only uses 2 chars but allows for up to 1296 siblings, and 100 deep if I use a varchar(200) with no separators. That would be enough for this app. Are the separators necessary?

If you're going that route, you could used fixed-length int values and append them in a varbinary column:

CREATE TABLE Tree2 (
Node SMALLINT NOT NULL IDENTITY(100, 1),
ParentNode SMALLINT, 
EmployeeID INT NOT NULL,  
Depth AS LEN(Lineage)/DATALENGTH(Node),
Lineage VARBINARY(255) NULL )

INSERT Tree2(EmployeeID) SELECT employeeid FROM employees

UPDATE T SET T.ParentNode=P.Node
FROM Tree2 T 
INNER JOIN Employees E ON T.EmployeeID=E.EmployeeID
INNER JOIN Employees B ON E.BossID=B.EmployeeID
INNER JOIN Tree2 P ON B.EmployeeID=P.EmployeeID

UPDATE Tree2 SET Lineage=0x WHERE ParentNode IS NULL

UPDATE T SET T.Lineage = P.Lineage + CAST(T.ParentNode AS BINARY(2))
FROM Tree2 AS T 
INNER JOIN Tree2 AS P ON (T.ParentNode=P.Node) 
WHERE P.Depth>=0 
AND P.Lineage IS NOT NULL 
AND T.Depth IS NULL

WHILE EXISTS (SELECT * FROM Tree2 WHERE DEPTH IS NULL) 
UPDATE T SET T.Lineage = P.Lineage + CAST(T.ParentNode AS BINARY(2))
FROM Tree2 AS T 
INNER JOIN Tree2 AS P ON (T.ParentNode=P.Node) 
WHERE P.Depth>=0 
AND P.Lineage IS NOT NULL 
AND T.Depth IS NULL

SELECT SPACE(T.Depth*2) + E.Name AS Name
FROM Employees E 
INNER JOIN Tree2 T ON E.EmployeeID=T.EmployeeID
ORDER BY T.Lineage + CAST(T.Node as binary(2))

The advantage is Depth becomes calculated based on the length of the Lineage column. I chose smallint because you can pack 65536 values into 2 bytes, which sounds like enough for your purposes. If you think you can get by with 256 values then you can go with tinyint (1 byte) and adjust the binary casts accordingly.

The binary lineage also closely resembles the way hierarchyid is implemented in SQL 2008, but it's not truly compatible. The downside is that some of the functions, like finding parents, children, etc., are harder because of the binary conversion. In that instance I agree with Corey, using actual IDs is better.

waveform
Yak Posting Veteran

93 Posts

Posted - 2011-08-16 : 12:31:18

Ah I see, the query is ok if you're comparing from the start or end of the string, but if it's somewhere in the middle, you need the separators to match where a number begins. Some things become obvious once they're pointed out. :)

(oops, posted this the exact time you repled again, so yet to peruse that code, but thank you muchly!)
No, I meant the yafla article:
http://www.yafla.com/papers/sqlhierarchies/sqlhierarchies2.htm

The article also suggested using Unicode. An nvarchar column, one character per level, each character having a value from 0 to (whatever Unicode goes up to). Unicode uses anything up to 6 bytes per character I believe, but assumedly SQL Server handles that behind the scenes? If so, isn't that the best option in terms of space efficiency and ease of use? The column would be unreadable, but 1 character per level is easy to code for.

robvolk
Most Valuable Yak

15732 Posts

Posted - 2011-08-16 : 12:34:29

Unicode is handled in SQL Server via nchar and nvarchar, which use 2 bytes per character only. You definitely don't want or need to use Unicode for this.

The binary solution I posted is probably the most space efficient you can get in SQL 2000, as it eliminates delimiters and minimizes the length of the data types used.

waveform
Yak Posting Veteran

93 Posts

Posted - 2011-08-16 : 13:34:41

Wow.. I can code well enough but SQL does my head in. :)
If I understand correctly:
1. Separate table used for tree info.
2. Smallint for Node = 2 bytes per level.
3. Node auto-numbers from 100.
4. Depth is automatically set - great idea!
5. Populate tree table with pointers to employee records.
6. Update ParentNode to the Node value of the record pointing to that person's boss.
7. Set Lineage to 0x where nodes have no parent.
8. Update immediate children of the top (parent-less) nodes.
9. While unassigned Tree records exist, do the same as above.

Questions.. :)
1. Is using a separate Tree table the preferred method?
2. Why does the Node start numbering from 100?
3. Does setting Lineage to 0x make Depth = 0 or 1?
4. Is the UPDATE statement before the WHILE necessary? It seems to do the same thing as what happens inside the WHILE.
5. A *2*-byte binary value is added to Lineage.. but if varbinary is a set of 1-byte values, how does that work? Don't we have the same problem using LIKE without the separators?
6. Smallint goes from -32768 to 32767, so does making it IDENTITY make it unsigned?
7. Why not also store the node's own Node value on the end of Lineage, then you won't need + CAST(T.Node as binary(2)) in queries?

Apologies if I'm being daft. :)

robvolk
Most Valuable Yak

15732 Posts

Posted - 2011-08-16 : 14:54:36

1. It's not strictly necessary, but it's a good idea since it's a fairly narrow table with fixed-size columns, except for the lineage.
2. That's just from the original article, you can choose any seed and increment.
3. It makes Depth=0 (LEN(0x) is zero, so is DATALENGTH)
4. Nope, I copied it from the article without reading the whole thing.
5. I'm using smallint as the data type for nodes, which is 2 bytes, therefore each node added to lineage has to be 2 bytes as well. You'd have to extract 2 bytes to find a specific node. LIKE wouldn't be used to extract a node from this kind of lineage value.
6. Nope, it's still a signed integer, but all values store in 2 bytes.

Keep in mind that node IDs and Employee IDs are completely separate (that's another advantage of using a separate Tree table) so you don't have to use the same data types to model a hierarchy that you use to store the data being modeled that way. Also remember that you don't have to insert all the child nodes into the Tree table, just a node for anyone who has children. Therefore, as I said earlier, if you can model your parent nodes with 256 (or fewer) distinct values, then tinyint could be used and therefore use only 1 byte. You'd have to tweak your queries a bit since you'd only store lineage for parent nodes, but that in turn saves even more space.

waveform
Yak Posting Veteran

93 Posts

Posted - 2011-08-17 : 03:24:26

Many thanks for all your help Rob. I'll go and have a play with a couple of test tables, and hopefully sort out the SQL for adding, deleting nodes, etc.

Subscribe to SQLTeam.com

SQLTeam.com Articles via RSS

SQLTeam.com Weblog via RSS

- Advertisement -

Resources