Linux-Egypt Forums - لينكس-مصر - Database

View Full Version : Database - Storing a Tree Structure

shafaki

What is the best way to store a tree structure in a database?

The database will store data about animals in the animal kingdom.

The animals will be classified such that each animal belong to a specific GENUS,
each genus will belong to a specific FAMILY,
each family will belong to a specific ORDER,
each order will belong to a specific CLASS,
and finally each class will belong to a specific PHYLUM.

Read the other way around, each phylum has many classes, and each class has many orders, ... and so on till each genus has many animals.

At the end, by knowing to which genus an animal belongs, one can know to which family, order, class and phylum it belongs.

What's the best way to store such kind of data in a database for easy retrieval.

I hope a database guru would answer this one.

Ashraf Al Shafaki

SYStems

Okay, I am not a db guru .. far from it
but i love databases ... so anyway

you already solved , i really dont know why
you think you have a problem!!!

the first thing you should do now is find a good

entity relation ship diagrammer...

then... draw your entities
phylum | class | order | etc...

then draw the relation between entities
there is different quinds of relations

one to one
one to many
many to many

now each one of those entities will have
attributes

entity Animal for example might have the following attributes
Name | Weight | Color | .... | GenusFK

those are the columns of your animal table
now each row should have a PK (primary key)
a primary key is an attribute a column that have a unique value in each row and always have a values (not null)

Now Genus itself is an entity
with different attribute
GenusPK | Name| Attr1 | Attr2 | Attr 3 | ....

One of those would be it's primary key ...

Now you said that each animal belong to one genus , but each genus have many animal
this is a clear one to many relationship

so, we will take the genus primary key, and add it the the animal table as a column , now we call this column a foreign key (cause it contains the keys of a foreign table )

Now you know your entities and the relations ship
they are all one to many!!!

so what am i missing, what is the problem
usually the problems are
1) you can't decide what are the entities of you db ( you already know that)

2) what are the relationship between the entities ( you know that too)

3) should this (xyz) be an attribute in a table a seperate entity ? (ask you self do i want to store info about it, if yes make it an entity)

using an ER-diagrammer make things a lot easier you should find one and use it

mohamed

Can you give us more info on what you intend to do ??

MadFarmAnimalz

Originally posted by mohamed
Can you give us more info on what you intend to do ??

Yea, like is this a homework assignment?

uniball

Originally posted by MadFarmAnimalz

Yea, like is this a homework assignment?
I don't know
depending on the situation, what needs to be done, The programming language that's gona be used to interact with the database, .......
I may choose one of the following:
1 - Use pure SQL database
2 - XML though it'll have a large overhead on the harddrive
3 - Hybrid of Text files, and an SQL database.
4 - A file system database "files, folders, symlinks"
5 - Berkely Database.
6 - Any combination of the above

What are you trying to do ?

Until i know, I suggest using an access database :-) <joke!>

shafaki

THANKS
First of all, I would like to thank everyone who had cared to give an answer. I was amazed by the rapid response, its richness and the number of answers. Special thanks to SYStems who happened to give the first answer and tried to give a detailed explanation and to uniball for giving interesting alternatives for doing the whole thing.

FUNWORK
I would first like to clear things out by saying that no this is no homework (so I am in no need of academic feedback), nor is it a real world thing either. Let's say it's for fun, or just experimenting some things out.

LANGUAGE
I am playing with a web site that would list information about animals. I am using PHP as the server-side scripting language, and MySQL as the database.

OLD DESIGN
I had two different design for the database with each having its benefits and drawbacks.

Design I (3 tables)
1] animals: ID, name, description, genusID
2] types: ID, name, metatypeID, parentID
3] metatype: ID, name

The metatype table will contain only 5 rows. The data in such rows will be:
1, phylum
2, class
3, order
4, family
5, genus

The types table will contain names of families, orders, classes ... etc. To know what type each row is, its metatypeID attribute would be set to point to the correct ID number in the metatype table (i.e. 1 if it was a phylum, 2 if it was a class... 5 if it was a genus).

Finally, each type will point to another type (above it in level) to which it belongs. This is done using the parentID field. Each genus will point to a family, each family will point to an order, ... each class will point to a phylum. Only a phylum will not point to a type because there is no type higher than it, so it's parentID will be set to 0.

Good
The nice thing with this design is that it conserves space. It's highly compact. It can also be 'easily' used to know the complete chain (or tree) of a certain animal by chaining through it's genus and from it to its family and so on till it reaches up to its phylum. This process though would take some coding (in PHP) and would also waste processor time (slow).

Bad
Till this point I was happy with my first design but then the real problem came. What if I want to do things the other way around. That is know the animals that belong to a specific phylum for instance. In this case I will have to look for every class that belongs to this phylum, then look for every order that belongs to each of those classes, then look for every family .... etc. till I reach the animals that are under that specific phulum in the animal kingdom tree. Quite a bit of exhausting code (in PHP), not to mention preocessor time and database performance. I am sure this is not the standard way to do it and sure there is a better way that people use.

So that was the end of my first design.

Design II (6 tables)
The second design is pretty simple. Actually, it's what might first come to mind.
1] animals: ID, name, description, phylumID, classID, orderID, familyID, genusID
2] genus: ID, name
3] family: ID, name
4] order: ID, name
5] class: ID, name
6] phylum: ID, name

SKIP (do not read this part, read from RESTART below)
Oh wait a minute. you've made me think of a new design right now that might fix the problem. Let's add some attributes to tables 2 to 5.

2] genus: ID, name, familyID
3] family: ID, name, orderID
4] order: ID, name, classID
5] class: ID, name, phylumID

If I want to check an animal's tree I'll get it instantly from the animal table. If I want to do it in reverse, that is seeing which animals belong to a specific order for instance, then now I can... Oh no, I fell in the same problem.

RESTART - PROBLEM STATEMENT
So now, here is the problem. It is easy to design a database that would just hold the data. The problem is: "How to design the database such that retrieval of a branch of the tree to which a node belongs, and retrieval of nodes that decend from a specific branch can be easy." This is the problem.

REQUEST
1- If you can think of a design for a database that would hold a tree (at the end of which are nodes, in my case the nodes are the animals) then please let me know. Taken into consideration of course that easy retrieval can be done using simple (as possible) code that is efficient and preferably standard (used commonly by people to solve this kind of problem (like design patterns)).

OR

2- An alternative solution that does not rely (at least entirely as uniball suggested) on Relational Databases.

I need your HELP, and would be specifically grateful for either a standard solution that is commonly used, or a creative solution that works.

Ashraf Al Shafaki

uniball

<quote>
Design II (6 tables)
The second design is pretty simple. Actually, it's what might first come to
mind.
1] animals: ID, name, description, phylumID, classID, orderID, familyID, genusID
2] genus: ID, name
3] family: ID, name
4] order: ID, name
5] class: ID, name
6] phylum: ID, name

SKIP (do not read this part, read from RESTART below)
Oh wait a minute. you've made me think of a new design right now that might fix
the problem. Let's add some attributes to tables 2 to 5.

2] genus: ID, name, familyID
3] family: ID, name, orderID
4] order: ID, name, classID
5] class: ID, name, phylumID

If I want to check an animal's tree I'll get it instantly from the animal
table. If I want to do it in reverse, that is seeing which animals belong to a
specific order for instance, then now I can... Oh no, I fell in the same problem.
</quote>

SELECT *
FROM animals
WHERE orderID=22;

What's the problem here ? Nothing, but the database'll be large, We can also
eliminate te orderID, *ID and replace them with the actual names.

SELECT *
FROM animals
WHERE phylum="sarcomastigophura" AND subphylum="opalinata";

I think that's a good design.

We can create an index for the animal to speed things up, since the animal'll
be the most requested field i think.

Still i don't like it.

===================================================================
What if we implemented it in flat files ?
for examplae we have a phylum: sarcomastiguphora
with 3 sub phylums: mastiguphora, opalinata, sarcodina.

we'll have a 2 files:
sarcomastiguphora.children
Sub Phylum # 1st line saying what's the next recoed.
mastiguphora
opalinata
sarcodina

sarcomastigophora.parents
Sub Kingdom # the 1st line is a line describing the parent id.
unicellular # The actual parent.

...........................
I don't like this way.
=======================================================================
XML:
I'm not sure about what i'm saying here regarding scientific facts!
<id="phylum" name="sarcomastiguphora">
<id="subphylum" name="mastiguphora">
<id="animal" name="pramisium">
<id="subphylum" name="opalinata">
<id="subphylum" name="sarcodina">
...........................
Overload on the harddisk
======================================================================
I'm just thinking in loud "voice!"
I'll think of better solutions but i have to get some sleep :-)

A crazy idea!
put all in a textfile, and use grep ;)

alaa

ok first lets admit, the relational database model is not suitable for such a task.
me an uniball are involved in a project with very similar needs like yours, and since its a real life problem we have to use MySQL

it could be done but it is never going to be very efficient.

here is what we did

now a flat file solution is not a very good idea, but using hundreds of flat files may not be a very bad idea
you see your problm could be solved if the you could walk around the tree in two directions (you want the parents to point to the children and the children to point to the parents).

so here is a suggestion, the Animal or species file would contain at the first line the filename of the genus
the genus file will contain as its first line the filename of the family etc.

now you have a tree where children point to their parents, lets do the reverse relation.

after putting whatever info in each file (try to make the info fixed length fields for faster access) you will list in each genus all the animals that belong to it each in a line of its own.

now since this solution is a bit messy, with hundreds of files all around, you'd better have a way to regenerate these files in case of emergencies.

cool thats what we call the hybrid model, in the hybrid model you have a database, the database has a table for animals, a table for genus etc etc etc.
the database will have all the data and the child to parent connections
meaning the animal table will have whatever info you need and the genus ID or name.
this way your files will not need to have the data only the connections, which means they'll be fixed length records by default (make all filenames of the same length).

now if anything happens you can regenerate the files, easily but in a very resource and time consuming method, normaly you'll have all the connections and relations in the files and all the data in the database.

you should also cache whatever answers you get (for instance if you want to know the Number of animals in the family foo, calculate this data then cache it somewhere.
the nice thing about this model is that you can use the GNU textutils for most of your data mining

for instance knowing the number of animals in a genus is as easy as
wc -l foo.genus
knowing the number of animals in a family becomes
-------------------------
c=0; for i in in `cat foo.family`;
do
d=`wc -l $i`
c=$(($c+$d))
done;
echo $c
--------------------------
put this in a script name it count_family and preferable implement it with a faster language than bash (you don't have to use perl or python for that, even bc or awk would be enough).
now to count number of animals in an order just do the same thing replacing the family file with the order file and wc -l with count_family, in fact you could write a generic script that counts counts anything in anything, you would write
count animal family
and it would count the animals in family, its very easy to do (I'll leave it as an exercise).

but all of this is fine and nice if you have to use MySQL, if you don't the the optimal solutions is
when you want to store a tree use a tree.

there is a tree data structure specialy made fr storing trees in files called the B Tree of the Balanced Tree, or even better there is the B+ Tree which is more compact and reduces disk access, if you're realy doing this to learn then you should learn about Tree data structures and B+ Tree files, however PHP is not a suitable language for data structures and true programming, use Python, Perl, C++, C or lisp for this kind of thing.

cheers,
Alaa

alaa

tried to find resources on B Trees this looks very promising http://www.bluerwhite.org/btree

this is short and to the point http://www.semaphorecorp.com/btp/algo.html

enjoy,
Alaa

mohamed

Mostly LDAP will be of benefit for you.
But it is not easy to work with.

Try openldap.org

Regards
Mohamed Eldesoky

shafaki

Thanks uniball for quoting my Design II. It seems I did reach a solution there but I was blinded to see it after burning out my brain in deep thinking.

Thanks alaa for suggesting B+ trees. I think I should learn something about them. You said I should use Perl (among others) instead of PHP. What's the advantage of Perl over PHP?

Yesterday, after posting here, I tried to search for an answer using google. I came accross a concept I never heard of before. It is called "Nested Sets". It is a way of storing tree structures in databases. I was unable to understand it though, despite seeing it in several sites.

At night, I played a bit with paper and pen, and started to understand it and discover how it works. But I think it does have it's own drawbacks (when updating the tree by adding new nodes). But the nested sets is a VERY efficient way when searching the tree!

A
B C D
E F G H I J

Imagine you have the above tree. The root node is A. It has 3 children B, C and D. B has one child E. C has 2 children F and G. Finally, D has 3 children H, I and J.

Now imagine a worm (like I read on one site on the net, which what led me later to understand when I tried to do it with pen and paper). imagine a worm moving anti-clockwise in this tree drawn above and starting from the left side of it's root, that is starting from the left of letter A. Each time this worm moves to a node it increments it's counter and writes this number on the side of the node that it visited.

In the above tree, the worm will first visit the left side of A and number it 1. Then the left side of B and number it 2. Then left side of E and number it 3. In order to continue it's movement along the tree, the worm will have to walk along the right side of E which it will number 4. Then the right side of B which it will number 5. Then it will march to the left side of C and number it 6. Then down to the left side of F and number it 7, then the right side of F and number it 8. Then the left side of E and number it 9, then the right side of E and number it 10. Then the right side of C and number it 11.

It will continue doing the same thing with the rest of the nodes. Namely, D left 12, H left 13, H right, 14, I left 15, I right 16, J left 17, K right 18, D right 19, A right 20. Then the worm upon reaching the root once again will stop.

If we store thid data in a table, it will look like this:

node, left, right
A, 1, 20
B, 2, 5
C, 6, 11
D, 12, 19
E, 3, 4
F, 7, 8
G, 9, 10
H, 13, 14
I, 15, 16
J, 17, 18

To know all the children of A, just pick those having numbers between 1 and 20! It's that simple. To know all children of D, just pick all those with number between 12 and 19! This is amazing.

(You will always find also that for any given node: right - left < 0)

If the difference between left and right is ony 1, then it is a bottom node (a leaf, that is has no nodes below it, childless).

Also it's easy to know all parent of a node. For instance, to know the parents of E, search for those with left smaller than 3 and right larger than 4. It will return B (2, 5) and A (1, 20), which are indeed the parents of B.

Those two methods of searching children and parents I discovered after storing the data in a nested set as I red yesterday. But perhaps they have other methods, additional methods or better ways.

What I was stuck with was what if I wanted to update the table by adding a, say, a sister node to E? This will complicate things, because I'll have to make the worm move on again and update all the other records. I did not look to see if they have a simpler solution for this one.

At least if not a lot of updating is needed, searching will be very fast. (Yet searching for ALL children, or all parents. This does not take into account searching for children under a specific level only. Which could be more complicated).

Hope someone gives me feedback on what he think with the nested sets model that I've just leared about yesterday.

alaa

congrats nested lists are a typical way of implementing tree data structures.

it usualy goes like this, you have a node object with a member list of nodes, each having its own member list of nodes etc??

but this still isn't a flat table compatible thing, variable length records in databases are horible to handle, so again I insist that the solution is by using a proper programming language that has good facilities for data abstraction and data structures.

now what you described is a tree, a B Tree is just a way to balance the tree around the file in a way that aids in fast access and easy data insertion.

B+ Trees use indexes and hashing to speed things up a bit (its not that simple I know).

as to why Perl, I don't really like Perl in particular but it has B+ Tree libraries and good string manipulation routines, not to mention regexp which ease searching alot.

why not PHP!! where to start, PHP is the BASIC of the 21 century, PHP is evil, it is an ASP clone, it stands for Pretty Home Page, it suck big time.

on a more serious note, PHP suffers from many problems, I don't really consider it a programming language, each part of PHP has been designed with a different style that it doesn't feel like a coherent language at all, the fact that it is embeded inside HTML is a big problem in itself, it makes it very hard to seperate interface code from logic code, its a pain to debug, and there is no way you can prototype stuff by accessing them directly.
there is also the problem of mixing data types together (an indexed array is the same as an associative array, and arrays could have multiple indexes for the same thing uuuuugh).
the real problem though is its horible typing engines, there is no type safety at all in PHP (IT NEEDS A === OPERATOR) and references are very very very messy, I hardly ever see anyone doing data structures or abstract algorythms in PHP (why would they, its there to generate a pretty homepage after all).

what you are doing here is a typical programming problem that needs all the power of data abstraction, you want to use a DB fine, learning the Perl/Python/C/C++ API is not going to be harder than the PHP one, and it'll most definatly fit better with the language (since there will be a language to fit with aslan).

can't describe how angry I am that I have to use PHP, and what is it with Data Bases everyone, not everything should be done with a database, they are hardly that efficient when it comes to truely dynamic things, or to small amounts of data, or to data with complex relation ships.

you know relational databases and the EVIL ER databases where created for management type people could find a tool to play with.

cheers,
Alaa