Relational Table DBMS

Recommended: UTF-8 Encoding; Two's Complement Integer

This article will not be discussing;

Data Encryption

Data replication / distribution

Query Languages

Sorting

Caching

Data Conflict resolution

Optimizing disk read writes

Since they are all just extra layers of processing on top of what is discussed here, and it can be discussed separately

Tables are essentially just a list of tuples (aka records, or rows), of which are stored into a file. However, tables appear to be 2D how does one store it into a linear file.
First of all, we need to break down a table into its components.

Attributes

These are the individual fields of a table. They are of set size and length of which makes our job of converting the dataset into a file much easier.
Since we know the length of each attribute we can literally just store them one after another, and then read back just the bytes of which belong to each attribute as they are needed.

Below is an example of finding the slice points to get a specific attribute within an entire row.

let start = 0; // In Bytes
let end = 0;   // In Bytes

for (Attr in Attributes){
  end += Attr.size;

  if (Attr.name == field){
    break;
  }

  start = end;
}

Data Types

Some common table data types and their sizes. * Float: A signed number of which can include non-whole numbers. Made up of 32Bits * Double: A signed number of which can include non-while numbers as well as number impractical to be stored as ints 6*(10)^(56). Made up of 64bits * String: An array of characters * Int: An Integer value encoded using two's complement integer * Int8: An integer made up of 8 bits * Int16: An integer made up of 16bits * Int32: An integer made up of 32bits * Int64: An integer made up of 64bits * UInt: A positive only integer value * UInt8: An unsigned integer made up of 8 bits * UInt16: An unsigned integer made up of 16bits * UInt32: An unsigned integer made up of 32bits * UInt64: An unsigned integer made up of 64bits * Boolean: A true/false value of which takes up a whole byte since it is in the smallest division of usable storage space. * Date: The method of storing dates can vary from system to system. You could encode your dates using 32bit ints and UNIX time, or you could use a string and storing it using ISO 8601

Tuples

Since we now know how to read attributes from a tuple, we can use a similar approach to read a tuple from a table.
First of all, we need to find the size of each tuple

let tupleSize = 0;
for (let Attr in Attributes){
  tupleSize += Attr.size;
}

Now we can read the tuple in the exact same way

Primary and Composite Keys

Keys are a way of uniquely identifying individual rows.
A primary key is a single attribute of which is unique for every row within the table (i.e. ID).
A composite key is when multiple fields in combination are used to uniquely identify a row (i.e. First & Given Name).
Both of these key systems can be implemented in multiple ways.

Depth Key

This is a simple approach where the system reads just the key values of each row until it finds a match with the request, then it returns the whole row.
Thus, the time is taken to find a given tuple increase linearly.

Key Hashing

The concept of Key Hashing is to convert a given value into a unique number (i.e. James => 1; Bob => 2),
this means that you can hash a given key to then get the index of the row belonging to that key.
Thus, this method can vastly improve the time it takes to get data from the table since it does not need to scan each tuple to see if it's key matches.
However, no hashing algorithm is perfect and there will be trade-offs for speed to performance because you will end up with empty indexes between the data of which you have stored.

I.E. if you hash 'Bob' => 3; and 'Sally' => 54 that they are the only people in your database you can see that you will now need to store index 1-2 & 4-53 to ensure that the hashing algorithm points to the correct tuple. Since you always use this hashing function to get the index than just read of the index the time taken to read a row is always a constant value, thus being optimal for large databases where size limitations are not as much of an issue.

Referencing

This allows data to be connected almost directly to allow the storing of more complex data.
For instance, instead of storing a client's address with every online order they make you can instead just store the client ID as a reference then read the address form the client table.

Key Referencing

Uses the systems previously described to find the target row.

If it is a Composite key then you will need to store all key values for the reference to work.

Exact Referencing

Instead of needing to re-find the target row every time, the reference is the exact index of the row. Thus, no extra processing is required to find the target tuple, instead, it can just be read off.

Compacting a Table

This task removes any empty rows from the database one after another to decrease the size of the table. This operation while space effective will take time, and during processing, no data can reliably be read or written to the table for fear or corruption. Thus, it is a highly expensive processing task.
Also, if you are using Key Hashing or Exact Referencing for performance speedups it will break the entire system reading incorrect tuples, and even possibly attempting to read out of the table its self.
Thus, it has become a rather out-dated practice only ever really used of archives or personal small-scale databases.

Foot Note

Since entire tables can be bigger than the RAM of a single computer all modern DBMSs never read the whole table at once. Instead, they read it as a stream of data.
For instance, you have an input buffer, you pipe all of your data from your read stream into that, then once you have enough data to make up a tuple, you remove that first tuple amount of data from the buffer and run your operation on it, then you can pipe your result to another file.
Thus, you can now work with massive datasets and have massive query results. Now when your operation is done you can just pipe that entire output file to where your result needs to go.
Also, it is important to know that this output cache file is not necessary, for instance, you could instead write directly to a table or pipe your results back to whatever requested the event in the first place.

References

Information Processes and Technology (The HSC Course) - Samuel Davis
CS186 Berkeley - Lecture 1