Database Management System

A DBMS is software that allows applications to store and analyze information in a database. A general-purpose DBMS is designed to allow the definition, creation, querying, update, and administration of databases in accordance with some data model.

Data Model: A data model is a collection of concepts for describing the data in database. Some examples include:

Relational (most common)
NoSQL (key/value, document, graph)
Array / Matrix / Vector (for machine learning)

Schema: Schema is a description of a particular collection of data based on a data model. This defines the structure of data for a data model. Without a schema, you will have random bits with no meaning.

Common Data Models

Relational (Most DBMSs)
Key/Value (NoSQL)
Graph (NoSQL)
Document/XML/Object (NoSQL)
Wide-Column/Column-family (NoSQL)
Array / Matrix / Vector (Machine Learning)
Hierarchical (Obsolete/Legacy/Rare)
Network (Obsolete/Legacy/Rare)
Multi-Value (Obsolete/Legacy/Rare)

Relational Model

The relational model defines a database abstraction based on relations to avoid maintenance overhead. It has three key ideas:

Store database in simple data structures (relations)
Physical storage left up to the DBMS implementation
Access data through a high-level language, where the DBMS figures out best execution strategy

The relational data model defines three concepts:

Structure: The definition of relations and their contents independent of their physical representation
Integrity: Ensure the database’s contents satisfy certain constraints
Manipulation: Programming interface for accessing and modifying a database’s contents

One of the ideas that the relational model provides is data independence. If we isolate the user / application from the low-level data representation, the user only has to worry about the high-level application logic. It also allows the DBMS to optimize the data layout according to the operating environment, database contents, and workload. When these factors change, the DBMS can respond and re-optimize the database.

Relation: A relation is an unordered set that contains the relationship of attributes that represent entities. Since the relationships are unordered, the DBMS can store them in any way it wants, allowing for optimization. It is possible to have repeated / duplicated elements in a relation.

Tuple: A tuple is a set of attribute values (also known as its domain) in the relation. In the past, values had to be atomic or scalar, but now values can also be lists or nested data structures. Every attribute can be a special value, NULL, which means for a given tuple the attribute is undefined.

Primary Key: A relation’s primary key uniquely identifies a single tuple in a table. Some DBMSs automatically create an internal primary key if you do not define one. A lot of DBMSs have support for autogenerated keys (so an application does not have to manually increment the keys), but a primary key is still required for some DBMSs.

Foreign Key: A foreign key specifies that an attribute from one relation maps to a tuple in another relation. Generally, the foreign key will point / be equal to a primary key in another table.

Constraint: A constraint is a user-defined condition that must hold for any instance of the database. Unique key and referential (foreign key) constraints are the most common.

Data Manipulation Languages (DMLs)

Procedural: The query specifies the (high-level) strategy the DBMS should use to find the desired result based on sets / bags. For example, use a for loop to scan all records and count how many records are there to retrieve the number of records in the table.
Non-Procedural (Declarative): The query specifies only what data is wanted and not how to find it. For example, we can use SQL SELECT COUNT(*) FROM artist to count how many records are there in the table.

Relational Algebra

Relational Algebra: Relational Algebra is a set of fundamental operations to retrieve and manipulate tuples in a relation. Each operator takes in one or more relations as inputs, and outputs a new relation. To write queries we can ”chain” these operators together (often in a tree or directed acyclic graph) to create more complex operations.

Selection

Select takes in a relation and outputs a subset of the tuples from that relation that satisfy a selection predicate. The predicate acts as a filter, and we can combine multiple predicates using conjunctions and disjunctions

Syntax: $\sigma_{\text {predicate }}(R)$ .

Example: $\sigma_{\mathrm{a}\_\mathrm{id}={ }^{\prime} a2^{\prime}}(R)$

SQL: SELECT * FROM R WHERE a_id = 'a2'

Projection

Projection takes in a relation and outputs a relation with tuples that contain only specified attributes. You can rearrange the ordering of the attributes in the input relation as well as manipulate the values.

Syntax: $\pi_{\mathrm{A} 1, \mathrm{~A} 2, \ldots, \mathrm{An}}(R)$

Example: $\pi_{b\_id-100, aid }\left(\sigma_{\mathrm{a} \text { \_id }=\text { 'a2' }}(R)\right)$

SQL: SELECT b_id-100, a_id FROM R WHERE a_id = 'a2'

Union

Union takes in two relations and outputs a relation that contains all tuples that appear in at least one of the input relations. Note: The two input relations have to have the exact same attributes.

Syntax: $(R \cup S)$ .

SQL: (SELECT * FROM R) UNION ALL (SELECT * FROM S)

UNION - removes duplicates.

UNION ALL - keeps the duplicates.

Intersection

Intersection takes in two relations and outputs a relation that contains all tuples that appear in both of the input relations. Note: The two input relations have to have the exact same attributes.

Syntax: $(R \cap S)$ .

SQL: (SELECT * FROM R) INTERSECT (SELECT * FROM S)

Difference

Difference takes in two relations and outputs a relation that contains all tuples that appear in the first relation but not the second relation. Note: The two input relations have to have the exact same attributes.

Syntax: $(R-S)$ .

SQL: (SELECT * FROM R) EXCEPT (SELECT * FROM S)

Product

Product takes in two relations and outputs a relation that contains all possible combinations for tuples from the input relations.

Syntax: $(R \times S)$ .

SQL: (SELECT * FROM R) CROSS JOIN (SELECT * FROM S), or simply SELECT * FROM R, S

Join

Join takes in two relations and outputs a relation that contains all the tuples that are a combination of two tuples where for each attribute that the two relations share, the values for that attribute of both tuples is the same.

Syntax: $(R \bowtie S)$ .

SQL: SELECT * FROM R JOIN S USING (ATTRIBUTE1, ATTRIBUTE2...)