7. Data Structures and Algorithm, Database System, and Operating System (ACtE07)
7.1 Introduction to Data Structure, List, Linked Lists and Trees
Data Types, Data Structures, and Abstract Data Types
A Data Type defines the kind of values a variable can hold and the operations that can be performed on it (e.g., integer, float, string). A Data Structure is a particular way of organizing and storing data in a computer so that it can be accessed and modified efficiently. An Abstract Data Type (ADT) is a mathematical model for data types, defining the logical behavior of a data type from the user's perspective, independent of its implementation.
- Data Type: Focuses on the type of data and allowed operations.
- Data Structure: Focuses on the physical organization and storage of data.
- ADT: Focuses on the logical properties and operations, abstracting away implementation details.
Time and Space Analysis of Algorithms
Algorithm analysis involves determining the resources (time and space) required by an algorithm. Big O notation (O) describes the upper bound of an algorithm's running time or space requirements, representing the worst-case scenario. Omega notation (Ω) describes the lower bound, representing the best-case scenario. Theta notation (Θ) describes both the upper and lower bounds, representing the average-case or tight bound.
Example: Linear Search
To find an element in an unsorted array of n elements:
- Best Case (Ω(1)): The element is found at the first position.
- Worst Case (O(n)): The element is at the last position or not present, requiring
ncomparisons. - Average Case (Θ(n)): On average,
n/2comparisons are needed.
Linear Data Structures: Stack and Queue
Stack (LIFO - Last In, First Out): A linear data structure that follows a particular order in which operations are performed. Items are added and removed from the same end, called the "top."
- Operations:
Push(add item),Pop(remove item),Peek(view top item),IsEmpty. - Implementation:
- Array-based: Fixed size, simple to implement. Potential for stack overflow.
- Linked List-based: Dynamic size, more flexible.
Queue (FIFO - First In, First Out): A linear data structure that follows a particular order in which operations are performed. Items are added at one end (rear/enqueue) and removed from the other end (front/dequeue).
- Operations:
Enqueue(add item),Dequeue(remove item),Front(view front item),IsEmpty. - Implementation:
- Array-based: Fixed size, can suffer from inefficient space usage (elements shift).
- Circular Queue: An improved array-based implementation where the last element is connected to the first, allowing efficient reuse of space.
- Linked List-based: Dynamic size, efficient for insertions and deletions.
Stack Applications
Infix to Postfix Conversion: Converts an arithmetic expression from infix notation (operators between operands) to postfix notation (operators after operands).
Algorithm:
- Initialize an empty stack and an empty postfix expression string.
- Scan the infix expression from left to right.
- If an operand, append to postfix string.
- If an opening parenthesis '(', push onto stack.
- If a closing parenthesis ')', pop operators from stack and append to postfix until '(' is encountered. Pop and discard '('.
- If an operator, pop operators from stack (and append to postfix) that have higher or equal precedence than the current operator, then push the current operator onto the stack.
- After scanning, pop any remaining operators from the stack and append to postfix.
Example: Infix (A+B)*C
| Char | Stack | Postfix |
|---|---|---|
| ( | ( | |
| A | ( | A |
| + | ( + | A |
| B | ( + | AB |
| ) | AB+ | |
| * | * | AB+ |
| C | * | AB+C |
| End | AB+C* |
Evaluation of Postfix Expression:
Algorithm:
- Initialize an empty stack.
- Scan the postfix expression from left to right.
- If an operand, push it onto the stack.
- If an operator, pop two operands from the stack, perform the operation, and push the result back onto the stack.
- The final result is the only element remaining in the stack.
Example: Postfix 2 3 + 4 *
| Char | Stack | Operation |
|---|---|---|
| 2 | 2 | |
| 3 | 2, 3 | |
| + | 5 | 3+2=5 |
| 4 | 5, 4 | |
| * | 20 | 4*5=20 |
| End | 20 | Result: 20 |
Array Implementation of Lists
An array can represent a list where elements are stored in contiguous memory locations. Accessing elements by index is O(1). Insertion and deletion in the middle require shifting elements, leading to O(n) complexity. Stacks and Queues can be implemented using arrays, acting as specialized lists.
- Static List: Fixed size, determined at compile time (e.g., C-style arrays).
- Dynamic List: Can grow or shrink in size during runtime (e.g., C++
std::vector, JavaArrayList). Often implemented using dynamic arrays that resize when capacity is exceeded.
Dynamic Implementation of Linked List
A Linked List is a linear collection of data elements, called nodes, where each node points to the next node in the sequence. Nodes are not stored in contiguous memory locations.
- Types of Linked List:
- Singly Linked List: Each node contains data and a pointer to the next node. Traversal is only forward.
- Doubly Linked List: Each node contains data, a pointer to the next node, and a pointer to the previous node. Allows bidirectional traversal.
- Circular Linked List: The last node points back to the first node, forming a circle. Can be singly or doubly linked.
- Basic Operations on Linked List (Singly Linked List Example):
- Creation:
CREATE_NODE(data): node = new Node() node.data = data node.next = NULL return node - Insertion (Beginning):
INSERT_AT_BEGINNING(head, data): new_node = CREATE_NODE(data) new_node.next = head head = new_node return head - Insertion (End):
INSERT_AT_END(head, data): new_node = CREATE_NODE(data) IF head IS NULL THEN head = new_node RETURN head current = head WHILE current.next IS NOT NULL DO current = current.next current.next = new_node RETURN head - Insertion (Middle - after a specific node):
INSERT_AFTER_NODE(prev_node, data): IF prev_node IS NULL THEN PRINT "Previous node cannot be NULL" RETURN new_node = CREATE_NODE(data) new_node.next = prev_node.next prev_node.next = new_node - Deletion (by value):
DELETE_NODE(head, key): current = head prev = NULL IF current IS NOT NULL AND current.data == key THEN // Head node head = current.next DELETE current RETURN head WHILE current IS NOT NULL AND current.data != key DO prev = current current = current.next IF current IS NULL THEN // Key not found RETURN head prev.next = current.next DELETE current RETURN head
- Creation:
Doubly Linked Lists and its Applications: Offer O(1) insertion/deletion if the node reference is available, and O(n) for searching. Applications include:
- Implementing Undo/Redo functionality in editors.
- Browser history (back/forward buttons).
- Implementing LRU (Least Recently Used) cache.
Concept of Tree
A Tree is a non-linear data structure representing hierarchical data, consisting of nodes connected by edges.
- Root: The topmost node of the tree.
- Node: An entity in a tree that stores data and may have children.
- Edge: A connection between two nodes.
- Leaf Node: A node with no children.
- Parent: A node that has one or more child nodes.
- Child: A node directly connected to another node when moving away from the root.
- Degree: The number of children a node has. For a tree, it's the maximum degree of any node.
- Height: The length of the longest path from the root to a leaf node. (Number of edges)
- Level: The level of the root node is 0 (or 1). The level of any other node is its parent's level + 1.
- Depth: The length of the path from the root to a specific node. (Same as level for a node).
Operations in Binary Tree
A Binary Tree is a tree data structure where each node has at most two children, referred to as the left child and the right child.
- Insertion: New nodes are typically inserted as leaf nodes, maintaining the binary tree property (e.g., in a Binary Search Tree, smaller values go left, larger go right).
- Deletion: More complex, involving handling cases like deleting a leaf node, a node with one child, or a node with two children (often replaced by its in-order successor or predecessor).
- Searching: Start from the root and traverse down, comparing the target value with the current node's value. In a BST, if target < current, go left; if target > current, go right.
Tree Traversals
Visiting each node in a tree exactly once.
- Pre-order Traversal (Root, Left, Right):
PREORDER(node): IF node IS NOT NULL THEN PRINT node.data PREORDER(node.left) PREORDER(node.right) - In-order Traversal (Left, Root, Right):
INORDER(node): IF node IS NOT NULL THEN INORDER(node.left) PRINT node.data INORDER(node.right) - Post-order Traversal (Left, Right, Root):
POSTORDER(node): IF node IS NOT NULL THEN POSTORDER(node.left) POSTORDER(node.right) PRINT node.data
Example Tree:
4
/ \
2 5
/ \
1 3
- Pre-order: 4, 2, 1, 3, 5
- In-order: 1, 2, 3, 4, 5 (Sorted order for a BST)
- Post-order: 1, 3, 2, 5, 4
Height, Level and Depth of a Tree
As defined above, Height is the longest path from root to leaf. Level is the distance from the root (root at level 0 or 1). Depth of a node is the length of the path from the root to that node.
AVL Balanced Trees
An AVL Tree is a self-balancing Binary Search Tree where the difference between the heights of the left and right subtrees (balance factor) of any node is at most 1. If this property is violated, the tree performs rotations to rebalance itself.
- Rotations:
- LL Rotation (Left-Left): Occurs when a node is inserted into the left subtree of the left child of an unbalanced node. A single right rotation is performed.
- RR Rotation (Right-Right): Occurs when a node is inserted into the right subtree of the right child of an unbalanced node. A single left rotation is performed.
- LR Rotation (Left-Right): Occurs when a node is inserted into the right subtree of the left child of an unbalanced node. A left rotation is performed on the child, followed by a right rotation on the parent.
- RL Rotation (Right-Left): Occurs when a node is inserted into the left subtree of the right child of an unbalanced node. A right rotation is performed on the child, followed by a left rotation on the parent.
7.2 Sorting, Searching, and Graphs
Types of Sorting
- Internal Sorting: All data to be sorted fits into the main memory.
- External Sorting: Data is too large to fit into main memory and resides on disk, requiring external storage access during sorting.
Sorting Algorithms
Each algorithm below aims to arrange elements in a specific order (ascending or descending).
- Insertion Sort:
Algorithm: Iterates through the list, picking an element and inserting it into its correct position among the elements already sorted. Like sorting a hand of playing cards.
INSERTION_SORT(arr, n): FOR i FROM 1 TO n-1 DO key = arr[i] j = i - 1 WHILE j >= 0 AND arr[j] > key DO arr[j+1] = arr[j] j = j - 1 arr[j+1] = keyTime Complexity:
O(n²)in worst and average cases.O(n)in best case (already sorted). - Selection Sort:
Algorithm: Finds the minimum element from the unsorted part and puts it at the beginning of the sorted part.
SELECTION_SORT(arr, n): FOR i FROM 0 TO n-2 DO min_idx = i FOR j FROM i+1 TO n-1 DO IF arr[j] < arr[min_idx] THEN min_idx = j SWAP(arr[min_idx], arr[i])Time Complexity:
O(n²)in all cases. - Exchange Sort / Bubble Sort:
Algorithm: Repeatedly steps through the list, compares adjacent elements and swaps them if they are in the wrong order. Passes through the list until no swaps are needed.
BUBBLE_SORT(arr, n): FOR i FROM 0 TO n-2 DO swapped = FALSE FOR j FROM 0 TO n-i-2 DO IF arr[j] > arr[j+1] THEN SWAP(arr[j], arr[j+1]) swapped = TRUE IF NOT swapped THEN BREAK // Optimization for already sorted arraysTime Complexity:
O(n²)in worst and average cases.O(n)in best case. - Merge Sort:
Algorithm: A divide-and-conquer algorithm. It divides the unsorted list into n sublists, each containing one element, then repeatedly merges sublists to produce new sorted sublists until there is only one sorted list remaining.
MERGE_SORT(arr, left, right): IF left < right THEN mid = (left + right) / 2 MERGE_SORT(arr, left, mid) MERGE_SORT(arr, mid+1, right) MERGE(arr, left, mid, right) // Merges two sorted subarraysTime Complexity:
O(n log n)in all cases. - Radix Sort:
Algorithm: A non-comparative integer sorting algorithm that sorts data with integer keys by grouping keys by individual digits which share the same significant position and value.
RADIX_SORT(arr, n): max_val = GET_MAX(arr, n) // Find maximum number to determine number of digits exp = 1 WHILE max_val / exp > 0 DO COUNT_SORT(arr, n, exp) // Use Counting Sort for current digit position exp = exp * 10Time Complexity:
O(d * (n + b)), wheredis the number of digits,nis the number of elements, andbis the base of the number system (e.g., 10 for decimal). - Shell Sort:
Algorithm: An in-place comparison sort. It is an extension of insertion sort that allows the exchange of items that are far apart. It starts by sorting pairs of elements far apart, then progressively reducing the gap between elements to be compared.
SHELL_SORT(arr, n): gap = n / 2 WHILE gap > 0 DO FOR i FROM gap TO n-1 DO temp = arr[i] j = i WHILE j >= gap AND arr[j - gap] > temp DO arr[j] = arr[j - gap] j = j - gap arr[j] = temp gap = gap / 2Time Complexity: Depends on the gap sequence, typically ranges from
O(n log² n)toO(n^1.5)in worst case. - Heap Sort as Priority Queue:
Algorithm: Uses a binary heap data structure. It first builds a max-heap from the input data. Then, it repeatedly extracts the maximum element from the heap (which is the root), swaps it with the last element, reduces the heap size, and calls
heapifyon the new root.HEAPIFY(arr, n, i): // To maintain max-heap property largest = i left = 2*i + 1 right = 2*i + 2 IF left < n AND arr[left] > arr[largest] THEN largest = left IF right < n AND arr[right] > arr[largest] THEN largest = right IF largest != i THEN SWAP(arr[i], arr[largest]) HEAPIFY(arr, n, largest) HEAP_SORT(arr, n): FOR i FROM n/2 - 1 DOWNTO 0 DO // Build max-heap HEAPIFY(arr, n, i) FOR i FROM n-1 DOWNTO 0 DO // Extract elements one by one SWAP(arr[0], arr[i]) HEAPIFY(arr, i, 0) // Call heapify on reduced heapTime Complexity: Building heap is
O(n). Sorting isO(n log n).
Big O Notation and Efficiency of Sorting
A comparison of common sorting algorithms:
| Algorithm | Best Case | Average Case | Worst Case | Space Complexity |
|---|---|---|---|---|
| Bubble Sort | O(n) | O(n²) | O(n²) | O(1) |
| Insertion Sort | O(n) | O(n²) | O(n²) | O(1) |
| Selection Sort | O(n²) | O(n²) | O(n²) | O(1) |
| Merge Sort | O(n log n) | O(n log n) | O(n log n) | O(n) |
| Heap Sort | O(n log n) | O(n log n) | O(n log n) | O(1) |
| Radix Sort | O(d*(n+b)) | O(d*(n+b)) | O(d*(n+b)) | O(n+b) |
| Shell Sort | O(n log n) | O(n log² n) | O(n^1.5) | O(1) |
Search Techniques
- Sequential Search (Linear Search):
Algorithm: Checks each element in the list sequentially until a match is found or the end of the list is reached.
SEQUENTIAL_SEARCH(arr, n, target): FOR i FROM 0 TO n-1 DO IF arr[i] == target THEN RETURN i // Element found at index i RETURN -1 // Element not foundTime Complexity:
O(n). - Binary Search:
Algorithm: Efficiently finds an element in a sorted list. It repeatedly divides the search interval in half. If the value of the search key is less than the item in the middle of the interval, narrow the interval to the lower half. Otherwise, narrow it to the upper half.
BINARY_SEARCH(arr, low, high, target): WHILE low <= high DO mid = low + (high - low) / 2 IF arr[mid] == target THEN RETURN mid ELSE IF arr[mid] < target THEN low = mid + 1 ELSE high = mid - 1 RETURN -1Example: Search for
23in[2, 5, 8, 12, 16, 23, 38, 56, 72, 91]- low=0, high=9, mid=4 (arr[4]=16). 16 < 23, so low = 4+1=5.
- low=5, high=9, mid=7 (arr[7]=56). 56 > 23, so high = 7-1=6.
- low=5, high=6, mid=5 (arr[5]=23). 23 == 23, return 5.
Time Complexity:
O(log n). - Tree Search (BST Search): Searching in a Binary Search Tree (BST) involves comparing the target value with the current node. If the target is smaller, move to the left child; if larger, move to the right child. This continues until the value is found or a null pointer is encountered. Time Complexity:
O(h)wherehis the height of the tree (O(log n)for balanced,O(n)for skewed). - General Search Tree: A tree where values in subtrees are ordered relative to the root, but not necessarily binary (e.g., B-trees).
Hashing
Hashing is the process of mapping data of arbitrary size to a fixed-size value, typically an integer, using a Hash Function. The output is called a hash value or hash code.
- Hash Functions:
- Division Method:
h(k) = k mod m, wherekis the key, andmis the size of the hash table (often a prime number). - Multiplication Method:
h(k) = floor(m * (k * A mod 1)), whereAis a constant (0 < A < 1).
- Division Method:
- Hash Tables: A data structure that implements an associative array (map) abstract data type, mapping keys to values. It uses a hash function to compute an index into an array of buckets or slots, from which the desired value can be found.
- Collision Resolution: When two different keys hash to the same index.
- Open Addressing: All elements are stored in the hash table itself.
- Linear Probing: If a collision occurs at index
i, search for the next empty slot at(i+1) mod m,(i+2) mod m, etc. - Quadratic Probing: If a collision occurs at index
i, search for the next empty slot at(i+1²) mod m,(i+2²) mod m, etc.
- Linear Probing: If a collision occurs at index
- Chaining: Each slot in the hash table is a pointer to a linked list (or other data structure) of all the keys that hash to that slot.
- Open Addressing: All elements are stored in the hash table itself.
Undirected and Directed Graphs
A Graph is a non-linear data structure consisting of a finite set of vertices (or nodes) and a set of edges connecting pairs of vertices.
- Undirected Graph: Edges have no direction; if an edge connects A and B, it means A is connected to B and B is connected to A.
- Directed Graph (Digraph): Edges have a direction; an edge from A to B means a connection from A to B, but not necessarily from B to A.
- Terminology: Vertex, Edge, Weight (cost associated with an edge), Path (sequence of vertices connected by edges), Cycle (a path that starts and ends at the same vertex), Degree (number of edges incident to a vertex).
Representation of Graph
- Adjacency Matrix: A 2D array
Adj[V][V]whereVis the number of vertices.Adj[i][j] = 1if there is an edge from vertexito vertexj, otherwise0. For weighted graphs, store weights instead of 1.Pros: Quick check for existence of an edge (O(1)).
Cons: High space complexity
O(V²), even for sparse graphs. - Adjacency List: An array of linked lists where the
i-th element of the array contains a linked list of all vertices adjacent to vertexi.Pros: Space efficient for sparse graphs
O(V+E), whereEis the number of edges.Cons: Checking for an edge can be
O(degree(V)).
Transitive Closure of Graph (Warshall's Algorithm)
The Transitive Closure of a directed graph G is a graph G* with the same vertices as G, such that there is an edge from vertex i to vertex j in G* if and only if there is a path from i to j in G.
Warshall's Algorithm: Computes the transitive closure using dynamic programming. It iteratively updates a connectivity matrix R.
WARSHALL(Adj_Matrix R, n):
// Initialize R with adjacency matrix (R[i][j] = 1 if edge, else 0)
FOR k FROM 0 TO n-1 DO
FOR i FROM 0 TO n-1 DO
FOR j FROM 0 TO n-1 DO
R[i][j] = R[i][j] OR (R[i][k] AND R[k][j])
RETURN R
Example:
Graph: A->B, B->C, C->D, A->C Initial R: A B C D A 0 1 1 0 B 0 0 1 0 C 0 0 0 1 D 0 0 0 0 After k=0 (A): (no new paths) After k=1 (B): R[A][C] becomes (R[A][B] AND R[B][C]) -> (1 AND 1) = 1. (A->B->C) A B C D A 0 1 1 0 B 0 0 1 0 C 0 0 0 1 D 0 0 0 0 After k=2 (C): R[A][D] becomes (R[A][C] AND R[C][D]) -> (1 AND 1) = 1. (A->C->D) A B C D A 0 1 1 1 B 0 0 1 1 C 0 0 0 1 D 0 0 0 0
The final matrix shows all reachability.
Depth First Traversal and Breadth First Traversal of Graph
- Depth First Search (DFS): Explores as far as possible along each branch before backtracking. Uses a stack (implicitly or explicitly).
DFS(graph, start_node): visited = SET() stack = STACK() stack.push(start_node) WHILE stack IS NOT EMPTY DO node = stack.pop() IF node NOT IN visited THEN ADD node TO visited PRINT node FOR EACH neighbor OF node DO IF neighbor NOT IN visited THEN stack.push(neighbor) - Breadth First Search (BFS): Explores all the neighbor nodes at the present depth before moving on to the nodes at the next depth level. Uses a queue.
BFS(graph, start_node): visited = SET() queue = QUEUE() visited.add(start_node) queue.enqueue(start_node) WHILE queue IS NOT EMPTY DO node = queue.dequeue() PRINT node FOR EACH neighbor OF node DO IF neighbor NOT IN visited THEN visited.add(neighbor) queue.enqueue(neighbor)
Topological Sorting
A linear ordering of vertices in a directed acyclic graph (DAG) such that for every directed edge u -> v, vertex u comes before v in the ordering.
- Depth First Approach: Perform DFS. When a node has no unvisited neighbors (or all neighbors are in recursion stack), add it to the front of a list (or push onto a stack and then reverse).
- Breadth First Approach (Kahn's Algorithm): Find all nodes with an in-degree of 0. Add them to a queue. While the queue is not empty, dequeue a node, add it to the sorted list, and decrement the in-degree of its neighbors. If a neighbor's in-degree becomes 0, enqueue it.
Minimum Spanning Trees (MST)
A Minimum Spanning Tree (MST) of an undirected, weighted graph is a subset of the edges that connects all the vertices together, without any cycles and with the minimum possible total edge weight.
- Prim's Algorithm (Step-by-step):
- Start with an arbitrary vertex as the initial MST.
- Repeatedly add the edge with the smallest weight that connects a vertex in the MST to a vertex not yet in the MST.
- Continue until all vertices are included in the MST.
- Kruskal's Algorithm (Step-by-step):
- Sort all edges in non-decreasing order of their weights.
- Initialize an empty MST.
- Iterate through the sorted edges: if adding an edge does not form a cycle with the edges already in the MST, add it.
- Repeat until
V-1edges are added (whereVis the number of vertices). Use a Disjoint Set Union (DSU) data structure to efficiently detect cycles.
Shortest-Path Algorithm: Dijkstra's Algorithm
Finds the shortest paths from a single source vertex to all other vertices in a graph with non-negative edge weights.
Dijkstra's Algorithm (Step-by-step):
- Initialize distances: Set distance of source vertex to 0, and all other vertices to infinity.
- Create a set
S(for visited vertices) and a min-priority queueQ(for unvisited vertices). Add all vertices toQ. - While
Qis not empty:- Extract the vertex
uwith the smallest distance fromQ. - Add
utoS. - For each neighbor
vofu:- If
vis not inSanddist[u] + weight(u,v) < dist[v], then updatedist[v] = dist[u] + weight(u,v)and updatevinQ.
- If
- Extract the vertex
Example:
Graph: A --(1)--> B --(3)--> D
| /|\
(4) (2) (1)
| / | \
C --(5)--> E
Source: A
Distances: A=0, B=inf, C=inf, D=inf, E=inf
1. Extract A (dist=0). Neighbors: B, C
Update B: dist[B] = min(inf, 0+1) = 1
Update C: dist[C] = min(inf, 0+4) = 4
Q: {B:1, C:4, D:inf, E:inf}
2. Extract B (dist=1). Neighbors: A (visited), D, E
Update D: dist[D] = min(inf, 1+3) = 4
Update E: dist[E] = min(inf, 1+2) = 3
Q: {C:4, D:4, E:3}
3. Extract E (dist=3). Neighbors: B (visited)
Q: {C:4, D:4}
4. Extract C (dist=4). Neighbors: A (visited), E (visited)
Q: {D:4}
5. Extract D (dist=4). Neighbors: B (visited)
Q: {}
Final Shortest Paths from A: A=0, B=1, C=4, D=4, E=3
7.3 Introduction to Data Models, Normalization, and SQL
Data Abstraction and Data Independence
- Data Abstraction: The process of hiding complex implementation details of data storage and retrieval, exposing only the essential features to the user.
- Physical Level: Describes how data is actually stored on disk.
- Logical Level: Describes what data is stored and the relationships among data (schema).
- View Level: Describes only part of the database relevant to a particular user or application.
- Data Independence: The ability to modify the schema at one level without affecting the schema at a higher level.
- Physical Data Independence: Changes in the physical schema (e.g., storage devices, file organization) do not require changes in the logical schema.
- Logical Data Independence: Changes in the logical schema (e.g., adding/removing attributes) do not require changes in the view level or application programs.
Schema and Instances
- Schema: The overall design or structure of the database. It describes the logical view of the entire database. It is defined during database design and is relatively stable.
- Instance: The actual data stored in the database at a particular moment in time. It changes frequently as data is inserted, deleted, or updated.
E-R Model (Entity-Relationship Model)
A high-level conceptual data model that describes the data requirements and the relationships between data.
- Entity: A real-world object that is distinguishable from other objects (e.g., Student, Course).
- Attribute: A property or characteristic of an entity (e.g., Student ID, Student Name).
- Relationship: An association between two or more entities (e.g., Student ENROLLS in Course).
- Cardinality: Specifies the number of instances of one entity that can be associated with instances of another entity (1:1, 1:N, N:M).
Strong and Weak Entity Sets
- Strong Entity Set: An entity set that has its own primary key and can exist independently.
- Weak Entity Set: An entity set that does not have a primary key of its own and depends on a strong entity set (identifying owner) for its existence and identification. It has a partial key (discriminator) that, combined with the primary key of its owner, forms its full primary key. (e.g., 'Dependent' entity depends on 'Employee').
Attributes and Keys
- Attributes:
- Simple vs. Composite: Simple (atomic), Composite (divisible, e.g., Address into Street, City).
- Single-valued vs. Multi-valued: Single (one value, e.g., Age), Multi-valued (multiple values, e.g., Phone Numbers).
- Stored vs. Derived: Stored (directly stored, e.g., DateOfBirth), Derived (calculated from other attributes, e.g., Age from DateOfBirth).
- Keys:
- Super Key: A set of one or more attributes that, taken collectively, can uniquely identify a tuple (row) in a relation.
- Candidate Key: A minimal super key; no proper subset of its attributes can uniquely identify tuples.
- Primary Key: A candidate key chosen by the database designer to uniquely identify tuples in a relation.
- Foreign Key: An attribute or set of attributes in one relation that refers to the primary key of another relation, establishing a link between them.
E-R Diagram Notation and Examples
- Rectangles: Entities
- Diamonds: Relationships
- Ellipses: Attributes (underlined for primary keys)
- Double Rectangles: Weak Entities
- Double Diamonds: Identifying Relationships for Weak Entities
- Double Ellipses: Multi-valued Attributes
- Dashed Ellipses: Derived Attributes
- Lines: Link attributes to entities, and entities to relationships.
- Cardinality Ratios: Notations like 1, N, M, * on lines connecting entities to relationships.
Example: A Student (ID, Name) enrolls in Courses (CourseID, Title).
+----------+ +----------+ +---------+
| Student |-------| ENROLLS |-------| Course |
+----------+ +----------+ +---------+
(ID, Name) (N:M) (CourseID, Title)
Normalization
The process of organizing the columns (attributes) and tables (relations) of a relational database to minimize data redundancy and improve data integrity.
- Functional Dependency (FD): An attribute B is functionally dependent on an attribute A if for every valid instance of the relation, that value of A uniquely determines the value of B (A → B).
- Full Functional Dependency: An attribute B is fully functionally dependent on a composite key A if B is functionally dependent on A, but not on any proper subset of A.
- Partial Functional Dependency: An attribute B is partially functionally dependent on a composite key A if B is functionally dependent on A, and also on a proper subset of A.
- Transitive Functional Dependency: An attribute C is transitively dependent on A if A → B and B → C (and B is not part of the primary key).
- 1NF (First Normal Form):
- Definition: A relation is in 1NF if every attribute contains only atomic (indivisible) values, and there are no repeating groups of columns.
- Example: Table
(StudentID, StudentName, Course1, Course2)is not 1NF due to repeating courses. It should be(StudentID, StudentName, Course).
- 2NF (Second Normal Form):
- Definition: A relation is in 2NF if it is in 1NF and all non-key attributes are fully functionally dependent on the primary key. (Eliminates partial dependencies).
- Example: Table
(StudentID, CourseID, StudentName, CourseTitle). If(StudentID, CourseID)is PK, andStudentID -> StudentName(partial dependency), then it's not 2NF.Step-by-step: Decompose into
(StudentID, StudentName)and(StudentID, CourseID, CourseTitle).
- 3NF (Third Normal Form):
- Definition: A relation is in 3NF if it is in 2NF and there are no transitive functional dependencies of a non-key attribute on the primary key.
- Example: Table
(StudentID, StudentName, CourseID, CourseTitle, InstructorName, InstructorDept). IfCourseID -> InstructorNameandInstructorName -> InstructorDept, thenStudentID -> InstructorDeptis transitive.Step-by-step: Decompose into
(StudentID, StudentName, CourseID),(CourseID, CourseTitle, InstructorName), and(InstructorName, InstructorDept).
- BCNF (Boyce-Codd Normal Form):
- Definition: A relation is in BCNF if it is in 3NF and for every non-trivial functional dependency
A -> B, A is a superkey. (Stricter than 3NF, handles cases with multiple candidate keys where a non-key attribute determines part of a candidate key). - Example: A table
(Student, Subject, Instructor)where(Student, Subject)is PK,Instructor -> Subject, andSubject -> Instructor. This is 3NF but not BCNF if there are multiple instructors for a subject.
- Definition: A relation is in BCNF if it is in 3NF and for every non-trivial functional dependency
Integrity Constraints and Domain Constraints
- Integrity Constraints: Rules that ensure the quality and consistency of data in a database.
- Entity Integrity: Primary key cannot be NULL.
- Referential Integrity: Foreign key values must either match a primary key value in the referenced relation or be NULL.
- Domain Constraints: Specify that the value of an attribute must come from a predefined set of values (its domain). (e.g., Age must be > 0, Gender must be 'M' or 'F').
Relations (Joined, Derived)
- Joined Relations: The result of combining two or more relations based on a common attribute (e.g., using SQL JOIN).
- Derived Relations: Relations that are not explicitly stored in the database but can be computed or derived from other stored relations (e.g., a VIEW in SQL).
Queries under DDL and DML
SQL (Structured Query Language): A standard language for managing and manipulating relational databases.
- DDL (Data Definition Language): Used to define and manage database structure.
CREATE TABLE Students (ID INT PRIMARY KEY, Name VARCHAR(50));ALTER TABLE Students ADD COLUMN Age INT;DROP TABLE Students;
- DML (Data Manipulation Language): Used to manage data within the database.
SELECT Name, Age FROM Students WHERE Age > 20;INSERT INTO Students (ID, Name, Age) VALUES (1, 'Alice', 25);UPDATE Students SET Age = 26 WHERE ID = 1;DELETE FROM Students WHERE Age < 20;
Views, Assertions and Triggering
- Views: A virtual table based on the result-set of an SQL query. It does not store data itself but provides a way to present data from one or more tables in a customized way.
- Assertions: General constraints that involve multiple tables or attributes, specified using
CREATE ASSERTION. They are checked whenever any data involved in the assertion is modified. - Triggers: Stored procedures that automatically execute (fire) in response to certain events (
INSERT,UPDATE,DELETE) on a particular table or view.
Relational Algebra
A procedural query language that takes instances of relations as input and yields instances of relations as output.
- Select (σ): Filters tuples (rows) based on a condition.
σcondition(R) - Project (π): Selects attributes (columns).
πA1, A2(R) - Join (⋈): Combines tuples from two relations based on a common attribute.
R ⋈condition S - Union (∪): Combines all tuples from two relations (must be union-compatible).
R ∪ S - Intersection (∩): Returns tuples common to both relations (must be union-compatible).
R ∩ S - Difference (-): Returns tuples in the first relation but not in the second (must be union-compatible).
R - S - Cartesian Product (×): Combines every tuple of the first relation with every tuple of the second relation.
R × S
Query Cost Estimation and Query Optimization
- Query Cost Estimation: Predicting the resources (CPU, I/O) required to execute a query, based on statistics about data, access methods, and available algorithms.
- Query Optimization: The process of selecting the most efficient execution plan for a query from a set of alternatives, aiming to minimize resource usage (cost).
- Query Decomposition: Breaking down a complex query into smaller, manageable sub-queries that can be processed more efficiently.
7.4 Transaction Processing, Concurrency Control and Crash Recovery
ACID Properties
A set of properties that guarantee that database transactions are processed reliably.
- Atomicity: A transaction is treated as a single, indivisible unit of work. Either all its operations are completed successfully, or none of them are. If any part fails, the entire transaction is rolled back.
Example: A bank transfer involves debiting one account and crediting another. If debit succeeds but credit fails, the entire transaction is undone.
- Consistency: A transaction brings the database from one valid state to another. It must preserve all defined integrity constraints (e.g., referential integrity, unique keys).
Example: If a transaction transfers money, the total sum of money in all accounts should remain the same before and after the transaction.
- Isolation: Concurrent transactions execute independently without interfering with each other. The intermediate state of a transaction is not visible to other transactions until it is committed.
Example: If two people try to withdraw from the same account simultaneously, one transaction must complete before the other sees the updated balance.
- Durability: Once a transaction is committed, its changes are permanent and survive any subsequent system failures (e.g., power loss, crash).
Example: After a successful withdrawal, the new balance is permanently recorded, even if the system crashes immediately after the transaction.
Concurrent Executions
Running multiple transactions simultaneously can lead to problems:
- Dirty Read (Uncommitted Read): A transaction reads data written by another uncommitted transaction. If the second transaction rolls back, the first transaction has read "dirty" (invalid) data.
- Lost Update: Two concurrent transactions read the same data, modify it, and write it back. One update overwrites the other, leading to a loss of one of the modifications.
- Unrepeatable Read: A transaction reads the same data twice and gets different values because another committed transaction modified the data between the two reads.
Serializability Concept
A schedule (interleaving of operations from multiple transactions) is serializable if its effect is equivalent to some serial execution of the same transactions. This ensures correctness in concurrent execution.
- Conflict Serializability: A schedule is conflict serializable if it can be transformed into a serial schedule by swapping non-conflicting operations. Two operations conflict if they are on the same data item and at least one is a write operation.
- View Serializability: A schedule is view serializable if its final state is the same as some serial schedule, and each transaction reads the same initial values and writes the same final values as in that serial schedule. (More general but harder to test than conflict serializability).
Lock based Protocols
Mechanisms used to control concurrent access to data, typically involving locking data items before accessing them.
- Shared/Exclusive Lock:
- Shared (S) Lock: Allows multiple transactions to read a data item concurrently.
- Exclusive (X) Lock: Grants exclusive access, allowing only one transaction to read and write a data item.
- Two-Phase Locking (2PL): A protocol that ensures serializability by requiring transactions to acquire all locks before releasing any.
- Growing Phase: Transaction can acquire locks but cannot release any.
- Shrinking Phase: Transaction can release locks but cannot acquire any new ones.