7. Data Structures and Algorithm, Database System, and Operating System (ACtE07)

7.1 Introduction to Data Structure, List, Linked Lists and Trees

Data Types, Data Structures, and Abstract Data Types

A Data Type defines the kind of values a variable can hold and the operations that can be performed on it (e.g., integer, float, string). A Data Structure is a particular way of organizing and storing data in a computer so that it can be accessed and modified efficiently. An Abstract Data Type (ADT) is a mathematical model for data types, defining the logical behavior of a data type from the user's perspective, independent of its implementation.

Data Type: Focuses on the type of data and allowed operations.
Data Structure: Focuses on the physical organization and storage of data.
ADT: Focuses on the logical properties and operations, abstracting away implementation details.

Time and Space Analysis of Algorithms

Algorithm analysis involves determining the resources (time and space) required by an algorithm. Big O notation (O) describes the upper bound of an algorithm's running time or space requirements, representing the worst-case scenario. Omega notation (Ω) describes the lower bound, representing the best-case scenario. Theta notation (Θ) describes both the upper and lower bounds, representing the average-case or tight bound.

Example: Linear Search

To find an element in an unsorted array of n elements:

Best Case (Ω(1)): The element is found at the first position.
Worst Case (O(n)): The element is at the last position or not present, requiring n comparisons.
Average Case (Θ(n)): On average, n/2 comparisons are needed.

Linear Data Structures: Stack and Queue

Stack (LIFO - Last In, First Out): A linear data structure that follows a particular order in which operations are performed. Items are added and removed from the same end, called the "top."

Operations: Push (add item), Pop (remove item), Peek (view top item), IsEmpty.
Implementation:
- Array-based: Fixed size, simple to implement. Potential for stack overflow.
- Linked List-based: Dynamic size, more flexible.

Queue (FIFO - First In, First Out): A linear data structure that follows a particular order in which operations are performed. Items are added at one end (rear/enqueue) and removed from the other end (front/dequeue).

Operations: Enqueue (add item), Dequeue (remove item), Front (view front item), IsEmpty.
Implementation:
- Array-based: Fixed size, can suffer from inefficient space usage (elements shift).
- Circular Queue: An improved array-based implementation where the last element is connected to the first, allowing efficient reuse of space.
- Linked List-based: Dynamic size, efficient for insertions and deletions.

Stack Applications

Infix to Postfix Conversion: Converts an arithmetic expression from infix notation (operators between operands) to postfix notation (operators after operands).

Algorithm:

Initialize an empty stack and an empty postfix expression string.
Scan the infix expression from left to right.
If an operand, append to postfix string.
If an opening parenthesis '(', push onto stack.
If a closing parenthesis ')', pop operators from stack and append to postfix until '(' is encountered. Pop and discard '('.
If an operator, pop operators from stack (and append to postfix) that have higher or equal precedence than the current operator, then push the current operator onto the stack.
After scanning, pop any remaining operators from the stack and append to postfix.

Example: Infix (A+B)*C

Char	Stack	Postfix
(	(
A	(	A
+	( +	A
B	( +	AB
)		AB+
*	*	AB+
C	*	AB+C
End		AB+C*

Evaluation of Postfix Expression:

Algorithm:

Initialize an empty stack.
Scan the postfix expression from left to right.
If an operand, push it onto the stack.
If an operator, pop two operands from the stack, perform the operation, and push the result back onto the stack.
The final result is the only element remaining in the stack.

Example: Postfix 2 3 + 4 *

Char	Stack	Operation
2	2
3	2, 3
+	5	3+2=5
4	5, 4
*	20	4*5=20
End	20	Result: 20

Array Implementation of Lists

An array can represent a list where elements are stored in contiguous memory locations. Accessing elements by index is O(1). Insertion and deletion in the middle require shifting elements, leading to O(n) complexity. Stacks and Queues can be implemented using arrays, acting as specialized lists.

Static List: Fixed size, determined at compile time (e.g., C-style arrays).
Dynamic List: Can grow or shrink in size during runtime (e.g., C++ std::vector, Java ArrayList). Often implemented using dynamic arrays that resize when capacity is exceeded.

Dynamic Implementation of Linked List

A Linked List is a linear collection of data elements, called nodes, where each node points to the next node in the sequence. Nodes are not stored in contiguous memory locations.

Types of Linked List:
- Singly Linked List: Each node contains data and a pointer to the next node. Traversal is only forward.
- Doubly Linked List: Each node contains data, a pointer to the next node, and a pointer to the previous node. Allows bidirectional traversal.
- Circular Linked List: The last node points back to the first node, forming a circle. Can be singly or doubly linked.

Basic Operations on Linked List (Singly Linked List Example):

Creation:


CREATE_NODE(data):
  node = new Node()
  node.data = data
  node.next = NULL
  return node

Insertion (Beginning):


INSERT_AT_BEGINNING(head, data):
  new_node = CREATE_NODE(data)
  new_node.next = head
  head = new_node
  return head

Insertion (End):


INSERT_AT_END(head, data):
  new_node = CREATE_NODE(data)
  IF head IS NULL THEN
    head = new_node
    RETURN head
  current = head
  WHILE current.next IS NOT NULL DO
    current = current.next
  current.next = new_node
  RETURN head

Insertion (Middle - after a specific node):


INSERT_AFTER_NODE(prev_node, data):
  IF prev_node IS NULL THEN
    PRINT "Previous node cannot be NULL"
    RETURN
  new_node = CREATE_NODE(data)
  new_node.next = prev_node.next
  prev_node.next = new_node

Deletion (by value):


DELETE_NODE(head, key):
  current = head
  prev = NULL
  IF current IS NOT NULL AND current.data == key THEN // Head node
    head = current.next
    DELETE current
    RETURN head
  WHILE current IS NOT NULL AND current.data != key DO
    prev = current
    current = current.next
  IF current IS NULL THEN // Key not found
    RETURN head
  prev.next = current.next
  DELETE current
  RETURN head

Doubly Linked Lists and its Applications: Offer O(1) insertion/deletion if the node reference is available, and O(n) for searching. Applications include:

Implementing Undo/Redo functionality in editors.
Browser history (back/forward buttons).
Implementing LRU (Least Recently Used) cache.

Concept of Tree

A Tree is a non-linear data structure representing hierarchical data, consisting of nodes connected by edges.

Root: The topmost node of the tree.
Node: An entity in a tree that stores data and may have children.
Edge: A connection between two nodes.
Leaf Node: A node with no children.
Parent: A node that has one or more child nodes.
Child: A node directly connected to another node when moving away from the root.
Degree: The number of children a node has. For a tree, it's the maximum degree of any node.
Height: The length of the longest path from the root to a leaf node. (Number of edges)
Level: The level of the root node is 0 (or 1). The level of any other node is its parent's level + 1.
Depth: The length of the path from the root to a specific node. (Same as level for a node).

Operations in Binary Tree

A Binary Tree is a tree data structure where each node has at most two children, referred to as the left child and the right child.

Insertion: New nodes are typically inserted as leaf nodes, maintaining the binary tree property (e.g., in a Binary Search Tree, smaller values go left, larger go right).
Deletion: More complex, involving handling cases like deleting a leaf node, a node with one child, or a node with two children (often replaced by its in-order successor or predecessor).
Searching: Start from the root and traverse down, comparing the target value with the current node's value. In a BST, if target < current, go left; if target > current, go right.

Tree Traversals

Visiting each node in a tree exactly once.

Pre-order Traversal (Root, Left, Right):


PREORDER(node):
  IF node IS NOT NULL THEN
    PRINT node.data
    PREORDER(node.left)
    PREORDER(node.right)

In-order Traversal (Left, Root, Right):


INORDER(node):
  IF node IS NOT NULL THEN
    INORDER(node.left)
    PRINT node.data
    INORDER(node.right)

Post-order Traversal (Left, Right, Root):


POSTORDER(node):
  IF node IS NOT NULL THEN
    POSTORDER(node.left)
    POSTORDER(node.right)
    PRINT node.data

Example Tree:

Pre-order: 4, 2, 1, 3, 5
In-order: 1, 2, 3, 4, 5 (Sorted order for a BST)
Post-order: 1, 3, 2, 5, 4

Height, Level and Depth of a Tree

As defined above, Height is the longest path from root to leaf. Level is the distance from the root (root at level 0 or 1). Depth of a node is the length of the path from the root to that node.

AVL Balanced Trees

An AVL Tree is a self-balancing Binary Search Tree where the difference between the heights of the left and right subtrees (balance factor) of any node is at most 1. If this property is violated, the tree performs rotations to rebalance itself.

Rotations:
- LL Rotation (Left-Left): Occurs when a node is inserted into the left subtree of the left child of an unbalanced node. A single right rotation is performed.
- RR Rotation (Right-Right): Occurs when a node is inserted into the right subtree of the right child of an unbalanced node. A single left rotation is performed.
- LR Rotation (Left-Right): Occurs when a node is inserted into the right subtree of the left child of an unbalanced node. A left rotation is performed on the child, followed by a right rotation on the parent.
- RL Rotation (Right-Left): Occurs when a node is inserted into the left subtree of the right child of an unbalanced node. A right rotation is performed on the child, followed by a left rotation on the parent.

7.2 Sorting, Searching, and Graphs

Types of Sorting

Internal Sorting: All data to be sorted fits into the main memory.
External Sorting: Data is too large to fit into main memory and resides on disk, requiring external storage access during sorting.

Sorting Algorithms

Each algorithm below aims to arrange elements in a specific order (ascending or descending).

Insertion Sort:
Algorithm: Iterates through the list, picking an element and inserting it into its correct position among the elements already sorted. Like sorting a hand of playing cards.
```
INSERTION_SORT(arr, n):
  FOR i FROM 1 TO n-1 DO
    key = arr[i]
    j = i - 1
    WHILE j >= 0 AND arr[j] > key DO
      arr[j+1] = arr[j]
      j = j - 1
    arr[j+1] = key
        
```
Time Complexity: O(n²) in worst and average cases. O(n) in best case (already sorted).

Selection Sort:

Algorithm: Finds the minimum element from the unsorted part and puts it at the beginning of the sorted part.


SELECTION_SORT(arr, n):
  FOR i FROM 0 TO n-2 DO
    min_idx = i
    FOR j FROM i+1 TO n-1 DO
      IF arr[j] < arr[min_idx] THEN
        min_idx = j
    SWAP(arr[min_idx], arr[i])

Time Complexity: O(n²) in all cases.

Exchange Sort / Bubble Sort:

Algorithm: Repeatedly steps through the list, compares adjacent elements and swaps them if they are in the wrong order. Passes through the list until no swaps are needed.


BUBBLE_SORT(arr, n):
  FOR i FROM 0 TO n-2 DO
    swapped = FALSE
    FOR j FROM 0 TO n-i-2 DO
      IF arr[j] > arr[j+1] THEN
        SWAP(arr[j], arr[j+1])
        swapped = TRUE
    IF NOT swapped THEN BREAK // Optimization for already sorted arrays

Time Complexity: O(n²) in worst and average cases. O(n) in best case.

Merge Sort:
Algorithm: A divide-and-conquer algorithm. It divides the unsorted list into n sublists, each containing one element, then repeatedly merges sublists to produce new sorted sublists until there is only one sorted list remaining.
```
MERGE_SORT(arr, left, right):
  IF left < right THEN
    mid = (left + right) / 2
    MERGE_SORT(arr, left, mid)
    MERGE_SORT(arr, mid+1, right)
    MERGE(arr, left, mid, right) // Merges two sorted subarrays
        
```
Time Complexity: O(n log n) in all cases.
Radix Sort:
Algorithm: A non-comparative integer sorting algorithm that sorts data with integer keys by grouping keys by individual digits which share the same significant position and value.
```
RADIX_SORT(arr, n):
  max_val = GET_MAX(arr, n) // Find maximum number to determine number of digits
  exp = 1
  WHILE max_val / exp > 0 DO
    COUNT_SORT(arr, n, exp) // Use Counting Sort for current digit position
    exp = exp * 10
        
```
Time Complexity: O(d * (n + b)), where d is the number of digits, n is the number of elements, and b is the base of the number system (e.g., 10 for decimal).
Shell Sort:
Algorithm: An in-place comparison sort. It is an extension of insertion sort that allows the exchange of items that are far apart. It starts by sorting pairs of elements far apart, then progressively reducing the gap between elements to be compared.
```
SHELL_SORT(arr, n):
  gap = n / 2
  WHILE gap > 0 DO
    FOR i FROM gap TO n-1 DO
      temp = arr[i]
      j = i
      WHILE j >= gap AND arr[j - gap] > temp DO
        arr[j] = arr[j - gap]
        j = j - gap
      arr[j] = temp
    gap = gap / 2
        
```
Time Complexity: Depends on the gap sequence, typically ranges from O(n log² n) to O(n^1.5) in worst case.

Heap Sort as Priority Queue:

Algorithm: Uses a binary heap data structure. It first builds a max-heap from the input data. Then, it repeatedly extracts the maximum element from the heap (which is the root), swaps it with the last element, reduces the heap size, and calls heapify on the new root.


HEAPIFY(arr, n, i): // To maintain max-heap property
  largest = i
  left = 2*i + 1
  right = 2*i + 2
  IF left < n AND arr[left] > arr[largest] THEN
    largest = left
  IF right < n AND arr[right] > arr[largest] THEN
    largest = right
  IF largest != i THEN
    SWAP(arr[i], arr[largest])
    HEAPIFY(arr, n, largest)

HEAP_SORT(arr, n):
  FOR i FROM n/2 - 1 DOWNTO 0 DO // Build max-heap
    HEAPIFY(arr, n, i)
  FOR i FROM n-1 DOWNTO 0 DO // Extract elements one by one
    SWAP(arr[0], arr[i])
    HEAPIFY(arr, i, 0) // Call heapify on reduced heap

Time Complexity: Building heap is O(n). Sorting is O(n log n).

Big O Notation and Efficiency of Sorting

A comparison of common sorting algorithms:

Algorithm	Best Case	Average Case	Worst Case	Space Complexity
Bubble Sort	O(n)	O(n²)	O(n²)	O(1)
Insertion Sort	O(n)	O(n²)	O(n²)	O(1)
Selection Sort	O(n²)	O(n²)	O(n²)	O(1)
Merge Sort	O(n log n)	O(n log n)	O(n log n)	O(n)
Heap Sort	O(n log n)	O(n log n)	O(n log n)	O(1)
Radix Sort	O(d*(n+b))	O(d*(n+b))	O(d*(n+b))	O(n+b)
Shell Sort	O(n log n)	O(n log² n)	O(n^1.5)	O(1)

Search Techniques

Sequential Search (Linear Search):

Algorithm: Checks each element in the list sequentially until a match is found or the end of the list is reached.


SEQUENTIAL_SEARCH(arr, n, target):
  FOR i FROM 0 TO n-1 DO
    IF arr[i] == target THEN
      RETURN i // Element found at index i
  RETURN -1 // Element not found

Time Complexity: O(n).

Binary Search:
Algorithm: Efficiently finds an element in a sorted list. It repeatedly divides the search interval in half. If the value of the search key is less than the item in the middle of the interval, narrow the interval to the lower half. Otherwise, narrow it to the upper half.
```
BINARY_SEARCH(arr, low, high, target):
  WHILE low <= high DO
    mid = low + (high - low) / 2
    IF arr[mid] == target THEN
      RETURN mid
    ELSE IF arr[mid] < target THEN
      low = mid + 1
    ELSE
      high = mid - 1
  RETURN -1
        
```
Example: Search for 23 in [2, 5, 8, 12, 16, 23, 38, 56, 72, 91]
1. low=0, high=9, mid=4 (arr[4]=16). 16 < 23, so low = 4+1=5.
2. low=5, high=9, mid=7 (arr[7]=56). 56 > 23, so high = 7-1=6.
3. low=5, high=6, mid=5 (arr[5]=23). 23 == 23, return 5.
Time Complexity: O(log n).
Tree Search (BST Search): Searching in a Binary Search Tree (BST) involves comparing the target value with the current node. If the target is smaller, move to the left child; if larger, move to the right child. This continues until the value is found or a null pointer is encountered. Time Complexity: O(h) where h is the height of the tree (O(log n) for balanced, O(n) for skewed).
General Search Tree: A tree where values in subtrees are ordered relative to the root, but not necessarily binary (e.g., B-trees).

Hashing

Hashing is the process of mapping data of arbitrary size to a fixed-size value, typically an integer, using a Hash Function. The output is called a hash value or hash code.

Hash Functions:
- Division Method: h(k) = k mod m, where k is the key, and m is the size of the hash table (often a prime number).
- Multiplication Method: h(k) = floor(m * (k * A mod 1)), where A is a constant (0 < A < 1).
Hash Tables: A data structure that implements an associative array (map) abstract data type, mapping keys to values. It uses a hash function to compute an index into an array of buckets or slots, from which the desired value can be found.
Collision Resolution: When two different keys hash to the same index.
- Open Addressing: All elements are stored in the hash table itself.
  - Linear Probing: If a collision occurs at index i, search for the next empty slot at (i+1) mod m, (i+2) mod m, etc.
  - Quadratic Probing: If a collision occurs at index i, search for the next empty slot at (i+1²) mod m, (i+2²) mod m, etc.
- Chaining: Each slot in the hash table is a pointer to a linked list (or other data structure) of all the keys that hash to that slot.

Undirected and Directed Graphs

A Graph is a non-linear data structure consisting of a finite set of vertices (or nodes) and a set of edges connecting pairs of vertices.

Undirected Graph: Edges have no direction; if an edge connects A and B, it means A is connected to B and B is connected to A.
Directed Graph (Digraph): Edges have a direction; an edge from A to B means a connection from A to B, but not necessarily from B to A.
Terminology: Vertex, Edge, Weight (cost associated with an edge), Path (sequence of vertices connected by edges), Cycle (a path that starts and ends at the same vertex), Degree (number of edges incident to a vertex).

Representation of Graph

Adjacency Matrix: A 2D array Adj[V][V] where V is the number of vertices. Adj[i][j] = 1 if there is an edge from vertex i to vertex j, otherwise 0. For weighted graphs, store weights instead of 1.
Pros: Quick check for existence of an edge (O(1)).

Cons: High space complexity O(V²), even for sparse graphs.
Adjacency List: An array of linked lists where the i-th element of the array contains a linked list of all vertices adjacent to vertex i.
Pros: Space efficient for sparse graphs O(V+E), where E is the number of edges.

Cons: Checking for an edge can be O(degree(V)).

Transitive Closure of Graph (Warshall's Algorithm)

The Transitive Closure of a directed graph G is a graph G* with the same vertices as G, such that there is an edge from vertex i to vertex j in G* if and only if there is a path from i to j in G.

Warshall's Algorithm: Computes the transitive closure using dynamic programming. It iteratively updates a connectivity matrix R.


WARSHALL(Adj_Matrix R, n):
  // Initialize R with adjacency matrix (R[i][j] = 1 if edge, else 0)
  FOR k FROM 0 TO n-1 DO
    FOR i FROM 0 TO n-1 DO
      FOR j FROM 0 TO n-1 DO
        R[i][j] = R[i][j] OR (R[i][k] AND R[k][j])
  RETURN R

Example:

Graph: A->B, B->C, C->D, A->C
Initial R:
  A B C D
A 0 1 1 0
B 0 0 1 0
C 0 0 0 1
D 0 0 0 0

After k=0 (A): (no new paths)
After k=1 (B): R[A][C] becomes (R[A][B] AND R[B][C]) -> (1 AND 1) = 1. (A->B->C)
  A B C D
A 0 1 1 0
B 0 0 1 0
C 0 0 0 1
D 0 0 0 0

After k=2 (C): R[A][D] becomes (R[A][C] AND R[C][D]) -> (1 AND 1) = 1. (A->C->D)
  A B C D
A 0 1 1 1
B 0 0 1 1
C 0 0 0 1
D 0 0 0 0

The final matrix shows all reachability.

Depth First Traversal and Breadth First Traversal of Graph

Depth First Search (DFS): Explores as far as possible along each branch before backtracking. Uses a stack (implicitly or explicitly).


DFS(graph, start_node):
  visited = SET()
  stack = STACK()
  stack.push(start_node)
  WHILE stack IS NOT EMPTY DO
    node = stack.pop()
    IF node NOT IN visited THEN
      ADD node TO visited
      PRINT node
      FOR EACH neighbor OF node DO
        IF neighbor NOT IN visited THEN
          stack.push(neighbor)

Breadth First Search (BFS): Explores all the neighbor nodes at the present depth before moving on to the nodes at the next depth level. Uses a queue.


BFS(graph, start_node):
  visited = SET()
  queue = QUEUE()
  visited.add(start_node)
  queue.enqueue(start_node)
  WHILE queue IS NOT EMPTY DO
    node = queue.dequeue()
    PRINT node
    FOR EACH neighbor OF node DO
      IF neighbor NOT IN visited THEN
        visited.add(neighbor)
        queue.enqueue(neighbor)

Topological Sorting

A linear ordering of vertices in a directed acyclic graph (DAG) such that for every directed edge u -> v, vertex u comes before v in the ordering.

Depth First Approach: Perform DFS. When a node has no unvisited neighbors (or all neighbors are in recursion stack), add it to the front of a list (or push onto a stack and then reverse).
Breadth First Approach (Kahn's Algorithm): Find all nodes with an in-degree of 0. Add them to a queue. While the queue is not empty, dequeue a node, add it to the sorted list, and decrement the in-degree of its neighbors. If a neighbor's in-degree becomes 0, enqueue it.

Minimum Spanning Trees (MST)

A Minimum Spanning Tree (MST) of an undirected, weighted graph is a subset of the edges that connects all the vertices together, without any cycles and with the minimum possible total edge weight.

Prim's Algorithm (Step-by-step):
1. Start with an arbitrary vertex as the initial MST.
2. Repeatedly add the edge with the smallest weight that connects a vertex in the MST to a vertex not yet in the MST.
3. Continue until all vertices are included in the MST.
Kruskal's Algorithm (Step-by-step):
1. Sort all edges in non-decreasing order of their weights.
2. Initialize an empty MST.
3. Iterate through the sorted edges: if adding an edge does not form a cycle with the edges already in the MST, add it.
4. Repeat until V-1 edges are added (where V is the number of vertices). Use a Disjoint Set Union (DSU) data structure to efficiently detect cycles.

Shortest-Path Algorithm: Dijkstra's Algorithm

Finds the shortest paths from a single source vertex to all other vertices in a graph with non-negative edge weights.

Dijkstra's Algorithm (Step-by-step):

Initialize distances: Set distance of source vertex to 0, and all other vertices to infinity.
Create a set S (for visited vertices) and a min-priority queue Q (for unvisited vertices). Add all vertices to Q.
While Q is not empty:
1. Extract the vertex u with the smallest distance from Q.
2. Add u to S.
3. For each neighbor v of u:
  - If v is not in S and dist[u] + weight(u,v) < dist[v], then update dist[v] = dist[u] + weight(u,v) and update v in Q.

Example:

Graph: A --(1)--> B --(3)--> D
       |         /|\
      (4)       (2) (1)
       |       / | \
       C --(5)--> E
Source: A

Distances: A=0, B=inf, C=inf, D=inf, E=inf

1. Extract A (dist=0). Neighbors: B, C
   Update B: dist[B] = min(inf, 0+1) = 1
   Update C: dist[C] = min(inf, 0+4) = 4
   Q: {B:1, C:4, D:inf, E:inf}

2. Extract B (dist=1). Neighbors: A (visited), D, E
   Update D: dist[D] = min(inf, 1+3) = 4
   Update E: dist[E] = min(inf, 1+2) = 3
   Q: {C:4, D:4, E:3}

3. Extract E (dist=3). Neighbors: B (visited)
   Q: {C:4, D:4}

4. Extract C (dist=4). Neighbors: A (visited), E (visited)
   Q: {D:4}

5. Extract D (dist=4). Neighbors: B (visited)
   Q: {}

Final Shortest Paths from A: A=0, B=1, C=4, D=4, E=3

7.3 Introduction to Data Models, Normalization, and SQL

Data Abstraction and Data Independence

Data Abstraction: The process of hiding complex implementation details of data storage and retrieval, exposing only the essential features to the user.
- Physical Level: Describes how data is actually stored on disk.
- Logical Level: Describes what data is stored and the relationships among data (schema).
- View Level: Describes only part of the database relevant to a particular user or application.
Data Independence: The ability to modify the schema at one level without affecting the schema at a higher level.
- Physical Data Independence: Changes in the physical schema (e.g., storage devices, file organization) do not require changes in the logical schema.
- Logical Data Independence: Changes in the logical schema (e.g., adding/removing attributes) do not require changes in the view level or application programs.

Schema and Instances

Schema: The overall design or structure of the database. It describes the logical view of the entire database. It is defined during database design and is relatively stable.
Instance: The actual data stored in the database at a particular moment in time. It changes frequently as data is inserted, deleted, or updated.

E-R Model (Entity-Relationship Model)

A high-level conceptual data model that describes the data requirements and the relationships between data.

Entity: A real-world object that is distinguishable from other objects (e.g., Student, Course).
Attribute: A property or characteristic of an entity (e.g., Student ID, Student Name).
Relationship: An association between two or more entities (e.g., Student ENROLLS in Course).
Cardinality: Specifies the number of instances of one entity that can be associated with instances of another entity (1:1, 1:N, N:M).

Strong and Weak Entity Sets

Strong Entity Set: An entity set that has its own primary key and can exist independently.
Weak Entity Set: An entity set that does not have a primary key of its own and depends on a strong entity set (identifying owner) for its existence and identification. It has a partial key (discriminator) that, combined with the primary key of its owner, forms its full primary key. (e.g., 'Dependent' entity depends on 'Employee').

Attributes and Keys

Attributes:
- Simple vs. Composite: Simple (atomic), Composite (divisible, e.g., Address into Street, City).
- Single-valued vs. Multi-valued: Single (one value, e.g., Age), Multi-valued (multiple values, e.g., Phone Numbers).
- Stored vs. Derived: Stored (directly stored, e.g., DateOfBirth), Derived (calculated from other attributes, e.g., Age from DateOfBirth).
Keys:
- Super Key: A set of one or more attributes that, taken collectively, can uniquely identify a tuple (row) in a relation.
- Candidate Key: A minimal super key; no proper subset of its attributes can uniquely identify tuples.
- Primary Key: A candidate key chosen by the database designer to uniquely identify tuples in a relation.
- Foreign Key: An attribute or set of attributes in one relation that refers to the primary key of another relation, establishing a link between them.

E-R Diagram Notation and Examples

Rectangles: Entities
Diamonds: Relationships
Ellipses: Attributes (underlined for primary keys)
Double Rectangles: Weak Entities
Double Diamonds: Identifying Relationships for Weak Entities
Double Ellipses: Multi-valued Attributes
Dashed Ellipses: Derived Attributes
Lines: Link attributes to entities, and entities to relationships.
Cardinality Ratios: Notations like 1, N, M, * on lines connecting entities to relationships.

Example: A Student (ID, Name) enrolls in Courses (CourseID, Title).


+----------+       +----------+       +---------+
|  Student |-------| ENROLLS  |-------| Course  |
+----------+       +----------+       +---------+
  (ID, Name)           (N:M)         (CourseID, Title)

Normalization

The process of organizing the columns (attributes) and tables (relations) of a relational database to minimize data redundancy and improve data integrity.

Functional Dependency (FD): An attribute B is functionally dependent on an attribute A if for every valid instance of the relation, that value of A uniquely determines the value of B (A → B).
- Full Functional Dependency: An attribute B is fully functionally dependent on a composite key A if B is functionally dependent on A, but not on any proper subset of A.
- Partial Functional Dependency: An attribute B is partially functionally dependent on a composite key A if B is functionally dependent on A, and also on a proper subset of A.
- Transitive Functional Dependency: An attribute C is transitively dependent on A if A → B and B → C (and B is not part of the primary key).
1NF (First Normal Form):
- Definition: A relation is in 1NF if every attribute contains only atomic (indivisible) values, and there are no repeating groups of columns.
- Example: Table (StudentID, StudentName, Course1, Course2) is not 1NF due to repeating courses. It should be (StudentID, StudentName, Course).
2NF (Second Normal Form):
- Definition: A relation is in 2NF if it is in 1NF and all non-key attributes are fully functionally dependent on the primary key. (Eliminates partial dependencies).
- Example: Table (StudentID, CourseID, StudentName, CourseTitle). If (StudentID, CourseID) is PK, and StudentID -> StudentName (partial dependency), then it's not 2NF.
  Step-by-step: Decompose into (StudentID, StudentName) and (StudentID, CourseID, CourseTitle).
3NF (Third Normal Form):
- Definition: A relation is in 3NF if it is in 2NF and there are no transitive functional dependencies of a non-key attribute on the primary key.
- Example: Table (StudentID, StudentName, CourseID, CourseTitle, InstructorName, InstructorDept). If CourseID -> InstructorName and InstructorName -> InstructorDept, then StudentID -> InstructorDept is transitive.
  Step-by-step: Decompose into (StudentID, StudentName, CourseID), (CourseID, CourseTitle, InstructorName), and (InstructorName, InstructorDept).
BCNF (Boyce-Codd Normal Form):
- Definition: A relation is in BCNF if it is in 3NF and for every non-trivial functional dependency A -> B, A is a superkey. (Stricter than 3NF, handles cases with multiple candidate keys where a non-key attribute determines part of a candidate key).
- Example: A table (Student, Subject, Instructor) where (Student, Subject) is PK, Instructor -> Subject, and Subject -> Instructor. This is 3NF but not BCNF if there are multiple instructors for a subject.

Integrity Constraints and Domain Constraints

Integrity Constraints: Rules that ensure the quality and consistency of data in a database.
- Entity Integrity: Primary key cannot be NULL.
- Referential Integrity: Foreign key values must either match a primary key value in the referenced relation or be NULL.
Domain Constraints: Specify that the value of an attribute must come from a predefined set of values (its domain). (e.g., Age must be > 0, Gender must be 'M' or 'F').

Relations (Joined, Derived)

Joined Relations: The result of combining two or more relations based on a common attribute (e.g., using SQL JOIN).
Derived Relations: Relations that are not explicitly stored in the database but can be computed or derived from other stored relations (e.g., a VIEW in SQL).

Queries under DDL and DML

SQL (Structured Query Language): A standard language for managing and manipulating relational databases.

DDL (Data Definition Language): Used to define and manage database structure.
- CREATE TABLE Students (ID INT PRIMARY KEY, Name VARCHAR(50));
- ALTER TABLE Students ADD COLUMN Age INT;
- DROP TABLE Students;
DML (Data Manipulation Language): Used to manage data within the database.
- SELECT Name, Age FROM Students WHERE Age > 20;
- INSERT INTO Students (ID, Name, Age) VALUES (1, 'Alice', 25);
- UPDATE Students SET Age = 26 WHERE ID = 1;
- DELETE FROM Students WHERE Age < 20;

Views, Assertions and Triggering

Views: A virtual table based on the result-set of an SQL query. It does not store data itself but provides a way to present data from one or more tables in a customized way.
Assertions: General constraints that involve multiple tables or attributes, specified using CREATE ASSERTION. They are checked whenever any data involved in the assertion is modified.
Triggers: Stored procedures that automatically execute (fire) in response to certain events (INSERT, UPDATE, DELETE) on a particular table or view.

Relational Algebra

A procedural query language that takes instances of relations as input and yields instances of relations as output.

Select (σ): Filters tuples (rows) based on a condition. σ_condition(R)
Project (π): Selects attributes (columns). π_{A1, A2}(R)
Join (⋈): Combines tuples from two relations based on a common attribute. R ⋈_condition S
Union (∪): Combines all tuples from two relations (must be union-compatible). R ∪ S
Intersection (∩): Returns tuples common to both relations (must be union-compatible). R ∩ S
Difference (-): Returns tuples in the first relation but not in the second (must be union-compatible). R - S
Cartesian Product (×): Combines every tuple of the first relation with every tuple of the second relation. R × S

Query Cost Estimation and Query Optimization

Query Cost Estimation: Predicting the resources (CPU, I/O) required to execute a query, based on statistics about data, access methods, and available algorithms.
Query Optimization: The process of selecting the most efficient execution plan for a query from a set of alternatives, aiming to minimize resource usage (cost).
Query Decomposition: Breaking down a complex query into smaller, manageable sub-queries that can be processed more efficiently.

7.4 Transaction Processing, Concurrency Control and Crash Recovery

ACID Properties

A set of properties that guarantee that database transactions are processed reliably.

Atomicity: A transaction is treated as a single, indivisible unit of work. Either all its operations are completed successfully, or none of them are. If any part fails, the entire transaction is rolled back.
Example: A bank transfer involves debiting one account and crediting another. If debit succeeds but credit fails, the entire transaction is undone.
Consistency: A transaction brings the database from one valid state to another. It must preserve all defined integrity constraints (e.g., referential integrity, unique keys).
Example: If a transaction transfers money, the total sum of money in all accounts should remain the same before and after the transaction.
Isolation: Concurrent transactions execute independently without interfering with each other. The intermediate state of a transaction is not visible to other transactions until it is committed.
Example: If two people try to withdraw from the same account simultaneously, one transaction must complete before the other sees the updated balance.
Durability: Once a transaction is committed, its changes are permanent and survive any subsequent system failures (e.g., power loss, crash).
Example: After a successful withdrawal, the new balance is permanently recorded, even if the system crashes immediately after the transaction.

Concurrent Executions

Running multiple transactions simultaneously can lead to problems:

Dirty Read (Uncommitted Read): A transaction reads data written by another uncommitted transaction. If the second transaction rolls back, the first transaction has read "dirty" (invalid) data.
Lost Update: Two concurrent transactions read the same data, modify it, and write it back. One update overwrites the other, leading to a loss of one of the modifications.
Unrepeatable Read: A transaction reads the same data twice and gets different values because another committed transaction modified the data between the two reads.

Serializability Concept

A schedule (interleaving of operations from multiple transactions) is serializable if its effect is equivalent to some serial execution of the same transactions. This ensures correctness in concurrent execution.

Conflict Serializability: A schedule is conflict serializable if it can be transformed into a serial schedule by swapping non-conflicting operations. Two operations conflict if they are on the same data item and at least one is a write operation.
View Serializability: A schedule is view serializable if its final state is the same as some serial schedule, and each transaction reads the same initial values and writes the same final values as in that serial schedule. (More general but harder to test than conflict serializability).

Lock based Protocols

Mechanisms used to control concurrent access to data, typically involving locking data items before accessing them.

Shared/Exclusive Lock:
- Shared (S) Lock: Allows multiple transactions to read a data item concurrently.
- Exclusive (X) Lock: Grants exclusive access, allowing only one transaction to read and write a data item.
Two-Phase Locking (2PL): A protocol that ensures serializability by requiring transactions to acquire all locks before releasing any.
- Growing Phase: Transaction can acquire locks but cannot release any.
- Shrinking Phase: Transaction can release locks but cannot acquire any new ones.