Phase 5: Monitoring & Maintenance

This phase is the core of day-to-day operations. System Administrators must proactively monitor the environment for availability, performance bottlenecks, and security threats. When things inevitably break, they must rapidly investigate the root cause and execute repairs to restore normal operations.

System Administrator: Phase 5 Guide

J.63SAM00.019.1

Monitoring System Availability

Detailed Explanation: Availability monitoring (often called "Uptime Monitoring") answers a simple question: "Is the server online and responding to requests?" This is usually the first alert an administrator receives when an outage occurs. It relies on external checks like ICMP (Ping) or HTTP status codes.

Code Snippet: Basic HTTP Availability Watchdog Script

While enterprise teams use tools like Prometheus or Nagios, understanding the core concept is vital. This script checks if a website returns a "200 OK" status; if not, it triggers an alert.

#!/bin/bash
# availability_check.sh
TARGET_URL="https://www.company.com"
ADMIN_EMAIL="admin@company.com"

# Fetch only the HTTP status code
STATUS_CODE=$(curl -o /dev/null -s -w "%{http_code}\n" $TARGET_URL)

if [ "$STATUS_CODE" -ne 200 ]; then
    echo "ALERT: $TARGET_URL is DOWN! Status Code: $STATUS_CODE" | mail -s "CRITICAL: Website Down" $ADMIN_EMAIL
    # Alternatively, trigger a webhook to Slack/Microsoft Teams here
else
    echo "SUCCESS: $TARGET_URL is UP (Status: $STATUS_CODE)"
fi
J.63SAM00.020.1

Monitoring System Performance

Detailed Explanation: A server can be "online" but performing so poorly that it's practically useless to users. Performance monitoring involves tracking internal metrics: CPU load, RAM utilization, Disk I/O, and Network throughput over time to identify trends and bottlenecks.

Figure: Visualizing Performance Spikes

100% 75% 50% 0% CPU Usage Memory Usage CPU Bottleneck Detected
Command Line Tools: Administrators frequently use real-time tools like top, htop, or iostat to view performance data interactively directly on the server console.
J.63SAM00.021.2

Monitoring System Security

Detailed Explanation: Security monitoring is the active observation of logs to detect unauthorized access, privilege escalation, or malware. It heavily relies on reviewing system authentication logs and application error logs. Centralized logging (sending all logs to a SIEM like Splunk or Elastic Security) is an enterprise standard.

Code Snippet: Real-Time Security Log Monitoring (journalctl)

A system administrator can use the systemd journal to continuously watch for failed SSH login attempts in real-time.

# Watch the authentication logs in real-time (-f) 
# and filter for lines containing 'sshd' and 'Failed'
sudo journalctl -u ssh -f | grep "Failed password"

# Example Output showing a brute-force attempt:
# May 15 10:22:11 server01 sshd[1234]: Failed password for root from 192.168.1.50 port 55432 ssh2
# May 15 10:22:14 server01 sshd[1236]: Failed password for invalid user admin from 192.168.1.50 port 55444 ssh2
J.63SAM00.022.1

Investigating System Faults

Detailed Explanation: When a monitor triggers an alert, the administrator must perform Root Cause Analysis (RCA). This requires methodical troubleshooting rather than blind guessing. The OSI model (checking Layer 1 physical up to Layer 7 application) is a common methodology.

Figure: Troubleshooting Workflow

1. Identify Alert (Web service down) 2. Check Logs (/var/log/nginx/error.log) 3. Isolate Issue (Nginx can't reach DB) 4. Test Hypothesis (Can I ping the DB?) 5. Determine Fix (DB is out of memory)
J.63SAM00.023.1

Repairing System Faults

Detailed Explanation: Once the root cause is isolated, the administrator must repair it securely. Repairs can range from restarting a hung service, to rolling back a bad configuration file, or resizing a full disk partition. A critical rule of repair is to ensure the fix is permanent, not just a temporary patch.

Code Snippet: Diagnosing and Repairing a Failed Service

If a service crashes (e.g., Apache), the admin uses systemctl to identify why it failed and then restart it after fixing the underlying issue (e.g., a syntax error in the config file).

# 1. Check the status to see why it failed
sudo systemctl status apache2

# 2. Check the configuration file for syntax errors before restarting
sudo apache2ctl configtest

# Output might say: "Syntax error on line 45 of /etc/apache2/apache2.conf"
# Admin fixes the typo in the file via 'nano' or 'vim'

# 3. Restart the service to apply the repair
sudo systemctl restart apache2

# 4. Verify the service is active and running again
sudo systemctl is-active apache2