This phase is the core of day-to-day operations. System Administrators must proactively monitor the environment for availability, performance bottlenecks, and security threats. When things inevitably break, they must rapidly investigate the root cause and execute repairs to restore normal operations.
Detailed Explanation: Availability monitoring (often called "Uptime Monitoring") answers a simple question: "Is the server online and responding to requests?" This is usually the first alert an administrator receives when an outage occurs. It relies on external checks like ICMP (Ping) or HTTP status codes.
While enterprise teams use tools like Prometheus or Nagios, understanding the core concept is vital. This script checks if a website returns a "200 OK" status; if not, it triggers an alert.
#!/bin/bash
# availability_check.sh
TARGET_URL="https://www.company.com"
ADMIN_EMAIL="admin@company.com"
# Fetch only the HTTP status code
STATUS_CODE=$(curl -o /dev/null -s -w "%{http_code}\n" $TARGET_URL)
if [ "$STATUS_CODE" -ne 200 ]; then
echo "ALERT: $TARGET_URL is DOWN! Status Code: $STATUS_CODE" | mail -s "CRITICAL: Website Down" $ADMIN_EMAIL
# Alternatively, trigger a webhook to Slack/Microsoft Teams here
else
echo "SUCCESS: $TARGET_URL is UP (Status: $STATUS_CODE)"
fi
Detailed Explanation: A server can be "online" but performing so poorly that it's practically useless to users. Performance monitoring involves tracking internal metrics: CPU load, RAM utilization, Disk I/O, and Network throughput over time to identify trends and bottlenecks.
top, htop, or iostat to view performance data interactively directly on the server console.
Detailed Explanation: Security monitoring is the active observation of logs to detect unauthorized access, privilege escalation, or malware. It heavily relies on reviewing system authentication logs and application error logs. Centralized logging (sending all logs to a SIEM like Splunk or Elastic Security) is an enterprise standard.
A system administrator can use the systemd journal to continuously watch for failed SSH login attempts in real-time.
# Watch the authentication logs in real-time (-f) # and filter for lines containing 'sshd' and 'Failed' sudo journalctl -u ssh -f | grep "Failed password" # Example Output showing a brute-force attempt: # May 15 10:22:11 server01 sshd[1234]: Failed password for root from 192.168.1.50 port 55432 ssh2 # May 15 10:22:14 server01 sshd[1236]: Failed password for invalid user admin from 192.168.1.50 port 55444 ssh2
Detailed Explanation: When a monitor triggers an alert, the administrator must perform Root Cause Analysis (RCA). This requires methodical troubleshooting rather than blind guessing. The OSI model (checking Layer 1 physical up to Layer 7 application) is a common methodology.
Detailed Explanation: Once the root cause is isolated, the administrator must repair it securely. Repairs can range from restarting a hung service, to rolling back a bad configuration file, or resizing a full disk partition. A critical rule of repair is to ensure the fix is permanent, not just a temporary patch.
If a service crashes (e.g., Apache), the admin uses systemctl to identify why it failed and then restart it after fixing the underlying issue (e.g., a syntax error in the config file).
# 1. Check the status to see why it failed sudo systemctl status apache2 # 2. Check the configuration file for syntax errors before restarting sudo apache2ctl configtest # Output might say: "Syntax error on line 45 of /etc/apache2/apache2.conf" # Admin fixes the typo in the file via 'nano' or 'vim' # 3. Restart the service to apply the repair sudo systemctl restart apache2 # 4. Verify the service is active and running again sudo systemctl is-active apache2