Skip to main content

Server Maintenance Procedures

Overview

Regular maintenance is essential for keeping our infrastructure secure, stable, and performant. This guide outlines routine maintenance tasks and procedures.

Maintenance Schedule

Daily Tasks

Automated (verify running):

  • Backups completed successfully
  • Monitoring checks all passing
  • No critical alerts

Manual check (5 minutes):

  • Check support tickets for issues
  • Review monitoring dashboard
  • Check for any failed services
  • Verify email queue is processing

Weekly Tasks (30-60 minutes)

Every Monday morning:

  • Review backup logs (all successful?)
  • Check disk space on all servers
  • Review resource usage trends
  • Check for security updates
  • Review error logs for patterns
  • Verify SSL certificates not expiring soon (< 30 days)
  • Check for suspended/overdue accounts
  • Review bandwidth usage
  • Check spam blacklist status

Monthly Tasks (2-3 hours)

First weekend of each month:

  • Apply non-security updates (during maintenance window)
  • Test backup restoration (sample account)
  • Review and clean old backups
  • Audit user accounts (remove inactive)
  • Check for orphaned files/databases
  • Review server performance metrics
  • Update documentation (if anything changed)
  • Review and optimize databases
  • Check for unused resources
  • Audit firewall rules
  • Review monitoring alerts (any false positives?)
  • Team meeting: discuss any issues or improvements

Quarterly Tasks (Half day)

Every 3 months:

  • Full security audit
  • Test disaster recovery procedures
  • Review and update emergency contacts
  • Audit all passwords/credentials
  • Review capacity planning (need more resources?)
  • Update internal documentation
  • Review and update client contracts if needed
  • Check for software EOL (end of life) notices
  • Full backup integrity verification

Annual Tasks (Full day)

Once per year:

  • Complete infrastructure review
  • Evaluate new technologies/improvements
  • Review and renew licenses
  • Update disaster recovery plan
  • Review insurance and liability coverage
  • Security penetration testing (if budget allows)
  • Full documentation audit and update
  • Review and update all procedures
  • Team training on new features/systems

Routine Procedures

Backup Verification

Daily Check:

  1. Login to backup management dashboard
  2. Verify all scheduled backups completed
  3. Check backup sizes (significant changes?)
  4. Review any backup failures
  5. Investigate and resolve any issues

Monthly Restoration Test:

  1. Select random client account
  2. Restore to test environment
  3. Verify all files present
  4. Test database restoration
  5. Verify email restoration (if applicable)
  6. Document test results
  7. Delete test restoration

Security Updates

Critical Security Updates (Within 24-48 hours):

  1. Review security bulletin
  2. Assess impact and urgency
  3. Schedule emergency maintenance if critical
  4. Notify clients if downtime required
  5. Take pre-update backup
  6. Apply update
  7. Test services
  8. Monitor for issues
  9. Document update

Regular Updates (Monthly maintenance window):

  1. Review available updates
  2. Note any that require reboot
  3. Schedule maintenance window
  4. Notify clients 48-72 hours in advance
  5. Take backups before starting
  6. Apply updates
  7. Reboot if required
  8. Verify all services running
  9. Monitor for 24 hours post-update
  10. Document updates applied

Log Review

Weekly Log Analysis:

Check these logs:

DirectAdmin/Web Server:

# Apache errors
tail -100 /var/log/httpd/error_log | grep -i "critical\|error"

# PHP errors
tail -100 /var/log/php/error.log

# DirectAdmin errors
tail -100 /var/log/directadmin/error.log

Email Server (SmarterMail):

# Check for delivery issues
tail -f /opt/SmarterMail/Logs/smtp.log | grep -i "error\|failed"

# Check spam filter
tail -f /opt/SmarterMail/Logs/spool.log | grep -i "error"

System Logs:

# System messages
tail -100 /var/log/messages | grep -i "error\|fail\|critical"

# Authentication logs
tail -100 /var/log/secure | grep -i "failed"

What to look for:

  • Repeated errors (indicates systemic issue)
  • Failed login attempts (security concern)
  • Disk space warnings
  • Service crashes
  • Database errors
  • Permission issues

Disk Space Management

When disk usage > 85%:

  1. Identify what's using space:
du -sh /* | sort -h
du -sh /home/* | sort -h
  1. Common culprits:

    • Old backups
    • Log files
    • Client uploads/files
    • Database dumps
    • Email storage
  2. Clean up:

    • Rotate old logs: logrotate -f /etc/logrotate.conf
    • Delete old backups beyond retention
    • Contact clients with excessive usage
    • Clear temp files: rm -rf /tmp/*
    • Empty trash folders
  3. Prevent recurrence:

    • Adjust backup retention if needed
    • Implement quotas
    • Set up alerts at 80%

Database Optimization

Monthly MySQL/MariaDB Optimization:

# Login to MySQL
mysql -u root -p

# Show database sizes
SELECT
table_schema AS "Database",
ROUND(SUM(data_length + index_length) / 1024 / 1024, 2) AS "Size (MB)"
FROM information_schema.TABLES
GROUP BY table_schema;

# Optimize tables
mysqlcheck -o --all-databases -u root -p

# Repair if needed
mysqlcheck -r --all-databases -u root -p

After optimization:

  • Monitor database performance
  • Check query execution times
  • Verify applications working correctly

Service Health Checks

Daily quick check:

# Check all services running
systemctl status httpd
systemctl status mysqld
systemctl status directadmin
systemctl status sshd

# Or use:
systemctl list-units --state=failed

Web Hosting:

  • Test website loading
  • Check DirectAdmin accessible
  • Verify FTP connection
  • Test email sending/receiving

Email Service:

  • Login to webmail
  • Send test email
  • Check mail queue
  • Verify spam filtering

Bot Hosting:

  • Login to Pterodactyl
  • Check active servers
  • Verify resource allocation
  • Test server starts

SSL Certificate Monitoring

Check certificate expiry:

# Check SSL cert expiration
echo | openssl s_client -servername domain.com -connect domain.com:443 2>/dev/null | openssl x509 -noout -dates

Automated renewal check:

  • Verify Let's Encrypt auto-renewal working
  • Check renewal logs
  • Test manual renewal if issues

Certificates expiring < 30 days:

  1. Verify auto-renewal configured
  2. Manually renew if needed
  3. Test certificate after renewal
  4. Update clients if manual action needed

Maintenance Window Procedures

Pre-Maintenance

48-72 hours before:

  1. Schedule window

    • Low-traffic time (e.g., 2-4 AM)
    • Outside business hours
    • Avoid weekends if possible (unless emergency)
  2. Client notification

Subject: Scheduled Maintenance - [Date/Time]

Dear Valued Clients,

We will be performing scheduled maintenance on [Date] from [Start Time] to [End Time] GMT.

Expected impact:
- [Service] may experience brief interruptions
- [Estimated downtime: X minutes]

We apologize for any inconvenience and appreciate your understanding.

If you have any concerns, please contact support.

Best regards,
Alfie Web Solutions Team
  1. Preparation
    • Review maintenance steps
    • Take full backups
    • Prepare rollback plan
    • Test in staging (if applicable)
    • Have emergency contacts ready
    • Clear schedule for monitoring

During Maintenance

  1. Start maintenance

    • Post status update (if status page exists)
    • Begin work according to plan
    • Document steps taken
  2. Apply changes

    • Follow procedures carefully
    • Take notes of any issues
    • Test each change before proceeding
  3. Monitor

    • Watch for errors
    • Check service status
    • Verify connectivity
  4. Rollback if needed

    • If major issues, rollback immediately
    • Restore from pre-maintenance backup
    • Reschedule maintenance

Post-Maintenance

  1. Verification (immediate)

    • All services running
    • Test functionality
    • Check for errors
    • Verify client access
  2. Monitoring (24 hours)

    • Watch error logs
    • Monitor performance
    • Check for client reports
    • Respond to any issues quickly
  3. Communication

Subject: Maintenance Complete

Dear Clients,

Our scheduled maintenance has been completed successfully. All services are now operational.

Thank you for your patience.

If you experience any issues, please contact support.

Best regards,
Alfie Web Solutions Team
  1. Documentation
    • Update change log
    • Document any issues encountered
    • Note lessons learned
    • Update procedures if needed

Emergency Procedures

Service Outage

If service goes down unexpectedly:

  1. Immediate Response (Within 5 minutes)

    • Confirm outage (not local issue)
    • Check server status
    • Review recent changes
    • Start investigating cause
  2. Communication (Within 15 minutes)

    • Post status update
    • Notify clients via email
    • Set expectations on resolution time
  3. Resolution

    • Identify root cause
    • Implement fix
    • Test thoroughly
    • Monitor closely
  4. Post-Incident

    • Post-mortem analysis
    • Document incident
    • Identify prevention measures
    • Update procedures
    • Client apology/compensation if warranted

Server Compromise

If security breach suspected:

  1. Immediate Actions

    • Isolate affected server (disconnect if necessary)
    • Preserve logs and evidence
    • Change all passwords
    • Notify Tom/Alfie immediately
  2. Investigation

    • Identify breach method
    • Assess scope of compromise
    • Check for backdoors
    • Review access logs
  3. Remediation

    • Remove malware/backdoors
    • Patch vulnerabilities
    • Restore from clean backup if needed
    • Strengthen security
  4. Client Notification

    • Inform affected clients
    • Provide guidance
    • Offer assistance
    • Document incident
  5. Prevention

    • Security audit
    • Implement additional protections
    • Update policies
    • Team training

Performance Optimization

Web Server Optimization

If websites loading slowly:

  1. Identify bottleneck

    • CPU usage high?
    • RAM exhausted?
    • Disk I/O slow?
    • Network latency?
  2. Common optimizations

    • Enable caching (OPcache for PHP)
    • Optimize Apache/Nginx config
    • Review .htaccess rules
    • Optimize databases
    • Implement CDN if needed
  3. Client-side

    • Image optimization
    • Minify CSS/JS
    • Enable compression
    • Review plugins/themes (if WordPress)

Database Performance

If database slow:

# Check for slow queries
mysqladmin -u root -p processlist

# Enable slow query log
SET GLOBAL slow_query_log = 'ON';
SET GLOBAL long_query_time = 2;

# Review slow queries
pt-query-digest /var/log/mysql/slow.log

Optimization steps:

  • Add indexes to frequently queried columns
  • Optimize table structure
  • Increase MySQL buffer pool
  • Tune MySQL configuration

Email Performance

If email slow/delayed:

  1. Check mail queue:
# In SmarterMail admin panel
# Navigate to: Manage → Spool → View Active Messages
# Or check via command line:
find /opt/SmarterMail/Spool -type f | wc -l
  1. Common issues:

    • Queue backlog (processing)
    • Greylist delays (normal)
    • External spam filtering
    • Slow recipient servers
  2. Actions:

    • Flush queue if stuck
    • Adjust rate limits if needed
    • Check for blacklisting
    • Monitor delivery times

Tools & Resources

Monitoring Tools

  • Uptime monitoring: [Tool name]
  • Resource monitoring: top, htop, netdata
  • Log analysis: grep, awk, logwatch
  • Network: ping, traceroute, mtr
  • SSL: openssl, SSL Labs

Useful Commands Reference

System Status:

# Load average
uptime

# Memory
free -h

# Disk
df -h

# Network
netstat -tuln
ss -tuln

# Processes
ps aux | head -20

Service Management:

# Status
systemctl status [service]

# Start/Stop/Restart
systemctl start [service]
systemctl stop [service]
systemctl restart [service]

# Enable/Disable auto-start
systemctl enable [service]
systemctl disable [service]

DirectAdmin:

# User list
cd /usr/local/directadmin/scripts
./listusers.sh

# Suspended users
./listsuspended.sh

# Disk usage
./quota.sh

Best Practices

  1. Always backup first - Before any changes
  2. Test in staging - When possible
  3. Document everything - Changes, issues, solutions
  4. Schedule wisely - Low-traffic times
  5. Communicate proactively - Keep clients informed
  6. Monitor closely - After changes
  7. Have rollback plan - Know how to undo
  8. Learn from incidents - Post-mortems
  9. Keep updated - Security patches promptly
  10. Stay organized - Follow procedures

Maintenance Checklist Template

Pre-Maintenance:

  • Scope of work defined
  • Client notification sent
  • Backups taken
  • Rollback plan ready
  • Team available
  • Emergency contacts ready

During Maintenance:

  • Status posted
  • Changes applied
  • Each step tested
  • Issues documented
  • Time tracking

Post-Maintenance:

  • Services verified
  • Client notification sent
  • 24-hour monitoring
  • Documentation updated
  • Lessons learned noted

Keep this document updated with any procedure changes Last updated: [Date]