Atlanta Unstable host

Incident Report for RedHelm Cloud

Postmortem

Summary Problem Description
An internal host failure occurred within the ATL cluster, which is responsible for hosting customer Virtual Machines (VMs). The failure resulted in the host losing network connectivity to customer networks. Notably, this issue specifically impacted the host management and carrier signal while the resident VMs remained technically operational but isolated.

Immediate Remediation
Upon discovering the host was in a "fenced" state and unresponsive to management commands, the following actions were taken:

Host Reboot: A hard reboot was initiated to clear the unresponsive state and restore management access.

Vendor Escalation: Hardware vendor support was contacted immediately, and formal cases were opened for deep-log analysis.

Resolution
Once the host was rebooted, network connectivity was successfully restored. To ensure service stability, all resident VMs were migrated to healthy hosts within the cluster. Following the migration and verification of network paths, all services returned to normal operational status. The host has since remained stable in production with no further incidents.

Analysis Findings/Review
The investigation confirmed that the root cause was a Layer 1 (Physical) or Layer 2 (Data Link) failure, identified by the "Failed Criteria 128" error.

The Culprit: The "Criteria 128" error serves as the definitive indicator that the host lost its carrier signal (physical link).

Fencing Mechanism: The host entered a "fenced" state—a protective High Availability (HA) measure. In this state, a physical server becomes unresponsive to the cluster heartbeat while VMs continue to run. Fencing is a critical fail-safe designed to prevent "split-brain" scenarios and data corruption, eventually allowing other cluster nodes to safely take over the VM workloads.

Status: The reboot successfully reset the host connectivity.

Recommendations
To prevent a recurrence of this driver-level link failure, the following actions are scheduled:

Firmware/Driver Alignment: We will be scheduling a maintenance window to update the NIC firmware and drivers to the latest vendor-certified versions.

Proactive Monitoring: Enhanced logging has been enabled on the ATL cluster to capture any intermittent "Criteria 128" warnings before a total link loss occurs.

Posted Jan 21, 2026 - 13:42 EST

Resolved

Environment has been in a stable state. To ensure continued reliability we proactively sidelined the specific server involved, until the root cause is determined. An RCA will be available after the investigation is complete.

Posted Jan 12, 2026 - 17:08 EST

Identified

This morning, we experienced an instability issue with a virtual host in our Atlanta facility. This caused a temporary disruption for some systems, but we can confirm that all affected VMs have been successfully rebooted and are now fully responsive.

The environment is being monitored for any further issues. We are working directly with our vendors to determine the root cause and will provide updates if any further action is required.

Posted Jan 12, 2026 - 12:01 EST

This incident affected: ATL01 (Atlanta, GA) (Virtual Infrastructure).