Summary Problem Description
An internal host failure occurred within the ATL cluster, which is responsible for hosting customer Virtual Machines (VMs). The failure resulted in the host losing network connectivity to customer networks. Notably, this issue specifically impacted the host management and carrier signal while the resident VMs remained technically operational but isolated.
Immediate Remediation
Upon discovering the host was in a "fenced" state and unresponsive to management commands, the following actions were taken:
Host Reboot: A hard reboot was initiated to clear the unresponsive state and restore management access.
Vendor Escalation: Hardware vendor support was contacted immediately, and formal cases were opened for deep-log analysis.
Resolution
Once the host was rebooted, network connectivity was successfully restored. To ensure service stability, all resident VMs were migrated to healthy hosts within the cluster. Following the migration and verification of network paths, all services returned to normal operational status. The host has since remained stable in production with no further incidents.
Analysis Findings/Review
The investigation confirmed that the root cause was a Layer 1 (Physical) or Layer 2 (Data Link) failure, identified by the "Failed Criteria 128" error.
The Culprit: The "Criteria 128" error serves as the definitive indicator that the host lost its carrier signal (physical link).
Fencing Mechanism: The host entered a "fenced" state—a protective High Availability (HA) measure. In this state, a physical server becomes unresponsive to the cluster heartbeat while VMs continue to run. Fencing is a critical fail-safe designed to prevent "split-brain" scenarios and data corruption, eventually allowing other cluster nodes to safely take over the VM workloads.
Status: The reboot successfully reset the host connectivity.
Recommendations
To prevent a recurrence of this driver-level link failure, the following actions are scheduled:
Firmware/Driver Alignment: We will be scheduling a maintenance window to update the NIC firmware and drivers to the latest vendor-certified versions.
Proactive Monitoring: Enhanced logging has been enabled on the ATL cluster to capture any intermittent "Criteria 128" warnings before a total link loss occurs.