CHPC - Several nodes on notchpeak lost power – Incident details

All systems operational

Several nodes on notchpeak lost power

Resolved
Partial outage 5 %
Started 17 days agoLasted about 20 hours

Affected

General Environment (GE)

Partial outage from 1:52 AM to 10:03 PM

HPC clusters

Partial outage from 1:52 AM to 10:03 PM

Updates
  • Resolved
    Resolved

    Nodes are online. CHPC staff have drained a limited set of nodes (preventing new jobs from starting but not affecting currently running jobs) to rebalance power.

  • Monitoring
    Monitoring

    CHPC staff on-site at the data center brought most systems back online. Staff will rebalance affected systems among power distribution units to prevent similar issues in the future. One notchpeak node remains offline while staff work on power distribution.

  • Investigating
    Investigating

    Several nodes on the notchpeak cluster lost power on the afternoon of March 5. This incident affected notch366, notch452, notch472, notch473, notch474, notch475, notch476, notch477, notch478, notch479, notch480, notch481, notch482, notch483, notch484, notch485, notch486, notch487, notch488, notch489, notch490, notch491, notch492, notch493, notch494, notch495, notch496, notch497, notch498, notch499, notch500, notch501, notchpeak32, notchpeak33, and notchpeak34.