CHPC - Many systems on the granite cluster went offline (lost power) at approximately 4:45 p.m. – Incident details

All systems operational

Many systems on the granite cluster went offline (lost power) at approximately 4:45 p.m.

Resolved
Major outage
Started 11 days agoLasted about 15 hours

Affected

General Environment (GE)

Major outage from 10:52 PM to 2:11 PM, Partial outage from 10:52 PM to 2:11 PM

HPC clusters

Major outage from 10:52 PM to 2:11 PM

Computational servers, independent of clusters

Partial outage from 10:52 PM to 2:11 PM

Updates
  • Resolved
    Resolved

    Systems on granite returned to service shortly after the outage yesterday afternoon. CHPC staff have moved a switch to two separate power distribution units to prevent similar incidents in the future.

  • Monitoring
    Monitoring

    CHPC and DDC staff have identified the issue and restored power to systems. Systems on the granite cluster should be back online or in the process of coming back online. CHPC staff will continue to monitor the cluster.

  • Investigating
    Investigating

    The CHPC is investigating an issue with many of the systems on the granite cluster. Initial reports suggest there may be an issue with power distribution to login nodes, networking infrastructure, and core services.