CHPC - Some jobs are failing to start, yielding errors in sbatch or srun – Incident details

All systems operational

Some jobs are failing to start, yielding errors in sbatch or srun

Resolved
Major outage
Started 20 days agoLasted about 7 hours

Affected

General Environment (GE)

Major outage from 7:35 PM to 2:18 AM, Partial outage from 10:25 PM to 2:18 AM

HPC clusters

Major outage from 7:35 PM to 2:18 AM

Open OnDemand

Major outage from 7:35 PM to 10:25 PM, Partial outage from 10:25 PM to 2:18 AM

Computational servers, independent of clusters

Major outage from 7:35 PM to 2:18 AM

Storage systems

Major outage from 7:35 PM to 10:25 PM, Partial outage from 10:25 PM to 2:18 AM

Updates
  • Resolved
    Resolved
    This incident has been resolved.
  • Update
    Update

    The issues we are seeing stems with inability to write to the "sys" branches, where applications, SLURM information, and other things are being written. We are still investigating the reasons for that.

  • Investigating
    Investigating

    CHPC system administrators are aware of issues users have reported when starting jobs and are currently investigating.