CHPC Status - Incident history

Issues with logins to beehive (Windows server in General Environment)

Fri, 3 Apr 2026 14:20:00 +0000

Type: Incident

Duration: 2 hours and 30 minutes

Affected Components:

Apr 3, 14:20:00 GMT+0
Investigating - Users have reported issues logging in to the beehive server..

Apr 3, 16:50:00 GMT+0
Resolved - Users report that logins to beehive are working again. The cause of the issue was very high memory consumption that exhausted the system's memory and the space available for the pagefile (swap)..

Slurm not responding in Protected Environment (Redwood)

Wed, 18 Mar 2026 13:49:48 +0000

Type: Incident

Duration: 54 minutes

Affected Components: ,

Mar 18, 13:49:48 GMT+0
Identified - The CHPC has become aware that Slurm is not talking to Redwood and we are working to fix this..

Mar 18, 14:43:20 GMT+0
Resolved - This incident has been resolved..

Sites hosted on home.chpc.utah.edu are not reachable

Mon, 16 Mar 2026 17:30:00 +0000

Type: Incident

Duration: 1 hour and 14 minutes

Affected Components:

Mar 16, 17:30:00 GMT+0
Investigating - Websites hosted on home.chpc.utah.edu, including user- and group-specific sites, are not reachable. The CHPC is aware of this issue and investigating the cause..

Mar 16, 18:43:59 GMT+0
Resolved - The issue with [home.chpc.utah.edu](http://home.chpc.utah.edu) has been resolved and user- and group-specific pages are being served again. If you continue to encounter issues, please contact the CHPC at helpdesk@chpc.utah.edu..

Filesystem issue requiring journal replay; some General Environment group spaces inaccessible

Fri, 13 Mar 2026 16:44:00 +0000

Type: Incident

Duration: 8 minutes

Affected Components:

Mar 13, 16:44:00 GMT+0
Investigating - Excessive usage on a General Environment filesystem required it to be taken offline temporarily..

Mar 13, 16:52:00 GMT+0
Resolved - CHPC staff have brought the filesystem back online..

Many systems on the granite cluster went offline (lost power) at approximately 4:45 p.m.

Wed, 11 Mar 2026 22:52:48 +0000

Type: Incident

Duration: 15 hours and 18 minutes

Affected Components: , ,

Mar 12, 14:11:16 GMT+0
Resolved - Systems on granite returned to service shortly after the outage yesterday afternoon. CHPC staff have moved a switch to two separate power distribution units to prevent similar incidents in the future..

Mar 11, 22:52:48 GMT+0
Investigating - The CHPC is investigating an issue with many of the systems on the granite cluster. Initial reports suggest there may be an issue with power distribution to login nodes, networking infrastructure, and core services..

Mar 11, 23:00:17 GMT+0
Monitoring - CHPC and DDC staff have identified the issue and restored power to systems. Systems on the granite cluster should be back online or in the process of coming back online. CHPC staff will continue to monitor the cluster..

Several nodes on notchpeak lost power

Fri, 6 Mar 2026 01:52:00 +0000

Type: Incident

Duration: 20 hours and 12 minutes

Affected Components:

Mar 6, 22:03:39 GMT+0
Resolved - Nodes are online. CHPC staff have drained a limited set of nodes (preventing new jobs from starting but not affecting currently running jobs) to rebalance power..

Mar 6, 03:20:00 GMT+0
Monitoring - CHPC staff on-site at the data center brought most systems back online. Staff will rebalance affected systems among power distribution units to prevent similar issues in the future. One notchpeak node remains offline while staff work on power distribution..

Mar 6, 01:52:00 GMT+0
Investigating - Several nodes on the notchpeak cluster lost power on the afternoon of March 5\. This incident affected notch366, notch452, notch472, notch473, notch474, notch475, notch476, notch477, notch478, notch479, notch480, notch481, notch482, notch483, notch484, notch485, notch486, notch487, notch488, notch489, notch490, notch491, notch492, notch493, notch494, notch495, notch496, notch497, notch498, notch499, notch500, notch501, notchpeak32, notchpeak33, and notchpeak34..

Several notchpeak nodes lost power

Thu, 5 Mar 2026 20:20:00 +0000

Type: Incident

Duration: 12 minutes

Affected Components:

Mar 5, 20:20:00 GMT+0
Investigating - CHPC staff received alerts that several notchpeak nodes lost power. The outage is related to power infrastructure serving the rack. Staff are on-site and investigating..

Mar 5, 20:32:00 GMT+0
Resolved - On-site staff restored affected power infrastructure to service. Affected nodes should return to service shortly..

Issues loading Open OnDemand in the General Environment

Wed, 4 Mar 2026 14:10:00 +0000

Type: Incident

Duration: 1 hour and 54 minutes

Affected Components:

Mar 4, 14:10:00 GMT+0
Investigating - CHPC staff are aware of an issue with access to Open OnDemand, [ondemand.chpc.utah.edu](http://ondemand.chpc.utah.edu), in the General Environment. We are investigating the cause of the issue. At this time, we believe the issue is related to a system that lost ethernet connectivity at 7:10 a.m. We will provide updates on this issue as we learn more..

Mar 4, 16:04:27 GMT+0
Resolved - CHPC staff have restored a storage system that lost ethernet connectivity to service. Open OnDemand is responsive again. Thank you for your patience. If you continue to encounter issues, please contact us at helpdesk@chpc.utah.edu..

Issues with Protected Environment resources

Fri, 27 Feb 2026 14:30:00 +0000

Type: Incident

Duration: 6 hours and 41 minutes

Affected Components: , , , , , , ,

Feb 27, 14:30:00 GMT+0
Investigating - The CHPC is aware of issues with the Protected Environment, leading to degraded performance. Staff are working to identify and correct the issue. We will provide updates as we learn more..

Feb 27, 17:22:51 GMT+0
Investigating - CHPC staff are continuing to investigate the issue. Based on user reports, this incident's impact is being updated to an outage rather than degraded performance..

Feb 27, 19:07:24 GMT+0
Monitoring - Issues in the Protected Environment are attributable to packet loss. CHPC staff have stopped replication between the General Environment VAST and Protected Environment VAST, which significantly reduced the packet loss. Services appear to be responsive again. Logins and services in the PE should begin working. CHPC staff will continue to monitor the situation..

Feb 27, 19:59:41 GMT+0
Monitoring - While we have identified the cause of the problems, and most services in the PE are available, individual systems may still have lingering problems due to the previous network loss to the file systems. If your particular service has issues, please, contact helpdesk@chpc.utah.edu. .

Feb 27, 21:10:46 GMT+0
Resolved - CHPC staff have determined that the issue with the Protected Environment has been resolved. Systems have remained accessible since the update earlier this afternoon. If you continue to encounter issues with a resource in the Protected Environment, please contact the CHPC at helpdesk@chpc.utah.edu. Thank you for your patience..

Issues with web server, affecting access to some CHPC-managed websites

Tue, 17 Feb 2026 14:15:00 +0000

Type: Incident

Duration: 1 hour and 45 minutes

Affected Components:

Feb 17, 14:15:00 GMT+0
Investigating - We are currently investigating this incident..

Feb 17, 16:00:00 GMT+0
Resolved - CHPC system administrators restarted a service on the web server, which resolved the issue..

Issues with web server, affecting access to some CHPC-managed websites

Mon, 16 Feb 2026 14:15:00 +0000

Type: Incident

Duration: 5 hours and 30 minutes

Affected Components:

Feb 16, 14:15:00 GMT+0
Investigating - We are currently investigating this incident..

Feb 16, 19:45:00 GMT+0
Resolved - CHPC system administrators restarted a service on the web server, which resolved the issue..

Issues with web server, affecting access to some CHPC-managed websites

Sat, 14 Feb 2026 14:15:00 +0000

Type: Incident

Duration: 2 hours and 45 minutes

Affected Components:

Feb 14, 14:15:00 GMT+0
Investigating - We are currently investigating this incident..

Feb 14, 17:00:00 GMT+0
Resolved - CHPC system administrators restarted a service on the web server, which resolved the issue..

Significant delays when logging in to Protected Environment (identified approximately 2:30 p.m. on Friday, February 13)

Fri, 13 Feb 2026 21:35:42 +0000

Type: Incident

Duration: 54 minutes

Affected Components: , , , , , , ,

Feb 13, 21:35:42 GMT+0
Investigating - The CHPC is aware of significant delays encountered by users when logging in to systems in the Protected Environment, including SSH logins and Open OnDemand. We are investigating the issue..

Feb 13, 22:30:00 GMT+0
Resolved - CHPC network and system administrators have identified and fixed the issue. The problem is hardware-related; staff have implemented a workaround and will engage the hardware vendor..

Issues with database server, resulting in errors like "Error establishing a database connection" when connecting to some CHPC-hosted websites

Fri, 6 Feb 2026 20:30:00 +0000

Type: Incident

Duration: 53 minutes

Affected Components:

Feb 6, 20:30:00 GMT+0
Investigating - We are currently investigating this incident..

Feb 6, 21:23:29 GMT+0
Resolved - CHPC systems staff have restored the database server. Websites dependent on this system should now be accessible again..

Delays logging in and loading modules in the Protected Environment

Thu, 5 Feb 2026 23:52:14 +0000

Type: Incident

Duration: 6 days, 16 hours and 57 minutes

Affected Components: , , , , , ,

Feb 5, 23:52:14 GMT+0
Investigating - Users are reporting delays logging in and loading modules in the Protected Environment. CHPC staff are aware of the issue and investigating..

Feb 12, 16:48:59 GMT+0
Resolved - We have monitored the state of the Protected Environment for several days without observing delays logging in and loading modules. If we observe similar issues in the future, we will provide an update in the form of another incident. Thank you for your patience..

Feb 7, 04:55:47 GMT+0
Monitoring - We have done some changes and are currently monitoring if the slowdowns reappear..

Delays logging in and loading modules in the Protected Environment

Thu, 5 Feb 2026 20:00:00 +0000

Type: Incident

Duration: 2 hours

Affected Components: , , , , , ,

Feb 5, 20:00:00 GMT+0
Investigating - Users have reported delays logging in and loading modules on systems in the Protected Environment..

Feb 5, 22:00:00 GMT+0
Resolved - This issue is now resolved. Delays in logging in and loading modules, which both require reading a significant number of files, were attributable to high pressure on a filesystem in the Protected Environment..

Issues with Open OnDemand jobs not starting

Tue, 3 Feb 2026 06:30:00 +0000

Type: Incident

Duration: 11 hours and 38 minutes

Affected Components:

Feb 3, 06:30:00 GMT+0
Investigating - The CHPC is aware of issues with Open OnDemand (OOD) jobs remaining in a "Starting" state without providing an option to connect. Users have reported issues with VS Code Server, RStudio Server, and ParaView, and this issue likely affects other OOD applications. This was first observed late in the evening on February 2 and persists into the morning of February 3..

Feb 3, 18:07:32 GMT+0
Resolved - CHPC staff have resolved the issue with Open OnDemand jobs. The cause was many jobs with high input and output load. This affected the time required to load the environment; many jobs were taking a significant amount of time to start, and many jobs started with incomplete environments, causing immediate job failures..

Issue with access to Linux nodes in Citadel environment

Mon, 2 Feb 2026 15:00:26 +0000

Type: Incident

Duration: 4 hours and 50 minutes

Affected Components:

Feb 2, 15:00:26 GMT+0
Investigating - Linux nodes in the Citadel environment, including the login node, are refusing connections. System administrators are aware of the issue and working on a resolution. The Windows host remains accessible..

Feb 2, 19:50:00 GMT+0
Resolved - The issue with logins to the Citadel environment has been resolved. The problem was caused by an update to Duo packages on Linux systems. CHPC staff have configured systems in Citadel to use a specific source for Duo updates to prevent this issue in the future..

Issue with Slurm data acquisition, affecting the output of "mychpc batch" and available options for job submission in Open OnDemand

Tue, 27 Jan 2026 15:54:06 +0000

Type: Incident

Duration: 10 minutes

Affected Components: ,

Jan 27, 15:54:06 GMT+0
Investigating - The CHPC is aware of an issue with the API that provides Slurm information to the "mychpc batch" command and Open OnDemand job submission options. Developers are working to resolve the issue..

Jan 27, 16:03:57 GMT+0
Resolved - Developers have corrected the issue. The "mychpc batch" command and Open OnDemand job submission options are now working as expected..

Issue with file server, affecting file access

Sun, 25 Jan 2026 05:30:00 +0000

Type: Incident

Duration: 9 hours and 20 minutes

Affected Components:

Jan 25, 05:30:00 GMT+0
Investigating - A file server stopped responding at approximately 10:30 p.m. on Saturday, January 25\. This affected access to some group spaces, including spaces used to serve some websites..

Jan 25, 14:50:00 GMT+0
Resolved - CHPC storage administrators restored the file server. Services are back online..

Scheduled maintenance on power infrastructure: General Environment clusters (granite, notchpeak, kingspeak, lonepeak) offline late January 21 to early January 22 (overnight)

Thu, 22 Jan 2026 03:00:00 +0000

Type: Maintenance

Duration: 15 hours and 7 minutes

Affected Components:

Jan 22, 03:00:01 GMT+0
Identified - Maintenance is now in progress.

Jan 22, 03:00:00 GMT+0
Identified - Scheduled maintenance on power infrastructure at the Downtown Data Center will require all General Environment clusters (granite, notchpeak, kingspeak, and lonepeak) and standalone biochemistry and cryo-EM servers to be taken offline from the evening of January 21 to the morning of January 22. Critical services, storage systems, virtual machines (VMs), and the redwood cluster in the Protected Environment are on generated power and should not be affected by this outage..

Jan 22, 18:07:01 GMT+0
Completed - The power outage is complete and systems will be made available to users shortly. Notably, the change to shared and exclusive partitions has been implemented in the General Environment (the change will follow in the Protected Environment later today). Please see for more information about this change..

Issues with Protected Environment resources ongoing

Fri, 16 Jan 2026 05:00:00 +0000

Type: Incident

Duration: 4 days, 18 hours and 13 minutes

Affected Components:

Jan 16, 05:00:00 GMT+0
Investigating - There are continuing issues with the Protected Environment (PE). The issues began on Thursday, January 15\. System and network administrators are working to resolve issues and bring the PE back to a fully functional state. The issue appears to be related to a network interface card on a critical storage system..

Jan 20, 22:40:06 GMT+0
Investigating - System and network administrators believe they have identified an issue with a network interface card on storage infrastructure. They are in contact with technical support staff from the vendor..

Jan 20, 23:13:25 GMT+0
Resolved - Systems in the Protected Environment (PE) are available to users and issues should now be resolved. The environment is operating with a single switch and three of four CNodes in its VAST storage system. (These should not affect the functionality of the PE, though they will need to be addressed at a later date.) All virtual machines are back online. If you encounter any issues with systems in the PE, please let us know. We are grateful for your patience and support as we worked to identify and address issues..

Issues with Visual Studio Code remote connections to General Environment resources, with workaround

Thu, 15 Jan 2026 17:00:00 +0000

Type: Incident

Duration: 11 days and 40 minutes

Affected Components:

Jan 15, 17:00:00 GMT+0
Identified - Several users have reported issues using Visual Studio Code to connect to resources in the General Environment. The CHPC is aware of this issue, which is related to the technology used for home directories. In the interim, if using VS Code with remote connection features is important for your workflows, we recommend implementing the workaround described in . We apologize for the inconvenience..

Jan 26, 17:39:47 GMT+0
Resolved - Visual Studio Code, in addition to other software that relies on rename operations, is now working following identification and resolution of an issue by the storage team..

Protected Environment resources offline for troubleshooting of network issue (updated Friday, January 16)

Thu, 15 Jan 2026 03:00:00 +0000

Type: Maintenance

Duration: 5 days, 19 hours and 34 minutes

Affected Components:

Jan 15, 03:00:01 GMT+0
Identified - Maintenance is now in progress.

Jan 15, 03:00:00 GMT+0
Identified - Scheduled maintenance on power infrastructure at the Downtown Data Center will require all General Environment clusters (granite, notchpeak, kingspeak, and lonepeak) and standalone biochemistry and cryo-EM servers to be taken offline from the evening of January 14 to the morning of January 15. Critical services, storage systems, virtual machines (VMs), and the redwood cluster in the Protected Environment are on generated power and should not be affected by this outage..

Jan 16, 01:34:55 GMT+0
Identified - Protected Environment network issues persist. System and network administrators at the CHPC have engaged vendors of networking hardware and are on-site at the data center to address the problem as quickly as possible..

Jan 16, 16:45:32 GMT+0
Identified - CHPC system and network administrators are taking systems in the Protected Environment offline to expedite triage and troubleshooting of networking issues..

Jan 16, 23:58:48 GMT+0
Identified - Network administrators are continuing work to diagnose packet loss issues by bringing devices and interfaces online individually. Network issues persist..

Jan 17, 06:43:05 GMT+0
Identified - Network and system administrators at the CHPC have isolated and triaged network issues and brought most systems back online. The Protected Environment is now functional and accessible to users. Some components are in a degraded or diminished state as a result of reconfigurations to bring the PE back online quickly and will require further troubleshooting next week to return to a normal, fully operational state. We appreciate your patience with this issue..

Jan 20, 22:33:59 GMT+0
Completed - Moving to an incident. Issues are ongoing..

Issue with NFS root server in Protected Environment, affecting availability of redwood cluster

Thu, 8 Jan 2026 18:40:00 +0000

Type: Incident

Duration: 51 minutes

Affected Components:

Jan 8, 18:40:00 GMT+0
Investigating - At approximately 11:40 a.m., an NFS root server in the Protected Environment went offline. This unexpected outage impacts the redwood cluster. CHPC staff are aware of this issue and are working to resolve it as quickly as possible..

Jan 8, 19:31:03 GMT+0
Resolved - CHPC system administrators have returned the NFS root server to service. The redwood cluster is operational again. Thank you for your patience..

Scheduled maintenance (downtime) affecting all CHPC resources

Tue, 16 Dec 2025 14:30:00 +0000

Type: Maintenance

Duration: 4 days and 6 hours

Affected Components: , , , ,

Dec 16, 14:30:01 GMT+0
Identified - Maintenance is now in progress.

Dec 17, 16:14:51 GMT+0
Identified - As of approximately 10:45 p.m. on December 16, storage systems, the CHPC website, and standalone servers (such as biochemistry cryo-EM servers) are functional. Network issues on the morning on December 17, however, have affected access to CHPC resources. CHPC staff are working to resolve issues. Remaining services, including virtual machines (VMs) and HPC clusters, are still under maintenance. We anticipate having systems functional today (December 17)..

Dec 17, 16:43:23 GMT+0
Identified - Networking (DNS) issues have been resolved. The CHPC website and portal are now accessible. HPC systems and virtual machines (VMs) remain under maintenance..

Dec 18, 00:42:19 GMT+0
Identified - We have released HPC clusters and jobs should now start running. All services should now be functional, apart from the MySQL database server and associated websites. Work on this will continue with expected release sometime tomorrow (December 18). Thank you for your patience. If you notice issues with anything other than MySQL and websites, please let us know at [helpdesk@chpc.utah.edu](mailto:helpdesk@chpc.utah.edu)..

Dec 19, 02:30:00 GMT+0
Identified - The community MySQL server is still being migrated to a new server. We also continue to troubleshoot issues with the older VM farm, which still runs a few VMs, which should be available but may be less responsive. We will update on the progress again tomorrow. Thank you for your understanding..

Dec 16, 14:30:00 GMT+0
Identified - A downtime on December 16 and 17 will affect all CHPC resources, including clusters (redwood, granite, notchpeak, kingspeak, and lonepeak), standalone servers, virtual machines (VMs), and storage. This downtime is necessary to replace core network infrastructure. Reservations on the computing clusters have been put in place to ensure jobs will complete prior this downtime. Systems will be brought up on December 17 as the network replacement and any related system administration work is completed. An announcement will be made as resources are made available..

Dec 20, 20:30:00 GMT+0
Completed - The scheduled maintenance is now complete. If you encounter any issues, please contact the CHPC at helpdesk@chpc.utah.edu..

Scheduled maintenance affecting University of Utah systems including websites, email, VPN logins, and some authentication services

Sun, 14 Dec 2025 07:30:00 +0000

Type: Maintenance

Duration: 4 hours

Affected Components:

Dec 14, 07:30:00 GMT+0
Identified - Scheduled maintenance on University Information Technology systems may affect access to CHPC systems, including the CHPC website. Logins and VPN connections may also be affected. See for further details..

Dec 14, 07:30:01 GMT+0
Identified - Maintenance is now in progress.

Dec 14, 11:30:00 GMT+0
Completed - Maintenance has completed successfully.

Issue with virtual machines, also affecting the Slurm scheduler in the General Environment

Thu, 11 Dec 2025 00:54:00 +0000

Type: Incident

Duration: 19 hours and 1 minute

Affected Components: , ,

Dec 11, 00:54:00 GMT+0
Investigating - We are currently investigating this incident. There is an issue with virtual machines in the General Environment. This has taken many CHPC services offline, including the Slurm scheduler and Open OnDemand. CHPC staff are aware of the issue and working on a resolution..

Dec 11, 19:55:28 GMT+0
Resolved - This incident has been resolved. Systems should now be available or become available shortly..

Scheduled maintenance on portal.chpc.utah.edu

Wed, 10 Dec 2025 04:30:00 +0000

Type: Maintenance

Duration: 15 minutes

Affected Components:

Dec 10, 04:30:00 GMT+0
Identified - [portal.chpc.utah.edu](http://portal.chpc.utah.edu) will be inaccessible briefly for scheduled maintenance at 9:30 p.m. MST on Tuesday, December 9..

Dec 10, 04:30:01 GMT+0
Identified - Maintenance is now in progress.

Dec 10, 04:45:00 GMT+0
Completed - Maintenance has completed successfully.

Network issues affecting CHPC website and other services

Fri, 14 Nov 2025 19:51:48 +0000

Type: Incident

Duration: 15 minutes

Affected Components:

Nov 14, 19:51:48 GMT+0
Investigating - We are currently investigating this incident..

Nov 14, 20:06:56 GMT+0
Resolved - This incident has been resolved..

Slurm upgrade on CHPC clusters (granite, notchpeak, kingspeak, lonepeak, and redwood)

Tue, 13 May 2025 14:00:00 +0000

Type: Maintenance

Duration: 9 hours

Affected Components: ,

May 13, 14:00:01 GMT+0
Identified - Maintenance is now in progress.

May 13, 23:00:00 GMT+0
Completed - Maintenance has completed successfully.

May 13, 14:00:00 GMT+0
Identified - This upgrade will cause disruptions to job submission. Existing jobs should run without issue or interruption, but there will be a period of time during which job submission will not work. Please see for more information..

Open OnDemand upgrade

Sun, 20 Apr 2025 03:00:00 +0000

Type: Maintenance

Duration: 10 minutes

Affected Components:

Apr 20, 03:00:01 GMT+0
Identified - Maintenance is now in progress.

Apr 20, 03:00:00 GMT+0
Identified - Please see for more information..

Apr 20, 03:10:00 GMT+0
Completed - Maintenance has completed successfully.

Datacenter cooling upgrade

Tue, 25 Mar 2025 12:00:00 +0000

Type: Maintenance

Duration: 1 day and 12 hours

Affected Components:

Mar 25, 12:00:01 GMT+0
Identified - Maintenance is now in progress.

Mar 27, 00:00:00 GMT+0
Completed - Maintenance has completed successfully.

Mar 25, 12:00:00 GMT+0
Identified - This is a planned downtime at the Downtown Data Center to perform electrical work for the new cooling system, and **will impact the** **general environment clusters lonepeak, kingspeak, notchpeak, and granite.** **Parts of the protected environment including the redwood cluster may be shut down for system administration work as well.** The electrical work by the DDC staff is expected to take the entire day on the 25th, and systems will be brought up the following day after the power outage and any system admin work is complete..

Datacenter power maintenance

Sun, 16 Mar 2025 14:00:00 +0000

Type: Maintenance

Duration: 8 hours

Affected Components:

Mar 16, 22:00:00 GMT+0
Completed - Maintenance has completed successfully.

Mar 16, 14:00:01 GMT+0
Identified - Maintenance is now in progress.

Mar 16, 14:00:00 GMT+0
Identified - Rocky Mountain Power has just notified us of a planned power outage to the Downtown Data Center and vicinity this Sunday, March 16, at 9am to address some urgent electrical changes for the city. The CHPC systems with power backed up by generators (the Protected Environment, network, data storage, and virtual machines) will be fine, but the systems with battery backup only will need to be shut down. **Therefore we will shut down the general environment clusters lonepeak, kingspeak, notchpeak, and granite at 8AM on Sunday March 16**. We estimate these servers will be down for several hours..