The Livermore Computing Center recognizes that our user's work can be seriously impacted by the unavailability of systems and services. Unfortunately, some amount of interruption in service is unavoidable. The Center strives to minimize user disruption caused by scheduled interruptions for repairs, updates, installation of new equipment, troubleshooting, or preventive maintenance. We therefore observe the following guidelines in scheduling these events:
- We will provide a minimum of 24 hours notice before an event takes place, when possible.
- Notification of events will be displayed in MOTDs, news postings, e-mails to status lists, and on the Machine Status page.
- Notification of events of greatest impact to the most users will be more prominently displayed and publicized through technical bulletins, messages, and calls to Computer Coordinators.
- We will try to overlap events when they do not interfere with each other. For example: storage down time at the same time as the weekly IBM preventive maintenance.
- When possible, we will schedule events in a way that is more convenient for users, such as before 8 a.m. or during lunch time.
- We avoid scheduling events on Fridays or Mondays (or logical Fridays or Mondays due to holidays) when users may be preparing for weekend runs or examining results. We also avoid Friday changes since unexpected consequences could disturb weekend runs and key staff may be unavailable to deal with the situation.
- Rolling TOSS updates may occur on a Monday in order to complete a cluster in a timely manner. When these updates occur, jobs are not killed because updates occur between jobs. The main impact is that there is an ~8am reboot of the login nodes.
- We will do other modifications on a Friday/Monday if there is a high probability of correcting an existing emergency situation or there is a high probability of preventing such a situation from occurring during the weekend or holiday.
- We provide redundancy and fail-over service where technology allows. Users may not notice or may minimally notice failures when this occurs.
Please note that sometimes an urgent problem or the threat of a failure requires that we take action quickly. This may preclude the scheduling options and advance notice described above.
Frequently Asked Questions
Why is there so much going on with LC systems and services?
We have a large number of systems. Besides the Advanced Technology Systems, Linux clusters, and storage, there are many machines supporting services for those systems. These include NFS servers, LCRM and Moab control hosts, and LDAP servers. Each system has numerous hardware and software components. Many down times are part of our continuous effort to integrate new equipment to increase capacity and improve performance.
All hardware is subject to failure. All software is subject to bugs. We work to mitigate these sad but true facts by keeping up with preventive maintenance, software patches and updates, and taking a proactive response to problems. Some of our scheduled downtimes are part of the effort to prevent unscheduled downtimes, with accompanying potential loss of work or data.
Our systems and services are complex and closely entwined. An interruption to a central service, such as NFS-provided home directories or Kerberos authentication service, may have far reaching effects.
Why is it going on while I'm trying to work?
Some disruptive events are done off-hours, in the early morning hours, or occasionally on weekends. However, much work needs to be done when LC, network, and vendor staff are available to provide expertise and to help deal with any problems that arise. When full local and vendor support is available, we are better able to ensure minimal down time. We try to compromise with early morning and lunch time scheduling.