Go Home. We’ve Got This!

You wish you could go home at five o’clock each day, but students expect to have access to your services 24/7. Increasingly, student services have been moved online so our computers can support the services while you take a break. We host more than 70 customer-facing custom and vendor-supported applications on our servers.

Staff and students alike expect these services to always be available when they need them. It may be a freshman racing against the clock to meet a midnight deadline for submitting an application for membership in a Freshman Leadership Organization (FLO) or it may be a Critical Incident Response Team (CIRT) staff member who needs to update case notes for a student in crisis at three o’clock in the morning. This level of service requires us to maintain dependable systems.

Contrary to popular opinion, computer systems don’t “just work.” Careful system design and monitoring are critical to achieving such reliability. “Service Availability” is a key metric that we track.

Our services are available 99.97% of the time. That’s all but 3 hours a year, and most of those 3 hours are during scheduled maintenance windows on the first and third Sundays of each month. Most of the service disruptions our customers experience are for reasons beyond our control such as power, network or CAS outages. We couldn’t control the winter storm in February, but our services remained available throughout that week to those who had Internet access. For comparison, Google Services and YouTube reportedly had three major outages in 2020 for a total of about 9 hours.

How do we achieve this level of reliability? Two key strategies are system redundancy and automated monitoring.

System Redundancy

Remember the days of your desktop computer dying from a failed power supply or a damaged hard disk? The levels of redundancy for our servers now go far beyond dual power supplies, network connections, and storage disks.

Our services run on more than 200 virtual servers that can be moved on the fly between different sets of hardware without disrupting the services you depend on. It’s like your cell phone changing which tower it is connected to without disrupting your call as you travel down the highway. This lets us repair or replace hardware without you noticing.

Our most critical virtual services also have redundancy. When information is saved in a database, it actually gets written to more than one copy of the database. If the primary copy fails, the system can instantly elevate another copy to be the primary with no downtime.

Automated Monitoring

When one of these systems does go down, the problem gets detected automatically by our monitoring service. We don’t just check if a service is up or down. We are constantly checking a variety of things such as if disks are getting full, if memory is nearly used up, and if pages are loading too slowly. We even monitor the temperature in one of our server rooms. Checking for “close to failing” conditions lets us intervene in time to prevent actual service failures. About 3600 distinct things are checked at least every five minutes.

At any time day or night when service quality falls outside acceptable boundaries, notification is automatically sent to appropriate team members by text or email. It is also displayed on a service status screen within our office suite so everyone can have awareness. If the issue is not resolved within the designated time period, the situation is automatically escalated with additional notifications to other team members.

With such aggressive monitoring, it’s not unusual for us to detect and resolve problems before our customers notice. On occasion we have detected and informed the Division of IT of campus service disruptions before they were reported to us.

We are always improving. In the rare case when a bad condition does get past us, we typically add a test to our monitoring service so we can be alerted if it recurs.

Understanding the level of criticality of each service we offer allows us to design the right solution to meet your needs. Most of the year no one cares if the voting system for student body elections is working, but for two days in October and February the story is very different.

Bruce Brown in the Leadership & Service Center of Student Activities advises multiple high-profile student organizations that rely on our services. He knows our service quality doesn’t happen by accident. “Whenever I’ve had the opportunity to partner with Technology Services – Student Affairs for student organizations or programs, it’s been the definition of a true partnership in that they seek to actively learn and anticipate challenges and solutions for which we didn’t yet know we needed an answer. I’ve been thankful for the work they continue to do.”

If a critical service does somehow fail after hours, please call our Service Desk at 979-862-7990 and leave a voice message. Just like the messages sent from our monitoring service, we also monitor voice messages you leave. If it is critical, we will work immediately to restore the service, so you can go back to enjoying being home.

Ensuring value for our customers. That’s what we’re about. Ask how we can help you!

For more information about our custom application supporting CIRT, visit https://doit.tamu.edu/serving-students-in-crisis/