January 13, 2003
The following describes in a short and long version the events surrounding the failure and replacement of the storage system supporting Exchange and Outlook. Please feel free to call me at 710-8788 if I can be of any assistance regarding this matter. Thanks for your patience during this time.
The Short Story:
The storage system by Dell that started failing New Year's Eve and completely failed again at 4:30 pm on Monday January 6th, has been replaced with a new storage system by EMC2 that is several orders of magnitude better. All information and email except for email that arrived from January 2nd until January 6 has been fully restored. For 4/5th of the faculty and staff, mail from January 2nd pm until Monday January 6th was recovered from the damaged disks and will be placed in your mailboxes in the next few days. For approximately 1/5th of the faculty/staff and all students, the email from Thursday January 2nd until Monday January 6th could not be recovered by us from the damaged disks. The damaged disk drives will be sent to disk recovery specialists and we will be able to restore the email that they can recover in about 2-3 weeks.
The ITS server staff worked as a team 24 hours a day for over 10 days to bring this new storage system online. During this time they rebuilt and restored one storage system, only to have it fail within hours, and then installed and restored a second data storage system. Essentially six months of work was done in a matter of 10 days. I had the opportunity to work closely with the entire group over these days, and I can assure you that they are the finest group of information technology professionals in the country; and that their dedication to Baylor was greater during this period than their concern for themselves or their families. To them and their spouses and children, we all owe a tremendous debt of gratitude for the personal sacrifices made.
I also want to thank members of the Dell support team. Not only did Dell provide us with 6 of their top 12 specialists including an outstanding Exchange specialist from Microsoft, but they ensured that the equipment we needed was in Waco within hours. The support that Dell gave Baylor was truly exceptional during this very difficult period.
Finally, I want to thank everyone who kept us in their thoughts and prayers during these trying days. Your support and cheering from the sidelines was discernable and helped us to go that extra mile. Your patience and understanding was truly remarkable and a testament to Baylor's Christian identity. If you feel inclined, please drop these folks an email of thanks.
ITS Support Staff
- Bob Hartland, Director of IT Servers and Networks
- Tommy Roberson, Manager of IT Security and Server Operations
- Jon Allen
- Stuart Madsen
- Rob Branham
- Mike Hutcheson
- Darren Jones
- Pay Hynan (from ECS)
- Ray Nazzario
- Cash Coleman
The Long Technical Story (for those who are interested):
Storage systems or Storage Area Networks (SANS) are large arrays of disk drives connected by fiber allowing very high speed communication among the disks as well as a group of terminals (nodes) connected to them. To ensure reliability, SANS have dual channels holding redundant sets of components so that if one component fails the other component takes over without interruption. The hard drives are formatted so information is not lost if a hard drive fails. Failed hard drives can be replaced without stopping the SAN and information is automatically restored to the new disk drive. SANS are engineered to be very high speed, fault tolerant storage systems providing very high availability and data integrity. Pictures of the SANs discussed can be found on the ITS web site at http://www.baylor.edu/ITS.
One of the high speed nodes on a SAN is a tape backup system. At Baylor, the tape backup system makes a full copy of all email data to tape on Saturday evening and Wednesday evening. An incremental backup to tape is made all other evenings that contains anything that has changed since a full backup.
Baylor has two SANS. One SAN stores data for Exchange/Outlook. We have two Exchange servers for faculty/staff and two Exchange servers for students. All of these email servers utilize this SAN for storage. This configuration ensures high availability because one Exchange server can take over if the other Exchange server fails. For faculty and staff there are three storage groups within the SAN, and if a catastrophic failure occurs it is necessary to restore 1/3rd, 2/3rd or all of the data from tape depending on whether 1, 2 or all 3 storage groups were damaged.
Baylor bought two 660 SANs made by Dell during 2001. The 660's never achieved a truly stable state. They experienced a number of minor failures and in August of 2002 were down for repair for three days. The SAN used for Exchange failed again in October of 2002 and this time it lost a significant amount of data. Human error prevented a full restore from tape and we lost 4 days of information. After the October failure, Baylor ordered a replacement SAN from Dell made by EMC2. This SAN was shipped from EMC2 factories January 2, 2003.
On December 31st 2002, a controller on the 660 SAN that provides data to Exchange/Outlook failed. The SAN was restored after several hours. On Thursday, January 2nd precautionary measures were implemented First, an incremental backup of faculty/staff data was made to tape. Second, in an effort to keep this SAN operational until the new SAN could be installed, a Dell engineer replaced the failed controller at 9:00 pm. Finally, all incoming mail was routed to a special computer where it would be stored pending the replacement of the controller. During the controller replacement, a pin on the connection of this controller to the backplane to which it connects broke. Dell immediately dispatched another controller and backplane to Waco. After replacement of the new parts Friday morning the SAN started, but immediately failed again. This time the failure damaged two storage groups. A second engineer who had arrived from Dell discovered several other components that showed intermittent failures which might have caused the problem. He worked on the SAN until Saturday evening: essentially rebuilding the 660.
Now the damaged storage groups had to be restored from tape. The tape restore process suffered several setbacks, but completed successful for one of the two damaged storage groups early Monday morning. On Monday morning January 6, 2/3rd of the faculty had access to Exchange/Outlook. By 2:30 pm Monday afternoon January 6, all 3 storage groups were restored and the email that had queued since Thursday evening was released. A full backup to tape was begun just after all queued email was delivered. Long before this backup completed, the SAN once again failed completely! This was 4:30 pm on Monday, January 6. At this point in time, our last good backup was from the previous Thursday.
More components were replaced and a repair process was begun to recover the lost data. After working all night the Dell engineer declared at 4:00 am Tuesday morning that all 3 storage groups and all student storage groups had been destroyed and could not be recovered except from tape. The backup from Thursday the 2nd was our only recourse.
All efforts to resurrect and use the 660 SAN used for Exchange were now abandoned. Fortunately, the new 600 that had been ordered in November and shipped on January 2 was on its way to Waco! While all of this was going on, the 2nd SAN 660 unit, which supported our website and various other applications had experienced some problems, but was continuing to function. However, Baylor no longer had any confidence in this unit and was committed to replacing that system as well. However, we had nothing on order to replace it.
The new SAN
The new SAN was to arrive in Waco at 2:00 pm on Tuesday January 7. The morning was spent planning for the new system and designing ways to communicate the disaster to others. Getting the word out without email was challenging. The Baylor PR office was very helpful in putting together a statement, a press release, and updating the Baylor website. Campus mail was used to deliver the message to faculty and staff, and flyers were delivered to students. A never-before-tried broadcast voicemail was used and ironically caused the voicemail system to become overloaded and slow to respond for several hours. After several attempts, a method was found to send an automated reply to most senders of email to Baylor to let them know that the system was down and would not be available until Monday January 13th.
Also during the morning, the Dell engineers made one more run at resurrecting the 660 while waiting for the 600 to arrive. That attempt proved successful despite 4 subsequent failures. Having the 660 running allowed us to copy any accessible information and allowed us to recover the email sent from January 2nd until January 6th for approximately 4/5th of the faculty. This email will appear in mailboxes during the next several days. For approximately 1/5th of the faculty/staff and all students, the email from Thursday January 2nd until Monday January 6th could not be read from the 660. The damaged disk drives will be sent to disk recovery specialists and we will be able to restore the email that they can recover in about 2-3 weeks.
The 600 SAN arrived on time. A conference room was converted to a war room where the new SAN was assembled and team members met to plan the recovery. The new SAN was rolled into the server room Tuesday evening. On Wednesday the disks were formatted--a long process for 4 terabytes of drive space. By 5 pm Wednesday the data recovery process could begin.
Two teams were created. One team carved out the individual storage units on the new SAN that would hold the Exchange data recovered from tape. The storage units on the new 600 have to appear identical to the storage units on the old 660. The creation of identically matching storage units would take until Thursday.
A second team carved out individual storage units on the new SAN and placed whatever pieces of the storage groups that could be recovered from the old SAN into these storage units. Each of these storage units was connected to a computer and therefore restoration of the damaged storage groups, a process that could take days, occurred in parallel on multiple computers simultaneously and finished by Thursday evening. Remarkably, data from 2 of the 3 storage groups was recovered through this process. Unfortunately, the disks were too badly damaged to recover 1 of the storage group (1/3rd of the faculty) and none of the student storage groups was recoverable. This meant that email for 1/3rd of the faculty and all the students sent from Thursday January 2nd at 9pm through Monday at 4:30 pm was lost unless a disk recovery company could restore it.
While all of this was underway, Baylor and Dell made arrangements for another SAN to be sent to Baylor on an emergency basis. This additional SAN was pulled out of a laboratory and sent to Baylor on Thursday evening. By Saturday evening this 2nd SAN was installed and in place to start converting the Baylor website and several other servers.
Recovery from Tape
Restoration of the Exchange data from tape began Thursday morning and we had high hopes that the SAN would be restored by Friday morning. The new SAN allowed a tape restore that had previously lasted 14 hours to be completed in 2 hours.
Unfortunately, by late Thursday evening the tape recovery had failed. Also, several large file copies had also failed. The Dell engineers were unsure of the problem. Calls were made to EMC2, Dell, Brocade (the makers of a fiber switch) and others. It was determined that the Host Bus Adapters (HBAs) were the most likely culprit. HBAs are the cards in nodes (computers) that allow for high speed optical transfer of data. New HBAs were brought to Waco Friday morning and installation of the new HBAs commenced.
Since I was unsure that the new HBAs would fix the new SAN we made arrangements for a backup plan. After replacing the HBAs, if the tape restore did not work, Dell would send in a large RAID array to replace the 600. Rebuilding a 3rd storage system was a terribly depressing thought among all of us, and Friday evening while the tape was restoring, was a very stressful time.
The new HBAs fixed the problems. Since the replacement of the HBAs, no errors or problems have been reported with the SAN. Large copies, tape backups, tape restores, have all worked flawlessly. The recovery of information was completed around 2:00 am Saturday morning. I awoke at 4:00 am on Saturday to a get the good news that Exchange/Outlook was fully operational and started doing my email. At 5:45 am the SAN went down!
Second Recovery from Tape
Most of the team had left to go home at 2:00 am Saturday morning thinking that all was right with the world. I was the only fresh soul among them. I went to the server room to find that one of the Dell engineers had accidentally caused the crash by pulling fan units to simulate problems. He had left the fan units out of the SAN for more the two minutes and this had caused the SAN shut down just like it was designed. I was overcome with relief that the new SAN had not really failed. Nevertheless, the Exchange server had all its database stores mounted when the SAN shut down and one of the storage groups was damaged and would have to be restored.
The SAN worked fine, but now we had to do yet another restore. Fortunately, during the week we had found a way to backup the incoming mail queues. Saturday morning we made backups of all good storage groups and then spent Saturday afternoon restoring the damaged storage group and rerunning the incoming mail queue. When this finished, backups were made of all storage groups.
Finally at 8:00 pm Saturday evening, the Exchange/Outlook system at Baylor is back online with a new SAN.
Converting off the 2nd SAN
The 2nd SAN sent in by Dell/EMC2 was up and configured by 5:00 pm Saturday. At this time, Baylor began copying data off of a number of servers that utilized this set of storage. This included the Baylor website, a major database server, and various administrative systems. Advice from Dell and EMC2 allowed us to find ways to copy most of the website to the new storage without taking the site down. Sometime after 10:00 pm, the main Baylor website was shutdown to allow us to copy the remaining portions of the website and all of the databases in question. Before 2:00 am on Sunday morning January 12 the website and database applications were back online, running on the emergency replacement SAN.
We are no longer dependent on the 660 SANs.