E-mail OutageJan. 13, 2003
The following describes in a short and long version the events surrounding the failure and replacement of the storage system supporting Exchange and Outlook. Please feel free to call me at 710-8788 if I can be of any assistance regarding this matter. Thanks for your patience during this time.
The Short Story:
The storage system by Dell that started failing New Year's Eve and completely failed again at 4:30 pm on Monday January 6th, has been replaced with a new storage system by EMC2 that is several orders of magnitude better. All information and email except for email that arrived from January 2nd until January 6 has been fully restored. For 4/5th of the faculty and staff, mail from January 2nd pm until Monday January 6th was recovered from the damaged disks and will be placed in your mailboxes in the next few days. For approximately 1/5th of the faculty/staff and all students, the email from Thursday January 2nd until Monday January 6th could not be recovered by us from the damaged disks. The damaged disk drives will be sent to disk recovery specialists and we will be able to restore the email that they can recover in about 2-3 weeks.
The ITS server staff worked as a team 24 hours a day for over 10 days to bring this new storage system online. During this time they rebuilt and restored one storage system, only to have it fail within hours, and then installed and restored a second data storage system. Essentially six months of work was done in a matter of 10 days. I had the opportunity to work closely with the entire group over these days, and I can assure you that they are the finest group of information technology professionals in the country; and that their dedication to Baylor was greater during this period than their concern for themselves or their families. To them and their spouses and children, we all owe a tremendous debt of gratitude for the personal sacrifices made.
I also want to thank members of the Dell support team. Not only did Dell provide us with 6 of their top 12 specialists including an outstanding Exchange specialist from Microsoft, but they ensured that the equipment we needed was in Waco within hours. The support that Dell gave Baylor was truly exceptional during this very difficult period.
Finally, I want to thank everyone who kept us in their thoughts and prayers during these trying days. Your support and cheering from the sidelines was discernable and helped us to go that extra mile. Your patience and understanding was truly remarkable and a testament to Baylor's Christian identity. If you feel inclined, please drop these folks an email of thanks.
ITS Support Staff
The Long Technical Story (for those who are interested):
One of the high speed nodes on a SAN is a tape backup system. At Baylor, the tape backup system makes a full copy of all email data to tape on Saturday evening and Wednesday evening. An incremental backup to tape is made all other evenings that contains anything that has changed since a full backup.
Baylor has two SANS. One SAN stores data for Exchange/Outlook. We have two Exchange servers for faculty/staff and two Exchange servers for students. All of these email servers utilize this SAN for storage. This configuration ensures high availability because one Exchange server can take over if the other Exchange server fails. For faculty and staff there are three storage groups within the SAN, and if a catastrophic failure occurs it is necessary to restore 1/3rd, 2/3rd or all of the data from tape depending on whether 1, 2 or all 3 storage groups were damaged.
On December 31st 2002, a controller on the 660 SAN that provides data to Exchange/Outlook failed. The SAN was restored after several hours. On Thursday, January 2nd precautionary measures were implemented First, an incremental backup of faculty/staff data was made to tape. Second, in an effort to keep this SAN operational until the new SAN could be installed, a Dell engineer replaced the failed controller at 9:00 pm. Finally, all incoming mail was routed to a special computer where it would be stored pending the replacement of the controller. During the controller replacement, a pin on the connection of this controller to the backplane to which it connects broke. Dell immediately dispatched another controller and backplane to Waco. After replacement of the new parts Friday morning the SAN started, but immediately failed again. This time the failure damaged two storage groups. A second engineer who had arrived from Dell discovered several other components that showed intermittent failures which might have caused the problem. He worked on the SAN until Saturday evening: essentially rebuilding the 660.
Now the damaged storage groups had to be restored from tape. The tape restore process suffered several setbacks, but completed successful for one of the two damaged storage groups early Monday morning. On Monday morning January 6, 2/3rd of the faculty had access to Exchange/Outlook. By 2:30 pm Monday afternoon January 6, all 3 storage groups were restored and the email that had queued since Thursday evening was released. A full backup to tape was begun just after all queued email was delivered. Long before this backup completed, the SAN once again failed completely! This was 4:30 pm on Monday, January 6. At this point in time, our last good backup was from the previous Thursday.
More components were replaced and a repair process was begun to recover the lost data. After working all night the Dell engineer declared at 4:00 am Tuesday morning that all 3 storage groups and all student storage groups had been destroyed and could not be recovered except from tape. The backup from Thursday the 2nd was our only recourse.
All efforts to resurrect and use the 660 SAN used for Exchange were now abandoned. Fortunately, the new 600 that had been ordered in November and shipped on January 2 was on its way to Waco! While all of this was going on, the 2nd SAN 660 unit, which supported our website and various other applications had experienced some problems, but was continuing to function. However, Baylor no longer had any confidence in this unit and was committed to replacing that system as well. However, we had nothing on order to replace it.
The new SAN
Also during the morning, the Dell engineers made one more run at resurrecting the 660 while waiting for the 600 to arrive. That attempt proved successful despite 4 subsequent failures. Having the 660 running allowed us to copy any accessible information and allowed us to recover the email sent from January 2nd until January 6th for approximately 4/5th of the faculty. This email will appear in mailboxes during the next several days. For approximately 1/5th of the faculty/staff and all students, the email from Thursday January 2nd until Monday January 6th could not be read from the 660. The damaged disk drives will be sent to disk recovery specialists and we will be able to restore the email that they can recover in about 2-3 weeks.
The 600 SAN arrived on time. A conference room was converted to a war room where the new SAN was assembled and team members met to plan the recovery. The new SAN was rolled into the server room Tuesday evening. On Wednesday the disks were formatted--a long process for 4 terabytes of drive space. By 5 pm Wednesday the data recovery process could begin.
Two teams were created. One team carved out the individual storage units on the new SAN that would hold the Exchange data recovered from tape. The storage units on the new 600 have to appear identical to the storage units on the old 660. The creation of identically matching storage units would take until Thursday.
A second team carved out individual storage units on the new SAN and placed whatever pieces of the storage groups that could be recovered from the old SAN into these storage units. Each of these storage units was connected to a computer and therefore restoration of the damaged storage groups, a process that could take days, occurred in parallel on multiple computers simultaneously and finished by Thursday evening. Remarkably, data from 2 of the 3 storage groups was recovered through this process. Unfortunately, the disks were too badly damaged to recover 1 of the storage group (1/3rd of the faculty) and none of the student storage groups was recoverable. This meant that email for 1/3rd of the faculty and all the students sent from Thursday January 2nd at 9pm through Monday at 4:30 pm was lost unless a disk recovery company could restore it.
While all of this was underway, Baylor and Dell made arrangements for another SAN to be sent to Baylor on an emergency basis. This additional SAN was pulled out of a laboratory and sent to Baylor on Thursday evening. By Saturday evening this 2nd SAN was installed and in place to start converting the Baylor website and several other servers.
Recovery from Tape
Unfortunately, by late Thursday evening the tape recovery had failed. Also, several large file copies had also failed. The Dell engineers were unsure of the problem. Calls were made to EMC2, Dell, Brocade (the makers of a fiber switch) and others. It was determined that the Host Bus Adapters (HBAs) were the most likely culprit. HBAs are the cards in nodes (computers) that allow for high speed optical transfer of data. New HBAs were brought to Waco Friday morning and installation of the new HBAs commenced.
Since I was unsure that the new HBAs would fix the new SAN we made arrangements for a backup plan. After replacing the HBAs, if the tape restore did not work, Dell would send in a large RAID array to replace the 600. Rebuilding a 3rd storage system was a terribly depressing thought among all of us, and Friday evening while the tape was restoring, was a very stressful time.
The new HBAs fixed the problems. Since the replacement of the HBAs, no errors or problems have been reported with the SAN. Large copies, tape backups, tape restores, have all worked flawlessly. The recovery of information was completed around 2:00 am Saturday morning. I awoke at 4:00 am on Saturday to a get the good news that Exchange/Outlook was fully operational and started doing my email. At 5:45 am the SAN went down!
Second Recovery from Tape
The SAN worked fine, but now we had to do yet another restore. Fortunately, during the week we had found a way to backup the incoming mail queues. Saturday morning we made backups of all good storage groups and then spent Saturday afternoon restoring the damaged storage group and rerunning the incoming mail queue. When this finished, backups were made of all storage groups.
Finally at 8:00 pm Saturday evening, the Exchange/Outlook system at Baylor is back online with a new SAN.
Converting off the 2nd SAN
We are no longer dependent on the 660 SANs.