A leap of faith…

Less than three weeks after getting all the Equallogic arrays up to the latest firmware while making the most of a power failure down time along comes another firmware update!

The Equallogic arrays have two controllers each with redundant network connections for SAN traffic.  One controller is the active controller and the other is the passive standby.

During updates both controllers are flashed with the new firmware.  The standby controller is restarted which does not affect traffic but brings it online with the new firmware.  The active controller is then restarted which causes the standby controller to take over as the active controller (are you following this???)  and the old active controller then restarts with the new firmware in the standby role.

So, the theory is that the time taken to fail over the controller is less than the iSCSII timeout of connected systems so there is no interruption to disk I/O… With ~50 systems running in the environment including a bunch of GroupWise post offices it takes some faith that this will in fact work as described.

When I introduced the new PS6010X 10 GbE array I put it in a separate group and moved a few low impact systems to it as an initial test.  I then moved the new ADM4PO post office server to the new group to flex it a little more.

With fingers crossed I performed the firmware upgrade on the new array.  There are a few heart stopping moments as connections are lost and reconnected and various components start generating alerts and flashing red warnings!  The whole process took a few minutes but worked flawlessly.  My own mailbox is on the ADM4PO and I was able to keep working throughout.  I also monitored all 8 VMs running from the array through vCenter and not one alert was generated.

The question now is – Do I have the nerve to perform the same upgrade on the older arrays?

Every cloud… or, Making the most of a power failure!

Sunday morning – power goes out.  Fried squirrel for lunch and a couple hours of cool down time for the servers.

On the up side I managed to:

  • Reconfigure the physical SAN links
  • Update the 1 GbE SAN switch stack firmware
  • Update the Equallogic array firmware

All these things would have required down time or much juggling of virtual resources.

Nice one 🙂

Virtualization…

So here’s what virtualization looks like…

From this...

From this...

... to this!

... to this!

I keep walking by the stacks of 40+ old servers that have been virtualized and decided today was the day to take a picture for posterity, or at least for as long as this blog exists.

Smith Spam – Email delivery from Publishing Concepts (PCI)

Had a report this morning of very slow delivery of a test email from PCI.  PCI is an external mass-mailing service which is used by several departments for sending messages to all students.  It is also the mechanism for sending the twice weekly eDigest.

A quick analysis of email handling my the McAfee appliances over the last seven days revealed that of the ~1 million messages accepted for further scanning (~500k other sources were blocked outright) ~350k were passed on to MessageScreen for more inspection.

How many messages arrived from PCI?  ~35k – or around 10% of all inbound email!

In order to expedite delivery of these email I created a new policy on the McAfee appliances.  99.9% of the PCI email originated from the same IP address so the policy triggers on this.  Messages from this IP are now not scanned in any way, shape or form and are directly relayed to GWSMTP2 and thence to users mailboxes.

GWSMTP2 is a GroupWise GWIA running on SUSE Linux and is also the GWIA used for emergency email delivery.  It does no standard email processing but is a failover outbound gateway should GWSMTP1 fail.

I have asked for some test messages to be generated to check the new delivery mechanism but the big test will be the next mass email campaign.

ADM3PO Restart required

Had a couple of Big brother alerts overnight for lost connectivity to GWADM3PO (Physical server) and also several reports of client issues on arriving at work.

Couldn’t get a remote console so went to the physical console which showed multiple losses of time synchronization.

Restarted the server in order to give it a clean start.

Failing forwards to Gmail

While investigating the bounce message issue I noticed the McAfee appliances reporting a lot of failed deliveries to Gmail accounts.  I would hazzard a guess that these are all accounts where a GroupWise rule is forwarding all email.

The rejection message is in the format:

Bounced: 74.125.91.27, 550 5.7.1 Unauthenticated email is not accepted from this domain. a6si10739687qck.146

I sent a message, set up forwarding of all email and then set up delegation of all email to my Gmail account and could not reproduce the problem.  In each test all mail was successfully delivered to my Gmail account.

A quick Google of the issue scored a few hits which indicate that this may be a new global issue, possibly in light of some undocumented change in email handling at Gmail.  Several posts in forums say the problem started in mid January.  Perhaps I should use something other than Google to research the problem!

This issue is taking a back seat to the more pressing NDR bounce issue for now, perhaps they are linked somehow?

Bouncy! Bouncy! Email delivery problems

Had a weird situation develop yesterday (of course I had the day off) where some, but not all, inbound email was being bounced with an error 442 No delivery mechanism available message.  A quick check showed a lot of queued messages at the McAfee appliances from no_reply@smith.edu and the MessageScreen appliances being swamped by these message delivery attempts.

no_reply@smith.edu is the sender name used by the McAfee appliances when sending an NDR (Non Delivery Report) message.  The problem was that these messages were then being bounced back from the likes of Facebook, Gmail etc and attempted delivery on to MessageScreen where a rule is set up to drop them all on the grounds that “no_reply” meant we did not want to get a reply to this message.

I could not discover the cause of this sudden surge of NDRs so to relieve the pressure I turned off NDR generation and flushed out all the queued invalid bounce meessages.

Looks like I will be doing a lot of digging for answers today!

SAN Disk Failure and Replacement

The Administrative SAN generated a disk failure notification on 2/5/2011 which was delivered to my cell phone. A replacement disk was ordered this morning. This arrived this afternoon and has been installed.

Anubis Backup Media Agent Issues

Last Saturday at 12:31:37 AM, Anubis re-booted for reasons unknown. This also occurred two weeks
ago on the same day and time. Once the system comes back up, the backup network mode has changed preventing backups from running without user intervention. This issue is being researched.

Prospero Server successful migration

On Monday January 31st the Prospero server was successfully migrated from a Windows 2003 server to a new Windows 2008 R2 server.  In the process the Panopto software was upgraded to the latest version.

This server is used by ET to store and stream video recordings of classes.  The physical server was migrated to a virtual machine last year but Panopto announced that they were discontinuing support for Windows 2003 server installations.

Now, migrating a Windows based software package with a 500 GB data drive to new physical hardware would be a pain in the proverbial.  After a little preparation and proving the procedure with test VMs the actual migration went like clockwork and was completed in 30 minutes.

All of the data to be migrated was placed on the 500GB virtual data disk.  The old server was then given a new name and IP address and shut down.  The data disk was removed from the old VM.

The new server was then given the prospero name and IP address and restarted.  The data disk was moved using vCenter and attached to the new server (took a few seconds to complete!)

All that was left was to install the new Panopto software and everything was running smoothly!

Have I ever mentioned how much I love VMware?