Monday, May 02, 2011

ESXi 4.1 Update 1 travail - lessons learned.

I’ve been biding my time over the last few months to migrate to ESXi.  Knowing that ESX4.1 is that last edition of the “full fat” VMware, I knew my next move would have to be to ESXi, so rather than make a bigger job whenever (cough) 5.0 is launched. I thought I’d change over the long weekend when I knew clients would be closed.
 
It was entertaining.
 
Building a boot and install USB stick rather than using a DVD burned with an ISO image was an important part of the test.
This is going to come in useful next month as I have some client work then, where the dirty nature of the computer room (a breeze block room in the corner of the warehouse) means that DVD drives become unusable within a few months – I dread to think (and am not responsible for!) the state of the servers and SAN…  So anyway I want to be able to boot and install from USB if necessary.
http://blog.vmpros.nl/2010/09/03/vmware-how-to-create-a-bootable-esxi-usb-stick/ didn’t really work for me, but http://www.ivobeerens.nl/?p=699 proved to be a good source of a procedure on how to do this.  However there are some caveats to the process:
  • Syslinux 4.0.4 (the latest) does not work (or at least did not for me) – stick to 4.0.3!!
  • When modifying the contents of the stick remember to do everything!
  • Whilst the storage in my instance is software iSCSI IT IS IMMENSELY PRUDENT TO DISCONNECT STORAGE.  As this install process initialises some storage, you do not want to accidentally wipe a LUN.  My recommendation is always to build ESX(i) hosts disconnected from storage.  It prevents an easily avoidable mistake.  Likewise I avoid “Boot from SAN” setup.
  • Make sure you follow all the steps. I managed to miss 1 or 2 a few times before I got it right.
  • Don’t forget that the KS.CFG is YOUR INSTALL SCRIPT.  It’s easy to forget this and take the content and run with it.  If you do, you’ll get an ESX box with 192.168.1.10 as its IP, VMware01 as the root password, and ESXi-01.beerens.local as its full name connected to a domain “beerens.local”.  I could be wrong, but I think this is unlikely to work in your world J
 
So once the stick is done:
  • Check for any Anti-Affinity rules in DRS, this will make sure your VM’s can have maximum mobility around the farm during the change.  You may want to weaken them
  • Move any non-running servers off local storage (if there is any) to SAN or other shared storage – cut and paste or storage migrate.  If you storage migrate you can change the host as well to unregister them from the server.
  • Storage migrate all running VM’s on local storage off the server to shared storage (no downtime here).
  • Put the ESX host in maintenance mode (and take the option to migrate all paused and stopped machines off the host).  All running guests will migrate off
This will leave you with a host doing no work, and having no VM’s stored in its local storage.
 
Now, and this is optional, but I highly recommend it.
  • Document the server setup – including network settings, iSCSI paths, vSwitch names and configs.  In fact everything you can!!!  If you are licenced for it, then consider Host Profiles as a means to the end.
  • Disconnect all external storage connections, and verify this by checking via vCentre.
 
 
Now you can start, insert the USB, boot the server, select boot from USB if required and watch it install.  If you have boot from USB as default, then at the end of the install you should remove the USB before it boots again.
Your KS.CFG will do the initial configuration and you have a new ESXi server.
 
This is where some of my fun started.  Now please bear with me – some of this was done late at night over a bank holiday, so I did not do my more normal thorough investigation, and I do not have answers to all the questions, but a list of issues encountered and some observations.
 
  1. vCentre
I thought my vCentre was up to date.  I was lazy, it was not.  I discovered on adding the new host to my network that there were some management issues from VC to ESX.  So I needed to upgrade vCentre.  I also discovered that some VM’s would not start when running on the new host – it seems they were mostly VM Version 4; but also (to make things harder) VMtools needs to be updated too!
 
  1. vCentre upgrade ISO
This is a 2.2GB download.  You do not want to do this on a 512KB ADSL connection.  I hoiked out my 3G MiFi unit, and downloaded it over the air instead to the laptop.  I achieved a 10 fold performance benefit by using this.  Fortunately I had 3.5GB left on the monthly allowance, so all was well.
 
  1. vCentre Upgrade action
Sadly this is a lengthy process, but by using full documentation from the installation (you do have this don’t you?) I was able to breeze through the dialog boxes and get everything up to date except Update Manager.  For some reason that part of the ISO is corrupt.  I am downloading it again as I type.
For prudence I snapshotted the VM that is the VC before starting.  At times later on, I would be tempted to restore to this, put ESX4.1 back on the host and give up.
Oh, and don’t forget to take the in place upgrade option – if you go for a new database your whole farm is screwed! (no, I didn’t)
 
  1. vCentre Client upgrade
On starting the vCentre Client, the new VC edition wants an upgrade before I can connect to it.  This install fails…
Now this was fun… My main management server (physical still – for good historical reasons), is where I do most of the work.   However this is now 6 years old and has a large number of VMware components go through it.  Unfortunately… some old MST file was hanging around and the VI Client upgrade failed.  By now it was late at night after a quick burst of investigation I decided on a more radical approach.  I stopped all VMware services, hacked out all the VMware stuff from the registry, killed VMware folders in Program Files, and rebooted the machine.  This did not completely fix the install, and found a few more VMware folders in the Documents and Settings tree, they went too.
 
  1. DNS and AD failure
Yes, you read that right.  When this box came back DNS was down, and AD was not working as a consequence.  Fearing I’d ripped something out I hadn’t meant to I was tempted to hit the backup tape (you do take backups don’t you?) but waited a bit…
This being more a test lab than a production network the primary physical box on which I was working is the original DC of the network.   The other DC’s are virtual, and it turned out that neither had started properly when I had restarted the ESX hosts a bit earlier.  We had had a power cut earlier in the day, and whilst the kit had all stayed up, it seemed (only with hindsight) that whilst I have UPS’s all round a slight barf on one UPS had impacted a network switch and the virtual world was not talking to the physical world properly.  Taking the IT Crowd “Turn it off and on again” philosophy to its logical limit… I shut down all the VM guests (you do have a PowerShell script for this don’t you?!) and shutdown the hosts.  I then power cycled the switches and waited for them to come back.  I then booted the ESX boxes, and the physical server and all was well.  A quick check round logs and events proved this was the case.
I’m not going to try to work out why, as this was now 1am…
 
  1. vCentre client now installed properly and I can connect to vCentre Server again.
A quick bit of configuration of vSwitches, and all seemed to be well except…
 
  1. iSCSI connections
One of the iSCSI connections relies on decent security from the SAN side – and with the new ESXi installation the IQN’s on the software iSCSI had changed, so the SAN had to be told it was allowed to connect!  A quick fix there, and the new ESXi box can see all storage, and works a treat.
 
  1. Finally all was well
 
  1. So I just need that good ISO for the Update Manager installation so that I can now manage updates across the VM’s (VM Version and VMTools for now).
 
 
Observations?
  • Well you can see from the above that Douglas Adams was right when he wrote “Don’t Panic” – I could have given up with the backups, snapshots and original ESX4.1 that I had and gone back to square one.
  • Document your setup, NOW.  You never know when it might come in useful
  • In ESXi the Service Console no longer exists – look for the Management Network in your ESXi networking setup
  • IQN’s can change
  • Check your VM version – some of your older VM’s may be 4 instead of 7.  In my experience, a VM version 4 had some issues starting and seeing network hardware on a new host.
  • Anti-affinity – keep an eye on it, and restore it when done
  • If you use ESXTOP on ESX, don’t forget – without the service console, you won’t get this on the host
  • ILO – if you have it, make sure you know the password – it saves a lot of hassle connecting to the host
  • Lastly NEVER FORGET you can use the VI Client directly to the host to work things.  If the VC goes down, it means you can still start stop guests, enter/exit maintenance mode, reboot and shutdown an ESX box.  This can be your friend.  A lot.
 
 
Oh, and very lastly – if you finish work at nearly 3am in the morning after some problems like this, then the early morning Radio4 news on the day Osama Bin Laden is killed makes for a pretty good wakeup call.
 

1 comment:

P Bryant said...

Happy to report a revised download of the ISO fixed the Update Manager revision to Update 1 and all is well there too.