Chinwag Reloaded with Chad Sakac (@sakacc)

21 Nov

It’s back! I’m pleased to announce that I’m getting back into the Disc Jockey chair – and Chingwag is back! With a bang. I’m so stoked to say my first chinwaggie is Chad Sakac of EMC. This is the first time I’ve had on the show. And its all part of my “Rolling Thunder” Chinwag Relaunch – I will be interviewing the TOP people in our world for the next couple week. In the medium term I will reaching out to folks in the community who I know online or meet at VMUGs. So this new version of the podcast will not be a vRockstar only thing. It’s good to be back!

I’ve building up a library of Chinwags with folks since mid-October – this was recorded just before VMworld EU – 2014



Chad Sakac blogs at’ve known each for a while – met first at VMworld in Las Vegas. That was back when I was getting into SRM, and EMC and NetApp both helped me out with access to storage for the SRM 4.0 book. In this chinwag Chad gives his take on

  • Converged Vs Hyper-Converged
  • EMC’s plans for EVO:RAIL
  • EMC XtremIO upgrading

Click here to listen to the Podcast on this site…. or alternatively watch the youtube video below:

Comments Off

Posted in Chinwag


EVO:RAIL – Anatomy of a JSON Configuration File

10 Nov

One common question that comes up around EVO:RAIL is where all those pre-defined settings in the appliance UI come from, where are they stored and how can you modify them? In case you haven’t seen the EVO:RAIL UI (and if you haven’t what stone have you been living under! :-) ) The “Configuration” front-end and engine of EVO:RAIL is a service (vmware-marvin) that runs inside the vCenter Server Appliance which initially comes up on node01 of the first appliance itself:

Screen Shot 2014-10-24 at 13.30.01

As you can see from the single page there are actually quite a number of settings gathered before the configuration is initialized, built, configured and finalized. These include:

  • ESXi host naming convention and vCenter FQDN
  • vSphere ESXi host networking (IP Ranges/Subnet Mask/Default Gateway/VLAN IDs) for vMotion, Virtual SAN and ESXi hosts
  • Virtual Machine Network (VLAN ID and associated network names)
  • Passwords (for ESXi and vCenter) and optional Active Directory configuration
  • Global Settings such as: Time Zone, NTP, DNS, Syslog/Log Insight IP, Proxy Server settings including hostname, port, username and password


Where do these settings come from? Well, they are held in a JSON file on the vCenter Server Appliance that runs on the EVO:RAIL in the following directory:

/usr /lib /vmware-marvin /marvind /webapps /ROOT /WEB-INF /classes/

The file is called default-config-static.json

There is a sample JSON file in our User Guide appendix – for ease of use I’ve reproduced it at the end of this blog post.

The reason the file is named “static” is that it uses a static IP configuration for all components that require an IP address. When you order an EVO:RAIL, you could supply the VLAN/IP/hostname data to a Qualified EVO:RAIL Partner (QEP) for customization at the factory. Then, when it arrives onsite, you would just plumb it into the network and click the “Just Go!” button. Personally, I always check “Customize Me!” unless I know I’m just repeating the same process over and over again for validating purposes. For example I’m working a lot in the HOL where I repeatedly build the same EVO:RAIL appliance for testing purposes.

Screen Shot 2014-10-24 at 13.44.35

Incidentally, if you do click “Just Go!”, then all you will be prompted is to set the ESXi hosts and vCenter Server passwords. From here you can change the default password that is held in the JSON file to something that suits your needs.

Screen Shot 2014-10-24 at 13.48.41

So what if the settings in the JSON file don’t suit your needs, and you want to change them without endlessly retyping? Perhaps you would like to send an EVO:RAIL out to a remote location, and send the person who’s plugging it in for the first time, settings that are valid for their location. Simple – send them a custom JSON file with the correct values…

When confronted by the EVO:RAIL Welcome screen they would use “Customize Me!” and then click the link to “UPLOAD CONFIGURATION FILE”:

Screen Shot 2014-10-24 at 13.52.24

EVO:RAIL will then inspect and validate the JSON file to check it for errors – so if there are bum entries in it you will receive a warning. (Note that EVO:RAIL does not validate the syntax of the JSON file, so syntax errors will prevent the file from being read.)

For instance in the example below I deliberately created a JSON with insufficient IP addresses in the pool for the Virtual SAN pool of IP addresses, as well you can see there is a bum IP range from the JSON file:

Screen Shot 2014-10-24 at 14.02.50

Screen Shot 2014-10-24 at 14.03.36

Note: 10.20.x.y and 10.10.x.y are not invalid IP ranges.

If no errors are found in the JSON file, then it should pass with flying colors – and lead you to the Build Appliance button.

Screen Shot 2014-10-24 at 14.07.22

One thing you’ll notice about my ranges is that they begin x.y.z.101 to x.y.z.116 to include 16 IP addresses even though I only need 4 IP addresses for the 4 servers in my EVO:RAIL appliance. That’s deliberate on my part – so if I ever need to add an additional EVO:RAIL appliance, it can simply be detected on the network and added. To add an appliance all I need to supply is the ESXi and vCenter passwords in the tiny workflow. By creating a pool of IP addresses in the JSON file, it is very easy to add more EVO:RAIL appliances. When a new appliance becomes present on the network, the EVO:RAIL Management UI will discover it, and allow the administrator to add it to the existing cluster.


And if there are free IP addresses in the pool, all that’s required is the passwords. These are indicated by the green ticks in the Add New EVO:RAIL appliance UI.


If the network pool of IP address is not big enough or depleated – you get blue (i) information badges, and you’ll need to specify additional bundles of IP ranges to add the appliance. This is mildly taxing, and I think its just neater to specify the full range of IP (16) for each type of pool upfront. It’s just a slicker experience.

Screen Shot 2014-10-24 at 16.31.29

If the pool is depleted of IP addresses because the range was set too small its not the end of the world – it just means you need to fill in more IP data before you can add the EVO:RAIL appliance. In the example below where the IP pool was kept deliberately small (just 4-IPs) there’s a blue (i) info alert to indicate the operator needs to add more IP addresses before continuing.


  • You can use JSON files to automate and customize the build out of EVO:RAIL without the need to manually type IP configurations (not that it takes much time to do that!)
  • The EVO:RAIL Configuration engine validates your IP setting
  • It’s nice to have a pool of IP addresses large enough to make adding a second appliance a trivial process.
  • However, there are limits – the incorrect IP address, or bad VLAN values could still pass a validation test by the operator inputting incorrect settings or by the person who creates the JSON file. After all, EVO:RAIL has no idea if you have mistyped the default gateway IP address…
  • Finally, it is possible to leave the password field in the JSON file blank – which means no password is ever stored in clear-text and the person doing the configuration file would have to type in a valid password.


Click here to see a sample JSON file:

Read the rest of this entry »

Comments Off

Posted in EVO:RAIL


EVO:RAIL – Getting the pre-RTFM in place…

04 Nov

One of things we are really proud of within the EVO:RAIL team is how easy, simple and quick it is to setup the first EVO:RAIL and then using our auto-discovery method – scale out the environment by quickly adding subsequent EVO:RAIL appliances. As ever the experience is highly dependent on the network pre-requisites being in place, and so as long as you RTFM, you should be good-to-go.

In case you don’t know, EVO:RAIL implements a version of “Zero Network Configuration” which is a widely recognized RFC standard. VMware’s implementation of this is called VMware Loudmouth (it was originally to be called LoudSpeaker) and it’s a service that runs on both EVO:RAIL’s built-in vCenter Server Appliance and on the 4-server nodes that run VMware ESXi in the 2U enclosure. Below you can see the Loudmouth status running on the first ESXi server node, and on the vCenter Server Appliance:

Screen Shot 2014-10-23 at 12.18.52

The User Guide for EVO:RAIL outlines the network requirements for the appliance, but its worth restating them here – and also flagging up the consequences if you don’t meet these pre-requisites. Firstly, EVO:RAIL requires a physical switch (which isn’t part of the offering, but most of the Qualified EVO:RAIL Partners (QEPs) would be happy to sell you one if you don’t have one already!) that supports 10Gbps networking. Both RJ-45 or SFP+ connections are supported, and some QEPs have two EVO:RAIL offerings to support these, although some just support one type at present.

IPv6 support is required on all the ports on the switch, and you need at least 8 ports free to plug-in each of the 2x10Gps nics (vmnic0/1) that the server in the EVO:RAIL uses. You may actually want to have a separate switch for the single 1Gps nic that each server node has for BMC/ILO/DRAC style access. These could be plugged into the 10Gbps switch if you have spare ports, or you might prefer to plug them into a spare 1Gbps switch if you have ports free. After all those 10Gps ports are precious… The User Guide has a good diagram that illustrates this – and of course, it is supported to have vmnic0 plugged into one switch, and vmnic1 plugged into another – to provide switch redundancy.

Screen Shot 2014-10-23 at 11.55.03

Whilst EVO:RAIL doesn’t require VLANs as such, I think it is the case that the vast majority of people will want to use them. Personally I would strongly recommend that if the switch is brand new or hasn’t been used by vSphere before you ensure that the VLAN configuration is in place BEFORE you even rack-and-stack the EVO:RAIL. This will lead to a more out-of-the-box experience.

In total you will need at least 3 VLANs to cover the vMotion, Virtual SAN and Virtual Machine traffic. Of course, you can have as many VLANs to separate your various VMs as you need. By default management traffic is not tagged by the EVO:RAIL appliance, and will reside on the default VLAN of the physical switch (this is normally VLAN0 or VLAN1 depending on the vendor). You’ll notice that in the EVO:RAIL UI there isn’t an option to set VLAN Tagging for the VMware ESXi hosts:

Screen Shot 2014-10-23 at 12.30.55

The EVO:RAIL could have set a VLAN ID, but of course doing that would have disconnected you from the appliance, and potentially it would mean a re-IP of the local management machine, and even a re-patch to the right VLAN. So by default the EVO:RAIL uses the default network of VLAN0/1 on the switch. If you want/need to change the management network – you could use the Web Client or vSphere Desktop Client to enable tagging for the Management VMkernel portgroup and the management virtual machine network. For me this is similar to the days when I used the “Ultimate Deployment Appliance” to install ESXi across the network using a PXE situation. In that setup I needed to use a native VLAN, as you don’t get VLAN tagging support in a PXE boot. For me it was easiest to use VLAN0/1 to do the install, and make my ESXi use that as the management network, rather than go through a reconfiguration afterwards.

Of course you do get the option to set the VLAN Tagging ID for vMotion, Virtual SAN and of course, for your VM Networks:

Screen Shot 2014-10-23 at 12.31.54 Screen Shot 2014-10-23 at 12.32.53 Screen Shot 2014-10-23 at 12.34.12

Now, here’s the important part – both the Virtual SAN and Management VLAN will need multicast enabled for both IPv4 and IPv6. This has always been a requirement for Virtual SAN, and the EVO:RAIL requires it for the auto-discovery process to work correctly. You could enable multicast for IPv4 and IPv6 on ALL the ports of the physical switch (this would be desirable if the customer isn’t using VLANs at all), or alternatively enable it just on the two networks that require it (Management and Virtual SAN). If you are using multiple switches ISL multicast traffic for IPv4 and IPv6 must be able to communicate between the switches.

To allow multicast traffic to pass through, you have two options for either all EVO:RAIL ports on your TOR switch or for the Virtual SAN and management VLANs (if you have VLANs configured):

1) Enable IGMP Snooping on your TOR switches AND enable IGMP Querier. By default, most switches enable IGMP Snooping, but disable IGMP Querier


2) Disable IGMP Snooping on your switches. This option may lead to additional multicast traffic on your network.

In my experience folks often over-react to the idea of multicast. I’m not sure why that is, given that this traffic is used to transmit a relatively small amount of network traffic in the form of metadata. Perhaps old-timers like me are thinking back to the days when technologies like Symantec Ghost or multi-cast video had the potential to flatten a network with excessive multicast traffic, and bring it to its knees. But we are not talking about that sort of volume of traffic in the case of Virtual SAN or EVO:RAIL. In case you don’t know (and I didn’t before joining the EVO:RAIL team) IGMP Snooping software examines IGMP protocol messages within a VLAN to discover which interfaces are connected to hosts or other devices interested in receiving this traffic. Using the interface information, IGMP Snooping can reduce bandwidth consumption in a multi-access LAN environment to avoid flooding an entire VLAN. IGMP Snooping tracks ports that are attached to multicast-capable routers to help manage IGMP membership report forwarding. It also responds to topology change notifications. For IPv6, MLD (Multicast Listener Discovery) is essentially the same as IGMP (Internet Group Management Protocol) in IPv4.

Note: Text in italics are direct quotes from the EVO:RAIL user guide!

So the 60million dollar question. What happens if you choose to totally ignore these recommendations and pre-requisites? Well firstly, it strikes me as obvious that configuring EVO:RAIL to speak to VLANs that don’t even exist is going to result in all manner of problems. You really have two options – either do the VLAN work on the physical switch first (my personal recommendation) or do not use VLANs at all and have a flat network. What’s not a good idea is to setup EVO:RAIL without any VLAN configuration, and then reconfigure the physical switch to use them after the fact. That would mean vSphere networking on each of the physical hosts would need reconfiguring to be an area of the VLAN ID for tagging purposes on the portgroup.

The above seems pretty straightforward to me – and I’m really sorry if it sounds like teaching Grandma how to suck eggs. The multi-cast requirements are less obvious as most folks will be as new to EVO:RAIL as I am (this is my 9th week!). What happens if the multicast requirements aren’t met on either the Management or the Virtual SAN networks? Firstly on the EVO:RAIL Management network, if multicast is not enabled, then the servers that make up the EVO:RAIL will not be discovered and you will see an error message at around 2% of the configuration process.

As for Virtual SAN the EVO:RAIL configuration will continue and will complete, but you will find Virtual SAN will show that individual disk groups have been created for each appliance node, rather than one single disk group containing all the storage (HDD/SSD) for all the nodes in the cluster. It’s perhaps best to use the vSphere Web Client which is available from the EVO:RAIL Management UI to inspect the status of Virtual SAN configuration, and from there you can:

  1. Navigate to the Home tab, select Hosts & Clusters
  2. Expand the Marvin-Datacenter and Marvin-Virtual-SAN-Cluster to show you now have four hosts provided by the EVO:RAIL Appliance
  3. With the Marvin-Virtual-SAN-Cluster selected, click the Manage Tab, and under the Settings column, select General under Virtual SAN.
  4. Confirm that all 4 hosts are marked as “Eligible” and the network status is “Normal“. If multicast has not been enabled you would see the status “Misconfiguration Detected”



Note: This screen grab is taken from the new version of the HOL I’ve been working on.

As more realistic view from an actual EVO:RAIL appliance would look like this:


If you’re seeing “Misconfiguration Detected” most likely you will find that the root of the problem is multicast related. If you are troubleshooting multicast issues with Virtual SAN, a good place to start is this blog post on the vSphere blog written by my colleague, Joe Cook:


  • Get your switch configured first with the right settings. This is true of almost any hyper-converged environment.
  • If you get the 2/3% an error during the EVO:RAIL configuration process – check your multicast settings for the management network (VLAN0 or VLAN1 depending on vendor). Then click try again.
  • After the build of the first EVO:RAIL appliance, check your Virtual SAN settings to make sure there are no unpleasant messages and that all the server nodes are in the same disk group.

Comments Off

Posted in EVO:RAIL


Back To Basics: Viewing and Modifying vSphere HA Settings

03 Nov

 All the settings for vSphere HA can be found under the properties of the cluster >> Manage and the Edit button.

Screen Shot 2014-09-17 at 14.50.46.png

Host Monitoring

The host monitoring portion of the Edit Cluster Settings options control whether vSphere HA is turn on or off. As indicated earlier it is possible to turn off Host Monitoring. This controls whether the vSphere host share the network heartbeat that is used to calculate if the host is alive or dead. This can check can be temporarily turned off if you know that some network maintenance (such as physical switch or router upgrade) is likely to cause the network to be down for a period of time. The virtual machine options control what happens by default if there is a failure or isolation event. Two settings are available here, VM Restart Priority and Host Isolation Response. Restart Priority allows for four options – disabled, low, medium, high. By default is medium is selected, and all VMs would have the same restart priority of medium. It’s then possible under the VM Overrides options to add individual VMs, and indicate that some VMs have a low priority, or high priority – are started after or before any VMs with a medium priority. Alternatively, VMs can be excluded from the restart process altogether by using the VM Over-rides to disabled. This can be useful if you have non-critical VMs are that are not needed be available – and frees up resources for the more critical VMs. The Isolation Response controls what happens if a host becomes disconnected from the cluster due to some network outage or configuration error. In this case the isolated host may well be able to communicate to the router, but not the other hosts. Alternatively using what are called “datastore heartbeats” vSphere can work out that the host maybe disconnected from the network, but still connected to shared cluster resources. In such a case the host could still be running, and the VMs are unaffected. In this case the default policy would be to “Leave Powered On”. The alternatively, is assume a failure has occurred and either power of and restart, or shutdown the guest operating system, and restart on to the remaining hosts.

Screen Shot 2014-09-17 at 14.52.04.png

Admission Control Policy

Screen Shot 2014-09-17 at 15.05.33.png

The Admission Control Policy controls how resources are reserved to the clusters – and whether VMs are powered on or not based on the remaining resources left in the cluster. One policy allows for the definition of capcity by Static Number of Hosts. The spinner allows the SysAdmin to indicate how many host they feel they could comfortably lose but still maintain good quality of service. This spinner can now be taken as high 31. This is because the maximum number of vSphere host in a cluster is 32, which would allow logically for 31 failures leaving just one 1 host left over. As you might imagine its highly unlikely that one remaining node could take over from the lost of 31 servers. However, its more reasonable to suggest that in 32 node cluster that is 50% loaded, that a much higher number of physical servers could fail than the default of just 1.

By default the “slot” size is calculated based on the total maximum reservations used for CPU/Memory. A reservation is expressed on a VM or resource pool as guarantee of the given resources – rather than it being allocated on a on-demand basis. The idea of basing the “slot” size on these values is to try to guarantee that VMs are able to have their reservations allocated during power on. In some case this dynamically calculated “slot” size isn’t appropriate for customers – as it can be skewed by mix of very large and very small VMs. That can result in an either very large slot sizes which quickly reduces the number of VMs that can be powered on, or very small slot sizes which are quickly consummed by series of very large VMs. For this reason it is possible to modify the static number of hosts policy, by specifying a Fixed Slot Size expressed in CPU in Mhz and Memory in MB. Additionally, the Calculate button can be used to see which VMs could potentially require more than one slot. This can be used to verify if the fixed slot size is appropriately set. Once calculated the View link will show a list of VMs requiring more than one slot.

As alternatively to using a static number of hosts together with a slot size, vSphere provides the option to manage admission control by reserving a percentage of cluster resources. As you might gather this involves reserving an amount of CPU or Memory as proportion of the overall amount provided by the cluster. This can have a very similar effect to using a static number of hosts. For instance on three node cluster, if 33% was reserved for failover, this would be similar (but not the same as) to indicating +1 redundancy. This method dispenses with slot sizes altogether, and has proved to be a popular reconfiguration of vSphere Clustering.

SysAdmin are able to configure Dedicated Failover Hosts – in these case the specified hosts do not take a VM load, and held in reserve ready for vSphere host failure. Whilst this guarantees that the resource will be available. Many customers find this an expensive option and would prefer to allow their hosts to take some kind of load, but manage the overall load, with a reservation of resources.

Finally, Admission Control can be turned of by using Do not reserve capacity. This keeps vSphere HA running but doesn’t impose any restrictions on the whether a VM can be failover or power on manually. Occasionally.

VM Monitoring

VM Monitoring is sub-component of VMware HA, and is an optional feature. It can be used to inspect the state of virtual machines, and based on the result reboot them if they appear to have become unresponsive. The default is VM Monitoring is disabled, and some customer prefer this because they are anxious about vSphere ‘getting it wrong’ and unnecessarily rebooting VMs. This is because VM Monitoring inspects the VMware Tools “Heartbeat Service” and uses a successive lack of responses to determine if a VM is stalled or not. Significant work has been undertaken by VMware to lessen this concern – so in conjunction with the heartbeat, VM Monitoring now inspect IO activity. The assumption is that if the heartbeat returns no answer AND no disk IO activity is taking place, there’s good likelihood that the VM has halted with either a kernel panic in Linux or Blue Screen of Death (BSOD) in Windows.

When VM Monitoring is enabled it comes with two options – VM Monitoring Only and VM and Application Monitoring. First monitors the VM heartbeat and restarts if no response is given within a specific time. VM and Application Monitoring checks for heartbeat signals from both VMware Tools as well as Applications/Services running within the guest operating system. This is called VMware AppHA and requires a virtual appliance to be configured, leverages VMware’s Hyperic software inside the guest operating system, and offers support to range of applications/services running in Enterprize environments. For simplicity we will cover “VM Monitoring” here, and cover VMware AppHA separately.

Screen Shot 2014-10-22 at 16.33.36.png

Monitoring Sensitivity comes with two options a preset value which allows you to indicate a Preset level of sensitivity. With high sensitivity this sets a sensitive policy that would only reset the VM if three have been three failures in 60mins, and checks counts a failure as no response within 30secs. As you move the slider bar from right to left VM Monitoring become increasingly conservative, and restarts are less likely to occur

High (2) – No response in 30sec, 3 failures per 60min

Medium (1) – No response in 60sec, 3 failures in 24hrs

Low (0)- No response in 2mins, 3 failures in 7days

If these settings are to aggressive or too conservative – then Custom setting allows the administrator control over these tolerances.

Datastore Heartbeating

Along side using network heartbeats to evaluate the availability of a vSphere host, vSphere HA can also validate storage heartbeats to validate the connectivity. The assumption being if both network and storage heartbeats are both unavailable, then its highly liked the host has suffered a catastrophic failure. This can be regarded as increasing the condition required to initiate the restart of the VMs to another host, and another method of reducing the occurence of split-brain. Datastore Heartbeating requires two or more datastores accessible to all the hosts in the cluster, and is a mandatory feature. Therefore if the hosts do not share datastores or if HA is enabled before the storage configuration has completed, this is likely to generate a warning on the vSphere hosts.

Screen Shot 2014-10-22 at 17.00.28.png

By default the vSphere HA Automatically select(s) datastores accessible to the host(s). In some case this selection may not reflect your preference. For instance if you work in a volatile lab environment where storage is temporarily mounted, VMs created, then destroyed and then unmounted. You may perfer to instead use datastores which know will always be mounted to the cluster. For this reason it’s possible to Use datastores from the specified list or else Use datastores from the specified list and complement automatically if needed. This last option feels like a good compromise between control, whilst at the same time protecting the environment from situations where the datastore maybe reconfigured or become unavailable.

In this case the policy was change to ensure that the “Software” and “Current-Templates” datastore locations were selected as the datastore heartbeat preference.

Screen Shot 2014-10-22 at 17.14.19.png

Advanced Options

Advanced Options allows the administrator to supplement the vSphere HA configuration with additional parameters that control is functionality. A complete list of these options are available in VMwareKB Article 2033250. Typically, these settings are modified in environments that present unique requirements or demands generally around networking. These settings have been recently updated with the March, 2014 release of VMware Virtual SAN. vSphere HA can now leverage aspects of the Virtual SAN networking requirements as part of its internal logic.

Comments Off

Posted in BackToBasics


Back To Basics: Testing vSphere HA

31 Oct

There are number of different ways to test if vSphere is working correctly. By the far the most effective and realistic is to enduce a failure of a physical vSphere host by powering it off. This can be done physically with the power button or by using the BMC/DRAC/ILO card. This test would require some powered on VMs. Powering off a vSphere host does not register immediately in the vCenter/Web Client UI as the management systems has number of re-tries to connect to the vSphere host in the event of temporary network outage. So for tests you may wish to carry out a ping -t of the vSphere host that will be brought down and number of the VMs that currently located on the host.

You can find out the IP address of given VM by viewing its “Summary” page

Screen Shot 2014-09-17 at 12.57.01.png

In the example below – a ping -t was made of and the VM. Using the HP ILO interface was forcibly and unceremoniously powered off.The older vSphere Client makes a better job of refreshing the management view to indicate the state of the vSphere host. You may need to refresh the Web Client in order to see these events.

It took it about 60seconds to generate a red alarm on the host, indicating there maybe an issue. It was a further 80seconds before the state of the vSphere host turned to “Not Responding”. This can also be indication of some network disconnect caused by a fault in the network. It was at 90seconds when the VMs that were running on the vSphere host were unregistered from it in vCenter, and instead registered to the other hosts in the cluster. Using a ping -t on a Windows 2012 R2 instance it was 180second when the operating system inside the VM began to respond to pings. In some cases you might prefer to use the command “ESXTOP” running on the hosts that are remaining in the cluster to watch the process of registering and power on in a more real-time fashion.

Screen Shot 2014-09-17 at 13.13.45.png

The after effects of a vSphere HA event are very much dependent on other clustering settings. For instance if DRS is enabled in a fully-automated mode, when the lost vSphere host is returned to the cluster, then VMs would be automagically vMotion’d to the host. If DRS is not enabled in this fashion the VMs remain running on the remaining hosts, until such time as the SysAdmin moves them manually or accepts a recommendation for them to be moved.

If the vSphere Web Client is refreshed then status information will display as can be seen below:

Screen Shot 2014-09-17 at 13.45.34.png

The health status of the cluster can be viewed from the monitor tab in the vSphere Web Client. This can be monitor the availability of the cluster – if network isolation has taken place – and also the total amount of “slots” available in the cluster as well.

Screen Shot 2014-09-17 at 13.52.29.png

Comments Off

Posted in BackToBasics


FUD Wars…

30 Oct

This week I caught up on a new(ish) podcast which is done through the medium of Google+ Hangouts. I enjoyed immensely, and that was by no measure of it containing many people who I know from the community (Josh Atwell, Amy Lewis et al).

I had hoped to tune in live but I was elsewhere. I forget what I was at the time, but most likely I was at choir or at rehearsals in a local play I’m in at the end of the week (not a starring role, just a walk on part – I play the 6th soldier who slopes on embarrassed at the back!).

This week the guys talked about FUD, and the various backbiting and unpleasantness that’s circulating online – often generated by folks in the pay of a vendor. This is a bit of bubble I guess – with a lot of the people like me on active on SocialMedia. So bit like any bubble – The Beltway, Westminster and so on – it might be only of interest to folks working in that field. There are a couple of choice examples of where things have “turned personal”, and drifted into elements of “mud slinging”. My heart always sinks when I see this – it reminds me of the trolls on Facebook and Twitter  – who deliberately go out their way to be unpleasant or cruel to someone. It is a very public display of the worst aspects of humanity. I don’t really like to be reminded of how horrible humans can be to one another. I see it enough already on the evening news, to want to witness it amongst my peers. And no, I’m not going to do a link-o-rama to those posts. Why feed the trolls anymore than they feed themselves. Right? It’s tempting to repost –  but it feels like  thesocial-media equivalent of slowing down on highway to gorp at a car-crash.

An yway, I wanted to add my own pennyworth to the debate generally.

In many ways I feel a bit of interloper. I spent 90s as employee for UK-based training company. The 00’s were spent as independent freelance trainer (and some minor consultancy gigs). Late 00’s saw me be a tech journalist at TechTarget for 2 year stint. And then for the 1st time in my life I joined a vendor – VMware. I ended up in the competition team at VMware, before moving to EVO:RAIL team about 7 weeks ago. Incidentally, there’s isn’t much relationship between the two roles. It’s helpful to have links back to the Competition Team, but this is a net-new role for me in kind of Tech Marketing position. So the reason I feel a bit of interloper is for the vast majority of my career I’ve not been on the vendor side.

What I think is interesting about this – is would thought a majority of those people doing my sort of role are generally former customers, or worked out of the partner/channel. For various reasons these folks personal-star took off, and they were picked up by a vendor because they came with a ready-made audience. A couple of years ago, an impression was built-up that all you need was a blog and a couple of K of twitter followers – and you could line yourself up a cushy job with a big company or start-up. I think that’s a mistaken perception. It took me a good 18 months to find a role with a vendor – and I wasn’t just looking at VMware at the time. My reputation in the community was a door opener – nothing more. If you have social-media/blogging reputation its only going to carry so far up the corporate ladder until you meet more and more people who go “RTFM, who?”. My point here is simple one – I’m personally uncomfortable with the vRockstar title. It can distort people’s sense of reality and perspective, and can make them jumped up “Don’t you know who I am types”. As ever, I’m going off topic.

I think the reason some many of us (who work in this aspect of the industry – outbound, social, tech-marketing types) feel uncomfortable with recent developments. Is many of us aren’t corporate types who went straight from college into some Uber-Corporate machine, and worked ourselves up the ladder. So FanBoism, FUD, aggressively competitive activity – something most of us feel is a bit icky. Nonetheless, customers and audience will always want us to compare our tech with the alternatives. That’s understandable because customers want to know the differences – and there is precious small amounts of truly independent analysis (by which that I mean non-vendor affiliated, and free from personal ‘axe-to-grind’ bias). Independents are often under attack by the vendors for their lack of hands-on knowledge – whilst at the same time doing nothing to help them get their hands on the products.

So the question is how do write about your technology as compared to a competitor technology – whilst avoiding FUD or being accused of FUD. The answer is with great difficulty because whatever you do, you can be accused of FUD or FanBiosm by others. I think there’s only one way to do it. If you can – get hands-on. There were many times in my previous role where I felt the urge to write to something about competitive product. Most of my content was made for internal consumption only, but there was times I wished I could just blog about my findings. In the end I didn’t. Why? Well, because my blog content is usually known for being practical and hands-on. Generally, I think I’ve built my reputation on helping people – heck I even got comments from Microsoft customers for helping them with the SCVMM deployments. I’m not sure if that’s an outcome that my employer was expecting. But in away it made me smile. Despite being critical of aspects of Microsoft’s technology, I wound up writing stuff that helped one it customers have a better experience. In great ying-yang, instant-karma measure of life, that seems more valuable to me, than putting the boot into Microsoft.

For me the FUD debate really boils down to you as a person. Do you have ethics? Are you nice person? Can you engage with people who disagree with, with decorum and politeness? If you are, then should be able to talk about the advantages of your technology over another without it becoming a slanging match. If not you will find our yourself descending into personal attacks, and defamation. That behaviour will not, and does not enhance your reputation and standing within the community. It’s huge turn off. Why would you want to turn off the audience who listens to you?

Comments Off

Posted in Announcements


EVO:RAIL VDI Scalability Reference

28 Oct

This blog post is really a short advertisement for someone else’s blog. When I was last at the Bristol (South-West) VMUG in UK, my former co-author on the EUC book (Barry Coombs) asked me a very pertinent question. Being EUC focused, Barry was keen to see whitepapers and performance analysis that could be used to demonstrate the scalability claims made by the EVO:RAIL. Of course, Barry is specifically focused in this case on Horizon View as an example. But the demand is one that I would expect to see across the board for general server consolidation, virtual desktop and specific application types. Just to give you an idea of the publically stated scalability numbers this chart is a handy reminder:

Screen Shot 2014-10-23 at 15.35.53

At the time I pointed out that there is plenty of Virtual SAN performance data in the public domain. A really good example is the recent study that was undertaken to benchmark Microsoft Exchange mailboxes on Virtual SAN as well as posts about performance for Microsoft SQL. I must say that both Wade Holmes and Rawlinson Rivera are doing some stellar work in this area.

Great though that is, Barry made what I think is an important point. EVO:RAIL represents quite a prescriptive deployment of Virtual SAN with respect to the hardware used, and the amounts and proportions of HDD to SSD. From his perspective as an architect he needs to be able to point to and justify the selection of any given platform. He needs to be able to judge how much performance a given deployment will deliver per appliance – and then demonstrate that the system will deliver that performance. It’s worth stating what those HDD/SSD amounts/proportions are again, just in case you aren’t familiar.

Each server in the EVO:RAIL 2U enclosure (4 servers per enclosure) has 192GB RAM allocated to it – two pCPU with 6-cores – and 1xSSD (400GB) drive for read/write cache in the Virtual SAN together with 3 SAS HDD drives. For the WHOLE appliance that works out at 14.4TB of HDD raw capacity and 1.6TB of SSD. It’s important to remember that Virtual SAN “Failures to Tolerate” is set to 1 by default – this means for every 1 VM created, a copy is created elsewhere in the cluster. The result is that 14TB of raw storage becomes about 6.5TB usable. If you look at these numbers you will see that a ratio of around 10% of the storage available is SSD based, which largely reflects the best practices surrounding Virtual SAN implementations.

So its with great pleasure I can say that the EUC team has been doing some independent validation of the EVO:RAIL platform specifically for Horizon View. The main take away – our initial claim for 250 virtual desktop VMs – holds true so long as the appliance isn’t housing other VMs at the same time. Basically, the EUC tested a configuration where the appliance is dedicated to just running the virtual desktops, and “virtual desktop infrastructure” components (the Connection server/Security server) are running elsewhere. The other configuration they tested was a more “VDI-in-a-box” configuration where both virtual desktops AND the Horizon View server services were contained in the same appliance. As you might suspect the number of supported virtual desktops comes down to about 200. Remember however that additional EVO:RAIL appliances could be added to exceed this per-appliance calculation to support up to 1000 virtual desktop instances. As the chart above indicates, the assumption is that all the virtual desktops are the same and are configured for 2vCPUs, 2GB RAM and 30GB virtual disk.

One question that occasionally comes up is the ratio of VMs per node. Sometimes people think that the ratio is a bit small. But its important to remember that we need to factor in resource management to any cluster – and work on the assumption of what resources would be available if a server is in maintenance mode for patch management OR if you actually have a physically failed server. As ever its best to err on the side of caution, and build an infrastructure that accommodates at least N+1 availability – rather than being over optimistic by assuming all things run all the time without a problem…

For further information about the VMware EUC Teams work with EVO:RAIL follow this link:

Comments Off

Posted in EVO:RAIL


Back To Basics: Introduction: vSphere High Availability

27 Oct

HA Overview

The primary role of High Availability (HA) in vSphere environment is to restart VMs if a vSphere Host experiences as catastrophic failure. This could be caused by any number of issues such as power outage, and failure of multiple hardware components such that operation of the VM is impacted. VMware HA is part of number of “clustering” technologies including Distributed Resource Management (DRS) and Distributed Power Management – that intend to gather the individual resources of physical resources, and represent them as logical pool of resources that can be used to run virtual machines. Once the clustering technologies are enabled administrators a liberated from the constraints of the physical world, and the focus is less on the capabilities of an individual physical server, and more about the capacity and utilization of the cluster. HA is not the only availability technology available – once enabled administrator have the option to enabled “Fault Tolerance” on selected VMs that benefit from its features. In order for FT to be enabled, so must HA.

In recent version of HA, more focus has been made on the availability of the VM generally – and so it is now possible to inspect the state of the VM itself, and to restart it – based on monitoring services within the guest operating system itself. The assumption being if core VMware services that run inside the Guest Operating system have stopped – this is likely to be good indication that the VM has serious issue, and end-users have already been disconnected.

In terms of configuration – VMware HA shares many of the same pre-requisites as VMware VMotion such as shared storage, access to consistently named networks and so on. As the VM is restarted there is no specific requirement for matching CPUs, although the reality is that because of vMotion and DRS this is often the case anyway.

Under the covers vSphere HA has a Master/Slave model where the first vSphere Host to join the cluster becomes the “master”. If the master becomes unavailable an election process is used to generate a new master. In simple configuration vSphere HA uses the concept of the “slot” to calculate the free resources available for new VMs to be created and join the cluster. The “slot” is calculated by working out the VMs size in terms of memory and CPU resources. When all the slots have been used, no more VMs can be powered on. The concept is used to stop a cluster becoming over-statuated with VMs, and stops the failure of one or more hosts from degrading overall performance, by allowing too many VMs to run on too few servers.

HA and Resource Management

If you lose a vSphere Host simultaneously the clusters has lost its contribution of CPU/Memory resources, and in the case of Virtual SAN – its contribution of storage as well. For this reason planning needs to conducted to work out was “reserve” of resources the cluster will have to accommodate failures. In more classically designs this can be express as N+1 or N+2 redundancy. Where we plan that the number of hosts required to deliver acceptable performances is N, and then we factor in additional hosts for either maintenance windows or failures. Related to this a concept of “Admission Control” which is the logic that either allows or denies power on events. As you might gather, it makes no sense in 32-node cluster, to attempt to power on VM when only one vSphere host is running. Admission control stops failures generating more failures, and decreasing the performance of the cluster, by allowing cascading failures effecting the whole cluster. For instance, if redundancy was set at +2 – VMware HA would allow two vSphere hosts to fail, and would restart VMs on the remaining nodes in the cluster. However, if a third vSphere host failed – the setting of +2 would stop VMs being restarted on the remaining hosts.

VMware HA as number of ways of expressing this reservation of resources for failover. It is possible to use classical +1, +2, and so on redundancy to indicate the tolerate loss of vSphere hosts and resources they provide. Additionally, its possible to break free from constraints of the physical world – and express this reservation in the form of percentage of CPU/Memory resources to be reserved to the failover process. Additionally, its possible to indicate a dedicated host that is use for failover – in classical active/standby approach.

Split-Brain and Isolation

Split-brain and Isolation are terms that both relate to how clustering systems work out that a failure has occurred. For example a host could be incommunicable merely because the network that used to communicate from host-to-host in the cluster has a failure – typically this is the “Management” network address that resolves to the vSphere server FQDN name. For this reason it’s really a requirement of HA that the network have maximum redundancy to prevent split-brain from occurring – situation where the clustering system loses integrity and it becomes impossible to decide which systems are running acceptably or not. There are a couple of different ways of ensuring this which were covered earlier in the networking segments. However, a Standard Switch could be configured for two vmnics, and those vmnics (0 and 1) could be patched into different physical switches. This would guarantee that false failovers wouldn’t occur simply because of switch failure or network card failure. As with all redundancy a penny worth prevention is with a pound of cure – and its best to configure a HA cluster with maximum network redundancy to stop unwanted failovers occurring due to simple network outages.

With that said, HA does come with “isolation” settings which allow you to control what happens should network isolation take place. The HA agent does check external network devices such as routers to calculate if failure has taken place or if merely network isolation has occurred. VMware HA also checks to see if access to external storage is still valid. By these many checks the HA Agent can correctly work out if failure or network isolation has taken place. Finally, VMware HA has per-VM setting that control what happens should network isolation take place. By default network isolation is treated as if the host has physically stopped functioning – and VMs are restarted. However, using per-VM controls its possible to over-ride this behaviour if necessary. For the most part many customers don’t worry about these settings, as they have delivered plenty of network redundancy to the physical host.

Managing VM High Availability

Creating a vSphere HA Cluster

Enabling VMware HA starts with creating a “cluster” in the datacenter that contains the vSphere hosts.

1. Right-click the Datacenter, and select New Cluster

2. In the name field type the name of the cluster. The name can reflect the purpose of the cluster for instance a cluster for virtual desktops. Increasingly, SySAdmins prefer to classify their cluster by their relative capabilities such as Gold, Silver, Bronze and so on. Additionally, clusters can be create with the sole purpose of running the vSphere infrastructure – companies often refer to these as “Management Clusters”. Those with experience generally turn on all the core vSphere clustering features including DRS and EVC.

3. Enable the option Turn On next to vSphere HA

Screen Shot 2014-09-16 at 11.23.20.png

Note: This dialog box only shows a subset of options available once the cluster has been created. For instance the full cluster settings allow for adjustments associated with the “slot” size of VM, as well the Active/Passive or Active/Standby optional configuration.

The option to Enable host monitoring is used to allow vSphere hosts to check each others state. This checks to see if a vSphere host is down or isolated from the network. The option can be temporarily turned off if its felt that network maintainance may contribute to false and unwanted failover. Enable Admission Control can be modified from using a simple count of vSphere hosts to achieve +1, +2 redundancy. Incidentally, this spinner can currently be only increased to a maximum of 31. Alternatively, the administrator can switch admission control to use a percentage to represent reservations of CPU/Memory allocated a reserve of resources held back to accommodate failover. Finally, Admission Control can be turned off entirely. This will allow failovers to carry on even when there’s insufficient resources to power on the VM and achieve acceptable performance. This isn’t really recommended, but maybe required in circumstance where a business critical application must be available, even if it offers degraded performance. In this situation the business is prepared to accept degraded service levels, rather than no service at all. In the ideal world, there should be plenty of resources to accommodate the loss physical servers. VM Monitoring can be used to track the state of VMs. It can be turn on at entire cluster-level with certain VMs excluded as needed, or alternatively it can be enabled on per-VM basis.

Adding Multiple vSphere hosts to a HA Enabled Cluster

Once the cluster has been created vSphere hosts can be added by using drag-and-drop. However, you may find that using “Add Host” for new hosts that need to be joined to the cluster, or using “Move Hosts” for vSphere hosts that have already been added to vCenter.

Screen Shot 2014-09-16 at 12.09.28.png

If the Move Hosts option is used then multiple vSphere hosts can be added to the cluster. During this time the HA Agent is installed and enabled on each host – this can take sometime.

Screen Shot 2014-09-16 at 12.12.52.png

Once the cluster has been created the Summary screen will show basic details such as:

  • Number of vSphere hosts
  • Total/Used CPU/Memory
  • Simple HA Configuration
  • Cluster consumers (Resource Pools, vApps and VMs)

Screen Shot 2014-09-17 at 12.35.34.png

Comments Off

Posted in BackToBasics


I’m on telly…

24 Oct

Well, actually I was real telly just a couple of weeks ago in the BBC-TV programme called “Marvellous”. If you squint, and look at the back row of the chior you might see me opening my big fat gob (nothing changes there, Mike, I hear you all say!). Last week I had a more close-up opportunity to be interviewed by VMworld TV by the one, the only and the legendary Eric Sloof. Here’s Eric quized me about my move into the EVO:RAIL team, The EVO:RAIL Challenge and my previous life as freelance VMware Certified Instructor (VCI). Enjoy!

Comments Off

Posted in EVO:RAIL


My VMUG – November

24 Oct

Screen Shot 2014-10-24 at 10.05.41

I’m attending three VMUGs this November and at each one I’ll be squalking about EVO:RAIL. I’m hoping to be able to pull together a “VMUG” version of EVO:RAIL content, one that dispenses with the corporate deck, and helps me put across my own view point. That’s very much dependent on what time I have over the coming weeks. I’m super bizzy at the moment finishing up a new version of the EVO:RAIL HOL, as well as some internal work have to do help our partners and our own staff get up to speed.

Here’s my itinary:

UK National VMUG User Conferrence
Tuesday 18 November 2014
National Motorcycle Museum
Coventry Road Bickenhill
Solihull, West Midlands B92 0EJ
Agenda & Registration

Once again this event will have vCurry night with a vQuiz. I’m pleased to say my wife, Carmel will be at the vCurry night too!

21st VMUGBE+ Meeting (Antwerp)
Friday 21st November 2014
Filip Williotstraat 9
2600 Berchem
Agenda & Registration

Again, Carmel will be joining me on the trip – although she will be discovering the delights of old Antwerp. After the VMUG is done she and I will be spending the weekend in Bruges. A place we always wanted to visit – and we hope to get to the Menin Gate to pay our respects.

And Finally. I will be crossing the boarder to Scotland for the Edinburgh VMUG too!

Scotland User Group (Edinburgh)
Thursday 27th November 2014
The Carlton Hotel
19 North Bridge
Old Town
City of Edinburgh EH1 1SD
Agenda (TBA) and Registration

Comments Off

Posted in VMUG