Foreman ESXi Installation Never Completes

In our environment we want to use Foreman to build our OCP hardware. It was working well except that the installation would repeatedly PXE boot to reinstall because Foreman would not know that the PXE portion was complete.

The standard notification code below was failing for some reason.

%post --interpreter=busybox --ignorefailure=true 
wget -O /dev/null <%= foreman_url %>

After running the unattended install in a Fusion VM so I could have console access I reviewed the install log and a misleading message from wget about the URL being bad. After a little digging I determine that the problem was no DNS in the ESXi installer.

Per the documentation this is expected.
Deploying ESXi 5.x using the Scripted Install feature (2004582)

Note: When ESXi 5.x is being installed using a kickstart file via PXE with DHCP enabled; the DNS settings are not available and are not saved in the configuration file.

Following what I have seen in previous installation scripts I implemented a quick fix to the Foreman OS template to add DNS settings to the installer environment so the call works and the build proceeds as expected.

%post --interpreter=busybox --ignorefailure=true 
# Add temporary DNS resolution so the foreman call works
echo "nameserver <%= @host.subnet.dns_primary %>" >> /etc/resolv.conf
echo "nameserver <%= @host.subnet.dns_secondary %>" >> /etc/resolv.conf
wget -O /dev/null <%= foreman_url %>
echo "Done with Foreman call"

Problems with intuitive availability calculations

I often see/hear comments from people whose common sense approach to availability design includes eliminating a single point of failure.  This is a great goal when required, but care must be take that the “common sense” approach is actually achieving what is required.  I have found that using a fault tree approach to evaluating design decisions can be insightful.

The last instance of this is a book, while otherwise great, gave the advice of separating a database from the application function onto separate servers because there was a single point of failure.  This advice works when the components are not related, but not at all when they are dependent.

Here is an example analysis of an application on hardware that is 99% available.

If we follow the advice of separating concerns to eliminate the single point of failure we end up with this picture.

Unfortunately, based on the hardware availability, we have reduced the availability of the service. With a diagram it becomes apparent that the entire service is now at risk of failing when either of the two servers fail.  This incorrect thought process also leads to the idea that consolidating an application onto fewer virtualized servers always leads to lower availability.

All of this is not to say that consolidation always provides higher availability.  Here is an example of using a software stack that allows for multiple servers to serve the same purpose.  This analysis is not exact as it leaves out the increased likelihood of software failure or human error, but in the simple case of hardware availability you can see a definite improvement.