System Administration

Red Hat CloudForms/ManageIQ – Examples

For me, one of the easier ways to begin to learn something is to learn by example. Even if the example does not solve my exact problem, I can use it to figure out ways of extracting the information I need and patterns for implementing new functionality. As a response to a forum question regarding provisioning approvals, Kevin Morey pointed out a Github repository that I was unaware of (https://github.com/ramrexx/CloudFormsPOC). This repo is a great source of example code to get you started exploring more advanced usage of CloudForms.

I would suggest that reading the docs and playing the the code enough to feel comfortable with it are critically important if you plan to use CloudForms or ManageIQ, but this is a great reference for one way of solving the problem.

Foreman ESXi Installation Never Completes

In our environment we want to use Foreman to build our OCP hardware. It was working well except that the installation would repeatedly PXE boot to reinstall because Foreman would not know that the PXE portion was complete.

The standard notification code below was failing for some reason.

%post --interpreter=busybox --ignorefailure=true 
wget -O /dev/null <%= foreman_url %>

After running the unattended install in a Fusion VM so I could have console access I reviewed the install log and a misleading message from wget about the URL being bad. After a little digging I determine that the problem was no DNS in the ESXi installer.

Per the documentation this is expected.
Deploying ESXi 5.x using the Scripted Install feature (2004582)

Note: When ESXi 5.x is being installed using a kickstart file via PXE with DHCP enabled; the DNS settings are not available and are not saved in the configuration file.

Following what I have seen in previous installation scripts I implemented a quick fix to the Foreman OS template to add DNS settings to the installer environment so the call works and the build proceeds as expected.

%post --interpreter=busybox --ignorefailure=true 
# Add temporary DNS resolution so the foreman call works
echo "nameserver <%= @host.subnet.dns_primary %>" >> /etc/resolv.conf
echo "nameserver <%= @host.subnet.dns_secondary %>" >> /etc/resolv.conf
wget -O /dev/null <%= foreman_url %>
echo "Done with Foreman call"

OpenZIS installation guide

I have spent a day or so trying to get the OpenZIS project up and running on a CentOS machine. A lack of updated instructions and an older code base made this process harder than I expected.

While I am still investigating if OpenZIS will meet my needs, I wanted to publish what I did as a reference to others who might try to install it. The doc is in my forked version of OpenZIS. Once I have tested things more, I will likely submit a pull request to the author.

https://github.com/ewannema/OpenZIS/blob/master/INSTALL_CENTOS_6.md

Problems with intuitive availability calculations

I often see/hear comments from people whose common sense approach to availability design includes eliminating a single point of failure.  This is a great goal when required, but care must be take that the “common sense” approach is actually achieving what is required.  I have found that using a fault tree approach to evaluating design decisions can be insightful.

The last instance of this is a book, while otherwise great, gave the advice of separating a database from the application function onto separate servers because there was a single point of failure.  This advice works when the components are not related, but not at all when they are dependent.

Here is an example analysis of an application on hardware that is 99% available.

If we follow the advice of separating concerns to eliminate the single point of failure we end up with this picture.

Unfortunately, based on the hardware availability, we have reduced the availability of the service. With a diagram it becomes apparent that the entire service is now at risk of failing when either of the two servers fail.  This incorrect thought process also leads to the idea that consolidating an application onto fewer virtualized servers always leads to lower availability.

All of this is not to say that consolidation always provides higher availability.  Here is an example of using a software stack that allows for multiple servers to serve the same purpose.  This analysis is not exact as it leaves out the increased likelihood of software failure or human error, but in the simple case of hardware availability you can see a definite improvement.

Quick Memory Allocation for Limit / Reservation Testing in VMware

I am testing the impact and behavior of memory limits and reservations along with balloon drivers and I needed a quick way to allocate memory in a user program.

PowerShell to the rescue. This isn’t a good method for exact memory allocation, but you can consume MBs to GBs of memory pretty quickly.

# Allocate memory by creating a large string.  Divide the length by 4
# (Unicode size) to get an approximation of the MB allocation. Make
# sure to assign the result to a variable otherwise the memory will be
# reclaimed to the .Net garbage collector
$a = "a" * 256MB/4

Quick and dirty PowerPath/VE output parser

Here is a quick and dirty PowerShell script to parse the output of the rpowermt command used to manage PowerPath/VE on ESX (for EMC arrays). I wrote the code quickly to solve a particular problem, but if I get requests I could extend it to parse the rest of the output and do so in a little more robust fashion.

Why do this? I am in the middle of migrations between frames and I needed an easy way to determine what needed moved when the storage admin said move all LUNs from CLARiiON XXXX to the new LUNs on YYYY – and make sure to move LUN 41 to LUN 88.

By parsing the output of rpowermt I could more easily determine the ID of the LUNs they indicated and map that to the extent properties of my datastores to determine what needed to move.

Two step usage:

  1. rpowermt host=esx001 display dev=all > esx001_lundata.txt
  2. $lunInfo = .\SimplePowerVeParse esx001_lundata.txt

Alternate ways of accomplishing the same thing:

  1. The Storage Viewer plugin from EMC would be helpful in determining this from the GUI, but due to political reasons this is not yet deployed in production.
  2. I originally wanted to try and talk to the CIM provider for PowerPath, but quickly found that the Get-WSManInstance cmdlet’s authentication assumptions were very Windows centric.

The script (more…)

Recovering from lockdown mode, a corrupt vCenter, and no DCUI

One of the worst case scenarios when securing an ESXi host is disabling the DCUI, enabling lockdown mode, and then losing vCenter for some reason. If your vCenter database is corrupt then you have lost the ability to manage the host. The official answer at this point is to rebuild the host. While I hope you have an automated build process that would make this easy, there is at least one other option to recover your system.

DISCLAIMER: This is not supported or endorsed by VMware. The steps below assume that you have experience with Linux system administration. The official solution is to rebuild your host.

A quick refresher on lockdown mode: When you enable lockdown mode the system removes the permissions for all of the standard users except the vpxuser account which is what vCenter uses to manage the system.

Here are the steps to be able to manage your system again:

  1. Shut the host down. Yes, that means a hard crash for the host and any running VMs.
  2. Reset the password for the vpxuser account to a known value. Here is  an article from Bernhard Bock on doing it for root. The details in the instructions might vary slightly from your environment, but should be enough to get someone experienced with *nix pointed in the right direction. Use this process to reset the vpxuser account instead of the root account.
  3. Add the host into your vCenter inventory using the vpxuser account.
  4. The following steps may not be necessary, but if you are going to run with lockdown mode disabled from now on I would do them just in case the system does not clean up everything properly on a host add.
    1. Enable lockdown mode
    2. Disable lockdown mode

If you see issues with this process or have other ideas on how to recover the host in this situation please add a comment or send me an email so I can update the post.

VMware Update Manager: Different non-critical host updates for Nexus

I have 5 difference vCenter servers installed over various timeframes and I am seeing different counts for the Non-Critical Host Updates across them. After some digging it appears that this is due to the Nexus 1000v updates.

We have not deployed the Nexus to production yet so this was a little confusing for me. It appears that when Update Manager is installed it downloaded the patches for the Nexus 1000v, but they are not included in the updates that are downloaded because we have none running in the environment and Update Manager was not configured to download them.

For consistency across my environments I have enabled the custom patch download source for Cisco. The URL ends in csco-main-index.xml. This does not change functionality for our deployment, but it quiets the gnawing thought that we are applying different patches to our environments.