October 29, 2020

How to Solve Critical Windows Services Restart During Puppet Agent Upgrades

Windows
How to & Use Cases

In this blog we’ll share the journey we went on to solve a not-so-easy customer problem: a critical Windows services restart during Puppet agent upgrades.

Back to top

The Problem: Upgrading the Puppet Agent Triggers Windows Critical Services Restart

As with most software fixes, it starts with a customer ticket. In this case, "PA-3175: component DHCP Server service restarts after an upgrade". This troubleshooting journey goes from analyzing Windows installer logs, to using undocumented Windows API calls, and back to the Windows installer logs, until we eventually found a solution.

On Puppet Enterprise (PE) deployments, each Windows agent runs a service to allow execution of various actions on remote nodes — the pxp-agent. The pxp-agent is registered as a Windows service with the help of NSSM, a service helper that allows you to run standard executables and scripts as Windows services.

Love Digging Around in Puppet?

Take your skills further with free courses and training from Puppet Tech Ed.

EXPLORE COURSES

NSSM allows users to easily register/unregister an application as a Windows service and customize many related options. Because NSSM also registers itself as an Event Log message source, upgrading the Puppet agent (which contains pxp-agent and NSSM) while Event Viewer is open triggers a restart of critical Windows services (DHCP Client/Server), leading to unresponsive remote hosts.

While you may not have an Event Viewer open very often, the same issue occurs when a log collector service is running, leading to increased occurrence of the initial problem.

Digging Around

The problem varies between Windows versions, but generally it looks like this:

The issue in a Windows UI.

The reason behind this is that the puppet-agent package upgrades the nssm.exe file that is currently loaded in the Event Log service process image:

Event Log service process image

This leads to an Event Log service restart that also restarts other services registered as Event Log message sources, including critical services like DHCP server, DHCP client, etc.

Fixing NSSM

As the critical Windows services restart is caused by the following interaction:

nssm upgrade -> Event Log service restart -> critical Windows services restart

Not registering nssm.exe as an Event Log message source was our first step, so that NSSM upgrades would no longer trigger a restart of the Event Log service.

This solved the issue for newer deployments of the Puppet agent, but not existing ones. When nssm.exe was already a registered Event Log message source, the problem persisted. See our initial NSSM correction.

Open Source and Undocumented Windows API

Another solution we investigated was to detect whether nssm.exe was already loaded in the EventLog.exe process image, and if possible, unload it, by doing what ProcessHacker does. It was a fun time reading the ProcessHacker source code and experimenting with undocumented Windows API.

While getting close to a stable version, we realized that we could not guarantee the atomicity of the check/install steps — nssm.exe could still load into EventLog.exe between our check and the installation. We archived this solution and continued our journey.

Back to the Installer

Looking to get a bird’s-eye view on the problem, we realized that renaming the nssm.exe file to nssm-pxp-agent.exe solves the conflict, as the new package did not replace nssm.exe. Still, even with this change, the package validation failed.

It took us some time to realize that it was not the new package validation that failed, it was the package validation of the previous package uninstaller that failed, as it would try to uninstall the nssm.exe that is still loaded in EventLog.exe.

To trick the uninstaller, we tested and found that renaming the file or removing the file reference from the HKLM:\SOFTWARE\Microsoft\Windows\CurrentVersion\Installer\UserData\S-1-5-18\Components registry key both tricks the previous package uninstaller validation and performs the upgrade without any critical service restarting. Between the two similar solutions, we implemented the registry change because it is more stable in case of install/uninstall failures.

As we were already using WiX toolset to create the installer package, we implemented the registry update using the WixQuietExecCmdLine custom action right before the RemoveExistingProducts step. See our [Puppet Agent correction] (https://github.com/puppetlabs/puppet-agent/pull/1912/files).

Back to top

The Solution: Renaming and Removing nssm.exe References

After a bumpy ride, the solution to our customer problem was to:

  1. No longer register nssm.exe as an Event Log message source, as this requires critical Windows services to restart with upgrades.
  2. If packages with nssm.exe registered as an Event Log message source are already delivered:
    1. Rename nssm.exe in the newer packages to a name specific to the application. For example, nssm-myapp.exe.
    2. Remove any nssm.exe references from the HKLM:\SOFTWARE\Microsoft\Windows\CurrentVersion\Installer\UserData\S-1-5-18\Components registry key during the new package installation.

Further Reading

Puppet for Windows

Back to top