Get Puppet Enterprise First 10 nodes are free!
Try it now
Request a demo
Automate IT and infrastructure, manage complex workflows, and mitigate risk at scale.
Try the full-featured Puppet Enterprise for free on 10 nodes.
Puppet Comply Find and prevent compliance failures
Compliance Enforcement Modules Remediate to stay in compliance
Continuous Delivery for Puppet Enterprise Build, test, and deploy infrastructure as code faster and easier
Content & Modules Pre-built scripts to automate common tasks
CentOS EOL Here’s how to secure your CentOS infrastructure – even after EOL.
Find thousands of component modules built by the community and guidance on using them in your own infrastructure.
Visit Puppet Forge >>
Open Source PuppetPerfect for individuals and small infrastructure
BoltAutomate tasks in orchestration workflows
See all open source projects >>
Contribute to open source projects >>
In this blog we’ll share the journey we went on to solve a not-so-easy customer problem: a critical Windows services restart during Puppet agent upgrades.
As with most software fixes, it starts with a customer ticket: component DHCP Server service restarts after an upgrade. This troubleshooting journey goes from analyzing Windows installer logs, to using undocumented Windows API calls, and back to the Windows installer logs, until we eventually found a solution.
On Puppet Enterprise (PE) deployments, each Windows agent runs a service to allow execution of various actions on remote nodes — the pxp-agent. The pxp-agent is registered as a Windows service with the help of NSSM.
NSSM allows the user — in this case, us — to easily register/unregister an application as a Windows service and customize many related options. Because NSSM also registers itself as an Event Log message source, upgrading the Puppet agent (which contains pxp-agent and NSSM) while Event Viewer is open triggers a restart of critical Windows services (DHCP Client/Server), leading to unresponsive remote hosts.
While you may not have an Event Viewer open very often, the same issue occurs when a log collector service is running, leading to increased occurrence of the initial problem.
The problem varies between Windows versions, but generally it looks like this:
The reason behind this is that the puppet-agent package upgrades the nssm.exe file that is currently loaded in the Event Log service process image:
This leads to an Event Log service restart that also restarts other services registered as Event Log message sources, including critical services like DHCP server, DHCP client, etc.
As the critical Windows services restart is caused by the following interaction:
nssm upgrade -> Event Log service restart -> critical Windows services restart
Not registering nssm.exe as an Event Log message source was our first step, so that NSSM upgrades would no longer trigger a restart of the Event Log service.
This solved the issue for newer deployments of the Puppet agent, but not existing ones. When nssm.exe was already a registered Event Log message source, the problem persisted. See our initial NSSM correction.
Another solution we investigated was to detect whether nssm.exe was already loaded in the EventLog.exe process image, and if possible, unload it, by doing what ProcessHacker does. It was a fun time reading the ProcessHacker source code and experimenting with undocumented Windows API.
While getting close to a stable version, we realized that we could not guarantee the atomicity of the check/install steps — nssm.exe could still load into EventLog.exe between our check and the installation. We archived this solution and continued our journey.
Looking to get a bird’s-eye view on the problem, we realized that renaming the nssm.exe file to nssm-pxp-agent.exe solves the conflict, as the new package did not replace nssm.exe. Still, even with this change, the package validation failed.
It took us some time to realize that it was not the new package validation that failed, it was the package validation of the previous package uninstaller that failed, as it would try to uninstall the nssm.exe that is still loaded in EventLog.exe.
To trick the uninstaller, we tested and found that renaming the file or removing the file reference from the HKLM:\SOFTWARE\Microsoft\Windows\CurrentVersion\Installer\UserData\S-1-5-18\Components registry key both tricks the previous package uninstaller validation and performs the upgrade without any critical service restarting. Between the two similar solutions, we implemented the registry change because it is more stable in case of install/uninstall failures.
As we were already using WiX toolset to create the installer package, we implemented the registry update using the WixQuietExecCmdLine custom action right before the RemoveExistingProducts step. See our [Puppet Agent correction] (https://github.com/puppetlabs/puppet-agent/pull/1912/files).
After a bumpy ride, the solution to our customer problem was to:
Puppet for Windows
Senior Software Engineer
Ciprian Badescu is a senior software engineer at Puppet.