Get Puppet Enterprise First 10 nodes are free!
Try it now
Request a demo
Automate IT and infrastructure, manage complex workflows, and mitigate risk at scale.
Try the full-featured Puppet Enterprise for free on 10 nodes.
Puppet Comply Find and prevent compliance failures
Compliance Enforcement Modules Remediate to stay in compliance
Continuous Delivery for Puppet Enterprise Build, test, and deploy infrastructure as code faster and easier
Content & Modules Pre-built scripts to automate common tasks
CentOS EOL Here’s how to secure your CentOS infrastructure – even after EOL.
Find thousands of component modules built by the community and guidance on using them in your own infrastructure.
Visit Puppet Forge >>
Open Source PuppetPerfect for individuals and small infrastructure
BoltAutomate tasks in orchestration workflows
See all open source projects >>
Contribute to open source projects >>
With self-healing infrastructure, DevOps engineers can rest assured that their operating systems will stay up and running around the clock. Less manual work, more peace of mind.
So, what is self-healing infrastructure, and how can you get started? We’ve covered all this and more in this article.
Self-healing infrastructure is an automation methodology that allows systems to identify and repair errors and misconfigurations without any human action.
Self-healing infrastructure is the next natural progression in automation. IT automation has become necessary in today’s environment where expectations on IT are high, demand for talent far outpaces the supply, and threat remediation requires immediate action. Self-healing infrastructure is also the basis of AIOps. Without the ability to automatically remediate errors and misconfigurations, an IT group won’t be able to scale.
Stay on top of DevOps trends with the newest State of DevOps Report from Puppet >>
The reason for implementing self-healing infrastructure should be to remove interruptions for your work day (and your nights). The goal should be to adopt a well-scheduled work cycle where interruptions are the exception, not the norm.
When I was a DevOps engineer, I had a list of projects I was going to complete “when things slowed down.” That list served me well for many years, and I passed the list on, sepia-toned and faded, to my replacement when I moved on. Imagine a world where those projects get worked on, implemented, and checked off… and the phone isn’t ringing at 3 a.m. because a disk drive is full.
Identify common, repetitive support issues and compliance-breaking configuration changes. Those five-minute fixes that require very little engagement for each occurrence but that add up over time are a great way to cut down on the tickets!
A complex, multi-step remediation process isn't the place to start.
Pick your tools, enforce their use, and have the automation stored in a code repository to make the solutions readily available. As your self-healing library expands, there will be more and more opportunity to reuse the code, keeping things simple and standardized.
Don’t exchange being interrupted by errors and outages for being interrupted by debugging automation.
Monolithic self-healing routines are brittle and high maintenance. It’s better to make the self-healing responses modular and, if appropriate, put them in a toolchain.
Spending days scripting and testing automation for a task that takes you five minutes once a month isn’t freeing up your time. Target quick hits and repeat offenders.
So what would be a practical, actionable example of a task or scenario that makes sense for creating self-healing infrastructure? Think of ways you can complete this sentence in your environment:
Every time X happens, I have to Y.
It looks like there are two variables to consider here, but there are actually three parts to this self-healing equation. You need to identify the incident that arises (X) and the action taken every time to remediate the incident (Y), but also the conditions that arise that make you aware of the incident. More often than not, this comes in the form of an alert from your monitoring system. So the manual, human-based workflow today is 1) I get an alert, 2) the alert tells me -- or I divine from the message and the symptoms -- what is going on, and 3) I take action to resolve it.
When broken down like this, it becomes a bit easier to conceptualize how to automate these three actions into a self-healing situation. Identify the message or alert that indicates an incident has occurred, then tie an automated action to the incident. Optionally, add confirmation that the incident has, in fact, been resolved, and notify your monitoring platform to acknowledge and resolve the alert.
A very common example is disk space utilization. Frequently, a human will be woken in the middle of the night by an alert that a disk is 90% full. Then the human will log into a system to delete old files, clean up large binaries left over from other users, compress old logs, or take some other space-saving measures to resolve the alert. A more permanent solution would be to also extend the size of a volume and filesystem to meet increasing demand. An important follow-up action for the next day is to contact the owner of the system and remind them (for the fourth time this week) that their system is filling up filesystems and can they please do something about it.
Tying together monitoring, ChatOps, and configuration management into a self-healing system can seem daunting, but if you start small, create reusable pieces, and gradually move toward a solution, you can start to move the humans from performing the healing to managing the self-healing process itself.
Hopefully, we’ve been able to pique your interest in self-healing automation for your infrastructure. This is a very basic outline of the theory; the practice itself can vary widely depending on your environment and your needs.
A good plan and a goal can help guide you in designing your self-healing infrastructure and keep it where it should be: a tool that improves your life and enhances the reliability of your IT estate.
Puppet Enterprise is the leading solution for streamlining and automating your IT operations. It helps you keep your infrastructure up and running efficiently and securely, allowing your team to focus on mission-critical tasks. With Puppet, you can easily automate a wide range of tasks and processes, such as provisioning, configuration management, application deployment, and more.
Download a free trial and take advantage of its powerful automation capabilities today!
Try Puppet Enterprise
Technical Product Marketing Director, Puppet by Perforce
Senior Principal Product Marketing Manager, Puppet by Perforce