Creating Successful Migration Workflows with Puppet

How to & Use Cases,

Infrastructure Automation

I’ve been doing this for over thirty years. Sysadmin, ops lead, global teams, and more data centre migrations than I’d like to admit. Site to site, P2V, V2V, cloud, hybrid, all of it. Every migration gets sold as a clean, well-planned transition. None of them are. They go wrong in very predictable ways. Not because moving infrastructure is especially difficult, but because nobody ever has a clear, current view of what’s actually running, what’s changed, and what still matters. So people fall back to spreadsheets, SSH sessions, scripts written at 2am, and a lot of “we think this is right”. That’s where migrations fail. Not in the move itself, but in the loss of control around it.

Why Migrations Go Wrong

After you’ve done a few of these, the pattern is obvious. Three things show up every time.

The first is drift. Most environments look well understood on paper. In reality, versions don’t match, configurations have wandered off in their own directions, “identical” servers aren’t, and there are systems no one owns but everyone is afraid to touch. I once worked on an estate where three “identical” app servers were running three different JVM versions, and nobody could tell me which one production traffic was actually landing on. You start the migration with an incomplete picture and everything after that is guesswork. It gets worse once the migration is underway, because people make manual fixes to keep things moving, and nothing lasts as long as a temporary solution.

The second is manual execution. No matter how good the plan looks in the slide deck, at some point it turns into a human running commands on a list of hosts they pasted out of a spreadsheet. Sequencing becomes tribal knowledge. Parallelisation becomes guesswork. The rollback plan, if anyone wrote one down, is usually a single line that says “restore from backup”.

The third is lost accountability. During a migration more people have access, more changes are happening, and less of it gets tracked properly. So, when something breaks at 3am, nobody is quite sure what changed, or who changed it, or how to undo it. By the time you’ve worked it out, the on-call engineer has lost an evening and the project has lost a week of trust.

None of this is new. What’s interesting is that all three problems have the same root cause. There’s no consistent way to know what your estate looks like and operate it predictably while it’s moving.

The Four Things You Need to Stay in Control

Fixing this comes down to four things. You need to know what’s actually out there. You need a way to say what it should look like. You need to be able to operate it safely at scale. And you need to know who’s allowed to do what. Most teams have bits of all four scattered across different tools, which is why nothing quite lines up when it matters. Puppet Enterprise gives you all four in one place. One model, one set of controls, and one audit trail. Not four tools you have to glue together yourself and pray they agree with each other.

Visibility

Most migrations start with a discovery phase that’s stale before the spreadsheet is finished. What you actually want is continuous visibility, and that’s what facter does. Facter runs on every node, collects system information, and reports it back centrally dozens of times a day. Out of the box you get the obvious things (OS, kernel, memory, network) but the part that matters during a migration is custom facts. You can teach facter to collect anything you care about.

Here’s a small one that reports the version of an in-house payments app by reading a file the deploy pipeline drops on disk:

Facter.add(:payments_version) do
  setcode do
    if File.exist?('/opt/payments/VERSION')
      File.read('/opt/payments/VERSION').strip
      end 
   end
end

Once that fact exists, every node running the app reports its version on every Puppet run. Five minutes later you can ask Puppet Enterprise “show me every node where payments_version is older than 4.2” and get a real answer, not a guess. Multiply that across the dozen things you actually care about (JVM version, storage layout, mount points, whether a vendor agent is running, whether a system is genuinely in use) and you’ve replaced your discovery spreadsheet with something that updates itself.

That’s the bit most teams never quite get to. Half the battle in a migration is just knowing what you’ve got.

State

Once you know what you’ve got, the next problem is making sure it behaves the way you expect.

This is the part Puppet has been doing for years. You define the desired state in code, and that code gets applied consistently across the old data centre, the new one, and any cloud environment you’re bringing into the mix. No golden images quietly drifting. No “we rebuilt it by hand and it’s nearly the same”.

The reason this matters during a migration is that it lets you stand up the new environment alongside the old one and keep them honest. Same roles, same profiles, same definitions. The question stops being “did we build this correctly?” and becomes “does it match the defined state?”. The second question has an answer. The first one rarely does.

Execution

At some point you actually have to do things. Stop services, drain load balancers, run pre-flight checks, kick off data sync, validate the result, move on. This is the bit that usually collapses into SSH and good intentions.

Puppet Tasks and Plans give you a way to do it properly. Tasks run actions across large numbers of nodes. Plans sequence those actions and react to what they return. You target nodes by what Puppet already knows about them, not by a hand-built host list, which means the targeting stays correct as the estate changes underneath you.

Here’s roughly what a cutover step looks like as a plan:

plan migration::drain_web (
  TargetSpec $nodes,
) {
  $targets = get_targets($nodes)
  run_command('systemctl stop nginx', $targets)
  run_task('healthcheck::wait_for_drain', $targets, timeout => 300)   return "drained ${targets.size} nodes"
}

Nothing clever. The point is that it’s the same plan whether you’re draining four nodes or four hundred, it runs through the orchestrator with the same RBAC and the same logging as everything else, and you can call it from another plan that drains the web tier, then the app tier, then flips DNS. The chaos of cutover night turns into something you can rehearse.

Control

The last piece is the one most people only think about after something has already broken. During a migration more people have access, more changes are happening, and the blast radius of a mistake is bigger. You need to know who can define behaviour, who can execute it, and who can see what’s going on.

Puppet Enterprise handles that through RBAC and its reporting layer. You can let a team run specific tasks against specific systems without handing them the keys to the kingdom, and you can give read-only visibility to the people who need to know what’s happening without putting them in the room.

The same controls apply no matter how the work is triggered. A change ticket in ServiceNow, a step in a CI/CD pipeline, an approval workflow, or someone clicking a button in the console all run the same plans against the same nodes. Same RBAC checks, same audit trail. That matters during a migration because cutover work doesn’t all come from one place. Some of it is planned change, some is automated, some is “the network team needs us to drain that rack in the next ten minutes”, and you don’t want each route to have its own access model and its own log.

If you’re running Puppet Enterprise Advanced, the same model extends into compliance. Continuous reporting, CIS benchmarks running on every node, and the ability to show an auditor that the new environment was in the expected state on the day you cut over. During a migration, evidence is the difference between “we think it went well” and “here’s the report”.

A Few Things That Come Along for the Ride

If part of your migration is moving workloads into the cloud, PE Advanced ships with CloudOps and FinOps capabilities that are genuinely useful in flight, not just once you’ve landed.

Once your cloud accounts are connected, you get visibility of what’s actually running, what it’s costing, and what’s been left switched on by mistake. Migrations are very good at producing all three: orphaned instances, oversized VMs that someone picked “to be safe”, and test environments that nobody remembered to turn off.

The bit that makes this useful rather than just another dashboard is that it’s tied to the same node inventory you’re already managing with Puppet. So the cost story doesn’t end up as a separate project six months after cutover, run by people who don’t know which servers were meant to be there in the first place.

A couple of other things in Puppet Enterprise Advanced are worth knowing about for the same reason. The data connector pushes Puppet’s facts, reports, and events into whatever SIEM or observability platform you already run, so the migration shows up in the same dashboards your SOC and SREs are already watching, instead of being a parallel universe nobody looks at. And the ServiceNow integration syncs bi-directionally with your CMDB, which means the live view Puppet has of the estate stops drifting away from the system of record the rest of the business is using. Both of those matter more during a migration than at any other point in the life of an environment, because both are usually where the wheels come off after cutover.

You Don't Have to Throw Away What You Got

One thing worth saying clearly, because it comes up in every conversation I have about this. You don’t need to rip out your existing tooling to do any of it.

Most environments I see are a mix of on-prem and cloud, multiple operating systems, and whatever automation has grown over the years. Usually that includes Ansible, a pile of homegrown scripts, and a fair bit of “don’t touch that, it works”. Trying to standardise all of that during a migration is a mistake. It adds risk at exactly the moment you want less of it.

A more practical approach is to orchestrate what you already have. Puppet Tasks can call existing scripts and Ansible playbooks, sequence them alongside Puppet-managed actions, and target the right systems based on facts and roles. Instead of maintaining multiple inventories and disconnected workflows, you get one control layer, driving everything you already run.

Puppet isn’t replacing those tools. It’s coordinating them. That lets you keep what works, cut the duplication, and converge on a more consistent model over time, which is a much safer place to be in the middle of a migration than mid-rewrite.

Read: Running Ansible Playbooks with Puppet

Final Thoughts

A data centre migration isn’t really a logistics problem. It’s a control problem.

If you know what’s out there, what state it’s in, and who’s changing it, you can move systems safely, validate them properly, and recover when things don’t go to plan. If you can’t, it doesn’t matter how good the plan looked at kickoff. You’ll be discovering the gaps the hard way, one outage at a time.

The goal of a migration isn’t to move things. It’s to only have to move them once.

Explore Puppet Enterprise Subscribe to Puppet Content