Exporting and anonymizing data
This document covers using the export, import and anonymization tools for PuppetDB.
The export tool will return an archive of all of your PuppetDB data which can be uploaded to another PuppetDB via the import tool. The export tool also has the ability to anonymize the archive before returning it. This is particularly useful when sharing PuppetDB data that contains sensitive items.
To create an anonymized PuppetDB archive directly, use the Puppet
from any node with puppet-client-tools installed:
$ puppet db export my-puppetdb-export.tar.gz --anonymization moderate
To import an anonymized PuppetDB tarball, use the Puppet
db subcommand from
any node with puppet-client-tools installed:
$ puppet db import my-puppetdb-export.tar.gz
How does it work?
The tool walks through your entire data set, applying different rules to each of the leaf data based on the profile you have chosen. The data structure is left intact, and only the data contents are modified. This maintains the "shape" of the data without exposing the underlying data you may wish to scrub.
We do this by always ensuring we replace data consistently. For example, if a string is replaced with something random, we ensure that all instances of that original string are replaced with the same random string throughout the data.
By keeping its original shape, the data can be anonymized based on your needs but still hold some value to the consumer of your anonymized data.
You may not need to anonymize all data in every case, so we have provided a number of profiles offering varying levels of anonymization.
The profile can be specified on the command line when the command is run. For
example, to choose the
low profile, enter:
$ puppet db export ./my-puppetdb-anonymized-export.tar.gz --anonymization low
full profile will anonymize all data (including node names, resource
types, resource titles, parameter names, values, any log messages, file names,
and file lines) while retaining the data set's shape. The result is a completely
anonymized data set. Report metrics under the
categories are left intact, as these can be inferred from the rest of the data,
but names of metrics under the
time category are anonymized as resource types.
This is useful if you are really concerned about limiting the data you expose, but provides the least utility for the consumer depending on the activity they are trying to test.
moderate profile attempts to be a bit smarter about what it anonymizes and
is the recommended profile for most cases. It sorts and anonymizes data by
Node name: is anonymized by default.
Resource type name: the core types that are built into Puppet are not anonymized, including some common types from the modules:
mysql. Any Puppet Enterprise core type names are also preserved. The goal here is to anonymize any custom or unknown resource type names, as these may contain confidential information.
Resource titles: all titles are anonymized expect for those belonging to Filebucket, Package, Service, and Stage.
Parameter names: are never anonymized.
Parameter values: everything is anonymized except for the values for
Report log messages: are always anonymized.
File names: are always anonymized.
File numbers: are left as they are.
Log messages: are always anonymized.
Metrics: metric names in the
timecategory are anonymized as resource types.
This profile is aimed at hiding security information specifically, but leaving most of the data in its original state. The following categories are anonymized:
Node name: is always anonymized.
Parameter values: only values and messages for parameter names containing the strings
Log messages: are always anonymized.
Verifying your anonymized data
After anonymizing data with the
puppetdb export tool, we strongly
recommend that you analyze the anonymized data before sharing it with another
party to ensure that all sensitive data has been scrubbed.
Simply untar the export file and analyze the contents:
$ tar -xzf my-puppetdb-anonymized-export.tar.gz $ cd puppetdb-bak
Inside this directory there is a directory for each content type (reports,
catalogs, and facts), and each file inside represents a node (and a report
instance for reports). The data is represented as human-readable JSON. You can
open these files and use tools such as
grep to check the status of specific
information you wish to anonymize.