Understanding Environment Caching in Puppet

Recently, puppet got a new way of defining environments. Along with the new environments, came a new way of handling the caching of the environments inside the puppet master. I found out that what puppet is now doing, and was doing in the past, wasn’t well understood by users and so this post is to get rid of some of the mystery.

I’m going to cover what an environment is internal to puppet, what the caching of an environment is, and what it isn’t, and what it used to be. In other post I’ll cover how to evaluate your situation and tune the environment cache and your workflow.

What is an environment?

Seems like such a simple question…but it isn’t. What an environment is in your puppet installation really depends on your workflow. In some workflows the environment is used to represent different stages in a build and deploy pipeline (dev, test, stage, production). In other workflows, I’ve heard of environments used to represent different clusters of servers or data centers. Internal to puppet, the concept of an environment becomes much simpler (but still not very easy). At its absolute simplest, an environment inside puppet is an instance of the Puppet::Node::Environment class. A Puppet::Node::Environment has a name, a set of initial manifests, and a set of modules, each with its own manifests.

In addition to all of these pieces that are directly related to an environment, there are also numerous other pieces that are indirectly related. Custom functions, which are written in ruby, are indirectly associated with the environment. Custom types are in a similar boat. Neither of these are completely under puppet’s control since they are extra ruby code that is loaded into the runtime.

Why does any of this matter? Any time a puppet master needs to construct a catalog of an agent node, it requires having all of the information about an environment on hand. Scanning the filesystem for the relevant directories and files, as well as parsing and validating puppet code, can all take time and put extra load on the server. If you’ve ever investigated a computer that is performing too much IO, then you’ll know how detrimental to overall performance unnecessary disk operations can be.

Improving Performance of Catalog Creation

Given that a puppet environment is needed for creating a catalog, and given that it can be expensive, both in terms of IO operations and in terms of CPU time, what can be done? The answer is to reduce unnecessary disk and cpu operations. But now we are faced with another question. Which operations are unnecessary?

A somewhat simplistic algorithm for what puppet does when creating a catalog is (ignoring lots of variations and other complications):

program_code = []
environment.manifests.each do |manifest_file|
  program_code << parse(File.read(manifest_file))

environment.modules.each do |module|
  module.manifests.each do |manifest_file|
    program_code << parse(File.read(manifest_file))

catalog = execute(program_code, node_name, facts)
return catalog

Every one of these operations are necessary, but the results of some change more often than others. Listing, reading, and parsing the manifest files only needs to happen when the file contents change, or when a file is added or removed. All of the other steps need to happen every time, either because the facts change every time, or because evaluating the program code has side effects which could affect the outcome. For example, custom functions that read and write to databases could have different results on every invocation.

The environment caching in puppet takes the approach of preserving the parsed manifest files between uses of an environment. This allows the master to skip reading, parsing, and validating the manifests every time it needs to build a catalog. The last piece of the caching puzzle is when puppet will evict an environment from the cache and start again.

Environment Cache Eviction

A cache eviction strategy defines when to remove an entry from a cache. Some of the most common strategies are to have a fixed size cache and then use an LRU (Least Recently Used) algorithm to remove entries when more space is needed in the cache. Puppet uses a much simpler cache eviction strategy: time-to-live. When an environment is placed into the cache the time is recorded. After a configured amount of time, the entry will be removed and the environment’s manifests will need to be parsed again. Right now puppet only decides to evict a specific environment cache entry when that specific environment is requested. If an environment is requested one time and never requested again it will remain in the cache indefinitely. This doesn’t lead to memory leaks in most puppet master configurations, however, because passenger, and other RACK servers, will terminate a worker process after a certain number of requests. If this wasn’t the case, then over time many puppet master setups would have an ever growing environment cache, since it is common to use short-lived, development or testing environments. These environments would never be freed from memory and the master process would continually grow in size (this is yet another reason why the WEBrick puppet master should not be used in a production scenario).

Right now the master only supports three eviction strategies:

  1. Always (environment_timeout = 0) – any time the environment is requested the cached information is evicted and recreated.
  2. Timeout (environment_timeout = <duration>) – if the cached information is older that the configured timeout it is evicted and recreated, otherwise the cached information is used as-is.
  3. Never (environment_timeout = unlimited) – the cached information will never be evicted.

Each individual environment can have its own eviction strategy configured and so it can be tuned to how that particular environment will be used.

How did this used to work?

Pre-directory environments, puppet used a slightly different mechanism for refreshing the parsed manifests. Whenever it encountered a new environment it would create a new Puppet::Node::Environment instance. This instance would never be thrown away. Instead the Puppet::Node::Environment would track what manifest files it had parsed and keep track of the mtime of each file as well as the last time the mtime was checked. Whenever catalog was requested it checked if any mtimes were last checked more than filetimeout (a setting that can be specified in puppet.conf) seconds ago. For any of the recorded mtimes that had expired it would stat the corresponding file. If any of the new mtimes are different from the old mtimes, it would throw away all of the parsed manifests and begin parsing them all again.

That system had led to several problems. First, it is much harder to explain and understand what will happen when (not all of the files will timeout at the same time and so changing one file might cause an immediate reparse, while changing another won’t). Secondly, it relied heavily on stat calls, which we had found were often a bottleneck and something we want to avoid as much as possible in production systems. Third, the internal design relied heavily on tracking and changing a lot of state. It was simply harder for us to understand the code and ensure that it was correct.

The new environment caching system is a lot easier for us to work with. There have been some bugs and difficulties as we have tried to refactor the existing code to use the new system, but in the end it will be much more controlled and understandable to everyone.

In a follow up post I’m going to cover how to evaluate the puppet master’s workload for tuning the environment cache.

This entry was posted in Software and tagged . Bookmark the permalink.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s