This proposal is partly the result of the discussions we had at the 2016 TechOps offsite, and of my readings before and after it.
The goal of this proposal is to make an incremental shift in how we organize our puppet code in the tree, to better standardize the code structure and make it simpler to read and to debug. It draws ideas from most puppet best practices and should be immediately applicable to new code we write from now on, without disrupting the existing code.
General principles
- The code should be organized in roles, profiles, and modules. Let's see in detail what each of these things will mean.
- Modules should be basic units of functionality (e.g. "set up, configure and run HHVM") and must not use classes from other modules, and mostly avoid, wherever possible, to use defines from other modules as well. No hiera call, explicit or implicit, should happen within a module. These rules will ensure the amount of WMF-specific code that makes it into modules is minimal, and improve debugging/refactoring/testing of modules as they don't really depend on each other. Keeping up with the HHVM module example, the base class hhvm is a good example of what should be in a module, while hhvm::admin is a good example of what should not be in a module; surely not in this one: it is a class to configure apache to forward requests to HHVM, depends mostly on another module (apache, in fact) and also adds ferm rules, which of course requires the WMF-specific network::constants class.
- Profiles are collections of resources in modules that represent an high-level functionality ("a webserver able to serve mediawiki"), should only have parameters that default to explicit hiera calls. This means that a typical profile will either need all of its hiera variables declared, or to be included by an ENC (like horizon) with its parameters declared. Unless specifically needed, no resource should be added to a profile using the include class method, but with explicit class instantiations. If a profile need another one as a precondition, it must be listed with a require ::profile::foo at the start of the class, but this should be mostly avoided. Most of our current "roles" should be profiles. Following our example, an apache webserver that proxies to a local HHVM installation should be configured via a 'profile::hhvm::web' class; a mediawiki installation served through such a webserver should be configured via a 'profile::mediawiki::web' class.
- Roles should only include profiles; A role can include more than one profile, but no conditionals, hiera calls, etc are allowed. So following our example, we should have a 'role::mediawiki::web' that just includes 'profile::mediawiki::web' and 'profile::hhvm::web'. Inheritance can be used between roles, but it is strongly discouraged: for instance, it should be remembered that inheritance will not work in hiera.
- Any node in site.pp must include exactly one role, with the "role()" function. No exceptions allowed.
- Almost all hiera definitions should go in the "role" heirarchy. Only exceptions can be shared data structures that can be used by many profiles or to feed different modules with their data. Those should go in the common/$site global hierarchy. A good example is, in our codebase, ganglia_clusters, or any list of servers (the memcached hosts for mediawiki, for example).
- Per-host hiera should only be used to allow tweaking some knobs for testing or to maybe declare canaries in a cluster. It should not be used to add/subtract functionality. If you need to do that, add a new role and configure it accordingly, within its own hiera namespace.
- hiera keys SHOULD try to reflect the most specific common shared namespace of the puppet classes trying to look them up (implicitly or explicitly). This should allow easier grepping and avoid conflicts. Global variables SHOULD BE avoided as the profile will by definition be the least specific common namespace.
Expected advantages
This will have a series of advantages, namely:
- Make it easier to write general-purpose, non-wmf-specific modules
- Ease of debug: it will be easy to see where something happens (typically, the profiles that are part of your role), hiera calls will all be explicit in the code and mostly contained into a single file (the one for the role declared in site.pp). It will thus be pretty easy to track down a variable lookup
- Using configurable profiles will allow easier reuse of the same code when we have more than one cluster, see e.g. the use of our current role::elasticsearch in different practical roles
- Labs and production roles might differ without harm; so for instance a production role might include standard and base::firewall, while a labs role might include some different classes.
On the other hand, some drawbacks have to be expected:
- a stricter code structure will sometimes present its hurdles and make it harder to solve some problems
- given the limits we just imposed, sometimes it will be hard to abstract some common code to a separate class and this might cause some more repetitions.
- The distinctions between what is a profile and what is not can be ambiguous and create some controversy.
A practical example
If you look at the role::elasticsearch::* roles, and you rename role::elasticsearch::common to profile::elasticsearch::common, you'll have a pretty good idea of what to expect. Also, the old-style role::etcd can be compared to the newer role::etcd::{common,networking,kubernetes} that were written following more or less the role/profile pattern. It's easy to see how the old-fashioned role::etcd can be molded easily in deriving from the "profile" (role::etcd::common), while accomodating all the use-cases with the original role::etcd would have required adding more and more parameters, and using regexes to distinguish between roles.
Also, since swift has always been considered a benchmark for the complexity of our manifests, an example change molding the swift roles into using the proposed standard has been posted to gerrit and refers to this task.