OPA Design Patterns: Offline Configuration Authorization
An OPA design pattern, as detailed in a previous post, gives you an architectural solution to solve one or more common policy problems. In this blog post, we describe what we call the Offline Configuration Authorization design pattern for OPA.
Remember that each OPA design patterns covers the following information:
- Name: a name to make it easy to communicate how you use OPA
- Intent: reasons to use this design pattern
- Architecture and software: what software enforces OPA policies and how is OPA deployed and integrated with that software?
- Policy: what kinds of policies do people write and who writes them?
- External Data: what are the data requirements (e.g. sources of data, size, dynamicity, consistency)
- Availability and Performance: what are the latency, throughput and availability achieved in this pattern
- Security: what are the best-practices for ensuring OPA policies are enforced properly?
- Concrete Problems: what are some real-world problems that this pattern solves?
Intent
Conceptually, the Offline Configuration pattern checks complex configuration files (e.g. Kubernetes, Terraform, Docker, Python) before they are deployed. Those configuration files could describe the settings for how a custom application should be deployed, or . the intricate details of how to set up the compute, networking, storage and more resources that run those applications. Or the files could describe how the CI/CD pipeline has been set up. In the end, modern configuration files are almost always structured files that can be turned easily into JSON and verified using OPA policies. Thinking of these configuration artifacts as “files” reinforces when the verification is performed: offline before the configuration is used to actually run the software.
Architecture and Software
The Offline Configuration pattern runs OPA for the most part as a CLI tool, where OPA runs just long enough to make a batch of policy decisions, and then shuts down. For example, that CLI tool can be run in several different ways.
- PR check. As a PR check in a CI/CD pipeline to ensure that all configuration files that merge into the repo meet the organization’s security, compliance and operational policies.
- Periodic scan. As a one-time or periodic scan of a source code repository, to understand what files fail to satisfy a new policy.
- Developer laptop. By a developer on her laptop as part of a unit test suite, so the developer knows long before she even opens a PR whether her configuration will pass those policy checks.
- All three of those enforcement points are shown in the following diagram depicting a developer’s normal workflow using a C/ICD pipeline.
Policy
Some common policies for Offline Configuration are shown below:
Generally, each policy statement is analyzing a single configuration file (aka “resource”) and making a decision. In contrast, the classic authorization decision is made about a user, an action and a resource (a triple). But for offline configuration, the decision is typically made just about the resource – the file – and ignores who is contributing that file and if they are updating or creating that file.
- Compute: Ensure none of the binaries have root privileges within the operating system.
- Networking: Port 23 should never be open on production systems, and 443 should always be open for web servers.
- Storage: All storage should be encrypted at rest, unless an exception has been granted.
- Application: Tenants in the Free tier should never have the SSO feature flag enabled.
In the past, a configuration file was relatively simple, but the configuration files for cloud-native software systems can be exceedingly complex. In the past, configuration files were often simple key-value pairs, perhaps represented in a custom configuration language, but today configuration files often have have many distinct sections, which have subsections, sub-subsections and so on, all of which assign values to variables, making JSON/YAML the choice du jour. These files may also include variable-substitution, arrays of unknown size, information encoded into strings that need to be parsed for validation.
For example, below we see two configuration files, one written for K8s and one written for a custom application.
Kubernetes Pod + Service | Application configuration |
< |
What each policy looks like in Rego depends on the schema for the configuration file. As a policy author, you need to know what that schema is and what it means in the real world, and then you write OPA policies to validate the configuration file.
Below are two examples of policy: one for the k8s resources and one for the application. Both policies return a set of messages explaining to the user which of the organization’s rules and regulations are violated by the configuration file. If there are none of those messages returned, then the configuration file meets the policy requirements.
K8s policy: all load balancers must have a costcenter label | App policy: all storage must have the suffix “hooli.com” |
Rule origins. An OPA policy for the offline configuration design pattern routinely has 10s, 100s or 1000s of these deny rules. Those rules come from a variety of places: governmental legislation (e.g. GDPR), industry standards (e.g. PCI-DSS), technology best practices (e.g. MITRE Attack), organization-wide rules and team-level conventions.
Rule Decisions. While the rules shown above only show deny rules, it is common to see other modalities as well, such as a warn rule that sends back a message to the user but does not actually cause the policy check as a whole to fail. The rules can also return metadata about the failure as a JSON object that includes the message, a numeric severity, a link to a wiki and whatever additional metadata might be relevant to the user trying to correct the problem.
Rule Enforcement. Different organizations enforce rules to differing degrees. At one end of the spectrum, 1000s of rules are run on all applicable files, the results are sorted to identify the most crucial items to fix and only the most severe problems are enforced. At the other end of the spectrum, the administrative team curates the rules, carefully tunes them to match exactly what the organization needs, and provides an exception mechanism so that all rules can always be enforced.
External Data
For Offline Configuration, the external data dependencies are relatively easy to include in an OPA bundle because either there are none or are hand-written files.
Common case requires no external data. Recall that in this design pattern, the policies only check whether the resource has been configured safely or not; the user responsible for the configuration rarely matters. So any external data that’s relevant needs to contain information about the resource itself. Such data is typically too large (as a whole) to fit into memory and so must be fetched dynamically via OPA (e.g. through http.send). If there are 1000 OPA rules that are being checked, and those rules are run over all the (relevant) files in a repository or a local directory, of which there could easily be 1000, making external callouts to retrieve data for all of those resources could impose a substantial load on datastores. Typically therefore, we see many Rego rules and almost no external data for this design pattern.
Exceptions. The one exception to this rule of thumb (please forgive the pun) is when there is external data that lists the exceptions that have been granted for OPA policies. Team B may need to disable rule 147 that prohibits changing public cloud security groups for application 17 because the entire purpose of application 17 is to automatically configure security groups for other applications. Or perhaps Team C may need to disable rule 153 for application 2 because it is business-critical but requires running with root permissions.
Feeding that list of exceptions as a JSON file into OPA allows all the rules to be enforced, unless an exception has been granted. That kind of data is usually small and relatively static because it is written by hand and should be pruned as exceptions expire. It can easily be included in the OPA bundle.
Cross-file rules. One other type of external data that can arise is when the policy cross-checks information in different files. For example, the LoadBalancer in the example above has the targetPort set to 9376. Ideally we’d want to check that the Application running behind that K8s Service configures port 9376 as well. While these checks are feasible with OPA, at the time of writing we typically do not see them addressed in the Offline pattern because (a) knowing which infrastructure and app configuration files should be cross-checked is non-obvious and (b) checking the files individually is a big enough win on their own.
Performance and availability
This Offline Configuration design pattern typically has fairly lax performance and availability requirements, compared to the other design patterns.
Generally, we see the following requirements for offline configuration checking
- Latency:
- Dev laptop is the most performance critical to ensure devs will run the checks frequently. Though often this can be handled via running policies across just the files that have been changed. Sub-second latency is reasonable for a single file.
- PR checks and Periodic scans can both tolerate 10s or 100s of seconds to examine a single repository.
- Throughput: Parallelizing the requests across files to bring down latency by using multiple CPUs is helpful, but not required.
- Availability: Typically OPA is run as a CLI tool, so the availability of OPA is taken for granted. Availability of the management layer on top of OPA (e.g. to ensure OPA has an up-to-date copy of policies/data and can record decisions if that is important) is the bigger concern; retries along with fail-open and fail-closed have their place.
Security
Security requirements for Offline Configuration are also quite lax compared to other design patterns. Most people we see
- Assume a PR check will be able to securely run OPA as a CLI process, hand it the appropriate inputs, and act appropriately on the OPA policy decisions.
- Assume that any automated or manual process that carries out a periodic scan of a repository is highly trusted, just like a PR check.
- Recognize that policy checks on a developer laptop are a user-experience improvement instead of a security control. The sooner developers see problems, the quicker they can fix them.
Concrete Problems
There are several concrete problems that we see solved with the Offline Configuration design pattern.
Platform GitOps. As a platform engineer embracing GitOps who wants versioning and rollback for the resources on my cloud (k8s, public cloud, etc.), I want to automatically check all the resource files the developers wants to merge to the source repository so that when my automated processes instantiate that repository in the cloud, all the resources obey my organization’s configuration requirements without my team having to manually review those files.
Automated Security. As a security engineer, I want a framework for automatically enforcing security policies consistently across all development teams without manual reviews so that the business can deliver value to customers as quickly as developers can write software AND I have ensured the security and integrity of that software and the platform it runs on.
Wiki-as-code. As an application development team who wants to help onboard new developers quickly, I want a framework for writing unit tests against configuration files so that my new developers learn quickly how to configure platforms and apps the way they will work in our environment.
Summary and Resources from Styra
In this post, you’ve learned about the Offline Configuration design pattern:
- Architecture and Software: OPA runs as a CLI on a developer laptop, as a PR check or as a periodic scan of a source code repository.
- Policy: Identify resources that violate policy and explain to the user why they violate policy.
- Data: Typically none.
- Performance and availability: 1 second performance for a single file and 10s of seconds for many files.
- Security: Handled by the environment.
- Concrete Solutions: Platform GitOps, Automated Security, Wiki-as-code.
To see some of this in action, check out some resources from Styra:
- Styra Academy: learn how to write policy for configuration authorization use cases
- Styra DAS Free, which includes:
- General-purpose OPA control plane that helps you write, test, distribute and monitor policy; and record and analyze OPA decisions.
- Terraform system type: specialized support to help you run policy checks for Terraform plans on laptops and CICD pipelines.