OPA in Production – How Reddit and Miro Built Enterprise Authorization with OPA
Two web-scale companies have recently shared how they solved mission-critical authorization challenges using Open Policy Agent (OPA). These accounts validate the value of what we’ve built with OPA and give important blueprints for engineers looking to address similar challenges. We consider these required reading for anyone considering or using OPA at scale.
In this post we review these two case studies to highlight common patterns and important differences.
Reddit – Building Fine-Grained Authorization
“OPA has Go bindings which make it easy to integrate into our Go service”
Evolving Authorization for Our Advertising Platform
Reddit’s advertising team needed to solve the problem of fine-grained, customer defined authorization on their advertiser B2B platform. While most users of Reddit have no need for custom authorization policies, advertisers are different. Each advertiser can have many employees with different access rights to the platform. It’s also common for rights to advertising accounts to be delegated to agencies, layering on further requirements.
Expressing policy in such a domain poses an interesting challenge and a compelling case for exactly what OPA offers.
Miro – Deconstructing the Monolith
“Rego lets us write policy code in a way that is constrained to that purpose alone. […] It’s just pure and declarative, and it actually simplifies the authorization logic a lot”
How Miro leverages Open Policy Agent to implement authorization-as-a-service
In recent years, Miro has been in a phase of rapid growth. Miro’s engineering teams need to migrate away from their monolithic architecture to move faster as a larger organization. Authorization functionality is an important piece in solving the puzzle.
…teams need a solution for authorization because it’s foundational
Breaking out authorization functionality is an accelerator to breaking out other services as it’s a common foundational dependency. This has meant completely rethinking how authorization is provided to the newly created microservices.
Miro is a real time platform, this makes the data needed to make authorization decisions a fast moving target. This poses another enterprise-tier challenge and another great example for how OPA can deliver in a challenging domain.
Why OPA?
Perhaps the most interesting question is… why? Why did both teams – with admittedly very different challenges – choose OPA in the first place?
Decoupling Authorization Both articles mention the need to decouple authorization logic from services’ business logic. This allows policy updates to happen out-of-band in addition to relieving new services of authorization responsibilities.
Rego: The Way to Go They also agree on the advantages of writing authorization rules in Rego instead of a general purpose language. Using a domain specific policy language allowed Reddit to keep “policy logic strongly isolated from the application logic”. They also stressed the need for flexibility: “the solution must be adaptable”.
The Miro team also makes reference to how Rego helps focus when writing policy code:
“Rego lets us write policy code in a way that is constrained to that purpose alone”
Low Latency Both Reddit and Miro have demands for fast response times in their use cases. The Reddit team was able to achieve very consistent response times. Miro also went deeper on this aspect of the problem and built a custom system to support their use case:
“we believe we’re going to be able to handle any realistic bursty-event scenario we’re likely to encounter in production when this goes live”
Live Policy Updates Updating policies without the need to redeploy the authorization service was considered important by Miro so they used the Bundle API to pull updated policy bundles from S3. Reddit went with storing Rego code together with the authorization service’s code and performing a redeploy of the service when policies change. They write that this isn’t optimal and they are considering moving to the Bundle API in the future.
Auditability Reddit mentions auditability as a reason to adopt OPA and note that they might be able to use the Decision Logging feature in future.
High decision throughput and frequent data changes
Both teams faced the challenge of building authorization for extremely high-traffic web applications where users’ permissions can change at any moment. This leads to Rego rules that are heavily dependent on data that describes the current context in which the user operates.
Interestingly both teams arrived at a similar solution: wrap OPA with their own authorization service. Authorization clients don’t communicate with OPA directly, rather with an in-house authorization service. In both cases this service will pre-load information relevant to the authorization request and pass it to OPA in the /v1/data POST
request at policy evaluation time (see “input overloading” in the OPA docs).
Input Overloading Model. Credit: OPA Docs
Using input overloading for submitting data to OPA has several advantages:
- Strong auditability, the data will be part of the input so it will be possible to easily replicate requests from decision logs.
- OPA will not consume memory for storing data.
- Data is always fresh.
Some of the downsides to this approach we see are:
- A custom layer is needed to interact with OPA to load the data, this can increase latency if using a standalone OPA.
- Fetching data can also add latency to requests. It also introduces new failure scenarios – what happens when the datastore is down?
There are some other solutions to this problem to consider:
- The Bundle API can stream partial updates to OPAs and happens outside the hot path of requests.
- Enterprise OPA from Styra can also consume live data updates from Kafka. Once the data is received and optionally formatted, it is ready for any requests the instance needs to process.
- It’s also possible to query some data sources at policy evaluation time using http.send (and sql.send in Enterprise OPA). This can be helpful when querying a prohibitively large dataset.
One step further: Embedding OPA
For their “real-time” use-case the team at Miro implemented a streaming mode for authorization where authorization decisions are cached by clients and these caches get updated by push notifications by the authorization service.
While this brings complexity it supports their need for sub-millisecond decision making by the client services while also maintaining very high throughput. To make this possible the Authz Service embeds OPA using the OPA Go SDK to re-evaluate many decisions when the data changes. This way Miro is able to use OPA for a use-case it wasn’t designed for, which they note is a testament to the flexibility of OPA in their article.
Miro’s streaming architecture. Credit: Miro
This is a custom solution to a problem defined not only by Miro’s product but their ongoing journey to a new architecture. In order to achieve similar results we encourage users to try out Enterprise OPA’s Kafka functionality, the gRPC API or embedding Enterprise OPA in your application.
Conclusions
It’s great to see the value standardizing on OPA has offered in these two enterprise use cases. From a flexible data model supporting multi-tenancy at Reddit to an embeddable policy engine delivering low latency authorization at Miro it’s fair to say the common factor OPA offers both companies is flexibility.
We’ve built a general purpose policy engine in OPA. We’ve also been learning from these high-scale, production use cases for some time here at Styra. Bit by bit, we’ve been building solutions to customer problems into Styra DAS and Enterprise OPA to make these offerings compelling starting points for enterprise teams. Get in touch if you’d like to learn how to hit the ground running with OPA or have a use case you think isn’t supported by OPA and want some expert guidance.