Saul O'Driscoll (dot) com

You don't need to write your own Kubernetes Operator

No I really do, you don’t understand ✨☸️✨

When talking to customers or discovering their infrastructure I often come across their operators or their desire for one.

To many, it seems like a silver bullet that will tie together Kubernetes and bend it to their will.

I’m not saying that no one should write their own operators… However, teams should carefully consider the implications of the operator journey.

Usually the desire for an operator comes from a place of:

  • Not understanding what can be done with K8s primitives and existing tooling
  • Not having done enough research into the CNCF landscape and what is already out there which leads to teams reinventing the wheel.

Alternative #1 to rolling your own operator: Policies (Kyverno, OPA Gatekeeper)

To give a simple example, let’s take a multi-tenant Kubernetes cluster. A common pattern is for platform teams to provision extra resources in a namespace before it is given to platform teams.

This can easily be done with Kyverno which can generate new resources as part of a webhook.

PSA: As of now, OPA Gatekeeper can not generate Kubernetes resources. OPA Gatekeeper can onl block and mutate resources but is nonetheless incredibly powerful and commonplace.

Here’s an example of a ClusterPolicy that generates PVC upon creation of a namespace. Taken from a Kubecon 2022 Demo Repo about Supply Chain Security with Tekton and Kubernetes.

apiVersion: kyverno.io/v1
kind: ClusterPolicy
metadata:
  name: generate-pvc
  annotations:
    policies.kyverno.io/title: Generate PVC
    policies.kyverno.io/severity: medium
    policies.kyverno.io/subject: Namespace
    kyverno.io/kyverno-version: 1.7.2
    policies.kyverno.io/minversion: 1.6.0
    kyverno.io/kubernetes-version: "1.23"
    policies.kyverno.io/description: >- 
      Generate default PVC
spec:
  rules:
  - name: create-pvc
    match:
      resources:
        kinds:
        - Namespace
    exclude:
      resources:
        namespaces:
        - common
    generate:
      kind: PersistentVolumeClaim
      name:  demo-pvc
      namespace: "{{request.namespace}}"
      synchronize: true
      data:
        spec:
          accessModes:
          - ReadWriteMany
          resources:
            requests:
              storage: 5Gi

If you take a look at the linked repo you will see many creative ways of generating resources, labeling namespaces and adding constraints with Kyverno automagically.

Keep in mind Kyverno can also generate non Kubernetes primitives so you can use it in combination with your other Kubernetes tools like for example External Secrets Operator, Crossplane or ArgoCD if you want.

You aren’t limited to the classic Kubernetes API resources. This can be a very potent combination.

Generating Cloud Resources with Custom Operators / Custom Resources.

This is also a common pattern. Platform teams want to create resources beyond the Kubernetes cluster such as S3 buckets or other cloud resources. Often these will be tied to resources within the cluster such as a namespace, deployment or service account.

The idea behind operators is a powerful one. When using something like the operator SDK to hook into the Kubernetes reconciliation loop you gain a huge benefit. The Kubernetes lifecycle management of resources via its API is a big reason why many orgs have adapted it.

Writing an operator that generates cloud resources from the ground up requires making calls to cloud provider APIs and covering many edge cases around proper deletion and managing identities / service accounts.

What many don’t know is that there are already tools like Crossplane (extended by UpBound CRDs to provide an even larger array of resources) and others like Pulumi and the somewhat newer Winglang. They all differ slightly and have differnet use cases but all have had lots of thought put into their design.

These tools are developed by open source communities that spend many engineering hours considering best-practice implementation details and cover an enormous amount of resources that can be generated with YAML (or via programming languages) in your cluster. They are often tried and test and have an existing community of support (or even enterprise support) around them.

Things to consider when rolling your own operators

If you do finds yourself still wanting to roll your own operator. Here are some things to consider.

  1. Maintenance Overhead and costs
  2. Continuous feature delivery and support
  3. Documentation

Maintenance Overhead and Costs

Operators are hard to get right but even harder to maintain. Whether or not they connect to upstream APIs or just manipulate resources within the cluster, they will need tending to.

You may have some solution up and running but then comes a change to one of the the upstream APIs (if it connects to one) which breaks your operator.

This will be an endless battle that operators face. It is important to include this in the total cost of developing and maintaining the operator.

As the operator potentially grows in terms of features and complexity, so will the maintenance overhead and cost. Make sure there is a team and budget for it before building a platform or product around a custom operator.

Continuous feature delivery and support

Agile software development is centered around the idea that software is ever evolving, there will be feature requests and v1.0 will likely not be perfect and far from the final version.

Users will have requests, there will be edge cases and there will be bugs. This burden will fall upon the people in charge of maintaining the operator.

An unspoken requirement here is that there is enough knowledge within the team to integrate new features and offer support to users. This brings me to my last point.

Documentation

Often the bane of many a developers existence, documentation is the lifeblood of most software. Bad documentation can be a death blow to software adoption before it even gets of the ground.

It’s important that good documentation is at the heart of your new shiny operator because I’ve seen cases where the sole developer of Kubernetes operator leaves the team and takes all their knowledge with them.

What happens is that the operator, which turned out to be a vital to the functioning of the platform, had to be deprecated eventually because no one wanted to or was able to maintain it.

Conclusion

I’m a huge fan of the operator framework in Kubernetes and use them every day. It is just important to understand what one is doing when they embark on the journey of creating and running an operator. They can be costly and annoying to maintain if you don’t know what you are getting into.

Here’s an interesting talk about an operator used for Securing 1/3 of Norway’s Annual State Budget. It’s a big boy however it has a clearly defined scope and a dedicated team of motivated developers behind it. In their operator called naiserator the platform team enables users to define a massive Kubernetes Application in one massive CRD and everything else is taken care of by their nais.io operator. Developers can get up an running on Kubernetes very quickly as long as they bring their own container images.

They knew what they were getting into, pros and cons and they have a dedicated team maintaining it. Great example of the right way to execute an operator even if its somewhat of a monolithic approach to Kubernetes.