Skip to content
Chimera readability score 70 out of 100, Academic reading level.

As we shared in our earlier post on FluxCD, RBC Capital Markets has been on a deliberate journey to modernize our Kubernetes platform. GitOps with FluxCD gave us a solid deployment foundation. But as our platform grew, today we operate over 50 clusters spanning on-premises VMware environments and multiple clouds, we hit a set of problems that no single off-the-shelf tool was designed to solve together: How do you manage the lifecycle of the clusters themselves? How do you ensure every node is reproducible and tamper-evident at boot? And how do you integrate Kubernetes service discovery with enterprise DNS infrastructure without every record change going through a ticket queue?
This post is about the several projects that answered those questions for us, and what we learned building with them inside a regulated financial institution.
The challenge: Platform engineering at scale in a regulated environment
Managing 50+ Kubernetes clusters across hybrid infrastructure is not just an operational challenge, in capital markets it is also a compliance challenge. SOX, PCI-DSS, and Basel III create real requirements around auditability, configuration drift prevention, and network segmentation. Our platform teams cannot afford to have snowflake nodes, undocumented cluster state, or manual DNS records that accumulate over years.
When we stepped back and looked at what we were spending engineering effort on, three gaps stood out:
- Node configuration drift: VM-based nodes that had been patched and mutated over time were becoming impossible to reason about.
- Cluster provisioning: spinning up new clusters for trading desks or risk teams was a multi-day manual exercise with no single source of truth.
- DNS integration: every new service or ingress endpoint required a manual ticket to our network team, creating a bottleneck and an audit trail that lived outside our GitOps workflow.
We decided to solve each of these from the ground up, using cloud-native projects where they existed and building our own where they did not.
Kairos: Immutable OS for nodes you can trust
The first piece of the puzzle was node immutability. We evaluated several approaches, but Kairos, a CNCF Sandbox project, aligned most directly with what we needed: a Linux distribution designed from first principles to be immutable, declaratively configured, and reproducible.
With Kairos, every node in our fleet boots from an OCI image. That image is built from a known base (in our case RHEL-derived), baked with our approved security configuration, and published to our internal registry. The cloud-config model lets us define node behaviour, SSH keys, network configuration, SSSD authentication against our Active Directory, Kubernetes agent registration, all as versioned YAML that flows through FluxCD just like any other platform component.
A CI/CD pipeline for operating system images
One of the less-discussed challenges of immutable infrastructure is the discipline it demands around image build and validation. We treat our Kairos images exactly like application container images: every change triggers a GitHub Actions pipeline that builds the image, runs integration tests against a live VM, and publishes a new OCI tag only on a clean pass. Nightly builds catch upstream regressions in base packages or the Kairos framework itself before they reach production.
This means our node image pipeline has the same properties we expect from application CI:
- Every commit is tested end-to-end, not just linted or statically analysed.
- Nightly runs validate that the current pinned base image and package set still produces a bootable, correctly configured node.
- OCI tags are immutable artefacts. A tag that passed integration tests is never modified; rollback is a matter of pointing to a prior tag.
Kubernetes-native VM provisioning with VirtRigaud
The other half of the VMware story is how we actually provision VMs from our Kairos images. Rather than reaching for imperative vSphere tooling, we use VirtRigaud, a Kubernetes operator that provides declarative VM management across multiple hypervisors (vSphere, Libvirt/KVM, and Proxmox) through a unified CRD API.
The model is straightforward: our Kairos-built OCI image is registered as a VMImage CRD, and VMs are expressed as VirtualMachine CRDs referencing that image. FluxCD reconciles these manifests like any other platform resource. The result is that provisioning a new Kairos node on vSphere is semantically identical to deploying a workload, it is a pull request, reviewed, merged, and reconciled automatically.
VirtRigaud’s remote provider architecture also fits our security requirements well: provider credentials are isolated to their own pods, and the controller communicates with them over gRPC/TLS rather than embedding hypervisor credentials centrally.
The operational shift this created was significant:
- Drift is eliminated by design. There is no apt or yum running on production nodes. If a configuration change is needed, a new image is built, integration-tested, and nodes are rolled.
- Audit trails become trivial. Because every node’s configuration is an OCI digest in a registry and every VM is a versioned CRD in Git, we can answer “what was running on that node on that date?” with precision.
- VMware integration is fully GitOps-native. Nodes are provisioned, updated, and decommissioned through the same GitOps workflow as everything else on the platform.
The learning curve was real: getting kernel modules, NetworkManager, and enterprise authentication (SSSD/AD) right inside an immutable image took iteration. But once solved, the result is a node foundation we can genuinely trust, which matters when regulators ask questions.
k0rdent: Cluster lifecycle management as a platform
Immutable nodes solved the “what is running” problem. But we still needed to answer “how do clusters get created, updated, and decommissioned?” consistently across our entire fleet.
k0rdent, built on Cluster API (CAPI), gave us a Kubernetes-native control plane for managing Kubernetes clusters. Rather than treating cluster provisioning as a bespoke scripting exercise, k0rdent models clusters as CRDs. Combined with k0smotron for in-cluster control planes, we can now express our entire cluster topology declaratively, and FluxCD reconciles that state continuously.
Our choice of Kubernetes distribution for workload clusters was k0s, a CNCF Sandbox project. k0s is a fully self-contained, single-binary Kubernetes distribution with no host OS dependencies beyond the kernel. That property matters a great deal when your nodes are running an immutable OS: k0s installs cleanly into a Kairos image without requiring package managers, systemd unit file manipulation at runtime, or any of the host-level assumptions that distributions like kubeadm make. The combination of Kairos and k0s gives us a full node-to-cluster stack where every component is declaratively expressed, OCI-packaged, and reproducible from a clean boot.
k0smotron extends this further by allowing Kubernetes control planes to run as workloads inside the management cluster, meaning even the control plane is expressed as a CRD, reconciled by FluxCD, with no out-of-band state.
The architecture we settled on organizes clusters into a hub-and-spoke model:
- A management cluster runs k0rdent, k0smotron, and the CAPI controllers.
- Workload clusters run k0s, provisioned and decommissioned through CRD manifests stored in Git.
- MetalLB handles load-balancing on bare-metal segments; Traefik provides ingress with consistent configuration across all spoke clusters.
Beyond day-one provisioning, this approach transformed how we handle day-two operations:
- Cluster upgrades are a pull request. The desired Kubernetes version is updated in a manifest, reviewed, and FluxCD applies it. There is no “who ran what command on which cluster” ambiguity.
- Cluster templates let us standardize configurations for common use cases, trading desk clusters, risk compute clusters, tooling clusters, and spin up new instances in minutes rather than days.
- Compliance posture is consistent by default. Because every cluster is expressed as code, our CEL-based admission webhooks and RBAC policies are applied uniformly at cluster creation time rather than bolted on after the fact.
We are also using k0rdent as the foundation for a spot-computing scheduler that allows donated physical server capacity to be absorbed dynamically into our platform, a capability we plan to share more about in a future post.
bindy: Kubernetes-native DNS operations
The last gap, and the one where no existing project fully covered our requirements, was DNS. In capital markets, DNS is not a commodity concern. Our trading applications, market data feeds, and risk systems use DNS extensively, and the enterprise infrastructure that serves them has been built and maintained over decades.
At RBC Capital Markets, that infrastructure is Infoblox, an enterprise DDI platform that is deeply integrated into our network operations. The integration model, however, was built for a world before Kubernetes: every DNS record request went through a ticketing workflow, routed to the network team, and processed on a timescale measured in hours or days. As our platform scaled to 50+ clusters, each spinning up dozens of services and ingress endpoints, that provisioning lag became a genuine operational bottleneck, and the paper trail for DNS changes lived entirely outside our GitOps audit trail.
bindy was built by Erick Bourgeois to bridge this gap, a Kubernetes operator, written in Rust using kube-rs, that manages DNS zones and records as first-class Kubernetes resources. The core design philosophy was to make DNS a GitOps citizen, with the same reconciliation guarantees we apply to everything else on the platform:
- Zones and records are CRDs. A DNSZone or ARecord manifest in Git is the source of truth, reconciled continuously by bindy’s controllers.
- RFC 2136 dynamic updates allow bindy to push record changes to the DNS backend without manual intervention or ticket queues.
- bindcar, a sidecar REST API, provides an RNDC interface that bindy’s controllers use for zone lifecycle operations (zone creation, deletion, reload) alongside dynamic updates.
- Multi-controller architecture with strict write boundaries prevents split-brain scenarios. Selection controllers and sync controllers are separated; sync state is stored on the synced resource to support force-reconciliation patterns.
The impact has been immediate. DNS records for new services are created automatically as part of the same GitOps workflow that deploys the service itself, provisioning time drops from hours to seconds, and the audit trail is Git history, not a ticket system. The rigid integration boundary that previously required human coordination on every DNS change is replaced by a reconciliation loop.
bindy is currently being expanded to support compliance scoring (a CRD-based model for zone health) and a future MCP server interface for integration with AI-driven platform tooling.
How the three fit together
What makes this stack coherent is that each layer builds on the same foundational principle: everything is code, reconciled continuously, with no manual state.
Git (source of truth)
└── FluxCD (reconciliation engine)
├── k0rdent / CAPI manifests → cluster lifecycle
├── Kairos cloud-config → node configuration
└── bindy CRDs → DNS records
Kairos ensures every node boots from a known, auditable image. k0rdent ensures every cluster is expressed and managed declaratively. bindy ensures every DNS record is a versioned artefact. FluxCD ties them together as the single reconciliation plane. The result is a platform where drift, at the node, cluster, or network level, is structurally prevented rather than operationally managed.
Challenges and lessons learned
Building this platform taught us several things we wish we had known earlier:
- Immutable OS adoption requires patience with enterprise integration. SSSD, NetworkManager, and corporate CA trust chains all need explicit attention when baking immutable images. Document everything; the day-two operator who debugs a boot failure at 2 AM is often not the person who built the image.
- CRD-based cluster management shifts responsibility left. When cluster provisioning is a pull request, platform teams need to invest in review processes and template governance up front, or the simplicity of “just a YAML file” becomes its own source of drift.
- Building operators in Rust is the right long-term call, but the ecosystem is still maturing. kube-rs is excellent, but patterns for multi-controller architectures with reflector/store caching require deliberate design decisions that the community is still converging on.
Looking ahead
Our platform continues to evolve. Some of the areas we are actively developing:
- SPIRE/SPIFFE integration for workload identity across all 50+ clusters, replacing certificate-per-service approaches with a hub-and-spoke SPIRE architecture that satisfies our zero-trust requirements.
- Foundry, an internal self-service API layer, built in Rust, that will surface cluster and DNS provisioning capabilities to development teams through a governed, event-driven interface.
- Kairos-based spot computing using k0smotron and Kata Containers to absorb donated physical server capacity dynamically.
We are proud to be building on and contributing back to the CNCF ecosystem, and we look forward to continuing to share what we learn. If you are working through similar challenges in a regulated environment, we would love to connect, find us in the Kairos, k0rdent, and FluxCD Slack communities, or reach out directly on LinkedIn.
Erick Bourgeois is Director and Head of Kubernetes Platform Engineering at RBC Capital Markets, managing 50+ Kubernetes clusters across multi-cloud and on-premises environments. He is a KubeCon and FluxCon speaker, FINOS Common Cloud Control member, and open-source developer at github.com/firestoned.

Facts Only

RBC Capital Markets operates over 50 Kubernetes clusters across on-premises VMware and multiple cloud environments.
The organization faced challenges with node configuration drift, manual cluster provisioning, and DNS record management.
Kairos, a CNCF Sandbox project, was adopted to create immutable, declaratively configured node images.
Kairos images are built from a RHEL-derived base, include security configurations, and are published to an internal registry.
A CI/CD pipeline validates Kairos images through integration tests before deployment.
VirtRigaud, a Kubernetes operator, manages VM provisioning declaratively across vSphere, Libvirt/KVM, and Proxmox.
k0rdent, built on Cluster API, manages Kubernetes cluster lifecycles using CRDs and integrates with k0smotron for in-cluster control planes.
k0s, a single-binary Kubernetes distribution, is used for workload clusters due to its compatibility with immutable OS images.
bindy, a Kubernetes operator written in Rust, automates DNS record management using RFC 2136 dynamic updates.
bindy integrates with Infoblox, replacing manual DNS ticketing processes.
The platform uses FluxCD for GitOps reconciliation, tying together node, cluster, and DNS management.
Future developments include SPIRE integration for workload identity and dynamic spot computing using Kairos and Kata Containers.

Executive Summary

RBC Capital Markets has modernized its Kubernetes platform to manage over 50 clusters across hybrid environments, addressing challenges in node configuration, cluster provisioning, and DNS integration. The solution leverages three key projects: Kairos for immutable node operating systems, k0rdent for Kubernetes-native cluster lifecycle management, and bindy for DNS automation. Kairos ensures nodes boot from auditable OCI images, eliminating configuration drift and enabling declarative management. k0rdent, built on Cluster API, allows clusters to be provisioned and updated via GitOps workflows, standardizing configurations and reducing manual intervention. bindy automates DNS record management, integrating with enterprise systems like Infoblox to replace manual ticketing processes. Together, these tools create a fully declarative, auditable platform where infrastructure changes are version-controlled and reconciled continuously. The approach aligns with regulatory requirements in financial services, ensuring compliance with standards like SOX and PCI-DSS while improving operational efficiency.
The implementation required overcoming integration challenges, such as adapting immutable OS images for enterprise authentication and designing multi-controller architectures in Rust. The platform now supports rapid cluster provisioning, automated DNS updates, and consistent compliance posture across all environments. Future developments include workload identity integration via SPIRE and dynamic spot computing capabilities. RBC Capital Markets is contributing to open-source projects like Kairos and k0rdent, sharing lessons learned with the broader cloud-native community.

Full Take

This case study from RBC Capital Markets offers a compelling example of how large, regulated enterprises can modernize infrastructure while meeting stringent compliance requirements. The strongest aspect of the narrative is its focus on solving real operational bottlenecks—configuration drift, manual provisioning, and DNS delays—through a cohesive, GitOps-driven architecture. The use of open-source tools like Kairos, Cluster API, and custom operators like bindy demonstrates a pragmatic approach to platform engineering, balancing innovation with regulatory constraints.
However, the narrative leans heavily on the benefits of declarative infrastructure without deeply exploring potential trade-offs. For instance, while immutable nodes eliminate drift, they also require rigorous image management pipelines, which may introduce complexity in debugging or emergency patching scenarios. The reliance on Rust for operator development, though performant, may limit community adoption due to the language's steeper learning curve compared to Go. Additionally, the integration with legacy systems like Infoblox, while necessary, could create long-term technical debt if the DNS automation layer becomes a single point of failure.
The root cause driving this modernization is the tension between agility and compliance in financial services. The solution reflects a broader industry trend toward treating infrastructure as code, but it also highlights the unique challenges of regulated environments where auditability and reproducibility are non-negotiable. The implications for human agency are significant: by automating manual processes, the platform reduces toil for engineers but also shifts responsibility left, requiring teams to adapt to GitOps workflows and CRD-based governance.
Bridge questions worth considering: How does this approach scale beyond 50 clusters, particularly in handling cross-cluster dependencies? What are the failure modes of a fully declarative system when reconciliation loops encounter edge cases? And how might smaller organizations with fewer resources replicate this model without the same level of in-house expertise?
Counterstrike scan: If this were part of a coordinated influence campaign, the playbook would emphasize the inevitability of GitOps and immutable infrastructure while downplaying the operational overhead. However, the content aligns with genuine engineering challenges and solutions, not manipulation. The focus on open-source contributions and regulatory compliance suggests authenticity rather than agenda-driven narrative.
Patterns detected: none