Oracle Cloud Infrastructure Kubernetes Engine (OKE) just got even better for high-performance workloads. You can now launch **managed node pools** directly into OCI Compute Clusters with RDMA networking — delivering ultra-low latency communication between worker nodes without giving up the operational simplicity of managed nodes.
This is a major win for distributed AI training, fine-tuning, multi-node inference, and other HPC-style workloads on Kubernetes.
Why RDMA Matters for Kubernetes
- Ultra-low latency (single-digit microseconds) between nodes
- High-bandwidth, direct memory access between GPUs across hosts
- Significantly better scaling efficiency for multi-node AI/ML jobs
- Keeps expensive GPUs utilized instead of waiting on network transfers
Key Benefits of This Release
- Full managed node pool experience (auto-scaling, upgrades, node replacement, cordon/drain)
- No more need for self-managed nodes just to get RDMA
- OKE automatically enables the required HPC plugins for RDMA
- Perfect for large-scale distributed training and inference
How to Use RDMA with OKE Managed Node Pools
Prerequisites
- Use an enhanced OKE cluster
- The Compute Cluster must exist and be in ACTIVE state
- Use an RDMA-capable bare metal shape
- Placement must be in the same availability domain as the Compute Cluster
- Do not specify fault domains (managed by Compute service)
Required IAM Policy
allow any-user to {COMPUTE_CLUSTER_LAUNCH_INSTANCE}
in compartment <compartment_name>
where request.principal.type = 'nodepool'
and target.resource.id = '<compute_cluster_OCID>'
Creating a Managed Node Pool with RDMA (Console)
- In the OCI Console, go to your enhanced OKE cluster
- Create a new managed node pool
- Under Advanced Options → Add a Compute Cluster:
- Select the compartment
- Select the Compute Cluster
- Choose an RDMA-supported shape
- Configure placement in the matching availability domain
- Create the node pool
OKE will automatically launch instances into the Compute Cluster with RDMA enabled.
Best Practices
- Use enhanced clusters for all new workloads
- Start with smaller clusters to validate performance gains
- Monitor GPU utilization and inter-node communication metrics
- Combine with OKE autoscaling for dynamic workloads
- Plan for the fact that Compute Cluster cannot be changed after node pool creation
Conclusion
With RDMA support for managed node pools, OKE now delivers the best of both worlds: the operational simplicity and automation of managed Kubernetes nodes combined with the ultra-low latency networking required for large-scale AI and HPC workloads.
Whether you’re doing distributed training, multi-node inference, or any communication-intensive workload, you can now take full advantage of OCI’s high-performance Compute Clusters without sacrificing managed node benefits.