Skip to main content

RoleBasedGroup
Kubernetes-native
LLM Inference Orchestration

Production-grade deployment for multi-role AI workloads, orchestrating distributed, stateful inference workloads with seamless role collaboration.
Any Inference Engine × Any Architecture
PD Disaggregated · EPD Disaggregated · AF Disaggregated · Tensor Parallel
All with a single declarative API (v1alpha2).

v1alpha2 APIApache 2.0Kubernetes 1.22+
Architecture Overview
RoleBasedGroup Architecture Diagram
RoleBasedGroup.yaml
apiVersion: workloads.x-k8s.io/v1alpha2
kind: RoleBasedGroup
metadata:
  name: llm-inference
spec:
  roles:
    - name: router
      replicas: 1
      standalonePattern: ...
    - name: prefill
      replicas: 2
      leaderWorkerPattern:
        size: 4  # 1 leader + 3 workers
    - name: decode
      replicas: 4
      dependencies: ["prefill"]
      leaderWorkerPattern:
        size: 2

Built for Production LLM Serving

Everything you need to run distributed inference at scale, with the simplicity of native Kubernetes APIs (v1alpha2).

Multi-Role Orchestration

Define complex topologies with gateway, router, prefill, and decode roles. Use standalonePattern for single-pod instances or leaderWorkerPattern for distributed tensor parallelism.

Learn more

Exclusive Topology Scheduling

Pods of different roles are automatically scheduled on different nodes or topology zones using the exclusive-topology annotation for isolation.

Learn more

Inplace Update

Efficient inplace update mechanism that minimizes disruption during configuration changes. Update pod specifications without recreating entire role groups.

Learn more

Coordinated Updates

Use CoordinatedPolicy CRD for coordinated rolling updates. maxSkew ensures roles stay within 1% progress difference during updates.

Learn more

Independently AutoScaling

Each role can scale independently based on its own metrics. Flexible HPA integration via scalingAdapter for per-role autoscaling policies.

Learn more

Pre-Warmup

Pre-warmup mechanism for faster service initialization. Roles can be pre-provisioned and warmed up before serving requests, reducing cold start latency.

Learn more

Ecosystem Integrations

RBG integrates with leading AI inference runtimes and frameworks, providing a complete solution for production LLM serving.

v1alpha2API Version
K8s 1.22+Compatibility
Apache 2.0License

NVIDIA Dynamo

NVIDIA Dynamo runtime with tensor parallelism across multiple workers.

Hover to expand
Gateway
Router
Prefill Workers
Decode Workers
KV Cache Transfer
Tensor ParallelismLeader-Worker PatternKV Cache TransferAutomatic Failover

Mooncake

Mooncake integration for high-performance KV cache transfer via RDMA in PD disaggregated deployments..

Hover to expand
Prefill Node
Mooncake Agent
RDMA Network
Decode Node
RDMA TransferZero-Copy KV CacheTopology AwarenessHigh Throughput

llm-d

Kubernetes-native distributed inference framework with smart routing.

Hover to expand
Inference Gateway
Scheduler
Prefill Pool
Decode Pool
KV Cache Store
Distributed ServingSmart RoutingKV-Cache AwareAuto-scaling

OME

Open Model Engine - High-performance inference framework with multi-GPU support.

Hover to expand
Model Loader
Batch Manager
GPU Pool
KV Cache Manager
Response Handler
Multi-GPU SupportDynamic BatchingKV Cache OptimizationQuantization

Get Started in Minutes

Deploy your first inference service with just a few commands. RBG integrates seamlessly with your existing Kubernetes workflow.

# Install RBG controller
kubectl apply --server-side -f https://raw.githubusercontent.com/sgl-project/rbg/main/deploy/kubectl/manifests.yaml

# Wait for controller to be ready
kubectl wait deploy/rbgs-controller-manager -n rbgs-system \
--for=condition=available --timeout=5m

Deployment Patterns

From single-node development to production-grade disaggregated serving, RBG supports all inference topologies with native Kubernetes APIs.

Aggregated Standalone

Deploy LLM inference on a single node when the model fits in one node's memory. The simplest setup for development and testing.

Use case: Models under 70B parameters on single or multi-GPU nodes

View full example
Aggregated Standalone.yaml
apiVersion: workloads.x-k8s.io/v1alpha2
kind: RoleBasedGroup
metadata:
  name: sglang-agg-inference
spec:
  roles:
  - name: backend
    replicas: 1
    scalingAdapter:
      enable: true
    rolloutStrategy:
      type: RollingUpdate
      rollingUpdate:
        type: InPlaceIfPossible
    standalonePattern:
      template:
        spec:
          containers:
          - name: backend
            image: lmsysorg/sglang:v0.5.9
            command:
            - python3
            - -m
            - sglang.launch_server
            - --model-path
            - "Qwen/Qwen3-0.6B"
            - --port
            - "8001"
            - --tp-size
            - "1"
            resources:
              limits:
                nvidia.com/gpu: "1"

RBG in ACK Serving Stack

RoleBasedGroup is a core component of Alibaba Cloud's ACK Serving Stack, providing the orchestration layer for production-grade LLM inference services.

ACK Serving Stack Architecture
RBG in ACK Serving Stack Architecture

Native Integration

Seamlessly integrated with ACK's container runtime and GPU management capabilities.

Performance Optimized

Leverages ACK's high-performance network and storage stack for optimal inference throughput.

Enterprise Ready

Built on ACK's enterprise-grade security, observability, and multi-cluster capabilities.

Use Cases

RBG powers production LLM inference across diverse industries and use scenarios.
(Content to be added)

Cloud Service Providers

Deploy and manage large-scale LLM inference services for cloud customers. Scale from hundreds to thousands of instances with coordinated updates.

Multi-regionAuto-scalingHigh availability

Enterprise AI Platforms

Run private LLM inference infrastructure with topology-aware placement for optimal GPU utilization and minimal latency.

Private cloudGPU optimizationSecurity

AI Research Labs

Experiment with novel inference architectures like PD-disaggregation. Quickly iterate on multi-role topologies for research.

ExperimentationCustom architecturesFlexible deployment

Model Hosting Services

Host multiple LLM models with isolated deployments. RoleBasedGroupSet enables managing identical inference clusters.

Multi-modelIsolationResource sharing

How to Contribute

We welcome contributions from the community! Whether you're fixing a bug, adding a feature, or improving documentation, your help makes RBG better.

1

Fork & Clone

Fork the repository on GitHub and clone it to your local machine.

Learn more
2

Create a Branch

Create a feature branch for your changes. Follow the naming convention: feature/your-feature-name.

git checkout -b feature/your-feature-name
3

Make Changes

Implement your feature or fix. Write clean, well-documented code following our style guidelines.

4

Write Tests

Add tests for your changes. Ensure all existing tests pass before submitting.

5

Submit PR

Push your changes and create a Pull Request. Include a clear description of your changes.

Learn more