Built for Production LLM Serving

Everything you need to run distributed inference at scale, with the simplicity of native Kubernetes APIs (v1alpha2).

Multi-Role Orchestration

Define complex topologies with gateway, router, prefill, and decode roles. Use standalonePattern for single-pod instances or leaderWorkerPattern for distributed tensor parallelism.

Learn more

Exclusive Topology Scheduling

Pods of different roles are automatically scheduled on different nodes or topology zones using the exclusive-topology annotation for isolation.

Learn more

Inplace Update

Efficient inplace update mechanism that minimizes disruption during configuration changes. Update pod specifications without recreating entire role groups.

Learn more

Coordinated Updates

Use CoordinatedPolicy CRD for coordinated rolling updates. maxSkew ensures roles stay within 1% progress difference during updates.

Learn more

Independently AutoScaling

Each role can scale independently based on its own metrics. Flexible HPA integration via scalingAdapter for per-role autoscaling policies.

Learn more

Pre-Warmup

Pre-warmup mechanism for faster service initialization. Roles can be pre-provisioned and warmed up before serving requests, reducing cold start latency.

Learn more

Ecosystem Integrations

RBG integrates with leading AI inference runtimes and frameworks, providing a complete solution for production LLM serving.

v1alpha2API Version

K8s 1.22+Compatibility

Apache 2.0License

NVIDIA Dynamo

NVIDIA Dynamo runtime with tensor parallelism across multiple workers.

Hover to expand

Gateway

Router

Prefill Workers

Decode Workers

KV Cache Transfer

Tensor ParallelismLeader-Worker PatternKV Cache TransferAutomatic Failover

Mooncake

Mooncake integration for high-performance KV cache transfer via RDMA in PD disaggregated deployments..

Hover to expand

Prefill Node

Mooncake Agent

RDMA Network

Decode Node

RDMA TransferZero-Copy KV CacheTopology AwarenessHigh Throughput

llm-d

Kubernetes-native distributed inference framework with smart routing.

Hover to expand

Inference Gateway

Scheduler

Prefill Pool

Decode Pool

KV Cache Store

Distributed ServingSmart RoutingKV-Cache AwareAuto-scaling

OME

Open Model Engine - High-performance inference framework with multi-GPU support.

Hover to expand

Model Loader

Batch Manager

GPU Pool

KV Cache Manager

Response Handler

Multi-GPU SupportDynamic BatchingKV Cache OptimizationQuantization

Get Started in Minutes

Deploy your first inference service with just a few commands. RBG integrates seamlessly with your existing Kubernetes workflow.

# Install RBG controller
kubectl apply --server-side -f https://raw.githubusercontent.com/sgl-project/rbg/main/deploy/kubectl/manifests.yaml

# Wait for controller to be ready
kubectl wait deploy/rbgs-controller-manager -n rbgs-system \
  --for=condition=available --timeout=5m

View full installation guide Quick start tutorial

Deployment Patterns

From single-node development to production-grade disaggregated serving, RBG supports all inference topologies with native Kubernetes APIs.

Aggregated Standalone

Deploy LLM inference on a single node when the model fits in one node's memory. The simplest setup for development and testing.

Use case: Models under 70B parameters on single or multi-GPU nodes

View full example

Aggregated Standalone.yaml

apiVersion: workloads.x-k8s.io/v1alpha2
kind: RoleBasedGroup
metadata:
  name: sglang-agg-inference
spec:
  roles:
  - name: backend
    replicas: 1
    scalingAdapter:
      enable: true
    rolloutStrategy:
      type: RollingUpdate
      rollingUpdate:
        type: InPlaceIfPossible
    standalonePattern:
      template:
        spec:
          containers:
          - name: backend
            image: lmsysorg/sglang:v0.5.9
            command:
            - python3
            - -m
            - sglang.launch_server
            - --model-path
            - "Qwen/Qwen3-0.6B"
            - --port
            - "8001"
            - --tp-size
            - "1"
            resources:
              limits:
                nvidia.com/gpu: "1"

How It Works

RBG treats your inference service as a coordinated organism, managing roles, dependencies, and topology as a single unit.

Auto Discovery

RBG automatically discovers and registers role endpoints, enabling seamless service discovery and communication.

RBG in ACK Serving Stack

RoleBasedGroup is a core component of Alibaba Cloud's ACK Serving Stack, providing the orchestration layer for production-grade LLM inference services.

ACK Serving Stack Architecture

Native Integration

Seamlessly integrated with ACK's container runtime and GPU management capabilities.

Performance Optimized

Leverages ACK's high-performance network and storage stack for optimal inference throughput.

Enterprise Ready

Built on ACK's enterprise-grade security, observability, and multi-cluster capabilities.

Use Cases

RBG powers production LLM inference across diverse industries and use scenarios.
(Content to be added)

Cloud Service Providers

Deploy and manage large-scale LLM inference services for cloud customers. Scale from hundreds to thousands of instances with coordinated updates.

Multi-regionAuto-scalingHigh availability

Enterprise AI Platforms

Run private LLM inference infrastructure with topology-aware placement for optimal GPU utilization and minimal latency.

Private cloudGPU optimizationSecurity

AI Research Labs

Experiment with novel inference architectures like PD-disaggregation. Quickly iterate on multi-role topologies for research.

ExperimentationCustom architecturesFlexible deployment

Model Hosting Services

Host multiple LLM models with isolated deployments. RoleBasedGroupSet enables managing identical inference clusters.

Multi-modelIsolationResource sharing

How to Contribute

We welcome contributions from the community! Whether you're fixing a bug, adding a feature, or improving documentation, your help makes RBG better.

Fork & Clone

Fork the repository on GitHub and clone it to your local machine.

Learn more

Create a Branch

Create a feature branch for your changes. Follow the naming convention: feature/your-feature-name.

git checkout -b feature/your-feature-name

Make Changes

Implement your feature or fix. Write clean, well-documented code following our style guidelines.

Write Tests

Add tests for your changes. Ensure all existing tests pass before submitting.

Submit PR

Push your changes and create a Pull Request. Include a clear description of your changes.

Learn more