Content

1 Key Takeaways
2 What Is vLLM?
3 Why Use gpt-oss?
4 Running vLLM on an AMD-Powered GPU Server
5 Setting Up the Environment for vLLM with Docker
6 Deploying gpt-oss on vLLM
7 Interacting with the Deployed gpt-oss 120b Model
8 Closing Thoughts

Vijona

2 hours ago

Running gpt-oss 120b with vLLM on AMD GPUs

One of the biggest considerations for anyone starting with large-scale LLM technology is compute capacity. VRAM, throughput, hardware architecture, and software stacks can vary greatly from one system to another, which can quickly become overwhelming. This becomes especially important when deploying LLMs. The goal is usually to achieve strong model quality at a reasonable cost, and finding that balance is where the real challenge begins.

In this tutorial, we take a closer look at the AMD Instinct MI300X GPU running gpt-oss 120b. This high-performance GPU is one of AMD’s flagship accelerators and offers substantial processing power. With 192 GB of HBM3 memory, it can deliver up to 653.7 TFLOPs and a theoretical maximum throughput of 5.3 TB/s. This makes it a strong platform for testing and serving LLMs. For this example, we use OpenAI’s gpt-oss 120b, a powerful language model known for its agentic and coding capabilities.

This guide explains how to use vLLM with AMD GPUs. By the end, you will understand what vLLM is, why gpt-oss is useful, and how to run gpt-oss 120b with vLLM on an AMD-powered GPU server.

Key Takeaways

vLLM is a powerful open-source tool for serving LLMs at scale on AMD GPUs.
gpt-oss 120b is a highly capable open-source agentic coding LLM and works efficiently with vLLM.
AMD MI300X GPU servers are well suited for serving gpt-oss 120b at scale.

What Is vLLM?

vLLM is an open-source inference engine built for fast and memory-efficient serving of large language models. It improves GPU memory usage, which helps deliver faster responses, higher throughput, and lower latency compared with many other serving options. Its main features include the PagedAttention algorithm, continuous batching, and compatibility with popular model ecosystems such as Hugging Face. These capabilities make vLLM a strong choice for serving large models.

Why Use gpt-oss?

gpt-oss, available in 20b and 120b variants, is OpenAI’s flagship open-source LLM release. Both versions are among the strongest agentic and coding models in their respective size classes. At release, gpt-oss 120b was competitive with o4 Mini on established reasoning benchmarks, while the 20b version performed similarly to o3 mini on common benchmarks and could run on edge devices with only 16 GB of virtual memory.

gpt-oss 120b is recommended because it is open source, released under the Apache 2.0 license, and suitable for fine-tuning across many use cases. It also provides state-of-the-art performance in reasoning and coding tasks. As shown in the benchmark results above, the model performs comparably with strong reasoning models such as o3 and o4-Mini with tools on the Codeforces benchmark. For these reasons, gpt-oss 120b is an excellent starting point for serving coding-focused models with vLLM.

Running vLLM on an AMD-Powered GPU Server

To begin, create an AMD MI300X-powered GPU server through your preferred cloud or infrastructure provider. Select a location where AMD GPU resources are available, choose AMD as the GPU platform, and select a single MI300X GPU. Then choose your SSH key from the available options for your account or team.

After confirming the configuration, create the GPU server. It may take a few moments until the machine is fully available.

Setting Up the Environment for vLLM with Docker

Once the GPU server is ready, connect to it via SSH from your local terminal. Then move into the directory where you want to work. From there, you can start using vLLM with Docker.

Deploying gpt-oss on vLLM

First, define an alias that downloads and starts the Docker container. Paste the following command into the terminal. This container is intended for an MI300X GPU.

Copy Code


alias drun='sudo docker run -it --network=host --device=/dev/kfd --device=/dev/dri --group-add=video --ipc=host --cap-add=SYS_PTRACE --security-opt seccomp=unconfined --shm-size 32G -v /data:/data -v $HOME:/myhome -w /myhome'

drun rocm/vllm-dev:open-mi300-08052025

The download and startup process may take some time. When it completes, you should be inside the container. From there, you can deploy the gpt-oss 120b model on vLLM. Paste the following commands into the terminal to serve the model on the AMD MI300X-powered GPU server.

Copy Code


export VLLM_ROCM_USE_AITER=1
export VLLM_USE_AITER_UNIFIED_ATTENTION=1
export VLLM_ROCM_USE_AITER_MHA=0

vllm serve openai/gpt-oss-120b --compilation-config '{"full_cuda_graph": true}'

This starts the vLLM deployment and begins downloading the model files into the container. If everything runs correctly, you should see a confirmation message like the one shown in the screenshot above. After that, the served model can be reached at “0.0.0.0:8000” or “localhost:8000” with OpenAI’s Python library.

Interacting with the Deployed gpt-oss 120b Model

Next, you need a way to interact with the deployed model. There are several possible approaches, but this section covers two common methods: cURL and OpenAI’s Python library. First, use cURL. Open a new terminal window and connect to the remote machine via SSH. Then paste the following command into the terminal. This example asks the model to tell a joke.

Copy Code

curl http://localhost:8000/v1/chat/completions \ -H "Content-Type: application/json" \ -d '{ "model": "openai/gpt-oss-120b", "messages": [ { "role": "system", "content": "You are a helpful assistant." }, { "role": "user", "content": "Tell me a joke." } ], "temperature": 0.7, "max_tokens": 100 }'

This should return output similar to the following:

Copy Code

{"id":"[anonymized]","object":"chat.completion","created":1762542942,"model":"openai/gpt-oss-120b","choices":[{"index":0,"message":{"role":"assistant","content":"analysisUser asks for a joke. Provide a joke. Keep it appropriate.assistantfinalSure, here's a classic one for you:\n\n**Why don’t scientists trust atoms?**\n\n*Because they make up everything!*","refusal":null,"annotations":null,"audio":null,"function_call":null,"tool_calls":[],"reasoning_content":null},"logprobs":null,"finish_reason":"stop","stop_reason":null}],"service_tier":null,"system_fingerprint":null,"usage":{"prompt_tokens":85,"total_tokens":134,"completion_tokens":49,"prompt_tokens_details":null},"prompt_logprobs":null,"kv_transfer_params":null}root@ml-ai-ubuntu-gpu-mi300curl http://localhost:8000/v1/chat/completions \hat/completions \

This method can be used for many different tasks, including code completion, tool-calling workflows, and complex function calls. Try your own prompts to explore what the model can do.

If you prefer Python, you can use OpenAI’s Python library. In a separate window from the running vLLM server, start Jupyter Lab. Paste the following commands into the terminal to install the required components.

Copy Code

python3 -m venv venv source venv/bin/activate pip install openai jupyter jupyter lab --allow-root

Use a browser feature in your development environment, such as Cursor or VS Code’s simple browser, to access the Jupyter Lab interface locally. Once it is running, create and open a new Jupyter Notebook. In the first code cell, paste the following Python code.

Copy Code


from openai import OpenAI
# Set OpenAI's API key and API base to use vLLM's API server.
openai_api_key = "EMPTY"
openai_api_base = "http://localhost:8000/v1"

client = OpenAI(
    api_key=openai_api_key,
    base_url=openai_api_base,
)

chat_response = client.chat.completions.create(
    model="openai/gpt-oss-120b",
    messages=[
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": "Tell me a joke."},
    ]
)
print("Chat response:", chat_response)

If everything is working correctly, you should receive output similar to this:

Copy Code


Chat response: ChatCompletion(id='chatcmpl-b600ce13dfd041a4a934ebe7826c8a44', choices=[Choice(finish_reason='stop', index=0, logprobs=None, message=ChatCompletionMessage(content='analysisThe user wants a joke. Provide a joke. Should be appropriate. Simple.assistantfinalWhy don’t scientists trust atoms?\n\nBecause they **make up** everything!', refusal=None, role='assistant', annotations=None, audio=None, function_call=None, tool_calls=[], reasoning_content=None), stop_reason=None)], created=1762543674, model='openai/gpt-oss-120b', object='chat.completion', service_tier=None, system_fingerprint=None, usage=CompletionUsage(completion_tokens=41, prompt_tokens=85, total_tokens=126, completion_tokens_details=None, prompt_tokens_details=None), prompt_logprobs=None, kv_transfer_params=None)

With this Python-based setup, vLLM interaction can be integrated into many applications and workflows, including custom agents. This approach is as flexible as cURL while also allowing you to use the broader Python ecosystem.

Closing Thoughts

Running vLLM with gpt-oss on AMD MI300X-powered GPU servers is relatively straightforward thanks to the work of the vLLM and ROCm communities. With access to suitable GPU infrastructure, users can launch this powerful model quickly on high-performance hardware.

Source: digitalocean.com

Create a Free Account

Try now

Posts you might be interested in:

Moderne Hosting Services mit Cloud Server, Managed Server und skalierbarem Cloud Hosting für professionelle IT-Infrastrukturen

DeepSeek-OCR Explained: Efficient OCR with Optical Context Compression

AI/ML, Tutorial

1 hour ago

Vijona1 hour ago DeepSeek-OCR for Efficient Document Processing Large Language Models (LLMs) and Vision-Language Models (VLMs) often struggle with the high computational effort required to process long documents. As documents…

Hidden Markov Models (HMMs): Theory, Algorithms & Python Guide

AI/ML, Tutorial

2 hours ago

Vijona2 hours ago Hidden Markov Models: Theory, Algorithms, Python Implementation, and Modern Alternatives Hidden Markov Models (HMMs) are probabilistic machine learning models used to identify patterns in sequential data. An…

Agent Communication Protocols Explained: FIPA ACL, KQML, MCP & AI Agents

AI/ML, Tutorial

3 hours ago

Vijona3 hours ago Agent Communication Protocols: How Autonomous AI Systems Exchange Information Over the last few years, artificial intelligence has developed quickly from a research-driven field into a technology used…

FEATURED PRODUCTS

Kubernetes

ccloud³

Managed Server

Cloud GPU

S3 Object Storage

COMPUTE

MANAGED

STORAGE

NETWORKING

MANAGEMENT TOOLS

BACKUPS & SNAPSHOTS

WEBSITE HOSTING

HOUSING

FEATURED INDUSTRIES

Enterprise

Saas-Hosting

Startup

INDUSTRIES

MORE INDUSTRIES

FEATURED USE CASES

Linux-Hosting

VMware Migration

Docker Hosting

USE CASES

MORE USE CASES

RESSOURCES

Help Center

Trust Center

Glossar

Tutorials

MORE CENTRON

MORE INFOS

FEATURED PRODUCTS

Kubernetes

ccloud³

Managed Server

Cloud GPU

S3 Object Storage

COMPUTE

MANAGED

STORAGE

NETWORKING

MANAGEMENT TOOLS

BACKUPS & SNAPSHOTS

WEBSITE HOSTING

HOUSING

FEATURED INDUSTRIES

Enterprise

Saas-Hosting

Startup

INDUSTRIES

MORE INDUSTRIES

FEATURED USE CASES

Linux-Hosting

VMware Migration

Docker Hosting

USE CASES

MORE USE CASES

RESSOURCES

Help Center

Trust Center

Glossar

Tutorials

MORE CENTRON

MORE INFOS

Running gpt-oss 120b with vLLM on AMD GPUs

Key Takeaways

What Is vLLM?

Why Use gpt-oss?

Running vLLM on an AMD-Powered GPU Server

Setting Up the Environment for vLLM with Docker

Deploying gpt-oss on vLLM

Interacting with the Deployed gpt-oss 120b Model

Closing Thoughts

Create a Free Account

Posts you might be interested in:

DeepSeek-OCR Explained: Efficient OCR with Optical Context Compression

Hidden Markov Models (HMMs): Theory, Algorithms & Python Guide

Agent Communication Protocols Explained: FIPA ACL, KQML, MCP & AI Agents