Running gpt-oss 120b with vLLM on AMD GPUs
One of the biggest considerations for anyone starting with large-scale LLM technology is compute capacity. VRAM, throughput, hardware architecture, and software stacks can vary greatly from one system to another, which can quickly become overwhelming. This becomes especially important when deploying LLMs. The goal is usually to achieve strong model quality at a reasonable cost, and finding that balance is where the real challenge begins.
In this tutorial, we take a closer look at the AMD Instinct MI300X GPU running gpt-oss 120b. This high-performance GPU is one of AMD’s flagship accelerators and offers substantial processing power. With 192 GB of HBM3 memory, it can deliver up to 653.7 TFLOPs and a theoretical maximum throughput of 5.3 TB/s. This makes it a strong platform for testing and serving LLMs. For this example, we use OpenAI’s gpt-oss 120b, a powerful language model known for its agentic and coding capabilities.
This guide explains how to use vLLM with AMD GPUs. By the end, you will understand what vLLM is, why gpt-oss is useful, and how to run gpt-oss 120b with vLLM on an AMD-powered GPU server.
Key Takeaways
- vLLM is a powerful open-source tool for serving LLMs at scale on AMD GPUs.
- gpt-oss 120b is a highly capable open-source agentic coding LLM and works efficiently with vLLM.
- AMD MI300X GPU servers are well suited for serving gpt-oss 120b at scale.
What Is vLLM?
vLLM is an open-source inference engine built for fast and memory-efficient serving of large language models. It improves GPU memory usage, which helps deliver faster responses, higher throughput, and lower latency compared with many other serving options. Its main features include the PagedAttention algorithm, continuous batching, and compatibility with popular model ecosystems such as Hugging Face. These capabilities make vLLM a strong choice for serving large models.
Why Use gpt-oss?
gpt-oss, available in 20b and 120b variants, is OpenAI’s flagship open-source LLM release. Both versions are among the strongest agentic and coding models in their respective size classes. At release, gpt-oss 120b was competitive with o4 Mini on established reasoning benchmarks, while the 20b version performed similarly to o3 mini on common benchmarks and could run on edge devices with only 16 GB of virtual memory.
gpt-oss 120b is recommended because it is open source, released under the Apache 2.0 license, and suitable for fine-tuning across many use cases. It also provides state-of-the-art performance in reasoning and coding tasks. As shown in the benchmark results above, the model performs comparably with strong reasoning models such as o3 and o4-Mini with tools on the Codeforces benchmark. For these reasons, gpt-oss 120b is an excellent starting point for serving coding-focused models with vLLM.
Running vLLM on an AMD-Powered GPU Server
To begin, create an AMD MI300X-powered GPU server through your preferred cloud or infrastructure provider. Select a location where AMD GPU resources are available, choose AMD as the GPU platform, and select a single MI300X GPU. Then choose your SSH key from the available options for your account or team.
After confirming the configuration, create the GPU server. It may take a few moments until the machine is fully available.
Setting Up the Environment for vLLM with Docker
Once the GPU server is ready, connect to it via SSH from your local terminal. Then move into the directory where you want to work. From there, you can start using vLLM with Docker.
Deploying gpt-oss on vLLM
First, define an alias that downloads and starts the Docker container. Paste the following command into the terminal. This container is intended for an MI300X GPU.
alias drun='sudo docker run -it --network=host --device=/dev/kfd --device=/dev/dri --group-add=video --ipc=host --cap-add=SYS_PTRACE --security-opt seccomp=unconfined --shm-size 32G -v /data:/data -v $HOME:/myhome -w /myhome'
drun rocm/vllm-dev:open-mi300-08052025
The download and startup process may take some time. When it completes, you should be inside the container. From there, you can deploy the gpt-oss 120b model on vLLM. Paste the following commands into the terminal to serve the model on the AMD MI300X-powered GPU server.
export VLLM_ROCM_USE_AITER=1
export VLLM_USE_AITER_UNIFIED_ATTENTION=1
export VLLM_ROCM_USE_AITER_MHA=0
vllm serve openai/gpt-oss-120b --compilation-config '{"full_cuda_graph": true}'
This starts the vLLM deployment and begins downloading the model files into the container. If everything runs correctly, you should see a confirmation message like the one shown in the screenshot above. After that, the served model can be reached at “0.0.0.0:8000” or “localhost:8000” with OpenAI’s Python library.
Interacting with the Deployed gpt-oss 120b Model
Next, you need a way to interact with the deployed model. There are several possible approaches, but this section covers two common methods: cURL and OpenAI’s Python library. First, use cURL. Open a new terminal window and connect to the remote machine via SSH. Then paste the following command into the terminal. This example asks the model to tell a joke.
curl http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "openai/gpt-oss-120b",
"messages": [
{ "role": "system", "content": "You are a helpful assistant." },
{ "role": "user", "content": "Tell me a joke." }
],
"temperature": 0.7,
"max_tokens": 100
}'
This should return output similar to the following:
{"id":"[anonymized]","object":"chat.completion","created":1762542942,"model":"openai/gpt-oss-120b","choices":[{"index":0,"message":{"role":"assistant","content":"analysisUser asks for a joke. Provide a joke. Keep it appropriate.assistantfinalSure, here's a classic one for you:\n\n**Why don’t scientists trust atoms?**\n\n*Because they make up everything!*","refusal":null,"annotations":null,"audio":null,"function_call":null,"tool_calls":[],"reasoning_content":null},"logprobs":null,"finish_reason":"stop","stop_reason":null}],"service_tier":null,"system_fingerprint":null,"usage":{"prompt_tokens":85,"total_tokens":134,"completion_tokens":49,"prompt_tokens_details":null},"prompt_logprobs":null,"kv_transfer_params":null}root@ml-ai-ubuntu-gpu-mi300curl http://localhost:8000/v1/chat/completions \hat/completions \
This method can be used for many different tasks, including code completion, tool-calling workflows, and complex function calls. Try your own prompts to explore what the model can do.
If you prefer Python, you can use OpenAI’s Python library. In a separate window from the running vLLM server, start Jupyter Lab. Paste the following commands into the terminal to install the required components.
python3 -m venv venv
source venv/bin/activate
pip install openai jupyter
jupyter lab --allow-root
Use a browser feature in your development environment, such as Cursor or VS Code’s simple browser, to access the Jupyter Lab interface locally. Once it is running, create and open a new Jupyter Notebook. In the first code cell, paste the following Python code.
from openai import OpenAI
# Set OpenAI's API key and API base to use vLLM's API server.
openai_api_key = "EMPTY"
openai_api_base = "http://localhost:8000/v1"
client = OpenAI(
api_key=openai_api_key,
base_url=openai_api_base,
)
chat_response = client.chat.completions.create(
model="openai/gpt-oss-120b",
messages=[
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "Tell me a joke."},
]
)
print("Chat response:", chat_response)
If everything is working correctly, you should receive output similar to this:
Chat response: ChatCompletion(id='chatcmpl-b600ce13dfd041a4a934ebe7826c8a44', choices=[Choice(finish_reason='stop', index=0, logprobs=None, message=ChatCompletionMessage(content='analysisThe user wants a joke. Provide a joke. Should be appropriate. Simple.assistantfinalWhy don’t scientists trust atoms?\n\nBecause they **make up** everything!', refusal=None, role='assistant', annotations=None, audio=None, function_call=None, tool_calls=[], reasoning_content=None), stop_reason=None)], created=1762543674, model='openai/gpt-oss-120b', object='chat.completion', service_tier=None, system_fingerprint=None, usage=CompletionUsage(completion_tokens=41, prompt_tokens=85, total_tokens=126, completion_tokens_details=None, prompt_tokens_details=None), prompt_logprobs=None, kv_transfer_params=None)
With this Python-based setup, vLLM interaction can be integrated into many applications and workflows, including custom agents. This approach is as flexible as cURL while also allowing you to use the broader Python ecosystem.
Closing Thoughts
Running vLLM with gpt-oss on AMD MI300X-powered GPU servers is relatively straightforward thanks to the work of the vLLM and ROCm communities. With access to suitable GPU infrastructure, users can launch this powerful model quickly on high-performance hardware.


