Build Faster Agentic LLM Workflows with Asynchronous Python Calls
Large language models can be difficult to run reliably in production because they may introduce inaccurate answers, inconsistent behavior, or noticeable response delays. As models become more advanced, it can be tempting to add more context and more detailed instructions to every prompt and expect the model to handle everything in one step. However, when a single system prompt is sent to a relatively large model, response time can often range from 3 to 15 seconds. For time-sensitive agents, such as voice-based phone assistants, that pause is too long.
A faster approach is to split the overall task into several smaller tasks and allow your LLM workflow to process those tasks in parallel. This makes it easier for smaller models to complete each task accurately. It can also significantly reduce latency and, in many cases, lower your operating costs.
There are several ways to run agentic workflows in parallel. In many agentic systems, however, the main bottleneck is the actual call to the LLM. You can reduce this bottleneck by sending multiple LLM API requests at the same time with Python’s asyncio and aiohttp libraries.
In this tutorial, you will use Python to create agentic workflows that send multiple LLM requests asynchronously. This makes it possible to build faster and more accurate LLM applications that can be deployed on a virtual server, GPU server, or serverless environment as an API endpoint.
You will create a customer service workflow for a real estate agency’s phone system. The workflow will answer questions about property listings, schedule appointments, and connect callers to human representatives. It will do this by sending several asynchronous requests to different LLM endpoints, reducing response time and giving you more flexibility when choosing the right model for each task.
Key Takeaways
- Using asynchronous requests to small, fast models can help you build effective, low-latency agentic workflows for applications where response speed matters.
- Multi-model workflows let you customize each prompt for a specific model. This can reduce overall compute requirements and lower the cost of both self-hosted LLM deployments and third-party API usage.
Prerequisites
To follow this tutorial, you will need:
- A local development environment for Python 3.
- Access to an LLM API. In this tutorial, you will use the Mistral-Small-3.2-24B model deployed on a GPU server using vLLM. You can set up a GPU-based server with vLLM to serve a model by following a suitable guide for running vLLM with your preferred GPU environment.
- Familiarity with Python
asyncandawaitsyntax. This is optional, but helpful.
Step 1 — Setting Up Your Environment
First, prepare your environment and install the required dependencies. In this tutorial, requests will be sent to a Mistral-Small-3.2-24B model deployed on an H200 GPU server. You can either send requests to your own deployed model or use a third-party LLM API. This tutorial only sends requests to one LLM, but you can use different models from different providers for each prompt if needed.
Create a virtual environment and install the necessary dependency:
python3 -m venv venv
source venv/bin/activate
pip install aiohttp
Because the model requests will be made asynchronously, you need to define asynchronous functions that can manage those calls. You will use Python’s asyncio and aiohttp libraries. If you use another LLM API, install the provider’s client library if it is required to communicate with the endpoint.
Create a new file named agentic_workflows.py. Add the following code to import the required dependencies and define the API endpoint.
Replace your_server_ip with the IP address of your GPU server deployment, or define the client for the LLM API you are using:
agentic_workflows.py
import asyncio
import aiohttp
# Your vLLM server URL
VLLM_SERVER_URL = "http://your_server_ip:8000/v1/chat/completions"
Step 2 — Writing the Asynchronous Call Logic
Your asynchronous call logic should accept a list of dictionaries. Each dictionary contains the model ID and the prompt that should be sent to the LLM. The function should send all requests without waiting for each individual response before starting the next one. Each prompt will include the conversation history and the system prompt for that specific request. You will also add a synchronous call_models function that can be imported and used in your main script.
Add the following asynchronous functions to agentic_workflows.py:
agentic_workflows.py
async def _call_single_model(call_spec):
"""Make an async call to the GPU server with a given model and messages"""
model_id = call_spec["model_id"]
messages = call_spec["messages"]
payload = {
"model": model_id,
"messages": messages,
"max_tokens": 100,
"temperature": 0.1
}
async with aiohttp.ClientSession() as session:
async with session.post(VLLM_SERVER_URL, json=payload) as response:
result = await response.json()
message = result["choices"][0]["message"]["content"]
return message
async def _call_models_async(call_list):
"""Call multiple models asynchronously and return responses in the same order"""
# Create tasks for all model calls
tasks = [_call_single_model(call_spec) for call_spec in call_list]
# Run all tasks concurrently and return responses in order
responses = await asyncio.gather(*tasks)
return responses
def call_models(call_list):
return asyncio.run(_call_models_async(call_list))
The aiohttp.ClientSession() method creates an asynchronous client session that sends POST requests to the API endpoint and handles HTTP responses without blocking the rest of the program. This allows several LLM API calls to run at the same time instead of one after another.
The _call_models_async() function collects the request details you provide and sends all API calls without waiting for each response before starting the next call. When the responses return, the function stores them in a list in the same order in which the requests were sent. The call_models() function is synchronous, making the asynchronous workflow easier to call from other code.
Step 3 — Writing Your Prompts
Now that the basic asynchronous API request logic is in place, you need to define the prompts. This workflow uses three separate system prompts. One prompt handles pricing questions, one handles scheduling requests, and one identifies other listing-related questions that should be forwarded to a human representative. You can add more prompts later, including prompts with more autonomous decision-making, but these three are enough for this example.
agentic_workflows.py
SYSTEM_PROMPT_CONFIGURATIONS = {
"pricing_prompt": {
"model_id": "mistralai/Mistral-Small-3.2-24B-Instruct-2506",
"prompt": "You are a customer service agent. Determine if the user's most recent request is asking about the price of a listing. If they are asking about the price of a listing AND if they have included the listing_id, return only the listing_id of the item they are asking about in the following format: 'listing_id: XXXXXX'. \nIf they are asking about the pricing of a listing AND did NOT mention the specific listing_id number, ask them for the listing id number. If they are requesting something other than the price of a listing: return only the word 'false'."
},
"scheduling_prompt": {
"model_id": "mistralai/Mistral-Small-3.2-24B-Instruct-2506",
"prompt": "You are a customer service agent. Determine if the user's most recent request is asking to schedule a call with a real estate agent. If they are trying to schedule a call, return the date and time they would like to schedule the call in the following format: 'date: YYYY-MM-DD, time: HH:MM'. If they are trying to schedule a call but they did not mention the specific date and time they are available, ask them what day and time they are available. Do not provide any additional information, such as the agent's availability. If they are not asking to schedule a call, return only the word 'false'."
},
"listing_prompt": {
"model_id": "mistralai/Mistral-Small-3.2-24B-Instruct-2506",
"prompt": "You are a customer service agent. Determine if the user is asking a question about a listing. If they are asking a question about a listing, return only the word 'true'. Otherwise, return only the word 'false'."
}
}
These system prompt configurations include the prompt name for later reference, the model ID that should receive the prompt, and the prompt text that will be combined with the conversation history.
Step 4 — Processing User Input Through the Models
Next, create a function that accepts the user’s conversation, adds each system prompt to that conversation, and sends the asynchronous requests to the correct model for each prompt using the call_models() function created earlier. This example assumes that the conversation history is stored in the frontend in the same format accepted by the API and is sent together with the user’s latest message.
Add the following function to agentic_workflows.py:
agentic_workflows.py
def run_agentic_workflow(conversation_history):
model_calls_list = []
prompt_names = [] # Keep track of prompt order for response mapping
for prompt_name, config in SYSTEM_PROMPT_CONFIGURATIONS.items():
model_id = config["model_id"]
system_prompt = config["prompt"]
# Construct the full message prompt: system prompt + conversation history
full_messages = [{"role": "system", "content": system_prompt}] + conversation_history
model_calls_list.append({
"model_id": model_id,
"messages": full_messages
})
prompt_names.append(prompt_name)
prompt_responses = call_models(model_calls_list)
This function creates a list of dictionaries for each system prompt. Each dictionary contains the model ID and the full message set, which includes the system prompt plus the conversation history. It then uses the call_models() function to call each model for each system prompt asynchronously.
After the models are called, the responses are stored in the prompt_responses variable in the same order in which the requests were sent. You then need to map those responses to specific variables.
Continue extending the run_agentic_workflow function:
agentic_workflows.py
# Map responses to their respective prompts
pricing_response = prompt_responses[prompt_names.index("pricing_prompt")]
scheduling_response = prompt_responses[prompt_names.index("scheduling_prompt")]
listing_response = prompt_responses[prompt_names.index("listing_prompt")]
Step 5 — Defining the Workflow Logic
Finally, add a logic layer to this function. This layer reviews the responses and decides what should happen next. It can look up a listing price in a database, route the call to a human representative, schedule an appointment with a real estate agent, or ask the user a follow-up question for more information.
Continue adding to the run_agentic_workflow function in agentic_workflows.py:
agentic_workflows.py
# Route 1: Handle pricing inquiries
if pricing_response.lower() != "false":
if pricing_response.startswith("listing_id:"):
# Extract listing ID from response
listing_id = pricing_response.split("listing_id: ")[1].strip()
# Simulate database/API lookup
# You can add database querying logic here to look for a target listing ID. For our example, we will use a simple Python dictionary
example_price_database = {
"123456": "$350,000",
"654321": "$450,000",
"112233": "$550,000"
}
found_price = example_price_database.get(listing_id)
if found_price:
final_response = f"The price for listing {listing_id} is {found_price}."
else:
final_response = "We are unable to find that listing ID in our records. Are you sure you have the correct listing ID?"
else:
final_response = pricing_response # Response asking for listing ID
# Route 2: Handle scheduling requests
elif scheduling_response.lower() != "false":
if scheduling_response.startswith("date:"):
final_response = f"Perfect! I've scheduled a call for you on that date and time. A sales representative will reach out to you at that time."
# In production: Add logic to actually book the appointment, and consider customizing the message to confirm the date and time selected.
else:
final_response = scheduling_response # Response asking for specific time
# Route 3: Handle general listing questions
elif listing_response.lower() != "false":
final_response = "Please hold while I transfer you to a specialist for further assistance."
# In production: Add logic to transfer chat to human representative
# You could alternatively add logic to access listing details and answer the user's specific question
else:
final_response = "I apologize, I'm not sure how I can help with that. Let me transfer you to a human representative who can better assist you."
# In production: Add logic to transfer chat to human representative
return final_response
Step 6 — Testing Your Workflow
Now that the code for a basic agentic workflow is complete, you can test how it handles three different categories of customer questions. First, test a conversation about the price of a listing. In this example, the user does not provide the listing ID in the first request, so the model asks a follow-up question. The user then provides the ID, and the model returns the correct price.
Create a new file named test_workflow.py and add the following test code:
test_workflow.py
from agentic_workflows import run_agentic_workflow
conversation_history = [
{"role": "user", "content": "Hi, can you tell me the price of one of your listings?"},
]
final_response = run_agentic_workflow(conversation_history)
print(f"Response: {final_response}")
Run the test script:
python3 test_workflow.py
You will see output similar to this:
Output
Response: Could you please provide the listing_id of the item you're asking about?
Next, test a complete conversation where the user provides the listing ID:
test_workflow.py
conversation_history = [
{"role": "user", "content": "Hi, can you tell me the price of one of your listings?"},
{"role": "assistant", "content": "Could you please provide the listing_id of the item you're asking about?"},
{"role": "user", "content": "Yes, the listing ID is 123456"},
]
final_response = run_agentic_workflow(conversation_history)
print(f"Response: {final_response}")
Run the script again:
python3 test_workflow.py
You will see the pricing response:
Output
Response: The price for listing 123456 is $350,000.
Conclusion
In this tutorial, you created a parallel agentic workflow from the ground up with Python. You wrote asynchronous functions that send multiple LLM requests at the same time, configured several system prompts for different tasks, and added routing logic to process user requests.
To make this framework ready for production, you will need to test it, rewrite prompts, change models, and test it again with different conversations until the accuracy and speed match your requirements. With this type of workflow, each prompt should be tested individually and improved continuously. In this example, the Mistral-Small-3.2-24B model was selected because it provides low latency, with a median response time below 0.5 seconds, follows the specific instructions well, and can be deployed on a single GPU. GPT-OSS-120b, GPT-OSS-20b, and Llama 3.1 8b were also tested, but the Mistral model performed best for this particular use case. Selecting the right model for each prompt or task is an iterative process.
If you want to use agents with more autonomy, you can give the system prompts more or less independence by providing additional context or access to more tools. You can also chain follow-up LLM calls together for reasoning or other purposes. However, when a prompt requires several API calls in sequence, the overall latency will increase because the workflow is no longer waiting for only one group of asynchronous model calls.
The Mistral model in this tutorial was deployed with vLLM, so the prompt format did not need to be changed for different models. However, OpenAI, Claude, and open-source models may require different prompt formats when requests are sent to their APIs. If you use different model providers for each call, you may need to write Python functions that convert prompts into the format required by the model being called. For example, you might need a function that converts a vLLM-formatted prompt into a Claude-compatible prompt format.
Some models may respond much more slowly than others, which means the workflow may have to wait for the slowest response before all results are available. With Python’s asyncio library, you can process model responses as soon as they arrive by using asyncio.as_completed() instead of asyncio.gather(). Then you can check the earliest responses to determine whether they contain the route you want to continue with and proceed without waiting for every remaining response. This would also require updates to the logic layer.
When you add more system prompts, the main bottleneck you are likely to encounter is rate limiting from the LLM API or processing limits on your GPU. You can address this by hosting your own model on a GPU server, increasing API rate limit quotas, distributing requests across multiple APIs, and adding retry logic, timeout handling, and error handling for situations where rate limits occur.


