Programmatic tool calling is quickly becoming one of the hottest topics among LLM builders on Reddit, X and dev forums. Developers are moving beyond classic “JSON tool calls” toward agents that can write small programs, orchestrate tools, and execute complex workflows safely and efficiently.
If stalling tool calls, hallucinated arguments, and slow, expensive chains sound familiar, you are not alone. Threads with hundreds of upvotes complain that tool calling is “hard business”: models call the wrong tool, over-fetch data, hit rate limits, or simply fabricate results when tools fail. At the same time, new approaches like Anthropic’s Programmatic Tool Calling (PTC) and “code-mode” tools are gaining attention because they cut token usage and improve reliability by letting the model generate code instead of brittle JSON blobs.
What is programmatic tool calling?
Programmatic tool calling is a pattern where the LLM writes and runs small programs to orchestrate tools, instead of just emitting one-off JSON function calls.
In traditional tool calling (OpenAI-style function calling, Claude tools, etc.), the model selects a tool and returns JSON arguments that your backend executes. Programmatic tool calling adds an extra layer: the model generates a mini-program (in Python, a DSL like PTC-Lisp, or another sandboxed language) that can:
- Call multiple tools in sequence or conditionally
- Loop, filter, aggregate and transform data
- Handle partial failures and retries deterministically
Developers on Reddit and X are excited about this because it lets the model express richer logic in fewer tokens and reduces the need to stuff thousands of raw records into the context window.
Why is everyone talking about it now?
Across Reddit communities like r/LLMDevs, r/LangChain and r/ClaudeAI, several “hot themes” emerge around tool calling:
- Reliability: models often choose the wrong tool or hallucinate parameters; people share long debugging threads on prompt tweaks and schema design.
- Latency and cost: multiple back‑and‑forth tool calls lead to slow UX and high GPU bills.
- Context bloat: developers push huge datasets into the context instead of using tools properly for filtering and aggregation.
- Orchestration complexity: people chain tools in LangChain, custom frameworks, or bespoke orchestrators, and it quickly becomes spaghetti.
Programmatic tool calling is seen as a way to address these pain points: instead of asking the model to micromanage every tool call step, you let it produce a compact program that runs in a sandbox, hits your tools efficiently, and only returns summarized results to the LLM.
This is also why there is so much energy around new libraries implementing PTC patterns in various languages and frameworks.
Classic tool calling vs programmatic tool calling
Here is a high‑level comparison of the two approaches:
| Aspect | Classic tool calling | Programmatic tool calling |
|---|---|---|
| Model output | JSON function call | Short program (Python, DSL, etc.) |
| Orchestration | Mostly handled by LLM + client loop | Mostly handled by generated program in sandbox |
| Workflow depth | One or few tools per turn | Arbitrary sequences, loops, branching |
| Token usage | Higher for many tool calls and large results | Lower, as tools compute and summarize server-side |
| Reliability | Prone to hallucinated arguments | More deterministic, typed signatures and retries possible |
| Complexity | Simpler to start, harder at scale | More setup initially, cleaner for complex agents |
On Regolo, you can run both patterns on the same GPU-backed LLMs; the difference is mostly at the application layer.
Core architecture on Regolo GPUs
To make these ideas concrete, imagine a backend running on Regolo’s LLM as a Service. You have:
- A hosted model (e.g. a 70B+ instruction-tuned LLM) on NVIDIA H100/A100/L40S GPUs, exposed via HTTP API.
- A set of tools in your app: database queries, internal APIs, search engines, etc.
- A small “tool runtime” that executes either:
- single tool calls (classic), or
- generated mini-programs (programmatic).
Your goal is to:
- Keep GPU utilization high (batching, KV caching) to reduce cost.
- Minimize tokens by having tools do heavy data processing outside the LLM.
- Maintain strict data privacy and zero data retention, which Regolo supports by design with European, privacy-first infrastructure.
In practice, this means designing your app so that the LLM does “high-level reasoning”, while your tools and program runtime do “heavy lifting”. Regolo’s GPU clusters handle the LLM inference efficiently; your code handles tool execution in your VPC or backend.
Example 1: Classic JSON tool calling on Regolo
Let’s start with a simple, OpenAI-style pattern and then evolve it into programmatic tool calling.
Assume Regolo exposes a chat completion endpoint similar in spirit to other major APIs, with support for tool (function) descriptions. (Adapt the base URL and auth headers to your actual Regolo endpoint.)
import os
import requests
import json
REGOLO_API_KEY = os.environ["REGOLO_API_KEY"]
REGOLO_BASE_URL = "https://api.regolo.ai/v1/chat/completions"
tools = [
{
"type": "function",
"function": {
"name": "get_user_orders",
"description": "Fetch last N orders for a given user id",
"parameters": {
"type": "object",
"properties": {
"user_id": {"type": "string"},
"limit": {"type": "integer", "minimum": 1, "maximum": 50}
},
"required": ["user_id"]
}
}
}
]
def call_regolo(messages, tools=None, tool_choice="auto"):
payload = {
"model": "regolo-llm-large",
"messages": messages,
}
if tools is not None:
payload["tools"] = tools
payload["tool_choice"] = tool_choice
resp = requests.post(
REGOLO_BASE_URL,
headers={
"Authorization": f"Bearer {REGOLO_API_KEY}",
"Content-Type": "application/json",
},
data=json.dumps(payload),
timeout=30,
)
resp.raise_for_status()
return resp.json()
def get_user_orders(user_id: str, limit: int = 10):
# Your actual business logic here (DB query, microservice call, etc.)
return [
{"id": "order_123", "total": 42.5},
{"id": "order_456", "total": 19.9},
][:limit]
def chat_with_tools(user_message: str):
messages = [
{"role": "system", "content": "You are a helpful support assistant."},
{"role": "user", "content": user_message},
]
first = call_regolo(messages, tools=tools, tool_choice="auto")
msg = first["choices"][0]["message"]
if "tool_calls" not in msg:
return msg["content"]
tool_results = []
for tool_call in msg["tool_calls"]:
name = tool_call["function"]["name"]
args = json.loads(tool_call["function"]["arguments"])
if name == "get_user_orders":
result = get_user_orders(**args)
else:
result = {"error": "unknown tool"}
tool_results.append({
"tool_call_id": tool_call["id"],
"role": "tool",
"name": name,
"content": json.dumps(result)
})
messages.append(msg)
messages.extend(tool_results)
second = call_regolo(messages)
return second["choices"][0]["message"]["content"]
if __name__ == "__main__":
print(chat_with_tools("Show me my last 2 orders. My id is user_789."))
Code language: Python (python)
This pattern mirrors what many developers already use and what X/Reddit threads often refer to as the “two‑step loop”: model proposes a tool call, client executes, then model finalizes. It works, but as workflows grow more complex, the orchestration logic in chat_with_tools becomes harder to maintain.
Example 2: Moving to programmatic tool calling
Now let’s let the model write a mini-program describing how to use tools, instead of a single JSON call. Inspired by recent PTC implementations, we can define a tiny DSL where the model returns something like:
(seq
(define orders (get_user_orders "user_789" 20))
(define filtered (filter_by_min_total orders 30))
(summarize_orders filtered))
Code language: JavaScript (javascript)
To keep things simple in Python, we will instead ask the model to return a restricted subset of Python using only whitelisted functions. Do not run arbitrary Python—always sandbox and validate—this example is intentionally minimal for clarity.
Step 1: Define a “tool runtime”
import textwrap
SAFE_GLOBALS = {}
TOOLS_RUNTIME = {
"get_user_orders": get_user_orders,
"filter_by_min_total": lambda orders, min_total: [
o for o in orders if o["total"] >= min_total
],
"summarize_orders": lambda orders: {
"count": len(orders),
"total_amount": sum(o["total"] for o in orders)
},
}
def execute_program(program_source: str, tools_runtime: dict):
local_env = {}
safe_globals = {"__builtins__": {}}
safe_globals.update(tools_runtime)
program_source = textwrap.dedent(program_source)
exec(program_source, safe_globals, local_env)
if "result" not in local_env:
raise ValueError("Program must set a `result` variable.")
return local_env["result"]
Code language: Python (python)
The rule for the model will be: “write a Python script that ends with a variable named result containing the final output”. The script may call only the tools we expose in TOOLS_RUNTIME.
Step 2: Ask the LLM for a program, not a final answer
PROGRAMMING_PROMPT = """
You are an AI that writes short Python programs to solve tasks by calling tools.
Available functions:
- get_user_orders(user_id: str, limit: int) -> list[dict]
- filter_by_min_total(orders: list[dict], min_total: float) -> list[dict]
- summarize_orders(orders: list[dict]) -> dict
Rules:
- Do NOT print anything.
- Do NOT import modules.
- Use only the functions listed above and basic Python control flow.
- Always assign the final answer to a variable named `result`.
"""
def plan_with_program(user_message: str):
messages = [
{"role": "system", "content": PROGRAMMING_PROMPT},
{"role": "user", "content": user_message},
]
resp = call_regolo(messages)
program_source = resp["choices"][0]["message"]["content"]
return program_source
def run_agent_with_program(user_message: str):
program_source = plan_with_program(
f"Write a program that solves this request: {user_message}"
)
result = execute_program(program_source, TOOLS_RUNTIME)
explanation_messages = [
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": f"User request: {user_message}"},
{"role": "user", "content": f"Tool results: {json.dumps(result)}"},
]
resp = call_regolo(explanation_messages)
return resp["choices"][0]["message"]["content"]
Code language: Python (python)
Now the agent does three steps:
- Ask the LLM for a program.
- Execute that program locally against tools, without extra LLM calls.
- Ask the LLM to express the final result to the user.
Developers on Reddit, X and other communities highlight that this pattern dramatically reduces the number of LLM calls and token usage because all looping and filtering happens in the program, not in natural language. Running this on Regolo means your GPU budget is spent on reasoning and summarization, not on blindly streaming large datasets back and forth.
Example 3: Multi-step workflow with sub-agents
Some of the most interesting community demos combine programmatic tool calling with sub-agents: smaller specialists that the main agent can invoke.
Here is a simple pattern:
def search_knowledge_base(query: str):
# Search your KB (e.g. Elasticsearch / vector DB) and return docs
return [{"title": "Refund policy", "content": "..."}, ...]
def escalate_to_human(ticket_id: str, summary: str):
# Create escalation in your ticketing system
return {"status": "created", "ticket_id": ticket_id}
AGENT_TOOLS = {
"get_user_orders": get_user_orders,
"filter_by_min_total": TOOLS_RUNTIME["filter_by_min_total"],
"summarize_orders": TOOLS_RUNTIME["summarize_orders"],
"search_knowledge_base": search_knowledge_base,
"escalate_to_human": escalate_to_human,
}
PROGRAMMING_PROMPT_MULTI = """
You are an AI that writes short Python programs to solve customer support tasks.
Available functions:
- get_user_orders(user_id: str, limit: int)
- filter_by_min_total(orders, min_total: float)
- summarize_orders(orders)
- search_knowledge_base(query: str)
- escalate_to_human(ticket_id: str, summary: str)
Rules:
- Use the tools as needed.
- You may branch based on tool results.
- Always set a final `result` dict with keys:
- 'answer': str (what to say to the user)
- 'actions': list (descriptions of any actions you took)
"""
def execute_support_program(program_source: str):
return execute_program(program_source, AGENT_TOOLS)Code language: Python (python)
With the right prompt, the LLM can generate programs that first search your knowledge base, then only call escalate_to_human if the confidence is low or a policy requires escalation. All of this logic is captured in a small script, while Regolo provides the GPU horsepower to interpret the user request and generate the program and final explanation.
Getting started with Regolo for programmatic tool calling
To implement these patterns in your own stack:
- Choose a Regolo model that fits your use case (generalist vs code-heavy, context length, latency).
- Implement a thin client in your preferred language to call the Regolo chat/completion API.
- Start with classic tool calling to validate your tools and schemas.
- Introduce a program-writing step (like
plan_with_program) with a restricted runtime and sandboxed execution. - Add error feedback and retries so the LLM can iteratively improve its programs.
- Monitor latency, token usage and GPU costs; adjust prompts to keep programs short and tool-heavy.
FAQ
What is the main advantage of programmatic tool calling over classic function calling?
The main advantage is that the LLM expresses complex workflows as short programs that can call multiple tools, loop and branch, instead of juggling many separate tool calls. This reduces token usage, increases reliability, and makes orchestration easier to reason about.
Is it safe to let an LLM write and run code?
It can be safe if you strictly sandbox execution: limit the language surface area, restrict imports, whitelist functions, enforce timeouts and memory limits, and treat errors as feedback to the LLM rather than system crashes. This is exactly the approach taken by emerging PTC runtimes.
Do I need a special model to use programmatic tool calling?
No, you can typically use any strong instruction-following LLM that can follow your “write code in this format” instructions. Some providers add explicit PTC support, but the pattern can be implemented on top of generic models as shown in this article.
How does this affect GPU cost?
By reducing the number of LLM turns and shrinking prompts, programmatic tool calling lowers total token throughput per task, which translates into fewer GPU-seconds per workflow. High-throughput inference techniques like batching and KV caching on Regolo further amplify these savings.
Can I combine PTC with RAG (Retrieval-Augmented Generation)?
Yes. A common pattern is to expose retrieval as a tool (e.g. search_knowledge_base) and let the program decide when and how to call it, possibly multiple times, before summarizing results for the user. This keeps raw documents outside the LLM context and lets your retrieval layer stay independent.
Does Regolo store my prompts or tool results?
Regolo is designed as a Zero Data Retention, Data Privacy First platform with compute and data residency in Europe, which means requests are processed for inference and not kept for training, aligning well with privacy-sensitive use cases like internal tools and enterprise agents.
Github Codes
You can download the codes on our Github repo, just copy and paste the .env.example files and fill properly with your credentials. If need help you can always reach out our team on Discord 🤙
🚀 Ready? Start your free trial on today
- Discord – Share your thoughts
- GitHub Repo – Code of blog articles ready to start
- Follow Us on X @regolo_ai
- Open discussion on our Subreddit Community
Built with ❤️ by the Regolo team. Questions? regolo.ai/contact or chat with us on Discord