Skip to content
Chimera readability score 0.6081 out of 100, reading level.

Over the past few weeks, we’ve been running open weight Large Language Models through Deep Agents harness evaluations, and the initial results show they are a viable option to use instead of, and alongside, closed frontier models. GLM-5 (z.ai) and MiniMax M2.7 each score similarly to closed frontier models on core agent tasks such as file operations, tool use, and instruction following.

This isn’t surprising if you’ve been following open model progress via the large set of open benchmarks such as SWE-Rebench and Terminal Bench 2.0. Tool calling is reliable and instruction following is consistent. For developers deploying agents in production, open models now offer a level of consistency and predictability that makes real-world workflows much more viable.

Why open models

When exploring open models, builders and customers tend to focus on a few key factors: cost, latency, and task performance.

In the limit, it would be great to use the smartest frontier model at the highest reasoning level for every task. In practice, two constraints make that unworkable: cost and latency. Closed frontier models can run 8–10x more expensive for high-throughput workloads, and they're often too slow for the response times users expect in interactive products.

| Model | Type | Input ($/M tokens) | Output ($/M tokens) |

|---|---|---|---|

| Claude Opus 4.6 (Anthropic) | Closed | $5.00 | $25.00 |

| Claude Sonnet 4.6 (Anthropic) | Closed | $3.00 | $15.00 |

| GPT-5.4 (OpenAI) | Closed | $2.50 | $15.00 |

| GLM-5 (Baseten) | Open | $0.95 | $3.15 |

| MiniMax M2.7 (OpenRouter) | Open | $0.30 | $1.20 |

To put the pricing in context: an application outputting 10M tokens/day costs roughly $250/day on Opus 4.6 versus ~$12/day for MiniMax M2.7. That's about a $87k annual difference.

Open models tend to be smaller than closed frontier models, and can be accelerated on specialized inference infrastructure — providers like Groq, Fireworks, and Baseten optimize for latency and throughput far beyond what most teams could achieve on their own. OpenRouter data show GLM-5 on Baseten averaging 0.65s latency and 70 tokens/second, compared to 2.56s and 34 tokens/second for Claude Opus 4.6. For latency-sensitive products, that gap is hard to engineer around.

How we evaluated

We've written about our eval methodology in depth in How we build evals for Deep Agents. We run evals using hosted inference providers, but Deep Agents can be run using fully local and private models via Ollama, vLLM, etc.

For open models, we ran seven eval categories: file operations, tool use, retrieval, conversation, memory, summarization, and “unit tests”. These cover tasks that exercise fundamentals: can the model reliably call tools, follow structured instructions, and operate on files? These are the capabilities that gate whether a model is usable in an agentic harness at all.

Each eval case defines success assertions (hard-fail checks that determine correctness) and efficiency assertions (soft checks that measure how the model got there). We report four metrics:

  • Correctness — the fraction of tests the model solved:

passed / total

. A score of 0.68 means 68% of test cases were solved correctly. This is the primary quality signal. - Solve rate — a combined measure of accuracy and speed. For each test, we compute

expected_steps / wall_clock_seconds

; failed tests contribute zero. The final score is the average across all tests. Higher is better — a model that solves tasks both correctly and quickly scores highest. - Step ratio — how many agentic steps the model actually took compared to how many we expected, aggregated across all tests:

total_actual_steps / total_expected_steps

. A value of 1.0 means the model used exactly the expected number of steps. Above 1.0 means it needed more (less efficient); below 1.0 means it needed fewer steps than initially expected. - Tool call ratio — same idea as step ratio, but counting individual tool calls instead of steps. 1.0 is on-budget, above is over-budget, below is under-budget.

Step ratio and tool call ratio are efficiency metrics. They don't affect whether a test passes, but they reveal how economically a model reaches the answer. A model that solves a task in 2 steps instead of the expected 5 is both correct and efficient.

Findings from our evals

These are early results; we’re actively maintaining and expanding our eval set. You can view recent runs in realtime both in our GitHub repo and at this shared LangSmith project.

Open models

View CI run (click model names to view individual evals)

| Model | Correctness | Passed | Solve Rate | Step Ratio | Tool Call Ratio |

|---|---|---|---|---|---|

| baseten:zai-org/GLM-5 | 0.64 | 94 of 138 | 1.17 | 1.02 | 1.06 |

| ollama:minimax-m2.7 | 0.57 | 85 of 138 | 0.27 | 1.02 | 1.04 |

Per-category correctness:

| model | Conversation | File Ops | Memory | Retrieval | Summarization | Tool Use | Unit Test |

|---|---|---|---|---|---|---|---|

| baseten:zai-org/GLM-5 | 0.38 | 1 | 0.44 | 1 | 0.6 | 0.82 | 1 |

| ollama:minimax-m2.7:cloud | 0.14 | 0.92 | 0.38 | 0.8 | 0.6 | 0.87 | 0.92 |

Frontier models

View CI run (click model names to view individual evals)

| Model | Correctness | Passed | Solve Rate | Step Ratio | Tool Call Ratio |

|---|---|---|---|---|---|

| anthropic:claude-opus-4-6 | 0.68 | 100 of 138 | 0.38 | 0.99 | 1.02 |

| google_genai:gemini-3.1-pro-preview | 0.65 | 96 of 138 | 0.26 | 0.99 | 1.01 |

| openai:gpt-5.4 | 0.61 | 91 of 138 | 0.61 | 1.05 | 1.15 |

Per-category correctness:

| model | Conversation | File Ops | Memory | Retrieval | Summarization | Tool Use | Unit Test |

|---|---|---|---|---|---|---|---|

| anthropic:claude-opus-4-6 | 0.05 | 1 | 0.67 | 1 | 1 | 0.87 | 1 |

| google_genai:gemini-3.1-pro-preview | 0.24 | 0.92 | 0.62 | 1 | 0.8 | 0.79 | 0.92 |

| openai:gpt-5.4 | 0.29 | 1 | 0.44 | 1 | 0.8 | 0.76 | 1 |

For Gemini 3+, this is

high

For OpenAI, this is

medium

For Claude, this is without extended thinking

DIY: Run Deep Agent evals locally

Our CI runs the same evaluation suite across 52 models organized into groups — including an open

group (baseten:zai-org/GLM-5

, ollama:minimax-m2.7:cloud

, ollama:nemotron-3-super

) that runs on every eval workflow. You can target any model group:

Run evals against all open models: pytest tests/evals --model-group open

Run against a specific model: pytest tests/evals --model baseten:zai-org/GLM-5

This makes it straightforward to compare open models against each other and against closed frontier models on the same tasks, using the same grading criteria.

Using open models in Deep Agents SDK

Swapping to an open model is a one-line change:

GLM-5:

pip install langchain-baseten

from deepagents import create_deep_agent

agent = create_deep_agent(model="baseten:zai-org/GLM-5")

MiniMax M2.7:

pip install langchain-openrouter

from deepagents import create_deep_agent

agent = create_deep_agent(model="openrouter:minimax/minimax-m2.7")

That's it. The harness handles the rest — it detects the model's context window size, disables unsupported modalities, and injects the right identity into the system prompt so the agent knows what it's working with.

The same open model is often available through multiple providers. Pick the one that matches your constraints. For example, GLM-5 is available as baseten:zai-org/GLM-5

, fireworks:fireworks/glm-5

, or ollama:glm-5

for self-hosted. Same model, same harness, different infrastructure.

LangChain provides support for the most popular open model providers. The providers we have tested for this release are: Baseten, Fireworks, Groq, OpenRouter, and Ollama (cloud).

Harness-level adjustments for your model

Open models have different context windows, different tool-calling formats, and different failure modes than closed frontier models. The Deep Agents harness absorbs these differences so you don't have to:

  • Model identity injection — the system prompt is patched at runtime with the model's name, provider, context limit, and supported modalities. The agent knows what it is and what it can do.
  • Context management — compression, offloading, and summarization thresholds adapt to the model's actual context window, not a hardcoded default. A model with a 4K context gets more aggressive compaction than Opus with 1M.

Deep Agents CLI

Each model is also available in the Deep Agents CLI. The Deep Agents CLI is our open-source coding agent and alternative to Claude Code.

In addition to all the capabilities in Deep Agents SDK, the CLI supports Runtime model swapping. We introduced a new middleware (ConfigurableModelMiddleware

) to enable switching models mid-session without restarting the agent. This enables patterns like using a frontier model for planning and an open model for execution.

You can switch models mid-session with the /model

slash command. This enables patterns like starting a task with a frontier model for planning, then switching to a cheaper open model for execution.

What’s next

Some things we’re excited to share soon:

  • Documenting harness tuning patterns for specific open model families
  • Testing multi-model subagent configurations (ex: frontier closed model orchestrator + open model subagents)

Open models work for agents today. We want to show the design patterns that help us engineer a good harness and build targeted evals that measure what matters for your task.

Deep Agents is open source. Try it with your preferred open model and come build great evals and agents with us.

Facts Only

Evaluations of open-weight LLMs (GLM-5, MiniMax M2.7) show performance comparable to closed frontier models in agentic tasks.
Open models are tested on file operations, tool use, instruction following, and other core capabilities.
Cost comparisons: MiniMax M2.7 costs ~$12/day for 10M tokens, while Claude Opus 4.6 costs ~$250/day.
Latency: GLM-5 averages 0.65s response time; Claude Opus 4.6 averages 2.56s.
Providers like Baseten, Fireworks, Groq, OpenRouter, and Ollama optimize open model inference.
Evaluation metrics include correctness, solve rate, step ratio, and tool call ratio.
GLM-5 scores 0.64 correctness (94/138 tests passed); MiniMax M2.7 scores 0.57 (85/138).
Closed models: Claude Opus 4.6 scores 0.68 (100/138), GPT-5.4 scores 0.61 (91/138).
Deep Agents SDK allows one-line model swapping (e.g., `model="baseten:zai-org/GLM-5"`).
Deep Agents CLI supports runtime model switching for hybrid workflows.
Future work includes tuning for open model families and multi-model subagent configurations.

Executive Summary

Recent evaluations of open-weight large language models (LLMs) like GLM-5 and MiniMax M2.7 demonstrate their viability as alternatives to closed frontier models for agentic tasks. These open models perform comparably in core areas such as file operations, tool use, and instruction following, with benchmarks like SWE-Rebench and Terminal Bench 2.0 confirming their reliability. Cost and latency advantages are significant: open models like MiniMax M2.7 cost roughly $12/day for 10M tokens, compared to $250/day for closed models like Claude Opus 4.6. Latency is also improved, with GLM-5 averaging 0.65s response time versus 2.56s for Opus. Evaluations across seven categories—file operations, tool use, retrieval, conversation, memory, summarization, and unit tests—show open models achieving correctness scores of 0.64 (GLM-5) and 0.57 (MiniMax M2.7), comparable to closed models like Claude Opus 4.6 (0.68). The Deep Agents framework supports seamless integration of open models, with one-line code changes enabling deployment. Providers like Baseten, Fireworks, and OpenRouter optimize open models for performance, while tools like the Deep Agents CLI allow runtime model swapping for cost-efficient workflows. Future developments include harness tuning for specific open model families and multi-model subagent configurations.

Full Take

The narrative presents open-weight LLMs as a pragmatic, cost-effective alternative to closed frontier models, emphasizing their performance parity in agentic tasks. This is a strong argument: the data shows open models like GLM-5 and MiniMax M2.7 achieving correctness scores within 5-10% of closed models, while offering dramatic cost savings (e.g., $87k annually for high-throughput workloads) and lower latency. The piece avoids hyperbole, grounding claims in specific benchmarks and real-world pricing. However, it’s worth questioning whether the evaluations fully account for edge cases where closed models might still excel, such as complex reasoning or nuanced conversation (where closed models scored higher in per-category correctness). The focus on cost and latency as primary constraints is valid but assumes these are the only critical factors—what about long-term reliability, security, or the ethical implications of open model deployment?
The pattern here aligns with a broader industry shift toward democratizing AI, but it’s important to ask: Who benefits most from this transition? Developers and startups gain affordability, but does this come at the cost of centralized oversight or safety guardrails? The piece doesn’t explore potential trade-offs, such as the environmental impact of running multiple optimized inference providers or the risks of fragmented model governance.
If this were part of a coordinated influence campaign, the playbook might involve downplaying the limitations of open models while amplifying cost savings to drive adoption. However, the content doesn’t match this pattern—it transparently shares evaluation metrics and acknowledges ongoing work. The strongest version of this narrative is that open models are now viable for production, but the conversation should extend to their broader societal impact.
**Patterns detected: none**

Sentinel — Human

Confidence

The analyzed article appears to be written by a human. The writing shows signs of a human writer's rhythm, personal voice, and unique style, making it unlikely that it is synthetically generated.

Signals Detected
low severity: Sentence length variance is present, suggesting a human writer's erratic rhythm.
high severity: The text demonstrates idiosyncratic emphasis and personal voice, indicative of a human author.
low severity: While the article does reference multiple models and evaluation categories, there are no apparent signs of argumentative skeletons matching known template patterns or coordinated synthetic production.
Human Indicators
The text contains a personal voice, idiosyncratic emphasis, and a clear stylistic fingerprint, which are not common in synthetically generated content.
Open Models have crossed a threshold — Arc Codex