What Is an AI Harness? — The Digital Flora

A powerful model sitting on its own can't do much real work. Put the strongest LLM in an empty chat box and ask it to refactor your repo, reconcile an invoice, or clean up your inbox — it will talk about the task beautifully and accomplish none of it. The missing piece isn't a better model. It's the software wrapped around it: the harness.

A model on its own is a Ferrari engine sitting on a table — the harness is the chassis, wheels and steering that turn it into a car

Engine vs. car

Think of a Ferrari engine on a table. Enormous power, zero usefulness. To become a car it needs a chassis, wheels, a steering wheel, a seat, warning lights, and a seatbelt. An LLM is the same: raw capability that only becomes a work machine once something gives it hands, eyes, and guardrails.

Simon Willison put it cleanly: an AI agent = LLM + harness. The agent isn't just a model that chats. It's a model placed inside a loop that can act, call tools, hold memory, ask for permission, and observe the result of what it just did before deciding the next step.

A harness is not "an environment for the AI"

It's worth separating words that get blurred together:

Environment — where the AI works: files, terminal, the web, a database, email, Figma, Notion, Drive.
Sandbox — the safety boundary that limits damage if it gets something wrong.
Runtime — where code actually runs and APIs actually get called.
Harness — the active layer that wires all of that to the LLM: it teaches the model to observe, call a tool, ask before risky moves, and keep its bearings across a long task.

The environment just lets the AI see data. The harness puts the AI's hands on the wheel — and decides how hard it's allowed to turn.

What's inside a decent harness

Tool loop — the think → act → observe → repeat cycle. This is the heartbeat.
Context management — the model's working memory is finite. When the history gets too long, the harness has to trim, compress, or summarize the parts that matter and drop the rest.
Permissions — checkpoints that pause and ask before anything irreversible.
Long-term memory — preferences, project structure, past decisions, recurring conventions that should survive across sessions.
System prompt — the soft constitution: the AI's role, its limits, and how it should prioritize.

A bloated context full of noise vs. a clean, well-managed one — the same model behaves very differently

Skills: expertise loaded on demand

A Skill is a bundle of instructions, scripts, and templates for one specific kind of work — and it's only loaded when it's actually needed. Instead of stuffing every bit of knowledge into the context up front, the harness waits for the right trigger and then pulls in the matching Skill.

It's like handing the AI a trade certificate on demand, but only when that trade is genuinely required. You save context, and the AI looks markedly more professional inside each domain.

MCP: the arm that reaches real services

MCP (Model Context Protocol) — a standard from Anthropic — lets a harness connect the model to outside services in a uniform way. Gmail, Notion, Drive, Figma, a company database, an internal system: any of them can show up as a data source through MCP.

The two answer different questions:

A Skill answers "what method should the AI use to do this?"
An MCP answers "where does the AI actually read and write real data?"

A strong setup is the harness plus the right Skills plus the right MCP for the services you actually use.

Same model, different results

Claude inside Claude Code is not the same experience as Claude in an empty chat. Cursor, Windsurf, Aider, Cline, ChatGPT with Code Interpreter — they often run models you already know, yet the outcomes differ, because the harness differs. One manages context better. One reads the repo more cleverly. One asks for permission more sensibly. One has a slow tool loop that wanders off and loses the plot.

When the AI "gets dumber," diagnose the right layer

When an AI suddenly performs badly, the reflex is to swap the model. Usually that's aiming at the wrong thing. Walk the layers first:

Context — if it forgets the request, re-introduces a bug you already fixed, or answers too generically, the context has overflowed or gone stale. A fresh chat or better compression beats a new model.
Domain Skill — if you ask it to review a contract, design a permission model, or write an SEO proposal without giving it the rules of that trade, it falls back on generic templates. A good Skill won't turn a weak model into an expert, but it removes a lot of guessing with checklists, formats, and examples.
Real data (MCP) — if it has no MCP into your CRM, Gmail, Drive, or database, it simply doesn't know the real state of things. It only knows what got pasted into the chat.
Permissions — if every action is blocked, it takes detours or stalls halfway. A good harness sorts actions into safe, ask first, and never.

Only then ask whether the model is really the bottleneck. Swapping the model is like replacing the engine in a car that has a flat tire and a crooked steering wheel.

A diagnostic staircase: context, then Skill, then data, then permissions — and only at the top, the model

From Model Wars to Harness Wars

2023–2024 was the Model Wars: who has the stronger model, the higher benchmark.
2025–2026 is the Harness Wars: who packages a model into a product that actually works.

The value of Cursor or Claude Code isn't just the model underneath. It's the system that reads the repo, edits files, runs tests, asks for permission, remembers the goal, and iterates through errors. The harness is where product experience accumulates: how to pick files into context, how to write a diff, how to call the terminal without wrecking the project, how to summarize a session, when to load a Skill, how to connect an MCP.

The 2026 formula

It's no longer just "AI agent = LLM + harness." It's:

A good AI agent = LLM + harness + the right Skills + the right MCP.

No LLM → no brain.
No harness → a brain with no steering.
No Skill → no method for the trade.
No MCP → no real data.

Each missing piece makes the AI "dumb" in a different way.

So the real skill isn't picking a model. It's looking at a failure and asking: is this a reasoning problem, a context problem, a tooling problem, a permission problem, a data problem, or an instruction problem? Once you can tell those apart, you stop reflexively swapping tools — and start fixing the right layer.

Một model dù mạnh nhưng chỉ có một mình thật ra làm được rất ít việc trong thực tế. Cho con LLM xịn nhất vào một khung chat trống rồi bảo nó refactor repo, đối chiếu một cái hoá đơn, hay dọn hộp thư của bạn — nó sẽ nói về công việc đó rất hay và chẳng làm xong cái nào. Thứ còn thiếu không phải là một model tốt hơn. Đó là chiến giáp bọc quanh nó: cái harness.

Model đứng một mình là động cơ Ferrari đặt trên bàn — harness mới là khung gầm, bánh xe và vô lăng biến nó thành chiếc xe

Động cơ và chiếc xe

Hình dung một động cơ Ferrari đặt trên bàn. Sức mạnh khủng khiếp, độ hữu dụng bằng không. Để thành chiếc xe, nó cần khung gầm, bánh xe, vô lăng, ghế lái, đèn cảnh báo và dây an toàn. LLM cũng vậy: năng lực thô của nó như một thiên tài chỉ có bộ não(đúng nghĩa đen), nó chỉ thật sự mạnh mẽ và hữu ích khi có thứ gì đó trao cho nó đôi tay, đôi mắt và hàng rào an toàn.

Simon Willison gói gọn rất gọn: AI agent = LLM + harness. Agent không chỉ là model biết nói chuyện. Nó là model được đặt trong một vòng lặp của các hành động liên tiếp, gọi tools, có memory, xin permission, và quan sát kết quả của việc nó vừa làm trước khi quyết định bước tiếp theo.

Harness không phải là "môi trường cho AI"

Có vài khái niệm hay bị gộp làm một, nên tách biệt rõ hơn:

Environment (môi trường) — nơi AI làm việc: file, terminal, web, database, email, Figma, Notion, Drive.
Sandbox — lớp giới hạn an toàn, chặn thiệt hại nếu nó làm sai.
Runtime — nơi code thực sự chạy và API thực sự được gọi.
Harness — lớp chủ động nối tất cả những thứ trên với LLM: dạy model cách quan sát, gọi công cụ, hỏi trước những hành động rủi ro, và giữ phương hướng xuyên suốt một task dài.

Môi trường chỉ cho AI nhìn thấy dữ liệu. Harness mới là thứ đặt tay AI lên vô lăng — và quyết định nó được phép đánh lái mạnh tới đâu.

Bên trong một harness tử tế

Tool loop — vòng lặp nghĩ → hành động → quan sát → lặp lại. Đây là nhịp tim.
Context management — bộ nhớ làm việc của model là hữu hạn. Khi lịch sử dài quá, harness phải cắt, nén, hoặc tóm tắt phần quan trọng và bỏ phần thừa.
Permissions — các điểm dừng để hỏi trước khi làm điều không thể hoàn tác.
Long-term memory — sở thích, cấu trúc dự án, quyết định cũ, quy ước lặp lại — những thứ nên sống sót qua nhiều phiên.
System prompt — bản hiến pháp mềm: vai trò của AI, giới hạn, và cách nó ưu tiên.

Context phình to đầy nhiễu so với context sạch, được quản lý tốt — cùng một model nhưng hành xử rất khác

Skills: chuyên môn nạp khi cần

Skill là một package gồm hướng dẫn, script và template cho một công việc cụ thể — và chỉ được nạp khi thực sự cần. Thay vì nhồi toàn bộ kiến thức vào context ngay từ đầu, harness xác định đúng lúc rồi mới kéo Skill tương ứng vào.

Giống như đưa cho AI một danh sách chứng chỉ hành nghề nghề. nó biết nó có thể làm những gì, thậm chí làm rất nhiều công việc khác nhau, nhưng chỉ khi cái nghề đó thật sự cần dùng. ta mới call nó. Bạn tiết kiệm được context, và AI trông chuyên nghiệp hơn hẳn trong từng lĩnh vực và công việc cụ thể.

MCP: cánh tay vươn ra dịch vụ thật

MCP (Model Context Protocol) — một chuẩn do Anthropic đưa ra — giúp harness kết nối model với các dịch vụ bên ngoài theo một cách thống nhất. Gmail, Notion, Drive, Figma, database công ty, hệ thống nội bộ: bất kỳ cái nào cũng có thể xuất hiện như một nguồn dữ liệu qua MCP.

Hai thứ trả lời hai câu hỏi khác nhau:

Skill trả lời: "AI nên làm việc này theo phương pháp nào?"
MCP trả lời: "AI thực sự đọc và ghi dữ liệu thật ở đâu?"

Một bộ setup mạnh là harness cộng đúng Skills cộng đúng MCP cho những dịch vụ bạn đang thực sự dùng.

Cùng model, kết quả khác nhau

Claude trong Claude Code không phải là Claude trong một khung chat trống. Cursor, Windsurf, Aider, Cline, ChatGPT với Code Interpreter — chúng thường chạy những model bạn đã quen, nhưng kết quả khác nhau, vì harness khác nhau. Có cái quản context tốt hơn. Có cái đọc repo khéo hơn. Có cái hỏi quyền hợp lý hơn. Có cái vòng lặp tool chậm, dễ lạc và quên mất mục tiêu.

Khi AI "ngu đi", hãy chẩn đoán đúng lớp

Khi AI bỗng làm việc tệ, phản xạ thường là đổi model. Đa phần là nhắm sai chỗ. Hãy đi qua từng lớp trước:

Context — nếu nó quên yêu cầu, lặp lại lỗi bạn đã sửa, hay trả lời chung chung, thì context đã tràn hoặc lỗi thời. Một chat mới hoặc nén context tốt hơn ăn đứt việc đổi model.
Domain Skill — nếu bạn bảo nó review hợp đồng, thiết kế mô hình permission, hay viết proposal SEO mà không đưa quy tắc nghề, nó sẽ dựa vào mẫu chung chung. Skill tốt không biến model yếu thành chuyên gia, nhưng cắt giảm rất nhiều phần suy đoán nhờ checklist, format và ví dụ.
Dữ liệu thật (MCP) — nếu nó không có MCP tới CRM, Gmail, Drive hay database, nó đơn giản là không biết trạng thái thật. Nó chỉ biết những gì được dán vào chat.
Permissions — nếu mọi thao tác đều bị chặn, nó đi đường vòng hoặc đứng khựng giữa chừng. Harness tốt phân loại hành động thành an toàn, hỏi trước, và không bao giờ.

Chỉ sau đó mới hỏi xem model có thật sự là nút thắt không. Đổi model lúc này giống như thay động cơ cho một chiếc xe đang xẹp lốp và lệch vô lăng.

Cầu thang chẩn đoán: context, rồi Skill, rồi dữ liệu, rồi permissions — và chỉ ở bậc trên cùng mới là model

Từ Model Wars sang Harness Wars

2023–2024 là thời Model Wars: ai có model mạnh hơn, benchmark cao hơn.
2025–2026 là thời Harness Wars: ai đóng gói model thành một sản phẩm thực sự làm được việc.

Giá trị của Cursor hay Claude Code không chỉ nằm ở model bên dưới. Nó nằm ở hệ thống đọc repo, sửa file, chạy test, hỏi quyền, nhớ mục tiêu, và lặp qua từng lỗi. Harness là nơi kinh nghiệm sản phẩm tích lũy: cách chọn file vào context, cách viết diff, cách gọi terminal mà không phá project, cách tóm tắt một phiên làm việc, khi nào nạp Skill, cách nối một MCP.

Công thức 2026

Không còn chỉ là "AI agent = LLM + harness" nữa. Mà là:

Một AI agent tốt = LLM + harness + đúng Skills + đúng MCP.

Thiếu LLM → không có bộ não.
Thiếu harness → bộ não không có tay lái.
Thiếu Skill → không có phương pháp nghề.
Thiếu MCP → không có dữ liệu thật.

Mỗi mảnh thiếu làm AI "ngu" theo một kiểu khác nhau.

Vậy nên kỹ năng thật sự không phải là chọn model. Mà là nhìn vào một lỗi và hỏi: đây là lỗi suy luận, lỗi context, lỗi công cụ, lỗi quyền hạn, lỗi dữ liệu, hay lỗi hướng dẫn? Khi phân biệt được những thứ đó, bạn thôi đổi công cụ theo phản xạ — và bắt đầu sửa đúng lớp.