What we learned building an AI exam marking assistant with The Mercian Trust

Copilot Studio

High-stakes decisions demand trust, transparency and clear accountability. Whether in education, healthcare, professional services or the public sector, organisations are increasingly asking the same question: how can AI support complex, judgement-led work without undermining professional confidence?

The Mercian Trust

Exam marking brings this challenge into sharp focus. It requires focus, consistency and professional judgement, often under tight deadlines. Through a practical proof of concept with The Mercian Trust, we explored whether AI agents could play a practical supporting role in the marking process. Not by replacing teachers but as an assistant that could apply marking criteria consistently, explain its decisions and reduce teacher’s workloads.

This work was deliberately framed as a time-boxed proof of concept. The goal was not to build a finished product, but to understand what responsible, trustworthy AI actually requires when decisions matter and accountability cannot be automated away.

Key takeaways

Trust in AI comes from clear reasoning, not just accurate outputs
Simpler designs are easier to govern and scale in practice
Subject-specific patterns outperform generic solutions
AI works best when it supports professional judgement, rather than replacing it

Below we share what this looked like in practice and the design choices that made the difference, both in education and beyond.

Start with trust, not automation

Early testing showed that AI could mark real exam papers using genuine marking schemes and produce results closely aligned with human markers. In one case, the difference was a single mark. On its own, that was not the breakthrough.

What mattered far more to teachers was understanding why a mark had been awarded. They wanted to see clear links to the marking criteria, explanations for why higher marks were not awarded and feedback that could be reused in conversations with students.

The key insight was simple – accuracy on its own is not enough. Explainability is what builds trust. Any system that behaves like a “black box,” no matter how fast or efficient, is unlikely to be accepted in assessment-led contexts without being able to provide feedback and explain its decisions.

Why simpler AI designs proved more reliable

One of the strongest technical learnings was that simpler AI agent designs consistently outperformed more complex ones.

Early approaches broke the process into many small, highly specialised AI agents. While this looked flexible on paper, it introduced fragility and made behaviour harder to predict and govern.

The most reliable pattern emerged when a single coordinating AI agent owned the end-to-end flow, with subject-specific logic applied only where it genuinely added value.

From a leadership perspective, this reinforced a key point – AI systems scale best when they are easy to reason about, audit and control.

Subject specificity is a strength, not a constraint

Assessment criteria do not generalise well. Different subjects interpret evidence, structure answers, and apply criteria in fundamentally different ways.

Attempts to force one generic marking approach across subjects diluted quality. When the system was designed around a specific subject and exam board, outcomes improved significantly, both in accuracy and in the quality of the feedback.

The implication for organisations considering the use of AI is clear. Scalability comes from repeating a proven pattern within clear boundaries, not from building one system that tries to do everything.

AI works best as support, not a decision-maker

A consistent pattern emerged during marking. The system was very good at identifying the correct level of performance but needed careful constraints when assigning precise marks within a band.

This mirrors real teaching practice. Determining whether work sits at the top or bottom of a band often relies on professional judgement and experience, not rigid rules.

When conservative guardrails were applied, the AI behaved more like an experienced teaching assistant than an autonomous decision-maker. This was the role teachers were most comfortable with and trusted.

Why these lessons apply far beyond education

While this work focused on exam marking, the lessons learnt extend far beyond schools or educational institutions.

In any human-centred domain, the success of AI depends less on technical ambition and more on thoughtful design. Systems need to be explainable, governable, and shaped around how people actually work.

AI delivers the most value where it removes unnecessary effort, increases consistency, explains its reasoning, and leaves final accountability with professionals.

Designing for adoption, not just capability

Our proof of concept with The Mercian Trust showed that responsibly designed AI agents can meaningfully support expert-led work. The next step is not more automation for its own sake but deeper validation with end users, clearer boundaries, and careful scaling.

For organisations exploring AI today, the question is no longer whether it can work. It is whether it is being designed in a way that people will trust, adopt and genuinely benefit from.

Let’s talk

Our work with The Mercian Trust focused on learning what responsible, trusted AI actually looks like in practice.

If you are facing similar questions around workload, consistency or adoption, we would love to discuss how AI could support work in your organisation.

Get in touch to explore what this could mean in your context.

Written by Michael Chambers, Lead Developer