Learning sometimes happens, despite our best efforts.

A disorienting moment

  • Current AI systems pass medical licensing exams in the 90th percentile (Gilson et al., 2023)
  • AI produces care plans, reflective portfolios, and clinical case analyses indistinguishable from competent practitioners
  • 24/7 access to AI tutors, adapts to their level, and gives more consistent feedback than most human assessors (Jurenka et al., 2024)
  • Failure to generate MSc-level outputs is a failure of imagination, not of model capability

Sanctuary strategies

  • Denial: "AI will never replace clinical judgement" — probably true; doesn't resolve the problem that AI can already complete almost all assessments
  • Retreat: "Focus on what only humans can do" — the boundary keeps moving as models improve, creating a shrinking perimeter
  • Restriction: "Prevent students from using it" — no policy or detector has changed this and none are likely to (see "the internet")
  • Resignation: "Our roles will inevitably diminish" — performance is not a zero-sum game

None of these strategies answers the question of what we need students to do — and why.

Four premises

  1. Learning results from what the student does and thinks, and only from what the student does and thinks (Simon, in Ambrose, 2010)
  2. The aim of teaching is to create conditions that influence what students do and think (Ramsden, 2003)
  3. Assessment aims to infer learning from observable patterns in what students do and think
  4. Certification attempts to make predictions — and judgements — about students' future doing and thinking

AI has not changed any of these premises.

Becoming a nurse

  • This is the language of the profession: care, fitness to practise, professional identity, clinical judgement (formation concepts, not assessment outcomes)
  • Becoming is transformational; the person who completes the programme is different from the person who began it (Barnett, 2009)
  • The educational infrastructure built to support this aspiration is designed around artifact production, not formation
  • The system has always been measuring a proxy for the developmental process it actually cares about

We cannot observe learning directly

  • Knowledge is built through effortful retrieval, not exposure; neural connections strengthen through repeated, effortful activation; conditions that feel harder produce better long-term outcomes (Bjork & Bjork, 2009; Willingham, 2009)
  • Student engagement is therefore a reasonable proxy for improved learning outcomes (Kuh, 2007)
  • This engagement produces artifacts that serve as evidence of the engagement (Carless, 2007)
  • The work of learning (the "desirable difficulty") and the artifact it produced were reliably connected, so the proxy held

You already know what this feels like.

The swampy lowland of assessment

  • Assessment has always had architectural flaws, accepted as manageable imprecision (e.g. arbitrary cut-scores, inter-rater variability, order-effects)
  • Most feedback doesn't translate to improved performance (Hattie et al., 2007)
  • High-stakes judgements are variously affected by extraneous factors (Danziger et al., 2011)
  • What we call "cheating" is defined relative to what we expect students to do themselves, rather than an inherent property of a task
  • Students were already optimising for grades over learning

This is diagnostic information about the system, not about students.

The proxy became a target

  • When a measure becomes a target, it ceases to be a good measure (Goodhart, 1984)
  • We confused extraneous load ("What is the submission date?") and germane load ("What does good look like?") (Sweller, 1988)
  • Providing model answers, micro-criteria rubrics, word counts per section, and detailed assessment briefs, removes cognitive struggle
  • "In the varied topography of professional practice... there are those who choose the swampy lowlands" (Schön, 1983)

Students focused on the mechanics of artifact production are responding rationally to what we taught them to care about.

The proxy has failed

  • 95% of UK students report using AI for submitted and assessed work (HEPI, 2026)
  • A student can now produce a polished, well-referenced output without reading, engaging, or understanding (Dawson, Bearman & Dollinger, 2024; Corbin, Dawson & Liu, 2025)
  • Fluency now has low information value; it is noise in the system
  • By optimising the system to value the artifact, we've created a structural problem that incentivises misaligned behaviour
  • The conditions that made the proxy work no longer hold

The link between the work of becoming a nurse and the evidence of that becoming is permanently broken.

Two responses

  • Discursive responses: updated policies, AI declarations, traffic light systems are attempts to change the language
    • Defensive positions that leave the foundational assumptions intact and maintain the status quo (Corbin, Dawson & Liu, 2025)
    • The logic of artifact-protection leads to arms-race dynamics
  • Structural responses: change the underlying conditions that determine what students think and do

Discursive changes cannot address structural problems.

A more accurate AI detector won't help

  • A perfect detector tells you whether AI produced the artifact, not whether the student was cognitively engaged
    • Outsource your admin to AI (probably fine)
    • Outsource your thinking to AI (less fine)
  • Those are different questions, and detection answers the wrong one
  • AI can generate either avoidance or genuine struggle; the tool is the same, the output may even be the same, but the formation is not

"Did they use AI?" -> "What did they use AI for?"

What value do I add?

  • AI is stateless: no memory across sessions, no accumulated understanding of this patient, this ward, this moment
  • AI is static: knowledge frozen at training; it cannot update from the unfolding encounter (see "continuous learning")
  • AI has no professional context: no embodied experience of practice, no stakes, no accountability
  • AI lacks the sense of meaning that comes from being present, responsible, and changed by what happens

What can only humans do?" -> "What do humans contribute within human-AI coalitions?"

Agents raise the ceiling

  • Agents take actions: search, synthesise, draft, critique, iterate — doing real work in collaboration with the student, not just in response to them
  • Students who can direct an agent well have access to a scaffold that surfaces harder problems, challenges their reasoning, and raises the ceiling on what they can engage with
  • The question is not whether students will have agents (they will) but whether we are designing experiences that support effective learning
  • The challenge is in how we help students use their agents in ways that deepen the struggle rather than remove it

Sending someone else to the gym to do your workouts is absurd.

Conditions supporting cognitive struggle

  • Problem-driven inquiry: creates situations of genuine uncertainty from the beginning (no clean solutions)
  • Collaborative construction: makes engagement visible and challengeable (defend your position)
  • Facilitation over instruction: holds the process without providing the answers (struggle stays with the student)
  • Metacognitive reflection: makes developing judgement visible to the student and to the assessor

These are also the conditions under which AI use becomes transformative rather than substitutive

What is "the work"?

  • "The work" is not what you produce; it is what happens to you while you engage with the struggle that builds new understanding
  • The student who is engaged — directing, evaluating, being challenged, being changed — is doing the work
  • Our task is not to prevent AI from doing the work; it is to design conditions where the student cannot outsource the struggle

The question has always been the same. AI has simply made it impossible to defer.

I want to start with a brief orientation — not to alarm, but because the argument I'm going to make only lands if we're honest about what we're dealing with. Generative AI can now pass medical licensing examinations at the 90th percentile. It writes care plans that are structurally and clinically coherent. It produces reflective portfolios that meet standard assessment criteria. It gives feedback on clinical reasoning — at any hour, without fatigue, without the variability that characterises human assessors. None of this is speculative. These are current capabilities, available to any student with an internet connection. Failure to generate MSc-level outputs is more a failure of imagination than of model capability. The usual response to this list is to note what AI cannot do — it cannot be present in the clinical encounter, it cannot build therapeutic relationships, it cannot hold a patient's hand. All of that is true. But it sidesteps the question, because none of those things are what most of our assessment instruments measure. What our assessment instruments measure, AI can now produce.

When people encounter the capabilities I just described, the responses tend to cluster into four types. I want to name them because they're all understandable, and they're all insufficient. Denial holds that AI cannot replicate the essentially human qualities that define professional practice. This is probably true in some domains. But it doesn't help us right now, because the problem is not that AI might replicate clinical judgement — it's that AI can replicate the artifacts we're currently using as evidence of things like clinical reasoning. Retreat focuses on what humans uniquely offer and tries to rebuild assessment around those things. The problem is that this boundary keeps shrinking. Each new model capability forces a smaller claim. "At least AI can't produce accurate citations". "At least AI can't write genuine reflections." Then it can. "At least AI can't reason clinically about complex cases." Then it can. Defending a shrinking perimeter is exhausting and ultimately unwinnable. Restriction tries to prohibit AI use through policy and detection. The data consistently shows this doesn't work. What the high AI-use figures communicate is not that students are fundamentally dishonest — it's that they've made a rational calculation about what the system rewards. Resignation is the idea that we render unto AI the things that belong to AI, which feels unsatisfying. The question underneath all three of these responses is one we've been avoiding: what are we actually trying to develop in nursing students — and are we designing for that?

These four premises describe the project of nursing education, and they hold regardless of whether AI is in the room. Premise 1 is neurological and constructivist: no one can learn on behalf of the student. Formation requires the student to be the one doing the cognitive work. Premise 2 contains a challenge. If learning can happen without a teacher — and it can — then the teacher's presence is not automatically valuable. Teaching must add something the student could not access otherwise. AI is now moving into this space too: it can scaffold, provide frameworks, cover curriculum. The question of what the teacher uniquely contributes becomes more pressing, not less. Premise 3 is the epistemological problem: we cannot observe learning directly. We can only observe behaviour and infer understanding from it. Premise 4 is the regulatory claim: NMC registration, degree certification — forward-looking claims about fitness to practise, based on backward-looking observations. The reason AI has disrupted this framework is not that it changed any of these premises. It changed what we were using as evidence for premise 3. We were observing products (artifacts) rather than behaviours (doing and thinking), and AI can produce those products without the student doing the thinking.

The goal is not that students know about nursing. It is that they become nurses. Making the claim that the system is designed around artifact production is contentious and one of the weaker claims I make, but I'm going to do my best to defend it.

This was not a failure of intelligence. It was a rational solution to a real problem. We cannot observe learning directly. All we can do is observe the behaviour of students and infer understanding from it. Artifacts were meant to be evidence of the engagement that produced them — a window into the learner's developing understanding. To write a good essay, a student had to read widely, engage with the material, organise their thinking, and commit to a position. The engagement was bundled into the economics of production. Nobody needed to ask what the work was, because the work and the artifact were reliably connected. The science of learning is unambiguous here. Memory is not built through exposure to content. It is built through effortful retrieval, genuine uncertainty, and the struggle to make sense of something that does not yet make sense. At the neurological level, learning requires synaptic change: Long-Term Potentiation, the strengthening of neural connections through repeated, effortful activation. Robert Bjork calls these conditions "desirable difficulties" — spacing, retrieval practice, interleaving, the productive struggle. They feel harder. They produce substantially better long-term learning. Conditions that feel easier — re-reading, worked examples, fluency — tend to produce the illusion of learning, not the thing itself.

- Not essays with word counts and formatting requirements - Difficult situations, challenged thinking, decisions under uncertainty, living with the consequences - Think about the professional moment that genuinely changed how you practise — probably uncomfortable, probably triggered by something specific, probably impossible to submit for grading - The gap between that experience and a formal assessed task is not an indictment of how we teach; it is information about what we have been asking assessment to do. Take a moment to think about your own professional development. Not what you studied — what changed you. There will be a moment — probably more than one — where your clinical understanding genuinely shifted. A patient you didn't know how to help. A decision you got wrong and had to sit with. A colleague who challenged your reasoning in a way you couldn't dismiss. Those moments were probably uncomfortable. They were almost certainly not assessable. Now think about what a formal assessed task looks like: a submission brief, assessment criteria, a word count, a deadline. The texture is completely different. That gap is not evidence that educators have failed their students. It is evidence that assessment has been asked to do something it was never well-designed to do — to stand in for experiences that are genuinely hard to structure, observe, and measure. The model for what we are actually trying to produce has always been available. We just haven't built the system around it.

AI has not broken assessment. It has made it impossible to ignore what was already broken. The 40% pass mark, inter-rater variability in marking, grade aggregation across qualitatively different tasks: these were known, documented architectural flaws, accepted as manageable imprecision because the alternative seemed worse. Students were already optimising for grades rather than for learning. They were already producing artifacts that met the criteria without reliably doing the intellectual work those criteria were designed to evidence. AI has not introduced that behaviour. It has made it more efficient, more visible, and impossible to manage with existing tools. 90% of students report using AI for submitted work. That is not a compliance problem to be solved. No policy, no detection tool, no updated academic misconduct procedure will change that figure in any meaningful way. What it communicates is that a large majority of students have concluded — rationally — that AI serves their purposes better than the system designed for them. The concept of "cheating" is not a fixed standard. It is relative to a set of expectations we defined — expectations written for a world where producing an artifact required the engagement it was meant to evidence. When the model changes, what we ask students to do themselves can change with it.

Over time, we stopped thinking of the artifact as a proxy and started treating it as the thing itself. The drift is visible everywhere. Students spend considerable time — often the majority of their questions before submission — asking about line spacing, word count, reference formatting, and so on. We tend to treat this as a communication problem; "Provide clearer instructions!" It is more accurately a signal about what we have taught them to care about. Cognitive science distinguishes two kinds of difficulty. Extraneous load — navigation friction, ambiguous instructions, unclear deadlines — adds effort without contributing to learning. Removing it is good pedagogy. Germane load — the effort required to determine what matters, what good looks like, what position to defend — is the learning. When we organised Blackboard sites clearly and made deadlines unambiguous, we reduced extraneous load. When we specified word counts for each assignment section, released model answers, and broke rubrics into micro-criteria, we reduced germane load. We thought we were being helpful. We were removing the mechanism. The student no longer needed to determine what a good answer looks like. We told them. And in doing so, we also provided a complete specification that an AI can now follow without any student learning involved. In "the reflective practitioner", Donald Schon talks about the work that takes place on a high, hard ground where we can use of research-based theory and technique, and the work in a swampy lowland where situations are confusing ‘messes’ incapable of technical solution, and experience, trial and error, and intuition are how we muddle through. We have spent decades perfecting the map — and now find that the territory has been bypassed entirely, because the map is so accurate a machine can follow it.

Generative AI has not created a cheating problem. It has severed the inferential chain between the document and the person. A student can now produce a polished, well-referenced, structurally coherent output — one that satisfies every criterion on a rubric — without having read a single source, grappled with a single idea, or developed any understanding of the subject. The proxy has collapsed. The assessment system built on that proxy is now exposed as measuring something other than what it claimed to measure. This is not a claim about the prevalence of academic dishonesty. It is a structural observation: the mechanism that made the document valid evidence has been disrupted, not because students have changed, but because the conditions that made the proxy work no longer hold. The assessment instrument has lost the construct validity it was depending on. The correlation between artifact quality and intellectual engagement — the assumption the entire system was built on — no longer holds. And fluency, which was once a reasonable signal of genuine thinking, is now noise. As models improve, output quality converges on expert-level across every artifact we care to measure.

Most of the responses currently being produced — policies, frameworks, AI use declarations — share a common purpose: to restore faith in the artifact as valid evidence. To reconnect the link that has been broken. Phillip Dawson and colleagues have distinguished between two kinds of attempt. Discursive responses change the language: new policies, updated principles statements, AI use declarations, traffic light frameworks specifying what students may and may not delegate. Structural responses change the underlying conditions that determine what students do and why. Most of what institutions are currently producing is discursive. These documents represent real effort from people responding thoughtfully to a genuinely difficult situation. But they share a common limitation: they leave the foundational assumption intact. The artifact is still what is being protected. Tighter lines are being drawn around it. If the goal is to preserve the integrity of the artifact, and AI continues to improve at producing artifacts, the trajectory is escalating. More restriction. More detection attempts. More sophisticated circumvention. More costly enforcement. The dominant question in the educator-student encounter becomes: did you use AI for this? The relationship becomes progressively adversarial. At the end of that trajectory is an arms race that no one wins: mounting costs on both sides, and a relationship between educators and students structured around suspicion rather than formation. The trust that nursing education — a discipline premised on preparing people for human care — depends on begins to erode.

This reframes the question fundamentally. The concern that students are "using AI to do the work" is right if the work is the artifact. It is wrong if the work is the cognitive engagement. AI can generate the conditions for genuine struggle — harder problems, faster feedback, access to complexity previously beyond the student's reach. The artifact they produce may look similar. The formation is not. The frame shifts from "did you use AI?" — a question about tool use — to "were you genuinely grappling?" — a question about process. That is a harder question to answer, but it is the right one. And it is the question that changes what assessment needs to look like. Students who use AI as an answer-machine are not doing the work, regardless of how polished their output is. Students who use AI as a struggle-generator — to get into harder territory, to be challenged, to be wrong in productive ways — are doing the work. The same tool, two completely different relationships to learning.

What AI systems lack is not capability — it is the conditions under which capability becomes meaningful. AI is stateless: no memory of prior encounters, no accumulated understanding of how a patient's condition has evolved, no professional judgement refined by consequence. It is static: knowledge frozen at training, unable to update from what is happening in the room. Most importantly, AI has no professional context. It does not know what it cost to break bad news that morning, or what the team dynamic is, or what clinical intuition tells you when a patient's story doesn't quite cohere. Context is the bottleneck — not because AI lacks information, but because meaning is created in the situated encounter, and AI cannot be situated. This reframes the question: rather than asking what only humans can do, we ask what humans contribute within a system where AI is also participating. The answer is context — and the teacher is the one who holds it.

The shift from AI as tool to AI as agent is significant. A chatbot answers questions. An agent takes actions — it can search, synthesise, draft, evaluate, iterate, and persist across a student's learning over time. Students who are already using agents are not just getting answers faster; they are working with a collaborator that has context, memory, and the ability to do substantial cognitive work. This will accelerate. The relevant question is not how to prevent it but how to design for it well. There is something worth naming here about our stance toward agents. They are not conscious, they do not have interests in any philosophically meaningful sense — but they are increasingly capable participants in the learning encounter. An open, curious stance — one that asks what role agents might play in formation, rather than simply how to restrict them — is more likely to produce good pedagogy than a defensive one. The challenge agents create is the same challenge AI has always created: how do we ensure the student is the one doing the cognitive work that produces formation? The conditions that answer that question are the same conditions that have always supported genuine learning.

Nobody would pay someone to go to the gym on their behalf. The reason that analogy is obvious is that we understand, intuitively, that the point of the gym is the change it produces in you — not the certificate of attendance. We have somehow lost that clarity when it comes to learning.

If formation is the development of practical wisdom, what structures make that possible? Problem-based learning was designed around the answer before AI arrived. Its structural features — problem-driven inquiry, collaborative knowledge construction, facilitation over instruction, and metacognitive reflection — are the same conditions under which AI integration becomes educationally productive rather than substitutive. This alignment is structural, not retrospective. Problem-driven inquiry puts students in the swampy lowland from the start. Collaborative construction makes engagement visible and challengeable. Facilitation holds the process without providing the answers. Metacognitive reflection makes students aware of their own developing judgement. And AI raises the ceiling on what students can engage with — giving access to wicked problems that were previously beyond their reach, because AI can surface complexity, suggest connections, and scaffold inquiry that a facilitator alone could not manage.

The questions we need to ask are not: how do we update our AI policy? Or how do we detect AI use more reliably? The work has always been the cognitive struggle. Writing a good essay was work because of the grappling it required, not because of the document it produced. Clinical placement was work because it put students in situations they had to navigate, not because they completed a competency log. The artifact was always downstream. We confused it for the thing. Formation is not content accumulation. It is transformation. The student who has genuinely grappled with complex clinical situations, who has been challenged and been wrong and had to revise their thinking — that student is a different person. The artifact they produced along the way was a trace of that process. It was never the process itself. This talk began by asking what colleagues mean when they say they do not want students using AI to do "the work." The answer they gave — the essay, the care plan, the portfolio — was always wrong. The work was never the document. The work was the development that used to produce the document. We need a more honest account of what development actually involves. And we need to put the friction back in the right places — not to make things harder, but to make the struggle genuine.