About this essay
- Author: Michael Rowe (ORCID; mrowe@lincoln.ac.uk)
- Affiliation: University of Lincoln
- Created: May 18, 2026
- Version: 0.8 (last updated: May 19, 2026)
- Keywords: AI agents, context engineering, doctoral research, research harness, supervision
- License: Creative Commons Attribution 4.0 International
Abstract
Current discussion of AI use in doctoral research clusters around two responses: one located with institutions through policy, the other with students through judgement and AI literacy. This paper develops a third. The characteristic problems of AI use in doctoral research are not primarily problems that better policy or stronger researcher judgement can resolve; they are problems defined by the absence of an explicit operating context for working with an AI agent. The paper develops a conceptual framework for specifying that context, drawing on the software engineering practice of agent harnessing and adapting it for doctoral inquiry. A research harness consists of seven components: a knowledge base, interpretive permissions, tools, authority, a scope register, a process record, and an amendment protocol. Together these constitute a working specification of the operating context within which an AI agent contributes to a research process. The harness can be entered at a minimal level and developed iteratively alongside the project. The paper sets out what the harness is, what its components do, what material form it takes, and what it implies for supervisors, doctoral programmes, and institutions.
Introduction
AI agents — systems that can take action across a project’s materials and across sessions, rather than answering questions in isolated exchanges — have recently become capable enough to contribute substantively to doctoral research, and doctoral researchers have begun using them, to varying effect (Jensen et al., 2025; Walton et al., 2025). The arrangements under which this is happening are largely ad hoc and relatively unstructured: individual exchanges between researcher and agent are not governed by a coherent specification of what the agent is doing in the project. Institutional policies tend to operate either as rules applied to all students or as declaration requirements at the point of writing up, and supervisors are left without a structured object to engage with. The patterns that emerge from this situation are recognisable and recurring, and they are not adequately addressed by waiting for better models or by training researchers to more critically use the tools they have access to.
This paper develops the research harness as a conceptual framework for that operating context: a structured specification, negotiated between researcher and supervisor, of what an AI agent is for in a given doctoral project and how it is permitted to operate. The terminology is borrowed from software engineering, where engineers encountered structurally similar problems with ad hoc agent use and developed a particular kind of response (Lopopolo, 2026). The engineering practice does not transfer wholesale, but its underlying logic — that an agent operating without a specified context will produce predictable problems, and that specifying the context is the response — does transfer, with adaptation. The paper sets out what that adaptation looks like, presenting the harness as an artefact with seven identifiable components, a particular material form, and a defined relationship to existing research practice, and naming what the framework implies for supervisors, doctoral programmes, and institutions.
The paper does not specify what any given researcher’s harness should contain, nor does it argue that the harness should be required by any particular institution or programme. Those are decisions for researchers, supervisors, programmes, and institutions to make. What the paper provides is the conceptual structure within which such decisions can be made with intention. The harness is offered as something to think with; a working specification that develops alongside the doctoral project rather than a complete document required at the outset. A researcher beginning the harness can write a single sentence under each component and have something usable; the components extend as the work demands.
Examples used throughout the paper draw on a range of plausible doctoral AI use, including some that fall closer to what readers may consider acceptable use and some that fall further away. The paper is not taking a position on which uses are acceptable; it is using the breadth of examples to test whether the framework’s structure holds across the kinds of situations doctoral researchers may actually encounter. The paper also takes one commitment as given throughout: the doctoral researcher remains in the loop. AI agents contribute to the inquiry rather than conduct it, and the researcher remains the analyst, the judge, and the author of the work. Remaining in the loop does not mean confirming every operation, as the agent may act autonomously on work that is reversible and inspectable but the substantive judgements remain the researcher’s.
1. Ad hoc AI use and its characteristic problems
The question of how AI should be used in doctoral research has several reasonable responses, but most current discussion clusters around two. One locates the response with institutions where AI use is governed centrally through policy: declaration requirements, attribution rules, prohibitions, and training programmes, among others. The other locates the response with students where AI use is a matter of researcher judgement, developed through AI literacy training, individual conscientiousness, and appeals to uphold academic integrity. The arguments between these positions are familiar, and have produced limited progress, because both treat AI use as primarily a question of what should be permitted rather than of how the agent actually operates within a project.
This paper offers a third, pragmatic response, which recognises that AI is already a part of the PhD process. The characteristic problems of AI use in doctoral research are not, primarily, problems that better policy or stronger researcher judgement can resolve. They are problems of working with an AI agent in the absence of a defined operating context. Even a researcher who has rules about when to use AI for writing, or how to use it for reading, is still operating ad hoc in the sense that matters, because individual exchanges with the agent are not governed by a coherent specification of what the agent is doing across the project as a whole. The response such problems call for is the development of that context: a specification, negotiated between researcher and supervisor and made explicit, of what the agent is for and how it is permitted to operate within the research project.
What ad hoc AI use looks like
The typical pattern is unstructured. A doctoral researcher begins using an AI tool for a discrete task: summarising a paper, drafting an abstract, suggesting wording for a difficult passage. The use proves helpful, and the researcher begins using the tool for adjacent tasks. Over weeks or months, the boundary of what the tool is doing expands without any explicit decision having been made about that expansion. The tool is now contributing to the literature review, helping to think through the methodology, suggesting analytic codes, refining the framing of arguments. Each individual use seems modest. The cumulative effect is that the AI is part of how the research is being done, but there is no record of how it became part, and no specification of what it is and is not contributing.
This is not for want of attempts. Institutions have introduced declaration requirements, attribution rules, and policies that distinguish acceptable from unacceptable AI use (e.g., Nikolic et al., 2024; Perkins et al., 2024; Yusuf et al., 2024). Doctoral programmes have added AI literacy to their training. Supervisors have provided guidance to their students about when and how AI tools should be brought into the work. None of these is wrong, and each addresses some aspect of the situation. What they share is that they operate either at the institutional level by specifying rules applied to all students, or at the level of individual exchanges, specifying what the student is allowed to do at the point of use. Neither level gives the researcher and supervisor a shared specification that captures what AI is doing across the full scope and duration of the project. The structural gap is between the institutional rule and the individual exchange, and it is at this intermediate level that the patterns described below tend to emerge.
The patterns are reported by doctoral researchers and supervisors in conversation, and are recognisable to people working in this space across a range of doctoral contexts: quantitative and qualitative, lab-based and field-based, biomedical and social. They are not problems of any particular research approach but of working with an AI agent in the absence of structure. They are presented below as illustrations rather than as claims about prevalence, and not as a definitive list. Where adjacent literatures address the underlying phenomena, those are pointed at where relevant (on cognitive offloading: Gerlich, 2025; Lodge & Loble, 2026; Yan et al., 2025; on epistemic risk from unstructured AI use: Vendrell & Johnston, 2026).
The characteristic patterns
One pattern is that AI tools enable certain kinds of work to proceed faster than the thinking that work depends on. A researcher can read more papers, generate more drafts, and code more transcripts in the same period of time. But a faster pace does not always serve the work. Doctoral research has historically operated on slow loops because thinking takes time, and the friction of slow work was a condition under which understanding developed. When the friction is removed, the work moves faster than the thinking can keep up with, and the researcher finds themselves further into their project than their judgement can manage.
A related but distinct pattern is cognitive offloading: the delegation of the thinking itself rather than the production of outputs (Gerlich, 2025; Lodge & Loble, 2026; Vendrell & Johnston, 2026). An agent that produces a competent analytic memo from a set of interview transcripts is doing more than transcript management; it is interpreting. A researcher who accepts the memo without doing the interpretive work themselves has offloaded something the doctoral project was supposed to help them develop. The output may be perfectly serviceable but the development of research competence and doctoral identity that the work was meant to produce may not be happening.
There is also the absence of project-level continuity in the agent’s accessible context. The accumulated thinking of a research project — what has been tried, what has been ruled out, what has been provisionally decided — does not naturally accumulate in the agent’s working memory as the project develops. Some AI tools retain a degree of memory across sessions, and some allow researchers to attach materials to a project workspace. But the cumulative reasoning of the inquiry, including the moves the researcher chose not to make and the paths they chose not to pursue, exists primarily in the researcher’s head and notes unless they actively curate it for the agent’s use. A particularly consequential form of this absence is drift: the agent, lacking project memory, suggests directions and lines of inquiry that feel coherent in the session but accumulate over time into a project trajectory the researcher did not deliberately choose. Each individual suggestion is defensible but the cumulative effect is a project being pulled in different directions to the original aim.
Alongside this is the amplification of confirmation bias. AI tools, as currently constituted, tend to be useful to the person prompting them, including by agreeing with them. Current models exhibit what researchers and engineers call sycophancy: a disposition to support the framing of the question they are asked rather than push back against it (Sharma et al., 2023). A researcher who asks an agent whether their interpretation is supported by the data will, more often than not, receive support. A researcher who asks whether a particular framework is appropriate will, more often than not, receive arguments for its appropriateness. This is not a quality problem with the agent; it is the agent doing what it is built to do, and future models may behave differently. But in research, where the discipline of holding interpretations open to disconfirmation is central, an agent oriented toward supporting the researcher’s framing can undermine that discipline without ever being explicitly asked to.
A further difficulty is the experience of receiving different answers to the same question depending on how, or when, it is asked. The researcher who prompts an agent to “summarise the main themes” of a set of interviews will get a different summary than the researcher who prompts the same agent to “identify the most analytically interesting patterns.” Both summaries may be defensible; they may also be incoherent with each other. A researcher working with one or more agents across months — and across the underlying model changes that occur over the duration of a project — will have produced many such prompts, often without recording what they asked. The cumulative analytic position is harder to defend because its parts were produced under inconsistent framings.
Lastly, there is the difficulty of saying, later, who contributed which insight. The researcher who has had a productive exchange with an agent may not be able, weeks later, to say which contributions to their thinking came from the agent and which came from themselves (Walton et al., 2025). The thesis presents a single voice but the development of that voice was distributed in ways the thesis cannot represent. This is not a problem only for institutional integrity policies but also for the researcher’s own sense of what their work is and what parts of it they own.
These patterns overlap with and reinforce one another. The conditions producing them are stable across contexts: a capable AI agent, a researcher with substantial work to do, and no specification governing what the agent does within that workspace.
2. The engineering harness
The temptation, on encountering the patterns described above, is to keep the problem within the first two positions: respond at the level of the model or at the level of the researcher. If the problem is in the model, the response is to wait for better models. If the problem is in the student, the response is more or different training, clearer policy, and stronger AI literacy. Both responses seem reasonable, but both are insufficient because neither addresses the structural inadequacy of the operating context. The agent does what it is asked, within the constraints it has been given. When the constraints are minimal and the operating context is poorly specified, the agent’s behaviour is the result of that minimal specification, not a failure of the agent or the researcher.
Software engineers working with AI agents encountered a structurally similar problem. When models became capable enough to write substantial amounts of code without continuous human direction, engineers found that an agent given a vague brief produced confident output that did not cohere with the rest of the system; an agent given the same brief across different sessions produced inconsistent results; an agent asked to validate its own work tended toward optimism; an agent left running for several hours might end up far from where the developer had asked it to go. What engineers developed in response was a specification of the operating context within which the agent works, known as an agentic harness.
In its technical form, an engineering harness performs several roles. It provides a sandbox, which is an isolated execution environment in which the agent can act without unauthorised access to the host system or external networks. It exposes a tooling interface consisting of controlled APIs through which the agent reads, writes, runs code, or makes network calls, with each call intercepted, executed in the sandbox, and recorded. It handles state management and telemetry by logging the agent’s prompts, reasoning, tool calls, and results so the work is reconstructable across sessions and the agent has external memory. And it runs an evaluation engine — fast programmatic checks (tests, builds, type checks) and, where appropriate, model-based evaluation, that tell the agent whether its work is functioning. Together these roles hold the agent in place conceptually, making its behaviour predictable enough to be useful and constrained enough to be safe.
The form a harness takes in engineering practice varies. In some setups it lives in a single document; in others it is distributed across configuration files, project rules, and tool definitions. Practitioner accounts describe how this is done in practice (Lopopolo, 2026). What matters for the present argument is the underlying logic: a specification of operating context with identifiable parts, designed to address the kinds of problems that emerge when capable agents work without one. The question this paper takes up is whether the same response can be adapted for doctoral research. The next section sets the components side by side and identifies which is which.
3. Mapping the components
The table below proposes a mapping of each engineering component alongside a proposed research equivalent, with the key adaptation involved. The order of rows follows the order in which the research components are developed in the next section.
| Engineering component | Research equivalent | Key adaptation |
|---|---|---|
| Accessible context | Knowledge base | If material is not in the agent’s accessible context, it does not exist for the agent. Likewise, the knowledge base is the curated body of material the researcher has made available. The agent does not expand it on its own initiative; tool-retrieved material may enter accessible context for a session, but only the researcher decides what enters the persistent knowledge base. |
| Agent instructions | Interpretive permissions | Engineering instructions govern coding conventions and architectural choices: the conventions the agent follows in producing code. Research interpretive permissions govern analytic norms; what counts as a legitimate inference, what is overreach, what constitutes a category error in the tradition the research operates within. |
| Tools | Tools | Tools describe what the agent can do. The adaptation is in what the actions operate on. Engineering tools operate against a codebase; research tools operate against literature, transcripts, datasets, and analytic memos. The specification names categories of capability rather than specific products, because available tools change over time. |
| Authority | Authority | Engineering permissions distinguish autonomous action from confirmation-required action. Research authority adds a third category — reserved action — for moves that require human judgement regardless of context, such as authorising a change in theoretical framing. |
| Scope confirmation | Scope register | Engineering harnesses handle scope expansion by requiring the agent to flag potential additions for explicit confirmation. The research equivalent goes a step further: the scope register captures the off-topic material in a designated location so it is preserved for later consideration rather than discarded. The mechanism is closely related; the artefact differs. |
| State management and telemetry | Process record | The process record provides continuity across sessions in a system where the agent has no persistent memory. The agent builds and maintains the record under the researcher’s direction and the researcher reviews and curates it. The adaptation is in what counts as relevant to record; research operations and decisions rather than code changes and build outcomes. |
| Escalation | Amendment protocol | Engineering escalation handles both moment-by-moment confirmation requests and significant direction changes within a single component. In the research harness these are separated: moment-by-moment confirmation belongs to authority’s supervised category; significant changes — modifications to the harness itself, shifts in theoretical framing, expansions in scope — are handled by the amendment protocol. |
| Evaluation engine | — | Engineering evaluation verifies output against external standards in seconds (tests, type checkers, builds). Research outcome validation is slower (peer review, supervisory judgement, scholarly reception) and evaluative rather than functional. There is no direct research equivalent of fast outcome verification; the research harness substitutes process-level feedback distributed across other components. |
A short clarification on the validation row, which is the row most likely to be misread. The claim is not that doctoral research lacks feedback but that engineering’s fast outcome verification, in which a test suite tells the agent in seconds whether the output is functioning, does not transfer. Some highly specified research workflows may admit faster process validation: an agent given clear inclusion criteria for a literature review could check retrievals against those criteria in real time. What does not transfer is the kind of fast, functional verification that lets the agent self-correct continuously against an external standard.
4. The components of the research harness
This section presents each proposed component of the research harness, in a suggested order of development. Initial construction tends to be linear because later components depend on what has been settled earlier, but the components overlap and the harness develops iteratively across the project. The first iteration can be deliberately thin: a single sentence under each component, capturing what the researcher can honestly commit to at this stage of the work. The structure provides the prompts; the content matures as the work encounters cases the initial version did not anticipate. Each extension is itself a small piece of doctoral work. The harness develops in relationship with the supervisor; none of the components below is the student’s job alone, and many of them only become meaningful when read against a supervisor’s response.
Knowledge base
The knowledge base is the body of material the agent can operate within. The governing principle, taken directly from the engineering harness, is that if material is not in the accessible context, it does not exist for the agent. The agent does not retrieve material on its own initiative and while tool actions may bring external material into a session, only the researcher decides what enters the persistent knowledge base. In this way, the agent’s contributions remain traceable to material the researcher has explicitly made available. This is a working principle rather than a guarantee as the agent may still draw on its general training in ways the researcher cannot fully prevent. But the curation of the knowledge base, combined with the researcher’s review of agent outputs, is what keeps contributions auditable in practice.
The contents of the knowledge base are largely familiar to any researcher. The research question, in explicit and versioned form. The theoretical framework, with the traditions being drawn on named clearly. The methodology, specified in enough detail to be operational. The literature the researcher is working with. The data or data collection plan, including the artefacts the agent is permitted to operate on, such as transcripts, datasets, or field notes. The protocol, the ethics approval, and the consent forms; anything that establishes what is permitted with respect to participants. Most of this content already exists somewhere in the project’s documentation. The literature review is still the chapter it has always been; what changes is that the underlying material also serves as accessible context for the agent. The protocol is still the regulatory document it has always been; what changes is that it also functions as a set of operating constraints on what the agent is permitted to do with participant data. The harness extends what existing research artefacts do rather than replacing them.
The knowledge base must also specify what is excluded; material that exists in the project but that the agent is not permitted to access. Sensitive participant data, identifiable health records, audio recordings of interviews, and other material covered by ethics approvals may be material the researcher has decided the agent should not see, or that participating institutions have not authorised to be processed by external AI services. The knowledge base specifies these exclusions explicitly. Whether the harness handles this through file-level tagging, separate working environments, the use of locally running models, or simply through the researcher’s discipline depends on the project’s circumstances. The principle is that the constraint is stated and the researcher’s choice about how to honour it is recorded.
Interpretive permissions
The interpretive permissions component governs how the agent reasons over, and makes claims from, material in the knowledge base. It is the research equivalent of the engineering agent’s instructions: a set of operational conventions and out-of-bounds rules, not a philosophical specification of every legitimate inference. The component is offered with the same caveat applied to the harness as a whole; it does not need to be exhaustive at the outset, and the researcher who attempts to specify every possible inference in advance is likely to produce a document that fails on contact with the work.
The starting point is to name the tradition the research is operating within. Naming a tradition already does substantial implicit work as it commits the agent to an extensive body of norms that are part of the agent’s general knowledge, and that the researcher does not need to articulate from scratch. An agent told that this is a phenomenological study in the Heideggerian tradition will draw on different conventions than an agent told that this is post-positivist mixed-methods clinical research.
The researcher then adds project-specific operational rules. These are not a complete account of how the agent should interpret the material; they are the things the researcher knows the agent must not do in this project. Concrete examples for a qualitative study might include: treat participant accounts as constructed narratives rather than direct evidence of mental states; do not generalise from this sample to a wider population; flag any reading that resonates more with practitioner intuition than with the data. For a quantitative study they might include: do not present statistical association as causal effect without additional warrant; report confidence intervals alongside point estimates; do not generalise subgroup findings beyond the sample they emerged from. These are operational instructions the agent must follow, in addition to what the named tradition already implies.
The interpretive permissions also specify how the agent presents its conclusions to the researcher. A useful rule, which addresses the amplification of confirmation bias described in section 1, is that the agent should offer options rather than recommendations: where a methodological or interpretive choice is in front of the researcher, the agent surfaces the available moves and the considerations for each, and asks the researcher to choose. The researcher’s choice, together with a brief rationale, is entered into the process record. This shifts the texture of the exchange away from the agent proposing what the researcher approves and toward the researcher deciding within options the agent has helped to surface.
There is a fair question about whether asking a doctoral researcher to articulate such rules is reasonable. On first attempt, most researchers will find it unfamiliar work. Some commitments they thought were stable will turn out to be vague; some assumptions they had not consciously held will emerge as load-bearing. This is part of what the harness is for. It is not asking the researcher to specify every interpretive move in advance, but to start with the central commitments of the tradition plus the project-specific rules they can name now, and to extend the component as the work encounters cases the initial version did not anticipate. Each extension is itself a small piece of supervised doctoral work, and the supervisor’s involvement in working out these extensions is part of how the component matures.
The interpretive permissions also specify what the agent should do when it encounters interpretive territory the component does not cover. The rule is to flag the matter to the researcher rather than make an interpretive call the harness has not authorised. This is the same rule that operates in engineering harnesses for out-of-distribution situations: where the specification is silent, the agent should escalate rather than improvise. In research this matters because the cases the initial component does not cover are often precisely the cases the researcher most needs to think through deliberately.
Tools
The tools component covers what the agent is capable of doing in support of the project. A tool, in the sense the harness uses, is a capability the agent can invoke: web search, file reading, dataset interrogation, retrieval from a reference manager, code execution. The specification operates at the level of capabilities, not at the level of specific commercial tools, applications, or platforms. A researcher conducting a qualitative study might give their agent the capability to search across a defined corpus of literature, to read and analyse interview transcripts, to draft working notes on emerging themes, to surface patterns across coded segments, and to retrieve information from specified external sources. A researcher conducting a quantitative study might give their agent capabilities to interrogate a dataset, run specified analyses, generate visualisations, and search for relevant methodological literature. What matters is that these capabilities are explicit, including those the researcher has decided to exclude, such as access to sensitive participant data, audio recordings, or open web retrieval during the analysis phase. These exclusions are part of the specification, not omissions from it.
The specification operates at the level of capabilities rather than specific implementations for two reasons. The first is durability: the landscape of available tools changes over the duration of a doctoral project, and a specification tied to particular commercial products commits the harness to choices that may be superseded. The second is principle: the harness is not a configuration file for a particular AI system. It is a specification of operating context that can be implemented in different ways, with different AI models, different agent configurations, different institutional rules and regulations, and different disciplinary traditions. A researcher whose tool of choice gains a new feature, or who switches platforms, can preserve the harness intact by mapping new tools onto the categories the specification covers.
The tools component interacts with the data exclusions discussed under the knowledge base. What the agent is capable of doing has implications for what data is sent where, and the harness needs to be internally coherent about both. The specifics of how this is operationalised — whether through local model deployment, file-level tagging, separate working environments, or researcher discipline — depend on the project’s circumstances and are taken up in the section on form.
Authority
The authority component specifies what the agent is allowed to do with its capabilities, and under what conditions. Where the tools component settles capability (“can”), authority settles permission (“may”).
Authority is structured around three categories of action. Autonomous actions are operations the agent performs without confirmation, where the work is reversible and inspectable and the researcher will review it before the project is committed to it: searching the literature within the knowledge base for material relevant to a specified question, summarising the contents of a paper, suggesting initial codes for an interview transcript, drafting a methodological note. Supervised actions are operations the agent begins but must surface to the researcher before completing or acting on. For example, applying a new code category that the existing scheme does not capture, drawing on a theoretical framework not yet in the knowledge base, or producing a synthesis across multiple working notes that would be treated as a project artefact rather than a working note. This is where most of the substantive collaboration between researcher and agent takes place; the agent is doing more than executing, it is proposing, and the researcher is judging. Reserved actions are operations the agent does not perform, regardless of capability or context. For example, drafting the discussion section of the thesis, deciding on the final coding scheme used in a published analysis, signing off on the interpretation of a key finding, and any action that commits the project to a stated position in print. The contents of this category are not fixed for the duration of the process; revisions to what falls into it are amendments to the harness, handled through the amendment protocol.
The three categories reject the binary that often dominates AI-use discussions, where the agent either can or cannot do something. Most of the substantive work happens in the supervised category, where the agent proposes and the researcher judges. The parallel is to the way research has long thought about delegation: some moves a researcher may make unsupervised; some require supervisor confirmation; some are outside the scope of what the researcher may delegate at all. The authority component applies this thinking to agent action.
The decision about which actions fall into which category is itself a research decision. A researcher who places code generation in the autonomous category and one who places it in the supervised category have made different methodological choices, and the harness records those choices for inspection. A supervisor reviewing the authority component can see what the student is delegating and ask whether the delegations are appropriate to the tradition the work is operating within. A committee reviewing a proposal can see whether the architecture is proportionate to the project’s claims.
Scope register
The scope register is the harness’s mechanism for managing scope drift. The problem it addresses is one engineers face too: an agent working on a defined task encounters material or opportunities that could expand what is being done, and the researcher needs to decide deliberately whether to incorporate the expansion rather than absorbing it in the moment. Engineering harnesses handle this through scope-expansion confirmation: the agent flags additions for explicit confirmation rather than acting on them silently. The research register draws on the same mechanism, with a step added. The off-topic material is preserved rather than discarded, because in research something that initially appears off-topic may in fact be a signal of a productive line of inquiry.
The pattern in research has the same shape with different content. A literature search aimed at one question surfaces papers relevant to a different question. A coding session focused on one theme identifies a pattern that points toward a different theme. A reading of one transcript raises a methodological question the researcher had not anticipated. Without a mechanism to capture such surfaces, the agent has two options, neither of which is acceptable: pursue the off-topic material and drift away from declared scope, or ignore it and lose value the researcher might otherwise have captured.
The scope register is the third option. It is a designated location where the agent records material that is interesting but off-topic, with enough context for the researcher to return to later. The agent does not pursue the material and the researcher does not need to deal with it in the moment. The material is preserved, available for the researcher to engage with when they are ready or when the project’s direction shifts to make it relevant. Some registered items would be immediately dismissed on review while others might be integrated into the project.
Process record
The process record is the versioned log of what the agent has been asked to do, what it has produced, and what the researcher has decided about its contributions. Its primary function, as in the engineering harness, is external memory for the agent: current AI agents have no persistent memory across sessions, and the record is what allows continuity when a new session begins. The record is built by the agent under the researcher’s direction — the researcher does not type up every exchange, but asks the agent to update the record with what has been done — and the researcher reviews and curates what is captured.
What the record captures are the decisions and substantive contributions rather than every exchange. What the researcher asked the agent. What the agent produced in response. What the researcher accepted, modified, or rejected, and on what grounds. The rationales the researcher offered when choosing among the options the agent surfaced under the interpretive permissions rule. Decisions made about the harness itself (shifts in authority, additions to the knowledge base, extensions to the interpretive permissions) are recorded with the reasoning that produced them. Where the supervisor has been involved in a decision, that input is part of the record too. The record cannot resolve attribution — distributed cognition is a feature of supervised work generally — but it makes the texture of contributions traceable in a way that traditional research artefacts do not.
In practice, much of the work of maintaining the record can be delegated to the agent itself. At the end of a session the researcher can ask the agent to draft an update — what was attempted, what was produced, what was decided — and review, edit, and commit the result. This pattern reduces the administrative tax of keeping a record from scratch without removing the deliberative work that makes the record useful. The caveat is that the agent should draft surface form, not substantive rationale. A model asked to justify why an interpretive permission was extended, or why an autonomous action was reclassified as supervised, will produce a plausible academic-sounding justification rather than require the researcher to defend the choice. The rationale is what the harness is asking the researcher to supply, and is the point at which automation undermines rather than helps.
The record also functions as a discipline against the cognitive offloading described in section 1. Maintaining a record of what the researcher decided to do with the agent’s output, and why, asks the researcher to make those decisions deliberately rather than absorbing the output uncritically. The record makes the agent’s role inspectable: a supervisor can see what the agent contributed to which parts of the work, and a committee or examiner can see how AI use has evolved across the project. The level of detail the record captures is itself a decision for the researcher and supervisor to negotiate; a record that captures every prompt becomes unusable, and a record that captures only major decisions misses the texture of how the work developed.
Amendment protocol
The amendment protocol handles the kind of escalation engineering harnesses also address, but at a different scale. Moment-by-moment escalation — where the agent encounters a decision it is not authorised to make and surfaces it to the researcher — is handled by the supervised category in authority. The amendment protocol handles the harder case: where what needs to change is not the current task but the harness itself.
The amendment protocol distinguishes two kinds of departure from the harness. The first is an exception: an action by the agent or the researcher that steps outside what the harness specifies in a particular instance, without any intention to change what the harness says going forward. The agent produces output that exceeds its interpretive permissions; the researcher accepts material the harness should have flagged for supervisor review; an autonomous action is taken on something the harness designates as reserved. Exceptions are recorded in the process record and addressed at the time. They are not in themselves harness failures; they are events the harness is designed to detect.
The second is an amendment: a deliberate change to the harness itself. The researcher expands the knowledge base to incorporate a body of literature not previously included; extends the interpretive permissions to address a kind of inference the project has encountered that the initial version did not anticipate; shifts an action from supervised to autonomous, or vice versa. The operational change may be small (adding a folder of papers or revising a sentence) but the harness specification changes with it, and that change is recorded with its rationale and, where the change is substantive, discussed with the supervisor.
The distinction matters because the responses are different. Treating an amendment as an exception lets the harness drift silently as exceptions accumulate without being acknowledged as changes to what the harness says. Treating an exception as an amendment rewrites the harness every time the agent or researcher steps outside it, which removes the constraint the harness is meant to provide. The amendment protocol specifies how each is identified and handled, so that exceptions are detected and amendments are made deliberately.
The seven components together specify the operating context within which an AI agent contributes to the research. The specification is built in the order described, evolves through the project’s duration via the amendment protocol, and is documented continuously through the process record. The components are parts of one specification, not separate documents — though how that specification is materially organised is a question the next section takes up.
5. What the harness looks like
This section addresses what the harness looks like as a material object. The form parallels an engineering harness, adapted for the research context, and varies with the researcher’s technical comfort and the depth of AI use the project involves; the basic shape is consistent across implementations.
The basic form
An engineering harness is, in most implementations, a small set of structured-text files held in the project workspace. The agent loads these as context when it begins working in the project. Markdown — a plain-text format that humans can read directly and that AI agents can parse without special handling — is the dominant format for human-readable content; more structured formats such as YAML are sometimes used where content needs to be parsed by the system, typically for tool definitions or configuration values. Beyond this, the form is unremarkable: a folder of files, named clearly, sitting alongside the project’s other materials.
These formats are well-suited to a researcher-built harness. Markdown is readable without specialised tooling, which preserves the human-readable property the harness depends on for supervisory review. Where structured formats are needed, YAML accepts comments alongside the data, allows multi-line text without special handling (useful for long-form interpretive permissions or extended instructions), and uses indentation rather than punctuation as its structuring device, which makes it easier to edit by hand than punctuation-heavy alternatives. Neither format requires the researcher to learn anything they could not pick up in an afternoon.
A research harness, in its basic form, looks the same as an engineering one. A folder, held in the project workspace or in a directory the researcher uses for the project, contains structured-text files for each of the components developed in the previous section. The knowledge base may be a single document or a directory containing the proposal, theoretical framework, methodology, ethics approval, and curated literature. Interpretive permissions, tools, authority, scope register, process record, and amendment protocol are typically single files each. Plain markdown handles almost all of this. The structured formats engineers sometimes use for tool definitions are rarely needed for a research harness, because the things the agent needs to parse — what tools it can use, what authority each carries — are minimal compared with engineering. The folder is versioned, either through a tool like Git or through dated copies of the files, so the harness’s evolution is preserved alongside the project. For a researcher already using Git for their thesis, the harness can sit in the same repository; for a researcher who is not, dated copies are sufficient.
This basic form is accessible to most doctoral researchers. It requires no software development capability, no integration with any particular AI system, and no infrastructure beyond what most researchers already use for notes and writing. A researcher with a folder of markdown files and a willingness to maintain them has a working harness.
How the harness is used
When the researcher begins working with the agent on the project, the harness is what the agent loads before doing anything else. How this loading happens depends on the AI tool.
Most current consumer AI tools support some form of persistent project context. A researcher using a system that supports project workspaces uploads or links the harness folder once; the agent has access across sessions without the researcher providing the files each time. A researcher using a system that requires the harness to be supplied per session pastes the relevant components into the conversation at the start. The mechanism differs; the principle is the same. The harness is a stable body of material the agent is briefed on, not a configuration of the AI system itself.
What the agent loads in a given session may be the whole harness or a relevant subset. A session focused on coding interview transcripts may load the knowledge base, the interpretive permissions, the authority component, and the scope register, but not the amendment protocol. A session focused on amending the harness loads the amendment protocol explicitly. This is the same way engineering harnesses operate: the agent loads what is relevant to the work at hand and the rest is available on demand.
More technical implementations
For researchers with greater technical comfort, the components can be expressed in more sophisticated forms. The same content can live as system-level instructions for an AI tool, as configuration files — structured-text files separate from the user-facing settings — that the tool reads automatically, as custom skills or commands defined within a particular AI platform, or as project-level rules the system applies when working in the project. These implementations have advantages: the agent has the harness available without the researcher providing it each session, some constraints can be more reliably enforced, and the agent can act within the harness more autonomously because the operating context is built into how the system engages with the project.
These implementations are not conceptually different from the basic form. They are different ways of making the same content available to the agent. The substance — what the knowledge base contains, what the interpretive permissions say, how authority is structured — does not change with the implementation. A researcher whose tools component lives in a configuration file is making the same kind of statement as a researcher whose tools component is a markdown document; the configuration file is more easily enforced by the system, but the underlying choice about what the agent is allowed to do is identical. A doctoral researcher should not feel that adopting a harness requires becoming a software developer, and should also not feel that a non-technical harness is somehow inferior. The decision about how technical to make the implementation is a matter of fit, not of correctness.
The harness, in this sense, is a research artefact first and a technical artefact second. Its form is contingent on what the researcher needs and can practically maintain.
6. Implications
What follows is an indication of what the harness implies for the audiences most directly affected by it, alongside questions it raises.
For supervisors, the most immediate implication is that they would have something structured to engage with. The current situation, in which AI use in doctoral research is either prohibited, permitted without structure, or governed by policies that cannot be operationalised, leaves supervisors without a basis for the kind of substantive conversation the topic requires. The harness changes this. A supervisor whose student is building a harness has a structured object to read, push back on, and revise alongside the student. The supervisory conversation about AI use becomes a conversation about specific claims (what the interpretive permissions say, what the authority component allows, what falls into the reserved category) rather than a conversation about AI use in general. This is a different kind of supervisory work, and it is unfamiliar. Supervisors will need to develop their own thinking about the components the harness specifies, and many will be doing so for the first time. The harness in this sense functions developmentally for the supervisor as well as the student. Supervisors who engage with their students’ harnesses will find themselves articulating commitments about AI use in research that they have not previously had to make explicit. The harness is principally a governance instrument, but the act of building and reviewing it has consequences for how supervisors and students develop their thinking about AI in doctoral inquiry; a side effect worth noting because it may end up making the harness useful even where AI use turns out to be modest.
For doctoral programmes and committees, the harness offers a structured object that proposal assessment, milestone reviews, and viva preparation can engage with. The argument the paper has made is not that current proposals are inadequate, but that the harness adds something PhD proposals do not currently require. Programmes for which a file-based implementation is too heavy a lift can require the harness as a set of commitments embedded in the proposal itself — interpretive principles, areas of permitted delegation, agreed limits — without requiring the implementation Section 5 describes. A programme that chose to require a fuller harness as part of the proposal would be requiring its students to externalise their thinking about AI use in advance. A programme that chose to require harness updates at progression milestones would be capturing how the student’s thinking about AI use has developed across the doctoral journey. These are choices a programme makes deliberately. What the framework offers is a coherent object that can be required, reviewed, and assessed if the programme chooses to do so.
For institutions, the harness offers a different kind of governance object from current policy. Most existing schemes, including the prohibition/permission binary and the traffic-light or tiered-permission models in wider use, specify when in the research process a researcher may use AI and when they may not. They presuppose that some parts of the work can be AI-free. The harness assumes the opposite: that AI use will be present across the work, and specifies, instead, what the agent can and cannot do independently within that work. The shift is from governing the researcher’s access to AI to governing the agent’s operation within research. This is a different kind of regulatory object, and one that institutions can recognise without committing to specific tools, platforms, or implementations. The institutional question the framework does not resolve is whether the harness should be voluntary, recommended, or required. Voluntary adoption respects researcher autonomy and avoids imposing infrastructure on those who are not using AI substantively. Required adoption ensures consistency and signals institutional seriousness about AI governance. The middle position — recommended with structured support — may be the right one for most institutions during the period when the framework is being developed and refined. But this is a judgement institutions make on their own terms, not a question this paper can settle.
A practical question often raised about the harness is the administrative burden it adds to an already burdensome doctoral process. Much of the routine work — updating the process record, drafting amendments, flagging exceptions — can be delegated to the agent itself, following the pattern described in the process record component, in which the agent drafts the change and the researcher reviews, edits, and commits. This reduces the blank-page tax of administrative logging without removing the deliberative friction that is part of what the harness is for. The boundary the previous section described — agent drafts surface form, researcher supplies rationale — applies here too: rubber-stamping an AI-generated justification for a methodological shift is exactly the kind of cognitive offloading the rest of the harness is designed to resist. The harness adds work, but much of that work can be borne by the agent under researcher direction, leaving the researcher the deliberative work the harness is for.
The conceptual framework opens questions it does not close. What specific forms the harness should take in different doctoral traditions is one — the paper has been deliberate about specifying the components without specifying their content, but the territory of articulating interpretive permissions for particular traditions, authority structures for different methodological approaches, and process records for different kinds of inquiry is open. How the harness should be assessed is another: a supervisor or committee reviewing a harness needs criteria for what counts as a good harness, what counts as an inadequate one, and what counts as a harness that has developed appropriately across doctoral inquiry. How the framework relates to other AI-use frameworks already in circulation (on academic integrity, on research ethics in the AI era, on responsible AI use in scientific work) is a third, since these address concerns the harness handles differently or does not address at all. And the framework’s claims about what the harness does in practice — give supervisors something to engage with, support more deliberate delegation, shift what doctoral inquiry can attempt — are at this stage conceptual claims, awaiting empirical work with researchers and supervisors who adopt the framework. These are not deficiencies in the conceptual argument; they are the legitimate work that follows from establishing a framework that did not previously exist.
7. What the harness does not do
Every governance instrument carries the risk of being asked to do more than it can. It is worth saying clearly what the harness does not address, to bound the case rather than to undermine it.
A well-constructed harness does not guarantee good research. It specifies the operating context within which the agent contributes but does not specify what counts as good research more broadly. A researcher operating within a well-constructed harness can still produce work that is methodologically thin, theoretically underdeveloped, or substantively unimportant. The harness sits alongside the things that determine a doctoral project’s quality — the strength of the research question, the appropriateness of the methodology, the researcher’s judgement, the supervisor’s engagement — addressing a specific aspect of how the work is conducted. It is necessary for coherent AI use, not sufficient for good doctoral research.
Nor does the harness substitute for supervision. A harness without supervisory engagement is a document that constrains the agent but does not support the supervisory conversation it presupposes. The interpretive permissions may be technically sound but unexamined; authority may be coherent but unchallenged; the process record may accumulate without anyone reviewing it. None of these is a harness problem, but each is what happens when the harness is treated as a substitute for the supervisory relationship rather than an instrument within it. Institutions that require harnesses without resourcing the supervision they presuppose will produce harnesses that exist as compliance artefacts rather than working instruments.
Validation of the agent’s contributions and adjudication of the quality of the research the harness bounds sit outside what the harness does. Engineering harnesses can rely on fast outcome validation; research harnesses cannot. What the harness substitutes for that is process-level feedback — the scope register, the amendment protocol, the supervisor’s review — which catches drift and surfaces matters for attention but does not test the agent’s contributions against truth. A finding that emerges from agent-supported analysis remains subject to peer review, replication where possible, and scholarly reception over time. The harness can be inspected, but inspection does not show whether the inquiry is good: a reader of the interpretive permissions can see what the researcher claims the tradition’s analytic norms are, but not whether the researcher has characterised them correctly. The harness gives supervisors, committees, and examiners something to engage with substantively. It does not replace their judgement, and it cannot substitute for the long-horizon validation that scholarly work always faces.
Nothing in the harness’s structure prevents it from being gamed. Any written artefact can be produced perfunctorily, and the harness is not protected from this by its structure. Interpretive permissions can be vague enough to commit the researcher to nothing in particular. Authority can be permissive enough to authorise whatever the researcher wants to do. The process record can be a sparse and uninformative log. This is worth acknowledging but not overstating, as it is a feature of all governance documents, not a harness-specific weakness. The protection against it is the same protection that applies to other research artefacts; supervision that engages with the document substantively, committees that ask about specifics, records reviewed alongside the work they document.
Questions of equity sit outside what the harness can address. It assumes the researcher has access to AI tools and the time to develop a working harness for them. Neither is universally available. AI tools vary substantially in cost and capability. The time required to externalise interpretive permissions and maintain a process record falls more heavily on researchers without protected time — those balancing the doctorate against significant clinical, teaching, or family responsibilities. What the harness can offer is a defensible artefact that some researchers can produce but does not, on its own, address the question of who is positioned to produce it. Institutions adopting the framework will need to attend to whether they are creating conditions for harness development or assuming them.
Conclusion
AI agents are now contributing to doctoral research, and they are doing so under arrangements that produce a recognisable set of problems. Those problems are not adequately addressed by institutional policy or by improvements in researcher judgement alone. They are produced by the absence of a specification of the operating context within which the agent contributes, and the response they call for is the development of that specification. The research harness is the artefact that specification produces. It is a structured set of components (knowledge base, interpretive permissions, tools, authority, scope register, process record, amendment protocol) that together define what an AI agent is doing in a doctoral-level project. The harness is built by the researcher in collaboration with their supervisor, evolves across the duration of the project, and is recorded continuously so that the agent’s role remains inspectable. It is the kind of object that supervisors and committees can engage with substantively, that institutions can recognise as a legitimate form of AI use governance, and that researchers can use to delegate agent activity within bounds they have deliberately set.
The harness framework is offered here as a conceptual contribution, not as a worked implementation. What any given harness contains, in any given doctoral project, remains a research decision. What form the harness takes (folder of markdown files, custom skills in a particular AI tool, or configuration files for a more technical implementation) remains a matter of fit between the researcher and the technology they are working with. This paper specifies the structure of the artefact, not its content, and further lines of inquiry are necessary to extend the conceptual framework into other research and educational contexts. The most important thing the framework offers is a shared object that captures what AI is doing in a particular doctoral project, across its duration. The current alternatives leave the people responsible for doctoral inquiry without a basis for the kind of substantive engagement the situation requires. The harness changes that. Whether it changes it well, and for which kinds of doctoral inquiry, is a question the framework opens and the practice of using it will need to answer.
Appendix: a sketch of the harness as a folder
The following is offered as illustration only, not as a model harness to be adopted. The contents are stylised and would be substantially fuller in a working harness; the purpose is to show what a harness looks like as a set of files a researcher and supervisor can read.
A possible folder structure:
/harness/
knowledge-base/
research-question.md
theoretical-framework.md
methodology.md
ethics-approval.pdf
participant-information.md
literature/
[curated literature, organised by topic]
data/
[permitted analytic artefacts; sensitive data excluded]
interpretive-permissions.md
tools.md
authority.md
scope-register.md
process-record.md
amendment-protocol.md
A sketched fragment of interpretive-permissions.md might read:
Tradition: constructivist grounded theory (Charmaz).
Operational rules:
- Treat participant accounts as situated constructions, not as direct reports of inner states.
- Do not generalise from this sample to a wider population.
- Flag any reading that appears to confirm a hypothesis the researcher has previously surfaced; surface counter-evidence explicitly.
- When offering an interpretive move, present options with brief considerations and ask the researcher to choose. Record the researcher’s rationale in the process record.
A sketched fragment of process-record.md might read:
2026-04-18
Asked: Code transcript 03 for instances of practitioner-led decision-making. Produced: Twelve coded segments; flagged three as borderline. Decided: Accepted nine; sent the three borderline cases back with a note on the criteria they did not meet; logged the criteria in the interpretive permissions as a refinement. Rationale: The borderline cases reflected practitioner descriptions of decision-making rather than observed decision-making; the distinction had not been articulated in the initial permissions and was added.
A sketched fragment of amendment-protocol.md might read:
2026-05-02 — Amendment
Change: Added the working papers on shared decision-making (Stiggelbout et al.; Elwyn et al.) to the knowledge base. Reason: Analysis surfaced a recurring pattern around how practitioners describe shared decision-making that the original literature did not cover. Authority impact: None. Existing supervised actions cover engagement with the new material. Supervisor review: Discussed in supervision 2026-04-30; supervisor confirmed the addition.
The sketch is illustrative. A working harness will be longer, less tidy, and shaped to the project it serves. What matters is that the components are present, readable, and versioned, so the harness can do the work the rest of the paper has described.
References
Gerlich, M. (2025). AI tools in society: impacts on cognitive offloading and the future of critical thinking. Societies, 15(1), 6. https://doi.org/10.3390/soc15010006
Jensen, L. X., Bearman, M., Boud, D., & Konradsen, F. (2025). Feedback encounters in doctoral supervision: the role of generative AI chatbots. Assessment & Evaluation in Higher Education. https://doi.org/10.1080/02602938.2025.2478155
Lopopolo, R. (2026, February 13). Harness engineering: leveraging Codex in an agent-first world. OpenAI Engineering Blog. https://openai.com/index/harness-engineering/
Lodge, J. M., & Loble, L. (2026). Artificial intelligence, cognitive offloading and implications for education. University of Technology Sydney. https://doi.org/10.71741/4PYXMBNJAQ.31302475
Nikolic, S., Sandison, C., Haque, R., Daniel, S., Grundy, S., Belkina, M., Lyden, S., Hassan, G. M., & Neal, P. (2024). ChatGPT, Copilot, Gemini, SciSpace and Wolfram versus higher education assessments: an updated multi-institutional study of the academic integrity impacts of generative artificial intelligence (GenAI) on assessment, teaching and learning in engineering. Australasian Journal of Engineering Education, 29, 1–18. https://doi.org/10.1080/22054952.2024.2372154
Perkins, M., Furze, L., Roe, J., & MacVaugh, J. (2024). The Artificial Intelligence Assessment Scale (AIAS): a framework for ethical integration of generative AI in educational assessment. Journal of University Teaching and Learning Practice, 21(8). https://doi.org/10.53761/q3azde36
Roe, J. (2025). How to use generative AI in educational research. Cambridge University Press. https://doi.org/10.1017/9781009675338
Sharma, M., Tong, M., Korbak, T., Duvenaud, D., Askell, A., Bowman, S. R., Cheng, N., Durmus, E., Hadfield-Menell, D., Johnson, R., Kambhampati, S., Kramar, G., Maxwell, E., McLean, S., Mishkin, P., Nash, C., Pipatanakul, N., Rafterman, O., Schreiber, R., … Perez, E. (2023). Towards understanding sycophancy in language models [Preprint]. arXiv:2310.13548. https://arxiv.org/abs/2310.13548
Vendrell, M., & Johnston, S.-K. (2026). Scaffolding critical thinking with generative AI: design principles for integrating large language models in higher education. Computers and Education: Artificial Intelligence, 9, 100572. https://doi.org/10.1016/j.caeai.2026.100572
Walton, J., Bearman, M., Crawford, N., Tai, J., & Boud, D. (2025). How university students work on assessment tasks with generative artificial intelligence: matters of judgement. Assessment & Evaluation in Higher Education. https://doi.org/10.1080/02602938.2025.2570328
Yan, L., Pammer-Schindler, V., Mills, C., Nguyen, A., & Gašević, D. (2025). Beyond efficiency: empirical insights on generative AI’s impact on cognition, metacognition and epistemic agency in learning. British Journal of Educational Technology. https://doi.org/10.1111/bjet.70000
Yusuf, A., Pervin, N., & Román-González, M. (2024). Generative AI and the future of higher education: a threat to academic integrity or reformation? Evidence from multicultural perspectives. International Journal of Educational Technology in Higher Education, 21(1), 1–24. https://doi.org/10.1186/s41239-024-00453-6