Evaluating Trustworthiness of Explanations in Agentic AI Systems

Elizabeth_Watkins · ‎05-20-2025

Elizabeth Watkins, Emanuel Moss, and Ramesh Manuvinakurike are research scientists at Intel Labs specializing in human-AI collaboration.

Highlights

Intel Labs research published at the ACM CHI 2025 Human-Centered Explainable Workshop found that multi-agent artificial intelligence (AI) systems create novel challenges for explainability, transparency, and trustworthiness.
Chain-of-thought (CoT) reasoning may create “explanations without explainability” — text that appears to explain the inner workings of agentic AI pipelines, but is actually erroneous or misleading, and not an insight into reasoning at all.
Surprisingly, the research found that reasoning-enabled LLM agents may perform worse than simpler LLMs.

A recent study from Intel Labs found that multi-agent AI systems can create new challenges in explainability, reliability, and trustworthiness. When examining the chain-of-thought reasoning of agentic systems, our research team found that while the systems create convincing and logical sounding output explaining how they arrived at a conclusion, the reasoning process may be erroneous and misleading. In Thoughts without Thinking: Reconsidering the Explanatory Value of Chain-of-Thought Reasoning in LLMs through Agentic Pipelines, published at the ACM CHI 2025 Human-Centered Explainable Workshop in Japan, we examined reasoning in agentic pipelines. Our study reveals how some strategies for fostering trust, like chain-of-thought reasoning, can paradoxically hinder it.

Today’s most cutting-edge AI systems are already moving on from large language models (LLMs) towards multi-agent systems. These systems are comprised of individual LLMs built as agents using specialized meta prompts with behavioral instructions, and tied together into pipelines where the output of one agent becomes the input of another. Each LLM agent has a specialized role: some perceive inputs, some are planners, some check retrieval-augmented generation (RAG) documents, and there’s even one to do a safety check. Queries, data, inputs, and outputs produced by each agent cascade through these pipelines. With little human oversight over each agent in the line, questions of trust and transparency take on new urgency. While these agentic pipelines promise more thorough, comprehensive, and grounded outputs, they also introduce novel challenges.

Chain-of-Thought Does Not Necessarily Equal Explainability

Chain-of-thought reasoning has been touted as a new way to bring explainability and transparency to generative AI-based systems. CoT means telling LLMs to write out their reasoning or the pathway they take towards an output. This is like asking a student to think out loud in their reasoning to get to an answer. In theory, this type of transparency should help users trust system outputs, since they can see and verify the steps that got the system to its conclusion. But in real-world agentic pipelines — where multiple LLMs receive, reason about, then generate content — CoT still needs work before it can become the panacea that’s been promised.

On our Intel Labs team, we’re building MARIE, a task guidance system designed to support semiconductor fabrication technicians in their manual, hands-on work. In this space we put agentic architecture to the test and conducted experiments to evaluate how CoT works in real-world scenarios. We put together task-based questions (for example, “What tool should I use for this step?”) and organizational/social queries (for example, “Who do I contact for IT help?”) sourced from interviews with actual technicians working in our fabs. We used these questions as a type of “participatory benchmark” to evaluate whether CoT reasoning led to more useful answers for technicians, and/or to clear explanations about how the system arrived at its outputs. For our experiment, we used Assembly 101, a task-based dataset that uses hands-on physical toy construction as a stand-in for fab data to test how well CoT can provide support to fab technicians.

Figure 1. Shows sample agentic flow depending on the type of input query from the user. Once the question is ingested, the lead planner generates a plan including a sequence of agentic calls. The output of each agent in the flow is shown here.

Surprising Results: Explanations without Explainability

In many cases, CoT gave us explanations that sounded plausible but were actually incorrect or misleading. We call these "explanations without explainability." For example, one system described toy dump truck components using terms such as clutch and transmission, which are relevant to actual full-sized vehicles. By completely missing the context of toys and play, this misunderstanding led the agent to erroneously redefine the scope, details, and task. It appeared that the system’s CoT reasoning was compelled to keep producing text until it filled up its context window, which encouraged the production of superfluous incorrect and misleading information that might not have been produced otherwise. Or another hypothesis points toward the tendency of LLMs to fall victim to the Einstellung paradigm in which a focus on familiar or common approaches directs us (or our language-model counterparts) away from the right strategy. One version of this would be the fallacy: “Most cars don’t start because they have dead batteries. Therefore, if the car won’t start, replace the battery.” We saw this type of reasoning repeatedly when we ran our participatory benchmark through the agentic pipeline.

These CoT shortcomings are serious for establishing trust in human-AI collaboration. Everyday users will inevitably investigate how AI systems get to an output, out of simple curiosity or because of hallucinations or errors in outputs. If they find CoT rationales that are misleading or incorrect, not only will their trust inevitably be damaged, but that confidence in the AI will be difficult to restore.

Agentic Pipelines: Power Demands Careful Oversight

The multi-agent pipeline is a sophisticated architecture combining perception, planning, and action agents together into a team of LLMs. The system handles tasks and leverages external resources, like interpreting multimodal and visual inputs, checking relevant documents, and even checking responses for safety with a specific designated agent. CoT reasoning shines a light on each of these agents, compelling each agent to explain what it “said” to the one next to it in the chain.

This setup should enhance transparency: each agent’s outputs are put on display, which should reveal key information about the outputs being produced by the whole system. But in practice, the handoffs between agents resemble a game of telephone. Any misinterpretations, errors, and misunderstandings cascade through the system, and exacerbate as they build towards the output. With each link in the chain dependent on LLM-generated output, CoT rationale can be especially muddying as errors surface and compound in surprising and hard-to-detect ways.

The Illusion of Thought: CoT Hinders Explainability in Three Ways

In addition to a quantitative analysis of how agentic pipelines performed in answering questions from the participatory benchmark dataset, we also conducted a qualitative content analysis on problematic outputs, alongside our LLM-as-a-judge approach. In our analysis a theme emerged: CoT creates an illusion of thoughtful reasoning. Even when outputs are completely wrong, the model’s wordy explanations might persuade others the answer is accurate, just because it sounds plausible.

In one example, the system incorrectly stated that it could not respond in multiple languages. Test prompts showed it could. The CoT not only insisted on this incorrect claim, it went even further by justifying its position. This illustrates three ways that a system’s CoT can hinder explainability:

The system takes as fact something that is not true, resulting in a misleading explanation.
It makes a logical error known as hasty generalization based on what’s easy to parse in its accessible documents, resulting in a partial explanation.
This then requires users to navigate extra text from the CoT material and analyze it to find accurate or useful information.

This plausible CoT explainability or illusion of thought is dangerous for trustworthiness. It may exacerbate an ongoing trust issue people have with LLMs: a convincing and smart sounding system that gets things wrong may not only corrode users’ trust in the system but also their confidence in how they use that system.

One of the most striking findings, drawn from the quantitative analysis but validated by the qualitative analysis, was that non-CoT models outperformed their reasoning-enabled counterparts. Responses from models like DeepSeek, which emphasize CoT generation, were often less accurate, helpful, or comprehensive than those from simpler LLMs.

Figure 2. Shows the reviewer scores for the answers as rated by humans and LLM-as-a-judge on Task and Org-Soc questions. We observe that the reviewer scores for the non-reasoning models are better than their reasoning (Deepseek) counterparts. We also observe that the thought reviewer scores are weakly correlated with the answer reviewer scores.

Towards Human-Centered Trust

This research reveals how important it is to look carefully at explainability, and think through what real people find trustworthy, reliable, and transparent in their real-world tasks. Our creation of a participatory benchmarking dataset, use of both human experts and LLMs-as-a-judge, were critical to our findings and revealed the complexity of evaluating agentic outputs. We need diverse evaluators, including real humans and domain experts, and context-aware benchmarks that reflect real-world use.