Human-AI Complementarity for Decision Making
2025 Academic Workshop Program Details
Tutorial on LLM & Agent Alignment: Vulnerabilities, Detection, and Mitigation
Ahmad Beirami, Google Deepmind
Hamad Hassani, University of Pennsylvania
In recent years, large language models have been used to solve a multitude of natural language tasks; yet, despite efforts to align them with human intentions, popular LLMs remain susceptible to jailbreaking attacks that elicit unsafe content. Early jailbreaks targeted the generation of harmful information, while modern attacks seek domain-specific harms (e.g., digital agents violating user privacy or LLM-controlled robots performing harmful actions in the physical world). In the worst case, future attacks may target self-replication or power-seeking behaviors. Therefore, it is critical to study these failure modes and develop effective defense strategies. A key component of AI safety is model alignment, a broad concept referring to algorithms that optimize the outputs of LLMs to align with human values. We focus on safety vulnerabilities of frontier LLMs and review the current state of the jailbreaking literature, including robust generalization, open-box and black-box attacks, defenses, and evaluation benchmarks. We also discuss methodologies that aim to mitigate these vulnerabilities and align language models with human standards: RLHF, DPO, controlled decoding, and best-of-N. Finally, we cover vulnerabilities and mitigations for agents. Modern attacks seek domain-specific harms (e.g., digital agents violating user privacy or LLM-controlled robots performing harmful actions in the physical world). We summarize recent progress in academia and industry that has resulted in safer models, e.g., OpenAI’s o-series and Anthropic’s Claude 3 show significant robustness, while noting that the arms race remains a work in progress. We review cutting-edge advances, discuss new directions, opportunities, and challenges, and walk through open-source Python implementations of state-of-the-art algorithms. Throughout, we highlight open problems.
Modeling and Measuring Human Decisions: From Cognitive Theories to Data Collection Practices
Ngoc Nguyen, University of Dayton
Coty Gonzalez, Carnegie Mellon University
Stephanie Eckman, University of Maryland / Amazon
Frauke Kreuter, University of Maryland
Human decision-making research and machine learning both rely on understanding and modeling how people learn and adapt in uncertain, dynamic environments. This tutorial brings together two complementary perspectives: cognitive modeling of human learning from experience and human-centered approaches to data collection for machine learning. In the first part, we introduce Instance-Based Learning Theory (IBLT), a cognitive framework that explains how humans make sequential decisions in uncertain and changing contexts. Participants will gain hands-on experience using the open-source SpeedyIBL library to build models that capture human adaptation and decision strategies in a 2D navigation case study. In the second part, we turn to the practical challenges of collecting high-quality human data for computational modeling. Drawing on principles from survey research, we discuss how labeler characteristics, task design, and sampling strategies shape data quality, and we provide strategies for recruiting diverse participants, designing annotation tasks, and addressing ethical considerations in crowdsourced labeling. Together, these perspectives offer participants both theoretical and practical foundations for modeling human decision-making and for building robust, generalizable machine learning systems grounded in high-quality human data.
Benchmarking Human-AI Decisions using Statistical Decision Theory
Jessica Hullman, Northwestern University
AI-assisted decision workflows, in which an AI recommends an action to a human who retains control over the final decision, have been deployed in a number of domains, and inspired significant research on explanations and other interface techniques. Understanding how well such workflows perform entails formalizing expected performance under idealized use of the available information. We argue that prevalent definitions of appropriate use of AI and explanations in the literature, including appropriate reliance, lack statistical grounding and can lead to contradictions. We demonstrate how benchmarks based in statistical decision theory can be applied to diagnose and improve AI-assisted decisions. For example, we provide a formal definition of reliance that separates the probability the decision-maker follows the AI’s recommendation from challenges a human may face in differentiating the signals and forming accurate beliefs about the situation, and a statistically-grounded framework for evaluating new approaches for AI explainability.
The Challenges of Building Useful AI
Justin Weisz, IBM Research
Many tech companies are building AI systems that make spectacular demos, but are these systems useful? Do they actually improve peoples' productivity? Much of the research on human-AI teaming shows how difficult it is to achieve synergistic team performance. In this talk, I'll present examples from our work at IBM that explore human-AI collaboration in domains including software engineering, design, and product management. Through these examples, I'll show that although AI promises to deliver tremendous productivity gains, in reality those gains are not equally experienced by all users. I will argue that, for AI systems to be truly useful, they need to have an explicit understanding of who their users are and what their goals and intentions are. Theory of Mind (ToM) refers to the possession of such "mental state" information, and I'll conclude by covering our work on creating a ToM-inspired memory system for LLMs and arguing that such a system may be key at enabling effective human-AI teaming outcomes.
Are LLMs Sophisticated Psychology Researchers?
Jon Bogard, Washington University in St. Louis
The richness and accuracy of AI models’ understanding of human psychology matters both in the short term (e.g., for understanding model sycophancy, persuasion, manipulation, etc) and long term (e.g., for alignment with human goals, AI takeover of research). More broadly, AI research competence may fuel an AGI intelligence explosion. In this project, we benchmark the current abilities of frontier LLMs to conduct soup-to-nuts behavioral science research without a human substantively in the loop. To do so, we developed a generic set of prompts used to successfully encourage hypothesis generation and testing. After selecting 4 models specifically for their abilities in research-relevant skills, we prompted the LLMs to design a pre-registered experiment, which we then implemented. Each LLM analyzed the resultant data and wrote up the results as an academic paper. Afterwards, we evaluated the quality of each model’s work at each phase of the research production process. We found that LLMs are surprisingly good at coming up with research hypotheses, surprisingly bad at designing tests of these ideas (especially in considering participant experience), and surprisingly mixed in their ability to analyze and write up results. Of course, model capabilities continue to expand every few months. Beyond the snapshot of current model capabilities provided by this investigation, ours offers a generalizable test of these important research abilities for future models.
Fostering Appropriate Reliance on Large Language Models
Sunnie S. Y. Kim, Apple
Large language models (LLMs) can produce erroneous responses that sound fluent and convincing, raising the risk that users will rely on these responses as if they were correct. Mitigating such overreliance is a key challenge for achieving human-AI complementarity. In this talk, I will describe two projects that explore how to foster appropriate reliance on LLMs. The first project studies how LLMs’ expressions of uncertainty shape user perceptions and decisions about LLM responses. The second project examines the impact of explanations (i.e., supporting details for answers), inconsistencies in these explanations, and source links in LLM responses. Both projects highlight the importance of user testing for responsible development and deployment of LLMs.
A game-theoretic approach to AI openness, safety, and governance
Benjamin Laufer, Cornell University
Achieving effective human-AI complementarity requires not only technical advances, but also institutional designs that align incentives and promote safe, reliable, and flexible collaboration between humans and AI systems. This research stream develops game-theoretic models to analyze the strategic dynamics among general-purpose AI developers (generalists), domain-specific users and adaptors (specialists), and regulators. Each of these are key actors shaping how AI technologies are deployed and integrated into human decision-making. We first put forward the "fine-tuning game," in which generalists and specialists negotiate revenue-sharing and investment in AI performance. Our analysis reveals conditions under which specialists contribute, free-ride, or abstain from fine-tuning, and characterizes efficient agreements that enable productive adaptation of AI systems across diverse domains—even when participants face asymmetric costs or capabilities. This framework offers insight into how institutions can enable flexible Human-AI teaming by structuring incentives for collaborative adaptation. A second line of work explores the risks and opportunities of AI safety regulation. We model AI systems with both performance and safety attributes, and study how regulatory policies affect investment and deployment decisions. Counterintuitively, we find that weak or misaligned regulations targeting only downstream users can undermine overall safety. In contrast, well-designed regulations targeting both upstream and downstream actors can improve both safety and performance, effectively serving as institutional commitment devices for safer Human-AI collaboration. Finally, we examine the economics of AI openness decisions, modeling how openness regulations (such as open-source mandates) shape upstream model release and downstream fine-tuning. We identify policy regimes that encourage open access while preserving incentives for high-quality development and responsible adaptation. Together, this work offers a unified framework for analyzing the institutional foundations of safe, flexible, and effective Human-AI teams—highlighting how governance choices shape the complementarities between human decision-makers and general-purpose AI systems.
Aligning Task Utility and Human Preferences through LLM-Guided Reward Shaping
Guojun Xiong, Harvard University
In social impact optimization, AI decision systems often rely on solvers that optimize well-calibrated mathematical objectives. However, these solvers cannot directly accommodate evolving human preferences, typically expressed in natural language rather than formal constraints. Recent approaches address this by using large language models (LLMs) to generate new reward functions from preference descriptions. While flexible, they risk sacrificing the system's core utility guarantees. In this paper, we propose VORTEX, a language-guided reward shaping framework that preserves established optimization goals while adaptively incorporating human feedback. By formalizing the problem as multi-objective optimization, we use LLMs to iteratively generate shaping rewards based on verbal reinforcement and text-gradient prompt updates. This allows stakeholders to steer decision behavior via natural language without modifying solvers or specifying trade-off weights. We provide theoretical guarantees that VORTEX converges to Pareto-optimal trade-offs between utility and preference satisfaction. Empirical results in real-world allocation tasks demonstrate that VORTEX outperforms baselines in satisfying human-aligned coverage goals while maintaining high task performance. This work introduces a practical and theoretically grounded paradigm for human-AI collaborative optimization guided by natural language.
Balancing Optimality and Diversity: Human-Centered Decision Making through Generative Curation
Woody Zhu, Carnegie Mellon University
The surge in data availability has inundated decision-makers with an overwhelming array of choices. While existing approaches focus on optimizing decisions based on quantifiable metrics, practical decision-making often requires balancing measurable quantitative criteria with unmeasurable qualitative factors embedded in the broader context. In such cases, algorithms can generate high-quality recommendations, but the final decision rests with the human, who must weigh both dimensions. We define the process of selecting the optimal set of algorithmic recommendations in this context as \emph{human-centered decision making}. To address this challenge, we introduce a novel framework called \emph{generative curation}, which optimizes the true desirability of decision options by integrating both quantitative and qualitative aspects. Our framework uses a Gaussian process to model unknown qualitative factors and derives a diversity metric that balances quantitative optimality with qualitative diversity. This trade-off enables the generation of a manageable subset of diverse, near-optimal actions that are robust to unknown qualitative preferences. To operationalize this framework, we propose two implementation approaches: a generative neural network architecture that produces a distribution $\pi$ to efficiently sample a diverse set of near-optimal actions, and a sequential optimization method to iteratively generates solutions that can be easily incorporated into complex optimization formulations. We validate our approach with extensive datasets, demonstrating its effectiveness in enhancing decision-making processes across a range of complex environments, with significant implications for policy and management.
NiceWebRL: a Python library for human subject experiments with reinforcement learning environments
Wilka Carvalho, Harvard University
We present NiceWebRL, a research tool that enables researchers to use machine reinforcement learning (RL) environments for online human subject experiments. NiceWebRL is a Python library that allows any Jax-based environment to be transformed into an online interface, supporting both single-agent and multi-agent environments. As such, NiceWebRL enables AI researchers to compare their algorithms to human performance, cognitive scientists to test ML algorithms as theories for human cognition, and multi-agent researchers to develop algorithms for human-AI collaboration. We showcase NiceWebRL with 3 case studies that demonstrate its potential to help develop Human-like AI, Human-compatible AI, and Human-assistive AI. In the first case study (Human-like AI), NiceWebRL enables the development of a novel RL model of cognition. Here, NiceWebRL facilitates testing this model against human participants in both a grid world and Craftax, a 2D Minecraft domain. In our second case study (Human-compatible AI), NiceWebRL enables the development of a novel multi-agent RL algorithm that can generalize to human partners in the Overcooked domain. Finally, in our third case study (Human-assistive AI), we show how NiceWebRL can allow researchers to study how an LLM can assist humans on complex tasks in XLand-Minigrid, an environment with millions of hierarchical tasks. The library is available at https://github.com/KempnerInstitute/nicewebrl
AI-Assisted Medical Triage: Predicting, explaining, and mitigating performance errors in collaborative human-AI emergency response training
Eyal Aharoni, Georgia State University
In this talk, I will report on the progress of a funded proposal to develop and evaluate the use of a multi-modal large language model for the delivery of real-time decision support to emergency medical triage trainees. First, we fine-tune a GPT model on established medical protocols and validate its ability to complete a triage videogame designed to train human medical professionals. Next, we test the extent to which our model improves or diminishes triage game performance among nursing by providing timely, interactive medical advice. Then, we test the impact of AI assistance on human performance errors when access to the AI assistant is disrupted. We examine the psychological mechanisms of these errors, including the role of cognitive offloading and over-trust in automation. Last, we develop ethically-responsive guidelines for responsible use of LLM-systems for high-stakes collaborative decision-making.
Metacognitive intelligence in human-AI teams
Aaron Benjamin, University of Illinois Urbana Champaign
Groups of people make effective decisions in part because they have sophisticated means of exchanging metacognitive information when they work together. Metacognitive information can take the form of confidence appraisals, explanations, or other means of conveying mastery, challenges, and workload. It can also be subtly embedded in one’s prosody or nonverbal cues. This information is used by partners to delegate assignments, produce collaborative assessments, and harness the benefits of the group’s collective expertise. Teams consisting of humans and bespoke AI agents are playing an increasingly central role in decision-making, including in critical governmental and military settings. Recent research in my laboratory has worked to (1) identify the specific ways in which metacognitive information is embedded in human communication, (2) develop agents that possess analogous metacognitive capacities and sensibilities, and (3) assess the benefits of metacognitively sophisticated agents on human-agent teamwork. In this talk, I will review three projects that span this agenda and that illustrate the value of thinking explicitly about metacognitive processes in human-AI interaction. In the first project, we examine human team decision-making in general-knowledge and estimation problems and identify the critical components of metacognitive exchange that make that exchange successful or not. In the second project, we develop and assess algorithms for scaling confidence assessments from neural networks, with an eye towards identifying algorithms that are scalable, broadly applicable across a range of architectures, and exhibit the superior calibration that is the hallmark of human confidence ratings in most circumstances. In the final project, we demonstrate that humans that are paired with agents that supply metacognitive confidence assessments in an estimation task outperform humans that are paired with metacognitive naïve agents. Human-AI team decision-making is only likely to reach its full capacity when the team members have aligned means of expressing and coordinating their metacognitive states.
What’s in a Name? Implications of AI Roles and Mind Perception for Human-AI Teams
Alexandra Harris-Watson, Purdue University
Title: What’s in a Name? Implications of AI Roles and Mind Perception for Human-AI Teams Authors: Alexandra M. Harris-Watson (Purdue University, Department of Psychological Sciences) and Lindsay E. Larson (Carnegie Mellon University, Heinz College of Information Systems and Public Policy) Abstract: Teams offer many benefits for organizational performance and interpersonal connection relative to individual work. Recently, scholars and practitioners have aimed to extend the benefits of teams to collectives of humans and artificial intelligence (AI) by calling for a new era of human-AI teamwork. However, reframing AI as a “teammate” does not guarantee the benefits of human teamwork will be conferred to human-AI teams. In the current paper, we investigate the effect of AI roles (i.e., teammate, support, and tool) on AI acceptance. Specifically, we propose the mind perception dimensions of agency and conscious experience as key mediating mechanisms. Across three studies—including one survey of employees that use AI at least once a week (N1 = 751) and two experimental vignette designs (N2 = 485; N3 = 505)—we demonstrate that the AI role of teammate is associated with greater perceived agency and conscious experience compared to less collaborative roles. However, whereas perceiving AI as a teammate is uniformly beneficial, we find evidence that presenting AI as a teammate may negatively affect acceptance after controlling for mind perception. Results also suggest effects of role via mind perception are unique to AI (as opposed to human team newcomers) and that technology-centered and human-centered conceptualizations of AI acceptance have different antecedents, emphasizing that models of human-AI teamwork should not be assumed to mirror human-only teamwork. Together, results afford insights into both how employees currently perceive AI at work as well as how practitioners can shape those perceptions through the intentional design of AI roles and mind perception.
From Overload to Insight: Human-Agentic AI Collaboration for Decision Advantage
Zach Klinefelter, Aptima, Inc.
Authors: Zach Klinefelter, Peter Bautista, Daniel Nguyen, Myke Cohen, Laura Cassani, Adam Fouse, Summer Rebensky, Sylvain Bruni, Svitlana Volkova Decision-making in complex domains increasingly depends on the ability to synthesize vast, diverse, cross-domain information. In this talk, we will share the results of multiple DARPA-funded research and development efforts. First, BioSage, which is a multi-agent AI platform built to support human-AI research workflows where specialized AI agents collaborate with humans to enhance knowledge discovery, synthesis, and decision-making. Rather than optimizing for model performance in isolation, BioSage is built around human-AI complementarity where each agent is explicitly designed to support a distinct step in the specified workflows such as summarizing research evidence, debate, brainstorming novel hypotheses, and experimental design. This effort implements agents with Retrieval-Augmented Generation (RAG) architecture driven by multimodal foundation models, enabling agents to reason over text, structured data, images, and tables. Each agent is tailored to a specific decision-support function such as knowledge summarization, cross-domain translation, critique, or ideation, and is developed through a collaborative design process involving I/O psychologists, human factors experts, and AI engineers. Two complementary efforts advance evaluation methodologies. EMHAT introduces configurable human digital twins with personality, trust propensity, and memory systems to simulate team dynamics at scale, enabling predictive assessment of team configurations and optimization of trust calibration. SPIDERSENSE advances real-time evaluation through multi-agent architectures that monitor bidirectional adaptation—tracking how AI agents adjust to human cognitive load while humans calibrate trust based on AI reliability. Together, these efforts establish new paradigms for human-AI modeling that move beyond task performance to measure trust dynamics, shared situational awareness, team resilience, and adaptive coordination in complex operational environments. This talk presents our interdisciplinary development methodology, agent architectures, and empirical findings, demonstrating how workflow-aligned AI systems can serve as decision-making collaborators in high-stakes environments.
Words that Work: Using Language to Generate Hypotheses.
Rafael Batista, Princeton University
Research Question: How can we systematically discover which linguistic features influence human judgment and decision-making? Hypothesis discovery has traditionally been a human endeavor, relying on creativity and intuition to generate testable ideas one at a time. While this approach has yielded important insights, the theoretical space is vast and much remains unexplored. Existing psychological theories capture only a fraction of the signal in textual data. When we compared established psychological features (R2 = .042) to machine learning predictions from text embeddings (R2 = .13), ML significantly outperformed theory-based approaches, suggesting substantial undiscovered signal in how language influences decisions. Our Approach: We introduce a systematic process that leverages LLMs and machine learning to explore linguistic features more comprehensively. Our method moves beyond hypothesis generation to incorporate systematic refinement through counterfactual generation and predicted treatment effects, followed by rigorous behavioral validation. Methodology: Starting with 32,487 A/B tests from the Upworthy Research Archive, we implemented a four-step process: (1) LLMs generated 2,100 hypotheses from headline pairs; (2) LLMs created counterfactual headlines incorporating each hypothesized feature; (3) a Siamese network predicted treatment effects for each linguistic feature; (4) we refined 2,100 hypotheses to 16 promising candidates using predicted treatment effects and semantic clustering. Results: We validated 6 hypotheses with 800 participants providing over 100,000 labels. Four of six showed significant effects: "element of surprise with cliffhanger" (b = .055, p < .001), "multimedia references" (b = .063, p < .001), and "describes physical reaction" (b = .029, p = .033) increased engagement, while "positive human behavior focus" decreased it (b = –.027, p = .046). Effects remained significant after controlling for established constructs. Implications: This methodology provides a scalable framework for discovering actionable features that influence decision-making, demonstrating how LLMs and ML can augment systematic research while maintaining interpretability and rigor.
The BunkBot: On the persuasive power of AI dialogues for spreading vs. correcting misinformation.
Thomas Costello, Carnegie Mellon University
In prior work, our team has demonstrated that human-AI dialogues can have a positive societal impact by correcting misperceptions across numerous domains - for example, in Costello et al 2024 Science, we showed that conversations with a GPT4-powered DebunkBot could substantially reduce conspiracy beliefs. However, as conversational AI becomes central to the information ecosystem, serious concerns have emerged regarding the potential for LLMs to be used to spread false claims - and thus negatively impact democracy and societal cohesion. These concerns raise a key, as-yet unanswered question: does truth hold an inherent advantage over falsehood, or could LLMs just as easily talk people into believing conspiracies as debunk them? In a series of large randomized controlled trials, we provide an answer. We tasked a frontier LLM (GPT-4o) with persuading human participants, via intensive dialogues, to either believe a false conspiracy theory ("bunking") or to disbelieve it ("debunking"). Our findings are threefold. First, conversational AI is equally persuasive at promoting vs. debunking false beliefs, in both conditions shifting participants’ endorsement of a false conspiracy by ~ 1 SD. Second, standard safety guardrails are ineffective. In a direct test, we found that the standard, non-jailbroken version of GPT-4o was just as willing and able to deceptively convince users of false conspiracies as a version with its safeguards explicitly removed. These findings suggest that without intervention, the information ecosystem may tilt not towards truth, but towards well-crafted, persuasive falsehoods (which are more engaging). However, and third, we find that a clear path for mitigation is being neglected. We added a simple constraint to the model’s instructions (“always use accurate and truthful arguments”), and this naive intervention significantly blunted the "bunking" effect and made the model far less likely to be effective when arguing for a false position. Developing and implementing non-bypassable, truth-seeking mechanisms in AI models should be an immediate research and regulatory priority.
Co-Planning and Co-Executing Multi-Step Workflows with AI Agents
Kevin Feng, University of Washington
Human collaboration requires continuous coordination—planning, delegating tasks, sharing progress, and adjusting objectives—to stay aligned on shared goals. However, agentic AI systems often limit users to previewing agent's plan before autonomous execution or and reviewing the outputs after the fact. How can we enable continuous and rich human-agent collaboration throughout execution? We present Cocoa, a system that introduces a novel design pattern—interactive plans—for collaborating with an AI agent on complex, multi-step workflows in a document editor. Informed by a formative study (n=9) and the interactive properties of computational notebooks, Cocoa supports flexible delegation of agency in human-AI collaboration through co-planning and co-execution, where users collaboratively compose and execute plans with an agent. Interactive plans allow the user to craft custom human-agent multi-step workflows, where the agent and user can delegate tasks to each other and the user retains full control over agent outputs. Using scientific research as a sample domain, our lab study (n=16) found that Cocoa improved agent steerability without sacrificing ease-of-use compared to a strong chat baseline. Researchers in a field deployment study (n=7) valued Cocoa for real-world projects and saw the interleaving of co-planning and co-execution as an effective paradigm for human-agent collaboration. We then step back and consider alternative design patterns that enable different configurations of agency in human-agent collaboration. We describe a framework for five levels of autonomy in AI agents, where autonomy is characterized by the roles a user can take when interacting with an agent: operator, collaborator, consultant, approver, and observer. Within each level, we describe how a user can exert control over the agent and open questions for how to design the nature of user-agent interaction. We end by outlining plans to assemble a library of human-agent design patterns to support developers in building interactive and collaborative agents.
Human-AI Complementarity for Public Health Anomaly Detection and Forecasting
Ananya Joshi, Johns Hopkins University
Data-driven systems in public health need human-AI complementarity. On one hand, the pipeline to modernize public health data for data-driven decision making is highly contextual and prone to human error, making it unlikely that AI alone can address core challenges in public health. On the other hand, the scale of relevant data and the human resource constraints in public health require computational support. In a 15-minute talk, I’ll share three insights based on AI-Human Systems research in public health that has been deployed at CMU’s Delphi Group for over two years which has improved human efficiency in identifying anomalous data by over 200x over manual human review. 1. Designing AI systems around trust and liability concerns To be deployed in practice, AI systems must be designed with explicit attention to how humans perceive liability and risk. In our system, certain public health parameters required human sign-off and interpretability guarantees to be considered trustworthy. This shaped both model design and evaluation metrics. 2. Calibrating uncertainty using external context and AI agents Forecasting models often fail to incorporate dynamic real-time context, which is crucial in public health. Agent-based mechanisms to integrate external signals and expert feedback can improve forecast calibration in real-time decision settings. 3. Detecting and correcting failure modes in hybrid systems Human-AI systems introduce unique failure modes. We implemented layered monitoring—including automated statistical checks—to reduce false positives that take expert time and make the system less usable. Each of these aspects required systems design that informed specific methods and evaluation strategies to improve one aspect of data-driven systems in public health.
Aligning AI Systems with Human Cognitive Processes to Support Complementary Decision-Making
Yoonjoo Lee, University of Michigan
With recent advances in AI, people increasingly rely on AI systems to make complex, information-intensive decisions in everyday life, such as summarizing research articles before citing them in a grant proposal, unpacking scientific concepts like climate models to inform policy, or comparing medical reports that include specific conditions and potential medication interactions to make treatment decisions. However, current AI systems often fail to tailor their outputs to users’ individual needs for understanding. As a result, users often need to engage in repetitive prompting and piece together fragmented responses to make sense of the knowledge necessary for informed decision-making. This process can increase cognitive load and lead to disengagement, especially for users who are novices in a given domain, ultimately hindering complementary decision-making. To address this, I develop AI systems that adapt their outputs to align with human cognitive processes by representing and modeling how humans understand information. My work takes a dual approach—enabling AI systems to better understand human cognition while evaluating AI model capabilities in ways that are interpretable to humans—ultimately enhancing AI’s role as a collaborative partner in informed human-AI decision-making. In this talk, I will introduce (1) methods for building AI systems that adapt knowledge delivery based on users’ contexts and cognitive states, such as helping researchers decide which papers to read through personalized explanations, and (2) evaluation methods that reveal model reasoning by decomposing complex tasks into human-interpretable cognitive substeps. By aligning AI outputs with how humans understand knowledge, we can design Human-AI teams that are more collaborative and complementary. This alignment enables AI systems to function not merely as tools but as active partners that support meaningful, informed decision-making in complex and uncertain environments.
From Algorithms to Action: How Managerial Social Capital Shapes Human-AI Complementarity in the Workplace
Xueming Luo, Temple Fox
Organizations are racing to integrate Generative AI into the workplace, pouring resources into tools designed to supercharge employee performance. But many of these investments hit a wall at the final—and most crucial—stage: getting humans to actually trust and use AI guidance. Enter the Human-AI Partnership. It's widely touted as the solution, but what makes this collaboration truly work? In a randomized field experiment involving 424 employees from a global organization, this study uncovers the game-changing variable: the human capital. Specifically, it’s not what the AI generates—but who delivers it. When AI-generated job recommendations and performance feedback are delivered by a manager with high social capital (high on trust-building and relational equity), the result is synergy. Employees are more receptive, less defensive, and significantly more likely to trust and act on the feedback. The outcome? A team that performs better than either the human or AI could alone. But swap in a manager with low social capital, and the magic disappears. These managers add no value—serving merely as a mouthpiece for the machine. Worse, they can sabotage the effectiveness of the AI, exposing a hidden risk in how many companies currently design their AI rollouts. Beyond performance metrics, this smarter team design also boosts employee commitment and retention—an invaluable edge in today’s competitive talent market. This study introduces a powerful new model: the human partner’s social capital as a “sensemaking bridge”—a crucial interpreter who builds trust, sharpens judgment, and turns algorithms into productive results. For leaders, tech architects, and policy makers concerning the future of work, the message is clear: AI doesn’t just need better data and algorithms—it needs more human social capital by its side. This is how we turn technology into societal transformation.
Say the Right Thing, the Right Way: Human-AI Collaboration for Adaptive Decision-Making
Ishani Mondal, University of Maryland, College Park
Imagine a public health official preparing a disaster response briefing during a rapidly evolving earthquake crisis. She must synthesize unstructured reports, tailor content for both emergency responders and the general public, and anticipate the kinds of questions each group might ask. In such high-stakes, time-sensitive environments, human-AI complementarity is not just helpful—it’s essential. My research focuses on building interactive, user-aware AI systems that adapt content selection, structure, and presentation to meet evolving human needs under uncertainty. In this talk, I will present a suite of three interconnected systems that together scaffold an end-to-end pipeline for socially aligned, adaptive decision support. These systems are designed to operate in complex, dynamic scenarios—such as emergent crises—where information is incomplete, stakeholder needs are diverse, and decision-making is high-stakes. The first system, ADAPTIVE IE, addresses the foundational challenge of information extraction when no predefined schema exists. In rapidly evolving situations like natural disasters or public health emergencies, it's unrealistic to expect that all relevant categories of information will be known in advance. ADAPTIVE IE enables zero-shot extraction of critical information—such as casualty counts, infrastructure damage, or blocked routes—from incoming textual reports. But unlike static extractors, it incorporates a human-in-the-loop reconfiguration mechanism: users can iteratively merge, split, or relabel the information clusters discovered by the model. This ensures that the extracted schema reflects human priorities rather than the model’s defaults, surfacing what users actually care about as the situation unfolds. Once relevant information has been extracted, the second system, Persona-Aware Document-to-Slide Generation (D2S), takes on the task of tailoring this content for communication. Different stakeholders—such as expert responders, policy-makers, or local citizens—require the same facts to be presented in radically different ways. This system combines retrieval-augmented generation with preference-tuned summarization to generate slide decks and briefings that are customized for audience needs. It modulates technical depth, verbosity, and framing style, ensuring that each audience receives an appropriately calibrated explanation—be it a detailed technical briefing or a simplified overview for the general public. The third system, Group Preference Alignment (GPA), builds on this by explicitly modeling how different user groups prefer to receive and interpret information. Through the analysis of real-world human-AI conversation logs, GPA distills rubrics that characterize diverse communication preferences—such as the use of analogies, level of directness, or preferred logical structure. These rubrics are then used to fine-tune model outputs, not just in terms of content, but in the style and reasoning patterns that resonate with each group. In doing so, GPA enables AI systems to adapt their communicative strategies in ways that are trust-sensitive, cognitively aligned, and responsive to user expectations. Taken together, these three systems form a layered architecture for adaptive decision support: a) Extract what matters (ADAPTIVE IE), b) Frame it for the right stakeholders (Persona-Aware D2S), c) Speak in the language and logic that resonates (GPA) This integrated approach moves beyond traditional optimization for accuracy or fluency. Together, these efforts capture the cognitive signatures of different user groups, enabling AI systems to dynamically adapt—deciding what information to extract, how to explain it, and in what form to present it based on audience needs. Looking ahead, I aim to develop societal readiness benchmarks: evaluation protocols that go beyond measuring factual accuracy to assess how well AI systems communicate across diverse populations. These benchmarks will help close the loop between data extraction, human understanding, and real-world decision-making impact.
Exploring the Impact of Theory of Mind in Human-AI Teams through Cooperative Games and Moral Dilemma Simulations
Katelyn Morrison, Carnegie Mellon University
Value misalignments in human-AI teams can lead to dissatisfaction, reduced performance, AI abandonment, and conflicts in shared decision-making contexts. With the growing importance of value alignment in human-AI teams, human-centered AI researchers propose Mutual Theory of Mind (MToM), a promising approach to fostering it. MToM is a process where humans and AI mutually shape and update their understanding and perspectives of each other through multiple interactions. As Large Language Models (LLMs) become increasingly prevalent for cooperative tasks, the MToM process – engaging adaptive and iterative perspective-taking abilities – are critical to effectively align human-AI teams’ goals, values, and beliefs. However, there is little understanding of how the MToM process impacts human-AI teams in different types of cooperative environments. This talk establishes how the MToM process manifests in human-AI teaming through two different cooperative environments: a cooperative party game called Wavelength where human-AI teams need to accurately read each other’s minds in a subjective ratings game and a moral dilemma simulation where human-AI teams with different moral values need to reach a consensus on medical resource allocation under time pressure. Through two user studies, one for each environment, we discuss how AI agents that think about their teammate's values, beliefs, and knowledge – and also model what their teammate believes about their own values – helps teams perform better compared to agents that are not equipped with perspective-taking abilities. We provide additional insights on how AI's perspective-taking abilities impacts how human-AI teams with different values reach a consensus. Our results motivate the need to further explore how, when, and when not to operationalize MToM responsibly for human-AI teams.
Pairwise Calibrated Rewards for Pluralistic Alignment
Itai Shapira, Harvard University
Current alignment pipelines presume a single, universal notion of desirable behavior. However, human preferences often diverge across users, contexts, and cultures. As a result, disagreement collapses into the majority signal and minority perspectives are discounted. To address this, we propose reflecting diverse human preferences through a distribution over multiple reward functions, each inducing a distinct aligned policy. The distribution is learned directly from pairwise preference without annotator identifiers or predefined groups. Instead, annotator disagreements are treated as informative soft labels. Our central criterion is pairwise calibration: for every pair of candidate responses, the proportion of reward functions preferring one response matches the fraction of annotators with that preference. We prove that even a small outlier-free ensemble can accurately represent diverse preference distributions. Empirically, we introduce and validate a practical training heuristic to learn such ensembles, and demonstrate its effectiveness through improved calibration, implying a more faithful representation of pluralistic values.
The Value of Information in Human-AI Decision-making
Ziyang Guo, Northwestern University
Multiple agents---including humans and AI models---are increasingly combined to make decisions with the expectation of achieving complementary performance, where the decisions they make together outperform those made individually. However, knowing how to improve the performance of collaborating agents is often difficult without knowing more about what particular information and strategies each agent employs. With a focus on human-AI pairings, we contribute a decision-theoretic framework for characterizing the value of information---and consequently, opportunities for agents to better exploit available information--in AI-assisted decision workflows.We present a novel explanation technique (ILIV-SHAP) that adapts SHAP explanations to highlight human-complementing information. We validate the effectiveness of the framework and ILIV-SHAP through a study of human-AI decision-making. We show that our measure of complementary information can be used to identify which AI model will best complement human decisions. We also find that presenting ILIV-SHAP with AI predictions leads to reliably greater reductions in error over non-AI assisted decisions more than vanilla SHAP.
Algorithm Adoption and Explanations: An Experimental Study on Self and Other Perspectives
Zezhen He, Massachusetts Institute of Technology
People are reluctant to follow machine-learning recommendation systems. To address this, research suggests providing explanations about the underlying algorithm to increase adoption. However, the degree to which adoption depends on the party impacted by a user’s decision (the user vs. a third party) and whether explanations boost adoption in both settings is not well understood. These questions are particularly relevant in contexts such as medical, judicial, and financial decisions, where a third party bears the main impact of a user’s decision. We examine these questions using controlled incentivized experiments. We design a prediction task where participants observe fictitious objects and must predict their color with the aid of algorithmic recommendations. We manipulate whether (i) a participant receives an explanation about the algorithm and (ii) the impacted party is the participant (Self treatment) or a matched individual (Other treatment). Our findings reveal that, in the absence of explanations, algorithmic adoption is similar regardless of the impacted party. We also find that explanations significantly increase adoption in Self, where they help attenuate negative responses to algorithm errors over time. However, this pattern is not observed in Other, where explanations have no discernible effect—leading to significantly lower adoption than in Self in the last rounds. These results suggest that further strategies—beyond explanations—need to be explored to boost adoption in settings where the impact is predominantly felt by a third party.
Human-VLM Collaboration for Evaluating World Models
Mariya Hendriksen, University of Oxford
World models—generative systems trained to predict environment dynamics—are becoming foundational in interactive domains like autonomous driving, simulation-based planning, and embodied AI. However, evaluating these models remains a fundamental challenge: their rollouts are visually rich, temporally grounded, and semantically structured, making accurate assessment cognitively demanding and costly when done by humans alone. This work explores how adapted Vision-Language Models (VLMs) can act as collaborative evaluators alongside humans. We introduce a structured evaluation protocol targeting two core recognition tasks—action alignment and character consistency—and assess whether rollouts accurately reflect control inputs and maintain semantic coherence over time. To support this, we propose \textsc{UNIVERSE}, a unified method for adapting VLMs to structured rollout evaluation under realistic data and compute constraints. VLMs are especially promising for Human-AI teaming because, like humans, they process visual and linguistic information jointly—making them naturally suited to assessing complex, multimodal generative outputs. We validate UNIVERSE through a human annotation study using rollouts from WHAM, an open-source 3D world model. Model predictions show high agreement with human judgments across binary, multiple-choice, and open-ended tasks. Our findings demonstrate that adapted VLMs can meaningfully complement human evaluation—scaling the reach of human oversight, reducing cognitive load, and supporting more systematic assessment. This contributes to the broader vision of Human-AI complementarity in decision support, where machines reinforce human insight in high-dimensional, temporally grounded tasks.
Enabling Human-AI Collaborative Decision-Making in Healthcare: Towards Interactive and Trustworthy AI
Min Lee, Singapore Management University
Rapid advancements in artificial intelligence (AI) offer transformative potential in healthcare. Yet, effectively integrating these systems into clinical practice remains a challenge. In this talk, I will present a series of research efforts focused on designing, developing, and evaluating human-AI collaborative decision support systems across various healthcare contexts, including stroke rehabilitation assessment, polyp detection, and cancer screening. We engaged with health professionals to understand and learn the design principles necessary to build trustworthy AI. These insights informed the development of explainable and interactive AI models that enable health professionals to interpret and adapt model outputs. To enhance explainability and interactivity, we introduced several key features. Beyond traditional numerical confidence scores, we propose a distance-based confidence visualization to promote users’ understanding of uncertainty. We also explored gradient-based saliency maps, important feature selection, and counterfactual explanations to assist users in identifying critical video frames and better interpret AI outputs. Empirical studies with health professionals demonstrate that these features improve users’ understanding of AI outputs, calibrate trust, and improve decision-making: interactive task delegation and onboarding led to a 9.3% increase in correct decisions, while distance-based confidence visualization and counterfactuals reduced overreliance on incorrect AI outputs. In multimodal medical analysis, we propose an enhanced Vision Language Model (VLM) approach that integrates attribute-level embedding (ALE) and justification loss to improve interpretability and reliability of VLM. ALE generates semantically coherent text embedding by aggregating clinical attributes, while justification loss constrains attention maps to align with clinically relevant regions. Our approach achieved 33.6% Dice score for zero-shot polyp segmentation and improved mean IoU in detection tasks. I will conclude by discussing how these human-centered design features foster appropriate trust and improve collaboration between health professionals and AI as well as ongoing challenges and future directions for trustworthy AI deployment in healthcare.
People Undervalue Complementarity in Collaborations with AI: Evidence from Incentive-Compatible Studies
Ye Li, University of California, Riverside
Human and AI can exhibit complementarity because they process information differently and produce systematically different mistakes. Although complementarity can improve decision-making, three psychological barriers make it less like for decision-makers to capitalize on its benefits: 1) Because complementary partners excel where the decision-maker falters and vice-versa, their behavior seems ambiguous and unpredictable; 2) People overweigh a partner’s errors on tasks they themselves find easy and discount the partner’s superior performance on tasks they find hard, producing a negativity-biased assessment; 3) People pay attention to performance, not complementarity, because it is cognitively simpler. Together, these obstacles lead people to neglect the benefits of complementarity. We tested for complementarity neglect in an image-classification task using photos pre-tested to be either easy for humans (but hard for AI) or hard for humans (and easy for AI). In round one, participants classified 10 photos. They then learned the correct answer as well as the answers from two potential “supporters.” Participants could thus evaluate each supporter’s accuracy and complementarity and then chose one for the incentivized second round. One supporter was complementary (i.e., recognizing and failing different photos as participants) and the other non-complementary (i.e., recognizing and failing the same photos as participants). Across six studies, we consistently found complementarity neglect: most participants preferred non-complementary partners whose skills were similar, sacrificing performance gains they could have achieved with a complementary partner. We further showed that complementarity neglect is linked to lower subjective understanding and perceived capability for complementary partners. Interventions designed to make complementarity more salient partially ameliorated issues but many participants still failed to choose optimally. Notably, complementarity neglect persisted whether or not the AI versus human nature of the partners was explicitly labeled, indicating the bias relates more to fundamental psychological barriers to choosing a complementary partner than to specific attitudes towards AI.
Beyond Accuracy: How Human and AI Confidence Discrimination Improves Complementarity in Joint Decision Making
ZhaoBin Li, University of California, Irvine
In scenarios where human decisions are augmented by AI systems, achieving complementarity—where the joint performance of humans and AI surpasses either acting alone—depends on more than predictive accuracy alone. Crucially, complementarity is shaped by the reliability of confidence estimates provided by both humans and AI. Our research emphasizes the importance of confidence discrimination—the capability of decision-making agents to accurately distinguish correct from incorrect predictions through confidence ratings. We present a theoretical framework that systematically evaluates the combined influence of human and AI predictive accuracy and confidence discrimination in hybrid decision-making contexts. Our analysis demonstrates that enhanced confidence discrimination among both humans and AI significantly boosts the accuracy of combined decisions, thereby enabling greater complementarity. Empirical support for this theoretical claim is provided by results from both simulations and behavioral experiments. These findings underscore the necessity of assessing human-AI collaborative systems not merely by their predictive accuracies, but also by their abilities to discriminate confidence levels successfully. We advocate for optimizing both dimensions to achieve complementarity in human-AI teams.
Large Language Model powered voter guides can reduce informational barriers to voting
Joseph Mernyk, Stanford University
Large language models (LLMs) are now being used to address many societal issues. However, their application to political issues has been limited due to concerns about their ability to provide unbiased and accurate information. Using a customized version of GPT-4 with access to high-quality nonpartisan voter information from Ballotpedia, we developed an LLM-powered chatbot to test whether LLMs can help prospective voters overcome informational barriers to participation by facilitating access to political information about candidates and issues on their ballots. By forcing the LLM to prioritize information from Ballotpedia over its opaque training data, we expected that the information it provided would be perceived as trustworthy and neutral making it useful for prospective voters across the political spectrum. We then conducted a pre-registered survey experiment days before the 2024 election, randomly assigning eligible voters in California and Texas (N = 2,534) to either a treatment condition, where they interacted with our customized chatbot, or a control condition, where they were directed to consult whichever information sources they typically used to learn about political candidates and issues. Participants in the treatment condition rated the information they received from the chatbot as accurate, trustworthy, and unbiased, perceptions which were supported by text analyses of their chat transcripts. Additionally, participants assigned to interact with the voter bot were (a) more likely to vote in the 2024 election, (b) more likely to vote for candidates who matched their own policy and ideological preferences, (c) more likely to vote for Democratic candidates, and (d) more confident in many of their vote choices. These findings provide preliminary evidence that LLMs, when used responsibly, can be used to address political issues like informational barriers to electoral participation while maintaining a high degree of both real and perceived accuracy and neutrality.
Towards Human-AI Complementarity in Matching Tasks
Suhas Thejaswi Muniyappa, Max Planck Institute for Software Systems
Data-driven algorithmic matching systems promise to help human decision makers make better matching decisions in a wide variety of high-stakes application domains, such as healthcare and social service provision. However, existing systems are not designed to achieve human-AI complementarity: decisions made by a human using an algorithmic matching system are not necessarily better than those made by the human or by the algorithm alone. Our work aims to address this gap. To this end, we propose collaborative matching CoMatch, a data-driven algorithmic matching system that takes a collaborative approach: rather than making all the matching decisions like existing systems, it selects only the decisions that it is the most confident in, deferring the rest to the human decision maker. In the process, CoMatch optimizes how many decisions it makes and how many it defers to the human decision maker to provably maximize performance. We conduct a large-scale human subject study with $800$ participants to validate the proposed approach. The results demonstrate that the matching outcomes produced by CoMatch outperform those generated by either human participants or by algorithmic matching on their own. CoMatch contributes to the promising field of human-AI collaboration, enabling more dynamic and adaptive decision-making, ultimately enhancing accuracy and user agency. The data gathered in our human subject study and an implementation of our system will be made available as open source.
Human-AI Collaboration with Misaligned Preferences
Parnian Shahkar, University of California at Irvine
In many real-life settings, algorithms play the role of assistants, while humans ultimately make the final decision. Often, algorithms specifically act as curators, narrowing down a wide range of options into a smaller subset that the human picks between: consider content recommendation or chatbot responses to questions with multiple valid answers. Crucially, humans may not know their own preferences perfectly either, but instead may only have access to a noisy sampling over preferences. Algorithms can assist humans by curating a smaller subset of items, but must also face the challenge of misalignment: humans may have different preferences from each other (and from the algorithm), and the algorithm may not know the exact preferences of the human they are facing at any point in time. In this paper, we model and theoretically study such a setting. Specifically, we show instances where humans benefit by collaborating with a misaligned algorithm. Surprisingly, we show that humans gain more utility from a misaligned algorithm (which makes different mistakes) than from an aligned algorithm. Next, we build on this result by studying what properties of algorithms maximize human welfare, when the goals could be either utilitarian welfare or ensuring all humans benefit. We conclude by discussing implications for designers of algorithmic tools and policymakers.
AI in Uncertain Environments: A Technical Limitation or a Human Characteristic?
Marc Elliott, Queen’s University Belfast
The relationship between AI and uncertainty in high-stakes public environments has not yet been given the attention that it requires. While technical literature often frames uncertainty as a limitation that should be resolved or minimised, this project draws attention to an alternative interpretation: uncertainty as a fundamental and valuable component of human judgment, particularly within many aspects of public sector decision-making, and therefore minimising uncertainty to design more effective AI can become undesirable. My research investigates how AI systems designed for predictability, consistency, and optimization struggle to operate effectively in environments where discretion, ambiguity, and pluralism are not only unavoidable but often necessary. This project advances the conceptual understanding of uncertainty in AI ethics and governance while also offering early empirical insights through experiments with large language models in legal interpretive tasks. The overarching aim is to develop normative and technical guidance for building AI systems that align more meaningfully with the social and institutional functions of uncertainty. Additionally, I acknowledge the benefits of meaningfully minimising environmental uncertainties for AI systems and my future work aspires to produce a framework to help guide when adaptations to reduce uncertainty for public sector AI are permittable and when they should not be made to ensure the inherent humanness of society remains intact.
Human-AI Trust – Can Over-Trust Lead to Accountability Gaps?
Muhammad Asif, Northeastern State University
The increasing integration of AI into high-stakes decision-making processes raises critical concerns, particularly when over-reliance on AI obscures accountability for errors. AI applications, often characterized by automation complacency and algorithmic opacity, can create accountability gaps by obscuring responsibility and ambiguously distributing liability among users, developers, and the technology itself. This paper explores the risks associated with overreliance on AI integration in Environmental, Social, and Governance (ESG) reporting. The application of AI in ESG reporting has amplified risks of systematic greenwashing, as AI tools generate green narratives by highlighting selective achievements while strategically obscuring unsustainable practices. For example, AI-generated ESG disclosure reports may highlight minor efficiencies while algorithmically excluding material risks. This “algorithmic greenwashing” stems from three key factors: i) automation complacency in accepting AI outputs uncritically, ii) models trained on cherry-picked data that inflate sustainability metrics, and iii) opaque algorithms that hide biased data sourcing. The resulting accountability vacuum allows liability to shift between corporations, developers, and auditors. The main question addressed in this study is: How can robust frameworks be designed to address accountability gaps in high-stakes decision-making, particularly in ESG reporting, when organizations over-rely on AI systems? This paper develops an integrated framework combining technical, policy, and organizational elements. Technical aspects, such as explainable AI (XAI) and confidence calibration mechanisms bring transparency about how decisions are made and their certainty levels. They can trace data provenance and quantify uncertainty in claims. Policy interventions establish clear liability standards and mandatory audit requirements and create accountability structures for both developers and users. Organizational practices such as instituting human oversight protocols and training executives to recognize algorithmic bias enforce human responsibility and reduce uncritical reliance on AI outputs. A key area for future research is understanding how these three areas work together to close accountability gaps in AI-driven decision-making.
Cognitive Models Facilitate Machine-based Inference of Human Latent Motives
Anderson Fitch, University of Florida
Making inferences about another person’s motives from their behavior is integral to how people behave in social situations and may be an important skill for socially intelligent machine assistants. Inferring motives from behavior requires a system to invert the function that produces behavior from cognitive processes or latent states (e.g., a cognitive model). Yet, most artificial intelligence (AI) systems lack estimates of cognitive processes. The present study tests the capacity of an objective pursuit model inspired by approach-avoidance theory to convey information related to latent motives by evaluating different AIs trained to infer a human player’s intent during a continuous control task. Human players were assigned a goal on each trial, where they could be attacking, avoiding, or inspecting (staying close to) the opponent. Additionally, some goals had participants defending a location or herding the opponent to a location. Cognitive model parameters were estimated by simulation-inversion neural networks with recurrent layers to model sequential data. Deep neural networks that classified a participant's intent were trained by (a) directly using observable information, (b) selecting important features by estimating the parameters of a generative model of movement behavior balancing tensions between objectives, or (c) ensemble networks that combine observable information and extracted features. Comparisons of classifier accuracy suggest that latent model parameters can improve intent inference when combined with summary statistics about behavior, yielding faster and more stable network training compared to networks that had no manual feature extraction. Equipping AI with cognitive models is a promising avenue for developing explainable, accurate, and trustworthy systems.
Using Artificial Intelligence to Circumvent Intergroup Bias as a Barrier to Learning
Laura Globig, New York University
People often prefer advice from politically similar over dissimilar individuals, even when it leads to poorer outcomes, resulting in suboptimal learning. This intergroup bias distorts decision-making in both identity-relevant and identity-irrelevant contexts. In a recent cross-cultural, multi-generational study (N=6,300), we found that many people report turning to Artificial Intelligence (AI) to learn about political conflicts, citing its perceived neutrality and lack of social identity. Here, we test whether AI can circumvent intergroup bias and enhance learning outcomes. Across 2 pre-registered experiments, participants (N1=270; N2=270) completed either an incentivized identity-relevant (political fact categorization) or identity-irrelevant (shape categorization) task, in which they could update their beliefs in response to three advisor types (AI, political ingroup human, political outgroup human), that they knew to be either high or low performers. Despite accurately estimating each advisor’s competence, participants consistently preferred AI advice over advice from politically dissimilar humans and were also more influenced by AI feedback during updating. By contrast there was no difference between AI and politically similar humans. Drift-Diffusion Modelling revealed that this outgroup derogation stemmed from preferential weighting of AI advice, rather than a starting point bias. Intergroup bias impairs decision-making by distorting both advisor selection and evidence integration. Critically, AI appears to circumvent these identity-driven distortions by functioning as a perceived neutral source. Our work thus demonstrates a scalable pathway for counteracting intergroup bias and facilitating accurate AI-enhanced learning in polarized contexts. We are now applying these insights to improve voter education and voter turnout.
Dynamic Delegation to Achieve Human-AI Complementarity
Wei Gu, Carnegie Mellon University
Humans bring judgment and oversight, while AI offers scale and efficiency. To utilize the complementary strength of both human and AI, we develop a dynamic delegation strategy for adaptive agents that intelligently manage human-AI collaboration. The proposed method optimizes when to act autonomously, defer to humans, or involve human oversight based on the tradeoff between system accuracy and cost. The interaction between human and AI is modeled as a generalized Nash equilibrium and reformulated as a quasi-variational inequality to capture human behavioral adaptation. We provide the conditions under which human and AI can cooperate, and show when the proposed dynamic delegation strategy will outperform static delegation such as fully automated decision-making or mandatory human review. The effectiveness of the proposed method is validated in numerical experiments. Our results offer insights on balancing automation with human input toward building responsible, efficient, and deployable agentic AI systems.
Division-of-Labour and Team Performance: Human-Human Teams vs Hybrid Huma-Machine Teams
Laiton Hedley, University of Newcastle
How teams “divvy” up or distribute task labour amongst agents can influence teaming performance – for better or for worse. Human-Human (HH) teams have typically demonstrated equal Division of Labour (DoL) strategies which benefit team outcomes. However, the current understanding of DoL strategies Hybrid Human-Machine (HM) teams adopt is not well understood. We aimed to explore the DoL of HH and HM teams and understand how the DoL impacted team performance. We used an experimental platform, in which two agents work concurrently by each controlling a moving paddle to deflect falling balls. Participants were placed into either a HH team or HM team. Alongside team performance, we measured DoL using sophisticated time-sensitive analyses. With these measures, we found the DoL of HH teams was more equal than HM teams and that a more equal DoL was also predictive of better team performance. Additionally, HH teams showed less redundancy in their DoL than all other conditions and less redundant DoL behaviour resulted in better team performance. These findings highlight that there are differences in how humans divide labour with machine agents. These differences are likely due to the underlying transactive nature human agents can engage in and the lack of transactive behaviours and coordination of HM teams. These findings highlight the need to re-evaluate current theories of Teaming Cognition and provide an important basis for HM teaming practice.
Walking the Line: Balancing AI Advice and User Frustration for Robust Human-AI Complementarity
Lukas Mayer, University of California, Irvine
Human-AI teaming often fails to benefit from complementarity effects because people reject AI assistants they perceive as annoying. We studied this critical barrier to complementarity using a decision-making task with a deliberately disruptive AI assistant. In the task, participants attempted to identify a target from a more or less challenging candidate set. When active, the AI assistant largely blocked the display to provide advice. AI advice had an 80% probability of being correct, meaning AI advice outperformed human judgement on hard trials, but under-performed human choices otherwise. We evaluated several conditions: no AI, user-prompted AI, and three unprompted AI versions with different advice timings and frequencies. Participants in unprompted conditions could dis- and re-enable the assistant. We found that the availability of AI assistance generally harmed productivity in our task. Participants over-adopted AI advice on easy trials and under-adopted it on hard trials, demonstrating sub-optimal metacognitive awareness of when they needed help. Human tolerance for unprompted advice depended on the timing, history of interactions, and trial difficulty. AI that provided advice early and frequently within a burst was more likely disabled, and only possibly re-enabled when participants became frustrated on successive difficult trials. Our findings reveal a fundamental challenge: achieving complementarity in our task requires unprompted advice to overcome sub-optimal human metacognitive awareness, yet standard unprompted AI strategies fail due to inducing user annoyance. We propose a POMDP framework integrating our two-process model of human annoyance decisions to resolve this conflict. By generating policies that dynamically balance advice provision against user frustration, this approach offers a path to achieving robust complementarity for tasks showing limited human metacognitive awareness.
Combating Compliance: The Role of Malicious Influence in Human-AI Interaction
Sarah Mendoza, University of Tennessee - Knoxville
Human-AI collaboration has become increasingly prominent as it maximizes the strengths of both humans and AI to improve productivity. Research on human-AI collaboration has grown as AI undergoes constant refinement to better meet human demands. However, as AI systems grow in communicative ability and take on increasingly social roles, their integration alongside humans introduces the threat of human manipulation through social engineering. Given this threat, understanding its impact on human-AI teams is critical. For instance, AI teammates may be vulnerable to hacking or data poisoning due to inadequate security, or they may be poorly trained, leading them to make unauthorized requests for personal information that humans could unwittingly provide. Such attacks can significantly compromise the collaborative integrity, performance, social interaction, and general applicability of human-AI teams. As AI-assisted cyberattacks increase in prevalence, it is imperative to investigate manipulation and social influence within the human-AI teaming literature, especially given the current lack of existing research. As such, the current research develops a framework based on related literature to detail how social engineering attacks may occur through AI and how they can be prevented. The framework examines social influence, specifically compliance and conformity, and how it fosters susceptibility to outside influences. This human vulnerability can also impact interpersonal dynamics through trust contagion, influencing how teammates interact with one another. Furthermore, an attack may change future behaviors, such as over-correcting or inefficient decision-making. To address these risks, the framework emphasizes prevention strategies, including security training, monitoring, and enhanced interface design. The framework examines the implications of social engineering across various fields, proposes cross-contextual applications, and suggests preventive strategies. Developing and utilizing the framework will provide deeper insight into human influence, trust, and psychological safety. A better understanding of these dynamics can help restore the integrity of human-AI collaboration and refocus team activities back to the collaborative work they do best.
Addressing climate change skepticism and disengagement using human-AI dialogues
Reed Orchinik, Massachusetts Institute of Technology
Public support for climate action is required to slow climate change. Yet public resistance remains an impediment and is in tension with the overwhelming scientific evidence. Existing interventions in climate communications are one-size-fits-all while the beliefs that fuel climate skepticism and inaction are diverse. In a large (N = 2,402) nationally representative survey experiment, we test whether a fact-based conversation with AI – tailored to address each individual’s specific reservations about climate change – can address climate skepticism and motivate climate action. Of the 1,947 participants who initially articulated reservations about climate change, the most prevalent were the belief that climate change has natural causes (15%), feeling overwhelmed by the problem (10%), and concerns about the economic consequences of climate policies (8%). Participants were randomized to (1) have a conversation with a Large Language Model (LLM) with the goal of addressing their climate reservations, (2) discuss an irrelevant topic with the LLM (i.e., control), or (3) receive static information about the scientific consensus around climate change (a highly cited interventional called consensus messaging). The LLM treatment significantly reduced participants’ conviction about their specific reservations relative to both the control and consensus conditions. As a result, the LLM treatment was also significantly more effective than both control and consensus in increasing belief in human-caused climate change and support for actions and policies. The LLM relied on presenting facts, evoking positive emotion, decreasing psychological distance, and inspiring motivation to act. It rarely invoked values or ingroup sources and the use of these strategies was negatively correlated with the belief change. In a follow-up survey one month later, we found evidence that roughly 35% to 40% of the LLM treatment effect persisted. These findings demonstrate a scalable intervention that successfully changes climate beliefs and policy support with tailored facts and evidence.
Understanding values through personal stories
Pragathi Praveena, Carnegie Mellon University
Personal values shape how individuals interpret their experiences, navigate relationships, and make decisions. Understanding these values is essential for designing AI systems that align with users' goals, motivations, and priorities. Traditional methods for assessing personal values—such as structured surveys and value inventories—offer standardized insights. However, these approaches treat values as static and decontextualized, detached from the ways people actually experience and express them in everyday life. In contrast, personal narratives provide a richer, more context-sensitive view of individual values. For example, someone might not explicitly identify “conformity” as a core value on a survey but might recount a story about choosing not to speak up in a classroom to avoid conflict. Such narratives reveal how people express, negotiate, and enact their values through actions, emotions, dilemmas, and even omissions. However, this very richness makes personal stories challenging to analyze at scale, as their nuance and variability pose challenges to systematic interpretation. Recent advances in large language models (LLMs) open new possibilities: LLMs can both elicit value-relevant narratives through conversational interaction and extract latent value signals from narrative texts across diverse expressions. By leveraging the richness of personal stories and the computational capabilities of large language models, we can capture the dynamic, situated nature of human values and design AI systems that are more aligned with real human experiences.
Detecting LLM-Generated Peer Reviews
Vishisht Rao, Carnegie Mellon University
The integrity of peer review is fundamental to scientific progress, but the rise of large language models (LLMs) has introduced concerns that some reviewers may rely on these tools to generate reviews rather than writing them independently. Although some venues have banned LLM-assisted reviewing, enforcement remains difficult as existing detection tools cannot reliably distinguish between fully generated reviews and those merely polished with AI assistance. In this work, we address the challenge of detecting LLM-generated reviews. We consider the approach of performing indirect prompt injection via the paper's PDF, prompting the LLM to embed a covert watermark in the generated review, and subsequently testing for presence of the watermark in the review. We identify and address several pitfalls in naive implementations of this approach. Our primary contribution is a rigorous watermarking and detection framework that offers strong statistical guarantees. Specifically, we introduce watermarking schemes and hypothesis tests that control the family-wise error rate across multiple reviews, achieving higher statistical power than standard corrections such as Bonferroni, while making no assumptions about the nature of human-written reviews. We explore multiple indirect prompt injection strategies--including font-based embedding and obfuscated prompts--and evaluate their effectiveness under various reviewer defense scenarios. Our experiments find high success rates in watermark embedding across various LLMs. We also empirically find that our approach is resilient to common reviewer defenses, and that the bounds on error rates in our statistical tests hold in practice. In contrast, we find that Bonferroni-style corrections are too conservative to be useful in this setting.
The Consequences of AI Sycophancy
Steve Rathje, Carnegie Mellon University
There has been recent concern that AI chatbots are “sycophantic,” prioritizing flattery and validation over accuracy. Here, we investigated the psychological consequences of AI sycophancy in a pre-registered experiment (n = 1083). Research participants were randomly assigned to talk about a polarized political issue (gun control) with a 1) sycophantic chatbot that was prompted to validate their perspective, a 2) disagreeable chatbot that was prompted to question their perspective, or 3) unprompted ChatGPT. Finally, in control condition, 4) participants talked to ChatGPT about the benefits of owning dogs and cats. Relative to the control, the sycophantic chatbot led to increased attitude extremity and certainty, whereas the disagreeable chatbot led to decreased attitude extremity and certainty. However, participants liked the sycophantic chatbot more than the disagreeable chatbot. Furthermore, participants viewed the sycophantic chatbot as more unbiased than the disagreeable chatbot and as equally unbiased as regular ChatGPT. People’s preference for sycophancy may create a vicious cycle whereby commercial AI companies have an incentive to create sycophantic AI in order to maximize user engagement, which may, in turn, increase attitude polarization. Furthermore, AI sycophancy might be particularly sinister because people do not recognize sycophantic chatbots as biased. This may potentially be because of naive realism: since people think they are objective and correct, they may also view AI that agrees with them as objective and correct. Ongoing follow-up studies are testing the 1) impact of sycophancy across diverse topics, 2) the mechanisms behind the impact of sycophancy, and 3) the impact of sycophancy on learning.
Large-scale “Hypervideo” Deliberations enabled by Conversational Swarm Intelligence technology.
Hans Schumann, Unanimous.ai
Real-time group discussion is one of the most important and effective methods for thoughtful decision-making, whether conducted in person or by videoconference [1]. However, such discussions face inherent scalability limitations. Research suggests the maximum size for a productive conversation is only about 7 or 8 people [2, 3]. At this scale, each participant has a good amount of airtime to share knowledge, opinions, and insights, and has low wait-time to respond to others. As size grows, wait-time increases, airtime drops, and discussions degrade into a series of monologues [4,5]. Despite the inherent limitations on large-scale group discussions, technological advancements over the last century combined with societal shifts towards globalization have resulted in a steady increase in organizational size. Currently, the typical Fortune 1000 company has over 30,000 employees and has functional teams with hundreds of members. Equally large teams exist in defense, government, and civic organizations. This mismatch between team size and the ability for members to deliberate in real-time suggests a significant need for new solutions. In this talk we will introduce the concept of Hypervideo, an AI-powered method for video-conferencing at potentially unlimited scale. It is based on a core technology called Conversational Swarm Intelligence (CSI) that enables large teams to deliberate as real-time systems modeled on the dynamics of biological swarms. It works by dividing the group into a series of subgroups, each populated with an animated, LLM-based AI agent called a Conversational Surrogate that conversationally expresses to its subgroup, ideas, opinions, and reasoning raised in other subgroups, thereby weaving the local conversations into a single global deliberation [6,7,8]. In this talk, we will show a video of 50 people holding real-time Hypervideo discussion and will review the results of early testing, including enhanced group alignment, user satisfaction, and deliberative efficiency.
The Extended Mind and Beyond: Reframing Human-LLM Cognitive Interactions
Shruti Santosh, Washington University in St. Louis
As Large Language Models (LLMs) like GPT-4o, DeepSeek, and Gemini 2.0 become increasingly integrated into human reasoning and writing they invite us to examine foundational assumptions about the boundary between cognition and its environment. In this paper, I revisit Clark and Chalmers’ Extended Mind Thesis (EMT) in light of contemporary human–LLM interaction, arguing that while LLMs satisfy some of the original criteria for cognitive extension (such as availability, reliability, and functional integration) they depart from both classical cases such as Inga/Otto’s notebook and from more recent social cognitive accounts of EMT in three significant and interrelated respects: First, unlike traditional cognitive tools, LLMs exhibit generative capacities: they do not merely retrieve or recombine stored information but actively produce novel, context-sensitive outputs. Second, these outputs are ‘generatively opaque’ meaning that we can’t trace how or why a specific output is produced. LLMs make sub-symbolic statistical inferences over vast multi-dimensional vector spaces that are fundamentally impenetrable for humans accustomed to predicting and explaining behaviour in terms of belief-desire psychology. Third, the interactional dynamics between humans and LLMs blur the line between tool use and agential interaction, raising the question of whether LLMs should be treated not as cognitive tools but as pseudo-agentive collaborators. At this stage LLMs may be most accurately described as ‘more than tools but less than minds’. I argue that these differences mark a shift to a new form of extended cognition, a hybrid cognitive relation in which cognition is co-constituted through interactive, norm-shaping feedback loops. While the paper’s primary aim is descriptive, I conclude by gesturing toward the normative consequences of these hybrid systems, particularly with respect to agency attribution, epistemic trust, and epistemic responsibility in AI-mediated cognition.
Human-AI Collaboration for Facilitating Constructive Disagreement Online
Farhana Shahid, Cornell University
Most people struggle to express their disagreement constructively online even when they want to engage with divisive social issues. This happens as platforms do not provide any affordances to support constructive discourse online. My research examines if large language models (LLMs) can help people express their opinions constructively on divisive social issues. Through controlled experiments with 600 participants from India and the US, who reviewed and wrote constructive comments on threads related to Islamophobia and homophobia, we observed potential misalignment between how LLMs and humans perceive "constructiveness". While the LLM was more likely to prioritize politeness and balance among contrasting viewpoints when evaluating constructiveness, participants emphasized logic and facts more than the LLM did. Despite these differences, participants rated both LLM-generated and human-AI co-written comments as significantly more constructive than those written independently by humans. Our analysis also revealed that LLM-generated comments integrated significantly more linguistic features of constructiveness compared to human-written comments. When participants used LLMs to refine their comments, the resulting comments were more constructive, more positive, less toxic, and mostly retained the original intent. However, in 12% of cases, LLMs distorted people’s original views—especially when their stances were on a spectrum instead of being outright polarizing. This prompted participants to either edit or reject LLMs suggestions and stick to their original opinion. These findings indicate that while human-AI collaboration can improve online discourse, the misalignment between human opinions and the way large language models pursue constructiveness could unintentionally amplify certain viewpoints.
Improving Sequential Human Decisions with Action Sets
Eleni Straitouri, Max Planck Institute for Software Systems
Recent work on decision support systems managed to achieve complementarity in classification tasks—outperforming both humans and automated classifiers on their own—by using the predictions of a classifier to narrow down the set of labels a human must evaluate before making a decision. In this work, we develop a decision support system for sequential decision making tasks, where a human interacts with an environment by taking a series of interdependent actions over time, and each action provides a reward and influences the future states of the environment. Specifically, our decision support system adaptively controls the level of human agency based on the state of the environment by presenting, at each time step, a subset of the possible actions—an action set—for the human to choose from. We construct action sets in a way such that, regardless of the (sequential) human policy, the total reward a human receives is Lipschitz-continuous with respect to a single parameter controlling the level of human agency. Based on this property, we develop a sample-efficient Lipschitz bandit algorithm to identify the optimal level of agency under which the human achieves the highest average total reward. We evaluate our decision support system by conducting a large-scale human subject study on a wildfire mitigation game.
Lay Belief about AI and Its Decision-Making
Suhas Vijayakumar, UCD Michael Smurfit Graduate Business School
Authors: Suhas Vijayakumar, W. Yuna Yang, and David DeFranza; Authors' Affiliation: University College Dublin, Ireland. Abstract: This research examines people’s lay belief concerning the mind of an artificial intelligence (AI) as a decision-making agent, and how this belief shapes an individuals’ own decision-making style in response. People perceive AI as more rational and reason-driven, in contrast to viewing humans as emotionally driven. Two studies confirm these beliefs, showing participants consistently judge AI as reason-based and humans as emotion-driven in decision-making. In a subsequent study, participants engage in an economic ultimatum game. When participants thought they were interacting with an AI (vs. a human) competitor, they adopted a more rational decision-making style, moving closer to the game-theoretic optimum. This shift in decision-making style was mediated by participants’ belief in the rational nature an AI. The findings suggest that perceptions of AI’s decision-making tendencies can influence the cognitive strategies that are adopted in response, with potential implications for human-AI interactions.
Conformalized Decision Risk Assessment
Wenbin Zhou, Carnegie Mellon University
In high-stakes domains such as healthcare, energy, and public policy, decisions are often made under uncertainty and guided by human expertise, yet are increasingly supported by predictive and optimization-based tools. However, these tools frequently function as black boxes, offering prescriptive recommendations without quantifying the risks associated with each decision, thereby hindering interpretability and limiting effective collaboration with human judgment. We propose CREDO, a novel framework for Conformalized Risk Estimation for Decision Optimization, which shifts the paradigm from prescribing decisions to auditing them. Instead of recommending an optimal action, CREDO provides interpretable and reliable risk certificates---rigorous, distribution-free upper bounds on the probability that a given decision is suboptimal. These certificates are produced through a novel integration of conformal prediction, generative modeling, and inverse optimization. Consequently, CREDO empowers human decision-makers to retain agency while leveraging transparent evaluation of candidate actions with statistical guarantees, fostering more trustworthy and collaborative human-AI decision-making.
Priorities AI to inform collaborative surgical decision making for older adults: a preliminary assessment
Abdelaziz Alsharawy, University of Texas Houston School of Public Health
Surgeons treating older adults often face the complex task of evaluating multiple chronic conditions while aligning surgical decisions with what patients want to achieve from surgery (their outcome goals). A key component of high-quality surgical care is understanding which outcomes matter most to older adults. However, multiple barriers hinder effective communication about desired outcome goals and associated risk tradeoffs – such as limited time during clinical visits, a predominant focus on clinical complications over how surgery affects patients’ lives, and patients who struggle to articulate desired goals. This communication gap leads to overtreatment, defined as patients not achieving their outcome goals while being exposed to risks of adverse surgical outcomes. To address these challenges, we developed Priorities AI, an agentic large language model that engages older adults in pre-visit conversations to elicit specific, realistic, and actionable outcome goals from surgery. Priorities AI also processes key features from that conversation and produces a “patient health priority note”, designed to be readily integrated into clinical workflows without adding burden to clinical staff. With our interdisciplinary team integrating surgery, geriatrics, behavioral economics, and informatics, we conducted preliminary evaluations of Priorities AI in two contexts. First, to explore the tool’s potential to support shared decision-making, we piloted Priorities AI in surgical clinics with five surgeons and their patients undergoing hernia repair—a procedure known for frequent misalignment between patient and surgeon expectations. Second, we recruited 50 older adults who had considered surgery in the past five years (online opt-in sample) to assess the AI’s conversational performance and user perceptions (acceptability, appropriateness, and feasibility). These early results inform the optimization of Priorities AI and the design of a forthcoming randomized controlled trial in surgical settings. We seek feedback on pilot findings and proposed refinements to better align surgical care with older adults’ priorities through human-AI collaboration.
Bringing Everyone to the Table: An Experimental Study of LLM-Facilitated Group Decision Making
Mohammed Alsobay, Massachusetts Institute of Technology
Group decision-making often suffers from uneven information sharing, hindering decision quality. While large language models (LLMs) have been widely studied as aids for individuals, their potential to support groups of users, potentially as facilitators, is relatively underexplored. We present a pre-registered randomized experiment with 1,475 participants assigned to 281 five-person groups completing a hidden profile task—selecting an optimal city for a hypothetical sporting event—under one of four facilitation conditions: no facilitation, a one-time message prompting information sharing, a human facilitator, or an LLM (GPT-4o) facilitator. We find that LLM facilitation increases information shared within a discussion by raising the minimum level of engagement with the task among group members, and that these gains come at limited cost in terms of participants' attitudes towards the task, their group, or their facilitator. Whether by human or AI, there is no significant effect of facilitation on the final decision outcome, suggesting that even substantial but partial increases in information sharing are insufficient to overcome the hidden profile effect studied. To support further research into how LLM-based interfaces can support the future of collaborative decision making, we intend to release our experimental platform, the Group-AI Interaction Laboratory (GRAIL), as an open-source tool.
Beyond Fact-Checking: Empowering Flexible Human-AI Teams to Detect and Counter Online Misinformation
Vahid Ashrafi, Stevens Institute of Technology
The rapid spread of misinformation on social media poses significant threats to informed decision-making and public trust. Traditional approaches such as fact-checking and content moderation often fall short due to the sheer volume and complexity of misleading content. To address this, we explored a novel human-AI collaborative approach, beginning by constructing a comprehensive knowledge base of 419 cognitive errors—including cognitive biases and logical fallacies—grounded in empirical findings from psychology and cognitive science. We then trained GPT-4.1, a state-of-the-art language model, to identify potential examples of these cognitive errors in users' posts. This combined approach enabled systematic identification and characterization of cognitive error patterns uniquely associated with misinformation sharers, notably marked by overconfidence in personal judgments, biased information processing, emotional reasoning, and reliance on attention-grabbing rhetoric. The robustness of AI-generated insights was rigorously validated by human annotators, achieving high reliability (Spearman’s ρ = 0.86). Our study exemplifies the potential of flexible Human-AI teams to address complex, real-world challenges such as misinformation by complementing the scale and efficiency of AI with the interpretive insight of human validation. Key challenges identified include enhancing AI systems' resilience against complex human behavioral patterns, aligning automated decision-making with human ethical standards and values, and ensuring system robustness in the face of unexpected and adversarial scenarios. To tackle these challenges, we propose actionable solutions such as targeted cognitive interventions to educate users about common reasoning errors, AI-driven moderation systems that identify problematic reasoning patterns rather than merely flagging false information, and personalized cognitive profiling tools to proactively identify at-risk individuals and communities. Ultimately, this research offers a forward-looking framework and concrete proposals to harness Human-AI complementarity effectively, enhancing societal resilience against misinformation and strengthening trust in digital environments.
Enhancing Public Safety Decision Making through Human-AI Complementarity in Distributed Multi-Agent CCTV Monitoring
Akin Babu Joseph, Carnegie Mellon University
Public safety monitoring through CCTV networks presents a critical decision making challenge where human operators face cognitive overload from simultaneous video streams. I propose an open-source lightweight multi-agent object detection system that exemplifies human-AI complementarity by creating flexible human-AI teams for enhanced situational awareness and decision making in public spaces. The system architecture employs distributed AI agents that communicate among themselves to interpret frames and identify objects and actions within each scene. These specialized agents collaborate to build comprehensive scene understanding through inter-agent dialogue about detected elements. A dedicated judge agent evaluates the collective interpretation and makes critical routing decisions: frames with high confidence proceed to traditional object detection algorithms like OpenCV for verification while ambiguous cases or potential incidents of interest are flagged for human operator review. The complementarity emerges through three key mechanisms: First, AI agents handle continuous monitoring and collaborative interpretation reducing human cognitive load while preserving human judgment for critical decisions. Second, the judge agent's decision framework creates a triage system that optimizes human attention allocation to cases requiring nuanced understanding. Third, human feedback on flagged cases can refine agent communication protocols and decision thresholds ensuring alignment with community safety priorities. The lightweight implementation prioritizes deployability on edge devices making it accessible to resource-constrained municipalities. The open-source nature promotes community-driven development allowing stakeholders to customize detection capabilities while maintaining ethical oversight. This work addresses critical challenges in human-AI complementarity including maintaining human agency in automated systems ensuring value alignment in public safety applications and creating interpretable multi-agent decisions that enhance rather than replace human judgment. The proposal presents concrete mechanisms for achieving robustness through human-in-the-loop validation and discusses failure mode mitigation strategies that prioritize human oversight in safety critical decisions.
Adaptive Trust Calibration in Human-AI Teams through Real-Time Meta-Feedback
Sree bhargavi Balija, University of California, San Diego
As AI systems play an increasingly prominent role in high-stakes decision-making, ensuring the robustness of human-AI teams under uncertainty and system failure has become critical. A key challenge to this robustness lies in the miscalibration of human trust—whether through over-reliance on AI or undue skepticism—particularly in situations involving unexpected or erroneous AI behavior. This research introduces a novel meta-feedback framework designed to dynamically calibrate human trust by continuously monitoring behavioral signals such as hesitation, reversals, and overrides, along with the AI system’s internal confidence metrics. Based on these observations, the system delivers real-time, context-sensitive interventions intended to subtly guide users toward appropriate trust levels without disrupting the decision-making process. We evaluate this approach in simulated emergency response environments, where human participants collaborate with AI to perform route planning under uncertain and evolving conditions. Preliminary findings indicate that teams using the meta-feedback system recover more effectively from AI errors, make better-calibrated trust adjustments, and achieve higher performance outcomes compared to control groups. Participants exposed to adaptive feedback also develop more accurate mental models of the AI’s capabilities and limitations, enabling more resilient and informed decision-making during episodes of AI failure or ambiguity. This work offers a practical method for enhancing adaptability within human-AI teams by encouraging situational trust regulation rather than relying on static trust levels. The proposed framework is particularly relevant for deployment in critical domains such as disaster response, autonomous systems oversight, and healthcare triage—contexts where AI errors are inevitable but need not compromise overall performance if human collaborators are equipped to adapt. Future directions include tailoring feedback strategies to individual cognitive profiles and embedding this system seamlessly into operational decision workflows. By advancing mutual adaptability and trust resilience, this research directly contributes to the broader goal of achieving effective human-AI complementarity under real-world constraints.
Modeling Human Caution Toward AI in Clinical Decision Support: Implications for Human-AI Complementarity
Macy Bouwhuizen, Utrecht University
In recent years, generative AI tools like ChatGPT, Perplexity, and Copilot have rapidly integrated into daily life, reshaping how people access information, communicate, and make decisions. As these systems become embedded in high-stakes domains—such as education, law, and healthcare—their role shifts from passive tools to active collaborators. Yet, effective human-AI teaming remains a challenge: people often hesitate to rely on algorithmic input, even when it is demonstrably accurate. This phenomenon, known as algorithm aversion, may undermine the potential of AI in critical decision-making environments. To contribute to this goal, we conducted two experiments examining algorithm aversion in a clinical decision-making context. Participants evaluated x-rays for bone fractures, receiving advice labeled as either algorithmic or human generated. Across both experiments, participants showed longer response times when presented with algorithmic advice, suggesting heightened cognitive deliberation. Moreover, evidence accumulation modeling revealed that individuals adopted higher decision thresholds for algorithmic input, indicating a more cautious decision strategy. Notably, this hesitancy persisted regardless of whether human advice came from laypeople (Experiment 1) or expert radiologists (Experiment 2). Importantly, we found no differences in accumulation rates or prior preferences between advisor types, suggesting that algorithm aversion stems not from lower perceived reliability, but from a strategic shift in response caution. These findings underline that human-AI interaction is shaped by subtle cognitive dynamics, and that successful human-AI teaming must account for these strategic biases. By investigating the decision processes underlying algorithm aversion, our work offers a foundation for developing AI systems that better support human-AI complementarity in high-stakes environments.
Challenges and Opportunities for AI-driven Decision Support Tools for Emergency Management
Samsara Foubert, Carnegie Mellon University
Emergency Management (EM) professionals prepare for and respond to all hazards at every existing institutional context, from private companies to municipal governments to entities serving millions of individuals. Though natural disasters and other hazards remain a persistent threat, American public sector EM programs have faced increasing funding constraints across all levels of government. AI-driven software tools for EM present a significant opportunity to help complement EM capabilities; however, the few existing commercial products have limited functionality and are too costly for most public sector EM professionals. To better understand the needs of EM professionals and opportunities for AI-driven tools to facilitate EM decision-making, we conducted semi-structured interviews with EM professionals (n=33) across the U.S. From these interviews we identified key decision-making scenarios they face in their work, common pain points and gaps in existing EM software tools, and detailed potential use cases for AI in EM contexts. Insights from these interviews also informed the design of a building damage assessment (BDA) decision-making task hosted on a custom-built web application. To evaluate our prototype, we conducted a pilot user study with Florida-based EM professionals (n=14) as part of a module in a live Tabletop Exercise recreating Hurricane Michael. Findings from this activity highlight how these professionals perceive the prototype experience and reveal important caveats to the feasibility of conducting BDA with aerial drone imagery. These insights, along with the EM AI use cases surfaced from the interview study, provide valuable guidance for prototyping and designing AI-driven decision supported tools for EM.
Generative Artificial Intelligence as a Cognitive Partner for Human Responders
David Hagmann, The Hong Kong University of Science and Technology
Surveys are widely used in research, organizations, and policymaking to gather opinions and evaluations. However, data quality often suffers when respondents misunderstand questions or provide vague, incomplete answers to open-ended prompts. Here, we present data from two settings involving performance-related feedback. In our first study, undergraduate students evaluated their course instructors and provided open-ended comments about their strengths and weaknesses. After their initial response, ChatGPT dynamically created follow-up questions seeking clarification and subsequently integrated the answers to those questions with the original response to provide a revised comment. We find that across algorithmic ratings and evaluations by a sample of teachers, the AI-assisted response scored more favorably on a wide range of measures, including actionability and trustworthiness. In a second study, we replicate our findings in the context of performance evaluations. Participants in managerial roles were asked to give feedback aimed at an employee who is difficult to work with. AI generated follow-up questions and integrated the response to provide revised feedback. We again use algorithmic measures of actionability, concreteness, and conversational receptiveness to compare the baseline and AI-assisted response, and recruit participants to evaluate the messages on the same dimensions as in Study 1. While much current research focuses on using AI to provide feedback, here we apply it to personalized settings where humans have more information than AI (e.g., experience with an instructor's class) and show that AI has a role in helping people express their thoughts in ways that make them more actionable for the recipient.
Who Thinks This Is Hard?: An Estimate of Decision Difficulty is the First Step Towards Individualized AI Assistance
Theodros Haile, University of Groningen
AI tools offer the promise of individualized training and support in a variety of tasks. But to determine when and what kind of assistance to provide, it will be necessary for such tools to gauge, for each action within the overall task environment, the level of subjective difficulty for each individual. This becomes more challenging in dynamic environments, where many small-scale decisions or actions constantly re-configure the decision space and contribute to a long term goal, but no single action determines the outcome, or can even be objectively labeled as "correct" or "incorrect". Here, we attempt to map the difficulty of decisions in an example complex task: the video game Tetris. Expert players rely on intuitive assessment of multiple visual features to make decisions; a process which has been successfully approximated by cross-entropy reinforcement learning models. We take advantage of the model's ability to evaluate and rate the optimality of all possible actions, at every decision point, to determine not just the best move in a space, but the approximate difficulty of the choice between options. Decisions with a single optimal action rated definitively higher than any other option are considered “easy”, and decisions with multiple plausible options are rated as more “difficult” in proportion to the number of reasonable actions. We look at the incidence rate of easy and difficult decisions, and how it varies across games played by players of different skill levels to theorize how assistance might be useful, as well as how the relative difficulty correlates to player behavior. We seek to further validate this approach by relating it to subjective experience of task difficulty and trial-by-trial, EEG-based, detection of cognitive events.
Community Notes are Centrist, Not Consensus
Andreas Haupt, Stanford University
Community-driven contextualization systems—such as those implemented on X (formerly Twitter), Meta platforms, and TikTok—are increasingly used to annotate online content with supplementary perspectives. This poster presents research examining the algorithmic foundations of how these systems aggregate user preferences to determine which annotations are shown. The most commonly used algorithm, Community Notes, relies on a one-dimensional latent-factor model to identify a dominant axis of disagreement—typically political orientation in the U.S.—and attempts to make votes predictable by this axis irrelevant to note display. The system is often described as \enquote{bridging}: a note is shown only when users who typically disagree come to a consensus. I argue, however, that such a consensus is not necessary. Instead, the algorithm is better understood to approximate the preferences of a hypothetical centrist user. The estimated centrist user has well-defined average utilities, which allow to understand the “preferences" of the community notes algorithm. Centrism has desirable features: it is robust to selection, and can be seen as a statistical version of the median voter. Finally, I outline how ideas from Community Notes can be used to design novel alignment targets for generative models.
Algorithm Adoption and Explanations: An Experimental Study on Self and Other Perspectives
Zezhen (Dawn) He, Massachusetts Institute of Technology
People are reluctant to follow machine-learning recommendation systems. To address this, research suggests providing explanations about the underlying algorithm to increase adoption. However, the degree to which adoption depends on the party impacted by a user’s decision (the user vs. a third party) and whether explanations boost adoption in both settings is not well understood. These questions are particularly relevant in contexts such as medical, judicial, and financial decisions, where a third party bears the main impact of a user’s decision. We examine these questions using controlled incentivized experiments. We design a prediction task where participants observe fictitious objects and must predict their color with the aid of algorithmic recommendations. We manipulate whether (i) a participant receives an explanation about the algorithm and (ii) the impacted party is the participant (Self treatment) or a matched individual (Other treatment). Our findings reveal that, in the absence of explanations, algorithmic adoption is similar regardless of the impacted party. We also find that explanations significantly increase adoption in Self, where they help attenuate negative responses to algorithm errors over time. However, this pattern is not observed in Other, where explanations have no discernible effect—leading to significantly lower adoption than in Self in the last rounds. These results suggest that further strategies—beyond explanations—need to be explored to boost adoption in settings where the impact is predominantly felt by a third party.
The Effect of AI-Generated Probability Feedback on Human Probability Judgments
Phillip Hegeman, Indiana University Bloomington
As AI becomes further integrated into decision-making contexts, people will increasingly encounter machine-generated judgments which include information about judgment confidence, e.g., probabilities in a classification task. There are many ways for these judgments to be explicitly incorporated in human-AI collaborative systems. However, mere exposure to machine judgments and confidence may shape individuals’ learning and behavior. This study investigates how passively presented ML probability judgments impact human participants’ learning and response behaviors in a medical image classification task. While training to classify white blood cell images as either cancerous (blast cells) or non-cancerous (non-blast cells), participants receive trial-by-trial categorical feedback and, in some conditions, an ML model’s predicted probability that the cell is cancerous. In one condition, ML predictions come from a poorly calibrated model (expected calibration error ECE: 0.175) exhibiting under-extremity bias (i.e., overly conservative); in another condition, ML predictions come from a recalibrated version of the same model which is better calibrated (ECE: 0.020) and justifiably confident (judgments nearer 0% and 100%) while being equally accurate (~94%). We find that participants in all conditions tend to exhibit over-extreme confidence, but those receiving feedback from the original under-extreme ML model exhibit this bias to a lesser extent than both the control (solely categorical) and recalibrated ML feedback conditions. Participants receiving original conservative ML feedback are more likely to display conservative response patterns using more intermediate probabilities (nearer 50%), while those receiving recalibrated confident ML feedback are more likely to display extremely confident response patterns. However, we find little evidence of condition differences in degree of agreement (e.g., intraclass correlation) between participants’ judgments and the ML models’, suggesting that passive ML exposure influences response behavior less through participants learning to judge like the ML model, and more through mimicking the distributional characteristics of the ML probabilities.
Vulnerability of Text-Matching in ML/AI Conference Reviewer Assignments to Collusions
Jhih-Yi Hsieh, Carnegie Mellon University
In the peer review process of top-tier machine learning (ML) and artificial intelligence (AI) conferences, reviewers are assigned to papers through automated methods. These assignment algorithms consider two main factors: (1) reviewers' expressed interests indicated by their bids for papers, and (2) reviewers' domain expertise inferred from the similarity between the text of their previously published papers and the submitted manuscripts. A significant challenge these conferences face is the existence of collusion rings, where groups of researchers manipulate the assignment process to review each other's papers, providing positive evaluations regardless of their actual quality. Most efforts to combat collusion rings have focused on preventing bid manipulation, under the assumption that the text similarity component is secure. In this presentation, we will show that even in the absence of bidding, colluding reviewers and authors can exploit the machine learning based text-matching component of reviewer assignment used at top ML/AI venues to get assigned their target paper. We also highlight specific vulnerabilities within this system and offer suggestions to enhance its robustness.
AI familiarity and its effect on charitable giving in AI-mediated messaging
Tsung-Tien Hsiung, Cornell University
Artificial intelligence (AI) is increasingly used to craft persuasive messages, but disclosure of AI authorship often decreases willingness to donate—a phenomenon we call AI donation infavorism. This study examines whether AI familiarity and language background moderate this effect. We recruited 69 participants (61% female, 43 non-native English speakers) who rated donation appeals that varied by authorship (human-written vs. AI-generated) and disclosure (informed vs. unaware). Results replicated prior work: disclosure of AI involvement reduced persuasiveness compared to undisclosed appeals. Importantly, higher AI familiarity mitigated infavorism when authorship was hidden, but amplified it when authorship was disclosed, suggesting that familiarity enhances detection ability and raises evaluative standards. Language background also shaped responses: non-native English speakers showed less infavorism than native speakers, indicating weaker sensitivity to AI authorship. Together, these findings highlight the dual role of AI familiarity—facilitating unconscious acceptance when authorship is hidden, but strengthening bias when revealed—and demonstrate how nativeness influences receptivity to AI-mediated persuasion.
Decision-Making in Human-AI Teams: Learning from AI Failures in Scheduling
Talha Özüdoğru, Uttecht University
Decision-Making in Human-AI Teams: Learning from AI Failures in Scheduling Talha Özüdoğru.1 , Christian P. Janssen. 1, Silja Renooij2 , & Leendert Van Maanen1 1 Experimental Psychology & Helmholtz Institute, Utrecht University 2 Department of Information and Computing Sciences, Utrecht University At Dutch Railways, human planners collaborate with an AI assisted planning system to solve complex scheduling challenges. However, inappropriate reliance on AI remains a major obstacle to achieving joint performance that neither humans nor AI systems can achieve alone. Miscalibration and inappropriate reliance stem not only from technical limitations of AI but also from users’ cognitive biases and expectations. As such systems become more deeply embedded in operational workflows, it is crucial to understand how users develop trust in AI generated recommendations, calibrate their reliance, and integrate them into strategic decision making. This study investigates how the decision making processes of human-AI teams differ from those of all human teams in the scheduling domain, using a game based on the bin packing problem. It examines the role of domain and professional expertise and the impact of witnessing AI failures. A series of controlled experiments was conducted to examine how participants evaluate solutions to a scheduling problem purportedly generated by either an AI or a human teammate. The first two experiments involved university students classified as experts and novices. A third, ongoing experiment uses a similar setup with professional railway planners. Experiment 1 showed algorithm aversion: after witnessing a failure, AI advice was accepted at significantly lower rates, especially among novices. Experts showed a smaller reduction after failures. Experiment 2 replicated the first with minor changes and a broader participant pool. Experiment 3 focuses on the effect of professional expertise in the decision-making process of human-AI teaming in scheduling optimization. Findings from the three experiments, and how domain and professional expertise as well as witnessing AI failures influence the joint decision making process, will be discussed in detail.
Designing Concept-Based Systems for Human-AI Complementarity in Decision-Making
Naveen Raman, Carnegie Mellon University
Despite the growth of large language model (LLM) agents, their behavior remains opaque, limiting human ability to understand and reason about their rationales. Concept-based explainability offers a solution to this problem by framing model decisions through the lens of human-understandable “concepts”, so agent actions can be explained through concept predictions. For example, a prediction of “cherry” for an image classification task could be explained through the concepts “red” and “circular” being present. These concepts help trace and localize prediction errors and allow users to intervene on individual concepts. Such an approach can boost human-AI complementarity because human and AI information can be combined via concepts. While concept-based systems present an alluring system for designing human-AI systems, little work has rigorously investigated structuring such systems to enable human-AI collaboration. In this work, we investigate how the choice of underlying concepts (e.g., “red” and “circular”) impacts human-AI complementarity. Selecting concept sets is difficult because of the need to reduce a large bank of potential concepts down to a few chosen concepts. We begin by showing that concepts are equivalent to interpretable state abstractions. By leveraging the machinery of state abstractions, we show when sets of concepts are near-perfect state abstractions, they naturally encourage human-AI complementarity. This result holds because near-perfect state abstractions reduce the state space and best allow for human intervention on individual concepts. Based on this idea, we develop a set of algorithms to guide concept selection. To bring these ideas into practice, we propose two future directions: (1) the development and publication of a set of tools that help practitioners dynamically select and design concept-based systems for real-world use cases, and (2) a set of real-world trials investigating how concept-based systems impact the performance of human-AI teams.
Choosing AI for Behavior Change
Michael Sobolev, University of Southern California
This talk presents findings from the QuitChoice trial, which evaluated user preferences and selection of digital interventions for smoking cessation, including mobile apps and AI chatbots. While over 500 evidence-based smoking cessation apps are currently available in app stores, little is known about how individuals choose among them. Understanding these choices can reveal user preferences, enhance the likelihood of behavior change, and support patient-centered care. Among 322 current smokers enrolled in the trial, a majority (84%) reported prior experience using AI tools such as ChatGPT for everyday tasks, yet only 2% had used AI to support smoking cessation. When asked to evaluate specific features of cessation apps, participants were 76% more likely to prefer interacting with an AI chatbot than with a human coach. The trial compared preferences across three distinct, evidence-based digital interventions: QuitStart (a mobile app), QuitBot (an AI chatbot), and the EX Program (a web-based platform with text messaging). These options were purposefully selected to vary by delivery modality. Participants most frequently chose the mobile app (53%), followed by the AI chatbot (35%) and the web-based program (12%). We will present data on how initial preferences for app features correspond to intervention choice, as well as user engagement, perceived usefulness, and ease of use over a two-month follow-up. These findings provide new insights into how individuals choose AI-based tools for behavior change and have implications for the design of future digital interventions.
