Derek Leben

Associate Teaching Professor
Tepper School of Business
Fall 2024
70-332 Business, Society, and Ethics (14-week course)
Research Question(s):
- To what extent does debating with generative AI (as compared to debating with a peer) impact…
- students’ development of analytical reasoning skills (i.e., argumentation and evaluation)?
- students’ perception of feedback received?
Teaching Intervention with Generative AI (genAI):
Leben provided suggestions and tips for how to engage with an instructor-customized “ethics debate coach” genAI tool (customized ChatGPT), about arguments written by students themselves. Next, Leben had students prompt the genAI tool to: a) give them objections to their argumentative paper from both the same and different normative ethical frameworks, and b) engage in debate with the students about their arguments.
Study Design:Leben taught three course sections, providing the same classroom instruction on leveraging normative ethical frameworks to design policies across all sections. Students in all sections drafted an argumentative paper for a policy supported with a normative ethical framework. Then, in each section, Leben randomly assigned students to one of two study conditions. In the genAI treatment condition, Leben implemented the genAI intervention described above. In the peer condition, students worked with peers to elicit objections and engage in debate. The cycle of drafting a paper, receiving feedback, and revising was repeated for two paper assignments, with students remaining in the same study condition. Leben then compared data sources across the two groups in which students interacted with a peer vs genAI.
Sample size: Treatment (83 students); Control (86 students)
Data Sources:
- Rubric scores for students’ strength of argumentation and use of evidence (“evaluation”) on the final versions of two major writing assignments (i.e., argumentative papers).
- Post survey of students’ perception of the feedback and revision experience and the feedback that they received.
- RQ1a: Both the peer and genAI group significantly improved in their argumentation (Figure 1) and evaluation (Figure 2) performance from paper 1 to paper 2. The peer group outperformed the genAI group on argumentation, but not evaluation. The degree of improvement over time for each skill did not depend on whether students debated with a peer or genAI.

Figure 1. For the argumentation section of the rubric, there was no interaction showing no impact of condition on improvement, F(1, 167) = 0.33, p = .57. There was a significant main effect of time, F(1, 167) = 18.27, p <.001, ηp2 = .10 and a significant main effect of condition, F(1, 167) = 4.98, p = .03, ηp2 = .03. Error bars are 95% confidence intervals for the means.

Figure 2. For the evaluation section of the rubric, there was no interaction showing no impact of condition on improvement, F(1, 167) = 0.96, p = .33. There was a significant main effect of time, F(1, 167) = 16.57, p <.001, ηp2 = .09 but no significant main effect condition, F(1, 167) = 0.10, p = .75. Error bars are 95% confidence intervals for the means.
- RQ1b: Regardless of condition or of which paper assignment, on average students reported high satisfaction and comfort with the feedback (approximately 6.25 rating on a 7-point likert scale).
Eberly Center’s Takeaway:
RQ1a&b: Students interacting with peers consistently outperformed those working with genAI on argumentation, but not evaluation. This could suggest that genAI feedback, even from a chatbot carefully customized for this purpose, may not be as helpful for argumentation. Student perceptions, however, were similarly positive for both peer and genAI feedback quality. Another possibility is that genAI’s output, which was multiple paragraphs of text, was more overwhelming or less prioritized compared to peers’ spoken feedback. It should be noted that 6 students (~3% of participants) switched treatments between papers but their rubric scores were still included with their originally assigned group, which adds noise to the data. These findings may suggest that using genAI for feedback on specialized ethics arguments is not yet a reasonable alternative to peer-to-peer interactions, even with careful upfront fine-tuning.