Introduction
This is my individual report on the group project that I did with Louisa and Dean. The original presentation can be found here.
Our project attempted to evaluate whether LLMs could adequately recognize and simulate cognitive biases. We specifically tested the models on hostile attribution bias (HAB), but we hoped that our findings could potentially generalize beyond HAB and shed light on the ability of LLMs to recognize other types of biases.
Group formation
Louisa, Dean, and I were assigned to work together by Professor Ji. Shortly after the groups were assigned, Louisa approached us with a coherent project proposal and experiment in mind. This proposal, which she had done previous work on (a fact that our professor was aware of), became the basis of our group project.
We were initially unclear on what additions we could make, since the project proposal was already so well thought out. However, we were able to find suitable things to explore: working off of Louisa’s existing proposal, I chose to write about the potential consequences and implications of the project’s experiment. Dean did research on the project’s background, and Louisa handled the experiment itself.
Project background
Our project tests two LLMs on their ability to recognize and simulate cognitive bias. Cognitive biases are a psychological phenomenon; they encapsulate any pattern of subjective, inaccurate thinking which does not reflect reality. For example, someone who automatically assumes that a phone call must represent an emergency is experiencing cognitive bias. Attribution biases are a class of cognitive biases which deal with inaccurate interpretations of other people’s behavior. Hostile attribution bias (HAB) is a specific type of attribution bias wherein the subject tends to think that others are being hostile towards them, even if their behavior is ambiguous or benign. Children with high levels of HAB are more likely to be aggressive themselves.
We used two models to evaluate HAB: GPT3.5-Turbo and Llama2’s StableBeluga. GPT3.5 is optimized for conversations, whereas StableBeluga has fewer parameters and is optimized for harmlessness. Both models are considered fairly capable compared to other LLMs.
Methods
We based our evaluations on one fictional scenario, which we posed to both models: a friend slips on ice and knocks you to the ground in the process. We gave each model a primer on what it means to have high, moderate, and low levels of HAB. Then we asked the models to do two things: first, simulate responses to the aforementioned scenario which demonstrate varying levels of HAB, and second, rate responses that we feed them according to perceived levels of HAB.
The images below show an example of GPT3.5’s simulated response generation. In order, they show the model’s attempts at high, moderate, and low HAB responses to a scenario where
Results
Both GPT3.5-Turbo and StableBeluga performed well on the second task. However, they were less adept (though still competent) at the first task — they could generate high HAB responses well, but they sometimes failed to simulate moderate and low HAB levels as distinct states.
The images below show an example of GPT3.5’s response generation. In order, they show the model’s attempts at simulating high, moderate, and low HAB responses. As is apparent, the moderate and low HAB simulations are very similar and would not appear as distinct in a clinical context.
Repeated attempts to get the models to respond to the scenario yielded similar results: they were generally promising, but unable to produce sufficiently distinct responses. Despite this failure, we concluded that fine-tuning would be unwarranted and infeasible due to the impossible costs and dataset sourcing issues involved.
Implications of the experiment
We were impressed by the models’ ability to rate and simulate HAB, but we were ultimately cautious about applying this ability to real-world scenarios where LLMs remain relatively untested. HAB mostly comes up in clinical contexts, and using LLMs to replace or augment human raters / therapists could be dangerous or misleading to patients in need of help. We also do not believe the models should currently be used for cognitive bias evaluation in judicial or medical settings. However, we were optimistic about more casual / inconsequential uses of this ability, for example the creation of detailed, automated personality tests — there is much public interest in popular psychology games like MBTI, and LLMs provide a unique opportunity to capitalize on this interest and revolutionize the level of detail that personality tests provide.