Yizhou Lyu’s Final Project Report

Title: Can large language models understand and evaluate hostile attribution biases?

Introduction

In history, machines have been built to be cold, unbiased, and objective. Cognitive biases, a systematic way of distorting one’s perceptions and judgments that give rise to individual’s subjective experiences, was thought to be an attribute unique to human beings. Among all the cognitive biases, hostile attribution bias refers to the tendency to interpret social situations as intentionally hostile. For example, imagine two people walking down a crowded street and bumping into each other as they pass. While one of them might interpret this as an accident due to the crowded street, another might perceive it as a deliberate hostile act intending to hurt him. In this situation, the second person has higher hostile attribution bias, suggesting that he has a biased system of appraisals that consistently biases his social judgments (Epps and Kendall, 1995; Dodge, 2006; Klein Tuente et al., 2019). Unlike human, machines are commonly believed to unbiased: in making judgments, machines will analyze and make decisions based on all the available evidence and information. In this project, we are interested in examining large language models, which are artificial intelligence systems extensively trained on human-generated text and have demonstrated great potential in understanding and generating human-like texts. We wonder if these models also possess the capability to understand and evaluate cognitive biases in human-like ways. This project aims to test and probe large language models’ abilities in simulating or taking perspectives of people with different levels of hostile attribution bias to respond to social situations, and their abilities in detecting the presence of hostile attribution bias in the responses.

To examine large language models’ proficiency in understanding and evaluating cognitive biases, we used two auto-regressive language models: Stable Beluga 13B and Open AI’s GPT-3.5 turbo. We prompted both models to generate responses that attribute causes to explain 15 realistic social scenarios in the Ambiguous Intentions Hostility Questionnaire (AIHQ, Combs et al., 2007). We prompted the models to generate responses emulating the perspectives associated with high, moderate, and low hostile attribution biases. Subsequently, we fed these responses to the models and prompted them to rate the hostility level of each generated response using a 1 to 5 scale. We found that both models understood the concept of hostile attribution bias with little instruction or prompting: the generated responses exhibited appropriate levels of hostile attribution bias, and the models’ hostility ratings of these responses also aligned with how much hostility each response should exhibit. This finding suggests a promising potential application of large language models in clinical settings, where they could automate the assessment of cognitive biases or gauge the severity of various psychological conditions, extending beyond hostile attribution bias.

Methods

Language models. Stable Beluga 13B and GPT-3.5 turbo are both auto-regressive language models, meaning that they are trained to generate text by predicting the next word based on the words before it. Both are trained on large corpus of text sourced from internets and articles and are designed to generate human-like text based on the prompts it receives. Stable Beluga 13B is a Llama2 13B model that is fine-tuned on an orca-style dataset, which improves the model’s ability to follow instructions and generate safe and helpful responses in alignment with human intentions. OpenAI’s GPT-3.5 turbo is one model of the Generative Pre-trained Transformer family that features a balance of performance and speed, performing with faster response time than its predecessors while maintaining competent performance. The two models are chosen for comparison due to their shared fundamental approach in doing next-word prediction and overall reliable performance in generating and classifying text. Differences in the training data and fine-tuning approaches used by the two models might result in variations in their performances, making comparisons interesting. For Stable Beluga 13B, we used the model available on Huggingface; for GPT-3.5 turbo, we used the Completion API available on OpenAI.

AIHQ scenarios. AIHQ questionnaire comprises of 15 negative social situations differ in the level of intentionality: there are 5 accidental, 5 ambiguous, and 5 intentional scenarios (Table 1). For each social situations, people are asked to respond to an open-ended hostile attribution question “What do you think was the real reason why the social situation happened?”. 2 human raters were usually recruited to rate these open-ended responses in terms of the hostility of the attributed intents on a scale of 1 to 5, with “1” given to responses that appraise the situation as an accident, and “5” given to responses that appraise the situation as a purposeful action with harmful intentions. Previous research has found that the AIHQ scores t positively correlated with paranoia and hostility, two core aspects of hostile attribution bias (Comb et al., 2007).

Accidental  A friend of yours slips on the ice, knocking you onto the ground.
While walking outside during the rain, a car swerves to avoid hitting a cat, and drives into a puddle, splashing water onto you.
You are on a bus sitting in an aisle seat. A person gets on the bus at the next stop, begins walking as the bus moves, and steps on your foot.
You’re dancing at a club, and someone bumps into you from behind.
You’re at a bar watching a football game and having a drink.  Suddenly, the home team scores, people begin to cheer, and someone hits your arm, spilling the drink onto your clothes.
Ambiguous You’ve been at a new job for three weeks.  One day, you see one of your new co-workers on the street. You start to walk up to this person and start to say hello, but she/he passes by you without saying hello.
You have an appointment with an important person.  When you arrive at your appointment, the secretary informs you that the person is not in; they took the day off.
You walk past a bunch of teenagers at a mall and your hear them start to laugh.
You are supposed to meet a new friend for lunch at a restaurant, but she/he never shows up.
You call a friend and leave a message on their answering machine, asking them to call you back. One week passes and they have not called you back.
Intentional Your neighbors are playing loud music.  You knock on the door and ask them to turn it down.  Fifteen minutes later, the music is loud again.
While driving, the person in the car behind you honks their horn and then cuts you off.
You’ve been looking for a parking spot for a while when you see one up ahead.  You put your signal on, proceed toward the spot, but someone passes your car and takes the parking space.
A day before meeting someone for a date, she/he calls to cancel. This is the third straight time they’ve done that.
Someone jumps in front of you on the grocery line and says, I’m in a rush.

Table 1. AIHQ scenarios. 15 AIHQ scenarios that are used in this project.

Generating. We aimed to probe the capabilities of Stable Beluga 13B and GPT-3.5 turbo in generating responses from the perspective of individuals with different levels of biases. We prompted both models to attribute causes to each of the 15 AIHQ scenarios from the perspective of high, moderate, and low hostile attribution bias individuals. In the prompt, a brief definition of what does it mean to have different levels of hostile attribution bias was given, for example, high hostile attribution bias refers to someone who is likely to see other’s actions as having hostile intentions while low hostile attribution bias refers to someone who is likely to see other’s actions as benign. Then, one scenario was given to the model in each query, and we prompted the models to think, act, and respond in the manner consistent with a high, moderate, or low hostile attribution bias individual. The full prompts are included in Table S1.1 and Table S1.2 in Appendix.

Evaluating. We then examined Stable Beluga 13B and GPT-3.5 turbo’s abilities in evaluating the levels of hostile attribution bias in the generated responses. Due to GPT-3.5 turbo’s better performance in emulating individuals with different levels of hostile attribution bias, we decided to rate the GPT-3.5 turbo generated responses. GPT-3.5 turbo generated 30 responses to each of the 15 scenarios at each of the 3 levels of hostile attribution bias. In the prompt, a brief description on what type of responses should be rated as a high or low hostile attribution response is given, similar to the instructions received by the human raters. We prompted the models to give only a numerical answer from 1 to 5 as the output, with full prompts included in Table S2 in Appendix.

Results

For response generation, Stable Beluga 13B and GPT-3.5 turbo were both able to output sensible responses to the social situations that exhibited different levels of hostile attribution bias. I will evaluate the generated responses toward the scenario “A friend of yours slips on the ice, knocking you onto the ground” as an example.

  Stable Beluga 13B GPT-3.5 turbo
High hostile attribution bias Reason: The friend’s action may have accidentally led to you falling, causing concern about a betrayal or deceit.

 

What you would do: Your defensiveness would likely cause you to react negatively, such as yelling at your friend or believing they intentionally knocked you down.

Reason: My friend purposely pushed me onto the ground because they wanted to hurt me.

 

What I would do: I would confront my friend and express my anger towards them for deliberately pushing me onto the ground. I may even consider ending our friendship because of their hurtful actions.

Moderate hostile attribution bias Reason: You might interpret your friend’s slip on the ice as them doing it intentionally to hurt you or get revenge for something you might have said in the past.

 

What you would do: First, you would take a moment to assess the situation and the possible reasons for your friend slipping. If it appears to be an accident and not malicious, then you would apologize for misunderstanding their intentions and assist them in getting up, offering to help walk them home or wherever they need to go. If it does seem to be a malicious act, you would confront your friend and address the issue at hand.

Reason for why it happened: There could be many reasons for this situation. Perhaps my friend didn’t see the ice and slipped accidentally. Or maybe they were trying to be funny and pushed me as a joke but didn’t realize the ice was slippery.

 

What I would do: I would first check to make sure my friend is okay and not hurt. Then, I would ask them if they meant to do that or if it was an accident. If they apologize and explain it was an accident, I would brush it off and laugh it off with them. If they were trying to be funny, I would playfully scold them for not being more careful and then also laugh it off. Either way, I would make sure to check in with my friend and make sure they are okay before moving on.

Low hostile attribution bias Reason: The ice is slippery, causing the accident.

What you would do: Offer help and support to your friend, asking if they need assistance getting up or calling for medical attention if necessary. Once your friend is stable, you might also reassure them that it was an accident and it’s important not to blame oneself for such accidents. Furthermore, you could propose the idea of contacting the relevant authorities or local authorities about the dangerous road conditions.

Reason: My friend slipped on the ice, causing us both to fall.

Response: In this situation, I would assume that my friend slipped accidentally and did not have any intent to harm me. I would first check to make sure my friend is okay and then help them up. I would also ask if they need any assistance or medical attention. I would not be angry or upset with my friend because I understand that accidents happen.

 

Table 2. Stable Beluga 13B and GPT-3.5 turbo generated responses. Sample responses to the scenario “A friend of yours slips on the ice, knocking you onto the ground”. Responses are randomly picked from the 30 responses generated for this scenario and each of the three levels of hostile attribution bias.

The response generated by Stable Beluga 13B sometimes showed an inconsistent portrayal of individuals with different degrees of hostile attribution bias. For example, when prompted the model to generate response from the perspective of a high hostile attribution bias individual, it began by suggesting that the friend’s action might had “accidentally” led to falling, indicating a low level of hostile attribution bias, which did not well align with a high hostile attribution bias perspective. However, this initial low-bias attribution seemed incongruous with the subsequent sentences in the response that discussed the possibility of betrayal and intention in the friend’s action, which are characteristics of a person with high hostile attribution bias. This contradiction demonstrated a lack of consistency in the model’s understanding of how people with different levels of hostile attribution bias would react to hypothetical social situations. This example suggested that the Stable Beluga 13B model might be struggling to maintain a coherent representation of mental states or psychological perspectives, particularly in topics that require a nuanced understanding of different levels of cognitive biases.

In contrast, the responses to the situations generated by GPT-3.5 turbo demonstrated a consistent adherence to the beliefs and behaviors expected from people with different levels of hostile attribution bias. The output for a high hostile attribution bias perspective used phrases like “purposely”, “they wanted to hurt me”, and “deliberately pushing”. These directly align with the thinking style of a person who tends to make hostile attributions. For the low hostile attribution bias counterpart, the response started by assuming that the friend “slipped accidentally” and “did not have any intent to harm me”, conveying a non-hostile interpretation of the situation. Beyond simply classifying the situation as intentional or hostile, GPT-3.5 turbo also suggested plausible subsequent actions that are in line with the specified levels of bias. It proposed ending friendship due to this hurtful action in the high hostile attribution bias response, whereas it suggested offering medical assistance to the friend in the low hostile attribution bias response. These responses mirror realistic human reactions to the situation, indicating a sophisticated ability in simulating the perspective and behavior of people with different levels of hostile attribution bias.

In evaluating the responses, we can see that both StableBeluga and GPT-3.5 turbo showed comparable reliability in assessing the level of hostile attribution bias exhibited by the responses to the social scenarios. They consistently assigned significantly higher scores, representing a higher level of hostile attribution bias, to the responses that intended to simulate the perspective of people with high hostile attribution bias. This also validated the effectiveness of GPT-3.5 turbo in generating such distinguishable responses. However, the two models failed to accurately distinguish between moderate and low levels of hostile attribution bias, often assigning similar ratings to both. This could be attributed to GPT-3.5 turbo’s tendency to produce responses for moderate and low levels that closely resemble one another in the language used and the sentiment expressed. The above example, where a friend slipped on the ice, illustrates this point: both the moderate and low levels of hostile attribution bias responses used phrases like “slipped accidentally” and “check to make sure my friend is okay”, indicating similar benign attributions to the situation.

Fig. 1. Hostile attribution bias ratings on the responses generated by GPT-3.5 turbo. A. Rating of the GPT-3.5 turbo generated responses by StableBeluga 13B. B. Rating of the same GPT-3.5 turbo generated responses by GPT-3.5 turbo.

Discussion

In this project, we explored the capability of large language models for both emulating human-like bias in responses to social situations and evaluating the biases within these generated responses. By using the text generation functionality of StableBeluga 13B and GPT-3.5 turbo, we found that both models adeptly understood the task with minimal instruction, successfully generated and evaluated responses in a way analogous to human reasoning. Specifically, GPT-3.5 turbo demonstrated better performance in generating responses that accurately and consistently reflected the intended levels of hostile attribution bias when ascribing reasons to social events, when compared with the StableBeluga 13B generated responses. Both models are equally proficient in classifying the level of biases in the responses on a 1 to 5 scale. These results highlight the potential for large language models to take over the role of human raters in rating open-ended questions in questionnaires like the AIHQ. Moreover, one could also consider the broader usage of language models in identifying cognitive biases or psychological disorders, expanding beyond hostile attribution bias that is specifically studied in this project.

Certain limitations of this study should be noted, which provides guidance for future research. One significant limitation is not evaluating language models’ performances on real responses generated by real people. In this project, we used two large language models to rate responses that are themselves generated by language models, which may not accurately reflect the model’s ability to assess human-written content, despite the machine-generated responses seeming plausible and human-like. To more accurately assess the models’ proficiency in clinical settings, I contacted and met with Dr. Dawn Neumann, who agreed to share a dataset containing 80 subjects’ AIHQ responses, along with the ratings of these responses given by two human raters. Unfortunately, the Data Use Agreement request has been prolonged, and I anticipate getting access to the data in January. By using large language models to evaluate actual human responses, I can directly compare the language model ratings with the human ratings to see how well language model resembles human raters in evaluating attribution bias. If there is a strong correlation between the two, it would enhance the validity and credibility of using large language models as substitutes for human raters in evaluating open-ended questions in the AIHQ.

Going forward, the potential applications of large language models may extend beyond clinical settings. Given their demonstrated proficiency in simulating and evaluating cognitive biases like the hostile attribution bias, these models could be used in personality assessments and could even be leveraged to devise strategies aimed at mitigating these biases. Ultimately, they could alter the way people perceive and interpret their everyday experiences. For example, we could consider using large language models as AI Chatbox, where individuals recount daily negative social events, and in turn, receive feedback that challenge and reframe their initial attributions for the causes of the events. Over time, such human-AI interactions could fundamentally change the way certain individuals process and understand social events, potentially diminishing certain undesirable cognitive biases like extremely high hostile attribution bias, and thereby fostering a more positive outlook on life for those seeking change.

Leave a Reply

Your email address will not be published. Required fields are marked *