Reading Deshpande et al.’s paper on toxicity in ChatGPT has only resulted in more questions about how ChatGPT comes to these responses when being assigned certain personas. Some results are unsurprising, such as having ChatGPT take the persona of Adolf Hitler, who people almost universally agree was a bad person and therefore, ChatGPT’s responses while embodying him would be toxic. What is more surprising is when ChatGPT returns negative responses when assigned personas of people who are perceived more neutrally by the public. Did ChatGPT, as Muhammad Ali, curse out aliens because that was how Muhammad Ali was known to speak? Or was this decision influenced by stereotypes people may have about black people or Muhammad Ali himself? Similarly, was Steve Jobs known for disliking the EU, or having opinions that may have indicated such, or did ChatGPT default to a response where Steve Jobs would be critical of the EU?
When looking at some of the sample responses in the appendix, it clear that some negative responses about a certain place or group of people is rooted in stereotypes associated with them, but is this because the persona actually ascribed to these stereotypes, or because ChatGPT found they were the most common complaints about the topic, and spoke of it in the persona’s voice? Furthermore, Deshpande et al. point out that even when ChatGPT is not actively encouraged to respond toxicly, or is not poised to do so from embodying an unpleasant persona, it can still respond critically about a topic. Does that mean the LLM has found something about their persona that would indicate such an opinion or bias, or might ChatGPT simply tend towards negativity?