Jiaming Feng’s Final Project Report | Perspectives on Large Language Models: Computational, Cognitive, Social at The University of Chicago

Project Report: Multimodal Evaluation of LLM’s Standardized Test Performance

For the final project for our Perspectives on LLMs class, Fady Adal, Lihao Sun, Yushu Qiu and I (Jiaming Feng) worked together to design and evaluate GPT-4’s capability to answer standardized test questions that involve multimodal capabilities. Fady did background research on the topic; Lihao built the server and gathered data with ChatGPT open API; Yushu performed data analysis; I provided suggestions on evaluation criteria, explanation for our findings, and directions for further research. In this report I will first briefly summarize our group’s findings and results, and then elaborate on several considerations on evaluations and future research that we did not go into full details during the in-class presentation due to time constraints.

The initial idea came from a simple combination of two recent papers, Yang et al. (2023) on the promising future of Large Multimodal Models (LMMs) and Zhong et al. (2023)’s paper on AGI Eval (a benchmark for evaluating LLMs’ capability on standardized tests). Since there are more than purely textual questions (which is the main focus of Zhong et al. 2023) on standardized tests, we thought: how well does ChatGPT do on these questions with graphs, charts, tables, and other visual components? We thought this would be doable and appropriate for the timespan of this class, as well as significant enough due to the novelty of these recent progress. Below is a brief summary of our methodology and results.

We define multimodal questions as those that include more than purely textual instructions. We then provided some categorizations both for subtypes of questions and the types of reasoning abilities involved, following pre-existing guidelines and parts in Zhong et al. (2023)’s discussions. The categories of questions include tables, geometry, analytic geometry, science diagrams, and data visualization. The cognitive abilities involved include quantitative skills (e.g. numerical/statistical calculations), logical reasoning (deduction, pattern identification, understanding logical connections), and reference to prior knowledge. Some questions might fall under more than one category.

With these labelings, we gathered test questions from question sources, performed manual data cleaning and extraction, performed model queries, recorded responses, and analyzed data. We chose three main sources of standardized test questions: GRE quantitative section (from ETS official guides), AMC 12 (from open database) and GRE Physics subject test (from official released tests). The sample question as well as the analyzed data can be found in our group’s Google Slides presentation. We have three key findings: (1) GPT-4 performs worse on graph-type questions than more textual questions; (2) GPT-4 is more effective at analyzing simple visualizations (e.g. tables) than complex ones (e.g. analytic geometry); (3) GPT-4 seems to perform better in problems that require more substantial domain knowledge than higher order cognitive reasoning. For (2), a good example is that GPT-4 does significantly better on table questions in GRE Math, with around 90% accuracy rate compared to other categories (~40%).

The results match our expectations, as table questions involve less multimodal capacity than, say, text recognition in an image. One of the criteria for evaluating LLMs’ multimodal capabilities is them being able to operate on things “beyond static 2-D visual representation and explicit semantic information.” Table questions, in this sense, contain explicit semantic information and it might not push GPT-4V’s capacity. We expect future studies with more data and more rigorous categorizations and correlations to validate our observation that GPT-4V is better at questions requiring domain knowledge than complex reasoning. Compared to the vast amount of training data readily available, the fine-tuning or prompting mechanism that is perhaps more required for higher level cognitive reasoning might see its effect later. I also suggested the possibility that more complex questions like analytic geometry, besides requiring more capacity on working beyond static visual representation or explicit semantic information, also requires some sort of pragmatic capacity (see Alikhani et al., 2023). For example, analytic geometry problems often contain figures with imperfections (“not drawn to scale”); we as humans might know to prioritize verbal instructions, but LLM’s ability to do so might fall under one of the categories of image-text coherence. In future studies, we also hope to perform consistency analysis (as currently several identical prompts might yield different answers ), analysis on wrong answers (to see if there are systematic weaknesses or mistake patterns), and compare with other LLMs.

As we hope to arrive at a more comprehensive and rigorous framework to evaluate LLMs’ multimodal capabilities, there are more factors to consider before we proceed to scale up our current study. In the following I will give three factors: (1) bias vs. interference; (2) the effect of adversarial choices, and (3) the concern of “cheating” on training data, and discuss how each factor relates to our study.

In two recent studies focusing on the illusions and hallucinations of GPT-4V (Zhang et al., 2023; Cui et al., 2023), the authors proposed the distinction between bias and interference both as potential factors that affect LLM output. Bias is the preference for certain output due to imbalances due to imbalances in the training data; these might include all sorts of ethnic, social, political and gender stereotypes (e.g. Garg et al., 2018). As examples of biases, Zhang et al. (2023) and Cui et al. (2023) find out that GPT-4V performs better when the image contains information of Western cultural background or texts; one of the findings in Zhong et al. (2023) is that GPT-4 performs better in standardized test questions in English than in other languages. Interference, on the other hand, has to do with the way information is presented to LLMs and specific details in prompting. The popular term “prompt engineering” largely takes advantage of the effect of interference. Although we did not have the time to obtain rigorous results, we were curious to see whether manipulating the following modes of presentation would yield different outputs: (1) a full screenshot of the entire problem, (2) using textual and image input for the verbal and visual component, but still present the entirety of the problem, (3) present the textual instructions and visual component separately, or even (4) present multiple problems at once. In any case, the attempt to discern biases versus interference in the discussion would be important in evaluating LLM performance.

One specific assumption in our study related to the idea of interference is that, since we directly took the input from question sources, the choice options given for Multiple-Choice Questions (MCQs) played little role in affecting the output and that those are equally “fair” choices. I suggest closer attention on this assumption in future studies. When humans do MCQs, several factors affect how well we do: the number of choices in general, the number of choices that make sense (i.e. are not easily ruled out via relevant common sense), and the deviation between choices (in a very broad sense; e.g. if one of the choices are perceived to be significantly different than others). There is also previous research (e.g. Jia & Liang, 2017) that studies the effect of adversarial examples in reading comprehension systems. A huge area of interest is whether we can manipulate the choices in MCQs (e.g. give one more/less choice, add an (ir)relevant choice, etc.) and give systematic observations on their output. This move has the additional benefit of evaluating LLMs’ ability to leverage common sense as part of multimodal evaluation if they can identify in their output that some choices do not make sense. For example, I once asked ChatGPT about an elementary math question that asks for the number of chickens among chickens and rabbits. I provided four answer choices, three of which do not make sense (one is negative, another exceeds the total number of animals, and another is not an integer). In that particular scenario, ChatGPT did not pursue the route of eliminating wrong answers in the first attempt, even though it would be the more common-sensical way for humans.

Finally, as Yann Lecun reminds us in a Twitter post: “Beware of ChatGPT cheating on dataset!” There was a recent discussion that ChatGPT was thought to perform remarkably well on the popular “Chihuahuas vs. Blueberry Muffins” visual discrimination task, but in one of the responses it is caught cheating: “This is an example of a visual pun or meme known as ‘puppy or muffin’,” suggesting that these tasks are present in the training data. The similar thing happened when I did the experiment of chicken-and-rabbit problems (see Appendix), ChatGPT recognized the archetype of the question and even gave specific numbers before I provided any. So, another important variable to rule out is the possibility of LLMs “cheating” on the dataset; we have to find ways to prompt them to actually do the questions.

In summary, our group’s project points at exciting new directions to evaluate LLMs’ multimodal ability on standardized test performances. We drew conclusions based on statistical patterns of GPT-4’s performance on multimodal standardized test questions, and provided considerations for evaluations. It is an insightful learning experience to discover more and more ways to make our evaluations more rigorous and comprehensive.

References

Malihe Alikhani et al., Image–text coherence and its implications for multimodal AI. Front. Artif. Intell. 6:1048874. doi: 10.3389/frai.2023.1048874, 2023.

Chenhang Cui et al., Holistic Analysis of Hallucination in GPT-4V(ision): Bias and Interference Challenges. arXiv:2311.03287v2 [cs.LG] 7 Nov 2023.

Nikhil Garg et al., Word embeddings quantify 100 years of gender and ethnic stereotypes. doi:10.1073/pnas.1720347115, 2018.

Robin Jia & Percy Liang, Adversarial Examples for Evaluating Reading Comprehension Systems. arXiv:1707.07328v1 [cs.CL] 23 Jul 2017.

Zhengyuan Yang et al., The Dawn of LMMs: Preliminary Explorations with GPT-4V(ision).”

arXiv:2309.17421, 2023.

Yichi Zhang et al., Grounding Visual Illusions in Language: Do Vision-Language Models Perceive Illusions Like Humans? arXiv:2311.00047v1 [cs.AI] 31 Oct 2023.

Wanjun Zhong et al., AGIEval: A Human-Centric Benchmark for Evaluating Foundation Models. arXiv:2304.06364v2 [cs.CL] 18 Sep 2023.

Appendix

Figure 1: My first attempt to evaluate the effect of different choices in MCQs, which ended surprisingly prematurely because ChatGPT recognizes the question archetype right away.

Leave a Reply Cancel reply