For a version with graphs, you can use this link:
https://publuu.com/flip-book/330094/760112
Project Introduction:
Our group’s final project focused on providing a multimodal evaluation of Large Language Models’ (LLM) cognitive capabilities, specifically their ability to understand and analyze graphs. Given time constraints, we primarily evaluated GPT-4, as it is currently one of the most popular LLM models and OpenAI recently released its multimodal API. Our group utilized standardized tests, including the quantitative sections of the GRE Test, AMC-12, and GRE Physics test, as the framework for our evaluation. The results show that GPT-4’s graphic analysis ability is not satisfactory at the moment, as it failed to answer many questions requiring complex graphic analysis.
Background and Motivation:
Recently, OpenAI unveiled the new multimodal API for GPT-4. With this update, GPT-4 can now read, understand, and interpret graphical data, extending its proficiency beyond just textual analysis to include visual information. This advancement necessitates new evaluation criteria. Prior assessments of LLMs like GPT-4 were predominantly centered on their text-processing abilities. However, it’s not assured that GPT-4 will maintain the same proficiency in handling graphical inputs. Responding to these new requirements, our group has set out to develop specialized evaluation methods specifically for GPT-4’s image processing capabilities.
Our project is primarily inspired by the paper “AGIEval: A Human-Centric Benchmark for Evaluating Foundation Models.” This paper proposes the novel concept of employing standardized test questions, such as those from the SAT Math test, as benchmarks to assess the abilities of LLMs in understanding and solving problems. After reviewing the paper, our group recognized several advantages of using standardized tests as an evaluation standard. Firstly, standardized tests come with official answers, providing an effective and objective metric (LLM’s answer accuracy) for assessment. This approach offers more convincing results as it limits the need for subjective opinions in the evaluation process, in contrast to other forms of testing, like English essay writing, which are subject to human judgment and potential biases. Secondly, these questions demand a certain level of cognitive capabilities. Many cannot be answered through mere knowledge memorization. Instead, they require the LLMs to engage in at least a certain degree of a reasoning process based on an understanding of the question. This aspect allows us to more accurately evaluate the cognitive abilities of GPT-4.
Methodology:
Due to time constraints, we were unable to review a vast array of exams. Consequently, we chose three particular exams to analyze: the GRE Physics test, the quantitative sections of the GRE Test, and AMC-12. These were deemed most suitable, and we utilized their officially provided sample questions and answers, which were publicly accessible.
(Sample Question from AMC 12)
Our primary reason for selecting these exams was their rich inclusion of graphical questions. Our objective is to assess GPT-4’s proficiency with graphs, so we specifically chose questions featuring graphs. These exams offer a diverse selection of such questions, each demanding varied cognitive abilities. The second reason for our choice is the high credibility of these exams. Their widespread use and recognition add value to GPT-4’s performance in them, making the results meaningful. Lastly, we considered the balance between domain knowledge memorization and cognitive complexity in these exams. Our reviews showed that AMC-12 comparatively demands the highest cognitive capabilities with less reliance on domain knowledge. The GRE Test quantitative section ranks second in both aspects. In contrast, the GRE Physics test requires more domain knowledge with comparatively less cognitive capabilities. This diverse range of requirements helps us better understand the extent to which GPT-4’s responses depend on memorized knowledge versus cognitive capabilities.
To account for the varying cognitive capabilities needed to analyze different graph types, we categorized each graph with specific labels based on its type. Here is the list of graph labels we used:
1.Geometry: Includes basic shapes, complex figures, and theorems.
2.Analytic Geometry: Graphs of equations, coordinate systems, and geometric properties of algebraic expressions.
3.Science Diagrams: Illustrations of scientific concepts, including biology, physics, and chemistry diagrams.
4.Data Visualizations: Graphs and charts like bar graphs, line charts, pie charts, scatter plots, etc.
5.Tables: Data presented in a tabular format, including statistical data and comparative information.
Among these various graph types, those such as Geometry necessitated a higher level of cognitive complexity, particularly in understanding and making inferences from the graph. In contrast, graph types like tables demanded a lower level of cognitive complexity. For these, GPT-4 primarily needed to extract data and apply its analysis without truly understanding the graph.
Based on this design, our group selected approximately 250 questions from the three exams and submitted them to GPT-4’s multimodal API. We provided GPT-4 with our instructions, the graph, the question, and the options altogether. Here is a sample instruction our group used: “According to the image below, give the value as required. End your response with ‘Final Answer: [Only the answer value without any symbols].”
Results and Findings:
(The General Accuracy of GPT-4’s Performance on Three Exams)
GPT-4 reaches a somewhat satisfactory performance on GRE Physics test, with around 70% accuracy rate. GRE Physics test, as mentioned, requires the least amount of cognitive complexity and the highest domain knowledge reliance. As the cognitive complexity increases, GPT-4’s performance gradually decreases. For GRE Test’s quantitative section, the accuracy is only around 53%, which is almost a 20% drop from the GRE Physics test. For AMC-12 test, the test that requires the highest cognitive complexity, the accuracy is fairly low, with only around 32% accuracy rate in general. Based on this trend, the first conclusion our group draws is that GPT-4 shows much better performance in exams that require substantial domain knowledge than in those requiring higher cognitive complexity.
(GPT-4’s Accuracy on GRE Test Quantitative Section’s Questions with Different Graph Types)
The graph above depicts the accuracy rates achieved by GPT-4 across various graph types in the quantitative section of the GRE Test. We focused exclusively on this exam’s results due to its diverse array of questions with different graph types. In contrast, the GRE Physics test and AMC-12 exhibit a more homogeneous distribution of graph types, making it challenging to draw specific conclusions about graph types from these exams. It’s clear that GPT-4 performs significantly better on questions involving table graphs (with an approximate accuracy rate of 91%,) which are relatively simpler, compared to more complex types like geometry graphs. As hypothesized, questions featuring table graphs do not demand an in-depth understanding and analysis from GPT-4. The primary task for GPT-4’s vision model in these instances is to extract data from the table, subsequently analyzing it as textual data. Hence, table graph questions might be essentially “pseudo graphs” questions, or data tables in nature. For questions with more complex graphs, GPT-4 is required to understand spatial relationships between elements, such as multiple triangles within the same graph. These cannot be simply converted to table or textual data for analysis. Therefore, our second conclusion is that GPT-4 exhibits markedly better performance in “pseudo graph” questions than in those necessitating a significant level of graphical understanding.
(Sample Question 1’s Graph)
To illustrate GPT-4’s limitations in understanding graphics, let’s consider Sample Question 1. This question involves squares PQRV and RVST, each with sides of 6, and asks whether the area of the shaded region is equal to 36. The correct answer is ‘equal.’ However, GPT-4’s response contains several errors. It recommends subtracting the area of a triangle PVS from a square PQRV. GPT-4 made two mistakes here: first, PVS is not a valid triangle on the graph; second, it should use the triangle PQST, not PQRV in subtraction. The idea of subtracting PST from PQST is not logical, which shows a misunderstanding of basic geometric concepts and an inability to correctly identify simple shapes like triangles. This result is particularly concerning given the simplicity of this geometric problem. The issue is compounded by the fact that GPT-4’s mistakes are not just singular errors, but multiple and fundamental misunderstandings.
The first two conclusions, which are the major conclusions we drew, have raised concerns within our group about GPT-4’s capabilities in conducting graphical analysis. GPT-4’s strong performance on table questions or those that heavily rely on domain knowledge memorization does not compellingly endorse its multimodal abilities. This performance might be attributable more to GPT-4’s extensive knowledge base and its proficiency in extracting information from graphs, rather than its ability to analyze them as graphs. The notable decrease in accuracy for questions that demand a high level of multimodal ability leads us to surmise that GPT-4’s proficiency in complex graphic analysis might still be lacking. Specifically, it appears not to be at a level where it can consistently perform well on standardized tests that require such advanced capabilities.
Beyond the two primary conclusions identified by our group, I also observed two minor findings. Although these patterns are not backed by a substantial number of instances, I believe it’s important to include them in this report, as they could potentially guide future research directions.
(Sample Question 2’s Graph)
GPT-4 has occasionally provided incorrect answers even when applying correct logic, as seen in Sample Question 2. This question required calculating f(f(-1)) using a graph. GPT-4 correctly identified the first step, determining f(-1) as 2, but then incorrectly calculated f(2) as -1 instead of the correct 1. This instance reveals GPT-4’s inconsistencies, particularly notable because it occurred immediately after a similar successful task. This raises two possibilities: GPT-4 might be prone to simple errors, similar to those a student might make, which brings into question its reliability for tasks requiring even basic cognitive skills. Alternatively, GPT-4’s understanding of the logic might not require extensive multi-modal capabilities, as GPT-4 could solely deduce the logic primarily from the text, like ‘f(f(1))’ in the question. Identifying point values on the graph, in general, might be a hard task for GPT-4. Further research is necessary to ascertain which scenario more accurately reflects GPT-4’s potential flaws.
(GPT-4’s Performance on GRE Test Quantitative Section’s Questions with Different Option Types)
In our analysis, we found that GPT-4’s performance varies across different question formats in the quantitative section of the GRE Test. Notably, it performs poorly on multiple-choice questions, which aligns with the general perception of these being challenging for many students. Surprisingly, GPT-4 excels in fill-in-the-blank questions, achieving almost 80% accuracy, significantly higher than its roughly 50% accuracy in single-choice questions. Typically, fill-in-the-blank questions are deemed the most difficult. One possible explanation is that the presence of options doesn’t significantly influence GPT-4’s performance, as it tackles questions step-by-step, unlike students who may need to guess based on the options. However, this doesn’t account for its lower success rate in multiple-choice questions if options truly don’t impact its performance. Another theory is that the options might actually constrain GPT-4’s answer formulation process. This is observed in many cases during the project we observed where GPT-4 either declares none of the options correct, identifies all as wrong while providing its own answer, or offers multiple correct answers when only one is requested. This suggests that GPT-4 might inherently tend to formulate its own solutions, going beyond the provided options. In fill-in-the-blank questions, without the constraints of options, GPT-4 might then performs more effectively.
Further Directions
Our group envisions several potential future directions that, due to time constraints, we are unable to pursue during our final project.
One key direction involves expanding our question database. Currently, our dataset comprises approximately 250 questions, but a more extensive collection would undoubtedly enhance the robustness of our conclusions. Additionally, incorporating questions from various exams and tapping into diverse sources would greatly bolster the credibility of our findings.
The second direction involves enhancing the granularity of our analysis by assigning cognitive ability labels to each question. This approach aims to deepen our understanding of the correlation between GPT-4’s accuracy and the cognitive complexity of the questions. Currently, as mentioned in the previous section, our methodology labels the exams, not questions, based on levels of domain knowledge memorization and cognitive complexity. This can be overly broad, as the questions within a single exam can vary significantly. During our project, we actually attempted to categorize questions under specific cognitive skills, namely Quantitative Skills, Logical Reasoning, and Referring to Memory/Prior Knowledge. However, our data analysis did not reveal a clear relationship between GPT-4’s performance and these cognitive skills. This absence of a distinct correlation during this stage, however, does not necessarily imply that such a relationship does not exist. I have two hypotheses here: firstly, our dataset may not have a sufficient variety of questions under each cognitive skill label; secondly, many questions had multiple cognitive skill labels, and this overlap complicates the analysis, as an error in such a question impacts all its associated labels, leading to a homogenization of accuracy across different labels. To address this, expanding our question database to include more questions uniquely categorized under a single cognitive skill label would be advantageous. This refinement in our approach would allow for a more rigorous examination of the relationship between GPT-4’s performance and the cognitive complexity of questions.
The third direction is to experiment with the few-shot learning method. In the AGIEval paper, the authors report an increase in accuracy when GPT-4 is provided with a few examples of questions and answers before responding. Our current evaluation primarily focuses on the zero-shot approach, but it would be intriguing to investigate whether GPT-4’s capability to learn from a small number of examples also extends to improving its graphical analysis skills. Moreover, the AGIEval paper introduces an evaluation of the ‘Chain-of-Thought’ reasoning process, which prompts GPT-4 to process questions in a step-by-step manner, providing explanations at each stage. While the responses we have received from GPT-4 suggest that it already employs a step-by-step reasoning approach, the AGIEval paper indicates an increase in accuracy when this method is explicitly implemented. Exploring whether this improvement in accuracy is consistent in different contexts would be a worthwhile addition to our research.
The fourth direction we want to explore is the observed correlation between the type of question options and GPT-4’s accuracy, as noted in the previous section. We have observed that GPT-4 tends to perform better on fill-in-the-blank questions compared to multiple-choice questions. Our first objective is to verify if this trend persists with an increased number of questions. Secondly, we plan to investigate whether the higher accuracy in fill-in-the-blank questions is due to these types of questions being inherently easier in exams, considering the difficulty of the format. This will involve converting some of the single-choice questions where GPT-4 erred into the fill-in-the-blank format to see if the accuracy improves. If this trend continues, we intend to delve into understanding why this is the case, especially since it contradicts our initial expectations. For example, we want to test our hypothesis that the absence of predefined options in fill-in-the-blank questions might offer GPT-4 greater freedom to generate answers, possibly leading to enhanced performance. This aspect of our research could provide valuable insights into the mechanics of GPT-4’s reasoning and response generation.
Conclusions:
In conclusion, our project has offered an evaluation of GPT-4’s capabilities in processing and interpreting graphical data, utilizing a diverse array of standardized tests. While we observed a notable proficiency in tasks that rely heavily on domain knowledge and less cognitive complexity, the model demonstrated limitations in handling questions that require a deeper level of multimodal understanding and graphical analysis. We plan to work on the potential future directions we have identified – expanding our question database, refining the granularity of cognitive complexity analysis, experimenting with few-shot learning methods, and exploring the impact of question format on accuracy – to enhance our understanding of this topic.
AI Usage:
I utilize GPT-4 for grammar revisions and sentence refinement. The initial draft is my own work, and all ideas within it are generated by my group and me.