MultiModal-Eval: A Standardized Framework for Evaluating LLMs’ Cognitive Capabilities (COURSE REPORT VERSION) | Perspectives on Large Language Models: Computational, Cognitive, Social at The University of Chicago

Pre-preprint is available at: https://drive.google.com/file/d/1UYbU_txAfmuvWOyHUT2rWG9fNyzdOZtd/view?usp=sharing

1 Introduction

The increasing prevalence of Large Language Models (LLMs) and Artificial Intelligence (AI) in contemporary society has led to the emergence of a wide array of evaluation benchmarks. Despite numerous efforts, we are faced with ever-expanding datasets encompassing a diverse range of tasks. The prevailing assumption that an increase in data or tasks equates to comprehensive assessment or recognition prompts significant inquiry: Can we realistically catalog every conceivable task in the world to evaluate Large Language Models (LLMs)? If this comprehensive cataloging is unattainable, how then can we confidently assert that we are effectively evaluating the “abilities” of Large Language Models (LLMs)?

To address this, it is essential to understand what we aim to assess in AI evaluation. There are two categories of assessment for LLMs or AI:

• Basic Processing Capabilities: (e.g., Optical Character Recognition accuracy in LLMs supporting graphical inputs, analogous to the physical functionality tests in humans)

• High-level Cognitive Abilities: (e.g., the logical reasoning required to solve a mathematical problem).

Standardized human examinations primarily assess cognitive complexity, presuming intact basic processing capabilities. These tests, designed for various ages and knowledge levels, encompass two dimensions: the required domain knowledge (as specified in official examination guides) and the involved cognitive complexity (similar to the perceived “difficulty” of a test). The key to differentiating exam difficulties lies in their

nature: whether they are qualification exams or competitive exams. For instance, qualification tests like the SAT are generally less cognitively challenging than competitive exams like the American Mathematics Competitions, which serve as selective benchmarks for the USA Math Olympics. This is not to say that every AMC question is more difficult than those in SAT Math, but the overall difficulty of AMC is notably higher, despite both covering only pre-calculus high school math. It is crucial to recognize that extensive domain knowledge does not necessarily equate to cognitive complexity. Often, questions that seem incomprehensible to the uninitiated can, upon learning basic concepts, reveal themselves to involve straightforward cognitive processes.

In this paper, I aim to:

• Develop a refined approach for evaluating the cognitive abilities of Large Language Models (LLMs) using standardized exams that assess both domain knowledge and cognitive complexity. This approach also proposes using Bloom’s Taxonomy as a framework for classifying AGI cognitive abilities. We plan to test the cognitive abilities of various MultiModal LLMs (GPT-4V, Google Bard, and Gemini) using these metrics.

• Investigate the impact of persona prompting on LLMs’ cognitive performance. This includes examining how introducing specific personas in prompts affects the problem-solving and reasoning capabilities of LLMs.

• Implement multistage adaptive testing, where LLMs respond to question sets based on their performance in previous sets. This method tailors each question to the LLM’s specific knowledge and ability level, eliminating overly challenging or simple questions out of their domain knowledge. The objective is to provide a rapid and informative assessment of LLMs’ cognitive abilities, streamlining the evaluation process while effectively measuring high-level cognitive functions.

2 Background [Course Report Version]

2.1 Existing Large MultiModal Models

The landscape of AI and machine learning has been significantly reshaped by the advent of Large MultiModal Models (LMMs). These models, epitomized by GPT-4V, Google Bard, and Google Gemini 1, represent the pinnacle of current AI technology in terms of their ability to process and synthesize vast amounts of information across various modalities.

• GPT-4V: Developed by OpenAI, GPT-4V is an advanced iteration of the Generative Pretrained Transformer series. This model showcases remarkable capabilities in natural language understanding and generation, making it one of the most sophisticated LLMs available. GPT-4V’s architecture allows it to process and generate text with a deep understanding of context, nuance, and even creative elements. Its applications range from composing intricate textual content to providing insightful responses in conversational AI.

• Google Bard: This model, developed by Google, is designed to leverage the vast informational resources of the internet. Google Bard is adept at understanding and generating natural language, making it an invaluable tool for information retrieval, summarization, and even creative endeavors. What sets Bard apart is its ability to integrate real-time data from the web, allowing it to provide up-to-date responses and insights on a wide array of topics.

• Google Gemini 1 : As another significant contribution by Google, Gemini 1 represents a stride in multimodal AI capabilities. Unlike its predecessors focused solely on text, Gemini 1 is designed to understand and generate outputs in multiple modalities, including text, images, and potentially audio and video. This capacity for multimodal understanding enables more comprehensive interactions with users, allowing for more dynamic and contextually rich AI applications.

The development and implementation of these LMMs have profound implications for various fields, from education and business to creative arts and scientific research. They exemplify the ongoing evolution in AI, moving towards more integrated, intelligent, and responsive systems. As these models continue to advance, they promise to further blur the lines between human and machine capabilities in processing, understanding, and creating content across multiple formats.

2.2 Existing Evaluation of LMMs

A recent popular multimodal evaluation metric is “MMMU: A Massive Multi-discipline Multimodal Understanding and Reasoning Benchmark for Expert AGI” 2 . MMMU integrates a broad spectrum of multi-discipline tasks requiring college-level knowledge and deliberate reasoning. Its questions span 30 subjects and 183 subfields, featuring 30 highly diverse image types such as charts, diagrams, maps, tables, music sheets, and chemical structures. Unlike existing benchmarks, MMMU emphasizes advanced perception and reasoning with domain-specific knowledge, challenging models to undertake tasks typical of experts. However, several weaknesses in their studies merit attention:

• Vastness is NOT Depth: The span of the dataset ̸ does not inherently imply depth in cognitive challenge.

• Subjective Difficulty Assignment: The difficulty level is determined by student annotators without well-defined metrics of difficulty, leading to significant variability in perceived challenge.

• Misconception of Complexity: MMMU’s emphasis on “college level” content as a marker of depth or complexity is misleading. Requiring specific domain knowledge does not automatically imply increased cognitive complexity.

These observations necessitate a careful reevaluation of how we assess AI, particularly in the context of multimodal evaluations like MMMU, to ensure a holistic and accurate understanding of AI capabilities.

2.3 Bloom’s Taxonomy: A Categorization of Cognitive Processes & Knowledge Domains

Bloom’s Taxonomy, a framework first introduced by Benjamin Bloom in 1956, categorizes cognitive skills into six domains: Remember, Understand, Apply, Analyze, Evaluate, and Create. In recent years, this taxonomy has gained relevance in assessing not just human cognition but also the cognitive abilities of advanced computational systems like Large Language Models (LLMs). LLMs, with their expansive knowledge bases and intricate processing capabilities, present a unique platform for examining cognitive processes in a non-human entity. These categories, along with the Knowledge Dimensions, provide a comprehensive structure for assessing the capabilities of LLMs. 3

• Remembering: Involves recognizing or recalling knowledge from memory. This is about memory retrieval of previously learned material.

• Understanding: Constructing meaning from various types of functions, whether written or graphic. It encompasses interpreting, exemplifying, classifying, summarizing, inferring, comparing, and explaining.

• Applying: Carrying out or using a procedure in a given situation. It relates to applying what has been learned in practical contexts.

• Analyzing: Breaking down material into its constituent parts and understanding its structure. This includes differentiating, organizing, and attributing.

• Evaluating: Making judgments based on criteria and standards through checking and critiquing.

• Creating: Putting elements together to form a coherent or functional whole; it involves generating, planning, and producing, and is the most complex mental function in this taxonomy.

The Knowledge Dimensions are:

• Factual Knowledge: Basic elements and terminology.

• Conceptual Knowledge: Interrelationships among basic elements within a larger structure.

• Procedural Knowledge: How to do something, methods of inquiry, and criteria for using skills, techniques, and methods.

• Metacognitive Knowledge: Knowledge of cognition in general as well as awareness and knowledge of one’s own cognition.

By categorizing LLM abilities using Bloom’s Taxonomy, we can create a structured approach to evaluate and understand their cognitive capabilities. In the context of LLMs, I aim to show that the first five cognitive processes of Bloom’s Taxonomy (Remember, Understand, Apply, Analyze, Evaluate) can be well evaluated in standardized examinations. LLMs have demonstrated remarkable abilities in areas such as causal understanding, logical deduction, and counterfactual reasoning, which align with these categories. The “Create” category and the “Metacognitive” dimension pose greater challenges, as they involve higher levels of abstraction and self-awareness that are not traditionally within the scope of standardized tests. This can guide the development of more advanced models and the creation of examinations that can effectively measure these abilities.

3 Methodology

3.1 Verification of Hypothesis: LMMs ✓Domain Knowledge, ✗Cognitive Complexity

The inherent design and functioning of LMMs (Large Language Models) lead to the hypothesis that their performance diminishes as cognitive complexity increases. LMMs excel in scenarios demanding extensive domain knowledge but limited cognitive complexity. To test this hypothesis, a structured pipeline is proposed: Selection and Extraction of Standardized Test Questions (with graphs), followed by Manual Data Cleaning, Model Query, and a detailed Response Analysis. This approach aims to evaluate the LMMs’ proficiency in handling various levels of cognitive demands while leveraging their vast knowledge base.

3.1.1 Standardized Test Questions Selection • GRE General Test (Quantitative Section) (Sources: Educational Testing Service – Official GRE Quantitative Reasoning Practice

Questions, Second Edition-McGraw-Hill Education 2017; The Official Guide to the GRE revised General Test 2nd Edition) The GRE General Test’s Quantitative Section, as a qualification exam, is an ideal candidate for evaluating LMMs’ domain knowledge. Though a qualification test for graduate studies, the questions predominantly measure high-school level mathematics skills 4 , focusing on arithmetic, algebra, geometry, and data analysis. This section reflects high school domain knowledge and basic cognitive complexity, providing a benchmark for assessing LLMs’ ability to apply mathematical concepts in structured problem-solving scenarios.

• American Mathematics Contest (AMC) 12 (Sources: All questions with graphs from 2000-2023; taken from AOPS database.) The AMC 12 Competition, aimed at high school students, tests mathematical creativity and problem-solving skills in areas like number theory, algebra, geometry, and combinatorics at a pre-calculus level. It often presents unique problems that require innovative solving methods. By including AMC, the study aims to thoroughly evaluate LLMs’ capabilities in tackling complex and imaginative mathematical challenges, indicating high-level cognitive complexity within a relatively narrow domain knowledge framework.

• GRE Subject Test – Physics (Sources: officially released 2001 and 2008 tests) The GRE Physics Subject Test, an extensive assessment of undergraduate-level physics knowledge as a qualification exam, spans topics like mechanics, electromagnetism, thermodynamics, quantum mechanics, and special relativity. Its inclusion in the study will shed light on LLMs’ capability to process and apply advanced physics concepts. This test showcases a broad domain knowledge base, encompassing high school mathematics, but is relatively less demanding in cognitive complexity (if test-takers knew the concepts, the test does not require a highly complex thought process to arrive at the correct answers).

In summary:

Domain Knowledge: AMC 12 ≈ GRE Quantitative < GRE Physics Subject

• Cognitive Complexity: GRE Physics Subject ≈ GRE Quantitative < AMC 12

3.1.2 Model Query with Prompts The following are baseline prompts fed to the model:

• “Fill in the Blank” Questions:

– “According to the image below, give the value as required. End your response with ’Final Answer: [Only the answer value without any symbols]’. <the question>”

• Choices that can have one or more correct answers: – “According to the image below, select all the answer choices that apply. You may choose one or multiple choices. End your response with ’Final Answer: ([Your Choice 1]), ([Your Choice 2])’ or more choices. <the question>”

• Choices that have one single correct answer: – “According to the image below, choose the best answer out of choices. End your response with ’Final Answer: ([Your Choice])’. <the question>”

3.1.3 Categorizations Images that are present in these three exams are of the following types:

• Geometry: Includes basic shapes, complex figures, and theorems.

• Analytic Geometry: Graphs of equations, coordinate systems, and geometric properties of algebraic expressions.

• Science Diagrams: Illustrations of scientific concepts, including biology, physics, and chemistry diagrams.

• Data Visualizations: Graphs and charts like bar graphs, line charts, pie charts, scatter plots, etc.

• Tables: Data presented in a tabular format, including statistical data and comparative information.

Since in this section we just want to verify the hypothesis, the cognitive abilities are simplified into the following categories:

• Quantitative Skills: This aligns with the “Apply” level of Bloom’s Taxonomy. It involves using procedures in a given situation, like performing numerical calculations, statistical analysis, and solving mathematical problems.

• Logical Reasoning: This corresponds to the “Analyze” stage in Bloom’s Taxonomy. It requires breaking down material into constituent parts and understanding its structure, such as identifying patterns, deducing conclusions, and understanding logical connections.

• Referring to Memory / Prior Knowledge: This is associated with the ’Remember’ level of Bloom’s Taxonomy. It involves recalling or recognizing knowledge from memory, such as utilizing previously acquired knowledge or theorems in problem-solving and analysis.

3.2 Persona Prompting Try different types of prompting strategies and see if that helps improving accuracy.

3.2.1 “Domain Expert” Persona TO BE DONE IN THE ACTUAL PAPER 5

3.3 Analysis of Existing LLMs with Bloom’s Taxonomy TO BE DONE IN THE ACTUAL PAPER

3.4 Multistage Adaptive Testing (MST)6 TO BE DONE IN THE ACTUAL PAPER

Result

4.1 Verified: LMMs ✓Domain Knowledge, ✗Cognitive Complexity

4.1.1 GRE Quantitative

Category	Accuracy Rate	Count
Quantitative Skills	0.5915	71
Memory	0.5	18
Logical Reasoning	0.4928	69

Category	Accuracy Rate	Count
Tables	0.9091	22
Geometry	0.4583	24
Data Visualizations	0.4286	42
Analytic Geometry	0.3636	11

One probable reason that LMMs are especially good at Tables is that it only involves OCR extraction of data from the tables without much processing with understanding the graphical information. Another finding is that GPT-4V does better job in questions that do not have choices, and the accuracy decreases if it is required to select multiple correct answers.

4.1.2 GRE Physics Subject Test

Notably, GRE Physics Subject test exhibit a much higher overall accuracy.

Category	Accuracy Rate	Count
Memory	0.7209	43
Logical Reasoning	0.7	40
Quantitative Skills	0.5714	14

Table 3: GRE Physics Sub: Accuracy Rate and Count by Cognitive Skill Category

4.1.3 AMC 12 Notably, AMC 12 questions are generally ascending in orders, with 1-10 being the relative easy ones, 11-20 being medium difficulty, and 21-25 being the most cognitively challenging. From the following graph, we can conclude that GPT-4V performs worse in those harder questions that require more cognitive complexities.

4.1.4 Summary This part of the study aims to evaluate whether accuracy performance diminishes as cognitive complexity increases. LMMs excel in scenarios demanding extensive domain knowledge but limited cognitive complexity. From the collected data and analysis, the hypothesis is verified to be true.

• Domain Knowledge: AMC 12 (32.43%) ≈ GRE Quantitative (53.80%) < GRE Physics Subject (70.45%)

• Cognitive Complexity: GRE Physics Subject (70.45%) ≈ GRE Quantitative (53.80%) < AMC 12 (32.43%)

More conclusions can be made from this part of the study: Cognitive Types Performance

Category	Accuracy Rate	Count
Science Diagram	0.7045	44

• GRE Quantitative and Physics showed moderate success in “Quantitative Skills” (0.5915 and 0.7209 respectively), suggesting LLMs’ decent capability in numerical reasoning.

• Performance in “Prior Knowledge” was relatively consistent across GRE Quantitative and Physics (0.5 and 0.7), indicating a reliable large domain knowledge.

• “Logical Reasoning” scores were lower, especially in GRE Math (0.4928), pointing to challenges LLMs face in complex problem-solving.

Image Types Performance:

• LLMs excelled in “Tables” interpretation in GRE Math (0.9091), highlighting their strength in structured data analysis, with reason detailed before.

• Performance in “Geometry” and “Analytical Geometry” was weaker across exams, with AMC 12 showing particularly low scores (0.3378 and 0.3333).

• “Data Visualization” and “Scientific Diagrams” were mixed, with GRE Physics showing competency in scientific diagrams (0.7045) but poor performance in GRE Quantitative (0.3636).

Category	Accuracy Rate	Count
Logical Reasoning	0.3415	82
Quantitative Skills	0.338	71
Memory	0.25	32

Category	Accuracy Rate	Count
Tables	1.0	1
Geometry	0.3378	74
Analytic Geometry	0.3333	6
Data Visualizations	0.0	1

General Accuracy:

• GRE Physics exhibited the highest overall accuracy (0.7045), suggesting a stronger alignment with LLMs’ capabilities in physics domain knowledge.

• AMC 12 had the lowest general accuracy (0.3243), reflecting the higher cognitive complexity that LLMs struggle with in creative and complex mathematical problem-solving.

Evaluation

5.1 Qualitative Observations with Examples

5.1.1 Behavior of Guessing & Partially Incorrect Approach With the followed instance of solving a problem, GPT-4 sometimes demonstrates a partially correct approach, yet its final answer was incorrect.

Strengths:

• Accurately identified the problem as a geometry question, finding relevant formulas effectively.

• Correctly recognized the relationship between the areas of the inner square and the surrounding triangles, indicating an understanding of fundamental geometric relationships.

• Exhibited logical thinking in breaking down the problem into smaller, solvable components.

Weaknesses:

• Misinterpreted the line segments in the diagram, which led to an erroneous calculation of the side length of the inner square. This highlights a limitation in visual interpretation and spatial reasoning.

• Relied on estimation and guessing for the final answer rather than precise calculation, indicating a potential shortfall in handling complex, multi-step problems.

• Demonstrated a gap in problem-solving consistency, as the initial correct approach did not translate into a correct final answer, suggesting a need for improved integration of different cognitive processes.

Discussion & Future Directions

Implement the proposed studies in the Introduction section. TO BE DONE IN THE ACTUAL PAPER

Conclusion

TO BE DONE IN THE ACTUAL PAPER

Limitations

TO BE DONE IN THE ACTUAL PAPER

Ethics Statement

TO BE DONE IN THE ACTUAL PAPER

Acknowledgements

This research is incubated in UChicago Course COGS 20100 “Perspectives on large language models: computational, cognitive, social” under the guidance and supervision of Prof. Eugene Yu Ji. Special thanks to my teammates in the course: Fady Adal, Jiaming “Jimmy” Feng, and Yushu Qiu. This research is generously funded by UChicago CS Career Advancement Center.

Leave a Reply Cancel reply