Re-evaluating GPT-4’s bar exam performance

Eric Martínez

Download from

dx.doi.org

More download options

Re-evaluating GPT-4’s bar exam performance

Eric Martínez

Artificial Intelligence and Law:1-24 (forthcoming) Copy BIBT_EX

Abstract

Perhaps the most widely touted of GPT-4’s at-launch, zero-shot capabilities has been its reported 90th-percentile performance on the Uniform Bar Exam. This paper begins by investigating the methodological challenges in documenting and verifying the 90th-percentile claim, presenting four sets of findings that indicate that OpenAI’s estimates of GPT-4’s UBE percentile are overinflated. First, although GPT-4’s UBE score nears the 90th percentile when examining approximate conversions from February administrations of the Illinois Bar Exam, these estimates are heavily skewed towards repeat test-takers who failed the July administration and score significantly lower than the general test-taking population. Second, data from a recent July administration of the same exam suggests GPT-4’s overall UBE percentile was below the 69th percentile, and $$\sim$$ ∼ 48th percentile on essays. Third, examining official NCBE data and using several conservative statistical assumptions, GPT-4’s performance against first-time test takers is estimated to be $$\sim$$ ∼ 62nd percentile, including $$\sim$$ ∼ 42nd percentile on essays. Fourth, when examining only those who passed the exam (i.e. licensed or license-pending attorneys), GPT-4’s performance is estimated to drop to $$\sim$$ ∼ 48th percentile overall, and $$\sim$$ ∼ 15th percentile on essays. In addition to investigating the validity of the percentile claim, the paper also investigates the validity of GPT-4’s reported scaled UBE score of 298. The paper successfully replicates the MBE score, but highlights several methodological issues in the grading of the MPT + MEE components of the exam, which call into question the validity of the reported essay score. Finally, the paper investigates the effect of different hyperparameter combinations on GPT-4’s MBE performance, finding no significant effect of adjusting temperature settings, and a significant effect of few-shot chain-of-thought prompting over basic zero-shot prompting. Taken together, these findings carry timely insights for the desirability and feasibility of outsourcing legally relevant tasks to AI models, as well as for the importance for AI developers to implement rigorous and transparent capabilities evaluations to help secure safe and trustworthy AI.

Cite

Plain text

BibTeX

Formatted text

Zotero

EndNote

Reference Manager

RefWorks

Options

Mark as duplicate

Find it on Scholar

Request removal from index

Revision history

Edit

Keywords

Artificial Intelligence IT Law, Media Law, Intellectual Property Information Storage and Retrieval Legal Aspects of Computing Philosophy of Law

Reprint years

DOI

10.1007/s10506-024-09396-9

My notes

Analytics

Added to PP
2024-04-01

Downloads
17 (#868,760)

6 months
17 (#148,352)

Historical graph of downloads

How can I increase my downloads?

Citations of this work

No citations found.

Add more citations

References found in this work

Value Alignment for Advanced Artificial Judicial Intelligence.Christoph Winter, Nicholas Hollman & David Manheim - 2023 - American Philosophical Quarterly 60 (2):187-203.

Poor writing, not specialized concepts, drives processing difficulty in legal language.Eric Martínez, Francis Mollica & Edward Gibson - 2022 - Cognition 224 (C):105070.

Measuring the complexity of the law: the United States Code.Daniel Martin Katz & M. J. Bommarito - 2014 - Artificial Intelligence and Law 22 (4):337-374.

Add more references

Applied ethics	Epistemology	History of Western Philosophy	Meta-ethics	Metaphysics	Normative ethics
Philosophy of biology	Philosophy of language	Philosophy of mind	Philosophy of religion	Science Logic and Mathematics	More ...

Re-evaluating GPT-4’s bar exam performance

Abstract

Categories

Keywords

Reprint years

DOI

Links

PhilArchive

External links

Through your library

My notes

Similar books and articles

Analytics

Citations of this work

References found in this work