Singh, A. K., None, N. S., Verma, V. K. & None, P. G. (2025). Beyond Accuracy: A Psychometric Benchmark and Stability Analysis of 15 Large Language Models on NEET-PG Medicine Questions (2021--2025). Journal of Contemporary Clinical Practice, 11(12), 783-789.
MLA
Singh, Anupam K., et al. "Beyond Accuracy: A Psychometric Benchmark and Stability Analysis of 15 Large Language Models on NEET-PG Medicine Questions (2021--2025)." Journal of Contemporary Clinical Practice 11.12 (2025): 783-789.
Chicago
Singh, Anupam K., Neeshu S. , Vinay K. Verma and Prasanth G. . "Beyond Accuracy: A Psychometric Benchmark and Stability Analysis of 15 Large Language Models on NEET-PG Medicine Questions (2021--2025)." Journal of Contemporary Clinical Practice 11, no. 12 (2025): 783-789.
Harvard
Singh, A. K., None, N. S., Verma, V. K. and None, P. G. (2025) 'Beyond Accuracy: A Psychometric Benchmark and Stability Analysis of 15 Large Language Models on NEET-PG Medicine Questions (2021--2025)' Journal of Contemporary Clinical Practice 11(12), pp. 783-789.
Vancouver
Singh AK, Neeshu NS, Verma VK, Prasanth PG. Beyond Accuracy: A Psychometric Benchmark and Stability Analysis of 15 Large Language Models on NEET-PG Medicine Questions (2021--2025). Journal of Contemporary Clinical Practice. 2025 Dec;11(12):783-789.
Background: Large language models (LLMs) are increasingly utilized in medical education and assessment, yet evaluations often rely solely on raw accuracy without accounting for stochastic variability or item difficulty. This study benchmarks 15 contemporary LLMs on the National Eligibility-cum-Entrance Test for Postgraduate (NEET-PG) Medicine examination, assessing accuracy, run-to-run stability, and response bias using psychometric methods. Methods: We evaluated models from major families (GPT, Gemini, Claude, Llama, Deep Seek, Kimi) using NEET-PG Medicine recall questions (n=121) from 2021--2025. Each model attempted every question three times to quantify reproducibility. Performance was analyzed using generalized linear mixed-effects models (GLMM) to adjust for year and difficulty, and Rasch Item Response Theory (IRT) to estimate latent ability. We further assessed run-to-run stability and option-selection bias. Results: GLMM-adjusted probabilities of a correct response varied widely, with GPT 5.2 (0.95; 95% CI 0.93--0.97) and Gemini 3 Flash (0.94; 95% CI 0.91--0.96) emerging as the top performers. While Claude Opus 4.5 demonstrated high run-to-run stability (98% identical responses), Claude 3.5 Sonnet showed substantial variability (41%). Systematic option-choice bias was observed, particularly in Llama models which under-selected option A (-15.7%) and over-selected D (+8.8%). A majority-vote ensemble achieved 87.6% accuracy but did not surpass the single best model. Rasch analysis confirmed a strong concordance with GLMM rankings (Spearman rho = 1.00). Conclusion: Newer LLMs demonstrate high proficiency on NEET-PG Medicine items, but raw accuracy masks significant differences in reliability and response bias. High-performing models are generally more stable and calibrated, whereas others exhibit marked stochasticity and letter preferences. These findings emphasize that clinical and educational adoption of LLMs requires rigorous psychometric validation beyond simple pass rates.
Keywords
Biosafety
Knowledge
Attitude
And Practice (KAP)
Senior Secondary School Students
Cross-Sectional Study
INTRODUCTION
Large language models (LLMs) such as ChatGPT have rapidly entered medical learning and assessment workflows. Their ability to generate fluent, context-aware answers has raised a practical question for educators and regulators: can these systems reliably solve high-stakes multiple-choice medical examinations, and if so, what does that imply for training, test security, and future assessment design? Early benchmarking using United States Medical Licensing Examination (USMLE)--style items showed that ChatGPT could reach or approach passing-level performance on some components, highlighting both potential educational utility and the need for careful evaluation of limitations and failure modes [1,2].
At the same time, "accuracy" alone is not sufficient for judging readiness or usefulness in medicine. LLMs can produce confident but incorrect content, and performance can vary with prompting and sampling, creating run-to-run variability that is uncommon in traditional test-takers. Work in clinical question-answering benchmarks has therefore emphasized structured evaluation frameworks, including attention to factuality, reasoning quality, and potential harm, rather than relying only on raw scores [3]. Systematic reviews also note fast growth of LLM use in medical education, but point out that study designs are heterogeneous and often lack robust methods to address uncertainty, reproducibility, and generalizability across settings and exam types [4].
In India, the National Eligibility-cum-Entrance Test for Postgraduate (NEET-PG), conducted under the National Board of Examinations in Medical Sciences (NBEMS), functions as a single-window entrance examination for admission to MD/MS/PG Diploma and related NBEMS pathways, making it a high-stakes national assessment [5]. Although there are emerging reports evaluating LLMs on NEET-PG, most have focused on a small number of systems or limited item sets, leaving uncertainty about comparative performance across model families and over time as models evolve [6]. Broader work comparing LLMs across different countries' licensing examinations further supports that performance is context-dependent and may not transfer cleanly across curricula or item styles [7].
Therefore, robust benchmarking on NEET-PG requires methods that (i) account for repeated runs and stochastic variability, (ii) adjust for differences in question difficulty and year, and (iii) provide interpretable effect sizes for model comparisons. Psychometric approaches such as item-response theory (including Rasch models) offer a principled way to place "ability" (model performance) and "difficulty" (question hardness) on a common scale, complementing accuracy-based analyses [8].
In this study, we evaluate multiple contemporary LLMs across NEET-PG Medicine questions from 2021--2025, quantify accuracy and stability across repeated runs, and compare models using adjusted statistical estimates to support reproducible, publication-grade conclusions.
MATERIALS AND METHODS
Question source and answer key
NEET-PG Medicine recall questions for 2021--2025 were obtained from MedicoApps (courtesy: Dr A.G.) and stored as question images. A consolidated answer key was created and then independently verified by two internal medicine specialists (Dr A.K.S. and Dr N.S.; >15 years' experience each). Any disagreements were resolved by discussion and review of standard references, and the final key was treated as the ground truth.
Model evaluation workflow
Fifteen large language models (LLMs) across major families (OpenAI GPT series, Google Gemini series, Anthropic Claude series, Meta Llama series, Deep Seek, and Kimi) were evaluated. Each question was submitted to each model via API calls (OpenRouter), using a standardized instruction to select a single best option (A--D). Each model--question pair was run three times to capture run-to-run variability. For models without native image input, a standardized OCR harness (Mistral OCR) was used to convert the question image to text, after which the same instruction format was applied. Model outputs were parsed to extract the chosen option (A--D). GPT-3.5 was not evaluated in 2024 (planned missingness), and no imputation was performed.
Data structure
Data were stored in long format, with one row per model response per question per run per year. Variables included exam year (`neet_year`), model, question identifier (`question_code`), run (1--3), selected option (`option_selected`), correct option (`correct option`), and correctness (`marks`, 1/0).
Statistical analysis
We first computed unadjusted accuracy for each model as the proportion of correct responses, and additionally computed majority-vote accuracy at the question level (>= 2 of 3 runs correct). Run-to-run stability was summarized as the proportion of questions for which all three runs selected the identical option (`p_all_same`). To evaluate response-letter tendencies, we tabulated the proportion of A/B/C/D selections per model. For the primary adjusted comparison, correctness (`marks`) was modeled using a generalized linear mixed-effects model (GLMM) with logit link and binomial outcome: `marks ~ model + neet_year + (1 | question_code)`, where `question_code` was included as a random intercept to account for item difficulty; models were fitted in R using `lme4` [9] and estimated marginal means and contrasts were obtained using `emmeans` [10]. Adjusted performance was reported as marginal probabilities of a correct response with Wald-type 95% confidence intervals, and pairwise comparisons were performed on the log-odds scale with multiplicity correction (Holm adjustment for comparisons versus a reference model; Tukey adjustment for all-pairs comparisons). For psychometric benchmarking, we additionally fit a one-parameter logistic Rasch IRT model treating each model--run as a "person" and each question as an "item," using a person-by-item matrix of binary correctness with missing values retained as NA; items with no response variability (all correct or all incorrect among observed responses) were excluded as non-identifiable. Rasch modeling used the `mirt` package [11], and interpretation followed standard IRT guidance in medical education measurement [12].
RESULTS
Dataset completeness and structure
Across NEET-PG years 2021--2025, model outputs were successfully captured for all planned model--year combinations, with each model answering all questions for a given year and each model--question pair repeated across three independent runs. There were 121 unique questions overall in 5 years. The year-wise row counts per model matched the expected design of three runs per question (e.g., 57 rows/model in 2021, consistent with 19 questions x 3 runs). As planned, GPT-3.5 was not evaluated in 2024, resulting in one structurally missing model--year cell (not treated as data loss). Table 1.
Year-adjusted model performance (mixed-effects model)
Unadjusted accuracy of various models yearwise is shown in Figure 1.
In the mixed-effects logistic regression adjusting for exam year and question-level difficulty, estimated marginal probabilities (Table 2) showed clear separation in performance. The highest adjusted probability of a correct response was observed for GPT 5.2 (0.954; 95% CI 0.927--0.971) and Gemini 3 Flash (0.941; 95% CI 0.910--0.962). Several models formed a second high-performing tier with adjusted probabilities in the mid-0.83 to 0.89 range, including GPT-3.5 (0.892; 95% CI 0.839--0.929), Claude Opus 4.5 (0.880; 95% CI 0.829--0.917), DeepSeek V3.2 (0.872; 95% CI 0.819--0.912), Kimi K2 (0.867; 95% CI 0.812--0.908), and Gemini 3 Pro (0.854; 95% CI 0.795--0.897). Lower adjusted probabilities were observed for models such as GPT-4 Turbo (0.429; 95% CI 0.341--0.522) and Claude 3.5 Sonnet (0.391; 95% CI 0.306--0.483).
A likelihood ratio test comparing the year+question model to the year+question+model model showed that adding model significantly improved fit (chi^2(14) = 902.59, p < 0.0001), supporting overall differences in performance between models.
Pairwise model comparisons
Pairwise contrasts were then derived from the fitted model to quantify between-model differences while controlling for multiple comparisons.
Against the reference (GPT 5.2) (Holm-adjusted): Gemini 3 Flash did not differ significantly from GPT 5.2 (adjusted p = 0.347; OR = 0.78, 95% CI 0.36--1.70). In contrast, all other models showed significantly lower odds of correct responses compared with GPT 5.2 after adjustment (e.g., GPT-3.5: adjusted p = 0.0016; OR = 0.40, 95% CI 0.18--0.89; Claude 3.5 Sonnet: adjusted p < 0.0001; OR = 0.031, 95% CI 0.016--0.063). (Table 3)
All-pairs comparisons (Tukey-adjusted) highlighted additional structure within the high-performing tier. For example, GPT 5.2 had significantly higher odds of correct responses than Gemini 3 Pro (OR = 3.52, 95% CI 1.53--8.11; adjusted p = 3.1 x 10^-5), Kimi K2 (OR = 3.15, 95% CI 1.36--7.30; adjusted p = 3.4 x 10^-4), DeepSeek V3.2 (OR = 3.01, 95% CI 1.30--6.99; adjusted p = 8.5 x 10^-4), and Claude Opus 4.5 (OR = 2.81, 95% CI 1.21--6.54; adjusted p = 0.003). GPT 5.2 vs Gemini 3 Flash remained non-significant after Tukey adjustment (adjusted p approx 1.00). In addition, Gemini 3 Flash outperformed Gemini 3 Pro (OR = 2.74, 95% CI 1.23--6.08; adjusted p = 0.0017). Several "near-neighbor" comparisons were not significant (e.g., DeepSeek V3.2 vs DeepSeek V3), suggesting similar performance within some families of models.
Run-to-run stability and majority-vote accuracy
Run-to-run stability (proportion of questions with identical option selection across all three runs) varied substantially across models. Claude Opus 4.5 showed the highest stability (`p_all_same` = 0.983), followed by GPT 5.2 (0.942) and Gemini 3 Flash (0.934), whereas Claude 3.5 Sonnet exhibited low stability (0.413). Majority-vote scoring (>= 2 of 3 runs correct) produced a similar ordering, with GPT 5.2 achieving the highest majority-vote accuracy (0.901) and Gemini 3 Flash next (0.884), while Claude 3.5 Sonnet remained lowest (0.405). Collectively, these findings indicate that some models behave near-deterministically across reruns, whereas others show substantial response variability, which can influence both reproducibility and the potential benefit of rerun-and-vote strategies (Figure 2).
Option-choice patterns
The distribution of correct options in the answer key was non-uniform (A most frequent), and models showed distinct option-selection fingerprints. For example, Claude Opus 4.5 selected option A in 38.8% of responses and D in 14.3%, while Claude 3.5 Sonnet showed a more balanced A/B split (A 30.9%, B 28.4%). These distributions were examined to identify response-letter preferences that might contribute to performance differences, especially under uncertain items.
Rasch modelling
Rasch modeling yielded model-level ability estimates (theta) on a common latent scale. Higher theta indicated greater probability of correct responses across the pooled item set. Ability estimates showed clear separation across models, with the top-performing models demonstrating positive theta values and lower-performing models clustered at negative theta values.
Item difficulty parameters (b) covered a wide range, indicating that the benchmark included both very easy and very difficult items, supporting discrimination across model abilities.
When grouped by model family, the ability estimates demonstrated substantial within-family variability, indicating that performance differences were driven by specific model versions rather than family membership alone. Rasch ability estimates showed strong concordance with GLMM-adjusted accuracy (Pearson r = 0.986, 95% CI 0.957--0.995; p < 0.001). Rank ordering was identical across the two approaches (Spearman rho = 1.00; p < 0.001).
Cognitive cluster (model response similarity)
Model--model similarity was calculated as the proportion of shared questions for which two models selected the same final answer, where the final answer for each model--question was taken as the majority vote across its 3 runs (ties counted as "no majority" and excluded from agreement for that model on that item). Across all model pairs, agreement ranged from 39.8% to 92.4% (median 55.8%, IQR 49.1%--76.3%), indicating substantial heterogeneity in how similarly models answered the same questions.
The highest agreement was observed within the Llama family (Llama 4 Maverick vs Llama 3.2 90B: 92.4% agreement, n=119 questions). Other strong within-family clustering included DeepSeek V3.2 vs DeepSeek V3: 86.1% (n=115) and Gemini 3 Flash vs Gemini 3 Pro: 83.9% (n=118). Among cross-family pairs, the strongest similarity was between the two highest-performing systems: GPT 5.2 vs Gemini 3 Flash: 89.3% agreement (n=121). Additional high cross-family similarities included GPT-3.5 vs Kimi K2: 86.2% (n=87) and DeepSeek V3.2 vs Kimi K2: 84.7% (n=118).
On average, within-family similarity was higher than between-family similarity (mean 63.2% vs 59.8%). Models with higher accuracy also tended to be more "central" (i.e., agree more with others): the correlation between a model's raw accuracy and its mean similarity to other models was r = 0.95, suggesting that stronger models converge more often on the same answers.
Option bias (model preference vs ground truth)
Option bias was computed as: (model's proportion choosing A/B/C/D) - (true answer-key proportion A/B/C/D), pooling across all years and runs (valid responses A--D).
The ground-truth distribution across the 121 questions was: A 40.5%, B 26.4%, C 15.7%, D 17.4%. Several models showed systematic deviations from this distribution, indicating response-letter preference beyond what the key would predict.
The largest biases were seen in the Llama models, which strongly under-selected A and over-selected D:
Llama 3.2 90B: A -15.7 pp, D +8.8 pp
Llama 4 Maverick: A -15.4 pp, D +9.1 pp
Claude Sonnet 4 showed a distinct pattern with B under-selection (-10.2 pp) and relative over-selection of C (+5.2 pp) and D (+6.3 pp). GPT-4 Turbo also demonstrated notable letter imbalance (A -7.7 pp, C +8.3 pp, D -4.1 pp).
In contrast, the best-performing models had minimal letter skew. For example:
GPT 5.2: A -1.1 pp, B -1.1 pp, C 0.0 pp, D +2.2 pp
Gemini 3 Flash: A -1.4 pp, B 0.0 pp, C +0.3 pp, D +1.1 pp
This indicates that higher performance was generally associated with more "calibrated" option usage, while some lower/mid performers exhibited stronger defaulting to specific letters.
Family-wise accuracy trends (2021--2025)
Unadjusted accuracy was calculated as the mean of marks (0/1) per model within each year (all runs included). The year-wise leaders were:
2021: Gemini 3 Flash 93.0%
2022: GPT 5.2 95.8%
2023: GPT 5.2 87.2% and Gemini 3 Flash 87.2% (tie)
2024: GPT 5.2 89.6%
2025: GPT 5.2 88.3%
Within the OpenAI family, GPT 5.2 remained consistently high across all years (87.7%--95.8%). Older OpenAI models were substantially lower but improved over time (e.g., GPT-4o: 45.6% in 2021 -> 70.0% in 2025; GPT-4 Turbo: 36.8% -> 55.0%). GPT-3.5 showed stable mid--high performance where evaluated (82.5% in 2021, 84.7% in 2022, 79.5% in 2023, 81.7% in 2025) and was not run in 2024.
Gemini models showed a clear hierarchy: Gemini 3 Flash stayed near the top but declined by 2025 (93.0% in 2021 -> 80.0% in 2025), while Gemini 3 Pro declined from 87.7% -> 70.0%. DeepSeek models stayed in a relatively tight high band with a dip in 2023 (e.g., DeepSeek V3.2: 83.3% in 2022 -> 73.1% in 2023 -> 83.3% in 2025). Kimi K2 performed consistently well and improved late (77.2% in 2021 -> 85.0% in 2025). Llama models showed high year-to-year variability (spike in 2022 with lower performance in 2021/2023), suggesting stronger sensitivity to the specific year's question set.
Ensemble effect (majority vote)
The ensemble was computed as a majority vote across the top 5 models by overall accuracy (GPT 5.2, Gemini 3 Flash, GPT-3.5, Claude Opus 4.5, DeepSeek V3.2). For each question, each model contributed its own within-model majority answer across 3 runs, and then an across-model vote was taken (ties counted as incorrect). (Figure 3)
The ensemble achieved 87.60% accuracy, ranking 3rd overall by raw accuracy:
GPT 5.2: 89.81%
Gemini 3 Flash: 87.88%
Ensemble (top-5): 87.60%
So the ensemble was approximately 2.2 percentage points (pp) below the best single model (GPT 5.2), but it outperformed GPT-3.5 by approximately 5.6 points (87.60% vs 82.02%) and exceeded all models ranked below Gemini 3 Flash. This pattern suggests that voting can reduce some idiosyncratic errors from individual models, but cannot fully surpass the strongest model when top systems already agree on most items.
Table 1. Dataset completeness by year and model. Counts of model responses by NEET-PG year and model; GPT-3.5 was not evaluated in 2024.
Model 2021 2022 2023 2024 2025
Claude 3.5 Sonnet 57 72 78 96 60
Claude Opus 4.5 57 72 78 96 60
Claude Sonnet 4 57 72 78 96 60
DeepSeek V3 57 72 78 96 60
DeepSeek V3.2 57 72 78 96 60
Gemini 2.0 Flash 57 72 78 96 60
Gemini 3 Flash 57 72 78 96 60
Gemini 3 Pro 57 72 78 96 60
GPT 5.2 57 72 78 96 60
GPT-3.5 57 72 78 -- 60
GPT-4 Turbo 57 72 78 96 60
GPT-4o 57 72 78 96 60
Kimi K2 57 72 78 96 60
Llama 3.2 90B 57 72 78 96 60
Llama 4 Maverick 57 72 78 96 60
Table: Table 2. GLMM-adjusted accuracy by model. Adjusted probability of a correct response for each model from the mixed-effects logistic regression, with standard errors and 95% confidence intervals.
Model Accuracy Probability Standard Error 95% CI
GPT 5.2 0.95 0.01 (0.93--0.97)
Gemini 3 Flash 0.94 0.01 (0.91--0.96)
GPT-3.5 0.89 0.02 (0.84--0.93)
Claude Opus 4.5 0.88 0.02 (0.83--0.92)
DeepSeek V3.2 0.87 0.02 (0.82--0.91)
Kimi K2 0.87 0.02 (0.81--0.91)
Gemini 3 Pro 0.85 0.03 (0.80--0.90)
DeepSeek V3 0.83 0.03 (0.77--0.88)
Llama 4 Maverick 0.57 0.05 (0.48--0.66)
GPT-4o 0.57 0.05 (0.48--0.66)
Llama 3.2 90B 0.55 0.05 (0.46--0.64)
Claude Sonnet 4 0.52 0.05 (0.43--0.61)
Gemini 2.0 Flash 0.47 0.05 (0.38--0.56)
GPT-4 Turbo 0.43 0.05 (0.34--0.52)
Claude 3.5 Sonnet 0.39 0.05 (0.31--0.48)
Table: Table 3. Selected model contrasts against top reference model (GPT 5.2). Pre-specified comparisons of selected models against the reference model, reporting log-odds estimates, standard errors, test statistics, and p-values.
Pairwise Comparison of Models Estimate SE z.ratio p.value Odds Ratio (95% CI)
Claude 3.5 Sonnet -- GPT 5.2 -3.47 0.24 -14.51 < 0.0001 0.03 (0.02--0.06)
GPT-4 Turbo -- GPT 5.2 -3.31 0.24 -13.89 < 0.0001 0.04 (0.02--0.07)
Gemini 2.0 Flash -- GPT 5.2 -3.13 0.24 -13.20 < 0.0001 0.04 (0.02--0.09)
Claude Sonnet 4 -- GPT 5.2 -2.93 0.24 -12.37 < 0.0001 0.05 (0.03--0.11)
Llama 3.2 90B -- GPT 5.2 -2.80 0.24 -11.84 < 0.0001 0.06 (0.03--0.12)
GPT-4o -- GPT 5.2 -2.74 0.24 -11.58 < 0.0001 0.06 (0.03--0.13)
Llama 4 Maverick -- GPT 5.2 -2.72 0.24 -11.51 < 0.0001 0.07 (0.03--0.13)
DeepSeek V3 -- GPT 5.2 -1.41 0.24 -5.77 < 0.0001 0.24 (0.12--0.50)
Gemini 3 Pro -- GPT 5.2 -1.26 0.25 -5.12 < 0.0001 0.28 (0.14--0.58)
Kimi K2 -- GPT 5.2 -1.15 0.25 -4.64 < 0.0001 0.32 (0.15--0.65)
DeepSeek V3.2 -- GPT 5.2 -1.10 0.25 -4.44 < 0.0001 0.33 (0.16--0.68)
Claude Opus 4.5 -- GPT 5.2 -1.03 0.25 -4.14 0.0001 0.36 (0.17--0.74)
GPT-3.5 -- GPT 5.2 -0.91 0.27 -3.36 0.001 0.40 (0.18--0.89)
Gemini 3 Flash -- GPT 5.2 -0.25 0.27 -0.94 0.34 0.78 (0.36--1.70)
DISCUSSION
In this multi-year benchmark of NEET-PG Medicine recall questions (121 unique items across 2021--2025; 15--25 questions/year; 3 runs/model/question), we found a wide spread in unadjusted accuracy across contemporary LLMs, from 41.9% (Claude 3.5 Sonnet) to 89.8% (GPT 5.2). The top tier comprised GPT 5.2 (89.8%) and Gemini 3 Flash (87.9%), followed by GPT-3.5 (82.0%), Claude Opus 4.5 (80.2%), and DeepSeek V3.2 (79.3%).
Accuracy
Even after accounting for exam year and question difficulty using mixed-effects modelling, unadjusted accuracy spanned a wide range (roughly 0.42 to 0.90), and the ranking pattern largely persisted after GLMM adjustment, suggesting that model-to-model differences were not explained only by differing mixes of "easy" and "hard" items. These findings align with the broader literature showing that newer generations of LLMs have reached (and in some settings exceeded) threshold performance on standardized medical MCQ-style examinations, while older models remain meaningfully behind. These results align with the broader literature showing that newer generations of large language models can reach or exceed passing-threshold performance on structured medical licensing examinations, while earlier/less capable variants remain substantially below that level [13--18].
Repeatability
A key practical message from our study is that accuracy alone is not the whole story. Re-running each model three times revealed measurable stochasticity. Some models were highly repeatable (high "all-three-runs identical" proportion), whereas others varied more. Claude Opus 4.5 showed the highest stability (`p_all_same` = 0.983), followed by GPT 5.2 (0.942) and Gemini 3 Flash (0.934), whereas Claude 3.5 Sonnet exhibited low stability (0.413). In real clinical education or decision-support workflows, this variability matters because the same prompt can yield different final answers. In other words, an LLM can be "good on average" but still unpredictable for an individual question. This observation is consistent with concerns raised in health-care focused reviews of conversational LLMs, which emphasize reliability, safety, and the need for structured evaluation beyond headline accuracy.
Cognitive clustering and what it implies
The "cognitive cluster" heatmap (response-similarity matrix) adds an important layer beyond accuracy: high-performing models tended to answer similarly, suggesting convergence toward comparable solution heuristics (or overlapping training signals). On run-1 responses, the highest pairwise similarity was between GPT 5.2 and Gemini 3 Flash (0.92), while closely related open-weight models also clustered (e.g., Llama 4 Maverick vs Llama 3.2 90B: 0.86). In contrast, similarity between top and bottom performers was lower (e.g., GPT 5.2 vs Claude 3.5 Sonnet: 0.45). Quantitatively, the mean within-group similarity among the top 8 models was 0.78, compared with 0.45 among the lower 7, with between-group similarity approximately 0.48. Practically, this means that when models are strong, they often agree---helpful for confidence, but it also implies their errors can be correlated, limiting the upside of naive ensembling.
Ensemble effect: helpful, but not magic
A simple majority-vote ensemble (run-1 only; one vote per model per item) achieved 87.6% accuracy. This exceeded most individual models, but did not surpass the best single model (GPT 5.2). This is exactly what we would expect when strong models are already highly correlated in their answers: majority vote improves robustness against idiosyncratic failures of weaker members, but it cannot reliably beat the best model unless the ensemble contains multiple high performers that make different (uncorrelated) mistakes. Conceptually, this is similar to the "self-consistency" idea in LLM reasoning, where aggregating multiple samples can improve reliability---yet gains depend on diversity in errors rather than sheer number of votes [19].
Option-choice bias: systematic "letter preferences"
The option-bias heatmap (model preference minus ground-truth distribution of correct options) showed that some models had non-trivial response-letter preferences. The most prominent pattern was in the Llama family: Llama 3.2 90B and Llama 4 Maverick under-selected option A by -15.7% and -15.4%, respectively, with compensatory over-selection of option D (+8.8% and +9.1%). In contrast, the best model (GPT 5.2) showed minimal bias (roughly within a few percentage points across options; e.g., A -1.1%, B -1.1%, C 0.0%, D +2.2%). These patterns matter because NEET-style MCQs do not have uniform answer-key distributions; if a model's "default guess" is skewed, observed accuracy may partially reflect a systematic letter preference rather than knowledge alone. Reporting option-choice bias therefore strengthens interpretability and reproducibility of exam-style evaluations.
Family-wise evolution across years
The family-wise year plots highlight two points. First, within-family heterogeneity is large: the OpenAI family spans approximately 44.6% (GPT-4 Turbo) to approximately 89.8% (GPT 5.2), and Gemini spans approximately 47.7% (Gemini 2.0 Flash) to approximately 87.9% (Gemini 3 Flash). Second, rank ordering is broadly stable, but not identical each year---e.g., Gemini 3 Flash led in 2021 (0.93), while GPT 5.2 led in 2022/2024/2025. This reinforces why year adjustment and question-level modeling (as well did with GLMM/Rasch) are useful: raw accuracies alone can be influenced by variation in item difficulty across years.
Why did GPT-3.5 perform better than GPT-4o / GPT-4 Turbo in this dataset, and why did Gemini 3 Flash outperform Gemini 3 Pro?
An unexpected finding in our benchmark was that GPT-3.5 outperformed GPT-4o / GPT-4 Turbo on this NEET-PG Medicine recall set. Several methodological and model-behavior factors could plausibly explain this. First, input modality differed: GPT-3.5 was evaluated through an OCR-to-text harness, whereas GPT-4o/4 Turbo were typically given the raw image. Prior work shows that LLM exam performance can differ materially between text-only versus image-based items, and that models may do substantially better once imaging-dependent questions are excluded. This suggests a simple mechanism: OCR conversion may have produced a cleaner, standardized text prompt that reduced visual parsing burden and improved answer selection for GPT-3.5, while vision-capable models had to cope with layout artifacts, low resolution, or distractors in the image itself [20]. Second, format sensitivity may have played a role: more capable models sometimes provide hedged outputs (multiple options, conditional answers, or lengthy reasoning) that can increase "non-standard" responses when strict A--D parsing is enforced; in contrast, GPT-3.5 may guess more directly, which can help on MCQs when uncertainty is high. Third, LLM services can exhibit behavior drift over time (including instruction-following and formatting reliability), and this can affect comparative performance even when model names remain constant [21]. Importantly, the direction of our finding differs from several domain question-bank studies where GPT-4-class models typically outperform GPT-3.5 and show greater consistency; therefore, our result likely reflects a combination of evaluation harness effects (vision vs OCR, parsing rules, prompt constraints) and domain/item idiosyncrasies, rather than a general reversal of capability [22]. A similar logic may explain why Gemini 3 Flash exceeded Gemini 3 Pro here. Google's developer documentation positions Gemini 3 Flash as having "Pro-level intelligence" while being optimized for speed/efficiency, and the public product note frames Flash as a fast, high-quality model intended for broad deployment [23,24]. In a strict MCQ setting, a faster, more decisive model that produces cleaner A--D outputs may score better than a "Pro" variant that is more verbose, more cautious, or more likely to emit explanations that complicate deterministic parsing---especially when the evaluation emphasizes a single-letter final answer.
Open/"open-weight" models vs proprietary models (DeepSeek, Kimi, Llama in this question bank) and related literature
Across our NEET-PG question bank, open/alternative model families (DeepSeek, Kimi, Llama) showed meaningful but uneven competitiveness relative to proprietary systems. In our results, DeepSeek and Kimi often reached the upper-middle tier (approaching several proprietary models), whereas Llama variants were more variable and also showed prominent option-letter bias, suggesting differences in instruction following and/or default guessing behavior. This pattern is consistent with emerging medical benchmarking literature: several recent studies comparing proprietary vs open(-weight) models on specialty MCQs report that proprietary frontier models still tend to lead on average, but that certain open(-weight) systems can be competitive depending on the task, domain, and evaluation setup (including quantization, prompting style, and multimodal handling) [25,26]. Additionally, multiple medical-education evaluations of DeepSeek-family models have reported performance that can be close to commercial systems on some exam-style question sets, supporting the plausibility of our finding that DeepSeek/Kimi can perform strongly in recall-MCQ formats, albeit with domain- and setup-specific variability [27,28]. Taken together, our results suggest that "open vs proprietary" is not a binary outcome in NEET-PG--style MCQs: some open(-weight) models can be competitive, but proprietary leaders remain at the top, and model behaviors (stability, option bias, response formatting) can meaningfully influence measured accuracy.
Limitations and future work
Key limitations include: (1) reliance on recall papers and a finite, specialty-specific question set; (2) potential exposure of questions/answers on the open internet (training contamination risk); (3) OCR mediation for non-vision models (possible transcription noise); (4) multiple-choice format without evaluating explanation quality or citation faithfulness; and (5) only three repeats per item, which captures some variability but not the full range of stochastic behavior. Future studies should use locked/held-out question banks, include image-heavy items and free-response reasoning, evaluate calibration and error taxonomy, and test "assisted" modes (retrieval, tools, structured prompting) while keeping comparisons fair.
How to interpret these findings for medical use
Our results support a cautious but optimistic view. High-end LLMs can answer a large proportion of NEET-PG Medicine MCQs correctly, consistent with prior evidence that LLMs demonstrate strong performance on licensing and benchmark medical QA tasks [13--18]. However, even the best model was wrong about 10% of the time on this dataset, and response similarity indicates that models can share blind spots. For education, this supports use as a tutoring/feedback tool with guardrails (explanations, references, and human verification). For clinical deployment, MCQ accuracy is only a proxy for competence; real practice demands calibrated uncertainty, source-grounded reasoning, patient context integration, and safety behaviors (referral thresholds, "do not guess" policies).
Acknowledgments
Dr. Abhishek Kumar, Product Head, Dynoble Medical CEO medicoapps for providing NEET PG Recall Questions.
CONCLUSION
Newer LLMs demonstrate high proficiency on NEET-PG Medicine items, but raw accuracy masks significant differences in reliability and response bias. High-performing models are generally more stable and calibrated, whereas others exhibit marked stochasticity and letter preferences. These findings emphasize that clinical and educational adoption of LLMs requires rigorous psychometric validation beyond simple pass rates.
REFERENCES
1. Gilson A, Safranek CW, Huang T, Socrates V, Chi L, Taylor RA, et al. How Does ChatGPT Perform on the United States Medical Licensing Examination? The Implications of Large Language Models for Medical Education and Knowledge Assessment. JMIR Med Educ. 2023;9:e45312. doi:10.2196/45312.
2. Kung TH, Cheatham M, Medenilla A, Sillos C, De Leon L, Elepano C, et al. Performance of ChatGPT on USMLE: Potential for AI-assisted medical education using large language models. PLOS Digit Health. 2023;2(2):e0000198. doi:10.1371/journal.pdig.0000198.
3. Singhal K, Azizi S, Tu T, Mahdavi SS, Wei J, Chung HW, et al. Large language models encode clinical knowledge. Nature. 2023. doi:10.1038/s41586-023-06291-2. PMID:37438534.
4. Lucas B, et al. Use and evaluation of ChatGPT and other large language models in medical education: a systematic review. BMC Med Educ. 2024. PMID:38425229.
5. National Board of Examinations in Medical Sciences (NBEMS). Information Bulletin (NBEMS examinations list including NEET-PG as a single-window entrance examination). 2021.
6. Paul S, et al. Comparative Analysis of the Performance of ChatGPT-3.5, ChatGPT-4, and Bard AI on the NEET-PG Examination. Cureus. 2024. PMID:38425265.
7. Chen Y, et al. Evaluating large language model performance in cross-cultural medical licensing examinations. BMC Med Educ. 2024. PMID:39136023.
8. De Champlain AF. A primer on classical test theory and item response theory for assessments in medical education. Med Educ. 2010;44(1):109--117. doi:10.1111/j.1365-2923.2009.03425.x. PMID:20078762.
9. Bates D, Machler M, Bolker B, Walker S. Fitting Linear Mixed-Effects Models Using lme4. J Stat Softw. 2015;67(1):1--48. doi:10.18637/jss.v067.i01.
10. Lenth RV, Piaskowski J. emmeans: Estimated Marginal Means, aka Least-Squares Means. R package (CRAN). 2025.
11. Chalmers RP. mirt: A Multidimensional Item Response Theory Package for the R Environment. J Stat Softw. 2012;48(6):1--29. doi:10.18637/jss.v048.i06.
12. De Champlain AF. A primer on classical test theory and item response theory for assessment in medical education. Med Educ. 2010;44(1):109--117. doi:10.1111/j.1365-2923.2009.03425.x.
13. Gilson A, Safranek CW, Huang T, et al. How Does ChatGPT Perform on the United States Medical Licensing Examination? The Implications of Large Language Models for Medical Education and Knowledge Assessment. JMIR Med Educ. 2023;9:e45312.
14. Kung TH, Cheatham M, Medenilla A, et al. Performance of ChatGPT on USMLE: Potential for AI-assisted medical education using large language models. PLOS Digit Health. 2023;2(2):e0000198.
15. Singhal K, Azizi S, Tu T, et al. Large language models encode clinical knowledge. Nature. 2023;620(7972):172--180.
16. Liu M, Wei J, Wen H, et al. Performance of ChatGPT across different versions in medical licensing examinations: systematic review. J Med Internet Res. 2024;26:e60807.
17. Chen R, Tan Y, Dong H, et al. Cross-cultural evaluation of large language model performance on medical licensing examinations. BMC Med Educ. 2024;24:830.
18. Sharma R, Jain R, Nagar J, et al. Comparative analysis of ChatGPT, Google Bard and other AI models on NEET-PG medicine questions. Cureus. 2024.
19. Wang X, Wei J, Schuurmans D, et al. Self-Consistency Improves Chain of Thought Reasoning in Language Models. arXiv. 2022:2203.11171.
20. Hayden N, et al. Performance of GPT-4 with Vision on text- and image-based ACR Diagnostic Radiology In-Training Examination questions. Radiology. 2024. doi:10.1148/radiol.240153.
21. Chen L, Zaharia M, Zou J. How is ChatGPT's behavior changing over time? arXiv [Preprint]. 2023 (revised 2023 Oct 31). arXiv:2307.09009. doi:10.48550/arXiv.2307.09009.
22. Demir S. Comparative evaluation of GPT-3.5 and GPT-4 in an ophthalmology question bank. Rom J Ophthalmol. 2024;68(4):367--371. doi:10.22336/rjo.2024.67.
23. Luo L, et al. A systematic comparison between GPT-4, GPT-4o and GPT-3.5. Sci Rep. 2025;15:7402. doi:10.1038/s41598-025-88251-3.
24. Google AI for Developers. Gemini 3 Developer Guide (Gemini API documentation) [Internet]. 2025 [cited 2025 Dec 25].
25. Google. Gemini 3 Flash: frontier intelligence built for speed [Internet]. 2025 Dec 17 [cited 2025 Dec 25].
26. Safavi-Naini SAA, Ali S, Shahab O, Shahhoseini Z, Savage T, Rafiee S, et al. Benchmarking proprietary and open-source language and vision-language models for gastroenterology clinical reasoning. npj Digit Med. 2025 Nov 27. doi:10.1038/s41746-025-02174-0.
27. Wu S, Koo M, Blum L, Black A, Kao L, Fei Z, et al. Benchmarking open-source large language models, GPT-4 and Claude 2 on multiple-choice questions in nephrology. NEJM AI. 2024;1(2):AIdbp2300092. doi:10.1056/AIdbp2300092.
28. Hu N, et al. Evaluation of AI models for radiology exam preparation: DeepSeek vs ChatGPT-3.5. 2025. doi:10.1080/10872981.2025.2589679.
Recommended Articles
Research Article
Comparison of stone free rate between ultrasound-assisted and fluoroscopy guided extracorporeal shockwave lithotripsy for radio Opaque renal stones