arXiv20262026avg 5.82interest 7.30169 HF MLLM evaluationvideo understandinggrounded evidence

This paper examines whether MLLMs genuinely ground personality judgments in behavior or rely on superficial first impressions. It introduces Grounded Personality Reasoning, the MM-OCEAN video dataset, and a multi-tier benchmark showing that many correct Big Five ratings are not supported by retrieved behavioral cues.

Source-first digest for monthly 2026_05 rank 21, rank_id p014.

Motivation / Background

Perception or Prejudice asks whether multimodal large language models can justify personality judgments with observable behavioral evidence, rather than merely matching the surface patterns that correlate with Big Five labels. The paper argues that rating-only apparent personality recognition is too weak for high-stakes human-facing use: a model can predict the right ordinal trait score while failing to retrieve the gaze, posture, voice, expression, or temporal cue that would make the judgment trustworthy.

The paper's core benchmark contribution is Grounded Personality Reasoning (GPR). GPR turns personality perception into a three-part chain: predict the Big Five rating, explain the rating through evidence-grounded reasoning, and answer structured cue-grounding questions that force localization of the supporting behavior. The paper then builds MM-OCEAN, a video benchmark with human-verified atomic observations, evidence-grounded trait analyses, and targeted MCQs. The main claim set and source support levels are summarized in Table 1.

Claims And Evidence

Support scores in Table 1 are internal source-support scores, not independent reproduction scores. A score of 5 means the claim is directly backed by source definitions, source-reported data, equations, figures, or appendix checks. A score of 4 means the paper gives clear evidence but the conclusion depends on benchmark design, judge reliability, or non-causal observational grouping.

Claim id Main claim Support Evidence anchors
C1 Rating-only personality benchmarks can over-credit models that produce the right Big Five score for the wrong reason; GPR is designed to require rating, reasoning, and cue grounding together. 5 problem framing, task equations, failure metrics
C2 MM-OCEAN is a source dataset for GPR: 1,104 short videos, about 13.5K human-verified observations, 5,520 trait analyses, and 5,320 retained cue-grounding MCQs. 5 dataset construction, dataset statistics, Figure 1, Figure 2
C3 The annotation pipeline deliberately separates observation from interpretation, then uses human verification, text-leakage filtering, and expert review to reduce shortcut questions. 5 pipeline stages, annotation protocol, Figure 3, Figure 6
C4 The central empirical finding is a field-wide Prejudice Gap: mean PR is 51.3%, mean HR is 10.4%, and the best HR reaches only 33.5%. 5 headline results, Figure 4, Figure 11, Figure 12
C5 The largest open-vs-closed gap is not rating or verbal explanation, but cue retrieval: the paper reports a -26.6 percentage point open frontier gap on T3. 5 category diagnostics, Table 5, Figure 4, Figure 16
C6 HR, RGM, PR, CR, and IR are more diagnostic than single-task scores because they expose different failure modes: confident raters, cautious reasoners, confabulation, and integration failure. 5 failure metrics, archetypes, Figure 5, Figure 13
C7 The Task 2 AI judge is useful but not definitive; the paper supports it with consistency and cross-judge checks while still listing it as a limitation. 4 judge protocol, Figure 7, Table 6
C8 Practical interpretation should stay conservative because MM-OCEAN uses short, mostly Western-context English videos, MCQ-based grounding, and apparent-personality labels from First Impressions V2. 5 limitations and ethics, Table 6

Core Technical Idea

The paper reframes personality assessment as evidence-grounded social cognition. The motivating failure is not merely low accuracy; it is a model that says "low extraversion" or "high agreeableness" without locating the behavioral evidence that warrants that judgment. This matters because apparent personality judgments are socially sensitive, and the paper explicitly ties trustworthiness to an evidence trail.

The formal task is built around a short video \(V = (V_{\text{vis}}, V_{\text{aud}}, V_{\text{txt}})\), the Big Five trait set \(\mathcal{T} = \{E, A, C, N, O\}\), and a five-point ordinal label set \(\mathcal{L} = \{1,2,3,4,5\}\). Table 2 and the equations below capture the source paper's key formal objects:

$$ \begin{aligned} \text{T1 (rating)}\quad & \hat{y}_i \in \mathcal{L},\quad \forall i \in \mathcal{T},\\ \text{T2 (reasoning)}\quad & (\hat{\mathcal{O}}, \hat{\mathcal{R}}) = f_\theta(V),\quad \hat{\mathcal{O}} = \{o_k\}_{k=1}^{K},\quad \hat{\mathcal{R}} = \{r_i \mid i \in \mathcal{T}\},\\ \text{T3 (grounding)}\quad & \hat{a}_q \in \{\texttt{A},\texttt{B},\texttt{C},\texttt{D},\texttt{E},\texttt{F}\},\quad \forall q \in \mathcal{Q}. \end{aligned} $$

Each observation \(o_k = (d_k, t_k^s, t_k^e, \text{desc}_k, b_k)\) records a perceptual dimension, timestamps, a description, and a body-part tag. Each trait reasoning chain \(r_i = (\ell_i, \mathcal{E}_i, \text{rat}_i)\) must cite at least one valid observation id. That grounding constraint is the key difference from ordinary apparent-personality regression.

Object Source label Digest interpretation
\( \hat{y}_i \in \mathcal{L} \) eq:t1 T1 asks for one ordinal Big Five level per trait.
\( (\hat{\mathcal{O}}, \hat{\mathcal{R}})=f_\theta(V) \) eq:t2 T2 asks for open-ended observations and trait rationales.
\( \hat{a}_q \in \{\texttt{A},...,\texttt{F}\} \) eq:t3 T3 asks targeted MCQs over cue-grounding categories.
\( \operatorname{Acc}_{T1}, \operatorname{MAE}_{T1} \) eq:t1_metrics Rating quality is exact ordinal match plus distance from the human label.
\( S_{T2} \) and \( \overline{S}_{T2} \) eq:t2_avg4 Reasoning quality is judged on evidence coverage, coherence, grounding, and direction.
\( \operatorname{Acc}_{T3}^{(c)} \) eq:t3_metrics Grounding can be aggregated overall or by cognitive category.
\( \operatorname{RGM}(m) \) and \( \Delta_{Tk} \) eq:rgm, eq:delta RGM detects rating-vs-grounding rank mismatch; \(\Delta\) compares open and closed frontiers.
\( \text{PR}, \text{CR}, \text{IR}, \text{HR} \) eq:PR, eq:HR Failure-mode rates decompose whether rating, reasoning, and cue retrieval succeed together.
\( \sigma(m) \) eq:sigma Option-letter skew diagnoses positional bias in six-way MCQs.

Table 2. Key equations. These equations come from equations.json and the source Markdown's task and evaluation sections.

Figure 1. MM-OCEAN overview pipeline.
Figure 1. MM-OCEAN high-level pipeline. The source figure links multimodal inputs, a multi-agent human-collaborative pipeline, text-only filtering, expert review, and the three evaluation tasks. See Figure 1 for the end-to-end benchmark flow.
Figure 2. MM-OCEAN benchmark overview and cue density.
Figure 2. Dataset structure. The source figure shows the benchmark scope, task structure, seven cue-grounding categories, and atomic-observation density. It complements Table 3 by showing that visual grounding includes both category design and localized observation geometry.
Component What the paper evaluates Why it matters
T1: ordinal personality rating Big Five level prediction over \(\mathcal{L}\) Keeps continuity with older APR benchmarks.
T2: open-ended rating reasoning Evidence coverage, logical coherence, grounding accuracy, and directional accuracy Tests whether the verbal explanation is behaviorally supported.
T3: structured cue grounding Seven MCQ categories spanning reasoning and visual grounding Forces retrieval of specific temporal, spatial, expression, and causal cues.
Cross-task rates PR, CR, IR, HR, and RGM Detects cases where a model gets one stage right but the full chain fails.

Table 3. Task design digest. The main innovation is the chain, not any one task alone.

The benchmark construction pipeline has five stages, shown in Figure 3. An Observer drafts atomic non-interpretive cues; human annotators verify, correct, localize, or delete those cues; a Psychologist writes Big Five analyses with cited evidence; an Examiner creates seven MCQs per video; and an Aligner plus expert review filters invalid or shortcut-prone questions.

Figure 3. Five-stage annotation pipeline.
Figure 3. Annotation pipeline. The figure makes the separation between observation, interpretation, question generation, alignment, and human review explicit. This separation supports claims C2 and C3 in Table 1.

Method Details

MM-OCEAN draws from ChaLearn First Impressions V2, using about 15-second single-speaker clips with crowd-sourced Big Five scores and ASR transcripts. The released benchmark contains 1,104 videos, about 13.5K verified observations, 5,520 trait-level analyses, and 5,320 retained cue-grounding MCQs, averaging 4.8 MCQs per video after filtering. Continuous Big Five scores are discretized into the five ordinal levels used by T1.

The human verification layer is a substantive part of the method. The source reports 24 trained annotators, 45,609 Observer-drafted clues judged, 36,677 bounding boxes drawn, 78.2% accepted as correct, 14.6% corrected, 5.9% deleted, and 605 bonus clues added. The web interface in Figure 6 supports frame-level timestamping and bounding boxes for retained Expression and Action observations.

Figure 6. Annotation web tool.
Figure 6. Annotation UI. The interface has a frame-accurate video scrubber, observation list, edit controls, and bounding-box overlay. It is evidence for the paper's claim that annotators refine both temporal and spatial grounding.

The diagnostic layer is the most important technical move. The paper first converts each per-video task outcome into binary success indicators:

$$ r_k = \mathbb{1}[R_k \geq \theta_k],\quad R_1 = \frac{1}{|\mathcal{T}|}\sum_i \mathbb{1}[\hat{y}_i = y_i^\star],\quad R_2 = \frac{S_{T2}}{10},\quad R_3 = \frac{1}{|\mathcal{Q}_V|}\sum_q \mathbb{1}[\hat{a}_q = a_q^\star]. $$

The defaults are \(\theta_1=\theta_3=0.5\), meaning majority-correct for T1 and T3, and \(\theta_2=0.7\), meaning an acceptable T2 judge score. It then defines the main failure and success rates:

$$ \begin{aligned} \text{PR}(m) &= \Pr[r_3=0 \mid r_1=1],& \text{CR}(m) &= \Pr[r_2=0 \mid r_1=1],\\ \text{IR}(m) &= \Pr[r_1=0 \mid r_3=1],& \text{HR}(m) &= \Pr[r_1=1 \wedge r_2=1 \wedge r_3=1]. \end{aligned} $$

Table 4 summarizes how these rates should be read in the digest.

Metric Meaning Desired direction
PR: Prejudice Rate Correct rating but failed cue retrieval Lower
CR: Confabulation Rate Correct rating but failed reasoning quality Lower
IR: Integration-failure Rate Correct cue retrieval but wrong rating Lower
HR: Holistic-Grounding Rate Rating, reasoning, and grounding all succeed Higher
RGM: Rating-Grounding Misalignment Average T2/T3 rank minus T1 rank Near zero or negative is better than strongly positive

Table 4. Diagnostic metrics. These metrics turn benchmark evaluation into a failure taxonomy rather than a single leaderboard.

Task 2 uses GPT-4o-mini as an AI-as-Judge over four dimensions: Evidence Coverage, Logical Coherence, Grounding Accuracy, and Directional Accuracy. The judge sees the model's T2 output, the ground-truth trait level, and the human-verified observations, but not the video. The paper supports this judge with two checks: a "confidently wrong" split where T2 scores drop when T1 is wrong, and a 200-video cross-judge subset with Claude Haiku 4.5 and Gemini 2.5 Flash-Lite. The reported Spearman rank correlations are 0.94 and 0.92.

Figure 7. AI-as-Judge consistency check.
Figure 7. Judge reliability. The figure shows that the T2 judge gives lower scores when T1 is wrong, with a reported cross-model delta cluster. This supports but does not fully eliminate judge-risk concerns.

Experiments And Results

The evaluation covers 27 MLLMs across 12 families, including 13 proprietary and 14 open-source models. The key result is that rating accuracy alone substantially overstates competence. The paper reports mean PR of 51.3%, mean HR of 10.4%, and best HR of 33.5% from Gemini 3 Flash. In other words, more than half of correct ratings are not paired with correct cue retrieval, and even the strongest model only clears the full rating-reasoning-grounding chain on about one third of samples.

The paper's main quantitative claims are condensed in Table 5. Figure 4, Figure 11, and Figure 12 provide the visual diagnostics behind those claims.

Result family Source-reported number Digest interpretation
Mean Prejudice Rate 51.3% Most correct ratings lack correct cue retrieval.
Mean Holistic-Grounding Rate 10.4% Full-chain success is rare across the model field.
Best HR 33.5% Even the best model leaves most samples not fully grounded.
Closed frontier PR about 14.5% Frontier closed models reduce but do not remove ungrounded correct ratings.
Open frontier PR about 47.0% Open models retain a much larger cue-grounding deficit.
Open-vs-closed T1 gap -5.6 percentage points Rating has largely narrowed across ecosystems.
Open-vs-closed T2 gap -3.6 percentage points Verbal explanation quality is closer than grounding.
Open-vs-closed T3 gap -26.6 percentage points Cue retrieval is the main remaining ecosystem gap.
Hardest T3 categories Spatial Localization 30.7%, Micro-expression Localization 34.6% Fine-grained perceptual localization is the bottleneck.
Easiest T3 category Temporal-Causal Reasoning 64.8% Higher-level reasoning categories are easier than visual localization.

Table 5. Main results digest. These values come from the source Markdown's benchmarking section and appendix summaries.

Figure 4. Per-category T3 radar.
Figure 4. T3 category radar. The radar is used in the paper to show that the largest ecosystem differences concentrate in visual-grounding categories. This supports the T3 gap row in Table 5.
Figure 5. Rating-grounding archetypes.
Figure 5. RGM archetypes. The source figure separates confident raters from cautious reasoners. It is the figure-level evidence for the RGM-based interpretation in Table 4.

The per-category analysis shows that MLLMs are much better at verbal or temporal reasoning than at localizing fine visual evidence. The paper reports a Top-3 closed advantage of +19.5 percentage points on Spatial Localization and +21.8 points on Temporal-Spatial Joint, compared with +6 to +11 points across reasoning-cluster categories. Even Gemini 3.1 Pro reaches only 57% on Spatial Localization and 71% on Temporal-Spatial Joint.

Figure 8. Per-trait T1 difficulty.
Figure 8. Per-trait rating difficulty. The source reports Neuroticism as the hardest Big Five trait, with 37.7% mean T1 accuracy and MAE 0.87. Figure 8 supports the claim that internal-state inference is harder than surface trait recognition.
Figure 9. Open-source size scaling.
Figure 9. Open-source size scaling. Scaling from <=8B to 9-32B helps T3 by about +17 percentage points, but scaling further to about 100B+ does not improve T3 in the reported grouping. Figure 9 is evidence for the practical takeaway that data and post-training quality matter more than raw size for cue grounding.
Figure 10. Reasoning-capable versus non-reasoning models.
Figure 10. Reasoning-capable model comparison. The paper reports reasoning-capable models leading by +18.3 percentage points on T3 and +11.5 on HR, but it correctly marks this as observational because the subsets differ by family, size, generation, and training data.

The paper uses RGM and HR to show that the failure modes are not uniform. "Confident raters" rank high on T1 but fall on T2/T3; "cautious reasoners" rate less accurately but retrieve and reason about evidence better. The source example is useful: Llama-4-Maverick-FP8 ranks 4 on T1 but 17/19 on T2/T3, while Gemini 2.5 Flash ranks 25 on T1 but 5 on HR. Figure 13 visualizes these rank shifts.

Figure 11. T1 accuracy versus Prejudice Rate.
Figure 11. PR-vs-T1 scatter. Only five of 27 models reach the source figure's trustworthy zone with high T1 and low PR. Figure 11 is direct visual evidence for a field-wide Prejudice Gap.
Figure 12. Failure-mode fingerprint.
Figure 12. Failure-mode fingerprint. This heatmap puts PR, CR, IR, and HR side by side. It supports claim C6 in Table 1 by showing that different models fail in different parts of the chain.
Figure 13. T1 to HR rank reordering.
Figure 13. Rank reordering. The rank-slope plot shows which models drop when the benchmark switches from T1-only rating to holistic grounded reasoning. It is the visual counterpart to RGM.

The appendix adds several diagnostics that make the benchmark more useful than the headline leaderboard alone. The threshold sweep reports HR rank correlation from 0.925 to 1.000 across 27 threshold settings, except a harsher T2 threshold that makes the top ranks noisier. The paper also reports that T2 Evidence Coverage is the hardest judge dimension, with mean 5.14, while Grounding Accuracy and Logical Coherence are easier. A positional-bias statistic \(\sigma\) flags models that overuse MCQ option letters:

$$ \sigma(m) = \sqrt{ \frac{1}{6} \sum_{\ell \in \{\texttt{A},\dots,\texttt{F}\}} \left(p_\ell(m)-\bar{p}\right)^2 } \times 100,\quad \bar{p}=16.\overline{6}\%. $$
Figure 14. Generation-over-time effects.
Figure 14. Generation effects. The source reports T3 improving more steeply across model generations than T1. For example, GPT-4o to GPT-5.5 gives a +34.5 point T3 jump in the reported comparison.
Figure 15. Positional bias versus T3 accuracy.
Figure 15. Positional bias. The paper reports a negative correlation of about -0.68 between option-letter skew and T3 accuracy, with every model above \(\sigma>10\) in the bottom third on T3.
Figure 16. Inter-model rank correlation.
Figure 16. Inter-model rank correlation. Top models agree more on which videos are easy or hard for T1 than on which MCQs they answer in T3. This supports the paper's argument that T3 is the sharper discriminator.

The worked example makes the core failure concrete. On a video where the ground-truth trait is Low Extraversion, both GPT-4o and Gemini 3 Flash produce the right rating and plausible verbal explanations. But the cue-grounding MCQ asks about the behavioral window from 4.9 to 8.7 seconds, where the subject's gaze drifts down and left while speech continues. Gemini 3 Flash selects the low-extraversion answer tied to internally focused processing; GPT-4o selects the wrong option. That is exactly the source definition of prejudice: correct rating, failed cue retrieval.

Practical Takeaways

The practical lesson is not that MLLMs are ready for personality scoring. It is the opposite: the benchmark shows why rating-only evaluations are unsafe as evidence of trustworthy social understanding. Table 6 collects the paper's main caveats.

Caveat Why it matters for use
Apparent personality only The labels come from First Impressions V2 and should not be treated as objective psychometric ground truth.
Short single-speaker English clips Cross-cultural, multilingual, multi-person, and long-context behavior are not covered.
Western-context source data The dataset may inherit cultural and linguistic bias.
AI-as-Judge for T2 The paper validates judge stability, but high-stakes use would need human or multi-judge review.
MCQ-based grounding T3 measures retrieval over predefined cues; high PR may partly reflect MCQ design choices.
Misuse risks Automated personality assessment can enable discriminatory screening, surveillance, or over-claimed validity.

Table 6. Limitations and responsible-use caveats. These caveats come from the discussion, datasheet, and ethics sections.

For a researcher building future MLLMs, the paper points to a concrete improvement target: strengthen fine-grained visual and temporal cue grounding, especially spatial localization, temporal-spatial joint reasoning, micro-expression localization, and evidence coverage. For an evaluator, the takeaway is to report full-chain success, not just T1 rating. For an application developer, the takeaway is to avoid treating a plausible personality explanation as evidence unless the model can cite and localize the behavior that supports it.

Reference Coverage

Anchor coverage for evidence links: problem framing, task formulation, dataset pipeline, dataset statistics, annotation protocol, failure metrics, judge robustness, prejudice gap, category diagnostics, archetypes, appendix diagnostics, worked example, and limitations and ethics.

Anchor coverage for tables: claims, key equations, task design, diagnostic metrics, results summary, and limitations.

Anchor coverage for figures: Figure 1, Figure 2, Figure 3, Figure 4, Figure 5, Figure 6, Figure 7, Figure 8, Figure 9, Figure 10, Figure 11, Figure 12, Figure 13, Figure 14, Figure 15, and Figure 16.