arXiv20262026avg 3.25interest 6.500 HF VLM reasoninganomaly detectiongrounded explanations

This paper addresses weak performance of large language and multimodal models on time-series anomaly detection by building VisAnomBench, a benchmark with curated anomaly explanations from public datasets. Fine-tuning yields VisAnomReasoner, a parameter-efficient VLM that improves anomaly localization and outperforms baselines on VisAnomBench and a cross-benchmark TSB-AD-U evaluation.

Source-first digest for checked paper rank 49, rank_id p055.

Motivation / Background

Time-series anomaly detection is usually evaluated as a numeric labeling problem: identify abnormal points or intervals, then report precision, recall, or F1. The paper argues that this is too thin for operational use because practitioners often need to know why a region is abnormal, whether the predicted interval is temporally aligned, and whether the explanation is grounded in the visible trend rather than a generic alarm.

The paper's response is to treat anomaly detection as plot-grounded vision-language reasoning. A model receives a rendered univariate time-series plot and optional context, then outputs structured anomaly tags, interval indices, and a step-by-step explanation. Figure 1 shows the intended behavior: VisAnomReasoner localizes the anomalous interval and ties the decision to observable periodicity and amplitude deviations, while comparison methods overflag spikes or produce generic rationales.

Figure 1. Qualitative anomaly reasoning example.
Figure 1. Qualitative anomaly reasoning example. Original caption: Qualitative Examples of Anomaly Reasoning. VisAnomReasoner precisely localizes the anomalous interval with visually grounded, structured reasoning, whereas other methods exhibit coarser localization or produce numerous spurious intervals with less grounded explanations. I place it here because it makes the task definition concrete: the model must localize an interval and justify it from the plot.

Claims And Evidence

Claim id Main claim Support Evidence anchors
C1 Time-series anomaly detection can be reformulated as a plot-grounded vision-language task that jointly predicts anomaly intervals and natural-language reasoning. 4 task formulation, dataset construction, qualitative example
C2 VisAnomBench supplies explanation-augmented supervision at meaningful scale by augmenting public anomaly datasets with reward-selected VLM explanations. 4 dataset construction, dataset statistics, source composition, multi-model supervision, reward objective
C3 VisAnomReasoner outperforms 15 general VLM, small VLM, foundation-model, specialized-model, and classical-detector baselines on VisAnomBench. 5 VisAnomBench results, main benchmark table
C4 The 7B variant generalizes to the separate TSB-AD-U benchmark better than the reported baselines. 4 TSB-AD-U results, TSB-AD-U table
C5 Explanation-augmented supervision improves anomaly localization beyond interval-only fine-tuning. 4 SFT bar chart, reasoning ablation, explanation preference
C6 A small, parameter-efficient VLM can be competitive with much larger VLMs for this task. 4 model and LoRA setup, VisAnomBench results, TSB-AD-U results, efficiency note
C7 The generated explanations are more visually grounded and useful than baseline explanations. 3 qualitative example, additional qualitative example, explanation preference, limitations

Scores are support-from-paper scores, not independent reproduction scores. I cap the explanation-quality and generalization claims where the evidence relies on model-judged explanations, synthetic supervision, one univariate plotting setup, or a fixed subset of TSB-AD-U.

Core Technical Idea

The method keeps the raw time series outside the language context window by rendering it as an image. For a univariate series \(x=\{x_1,\ldots,x_T\}\), ground-truth anomalies are intervals \(\mathcal{A}^\star=\{(s_i^\star,e_i^\star)\}_{i=1}^m\). A VLM receives the plot \(I(x)\) and optional context \(C\), then returns predicted intervals \(A\) and a reasoning trace \(E\):

$$ (I(x), C)\xrightarrow{\;\mathcal{M}_\theta\;}(A,E). $$

The structured target contains <anomaly> tags for the anomaly decision, <index> tags for interval localization, and <think> tags for the step-by-step explanation. This makes the paper less about a new time-series detector architecture and more about an instruction-tuned chart-reasoning detector: anomaly detection is supervised as a visual reasoning response.

The benchmark construction is the other core component. Instead of asking humans to write rationales for thousands of time-series windows, the authors generate candidate outputs from four large VLMs, then select the candidate with the best reward. The reward combines interval overlap and explanation quality:

$$ \begin{aligned} \mathcal{R}(A,E) =\;& \lambda_{\text{ano}}\mathcal{S}_{\text{ano}}(A,\mathcal{A}^{\star}) + \lambda_{\text{vis}}\mathcal{S}_{\text{vis}}(E) \\ &+ \lambda_{\text{axi}}\mathcal{S}_{\text{axi}}(E) + \lambda_{\text{cla}}\mathcal{S}_{\text{cla}}(E). \end{aligned} $$

The anomaly score is a length-weighted, range-based F1 over interval overlap. The explanation-quality terms judge visual groundedness, axis awareness, and clarity. In the appendix, the weights are \(\lambda_{\text{ano}}=0.3\), \(\lambda_{\text{vis}}=0.3\), \(\lambda_{\text{axi}}=0.1\), and \(\lambda_{\text{cla}}=0.3\), with Qwen2.5-VL-72B used as the explanation-quality judge.

Method Details

Benchmark Construction

VisAnomBench draws windows from four public univariate anomaly-detection sources: KPI, GutenTAG, UCR-EGI, and UCR-TSAD. The preprocessing keeps windows with anomaly ratio at most 10 percent, centers the anomaly interval between 30 percent and 70 percent of the segment, and requires segment length of at least 200 points. Candidate examples are filtered when multiple VLMs fail to identify the anomaly or consistently disagree with the provided interval, which is meant to remove label-noise cases rather than visually difficult ones.

Table 1 gives the source composition of the benchmark; Table 2 shows that the selected supervision targets come from all four generator models rather than a single teacher.

Benchmark TS count Domains Avg length Anomaly ratio Explanation count Avg explanation words
KPI 160 1 1,777 4.3% 640 170
GutenTAG 810 10 5,000 2.2% 3,240 114
UCR-EGI 2,097 5 8,268 5.8% 8,388 112
UCR-TSAD 249 8 2,755 6.9% 996 112

Table 1. VisAnomBench source statistics. The table supports C2 by showing that the benchmark combines multiple public anomaly sources with explanation targets. The paper also reports 2,576 training series and 740 held-out test series after filtering.

Generator selected as supervision source Grok-4-Fast LLaMA-4-Maverick Gemma-3-27B-IT Qwen2.5-VL-32B
Count 605 1,007 549 415
Percent 23.5% 39.1% 21.3% 16.1%

Table 2. Composition of reward-selected supervision. LLaMA-4-Maverick contributes the largest share, but every generator contributes selected examples. This is relevant because the benchmark is synthetic-supervision-heavy.

Model And Training

VisAnomReasoner fine-tunes Qwen2.5-VL-3B and Qwen2.5-VL-7B into 3B and 7B anomaly-reasoning variants. Training is supervised next-token prediction over the selected structured outputs. The paper uses LoRA with 95,178,752 trainable parameters, reported as 1.13 percent of the backbone, and applies LoRA adapters to the vision encoder and projection modules including q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, and down_proj.

The practical implication is that the learned detector is still a small VLM prompted on a plot image; it is not a direct numerical sequence model. That is also the main limitation: the model can reason over what the plotted image makes visible, but it may miss or hallucinate details if the rendering hides the relevant deviation.

Metrics

The main VisAnomBench evaluation uses interval-level true positives, false positives, false negatives, precision, recall, F1, and an Overlap score. A predicted interval is a true positive if it overlaps a ground-truth interval; extra intervals that do not overlap any ground truth are false positives.

$$ \mathrm{Precision}=\frac{N_{\mathrm{TP}}}{N_{\mathrm{TP}}+N_{\mathrm{FP}}},\quad \mathrm{Recall}=\frac{N_{\mathrm{TP}}}{N_{\mathrm{TP}}+N_{\mathrm{FN}}},\quad \mathrm{F}_1=2\frac{\mathrm{Precision}\cdot\mathrm{Recall}}{\mathrm{Precision}+\mathrm{Recall}}. $$

The Overlap metric measures boundary quality and penalizes overextended predictions:

$$ \mathrm{Overlap}= \frac{\lvert\mathcal{T}_{\mathcal{A}^{\star}}\cap\mathcal{T}_A\rvert} {\max(\lvert\mathcal{T}_{\mathcal{A}^{\star}}\rvert,\lvert\mathcal{T}_A\rvert)}. $$

For TSB-AD-U, the paper reports standard point-wise precision/recall/F1 and affiliation metrics, where affiliation scores are more tolerant of small event-boundary offsets.

Experiments And Results

VisAnomBench

Table 3 is the main quantitative evidence for the paper. VisAnomReasoner 7B has the best precision, recall, and F1; VisAnomReasoner 3B has the best Overlap. The nearest baseline precision and F1 are IForest at 48.92 and 48.30, so the 7B model improves precision by 23.17 percentage points and F1 by 25.64 percentage points. The tradeoff is that this is a benchmark built from the same explanation-supervision pipeline the model is trained on, so it is strong in-domain evidence rather than independent deployment proof.

Method Category TP FP FN Precision Recall F1 Overlap
Grok-4-Fast (314B) Large VLM 242 837 517 22.43 31.88 26.33 3.26
LLaMA-4-Maverick (17B) Large VLM 442 1,128 317 28.15 58.23 37.96 12.92
Qwen2.5-VL-7B Small VLM 517 1,784 242 22.47 68.12 33.79 14.04
Idefics3-8B Small VLM 123 2,057 636 5.64 16.21 8.37 2.25
SmolVLM-7B Small VLM 49 1,416 710 3.34 6.46 4.41 1.29
LLaVA-7B Small VLM 9 523 750 1.69 1.19 1.39 0.23
TimesFM (200M) Foundation model 196 543 563 26.52 25.82 26.17 2.03
Chronos (120M) Foundation model 200 538 559 27.10 26.35 26.72 2.27
AnomLLM (GPT-4o) Specialized large model 149 862 610 14.74 19.63 16.84 4.04
LLM-TSAD (GPT-4o) Specialized large model 477 5,621 282 7.82 66.53 14.00 20.81
LLMAD (GPT-4o) Specialized large model 349 402 410 46.47 45.98 46.23 17.31
VLM4TS (GPT-4o) Specialized large model 435 642 324 40.39 57.31 47.39 17.94
Sub-PCA Classical detector 329 411 430 44.46 43.35 43.90 17.41
Matrix Profile Classical detector 290 450 469 39.19 38.21 38.69 8.33
IForest Classical detector 362 378 397 48.92 47.69 48.30 20.53
VisAnomReasoner (3B) Ours 564 240 195 70.15 74.30 72.17 27.07
VisAnomReasoner (7B) Ours 576 223 183 72.09 75.88 73.94 25.35

Table 3. Main VisAnomBench anomaly-detection results. Values are percentages except TP/FP/FN counts. This table was recovered from latex_flattened/main.flattened.tex because the corresponding paper.md table block is empty.

The failure mode for many baselines is overflagging. Qwen2.5-VL-7B detects 517 true intervals but produces 1,784 false positives; LLM-TSAD detects 477 true intervals but produces 5,621 false positives. This explains why recall can look respectable while precision and F1 collapse. The Overlap column also shows that detecting many points is not enough: LLM-TSAD's high recall gives only 20.81 Overlap because its intervals are too broad or too numerous.

TSB-AD-U Generalization

Table 4 tests the same model family on TSB-AD-U. VisAnomReasoner 7B is strongest on standard recall and F1 and all three affiliation metrics. Qwen2.5-VL-7B remains a strong baseline in standard precision and F1, but the 7B fine-tuned model improves over it by 9.57 points in standard precision, 12.06 in standard recall, and 13.39 in standard F1. This supports cross-benchmark transfer, though the paper notes all experiments use the fixed univariate TSB-AD-U subset.

Method Category Std Precision Std Recall Std F1 Aff Precision Aff Recall Aff F1
Grok-4-Fast (314B) Large VLM 27.95 15.53 16.78 32.43 33.24 32.20
LLaMA-4-Maverick (17B) Large VLM 63.35 46.49 48.89 76.07 79.18 76.75
Qwen2.5-VL-7B Small VLM 66.18 48.85 49.52 75.90 74.93 74.36
Idefics3-8B Small VLM 45.12 35.11 34.64 50.83 59.02 53.03
SmolVLM-7B Small VLM 35.07 12.48 15.80 37.57 35.84 35.75
LLaVA-7B Small VLM 14.06 0.72 1.35 14.06 8.16 10.15
TimesFM (200M) Foundation model 32.32 43.21 32.19 71.09 57.23 60.08
Chronos (120M) Foundation model 47.18 55.37 47.23 74.36 75.61 73.31
AnomLLM (GPT-4o) Specialized large model 12.97 10.54 11.63 13.05 7.08 8.42
LLM-TSAD (GPT-4o) Specialized large model 31.74 28.40 29.98 66.69 52.76 53.69
LLMAD (GPT-4o) Specialized large model 32.96 37.54 22.91 67.50 73.37 64.16
VLM4TS (GPT-4o) Specialized large model 29.78 40.10 19.84 72.52 51.38 54.06
Sub-PCA Classical detector 11.80 16.82 12.28 48.22 31.71 34.99
Matrix Profile Classical detector 9.40 15.84 9.91 44.90 41.44 41.90
IForest Classical detector 12.44 17.89 13.05 50.44 41.51 42.67
VisAnomReasoner (3B) Ours 77.93 40.46 49.59 95.05 71.69 78.36
VisAnomReasoner (7B) Ours 75.75 60.91 62.91 95.27 81.51 85.58

Table 4. TSB-AD-U benchmark results. Values are percentages. The 3B model has the highest standard precision, while the 7B model has the strongest overall standard F1 and affiliation scores.

Ablations And Explanation Quality

Figure 2 compares base Qwen2.5-VL models against the supervised fine-tuned variants. The visual summary is consistent with the numeric tables: fine-tuning mostly fixes false positives, so precision, F1, and Overlap jump much more than recall.

Figure 2. SFT improvements over base Qwen2.5-VL.
Figure 2. Supervised fine-tuning gains. Original caption: Comparison of base Qwen2.5-VL models and their supervised fine-tuned variants (VisAnomReasoner). Arrows denote relative improvements due to supervised fine-tuning. The largest gains are precision, F1, and Overlap, which matches the paper's claim that supervised explanations help reduce overflagging.
Mode TP FP FN Precision Recall F1
Base 517 1,784 242 22.47 68.12 33.79
No reasoning 516 393 243 56.76 67.98 61.87
With reasoning 576 223 183 72.09 75.88 73.94

Table 5. Effect of reasoning supervision. Interval-only fine-tuning reduces false positives but does not recover more true anomalies. Adding reasoning traces improves both TP and FP, raising F1 from 61.87 to 73.94.

The explanation-quality result is less definitive but still informative: GPT-4o chooses the VisAnomReasoner 7B explanation over the base Qwen2.5-VL-7B explanation in 69.6 percent of 740 paired comparisons, chooses the base model in 29.7 percent, and ties in 0.7 percent. The paper also says the authors manually inspected a random subset of selected and rejected candidates during benchmark construction, but it does not report a full human evaluation.

Training data TP FP FN Precision Recall F1
SFT with single-model data 535 357 224 59.98 70.49 64.81
SFT with VisAnomBench 576 223 183 72.09 75.88 73.94

Table 6. Effect of multi-model supervision. The full reward-selected VisAnomBench supervision beats training on a single generator's data. This supports the use of multiple generators, but it is not a complete bias audit because only one single-model setting is reported.

Qualitative And Deep TSAD Checks

Figure 3 gives a second qualitative case. It is useful because it shows a different failure mode: LLMAD predicts 147 intervals, LLaMA-4-Maverick gives a plausible but early interval, and VisAnomReasoner keeps the interval closer to the ground truth.

Figure 3. Additional qualitative anomaly-reasoning examples.
Figure 3. Additional qualitative examples. Original caption: Additional Qualitative Examples of Anomaly Detection and Reasoning Across Models. VisAnomReasoner precisely localizes the anomalous interval with visually grounded, structured reasoning; LLaMA 4 Maverick correctly identifies the anomaly but has lower localization accuracy, while the other methods continue to overflag anomaly intervals.

Table 7 checks that the result is not only against VLM and classical detector baselines. The reported deep-learning TSAD baselines remain far below VisAnomReasoner on interval-level F1.

Model TP FP FN Precision Recall F1
CS-LSTM 175 556 584 23.94 23.06 23.49
Anomaly-Transformer 86 551 673 13.50 11.33 12.32
Image-Embedding-CAE 211 466 471 31.17 30.94 31.05
AE 297 439 459 40.35 39.29 39.81
CNN 237 481 522 33.01 31.23 32.09
LSTM 209 525 550 28.47 27.54 28.00
VisAnomReasoner (7B) 576 223 183 72.09 75.88 73.94

Table 7. Comparison with deep-learning TSAD baselines. These baselines are not vision-language methods, but their poor interval-level scores strengthen the paper's argument that standard sequence models do not directly solve this benchmark.

Efficiency

The paper reports that VisAnomReasoner uses LoRA rather than full-model training, with 95,178,752 trainable parameters or 1.13 percent of the backbone. For inference, it reports 16.5 plus/minus 2.3 seconds per input series and says runtime is largely insensitive to sequence length because the model reads a rendered plot rather than raw tokens. VLM4TS is reported as much slower, up to 452.6 seconds per series, while GPT-4o prompting methods average 3.8 seconds but have high variance and cannot process sequences longer than 14K points.

Practical Takeaways