MinT: Managed Infrastructure for Training and Serving Millions of LLMs

Source-first digest for monthly 2026_05 rank 25, rank_id p005.

Routing status: success, route full_markdown
PDF extraction: not used

Motivation / Background

MinT treats post-training as an infrastructure workload rather than as a sequence of isolated checkpoint conversions. The target setting is many trained LLM policies over a small number of expensive resident base-model deployments. Full fine-tuning makes every policy a full checkpoint; merge-based LoRA saves training memory but still tends to materialize a merged full checkpoint for inference. MinT instead makes the LoRA adapter revision the behavior-carrying object that crosses rollout, update, export, evaluation, serving, and rollback. This source claim is the starting point for the managed-policy evidence used in the support table.

The paper's contribution is best read as three separate scale claims. Scale Up keeps the adapter-revision path usable for dense and MoE frontier models, including distributed Megatron training and sparse-route handling. Scale Down removes full-checkpoint materialization from training-serving handoff and reuses idle time across concurrent GRPO policies. Scale Out separates durable policy addressability from CPU-cache, GPU-batch, and cold-load residency. Those distinctions matter because the paper does not claim one engine keeps 1,000,000 adapters active; it claims MinT can build, name, audit, and selectively serve revisions from a million-scale catalog through bounded working sets.

Claims And Evidence

Support scores are support-from-paper scores, not independent reproduction scores.

Claim id	Main claim	Support	Evidence anchors
C1	MinT's central abstraction is a managed adapter revision plus policy record, not a merged full checkpoint.	4	motivation, lifecycle mechanics, key equations and symbols
C2	Adapter-only handoff and concurrent multi-policy training materially reduce systems overhead while keeping the resident base allocation fixed.	5	scale-down evidence, handoff figure, utilization figure, scale-down table
C3	The same lifecycle runs across dense SFT/DPO/GRPO, MoE RL, GLM-5.1 Generative UI, and a 1.04T Kimi K2 countdown-task path.	4	scale-up evidence, dense curves, MoE curves, GLM curves, AutoResearch trace, scale-up table
C4	The million-scale serving claim is an artifact-backed addressability result with bounded selected-revision execution, not simultaneous residency.	5	serving evidence, serving bounds, limitations
C5	The paper identifies two concrete scale-out bottlenecks, cold activation interference and MoE tensor fanout, then gives measured controls for both.	4	rollout control, packed loader, serving evidence
C6	The strongest evidence is systems evidence for LoRA policy infrastructure; broad end-task quality claims remain benchmark- and recipe-specific.	3	scale-up table, limitations, reference coverage

Core Technical Idea

MinT separates a policy record from an adapter revision. The adapter revision is a fixed exported LoRA payload in serving layout. The policy record is service-owned state: compatible base version, LoRA rank, target modules, latest training checkpoint, rollout records, optimizer state, exported revisions, and cache/residency state. That distinction lets the system answer four questions independently: which base can run the behavior, which checkpoint resumes training, which revision should be evaluated or served, and where the adapter bytes currently live.

The runtime has four linked mechanisms. First, the service plane creates pollable operation ids and commits outputs only after metadata is written, so retries do not select uncommitted attempt files. Second, training workers keep the base model resident and time-slice policy states. Third, export converts trainer-side LoRA tensors into PEFT/vLLM-compatible serving layout without copying the base checkpoint. Fourth, serving actors map a requested policy revision through GPU-batch, CPU-cache, and cold shared-storage states before generation.

Source formula or symbol	Digest role	Meaning
\(W\) and \(L_i\)	Adapter-only handoff	\(W\) is the resident base; \(L_i\) is the trained LoRA adapter that MinT exports and serves instead of moving a full merged \(W'_i\).
\(r\) and \(L_r\)	Serving selection	A request resolves to exported revision \(r\); if \(L_r\) is in a GPU slot, vLLM decodes immediately, otherwise MinT promotes or cold-loads it.
\(A \rightarrow B\) policy switch	Time-sliced training	The base remains resident while MinT saves policy \(A\)'s adapter, optimizer moments, scheduler position, gradients, and rollout records before restoring policy \(B\)'s state.
\(10^6\) catalog	Scale-out addressability	The million count is a durable catalog/audit scale, not the number of adapters resident in one CPU cache or active in one GPU batch.
\(\lceil 2300/64 \rceil = 36\)	Fleet-sizing sketch	The appendix uses the 64-adapter same-batch window to size ideal placement for a 2300-distinct-adapter active-wave model.

The extracted equations.json contains no display equations for this paper, so Table 1 records the key inline symbols and source-side formulas that define the system semantics rather than claiming a formal optimization objective.

Method Details

The training side uses time-sliced multi-LoRA training. A resident trainer restores one policy's LoRA tensors and optimizer state, runs the update, checkpoints the result, then yields to another policy without reloading the base model. Workers advertise supported rank and target-module limits; smaller adapters can be padded/masked inside configured maximum slots. For Megatron MoE runs, dense LoRA tensors follow tensor-parallel shards, expert LoRA tensors are keyed by expert id, and export gathers or deduplicates tensors into the serving layout.

The serving side uses shared-base multi-LoRA sampling. The service maps a user-facing policy name to an exported adapter revision and sends that revision to a vLLM sampler that already holds the base model. If the adapter is active in the GPU batch, the request uses the hot path. If it is CPU-cached, the actor promotes it. If it is absent, MinT schedules a cold load from shared storage before generation. Cold loading is not hidden inside normal decoding latency; it is an explicit service stage with deduplication and backpressure.

MinT also records sparse-model provenance. For Qwen3-style MoE, rollout records can store selected expert ids so training can score tokens along the route that generated them; missing or unmappable routes are masked from the replayed policy-gradient term. For GLM-5-style dynamic sparse attention, the system stores backend/model path and correction policy, then uses IcePop-style rollout correction to zero unsafe importance weights outside a trusted probability-ratio band. This is a mitigation, not full DSA route replay.

Experiments And Results

Scale Down: Adapter Handoff And Utilization

The scale-down experiments isolate systems overhead. Adapter handoff compares a MinT adapter load against merge-and-load paths. Concurrent training compares three GRPO policies run sequentially against the same policies overlapped under one resident base allocation. The paper reports 18.3x and 2.85x handoff-step reductions in the abstract, and the table/figure evidence below shows the underlying file sizes, load/materialization times, cold first-sample times, wall-clock schedule times, and unchanged peak memory.

The handoff bars in Figure 1 show why moving adapter revisions is operationally different from moving full checkpoints.

**Figure 1. Adapter handoff breakdown.** The source caption says adapter handoff avoids merge-and-load stages that dominate the full-checkpoint path. The concrete values are summarized in Table 2.

The schedule timeline in Figure 2 shows the second part of the scale-down claim: concurrent GRPO policies fill idle resident-base windows rather than requiring another base copy.

Figure 2. Concurrent multi-LoRA training schedule and telemetry. — **Figure 2. Concurrent utilization.** The source caption links the timeline, average GPU utilization, low-utilization samples, and peak memory. The wall-time rows in Table 2 are the digest's compact evidence view.

Experiment	Model	MinT / adapter path	Baseline path	Reported effect
Handoff file scale	Qwen3-4B	rank-32 adapter, 252 MiB, 0.036 s load/materialization	full model, 8.061 GB, 71.820 s materialization	adapter-only handoff avoids full checkpoint transfer; abstract reports 18.3x measured handoff-step reduction
Handoff file scale	Qwen3-30B	rank-16 adapter, 1.692 GB, 46.455 s load/materialization	full model, 61.084 GB, 402.245 s materialization	abstract reports 2.85x measured handoff-step reduction
Concurrent GRPO	Qwen3-4B	3 policies in 1736.1 s, peak 65.6 GiB	sequential 3081.2 s, peak 65.6 GiB	1345.1 s saved, 1.77x speedup, peak memory unchanged
Concurrent GRPO	Qwen3-30B	3 policies in 7008.4 s, peak 68.0 GiB	sequential 10130.0 s, peak 68.0 GiB	3121.6 s saved, 1.45x speedup, peak memory unchanged

Scale Up: Dense, MoE, UI, And AutoResearch Runs

The scale-up evidence is not one benchmark. It is a compatibility argument across training paradigms and model placements. Dense runs cover SFT, DPO, and GRPO under one adapter lifecycle. MoE runs add expert-route replay and distributed placement. GLM-5.1 A2UI adds a product-facing GRPO workload. AutoResearch adds proxy-to-full benchmark promotion logic around maintained adapter recipes.

Figure 3 shows dense-model traces for SFT, DPO, and GRPO, while Table 3 captures the endpoint evidence.

Figure 3. Dense-model learning traces. — **Figure 3. Dense learning traces.** The source caption keeps each native metric separate: SFT loss, DPO reward margin, and GRPO train accuracy are not forced onto one score axis.

Figure 4 gives the sparse-model scale-up evidence, including Qwen3-30B/235B AIME24 and a Kimi K2 1.04T countdown-task curve.

**Figure 4. MoE RL curves.** The source text reports a 235B-A22B peak mean@1 of 0.967 on AIME24 and a 1.04T Kimi K2 run on 32.6B active parameters, both through the same rollout-update-export-evaluate loop.

Figure 5 covers the GLM-5.1 Generative UI GRPO case.

Figure 5b. GLM-5.1 A2UI hard-failure curve. — **Figure 5. GLM-5.1 A2UI curves.** Each step samples 32 prompts with 8 responses per prompt. The text reports the 5-step reward mean rising from about 0.21 to 0.27 and hard failures falling to 5 of 256 responses at step 40.

Figure 6 shows the AutoResearch LawBench trace.

Figure 6. LawBench AutoResearch trace. — **Figure 6. AutoResearch trace.** The full-benchmark confirmation matters because a high proxy candidate can fail on the full task set; the digest records those values in Table 3.

Evidence group	Setup	Main source result	Interpretation
Dense SFT	Qwen3-4B finance suite	Fineval 0.4226 to 0.7811; FPB 0.6906 to 0.8804; TFNS 0.5959 to 0.9095	SFT updates move through the adapter lifecycle and produce held-out gains.
Dense DPO	chat pairs	reward margin -0.03 to 30.88	Preference optimization uses the same export/serving path.
Dense GRPO	Qwen3-8B DAPO-AIME24	train accuracy EMA 0.188 to 0.719; peak 0.841 at step 92; step-240 eval 20/30, 19/30, 18/30 on AIME24/25/26	Rollout-based RL can update and export the same adapter object.
MoE GRPO	Qwen3-30B/235B and Kimi K2	235B-A22B reaches 0.967 peak mean@1 on AIME24; Kimi K2 uses 1.04T total parameters with 32.6B active	Adapter/base split survives tensor/expert-parallel placement.
Route consistency	Qwen3 MoE	out-of-route scoring ratios: 0.0013% with R3 on 30B, 0.0097% without R3 on 30B, 0.0253% with R3 on 235B	Route metadata controls a specific MoE scoring mismatch channel.
GLM-5.1 A2UI	40 GRPO steps	reward mean about 0.21 to 0.27; hard failures down to 5/256 at step 40	The lifecycle supports a Generative UI workload, not only math.
AutoResearch LawBench	Qwen3-4B recipes	base 0.4628; v10 0.4889; discarded v11 0.4858 full despite high proxy; maintained v23 0.5079 full	Proxy screening is guarded by full confirmation before recipe promotion.

Scale Out: Policy-Population Serving

The serving experiments split addressability from execution. A durable catalog can contain 1,000,000 adapter revisions, but one serving actor keeps only selected revisions in CPU cache and a smaller set in the GPU batch. The paper's strongest scale-out evidence is therefore a bundle: million-catalog build/audit, cache and same-batch bounds, warm selected-revision SLO, cold-load accounting, packed representation, and readiness control.

Tier	Evidence	Measured bound	Interpretation
Catalog artifact	packed Qwen3-30B rank-1 pool	1,000,000 / 1,000,000 built; 0 errors; 256 / 256 audit OK across 100 shards	Million scale is addressability, not one-engine residency.
CPU adapter cache	one 4-GPU serving actor	369 loaded adapters at a 512-adapter hotset; 550 loaded under 2048-adapter pressure	One actor can keep hundreds of adapters nearby, but weak locality raises p95.
GPU batch window	distinct-adapter probe	64 distinct adapters in the tested decoding batch	Batch execution is the smallest adapter-diversity window.
Warm selected-revision serving	64 warmed revisions, Poisson traffic	0.5, 1, and 2 rps keep 100% TTFT \(\le 5s\); 4 rps drops to 72.1% SLO	The warm serving knee is between 2 and 4 rps for this workload.
Cold-load path	warm/cold N=64 and N=16 probes	warm p95 21.35 s; cold-cache p95 199.81 s; N=16 staircase 1.375 to 23.267 s	Different missing adapters serialize through engine load work before decoding.

The mixed warm/cold control rows in Table 5 show why MinT distinguishes registered, prewarming, and ready states.

Policy	Existing warm traffic	New-adapter path	Interpretation
Admission off	post TTFT p95 24.03 s; \(>20s\) stalls: 10	e2e p95 59.18 s; user TTFT p95 22.19 s; load p95 47.37 s	Fast exposure, but cold first-touch disrupts warm tenants.
Admission on	post TTFT p95 9.71 s; \(>20s\) stalls: 0	e2e p95 314.79 s; user TTFT p95 10.68 s; load p95 294.96 s	Admission protects warm tenants, but new users wait behind activation.
Two-phase readiness	post TTFT p95 9.63 s; \(>20s\) stalls: 0	ready-path TTFT p95 4.60 s; load p95 0.00 s; prewarm span 409.04 s	First user requests arrive after activation, so they do not load adapters.

The packed-loader rows in Table 6 show that MoE LoRA cold load is object-bound, not only byte-bound.

Metric	Original	Packed	Effect
File size	110.75 MB	105.58 MB	1.05x smaller
Tensor objects	37,248	672	55.4x fewer
Read tensors	0.3669 s	0.0067 s	54.8x faster
Build loader objects	0.7540 s	0.0256 s	29.5x faster
N=4 live load	1.363 s	0.156 s	8.7x faster
N=8 live load	1.361 s	0.159 s	8.6x faster
N=16 live load	1.388 s	0.164 s	8.5x faster

The evidence boundary is explicit. The 1M result is a built-and-audited catalog plus selected-revision serving measurements. It is not a claim that every revision is resident, active, or warm. Warm selected-revision serving is measured for a particular Qwen3-30B rank-1 LoRA workload with prompt length 1024 and max output length 64. End-task quality evidence comes from named cookbook tasks and benchmarks, while the paper's main novelty is infrastructure: policy records, adapter revisions, export, cache state, cold-load scheduling, and readiness.

Practical Takeaways

MinT is most relevant when an organization expects many LoRA policies over a smaller set of expensive base models. The practical win is not only smaller files; it is making adapter revision the stable service unit that can be resumed, exported, evaluated, rolled back, cached, prewarmed, and served without turning every variant into another full-model deployment.

For training systems, the useful pattern is to keep the base resident and make policy state the object that moves. That requires versioned policy records, committed operation visibility, serving-compatible export, and exact attribution of rollout/evaluation scores to exported revisions. For serving systems, the useful pattern is to treat cold adapter loading as service work. A request-name catalog, CPU cache, GPU batch, and readiness state are different resources with different limits.

The main caution is that scale-out claims need careful wording. MinT's million-scale evidence is durable addressability over a real packed catalog, while runtime capacity depends on selected-revision locality, cold-load rate, batch adapter diversity, CPU cache, and prewarm policy. The paper is valuable because it makes those boundaries visible instead of compressing them into a single "serves millions of adapters" slogan.

Reference Coverage

This digest explicitly links all anchors used above: motivation, lifecycle mechanics, key equations and symbols, scale-down evidence, handoff figure, utilization figure, scale-down table, scale-up evidence, dense curves, MoE curves, GLM curves, AutoResearch trace, scale-up table, serving evidence, serving bounds, rollout control, packed loader, limitations, and reference coverage.