PredictLM v1: 0.751 cls / 0.609 reg on OpenML via test-time training

Today we're releasing PredictLM v1 — two open-weight tabular foundation models (13M Mini, 26M Base, both Apache-2.0) plus a test-time training (TTT) inference recipe. On a locked 25-dataset OpenML benchmark, the recommended Duo + TTT recipe hits 0.751 mean classification accuracy / 0.609 mean regression R² — a +7.8 pp / +7.3 pp lift over the same models used zero-tuning. The recipe runs on a laptop.

This post explains how we got there. It is mostly a negative-results paper.

The numbers

Locked 25-dataset eval (n=10 cls + 10 reg, seed=42, 1500-row cap per dataset, contamination-audited disjoint from training data):

Recipe	cls mean acc	reg mean R²
Mini-v1 alone (zero-tuning)	0.673	0.536
Base alone (zero-tuning)	0.685	0.589
Mini-v1 + TTT	0.742	0.595
Base + TTT	0.748	0.608
Duo + TTT (w = 0.40)	0.751	0.609

The Duo+TTT recipe does test-time fine-tuning on the user's in-context examples (15 inner Adam steps, lr 1e-4) for both Mini and Base, then averages their softmax predictions with weight 0.40 on Mini. Implementation is in the predictlm Python package — one call:

from predictlm import PredictLM, duo_ttt_predict
 
mini = PredictLM.from_pretrained("zerooneresearch/predictlm-mini-13m")
base = PredictLM.from_pretrained("zerooneresearch/predictlm-base-26m")
preds = duo_ttt_predict(mini, base, X_train, y_train, X_test, w=0.40)

Mini-v1 alone with TTT (no Base needed) gets 0.742 / 0.595 — within 0.01 of the full recipe.

Eleven architecture experiments that didn't matter

We spent the two weeks before release trying to lift the underlying Mini-v1 model itself. We ran eleven architecture / training experiments. They are documented honestly below.

#	Experiment	cls	reg	Outcome
1	Dual-teacher distillation (Base + TabICLv2)	-3-5pp	n/a	Failed: flat α diluted reg signal
2	ISAB datapoint attention	≈ baseline	≈ baseline	Smoke neutral; folded into #4
3	DiffFormer attention	≈ baseline	≈ baseline	Smoke win → NaN at step 31K
4	ISAB+DIFF 57M from teacher	-9-27pp	-9-27pp	Bigger student capped by 26M teacher
5	MPS throughput mitigations	n/a	n/a	Throughput regressed; reverted
6	V8+DIFF 39M no-teacher	-5pp	-5pp	Removing distillation hurt transfer
7	OPCD-EMA (on-policy context distillation + EMA self-anchor)	0.658	0.539	Statistical tie with baseline
8	Mini-v2 Muon optimizer (broken setup, `n_heads=4`)	0.653	0.458	Mismatched arch with teacher
9	Mini-v3 Muon (controlled, `n_heads=8`, warm-start)	0.680 peak	0.526	+0.007 cls at peak; overfit by 31K
10	Mini-Real (Real-TabPFN continued pretraining + L2-SP)	0.673	0.536	L2-SP λ=1e-3 froze the model
11	TabM BatchEnsemble FFN (k=4 rank-1 adapters)	0.669	0.529	OOM'd at step 9K; slots hadn't differentiated

The common pattern: changing the model class couldn't beat the original distillation baseline by more than +0.007. We were trying to lift a curve that was already at its asymptote for this scale / data regime.

After we exhausted architecture experiments, we tried inference-time tricks: feature-permutation TTA, context-bagging, isotonic calibration, a Duo ensemble of Mini + Base. The Duo gave +0.014 cls. The other tricks gave +0.002 to +0.005 each. We locked a "ship floor" at Duo + TTA = 0.689 cls / 0.576 reg and moved to launch prep.

Then, two days before release, we tried TTT.

What TTT actually does

Test-time training is a published recipe (Gozeten et al. 2025; productized in TabPFN-2.5). For each evaluation task — that is, for each user query — fine-tune the model for a handful of inner steps on the user's own in-context examples, then predict. The model specializes to that task before predicting, then resets.

Concretely:

Split the user's training table (rows + labels they pass to .fit) into a tuning-train split (80%) and tuning-val split (20%).
forward(tuning-train as context, tuning-val as query) → loss against tuning-val labels.
Backward + Adam step with lr 1e-4.
Repeat 15 times.
Predict on the actual test set using the full train as context.
Restore original weights for the next user query.

The model never changes between users. Per-task specialization happens at inference time on each user's data. The cost is ~15× the forward pass per prediction (~1–60 seconds depending on dataset size). The benefit, measured on our 25-dataset benchmark, is +0.069 mean classification accuracy on Mini-v1 alone.

Per-dataset breakdown

Of 20 successfully-loaded eval datasets, 19 improved with TTT, 1 was unchanged (−0.006). Top winners:

Dataset	Type	Baseline	+ TTT	Δ
sgemm_gpu_kernel_performance	reg	0.681	0.839	+0.159
elevators	reg	0.712	0.859	+0.147
mfeat-morphological	cls	0.633	0.762	+0.129
first-order-theorem-proving	cls	0.327	0.456	+0.129
optdigits	cls	0.809	0.936	+0.127

TTT helped most on tasks with richer features (median n_feat among winners: 47, vs 8 among the 5 neutral / tiny-improvement datasets) and on tasks below the accuracy ceiling. It's near-neutral on near-perfect tasks (solar_flare at 0.847, energy_efficiency at 0.988) — there is nothing left for it to fix.

Why the leverage was in the wrong place

Looking back, every one of our eleven architecture experiments was trying to make the generic prior better. We were lifting the curve everywhere by 0.007 cls and calling it a result. TTT instead specializes the prior per task — it leaves the prior alone but tunes it on each user's data right before predicting. The model that adapts wins.

This is a known result in the in-context-learning literature. Specialization after Generalization proves that TTT helps most when the base model is underparameterized — exactly our 26M regime. We just didn't try it until we had run out of other ideas.

If we were starting over, we would try TTT in week one, not week three.

Data-pipeline bugs we found and fixed

A side audit of the training data pipeline found six bugs that all 11 architecture experiments had been training against. Most were in standardization or sampling logic and silently capped the loss floor for every training run. The four most important:

Synthetic-task X and y standardization used joint context+query statistics instead of context-only. The model was being trained on artificially-easy tasks where the context and query distributions exactly matched. Affected 85% of training batches. Fixed.
Real-corpus classification tasks reported n_classes as the unique count in a sub-sample, but kept the original sparse class indices. When the model later saw class label 5 and n_classes=3, the loss computation either OOB-tripped or trained against malformed targets. Fixed.
Real-corpus small-dataset slicing used the requested n_context instead of the actual size, leaking query rows into the X-standardization stats. Fixed.
Real-corpus classification sampling was uniform-random, not stratified. Rare classes could be absent from context but present in query → impossible ICL on those rows. Fixed.

The fixes were applied late in the training cycle. None of them moved the eval needle by more than ~0.01 in isolation. The plateau wasn't really about data quality. It was about the leverage point.

On the 0.75 / 0.75 target

We aimed for 0.75 accuracy on both classification and regression.

Classification: 0.751 achieved. The Duo + TTT recipe lands just over the line.
Regression: 0.609 — significantly below 0.75.

The regression gap is structural for our 13M+26M scale. Even hosted TabPFN-2.5 (~100M parameters, proprietary) reports 0.662 R² on a comparable benchmark. To hit 0.75 R² on diverse tabular regression you currently need TabPFN-3-class compute or specialized regression-only architectures.

We aren't going to claim 0.75 R² in this release. Honest number: +7.3 percentage points of R² over the same model with zero-tuning, on the same eval.

What ships

Two model checkpoints, both Apache-2.0:

predictlm-mini-13m — 13.5M parameters, 54 MB inference weights. Trainable end-to-end on a single Tesla T4 in 3.3 hours for ~$1.30.
predictlm-base-26m — 26.2M parameters, 105 MB inference weights. Distilled-from teacher for Mini.

One Python package: pip install predictlm exposes PredictLM.from_pretrained(...), a sklearn-style .fit(X, y).predict(X_test) API, and .fit_and_predict_with_ttt(...) plus the module-level duo_ttt_predict(...) for the full ship recipe.

One MCP server: pip install predictlm-mcp wires PredictLM into Claude / Cursor / Continue as an MCP tool. LLM agents can hand off "given this table, predict the next column" to PredictLM in a single tool call.

All three under Apache-2.0. No registration, no API key, no proprietary kernel dependency. CPU and MPS both work for inference.

What we'd do next

Train a Large-50M+ with the bug-fixed data pipeline on cloud GPUs. The Mini-v1 → Base-26M gap was only +0.012 cls / +0.053 reg, so doubling again is unlikely to close the reg gap on its own — but combined with TTT it could reach 0.62-0.65 R².
Per-task isotonic calibration done correctly (we tried it; the pooled global version made things worse). Probably +0.005 to +0.01 on ECE without hurting accuracy.
Retrieval-augmented inference — at predict time, retrieve K most-similar tasks from a public-tasks index and prepend their rows to the user's context. ~+0.01 cls in the TabPFN-v2 ablation. Cheap to add.

What we're not going to do

We are not going to make wild accuracy claims to compete with proprietary 100M+ models. Our differentiation is open weights, Apache-2.0, EU-sovereign, runs on a laptop, calibrated uncertainty for free. Those are the wedges. Raw accuracy will improve as we have more compute.

Acknowledgments

The TTT recipe formulation is straight from Gozeten et al. 2025 and the TabPFN-2.5 paper. Real-TabPFN inspired the continued-pretraining attempt that didn't work for us. TabM's BatchEnsemble FFN inspired the attempt that crashed our Mac Studio. All three would have worked at proper scale; we ran them on what we had.

The work was done by one person on a Mac Studio over two weeks.