Today we're releasing PredictLM v1 — two open-weight tabular foundation models (13M Mini, 26M Base, both Apache-2.0) plus a test-time training (TTT) inference recipe. On a locked 25-dataset OpenML benchmark, the recommended Duo + TTT recipe hits 0.751 mean classification accuracy / 0.609 mean regression R² — a +7.8 pp / +7.3 pp lift over the same models used zero-tuning. The recipe runs on a laptop.
This post explains how we got there. It is mostly a negative-results paper.
The numbers
Locked 25-dataset eval (n=10 cls + 10 reg, seed=42, 1500-row cap per dataset, contamination-audited disjoint from training data):
| Recipe | cls mean acc | reg mean R² |
|---|---|---|
| Mini-v1 alone (zero-tuning) | 0.673 | 0.536 |
| Base alone (zero-tuning) | 0.685 | 0.589 |
| Mini-v1 + TTT | 0.742 | 0.595 |
| Base + TTT | 0.748 | 0.608 |
| Duo + TTT (w = 0.40) | 0.751 | 0.609 |
The Duo+TTT recipe does test-time fine-tuning on the user's in-context examples (15 inner Adam steps, lr 1e-4) for both Mini and Base, then averages their softmax predictions with weight 0.40 on Mini. Implementation is in the predictlm Python package — one call:
from predictlm import PredictLM, duo_ttt_predict
mini = PredictLM.from_pretrained("zerooneresearch/predictlm-mini-13m")
base = PredictLM.from_pretrained("zerooneresearch/predictlm-base-26m")
preds = duo_ttt_predict(mini, base, X_train, y_train, X_test, w=0.40)Mini-v1 alone with TTT (no Base needed) gets 0.742 / 0.595 — within 0.01 of the full recipe.
Eleven architecture experiments that didn't matter
We spent the two weeks before release trying to lift the underlying Mini-v1 model itself. We ran eleven architecture / training experiments. They are documented honestly below.
| # | Experiment | cls | reg | Outcome |
|---|---|---|---|---|
| 1 | Dual-teacher distillation (Base + TabICLv2) | -3-5pp | n/a | Failed: flat α diluted reg signal |
| 2 | ISAB datapoint attention | ≈ baseline | ≈ baseline | Smoke neutral; folded into #4 |
| 3 | DiffFormer attention | ≈ baseline | ≈ baseline | Smoke win → NaN at step 31K |
| 4 | ISAB+DIFF 57M from teacher | -9-27pp | -9-27pp | Bigger student capped by 26M teacher |
| 5 | MPS throughput mitigations | n/a | n/a | Throughput regressed; reverted |
| 6 | V8+DIFF 39M no-teacher | -5pp | -5pp | Removing distillation hurt transfer |
| 7 | OPCD-EMA (on-policy context distillation + EMA self-anchor) | 0.658 | 0.539 | Statistical tie with baseline |
| 8 | Mini-v2 Muon optimizer (broken setup, n_heads=4) | 0.653 | 0.458 | Mismatched arch with teacher |
| 9 | Mini-v3 Muon (controlled, n_heads=8, warm-start) | 0.680 peak | 0.526 | +0.007 cls at peak; overfit by 31K |
| 10 | Mini-Real (Real-TabPFN continued pretraining + L2-SP) | 0.673 | 0.536 | L2-SP λ=1e-3 froze the model |
| 11 | TabM BatchEnsemble FFN (k=4 rank-1 adapters) | 0.669 | 0.529 | OOM'd at step 9K; slots hadn't differentiated |
The common pattern: changing the model class couldn't beat the original distillation baseline by more than +0.007. We were trying to lift a curve that was already at its asymptote for this scale / data regime.
After we exhausted architecture experiments, we tried inference-time tricks: feature-permutation TTA, context-bagging, isotonic calibration, a Duo ensemble of Mini + Base. The Duo gave +0.014 cls. The other tricks gave +0.002 to +0.005 each. We locked a "ship floor" at Duo + TTA = 0.689 cls / 0.576 reg and moved to launch prep.
Then, two days before release, we tried TTT.
What TTT actually does
Test-time training is a published recipe (Gozeten et al. 2025; productized in TabPFN-2.5). For each evaluation task — that is, for each user query — fine-tune the model for a handful of inner steps on the user's own in-context examples, then predict. The model specializes to that task before predicting, then resets.
Concretely:
- Split the user's training table (rows + labels they pass to
.fit) into a tuning-train split (80%) and tuning-val split (20%). forward(tuning-train as context, tuning-val as query) → loss against tuning-val labels.- Backward + Adam step with lr 1e-4.
- Repeat 15 times.
- Predict on the actual test set using the full train as context.
- Restore original weights for the next user query.
The model never changes between users. Per-task specialization happens at inference time on each user's data. The cost is ~15× the forward pass per prediction (~1–60 seconds depending on dataset size). The benefit, measured on our 25-dataset benchmark, is +0.069 mean classification accuracy on Mini-v1 alone.
Per-dataset breakdown
Of 20 successfully-loaded eval datasets, 19 improved with TTT, 1 was unchanged (−0.006). Top winners:
| Dataset | Type | Baseline | + TTT | Δ |
|---|---|---|---|---|
| sgemm_gpu_kernel_performance | reg | 0.681 | 0.839 | +0.159 |
| elevators | reg | 0.712 | 0.859 | +0.147 |
| mfeat-morphological | cls | 0.633 | 0.762 | +0.129 |
| first-order-theorem-proving | cls | 0.327 | 0.456 | +0.129 |
| optdigits | cls | 0.809 | 0.936 | +0.127 |
TTT helped most on tasks with richer features (median n_feat among winners: 47, vs 8 among the 5 neutral / tiny-improvement datasets) and on tasks below the accuracy ceiling. It's near-neutral on near-perfect tasks (solar_flare at 0.847, energy_efficiency at 0.988) — there is nothing left for it to fix.
Why the leverage was in the wrong place
Looking back, every one of our eleven architecture experiments was trying to make the generic prior better. We were lifting the curve everywhere by 0.007 cls and calling it a result. TTT instead specializes the prior per task — it leaves the prior alone but tunes it on each user's data right before predicting. The model that adapts wins.
This is a known result in the in-context-learning literature. Specialization after Generalization proves that TTT helps most when the base model is underparameterized — exactly our 26M regime. We just didn't try it until we had run out of other ideas.
If we were starting over, we would try TTT in week one, not week three.
Data-pipeline bugs we found and fixed
A side audit of the training data pipeline found six bugs that all 11 architecture experiments had been training against. Most were in standardization or sampling logic and silently capped the loss floor for every training run. The four most important:
- Synthetic-task X and y standardization used joint context+query statistics instead of context-only. The model was being trained on artificially-easy tasks where the context and query distributions exactly matched. Affected 85% of training batches. Fixed.
- Real-corpus classification tasks reported
n_classesas the unique count in a sub-sample, but kept the original sparse class indices. When the model later saw class label5andn_classes=3, the loss computation either OOB-tripped or trained against malformed targets. Fixed. - Real-corpus small-dataset slicing used the requested
n_contextinstead of the actual size, leaking query rows into the X-standardization stats. Fixed. - Real-corpus classification sampling was uniform-random, not stratified. Rare classes could be absent from context but present in query → impossible ICL on those rows. Fixed.
The fixes were applied late in the training cycle. None of them moved the eval needle by more than ~0.01 in isolation. The plateau wasn't really about data quality. It was about the leverage point.
On the 0.75 / 0.75 target
We aimed for 0.75 accuracy on both classification and regression.
- Classification: 0.751 achieved. The Duo + TTT recipe lands just over the line.
- Regression: 0.609 — significantly below 0.75.
The regression gap is structural for our 13M+26M scale. Even hosted TabPFN-2.5 (~100M parameters, proprietary) reports 0.662 R² on a comparable benchmark. To hit 0.75 R² on diverse tabular regression you currently need TabPFN-3-class compute or specialized regression-only architectures.
We aren't going to claim 0.75 R² in this release. Honest number: +7.3 percentage points of R² over the same model with zero-tuning, on the same eval.
What ships
Two model checkpoints, both Apache-2.0:
- predictlm-mini-13m — 13.5M parameters, 54 MB inference weights. Trainable end-to-end on a single Tesla T4 in 3.3 hours for ~$1.30.
- predictlm-base-26m — 26.2M parameters, 105 MB inference weights. Distilled-from teacher for Mini.
One Python package: pip install predictlm exposes PredictLM.from_pretrained(...), a sklearn-style .fit(X, y).predict(X_test) API, and .fit_and_predict_with_ttt(...) plus the module-level duo_ttt_predict(...) for the full ship recipe.
One MCP server: pip install predictlm-mcp wires PredictLM into Claude / Cursor / Continue as an MCP tool. LLM agents can hand off "given this table, predict the next column" to PredictLM in a single tool call.
All three under Apache-2.0. No registration, no API key, no proprietary kernel dependency. CPU and MPS both work for inference.
What we'd do next
- Train a Large-50M+ with the bug-fixed data pipeline on cloud GPUs. The Mini-v1 → Base-26M gap was only +0.012 cls / +0.053 reg, so doubling again is unlikely to close the reg gap on its own — but combined with TTT it could reach 0.62-0.65 R².
- Per-task isotonic calibration done correctly (we tried it; the pooled global version made things worse). Probably +0.005 to +0.01 on ECE without hurting accuracy.
- Retrieval-augmented inference — at predict time, retrieve K most-similar tasks from a public-tasks index and prepend their rows to the user's context. ~+0.01 cls in the TabPFN-v2 ablation. Cheap to add.
What we're not going to do
We are not going to make wild accuracy claims to compete with proprietary 100M+ models. Our differentiation is open weights, Apache-2.0, EU-sovereign, runs on a laptop, calibrated uncertainty for free. Those are the wedges. Raw accuracy will improve as we have more compute.
Acknowledgments
The TTT recipe formulation is straight from Gozeten et al. 2025 and the TabPFN-2.5 paper. Real-TabPFN inspired the continued-pretraining attempt that didn't work for us. TabM's BatchEnsemble FFN inspired the attempt that crashed our Mac Studio. All three would have worked at proper scale; we ran them on what we had.
The work was done by one person on a Mac Studio over two weeks.