Models we shipped — and the numbers behind them.

The problem
The rules engine fired on everything that looked vaguely wrong. False positives buried the genuine cases, analysts burned hours on ghosts, and the metric everyone watched — catch rate — was quietly falling.
What we built
A deep-learning classifier trained on their own labelled history, tuned explicitly for the asymmetric cost of a wrong “yes” — wrapped in an MLOps pipeline with drift detection and a retraining loop.
The result
Accuracy climbed past 98% while the false-positive rate fell under 1.5%. Analysts stopped chasing noise, real cases surfaced faster, and the model now scores every event in real time.
“The accuracy they quoted is still the accuracy we see in production. That sentence sounds obvious until you've been burned by everyone who couldn't deliver it.” — Maya Ellison, VP Operations

The idea
Reuse model depth recursively instead of stacking parameters — adapting compute per token, so the model spends effort only where reasoning actually requires it.
Why it matters
It's the depth most consultancies cite but never produce. The same rigor goes into every fine-tune and retrieval pipeline we ship — which is why our clients' production systems run leaner.
In the open
Released publicly, with architecture and findings documented for the community — not locked behind a sales motion.
reviewed
More of what we've put into the world.

Grounded retrieval over 2M documents
A production RAG system answering from a firm's own corpus — with citations, guardrails, and zero hallucinated precedent.

A 7B model that beat the giant
We fine-tuned and distilled an open model on a client's support history — matching a frontier API on their task at a fraction of the cost per call.

Damage detection at the dock
A vision model flagging shipment damage from a phone photo — turning a manual inspection queue into an instant decision.

From notebooks to a real pipeline
We replaced a sprawl of one-off scripts with a versioned, monitored pipeline — the foundation every model the team ships now runs on.
Most agencies show you a slide. Kaylo showed up obsessing over our false-positive rate. Six weeks later it was running in production.
They treated our messy data like a feature, not a problem. The fine-tuned model they shipped does the work of a team — and we own the weights.