Updates / 2026-01-14

I think I found the angle

I spent time this week comparing ways to generate text and interpret inputs, and a clearer shape started to form. I keep coming back to a simple idea: the model can be a tool in the stack instead of the brain that decides the action. I think a lot of alignment problems show up because we push language models into roles that make auditing hard. We can inspect token probabilities and read a rationale, yet still lack a reliable way to verify why a decision happened in the first place. Anthropic’s public work on reward hacking and misalignment makes that gap feel more urgent.

The alternative I am testing breaks decision making into smaller steps that turn messy prose into structured data through a network of interconnected and auditable classifiers. That gives deterministic outcomes while keeping the flexibility of language model output. The model handles the voice and code after the decision has been made, which keeps the action path clean and inspectable.

That leaves the hard part: training a lot of classifiers and arranging them into a stack that stays faithful to the mission kernel. I am in the weeds now, but it feels like the first time in a while where the structure is clicking into place.

Author

Anthony Cote

Builder of Control OS and the Law of Instrumental Integrity. I share the real-time decisions, friction, and progress here as the work unfolds.