Exploring how to steer language models toward safety, truth, and human-aligned outcomes — without retraining them.
Named after a Navy navigator's call sign.
Try the DemoThree principles that make GATOR tick.
GATOR seeks to identify alignment directions in a model's internal geometry — directions that could form a navigational compass in the model's latent space for steering toward safe, truthful outputs.
A lightweight governor module aims to nudge the model's hidden states toward alignment directions during generation. Think of it as a navigator correcting course in real time.
The base model's weights are never changed. GATOR operates as an overlay that can be applied or removed at any time, leaving the original model completely intact.
Alignment corrections happen during generation, not before it.
Alignment directions have semantic meaning you can inspect.
The governor can be applied or removed without affecting the base model.
Designed to work with any transformer architecture.
Select a prompt and see how GATOR's response evolves across training steps — from raw base model toward governed output. Live inference coming soon.
GATOR is early-stage research. Here's what we know doesn't work yet.
The truth pole can override the safety pole on adversarial prompts. When "give a complete answer" and "refuse harmful requests" conflict, truthfulness sometimes wins.
Some responses oscillate across training steps — correct at one checkpoint, wrong at the next. This is most visible in math and reasoning tasks where the governor can find the right answer but hasn't yet learned to hold it.
The roadmap from proof-of-concept toward a scalable alignment pipeline.
Lock aligned prompts and redirect training budget to the ones still failing.
Enforce explicit ordering so safety always takes precedence on adversarial inputs.
Scale to large adversarial datasets with fully automated train-evaluate-lock loops.
Prove geometric steering transfers across model families and scales.
See how GATOR steers model activations in an interactive 3D visualization.
Launch Visualizer