Engineering the score: how Horion weighs metrics, logs, and traces.
The first score we shipped, nobody trusted. Here is how we rebuilt the rubric — pillar by pillar — and what changed in PR reviews.
The first time we tried to grade an instrumented service we got a number that no one trusted. Engineers looked at the dashboard, shrugged, and went back to reading Datadog directly. The score was there, but it never showed up in a real argument.
That's the test we now use to evaluate any scoring system: does it survive contact with a code review? If two engineers can't disagree about a PR using the score as ground truth, the score is decoration.
The three pillars
Horion grades a service on three pillars: metrics, logs, and traces. Each pillar produces a 0–100 sub-score, and the overall score is a weighted mean. The weights are configurable per service, but the defaults — 35 / 30 / 35 — landed after we tested a few dozen real codebases.
What goes into a pillar is more interesting than the weight. For metrics, the rubric checks for:
- A clear set of SLIs declared per endpoint.
- Histogram-based latency instead of averages.
- Cardinality budgets that don't blow up under traffic.
- Metric names that follow a stable naming convention.
Each criterion is a yes/no with an explanation. The pillar score is the share of criteria that pass. No magic.
Why the score finally argued back
The unlock wasn't a smarter model. It was making the rubric legible. Every criterion has a short name, a short reason, and a link to the offending file and line. When the engine drops a service from 78 to 64, you can read the diff and see exactly which three criteria flipped.
A score nobody can read is a score nobody can fight.
That changed how reviewers used Horion. The score stopped being a status light and started being a checklist people could push back on — sometimes correctly.
What we threw away
A few things we tried and removed:
- Free-form LLM grading of telemetry quality. Too noisy across runs.
- Composite scores that mixed pillars before showing them. People couldn't tell why the number moved.
- A "code quality" pillar. Out of scope. Horion is about observability, not linting.
The current rubric is boring on purpose. Boring rubrics are the ones engineers trust to argue with.