Ultimately the Survivors Do Not Prevail

The full paper, with methodology and complete results, is available here. Simulation code on GitHub at captainpete/zombie-survival.

In 2018, I wrote a post called Indirect Programming that ended here:

“Classic reward functions are one-dimensional scalar values. As a language, these have even greater problems with expressibility than early programming languages.”

The argument was that programming machine behaviour through a reward signal is blunter than a language. You describe the outcome you want. The agent reverse-engineers a policy to produce it. What it finds is only loosely related to what you had in mind.

In 2020, a collaborator and I ran an experiment that sharpens this. We built a zombie survival game and trained reinforcement learning agents inside it. By holding the reward function static across four scenarios, we found that small changes to the environment produced large changes in emergent behavior. If you want agents to act differently, you’ll have more luck shaping the environment than the reward function.

The game

The game has seven agents: two survivors (green), and five zombies (red). The world is a 2×2 metre arena with hard walls. Contact is physical. Agents are solid bodies with penalty-based collision forces. If a zombie touches a survivor, that’s a bite. Survivors lose ten points per bite; zombies gain ten. The zombie-apocalypse is a zero-sum game.

The agents learned through MADDPG: Multi-Agent Deep Deterministic Policy Gradient. The key idea is that each agent’s critic has access to the full joint state during training, positions and velocities of all seven agents, so stable policies can emerge even as everyone’s behaviour is changing simultaneously. At deployment, execution is decentralised: each agent acts on its own observations only.¹

We designed four scenarios, each more mechanically complex than the last.

Baseline: what happens when you just let it run

The baseline scenario is fully observable and fully transparent. Every agent can see every other agent’s position and velocity, indexed by agent number. The reward is contact-based and nothing else.

All seven agents initialise from rest. Zombies begin converging.

Speed 1× Survivor Zombie

Across eight independent training runs, the results were highly variable. The reward curves for each agent look like this, a period of instability around episodes 2,000–6,000 followed by a gradual settling:

The pattern that emerged was what we called target fixation. A subset of zombies would lock onto one survivor and stop switching. The targeted survivor would be overwhelmed; the other would be largely ignored and consequently learn to move faster (it had more room to explore without being bitten). This created a positive feedback loop: the less-targeted survivor got faster relative to the more-targeted one, which reinforced the zombies’ preference for the already-targeted one. None of this was designed. It fell out of gradient descent over the reward signal.

The high variance across runs reflects the sensitivity of this dynamic to initialisation. Which survivor gets fixated on is arbitrary. How quickly it stabilises depends on early random policy differences. The outcome type is consistent across runs.

The observation space is code

The second scenario changes almost nothing about the reward. It changes the observation.

In the baseline, agents were listed in each observation vector by fixed index: agent 0, agent 1, agent 2, and so on. This meant agents could learn policies tied to specific identities: “agent 2 is always the zombie in the bottom-left.” The anonymity scenario sorted all agents by distance before constructing the observation. Same physical world, same rewards, but now the observation said: nearest agent, second-nearest, third-nearest. Agents became exchangeable within their teams.

Anonymity: zombies rank by distance rather than fixed index.

Speed 1× Survivor Zombie

The effect on learning variance was immediate. Across ten runs of each:

The shaded bands are ±1 standard deviation across runs. The anonymity scenario (green) converges more tightly and more stably. Survivors develop general evasion strategies. Zombie coverage becomes more even.

This mattered to us more than it might appear. We hadn’t changed the reward, the architecture, the algorithm, or the hyperparameters. We changed the representation. How you describe state to the agent is as consequential as how you shape reward, arguably more so, because it determines what distinctions the agent can even learn to make. Feeding identity-indexed observations when only distance matters is the machine learning equivalent of keeping dead variables in scope: the network learns to distinguish things that shouldn’t be distinguished.²

The health mechanic

The third scenario adds one mechanic: health. Survivors start at full health; each zombie contact decays it by a small multiplicative factor (γ = 0.99 per bite). Lower health means lower maximum speed. A bitten survivor slows down.

The reward function is unchanged. Ten points per bite, same as before.

Normal pursuit. Bites begin; S0 health degrades with each contact.

Speed 1× Survivor Zombie

The zombies learned collaborative immobilisation. Rather than distributing across both survivors, they coordinated to pile onto the weakest, the one who had already taken bites and was already slowing. Once that survivor was slow enough to bite reliably, all five zombies would redirect, biting repeatedly, further slowing, creating a feedback loop that ended with one survivor pinned and health-depleted.

The survivors developed their own response: increasing separation. Here’s the mean distance between the two survivors over training, compared to the anonymity scenario:

The healthy survivor learned to put distance between itself and the compromised one. In a trained policy, when S0 starts taking bites and slowing, S1 reads this as a cue to retreat, because S0’s zombie cluster has become a hazard too. Saving S1 meant abandoning S0.

None of this required reward changes. We changed the state dynamics, and the reward function, still just counting bites, implied different optimal behaviour under those new dynamics. The environment did the rest.

Consider what it would take to produce the same coordination through reward shaping. You might try rewarding zombies for being near each other, or penalising survivors for proximity to each other’s zombie clusters. You’d be writing terms you don’t fully understand, hoping gradient descent finds the same pattern. And you’d probably get something close but wrong, something that looks like collaborative immobilisation under training conditions and breaks in deployment. The mechanic implied the strategy without any of this. The reward just said “biting is good.”

The fourth scenario gives survivors firearms. Two pellets per shot, a 0.2-second reload, Gaussian spread (σ = 3°), with firing probability gated by health via a sigmoid.³ The intuition going in was that armed survivors might fare better. They’re outnumbered; a gun is a force multiplier.

Both survivors armed. Firing at the approaching cluster.

Speed 1× Survivor Zombie

What emerged instead was a textbook social dilemma. Once survivors learned to aim at all, an increasing fraction of their hits landed on the other survivor. P(defect|hit) below is the paper’s metric: for each survivor, the fraction of successful hits landing on their teammate rather than a zombie, over the first 20,000 training episodes:

No reward signal pointed toward defection. The environment got there on its own.

Here’s the reasoning gradient descent found. Both survivors are competing to escape the same zombie cluster. If the other survivor is slower or more exposed, they draw zombie attention, which temporarily reduces pressure on you. Shooting them hurts them, makes them slower, makes them a more attractive target for zombies, and buys you breathing room. The environment created a competitive dynamic that the reward, which only counted bites, had not anticipated and didn’t prevent.

A role split also emerged: one survivor tended to hold ground and fire more aggressively while the other retreated. The fighters never became accurate against moving zombie targets. But their aim against the stationary, or predictably retreating, other survivor improved over time. Hitting something that isn’t trying to avoid you is easier.

What the agents were actually reading

In Indirect Programming I framed reward shaping as a language problem: reward functions are too inexpressive to describe what we want, and meta-RL might improve them as compilers improved programming languages.

These experiments suggest the framing is slightly off. The reward function is the language you write in. What the agents actually learn from is the environment.

The anonymity change touched only the observation space. The agents got a cleaner signal and converged faster. The health mechanic changed state dynamics and nothing in the reward; the agents produced collaborative immobilisation anyway. Adding weapons made defection locally rational, and the agents found that over 50,000 episodes with no reward pointing them there.

The reward function counted bites. The environment described everything the agents could do and what each action caused. They were paying more attention to that.

If you want a cooperative outcome, reward shaping is the indirect route. Design an environment where defection stops paying off. You can’t fully separate the two, and the environment is always part of what you’re specifying whether you think about it that way or not. Practitioners usually have the emphasis backwards.

MADDPG addresses the core non-stationarity problem in multi-agent RL. If each agent uses a standard single-agent algorithm, the environment appears non-stationary from any agent’s perspective, because every other agent’s policy is changing simultaneously. Centralised critics during training, which can condition on all agents’ actions, give each agent a stable target to train against. See Lowe et al. (2017). ↩
The mechanism is straightforward: if identity (agent index) predicts anything about behaviour, the network will learn to use it. During training it does predict something, because agents develop specialised roles through the randomness of early learning. Those roles are correlated with index, so the network incorporates index as a signal. This makes the learned policy brittle to any permutation of agents, and also to any new agent configuration at deployment. Sorting by distance removes the spurious correlation and forces a more general policy. ↩
The sigmoid gate: p(fire) = health / (1 + exp(20 × (0.5 − trigger))) where trigger is the agent’s continuous action for firing. This gave a smooth transition rather than a hard health cutoff, so a partially healthy survivor fires with reduced probability instead of stopping cold. ↩

The game

Baseline: what happens when you just let it run

The observation space is code

The health mechanic

The social dilemma

What the agents were actually reading

Footnotes