The Literal Listener

The full paper, with methodology and results, is available here.

Chris Potts’ RSA framework asks what makes an utterance a good one. The answer isn’t “the most probable thing to say given the context.” A pragmatic speaker (S₁) reasons about a literal listener (L₀). The literal listener just interprets utterances at face value; the pragmatic speaker picks whichever utterance, given that behavior, is most likely to achieve its goal.¹ Meaning lives in the gap between intent and what the listener does with it.

This is a clean formalism for question generation. To ask a good question is to reason about a listener, what do they know, what can they answer, what will they reveal if you ask just this rather than that. My final project for Chris Potts’ XCS224U (Winter 2022) tried to instantiate that computationally.

The finding, plainly: the approach worked by the numbers, by gaming the literal listener rather than by learning to interrogate a topic.

The standard approach, and its problem

Most question-generation research trains a model to predict $p(q|c)$ : the probability of a question $q$ given a context $c$ . You need example questions to train on. The model learns to produce questions that look like the ones in the dataset.

This creates a circularity. QuAC² was collected by crowdworkers asking conversational questions about Wikipedia articles. Training on QuAC teaches a model to ask conversational questions like “was it successful?” and “what happened after that?” because that’s what the annotators asked. If you want a model that asks better questions, you don’t have a reference corpus of better. You can only grade questions by how much they resemble what was asked.

The more fundamental problem: comparing generated questions to reference questions doesn’t measure whether the questions are doing their job, which is to recover information the student doesn’t have.

The distractible teacher

The student-teacher game gives the student a section of a Wikipedia article as context, and the teacher a different, hidden section from the same article. The student asks questions; the teacher answers from the hidden section; after several turns you measure how much of the hidden knowledge was recovered.

My version adds a distractor. The teacher is also given a random, unrelated Wikipedia article.³ Before answering, the teacher privately scores two candidate answers, one from the hidden knowledge and one from the distractor, and returns whichever it is more confident in. The student never sees the distractor and doesn’t know this is happening.

Vague or off-topic questions (“what else is interesting about this?”) are easy for the distractor to satisfy, because any passage can produce a span that sounds like a plausible answer to a vague question. Specific, on-topic questions are harder for the distractor to match. The distractor puts a price on vagueness.

Measuring quality without reference questions

Grading a question directly requires knowing what a good question looks like. Grading a line of questioning by its results doesn’t, you just run the game and measure what came out.

Two metrics. Topicality: embed the hidden knowledge and the collected answers using Sentence-BERT, take the cosine similarity.⁴ If the conversation extracted on-topic information, the answers will cluster near the hidden knowledge in embedding space. Teacher patience: mean QA confidence across all answers in the dialogue, raised to a power.⁵ Confident answers mean the questions were clear and answerable; low confidence means the teacher was mostly guessing.

Neither requires annotated questions. Any Wikipedia article can become a game. The English Wikipedia snapshot contains 6 million articles; after filtering for suitable sections, that’s 511,768 game settings, none of which needed human annotation to generate training signal from.

Two bets

The first: run the game over the Wikipedia corpus and use the resulting scores as a training signal. Use one model as an oracle to label another, at scale, without human annotation.⁶

The second: fine-tune the student using reinforcement learning. REINFORCE with a self-critical baseline, treating the game score as reward. My advisor suggested this was unlikely to work. RL on language models was generally considered unstable and unreliable.⁷

What the model learned

RL fine-tuning improved both metrics. The QuAC-tuned model improved teacher patience substantially. The Wikipedia-tuned model improved patience even more on out-of-distribution evaluation, and added modest topicality gains over the baseline.

Looking at what the model actually produced, the questions had converged on patterns like:

“what was the name of the album?” / “…the competition?” / “…the award?” / “…the building?”

Specific, short-answer, nameable-things questions. A RoBERTa model trained on SQuAD handles these confidently, because SQuAD is full of factual questions with single definite answers. Named entities are topic-specific enough that the distractor article usually can’t produce a confident answer for them, so the teacher stays on-topic. These questions score well on both metrics.

The pragmatic speaker had modeled the literal listener accurately. It had learned what kind of questions a SQuAD-trained QA model is good at answering, and was producing those.

That’s different from learning to interrogate a topic. The student adapted to the teacher’s preferences, not to the information the teacher was providing. Answers accumulated over the dialogue, but the student wasn’t building on them. It found a fixed strategy that worked for the teacher and ran it.

The conclusion from the paper put it this way: “I did not show that the questioner’s understanding was improved by answers provided in conversation. The student was able to adapt to the teacher, but not the information provided.”

The earlier failure

An intermediate experiment optimized topicality alone, without the patience constraint.

That model found a more direct route. It started generating questions that weren’t recognizably English. The teacher, forced to produce a span in response to garbled input, would retrieve something from the hidden knowledge when the distractor was no help. Averaging over many such exchanges, topicality was reasonable.

Adding patience fixed this. The student had to ask things a QA model could actually parse. But patience encoded the SQuAD model’s preferences directly into the reward, which is why the final model converged on SQuAD-style questions rather than anything else.

Without patience: the student found adversarial prompts. With patience: the student found the mode of the teacher’s training distribution. Neither is quite what a pragmatic questioner is supposed to do.

What it means

The RSA framing predicted this outcome, in retrospect. A pragmatic speaker reasons about a literal listener. Train it on a reward derived from that literal listener’s behavior, and it will learn to exploit the literal listener’s preferences. The literal listener here was a QA model with a strong prior toward factual, named-entity questions. The student found that prior and stayed there.

The deeper issue: any proxy for “question quality” derived from a QA model’s responses will encode that model’s biases. You can’t fully decouple the evaluation from the evaluator. Choosing a different teacher model would have produced different questions, possibly better in some ways, but still shaped by whatever that teacher found easy.

The RSA framework formalizes pragmatic reasoning recursively. A literal listener L₀ interprets utterances as a posterior over world states: $L_0(s|u) \propto P(u|s)\,P(s)$ . A pragmatic speaker S₁ chooses utterances the literal listener would most usefully respond to: $S_1(u|s) \propto L_0(s|u)^\lambda$ . The framework descends from Grice’s cooperative maxims (1975) and was given a neural treatment by Monroe et al., “Colors in Context: A Pragmatic Neural Model for Grounded Language Understanding,” TACL, 2017. The distractor setting specifically echoes the color reference game setup in that paper, where a speaker must identify a color uniquely given distractors. ↩
Choi et al., “QuAC: Question Answering in Context,” EMNLP, 2018. The dataset contains 98,407 QA pairs from 13,594 dialogues. Crowdworkers found people-focused articles easier to discuss than arbitrary concepts, so the dataset has a strong bias toward articles about individuals, particularly male entertainers. This shaped what the baseline model learned to ask. ↩
The teacher is deepset/roberta-base-squad2, a RoBERTa model fine-tuned on SQuAD 2.0. SQuAD 2.0 includes unanswerable questions, so the model returns a confidence score alongside each answer span, reflecting how certain it is that the answer is actually in the passage. This made it usable as a patience oracle: a low score means the teacher was guessing. ↩
Sentence-BERT (all-mpnet-base-v2) from sbert.net. Topicality: $T(\mathcal{K}, \mathcal{A}) = S_C(S(\mathcal{K}), S(\mathcal{A}))$ , where $S_C$ is cosine similarity. Answers are concatenated in turn order, with duplicates removed to avoid inflating the score through repetition. Topicality is computed per-game rather than per-answer because sentence embeddings aren’t additively composable. ↩
Teacher patience: $TP(A) = \left[\frac{1}{|A|}\sum_{a \in A} a_{score}\right]^\alpha$ , with $\alpha = 2$ . Raising to a power sharpens the penalty for low-confidence answers. The final game score is $T(\mathcal{K}, \mathcal{A}) \cdot TP(A)$ . ↩
This was an unusual thing to do for a language model fine-tuning experiment at the time — using a model as an oracle to generate training signal at scale, without human annotation. Ed. 2024: using a model as a labeling oracle to generate training signal from unlabeled data is now a routine part of language model training. Most large-scale fine-tuning pipelines do something in this family, from reward models scoring synthetic completions to frontier models labeling examples for smaller distillation targets. ↩
My advisor’s scepticism was reasonable given what RL on language models had produced up to that point. Ed. 2024: reinforcement learning from human feedback (RLHF) became the standard approach for aligning large language models. InstructGPT (Ouyang et al., 2022) and the training recipes that followed rely on RL fine-tuning of language models as a core component. The evidence changed quickly and decisively. ↩

The standard approach, and its problem

The distractible teacher

Measuring quality without reference questions

Two bets

What the model learned

The earlier failure

What it means

Footnotes